|
ABSTRACT
Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Microsoft SmartTagger.
|
| |
2
|
Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.
|
 |
3
|
|
| |
4
|
J. Bilmes. What HMMs can do. Technical report, UWEETR-2002-0003, 2002.
|
 |
5
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1999.
|
| |
12
|
|
| |
13
|
J. Droppo, L. Deng, and A. Acero. Evaluation of the splice algorithm on the aurora2 database. In Proceedings of the Eurospeech Conference, 2001.
|
 |
14
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.
|
| |
20
|
|
| |
21
|
A. Martin and M. Przybocki. NIST 2003 language recognition evaluation. In Proceedings of the Eurospeech Conference, 2003.
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.
|
| |
26
|
|
| |
27
|
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.
|
 |
28
|
Charles Sutton , Khashayar Rohanimanesh , Andrew McCallum, Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data, Proceedings of the twenty-first international conference on Machine learning, p.99, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015422]
|
CITED BY 15
|
|
|
|
|
Min-Yuh Day , Richard Tzong-Han Tsai , Cheng-Lung Sung , Chiu-Chen Hsieh , Cheng-Wei Lee , Shih-Hung Wu , Kun-Pin Wu , Chorng-Shyong Ong , Wen-Lian Hsu, Reference metadata extraction using a hierarchical knowledge representation framework, Decision Support Systems, v.43 n.1, p.152-167, February, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Honglei Guo , Huijia Zhu , Zhili Guo , XiaoXun Zhang , Zhong Su, Address standardization with latent semantic association, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|