ACM Home Page
Please provide us with feedback. Feedback
Mining reference tables for automatic text segmentation
Full text PdfPdf (255 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 20 - 29  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Eugene Agichtein  Columbia University
Venkatesh Ganti  Microsoft Research
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 95,   Citation Count: 15
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014058
What is a DOI?

ABSTRACT

Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Microsoft SmartTagger.
 
2
Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.
3
 
4
J. Bilmes. What HMMs can do. Technical report, UWEETR-2002-0003, 2002.
5
 
6
7
 
8
9
 
10
 
11
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1999.
 
12
 
13
J. Droppo, L. Deng, and A. Acero. Evaluation of the splice algorithm on the aurora2 database. In Proceedings of the Eurospeech Conference, 2001.
14
 
15
 
16
 
17
 
18
 
19
C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.
 
20
 
21
A. Martin and M. Przybocki. NIST 2003 language recognition evaluation. In Proceedings of the Eurospeech Conference, 2003.
 
22
 
23
24
 
25
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.
 
26
 
27
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.
28

CITED BY  15

Collaborative Colleagues:
Eugene Agichtein: colleagues
Venkatesh Ganti: colleagues