ACM Home Page
Please provide us with feedback. Feedback
Automatic segmentation of text into structured records
Full text PdfPdf (332 KB)
Source International Conference on Management of Data archive
Proceedings of the 2001 ACM SIGMOD international conference on Management of data table of contents
Santa Barbara, California, United States
Pages: 175 - 186  
Year of Publication: 2001
ISBN:1-58113-332-4
Also published in ...
Authors
Vinayak Borkar  Indian Institute of Technology, Bombay
Kaustubh Deshmukh  University of Washington, Seattle and Indian Institute of Technology, Bombay
Sunita Sarawagi  Indian Institute of Technology, Bombay
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 204,   Citation Count: 38
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/375663.375682
What is a DOI?

ABSTRACT

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.

We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
 
6
A. Crespo, J. Jannink, E. Neuhold, M. Rys, and R. Studer. A survey of semi-automatic extraction and transformation. http://www-db.stanford.edu/ crespo/publications/.
7
 
8
D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31-36, 1999.
 
9
 
10
H. Galhardas. http://caravel.inria.fr/ galharda/cleaning.html.
 
11
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructure information from the web. In Workshop on mangement of semistructured data, 1997.
12
 
13
 
14
S. Huffman. Learning information extraction patterns from examples. In S. Wermter, G. Scheler, and E. Riloff, editors, Proceedings of the 1995 IJCAI Workshop on New Approaches to Learning for Natural Language Processing., 1995.
 
15
 
16
J. Kupiec. Robust part of speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242, 1992.
 
17
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
 
18
P.-S. Laplace. Philosophical Essays on Probabilities. Springer-Verlag, New York, 1995. Translated by A. I. Dale from the 5th French edition of 1825.
 
19
 
20
 
21
 
22
G. Mecca, P. Merialdo, and P. Atzeni. Araneus in the era of xml. In IEEE Data Engineering Bullettin, Special Issue on XML. IEEE, September 1999.
 
23
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
 
24
I. Muslea. Extraction patterns for information extraction tasks: A survey. In The AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
25
 
26
L. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, 77(2), 1989.
 
27
 
28
K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37-42, 1999.
 
29

CITED BY  38

Collaborative Colleagues:
Vinayak Borkar: colleagues
Kaustubh Deshmukh: colleagues
Sunita Sarawagi: colleagues