|
ABSTRACT
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.
We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
Daniel M. Bikel , Scott Miller , Richard Schwartz , Ralph Weischedel, Nymble: a high-performance learning name-finder, Proceedings of the fifth conference on Applied natural language processing, p.194-201, March 31-April 03, 1997, Washington, DC
[doi> 10.3115/974557.974586]
|
| |
5
|
|
| |
6
|
A. Crespo, J. Jannink, E. Neuhold, M. Rys, and R. Studer. A survey of semi-automatic extraction and transformation. http://www-db.stanford.edu/ crespo/publications/.
|
 |
7
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
8
|
D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31-36, 1999.
|
| |
9
|
|
| |
10
|
H. Galhardas. http://caravel.inria.fr/ galharda/cleaning.html.
|
| |
11
|
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructure information from the web. In Workshop on mangement of semistructured data, 1997.
|
 |
12
|
|
| |
13
|
|
| |
14
|
S. Huffman. Learning information extraction patterns from examples. In S. Wermter, G. Scheler, and E. Riloff, editors, Proceedings of the 1995 IJCAI Workshop on New Approaches to Learning for Natural Language Processing., 1995.
|
| |
15
|
|
| |
16
|
J. Kupiec. Robust part of speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242, 1992.
|
| |
17
|
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
|
| |
18
|
P.-S. Laplace. Philosophical Essays on Probabilities. Springer-Verlag, New York, 1995. Translated by A. I. Dale from the 5th French edition of 1825.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
G. Mecca, P. Merialdo, and P. Atzeni. Araneus in the era of xml. In IEEE Data Engineering Bullettin, Special Issue on XML. IEEE, September 1999.
|
| |
23
|
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
|
| |
24
|
I. Muslea. Extraction patterns for information extraction tasks: A survey. In The AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
|
 |
25
|
|
| |
26
|
L. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, 77(2), 1989.
|
| |
27
|
|
| |
28
|
K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37-42, 1999.
|
| |
29
|
|
CITED BY 38
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Min-Yuh Day , Richard Tzong-Han Tsai , Cheng-Lung Sung , Chiu-Chen Hsieh , Cheng-Wei Lee , Shih-Hung Wu , Kun-Pin Wu , Chorng-Shyong Ong , Wen-Lian Hsu, Reference metadata extraction using a hierarchical knowledge representation framework, Decision Support Systems, v.43 n.1, p.152-167, February, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Norton T. Roman , Cristiano D. Ferreira , Luis A. A. Meira , Rodrigo Rezende , Luciano A. Digiampietri , Jorge Jambeiro Filho, Attribute-value specification in customs fraud detection: a human-aided approach, Proceedings of the 10th Annual International Conference on Digital Government Research: Social Networks: Making Connections between Citizens, Data and Government, May 17-20, 2009
|
|
|
|
|
|
|
|
|
|
|
|
Honglei Guo , Huijia Zhu , Zhili Guo , XiaoXun Zhang , Zhong Su, Address standardization with latent semantic association, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|