| Address standardization with latent semantic association |
| Full text |
Pdf
(498 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1155-1164
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Honglei Guo
|
IBM China Research Lab., Beijing, China
|
|
Huijia Zhu
|
IBM China Research Lab., Beijing, China
|
|
Zhili Guo
|
IBM China Research Lab., Beijing, China
|
|
XiaoXun Zhang
|
IBM China Research Lab., Beijing, China
|
|
Zhong Su
|
IBM China Research Lab., Beijing, China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 61, Downloads (12 Months): 160, Citation Count: 0
|
|
|
ABSTRACT
Address standardization is a very challenging task in data cleansing. To provide better customer relationship management and business intelligence for customer-oriented cooperates, millions of free-text addresses need to be converted to a standard format for data integration, de-duplication and householding. Existing commercial tools usually employ lots of hand-craft, domain-specific rules and reference data dictionary of cities, states etc. These rules work better for the region they are designed. However, rule-based methods usually require more human efforts to rewrite these rules for each new domain since address data are very irregular and varied with countries and regions. Supervised learning methods usually are more adaptable than rule-based approaches. However, supervised methods need large-scale labeled training data. It is a labor-intensive and time-consuming task to build a large-scale annotated corpus for each target domain. For minimizing human efforts and the size of labeled training data set, we present a free-text address standardization method with latent semantic association (LaSA). LaSA model is constructed to capture latent semantic association among words from the unlabeled corpus. The original term space of the target domain is projected to a concept space using LaSA model at first, then the address standardization model is active learned from LaSA features and informative samples. The proposed method effectively captures the data distribution of the domain. Experimental results on large-scale English and Chinese corpus show that the proposed method significantly enhances the performance of standardization with less efforts and training data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
R. Associates. Raab associates guide to customer matching systems. 2003.
|
| |
3
|
K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 2003.
|
| |
4
|
|
| |
5
|
|
| |
6
|
S. Deerwester, S. T. Dumais, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
|
| |
7
|
S. A. Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artifical Intelligence Research, 11:335--360, 1999.
|
| |
8
|
Radu Florian , Abe Ittycheriah , Hongyan Jing , Tong Zhang, Named entity recognition through classifier combination, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.168-171, May 31, 2003, Edmonton, Canada
[doi> 10.3115/1119176.1119201]
|
| |
9
|
Freitag. Trained named entity recognition using distributional clusters. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), 2004.
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning (ICML-1994), 1994.
|
| |
14
|
S. Miller, J. Guinness, and A. Zamanian. Name tagging with word clusters and discriminative training. In Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT) 2004 conference, pages 337--342, 2004.
|
| |
15
|
|
| |
16
|
Dan Shen , Jie Zhang , Jian Su , Guodong Zhou , Chew-Lim Tan, Multi-criteria-based active learning for named entity recognition, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.589-es, July 21-26, 2004, Barcelona, Spain
[doi> 10.3115/1218955.1219030]
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
 |
21
|
|
| |
22
|
|
|