ACM Home Page
Please provide us with feedback. Feedback
Address standardization with latent semantic association
Full text PdfPdf (498 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1155-1164  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Honglei Guo  IBM China Research Lab., Beijing, China
Huijia Zhu  IBM China Research Lab., Beijing, China
Zhili Guo  IBM China Research Lab., Beijing, China
XiaoXun Zhang  IBM China Research Lab., Beijing, China
Zhong Su  IBM China Research Lab., Beijing, China
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 61,   Downloads (12 Months): 160,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557144
What is a DOI?

ABSTRACT

Address standardization is a very challenging task in data cleansing. To provide better customer relationship management and business intelligence for customer-oriented cooperates, millions of free-text addresses need to be converted to a standard format for data integration, de-duplication and householding. Existing commercial tools usually employ lots of hand-craft, domain-specific rules and reference data dictionary of cities, states etc. These rules work better for the region they are designed. However, rule-based methods usually require more human efforts to rewrite these rules for each new domain since address data are very irregular and varied with countries and regions. Supervised learning methods usually are more adaptable than rule-based approaches. However, supervised methods need large-scale labeled training data. It is a labor-intensive and time-consuming task to build a large-scale annotated corpus for each target domain. For minimizing human efforts and the size of labeled training data set, we present a free-text address standardization method with latent semantic association (LaSA). LaSA model is constructed to capture latent semantic association among words from the unlabeled corpus. The original term space of the target domain is projected to a concept space using LaSA model at first, then the address standardization model is active learned from LaSA features and informative samples. The proposed method effectively captures the data distribution of the domain. Experimental results on large-scale English and Chinese corpus show that the proposed method significantly enhances the performance of standardization with less efforts and training data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
R. Associates. Raab associates guide to customer matching systems. 2003.
 
3
K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 2003.
 
4
 
5
 
6
S. Deerwester, S. T. Dumais, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
 
7
S. A. Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artifical Intelligence Research, 11:335--360, 1999.
 
8
 
9
Freitag. Trained named entity recognition using distributional clusters. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), 2004.
 
10
11
 
12
 
13
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning (ICML-1994), 1994.
 
14
S. Miller, J. Guinness, and A. Zamanian. Name tagging with word clusters and discriminative training. In Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT) 2004 conference, pages 337--342, 2004.
 
15
 
16
 
17
 
18
 
19
20
21
 
22

Collaborative Colleagues:
Honglei Guo: colleagues
Huijia Zhu: colleagues
Zhili Guo: colleagues
XiaoXun Zhang: colleagues
Zhong Su: colleagues