ACM Home Page
Please provide us with feedback. Feedback
Robust and efficient fuzzy match for online data cleaning
Full text PdfPdf (271 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Similarity queries I table of contents
Pages: 313 - 324  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Surajit Chaudhuri  Microsoft Research
Kris Ganjam  Microsoft Research
Venkatesh Ganti  Microsoft Research
Rajeev Motwani  Stanford University
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 221,   Citation Count: 73
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872796
What is a DOI?

ABSTRACT

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.
 
2
R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
 
3
 
4
 
5
 
6
7
8
 
9
 
10
W. Cohen and J. Richman. Learning to match and cluster entity names. In proceedings of SIGKDD, Edmonton, July 2002.
11
 
12
13
14
 
15
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
 
16
 
17
G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
 
18
 
19
20
 
21
B. Schneier. Applied Cryptography John Wiley, 1996.
 
22
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
 
23
Trillium Software. <u>http://www.trilliumsoft.com</u>
 
24
W. Winkler. The state of record linkage and current research problems. <u>http://www.census.gov/srd/papers/pdf/rr99-04.pdf</u>

CITED BY  73

Collaborative Colleagues:
Surajit Chaudhuri: colleagues
Kris Ganjam: colleagues
Venkatesh Ganti: colleagues
Rajeev Motwani: colleagues