|
ABSTRACT
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.
|
| |
2
|
R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
W. Cohen and J. Richman. Learning to match and cluster entity names. In proceedings of SIGKDD, Edmonton, July 2002.
|
 |
11
|
|
| |
12
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
13
|
|
 |
14
|
|
| |
15
|
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
|
| |
16
|
|
| |
17
|
G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
B. Schneier. Applied Cryptography John Wiley, 1996.
|
| |
22
|
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
|
| |
23
|
Trillium Software. <u>http://www.trilliumsoft.com</u>
|
| |
24
|
W. Winkler. The state of record linkage and current research problems. <u>http://www.census.gov/srd/papers/pdf/rr99-04.pdf</u>
|
CITED BY 73
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Carlos A. Heuser , Andrei E. N. Lima , Altigran Soares da Silva , Edleno Silva de Moura, Measuring similarity between collection of values, Proceedings of the 6th annual ACM international workshop on Web information and data management, November 12-13, 2004, Washington DC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Surajit Chaudhuri , Kris Ganjam , Venky Ganti , Rahul Kapoor , Vivek Narasayya , Theo Vassilakis, Data cleaning in microsoft SQL server 2005, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
Vibhuti Sengar , Tanuja Joshi , Joseph Joy , Samarth Prakash , Kentaro Toyama, Robust location search from text queries, Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems, November 07-09, 2007, Seattle, Washington
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Tanuja Joshi , Joseph Joy , Tobias Kellner , Udayan Khurana , A Kumaran , Vibhuti Sengar, Crosslingual location search, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
Jun Yan , Ning Liu , Qiang Yang , Benyu Zhang , Qiansheng Cheng , Zheng Chen, Mining Adaptive Ratio Rules from Distributed Data Sources, Data Mining and Knowledge Discovery, v.12 n.2-3, p.249-273, May 2006
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Carlos A. Heuser , Viviane Moreira Orengo , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
Xiaolei Li , Jiawei Han , Zhijun Yin , Jae-Gil Lee , Yizhou Sun, Sampling cube: a framework for statistical olap over sampling data, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. Carvalho , Albero H. F. Laender , Marcos André Gonçalves , Altigran S. da Silva, Replica identification using genetic programming, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Juliana Bonato dos Santos , Carlos A. Heuser , Viviane Moreira Orengo , Leandro Krug Wives, Automatic threshold estimation for data matching applications, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|
|
Moisés G. de Carvalho , Alberto H. F. Laender , Marcos André Gonçalves , Thiago C. Porto, The impact of parameter setup on a genetic programming approach to record deduplication, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Marcos Freitas Nunes , Carlos A. Heuser , Viviane P. Moreira , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Information Systems, v.34 n.8, p.740-756, December, 2009
|
|
|
|
|