|
ABSTRACT
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
Hasanat M. Dewan , Kui W. Mok , Mauricio Hernández , Salvatore J. Stolfo, Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew, Proceedings of the third international conference on on Parallel and distributed information systems, p.40-49, October 1994, Autin, Texas, United States
|
| |
5
|
|
| |
6
|
C. L. Forgy. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.
|
| |
7
|
|
| |
8
|
R. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal o} Computzng, 17:416-429, 1969.
|
| |
9
|
M. A. gernKndez. A Generalization of Band-Joins and the Merge/Purge Problem. Technical Report CUCS- 005-1995, Department of Computer Science, Columbia University, February 1995.
|
 |
10
|
|
 |
11
|
|
| |
12
|
D. P. Miranker, B. Lofaso, G. Farmer, A. Chandra, and D. Brant. On a TREAT-based Production System Compiler. In Proc. l Oth Int'l Conf. on Expert Systems, pages 617-630, 1990.
|
 |
13
|
Chris Nyberg , Tom Barclay , Zarka Cvetanovic , Jim Gray , Dave Lomet, AlphaSort: a RISC machine sort, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.233-242, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
14
|
|
CITED BY 99
|
|
|
|
|
|
|
|
|
|
|
H. V. Jagadish , Raymond T. Ng , Divesh Srivastava, Substring selectivity estimation, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.249-260, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
Mong Li Lee , Tok Wang Ling , Wai Lup Low, IntelliClean: a knowledge-based intelligent data cleaner, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.290-294, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
Wynne Hsu , Mong Li Lee , Bing Liu , Tok Wang Ling, Exploration mining in diabetic patients databases: findings and conclusions, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.430-436, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
P. Missier , G. Lalk , V. Verykios , F. Grillo , T. Lorusso , P. Angeletti, Improving Data Quality in Practice: A Case Study in the Italian Public Administration, Distributed and Parallel Databases, v.13 n.2, p.135-160, March 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Surajit Chaudhuri , Kris Ganjam , Venky Ganti , Rahul Kapoor , Vivek Narasayya , Theo Vassilakis, Data cleaning in microsoft SQL server 2005, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
|
|
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Byung-Won On , Ergin Elmacioglu , Dongwon Lee , Jaewoo Kang , Jian Pei, An effective approach to entity resolution problem using quasi-clique and its application to digital libraries, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
|
|
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases, Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy, p.553-557, May 22, 2006
|
|
|
|
|
|
|
|
|
|
|