ACM Home Page
Please provide us with feedback. Feedback
The merge/purge problem for large databases
Full text PdfPdf (1.37 MB)
Source International Conference on Management of Data archive
Proceedings of the 1995 ACM SIGMOD international conference on Management of data table of contents
San Jose, California, United States
Pages: 127 - 138  
Year of Publication: 1995
ISBN:0-89791-731-6
Also published in ...
Authors
Mauricio A. Hernández  Department of Computer Science, Columbia University, New York, NY
Salvatore J. Stolfo  Department of Computer Science, Columbia University, New York, NY
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 29,   Downloads (12 Months): 237,   Citation Count: 99
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/223784.223807
What is a DOI?

ABSTRACT

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
 
6
C. L. Forgy. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.
 
7
 
8
R. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal o} Computzng, 17:416-429, 1969.
 
9
M. A. gernKndez. A Generalization of Band-Joins and the Merge/Purge Problem. Technical Report CUCS- 005-1995, Department of Computer Science, Columbia University, February 1995.
10
11
 
12
D. P. Miranker, B. Lofaso, G. Farmer, A. Chandra, and D. Brant. On a TREAT-based Production System Compiler. In Proc. l Oth Int'l Conf. on Expert Systems, pages 617-630, 1990.
13
 
14

CITED BY  99

Collaborative Colleagues:
Mauricio A. Hernández: colleagues
Salvatore J. Stolfo: colleagues