|
ABSTRACT
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, Feb. 2002.
|
 |
3
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347141]
|
 |
4
|
|
| |
5
|
D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.
|
| |
6
|
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
|
| |
7
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
14
|
A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, Aug. 1996.
|
| |
15
|
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.
|
| |
16
|
U. Y. Nahm and R. J. Mooney. Using information extraction to aid the discovery of prediction rules from texts. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, Aug. 2000.
|
| |
17
|
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
|
| |
18
|
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
|
| |
19
|
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 185--208. MIT Press, 1999.
|
| |
20
|
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
|
| |
25
|
W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
|
| |
26
|
|
| |
27
|
|
CITED BY 74
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mikhail Bilenko , Sugato Basu , Raymond J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, Proceedings of the twenty-first international conference on Machine learning, p.11, July 04-08, 2004, Banff, Alberta, Canada
|
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aparna S. Varde , Elke A. Rundensteiner , Carolina Ruiz , David C. Brown , Mohammmed Maniruzzaman , Richard D. Sisson, Designing semantics-preserving cluster representatives for scientific input conditions, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Clifton Phua , Vincent Lee , Kate Smith-Miles , Ross Gayler, Adaptive communal detection in search of adversarial identity crime, Proceedings of the 2007 international workshop on Domain driven data mining, p.1-10, August 12-12, 2007, San Jose, California
|
|
|
|
|
|
Moisés G. Carvalho , Albero H. F. Laender , Marcos André Gonçalves , Altigran S. da Silva, Replica identification using genetic programming, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Alberto H. F. Laender , Marcos André Gonçalves , Thiago C. Porto, The impact of parameter setup on a genetic programming approach to record deduplication, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|
|
|
|
|
Fatiha Saïs , Nathalie Pernelle , Marie-Christine Rousset, L2R: a logical method for reference reconciliation, Proceedings of the 22nd national conference on Artificial intelligence, p.329-334, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|