|
ABSTRACT
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
William W. Cohen and Jacob Richman. Learning to match and cluster entity names. In Proceedings of the ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, 2001.
|
| |
5
|
William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research, 10:243--270, 1999.
|
| |
6
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
| |
7
|
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
|
 |
8
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon, AJAX: an extensible data cleaning tool, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.590, May 15-18, 2000, Dallas, Texas, United States
|
 |
9
|
|
| |
10
|
B. Kilss and W. Alvey. Record linkage techniques--1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts.gov/fcsm/methodology/, 1985.
|
| |
11
|
|
| |
12
|
|
 |
13
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
14
|
A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
|
| |
15
|
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
|
| |
16
|
Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.
|
| |
17
|
H.A. Baler Saip and C.L. Lucchesi. Matching algorithm, for bipartite graph. Technical Report DCC-03/93, Departamento de Cincia da Computao, Universidade Estudal de Campinas, 1993.
|
| |
18
|
|
| |
19
|
W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Statistics of Income Division, Internal Revenue Service Publication RR93/12. Available from http://www.census.gov/srd/www/byname.html, 1993.
|
| |
20
|
W. E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html, 1999.
|
| |
21
|
William E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.
|
CITED BY 35
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. Carvalho , Albero H. F. Laender , Marcos André Gonçalves , Altigran S. da Silva, Replica identification using genetic programming, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Alberto H. F. Laender , Marcos André Gonçalves , Thiago C. Porto, The impact of parameter setup on a genetic programming approach to record deduplication, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|