|
ABSTRACT
In many applications, there are a variety of ways of referring to the same underlying entity. Given a collection of references to entities, we would like to determine the set of true underlying entities and map the references to these entities. The references may be to entities of different types and more than one type of entity may need to be resolved at the same time. We propose similarity measures for clustering references taking into account the different relations that are observed among the typed references. We pose typed entity resolution in relational data as a clustering problem and present experimental results on real data showing improvements over attribute-based models when relations are leveraged.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.
|
| |
2
|
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, 2002.
|
 |
3
|
|
| |
4
|
I. Bhattacharya and L. Getoor. A latent dirichlet model for entity resolution. Technical report, University of Maryland, College Park, 2005.
|
 |
5
|
|
 |
6
|
|
| |
7
|
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-2003 Workshop on Information Integration on the Web, 2003.
|
 |
8
|
|
 |
9
|
|
| |
10
|
W. Emde and D. Wettschereck. Relational instance based learning. In L. Saitta, editor, Proceedings of The 13th International Conference on Machine Learning, pages 122 -- 130. Morgan Kaufmann Publishers, 1996.
|
| |
11
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
|
 |
12
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
 |
13
|
|
| |
14
|
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM SDM, Newport Beach, CA, USA, April 21--23 2005.
|
| |
15
|
|
| |
16
|
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.
|
 |
17
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
18
|
B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov. Blog: Probabilistic models with unknown objects. In IJCAI, 2005.
|
| |
19
|
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In KDD, 1996.
|
| |
20
|
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DMKD, 1997.
|
 |
21
|
|
| |
22
|
J. Neville, M. Adler, and D. Jensen. Clustering relational data using attribute and link information. In Text Mining and Link Analysis Workshop, IJCAI, 2003.
|
| |
23
|
Parag and P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, 2004.
|
| |
24
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003.
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
| |
29
|
W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Census Bureau, 1999.
|
| |
30
|
W. E. Winkler. Methods for record linkage and Bayesian networks. Technical report, U.S. Census Bureau, 2002.
|
|