|
ABSTRACT
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.
|
| |
2
|
Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China.
|
| |
3
|
Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)
|
 |
4
|
|
| |
5
|
Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.
|
| |
6
|
Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.
|
 |
13
|
|
 |
14
|
|
| |
15
|
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.
|
 |
16
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
| |
17
|
Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.
|
 |
18
|
|
| |
19
|
Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.
|
| |
20
|
|
 |
21
|
|
 |
22
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
23
|
McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
|
| |
24
|
Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.
|
| |
25
|
Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.
|
 |
26
|
|
| |
27
|
Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.
|
| |
28
|
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
|
| |
29
|
|
| |
30
|
|
 |
31
|
|
| |
32
|
Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.
|
| |
33
|
|
| |
34
|
Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.
|
| |
35
|
Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.
|
CITED BY 15
|
|
|
|
|
Hamid Haidarian Shahri , Galileo Namata , Saket Navlakha , Amol Deshpande , Nick Roussopoulos, A graph-based approach to vehicle tracking in traffic camera video streams, Proceedings of the 4th workshop on Data management for sensor networks: in conjunction with 33rd International Conference on Very Large Data Bases, September 24-24, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ee-Peng Lim , Maureen , Nelman Lubis Ibrahim , Aixin Sun , Anwitaman Datta , Kuiyu Chang, SSnetViz: a visualization engine for heterogeneous semantic social networks, Proceedings of the 11th International Conference on Electronic Commerce, August 12-15, 2009, Taipei, Taiwan
|
|
|
|
|