ACM Home Page
Please provide us with feedback. Feedback
Domain-independent data cleaning via analysis of entity-relationship graph
Full text PdfPdf (1.27 MB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 31 ,  Issue 2  (June 2006) table of contents
Pages: 716 - 767  
Year of Publication: 2006
ISSN:0362-5915
Authors
Dmitri V. Kalashnikov  University of California, Irvine, Irvine, CA
Sharad Mehrotra  University of California, Irvine, Irvine, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 192,   Citation Count: 11
Additional Information:

appendices and supplements   abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1138394.1138401
What is a DOI?

APPENDICES and SUPPLEMENTS
Online appendix to designing mediation for context-aware applications. The appendix supports the information on page 716.


ABSTRACT

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the VLDB Conference.
 
2
3
4
 
5
6
7
8
9
 
10
 
11
Cheng, R., Prabhakar, S., and Kalashnikov, D. 2003b. Querying imprecise data in moving object environments. In Proceedings of the IEEE ICDE Conference. Bangalore, India. IEEE Computer Society Press, Los Alamitos, CA.
 
12
Christen, P., Churches, T., and Zhu, J. X. 2002. Probabilistic name and address cleaning and standardization. In Proceedings of the Australasian Data Mining Workshop.
 
13
CiteSeer 2005. http://citeseer.nj.nec.com/cs.
14
 
15
16
 
17
Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IIWeb Workshop.
18
 
19
 
20
21
22
 
23
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Stat. Assoc. 64, 328, 1183--1210.
 
24
GAMS solvers 2005. http://www.gams.com/solvers/.
 
25
 
26
Getoor, L. 2001. Multi-relational data mining using probabilistic relational models: Research summary. In Proceedings of the 1st Workshop in Multi-Relational Data Mining.
 
27
28
 
29
HomePageSearch 2005. http://hpsearch.uni-trier.de.
 
30
Jaro, M. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Stat. Assoc. 84, 406.
 
31
Jaro, M. 1995. Probabilistic linkage of large public health data files. Stat. Med. 14, 5--7 (Mar.--Apr.).
 
32
 
33
Kalashnikov, D. and Mehrotra. 2003. Exploiting relationships for data cleaning. UCI Tech. Rep. TR-RESCUE-03-02.
 
34
Kalashnikov, D. V. and Mehrotra, S. 2004. Learning importance of relationships for reference disambiguation. UCI Tech. Rep. TR-RESCUE-04-23.
 
35
Kalashnikov, D. V. and Mehrotra, S. 2005. Exploiting relationships for domain-independent data-cleaning. SIAM SDM (extended version), www.ics.uci.edu/~dvk/pub/sdm05.pdf.
 
36
Kalashnikov, D. V., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data-cleaning. In Proceedings of the SIAM International Conference on Data Mining (SIAM SDM 2005) (Newport Beach, CA).
 
37
 
38
Kalashnikov, D. V. and Prabhakar, S. 2006. Fast similarity join for multi-dimensional data. Inf. Syst. J. to appear.
 
39
KDSurvey 2003. http://www.kdnuggets.com/polls/2003/data_preparation.htm.
 
40
 
41
 
42
Li, X., Morie, P., and Roth, D. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the AAAI.
 
43
 
44
Maletic, J. and Marcus, A. 2000. Data cleansing: Beyond integrity checking. In Proceedings of the Conference on Information Quality.
 
45
Malin, B. 2005. Unsupervised name disambiguation via social network similarity. In Proceedings of the Workshop on Link Analysis, Counterterrorism, and Security.
 
46
McCallum, A. and Wellner, B. 2003. Object consolidation by graph partitioning with a conditionally-trained distance metric. In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation.
 
47
McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In Proceedings of the NIPS.
48
 
49
Monge, A. E. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In Proceedings of the ACM SIGKDD Conference. Portland, OR.
 
50
Monge, A. E. and Elkan, C. P. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Tucson, AZ.
 
51
Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.
 
52
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2002. Identity uncertainty and citation matching. In Proceedings of the NIPS Conference.
 
53
54
 
55
Seid, D. and Mehrotra, S. 2006. Complex analytical queries over large attributed graph data. Submitted for Publication.
 
56
 
57
Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In Proceedings of the MRDM Workshop.
58
 
59
60
 
61
Wiederhold, G. 2005. The movies dataset. www-db.stanford.edu/pub/movies/doc.html.
 
62
Winkler, W. E. 1994. Advanced methods for record linkage. In Proceedings of the U.S. Bureau of Census.
 
63
Winkler, W. 1999. The state of record linkage and current research problems. In Proceedings of the U.S. Bureau of Census, TR99.

CITED BY  11

Collaborative Colleagues:
Dmitri V. Kalashnikov: colleagues
Sharad Mehrotra: colleagues