APPENDICES and SUPPLEMENTS
|
|
Online appendix to designing mediation for context-aware applications. The appendix supports the information on page 716.
|
ABSTRACT
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the VLDB Conference.
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
Surajit Chaudhuri , Kris Ganjam , Venky Ganti , Rahul Kapoor , Vivek Narasayya , Theo Vassilakis, Data cleaning in microsoft SQL server 2005, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066287]
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Cheng, R., Prabhakar, S., and Kalashnikov, D. 2003b. Querying imprecise data in moving object environments. In Proceedings of the IEEE ICDE Conference. Bangalore, India. IEEE Computer Society Press, Los Alamitos, CA.
|
| |
12
|
Christen, P., Churches, T., and Zhu, J. X. 2002. Probabilistic name and address cleaning and standardization. In Proceedings of the Australasian Data Mining Workshop.
|
| |
13
|
CiteSeer 2005. http://citeseer.nj.nec.com/cs.
|
 |
14
|
|
| |
15
|
|
 |
16
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347141]
|
| |
17
|
Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IIWeb Workshop.
|
 |
18
|
|
| |
19
|
|
| |
20
|
Luc De Raedt , Hendrik Blockeel , Luc Dehaspe , Wim Van Laer, Three companions for data mining in first order logic, Relational Data Mining, Springer-Verlag New York, Inc., New York, NY, 2001
|
 |
21
|
|
 |
22
|
|
| |
23
|
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Stat. Assoc. 64, 328, 1183--1210.
|
| |
24
|
GAMS solvers 2005. http://www.gams.com/solvers/.
|
| |
25
|
|
| |
26
|
Getoor, L. 2001. Multi-relational data mining using probabilistic relational models: Research summary. In Proceedings of the 1st Workshop in Multi-Relational Data Mining.
|
| |
27
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
28
|
|
| |
29
|
HomePageSearch 2005. http://hpsearch.uni-trier.de.
|
| |
30
|
Jaro, M. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Stat. Assoc. 84, 406.
|
| |
31
|
Jaro, M. 1995. Probabilistic linkage of large public health data files. Stat. Med. 14, 5--7 (Mar.--Apr.).
|
| |
32
|
|
| |
33
|
Kalashnikov, D. and Mehrotra. 2003. Exploiting relationships for data cleaning. UCI Tech. Rep. TR-RESCUE-03-02.
|
| |
34
|
Kalashnikov, D. V. and Mehrotra, S. 2004. Learning importance of relationships for reference disambiguation. UCI Tech. Rep. TR-RESCUE-04-23.
|
| |
35
|
Kalashnikov, D. V. and Mehrotra, S. 2005. Exploiting relationships for domain-independent data-cleaning. SIAM SDM (extended version), www.ics.uci.edu/~dvk/pub/sdm05.pdf.
|
| |
36
|
Kalashnikov, D. V., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data-cleaning. In Proceedings of the SIAM International Conference on Data Mining (SIAM SDM 2005) (Newport Beach, CA).
|
| |
37
|
|
| |
38
|
Kalashnikov, D. V. and Prabhakar, S. 2006. Fast similarity join for multi-dimensional data. Inf. Syst. J. to appear.
|
| |
39
|
KDSurvey 2003. http://www.kdnuggets.com/polls/2003/data_preparation.htm.
|
| |
40
|
|
| |
41
|
|
| |
42
|
Li, X., Morie, P., and Roth, D. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the AAAI.
|
| |
43
|
|
| |
44
|
Maletic, J. and Marcus, A. 2000. Data cleansing: Beyond integrity checking. In Proceedings of the Conference on Information Quality.
|
| |
45
|
Malin, B. 2005. Unsupervised name disambiguation via social network similarity. In Proceedings of the Workshop on Link Analysis, Counterterrorism, and Security.
|
| |
46
|
McCallum, A. and Wellner, B. 2003. Object consolidation by graph partitioning with a conditionally-trained distance metric. In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation.
|
| |
47
|
McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In Proceedings of the NIPS.
|
 |
48
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
49
|
Monge, A. E. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In Proceedings of the ACM SIGKDD Conference. Portland, OR.
|
| |
50
|
Monge, A. E. and Elkan, C. P. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Tucson, AZ.
|
| |
51
|
Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.
|
| |
52
|
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2002. Identity uncertainty and citation matching. In Proceedings of the NIPS Conference.
|
| |
53
|
|
 |
54
|
|
| |
55
|
Seid, D. and Mehrotra, S. 2006. Complex analytical queries over large attributed graph data. Submitted for Publication.
|
| |
56
|
|
| |
57
|
Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In Proceedings of the MRDM Workshop.
|
 |
58
|
|
| |
59
|
|
 |
60
|
|
| |
61
|
Wiederhold, G. 2005. The movies dataset. www-db.stanford.edu/pub/movies/doc.html.
|
| |
62
|
Winkler, W. E. 1994. Advanced methods for record linkage. In Proceedings of the U.S. Bureau of Census.
|
| |
63
|
Winkler, W. 1999. The state of record linkage and current research problems. In Proceedings of the U.S. Bureau of Census, TR99.
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.2
DATABASE MANAGEMENT
H.2.m
Miscellaneous
Additional Classification:
H.
Information Systems
H.2
DATABASE MANAGEMENT
H.2.5
Heterogeneous Databases
H.2.8
Database applications
Subjects:
Data mining
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
General Terms:
Algorithms,
Design,
Experimentation,
Performance,
Theory
Keywords:
Connection strength,
RelDC,
data cleaning,
entity resolution,
graph analysis,
reference disambiguation,
relationship analysis
|