|
ABSTRACT
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002.
|
 |
2
|
|
 |
3
|
|
| |
4
|
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003.
|
| |
5
|
V. Bush. As we may think. The Atlantic Monthly, 1945.
|
 |
6
|
|
| |
7
|
Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.
|
| |
8
|
W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.
|
 |
9
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347141]
|
| |
10
|
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.
|
| |
11
|
|
| |
12
|
A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.
|
| |
13
|
X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.
|
| |
14
|
X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.
|
| |
15
|
X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.
|
 |
16
|
Susan Dumais , Edward Cutrell , JJ Cadiz , Gavin Jancke , Raman Sarin , Daniel C. Robbins, Stuff I've seen: a system for personal information retrieval and re-use, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860451]
|
| |
17
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.
|
| |
18
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
19
|
Google. http://desktop.google.com/, 2004.
|
| |
20
|
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.
|
 |
21
|
|
| |
22
|
|
| |
23
|
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.
|
 |
24
|
Mong Li Lee , Tok Wang Ling , Wai Lup Low, IntelliClean: a knowledge-based intelligent data cleaner, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.290-294, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347154]
|
| |
25
|
|
| |
26
|
A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.
|
 |
27
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
28
|
M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.
|
| |
29
|
H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.
|
| |
30
|
Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.
|
| |
31
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.
|
| |
32
|
J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.
|
| |
33
|
D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.
|
 |
34
|
|
 |
35
|
|
| |
36
|
W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.
|
| |
37
|
W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.
|
CITED BY 48
|
|
|
|
|
|
|
|
Yuhan Cai , Xin Luna Dong , Alon Halevy , Jing Michelle Liu , Jayant Madhavan, Personal information management with SEMEX, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Cartic Ramakrishnan , Li Ding , Pranam Kolari , Amit P. Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Eyal Oren , Renaud Delbru , Sebastian Gerke , Armin Haller , Stefan Decker, ActiveRDF: object-oriented semantic web programming, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Li Ding , Amit Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection, ACM Transactions on the Web (TWEB), v.2 n.1, p.1-29, February 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mira Dontcheva , Steven M. Drucker , David Salesin , Michael F. Cohen, Relations, cards, and search templates: user-guided web data integration and layout, Proceedings of the 20th annual ACM symposium on User interface software and technology, October 07-10, 2007, Newport, Rhode Island, USA
|
|
|
|
|
|
|
|
|
Eric Chu , Akanksha Baid , Ting Chen , AnHai Doan , Jeffrey Naughton, A relational approach to incrementally extracting and querying structure in unstructured data, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Philippe Cudré-Mauroux , Parisa Haghani , Michael Jost , Karl Aberer , Hermann De Meer, idMesh: graph-based disambiguation of linked data, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
Fatiha Saïs , Nathalie Pernelle , Marie-Christine Rousset, L2R: a logical method for reference reconciliation, Proceedings of the 22nd national conference on Artificial intelligence, p.329-334, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Atsuyuki Morishima , Akiyoshi Nakamizo , Toshinari Iida , Shigeo Sugimoto , Hiroyuki Kitagawa, Bringing your dead links back to life: a comprehensive approach and lessons learned, Proceedings of the 20th ACM conference on Hypertext and hypermedia, June 29-July 01, 2009, Torino, Italy
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
|
|
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|