ACM Home Page
Please provide us with feedback. Feedback
Parallel linkage
Full text PdfPdf (884 KB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
SESSION: Record linkage and approximate matching (DB) table of contents
Pages 283-292  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Hung-sik Kim  The Pennsylvania State University, University Park, PA
Dongwon Lee  The Pennsylvania State University, University Park, PA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 65,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321482
What is a DOI?

ABSTRACT

We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
O. Benjelloun, H. Garcia-Molina, Q. Su, and J. Widom. "Swoosh: A Generic Approach to Entity Resolution". Technical report, Stanford University, 2005.
 
2
O. Benjelloun et al. "D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution". Technical report, Stanford University, 2006.
 
3
4
 
5
P. Christen, T. Churches, and M. Hegland. "A Parallel Open Source Data Linkage System". In Springer Lecture Notes in Artificial Intelligence, Sydney, Australia, May 2004.
 
6
7
 
8
I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
 
9
A. Grama, A. Gupta, G. Karypis, and V. Kumar. "Introduction to Parallel Computing (2nd Edition)". Addison Wesley, 2003.
10
11
 
12
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. "Exploiting Relationships for Domain-independent Data Cleaning". In SIAM Data Mining (SDM) Conf., 2005.
 
13
H. Kawai et al. "P-Swoosh: Parallel Algorithm for Generic Entity Resolution". Technical report, Stanford University, 2006.
 
14
D. Menestrina, O. Benjelloun, and H. Garcia-Molina. "Generic Entity Resolution with Data Confidences". In VLDB CleanDB Workshop, Seoul, Korea, Sep. 2006.
 
15
B.-W. On, N. Koudas, D. Lee, and D. Srivastava. "Group Linkage". In IEEE ICDE, Istanbul, Turkey, Apr. 2007.
16
 
17
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.
18
19
 
20
W. Shen, X. Li, and A. Doan. "Constraint-based Entity Matching". In AAAI, 2005.
 
21
W. E. Winkler. "The State of Record Linkage and Current Research Problems". Technical report, US Bureau of the Census, Apr. 1999.


Collaborative Colleagues:
Hung-sik Kim: colleagues
Dongwon Lee: colleagues