ACM Home Page
Please provide us with feedback. Feedback
Entity resolution with iterative blocking
Full text PdfPdf (487 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 6: entity resolution table of contents
Pages 219-232  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Steven Euijong Whang  Stanford University, Stanford, CA, USA
David Menestrina  Stanford University, Stanford, CA, USA
Georgia Koutrika  Stanford University, Stanford, CA, USA
Martin Theobald  Stanford University, Stanford, CA, USA
Hector Garcia-Molina  Stanford University, Stanford, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 50,   Downloads (12 Months): 173,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559870
What is a DOI?

ABSTRACT

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.
 
3
4
 
5
 
6
7
 
8
 
9
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.
 
10
L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.
 
11
12
 
13
 
14
15
 
16
 
17
A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.
 
18
19
20
 
21
22
 
23
W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.
 
24
W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.
 
25
W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.


Collaborative Colleagues:
Steven Euijong Whang: colleagues
David Menestrina: colleagues
Georgia Koutrika: colleagues
Martin Theobald: colleagues
Hector Garcia-Molina: colleagues