| Entity resolution with iterative blocking |
| Full text |
Pdf
(487 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 6: entity resolution
table of contents
Pages 219-232
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Steven Euijong Whang
|
Stanford University, Stanford, CA, USA
|
|
David Menestrina
|
Stanford University, Stanford, CA, USA
|
|
Georgia Koutrika
|
Stanford University, Stanford, CA, USA
|
|
Martin Theobald
|
Stanford University, Stanford, CA, USA
|
|
Hector Garcia-Molina
|
Stanford University, Stanford, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 50, Downloads (12 Months): 173, Citation Count: 1
|
|
|
ABSTRACT
Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.
|
| |
3
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
[doi> 10.1007/s00778-008-0098-x]
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.
|
| |
10
|
L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
16
|
|
| |
17
|
A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
[doi> 10.1145/1559845.1559870]
|
| |
23
|
W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.
|
| |
24
|
W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.
|
| |
25
|
W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.
|
CITED BY
|
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|