|
ABSTRACT
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.
|
 |
2
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
10
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
11
|
|
| |
12
|
J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.
|
 |
13
|
Vijay S. Iyengar , Chidanand Apte , Tong Zhang, Active learning using adaptive resampling, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.91-98, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347110]
|
| |
14
|
W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.
|
| |
15
|
|
| |
16
|
|
| |
17
|
R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, 1997. AAAI Press, Menlo Park, US.
|
| |
18
|
A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.
|
 |
19
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
20
|
|
| |
21
|
|
| |
22
|
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
S. Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.
|
| |
27
|
|
 |
28
|
H. S. Seung , M. Opper , H. Sompolinsky, Query by committee, Proceedings of the fifth annual workshop on Computational learning theory, p.287-294, July 27-29, 1992, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/130385.130417]
|
| |
29
|
S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.
|
| |
30
|
|
| |
31
|
W. E. Winkler. Matching and record linkage. In B. G. C. et al, editor, Business Survey Methods, pages 355--384. New York: J. Wiley, 1995. available from http://www.census.gov/.
|
| |
32
|
W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.
|
 |
33
|
|
| |
34
|
|
CITED BY 64
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alon Halevy , Michael Franklin , David Maier, Principles of dataspace systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.1-9, June 26-28, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
Sunita Sarawagi , Anuradha Bhamidipaty , Alok Kirpal , Chandra Mouli, ALIAS: an active learning led interactive deduplication system, Proceedings of the 28th international conference on Very Large Data Bases, p.1103-1106, August 20-23, 2002, Hong Kong, China
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
Qi Su , Dmitry Pavlov , Jyh-Herng Chow , Wendell C. Baker, Internet-scale collection of human-reviewed data, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deise de Brum Saccol , Nina Edelweiss , Renata de Matos Galante , Carlo Zaniolo, XML version detection, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
|
|
|
|
|
|
Surong Wang , Manoranjan Dash , Liang-Tien Chia , Min Xu, Efficient sampling of training set in large and noisy multimedia data, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.3, p.14-es, August 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Marcos Freitas Nunes , Carlos A. Heuser , Viviane P. Moreira , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Information Systems, v.34 n.8, p.740-756, December, 2009
|
|
|
|
|
|
|
|