|
ABSTRACT
The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM KDD'03 workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.
|
 |
2
|
|
 |
3
|
|
| |
4
|
C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. Manual, Department of Computer Science, National Taiwan University, 2001. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
|
| |
5
|
P. Christen. Probabilistic data generation for deduplication and data linkage. In IDEAL'05, Springer LNCS 3578, pages 109--116, Brisbane, 2005.
|
| |
6
|
|
| |
7
|
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, 2008.
|
| |
8
|
|
| |
9
|
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.
|
| |
10
|
T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.
|
| |
11
|
W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI'03 workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, 2003.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.
|
| |
16
|
|
| |
17
|
L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146--160, 2006.
|
| |
18
|
J. Jonas and J. Harper. Effective counterterrorism and the limited role of predictive data mining. Policy Analysis, (584), 2006.
|
| |
19
|
U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In TextML'02, pages 18--27, Sydney, 2002.
|
| |
20
|
J. S. Sanchez, J. M. Sotoca, and F. Pla. Efficient nearest neighbor classification with data reduction and fast search algorithms. In IEEE International Conference on Systems, Man and Cybernetics, volume 5, pages 4757--4762, 2004.
|
 |
21
|
|
 |
22
|
|
| |
23
|
W. E. Winkler. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Technical Report RR2000/05, US Bureau of the Census, 2000.
|
| |
24
|
|
 |
25
|
|
|