| A strategy for allowing meaningful and comparable scores in approximate matching |
| Full text |
Pdf
(472 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
table of contents
Lisbon, Portugal
SESSION: Record linkage and approximate matching (DB)
table of contents
Pages 303-312
Year of Publication: 2007
ISBN:978-1-59593-803-9
|
|
Authors
|
|
Carina F. Dorneles
|
UFRGS, Porto Alegre, Brazil
|
|
Carlos A. Heuser
|
UFRGS, Porto Alegre, Brazil
|
|
Viviane Moreira Orengo
|
UFRGS, Porto Alegre, Brazil
|
|
Altigran S. da Silva
|
UFAM, Manaus, Brazil
|
|
Edleno S. de Moura
|
UFAM, Manaus, Brazil
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 11, Downloads (12 Months): 67, Citation Count: 1
|
|
|
ABSTRACT
The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD 2004 (LNAI 3056), pages 638--647. Springer, 2004.
|
| |
7
|
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, pages 73--78, 2003.
|
| |
8
|
R. da Silva, R. K. Stasiu, V. M. Orengo, and C. A. Heuser. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, 1(1):35--46, January 2007.
|
| |
9
|
|
| |
10
|
|
 |
11
|
Carina F. Dorneles , Carlos A. Heuser , Andrei E. N. Lima , Altigran Soares da Silva , Edleno Silva de Moura, Measuring similarity between collection of values, Proceedings of the 6th annual ACM international workshop on Web information and data management, November 12-13, 2004, Washington DC, USA
[doi> 10.1145/1031453.1031465]
|
| |
12
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
|
 |
13
|
|
| |
14
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
| |
15
|
|
| |
16
|
|
| |
17
|
L. Lee. On the effectiveness of the skew divergence of statistical language analysis. Artificial Intelligence and Statistics, pages 65--72, 2001.
|
 |
18
|
|
| |
19
|
|
| |
20
|
SecondString. Carnegie Mellon University. Project Page, http://secondstring.sourceforge.net/.
|
| |
21
|
R. K. Stasiu, C. A. Heuser, and R. Silva. Estimating recall and precision for vague queries in databases. In CAISE 2005, Lecture Notes in Computer Science, pages 187--200. Springer Verlag, 2005.
|
| |
22
|
|
|