ACM Home Page
Please provide us with feedback. Feedback
A strategy for allowing meaningful and comparable scores in approximate matching
Full text PdfPdf (472 KB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
SESSION: Record linkage and approximate matching (DB) table of contents
Pages 303-312  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Carina F. Dorneles  UFRGS, Porto Alegre, Brazil
Carlos A. Heuser  UFRGS, Porto Alegre, Brazil
Viviane Moreira Orengo  UFRGS, Porto Alegre, Brazil
Altigran S. da Silva  UFAM, Manaus, Brazil
Edleno S. de Moura  UFAM, Manaus, Brazil
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 67,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321484
What is a DOI?

ABSTRACT

The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
4
5
 
6
P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD 2004 (LNAI 3056), pages 638--647. Springer, 2004.
 
7
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, pages 73--78, 2003.
 
8
R. da Silva, R. K. Stasiu, V. M. Orengo, and C. A. Heuser. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, 1(1):35--46, January 2007.
 
9
 
10
11
 
12
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
13
 
14
 
15
 
16
 
17
L. Lee. On the effectiveness of the skew divergence of statistical language analysis. Artificial Intelligence and Statistics, pages 65--72, 2001.
18
 
19
 
20
SecondString. Carnegie Mellon University. Project Page, http://secondstring.sourceforge.net/.
 
21
R. K. Stasiu, C. A. Heuser, and R. Silva. Estimating recall and precision for vague queries in databases. In CAISE 2005, Lecture Notes in Computer Science, pages 187--200. Springer Verlag, 2005.
 
22


Collaborative Colleagues:
Carina F. Dorneles: colleagues
Carlos A. Heuser: colleagues
Viviane Moreira Orengo: colleagues
Altigran S. da Silva: colleagues
Edleno S. de Moura: colleagues