ACM Home Page
Please provide us with feedback. Feedback
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
Full text PdfPdf (759 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
DEMONSTRATION SESSION: Demonstrations table of contents
Pages 1065-1068  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Author
Peter Christen  The Australian National University, Canberra, Australia
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 233,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1402020
What is a DOI?

ABSTRACT

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.
 
3
 
4
P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.
 
5
6
 
7
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008.
 
8
 
9
P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.
 
10
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.
 
11
T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.
12
 
13
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.
 
14
15
 
16
 
17
G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.