ACM Home Page
Please provide us with feedback. Feedback
A cost-based model and effective heuristic for repairing constraints by value modification
Full text PdfPdf (566 KB)
Source International Conference on Management of Data archive
Proceedings of the 2005 ACM SIGMOD international conference on Management of data table of contents
Baltimore, Maryland
SESSION: Research papers: data cleaning and mapping table of contents
Pages: 143 - 154  
Year of Publication: 2005
ISBN:1-59593-060-4
Authors
Philip Bohannon  Lucent Technologies-Bell Laboratories
Wenfei Fan  Univ. of Edinburgh & Bell Labs
Michael Flaster  Lucent Technologies-Bell Laboratories
Rajeev Rastogi  Lucent Technologies-Bell Laboratories
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 81,   Citation Count: 15
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066157.1066175
What is a DOI?

ABSTRACT

Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find "low cost" changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
L. Bertossi and J. Chomicki, Query Answering in Inconsistent Databases. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases, Springer, 2003.
 
3
L. Bravo and L. Bertossi. Logic programs for consistently querying data integration systems. In IJCAI, 2003.
4
5
 
6
A. Cali, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI, 2003.
 
7
J. Chomicki and J. Marcinkowski. On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases. In L. Bertossi, A. Hunter, and T. Schaub, editors, Integrity Tolerance. Springer, 2004.
 
8
 
9
W. Cohen, P. Ravikumar, and S. Feinberg. A comparison of string-distance metrics for name-matching tasks. In IIWeb, 2003.
 
10
M. G. Elfeky. V. S. Verykios, and A. K. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE, 2002.
 
11
12
 
13
 
14
M. Gertz and U. Lipeck. A diagnostic approach to repairing constraint violations in databases. In DX, 1995.
 
15
 
16
 
17
International Standard ISO/IEC 9075-2:2003(E). Information technology: Database languages, SQL Part 2 (Foundation, 2nd edition), 2003.
 
18
M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa florida. Journal of the American Statistical Association, 89:414--420, 1989.
 
19
Lavastorm. Making the Case for Automated Revenue Assurance Solutions. http://www.lavastormtech.com.
 
20
Lucent Technologies. Revenue Assurance Products. http://www.lucent.com/solutions/revenue.html.
 
21
A. Monge. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 23(4), 2000.
 
22
A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DKMD, 1997.
 
23
Network Resource Management: Inventory Takes Stage. Rhk research, July 2002. RHK Market Report.
 
24
 
25
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 2000.
 
26
 
27
Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org.
 
28
 
29
W. Winkler. Advanced methods for record linkage. Technical report. Statistical Research Division, U.S. Bureau of the Census., 1994.

CITED BY  15
Collaborative Colleagues:
Philip Bohannon: colleagues
Wenfei Fan: colleagues
Michael Flaster: colleagues
Rajeev Rastogi: colleagues