|
ABSTRACT
Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find "low cost" changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Marcelo Arenas , Leopoldo Bertossi , Jan Chomicki, Consistent query answers in inconsistent databases, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.68-79, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303983]
|
| |
2
|
L. Bertossi and J. Chomicki, Query Answering in Inconsistent Databases. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases, Springer, 2003.
|
| |
3
|
L. Bravo and L. Bertossi. Logic programs for consistently querying data integration systems. In IJCAI, 2003.
|
 |
4
|
|
 |
5
|
|
| |
6
|
A. Cali, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI, 2003.
|
| |
7
|
J. Chomicki and J. Marcinkowski. On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases. In L. Bertossi, A. Hunter, and T. Schaub, editors, Integrity Tolerance. Springer, 2004.
|
| |
8
|
|
| |
9
|
W. Cohen, P. Ravikumar, and S. Feinberg. A comparison of string-distance metrics for name-matching tasks. In IIWeb, 2003.
|
| |
10
|
M. G. Elfeky. V. S. Verykios, and A. K. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE, 2002.
|
| |
11
|
|
 |
12
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon, AJAX: an extensible data cleaning tool, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.590, May 15-18, 2000, Dallas, Texas, United States
|
| |
13
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
14
|
M. Gertz and U. Lipeck. A diagnostic approach to repairing constraint violations in databases. In DX, 1995.
|
| |
15
|
|
| |
16
|
|
| |
17
|
International Standard ISO/IEC 9075-2:2003(E). Information technology: Database languages, SQL Part 2 (Foundation, 2nd edition), 2003.
|
| |
18
|
M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa florida. Journal of the American Statistical Association, 89:414--420, 1989.
|
| |
19
|
Lavastorm. Making the Case for Automated Revenue Assurance Solutions. http://www.lavastormtech.com.
|
| |
20
|
Lucent Technologies. Revenue Assurance Products. http://www.lucent.com/solutions/revenue.html.
|
| |
21
|
A. Monge. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 23(4), 2000.
|
| |
22
|
A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DKMD, 1997.
|
| |
23
|
Network Resource Management: Inventory Takes Stage. Rhk research, July 2002. RHK Market Report.
|
| |
24
|
|
| |
25
|
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 2000.
|
| |
26
|
|
| |
27
|
Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org.
|
| |
28
|
|
| |
29
|
W. Winkler. Advanced methods for record linkage. Technical report. Statistical Research Division, U.S. Bureau of the Census., 1994.
|
CITED BY 15
|
|
|
|
|
|
|
|
|
|
|
Gao Cong , Wenfei Fan , Floris Geerts , Xibei Jia , Shuai Ma, Improving data quality: consistency and accuracy, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|