|
ABSTRACT
We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP World Computer Congress. 580--583.
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
Bertossi, L. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases. 43--83.
|
 |
8
|
|
| |
9
|
Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE). 746--755.
|
| |
10
|
|
| |
11
|
Bravo, L. and Bertossi, L. 2003. Logic programs for consistently querying data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 10--15.
|
| |
12
|
Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE).
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
Cali, A., Lembo, D., and Rosati, R. 2003b. Query rewriting and answering under constraints in data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 16--21.
|
| |
17
|
|
| |
18
|
Chomicki, J. and Marcinkowski, J. 2005b. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150.
|
| |
19
|
Codd, E. F. 1972. Relational completeness of data base sublanguages. In Database Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 65--98.
|
| |
20
|
Gao Cong , Wenfei Fan , Floris Geerts , Xibei Jia , Shuai Ma, Improving data quality: consistency and accuracy, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
| |
21
|
Eckerson, W. W. 2002. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Tech. rep., The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID=6064.
|
| |
22
|
Fellegi, I. and Holt, D. 1976. A systematic approach to automatic edit and imputation. J. Amer. Statist. Assn. 71, 353, 17--35.
|
| |
23
|
Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Statist. Assn. 64, 328, 1183--1210.
|
| |
24
|
|
 |
25
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon, AJAX: an extensible data cleaning tool, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.590, May 15-18, 2000, Dallas, Texas, United States
|
| |
26
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
27
|
|
| |
28
|
|
| |
29
|
Gertz, M. and Lipeck, U. 1995. A diagnostic approach to repairing constraint violations in databases. In Proceedings of the International Workshop on Principles of Diagnosis (DX). 65--72.
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
 |
36
|
|
 |
37
|
|
| |
38
|
|
| |
39
|
Maletic, J. I. and Marcus, A. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the Conference on Information Quality (IQ). 200--209.
|
| |
40
|
Monge, A. E. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4, 14--20.
|
| |
41
|
Papadimitriou, C. H. 1994. Computational Complexity. Addison Wesley.
|
| |
42
|
Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.
|
| |
43
|
|
| |
44
|
|
 |
45
|
|
| |
46
|
Shilakes, C. C. and Tylman, J. 1998. Enterprise information portals. Tech. rep., Merrill Lynch, Inc., New York, NY.
|
| |
47
|
|
 |
48
|
|
| |
49
|
Winkler, W. E. 1994. Advanced methods for record linkage. Tech. rep., Statistical Research Division, U.S. Bureau of the Census.
|
| |
50
|
Winkler, W. E. 1997. Set-covering and editing discrete data. In Proceedings of the American Statistical Association. Section on Survey Research Methods. 564--569.
|
| |
51
|
|
CITED BY 8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Lukasz Golab , Korn Flip , Andrew McGregor , Divesh Srivastava , Xi Zhang, Estimating the confidence of conditional functional dependencies, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Lukasz Golab , Theodore Johnson , J. Spencer Seidel , Vladislav Shkapenyuk, Stream warehousing with DataDepot, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|