|
ABSTRACT
Real-world data sets often contain errors and inconsistency. Even though this is a very important problem it has received relatively little attention in the research community. In this paper we examine the problem of learning missing values when some summary information is available. We use linear algebra and constraint programming techniques to learn the missing values using apriori-known summary information and that derived from the raw data. We reconstruct the missing values by different methods in three scenarios: ideal-constrained, under-constrained, and over-constrained. Furthermore, for a range query involving missing values, we also give the lower bound and upper bound for the values using constraint programming techniques. We believe that theory of linear algebra and constraint programming constitutes a sound basis for learning missing values when summary information is available.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bulletin of the Technical Committee on Data Engineering, 23(4), 2000.
|
| |
2
|
M. Berry. Large scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13-49, 1992.
|
| |
3
|
M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and dense symmetric eigenvalue problems. Journal of Computational and Applied Mathematics, 27:191-213, 1989.
|
| |
4
|
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Wadsworth, Belmont, CA, 1984.
|
| |
5
|
N. Chakravarti. Some results concerning post-infeasibility analysis. European Journal of Operation Research, 73:139-143, 1994.
|
| |
6
|
|
| |
7
|
J. W. Chinneck and E. W. Dravnieks. Locating minimal infeasible constraint sets in linear programs. ORSA Journal on Computing, 3(2):157-168, 1991.
|
| |
8
|
J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:1470-1480, 1972.
|
| |
9
|
|
| |
10
|
J. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1-141, 1991.
|
| |
11
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
12
|
A. Golan, G. Judge, and D. Miller. Maximum entropy econometrics: robust estimation with limited data. John Wiley, New York, 1996.
|
| |
13
|
C. Golub and C. V. Loan. Matrix computations. Johns Hopkins University Press, 1989.
|
| |
14
|
G. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM Journal of Numerical Analysis, 2(3):205-224, 1965.
|
| |
15
|
I. Good. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Annals of Mathematical Statistics, 34:911-934, 1963.
|
| |
16
|
Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
|
| |
23
|
J. Kullback. Information theory and statistics. John Wiley, New York, 1959.
|
 |
24
|
Mong Li Lee , Tok Wang Ling , Wai Lup Low, IntelliClean: a knowledge-based intelligent data cleaner, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.290-294, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347154]
|
| |
25
|
|
| |
26
|
R. Lynch. An assessment of the RAS method for updating input-output Tables. In I. Sohn, editor, Readings in Input-Output Analysis, pages 271-284. Oxford University Press, 1986.
|
| |
27
|
R. Myers. Classical and modern regression with applications. Duxbury, Boston, MA, 1986.
|
| |
28
|
R. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993.
|
| |
29
|
|
| |
30
|
A. L. Prodromidis and S. Stolfo. Mining databases with different schemas: integrating incompatible classifiers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
J. L. Schafer. Analysis of incomplete multivariate data. Chapman and Hall, London, 1997.
|
| |
35
|
|
| |
36
|
|
| |
37
|
C. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 1948.
|
| |
38
|
M. Wagner, J. Meller, and R. Elber. Large-scale linear programming techniques for he design of protein folding potentials. Technical Report 2002-02, Old Dominion University, 2002.
|
| |
39
|
|
CITED BY 4
|
|
Doug Burdick , Prasad M. Deshpande , T. S. Jayram , Raghu Ramakrishnan , Shivakumar Vaithyanathan, OLAP over uncertain and imprecise data, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
|
|
|
|
|