ACM Home Page
Please provide us with feedback. Feedback
Learning missing values from summary constraints
Full text PdfPdf (451 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 4 ,  Issue 1  (June 2002) table of contents
COLUMN: Constraints in data mining table of contents
Pages: 21 - 30  
Year of Publication: 2002
ISSN:1931-0145
Authors
Xintao Wu  UNC at Charlotte, Charlotte, NC
Daniel Barbará  George Mason University, Fairfax, VA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 37,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/568574.568579
What is a DOI?

ABSTRACT

Real-world data sets often contain errors and inconsistency. Even though this is a very important problem it has received relatively little attention in the research community. In this paper we examine the problem of learning missing values when some summary information is available. We use linear algebra and constraint programming techniques to learn the missing values using apriori-known summary information and that derived from the raw data. We reconstruct the missing values by different methods in three scenarios: ideal-constrained, under-constrained, and over-constrained. Furthermore, for a range query involving missing values, we also give the lower bound and upper bound for the values using constraint programming techniques. We believe that theory of linear algebra and constraint programming constitutes a sound basis for learning missing values when summary information is available.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bulletin of the Technical Committee on Data Engineering, 23(4), 2000.
 
2
M. Berry. Large scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13-49, 1992.
 
3
M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and dense symmetric eigenvalue problems. Journal of Computational and Applied Mathematics, 27:191-213, 1989.
 
4
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Wadsworth, Belmont, CA, 1984.
 
5
N. Chakravarti. Some results concerning post-infeasibility analysis. European Journal of Operation Research, 73:139-143, 1994.
 
6
 
7
J. W. Chinneck and E. W. Dravnieks. Locating minimal infeasible constraint sets in linear programs. ORSA Journal on Computing, 3(2):157-168, 1991.
 
8
J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:1470-1480, 1972.
 
9
 
10
J. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1-141, 1991.
 
11
 
12
A. Golan, G. Judge, and D. Miller. Maximum entropy econometrics: robust estimation with limited data. John Wiley, New York, 1996.
 
13
C. Golub and C. V. Loan. Matrix computations. Johns Hopkins University Press, 1989.
 
14
G. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM Journal of Numerical Analysis, 2(3):205-224, 1965.
 
15
I. Good. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Annals of Mathematical Statistics, 34:911-934, 1963.
 
16
 
17
 
18
19
 
20
21
 
22
 
23
J. Kullback. Information theory and statistics. John Wiley, New York, 1959.
24
 
25
 
26
R. Lynch. An assessment of the RAS method for updating input-output Tables. In I. Sohn, editor, Readings in Input-Output Analysis, pages 271-284. Oxford University Press, 1986.
 
27
R. Myers. Classical and modern regression with applications. Duxbury, Boston, MA, 1986.
 
28
R. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993.
 
29
 
30
A. L. Prodromidis and S. Stolfo. Mining databases with different schemas: integrating incompatible classifiers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1998.
 
31
 
32
 
33
 
34
J. L. Schafer. Analysis of incomplete multivariate data. Chapman and Hall, London, 1997.
 
35
 
36
 
37
C. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 1948.
 
38
M. Wagner, J. Meller, and R. Elber. Large-scale linear programming techniques for he design of protein folding potentials. Technical Report 2002-02, Old Dominion University, 2002.
 
39


Collaborative Colleagues:
Xintao Wu: colleagues
Daniel Barbará: colleagues