ACM Home Page
Please provide us with feedback. Feedback
Independence is good: dependency-based histogram synopses for high-dimensional data
Full text PdfPdf (237 KB)
Source International Conference on Management of Data archive
Proceedings of the 2001 ACM SIGMOD international conference on Management of data table of contents
Santa Barbara, California, United States
Pages: 199 - 210  
Year of Publication: 2001
ISBN:1-58113-332-4
Also published in ...
Authors
Amol Deshpande  Univ of California, Berkeley and Bell Laboratories
Minos Garofalakis  Bell Laboratories
Rajeev Rastogi  Bell Laboratories
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 53,   Citation Count: 43
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/375663.375685
What is a DOI?

ABSTRACT

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Y. M.M. Bishop, S. E. Fienberg, and P. W. Holland. "Discrete Multivariate Analysis". The MIT Press, 1975.
 
2
J. R.S. Blair and B. Peyton. "An Introduction to Chordal Graphs and Clique Trees". In "Graph Theory and Sparse Matrix Computation". Springer-Verlag NY, Inc., 1993.
 
3
 
4
R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag NY, Inc. (Springer Series in Statistics), 1997.
 
5
 
6
J.N. Darroch, S.L. Lauritzen, and T.P. Speed. "Markov Fields and Log-Linear Interaction Models for Contigency Tables". The Annals of Statistics, 8(3):522-539, 1980.
 
7
A. Deshpande, M. Garofalakis, and R. Rastogi. "Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data". Bell Labs Tech. Report, 2001.
 
8
D. Edwards. Personal Communication, September 2000.
 
9
 
10
B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3):211-216, 1966.
11
 
12
13
 
14
Finn V. Jensen and Frank Jensen. "Optimal Junction Trees". In Proc. of the 10th Annual Conf. on Uncertainty in AI, 1994.
 
15
Francesco M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5):1287-1294, 1991.
 
16
 
17
 
18
19
 
20
J. Whittaker. "Graphical Models in Applied Multivariate Statistics". John Wiley & Sons, Inc., 1990.

CITED BY  43

Collaborative Colleagues:
Amol Deshpande: colleagues
Minos Garofalakis: colleagues
Rajeev Rastogi: colleagues