|
ABSTRACT
Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Y. M.M. Bishop, S. E. Fienberg, and P. W. Holland. "Discrete Multivariate Analysis". The MIT Press, 1975.
|
| |
2
|
J. R.S. Blair and B. Peyton. "An Introduction to Chordal Graphs and Clique Trees". In "Graph Theory and Sparse Matrix Computation". Springer-Verlag NY, Inc., 1993.
|
| |
3
|
|
| |
4
|
R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag NY, Inc. (Springer Series in Statistics), 1997.
|
| |
5
|
|
| |
6
|
J.N. Darroch, S.L. Lauritzen, and T.P. Speed. "Markov Fields and Log-Linear Interaction Models for Contigency Tables". The Annals of Statistics, 8(3):522-539, 1980.
|
| |
7
|
A. Deshpande, M. Garofalakis, and R. Rastogi. "Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data". Bell Labs Tech. Report, 2001.
|
| |
8
|
D. Edwards. Personal Communication, September 2000.
|
| |
9
|
|
| |
10
|
B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3):211-216, 1966.
|
 |
11
|
Dimitrios Gunopulos , George Kollios , Vassilis J. Tsotras , Carlotta Domeniconi, Approximating multi-dimensional aggregate range queries over real attributes, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.463-474, May 15-18, 2000, Dallas, Texas, United States
|
| |
12
|
|
 |
13
|
|
| |
14
|
Finn V. Jensen and Frank Jensen. "Optimal Junction Trees". In Proc. of the 10th Annual Conf. on Uncertainty in AI, 1994.
|
| |
15
|
Francesco M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5):1287-1294, 1991.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
20
|
J. Whittaker. "Graphical Models in Applied Multivariate Statistics". John Wiley & Sons, Inc., 1990.
|
CITED BY 43
|
|
|
|
|
|
|
|
|
|
|
V. Markl , N. Megiddo , M. Kutsch , T. M. Tran , P. Haas , U. Srivastava, Consistently estimating the selectivity of conjuncts of predicates, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
Ihab F. Ilyas , Volker Markl , Peter Haas , Paul Brown , Ashraf Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Amol Deshpande , Carlos Guestrin , Samuel R. Madden , Joseph M. Hellerstein , Wei Hong, Model-driven data acquisition in sensor networks, Proceedings of the Thirtieth international conference on Very large data bases, p.588-599, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
V. Markl , P. J. Haas , M. Kutsch , N. Megiddo , U. Srivastava , T. M. Tran, Consistent selectivity estimation via maximum entropy, The VLDB Journal — The International Journal on Very Large Data Bases, v.16 n.1, p.55-76, January 2007
|
|
|
|
|
|
|
|
|
Feng Yan , Wen-Chi Hou , Zhewei Jiang , Cheng Luo , Qiang Zhu, Selectivity estimation of range queries based on data density approximation via cosine series, Data & Knowledge Engineering, v.63 n.3, p.855-878, December, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|