|
ABSTRACT
In this paper, we present a comparison of nonparametric estimation methods for computing approximations of the selectivities of queries, in particular range queries. In contrast to previous studies, the focus of our comparison is on metric attributes with large domains which occur for example in spatial and temporal databases. We also assume that only small sample sets of the required relations are available for estimating the selectivity. In addition to the popular histogram estimators, our comparison includes so-called kernel estimation methods. Although these methods have been proven to be among the most accurate estimators known in statistics, they have not been considered for selectivity estimation of database queries, so far. We first show how to generate kernel estimators that deliver accurate approximate selectivities of queries. Thereafter, we reveal that two parameters, the number of samples and the so-called smoothing parameter, are important for the accuracy of both kernel estimators and histogram estimators. For histogram estimators, the smoothing parameter determines the number of bins (histogram classes). We first present the optimal smoothing parameter as a function of the number of samples and show how to compute approximations of the optimal parameter. Moreover, we propose a new selectivity estimator that can be viewed as an hybrid of histogram and kernel estimators. Experimental results show the performance of different estimators in practice. We found in our experiments that kernel estimators are most efficient for continuously distributed data sets, whereas for our real data sets the hybrid technique is most promising.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
Yossi Matias , Jeffrey Scott Vitter , Min Wang, Wavelet-based histograms for selectivity estimation, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.448-459, June 01-04, 1998, Seattle, Washington, United States
|
 |
5
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
 |
6
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
7
|
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel, Optimal Histograms with Quality Guarantees, Proceedings of the 24rd International Conference on Very Large Data Bases, p.275-286, August 24-27, 1998
|
 |
8
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
9
|
|
| |
10
|
Gasser, T. & Engel, J. & Seifert, B. ,,Nonparametric function estimation" in: Rao (Ed.),"Handbook of Sta.tistics Vol. 9", North Holland 1993.
|
| |
11
|
David W. Scott. "Multivariate Density Estimation" Wiley & Sons 1992.
|
 |
12
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
13
|
Silverman, B.W. ,,Density Estimation for Statistics and Data Analysis" Chapman & Hall 1986.
|
| |
14
|
Simonoff, J. & Dong, J. "The Construction and Properties of Boundary Kernels for Sparse Multinomials" Journal of Computational and Graphical Statistics 1994.
|
| |
15
|
Wand, M.P. & Jones, M.C. ,,Kernel Smoothing" Chapman & Hall 1995.
|
| |
16
|
Brodsky, B.E. & Darkhovsky, B.S. "Nonparametric Methods in change-point problems" Kluwer Academic Publishers 1993.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
CITED BY 14
|
|
|
|
|
Michael Böhlen , Linas Bukauskas , Poul Svante Eriksen , Steffen Lilholt Lauritzen , Arturas Mažeika , Peter Musaeus , Peer Mylov, 3D visual data mining: goals and experiences, Computational Statistics & Data Analysis, v.43 n.4, p.445-469, 28 August 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
S. Subramaniam , T. Palpanas , D. Papadopoulos , V. Kalogeraki , D. Gunopulos, Online outlier detection in sensor data using non-parametric models, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Feng Yan , Wen-Chi Hou , Zhewei Jiang , Cheng Luo , Qiang Zhu, Selectivity estimation of range queries based on data density approximation via cosine series, Data & Knowledge Engineering, v.63 n.3, p.855-878, December, 2007
|
|
|
Zhenjie Zhang , Yin Yang , Ruichu Cai , Dimitris Papadias , Anthony Tung, Kernel-based skyline cardinality estimation, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|