|
ABSTRACT
We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
DEMOLOMBE, R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language. In Proceedings of the 6th International Conference on Very Large Data Bases, 1980, pp. 55-63.
|
| |
5
|
FELLER, W. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed., Wiley, New York, 1968.
|
| |
6
|
|
| |
7
|
JOHNSON, N. L., AND KOTZ, S. Urn Models and Their Application. Wiley, New York, 1977.
|
| |
8
|
MOOD, A. M., GRAYBILL, F. A., AND BOES, D.C. Introduction to the Theory of Statistics. 3rd ed., McGraw-Hill, New York, 1974.
|
| |
9
|
|
| |
10
|
|
 |
11
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
12
|
SHERMAN, B. A random variable related to the spacing of sample values. Ann. Math. Statistics 21 (1950), 339-361.
|
| |
13
|
|
 |
14
|
|
| |
15
|
WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Separability--An approach to physical database design. IEEE Trans. Comput. C-33, 3 (Mar. 1984), 209-222.
|
| |
16
|
|
| |
17
|
|
| |
18
|
YOUSSEFI, K., AND WONG, E. Query processing in a relational database management system. In Proceedings of the 5th International Conference on Very Large Data Bases, 1979, pp. 409-417.
|
 |
19
|
|
CITED BY 39
|
|
|
|
|
Wan-Sup Cho , Wook-Shin Han , Ki-Hyung Hong , Kyu-Young Whang, Estimating nested selectivity in object-oriented databases, Proceedings of the ninth international conference on Information and knowledge management, p.94-101, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
Moses Charikar , Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Towards estimation error guarantees for distinct values, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.268-279, May 15-18, 2000, Dallas, Texas, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Min Cai , Jianping Pan , Yu-Kwong Kwok , Kai Hwang, Fast and accurate traffic matrix measurement using adaptive cardinality counting, Proceeding of the 2005 ACM SIGCOMM workshop on Mining network data, August 26-26, 2005, Philadelphia, Pennsylvania, USA
|
|
|
Pere Barlet-Ros , Gianluca Iannaccone , Josep Sanjuàs-Cuxart , Diego Amores-López , Josep Solé-Pareta, Load shedding in network monitoring applications, 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, p.1-14, June 17-22, 2007, Santa Clara, CA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sumeet Singh , Cristian Estan , George Varghese , Stefan Savage, Automated worm fingerprinting, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.4-4, December 06-08, 2004, San Francisco, CA
|
|
|
Kevin Beyer , Peter J. Haas , Berthold Reinwald , Yannis Sismanis , Rainer Gemulla, On synopses for distinct-value estimation under multiset operations, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
Young-Koo Lee , Kyu-Young Whang , Yang-Sae Moon , Il-Yeol Song, A one-pass aggregation algorithm with the optimal buffer size in multidimensional OLAP, Proceedings of the 28th international conference on Very Large Data Bases, p.790-801, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sunil Chakkappen , Thierry Cruanes , Benoit Dageville , Linan Jiang , Uri Shaft , Hong Su , Mohamed Zait, Efficient and scalable statistics gathering for large databases in Oracle 11g, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Tongqing Qiu , Jian Ni , Hao Wang , Nan Hua , Y. Richard Yang , Jun Jim Xu, Packet doppler: network monitoring using packet shift detection, Proceedings of the 2008 ACM CoNEXT Conference, p.1-12, December 09-12, 2008, Madrid, Spain
|
|
|
|
|