ACM Home Page
Please provide us with feedback. Feedback
A linear-time probabilistic counting algorithm for database applications
Full text PdfPdf (1.52 MB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 15 ,  Issue 2  (June 1990) table of contents
Pages: 208 - 229  
Year of Publication: 1990
ISSN:0362-5915
Authors
Kyu-Young Whang  Korea Advanced Institute of Science and Technology, Seoul, Korea
Brad T. Vander-Zanden  Cornell Univ., Ithaca, NY
Howard M. Taylor  Univ. of Delaware, Newark
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 58,   Citation Count: 39
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/78922.78925
What is a DOI?

ABSTRACT

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
DEMOLOMBE, R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language. In Proceedings of the 6th International Conference on Very Large Data Bases, 1980, pp. 55-63.
 
5
FELLER, W. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed., Wiley, New York, 1968.
 
6
 
7
JOHNSON, N. L., AND KOTZ, S. Urn Models and Their Application. Wiley, New York, 1977.
 
8
MOOD, A. M., GRAYBILL, F. A., AND BOES, D.C. Introduction to the Theory of Statistics. 3rd ed., McGraw-Hill, New York, 1974.
 
9
 
10
11
 
12
SHERMAN, B. A random variable related to the spacing of sample values. Ann. Math. Statistics 21 (1950), 339-361.
 
13
14
 
15
WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Separability--An approach to physical database design. IEEE Trans. Comput. C-33, 3 (Mar. 1984), 209-222.
 
16
 
17
 
18
YOUSSEFI, K., AND WONG, E. Query processing in a relational database management system. In Proceedings of the 5th International Conference on Very Large Data Bases, 1979, pp. 409-417.
19

CITED BY  39

Collaborative Colleagues:
Kyu-Young Whang: colleagues
Brad T. Vander-Zanden: colleagues
Howard M. Taylor: colleagues