|
ABSTRACT
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
2
|
|
| |
3
|
|
| |
4
|
Ziv Bar-Yossef , Ravi Kumar , D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.623-632, January 06-08, 2002, San Francisco, California
|
 |
5
|
|
| |
6
|
T. Bohman, C. Cooper, and A. Frieze. Min-Wise Independent Linear Permutations. Electronic Journal of Combinatorics, 7:R26, 2000.
|
 |
7
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
 |
8
|
Moses Charikar , Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Towards estimation error guarantees for distinct values, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.268-279, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335230]
|
| |
9
|
|
| |
10
|
Graham Cormode , Mayur Datar , Piotr Indyk , S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), Proceedings of the 28th international conference on Very Large Data Bases, p.335-345, August 20-23, 2002, Hong Kong, China
|
 |
11
|
|
| |
12
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
13
|
M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities (extended abstract). In Proceedings of the 11th ESA European Symposium on Algorithms, pages 605--617, 2003.
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
É. Fusy and F. Giroire. Estimating the Number of Active Flows in a Data Stream over a Sliding Window. In Proceedings of the 4th SIAM Workshop on Analytic Algorithmics and Combinatorics, 2007.
|
| |
18
|
S. Ganguly. Counting Distinct Items over Update Streams. In Proceedings of the 16th ISAAC International Symposium on Algorithms and Computation, pages 505--514, 2005.
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
F. Giroire. Order Statistics and Estimating Cardinalities of Massive Datasets. In Proceedings of the 6th DMTCS Discrete Mathematics and Theoretical Computer Science, pages 157--166, 2005.
|
| |
24
|
Jim Gray , Surajit Chaudhuri , Adam Bosworth , Andrew Layman , Don Reichart , Murali Venkatrao , Frank Pellow , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, v.1 n.1, p.29-53, 1997
[doi> 10.1023/A:1009726021843]
|
| |
25
|
|
| |
26
|
P. Haas and L. Stokes. Estimating the Number of Classes in a Finite Population. Journal of the American Statistical Association, 93:1475--1487, 1998.
|
| |
27
|
Jiawei Han , Yixin Chen , Guozhu Dong , Jian Pei , Benjamin W. Wah , Jianyong Wang , Y. Dora Cai, Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams, Distributed and Parallel Databases, v.18 n.2, p.173-197, September 2005
[doi> 10.1007/s10619-005-3296-1]
|
 |
28
|
|
 |
29
|
Wen-Chi Hou , Gultekin Ozsoyoglu , Erdogan Dogdu, Error-constrained COUNT query evaluation in relational databases, Proceedings of the 1991 ACM SIGMOD international conference on Management of data, p.278-287, May 29-31, 1991, Denver, Colorado, United States
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
F. Olken and D. Rotem. Random Sampling from Databases: A Survey. Journal of Statistics and Computing, 5:25--42, 1995.
|
| |
39
|
G. Özsoyoglu, K. Du, A. Tjahjana, W.-C. Hou, and D. Y. Rowland. On Estimating COUNT, SUM, and AVERAGE Relational Algebra Queries. In Proceedings of the 2nd DEXA International Conference on Database and Expert Systems Applications, pages 406--412, 1991.
|
 |
40
|
|
 |
41
|
|
 |
42
|
|
| |
43
|
|
|