|
ABSTRACT
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For general data distributions, our top-k algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with hardly any loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and top-k elements. Although the problems of incremental reporting of frequent and top-k elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
2
|
Arasu, A., Babu, S., and Widom, J. 2003a. CQL: A language for continuous queries over streams and relations. In Proceedings of the 9th DBPL International Conference on Data Base and Programming Languages. 1--11.
|
| |
3
|
Arasu, A., Babu, S., and Widom, J. 2003b. The CQL Continuous Query Language: Semantic foundations and query execution. Tech. rep. 2002-67. Stanford University, Stanford, CA.
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
 |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
Bose, P., Kranakis, E., Morin, P., and Tang, Y. 2003. Bounds for frequency estimation of packet streams. In Proceedings of the 10th SIROCCO International Colloquium on Structural Information and Communication Complexity. 33--42.
|
| |
9
|
Boyer, R. and Moore, J. 1981. A fast majority vote algorithm. Tech. rep. 1981-32. Institute for Computing Science, University of Texas, Austin, Austin, TX.
|
| |
10
|
|
 |
11
|
Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, United States
|
| |
12
|
Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th VLDB International Conference on Very Large Data Bases. 464--475.
|
 |
13
|
|
 |
14
|
|
 |
15
|
Corinna Cortes , Kathleen Fisher , Daryl Pregibon , Anne Rogers, Hancock: a language for extracting signatures from data streams, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.9-17, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347094]
|
| |
16
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
Fischer, M. and Salzberg, S. 1982. Finding a majority among N votes: Solution to problem 81-5. J. Algorith. 3, 376--379.
|
| |
22
|
|
 |
23
|
Johannes Gehrke , Flip Korn , Divesh Srivastava, On computing correlated aggregates over continual data streams, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.13-24, May 21-24, 2001, Santa Barbara, California, United States
|
 |
24
|
|
| |
25
|
|
 |
26
|
Lukasz Golab , David DeHaan , Erik D. Demaine , Alejandro Lopez-Ortiz , J. Ian Munro, Identifying frequent items in sliding windows over on-line packet streams, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
[doi> 10.1145/948205.948227]
|
 |
27
|
|
 |
28
|
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
 |
32
|
Pankaj Gupta , Nick McKeown, Packet classification on multiple fields, Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, p.147-160, August 30-September 03, 1999, Cambridge, Massachusetts, United States
|
| |
33
|
|
 |
34
|
|
 |
35
|
Cheqing Jin , Weining Qian , Chaofeng Sha , Jeffrey X. Yu , Aoying Zhou, Dynamically maintaining frequent items over a data stream, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956918]
|
 |
36
|
|
| |
37
|
|
| |
38
|
|
| |
39
|
Manku, G. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th VLDB International Conference on Very Large Data Bases. 346--357.
|
 |
40
|
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
41
|
|
 |
42
|
|
| |
43
|
Metwally, A., Agrawal, D., and El Abbadi, A. 2005b. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th ICDT International Conference on Database Theory. 398--412. An extended version appears as a University of California, Santa Barbara, Department of Computer Science, Technical Report 2005-23.
|
| |
44
|
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143--152.
|
 |
45
|
|
| |
46
|
Zhu, Y. and Shasha, D. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th VLDB International Conference on Very Large Data Bases. 358--369.
|
| |
47
|
Zipf, G. 1949. Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading, MA.
|
|