ACM Home Page
Please provide us with feedback. Feedback
An integrated efficient solution for computing frequent and top-k elements in data streams
Full text PdfPdf (661 KB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 31 ,  Issue 3  (September 2006) table of contents
Pages: 1095 - 1133  
Year of Publication: 2006
ISSN:0362-5915
Authors
Ahmed Metwally  University of California, Santa Barbara, Santa Barbara, CA
Divyakant Agrawal  University of California, Santa Barbara, Santa Barbara, CA
Amr El Abbadi  University of California, Santa Barbara, Santa Barbara, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 183,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1166074.1166084
What is a DOI?

ABSTRACT

We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For general data distributions, our top-k algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with hardly any loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and top-k elements. Although the problems of incremental reporting of frequent and top-k elements are useful in many applications, to the best of our knowledge, no solution has been proposed.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Arasu, A., Babu, S., and Widom, J. 2003a. CQL: A language for continuous queries over streams and relations. In Proceedings of the 9th DBPL International Conference on Data Base and Programming Languages. 1--11.
 
3
Arasu, A., Babu, S., and Widom, J. 2003b. The CQL Continuous Query Language: Semantic foundations and query execution. Tech. rep. 2002-67. Stanford University, Stanford, CA.
4
5
6
 
7
 
8
Bose, P., Kranakis, E., Morin, P., and Tang, Y. 2003. Bounds for frequency estimation of packet streams. In Proceedings of the 10th SIROCCO International Colloquium on Structural Information and Communication Complexity. 33--42.
 
9
Boyer, R. and Moore, J. 1981. A fast majority vote algorithm. Tech. rep. 1981-32. Institute for Computing Science, University of Texas, Austin, Austin, TX.
 
10
11
 
12
Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th VLDB International Conference on Very Large Data Bases. 464--475.
13
14
15
 
16
 
17
18
 
19
 
20
 
21
Fischer, M. and Salzberg, S. 1982. Finding a majority among N votes: Solution to problem 81-5. J. Algorith. 3, 376--379.
 
22
23
24
 
25
26
27
28
 
29
30
31
32
 
33
34
35
36
 
37
 
38
 
39
Manku, G. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th VLDB International Conference on Very Large Data Bases. 346--357.
40
 
41
42
 
43
Metwally, A., Agrawal, D., and El Abbadi, A. 2005b. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th ICDT International Conference on Database Theory. 398--412. An extended version appears as a University of California, Santa Barbara, Department of Computer Science, Technical Report 2005-23.
 
44
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143--152.
45
 
46
Zhu, Y. and Shasha, D. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th VLDB International Conference on Very Large Data Bases. 358--369.
 
47
Zipf, G. 1949. Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading, MA.

CITED BY  7

Collaborative Colleagues:
Ahmed Metwally: colleagues
Divyakant Agrawal: colleagues
Amr El Abbadi: colleagues