ACM Home Page
Please provide us with feedback. Feedback
Finding frequent items in data streams
Full text PdfPdf (498 KB)
Source
Proceedings of the VLDB Endowment archive
Volume 1 ,  Issue 2  (August 2008) table of contents
SESSION: EXPERIMENTS AND ANALYSES table of contents
Pages 1530-1541  
Year of Publication: 2008
ISSN:2150-8097
Authors
Graham Cormode  AT&T Labs-Research, Florham Park, NJ
Marios Hadjieleftheriou  AT&T Labs-Research, Florham Park, NJ
Publisher
Bibliometrics
Downloads (6 Weeks): 38,   Downloads (12 Months): 242,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1454159.1454225
What is a DOI?

ABSTRACT

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented.

In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
S. Bhattacharrya, A. Madeira, S. Muthukrishnan, and T. Ye. How to scalably skip past streams. In Scalable Stream Processing Systems (SSPS) Workshop with ICDE 2007, 2007.
5
 
6
A. Blum, P. Gibbons, D. Song, and S. Venkataraman. New streaming algorithms for fast detection of superspreaders. Technical Report IRP-TR-04-23, Intel Research, 2004.
 
7
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. In SIROCCO, 2003.
 
8
B. Boyer and J. Moore. A fast majority vote algorithm. Technical Report ICSCA-CMP-32, Institute for Computer Science, University of Texas, Feb. 1981.
 
9
 
10
 
11
 
12
G. Cormode and M. Hadjieleftheriou. Finding Frequent Items in Data Streams: Source Code. http://www.research.att.com/~marioh/frequent-items.html.
13
14
 
15
G. Cormode, F. Korn, and S. Tirthapura. Exponentially decayed aggregates on data streams. In IEEE International Conference on Data Engineering, 2008.
 
16
G. Cormode and S. Muthukrishnan. MassDAL Public Code Bank. http://www.cs.rutgers.edu/~muthu/massdal-code-index.html.
 
17
G. Cormode and S. Muthukrishnan. What's new: Finding significant differences in network data streams. In Proceedings of IEEE Infocom, 2004.
 
18
 
19
 
20
21
 
22
M. Fischer and S. Salzburg. Finding a majority among n votes: Solution to problem 81-5. Journal of Algorithms, 3(4):376--379, 1982.
 
23
24
 
25
J. Hershberger, N. Shrivastava, S. Suri, and C. Toth. Adaptive spatial partitioning for multidimensional data streams. In ISAAC, 2004.
26
27
 
28
G. Kollios, J. Byers, J. Considine, M. Hadjieleftheriou, and F. Li. Robust aggregation in sensor networks. IEEE Data Engineering Bulletin, 28(1), Mar. 2005.
29
 
30
 
31
 
32
A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
33
 
34
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
 
35
 
36
37
 
38
 
39


Collaborative Colleagues:
Graham Cormode: colleagues
Marios Hadjieleftheriou: colleagues