|
ABSTRACT
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
 |
2
|
|
 |
3
|
|
| |
4
|
S. Bhattacharrya, A. Madeira, S. Muthukrishnan, and T. Ye. How to scalably skip past streams. In Scalable Stream Processing Systems (SSPS) Workshop with ICDE 2007, 2007.
|
 |
5
|
Lakshminath Bhuvanagiri , Sumit Ganguly , Deepanjan Kesh , Chandan Saha, Simpler algorithm for estimating frequency moments of data streams, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.708-713, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109634]
|
| |
6
|
A. Blum, P. Gibbons, D. Song, and S. Venkataraman. New streaming algorithms for fast detection of superspreaders. Technical Report IRP-TR-04-23, Intel Research, 2004.
|
| |
7
|
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. In SIROCCO, 2003.
|
| |
8
|
B. Boyer and J. Moore. A fast majority vote algorithm. Technical Report ICSCA-CMP-32, Institute for Computer Science, University of Texas, Feb. 1981.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
G. Cormode and M. Hadjieleftheriou. Finding Frequent Items in Data Streams: Source Code. http://www.research.att.com/~marioh/frequent-items.html.
|
 |
13
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007575]
|
 |
14
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Space- and time-efficient deterministic algorithms for biased quantiles over data streams, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
[doi> 10.1145/1142351.1142389]
|
| |
15
|
G. Cormode, F. Korn, and S. Tirthapura. Exponentially decayed aggregates on data streams. In IEEE International Conference on Data Engineering, 2008.
|
| |
16
|
G. Cormode and S. Muthukrishnan. MassDAL Public Code Bank. http://www.cs.rutgers.edu/~muthu/massdal-code-index.html.
|
| |
17
|
G. Cormode and S. Muthukrishnan. What's new: Finding significant differences in network data streams. In Proceedings of IEEE Infocom, 2004.
|
| |
18
|
|
| |
19
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
20
|
|
 |
21
|
|
| |
22
|
M. Fischer and S. Salzburg. Finding a majority among n votes: Solution to problem 81-5. Journal of Algorithms, 3(4):376--379, 1982.
|
| |
23
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, How to summarize the universe: dynamic maintenance of quantiles, Proceedings of the 28th international conference on Very Large Data Bases, p.454-465, August 20-23, 2002, Hong Kong, China
|
 |
24
|
|
| |
25
|
J. Hershberger, N. Shrivastava, S. Suri, and C. Toth. Adaptive spatial partitioning for multidimensional data streams. In ISAAC, 2004.
|
 |
26
|
|
 |
27
|
|
| |
28
|
G. Kollios, J. Byers, J. Considine, M. Hadjieleftheriou, and F. Li. Robust aggregation in sensor networks. IEEE Data Engineering Bulletin, 28(1), Mar. 2005.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
|
 |
33
|
|
| |
34
|
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
|
| |
35
|
|
| |
36
|
Robert Schweller , Zhichun Li , Yan Chen , Yan Gao , Ashish Gupta , Yin Zhang , Peter A. Dinda , Ming-Yang Kao , Gokhan Memik, Reversible sketches: enabling monitoring and analysis over high-speed data streams, IEEE/ACM Transactions on Networking (TON), v.15 n.5, p.1059-1072, October 2007
[doi> 10.1109/TNET.2007.896150]
|
 |
37
|
Nisheeth Shrivastava , Chiranjeeb Buragohain , Divyakant Agrawal , Subhash Suri, Medians and beyond: new aggregation techniques for sensor networks, Proceedings of the 2nd international conference on Embedded networked sensor systems, November 03-05, 2004, Baltimore, MD, USA
[doi> 10.1145/1031495.1031524]
|
| |
38
|
|
| |
39
|
|
CITED BY 3
|
|
Radu Berinde , Graham Cormode , Piotr Indyk , Martin J. Strauss, Space-optimal heavy hitters with strong error bounds, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|