|
ABSTRACT
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of "group testing", is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , John E. Hopcroft , Jeffrey Ullman , J. D. Ullman , J. E. Hopcroft, Data Structures and Algorithms, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1983
|
 |
2
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
 |
3
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
 |
4
|
|
| |
5
|
D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusions using bayes estimators. In Proceedings of the first SIAM International Conference on Data Mining, 2001.
|
| |
6
|
B. Boyer and J. Moore. A fast majority vote algorithm. Technical Report 35, Institute for Computer Science, University of Texas, 1982.
|
| |
7
|
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, 1979.
|
| |
8
|
|
| |
9
|
|
| |
10
|
D.-Z. Du and F. Hwang. Combinatorial Group Testing and Its Applications, volume 3 of Series on Applied Mathematics. World Scientific, 1993.
|
| |
11
|
|
| |
12
|
M. Fischer and S. Salzberg. Finding a majority among n votes: Solution to problem 81--5. Journal of Algorithms, 3(4):376--379, 1982.
|
 |
13
|
|
| |
14
|
P. Gibbons and Y. Matias. Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, A, 1999.
|
| |
15
|
|
 |
16
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
| |
17
|
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. QuickSAND: Quick summary and analysis of network data. Technical Report 2001--43, DIMACS, 2001.
|
| |
18
|
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of 28th International Conference on Very Large Data Bases, 2002.
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
| |
22
|
W. Kautz and R. Singleton. Nonrandom binary superimposed codes. IEEE Transactions on on Information Theory, 10:363--377, 1964.
|
| |
23
|
G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of 28th International Conference on Very Large Data Bases, pages 346--357, 2002.
|
| |
24
|
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
|
| |
25
|
|
| |
26
|
|
CITED BY 60
|
|
Lukasz Golab , David DeHaan , Erik D. Demaine , Alejandro Lopez-Ortiz , J. Ian Munro, Identifying frequent items in sliding windows over on-line packet streams, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
|
|
|
Cheqing Jin , Weining Qian , Chaofeng Sha , Jeffrey X. Yu , Aoying Zhou, Dynamically maintaining frequent items over a data stream, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
Mohamed Medhat Gaber , Shonali Krishnaswamy , Arkady Zaslavsky, Cost-efficient mining techniques for data streams, Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation, p.109-114, January 01, 2004, Dunedin, New Zealand
|
|
|
|
|
|
Balachander Krishnamurthy , Subhabrata Sen , Yin Zhang , Yan Chen, Sketch-based change detection: methods, evaluation, and applications, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
|
|
|
Yin Zhang , Sumeet Singh , Subhabrata Sen , Nick Duffield , Carsten Lund, Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, October 25-27, 2004, Taormina, Sicily, Italy
|
|
|
|
|
|
Jonathan Beaver , Nicholas Morsillo , Kirk Pruhs , Panos K. Chrysanthis , Vincenzo Liberatore, Scalable dissemination: what's hot and what's not, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Qi (George) Zhao , Mitsunori Ogihara , Haixun Wang , Jun (Jim) Xu, Finding global icebergs over distributed data sets, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffery Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou, False positive or false negative: mining frequent itemsets from high speed transactional data streams, Proceedings of the Thirtieth international conference on Very large data bases, p.204-215, August 31-September 03, 2004, Toronto, Canada
|
|
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|