ACM Home Page
Please provide us with feedback. Feedback
What's hot and what's not: tracking most frequent items dynamically
Full text PdfPdf (237 KB)
Source Symposium on Principles of Database Systems archive
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
San Diego, California
Pages: 296 - 306  
Year of Publication: 2003
ISBN:1-58113-670-6
Authors
Graham Cormode  Rutgers University, Piscataway NJ
S. Muthukrishnan  Rutgers University, Piscataway NJ
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGMOD: ACM Special Interest Group on Management of Data
SIGART: ACM Special Interest Group on Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 57,   Citation Count: 60
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/773153.773182
What is a DOI?

ABSTRACT

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of "group testing", is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
4
 
5
D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusions using bayes estimators. In Proceedings of the first SIAM International Conference on Data Mining, 2001.
 
6
B. Boyer and J. Moore. A fast majority vote algorithm. Technical Report 35, Institute for Computer Science, University of Texas, 1982.
 
7
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, 1979.
 
8
 
9
 
10
D.-Z. Du and F. Hwang. Combinatorial Group Testing and Its Applications, volume 3 of Series on Applied Mathematics. World Scientific, 1993.
 
11
 
12
M. Fischer and S. Salzberg. Finding a majority among n votes: Solution to problem 81--5. Journal of Algorithms, 3(4):376--379, 1982.
13
 
14
P. Gibbons and Y. Matias. Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, A, 1999.
 
15
16
 
17
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. QuickSAND: Quick summary and analysis of network data. Technical Report 2001--43, DIMACS, 2001.
 
18
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of 28th International Conference on Very Large Data Bases, 2002.
19
20
21
 
22
W. Kautz and R. Singleton. Nonrandom binary superimposed codes. IEEE Transactions on on Information Theory, 10:363--377, 1964.
 
23
G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of 28th International Conference on Very Large Data Bases, pages 346--357, 2002.
 
24
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
 
25
 
26

CITED BY  60

Collaborative Colleagues:
Graham Cormode: colleagues
S. Muthukrishnan: colleagues