|
ABSTRACT
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications.We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , John E. Hopcroft , Jeffrey Ullman , J. D. Ullman , J. E. Hopcroft, Data Structures and Algorithms, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1983
|
 |
2
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
| |
3
|
|
 |
4
|
|
| |
5
|
Barbara, D., Wu, N., and Jajodia, S. 2001. Detecting novel network intrusions using Bayes estimators. In Proceedings of the First SIAM International Conference on Data Mining.
|
| |
6
|
Boyer, B. and Moore, J. 1982. A fast majority vote algorithm. Tech. Rep. 35. Institute for Computer Science, University of Texas, at Austin, Austin, TX.
|
| |
7
|
Carter, J. L. and Wegman, M. N. 1979. Universal classes of hash functions. J. Comput. Syst. Sci. 18, 2, 143--154.
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Cormode, G. and Muthukrishnan, S. 2004b. What's new: Finding significant differences in network data streams. In Proceedings of IEEE Infocom.
|
| |
12
|
|
| |
13
|
Du, D.-Z. and Hwang, F. 1993. Combinatorial Group Testing and Its Applications. Series on Applied Mathematics, vol. 3. World Scientific, Singapore.
|
 |
14
|
|
| |
15
|
|
| |
16
|
Fischer, M. and Salzberg, S. 1982. Finding a majority among n votes: Solution to problem 81-5. J. Algorith. 3, 4, 376--379.
|
 |
17
|
|
 |
18
|
|
| |
19
|
Gibbons, P. and Matias, Y. 1999. Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science A.
|
| |
20
|
|
 |
21
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
| |
22
|
Gilbert, A., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. QuickSAND: Quick summary and analysis of network data. DIMACS Tech. Rep. 2001--43, Available online at http://dimacs. crutgers.edu/Techniclts.
|
| |
23
|
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases. 454--465.
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
Manku, G. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases. 346--357.
|
| |
29
|
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143--152.
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
CITED BY 11
|
|
|
|
|
|
|
|
Bo Sheng , Chiu Chiang Tan , Qun Li , Weizhen Mao, Finding popular categories for RFID tags, Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing, May 26-30, 2008, Hong Kong, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|