| Finding hierarchical heavy hitters in streaming data |
| Full text |
Pdf
(886 KB)
|
Source
|
ACM Transactions on Knowledge Discovery from Data (TKDD)
archive
Volume 1 , Issue 4 (January 2008)
table of contents
Article No. 2
Year of Publication: 2008
ISSN:1556-4681
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 25, Downloads (12 Months): 224, Citation Count: 0
|
|
|
ABSTRACT
Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (“heavy hitters”) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a “child” node can have multiple “parent” nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Sameet Agarwal , Rakesh Agrawal , Prasad Deshpande , Ashish Gupta , Jeffrey F. Naughton , Raghu Ramakrishnan , Sunita Sarawagi, On the Computation of Multidimensional Aggregates, Proceedings of the 22th International Conference on Very Large Data Bases, p.506-521, September 03-06, 1996
|
 |
2
|
|
| |
3
|
|
 |
4
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007575]
|
| |
5
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
Cristian Estan , Stefan Savage , George Varghese, Automatically inferring patterns of resource consumption in network traffic, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
[doi> 10.1145/863955.863972]
|
 |
12
|
|
| |
13
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, How to summarize the universe: dynamic maintenance of quantiles, Proceedings of the 28th international conference on Very Large Data Bases, p.454-465, August 20-23, 2002, Hong Kong, China
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
Laks V. S. Lakshmanan , Raymond T. Ng , Christine Xing Wang , Xiaodong Zhou , Theodore J. Johnson, The generalized MDL approach for summarization, Proceedings of the 28th international conference on Very Large Data Bases, p.766-777, August 20-23, 2002, Hong Kong, China
|
 |
18
|
Ju-Hong Lee , Deok-Hwan Kim , Chin-Wan Chung, Multi-dimensional selectivity estimation using compressed histogram information, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.205-214, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
19
|
|
| |
20
|
|
| |
21
|
Metwally, A., Agrawal, D., and Abbadi, A. E. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory.
|
| |
22
|
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 143--152.
|
 |
23
|
Raymond T. Ng , Alan Wagner , Yu Yin, Iceberg-cube computation with PC clusters, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.25-36, May 21-24, 2001, Santa Barbara, California, United States
|
 |
24
|
|
 |
25
|
Jeffrey Scott Vitter , Min Wang , Bala Iyer, Data cube approximation and histograms via wavelets, Proceedings of the seventh international conference on Information and knowledge management, p.96-104, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288645]
|
 |
26
|
Yin Zhang , Sumeet Singh , Subhabrata Sen , Nick Duffield , Carsten Lund, Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, October 25-27, 2004, Taormina, Sicily, Italy
[doi> 10.1145/1028788.1028802]
|
|