|
ABSTRACT
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions ("heavy hitters"). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions.We identify and study two variants of HHHs for multi-dimensional data, namely the "overlap" and "split" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees.We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in data streams. In International Conference on Very Large Databases, pages 464--475, 2003.
|
 |
4
|
|
 |
5
|
Cristian Estan , Stefan Savage , George Varghese, Automatically inferring patterns of resource consumption in network traffic, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
[doi> 10.1145/863955.863972]
|
 |
6
|
|
| |
7
|
L. V. S. Lakshmanan, R. T. Ng, C. X. Wang, X. Zhou, and T. Johnson. The generalized MDL approach for summarization. In Proceedings of 28th International Conference on Very Large Data Bases, pages 766--777, 2002.
|
 |
8
|
Ju-Hong Lee , Deok-Hwan Kim , Chin-Wan Chung, Multi-dimensional selectivity estimation using compressed histogram information, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.205-214, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/304182.304200]
|
| |
9
|
G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of 28th International Conference on Very Large Data Bases, pages 346--357, 2002.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
Jeffrey Scott Vitter , Min Wang , Bala Iyer, Data cube approximation and histograms via wavelets, Proceedings of the seventh international conference on Information and knowledge management, p.96-104, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288645]
|
CITED BY 17
|
|
Yin Zhang , Sumeet Singh , Subhabrata Sen , Nick Duffield , Carsten Lund, Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, October 25-27, 2004, Taormina, Sicily, Italy
|
|
|
|
|
|
Vijay Karamcheti , Davi Geiger , Zvi Kedem , S. Muthukrishnan, Detecting malicious network traffic using inverse distributions of packet contents, Proceeding of the 2005 ACM SIGCOMM workshop on Mining network data, August 26-26, 2005, Philadelphia, Pennsylvania, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ajay Anil Mahimkar , Zihui Ge , Aman Shaikh , Jia Wang , Jennifer Yates , Yin Zhang , Qi Zhao, Towards automated performance diagnosis in a large IPTV network, ACM SIGCOMM Computer Communication Review, v.39 n.4, October 2009
|
|