ACM Home Page
Please provide us with feedback. Feedback
Finding hierarchical heavy hitters in streaming data
Full text PdfPdf (886 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 1 ,  Issue 4  (January 2008) table of contents
Article No. 2  
Year of Publication: 2008
ISSN:1556-4681
Authors
Graham Cormode  AT&T Labs--Research, Florham Park, NJ
Flip Korn  AT&T Labs--Research, Florham Park, NJ
S. Muthukrishnan  Rutgers University, Piscataway, NJ
Divesh Srivastava  AT&T Labs--Research, Florham Park, NJ
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 248,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1324172.1324174
What is a DOI?

ABSTRACT

Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (“heavy hitters”) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input.

In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a “child” node can have multiple “parent” nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions.

We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
4
 
5
6
7
 
8
9
 
10
11
12
 
13
14
15
16
 
17
18
 
19
 
20
 
21
Metwally, A., Agrawal, D., and Abbadi, A. E. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory.
 
22
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 143--152.
23
24
25
26

Collaborative Colleagues:
Graham Cormode: colleagues
Flip Korn: colleagues
S. Muthukrishnan: colleagues
Divesh Srivastava: colleagues