ACM Home Page
Please provide us with feedback. Feedback
Processing top k queries from samples
Full text PdfPdf (516 KB)
Source International Conference On Emerging Networking Experiments And Technologies archive
Proceedings of the 2006 ACM CoNEXT conference table of contents
Lisboa, Portugal
SESSION: Supervision table of contents
Article No. 7  
Year of Publication: 2006
ISBN:1-59593-456-1
Authors
Edith Cohen  AT&T Labs--Research, Florham Park, NJ
Nadav Grossaug  Tel Aviv University, Tel Aviv, Israel
Haim Kaplan  Tel Aviv University, Tel Aviv, Israel
Sponsors
: CISCO
: Fundacao para a Ciencia e Tecnologia
: Thomson
SIGCOMM: ACM Special Interest Group on Data Communication
: Intel
Microsoft : Microsoft
: Associacao de Turismo de Lisboa
: E-Next
: ISCTE
: Camara Municipal de Lisboa
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 25,   Citation Count: 2
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1368436.1368446
What is a DOI?

ABSTRACT

Top-k queries are desired aggregation operations on data sets. Examples of queries on network data include the top 100 source AS's, top 100 ports, or top Domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples.

If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting and techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top-k frequencies. This bias as depends on the distribution and must be accounted for when seeking the actual value.

We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various data sets that include IP flows data, show that schemes that exploit more of the structure of the sample distribution produce much tight confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation.

Our work is basic and is widely applicable to all applications that process top-k and heavy hitters queries over a random sample of the actual records.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
4
 
5
 
6
E. Cohen, H. Kaplan, Y. Mansour, and M. Sharir. Tail estimates for dependent increasing sums. Manuscript, 2006.
7
8
 
9
B. Efron and R. Tibshrani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993.
 
10
11
 
12
13
14
15
16
17
18
 
19
20
 
21

Collaborative Colleagues:
Edith Cohen: colleagues
Nadav Grossaug: colleagues
Haim Kaplan: colleagues