ACM Home Page
Please provide us with feedback. Feedback
Random sampling techniques for space efficient online computation of order statistics of large datasets
Full text PdfPdf (1.50 MB)
Source International Conference on Management of Data archive
Proceedings of the 1999 ACM SIGMOD international conference on Management of data table of contents
Philadelphia, Pennsylvania, United States
Pages: 251 - 262  
Year of Publication: 1999
ISBN:1-58113-084-8
Also published in ...
Authors
Gurmeet Singh Manku  IBM Almaden Research Center
Sridhar Rajagopalan  IBM Almaden Research Center
Bruce G. Lindsay  IBM Almaden Research Center
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 71,   Citation Count: 50
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/304182.304204
What is a DOI?

ABSTRACT

In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
ARS97
 
AS95
R. Agrawal ~md A. Swami. A One-Pass Space- Efficient Algorithm for Finding Quantiles. In Proc. 7th lntl. Conf. Management of Data (COMAD-95), Pune, india, 1995.
 
BFP+73
M. Blum, R. "W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. TaJ:jan. Time Bounds for Selection. In J. Compui. Syst. Sci., volume 7, pages 448-461, t973.
CMN98
 
CT91
 
DB2
DB2 MVS..
 
DNS91
GM98
 
GM99
 
GMP97
 
Hel97
J.M. Hellerstein. Online Processing Redux. Bulletin of the IEEE Computer Society, 1997.
 
Hoe63
W. Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. American Statistical Association Jornal, pages 13-30, March 1963.
 
Inf
Informix..
 
MP80
J.I. Munro eald M. S. Paterson. Selection and Sorting with Limited Storage. Theoretical Computer Science, 12:315-323, 1980.
MRL98
 
Pat97
M.R. Paterson. Progress in Selection. Deptt. of Computer Science, University of \~arwick, Coventry, UK, 1997.
PIHS96
 
Poh69
I. Pohl. A Minimum Storage Algorithm for Computing the Median. Technical Report IBM Research Report RC 2701 (# 12713), IBM T J Watson Center, November 1969.
SALP79
Vit85
 
Yao74

CITED BY  50

Collaborative Colleagues:
Gurmeet Singh Manku: colleagues
Sridhar Rajagopalan: colleagues
Bruce G. Lindsay: colleagues