|
ABSTRACT
In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper.
First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length.
Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
ARS97
|
|
| |
AS95
|
R. Agrawal ~md A. Swami. A One-Pass Space- Efficient Algorithm for Finding Quantiles. In Proc. 7th lntl. Conf. Management of Data (COMAD-95), Pune, india, 1995.
|
| |
BFP+73
|
M. Blum, R. "W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. TaJ:jan. Time Bounds for Selection. In J. Compui. Syst. Sci., volume 7, pages 448-461, t973.
|
 |
CMN98
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
| |
CT91
|
|
| |
DB2
|
DB2 MVS..
|
| |
DNS91
|
|
 |
GM98
|
|
| |
GM99
|
|
| |
GMP97
|
|
| |
Hel97
|
J.M. Hellerstein. Online Processing Redux. Bulletin of the IEEE Computer Society, 1997.
|
| |
Hoe63
|
W. Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. American Statistical Association Jornal, pages 13-30, March 1963.
|
| |
Inf
|
Informix..
|
| |
MP80
|
J.I. Munro eald M. S. Paterson. Selection and Sorting with Limited Storage. Theoretical Computer Science, 12:315-323, 1980.
|
 |
MRL98
|
|
| |
Pat97
|
M.R. Paterson. Progress in Selection. Deptt. of Computer Science, University of \~arwick, Coventry, UK, 1997.
|
 |
PIHS96
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
Poh69
|
I. Pohl. A Minimum Storage Algorithm for Computing the Median. Technical Report IBM Research Report RC 2701 (# 12713), IBM T J Watson Center, November 1969.
|
 |
SALP79
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
 |
Vit85
|
|
| |
Yao74
|
|
CITED BY 50
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fatemah A. Alqallaf , Kjell P. Konis , R. Douglas Martin , Ruben H. Zamar, Scalable robust covariance and correlation estimates for data mining, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Xuemin Lin , Jian Xu , Qing Zhang , Hongjun Lu , Jeffrey Xu Yu , Xiaofang Zhou , Yidong Yuan, Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams, IEEE Transactions on Knowledge and Data Engineering, v.18 n.5, p.683-698, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Space- and time-efficient deterministic algorithms for biased quantiles over data streams, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, How to summarize the universe: dynamic maintenance of quantiles, Proceedings of the 28th international conference on Very Large Data Bases, p.454-465, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|