|
ABSTRACT
We present new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude.
We also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e. they apply with respect to a (user controlled) confidence parameter.
We present the algorithms, their theoretical analysis and simulation results on different datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
 |
2
|
|
 |
3
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
4
|
"DB2 MVS:", To be completed.
|
| |
5
|
"Informix", To be completed.
|
| |
6
|
|
| |
7
|
M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan, "Time Bounds for Selection", in J. Cornput. Syst. Sci., vol. 7, pp. 448-461, 1973.
|
| |
8
|
M. R. Paterson, "Progress in Selection", Deptt. of Computer Science, University of Warwick, Coventry, UK, 1997.
|
| |
9
|
D. Dor, Selection Algorithms, PhD thesis, Tel-Aviv University, 1995.
|
| |
10
|
|
| |
11
|
D. Dot and U. Zwick, "Finding the anth Largest Element", Combinatorica, vol. 16, pp. 41-58, 1996.
|
| |
12
|
D. Dor and U. Zwick, "Median Selection Requires (2 q-e)n Comparisons", Technical Report 312/96, Department of Computer Science, Tel-Aviv University, Apr. 1996.
|
| |
13
|
|
| |
14
|
I. Pohl, "A Minimum Storage Algorithm for Computing the Median", Technical Report IBM Research Report RC 2701 (# 12713), IBM T J Watson Center, Nov. 1969.
|
| |
15
|
J. I. Munro and M. S. Paterson, "Selection and Sorting with Limited Storage", Theoretical Computer Science, vol. 12, pp. 315-323, 1980.
|
 |
16
|
|
| |
17
|
R. Agrawal and A. Swami, "A One-Pass Space-Efficient Algorithm for Finding Quantiles", in Proc. 7th Intl. Conf. Management of Data (COMAD-95), Pune, India, 1995.
|
| |
18
|
|
| |
19
|
W. Hoeffding, "Probability Inequalities for Sum8 of Bounded Random Variables", American Statistical Association Jornal, pp. 13-30, Mar. 1963.
|
CITED BY 66
|
|
Fei Chen , Diane Lambert , José C. Pinheiro, Incremental quantile estimation for massive tracking, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.516-522, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
M. Otey , S. Parthasarathy , A. Ghoting , G. Li , S. Narravula , D. Panda, Towards NIC-based intrusion detection, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
|
|
|
Xuemin Lin , Jian Xu , Qing Zhang , Hongjun Lu , Jeffrey Xu Yu , Xiaofang Zhou , Yidong Yuan, Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams, IEEE Transactions on Knowledge and Data Engineering, v.18 n.5, p.683-698, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, How to summarize the universe: dynamic maintenance of quantiles, Proceedings of the 28th international conference on Very Large Data Bases, p.454-465, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|