|
ABSTRACT
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BENTLEY, J.L. Personal communication, Apr. 1983; see {11}.
|
| |
2
|
ERNVALL, J. AND NEVALAINEN, O. An algorithm for unbiased random sampling. Comput. J. 25, 1 (Jan. 1982), 45-47.
|
| |
3
|
FAN, C. T., MULLER, M. E., AND REZUCHA, I. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Am. Stat. Assoc. J. 57 (June 1962), 387-402.
|
| |
4
|
FELLER, W. An Introduction to Probability Theory and Its Applications, vol. I, 3rd ed. Wiley, 1968.
|
| |
5
|
FELLER, W. An Introduction to Probability Theory and Its Applications, vol. II, 2nd ed. Wiley, 1971.
|
 |
6
|
|
| |
7
|
KAWARASAKI, J., AND SIBUYA, M. Random numbers for simple random sampling without replacement. Keio Math. Sere. Rep. No. 7 (1982), 1-9.
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
VITTER, J.S. Optimum algorithms for two random sampling problems. In Proceedings of the 24th Annual IEEE Symposium on Foundations of Computer Science (Tucson, Az., Nov. 7-9), IEEE, New York, 1983 pp. 65-75.
|
CITED BY 105
|
|
|
|
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
H. Wang , S. Parthasarathy , A. Ghoting , S. Tatikonda , G. Buehrer , T. Kurc , J. Saltz, Design of a next generation sampling service for large scale data analysis applications, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A. Balmin , T. Eliaz , J. Hornibrook , L. Lim , G. M. Lohman , D. Simmen , M. Wang , C. Zhang, Cost-based optimization in DB2 XML, IBM Systems Journal, v.45 n.2, p.299-319, January 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Luciana S. Buriol , Gereon Frahling , Stefano Leonardi , Alberto Marchetti-Spaccamela , Christian Sohler, Counting triangles in data streams, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
|
|
|
Deepak Agarwal , Andrew McGregor , Jeff M. Phillips , Suresh Venkatasubramanian , Zhengyuan Zhu, Spatial scan statistics: approximations and performance study, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
Zhang Longbo , Li Zhanhuai , Zhao Yiqiang , Yu Min , Zhang Yang, A priority random sampling algorithm for time-based sliding windows over weighted streaming data, Proceedings of the 2007 ACM symposium on Applied computing, March 11-15, 2007, Seoul, Korea
|
|
|
Kevin Beyer , Peter J. Haas , Berthold Reinwald , Yannis Sismanis , Rainer Gemulla, On synopses for distinct-value estimation under multiset operations, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
|
|
|
Jeffery Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou, False positive or false negative: mining frequent itemsets from high speed transactional data streams, Proceedings of the Thirtieth international conference on Very large data bases, p.204-215, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gao Cong , Wenfei Fan , Floris Geerts , Xibei Jia , Shuai Ma, Improving data quality: consistency and accuracy, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parikshit Gopalan , T. S. Jayram , Robert Krauthgamer , Ravi Kumar, Estimating the sortedness of a data stream, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, p.318-327, January 07-09, 2007, New Orleans, Louisiana
|
|
|
|
|
|
Surong Wang , Manoranjan Dash , Liang-Tien Chia , Min Xu, Efficient sampling of training set in large and noisy multimedia data, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.3, p.14-es, August 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edith Cohen , Nick Duffield , Haim Kaplan , Carsten Lund , Mikkel Thorup, Stream sampling for variance-optimal estimation of subset sums, Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms, p.1255-1264, January 04-06, 2009, New York, New York
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edward Bortnikov , Maxim Gurevich , Idit Keidar , Gabriel Kliot , Alexander Shraer, Brahms: Byzantine resilient random membership sampling, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.53 n.13, p.2340-2359, August, 2009
|
|
|
|
|
|
Graham Cormode , Lukasz Golab , Korn Flip , Andrew McGregor , Divesh Srivastava , Xi Zhang, Estimating the confidence of conditional functional dependencies, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
REVIEW
"Prokop Vondracek : Reviewer"
There is presented a fast algorithm for selecting a random sample of n>
records without replacement for a pool of N> records, where the value of
N> is unknown beforehand. Algorithm Z, which does the sampling, is des
more...
|