|
ABSTRACT
The method of stable random projections is popular in data stream computations, data mining, information retrieval, and machine learning, for efficiently computing the lα (0 < α ≤ 2) distances using a small (memory) space, in one pass of the data. We propose algorithms based on (1) the geometric mean estimator, for all 0 <α ≤ 2, and (2) the harmonic mean estimator, only for small α (e.g., α < 0.344). Compared with the previous classical work [27], our main contributions include: • The general sample complexity bound for α ≠ 1,2. For α = 1, [27] provided a nice argument based on the inverse of Cauchy density about the median, leading to a sample complexity bound, although they did not provide the constants and their proof restricted ε to be "small enough." For general α ≠ 1, 2, however, the task becomes much more difficult. [27] provided the "conceptual promise" that the sample complexity bound similar to that for α = 1 should exist for general α, if a "non-uniform algorithm based on t-quantile" could be implemented. Such a conceptual algorithm was only for supporting the arguments in [27], not a real implementation. We consider this is one of the main problems left open in [27]. In this study, we propose a practical algorithm based on the geometric mean estimator and derive the sample complexity bound for all 0 < α ≤ 2. • The practical and optimal algorithm for α = 0+ The l0 norm is an important case. Stable random projections can provide an approximation to the l0 norm using α → 0+. We provide an algorithm based on the harmonic mean estimator, which is simple and statistically optimal. Its tail bounds are sharper than the bounds derived based on the geometric mean. We also discover a (possibly surprising) fact: in boolean data, stable random projections using α = 0+ with the harmonic mean estimator will be about twice as accurate as (l2) normal random projections. Because high-dimensional boolean data are common, we expect this fact will be practically quite useful. • The precise theoretical analysis and practical implications We provide the precise constants in the tail bounds for both the geometric mean and harmonic mean estimators. We also provide the variances (either exact or asymptotic) for the proposed estimators. These results can assist practitioners to choose sample sizes accurately.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
M. S. Bartlett. Approximate confidence intervals, II. Biometrika, 40(3/4):306--317, 1953.
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
 |
10
|
|
| |
11
|
Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493--507, 1952.
|
| |
12
|
|
| |
13
|
Graham Cormode and S. Muthukrishnan. Estimating dominance norms of multiple data streams. In ESA, pages 148--160, 2003.
|
| |
14
|
N. Cressie. A note on the behaviour of the stable distributions for small index. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 31(1):61--64, 1975.
|
| |
15
|
|
 |
16
|
Mayur Datar , Nicole Immorlica , Piotr Indyk , Vahab S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, Proceedings of the twentieth annual symposium on Computational geometry, June 08-11, 2004, Brooklyn, New York, USA
[doi> 10.1145/997817.997857]
|
| |
17
|
David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289--1306, 2006.
|
| |
18
|
Eugene F. Fama and Richard Roll. Some properties of symmetric stable distributions. Journal of the American Statistical Association, 63(323):817--836, 1968.
|
| |
19
|
Eugene F. Fama and Richard Roll. Parameter estimates for symmetric stable distributions. Journal of the American Statistical Association, 66(334):331--338, 1971.
|
| |
20
|
|
| |
21
|
|
| |
22
|
Sumit Ganguly and Graham Cormode. On estimating frequency moments of data streams. In APPROX-RANDOM, pages 479--493, 2007.
|
| |
23
|
I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and Products. Academic Press, sixth edition, 2000.
|
| |
24
|
Monika R. Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. American Mathematical Society, 1999.
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
| |
29
|
W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189--206, 1984.
|
| |
30
|
W. B. Johnson and G. Schechtman. Embedding lp into l1. Acta. Math., 149:71--85, 1982.
|
| |
31
|
James R. Lee and Assaf Naor. Embedding the diamond graph in lp and dimension reduction in l1. Geometric And Functional Analysis, 14(4): 745--747, 2004.
|
| |
32
|
Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer, second edition, 1998.
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873--880, 2007.
|
| |
37
|
|
| |
38
|
Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In COLT, pages 635--649, 2006.
|
| |
39
|
Ping Li, Trevor J. Hastie, and Kenneth W. Church. Nonlinear estimators and tail bounds for dimensional reduction in l1 using Cauchy random projections. In COLT, 2007.
|
| |
40
|
|
| |
41
|
J. Huston McCulloch. Simple consistent estimators of stable distribution parameters. Communications on Statistics-Simulation, 15(4): 1109--1136, 1986.
|
| |
42
|
L. R. Shenton and K. Bowman. Higher moments of a maximum-likelihood estimate. Journal of Royal Statistical Society B, 25(2):305--317, 1963.
|
| |
43
|
Santosh Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.
|
| |
44
|
V. M. Zolotarev. One-dimensional Stable Distributions. American Mathematical Society, Providence, RI, 1986.
|
|