|
ABSTRACT
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard lp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances.Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve.Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. M. Ali and S. D. Silvey, A general class of coefficients of divergence of one distribution from another, J. of Royal Statistical Society, Series B 28 (1966), 131--142.
|
| |
2
|
|
| |
3
|
S. Amari, Differential-geometrical methods in statistics, Springer-Verlag, New York, 1985.
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, Clustering with bregman divergences, JMLR (2005).
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
Tuǧkan Batu , Sanjoy Dasgupta , Ravi Kumar , Ronitt Rubinfeld, The complexity of approximating entropy, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.510005]
|
| |
12
|
|
| |
13
|
N. N. Cencov, Statistical decision rules and optimal inference, Transl. Math. Monographs, Am. Math. Soc. (Providence) (1981).
|
| |
14
|
|
| |
15
|
|
| |
16
|
Graham Cormode and S. Muthukrishnan, An improved data stream summary: The count-min sketch and its applications, Proc. of LATIN (2004), 29--38.
|
| |
17
|
|
| |
18
|
I. Csiszar, Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems, Ann. Statist. (1991), 2032--2056.
|
| |
19
|
|
| |
20
|
|
| |
21
|
J. Feigenbaum , S. Kannan , M. Strauss , M. Viswanathan, Testing and spot-checking of data streams (extended abstract), Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.165-174, January 09-11, 2000, San Francisco, California, United States
|
| |
22
|
|
 |
23
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
| |
24
|
Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian, Streaming and sublinear approximation of entropy and information distances, CoRR cs/0508122 (2005).
|
 |
25
|
|
 |
26
|
Michael Kearns , Yishay Mansour , Dana Ron , Ronitt Rubinfeld , Robert E. Schapire , Linda Sellie, On the learnability of discrete distributions, Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, p.273-282, May 23-25, 1994, Montreal, Quebec, Canada
[doi> 10.1145/195058.195155]
|
| |
27
|
Balachander Krishnamurthy, Harsha V. Madhyastha, and Suresh Venkatasubramanian, On stationarity in internet measurements through an information-theoretic lens, Ist IEEE International Workshop on Networking Meets Databases (NetDB), 2005.
|
| |
28
|
F. Liese and F. Vajda, Convex statistical distances, Teubner-Texte zur Mathematik, Band 95, Leipzig (1987).
|
| |
29
|
|
| |
30
|
J. Munro and M. Paterson, Selection and Sorting with Limited Storage, Theoretical Computer Science (1980), 315--323.
|
| |
31
|
S. Muthukrishnan, Data streams: Algorithms and applications, Survey available on request at muthu@research.att.com (2003).
|
| |
32
|
Dana Ron, Property testing (a tutorial), Handbook on Randomization (2000).
|
| |
33
|
N. Tishby, F. Pereira, and W. Bialek, The information bottleneck method, Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, 1999, pp. 368--377.
|
| |
34
|
Flemming Topsøe, Some inequalities for information divergence and related measures of discrimination., IEEE Transactions on Information Theory 46 (2000), no. 4, 1602--1609.
|
CITED BY 10
|
|
|
|
|
|
|
|
S. Subramaniam , T. Palpanas , D. Papadopoulos , V. Kalogeraki , D. Gunopulos, Online outlier detection in sensor data using non-parametric models, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|