ACM Home Page
Please provide us with feedback. Feedback
Streaming and sublinear approximation of entropy and information distances
Full text PdfPdf (360 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm table of contents
Miami, Florida
Pages: 733 - 742  
Year of Publication: 2006
ISBN:0-89871-605-5
Authors
Sudipto Guha  University of Pennsylvania, Philadelphia
Andrew McGregor  University of Pennsylvania, Philadelphia
Suresh Venkatasubramanian  AT&T Research Labs, Florham Park, NJ
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
: SIAM Activity Group on Discrete Mathematics
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 64,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1109557.1109637
What is a DOI?

ABSTRACT

In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard lp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances.Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve.Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. M. Ali and S. D. Silvey, A general class of coefficients of divergence of one distribution from another, J. of Royal Statistical Society, Series B 28 (1966), 131--142.
 
2
 
3
S. Amari, Differential-geometrical methods in statistics, Springer-Verlag, New York, 1985.
4
 
5
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, Clustering with bregman divergences, JMLR (2005).
 
6
7
 
8
 
9
10
11
 
12
 
13
N. N. Cencov, Statistical decision rules and optimal inference, Transl. Math. Monographs, Am. Math. Soc. (Providence) (1981).
 
14
 
15
 
16
Graham Cormode and S. Muthukrishnan, An improved data stream summary: The count-min sketch and its applications, Proc. of LATIN (2004), 29--38.
 
17
 
18
I. Csiszar, Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems, Ann. Statist. (1991), 2032--2056.
 
19
 
20
 
21
 
22
23
 
24
Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian, Streaming and sublinear approximation of entropy and information distances, CoRR cs/0508122 (2005).
25
26
 
27
Balachander Krishnamurthy, Harsha V. Madhyastha, and Suresh Venkatasubramanian, On stationarity in internet measurements through an information-theoretic lens, Ist IEEE International Workshop on Networking Meets Databases (NetDB), 2005.
 
28
F. Liese and F. Vajda, Convex statistical distances, Teubner-Texte zur Mathematik, Band 95, Leipzig (1987).
 
29
 
30
J. Munro and M. Paterson, Selection and Sorting with Limited Storage, Theoretical Computer Science (1980), 315--323.
 
31
S. Muthukrishnan, Data streams: Algorithms and applications, Survey available on request at muthu@research.att.com (2003).
 
32
Dana Ron, Property testing (a tutorial), Handbook on Randomization (2000).
 
33
N. Tishby, F. Pereira, and W. Bialek, The information bottleneck method, Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, 1999, pp. 368--377.
 
34
Flemming Topsøe, Some inequalities for information divergence and related measures of discrimination., IEEE Transactions on Information Theory 46 (2000), no. 4, 1602--1609.

CITED BY  10

Collaborative Colleagues:
Sudipto Guha: colleagues
Andrew McGregor: colleagues
Suresh Venkatasubramanian: colleagues