ACM Home Page
Please provide us with feedback. Feedback
Approximating entropy from sublinear samples
Full text PdfPdf (398 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms table of contents
New Orleans, Louisiana
Pages: 366 - 375  
Year of Publication: 2007
ISBN:978-0-898716-24-5
Authors
Mickey Brautbar  Hebrew University, Jerusalem, Israel
Alex Samorodnitsky  Hebrew University, Jerusalem, Israel
Sponsors
: SIAM Activity Group on Discrete Mathematics
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
Society for Industrial and Applied Mathematics  Philadelphia, PA, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 49,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

We consider the problem of approximating the entropy of a discrete distribution P on a domain of size q, given access to n independent samples from the distribution. It is known that nq is necessary, in general, for a good additive estimate of the entropy. A problem of multiplicative entropy estimate was recently addressed by Batu, Dasgupta, Kumar, and Rubinfeld. They show that n = qα suffices for a factor-α approximation, α > 1.

We introduce a new parameter of a distribution--its effective alphabet size qef(P). This is a more intrinsic property of the distribution depending only on its entropy moments. We show qef ≤ Õ(q). When the distribution P is essentially concentrated on a small part of the domain qefq. We strengthen the result of Batu et al. by showing it holds with qef replacing q.

This has several implications. In particular the rate of convergence of the maximum-likelihood entropy estimator (the empirical entropy) for both finite and infinite alphabets is shown to be dictated by the effective alphabet size of the distribution. Several new, and some known, facts about this estimator follow easily.

Our main result is algorithmic. Though the effective alphabet size is, in general, an unknown parameter of the distribution, we give an efficient procedure, with access to the alphabet size only, that achieves a factor-α approximation of the entropy with n = Õ (exp {α 1/4 · log3/4 q · log1/4 qef}). Assuming (for instance) log qef ≪ log q this is smaller than any power of q. Taking α → 1 leads in this case to efficient additive estimates for the entropy as well. In particular, this result shows that for many natural scenarios, a tight estimation of the entorpy may be achieved using a sub-linear sample.

Several extensions of the results above are discussed.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
B. Efron and C. Stein. The jackknife estimate of variance. Ann. Stat., 9(3):586--596, 1981.
 
5
P. Grassberger. Entropy estimates from insufficient samplings, E-print physics/0307138, July 2003.
 
6
C. Haixiao, S. R. Kulkarni, and S. Verdu. Universal entropy estimation via block sorting. IEEE Transactions on Information Theory, 50, July, 2004.
 
7
C. McDiarmid. Surveys in Combinatorics, chapter On the method of bounded differences, pages 148--188. Cambridge University Press, 1989.
 
8
G. Miller. Information theory in Psychology, chapter II-B: Note on the bias of information estimates, pages 95--100. Glencoe IL: Free Press, 1955.
 
9
I. Nemenman, W. Bialek, and R. de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69, 2004.
 
10
 
11
L. Paninski. Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50, 2004.
 
12
S. Strong, R. Koberle, R. de Ruyter van Steveninck, and W. Bialek. Entropy and information in neural spike trains. Phys. Rev. Let., 80(1), 1998.
 
13
A. Wyner and D. Foster. On the lower limits of entropy estimation. Submitted to IEEE Transactions on Information Theory.
 
14
M. Brautbar and A. Samorodnitsky. Approximating the entropy of large alphabets. Technical report, ECCC, 084, 2005.
Collaborative Colleagues:
Mickey Brautbar: colleagues
Alex Samorodnitsky: colleagues