2009 Paninsky Nonparametric estimation of entropy and distributions
2009 Paninsky Nonparametric estimation of entropy and distributions
discrete distributions
Liam Paninski
Department of Statistics
Columbia University
http://www.stat.columbia.edu/∼liam
[email protected]
March 5, 2009
The fundamental question in
neuroscience
Stimuli:
• images
• sounds
• time-varying behavior
Responses:
• observations from single or multiple simultaneously-recorded
point processes (neural activity)
Avoiding the curse of insufficient data
dp(x, y)
Z
I(X; Y ) = dp(x, y) log
X ×Y dp(x) × p(y)
Information-theoretic justifications:
• invariance
• “uncertainty” axioms
• data processing inequality
• channel and source coding theorems
I(X; Y ) ≥ I(S(X); T (Y ))
my
mx X
p̂M LE (x, y)
IˆM LE (X; Y ) =
X
p̂M LE (x, y) log
x=1 y=1
p̂M LE (x)p̂M LE (y)
n(x)
p̂M LE (x) = pN (x) = (empirical measure)
N
Fix p(x, y), mx , my and let sample size N → ∞. Then:
then
N 2
V ar(F ) < c
4
(inequalities due to Azuma-Hoeffding, Efron-Stein, Steele, etc.).
Our case:
X n(i)
Ĥ = g( )
i
N
X
j j−1 n(i)
max g( ) − g( ) < c =⇒ V ar g( ) ≤ N c2
j N N i
N
Derivation of new estimator
BUB
2
JK
1.8
1.6
RMS error bound (bits)
1.4
1.2
0.8
0.6
0.4
0.2
1 2 3 4 5 6
10 10 10 10 10 10
N
Undersampling example
8 1
10 0.8
x
12
0.6
14
16 0.4
18 0.2
20 0
5 10 15 20 5 10 15 20 5 10 15 20
y y y
α
P
Obvious conjecture: i i , 0 < α < 1 (Renyi entropy) should
p
behave similarly.
0.3 0.4
0.25
− p log (p)
0.2 0.3
sqrt (p)
0.15 0.2
0.1
0.1
0.05
• |F (pN N
0 ) − F (p1 )| > ǫ > 0
• pN
0 and p N
1 remain “statistically indistinguishable” (i.e.,
simple hypothesis test error remains bounded away from
zero)
Here, p0 places all of its mass on bin 1; p1 places most of its mass
on bin 1, but spreads some fraction uniformly on all other bins.
Directions
counts
2
0
0 20 40 60 80 100
g(ni )
p̂i = m
P ,
i=1 g(ni )
ni number of samples observed in bin i
j+α
Example: “add-constant” estimators, gj ≡ g(j) = N +mα
, α>0
α=0 α=1
0.04 0.04
estimated p
0.02 0.02
0 0
20 40 60 80 100 20 40 60 80 100
Basic upper bound
Ep~ (DKL (~
p, p̂))
m
!
X pi
= Ep~ pi log
i=1
p̂i
!
X N
X m
X
= −H(pi ) + (− log gj )pi BN,j (pi ) + Ep~ log g(nk )
i j=0 k=1
!
X X X
≤ −H(pi ) + (− log gj )pi BN,j (pi ) + Ep~ −1 + g(nk )
i j k
X X X
= f (pi ) = −H(pi ) − pi + (gj − pi log gj )BN,j (pi ).
i i j
N j
log p ≤ p − 1; H(p) = −p log p; BN,j (p) = p (1 − p)N −j
j
P
Equality iff k g(nk ) constant (e.g., add-constant estimator).
Two upper bounds
m
X
f (pi ) ≤ m max f (p) :
0≤p≤1
i=1
tight in heavily-sampled limit.
X f (p)
f (pi ) ≤ m max f (p) + max :
0≤p≤1/m 1/m≤p≤1 p
i
tight in sparse-data limit.
1 Y αi −1
P (p) = pi
Z(~
α) i
0.01
optimal α
approx opt
0.001
5 lower bound
lower bound
4 j=0 approx
(m−1)/2N approx
3
2
1
(upper) / (lower)
3 least−favorable Bayes
Braess−Sauer
optimized
2
1
−4 −3 −2 −1 0 1
10 10 10 10 10 10
N/m
— asymptotically tight as c → 0, c → ∞
— always tight within a factor of 2
Summary
Krichevsky, R. (1998). Laplace’s law of succession and universal encoding. IEEE Transactions on Information
Theory, 44:296–303.
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, pages 148–188.
Cambridge University Press.
Miller, G. (1955). Note on the bias of information estimates. In Information theory in psychology II-B, pages
95–100.
Nemenman, I., Shafee, F., and Bialek, W. (2002). Entropy and inference, revisited. NIPS, 14.
Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15:1191–1253.
Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on
Information Theory, 50:2200–2203.
Paninski, L. (2005). Variational minimax estimation of discrete distributions under KL loss. Advances in Neural
Information Processing Systems, 17.
Paninski, L. (2008). A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE
Transactions on Information Theory, 54:4750–4755.
Paninski, L. and Yajima, M. (2008). Undersmoothed kernel entropy estimators. IEEE Transactions on
Information Theory, 54:4384–4388.