0% found this document useful (0 votes)
5 views

2009 Paninsky Nonparametric estimation of entropy and distributions

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

2009 Paninsky Nonparametric estimation of entropy and distributions

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Nonparametric estimation of entropy and

discrete distributions

Liam Paninski
Department of Statistics
Columbia University
http://www.stat.columbia.edu/∼liam
[email protected]
March 5, 2009
The fundamental question in
neuroscience

The neural code: what is P (response | stimulus)?

Main question: how to estimate P (r|s) from (sparse)


experimental data?
Curse of dimensionality

Both stimulus and response can be very high-dimensional.

Stimuli:
• images
• sounds
• time-varying behavior
Responses:
• observations from single or multiple simultaneously-recorded
point processes (neural activity)
Avoiding the curse of insufficient data

1: Estimate some functional f (p) instead of full joint


distribution p(r, s)
— information-theoretic functionals

2: Improved nonparametric estimators


— minimax theory for discrete distributions under KL loss

3: Select stimuli more efficiently


— optimal experimental design

(4: Parametric approaches)


Part 1: Estimation of information

Many central questions in neuroscience are inherently


information-theoretic:
• What inputs are most reliably encoded by a given neuron?
• Are sensory neurons optimized to transmit information
about the world to the brain?
• Do noisy synapses limit the rate of information flow from
neuron to neuron?

Quantification of “information” is fundamental problem.

(...interest in neuroscience but also physics, telecommunications,


genomics, etc.)
Shannon mutual information

dp(x, y)
Z
I(X; Y ) = dp(x, y) log
X ×Y dp(x) × p(y)

Information-theoretic justifications:
• invariance
• “uncertainty” axioms
• data processing inequality
• channel and source coding theorems

But obvious open experimental question:


• is this computable for real data?
How to estimate information

I very hard to estimate in general...


... but lower bounds are easier.

Data processing inequality:

I(X; Y ) ≥ I(S(X); T (Y ))

Suggests a sieves-like approach.


Discretization approach
Discretize X, Y → Xdisc , Ydisc , estimate

Idiscrete (X; Y ) = I(Xdisc ; Ydisc )

• Data processing inequality =⇒ Idiscrete ≤ I


• Idiscrete ր I as partition is refined

Strategy: refine partition as samples N increases; if number of


bins m doesn’t grow too fast, Iˆ → Idiscrete ր I

Completely nonparametric, but obvious concerns:


• Want N >> m(N ) samples, to “fill in” histograms p(x, y)
• How large is bias, variance for fixed m?
Bias is major problem

my
mx X
p̂M LE (x, y)
IˆM LE (X; Y ) =
X
p̂M LE (x, y) log
x=1 y=1
p̂M LE (x)p̂M LE (y)

n(x)
p̂M LE (x) = pN (x) = (empirical measure)
N
Fix p(x, y), mx , my and let sample size N → ∞. Then:

• Bias(IˆM LE ) : ∼ −(mx − my + mx my − 1)/2N .


• Variance(IˆM LE ) : ∼ (log m)2 /N ; dominated by bias if
m = mx my large.
• No unbiased estimator exists.
(Miller, 1955; Paninski, 2003)
Convergence of common information
estimators

Result 1: If N/m → ∞, IˆM LE and related estimators


universally almost surely consistent.

Converse: if N/m → c < ∞, IˆM LE and related estimators


typically converge to wrong answer almost surely. (Asymptotic
bias can often be computed explicitly.)

Implication: if N/m small, large bias although errorbars vanish,


even if “bias-corrected” estimators are used! (Paninski, 2003)
Estimating information on m bins with
fewer than m samples

Result 2: A new estimator that is uniformly consistent as


N → ∞ even if N/m → 0 (albeit sufficiently slowly)

Error bounds good for all underlying distributions: estimator


works well even in worst case

Interpretation: information is strictly easier to estimate than p!


(Paninski, 2004)
Derivation of new estimator

Suffices to develop good estimator of discrete entropy:

Idiscrete (X; Y ) = H(Xdisc ) + H(Ydisc ) − H(Xdisc , Ydisc )


mx
X
H(X) = − p(x) log p(x)
x=1
Derivation of new estimator

Variational idea: choose estimator that minimizes upper bound


on error over
X
H = {Ĥ : Ĥ(pN ) = g(pN (i))} (pN = empirical measure)
i

Approximation-theoretic (binomial) bias bound


N

X j
max Biasp (Ĥ) ≤ B (Ĥ) ≡ m · max | − p log p − g( )BN,j (p)|
p 0≤p≤1
j=0
N

McDiarmid-Steele bound on variance


2
∗ j j−1
max V arp (Ĥ) ≤ V (Ĥ) ≡ N max g( ) − g( )
p j N N
Entropy bias bound

Biasp (Ĥ) = Ep (Ĥ) − H(p)


m  N 
X X j
= p(i) log p(i) + g( )BN,j (p(i))
i=1 j=0
N
N
X j
≤ m · max | − p log p − g( )BN,j (p)|
0≤p≤1
j=0
N
N
pj (1 − p)N −j : polynomial in p

• BN,j (p) = j
P
• If j g(j)BN,j (p) close to −p log p for all p, bias will be small
=⇒ standard uniform polynomial approximation theory
Entropy variance bound
“Method of bounded differences” (McDiarmid, 1989): let F (x1 , x2 , ..., xN )
be a function of N i.i.d. r.v.’s.
If any single xi has small effect on F , i.e,

sup |F (..., x, ...) − F (..., y, ...)| < c,

then
N 2
V ar(F ) < c
4
(inequalities due to Azuma-Hoeffding, Efron-Stein, Steele, etc.).

Our case:
X n(i)
Ĥ = g( )
i
N
X 
j j−1 n(i)
max g( ) − g( ) < c =⇒ V ar g( ) ≤ N c2
j N N i
N
Derivation of new estimator

Choose estimator to minimize (convex) error bound over


(convex) space H:

ĤBU B = argminĤ∈H [B ∗ (Ĥ)2 + V ∗ (Ĥ)].

Optimization of convex functions on convex parameter spaces is


computationally tractable by simple descent methods

Consistency proof involves Stone-Weierstrass theorem, penalized


polynomial approximation theory in Poisson limit N/m → c.
Error comparisons: upper and lower bounds
Upper and lower bounds on maximum rms error; N/m = 0.25
2.2

BUB
2
JK
1.8

1.6
RMS error bound (bits)

1.4

1.2

0.8

0.6

0.4

0.2

1 2 3 4 5 6
10 10 10 10 10 10
N
Undersampling example

true p(x,y) estimated p(x,y) | error | −5


x 10
1.6
2
1.4
4
6 1.2

8 1
10 0.8
x

12
0.6
14
16 0.4

18 0.2
20 0
5 10 15 20 5 10 15 20 5 10 15 20
y y y

mx = my = 1000; N/mxy = 0.25


IˆM LE = 2.42 bits
“bias-corrected” IˆM LE = −0.47 bits
IˆBU B = 0.74 bits; conservative (worst-case RMS upper bound) error: ±0.2 bits
true I(X; Y ) = 0.76 bits
Shannon (−p log p) is special

α
P
Obvious conjecture: i i , 0 < α < 1 (Renyi entropy) should
p
behave similarly.
0.3 0.4
0.25
− p log (p)

0.2 0.3

sqrt (p)
0.15 0.2
0.1
0.1
0.05

0.05 0.1 0.15 0.2 0.05 0.1 0.15 0.2

Result 3: Surprisingly, not true: no estimator can uniformly


P α
estimate i pi , α ≤ 1/2, if N ∼ m (Paninski, 2004).
In fact, need N > m(1−α)/α : smaller α =⇒ more data needed!
(Proof via Bayesian lower bounds on minimax error.)
Sketch of lower bound

Idea: find two sequences of distributions pN


0 and p N
1 such that:

• |F (pN N
0 ) − F (p1 )| > ǫ > 0

• pN
0 and p N
1 remain “statistically indistinguishable” (i.e.,
simple hypothesis test error remains bounded away from
zero)
Here, p0 places all of its mass on bin 1; p1 places most of its mass
on bin 1, but spreads some fraction uniformly on all other bins.
Directions

• 1/2 < α < 1? Other functionals?


• continuous (unbinned) entropy estimators: similar result
holds for kernel density estimates
(Paninski and Yajima, 2008)
• sparse hypothesis testing: much easier than estimation
(Paninski, 2008)
Part 2: Estimating discrete distributions
4

counts
2

0
0 20 40 60 80 100

• Want an estimator which works well even in worst case


(“minimax” approach)
• Assume no knowledge of “topology” (smoothness); e.g.
p(word) in large dictionary
Of interest: sparse data case: N/m small (N = samples, m =
number of bins).
Connections to entropy estimation

Information-theoretic error measure: Kullback-Leibler (relative


P
entropy coding) loss DKL (p; p̂) = i pi log(pi /p̂i )

What to do with unoccupied bins?

Sparse data case more interesting mathematically.

Methods turn out to be similar:


• Optimal approximation (Paninski, 2003)
• Dirichlet priors (Nemenman et al., 2002)
Upper bound idea

1. Derive upper bound on worst-case expected loss


2. Minimize upper bound over tractable class of estimators
3. Use estimator with best upper bound on worst-case loss
— want upper bound to be tight but tractable
Tractability:
1. Find bounds which are convex in the estimator
2. Allow estimators to range over a large convex space
=⇒ no non-global local minima exist: descent methods work
Class of estimators

g(ni )
p̂i = m
P ,
i=1 g(ni )
ni number of samples observed in bin i

j+α
Example: “add-constant” estimators, gj ≡ g(j) = N +mα
, α>0

α=0 α=1
0.04 0.04
estimated p

0.02 0.02

0 0
20 40 60 80 100 20 40 60 80 100
Basic upper bound

Ep~ (DKL (~
p, p̂))
m
!
X pi
= Ep~ pi log
i=1
p̂i
  !
X N
X m
X
= −H(pi ) + (− log gj )pi BN,j (pi ) + Ep~ log g(nk )
i j=0 k=1
  !
X X X
≤ −H(pi ) + (− log gj )pi BN,j (pi ) + Ep~ −1 + g(nk )
i j k
X X X
= f (pi ) = −H(pi ) − pi + (gj − pi log gj )BN,j (pi ).
i i j

 
N j
log p ≤ p − 1; H(p) = −p log p; BN,j (p) = p (1 − p)N −j
j
P
Equality iff k g(nk ) constant (e.g., add-constant estimator).
Two upper bounds

m
X
f (pi ) ≤ m max f (p) :
0≤p≤1
i=1
tight in heavily-sampled limit.

   
X f (p)
f (pi ) ≤ m max f (p) + max :
0≤p≤1/m 1/m≤p≤1 p
i
tight in sparse-data limit.

Minimizing bounds = polynomial approximation problem, as in


entropy estimation case (Paninski, 2003; Paninski, 2004).

Note: bounds are convex in gj =⇒ easy to minimize!


Lower bounds

Compare upper bounds to some well-defined optimum: lower


bound on worst-case error of any estimator.
Derive family of bounds indexed by some parameter α, then
optimize over α.
Key idea: average (Bayesian) loss ≤ maximum (worst-case) loss.
Dirichlet lower bounds
Dirichlet priors are convenient (Cover, 1972; Krichevsky, 1998;
Nemenman et al., 2002):

1 Y αi −1
P (p) = pi
Z(~
α) i

1. Compute the average error under any Dirichlet distribution;


can be done analytically for any N, m, α
~.
2. Maximize over the possible Dirichlet parameters α (i.e., find
“least favorable” Dirichlet prior) to obtain tightest bound
3. Simplification: restrict α
~ to be constant, α
~ = (α, α, . . . , α)
(1-D maximization)
4. Bound achieved by add-α estimator.
Asymptotic analysis

Proposition 1. Any add-α estimator, α > 0, is uniformly


KL-consistent if N/m → ∞.
Note: N/m is allowed to tend to infinity arbitarily slowly.

Proposition 2 (Converse). No estimator is uniformly


KL-consistent if lim sup N/m < ∞.
— Contrast with entropy estimation, where consistent
estimators do exist in this regime (conjectured by
(Nemenman et al., 2002; Paninski, 2003); proven in
(Paninski, 2004)).

=⇒ entropy of p is strictly easier to estimate than p!


Main result

In data-sparse regime c ≡ N/m → 0, add-α estimator with


α = −c log c is optimal.

Proposition 3. The least-favorable Dirichlet parameter is given


by H(c) as c → 0; the corresponding add-H(c) estimator also
asymptotically minimizes the upper bound. The maximal and
average error behave as − log(c)(1 + o(1)) for c → 0.
Illustration of bounds
0.1
α

0.01
optimal α
approx opt
0.001

5 lower bound
lower bound

4 j=0 approx
(m−1)/2N approx
3
2
1
(upper) / (lower)

3 least−favorable Bayes
Braess−Sauer
optimized
2

1
−4 −3 −2 −1 0 1
10 10 10 10 10 10
N/m

— asymptotically tight as c → 0, c → ∞
— always tight within a factor of 2
Summary

• New upper and lower bounds on discrete estimation error


• Useful applications of (convex) variational idea
• Proved asymptotic tightness of bounds
• Numerically, bounds turn out to be fairly tight
• Optimal sparse estimator: add-|c log c|
• See (Paninski, 2005) for details.
References
Cover, T. (1972). Admissibility properties of Gilbert’s encoding for unknown source probabilities. IEEE
Transactions on Information Theory, 18:216–217.

Krichevsky, R. (1998). Laplace’s law of succession and universal encoding. IEEE Transactions on Information
Theory, 44:296–303.

McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, pages 148–188.
Cambridge University Press.

Miller, G. (1955). Note on the bias of information estimates. In Information theory in psychology II-B, pages
95–100.

Nemenman, I., Shafee, F., and Bialek, W. (2002). Entropy and inference, revisited. NIPS, 14.

Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15:1191–1253.

Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on
Information Theory, 50:2200–2203.

Paninski, L. (2005). Variational minimax estimation of discrete distributions under KL loss. Advances in Neural
Information Processing Systems, 17.

Paninski, L. (2008). A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE
Transactions on Information Theory, 54:4750–4755.

Paninski, L. and Yajima, M. (2008). Undersmoothed kernel entropy estimators. IEEE Transactions on
Information Theory, 54:4384–4388.

You might also like