Notes It
Notes It
Billy Fang
Instructor: Thomas Courtade
Fall 2016
These are my personal notes from an information theory course taught by Prof. Thomas Courtade.
Most of the material is from [1]. Any errors are mine.
Contents
1 Entropy 1
4 Data compression 5
5 Channel capacity 6
6 Differential entropy 14
7 Gaussian channel 16
H(X j Y ) H(X) (conditioning reduces uncertainty), but it is not always true that H(X j = y) H(X) for
Y each y (only true on average).
Pinsker’s inequality: D(pkq) 2 ln 2 kp qk12.
1
I(X; Y ) is concave in pX for fixed pY jX . It is convex in pY jX for fixed pX . To see the first one, note I(X; Y )
= H(Y ) H(Y j X). For fixed pY jX , we have H(Y j X) is linear in pX , and H(Y ) is concave function composed
P
We discuss a connection between information and estimation. Consider the Markov chain X ! Y ! Xb(Y )
where Y is the noisy observation of X, and Xb(Y ) is an estimator of X based on Y . Let P e = P(X 6= Xb(Y )).
Theorem 1.1 (Fano’s inequality).
H(Pe) + Pe logjX j H(X j Xb):
= H(X j Xb)
and
H(E; X j X) = H(E j X) + H(X j E; X)
b H(E) +bpeH(X j X; E b e j
= 1) + (1 p )H(X X;E = 0)
b
= H(Pe) + pe logjX j b
1
Proof. By the data-processing inequality, I(X; Xb) I(X; Y ) which implies
H(X j Xb) H(X j Y ).
We now clarify the interpretation of “entropy is the average number of bits needed to describe a
random variable.” Consider a function f : X ! f0; 1g that maps the alphabet to arbitrary length bit strings.
We want f to be injec-tive (able to distinguish between letters in the alphabet). Let ‘(f(x)) be the length of
this description of x. Then E[‘(f(X))] is the average length of this description.
X
E[‘(f(X))] = p(x)‘(f(x))
x
X p(x)
= H(X) + p(x) log
x
2 ‘(f(x))
X
= H(X) p(x) log 2
0
‘(f(x )) + p(x) log p(x)
X2
X0
‘(f(x 0)) + D(p X kQ )
X 2 ‘(f(x))= P 2 ‘(f(x0))
x
x x x0
= H(X) log
x0
X 0
H(X) log 2 ‘(f(x )) :
0
x
Proposition 1.3.
X
2 ‘(f(x0)) log 2jX j:
0
x
Proof. Consider maximizing the sum on the left-hand side. Respecting injectivity of f, this is the first jX j
terms of the series
2 0+2 1+2 1+2 2+2 2+2 2+2 2+
Suppose we observe i.i.d. X1; : : : ; Xn. The alphabet has size jX jn. Any description of these n
outcomes requires at least H(Xn) log(1 + n logjXj) bits. Using independence, we have the following lower
bound on the average number of bits per outcome.
1 1
n
n E[‘(fn(X ))] H(X1) n log(1 + n logjX j): = H(X) O(log(n)=n):
Suppose X = fa; b; cg and f(a) = 0, f(b) = 1, and f(c) = 01. Concatenating is no longer injective: 01101 could
correspond to abbc or cbc. If we want uniquely decodable f, then the upper bound of the previous proposition is 1.
Consider X being uniform on the above alphabet. Let f(a) = , f(b) = 0, and f(c) = 1. Then E[‘(f(X))] =
2=3 < 1:58 H(X).
X n
n i f(Xi) = n log pXn (x ) ! H(X)
So pXn (xn) 2 nH(X).
2
Theorem 2.1.
1 p
n
n log pX n (x ) ! H(X):
Last time we saw a few examples where the formula for entropy “magically” appeared. One was that if
f : X ! f0; 1g is injective, then E‘(f(X)) H(X) log log 2jX j. Today we will show that there exists an f such that
E‘(f (X)) . H(X).
We also saw that if we observe an i.i.d. sequence of Bern(1 p) random variables, then the “typical”
sequence of length n has n(1 p) ones and np zeros, and moreover p Xn (xn) 2 nH(X).
To formally define “typical,” we will work backwards from the above example.
(n)
2. P(Xn 2 A(n)) ! 1.
3. jA(n)j 2n(H(X)+ ).
4. jA(n)j (1 0)2n(H(X) ) for n sufficiently large.
Proof. 1. follows by definition. 2. follows by the AEP/LLN.
3. follows by X
In summary, A(n) together has almost all the proability mass, the probability on sequences in A (n) is
roughly uniform, and the cardinality is roughly 2nH(X).
If B(n) is the smallest set with probability 1 , then jB(n)j jA(n)j in some sense. More precisely,
1 (n)
X
2 n(H(X) )
xn2A\B
jB(n)j2 n(H(X) )
1
jB(n)j (1)2n(H(X) )
(n)
3
We now describe a scheme to describe X with H(X) bits on average: the typical set encoding.
sequences in A(n) , we need log A(n) + 1 bits per label.
If we label all (n) j j (n) (n)
To encode x 2 A , we put a flag 1 in front of the label (length logjA j + 2. If x 2= A , we put a flag 0 in front
of the binary representation of the sequence (length n(logjXj + 1) + 1).
1 1 1
n (n) (n) (n)
n E‘(f(X )) n P (A )(logjA j +2)+ n (1 P (A ))(n(logjX j + 1) + 1)
1
(H(X) + ) + n + (n)((logjX j + 1) + 1=n)
H(X) + 2
Although this scheme is not practical, we see that we get a matching upper bound to our earlier lower
bound H(X) log log 2jX j on the expected length.
Proof. 0 H(Xn+1 j X1; : : : ; Xn) H(Xn+1 j X2; : : : ; Xn) = H(Xn j X1; : : : ; Xn 1). So this is a nonnegative
decreasing sequence and therefore converges.
an := H(Xn j X1; : : : ; Xn 1) and bn = n 1P n 1 i 1 n 1 n concludes.
a
i=1 i j
If an ! a then the Cesaro means b n :=n n i=1 i ! a also converge to the same limit. Applying this to
1 1
H(X X ;:::;X ) = H(X ;:::;X )
stochastic processes. In particular if there is a stationary distribution (satisfies
Markov chains are an example of P
P = ), then the Markov chain is stationary and we may use the previous lemma. So H(fXig) = limn!1 H(Xn j Xn
1). We have
X X
H(Xn j Xn 1) = (i)H(Xn j Xn 1 = i) = (i)H(pi1; : : : ; pim)
i i
The Second Law of Thermodynamics: the entropy of an isolated system is not decreasing. We might model such
a system with a Markov process. [Example: Ehrenfest gas system?]
Suppose we start a Markov chain in two different states, and let their state distributions at time n be P n
and Qn respectively, then
D(Pn+1kQn+1) D(PnkQn):
If Pn is stationary, D(Pnk ) is decreasing.
4
4 Data compression
We have seen that it takes H(X) bits on average to describe X. This is the fundamental idea behind
data com-pression.
From AEP, it takes about nH(X) bits to represent X1; : : : ; Xn. But this scheme (using typical set) is
wildly impractical.
P
Let c : X ! f0; 1g be a source code. We are interested in L(c) = x pX (x)‘(c(x)).
x pX (x) c(x) ‘(c(x))
For example, a 1=5 00 2 has L(c) = 1:6 and H(X) = 1:52.
b 2=5 01 2
c 2=5 1 1
This defines the extension code c defined by c (x1 xn) = c(x1) c(xn).
A nonsingular code c is an injective code. A code is uniquely decodable if c is nonsingular (injective).
A code is prefix-free (a.k.a. instantaneous) if no code word is a prefix of any other code word.
Confusingly, such codes are sometimes called prefix codes.
Proof. Sort ‘1 ‘N . Draw a binary tree of code words. If ‘1 = 2 for example, let the first code word be 00 and
prune the children at that node. Then choose the next smallest available word of length ‘ 2 and so on. The
inequality guarantees that this is possible and we don’t run out of words.
We now show any prefix code satisfies the inequality. Consider inputting a sequence of random coin
flips into a decoder which spits out x if the input is a code word.
[ X X
1 P(some x comes out) = P fx comes outg = P(x comes out) = 2 ‘(c(x))
x2X x2X x2X
P ‘
Theorem 4.2 (McMillan inequality). Any uniquely decodable code has codeword lengths satisfying i2 i 1.
We prove McMillan’s inequality on the homework. Recall that for any f, we proved
X
E[‘(f(X))] = H(X) + D(pX kq) log 2 ‘(f(x))
x
where q(x) / 2 ‘(f(x)). If c is uniquely decodable, then applying McMillan’s inequality shows that E‘(c(X))
H(X).
‘
How do we design good codes? Consider choosing ‘i to minimize i pX (i)‘i subject to i 2 i 1. Then the
k D(p q)
X
Xi pX (i)‘i i pX (i)(1 log pX (i)) = H(X) + 1:
Theorem 4.3. Any uniquely decodable code c satisfies E‘(c(X)) H(X) (impossibility), and there exists a
code c such that E‘(c (X)) H(X) + 1 (achievability).
5
At each stage we group the two least probable symbols into one and add their probabilities. We start with fag, fbg,
and fcg. After the first stage we have fa; bg w.p. 3=5 and c w.p. 2=5. Finally, we merge everything to get fa; b; cg. At
each grouping, label the edges 0 and 1. Read backwards to get the code: a = 00, b = 01, and c = 1.
E‘(c(X)) = 2=5 + 4=5 + 2=5 = 8=5 1:6. While H(X) 0:722.
Theorem 4.4. Huffman coding is optimal among all uniquely decodable codes. That is, if c is the Huffman
code, then E‘(c (X)) E‘(c(X)) for any uniquely decodable code c.
One “trick” to get rid of the +1 is to group symbols together and give a code X n ! f0; 1g . Then we have
the following bounds on the number of bits to describe n symbols: H(X n) E‘(c (X)) H(Xn) + 1. Dividing by n
gives the following bounds on the number of bits per symbol:
1 1
H(X) n E‘(c (Xn)) H(X) + n :
The Huffman code for X n requires building a tree with jX j n leaves. We have a tradeoff between length of
the code and computational complexity.
[See Lempel-Ziv for universal coding: variable length coding.]
This motivates arithmetic coding, which has complexity linear in n (instead of exponential as in the
above discus-sion). We discuss Shannon-Fano-Elias coding. We will see that E‘(cSF E(X)) H(X) + 2.
Let the distribution of X be (p1; : : : ; p3). We partition [0; 1) into half-open intervals each of length pi. [So, [0; p1),
[p1; p1 + p2), and [p1 + p2; p1 + p2 + p3).] To encode i, take the midpoint of interval of length p i, write it in binary, and
truncate to d log pie + 1 bits. The truncation will always lie in the same interval, since truncation subtracts at
( log p +1)
most 2 (d log pie+1) 2 i = pi=2. We can also readily see E‘(cSF E(X)) H(X) + 2.
Suppose X takes values a; b; c with probabilities 0:6; 0:3; 0:1 respectively. The intervals are
[0; 0:6); [0:6; 0:9); [0:9; 1). The mid points are 0:3 = 0:010011 2, 0:75 = 0:112, and 0:95 = 0:1111001 : : :.
So our code words are 01, 110, and 11110. It is clear this is not really optimal. However it is good for scaling up.
Consider another example where X takes values a; b; c with probabilities 1=2; 1=3; 1=6 respectively. Then the
product distribution is a distribution over 9 outcomes, e.g. ab has probability 1=6.
Compare the interval representations for the X and X 2 codes. In the latter, we would simply partition
each of the three half-open intervals of the former into three to get a total of nine intervals.
In general if we want to encode a sequence of n bits, keep partitioning the intervals in the same way,
and return any binary string in the small interval (e.g., truncating the midpoint to d log p Xn (xn)e + 1).
Note this procedure is linear in n.
Asymmetric Numeral System (ANS)? Coping with distributions that are less uniform; some
probabilities very high. Skew binary representation? See paper.
5 Channel capacity
The communication problem:
Assumptions
1. Messages are random variables uniformly distributed over f1; : : : ; Mg.
2. Channel has known statistical properties. Input X and output Y , know PY jX . Justification: in real life
we can take measurements.
6
3. Discrete time, discrete alphabet, memoryless channel. Statistics of PY jX do not change over time.
[If X1; X2 are i.i.d. and input sequentially, then output Y1; Y2 are i.i.d.]
Example 5.1 (Binary symmetric channel). The binary symmetric channel BSC(p) takes binary input x 2 f0; 1g
and outputs either x or 1 x with probability 1 p and p respectively.
A (M; n)-block code for the channel (X ; PY jX ; Y) uses n symbols from X to transmit one of M
messages W i, i = 1; : : : ; M.
n n
X (W ) Y
W ! (M; n)-block code ! PY njXn ! decoder ! W (Y n)
The rate of an (M; n) code is R = log M = number
n
of information bits =
number of channel uses bits/channel use. We also write an (M; n) code
c
nR
as a (2 ; n) code.
A good code should have high rate and good reliability, but these are in conflict.
A rate R is achievable if there exists a sequence of (2nR; n) such that
maxi P(Wc 6= W i j W = W i) ! 0
We call this reliable communication.
Definition 5.2 (Operational definition of channel capacity). The capacity C of a channel
PY jX is supfR : R is achievableg:
Consequently,
1. If R < C, then R is achievable. That is, for any > 0, there exists an n and a (2 nR; n) code with maxi
P(Wc 6= W i j W = W i) < .
2. If R > C, then R is not achievable. That is, there exists c such that for any sequence of (2 nR; n)
codes, lim infn!1 P(Wc 6= W i j W = W i) > c.
Theorem 5.3 (Shannon’s channel coding theorem).
C = max I(X; Y ):
PX
Properties of C.
1. C 0.
2. C min(logjX j; logjYj). (Just note I(X; Y ) H(X) logjX j.) This is equality when the channel is a
deterministic map from X ! Y, and jYj = jX j.
3. Recall I(X; Y ) is concave in PX for fixed PY jX . Computing C is a convex optimization problem.
Example 5.4 (Binary symmetric channel). Consider BSC(p). Note I(X; Y ) = H(Y ) H(Y j X) = H(Y ) H(p) 1
H(p) for any distribution on X. If X Ber(1=2), then Y Ber(1=2), which gives equality, so this is
the maximizing distribution. C = 1 H(p). For example if p = 0:1, we have C 0:6.
Example 5.5 (Binary erasure channel). Consider the binary erasure channel BEC(p). Input x 2 f0; 1g and output
x with probability 1 p or e (erasure) with probability p. Letting E be the indicator for erasure, we have
I(X; Y ) = H(Y ) H(Y j X) = H(Y ) H(p)
= H(Y; E) H(p)
= H(E) + H(Y j E) H(p)
= H(Y j E)
= pH(Y j E = 1) + (1 p)H(Y j E = 0)
(1 p):
7
Taking X Ber(1=2) gives equality again, so C = 1 p. This makes sense: if we knew where the erasures were
(proportion p of the time), we can send What is surprising that we don’t need to know where the erasures
perfectly. are.
nR = H(W (n))
1 + nnR + H(Yi) H(Y n j Xn) indep. bound (chain rule + cond. reduces entropy)
=1
n n
Xi X
c
The use of Fano’s inequality says we should use the best estimator Wc of W ...
The use of data processing inequality implies W 7!Xn and Y n ! Wc should be close to bijective.
The next inequality implies the Yi should be close to independent (i.e. the Xi should be close to independence).
The last inequality implies that the distribution of Xn is close to i.i.d. from argmaxPX I(X; Y ). [Capacity-
achieving input distribution (CAID).]
The set A(n) of jointly typical sequences (Xn; Y n) with respect to a distribution PXY is the set of n-
sequences with empirical entropies -close to true entropy, that is,
( log p n
A(n) = (xn; yn) : n X
n n
;Y n (x ; y ) H(X;Y) < ;
1
1
n n
n log pX (x ) H(X) <;
)
n
1 log p
Y n
(yn) H(Y ) <
If a rate R is close to C, then the function Xn(W ) ought to be “random” (specifically, i.i.d. from PX , the
optimizer of I(X; Y )).
Well-known codes like algebraic codes (Reed-Solomon, Gaulay, PCH) have a lot of structure and redundancy for
the sake of simple decoding. However, this is at odds with the above intuition of a capacity-achieving code.
Today we will prove the achievability direction of the channel coding theorem.
1. P((Xn; Y n) 2 A(n)(X; Y )) ! 1.
2. jA(n)(X; Y )j 2n(H(X;Y )+ ).
3. If (Xen; Yen) (PX PY )n, then
9
Proof. The first statement follows by the law of large numbers. The second statement follows by
X
x
X
2 n(H(X) )2 nH(Y ) )
(xn;yn)2A(n)
= jA(n)j2 n(H(X)+H(Y ) 2)
2 n(H(X)+H(Y ) H(X;Y ) 3 ):
Proof of achievability in channel coding theorem. We now prove that all rates R < C are achievable. This
is a non-constructive proof. We will use the probabilistic method: we will show that some object in a finite
class C satisfies property P by exhibiting a distribution over C, drawing an object from this distribution,
and show that this object satisfies P with probability > 0.
In our setting C is a set of (2nR; R) codes and P is “code has small probability of error.”
We will use random coding. Fix some P X and > 0 and R. Generate a (2nR; n) code at random according to
PX . The codebook is
C = [Xi(w)]i;w 2 R2nR n:
The wth row of C is the codeword for the wth message. Each letter of each code word is generated i.i.d. from P X , so
2nR n
YY
P(C) = PX (Xi(w)):
w=1 i=1
We have decided the encoding scheme. For decoding, we will use typical set decoding. The receiver
declares that Wc was sent if both of the following happen.
1. (Xn(W ); Y n) 2 A(n)(X; Y ), that is, the received message Y n is jointly typical with code word Xn(W ).
other index k = W
6c
(Xn(k); Y n) 2 A(n) (X;Y)
2. No c also satisfies . c
Otherwise, the decoder declares an error.
We now compute the expected probability of error for this scheme, averaged over codebooks C drawn
from the above distribution.
The probability of error is
n n
i(C) = P(Wc 6= i j X = X (i)):
This probability is over the randomness of the channel, but the codebook fixed.
The average [over all messages] probability of error for a fixed code C is
2nR
(n) 1X
Pe (C) = E W (C) = 2nR w(C)
w=1
10
The average over all codes is
X
P =
error P(C)P (n)(C)
C 1
= P(C) 2nR 2nR w(C)
X X
C w=1
1
= 2nR 2nR P(C) w(C)
w=1 C
XX
X
= P(C) 1(C) rows of C are exchangeable
C
c
P(E1 j W = 1) + P(Ei j W = 1)
i=2
2nR
X
n n (n) n n (n)
= P((X ; Y ) 2= A )+ P((Xe ; Ye ) 2 A )
i=2
Let Xn be i.i.d. Bern(1=2). Suppose Y n is obtained by passing Xn through BSC( ). [That is, Yi is equal
to Xi with probability 1 , and is flipped with probability .] For any b : f0; 1g n ! f0; 1g (one-bit function), is it
true that I(b(Xn); Y n) 1 H( ) = I(X1; Y1)? This is the “most informative function conjecture.”
In the last few lectures we proved the channel coding theorem, which stated that the channel capacity
is maxPX I(X; Y ).
Converse: If R is achievable, then R < C. This holds using either definition of achievability (involving
maximal probability of error maxi P (Wc 6= W i j W = W i) or average probability of error P (Wc 6= W )).
11
Since messages are uniformly distributed,
P (Wc 6= W )
#fi : P (W 6= Wi j W = Wi) g=2nR :
c
= 2P (W = W ) gives
Taking c6 # i : P (W = W i W = W i) 2P(W = W) 2nR=2:
f 6 j 6 g
(left-hand side), then the new rate is
c
If we throw away these bad messages c
log(#codewords) log(2nR=2) 1
= =R :
n n n
In the new code, all the codewords have probability of error 2P (Wc 6= W ) by
definition, so max P (Wc0 6= W 0i j W 0 = W 0i) 2P (Wc 6=
W ) ! 0:
i
What is the rate of information sent from eye to brain? Measure signal X entering eye, signal Y
entering brain, estimate I(X; Y ), gives upper bound on rate. [Estimating I(X; Y ) needs some shift to align
due to delay, quantize time, etc.]
Suppose that we are the encoder that sends X i through a channel, which sends Y i to a decoder. What
if we get feedback: we see Yi (what the decoder receives)? We argued before that in the binary erasure
channel, the capacity with feedback is the same. Is this true in general? [We know trivially that capacity
with feedback must be at least as large as the capacity without feedback: just ignore the feedback.]
More explicitly, the encoder sends Xi, which can depend on W , as well as Xi 1 and Y i 1.
Theorem 5.9. The feedback does not improve the channel capacity. However, it can simplify the encoding scheme. It
can also get us to capacity much more quickly. That is, Pe(best code) 2 nE without feedback, but with
feedback Pe(best code) 2 2nE0 .
I(W ; W ) + nR + 1
cn n n Fano
I(W ; Y ) + nR n + 1 data-processing, W ! Y ! Wc
n n
= H(Y ) H(Y j W ) + nR n + 1
X
= i (H(Yi) H(Yi j Xi)) + nR n + 1 Yi is conditionally independent of all past things given X i
X
= I(Xi; Yi) + nR n + 1
i
n max I(X; Y ) + nR n + 1:
PX
12
Thus R C.
Remark: In the channel coding theorem, we had
P
We then used H(Y n j X n) = i H(Yi j Xi) because the Yi were conditionally independent given the
corresponding Xi. This is no longer the case with feedback.
Preview of another problem (to be continued later):
Consider the following compression example. [No channel.] Suppose we have X n and Y n are correlated, and we
want to encode each separately, and then a decoder takes both. (If the first encoder sents nR X bits and the second
sends nRY , then the rate is RX + RY .) How much worse is this than encoding (Xn; Y n) together? Chapter 15.4.
We say RX and RY are achievable if there exists a sequence of functions f : X n ! [2nRX ] and g : Yn ! [2nRY ]
and : [2nRX ] [2nRY ] ! X n Yn such that
as n ! 1. (Xbn; Ybn) := (f(Xn); g(Y n)). The achievable rate region is the closure of achievable rates.
Recall that if RX H(X) then we can send the information losslessly. So the achievable rate region
definitely contains [RX ; 1) [RY ; 1). Considering the special case where we can encode everything
together, we see that the achievable rate region must lie in the region fR X + RY H(X; Y )g.
The answer: any rates satisfying RX H(X j Y ), RY H(Y j X), and RX + RY H(X; Y ).
In channel coding, high rate is desirable, but hard because of the channel. In compression, low rate is
desirable (communicate using fewer bits) but hard. That is why achievable rates in channel coding has an
upper bound, while in compression there is a lower bound.
We will discuss Problem 4 on the midterm.
(Xn; Y n) are drawn i.i.d. from some distribution. Y n is encoded into f(Y n) 2 f0; 1gnR. The decoder
receives both f(Y n) and Xn, and gives an estimate Ybn.
error is P (n) = P (yn(Xn; f(Y n) = Y n) R
2 n(H(X;Y )+ )
n n n A(n) (X;Y )
2
Why can entropy be negative? Consider approximating the integral by 1 1 f(i=N) log f(i=N). Consider
i= N
i=N)=N
[X]N , a discretized version of f taking values 1=N with probabilities f( P . Then the above approximation to
the integral is H([X]N ) log N = H([X]N ) H([U]N ), where [U]N is the discretization of a Unif(0; 1) random
variable. This is the “differential” in the name: it is in some sense the difference in discrete entropy of
quantized versions of X and U.
dp dp
The relative entropy is D(P kQ) = p(x) log dq (x) dx = EP log dq (X). Note that we can rewrite this as
R dp dp R
dq (x) log dq (x) dq(x). So, differential entropy can be written as D(P k dx) where “dx” denotes the Lebesgue
measure.
1 x2=2 2
2
Example 6.2 (Gaussian). Let f(x) = p 2 e .
Z 1
h(X) = f(x) log f(x) dx
1 x2
2 Z
= 2 log 2 + f(x) log(e) 2 2 dx
1 2 1
= 2 log 2 + 2 log e
1
= 2 log 2 e 2
R
Joint density is h(X1; : : : ; Xn) = f log f where f is the joint density.
1=2 1
exp 2 (x )>K 1(x ).
Example 6.3 (Multivariate Gaussian). Let f(x) = (2 ) n=2jKj
Z
h(X) = f log f
1
1 n 1 1 )K 1(X )
= 2 log(2 ) jKj + 2 log(e) + 2 log(e)E(X
1 n trace trick
= 2 log(2 )njKj + 2 log e
1
= 2 log(2 e)njKj:
Note that does not appear. This is because entropy is invariant to shifting.
Chain rule: h(Y; X) = h(X) + h(Y j Xj ). R X;Y f Y jX j E Y jX j
Conditional differential entropy is h(Y X) = f (x; y) log f (y x) dy dx = log f (Y X).
Relative entropy: if supp(f) supp(g), D(fkg) = f log 0 by Jensen’s inequality.
R
g
P
Also, and P j j. [Recall Y X j j
generally, for a matrix A, h(AX) = h(X) + logjAj.
Aside:
14
Proposition 6.4 (Entropy power inequality). Let X and Y be independent random vectors on R n.
Suppose X is uniform on some set A, so f X (x) = 1= vol(A)1A. Then h(X) = log vol(A). Similarly if Y is
uniform on another set B, then h(Y ) = log vol(B). Then, the entropy power inequality implies
The Brunn-Minkowski inequality states vol(A + B)1=n vol(A)1=n + vol(B)1=n, where A + B is the Minkowski sum
fa + b : a 2 A; b 2 Bg. If we take 22h(X+Y )=n vol(A + B)2=n (note this is not true, due to the convoluting), then
we see that the entropy power inequality suggests a stronger inequality than the Brunn-Minkowski inequality.
Rough volume argument for strong converse of channel coding: if R > C, The number of typical y n is
nH(Y )
2 . For each xn, number of conditionally typical yn is 2nH(Y jX). If these sets are disjoint, the number
of sets in yn is about 2nI(X;Y ). If we have R > C then we have overlap.
If X is discrete and Y is continuous, no notion of joint entropy. However, we can talk
For general random variables, we have another equivalent definition of mutual information.
where the supremum is over all partitions. Note that this is essentially the definition of Lebesgue
integration from approximation by simple functions. For continuous Y , we recover the earlier definition.
sup H([Y ] ) H([Y ] [X] ) h(Y )+H([U] ) h(Y [X] ) H([U] ) h(Y ) h(Y X):
I(X;Y) = N N Nj N N j N N ! j
The fact that mutual information can be defined between discrete and continuous random variables is
good in practice. Consider a codeword X n 2 [2nR] being sent through a channel; such channels usually
produce continuous output.
Last time we showed for X N( ; K), we have h(X) = 12 log(2 e)njKj.
1
h(Y ) h(X) = 2 log[(2 e)njKj]:
Moreover, equality holds if and only if Y is Gaussian.
So, the Gaussian distribution maximizes entropy under a second moment constraint.
Aside: Note that for the discrete case, uniform distribution over finite alphabet maximizes entropy
without moment conditions. It does not usually make sense to impose moment conditions since we
usually do not care about the values of X, unlike the continuous case. For nonnegative integers with a
mean constraint, geometric distribution maximizes entropy.
15
0 De(fk ) f log e
= Z
= Z f loge he(f)
= EX f [ loge (X)] he(f)
1 n
1 > 1
= 2 log[(2 ) jKj] + 2 EX f X K X he(f)
= 2 log[(2 ) jKj] + 2| e {z e }
=n
1 n 1 n
log e h (f)
1
n
= 2 loge[(2 e) jKj] he(f)
= he( ) he(f):
7 Gaussian channel
The Gaussian channel takes input X and outputs Y = X + Z where Z N(0; 2) is independent of X.
We have supPX I(X; Y ) = 1 because we do not have constraints on X. We could spread the distribution of X
so widely such that the Gaussian Z does not perturb by much, and Y can easily be decoded.
We consider instead supPX :EX2 P I(X; Y ). [WLOG we assume X is zero mean; shifting does not change any-
thing.] We have
1 P
n 2 X N(0; P) C= 2 2R
1 log 1 + P2 p= 2 is th e sign al-t o-n oise (SNR ).
nR H(W )
I(Xn; Y n) + nPe(n)R + 1 data proc., Fano
X
n
R n I(Xi; Yi) + n
=1
We cannot take the supremum over PX because we have the constraint kXn(w)k2 nP . We use the fact that for fixed PY jX , the map
P R
PX 7!I(X; Y ) is concave. We define PX = 1 n PX . Then PY = PY jX dPX . Then
n i=1 i
1 nXi
max I(X; Y )
16
P
because EX2 = n1 ni=1 EXi2 P . With a little more work, the constraint kXn(W )k2 nP could be relaxed to
hold on average over messages W .
Let X1; X2; : : : be i.i.d. with density f, mean zero, and second moment equal to 1. The central limit theorem
states n
1 X d
Sn := p
n Xi ! N(0; 1):
i=1
Note that Sn all have the same second moment 1. We suspect h(Sn) is increasing, since the limiting
distribution is Gaussian.
Recall the entropy power inequality, which implies
log 2) 1
22h((X1+X2)=p2) = 22(h(X1+X2) 2 (22h(X ) + 22h(X )) = 22h(X ):
1 2 1
?: Dembo, Cover
p p
Let U = X and V = 1 Y.
17
We will be using results from optimal transport theory (but we will derive things from scratch). This is
an adapta-tion of Olivier Rioul’s proof (2016).
Let FX and FY be the cdfs of X and Y respectively. If X N(0; 1), then (x ) is uniform on [0; 1], and F X 1(
(X )) is distributed as X. So let TX = FX 1 so that TX (X ) is distributed as X. Note that T X is an increasing
function. [We assumed the densities are nonvanishing so the CDFs are strictly increasing.] Thus T X0 > 0.
Define TY similarly.
d
Let X ; YN(0; 1) be i.i.d. Then (TX (X ); TY (Y )) = (X; Y ) in distribution.
X; Y N(0; 1) write
Let
ee be i.i.d. Then we canX =p X p1 Y
p Y = 1 X+
p Y
ee rotation invariant.]
e
We also have Ye (X ) = X+ 1 Y:
0
d
x y
(x) = T (x ) + (1 )T 0 (y ):
X Y
p e
f
p X+ dY e
e
1
e
is a density. ye e ye e ye
p p 1
p
h( X + 1 Y ) = E log f( X+ p1 Y)
1
= E log
f( Y (X))
0 x)
= E log y ( e e
x)
Ye (
f e
e e
= h(X) + E log (X) + E log 0 (X)
e
fY X) Y
e e e e
EY EX X) Y
= h(X) + EY D(gkfY ) e E
e
Y
0
e + log (X)
h( 0
Ee Y e e e
X) + log (X)
X) + log Te (X) + (1 ) log T 0 (Y )
h( e E X0 e E Y concavity of logarithm
e e eE
= (h (X) + E log TX0( Y0
X)) + (1 )(h(Y ) + log T (Y ))
e e
= h(X ) + (1 )h(Y ): e e
The last step is due to the change of variables (xe) = fX (TX (xe))TX0(Xe),
0
h(X) + E log TX0 (X) = E log TX (X) (X e) = E log f X ( TX (x) )
1 = E log f X (X)
1 = h(X):
e e e e
18
p
To relax the condition that the densities of X and Y are non-vanishing, we use an approximation h(X +
Z) ! h(X) as ! 0.
We proved the entropy power inequality in dimension 1. In higher dimensions it takes a similar form in
that the constants do not degrade; it is dimension free.
n 1 n 1 X)
2h(X +Y j Xn + Yn) (n 1) log 2 n 1 h(X j n + 2n 1 h(Y jYn)
2h(Xn) 2h(Yn)
2h(Xn + Yn) log 2 +2
n h(Xn + Y n) log 2 n h(X )
+ 2 n h(Y ) concavity of log-sum-exp
2 2 n 2 n
Consider a adversarial channel who sees the distribution of X, chooses distribution Z and outputs Y =
X + Z. We want to find the capacity
sup inf I(X; X + Z)
PZ
PX
2
subject to EX2 2
X and EZ Z
2
.
2
= 2 log 1 + Z Gaussian channel
19
Thus, we have the Gaussian channel saddle point property:
2 PZ PZ
1 2 Z2
X
PX PX
log 1+ = sup inf I(X; X + Z) = inf sup I(X; X + Z)
In other words,
I(X;X +Z ) I(X;X +Z) I(X ;X +Z):
So non-Gaussian channels have higher capacity.
Getting Gaussian codes to work for any channel: perform unitary transformation before sending to
channel, and then perform inverse on output. CLT?
b
sentation) and fidelity of the reconstruction.
+
To measure fidelity, we need some distortion function (measure) d : X X ! R . For convenience 2we
sometimes consider bounded distortion functions that satisfy d = max d(x; x) < . For example, (x x) is
unbounded. Most results generalize to the unbounded case. max x;x b 1
Squared error / quadratic loss d(x; x) = (x x)2. Again, this is unbounded on R2.
sequences by d(xn; xn) = 1 n d(x ; x ), the average per-symbol dis-
n nR
i nR
i i
We have a encoding function f
n :
X ! [2 ]
and a b
decoding function g
n : [2 ]
b
!X
b
The central quantity is the expected distortion X n
X b b
Ed(Xn; gn(fn(Xn))) = | {z } xn
n n n
p(x )d(x ; gn(fn(x ))):
b nR
A rate distortion pair (R; D) is achievable if there exists a sequence of (2 ; n) codes (fn; gn) such that
n n
nlim Ed(X ; gn(fn(X ))) D:
!1
Note that if (R; D) is achievable, then (R; D0) and (R0; D) are also achievable if D0 D and R0 R.
The achievable rate distortion region is also convex: if you have two codes, you can flip a coin to choose
which code to use for a particular block. The rate and distortion will just be the convex combinations.
The point where the boundary of the achievable region hits the D = 0 axis is (H(X); 0). Where the
boundary hits R = 0 is (0; minxb Ed(X; xb)) (output the least offending guess on average). If d max exists, it
is larger than this minimum.
Note that the achievability definition can be restated > 0 such that P (d(Xn; gn(fn(Xn)) > D + ) ! 0 for
all
> 0. (???)
The rate distortion function is R(D) = inffR : (R; D) achievableg, the lower boundary of the achievable
rate distortion region.
Theorem 9.1. For Xi PX i.i.d., jX j; jXb j < 1, and distortion bounded by dmax,
20
In channel coding, the channel is fixed and we control the distribution over the input. In this setting,
the input PX is given, and we control the encoding/decoding scheme that gives the output.
Recall PXjX 7!I(X; X) is convex for fixed PX , so this is a convex optimization problem. [Note that the
constraint on bthe expected distortion
b
is a linear constraint: d(X; X) =
E x;x
p(x)p(x x)d(x; x).]
j In c hannel c odi ng, if we di d not c hoos e too many inputs, their typic al i mag es under the channel woul d hopefull y
P b
be disjoint. This is a packing argument. We could use a volume argument to estimate how many inputs we can send.
b b b
n
In our setting, we have some subset of of size 2nR, and we consider the “preimage” of inputs in
n
that are
X X n
within distortion D of these outputs X(i). We want enough in the output space so that their “preimages” cover X .
b
b
This is in some sense a dual of channel coding. Note that R controls the number of “preimages,” and with lower
R, it becomes harder to cover. D controls the size of each “preimage” and lower D makes it harder to cover.
Example 9.2. Let X Ber(p) with p 1=2. Let be the XOR operation, and consider the Hamming distortion.
Lower bounding I(X; X) subject to Ed(X; X) D gives
b
b I(X ;X) = H(X) H(X X)
b j b
H(p) H(X X j X) X)
H(p) H(X b b
H(p) H(D): b
H(p) H(D) is the mutual information corresponding to the BSC with transition probability D taking some
input X and having output distribution Y following (p; 1 p). Indeed, I(X; Y ) = H(Y ) H(Y j X) = H(p) H(D).
How do we use this information to achieve this rate in our original problem?
Put Xb through a BSC(D) channel so that X has distribution (p; 1 p). Using Bayes’s Rule gives P
(xb = 0) =
1 pD.1
2D
So, (
Example 9.3. Consider a Gaussian random variable X N(0; 2). How would we do a one-bit quantization?
Let
f(X) := x1[X 0] x1[X < 0]. Ifb x = q
2
, then E(X f(X))2 = 2 2
0:36 2.
We b b PXjX : E(X X) D
want to find R(D) = min 2 I(X; X).
c b b
I(X; X) = h(X) h(X j X)
b = h(X) h(X b j X)
X
b b
h(X) h(X Xb)
1 1
2
2 log(2 e ) 2 log(2 eD)
1 log 2
= :
2 D
We have shown a lower bound. Is equality attained?
the theorem about the Gaussian channel: if Y = X + Z where X N(0; P ) and Z N(0; N), then
Recall 1 N+P
I(X;Y) = log . channel X = X + Z where X
2 N N(0; 2 D) and Z N(0; D), then X N(0; 2).
Consider a Gaussian 2 b
Then I(X; X) = 2 lo g
. So, R(D) = 2 log 2 b D D .
1 1
b
Now compare with our one-bit quantizer. With the same second fidelity constraint D = 0:36 2, the optimal rate
2 1 1
is R(0:36 ) = log 0:737. This is a significant improvement over the rate 1 of the one-bit quantizer. It is
2 0:36
suboptimal to quantize on a symbol-by-symbol basis; we gain by quantizing on blocks.
21
Suppose X N(0; 1) is the input to a neural network which outputs X. Suppose we have E(X X)2 = 1=2.
information flow through any layer is R(1=2) = 1=2.
The
We now prove the theorem. b b
Proof of achievability. Fix PXbjX such that Ed(X; Xb) D. We want to show that there exists a sequence of
(2nR; n) codes that have rate R I(X; Xb) and achieve E[d(X n; gn(fn(Xn))) ! D as n ! 1.
We define a distortion-typical set.
n n
Ad; := (xn; xn) : n log p(x ; x ) H(X; X) < ;
(n) n 1
1 b n b b
n log p(x ) h(X) < ;
1 n
log p(x ) h(X) <;
n
n n b
d(x ; x ) b Ed(X; X) <
(n) n n o
j j
=A (X; X) (x ; x b) : d(x ; x ) Ed(X; X) <
b \f j n n j g
b
A (X; b b b
(n) X):
() b
1. We have P (Ad;n) ! 1 since the two sets in the intersection have probability tending to 1, both by the weak law
of large numbers.
n n n n(I(X;X )+3 )
2. p(x ) p(x j x )2 b for all (xn; xn) 2 Ad;(n).
b b
n n n
p(xn; xn) n
b
n(H(X;X) H(X) H(X)+3 ) n n(I(X;X)+3 )
b j b p(xb)p(x ) b b b b b
b
3. If 0 x; y 1 and n 0, then (1 xy)n 1 x + e yn. To see this, note x
n
(1 xy) is convex and
yn 7! y
nonincreasing for fixed y. Also, x 7!1 x+e is linear and nonincreasing for fixed y. Also, 1 y e .
[See Lemma 10.5.3.]
We describe the random code.
2. Typical set encoding: for each Xn, select i such that (xn; Xbn(i)) 2 A(d;n) if
possible. If there are multiple such i, break ties arbitrarily.
If no such i exists, then send i = 1. This happens with small probability P e.
Then,
n n min E d(X; x) <
Ed(X ; Xb (i)) (1 Pe)(D + ) + Pedmax dmax can be relaxed to x
1
b b
D + + Pedmax:
X
P ((xn; Xbn) 2= A(d;n)) = 1 b p(xn)1A(n) (xn; xn)
b b
2nR
xn d;
"
P (@i : (xn; Xbn(i)) 2 A(d;n)) = 1 n n n #
X p(x )1Ad;(n) (x ; x )
b
n
b b
independence
x
22
" #
Pe = n
p(xn) 1 n
p(xn)1Ad;(n) (xn; xn) 2nR
X Xb
b b 2nR
x x
" n n n(I(X;X)+3 ) n n #
n p(xn) 1 n p(x j x )2 1Ad;(n) (x ; x )
X Xb
b b b 2nR
x x
= x
n
p(x n 6
) 1 2
2 n(I(X;X)+3 )
y b
n
x p(xn j x n )1Ad;(n) (x ; x
n n )3
7
X 6 X 7
6 b x b 7
4 | {z }b 5
"
n p(x n ) 1 n p(x n j x )1Ad;(n ) ( x
n | n;x )+e
n {z 2 n(I(X;X c
} 2 # inequality in “3” above
)+ ) nR
X Xb
b b
x x
Recall the rate distortion setup. We give Xn (drawn i.i.d. from some PX ) to an encoder, who then
sends nR bits to a decoder, who then outputs Xbn which hopefully has distortion Ed(Xn; Xbn) D.
We want to characterize R(D) = inffR : (R; D) achievableg, the lowest possible rate at which it is
possible to obtain [asymptotically] expected distortion D.
It turns out that
R(D) = min I(X; Xb):
PXcjX :Ed(X;Xb) D
Last time, we proved achievability: if R > R(D), then there exists a sequence of (2nR; n) codes with
limn!1 Ed(Xn; Xbn) D.
Sketch:
3. If R > R(D), then we have chosen enough Xbn(i) so that the set of all corresponding distortion balls
is so large that the probability of error is small...
We now turn to proving the converse. First, we need the following lemma.
We proved convexity of the operational definition of R(D) last time by showing the set of achievable pairs (R; D)
is convex. However, this lemma asserts convexity of the thing that we have yet to prove is equal to R(D).
Proof. Let P (0) achieve R(D0), and let P (1) achieve R(D1).
Define P (b
XjX = P + P . XjX
b
) (0) (1)
XjX
7!
XjX XjX
P bj b () (0) (1)
b
Since X X b is b X
23
Thus
() min
I
P XcjX (X; Xb) PXcjX :Ed(X;Xb) D0+ D1 I(X; Xb) = R( D0 + D1):
We now prove the converse. We want to prove that if R < R(D), then (R; D) is not achievable by any scheme.
Proof of the converse. Consider a (2nR; n) code with encoder fn and decoder gn that achieves Ed(Xn;
nR H(Xbn)
I(Xbn; Xn)
= H(Xn) H(Xn j Xbn)
n
X
= I(Xi; Xbi)
i=1
n
X
Xi b b
= nR(D): En d(Xi; Xi) = Ed(Xn; Xn) = D
=1
Remark: nR I(Xn ; Xn) could be deduced directly by data-processing, since there is a “bottleneck” of nR bits
H(Xbn j Xn) = 0, i.e. Xbn is a deterministic function of Xn. [This shows that randomized decoding
doesn’t help.]
H(Xi j Xbn; X1; : : : ; Xi 1) = H(Xi j Xbi), i.e. Xbi is a sufficient statistic for Xi.
We now consider joint source channel coding. Let V m be the observation (i.i.d. from PV ). We encode V m
and encodes it into Xn, which gets sent through a discrete memoryless channel (DMC) P Y jX . The channel
outputs Y n is then decoded into a reproduction Vbm of the original observation. We would like Ed(V m; Vbm) D.
Theorem 9.5. Distortion D is achievable if and only if R(D) BC where B = mn is the bandwith mismatch.
24
This has a separation result, in that the following scheme is optimal. Do the rate-distortion optimal encoding to
nR(D) bits, and these are uniformly distributed. This what the channel likes. Use a channel code at rate C and send it
through the channel. Finally, do the corresponding channel decoding and the rate-distortion decoding.
The “if” part is simple: the rate R is lower than the capacity, so everything works.
For the reverse,
nC I(Xn; Y n) channel coding converse
m m
I(V ; Vb )data processing
mR(D): rate distortion converse
where j(i) is the parent of i. Note that these approximations use the same P that we are estimating. So for each
tree we have an explicit approximation. We are not approximating P with any distribution with a tree structure.
We want to optimize
min D(P P )
t Tn k t
2
Note jTnj = nn 2. ln 2 2
P .
Note D(P kPt) 2 kP tkT V
In other words, if we consider the complete graph on n vertices with edge weights I(X i; Xj), then the
weight spanning tree is this maximum weight maximum dependence tree.
Theorem 10.1. t 2 argmint2Tn D(P kPt) if and only if it is a maximum-weight dependence tree.
Proof.
X P (x)
D(P P ) = P (x) log
k t x Pt(x)
n
X X Xi
25
n
So argmint D(P kPt) = argmaxt i=1 I(Xi; Xj(i)).
P
If we want to do this approximation, we just need to estimate the O(n 2) mutual informations, rather than the O(2 n)
probabilities (if xi are binary).
Maximum likelihood estimator of the true tree is the plug-in estimator. (Estimate mutual informations
using data and take the empirical maximum spanning tree.)
Why relative entropy ends up being nice? Chain rule / factorization is one possibility.
Our problem is as follows. We have X(1); : : : ; X(m) and we want to get an estimate of the dependence
tree. We estimate the mutual informations I(Xi; Xj) and find the empirical max weight tree.
Recall I(X; Y ) = H(X) + H(Y ) H(X; Y ). So it suffices to estimate entropies.
How do we estimate H(P ) given n i.i.d. samples drawn from P ?
Classical statistics. Suppose jX j = S is fixed. Find the optimal estimator of H(P ) as n ! 1. Let Pn denote the
empirical distribution, then
H(Pn) = Xi pi log pi
where pi is the relative frequency of symbol i in the data. b b
b
H(Pn) is the MLE for H(P ), and it is asymptotically efficient (asymptotically, attains equality in the
Cramer-Rao bound) by Hajek-Le Cam theory.
What about non-asymptotics? What if n is not “huge” relative to the alphabet size S?
Decision theoretic framework. For an estimator Hbn, the worst-case risk is
Rnmax(Hbn) = sup EP (H(P ) Hbn)2;
P 2MS
If n S (e.g. n ! 1 while S fixed) then the bias term is small and we get the classical asymptotics. Otherwise,
bias term becomes important. So, consistency of MLE is equivalent to n = (S).
Next time, we show we can do better.
Recall we are trying to estimate H(P ) for some distribution P , using i.i.d. samples X n. The MLE is
H(Pn) where Pn is the empirical distribution.
From classical statistics,
Var( log P (X))
2
EP (H(P ) H(Pn)) n
as n ! 1. If S is the support size, this suggests the sample complexity is ((ln S) 2). However, this is only
valid in the asymptotic regime n ! 1 while support size is fixed.
26
If the support as size S,
S2 (ln S)2
2
sup EP (H(P ) H(Pn)) n2 + n :
P 2MS
So the sample complexity of MLE is actually (S). If n is not large enough, the bias term (first term) is too
large. There is a phase transition: if sample complexity is & S then the risk is nearly zero, if it is less,
than the risk is
high.
Can we do better than the MLE? Valiant and Valiant showed that the [minimax?] phase transition for
entropy estimation occurs instead at (S= ln S).
We get an “effective sample size enlargement phenomena.” This result implies that the risk with n samples has the
same error as the MLE estimator with n log n samples.
p log p
Note that entropy is separable: H(P ) = i i i=
when n is not large enough.
i f(pi) where f(x) = x log x.
P
Recall the issue with the MLE is large bias P
The plug-in estimator is H(Pn) = pi log pi. Consider the function f(x) = x log x. If pi pi and pi is
near , then
1=2 f(p ) f(p ) slope of f is low. However, near zero, f has infinite slope, so f(p ) and
i i because theP i
f(pi) differ greatly. b b b
b
To fix this, we divide [0; 1] into a smooth regime (log n=n; 1] and a non-smooth regime [0; log n=n). b
For the smooth regime we have a bias-corrected estimate f(pi) 2n1 f00(pi)pi(1 pi).
For the non-smooth regime [0; log n=n), use the best (sup norm) polynomial approximation of f order log n.
Note that this estimation procedure does not depend on S. b bb b
We compare the L2 rates:
minimax : S2
+ (ln S)
2
2
(n log n) n
S2 (ln S)2
MLE : 2 +
n n
b b X b
b
x
2 b b
Lemma 11.1.
R(D) = min min D(Q X;X k P R ):
Q AR c X X
2 X b b
Proof.
D(Q P R ) D(Q P Q ) = D(Q R ) 0:
X;X k X X X;X k X X Xk X
b b b b b b
27
To implement the alternating minimization algorithm, we need to solve two problems.
Given QX;Xb, find RXb that minimizes D(QX;XbkPX RXb). The above lemma implies RXb = QXb.
Given RXb, find QX;Xb 2 A that minimizes D(QX;XbkPX RXb). Since the marginal QXb is fixed, we
just need to find QXbjX . The following lemma shows there is a closed form expression.
Lemma 11.2. The minimizing conditional distribution has the form
x;x)
R e d( b
QXX (x x) = X(x) d( x;x0)
j j x0 b X(x )
0
R b
e
b b P b bb b
b j (bx)j
@QXj X (x j x)
J(QXjX ) = PX (x) log RX + PX (x) + 1PX (x)d(x; x) + 2PX (x)
b b b b b b
X X j
X r (x j y)
= p(y) r (x j y) log
r(x j y)
x;y
= p(y)D(r (x j y)kr(x j y))
0:
C = max I(X; Y )
PX
= max D(P kP P )
PX XY X Y
X
For any RX ,
Q (x y) := RX (x)PY jX (y j x) :
P 0
For any QXjY , XjY j x0 RX (x )PY jX (y j x0)
Q Qj X Y (x j y)PY jX (yjx)
RX (x) = x 0y
y QX Y (x 0 y)
P
Y jX
(yjx0)
PQ j j
28
12 Information theory and statistics
12.1 Theory of types
1 n X f g n
Let X ; : : : ; Xn be a sequence of symbols from = ai .
The type of X , denoted PXn , is the empirical distribution associated to X = (X1; : : : ; Xn).
Pn denotes the set of types with denominator n, i.e., the possible types associated with a sample of size n.
Example 12.1. If X = f0; 1g, Pn := n; n :0 k n
k n k
8
Example 12.2. If P = (3=8; 5=8), then T (P ) are all 3 binary vectors of length 8 with exactly three zeros.
Theorem 12.3.
jPnj (n + 1)jXj:
[The above bound is very crude, but it is good enough because we will be comparing this to things
that grow exponentially in n.]
Consequently, the number of type classes is polynomial in n.
Qn(Xn) = 2 n(H(P
X
n )+D(P
X
n kQ))
:
It makes sense that the probability depends only on the type. (Permuting does not affect the empirical
distribution.) Recall Qn(Xn) 2 nH(Q) for typical Xn and contrast this with the statement of the theorem.
Proof.
P
2 n(H(P )+D(P kQ)) = 2nP n (a) log 1
+
P
P n (a) log PXn (a)
n
Xn X n a2X X PX (a) a2X X Q(a)
P
a2X
= Qn(Xn);
where we note nPXn (a) is the number of times a appears in the sample.
Theorem 12.5.
1
2nH(P ) jT (P )j 2nH(P ):
(n + 1)jXj
That is, the type class of P has about 2nH(P ) sequences.
This is a more precise notion than typical sets. Both shows how entropy is some notion of volume.
29
Proof.
1 P n(T (P ))
X
= P n(xn)
n
x 2T (P )
= jT (P )j2 nH(P ):
The last equality is due to P Y
and prove it later. Intuitively, it states that under a particular probability distribution, the type with
maximum proba-bility is the original distribution.
X
1= P n(T (Q))
Q2Pn
X
= P n(T (P ))
Q2Pn
jPnjP n(T (P ))
(n + 1)jXjP n(T (P ))
= (n + 1)jXjjT (P )j2 nH(P ):
It remains to prove the “maximum likelihood” result.
j
P n(T (P )) = T(P) Q
P n(T (P )) T (P )j
P (a)nP (n)
a P (a) nP (a)
b j bj Q b
n
= nP (a1);:::;nP (a jXj )
n
P (a)n(P (a) P (a))
nP (a jXj Y b
nP (a1);:::;nP (a ) a
(b 1
))! b (nP (a ))! n(P (a) P (a))
jXj
b b jXj Y b
=
(nP (a1))! (nP (a ))! P (a)
a
Y b b m!
m n
(nP (a))n(P (a) P (a))P (a)n(P (a) P (a)) n! n
a
Y
= nn(Pb(n) P (a))
a
P
=n n a(Pb(a) P (a))
= 1:
Theorem 12.6. For any P 2 Pn and any distribution Q, the probability of type class T (P ) under Q is
1
(n + 1)jXj 2 nD(P kQ) Qn(T (P )) 2 nD(P kQ):
30
The probability of observing some empirical distribution under Q is exponentially small in the relative entropy.
Proof.
X
Qn(T (P )) = Q (x n )
xn2T (P )
X
= 2 n(D(P kQ)+H(P ))
n
x 2T (P )
The Borel-Cantelli theorem implies that if n P(D(PXn kP ) > ) < 1, then D(PXn kP ) ! 0 almost surely.
Given > 0, let
TQ := fxn : D(PXn kQ) g:
D(P Q)>
(n + 1)jXj2 n
n
jXj log(n+1)
=2 ( n )
This is a law of large numbers: the probability of getting a sample whose empirical distribution is far from
Q in relative entropy is exponentially small. Applying Borel-Cantelli implies
31
12.2 Large deviations
Let X1; X2; : : : Q be i.i.d. on a finite alphabet X .
The weak law of large numbers states
P n Xi > EX1 +
! ! 0:
1 n
Xi
=1
where n is called the large deviation. What is interesting is that E is an exponent that we can compute
explicitly, and the upper bound is tight up to a constant.
Example 12.8. Let Xi Ber(p) and Q = Ber(p). Note that 1 n Xi = X P n X = PXn (1) (the proportion
1 n
X
X P 2Pn:
i=1 P (1) p+
2 (n + 1)X jPnj
jPnj n min D(P kQ)
2 n min D(P kQ); 2
This example is a special case of the following theorem, with E being the collection of distributions on
f0; 1g with expectation p + .
Theorem 12.9 (Sanov’s theorem). Let X1; X2; : : : Q be i.i.d. and let E be a collection of probability
distributions on X . Then the probability that the empirical distribution P Xn lies in E is
Qn(E) = Qn(E \ Pn) (n + 1)jXj2 nD(P kQ);
where P := argminP 2E D(P kQ). Moreover, if E is the closure of its interior, then
1
n log Qn(E) ! D(P kQ);
Common example of the collection E is E := P : x2X g(x)P (x) , that is, the set of distributions whose
such that EX P g(X) . If g(x) = xk, this is a moment constraint. We could also have many constraints (gj
and j for j = 1; : : : ; J).
32
Proof.
X
n
Q (E) = Qn(T (P ))
P 2E\Pn
X 2 nD(P kQ)
P 2E\Pn
jXj
(n + 1) 2 n minP 2E\Pn D(P kQ)
jXj
(n + 1) 2 nD(P kQ):
X
n
Q (E) = Qn(T (P ))
P 2E\Pn
n
Q (T (Pn)) for any Pn 2 E \ Pn
1
nD(P Q)
(n + 1)jXj 2 nk :
We need to find a sequence fPn : Pn 2 E \ Pngn 1 such that D(PnkQ) ! D(P kQ). If E has nonempty interior
and E is the closure of its interior, then we can approximate any interior point by a sequence of types, and
this is possible.
We review Sanov’s theorem. We consider the collection of probability distributions on X and observe X1; : :
: ; Xn Q for one particular distribution. Let E be a collection of other distributions, e.g., set of distribu-tions with
expected value 0:8. We want to understand the probability that the empirical distribution P Xn is in E.
Sanov’s theorem implies that this probability exponentially small with exponent min P12E D(P kQ).
If P := argminP 2E D(P kQ), then we get the lower bound Qn(T (P )) (n+1)jXj 2 nD(P kQ)
for free. The
“closure of the interior” condition allows use to use denseness of types to extend to the case where P
is not a type.
Example 12.10. Suppose we have a fair coin. What is the probability of 700 heads in 1000 tosses? Let E
= fP : EX P X 0:7g. Why does this make sense? If the observations X n have 700 heads then PXn 2 E (here
n = 1000). Note that Q is the fair distribution. Sanov’s theorem implies
1
n log P( 700 heads) D((0:7; 0:3)k(0:5; 0:5)) 0:119
More precisely,
To compute P = argminP 2E D(P kQ) where E is convex, this becomes a convex optimization problem:
use Lagrange multipliers.
The more general version of Sanov’s theorem (continuous distributions, etc.) is as follows.
33
12.3 Conditional limit theorem
Suppose I am manufacturing bolts, each of which is supposed to nominally weight 10 grams. I find a batch
of 1000 bolts that weighs 10:5 kilograms. What is the probability that any given bolt weights 11 grams?
[What does the bulk measurement tell us about the marginal distributions of the individual measurements?]
Theorem 12.12 (Conditional limit theorem). Suppose X1; X2; : : : Q are i.i.d. and we observe PXn 2 E with
Q 2= E and E is closed and convex.
p
= P ( x) log Q( x)
P (x) D(P kQ)
X
Proof. For binary distributions, one can prove the following (exercise):
p p log e
2
p log q + p log q 2 (2(p q)) :
34
The data processing inequality for relative entropy: if P 0 = PY jX P and Q0 = PY jX Q, then
Intuition for the conditional limit theorem: P completely dominates the behavior of the marginal.
Proof of conditional limit theorem. Let St := fP 2 P(X ) : D(P kQ) tg. This is a convex set.
Let D := D(P kQ) = minP 2E D(P kQ).
Let A := SD +2 \ E and B := E n A = E n SD +2 .
X
Qn(B) = Qn(T (P ))
P 2E\Pn: D(P
kQ)>D +2 X
2 nD(P kQ)
P 2E\Pn: D(P
kQ)>D +2
(n + 1)jXj2 n(D +2 ):
Qn(A) Qn(SD + \ E)
X
= Qn(T (P ))
P 2E\Pn: D(P
kQ) D +
1 n(D + )
(n + 1)jXj 2 :
n n
) = Q (B \ E) Q (B) (n + 1)jXj2 n(D +2 ) = (n + 1)2jXj2 n
P (P X n B P n 0:
Qn(E) Qn(A)
X 1
2 j 2E 2 n(D + ) !
(n+1)jXj
[So, the probability that our empirical distribution is outside a KL-ball around P (given it lies in E)
vanishes.] Thus, P(PXn 2 A j PXn 2 E) ! 1:
By the Pythagorean inequality, for P 2 A we have
D(P kP ) + D = D(P kP ) + D(P kQ) D(P kQ) D+2;
so D(P kP ) 2 . This combined with Pinsker’s inequality implies
P(PXn 2 A j PXn 2 E) P(D(PXn kP ) 2 j PXn 2 E) P(kPXn P k1 0 j PXn 2 E); and by our earlier
work, these three quantities tend to 1 as n ! 1. Consequently
P(jPXn (a) P (a)j j PXn 2 E) ! 1:
Aside: after proving Sanov’s theorem, we proved D(P Xn kP ) ! 0 almost surely which implies kPXn P k1
! 0 almost surely, by Pinsker’s inequality.
35
12.4 Fisher information and Cramer-Rao lower bound
Let f(x; ) be a family of densities indexed by . For example, the location family is f(x; ) = f(x ) for some
f.
An estimator for from a sample of size n is a function T : X n ! . The error of this estimator T (X n) is a
random variable.
n
For example, Xi N ( ; 1) i.i.d. and T (X n ) =1n i=1 X i.
n
An estimator is unbiased if E T (X ) = . P
The Cramer-Rao bound states that the variance of an unbiased estimator is lower bounded by 1=J( ).
0 2
Example 12.15. Let f(x; ) := f(x ). Then J( ) = f (x) dx. This is a measure of curvature/smoothness. If f
f(x)
=J( )
Consider the initial condition u(0; x) = (x) (all heat starts at 0). Then a solution is
That is, at time t, the temperature profile is the Gaussian density with variance t.
R
More generally, if the initial condition is u(0; x) = f(x), then a solution is the convolution u(t; x) =
f(s)u0(t; x s) ds where u0 is the Gaussian kernel above. Note that if we integrate over x, we get the “total
energy” which is conserved (constant in t). This is easy to see in the special case where f is a density (in x), in
which case the convolution will also be a density, which integrates (over x) to 1, which is constant in t.
Let us focus on this special case. Let X f and Z N(0; I). Then u(t; x) = f t(x) where ft is the density of
p
X+ tZ. [Convolution of densities is density of sum.]
We claim h(X + p tZ) is nondecreasing in t. [This matches our intuition from the second law of thermodynamics.]
p p
h(X + tZ) h(X + tZ j Z) = h(X):
This implies
p p p
h(X + t + t0 Z) = h(X + p tZ +
tZ) t 0 Z0) h(X +
which implies our claim.
Amazingly, we not only know it is increasing in t, but we have a formula for the rate of increase.
36
We will define the Fisher information J now. Given a parametric family of densities ff(x; )g parameterized by
, the Fisher information at is
@
2
J( ) = X fE(x; ) @ log f(X; ) :
We will focus on location families where f(x; ) := f X (x ) for some fixed density fX . In this case, the Fisher
information is Z (f 0 (x) dx;
f
(x))2
J( )=
which is free of . We then use the compact notation J( ) = J(f) = J(X) to denote the Fisher information
associated with this density. This is the Fisher information that we will consider from now on.
In n dimensions, this generalizes to
p
The last expression shows how the Fisher information corresponds to the smoothness of f (actually, of f ).
We now prove de Bruijn’s identity.
Proof. Z
dt
h(f ) =
t dt ft(x) log ft(x) dx
d d
Z Z f (x) dx
= (log ft(x)) @t ft(x) dx + @t t
@ @
Z 1
2
n @2 i=1 @ xi dt
d Z
X
= (log ft(x)) 2 ft(x) dx + ft(x) dx
1 n @2 | = dt
{z }
d
1=0
2 i=1 X
Z @xi
= (log ft(x)) 2 ft(x) dx :
Assuming (log ft(x)) @x@i ft(x) ! 0 as jxj ! 1, integration by parts for each summand indexed by i (with
respect to dxi) gives
d dt
1 2 i=1
Z
n @2 @ xi
X
h(ft) = (log ft(x)) 2 ft(x) dx
f
= 1 n @xi t(x) 2 dx
Z @
2 i=1 ft(x)
1 Z f (x) 2
= jr t j dx
2ft(x)
1
J(f ):
= 2 t
Note that nonnegativity of Fisher information also shows that h(f t) is nondecreasing in t.
We will use de Bruijn’s identity to prove an uncertainty principle for entropy and Fisher
information. The entropy power inequality gives
p p
e n2 h(X+ tZ) 2 h(X) 2 h( tZ) 2 h(X)
p
e n +e n =e n + 2 et
2
e n2 h(X+ tZ) en h(X)
2 e:
t
37
=
Taking t ! 0 makes the left-hand side equal to
dt e n h( X+p
d 2
tZ)t=0 n J(X)e n h(X);
1 2
Recalling h(X) = f log f dx and J(X) = f log f dx, we see that the above two quantities are parallel
R
analogues of entropy and Fisher information. R jr j
The above two quantities can be simplified to be
n 1 n
2
D(XkZ) = 2 log(2 e) + 2 EjXj 2 h(X)
2
I(XkZ) = J(X) 2n + EjXj :
Using the bound log x x 1, we have
1 2 1
1
2 I(XkZ) D(XkZ):
de Bruijn 2 1
H(X)
EPI =) J(X)e n 2 en () D(XkZ) 2 I(XkZ)
We only showed the forward implication of the last “if and only if.” The reverse is simple too.
Note that the last result is dimension free.
Let g2 := f = . We can reformulate the last result.
Z Z
2 jrgj2 d g2 log g2 d :
R
Noting that g2 d = 1, we can write
Z Z Z Z
2 jrgj2 d g2 log g2 d g2 d log g2 d :
This inequality still holds if we scale g by a constant. (Log terms will cancel.)
38
Theorem 13.4 (Log Sobolev inequality for Gaussian measure). For “smooth” g,
Z Z Z Z
2 jrgj2 d g2 log g2 d g2 d log g2 d =: Ent (g2):
EPI =) Fisher information-entropy uncertainty principle () LSI (info. th.) () LSI (functional form)
Consider U N(0; 1). We have P(U r) e r2=2. So, the theorem states tail that under a Lipschitz function, the
behavior is still the same. R
Without loss of generality suppose L = 1. We consider g2(x) := e F (x) (and assume F d = 0) and plug
2=2 it into the LSI. We have rg(x) = 2 (rF (x))e =2F (x) 2=4, so
2
Z 2
2
2 Z eF =2 d :
R
Let ( ) := eF 2=2
d . The LSI implies
2Z 2
eF 2=2
d = ()
2 2
Z 2
eF 2=2( F =2) d ( ) log ()
Z
( ) log ( ) 2
(F 2) d = 0( ):
eF =2
R
We have H(0) = (0) = F d = 0. Thus H( ) 0 for all0. This gives ( ) 1, and so
Z
2
e F d e L2=2:
39
13.4 Talagrand’s information-transportation inequality
The quadratic Wasserstein distance between two probability measures and (on the same space) is
2 inf X Y 2
EjXi Yij2
i=1
n X
= W 22( i; i):
i=1
1. EW 2(PXn ; ) ! 0.
inf D( k )
E t n
lim inf 1 log P (W (P
n 2 X
n ; ) > t)
2 !1
lim sup(t EW 2(PXn ; ))2=2
n!1
= t2=2:
If W 2( ; ) > t (i.e. 2 Et) then
2D( k ) t2:
Taking t = W 2( ; ) and ! 0 proves the theorem.
Combining with the previous results gives the nice chain
I( k ) 2D( k ) W 22( ; ):
40
References
[1] Cover, Thomas and Thomas, Joy. Elements of information theory. John Wiley & Sons. 2012.
41