0% found this document useful (0 votes)
59 views

Notes It

This document contains notes from an information theory course taught by Professor Thomas Courtade in Fall 2016. The notes cover topics such as entropy, asymptotic equipartition property, entropy rates of stochastic processes, data compression, channel capacity, differential entropy, Gaussian channels, entropy power inequality, rate distortion theory, and applications of information theory to statistics and mathematics. The notes are based primarily on a reference text.

Uploaded by

himanshu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Notes It

This document contains notes from an information theory course taught by Professor Thomas Courtade in Fall 2016. The notes cover topics such as entropy, asymptotic equipartition property, entropy rates of stochastic processes, data compression, channel capacity, differential entropy, Gaussian channels, entropy power inequality, rate distortion theory, and applications of information theory to statistics and mathematics. The notes are based primarily on a reference text.

Uploaded by

himanshu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Information theory

Billy Fang
Instructor: Thomas Courtade
Fall 2016

These are my personal notes from an information theory course taught by Prof. Thomas Courtade.
Most of the material is from [1]. Any errors are mine.

Contents
1 Entropy 1

2 Asymptotic Equipartition Property 2

3 Entropy rates of a stochastic process 4

4 Data compression 5

5 Channel capacity 6

6 Differential entropy 14

7 Gaussian channel 16

8 Entropy power inequality 17

9 Rate distortion theory 20

10 Approximating distributions and entropy 25

11 Computing rate distortion and channel capacity 27

12 Information theory and statistics 29


12.1 Theory of types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
12.2 Large deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
12.3 Conditional limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
12.4 Fisher information and Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

13 Entropy methods in mathematics 36


13.1 Fisher information and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
13.2 The logarithmic Sobolev inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13.3 Concentration of measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
13.4 Talagrand’s information-transportation inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.5 The blowing-up phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1 Entropy
[Absent from first lecture; refer to Chapter 2 of [1] for missing introductory material.]
X
H(X) = p(x) log p(x) = E log p(X):
x
X
H(Y j X) = p(x)H(Y j X = x) = E log p(Y j X):
x

Chain rule: H(X; Y ) = H(X) + H(Y j X).

H(X j Y ) H(X) (conditioning reduces uncertainty), but it is not always true that H(X j = y) H(X) for
Y each y (only true on average).
Pinsker’s inequality: D(pkq) 2 ln 2 kp qk12.
1

Entropy H(p1; : : : ; pn) is concave in (p1; : : : ; pn). and let p(Xi) :=


H(X)logjX j. Proof: use Jensen’s inequality. Let pX = (p1; : : : ; pn)
(pi; : : : ; pn; p1; : : : ; pn 1). Then H(pX ) = H(pX(i)). So,
H(X) = n H(pX ) H n pX ! = H(1=n; : : : ; 1=n) = log n:
n 1 (i) 1 n (i)
X Xi
i=1 =1

I(X; Y ) is concave in pX for fixed pY jX . It is convex in pY jX for fixed pX . To see the first one, note I(X; Y )
= H(Y ) H(Y j X). For fixed pY jX , we have H(Y j X) is linear in pX , and H(Y ) is concave function composed
P

with a linear function pY = x pY jX (y)pX (x) of pX .


Data processing inequality: if X ! Y ! Z is a Markov chain (X and Z independent given Y ), then I(X; Z)
I(X; Y ).

I(X; Z) I(X; Z) + I(X; Y j Z) = I(X; Y; Z) = I(X; Y ) + I(X; Z) = I(X; Y ):

We discuss a connection between information and estimation. Consider the Markov chain X ! Y ! Xb(Y )

where Y is the noisy observation of X, and Xb(Y ) is an estimator of X based on Y . Let P e = P(X 6= Xb(Y )).
Theorem 1.1 (Fano’s inequality).
H(Pe) + Pe logjX j H(X j Xb):

Proof. Let E be the indicator for X 6= Xb so that H(E) = H(Pe).


Using chain rule in two ways gives
H(E; X j Xb) = H(X j Xb) + H(E j X; Xb)

= H(X j Xb)

and
H(E; X j X) = H(E j X) + H(X j E; X)
b H(E) +bpeH(X j X; E b e j
= 1) + (1 p )H(X X;E = 0)

b
= H(Pe) + pe logjX j b

The logjX j can be replaced by log(jX j 1).


Corollary 1.2. With Pe := minX P(X(Y ) 6= X), we have

b b H(Pe) + Pe log H(X j Y )


jX j

1
Proof. By the data-processing inequality, I(X; Xb) I(X; Y ) which implies
H(X j Xb) H(X j Y ).

We now clarify the interpretation of “entropy is the average number of bits needed to describe a
random variable.” Consider a function f : X ! f0; 1g that maps the alphabet to arbitrary length bit strings.
We want f to be injec-tive (able to distinguish between letters in the alphabet). Let ‘(f(x)) be the length of
this description of x. Then E[‘(f(X))] is the average length of this description.

X
E[‘(f(X))] = p(x)‘(f(x))
x
X p(x)
= H(X) + p(x) log
x
2 ‘(f(x))
X
= H(X) p(x) log 2
0
‘(f(x )) + p(x) log p(x)
X2
X0
‘(f(x 0)) + D(p X kQ )
X 2 ‘(f(x))= P 2 ‘(f(x0))

x
x x x0
= H(X) log
x0
X 0
H(X) log 2 ‘(f(x )) :
0
x

Proposition 1.3.
X
2 ‘(f(x0)) log 2jX j:
0
x

Proof. Consider maximizing the sum on the left-hand side. Respecting injectivity of f, this is the first jX j
terms of the series
2 0+2 1+2 1+2 2+2 2+2 2+2 2+

Continuing from above, we have shown


E[‘(f(X))] H(X) log log 2jX j:

Suppose we observe i.i.d. X1; : : : ; Xn. The alphabet has size jX jn. Any description of these n
outcomes requires at least H(Xn) log(1 + n logjXj) bits. Using independence, we have the following lower
bound on the average number of bits per outcome.
1 1
n
n E[‘(fn(X ))] H(X1) n log(1 + n logjX j): = H(X) O(log(n)=n):
Suppose X = fa; b; cg and f(a) = 0, f(b) = 1, and f(c) = 01. Concatenating is no longer injective: 01101 could
correspond to abbc or cbc. If we want uniquely decodable f, then the upper bound of the previous proposition is 1.
Consider X being uniform on the above alphabet. Let f(a) = , f(b) = 0, and f(c) = 1. Then E[‘(f(X))] =
2=3 < 1:58 H(X).

2 Asymptotic Equipartition Property


Suppose we observe i.i.d. Ber(1 p) random variables. The “typical” sequence has pn zeros and (1 p)n ones.
The probability of a single such sequence is ppn(1 p)(1 p)n = 2 nH(X).
Applying the law of large numbers to f(x) = log pX (x) gives
1 1 1 p

X n
n i f(Xi) = n log pXn (x ) ! H(X)
So pXn (xn) 2 nH(X).

2
Theorem 2.1.
1 p

n
n log pX n (x ) ! H(X):

Last time we saw a few examples where the formula for entropy “magically” appeared. One was that if
f : X ! f0; 1g is injective, then E‘(f(X)) H(X) log log 2jX j. Today we will show that there exists an f such that
E‘(f (X)) . H(X).
We also saw that if we observe an i.i.d. sequence of Bern(1 p) random variables, then the “typical”
sequence of length n has n(1 p) ones and np zeros, and moreover p Xn (xn) 2 nH(X).
To formally define “typical,” we will work backwards from the above example.
(n)

A(n) = fxn : 2 n(H(X)+ ) pXn (xn) 2 n(H(X) )g:

Proposition 2.3 (Properties of typical sets).


n
1. x 2 A(n) () H(X) 1
n log pXn (x ) H(X) + .

2. P(Xn 2 A(n)) ! 1.

3. jA(n)j 2n(H(X)+ ).
4. jA(n)j (1 0)2n(H(X) ) for n sufficiently large.
Proof. 1. follows by definition. 2. follows by the AEP/LLN.
3. follows by X

1 pXn (x) jA(n)j2 n(H(X)+ ):


(n)
x2A
4. X

1 P (X 2 A(n)) =pXn (x) jA(n)j2 n(H(X) ): x2A(n)

In summary, A(n) together has almost all the proability mass, the probability on sequences in A (n) is
roughly uniform, and the cardinality is roughly 2nH(X).
If B(n) is the smallest set with probability 1 , then jB(n)j jA(n)j in some sense. More precisely,
1 (n)

n logjB j > H(X) 0


Proof.

1 P (B(n) \ A(n)) union bound


X
= pXn (xn)
xn2A\B

X
2 n(H(X) )
xn2A\B

jB(n)j2 n(H(X) )

1
jB(n)j (1)2n(H(X) )
(n)

n logjB j > H(X) + log(1 )

3
We now describe a scheme to describe X with H(X) bits on average: the typical set encoding.
sequences in A(n) , we need log A(n) + 1 bits per label.
If we label all (n) j j (n) (n)
To encode x 2 A , we put a flag 1 in front of the label (length logjA j + 2. If x 2= A , we put a flag 0 in front
of the binary representation of the sequence (length n(logjXj + 1) + 1).
1 1 1
n (n) (n) (n)
n E‘(f(X )) n P (A )(logjA j +2)+ n (1 P (A ))(n(logjX j + 1) + 1)
1
(H(X) + ) + n + (n)((logjX j + 1) + 1=n)
H(X) + 2
Although this scheme is not practical, we see that we get a matching upper bound to our earlier lower
bound H(X) log log 2jX j on the expected length.

3 Entropy rates of a stochastic process


For a random process fXig the entropy rate is defined as
1
H(fXig) = lim H(X1; : : : ; Xn)
n!1 n
provided the limit exists.

Theorem 3.1 (Shannon-McMillan-Breiman). For stationary ergodic processes,


1
n log pX n (X1; : : : ; Xn) ! H(fXig)
with probability 1.
This is the AEP generalized to more general processes. AEP-like properties also generalize.
A stationary process has shift-invariant joint probabilities: p X1n;:::;Xk = pX‘+1;:::;X‘+k . An ergodic process has
time averages equalling ensemble averages in some sense, e.g. 1n
p q
i=1 Xi ! EX (a LLN-type property). probability (p = q) and flip it repeatedly. The space

Non-ergodic process: choose a -coin or a -coin with equal P 6


average is (p + q)=2, but the time average is either p or q.
Lemma 3.2. For a stationary process,
H(fXig) = lim H(Xn j X1; : : : ; Xn 1):
n!1

Proof. 0 H(Xn+1 j X1; : : : ; Xn) H(Xn+1 j X2; : : : ; Xn) = H(Xn j X1; : : : ; Xn 1). So this is a nonnegative
decreasing sequence and therefore converges.
an := H(Xn j X1; : : : ; Xn 1) and bn = n 1P n 1 i 1 n 1 n concludes.
a
i=1 i j

If an ! a then the Cesaro means b n :=n n i=1 i ! a also converge to the same limit. Applying this to
1 1
H(X X ;:::;X ) = H(X ;:::;X )
stochastic processes. In particular if there is a stationary distribution (satisfies
Markov chains are an example of P
P = ), then the Markov chain is stationary and we may use the previous lemma. So H(fXig) = limn!1 H(Xn j Xn
1). We have
X X
H(Xn j Xn 1) = (i)H(Xn j Xn 1 = i) = (i)H(pi1; : : : ; pim)
i i
The Second Law of Thermodynamics: the entropy of an isolated system is not decreasing. We might model such
a system with a Markov process. [Example: Ehrenfest gas system?]
Suppose we start a Markov chain in two different states, and let their state distributions at time n be P n
and Qn respectively, then
D(Pn+1kQn+1) D(PnkQn):
If Pn is stationary, D(Pnk ) is decreasing.

4
4 Data compression
We have seen that it takes H(X) bits on average to describe X. This is the fundamental idea behind
data com-pression.
From AEP, it takes about nH(X) bits to represent X1; : : : ; Xn. But this scheme (using typical set) is
wildly impractical.
P
Let c : X ! f0; 1g be a source code. We are interested in L(c) = x pX (x)‘(c(x)).
x pX (x) c(x) ‘(c(x))
For example, a 1=5 00 2 has L(c) = 1:6 and H(X) = 1:52.
b 2=5 01 2
c 2=5 1 1
This defines the extension code c defined by c (x1 xn) = c(x1) c(xn).
A nonsingular code c is an injective code. A code is uniquely decodable if c is nonsingular (injective).
A code is prefix-free (a.k.a. instantaneous) if no code word is a prefix of any other code word.
Confusingly, such codes are sometimes called prefix codes.

codes nonsingular cords uniquely decodable codes prefix-free codes:


Theorem 4.1 (Kraft inequality). Any prefix N i=1
N
code satisfies N 2 ‘i 1.
(‘ ) P 2 ‘i there exists a prefix code with these codeword lengths.
Also, for any numbers i i=1 satisfying i=1 , P

Proof. Sort ‘1 ‘N . Draw a binary tree of code words. If ‘1 = 2 for example, let the first code word be 00 and
prune the children at that node. Then choose the next smallest available word of length ‘ 2 and so on. The
inequality guarantees that this is possible and we don’t run out of words.
We now show any prefix code satisfies the inequality. Consider inputting a sequence of random coin
flips into a decoder which spits out x if the input is a code word.
[ X X
1 P(some x comes out) = P fx comes outg = P(x comes out) = 2 ‘(c(x))
x2X x2X x2X

P ‘
Theorem 4.2 (McMillan inequality). Any uniquely decodable code has codeword lengths satisfying i2 i 1.

We prove McMillan’s inequality on the homework. Recall that for any f, we proved
X
E[‘(f(X))] = H(X) + D(pX kq) log 2 ‘(f(x))
x

where q(x) / 2 ‘(f(x)). If c is uniquely decodable, then applying McMillan’s inequality shows that E‘(c(X))
H(X).

How do we design good codes? Consider choosing ‘i to minimize i pX (i)‘i subject to i 2 i 1. Then the
k D(p q)

optimal ‘is satisfy ‘i = log pX (i). This is natural when considering P X . P


However, our code lengths must be integers, so we can consider instead ‘i = d log pX (i)e.

X
Xi pX (i)‘i i pX (i)(1 log pX (i)) = H(X) + 1:
Theorem 4.3. Any uniquely decodable code c satisfies E‘(c(X)) H(X) (impossibility), and there exists a
code c such that E‘(c (X)) H(X) + 1 (achievability).

The Huffman code is an “efficient” construction of the best prefix code.


Consider the following alphabet and distribution.
x a b c
pX 1=5 2=5 2=5

5
At each stage we group the two least probable symbols into one and add their probabilities. We start with fag, fbg,
and fcg. After the first stage we have fa; bg w.p. 3=5 and c w.p. 2=5. Finally, we merge everything to get fa; b; cg. At
each grouping, label the edges 0 and 1. Read backwards to get the code: a = 00, b = 01, and c = 1.
E‘(c(X)) = 2=5 + 4=5 + 2=5 = 8=5 1:6. While H(X) 0:722.

Theorem 4.4. Huffman coding is optimal among all uniquely decodable codes. That is, if c is the Huffman
code, then E‘(c (X)) E‘(c(X)) for any uniquely decodable code c.

See book for proof.


Aside: consider the above distribution with code a = 00, b = 1, and c = 0. This is not uniquely
decodable (although it is nonsingular). Its expected length is 6=5, shorter than the Huffman code.
So we showed
H(X) L H(X) + 1:

One “trick” to get rid of the +1 is to group symbols together and give a code X n ! f0; 1g . Then we have
the following bounds on the number of bits to describe n symbols: H(X n) E‘(c (X)) H(Xn) + 1. Dividing by n
gives the following bounds on the number of bits per symbol:
1 1
H(X) n E‘(c (Xn)) H(X) + n :

The Huffman code for X n requires building a tree with jX j n leaves. We have a tradeoff between length of
the code and computational complexity.
[See Lempel-Ziv for universal coding: variable length coding.]
This motivates arithmetic coding, which has complexity linear in n (instead of exponential as in the
above discus-sion). We discuss Shannon-Fano-Elias coding. We will see that E‘(cSF E(X)) H(X) + 2.
Let the distribution of X be (p1; : : : ; p3). We partition [0; 1) into half-open intervals each of length pi. [So, [0; p1),
[p1; p1 + p2), and [p1 + p2; p1 + p2 + p3).] To encode i, take the midpoint of interval of length p i, write it in binary, and
truncate to d log pie + 1 bits. The truncation will always lie in the same interval, since truncation subtracts at
( log p +1)
most 2 (d log pie+1) 2 i = pi=2. We can also readily see E‘(cSF E(X)) H(X) + 2.
Suppose X takes values a; b; c with probabilities 0:6; 0:3; 0:1 respectively. The intervals are
[0; 0:6); [0:6; 0:9); [0:9; 1). The mid points are 0:3 = 0:010011 2, 0:75 = 0:112, and 0:95 = 0:1111001 : : :.
So our code words are 01, 110, and 11110. It is clear this is not really optimal. However it is good for scaling up.
Consider another example where X takes values a; b; c with probabilities 1=2; 1=3; 1=6 respectively. Then the
product distribution is a distribution over 9 outcomes, e.g. ab has probability 1=6.
Compare the interval representations for the X and X 2 codes. In the latter, we would simply partition
each of the three half-open intervals of the former into three to get a total of nine intervals.
In general if we want to encode a sequence of n bits, keep partitioning the intervals in the same way,
and return any binary string in the small interval (e.g., truncating the midpoint to d log p Xn (xn)e + 1).
Note this procedure is linear in n.
Asymmetric Numeral System (ANS)? Coping with distributions that are less uniform; some
probabilities very high. Skew binary representation? See paper.

5 Channel capacity
The communication problem:

msg ! transmitter ! Channel (noisy) ! receiver ! msg

Assumptions
1. Messages are random variables uniformly distributed over f1; : : : ; Mg.

2. Channel has known statistical properties. Input X and output Y , know PY jX . Justification: in real life
we can take measurements.

6
3. Discrete time, discrete alphabet, memoryless channel. Statistics of PY jX do not change over time.
[If X1; X2 are i.i.d. and input sequentially, then output Y1; Y2 are i.i.d.]
Example 5.1 (Binary symmetric channel). The binary symmetric channel BSC(p) takes binary input x 2 f0; 1g
and outputs either x or 1 x with probability 1 p and p respectively.
A (M; n)-block code for the channel (X ; PY jX ; Y) uses n symbols from X to transmit one of M
messages W i, i = 1; : : : ; M.
n n
X (W ) Y
W ! (M; n)-block code ! PY njXn ! decoder ! W (Y n)
The rate of an (M; n) code is R = log M = number
n
of information bits =
number of channel uses bits/channel use. We also write an (M; n) code
c
nR
as a (2 ; n) code.
A good code should have high rate and good reliability, but these are in conflict.
A rate R is achievable if there exists a sequence of (2nR; n) such that

maxi P(Wc 6= W i j W = W i) ! 0
We call this reliable communication.
Definition 5.2 (Operational definition of channel capacity). The capacity C of a channel

PY jX is supfR : R is achievableg:

Consequently,

1. If R < C, then R is achievable. That is, for any > 0, there exists an n and a (2 nR; n) code with maxi
P(Wc 6= W i j W = W i) < .
2. If R > C, then R is not achievable. That is, there exists c such that for any sequence of (2 nR; n)
codes, lim infn!1 P(Wc 6= W i j W = W i) > c.
Theorem 5.3 (Shannon’s channel coding theorem).
C = max I(X; Y ):
PX

Properties of C.
1. C 0.
2. C min(logjX j; logjYj). (Just note I(X; Y ) H(X) logjX j.) This is equality when the channel is a
deterministic map from X ! Y, and jYj = jX j.
3. Recall I(X; Y ) is concave in PX for fixed PY jX . Computing C is a convex optimization problem.
Example 5.4 (Binary symmetric channel). Consider BSC(p). Note I(X; Y ) = H(Y ) H(Y j X) = H(Y ) H(p) 1
H(p) for any distribution on X. If X Ber(1=2), then Y Ber(1=2), which gives equality, so this is
the maximizing distribution. C = 1 H(p). For example if p = 0:1, we have C 0:6.
Example 5.5 (Binary erasure channel). Consider the binary erasure channel BEC(p). Input x 2 f0; 1g and output
x with probability 1 p or e (erasure) with probability p. Letting E be the indicator for erasure, we have
I(X; Y ) = H(Y ) H(Y j X) = H(Y ) H(p)
= H(Y; E) H(p)
= H(E) + H(Y j E) H(p)
= H(Y j E)
= pH(Y j E = 1) + (1 p)H(Y j E = 0)
(1 p):

7
Taking X Ber(1=2) gives equality again, so C = 1 p. This makes sense: if we knew where the erasures were
(proportion p of the time), we can send What is surprising that we don’t need to know where the erasures
perfectly. are.

Recall the setup.


nR n n n
W 2f1;:::;2 g X (W ) Y W (Y )
! ! nRY c

encoder PYijXi ! decoder !


A rate R is achievable if there exists a sequence of (2 ; n) codes such that maxi P(W 6= W i j W = W i) ! 0 as
c
n ! 1. The channel capacity C is a supremum of achievable rates; can be interpreted as the maximum rate at which
information can be transmitted reliably.
We showed that the channel capacities for BSC(p) and BEC(p) are 1 H(p) nad 1 p
respectively.
Heuristic explanation for channel capacity of BEC(p): First we argue an upper bound on achievable rates R.
Suppose we have extra information: we know the location of the [on average] pn erasures. Then we could
send [on
average] (1 p)n bits reliably, i.e. 1 p bits per channel use. This implies R 1 p for achievable R.
Now, how do we actually achieve this without the actual information?
Consider a G 0; 1 n (1 p )n that is full rank (over ). Let Xn(W ) := GW 0; 1 n where W
(1 p )n 2 f g F 2 2 f g 2
f0; 1g . Once we send it through the channel, [on average] proportion p bits are erased, so after the channel
we have n(1 p) bits that are not erased. Then we can invert the system since 1 p < 1 p. Regarding finding
a G, one can show that with i.i.d. Bernoulli entries, the resulting matrix is full rank with high probability.
Note that in this scheme, the probability of error is the probability that the number of erasures is > (p + )n.
Example 5.6 (Noisy typewriter). X = fA; : : : ; F g, and for each letter X, the distribution Y j X is equally
likely to be X or X + 1.
I(X; Y ) = H(Y ) H(Y j X) = H(Y ) 1 log(6) 1 = log 3:
If we choose PX to be uniform on fA; C; Eg, then we have equality I (X; Y ) = log 3.
Preview of Achievability in Channel Coding Theorem: all channels look like the noisy typewriter for n
sufficiently large.
Example 5.7 (Additive Gaussian white noise). Let P Y jX N (X; 1). Then the typical set is a ball centered at
Xn on p
the order of n. Then a simple encoding/decoding scheme is to choose (for the input distribution) a packing of the
space so that these balls are disjoint; decoding just chooses the nearest center. We now prove the
theorem.
Proof of channel coding theorem (converse/impossibility). We begin with the converse: if R is achievable, then R
C := maxPX I(X; Y ). If R is achievable, there exists a sequence of (2 nR; n) code with maxi P(Wc(n) 6= W i
j W (n) = W i) = n with n ! 0. This implies P(Wc(n) 6= W (n)) n (average is less than maximum).
Note

nR = H(W (n))

= H(W (n) j W c(n)) + I(W (n); W c(n))


Fano, Wc is estimator of W
1 + nnR + I(W (n); Wc(n))
1 + nnR + I(Xn; Y n) (n) n n (n)
n n n data proc. on W ! X ! Y ! Wc
= 1 + nnR + H(Y ) H(Y jX )
n
Xi

1 + nnR + H(Yi) H(Y n j Xn) indep. bound (chain rule + cond. reduces entropy)
=1
n n
Xi X

= 1 + nnR + H(Yi) H(Yi j Xn; Y i 1)


=1 i=1
n n
X
i X
= 1 + nnR + H(Yi) H(Yi j Xi) memorylessness/Markov
=1 i=1
8
n
X

= 1 + nnR + I(Xi; Yi)


i=1
1 + nnR + nC:

Divide by n and take n ! 1 (recall n ! 0) gives R C.


Note that the above proof works fine even if we only have the weaker condition P(W (n) 6= W (n))
n (average
error over uniform on W ). The other direction also only requires this weaker assumption.
c
Also note that the uniformity of W over the possible messages is crucial (appears in the step nR = H(W (n))).
C 1 C
Note at the end of the above proof, we had n 1 R nR 1 R . If R > C, then the probability of error is
bounded from below. This is the weak converse.
There is a strong converse: if R > C, P(W 6= W ) 1
Suppose R is ve ry clos e to C . Wha t do the abov e in equaliti es tell us?
2 E(R;C) where E(R; C) > 0 is some function(?)

c
The use of Fano’s inequality says we should use the best estimator Wc of W ...

The use of data processing inequality implies W 7!Xn and Y n ! Wc should be close to bijective.
The next inequality implies the Yi should be close to independent (i.e. the Xi should be close to independence).
The last inequality implies that the distribution of Xn is close to i.i.d. from argmaxPX I(X; Y ). [Capacity-
achieving input distribution (CAID).]

The set A(n) of jointly typical sequences (Xn; Y n) with respect to a distribution PXY is the set of n-
sequences with empirical entropies -close to true entropy, that is,
( log p n
A(n) = (xn; yn) : n X
n n
;Y n (x ; y ) H(X;Y) < ;
1

1
n n
n log pX (x ) H(X) <;

)
n
1 log p
Y n
(yn) H(Y ) <

If a rate R is close to C, then the function Xn(W ) ought to be “random” (specifically, i.i.d. from PX , the
optimizer of I(X; Y )).
Well-known codes like algebraic codes (Reed-Solomon, Gaulay, PCH) have a lot of structure and redundancy for
the sake of simple decoding. However, this is at odds with the above intuition of a capacity-achieving code.
Today we will prove the achievability direction of the channel coding theorem.

Theorem 5.8 (Joint AEP). Let (Xn; Y n) PXYn .

1. P((Xn; Y n) 2 A(n)(X; Y )) ! 1.

2. jA(n)(X; Y )j 2n(H(X;Y )+ ).
3. If (Xen; Yen) (PX PY )n, then

P((Xen; Yen) 2 A(n)) 2 n(I(X;Y ) 3)


:
The third statement is new. Independence causes the probability of being in the typical set to be
vanishing. Note all pairings of typical X sequences and typical Y sequences are not necessarily jointly
typical (in fact, most of them are not).

9
Proof. The first statement follows by the law of large numbers. The second statement follows by
X
x

1 n;yn p(xn; yn) jA(n)(X; Y )j2 n(H(X;Y )+ ):

For the third statement X


(xn;y ) A
P((Xn; Y n) 2 A(n)) = pXn (xn)pY n (yn)
e e n
2 (n)

X
2 n(H(X) )2 nH(Y ) )

(xn;yn)2A(n)
= jA(n)j2 n(H(X)+H(Y ) 2)
2 n(H(X)+H(Y ) H(X;Y ) 3 ):

Proof of achievability in channel coding theorem. We now prove that all rates R < C are achievable. This
is a non-constructive proof. We will use the probabilistic method: we will show that some object in a finite
class C satisfies property P by exhibiting a distribution over C, drawing an object from this distribution,
and show that this object satisfies P with probability > 0.
In our setting C is a set of (2nR; R) codes and P is “code has small probability of error.”
We will use random coding. Fix some P X and > 0 and R. Generate a (2nR; n) code at random according to
PX . The codebook is
C = [Xi(w)]i;w 2 R2nR n:
The wth row of C is the codeword for the wth message. Each letter of each code word is generated i.i.d. from P X , so

2nR n
YY

P(C) = PX (Xi(w)):
w=1 i=1

We have decided the encoding scheme. For decoding, we will use typical set decoding. The receiver
declares that Wc was sent if both of the following happen.
1. (Xn(W ); Y n) 2 A(n)(X; Y ), that is, the received message Y n is jointly typical with code word Xn(W ).
other index k = W
6c
(Xn(k); Y n) 2 A(n) (X;Y)

2. No c also satisfies . c
Otherwise, the decoder declares an error.
We now compute the expected probability of error for this scheme, averaged over codebooks C drawn
from the above distribution.
The probability of error is
n n
i(C) = P(Wc 6= i j X = X (i)):
This probability is over the randomness of the channel, but the codebook fixed.
The average [over all messages] probability of error for a fixed code C is
2nR
(n) 1X
Pe (C) = E W (C) = 2nR w(C)
w=1

[This is a function of C, can be thought of as a conditional probability.]

10
The average over all codes is
X
P =
error P(C)P (n)(C)
C 1
= P(C) 2nR 2nR w(C)
X X
C w=1

1
= 2nR 2nR P(C) w(C)
w=1 C
XX
X
= P(C) 1(C) rows of C are exchangeable
C

= P(error j W = 1): error prob. averaged over all codes


n n (n)
Define error events Ei = f(X (i); Y ) 2 A (X; Y )g for i = 1; : : : ; 2nR. Continuing from

above, P(error j W = 1) = P(E1c [ E2 [ [ E2nR j W = 1)


2nR
X

c
P(E1 j W = 1) + P(Ei j W = 1)
i=2
2nR
X

n n (n) n n (n)
= P((X ; Y ) 2= A )+ P((Xe ; Ye ) 2 A )
i=2

+ 2nR2 n(I(X;Y ) 3) for n sufficiently large

2 for n sufficiently large, if R < I(X; Y ) 3

For i 6= 1, Y n is independent of Xn(i) because Y n came from Xn(1).


So, the average [over codebooks and messages] probability of error is vanishing in n, so there exists
some sequence of codebooks with vanishing probability of error [averaged over messages].
In conclusion, for any R < C and > 0 there exists a (2nR; n) code with average [over messages, not
codebooks] probability of error < .
“Almost every code is a good code, except the ones we construct.” However, these codes are virtually
impossible to decode. [Codebook is of exponential size, decoding needs to check all codewords.]

Let Xn be i.i.d. Bern(1=2). Suppose Y n is obtained by passing Xn through BSC( ). [That is, Yi is equal
to Xi with probability 1 , and is flipped with probability .] For any b : f0; 1g n ! f0; 1g (one-bit function), is it
true that I(b(Xn); Y n) 1 H( ) = I(X1; Y1)? This is the “most informative function conjecture.”
In the last few lectures we proved the channel coding theorem, which stated that the channel capacity
is maxPX I(X; Y ).
Converse: If R is achievable, then R < C. This holds using either definition of achievability (involving
maximal probability of error maxi P (Wc 6= W i j W = W i) or average probability of error P (Wc 6= W )).

Achievability: There exist (2nR; n) codes with P (Wc 6= W ) ! 0 as n ! 1 provided R < C.


We just need to tidy up the achievability result by proving the version with maximum probability of error.
This is an application of Markov’s inequality. Note
2nR
X

P (Wc 6= W ) = P (Wc 6= W i j W = W i)P (W = W i)


i=1

11
Since messages are uniformly distributed,
P (Wc 6= W )
#fi : P (W 6= Wi j W = Wi) g=2nR :
c

= 2P (W = W ) gives
Taking c6 # i : P (W = W i W = W i) 2P(W = W) 2nR=2:
f 6 j 6 g
(left-hand side), then the new rate is
c
If we throw away these bad messages c
log(#codewords) log(2nR=2) 1
= =R :
n n n
In the new code, all the codewords have probability of error 2P (Wc 6= W ) by
definition, so max P (Wc0 6= W 0i j W 0 = W 0i) 2P (Wc 6=
W ) ! 0:
i

What is the rate of information sent from eye to brain? Measure signal X entering eye, signal Y
entering brain, estimate I(X; Y ), gives upper bound on rate. [Estimating I(X; Y ) needs some shift to align
due to delay, quantize time, etc.]
Suppose that we are the encoder that sends X i through a channel, which sends Y i to a decoder. What
if we get feedback: we see Yi (what the decoder receives)? We argued before that in the binary erasure
channel, the capacity with feedback is the same. Is this true in general? [We know trivially that capacity
with feedback must be at least as large as the capacity without feedback: just ignore the feedback.]
More explicitly, the encoder sends Xi, which can depend on W , as well as Xi 1 and Y i 1.
Theorem 5.9. The feedback does not improve the channel capacity. However, it can simplify the encoding scheme. It
can also get us to capacity much more quickly. That is, Pe(best code) 2 nE without feedback, but with
feedback Pe(best code) 2 2nE0 .

Proof. We already know C F B C so it suffices to prove the other direction.


nR = H(W ) messages are uniformly distributed

I(W ; W ) + nR + 1
cn n n Fano

I(W ; Y ) + nR n + 1 data-processing, W ! Y ! Wc
n n
= H(Y ) H(Y j W ) + nR n + 1

Xi (H(Yi) H(Yi j W; Y i 1)) + nR n + 1 independence bound, chain rule

= Xi (H(Yi) H(Yi j W; Y i 1; Xi)) + nR n + 1 Xi is function of (W; Y i 1)

X
= i (H(Yi) H(Yi j Xi)) + nR n + 1 Yi is conditionally independent of all past things given X i
X
= I(Xi; Yi) + nR n + 1
i
n max I(X; Y ) + nR n + 1:
PX

12
Thus R C.
Remark: In the channel coding theorem, we had

I(Xn; Y n) = H(Y n) H(Y n j Xn) + n n


X

H(Yi) H(Y n j Xn) + n n:


i

P
We then used H(Y n j X n) = i H(Yi j Xi) because the Yi were conditionally independent given the
corresponding Xi. This is no longer the case with feedback.
Preview of another problem (to be continued later):
Consider the following compression example. [No channel.] Suppose we have X n and Y n are correlated, and we
want to encode each separately, and then a decoder takes both. (If the first encoder sents nR X bits and the second
sends nRY , then the rate is RX + RY .) How much worse is this than encoding (Xn; Y n) together? Chapter 15.4.
We say RX and RY are achievable if there exists a sequence of functions f : X n ! [2nRX ] and g : Yn ! [2nRY ]
and : [2nRX ] [2nRY ] ! X n Yn such that

P ( (f(Xn); g(Y n)) 6= (Xn; Y n)) ! 0

as n ! 1. (Xbn; Ybn) := (f(Xn); g(Y n)). The achievable rate region is the closure of achievable rates.
Recall that if RX H(X) then we can send the information losslessly. So the achievable rate region
definitely contains [RX ; 1) [RY ; 1). Considering the special case where we can encode everything
together, we see that the achievable rate region must lie in the region fR X + RY H(X; Y )g.
The answer: any rates satisfying RX H(X j Y ), RY H(Y j X), and RX + RY H(X; Y ).
In channel coding, high rate is desirable, but hard because of the channel. In compression, low rate is
desirable (communicate using fewer bits) but hard. That is why achievable rates in channel coding has an
upper bound, while in compression there is a lower bound.
We will discuss Problem 4 on the midterm.
(Xn; Y n) are drawn i.i.d. from some distribution. Y n is encoded into f(Y n) 2 f0; 1gnR. The decoder
receives both f(Y n) and Xn, and gives an estimate Ybn.
error is P (n) = P (yn(Xn; f(Y n) = Y n) R

The probability of (n) e 6 . We say is achievable if there exists a sequence of


(2nR; n) codes with Pe ! 0.
j
b
First, we prove that if R is achievable, then R H(Y X).
nR H(f(Y n))
H(f(Y n) j Xn) conditioning reduces entropy
n n n n n n
= H(Y ; f(Y ) j X ) H(Y j f(Y ); X ) chain rule
n n n (n)
H(Y ; f(Y ) j X ) (nPe logjYj + 1)
n n n n n
= H(Y j X ) + H(f(Y ) j Y ; X ) n n
= nH(Y j X) + 0 n n:
Next, we prove a lemma about typical sets. We define A (n) (Y xn) = yn : (xn; yn) A(n) (X;Y) . We prove
(n) n
A (Y x ) 2nH(Y jX)+2 ). j f 2 g
j j j
2 n(H(X) ) p ( xn )
X
p(xn; yn)
n n (n)
yn :(x ;y )2A (X;Y )
X
y :(x ;y )

2 n(H(X;Y )+ )
n n n A(n) (X;Y )
2

= jA(n)(Y j xn)j2 n(H(X;Y )+ )


13
6 Differential entropy
R
The differentiable entropy of a continuous random variable with density f is h(X) := f(x) log f(x) dx =
E log f(X).
If X is discrete and Y is continuous, then I(X; Y ) = H(X) H(X j Y ) = h(Y ) h(Y j X).
negative! R 0a a
1 1
Example 6.1 (Uniform). If X is uniform on [0; a], then h(X) = a log dx = log a. In particular, h(X) can be

Why can entropy be negative? Consider approximating the integral by 1 1 f(i=N) log f(i=N). Consider
i= N
i=N)=N

[X]N , a discretized version of f taking values 1=N with probabilities f( P . Then the above approximation to
the integral is H([X]N ) log N = H([X]N ) H([U]N ), where [U]N is the discretization of a Unif(0; 1) random
variable. This is the “differential” in the name: it is in some sense the difference in discrete entropy of
quantized versions of X and U.
dp dp
The relative entropy is D(P kQ) = p(x) log dq (x) dx = EP log dq (X). Note that we can rewrite this as
R dp dp R
dq (x) log dq (x) dq(x). So, differential entropy can be written as D(P k dx) where “dx” denotes the Lebesgue
measure.
1 x2=2 2
2
Example 6.2 (Gaussian). Let f(x) = p 2 e .
Z 1
h(X) = f(x) log f(x) dx
1 x2
2 Z
= 2 log 2 + f(x) log(e) 2 2 dx
1 2 1
= 2 log 2 + 2 log e
1
= 2 log 2 e 2

R
Joint density is h(X1; : : : ; Xn) = f log f where f is the joint density.
1=2 1
exp 2 (x )>K 1(x ).
Example 6.3 (Multivariate Gaussian). Let f(x) = (2 ) n=2jKj
Z
h(X) = f log f
1

1 n 1 1 )K 1(X )
= 2 log(2 ) jKj + 2 log(e) + 2 log(e)E(X
1 n trace trick
= 2 log(2 )njKj + 2 log e
1
= 2 log(2 e)njKj:
Note that does not appear. This is because entropy is invariant to shifting.
Chain rule: h(Y; X) = h(X) + h(Y j Xj ). R X;Y f Y jX j E Y jX j
Conditional differential entropy is h(Y X) = f (x; y) log f (y x) dy dx = log f (Y X).
Relative entropy: if supp(f) supp(g), D(fkg) = f log 0 by Jensen’s inequality.
R
g

Mutual information is I(X; Y ) = D(fXY kfX fY ) = h(X)


h(X j Y ) 0. From this we see conditioning still
decreases entropy. n n
We still have h(X 1; : : : ; Xn) = i=1 h(Xi j X1
h(X + c) = h(X)
; : : : ; Xi 1)
h(aX) = h(X) + log a
i=1
h(Xi).
.] More if Y = aX then f (y) = f (y=a)= a

P
Also, and P j j. [Recall Y X j j
generally, for a matrix A, h(AX) = h(X) + logjAj.
Aside:

14
Proposition 6.4 (Entropy power inequality). Let X and Y be independent random vectors on R n.

22h(X+Y )=n 22h(X)=n + 22h(Y :


)=n

Suppose X is uniform on some set A, so f X (x) = 1= vol(A)1A. Then h(X) = log vol(A). Similarly if Y is
uniform on another set B, then h(Y ) = log vol(B). Then, the entropy power inequality implies

22h(X+Y )=n vol(A)2=n + vol(B)2=n:

The Brunn-Minkowski inequality states vol(A + B)1=n vol(A)1=n + vol(B)1=n, where A + B is the Minkowski sum
fa + b : a 2 A; b 2 Bg. If we take 22h(X+Y )=n vol(A + B)2=n (note this is not true, due to the convoluting), then
we see that the entropy power inequality suggests a stronger inequality than the Brunn-Minkowski inequality.

Rough volume argument for strong converse of channel coding: if R > C, The number of typical y n is
nH(Y )
2 . For each xn, number of conditionally typical yn is 2nH(Y jX). If these sets are disjoint, the number
of sets in yn is about 2nI(X;Y ). If we have R > C then we have overlap.
If X is discrete and Y is continuous, no notion of joint entropy. However, we can talk

about mutual information. I(X; Y ) = H(X) H(X j Y ) = h(Y ) h(Y j X):

For general random variables, we have another equivalent definition of mutual information.

I(X; Y ) = sup I([X]P ; [Y ]P );


P

where the supremum is over all partitions. Note that this is essentially the definition of Lebesgue
integration from approximation by simple functions. For continuous Y , we recover the earlier definition.

sup H([Y ] ) H([Y ] [X] ) h(Y )+H([U] ) h(Y [X] ) H([U] ) h(Y ) h(Y X):
I(X;Y) = N N Nj N N j N N ! j
The fact that mutual information can be defined between discrete and continuous random variables is
good in practice. Consider a codeword X n 2 [2nR] being sent through a channel; such channels usually
produce continuous output.
Last time we showed for X N( ; K), we have h(X) = 12 log(2 e)njKj.

Theorem 6.5. For any random variable Y with covariance K,

1
h(Y ) h(X) = 2 log[(2 e)njKj]:
Moreover, equality holds if and only if Y is Gaussian.

So, the Gaussian distribution maximizes entropy under a second moment constraint.
Aside: Note that for the discrete case, uniform distribution over finite alphabet maximizes entropy
without moment conditions. It does not usually make sense to impose moment conditions since we
usually do not care about the values of X, unlike the continuous case. For nonnegative integers with a
mean constraint, geometric distribution maximizes entropy.

Proof. Let (x) = (2 ) n=2


j
K
j
1=2 e x>K
1
x=2 the Gaussian density and f arbitrary. [WLOG both f and
be 1
1

have zero mean.] Then loge = 2 log[(2 )njKj] + 2 x >K 1


x.

15
0 De(fk ) f log e
= Z

= Z f loge he(f)
= EX f [ loge (X)] he(f)
1 n
1 > 1
= 2 log[(2 ) jKj] + 2 EX f X K X he(f)
= 2 log[(2 ) jKj] + 2| e {z e }
=n
1 n 1 n
log e h (f)
1
n
= 2 loge[(2 e) jKj] he(f)
= he( ) he(f):

7 Gaussian channel
The Gaussian channel takes input X and outputs Y = X + Z where Z N(0; 2) is independent of X.
We have supPX I(X; Y ) = 1 because we do not have constraints on X. We could spread the distribution of X
so widely such that the Gaussian Z does not perturb by much, and Y can easily be decoded.
We consider instead supPX :EX2 P I(X; Y ). [WLOG we assume X is zero mean; shifting does not change any-
thing.] We have

I(X; Y ) = h(Y ) h(Y j X) = h(Y ) h(Z)


1
= h(Y ) log 2 e 2
2
1 1
2 2
2 log 2 e( + P ) 2 log 2 e
= 2 log 1 + 2 :

1 P

n 2 X N(0; P) C= 2 2R
1 log 1 + P2 p= 2 is th e sign al-t o-n oise (SNR ).

Equality is attained by nR, so 2 . The ratio nR n n


For the Gaussian channel, a (2 ; n; P ) code c is a map from w [2 ] and outputs X (w) such that
kX (w)k nP . [The idea is that we have limited energy; we cannot amplify arbitrarily large.]

nR H(W )
I(Xn; Y n) + nPe(n)R + 1 data proc., Fano
X
n

I(Xi; Yi) + 1 + nPe(n)R


i=1
1 n
Xi

R n I(Xi; Yi) + n
=1

We cannot take the supremum over PX because we have the constraint kXn(w)k2 nP . We use the fact that for fixed PY jX , the map
P R
PX 7!I(X; Y ) is concave. We define PX = 1 n PX . Then PY = PY jX dPX . Then
n i=1 i

1 nXi
max I(X; Y )

n =1 I(Xi; Yi) I(X; Y ) PX:EX2 P

16
P
because EX2 = n1 ni=1 EXi2 P . With a little more work, the constraint kXn(W )k2 nP could be relaxed to
hold on average over messages W .
Let X1; X2; : : : be i.i.d. with density f, mean zero, and second moment equal to 1. The central limit theorem
states n
1 X d
Sn := p
n Xi ! N(0; 1):
i=1

Note that Sn all have the same second moment 1. We suspect h(Sn) is increasing, since the limiting
distribution is Gaussian.
Recall the entropy power inequality, which implies
log 2) 1
22h((X1+X2)=p2) = 22(h(X1+X2) 2 (22h(X ) + 22h(X )) = 22h(X ):
1 2 1

This argument shows h(S2k ) h(S2‘ ) for k ‘.


But does h(Sn) increase monotonically? This was an open problem since Shannon, and solved in 2004.
Note that h(Sm) h(Sn) is equivalent to D(SmkN(0; 1)) D(SnkN(0; 1)). [Convergence in entropy implies
convergence in distribution. See Pinsker?] So a strong central limit theorem holds.

8 Entropy power inequality


Theorem 8.1.
22h(X+Y ) 22h(X) + 22h(Y ):

1948: proposed by Shannon (proof was wrong)

1959: Stam (semigroup / Fisher information)


1965: Blachman (same technique)
1991: Carlen, Soffer (same technique)

?: Dembo, Cover
p p
Let U = X and V = 1 Y.

2h(U+V ) 22h(U) + 22h(V )


p
p 1
h( X) = h(X) + 2 log
22h( X+ p 1Y ) 22h(X) + (1 )22h(Y )

p p 22 h(X)+2(1 )h(Y ) Jensen


h( X+ 1 Y) h(X) + (1 )h(Y ):
This last inequality is Lieb’s inequality, and we have shown it is a consequence of the entropy power inequality. This
22h(U)
is actually equivalent to the entropy power inequality: choose = 22h(U)+22h(V ) .1 The latter form is more
convenient for proving, but the original form is more convenient for applications.
Without loss of generality, we may assume the densities of X and Y do not vanish. [Else convolve with
a Gaussian with low variance; does not change much.]
22h(U)=n
1 Suppose we want to prove the n-dimensional EPI 22h(U+V )=n 22h(U)=n + 22h(V )=n. Let = .p
22h(U)=n+22h(V )=n
p
h(U + V ) = h( X + 1 Y)
h(X) + (1 )h(Y )
n n
= h(U) 2 log + (1 ) h(Y ) 2 log
n

= 2 log 22h(U)=n + 22h(V )=n :

17
We will be using results from optimal transport theory (but we will derive things from scratch). This is
an adapta-tion of Olivier Rioul’s proof (2016).
Let FX and FY be the cdfs of X and Y respectively. If X N(0; 1), then (x ) is uniform on [0; 1], and F X 1(
(X )) is distributed as X. So let TX = FX 1 so that TX (X ) is distributed as X. Note that T X is an increasing
function. [We assumed the densities are nonvanishing so the CDFs are strictly increasing.] Thus T X0 > 0.
Define TY similarly.
d
Let X ; YN(0; 1) be i.i.d. Then (TX (X ); TY (Y )) = (X; Y ) in distribution.
X; Y N(0; 1) write

Let
ee be i.i.d. Then we canX =p X p1 Y
p Y = 1 X+
p Y

ee rotation invariant.]

[This is a unitary transformation; Gaussian distribution is e e


Consider p p
(x) = T (x ) + 1 T (y ):
X
Then
p
ye e d
p
Y

e
We also have Ye (X ) = X+ 1 Y:
0
d
x y
(x) = T (x ) + (1 )T 0 (y ):
X Y

p e
f
p X+ dY e
e
1

Let be the density of . By the change of variables formula,


f (x) = f( (x)) 0 (x)

e
is a density. ye e ye e ye

p p 1
p
h( X + 1 Y ) = E log f( X+ p1 Y)
1
= E log
f( Y (X))
0 x)
= E log y ( e e
x)
Ye (
f e

e e
= h(X) + E log (X) + E log 0 (X)
e
fY X) Y
e e e e

" log (X) 0


= h(X) + e e e
Y # + log (X)
j E
e fY ( e e e

EY EX X) Y
= h(X) + EY D(gkfY ) e E
e
Y
0

e + log (X)
h( 0
Ee Y e e e
X) + log (X)
X) + log Te (X) + (1 ) log T 0 (Y )
h( e E X0 e E Y concavity of logarithm
e e eE
= (h (X) + E log TX0( Y0
X)) + (1 )(h(Y ) + log T (Y ))

e e
= h(X ) + (1 )h(Y ): e e
The last step is due to the change of variables (xe) = fX (TX (xe))TX0(Xe),
0
h(X) + E log TX0 (X) = E log TX (X) (X e) = E log f X ( TX (x) )
1 = E log f X (X)
1 = h(X):

e e e e
18
p
To relax the condition that the densities of X and Y are non-vanishing, we use an approximation h(X +
Z) ! h(X) as ! 0.
We proved the entropy power inequality in dimension 1. In higher dimensions it takes a similar form in
that the constants do not degrade; it is dimension free.

Theorem 8.2 (Conditional EPI). If X and Y are conditionally independent given U.

22h(X+Y jU) 22h(XjU) + 22h(Y jU) :


Proof. By the EPI,

2h(X + Y j U = u) log 22h(XjU=u) + 22h(Y jU=u)


2h(X + Y U) log 22h(XjU) + 22h(Y jU) :
j concavity of log-sum-exp

Theorem 8.3 (EPI in n dimensions). If X n and Y n are random vectors in R n,


2 2 2
2n h(X +Y )
n n
2 n
h(X )
n
+ 2n h(Y n):
Proof.

h(Xn + Y n) = h(Xn + Yn) + h(Xn 1 + Y n 1 j Xn + Yn)


2 n 1 2 h(Y n 1 X ;Y )

2h(Xn 1 + Y n 1 j Xn; Yn) (n 1) log 2 n 2 1 h(X n 1


jXn;Yn)
+22 n 1 n 1 j n n

n 1 n 1 X)
2h(X +Y j Xn + Yn) (n 1) log 2 n 1 h(X j n + 2n 1 h(Y jYn)
2h(Xn) 2h(Yn)
2h(Xn + Yn) log 2 +2
n h(Xn + Y n) log 2 n h(X )
+ 2 n h(Y ) concavity of log-sum-exp
2 2 n 2 n

Consider a adversarial channel who sees the distribution of X, chooses distribution Z and outputs Y =
X + Z. We want to find the capacity
sup inf I(X; X + Z)
PZ
PX
2
subject to EX2 2
X and EZ Z
2
.

I(X; X + Z) = h(X + Z) h(Z)


h(X + Z) h(Z ) Z N(0; 2 )
Z
h(X + Z ) h(Z )
2 EPI
=1 2 log 1 + Z 2

sup inf h(X + Z) h(Z) sup h(X + Z ) h(Z )


PZ
PX PX
2
1
X

2
= 2 log 1 + Z Gaussian channel
19
Thus, we have the Gaussian channel saddle point property:
2 PZ PZ
1 2 Z2
X
PX PX
log 1+ = sup inf I(X; X + Z) = inf sup I(X; X + Z)

In other words,
I(X;X +Z ) I(X;X +Z) I(X ;X +Z):
So non-Gaussian channels have higher capacity.
Getting Gaussian codes to work for any channel: perform unitary transformation before sending to
channel, and then perform inverse on output. CLT?

9 Rate distortion theory


Lossy compression. Encoder observes X n, sends nR bits to decoder, who then outputs estimate Xn.
If R > H(X), then Xn = Xn is possible with high probability. bits in the repre-
If , then what can we say? Trade-off between dimension reduction (rate, number of b
R < H(X)

b
sentation) and fidelity of the reconstruction.
+
To measure fidelity, we need some distortion function (measure) d : X X ! R . For convenience 2we
sometimes consider bounded distortion functions that satisfy d = max d(x; x) < . For example, (x x) is
unbounded. Most results generalize to the unbounded case. max x;x b 1

Some distortion functions are b b b


Hamming distortion d(x; xb) = 1[x 6= xb], used in the case where X = Xb (or some subset
relationship). Note Ed(X; Xb) = P (X 6= Xb).

Squared error / quadratic loss d(x; x) = (x x)2. Again, this is unbounded on R2.
sequences by d(xn; xn) = 1 n d(x ; x ), the average per-symbol dis-

We can extend distortion functions to b b n n n i=1 i i )


tortion. Other possibilities that we will not consider include d(x ; x ) =P . n.
max d(x ;x

n nR
i nR
i i
We have a encoding function f
n :
X ! [2 ]
and a b
decoding function g
n : [2 ]
b
!X

b
The central quantity is the expected distortion X n
X b b
Ed(Xn; gn(fn(Xn))) = | {z } xn
n n n
p(x )d(x ; gn(fn(x ))):

b nR
A rate distortion pair (R; D) is achievable if there exists a sequence of (2 ; n) codes (fn; gn) such that
n n
nlim Ed(X ; gn(fn(X ))) D:
!1

Note that if (R; D) is achievable, then (R; D0) and (R0; D) are also achievable if D0 D and R0 R.
The achievable rate distortion region is also convex: if you have two codes, you can flip a coin to choose
which code to use for a particular block. The rate and distortion will just be the convex combinations.
The point where the boundary of the achievable region hits the D = 0 axis is (H(X); 0). Where the
boundary hits R = 0 is (0; minxb Ed(X; xb)) (output the least offending guess on average). If d max exists, it
is larger than this minimum.
Note that the achievability definition can be restated > 0 such that P (d(Xn; gn(fn(Xn)) > D + ) ! 0 for
all
> 0. (???)
The rate distortion function is R(D) = inffR : (R; D) achievableg, the lower boundary of the achievable
rate distortion region.

Theorem 9.1. For Xi PX i.i.d., jX j; jXb j < 1, and distortion bounded by dmax,

R(D) = min I(X; Xb):


PXcjX :Ed(X;Xb) D

20
In channel coding, the channel is fixed and we control the distribution over the input. In this setting,
the input PX is given, and we control the encoding/decoding scheme that gives the output.
Recall PXjX 7!I(X; X) is convex for fixed PX , so this is a convex optimization problem. [Note that the
constraint on bthe expected distortion
b
is a linear constraint: d(X; X) =
E x;x
p(x)p(x x)d(x; x).]
j In c hannel c odi ng, if we di d not c hoos e too many inputs, their typic al i mag es under the channel woul d hopefull y

P b
be disjoint. This is a packing argument. We could use a volume argument to estimate how many inputs we can send.

b b b
n
In our setting, we have some subset of of size 2nR, and we consider the “preimage” of inputs in
n
that are
X X n
within distortion D of these outputs X(i). We want enough in the output space so that their “preimages” cover X .
b

b
This is in some sense a dual of channel coding. Note that R controls the number of “preimages,” and with lower
R, it becomes harder to cover. D controls the size of each “preimage” and lower D makes it harder to cover.
Example 9.2. Let X Ber(p) with p 1=2. Let be the XOR operation, and consider the Hamming distortion.
Lower bounding I(X; X) subject to Ed(X; X) D gives
b
b I(X ;X) = H(X) H(X X)
b j b
H(p) H(X X j X) X)

H(p) H(X b b

H(p) H(D): b
H(p) H(D) is the mutual information corresponding to the BSC with transition probability D taking some
input X and having output distribution Y following (p; 1 p). Indeed, I(X; Y ) = H(Y ) H(Y j X) = H(p) H(D).
How do we use this information to achieve this rate in our original problem?
Put Xb through a BSC(D) channel so that X has distribution (p; 1 p). Using Bayes’s Rule gives P
(xb = 0) =
1 pD.1
2D
So, (

R(D) = H(p) H(D) D p


0 D>p
If D p, we can simply output 0 all the time, and then the expected distortion is p.

Example 9.3. Consider a Gaussian random variable X N(0; 2). How would we do a one-bit quantization?
Let
f(X) := x1[X 0] x1[X < 0]. Ifb x = q
2
, then E(X f(X))2 = 2 2
0:36 2.
We b b PXjX : E(X X) D
want to find R(D) = min 2 I(X; X).

c b b
I(X; X) = h(X) h(X j X)
b = h(X) h(X b j X)
X

b b
h(X) h(X Xb)
1 1
2
2 log(2 e ) 2 log(2 eD)
1 log 2
= :
2 D
We have shown a lower bound. Is equality attained?
the theorem about the Gaussian channel: if Y = X + Z where X N(0; P ) and Z N(0; N), then
Recall 1 N+P
I(X;Y) = log . channel X = X + Z where X
2 N N(0; 2 D) and Z N(0; D), then X N(0; 2).
Consider a Gaussian 2 b
Then I(X; X) = 2 lo g
. So, R(D) = 2 log 2 b D D .

1 1
b
Now compare with our one-bit quantizer. With the same second fidelity constraint D = 0:36 2, the optimal rate
2 1 1
is R(0:36 ) = log 0:737. This is a significant improvement over the rate 1 of the one-bit quantizer. It is
2 0:36
suboptimal to quantize on a symbol-by-symbol basis; we gain by quantizing on blocks.
21
Suppose X N(0; 1) is the input to a neural network which outputs X. Suppose we have E(X X)2 = 1=2.
information flow through any layer is R(1=2) = 1=2.
The
We now prove the theorem. b b
Proof of achievability. Fix PXbjX such that Ed(X; Xb) D. We want to show that there exists a sequence of
(2nR; n) codes that have rate R I(X; Xb) and achieve E[d(X n; gn(fn(Xn))) ! D as n ! 1.
We define a distortion-typical set.
n n
Ad; := (xn; xn) : n log p(x ; x ) H(X; X) < ;
(n) n 1

1 b n b b
n log p(x ) h(X) < ;

1 n
log p(x ) h(X) <;
n

n n b
d(x ; x ) b Ed(X; X) <
(n) n n o
j j
=A (X; X) (x ; x b) : d(x ; x ) Ed(X; X) <
b \f j n n j g
b
A (X; b b b
(n) X):

() b
1. We have P (Ad;n) ! 1 since the two sets in the intersection have probability tending to 1, both by the weak law
of large numbers.
n n n n(I(X;X )+3 )
2. p(x ) p(x j x )2 b for all (xn; xn) 2 Ad;(n).

b b
n n n
p(xn; xn) n
b
n(H(X;X) H(X) H(X)+3 ) n n(I(X;X)+3 )
b j b p(xb)p(x ) b b b b b

p(x x ) = p(x ) n n p(x )2 = p(x )2

b
3. If 0 x; y 1 and n 0, then (1 xy)n 1 x + e yn. To see this, note x
n
(1 xy) is convex and
yn 7! y
nonincreasing for fixed y. Also, x 7!1 x+e is linear and nonincreasing for fixed y. Also, 1 y e .
[See Lemma 10.5.3.]
We describe the random code.

1. Generate 2nR sequences Xbn(i), i = 1; : : : ; 2nR i.i.d. from PXb.

2. Typical set encoding: for each Xn, select i such that (xn; Xbn(i)) 2 A(d;n) if
possible. If there are multiple such i, break ties arbitrarily.
If no such i exists, then send i = 1. This happens with small probability P e.

Then,
n n min E d(X; x) <
Ed(X ; Xb (i)) (1 Pe)(D + ) + Pedmax dmax can be relaxed to x
1
b b
D + + Pedmax:

It now suffices to show Pe ! 0 provided R > I(X; Xb).

X
P ((xn; Xbn) 2= A(d;n)) = 1 b p(xn)1A(n) (xn; xn)
b b
2nR
xn d;

"
P (@i : (xn; Xbn(i)) 2 A(d;n)) = 1 n n n #
X p(x )1Ad;(n) (x ; x )
b
n
b b
independence
x

22
" #
Pe = n
p(xn) 1 n
p(xn)1Ad;(n) (xn; xn) 2nR

X Xb
b b 2nR

x x
" n n n(I(X;X)+3 ) n n #
n p(xn) 1 n p(x j x )2 1Ad;(n) (x ; x )
X Xb
b b b 2nR

x x

= x
n
p(x n 6
) 1 2
2 n(I(X;X)+3 )
y b
n
x p(xn j x n )1Ad;(n) (x ; x
n n )3
7
X 6 X 7

6 b x b 7
4 | {z }b 5

"
n p(x n ) 1 n p(x n j x )1Ad;(n ) ( x
n | n;x )+e
n {z 2 n(I(X;X c
} 2 # inequality in “3” above
)+ ) nR
X Xb
b b
x x

= P ((Xn; Xbn) 2= A(d;n)) + e 2n(R (I(X;Xc)+3 ))


!0+0 R > I(X; Xb) + 3

Recall the rate distortion setup. We give Xn (drawn i.i.d. from some PX ) to an encoder, who then
sends nR bits to a decoder, who then outputs Xbn which hopefully has distortion Ed(Xn; Xbn) D.
We want to characterize R(D) = inffR : (R; D) achievableg, the lowest possible rate at which it is
possible to obtain [asymptotically] expected distortion D.
It turns out that
R(D) = min I(X; Xb):
PXcjX :Ed(X;Xb) D

Last time, we proved achievability: if R > R(D), then there exists a sequence of (2nR; n) codes with
limn!1 Ed(Xn; Xbn) D.
Sketch:

1. Fix P such that E d(X; X ) D.


XbjX PX PXcjX b
2. We picked 2nR sequences Xbn(1); : : : ; Xbn(2nR). For each, there is a distortion ball fXn : d(Xn; Xb(i)) Dg.

3. If R > R(D), then we have chosen enough Xbn(i) so that the set of all corresponding distortion balls
is so large that the probability of error is small...

We now turn to proving the converse. First, we need the following lemma.

Lemma 9.4. D 7!min I(X; Xb) is convex.


PXcjX :Ed(X;Xb) D

We proved convexity of the operational definition of R(D) last time by showing the set of achievable pairs (R; D)
is convex. However, this lemma asserts convexity of the thing that we have yet to prove is equal to R(D).
Proof. Let P (0) achieve R(D0), and let P (1) achieve R(D1).
Define P (b
XjX = P + P . XjX
b
) (0) (1)
XjX
7!
XjX XjX
P bj b () (0) (1)

b
Since X X b is b X

I (X; Xb)I (X; Xb) + I (X; X ) = R(D ) +


PXcjX PXcjX PXcjX b 0 R(D1):
EP d(X; Xb) = E E d(X; X ) D
XcjX PXcjX d(X; Xb) + PXcjX b 0+ D1:
() (0) (1)

23
Thus
() min
I
P XcjX (X; Xb) PXcjX :Ed(X;Xb) D0+ D1 I(X; Xb) = R( D0 + D1):

We now prove the converse. We want to prove that if R < R(D), then (R; D) is not achievable by any scheme.

Proof of the converse. Consider a (2nR; n) code with encoder fn and decoder gn that achieves Ed(Xn;

Xbn) = D where Xbn = gn(fn(Xn)). We want to show R R(D).

nR H(Xbn)
I(Xbn; Xn)
= H(Xn) H(Xn j Xbn)
n
X

= (H(Xi) H(Xi j Xbn; Xi 1)


i=1
n
X

H(Xi) H(Xi j Xbi) conditioning reduces entropy


i=1
n
X

= I(Xi; Xbi)
i=1
n
X

R(Ed(Xi; Xbi)) def. of R(D)


i=1
1 n
Xi b 1 n
nR n Ed(Xi; Xi) ! convexity, Jensen
=1

Xi b b
= nR(D): En d(Xi; Xi) = Ed(Xn; Xn) = D
=1

Remark: nR I(Xn ; Xn) could be deduced directly by data-processing, since there is a “bottleneck” of nR bits

in the model. (?) b


Let us see what happens when the inequalities become tight, in order to characterize a good scheme.

nR = H(Xbn): all 2nR reproductions Xb(i) are equally likely.

H(Xbn j Xn) = 0, i.e. Xbn is a deterministic function of Xn. [This shows that randomized decoding
doesn’t help.]

H(Xi j Xbn; X1; : : : ; Xi 1) = H(Xi j Xbi), i.e. Xbi is a sufficient statistic for Xi.

P = argmin I(X; Xb).


XbijXi PXcjX :Ed(X;Xb D
In the application of Jensen, either R(D) is linear (usually isn’t), or, in the strictly convex case, Ed(Xi; Xbi)
= D (exactly the same) for all i.

We now consider joint source channel coding. Let V m be the observation (i.i.d. from PV ). We encode V m
and encodes it into Xn, which gets sent through a discrete memoryless channel (DMC) P Y jX . The channel
outputs Y n is then decoded into a reproduction Vbm of the original observation. We would like Ed(V m; Vbm) D.
Theorem 9.5. Distortion D is achievable if and only if R(D) BC where B = mn is the bandwith mismatch.

24
This has a separation result, in that the following scheme is optimal. Do the rate-distortion optimal encoding to
nR(D) bits, and these are uniformly distributed. This what the channel likes. Use a channel code at rate C and send it
through the channel. Finally, do the corresponding channel decoding and the rate-distortion decoding.
The “if” part is simple: the rate R is lower than the capacity, so everything works.
For the reverse,
nC I(Xn; Y n) channel coding converse
m m
I(V ; Vb )data processing
mR(D): rate distortion converse

Thus R(D) BC.

10 Approximating distributions and entropy


“Approximating Probability Distributions with Dependence Trees” (Chow-Liu 1968)
P is a joint didstribution on n variables x = (x1; : : : ; xn). Estimating this is hard (curse of dimensionality). Note
n
Y

P (x) = P (xmi j xm1 ; : : : ; xmi 1 )


i=1

where m1; : : : ; mn is any permutation of [n].


We want to approximate P by a “second order” distribution This is also known as “tree dependence.”
n
Y
P (x) = P (x x );
tree mi j mj(i)
i=1

where j(i) is the parent of i. Note that these approximations use the same P that we are estimating. So for each
tree we have an explicit approximation. We are not approximating P with any distribution with a tree structure.
We want to optimize
min D(P P )
t Tn k t
2

Note jTnj = nn 2. ln 2 2
P .
Note D(P kPt) 2 kP tkT V

A maximum weight dependence tree is a tree t satisfying


n n
X Xi

I(Xi; Xj(i))I(Xi; Xj0(i)); 8t0 2 Tn:


i=1 =1

In other words, if we consider the complete graph on n vertices with edge weights I(X i; Xj), then the
weight spanning tree is this maximum weight maximum dependence tree.

Theorem 10.1. t 2 argmint2Tn D(P kPt) if and only if it is a maximum-weight dependence tree.
Proof.
X P (x)
D(P P ) = P (x) log
k t x Pt(x)
n
X X Xi

= P (x) log P (x) P (x) log P (xi j xj(i))


x x =1
n P (x ; x ) n
X X i j(i) X Xi

= H(X) P (x) log P (xj(i))P (xi) P (x) log P (xi)


x i=1 x =1
n n
Xi X

= H(X) + H(Xi) I(Xi; Xj(i)):


=1 i=1
| no {z }
dependence on t

25
n
So argmint D(P kPt) = argmaxt i=1 I(Xi; Xj(i)).
P
If we want to do this approximation, we just need to estimate the O(n 2) mutual informations, rather than the O(2 n)
probabilities (if xi are binary).
Maximum likelihood estimator of the true tree is the plug-in estimator. (Estimate mutual informations
using data and take the empirical maximum spanning tree.)
Why relative entropy ends up being nice? Chain rule / factorization is one possibility.
Our problem is as follows. We have X(1); : : : ; X(m) and we want to get an estimate of the dependence
tree. We estimate the mutual informations I(Xi; Xj) and find the empirical max weight tree.
Recall I(X; Y ) = H(X) + H(Y ) H(X; Y ). So it suffices to estimate entropies.
How do we estimate H(P ) given n i.i.d. samples drawn from P ?
Classical statistics. Suppose jX j = S is fixed. Find the optimal estimator of H(P ) as n ! 1. Let Pn denote the
empirical distribution, then
H(Pn) = Xi pi log pi
where pi is the relative frequency of symbol i in the data. b b
b
H(Pn) is the MLE for H(P ), and it is asymptotically efficient (asymptotically, attains equality in the
Cramer-Rao bound) by Hajek-Le Cam theory.
What about non-asymptotics? What if n is not “huge” relative to the alphabet size S?
Decision theoretic framework. For an estimator Hbn, the worst-case risk is
Rnmax(Hbn) = sup EP (H(P ) Hbn)2;
P 2MS

and the minimax risk is


inf sup EP (H(P ) Hbn)2:
Hbn P 2MS
Classical asymptotics: for the plug-in estimator,
Var( log P (X))
EP (H(P ) H(Pn))2 n
3
sup Var( log P (X)) 4 (log S)2
P 2MS
2
Does n (log S) imply consistency?
No. Bias-variance decomposition:

EP (H(P) Hb)2 = (E[Hb] H(P ))2 + VarP (Hb)


Jiao Venkat Han Weissman (2014): for the plug-in estimator,
S2 (log S)2
Rnmax(H(Pn)) n2 |{z}
+ |
n{z }

bias squared variance

If n S (e.g. n ! 1 while S fixed) then the bias term is small and we get the classical asymptotics. Otherwise,
bias term becomes important. So, consistency of MLE is equivalent to n = (S).
Next time, we show we can do better.

Recall we are trying to estimate H(P ) for some distribution P , using i.i.d. samples X n. The MLE is
H(Pn) where Pn is the empirical distribution.
From classical statistics,
Var( log P (X))
2
EP (H(P ) H(Pn)) n
as n ! 1. If S is the support size, this suggests the sample complexity is ((ln S) 2). However, this is only
valid in the asymptotic regime n ! 1 while support size is fixed.

26
If the support as size S,
S2 (ln S)2
2
sup EP (H(P ) H(Pn)) n2 + n :
P 2MS
So the sample complexity of MLE is actually (S). If n is not large enough, the bias term (first term) is too
large. There is a phase transition: if sample complexity is & S then the risk is nearly zero, if it is less,
than the risk is
high.
Can we do better than the MLE? Valiant and Valiant showed that the [minimax?] phase transition for
entropy estimation occurs instead at (S= ln S).
We get an “effective sample size enlargement phenomena.” This result implies that the risk with n samples has the
same error as the MLE estimator with n log n samples.
p log p
Note that entropy is separable: H(P ) = i i i=
when n is not large enough.
i f(pi) where f(x) = x log x.
P
Recall the issue with the MLE is large bias P
The plug-in estimator is H(Pn) = pi log pi. Consider the function f(x) = x log x. If pi pi and pi is
near , then
1=2 f(p ) f(p ) slope of f is low. However, near zero, f has infinite slope, so f(p ) and

i i because theP i
f(pi) differ greatly. b b b
b
To fix this, we divide [0; 1] into a smooth regime (log n=n; 1] and a non-smooth regime [0; log n=n). b
For the smooth regime we have a bias-corrected estimate f(pi) 2n1 f00(pi)pi(1 pi).
For the non-smooth regime [0; log n=n), use the best (sup norm) polynomial approximation of f order log n.
Note that this estimation procedure does not depend on S. b bb b
We compare the L2 rates:

minimax : S2
+ (ln S)
2
2
(n log n) n
S2 (ln S)2
MLE : 2 +
n n

11 Computing rate distortion and channel capacity


Alternating minimization algorithm.
Example: Suppose we have disjoint sets A; B and we want to find min a2A;b2Bka at bk2. We maintain a current
and bt, and then update at+1 = argmina2Aka btk2 and bt+1 = argminb2Bkat+1 converge bk2. This is guaranteed to
to the minimum for well-behaved distance functions and convex A and B.
For us, relative entropy is a good distance function.
Let A be the joint distributions with marginal PX and expected distortion
D.
( )
A := QX;X : Ed(X; X) D; Q(x; x) = PX (x)

b b X b
b
x

R(D) = min I(X; X) QA b


k
Q2A
= min D(QX;X QXQX)
2 b b
min D(Q X;X
kP X Q ):
= QA X

2 b b
Lemma 11.1.
R(D) = min min D(Q X;X k P R ):
Q AR c X X
2 X b b

Proof.
D(Q P R ) D(Q P Q ) = D(Q R ) 0:
X;X k X X X;X k X X Xk X

b b b b b b

27
To implement the alternating minimization algorithm, we need to solve two problems.

Given QX;Xb, find RXb that minimizes D(QX;XbkPX RXb). The above lemma implies RXb = QXb.

Given RXb, find QX;Xb 2 A that minimizes D(QX;XbkPX RXb). Since the marginal QXb is fixed, we
just need to find QXbjX . The following lemma shows there is a closed form expression.
Lemma 11.2. The minimizing conditional distribution has the form
x;x)
R e d( b
QXX (x x) = X(x) d( x;x0)
j j x0 b X(x )
0

R b
e

b b P b bb b

where is such that EQd(X; Xb) = D.


Proof. Lagrange multipliers.
J(Q ) = D(Q P R P ) + E d(X; X) + P (x)Q (x x)
XjX XjX Xk X X 1 Q 2 X XjX j
b b b b X b
@ QX X (x x) b b
x;x

b j (bx)j
@QXj X (x j x)
J(QXjX ) = PX (x) log RX + PX (x) + 1PX (x)d(x; x) + 2PX (x)

b b b b b b

Proof of second part of Lemma 10.8.1.


r (x y) r(x y) r (x; y)
x;y p(x)p(y j x) log j
p(x)
log p(x)
j = x;y p(x)p(y j x) log r(x y)

X X j
X r (x j y)
= p(y) r (x j y) log
r(x j y)
x;y
= p(y)D(r (x j y)kr(x j y))
0:

Similarly, there is an alternating maximization procedure for channel capacity.

C = max I(X; Y )
PX

= max D(P kP P )
PX XY X Y

= max max R (x)P (y j x) log QXjY (x j y)


Q
XjY RX x;y X Y jX R (x) X

X
For any RX ,
Q (x y) := RX (x)PY jX (y j x) :

P 0
For any QXjY , XjY j x0 RX (x )PY jX (y j x0)
Q Qj X Y (x j y)PY jX (yjx)
RX (x) = x 0y
y QX Y (x 0 y)
P
Y jX
(yjx0)

PQ j j

28
12 Information theory and statistics
12.1 Theory of types
1 n X f g n
Let X ; : : : ; Xn be a sequence of symbols from = ai .
The type of X , denoted PXn , is the empirical distribution associated to X = (X1; : : : ; Xn).
Pn denotes the set of types with denominator n, i.e., the possible types associated with a sample of size n.
Example 12.1. If X = f0; 1g, Pn := n; n :0 k n
k n k

The type class of P 2 Pn is T (P ) = fxn 2 X n : PXn = P g:

8
Example 12.2. If P = (3=8; 5=8), then T (P ) are all 3 binary vectors of length 8 with exactly three zeros.
Theorem 12.3.
jPnj (n + 1)jXj:

Proof. There are n + 1 choices f0; 1; : : : ; ng for each numerator.

[The above bound is very crude, but it is good enough because we will be comparing this to things
that grow exponentially in n.]
Consequently, the number of type classes is polynomial in n.

Theorem 12.4. Let X1; : : : ; Xn Q be i.i.d. Then the probability of Xn is

Qn(Xn) = 2 n(H(P
X
n )+D(P
X
n kQ))
:
It makes sense that the probability depends only on the type. (Permuting does not affect the empirical
distribution.) Recall Qn(Xn) 2 nH(Q) for typical Xn and contrast this with the statement of the theorem.
Proof.
P
2 n(H(P )+D(P kQ)) = 2nP n (a) log 1
+
P
P n (a) log PXn (a)
n
Xn X n a2X X PX (a) a2X X Q(a)
P

P (a) log Q(a)


= 2n a2X X
n
Y
nP n (a)
= Q(a) X

a2X

= Qn(Xn);

where we note nPXn (a) is the number of times a appears in the sample.

Theorem 12.5.
1
2nH(P ) jT (P )j 2nH(P ):
(n + 1)jXj
That is, the type class of P has about 2nH(P ) sequences.
This is a more precise notion than typical sets. Both shows how entropy is some notion of volume.

29
Proof.

1 P n(T (P ))
X

= P n(xn)
n
x 2T (P )

= jT (P )j2 nH(P ):
The last equality is due to P Y

2 nH(P ) = 2n a P (a) log P (a) = P (a) nP (a) :


a

[Alternatively, apply the previous theorem with Q = PXn .]


We now prove the lower bound. We will assume

P n(T (P )) P n(T (Pb)); 8Pb 2 Pn

and prove it later. Intuitively, it states that under a particular probability distribution, the type with
maximum proba-bility is the original distribution.

X
1= P n(T (Q))
Q2Pn
X

max P n(T (Q))


Q
Q2Pn
X

= P n(T (P ))
Q2Pn

jPnjP n(T (P ))
(n + 1)jXjP n(T (P ))
= (n + 1)jXjjT (P )j2 nH(P ):
It remains to prove the “maximum likelihood” result.
j
P n(T (P )) = T(P) Q
P n(T (P )) T (P )j
P (a)nP (n)
a P (a) nP (a)

b j bj Q b
n
= nP (a1);:::;nP (a jXj )
n
P (a)n(P (a) P (a))
nP (a jXj Y b
nP (a1);:::;nP (a ) a

(b 1
))! b (nP (a ))! n(P (a) P (a))
jXj
b b jXj Y b
=
(nP (a1))! (nP (a ))! P (a)
a
Y b b m!
m n
(nP (a))n(P (a) P (a))P (a)n(P (a) P (a)) n! n
a
Y

= nn(Pb(n) P (a))
a
P

=n n a(Pb(a) P (a))
= 1:

Theorem 12.6. For any P 2 Pn and any distribution Q, the probability of type class T (P ) under Q is
1
(n + 1)jXj 2 nD(P kQ) Qn(T (P )) 2 nD(P kQ):

30
The probability of observing some empirical distribution under Q is exponentially small in the relative entropy.

Proof.
X
Qn(T (P )) = Q (x n )
xn2T (P )
X
= 2 n(D(P kQ)+H(P ))
n
x 2T (P )

= jT (P )j2 n(D(P kQ)+H(P )) :


Applying the previous theorem finishes the proof.
In summary,

jPnj (n + 1)jXj. (Bound on number of types.)

Qn(xn) = 2 n(H(P )+D(P kQ)) for xn 2 T (P ).


jT (P )j = 2nH(P )

Qn(T (P )) = 2 nD(P kQ)

Theorem 12.7. Let X1; X2; : : : be i.i.d. from P . Then


( jXj log(n+1))
P(D(PXn kP ) >) 2n 1
n

Note that the right-hand side does not depend on P .


P

The Borel-Cantelli theorem implies that if n P(D(PXn kP ) > ) < 1, then D(PXn kP ) ! 0 almost surely.
Given > 0, let
TQ := fxn : D(PXn kQ) g:

PQn fXn : D(PXn kQ) > g = 1 Qn(TQ)


X
= P 2Pn: k
Qn(T (P ))
D(P Q)>
P 2Pn: Xk 2 nD(P kQ)

D(P Q)>
(n + 1)jXj2 n
n
jXj log(n+1)
=2 ( n )

This is a law of large numbers: the probability of getting a sample whose empirical distribution is far from
Q in relative entropy is exponentially small. Applying Borel-Cantelli implies

D(PXn kQ) ! 0; almost surely.

This is a strengthening of the law of large numbers.


Note that relative entropy controls L1 distance between measures. On finite-dimensional spaces, all
norms are equivalent, so up to constants relative entropy controls all norms (in finite dimensions). More
generally, relative entropy controls many transportation distances.

31
12.2 Large deviations
Let X1; X2; : : : Q be i.i.d. on a finite alphabet X .
The weak law of large numbers states
P n Xi > EX1 +
! ! 0:
1 n
Xi
=1

The proof using Chebychev’s inequality actually gives us a rate


! :
P n Xi > EX1 + n2 2
1 n Var(X)
Xi
=1

We usually rewrite the left-hand side as


!
n
X
P
i=1
Xi > nEX1 + n = 2 nE;

where n is called the large deviation. What is interesting is that E is an exponent that we can compute
explicitly, and the upper bound is tight up to a constant.
Example 12.8. Let Xi Ber(p) and Q = Ber(p). Note that 1 n Xi = X P n X = PXn (1) (the proportion

of 1s under the empirical distribution).


P n X i p+ ! = n
n
Q ( T (P ))
P i=1 E X

1 n
X
X P 2Pn:
i=1 P (1) p+
2 (n + 1)X jPnj
jPnj n min D(P kQ)
2 n min D(P kQ); 2

where the minimum is over the same types in the sum.


!
P n Xi p + = 2 n min D(P kQ)
1 n
Xi
=1

This example is a special case of the following theorem, with E being the collection of distributions on
f0; 1g with expectation p + .

Theorem 12.9 (Sanov’s theorem). Let X1; X2; : : : Q be i.i.d. and let E be a collection of probability
distributions on X . Then the probability that the empirical distribution P Xn lies in E is
Qn(E) = Qn(E \ Pn) (n + 1)jXj2 nD(P kQ);

where P := argminP 2E D(P kQ). Moreover, if E is the closure of its interior, then

1
n log Qn(E) ! D(P kQ);

that is, the lower bound matches the upper bound.


P

Common example of the collection E is E := P : x2X g(x)P (x) , that is, the set of distributions whose
such that EX P g(X) . If g(x) = xk, this is a moment constraint. We could also have many constraints (gj
and j for j = 1; : : : ; J).

32
Proof.
X
n
Q (E) = Qn(T (P ))
P 2E\Pn

X 2 nD(P kQ)
P 2E\Pn
jXj
(n + 1) 2 n minP 2E\Pn D(P kQ)
jXj
(n + 1) 2 nD(P kQ):

X
n
Q (E) = Qn(T (P ))
P 2E\Pn
n
Q (T (Pn)) for any Pn 2 E \ Pn
1
nD(P Q)
(n + 1)jXj 2 nk :
We need to find a sequence fPn : Pn 2 E \ Pngn 1 such that D(PnkQ) ! D(P kQ). If E has nonempty interior
and E is the closure of its interior, then we can approximate any interior point by a sequence of types, and
this is possible.

We review Sanov’s theorem. We consider the collection of probability distributions on X and observe X1; : :
: ; Xn Q for one particular distribution. Let E be a collection of other distributions, e.g., set of distribu-tions with
expected value 0:8. We want to understand the probability that the empirical distribution P Xn is in E.
Sanov’s theorem implies that this probability exponentially small with exponent min P12E D(P kQ).
If P := argminP 2E D(P kQ), then we get the lower bound Qn(T (P )) (n+1)jXj 2 nD(P kQ)
for free. The
“closure of the interior” condition allows use to use denseness of types to extend to the case where P
is not a type.

Example 12.10. Suppose we have a fair coin. What is the probability of 700 heads in 1000 tosses? Let E
= fP : EX P X 0:7g. Why does this make sense? If the observations X n have 700 heads then PXn 2 E (here
n = 1000). Note that Q is the fair distribution. Sanov’s theorem implies

1
n log P( 700 heads) D((0:7; 0:3)k(0:5; 0:5)) 0:119

More precisely,

2 138:9 = 2 n(0:119+0:0199) P( 700 heads) 2 n(0:119 0:0199) = 2 99:1:


Again, n = 1000 (but the same argument works for any n). The upper and lower bounds differ only in an exponential
factor of log n=n; as n ! 1 these are very close.

To compute P = argminP 2E D(P kQ) where E is convex, this becomes a convex optimization problem:
use Lagrange multipliers.
The more general version of Sanov’s theorem (continuous distributions, etc.) is as follows.

Theorem 12.11 (Sanov’s theorem).


1
inf D(P kQ) lim inf log Qn(E)
P 2int(E) n!1 n
1
lim sup log Qn(E)
n!1 n
inf D(P kQ):
P 2cl(E)

33
12.3 Conditional limit theorem
Suppose I am manufacturing bolts, each of which is supposed to nominally weight 10 grams. I find a batch
of 1000 bolts that weighs 10:5 kilograms. What is the probability that any given bolt weights 11 grams?
[What does the bulk measurement tell us about the marginal distributions of the individual measurements?]

Theorem 12.12 (Conditional limit theorem). Suppose X1; X2; : : : Q are i.i.d. and we observe PXn 2 E with
Q 2= E and E is closed and convex.
p

P(X1 = a j PXn 2 E) ! P (a);


where P = argminP 2E D(P kQ).
We need two intermediate results along the way.
Theorem 12.13 . PX 2
(Pythagorean theorem) For E ( ) closed and convex, and Q = E, let P :=
argminP 2E D(P kQ). Then,
D(P kQ) D(P kP ) + D(P kQ)
for all P 2 E.

Proof. Let P 2 E and define P = P + P . By definition of P , we have dd D(P kQ) 0 at = 0.


X P (x)
D(P kQ) = x
P (x) log Q(x)
D(P Q) =
d k x (P (x) P (x)) log Q(x) + P (x) P (x)
d P (x)
X
P (x)
= (P (x) P (x)) log
X
Q(x)
0 d D(P kQ) =0
= x
x (P (x) P (x)) log Q(x)
d X P (x)

= P ( x) log Q( x)
P (x) D(P kQ)
X

= D(P kQ) D(P kP ) D(P kQ):

Theorem 12.14 (Pinsker’s inequality).


log e
2
D(P kQ) 2 kP Qk1 :

Note that for A = fx : P (x) Q(x)g,


X
kP Qk1 = jP (x) Q(x)j
x
= (P(A) Q(A)) (1 P(A) (1 Q(A))
= 2(P (A) Q(A))
= 2 max(P (B) Q(B)):
BX

Proof. For binary distributions, one can prove the following (exercise):
p p log e
2
p log q + p log q 2 (2(p q)) :

34
The data processing inequality for relative entropy: if P 0 = PY jX P and Q0 = PY jX Q, then

D(P kQ) D(P 0kQ0):


Define a channel Y = 1fX2Ag. Then
D(P kQ) D((P (A); 1 P (A))k(Q(A); 1 Q(A)))
log e
2
2 (2(P (A) Q(A)))
log e
2
= 2 kP Qk1 :

Intuition for the conditional limit theorem: P completely dominates the behavior of the marginal.

Proof of conditional limit theorem. Let St := fP 2 P(X ) : D(P kQ) tg. This is a convex set.
Let D := D(P kQ) = minP 2E D(P kQ).
Let A := SD +2 \ E and B := E n A = E n SD +2 .
X
Qn(B) = Qn(T (P ))
P 2E\Pn: D(P
kQ)>D +2 X

2 nD(P kQ)
P 2E\Pn: D(P
kQ)>D +2
(n + 1)jXj2 n(D +2 ):

Qn(A) Qn(SD + \ E)
X

= Qn(T (P ))
P 2E\Pn: D(P
kQ) D +
1 n(D + )

(n + 1)jXj 2 :
n n
) = Q (B \ E) Q (B) (n + 1)jXj2 n(D +2 ) = (n + 1)2jXj2 n
P (P X n B P n 0:
Qn(E) Qn(A)
X 1
2 j 2E 2 n(D + ) !
(n+1)jXj
[So, the probability that our empirical distribution is outside a KL-ball around P (given it lies in E)
vanishes.] Thus, P(PXn 2 A j PXn 2 E) ! 1:
By the Pythagorean inequality, for P 2 A we have
D(P kP ) + D = D(P kP ) + D(P kQ) D(P kQ) D+2;
so D(P kP ) 2 . This combined with Pinsker’s inequality implies

P(PXn 2 A j PXn 2 E) P(D(PXn kP ) 2 j PXn 2 E) P(kPXn P k1 0 j PXn 2 E); and by our earlier
work, these three quantities tend to 1 as n ! 1. Consequently
P(jPXn (a) P (a)j j PXn 2 E) ! 1:

Aside: after proving Sanov’s theorem, we proved D(P Xn kP ) ! 0 almost surely which implies kPXn P k1
! 0 almost surely, by Pinsker’s inequality.

35
12.4 Fisher information and Cramer-Rao lower bound
Let f(x; ) be a family of densities indexed by . For example, the location family is f(x; ) = f(x ) for some
f.
An estimator for from a sample of size n is a function T : X n ! . The error of this estimator T (X n) is a
random variable.
n
For example, Xi N ( ; 1) i.i.d. and T (X n ) =1n i=1 X i.
n
An estimator is unbiased if E T (X ) = . P
The Cramer-Rao bound states that the variance of an unbiased estimator is lower bounded by 1=J( ).
0 2
Example 12.15. Let f(x; ) := f(x ). Then J( ) = f (x) dx. This is a measure of curvature/smoothness. If f
f(x)
=J( )

is very spread out, it is hard to estimate ; indeed then 1 R will be large.


We will see later that there is a finer inequality.
1 1 2h(X)
2 Var(X): J(X)
2e

13 Entropy methods in mathematics


13.1 Fisher information and entropy
Let u(t; x) denote the temperature at time x at time t, where x 2 Rn. The heat equation is
n
@2
@ 1 X u(t; x):
@t u(t; x) = 2
i=1 @x2i

Consider the initial condition u(0; x) = (x) (all heat starts at 0). Then a solution is

u0(t; x) = (2 t) n=2e xj2=2t:

That is, at time t, the temperature profile is the Gaussian density with variance t.
R
More generally, if the initial condition is u(0; x) = f(x), then a solution is the convolution u(t; x) =
f(s)u0(t; x s) ds where u0 is the Gaussian kernel above. Note that if we integrate over x, we get the “total
energy” which is conserved (constant in t). This is easy to see in the special case where f is a density (in x), in
which case the convolution will also be a density, which integrates (over x) to 1, which is constant in t.
Let us focus on this special case. Let X f and Z N(0; I). Then u(t; x) = f t(x) where ft is the density of
p
X+ tZ. [Convolution of densities is density of sum.]
We claim h(X + p tZ) is nondecreasing in t. [This matches our intuition from the second law of thermodynamics.]
p p
h(X + tZ) h(X + tZ j Z) = h(X):
This implies
p p p
h(X + t + t0 Z) = h(X + p tZ +
tZ) t 0 Z0) h(X +
which implies our claim.
Amazingly, we not only know it is increasing in t, but we have a formula for the rate of increase.

Proposition 13.1 (de Bruijn’s identity).


p p
d h(X + tZ) = 1 J(X + tZ):
dt 2
That is,
d 1
dt h(ft) = 2 J(ft):

36
We will define the Fisher information J now. Given a parametric family of densities ff(x; )g parameterized by
, the Fisher information at is
@
2
J( ) = X fE(x; ) @ log f(X; ) :
We will focus on location families where f(x; ) := f X (x ) for some fixed density fX . In this case, the Fisher
information is Z (f 0 (x) dx;
f
(x))2
J( )=

which is free of . We then use the compact notation J( ) = J(f) = J(X) to denote the Fisher information
associated with this density. This is the Fisher information that we will consider from now on.
In n dimensions, this generalizes to

J(f) = J(X) = Z jrf( x) j


f(x) 2 dx = 4
Z p
r f(x)
2 dx :

p
The last expression shows how the Fisher information corresponds to the smoothness of f (actually, of f ).
We now prove de Bruijn’s identity.
Proof. Z
dt
h(f ) =
t dt ft(x) log ft(x) dx
d d
Z Z f (x) dx
= (log ft(x)) @t ft(x) dx + @t t
@ @

Z 1
2
n @2 i=1 @ xi dt
d Z
X
= (log ft(x)) 2 ft(x) dx + ft(x) dx
1 n @2 | = dt
{z }
d
1=0
2 i=1 X
Z @xi
= (log ft(x)) 2 ft(x) dx :

Assuming (log ft(x)) @x@i ft(x) ! 0 as jxj ! 1, integration by parts for each summand indexed by i (with
respect to dxi) gives
d dt
1 2 i=1
Z
n @2 @ xi

X
h(ft) = (log ft(x)) 2 ft(x) dx
f
= 1 n @xi t(x) 2 dx
Z @
2 i=1 ft(x)

1 Z f (x) 2
= jr t j dx
2ft(x)
1
J(f ):
= 2 t

Note that nonnegativity of Fisher information also shows that h(f t) is nondecreasing in t.
We will use de Bruijn’s identity to prove an uncertainty principle for entropy and Fisher
information. The entropy power inequality gives
p p
e n2 h(X+ tZ) 2 h(X) 2 h( tZ) 2 h(X)
p
e n +e n =e n + 2 et
2
e n2 h(X+ tZ) en h(X)
2 e:
t

37
=
Taking t ! 0 makes the left-hand side equal to
dt e n h( X+p

d 2
tZ)t=0 n J(X)e n h(X);
1 2

by de Bruijn’s identity. Thus, we arrive at the following.


Proposition 13.2 (Stam’s inequality).
h(X)
J(X)e n2 2 en:
This is an uncertainty principle: product of two uncertanties is greater than some constant.
Note that in dimension n = 1, we have h(X) 12 log[2 e Var(X)], so we recover the Cramer-Rao bound
J(X) Var(X) 1:

13.2 The logarithmic Sobolev inequality


Let be the density of Z N(0; I). Let (x) dx = d . We have
D(XkZ) = Z f(x) log (x) = Z log d
f(x) f f

The relative Fisher information is


Z
I(XkZ) = f(x) r log f(x)
(x)
2
dx = Z
f r log
f 2 d :
2

Recalling h(X) = f log f dx and J(X) = f log f dx, we see that the above two quantities are parallel
R
analogues of entropy and Fisher information. R jr j
The above two quantities can be simplified to be
n 1 n
2
D(XkZ) = 2 log(2 e) + 2 EjXj 2 h(X)
2
I(XkZ) = J(X) 2n + EjXj :
Using the bound log x x 1, we have
1 2 1

log 2 e log n J(X) + n h(X) n J(X) 1:


Combining this bound with the above implies the following.
Theorem 13.3 (Log Sobolev inequality, information-theoretic form).

1
2 I(XkZ) D(XkZ):

de Bruijn 2 1
H(X)
EPI =) J(X)e n 2 en () D(XkZ) 2 I(XkZ)
We only showed the forward implication of the last “if and only if.” The reverse is simple too.
Note that the last result is dimension free.
Let g2 := f = . We can reformulate the last result.
Z Z
2 jrgj2 d g2 log g2 d :
R
Noting that g2 d = 1, we can write
Z Z Z Z
2 jrgj2 d g2 log g2 d g2 d log g2 d :

This inequality still holds if we scale g by a constant. (Log terms will cancel.)
38
Theorem 13.4 (Log Sobolev inequality for Gaussian measure). For “smooth” g,
Z Z Z Z
2 jrgj2 d g2 log g2 d g2 d log g2 d =: Ent (g2):

This is equivalent to the previous formulation of the log Sobolev inequality.

EPI =) Fisher information-entropy uncertainty principle () LSI (info. th.) () LSI (functional form)

13.3 Concentration of measure


F : Rn ! R is L-Lipschitz (denoted kF k Lip L) if jF (x) F (y)j Ljx yj, for all x and y.

Theorem 13.5 (Borell’s inequality). Let Z N (0; I). If kF kLip L, then


2
r

P(F (Z) E[F (Z)] + r) e 2L2 :

Consider U N(0; 1). We have P(U r) e r2=2. So, the theorem states tail that under a Lipschitz function, the
behavior is still the same. R
Without loss of generality suppose L = 1. We consider g2(x) := e F (x) (and assume F d = 0) and plug
2=2 it into the LSI. We have rg(x) = 2 (rF (x))e =2F (x) 2=4, so
2
Z 2
2

2 jrgj d = 2 Z jrF j2e F =2 d


2
2

2 Z eF =2 d :
R
Let ( ) := eF 2=2
d . The LSI implies
2Z 2
eF 2=2
d = ()
2 2
Z 2
eF 2=2( F =2) d ( ) log ()
Z
( ) log ( ) 2
(F 2) d = 0( ):
eF =2

We define the Herbst argument H( ) = 1 log ( ). Then


2
()H0()= 0() ( ) log ( ) 0:

Since 2 ( ) > 0 for > 0, we have H0( ) 0.


0
(0)

R
We have H(0) = (0) = F d = 0. Thus H( ) 0 for all0. This gives ( ) 1, and so
Z
2
e F d e L2=2:

Markov’s inequality with = r gives


r r2=2
P(F (Z) r) e 2=2 = e :

39
13.4 Talagrand’s information-transportation inequality
The quadratic Wasserstein distance between two probability measures and (on the same space) is

2 inf X Y 2

W2 (;)= PXY :PX= ;PY = Ej j


We can think of this as a distance between measures, motivated by moving probability mass from one to the
other. It is actually a metric (satisfies triangle inequality, etc.). Also, it admits a nice dimension decomposition.

W 22( ; ) = inf EjX Y j2


nX

EjXi Yij2
i=1
n X

= W 22( i; i):
i=1

Theorem 13.6 (Talagrand’s inequality).


2D( k ) W 22( ; ):
Note that both sides grow “linearly” in dimension n. Contrast this with Pinsker’s inequality, where the total
variation is bounded by 1 regardless of dimension, rendering it rather unhelpful.
P
Let PXn := 1 n X be the empirical distribution of X1; : : : ; XN N (0; I). The following are true.
n i=1 i

1. EW 2(PXn ; ) ! 0.

2. Et = f : W 2( ; ) > tg is open in the topology of weak convergence.


3. gn : (x1; : : : ; xn) 7!W 2(Pxn ; ) is n 1=2-Lipschitz.
Concentration of Lipschitz functions implies

P (W 2(PXn ; ) > t) e n(t EW2(PXn ; ))2=2


:
Sanov’s theorem implies

inf D( k )
E t n
lim inf 1 log P (W (P
n 2 X
n ; ) > t)
2 !1
lim sup(t EW 2(PXn ; ))2=2
n!1

= t2=2:
If W 2( ; ) > t (i.e. 2 Et) then
2D( k ) t2:
Taking t = W 2( ; ) and ! 0 proves the theorem.
Combining with the previous results gives the nice chain

I( k ) 2D( k ) W 22( ; ):

13.5 The blowing-up phenomenon


For B Rn, let Bt := fx : d(x; B) tg be the t-blowup of B.
p
Theorem 13.7. Let B Rn. If t 2 log (B) where is the standard Gaussian measure. Then,
1
1 (Bt) exp 2 (t p 2 log (B))2 :
Roughly, if B contains a sufficient amount of the mass of , then Bt contains almost all of the mass!
Concretely, if (B) = 10 6, then (B13) 1 3 10 13. If we consider Rn for n large, most Gaussian vectors

p p lie on a spherical shell of radius n. Note that 13 is a


very small distance compared to this n.

40
References
[1] Cover, Thomas and Thomas, Joy. Elements of information theory. John Wiley & Sons. 2012.

41

You might also like