0% found this document useful (0 votes)
2 views

Lecture6

This document discusses differential entropy and the Gaussian channel in the context of information theory. It introduces the concept of differential entropy for continuous random variables, provides examples including uniform and Gaussian distributions, and explains the properties of differential entropy. Additionally, it defines the Gaussian channel, its capacity, and demonstrates that the Gaussian distribution maximizes differential entropy.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture6

This document discusses differential entropy and the Gaussian channel in the context of information theory. It introduces the concept of differential entropy for continuous random variables, provides examples including uniform and Gaussian distributions, and explains the properties of differential entropy. Additionally, it defines the Gaussian channel, its capacity, and demonstrates that the Gaussian distribution maximizes differential entropy.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Information Theory

Lecture 6: Differential Entropy and the Gaussian Channel

Basak Guler

1 / 17
Differential Entropy
• So far we have only worked with discrete random variables.

• We will now extend our information measures to continuous random


variables.

• Recall that continuous random variables are characterized by a PDF


(probability density function) f (x) instead of the PMF p(x). The PDF
R∞
satisfies the properties f (x) ≥ 0 and −∞ f (x)dx = 1.

2 / 17
Differential Entropy
• So far we have only worked with discrete random variables.

• We will now extend our information measures to continuous random


variables.

• Recall that continuous random variables are characterized by a PDF


(probability density function) f (x) instead of the PMF p(x). The PDF
R∞
satisfies the properties f (x) ≥ 0 and −∞ f (x)dx = 1.

• Definition 38 (Differential Entropy). Differential entropy of a


continuous random variable X with a PDF f (x) is defined as:
Z Z
1
h(X ) = − f (x) log f (x)dx = f (x) log dx = E[− log f (X )]
S S f (x)

where S is the set of all x for which f (x) > 0 (support set of X ).

2 / 17
Example 26
• Example 26 (Uniform Random Variable). Let X ∼ U[a, b] be a
uniform random variable with PDF
(
1
for x ∈ [a, b]
f (x) = b−a
0 otherwise

for some b > a. Find its differential entropy.

3 / 17
Example 26
• Example 26 (Uniform Random Variable). Let X ∼ U[a, b] be a
uniform random variable with PDF
(
1
for x ∈ [a, b]
f (x) = b−a
0 otherwise

for some b > a. Find its differential entropy.

• Solution.
Z Z b
1 1
h(X ) = − f (x) log f (x)dx = − log dx = log(b − a)
S a b−a b−a

• Unlike regular entropy which is always ≥ 0, differential entropy can be


positive or negative.

• For example, in the above example, if b − a < 1, then h(X ) < 0!

3 / 17
Example 27
• Example 27 (Gaussian Random Variable). Let X ∼ N (0, σ 2 ) be a
Gaussian random variable with mean 0 and variance σ 2 . The
corresponding PDF is:

1 x2
f (x) = √ e− 2σ2
2πσ 2
Find its differential entropy.

4 / 17
Example 27
• Solution.
Z
h(X ) = − f (x) log f (x)dx
ZS∞  
1 − x2 1 − x2
=− √ e 2σ 2 log √ e 2σ 2 dx
−∞ 2πσ 2 2πσ 2
Z ∞
x2
 
1 x2 1
=− √ e− 2σ2 log √ log− e dx
−∞ 2πσ 2 2πσ 2 2σ 2

log e ∞ x2
Z Z
1 1 − x2
2 x2
= − log √ √ e 2σ dx + 2 √ e− 2σ2 dx
2πσ 2 −∞ 2πσ 2 2σ −∞ 2πσ 2
| {z } | {z }
1 =E[X 2 ]=σ 2 +02 =σ 2
1 1
= log(2πσ 2 ) + log e
2 2
1
= log(2πeσ 2 ) bits
2

5 / 17
Other properties of Differential Entropy
• Relative entropy (KL-distance) between two PDFs f (x) and g(x) is:
Z
f (x)
D(f ||g) = f (x) log dx ≥ 0
g(x)

• Joint and conditional different entropy are similar to the discrete case:

h(X , Y ) = E[− log f (X , Y )] (joint differential entropy)


h(X |Y ) = E[− log f (X |Y )] (conditional differential entropy)

where

I(X ; Y ) = h(X ) + h(Y ) − h(X , Y )


= h(X ) − h(X |Y )
= h(Y ) − h(Y |X )
≥0 Mutual information is always non-negative!

6 / 17
Other properties of Differential Entropy
• We also have the following properties, for any constant c,

h(X + c) = h(X ) (translation does not change entropy)


h(X · c) = h(X ) + log|c| for c 6= 0

7 / 17
Gaussian Distribution Maximizes Differential Entropy
• Recall that entropy was maximized by a uniform distribution.

• Differential entropy is maximized by a Gaussian distribution.

8 / 17
Gaussian Distribution Maximizes Differential Entropy
• Recall that entropy was maximized by a uniform distribution.

• Differential entropy is maximized by a Gaussian distribution.

• Theorem 36 (Gaussian Maximizes Differential Entropy): Let X be


a random variable with PDF f (x) and second moment E[X 2 ] ≤ σ 2 .
Then,
1
h(X ) ≤ log(2πeσ 2 )
2
with equality if and only if X ∼ N (0, σ 2 ).

• In other words, the differential entropy h(X ) is maximized when X is a


Gaussian random variable.

8 / 17
Gaussian Distribution Maximizes Differential Entropy
x2
• Proof. Let g(x) = √ 1 e− 2σ2 be the Gaussian PDF for N (0, σ 2 ).
2πσ 2

9 / 17
Gaussian Distribution Maximizes Differential Entropy
x2
• Proof. Let g(x) = √ 1 e− 2σ2 be the Gaussian PDF for N (0, σ 2 ).
√2
2πσ
• Then, log g(x) = x2
− log 2πσ 2 − 2σ 2 log e and therefore,

0 ≤ D(f ||g) (1)


Z ∞
f (x)
= f (x) log dx
−∞ g(x)
Z ∞ Z ∞
= f (x) log f (x)dx − f (x) log g(x)dx
−∞ −∞
Z ∞ Z ∞  √ x2

= f (x) log f (x)dx − f (x) − log 2πσ − 2 log e dx
−∞ −∞ 2σ 2
√ 1
Z ∞
= −h(X ) + log 2πσ 2 + log e x 2 f (x)dx
2σ 2 −∞
√ 1
= −h(X ) + log 2πσ 2 + log eE[X 2 ]
2σ 2
√ 1
≤ −h(X ) + log 2πσ 2 + (log e)σ 2
2σ 2
1
= −h(X ) + log 2πeσ 2
2
9 / 17
Gaussian Distribution Maximizes Differential Entropy
• Hence
1
h(X ) ≤ log 2πeσ 2
2

• For this to become an equality, we must have in equation (1),

D(f ||g) = 0 (2)

which occurs if and only if f (x) = g(x), which means f (x) should be a
Gaussian distribution N (0, σ 2 ).

10 / 17
The Gaussian Channel
• Definition 39 (Gaussian Channel). The Gaussian channel is
defined as:
Yi = Xi + Zi
where Xi is the channel input (at time i), Yi is the channel output, and
Zi ∼ N (0, N) is the noise, drawn i.i.d. from a Gaussian distribution. Zi
is independent from Xi .
Noise
Zi
Channel Input Channel Output
Xi + Yi

11 / 17
The Gaussian Channel
• Definition 39 (Gaussian Channel). The Gaussian channel is
defined as:
Yi = Xi + Zi
where Xi is the channel input (at time i), Yi is the channel output, and
Zi ∼ N (0, N) is the noise, drawn i.i.d. from a Gaussian distribution. Zi
is independent from Xi .
Noise
Zi
Channel Input Channel Output
Xi + Yi

• Most widely known continuous channel model.


• Channel is memoryless & discrete-time, but continuous alphabet.
• Simple and tractable model for “real-life” communication channels.

11 / 17
The Gaussian Channel
Intuition behind “Gaussian”:

• The intuition for using a Gaussian random variable to model the


channel noise is based on the central limit theorem.

• Communication channels (wireless or wired) are subject to a variety of


“random noise events”, such as thermal noise from electrical circuits,
signals originating from unknown devices etc.

• The central limit theorem says that the aggregate effect of a large
number of these random events will have a Gaussian distribution.

• So we represent this “aggregate effect” as a Gaussian random


variable.

12 / 17
Gaussian Channel Capacity
• Recall the channel capacity for the discrete alphabet case:

C = max I(X ; Y ) (3)


p(x)

• For the discrete case, we know that C ≤ min{log |X |, log |Y|}.

13 / 17
Gaussian Channel Capacity
• Recall the channel capacity for the discrete alphabet case:

C = max I(X ; Y ) (3)


p(x)

• For the discrete case, we know that C ≤ min{log |X |, log |Y|}.


• For the continuous case, with no restrictions on X , (3) does not have
an upper bound, X can take any value!
• For example, using very large powers we could send an infinite
number of inputs by placing X ’s far away from each other, so that each
signal is distinguishable with arbitrarily low error probability.

13 / 17
Gaussian Channel Capacity
• Recall the channel capacity for the discrete alphabet case:

C = max I(X ; Y ) (3)


p(x)

• For the discrete case, we know that C ≤ min{log |X |, log |Y|}.


• For the continuous case, with no restrictions on X , (3) does not have
an upper bound, X can take any value!
• For example, using very large powers we could send an infinite
number of inputs by placing X ’s far away from each other, so that each
signal is distinguishable with arbitrarily low error probability.
• However, this requires transmissions with infinite power, which is not
physically meaningful.
• A common technique is to have an average power constraint
E[X 2 ] ≤ P. For codeword {x1 , . . . , xn } of length n → ∞ this would
Pn
mean n1 i=1 xi2 → E[X 2 ] ≤ P.
13 / 17
Gaussian Channel Capacity
• Definition 40. The capacity of the Gaussian channel with average
power constraint P:

C= max I(X ; Y ) (4)


p(x):E[X 2 ]≤P

Noise
Z
Channel Input Channel Output
X + Y

14 / 17
Gaussian Channel Capacity
• Theorem 37 (Gaussian Channel Capacity). The capacity of the
Gaussian channel with average power constraint P and noise
variance N is equal to:
 
1 P
C = log 1 + (5)
2 N

which is attained by a Gaussian input distribution X ∼ N (0, P).

15 / 17
Gaussian Channel Capacity
• Proof. Let Y = X + Z where X and N are independent from each
other, X ∼ p(x) and Z ∼ N (0, N).

16 / 17
Gaussian Channel Capacity
• Proof. Let Y = X + Z where X and N are independent from each
other, X ∼ p(x) and Z ∼ N (0, N).
• Then,

I(X ; Y ) = h(Y ) − h(Y |X )


= h(Y ) − h(X + Z |X )
= h(Y ) − h(Z |X )
= h(Y ) − h(Z ) (As X and Z are independent)
1
= h(Y ) − log 2πeN (Since Z ∼ N (0, N))
2

16 / 17
Gaussian Channel Capacity
• Now, note that the second moment of Y is:

E[Y 2 ] = [(X + Z )2 ] = E[X 2 ] +2 E[XZ ] + E[Z 2 ] ≤ P + N (6)


| {z } | {z } | {z }
≤P E[X ]E[Z ]=0 N

where E[XZ ]=E[X ]E[Z ] = 0 as X and Z are independent and E[Z ] = 0.

17 / 17
Gaussian Channel Capacity
• Now, note that the second moment of Y is:

E[Y 2 ] = [(X + Z )2 ] = E[X 2 ] +2 E[XZ ] + E[Z 2 ] ≤ P + N (6)


| {z } | {z } | {z }
≤P E[X ]E[Z ]=0 N

where E[XZ ]=E[X ]E[Z ] = 0 as X and Z are independent and E[Z ] = 0.


• We know that for a given second moment E[Y 2 ], the Gaussian
random variable maximizes the differential entropy.

17 / 17
Gaussian Channel Capacity
• Now, note that the second moment of Y is:

E[Y 2 ] = [(X + Z )2 ] = E[X 2 ] +2 E[XZ ] + E[Z 2 ] ≤ P + N (6)


| {z } | {z } | {z }
≤P E[X ]E[Z ]=0 N

where E[XZ ]=E[X ]E[Z ] = 0 as X and Z are independent and E[Z ] = 0.


• We know that for a given second moment E[Y 2 ], the Gaussian
random variable maximizes the differential entropy.
• So h(Y ) is maximized if Y is a Gaussian random variable with
variance P + N, in which case:
1
h(Y ) = log 2πe(P + N) (7)
2
and therefore:
1 1 1 P
I(X ; Y ) ≤ log 2πe(P + N) − log 2πeN = log(1 + ) (8)
2 2 2 N
where the upper bound can be achieved by choosing X ∼ N (0, P).
17 / 17

You might also like