Open navigation menu

Scribd

0% found this document useful (0 votes)

11 views

session2

The document discusses entropy as a fundamental measure in information theory, defining it mathematically for discrete random variables and explaining its significance in measuring randomness and information content. It also provides examples of binary and binomial entropy, as well as Jensen's inequality, which relates to convex functions and expectations of random variables. The document emphasizes the expression of entropy in bits and nats, highlighting its importance in the field of information theory.

Uploaded by

Abdelhamid Idamia

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

session2

The document discusses entropy as a fundamental measure in information theory, defining it mathematically for discrete random variables and explaining its significance in measuring randomness and information content. It also provides examples of binary and binomial entropy, as well as Jensen's inequality, which relates to convex functions and expectations of random variables. The document emphasizes the expression of entropy in bits and nats, highlighting its importance in the field of information theory.

Uploaded by

Abdelhamid Idamia

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Lecture 2: Information Measures

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure

▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024

Example 1: Binary entropy

The entropy of X ∼Bernoulli(a) is:

1 1
H(X) = h2 (a) = a log2 + (1 − a) log2
a 1−a

0.9

0.8

0.7
Entropy H(p) (bits)

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024

Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.


1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024

Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.


1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024

Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.


1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024

Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.


1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Definition
A function f : Rd → R is said to be convex if for all x, y ∈ Rd and all
λ ∈ [0, 1]:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

0.9

0.8

0.7

0.6
f(x)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property (Jensen’s Inequality)

Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:

▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property (Jensen’s Inequality)

Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:

▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property (Jensen’s Inequality)

Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:

▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property (Jensen’s Inequality)

Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:

▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Jensen’s inequality

Property (Jensen’s Inequality)

Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:

▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:

X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality

X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:

X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality

X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:

X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality

X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:

X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality

X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:

1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)

▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024

Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:

1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)

▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024

Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:

1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)

▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024

Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.

▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.

▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.

▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.

▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024
Additivity of entropy

Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)

Proof: From independence pX,Y (x, y) = pX (x)pY (y)

X 1
H(X, Y) = pX,Y (x, y) log2
pX,Y (x, y)
(x,y)∈X ×Y
X 1 1
= pX,Y (x, y) log2 + log2
pX (x) pY (y)
(x,y)∈X ×Y
X 1 X 1
= pX (x) log2 + pY (y) log2
pX (x) pY (y)
x∈X y∈Y

= H(X) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024

Additivity of entropy

Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)

Proof: From independence pX,Y (x, y) = pX (x)pY (y)

X 1
H(X, Y) = pX,Y (x, y) log2
pX,Y (x, y)
(x,y)∈X ×Y
X 1 1
= pX,Y (x, y) log2 + log2
pX (x) pY (y)
(x,y)∈X ×Y
X 1 X 1
= pX (x) log2 + pY (y) log2
pX (x) pY (y)
x∈X y∈Y

= H(X) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024

Conditional entropy: symmetry

Property
Conditional entropy is not symmetrical unless H(X) = H(Y):

H(Y|X) − H(X|Y) = H(Y) − H(X)

Conditional entropy is symmetrical iff X and Y have the same entropy.

Information Theory, Richard Combes, CentraleSupelec, 2024

Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.

The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6

1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.

The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6

1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.

The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6

1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞

otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence

▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞

otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence

▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞

otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence

▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞

otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence

▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy is positive

Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:

q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy is positive

Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:

q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024

Example: binary variables

If p,q are two Bernoulli distributions then:

p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)

3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)

Convex function, positive and minimized for p(0) = q(0).

Information Theory, Richard Combes, CentraleSupelec, 2024

Example: binary variables

If p,q are two Bernoulli distributions then:

p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)

3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)

Convex function, positive and minimized for p(0) = q(0).

Information Theory, Richard Combes, CentraleSupelec, 2024

Relative entropy is not a distance

Consider |X | = 2 and p = ( 12 , 21 ) and q = (a, 1 − a).

Then D(p||q) ̸= D(q||p) if a ̸= 21 .
Relative entropy is not symmetrical and may be infinite. The triangle
inequality does not hold either.

Information Theory, Richard Combes, CentraleSupelec, 2024

Mutual Information

Definition
Let (X, Y) ∈ X × Y discrete random variables with joint distribution
pX,Y and marginal distributions pX and pY respectively. The mutual
information between X and Y is:
XX pX,Y (x, y)
I(X; Y) = pX,Y (x, y) log2
pX (x)pY (y)
x∈X y∈Y
XX 1 1
= p(x, y) log2 − log2
pX (x) pX|Y (x|y)
x∈X y∈Y

= H(X) − H(X|Y)
= H(Y) − H(Y|X)
= H(X) + H(Y) − H(X, Y)
= D(pX,Y ||pX pY )

Information Theory, Richard Combes, CentraleSupelec, 2024

Mutual Information

▶ Reduction in entropy of X when Y is revealed

▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024

Mutual Information

▶ Reduction in entropy of X when Y is revealed

▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024

Mutual Information

▶ Reduction in entropy of X when Y is revealed

▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024

Mutual Information

▶ Reduction in entropy of X when Y is revealed

▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity of Mutual Information

Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.

Proof: By definition

I(X; Y) = D(pX,Y ||pX pY ) ≥ 0

since relative entropy is positive, with equality if and only if

pX,Y = pX pY so that X,Y are independent.

Information Theory, Richard Combes, CentraleSupelec, 2024

Positivity of Mutual Information

Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.

Proof: By definition

I(X; Y) = D(pX,Y ||pX pY ) ≥ 0

since relative entropy is positive, with equality if and only if

pX,Y = pX pY so that X,Y are independent.

Information Theory, Richard Combes, CentraleSupelec, 2024

Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.

From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024

Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.

From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024

Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.

From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024

You might also like

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (54)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Gardner Denver: Operating and Service Manual
No ratings yet
Gardner Denver: Operating and Service Manual
41 pages
Grade 5 DLL SCIENCE 5 Q4 Week 9
No ratings yet
Grade 5 DLL SCIENCE 5 Q4 Week 9
6 pages
As 5101.4-2008 Methods For Preparation and Testing of Stabilized Materials Unconfined Compressive Strength of
0% (1)
As 5101.4-2008 Methods For Preparation and Testing of Stabilized Materials Unconfined Compressive Strength of
2 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
L01
No ratings yet
L01
5 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Tema 1 Awp
No ratings yet
Tema 1 Awp
32 pages
Entropy
No ratings yet
Entropy
21 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
session3
No ratings yet
session3
44 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Information Theory
No ratings yet
Information Theory
114 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Notes It
No ratings yet
Notes It
46 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
Entr5
No ratings yet
Entr5
2 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
Probability & Information: Prof. J Bapat
No ratings yet
Probability & Information: Prof. J Bapat
20 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
lời giải
No ratings yet
lời giải
52 pages
BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
ITC Module - I
No ratings yet
ITC Module - I
98 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
No ratings yet
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
6 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Information Theory Entropy Relative Entropy
No ratings yet
Information Theory Entropy Relative Entropy
60 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Principles of Information Theory
No ratings yet
Principles of Information Theory
5 pages
Info Theory Exercise Solutions
No ratings yet
Info Theory Exercise Solutions
16 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Advance Digital Communication
No ratings yet
Advance Digital Communication
66 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Lecture0 Print
No ratings yet
Lecture0 Print
30 pages
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Lectures on Boolean Algebras
From Everand
Lectures on Boolean Algebras
Paul R. Halmos
4/5 (2)
2024 Gedik
No ratings yet
2024 Gedik
8 pages
Needs and Expectations Register
100% (1)
Needs and Expectations Register
6 pages
3 Coating
No ratings yet
3 Coating
17 pages
Fourth Periodical Test
100% (1)
Fourth Periodical Test
6 pages
Soal TOEFL 02
No ratings yet
Soal TOEFL 02
61 pages
Electronic Reservation Slip (ERS) : 4134895480 18046/east Coast Exp Sleeper Class (SL)
No ratings yet
Electronic Reservation Slip (ERS) : 4134895480 18046/east Coast Exp Sleeper Class (SL)
2 pages
Grade 5 - Term 1 Geography Booklet 2
No ratings yet
Grade 5 - Term 1 Geography Booklet 2
9 pages
The Mask You Live in - Curriculum Overview - Second Edition
No ratings yet
The Mask You Live in - Curriculum Overview - Second Edition
18 pages
What Is Astrology-Intro To Jyotish-Free-eBook
100% (1)
What Is Astrology-Intro To Jyotish-Free-eBook
44 pages
Innovation and Technology Transfer For B
No ratings yet
Innovation and Technology Transfer For B
6 pages
Instant Download Black Metaphors How Modern Racism Emerged from Medieval Race Thinking Cord J Whitaker PDF All Chapters
100% (1)
Instant Download Black Metaphors How Modern Racism Emerged from Medieval Race Thinking Cord J Whitaker PDF All Chapters
40 pages
VP-G Model Installation and Maintenance Guide
No ratings yet
VP-G Model Installation and Maintenance Guide
54 pages
Ficha 375W CSP17 60H
No ratings yet
Ficha 375W CSP17 60H
2 pages
Volunteering Program For Teenagers
No ratings yet
Volunteering Program For Teenagers
2 pages
Will My Building Withstand Eq 2013
No ratings yet
Will My Building Withstand Eq 2013
3 pages
RRL 4
No ratings yet
RRL 4
4 pages
Engineering Marvel
No ratings yet
Engineering Marvel
3 pages
ROBINS-I: A Tool For Assessing Risk of Bias in Non-Randomised Studies of Interventions
No ratings yet
ROBINS-I: A Tool For Assessing Risk of Bias in Non-Randomised Studies of Interventions
7 pages
Quality Guru 03 - Feigenbaum, Armand (Qp1105watson)
No ratings yet
Quality Guru 03 - Feigenbaum, Armand (Qp1105watson)
5 pages
Researchpolicynote Genai en 06122023
No ratings yet
Researchpolicynote Genai en 06122023
19 pages
Project Plan
No ratings yet
Project Plan
8 pages
Eyoh PHD Proposal Seminar 1 Prof Okeke
No ratings yet
Eyoh PHD Proposal Seminar 1 Prof Okeke
22 pages
Ego States
No ratings yet
Ego States
4 pages
Sony Imx415 Datasheet
No ratings yet
Sony Imx415 Datasheet
101 pages
Iot Lab Manual
No ratings yet
Iot Lab Manual
18 pages
Cristinnata LH - QAQC of Mine Geology Sampling
100% (1)
Cristinnata LH - QAQC of Mine Geology Sampling
35 pages
Journal of Chemistry - 2024 - Kodua - Optimization of the Reaction Conditions in Biodiesel Production The Case of Baobab
No ratings yet
Journal of Chemistry - 2024 - Kodua - Optimization of the Reaction Conditions in Biodiesel Production The Case of Baobab
14 pages