0% found this document useful (0 votes)
11 views

session2

The document discusses entropy as a fundamental measure in information theory, defining it mathematically for discrete random variables and explaining its significance in measuring randomness and information content. It also provides examples of binary and binomial entropy, as well as Jensen's inequality, which relates to convex functions and expectations of random variables. The document emphasizes the expression of entropy in bits and nats, highlighting its importance in the field of information theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

session2

The document discusses entropy as a fundamental measure in information theory, defining it mathematically for discrete random variables and explaining its significance in measuring randomness and information content. It also provides examples of binary and binomial entropy, as well as Jensen's inequality, which relates to convex functions and expectations of random variables. The document emphasizes the expression of entropy in bits and nats, highlighting its importance in the field of information theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Lecture 2: Information Measures

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Entropy

Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
  X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X

▶ Entropy is the most fundamental information measure


▶ Entropy measures the randomness of X
▶ Entropy measures how much information is contained in X
▶ Entropy is expressed in bits
▶ If log2 is replaced by log, entropy is expressed in nats
1
▶ A nat is log(2) ≈ 1.44 bits

Information Theory, Richard Combes, CentraleSupelec, 2024


Example 1: Binary entropy

The entropy of X ∼Bernoulli(a) is:


1 1
H(X) = h2 (a) = a log2 + (1 − a) log2
a 1−a

0.9

0.8

0.7
Entropy H(p) (bits)

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024


Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.

1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024


Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.

1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024


Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.

1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024


Example 2: Binomial Variables
Let X1 , X2 i.i.d. Bernoulli(a) and Y = X1 + X2 .

2 if y = 0
a

pY (y) = 2a(1 − a) if y = 1

(1 − a)2 if y = 2.

1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).

1.5

1
Entropy H(p) (bits)

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Definition
A function f : Rd → R is said to be convex if for all x, y ∈ Rd and all
λ ∈ [0, 1]:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

0.9

0.8

0.7

0.6
f(x)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .

Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property (Jensen’s Inequality)


Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:


▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property (Jensen’s Inequality)


Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:


▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property (Jensen’s Inequality)


Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:


▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property (Jensen’s Inequality)


Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:


▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Jensen’s inequality

Property (Jensen’s Inequality)


Consider f : Rd → R a convex function and X a random vector in Rd .
Then f (E(X)) ≤ E(f (X)) with equality if and only if f is linear over
the support of the distribution of X.

Examples, for any X:


▶ E(X)2 ≤ E(X 2 )
▶ etE(X) ≤ E(etX )
▶ E(log X) ≤ log E(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:


X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality


X 1 X 1 
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:


X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality


X 1 X 1 
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:


X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality


X 1 X 1 
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity and maximal entropy
Property
For any X ∈ X , 0 ≤ H(X) ≤ log2 |X | with equality if X is uniform.

Proof: Since 0 ≤ pX (x) ≤ 1:


X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
pX (x)
x∈X x∈X

Logarithm is strictly concave, using Jensen’s inequality


X 1 X 1 
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
pX (x) pX (x)
x∈X x∈X

If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
 
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)


▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024


Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
 
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)


▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024


Joint entropy

Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
 
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y

▶ The joint entropy is simply the entropy of the vector (X, Y)


▶ Joint entropy measures the randomness of the vector (X, Y)

Information Theory, Richard Combes, CentraleSupelec, 2024


Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.


▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).


Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.


▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).


Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.


▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).


Information Theory, Richard Combes, CentraleSupelec, 2024
Conditional entropy
Definition
The conditional entropy of X ∈ X knowing Y ∈ Y two discrete
random variables with joint distribution pX,Y and conditional
distribution pX|Y is

1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).

from the Bayes rule pX,Y (x, y) = pX|Y (x|y)pY (y)

▶ Measures the randomness left in X after Y has been revealed.


▶ It is also the entropy of the conditional distribution
▶ The last equality is an example of a "chain rule"

H(X, Y) = H(X|Y) + H(Y).


Information Theory, Richard Combes, CentraleSupelec, 2024
Additivity of entropy

Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)

Proof: From independence pX,Y (x, y) = pX (x)pY (y)


X 1
H(X, Y) = pX,Y (x, y) log2
pX,Y (x, y)
(x,y)∈X ×Y
X  1 1 
= pX,Y (x, y) log2 + log2
pX (x) pY (y)
(x,y)∈X ×Y
X 1 X 1
= pX (x) log2 + pY (y) log2
pX (x) pY (y)
x∈X y∈Y

= H(X) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024


Additivity of entropy

Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)

Proof: From independence pX,Y (x, y) = pX (x)pY (y)


X 1
H(X, Y) = pX,Y (x, y) log2
pX,Y (x, y)
(x,y)∈X ×Y
X  1 1 
= pX,Y (x, y) log2 + log2
pX (x) pY (y)
(x,y)∈X ×Y
X 1 X 1
= pX (x) log2 + pY (y) log2
pX (x) pY (y)
x∈X y∈Y

= H(X) + H(Y).

Information Theory, Richard Combes, CentraleSupelec, 2024


Conditional entropy: symmetry

Property
Conditional entropy is not symmetrical unless H(X) = H(Y):

H(Y|X) − H(X|Y) = H(Y) − H(X)

Conditional entropy is symmetrical iff X and Y have the same entropy.

Information Theory, Richard Combes, CentraleSupelec, 2024


Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.


The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6


1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.


The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6


1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Example: two dices

Let X, Y two draws of a six-face unbiased dice and Z = X + Y.


The distribution of Z = X + Y is
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
pZ (z) 36 36 36 36 36 36 36 36 36 36 36
Knowing X, Z is uniform on {X + 1, . . . , X + 6} so:

H(X) = H(Y) = H(Z|X) = log2 6


1 X k 36
H(Z) = log2 6 + 2 log2
6 36 k
k=1...5
H(Z, X) = H(Z|X) + H(X)

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞


otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence


▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞


otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence


▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞


otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence


▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy

Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:

p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X

if q is absolutely continuous with respect to p and D(p||q) = +∞


otherwise.

▶ Relative entropy is also called the Kullback-Leibler divergence


▶ It is a dissimilarity measure between distributions
▶ Relative entropy is the expectation of the log-likelihood ratio

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy is positive

Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:

q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy is positive

Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:

q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X

Information Theory, Richard Combes, CentraleSupelec, 2024


Example: binary variables

If p,q are two Bernoulli distributions then:

p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)

3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)

Convex function, positive and minimized for p(0) = q(0).

Information Theory, Richard Combes, CentraleSupelec, 2024


Example: binary variables

If p,q are two Bernoulli distributions then:

p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)

3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)

Convex function, positive and minimized for p(0) = q(0).

Information Theory, Richard Combes, CentraleSupelec, 2024


Relative entropy is not a distance

Consider |X | = 2 and p = ( 12 , 21 ) and q = (a, 1 − a).


Then D(p||q) ̸= D(q||p) if a ̸= 21 .
Relative entropy is not symmetrical and may be infinite. The triangle
inequality does not hold either.

Information Theory, Richard Combes, CentraleSupelec, 2024


Mutual Information

Definition
Let (X, Y) ∈ X × Y discrete random variables with joint distribution
pX,Y and marginal distributions pX and pY respectively. The mutual
information between X and Y is:
XX pX,Y (x, y)
I(X; Y) = pX,Y (x, y) log2
pX (x)pY (y)
x∈X y∈Y
XX  1 1 
= p(x, y) log2 − log2
pX (x) pX|Y (x|y)
x∈X y∈Y

= H(X) − H(X|Y)
= H(Y) − H(Y|X)
= H(X) + H(Y) − H(X, Y)
= D(pX,Y ||pX pY )

Information Theory, Richard Combes, CentraleSupelec, 2024


Mutual Information

▶ Reduction in entropy of X when Y is revealed


▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024


Mutual Information

▶ Reduction in entropy of X when Y is revealed


▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024


Mutual Information

▶ Reduction in entropy of X when Y is revealed


▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024


Mutual Information

▶ Reduction in entropy of X when Y is revealed


▶ Symmetry: I(X; Y) = I(Y; X)
▶ Quantifies how much information can be transmitted with (X, Y)
▶ Dissimilarity measure between pX,Y and pX pY

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity of Mutual Information

Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.

Proof: By definition

I(X; Y) = D(pX,Y ||pX pY ) ≥ 0

since relative entropy is positive, with equality if and only if


pX,Y = pX pY so that X,Y are independent.

Information Theory, Richard Combes, CentraleSupelec, 2024


Positivity of Mutual Information

Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.

Proof: By definition

I(X; Y) = D(pX,Y ||pX pY ) ≥ 0

since relative entropy is positive, with equality if and only if


pX,Y = pX pY so that X,Y are independent.

Information Theory, Richard Combes, CentraleSupelec, 2024


Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.


From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024


Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.


From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024


Conditioning Reduces Entropy

Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.

Proof: We have

0 ≤ I(X; Y) = H(X) − H(X|Y)

with equality iff X,Y are independent.


From the chain rule H(X, Y) = H(Y|X) + H(X) ≤ H(X) + H(Y)
using the previous result.

Information Theory, Richard Combes, CentraleSupelec, 2024

You might also like