session2
session2
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
Definition
The entropy of X ∈ X a discrete random variable with distribution pX
is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) pX (x)
x∈X
0.9
0.8
0.7
Entropy H(p) (bits)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p
1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).
1.5
1
Entropy H(p) (bits)
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p
1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).
1.5
1
Entropy H(p) (bits)
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p
1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).
1.5
1
Entropy H(p) (bits)
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p
1 1 1
H(Y) = a2 log2 2
+ 2a(1 − a) log2 + (1 − a)2 log2
a 2a(1 − a) (1 − a)2
= 2h2 (a) − 2a(1 − a).
1.5
1
Entropy H(p) (bits)
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p
Definition
A function f : Rd → R is said to be convex if for all x, y ∈ Rd and all
λ ∈ [0, 1]:
0.9
0.8
0.7
0.6
f(x)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x
Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x
Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x
Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x
Property
A twice differentiable function f : Rd → R is convex if and only if its
Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Examples:
▶ f (x) = x2
▶ f (x) = etx
▶ f (x) = log 1x
If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X
If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X
If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X
If X is uniform:
X 1 X
H(X) = pX (x) log2 = pX (x) log2 |X | = log2 |X |.
pX (x)
x∈X x∈X
Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y
Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y
Definition
The joint entropy of X ∈ X and Y ∈ Y two discrete random variables
with joint distribution (X, Y) ∼ pX,Y (x, y) is:
1 X 1
H(X, Y) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y) pX,Y (x, y)
(x,y)∈X ×Y
1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).
1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).
1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).
1 1 1
H(X|Y) = E log2 = E log2 − E log2
pX|Y (X|Y) pX,Y (X, Y) pY (Y)
= H(X, Y) − H(Y).
Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)
= H(X) + H(Y).
Property
Consider X, Y independent. Then H(X, Y) = H(X) + H(Y). and
H(X|Y) = H(X)
= H(X) + H(Y).
Property
Conditional entropy is not symmetrical unless H(X) = H(Y):
Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X
Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X
Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X
Definition
Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) q(x)
x∈X
Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:
q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X
Property
Consider p, q two distributions. Then D(p||q) ≥ 0 with equality iff
p = q.
1
Proof: Since z 7→ log2 z is convex, from Jensen’s inequality:
q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
p(x)
x∈X
p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)
3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)
p(0) 1 − p(0)
D(p||q) = p(0) log2 + (1 − p(0)) log2 .
q(0) 1 − q(0)
3.5
q = 0.1
q = 0.2
3 q = 0.3
q = 0.4
q = 0.5
2.5
Relative Entropy D(p || q)
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability p(0)
Definition
Let (X, Y) ∈ X × Y discrete random variables with joint distribution
pX,Y and marginal distributions pX and pY respectively. The mutual
information between X and Y is:
XX pX,Y (x, y)
I(X; Y) = pX,Y (x, y) log2
pX (x)pY (y)
x∈X y∈Y
XX 1 1
= p(x, y) log2 − log2
pX (x) pX|Y (x|y)
x∈X y∈Y
= H(X) − H(X|Y)
= H(Y) − H(Y|X)
= H(X) + H(Y) − H(X, Y)
= D(pX,Y ||pX pY )
Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.
Proof: By definition
Property
Let X, Y discrete random variables then I(X; Y) ≥ 0 with equality if
and only if X and Y are independent.
Proof: By definition
Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.
Proof: We have
Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.
Proof: We have
Property
Let X, Y discrete random variables then H(X|Y) ≤ H(X) with
equality if and only if X,Y are independent and
H(X, Y) ≤ H(X) + H(Y) with equality if and only if X,Y are
independent.
Proof: We have