0% found this document useful (0 votes)
9 views

DLAI4 Example Solutions

The document discusses solutions to math problems. It provides the solutions to three problems involving probability distributions, sigma algebras, and information entropy. Conditional and joint probability distributions are calculated. Properties of sigma algebras are outlined. An equation relating conditional entropies is derived.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DLAI4 Example Solutions

The document discusses solutions to math problems. It provides the solutions to three problems involving probability distributions, sigma algebras, and information entropy. Conditional and joint probability distributions are calculated. Properties of sigma algebras are outlined. An equation relating conditional entropies is derived.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MATH4267 Solutions JL

1. (a) The conditional PDFs of (X1 , X2 |Y = 1) and (X1 , X2 |Y = 0) are:


 
1 1 2 2

fX1 ,X2 |Y =1 (x1 , x2 ) = exp − (x − 1) + y
2π 2
 
1 1 2 2

fX1 ,X2 |Y =0 (x1 , x2 ) = exp − (x + 1) + y
2π 2

and we have P (Y = 1) = α. So the unconditional PDF of (X1 , X2 ) is given by:

f(X1 ,X2 ) (x1 , x2 ) = P (Y = 1)f(X1 ,X2 |Y =1) + P (Y = 0)f(X1 ,X2 |Y =0)


 
1 1 2 2

= α exp − (x − 1) + y +
2π 2
 
1 1 2 2

(1 − α) exp − (x + 1) + y (1)
2π 2

[SEEN SIMILAR]
[3 marks]
(b) The generalisation error of d is:
  
E(X1 ,X2 ),Y Loss Y, Ŷ ) = E(X1 ,X2 ),Y (c (Y, d(X1 , X2 ))) (2)

NB: it is fine not to write this out as an integral, but an answer written as an
integral is also acceptable. [SEEN]
[3 marks]

1 of 7 2024
MATH4267 Solutions JL

(c) The generalisation error for a zero-one loss is minimised for the function
(
1 if P (Y = 1|X1 , X2 = (x1 , x2 )) > P (Y = 0|X1 , X2 = (x1 , x2 ))
dmin (x1 , x2 ) =
0 otherwise

Since we have
fX1 ,X2 |Y =1 (x1 , x2 )P (Y = 1)
P (Y = 1|X1 , X2 = (x1 , x2 )) =
fX1 ,X2 (x1 , x2 )
αfX1 ,X2 |Y =1 (x1 , x2 )
=
fX1 ,X2 (x1 , x2 )
fX ,X |Y =0 (x1 , x2 )P (Y = 0)
P (Y = 0|X1 , X2 = (x1 , x2 )) = 1 2
fX1 ,X2 (x1 , x2 )
(1 − α)fX1 ,X2 |Y =0 (x1 , x2 )
=
fX1 ,X2 (x1 , x2 )

we have dmin (x1 , x2 ) in our case is:


(
1 if αfX1 ,X2 |Y =1 (x1 , x2 ) > (1 − α)fX1 ,X2 |Y =0 (x1 , x2 )
dmin (x1 , x2 ) = (3)
0 otherwise

[SEEN SIMILAR]
[4 marks]
NB: it is fine to use abbreviations of previously defined functions, rather than
writing them out in full each time.

2 of 7 2024
MATH4267 Solutions JL

2. (a) Given a finite set S, a σ-algebra Σ on S is a collection of subsets of S (that is,


Σ ⊆ 2S ) such that:
• S ∈ Σ;
• Σ is closed under complementarity, so if A ∈ Σ then (S \ A) ∈ Σ;
• Σ is closed under union/intersection, so if A, B ∈ Σ then (A ∪ B) ∈ Σ (and
(A ∩ B) ∈ Σ).
NB: the concept of countable union or intersection is not relevant here, since S is
finite, but there is no penalty for mentioning it. Only one of union/intersection
closure need be mentioned, but there is no penalty for mentioning both. [SEEN]
[3 marks]
(b) Since the map from X (ℓ) to X (ℓ+1) is a deterministic function, we have S(X (ℓ) ) ⊇
S(X (ℓ+1) ), from which the statement follows. Inequality will hold in each case
if the function from X (ℓ) to X (ℓ+1) is one-to-one. (it is fine to say bijective,
invertible, or injective). [SEEN]
[3 marks]
(c) Suppose that the outcome of the original network is Y = (Y1 , Y2 , . . . Ym ) where

there are M neurons, and the output of the new network is Y ′ = (Y1′ , Y2′ , . . . Ym−1 ).
Then Y1′ = Y1 , Y2′ = Y2 , . . . , and we can define a deterministic function from Y
to Y ′ . Hence S(Y ′ ) ⊆ S(Y ). [SEEN SIMILAR]
[4 marks]
(d) Suppose A, B ∈ Σ and we have closure under union. Then by closure under
complement, (S \ A) ∈ Σ and (S \ B) ∈ Σ. By the assumed closure under union,
we have (S \ A) ∪ (S \ B) ∈ Σ. But the set T = (S \ A) ∪ (S \ B) will contain
everything in S not in both A and B, so (A ∩ B) = (S \ T ). Since T ∈ Σ, we
have S \ T ∈ Σ by closure under complement, so (A ∩ B) ∈ Σ.
Suppose A, B ∈ Σ and we have closure under intersection. Then by closure
under complement, (S \ A) ∈ Σ and (S \ B) ∈ Σ. By the assumed closure under
intersection, we have (S \ A) ∩ (S \ B) ∈ Σ. But the set T ′ = (S \ A) ∪ (S \ B)
will contain everything in S not in either A and B, so (A ∪ B) = (S \ T ′ ). Since
T ′ ∈ Σ, we have S \ T ′ ∈ Σ by closure under complement, so (A ∪ B) ∈ Σ.
NB: it is fine to show just one way and say ‘The other direction is similar’, as
long as there is acknowledgement of proof the other way too.
[UNSEEN]
[4 marks]

3 of 7 2024
MATH4267 Solutions JL

3. (a) We have

H(X|Y ) − H(Y |X) = (H(X, Y ) − H(Y )) − (H(X, Y ) − H(X)) = H(X) − H(Y )


(4)
We can derive this (this is not necessary for a correct answer) as follows:
Z  
fX,Y (x, y)
H(X|Y ) − H(Y |X) = − fX,Y (x, y) ln dxdy
fY (y)
Z  
fX,Y (x, y)
+ fX,Y (x, y) ln dxdy
fX (x)
Z  
fX,Y (x, y)fY (y)
= fX,Y (x, y) ln dxdy
fX,Y (x, y)fX (x)
Z  
fY (y)
= fX,Y (x, y) ln dxdy
fX (x)
Z Z
= − fX,Y (x, y) ln (fX (x)) dxdy + fX,Y (x, y) ln (fY (y)) dxdy
Z Z 
=− fX,Y (x, y)dy ln (fX (x)) dx
Z Z 
+ fX,Y (x, y)dx ln (fY (y)) dy
Z Z
= − fX (x) ln (fX (x)) dx + fY (y) ln (fY (y)) dy

= H(X) − H(Y )

Since H(X|Y ) − H(Y |X) = H(X) − H(Y ), we have H(X|Y ) = H(Y |X) if and
only if H(X) = H(Y ). [SEEN]
[3 marks]
(b) The K-L divergence of random variables X and Y (where defined) is defined as
Z  
fX (x)
DKL (fX , fY ) = fX (x) ln dx
fY (x)

where fX (x) and fY (x) are the PDFs of X and Y , and the integral is over the
domain of fX and fY .
To show that the K-L divergence is non-negative, we use Jensen’s inequality:
since ln(x) is concave, we have, for positive-valued functions p(x), q(x):
Z Z 
p(x) ln(q(x))dx ≤ ln p(x)q(x)dx

4 of 7 2024
MATH4267 Solutions JL

so
Z  
fY (x)
DKL (fX ||fY ) = − fX (x) ln dx
fX (x)
Z  Z 
fY (x)
≥ ln fX (x) dx ln fY (x)dx
fX (x)
= ln(1) = 0

Alternatively, using the more specific inequality ln(x) ≤ x − 1, we have


Z  
fY (x)
DKL (fX ||fY ) = − fX (x) ln dx
fX (x)
Z  
fY (x)
≥ − fX (x) − 1 dx
fX (x)

Z Z
=− fY (x)dx + fX (x)dx

= −1 + 1 = 0

[SEEN]
[5 marks]

5 of 7 2024
MATH4267 Solutions JL

(c) We have

I(X, Y ) = H(X) + H(Y ) − H(X, Y )


Z Z Z
= − fX (x) ln(fX (x))dx − fY (y) ln(fY (x))dy + fX,Y (x, y) ln (fX,Y ) dxdy
Z Z  Z Z 
=− fX,Y (x, y)dy ln(fX (x))dx − fX,Y (x, y)dx ln(fY (x))dy
Z
+ fX,Y (x, y) ln (fX,Y ) dxdy
Z Z
= − fX,Y (x, y) ln(fX (x))dx − fX,Y (x, y) ln(fY (x))dy
Z
+ fX,Y (x, y) ln (fX,Y ) dxdy
Z  
fX,Y (x, y)
= fX,Y (x, y) ln dxdy (= DKL (fX,Y (x, y)||fX (x)fY (y)))
fX (x)fY (y)
  f (x,y)  
Z X,Y
fX,Y (x, y)  fY (y) 
= fY (y) ln dxdy
fY (y) fX (x)
 
fX|Y =y (x)
Z
= fY (y)fX|Y =y (x) ln dxdy
fX (x)
Z
= fY (y)DKL (fX|Y =y ||fX (x))dxdy

= Ey∼Y DKL (fX|Y =y ||fX (x))
(5)

as required. [UNSEEN]
[7 marks]

6 of 7 2024
MATH4267 Solutions JL

4. (a) We have hi = ϕ(W hi−1 + U Xi + b). The value of W affects hi both directly
through this equation, and through hi−1 . Thus we have for i > 1:
dhi ∂hi ∂hi dhi−1
αi = = +
dW ∂W ∂hi−1 dW
= βi + γi αi−1

[SEEN SIMILAR]
[4 marks]
(b) If n = 2 then we have
2−1 2
!
X Y
α2 = β2 + γ2 α1 = β2 + γj βi
i=1 j=1+1

Suppose by induction that the formula in the question holds for n = 1, 2 . . . n−1.
Then we have

αn = βn + γn αn−1
n−2 n−1
! !
X Y
= βn + γn βn−1 + γj βi
i=1 j=i+1
n−2 n−1
!
X Y
= βn + γn βn−1 + γn γj βi
i=1 j=i+1
n−1 n
! n−2 n
!
X Y X Y
= βn + γj βi + γj βi
i=n−1 j=i+1 i=1 j=1+1
n−1 n
!
X Y
= βn + γj βi
i=1 j=1+1

which gives the requisite result by induction. [UNSEEN]


[7 marks]
(c) The exploding gradient problem can arise when activation functions have gra-
dients which exceed 1. An example could be the scaled sigmoid function:

ϕ(x) = (1 + exp(−5x))−1 (6)

[SEEN SIMILAR]
[4 marks]

7 of 7 2024

You might also like