0% found this document useful (0 votes)
16 views

Solutions For Practice Set

The document summarizes key concepts from an EE 657 pattern recognition and machine learning course. It provides 8 examples covering topics like Bayes' theorem, covariance, independence of random variables, mean and variance properties, uniform distributions, symmetric matrices, and eigendecomposition of symmetric matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Solutions For Practice Set

The document summarizes key concepts from an EE 657 pattern recognition and machine learning course. It provides 8 examples covering topics like Bayes' theorem, covariance, independence of random variables, mean and variance properties, uniform distributions, symmetric matrices, and eigendecomposition of symmetric matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EE 657 - Pattern Recognition and Machine Learning

Solutions for Practice Set

1. Let us denote apples, oranges and limes by a, o and l respectively. The marginal probability of
selecting an apple is given by,

p(a) =p(a | r)p(r) + p(a | b)p(b) + p(a | g)p(g)


3 1 3
= × 0.2 + × 0.2 + × 0.6
10 2 10
=0.34

where the conditional probabilities are obtained from the proportions of apples in each box.
To find the probability that the box was green, given that the fruit we selected was an orange, we
can use Bayes? theorem
p(o | g)p(g)
p(g | o) = (1)
p(o)
The denominator in (1) is given by,

p(o) =p(o | r)p(r) + p(o | b)p(b) + p(o | g)p(g)


4 1 3
= × 0.2 + × 0.2 + × 0.6
10 2 10
=0.36

from which we obtain,


3 0.6 1
p(g | o) = × =
10 0.36 2
2. The definition of covariance is given by (2) as

cov[x, y] = Ex,y [xy] − E[x]E[y] (2)

Using, X
E[f ] = p(x)f (x) (3)
x

and the fact that p(x, y) = p(x)p(y) when x and y are independent, we obtain
XX
E[xy] = p(x, y)xy
x y
X X
= p(x)x p(y)y
x y

=E[x]E[yh]

and hence cov[x, y] = 0. The case where x and y are continuous variables is analogous, with (3)
replaced by (4) and the sums replaced by integrals.
Z
E[f ] = p(x)f (x)dx (4)

3. Since x and z are independent, their joint distribution factorizes p(x, z) = p(x)p(z),and so
RR
E[x + z] = (x + z)p(x)p(z)dxdz
R R
= xp(x)dx + zp(z)dz

= E[x] + E[z]
Similarly for the variances, we first note that

(x + zE[x + z])2 = (xE[x])2 + (zE[z])2 + 2(xE[x])(zE[z])

where the final term will integrate to zero with respect to the factorized distribution p(x)p(z). Hence

(x + z − E[x + z])2 p(x)p(z)dxdz


RR
var[x + z] =

= (x − E[x])2 p(x)dx + (z − E[z])2 p(z)dz


R R

= var(x) + var(z)

4. We have,
N
2 1 X
σM L = (xn − µ)2 (5)
N n=1
Now,
N
 
2 1 X
E{xn } [σM L ] =E{xn } (xn − µ)2
N n=1
N  
1 X
= E{xn } x2n − 2xn µ + µ2
N n=1
N  
1 X 2 2 2
= µ + σ − 2µµ + µ
N n=1
=σ 2

5. To prove the result  


Ex [x] = Ey Ex [x | y] (6)
We use the product rule of probability
Z Z 
 
Ey Ex [x | y] = xp(x | ydx) p(y)dy
Z Z
= xp(x, y)dxdy
Z
= xp(x)dx

=Ex [x]

To prove the result  


varx [x] = E[varx [x | y]] + vary E[x | y] (7)

We use the conditional variance of the result,

var[x] = E[x2 ] − E[x]2 (8)

and the relation (6), to give

E[varx [x | y]] + vary [E[x | y]] =Ey [Ex [x2 | y] − Ex [x | y]2 ] + Ey [Ex [x | y]2 ] − Ey [Ex [x | y]]2
=Ex [x2 ] − Ex [x]2
=varx [x]
 
where we have made use of Ey Ex [x2 | y] = Ex [x2 ] which can be proved by analogy with (6).
6. The normalization of the uniform distribution is proved trivially
b
b−a
Z
1
dx = =1
a b−a b−a

For the mean of the distribution we have


Z b b
x2 b2 − a2

1 a+b
E[x] = xdx = = =
a b−a 2(b − a) a 2(b − a) 2

The variance can be found by first evaluating


b b
x3 b3 − a3 a2 + ab + b2
Z 
1
E[x2 ] = x2 dx = = =
a b−a 3(b − a) a 3(b − a) 3

a2 + ab + b2 (a + b)2 (b − a)2
var[x] = E[x2 ] − E[x]2 = − =
3 4 12
7. Consider a matrix M which is symmetric, so that M T = M . The inverse matrix M −1 satisfies

M M −1 = I.

Taking the transpose of both sides of this equation,

(M −1 )T M T = I T = I

since the identity matrix is symmetric. Making use of the symmetry condition for M we then have

(M −1 )T M = I

and hence, from the definition of the matrix inverse,

(M −1 )T = M −1

and so M −1 is also a symmetric matrix.


8. We have,
D
X
Σ= λi ui uTi (9)
i=1

We start by pre-multiplying both sides of (9) by u†i , the conjugate transpose of ui . This gives us,

u†i Σui = λi u†i ui (10)

Next consider the conjugate transpose of (9) and post-multiply it by ui, which gives us

u†i Σ† ui = λ∗i u†i ui (11)

. where λ∗i is the complex conjugate of λi . We now subtract (10) from (11) and use the fact the Σ
is real and symmetric and hence Σ = Σ† , to get

0 = (λ∗i − λ∗i )u†i ui

Hence λ∗i = λi and so λi must be real. Now consider

uTi uj λj =uTi Σuj


=uTi ΣT uj
T
= Σui uj
=λi uTi uj
, where we have used (9) and the fact that ? is symmetric. If we assume that 0 6= λi 6= λj 6= 0, the
only solution to this equation is that uTi uj = 0, i.e., that ui and uj are orthogonal.
If 0 6= λi = λj 6= 0, any linear combination of ui and uj will be an eigenvector with eigenvalue
λ = λi = λj , since, from (9),

Σ(aui + buj ) =aλi ui + bλj uj


=λ(aui + buj )
.

Assuming that ui 6= uj , we can construct

uα = aui + buj uβ = cui + duj

such that uα and uβ are mutually orthogonal and of unit length. Since ui and uj are orthogonal to
uk (k 6= i, k 6= j), so are uα and uβ .
Finally, if λi = 0, Σ must be singular, with ui lying in the null space of Σ. In this case, ui will be
orthogonal to the eigenvectors projecting onto the row space of Σ and we can chose ||ui || = 1, so
that (12) is satisfied.
uTi uj = Iij (12)
If more than one eigenvalue equals zero, we can chose the corresponding eigenvectors arbitrarily, as
long as they remain in the null space of Σ, and so we can chose them to satisfy (12).
9. The covariance matrix Σ can be expressed as an expansion in terms of its eigenvectors in the form
D
X
Σ= λi ui uTi (13)
i=1

We can write the r.h.s. of (13) in matrix form as


D
X
λi ui uTi = UΛUT = M (14)
i=1

where U is a D × D matrix with the eigenvectors u1 , . . . , uD as its columns and Λ is a diagonal


matrix with the eigenvalues λ1 , . . . , λD along its diagonal. Thus we have

UT MU = UT UΛUT U = Λ

We also have that


UT ΣU = UT ΛU = UT UΛ = Λ,

and so M = Σ and (13) holds. Moreover, since U is orthonormal, U−1 = UT and so


D
X
Σ−1 = (UΛUT )−1 = (UT )−1 Λ−1 U−1 = UΛ−1 UT = λi ui uTi
i=1

10. Since u1 , . . . , uD constitute a basis for IRD , we can write

a = â1 u1 + â2 u2 + · · · + âD uD ,

where â1 , . . . , âD are coefficients obtained by projecting a on u1 , . . . , uD . Note that they typically
do not equal the elements of a. Using this we can write

aT Σa = (â1 uT1 + · · · + âD uTD )Σ(â1 u1 + · · · + âD uD )

and combining this result with Σut = λt ut we get

(â1 uT1 + · · · + âD uD )(â1 λ1 u1 + · · · + âD λD uD )


Now, since uTi uj = 1 only if i = j, and 0 otherwise, this becomes

â21 λ1 + · · · + â2D λD

and since a is real, we see that this expression will be strictly positive for any nonzero a, if all
eigenvalues are strictly positive. It is also clear that if an eigenvalue, λi , is zero or negative, there
exist a vector a (e.g. a = ui ), for which this expression will be less than or equal to zero. Thus, that
a matrix has eigenvectors which are all strictly positive is a sufficient and necessary condition for the
matrix to be positive definite.
11. From y = x + z we have trivially that E[y] = E[x] + E[z]. For the covariance we have,
 
 T
cov[y] =E x − E[x] + z − E[z] x − E[x] + z − E[z]
   
 T  T
=E x − E[x] x − E[x] + E z − E[z] z − E[z]
   
 T  T
+ E x − E[x] z − E[z] + E z − E[z] x − E[x]

=cov[x] + cov[z]

where we have used the independence of x and z, together with E[(x − E[x])] = E[(z − E[z])] = 0, to
set the third and fourth terms in the expansion to zero. For 1-dimensional variables the covariances
become variances and we obtain the result of problem 3 as a special case.
12. The posterior distribution is proportional to the product of the prior and the likelihood function

p(µ|X) ∝ p(µ)ΠN
n=1 p(xn |µ, Σ).

Thus the posterior is proportional to an exponential of a quadratic form in µ given by


N N
1 1X 1 T −1 X
− (µ−µ0 )T Σ−1 T −1 −1 T −1 −1

0 (µ−µ0 )− (x n −µ) Σ (x n −µ) = − µ (Σ0 +N Σ )µ+µ Σ 0 µ0 +Σ xn + const
2 2 n=1 2 n=1

where const. denotes terms independent of µ. we see that the mean and covariance of the posterior
distribution are given by

µN = (Σ−1
0 + NΣ ) (Σ−1
−1 −1
0 µ0 + Σ
−1
N µM L )

Σ−1 −1
N = Σ) + N Σ
−1

where µM L is the maximum likelihood solution for the mean given by


N
1 X
µM L = xn
N n=1

13. Assume that the convex hulls of {xn } and {ym } intersect. Then there exist a point z such that
X X
z= αn xn = βm y m (15)
n m
P
where β ≥ 0 for all m and m βm = 1. If {xn } and {ym } also were to be linearly separable, we
would have that
X X  
b T z + w0 =
w b T xn + w0 =
αn w αn wb T xn + w0 > 0 (16)
n n

b T xn + w0 > 0 and the {αn } are all non-negative and sum to 1, but by the corresponding
since w
argument
T
X T
X  T 
w z + w0 =
b βm w ym + w0 =
b βm w y m + w 0 < 0
b (17)
m m
which is a contradiction and hence {xn } and {ym } cannot be linearly separable if their convex hulls
intersect.
If we instead assume that {xn } and {ym } are linearly separable and consider a point z in the
intersection of their convex hulls, the same contradiction arise. Thus no such point can exist and
the intersection of the convex hulls of {xn } and {ym } must be empty.
14. If the values of the tn were known then each data point for which tn = 1 would contribute p tn = 1 |
φ(xn ) to the log likelihood, and each point for which tn = 0 would contribute 1 − p tn = 1 | φ(xn )
to the log likelihood. A data point whose probability of having tn = 1 is given by πn will therefore
contribute
 
πn p tn = 1 | φ(xn ) + (1 − πn ) 1 − p tn = 1 | φ(xn ) (18)

and so the overall log likelihood for the data set is given by
N
X  
πn ln p tn = 1 | φ(xn ) + (1 − πn ) ln 1 − p tn = 1 | φ(xn ) (19)
n=1

This can also be viewed from a sampling perspective by imagining sampling the value of each tn
some number M times, with probability of tn = 1 given by πn , and then constructing the likelihood
function for this expanded data set, and dividing by M . In the limit M → ∞ recover (19).
15. For the mixture model the joint distribution can be written
K
X
p(xa , xb ) = πk p(xa , xb |k)
k=1

We can find the conditional density p(xb |xa ) by making use of the relation

p(xa , xb )
p(xb |xa ) =
p(xa )

For mixture model the marginal density of xa is given by


K
X
p(xa ) = πk p(xa |k)
k=1

where Z
p(xa |k) = p(xa , xb |k)dxb

Thus we can write the conditional density in the form


PK
k=1 πk p(xa , xb |k)
p(xb |xa ) = P K
j=1 πj p(xa |j)

Now we decompose the numerator using

p(xa , xb |k) = p(xb |xa , k)p(xa |k)

which allows us finally to write the conditional density as a mixture model of the form
K
X
p(xb |xa ) = λk p(xb |xa , k)
k=1

where the mixture coefficients are given by

πk p(xa |k)
λk = p(k|xa ) = P
j πj p(xa |j)
16. Since the expectation of a sum is the sum of the expectations we have
K
X K
X
E[x] = πk Ek [x] = πk µk
k=1 k=1

where Ek [x] denotes the expectation of x under the distribution p(x|k). To find the covariance we
use the general relation
cov[x] = E[xxT ]E[x]E[x]T
to give
cov[x] = E[xxT ] − E[x]E[x]T
PK
= k=1 πk Ek [xxT ] − E[x]E[x]T
PK
= k=1 πk {Σk + µk µTk } − E[x]E[x]T
17. Since y = Ax + b,
p(y|x) = δ(y − Ax − b)
i.e. a delta function at Ax + b. From the sum and product rules, we have
R R
p(y) = p(y, x)dx = p(y|x)p(x)dx
R
= δ(y − Ax − b)p(x)dx
When M = D and A is assumed to have full rank, we have
x = A−1 (y − b)
and thus
p(y) = N (A−1 (y − b)|µ, Σ)

= N (y|Aµ + b, AΣAT )
When M > D, y will be strictly confined to a D-dimensional subspace and hence p(y) will be
singular. In this case we have
x = A−1 (y − b)
where A−1 is the left inverse of A and thus
p(y) = N (A−1 (y − b)|µ, Σ)
 
−L T −1 −L −1

= N y|Aµ + b, (A ) Σ A

The covariance matrix on the last line cannot be computed, but we can still compute p(y), by using
the corresponding precision matrix and constraining the density to be zero outside the column space
of A :  
−L T −1 −L −1
δ(y − AA−L (y − b) − b)

p(y) = N y|Aµ + b, (A ) Σ A

Finally, when M < D, we can make use of


p(x) = N (x|µ, Λ−1 )
p(y|x) = N (y|Ax + b, L−1 )
p(y) = N (y|Aµ + b, L−1 + AΛ−1 AT )
and set L−1 = 0 in
p(y|x) = N (y|Ax + b, L−1 )
While this means that p(y|x) is singular, the marginal distribution p(y), given by
p(y) = N (y|Aµ + b, L−1 + AΛ−1 AT )
is non-singular as long as A and Σ are assumed to be of full rank.

You might also like