Solutions For Practice Set
Solutions For Practice Set
1. Let us denote apples, oranges and limes by a, o and l respectively. The marginal probability of
selecting an apple is given by,
where the conditional probabilities are obtained from the proportions of apples in each box.
To find the probability that the box was green, given that the fruit we selected was an orange, we
can use Bayes? theorem
p(o | g)p(g)
p(g | o) = (1)
p(o)
The denominator in (1) is given by,
Using, X
E[f ] = p(x)f (x) (3)
x
and the fact that p(x, y) = p(x)p(y) when x and y are independent, we obtain
XX
E[xy] = p(x, y)xy
x y
X X
= p(x)x p(y)y
x y
=E[x]E[yh]
and hence cov[x, y] = 0. The case where x and y are continuous variables is analogous, with (3)
replaced by (4) and the sums replaced by integrals.
Z
E[f ] = p(x)f (x)dx (4)
3. Since x and z are independent, their joint distribution factorizes p(x, z) = p(x)p(z),and so
RR
E[x + z] = (x + z)p(x)p(z)dxdz
R R
= xp(x)dx + zp(z)dz
= E[x] + E[z]
Similarly for the variances, we first note that
where the final term will integrate to zero with respect to the factorized distribution p(x)p(z). Hence
= var(x) + var(z)
4. We have,
N
2 1 X
σM L = (xn − µ)2 (5)
N n=1
Now,
N
2 1 X
E{xn } [σM L ] =E{xn } (xn − µ)2
N n=1
N
1 X
= E{xn } x2n − 2xn µ + µ2
N n=1
N
1 X 2 2 2
= µ + σ − 2µµ + µ
N n=1
=σ 2
=Ex [x]
E[varx [x | y]] + vary [E[x | y]] =Ey [Ex [x2 | y] − Ex [x | y]2 ] + Ey [Ex [x | y]2 ] − Ey [Ex [x | y]]2
=Ex [x2 ] − Ex [x]2
=varx [x]
where we have made use of Ey Ex [x2 | y] = Ex [x2 ] which can be proved by analogy with (6).
6. The normalization of the uniform distribution is proved trivially
b
b−a
Z
1
dx = =1
a b−a b−a
a2 + ab + b2 (a + b)2 (b − a)2
var[x] = E[x2 ] − E[x]2 = − =
3 4 12
7. Consider a matrix M which is symmetric, so that M T = M . The inverse matrix M −1 satisfies
M M −1 = I.
(M −1 )T M T = I T = I
since the identity matrix is symmetric. Making use of the symmetry condition for M we then have
(M −1 )T M = I
(M −1 )T = M −1
We start by pre-multiplying both sides of (9) by u†i , the conjugate transpose of ui . This gives us,
Next consider the conjugate transpose of (9) and post-multiply it by ui, which gives us
. where λ∗i is the complex conjugate of λi . We now subtract (10) from (11) and use the fact the Σ
is real and symmetric and hence Σ = Σ† , to get
such that uα and uβ are mutually orthogonal and of unit length. Since ui and uj are orthogonal to
uk (k 6= i, k 6= j), so are uα and uβ .
Finally, if λi = 0, Σ must be singular, with ui lying in the null space of Σ. In this case, ui will be
orthogonal to the eigenvectors projecting onto the row space of Σ and we can chose ||ui || = 1, so
that (12) is satisfied.
uTi uj = Iij (12)
If more than one eigenvalue equals zero, we can chose the corresponding eigenvectors arbitrarily, as
long as they remain in the null space of Σ, and so we can chose them to satisfy (12).
9. The covariance matrix Σ can be expressed as an expansion in terms of its eigenvectors in the form
D
X
Σ= λi ui uTi (13)
i=1
UT MU = UT UΛUT U = Λ
where â1 , . . . , âD are coefficients obtained by projecting a on u1 , . . . , uD . Note that they typically
do not equal the elements of a. Using this we can write
â21 λ1 + · · · + â2D λD
and since a is real, we see that this expression will be strictly positive for any nonzero a, if all
eigenvalues are strictly positive. It is also clear that if an eigenvalue, λi , is zero or negative, there
exist a vector a (e.g. a = ui ), for which this expression will be less than or equal to zero. Thus, that
a matrix has eigenvectors which are all strictly positive is a sufficient and necessary condition for the
matrix to be positive definite.
11. From y = x + z we have trivially that E[y] = E[x] + E[z]. For the covariance we have,
T
cov[y] =E x − E[x] + z − E[z] x − E[x] + z − E[z]
T T
=E x − E[x] x − E[x] + E z − E[z] z − E[z]
T T
+ E x − E[x] z − E[z] + E z − E[z] x − E[x]
=cov[x] + cov[z]
where we have used the independence of x and z, together with E[(x − E[x])] = E[(z − E[z])] = 0, to
set the third and fourth terms in the expansion to zero. For 1-dimensional variables the covariances
become variances and we obtain the result of problem 3 as a special case.
12. The posterior distribution is proportional to the product of the prior and the likelihood function
p(µ|X) ∝ p(µ)ΠN
n=1 p(xn |µ, Σ).
where const. denotes terms independent of µ. we see that the mean and covariance of the posterior
distribution are given by
µN = (Σ−1
0 + NΣ ) (Σ−1
−1 −1
0 µ0 + Σ
−1
N µM L )
Σ−1 −1
N = Σ) + N Σ
−1
13. Assume that the convex hulls of {xn } and {ym } intersect. Then there exist a point z such that
X X
z= αn xn = βm y m (15)
n m
P
where β ≥ 0 for all m and m βm = 1. If {xn } and {ym } also were to be linearly separable, we
would have that
X X
b T z + w0 =
w b T xn + w0 =
αn w αn wb T xn + w0 > 0 (16)
n n
b T xn + w0 > 0 and the {αn } are all non-negative and sum to 1, but by the corresponding
since w
argument
T
X T
X T
w z + w0 =
b βm w ym + w0 =
b βm w y m + w 0 < 0
b (17)
m m
which is a contradiction and hence {xn } and {ym } cannot be linearly separable if their convex hulls
intersect.
If we instead assume that {xn } and {ym } are linearly separable and consider a point z in the
intersection of their convex hulls, the same contradiction arise. Thus no such point can exist and
the intersection of the convex hulls of {xn } and {ym } must be empty.
14. If the values of the tn were known then each data point for which tn = 1 would contribute p tn = 1 |
φ(xn ) to the log likelihood, and each point for which tn = 0 would contribute 1 − p tn = 1 | φ(xn )
to the log likelihood. A data point whose probability of having tn = 1 is given by πn will therefore
contribute
πn p tn = 1 | φ(xn ) + (1 − πn ) 1 − p tn = 1 | φ(xn ) (18)
and so the overall log likelihood for the data set is given by
N
X
πn ln p tn = 1 | φ(xn ) + (1 − πn ) ln 1 − p tn = 1 | φ(xn ) (19)
n=1
This can also be viewed from a sampling perspective by imagining sampling the value of each tn
some number M times, with probability of tn = 1 given by πn , and then constructing the likelihood
function for this expanded data set, and dividing by M . In the limit M → ∞ recover (19).
15. For the mixture model the joint distribution can be written
K
X
p(xa , xb ) = πk p(xa , xb |k)
k=1
We can find the conditional density p(xb |xa ) by making use of the relation
p(xa , xb )
p(xb |xa ) =
p(xa )
where Z
p(xa |k) = p(xa , xb |k)dxb
which allows us finally to write the conditional density as a mixture model of the form
K
X
p(xb |xa ) = λk p(xb |xa , k)
k=1
πk p(xa |k)
λk = p(k|xa ) = P
j πj p(xa |j)
16. Since the expectation of a sum is the sum of the expectations we have
K
X K
X
E[x] = πk Ek [x] = πk µk
k=1 k=1
where Ek [x] denotes the expectation of x under the distribution p(x|k). To find the covariance we
use the general relation
cov[x] = E[xxT ]E[x]E[x]T
to give
cov[x] = E[xxT ] − E[x]E[x]T
PK
= k=1 πk Ek [xxT ] − E[x]E[x]T
PK
= k=1 πk {Σk + µk µTk } − E[x]E[x]T
17. Since y = Ax + b,
p(y|x) = δ(y − Ax − b)
i.e. a delta function at Ax + b. From the sum and product rules, we have
R R
p(y) = p(y, x)dx = p(y|x)p(x)dx
R
= δ(y − Ax − b)p(x)dx
When M = D and A is assumed to have full rank, we have
x = A−1 (y − b)
and thus
p(y) = N (A−1 (y − b)|µ, Σ)
= N (y|Aµ + b, AΣAT )
When M > D, y will be strictly confined to a D-dimensional subspace and hence p(y) will be
singular. In this case we have
x = A−1 (y − b)
where A−1 is the left inverse of A and thus
p(y) = N (A−1 (y − b)|µ, Σ)
−L T −1 −L −1
= N y|Aµ + b, (A ) Σ A
The covariance matrix on the last line cannot be computed, but we can still compute p(y), by using
the corresponding precision matrix and constraining the density to be zero outside the column space
of A :
−L T −1 −L −1
δ(y − AA−L (y − b) − b)
p(y) = N y|Aµ + b, (A ) Σ A