PRML Solution Manual
PRML Solution Manual
E DITED B Y
ZHENGQI GAO
0.1 Introduction
=>
∂E N
{ y(xn , w) − t n } xni = 0
X
=
∂w i n=1
=>
N N
y(xn , w)xni = xni t n
X X
n=1 n=1
=>
N XM N
j
( w j xn )xni = xni t n
X X
n=1 j =0 n=1
=>
N X
M N
( j+ i)
xni t n
X X
w j xn =
n=1 j =0 n=1
=>
M X
N N
( j+ i)
xni t n
X X
xn wj =
j =0 n=1 n=1
PN i+ j PN
If we denote A i j = n=1 xn and T i = n=1 xn i t n , the equation above can
be written exactly as (1.222), Therefore the problem is solved.
This problem is similar to Prob.1.1, and the only difference is the last
term on the right side of (1.4), the penalty term. So we will do the same thing
as in Prob.1.1 :
=>
∂E N
{ y(xn , w) − t n } xni + λw i = 0
X
=
∂w i n=1
=>
M X
N N
( j+ i)
xni t n
X X
xn w j + λw i =
j =0 n=1 n=1
=>
M X N N
( j+ i)
xni t n
X X
{ xn + δ ji λ}w j =
j =0 n=1 n=1
2
where
(
0 j 6= i
δ ji
1 j=i
Problem 1.3 Solution
P(o| g)P(g)
P(g| o) =
P(o)
We calculate the probability of selecting an orange P(o) first :
4 1 3
P(o) = P(o| r)P(r) + P(o| b)P(b) + P(o| g)P(g) = × 0.2 + × 0.2 + × 0.6 = 0.36
10 2 10
Therefore we can get :
3
P(o| g)P(g) 10 × 0.6
P(g| o) = = = 0.5
P(o) 0.36
Problem 1.4 Solution
d p x (x) ¯¯
=0
dx x̂
Therefore, when y = ŷ, s.t. x̂ = g( ŷ), the first term on the right side of (∗∗)
will be 0, leading the first term in (∗) equals to 0, however because of the
existence of the second term in (∗), the derivative may not equal to 0. But
3
when linear transformation is applied, the second term in (∗) will vanish,
(e.g. x = a y + b). A simple example can be shown by :
x̂ 6= sin( ŷ)
Therefore:
Z Z Z Z
x yp(x, y) dx d y = x yp x (x)p y (y) dx d y
Z Z
= ( xp x (x) dx)( yp y (y) d y)
=> E x,y [x y] = E[x]E[y]
Here we utilize :
x = r cos θ , y = r sin θ
Based on the fact :
r2 r 2 ¯+∞
Z +∞
exp(− 2 )r dr = −σ2 exp(− 2 )¯0 = −σ2 (0 − 1) = σ2
0 2σ 2σ
Z +∞ Z
1+∞ 1
N (x¯µ, σ2 ) dx exp{− 2 (x − µ)2 } dx
¯
= p
−∞ −∞ 2πσ 2 2σ
Z +∞
1 1
= p exp{− 2 y2 } d y ( y = x − µ )
−∞ 2πσ2 2σ
Z +∞
1 1
= p exp{− 2 y2 } d y
2πσ2 −∞ 2σ
= 1
The second problem has already be given hint in the description. Given
that :
d( f g) dg df
= f +g
dx dx dx
We differentiate both side of (1.127) with respect to σ2 , we will obtain :
+∞ 1 (x − µ)2
Z
)N (x¯µ, σ2 ) dx = 0
¯
(− +
−∞ 2σ 2 2σ 4
5
So the equation above has actually proven (1.51), according to the defini-
tion: Z +∞
(x − E[x])2 N (x¯µ, σ2 ) dx
¯
var[x] =
−∞
Where E[x] = µ has already been proved. Therefore :
var[x] = σ2
Finally,
E[x2 ] = var[x] + E[x]2 = σ2 + µ2
Problem 1.9 Solution
Here we only focus on (1.52), because (1.52) is the general form of (1.42).
Based on the definition : The maximum of distribution is known as its mode
and (1.52), we can obtain :
∂N (x¯µ, Σ)
¯
1
= − [Σ−1 + (Σ−1 )T ] (x − µ)N (x¯µ, Σ)
¯
∂x 2
= −Σ−1 (x − µ)N (x¯µ, Σ)
¯
∂xT Ax
= (A + AT )x and (Σ−1 )T = Σ−1
∂x
Therefore,
∂N (x¯µ, Σ)
¯
only when x = µ, =0
∂x
Note: You may also need to calculate Hessian Matrix to prove that it is
maximum. However, here we find that the first derivative only has one root.
Based on the description in the problem, this point should be maximum point.
Problem 1.10 Solution
and independence.
Z Z
E[x + z] = (x + z)p(x, z) dx dz
Z Z
= (x + z)p(x)p(z) dx dz
Z Z Z Z
= xp(x)p(z) dx dz + z p(x)p(z) dx dz
Z Z Z Z
= ( p(z) dz)xp(x) dx + ( p(x) dx)z p(z) dz
Z Z
= xp(x) dx + z p(z) dz
= E[x] + E[z]
Z Z
var[x + z] = ( x + z − E[x + z] )2 p(x, z) dx dz
Z Z
= { ( x + z )2 − 2 (x + z) E[x + z]) + E2 [x + z] } p(x, z) dx dz
Z Z Z Z
2
= (x + z) p(x, z) dx dz − 2E[x + z] (x + z)p(x, z) dx dz + E2 [x + z]
Z Z
= (x + z)2 p(x, z) dx dz − E2 [x + z]
Z Z
= (x2 + 2xz + z2 ) p(x)p(z) dx dz − E2 [x + z]
Z Z Z Z Z Z
2
= ( p(z)dz) x p(x) dx + 2xz p(x)p(z) dx dz + ( p(x)dx) z2 p(z) dz − E2 [x + z]
Z Z
= E[x2 ] + E[z2 ] − E2 [x + z] + 2xz p(x)p(z) dx dz
Z Z
= E[x2 ] + E[z2 ] − (E[x] + E[z])2 + 2xz p(x)p(z) dx dz
Z Z
= E[x2 ] − E2 [x] + E[z2 ] − E2 [z] − 2E[x] E[z] + 2 xz p(x)p(z) dx dz
Z Z
= var[x] + var[z] − 2E[x] E[z] + 2( xp(x) dx)( z p(z) dz)
= var[x] + var[z]
Based on prior knowledge that µ ML and σ2ML will decouple. We will first
calculate µ ML :
d(ln p(x¯ µ, σ2 ))
¯
N
1 X
= 2 (xn − µ)
dµ σ n=1
We let :
d(ln p(x¯ µ, σ2 ))
¯
=0
dµ
7
Therefore :
1 XN
µ ML = xn
N n=1
And because:
d(ln p(x¯ µ, σ2 ))
¯
1 X N
= ( (xn − µ)2 − N σ2 )
d σ2 2σ4 n=1
We let :
d(ln p(x¯ µ, σ2 ))
¯
=0
d σ2
Therefore :
1 XN
σ2ML = (xn − µ ML )2
N n=1
Problem 1.12 Solution
For E[σ2ML ], we need to take advantage of (1.56) and what has been given
in the problem :
1 XN
E[σ2ML ] = E[ (xn − µ ML )2 ]
N n=1
1 X N
= E[ (xn − µ ML )2 ]
N n=1
1 X N
= E[ (x2 − 2xn µ ML + µ2ML )]
N n=1 n
1 X N 1 X N 1 X N
= E[ x2n ] − E[ 2xn µ ML ] + E[ µ2 ]
N n=1 N n=1 N n=1 ML
2 X N 1 XN
= µ2 + σ 2 − E[ xn ( xn )] + E[µ2ML ]
N n=1 N n=1
2 N N 1 XN
= µ2 + σ 2 − E E xn )2 ]
X X
[ x n ( x n )] + [(
N 2 n=1 n=1 N n=1
2 N 1 N
= µ2 + σ 2 − E 2
E xn )2 ]
X X
[( x n ) ] + [(
N 2 n=1 N 2 n=1
1 N
= µ2 + σ 2 − E xn )2 ]
X
[(
N 2 n=1
1
= µ2 + σ 2 − 2 [ N ( N µ2 + σ 2 ) ]
N
8
Therefore we have:
N −1 2
E[σ2ML ] = ( )σ
N
Problem 1.13 Solution
1 XN
E[σ2ML ] = E[ (xn − µ)2 ] (Because here we use µ to replace µ ML )
N n=1
1 X N
= E[ (xn − µ)2 ]
N n=1
1 X N
= E[ (x2 − 2xn µ + µ2 )]
N n=1 n
1 X N 1 X N 1 X N
= E[ x2n ] − E[ 2xn µ] + E[ µ2 ]
N n=1 N n=1 N n=1
2µ X N
= µ2 + σ 2 − E[ x n ] + µ2
N n=1
= µ2 + σ2 − 2µ2 + µ2
= σ2
Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the
mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in
Prob.1.12). In other words, the difference can be shown by the following equa-
tions:
This problem is quite similar to the fact that any function f (x) can be
written into the sum of an odd function and an even function. If we let:
w i j + w ji w i j − w ji
wSij = and w iAj =
2 2
It is obvious that they satisfy the constraints described in the problem,
which are :
w i j = wSij + w iAj , wSij = wSji , w iAj = −w Aji
9
D
D X D
D X
(wSij + w iAj )x i x j
X X
wi j xi x j =
i =1 j =1 i =1 j =1
D X
D D X
D
wSij x i x j + w iAj x i x j
X X
=
i =1 j =1 i =1 j =1
Therefore, we only need to prove that the second term equals to 0, and
here we use a simple trick: we will prove twice of the second term equals to 0
instead.
D X
D D X
D
w iAj x i x j ( w iAj + w iAj )x i x j
X X
2 =
i =1 j =1 i =1 j =1
D
D X
( w iAj − w Aji )x i x j
X
=
i =1 j =1
D X
D D X
D
w iAj x i x j − w Aji x i x j
X X
=
i =1 j =1 i =1 j =1
D X
D D X
D
w iAj x i x j − w Aji x j x i
X X
=
i =1 j =1 j =1 i =1
= 0
D(D + 1)
D + D − 1 + ... + 1 =
2
Note: You can view this intuitively by considering if the upper triangular
part of a symmetric matrix is given, the whole matrix will be determined.
Problem 1.15 Solution
This problem is a more general form of Prob.1.14, so the method can also
be used here: we will find a way to use w i 1 i 2 ...i M to represent w
e i 1 i 2 ...i M .
We begin by introducing a mapping function:
M
[ M
[
s.t. x ik = x jk , and x j1 ≥ x j2 ≥ x j3 ... ≥ x jM
k=1 k=1
10
F(x5 x2 x3 x2 ) = x5 x3 x2 x2
F(x1 x3 x3 x2 ) = x3 x3 x2 x1
F(x1 x4 x2 x3 ) = x4 x3 x2 x1
F(x1 x1 x5 x2 ) = x5 x2 x1 x1
After introducing F, the solution will be very simple, based on the fact
that F will not change the value of the term, but only rearrange it.
D X
X D D
X j1
D X
X jX
M −1
... w i 1 i 2 ...i M x i1 x i2 ...x iM = ... w
e j 1 j 2 ... j M x j1 x j2 ...x jM
i 1 =1 i 2 =1 i M =1 j 1 =1 j 2 =1 j M =1
X
where w
e j 1 j 2 ... j M = w
w∈Ω
Ω = {w i 1 i 2 ...i M ¯ F(x i1 x i2 ...x iM ) = x j1 x j2 ...x jM , ∀ x i1 x i2 ...x iM }
¯
DX i1
+1 X iX
M −1
... w
e i 1 i 2 ...i M x i1 x i2 ...x iM (∗)
i 1 =1 i 2 =1 i M =1
We divide (∗) into two parts based on the first summation: the first part
is made up of i i = 1, 2, ..., D and the second part i 1 = D + 1. After division, the
first part corresponds to n(D, M), and the second part corresponds to n(D +
1, M − 1). Therefore we obtain:
D
X
n(D, M) = n(i, M − 1)
i =1
11
M M −1
Secondly, we rewrite the first term on the left side to C M , because C M −1
=
M
CM = 1. In other words, we only need to prove:
M M −1 M −1 M
CM + CM + ... C D + M −2 = C D + M −1
r r r −1
Thirdly, we take advantage of the property : C N = CN −1
+ CN −1
. So we
can recursively combine the first term and the second term on the left side,
and it will ultimately equal to the right side.
(1.137) gives the mathematical form of n(D, M), and we need all the con-
clusions above to prove it.
Let’s give some intuitive concepts by illustrating M = 0, 1, 2. When M = 0,
(1.134) will consist of only a constant term, which means n(D, 0) = 1. When
M = 1,it is obvious n(D, 1) = D, because in this case (1.134) will only have D
terms if we expand it. When M = 2, it degenerates to Prob.1.14, so n(D, 2) =
D (D +1)
2 is also obvious. Suppose (1.137) holds for M − 1, let’s prove it will also
hold for M.
D
X
n(D, M) = n(i, M − 1) ( based on (1.135) )
i =1
D
C iM+−M1−2
X
= ( based on (1.137) holds for M − 1 )
i =1
M −1 M −1 M −1 M −1
= CM −1 + C M + CM +1 ... + C D + M −2
M M −1 M −1 M −1
= ( CM + CM ) + CM +1 ... + C D + M −2
M M −1 M −1
= ( CM +1 + C M +1 )... + C D + M −2
M M −1
= CM +2 ... + C D + M −2
...
M
= CD + M −1
This problem can be solved in the same way as the one in Prob.1.15.
Firstly, we should write the expression consisted of all the independent terms
up to Mth order corresponding to N(D, M). By adding a summation regard-
ing to M on the left side of (1.134), we obtain:
M X
X i1
D X iX
m−1
... w
e i 1 i 2 ...i m x i1 x i2 ...x im (∗)
m=0 i 1 =1 i 2 =1 i m =1
M
X D
+1 X i1
X iX
m−1
... w
e i 1 i 2 ...i m x i1 x i2 ...x im (∗∗)
m=0 i 1 =1 i 2 =1 i m =1
M
X
N(D, M + 1) = n(D, m) + n(D, M + 1)
m=0
M
X +1
= n(D, m)
m=0
To prove (1.139), we will also use the same technique in Prob.1.15 instead
of Mathematical Induction. We begin based on already proved (1.138):
M
X
N(D, M) = n(D, M)
m=0
13
M
m
X
N(D, M) = CD + m−1
m=0
= C 0D −1 + C 1D + C 2D +1 + ... + C D M
+ M −1
= ( C 0D + C 1D ) + C 2D +1 + ... + C DM
+ M −1
= M
( C 1D +1 + C 2D +1 ) + ... + C D + M −1
= ...
M
= CD +M
Here as asked by the problem, we will view the growing speed of N(D, M).
We should see that in n(D, M), D and M are symmetric, meaning that we only
need to prove when D À M, it will grow like D M , and then the situation of
M À D will be solved by symmetry.
(D + M)! (D + M)D + M
N(D, M) = ≈
D! M! DD MM
1 D+M D
= ( ) (D + M) M
MM D
1 M D
= M
[(1 + ) M ] M (D + M) M
M D
e M
≈ ( ) (D + M) M
M
eM M
= M
(1 + ) M D M
M D
eM M D M2
= M
[(1 + ) M ] D D M
M D
M2
e M+ D eM
≈ DM ≈ DM
MM MM
M2
Where we use Stirling’s approximation, lim (1 + n1 )n = e and e D ≈ e0 =
n→+∞
1. According to the description in the problem, When D À M, we can actually
eM M
view M M as a constant, so N(D, M) will grow like D in this case. And by
D
symmetry, N(D, M) will grow like M , when M À D.
Finally, we are asked to calculate N(10, 3) and N(100, 3):
3
N(10, 3) = C 13 = 286
3
N(100, 3) = C 103 = 176851
Problem 1.17 Solution
14
Z +∞
Γ(x + 1) = u x e−u du
0
Z +∞
= − u x de−u
0
¯+∞ Z +∞
= − u x e−u ¯ − e−u d(− u x )
¯
0 0
¯+∞ Z +∞
x −u ¯
= −u e ¯ +x e−u u x−1 du
0 0
¯+∞
x −u ¯
= −u e ¯ + x Γ(x)
0
Hence, we obtain:
D
D SD D 2π 2
π2 = Γ( ) => SD =
2 2 Γ( D2 )
15
S D has given the expression of the surface area with radius 1 in dimen-
sion D, we can further expand the conclusion: the surface area with radius r
in dimension D will equal to S D · r D −1 , and when r = 1, it will reduce to S D .
This conclusion is naive, if you find that the surface area of different sphere
in dimension D is proportion to the D − 1th power of radius, i.e. r D −1 . Con-
sidering the relationship between V and S of a sphere with arbitrary radius
in dimension D: dVdr = S, we can obtain :
SD D
Z Z
V = S dr = S D r D −1 dr = r
D
The equation above gives the expression of the volume of a sphere with
radius r in dimension D, so we let r = 1 :
SD
VD =
D
For D = 2 and D = 3 :
S2 1 2π
V2 = = · =π
2 2 Γ(1)
3 3
S3 1 2π 2 1 2π 2 4
V3 = = · 3 = · p = π
3 3 Γ( 2 ) 3 π 3
2
Problem 1.19 Solution
2 π D 1
(∗) = ·( ) 2 · D
D 4 Γ( 2 )
Hence, it is now quite obvious, all the three terms will converge to 0 when
D → +∞. Therefore their product will also converge to 0. The last problem is
quite simple :
p
center to one corner a2 · D p p
= = D and lim D = +∞
center to one side a D →+∞
d p(r) SD D −2 r2 r2
= r exp( − ) (D − 1 − ) (∗)
dr D
(2πσ2 ) 2 2σ 2 σ2
We let p
the derivative equal to 0, we will obtain its unique root( stationary
point) r̂ = D − 1 σ, because r ∈ [0, +∞]. When r < r̂, the derivative is large
than 0, p(r) will increase as r ↑, and when r > r̂, the derivative is less than 0,
p(r) will decrease as r ↑. Therefore
p r̂ will be the only maximum point. And it
is obvious when D À 1, r̂ ≈ D σ.
2
p( r̂ + ² ) ( r̂ + ²)D −1 exp(− (r̂2+σ²2) )
= 2
p( r̂) r̂ D −1 exp(− 2r̂σ2 )
² 2² r̂ + ²2
= (1 + )D −1 exp(− )
r̂ 2σ 2
2² r̂ + ²2 ²
= exp( − + (D − 1)ln(1 + ) )
2σ 2 r̂
2² r̂ + ²2 ² 2² r̂ + ²2 ² ²2
− + (D − 1) ln(1 + ) ≈ − + (D − 1) ( − )
2σ2 r̂ 2σ 2 r̂ 2 r̂ 2
2² r̂ + ²2 2 r̂ ² − ²2
= − +
2σ 2 2σ2
2
²
= − 2
σ
2
Therefore, p( r̂ + ²) = p( r̂) exp(− σ² 2 ). Note: Here I draw a different con-
clusion compared with (1.149), but I do not think there is any mistake in
my deduction.
Finally, we see from (1.147) :
¯ 1
p(x)¯ =
¯
D
x=0
(2πσ2 ) 2
17
¯ 1 r̂ 2 1 D
p(x)¯ = exp(− )≈ exp(− )
¯
||x||2 = r̂ 2
(2πσ2 ) 2
D
2σ 2 D
(2πσ2 ) 2 2
Problem 1.21 Solution
Recall that the decision rule which can minimize misclassification is that
if p(x, C 1 ) > p(x, C 2 ), for a given value of x, we will assign that x to class
C 1 . We can see that in decision area R 1 , it should satisfy p(x, C 1 ) > p(x, C 2 ).
Therefore, using what we have proved, we can obtain :
Z Z
1
p(x, C 2 ) dx ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx
R1 R1
Given a specific x, the first term on the right side is a constant, which
equals to 1, no matter which class C j we assign
¯ x to. Therefore if we want to
minimize the loss, we will maximize p(C j x). Hence, we will
¯
¯ assign x to class
C j , which can give the biggest posterior probability p(C j ¯x).
The explanation of the loss matrix is quite simple. If we label correctly,
there is no loss. Otherwise, we will incur a loss, in the same degree whichever
class we label it to. The loss matrix is given below to give you an intuitive
view:
0 1 1 ... 1
1 0 1 . . . 1
. . . .
. . . ..
. . . . ..
1 1 1 ... 0
Problem 1.23 Solution
18
XXZ XXZ
E[L] =
¯
L k j p(x, C k ) dx = L k j p(C k )p(x¯C k ) dx
k j Rj k j Rj
Therefore, the reject criterion from the view of λ above is actually equiv-
alent to the largest posterior probability is larger than 1 − λ :
X
min L kl p(C k | x) < λ <=> max p(C l | x) > 1 − λ
l k l
And from the view of θ and posterior probability, we label a class for x (i.e.
we do not reject) is given by the constrain :
Hence from the two different views, we can see that λ and θ are correlated
with:
λ+θ =1
We can prove this informally by dealing with one dimension once a time
just as the same process in (1.87) - (1.89) until all has been done, due to the
fact that the total loss E can be divided to the summation of loss on every
19
dimension, and what’s more they are independent. Here, we will use a more
informal way to prove this. In this case, the expected loss can be written :
Z Z
E[L] = {y(x) − t}2 p(x, t) dt dx
Where ² means the small neighborhood,U means the whole space x lies
in. Note that y(x) has no correlation with the first term, but the second term
(because how to choose y(x) will affect the location of ²). Hence we will put ²
at the location where p(t|x) achieve its largest value, i.e. the mode, because
in this way we can obtain the largest reduction. Therefore, it is natural we
choose y(x) equals to t that maximize p(t|x) for every x.
Moreover, the left side of (∗) and (∗∗) are equivalent by definition, so we
can obtain:
h(A) + h(B) = f (P(A) · P(B))
=> f (p(A)) + f (p(B)) = f (P(A) · P(B))
21
yN f (p) = f (p yN ) ≤ f (p x ) ≤ f (p z N ) = z N f (p)
This problem is a little bit tricky. The entropy for a M-state discrete ran-
dom variable x can be written as :
M
X
H[x] = − λ i ln(λ i )
i
M
X M
X
ln( λi xi ) ≥ λ i ln(x i )
i =1 i =1
1
We choose x i = λi
and simplify the equation above, we will obtain :
M
X
lnM ≥ − λ i ln(λ i ) = H[x]
i =1
Based on definition :
p(x) s 1 1
ln{ } = ln( ) − [ 2 (x − µ)2 − 2 (x − m)2 ]
q(x) σ 2σ 2s
s 1 1 µ m µ2 m2
= ln( ) − [( 2 − 2 )x2 − ( 2 − 2 )x + ( 2 − 2 )]
σ 2σ 2s σ s 2σ 2s
We will take advantage of the following equations to solve this problem.
Z
E[x2 ] = x2 N (x|µ, σ2 ) dx = µ2 + σ2
Z
E[x] = x N (x|µ, σ2 ) dx = µ
Z
N (x|µ, σ2 ) dx = 1
q(x)
Z
K L(p|| q) = − p(x)ln{ } dx
p(x)
p(x)
Z
= N (x|µ, σ) ln{ } dx
q(x)
s 1 1 µ m µ2 m2
= ln( ) − ( 2 − 2 )(µ2 + σ2 ) + ( 2 − 2 )µ − ( 2 − 2 )
σ 2σ 2s σ s 2σ 2s
s σ2 + (µ − m)2 1
= ln( ) + −
σ 2s2 2
23
1 1 (µ − m)2
K L(p|| q) = ln(x) + 2
− + a, where a =
2x 2 2s2
We calculate the derivative of K L with respect to x, and let it equal to 0:
d(K L) 1
= − x−3 = 0 => x = 1 ( ∵ s, σ > 0 )
dx x
When x < 1 the derivative is less than 0, and when x > 1, it is greater than
0, which makes x = 1 the global minimum. When x = 1, K L(p|| q) = a. What’s
more, when µ = m, a will achieve its minimum 0. In this way, we have shown
that the K L distance between two Gaussian Distributions is not less than 0,
and only when the two Gaussian Distributions are identical, i.e. having same
mean and variance, K L distance will equal to 0.
p(y) p(x)p(y)
=
p(y|x) p(x, y)
Moreover, it is straightforward that if and only if x and y is statistically
independent, the equality holds, due to the property of KL distance. You can
also view this result by :
Z Z
H[x, y] = − p(x, y) lnp(x, y) dx dy
Z Z Z Z
= − p(x, y) lnp(x) dx dy − p(x, y) lnp(y) dx dy
Z Z Z
= − p(x) lnp(x) dx − p(y) lnp(y) dy
= H[x] + H[y]
∂y ∂y
=A and p(x) = p(y) | | = p(y) |A|
∂x ∂x
Z
p(x) dx = 1
If there are more than one possible value of random variable y given
x = x i , denoted as y j , such that p(y j | x i ) 6= 0 (Because x i , y j are both "possi-
ble", p(x i , y j ) will also not equal to 0), constrained by 0 ≤ p(y j | x i ) ≤ 1 and
P
j p(y j | x i ) = 1, there should be at least two value of y satisfied 0 < p(y j | x i ) <
1, which ultimately leads to (∗) > 0.
Therefore, for each possible value of x, there will only be one y such that
p(y| x) 6= 0. In other words, y is determined by x. Note: This result is quite
straightforward. If y is a function of x, we can obtain the value of y as soon
as observing a x. Therefore we will obtain no additional information when
observing a y j given an already observed x.
∂F
Z
F[y(x) + ²η(x)] = F[y(x)] + ²η(x) dx (∗∗)
∂y
Where y(x) can be viewed as an operator that for any input x it will give
an output value y, and equivalently, F[y(x)] can be viewed as an functional
operator that for any input value y(x), it will give an ouput value F[y(x)].
Then we consider a functional operator:
Z
I[p(x)] = p(x) f (x) dx
∂I
= f (x)
∂ p(x)
Note that ²η(x) is much smaller than p(x), we will write its Taylor Theo-
rems at point p(x):
²η(x)
ln( p(x) + ²η(x) ) = lnp(x) + + O(²η(x)2 )
p(x)
∂J
= lnp(x) + 1
∂ p(x)
Now we can go back to (1.108). Based on ∂ ∂pJ( x) and ∂ p∂(Ix) , we can calculate
the derivative of the expression just before (1.108) and let it equal to 0:
− lnp(x) − 1 + λ1 + λ2 x + λ3 (x − µ)2 = 0
Hence we rearrange it and obtain (1.108). From (1.108) we can see that
p(x) should take the form of a Gaussian distribution. So we rewrite it into
Gaussian form and then compare it to a Gaussian distribution with mean µ
and variance σ2 , it is straightforward:
1 (x − µ)2
exp(−1 + λ1 ) = , exp(λ2 x + λ3 (x − µ)2 ) = exp{ }
1
(2πσ2 ) 2 2σ 2
Finally, we obtain :
λ1 = 1 − ln(2πσ2 )
λ2 = 0
1
λ3 = −
2σ 2
Note that there is a typo in the official solution manual about λ3 . More-
over, in the following parts, we will substitute p(x) back into the three con-
straints and analytically prove that p(x) is Gaussian. You can skip the fol-
lowing part. (The writer would especially thank Dr.Spyridon Chavlis from
IMBB,FORTH for this analysis)
27
We already know:
b b2
ax2 + bx + c = a(x − d)2 + f , d = − , f = c−
2a 4a
Using this quadratic form, the constraints can be written as
R∞ R∞ 2
1. −∞ p(x)dx = −∞ e[a( x−d ) + f ] dx = 1
[ a ( x − d )2 + f ]
R∞ R∞
2. −∞ xp(x)dx = −∞ xe dx = µ
2 [ a ( x − d )2 + f ]
R∞ 2
R∞
3. −∞ (x − µ) p(x)dx = −∞ (x − µ) e dx = σ2
ef ∞
Z
2
I1 = p e−w dw
−a −∞
2
As e− x is an even function the integral is written as:
2e f ∞
Z
2
I1 = p e−w dw
−a 0
2
p 1
Let w = t ⇒ w = t ⇒ dw = p dt, and thus:
2 t
2e f ∞ 2e f ∞ ef π
r
1 1 −1 − t 1
Z Z
− 21 − t
I1 = p t e dt = p t 2 e dt = p Γ( ) = e f
−a 0 −a 0 2 −a 2 −a
π
r
f
e =1 (∗)
−a
The second constraint can be written as:
Z ∞ Z ∞
2 2
I2 = xe[a( x−d ) + f ] dx = e f xe a( x−d ) dx
−∞ −∞
ef d ∞
Z
2
I 22 = p e−w dw
−a −∞
2
As e− x is an even function the integral is written as:
2e f d ∞
Z
2
I 22 = p e−w dw
−a 0
29
p 1
Let w2 = t ⇒ w = t ⇒ dw = p dt, and thus:
2 t
2e f d ∞ 2e f d ∞ ef d 1 π
r
1 1 −1 − t
Z Z
− 21 − t
I 22 = p t e dt = p t 2 e dt = p Γ( ) = e f d
−a 0 −a 0 2 −a 2 −a
Thus, the second constraint can be rewritten
π
r
ef d =µ (∗∗)
−a
Combining (∗) and (∗∗), we can obtain that d = µ. Recall that:
b λ2 − 2λ3 µ
d=− =− = µ ⇒ λ2 − 2λ3 µ = −2λ3 µ ⇒ λ2 = 0
2a 2λ3
So far, we have:
b = −2λ3 µ
And
b2 4λ2 µ2
= λ3 µ2 + λ1 − 1 − 3 = λ1 − 1
f = c−
4a 4λ3
Finally, we deal with the third also the last constraint. Substituting λ2 = 0
into the last constraint we have:
Z ∞ Z ∞
2 2
I3 = (x − µ)2 e[λ3 ( x−µ) +λ1 −1] dx = eλ1 −1 (x − µ)2 eλ3 ( x−µ) dx
−∞ −∞
eλ1 −1 ∞ 2 −w2
Z ∞
1 dw
Z
2
I 3 = eλ1 −1 − w2 e − w p = w e dw
−∞ λ3
3
−λ3 −λ3 2 −∞
eλ1 −1 ∞
Z
3
= 3
t 2 −1 e− t dt
0
−λ3 2
eλ1 −1 3 eλ1 −1 π
= Γ( ) =
3 3
2 2
−λ3 2
−λ32
30
Thus, we obtain:
1
λ1 = 1 − ln(2πσ2 )
2
So far, we have obtainde λ i , where i = 1, 2, 3. We substitute them back
into p(x), yielding:
1 1
µ ¶
2 2
p(x) = exp −1 + 1 − ln(2πσ ) − 2 (x − µ)
2 2σ
1 1
µ ¶ µ ¶
2 2
= exp − ln(2πσ ) exp − 2 (x − µ)
2 2σ
1 1
µ µ ¶¶ µ ¶
2
= exp ln p exp − 2 (x − µ)
2πσ2 2σ
Thus,
1 1
µ ¶
2
p(x) = p exp − 2 (x − µ)
2πσ2 2σ
Just as required.
1 (x − µ)2
Z Z
= − p(x) ln{ } dx − p(x) {− } dx
2πσ2 2σ 2
1 σ2
= − ln{ } +
2πσ2 2σ2
1
= { 1 + ln(2πσ2 ) }
2
31
Here we should make it clear that if the second derivative is strictly pos-
itive, the function must be strictly convex. However, the converse may not be
true. For example f (x) = x4 , g(x) = x2 , x ∈ R are both strictly convex by def-
inition, but their second derivatives at x = 0 are both indeed 0 (See keyword
convex function on Wikipedia or Page 71 of the book Convex Optimization
written by Boyd, Vandenberghe for more details). Hence, here more precisely
we will prove that a convex function is equivalent to its second derivative is
non-negative by first considering Taylor Theorems:
f (x + ²) + f (x − ²) − 2 f (x)
f 00 (x) = lim
² →0 ²2
1 1 1 1
f (x) = f ( (x + ²) + (x − ²)) ≤ f (x + ²) + f (x − ²)
2 2 2 2
Hence f 00 (x) ≥ 0. The converse situation is a little bit complex, we will use
Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion
above :
f 00 (x? )
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + (x − x0 )
2
Where x? lies between x and x0 . By hypothesis, f 00 (x) ≥ 0, the last term is
non-negative for all x. We let x0 = λ x1 + (1 − λ)x2 , and x = x1 :
See Prob.1.31.
X p(x i )p(y j )
I[x, y] = − p(x i , y j )ln
i, j p(x i , y j )
2 22 1 1 2
1 · 1 3·3 3 1 3·3
3
= − ln − ln − ln = 0.1744
3 1/3 3 1/3 3 1/3
1
We let λm = M , m = 1, 2, ..., M. Hence we will obtain:
x1 + x2 + ... + xm 1 1
ln( )≥ [ ln(x1 ) + ln(x2 ) + ... + ln(x M ) ] = ln(x1 x2 ...x M )
M M M
We take advantage of the fact that f (x) = lnx is strictly increasing and
then obtain :
x1 + x2 + ... + xm p
≥ M x1 x2 ...x M
M
Problem 1.41 Solution
p(x)p(y)
Z Z
I[x, y] = − p(x, y)ln dx dy
p(x, y)
p(x)
Z Z
= − p(x, y)ln dx dy
p(x|y)
Z Z Z Z
= − p(x, y)lnp(x) dx dy + p(x, y)lnp(x|y) dx dy
Z Z Z Z
= − p(x)lnp(x) dx + p(x, y)lnp(x|y) dx dy
= H[x] − H[x|y]
E[x] =
X
x i p(x i ) = 0 · (1 − µ) + 1 · µ = µ
x i =0,1
34
(x i − E[x])2 p(x i )
X
var[x] =
x i =0,1
= (0 − µ)2 (1 − µ) + (1 − µ)2 · µ
= µ(1 − µ)
X
H[x] = − p(x i ) lnp(x i ) = − µ lnµ − (1 − µ) ln(1 − µ)
x i =0,1
1−µ 1+µ
E[x] =
X
x i · p(x i ) = −1 · + 1· =µ
x i =−1,1 2 2
(x i − E[x])2 · p(x i )
X
var[x] =
x i =−1,1
1−µ 1+µ
= (−1 − µ)2 · + (1 − µ)2 ·
2 2
= 1 − µ2
X 1−µ 1−µ 1+µ 1+µ
H[x] = − p(x i ) · ln p(x i ) = − · ln − · ln
x i =−1,1 2 2 2 2
Problem 2.3 Solution
m N!
CN =
m! (N − m)!
We evaluate the left side of (2.262) :
m m−1 N! N!
CN + CN = +
m! (N − m)! (m − 1)! (N − (m − 1))!
N! 1 1
= ( + )
(m − 1)! (N − m)! m N − m + 1
(N + 1)! m
= = CN +1
m! (N + 1 − m)!
To proof (2.263), here we will proof a more general form:
N
(x + y) N = m m N −m
X
CN x y (∗)
m=0
35
N
(x + y) N +1 m m N −m
X
= (x + y) CN x y
m=0
N N
m m N −m m m N −m
X X
= x CN x y +y CN x y
m=0 m=0
N N
m m+1 N − m m m N +1− m
X X
= CN x y + CN x y
m=0 m=0
NX
+1 N
m−1 m N +1− m m m N +1− m
X
= CN x y + CN x y
m=1 m=0
N
m−1 m m N +1− m
+ x N +1 + y N +1
X
= (C N + CN )x y
m=1
N
m m N +1− m
+ x N +1 + y N +1
X
= CN +1 x y
m=1
NX
+1
m m N +1− m
= CN +1 x y
m=0
Solution has already been given in the problem, but we will solve it in a
36
m m m−1
of the property, C N = CN −1
+ CN −1
.
N
mµm−1 (1 − µ) N −m C N
X £ m−1 m m
Nµ −1 − C N + (1 − µ)C N
¤
var[m] =
m=1
N
mµm−1 (1 − µ) N −m (1 − µ)C N
m m−1 m
X
Nµ
£ ¤
= + CN −1 − C N
m=1
N
mµm−1 (1 − µ) N −m (1 − µ)C N
m m
X
Nµ
£ ¤
= − CN −1
m=1
N
nX N o
mµm−1 (1 − µ) N −m+1 C N
m
mµm−1 (1 − µ) N −m C N
m
X
= Nµ − −1
m=1 m=1
n o
= N µ · N(1 − µ)[µ + (1 − µ)] N −1 − (N − 1)(1 − µ)[µ + (1 − µ)] N −2
n o
= N µ N(1 − µ) − (N − 1)(1 − µ) = N µ(1 − µ)
Hints have already been given in the description, and let’s make a little
improvement by introducing t = y + x and x = tµ at the same time, i.e. we will
do following changes:
(
x = tµ t = x+ y
½
and
y = t(1 − µ) µ = x+x y
Note t ∈ [0, +∞], µ ∈ (0, 1), and that when we change variables in integral,
we will introduce a redundant term called Jacobian Determinant.
¯ ∂x ∂x ¯ ¯
∂(x, y) ¯¯ ∂µ ∂ t ¯¯ ¯¯ t µ ¯¯
¯
= ¯ ∂y ∂y ¯ = =t
∂(µ, t) ¯ ∂µ ∂ t ¯ ¯ − t 1 − µ ¯
a(a + 1) a 2
var[µ] = −( )
(a + b)(a + b + 1) a+b
ab
= 2
(a + b) (a + b + 1)
Therefore, we obtain the simple expression for the first term on the right
side of (2.271) :
Z Z Z Z
E y [var x [x| y]] = x2 p(x) dx − E x [x| y]2 p(y) d y (∗)
Then following the same procedure, we deal with the second term of the
equation above.
Z
2E[x] · E x [x| y]p(y) d y = 2E[x] · E y [E x [x| y]]] = 2E[x]2
Therefore, we obtain the simple expression for the second term on the
right side of (2.271) :
Z
var y [E x [x| y]] = E x [x| y]2 p(y) d y − E[x]2 (∗∗)
This problem is complexed, but hints have already been given in the de-
scription. Let’s begin by performing integral of (2.272) over µ M −1 . (Note :
41
We change variable by :
µ M −1
t=
1 − µ − m − ... − µ M −2
Γ(α1 + α2 + ... + α M )
CM =
Γ(α1 )Γ(α2 )...Γ(α M −1 )Γ(α M )
as required.
Γ(α0 ) K
Z
α −1
Y
= µj µ k dµ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1 k
Γ(α0 ) K
Z
α −1
Y
= µj µk k d µ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1
Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 1)Γ(α j+1 )...Γ(αK )
=
Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 1)
Γ(α0 )Γ(α j + 1) αj
= =
Γ(α j )Γ(α0 + 1) α0
Γ(α0 ) K
Z
α −1
µ2j
Y
= µk k dµ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1
Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 2)Γ(α j+1 )...Γ(αK )
=
Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 2)
Γ(α0 )Γ(α j + 2) α j (α j + 1)
= =
Γ(α j )Γ(α0 + 2) α0 (α0 + 1)
Hence, we obtain :
α j (α j + 1) αj α j (α0 − α j )
var[µ j ] = E[µ2j ] − E[µ j ]2 = −( )2 =
α0 (α0 + 1) α0 α20 (α0 + 1)
in this case.
Γ(α0 )
= K(α)
Γ(α1 )Γ(α2 )...Γ(αK )
∂D ir(µ|α) K
α −1
Y
= ∂(K(α) µi i ) / ∂α j
∂α j i =1
QK α i −1
K
∂K(α) Y α −1
∂ i =1 µ i
= µ i i + K(α)
∂α j i =1 ∂α j
K
∂K(α) Y α −1
= µi i + lnµ j · D ir(µ|α)
∂α j i =1
∂ D ir(µ|α) d µ ∂1
R
left side = = =0
∂α j ∂α j
∂K(α) K
Z Y
α −1
right side = µi i d µ + E[lnµ j ]
∂α j i =1
∂K(α) 1
= + E[lnµ j ]
∂α j K(α)
∂ lnK(α)
= + E[lnµ j ]
∂α j
44
Therefore, we obtain :
∂ lnK(α)
E[lnµ j ] = −
∂α j
n o
∂ lnΓ(α0 ) − Ki =1 lnΓ(α i )
P
= −
∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 )
= −
∂α j ∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 ) ∂α0
= −
∂α j ∂α0 ∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 )
= −
∂α j ∂α0
= ψ(α j ) − ψ(α0 )
Since we have : Z b 1
dx = 1
a b−a
It is straightforward that it is normalized. Then we calculate its mean :
b 1 x2 ¯¯b a + b
Z
E[x] = x dx = ¯ =
a b−a 2(b − a) a 2
Hence we obtain:
(b − a)2
var[x] =
12
Problem 2.13 Solution
p(x) 1 |L| 1 1
ln( ) = ln ( ) + (x − m)T L−1 (x − m) − (x − µ)T Σ−1 (x − µ)
q(x) 2 |Σ| 2 2
1 1 £ 1
Z Z ½ ¾
T −1
µ Σ µ
¤
f (x)lng(x) dx = f (x)ln exp − (x − ) (x − ) dx
(2π)D /2 |Σ|1/2 2
1 1 £ 1
Z ½ ¾ Z
T −1
µ Σ µ
¤
= f (x)ln dx + f (x) − (x − ) (x − ) dx
(2π)D /2 |Σ|1/2 2
1 1 1 £
½ ¾
T −1
E µ Σ µ
¤
= ln − (x − ) (x − )
(2π)D /2 |Σ|1/2 2
1 1 1
½ ¾
= ln − tr{ I D }
(2π) D /2 |Σ| 1/2 2
1 D
½ ¾
= − ln|Σ| + (1 + ln(2π))
2 2
= − H(g)
We take advantage of two properties of PDF f (x), with mean µ and vari-
ance Σ, as listed below. What’s more, we also use the result of Prob.2.15,
which we will proof later. Z
f (x) dx = 1
H(g) ≥ H( f )
In other words, we have proved that an arbitrary PDF f (x) with the same
mean and variance as a Gaussian PDF g(x), its entropy cannot be greater
than that of Gaussian PDF.
We have already used the result of this problem to solve Prob.2.14, and
now we will prove it. Suppose x ∼ p(x) = N (µ|Σ) :
Z
H[x] = − p(x)lnp(x) dx
1 1 £ 1
Z ½ ¾
T −1
exp − (x − µ) Σ (x − µ) dx
¤
= − p(x)ln
(2π)D /2 |Σ|1/2 2
1 1 £ 1
Z ½ ¾ Z
f (x) − (x − µ)T Σ−1 (x − µ) dx
¤
= − p(x)ln dx −
(2π) D /2 |Σ| 1/2 2
1 1 1 £
½ ¾
+ E (x − µ)T Σ−1 (x − µ)
¤
= − ln
(2π) D /2 |Σ| 1/2 2
1 1 1
½ ¾
= − ln + tr{ I D }
(2π)D /2 |Σ|1/2 2
1 D
= ln|Σ| + (1 + ln(2π))
2 2
Note: f X ,Y (·) is the joint PDF of X and Y , and then we rearrange the
order, we will obtain :
Z z ·Z +∞ ¸
F Z (z) = f X ,Y (u − y, y) d y du
−∞ −∞
We can obtain : Z +∞
f Z (u) = f X ,Y (u − y, y) d y
−∞
And if X and Y are independent, which means f X ,Y (x, y) = f X (x) f Y (y), we
can simplify f Z (z) :
Z +∞
f Z (u) = f X (u − y) f Y (y) d y i.e. f Z = f X ∗ f Y
−∞
Until now we have proved that the PDF of the summation of two inde-
pendent random variable is the convolution of the PDF of them. Hence it is
straightforward to see that in this problem, where random variable x is the
summation of random variable x1 and x2 , the PDF of x should be the convo-
lution of the PDF of x1 and x2 . To find the entropy of x, we will use a simple
method, taking advantage of (2.113)-(2.117). With the knowledge :
1
p(x2 ) = N (µ2 , τ−
2 )
1
p(x| x2 ) = N (µ1 + x2 , τ−
1 )
1£ 1 −1
1 + ln2π(τ−
1 + τ2 )
¤
H[x] =
2
48
It is straightforward then :
D X
D D X
D
(x i − µ i )ΛSij (x j − µ j ) + (x i − µ i )Λ iAj (x j − µ j )
X X
(∗) =
i =1 j =1 i =1 j =1
D X
D
(x i − µ i )ΛSij (x j − µ j )
X
=
i =1 j =1
We will just follow the hint given in the problem. Firstly, we take complex
conjugate on both sides of (2.45) :
Σu i = λi u i => Σu i = λi u i
Where we have taken advantage of the fact that Σ is a real matrix, i.e.,
Σ = Σ. Then using that Σ is a symmetric, i.e., ΣT = Σ :
u i T Σu i = u i T ( Σu i ) = u i T ( λi u i ) = λi u i T u i
T
u i T Σ u i = ( Σ u i )T u i = ( λ i u i )T u i = λ i u i T u i
T
Since u i 6= 0, we have u i T u i 6= 0. Thus λTi = λ i , which means λ i is real.
Next we will proof that two eigenvectors corresponding to different eigenval-
ues are orthogonal.
λ i < u i , u j > = < λ i u i , u j > = < Σ u i , u j > = < u i , ΣT u j > = λ j < u i , u j >
For every N × N real symmetric matrix, the eigenvalues are real and the
eigenvectors can be chosen such that they are orthogonal to each other. Thus
a real symmetric matrix Σ can be decomposed as Σ = U ΛU T ,where U is an
orthogonal matrix, and Λ is a diagonal matrix whose entries are the eigen-
values of A. Hence for an arbitrary vector x, we have:
u 1T x λ1 u 1T x
D
. ..
Σ x = U ΛU T x = U Λ .. = U λk u k u T
X
=( k )x
.
k=1
uT
D
x λD u T
D
x
And since Σ−1 = U Λ−1U T , the same procedure can be used to prove
(2.49).
Since a is real,the expression above will be strictly positive for any non-
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigen-
value, λ i , is zero or negative, there will exist a vector a (e.g. a = u i ), for
which this expression will be no greater than 0. Thus, that a real symmet-
ric matrix has eigenvectors which are all strictly positive is a sufficient and
necessary condition for the matrix to be positive definite.
SD 2πD /2
VD = =
D Γ( D2 + 1)
− MBD −1
· ¸ · ¸
A B M
×
C D −D −1 CM D + D −1 CMBD −1
−1
The result can also be partitioned into four blocks. The block located at
left top equals to :
AM − BD −1 CM = (A − BD −1 C)(A − BD −1 C)−1 = I
Where we have taken advantage of (2.77). And the right top equals to :
Where we have used the result of the left top block. And the left bottom
equals to :
CM − DD −1 CM = 0
−CMBD −1 + DD −1 + DD −1 CMDD −1 = I
we have proved what we are asked. Note: if you want to be more precise,
you should also multiply the block matrix on the right side of (2.76) and then
prove that it will equal to a identity matrix. However, the procedure above
can be also used there, so we omit the proof and what’s more, if two arbitrary
square matrix X and Y satisfied X Y = I, it can be shown that Y X = I also
holds.
µa Σaa Σab
µ ¶ · ¸
µa,b = Σ(a,b)(a,b) =
µb Σba Σbb
And the expression of Λ− aa and Λab can be given by using (2.76) and (2.77)
1
Here we will also directly calculate the inverse matrix instead to give
another solution. Let’s first begin by introducing two useful formulas.
(I + P)−1 = (I + P)−1 (I + P − P)
= I − (I + P)−1 P
And since
= A −1 − (I + A −1 BCD)−1 A −1 BCD A −1
Where we have assumed that A is invertible and also used the first for-
mula we introduced. Then we also assume that C is invertible and recur-
sively use the second formula :
Just as required.
Also the same procedure can be used here. We omit the proof for simplic-
ity.
54
µ Λ−1 Λ−1 A T
µ ¶ µ ¶ · ¸
x
z= E(z) = cov(z) =
y Aµ + b A Λ−1 L + A Λ−1 A T
−1
µ y| x = A µ + L − L−1 (−LA)(x − µ) = Ax + L
Λ−1 Λ−1 A T Λµ − A T Lb µ
µ ¶µ ¶ µ ¶
=
A Λ−1 L−1 + A Λ−1 A T Lb Aµ + b
Let’s make this problem more clear. The deduction in the main text, i.e.,
(2.101-2.110), firstly denote a new random variable z corresponding to the
joint distribution, and then by completing square according to z,i.e.,(2.103),
obtain the precision matrix R by comparing (2.103) with the PDF of a mul-
tivariate Gaussian Distribution, and then it takes the inverse of precision
matrix to obtain covariance matrix, and finally it obtains the linear term i.e.,
(2.106) to calculate the mean.
In this problem, we are asked to solve the problem from another perspec-
tive: we need to write the joint distribution p(x, y) and then perform inte-
gration over x to obtain marginal distribution p(y). Let’s begin by write the
quadratic form in the exponential of p(x, y) :
1 1
− (x − µ)T Λ(x − µ) − (y − Ax − b)T L(y − Ax − b)
2 2
We extract those terms involving x :
1
= − xT (Λ + A T LA) x + xT [Λµ + A T L(y − b) ] + const
2
1 1
= − (x − m)T (Λ + A T LA) (x − m) + mT (Λ + A T LA)m + const
2 2
Where we have defined :
Now if we perform integration over x, we will see that the first term van-
ish to a constant, and we extract the terms including y from the remaining
parts, we can obtain :
1 h i
= − yT L − LA (Λ + A T LA)−1 A T L y
2 n
+ yT L − LA (Λ + A T LA)−1 A T L b
£ ¤
o
+LA(Λ + A T LA)−1 Λµ
We firstly view the quadratic term to obtain the precision matrix, and
then we take advantage of (2.289), we will obtain (2.110). Finally, using the
56
linear term combined with the already known covariance matrix, we can ob-
tain (2.109).
Let’s follow the hint by firstly calculating the derivative of (2.118) with
respect to Σ and let it equal to 0 :
N ∂ 1 ∂ X N
− ln|Σ| − (xn − µ)T Σ−1 (xn − µ) = 0
2 ∂Σ 2 ∂Σ n=1
N ∂ N N
− ln|Σ| = − (Σ−1 )T = − Σ−1
2 ∂Σ 2 2
Provided with the result that the optimal covariance matrix is the sample
covariance, we denote sample matrix S as :
1 XN
S= (xn − µ)(xn − µ)T
N n=1
1 ∂ X N
second term = − (xn − µ)T Σ−1 (xn − µ)
2 ∂Σ n=1
N ∂
= − T r[Σ−1 S]
2 ∂Σ
N −1
= Σ S Σ−1
2
Where we have taken advantage of the following property, combined with
the fact that S and Σ is symmetric. (Note : this property can be found in The
Matrix Cookbook.)
∂
T r(A X −1 B) = −(X −1 BA X −1 )T = −(X −1 )T A T B T (X −1 )T
∂X
Thus we obtain :
N −1 N −1
− Σ + Σ S Σ−1 = 0
2 2
57
The proof of (2.62) is quite clear in the main text, i.e., from page 82 to
page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin
by proving (2.123) :
1 X N 1
E[µ ML ] = E[ xn ] = · Nµ = µ
N n=1 N
Where we have taken advantage of the fact that xn is independently and
identically distributed (i.i.d).
Then we use the expression in (2.122) :
1 XN
E[Σ ML ] = E[ (xn − µ ML )(xn − µ ML )T ]
N n=1
1 XN
= E[(xn − µ ML )(xn − µ ML )T ]
N n=1
1 XN
= E[(xn − µ ML )(xn − µ ML )T ]
N n=1
1 XN
= E[xn xn T − 2µ ML xn T + µ ML µT
ML ]
N n=1
1 XN 1 XN 1 XN
= E[xn xn T ] − 2 E[µ ML xn T ] + E[µ ML µT
ML ]
N n=1 N n=1 N n=1
1 XN
second term = −2 E[µ ML xn T ]
N n=1
1 XN 1 XN
= −2 E[ ( xm )xn T ]
N n=1 N m=1
1 X N X N
= −2 E[xm xn T ]
N 2 n=1 m=1
1 X N X N
= −2 (µµT + I nm Σ)
N 2 n=1 m=1
1
= −2 2 (N 2 µµT + N Σ)
N
1
= −2(µµT + Σ)
N
58
1 XN
third term = E[µ ML µT
ML ]
N n=1
1 XN N
1 X 1 XN
= E[( x j) · ( x i )]
N n=1 N j=1 N i=1
1 X N N N
E
X X
= [( x j ) · ( x i )]
N 3 n=1 j=1 i =1
1 X N
= (N 2 µµT + N Σ)
N 3 n=1
1
= µµT + Σ
N
Finally, we combine those three terms, which gives:
N −1
E[Σ ML ] = Σ
N
Note: the same procedure from (2.59) to (2.62) can be carried out to prove
(2.291) and the only difference is that we need to introduce index m and n
to represent the samples. (2.291) is quite straightforward if we see it in this
way: If m = n, which means xn and xm are actually the same sample, (2.291)
will reduce to (2.262) (i.e. the correlation between different dimensions ex-
ists) and if m 6= n, which means xn and xm are different samples, also i.i.d,
then no correlation should exist, we can guess E[xn xm T ] = µµT in this case.
Let’s follow the hint. However, firstly we will find the sequential expres-
sion based on definition, which will make the latter process on finding coef-
ficient a N −1 more easily. Suppose we have N observations in total, and then
we can write:
1 XN
σ2(
ML
N)
= N) 2
(xn − µ(ML )
N n=1
" #
1 NX −1
= N) 2
(xn − µ(ML N) 2
) + (x N − µ(ML )
N n=1
N − 1 1 NX−1 1
= N) 2
(xn − µ(ML N) 2
) + (x N − µ(ML )
N N − 1 n=1 N
N − 1 2( N −1) 1
= σ ML + (x N − µ(ML N) 2
)
N N
1 h i
= σ2(
ML
N −1)
+ (x N − µ (N ) 2
ML
) − σ 2( N −1)
ML
N
59
1 XN ∂
·
∂
¸
lim lnp(x n | µ , σ ) = E x lnp(x n | µ , σ )
N →+∞ N n=1 ∂σ2 ∂σ2
∂
σ2(
ML
N)
= σ2(
ML
N −1)
+ a N −1 N)
lnp(x N |µ(ML N −1)
, σ(ML ) (∗)
∂σ2(
ML
N −1)
N) 2
(x N − µ(ML
" #
1 )
= σ2(
ML
N −1)
+ a N −1 − +
2σ2(
ML
N −1)
2σ4(
ML
N −1)
2σ4(
ML
N −1)
a N −1 =
N
Then we will obtain :
1 h 2( N −1) i
σ2(
ML
N)
= σ2(
ML
N −1)
+ −σ ML N) 2
+ (x N − µ(ML )
N
We can see that the results are the same. An important thing should be
noticed : In maximum likelihood, when estimating variance σ2( N)
ML
, we will
N)
first estimate mean µ(ML , and then we we will calculate variance σ2( ML
N)
.
In other words, they are decoupled. It is the same in sequential method.
For instance, if we want to estimate both mean and variance sequentially,
N −1)
after observing the Nth sample (i.e., x N ), firstly we can use µ(ML together
N)
with (2.126) to estimate µ(ML and then use the conclusion in this problem
N)
to obtain σ(ML N)
. That is why in (∗) we write lnp(x N |µ(ML N −1)
, σ(ML ) instead of
N −1) ( N −1)
lnp(x N |µ(ML , σ ML ).
1 XN
N) N) N) T
Σ(ML = (xn − µ(ML )(xn − µ(ML )
N n=1
" #
1 NX −1
(N ) (N ) T (N ) (N ) T
= (xn − µ ML )(xn − µ ML ) + (x N − µ ML )(x N − µ ML )
N n=1
N − 1 ( N −1) 1 N) N) T
= Σ ML + (x N − µ(ML )(x N − µ(ML )
N N
N −1) 1 h
N) N) T N −1)
i
= Σ(ML + (x N − µ(ML )(x N − µ(ML ) − Σ(ML
N
If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we
can obtain :
N) N −1) ∂ N) N −1)
Σ(ML = Σ(ML + a N −1
N −1)
lnp(x N |µ(ML , Σ(ML )
∂Σ(ML
N −1) ∂ N) N −1)
= Σ(ML + a N −1 lnp(x N |µ(ML , Σ(ML )
N −1)
∂Σ(ML
1 1
· ¸
N −1) N −1) −1 N −1) −1 N −1) N −1) T N −1) −1
= Σ(ML + a N −1 − [Σ(ML ] + [Σ(ML ] (xn − µ(ML )(xn − µ(ML ) [Σ(ML ]
2 2
2 2( N −1)
a N −1 = Σ
N ML
We can see that the equation above will be identical with our previous
conclusion based on definition.
1 XN 1 1
− (xn − µ)2 − 2 (µ − µ0 )2 = − 2 (µ − µ N )2
2σ n=1
2 2σ 0 2σ N
N 1
quadratic term = −( + ) µ2
2σ 2 2σ20
PN
n=1 x n µ0
linear term = ( + )µ
σ2 σ20
61
We also rewrite the right side regarding to µ, and hence we will obtain :
PN
N 1 1 n=1 x n µ0 µN
−( 2 + ) µ2 = − 2 µ2 , ( + )µ = µ
2σ 2
2σ 0 2σ N σ2 σ20 σ2N
µN σ2 + N σ20 σ2 N σ20
= ( µ0 + N)
µ(ML )
σ2N σ20 σ2 N σ20 + σ2 N σ20 + σ2
σ2 + N σ20 σ2 N σ20
= ( µ0 + N)
µ(ML )
σ20 σ2 N σ20 + σ2 N σ20 + σ2
N)
N µ(ML
PN
µ0 µ0 n=1 x n
= + = +
σ20 σ2 σ20 σ2
P N −1
µ0 n=1 xn xN
= + +
σ20 σ2 σ2
µ N −1 xN
= +
σ2N −1 σ2
62
We focus on the exponential term on the right side and then rearrange it
regarding to µ.
" #
N 1 1
− (xn − µ) Σ (xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
T −1
X
right =
n=1 2 2
" #
N 1 1
− (xn − µ) Σ (xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
T −1
X
=
n=1 2 2
1 N
= − µ (Σ0 −1 + N Σ−1 ) µ + µT (Σ−1
0 µ0 + Σ
−1
X
xn ) + const
2 n=1
Σ−N1 = Σ0 −1 + N Σ−1
ba
Z +∞ Z +∞
1 a a−1
b λ exp(− bλ) d λ = λa−1 exp(− bλ) d λ
0 Γ(a) Γ(a) 0
ba
Z +∞
u 1
= ( )a−1 exp(− u) du
Γ(a) 0 b b
Z +∞
1
= u a−1 exp(− u) du
Γ(a) 0
1
= · Γ(a) = 1
Γ(a)
Where we first perform change of variable bλ = u, and then take advan-
tage of the definition of gamma function:
Z +∞
Γ(x) = u x−1 e−u du
0
ba
Z +∞ Z +∞
2 1 a a−1
λ b λ exp(− bλ) d λ = λa+1 exp(− bλ) d λ
0 Γ(a) Γ(a) 0
ba
Z +∞
u 1
= ( )a+1 exp(− u) du
Γ(a) 0 b b
Z +∞
1 a+1
= u exp(− u) du
Γ(a) · b2 0
1 a(a + 1)
= · Γ(a + 2) =
Γ(a) · b2 b2
64
a0 = 1 + β0 /2 = 1 + (N + β) 2
±
N x2 N
2 n
b0 = d 0 − c0 /2β0 = d + xn )2 (2(β + N))
X X ±
− (c +
n=1 2 n=1
Problem 2.45 Solution
Let’s begin by writing down the dependency of the prior distribution W (Λ|W, v)
and the likelihood function p(X |µ, Λ) on Λ.
N 1
p(X |µ, Λ) ∝ |Λ| N /2 exp − (xn − µ)T Λ(xn − µ)
£X ¤
n=1 2
And if we denote
1 XN
S= (xn − µ)(xn − µ)T
N n=1
Then we can rewrite the equation above as:
£ 1
p(X |µ, Λ) ∝ |Λ| N /2 exp − Tr(S Λ)
¤
2
Just as what we have done in Prob.2.34, and comparing this problem with
Prob.2.34, one important thing should be noticed: since S and Λ are both
symmetric, we have: Tr(S Λ) = Tr (S Λ)T = Tr(ΛT S T ) = Tr(ΛS). And we
¡ ¢
vN = N + v
W N = (W −1 + S)−1
Problem 2.46 Solution
It is quite straightforward.
Z ∞
p(x|µ, a, b) = N (x|µ, τ−1 )Gam(τ|a, b) d τ
0
b a exp(− bτ)τa−1 τ 1/2
∞ n τ
Z o
= ( ) exp − (x − µ)2 d τ
0 Γ(a) 2π 2
a
τ
Z ∞
b 1 n o
= ( )1/2 τa−1/2 exp − bτ − (x − µ)2 d τ
Γ(a) 2π 0 2
And if we make change of variable: z = τ[b + (x − µ)2 /2], the integral above
can be written as:
b a 1 1/2 ∞ a−1/2 τ
Z n o
p(x|µ, a, b) = ( ) τ exp − bτ − (x − µ)2 d τ
Γ(a) 2π 0 2
a Z ∞· ¸a−1/2
b 1 z 1
= ( )1/2 exp {− z} dz
Γ(a) 2π 0 b + (x − µ ) 2 /2 b + (x − µ)2 /2
¸a+1/2 Z ∞
b a 1 1/2 1
·
= ( ) z a−1/2 exp {− z} dz
Γ(a) 2π b + (x − µ)2 /2 0
¸−a−1/2
b a 1 1/2 (x − µ)2
·
= ( ) b+ Γ(a + 1/2)
Γ(a) 2π 2
And if we substitute a = v/2 and b = v/2λ, we will obtain (2.159).
Then we calculate E[xxT ]. The steps above can also be used here.
Z
T
E[xx ] = St(x|µ, Λ, v) · xxT dx
x∈RD
·Z +∞
¯v v
Z ¸
−1 T
N (x µ, (ηΛ) ) · Gam(η , ) d η xx dx
¯
= ¯ ¯
x∈RD 0 2 2
Z +∞
¯v v
Z
xxT N (x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx
¯
=
x∈RD 0 2 2
Z +∞ ·Z
¯v v
¸
T −1
xx N (x µ, (ηΛ) ) dx · Gam(η , ) d η
¯
= ¯ ¯
0 x∈RD 2 2
Z +∞ h
v v i
E[µµT ] · Gam(η ¯ , ) d η
¯
=
0 2 2
Z +∞ h i ¯v v
= µµT + (ηΛ)−1 Gam(η ¯ , ) d η
0 2 2
Z +∞
v v
= µµT + (ηΛ)−1 · Gam(η ¯ , ) d η
¯
0 2 2
Z +∞
T −1 1 v v/2 v/2−1 v
= µµ + (ηΛ) · ( ) η exp(− η) d η
0 Γ (v/2) 2 2
1 v v/2 +∞ v/2−2 v
Z
T −1
= µµ + Λ ( ) η exp(− η) d η
Γ(v/2) 2 0 2
vη
If we denote: z = 2 , the equation above can be reduced to :
1 v v/2 +∞ 2z v/2−2 2
Z
T T −1
E[xx ] = µµ +Λ ( ) ( ) exp(− z) dz
Γ(v/2) 2 0 v v
Z +∞
1 v
= µµT + Λ−1 · z v/2−2 exp(− z) dz
Γ(v/2) 2 0
Γ(v/2 − 1) v
= µµT + Λ−1 ·
Γ(v/2) 2
1 v
= µµT + Λ−1
v/2 − 1 2
v
= µµT + Λ−1
v−2
Where we have taken advantage of the property: Γ(x + 1) = xΓ(x), and
since we have cov[x] = E (x − E[x])(x − E[x])T , together with E[x] = µ, we
£ ¤
can obtain:
v
cov[x] = Λ−1
v−2
Problem 2.50 Solution
69
We first prove (2.177). Since we have exp(i A)· exp(− i A) = 1, and exp(i A) =
cosA + isinA. We can obtain:
Which gives cos2 A + sin2 A = 1. And then we prove (2.178) using the hint.
θ0 )].
(θ − θ0 )2
½· ¸¾
4
exp { mcos(θ − θ0 )} = exp m 1 − + O((θ − θ0 ) )
2
(θ − θ0 )2
½ ¾
= exp m − m − mO((θ − θ0 )4 )
2
(θ − θ0 )2
½ ¾
· exp − mO((θ − θ0 )4 )
© ª
= exp(m) · exp − m
2
It is same for exp(mcosθ ) :
θ2
exp { mcosθ } = ) · exp − mO(θ 4 )
© ª
exp(m) · exp(− m
2
Now we rearrange (2.179):
1
p(θ |θ0 , m) = exp { mcos(θ − θ0 )}
2π I 0 (m)
1
= R 2π exp { mcos(θ − θ0 )}
0 exp { mcosθ } d θ
2
exp(m) · exp − m (θ−2θ0 ) · exp − mO((θ − θ0 )4 )
n o © ª
= R 2π θ2
0 exp(m) · exp(− m 2 ) · exp − mO(θ ) d θ
4
© ª
1 (θ − θ 0 )2
½ ¾
= R 2π exp − m
exp(− m θ ) d θ
2
2
0 2
Where we have used (2.183), and then together with (2.182), we can ob-
tain :
XN N
X
cosθ0 sinθn − sinθ0 cosθn = 0
n=1 n=1
71
Which gives:
n sinθ n
½P ¾
θ0ML = tan −1
n cosθ n
P
θ = θ0 + k π (k ∈ Z)
1 XN
A(m ML ) = cos(θn − θ0ML )
N n=1
By using (2.178), we can write :
1 XN
A(m ML ) = cos(θn − θ0ML )
N n=1
1 XN ³ ´
= cosθn cosθ0ML + sinθn sinθ0ML
N n=1
à ! à !
1 XN 1 N
cosθ N cosθ0ML + sinθ N sinθ0ML
X
=
N n=1 N n=1
And then by using (2.169) and (2.184), it is obvious that θ̄ = θ0ML , and
hence A(m ML ) = r̄.
Recall that the distributions belonging to the exponential family have the
form:
p(x|η) = h(x) g(η) exp(ηT u(x))
And according to (2.13), the beta distribution can be written as:
Γ(a + b) a−1
Beta(x|a, b) = x (1 − x)b−1
Γ(a)Γ(b)
Γ(a + b)
= exp [(a − 1)lnx + (b − 1)ln(1 − x)]
Γ(a)Γ(b)
Γ(a + b) exp [alnx + bln(1 − x)]
=
Γ(a)Γ(b) x(1 − x)
η = [a, b]T
u(x) = [lnx, ln(1 − x)]T
1
p(x|θ0 , m) = exp(mcos(x − θ0 ))
2π I 0 (m)
1
= exp [m(cosxcosθ0 + sinxsinθ0 )]
2π I 0 (m)
73
Recall that the distributions belonging to the exponential family have the
form:
p(x|η) = h(x) g(η) exp(ηT u(x))
And the multivariate Gaussian Distribution has the form:
1 1 1
½ ¾
T −1
N (x|µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)D /2 |Σ|1 /2 2
We expand the exponential term with respect to µ.
1 1 1 T −1
½ ¾
T −1 −1
N (x|µ, Σ) = exp − (x Σ x − 2µ Σ x + µΣ µ)
(2π)D /2 |Σ|1/2 2
1 1 1 T −1 1
½ ¾ ½ ¾
T −1 −1
= exp − x Σ x + µ Σ x exp − µ Σ µ)
(2π)D /2 |Σ|1/2 2 2
η = [Σ−1 µ, − 12 vec(Σ−1 ) ]T
u(x) = [x, vec(xxT ) ]
g(η) = exp( 41 η 1 T η 2 −1 η 1 ) + | − 2η 2 |1/2
h(x) = (2π)−D /2
Then we calculate the derivative of both sides of the equation above with
respect to η using the Chain rule of Calculus:
Z Z
−∇∇ ln g(η) = ∇ g(η) h(x) exp{ηT u(x)} u(x)T dx + g(η) h(x) exp{ηT u(x)} u(x)u(x)T dx
One thing needs to be addressed here: please pay attention to the trans-
pose operation, and −∇∇ ln g(η) should be a matrix. Notice the relationship
∇ ln g(η) = ∇ g(η)/g(η), the first term on the right hand side of the above equa-
tion can be simplified as:
Z
(first term on the right) = ∇ ln g(η) · g(η) h(x) exp{ηT u(x)} u(x)T dx
= ∇ ln g(η) · E[u(x)T ]
= −E[u(x)] · E[u(x)T ]
Based on the definition, the second term on the right hand side is E[u(x)u(x)T ].
Therefore, combining these two terms, we obtain:
It is straightforward.
1 x
Z Z
p(x|σ) dx = f ( ) dx
σ σ
1
Z
= f (u)σ du
Z σ
= f (u) du = 1
N
X M
X
lnp(xn ) = n i ln(h i )
n=1 i =1
all the N observations, there are n i samples fall into region ∆ i , we can easily
write down the likelihood function just as the equation above, and note we use
M to denote the number of different regions. Therefore, an implicit equation
should hold:
XM
ni = N
i =1
We now need to take account of the constraint that p(x) must integrate
to unity, which can be written as M j =1 h j ∆ j = 1. We introduce a Lagrange
P
M M
n i ln(h i ) + λ( h j ∆ j − 1)
X X
i =1 j
N +λ=0
We also see that if we use "1NN" (K = 1), the probability density will be
well normalized. Note that if and only if the volume of all the spheres are
small enough and N is large enough, the equation above will hold. Fortu-
nately, these two conditions can be satisfied in KNN.
76
1X M wj
µ0 = w0 + wj and µ j =
2 j=1 2
N n o
r n t n − wT Φ(xn ) Φ(xn )T
X
∇E D (w) =
n=1
Where we have used y(xn , w) to denote the output of the linear model
when input variable is xn , without noise added. For the second term in the
equation above, we can obtain :
D D X
D D X
D D X
D
E² [( w i ² i )2 ] = E² [ w i w j E² [² i ² j ] = σ2
X X X X
wi w j ²i ² j ] = wi w j δi j
i =1 i =1 j =1 i =1 j =1 i =1 j =1
Which gives
D D
E² [( w i ² i )2 ] = σ2 w2i
X X
i =1 i =1
For the third term, we can obtain:
D D
E² [2 w i ² i y(xn , w) − t n ] = 2 y(xn , w) − t n E² [ w i ² i ]
¡X ¢¡ ¢ ¡ ¢ X
i =1 i =1
D
E² [w i ² i ]
¡ ¢X
= 2 y(xn , w) − t n
i =1
= 0
It is obvious that L(w, λ) and (3.29) has the same dependence on w. Mean-
while, if we denote the optimal w that can minimize L(w, λ) as w? (λ), we can
see that
M
|w?j | q
X
η=
j =1
N 1 XN £ ¤T
lnp(T | X ,W, β) = − ln|Σ| − t n − W T φ(xn ) Σ−1 t n − W T φ(xn )
£ ¤
2 2 n=1
79
Where we have already omitted the constant term. We set the derivative
of the equation above with respect to W equals to zero.
N
Σ−1 [t n − W T φ(xn ) φ(xn )T
X ¤
0=−
n=1
1 XN ¤T
Σ= [t n − W T T
ML φ(x n ) [t n − W ML φ(x n )
¤
N n=1
We can see that the solutions for W and Σ are also decoupled.
Let’s begin by writing down the prior distribution p(w) and likelihood
function p(t| X , w, β).
N
p(w) = N (w| m0 , S 0 ) , N (t n |wT φ(xn ), β−1 )
Y
p(t| X , w, β) =
n=1
Since the posterior PDF equals to the product of the prior PDF and likeli-
hood function, up to a normalized constant. We mainly focus on the exponen-
tial term of the product.
N n
β X o2 1
exponential term = − t n − wT φ(xn ) − (w − m0 )T S −1
0 (w − m0 )
2 n=1 2
β XN n o 1
= − t2n − 2t n wT φ(xn ) + wT φ(xn )φ(xn )T w − (w − m0 )T S −1
0 (w − m0 )
2 n=1 2
" #
1 T X N
T −1
= − w βφ(xn )φ(xn ) + S 0 w
2 n=1
" #
1 N
− −2m0T S − 1
2β t n φ(xn )T w
X
0 −
2 n=1
+ const
N
−2m N T S N −1 = −2m0T S −1
2β t n φ(xn )T
X
0 −
n=1
If we multiply −0.5 on both sides, and then transpose both sides, we can
easily see that m N = S N (S 0 −1 m 0 + βΦT t)
80
p(w) = N (m N , S N )
Since the posterior equals to the production of likelihood function and the
prior, up to a constant, we focus on the exponential term.
exponential term = (w − m N )T S −1 T
N (w − m N ) + β(t N +1 − w φ(x N +1 ))
2
wT S N −1 + β φ(x N +1 ) φ(x N +1 )T w
£ ¤
=
−2wT S −
£ 1
N m N + β φ(x N +1 ) t N +1
¤
+const
And
¡ 1
m N +1 = S N +1 S −
N m N + β φ(x N +1 ) t N +1
¢
p(w) = N (m N , S N )
p(w| x N +1 , t N +1 ) = N (Σ φ(x N +1 )β t N +1 + S N −1 m N , Σ)
© ª
Where Σ = (S N −1 + φ(x N +1 )βφ(x N +1 )T )−1 , and we can see that the result
is exactly the same as the one we obtained in the previous problem.
81
And
p(w|t, α, β) = N (w| m N , S N )
Where m N , S N are given by (3.53) and (3.54). As what we do in previous
problem, we can rewrite p(t|w, β) as:
And then we take advantage of (2.113), (2.114) and (2.115), we can obtain:
φ(x)T m N = m N T φ(x)
1 −1 T
S−
N +1 = S N + β φ(x N +1 ) φ(x N +1 )
βS N φ(x N +1 )φ(x N +1 )T S N
= SN −
1 + βφ(x N +1 )T S N φ(x N +1 )
N
N (t n |wT φ(xn ), β−1 )
Y
p(t|X, w, β) =
n=1
N £ β
β1/2 exp − (t n − wT φ(xn ))2
Y ¤
∝ (∗∗)
n=1 2
According to Bayesian Inference, we have p(w, β|t) ∝ p(t|X, w, β)× p(w, β).
We first focus on the quadratic term with regard to w in the exponent.
β N β
quadratic term = − wT S 0 −1 w + − wT φ(xn )φ(xn )T w
X
2 n=1 2
β N
= − wT S 0 −1 + φ(xn )φ(xn )T w
£ X ¤
2 n=1
Where the first term is generated by (∗), and the second by (∗∗). By now,
we know that:
N
S N −1 = S 0 −1 + φ(xn )φ(xn )T
X
n=1
N
linear term = β m 0 T S 0 −1 w + β t n φ(xn )T w
X
n=1
N
= β m 0 T S 0 −1 + t n φ(xn )T w
£ X ¤
n=1
Again, the first term is generated by (∗), and the second by (∗∗). We can
also obtain:
N
m N T S N −1 = m 0 T S 0 −1 + t n φ(xn )T
X
n=1
Which gives:
N
m N = S N S 0 −1 m 0 +
X
t n φ(xn )
£ ¤
n=1
83
β β XN
constant term = (− m 0 T S 0 −1 m 0 − b 0 β) − t2
2 2 n=1 n
£1 1 XN
= −β m 0 T S 0 −1 m 0 + b 0 + t2n
¤
2 2 n=1
1 1 1 XN
m N T S N −1 m N + b N = m 0 T S 0 −1 m 0 + b 0 + t2
2 2 2 n=1 n
Which gives :
1 1 XN 1
bN = m 0 T S 0 −1 m 0 + b 0 + t2 − m N T S N −1 m N
2 2 n=1 n 2
We know that:
and that:
We go back to (∗), and first deal with the integral with regard to w:
Z Z
N (t|φ(x)T w, β−1 ) N (w| m N , β−1 S N ) dw Gam(β|a N , b N ) d β
£ ¤
p(t|X, t) =
Z
= N (t|φ(x)T m N , β−1 + φ(x)T β−1 S N φ(x)) Gam(β|a N , b N ) d β
Z
N t|φ(x)T m N , β−1 (1 + φ(x)T S N φ(x)) Gam(β|a N , b N ) d β
£ ¤
=
84
Where we have used (2.113), (2.114) and (2.115). Then, we follow (2.158)-
(2.160), we can see that p(t|X, t) = St(t|µ, λ, v), where we have defined:
aN £ ¤−1
µ = φ(x)T m N , λ= · 1 + φ(x)T S N φ(x) , v = 2a N
bN
Firstly, according to (3.16), if we use the new orthonormal basis set spec-
ified in the problem to construct Φ, we can obtain an important property:
ΦT Φ = I. Hence, if α = 0, together with (3.54), we know that SN = 1/β.
Finally, according to (3.62), we can obtain:
We know that
N
N (φ(xn )T w, β−1 ) ∝ N (Φw, β−1 I)
Y
p(t|w, β) =
n=1
And
p(w|α) = N (0, α−1 I)
Comparing them with (2.113), (2.114) and (2.115), we can obtain:
We know that:
N
N (φ(xn )T w, β−1 )
Y
p(t|w, β) =
n=1
N 1 1 T 2
Y
φ
© ª
= exp − (t n − (x n ) w)
n=1 (2πβ )
−1 1/2 2β−1
β β N
) N /2 exp
− (t n − φ(xn )T w)2
©X ª
= (
2π n=1 2
β N /2 © β
= ( ) exp − ||t − Φw||2
ª
2π 2
85
And that:
β α
E(w) = ||t − Φw||2 + wT w
2 2
β T α
= (t t − 2tT Φw + wT ΦT Φw) + wT w
2 2
1£ T
w (βΦ Φ + αI)w − 2βt Φw + βtT t
T T
¤
=
2
Observing the equation above, we see that E(w) contains the following
term :
1
(w − m N )T A(w − m N ) (∗)
2
Now, we need to solve A and m N . We expand (∗) and obtain:
1 T
(∗) = (w Aw − 2m N T Aw + m N T Am N )
2
We firstly compare the quadratic term, which gives:
A = βΦT Φ + αI
m N T A = βtT Φ
m N = βA−1 ΦT t
86
You can just follow the steps from (3.87) to (3.92), which is already very
clear.
Problem 3.21 Solution
Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A
is a M × M real symmetric matrix, with eigenvalues λ i , i = 1, 2, ..., M, |A| and
Tr(A) can be written as:
M
Y M
X
|A| = λi , Tr(A) = λi
i =1 i =1
d £YM ¤ X M d XM 1
left side = ln (α + λ i ) = ln(α + λ i ) =
dα i =1 i =1 d α i =1 α + λ i
87
d
A−1 A = A−1 I = A−1
dα
For the symmetric matrix A, its inverse A−1 has eigenvalues 1 / (α+λ i ) , i =
1, 2, ..., M. Therefore,
d M 1
Tr(A−1
X
A) =
dα i =1 α + λ i
The last two terms in the equation above can be further written as:
β d d α ©β d d α ª dm N
||t − Φ m N ||2 + mN T mN = ||t − Φ m N ||2 + mN T mN ·
2 dβ dβ 2 2 dm N dm N 2 dβ
©β α dm N
[−2ΦT (t − Φ m N )] + 2m N ·
ª
=
2 2 dβ
ª dm N
− βΦT (t − Φ m N ) + α m N ·
©
=
dβ
dm N
− βΦT t + (αI + βΦT Φ)m N ·
© ª
=
dβ
ª dm N
− βΦT t + Am N ·
©
=
dβ
= 0
d 1 1 XN
E(m N ) = ||t − Φ m N ||2 = (t n − m N T φ(xn ))2
dβ 2 2 n=1
d 1 γ
( ln|A|) =
dβ 2 2β
m N = SN (S0 −1 m 0 + ΦT t)
S N −1 = S0 −1 + ΦT Φ
N
a N = a0 +
2
1 N
b N = b 0 + (m 0 T S0 −1 m 0 − m N T SN −1 m N + t2n )
X
2 n=1
Which are exactly the same as those in Prob.3.12, and then we evaluate
the integral, taking advantage of the normalized property of multivariate
Gaussian Distribution and Gamma Distribution.
a
b0 0 2π M /2
Z
1/2
p(t) = ( ) | SN | βa N −1+ M /2 exp(− b N β) d β
(2π)( M + N )/2 |S0 |1/2 β
a
b0 0
Z
M /2 1/2
= (2π) |SN | βa N −1 exp(− b N β) d β
(2π)( M + N )/2 |S0 |1/2
a0
1 |SN |1/2 b 0 Γ(a N )
= a
(2π) N /2 |S0 |1/2 b NN Γ(b N )
89
Just as required.
Let’s just follow the hint and we begin by writing down expression for the
likelihood, prior and posterior PDF. We know that p(t|w, β) = N (t|Φw, β−1 I).
What’s more, the form of the prior and posterior are quite similar:
p(w, β) = N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
And
p(w, β|t) = N (w|mN , β−1 SN ) Gam(β|a N , b N )
Where the relationships among those parameters are shown in Prob.3.12,
Prob.3.23. Now according to (3.119), we can write:
N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
p(t) = N (t|Φw, β−1 I)
N (w|mN , β−1 SN ) Gam(β|a N , b N )
a
N (w|m0 , β−1 S0 ) b 0 β 0 exp(− b 0 β) / Γ(a 0 )
0 a −1
−1
= N (t|Φw, β I)
N (w|mN , β−1 SN ) b aNN βa N −1 exp(− b N β) / Γ(a N )
a
N (w|m0 , β−1 S0 ) b 0 Γ(a N ) a0 −a N
0
We look back to the previous problem and we notice that at the last step
in the deduction of p(t), we complete the square according to w. And if we
carefully compare the left and right side at the last step, we can obtain :
© β © β
exp − (t − Φw)T (t − Φw) exp − (w − m 0 )T S0 −1 (w − m 0 )
ª ª
2 2
© β
= exp − (w − m N )T SN −1 (w − m N ) exp − (b N − b 0 )β
ª © ª
2
90
β |SN |1/2
) N /2 exp − (b N − b 0 )β
© ª
Gaussian terms = (
2π |S0 |1/2
If the convex hull of {xn } and {yn } intersects, we know that there will be a
point z which can be written as z = n αn xn and also z = n βn yn . Hence we
P P
can obtain:
b T z + w0 b T(
X
w = w α n xn ) + w 0
n
b T xn ) + ( αn )w0
X X
= ( αn w
n n
T
X
= αn (w
b xn + w 0 ) (∗)
n
Where we have used n αn = 1. And if {xn } and {yn } are linearly separa-
P
1 ©
Tr (XW + 1w0 T − T)T (XW + 1w0 T − T)
ª
E D (W)
e =
2
∂E D (W)
= 2Nw0 + 2(XW − T)T 1
e
∂w0
91
∂
Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)BT
£ ¤
∂X
We set the derivative equals to 0, which gives:
1
w0 = − (XW − T)T 1 = t̄ − WT x̄
N
Where we have denoted:
1 T 1 T
t̄ = T 1, and x̄ = X 1
N N
b †T
W=X b
b = X − X̄ ,
X and T
b = T − T̄
y(x) = WT x + w0
= WT x + t̄ − WT x̄
= t̄ + WT (x − x̄)
1 T T 1 XN
aT t̄ = a T 1= aT tn = − b
N N n=1
Therefore,
aT y(x) = aT t̄ + WT (x − x̄)
£ ¤
= aT t̄ + aT WT (x − x̄)
= − b + aT Tb T (X
b † )T (x − x̄)
= −b
92
1
w=− (m2 − m1 ) ∝ (m2 − m1 )
2λ
Problem 4.5 Solution
(m 2 − m 1 )2
J(w) =
s21 + s22
||wT (m2 − m1 )||2
= P T 2
P T 2
n∈C 1 (w xn − m 1 ) + n∈C 2 (w xn − m 2 )
93
SB = (m2 − m1 )(m2 − m1 )T
Just as required.
Where in the last second step, we rearrange the term according to x, i.e.,
its quadratic, linear, constant term. We have also defined :
w = Σ−1 (µ1 − µ2 )
And
1 1 p(C 1 )
w0 = − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln
2 2 p(C 2 )
Finally, since p(C 1 |x) = σ(a) as stated in (4.57), we have p(C 1 |x) = σ(wT x+
w0 ) just as required.
∂L N t
X nk
= +λ
∂πk π
n=1 k
∂ ln p K
N X 1 1 ∂ X N X K
t nk (− Σ−1 ) − t nk (φn − µk )Σ−1 (φn − µk )T
X
=
∂Σ n=1 k=1 2 2 ∂Σ n=1 k=1
K
N X t nk −1 1 ∂ X K X N
Σ − t nk (φn − µk )Σ−1 (φn − µk )T
X
= −
n=1 k=1 2 2 ∂Σ k=1 n=1
N 1 1 ∂ X K
− Σ−1 − Nk Tr(Σ−1 Sk )
X
=
n=1 2 2 ∂Σ k=1
N −1 1 X K
= − Σ + Nk Σ−1 Sk Σ−1
2 2 k=1
1 X N
Sk = t nk (φn − µk )(φn − µk )T
Nk n=1
Now we set the derivative equals to 0, and rearrange the equation, which
gives:
K N
k
Σ=
X
Sk
k=1 N
Problem 4.11 Solution
L
M Y
Y φ
p(φ|C k ) = µkml
ml
m=1 l =1
Note that here only one of the value among φm1 , φm2 , ... φmL is 1, and the
others are all 0 because we have used a 1 − of − L binary coding scheme, and
also we have taken advantage of the assumption that the M components of
φ are independent conditioned on the class C k . We substitute the expression
above into (4.63), which gives:
M X
X L
ak = φml · ln µkml + ln p(C k )
m=1 l =1
Based on definition, i.e., (4.59), we know that logistic sigmoid has the
form:
1
σ(a) =
1 + exp(−a)
98
Just as required.
Since yn is the output of the logistic sigmoid function, we know that 0 <
yn < 1 and hence yn (1 − yn ) > 0. Then we use (4.97), for an arbitrary non-zero
real vector a 6= 0, we have:
N
aT Ha = aT yn (1 − yn )φn φT
£X ¤
n a
n=1
N
yn (1 − yn ) (φT T T
X
= n a) (φ n a)
n=1
N
yn (1 − yn ) b2n
X
=
n=1
We still denote yn = p(t = 1|φn ), and then we can write down the log
likelihood by replacing t n with πn in (4.89) and (4.90).
N
X
ln p(t|w) = { πn ln yn + (1 − πn ) ln(1 − yn ) }
n=1
Problem 4.17 Solution
1 1
Φ0 (a) = N (θ |0, 1)¯θ=a = p exp(− a2 )
¯
2π 2
N a2
X yn − t n exp(− 2n )
∇w ln p = p φn
n=1 yn (1 − yn ) 2π
And
a2 a2
∂ exp(− n ) ∂ exp(− n ) ∂a n
{ p 2 } = { p 2 }
∂w 2π ∂a n 2π ∂w
2
an a
= − p exp(− n )φn
2π 2
a2 a2 a2
∂ yn − t n exp(− 2n ) ∂ yn − t n exp(− 2n ) yn − t n ∂ exp(− 2n )
{ p } = { } p + { p }
∂w yn (1 − yn ) 2π ∂w yn (1 − yn ) 2π yn (1 − yn ) ∂w 2π
a2 a2
£ yn2 + t n − 2yn t n exp(− 2n ) ¤ exp(− 2n )
= p − a n (yn − t n ) p φn
yn (1 − yn ) 2π 2π yn (1 − yn )
H = ∇∇w ln p
N ∂ a2
X yn − t n exp(− 2n )
= { p } · φn
n=1 ∂w yn (1 − yn ) 2π
N £ y2 + t − 2y t exp(− n ) a2 a2
n n n n 2 ¤ exp(− 2n )
φn φn T
X
= p − a n (yn − t n ) p
n=1 yn (1 − yn ) 2π 2π yn (1 − yn )
102
K X
K N
uT ynk (I k j − yn j ) φn φn T }uk
X X
(∗) = j {−
j =1 k=1 n=1
K X N
K X
T
uT
X
= j {− ynk (I k j − yn j ) φ n φ n }uk
j =1 k=1 n=1
K X N
K X K X N
K X
T T
uT uT
X X
= j {− ynk I k j φ n φ n }uk + j { ynk yn j φ n φ n }uk
j =1 k=1 n=1 j =1 k=1 n=1
K X
N K X
K X
N
T T
uT yn j uT
X X
= k {− ynk φ n φ n }uk + j { φ n φ n } ynk uk
k=1 n=1 j =1 k=1 n=1
It is quite obvious.
Z a
Φ(a) = N (θ |0, 1) d θ
−∞
Z a
1
= + N (θ |0, 1) d θ
2
Z0 a
1
= + N (θ |0, 1) d θ
2 0
Z a
1 1
= +p exp(−θ 2 /2) d θ
2 2π 0
p Z a
1 1 π 2
= +p p exp(−θ 2 /2) d θ
2 2π 2 0 π
Z a
1 1 2
= (1 + p p exp(−θ 2 /2) d θ )
2 2 0 π
1 © 1 ª
= 1 + p er f (a)
2 2
(2π) M /2
= f (θ M AP )
|A|1/2
(2π) M /2
= p(D |θ M AP ) p(θ M AP )
|A|1/2
M 1
ln p(D) = ln p(D |θ M AP ) + ln p(θ M AP ) + ln 2π − ln |A|
2 2
Just as required.
M 1
ln p(D) = ln p(D |θ M AP ) + ln p(θ M AP ) + ln 2π − ln |A|
2 2
M 1 1
= ln p(D |θ M AP ) − ln 2π − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m)
2 2 2
M 1
+ ln 2π − ln |A|
2 2
1 1 1
= ln p(D |θ M AP ) − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |A|
2 2 2
Where we have used the definition of the multivariate Gaussian Distri-
bution. Then, from (4.138), we can write:
= H + V0 −1
above becomes:
1 1 ©
ln p(D) = ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |V0 | · |H + V−1
ª
0 |
2 2
1 1 ©
= ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
ª
ln |V0 H + I|
2 2
1 1 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |V0 | − ln |H|
2 2 2
1 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |H| + const
2 2
Where we have used the property of determinant: |A|·|B| = |AB|, and the
fact that the prior is board, i.e. I can be neglected with regard to V0 H. What’s
more, since the prior is pre-given, we can view V0 as constant. And if the data
is large, we can write:
N
X
H= Hn = N Hb
n=1
PN
Where H
b = 1/N
n=1 Hn , and then
1 1
ln p(D) ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |H| + const
2 2
1 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln | N H
b | + const
2 2
1 M 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln N − ln |H b | + const
2 2 2
M
≈ ln p(D |θ M AP ) − ln N
2
This is because when N >> 1, other terms can be neglected.
We first need to obtain the expression for the first derivative of probit
function Φ(λa) with regard to a. According to (4.114), we can write down:
d d Φ(λa) d λa
Φ(λa) = ·
da d(λa) da
λ © 1
p exp − (λa)2
ª
=
2π 2
Where we have used σ(0) = 0.5. Let their derivatives at origin equals, we
have:
λ 1
p =
2π 4
p ± 2
i.e., λ = 2π 4. And hence λ = π 8 is obvious.
±
We will prove (4.152) in a more simple and intuitive way. But firstly, we
need to prove a trivial yet useful statement: Suppose we have a random vari-
able satisfied normal distribution denoted as X ∼ N (X |µ, σ2 ), the probability
x−µ
of X ≤ x is P(X ≤ x) = Φ( σ ), and here x is a given real number. We can see
this by writing down the integral:
Z x
1 1
exp − 2 (X − µ)2 d X
£ ¤
P(X ≤ x) = p
−∞ 2πσ2 2σ
Z x−µ
σ 1 1
= p exp(− γ2 ) σ d γ
−∞ 2πσ 2 2
Z x−µ
σ 1 1
= p exp(− γ2 ) d γ
−∞ 2π 2
x−µ
= Φ( )
σ
Where we have changed the variable X = µ + σγ. Now consider two ran-
dom variables X ∼ N (0, λ−2 ) and Y ∼ N (µ, σ2 ). We first calculate the condi-
tional probability P(X ≤ Y | Y = a):
a−0
P(X ≤ Y | Y = a) = P(X ≤ a) = Φ( ) = Φ(λa)
λ−1
Together with Bayesian Formula, we can obtain:
Z +∞
P(X ≤ Y ) = P(X ≤ Y | Y = a) pd f (Y = a) dY
−∞
Z +∞
= Φ(λa) N (a|µ, σ2 ) da
−∞
Where pd f (·) denotes the probability density function and we have also
used pd f (Y ) = N (µ, σ2 ). What’s more, we know that X − Y should also sat-
isfy normal distribution, with:
E[X − Y ] = E[X ] − E[Y ] = 0 − µ = −µ
And
var[X − Y ] = var[X ] + var[Y ] = λ−2 + σ2
Therefore, X − Y ∼ N (−µ, λ−2 + σ2 ) and it follows that:
0 − (−µ) µ
P(X − Y ≤ 0) = Φ( p ) = Φ( p )
λ−2 + σ2 λ−2 + σ2
Since P(X ≤ Y ) = P(X − Y ≤ 0), we obtain what have been required.
106
e a − e −a
tanh(a) =
e a + e −a
2e a
= −1 + a
e + e −a
1
= −1 + 2
1 + e−2a
= 2σ(2a) − 1
M
a(kt) w(2 t)
tanh(a(jt) ) + w(2 t)
X
= kj k0
j =1
M
w(2 t)
[ 2σ(2a(jt) ) − 1 ] + w(2 t)
X
= kj k0
j =1
M M
2 w(2 t)
σ(2a(jt) ) + − w(2 t)
+ w(2 t) ¤
X £ X
= kj kj k0
j =1 j =1
M
a(ks) = w(2 s)
σ(a(js) ) + w(2 s)
X
kj k0
j =1
To make the two networks equivalent, i.e., a(ks) = a(kt) , we should make
sure:
a(s) = 2a(jt)
j
w(2
kj
s)
= 2w(2 kj
t)
w(2s) = − M w(2 t) + w(2 t)
P
k0 j =1 kj k0
w(1
ji
s)
= 2w(1
ji
t)
, and w(1
j0
s)
= 2w(1
j0
t)
N £
β X ¤ NK
E(w, β) = − ln p(T|X, w) = ( y(xn , w)−tn )T ( y(xn , w)−tn ) − ln β+const
2 n=1 2
Here we have used const to denote the term independent of both w and
β. Note that here we have used the definition of the multivariate Gaussian
Distribution. What’s more, we see that the covariance matrix β−1 I and the
weight parameter w have decoupled, which is distinct from the next prob-
lem. We can first solve wML by minimizing the first term on the right of the
equation above or equivalently (5.11), i.e., imaging β is fixed. Then according
to the derivative of E(w, β) with regard to β, we can obtain (5.17) and hence
β ML .
Following the process in the previous question, we first write down the
negative logarithm of the likelihood function.
1 XN © ª N
E(w, Σ) = [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + ln |Σ| + const (∗)
2 n=1 2
Note here we have assumed Σ is unknown and const denotes the term
independent of both w and Σ. In the first situation, if Σ is fixed and known,
the equation above will reduce to:
1 XN ©
[ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + const
ª
E(w) =
2 n=1
Note that here we use t to denote the observed target label, t r to denote
its real label, and that our network is aimed to predict the real label t r not t,
i.e., p(t r = 1|x, w) = y(x, w), hence we see that:
Where y is short for y(x, w). Therefore, taking the negative logarithm, we
can obtain the error function:
N
t 1− t ª
ln (1 − ²) · ynn (1 − yn )1− t n + ² · (1 − yn ) t n yn n
X ©
E(w) = −
n=1
ynk = yk (xn , w)
We know that yk = σ(a k ), where σ(·) represents the logistic sigmoid func-
tion. Moreover,
dσ
= σ(1 − σ)
da
109
dE(w) 1£ ¤ 1 £ ¤
= −t k yk (1 − yk ) + (1 − t k ) yk (1 − yk )
da k yk 1 − yk
£ ¤ £ 1 − tk tk ¤
= yk (1 − yk ) −
1 − yk yk
= (1 − t k )yk − t k (1 − yk )
= yk − t k
Just as required.
Where we have used the fact that only when k = j, I k j = 1 6= 0 and that
PK
k=1 t kn = 1.
We know that the logistic sigmoid function σ(a) ∈ [0, 1], therefore if we
perform a linear transformation h(a) = 2σ(a) − 1, we can find a mapping func-
tion h(a) from (−∞, +∞) to [−1, 1]. In this case, the conditional distribution
of targets given inputs can be similarly written as:
£ 1 + y(x, w) ¤(1+ t)/2 £ 1 − y(x, w) ¤(1− t)/2
p(t|x, w) =
2 2
£ ¤
Where 1 + y(x, w) /2 represents the conditional probability p(C 1 | x). Since
now y(x, w) ∈ [−1, 1], we also need to perform the linear transformation to
make it satisfy the constraint for probability.Then we can further obtain:
N ©1+ t 1 + yn 1 − t n 1 − yn ª
X n
E(w) = − ln + ln
n=1 2 2 2 2
1 N
X © ª
= − (1 + t n ) ln(1 + yn ) + (1 − t n ) ln(1 − yn ) + N ln 2
2 n=1
Therefore, every λ i should be positive. On the other hand, If all the eigen-
values λ i are positive, from (5.38) and (5.39), we see that H is positive defi-
nite.
It is obvious. We follow (5.35) and then write the error function in the
form of (5.36). To obtain the contour, we enforce E(w) to equal to a constant
C.
1X
E(w) = E(w∗ ) + λ i α2i = C
2 i
We rearrange the equation above, and then obtain:
λ i α2i = B
X
i
E n (w ji + ²) − E n (w ji − ²) = 2²E 0n (w ji ) + O(²3 )
∂a k X ∂a k ∂a j ∂a j
wk j h0 (a j )
X
= = (∗∗)
∂xi j ∂a j ∂ x i j ∂xi
112
Using recursive formula (∗∗) and then (∗), we can obtain the Jacobian
Matrix.
1 XN 1 XN X M
E= ||yn − tn ||2 = (yn,m − t n,m )2
2 n=1 2 n=1 m=1
Where the subscript m denotes the mthe element of the vector. Then we
can write down the Hessian Matrix as before.
N X
X M N X
X M
H = ∇∇E = ∇yn,m ∇yn,m + (yn,m − t n,m )∇∇yn,m
n=1 m=1 n=1 m=1
Similarly, we now know that the Hessian Matrix can be approximated as:
N X
M
bn,m bT
X
H' n,m
n=1 m=1
bn,m = ∇ yn,m
It is obvious.
∂2 E ∂ 1 ∂y
Z Z
= 2(y − t) p(x, t)dxdt
∂wr ∂ws ∂wr 2 ∂ws
∂ y2 ∂y ∂y ¤
Z Z
£
= (y − t) + p(x, t)dxdt
∂wr ∂ws ∂ws ∂wr
R
Note that in the last step, we have used y = tp(t|x) dt. Then we substi-
tute it into the second derivative, which gives,
∂2 E ∂y ∂y
Z Z
= p(x, t)dxdt
∂wr ∂ws ∂ws ∂wr
∂y ∂y
Z
= p(x)dx
∂ws ∂wr
∂E n ∂E n ∂a k
= = δk x i
skip
∂wki ∂a k ∂wskip
ki
Where the first term on the right side corresponds to those information
conveying from the hidden unit to the output and the second term corre-
sponds to the information conveying directly from the input to output.
XN ∂E
∇E(w) = ∇a n
n=1 ∂a n
N
X ∂ £ ¤
= − t n ln yn + (1 − t n ) ln(1 − yn ) ∇a n
n=1 ∂a n
N © ∂(t ln y ) ∂ y ∂(1 − t n ) ln(1 − yn ) ∂ yn ª
X n n n
= − + ∇a n
n=1 ∂ yn ∂a n ∂ yn ∂a n
N £t −1
X n ¤
= − · yn (1 − yn ) + (1 − t n ) · yn (1 − yn ) ∇a n
n=1 yn 1 − yn
N £
X ¤
= − t n (1 − yn ) − (1 − t n )yn ∇a n
n=1
N
X
= (yn − t n )∇a n
n=1
114
Where we have used the conclusion of problem 5.6. Now we calculate the
second derivative.
N ©
X ª
∇∇E(w) = yn (1 − yn )∇a n ∇a n + (yn − t n )∇∇a n
n=1
Similarly, we can drop the last term, which gives exactly what has been
asked.
N X
X K
E(w) = − t nk ln ynk
n=1 k=1
Here we assume that the output of the network has K units in total and
there are W weights parameters in the network. WE first calculate the first
derivative:
XN dE
∇E = · ∇an
n=1 da n
XN £ d K
X ¤
= − ( t nk ln ynk ) · ∇an
n=1 da n k=1
N
X
= cn · ∇an
n=1
∂ K
X
c n, j = − ( t nk ln ynk )
∂a j k=1
XK ∂
= − (t nk ln ynk )
k=1 ∂a j
K t
X nk
= − ynk (I k j − yn j )
y
k=1 nk
K
X K
X
= − t nk I k j + t nk yn j
k=1 k=1
K
X
= − t n j + yn j ( t nk )
k=1
= yn j − t n j
115
Here dcn /dan is a matrix with size K × K. Therefore, the second term can
be neglected as before, which gives:
N dc
X n
H= ( ∇an ) · ∇an
n=1 da n
K
b N +1,k bT T
X
H N +1,K = H N,K + N +1,k = H N,K + B N +1 B N +1
k=1
NX
+1
H N +1,K +1 = H N +1,K + bn,K +1 bT T
n,K +1 = H N +1,K + BK +1 BK +1
n=1
1
Where H−
N +1,K
is defined by (∗). If we substitute (∗) into the expression
1 1
above, we can obtain the relationship between H−
N +1,K +1
and H−
N,K
.
∂2 E n ∂ ∂E n
= ( )
∂w(2)
kj
∂w(2)
k0 j 0
∂wk j ∂w(2)
(2)
k0 j 0
∂ ∂ E n ∂ a k0
= ( )
∂w(2) ∂a k0 ∂w(2)0 0
kj k j
∂ ∂ E n ∂ j 0 w k0 j 0 z j 0
P
= ( )
∂w(2) ∂ a k0 ∂w(2)
kj k0 j 0
∂ ∂E n
= ( z j0 )
∂w(2) ∂ a k0
kj
∂ ∂E n ∂E n ∂ z j0
= ( )z j0 +
∂w(2) ∂a k0 ∂a k0 ∂w(2)
kj kj
∂ ∂E n ∂a k
= ( ) z j0 + 0
∂a k ∂a k0 ∂w(2)
kj
∂ ∂E n
= ( )z j z j0
∂ a k ∂ a k0
= z j z j0 M kk0
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(1)
j0 i0
∂w(1)
ji
∂w(1)
j0 i0
∂ X ∂ E n ∂ a k0
= ( )
∂w(1) 0 ∂a k0 ∂w(1)
ji k 0 0 ji
∂ ∂E n
w(2) h0 (a j0 )x i0 )
X
= ( k0 j 0
(1)
k0 ∂w ji ∂ a k0
∂ ∂E n
h0 (a j0 )x i0 w(2)
X
= ( k0 j 0
)
k0 ∂w(1) ∂ a k0
ji
X ∂ ∂E n (2) ∂a k
h0 (a j0 )x i0
X
= ( wk0 j0 ) (1)
k0 k ∂ a k ∂ a k0 ∂w ji
X ∂ ∂E n (2)
h0 (a j0 )x i0 wk0 j0 ) · (w(2) h0 (a j )x i )
X
= (
k 0 k ∂ a k ∂ a k 0
kj
∂2 E n ∂ ∂E n
w(2) h0 (a j )x i0 )
X
= ( k0 j
∂w(1) ∂w(1) k0 ∂w ji (1) ∂ a k0
ji ji 0
∂ ∂E n X ∂E n (2) ∂ h0 (a j )
w(2) )h0 (a j ) + x i0
X
= x i0 ( k0 j
( w k0 j )
k0 ∂w ji ∂a k0
(1)
k0 ∂ a k0 ∂w(1)
ji
X ∂E n (2) ∂ h0 (a j )
x i0 x i h0 (a j )h0 (a j ) w(2) (2)
XX
= k0 j
· w M kk 0 + x i 0 ( w k0 j )
k0 ∂ a k0
kj
k0 k ∂w(1)
ji
X ∂E n (2) 00
x i0 x i h0 (a j )h0 (a j ) w(2) (2)
XX
= 0 · w M kk 0 + x i0 ( wk0 j )h (a j )x i
k0 ∂ a k0
k j kj
k0 k
X X (2)
x i0 x i h0 (a j )h0 (a j ) wk0 j · w(2) M kk0 + h00 (a j )x i x i0 δk0 w(2)
X
= kj k0 j
k0 k k0
It seems that what we have obtained is slightly different from (5.94) when
j = j 0 . However this is not the case, since the summation over k0 in the second
term of our formulation and the summation over k in the first term of (5.94) is
actually the same (i.e., they both represent the summation over all the output
units). Combining the situation when j = j 0 and j 6= j 0 , we can obtain (5.94)
just as required. Finally, we deal with the third case. Similarly we first focus
on j 6= j 0 :
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(2)
k j0
∂w(1)
ji
∂w(2)
k j0
∂ ∂E n ∂a k
= ( )
∂w ji (1) ∂a k ∂w(2)0
kj
∂ ∂E n ∂ j0 wk j0 z j0
P
= ( )
∂w(1) ∂a k ∂w(2)
ji k j0
∂ ∂E n
= ( z j0 )
∂w(1) ∂a k
ji
X ∂ ∂ E n ∂ a k0
= z j0 ( ) (1)
k0 ∂a k0 ∂a k ∂w ji
z j0 M kk0 w(2) h0 (a j )x i
X
= k0 j
k0
0
M kk0 w(2)
X
= x i h (a j )z j0 k0 j
k0
Note that in (5.95), there are two typos: (i)H kk0 should be M kk0 . (ii) j should
118
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(2)
kj
∂w ji ∂w(2)
(1)
kj
∂ ∂E n ∂a k
= ( )
∂w(1) ∂a k ∂w(2)
ji kj
∂ ∂E n ∂ j wk j z j
P
= ( )
∂w(1) ∂a k ∂w(2)
ji kj
∂ ∂E n
= ( z j)
∂w(1) ∂a k
ji
∂ ∂E n ∂E n ∂ z j
= ( )z j +
∂w(1) ∂a k ∂a k w(1)
ji ji
∂E n ∂ z j
x i h0 (a j )z j M kk0 w(2)
X
= k0 j
+
k0 ∂a k w(1)
ji
x i h0 (a j )z j M kk0 w(2) + δk h0 (a j )x i
X
= k0 j
k0
∂2 E n ∂ ∂E n
= ( )
∂ w k0 i 0 ∂ w k j ∂ w k0 i 0 ∂ w k j
∂ ∂E n
= ( z j)
∂wk i ∂a k
0 0
∂ w k0 i 0 ∂ ∂ E n
= zj ( )
∂ a k0 ∂ a k0 ∂ a k
= z j x i0 M kk0
119
And
∂2 E n ∂
X ∂E n ∂a k
= ( )
∂wk0 i0 ∂w ji ∂wk0 i0 k ∂a k ∂w ji
∂ X ∂E n
= ( wk j h0 (a j )x i )
∂ w k0 i 0 k ∂ a k
X 0 ∂ ∂E n
= h (a j )x i wk j ( )
k ∂ w k0 i 0 ∂ a k
X 0 ∂ ∂ E n a k0
= h (a j )x i wk j ( )
k ∂ a k0 ∂ a k w k0 i 0
X 0
= h (a j )x i wk j M kk0 x i0
k
x i x i0 h0 (a j )
X
= wk j M kk0
k
Finally, we have
∂2 E n ∂ ∂E n
= ( )
∂wk0 i0 wki ∂wk0 i0 ∂wki
∂ ∂E n
= ( xi )
∂ w k0 i 0 ∂ a k
∂ ∂ E n ∂ a k0
= xi ( )
∂ a k0 ∂ a k w k0 i 0
= x i x i0 M kk0
Problem 5.24 Solution
Since we know the gradient of the error function with respect to w is:
∇E = H(w − w∗ )
w(τ) = w(τ−1) − ρ ∇E
= w(τ−1) − ρ H(w(τ−1) − w∗ )
w(1)
j
= (1 − ρη j )w(0)
j
+ ρη j w∗j
= ρη j w∗j
1 − (1 − ρη j ) w∗j
£ ¤
=
Suppose (5.197) holds for τ, we now prove that it also holds for τ + 1.
= (1 − ρη j ) 1 − (1 − ρη j )τ + ρη j w∗j
© £ ¤ ª
1 − (1 − ρη j )τ+1 w∗j
£ ¤
=
1 X ∂ ynk ¯¯
Ωn = ( )2
2 k ∂ξ ξ=0
1 X X ∂ ynk ∂ x i ¯¯
= ( )2
2 k i ∂ x i ∂ξ ξ=0
1X X ∂
= ( τi ynk )2
2 k i ∂xi
X ∂z j X ∂ h(a j )
αj = τi = τi
i ∂xi i ∂xi
X ∂ h(a j ) ∂a j
= τi
i ∂a j ∂ x i
∂
h0 (a j ) τ i a j = h0 (a j )β j
X
=
i ∂ x i
Moreover,
∂a j ∂
P
X X i 0 w ji 0 z i 0
βj = τi = τi
i ∂xi i ∂xi
X X ∂w ji0 z i0 X X ∂ z i0
= τi = τ i w ji0
i i0 ∂xi i i0 ∂xi
X X ∂ z i0 X
= w ji0 τi = w ji0 α i0
i0 i ∂xi i0
So far we have proved that (5.204) holds and now we aim to find a forward
propagation formula to calculate Ωn . We firstly begin by evaluating {β j } at
the input units, and then use the first equation in (5.204) to obtain {α j } at the
input units, and then the second equation to evaluate {β j } at the first hidden
layer, and again the first equation to evaluate {α j } at the first hidden layer.
We repeatedly evaluate {β j } and {α j } in this way until reaching the output
122
∂Ωn ∂ ©1 X ª 1 X ∂(G yk )2
= (G yk )2 =
∂wrs ∂wrs 2 k 2 k ∂wrs
1 X ∂(G yk )2 ∂(G yk ) X ∂G yk
= = G yk
2 k ∂(G yk ) ∂wrs k ∂wrs
£ ∂ yk ¤ X £ ∂ yk ∂a r ¤
G yk G αk G
X
= =
k ∂wrs k ∂a r ∂wrs
αk G δkr z s = αk G [δkr ]z s + G [z s ]δkr
X £ ¤ X © ª
=
k k
X
αk φkr z s + αs δkr
© ª
=
k
Provided with the idea in section 5.3, the backward propagation formula
is easy to derive. We can simply replace E n with yk to obtain a backward
equation, so we omit it here.
Here a(jm) denotes the activation of the jth unit in th mth feature map,
whereas w(im) denotes the ith element of the corresponding feature vector
and finally z(im
j
)
denotes the ith input for the jth unit in the mth feature map.
Note that δ(jm) can be computed recursively from the units in the following
layer.
∂ © wi − µ j
π j N (w i |µ j , σ2j ) = −π j N (w i |µ j , σ2j )
ª
∂w i σ2j
∂E
e ∂E ∂λΩ(w)
= +
∂w i ∂w i ∂w i
( Ã !)
∂E ∂ M
π j N (w i |µ j , σ2j )
X X
= −λ ln
∂w i ∂w i i j =1
( Ã !)
∂E ∂ M
π jN (w i |µ j , σ2j )
X
= −λ ln
∂w i ∂w i j =1
( )
∂E 1 ∂ M
π jN (w i |µ j , σ2j )
X
= − λ PM
∂w i (w i |µ j , σ2j ) ∂w i
j =1 π j N j =1
( )
∂E 1 M wi − µ j
N (w i |µ j , σ2j )
X
= + λ PM πj
∂w i 2 σ2j
j =1 π j N (w i |µ j , σ j ) j =1
PM w i −µ j 2
∂E j =1 π j σ2 N (w i |µ j , σ j )
j
= +λ
∂w i k πk N (w i |µk , σ2k )
P
∂E M
X π j N (w i |µ j , σ2j ) wi − µ j
= +λ
∂w i k πk N (w i |µk , σ2k ) σ2j
P
j =1
∂E M
X wi − µ j
= +λ γ j (w i )
∂w i j =1 σ2j
∂ © wi − µ j
π j N (w i |µ j , σ2j ) = π j N (w i |µ j , σ2j )
ª
∂µ j σ2j
124
We can derive:
∂E
e ∂λΩ(w)
=
∂µ j ∂µ j
( Ã !)
∂ M
π jN (w i |µ j , σ2j )
X X
= −λ ln
∂µ j i j =1
( Ã !)
X ∂ M
π j N (w i |µ j , σ2j )
X
= −λ ln
i ∂µ j j =1
( )
1 ∂ M
π jN (w i |µ j , σ2j )
X X
= −λ
(w i |µ j , σ2j ) ∂µ j
PM
i j =1 π j N j =1
1 wi − µ j
N (w i |µ j , σ2j )
X
= −λ PM πj 2
i j =1 π j N (w i | µ j , σ 2
j
) σ j
2
X π j N (w i |µ j , σ j ) µ j − wi X µ j − wi
= λ PK 2
= λ γ j (w i )
i k=1 π k N (w i |µ k , σ k )
2 σj i σ2j
We can derive:
∂E
e ∂λΩ(w)
=
∂σ j ∂σ j
( Ã !)
∂ M
π jN (w i |µ j , σ2j )
X X
= −λ ln
∂σ j i j =1
( Ã !)
X ∂ M
π j N (w i |µ j , σ2j )
X
= −λ ln
i ∂σ j j =1
( )
1 ∂ M
π jN (w i |µ j , σ2j )
X X
= −λ
(w i |µ j , σ2j ) ∂σ j
PM
i j =1 π j N j =1
1 ∂ n o
π j N (w i |µ j , σ2j )
X
= −λ PM 2 ∂σ
i j =1 π j N (w i |µ j , σ j ) j
à !
1 1 (w i − µ j )2
π j N (w i |µ j , σ2j )
X
= λ PM −
π N (w | µ , σ 2) σ σ 3
i j =1 j i j j j j
2
π j N (w i |µ j , σ j )
à !
X 1 (w i − µ j )2
= λ PM −
π N (w | µ , σ 2) σ σ3j
i k=1 k i k k j
à !
X 1 (w i − µ j )2
= λ γ j (w i ) −
i σj σ3j
Just as required.
∂πk ∂ exp(η k )
½ ¾
=
∂η j ∂η j k exp(η k )
P
− exp(η k )exp(η j )
= ¤2
k exp(η k )
£P
= −π j πk
∂πk ∂ exp(η k )
½ ¾
=
∂η k ∂η k k exp(η k )
P
= πk − πk πk
If we combine these two cases, we can easily see that (5.208) holds. Now
126
we prove (5.147).
∂E
e ∂Ω(w)
= λ
∂η j ∂η j
( ( ))
∂ M
π j N (w i |µ j , σ2j )
X X
= −λ ln
∂η j i j =1
( ( ))
X ∂ M
π jN (w i |µ j , σ2j )
X
= −λ ln
i ∂η j j =1
( )
1 ∂ M
πk N (w i |µk , σ2k )
X X
= −λ
(w i |µ j , σ2j ) ∂η j
PM
i j =1 π j N k=1
1 M ∂ ©
πk N (w i |µk , σ2k )
X X
= −λ
ª
∂η j
PM 2
i j =1 π j N (w i |µ j , σ j ) k=1
1 M ∂ © ª ∂πk
πk N (w i |µk , σ2k )
X X
= −λ
∂πk ∂η j
PM 2
i j =1 π j N (w i |µ j , σ j ) k=1
1 M
N (w i |µk , σ2k )(δ jk π j − π j πk )
X X
= −λ PM 2)
i j =1 j π N (w |
i jµ , σ j k=1
( )
1 M
2 2
π j N (w i |µ j , σ j ) − π j πk N (w i |µk , σk ))
X X
= −λ P M 2
i j =1 π j N (w i |µ j , σ j ) k=1
Just as required.
It is trivial. We set the attachment point of the lower arm with the ground
as the origin of the coordinate. We first aim to find the vertical distance from
the origin to the target point, and this is also the value of x2 .
x2 = L 1 sin(π − θ1 ) + L 2 sin(θ2 − (π − θ1 ))
= L 1 sin θ1 − L 2 sin(θ1 + θ2 )
Similarly, we calculate the horizontal distance from the origin to the tar-
get point.
x1 = −L 1 cos(π − θ1 ) + L 2 cos(θ2 − (π − θ1 ))
= L 1 cos θ1 − L 2 cos(θ1 + θ2 )
From these two equations, we can clearly see the ’forward kinematics’ of
the robot arm.
127
∂πk (x)
= δ jk π j (x) − π j (x)πk (x)
∂aπj
1 K £
δ jk π j (xn ) − π j (xn )πk (xn ) N (tn |µk , σ2k )
X ¤
= − PK
(tn |µk , σ2k ) k=1
k=1 π k N
( )
1 K
π j (xn )N (tn |µ j , σ2j ) − π j (xn ) πk (xn )N (tn |µk , σ2k )
X
= − PK 2)
k=1 π k N (t n | µ k , σ k k =1
( )
1 K
2 2
−π j (xn )N (tn |µ j , σ j ) + π j (xn ) πk (xn )N (tn |µk , σk )
X
= PK 2)
k=1 π k N (t n | µ k , σ k k =1
∂E n
= −γ j + π j
∂aπj
Note that our result is slightly different from (5.155) by the subindex. But
there are actually the same if we substitute index j by index k in the final
expression.
∂ © ª tn − µk
πk N (tn |µk , σ2k ) = πk N (tn |µk , σ2k )
∂µk σk2
128
One thing worthy noticing is that here we focus on the isotropic case as
stated in page 273 of the textbook. To be more precise, N (tn |µk , σ2k ) should
be N (tn |µk , σ2k I). Provided with the equation above, we can further obtain:
( )
∂E n ∂ K
πk N (tn |µk , σ2k )
X
= − ln
∂µk ∂µk k=1
( )
1 ∂ K
πk N (tn |µk , σ2k )
X
= − PK
(tn |µk , σ2k ) ∂µk k=1
k=1 π k N
1 tn − µk
= − PK · 2
πk N (tn |µk , σ2k )
π
k=1 k N (t |
n kµ , σ 2
k
) σ k
tn − µk
= −γk
σ2k
Hence noticing (5.152), the lth element of the result above is what we are
required.
∂E n ∂E n µkl − tl
µ = = γk
∂a ∂µkl σ2k
kl
Problem 5.36 Solution
Note that there is a typo in (5.157) and the underlying reason is that:
= (σ2k )D
|σ2k ID ×D |
First we know two properties for the Gaussian distribution N (t|µ, σ2 I):
Z
E[t] = tN (t|µ, σ2 I) dt = µ
129
And Z
E[||t||2 ] = ||t||2 N (t|µ, σ2 I) dt = Lσ2 + ||µ||2
Note that there is a typo in (5.160), i.e., the coefficient L in front of σ2k is
missing.
130
From (5.167) and (5.171), we can write down the expression for the pre-
dictive distribution:
Z
p(t|x, D, α, β) = p(w|D, α, β)p(t|x, w, β) dw
Z
≈ q(w|D)p(t|x, w, β) dw
Z
= N (w|wMAP , A−1 )N (t|gT w − gT wMAP + y(x, wMAP ), β−1 ) dw
Just as required.
(2π)W /2
= p(D |wMAP , β)p(wMAP |α)
|A|1/2
N (2π)W /2
N (t n | y(xn , wMAP ), β−1 )N (wMAP |0, α−1 I)
Y
=
n=1 |A|1/2
Hessian matrix should be derived from (5.24) and the cross entropy in (5.184)
will also be replaced by (5.24).
(2π)W /2
p(D |α) = p(D |wMAP )p(wMAP |α)
|A|1/2
Since we know that the prior p(w|α) follows a Gaussian distribution, i.e.,
(5.162), as stated in the text. Therefore we can obtain:
1
ln p(D |α) = ln p(D |wMAP ) + ln p(wMAP |α) − ln |A| + const
2
α W 1
= ln p(D |wMAP ) − wT w + ln α − ln |A| + const
2 2 2
W 1
= −E(wMAP ) + ln α − ln |A| + const
2 2
Just as required.
1
an = − {wT φ(xn ) − t n }
λ
1
= − {w1 φ1 (xn ) + w2 φ2 (xn ) + ... + w M φ M (xn ) − t n }
λ
w1 w2 wM tn
= − φ1 (xn ) − φ2 (xn ) − ... − φ M (xn ) +
λ λ λ λ
w1 w2 wM
= (c n − )φ1 (xn ) + (c n − )φ2 (xn ) + ... + (c n − )φ M (xn )
λ λ λ
Here we have defined:
t n /λ
cn =
φ1 (xn ) + φ2 (xn ) + ... + φ M (xn )
From what we have derived above, we can see that a n is a linear com-
bination of φ(xn ). What’s more, we first substitute K = ΦΦT into (6.7), and
then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain
(6.2) just as required.
where N is the total number of samples and c n is the times that t n φn has
been added from step 0 to step τ + 1. Therefore, it is obvious that we have:
N
X
w= αn t n φn
n=1
In other words, the update process is to add learning rate η to the coeffi-
cient αn corresponding to the misclassified pattern xn , i.e.,
α(nτ+1) = α(nτ) + η
y(x) = f ( wT φ(x) )
N
α n t n φT
X
= f( n φ(x) )
n=1
N
X
= f( αn t n k(xn , x) )
n=1
||x − xn ||2 = (x − xn )T (x − xn )
= (xT − xT
n )(x − x n )
= xT x − 2xT T
n x + xn xn
To construct such a matrix, let us suppose the two eigenvalues are 1 and
2, and the matrix has form: · ¸
a b
c d
133
(2)-(1), yielding:
a+d =3
Therefore, we set a = 4 and d = −1. Then we substitute them into (1), and
thus we see:
bc = −6
Finally, we choose b = 3 and c = −2. The constructed matrix is:
· ¸
4 3
−2 −1
We can obtain:
£p ¤T £p
k(x, x0 ) = ck 1 (x, x0 ) = cφ(x) cφ(x0 )
¤
Just as required.
We now obtain:
n n−1
k(x, x0 ) = a n k 1 (x, x0 ) + a n−1 k 1 (x, x0 ) + ... + a 1 k 1 (x, x0 ) + a 0
By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x, x0 )
is a valid kernel. For (6.16), we can use Taylor expansion, and since the
coefficients of Taylor expansion are all positive, we can similarly prove its
validity.
To prove (6.17), we will use the property stated below (6.12). Since we
know k 1 (x, x0 ) and k 2 (x, x0 ) are valid kernels, their Gram matrix K1 and K2
134
are both positive semidefinite. Given the relation (6.12), it can be easily
shown K = K1 + K2 is also positive semidefinite and thus k(x, x0 ) is also a
valid kernel.
To prove (6.18), we assume the map function for kernel k 1 (x, x0 ) is φ(1) (x),
and similarly φ(2) (x) for k 2 (x, x0 ). Moreover, we further assume the dimension
of φ(1) (x) is M, and φ(2) (x) is N. We expand k(x, x0 ) based on (6.18):
where φ(1)
i
(x) is the ith element of φ(1) (x), and φ(2)
j
(x) is the jth element
of φ(2) (x). To be more specific, we have proved that k(x, x0 ) can be written as
φ(x)T φ(x0 ). Here φ(x) is a MN ×1 column vector, and the kth (k = 1, 2, ..., MN)
element is given by φ(1)i
(x) × φ(2)
j
(x). What’s more, we can also express i, j in
terms of k:
i = (k − 1) ® N + 1 and j = (k − 1) ¯ N + 1
where ® and ¯ means integer division and remainder, respectively.
where we have denoted g(φ(x)) = f (x) and now it is obvious that (6.19)
holds. To prove (6.20), we suppose x is a N × 1 column vector and A is a N × N
symmetric positive semidefinite matrix. We know that A can be decomposed
to QBQT . Here Q is a N × Northogonal matrix, and B is a N × N diagonal
matrix whose elements are no less than 0. Now we can derive:
To be more specific, we have proved that k(x, x0 ) = φ(x)T φ(x0 ), and here
φ
p(x) is a N ×p1 column vector, whose ith (i = 1, 2, ..., N) element is given by
B ii yi , i.e., B ii (QT x) i .
135
i = (k − 1) ® N + 1 and j = (k − 1) ¯ N + 1
We see that if we choose k(x, x0 ) = f (x) f (x0 ) we will always find a solution
y(x) proportional to f (x).
= φ(x)T φ(x0 )
First, let’s explain the problem a little bit. According to (6.27), what we
need to prove here is:
The biggest difference from the previous problem is that φ(A) is a 2|D | × 1
column vector and instead of indexed by 1, 2, ..., 2|D | here we index it by {U |U ⊆
D } (Note that {U |U ⊆ D } is all the possible subsets of D and thus there are
2|D | elements in total). Therefore, according to (6.95), we can obtain:
φ(A 1 )T φ(A 2 ) =
X
φU (A 1 )φU (A 2 )
U ⊆D
By using the summation, we actually iterate through all the possible sub-
sets of D. If and only if the current iterating subset U is a subset of both A 1
and A 2 simultaneously, the current adding term equals to 1. Therefore, we
actually count how many subsets of D is in the intersection of A 1 and A 2 .
Moreover, since A 1 and A 2 are both defined in the subset space of D, what
we have deduced above can be written as:
φ(A 1 )T φ(A 2 ) = 2| A 1 ∩ A 2 |
Just as required.
∂ 1
µ ¶
g(µ, x) = ∇µ ln p(x|µ) = T −1
− (x − µ) S (x − µ) = S−1 (x − µ)
∂µ 2
We rewrite the problem. What we are required to prove is that the Gram
matrix K: · ¸
k 11 k 12
K= ,
k 21 k 22
where k i j (i, j = 1, 2) is short for k(x i , x j ), should be positive semidefinite. A
positive semidefinite matrix should have positive determinant, i.e.,
k 12 k 21 ≤ k 11 k 22 .
Note that here φn is short for φ(xn ). Based on the equation above, we can
obtain:
N ∂f
· φT
X
∇w f = n
n=1 ∂(w Tφ )
n
138
∂g
∇w g = · 2wT
∂(wT w)
N ∂f ∂g
· φT · 2wT = 0
X
∇w J = ∇w f + ∇w g = n +
n=1 ∂(wT φ n) ∂(wT w)
1 XN ∂f
w= ·φ
2a n=1 ∂(wT φn ) n
∂g
Where we have defined: a = 1 ÷ ∂(wT w) , and since g is a monotonically
increasing function, we have a > 0.
1 XN Z © ª2
E[y + ²η] = y + ²η − t n v(ξ)d ξ
2 n=1
1 XN Z ©
(y − t n )2 + 2 · (²η) · (y − t n ) + (²η)2 v(ξ)d ξ
ª
=
2 n=1
N Z
{ y − t n } ηvd ξ + O(²2 )
X
= E[y] + ²
n=1
Note that here y is short for y(xn + ξ), η is short for η(xn + ξ) and v is short
for v(ξ) respectively. Several clarifications must be made here. What we have
done is that we vary the function y by a little bit (i.e., ²η) and then we expand
the corresponding error with respect to the small variation ². The coefficient
before ² is actually the first derivative of the error E[y + ²η] with respect to
² at ² = 0. Since we know that y is the optimal function that can make E
the smallest, the first derivative of the error E[y + ²η] should equal to zero at
² = 0, which gives:
N Z
X
{ y(xn + ξ) − t n } η(xn + ξ)v(ξ)d ξ = 0
n=1
139
Now we are required to find a function y that can satisfy the equation
above no matter what η is. We choose:
η(x) = δ(x − z)
We set it to zero and rearrange it, which finally gives (6.40) just as re-
quired.
According to the main text below Eq (6.48), we know that f (x, t), i.e., f (z),
follows a zero-mean isotropic Gaussian:
f (z) = N (z|0, σ2 I)
zm = (xm , t m )
R
The integral f (z − zm )dt corresponds to the marginal distribution with
respect to the remaining variable x and, thus, we obtain:
Z
f (z − zm ) dt = N (x| xm , σ2 )
πn · N (t| t n , σ2 )
X
=
n
140
E[t| x] =
X
k(x, xn ) · t n
n
order by putting the last element (i.e., t N +1 ) to the first position, denoted as
t̄ N +1 , it should also satisfy a Gaussian distribution:
c kT
µ ¶
C̄ N +1 =
k CN
Where k and c have been given in the main text below Eq (6.65). The
conditional distribution p(t N +1 |t N ) should also be a Gaussian. By analogy to
Eq (2.94)-(2.98), we can simply treat t N +1 as xa , t N as xb , c as Σaa , k as Σba ,
kT as Σab and C N as Σbb . Substituting them into Eq (2.79) and Eq (2.80)
yields:
Λaa = (c − kT C−N1 k)−1
And:
Λab = −(c − kT C−N1 k)−1 kT C−N1
Then we substitute them into Eq (2.96) and (2.97), yields:
1
p(t N +1 |t N ) = N (µa|b , Λ−
aa )
= kT C−1
N t N = m(x N +1 )
Where < ·, · > represents the inner product. Comparing the result above
with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is
similar for the variance. We substitute c, k and C N into Eq (6.67). Then we
simplify the expression using matrix identity (C.7). Finally, we will observe
that it is equal to Eq (3.59).
Based on Eq (6.64) and (6.65), We first write down the joint distribution
for t N +L = [t 1 (x), t 2 (x), ..., t N +L (x)]T :
p(t N +L ) = N (t N +L |0, C N +L )
The expression above has already implicitly divided the vector t N +L into
two parts. Similar to Prob.6.20, for later simplicity we rearrange the order
of t N +L denoted as t̄ N +L = [t N +1 , ..., t N +L , t 1 , ..., t N ]T . Moreover, t̄ N +L should
also follows a Gaussian distribution:
KT
µ ¶
C N +1,N +L
C̄ N +L =
K C1,N
and Λab :
Λab = − (C N +1,N +L − KT · C−1
1,N · K)
−1
· KT · C−1
1,N
143
µa|b = 0 + KT · C−1 T −1
1,N · t N = K · C1,N · t N
where we have used the fact that W is a diagonal matrix. Since Wii > 0,
we obtain xT Wx > 0 just as required. Suppose we have two positive definite
matrix, denoted as A1 and A2 , i.e., for arbitrary vector x, we have xT A1 x > 0
and xT A2 x > 0. Therefore, we can obtain:
xT (A1 + A2 )x = xT A1 x + xT A2 x > 0
Just as required.
Problem 6.25 Solution
anew
N = a N − (−W N − C−1 −1 −1
N ) (t N − σ N − C N a N )
1 −1 −1
= a N + (W N + C−N ) (t N − σ N − C N a N )
1 −1 1 −1
= (W N + C− (W N + C−
N )a N + t N − σ N − C N a N
£ ¤
N )
1 −1 −1
= C N C−
N (W N + C N ) (t N − σ N + W N a N )
= C N (C N W N + I)−1 (t N − σ N + W N a N )
Just as required.
We can obtain:
A = kT C−1
N , b = 0, L
−1
= c − kT C−1
N k
And
µ = a?
N, Λ = H
Where we have used Eq (6.85) and the fact that C N is symmetric. Then we
use matrix identity (C.7) to further reduce the expression, which will finally
give Eq (6.88).
1 NX
+1
1
N+1 · k(x, xn ) t = +1
n=1 Z k
p(x| t) =
1 NX−1
1
· k(x, xn ) t = −1
N−1 n=1 Z k
145
1 NX
1 +1
·
Z N+1
· k(x, xn ) t = +1
n=1
p(t|x) =
1 1 NX
−1
·
· k(x, xn ) t = −1
Z N−1 n=1
1 NX 1 NX
+1 −1
+1 if · k(x, x n ) ≥ · k(x, xn )
N+1 n=1 N−1 n=1
?
t = (∗)
1 NX+1
1 NX−1
−1 if
· k(x, xn ) ≤ · k(x, xn )
N+1 n=1 N−1 n=1
1 NX+1
1 NX+1
k(x, xn ) = xT xn = xT x̃+1
N+1 n=1 N+1 n=1
1 NX+1
x̃+1 = xn
N+1 n=1
and similarly for x̃−1 . Therefore, the classification criterion (∗) can be
written as: (
? +1 if x̃+1 ≥ x̃−1
t =
−1 if x̃+1 ≤ x̃−1
When we choose the kernel function as k(x, x0 ) = φ(x)T φ(x0 ), we can sim-
ilarly obtain the classification criterion:
(
? +1 if φ̃(x+1 ) ≥ φ̃(x−1 )
t =
−1 if φ̃(x+1 ) ≤ φ̃(x−1 )
1 NX+1
φ̃(x+1 ) = φ(xn )
N+1 n=1
146
Suppose we have find w0 and b 0 , which can let all points satisfy Eq (7.5)
and simultaneously minimize Eq (7.3). This hyperlane decided by w0 and
b 0 is the optimal classification margin. Now if the constraint in Eq (7.5)
becomes:
t n (wT φ(xn ) + b) ≥ γ
We can conclude that if we perform change of variables: w0 − > γw0 and
b− > γ b, the constraint will still satisfy and Eq (7.3) will be minimize. In
other words, if the right side of the constraint changes from 1 to γ, The new
hyperlane decided by γw0 and γ b 0 is the optimal classification margin. How-
ever, the minimum distance from the points to the classification margin is
still the same.
Suppose we have x1 belongs to class one and we denote its target value
t 1 = 1, and similarly x2 belongs to class two and we denote its target value
t 2 = −1. Since we only have two points, they must have t i · y(x i ) = 1 as shown
in Fig. 7.1. Therefore, we have an equality constrained optimization problem:
( T
1 w φ(x1 ) + b = 1
minimize ||w||2 s.t.
2 wT φ(x2 ) + b = −1
This is an convex optimization problem and it has been proved that global
optimal exists.
N
||w||2 =
X
an
n=1
When we find th optimal solution, the second term on the right hand side
of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its
dual is given by:
N 1
a n − ||w||2
X
L̃(a) =
n=1 2
147
Therefore, we have:
1 N 1
||w||2 = L(a) = L̃(a) = a n − ||w||2
X
2 n=1 2
If the target variable can only choose from {−1, 1}, and we know that
p(t = 1| y) = σ(y)
We can obtain:
p(t| y) = σ(yt)
N
Y N
X N
X
− ln p(D) = − ln σ(yn t n ) = − ln σ(yn t n ) = E LR (yn t n )
n=1 n=1 n=1
The derivatives are easy to obtain. Our main task is to derive Eq (7.61)
148
using Eq (7.57)-(7.60).
N 1 N
(ξn + ξbn ) + ||w||2 −
X X
L = C (µ n ξ n + µ
bn ξbn )
n=1 2 n=1
N
X N
X
− a n (² + ξn + yn − t n ) − abn (² + ξbn + yn − t n )
n=1 n=1
N 1 N N
(ξn + ξbn ) + ||w||2 −
X X X
= C (a n + µn )ξn − (abn + µ
bn )ξbn
n=1 2 n=1 n=1
N
X N
X
− a n (² + yn − t n ) − abn (² + yn − t n )
n=1 n=1
N 1 N N
(ξn + ξbn ) + ||w||2 −
X X X
= C C ξn − C ξbn
n=1 2 n=1 n=1
N
X N
X
− (a n + abn )² − (a n − abn )(yn − t n )
n=1 n=1
1 N N
||w||2 −
X X
= (a n + abn )² − (a n − abn )(yn − t n )
2 n=1 n=1
1 N N N
||w||2 − (a n − abn )(wT φ(xn ) + b − t n ) −
X X X
= (a n + abn )² +
2 n=1 n=1 n=1
1 N N N
||w||2 − (a n − abn )(wT φ(xn ) + b) −
X X X
= (a n + abn )² + (a n − abn )t n
2 n=1 n=1 n=1
1 N N N
||w||2 − (a n − abn )wT φ(xn ) −
X X X
= (a n + abn )² + (a n − abn )t n
2 n=1 n=1 n=1
1 N N
||w||2 − ||w||2 −
X X
= (a n + abn )² + (a n − abn )t n
2 n=1 n=1
1 N N
= − ||w||2 −
X X
(a n + abn )² + (a n − abn )t n
2 n=1 n=1
Just as required.
This obviously follows from the KKT condition, described in Eq (7.67) and
(7.68).
N
p(t n |xn , w, β−1 )
Y
p(t|X, w, β) =
n=1
N
N (t n |wT φ(xn ), β−1 )
Y
=
n=1
= N (t|Φw, β−1 I)
p(w|t, X, α, β) = N (m, Σ)
Σ = (A + βΦT Φ)−1
And
m = βΣΦT t
Just as required.
M
1 −1
N (0, α−
i ) = N (w|0, A )
Y
p(w|α) =
i =1
β 1 M Z
) N /2 · α1/2
Y
= ( · i · exp{−E(w)} dw
2π (2π) M /2 m=1
Note that in the last step we have used matrix identity Eq (C.7). There-
fore, as we know that the pdf is Gaussian and the exponential term has been
given by E(t), we can easily write down Eq (7.85) considering those normal-
ization constant.
What’s more, as required by Prob.7.11, the evaluation of the integral can
be easily performed using Eq(2.113)- Eq(2.117).
According to the previous problem, we can explicitly write down the log
marginal likelihood in an alternative form:
N N 1 1X M
ln p(t|X, α, β) = ln β − ln 2π + ln |Σ| + ln α i − E(t)
2 2 2 2 i=1
We first derive:
dE(t) 1 d
= − (mT Σ−1 m)
dαi 2 dαi
1 d
= − (β2 tT ΦΣΣ−1 ΣΦT t)
2 dαi
1 d
= − (β2 tT ΦΣΦT t)
2 dαi
1 £ d 2 T T d Σ−1
= − Tr (β t ΦΣΦ t) · ]
2 d Σ−1 dαi
1 2 £ 1
= β T r Σ(ΦT t)(ΦT t)T Σ · I i ] = m2ii
2 2
In the last step, we have utilized the following equation:
d
T r(AX−1 B) = −X−T AT BT X−T
dX
Moreover, here I i is a matrix with all elements equal to zero, expect the
i-th diagonal element, and the i-th diagonal element equals to 1. Then we
utilize matrix identity Eq (C.22) to derive:
d ln |Σ| d ln |Σ−1 |
= −
dαi dαi
h d i
= −T r Σ (A + βΦT Φ)
dαi
= −Σ ii
d ln p 1 1 1
= − m2 − Σ ii
dαi 2α i 2 i 2
152
d ln |Σ| d ln |Σ−1 |
= −
dβ dβ
h d i
= −T r Σ (A + βΦT Φ)
dβ
h i
= −T r ΣΦT Φ
Then we continue:
dE(t) 1 T 1 d
= t t− (mT Σ−1 m)
dβ 2 2 dβ
1 T 1 d 2 T
= t t− (β t ΦΣΣ−1 ΣΦT t)
2 2 dβ
1 T 1 d 2 T
= t t− (β t ΦΣΦT t)
2 2 dβ
1 T 1 d T
= t t − βtT ΦΣΦT t − β2 (t ΦΣΦT t)
2 2 dβ
1n T d T o
= t t − 2βtT ΦΣΦT t − β2 (t ΦΣΦT t)
2 dβ
1 T
n d T o
= t t − 2tT (Φm) − β2 (t ΦΣΦT t)
2 dβ
1 T
n £ d d Σ−1 ¤o
= t t − 2tT (Φm) − β2 T r (t T
ΦΣΦ T
t) ·
2 d Σ−1 dβ
1 T
n ¤o
t t − 2tT (Φm) + β2 T r Σ(ΦT t)(ΦT t)T Σ · ΦT Φ
£
=
2
1n T o
t t − 2tT (Φm) + T r mmT · ΦT Φ]
£
=
2
1n T o
t t − 2tT (Φm) + T r ΦmmT · ΦT ]
£
=
2
1
= ||t − Φm||2
2
Therefore, we have obtained:
d ln p 1³ N ´
= − ||t − Φm||2 − T r[ΣΦT Φ]
dβ 2 β
153
Just as required.
p(t|x, X, t, α∗ , β∗ ) = N (µ, σ2 )
And
σ2 = (β∗ )−1 + φ(x)T Σφ(x)
Just as required.
1
L(α) = − { N ln 2π + ln |C| + tT C−1 t}
2
1n 1 T −1
= − N ln 2π + ln |C− i | + ln |1 + α−
i ϕ i C− i ϕ i |
2
1
C− ϕ ϕT C−1 o
−i i i −i
+tT (C−1
−i − )t
α i + ϕT C−1 ϕ
i −i i
−1 T −1
1 1 T C− i ϕ i ϕ i C− i
= L(α− i ) − ln |1 + α−i 1 ϕTi C−1
ϕ
−i i | + t t
2 2 α i + ϕTi C− 1ϕ
−i i
2
1 −1 1 qi
= L(α− i ) − ln |1 + α i s i | +
2 2 αi + s i
2
1 αi + s i 1 q i
= L(α− i ) − ln +
2 αi 2 αi + s i
1 h q2i i
= L(α− i ) + ln α i − ln(α i + s i ) + = L(α− i ) + λ(α i )
2 αi + s i
∂λ 1 1 1 q2i
= [ − − ]
∂α i 2 α i α i + s i (α i + s i )2
∂2 λ 1 1 1 2q2i
= [− 2 + + ]
∂α2i 2 α i (α i + s i )2 (α i + s i )3
Next we aim to prove that when α i is given by Eq (7.101), i.e., setting the
first derivative equal to 0, the second derivative (i.e., the expression above) is
negative. First we can obtain:
s2i s i q2i
αi + s i = + si =
q2i − s i q2i − s i
155
Just as required.
Qi = ϕT −1
i C t
= ϕT −1 −1 T −1
i (β I + ΦA Φ ) t
= ϕT T −1 T
i (βI − βIΦ(A + Φ βIΦ) Φ βI)t
= ϕT 2 T −1 T
i (β − β Φ(A + βΦ Φ) Φ )t
= ϕT 2 T
i (β − β ΦΣΦ )t
= βϕT 2 T T
i t − β ϕ i ΦΣΦ t
Si = ϕT −1
i C ϕi
= ϕT 2 T
i (β − β ΦΣΦ )ϕ i
= βϕT 2 T T
i ϕ i − β ϕ i ΦΣΦ ϕ i
Just as required.
N
∂ nX o XN
t n ln yn + (1 − t n ) ln(1 − yn ) = (t n − yn )φn = ΦT (t − y)
∂w n=1 n=1
156
∂ n o N
ΦT (t − y) = − σ(1 − σ) · φn · φT T
n = −Φ BΦ
X
∂w n=1
∂ ln p(t|α) ¯¯ ∂ ln p(t|α) ∂w ¯¯
=
∂α i ∂w ∂α i w = w∗
¯ ¯
w = w∗
∂ 1 1 ∂ £X 1
ln α−i 1 =
¤
[− ln |A|] = −
∂α i 2 2 ∂α i i 2α i
∂ 1 1
[ ln |Σ|] = − Σ ii
∂α i 2 2
Therefore, we obtain:
∂ ln p(t|α) 1 1
= − Σ ii
∂α i 2α i 2
Note: here I draw a different conclusion as the main text. I have also
verified my result in another way. You can write the prior as the product of
1
N (w i |0, α−
i
) instead of N (w|0, A). In this form, since we know that:
M
∂ X ∂ 1 αi 1 1
ln N (w i |0, α−i 1 ) = ( ln α i − w2i ) = − (w∗i )2
∂α i i =1 ∂α i 2 2 2α i 2
The above expression can be used to replace the derivative of −1/2wT Aw−
1/2 ln |A|. Since the derivative of the likelihood with respect to α i is not zero
at w∗ , (7.115) seems not right anyway.
Here we adopt the same assumption as in the main text: No arrows lead
from a higher numbered node to a According to Eq(8.5), we can write:
Z Z YK
p(x) dx = p(xk | pa k ) dx
x x k=1
Z KY
−1
= p(xK | pa K ) p(xk | pa k ) dx
x k=1
Z Z h KY
−1 i
= p(xK | pa K ) p(xk | pa k )dxK dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] xK k=1
Z h KY
−1 Z i
= p(xk | pa k ) p(xK | pa K )dxK dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1 xK
Z h KY
−1 i
= p(xk | pa k ) dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1
Z KY
−1
= p(xk | pa k ) dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1
Note that from the third line to the fourth line, we have used the fact
that x1 , x2 , ...xK −1 do not depend on xK , and thus the product from k = 1 to
K − 1 can be moved to the outside of the integral with respect to xK , and that
we have used the fact that the conditional probability is correctly normalized
from the fourth line to the fifth line. The aforementioned procedure will be
repeated for K times until all the variables have been integrated out.
And ½
0.592, if b = 0
p(b) = p(a = 0, b) + p(a = 1, b) =
0.408, if b = 1
Therefore, we conclude that p(a, b) 6= p(a)p(b). For instance, we have
p(a = 1, b = 1) = 0.144, p(a = 1) = 0.4 and p(b = 1) = 0.408. It is obvious
that:
0.144 = p(a = 1, b = 1) 6= p(a = 1)p(b = 1) = 0.4 × 0.408
To prove the conditional dependency, we first calculate p(c):
½
X 0.480, if c = 0
p(c) = p(a, b, c) =
a,b = 0,1
0.520, if c = 1
Now we can easily verify the statement p(a, b| c) = p(a| c)p(b| c). For in-
stance, we have:
This problem follows the previous one. We have already calculated p(a)
and p(b| c), we rewrite it here.
½
0.6, if a = 0
p(a) = p(a, b = 0) + p(a, b = 1) =
0.4, if a = 1
And
0.384/0.480 = 0.800, if b = 0, c =0
p(b, c) 0.208/0.520 = 0.400, if b = 0, c =1
p(b| c) = =
p(c)
0.096/0.480 = 0.200, if b = 1, c =0
0.312/0.520 = 0.600, if b = 1, c =1
a→c→b
It looks quite like Figure 8.6. The difference is that we introduce α i for
each w i , where i = 1, 2, ..., M.
E[x1 ] = b 1
161
Just as required.
This statement is easy to see but a little bit difficult to prove. We put Fig
8.26 here to give a better illustration.
However, we can see that the value of the right hand side depends on a,b
and d, while the left hand side only depends on b and d. In general, this
expression will not hold, and, thus, a and b are not dependent conditioned on
d.
We will calculate each of the term on the right hand side. Let’s begin from
the numerator p(D = 0). According to the sum rule, we have:
Where we have used Eq(8.30), Eq(8.105) and Eq(8.106). Note that the
second term in the denominator, i.e., p(F = 0), equals 0.1, which can be easily
derived from the main test above Eq(8.30). We now only need to calculate
p(D = 0|F = 0). Similarly, according to the sum rule, we have:
X
p(D = 0|F = 0) = p(D = 0,G |F = 0)
G =0,1
X
= p(D = 0|G, F = 0) p(G |F = 0)
G =0,1
X
= p(D = 0|G) p(G |F = 0)
G =0,1
= 0.9 × 0.81 + (1 − 0.9) × (1 − 0.81)
= 0.748
Several clarifications must be made here. First, from the second line
to the third line, we simply eliminate the dependence on F = 0 because we
know that D only depends on G according to Eq(8.105) and Eq(8.106). Sec-
ond,from the third line to the fourth line, we have used Eq(8.31), Eq(8.105)
and Eq(8.106). Now, we substitute all of them back to (∗), yielding:
p(D = 0, B = 0, F = 0)
p(F = 0|D = 0, B = 0) =
p(D = 0, B = 0)
P
G p(D = 0, B = 0, F = 0,G)
= P
p(D = 0, B = 0,G)
P G
G p(B = 0, F = 0,G) p(D = 0|B = 0, F = 0,G)
= P
G p(B = 0,G) p(D = 0|B = 0,G)
P
G p(B = 0, F = 0,G) p(D = 0|G)
= P (∗∗)
G p(B = 0,G) p(D = 0|G)
Just as required. The intuition behind this result coincides with the com-
mon sense. Moreover, by analogy to Fig.8.54, the node a and b in Fig.8.54
represents B and F in our case. Node c represents G, while node d represents
D. You can use d-separation criterion to verify the conditional properties.
Using the property of symmetry, we only need to count the free variables
on the lower triangle of the matrix. In the first column, there are (M − 1) free
variables. In the second column, there are (M − 2) free variables. Therefore,
the total free variables are given by:
M(M − 1)
(M − 1) + (M − 2) + ... + 0 =
2
Each value of these free variables has two choices, i.e., 1 or 0. Therefore,
the total number of such matrix is 2 M ( M −1)/2 . In the case of M = 3, there are
8 possible undirected graphs:
We can obtain:
1 X
· hX hX ii ¸
p(xn , x N ) = ψn−1,n (xn−1 , xn )... ψ2,3 (x2 , x3 ) ψ1,2 (x1 , x2 ) ...
Z xn−1 x2 x1
· hX i ¸
X
× ψn,n+1 (xn , xn+1 )... ψ N −2,N −1 (x N −2 , x N −1 )ψ N −1,N (x N −1 , x N ) ...
xn+1 x N −1
Note that in the second line, the summation term with respect to x N −1 is
the product of ψ N −2,N −1 (x N −2 , x N −1 ) and ψ N −1,N (x N −1 , x N ). So here we can
actually draw an undirected graph with N − 1 nodes, and adopt the proposed
algorithm to solve p(xn , x N ). If we use x? n to represent the new node, then the
joint distribution can be written as:
1 ? ? ? ? ? ?
p(x? ) = ψ (x , x ) ψ (x , x ) ... ψ? ? ?
N −2,N −1 (x N −2 , x N −1 )
Z ? 1,2 1 2 2,3 2 3
Where ψ? ? ?
n,n+1 (x n , x n+1 ) is defined as:
1
p(x) = ψ1,3 (x1 , x3 ) ψ2,3 (x2 , x3 ) ψ3,4 (x3 , x4 ) ψ4,5 (x4 , x5 )
Z
We simply choose x4 as the root and the corresponding directed tree is
well defined by working outwards. In this case, the distribution defined by
the directed tree is:
only depends on only one node (except the root), i.e., its parent. Thus we can
easily change a undirected tree to a directed one by matching the potential
function with the corresponding conditional PDF, as shown in the example.
Moreover, we can choose any node in the undirected tree to be the root and
then work outwards to obtain a directed tree. Therefore, in an undirected tree
with n nodes, there is n corresponding directed trees in total.
For each r nk when n is fixed and k = 1, 2, ..., K, only one of them equals
1 and others are all 0. Therefore, there are K possible choices. When N
data are given, there are K N possible assignments for { r nk ; n = 1, 2, ..., N; k =
1, 2, ..., K }. For each assignments, the optimal {µk ; k = 1, 2, ..., K } are well de-
termined by Eq (9.4).
As discussed in the main text, by iteratively performing E-step and M-
step, the distortion measure in Eq (9.1) is gradually minimized. The worst
case is that we find the optimal assignment and {µk } in the last iteration. In
other words, K N iterations are required. However, it is guaranteed to con-
verge because the assignments are finite and the optimal {µk } is determined
once the assignment is given.
K
r N k ||x N − µk ||2
X
JN = JN −1 +
k=1
In the E-step, we still assign the N-th data x N to the closet center and
suppose that this cloest center is µm . Therefore, the expression above will
reduce to:
JN = JN −1 + ||xn − µm ||2
In the M-step, we set the derivative of JN with respect to µk to 0, where
k = 1, 2, ..., K. We can observe that for those µk , k 6= m, we have:
∂ JN ∂ JN −1
=
∂µk ∂µk
171
x N − µ(mN −1)
= µ(mN −1) +
1 + nN=−11 r nk
P
Note that we have used 1-of-K coding scheme for z = [z1 , z2 , ..., zK ]T . To
be more specific, only one of z1 , z2 , ..., zK will be 1 and all others will equal 0.
Therefore, the summation over z actually consists of K terms and the k-th
term corresponds to z k equal to 1 and others 0. Moreover, for the k-th term,
the product will reduce to πk N (x|µk , Σk ). Therefore, we can obtain:
K h i zk K
(πk N (x|µk , Σk ) πk N (x|µk , Σk )
XY X
p(x) = =
z k=1 k=1
Just as required.
In other words, in thise case, the only modification is that the term p(X, Z|θ )
in Eq (9.29) will be replaced by p(X, Z|θ )p(θ ). Therefore, in the E-step, we still
need to calculate the posterior p(Z|X, θ old ) and then in the M-step, we are re-
0
quired to maximize Q (θ , θ old ). In this case, by analogy to Eq (9.30), we can
0
write down Q (θ , θ old ):
0
h i
Q (θ , θ old ) = p(Z|X, θ old ) ln p(X, Z|θ )p(θ )
X
Z
h i
p(Z|X, θ old ) ln p(X, Z|θ ) + ln p(θ )
X
=
Z
p(Z|X, θ old ) ln p(X, Z|θ ) + p(Z|X, θ old ) ln p(θ )
X X
=
Z Z
old
p(Z|X, θ old )
X X
= p(Z|X, θ ) ln p(X, Z|θ ) + ln p(θ ) ·
Z Z
old
X
= p(Z|X, θ ) ln p(X, Z|θ ) + ln p(θ )
Z
= Q(θ , θ old ) + ln p(θ )
Just as required.
Notice that the condition on µ, Σ and π can be omitted here, and we only
need to prove p(Z|X) can be written as the product of p(zn |xn ). Correspond-
ingly, the small dots representing µ, Σ and π can also be omitted in Fig 9.6.
Observing Fig 9.6 and based on definition, we can write :
p(X|Z)p(Z)
p(Z|X) =
p(X)
hQ ih Q i
N N
n=1 p(x n |z n ) n=1 p(z n )
= QN
n=1 p(x n )
N p(x |z )p(z )
Y n n n
=
n=1 p(xn )
N
Y
= p(zn |xn )
n=1
Just as required. The essence behind the problem is that in the directed
graph, there are only links from zn to xn . The deeper reason is that (i) the
mixture model is given by Fig 9.4, and (ii) we assume the data {xn } is i.i.d,
and thus there is no link from xm to xn .
∂ ln N (xn |µk , Σ) 1 1
= − Σ−1 + Σ−1 Snk Σ−1
∂Σ 2 2
Where we have defined:
∂ ln p N X
∂ n X K o
= z nk [ ln πk + ln N (xn |µk , Σk ) ]
∂µk ∂µk n=1 k=1
∂ n X N o
= z nk [ ln πk + ln N (xn |µk , Σk ) ]
∂µk n=1
N ∂ n o
z nk ln N (xn |µk , Σk )
X
=
n=1 ∂µ k
X ∂ n o
= ln N (xn |µk , Σk )
xn ∈C k ∂µ k
175
∂L N z
X nk
= +λ = 0
∂πk π
n=1 k
We multiply both sides by πk and sum over k making use of the constraint
Eq (9.9), yielding λ = − N. Substituting it back into the expression, we can
obtain:
1 XN
πk = z nk
N n=1
Just as required.
∂E z [ln p] N
∂ nX o
= γ(z nk ) ln N (xn |µk , Σk )
∂µk ∂µk n=1
N
X ∂ ln N (xn |µk , Σk )
= γ(z nk ) ·
n=1 ∂µk
N h i
1
γ(z nk ) · − Σ−
X
= k (x n − µ k )
n=1
∂E z N
∂ nX o
= γ(z nk ) ln N (xn |µk , Σk )
∂Σk ∂Σk n=1
N
X ∂ ln N (xn |µk , Σk )
= γ(z nk )
n=1 ∂Σk
N h 1 1 −1 i
1 −1
γ(z nk ) · − Σ− Σ Σ
X
= + S nk
n=1 2 k 2 k k
176
∂L N γ(z )
X nk
= +λ = 0
∂πk n=1 π k
We multiply both sides by πk and sum over k making use of the constraint
Eq (9.9), yielding λ = − N (you can see Eq (9.20)- Eq (9.22) for more details).
Substituting it back into the expression, we can obtain:
1 XN Nk
πk = γ(z nk ) =
N n=1 N
Just as Eq (9.22).
p(xa , xb ) p(x) K
X πk
p(xb |xa ) = = = · p(x| k)
p(xa ) p(xa ) k=1 p(xa )
Then we deal with the covariance matrix. For an arbitrary random vari-
able x, according to Eq (2.63) we have:
Since E[x] is already obtained, we only need to solve E[xxT ]. First we only
focus on the k-th component and rearrange the expression above, yielding:
First, let’s make this problem more clear. In a mixture of Bernoulli dis-
tribution, whose complete-data log likelihood is given by Eq (9.54) and whose
model parameters are πk and µk . If we want to obtain those parameters, we
can adopt EM algorithm. In the E-step, we calculate γ(z nk ) as shown in Eq
(9.56). In the M-step, we update πk and µk according to Eq (9.59) and Eq
(9.60), where Nk and x¯k are defined in Eq (9.57) and Eq (9.58). Now let’s
back to this problem. The expectation of x is given by Eq (9.49):
K
( opt) ( opt)
E[x] =
X
πk µk
k=1
( opt) ( opt)
Here πk and µk are the parameters obtained when EM is converged.
179
K
( opt) ( opt)
E[x] =
X
πk µk
k=1
K 1 N
( opt)
γ(z nk )( opt) xn
X X
= πk ( opt)
k=1 NK n=1
K N ( opt) N
k 1
γ(z nk )( opt) xn
X X
= ( opt)
k=1 N NK n=1
K 1 X N
γ(z nk )( opt) xn
X
=
k=1 N n=1
N X K γ(z )( opt) x
X nk n
=
n=1 k=1 N
N x X K
n
γ(z nk )( opt)
X
=
n=1 N k=1
1 XN
= xn = x̄
N n=1
And
Nk(1) (0)
PN PN
n=1 π k
(1)
n=1 γ(z nk )
π(1)
k
= = = π(0)
k
=
N N N
In other words, in this case, after the first EM iteration, we find that the
new µ(1)
k
are all identical, which are all given by x̄. Moreover, the new π(1)
k
are
identical to their corresponding initial value π(0)
k
. Therefore, in the second
EM iteration, we can similarly conclude that:
µ(2)
k
= µ(1)
k
= x̄ , π(2)
k
= π(1)
k
= π(0)
k
In other words, the EM algorithm actually stops after the first iteration.
X K h
XY i zk
p(x|µ) = p(x, z|µ, π) = πk p(x|µk )
z z k=1
The summation over z is made up of K terms and the k-th term corre-
sponds to z k = 1 and other z j , where j 6= k, equals 0. Therefore, the k-th term
will simply reduce to πk p(x|µk ). Hence, performing the summation over z
will finally give Eq (9.47) just as required. To be more clear, we summarize
the aforementioned statement:
K h
XY i zk
p(x|µ) = πk p(x|µk )
z k=1
K h
Y i zk ¯ K h
Y i zk ¯
= πk p(x|µk ) ¯ + ... + πk p(x|µk ) ¯
¯ ¯
z1 = 1 zK = 1
k=1 k=1
= π1 p(x|µ1 ) + ... + πK p(x|µK )
K
X
= πk p(x|µk )
k=1
Noticing that πk doesn’t depend on any µki , we can omit the first term in
the open brace when calculating the derivative of Eq (9.55) with respect to
µki :
∂E z [ln p] ∂ N X
X K n D £
X ¤o
= γ(z nk ) xni ln µki + (1 − xni ) ln(1 − µki )
∂µki ∂µki n=1 k=1 i =1
∂ N X
X K X
D n ¤o
γ(z nk ) xni ln µki + (1 − xni ) ln(1 − µki )
£
=
∂µki n=1 k=1 i =1
N
X ∂ n ¤o
γ(z nk ) xni ln µki + (1 − xni ) ln(1 − µki )
£
=
n=1 ∂µ ki
XN µ
xni 1 − xni
¶
= γ(z nk ) −
n=1 µki 1 − µki
N
X xni − µki
= γ(z nk )
n=1 µki (1 − µki )
181
K
L = E z [ln p(X, Z|µ, π)] + λ(
X
πk − 1)
k=1
X K
N X K
X
γ(z nk ) + λπk = 0
n=1 k=1 k=1
PK
Noticing that k=1 π k equals 1, we can obtain:
N X
X K
λ=− γ(z nk )
n=1 k=1
Finally, substituting it back into (∗) and rearranging it, we can obtain:
PK PK
k=1 γ(z nk ) γ(z nk ) Nk
πk = − = P N kP
=1
=
λ K
n=1 k=1 γ(z nk )
N
The incomplete-data log likelihood is given by Eq (9.51), and p(xn |µk ) lies
in the interval [0, 1], which can be easily verified by its definition, i.e., Eq
(9.44). Therefore, we can obtain:
N
X K
©X ª XN K
©X ª XN
ln p(X|µ, π) = ln πk p(xn |µk ) ≤ ln πk × 1 ≤ ln 1 = 0
n=1 k=1 n=1 k=1 n=1
182
Where we have used the fact that the logarithm is monotonic increasing,
and that the summation of πk over k equals 1. Moreover, if we want to achieve
the equality, we need p(xn |µk ) equal to 1 for all n = 1, 2, ..., N. However, this
is hardly possible.
To illustrate this, suppose that p(xn |µk ) equals 1 for all data points. With-
out loss of generality, consider two data points x1 = [x11 , x12 , ..., x1D ]T and
x2 = [x21 , x22 , ..., x2D ]T , whose i-th entries are different. We further assume
x1 i = 1 and x2 i = 0 since x i is a binary variable. According to Eq (9.44), if we
want p(x1 |µk ) = 1, we must have µ i = 1 (otherwise it muse be less than 1).
However, this will lead p(x2 |µk ) equal to 0 since there is a term 1 − µ i = 0 in
the product shown in Eq (9.44).
Therefore, when the data set is pathological, we will achieve this singu-
larity point by adopting EM. Note that in the main text, the author states
that the condition should be pathological initialization. This is also true. For
instance, in the extreme case, when the data set is not pathological, if we
initialize one πk equal to 1 and others all 0, and some of µ i to 1 and others 0,
we may also achieve the singularity.
Γ(α0 ) YK
α −1
p(π|α) = π k
Γ(α1 )...Γ(αK ) k=1 k
Therefore, the contribution of the Dirichlet prior to ln p(θ ) should be given
by:
K
X
(αk − 1) ln πk
k=1
183
0
Therefore, now Q (θ , θ old ) can be written as:
0
D h
K X K
i X
Q (θ , θ old ) = E z [ln p] +
X
(a i − 1) ln µki + (b i − 1) ln(1 − µki ) + (αk − 1) ln πk
k=1 i =1 k=1
0
Similarly, we calculate the derivative of Q (θ , θ old ) with respect to µki .
This can be simplified by reusing the deduction in Prob.9.15:
0
∂Q ∂E z [ln p] ai − 1 bi − 1
= + −
∂µki ∂µki µki 1 − µki
N
X xni 1 − xni ai − 1 bi − 1
= γ(z nk )( − )+ −
n=1 µki 1 − µki µki 1 − µki
PN PN
n=1 x ni · γ(z nk ) + a i − 1 (1 − xni )γ(z nk ) + b i − 1
= − n=1
µki 1 − µki
Nk x̄ki + a i − 1 Nk − Nk x̄ki + b i − 1
= −
µki 1 − µki
Note that here x̄ki is defined as the i-th entry of x̄k defined in Eq (9.58).
To be more clear, we have used Eq (9.57) and Eq (9.58) in the last step:
N
X h 1 XN i
xni · γ(z nk ) = Nk · xni · γ(z nk ) = Nk · x̄ki
n=1 Nk n=1
Nk x̄ki + a i − 1
µki =
Nk + a i − 1 + b i − 1
0
Next we maximize Q (θ , θ old ) with respect to π. By analogy to Prob.9.16,
we introduce Lagrange multiplier:
K K
L ∝ Ez +
X X
(αk − 1) ln πk + λ( πk − 1)
k=1 k=1
0
Note that the second term on the right hand side of Q in its definition has
been omitted, since that term can be viewed as a constant with regard to π.
We then calculate the derivative of L with respect to πk by taking advantage
of Prob.9.16:
∂L N γ(z ) αk − 1
X nk
= + +λ = 0
∂πk n=1 π k πk
Similarly, We first multiply both sides of the expression by πk and then adopt
summation with respect to k, which gives:
K X
X N K
X K
X
γ(z nk ) + (αk − 1) + λπk = 0
k=1 n=1 k=1 k=1
184
PK
Noticing that k=1 π k equals 1, we can obtain:
K
X K
X
λ=− Nk − (αk − 1) = − N − α0 + K
k=1 k=1
K
p(x|µk ) zk
Y
p(x|z, µ) =
k=1
K
z
Y
p(z|π) = πkk
k=1
N
Y N Y
Y K h i znk
p(X, Z|µ, π) = p(zn |π)p(xn |zn , µ) = πk p(x|µ)
n=1 n=1 k=1
N X
K D Y
M
xni j
X h Y i
ln p(X, Z|µ, π) = z nk ln πk µki j
n=1 k=1 d =1 j =1
N X
X K h D X
X M i
= z nk ln πk + xni j ln µki j
n=1 k=1 d =1 j =1
πk p(xn |µk )
γ(z nk ) = E[z nk ] = PK
j =1 π j p(x n |µ j )
185
Next, in the M-step, we are required to maximize E z [ln p(X, Z|µ, π)] with
respect to π and µk , where E z [ln p(X, Z|µ, π)] is given by:
K
N X h D X
M i
E z [ln p(X, Z|µ, π)] =
X X
γ(z nk ) ln πk + xni j ln µki j
n=1 k=1 i =1 j =1
Notice that there exists two constraints: (i) the summation of πk over k
equals 1, and (ii) the summation of µki j over j equals 1 for any k and i, we
need to introduce Lagrange multiplier:
K D
K X M
L = E z [ln p] + λ(
X X X
πk − 1) + η ki ( µki j − 1)
k=1 k=1 i =1 j =1
Nk
πk =
N
Where Nk is defined as:
N
X
Nk = γ(z nk )
n=1
∂L N γ(z )x
X nk ni j
= + η ki
∂µki j n=1 µki j
N
X
γ(z nk ) xni j + η ki µki j = 0
n=1
M X
X N N
X M
hX i N
X
η ki = − γ(z nk ) xni j = − γ(z nk ) xni j = − γ(z nk ) = − N k
j =1 n=1 n=1 j =1 n=1
186
P
Where we have used the fact that j x ni j = 1. Substituting back into the
derivative, we can obtain:
PN
n=1 γ(z nk ) x ni j 1 X N
µki j = − = γ(z nk ) xni j
η ki Nk n=1
M
α= (∗)
E[wT w]
Therefore, we now need to calculate the expectation E[wT w]. Notice that
the posterior has already been given by Eq (3.49):
p(w|t) = N (m N , S N )
To calculate E[wT w], here we write down an property for a Gaussian ran-
dom variable: if x ∼ N (m, Σ), we have:
This property has been shown in Eq(378) in ’the Matrix Cookbook’. Uti-
lizing this property, we can obtain:
E[wT w] = Tr[S N ] + mT
N mN
N
β = PN
n=1 E[(t n − w
Tφ 2
n) ]
p(w|t) = N (m N , S N )
187
E[(t n − wT φn )2 ] = E[t2n − 2t n · wT φn + wT φn φT
n w]
= E[t2n ] − E[2t n · wT φn ] + E[wT (φn φT
n ) w]
= t2n − 2t n · E[φT T T T
n w] + Tr[φ n φ n S N ] + m N φ n φ n m N
= t2n − 2t n φT T T T
n · E[w] + Tr[φ n φ n S N ] + m N φ n φ n m N
= t2n − 2t n φT T T T
n m N + Tr[φ n φ n S N ] + m N φ n φ n m N
= (t n − mT 2 T
N φ N ) + Tr[φ n φ n S N ]
1 1 XN n o
= (t n − mT φ
N N ) 2
+ Tr[ φ φ T
n n NS ]
β N n=1
1n o
= ||t − Φm N ||2 + Tr[ΦT ΦS N ]
N
Note that in the last step, we have performed vectorization. Here the j-th
row of Φ is given by φ j , identical to the definition given in Chapter 3.
N β N
β X 1X M αi XM α
i
Ew [ln p] = ln − Ew [(t n − wT φn )2 ] + ln − Ew [w2i ]
2 2π 2 n=1 2 i=1 2π i=1 2
Here the subscript (i, i) represents the entry on the i-th row and i-th
column of the matrix Ew [wwT ]. So now, we are required to calculate the
expectation. To be more clear, this expectation is with respect to the posterior
defined by Eq (7.81):
p(w|t, X, α, β) = N (m, Σ)
Here we use Eq (377) described in ’the Matrix Cookbook’. We restate it
here: if w ∼ N (m, Σ), we have:
E[wwT ] = Σ + mmT
1 1 1
αi = = =
Ew [wwT ] ( i,i ) (Σ + mmT ) ( i,i ) Σ ii + m2i
N
β(new) = P N
n=1 Ew [(t n − w
Tφ 2
n) ]
1 1 XN n o
= (t n − mT φ N )2 + Tr[φn φT
n Σ]
β(new) N n=1
1n o
= ||t − Φm||2 + Tr[ΦT ΦΣ]
N
To make it consistent with Eq (9.68), let’s first prove a statement:
This can be easily shown by substituting Σ, i.e., Eq(7.83), back into the
expression:
Now we start from this statement and rearrange it, which gives:
1 1n o
= ||t − Φm||2 + Tr[ΦT ΦΣ]
β(new) N
1n o
= ||t − Φm||2 + Tr[β−1 (I − AΣ)]
N
1n o
= ||t − Φm||2 + β−1 Tr[I − AΣ]
N
1n o
||t − Φm||2 + β−1 (1 − α i Σ ii )
X
=
N i
||t − Φm||2 + β−1 i γ i
P
=
N
Here we have defined γ i = 1 − α i Σ ii as in Eq (7.89). Note that there is a
typo in Eq (9.68), m N should be m.
1 − α? Σ ii
α? =
m2i
1
α? =
m2i + Σ ii
||t − Φm||2
(β? )−1 =
N − i γi
P
Note that in the last step, we have used the fact that ln p(X|θ ) doesn’t
depend on Z, and that the summation of q(Z) over Z equal to 1 because q(Z)
is a PDF.
∂1 ¯¯ ∂ ln p(X|θ ) ¯ ¯
= ¯ (old) +
∂θ ∂θ
¯ (old)
θ θ
∂ ln p(X|θ ) ¯ ¯
=
∂θ
¯ (old)
θ
This problem can be much easier to prove if we view it from the perspec-
tive of KL divergence. Note that when q(Z) = p(Z|X, θ (old) ), the KL divergence
vanishes, and that in general KL divergence is less or equal to zero. There-
fore, we must have:
∂K L(q|| p) ¯¯
¯ (old) = 0
∂θ θ
Nkold = γold (z nk )
X
n
If now we just re-evaluate the responsibilities for one data point xm , we can
obtain:
Nknew =
X old
γ (z nk ) + γnew (z mk )
n6= m
1 γnew (z mk )xm
µnew γold (z nk )xn +
X
k =
Nknew n6=m Nknew
1 γnew (z mk )xm γold (z mk )xm
γold (z nk )xn +
X
= −
Nknew n Nknew Nknew
Nkold 1 X γnew (z mk )xm γold (z mk )xm
= γold (z nk )xn + −
Nknew N old n Nknew Nknew
k
Nkoldold
h
new old
i x
m
= µ
new k + γ (z mk ) − γ (z mk ) new
Nk Nk
Nknew − Nkold h i x
m
= µold
k − µ old
k + γnew
(z mk ) − γ old
(z mk )
Nknew Nknew
γnew (z mk ) − γold (z mk ) old h new i x
m
= µold
k − new µk + γ (z mk ) − γold (z mk )
Nk Nknew
γnew (z mk ) − γold (z mk ) ³ ´
= µold
k + · x m − µ old
k
Nknew
Just as required.
Nknew 1 n old o
πnew
k = =Nk + γnew (z mk ) − γold (z mk )
N N
new
γ (z mk ) − γold (z mk )
= πold
k +
N
Here we have used the conclusion (the update formula for Nknew ) in the
previous problem. Next we deal with the covariance matrix Σ. By analogy to
193
One important thing worthy mentioned is that in the second step, there
is an approximate equal sign. Note that in the previous problem, we have
194
shown that if we only recompute the data point xm , all the center µk will
also change from µold k
to µnew
k
, and the update formula is given by Eq (9.78).
However, for the convenience of computing, we have made an approximation
here. Other approximation methods can also be applied here. For instance,
you can replace µnewk
with µold
k
whenever it occurs.
The complete solution should be given by substituting Eq (9.78) into the
right side of the first equal sign and then rearranging it, in order to construct
a relation between Σnew k
and Σold
k
. However, this is too complicated.
Note that in the last step, we have used the fact that ln p(X) doesn’t de-
pend on Z, and that the integration of q(Z) over Z equal to 1 because q(Z) is
a PDF.
m 1 = µ 1 − Λ−
11 Λ12 (m 2 − µ2 )
1
½
m 2 = µ 2 − Λ−
22 Λ21 (m 1 − µ1 )
1
(10.14). When none of Λ− 11 and Λ22 equals 0, we substitute m 1 , i.e., the first
1 −1
195
The first term at the left hand side will equal 0 only when the distribu-
tion is singular, i.e., the determinant of the precision matrix Λ (i.e., Λ11 Λ22 −
Λ12 Λ21 ) is 0. Therefore, if the distribution is nonsingular, we must have
m 2 = µ2 . Substituting it back into the first line, we obtain m 1 = µ1 .
Note that in the third step, since all the factors q i (Z i ), where i 6= j, are
fixed, they can be absorbed into the ’Const’ variable. In the last step, we have
denoted the marginal distribution:
Z Y
p(Z j ) = p(Z) dZ i
i 6= j
Using the functional derivative (for more details, you can refer to Ap-
pendix D or Prob.1.34), we calculate the functional derivative of L with re-
spect to q j (Z j ) and set it to 0:
p(Z j )
− +λ = 0
q j (Z j )
196
λ q j (Z j ) = p(Z j )
q?j (Z j ) = p(Z j )
∂KL
= −Σ−1 E[x] + Σ−1 µ = 0
∂µ
1 1 1
KL (p|| q) = ln |Σ| + Tr[Σ−1 E(xxT )] − µT Σ−1 µ + const
2 2 2
197
Now it is clear that when q z (z) equals p(z|θ 0 , X), the KL divergence is
minimized. This corresponds to the E-step. Next, we calculate the optimal
q θ (θ ), i.e., θ 0 , by maximizing L(q) given in Eq (10.3), but fixing q θ (θ ):
Z Z n p(X, Z) o
L(q) = q(Z) ln dZ
q(Z)
Z Z n p(X, z, θ ) o
= q z (z)q θ (θ ) ln dz d θ
q z (z)q θ (θ )
Z Z n p(X, z, θ ) o Z
= q z (z)q θ (θ ) ln dz d θ − q θ (θ ) ln q θ (θ ) d θ
q z (z)
Z Z n o Z
= q z (z)q θ (θ ) ln p(X, z, θ ) dz d θ − q θ (θ ) ln q θ (θ ) d θ + const
Z Z
= q θ (θ ) E qz [ln p(X, z, θ )] d θ − q θ (θ ) ln q θ (θ ) d θ + const
Z
= E qz (z) [ln p(X, z, θ 0 )] − q θ (θ ) ln q θ (θ ) d θ + const
4 ³
Z ´
(1+α)/2 (1−α)/2
D α (p|| q) = 1 − p q dx
1 − α2
4 n p 1−α 1−α 2 ¤ o
Z
£
= 1 − 1 + ln q + O( ) dx
1 − α2 p(1−α)/2 2 2
4 n 1 + 1−2α ln q + O( 1−2α )2
Z o
= 1 − p · dx
1 − α2 1 + 1−2α ln p + O( 1−2α )2
4 n 1 + 1−2α ln q
Z o
≈ 1 − p · dx
1 − α2 1 + 1−2α ln p
4 n £ 1 + 1−2α ln q
Z o
¤
= − p · − 1 dx
1 − α2 1 + 1−2α ln p
1−α 1−α
4 n 2 ln q − 2 ln p
Z o
= − p · dx
1 − α2 1 + 1−2α ln p
2 ln q − ln p
n Z o
= − p· dx
1+α 1 + 1−2α ln p
q
Z Z
≈ − p · (ln q − ln p) dx = − p · ln dx
p
Here p and q is short for p(x) and q(x). It is similar when α → −1. One
important thing worthy mentioning is that if we directly approximate p(1+α)/2
by p instead of p/p(1−α)/2 in the first step, we won’t get the desired result.
E[ τ ] n N
ln q?
o
λ0 (µ − µ0 )2 + (xn − µ)2 + const
X
µ (µ) = −
2 n=1
E[ τ ] n N N o
λ0 µ2 − 2 λ0 µ0 µ + λ0 µ20 + N µ2 − 2 ( x2n + const
X X
= − xn ) µ +
2 n=1 n=1
E[ τ ] n N N o
(λ0 + N) µ2 − 2 ( λ0 µ0 + xn ) µ + (λ0 µ20 + x2n ) + const
X X
= −
2 n=1 n=1
PN 2 PN
E[τ] (λ0 + N) 2 n λ0 µ0 + n=1 xn λ0 µ0 + n=1 x2n o
= − µ −2 µ+ + const
2 λ0 + N λ0 + N
1 λN λN
ln q?
µ (µ) = ln − (µ − µ N )2
2 2π 2
200
We match the terms related to µ (the quadratic term and linear term),
yielding:
λ0 µ0 + nN=1 xn
P
λ N = E[τ] · (λ0 + N) , and λ N µ N = E[τ] · (λ0 + N) ·
λ0 + N
Therefore, we obtain:
λ0 µ0 + N x̄
µN =
λ0 + N
Where x̄ is the mean of xn , i.e.,
1 XN
x̄ = xn
N n=1
Then we deal with the other factor q τ (τ). Note that there is a typo in
Eq (10.28), the coefficient ahead of ln τ should be N2+1 . Let’s verify this by
considering the terms introducing ln τ. The first term inside the expectation,
i.e., ln p(D |µ, τ), gives N 2 ln τ, and the second term inside the expectation, i.e.,
1
ln p(µ|τ), gives 2 ln τ. Finally the last term ln p(τ) gives (a 0 − 1) ln τ. Therefore,
Eq (10.29), Eq (10.31) and Eq (10.33) will also change consequently. The right
forms of these equations will be given in this and following problems.
Now suppose that q τ (τ) is a Gamma distribution, i.e., q τ (τ) ∼ Gam(τ|a N , b N ),
we have:
ln q τ (τ) = − ln Γ(a N ) + a N ln b N + (a N − 1) ln τ − b N τ
Comparing it with Eq (10.28) and matching the coefficients ahead of τ
and ln τ, we can obtain:
N +1 N +1
a N − 1 = a0 − 1 + ⇒ a N = a0 +
2 2
And similarly
1 hX N i
b N = b 0 + Eµ (xn − µ)2 + λ0 (µ − µ0 )2
2 n=1
Just as required.
N/2
≈ hP i
1 N
2 Eµ n=1 (x n − µ)
2
N
= hP i
N
Eµ n=1 (x n − µ)
2
n1 hXN io−1
= · Eµ (xn − µ)2
N n=1
201
a 0 + (N + 1)/2
var[τ] = ³ ¤´2
b 0 + 12 Eµ N
µ 2 + λ (µ − µ )2
£ P
n=1 (x n − ) 0 0
N/2
≈
1
£ PN ¤2 ≈ 0
4 Eµ n=1 (x n − µ)
2
Just as required.
1 λ0 µ0 + N x 2
E[µ2 ] = λ−1 2
N + E[ µ ] = +( )
(λ0 + N)E[τ] λ0 + N
1
= + x2
N E[τ]
Note that since there is a typo in Eq (10.29) as stated in the previous
problem, i.e., missing a term 21 . E[τ]−1 actually equals:
hP i
1 N 2 2
1 bN b 0 + E
2 µ (x
n=1 n − µ ) + λ0 ( µ − µ 0 )
= =
E[τ] aN a 0 + (N + 1)/2
hP i
1 N 2
E
2 µ (x
n=1 n − µ )
=
(N + 1)/2
1 X N
= Eµ [ (xn − µ)2 ]
N + 1 n=1
N 1 XN
= Eµ [ (xn − µ)2 ]
N +1 N n=1
N n 2 o
= x − 2xE[µ] + E[µ2 ]
N +1
N n 2 1 o
= x − 2x2 + + x2
N +1 N E[τ]
N n 2 1 o
= x − x2 +
N +1 N E[τ]
N n 2 2
o 1
= x −x +
N +1 (N + 1)E[τ]
Rearranging it, we can obtain:
1 1 XN
= (x2 − x2 ) = (xn − x)2
E[τ] N n=1
Just as required.
Notice that the first term is actually L m defined in Eq (10.35) and that
the summation of q(m) over m equals 1, we can obtain:
λ = Lm + 1
203
One last thing worthy mentioning is that you can directly start from L m
given in Eq (10.35), without enforcing Lagrange Multiplier, to obtain q(m).
In this way, we can actually obtain:
00
p(m) exp(L )
Lm =
X
q(m) ln
m q(m)
00
It is actually the KL divergence between q(m) and p(m) exp(L ). Note
00 00
that p(m) exp(L ) is not normalized, we cannot let q(m) equal to p(m) exp(L )
to achieve the minimum of a KL distance, i.e., 0, since q(m) is a probability
distribution and should sum to 1 over m.
Therefore, we can guess that the optimal q(m) is given by the normal-
00
ized p(m) exp(L ). In this way, the constraint, i.e., summation of q(m) over
m equals 1, is implicitly guaranteed. The more strict proof using Lagrange
Multiplier has been shown above.
n=1 k=1
205
K K
z
XY X
A= ρ 11kk = ρ1 j
z1 k=1 j =1
Here we have used 1-of-K coding scheme for z1 = [z11 , z12 , ..., z1K ]T , i.e.,
only one of { z11 , z12 , ..., z1K } will be 1 and others all 0. Therefore the summa-
tion over z1 is made up of K terms, and the j-th term corresponds to z1 j = 1
and other z1 i equals 0. In this case, we have obtained:
1 YK K ³ ρ 1k ´ z1 k
q? (Z) =
z
Y
ρ 11kk = PK
j =1 ln ρ 1 j
A k=1 k=1
K
N Y
q? (Z) =
z
X X Y
nk
r nk
Z z1 ,...,z N n=1 k=1
Xn K
N Y o
z
X Y
nk
= r nk
zN z1 ,...,z N −1 n=1 k=1
Xn K NY K
−1 Y o
z ¤ £ z nk ¤
X £Y
= r NNkk · r nk
zN z1 ,...,z N −1 k=1 n=1 k=1
K
X n£ Y NY K
−1 Y o
zN k ¤ z
X
nk
= rNk · r nk
zN k=1 z1 ,...,z N −1 n=1 k=1
K
X n£ Y o
z ¤
= r NNkk · 1
zN k=1
K K
z
X Y X
= r NNkk = rNk = 1
z N k=1 k=1
The proof of the final step is exactly the same as that for N = 1. So now,
with the assumption Eq (10.48) and Eq (10.49) are right for N − 1, we have
shown that they are also correct for N. The proof is complete.
With the knowledge that the optimal q? (µ, Λ) can be written as:
K K
q? (µ, Λ) = q? (µk |Λk )q? (Λk ) = N (µk |mk , (βk Λk )−1 )W (Λk |Wk , vk ) (∗)
Y Y
k=1 k=1
207
1 K X N 1 1 T
− µT Λ r nk µT
k Λ k µ k = − µ k (β0 Λ k + N k Λ k ) µ k
X
k (β0 k ) µ k −
2 k=1 n=1 2 2
βk = β0 + N k
N N
µT
k (β0 Λ k m0 ) + r nk · µT T
k (Λ k x n ) = µ k (β0 Λ k m0 + r nk Λk xn )
X X
n=1 n=1
N
= µT
k Λ k (β0 m0 +
X
r nk xn )
n=1
= µT
k Λ k (β0 m0 + N k x̄ k )
1 N
X 1 X N
x̄k = P N r nk xn = r nk xn
n=1 r nk n=1
Nk n=1
1
mk = (β0 m0 + Nk x̄k )
βk
Now we have obtained q? (µk |Λk ) = N (µk |mk , (βk Λk )−1 ), using the rela-
tion:
ln q? (Λk ) = ln q? (µk , Λk ) − ln q? (µk |Λk )
208
Where in the last step we have used Eq (10.51). Now we are ready to
prove that T is exactly given by Eq (10.62). Let’s first consider the coefficients
ahead of the quadratic term with repsect to µk :
N N
(quad) = β0 µk µT r nk µk µT T
r nk − βk )µk µT
X X
k + k − β k µ k µ k = (β0 + k =0
n=1 n=1
1 T T 1 2
= W−
0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − β m k mT
βk k k
1 T T 1
= W−
0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − (β0 m0 + NK x̄k )(β0 m0 + NK x̄k )T
βk
1
β20 Nk2 1
= W−
0 + N k S k + (β0 − )m0 m0T + (Nk − )x̄k x̄T
k − 2(β0 m0 ) · (NK x̄k )T
βk βk βk
1 β N
0 k β N
0 k β0 NK
= W−
0 + Nk Sk + m0 m0T + x̄k x̄T
k − 2(m0 ) · (x̄k )T
βk βk βk
1 β0 N k
= W−
0 + Nk Sk + (m0 − x̄k )(m0 − x̄k )T
βk
Just as required.
h
1
= D β−k + E Λ k
(mk − xn )T Λk (mk − xn )]
n o
1 T
= D β−k + E Λk Tr[ Λ k · (m k − x n )(m k − x n ) ]
n o
1 T
= D β−k + Tr E Λ k
[ Λ k ] · (m k − x n )(m k − x n )
n o
= D βk + Tr vk Wk · (mk − xn )(mk − xn )T
−1
1 T
= D β−
k + v k (m k − x n ) W k (m k − x n )
Just as required.
Where we have used the definition of Nk . Next we deal with the last term
inside the bracket, i.e.,
1 XN X K n o 1 XN X K
r nk · − vk (xn − mk )T Wk (xn − mk ) = − Tr[r nk vk · (xn − mk )(xn − mk )T · Wk ]
2 n=1 k=1 2 n=1 k=1
1X K N
r nk vk · (xn − mk )(xn − mk )T · Wk ]
X
= − Tr[
2 k=1 n=1
Since we have:
N N
r nk vk · (xn − mk )(xn − mk )T r nk · (x̄k − mk + xn − x̄k )(x̄k − mk + xn − x̄k )T
X X
= vk
n=1 n=1
N
r nk · (x̄k − mk )(x̄k − mk )T
X
= vk
n=1
N
r nk · (xn − x̄k )(xn − x̄k )T
X
+vk
n=1
N
r nk · 2(x̄k − mk )(xn − x̄k )T
X
+vk
n=1
= vk Nk · (x̄k − mk )(x̄k − mk )T
+vk N k Sk
N N
r nk x̄k )T
X X
+vk · 2(x̄k − mk )( r nk xn −
n=1 n=1
= vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk
+vk · 2(x̄k − mk )(N k x̄k − N k x̄k )T
= vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk
Just as required.
K
E[ln p(π)] = ln C(α0 ) + (α0 − 1) E[ln πk ]
X
k=1
K
X
= ln C(α0 ) + (α0 − 1) ln π
ek
k=1
K K
E[ln p(µ, Λ)] = E[ln N (µk |m0 , (β0 Λk )−1 )] + E[ln W (Λk |W0 , v0 )]
X X
k=1 k=1
K n D 1 1 o
E − ln 2π + ln |β0 Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )
X
=
k=1 2 2 2
K n v0 − D − 1 1 o
1
E ln B(W0 , v0 ) + ln |Λk | − Tr[W− 0 Λk ]
X
+
k=1 2 2
K n D D 1 1 o
E − ln 2π + ln β0 + ln |Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )
X
=
k=1 2 2 2 2
K n v0 − D − 1 1 o
1
E ln B(W0 , v0 ) + ln |Λk | − Tr[W− Λ
X
+ 0 k ]
k=1 2 2
K · D β0 1 X K 1X K n o
= ln + ln Λ
fk − E (µk − m0 )T (β0 Λk )(µk − m0 )
2 2π 2 k=1 2 k=1
v0 − D − 1 X K 1X K n o
1
K · ln B(W0 , v0 ) + ln Λ
fk − E Tr[W− 0 Λ k ]
2 k=1 2 k=1
E[ µ k ] = m k , E[µk µT T −1 −1
k ] = mk mk + βk Λk
213
K n o K n o
E (µk − m0 )T (β0 Λk )(µk − m0 ) E Tr[Λk · (µk − m0 )(µk − m0 )T ]
X X
= β0
k=1 k=1
K n £ ¤o
Eµk ,Λk Tr Λk · (µk µT T T
X
= β0 k − 2 µ m
k 0 + m0 m0 )
k=1
K n £ ¤o
EΛk Tr Λk · (mk mT −1 −1
Λ T T
X
= β0 k + β k k − 2m m
k 0 + m m
0 0 )
k=1
K n £ ¤o
1 T T T
EΛk Tr β−
k I + Λ k · (m k m k − 2m k m0 + m0 m0 )
X
= β0
k=1
K n ¤o
1 T
EΛk D · β− Λ
X
= β0
£
k + Tr k · (m k − m0 )(m k − m0 )
k=1
K D β0 K n o
EΛk (mk − m0 )Λk (mk − m0 )T
X
= + β0
βk k=1
K D β0 K
(mk − m0 ) · EΛk [Λk ] · (mk − m0 )T
X
= + β0
βk k=1
K D β0 K
(mk − m0 ) · (vk Wk ) · (mk − m0 )T
X
= + β0
βk k=1
N,K N,K
E[ln q(Z)] = E[z nk ] · ln r nk =
X X
r nk · ln r nk
n,k=1 n,k=1
K
E[ln q(π)] = ln C(α) + (αk − 1) E[ln πk ]
X
k=1
K
X
= ln C(α0 ) + (αk − 1) ln π
ek
k=1
214
K K
E[ln q(µ, Λ)] = E[ln N (µk |mk , (βk Λk )−1 )] + E[ln W (Λk |Wk , vk )]
X X
k=1 k=1
K n D D 1 1 o
E − ln 2π + ln βk + ln |Λk | − (µk − mk )T (βk Λk )(µk − mk )
X
=
k=1 2 2 2 2
K n vk − D − 1 1 o
1
E ln B(Wk , vk ) + ln |Λk | − Tr[W− Λ
X
+ k k ]
k=1 2 2
K · D βk 1 X K 1X K n o
= ln + ln Λ
fk − E (µk − mk )T (βk Λk )(µk − mk )
2 2π 2 k=1 2 k=1
vk − D − 1 XK 1X K n o
1
K · ln B(Wk , vk ) + ln Λ
fk − E Tr[W− k Λ k ]
2 k=1 2 k=1
K · D βk 1 X K KD
= ln + ln Λ
fk −
2 2π 2 k=1 2
vk − D − 1 XK 1X K n o
1
K · ln B(Wk , vk ) + ln Λ
fk − vk E Tr[W−
k W k ]
2 k=1 2 k=1
K · D βk 1 X K KD
= ln + ln Λ
fk −
2 2π 2 k=1 2
vk − D − 1 XK 1X K
K · ln B(Wk , vk ) + ln Λ
fk − vk · D
2 k=1 2 k=1
It is identical to Eq (10.77).
Similarly, the two brackets in the first line correspond to the derivative
with respect to Eq (10.71) and (10.74). Rearranging it, we obtain Eq (10.61).
Next noticing that vk and Wk are always coupled in L , e.g.,vk occurs ahead of
quadratic terms in Eq (10.71). We will deal with vk and Wk simultaneously.
Let’s first make this more clear by writing down those terms depend on vk
and Wk in L :
K n1 o
ln Λ
e k − H[q(Λk )]
X
(10.77) ∝ (∗)
k=1 2
1X K n o
(10.71) ∝ Nk ln Λ
e k − vk · Tr[(Sk + Ak )Wk ]
2 k=1
1X K n o v −D −1 XK 1X K
(10.74) ∝ e k − β0 vk · Tr[Bk Wk ] + 0
ln Λ ln Λ vk Tr[W−1
0 Wk ]
ek −
2 k=1 2 k=1 2 k=1
K
v0 − D X 1X K
1
= ln Λ
ek − vk Tr[(β0 Bk + W−
0 )W k ]
2 k=1 2 k=1
Where ln Λ
e k is given by Eq (10.65) and Ak and Bk are given by:
vk vk D D vk + 1 − i
Γ(
X
ln B(Wk , vk ) ∝ − ln |Wk | − ln 2 − )
2 2 i =1 2
216
Note that Eq (10.77) has a minus sign in L , the negative of (∗) has been
used in the first line. We first calculate the derivative of L with respect to vk
and set it to zero:
∂L 1 e k ln Λ
d ln Λ ek 1 D
= (Nk + v0 − vk ) − − Tr[Fk Wk ] +
∂vk 2 dvk 2 2 2
D
|Wk | D ln 2 1 X 0 vk + 1 − i
+ + + Γ( )
2 2 2 i=1 2
1h d ln Λ
ek i
= (Nk + v0 − vk ) − Tr[Fk Wk ] + D = 0
2 dvk
Where in the last step, we have used the definition of ln Λ
e k , i.e., Eq
(10.65). Then we calculate the derivative of L with respect to Wk and set
it to zero:
∂L 1 1 vk vk 1
= (Nk + v0 − vk )W−
k − F k + W−
∂Wk 2 2 2 k
1 1 vk 1
= (Nk + v0 − vk )W−
k − (Fk − W−k )=0
2 2
Staring at these two derivatives long enough, we find that if the following
two conditions:
1
Nk + v0 − vk = 0 , and Fk = W− k
are satisfied, the derivatives of L with respect to vk and Wk will all be
zero. Rearranging the first condition, we obtain Eq (10.63). Next we prove
that the second condition is exactly Eq (10.62), by simplifying Fk .
1
Fk = Nk Sk + Nk Ak + β0 Bk + W−
0
1 T T
= W−
0 + N k S k + N k · (x¯k − m k )(x¯k − m k ) + β0 · (m k − m0 )(m k − m0 )
217
∂L N
X d ln π
ek d ln π
ek £ d ln π
ek d ln C(α) ¤
= r nk + (α0 − 1) − (αk − 1) + ln π
ek +
∂αk n=1 d αk d αk d αk d αk
d ln π
ek d ln C(α)
= (N k + α0 − αk ) − ln πek −
d αk d αk
£ 0 0 ¤ d[ ln Γ(αb) − ln Γ(αk ) ]
= (N k + α0 − αk ) φ (αk ) − φ (αb) − φ(αk ) − φ(α
¤ £
b) −
d αk
£ 0 0
= (N k + α0 − αk ) φ (αk ) − φ (αb) − φ(αk ) − φ(α
b) − [ φ(α
b) − φ(αk ) ]
¤ £ ¤
£ 0 0
= (N k + α0 − αk ) φ (αk ) − φ (α
¤
b) = 0
P
Note that constraint exists for r nk : k r nk = 1, we cannot calculate the
derivative and set it to zero. We must introduce a Lagrange Multiplier. Before
doing so, let’s simplify Sk + Ak :
1 X N
Sk + Ak = r nk (xn − x¯k )(xn − x¯k )T + (x¯k − mk )(x¯k − mk )T
Nk n=1
1 X N h i
= r nk xn xT T
n − 2r nk x n x¯k + r nk x¯k x¯k
T
+ x¯k x¯k T − 2x¯k mk + mk mT k
Nk n=1
PN PN
1 X N ¯ T
n=1 2r nk x n x k r nk x¯k x¯k T
= T
r nk xn xn − + n=1 + x¯k x¯k T − 2x¯k mk + mk mT
k
Nk n=1 Nk Nk
1 X N 2Nk x¯k x¯k T Nk x¯k x¯k T
= r nk xn xT
n − + + x¯k x¯k T − 2x¯k mk + mk mT
k
Nk n=1 Nk Nk
1 X N
= r nk xn xT T
n − 2x¯k m k + m k m k
Nk n=1
1 X N
= ( r nk xn xT T
n − 2N k x¯k m k + N k m k m k )
Nk n=1
1 hXN i
= r nk (xn xT
n − 2x¯k m k + m k m T
k )
Nk n=1
1 X N
= r nk (xn − mk )(xn − mk )T
Nk n=1
Therefore, we obtain:
1X n o X
L e k − D β−1 + r nk ln π
r nk ln Λ
X
∝ k e k − r nk ln r nk
2 k,n k,n k,n
1X
− Nk vk Tr[(Sk + Ak )Wk ]
2 k
1X n o X
e k − D β−1 + r nk ln π
r nk ln Λ
X
= k e k − r nk ln r nk
2 k,n k,n k,n
1X K X N
− vk r nk (xn − mk )T Wk (xn − mk )
2 k=1 n=1
1X n o X
e k − D β−1 + r nk ln π
r nk ln Λ
X
(Lagrange) = k e k − r nk ln r nk
2 k,n k,n k,n
1X K X N N
vk r nk (xn − mk )T Wk (xn − mk ) +
X X
− λn (1 − r nk )
2 k=1 n=1 n=1 k
obtain:
∂(Lagrange) 1
= e k − D β−1 } + ln π
{ln Λ e k − [ln r nk + 1]
k
∂ r nk 2
1
− vk (xn − mk )T Wk (xn − mk ) + λn = 0
2
Moving ln r nk to the right side and then exponentiating both sides, we
obtain Eq (10.67), and the normalized r nk is given by Eq (10.49), (10.46), and
(10.64)-(10.66).
z
b k=1 k=1
K £
XZ Z Z Y ¤ bzk
1
≈ b |µk , Λ−
N (x k ) · πk · q(π, µ, Λ) d π d µ d Λ
z
b k=1
K Z Z Z
1
b |µk , Λ−
N (x k ) · π k · q(π, µ, Λ) d π d µ d Λ
X £ ¤
=
k=1
K Z Z Z K
1
b |µk , Λ−
N (x q(µ j , Λ j ) d π d µ d Λ
X Y
k ) · π k · q(π) ·
£ ¤
=
k=1 j =1
Where we have used the fact that z uses a one-of-k coding scheme. Recall
that µ = {µk } and Λ = {Λk }, the term inside the summation can be further
simplified. Namely, for those index j 6= k, the integration with respect to µ j
and Λ j will equal 1, i.e.,
K Z Z Z K
1
b |µk , Λ−
N (x q(µ j , Λ j ) d π d µ d Λ
X Y
π π
£ ¤
p(x
b |X) = k ) · k · q( ) ·
k=1 j =1
K Z Z Z
1
b |µk , Λ−
N (x k ) · π k · q(π) · q(µ k , Λ k ) d π d µ k d Λ k
X
=
k=1
K Z Z Z
1 −1
b |µk , Λ−
N (x k ) · π k · Dir(π|α) · N (µ k |m k , (β k Λ k ) )W (Λ k |W k , v k ) d π d µ k d Λ k
X
=
k=1
K Z Z α
k 1 −1
N (x b |µk , Λ−
k ) · N (µ k |m k , (β k Λ k ) ) · W (Λ k |W k , v k ) d µ k d Λ k
X
p(x
b |X) =
k=1 α
b
K nZ £Z ¤ αk o
1 −1
N (x b |µk , Λ− N Λ W Λ Λ
X
= k ) · ( µ k | m k , (β k k ) ) d µ k · · ( k | W k , v k ) d k
k=1 αb
K
−1 α k
n Z o
1
N (xb |mk , (1 + β− Λ W Λ Λ
X
= k ) k ) · · ( k | W ,
k k v ) d k
k=1 αb
K α Z
k 1 −1
N (x b |mk , (1 + β− k )Λ k ) · W (Λ k |W k , v k ) d Λ k
X
=
k=1 α
b
Notice that the Wishart distribution is a conjugate prior for the Gaussian
distribution with known mean and unknown precision. We conclude that the
product of N (x
b |mk , (1 + β−
k
1
)Λ−
k
1
) · W (Λk |Wk , vk ) is again a Wishart distribu-
tion without normalized, which can be verified by focusing on the dependency
on Λk :
n Tr[Λ · (x b − m k )T ] 1
k b − m k )(x
o
(product) ∝ |Λk |1/2+(vk −D −1)/2 · exp − 1)
− Tr[Λ k W −1
k ]
2(1 + β−
k
2
0 0
∝ W (Λ k | W , v )
|I + abT | = 1 + aT b
221
vk vk D D(D − 1) D vk + 1 − i
ln Γ(
X
ln B(Wk , vk ) = − ln |Wk | − ln 2 − ln π − )
2 2 4 i =1 2
Nk Nk D D Nk + 1 − i
ln Γ(
X
→ ln | Nk Sk | − ln 2 − )
2 2 i =1 2
Nk D Nk − 1 − i
ln Γ(
X
= (D ln Nk + ln |Sk | − D ln 2) − + 1)
2 i =1 2
Nk Nk
≈ (D ln + ln |Sk |)
2 2
X D £1 Nk − 1 − i Nk − 1 − i 1 Nk − 1 − i ¤
− ln 2π − +( + ) ln
i =1 2 2 2 2 2
Nk Nk D £ N Nk Nk ¤
X k
≈ (D ln + ln |Sk |) − − + ln
2 2 i =1 2 2 2
Nk Nk Nk D Nk D Nk
= (D ln + ln |Sk |) + − ln
2 2 2 2 2
Nk
= (D + ln |Sk |)
2
Where we have used Eq (1.146) to approximate the logarithm of Gamma
222
αk (α
b − αk ) α
b·α 1
var[µk ] =
b
≤ = →0
α
b 2 (α
b + 1) α
b3 α
b
We can also conclude that pi k only achieves one value NNk , which is iden-
tical to the EM of Gaussian Mixture, i.e., Eq (9.26). Now it is trivial to see
that the predictive distribution will reduce to a Mixture of Gaussian using Eq
(10.80). Beause π, µk and Λk all reduce to a Dirac function, the integration
is easy to perform.
This can be verified directly. The total number of labeling equals assign
K labels to K object. For the first label, we have K choice, K − 1 choice for the
second label, and so on. Therefore, the total number is given by K!.
Let’s explain this problem in details. Suppose that now we have a mix-
ture of Gaussian p(Z|X), which are required to approximate. Moreover, it
has K components and each of the modes is denoted as {µ1 , µ2 , ..., µK }. We
use the variational inference, i.e., Eq (10.3), to minimize the KL divergence:
KL(q|| p), and obtain an approximate distribution q s (Z) and a corresponding
lower bound L(q s ).
According to the problem description, this approximate distribution q s (Z)
will be a single mode Gaussian located at one of the modes of p(Z|X), i.e.,
q s (Z) = N (Z|µs , Σs ), where s ∈ {1, 2, ..., K }. Now, we replicate this q s for K!
times in total. Each of the copies is moved to one mode’s center.
Now we can write down the mixing distribution made up of K! Gaussian
distribution:
1 X K!
q m (Z) = N (Z|µC (m) , Σs )
K! m=1
Where C(m) represents the mode of the m-th component. C(m) ∈ {1, 2, ..., K }.
What the problem wants us to prove is:
L(q s ) + ln K! ≈ L(q m )
p(Z|X)
Z
KL(q m || p) = − q m (Z) ln dZ
q m (Z)
Z Z
= − q m (Z) ln p(Z|X) dZ + q m (Z) ln q m (Z) dZ
Z Zn 1 XK! o
= − q m (Z) ln p(Z|X) dZ + N (Z|µC (m) , Σs ) dZ
q m (Z) ln
K! m=1
Z Z nXK! o 1
= − q m (Z) ln p(Z|X) dZ + q m (Z) ln N (Z|µC (m) , Σs ) dZ + ln
m=1 K!
Z
= − ln K! − q m (Z) ln p(Z|X) dZ
1 K!
Z X K!
nX o
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) dZ
K! m=1 m=1
1 K!
Z X K!
nX o
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) dZ
K! m=1 m=1
Z
≈ − ln K! − q m (Z) ln p(Z|X) dZ
1 K!
Z X n o
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) dZ
K! m=1
Z
= − ln K! − q m (Z) ln p(Z|X) dZ
Z n o
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) dZ (∀ m ∈ {1, 2, ..., K })
Z Z
= − ln K! − q m (Z) ln p(Z|X) dZ + q s (Z) ln q s (Z) dZ
1
Z nX K! o Z
= − ln K! − N (Z|µC (m) , Σs ) ln p(Z|X) dZ + q s (Z) ln q s (Z) dZ
K! m=1
Z Z
≈ − ln K! − q s (Z) ln p(Z|X) dZ + q s (Z) ln q s (Z) dZ
p(Z|X)
Z
= − ln K! − q s (Z) ln dZ = − ln K! + KL(q s || p)
q s (Z)
To obtain the desired result, we have adopted an approximation here,
however, you should notice that this approximation is rough.
1 XN
c= (v0 − D + z nk )
2 n=1
and
N
1n X o
B= z nk (xn − µk )(xn − µk )T + β0 (µk − m0 )(µk − m0 )T + W−
0
1
2 n=1
therefore, we obtain:
11
Λ−
k = B
c
Note that in the MAP framework, we need to solve z nk first, and then
substitute them in c and B in the expression above. Nevertheless, from the
expression above, we can see that Λ−
k
1
won’t have zero determinant.
Where the first term on the right hand side is given by Eq (10.87), the
second one is given by Eq (10.88), the third one is given by Eq (10.89), and
the last one is given by Gam(β| c 0 , d 0 ). Using the variational framework, we
assume a posterior variational distribution:
q(w, α, β) = q(w)q(α)q(β)
227
m N = Eβ [β]S N ΦT t
and n o−1
S N = Eβ [β] · ΦT Φ + Eα [α] · I
N
cN = + c0
2
and
1n o
d N = d0 + Tr[ΦT ΦS N ] + ||Φm N − t||2
2
228
where we have used (B.31). Finally, we deal with the modification of the
first term on the right hand side of Eq (10.107):
nN N β o
Eβ,w [ln p(t|w, β)] = Eβ ln β − ln 2π − Ew [||Φw − t||2 ]
2 2 2
N N Eβ [β]
= Eβ [ln β] − ln 2π − Ew [||Φw − t||2 ]
2 2 2
N cN
= (ϕ(c N ) − ln d N − ln 2π) − Ew [||Φw − t||2 ]
2 2d N
N cN n o
= (ϕ(c N ) − ln d N − ln 2π) − Tr[ΦT ΦS N ] + ||Φm N − t||2
2 2d N
The last question is the predictive distribution. It is not difficult to ob-
serve that the predictive distribution is still given by Eq (10.105) and Eq
(10.106), with 1/β replaced by 1/E[β].
Let’s deal with the terms in Eq(10.107) one by one. Noticing Eq (10.87),
we have:
N N β hXN i
E[ln p(t|w)]w = − ln(2π) + ln β − E (t n − wT φn )2
2 2 2 n=1
N N β hXN N N i
= − ln(2π) + ln β − E t2n − 2 t n · wT φ n + wT φ n · φT
X X
nw
2 2 2 n=1 n=1 n=1
N N β h T i
= − ln(2π) + ln β − E t t − 2wT ΦT t + wT · (ΦT Φ) · w
2 2 2
N N β T h £ i
= − ln(2π) + ln β − t t − βE[wT ] · ΦT t + Tr E (wwT ) · (ΦT Φ)]
¤
2 2 2
229
M M E[α]α
E ln p(w|α) w,α ln(2π) + E[ln α]α − · E[wwT ]w
£ ¤
= −
2 2 2
Then using Eq (10.93)-(10.95), (B.27), (B.30) and Eq (10.103), we obtain
Eq (10.109) just as required. Then we deal with the third term in Eq (10.107)
by noticing Eq (10.89):
and
1 M
−E[ln q(w)]w = H[w] =
ln |S N | + (1 + ln(2π))
2 2
Problem 10.28 Solution(waiting for update)
d2 d 1 1
2
(ln x) = ( ) = − 2 <0
dx dx x x
Therefore, f (x) = ln x is concave for 0 < x < ∞. Based on definition, i.e.,
Eq (10.133), we can obtain:
g(λ) = min{λ x − ln x}
x
We observe that:
d 1
(λ x − ln x) = λ −
dx x
In other words, when λ ≤ 0, λ x−ln x will always decrease as x increase. On
the other hand, when λ > 0, λ x − ln x will achieve its minimum when x = 1/λ.
Therefore, we conclude that:
1 1
g(λ) = λ · − ln = 1 + ln λ
λ λ
230
= λ · x + (1 − λ) ln(1 − λ) + λ ln λ
= λ · x − g(λ)
231
Just as required.
By now, we have already proved (∗), and thus f (x) is convex with respect
to x2 . Utilizing the convex property of f (x) with respect to x2 , we can obtain:
p p
p
y/2
p
− y/2
p
²/2
p
− ²/2 1 e ²/2 − e− ²/2
− ln(e +e ) ≥ − ln(e +e ) − ²−1/2 · p p · (y − ²)
4 e ²/2 + e− ²/2
1 eξ/2 − e−ξ/2
− ln(e x/2 + e− x/2 ) ≥ − ln(eξ/2 + e−ξ/2 ) − ξ−1 · ξ/2 · (x2 − ξ2 )
4 e + e−ξ/2
= − ln(eξ/2 + e−ξ/2 ) − λ(ξ) · (x2 − ξ2 )
Where we have used the definition of λ(ξ), i.e., Eq (10.141). Notice that
the expression above is identical to Eq (10.143), from which we can easily
obtain Eq (10.144).
product of p(t n |w). Let’s begin by deriving a sequential update formula for
1
S−
N
:
NX
+1
1 1
S−
N +1 = S−
0 +2 λ(ξn )φn φT
n
n=1
N
= 2λ(ξ N +1 )φ N +1 φT −1
λ(ξn )φn φT
X
N +1 + S0 + 2 n
n=1
= 2λ(ξ N +1 )φ N +1 φT −1
N +1 + S N
1
In conclusion, when a new data (φ N +1 , t N +1 ) arrives, we first update S−
N +1
,
and then update m N +1 based on the formula we obtained above.
d σ (ξ )
= σ(ξ) · (1 − σ(ξ))
dξ
First, we should clarify one thing and that is there is typos in Eq(10.164).
It is not difficult to observe these error if we notice that for q(w) = N (w|m N , S N ),
in its logarithm, i.e.,ln q(w), 21 ln |S N | should always have the same sign as
1 T −1
2 m N S N m N . This is our intuition. However, this is not the case in Eq(10.164).
Based on Eq(10.159), Eq(10.153) and the Gaussian prior p(w) = N (w|m0 , S0 ),
we can analytically obtain the correct lower bound L(ξ) (this will also be
strictly proved by the next problem):
1 1 1 1 T −1
L(ξ) = ln |S N | − ln |S0 | + mT −1
N S N m N − m0 S0 m0
2 2 2 2
N © 1
ln σ(ξn ) − ξn + λ(ξn )ξ2n
X ª
+
n=1 2
1 1 N © 1
ln |S N | + mT −1
ln σ(ξn ) − ξn + λ(ξn )ξ2n + const
X ª
= N SN mN +
2 2 n=1 2
Where const denotes the term unrelated to ξn because m0 and S0 don’t
1
depend on ξn . Moreover, noticing that S− N
· m N also doesn’t depend on ξn
according to Eq(10.157),thus it will be convenient to define a variable: z N =
1
S−
N
· m N , and we can easily verify:
mT −1 −1 T −1 −1 T −1 T −1
N S N m N = [S N S N m N ] S N [S N S N m N ] = [S N z N ] S N [S N z N ] = z N S N z N
ξ2n = Tr (S−1 T T
N + z N z N ) · S N · φn φn · S N
£ ¤
= (S N · φn )T · (S−1 T
N + z N z N ) · (S N · φ n )
= φT T
n · (S N + S N z N z N S N ) · φ n
= φT T
n · (S N + m N m N ) · φ n
1
Where we have used the defnition of z N ,i.e., z N = S−
N
· m N and also re-
peatedly used the symmetry property of S N .
There is a typo in Eq (10.164), for more details you can refer to the previ-
ous problem. Let’s calculate L(ξ) based on Based on Eq(10.159), Eq(10.153)
and the Gaussian prior p(w) = N (w|m0 , S0 ):
N n
h(w, ξ) p(w) = N (w|m0 , S0 ) · σ(ξn ) exp wT φn tn − (wT φn + ξn )/2
Y
n=1
o
T
−λ(ξn )([w φn ]2 − ξ2n )
n N o n 1 o
(2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − (w − m0 )T S− 1
Y
= 0 (w − m0 )
n=1 2
N n o
exp wT φn tn − (wT φn + ξn )/2 − λ(ξn )([wT φn ]2 − ξ2n )
Y
·
n=1
n N ¡ 1 N ξ N ¢o
n
(2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − m0T S− 1 2
Y X X
= 0 m 0 − + λ (ξ n )ξ n
n=1 2 n=1 2 n=1
n 1 ³ N ´ ³ N 1 ´o
· exp − wT S− 1 T T −1
X X
0 + 2 λ(ξ n )φ φ
n n w + w S 0 m 0 + φ n (t n −
2 n=1 n=1 2
1 |S N | 1 T −1 1 N n 1 o
− m0 S0 m0 + mT −1
ln σ(ξn ) − ξn + λ(ξn )ξ2n
X
= ln N SN mN +
2 |S0 | 2 2 n=1 2
Let’s clarify this problem. What this problem wants us to prove is that
suppose at beginning the joint distribution comprises a product of j −1 factors,
i.e.,
jY
−1
p j−1 (D, θ ) = f j−1 (θ )
i =1
and now the joint distribution comprises a product of j factors:
j
Y
p j (D, θ ) = f j (θ ) = p j−1 (D, θ ) · f j (θ )
i =1
qinit (θ ) = fe0 (θ )
Y Y
fei (θ ) = f 0 (θ ) fei (θ )
i 6=0 i 6=0
q(θ )
q/0 (θ ) f 0 (θ ) = fei (θ ) · f 0 (θ ) = qinit (θ )
Y
=
f 0 (θ ) i6=0
e
qnew (θ ) qinit (θ )
fe0 (θ ) = Z0 /0 = 1 · /0 = f 0 (θ )
q (θ ) q (θ )
Where we have completed squares over θ in the last step, and we have
defined:
h i
A = (vI)−1 − (vn I)−1 and B = 2 · − (vI)−1 · m + (vn I)−1 · mn
v/ n
= v/n · v−1 · m − · mn
vn
v/ n
= v/n ([v/n ]−1 − v−1
n )·m− · mn
vn
v/ n
= m+ · (m − mn )
vn
Which is identical to Eq (10.214). One important thing worthy clar-
ified is that: for arbitrary two Gaussian random variable, their division is
not a Gaussian. You can find more details by typing "ratio distribution" in
Wikipedia. Generally speaking, the division of two Gaussian random vari-
able follows a Cauchy distribution. Moreover, the product of two Gaussian
random variables is not a Gaussian random variable.
However, the product of two Gaussian PDF, e.g.,p(x) and p(y), can be a
Gaussian PDF because when x and y are independent, p(x, y) = p(x)p(y),
is a Gaussian PDF. In the EP framework,according to Eq (10.204), we have
already assumed that q(θ ), i.e., Eq (10.212), is given by the product of fej (θ ),
i.e.,(10.213). Therefore, their division still gives by the product of many re-
maining Gaussian PDF, which is still a Gaussian.
Finally, based on Eq (10.206) and (10.209), we can obtain:
Z
Zn = q/n (θ )p(xn |θ ) d θ
Z
N (θ |m/n , v/n I) · (1 − w)N (xn |θ , I) + wN (xn |0, αI) d θ
© ª
=
Z Z
= (1 − w) N (θ |m , v I)N (xn |θ , I) d θ + w N (θ |m/n , v/n I) · N (xn |0, αI) d θ
/n /n
This problem is really complicated, but hint has already been given in
Eq (10.244) and (10.255). Notice that in Eq (10.244), we have a quite com-
plicated term ∇m/n ln Z n , which we know that ∇m/n ln Z n = (∇m/n Z n )/Z n based
on the Chain Rule, and since we know the exact form of Z n which has been
derived in the previous problem, we guess that we can start from dealing
with ∇m/n ln Z n to obtain Eq (10.244). Before starting, we write down a basic
formula here: for a Gaussian random variable x ∼ N (x|µ, Σ), we have:
Here we have used q/n (θ ) = N (θ |m/n , v/n I), and q/n (θ ) · p(xn |θ ) = Z n ·
new
q (θ ). Rearranging the equation above, we obtain Eq (10.244). Then we
use Eq (10.216), yielding:
1
∇v/n ln Z n = · ∇ /n Z n
Zn v
1
Z
= · ∇v/n q/n (θ )p(xn |θ )d θ
Zn
1
Z n o
= · ∇v/n q/n (θ ) p(xn |θ )d θ
Zn
1 1 D o /n
Z n
/n 2
= · || m − θ || − q (θ ) · p(xn |θ )d θ
Z 2(v/n )2 2v/n
Zn n 1 D o
= qnew (θ ) · (m /n
− θ ) T
(m /n
− θ ) − dθ
2(v/n )2 2v/n
1 n T /n /n 2
o D
= / n 2
E [ θθ ] − 2 E [θ ]m + ||m || − /n
2(v ) 2v
There is a typo in Eq (10.255), and the intrinsic reason is that when calcu-
lating ∇v/n q/n (θ ), there are two terms in q/n (θ ) dependent on v/n : one is inside
the exponential, and the other is in the fraction |v/n1I|1/2 , which is outside the
exponential. Now, we still use Eq (10.216), yielding:
1 1 D
(1 − w)N (xn |m/n , (v/n + 1)I) · /n 2
£ ¤
∇v/n ln Z n = || x n − m || −
Zn 2(v/n + 1)2 2(v/n + 1)
1 D
= ρn · ||xn − m/n ||2 −
£ ¤
/ n
2(v + 1)2 2(v n + 1)
/
1 n o 1 n o
v = · E[θ T θ ] − E[θ T ]E[θ ] = · E[θ T θ ] − ||E[θ ]||2
D D
1 n o
= · 2(v ) · ∇v/n ln Z n + 2E[θ ]m/n − ||m/n ||2 + D · v/n − ||E[θ ]||2
/n 2
D
1 n o
= · 2(v/n )2 · ∇v/n ln Z n − ||E[θ ] − m/n ||2 + D · v/n
D
1 n 1 o
= · 2(v/n )2 · ∇v/n ln Z n − ||v/n · ρ n · /n (xn − m/n )||2 + D · v/n
D v +1
If we substitute ∇v/n ln Z n into the expression above, we will just obtain
Eq (10.215) as required.
241
1X L
E[ fb] = E[ f (z(l ) )]
L l =1
1X L
= E[ f (z(l ) )]
L l =1
1
= · L · E[ f ] = E[ f ]
L
Where we have used the fact that the expectation and the summation can
exchange order because all the z(l ) are independent, and that E[ f (z(l ) )] = E[ f ]
because all the z(l ) are drawn from p(z). Next, we deal with the variance:
1 X L
2 (l ) L2 − L
= E [ f (z )] + E[ f ]2 − E[ f ]2
L2 l =1 L2
1 X L 1
= E[ f 2 (z(l ) )] − E[ f ]2
L2 l =1 L
1 1
= 2
· L · E[ f 2 ] − E[ f ] 2
L L
1 1 1
= E[ f ] − E[ f ]2 = E[( f − E[ f ])2 ]
2
L L L
Just as required.
What this problem wants us to prove is that if we use y = h−1 (z) to trans-
form the value of z to y, where z satisfies a uniform distribution over [0, 1]
and h(·) is defined by Eq(11.6), we can enforce y to satisfy a specific desired
distribution p(y). Let’s prove it beginning by Eq (11.1):
Z y
dz d
p? (y) = p(z) · |
0
| = 1 · h (y) = p( b
y)d b
y = p(y)
dy d y −∞
242
Just as required.
Therefore, since we know that z = h(y) = tan−1 (y), we can obtain the
transformation from z to y: y = tan(z).
−2 ln r 2 1/2
y1 = r cos θ ( ) = cos θ (−2 ln r 2 )1/2 (∗)
r2
Similarly, we also have:
It is easily to obtain:
∂(y1 , y2 )
| | = (−2r −1 (cos2 θ + sin2 θ )) = −2r −1
∂(r, θ )
∂(z1 , z2 ) 1 r2 1 2 1 n y2 + y2 o
p(y1 , y2 ) = p(z1 , z2 )| |= ·|− | = r = exp 1 2
∂(y1 , y2 ) π 2 2π 2π −2
Just as required.
Just as required.
pe(z0 )
P[accept|z0 ] =
kq(z0 )
Since we know z0 is drawn from q(z), we can obtain the total acceptance
rate by integral:
pe(z0 )
Z Z
P[accept] = P[accept|z0 ] · q(z0 ) dz0 = dz0
k
It is identical to Eq (11.14). We substitute Eq (11.13) into the expression
above, yielding:
Zp
P[accept] =
k
We define a very small vector ², and we can obtain:
Notice that the symbols used in the main text is different from those in
the problem description. in the following, we will use those in the main text.
Namely, y satisfies a uniform distribution on interval [0, 1], and z = b tan y+ c.
Then we aims to prove Eq (11.16). Since we know that:
dy
q(z) = p(y) · | |
dz
and that:
z−c dy 1 1
y = arctan => = ·
b dz b 1 + [(z − c)/b]2
1 1
q(z) = 1 · ·
b 1 + [(z − c)/b]2
Here we use e z i,i+1 to represent the intersection point of the i-th and i + 1-
th envelope, q i (z) to represent the comparison function of the i-th envelope,
and N is the total number of the envelopes.Notice that e z0,1 and e
z N,N +1 could
be −∞ and ∞ correspondingly.
First, from Fig.11.6, we see that: q(z i ) = pe(z i ), substituting the expres-
sion above into the equation and yielding:
k i λ i = pe(z i ) (∗)
One important thing should be made clear is that we can only evaluate
pe(z) at specific point z, but not the normalized PDF p(z). This is the assump-
tion of rejection sampling. For more details, please refer to section 11.1.2.
Notice that q i (z) and q i+1 (z) should have the same value at e z i,i+1 , we
obtain:
k i λ i exp{−λ i ( e
z i,i+1 − z i )} = k i+1 λ i+1 exp{−λ i+1 ( e
z i,i+1 − z i+1 )}
246
1 n k i λi o
z i,i+1 = ln + λ i z i − λ i+1 z i+1 (∗∗)
λ i − λ i+1 k i+1 λ i+1
e
Before moving on, we should make some clarifications: the adaptive re-
jection sampling begins with several grid points, e.g., z1 , z2 , ..., z N , and then
we evaluate the derivative of pe(z) at those points, i.e., λ1 , λ2 , ..., λ N . Then we
can easily obtain k i based on (∗), and next the intersection points e z i,i+1 based
on (∗∗).
In this problem, we will still use the same notation as in the previous one.
First, we need to know the probability of sampling from each segment. Notice
that Eq (11.17) is not correctly normalized, we first calculate its normaliza-
tion constant Z q :
Z z N,N +1
e N Z
X z i,i+1
e
Zq = q(z) dz = q i (z i ) dz i
z 0, 1
e z i−1,i
i =1 e
N Z
X z i,i+1
e
= k i λ i exp{−λ i (z − z i )} dz i
z i−1,i
i =1 e
N
X ¯ ez i,i+1
= − k i exp{−λ i (z − z i )}¯
¯
z i−1,i
i =1
e
N
X h i XN
= − k i exp{−λ i ( e
z i,i+1 − z i )} − exp{−λ i ( e
z i−1,i − z i )} = k
bi
i =1 i =1
From this derivation, we know that the probability of sampling from the
b i /Z q , where Z q = P N k
i-th segment is given by k i =1 i . Therefore, now we define
b
an auxiliary random variable η, which is uniform in interval [0, 1], and then
define:
1 jX −1
bm , 1
Xj
i = j if η ∈ [ k k
b m ], j = 1, 2, ..., N (∗∗)
Z q m=0 Z q m=0
−k i h i
= · exp{−λ i (z − z i )} − exp{−λ i ( e z i−1,i − z i )}
k
bi
ki h i
= · exp(λ i z i ) exp{−λ i e z i−1,i } − exp{−λ i z}
k
bi
1 h ξ i
h−i 1 (ξ) = · ln exp{−λ i e z i−1,i } −
−λ i ki
· exp(λ i z i )
k
bi
h i
1 ln exp{−λ i z
e i −1,i }
= ·
−λ i ln
bi ξ
k
k i ·exp(λ i z i )
z i−1,i
e
=
ln ξ + ln kk ii − λ i z i
b
Eτ [(z(τ) )2 ] = 0.5 · Eτ−1 [(zτ−1 )2 ] + 0.25 · Eτ−1 [(zτ−1 + 1)2 ] + 0.25 · Eτ−1 [(zτ−1 − 1)2 ]
= Eτ−1 [(zτ−1 )2 ] + 0.5
If the initial state is z(0) = 0 (there is a typo in the line below Eq (11.36)),
we can obtain Eτ [(z(τ) )2 ] = τ/2 just as required.
This problem requires you to know the definition of detailed balance, i.e.,
Eq (11.40):
p? (z)T(z, z0 ) = p? (z0 )T(z0 , z)
Note that here z and z0 are the sampled values of [z1 , z2 , ..., z M ]T in two
consecutive Gibbs Sampling step. Without loss of generality, we assume that
we are now updating zτj to zτj +1 in step τ:
To be more specific, we write down the first line based on Gibbs sampling,
where zτ/ j denotes all the entries in vector zτ except zτj . In the second line,
we use the conditional property, i.e, p(a, b) = p(a| b)p(b) for the first term.
In the third line, we use the fact that zτ/ j = zτ/ j+1 . Then we reversely use
the conditional property for the last two terms in the fourth line, and finally
obtain what has been asked.
Obviously, Gibbs Sampling is not ergodic for this specific distribution, and
the quick reason is that neither the projection of the two shaded region on z1
axis nor z2 axis overlaps. For instance, we denote the left down shaded region
as region 1. If the initial sample falls into this region, no matter how many
steps have been carried out, all the generated samples will be in region 1. It
is the same for the right up region.
Where we have used the fact that the mean of z i is µ i , i.e., E[z i ] = µ i , and
that the mean of v is 0, i.e., E[v] = 0. Then we deal with the variance:
var[z0i ] = E[(z0i − µ i )2 ]
= E[(α(z i − µ i ) + σ i (1 − α2i )1/2 v)2 ]
= E[α2 (z i − µ i )2 ] + E[σ2i (1 − α2i )v2 ] + E[2α(z i − µ i ) · σ i (1 − α2i )1/2 v]
= α2 · E[(z i − µ i )2 ] + σ2i (1 − α2i ) · E[v2 ] + 2α · σ i (1 − α2i )1/2 · E[(z i − µ i )v]
= α2 · var[z i ] + σ2i (1 − α2i ) · (var[v] + E[v]2 ) + 2α · σ i (1 − α2i )1/2 · E[(z i − µ i )] · E[v]
= α2 · σ2i + σ2i (1 − α2i ) · 1 + 0
= σ2i
Where we have used the fact that z i and v are independent and thus
E[(z i − µ i )v] = E[z i − µ i ] · E[v] = 0
∂H ∂K
= = ri
∂r i ∂r i
∂H ∂E
=
∂zi ∂zi
250
There are typos in Eq (11.68) and (11.69). The signs in the exponential of
the second argument of the min function is not right. To be more specific, Eq
(11.68) should be:
1 1
exp(− H(R))δV min{1, exp(H(R) − H(R 0 ))} (∗)
ZH 2
and Eq (11.69) is given by:
1 1
exp(− H(R 0 ))δV min{1, exp(H(R 0 ) − H(R))} (∗∗)
ZH 2
When H(R) = H(R 0 ), they are clearly equal. When H(R) > H(R 0 ), (∗) will
reduce to:
1 1
exp(− H(R))δV
ZH 2
Because the min function will give 1, and in this case (∗∗) will give:
1 1 1 1
exp(− H(R 0 ))δV exp(H(R 0 ) − H(R))} = exp(− H(R))δV
ZH 2 ZH 2
Therefore, they are identical, and it is similar when H(R) < H(R 0 ).
By analogy to Eq (12.2), we can conclude that the projected data with re-
spect to a vector u M +1 should have variance given by uT M +1
Su M +1 . Moreover,
there are two constraints for u M +1 : (1) it should be correctly normalized, i.e.,
uT u
M +1 M +1
= 1, and (2) it should be orthogonal to all the previous M chosen
eigenvectors {u1 , u2 , ..., u M }. We aim to maximize the variance with respect
to u M +1 satisfying these two constraints. This can be done by enforcing the
Lagrange Multiplier:
M
L = uT T
η m uT
X
M +1 Su M +1 + λ(1 − u M +1 u M +1 ) + m u M +1 (∗)
m=1
251
∂L M
X
= 2Su M +1 − 2λu M +1 + η m um = 0
∂u M +1 m=1
We left multiply uT
m , yielding:
M
2uT T T
= 2uT
X
m Su M +1 − 2λu m u M +1 + u m η m um m Su M +1 − 0 + η m
m=1
= 2uT
M +1 Su m + η m
= 2uT
M +1 λ m u m + η m
= ηm
where we have used the property of orthogonality and in the second line
we have transpose the first term and use the property that S is symmetric. So
now we obtain η m = 0. This will directly lead (∗) reduce to the form as shown
in Eq (12.4), and thus consequently we now need to choose a eigenvector of S
among those not chosen, which has the largest eigenvalue.
1 T
uTi u i = v XXT v i
N λi i
1 T
v XXT v i = λ i vTi v i = λ i ||v i ||2 = λ i
N i
Here we have used the fact that v i has unit length. Substituting it back
into uTi u i , we can obtain:
uTi u i = 1
Just as required.
and
b = WΣ1/2
W
Therefore, in the general case, the final form of p(x) can still be written
as Eq (12.35).