10/36-702 Statistical Machine Learning Homework #2 Solutions
10/36-702 Statistical Machine Learning Homework #2 Solutions
Solution.
y−Yi
∫ y ∑ K( h )K( h )dy
1 x−Xi
∫ y ⋅ p̂(x, y)dy = nh2
p̂(x) nh ∑ K( h )
1 x−Xi
∑ K( h )∫
x−Xi
y h1 K( y−Y
h )dy
i
=
h )
x−Xi
∑ K(
h )Yi
x−Xi
∑ K(
=
∑ K( h )
x−Xi
̂
= m(x).
1
10/36-702 Statistical Machine Learning: Homework 2
X ⎛ µ σ2 ρστ ⎞
( ) ∼ N ( ), ( ) .
Y ⎝ η ρστ τ2 ⎠
(a) (5 pts.) Show that m(x) = E[Y ∣X = x] = α + βx and find explicit expressions for α and β.
̂
(b) (5 pts.) Find the maximum likelihood estimator m(x) ̂
̂ + βx.
=α
̂
(c) (5 pts.) Show that ∣m(x) − m(x)∣2 = OP (n−1 ).
Solution.
⎛ τ ⎞
Y ∣X = x ∼ N η + ρ(x − µ), (1 − ρ2 )τ 2 ,
⎝ σ ⎠
which gives
τ ρµ τρ
α=η− and β = .
σ σ
(b) Given a sample (X1 , Y1 ), . . . , (Xn , Yn ), the MLEs for the bivariate normal parameters are
̂=X
µ
η̂ = Y
1 n
̂
σ2 = ∑(Xi − X)
2
n i=1
1 n
̂
τ2 = ∑(Yi − Y )
2
n i=1
̂ 1 n
Cov(X, Y ) = ∑(Xi − X)(Yi − Y ).
n i=1
τρ τ ρσ
Note β = σ = σ2
. Then by the equivariance property of the MLE,
̂
Cov(X, Y)
β̂ =
̂
σ 2
and
̂
̂ = Y − βX.
α
Again by equivariance,
̂
m(x) ̂
̂ + βx.
=α
2
10/36-702 Statistical Machine Learning: Homework 2
̂
(c) m(x) is an MLE and satisfies the regularity conditions for asymptotic normality. Therefore,
√
̂
n(m(x) − m(x)) ; N (0, I −1 (m(x))),
which implies
√
̂
n∣m(x) ̂
− m(x)∣ = Op (1) Ô⇒ ∣m(x) − m(x)∣2 = Op (n−1 ).
3
10/36-702 Statistical Machine Learning: Homework 2
where B(x) is the cube containing x and n(x) = ∑i 1(Xi ∈ B(x)). Assume that
for all x, y. You may further assume that supx Var(Y ∣X = x) < ∞.
(a) (10 pts.) Show that
̂
∣E[m(x)] − m(x)∣ ≤ C1 h
for some C1 > 0. Also show that
C2
̂
Var(m(x)) ≤
n(x)
for some C2 > 0.
(b) (10 pts.) Now let X be random and assume that X has a uniform density on [0, 1]d . Let
h ≡ hn = (C log n/n)1/d . Show that, for C > 0 large enough, P (min nj = 0) → 0 as n → ∞ where
nj is the number of observations in cube Bj .
Solution.
(a) We have that Xi are fixed, so that m(Xi ) = Yi . Were they not, the below is still applicable by
using the law of iterated expectation and the law of total variance.
RRR ⎡⎢ ⎤
⎥ RRR
− m(x)∣ = RRRRRE⎢⎢ RRR
1 ⎥
̂
∣E[m(x)] 1
∑ i {Xi ∈B(x)} ⎥
Y − m(x) RRR
RRR ⎢ n(x) i ⎥ RR
⎣ ⎦
RRR RRR
R
R 1 R
= RRR ∑ (E[Yi ] − m(x))1{Xi ∈B(x)} RRRR
RRR n(x) i RRR
RRR 1 RRR
= RRRRR ∑ (m(Xi ) − m(x))1{Xi ∈B(x)} RRRR
R
RRR n(x) i RRR
1
≤ ∑ ∣m(Xi ) − m(x)∣1{Xi ∈B(x)}
n(x) i
1 √
≤ ∑ L dh ⋅ 1{Xi ∈B(x)}
n(x) i
√
= L dh
With the first upper bound due to triangular inequality and the second one because, given
x, y ∈ Bi :
d √
∥x − y∥22 = ∑ (xj − yj )2 ≤ dh2 Ô⇒ ∥x − y∥2 ≤ dh
j=1
4
10/36-702 Statistical Machine Learning: Homework 2
⎛ 1 ⎞
̂
Var(m(x)) = Var ∑ Yi 1{Xi ∈B(x)}
⎝ n(x) i ⎠
1
= ∑ Var(Yi )1{Xi ∈B(x)}
n2 (x) i
M
≤ .
n(x)
(b)
⎛ B ⎞
P (min nj = 0) = P ⋃ {nj = 0}
j ⎝ j=1 ⎠
B
≤ ∑ P (nj = 0)
j=1
B n
= ∑ ∏ (1 − P (Xi ∈ Bj ))
j=1 i=1
1
= (1 − hd )n
hd
n
n ⎛ C log n ⎞
= 1−
C log n ⎝ n ⎠
Since B = .
1 1
hd
Take C = 1. Then
n
n ⎛ C log n ⎞ n C log n
1− < e− n ⋅n
C log n ⎝ n ⎠ C log n
n
= n−C
C log n
1
=
C log n
→ 0.
1
if we assume 1/h is an integer, otherwise we could use that as an upper bound.
5
10/36-702 Statistical Machine Learning: Homework 2
for some Mercer kernel function K ∶ Rd × Rd → R. In this problem, you will prove that the above
problem is equivalent to the finite dimensional one
(a) (5 pts.) Let f = ∑ni=1 αi K(⋅, xi ), and consider defining a function f̃ = f + ρ, where ρ is any
function orthogonal to K(⋅, xi ), i = 1, . . . , n. Using the properties of the representers, prove
that f̃(xi ) = f (xi ) for all i = 1, . . . , n, and ∥f̃∥2H ≥ ∥f ∥2H .
(b) (10 pts.) Conclude from part (a) that in the infinite-dimensional problem (1), we need only
consider functions of the form f = ∑ni=1 αi K(⋅, xi ), and that this in turn reduces to (2).
Solution.
Also, because
n
⟨ρ, f ⟩HK = ⟨ρ, ∑ αi K(⋅, xi )⟩
i=1 HK
n
= ∑ αi ⟨ρ, K(⋅, xi )⟩HK
i=1
= 0,
we have,
6
10/36-702 Statistical Machine Learning: Homework 2
(b) For any f̃ ∈ HK , let ỹ = (f̃(x1 ), . . . , f̃(xn ))T ∈ Rn . Let f ∈ HK be f = ∑ni=1 αi K(⋅, xi ), where
α = K −1 ỹ. Then
n
⟨f̃− f, K(⋅, xi )⟩HK = ⟨f̃, K(⋅, xi )⟩HK − ∑ αj ⟨K(⋅, xj ), K(⋅, xi )⟩HK
j=1
n
= f̃(xi ) − ∑ αj K(xi , xj )
j=1
Hence, f̃ − f ⊥ K(⋅, xi ) for all i = 1, . . . , n, and from (a), this implies f̃(xi ) = f (xi ) for all
i = 1, . . . , n, and ∥f̃∥2HK ≥ ∥f ∥2HK , where equality holds if and only if f̃ = f . Therefore,
n n
∑ (yi − f (xi )) + λ∥f ∥HK ≤ ∑ (yi − f̃(xi )) + λ∥f̃∥HK ,
2 2 2 2
i=1 i=1
where equality holds if and only if f̃ = f . Hence if f̃ = argminf ∈HK ∑ni=1 (yi − f (xi )) + λ∥f ∥2HK ,
2
then f̃ = ∑ni=1 αi K(⋅, xi ) with α = K −1 ỹ. So we only need to consider functions of the form
f = ∑ni=1 αi K(⋅, xi ). By plugging in, we have
2
n
2
n ⎛ n ⎞ n n
∑ (yi − f (xi )) λ∥f ∥2HK = ∑ yi ∑ αj K(xi , xj ) + λ ∑ ∑ αi αj K(xi , xj )
i=1 i=1 ⎝ j=1 ⎠ i=1 j=1
7
10/36-702 Statistical Machine Learning: Homework 2
Solution.
8
10/36-702 Statistical Machine Learning: Homework 2
so that c = m(x), which is the median of pY ∣X=x (y). It is a minimum since r′ (c) < 0 for c < m(x)
and r′ (c) > 0 for c > m(x). Since m minimizes E[∣Y − c∣ ∣ X = x] at every x for any g we get
which implies
(c) By setting the first derivative of the loss function equal to 0 we obtain:
∂R(β)
=0
∂β
∂E(Y − β T X)2
Ô⇒ =0
∂β
Ô⇒ E[ − 2X(Y − β T X)] = 0
Ô⇒ 2Bβ − 2α = 0
Ô⇒ β∗ = B −1 α,
where we can exchange the derivative and expectation by the dominated convergence theorem.
The loss function R(β) is strictly convex so β∗ is its unique minimum.
9
10/36-702 Statistical Machine Learning: Homework 2
Find an explicit form for θ̂ in three cases: (i) (10 pts.) J(β) = ∥θ∥0 , (ii) (10 pts.) J(β) = ∥θ∥1 and
(iii) (5 pts.) J(β) = ∥θ∥22 .
Solution.
d ⎛ ⎞
∑(Yi − θi ) + λ∥θ∥0 = ∑ (Yi − θi ) + λ1(θi ≠ 0) .
2 2
i j=1 ⎝ ⎠
⎧
⎪0 if Yi2 < λ
⎪
⎪
⎪
̂
θi = ⎨0 or Yi if Yi2 = λ
⎪
⎪
⎪
⎪
⎩Yi if Yi2 > λ.
Hence
d ⎛ ⎞
∑(Yi − θi ) + λ∥θ∥0 = ∑ (Yi − θi ) + λ1(θi ≠ 0)
2 2
i j=1 ⎝ ⎠
⎧
⎪ ⎫
d
⎪ 2 ⎪ ⎪
≥ ∑ min ⎨Yi , λ⎬
⎪
⎪ ⎪
⎪
j=1 ⎩ ⎭
and equality holds if and only if
√
⎧
⎪0 if ∣Yi ∣ < λ
⎪
⎪
⎪ √
̂
θi = ⎨0 or Yi if ∣Yi ∣ = λ
⎪
⎪
⎪ √
⎪
⎩Yi if ∣Yi ∣ > λ.
10
10/36-702 Statistical Machine Learning: Homework 2
for all i = 1, . . . , d.
When θ̂iOLS ≥ 0, then θ̂i ≥ 0 so
λ
θ̂i = (θ̂iOLS + )1{θ̂OLS ≤− λ } ,
2 i 2
⎧
⎪θ̂iOLS − λ
θ̂iOLS ≥ λ2
⎪
⎪
⎪ 2
θ̂i = ⎨0 θ̂iOLS ∈ ( − λ2 , λ2 ) .
⎪
⎪
⎪
⎩θ̂i
⎪ + θ̂iOLS ≤ λ2
OLS λ
2
(iii) Here the objective function is differentiable everywhere. Taking the gradient w.r.t. θ we have
⎛ ⎞
▽θ ∑(Yi − θi )2 + λ∥θ∥22 = ∑ (−2Yi θi + 2λθi ) .
⎝ i ⎠ i
11