0% found this document useful (0 votes)
99 views11 pages

10/36-702 Statistical Machine Learning Homework #2 Solutions

This document contains the solutions to homework problems from a statistical machine learning course. Problem 1 shows that the kernel regression estimator defined in class is a weighted average of the target values. Problem 2 considers a bivariate normal distribution and derives the maximum likelihood estimator for the linear regression model. Problem 3 bounds the bias and variance of a piecewise constant estimator of a regression function, and shows the probability that any bin is empty goes to 0 as the sample size increases.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views11 pages

10/36-702 Statistical Machine Learning Homework #2 Solutions

This document contains the solutions to homework problems from a statistical machine learning course. Problem 1 shows that the kernel regression estimator defined in class is a weighted average of the target values. Problem 2 considers a bivariate normal distribution and derives the maximum likelihood estimator for the linear regression model. Problem 3 bounds the bias and variance of a piecewise constant estimator of a regression function, and shows the probability that any bin is empty goes to 0 as the sample size increases.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Carnegie Mellon University Department of Statistics & Data Science

10/36-702 Statistical Machine Learning Homework #2


Solutions

DUE: 3:00 PM February 22, 2019

Problem 1 [10 pts.]


Consider the data (X1 , Y1 ), . . . , (Xn , Yn ) where Xi ∈ R and Yi ∈ R. Inspired by the fact that E[Y ∣X =
x] = ∫ yp(x, y)dy/p(x), define
p(x, y)dy
=∫

̂
m(x)
p̂(x)
where
1 1 ⎛ Xi − x ⎞
p̂(x) = ∑ K
n i h ⎝ h ⎠
and
1 1 ⎛ Xi − x ⎞ ⎛ Yi − y ⎞
p̂(x, y) = ∑ 2K K .
n i h ⎝ h ⎠ ⎝ h ⎠
̂
Assume that ∫ K(u)du = 1 and ∫ uK(u)du = 0. Show that m(x) is exactly the kernel regression
estimator that we defined in class.

Solution.

y−Yi
∫ y ∑ K( h )K( h )dy
1 x−Xi
∫ y ⋅ p̂(x, y)dy = nh2
p̂(x) nh ∑ K( h )
1 x−Xi

∑ K( h )∫
x−Xi
y h1 K( y−Y
h )dy
i

=
h )
x−Xi
∑ K(
h )Yi
x−Xi
∑ K(
=
∑ K( h )
x−Xi

̂
= m(x).

1
10/36-702 Statistical Machine Learning: Homework 2

Problem 2 [15 pts.]


Suppose that (X, Y ) is bivariate Normal:

X ⎛ µ σ2 ρστ ⎞
( ) ∼ N ( ), ( ) .
Y ⎝ η ρστ τ2 ⎠

(a) (5 pts.) Show that m(x) = E[Y ∣X = x] = α + βx and find explicit expressions for α and β.

̂
(b) (5 pts.) Find the maximum likelihood estimator m(x) ̂
̂ + βx.

̂
(c) (5 pts.) Show that ∣m(x) − m(x)∣2 = OP (n−1 ).

Solution.

(a) Some simple calculations show

⎛ τ ⎞
Y ∣X = x ∼ N η + ρ(x − µ), (1 − ρ2 )τ 2 ,
⎝ σ ⎠

which gives
τ ρµ τρ
α=η− and β = .
σ σ
(b) Given a sample (X1 , Y1 ), . . . , (Xn , Yn ), the MLEs for the bivariate normal parameters are

̂=X
µ
η̂ = Y
1 n
̂
σ2 = ∑(Xi − X)
2
n i=1
1 n
̂
τ2 = ∑(Yi − Y )
2
n i=1
̂ 1 n
Cov(X, Y ) = ∑(Xi − X)(Yi − Y ).
n i=1

τρ τ ρσ
Note β = σ = σ2
. Then by the equivariance property of the MLE,

̂
Cov(X, Y)
β̂ =
̂
σ 2

and
̂
̂ = Y − βX.
α
Again by equivariance,
̂
m(x) ̂
̂ + βx.

2
10/36-702 Statistical Machine Learning: Homework 2

̂
(c) m(x) is an MLE and satisfies the regularity conditions for asymptotic normality. Therefore,

̂
n(m(x) − m(x)) ; N (0, I −1 (m(x))),

which implies

̂
n∣m(x) ̂
− m(x)∣ = Op (1) Ô⇒ ∣m(x) − m(x)∣2 = Op (n−1 ).

3
10/36-702 Statistical Machine Learning: Homework 2

Problem 3 [20 pts.]


Let m(x) = E[Y ∣X = x]. Let X ∈ [0, 1]d . Divide [0, 1]d into cubes B1 , . . . , BN whose sides have length
h. The data are (X1 , Y1 ), . . . , (Xn , Yn ). In this problem, treat the Xi ’s as fixed. Assume that the
number of observations in each bin is positive. Let
1
̂
m(x) = ∑ Yi 1(Xi ∈ B(x))
n(x) i

where B(x) is the cube containing x and n(x) = ∑i 1(Xi ∈ B(x)). Assume that

∣m(y) − m(x)∣ ≤ L∥x − y∥2

for all x, y. You may further assume that supx Var(Y ∣X = x) < ∞.
(a) (10 pts.) Show that
̂
∣E[m(x)] − m(x)∣ ≤ C1 h
for some C1 > 0. Also show that
C2
̂
Var(m(x)) ≤
n(x)
for some C2 > 0.

(b) (10 pts.) Now let X be random and assume that X has a uniform density on [0, 1]d . Let
h ≡ hn = (C log n/n)1/d . Show that, for C > 0 large enough, P (min nj = 0) → 0 as n → ∞ where
nj is the number of observations in cube Bj .

Solution.

(a) We have that Xi are fixed, so that m(Xi ) = Yi . Were they not, the below is still applicable by
using the law of iterated expectation and the law of total variance.
RRR ⎡⎢ ⎤
⎥ RRR
− m(x)∣ = RRRRRE⎢⎢ RRR
1 ⎥
̂
∣E[m(x)] 1
∑ i {Xi ∈B(x)} ⎥
Y − m(x) RRR
RRR ⎢ n(x) i ⎥ RR
⎣ ⎦
RRR RRR
R
R 1 R
= RRR ∑ (E[Yi ] − m(x))1{Xi ∈B(x)} RRRR
RRR n(x) i RRR
RRR 1 RRR
= RRRRR ∑ (m(Xi ) − m(x))1{Xi ∈B(x)} RRRR
R
RRR n(x) i RRR
1
≤ ∑ ∣m(Xi ) − m(x)∣1{Xi ∈B(x)}
n(x) i
1 √
≤ ∑ L dh ⋅ 1{Xi ∈B(x)}
n(x) i

= L dh

With the first upper bound due to triangular inequality and the second one because, given
x, y ∈ Bi :

d √
∥x − y∥22 = ∑ (xj − yj )2 ≤ dh2 Ô⇒ ∥x − y∥2 ≤ dh
j=1

4
10/36-702 Statistical Machine Learning: Homework 2

Let supx Var(Y ∣X = x) = M.

⎛ 1 ⎞
̂
Var(m(x)) = Var ∑ Yi 1{Xi ∈B(x)}
⎝ n(x) i ⎠
1
= ∑ Var(Yi )1{Xi ∈B(x)}
n2 (x) i
M
≤ .
n(x)

(b)

⎛ B ⎞
P (min nj = 0) = P ⋃ {nj = 0}
j ⎝ j=1 ⎠
B
≤ ∑ P (nj = 0)
j=1
B n
= ∑ ∏ (1 − P (Xi ∈ Bj ))
j=1 i=1
1
= (1 − hd )n
hd
n
n ⎛ C log n ⎞
= 1−
C log n ⎝ n ⎠

Since B = .
1 1
hd
Take C = 1. Then
n
n ⎛ C log n ⎞ n C log n
1− < e− n ⋅n
C log n ⎝ n ⎠ C log n
n
= n−C
C log n
1
=
C log n
→ 0.

1
if we assume 1/h is an integer, otherwise we could use that as an upper bound.

5
10/36-702 Statistical Machine Learning: Homework 2

Problem 4 [15 pts.]


Consider the RKHS problem
n
f̂ = argminf ∈H ∑ (yi − f (xi )) + λ∥f ∥2H ,
2
(1)
i=1

for some Mercer kernel function K ∶ Rd × Rd → R. In this problem, you will prove that the above
problem is equivalent to the finite dimensional one

̂ = argminα∈Rn ∥y − Kα∥22 + λαT Kα,


α (2)

where K ∈ Rn×n denotes the kernel matrix Kij = K(xi , xj ).


Recall that the functions K(⋅, xi ), i = 1, . . . , n are called the representers of evaluation.
Recall that
• ⟨f, K(⋅, xi )⟩H = f (xi ), for any function f ∈ H

• ∥f ∥2H = ∑ni,j=1 αi αj K(xi , xj ) for any function f = ∑ni=1 αi K(⋅, xi ).

(a) (5 pts.) Let f = ∑ni=1 αi K(⋅, xi ), and consider defining a function f̃ = f + ρ, where ρ is any
function orthogonal to K(⋅, xi ), i = 1, . . . , n. Using the properties of the representers, prove
that f̃(xi ) = f (xi ) for all i = 1, . . . , n, and ∥f̃∥2H ≥ ∥f ∥2H .

(b) (10 pts.) Conclude from part (a) that in the infinite-dimensional problem (1), we need only
consider functions of the form f = ∑ni=1 αi K(⋅, xi ), and that this in turn reduces to (2).

Solution.

(a) Since f, f̃ ∈ HK , for all i = 1, . . . , n

f̃(xi ) = ⟨f̃, K(⋅, xi )⟩HK


= ⟨f, K(⋅, xi )⟩HK + ⟨ρ, K(⋅, xi )⟩HK
= ⟨f, K(⋅, xi )⟩HK
= f (xi ).

Also, because
n
⟨ρ, f ⟩HK = ⟨ρ, ∑ αi K(⋅, xi )⟩
i=1 HK
n
= ∑ αi ⟨ρ, K(⋅, xi )⟩HK
i=1
= 0,

we have,

∥f̃∥2HK = ⟨f, f ⟩HK + ⟨ρ, ρ⟩HK + 2⟨ρ, f ⟩HK


= ∥f ∥2HK + ∥ρ∥2HK
≥ ∥f ∥2HK .

6
10/36-702 Statistical Machine Learning: Homework 2

(b) For any f̃ ∈ HK , let ỹ = (f̃(x1 ), . . . , f̃(xn ))T ∈ Rn . Let f ∈ HK be f = ∑ni=1 αi K(⋅, xi ), where
α = K −1 ỹ. Then
n
⟨f̃− f, K(⋅, xi )⟩HK = ⟨f̃, K(⋅, xi )⟩HK − ∑ αj ⟨K(⋅, xj ), K(⋅, xi )⟩HK
j=1
n
= f̃(xi ) − ∑ αj K(xi , xj )
j=1

= f̃(xi ) − [K(K −1 ỹ)]i


= f̃(xi ) − f̃(xi )
= 0.

Hence, f̃ − f ⊥ K(⋅, xi ) for all i = 1, . . . , n, and from (a), this implies f̃(xi ) = f (xi ) for all
i = 1, . . . , n, and ∥f̃∥2HK ≥ ∥f ∥2HK , where equality holds if and only if f̃ = f . Therefore,
n n
∑ (yi − f (xi )) + λ∥f ∥HK ≤ ∑ (yi − f̃(xi )) + λ∥f̃∥HK ,
2 2 2 2
i=1 i=1

where equality holds if and only if f̃ = f . Hence if f̃ = argminf ∈HK ∑ni=1 (yi − f (xi )) + λ∥f ∥2HK ,
2

then f̃ = ∑ni=1 αi K(⋅, xi ) with α = K −1 ỹ. So we only need to consider functions of the form
f = ∑ni=1 αi K(⋅, xi ). By plugging in, we have
2
n
2
n ⎛ n ⎞ n n
∑ (yi − f (xi )) λ∥f ∥2HK = ∑ yi ∑ αj K(xi , xj ) + λ ∑ ∑ αi αj K(xi , xj )
i=1 i=1 ⎝ j=1 ⎠ i=1 j=1

= ∥y − Kα∥22 + λαT Kα.

7
10/36-702 Statistical Machine Learning: Homework 2

Problem 5 [15 pts.]


Let X = (X(1), . . . , X(d)) ∈ Rd and Y ∈ R. In the questions below, make any reasonable assumptions
that you need but state your assumptions.
(a) (5 pts.) Prove that E(Y − m(X))2 is minimized by choosing m(x) = E(Y ∣X = x).
(b) (5 pts.) Find the function m(x) that minimizes E∣Y − m(X)∣. (You can assume that the
conditional cdf F (y∣X = x) is continuous and strictly increasing, for every x.)
(c) (5 pts.) Prove that E(Y − β T X)2 is minimized by choosing β∗ = B −1 α where B = E(XX T ) and
α = (α1 , . . . , αd ) and αj = E(Y X(j)).

Solution.

(a) Let g(x) be any function of x. Then


E(Y − g(X))2 = E(Y − m(X) + m(X) − g(X))2
= E(Y − m(X))2 + E(m(X) − g(X))2 + 2E((Y − m(X))(m(X) − g(X)))
≥ E(Y − m(X))2 + 2E((Y − m(X))(m(X) − g(X)))
⎛ RRR ⎞
= E(Y − m(X)) + 2EE (Y − m(X))(m(X) − g(X))RRRRRX
2
⎝ RRR ⎠
⎛ ⎞
= E(Y − m(X))2 + 2E (E(Y ∣X) − m(X))(m(X) − g(X))
⎝ ⎠
⎛ ⎞
= E(Y − m(X))2 + 2E (m(X) − m(X))(m(X) − g(X))
⎝ ⎠
= E(Y − m(X))2

(b) Let g(x) be any function of x. Recall that


E[∣Y − g(X)∣] = E{E[∣Y − g(X)∣ ∣X]}.

The idea is to choose c such that E[∣Y − c∣ ∣ X = x] is minimized. Now define:

r(c) = E[∣Y − c∣ ∣ X = x] = ∫ ∣y − c∣pY ∣X=x (y)dy.

The function hy (c) = ∣y − c∣ is differentiable everywhere except when y = c. Thus for c ≠ y




⎪1 c>y
h′y (c) =⎨ = 1(c > y) − 1(c < y).

⎩−1 c < y

Since Y is continuous and has a density function, P (Y = c) = 0. So to minimize r(c) we can
differentiate under the integral sign and set the derivative equal to 0 to obtain:
c ∞
r′ (c) = ∫ h′y (c)pY ∣X=x (y)dy = ∫ pY ∣X=x (y)dy − ∫ pY ∣X=x (y)dy
−∞ c
c
= 2∫ pY ∣X=x (y)dy − 1 = 0
−∞
c 1
⇐⇒ ∫ pY ∣X=x (y)dy = ,
−∞ 2

8
10/36-702 Statistical Machine Learning: Homework 2

so that c = m(x), which is the median of pY ∣X=x (y). It is a minimum since r′ (c) < 0 for c < m(x)
and r′ (c) > 0 for c > m(x). Since m minimizes E[∣Y − c∣ ∣ X = x] at every x for any g we get

E[∣Y − g(X)∣ − ∣Y − m(X)∣∣X = x] ≥ 0

which implies

R(g) − R(m) = E[∣Y − g(X)∣ − ∣Y − m(X)∣] = E{E[∣Y − g(X)∣ − Y − m(X)∣∣X]} ≥ 0.

(c) By setting the first derivative of the loss function equal to 0 we obtain:

∂R(β)
=0
∂β
∂E(Y − β T X)2
Ô⇒ =0
∂β
Ô⇒ E[ − 2X(Y − β T X)] = 0
Ô⇒ 2Bβ − 2α = 0
Ô⇒ β∗ = B −1 α,

where we can exchange the derivative and expectation by the dominated convergence theorem.
The loss function R(β) is strictly convex so β∗ is its unique minimum.

9
10/36-702 Statistical Machine Learning: Homework 2

Problem 6 [25 pts.]


Consider the many Normal means problem where we observe Yi ∼ N (θi , 1) for i = 1, ⋯, d. Let θ̂
minimize the penalized loss
∑(Yi − θi ) + λJ(θ).
2
i

Find an explicit form for θ̂ in three cases: (i) (10 pts.) J(β) = ∥θ∥0 , (ii) (10 pts.) J(β) = ∥θ∥1 and
(iii) (5 pts.) J(β) = ∥θ∥22 .

Solution.

(i) Note that

d ⎛ ⎞
∑(Yi − θi ) + λ∥θ∥0 = ∑ (Yi − θi ) + λ1(θi ≠ 0) .
2 2
i j=1 ⎝ ⎠

Then for each term i,

(Yi − θi )2 + λ1(θi ≠ 0) ≥ Yi2 1(θi = 0) + λ1(θi ≠ 0)



⎪ ⎫

⎪ ⎪
≥ min ⎨Yi2 , λ⎬

⎪ ⎪

⎩ ⎭
and equality holds if and only if


⎪0 if Yi2 < λ



̂
θi = ⎨0 or Yi if Yi2 = λ




⎩Yi if Yi2 > λ.

Hence
d ⎛ ⎞
∑(Yi − θi ) + λ∥θ∥0 = ∑ (Yi − θi ) + λ1(θi ≠ 0)
2 2
i j=1 ⎝ ⎠

⎪ ⎫
d
⎪ 2 ⎪ ⎪
≥ ∑ min ⎨Yi , λ⎬

⎪ ⎪

j=1 ⎩ ⎭
and equality holds if and only if


⎪0 if ∣Yi ∣ < λ


⎪ √
̂
θi = ⎨0 or Yi if ∣Yi ∣ = λ


⎪ √

⎩Yi if ∣Yi ∣ > λ.

10
10/36-702 Statistical Machine Learning: Homework 2

(ii) First write

min ∑(Yi − θi )2 + λ∥θ∥1 = min ∑ (−2Yi θi + θi2 + λ∣θi ∣) .


θ i θ i

Now note it is simply equivalent to

min −2Yi θi + θi2 + λ∣θi ∣


θi

⇐⇒ min −2θ̂iOLS θi + θi2 + λ∣θi ∣


θi

for all i = 1, . . . , d.
When θ̂iOLS ≥ 0, then θ̂i ≥ 0 so

−2θ̂iOLS θi + θi2 + λ∣θi ∣ = −2θ̂iOLS θi + θi2 + λθi .

Differentiating with respect to θi , setting equal to zero, and solving gives


λ
θ̂i = (θ̂iOLS − )1θ̂OLS ≥ λ } .
2 i 2

When θ̂iOLS ≤ 0, the analogous steps give

λ
θ̂i = (θ̂iOLS + )1{θ̂OLS ≤− λ } ,
2 i 2

Putting them together gives


⎪θ̂iOLS − λ
θ̂iOLS ≥ λ2


⎪ 2
θ̂i = ⎨0 θ̂iOLS ∈ ( − λ2 , λ2 ) .



⎩θ̂i
⎪ + θ̂iOLS ≤ λ2
OLS λ
2

(iii) Here the objective function is differentiable everywhere. Taking the gradient w.r.t. θ we have

⎛ ⎞
▽θ ∑(Yi − θi )2 + λ∥θ∥22 = ∑ (−2Yi θi + 2λθi ) .
⎝ i ⎠ i

Setting this equal to 0 and solving for θ gives


Yi
θ̂i = . (3)
1+λ
Since the objective is strictly convex, (3) is the unique solution.

11

You might also like