Lecture BDS 5 23 24 Print
Lecture BDS 5 23 24 Print
21 February 2024
LARS & friends
■ The LARS algorithm is very closely related to a variable selection technique called
least angle regression (if you are interested in this technique it was introduced in
Efron et al. (2004), see the references). Sometimes it is abbreviated just as LAR.
■ Recall that for the linear model we had the following characterization of the
LASSO estimator: A necessary and sufficient condition for β̂ to be a solution of
((3), lecture 4, slide 20) is
7
LARS & friends (cont’d)
Based on this characterization one can prove the following result
Theorem (LASSO is piecewise linear for linear model) As a function of λ we have for
the solution β̂(λ) of the minimization problem labeled (3) on slide 20 of lecture 4 the
following:
(i) There exists a real number λmax such β̂(λ) = 0 for λ ≥ λmax (cf. Exercise 13).
(ii) There are real numbers 0 = λ0 < λ1 < . . . < λm = λmax and vectors γ i ∈ Rd ,
i = 0, . . . , m − 1, such
Some intuition: The characterization result on the previous slide gives conditions
on the gradient of the criterion function. For the squared loss the gradient is linear
in β for any λ. Thus, it is not entirely unexpected that the solution as a function of
λ is piecewise linear.
8
Density estimation (cont’d)
■ One can argue that (??) is
Z ∞ Z ∞
E (fˆh (x))2 dx − 2E fˆh (x)f (x) dx (3)
−∞ −∞
22
Density estimation (cont’d)
■ Because CVh is a proxi for the quantity we want to minimize as function of h, we
simply choose h as the value for which our CVh is minimal; mathematically we
can denote this by
ĥoptimal = arg min CVh ,
h
where h ranges over [c1 n−1/5 , c2 n−1/5 ] with 0 < c1 < c2 and the exponent of h
comes from the fact that we already concluded above that the optimal h ∼ n−1/5 .
■ Important message of ĥoptimal = arg minh CVh : We turned the somehow
arbitrary h (within the above range) into a data dependent or data based choice.
23
Cross validation LASSO (cont’d)
■ From the last bullet on the previous slide it is clear that the prediction error is also
a function of β̂(λ). We write this as
d
β̂j (λ) xfj uture .
X
λ → Y f uture −
j=1
is small.
■ Recalling what we did above we need an estimator for this unknown expectation
for EVERY λ to find the one for which it is minimal.
26
Cross validation LASSO (cont’d)
■ How did we get an estimator for the criterion function above? We split the sample
into n groups
Group 1 : x2 , x3 , . . . , xn
Group 2 : x1 , x3 , . . . , xn
...
Group n : x1 , x2 , x3 , . . . , xn−1
27
Cross validation LASSO (cont’d)
■ Now to estimate the expectation we first leave out the first group and estimate β
1
based on group 2,..., group K. Denote this leave group one out estimate by β̂ (λ).
We then estimate the average squared prediction error by
n 2
K d
K X X
L1 (λ) = Yi − β̂j1 (λ)xij .
n
i=1 j=1
■ Next we leave out group 2 when estimating β. The estimate based on group 1,
2
group 3,. . . , group K is denoted by β̂ (λ). We then estimate the average squared
prediction error by
2n 2
K d
K X X
L2 (λ) = Yi − β̂j2 (λ)xij .
n n
i= K +1 j=1
■ This process is performed K times giving us K estimates for the squared error
prediction loss.
28
Cross validation LASSO (cont’d)
■ Finally we average L1 (λ), . . . , LK (λ) and get
K
1 X i
L̄(λ) = L (λ).
K
i=1
■ In words our L̄(λ) gives us an estimate for the expected squared error prediction
loss if we use λ for the LASSO estimate of β.
■ Given an estimate for L̄(λ) for every λ or at least for a grid of λs we would choose
λ (recalling what we did above for density estimation) as the value for which L̄(λ)
is minimal; mathematically we write this as