0% found this document useful (0 votes)
26 views9 pages

Lecture BDS 5 23 24 Print

The document discusses the LARS algorithm and its relationship to least angle regression. It provides characterization of the LASSO estimator and proves that LASSO solutions are piecewise linear functions of the regularization parameter. It also discusses using cross-validation to select the regularization parameter in a data-dependent way to obtain estimators with low prediction error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Lecture BDS 5 23 24 Print

The document discusses the LARS algorithm and its relationship to least angle regression. It provides characterization of the LASSO estimator and proves that LASSO solutions are piecewise linear functions of the regularization parameter. It also discusses using cross-validation to select the regularization parameter in a data-dependent way to obtain estimators with low prediction error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data Statistics, meeting 5: When d

is bigger than n, part 2

21 February 2024
LARS & friends
■ The LARS algorithm is very closely related to a variable selection technique called
least angle regression (if you are interested in this technique it was introduced in
Efron et al. (2004), see the references). Sometimes it is abbreviated just as LAR.
■ Recall that for the linear model we had the following characterization of the
LASSO estimator: A necessary and sufficient condition for β̂ to be a solution of
((3), lecture 4, slide 20) is

Gj (β̂) = −sign(β̂j )λ if β̂j 6= 0;


|Gj (β̂)| ≤ λ if β̂j = 0;

where Gj (β̂) is the jth component of G(β). Furthermore, if for a solution β̂ we


have |Gj (β̂)| < λ and hence β̂j = 0, then for any other solution β̃ of (3) in lecture
4, we have β̃j = 0.
■ Note that it characterizes the LASSO estimator for λ fixed.

7
LARS & friends (cont’d)
Based on this characterization one can prove the following result
Theorem (LASSO is piecewise linear for linear model) As a function of λ we have for
the solution β̂(λ) of the minimization problem labeled (3) on slide 20 of lecture 4 the
following:
(i) There exists a real number λmax such β̂(λ) = 0 for λ ≥ λmax (cf. Exercise 13).
(ii) There are real numbers 0 = λ0 < λ1 < . . . < λm = λmax and vectors γ i ∈ Rd ,
i = 0, . . . , m − 1, such

β̂(λ) = β̂(λk ) + (λ − λk )γ k , λ ∈ [λk , λk+1 ), k = 0, . . . , m − 1.

Some intuition: The characterization result on the previous slide gives conditions
on the gradient of the criterion function. For the squared loss the gradient is linear
in β for any λ. Thus, it is not entirely unexpected that the solution as a function of
λ is piecewise linear.

8
Density estimation (cont’d)
■ One can argue that (??) is
Z ∞  Z ∞ 
E (fˆh (x))2 dx − 2E fˆh (x)f (x) dx (3)
−∞ −∞

plus a constant not depending on fˆh .


■ Thanks to our leave-one out procedure we have not only one density estimator but
n.
■ Those can be used to estimate (??) by
n Z ∞ n
1 2
(fˆhj (x))2 dx − fˆhj (Xj ).
X X
CVh :=
n −∞ n
j=1 j=1

■ Note that CVh depends only on h and the data.

22
Density estimation (cont’d)
■ Because CVh is a proxi for the quantity we want to minimize as function of h, we
simply choose h as the value for which our CVh is minimal; mathematically we
can denote this by
ĥoptimal = arg min CVh ,
h

where h ranges over [c1 n−1/5 , c2 n−1/5 ] with 0 < c1 < c2 and the exponent of h
comes from the fact that we already concluded above that the optimal h ∼ n−1/5 .
■ Important message of ĥoptimal = arg minh CVh : We turned the somehow
arbitrary h (within the above range) into a data dependent or data based choice.

23
Cross validation LASSO (cont’d)
■ From the last bullet on the previous slide it is clear that the prediction error is also
a function of β̂(λ). We write this as
d
β̂j (λ) xfj uture .
X
λ → Y f uture −
j=1

■ Typically, we want the squared prediction error to be small on average, i.e. we


would like to find λ such that
 2 
d
β̂j (λ) xfj uture  
X
E Y f uture −
j=1

is small.
■ Recalling what we did above we need an estimator for this unknown expectation
for EVERY λ to find the one for which it is minimal.

26
Cross validation LASSO (cont’d)
■ How did we get an estimator for the criterion function above? We split the sample
into n groups

Group 1 : x2 , x3 , . . . , xn
Group 2 : x1 , x3 , . . . , xn
...
Group n : x1 , x2 , x3 , . . . , xn−1

and calculated an estimator for f based on group 1, group 2, . . ., group n.


■ Here we will follow a similar strategy. We divide the sample
(Y1 , x1 ), . . . , (Yn , xn ) into K > 1 groups. One could take, for instance, K = 10.
The groups are then (for ease of notation we assume that n is divisible by K).

Group 1 : (Y1 , x1 ), . . . , (Y Kn , x Kn ); Group 2 : (Y Kn +1 , x Kn +1 ), . . . , (Y 2n , x 2n ); . . .


K K

27
Cross validation LASSO (cont’d)
■ Now to estimate the expectation we first leave out the first group and estimate β
1
based on group 2,..., group K. Denote this leave group one out estimate by β̂ (λ).
We then estimate the average squared prediction error by
n  2
K d
K X X
L1 (λ) = Yi − β̂j1 (λ)xij  .
n
i=1 j=1

■ Next we leave out group 2 when estimating β. The estimate based on group 1,
2
group 3,. . . , group K is denoted by β̂ (λ). We then estimate the average squared
prediction error by
2n  2
K d
K X X
L2 (λ) = Yi − β̂j2 (λ)xij  .
n n
i= K +1 j=1

■ This process is performed K times giving us K estimates for the squared error
prediction loss.

28
Cross validation LASSO (cont’d)
■ Finally we average L1 (λ), . . . , LK (λ) and get
K
1 X i
L̄(λ) = L (λ).
K
i=1

■ In words our L̄(λ) gives us an estimate for the expected squared error prediction
loss if we use λ for the LASSO estimate of β.
■ Given an estimate for L̄(λ) for every λ or at least for a grid of λs we would choose
λ (recalling what we did above for density estimation) as the value for which L̄(λ)
is minimal; mathematically we write this as

λ̂optimal = arg min L̄(λ).


λ

■ As for the density estimator Important message of λ̂optimal =


arg minλ L̄(λ): We turned the somehow arbitrary λ into a data dependent or data
based choice.
■ !!!! Note this choice gives us an β with low prediction error.
29

You might also like