0% found this document useful (0 votes)
18 views14 pages

Lecture BDS 7-23-24 Print

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Lecture BDS 7-23-24 Print

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data Statistics, meeting 7: When d

is bigger than n, part 4

28 February 2024
Found and open
■ Found: With high probability the difference between the LASSO estimator β̂ n
and the true β is small even if d > n provided the number of non-zero entries of β
remains fixed: cf. Theorem (upper bound on error of LASSO probabilistic
version)?
■ Remained open
◆ What do we know about the averaged squared approximation error
(β̂ n − β)T XnT Xn (β̂ n − β), when β̂ n is the LASSO estimator?
◆ Does the LASSO estimator select those covariates for which we have βj 6= 0?

4
Criterion 2
■ Recall criterion 2 of Lecture 6 (see slide 13 of this lecture) for a good estimator
was that the average squared approximation error goes to zero. i.e.

(1/n)||Xn (β̂ − β)||22 → 0.

■ Today, we will consider two results for this quantity:


◆ The first result uses the restricted eigenvalue condition over C(S, 3) and the
number of non-zero elements of β to equal k;
◆ The second result uses another type of sparsity, so-called weak sparsity. It
requires
Xd
|βj | ≤ R1 ,
j=1

where R1 is a strictly positive real number.

6
Criterion 2 (cont’d)
■ The reason for considering two different types of sparsity is to get a feel for how
our notion of sparsity affects the results we can expect (in the straightforward
sense of ’certain input results in a certain output’).
■ Which strategy worked nicely in Lecture 6 when we looked at ||β̂n − β||2 ?
■ Splitting the problem into
◆ a deterministic part; and
◆ a random part.
■ Here we will do the same. The deterministic result for the LASSO estimator,
i.e. the solution to
 2
n
X Xd Xd
minimize w.r.t. β : (1/n) yi − βj xij  + λ |βj |, (1)
i=1 j=1 j=1

is as follows

7
Criterion 2 (cont’d)
Theorem: (upper bound approximation error) For problem (1) we have with
λn ≥ 2||XnT ǫn (ω)||∞ /n > 0.
(i) Suppose that the design matrix Xn satisfies the restricted eigenvalue bound with
parameter γ > 0 over C(S; 3) and let |SA (β)| = k. Then for any solution β̂ of (1)
we have
2 c1
(1/n)||Xn (β̂ n − β)||2 ≤ k λ2n , (2)
γ
where c1 is a constant independent of d and n.
Pd
(ii) If j=1 |βj | ≤ R1 we have

(1/n)||Xn (β̂ n − β)||22 ≤ c2 R1 λn , (3)

where c2 is a constant independent of d and n.


■ Remark: In part (i) we denote by |SA (β)| the cardinality, i.e. the number of
elements, of the active set of the true regression vector β.

8
Criterion 2 (cont’d)
Remarks: (on upper bound approximation error)
q
■ We know from Lecture 6, that by choosing λn = τ σ log(d) n , with τ chosen

appropriately (see Lecture 6 where τ > 8 was needed), √ the statements above
will hold with high probability if τ is not too close to 8 or if log(d) is not too
small (or both).
■ With that choice of λn statement (i) becomes

c̃1 log(d)
(1/n)||Xn (β̂ n − β)||22 ≤ k , (4)
γ n
where c̃1 is a constant independent of d and n; and statement (ii) becomes
r
2 log(d)
(1/n)||Xn (β̂ n − β)||2 ≤ c̃2 R1 , (5)
n
where c̃2 is a constant independent of d and n.

9
Selecting right variables (cont’d)
Here comes the result (for the exact assumptions see Wainwright (2009) or Section
11.4 of Hastie, T, Tibshirani, R., Wainwright, M. (2016))
q
Theorem (selecting right variables) Let λn = cγ3 σ log(d)
n , where c3 is a constant.
Then with probability at least of order (ignoring constants) 1 − exp(− log(d)) we have
(i) SA (β̂ n ) ⊆ SA (β);
(ii) if additionally minj∈SA |βj | is bounded from below (with the lower bound
depending on λn ) we have SA (β̂ n ) = SA (β).
Remarks (on Theorem selecting right variables)
■ Statement (i) means our estimator does not include variables that are irrelevant
(with high probability);
■ Yet, statement (i) does not exclude the possibility of missing important variables;
■ Statement (ii) implies that by imposing further assumptions we select all important
variables and only those (with high probability).

13
Older penalities (cont’d)
■ Message of Quiz 7?
■ Even if XnT Xn does not have an inverse,

XnT Xn + Id×d

will have an inverse because every positive definite matrix has an inverse (and
Id×d is positive definite ).
■ We can generalize this to λId×d (cf. also Quiz 7) which is also positive definite if
λ > 0 to find that the following matrix has an inverse too:

XnT Xn + λId×d .

■ This opens up another way to solve the issues mentioned on the previous slide. If
(6) is not possible for β̂ because XnT Xn does not have an inverse, we could simply
replace it by
R
β̂ n = (XnT Xn + λId×d )−1 XnT Yn . (7)
R
■ β̂ n
is known as Ridge regression estimator (the name Ridge is the reason for the
superscript R).
16
Older penalities (cont’d)
Remarks (on Ridge regression)
■ Clearly, if λ is small XnT Xn + λId×d will be close to XnT Xn . Yet, the addition of
the small λId×d to XnT Xn has a huge impact as XnT Xn + λId×d has an inverse in
contrast to XnT Xn .
■ In practice, choosing λ too small may make it hard (or impossible) for your
software to find the inverse of XnT Xn + λId×d .
■ As alluded to in the introduction to Quiz 7 in practice finding the inverse of XnT Xn
is difficult (or impossible) if XnT Xn is close to collinearity (that means here we
have highly correlated regressors).
■ Also in such a case inverting XnT Xn + λId×d instead and using (7) as our
regression estimator provides a way out.

17
Older penalities (cont’d)
■ While (7) was how Ridge regression was introduced historically, nowadays
another way of introducing it is more popular.
■ The more recent way of introducing the Ridge regression estimator is
■ The solution to
 2
n
X X d Xd
minimize w.r.t. β : yi − βj xij  + λ βj2 (8)
i=1 j=1 j=1

is called Ridge regression estimator.


■ On exercise sheet 4 you can convince yourself that (7) and (8) are indeed the same.
■ If P
you think back to our first lecture on the LASSO and if you see that the penalty
λ dj=1 βj2 includes β1 it means that our data were already processed to fulfil the
normalizing and centering conditions (recall we do not want to penalize the
intercept).
Pn  Pd 2
■ In line with most of the literature we do not multiply i=1 yi − j=1 βj xij
in Equation (8) by (1/n).

18
Older penalities (cont’d)
■ Let’s try to get a feel for what Ridge regression does.
■ When we started studying the LASSO estimator we looked at which special case
to get a feel for what LASSO does?
■ Right, we looked at orthogonal design. Let’s do the same here. Assume that
XnT Xn = Id×d then Equation (7) becomes
R 1
β̂ n = (Id×d + λId×d ) −1
XnT Yn = XnT Yn .
1+λ

■ Let’s compare this to what our ordinary OLS estimator β̂ LS would be if we had
not added the term λId×d . Then β̂ LS would be

β̂ LS = XnT Yn .

■ So, in this case we see very clearly that ridge regression shrinks the ordinary OLS
towards zero.

19
Older penalities (cont’d)
To conclude this section and to shed light on the third statement, let us compare the
ridge regression estimator with the LASSO for orthogonal design.
■ We can rewrite the ridge regression estimator as
1
β̂iR = β̂LS,i .
1+λ

■ Also the LASSO can be written in terms of β̂ LS as


λ λ


 β̂ LS,i − 2 , β̂ LS,i ≥ 2;
β̂i = 0, − λ2 < β̂LS,i < λ2 ;

β̂LS,i + λ2 , β̂LS,i ≤ − λ2 .

■ Bottom line:
◆ Ridge regression estimator does not select variables ( β̂iR is unequal to zero if
β̂LS,i is), whereas LASSO does select variables.
◆ Hence, LASSO overcomes the problem of having to estimate many regression
coefficients with a possibly small n. Ridge fails to do so.
22
Correlations and penalities
We now turn a generalization of the LASSO penality known as elastic net. To motivate
it we collect a few more facts about the LASSO and the ridge regression estimator:
■ If we have several strongly correlated regressors LASSO will tend to select only
one (or maybe two) of them simply because many regression coefficients are
estimated as zero by LASSO. Picking only one (or maybe two) from a group of
strongly correlated regressors may seem a bit arbitrary.
■ As we have highly correlated variables the estimates for the regression coefficients
are somehow arbitrary. For example, when X1 and X2 are strongly correlated and
their ’joint effect’ is 2, say, it does not really make a difference whether you
estimate β1 and β2 both as 1 or whether you estimate β1 as 1.8 and β2 as 0.2.
Thus, we may view this also as arbitrary (on Exercise sheet 4 you can convince
yourself of this).
■ Ridge regression is known to overcome this arbitrariness by diving the total effect
equally among the two regressors (on Exercise sheet 4 you can convince yourself
of this property of ridge regression).

24
Correlations and penalities
■ The elastic net estimator is the solution to
 2  
Xn X d d
X d
X
minimize w.r.t. β : yi − βj xij  + λ α |βj | + (1 − α) βj2  .
i=1 j=1 j=1 j=1

■ Here α ∈ [0, 1].


■ Clearly, for α = 0 we obtain the ridge regression estimator while for α = 1 we get
the LASSO estimator.
■ For values of α between 0 and 1 we obtain a comprise between LASSO and ridge.

26

You might also like