Lecture Notes On Ridge Regression
Lecture Notes On Ridge Regression
License
This document is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike license: http:
//creativecommons.org/licenses/by-nc-sa/4.0/
Disclaimer
This document is a collection of many well-known results on ridge regression. The current status of the document
is ‘work-in-progress’ as it is incomplete (more results from literature will be included) and it may contain incon-
sistencies and errors. Hence, reading and believing at own risk. Finally, proper reference to the original source
may sometimes be lacking. This is regrettable and these references – if ever known to the author – will be included
in later versions.
Acknowledgements
Many people aided in various ways to the construction of these notes. Mark A. van de Wiel commented on various
parts of Chapter 2. Jelle J. Goeman clarified some matters behind the method described in Section 6.4.3. Paul H.C.
Eilers pointed to helpful references for Chapter 4 and provided parts of the code used in Section 4.3. Harry van
Zanten commented on Chapters 1 and ??, while Stéphanie van de Pas made suggestions for the improvement of
Chapters ?? and ??. Glenn Andrews thoroughly read Chapter 1 filtering out errors, unclarities, and inaccuracies.
Small typo’s or minor errors, that have – hopefully – been corrected in the latest version, were pointed out by
(among others): Rikkert Hindriks, Micah Blake McCurdy, José P. González-Brenes, Hassan Pazira, and numerous
students from the High-dimensional data analysis- and Statistics for high-dimensional data-courses taught at
Leiden University and the Vrije Universiteit Amsterdam, respectively.
Contents
1 Ridge regression 2
1.1 Linear regression 2
1.2 The ridge regression estimator 5
1.3 Eigenvalue shrinkage 9
1.3.1 Principal component regression 10
1.4 Moments 10
1.4.1 Expectation 11
1.4.2 Variance 12
1.4.3 Mean squared error 14
1.4.4 Debiasing 17
1.5 Constrained estimation 17
1.6 Degrees of freedom 21
1.7 Computationally efficient evaluation 21
1.8 Penalty parameter selection 22
1.8.1 Information criterion 23
1.8.2 Cross-validation 24
1.8.3 Generalized cross-validation 25
1.8.4 Randomness 25
1.9 Simulations 26
1.9.1 Role of the variance of the covariates 26
1.9.2 Ridge regression and collinearity 28
1.9.3 Variance inflation factor 30
1.10 Illustration 32
1.10.1 MCM7 expression regulation by microRNAs 33
1.11 Conclusion 37
1.12 Exercises 37
2 Bayesian regression 46
2.1 A minimum of prior knowledge on Bayesian statistics 46
2.2 Relation to ridge regression 47
2.3 Markov chain Monte Carlo 50
2.4 Empirical Bayes 55
2.5 Conclusion 56
2.6 Exercises 56
4 Mixed model 73
4.1 Link to ridge regression 78
4.2 REML consistency, high-dimensionally 79
4.3 Illustration: P-splines 81
High-throughput techniques measure many characteristics of a single sample simultaneously. The number of
characteristics p measured may easily exceed ten thousand. In most medical studies the number of samples n
involved often falls behind the number of characteristics measured, i.e: p > n. The resulting (n × p)-dimensional
data matrix X:
X1,∗ X1,1 . . . X1,p
X = (X∗,1 | . . . | X∗,p ) = ... = ... .. ..
. .
Xn,∗ Xn,1 ... Xn,p
from such a study contains a larger number of covariates than samples. When p > n the data matrix X is said to
be high-dimensional, although no formal definition exists.
In this chapter we adopt the traditional statistical notation of the data matrix. An alternative notation would be
X⊤ (rather than X), which is employed in the field of (statistical) bioinformatics. In X⊤ the columns comprise
the samples rather than the covariates. The case for the bioinformatics notation stems from practical arguments. A
spreadsheet is designed to have more rows than columns. In case p > n the traditional notation yields a spreadsheet
with more columns than rows. When p > 10000 the conventional display is impractical. In this chapter we stick
to the conventional statistical notation of the data matrix as all mathematical expressions involving X are then in
line with those of standard textbooks on regression.
The information contained in X is often used to explain a particular property of the samples involved. In
applications in molecular biology X may contain microRNA expression data from which the expression levels of
a gene are to be described. When the gene’s expression levels are denoted by Y = (Y1 , . . . , Yn )⊤ , the aim is to
find the linear relation Yi = Xi,∗ β from the data at hand by means of regression analysis. Regression is however
frustrated by the high-dimensionality of X (illustrated in Section 1.2 and at the end of Section 1.5). These notes
discuss how regression may be modified to accommodate the high-dimensionality of X. First, linear regression is
recaputilated.
The randomness of εi implies that Yi is also a random variable. In particular, Yi is normally distributed, because
εi ∼ N (0, σ 2 ) and Xi,∗ β is a non-random scalar. To specify the parameters of the distribution of Yi we need to
calculate its first two moments. Its expectation equals:
E(Yi ) = E(Xi,∗ β) + E(εi ) = Xi,∗ β,
while its variance is:
Var(Yi ) = E{[Yi − E(Yi )]2 } = E(Yi2 ) − [E(Yi )]2
2
= E[(Xi,∗ β) + 2εi Xi,∗ β + ε2i ] − (Xi,∗ β) 2
= E(ε2i )
= Var(εi ) = σ2 .
Hence, Yi ∼ N (Xi,∗ β, σ 2 ). This formulation (in terms of the normal distribution) is equivalent to the formulation
of model (1.1), as both capture the assumptions involved: the linearity of the functional part and the normality of
the error.
Model (1.1) is often written in a more condensed matrix form:
Y = X β + ε, (1.2)
where ε = (ε1 , ε2 , . . . , εn )⊤ and distributed as ε ∼ N (0p , σ 2 Inn ). As above model (1.2) can be expressed as a
multivariate normal distribution: Y ∼ N (X β, σ 2 Inn ).
Model (1.2) is a so-called hierarchical model (not to be confused with the Bayesian meaning of this term).
Here this terminology emphasizes that X and Y are not on a par, they play different roles in the model. The
former is used to explain the latter. In model (1.1) X is referred as the explanatory or independent variable, while
the variable Y is generally referred to as the response or dependent variable.
The covariates, the columns of X, may themselves be random. To apply the linear model they are temporarily
assumed fixed. The linear regression model is then to be interpreted as Y | X ∼ N (X β, σ 2 Inn )
Example 1.1 (Methylation of a tumor-suppressor gene)
Consider a study which measures the gene expression levels of a tumor-suppressor genes (TSG) and two methy-
lation markers (MM1 and MM2) on 67 samples. A methylation marker is a gene that promotes methylation.
Methylation refers to attachment of a methyl group to a nucleotide of the DNA. In case this attachment takes
place in or close by the promotor region of a gene, this complicates the transcription of the gene. Methylation
may down-regulate a gene. This mechanism also works in the reverse direction: removal of methyl groups may
up-regulate a gene. A tumor-suppressor gene is a gene that halts the progression of the cell towards a cancerous
state.
The medical question associated with these data: do the expression levels methylation markers affect the
expression levels of the tumor-suppressor gene? To answer this question we may formulate the following linear
regression model:
Yi,tsg = β0 + βmm1 Xi,mm1 + βmm2 Xi,mm2 + εi ,
with i = 1, . . . , 67 and εi ∼ N (0, σ 2 ). The interest focusses on βmm1 and βmm2 . A non-zero value of at least one
of these two regression parameters indicates that there is a linear association between the expression levels of the
tumor-suppressor gene and that of the methylation markers.
Prior knowledge from biology suggests that the βmm1 and βmm2 are both non-positive. High expression levels
of the methylation markers lead to hyper-methylation, in turn inhibiting the transcription of the tumor-suppressor
gene. Vice versa, low expression levels of MM1 and MM2 are (via hypo-methylation) associated with high
expression levels of TSG. Hence, a negative concordant effect between MM1 and MM2 (on one side) and TSG
(on the other) is expected. Of course, the methylation markers may affect expression levels of other genes that in
turn regulate the tumor-suppressor gene. The regression parameters βmm1 and βmm2 then reflect the indirect effect
of the methylation markers on the expression levels of the tumor suppressor gene. □
The linear regression model (1.1) involves the unknown parameters: β and σ 2 , which need to be learned from
the data. The parameters of the regression model, β and σ 2 are estimated by means of likelihood maximization.
Recall that Yi ∼ N (Xi,∗ β, σ 2 ) with corresponding density: fYi (yi ) = (2 π σ 2 )−1/2 exp[−(yi − Xi∗ β)2 /2σ 2 ].
The likelihood thus is:
n
Y √
L(Y, X; β, σ 2 ) = ( 2 π σ)−1 exp[−(Yi − Xi,∗ β)2 /2σ 2 ],
i=1
4 Ridge regression
in which the independence of the observations has been used. Because of the strict monotonicity of the logarithm,
the maximization of the likelihood coincides with the maximum of the logarithm of the likelihood (called the log-
likelihood). Hence, to obtain maximum likelihood estimates of the parameter it is equivalent to find the maximum
of the log-likelihood. The log-likelihood is:
√ Xn
L(Y, X; β, σ 2 ) = log[L(Y, X; β, σ 2 )] = −n log( 2 π σ) − 12 σ −2 (yi − Xi,∗ β)2 .
i=1
Pn
After noting that i=1 (Yi − Xi,∗ β)2 = ∥Y − X β∥22 = (Y − X β)⊤ (Y − X β), the log-likelihood can be
written as:
√
L(Y, X; β, σ 2 ) = −n log( 2 π σ) − 21 σ −2 ∥Y − X β∥22 .
In order to find the maximum of the log-likelihood, take its derivate with respect to β:
∂ ∂
L(Y, X; β, σ 2 ) = − 12 σ −2 ∥Y − X β∥22 = σ −2 X⊤ (Y − X β).
∂β ∂β
Equate this derivative to zero gives the estimating equation for β:
X⊤ X β = X⊤ Y. (1.3)
Equation (1.3) is called to the normal equation. Pre-multiplication of both sides of the normal equation by
(X⊤ X)−1 now yields the maximum likelihood estimator of the regression parameter: β̂ = (X⊤ X)−1 X⊤ Y,
in which it is assumed that (X⊤ X)−1 is well-defined.
Along the same lines one obtains the maximum likelihood estimator of the residual variance. Take the partial
derivative of the loglikelihood with respect to σ 2 :
∂
L(Y, X; β, σ 2 ) = −σ −1 + σ −3 ∥Y − X β∥22 .
∂σ
Equate the right-hand side to zero and solve for σ 2 to find σ̂ 2 = n−1 ∥Y −X β∥22 . In this expression β is unknown
and the maximum likelihood estimate of β is plugged-in.
With explicit expressions of the maximum likelihood estimators at hand, we can study their properties. The
expectation of the maximum likelihood estimator of the regression parameter β is:
in which we have used that E(YY⊤ ) = X β β ⊤ X⊤ + σ 2 Inn . From Var(β̂) = σ 2 (X⊤ X)−1 , one obtains an
estimate of the variance of the estimator of the j-th regression coefficient: σ̂ 2 (β̂j ) = σ̂ 2 [(X⊤ X)−1 ]jj . This may
be used to construct a confidence interval for the estimates or test the hypothesis H0 : βj = 0. In the latter display,
σ̂ 2 should not be the maximum likelihood estimator, but is to be replaced
Pn by the residual sum-of-squares divided
by n − p rather than n. The residual sum-of-squares is defined as i=1 (Yi − Xi,∗ β̂)2 .
The prediction of Yi , denoted Ybi , is the expected value of Yi according the linear regression model (with its pa-
rameters replaced by their estimates). The prediction of Yi thus equals E(Yi ; β̂, σ̂ 2 ) = Xi,∗ β̂. In matrix notation
the prediction is:
Y
b = X β̂ = X (X⊤ X)−1 X⊤ Y := HY,
1.2 The ridge regression estimator 5
where H is the hat matrix, as it ‘puts the hat’ on Y. Note that the hat matrix is a projection matrix, i.e. H2 = H
for
Moreover, H⊤ = H. Thus, the prediction Y b is an orthogonal projection of Y onto the space spanned by the
columns of X.
With βb available, an estimate of the errors ε̂i , dubbed the residuals are obtained via:
ε̂ = b = Y − X β̂ = Y − X (X⊤ X)−1 X⊤ Y = [I − H] Y.
Y−Y
Thus, the residuals are a projection of Y onto the orthogonal complement of the space spanned by the columns of
X. The residuals are to be used in diagnostics, e.g. checking of the normality assumption by means of a normal
probability plot.
For more on the linear regression model confer the monograph of Draper and Smith (1998).
# ids of ERBB2
idERBB2 <- which(fData(vdx)[,5] == 2064)
# regression analysis
summary(lm(formula = Y[,1] ˜ X[,1] + X[,2] + X[,3] + X[,4]))
* If the (column) rank of X is smaller than p, there exists a non-trivial v ∈ Rp such that Xv = 0p . Multiplication of this inequality by
X⊤ yields X⊤ Xv = 0p . As v ̸= 0p , this implies that X⊤ X is not invertible.
1.2 The ridge regression estimator 7
Applied to our high-dimensional setting, the columns of a high-dimensional design matrix X are linearly
dependent and this super-collinearity causes X⊤ X to be singular. Let us now recall the maximum likelihood esti-
mator of the parameter of the linear regression model: β̂ = (X⊤ X)−1 X⊤ Y. This estimator is only well-defined
if (X⊤ X)−1 exits. Hence, when X is high-dimensional the regression parameter β cannot be estimated using the
maximum likelihood procedure.
So far we only presented the practical consequence of high-dimensionality: the expression (X⊤ X)−1 X⊤ Y cannot
be evaluated numerically. But the problem arising from the high-dimensionality of the data is more fundamental.
To appreciate this, consider the normal equations: X⊤ Xβ = X⊤ Y. The matrix X⊤ X is of rank n, while β is a
vector of length p. Hence, while there are p unknowns, the system of linear equations from which these are to be
solved effectively comprises n degrees of freedom. If p > n, the vector β cannot uniquely be determined from
this system of equations. To make this more specific let U be the n-dimensional space spanned by the columns
of X and the p − n-dimensional space V be orthogonal complement of U , i.e. V = U ⊥ . Then, Xv = 0p for all
v ∈ V . So, V is the non-trivial null space of X. Consequently, as X⊤ Xv = X⊤ 0p = 0n , the solution of the
normal equations is:
where A+ denotes the Moore-Penrose inverse of the matrix A (adopting the notation of Harville, 2008). For a
square symmetric matrix, the generalized inverse is defined as:
Xp
A+ = νj−1 1{νj ̸=0} vj vj⊤ ,
j=1
where vj are the eigenvectors of A (and are not – necessarily – an element of V ). The solution of the normal
equations is thus only determined up to an element from a non-trivial space V , and there is no unique estimator of
the regression parameter.
To arrive at a unique regression estimator for studies with rank deficient design matrices, the minimum least
squares estimator may be employed.
Definition 1.1 (Ishwaran and Rao, 2014)
The minimum least squares estimator of regression parameter minimizes the sum-of-squares criterion and is of
minimum length. Formally, β̂MLS = arg minβ∈Rp ∥Y − Xβ∥22 such that ∥β̂MLS ∥22 < ∥β∥22 for all β that minimize
∥Y − Xβ∥22 .
If X is of full rank, the minimum least squares regression estimator coincides with the least squares/maximum
likelihood one as the latter is a unique minimizer of the sum-of-squares criterion and, thereby, automatically also
the minimizer of minimum length. When X is rank deficient, β̂MLS = (X⊤ X)+ X⊤ Y. To see this recall from
above that ∥Y − Xβ∥22 is minimized by (X⊤ X)+ X⊤ Y + v for all v ∈ V . The length of these minimizers is:
∥(X⊤ X)+ X⊤ Y + v∥22 = ∥(X⊤ X)+ X⊤ Y∥22 + 2Y⊤ X(X⊤ X)+ v + ∥v∥22 ,
which, by the orthogonality of V and the space spanned by the columns of X, equals ∥(X⊤ X)+ X⊤ Y∥22 + ∥v∥22 .
Clearly, any nontrivial v, i.e. v ̸= 0p , results in ∥(X⊤ X)+ X⊤ Y∥22 + ∥v∥22 > ∥(X⊤ X)+ X⊤ Y∥22 and, thus,
β̂MLS = (X⊤ X)+ X⊤ Y.
An alternative (and related) estimator of the regression parameter β that avoids the use of the Moore-Penrose
inverse and is able to deal with (super)-collinearity among the columns of the design matrix is the proposed ridge
regression estimator by Hoerl and Kennard (1970). It essentially comprises of an ad-hoc fix to resolve the (al-
most) singularity of X⊤ X. Hoerl and Kennard (1970) propose to simply replace X⊤ X by X⊤ X + λIpp with
λ ∈ [0, ∞). The scalar λ is a tuning parameter, henceforth called the penalty parameter for reasons that will be-
come clear later. The ad-hoc fix solves the singularity as it adds a positive matrix, λIpp , to a positive semi-definite
one, X⊤ X, making the total a positive definite matrix (Lemma 14.2.4 of Harville, 2008), which is invertible.
Example 1.3 (Super-collinearity, continued)
Recall the super-collinear design matrix X of Example 1.3. Then, for (say) λ = 1:
5 2 2
X⊤ X + λIpp = 2 7 −4 .
2 −4 7
8 Ridge regression
The eigenvalues of this matrix are 11, 7, and 1. Hence, X⊤ X + λIpp has no zero eigenvalue and its inverse is
well-defined. □
With the ad-hoc fix for the singularity of X⊤ X at hand, Hoerl and Kennard (1970) proceed to define the ridge
regression estimator.
Definition 1.2
The ridge regression estimator of the regression parameter of the linear regression model is:
For λ strictly positive (Question 1.16 discusses the consequences of a negative values of λ), the ridge regression
estimator is a well-defined estimator, even if X is high-dimensional. However, each choice of λ leads to a different
ridge regression estimator. The set of all ridge regression estimates {β̂(λ) : λ ∈ [0, ∞)} is called the solution
path or regularization path of the ridge regression estimator.
Example 1.3 (Super-collinearity, continued)
Recall the super-collinear design matrix X of Example 1.3. Suppose that the corresponding response vector
is Y = (1.3, −0.5, 2.6, 0.9)⊤ . The ridge regression estimates for, e.g. λ = 1, 2, and 10 are then: β̂(1) =
(0.614, 0.548, 0.066)⊤ , β̂(2) = (0.537, 0.490, 0.048)⊤ , and β̂(10) = (0.269, 0.267, 0.002)⊤ . The full solution
path of the ridge regression estimator is shown in the left-hand side panel of Figure 1.1.
^
β1(λ)
^
β2(λ)
^
β3(λ)
0.6
ridge estimate
0.4
0.2
0.0
−5 0 5 10
log(λ)
Figure 1.1: Left panel: the regularization path of the ridge regression estimator for the data of Example 1.3. Right
panel: the ‘maximum likelihood fit’ Yb and the ‘ridge fit’ Yb (λ) (the dashed-dotted red line) to the observation Y in
the (hyper)plane spanned by the covariates.
Low-dimensionally, when X⊤ X is of full rank, the ridge regression estimator is linearly related to its maximum
likelihood counterpart. To see this define the linear operator Wλ = (X⊤ X + λIpp )−1 X⊤ X. The ridge regression
estimator β̂(λ) can then be expressed as Wλ β̂ for:
The linear operator Wλ thus transforms the maximum likelihood estimator of the regression parameter into its
ridge regularized counterpart. High-dimensionally, no such linear relation between the ridge and the minimum
1.3 Eigenvalue shrinkage 9
With an estimate of the regression parameter β available, we can define the fit. For the ridge regression esti-
mator the fit is defined analogous to the maximum likelihood case:
Y(λ)
b = Xβ̂(λ) = X(X⊤ X + λIpp )−1 X⊤ Y := H(λ)Y.
For the maximum likelihood regression estimator the fit could be understood as a projection of Y onto the subspace
spanned by the columns of X. This is depicted in the right panel of Figure 1.1, where Yb is the projection of the
observation Y onto the covariate space. The projected observation Yb is orthogonal to the residual ε = Y − Yb .
This means the fit is the point in the covariate space closest to the observation. Put differently, the covariate space
does not contain a point that is better (in some sense) in explaining the observation. Compare this to the ‘ridge fit’
which is plotted as a dashed-dotted red line in the right panel of Figure 1.1. The ‘ridge fit’ is a line, parameterized
by {λ : λ ∈ R≥0 }, where each point on this line matches to the corresponding intersection of the regularization
path β̂(λ) and the vertical line x = λ. The ‘ridge fit’ Yb (λ) runs from the maximum likelihood fit Yb = Yb (0) to an
intercept-only (if present) fit in which the covariates do not contribute to the explanation of the observation. From
the figure it is obvious that for any λ > 0 the ‘ridge fit’ Yb (λ) is not orthogonal to the observation Y . In other
words, the ‘ridge residuals’ Y − Yb (λ) are not orthogonal to the fit Yb (λ) (confer Exercise 1.8b). Hence, the ad-hoc
fix of the ridge regression estimator resolves the non-evaluation of the estimator in the face of super-collinearity
but yields a ‘ridge fit’ that is not optimal in explaining the observation. Mathematically, this is due to the fact
that the fit Yb (λ) corresponding to the ridge regression estimator is not a projection of Y onto the covariate space
(confer Exercise 1.8a).
X = Ux Dx Vx⊤ .
In the above Dx an n × p-dimensional block matrix. Its upper left block is a (rank(X) × rank(X))-dimensional
digonal matrix with the singular values on the diagonal. The remaining blocks, zero if p = n and X is of full
rank, one if rank(X) = n or rank(X) = p, or three if rank(X) < min{n, p}, are of appropriate dimensions and
comprise zeros only. The matrix Ux an n×n-dimensional matrix with columns containing the left singular vectors
(denoted ui ), and Vx a p × p-dimensional matrix with columns containing the right singular vectors (denoted vi ).
The columns of Ux and Vx are orthogonal: U⊤ ⊤ ⊤
x Ux = Inn = Ux Ux and Vx Vx = Ipp = Vx Vx .
⊤
The maximum likelihood estimator, which is well-defined if n > p and rank(X) = p, can then be rewritten in
terms of the SVD-matrices as:
The block structure of the design matrix implies that matrix (D⊤ x Dx )
−1
Dx results in a p × n-dimensional matrix
with the reciprocal of the nonzero singular values on the diagonal of the left p × p-dimensional upper left block.
Similarly, the ridge regression estimator can be rewritten in terms of the SVD-matrices as:
Combining the two results and writing (Dx )jj = dx,jj for the p nonzero singular values on the diagonal of the
upper block of Dx we have: d−1 2
x,jj ≥ dx,jj (dx,jj + λ)
−1
for all λ > 0. Thus, the ridge penalty shrinks the singular
values.
10 Ridge regression
Return to the problem of the super-collinearity of X in the high-dimensional setting (p > n). The super-
collinearity implies the singularity of X⊤ X and prevents the calculation of the maximum likelihood estima-
tor
Ppof the2 regression coefficients. However, X⊤ X + λIpp is non-singular, with inverse: (X⊤ X + λIpp )−1 =
−1
j=1 (dx,jj + λ) vj vj⊤ where dx,jj = 0 for j > rank(X). The right-hand side is well-defined for λ > 0.
From the ‘spectral formulation’ of the ridge regression estimator (1.5) the λ-limits can be deduced. The lower
λ-limit of the ridge regression estimator β̂(0+ ) = limλ↓0 β̂(λ) coincides with the minimum least squares estima-
tor. This is immediate when X is of full rank. In the high-dimensional situation, if the dimension p exceeds the
sample size n, it follows from the limit:
−1
dx,jj dx,jj if dx,jj ̸= 0
lim 2 =
λ↓0 dx,jj + λ 0 if dx,jj = 0
Then, limλ↓0 β̂(λ) = β̂MLS . Similarly, the upper λ-limit is evident from the fact that limλ→∞ dx,jj (d2x,jj +λ)−1 =
0, which implies limλ→∞ β̂(λ) = 0p . Hence, all regression coefficients are shrunken towards zero as the penalty
parameter increases. This also holds for X with p > n. Furthermore, this behaviour is not strictly monotone in λ:
λa > λb does not necessarily imply |β̂j (λa )| < |β̂j (λb )|. Upon close inspection this can be witnessed from the
ridge solution path of β3 in Figure 1.1.
= (Ikp D⊤
x Dx Ipk )
−1
Ikp D⊤ ⊤
x Ux Y
= (D⊤
x,k Dx,k )
−1 ⊤
Dx,k U⊤
xY
= (D⊤
x,k Dx,k )
−1 ⊤
Dx,k U⊤
x Y,
where Dx,k is a submatrix of Dx formed from Dx by removal of the last p − k columns. Similarly, Ikp and Ipk
are obtained from Ipp by removal of the last p − k rows and columns, respectively. The principal component
regression estimator of β then is β̂pcr = Vx,k (D⊤x,k Dx,k )
−1 ⊤
Dx,k U⊤
x Y. When k is set equal to the column rank
of X, and thus to the rank of X X, the principal component regression estimator β̂pcr = (X⊤ X)− X⊤ Y, where
⊤
1.4 Moments
The first two moments of the ridge regression estimator are derived. Next the performance of the ridge regression
estimator is studied in terms of the mean squared error, which combines the first two moments.
1.4 Moments 11
1.4.1 Expectation
The left panel of Figure 1.1 shows ridge estimates of the regression parameters converging to zero as the penalty
parameter tends to infinity. This behaviour of the ridge regression estimator does not depend on the specifics of
the data set. To see this study the expectation of the ridge regression estimator:
E β̂(λ) = E (X⊤ X + λIpp )−1 X⊤ Y = (X⊤ X + λIpp )−1 X⊤ E(Y)
The identity above does not hold if the rows X harbor linear dependency. For instance, if the study contains
a replicate corresponding to a duplicated row. This does not hamper the envisioned bias decomposition as the
definition of the projection matrix can be modified, e.g. without one of instances of the duplicated row, such that
the identity in the display above holds and it still projects onto R(X). With the projection matrix at hand, we note
that
Px β̂(λ) = Vx Ipn Inp Vx⊤ Vx (D⊤
x Dx + λIpp )
−1 ⊤ ⊤
Dx Ux Y
= Vx (D⊤
x Dx + λIpp )
−1
Ipn Inp D⊤ ⊤
x Ux Y = β̂(λ).
The ridge regression estimator is thus unaffected by the projection, as Px β̂(λ) = β̂(λ), and it must therefore
already be an element of the projected subspace R(X). The bias can now be decomposed as:
1.4.2 Variance
The second moment of the ridge regression estimator is straightforwardly obtained when exploiting its linearly
relation with the maximum likelihood regression estimator. Then,
in which we have used Var(AY) = AVar(Y)A⊤ for a non-random matrix A, the fact that Wλ is non-random,
and Var[β̂] = σ 2 (X⊤ X)−1 .
We characterize the behavior of the variance of the ridge regression estimator. We first observe that, similar to
the expectation, it vanishes as λ tends to infinity:
Hence, the variance of the ridge regression coefficient estimates decreases towards zero as the penalty parameter
becomes large. This is illustrated in the right panel of Figure 1.1 for the data of Example 1.3. Secondly, the
variance of the maximum likelihood estimator is ‘larger’ than that of the ridge regression estimator (Proposition
1.2).
Proposition 1.2
The variance of the maximum likelihood regression estimator exceeds (in the positive definite ordering sense) that
of the ridge regression estimator, Var(β̂) ⪰ Var[β̂(λ)], with the inequality being strict if λ > 0.
Proof. Use the analytic expression of the variance of the ridge regression estimator to study its difference to that
of the maximum likelihood regression estimator:
This difference is non-negative definite as each component in the matrix product is non-negative definite, and even
positive definite if either λ > 0 or rank(X⊤ X) = p. ■
The variance inequality of Proposition 1.2 can be interpreted in terms of the stochastic behaviour of both involved
estimates. This is illustrated by the next example.
Example 1.5 (Variance comparison)
Consider the design matrix:
−1 0 2 1
X⊤ = .
2 1 −1 0
he variances of the maximum likelihood and ridge (with λ = 1) estimates of the regression coefficients then are:
0.3 0.2 0.1524 0.0698
Var(β̂) = σ 2 and Var[β̂(λ)] = σ 2 .
0.2 0.3 0.0698 0.1524
These variances can be used to construct levels sets of the distribution of the estimates. The level sets that contain
50%, 75% and 95% of the distribution of the maximum likelihood and ridge regression estimates are plotted in
Figure 1.2. In line with inequality of Proposition 1.2 the level sets of the ridge regression estimate are smaller than
that of the maximum likelihood one. The former thus varies less. □
Var[β̂(λ)] = σ 2 Wλ (X⊤ X)−1 Wλ⊤ = σ 2 (Ipp + λIpp )−1 Ipp [(Ipp + λIpp )−1 ]⊤ = σ 2 (1 + λ)−2 Ipp .
As the penalty parameter λ is non-negative the former exceeds the latter. In particular, the expression after the
utmost right equality sign vanishes as λ → ∞. □
1.4 Moments 13
Figure 1.2: Level sets of the distribution of the maximum likelihood (left panel) and ridge (right panel) regression
estimators.
The variance of the ridge regression estimator may be decomposed in the same way as its bias (cf. the end of
Section 1.4.1). There is, however, no contribution of the high-dimensionality of the study design as that is non-
random and, consequently, exhibits no variation. Hence, the variance only relates to the variation in the projected
subspace R(X) as is obvious from:
Perhaps this is seen more clearly when writing the variance of the ridge regression estimator in terms of the
matrices that constitute the singular value decomposition of X:
Var[β̂(λ)] = Vx (D⊤
x Dx + λIpp )
−1 ⊤
Dx Dx (D⊤
x Dx + λIpp )
−1 ⊤
Vx .
High-dimensionally, (D⊤ ⊤ ⊤
x Dx )jj = 0 for j = n+1, . . . , p. And if (Dx Dx )jj = 0, so is [(Dx Dx +λIpp )
−1 ⊤
Dx Dx
⊤ −1
(Dx Dx + λIpp ) ]jj = 0. Hence, the variance is determined by the first n columns of Vx . When n < p, the
variance is then to interpreted as the spread of the ridge regression estimator (with the same choice of λ) when
the study is repeated with exactly the same design matrix such that the resulting estimator is confined to the same
subspace R(X). The following R-script illustrates this by an arbitrary data example (plot not shown):
Listing 1.2 R code
# set parameters
X <- matrix(rnorm(2), nrow=1)
betas <- matrix(c(2, -1), ncol=1)
lambda <- 1
The full distribution of the ridge regression estimator is now known. The estimator, β̂(λ) = (X⊤ X+λIpp )−1 X⊤ Y
14 Ridge regression
is a linear estimator, linear in Y. As Y is normally distributed, so is β̂(λ). Moreover, the normal distribution is
fully characterized by its first two moments, which are available. Hence:
β̂(λ) ∼ N {(X⊤ X + λIpp )−1 X⊤ X β, σ 2 (X⊤ X + λIpp )−1 X⊤ X[(X⊤ X + λIpp )−1 ]⊤ }.
Given λ and X, the random behavior of the estimator is thus known. In particular, when n < p, the variance is
semi-positive definite and this p-variate normal distribution is degenerate, i.e. there is no probability mass outside
R(X) the subspace of Rp spanned by the rows of the X.
In the last step we have used β̂ ∼ N [β, σ 2 (X⊤ X)−1 ] and the expectation of the quadratic form of a multivariate
random variable ε ∼ N (µε , Σε ) that for a nonrandom symmetric positive definite matrix Λ is (cf. Mathai and
Provost 1992) E(ε⊤ Λ ε) = tr(ΛΣε ) + µ⊤ ε Λµε , of course replacing ε by β̂ in this expectation. ■
The first summand in the Lemma 1.1’s expression of MSE[β̂(λ)] represents the sum of the variances of the ridge
regression estimator, while the second summand can be thought of the “squared bias” of the ridge regression
estimator. In particular, limλ→∞ MSE[β̂(λ)] = β ⊤ β, which is the squared biased for an estimator that equals
zero (as does the ridge regression estimator in the limit).
Example 1.6 (Orthonormal design matrix, continued)
Assume the design matrix X is orthonormal. Then, MSE[β̂] = p σ 2 and
The following theorem and proposition are required for the proof of the main result.
Theorem 1.1 (Theorem 1 of Theobald, 1974)
Let θ̂1 and θ̂2 be (different) estimators of θ with second order moments:
and
We are now ready to proof the main result, formalized as Theorem 1.2, that for some λ the ridge regression
estimator yields a lower MSE than the ML regression estimator. Question 1.12 provides a simpler (?) but more
limited proof of this result.
Theorem 1.2 (Theorem 2 of Theobald, 1974)
There exists λ > 0 such that MSE[β̂(λ)] < MSE[β̂(0)] = MSE(β̂).
Proof. The second order moment matrix of the ridge regression estimator is:
Then:
M(0) − M(λ) = σ 2 (X⊤ X)−1 − σ 2 Wλ (X⊤ X)−1 Wλ⊤ − (Wλ − Ipp )ββ ⊤ (Wλ − Ipp )⊤
= σ 2 Wλ (X⊤ X)−1 (X⊤ X + λIpp )(X⊤ X)−1 (X⊤ X + λIpp )(X⊤ X)−1 Wλ⊤
−σ 2 Wλ (X⊤ X)−1 Wλ⊤ − (Wλ − Ipp )ββ ⊤ (Wλ − Ipp )⊤
= σ 2 Wλ [2 λ (X⊤ X)−2 + λ2 (X⊤ X)−3 ]Wλ⊤
−λ2 (X⊤ X + λIpp )−1 ββ ⊤ [(X⊤ X + λIpp )−1 ]⊤
= σ 2 (X⊤ X + λIpp )−1 [2 λ Ipp + λ2 (X⊤ X)−1 ][(X⊤ X + λIpp )−1 ]⊤
−λ2 (X⊤ X + λIpp )−1 ββ ⊤ [(X⊤ X + λIpp )−1 ]⊤
= λ(X⊤ X + λIpp )−1 [2 σ 2 Ipp + λσ 2 (X⊤ X)−1 − λββ ⊤ ][(X⊤ X + λIpp )−1 ]⊤ .
This is positive definite if and only if 2 σ 2 Ipp + λσ 2 (X⊤ X)−1 − λββ ⊤ ≻ 0. Hereto it suffices to show that
2 σ 2 Ipp − λββ ⊤ ≻ 0. By Proposition 1.3 this holds for λ such that 2σ 2 (β ⊤ β)−1 > λ. For these λ, we thus have
M(0) − M(λ). Application of Theorem 1.1 now concludes the proof. ■
This result of Theobald (1974) is generalized by Farebrother (1976) to the class of design matrices X with
rank(X) < p.
Theorem 1.2 can be used to illustrate that the ridge regression estimator strikes a balance between the bias and
variance. This is illustrated in the left panel of Figure 1.3. For small λ, the variance of the ridge regression esti-
mator dominates the MSE. This may be understood when realizing that in this domain of λ the ridge regression
estimator is close to the unbiased maximum likelihood regression estimator. For large λ, the variance vanishes
and the bias dominates the MSE. For small enough values of λ, the decrease in variance of the ridge regression
estimator exceeds the increase in its bias. As the MSE is the sum of these two, the MSE first decreases as λ moves
away from zero. In particular, as λ = 0 corresponds to the ML regression estimator, the ridge regression estimator
yields a lower MSE for these values of λ. In the right panel of Figure 1.3 MSE[β̂(λ)] < MSE[β̂(0)] for λ < 7
(roughly) and the ridge regression estimator outperforms its maximum likelihood counterpart.
16 Ridge regression
MSE
Variance
Bias
4
3
3
MSE / bias / variance
2
MSE
2
1
1
ridge
ML
0
0
0 5 10 15 0 2 4 6 8 10
lambda lambda
Figure 1.3: Left panel: mean squared error, and its ‘bias’ and ‘variance’ parts, of the ridge regression estimator (for
artificial data). Right panel: mean squared error of the ridge and ML estimator of the regression coefficient vector
(for the same artificial data).
Besides another motivation behind the ridge regression estimator, the use of Theorem 1.2 is limited. The opti-
mal choice of λ depends on the quantities β and σ 2 . These are unknown in practice. Then, the penalty parameter
is chosen in a data-driven fashion (see e.g. Section 1.8.2 and various other places).
Theorem 1.2 may be of limited practical use, it does give insight in when the ridge regression estimator may
be preferred over its ML counterpart. Ideally, the range of penalty parameters for which the ridge regression
estimator outperforms – in the MSE sense – the ML regression estimator is as large as possible. The factors
that influence the size of this range may be deduced from the optimal penalty λopt = σ 2 (β ⊤ β/p)−1 found un-
der the assumption of an orthonormal X (see Example 1.6). But also from the bound on the penalty parameter,
λmax = 2σ 2 (β ⊤ β)−1 such that MSE[β̂(λ)] < MSE[β̂(0)] for all λ ∈ (0, λmax ), derived in the proof of Theorem
1.2. Firstly, an increase of the error variance σ 2 yields a larger λopt and λmax . Put differently, more noisy data bene-
fits the ridge regression estimator. Secondly, λopt and λmax also become larger when their denominators decreases.
The denominator β ⊤ β/p may be viewed as an estimator of the ‘signal’ variance ‘σβ2 ’. A quick conclusion would
be that ridge regression profits from less signal. But more can be learned from the denominator. Contrast the two
regression parameters βunif = 1p and βsparse which comprises of only zeros except the first element Pp which equals p,
i.e. βsparse = (p, 0, . . . , 0)⊤ . Then, the βunif and βsparse have comparable signal in the sense that j=1 βj = p. The
denominator of λopt corresponding both parameters equals 1 and p, respectively. This suggests that ridge regres-
sion will perform better in the former case where the regression parameter is not dominated by a few elements
but rather all contribute comparably to the explanation of the variation in the response. Of course, more factors
contribute. For instance, collinearity among the columns of X, which gave rise to ridge regression in the first place.
The choice of the penalty parameter on the basis of the mean squared error strikes a balance between the bias
and variance, a so-called bias-variance trade-off. In the extremes, the model has either a high variance but low
bias and will overfit the data. Or, the model exhibits little variance but has a high bias and as a result underfits the
data. Ideally, the balance between bias and variance results in a model that neither over- nor underfits. Apart from
the regularization parameter, the bias-variance trade-off is affected by other means. For instance, the addition of
more covariates (or transformations thereof) to the model is likely to result in a lower bias but also in a high vari-
ance, while the employment of a simpler model has the opposite effect. Alternative, a larger sample size typically
reduces the variance.
Remark 1.1
Theorem 1.2 can also be used to conclude on the biasedness of the ridge regression estimator. The Gauss-Markov
theorem (Rao, 1973) states (under some assumptions) that the ML regression estimator is the best linear unbiased
1.5 Constrained estimation 17
estimator (BLUE) with the smallest MSE. As the ridge regression estimator is a linear estimator and outperforms
(in terms of MSE) this ML estimator, it must be biased (for it would otherwise refute the Gauss-Markov theorem).
1.4.4 Debiasing
Despite a potential superior performance in the MSE sense of the ridge regression estimator, it remains biased.
This bias hampers the direct application of most machinery presented in the statistical literature as that is geared
towards unbiased estimators. Unbiasedness facilitates proper inference and confidence interval construction. For
instance, an estimator with a large bias and small variance may yield a confidence interval that does not contain
the true parameter value. Hence, several proposals to de-bias the ridge regression estimator have been presented.
A straightforward approach would be to correct for the bias of the estimator (1.6), using the ‘non-debiased’
ridge regression estimator as an estimator for the regression parameter. This debiased ridge regression estimator
is exactly what Zhang and Politis (2022) propose:
β̂d (λ) − β = −λ2 (X⊤ X + λIpp )−2 β + (X⊤ X + λIpp )−1 [Ipp + λ(X⊤ X + λIpp )−1 ]X⊤ ε,
where we have i) substituted the analytic expression of the non-debiased ridge regression estimator, ii) substituted
Xβ + ε for Y by virtue of the linear model, and iii) manipulated the resulting expression using straightforward
linear algebra. From the expression of the preceeding display, we directly obtain the bias, E[β̂d (λ)] − β =
−λ2 (X⊤ X + λIpp )−2 β, and variance,
From these expressions, it is – as Zhang and Politis (2022) point out – clear that, if λd−2
x,n tends to zero as n → ∞,
the bias of debiased estimator is much smaller than that of its non-debiased counterpart, while the difference in
their variances becomes negligible. It is not immediate how this condition on λd−2 x,n translates practically to finite
sample sizes in terms of a class of design matrices and a domain of the regularization parameter.
An alternative debiased ridge regression estimator is proposed in Bühlmann (2013). It starts from the bias
decomposition of Proposition (1.1) into a part attributable to the regularization and one to the high-dimensionality.
Bühlmann (2013) assumes a relatively small penalty parameter such that effectively the regularization bias,
Px {E[β̂(λ)] − β}, is smaller than the standard error and, thereby, not of primary concern. We are then left to
reduce the bias introduced by the high-dimensionality, (Px − Ipp )β, called projection bias in Bühlmann (2013).
Hereto Bühlmann (2013) assumes the existence of alternative but accurate estimator β̂ (init) . Under (strong?) as-
sumptions, the lasso regression estimator (Chapter 6) may serve as such. With this alternative, accurate estimator
at hand, the projection bias is eliminated by replacing β by β̂ (init) and substracting the projection bias term from
the non-debiased ridge regression estimator.
This loss function is the traditional sum-of-squares augmented with a penalty. The particular form of the penalty,
λ∥β∥22 is referred to as the ridge penalty or ridge regularization term and λ as the penalty parameter or regular-
ization parameter. For λ = 0, minimization of the ridge loss function yields the ML estimator (if it exists). For
any λ > 0, the ridge penalty contributes to the loss function, affecting its minimum and its location. The minimum
of the sum-of-squares is well-known. The minimum of the ridge penalty is attained at β = 0p whenever λ > 0.
The β that minimizes Lridge (β; λ) then balances the sum-of-squares and the penalty. The effect of the penalty in
this balancing act is to shrink the regression coefficients towards zero, its minimum. In particular, the larger λ, the
18 Ridge regression
larger the contribution of the penalty to the loss function, the stronger the tendency to shrink non-zero regression
coefficients to zero (and decrease the contribution of the penalty to the loss function). This motivates the name
‘penalty’ as non-zero elements of β increase (or penalize) the loss function.
To verify that the ridge regression estimator indeed minimizes the ridge loss function, proceed as usual. Take
the derivative with respect to β:
∂
Lridge (β; λ) = −2X⊤ (Y − Xβ) + 2λIpp β = −2X⊤ Y + 2(X⊤ X + λIpp )β.
∂β
Equate the derivative to zero and solve for β. This yields the ridge regression estimator.
The ridge regression estimator is thus a stationary point of the ridge loss function. A stationary point corre-
sponds to a minimum if the Hessian matrix with second order partial derivatives is positive definite. The Hessian
of the ridge loss function is
∂2
Lridge (β; λ) = 2(X⊤ X + λIpp ).
∂β ∂β ⊤
This Hessian is the sum of the (semi-)positive definite matrix X⊤ X and the positive definite matrix λ Ipp . Lemma
14.2.4 of Harville (2008) then states that the sum of these matrices is itself a positive definite matrix. Hence,
the Hessian is positive definite and the ridge loss function has a stationary point at the ridge regression estimator,
which is a minimum.
The ridge regression estimator minimizes the ridge loss function. It remains to verify that it is a global mini-
mum. To this end we introduce the concept of a convex function. As a prerequisite, a set S ⊂ Rp is called convex
if for all β1 , β2 ∈ S their weighted average βθ = (1 − θ)β1 + θβ2 for all θ ∈ [0, 1] is itself an element of S , thus
βθ ∈ S . If for all θ ∈ (0, 1), the weighted average βθ is inside S and not on its boundary, the set is called strictly
convex. Examples of (strictly) convex and nonconvex sets are depicted in Figure 1.4. A function f (·) is (strictly)
convex if the set {y : y ≥ f (β) for all β ∈ S for any convex S }, called the epigraph of f (·), is (strictly) convex.
Examples of (strictly) convex and nonconvex functions are depicted in Figure 1.4. The ridge loss function is the
sum of two parabola’s: one is at least convex and the other strictly convex in β. The sum of a convex and strictly
convex function is itself strictly convex (see Lemma 9.4.2 of Fletcher 2008). The ridge loss function is thus strictly
convex. Theorem 9.4.1 of Fletcher 2008 then warrants, by the strict convexity of the ridge loss function, that the
ridge regression estimator is a global minimum.
From the ridge loss function the limiting behavior of the variance of the ridge regression estimator can be un-
derstood. The ridge penalty with its minimum β = 0p does not involve data and, consequently, the variance of its
minimum equals zero. With the ridge regression estimator being a compromise between the maximum likelihood
estimator and the minimum of the penalty, so is its variance a compromise of their variances. As λ tends to infinity,
the ridge regression estimator and its variance converge to the minimizer of the loss function and the variance of
the minimizer, respectively. Hence, in the limit (large λ) the variance of the ridge regression estimator vanishes.
Understandably, as the penalty now fully dominates the loss function and, consequently, it does no longer involve
data (i.e. randomness).
The loss function of the ridge regression estimator facilitates another view on the estimator. Hereto now define the
ridge regression estimator as:
β̂(λ) = arg minβ∈Rp ∥Y − X β∥22 + λ∥β∥22 . (1.9)
This minimization problem can be reformulated into the following constrained optimization problem:
β̂(λ) = arg min{β∈Rp : ∥β∥22 ≤c} ∥Y − X β∥22 , (1.10)
for some suitable c > 0. The constrained optimization problem (1.10) can be solved by means of the Karush-Kuhn-
Tucker (KKT) multiplier method (Fletcher, 2008), which minimizes a function subject to inequality constraints.
The KKT multiplier method states that, under some regularity conditions (all met here), there exists a constant
ν ≥ 0, called the multiplier, such that the solution β̂(ν) of the constrained minimization problem (1.10) satisfies
the so-called KKT conditions. The first KKT condition (referred to as the stationarity condition) demands that
the gradient (with respect to β) of the Lagrangian associated with the minimization problem equals zero at the
solution β̂(ν). The Lagrangian for problem (1.10) is:
∥Y − X β∥22 + ν(∥β∥22 − c).
1.5 Constrained estimation 19
4
3
3
I II
2
2
I II
1
1
y
y
0
0
III IV
−1
−1
III IV
−2
−2
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
x x
3
150
2
100
1
y
0
50
−1
−2
0
−4 −2 0 2 4 −4 −2 0 2 4
x x
ML estimate model
8
1.0
0.5
4
beta2
0.0
y
2
−0.5
−1.0
0
−1.5
−2
beta1 x1
Figure 1.4: Top panels show examples of convex (left) and nonconvex (right) sets. Middle panels show examples of
convex (left) and nonconvex (right) functions. The left bottom panel illustrates the ridge estimation as a constrained
estimation problem. The ellipses represent the contours of the ML loss function, with the blue dot at the center the
ML estimate. The circle is the ridge parameter constraint. The red dot is the ridge estimate. It is at the intersection
of the ridge constraint and the smallest contour with a non-empty intersection with the constraint. The right bottom
panel shows the data corresponding to Example 1.7. The grey line represents the ‘true’ relationship, while the black
line the fitted one.
20 Ridge regression
The second KKT condition (the complementarity condition) requires that ν(∥β̂(ν)∥22 − c) = 0. If ν = λ and
c = ∥β̂(λ)∥22 , the ridge regression estimator β(λ) satisfies both KKT conditions. Hence, both problems have the
same solution if c = ∥β̂(λ)∥22 .
The constrained estimation interpretation of the ridge regression estimator is illustrated in the left bottom panel
of Figure 1.4. It shows the level sets of the sum-of-squares criterion and centered around zero the circular ridge
parameter constraint, parametrized by {β : ∥β∥22 = β12 +β22 ≤ c} for some c > 0. The ridge regression estimator
is then the point where the sum-of-squares’ smallest level set hits the constraint. Exactly at that point the sum-
of-squares is minimized over those β’s that live on or inside the constraint. In the high-dimensional setting the
ellipsoidal level sets are degenerated. For instance, in the 2-dimensional case of the left bottom panel of Figure
1.4, the ellipsoids would then be lines but the geometric interpretation is unaltered.
The ridge regression estimator is always to be found on the boundary of the ridge parameter constraint and is
never an interior point. To see this, assume, for simplicity, that the X⊤ X matrix is of full rank. The radius of the
ridge parameter constraint can then be bounded as follows:
The inequality in the display above follows from i) X⊤ X ≻ 0 and λIpp ≻ 0, ii) X⊤ X + λIpp ≻ X⊤ X (due to
Lemma 14.2.4 of Harville, 2008), and iii) (X⊤ X)−2 ≻ (X⊤ X + λIpp )−2 (inferring Corollary 7.7.4 of Horn and
Johnson, 2012). The ridge regression estimator is thus always on the boundary or in a circular constraint centered
around the origin with a radius that is smaller than the size of the maximum likelihood estimator. Moreover, the
constrained estimation formulation of the ridge regression estimator then implies that the latter must be on the
boundary of the constraint.
The size of the spherical ridge parameter constraint shrinks monotonously as λ increases, and eventually, in
the λ → ∞-limit collapses to zero (as is formalized by Proposition 1.4).
Proposition 1.4
The squared norm of the ridge regression estimator satisfies:
i) d∥β̂(λ)∥22 /dλ < 0 for λ > 0,
ii) limλ→∞ ∥β̂(λ)∥22 = 0.
Proof. For part i) we need to verify that ∥β̂(λ)∥22 > ∥β̂(λ + h)∥22 for h > 0. Hereto substitute the singular value
decomposition of the design matrix, X = Ux Dx Vx⊤ , into the display above. Then, after a little algebra, take the
derivative with respect to λ, and obtain:
d Xn
∥β̂(λ)∥22 = −2 d2x,i (d2x,i + λ)−3 (Y⊤ Ux,∗,i )2 .
dλ i=1
This is negative for all λ > 0. Indeed, the parameter constraint thus becomes smaller and smaller as λ increases,
and so does the size of the estimator.
Part ii) follows from limλ→∞ β̂(λ) = 0p , which has been concluded previously by other means. ■
The relevance of viewing the ridge regression estimator as the solution to a constrained estimation problem be-
comes obvious when considering a typical threat to high-dimensional data analysis: overfitting. Overfitting refers
to the phenomenon of modelling the noise rather than the signal. In case the true model is parsimonious (few
covariates driving the response) and data on many covariates are available, it is likely that a linear combination of
all covariates yields a higher likelihood than a combination of the few that are actually related to the response. As
only the few covariates related to the response contain the signal, the model involving all covariates then cannot
but explain more than the signal alone: it also models the error. Hence, it overfits the data. In high-dimensional
settings overfitting is a real threat. The number of explanatory variables exceeds the number of observations. It
is thus possible to form a linear combination of the covariates that perfectly explains the response, including the
noise.
Large estimates of regression coefficients are often an indication of overfitting. Augmentation of the estimation
procedure with a constraint on the regression coefficients is a simple remedy to large parameter estimates. As a
consequence it decreases the probability of overfitting. Overfitting is illustrated in the next example.
Example 1.7 (Overfitting)
Consider an artificial data set comprising of ten observations on a response Yi and nine covariates Xi,j . All
covariate data are sampled from the standard normal distribution: Xi,j ∼ N (0, 1). The response is generated by
Yi = Xi,1 + εi with εi ∼ N (0, 41 ). Hence, only the first covariate contributes to the response.
1.6 Degrees of freedom 21
P9
The regression model Yi = j=1 Xi,j βj + εi is fitted to the artificial data using R. This yields the regression
parameter estimates:
It represents the effective number of parameters used by the estimator. It can also be viewed as the amount of self-
explanation by the observations of the fit. Recall from ordinary regression that Y b = X(X⊤ X)−1 X⊤ Y = HY
where H is the hat matrix. Application of the defintion (1.11) then yields the degrees of freedom used by the
maximum likelihood regression estimator and equals tr(H), the trace of H. In particular, if X is of full rank, i.e.
rank(X) = p, then tr(H) = p.
We adopt the degrees of freedom definition (1.11) for the ridge regression estimator. We then find
Xn
df(λ) = [Var(Yi )]−1 Cov(Ybi , Yi )
i=1
Xn
= σ −2 Cov[Xi,∗ β̂(λ), Yi ]
i=1
Xn
= σ −2 Cov[Xi,∗ (X⊤ X + λIpp )−1 XY, Yi ]
i=1
Xn
= σ −2 Xi,∗ (X⊤ X + λIpp )−1 XCov(Y, Yi )
i=1
= tr[X(X X + λIpp )−1 X⊤ ]
⊤ ⊤
Xp
= (D⊤ ⊤
x Dx )jj [(Dx Dx )jj + λ]
−1
,
j=1
where we have used the independence among the observations. High-dimensionally, the sum on the right-hand side
of the last line of the display above may be limited to n. The degrees of freedom consumed by the ridge regression
estimator is monotone decreasing in λ. In particular, limλ→∞ df(λ) = 0. That is, in the limit no information from
X is used. Indeed, β̂(λ) is forced to equal 0p which is not derived from data. Finally, from the derivation in the
display above we may deduce a definition of the ‘ridge hat matrix’: H(λ) := X(X⊤ X + λIpp )−1 X⊤ . We can
then write, in analogy to the ordinary regression case, df(λ) = tr[H(λ)].
Hence, the reformulated ridge regressin estimator involves the inversion of an n × n-dimensional matrix. With
n = 100 this is feasible on most standard computers.
Hastie and Tibshirani (2004) point out that, with the SVD-trick above, the number of computation operations
reduces from O (p3 ) to O (pn2 ). In addition, they point out that this computational short-cut can be used in
combination with other loss functions, for instance that of standard generalized linear models (see Chapter 5). This
computation is illustrated in Figure 1.5, which shows the substantial gain in computation time of the evaluation of
the ridge regression estimator using the efficient over the naive implementation against the dimension p. Details
of this figure are provided in Question 1.24.
regular
efficient
1e+06
8e+05
computation time
6e+05
4e+05
2e+05
0e+00
20 40 60 80 100
dimension (p)
Figure 1.5: Computation time of the evaluation of the ridge regression estimator using the naive and efficient
implementation against the dimension p.
The inversion of the p × p-dimensional matrix can be avoided in an other way. Hereto one needs the Woodbury
identity. Let A, U and V be p × p-, p × n- and n × p-dimensional matrices, respectively. The (simplified form of
the) Woodbury identity then is:
Application of the Woodbury identity to the matrix inverse in the ridge estimator of the regression parameter gives:
β̂(λ) = (X⊤ X + λIpp )−1 X⊤ Y = [λ−1 Ipp − λ−2 X⊤ (Inn + λ−1 XX⊤ )−1 X]X⊤ Y
−1 ⊤ −1 ⊤ −1
= λ X (Inn + λ XX ) Y. (1.12)
The inversion of the p × p-dimensional matrix X⊤ X + λIpp is thus replaced by that of the n × n-dimensional
matrix Inn + λ−1 XX⊤ . In addition, this expression of the ridge regression estimator avoids the singular value
decomposition of X, which may in some cases introduce additional numerical errors (e.g. at the level of machine
precision).
some sanity requirements one may wish to impose on the ridge regression estimator. Those requirements do not
yield a specific choice of the penalty parameter but they specify a domain of sensible penalty parameter values.
The evaluation of the ridge regression estimator may be subject to numerical inaccuracy, which ideally is
avoided. This numerical inaccuracy results from the ill-conditionedness of X⊤ X + λIpp . A matrix is ill-
conditioned if its condition number is high. The condition number of a square positive definite matrix A is
the ratio of its largest and smallest eigenvalue. If the smallest eigenvalue is zero, the conditional number is unde-
fined and so is A−1 . Furthermore, a high condition number is indicative of the loss (on a log-scale) in numerical
accuracy of the evaluation of A−1 . To ensure the numerical accurate evaluation of the ridge regression estimator,
the choice of the penalty parameter is thus to be restricted to a subset of the positive real numbers such that it
yields a well-conditioned matrix X⊤ X + λIpp . Clearly, the penalty parameter λ should then not be too close to
zero. There is however no consensus on the criteria on the condition number for a matrix to be well-defined. This
depends among others on how much numerical inaccuracy is tolerated by the context. Of course, as pointed out
in Section 1.7, inversion of the X⊤ X + λIpp matrix can be circumvented. One then still needs to ensure the well-
conditionedness of λ−1 XX⊤ + Inn , which too results in a lower bound for the penalty parameter. Practically,
following Peeters et al. (2019) (who do so in a different context), we suggest to generate a conditional number plot.
It plots (on some convenient scale) the condition number of the matrix X⊤ X + λIpp against the penalty parameter
λ. From this plot, we identify the domain of the penalty parameter value associated with well-conditionedness.
To guide in this choice, Peeters et al. (2019) overlay this plot with a curve indicative of the numerical inaccurary.
Traditional text books on regression analysis suggest, in order to prevent over-fitting, to limit the number
of covariates of the model. This ought to leave enough degrees of freedom to estimate the error and facilitate
proper inference. While ridge regression is commonly used for prediction purposes and inference need not be the
objective, over-fitting is certainly to be avoided. This can be achieved by limiting the degrees of freedom spent
on the estimation of the regression parameter (Harrell, 2001). We thus follow Saleh et al. (2019), who illustrate
this in the ridge regression context, and use the degrees of freedom to bound the search domain of the penalty
parameter. This requires the specification of a maximum degrees of freedom, denoted by ν with 0 ≤ ν ≤ n, one
wishes to spend on the construction of the ridge regression estimator. Now choose λ such that df[β̂(λ)] ≤ ν. To
find the bound on the penalty parameter, note that
Xmin{n,p}
df[β̂(λ)] = tr[H(λ)] = (D⊤ ⊤
x Dx )jj [(Dx Dx )jj + λ]
−1
j=1
as σ̂ −2 (λ) = n−1 ∥Y − Xβ̂(λ)∥22 . The value of λ which minimizes AIC(λ) corresponds to the ‘optimal’ balance
of model complexity and overfitting.
Although information criteria are widely used to guide the choice of the penalty parameter, we comment on
their use within the context of ridge regression. Information criteria guide the decision process when having to
decide among various different models. Different models use different sets of explanatory variables to explain
the behaviour of the response variable. In that sense, the use of information criteria for the deciding on the ridge
penalty parameter may be considered inappropriate: ridge regression uses the same set of explanatory variables
irrespective of the value of the penalty parameter. Moreover, often ridge regression is employed to predict a
response and not to provide an insightful explanatory model. The latter need not yield the best predictions. Finally,
empirically we observed that the AIC may have its optimum, not inside but, at the boundaries of the domain of
the ridge penalty parameter.
1.8.2 Cross-validation
Instead of choosing the penalty parameter to balance model fit with model complexity, cross-validation requires
it (i.e. the penalty parameter) to yield a model with good prediction performance. Commonly, this performance
is evaluated on novel data. Novel data need not be easy to come by and one has to make do with the data at
hand. The setting of ‘original’ and novel data is then mimicked by sample splitting: the data set is divided into
two (groups of) samples. One of these two data sets, called the training set, plays the role of ‘original’ data on
which the model is built. The second of these data sets, called the test set, plays the role of the ‘novel’ data and
is used to evaluate the prediction performance (often operationalized as the loglikelihood or the prediction error)
of the model built on the training data set. This procedure (model building and prediction evaluation on training
and test set, respectively) is done for a collection of possible penalty parameter choices. The penalty parameter
that yields the model with the best prediction performance is to be preferred. The thus obtained performance
evaluation depends on the actual split of the data set. To remove this dependence, the data set is split many times
into a training and test set. Per split the model parameters are estimated for all choices of λ using the training data
and estimated parameters are evaluated on the corresponding test set. The penalty parameter, that on average over
the test sets performs best (in some sense), is then selected.
When the repetitive splitting of the data set is done randomly, samples may accidently end up in a vast majority
of the splits in either training or test set. Such samples may have an unbalanced influence on either model building
or prediction evaluation. To avoid this k-fold cross-validation structures the data splitting. The samples are divided
into k more or less equally sized exhaustive and mutually exclusive subsets. In turn (at each split) one of these
subsets plays the role of the test set while the union of the remaining subsets constitutes the training set. Such
a splitting warrants a balanced representation of each sample in both training and test set over the splits. Still
the division into the k subsets involves a degree of randomness. This may be fully excluded when choosing
k = n. This particular case is referred to as leave-one-out cross-validation (LOOCV). For illustration purposes
the LOOCV procedure is detailed fully below:
0) Define a range of interest for the penalty parameter.
1) Divide the data set into training and test set comprising samples {1, . . . , n} \ i and {i}, respectively.
2) Fit the linear regression model by means of ridge estimation for each λ in the grid using the training set.
This yields:
where X−i,∗ and Y−i are the design matrix and response vector with the i-th row and element, respectively,
2
excluded. The corresponding estimate of the error variance σ̂−i (λ).
2
3) Evaluate the prediction performance of these models on the test set by log{L[Yi , Xi,∗ ; β̂−i (λ), σ̂−i (λ)]}.
Or, by the prediction error |Yi − Xi,∗ β̂−i (λ)|, possibly squared.
4) Repeat steps 1) to 3) such that each sample plays the role of the test set once.
5) Average the prediction performances of the test sets at each grid point of the penalty parameter:
Xn
n−1 2
log{L[Yi , Xi,∗ ; β̂−i (λ), σ̂−i (λ)]}.
i=1
The quantity above is called the cross-validated loglikelihood. It is an estimate of the prediction performance
of the model corresponding to this value of the penalty parameter on novel data.
6) The value of the penalty parameter that maximizes the cross-validated loglikelihood is the value of choice.
1.8 Penalty parameter selection 25
The procedure is straightforwardly adopted to k-fold cross-validation, a different criterion, and different estima-
tors.
In the LOOCV procedure above resampling can be avoided when the predictive performance is measured by
Allen’s PRESS (Predicted Residual Error Sum of Squares) statistic (Allen, 1974). For then, the LOOCV predictive
performance can be expressed analytically in terms of the known quantities derived from the design matrix and
response (as pointed out but not detailed in Golub et al. 1979). Define the optimal penalty parameter to minimize
Allen’s PRESS statistic:
Xn
λopt = arg minλ>0 n−1 [Yi − Xi,∗ β̂−i (λ)]2 = arg minλ>0 n−1 ∥B(λ)[Inn − H(λ)]Y∥22 ,
i=1
where B(λ) is diagonal with [B(λ)]ii = [1−Hii (λ)]−1 . The second equality in the preceding display is elaborated
in Exercise 1.25. Its main takeaway is that the predictive performance for a given λ can be assessed directly from
the ridge hat matrix and the response vector without the recalculation of the n leave-one-out ridge regression
estimators. Computationally, this is a considerable gain.
No such analytic expression of the cross-validated loss as above exists for general K-fold cross-validation,
but considerable computational gain can nonetheless be achieved (van de Wiel et al., 2021). This exploits the fact
that the ridge regression estimator appears in Allen’s PRESS statistics – and the likelihood – only in combination
with the design matrix, together forming the linear predictor. There is thus no need to evaluate the estimator itself
when interest is only in its predictive performance. Then, if G1 , . . . , GK ⊂ {1, . . . , n} are the mutually exclusive
and exhaustive K-fold sample index sets, the linear predictor for the k-th fold can be expressed as:
where we have used the Woodbury identity again. Finally, for each fold the computationally most demanding
matrices of this expression, XGk ,∗ X⊤ ⊤ ⊤
−Gk ,∗ and X−Gk ,∗ X−Gk ,∗ , are both submatrices of XX . If the latter matrix
is evaluated prior to the cross-validation loop, all calculations inside the loop involve only matrices of dimensions
n × n, maximally, and obtained from XX⊤ by subsetting.
The need for this approximation is pointed out by Golub et al. (1979) by example through an ‘extreme’ case where
the minimization of Allen’s PRESS statistic fails to produce a well-defined choice of the penalty parameter λ.
This ‘extreme’ case requires a (unit) diagonal design matrix X. Straightforward (linear) algebraic manipulations
of Allen’s PRESS statistic then yield:
Xn Xn
n−1 [1 − Hii (λ)]−2 [Yi − X⊤
i,∗ β̂(λ)]
2
= n−1 Yi2 ,
i=1 i=1
which indeed has no unique minimizer in λ. Additionally, the GCV(λ) criterion may in some cases be preferred
computationally when it is easier to evaluate tr[H(λ)] (e.g. from the singular values) than the individual diagonal
elements of H(λ).
1.8.4 Randomness
The discussed procedures for penalty parameter selection all depend on the data at hand. As a result, so does the
selected penalty parameter. It should therefore be considered a statistic, a quantity calculated from data. In case
26 Ridge regression
of K-fold cross-validation, the random formation of the splits adds another layer of randomness. Irrespectively, a
statistic exhibits randomness. This randomness propagates into the ridge regression estimator. The distributional
properties of the ridge regression estimator derived in Section 1.4 are thus conditional on the penalty parameter.
Those of the unconditional distribution of the ridge regression estimator, now denoted β̂(λ̂) with λ̂ indicating
Figure 1.6: Left panel: histogram of estimates of an element of β̂(λ̂) obtained from thousand bootstrap samples
with re-tuned penalty parameters. Right penal: qq-plot of the estimates estimates of an element of β̂(λ̂) obtained
from thousand bootstrap samples with re-tuned vs. fixed penalty parameters.
that the penalty depends on the data (and possibly the particulars of the splits), may be rather different. Analytic
finite sample results appear to be unavailable. An impression of the distribution of β̂(λ̂) can be obtained through
simulation. Hereto we have first drawn a data set from the linear regression model with dimension p = 100,
sample size n = 25, a standard normally distribution error, the rows of the design matrix sampled from a zero-
centered multivariate normal distribution with a uniformly correlated but equivariant covariance matrix, and a
regression parameter with elements equidistantly distributed over the interval [−2 21 , 2 12 ]. From this data set, we
then generate thousand nonparametric bootstrapped data sets. For each bootstrapped data set, we select the penalty
parameter by means of K = 10-fold cross-validation and evaluate the ridge regression estimator using the selected
penalty parameter. The left panel of Figure 1.6 shows the histogram of the thus acquired estimates of an arbitrary
element of the regression parameter. The shape of the distribution clearly deviates from the normal one of the
conditional ridge regression estimator. In the right panel of Figure 1.6, we have plotted element-wise quantiles of
the conditional vs. unconditional ridge regression estimators. It reveals that not only the shape of the distribution,
but also its moments are affected by the randomness of the penalty parameter.
1.9 Simulations
Simulations are presented that illustrate properties of the ridge regression estimator not discussed explicitly in the
previous sections of this chapter.
paths start close to one and vanish as λ → ∞. However, ridge regularization paths of regression coefficients
corresponding to covariates with a large variance dominate those with a low variance.
1.0
σ2 in (0,1]
σ2 in (1,2]
σ2 in (2,3]
σ2 in (3,4]
σ2 in (4,5]
0.8
0.6
Ridge estimate
0.4
0.2
0.0
5 6 7 8 9 10 11
log(lambda)
Ridge estimates with equal covariate variances Ridge estimates with unequal covariate variances
constraint
4
∝ 1/Var(X1) ∝ 1/Var(X1)
3
3
∝ 1/Var(X2)
∝ 1/Var(X2)
2
2
β2
β2
^
β2(λ)
1
^
β2(λ)
0
0
−1
−1
−1 0 1 2 3 4 5 −1 0 1 2 3 4 5
β1 β1
Figure 1.7: Top panel: Ridge regularization paths for coefficients of the 50 uncorrelated covariates with distinct
variances. Color and line type indicated the grouping of the covariates by their variance. Bottom panels: Graphical
illustration of the effect of a covariate’s variance on the ridge regression estimator. The grey circle depicts the ridge
parameter constraint. The dashed black ellipsoids are the level sets of the least squares loss function. The red dot
is the ridge regression estimate. Left and right panels represent the cases with equal and unequal, respectively,
variances of the covariates.
Ridge regression’s preference of covariates with a large variance can intuitively be understood as follows. First
note that the ridge regression estimator now can be written as:
Plug in the employed parametrization of Σ, which gives: [β(λ)]j = j(j + 50λ)−1 (β)j . Hence, the larger the
28 Ridge regression
covariate’s variance (corresponding to the larger j), the larger its ridge regression coefficient estimate. Ridge
regression thus prefers, among a set of covariates with comparable effect sizes, those with larger variances.
The reformulation of ridge penalized estimation as a constrained estimation problem offers a geometrical
interpretation of this phenomenon. Let p = 2 and the design matrix X be orthogonal, while both covariates
contribute equally to the response. Contrast the cases with Var(X1 ) ≈ Var(X2 ) and Var(X1 ) ≫ Var(X2 ). The
level sets of the least squares loss function associated with the former case are circular, while that of the latter
are strongly ellipsoidal (see Figure 1.7). The diameters along the principal axes (that – due to the orthogonality
of X – are parallel to that of the β1 - and β2 -axes) of both circle and ellipsoid are reciprocals of the variance of
the covariates. When the variances of both covariates are equal, the level sets of the loss function expand equally
fast along both axes. With the two covariates having the same regression coefficient, the point of these level sets
closest to the parameter constraint is to be found on the line β1 = β2 (Figure 1.7, left panel). Consequently, the
ridge regression estimator satisfies β̂1 (λ) ≈ β̂2 (λ). With unequal variances between the covariates, the ellipsoidal
level sets of the loss function have diameters of rather different sizes. In particular, along the β1 -axis it is narrow
(as Var(X1 ) is large), and – vice versa – wide along the β2 -axis. Consequently, the point of these level sets closest
to the circular parameter constraint will be closer to the β1 - than to the β2 -axis (Figure 1.7, left panel). For the
ridge estimator of the regression parameter this implies 0 ≪ β̂1 (λ) < 1 and 0 < β̂2 (λ) ≪ 1. Hence, the covariate
with a larger variance yields the larger ridge regression estimator.
Should one thus standardize the covariates prior to ridge regression analysis? When dealing with gene expres-
sion data from microarrays, the data have been subjected to a series of pre-processing steps (e.g. quality control,
background correction, within- and between-normalization). The purpose of these steps is to make the expression
levels of genes comparable both within and between hybridizations. The preprocessing should thus be considered
an inherent part of the measurement. As such, it is to be done independently of whatever down-stream analysis is
to follow and further tinkering with the data is preferably to be avoided (as it may mess up the ‘comparable-ness’
of the expression levels as achieved by the preprocessing). For other data types different considerations may apply.
Among the considerations to decide on standardization of the covariates, one should also include the fact that
ridge regression estimates prior and posterior to scaling do not simply differ by a factor. To see this assume that
the covariates have been centered. Scaling of the covariates amounts to post-multiplication of the design matrix
by a p × p-dimensional diagonal matrix A with the reciprocals of the covariates’ scale estimates on its diagonal
(Sardy, 2008). Hence, the ridge regression estimator (for the rescaled data) is then given by:
Effectively, the scaling is equivalent to covariate-wise penalization (see Chapter 3 for more on this). The ‘scaled’
ridge regression estimator may then be derived along the same lines as before in Section 1.5:
In general, this is unequal to the ridge regression estimator without the rescaling of the columns of the design
matrix. Moreover, it should be clear that β̂ (scaled) (λ) ̸= Aβ̂(λ).
with
1
Σkk = 5 (k − 1) 110×10 + 51 (6 − k) I10×10 .
The data of the response variable Y are then obtained through: Y = Xβ + ε, with ε ∼ N (0n , Inn ) and β = 150 .
Hence, all covariates contribute equally to the response. Would the columns of X be orthogonal, little difference
in the ridge estimates of the regression coefficients is expected.
The results of this simulation study with sample size n = 1000 are presented in Figure 1.8. All 50 regular-
ization paths start close to one as λ is small and converge to zero as λ → ∞. But the paths of covariates of the
same block of the covariance matrix Σ quickly group, with those corresponding to a block with larger off-diagonal
elements above those with smaller ones. Thus, ridge regression prefers (i.e. shrinks less) coefficient estimates of
strongly positively correlated covariates.
5
r=0.3 constraint, correlated X
1.0
4
r=0.8
0.8
∝ 1/Var(X1)
∝ 1/Var(X2)
Ridge estimate
0.6
2
β2
^
β2(λ)
0.4
1
0
0.2
−1
^
0.0
β1(λ)
5 6 7 8 9 10 11 −1 0 1 2 3 4 5
log(lambda) β1
Figure 1.8: Left panel: Ridge regularization paths for coefficients of the 50 covariates, with various degree of
collinearity but equal variance. Color and line type correspond to the five blocks of the covariate matrix Σ. Right
panel: Graphical illustration of the effect of the collinearity among covariates on the ridge regression estimator.
The solid and dotted grey circles depict the ridge parameter constraint for the collinear and orthogonal cases,
respectively. The dashed black ellipsoids are the level sets of the sum-of-squares squares loss function. The red
dot and violet diamond are the ridge regression for the positive collinear and orthogonal case, respectively.
Intuitive understanding of the observed behaviour may be obtained from the p = 2 case. Let U , V and ε be
independent random variables with zero mean. Define X1 = U + V , X2 = U − V , and Y = β1 X1 + β2 X2 + ε
with β1 and β2 constants. Hence, E(Y ) = 0. Then:
Y = (β1 + β2 )U + (β1 − β2 )V + ε = γu U + γv V + ε
and Cor(X1 , X2 ) = [Var(U ) − Var(V )]/[Var(U ) + Var(V )]. The random variables X1 and X2 are strongly
positively correlated if Var(U ) ≫ Var(V ). The ridge regression estimator associated with regression of Y on U
and V is:
−1
Var(U ) + λ 0 Cov(U, Y )
γ(λ) = .
0 Var(V ) + λ Cov(V, Y )
For large enough λ
1 Var(U ) 0 β1 + β2
γ(λ) ≈ .
λ 0 Var(V ) β1 − β2
If Var(U ) ≫ Var(V ) and β1 ≈ β2 , the ridge estimate of γv vanishes for large λ. Hence, ridge regression prefers
positively covariates with similar effect sizes.
30 Ridge regression
This phenomenon can also be explained geometrically. For the illustration consider ridge estimation with
λ = 1 of the linear model Y = Xβ + ε with β = (3, 3)⊤ , ε ∼ N (02 , I22 ) and the columns of X strongly and
positively collinear. The level sets of the sum-of-squares loss, ∥Y − Xβ∥22 , are plotted in the right panel of Figure
1.8. Recall that the ridge regression estimate is found by looking for the smallest loss level set that hits the ridge
contraint. The sought-for estimate is then the point of intersection between this level set and the constraint, and
– for the case at hand – is on the x = y-line. This is no different from the case with orthogonal X columns. Yet
their estimates differ, even though the same λ is applied. The difference is to due to fact that the radius of the
ridge constraint depends on λ, X and Y. This is immediate from the fact that the radius of the constraint equals
∥β̂(λ)∥22 (see Section 1.5). To study the effect of X on the radius, we remove its dependence on Y by considering
its expectation, which is:
E[∥β̂(λ)∥22 ] = E{[(X⊤ X + λIpp )−1 (X⊤ X) β̂]⊤ (X⊤ X + λIpp )−1 (X⊤ X) β̂}
= E[Y⊤ X(X⊤ X + λIpp )−2 X⊤ Y]
= σ 2 tr X(X⊤ X + λIpp )−2 X⊤ + β ⊤ X⊤ X(X⊤ X + λIpp )−2 X⊤ Xβ.
In the last step we have used Y ∼ N (Xβ, σ 2 Ipp ) and the expectation of the quadratic form of a multivariate
random variable ε ∼ N (µε , Σε ) is E(ε⊤ Λ ε) = tr(Λ Σε ) + µ⊤ ε Λµε (cf. Mathai and Provost, 1992). The
expression for the expectation of the radius of the ridge constraint can now be evaluated for the orthogonal X and
the strongly, positively collinear X. It turns out that the latter is larger than the former. This results in a larger ridge
constraint. For the larger ridge constraint there is a smaller level set that hits it first. The point of intersection, still
on the x = y-line, is now thus closer to β and further from the origin (cf. right panel of Figure 1.8). The resulting
estimate is thus larger than that from the orthogonal case.
The above needs some attenuation. Among others it depends on: i) the number of covariates in each block, ii)
the size of the effects, i.e. regression coefficients of each covariate, and iii) the degree of collinearity. Possibly,
there are more factors influencing the behaviour of the ridge regression estimator presented in this subsection.
This behaviour of ridge regression is to be understood if a certain (say) clinical outcome is to be predicted
from (say) gene expression data. Genes work in concert to fulfil a certain function in the cell. Consequently,
one expects their expression levels to be correlated. Indeed, gene expression studies exhibit many co-expressed
genes, that is, genes with correlating transcript levels. But also in most other applications with many explanatory
variables collinearity will be omnipresent and similar issues are to be considered.
in which it assumed that the Xi,j ’s are random and – using the column-wise zero ‘centered-ness’of X – that
n−1 X⊤ X is estimator of their covariance matrix. Moreover, the identity used to arrive at the second line of
the display, [(X⊤ X)−1 ]jj = [Var(Xi,j | Xi,1 , . . . , Xi,j−1 , Xi,j+1 , . . . , Xi,p )]−1 , originates from Corollary 5.8.1
of Whittaker (1990). Thus, Var(β̂j ) factorizes into σ 2 [Var(Xi,j )]−1 and the variance inflation factor VIF(β̂j ).
When the j-th covariate is orthogonal to the other, i.e. there is no collinearity, then the VIF’s denominator,
Var(Xi,j | Xi,1 , . . . , Xi,j−1 , Xi,j+1 , . . . , Xi,p ), equals Var(Xi,j ). Consequently, VIF(β̂j ) = 1. When there is
collinearity among the covariates Var(Xi,j | Xi,1 , . . . , Xi,j−1 , Xi,j+1 , . . . , Xi,p ) < Var(Xi,j ) and VIF(β̂j ) > 1.
The VIF then inflates the variance of the estimator of βj under orthogonality – hence, the name – by a factor
attributable to the collinearity.
1.9 Simulations 31
The definition of the VIF needs modification to apply to the ridge regression estimator. In Marquardt (1970)
the ‘ridge VIF’ is defined analogously to the above definition of the VIF of the maximum likelihood regression
estimator as:
Var[β̂j (λ)] = σ 2 [(X⊤ X + λIpp )−1 X⊤ X(X⊤ X + λIpp )−1 ]jj
n−1 σ 2
= · nVar(Xi,j )[(X⊤ X + λIpp )−1 X⊤ X(X⊤ X + λIpp )−1 ]jj
Var(Xi,j )
:= n−1 σ 2 [Var(Xi,j )]−1 VIF[β̂j (λ)],
where the factorization is forced in line with that of the ‘maximum likelihood VIF’ but lacks a similar interpre-
tation. When X is orthogonal, VIF[β̂j (λ)] = [Var(Xi,j )]2 [Var(Xi,j ) + λ]−2 < 1 for λ > 0. Penalization then
deflates the VIF.
An alternative definition of the ‘ridge VIF’ presented by Garcı́a et al. (2015) for the ‘p = 2’-case, which they
motivate from counterintuitive behaviour observed in the ‘ridge VIF’ defined by Marquardt (1970), adopts the
Figure 1.9: Contour plots of ‘Marquardt VIFs’ and ‘Garcia VIFs’, left and right columns, respectively. The top
panels show these VIFs against the degree of penalization (x-axis) and collinearity (y-axis) for a fixed sample size
(n = 100) and dimension (p = 50). The bottom panels show these VIFs against the degree of penalization (x-axis)
and the sample size (y-axis) for a fixed dimension (p = 50).
32 Ridge regression
‘maximum likelihood VIF’ definition but derives the ridge regression estimator from augmented data to comply
with the maximum likelihood approach. This requires to augment √ the⊤response vector with p zeros, i.e. Yaug =
(Y⊤ , 0⊤ ⊤ ⊤
p ) and the design matrix with p rows as Xaug = (X , λIpp ) . The ridge regression estimator can then
be written as β̂(λ) = (X⊤ aug Xaug )
−1 ⊤
Xaug Yaug (see Exercise 1.6). This reformulation of the ridge regression estimator
in the form of its maximum likelihood counterpart suggests the adoption of latter’s VIF definition for the former.
However, in the ‘maximum likelihood VIF’ the design matrix is assumed to be zero centered column-wise. Within
the augmented data formulation this may be achieved by the inclusion of a column of ones in Xaug representing
the intercept. The inclusion of an intercept, however, requires a modification of the estimators of Var(Xi,j ) and
Var(Xi,j | Xi,1 , . . . , Xi,j−1 , Xi,j+1 , . . . , Xi,p ). The former is readily obtained, while the latter is given by the
reciprocals of the 2nd to the (p + 1)-th diagonal elements of the inverse of:
⊤ √ ⊤
1n √ X 1n √ X n+p
√ λ1p
= .
1p λIpp 1p λIpp λ1p X⊤ X + λIpp
The lower right block of this inverse is obtained using well-known linear algebra results i) the analytic expression
of the inverse of a 2 × 2-partitioned block matrix and i) the inverse of the sum of an invertible and a rank one
matrix (given by the Sherman-Morrison formula, see Corollary 18.2.10, Harville, 2008). It equals:
1.10 Illustration
The application of ridge regression to actual data aims to illustrate its use in practice.
1.10 Illustration 33
# extract data
slh <- getGEO("GSE20161", GSEMatrix=TRUE)
GEdata <- slh[1][[1]]
MIRdata <- slh[2][[1]]
150
9.1
obs. MCM7 expression level
Frequency
9.0
100
8.9
50
8.8
8.7
8.94 8.96 8.98 9.00 9.02 9.04 −0.002 −0.001 0.000 0.001
Figure 1.10: Left panel: Observed vs. (ridge) fitted MCM7 expression values. Right panel: Histogram of the ridge
regression coefficient estimates.
The overall aim of this illustration was to assess whether microRNA-regulation of MCM7 could also be ob-
1.10 Illustration 35
served in this prostate cancer data set. In this endeavour the dogma (stating this regulation should be negative)
has nowhere been used. A first simple assessment of the validity of this dogma studies the signs of the estimated
regression coefficients. The ridge regression estimate has 394 out of the 735 microRNA probes with a negative
coefficient. Hence, a small majority has a sign in line with the ‘microRNA ↓ mRNA’ dogma. When, in addition,
taking the size of these coefficients into account (Figure 1.10, right panel), the negative regression coefficient
estimates do not substantially differ from their positive counterparts (as can be witnessed from their almost sym-
metrical distribution around zero). Hence, the value of the ‘microRNA ↓ mRNA’ dogma is not confirmed by this
ridge regression analysis of the MCM7-regulation by microRNAs. Nor is it refuted.
The implementation of ridge regression in the penalized-package offers the possibility to fully obey the
dogma on negative regulation of mRNA expression by microRNAs. This requires all regression coefficients to be
negative. Incorporation of the requirement into the ridge estimation augments the constrained estimation problem
with an additional constraint:
With the additional non-positivity constraint on the parameters, there is no explicit solution for the estimator.
The ridge estimate of the regression parameters is then found by numerical optimization using e.g. the Newton-
Raphson algorithm or a gradient descent approach. The next listing gives the R-code for ridge estimation with the
non-positivity constraint of the linear regression model.
The linear regression model linking MCM7 expression to that of the microRNAs is fitted by ridge regression
while simultaneously obeying the ‘negative regulation of mRNA by microRNA’-dogma to the prostate cancer
data. In the resulting model 401 out of 735 microRNA probes have a nonzero (and negative) coefficient. There is
a large overlap in microRNAs with a negative coefficient between those from this and the previous fit. The models
are also compared in terms of their fit to the data. The Spearman rank correlation coefficient between response and
predictor for the model without positive regression coefficients equals 0.679 and its coefficient of determination
0.524 (confer the left panel of 1.11 for a visualization). This is a slight improvement upon the unconstrained ridge
estimated model. The improvement may be small but it should be kept in mind that the number of parameters used
by both models is 401 (for the model without positive regression coefficients) vs. 735. Hence, with close to half
the number of parameters the dogma-obeying model gives a somewhat better description of the data. This may
suggest that there is some value in the dogma as inclusion of this prior information leads to a more parsimonious
model without any loss in fit.
The dogma-obeying model selects 401 microRNAs that aid in the explanation of the variation in the gene
expression levels of MCM7. There is an active field of research, called target prediction, trying to identify which
microRNAs target the mRNA of which genes. Within R there is a collection of packages that provide the target
prediction of known microRNAs. The packages differ on the method (e.g. experimental or sequence comparison)
that has been used to arrive at the prediction. These target predictions may be used to evaluate the value of the
found 401 microRNAs. Ideally, there would be a substantial amount of overlap. The R-script that loads the target
predictions and does the comparison is below.
Fit of ridge analysis with constraints Histogram of ridge regression estimates with constraints
400
9.2
300
9.1
obs. MCM7 expression level
Frequency
9.0
200
8.9
100
8.8
8.7
0
8.94 8.96 8.98 9.00 9.02 9.04 9.06 −0.005 −0.004 −0.003 −0.002 −0.001 0.000
Figure 1.11: Left panel: Observed vs. (ridge) fitted MCM7 expression values (with the non-positive constraint
on the parameters in place). Right panel: Histogram of the ridge regression coefficient estimates (from the non-
positivity constrained analysis).
With knowledge available on each microRNA whether it is predicted (by at least one target prediction pack-
age) to be a potential target of MCM7, it may be cross-tabulated against its corresponding regression coefficient
estimate in the dogma-obeying model being equal to zero or not. Table 1.1 contains the result. Somewhat super-
fluous considering the data, we may test whether the targets of MCM7 are overrepresented in the group of strictly
negatively estimated regression coefficients. The corresponding chi-squared test (with Yates’ continuity correc-
tion) yields the test statistic χ2 = 0.0478 with a p-value equal to 0.827. Hence, there is no enrichment among
the 401 microRNAS of those that have been predicted to target MCM7. This may seem worrisome. However, the
1.11 Conclusion 37
Table 1.1: Cross-tabulation of the microRNAs being a potential target of MCM7 vs. the value of its regression
coefficient in the dogma-obeying model.
microRNAs have been selected for their predictive power of the expression levels of MCM7. Variable selection
has not been a criterion (although the sign constraint implies selection). Moreover, criticism on the value of the
microRNA target prediction has been accumulating in recent years (REF).
1.11 Conclusion
We discussed ridge regression as a modification of linear regression to overcome the empirical non-identifiability
of the latter when confronted with high-dimensional data. The means to this end was the addition of a (ridge)
penalty to the sum-of-squares loss function of the linear regression model, which turned out to be equivalent to
constraining the parameter domain. This warranted the identification of the regression coefficients, but came at
the cost of introducing bias in the estimates. Several properties of ridge regression like moments and its MSE have
been reviewed. Finally, its behaviour and use have been illustrated in simulation and omics data.
1.12 Exercises
Question 1.1
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σε2 Inn ). This model (without intercept) is
fitted to data using the ridge regression estimator β̂(λ) = arg minβ ∥Y − Xβ∥22 + λ∥β∥22 with λ > 0. The data
are:
X⊤ = −1 1 1 −1 and Y⊤ = −1.5 2.9 −3.5 0.7 .
a) Evaluate the maximum likelihood/ordinary least squares estimator of the regression parameter, i.e. β̂(λ) for
λ = 0.
b) Evaluate the ridge regression estimator for λ = 1.
c) Verify that the ridge regression estimator β̂(λ) shrinks as λ increases. Hereto combine your asnwer to parts
a) and b) and the evaluation of the ridge regression estimator for λ = 10 and λ = 1000. Is the order of the
employed choices of λ and the absolute value of the corresponding estimates concordant?
Question 1.2 †
Consider the simple linear regression model Yi = β0 + β1 Xi + εi with εi ∼ N (0, σ 2 ). The data on the covariate
and response are: X⊤ = (X1 , X2 , . . . , X8 )⊤ = (−2, −1, −1, −1, 0, 1, 2, 2)⊤ and Y⊤ = (Y1 , Y2 , . . . , Y8 )⊤ =
(35, 40, 36, 38, 40, 43, 45, 43)⊤ , with corresponding elements in the same order.
a) Find the ridge regression estimator for the data above for a general value of λ.
b) Evaluate the fit, i.e. Ybi (λ) for λ = 10. Would you judge the fit as good? If not, what is the most striking
feature that you find unsatisfactory?
c) Now zero center the covariate and response data, denote it by X̃i and Ỹi , and evaluate the ridge regression
estimator of Ỹi = β1 X̃i + εi at λ = 4. Verify that in terms of original data the resulting predictor now is:
Ybi (λ) = 40 + 1.75X.
Note that the employed estimate in the predictor found in part c) is effectively a combination of a maximum
likelihood and ridge regression one for intercept and slope, respectively. Put differently, only the slope has been
shrunken.
Question 1.3
Consider the simple linear regression model Yi = β0 + Xi β + εi for i = 1, . . . , n and with εi ∼i.i.d. N (0, σ 2 ).
The model comprises a single covariate and an intercept. Response and covariate data are: {(yi , xi )}4i=1 =
{(1.4, 0.0), (1.4, −2.0), (0.8, 0.0), (0.4, 2.0)}. Find the value of λ that yields the ridge regression estimate (with
an unregularized/unpenalized intercept as is done in part c) of Question 1.2) equal to (1, − 18 )⊤ .
Question 1.4
Plot the regularization path of the ridge regression estimator over the range λ ∈ (0, 20.000] using the data of
Example 1.2.
Question 1.5
Consider the ridge regression estimator β̂(λ) = (X⊤ X + λIpp )−1 X⊤ Y. Show that, for λ large enough,
sign[β̂(λ)] = sign(X⊤ Y). Hint: If A is a nonsingular matrix andPthe largest (in an absolute sense) singular
∞
value of BA−1 is smaller than one, then(A + B)−1 = A−1 + A−1 k=1 (−1)k (BA−1 )k .
Question 1.6 ‡
Show that the ridge regression estimator can be obtained by ordinary
√ least squares regression on an augmented
data set. Hereto augment the matrix X with p additional rows λIpp , and augment the response vector Y with p
zeros.
Question 1.7
Recall the definitions of β̂MLS , Wλ and β̂(λ) from Section 1.2. Show that, unlike the linear relations between the
ridge and maximum likelihood estimators if X⊤ X is of full rank, β̂(λ) ̸= Wλ β̂MLS high-dimensionally.
Question 1.8
The coefficients β of a linear regression model, Y = Xβ + ε, are estimated by β̂ = (X⊤ X)−1 X⊤ Y. The
associated fitted values then given by Yb = X β̂ = X(X⊤ X)−1 X⊤ Y = HY, where H = X(X⊤ X)−1 X⊤
referred to as the hat matrix. The hat matrix H is a projection matrix as it satisfies H = H2 . Hence, linear
regression projects the response Y onto the vector space spanned by the columns of Y. Consequently, the residuals
ε̂ and Ŷ are orthogonal. Now consider the ridge estimator of the regression coefficients: β̂(λ) = (X⊤ X +
λIpp )−1 X⊤ Y. Let Ŷ(λ) = Xβ̂(λ) be the vector of associated fitted values.
a) Show that the ridge hat matrix H(λ) = X(X⊤ X + λIpp )−1 X⊤ , associated with ridge regression, is not a
projection matrix (for any λ > 0), i.e. H(λ) ̸= [H(λ)]2 .
b) Show that for any λ > 0 the ‘ridge fit’ Y(λ)
b is not orthogonal to the associated ‘ridge residuals’ ε̂(λ),
defined as ε(λ) = Y − Xβ̂(λ).
Question 1.9
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi ∼i.i.d N (0, σ 2 ).
Suppose the parameter β is estimated by the ridge regression estimator β̂(λ) = (X⊤ X + λIpp )−1 X⊤ Y.
a) The vector of ‘ridge residuals’, defined as ε(λ) = Y − Xβ̂(λ), are normally distributed. Why?
b) Show that E[ε(λ)] = [Inn − X(X⊤ X + λIpp )−1 X⊤ ]Xβ.
c) Show that Var[ε(λ)] = σ 2 [Inn − X(X⊤ X + λIpp )−1 X⊤ ]2 .
d) Could the normal probability plot, i.e. a qq-plot with the quantiles of standard normal distribution plotted
against those of the ridge residuals, be used to assess the normality of the latter? Motivate.
Question 1.10
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ). This model is fitted to data,
X1,∗ = (4, −2) and Y1 = (10), using the ridge regression estimator β̂(λ) = (X⊤ 1,∗ X1,∗ + λI22 )
−1 ⊤
X1,∗ Y1 .
Throughout use λ = 5
a) Evaluate the ridge regression estimator.
b) Suppose β = (1, −1)⊤ . Evaluate the bias of the ridge regression estimator.
c) Decompose the bias into a component due to the regularization and one attributable to the high-dimensionality
of the study.
d) Would β have equalled (2, −1)⊤ , the bias’ component due to the high-dimensionality vanishes. Explain
why.
‡ This exercise is freely rendered from Hastie et al. (2009), but can be found in many other places. The original source is unknown to the
author.
1.12 Exercises 39
c) Part b) reveals that, for small values of λ, the estimates fall outside the line found in part a). Using the
theory outlined in Section 1.4.1, the estimates can be decomposed into a part that falls on this line and a part
that is orthogonal to it. The latter is given by (I22 − Px )β̂(λ) where Px is the projection matrix onto the
space spanned by the columns of X. Evaluate the projection matrix Px .
d) Numerical inaccuracy, resulting from the ill-conditionedness of X⊤ X+λI22 , causes (I22 −Px )β̂(λ) ̸= 02 .
Verify that the observed non-null (I22 − Px )β̂(λ) are indeed due to numerical inaccuracy. Hereto generate
a log-log plot of the condition number of X⊤ X + λI22 vs. the ∥(I22 − Px )β̂(λ)∥2 for the provided set of
λ’s.
Question 1.12
Provide an alternative proof of Theorem 1.2 that states the existence of a positive value of the penalty parameter for
which the ridge regression estimator has a superior MSE compared to that of its maximum likelihood counterpart.
a) Show that the derivative of the MSE with respect to the penalty parameter is negative at zero. In this use the
following results from matrix calculus:
d h d i d
tr[A(λ)] = tr A(λ) , (A + λB)−1 = −(A + λB)−1 B(A + λB)−1 ,
dλ dλ dλ
and the chain rule
d h d i h d i
A(λ) B(λ) = A(λ) B(λ) + A(λ) B(λ) ,
dλ dλ dλ
where A(λ) and B(λ) are square, symmetric matrices parameterized by the scalar λ.
b) Use Von Neumann’s trace inequality to show that MSE[β̂(λ)] < MSE[β̂(0)] for 0 < λ < σ 2 ∥β∥−2 2 .
Von Neumann’s trace inequality states that, for p × p-dimensional matrices A and Pp B with singular values
da,1 ≥ da,2 ≥ . . . ≥ da,p and db,1 ≥ db,2 ≥ . . . ≥ db,p respectively, |tr(AB)| ≤ j=1 da,j db,j .
Note: the proof in the lecture notes is a stronger one, as it provides a (larger) interval on the penalty parameter
where the MSE of the ridge regression estimator is superior to that of the maximum likelihood one.
Question 1.13
Recall that there exists λ > 0 such that MSE(β̂) > MSE[β̂(λ)]. Then, show that, for that λ, the mean squared
b = MSE(Xβ̂) ≥ MSE[Xβ̂(λ)]. Hint: If A and B are nonnegative
error of the linear predictor satisfies MSE(Y)
definite square matrices, then tr(AB) ≥ 0.
Question 1.15
Consider the regularization paths of the elements of the ridge regression estimator of the linear regression model
with two covariates. Could it be that, for λ′ > λ > 0, we find the estimates
a) β̂(λ) = (2, 1)⊤ and β̂(λ′ ) = (1 21 , 1 10
1 ⊤
) ?
b) β̂(λ′ ) = (2, 1)⊤ and β̂(λ) = (1 2 , 1 10
1 1 ⊤
) ?
′ ⊤ 1 ⊤
c) β̂(λ ) = (2, 0) and β̂(λ) = (1 2 , 0) ?
Question 1.16 (Negative penalty parameter)
Consider fitting the linear regression model, Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ), to data by means of the
ridge regression estimator. This estimator involves the penalty parameter which is said to be positive. It has been
suggested, by among others Hua and Gunst (1983), to extend the range of the penalty parameter to the whole set
of real numbers. That is, also tolerating negative values. Let’s investigate the consequences of allowing negative
values of the penalty parameter. Hereto use in the remainder the following numerical values for the design matrix,
response, and corresponding summary statistics:
5 4 3 ⊤ 34 32 ⊤ 27
X = , Y = , X X = , and X Y = .
−3 −4 −4 32 32 28
a) For which λ < 0 is the ridge regression estimator β̂(λ) = (X⊤ X + λI22 )−1 X⊤ Y well-defined?
b) Now consider the ridge regression estimator to be defined via the ridge loss function, i.e.
β̂(λ) = arg minβ∈R2 ∥Y − Xβ∥22 + λ∥β∥22 .
Let λ = −20. Plot the level sets of this loss function, and add a point with the corresponding ridge regression
estimate β̂(−20).
c) Verify that the ridge regression estimate β̂(−20) is a saddle point of the ridge loss function, as can also
be seen from the contour plot generated in part b). Hereto study the eigenvalues of its Hessian matrix.
Moreover, specify the range of negative penalty parameters for which the ridge loss function is convex (and
does have a unique well-defined minimum).
d) Find the minimum of the ridge loss function.
Question 1.17
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with εi ∼i.i.d. N (0, σ 2 )
Consider the following two ridge regression estimators of the regression parameter of this model, defined as:
Xn Xn
arg minβ∈Rp (Yi − Xi,∗ β)2 + λ∥β∥22 and arg minp (Yi − Xi,∗ β)2 + nλ∥β∥22 .
i=1 β∈R i=1
Question 1.18
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. normally
distributed with zero mean and a common variance. The rows of the design matrix X have two elements, and
neither column represents the intercept, but X∗,1 = X∗,2 .
a) Suppose an estimator of the regression parameter β of this model is obtained through the minimization of
the sum-of-squares augmented with a ridge penalty, ∥Y − Xβ∥22 + λ∥β∥22 , in which λ > 0 is the penalty
parameter. The minimizer is called the ridge regression estimator and is denoted by β̂(λ). Show that
[β̂(λ)]1 = [β̂(λ)]2 for all λ > 0.
b) The covariates are now related as X∗,1 = −2X∗,2 . Data on the response and the covariates are:
{(yi , xi,1 , xi,2 )}6i=1 = {(1.5, 1.0, −0.5), (1.9, −2.0, 1.0), (−1.6, 1.0, −0.5),
(0.8, 4.0, −2.0), (0.9, 2.0, −1.0), (−0.5, 4.0, −2.0)}.
Question 1.19
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. nor-
Pn distributed
mally with zero mean and a common variance. Moreover, X∗,j = X∗,j ′ for all j, j ′ = 1, . . . , p and
2 2 2
i=1 Xi,j = 1. Show that the ridge regression estimator, defined as β(λ) = arg minβ∈R ∥Y − Xβ∥2 + λ∥β∥2
p
where b = X⊤ ∗,1 Y. Hint: you may want to use the Sherman-Morrison formula. Let A and B be symmetric
matrices of the same dimension, with A invertible and B of rank one. Moreover, define g = tr(A−1 B). Then:
(A + B)−1 = A−1 − (1 + g)−1 A−1 BA−1 .
Question 1.20
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. normally
distributed with zero mean and a common but unknown variance. Information on the response, design matrix and
relevant summary statistics are:
X⊤ = 2 1 −2 , Y⊤ = −1 −1 1 , X⊤ X = 9 , and X⊤ Y = −5 ,
from which the sample size and dimension of the covariate space are immediate.
a) Evaluate the ridge regression estimator β̂(λ) with λ = 1.
c β̂(λ)], for λ = 1. In this the error variance
b) Evaluate the variance of the ridge regression estimator, i.e.Var[
σ 2 is estimated by n−1 ∥Y − Xβ̂(λ)∥22 .
c) Recall that the ridge regression estimator β̂(λ) is normally distributed. Consider the interval
c β̂(λ)]}1/2 , β̂(λ) + 2{Var[
c β̂(λ)]}1/2 .
C = β̂(λ) − 2{Var[
Is this a genuine (approximate) 95% confidence interval for β? If so, motivate. If not, what is the interpre-
tation of this interval?
d) Suppose the design matrix is augmented with an extra column identical to the first one. Moreover, assume
λ to be fixed. Is the estimate of the error variance unaffected, or not? Motivate.
Question 1.21
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with εi ∼i.i.d. N (0, σ 2 ).
The ridge regression estimator of β is denoted by β̂(λ) for λ > 0.
42 Ridge regression
a) Show:
Xp
tr{Var[Y(λ)]}
b = σ2 (Dx )4jj [(Dx )2jj + λ]−2 ,
j=1
where Y(λ)
b = Xβ̂(λ) and Dx is the diagonal matrix containing the singular values of X on its diagonal.
b) The coefficient of determination is defined as:
R2 = [Var(Y) − Var(Y)]/[Var(Y)]
b = [Var(Y − Y)]/[Var(Y)],
b
where Yb = Xβ̂ with β̂ = (X⊤ X)−1 X⊤ Y. Show that the second equality does not hold when Y b is
⊤
now replaced by the ridge regression predictor defined as Y(λ) = H(λ)Y where H(λ) = X(X X +
b
λIpp )−1 X⊤ . Hint: Use the fact that H(λ) is not a projection matrix, i.e. H(λ) ̸= [H(λ)]2 .
Question 1.22
Fit the linear regression model, Y = Xβ + ε with the usual assumptions, by means of the ridge regression
estimator β̂(λ). Let df (λ) denote the degrees of freedom consumed by the ridge regression estimator. Show that
limλ→∞ [n − df (λ)]−1 ∥Y − Xβ̂(λ)∥22 = n−1 ∥Y∥22 , the maximum likelihood estimator of the variance in the
response.
Question 1.23
The linear regression model, Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ), is fitted to the data with the following design
matrix, response and relevant summary statistics:
⊤ 4 2 ⊤ 1
X = 2 1 , Y = 0.5 , X X = , and X Y = .
2 1 0.5
Hence, p = 2 and n = 1.
a) Verify that the singular value decomposition of X is
√ √
√
−2/√5 −1/√5
= Ux Dx Vx⊤ :=
X −1 5 0 .
1/ 5 −2/ 5
respectively. In the remainder study the computational gain of the latter. Hereto carry out the following instruc-
tions:
a) Load the R-package microbenchmark (Mersmann, 2014).
b) Generate data. In this fix the sample size at n = 10, and let the dimension range from p = 10, 20, 30, . . . , 100.
Sample the elements of the response and the ten design matrices from the standard normal distribution.
c) Verify the superior computation time of the latter by means of the microbenchmark-function with default
settings. Throughout use the first design matrix and λ = 1. Write the output of the microbenchmark-
function to an R-object. It will be a data.frame with two slots expr and time that contain the function
calls and the corresponding computation times, respectively. Each call has by default been evaluated a
hundred times in random order. Summary statistics of these individual computation times are given when
printing the object on the screen.
1.12 Exercises 43
d) Use the crossprod- and tcrossprod-functions to improve the computation times of the evaluation of
both β̂reg (λ) and β̂eff (λ) as much as possible.
e) Use the microbenchmark-function to evaluate the (average) computation time both β̂reg (λ) and β̂eff (λ)
on all ten data sets, i.e. defined by the ten design matrices with different dimensions.
f) Plot, for both β̂reg (λ) and β̂eff (λ), the (average) computation time (on the y-axis) against the dimension of
the ten data sets. Conclude on the computation gain from the plot.
(X⊤
−i,∗ X−i,∗ + λIpp )
−1
= (X⊤ X + λIpp )−1
+(X⊤ X + λIpp )−1 X⊤
i,∗ [1 − Hii (λ)]
−1
Xi,∗ (X⊤ X + λIpp )−1 ,
The penalty parameter is chosen as the minimizer of the leave-one-out cross-validated squared error of the predic-
tion (i.e. Allen’s PRESS statistic). Show that λ = ∞.
Question 1.29
Consider fitting the linear regression model, Y = Xβ +ε with design matrix X, β ∈ Rp , and ε ∼ N (0n , σ 2 Inn ),
44 Ridge regression
by means of the ridge regression estimator, β̂(λ) = (X⊤ X + λIpp )−1 X⊤ Y, for the below provided choices of
X and Y:
1 0 2+λ −1 −(1 + λ)
1
X = 0 1 , H(λ) = −1 2 + λ −(1 + λ) ,
(λ + 1)(λ + 3)
−1 −1 −(1 + λ) −(1 + λ) 2(1 + λ)
and Y = (−2, 1, 2)⊤ . The above display also contains the ridge hat matrix, which comes in handy for the answer.
6
Prior to fitting, choose the penalty parameter through generalized cross-validation. Show that then λopt ∈ (0, 53 ),
and that this interval indeed contains a single minimum, which is indeed the global minimum of GCV(λ) on the
positive real line.
# load data
data(stockholm)
a) Find the ridge penalty parameter by means of AIC minization. Hint: the likelihood can be obtained from
the penFit-object that is created by the penalized-function of the R-package penalized.
b) Find the ridge penalty parameter by means of leave-one-out cross-validation, as implemented by the optL2-
function provided by the R-package penalized.
c) Find the ridge penalty parameter by means of leave-one-out cross-validation using Allen’s PRESS statistic
as performance measure (see Section 1.8.2).
1.12 Exercises 45
d) Discuss the reasons for the different values of the ridge penalty parameter obtained in parts a), b), and c).
Also investigate the consequences of these values on the corresponding regression estimates.
The ridge regression estimator is equivalent to a Bayesian regression estimator. On one hand this equivalence
provides another interpretation of the ridge regression estimator. But it also shows that within the Bayesian
framework the high-dimensionality of the data need not frustrate the numerical evaluation of the estimator. In
addition, the framework provides ways to quantify the consequences of high-dimensionality on the uncertainty of
the estimator. Within this chapter, focus is on the equivalence of Bayesian and ridge regression. In this particular
case, the connection is immediate from the analytic formulation of the Bayesian regression estimator. After this
has been presented, it is shown how this estimator may also be obtained by means of sampling. The relevance
of the sampling for the evaluation of the estimator and its uncertainty becomes apparent in subsequent chapters
where we discuss other penalized estimators for which the connection cannot be captured in analytic form.
P (Y = y | θ) π(θ) P (Y = y | θ) π(θ)
π(θ | Y = y) = = R .
P (Y = y) 0,1
P (Y = y | θ) π(θ) dθ
The posterior distribution thus contains all knowledge on the parameter. As the denomenator – referred to as the
distribution’s normalizing constant – in the expression of the posterior distribution does not involve the parameter
θ, one often writes π(θ | Y = y) ∝ P (Y = y | θ) π(θ), thus specifying the (relative) density of the posterior
distribution of θ as ‘likelihood × prior’.
While the posterior distribution is all one wishes to know, often a point estimate is required. A point estimate
of θ can be obtained from the posterior by taking the mean or the mode. Formally, the Bayesian point estimator of
2.2 Relation to ridge regression 47
a parameter θ is defined as the estimator that minimizes the Bayes risk over a prior distribution of the parameter
θ. The Bayes risk is defined as θ E[(θ̂ − θ)2 ]πθ (θ; α)dθ, where πθ (θ; α) is the prior distribution of θ with
R
hyperparameter α. It is thus a weighted average of the Mean Squared Error, with weights specified through the
prior. The Bayes risk is minimized by the posterior mean:
R
θ P (Y = y | θ) π(θ) dθ
Eθ (θ | data) = Rθ .
θ
P (Y = y | θ) π(θ) dθ
(cf., e.g., Bijma et al., 2017). The Bayesian point estimator of θ yields the smallest possible expected MSE, under
the assumption of the employed prior. This estimator thus depends on the likelihood and the prior, and a different
prior yields a different estimator.
A point estimator is preferably accompanied by a quantification of its uncertainty. Within the Bayesian context
this is done by so-called credible intervals or regions, the Bayesian equivalent of confidence intervals or regions.
A Bayesian credible interval encompass a certain percentage of the probabilityR mass of the posterior. For instance,
a subset of the parameter space C forms an 100(1 − α)% credible interval if C π(θ | data) dθ = 1 − α.
In the example above it is important to note the role of the prior in the updating of the knowledge on θ. First,
when the prior distribution is uniform on the unit interval, the mode of the posterior coincides with the maximum
likelihood estimator. Furthermore, when n is large the influence of the prior in the posterior is negligible. It is
therefore that for large sample sizes frequentist and Bayesian analyses tend to produce similar results. However,
when n is small, the prior’s contribution to the posterior distribution cannot be neglected. The choice of the prior
is then crucial as it determines the shape of the posterior distribution and, consequently, the posterior estimates.
It is this strong dependence of the posterior on the arbitrary (?) choice of the prior that causes unease with a
frequentist (leading some to accuse Bayesians of subjectivity). In the high-dimensional setting the sample size
is usually small, especially in relation to the parameter dimension, and the choice of the prior distribution then
matters. Interest here is not in resolving the frequentist’s unease, but in identifying or illustrating the effect of the
choice of the prior distribution on the parameter estimates.
The posterior distribution can be expressed as a multivariate normal distribution. Hereto group the terms from the
exponential functions that involve β and manipulate as follows:
(Y − Xβ)⊤ (Y − Xβ) + λβ ⊤ β
= Y⊤ Y − 2β ⊤ X⊤ Y + β ⊤ X⊤ Xβ + λβ ⊤ β
= Y⊤ Y − 2β ⊤ (X⊤ X + λIpp )(X⊤ X + λIpp )−1 X⊤ Y + β ⊤ (X⊤ X + λIpp )β
= Y⊤ Y − β ⊤ (X⊤ X + λIpp )β̂(λ) − [β̂(λ)]⊤ (X⊤ X + λIpp )β + β ⊤ (X⊤ X + λIpp )β
= Y⊤ Y − Y⊤ X(X⊤ X + λIpp )−1 X⊤ Y + [β − β̂(λ)]⊤ (X⊤ X + λIpp )[β − β̂(λ)].
Prior of beta
lambda=1
1.2
lambda=2
lambda=5
lambda=10
1.0
0.8
density
0.6
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
beta
Figure 2.1: Conjugate prior of the regression parameter β for various choices of λ, the penalty parameters c.q.
precision.
with
⊤
gβ (β | σ 2 , Y, X) ∝ exp − 21 σ −2 β − β̂(λ) (X⊤ X + λIpp ) β − β̂(λ) .
Clearly, after having recognized the form of a multivariate normal density in the expression of the preceeding
display, the conditional posterior mean of β is E(β | σ 2 , Y, X) = β̂(λ). Hence, the ridge regression estimator
can be viewed as the Bayesian posterior mean estimator of β when imposing a Gaussian prior on the regression
parameter.
conditional posterior density, unit penalty conditional posterior density, small penalty
6
04
045
003
025
015
1e−
0
0.0
0.0
0.00
0.00
55
04
04
00
04
0.0
4e−
3e−
5e−
4
4
−0
6e
5
2
5e−0
0.01
2e−0
0.02
β2
β2
3 0.0
0.0 3 5
45 0.0
0.0 5
0
0.055
0.04
4
4e−0
0.025
4
3e−0
0.015
−2
−2
4
2e−0
4
04
1e−0
0.005
5e−
5
5e−0
045
025
0.00
015
035
0.00
0.00
0.00
−4
−4
−4 −2 0 2 4 6 −4 −2 0 2 4 6
β1 β1
Figure 2.2: Level sets of the (unnormalized) conditional posterior of the regression parameter β. The grey dashed
line depicts the support of the degenerated normal distribution of the frequentist’s ridge regression estimate.
In one, the left panel, λ is arbitrarily set equal to one. The choice of the employed λ, i.e. λ = 1, is irrelevant,
the resulting posterior distribution only serves as a reference. This choice results in a posterior distribution that is
concentrated around the Bayesian point estimate, which coincides with the ridge regression estimate. The almost
spherical level sets around the point estimate may be interpreted as credible intervals for β. The grey dashed
line, spanned by the row of the design matrix, represents the support of the degenerated normal distribution of the
frequentist’s ridge regression estimate (cf. end of Section 1.4.2). The contrast between the degenerated frequentist
and well-defined Bayesian normal distribution illustrates that – for a suitable choice of the prior – within the
Bayesian context high-dimensionality need not be an issue with respect to the evaluation of the posterior. In the
right panel of Figure 2.2 the penalty parameter λ is set equal to a very small value, i.e. 0 < λ ≪ 1. This
represents a very imprecise or uninformative prior. It results in a posterior distribution with level sets that are far
from spherical and are very streched in the direction orthogonal to the subspace spanned by the row of the design
matrix and indicates in which direction there is most uncertainty with respect to β. In particular, when λ ↓ 0,
the level sets loose their ellipsoid form and the posterior collapses to the degenerated normal distribution of the
frequentist’s ridge regression estimate. Finally, in combination with the left-hand plot this illustrates the large
effect of the prior in the high-dimensional context.
With little extra work we may also obtain the conditional posterior of σ 2 from the joint posterior distribution:
in which one can recognize the shape of an inverse gamma distribution. The relevance of this conditional distri-
bution will be made clear in Section 2.3.
Sampling from the conditional posterior of β becomes horrorifically slow in high dimensions. This is overcome
by the computationally efficient sampling scheme proposed by Bhattacharya et al. (2016). To draw a sample from
the multivariate normal distribution with a structured mean vector and covariance matrix,
The next proposition warrants that the Algorithm 1 indeed samples from distribution (2.1).
β is normally distributed. Its mean is E(β) = λ−1 X⊤ (λ−1 XX⊤ + Inn )−1 Y, which is the expression of the
ridge regression estimator (1.12). The variance is
where we have used the rules of variance calculus, the independence between u and v, and the Woodbury matrix
identity. ■
50 Bayesian regression
Algorithm 1 is valid for any (n, p)-combination, but most computational speed is gained if p ≫ n.
In general, any penalized estimator has a Bayesian equivalent. That is generally not the posterior mean as for
the ridge regression estimator, which is more a coincide due to the normal form of its likelihood and conjugate
prior. A penalized estimator always coincides with the Bayesian MAP (Mode A Posteriori) estimator. Then, the
MAP estimates estimates a parameter θ by the mode, i.e. the maximum, of the posterior density. To see the
equivalence of both estimators we derive the MAP estimator. The latter is defined as:
P (Y = y | θ) π(θ)
θ̂map = arg max π(θ | Y = y) = arg max R .
θ θ P (Y = y | θ) π(θ) dθ
To find the maximum of the posterior density take the logarithm (which does not change the location of the
maximum) and drop terms that do not involve θ. The MAP estimator then is:
The first summand on the right-hand side is the loglikelihood, while the second is the logarithm of the prior density.
The latter, in case of a normal prior, is proportional to σ −2 λ∥β∥22 . This is – up to some factors – exactly the loss
function of the ridge regression estimator. If the quadratic ridge penalty is replaced by different one, it is easy to
see what form the prior density should have in order for both estimators – the penalized and MAP – to coincide.
given that it of σ 2 is known. By drawing a sample σ 2,(1) , . . . , σ 2,(T ) of gσ2 (σ 2 | Y, X) and, subsequently, a sample
{β (t) }Tt=1 from gβ (β | σ 2,(t) , Y, X) for t = 1, . . . , T , the parameter σ 2 has effectively been integrated out and
the resulting sample of β (1) , β (2) , . . . , β (T ) is from fβ (β | Y, X).
It is clear from the above that the ability to sample from the posterior distribution is key to the estimation
of the parameters. But this sampling is hampered by knowledge of the normalizing constant of the posterior.
2.3 Markov chain Monte Carlo 51
This is overcome by the acceptance-rejection sampler. This sampler generates independent draws from a target
density π(y) = f (y)/K, where f (y) is an unnormalized density and K its (unknown) normalizing constant.
The sampler relies on the existence of a density h(y) that dominates f (y), i.e., f (y) ≤ c h(y) for some known
c ∈ R>0 , from which it is possible to simulate. Then, in order to draw a Y from the posterior π(·) run Algorithm 2:
A proof that this indeed yields a random sample from π(·) is given in Flury (1990). For this method to be com-
putationally efficient, c is best set equal to supy {f (y)/h(y)}. The R-script below illustrates this sampler for the
unnormalized density f (y) = cos2 (2πy) on the unit interval. As maxy∈[0,1] cos2 (2πy) = 1, the density f (y) is
dominated by the density function of the uniform distribution U [0, 1] and, hence, the script uses h(y) = 1 for all
y ∈ [0, 1].
Listing 2.1 R code
# define unnormalized target density
cos2density <- function(x){ (cos(2*pi*x))ˆ2 }
# verify by eye-balling
hist(acSampler(100000), n=100)
In practice, it may not be possible to find a density h(x) that dominates the unnormalized target density f (y) over
the whole domain. Or, the constant c is too large, yielding a rather small acceptance probability, which would
make the acceptance-rejection sampler impractically slow. A different sampler is then required.
The Metropolis-Hastings sampler overcomes the problems of the acceptance-rejection sampler and generates
a sample from a target distribution that is known up to its normalizing constant. Hereto it constructs a Markov
chain that converges to the target distribution. A Markov chain is a sequence of random variables {Yt }∞ t=1 with
Yt ∈ Rp for all t that exihibit a simple dependence relationship among subsequent random variables in the
sequence. Here that simple relationship refers to the fact/assumption that the distribution of Yt+1 only depends
on Yt and not on the part of the sequence preceeding it. The Markov chain’s random walk is usually initiated by a
single draw from some distribution, resulting in Y1 . From then on the evolution of the Markov chain is described
by the transition kernel. The transition kernel is a the conditional density gYt+1 | Yt (Yt+1 = ya | Yt = yb ) that
specifies the distribution of the random variable at the next instance given the realization of the current one. Under
52 Bayesian regression
some conditions (akin to aperiodicity and irreducibility for Markov chains in discrete time and with a discrete
state space), the influence of the initiation washes out and the random walk converges. Not to a specific value, but
to a distribution. This is called the stationary distribution, denoted by φ(y), and Yt ∼ φ(y) for large enough t.
The stationary distribution satisfies:
Z
φ(ya ) = gYt+1 | Yt (Yt+1 = ya | Yt = yb ) φ(yb ) dyb . (2.2)
Rp
That is, the distribution of Yt+1 , obtained by marginalization over Yt , coincides with that of Yt . Put differently,
the mixing by the transition kernel does not affect the distribution of individual random variables of the chain.
To verify that a particular distribution φ(y) is the stationary one it is sufficient to verify it satisfies the detailed
balance equations:
for all choices of ya , yb ∈ Rp . If a Markov chain satisfies this detailed balance equations, it is said to be reversible.
Reversibility means so much as that, from the realizations, the direction of the chain cannot be discerned as:
fYt ,Yt+1 (ya , yb ) = fYt ,Yt+1 (yb , ya ), that is, the probability of starting in state ya and finishing in state yb
equals that of starting in yb and finishing in ya . The sufficiency of this condition is evident after its integration on
both sides with respect to yb , from which condition (2.2) follows.
MCMC assumes the stationary distribution of the Markov chain to be known up to a scalar – this is the target
density from which is to be sampled – but the transition kernel is unknown. This poses a problem as the transition
kernel is required to produce a sample from the target density. An arbitrary kernel is unlikely to satisfy the detailed
balance equations with the desired distribution as its stationary one. If indeed it does not, then:
for some ya and yb . This may (loosely) be interpreted as that the process moves from ya to yb too often and from
yb to ya too rarely. To correct this the probability of moving from yb to yb is introduced to reduce the number of
moves from ya to yb . As a consequence not always a new value is generated, the algorithm may decide to stay in
ya (as opposed to acceptance-rejection sampling). To this end a kernel is constructed and comprises compound
transitions that propose a possible next state which is simultaneously judged to be acceptable or not. Hereto take
an arbitrary candidate-generating density function gYt+1 | Yt (yt+1 | yt ). Given the current state Yt = yt , this
kernel produces a suggestion yt+1 for the next point in the Markov chain. This suggestion may take the random
walk too far afield from the desired and known stationary distribution. If so, the point is to be rejected in favour of
the current state, until a new suggestion is produced that is satisfactory close to or representative of the stationary
distribution. Let the probability of an acceptable suggestion yt+1 , given the current state Yt = yt , be denoted by
A(yt+1 | yt ). The thus constructed transition kernel then formalizes to:
hYt+1 | Yt (ya | yb ) = gYt+1 | Yt (ya | yb ) A(ya | yb ) + r(ya ) δyb (ya ),
where A(ya | yb ) is the acceptance probability of the suggestion ya given the current state yb and is defined as:
Yt+1 | Yt (yb | ya )
n φ(y ) g o
a
A = min 1, .
φ(yb ) gYt+1 | Yt (ya | yb )
R
Furthermore, δyb (·) is the Dirac delta function and r(ya ) = 1 − Rp gYt+1 | Yt (ya | yb )A(ya | yb ) dya , which is
associated with the rejection. The transition kernel hYt+1 | Yt (· | ·) equipped with the above specified acceptance
probability is referred to as the Metropolis-Hastings kernel. Note that, although the normalizing constant of the
stationary distribution may be unknown, the acceptance probability can be evaluated as this constant cancels out
in the φ(ya ) [φ(yb )]−1 term.
It rests now to verify that φ(·) is the stationary distribution of the thus defined Markov process. Hereto first
the reversibility of the proposed kernel is assessed. The contribution of the second summand of the kernel to the
detailed balance equations, r(ya ) δyb (ya )φ(yb ), exists only if ya = yb . Thus, if it contributes to the detailed
balance equations, it does so equally on both sides of the equation. It is now left to assess whether:
To verify this equality use the definition of the acceptance probability from which we note that if A(ya | yb ) ≤ 1
(and thus A(ya | yb ) = [φ(ya )gYt+1 | Yt (yb | ya )][φ(yb )gYt+1 | Yt (ya | yb )]−1 ), then A(yb | ya ) = 1 (and vice
2.3 Markov chain Monte Carlo 53
versa). In either case, the equality in the preceeding display holds, and thereby, together with the observation
regarding the second summand, so do the detailed balance equations for Metropolis-Hastings kernel. Then, φ(·)
is indeed the stationary distribution of the constructed transition kernel hYt+1 | Yt (ya | yb ) as can be verified from
Condition (2.2):
Z Z
hYt+1 | Yt (ya | yb ) φ(yb ) dyb = hYt+1 | Yt (yb | ya ) φ(ya ) dyb
Rp Rp
Z Z
= gYt+1 | Yt (yb | ya ) A(yb | ya ) φ(ya ) dyb + r(yb )δya (yb ) φ(ya ) dyb
Rp Rp
Z Z
= φ(ya ) gYt+1 | Yt (yb | ya ) A(yb | ya ) dyb + φ(ya ) r(yb )δya (yb ) dyb
Rp Rp
= φ(ya )[1 − r(ya )] + r(ya ) φ(ya ) = φ(ya ),
in which the reversibility of hYt+1 | Yt (·, ·), the definition of r(·), and the properties of the Dirac delta function
have been used.
The following R-script shows how the Metropolis-Hastings sampler can be used to from a mixture of two nor-
mals f (y) = θϕµ=2,σ2 =0.5 (y) + (1 − θ)ϕµ=−2,σ2 =2 (y) with θ = 14. The sampler assumes this distribution is
known up to its normalizing constant and uses its unnormalized density exp[−(y − 2)2 ] + 3 exp[−(y + 2)2 /4].
The sampler employs the Cauchy distribution centered at the current state as transition kernel from a candidate for
the next state is drawn. The chain is initiated arbitrarily with Y (1) = 1.
# accept/reject candidate
if (runif(1) < alpha){
draw <- c(draw, candidate)
} else {
draw <- c(draw, draw[i])
}
}
return(draw)
}
# verify by eye-balling
Y <- mcmcSampler(100000, 1)
hist(Y[-c(1:1000)], n=100)
The histogram (not shown) shows that the sampler, although using a unimodal kernel, yields a sample from the
bimodal mixture distribution.
The Gibbs sampler is a particular version of the MCMC algorithm. The Gibbs sampler enhances convergence
to the stationary distribution (i.e. the posterior distribution) of the Markov chain. It requires, however, the full
conditionals of all random variables (here: model parameters), i.e. the conditional distributions of one random
variable given all others, to be known analytically. Let the random vector Y for simplicity – more refined par-
titions possible – be partioned as (Ya , Yb ) with subscripts a and b now refering to index sets that partition the
random vector Y (instead of the previously employed meaning of referring to two – possibly different – elements
from the state space). The Gibbs sampler thus requires that both fYa | Yb (ya , yb ) and fYb | Ya (ya , yb ) are known.
Being a specific form of the MCMC algorithm the Gibbs sampler seeks to draw Yt+1 = (Ya,t+1 , Yb,t+1 ) given
the current state Yt = (Ya,t , Yb,t ). It draws, however, only a new instance for a single element of the partition
(e.g. Ya,t+1 ) keeping the remainder of the partition (e.g. Yb,t ) temporarily fixed. Hereto define the transition
kernel:
fYa,t+1 | Yb,t+1 (ya,t+1 , yb,t+1 ) if yb,t+1 = yb,t ,
gYt+1 | Yt (Yt+1 = yt+1 | Yt = yt+1 ) =
0 otherwise.
Using the definition of the conditional density the acceptance probability for the t + 1-th proposal of the subvector
Ya then is:
n φ(y ) g
Yt+1 | Yt (yt , yt+1 )
o
t+1
A(yt+1 | yt ) = min 1,
φ(yt ) gYt+1 | Yt (yt+1 , yt )
n φ(y ) f
Ya,t+1 | Yb,t+1 (ya,t , yb,t )
o
t+1
= min 1,
φ(yt ) fYa,t+1 | Yb,t+1 (ya,t+1 , yb,t+1 )
n φ(y )
t+1 fYa,t+1 ,Yb,t+1 (ya,t , yb,t )/fYb,t+1 (yb,t ) o
= min 1,
φ(yt ) fYa,t+1 ,Yb,t+1 (ya,t+1 , ya,t+1 )/fYb,t+1 (yb,t+1 )
n f (yb,t+1 ) o
Y
= min 1, b,t+1 = 1.
fYb,t+1 (yb,t )
The acceptance probability of each proposal is thus one (which contributes to the enhanced convergence of the
Gibbs sampler to the joint posterior). Having drawn an acceptable proposal for this element of the partition, the
Gibbs sampler then draws a new instance for the next element of the partition, i.e. now Yb , keeping Ya fixed.
This process of subsequently sampling each partition element is repeated until enough samples have been drawn.
To illustrate the Gibbs sampler revisit Bayesian regression. In Section 2.2 the full conditional distributions of
β and σ 2 were derived. The Gibbs sampler now draws in alternating fashion from these conditional distributions
(see Algorithm 4 for its pseudo-code).
2.4 Empirical Bayes 55
5 end
6 Remove the first Tburn-in draws (representing the burn-in phase).
7 Select every fthinning -th sample (thinning).
Algorithm 4: Pseudocode of the Gibbs sampler of the joint posterior of the Bayesian regression param-
eters.
Empirical Bayes (EB) is a branch of Bayesian statistics that meets the subjectivity criticism of frequentists. Instead
of fully specifying the prior distribution empirical Bayesians identify only its form. The hyper parameters of this
prior distribution are left unspecified and are to be found empirically. In practice, these hyper-parameters are
estimated from the data at hand. However, the thus estimated hyper parameters are used to obtain the Bayes
estimator of the model parameters. As such the data are then used multiple times. This is usually considered an
inappropriate practice but is deemed acceptable when the number of model parameters is large in comparison to
the number of hyper parameters. Then, the data are not used twice but ‘1 + ε’-times (i.e. once-and-a-little-bit)
and only little information from the data is spent on the estimation of the hyper parameters. Having obtained an
estimate of the hyper parameters, the prior is fully known, and the posterior distribution (and summary statistics
thereof) are readily obtained by Bayes’ formula.
The most commonly used procedure for the estimation of the hyper parameters is marginal likelihood maxi-
mization, which is a maximum likelihood-type procedure. But the likelihood cannot directly be maximized with
respect to the hyper parameters as it contains the model parameters that are assumed to be random within the
Bayesian framework. This may be circumvented by choosing a specific value for the model parameter but would
render the resulting hyper parameter estimates dependent on this choice. Instead of maximization with the model
parameter set to a particular value one would preferably maximize over all possible realizations. The latter is
achieved by marginalization with respect to the random model parameter, in which the (prior) distribution of the
model parameterRis taken into account. This amounts to integrating out the model parameter from the posterior
distribution, i.e. P (Y = y | θ) π(θ) dθ, resulting in the so-called marginal posterior. After marginalization the
specifics of the model parameter have been discarded and the marginal posterior is a function of the observed data
and the hyper parameters. The estimator of the hyper parameter is now defined as the maximizer of this marginal
posterior.
To illustrate the estimation of hyper parameters of the Bayesian linear regression model through marginal
likelihood maximization assume the regression parameter β and and the error variance σ 2 to be endowed with
conjugate priors: β | σ 2 ∼ N (0p , σ 2 λ−1 Ipp ) and σ 2 ∼ IG (α0 , β0 ). Three hyper parameters are thus to be esti-
mated: the shape and scale parameters, α0 and β0 , of the inverse gamma distribution and the λ parameter related
to the variance of the regression coefficients. Straightforward application of the outlined marginal likelihood prin-
ciple does, however, not work here. The joint prior, π(β | σ 2 )π(σ 2 ), is too flexible and does not yield sensible
estimates of the hyper parameters. As interest is primarily in λ, this is resolved by setting the hyper parameters
of π(σ 2 ) such that the resulting prior is uninformative, i.e. as objectively as possible. This is operationalized as a
very flat distribution. Then, with the particular choices of α0 and β0 that produce an uninformative prior for σ 2 ,
56 Bayesian regression
where the factors not involving λ have been dropped throughout and b1 = β0 + 12 [Y⊤ Y − Y⊤ X(X⊤ X +
λIpp )−1 X⊤ Y]. Prior to the maximization of the marginal likelihood the logarithm is taken. That changes the
maximum, but not its location, and yields an expression that is simpler to maximize. With the empirical Bayes
estimate λ̂eb at hand, the Bayesian estimate of the regression parameter β is β̂eb = (X⊤ X + λ̂eb Ipp )−1 X⊤ Y.
Finally, the particular choice of the hyper parameters of the prior on σ 2 is not too relevant. Most values of α0 and
β0 that correspond to a rather flat inverse gamma distribution yield resulting point estimates β̂eb that do not differ
too much numerically.
2.5 Conclusion
Bayesian regression was introduced and shown to be closely connected to ridge regression. Under a conjugate
Gaussian prior on the regression parameter the Bayesian regression estimator coincides with the ridge regression
estimator, which endows the ridge penalty with the interpretation of this prior. While an analytic expression of
these estimators is available, a substantial part of this chapter was dedicated to evaluation of the estimator through
resampling. The use of this resampling will be evident when other penalties and non-conjugate priors will be
studied (cf. Excercise ?? and Sections 5.7 and 6.7). Finally, another informative procedure, empirical Bayes, to
choose the penalty parameter was presented.
2.6 Exercises
Question 2.1
Consider the linear regression model Yi = Xi β + εi with the εi i.i.d. following a standard normal law N (0, 1).
Data on the response and covariate are available: {(yi , xi )}8i=1 = {(−5, −2), (0, −1),
(−4, −1), (−2, −1), (0, 0), (3, 1), (5, 2), (3, 2)}.
a) Assume a zero-centered normal prior on β. What variance, i.e. which σβ2 ∈ R>0 , of this prior yields a mean
posterior E(β | {(yi , xi )}8i=1 , σβ2 ) equal to 1.4?
b) Assume a non-zero centered normal prior. What (mean, variance)-combinations for the prior will yield a
mean posterior estimate β̂ = 2?
Question 2.2
Consider the Bayesian linear regression model Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ) and priors β | σ 2 ∼
N (0p , σβ2 Ipp ) and σ 2 ∼ IG (a0 , b0 ) where σβ2 = cσ 2 for some c > 0 and a0 and b0 are the shape and scale
2.6 Exercises 57
parameters, respectively, of the inverse Gamma distribution. This model is fitted to data from a study where
the response is explained by a single covariate, and henceforth β is replaced by β, with the following relevant
summary statistics: X⊤ X = 2 and X⊤ Y = 5.
a) Suppose E(β | σ 2 = 1, c, X, Y) = 2. What amount of regularization should be used such that the ridge
regression estimate β̂(λ2 ) coincides with the aforementioned posterior (conditional) mean?
b) Give the (posterior) distribution of β | {σ 2 = 2, c = 2, X, Y}.
c) Discuss how a different prior of σ 2 affects the correspondence between E(β | σ 2 , c, X, Y) and the ridge
regression estimator.
Question 2.3
Revisit the microRNA data example of Section 1.10. Use the empirical Bayes procedure outlined in Section 2.4
to estimate the penalty parameter. Compare this to the one obtained via cross-validation. In particular, compare
the resulting point estimates of the regression parameter.
Question 2.4
Revisit question 1.16. From a Bayesian perspective, is the suggestion of a negative ridge penalty parameter
sensible?
Question 2.5
Consider the linear regression model Y = Xβ + ε with X ∈ Mn,p , β ∈ Rp , and εi ∼ N (0, σ 2 Inn ). Assume the
βj are independently and identically distributed with a generalized normal prior distribution. The latter has density
function f (x; µ, α1 , α2 ) = α2 [2α1 Γ(α2−1 )]−1 exp[−(|x − µ|/α1 )α2 ] with location parameter µ, scale parameter
α1 and shape parameter α2 . For which choice of these hyperparameter does the MAP estimator coincide with the
bridge regression one? The bridge regression estimator of β is defined as:
Xp
β̂(λγ ) = arg minβ∈R ∥Y − Xβ∥22 + λγ |βj |γ ,
j=1
Question 2.6
Consider the linear regression model Y = Xβ + ε without intercept and ε ∼ N (0n , σ 2 Inn ), to explain the
variation in the response Y by a linear combination of the columns of the design matrix X. Execute the R-code
below to sample data from this model.
Listing 2.3 R code
# set dimension and sample size
n <- 50
p <- 100
# create covariates
X <- matrix(rnorm(n*p), ncol=p)
b) Use the Gibbs sampler to draw 10100 times from both conditional posteriors in alternating fashion. To
remove the dependency on the choice of the initiation σ02 throw away the first 100 (i.e. the burn-in period)
draws for both parameters. To reduce the dependency between subsequent draws, apply thinning: keep
each 10-th draw after the burn-in period. After both operations (removal of the burning-in period and
the thinning) only those draws corresponding to iterations 110, 120, . . . , 10100 are preserved. Those are
considered a representative sample from the joint posterior of β and σ 2 .
c) Calculate the lower bound of the 95% credible interval, containing the central (100 − α)% with α = 0.05
of the posterior probability mass, of the second element of the regression parameter. Use the quantile-
function for the credible interval construction.
d) Investigate the dependency among subsequent draws, i.e. without the thinning, of σ 2 from the Gibbs sam-
pler. In this contrast the case with λ = 1 to that with λ = 106 . Is there a difference in their 1st order
dependencies? If so, explain this difference. If not, explain the absence thereof.
3 Generalizing ridge regression
The exposé on ridge regression may be generalized in many ways. Among others different generalized linear
models may be considered (confer Section 5.3). In this section we stick to the linear regression model Y = Xβ+ε
with the usual assumptions. But we now fit it in weighted fashion – to accommodate the ridge estimation of the
logistic regression model (see Chapter 5) – and generalize the common, spherical penalty.
Our generalization of the ridge loss function, a weighted least squares criterion augmented with a generalized
ridge penalty, is:
In the above display W is a n × n-dimensional, diagonal matrix with (W)ii ∈ [0, 1] representing the weight of
the i-th observation. The minimizer of loss function (3.2) is our generalization of the ridge regression estimator.
The generalized ridge penalty in loss function (3.2) is now a quadratic form with penalty parameter ∆, a p×p-
dimensional, positive definite, symmetric matrix. When ∆ = λIpp , one regains the spherical penalty of ‘regular
ridge regression’. This penalty shrinks each element of the regression parameter β equally along the unit vectors
ej . Generalizing ∆ to the class of symmetric, positive definite matrices S++ allows for i) different penalization
per regression parameter, and ii) joint (or correlated) shrinkage among the elements of β. The penalty parameter
∆ determines the speed and direction of shrinkage. The p-dimensional column vector β0 is a user-specified,
non-random target towards which β is shrunken as the penalty parameter increases. When recasting generalized
ridge estimation as a constrained estimation problem, the implications of the penalty may be visualized (Figure
3.1, left panel). The generalized ridge penalty is a quadratic form centered around β0 . In Figure 3.1 the parameter
constraint clearly is ellipsoidal (and not spherical). Moreover, the center of this ellipsoid is not at zero.
β1(λ)
3.0
0.015
β2(λ)
β3(λ)
6
0.025 β4(λ)
2.5
0.035
0.045
0.055
4
2.0
0.06
βj(λ)
1.5
2
β2
0.05
0.04
0.03
1.0
0
0.02
0.01
0.5
0.005
−2
loss
0.0
−2 0 2 4 6 0 2 4 6 8 10 12
β1 log(λ)
Figure 3.1: Left panel: the contours of the likelihood (grey solid ellipsoids) and the parameter constraint implied
by the generalized penalty (black dashed ellipsoids) Right panel: generalized (fat coloured lines) and ‘regular’ (thin
coloured lines) regularization paths of four regression coefficients. The dotted grey (straight) lines indicated the
targets towards the generalized ridge penalty shrinks regression coefficient estimates.
The addition of the generalized ridge penalty to the sum-of-squares ensures the existence of a unique regression
estimator in the face of super-collinearity. The generalized penalty is a non-degenerated quadratic form in β due
60 Generalizing ridge regression
to the positive definiteness of the matrix ∆. As it is non-degenerate, it is strictly convex. Consequently, the
generalized ridge regression loss function (3.1), being the sum of a convex and strictly convex function, is also
strictly convex. This warrants the existence of a unique global minimum and, thereby, a unique estimator.
There is an analytic expression for the optimum of the generalized ridge loss function (3.1). To see this, obtain
the estimating equation of β through equating its derivative with respect to β to zero:
2X⊤ WY − 2X⊤ WXβ − 2∆β + 2∆β0 = 0p .
This is solved by:
β̂(∆) = (X⊤ WX + ∆)−1 (X⊤ WY + ∆β0 ). (3.2)
Clearly, this reduces to the ‘regular’ ridge regression estimator by setting W = Inn , β0 = 0p , and ∆ = λIpp .
The effects of the generalized ridge penalty on the coefficients of corresponding estimator can be seen in the reg-
ularization paths of the estimator’s coefficients. Figure 3.1 (right panel) contains an example of the regularization
paths for coefficients of a linear regression model with four explanatory variables. Most striking is the limiting
behaviour of the estimates of β3 and β4 for large values of the penalty parameter λ: they convergence to a non-
zero value (as was specified by a nonzero β0 ). More subtle is the (temporary) convergence of the regularization
paths of the estimates of β2 and β3 . That of β2 is pulled away from zero (its true value and approximately its
unpenalized estimate) towards the estimate of β3 . In the regularization path of β3 this can be observed in a de-
layed convergence to its nonzero target value (for comparison consider that of β4 ). For reference the corresponding
regularization paths of the ‘regular’ ridge estimates (as thinner lines of the same colour) are included in Figure 3.1.
One particular version of the generalized ridge penalty deserves special attention. It distinghuises between pe-
nalized and unpenalized covariates. The latter comprises explanatory variables that ought to be in the model, and
are not to be shrunken. This aims to keep the bias of their estimates to a minimum and make them comparable
to these reported in the literature. For instance, in the statistical analysis of clinical trials factors like ‘age’ and
‘gender’ are known to associated with the response, and any model without them would not be acceptable to the
field. Additionally, the patients of the clinical trial may have been molecularly characterized at baseline. The re-
sulting high-dimensional molecular information may then be included in the model, alongside the aforementioned
factors. The effect of the high-dimensional molecular covariates can then only be estimated in penalized fashion,
while that of ‘age’ and ‘gender’ are preferably estimated in an unpenalized manner.
To present our estimator with penalized and unpenalized covariates, we introduce separate notation for them.
This modifies the model to Y = Uγ + Xβ + ε, where U is a full rank n × q-dimensional design matrix of
the q unpenalized covariates and γ the associated regression parameter. The ridge regression estimator is then
generalized to:
γ̂(∆, β0 ), β̂(∆, β0 ) = arg min ∥Y − Uγ − Xβ∥22 + (β − β0 )⊤ ∆(β − β0 ).
γ∈Rq ,β∈Rp
Note that the estimator of regression parameter γ of the unpenalized covariates U too depends on the penalty
parameters. This is due to the fact that it is jointly estimated with the regression parameter β of their penalized
counterparts X. This bears consequence for the bias of the estimate of γ (see Exercise ??). Again an analytic
expression of the estimator exists:
⊤ −1 ⊤
U U U⊤ X
γ̂(∆, β0 ) U Y
= .
β̂(∆, β0 ) X⊤ U X ⊤ X + ∆ X⊤ Y + ∆β0
This expression can be ‘simplified’, as is done in (Lettink et al., 2023), to:
γ̂(∆, β0 ) = {U⊤ [X∆−1 X⊤ + Inn ]−1 U}−1 U⊤ [X∆−1 X⊤ + Inn ]−1 (Y − Xβ0 ),
β̂(∆, β0 ) = β0 + (X⊤ X + ∆)−1 X⊤ [Y − Uγ̂(∆, β0 ) − Xβ0 ],
which is obtained by means of the Woodbury matrix identity and the analytic expression of the inverse of a 2 × 2
block matrix.
Example 3.1 (Fused ridge estimation)
An example of a generalized ridge penalty is the fused ridge penalty (as introduced by Goeman, 2008). Consider
the standard linear model Y = Xβ + ε. The fused ridge regression estimator of β then minimizes:
Xp
∥Y − Xβ∥22 + λ (βj − βj−1 )2 . (3.3)
j=2
61
The penalty in the loss function above can be written as a generalized ridge penalty:
λ −λ 0 ... ... 0
.. ..
−λ 2λ −λ . .
.. .. ..
Xp
⊤ 0 −λ 2λ . . .
λ (βj − βj−1 )2 = β .. . . β.
.. .. ..
j=2 . . . . . 0
.. .. .. ..
. . . . −λ
0 ... ... 0 −λ λ
The matrix ∆ employed above is semi-positive definite and therefore the loss function (3.3) need not be strictly
convex. Hence, often a regular ridge penalty ∥β∥22 is added (with its own penalty parameter).
To illustrate the effect of the fused ridge penalty on the estimation of the linear regression model Y = Xβ + ε,
6
let βj = ϕ0,1 (zj ) with zj = −30 + 50 j for j = 1, . . . , 500. Sample the elements of the design matrix X and those
of the error vector ε from the standard normal distribution, then form the response Y from the linear model. The
regression parameter is estimated through fused ridge loss minimization with λ = 1000. The estimate is shown
in the left panel of Figure 3.2 (red line). For reference the figure includes the true β (black line) and the ‘regular
ridge’ estimate with λ = 1 (blue line). Clearly, the fused ridge regression estimate yields a nice smooth vector of
β estimates.
The fused ridge penalty employed in the fused ridge loss function (3.3) shrinks one-dimensionally. It shrinks,
when viewing the βj as equidistantly distributed on a line ordered by their position in the regression parameter,
contiguous elements towards one other. Should we expect the covariates’ effects to be similar, i.e. of equal
size and sign, depending on their spatial proximity on a two-dimensional grid, this may be incorporated in the
two-dimensional fused ridge regression estimator (Lettink et al., 2023). □
Straightforward application of this penalty within the context of the linear regression model may seem farfetched.
That is, why would a common effect of all covariates be desirable? Indeed, the motivation of Anatolyev (2020)
stems from a different context. Translated to the present standard linear regression model context, that motivation
could be thought of – loosely – as an application of the penalty (3.4) to subsets of the regression parameter. Sit-
uations arise where many covariates are only slightly different operationalizations of the same trait. For instance,
in brain image data the image itself is often summarized in many statistics virtually measuring the same thing. In
extremis, those could be the mean, median and trimmed mean of the intensity of a certain region of the image. It
would be ridiculous to assume these three summary statistics to have a wildly different effect, and shrinkage to a
common value seems sensible practice. Finally, note that the above employed penalty matrix, which we write as
λ∆ := λ(Ipp − p−1 1pp ), is nonnegative definite. Hence, in the face of (super-) collinearity, an additional penalty
term is required for a well-defined estimator.
The homogeneity ridge regression estimator can – but is left as an exercise – be reformulated using straight-
forward linear algebra as
betas
ridge (λ=1)
gen. ridge (λ=1000)
1.0
0.3
0.8
Ridge estimate
0.2
0.6
betas
0.1
0.4
0.2
0.0
λIpp
λ(Ipp+p−11pp)
0.0
−0.1
Index log(lambda)
0.6
0.4
0.2
λ
2λ
3λ
4λ
0.0
5λ
0 2 4 6 8
log(lambda)
Figure 3.2: Top left panel: illustration of the fused ridge estimator (in simulation). The true parameter β and its
ridge and fused ridge estimates against their spatial order. Top right panel: regularization path of the ridge regres-
sion estimator with a ‘homogeneity’ penalty matrix. Bottom left panel: regularization path of the ridge regression
estimator with a ‘co-data’ penalty matrix. Bottom right panel: Ridge vs. fused ridge estimates of the DNA copy
effect on KRAS expression levels. The dashed, grey vertical bar indicates the location of the KRAS gene.
covariate index set {1, . . . , p}, i.e. Jk ∩ Jk′ = ∅ for all k ̸= k ′ and ∪K
k=1 Jk = {1, . . . , p}. The estimator then is:
XK X
β̂(λ1 , . . . , λK ) = arg minp ∥Y − Xβ∥22 + λk βj2 , (3.5)
β∈R k=1 j∈Jk
To express to above in terms the estimator 3.2, it involves W = Inn , β0 = 0p , and a diagonal ∆. A group’s
importance is reflected in the size of the penalty parameter λk , relative to the other penalty parameters. The
larger a group’s penalty parameter, the smaller the ridge constraint of the corresponding elements of the regression
parameter. And a small constraint allows these elements little room to vary, and the corresponding covariates little
flexibility to accommodate the variation in the response. For instance, suppose the group index is concordant with
the hypothesized importance of the covariates for the linear model. Ideally, the λk are then reciprocally concordant
to the group index. Of course, the penalty parameters need not adhere to this concordance, as they are selected by
means of data. For more on their selection in see Van De Wiel et al. (2016). Clearly, the first regression coefficients
of the first ten covariates are shrunken towards a common value, while the others are all shrunken towards zero.
We illustrate the effect of the ‘co-data’ penalty on the same data as that used to illustrate the effect of the
‘homogeneity’ penalty (3.4). Now we employ a diagonal 5 × 5 block penalty matrix ∆, with blocks ∆11 =
λI10,10 , ∆11 = 2λI10,10 , ∆11 = 3λI10,10 , ∆11 = 4λI10,10 , and ∆11 = 5λI10,10 . The regularization paths of
the estimator (3.5) are shown in the bottom left panel of Figure 3.2. Clearly, the regression coefficient estimates
are shrunken less if they belong to a group with high important (i.e. lower penalty). □
β̂ (t) (λt ) = arg minβ∈Rp ∥Y(t) − X(t) β∥22 + λt ∥β − β̂ (t−1) (λt−1 )∥22 (3.6)
ridge estimates with zero−centered penalty ridge estimates with nonzero−centered penalty
β=2.5 β=2.5
4
β=1 β=1
β=0 β=0
β=−1 β=−1
3
β=−2.5 β=−2.5
2
2
1
1
β
β
0
0
−1
−1
−2
−2
−3
−3
0 5 10 15 20 0 5 10 15 20 25
update update
Figure 3.3: The top panels show the (5%, 95%-quantile intervals of the traditional (left) and updated (right) ridge
estimates of βj with j ∈ {0, 30, 50, 70, 100} plotted against t. The solid, colored line inside these intervals is the
corresponding 50% quantile. The dotted, grey lines are the true values of the βj ’s.
64 Generalizing ridge regression
for t = 1, . . .. Effectively, with the arrival of novel data we update our current estimator of the regression parameter
β̂ (t−1) (λt−1 ), resulting in the updated one β̂ (t) (λt ). We thus obtain a sequence of estimators {β̂ (t) (λt )}∞ t=1 ,
which is initiated by any nonrandom target, e.g. the null vector. Over time we expect this sequence of estimators
to incorporate the lessons from the past.
Let us investigate by simulation whether the updated ridge regression estimator (3.6) indeed accumulates
knowledge on the regression parameter. Hereto we draw a sequence of data sets from the linear regression model
Yt = Xt β + εt with t = 1, . . . , 25. The elements of the design matrices are sampled from the standard normal
distribution, the elements of β chosen as βj = (j − 50)/20 for j = 0, 1, 2, . . . , 100, while εt ∼ N (0n , 0.04Inn ).
The regression parameter is estimated from each data set using both the regular and updated ridge estimators, in
which the latter uses its previous estimate as target. The penalty parameter of both estimators are determined by
cross-validation. We repeat all this a hundred times. The results, by the 5%, 50% and 95% quantiles, of the regular
and updated ridge regression estimates of βj with j ∈ {0, 30, 50, 70, 100} are plotted against t (Figure 3.3). The
quantiles of the regular ridge regression estimates of the selected elements of β are virtually constant over times
(left panel of Figure 3.3). In contrast, the quantiles of the update ridge regression estimates (right panel of Figure
3.3) clearly improve as t increases. The improvement is two-fold: i) they become less biased, and ii) the distance
between the 5% and 95% quantiles vanishes. The quantiles’ behavior indicates that the updating does lead to an
accumulation of knowledge
In Figure 3.3 the 5%, 50% and 95% quantiles of the traditional and updated ridge estimates of βj with j ∈
{0, 30, 50, 70, 100} are plotted against t. These quantiles of the traditional ridge estimates of these elements of β
are constant over t (left panel of Figure 3.3). Those of the update ridge estimates (right panel of Figure 3.3) clearly
improve as t increases. The improvement is two-fold: i) they become less biased, and ii) the distance between the
5% and 95% quantiles vanishes.
The simulation results can be underpinned theoretically. Theorem 3.1 states (under conditions) the consistency
in t of the sequence for the estimation of β and the associated linear predictor.
Theorem 3.1 (Theorem 2 of van Wieringen and Binder, 2022)
Adopt assumption A1 and let {β̂t (λt )}∞ t=1 the corresponding sequence of update ridge regression estimators (3.6).
Furthermore,
i) initiate the estimator sequence by any nonrandom target.
ii) let ∩∞t=T null(Xt ) = 0p for sufficiently large T ∈ N.
−2
iii) choose the regularization scheme {λt }∞ 2 2
t=1 such that limt→∞ σε p d1 (Xt ) λt = 0 with d1 (Xt ) the largest
singular value of Xt .
Then, for every c > 0:
The updating procedure above is a frequentist analogue of Bayesian updating (Berger, 2013). From the
Bayesian perspective, the ridge penalty corresponds to a normal prior on the regression parameter (see Chapter 2).
With this normal prior, the posterior of β is also normal. In turn, this normal posterior serves as (normal) prior in
the next update of β. The prior for the t + 1-th update then becomes βt+1 | σ 2 ∼ N [βt (λt , βt−1 ), σ 2 λ−1t+1 Ipp ],
which yields a posterior mean E[βt+1 | Yt+1 , Xt+1 , σ 2 , βt (λt , βt−1 )] coinciding with the frequentist estimator
β̂t+1 (λt+1 , βt ). □
3.1 Moments
The expectation and variance of β̂(∆) are obtained through application of the same matrix algebra and expectation
and covariance rules used in the derivation of their counterparts of the ‘regular’ ridge regression estimator. This
leads to:
where σ 2 is the error variance. From these expressions similar limiting behaviour as for the ‘regular’ ridge re-
gression case can be deduced. To this end let Vδ Dδ Vδ⊤ be the eigendecomposition of ∆ and dδ,j = (Dδ )jj .
3.2 The Bayesian connection 65
Furthermore, define (with some abuse of notation) lim∆→∞ as the limit of all dδ,j simultaneously tending to
infinity. Then, lim∆→∞ E[β̂(∆, β0 )] = β0 and lim∆→∞ Var[β̂(∆, β0 )] = 0pp .
Example 3.5
Let X be an n × p-dimensional, orthonormal design matrix with p ≥ 2. Contrast the regular and generalized ridge
regression estimator, the latter with W = Ipp , β0 = 0p and ∆ = λR where R = (1 − ρ)Ipp + ρ1pp for ρ ∈
(−(p−1)−1 , 1). For ρ = 0 the two estimators coincide. The variance of the generalized ridge regression estimator
then is Var[β̂(∆, β0 = 0p )] = (Ipp +∆)−2 . The efficiency of this estimator, measured by its generalized variance,
is:
This efficiency attains its minimum at ρ = 0. In the present case, the regular ridge regression estimator is thus
more efficient than its generalized counterpart. □
Hence, irrespective of the choice of ∆, the generalized ridge is then unbiased. Thus:
When ∆ = λIpp , this MSE is smaller than that of the ML regression estimator, irrespective of the choice of λ. □
with
⊤
gβ|σ2 ,Y,X (β | σ 2 , Y, X) ∝ exp{− 21 σ −2 β − β̂(∆) (X⊤ X + ∆) β − β̂(∆) }.
This implies E(β | σ 2 , Y, X) = β̂(∆, β0 ). Hence, the generalized ridge regression estimator too can be viewed
as the Bayesian posterior mean estimator of β when imposing a multivariate Gaussian prior on the regression
parameter.
The Bayesian formulation of our generalization of ridge regression provides additional intuition of the penalty
parameters β0 and ∆. For instance, a better initial guess, i.e. β0 , yields a better posterior mean and mode. Or,
less uncertainty in the prior, i.e. a smaller (in the positive definite ordering sense) variance ∆, yields a more
concentrated posterior. These claims are underpinned in Question 3.6. The intuition they provide guides in the
choice of penalty parameters β0 and ∆.
3.3 Application
An illustration involving omics data can be found in the explanation of a gene’s expression levels in terms of
its DNA copy number. The latter is simply the number of gene copies encoded in the DNA. For instance, for
most genes on the autosomal chromosomes the DNA copy number is two, as there is a single gene copy on
66 Generalizing ridge regression
each chromosome and autosomal chromosomes come in pairs. Alternatively, in males the copy number is one
for genes that map to the X or Y chromosome, while in females it is zero for genes on the Y chromosome. In
cancer the DNA replication process has often been compromised leading to a (partially) reshuffled and aberrated
DNA. Consequently, the cancer cell may exhibit gene copy numbers well over a hundred for classic oncogenes.
A faulted replication process does – of course – not nicely follow the boundaries of gene encoding regions. This
causes contiguous genes to commonly share aberrated copy numbers. With genes being transcribed from the DNA
and a higher DNA copy number implying an enlarged availability of the gene’s template, the latter is expected
to lead to elevated expression levels. Intuitively, one expects this effect to be localized (a so-called cis-effect),
but some suggest that aberrations elsewhere in the DNA may directly affect the expression levels of distant genes
(referred to as a trans-effect). Figure 3.4 shows a cartoon of the cis- and trans-effect of DNA copy number on
transcription.
Figure 3.4: Illustration of the cis- and trans-effect of DNA copy number on gene expression levels.
The cis- and trans-effects of DNA copy aberrations on the expression levels of the KRAS oncogene in colorec-
tal cancer are investigated. Data of both molecular levels from the TCGA (The Cancer Genome Atlas) repository
are downloaded (Cancer Genome Atlas Network, 2012). The gene expression data are limited to that of KRAS,
while for the DNA copy number data only that of chromosome 12, which harbors KRAS, is retained. This leaves
genomic profiles of 195 samples comprising 927 aberrations. Both molecular data types are zero centered feature-
wise. Moreover, the data are limited to ten – conveniently chosen? – samples. The KRAS expression levels are
explained by the DNA copy number aberrations through the linear regression model. The model is fitted by means
of ridge regression, with λ∆ and ∆ = Ipp and a single-banded ∆ with unit diagonal and the elements of the first
off-diagonal equal to the arbitrary value of −0.4. The latter choice appeals to the spatial structure of the genome
and encourages similar regression estimates for contiguous DNA copy numbers. The penalty parameter is chosen
by means of leave-one-out cross-validation using the squared error loss.
Listing 3.1 R code
# load libraries
library(cgdsr)
library(biomaRt)
library(Matrix)
# get available case lists (collection of samples) for a given cancer study
mycancerstudy <- getCancerStudies(mycgds)[37,1]
mycaselist <- getCaseLists(mycgds,mycancerstudy)[1,1]
# get data slices for a specified list of genes, genetic profile and case list
cnData <- numeric()
geData <- numeric()
geneInfo <- numeric()
for (j in 1:nrow(geneList)){
geneName <- as.character(geneList[j,1])
geneData <- getProfileData(mycgds, geneName, c(cnProf, mrnaProf), mycaselist)
if (dim(geneData)[2] == 2 & dim(geneData)[1] > 0){
cnData <- cbind(cnData, geneData[,1])
geData <- cbind(geData, geneData[,2])
geneInfo <- rbind(geneInfo, geneList[j,])
}
}
colnames(cnData) <- rownames(geneData)
colnames(geData) <- rownames(geneData)
# preprocess data
Y <- geData[, match("KRAS", geneInfo[,1]), drop=FALSE]
Y <- Y - mean(Y)
X <- sweep(cnData, 2, apply(cnData, 2, mean))
# subset data
idSample <- c(50, 58, 61, 75, 66, 22, 67, 69, 44, 20)
Y <- Y[idSample]
X <- X[idSample,]
The right panel of Figure 3.2 shows the ridge regression estimate with both choices of ∆ and optimal penalty
parameters plotted against the chromosomal order. The location of KRAS is indicated by a vertical dashed bar.
The ordinary ridge regression estimates show a minor peak at the location of KRAS but is otherwise more or
less flat. In the generalized ridge estimates the peak at KRAS is emphasized. Moreover, the region close to
KRAS exhibits clearly elevated estimates, suggesting co-abberated DNA. For the remainder the generalized ridge
estimates portray a flat surface, with the exception of a single downward spike away from KRAS. Such negative
effects are biologically nonsensible (more gene templates leading to reduced expression levels?). On the whole the
generalized ridge estimates point towards the cis-effect as the dominant genomic regulation mechanism of KRAS
expression. The isolated spike may suggest the presence of a trans-effect, but its sign is biological nonsensible
and the spike is fully absent in the ordinary ridge estimates. This leads us to ignore the possibility of a genomic
trans-effect on KRAS expression levels in colorectal cancer.
The sample selection demands justification. It yields a clear illustrate-able difference between the ordinary
and fused ridge estimates. When all samples are left in, the cis-effect is clearly present, discernable from both
estimates that yield a virtually similar profile.
From this last expression it becomes clear how this estimator generalizes the ‘regular ridge estimator’. The latter
shrinks all eigenvalues, irrespectively of their size, in the same manner through a common penalty parameter. The
‘generalized ridge estimator’, through differing penalty parameters (i.e. the diagonal elements of Λ), shrinks them
individually.
The generalized ridge estimator coincides with the Bayesian linear regression estimator with the normal prior
N [0p , (Vx ΛVx⊤ )−1 ] on the regression parameter β (and preserving the inverse gamma prior on the error vari-
ance). Assume X to be of full column rank and choose Λ = g −1 D⊤ x Dx with g a positive scalar. The prior on β
then – assuming (X⊤ X)−1 exists – reduces to Zellner’s g-prior: β ∼ N [0p , g(X⊤ X)−1 ] (Zellner, 1986). The
corresponding estimator of the regression coefficient is: β̂(g) = g(1+g)−1 (X⊤ X)−1 X⊤ Y, which is proportional
to the unpenalized ordinary least squares estimator of β.
For convenience of notation in the analysis of the generalized ridge estimator the linear regression model is
usually rewritten as:
where dx,j = (Dx )jj and λj = (Λ)jj . Having α and σ 2 available, it is easily seen (equate the derivative w.r.t.
λj to zero and solve) that the MSE of α̂(Λ) is minimized by λj = σ 2 /αj2 for all j. With α and σ 2 unknown,
Hoerl and Kennard (1970) suggest an iterative procedure to estimate the λj ’s. Initiate the procedure with the
OLS estimates of α and σ 2 , followed by sequentially updating the λj ’s and the estimates of α and σ 2 . An
analytic expression of the limit of this procedure exists (Hemmerle, 1975). This limit, however, still depends on
the observed Y and as such it does not necessarily yield the minimal attainable value of the MSE. This limit may
nonetheless still yield a potential gain in MSE. This is investigated in Lawless (1981). Under a variety of cases it
seems to indeed outperform the OLS estimator, but there are exceptions.
A variation on this theme is presented by Guilkey and Murphy (1975) and dubbed “directed” ridge regression.
Directed ridge regression only applies the above ‘generalized shrinkage’ in those eigenvector directions that have
a corresponding small(er) – than some user-defined cut-off – eigenvalue. This intends to keep the bias low and
yield good (or supposedly better) performance.
3.5 Conclusion
A note of caution to conclude. The generalized ridge penalty is extremely flexible. It can incorporate any prior
knowledge on the parameter values (through specification of β0 ) and the relations among these parameters (via ∆).
While a pilot study or literature may provide a suggestion for β0 , it is less obvious how to choose an informative
∆, although a spatial structure is a nice exception. In general, exact knowledge on the parameters should not be
incorporated implicitly via the penalty (read: prior) but preferably be used explicitly in the model – the likelihood
– itself. Though this may be the viewpoint of a prudent frequentist and a subjective Bayesian might disagree.
3.6 Exercises
Question 3.1
Consider a pathway comprising of three genes called A, B, and C. Let random variables Yi,a , Yi,b , and Yi,c be
the random variable representing the expression of levels of genes A, B, and C in sample i. Hundred realizations,
i.e. i = 1, . . . , n, of Yi,a , Yi,b , and Yi,c are available from an observational study. In order to assess how the
expression levels of gene A are affect by that of genes B and C a medical researcher fits the
with εi ∼ N (0, σ 2 ). This model is fitted by means of ridge regression, but with a separate penalty parameter, λb
and λc , for the two regression coefficients, βb and βc , respectively.
a) Write down the ridge penalized loss function employed by the researcher.
b) Does a different choice of penalty parameter for the second regression coefficient affect the estimation of
the first regression coefficient? Motivate your answer.
c) The researcher decides that the second covariate Yi,c is irrelevant. Instead of removing the covariate from
model, the researcher decides to set λc = ∞. Show that this results in the same ridge estimate for βb as
when fitting (again by means of ridge regression) the model without the second covariate.
Question 3.2
Consider the linear regression model Yi = β1 Xi,1 + β2 Xi,2 + εi for i = 1, 2, 3. Information on the response,
design matrix and relevant summary statistics are:
⊤ −1 0 2 ⊤
⊤ 5 1 ⊤ 5
X = , Y = −1 −1 2 , X X = , and X Y = .
1 −2 1 1 6 3
Question 3.3
Consider the linear regression model Yi = β1 Xi,1 +β2 Xi,2 +εi for i = 1, . . . , n. Suppose estimates of the regres-
sion parameters (β1 , β2 ) of this model are obtained through the minimization of the sum-of-squares augmented
with a ridge-type penalty:
Xn
(Yi − β1 Xi,1 − β2 Xi,2 )2 + λ(β12 + β22 + 2νβ1 β2 ),
i=1
with penalty parameter δ ∈ R>0 . The data scientist is surprised to find that resulting estimate β̂(δ) does
not have the same limiting (in the penalty parameter) behavior as the β̂1 (λ, −1), i.e. limδ→∞ β̂(δ) ̸=
limλ→∞ β̂1 (λ, −1). Explain the misconception of the data scientist.
c) Assume that i) n ≫ 2, ii) the unpenalized estimates (β̂1 (0, 0), β̂2 (0, 0))⊤ equal (−2, 2), and iii) that the
two covariates X1 and X2 are zero-centered, have equal variance, and are strongly negatively correlated.
Consider (β̂1 (λ, ν), β̂2 (λ, ν))⊤ for both ν = −0.9 and ν = 0.9. For which value of ν do you expect the
sum of the absolute value of the estimates to be largest? Hint: Distinguish between small and large values
of λ and think geometrically!
Question 3.4
Show that the genalized ridge regression estimator, β̂(∆) = (X⊤ X + ∆)−1 X⊤ Y, too (as in Question 1.6)
can be obtained by ordinary least squares regression on an augmented data set. Hereto consider the Cholesky
decomposition of the penalty matrix: ∆ = L⊤ L. Now augment the matrix X with p additional rows comprising
the matrix L, and augment the response vector Y with p zeros.
Question 3.5
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0p , σ 2 Ipp ). Assume β ∼ N (β0 , σ 2 ∆−1 ) with
β0 ∈ Rp and ∆ ≻ 0 and a gamma prior on the error variance. Verify (i.e., work out the details of the derivation)
that the posterior mean coincides with the generalized ridge estimator defined as:
β̂ = (X⊤ X + ∆)−1 (X⊤ Y + ∆β0 ).
Question 3.6
Consider the Bayesian linear regression model Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ), a multivariate normal law
as conditional prior distribution on the regression parameter: β | σ 2 ∼ N (β0 , σ 2 ∆−1 ), and an inverse gamma
prior on the error variance σ 2 ∼ IG (γ, δ). The consequences of various choices for the hyper parameters of the
prior distribution on β are studied.
a) Consider the following conditional prior distributions on the regression parameters β | σ 2 ∼ N (β0 , σ 2 ∆−1
a )
p
and β | σ 2 ∼ N (β0 , σ 2 ∆−1
b ) with precision matrices ∆a , ∆b ∈ S++ such that ∆a ⪰ ∆b , i.e. ∆a =
∆b + D for some positive semi-definite symmetric matrix of appropriate dimensions. Verify:
Var(β | σ 2 , Y, X, β0 , ∆a ) ⪯ Var(β | σ 2 , Y, X, β0 , ∆b ),
i.e. the smaller (in the positive definite ordering) the variance of the prior the smaller that of the posterior.
b) In the remainder of this exercise assume ∆a = ∆ = ∆b . Let βt be the ‘true’ or ‘ideal’ value of the regres-
sion parameter, that has been used in the generation of the data, and show that a better initial guess yields a
better posterior probability at βt . That is, take two prior mean parameters β0 = β0(a) and β0 = β0(b) such that
the former is closer to βt than the latter. Here close is defined in terms of the Mahalabonis distance, which
for, e.g. βt and β0(a) is defined as dM (βt , β0(a) ; Σ) = [(βt − β0(a) )⊤ Σ−1 (βt − β0(a) )]1/2 with positive definite
covariance matrix Σ with Σ = σ 2 ∆−1 . Show that the posterior density πβ | σ2 (β | σ 2 , X, Y, β0(a) , ∆) is
larger at β = βt than with the other prior mean parameter.
3.6 Exercises 71
c) Adopt the assumptions of part b) and show that a better initial guess yields a better posterior mean. That is,
show
Question 3.7
The ridge penalty may be interpreted as a multivariate normal prior on the regression coefficients: β ∼ N (0, λ−1 Ipp ).
Different priors may be considered. In case the covariates are spatially related in some sense (e.g. genomically),
it may of interest to assume a first-order autoregressive prior: β ∼ N (0, λ−1 Σa ), in which Σa is a (p × p)-
dimensional correlation matrix with (Σa )j1 ,j2 = ρ|j1 −j2 | for some correlation coefficient ρ ∈ [0, 1). Hence,
. . . ρp−1
1 ρ
ρ 1 . . . ρp−2
Σa = . .. .
. ..
.. .. . .
ρp−1 ρp−2 ... 1
a) The penalized loss function associated with this AR(1) prior is:
Question 3.8
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n. Suppose estimates of the
regression parameters β of this model are obtained through the minimization of the sum-of-squares augmented
with a ridge-type penalty:
for known α ∈ [0, 1], nonrandom p-dimensional target vectors βt,a and βt,b with βt,a ̸= βt,b , and penalty
parameter λ > 0. Here Y = (Y1 , . . . , Yn )⊤ and X is n × p matrix with the n row-vectors Xi,∗ stacked.
a) When p > n the sum-of-squares part does not have a unique minimum. Does the above employed penalty
warrant a unique minimum for the loss function above (i.e., sum-of-squares plus penalty)? Motivate your
answer.
b) Could it be that for intermediate values of α, i.e. 0 < α < 1, the loss function assumes smaller values than
for the boundary values α = 0 and α = 1? Motivate your answer.
c) Draw the parameter constraint induced by this penalty for α = 0, 0.5 and 1 when p = 2
d) Derive the estimator of β, defined as the minimum of the loss function, explicitly.
e) Discuss the behaviour of the estimator α = 0, 0.5 and 1 for λ → ∞.
Question 3.9
Revisit Exercise 1.3. There the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with
εi ∼i.i.d. N (0, σ 2 ) is considered. The model comprises a single covariate and an intercept. Response and
covariate data are: {(yi , xi,1 )}4i=1 = {(1.4, 0.0), (1.4, −2.0), (0.8, 0.0), (0.4, 2.0)}.
a) Evaluate the generalized ridge regression estimator of β with target β0 = 02 and penalty matrix ∆ given
by (∆)11 = λ = (∆)22 and (∆)12 = 12 λ = (∆)21 in which λ = 8.
b) A data scientist wishes to leave the intercept unpenalized. Hereto s/he sets in part a) (∆)11 = 0. Why does
the resulting estimate not coincide with the answer to Exercise 1.3? Motivate.
Question 3.10
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σε2 Inn ). This model (without intercept) is
72 Generalizing ridge regression
fitted to data using the generalized regression estimator β̂(λ) = arg minβ ∥Y − Xβ∥22 + λβ ⊤ ∆β with λ > 0.
The penalty matrix ∆ and the data are:
4 c
∆= , X = −2 1 , and Y = −2 .
c 1
Question 3.11
Consider the linear regression model: Y = Xβ +ε with ε ∼ N (0n , σ 2 Inn ). Let β̂(λ) = (X⊤ X+λIpp )−1 X⊤ Y
be the ridge regression estimator with penalty parameter λ. The shrinkage of the ridge regression estimator
propogates to the scale of the ‘ridge prediction’ Xβ̂(λ). To correct (a bit) for the shrinkage, de Vlaming and
Groenen (2015) propose the alternative ridge regression estimator: β̂(α) = [(1 − α)X⊤ X + αIpp ]−1 X⊤ Y with
shrinkage parameter α ∈ [0, 1].
a) Let α = λ(1 + λ)−1 . Show that β̂(α) = (1 + λ)β̂(λ) with β̂(λ) as in the introduction above.
b) Use part a) and the parametrization of α provided there to show that the some shrinkage has been undone.
That is, show: Var[Xβ̂(λ)] < Var[Xβ̂(α)] for any λ > 0.
c) Use the singular value decomposition of X to show that limα↓0 β̂(α) = (X⊤ X)−1 X⊤ Y (should it exist)
and limα↑1 β̂(α) = X⊤ Y.
d) Derive the expectation, variance and mean squared error of β̂(α).
e) Temporarily assume that p = 1 and let X⊤ X = c for some c > 0. Then, MSE[β̂(α)] = (c − 1)2 β 2 +
σ 2 c[(1 − α)c + α]−2 . Does there exist an α ∈ (0, 1) such that the mean squared error of β̂(α) is smaller
than that of its maximum likelihood counterpart? Motivate.
f) Now assume p > 1 and an orthonormal design matrix. Specify the regularization path of the alternative
ridge regression estimator β̂(α).
Question 3.12
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σ 2 In n). Goldstein & Smith (1974) proposed
a novel generalized ridge estimator of its p-dimensional regression parameter:
Here the mixed model introduced by Henderson (1953), which generalizes the linear regression model, is studied
and estimated in unpenalized (!) fashion. Nonetheless, it will turn out to have an interesting connection to ridge
regression. This connection may be exploited to arrive at an informed choice of the ridge penalty parameter.
The linear regression model, Y = Xβ + ε, assumes the effect of each covariate to be fixed. In certain
situations it may be desirable to relax this assumption. For instance, a study may be replicated. Conditions
need not be exactly constant across replications. Among others this may be due to batch effects. These may be
accounted for and are then incorporated in the linear regression model. But it is not the effects of these particular
batches included in the study that are of interest. Would the study have been carried out at a later date, other
batches may have been involved. Hence, the included batches are thus a random draw from the population of all
batches. With each batch possibly having a different effect, these effects may also be viewed as random draws
from some hypothetical ‘effect’-distribution. From this point of view the effects estimated by the linear regression
model are realizations from the ‘effect’-distribution. But interest is not in the particular but the general. Hence, a
model that enables a generalization to the distribution of batch effects would be more suited here.
Like the linear regression model the mixed model, also called mixed effects model or random effects model,
explains the variation in the response by a linear combination of the covariates. The key difference lies in the fact
that the latter model distinguishes two sets of covariates, one with fixed effects and the other with random effects.
In matrix notation mirroring that of the linear regression model, the mixed model can be written as:
Y = Xβ + Zγ + ε,
where Y is the response vector of length n, X the (n × p)-dimensional design matrix with the fixed vector β with
p fixed effects, Z the (n × q)- dimensional design matrix with an associated q × 1 dimensional vector γ of random
effects, and distributional assumptions ε ∼ N (0n , σε2 Inn ), γ ∼ N (0q , Rθ ) and ε and γ independent. In this Rθ
is symmetric, positive definite and parametrized by a low-dimensional parameter θ.
The distribution of Y is fully defined by the mixed model and its accompanying assumptions. As Y is a linear
combination of normally distributed random variables, it is itself normally distributed. Its mean is:
Var(Y) = Var(Xβ) + Var(Zγ) + Var(ε) = ZVar(γ)Z⊤ + σε2 Inn = ZRθ Z⊤ + σε2 Inn
in which the independence between ε and γ and the standard algebra rules for the Var(·) and Cov(·) operators
have been used. Put together, this yields: Y ∼ N (Xβ, ZRθ Z⊤ + σε2 Inn ). Hence, the random effects term Zγ
of the mixed model does not contribute to the explanation of the mean of Y, but aids in the decomposition of its
variance around the mean. From this formulation of the model it is obvious that the random part of two distinct
observations of the response are – in general – not independent: their covariance is given by the corresponding
element of ZRθ Z⊤ . Put differently, due to the independence assumption on the error two observations can only
be (marginally) dependent through the random effect which is attenuated by the associated design matrix Z. To
illustrate this, temporarily set Rθ = σγ2 Iqq . Then, Var(Y) = σγ2 ZZ⊤ + σε2 Inn . From this it is obvious that
two variates of Y are now independent if and only if the corresponding rows of Z are orthogonal. Moreover,
two pairs of variates have the same covariance if they have the same covariate information in Z. Two distinct
observations of the same individual have the same covariance as one of these observations with that of another
individual with identical covariate information as the left-out observation on the former individual. In particular,
their ‘between-covariance’ equals their individual ‘within-covariance’.
The mixed model and the linear regression model are clearly closely related: they share a common mean, a
normally distributed error, and both explain the response by a linear combination of the explanatory variables.
74 Mixed model
Moreover, when γ is known, the mixed model reduces to a linear regression model. This is seen from the condi-
tional distribution of Y: Y | γ ∼ N (Xβ + Zγ, σε2 Inn ). Conditioning on the random effect γ thus pulls in the
term Zγ to the systematic, non-random explanatory part of the model. In principle, the conditioned mixed model
could now be rewritten as a linear regression model by forming a new design matrix and parameter from X and Z
and β and γ, respectively.
Example 4.1 (Mixed model for a longitudinal study)
A longitudinal study looks into the growth rate of cells. At the beginning of the study cells are placed in n petri
dishes, with the same growth medium but at different concentrations. The initial number of cells in each petri dish
is counted as is done at several subsequent time points. The change in cell count is believed to be – at the log-scale
– a linear function of the concentration of the growth medium. The linear regression model may suffice. However,
variation is omnipresent in biology. That is, apart from variation in the initial cell count, each cell – even if they
are from common decent – will react (slightly?) differently to the stimulus of the growth medium. This intrinsic
cell-to-cell variation in growth response may be accommodated for in the linear mixed model by the introduction
of a random cell effect, both in off-set and slope. The (log) cell count of petri dish i at time point t, denoted Yit ,
is thus described by:
Yit = β0 + Xi β1 + Zi γ + εit ,
with intercept β0 , growth medium concentration Xi in petri dish i, and fixed growth medium effect β1 , and
Zi = (1, Xi ), γ the 2 dimensional random effect parameter bivariate normally distributed with zero mean and
diagonal covariance matrix, and finally εit ∼ N (0, σε2 ) the error in the cell count of petri dish i at time t. In
matrix notation the matrix Z would comprise of 2n columns: two columns for each cell, ei and Xi ei (with ei
the n-dimensional unit vector with a one at the i-th location and zeros elsewhere), corresponding to the random
intercept and slope effect, respectively. The fact that the number columns of Z, i.e. the explanatory random effects,
Figure 4.1: Linear regession (thick black solid line) vs. mixed model fit (thin colored and patterned lines).
equals 2n does not pose identifiability problems as per column only a single parameter is estimated. Finally, to
illustrate the difference between the linear regression and the linear mixed model their fits on artifical data are
plotted (top left panel, Figure 4.1). Where the linear regression fit shows the ‘grand mean relationship’ between
cell count and growth medium, the linear mixed model fit depicts the petri dish specific fits. □
The mixed model was motivated by its ability to generalize to instances not included in the study. From the
examples above another advantage can be deduced. E.g., the cells’ effects are modelled by a single parameter
(rather than one per cell). More degrees of freedom are thus left to estimate the noise level. In particular, a test for
the presence of a cell effect will have more power.
The parameters of the mixed model are estimated either by means of likelihood maximization or a related pro-
cedure known as restricted maximum likelihood. Both are presented, with the exposé loosely based on Bates
and DebRoy (2004). First the maximum likelihood procedure is introduced, which requires the derivation of the
likelihood. Hereto the assumption on the random effects is usually transformed. Let R̃θ = σε−2 Rθ , which is the
covariance of the random effects parameter relevative to the error variance, and R̃θ = Lθ L⊤
θ its Cholesky decom-
position. Next define the change-of-variables γ = Lθ γ̃. This transforms the model to: Y = Xβ + ZLθ γ̃ + ε
75
but now with the assumption γ̃ ∼ N (0q , σε2 Iqq ). Under this assumption the conditional likelihood, conditional
on the random effects, is:
To evaluate the integral, the exponent needs rewriting. Hereto first note that:
where in the last step Sylvester’s determinant identity has been used.
The maximum likelihood estimators of the mixed model parameters β, σε2 and R̃θ are found through the
maximization of the logarithm of the likelihood (4.1). Find the roots of the partial derivatives of this log-likelihood
with respect to the mixed model parameters. For β and σε2 this yields:
The former estimate can be substituted into the latter to remove its dependency on β. However, both estimators
still depend on θ. An estimator of θ may be found by substitution of β̂ and σ̂ε2 into the log-likelihood followed by
its maximization. For general parametrizations of R̃θ by θ there are no explicit solutions. Then, resort to standard
nonlinear solvers such as the Newton-Raphson algorithm and the like. With a maximum likelihood estimate of
θ at hand, those of the other two mixed model parameters are readily obtained from the formula’s above. As θ
is unknown at the onset, it needs to be initiated followed by sequential updating of the parameter estimates until
convergence.
Restricted maximum likelihood (REML) considers the fixed effect parameter β as a ‘nuisance’ parameter
and concentrates
R on the estimation of the variance components. The nuisance parameter is integrated out of the
likelihood, Rp L(Y)dβ, which is referred to as the restricted likelihood. Those values of θ (and thereby R̃θ ) and
76 Mixed model
σε2 that maximize the restricted likelihood are the REML estimators. The restricted likelihood, by an argument
similar to that used in the derivation of the likelihood, simplifies to:
Z
L(Y)dβ = (2πσε2 )−n/2 |Q̃|−1/2 exp{− 12 σε−2 Y⊤ [Q̃−1 −1 ⊤ −1
θ − Q̃θ X(X Q̃θ X)
−1 ⊤ −1
X Q̃θ ]Y}
Rp
Z
exp{− 12 σε−2 [β − (X⊤ Q̃−1θ X)
−1 ⊤ −1
X Q̃θ Y]⊤ X⊤ Q̃−1
θ X
Rp
[β − (X⊤ Q̃−1
θ X)
−1 ⊤ −1
X Q̃θ Y]}dβ
= (2πσε2 )−(n−p)/2 |Q̃θ |−1/2 |X⊤ Q̃−1
θ X|
−1/2
where Q̃θ = Inn + ZR̃θ Z⊤ is the relative covariance (relative to the error variance) of Y. The REML estimators
are now found by equating the partial derivatives of this restricted loglikelihood to zero and solving for σε2 and θ.
The former, given the latter, is:
−1
σ̂ε2 = 1 ⊤
n−p Y [Q̃θ − Q̃−1 ⊤ −1
θ X(X Q̃θ X)
−1 ⊤ −1
X Q̃θ ]Y
⊤ −1/2 −1/2 −1/2 −1/2 −1/2 −1/2
= 1
n−p Y Q̃θ [Inn − Q̃θ X(X⊤ Q̃θ Q̃θ X)−1 X⊤ Q̃θ ]Q̃θ Y,
where the rewritten form reveals a projection matrix and, consequently, a residual sum of squares. Like the
maximum likelihood estimator of θ, its REML counterpart is generally unknown analytically and to be found
numerically. Iterating between the estimation of both parameters until convergence yields the REML estimators.
Obviously, REML estimation of the mixed model parameters does not produce an estimate of the fixed parameter
β (as it has been integrated out). Should however a point estimate be desired, then in practice the ML estimate of
β with the REML estimates of the other parameters is used.
An alternative way to proceed (and insightful for the present purpose) follows the original approach of Henderson,
who aimed to construct a linear predictor for Y.
Definition 4.1
A predictand is the function of the parameters that is to be predicted. A predictor is a function of the data that
predicts the predictand. When this latter function is linear in the observation it is said to be a linear predictor.
In case of the mixed model the predictand is Xnew β + Znew γ for (nnew × p)- and (nnew × q)-dimensional design
matrices Xnew and Znew , respectively. Similarly, the predictor is some function of the data Y. When it can be
expressed as AY for some matrix A it is a linear predictor.
The construction of the aforementioned linear predictor requires estimates of β and γ. To obtain these esti-
mates first derive the joint density of (γ, Y):
Rθ Z⊤
γ 0q Rθ
∼ N ,
Y Xβ ZRθ σε2 Inn + ZRθ Z⊤
From this the likelihood is obtained and after some manipulations the loglikelihood can be shown to be propor-
tional to:
in which – following Henderson – Rθ and σε2 are assumed known (for instance by virtue of maximum likelihood
or REML estimation). The estimators of β and γ are now the minimizers of loss criterion (4.2). Effectively, the
random effect parameter γ is now temporarily assumed to be ‘fixed’. That is, it is temporarily treated as fixed in
the derivations below that lead to the construction of the linear predictor. However, γ is a random variable and
one therefore speaks of a linear predictor rather than linear estimator.
To find the estimators of β and γ, defined as the minimizer of loss function, equate the partial derivatives of
mixed model loss function (4.2) with respect to β and γ to zero. This yields the estimating equations (also referred
to as Henderson’s mixed model equations):
X⊤ Y − X⊤ Xβ − X⊤ Zγ = 0p ,
σε−2 Z⊤ Y − σε−2 Z⊤ Zγ − σε−2 Z⊤ Xβ − R−1
θ γ = 0q .
77
Solve each estimating equation for the parameters individually and find:
Note, using the Cholesky decomposition of R̃θ and applying the Woodbury identity twice (in both directions),
that:
γ̂ = (ZT Z + R̃−1
θ )
−1 ⊤
Z (Y − Xβ)
= [R̃θ − R̃θ Z⊤ (Inn + ZR̃θ Z⊤ )−1 ZR̃θ ]Z⊤ (Y − Xβ)
= Lθ [Iqq − L⊤ ⊤ ⊤ ⊤ −1
θ Z (Inn + ZLθ Lθ Z ) ZLθ ]L⊤ ⊤
θ Z (Y − Xβ)
= Lθ (L⊤ ⊤
θ Z ZLθ + Iqq )
−1 ⊤ ⊤
Lθ Z (Y − Xβ)
= Lθ µγ̃ |Y .
It thus coincides with the conditional estimate of the γ found in the derivation of the maximum likelihood estimator
of the mixed model. This expression could also have been found by conditioning with the multivariate normal
above which would have given E(γ | Y).
The estimator of both β and γ can be expressed fully and explicitly in terms of X, Y, Z and Rθ . To obtain
that of β substitute the estimator of γ of equation (4.4) into that of β given by equation 4.3):
in which the Woodbury identity has been used. Now group terms and solve for β:
β̂ = [X⊤ (ZRθ Z⊤ + σε2 Inn )−1 X]−1 X⊤ (ZRθ Z⊤ + σε2 Inn )−1 Y. (4.5)
This coincides with the maximum likelihood estimator of β presented above (for known Rθ and σε2 ). Moreover, in
the preceeding display one recognizes a generalized least squares (GLS) estimator. The GLS regression estimator
is BLUE (Best Linear Unbiased Estimator) when Rθ and σε are known. To find an explicit expression for γ use
Qθ as previously defined and substitute the explicit expression (4.5) for the estimator of β in the estimator of γ,
shown in display (4.4) above. This gives:
Theorem 4.1
The predictor Xβ̂ + Zγ̂ is the BLUP of Ỹ = Xβ + Zγ.
This is also the expectation of the predictand Xβ + Zγ. Hence, the predictor is unbiased.
78 Mixed model
To show the predictor BY has minimum prediction error variance within the class of unbiased linear predic-
tors, assume the existence of another unbiased linear predictor AY of Xβ + Zγ. The predictor error variance of
the latter predictor is:
where the last step uses AX = BX, which follows from the fact that
from which the minimum variance follows as the first summand on the right-hand side is nonnegative and zero if
and only if A = B. ■
Constrast this to a mixed model void of covariates with fixed effects and comprising only covariates with a random
effects: Y = Zγ + ε with distributional assumptions γ ∼ N (0q , σγ2 Iqq ) and ε ∼ N (0n , σε2 Inn ). This model,
when temporarily considering γ as fixed, is fitted by the minimization of loss function (4.2). The corresponding
estimator of γ is then defined, with the current mixed model assumptions in place, as:
The estimators are – up to a reparamatrization of the penalty parameter – defined identically. This should not
come as a surprise after the discussion of Bayesian regression (cf. Chapter 2) and the alert reader would already
have recognized a generalized ridge loss function in Equation (4.2). The fact that we discarded the fixed effect
part of the mixed model is irrelevant for the analogy as those would correspond to unpenalized covariates in the
ridge regression problem.
The link with ridge regression is also immenent from the linear predictor of the random effect. Recall: γ̂ =
(Z⊤ Z + R−1 θ ) Z (Y − Xβ). When we ignore R−1
−1 ⊤
θ , the predictor reduces to a least squares estimator. But
with a symmetric and positive definite matrix R−1θ , the predictor is actually of the shrinkage type as is the ridge
regression estimator. This shrinkage estimator also reveals, through the term (Z⊤ Z + R−1 θ )
−1
, that a q larger than
n does not cause identifiability problems as long as Rθ is parametrized low-dimensionally enough.
The following mixed model result provide an alternative approach to choice of the penalty parameter in ridge
regression. It assumes a mixed model comprising of the random effects part only. Or, put differently, it assume
the linear regression model Y = Xβ + ε with β ∼ N (0p , σβ2 Ipp ) and ε ∼ N (0n , σε2 Inn ).
Theorem 4.2 (Theorem 2, Golub et al)
The expected generalized cross-validation error Eβ {Eε [GCV (λ)]} is minimized for λ = σε2 /σβ2 .
4.2 REML consistency, high-dimensionally 79
Proof. The proof first finds an analytic expression of the expected GCV (λ), then its minimum. Its expectation
can be re-expressed as follows:
2
1
− H(λ)]/n}−2 ∥[Inn − H(λ)]Y
Eβ {Eε [GCV (λ)]} = Eβ Eε n {tr[Inn 2
n{tr[Inn − H(λ)]}−2 Eβ Eε {Y⊤ [Inn − H(λ)]2 Y}
=
n{tr[Inn − H(λ)]}−2 Eβ Eε tr{(Xβ + ε)⊤ [Inn − H(λ)]2 (Xβ + ε)}
=
= n{tr[Inn − H(λ)]}−2 tr{[Inn − H(λ)]2 XX⊤ Eβ [Eε (ββ ⊤ )]}
+ tr{[Inn − H(λ)]2 Eβ [Eε (εε⊤ )]}
Equate the derivative of this expectation w.r.t. λ to zero, which can be seen to be proportional to:
Theorem 4.2 can be extended to include unpenalized covariates. This leaves the result unaltered: the optimal (in
the expected GCV sense) ridge penalty is the same signal-to-noise ratio.
We have encountered the result of Theorem 4.2 before. Revisit Example 1.6 which derived the mean squared
error (MSE) of the ridge regression estimator when X is orthonormal. It was pointed out that this MSE is minized
for λ = pσε /β ⊤ β. As β ⊤ β/p is an estimator for σβ2 , this implies the same optimal choice of the penalty
parameter.
To point out the relevance of Theorem 4.2 for the choice of the ridge penalty parameter still assume the
regression parameter random. The theorem then says that the optimal penalty parameter (in the GCV sense)
equals the ratio of the error variance and that of the regression parameter. Both variances can be estimated by
means of the mixed model machinery (provided for instance by the lme4 package in R). These estimates may
be plugged in the ratio to arrive at a choice of ridge penalty parameter (see Section 4.3 for an illustration of this
usage).
where Pθ = Q̃−1 −1 ⊤ −1
θ − Q̃θ X(X Q̃θ X)
−1 ⊤ −1
X Q̃θ and Q̃θ = Inn + θZZ⊤ . To arrive at the REML estimators
choose initial values for the parameters. Choose one of the estimating equations substitute the initial value of the
one of the parameters and solve for the other. The found root is then substituted into the other estimating equation,
which is subsequently solved for the remaining parameter. Iterate between these two steps until convergence. The
discussion of the practical evaluation of a root for θ from these estimating equations in a high-dimensional context
is postponed to the next section.
The employed linear mixed model assumes that each of the q covariates included as a column in Z contributes
to the variation of the response. However, it may be that only a fraction of these covariates exerts any influence on
the response. That is, the random effect parameter γ is sparse, which could be operationalized as γ having q0 zero
elements while the remaining qc = q − q0 elements are non-zero. Only for the latter qc elements of γ the normal
assumption makes sense, but is invalid for the q0 zeros in γ. The posed mixed model is then misspecified.
The next theorem states that the REML estimators of θ = σγ2 /σε2 and σε2 are consistent (possibly after adjust-
ment, see the theorem), even under the above mentioned misspecification.
where τ, ω ∈ (0, 1]. Finally, suppose that σε2 and σγ2 are positive. Then:
i) The ‘adjusted’ REML estimator of the variance ratio σγ2 /σε2 is consistent:
q \ P
(σ 2 /σ 2 ) −→ σγ2 /σε2 .
qc γ ε
P
ii) The REML estimator of the error variance is consistent: σ̂ε2 −→ σε2 .
Before the interpretation and implication of Theorem 4.3 are discussed, its conditions for the consistency result
are reviewed:
◦ The standardization and distribution assumption on the design matrix of the random effects has no direct
practical interpretation. These conditions warrant the applicability of certain results from random matrix
theory upon which the proof of the theorem hinges.
◦ The positive variance assumption σε2 , σγ2 > 0, in particular that of the random effect parameter, effectively
states that some – possibly misspecified – form of the mixed model applies.
◦ Practically most relevant are the conditions on the sample size, random effect dimension, and sparsity.
The τ and ω in Theorem 4.3 are the limiting ratio’s of the sample size n and non-zero random effects qc ,
respectively, to the total number of random effects q. The number of random effects thus exceeds the sample
size, as long as the latter grows (in the limit) at some fixed rate with the former. Independently, the model
may be misspecified. The sparsity condition only requires that (in the limit) a fraction of the random effects
is nonzero.
Now discuss the interpretation and relevance of the theorem:
◦ Theorem 4.3 complements the classical low-dimensional consistency results on the REML estimator.
◦ Theorem 4.3 shows that not all (i.e. consistency) is lost when the model is misspecified.
◦ The practical relevance of the part i) of Theorem 4.3 is limited as the number of nonzero random effects qc ,
or ω for that matter, is usually unknown. Consequently, the REML estimator of the variance ratio σγ2 /σε2
cannot be adjusted correctly to achieve asymptotically unbiasedness and – thereby – consistency
◦ Part ii) in its own right may not seem very useful. But it is surprising that high-dimensionally (i.e. when
the dimension of the random effect parameter exceeds the sample size) the standard (that is, derived for
low-dimensional data) REML estimator of σε2 is consistent. Beyond this surprise, a good estimator of σε2
indicates how much of the variation in the response cannot be attributed to the covariates represented by
the columns Z. A good indication of the noise level in the data finds use at many place. In particular, it is
helpful in deciding on the order of the penalty parameter.
◦ Theorem 4.2 suggests to choose the ridge penalty parameter equal to the ratio of the error variance and that
of the random effects. Confronted with data the reciprocal of the REML estimator of θ = σγ2 /σε2 may be
4.3 Illustration: P-splines 81
used as value for the penalty parameter. Without the adjustment for the fraction of nonzero random effects,
this value is off. But in the worst case this value is an over-estimation of the optimal (in the GCV sense)
ridge penalty parameter. Consequently, too much penalization is applied and the ridge regression estimate
of the regression parameter is conservative as it shrinks the elements too much to zero.
Figure 4.2: Top left and right panels: B-spline basis functions of degree 1 and 2, respectively. Bottom left and right
panel: P-spline fit to transcript levels of circadian clock experiment in mice.
Smooothing refers to nonparametric – in the sense that parameters have no tangible interpretation – description
82 Mixed model
of a curve. For instance, one may wish to learn some general functional relationship between two variable, X and
Y , from data. Statistically, the model Y = f (X) + ε, for unknown and general function f (·), is to be fitted to
paired observations {(yi , xi )}ni=1 . Here we use P-splines, penalized B-splines with B for Basis (Eilers and Marx,
1996).
A B-spline is formed through a linear combination of (pieces of) polynomial basis functions of degree r.
For their construction specify the interval [xstart , xend ] on which the function is to be learned/approximated. Let
{tj }m+2r
j=0 be a grid, overlapping the interval, of equidistantly placed points called knots given by tj = xstart + (j −
1
r)h for all j = 0, . . . , m + 2r with h = m (xend − xstart ). The B-spline base functions are then defined as:
where ∆r [fj (·)] is the r-th difference operator applied to fj (·). For r = 1: ∆[fj (·)] = fj (·) − fj−1 (·),
while r = 2: ∆2 [fj (·)] = ∆{∆[fj (·)]} = ∆[fj (·) − fj−1 (·)] = fj (·) − 2fj−1 (·) + fj−2 (·), et cetera.
The top right and bottom left P panels of Figure 4.2 show a 1st and 2nd degree B-spline basis functions. A P-
m+2r
spline is a curve of the form j=0 αj Bj (x; r) fitted to the data by means of penalized least squares mini-
mization. The least squares are ∥Y − Bα∥22 where B is a n × (m + 2r)-dimensional matrix with the j-th
column equalling (Bj (xi ; r), Bj (x2 ; r), . . . , Bj (xn ; r))⊤ . The employed penalty is of the ridge type: the sum
of the squared difference among contiguous αj . Let D be the first order differencing matrix. The penalty
Pm+2r
can then be written as ∥Dα∥22 = 2
j=2 (αj − αj−1 ) . A second order difference matrix would amount to
m+2r
∥Dα∥22 = j=3 (αj − 2αj−1 + αj−2 )2 . Eilers (1999) points out how P-splines may be interpret as a mixed
P
model. Hereto choose X̃ such that its columns span the null space of D⊤ D, which comprises a single column
representing the intercept when D is a first order differencing matrix, and Z̃ = D⊤ (DD⊤ )−1 . Then, for any α:
where DX̃β has vanished by the construction of X̃. Hence, the penalty only affects the random effect parameter,
leaving the fixed effect parameter unshrunken. The resulting loss function, ∥Y − Xβ − Zγ∥22 + λ∥γ∥22 , coincides
for suitably chosen λ to that of the mixed model (as will become apparent later). The bottom panels of Figure 4.2
shows the flexibility of this approach.
The following R-script fits a P-spline to a gene’s transcript levels of the circadian clock study in mice. It uses
a basis of m = 50 truncated polynomial functions of degree r = 3 (cubic), which is generated first alongside
several auxillary matrices. This basis forms, after post-multiplication with a projection matrix onto the space
spanned by the columns of the difference matrix D, the design matrix for the random coefficient of the mixed
model Y = Xβ + Zγ + ε with γ ∼ N (0q , σγ2 Iqq ) and ε ∼ N (0n , σε2 Inn ). The variance parameters of this
model are then estimated by means of restricted maximum likelihood (REML). The final P-spline fit is obtained
from the linear predictor using, in line with Theorem 4.2, λ = σε2 /σγ2 in which the REML estimates of these
variance parameters are substituted. The resulting P-spline fit of two transcripts is shown in the bottom panels of
Figure 4.2.
#------------------------------------------------------------------------------
# intermezzo: declaration of functions used analysis
#------------------------------------------------------------------------------
#------------------------------------------------------------------------------
# load data
data(cycMouseLiverRNA)
id <- 14
Y <- as.numeric(cycMouseLiverRNA[id,-1])
X <- 1:length(Y)
# initiate
theta <- 1
for (k in 1:100){
# for-loop, alternating between theta and error variance estimation
thetaPrev <- theta
QthetaInv <- solve(diag(length(Y)) + theta * Z %*% t(Z))
Ptheta <- QthetaInv -
QthetaInv %*% X %*%
solve(t(X) %*% QthetaInv %*% X) %*% t(X) %*% QthetaInv
sigma2e <- t(Y) %*% Ptheta %*% Y / (length(Y)-2)
theta <- uniroot(thetaEstEqREML, c(0, 100000),
Z=Z, Y=Y, X=X, sigma2e=sigma2e)$root
if (abs(theta - thetaPrev) < 10ˆ(-5)){ break }
}
# P-spline fit
bgHat <- solve(t(cbind(X, Z)) %*% cbind(X, Z) +
diag(c(rep(0, ncol(X)), rep(1/theta, ncol(Z))))) %*%
t(cbind(X, Z)) %*% Y
84 Mixed model
# plot fit
plot(Y, pch=20, xlab="time", ylab="RNA concentration",
main=paste(strsplit(cycMouseLiverRNA[id,1], "_")[[1]][1],
"; # segments: ", m, sep=""))
lines(cbind(X, Z) %*% bgHat, col="blue", lwd=2)
The fitted splines displayed Figure 4.2 nicely match the data. From the circadian clock perspective it is especially
the fit in the right-hand side bottom panel that displays the archetypical sinusoidal behaviour associated by the
layman with the sought-for rythm. Close inspection of the fits reveals some minor discontinuities in the derivative
of the spline fit. These minor discontinuities are indicative of a little overfitting, due to too large an estimate of σγ2 .
This appears to be due to numerical instability of the solution of the estimating equations of the REML estimators
of the mixed model’s variance parameter estimators when m is large compared to the sample size n.
5 Ridge logistic regression
Ridge penalized estimation is not limited to the standard linear regression model, but may be used to estimate
(virtually) any model. Here we illustrate how it may be used to fit the logistic regression model. To this end
we first recap this model and the (unpenalized) maximum likelihood estimation of its parameters. After which
the model is estimated by means of ridge penalized maximum likelihood, which will turn out to be a relatively
straightforward modification of unpenalized estimation.
The function g(·; ·) is called the link function. It links the response to the explanatory variables. The one above
is called the logistic link function. Or short, logit. The regression parameters have tangible interpretations. When
the first covariate represents the intercept, i.e. Xi,j = 1 for all i, then β1 determines where the link function equals
a half when all other covariates fail to contribute to the linear predictor (i.e. where P (Yi = 1 | Xi,∗ ) = 0.5 when
Xi,∗ β = β1 ). This is illustrated in the top-left panel of Figure 5.1 for various choices of the intercept. On the other
hand, the regression parameters are directly related to the odds ratio: odds ratio = odds(Xi,j + 1)/odds(Xi,j ) =
exp(βj ). Hence, the effect of a unit change in the j-th covariate on the odds ratio is exp(βj ) (see Figure 5.1,
top-right panel). Other link functions (depicted in Figure 5.1, bottom-left panel) are common, e.g. the probit:
pi = Φ0,1 (Xi,∗ β); the cloglog: pi = π1 arctan(Xi,∗ β) + 12 ; the Cauchit: pi = exp[− exp(Xi,∗ β)]. All these
link functions are invertible. Irrespective of the choice of the link function, the binary data are thus modelled
as Yi ∼ B [g −1 (Xi,∗ ; β), 1]. That is, as a single draw from the Binomial distribution with success probability
g −1 (Xi,∗ ; β).
Let us now estimate the parameter of the logistic regression model by means of the maximum likelihood
method. The likelihood of the experiment is then:
n
Y Yi 1−Yi
L(Y | X; β) = P (Yi = 1 | Xi,∗ ) P (Yi = 0 | Xi,∗ ) .
i=1
After taking the logarithm and some ready algebra, the log-likelihood is found to be:
Xn
L(Y | X; β) = Yi Xi,∗ β − log[1 + exp(Xi,∗ β)] .
i=1
86 Ridge logistic regression
1.0
β0=−2
β0=−1
β0= 0
β0= 1
β0= 2
0.8
0.8
β1=−2
β1=−1
β1= 0
β1= 1
0.6
0.6
β1= 2
P(Y=1 | X)
P(Y=1 | X)
0.4
0.4
0.2
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
X X
logit 1.0
probit
cauchit
cloglog
0.8
0.8
0.6
0.6
P(Y=1 | X)
Y
0.4
0.4
0.2
0.2
obs.
0.0
0.0
pred.
−4 −2 0 2 4 −4 −2 0 2 4
X X
Figure 5.1: Top row, left panel: the response curve for various choices of the intercept β0 . Top row, right panel: the
response curve for various choices of the regression coefficent β1 . Bottom row, left panel: the responce curve for
various choices of the link function. Bottom panel, right panel: observations, fits and their deviations.
Differentiate the log-likelihood with respect to β, equate it zero, and obtain the estimating equation for β:
n h
∂L X exp(Xi,∗ β) i ⊤
= Yi − X = 0p . (5.1)
∂β i=1
1 + exp(Xi,∗ β) i,∗
The ML estimate of β strikes a (weighted by the Xi,∗ ) balance between observation and model. Put differently
(and illustrated in the bottom-right panel of Figure 5.1), a curve is fit through data by minimizing the distance
between them: at the ML estimate of β a weighted average of their deviations is zero.
The maximum likelihood estimate of β is evaluated by solving Equation (5.1) with respect to β by means of
the Newton-Raphson algorithm. The Newton-Raphson algorithm iteratively finds the zeros of a smooth enough
function f (·). Let x0 denote an initial guess of the zero. Then, approximate f (·) around x0 by means of a
first order Taylor series: f (x) ≈ f (x0 ) + (x − x0 ) (df /dx)|x=x0 . Solve this for x and obtain: x = x0 −
[(df /dx)|x=x0 ]−1 f (x0 ). Let x1 be the solution for x, use this as the new guess and repeat the above until con-
vergence. When the function f (·) has multiple arguments, is vector-valued and denoted by ⃗f , and the Taylor
5.2 Separable and high-dimensional data 87
the Jacobi matrix. An update of x0 is now readily constructed by solving (the approximation for) ⃗f (x) = 0 for x.
When applied here to the maximum likelihood estimation of the regression parameter β of the logistic regres-
sion model, the Newton-Raphson update is:
∂ 2 L −1 ∂L
β̂ new = β̂ old −
∂β∂β ⊤ β=β̂ old ∂β β=β̂ old
∂2L Xn exp(Xi,∗ β)
= − X⊤ Xi,∗ .
∂β∂β ⊤ i=1 [1 + exp(Xi,∗ β)]2 i,∗
∂L ∂2L
= X⊤ [Y − ⃗g−1 (X; β)] and = −X⊤ WX,
∂β ∂β∂β ⊤
where ⃗g−1 (X; β) = [g −1 (X1,∗ ; β), . . . , g −1 (Xn,∗ ; β)]⊤ with g −1 (·; ·) = exp(·; ·)/[1 + exp(·; ·)] and W di-
agonal with (W)ii = exp(Xi β̂ old )[1 + exp(Xi β̂ old )]−2 . The notation W was already used in Chapter 3 and,
generally, refers to a (diagonal) weight matrix with the choice of the weights depending on the context. The
updating formula of the estimate then becomes:
where Z = Xβ̂ old + W−1 [Y − ⃗g−1 (X; β old )]. The Newton-Raphson update is thus the solution to the following
weighted least squares problem:
Effectively, at each iteration the adjusted response Z is regressed on the covariates that comprise X. For more on
logistic regression confer the monograph of Hosmer Jr et al. (2013).
for i = 1, . . . , n. Separable data mostly occur of either of the two response outcomes has a low prevalence.
The logistic regression parameter cannot be learned from separable data by means of the maximum likelihood
method. For the existence of a separating hyperplane implies that the optimal fit is perfect and all samples have –
88 Ridge logistic regression
according to the fitted logistic regression model – a probability of one of being assigned to the correct outcome,
i.e. P (Yi = 1 | Xi ) equals either zero or one. Consequently, the loglikelihood vanishes, cf.:
Xn
L(Y | X; β) = Yi log[P (Yi = 1 | Xi,∗ )] + (1 − Yi ) log[P (Yi = 0 | Xi,∗ ) = 0,
i=1
and does no longer involve the logistic regression parameter. The logistic regression parameter is then to be chosen
such that P (Yi = 1 | Xi,∗ ) = exp(Xi,∗ β)[1 + exp(Xi,∗ β)]−1 ∈ {0, 1} (depending on whether Yi is indeed of the
‘1’ class). This only occurs when (some) elements of β equal (minus) infinity. Hence, the maximum likelihood
estimator is not well-defined.
The common workaround to learn the logistic regression parameter from separable data is a technique called
Firth penalized estimation (Firth, 1993; Heinze and Schemper, 2002). It amounts to the maximization of the
loglikelihood augmented with the so-called Firth penalty:
where I (β) = X⊤ W(β)X is the Fisher information matrix. The Firth penalty corrects for the first order bias
of the maximum likelihood estimator due to the unbalancedness in the class prevalences. The larger this unbal-
ance, the more likely the data are separable. Alternatively, the Firth penalty can be motivated as a (very) weakly
informative Jeffrey’s prior.
High-dimensionally, a separable hyperplane can always be found, unless there is at least one pair of samples
with a common variate vector, i.e. Xi,∗ = Xi′ ,∗ with i ̸= i′ , and different responses Yi ̸= Yi′ . Moreover, the
maximum likelihood estimator of the logistic regression parameter is not well-defined high-dimensionally. To see
this, assume p > n and an estimate β̂ to be available. Due to the high-dimensionality, the null space of X is
non-trivial. Hence, let γ ∈ null(span(X)). Then: Xβ̂ = Xβ̂ + Xγ = X(β̂ + γ). As the null space is a p − n-
dimensional subspace, γ need not equal zero. Hence, an infinite number of estimators of the logistic regression
parameter exists that yield the same loglikelihood.
Firth penalization may resolve the separable data issue low-dimensionally. It does not ensure the existence of
the estimator high-dimensionally. This is illustrated by the next example.
Example 5.1
Let the covariates be mutually exclusive indicators. Then, for all i = 1, . . . , n, there is a j ∈ {1, . . . , p} such
that Xi,∗ = e⊤j , where ej is the unit vector comprising all zero’s except for a one at the j-th position. The Firth
penalty then is
Xp Xn
j=1
1{Xi,∗ =e⊤j } { 12 βj − log[1 + exp(βj )]}.
i=1
High-dimensionally, this Firth penalty is notP strictly concave as its Hessian matrix is diagonal with diagonal
elements Hjj = − exp(βj )[1 + exp(βj )]−2 i=1 1{Xi,∗ =e⊤
n
j }
for j = 1, . . . , p. Hence, the maximizer of the
Firth penalized loglikelihood is not well-defined high-dimensionally. □
where the second summand is the ridge penalty (the sum of the square of the elements of β) with λ the penalty
parameter. Penalization of the loglikelihood now amounts to the substraction of the ridge penalty. This is due to
the fact that the estimator is now defined as the maximizer (instead of a minimizer) of the loss function. Moreover,
the 12 in front of the penalty is only there to simply derivations later, and could in principle be absorped in the
penalty parameter. Finally, the augmentation of the loglikelihood with the ridge penalty ensures the existence of
a unique estimator. The loglikelihood need not be strictly concave, but the ridge penalty, − 12 ∥β∥22 , is. Together
they form the ridge penalized loglikelihood above, which is thus also strictly concave. This warrants the existence
of a unique maximizer, and a well-defined ridge logistic regression estimator.
The optimization of the ridge penalized loglikelihood associated with the logistic regression model proceeds,
due to the differentiability of the penalty, fully analogous to the unpenalized case and uses the Newton-Raphson
5.3 Ridge estimation 89
20
0
30
−110
−170
−14
−160
−1
−15
−1
4
1.0
−100
2
0.5
penalized estimate
β2
0.0
−2
−0.5
−4
0
−10
−4 −2 0 2 4 0 1 2 3 4
β1 log(lambda)
1.0 λ=0
λ=1
0.20
λ=2.5
λ=5
λ=10
0.8
λ=100
variance of penalized estimate
0.15
0.6
y^
0.10
0.4
0.05
0.2
0.00
0.0
0 1 2 3 4 −2 −1 0 1 2
Figure 5.2: Top row, left panel: contour plot of the penalized log-likelihood of a logistic regression model with
the ridge constraint (red line). Top row, right panel: the regularization paths of the ridge estimator of the logistic
regression parameter. Bottom row, left panel: variance of the ridge estimator of the logistic regression parameter
against the logarithm of the penalty parameter. Bottom panel, right panel: the predicted success probability versus
the linear predictor for various choices of the penalty parameter.
algorithm for solving the (penalized) estimating equation. Hence, the unpenalized ML estimation procedure is
modified straightforwardly by replacing gradient and Hessian by their ‘penalized’ counterparts:
∂ Lpen ∂L ∂ 2 Lpen ∂2L
= − λβ and = − λIpp .
∂β ∂β ∂β∂β ⊤ ∂β∂β ⊤
With these at hand, the Newton-Raphson algorithm is (again) reformulated as an iteratively re-weighted least
squares algorithm with the updating step changing accordingly to:
β̂ new (λ) = β̂ old (λ) + V−1 X⊤ {Y − ⃗g−1 [X; β old (λ)]} − λβ old (λ)
= V−1 Vβ̂ old − λV−1 β̂ old (λ) + V−1 X⊤ WW−1 {Y − ⃗g−1 [X; β old (λ)]}
= V−1 X⊤ W Xβ̂ old (λ) + W−1 {Y − ⃗g−1 [X; β old (λ)]}
Obviously, the ridge estimate of the logistic regression parameter tends to zero as λ → ∞. Now consider a
linear predictor with an intercept that is left unpenalized. When λ tends to infinity, all regression coefficients but
the intercept
Pn vanish. PnThe intercept is left to model the success probability. Hence, in this case limλ→∞ β̂0 (λ) =
log[ n1 i=1 Yi / n1 i=1 (1 − Yi )].
2
2
1
1
0
0
x2
x2
−1
−1
−2
−2
−3
−3
−2 −1 0 1 2 −2 −1 0 1 2
x1 x1
2
2
1
1
0
0
x2
x2
−1
−1
−2
−2
−3
−3
−2 −1 0 1 2 −2 −1 0 1 2
x1 x1
Figure 5.3: The realized design as scatter plot (X1 vs X2 overlayed by the success (red) and failure regions (green)
for various choices of the penalty parameter: λ = 0 (top row, left panel), λ = 10 (top row, right panel) λ = 40 (bottom
row, left panel), λ = 100 (bottom row, right panel).
The effect of the ridge penalty on parameter estimates propagates to the predictor p̂i . The linear predictor of the
linear regression model involving the ridge estimator Xi β̂(λ) shrinks towards a common value for each i, leading
to a scale difference between observation and predictor (as seen before in Section 1.10). This behaviour transfers
to the ridge logistic regression predictor, as is illustrated on simulated data. The dimension and sample size of
these data are p = 2 and n = 200, respectively. The covariate data are drawn from the standard normal, while
that of the response is sampled from a Bernoulli distribution with success probability P (Yi = 1) = exp(2Xi,1 −
2Xi,2 )/[1 + exp(2Xi,1 − 2Xi,2 )]. The logistic regression model is estimated from these data by means of ridge
penalized likelihood maximization with various choices of the penalty parameter. The bottom right plot in Figure
5.2 shows the predicted success probability versus the linear predictor for various choices of the penalty parameter.
Larger values of the penalty parameter λ flatten the slope of this curve. Consequently, for larger λ more excessive
values of the covariates are needed to achieve the same predicted success probability as those obtained with smaller
λ at more moderate covariate values. The implications for the resulting classification may become clearer when
studying the effect of the penalty parameter on the ‘failure’ and ‘success regions’ respectively defined by:
5.4 Moments 91
where Wml is defined as in the Iteratively Weighted Least Squares algorithm but with β̂ ml substituted for the previ-
ous update β̂ old (λ). This alternative definition assumes the availability of the maximum likelihood estimator and
is thus not applicable to separable, and in particular high dimensional, data.
An alternative motivation of this approximate ridge logistic regression estimator follows from the following
asymptotic argument. Hereto we assume that for large enough n the maximum likelihood estimator of the logistic
regression parameter β̂ ml exists. We can then develop a second order Taylor approximation of the penalized
loglikelihood’s summand log[1 + exp(Xβ)] in β around the point β = β̂ ml :
where Wml defined as above. We substitute this Taylor approximation in the penalized loglikelihood to obtain an
approximation of the latter:
where an contains all remaining terms not involving β. The right-hand side is maximized at
5.4 Moments
The 1st and 2nd order moments of the ridge maximum likelihood estimator of the logistic regression parameter is
unknown analytically but typically approximated by that of the final update of the Newton-Raphson algorithm.
This approximation assumes the one-to-last update β̂ old to be non-random and then proceeds as before with the
regular ridge regression estimator of the linear regression model parameter to arrive at:
with
where the identity Var(Y) = W follows from the variance of a Binomial distributed random variable. From these
expressions similar properties as for the ridge maximum likelihood estimator of the regression parameter of the
linear model may be deduced. For instance, the ridge maximum likelihood estimator of the logistic regression
parameter converges to zero as the penalty parameter tends to infinity (confer the top right panel of Figure 5.2).
Similarly, their variances vanish as λ → ∞ (illustrated in the bottom left panel of Figure 5.2).
The form of these moment approximation corroborates asymptotically with that of the approximated ridge
logistic regression estimator β̂(λ) introduced in Remark 5.1.
92 Ridge logistic regression
The maximization of the penalized loglikelihood of the logistic regression model can be reformulated as a con-
strained estimation problem, as was done in Section 1.5 for the linear regression model. The ridge logistic regres-
sion estimator is thus defined equivalently as:
This is illustrated by the top left panel of Figure 5.2 for the ‘p = 2’-case. It depicts the contours (black lines) of
the loglikelihood and the spherical domain of the parameter constraint {β ∈ R2 : β12 + β22 ≤ c(λ)} (red line).
The parameter constraint can again be interpreted as a means to harness against overfitting, as it prevents the ridge
logistic regression estimator from assuming very large values, thereby avoiding a perfect description of the data.
The parameter constraint of the logistic regression estimator exhibits behavior in λ as that of the regular ridge
estimator. Hereto we note that the squared radius of the parameter constraint is – by identical argumentation as
provided in Section 1.5 – equal to c(λ) = ∥β̂(λ)∥22 . However, no explicit expression for this squared radius exits,
as none exists for the logistic ridge regression estimator, to verify its the constraint shrinkage behavior. Proposition
5.1 characterizes the essential properties of this behavior.
Proposition 5.1
The squared norm of the ridge logistic regression estimator satisfies:
i) d∥β̂(λ)∥22 /dλ < 0 for λ > 0,
ii) limλ→∞ ∥β̂(λ)∥22 = 0.
d
Proof. For part i) take the derivative of the estimating equation with respect to λ, solve for dλ β̂(λ), and find that
d
β̂(λ) = −(X⊤ WX + λIpp )−1 β̂(λ).
dλ
Using this derivative and the chain rule, we obtain the derivative of the estimator’s (squared) Euclidean length:
d d d
∥β̂(λ)∥22 = {[β̂(λ)]⊤ β̂(λ)} = 2[β̂(λ)]⊤ β̂(λ) = −2[β̂(λ)]⊤ (X⊤ WX + λIpp )−1 β̂(λ).
dλ dλ dλ
d
As the right-hand side is a quadratic form multiplied by a negative scalar, we conclude that dλ ∥β̂(λ)∥22 < 0 for
all λ > 0.
To verify the claimed λ-limit of the squared length of the ridge logistic regression estimator, note that it
satisfies: β̂(λ) = λ−1 X⊤ {Y − ⃗g−1 [X; β̂(λ)]}, which can be derived from the estimating equation. But as all
elements of the vector ⃗g−1 [X; β̂(λ)] are in the interval (0, 1) and Yi ∈ {0, 1}, we obtain the inequality:
∥β̂(λ)∥22 = λ−2 {Y − ⃗g−1 [X; β̂(λ)]}⊤ XX⊤ {Y − ⃗g−1 [X; β̂(λ)]} ≤ λ−2 1⊤ ⊤
n XX 1n .
This bound can be sharpened by using the worst possible fit, where either limλ→∞ ⃗g−1 [X; β̂(λ)] = 21 or
Pn
limλ→∞ ⃗g−1 [X; β̂(λ)] = n1 i=1 Yi , depending on the presence of an unpenalized intercept in the model. Irre-
spectively, the derived bound vanishes as λ → ∞. ■
Part i) of Proposition 5.1 tells us that the shrinkage behaviour of the ridge logistic regression estimator is monotone
in λ. Monotone in the sense that the squared Euclidean length of the estimator is, as ∥β̂(λ)∥22 ≥ 0 for all λ > 0,
is strictly decreasing in λ. Furthermore, part ii) of Proposition 5.1 then ensures that ultimately the constraint
collapses onto zero.
The degrees of freedom consumed by the ridge logistic regression estimator can be derived from the definition
(1.11) previously introduced in Section 1.6. It uses β̂(λ1 ) as obtained from the last iteration of the IRLS algorithm,
and assumes the weight matrix employed in this last iteration to be nonrandom. Then, but see also Park and Hastie
5.7 The Bayesian connection 93
(2008),
Xn
df(λ) = [Var(Yi )]−1 Cov(Ybi , Yi )
i=1
Xn
= {exp(Xi,∗ β)[1 + exp(Xi,∗ β)]−2 }−1 Cov[Xi,∗ β̂(λ1 ), Yi ]
i=1
Xn
= [(W)ii ]−1 Cov(Xi,∗ (X⊤ WX + λIpp )−1 X⊤ WZ, Yi )
i=1
Xn
= [(W)ii ]−1 Xi,∗ (X⊤ WX + λIpp )−1 X⊤ W
i=1
×Cov{Xβ̂ old + W−1 [Y − ⃗g−1 (X; β old )], Yi }
Xn
= [(W)ii ]−1 Xi,∗ (X⊤ WX + λIpp )−1 X⊤ WCov[W−1 Y, Yi ]
i=1
= tr[X(X⊤ WX⊤ + λIpp )−1 X⊤ ],
where we have used the independence among the individual observations. The degrees of freedom of the ridge
logistic regression estimator too decrease monotonically to zero as λ increases.
This does not coincide with any standard distribution. But, under appropriate conditions, the posterior distribution
is asymptotically normal. This invites a (multivariate) normal approximation to the posterior distribution above.
The Laplace’s method provides (cf. Bishop, 2006).
posterior
Laplace approx.
0.005
0.004
0.003
posterior
0.002
0.001
0.000
−2 0 2 4 6 8 10
Figure 5.4: Laplace approximation to the posterior density of the Bayesian logistic regression parameter.
Laplace’s method i) centers the normal approximation at the mode of the posterior, and ii) chooses the covari-
ance to match the curvature of the posterior at the mode. The posterior mode is the location of the maximum of
the posterior distribution. The location of this maximum coincides with that of the logarithm of the posterior. The
latter is the log-likelihood augmented with a ridge penalty. Hence, the posterior mode, which is taken as the mean
of the approximating Gaussian, coincides with the ridge logistic regression estimator. For the covariance of the
approximating Gaussian, the logarithm of the posterior is approximated by a second order Taylor series around
94 Ridge logistic regression
in which the first order term cancels as the derivative of fβ (β | Y, X) with respect to β vanishes at the posterior
mode – its maximum. Take the exponential of this approximation and match its arguments to that of a multivariate
Gaussian exp[− 12 (β − µβ )⊤ Σ−1β (β − µβ )]. The covariance of the sought Gaussian approximation is thus the
inverse of the Hessian of the negative penalized log-likelihood. Put together the posterior is approximated by:
β | Y, X ∼ N [β̂MAP , (∆ + X⊤ WX)−1 ].
The Gaussian approximation is convenient but need not be good. Fortunately, the Bernstein-Von Mises Theorem
(van der Vaart, 2000) tells us that it is very accurate when the model is regular, the prior smooth, and the sample
size sufficiently large. The quality of the approximation for an artificial example data set is shown in Figure 5.4.
Both reformulations hinge upon the Woodbury matrix identity. On one hand, these reformulations avoid the
inversion of a p × p dimensional matrix (as been shown for the regular ridge regression estimator, see Section
1.7). On the other, only the term λ−1 XX⊤ in the above expressions involves a matrix multiplication over the p
dimension. Exactly this term is not updated at each iteration of the IRLS algorithm. It can thus be evaluated and
stored prior to the first iteration step, and at each iteration be called upon. Furthermore, the remaining quantities
updated at each iteration are the weights and the penalized loglikelihood, but those are obtained straightforwardly
from the linear predictor and the penalty. The pseudo-code of an efficient version of the IRLS algorithm of the
ridge logistic estimator, as has been implemented in the ridgeGLM-function of the porridge-package (van
Wieringen and Aflakparast, 2021), thus comprises the following steps:
1) Evaluate and store λ−1 XX⊤ .
2) Initiate the linear predictor.
3) Update the weights, the adjusted response, the penalized loglikelihood, and the linear predctor.
4) Repeat the previous step until convergence.
5) Evaluate the estimator using the weights from the last iteration.
In the above, convergence refers to no further or a negligible improvement of the penalized loglikelihood. In the
final step of this version of the IRLS algorithm, the ridge logistic regression estimator is obtained by:
In the above display the weights W and adjusted response Z are obtained from the last iteraton of the IRLS
algorithm’s third step. Alternatively, computational efficiency is obtained by the ‘SVD-trick’ (see Question 5.6).
The approach of Meijer and Goeman (2013) hinges upon the first-order Taylor expansion of the left-out penalized
loglikelihood of the left-out estimate β̂−i (λ) around β̂(λ), which yields an approximation of the former:
!−1
∂ 2 Lpen
−i ∂ Lpen
−i
β̂−i (λ) ≈ β̂(λ) −
∂β∂β ⊤ β=β̂(λ) ∂β β=β̂(λ)
= β̂(λ) + (X⊤
−i,∗ W−i,−i X−i,∗ + λIpp ) −1
{X⊤ g−1 (X−i,∗ ; β̂(λ))] − λβ̂(λ)}.
−i,∗ [Y−i − ⃗
This approximation involves the inverse of a p × p dimensional matrix, which amounts to the evaluation of n such
inverses for the LOOCV loss. As in Section 1.8.2 this may be avoided. Rewrite both the gradient and the Hessian
of the left-out loglikelihood in the approximation of the preceding display:
and
(X⊤
−i,∗ W−i,−i X−i,∗ + λIpp )
−1
= (X⊤ WX + λIpp )−1 + Wii (X⊤ WX + λIpp )−1 X⊤
i,∗
where the Woodbury identity has been used and now Hii (λ) = Wii Xi,∗ (X⊤ WX + λIpp )−1 X⊤ i,∗ . Substitute
both in the approximation of the left-out ridge logistic regression estimator and manipulate as in Section 1.8.2 to
obtain:
which can be verified by means of the Woodbury matrix identity. Furthermore, all matrix products of submatrices
of X are themselves submatrices of XX⊤ . Further efficiency is gained if the latter is evaluated before the start of
the cross-validation loop. It then only remains to subset XX⊤ inside this loop.
The derivation of the generalized ridge penalized loglikelihood of the logistic regression model is then analogous
to that of the regular ridge penalized case. Its maximizer is the sought estimator:
Xn
γ̂(·), β̂(·) = arg max q p
Yi (Ui,∗ γ + Xi,∗ β) − log[1 + exp(Ui,∗ γ + Xi,∗ β)]
γ∈R ,β∈R i=1
−(β − β0 )⊤ ∆(β − β0 ).
No analytic expression for this estimator is available. It is again evaluated numerically by means of the Iteratively
Reweight Least Squares (IRLS) algorithm that generates a sequence {γ̂ (t) (·), β̂ (t) (·)}∞t=1 that converges to the
generalized ridge logistic regression estimator. Given the t-th values of this sequence, the next are found by:
γ̂ (t+1) (·) = {U⊤ (W(t) )−1 [X∆−1 X⊤ + (W(t) )−1 ]−1 (W(t) )−1 U}−1
× U⊤ [X∆−1 X⊤ + (W(t) )−1 ]−1 (Y(adj,t) − Xβ0 ),
β̂ (t+1) (·) = β0 + (X⊤ W(t) X + ∆)−1 X⊤ W(t) [Y(adj,t) − Uγ̂ (t+1) (·) − Xβ0 ].
These expression have been obtained by application of the analytic expression of the inverse of a 2×2 block matrix,
the Woodbury matrix identity, and some linear algebraic manipulations (see Lettink et al., 2023 for details). The
IRLS algorithm is terminated after a finite number of iterations, either when the loss does no longer substantially
improve or subsequent iterations show little difference in their evaluation of the parameter estimates.
5.11 Application
The ridge logistic regression is used here to explain the status (dead or alive) of ovarian cancer samples at the
close of the study from gene expression data at baseline. Data stem from the TCGA study (Cancer Genome
Atlas Network, 2011), which measured gene expression by means of sequencing technology. Available are 295
samples with both status and transcriptomic profiles. These profiles are composed of 19990 transcript reads. The
sequencing data, being representative of the mRNA transcript count, is heavily skewed. Zwiener et al. (2014) show
that a simple transformation of the data prior to model building generally yields a better model than tailor-made
approaches. Motivated by this observation the data were – to accommodate the zero counts – asinh-transformed.
The logistic regression model is then fitted in ridge penalized fashion, leaving the intercept unpenalized. The ridge
penalty parameter is chosen through 10-fold cross-validation minimizing the cross-validated error. R-code, and
that for the sequel of this example, is to be found below.
Listing 5.1 R code
# load libraries
library(glmnet)
library(TCGA2STAT)
# load data
OVdata <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM", clinical=TRUE)
Y <- as.numeric(OVdata[[3]][,2])
X <- asinh(data.matrix(OVdata[[3]][,-c(1:3)]))
# start fit
# optimize penalty parameter
cv.fit <- cv.glmnet(X, Y, alpha=0, family=c("binomial"),
nfolds=10, standardize=FALSE)
optL2 <- cv.fit$lambda.min
# estimate model
glmFit <- glmnet(X, Y, alpha=0, family=c("binomial"),
lambda=optL2, standardize=FALSE)
# visualize fit
boxplot(linPred ˜ Y, pch=20, border="lightblue", col="blue",
5.11 Application 97
# build model
pred2obsL2 <- matrix(nrow=0, ncol=4)
colnames(pred2obsL2) <- c("optLambda", "linPred", "predProb", "obs")
for (f in 1:length(folds)){
print(f)
cv.fit <- cv.glmnet(X[-folds[[f]],], Y[-folds[[f]]], alpha=0,
family=c("binomial"), nfolds=10, standardize=FALSE)
optL2 <- cv.fit$lambda.min
glmFit <- glmnet(X[-folds[[f]],], Y[-folds[[f]]], alpha=0,
family=c("binomial"), lambda=optL2, standardize=FALSE)
linPred <- glmFit$a0 + X[folds[[f]],,drop=FALSE] %*% glmFit$beta
predProb <- exp(linPred) / (1+exp(linPred))
pred2obsL2 <- rbind(pred2obsL2, cbind(optL2, linPred, predProb, Y[folds[[f]]]))
}
# visualize fit
boxplot(pred2obsL2[,3] ˜ pred2obsL2[,4], pch=20, border="lightblue", col="blue",
ylab="linear predictor", xlab="response",
main="prediction")
The fit of the resulting model is studied. Hereto the fitted linear predictor Xβ̂(λopt ) is plotted against the
status (Figure 5.5, left panel). The plot shows some overlap between the boxes, but also a clear separation. The
latter suggests gene expression at baseline thus enables us to distinguish surviving from the to-be-diseased ovarian
cancer patients. Ideally, a decision rule based on the linear predictor can be formulated to predict an individual’s
outcome.
The fit, however, is evaluated on the samples that have been used to build the model. This gives no insight on
the model’s predictive performance on novel samples. A replication of the study is generally costly and comparable
data sets need not be at hand. A common workaround is to evaluate the predictive performance on the same data
(Subramanian and Simon, 2010). This requires to put several samples aside for performance evaluation while
the remainder is used for model building. The left-out sample may accidently be chosen to yield an exaggerated
(either dramatically poor or overly optimistic) performance. This is avoided through the repetition of this exercise,
leaving (groups of) samples out one at the time. The left-out performance evaluations are then averaged and
believed to be representative of the predictive performance of the model on novel samples. Note that, effectively,
as the model building involves cross-validation and so does the performance evaluation, a double cross-validation
loop is applied. This procedure is applied with a ten-fold split in both loops. Denote the outer folds by f =
1, . . . , 10. Then, Xf and X−f represent the design matrix of the samples comprising fold f and that of the
remaining samples, respectively. Define Yf and Y−f similarly. The linear prediction for the left-out fold f is
then Xf β̂−f (λopt, -f ). For reference to the fit, this is compared to Yf visually by means of a boxplot as used above
(see Figure 5.5, right panel). The boxes overlap almost perfectly. Hence, little to nothing remains of the predictive
98 Ridge logistic regression
Figure 5.5: Left panel: Box plot of the status vs. the fitted linear predictor using the full data set. Right panel: Box
plot of the status vs. the linear prediction in the left-out samples of the 10 folds.
power suggested by the boxplot of the fit. The fit may thus give a reasonable description of the data at hand, but it
extrapolates poorly to new samples.
Logistic regression seperates – by virtue of the linear predictor – the response classes by a hyperplane in the design
space (see Figure 5.3). This may appear a substantial limitation of logistic regression as indeed there is no ground
why the phenomenon under study would exhibit such neat linear behavior. This limitation can be circumvented
2
by the expansion of our design matrix by functions of the covariates, e.g Xi,1 , Xi,1 Xi,2 , Xi,1 , sin(Xi,1 ), . . .. Such
an expanded set of covariates facilitates nonlinear boundaries of the classification domains, as illustrated in Figure
5.6. This greatly expands the class of classifiers, with the potential of a substantial higher performance but with
the risk of overfitting.
1.0 1.0 1.0
X2
X2
X1 X1 X1
Figure 5.6: Classification domains of logistic regression with, from left to right, a design matrix expanded by the
interaction between the covariates, and the covariates squared and cubed, respectively. The red and green domain
represent regions where the classification is reasonably certain, while in the white domain it is not.
The key question in order to facilitate nonlinear classification boundaries is what functions to choose, or, put
differently, what basis to form, from the original covariates that yield the best classifier. The easy (?) answer
would be to let the data decide, e.g. by means of a feedforward neural network. A feedforward neural network is
5.13 Conclusion 99
where:
◦ the {Wu }m u=1 are parameter matrices that define linear combinations of the input which are propagated,
◦ the {α}m u=1 are parameter vectors – called bias terms – that offset the linear combination of the input,
◦ the fu : RLu−1 → RLu are nonlinear activation functions that operate on the offsetted linear combination.
This compositional map of the feedforward neural network can be written as:
with input vector x ∈ Rp . The feedforward neural network thus alternates between an affine linear combination
and a nonlinear transformation. Logistic regression is a single layered forward neural network with α = β0 ,
W = β, and f (α + Wx) = [1 + exp(−α − Wx)]−1 . A common alternative and nonlinear choice, due to its
continuity and piece-wise linearity, of the fm (·) is the so-called the ReLu (Rectified Linear Unit) function defined
as f (x) = max{0, x}. Refer to the monograph of Goodfellow et al. (2016) for more on forward neural networks.
The forward neural network effectively creates itself the basis functions from the covariates. It can approxi-
mate virtually any function if m and the {Lu }m u=1 are large (Schmidt-Hieber, 2019). There is of course no free
lunch. The feedforward neural network’s parameters may be non-identifiable. Irrespectively, the estimation of the
network’s many parameter requires strong regularization. Additionally, it requires massive computational power
to learn a deep neural network. Finally, the utilization of a learned forward neural network may be hampered by
its opaqueness: the input-output relationship is hard to interpret, a desirable property for their acceptance (Rudin,
2019).
5.13 Conclusion
To deal with response variables other than continuous ones, ridge logistic regression was discussed. High-
dimensionally, the empirical identifiability problem then persists. Again, penalization came to the rescue: the
ridge penalty may be combined with other link functions than the identity. Properties of ridge regression were
shown to carry over to its logistic equivalent.
5.14 Exercises
Question 5.1
The variation in a binary response Yi ∈ {0, 1} due to two covariates Xi,∗ = (Xi,1 , Xi,2 ) is described by the
logistic regression model: P (Yi = 1) = exp(Xi,∗ β)[1 + exp(Xi,∗ β)]. The study design and the observed
response are given by:
1 −1 1
X = and Y = .
−1 1 0
Question 5.2
Consider an experiment involving n cancer samples. For each sample i the transcriptome of its tumor has been
profiled and is denoted Xi = (Xi1 , . . . , Xip )⊤ where Xij represents the gene j = 1, . . . , p in sample i. Addi-
tionally, the overall survival data, (Yi , ci ) for i = 1, . . . , n of these samples is available. In this Yi denotes the
survival time of sample i and ci the event indicator with ci = 0 and ci = 1 representing non- and censoredness,
respectively. You may ignore the possibility of ties in the remainder.
a) Write down the Cox proportional regression model that links overall survival times (as the response variable)
to the expression levels.
b) Specify its loss function for penalized maximum partial (!) likelihood estimation of the parameters. Penal-
ization is via the regular ridge penalty.
100 Ridge logistic regression
c) From this loss function, derive the estimating equation for the Cox regression coefficients.
d) Describe (in words) how you would evaluate the ridge ML estimator numerically.
Question 5.3
Load the leukemia data available via the multtest-package (downloadable from BioConductor) through the
following R-code:
Listing 5.2 R code
# activate library and load the data
library(multtest)
data(golub)
The objects golub and golub.cl are now available. The matrix-object golub contains the expression profiles
of 38 leukemia patients. Each profile comprises expression levels of 3051 genes. The numeric-object golub.cl
is an indicator variable for the leukemia type (AML or ALL) of the patient.
a) Relate the leukemia subtype and the gene expression levels by a logistic regression model. Fit this model by
means of penalized maximum likelihood, employing the ridge penalty with penalty parameter λ = 1. This
is implemented in the penalized-packages available from CRAN. Note: center (gene-wise) the expression
levels around zero.
b) Obtain the fits from the regression model. The fit is almost perfect. Could this be due to overfitting the data?
Alternatively, could it be that the biological information in the gene expression levels indeed determines the
leukemia subtype almost perfectly?
c) To discern between the two explanations for the almost perfect fit, randomly shuffle the subtypes. Refit the
logistic regression model and obtain the fits. On the basis of this and the previous fit, which explanation is
more plausible?
d) Compare the fit of the logistic model with different penalty parameters, say λ = 1 and λ = 1000. How does
λ influence the possibility of overfitting the data?
e) Describe what you would do to prevent overfitting.
Question 5.4
Load the breast cancer data available via the breastCancerNKI-package (downloadable from BioConductor)
through the following R-code:
Listing 5.3 R code
# activate library and load the data
library(breastCancerNKI)
data(nki)
Question 5.5
Derive the equivalent of Equations (5.2) and (5.3) for the targeted ridge logistic regression estimator:
Xn
β̂(λ, β0 ) = arg minp Y⊤ Xβ̂(λ; β0 ) − log{1 + exp[Xi,∗ β̂(λ; β0 )]} − 21 λ∥β̂(λ; β0 ) − β0 ∥22 ,
β∈R i=1
with nonrandom β0 ∈ Rp .
Question 5.6
The iteratively reweighted least squares (IRLS) algorithm for the numerical evaluation of the ridge logistic regres-
sion estimator requires the inversion of a p × p-dimensional matrix at each iteration. In Section 1.7 the singular
value decomposition (SVD) of the design matrix is exploited to avoid the inversion of such a matrix in the numer-
ical evaluation of the ridge regression estimator. Use this trick to show that the computational burden of the IRLS
algorithm may be reduced to one SVD prior to the iterations and the inversion of an n × n dimensional matrix at
each iteration (as is done in Eilers et al., 2001).
Question 5.7
The ridge estimator of parameter β of the logistic regression model is the maximizer of the ridge penalized
loglikelihood:
Xn
Lpen (Y, X; β, λ) = Yi Xi β − log[1 + exp(Xi β)] − 21 λ∥β∥22 .
i=1
The maximizer is found numerically by the iteratively reweighted least squares (IRLS) algorithm which is outlined
in Section 5.3. Modify the algorithm, as is done in van Wieringen and Binder (2022), to find the generalized ridge
logistic regression estimator of β defined as:
Xn
Lpen (Y, X; β, λ) = Yi Xi β − log[1 + exp(Xi β)] − 21 (β − β0 )⊤ ∆(β − β0 ),
i=1
Question 5.8
Consider the logistic regression model to explain the variation of a binary random variable by a covariate. Let
Yi ∈ {0, 1}, i = 1, . . . , n, represent n binary random variables that follow a Bernoulli distribution with parameter
P (Yi = 1 | Xi ) = exp(Xi β)[1 + exp(Xi β)] and corresponding covariate Xi ∈ {−1, 1}. Data {(yi , Xi )}ni=1 are
summarized as a contingency table (Table 5.1):
yi = 0 yi = 1
Xi = -1 301 196
Xi = -1 206 297
c − n exp(β)[1 + exp(β)]−1 − λβ = 0,
and specifiy c.
b) Show that β̂(λ) ∈ (λ−1 (c − n), λ−1 c) for all λ > 0. Ensure that this is a meaningful interval, i.e. that it is
non-empty, and verify that c > 0. Moreover, conclude from the interval that limλ→∞ β̂(λ) = 0.
c) The Taylor expansion of exp(β)[1 + exp(β)]−1 around β = 0 is:
exp(β)[1 + exp(β)]−1 = 1
2 + 41 x − 1 3
48 x + 1
480 x
5
+ O (x6 ).
Substitute the 3rd order Taylor approximation into the estimating equations and use the data of Table 5.1 to
evaluate its roots using the polyroot-function (of the base-package). Do the same for the 1st and 2nd
order Taylor approximations of exp(β)[1 + exp(β)]−1 . Compare these approximate estimates to the one
provided by the ridgeGLM-function (of the porridge-package, van Wieringen and Aflakparast, 2021).
102 Ridge logistic regression
In the remainder consider the 1st order Taylor approximated ridge logistic regression estimator: β̂1st (λ) = (λ +
1 −1
4 n) (c − 12 n).
d) Find an expression for E[β̂1st (λ)].
e) Find an expression for Var[β̂1st (λ)].
f) Combine the answers of parts d) and e) to find an expression for MSE[β̂1st (λ)].
g) Plot this MSE against λ for the data provided in Table 5.1 and reflect on an analogue of Theorem 1.2 for the
ridge logistic regression estimator.
Question 5.9
Consider the logistic regression model Yi ∼ Bernoulli{exp(Xi,∗ β)[exp(Xi,∗ β)]−1 } for i = 1, . . . , n and re-
gression parameter β ∈ Rp , response variable Yi and the p-dimensional row vector Xi,∗ with the covariate
information. Suppose the samples can be divided into two groups, one of size n0 as the covariate information and
the other of size n1 such that n = n0 + n1 , with data Yi = 0 and Xi,∗ = X1,∗ for i = 1, . . . , n0 and Yi = 1 and
Xi,∗ = −X1,∗ for i = n0 + 1, . . . , n0 + n1 . From these data, the regression parameter is estimated with the ridge
logistic regression parameter which maximizes the loglikelihood augmented with the ridge penalty 12 λ∥β∥22 with
penalty parameter λ2 > 0. Show that the ridge logistic regression estimator β̂(λ2 ) is of the form c(λ2 )X⊤ 1,∗ with
c(λ2 ) ∈ R.
6 Lasso regression
In this chapter we return to the linear regression model, which is still fitted in penalized fashion but this time with
a so-called lasso penalty. Yet another penalty? Yes, but the resulting estimator will turn out to have interesting
properties. The outline of this chapter loosely follows that of its counterpart on ridge regression (Chapter 1).
The chapter can – at least partially – be seen as an elaborated version of the original work on lasso regression,
i.e. Tibshirani (1996), with most topics covered and visualized more extensively and incorporating results and
examples published since.
Recall that ridge regression finds an estimator of the parameter of the linear regression model through the
minimization of:
with fpen (β, λ) = λ∥β∥22 . The particular choice of the penalty function originated in a post-hoc motivation of the
ad-hoc fix to the singularity of the matrix X⊤ X, stemming from the design matrix X not being of full rank (i.e.
rank(X) < p). The ad-hoc nature of the fix suggests that the choice for the squared Euclidean norm of β as a
penalty is arbitrary and other choices may be considered, some of which were already encountered in Chapter 3.
One such choice is the so-called lasso penalty, giving rise to lasso regression, as introduced by Tibshirani
(1996). Like ridge regression, lasso regression fits the linear regression model Y = Xβ + ε with the standard
assumption on the error ε. Like ridge regression, it does so by minimization of the sum of squares augmented
with a penalty. Hence, lasso regression too minimizes loss function (6.1). The difference with ridge regression is
in the penalty function. Instead of the squared Euclidean norm, lasso regression uses the ℓ1 -norm: fpen (β, λ1 ) =
λ1 ∥β∥1 , the sum of the absolute values of the regression parameters multiplied by the lasso penalty parameter λ1 .
To distinguish the ridge and lasso penalty parameters, they are henceforth denoted λ2 and λ1 , respectively, with
the subscript referring to the norm used in the penalty. The lasso regression loss function is thus:
Xn Xp
Llasso (β; λ) = ∥Y − X β∥22 + λ1 ∥β∥1 = (Yi − Xi∗ β)2 + λ1 |βj |. (6.2)
i=1 j=1
The lasso regression estimator is then defined as the minimizer of this loss function. As with the ridge regression
loss function, the maximum likelihood estimator (should it exists) of β minimizes the first part, while the second
part is minimized by setting β equal to the p dimensional zero vector. For λ1 close to zero, the lasso estimate
is close to the maximum likelihood estimate. Whereas for large λ1 , the penalty term overshadows the sum-of-
squares, and the lasso estimate is small (in some sense). Intermediate choices of λ1 mold a compromise between
those two extremes, with the penalty parameter determining the contribution of each part to this compromise. The
lasso regression estimator thus is not one but a whole sequence of estimators of β, one for every λ1 ∈ R>0 . This
sequence is the lasso regularization path, defined as {β̂(λ1 ) : λ1 ∈ R>0 }. To arrive at a final lasso estimator of
β, like its ridge counterpart, the lasso penalty parameter λ1 needs to be chosen (see Section 6.8).
The ℓ1 penalty of lasso regression is equally arbitrary as the ℓ22 -penalty of ridge regression. The latter ensured
the existence of a well-defined estimator of the regression parameter β in the presence of super-collinearity in the
design matrix X, in particular when the dimension p exceeds the sample size n. The augmentation of the sum-
of-squares with the lasso penalty achieves the same. This is illustrated in Figure 6.1. For the high-dimensional
setting with p = 2 and n = 1 and arbitrary data the level sets of the sum-of-squares ∥Y − X β∥22 and the lasso
regression loss ∥Y − X β∥22 + λ1 ∥β∥1 are plotted (left and right panel, respectively). In both panels the minimum
is indicated in red. For the sum-of-squares the minimum is a line. As pointed out before in Section 1.2 of Chapter
1 on ridge regression, this minimum is determined up to an element of the null set of the design matrix X, which
in this case is non-trivial. In contrast, the lasso regression loss exhibits a unique well-defined minimum. Hence,
the augmentation of the sum-of-squares with the lasso penalty yields a well-defined estimator of the regression
parameter. (This needs some attenuation: in general the minimum of the lasso regression loss need not be unique,
confer Section 6.1).
104 Lasso regression
Figure 6.1: Contour plots of the sum-of-squares and the lasso regression loss (left and right panel, respectively).
The dotted grey line represent level sets. The red line and dot represent the location of minimum in both panels.
The mathematics involved in the derivation in this chapter tends to be more intricate than for ridge regression.
This is due to the non-differentiability of the lasso penalty at zero. This has consequences on all aspects of the
lasso regression estimator as is already obvious in the right-hand panel of Figure 6.1: confer the non-differentiable
points of the lasso regression loss level sets.
6.1 Uniqueness
The lasso regression estimator need not be unique. This can loosely be concluded from the lasso regression loss
function, which is the sum of the sum-of-squares criterion and a sum of absolute value functions. Both are convex
in β: the former is not strict convex due to the high-dimensionality and the absolute value function is convex
due to its piece-wise linearity. Thereby the lasso loss function too is convex but not strict. Consequently, its
minimum need not be uniquely defined. But, the set of solutions of a convex minimization problem is convex
(Theorem 9.4.1, Fletcher, 1987). Hence, would there exist multiple minimizers of the lasso loss function, they
can be used to construct a convex set of minimizers. Thus, if β̂a (λ1 ) and β̂b (λ1 ) are lasso estimators, then so are
(1 − θ)β̂a (λ1 ) + θβ̂b (λ1 ) for θ ∈ (0, 1). This is illustrated in Example 6.1.
Example 6.1 (Super-collinear covariates)
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. normally
distributed with zero mean and a common variance. The rows of the design matrix X are of length two, neither
column represents the intercept, but X∗,1 = X∗,2 . Suppose an estimate of the regression parameter β of this model
is obtained through the minimization of the sum-of-squares augmented with a lasso penalty, ∥Y−Xβ∥22 +λ1 ∥β∥1
with penalty parameter λ1 > 0. To find the minimizer define u = β1 + β2 and v = β1 − β2 and rewrite the lasso
loss criterion to:
The function |u + v| + |u − v| is minimized with respect to v for any v such that |v| < |u| and the corresponding
minimum equals 2|u|. The estimator of u thus minimizes:
For sufficiently small values of λ1 the estimate of u will be unequal to zero. Then, any v such that |v| < |u| will
yield the same minimum of the lasso loss function. Consequently, β̂(λ1 ) is not uniquely defined as β̂1 (λ1 ) =
1 1
2 [û(λ1 ) + v̂(λ1 )] need not equal β̂2 (λ1 ) = 2 [û(λ1 ) − v̂(λ1 )] for any v̂(λ1 ) such that 0 < |v̂(λ1 )| < |û(λ1 )|. □
6.1 Uniqueness 105
The non-uniquess of the lasso regression estimator may not be its most appealing property, it is unproblematic in
many practical settings. The following lemma states that the estimator is, with high probability and under a mild
condition on the design matrix, unique.
Lemma 6.1 (Lemma 4, Tibshirani, 2013)
If the elements of X ∈ Mn,p are drawn from a continuous probability distribution on Rn×p , then for any Y and
λ > 0, the lasso solution is unique with probability one.
Moreover, while the lasso regression estimator β̂(λ1 ) need not be unique, its linear predictor Xβ̂(λ1 ) is. This can
be proven by contradiction (Tibshirani, 2013). Suppose there exists two lasso regression estimators of β, denoted
β̂a (λ1 ) and β̂b (λ1 ), such that Xβ̂a (λ1 ) ̸= Xβ̂b (λ1 ). Define c to be the minimum of the lasso loss function. Then,
by definition of the lasso estimators β̂a (λ1 ) and β̂b (λ1 ) satisfy:
∥Y − Xβ̂a (λ1 )∥22 + λ1 ∥β̂a (λ1 )∥1 = c = ∥Y − Xβ̂b (λ1 )∥22 + λ1 ∥β̂b (λ1 )∥1 .
∥Y − X[(1 − θ)β̂a (λ1 ) + θβ̂b (λ1 )]∥22 + λ1 ∥(1 − θ)β̂a (λ1 ) + θβ̂b (λ1 )∥1
= ∥(1 − θ)[Y − Xβ̂a (λ1 )] + θ[Y − Xβ̂b (λ1 )]∥22 + λ1 ∥(1 − θ)β̂a (λ1 ) + θβ̂b (λ1 )∥1
< (1 − θ)∥Y − Xβ̂a (λ1 )∥22 + θ∥Y − Xβ̂b (λ1 )∥22 + (1 − θ)λ1 ∥β̂a (λ1 )∥1 + θλ1 ∥β̂b (λ1 )∥1
= (1 − θ)c + θc = c,
by the strict convexity of ∥Y − Xβ∥22 in Xβ and the convexity of ∥β∥1 on θ ∈ (0, 1). This implies that (1 −
θ)β̂a (λ1 )+θβ̂b (λ1 ) yields a lower minimum of the lasso loss function and contradicts our assumption that β̂a (λ1 )
and β̂b (λ1 ) are lasso regression estimators. Hence, the lasso linear predictor is unique.
Example 6.2 (Perfectly super-collinear covariates, revisited)
Revisit the setting of Example 6.1, where a linear regression model without intercept and only two but perfectly
correlated covariates is fitted to data. The example revealed that the lasso estimator need not be unique. The lasso
predictor, however, is
Y(λ
b 1 ) = Xβ̂(λ1 ) = X∗,1 β̂1 (λ1 ) + X∗,2 β̂2 (λ1 ) = X∗,1 [β̂1 (λ1 ) + β̂2 (λ1 )] = X∗,1 û(λ1 ),
with u defined and (uniquely) estimated as in Example 6.1 and v dropping from the predictor. □
Example 6.3
The issues, non- and uniqueness of the lasso-estimator and predictor, respectively, raised above are illustrated in a
numerical setting. Hereto data are generated in accordance with the linear regression model Y = Xβ + ε where
the n = 5 rows of X are sampled from N [0p , (1 − ρ)Ipp + ρ1pp ] with p = 10, ρ = 0.99, β = (1⊤ ⊤
3 , 0p−3 )
⊤
1
and ε ∼ N (0p , 10 Inn ). With these data the lasso estimator of the regression parameter β for λ1 = 1 is evaluated
using two different algorithms (see Section 6.4). Employed implementations of the algorithms are those available
through the R-packages penalized and glmnet. Both estimates, denoted β̂p (λ1 ) and β̂g (λ1 ) (the subscript
refers to the first letter of the package), are given in Table 6.3. The table reveals that the estimates differ, in par-
β1 β2 β3 β4 β5 β6 β7 β8 β9 β10
penalized β̂p (λ1 ) 0.267 0.000 1.649 0.093 0.000 0.000 0.000 0.571 0.000 0.269
glmnet β̂g (λ1 ) 0.269 0.000 1.776 0.282 0.195 0.000 0.000 0.325 0.000 0.000
Table 6.1: Lasso estimates of the linear regression β for both algorithms.
ticular in their support (i.e. the set of nonzero values of the estimate of β). This is troublesome when it comes
to communication of the optimal model. From a different perspective the realized loss ∥Y − Xβ∥22 + λ1 ∥β∥1
for each estimate is approximately equal to 2.99, with the difference possibly due to convergence criteria of the
algorithms. On another note, their corresponding predictors, Xβ̂p (λ1 ) and Xβ̂g (λ1 ), correlate almost perfectly:
106 Lasso regression
cor[Xβ̂p (λ1 ), Xβ̂g (λ1 )] = 0.999. These results thus corroborate the non-uniqueness of the estimator and the
uniqueness of the predictor.
# load libraries
library(penalized)
library(glmnet)
library(mvtnorm)
# sample response
Y <- X %*% betas + rnorm(n, sd=0.1)
# compare estimates
cbind(Bhat1, Bhat2)
# compare predictor
cor(X %*% Bhat1, X %*% Bhat2)
Note that in the code above the evaluation of the lasso estimator appears to employ a different lasso penalty
parameter λ1 for each package. This is due to the fact that internally (after removal of standardization of X and
Y) the loss functions optimized are ∥Y − Xβ∥22 + λ1 ∥β∥1 vs. (2n)−1 ∥Y − Xβ∥22 + λ1 ∥β∥1 . Rescaling of λ1
resolves this issue. □
Figure 6.2: The left panel shows the regularization path of the lasso regression estimator for simulated data. The
vertical grey dotted lines indicate the values of λ1 at which there is a discontinuity in the derivative (with respect to
λ1 ) of the lasso regularization path of one the regression estimates. The right panel displays the soft (solid, red)
and hard (grey, dotted) threshold functions.
For particular cases, an orthormal design (Example 6.4) and p = 2 (Example 6.6), an analytic expression for
the lasso regression estimator exists. While the latter is of limited use, the former is exemplary and will come of
use later in the numerical evaluation of the lasso regression estimator in the general case (see Section 6.4).
Example 6.4 (Orthonormal design matrix)
Consider an orthonormal design matrix X, i.e. X⊤ X = Ipp = (X⊤ X)−1 . The lasso estimator then is:
where β̂ = (X⊤ X)−1 X⊤ Y = X⊤ Y is the maximum likelihood estimator of β and β̂j its j-th element and
f (x) = (x)+ = max{x, 0}. This expression for the lasso regression estimator can be obtained as follows.
Rewrite the lasso regression loss criterion:
Xp
min ∥Y − X β∥22 + λ1 ∥β∥1 = min Y⊤ Y − Y⊤ X β − β ⊤ X⊤ Y + β ⊤ X⊤ X β + λ1 |βj |
β β j=1
Xp
∝ min −β̂ ⊤ β − β ⊤ β̂ + β ⊤ β + λ1 |βj |
β j=1
Xp
− 2β̂jml βj + βj2 + λ1 |βj |
= min
β1 ,...,βp j=1
Xp
= (min −2β̂j βj + βj2 + λ1 |βj |).
j=1 βj
The minimization problem can thus be solved per regression coefficient. This gives:
(
minβj −2β̂j βj + βj2 + λ1 βj if βj > 0,
min −2β̂j βj + βj2 + λ1 |βj | =
βj minβj −2β̂j βj + βj2 − λ1 βj if βj < 0.
108 Lasso regression
The minimization within the sum over the covariates is with respect to each element of the regression parameter
separately. Optimization with respect to the j-th one gives:
β̂j − 12 λ1 if β̂j (λ1 ) > 0
β̂j (λ1 ) = β̂ + 1 λ if β̂j (λ1 ) < 0
j 2 1
0 otherwise
This case-wse solution can be compressed to the form of the lasso regression estimator above.
The analytic expression for the lasso regression estimator above provides insight in how it relates to the maxi-
mum likelihood estimator of β. The right-hand side panel of Figure 6.2 depicts this relationship. Effectively, the
lasso regression estimator thresholds (after a translation) its maximum likelihood counterpart. The function is also
referred to as the soft-threshold function (for contrast the hard-threshold function is also plotted – dotted line – in
Figure 6.2). □
Y⊤ =
−4.9 −0.8 −8.9 4.9 1.1 −2.0 ,
1 −1 3 −3 1 1
X⊤ = .
−3 −3 −1 0 3 0
Note that the design matrix is orthogonal, i.e. its columns are orthogonal (but not normalized to one). The
orthogonality of X yields a diagonal X⊤ X, and so is its inverse (X⊤ X)−1 . Here diag(X⊤ X) = (22, 28).
Rescale X to an orthonormal design matrix, denoted X̃, and rewrite the lasso regression loss function to:
√ −1 √ 2
22
√ 0 22 √ 0
∥Y − Xβ∥22 + λ1 ∥β∥1 = Y−X β + λ1 ∥β∥1
0 28 0 28
2
√ √
= ∥Y − X̃γ∥22 + (λ1 / 22)|γ1 | + (λ1 / 28)|γ2 |,
√ √
where γ = ( 22β1 , 28β2 )⊤ . By the same argument this loss can be minimized with respect to each element of
γ separately. In particular, the soft-threshold function provides an analytic expression for the estimates of γ:
√ √ √
γ̂1 (λ1 / 22) = sign(γ̂1 )[|γ̂1 | − 12 (λ1 / 22)]+ = −[9.892513 − 21 (10/ 22)]+ = −8.826509,
√ √ √
γ̂2 (λ1 / 28) = sign(γ̂2 )[|γ̂2 | − 12 (λ1 / 28)]+ = [5.537180 − 21 (10/ 28)]+ = 4.592269,
where γ̂1 and γ̂2 are the ordinary least square estimates of γ1 and γ2 obtained from regressing Y on the correspond-
ing column of X̃. Rescale back and obtain the lasso regression estimate: β̂(10) = (−1.881818, 0.8678572)⊤ . □
for some ρ ∈ (−1, 1). The lasso regression estimator is then of similar form as in the orthonormal case but the
soft-threshold function now depends on λ1 , ρ and the maximum likelihood estimate β̂ (see Exercise 6.8). □
Apart from the specific cases outlined in the two examples above no other explicit solution for the minimizer
of the lasso regression loss function appears to be known. Locally though, for large enough values of λ1 , an
analytic expression for solution can also be derived. Hereto we point out that (details to be included later) the
lasso regression estimator satisfies the following estimating equation:
X⊤ Xβ̂(λ1 ) = X⊤ Y − 21 λ1 ẑ
6.3 Sparsity 109
for some ẑ ∈ Rp with (ẑ)j = sign{[β̂(λ1 )]j } whenever [β̂j (λ1 )]j ̸= 0 and (ẑ)j ∈ [−1, 1] if [β̂j (λ1 )]j = 0. Then:
Xp
0 ≤ [β̂(λ1 )]⊤ X⊤ Xβ̂(λ1 ) = [β̂(λ1 )]⊤ (X⊤ Y − 12 λ1 ẑ) = [β̂(λ1 )]j (X⊤ Y − 21 λ1 ẑ)j .
j=1
For λ1 > 2∥X⊤ Y∥∞ the summands on the right-hand side satisfy:
This implies that β̂(λ1 ) = 0p if λ1 > 2∥X⊤ Y∥∞ , where ∥a∥∞ is the supremum norm of vector a defined as
∥a∥∞ = max{|a1 |, |a2 |, . . . , |ap |}.
6.3 Sparsity
The change from the ℓ22 -norm to the ℓ1 -norm in the penalty may seem only a detail. Indeed, both ridge and lasso
regression fit the same linear regression model. But the attractiveness of the lasso lies not in what it fits, but in a
consequence of how it fits the linear regression model. The lasso estimator of the vector of regression parameters
may contain some or many zero’s. In contrast, ridge regression yields an estimator of β with elements (possibly)
close to zero, but unlikely to be equal to zero. Hence, lasso penalization results in β̂j (λ1 ) = 0 for some j (in
particular for large values of λ1 , see Section 6.1), while ridge penalization yields an estimate of the j-th element
of the regression parameter β̂j (λ2 ) ̸= 0. A zero estimate of a regression coefficient means that the corresponding
covariate has no effect on the response and can be excluded from the model. Effectively, this amounts to variable
selection. Where traditionally the linear regression model is fitted by means of maximum likelihood followed by
a testing step to weed out the covariates with effects indistinguishable from zero, lasso regression is a one-step-go
procedure that simulatenously estimates and selects.
The in-built variable selection of the lasso regression estimator is a geometric accident. To understand how
it comes about the lasso regression loss optimization problem (6.2) is reformulated as a constrained estimation
problem (using the same argumentation as previously employed for ridge regression, see Section 1.5):
where c(λ1 ) = ∥β̂(λ1 )∥1 . Again, this is the standard least squares problem, with the only difference that the sum
of the (absolute value of the) regression parameters β1 , β2 , . . . , βp is required to be smaller than c(λ1 ). The effect
of this requirement is that the lasso estimator of the regression parameter coefficients can no longer assume any
value (from −∞ to ∞, as is the case in standard linear regression), but are limited to a certain range of values.
With the lasso and ridge regression estimators minimizing the same sum-of-squares, the key difference with the
constrained estimation formulation of ridge regression is not in the explicit form of c(λ1 ) (and is set to some
arbitrary convenient value in the remainder of this section) but in what is bounded by c(λ1 ) and the domain of
acceptable values for β that it implies. For the lasso regression estimator the domain is specified by a bound on
the ℓ1 -norm of the regression parameter while for its ridge counterpart the bound is applied to the squared ℓ2 -norm
of β. The parameter constraints implied by the lasso and ridge norms result in balls in different norms:
respectively, and where c(·) is now equipped with a subscript referring to the norm to stress that it is different
for lasso and ridge. The left-hand panel of Figure 6.3 visualizes these parameter constraints for p = 2 and
c1 (λ1 ) = 2 = c2 (λ2 ). In the Euclidean space ridge yields a spherical constraint for β, while a diamond-like shape
for the lasso. The lasso regression estimator is then that β inside this diamond domain that yields the smallest
sum-of-squares (as is visualized by right-hand panel of Figure 6.3).
The right-hand panel of Figure 6.3) illustrates Lemma 6.1. The lasso solution is non-unique if the levels sets
of the sum-of-squares are parallel to one of the sides of the diamond parameter constraint. This occurs with
probability zero.
The selection property of the lasso is due to the fact that the diamond-shaped parameter constraint has its
corners falling on the axes. For a point to lie on an axis, one coordinate needs to equal zero. The lasso regression
110 Lasso regression
ML estimate
8
lasso estimate
2
6
1
4
0
beta2
beta2
2
−1
0
−2
lasso penalty
−3
−2
ridge penalty
−3 −2 −1 0 1 2 −2 0 2 4 6 8
beta1 beta1
Figure 6.3: Left panel: The lasso parameter constraint (|β1 | + |β2 | ≤ 2) and its ridge counterpart (β12 + β22 ≤ 2).
Solution path of the ridge estimator and its variance. Right panel: the lasso regression estimator as a constrained
least squares estimtor.
Figure 6.4: Shrinkage with the lasso. The range of possible lasso estimates is demarcated by the diamond around
the origin. The grey areas contain all points that are closest to one of the diamond’s corners than to any other point
inside the diamond. If the maximum likelihood estimate falls inside any of these grey areas, the lasso shrinks it to
the closest diamond tip (which corresponds to a sparse solution). For example, let the red dot in the fourth quadrant
be an maximum likelihood estimate. It is in a grey area. Hence, its lasso estimate is the red dot at the lowest tip of
the diamond.
estimator coincides with the point inside the diamond closest to the maximum likelihood estimator. This point
may correspond to a corner of the diamond, in which case one of the coordinates (regression parameters) equals
zero and, consequently, the lasso regression estimator does not select this element of β. Figure 6.4 illustrates the
selection property for the case with p = 2 and an orthonormal design matrix. An orthornormal design matrix
yields level sets (orange dotted circles in Figure 6.4) of the sum-of-squares that are spherical and centered around
6.3 Sparsity 111
the maximum likelihood estimate (red dot in Figure 6.4). For maximum likelihood estimates inside the grey
areas the closest point in the diamond-shaped parameter domain will be on one of its corners. Hence, for these
maximum likelihood estimates the corresponding lasso regression estimate will include on a single covariate in
the model. The geometrical explanation of the selection property of the lasso regression estimator also applies to
non-orthonormal design matrices and in dimensions larger than two. In particular, high-dimensionally, the sum-
of-squares may be a degenerated ellipsoid, that can and will still hit a corner of the diamond-shaped parameter
domain. Finally, note that a zero value of the lasso regression estimate does imply neither that the parameter is
indeed zero nor that it will be statistically indistinguishable from zero.
Larger values of the lasso penalty parameter λ1 induce tighter parameter constraints. Consequently, the number
of zero elements in the lasso regression estimator of β increases as λ1 increases. However, where ∥β̂(λ1 )∥1
decreases monotonically as λ1 increases (left panel of Figure 6.5 for an example and Exercise 6.13), the number
of non-zero coefficients does not. Locally, at some finite λ1 , the number of non-zero elements in β̂(λ1 ) may
temporarily increase with λ1 , to only go down again as λ1 is sufficiently increased (as in the λ1 → ∞ limit the
number of non-zero elements is zero, see the argumentation at the end of Section 6.2). The right panel of Figure
6.5 illustrates this behavior for an arbitrary data set.
Figure 6.5: Contour plots of the sum-of-squares and the lasso regression loss (left and right panel, respectively).
The dotted grey line represent level sets. The red line and dot represent the the location of minimum in both panels.
The attractiveness of the lasso regression estimator is in its simultaneous estimation and selection of parame-
ters. For large enough values of the penalty parameter λ1 the estimated regression model comprises only a subset
of the supplied covariates. In high-dimensions (demanding a large penalty parameter) the number of selected
parameters by the lasso regression estimator is usually small (relative to the total number of parameters), thus
producing a so-called sparse model. Would one adhere to the parsimony principle, such a sparse and thus simpler
model is preferable over a full model. Simpler may be better, but too simple is worse. The phenomenon or system
that is to be described by the model need not be sparse. For instance, in molecular biology the regulatory network
of the cell is no longer believed to be sparse (Boyle et al., 2017). Similarly, when analyzing brain image data, the
connectivity of the brain is not believed to be sparse.
# load data
data(diabetes)
X <- diabetes$x
Y <- diabetes$y
Irrespective of the drawn sample size n the plotted regularization paths all terminate before the n + 1-th variate
enters the model. This could of course be circumstantial evidence at best, or even be labelled a bug in the software.
But even without the LARS algorithm the nontrivial part of the inequality, that the number of selected variates
p does not exceed the sample size n, can be proven (Osborne et al., 2000).
In the high-dimensional setting, when p is large compared to n small, this implies a considerable dimension
reduction. It is, however, somewhat unsatisfactory that it is the study design, i.e. the inclusion of the number of
samples, that determines the upperbound of the model size.
6.4 Estimation
In the absence of an analytic expression for the optimum of the lasso loss function (6.2), much attention is devoted
to numerical procedures to find it.
min{β∈Rp : Rβ≥0q } 1
2 (Y − Xβ)⊤ (Y − Xβ), (6.3)
where R is a suitably chosen q × p dimensional linear constraint matrix that specifies the linear constraints on the
parameter β. For p = 2 the domain implied by lasso parameter constraint {β ∈ R2 : ∥β∥1 < c(λ1 )14 } is equal
to:
−1 1
6.4 Estimation 113
L(β, ν) = 1
2 (Y − Xβ)⊤ (Y − Xβ) − ν ⊤ [Rβ + c(λ1 )1q ], (6.4)
where ν = (ν1 , . . . , νq )⊤ is the vector of non-negative multipliers. The dual function is now defined as inf β L(β, ν).
This infimum is attained at:
which can be verified by equating the first order partial derivative with respect to β of the Lagrangian to zero and
solving for β. Substitution of β = β ∗ into the dual function gives, after changing the minus sign:
1 ⊤ ⊤ −1 ⊤
2 ν R(X X) R ν + ν ⊤ [R(X⊤ X)−1 X⊤ Y + c(λ1 )1q ] − 12 Y⊤ [Inn − X(X⊤ X)−1 X⊤ ]Y.
The dual problem minimizes this expression (from which the last term is dropped as is does not involve ν) with
respect to ν, subject to ν ≥ 0. Although also a quadratic programming problem, the dual problem i) has a simpler
formulation of the linear constraints and ii) is defined on a lower dimensional space than the primal problem
(should the number of columns of R exceeds its number of rows). If ν̃ is the solution of the dual problem, the
solution of the primal problem is obtained from Equation (6.5). Refer to, e.g., Bertsekas (2014) for more on
quadratic programming.
Example 6.5 (Orthogonal design matrix, continued)
The evaluation of the lasso regression estimator by means of quadratic programming is illustrated using the data
from the numerical Example 6.5. The R-script below solves, the implementation of the quadprog-package, the
quadratic program associated with the lasso regression problem of the aforementioned example.
Listing 6.3 R code
# load library
library(quadprog)
# data
Y <- matrix(c(-4.9, -0.8, -8.9, 4.9, 1.1, -2.0), ncol=1)
X <- t(matrix(c(1, -1, 3, -3, 1, 1, -3, -3, -1, 0, 3, 0), nrow=2, byrow=TRUE))
# constraint radius
L1norm <- 1.881818 + 0.8678572
For relatively small p quadratic programming is a viable option to find the lasso regression estimator. For large
p it is practically not feasible. Above the linear constraint matrix R is 4 × 2 dimensional for p = 2. When
p = 3, it requires a linear constraint matrix R with six rows corresponding to the six side of a 3-dimensional
cube. In general, 2p linear constraints are required to fully specify the parameter constraint of the lasso regression
estimator. If p large, the specification of only the linear constraint matrix R will take endlessly, as it is a 2p × p
matrix, leave alone solving the corresponding quadratic program.
2
10
4
7.5
3.5
1
3
5
| beta |
β1
0
0.5
1
2
−1
1
lasso penalty
approximation 3.5
0
−2
−4 −2 0 2 4
2 1 0 −1 −2
beta
β2
Figure 6.6: Left panel: quadratic approximation (i.e. the ridge penalty) to the absolute value function (i.e. the lasso
penalty). Right panel: Illustration of the coordinate descent algorithm. The dashed grey lines are the level sets
of the lasso regression loss function. The red arrows depict the parameter updates. These arrows are parallel to
either the β1 or the β2 parameter axis, thus indicating that the regression parameter β is updated coordinate-wise.
The loss function now contains a weighted ridge penalty. In this one recognizes a generalized ridge regression loss
function (see Chapter 3). As its minimizer is known, the approximated lasso regression loss function is optimized
by:
β (k+1) (λ1 ) = {X⊤ X + λ1 Ψ[β (k) (λ1 )]}−1 X⊤ Y
where
(k) (k)
diag{Ψ[β (k) (λ1 )]} = (1/|β1 |, 1/|β2 |, . . . , 1/|βp(k) |).
β1
β2
optimum in chosen
initial direction
value
β1
β2
path of
global optimum steepest ascent
initial value
β1
β2
new path of
steepest ascent
The application of gradient ascent to find the lasso regression estimator is frustrated by the non-differentiability
(with respect to any of the regression parameters) of the lasso penalty function at zero. In Goeman (2010) this
is overcome by the use of a generalized derivative. Define the directional or Gâteaux derivative of the function
f : Rp → R at x ∈ Rp in the direction of v ∈ Rp as:
assuming this limit exists. The Gâteaux derivative thus gives the infinitesimal change in f at x in the direction of
v. As such f ′ (x) is a scalar (as is immediate from the definition after noting that f (·) ∈ R) and should not be
confused with the gradient (the vector of partial derivatives). Furthermore, at each point x there are infinitely many
Gâteaux differentials (as there are infinitely many choices for v ∈ Rp ). In the particular case when v = ej , ej
the unit vector along the axis of the j-th coordinate, the directional derivative coincides with the partial derivative
116 Lasso regression
of f in the direction of xj . Relevant for the case at hand is the absolute value function f (x) = |x| with x ∈ R.
Evaluation of the limits in its Gâteaux derivative yields:
x
′ v |x| if x ̸= 0,
f (x) =
v if x = 0,
for any v ∈ R \ {0}. Hence, the Gâteaux derivative of |x| does exits at x = 0. In general, the Gâteaux differential
may be uniquely defined by limiting the directional vectors v to i) those with unit length (i.e. ∥v∥ = 1) and ii) the
direction of steepest ascent. Using the Gâteaux derivative a gradient of f (·) at x ∈ Rp may then be defined as:
′
f (x) ◦ vopt if f ′ (x) ≥ 0,
∇f (x) = (6.6)
0p if f ′ (x) < 0,
in which ◦ is the Hadamard (i.e,. element-wise) product and vopt = arg max{v : ∥v∥=1} f ′ (x). This is the direction
of steepest ascent, vopt , scaled by Gâteaux derivative, f ′ (x), in the direction of vopt .
Goeman (2010) applies the definition of the Gâteaux gradient to the lasso penalized likelihood (6.2) using the
direction of steepest ascent as vopt . The resulting partial Gâteaux derivative with respect to the j-th element of the
regression parameter β is:
∂
∂ ∂βj L(Y, X; β) − λ1 sign(β
∂j
) if βj ̸= 0
∂
Llasso (Y, X; β) = ∂β L (Y, X; β) − λ 1 sign ∂βj L (Y, X; β) if βj = 0 and |∂ L/∂βj | > λ1
∂βj 0 j
otherwise ,
Pp
where ∂ L/∂βj = j ′ =1 (X⊤ X)j ′ ,j βj − (X⊤ Y)j . This can be understood through a case-by-case study. The
partial derivative above is assumed to be clear for the βj ̸= 0 and the ‘otherwise’ cases. That leaves the clarification
of the middle case. When βj = 0, the direction of steepest ascent of the penalized loglikelihood points either into
{β ∈ Rp : βj > 0}, or {β ∈ Rp : βj < 0}, or stays in {β ∈ Rp : βj = 0}. When the direction of steepest
ascent points into the positive or negative half-hyperplanes, the contribution of λ1 |βj | to the partial Gâteaux
derivative is simply λ1 or −λ1 , respectively. Then, only when the partial derivative of the log-likelihood together
with this contribution is larger then zero, the penalized loglikelihood improves and the direction is of steepest
ascent. Similarly, the direction of steepest ascent may be restricted to {β ∈ Rp | βj = 0} and the contribution
of λ1 |βj | to the partial Gâteaux derivative vanishes. Then, only if the partial derivative of the loglikelihood is
positive, this direction is to be pursued for the improvement of the penalized loglikelihood.
Convergence of gradient ascent can be slow close to the optimum. This is due to its linear approximation
of the function. Close to the optimum the linear term of the Taylor expansion vanishes and is dominated by the
second-order quadratic term. To speed-up convergence close to the optimum the gradient ascent implementation
offered by the penalized-package switches to a Newton-Raphson procedure.
arg min ∥Y − Xβ∥22 + λ1 ∥β∥1 = arg min ∥Y − X∗,\j β\j − X∗,j βj ∥22 + λ1 |βj |1
βj βj
where Ỹ = Y − X∗,\j β\j . After a simple rescaling of both X∗,j and βj , the minimization of the lasso regression
loss function with respect to βj is equivalent to one with an orthonormal design matrix. From Example 6.4 it is
known that the minimizer is obtained by application of the soft-threshold function to the corresponding maximum
likelihood estimator (now derived from Ỹ and X∗,j ). The coordinate descent algorithm iteratively runs over the
p elements until convergence. The right panel of Figure 6.6 provides an illustration of the coordinate descent
algorithm.
Convergence of the coordinate descent algorithm to the minimum of the lasso regression loss function (6.2) is
warranted by the convexity of this function. At each minization step the coordinate descent algorithm yields an
6.5 Moments 117
update of the parameter estimate that corresponds to an equal or smaller value of the loss function. It, together with
the compactness of diamond-shaped parameter domain and the boundedness (from below) of the lasso regression
loss function, implies that the coordinate descent algorithm converges to the minimum of this lasso regression loss
function. Although convergence is assured, it may take many steps for it to be reached. In particular, when i) two
covariates are strongly collinear, ii) one of the two covariates contributes only slightly more to the response, and
iii) the algorithm is initiated with the weaker explanatory covariate. The coordinate descent algorithm will then
take may iterations to replace the latter covariate by the preferred one. In such cases simultaneous updating, as is
done by the gradient ascent algorithm (Section 6.4.3), may be preferable.
6.5 Moments
In general the moments of the lasso regression estimator appear to be unknown. In certain cases an approximation
can be given. This is pointed out here. Use the quadratic approximation to the absolute value function of Section
6.4.2 and approximate the lasso regression loss function around the lasso regression estimate:
Xp
∥Y − Xβ∥22 + λ1 ∥β∥1 ≈ ∥Y − Xβ∥22 + 12 λ1 |β̂(λ1 )|−1 βj2 .
j=1
Optimization of the right-hand side of the preceeding display with respect to β gives a ‘ridge approximation’ to
the lasso estimator:
β̂(λ1 ) ≈ {X⊤ X + λ1 Ψ[β̂(λ1 )]}−1 X⊤ Y,
with (Ψ[β̂(λ1 )])jj = |β̂j (λ1 )|−1 if β̂j (λ1 ) ̸= 0. Now use this ‘ridge approximation’ to obtain the approximation
to the moments of the lasso regression estimator:
E[β̂(λ1 )] ≈ E {X⊤ X + λ1 Ψ[β̂(λ1 )]}−1 X⊤ Y = {X⊤ X + λ1 Ψ[β̂(λ1 )]}−1 X⊤ Xβ
and
Var[β̂(λ1 )] ≈ Var {X⊤ X + λ1 Ψ[β̂(λ1 )]}−1 X⊤ Y
for λ1 > 0. If rank(X) = p, the lasso regression estimator’s number of nonzero’s is even a consistent estimator of
the degrees of freedom.
Proof. The full proof is beyond the scope of these notes, and here we limit ourselves to show the unbiasedness of
the degrees of freedom estimator for orthonormal X. The proof makes use of Lemma 2 of Stein (1981), which
states that if Y ∼ N (µ, σ 2 ) and g(·) is an absolute continuous function with E[g ′ (Y )] < ∞, then E[g(Y )(Y −
µ)] = σ 2 E[g ′ (Y )]. Starting from the degrees of freedom definition introduced in Section 1.6, we then manipulate
(with minor abuse of notation) as follows:
Xn
df(λ1 ) = [Var(Yi )]−1 Cov(Ybi , Yi )
i=1
Xn
= σ −2 E{[Ybi − E(Ybi )][Yi − E(Yi )]}
i=1
Xn
= σ −2 E{Xi,∗ β̂(λ1 )[Yi − E(Yi )]}
i=1
Xn Xp
= σ −2 Xi,j E{[β̂(λ1 )]j [Yi − E(Yi )]}
i=1 j=1
Xn Xp
= σ −2 Xi,j E{sign(β̂jml )(|β̂jml | − 21 λ1 )+ [Yi − E(Yi )]}
i=1 j=1
Xn Xp n d Xn Xn o
′ ,j Yi′ −
1
= Xi,j E sign X i ′ ,j Yi′ Xi 2 λ1
i=1 j=1 dYi i′ =1 i′ =1 +
Xn Xp
Xi,j E 1{| n′ Xi′ ,j Yi′ |> 12 λ1 }
2
= P
i=1 j=1 i =1
h Xp i
= E 1{| Pni=1 Xi,j Yi |> 12 λ1 } ,
j=1
where we have used the definition of the covariance, the analytic expression of the lasso regression estimator in
the orthonormal case, and the independence among the samples. ■
Would one select the lasso regression parameter’s penalty parameter on the basis of an information criterion, this
simple unbiased estimator of its degrees of freedom is rather convenient.
0.5
0.4
0.3
density
0.2
0.1
0.0
−4 −2 0 2 4
Figure 6.8: Top panel: the Laplace prior associated with the Bayesian counterpart of the lasso regression estimator.
Bottom left panel: the posterior distribution of the regression parameter for various Laplace priors. Bottom right
panel: posterior mode vs. the penalty parameter λ1 .
The posterior is not a well-known and characterized distribution. This is not necessary as interest concentrates here
on its maximum. The location of the posterior mode coincides with the location of the maximum of logarithm
of the posterior. The log-posterior is proportional to: −(2σ 2 )−1 ∥Y − Xβ∥22 − b−1 ∥β∥1 , with its maximizer
minimizing ∥Y − Xβ∥22 + (2σ 2 /b)∥β∥1 . In this one recognizes the form of the lasso regression loss function
(6.2). It is thus clear that the scale parameter of the Laplace distribution reciprocally relates to lasso penalty
parameter λ1 , similar to the relation of the ridge penalty parameter λ2 and the variance of the Gaussian prior of
the ridge regression estimator.
The posterior may not be a standard distribution, in the univariate case (p = 1) it can be visualized. Specif-
ically, the behaviour of the MAP can then be illustrated, which – as the MAP estimator corresponds to the lasso
regression estimator – should also exhibit the selection property (see Exercise 6.15). The bottom left panel of
Figure 6.8 shows the posterior distribution for various choices of the Laplace scale parameter (i.e. lasso penalty
parameter). Clearly, the mode shifts towards zero as the scale parameter decreases / lasso penalty parameter in-
creases. In particular, the posterior obtained from the Laplace prior with the smallest scale parameter (i.e. largest
penalty parameter λ1 ), although skewed to the left, has a mode placed exactly at zero. The Laplace prior may thus
produce MAP estimators that select. However, for smaller values of the lasso penalty parameter the Laplace prior
is not concentrated enough around zero and the contribution of the likelihood in the posterior outweighs that of
the prior. The mode is then not located at zero and the parameter is ‘selected’ by the MAP estimator. The bottom
120 Lasso regression
right panel of Figure 6.8 plots the mode of the normal-Laplace posterior vs. the Laplace scale parameter. In line
with Theorem 6.1 it is piece-wise linear.
Park and Casella (2008) go beyond the elementary correspondence of the frequentist lasso estimator and the
Bayesian posterior mode and formulate the Bayesian lasso regression model. To this end they exploit the fact that
the Laplace distribution can be written as a scale mixture of normal distributions with an exponentional mixing
density. This allows the construction of a Gibbs sampler for the Bayesian lasso estimator. Finally, they suggest to
impose a gamma-type hyperprior on the (square of the) lasso penalty parameter. Such a full Bayesian formulation
of the lasso problem enables the construction of credible sets (i.e. the Bayesian counterpart of confidence intervals)
to express the uncertainty of the maximum a posterior estimator. However, the lasso regression estimator may be
seen as a Bayesian estimator, in the sense that it coincides with the posterior mode, the ‘lasso’ posterior distribution
cannot be blindly used for uncertainty quantification. In high-dimensional sparse settings the ‘lasso’ posterior
distribution of β need not concentrate around the true parameter, even though its mode is a good estimator of the
regression parameter (cf. Section 3 and Theorem 7 of Castillo et al., 2015).
A covariate is thus considered stable if at some point on its stability path it is selected in more than 100πthr %
of the subsamples. Meinshausen and Bühlmann (2010) claim that selection on the basis of stability is relatively
insensitive to either the choice of πthr or that of λ1 and Λ. Meinshausen and Bühlmann (2010) then prove that error
control of selection based on stability is possible.
E(|S ∩ ŜΛ |) |S |
= .
E(|N ∩ N̂Λ |) |N |
The expected number V = |N ∩ Ŝ stable | of falsely selected but stable variables is then bounded for πthr ∈ ( 12 , 1)
by V ≤ (2πthr − 1)−1 p−1 qΛ
2
, where qΛ = EI [|ŜΛ (I )|] is the expected number of selected covariates over all
subsamples of equal size.
Exchangeability thus requires that for the covariates with a zero coefficient the probability of being selected is
invariant under subsampling over the full range of penalty parameters. The validity of the exchangeability may
be hard to assess in practice. The random guessing assumption, however, seems unproblematic for any minimally
sophisticated method.
Theorem 6.4 can be put to practical use and guide the decision on the optimal value(s) of λ1 as follows. Specify
the stability threshold, i.e. the bound on the covariates’ selection frequency beyond whichpthey are considered
stable, πthr . Then, to ensure (say) E(V ) ≤ 1 choose the penalty parameter λ1 such that qΛ ≤ (2πthr − 1)p. While
qΛ is in principle unknown, it can be estimated by means of the resampling.
6.9.1 Linearity
The ridge regression estimator is a linear (in the observations) estimator, while the lasso regression estimator
is not. This is immediate from the analytic expression of the ridge regression estimator, β̂(λ2 ) = (X⊤ X +
λ2 Ipp )−1 X⊤ Y, which is a linear combination of the observations Y. To show the non-linearity of the lasso
regression estimator available, it suffices to study the analytic expression of j-th element of β̂(λ1 ) in the orthonor-
mal case: β̂j (λ1 ) = sign(β̂j )(|β̂j | − 12 λ1 )+ = sign(X⊤ ⊤ 1
∗,j Y)(|X∗,j Y| − 2 λ1 )+ . This clearly is not linear in Y.
Consequently, the response Y may be scaled by some constant c, denoted Ỹ = cY, and the corresponding ridge
regression estimators are one-to-one related by this same factor β̂(λ2 ) = cβ̃(λ2 ). The lasso regression estimator
based on the unscaled data is not so easily recovered from its counterpart obtained from the scaled data.
6.9.2 Shrinkage
Both lasso and ridge regression estimation minimize the sum-of-squares plus a penalty. The latter encourages the
estimator to be small, in particular closer to zero. This behavior is called shrinkage. The particular form of the
penalty yields different types of this shrinkage behavior. This is best grasped in the case of an orthonormal design
matrix. The j-the element of the ridge regression estimator then is: β̂j (λ2 ) = (1 + λ2 )−1 β̂j , while that of the
lasso regression estimator is: β̂j (λ1 ) = sign(β̂j )(|β̂j | − 12 λ1 )+ . In Figure 6.9 these two estimators β̂j (λ2 ) and
β̂j (λ1 ) are plotted as a function of the maximum likelihood estimator β̂j . Figure 6.9 shows that lasso and ridge
regression estimator translate and scale, respectively, the maximum likelihood estimator, which could also have
122 Lasso regression
been concluded from the analytic expression of both estimators. The scaling of the ridge regression estimator
amounts to substantial and little shrinkage (in an absolute sense) for elements of the regression parameter β with
a large and small maximum likelihood estimate, respectively. In contrast, the lasso regression estimator applies an
equal amount of shrinkage to each element of β, irrespective of the coefficients’ sizes.
Figure 6.9: Solution path of the lasso and ridge regression estimators, left and right panel, respectively, for data
with an orthonormal design matrix.
where γ1 = [Var(X1 )]1/2 β1 and γ2 = [Var(X2 )]1/2 β2 . The rescaled design matrix X̃ is now orthonormal and
analytic expressions of estimators of γ1 and γ2 are available. The former parameter is penalized substantially less
than the latter as λ1 [Var(X1 )]−1/2 ≪ λ1 [Var(X2 )]−1/2 . As a result, if for large enough values of λ1 one variable
is selected, it is more likely to be γ1 .
6.9 Comparison to ridge 123
Figure 6.10: Regularization paths of the lasso regression estimator related to the simulations presented in Sections
6.9.3 and 6.9.4 in the left and right panel, respectively.
choose the penalty parameter of both estimators by means of leave-one-out cross-validation. The estimators
are then evaluated with the optimal penalty parameter, and subsequently, the prediction of the left-out samples
are obtained. These predictions are then compared to the corresponding observations by means of Spearman’s
rank correlation. This performance measure is hardly affected by the shrunken scale of the predictions due the
regularization. For the lasso regression estimate we also register which elements of the parameter are nonzero.
The aforementioned has been repeated a thousand times for each setting.
# set parameters
structure <- c("dense", "sparse")[2]
p <- c(50, 500, 5000)[1]
n <- 100
# load data
data(vdx)
# results
cors <- rbind(cors, c(n, p, cor(Y, YpredL1, m="s"),
cor(Y, YpredL2, m="s")))
}
The predictive performances as measured by Spearman’s rank correlation over all iterations of the simulation
is depicted by violinplots in Figure 6.11. In the setting with a dense regression parameter, the ridge regression
estimates show a better performance than its lasso counterpart. This becomes more pronounced for larger dimen-
sions: the former’s performance is reasonably constant over the dimensions, while the latter’s performance dilutes
slightly and, additionally, becomes more unstable, i.e. the spread among the thousand correlations increases. In
the sparse case, the roles are reversed and the lasso regression estimates exhibit a better performance. But with
6.9 Comparison to ridge 125
larger dimensions the spread among both estimates’ performance measures increases, although most for the ridge
regression estimator. Overall, the ridge and lasso regression estimators are prefered if the true regression parameter
is dense and sparse, respectively.
Spearman correlation of obs. and cv−prediction, dense β Spearman correlation of obs. and cv−prediction, sparse β
1.0
1.0
0.5
0.5
cor(Y,Y)
cor(Y,Y)
^
^
0.0
0.0
−0.5
−0.5
lasso lasso
−1.0
−1.0
ridge ridge
50
50
00
00
50
50
00
00
0
0
50
50
0
00
p=
p=
50
50
p=
p=
50
p=
p=
p=
p=
p=
n=
p=
n=
Figure 6.11: Violinplots of the predictive performance, operationalized as Spearman’s rank correlation among
observation and prediction, of the lasso and ridge regression estimates with fixed absolute sparsity at 10% for
various increasing p and a dense (left) and sparse (right) regression parameter.
The selection frequency of the lasso regression estimator (not shown) reveals that larger (in absolute sense) el-
ements of the parameter are selected more often than smaller ones. The selection frequency of either waters down
as the dimension increases. In principle, we could do a similar excercise for the ridge estimator by studying the
ranks of the estimate’s absolute value. This indicates similar behavior, i.e. the ranks are concordant with the true
parameter values, but a formal selection procedure is absent, although we could formulate some thresholding-type
procedure.
We have repeated the simulation but fix the sparsity in a relative sense instead of an absolute one. That is,
the relative sparsity is fixed at 10%, i.e. ⌈p/10⌉ elements of the regression parameter are non-zero, instead of
the five non-zero elements irrespective of the dimension. The elements of the regression parameter are set to
βj = 500 j p−7/4 for j = 1, . . . , ⌈p/10⌉ and βj = 0 for j = ⌈p/10⌉ + 1, . . . , p. The particular dependence of
the nonzero elements of the regression parameter on the dimension is a clumsy attempt to fix the variance of the
response over the employed range of dimensions. The latter now ranges over p ∈ {50, 250, 500, 750, . . . , 2500}.
Figure 6.12 shows the violinplots of the resulting thousand Spearman’s rank correlations between the predictions
and observations of the lasso and ridge estimates for the various dimensions. For small dimensions, p = 50 and
p = 250, the predictive performance of the ridge regression estimates falls behind that of its lasso counterpart.
But as the dimension grows beyond p = 500, the roles reverse and the ridge regression estimate performs better
than its lasso counterpart.
Let us close with some intuition why we see those results. The lasso regression estimator selects, and thereby
performs dimension reduction. This may be unstable and thereby less reproducible in high-dimensional settings.
The ridge regression estimator can be conceived as doing some form of averaging, which is relatively robust and
reproducible.
In all, these simulations may suggest that overall the ‘ridge predictor’ tends to have a better predictive perfor-
mance than its lasso counterpart. Should we prefer thus the ridge over the lasso regression estimator? No, that
would be shortsighted. The simulations are far too limited in nature, they are only meant to illustrate the depen-
dence of the behavior of the estimators under various choices of the regression parameter. One may interpret the
126 Lasso regression
1.0
0.5
cor(Y,Y)
^
0.0
−0.5
lasso
−1.0
ridge
50
50
00
50
00
50
00
50
00
50
00
p=
10
12
15
17
20
22
25
p=
p=
p=
p=
p=
p=
p=
p=
p=
p=
Figure 6.12: Violinplots of the predictive performance, operationalized as Spearman’s rank correlation among
observation and prediction, of the lasso and ridge regression estimates with fixed relative sparsity at 10% and
increasing p.
simulations as suggesting that a choice for either estimator comes with a believe on the structure of the regres-
sion parameter. E.g., the lasso regression estimator implicitly assumes the system under study is sparse (and the
nonzero elements are clearly distinguishable – in some sense – from zero). The validity of this assumption is not
directly obvious is, e.g., biological phenomena. Similarly, the ridge regression estimator implicity assumes all (or
at least many) covariates contribute to the explanation of the variation of the response. Such an assumption too is
difficult to verify.
6.10 Application
A seminal high-dimensional data set, that has often been re-analyzed in many articles to illustrate novel (regular-
ization) methods, is presented in Van’t Veer et al. (2002); Van De Vijver et al. (2002). It is part of a study into
breast cancer and comprises the overall survival information and expression levels of 24158 genes of 291 breast
cancer samples. Alongside the high-throughput omics data, clinical information like age and sex is available but
discarded here. The data are provided via the breastCancerNKI-package (Schroeder et al., 2022).
In the original work of Van’t Veer et al. (2002); Van De Vijver et al. (2002) the data are used to build a
linear predictor of overall survival. This resulted in a so-called ‘signature’ of 70 genes – quite a dimension
reduction! – that together are predictive of overall survival. Upon this work, a commercial enterprise has been
founded that introduced a diagonostic test called the Mammaprint (https://en.wikipedia.org/wiki/
MammaPrint). In a nutshell, the workings of the test (essentially) amount to the evaluation of the linear predictor,
which is subsequently compared to a reference value to decide on the expected survival outcome. Within the
context of our data set, the reference value will be the median of all linear predictions:
poor prognosis if Xi,∗ β̂ > median{X1,∗ β̂, . . . , Xn,∗ β̂},
good prognosis if Xi,∗ β̂ < median{X1,∗ β̂, . . . , Xn,∗ β̂}.
The prognosis determines the individual’s follow-up treatment. For instance, should the test indicate a ‘good
prognosis’, the individual may be spared chemo-therapy without a substantial overall survival reduction but a
considerable gain in quality of life.
6.10 Application 127
5000
10000
5000
4000
observed survival time
3000
1000
2000
500
1000
200
100
Figure 6.13: State diagrams of two Markov chains with corresponding transition matrices. The ‘∗’ in the transition
matrices indicate unknown values of the transition probabilities.
Here we re-analyse the data set of Van’t Veer et al. (2002); Van De Vijver et al. (2002) and predict the breast
cancer overall survival by gene expression levels. To this end we adopt the Cox proportional hazards model that
describes the instantaneous rate of death at time t conditional on survival until then, or later. Mathematically, the
hazard of individual i is then h0 (t) exp(Xi,∗ β), where h0 (t) is the baseline hazard function common to all. The
covariates then affect the hazard in a multiplicative manner, that is independent of time. The regression parameter
is estimated by maximization of the (partial) likelihood. Due to the high-dimensionality, the latter is augmented
with a penalty. To mimick the parameter selection used in the contruction of the gene signature, we employ the
lasso penalty. The penalty parameter λ1 is chosen through leave-one-out cross-validation. The following R-code
loads the data, trains the model, and performs some simple diagnostics of the model’s fit.
Listing 6.5 R code
# activate libraries
library(breastCancerNKI)
library(Biobase)
library(penalized)
library(impute)
library(beanplot)
library(survival)
library(survminer)
128 Lasso regression
# cross-validated plot
profL1list <- profL1(Y ˜ X, model="cox")
plot(profL1list[[3]] ˜ profL1list[[1]], type="l", lwd=2,
xlab=expression(lambda[1]), ylab="cross-validated loss", main="")
# obtain fit
coxFit <- coxph(Y ˜ linPredL1)
The upper left panel of Figure 6.13 shows the cross-validated loglikelihood versus the penalty parameter. The
plot reveals that has it has multiple local maxima. This is frequently seen for the lasso-type estimators when trained
through cross-validation. Hence, care should be exercised when adopting the outcome of a search algorithm for
the optimal penalty parameter: it may correspond to a local maximum.
The estimated linear predictor with the optimal cross-validated penalty parameter comprises 20 genes, an even
stronger dimension reduction than in the original analysis of ) (but they used different statistical machinery to do
so). The resulting fit is depicted in various ways by three panels of Figure 6.13. The upper right panel shows
6.11 Exercises 129
the Kaplan-Meier curves of the groups of individuals falling below and above the linear predictor’s median. The
bottom left panels compares the observed non-censored survival times of these groups, while the bottom right
panel plots the these survival times against their expected values. The diagnostic plots reveal some value on the
formed linear predictor.
One may now ask oneself whether the 20 genes selected by our lasso estimator are well-established (or novel)
contributors to breast cancer. Let us first answer this anecdotally. A few years after work of Van’t Veer et al.
(2002); Van De Vijver et al. (2002), the results of a similar study were published. It too presented a gene signature
for overall survival of breast cancer patients based on expression levels. This signature comprises a similar amount
of genes, but the two gene signatures, that became known as the ‘Amsterdam’ and ‘Rotterdam’ signatures, showed
little overlap. Clinicians did not know which signature to prefer (football fans may see a potential rivalry along
familiar lines here), as neither set of signature genes was clearly implicated in breast cancer. The signatures’
minimal overlap and their lack of a clear relation to breast cancer was a cause for concern. To investigate this,
Ein-Dor et al. (2005) conducted a simple in silico experiment. They took the original Amsterdam signature of
70 genes. Subsequently, they removed these 70 genes from the set of covariates, and built another predictor
comprising 70 covariates/genes. In turn the latter 70 covariates were removed too, and again a novel predictor of
70 covariates was built. And so on, until ten predictors of equal size were obtained. These predictors differed little
in terms of their performance. Hence, predictors with non-overlapping gene sets may perform equally well. The
line of thought behind this experiment was pushed in extremis ). The authors skipped the training of a model and
simply formed a signature from a randomly selected gene set of size p ≈ 100. The title of their article captures
the essence of their conclusion: “Most random gene expression signatures are significantly associated with breast
cancer outcome”. To conclude, Ein-Dor et al. (2006) showed by simulation that a training set of thousands of
samples is needed to produce a predictor with a stable gene set. The lasso estimator selects, but it does so to
explain the response best, not to facilitate a convenient contextual interpretation. Moreover, as supercollinearity
abounds in high-dimensional settings, there typically is an alternative parameter selection with equal performance.
Hence, show restrained when assigning (too?) much interpretational weight to the selected set of covariates.
6.11 Exercises
Question 6.1
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σε2 Inn ). This model (without intercept) is
fitted to data using the ridge regression estimator β̂(λ1 ) = arg minβ ∥Y − Xβ∥22 + λ1 ∥β∥1 with λ1 > 0. The
data are:
Question 6.2
Find the lasso regression solution for the data below for a general value of λ and for the straight line model Y =
β0 +β1 X+ε (only apply the lasso penalty to the slope parameter, not to the intercept). Show that when λ1 is chosen
as 14, the lasso solution fit is Ŷ = 40+1.75X. Data: X⊤ = (X1 , X2 , . . . , X8 )⊤ = (−2, −1, −1, −1, 0, 1, 2, 2)⊤ ,
and Y⊤ = (Y1 , Y2 , . . . , Y8 )⊤ = (35, 40, 36, 38, 40, 43, 45, 43)⊤ .
Question 6.3
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with εi ∼i.i.d. N (0, σ 2 ). The
model comprises a single covariate and, depending on the subquestion, an intercept. Data on the response and the
covariate are: {(yi , xi,1 )}4i=1 = {(1.4, 0.0), (1.4, −2.0), (0.8, 0.0), (0.4, 2.0)}.
a) Evaluate the lasso regression estimator of the model without intercept for the data at hand with λ1 = 0.2.
b) Evaluate the lasso regression estimator of the model with intercept for the data at hand with λ1 = 0.2 that
does not apply to the intercept (which is left unpenalized).
130 Lasso regression
Question 6.4
Plot the regularization path of the lasso regression estimator over the range λ1 ∈ (0, 160] using the data of Example
1.2.
Question 6.5
Consider the standard linear regression model Yi = Xi,1 β1 + Xi,2 β2 + εi for i = 1, . . . , n and with the εi
i.i.d. normally distributed with zero mean and some known common variance. In the estimation of the regression
parameter (β1 , β2 )⊤ a lasso penalty is used: λ1,1 |β1 | + λ1,2 |β2 | with penalty parameters λ1,1 , λ1,2 > 0.
a) Let λ1,1 = λ1,2 and assume the covariates are orthogonal with the spread of the first covariate being much
larger than that of the second. Draw a plot with β1 and β2 on the x- and y-axis, repectively. Sketch the
parameter constraint as implied by the lasso penalty. Add the levels sets of the sum-of-squares, ∥Y −Xβ∥22 ,
loss criterion. Use the plot to explain why the lasso tends to select covariates with larger spread.
b) Assume the covariates to be orthonormal. Let λ1,2 ≫ λ1,1 . Redraw the plot of part a of this exercise. Use
the plot to explain the effect of differening λ1,1 and λ1,2 on the resulting lasso estimate.
c) Show that the two cases (i.e. the assumptions on the covariates and penalty parameters) of part a and b of
this exercise are equivalent, in the sense that their loss functions can be rewritten in terms of the other.
Question 6.6
Investigate the effect of the variance of the covariates on variable selection by the lasso. Hereto consider the toy
model: Yi = Xi,1 + Xi,2 + εi , where εi ∼ N (0, 1), Xi,1 ∼ N (0, 1), and Xi,2 = a Xi,1 with a ∈ [0, 2]. Draw a
hundred samples for both Xi,1 and εi and construct both Xi,2 and Yi for a grid of a’s. Fit the model by means of
the lasso regression estimator with λ1 = 1 for each choice of a. Plot e.g. in one figure a) the variance of Xi,1 , b)
the variance of Xi,2 , and c) the indicator of the selection of Xi,2 . Which covariate is selected for which values of
scale parameter a?
Question 6.7
Show the non-uniqueness of the lasso regression estimator for p > 2 when the design matrix X contains linearly
dependent columns.
Question 6.8
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0, σ 2 ) and an n × 2-dimensional design matrix
with zero-centered and standardized but collinear columns, i.e.:
⊤ 1 ρ
X X =
ρ 1
with ρ ∈ (−1, 1). Then, an analytic expression for the lasso regression estimator exists. Show that:
sgn(β̂j(ml) )[|β̂j(ml) | − 21 λ1 (1 + ρ)−1 ]+ if sgn[β̂1 (λ1 )] = sgn[β̂2 (λ1 )],
β̂j (λ1 ) ̸= 0 ̸= β̂2 (λ1 ),
sgn(β̂ (ml) )[|β̂ (ml) | − 1 λ (1 − ρ)−1 ]
if sgn[β̂1 (λ1 )] ̸= sgn[β̂2 (λ1 )],
j j 2 1 +
β̂j (λ1 ) =
( β̂1 (λ1 ) ̸= 0 ̸= β̂2 (λ1 ),
(ml)
0 if j ̸
= arg max j ′ {| β̂ j ′ |}
otherwise .
sgn(β̃j(ml) )(|β̃j(ml) | − 12 λ1 )+ if j = arg maxj ′ {|β̂j(ml) ′ |}
Question 6.9
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σε2 Inn ). This model (without intercept)
is fitted to data using the lasso regression estimator β̂(λ1 ) = arg minβ ∥Y − Xβ∥22 + λ1 ∥β∥1 . The relevant
summary statistics of the data are:
1 − 51
⊤ ⊤ −7
X X= , and X Y = .
− 15 1 5
Question 6.10
Consider the linear regression model Y = Xβ + ε with β ∈ Rp and ε ∼ N (0n , σε2 Inn ). This model is fitted to
data using the lasso regression estimator β̂(λ1 ) = arg minβ ∥Y − Xβ∥22 + λ1 ∥β∥1 .
a) Suppose n = 2 and p = 3. Could it be that β̂(λ1 ) = (−1.3, 2.7, 0.9)⊤ for some λ1 > 0? Motivate.
b) Suppose n = 3 and p = 2. Could it be that β̂(λ1 ) = (−1.3, 0.9)⊤ for λ1 = 1 and β̂(λ1 ) = (−1.5, 0.8)⊤
for λ1 = 4? Motivate.
c) Suppose n = 3 and p = 2. Could it be that β̂(λ1 ) = (−4, 2)⊤ for λ1 = 1, β̂(λ1 ) = (−2, 1.5)⊤ for λ1 = 2,
and β̂(λ1 ) = (−1, 1)⊤ for λ1 = 3? Motivate.
Question 6.11
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. nor-
Pn distributed
mally with zero mean and a common variance. Moreover, X∗,j = X∗,j ′ for all j, j ′ = 1, . . . , p and
2
i=1 Xi,j = 1. Question 1.19 revealed that in this case all elements of the ridge regression estimator are equal,
irrespective of the choice of the penalty parameter λ2 . Does this hold for the lasso regression estimator? Motivate
your answer.
Question 6.12
Consider the linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. normally distributed
with zero mean and a common variance. Relevant information on the response and design matrix are summarized
as:
3 −2 3
X⊤ X = , X⊤ Y = .
−2 2 −1
β̂(λ1 ) = arg min2 3β12 + 2β22 − 4β1 β2 − 6β1 + 2β2 + λ1 |β1 | + λ1 |β2 |.
β∈R
b) For λ1 = 0.2 the lasso estimate of the second element of β is β̂2 (λ1 ) = 1.25. Determine the corresponding
value of β̂1 (λ1 ).
c) Determine the smallest λ1 for which it is guaranteed that β̂(λ1 ) = 02 .
Question 6.13
Show ∥β̂(λ1 )∥1 is monotone decreasing in λ1 . In this assume orthonormality of the design matrix X.
Question 6.14
Consider the linear regression model Y = Xβ + ε with ε ∼ N (0n , σ 2 Inn ). This model is fitted to data,
X1,∗ = (4, −2) and Y1 = 10, using the lasso regression estimator β̂(λ1 ) = arg minβ ∥Y1 − X1,∗ β∥22 + λ1 ∥β∥1 .
a) How many nonzero elements does the lasso regression estimator with an arbitrary λ1 > 0 have for these
data?
b) Ignore the second covariate and evaluate the lasso regression estimator for λ1 = 8.
c) Suppose that, when regressing the response on each covariate separately, the corresponding lasso regression
estimates with λ1 = 8 are β̂1 (λ1 ) = 2 41 and β̂2 (λ1 ) = −4. Now consider the regression problem with
both covariates in the model. Does the lasso regression estimate with λ1 = 8 then equal β̂(λ1 ) = (2 41 , 0)⊤ ,
β̂(λ1 ) = (0, −4)⊤ , or some other value? Motivate!
Question 6.15
Consider a single draw, denoted Y, from the normal means model Y ∼ N (µ, σ 2 Ipp ) with µ ∈ Rp and σ 2 =
1. Assume the µi to be i.i.d. distributed following the double exponential distribution with density f (x) =
(2b)−1 exp(−b−1 |x|). Show that, if b = λ−1 map
1 , the MAP (Mode A Posteriori) estimator µ̂i equals sign(Yj )(|Yj | −
λ1 )+ for j = 1, . . . , p.
Question 6.16
A researcher has measured gene expression measurements for 1000 genes in 40 subjects, half of them cases and
the other half controls.
a) Describe and explain what would happen if the researcher would fit an ordinary logistic regression to these
data, using case/control status as the response variable.
132 Lasso regression
b) Instead, the researcher chooses to fit a lasso regression, choosing the tuning parameter lambda by cross-
validation. Out of 1000 genes, 37 get a non-zero regression coefficient in the lasso fit. In the ensuing
publication, the researcher writes that the 963 genes with zero regression coefficients were found to be
“irrelevant”. What is your opinion about this statement?
Question 6.17
Consider the standard linear regression model Yi = Xi,∗ β + εi for i = 1, . . . , n and with the εi i.i.d. normally
distributed with zero mean and a common variance. Let the first covariate correspond to the intercept. The
model is fitted to data by means of the minimization
Pp of the sum-of-squares augmented with a lasso penalty in
which the intercept is left unpenalized: λ1 j=2 |βj | with penalty parameter λ1 > 0. The penalty parameter is
chosen through leave-one-out cross-validation (LOOCV). The predictive performance of the model is evaluated,
again by means of LOOCV. Thus, creating a double cross-validation loop. At each inner loop the optimal λ1
yields an empty intercept-only model, from which a prediction for the left-out sample is obtained. The vector of
these prediction is compared to the corresponding observation vector through their Spearman correlation (which
measures the monotonicity of a relatonship and – as a correlation measure – assumed values on the [−1, 1] interval
with an analogous interpretation to the ‘ordinary’ correlation). The latter equals −1. Why?
Question 6.18
Load the breast cancer data available via the breastCancerNKI-package (downloadable from BioConductor)
through the following R-code:
Listing 6.6 R code
# activate library and load the data
library(breastCancerNKI)
data(nki)
Many variants of penalized regression, in particular of lasso regression, have been presented in the literature. Here
we give an overview of some of the more current ones. Not a full account is given, but rather a brief introduction
with emphasis on their motivation and use.
The elastic net penalty – defined implicitly in the preceeding display – is thus simply a linear combination of
the lasso and ridge penalties. Consequently, the elastic net regression estimator encompasses its lasso and ridge
counterparts. Hereto just set λ2 = 0 or λ1 = 0, respectively. A novel estimator is defined if both penalties act
simultaneously, i.e. if their corresponding penalty parameters are both nonzero.
Does this novel elastic net estimator indeed inherit the strengths of the lasso and ridge regression estimators?
Let us turn to the aforementioned motivation behind the elastic net estimator. Starting with the uniqueness, the
strict convexity of the ridge penalty renders the elastic net loss function strictly convex, as it is a combination of the
ridge penalty and the lasso function – notably non-strict convex when the dimension p exceeds the sample size n.
This warrants the existence of a unique minimizer of the elastic net loss function. To assess the preservation of the
selection property, now without the bound on the maximum number of selectable variables, exploit the equivalent
constraint estimation formulation of the elastic net estimator. Figure 7.1 shows the parameter constraint of the
elastic net estimator for the ‘p = 2’-case, which is defined by the set:
{(β1 , β2 ) ∈ R2 : λ1 (|β1 | + |β2 |) + 12 λ2 (β12 + β22 ) ≤ c(λ1 , λ2 )}.
Visually, the ‘elastic net parameter constraint’ is a compromise between the circle and the diamond shaped con-
straints of the ridge and lasso regression estimators. This compromise inherits exactly the right geometrical fea-
tures: the strict convexity of the ‘ridge circle’ and the ‘corners’ (referring to points at which the constraint’s
boundary is non-smootness/non-differentiability) falling at the axes of the ‘lasso constraint’. The latter feature, by
the same argumentation as presented in Section 6.3, endows the elastic net estimator with the selection property.
Moreover, it can – in principle – select p features as the point in the parameter space where the smallest level
set of the unpenalized loss hits the elastic net parameter constraint need not fall on any axis. For example, in the
‘p = 2, n = 1’-case the level sets of the sum-of-squares loss are straight lines that, if running almost parallel to
the edges of the ‘lasso diamond’, are unlikely to first hit the elastic net parameter constraint at one of its corners.
Finally, the largest penalty parameter relates (reciprocally) to the volume of the elastic net parameter constraint,
while the ratio between λ1 and λ2 determines whether it is closer to the ‘ridge circle’ or to the ‘lasso diamond’.
Whether the elastic net regression estimator also delivers on the joint shrinkage property is assessed by simu-
lation (not shown). The impression given by these simulations is that the elastic net has joint shrinkage potential.
This, however, usually requires a large ridge penalty, which then dominates the elastic net penalty.
134 Generalizing lasso regression
lasso
ridge
elastic net
1.0
0.5
β2
0.0
−0.5
−1.0
β1
Figure 7.1: The left panel depicts the parameter constraint induced by the elastic net penalty and, for reference,
those of the lasso and ridge are added. The right panel shows the contour plot of the cross-validated loglikelihood
vs. the two penalty parameters of the elastic net estimator.
The elastic net regression estimator can be found with procedures similar to those that evaluate the lasso
regression estimator (see Section 6.4) as the elastic net loss can be reformulated as a lasso loss. Hereto the ridge
part of the elastic net penalty is absorbed into the sum of squares using the data augmentation trick of Exercise 1.6
which showed that the ridge regression estimator is the ML regression estimator of the related regression model
with p zeros and rows added to the response and design matrix, respectively. That is, write Ỹ = (Y⊤ , 0⊤ ⊤
p ) and
√
X̃ = (X⊤ , λ2 Ipp )⊤ . Then:
Hence, the elastic net loss function can be rewritten to ∥Ỹ − X̃β∥22 + λ1 ∥β∥1 . This is familiar territory and the
lasso algorithms of Section 6.4 can be used. Zou and Hastie (2005) present a different algorithm for the evaluation
of the elastic net estimator that is faster and numerically more stable (see also Exercise 7.5). Irrespectively, the
reformulation of the elastic net loss in terms of augmented data also reveals that the elastic net regression estimator
can select p variables. Even for p > n, which is immediate from the observation that rank(X̃) = p.
The penalty parameters need tuning, e.g. by cross-validation. They are subject to empirically indeterminancy.
That is, often a large range of (λ1 , λ2 )-combinations will yield a similar cross-validated performance as can be
witnessed from Figure 7.1. It shows the contourplot of the penalty parameters vs. this performance. There is a
yellow ‘banana-shaped’ area that corresponds to the same and optimal performance. Hence, no single (λ1 , λ2 )-
combination can be distinguished as all yield the best performance. This behaviour may be understood intuitively.
Reasoning loosely, while the lasso penalty λ1 ∥β∥1 ensures the selection property and the ridge penalty 21 λ2 ∥β∥22
warrants the uniqueness and joint shrinkage of coefficients of collinear covariates, they have a similar effect on
the size of the estimator. Both shrink, although in different norms. But a reduction in the size of β̂(λ1 , λ2 ) in one
norm implies a reduction in another. An increase in either the lasso and ridge penalty parameter will have a similar
effect on the elastic net estimator: it shrinks. The selection and ‘joint shrinkage’ properties are only consequences
of the employed penalty and are not criteria in the optimization of the elastic net loss function. There, only size
matters. The size of β refers to λ1 ∥β∥1 + 21 λ2 ∥β∥22 . As in the size the lasso and ridge penalties appear as a linear
combination in the elastic net loss function and have a similar effect on the elastic net estimator: there are many
positive (λ1 , λ2 )-combinations that constrain the size of the elastic net estimator equally. In contrast, for both
the lasso and ridge regression estimators, different penalty parameters yield estimators of different sizes (defined
accordingly). Moreover, it is mainly the size that determines the cross-validated performance as the size deter-
mines the shrinkage of the estimator and, consequently, the size of the errors. But only a fixed size leaves enough
freedom to distribute this size over the p elements of the regression parameter estimator β̂(λ1 , λ2 ) and, due to
the collinearity, among them many that yield a comparable performance. Hence, if a particularly sized elastic
net estimator β̂(λ1 , λ2 ) optimizes the cross-validated performance, then high-dimensionally there are likely many
others with a different (λ1 , λ2 )-combination but of equal size and similar performance.
The empirical indeterminancy of penalty parameters touches upon another issues. In principle, the elastic net
7.2 Fused lasso 135
regression estimator can decide whether a sparse or non-sparse solution is most appropriate. The indeterminancy
indicates that for any sparse elastic net regression estimator a less sparse one can be found with comparable
performance, and vice versa. Care should be exercised when concluding on the sparsity of the linear relation
under study from the chosen elastic net regression estimator.
A solution to the indeterminancy of the optimal penalty parameter combination is to fix their ratio. For in-
terpretation purposes this is done through the introduction of a ‘mixing parameter’ α ∈ [0, 1]. The elastic net
penalty is then written as λ[α∥β∥1 + 21 (1 − α)∥β∥22 ]. The mixing parameter is set by the user while λ > 0 is
typically found through cross-validation (cf. the implementation in the glmnet-package) (Friedman et al., 2009).
Generally, no guidance on the choice of mixing parameter α can be given. In fact, it is a tuning parameter and as
such needs tuning rather then setting out of the blue.
which involves two penalty parameters λ1 and λ1,f for the lasso and fusion penalties, respectively. As a result
of adding the fusion penalty the fused lasso regression estimator not only shrinks elements of β towards zero but
also the difference of neighboring elements of β. In particular, for large enough values of the penalty parameters
the estimator selects elements of and differences of neighboring elements of β. This corresponds to a sparse
estimate of β while the vector of its first order differences too is dominated by zeros. That is many elements of
β̂(λ1 , λ1,f ) equal zero with few changes in β̂(λ1 , λ1,f ) when running over j. Hence, the fused lasso regression
penalty encourages large sets of neighboring elements of j to have a common (or at least a comparable) regression
parameter estimate. This is visualized – using simulated data from a simple toy model with details considered
irrelevant for the illustration – in Figure 7.2 where the elements of the fused lasso regression estimate β̂(λ1 , λ1,f )
are plotted against the index of the covariates. For reference the true β and the lasso regression estimate with the
same λ1 are added to the plot. Ideally, for a large enough fusion penalty parameter, the elements of β̂(λ1 , λ1,f )
would form a step-wise function in the index j, with many equalling zero and exhibiting few changes, as the
elements of β do. While this is not the case, it is close, especially in comparison to the elements of its lasso cousin
β̂(λ1 ), thus showing the effect of the inclusion of the fusion penalty.
It is insightful to view the fused lasso regression problem as a constrained
Pp estimation problem. The fused
lasso penalty induces a parameter constraint: {β ∈ Rp : λ1 ∥β∥1 + λ1,f j=2 |βj − βj−1 | ≤ c(λ1 , λ1,f )}. This
constraint is plotted for p = 2 in the right panel of Figure 7.2 (clearly, it is not the intersection of the constraints
induced by the lasso and fusion penalty separately as one might accidently conclude from Figure 2 in Tibshirani
et al., 2005). The constraint is convex, although not strict, which is convenient for optimization purposes. Morever,
the geometry of this fused lasso constraint reveals why the fused lasso regression estimator selects the elements
of β as well as its first order difference. Its boundary, while continuous and generally smooth, has six points at
which it is non-differentiable. These all fall on the grey dotted lines in the right panel of Figure 7.2 that correspond
to the axes and the diagonal, put differently, on either the ‘β1 = 0’, ‘β2 = 0’ , or ‘β1 = β2 ’-lines. The fused
lasso regression estimate is the point where the smallest level set of the sum-of-squares criterion ∥Y − Xβ∥22 ,
be it an ellipsoid or hyper-plane, hits the fused lasso constraint. For an element or the first order difference to be
zero it must fall on one of the dotted greys lines of the right panel of Figure 7.2. Exactly this happens on one
of the aforementioned six point of the constraint. Finally, the fused lasso regression estimator has, when – for
reasonably comparable penalty parameters λ1 and λ1,f – it shrinks the first order difference to zero, a tendency to
also estimate the corresponding individual elements as zero. In part, this is due to the fact that |β1 | = 0 = |β2 |
implies that |β1 −β2 | = 0, while the reverse does not necessary hold. Moreover, if |β1 | = 0, then |β1 −β2 | = |β2 |.
The fusion penalty thus converts to a lasso penalty of the remaining nonzero element of this first order difference,
i.e. (λ1 + λ1,f )|β2 |, thus furthering the shrinkage of this element to zero.
The evaluation of the fused lasso regression estimator is more complicated than that of the ‘ordinary’ lasso
regression estimator. For ‘moderately sized’ problems Tibshirani et al. (2005) suggest to use a variant of the
136 Generalizing lasso regression
Figure 7.2: The left panel shows the lasso (red, diamonds) and fused lasso (black circles) regression parameter
estimates and the true parameter values (grey circles) plotted against their index j. The (β1 , β2 )-parameter con-
straint induced by the fused lasso penalty for various combinations of the lasso and fusion penalty parameters λ1
and λ1,f , respectively. The grey dotted lines corresponds to the ‘β1 = 0’-, ‘β2 = 0’- and ‘β1 = β2 ’-lines where
selection takes place.
quadratic programming method (see also Section 6.4.1) that is computationally efficient when many linear con-
straints are active, i.e. if many elements of and first order difference of β are zero. Chaturvedi et al. (2014) extend
the gradient ascent approach discussed in Section 6.4.3 to solve the minimization of the fused lasso loss function.
For the limiting ‘λ1 = 0’-case the fused lasso loss function can be reformulated as a lasso loss function (see
Exercise 7.1). Then, the algorithms of Section 6.4 may be applied to find the estimate β̂(0, λ1,f ).
where λ1,G is the group lasso penalty parameter (with subscript G for Group), G is the total number of groups,
Jg ⊂ {1, . . . , p} is covariate index set of the g-th group such that the Jg are mutually exclusive and exhaustive,
i.e. Jg1 ∩ Jg2 = ∅ for all g1 ̸= g2 and ∪G g=1 Jg = {1, . . . , p}, and |Jg | denotes the cardinality (the number of
elements) of Jg .
The group lasso estimator performs covariate selection at the group level but does not result in a sparse within-
group estimate. This may be achieved through employment of the sparse group lasso regression estimator (Simon
et al., 2013):
XG q
β̂(λ1 , λ1,G ) = arg minp ∥Y − Xβ∥22 + λ1 ∥β∥1 + λ1,G |Jg | ∥βg ∥2 ,
β∈R g=1
which combines the lasso with the group lasso penalty. The inclusion of the former encourages within-group
sparsity, while the latter performs selection at the group level. The sparse group lasso penalty resembles the
elastic net penalty with the ∥β∥2 -term replacing the ∥β∥22 -term of the latter.
7.3 The (sparse) group lasso 137
Figure 7.3: Top left panel: The (β1 , β2 )-parameter constraint induced by the lasso, group lasso and sparse group
lasso penalty. Top right panel: the adaptive threshold function associated with the adaptive lasso regression es-
timator for orthonormal design matrices. Bottom left panel: illustration of the lasso and adaptive lasso fit and the
true curve. Bottom right panel: the parameter constraint induced by the ℓ0 -penalty.
The (sparse) group lasso regression estimation problems can be reformulated as constrained estimation prob-
lems. Their parameter constraints are depicted in Figure 7.3. By now the reader will be familiar with the charactis-
tic feature, i.e. the non-differentiability of the boundary at the axes, of the constraint that endows the estimator with
the potential to select. This feature is clearly present for the sparse group lasso regression estimator, covariate-
wise. Although both the group lasso and the sparse group lasso regression estimator select group-wise, illustration
of the associated geometrical feature requires plotting in dimensions larger than two and is not attempted. How-
ever, if all groups are singletons, the (sparse) group lasso penalties are equivalent to the regular lasso penalty.
The sparse group lasso regression estimator is found through exploitation of the convexity of the loss function
(Simon et al., 2013). It alternates between group-wise and within-group optimization. The resemblance of the
sparse group lasso and elastic net penalties propagates to the optimality conditions of both estimators. In fact, the
within-group optimization amounts to the evaluation of an elastic net regression estimator (Simon et al., 2013).
When within each group the design matrix is orthonormal and λ1,G = 0, the group lasso regression estimator can
be found by a group-wise coordinate descent procedure for the evaluation of the estimator (cf. Exercise ??).
Both the sparse group lasso and the elastic net regression estimators have two penalty parameters that need
tuning. In both cases the corresponding penalties have a similar effect: shrinkage towards zero. If the g-th group’s
138 Generalizing lasso regression
contribution to the group lasso penalty has vanished, then so has the contribution of all covariates to the regular
lasso penalty. And vice versa. This complicates the tuning of the penalty parameters as it is hard to distinguish
which shrinkage effect is most beneficial for the estimator. Simon et al. (2013) resolve this by setting the ratio of
the two penalty parameters λ1 and λ1,G to some arbitrary but fixed value, thereby simplifying the tuning.
Hence, it is a generalization of the lasso penalty with covariate-specific weighing. The weight of the j-th covariate
is reciprocal to the j-th element of the initial regression parameter estimate β̂ init . If the initial estimate of βj is
large or small, the corresponding element in the adaptive lasso estimator will be penalized less or more and thereby
determine the amount of shrinkage, which may now vary between the estimates of the elements of β. In particular,
if β̂jinit = 0, the adaptive lasso penalty parameter corresponding to the j-th element is infinite and yields β̂ adapt = 0.
The adaptive lasso regression estimator, given an initial regression estimate β̂ init , can be found numerically by
minor changes of the algorithms presented in Section 6.4. In case of an orthonormal design an analytic expression
of the adaptive lasso estimator exists (see Exercise 7.7):
β̂jadapt (λ1 ) = sign(β̂jols )(|β̂jols | − 12 λ1 /|β̂jols |)+ .
This adaptive lasso estimator can be viewed as a compromise between the soft thresholding function, associ-
ated with the lasso regression estimator for orthonormal design matrices (Section 6.2), and the hard thresholding
function, associated with truncation of the ML regression estimator (see the top right panel of Figure 7.3 for an
illustration of these thresholding functions).
How is the initial regression parameter estimate β̂ init to be chosen? Low-dimensionally the maximum likelihood
regression estimator one may used. The resulting adaptive lasso regression estimator is sometimes referred to as
the Gauss-Lasso regression estimator. High-dimensionally, the lasso or ridge regression estimators will do. Any
other estimator may in principle be used. But not all yield the desirable asymptotic properties.
A different motivation for the adaptive lasso is found in its ability to undo some or all of the shrinkage of the
lasso regression estimator due to penalization. This is illustrated in the left bottom panel of Figure 7.3. It shows
the lasso and adaptive lasso regression fits. The latter clearly undoes some of the bias of the former.
7.6 Conclusion
Finally, a note of caution in similar spirit as that which concludes Chapter 3. It is a joy to play with the penalty
and see how it encourages the regression parameter estimate to exhibit hypothesized behaviour. Nonetheless, for
a deeper and profound understanding of the data generating mechanism, such knowledge is ideally incorporated
explicitly in the model itself.
7.7 Exercises
Question 7.1
Augment the lasso penalty with the sum of the absolute differences all pairs of successive regression coefficients:
Xp Xp
λ1 |βj | + λ1,f |βj − βj−1 |.
j=1 j=2
−1 1
a) Evaluate for (λ1 , λ2 ) = (3, 2) the elastic net regression estimator of the linear regression model.
b) Now consider the evaluation the elastic net regression estimator of the linear regression model for the same
penalty parameters, (λ1 , λ2 ) = (3, 2), but this time involving two covariates. The first covariate is as in
part a), the second is orthogonal to that one. Do you expect the resulting elastic net estimate of the first
regression coefficient β̂1 (λ1 , λ2 ) to be larger, equal or smaller (in an absolute sense) than your answer to
part a)? Motivate.
c) Now take in part b) the second covariate equal to the first one. Show that the first coefficient of elastic net
estimate β̂1 (λ1 , 2λ2 ) is half that of part a). Note: there is no need to know the exact answer to part a).
Question 7.7 *
Consider the linear regression model Y = Xβ + ε. It is fitted to data from a study with an orthonormal design
matrix by means of the adaptive lasso regression estimator initiated by the OLS/ML regression estimator. Show
that the j-th element of the resulting adaptive lasso regression estimator equals:
* This question is freely copied from Bühlmann and Van De Geer (2011): Problem 2.5a, page 43.
Bibliography
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control,
19(6), 716–723.
Allen, D. M. (1974). The relationship between variable selection and data agumentation and a method for predic-
tion. Technometrics, 16(1), 125–127.
Ambs, S., Prueitt, R. L., Yi, M., Hudson, R. S., Howe, T. M., Petrocca, F., Wallace, T. A., Liu, C.-G., Volinia,
S., Calin, G. A., Yfantis, H. G., Stephens, R. M., and Croce, C. M. (2008). Genomic profiling of microrna
and messenger RNA reveals deregulated microrna expression in prostate cancer. Cancer Research, 68(15),
6162–6170.
Anatolyev, S. (2020). A ridge to homogeneity for linear models. Journal of Statistical Computation and Simula-
tion, 90(13), 2455–2472.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116(2), 281–297.
Bates, D. and DebRoy, S. (2004). Linear mixed models and penalized least squares. Journal of Multivariate
Analysis, 91(1), 1–17.
Berger, J. . (2013). Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media.
Bertsekas, D. P. (2014). Constrained Optimization and Lagrange Multiplier Methods. Academic press.
Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). Fast sampling with Gaussian scale mixture priors
in high-dimensional regression. Biometrika, 103(4), 985–991.
Bijma, F., Jonker, M. A., and van der Vaart, A. W. (2017). An Introduction to Mathematical Statistics. Amsterdam
University Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Boyle, E. A., Li, Y. I., and Pritchard, J. K. (2017). An expanded view of complex traits: from polygenic to
omnigenic. Cell, 169(7), 1177–1186.
Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli, 19(4), 1212–1242.
Bühlmann, P. and Van De Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applica-
tions. Springer Science & Business Media.
Cancer Genome Atlas Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353),
609–615.
Cancer Genome Atlas Network (2012). Comprehensive molecular characterization of human colon and rectal
cancer. Nature, 487(7407), 330–337.
Castillo, I., Schmidt-Hieber, J., and Van der Vaart, A. W. (2015). Bayesian linear regression with sparse priors.
The Annals of Statistics, 43(5), 1986–2018.
Chaturvedi, N., de Menezes, R. X., and Goeman, J. J. (2014). Fused lasso algorithm for Cox’ proportional hazards
and binomial logit models with application to copy number profiles. Biometrical Journal, 56(3), 477–492.
Chib, S. and Greenberg, E. (1995). Understanding the metropolis-hastings algorithm. The American Statistician,
49(4), 327–335.
de Vlaming, R. and Groenen, P. J. F. (2015). The current and future use of ridge regression for prediction in
quantitative genetics. BioMed Research International, page Article ID 143712.
Draper, N. R. and Smith, H. (1998). Applied Regression Analysis (3rd edition). John Wiley & Sons.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American statistical
Association, 81(394), 461–470.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics,
32(2), 407–499.
142 BIBLIOGRAPHY
Eilers, P. (1999). Discussion on: The analysis of designed experiments and longitudinal data by using smoothing
splines. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(3), 307–308.
Eilers, P. and Marx, B. (1996). Flexible smoothing with b-splines and penalties. Statistical Science, 11(2), 89–102.
Eilers, P. H. C., Boer, J. M., van Ommen, G.-J., and van Houwelingen, H. C. (2001). Classification of microarray
data with penalized logistic regression. In Microarrays: Optical technologies and informatics, volume 4266,
pages 187–198.
Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005). Outcome signature genes in breast cancer: is
there a unique set? Bioinformatics, 21(2), 171–178.
Ein-Dor, L., Zuk, O., and Domany, E. (2006). Thousands of samples are needed to generate a robust gene list for
predicting outcome in cancer. Proceedings of the National Academy of Sciences, 103(15), 5923–5928.
Esquela-Kerscher, A. and Slack, F. J. (2006). Oncomirs: microRNAs with a role in cancer. Nature Reviews
Cancer, 6(4), 259–269.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal
of the American statistical Association, 96(456), 1348–1360.
Farebrother, R. W. (1976). Further results on the mean square error of ridge regression. Journal of the Royal
Statistical Society, Series B (Methodological), pages 248–250.
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38.
Fletcher, R. (2008). Practical Methods of Optimization, 2nd Edition. John Wiley, New York.
Flury, B. D. (1990). Acceptance–rejection sampling made easy. SIAM Review, 32(3), 474–476.
Friedman, J., Hastie, T., and Tibshirani, R. (2009). glmnet: Lasso and elastic-net regularized generalized linear
models. R package version, 1(4).
Garcı́a, C. B., Garcı́a, J., López Martı́n, M., and Salmerón, R. (2015). Collinearity: Revisiting the variance
inflation factor in ridge regression. Journal of Applied Statistics, 42(3), 648–661.
Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. Handbook of Markov chain Monte Carlo,
20116022, 45.
Goeman, J. J. (2008). Autocorrelated logistic ridge regression for prediction based on proteomics spectra. Statis-
tical Applications in Genetics and Molecular Biology, 7(2).
Goeman, J. J. (2010). L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52,
70–84.
Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good
ridge parameter. Technometrics, 21(2), 215–223.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
Guilkey, D. K. and Murphy, J. L. (1975). Directed ridge regression techniques in cases of multicollinearity.
Journal of the American Statistical Association, 70(352), 769–775.
Hansen, B. E. (2015). The risk of James–Stein and lasso shrinkage. Econometric Reviews, 35(8-10), 1456–1470.
Harrell, F. E. (2001). Regression modeling strategies: with applications to linear models, logistic regression, and
survival analysis. Springer.
Harville, D. A. (2008). Matrix Algebra From a Statistician’s Perspective. Springer, New York.
Hastie, T. and Tibshirani, R. (2004). Efficient quadratic regularization for expression arrays. Biostatistics, 5(3),
329–340.
Hastie, T., Friedman, J., and Tibshirani, R. (2009). The Elements of Statistical Learning. Springer.
Heinze, G. and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in
medicine, 21(16), 2409–2419.
Hemmerle, W. J. (1975). An explicit solution for generalized ridge regression. Technometrics, 17(3), 309–314.
Henderson, C. (1953). Estimation of variance and covariance components. Biometrics, 9(2), 226–252.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Tech-
nometrics, 12(1), 55–67.
Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge University Press.
Hosmer Jr, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied Logistic Regression, volume 398. John
Wiley & Sons.
Hua, T. A. and Gunst, R. F. (1983). Generalized ridge regression: a note on negative ridge parameters. Commu-
BIBLIOGRAPHY 143
breastCancerNKI: Gene expression dataset published by Schmidt et al. [2008] (NKI). R package version
1.34.0.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Shao, J. and Deng, X. (2012). Estimation in high-dimensional linear models with deterministic design matrices.
The Annals of Statistics, 40(2), 812–831.
Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational
and Graphical Statistics, 22(2), 231–245.
Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pages
1135–1151.
Sterner, J. M., Dew-Knight, S., Musahl, C., Kornbluth, S., and Horowitz, J. M. (1998). Negative regulation of
DNA replication by the retinoblastoma protein is mediated by its association with MCM7. Molecular and
Cellular Biology, 18(5), 2748–2757.
Subramanian, J. and Simon, R. (2010). Gene expression–based prognostic signatures in lung cancer: ready for
clinical use? Journal of the National Cancer Institute, 102(7), 464–474.
Theobald, C. M. (1974). Generalizations of mean square error applied to ridge regression. Journal of the Royal
Statistical Society. Series B (Methodological), 36(1), 103–106.
Tibshirani, R. (1996). Regularized shrinkage and selection via the lasso. Journal of the Royal Statistical Society
B, 58(1), 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused
lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Tibshirani, R. J. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics, 7, 1456–1490.
Tibshirani, R. J. and Taylor, J. (2012). Degrees of freedom in lasso problems. The Annals of Statistics, 40(2),
1198–1232.
Tye, B. K. (1999). MCM proteins in DNA replication. Annual Review of Biochemistry, 68(1), 649–686.
Van De Vijver, M. J., He, Y. D., Van’t Veer, L. J., Dai, H., Hart, A. A. M., Voskuil, D. W., Schreiber, G. J., Peterse,
J. L., Roberts, C., Marton, M. J., et al. (2002). A gene-expression signature as a predictor of survival in breast
cancer. New England Journal of Medicine, 347(25), 1999–2009.
Van De Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N., and Wilting, S. M. (2016). Better prediction
by use of co-data: adaptive group-regularized ridge regression. Statistics in Medicine, 35(3), 368–381.
van de Wiel, M. A., van Nee, M. M., and Rauschenberger, A. (2021). Fast cross-validation for multi-penalty ridge
regression. Journal of Computational and Graphical Statistics, 30(4), 835–847.
van der Vaart, A. W. (2000). Asymptotic Statistics, volume 3. Cambridge University Press.
van Wieringen, W. and Aflakparast, M. (2021). porridge: Ridge-Type Estimation of a Potpourri of Models. R
package version 0.2.1, https://CRAN.R-project.org/package=porridge.
van Wieringen, W. N. and Binder, H. (2022). Sequential learning of regression models by penalized estimation.
Journal of Computational and Graphical Statistics, 31(3), 877–886.
Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., Van Der Kooy,
K., Marton, M. J., Witteveen, A. T., et al. (2002). Gene expression profiling predicts clinical outcome of breast
cancer. Nature, 415(6871), 530–536.
Wang, L., Tang, H., Thayanithy, V., Subramanian, S., Oberg, L., Cunningham, J. M., Cerhan, J. R., Steer, C. J.,
and Thibodeau, S. N. (2009). Gene networks and microRNAs implicated in aggressive prostate cancer. Cancer
research, 69(24), 9490–9497.
Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, Chichester,
England.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions.
Bayesian inference and decision techniques: essays in honor of Bruno De Finetti, 6, 233–243.
Zhang, Y. and Politis, D. N. (2022). Ridge regression revisited: Debiasing, thresholding and bootstrap. Annals of
Statistics, 50(3), 1401–1422.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association,
101(476), 1418–1429.
BIBLIOGRAPHY 145
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Zou, H., Hastie, T., and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. The Annals of Statistics,
35(5), 2173–2192.
Zwiener, I., Frisch, B., and Binder, H. (2014). Transforming RNA-seq data to improve the performance of prog-
nostic gene signatures. PloS one, 9(1), e85150.