Learning From Data: 9: Regularization
Learning From Data: 9: Regularization
9: Regularization
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Motivation
Regularization
Bibliography
Eout Eout
Ein Ein
Expected Error
Expected Error
variance generalization error
Questions:
How can we optimize variance and bias trade-offs?
How can we optimize the generalization error?
How can we avoid or reduce over-fitting?
4/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization
Recap: Linear Regression
Linear Regression Algorithm
The linear regression algorithm is defined by the solution wopt to the
following optimization problem:
N
1 X t 2
wopt := arg min Ein (w) = arg min w xn − yn
w∈Rd+1 w∈Rd+1 N n=1
1
= arg min (X w − y)T (X w − y)
w∈Rd+1 N
wopt = X † y = (X t X )−1 X t y
wopt = X † y = (X t X )−1 X t y
Ideas:
The complexity of the model is related to the number of
coefficients.
We can compare models by a complexity hierarchy.
We can regard a simpler model as a constrained version of a more
complex model.
Thus, if Hn := {w0 + w1 φ1 (x ) + . . . + wn φn (x )} we can embed Hm in
Hn for m < n, i.e. Hm ⊆ Hn be setting all wi = 0 for m < i ≤ n.
We call Hm a constrained version of Hn .
Complexity Measure
Idea: The bigger the values of wi , the more complex is Hn .
1
with λC := − 2C wt ∇Ein (w).
Sketch of Proof.
Obviously,
again. https://web.archive.org/web/20210506170321/http:
//www.onmyphd.com/?p=kkt.karush.kuhn.tucker
See also
http://www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf
11/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Proof (cont.)
λ
Eaug (h, λ, Ω) := Ein (h) + Ω(h).
N
Note: The name weight decay stems from the fact that it reduces large
weights. However, in general it does not reduce them to zero.
14/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Solution of L2 -regularized (Non-) Linear Regression
Recall that the solution of the (unregularized) (Non-) Linear Regression
1
wlin = arg min (X w − y)T (X w − y)
w∈Rd+1 N
is given by
wlin = (X t X )−1 X t y,
with Xi,j := φ(xi )j for the non-linear case and Xi,j := xi,j for the linear
case, if (X t X ) is invertible.
In general, if it is not invertible, the solution is obtained by using the
Penrose inverse X † , defined by the solution to wlin → X † wlin , defined by
X t X wlin = X t Y .
Lemma 7
If λ > 0, then the matrix B λ := X t X + λI is invertible.
Proof.
For any vector x ∈ ker B λ with x 6= 0 we have
and
lim wreg = 0.
λ→∞
n
!1
X p
Ωp (hw ) := kwkLp = |wi |p , ∀p > 0
i=1
for regularization.
In particular the L1 norms are used very often. Although this might seem
only a small difference to L2 , they behave quite different from L2
regularizers.
2 2
1 1
0 0
−2 −1 0 1 2 −2 −1 0 1 2
−1 −1
−2 −2
2
2
1
1
0
0
−2 −1 0 1 2
−2 −1 0 1 2
−1
−1
−2 −2
L1 touches an arbitrary line (or any other contour) most likely at a corner
(left graphics). But at a corner, one of the coordinates (here x2 ) is zero.
0
−2 −1 0 1 2
−1
−2
L2 touches an arbitrary line (or any other contour) most likely anywhere,
where both coordinates are almost always different from zero.
unregularized unregularized
L1−Regularization ● ● ● L1−Regularization ● ● ●
L2−Regularization L2−Regularization
target target
● ●
● ●
y
y
● ●
● ●
● ● ● ●
● ●
x x
Regularized Non−Linear Regression, Lambda = 10
unregularized
L1−Regularization ● ● ●
L2−Regularization
target
●
y
●
● ●
1.4
1.2
1.0
weight coefficient
w1
0.8
w2
w3
0.6
w4
0.4
0.2
0.0
0 5 10 15 20 25 30
lambda
# exakt solution
we= t(solve(t(X)%*%X)%*%t(X)%*%y)
# L1 solution
wr1 = t(optim(c(0,0,0,0), augerror1, l=l)$par)