0% found this document useful (0 votes)
28 views

Learning From Data: 9: Regularization

The document discusses regularization techniques for machine learning models. It introduces regularization as a method to optimize the bias-variance tradeoff and improve generalization. Regularization adds a penalty term to the loss function that shrinks weights, reducing complexity and overfitting. Specifically, it describes L2 regularization, also known as weight decay, which adds the sum of squared weights to the loss function. Minimizing the regularized loss is equivalent to minimizing the original loss with a constraint on the sum of squared weights.

Uploaded by

Hamed Aryanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Learning From Data: 9: Regularization

The document discusses regularization techniques for machine learning models. It introduces regularization as a method to optimize the bias-variance tradeoff and improve generalization. Regularization adds a penalty term to the loss function that shrinks weights, reducing complexity and overfitting. Specifically, it describes L2 regularization, also known as weight decay, which adds the sum of squared weights to the loss function. Minimizing the regularized loss is equivalent to minimizing the original loss with a constraint on the sum of squared weights.

Uploaded by

Hamed Aryanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Learning From Data

9: Regularization
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization Summer Semester 2022
Content

Motivation

Regularization

Lp Norms and Regularizers

General Regularization Definition

Bibliography

2/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


If I had an hour to solve a problem I’d spend 55 minutes thinking
about the problem and 5 minutes thinking about solutions.
–Albert Einstein

3/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Trade-Off
Remember the bias-variance decomposition and the generalization error?

Eout Eout
Ein Ein
Expected Error

Expected Error
variance generalization error

bias in sample error

Number of Data Points, N Number of Data Points, N

Questions:
How can we optimize variance and bias trade-offs?
How can we optimize the generalization error?
How can we avoid or reduce over-fitting?
4/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization
Recap: Linear Regression
Linear Regression Algorithm
The linear regression algorithm is defined by the solution wopt to the
following optimization problem:
N
1 X t 2
wopt := arg min Ein (w) = arg min w xn − yn
w∈Rd+1 w∈Rd+1 N n=1
1
= arg min (X w − y)T (X w − y)
w∈Rd+1 N

The solution is (see lecture 6):

wopt = X † y = (X t X )−1 X t y

where X † denotes the Moore-Penrose pseudo inverse.

5/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Recap: Non-Linear Regression
Remember: Linearity in the weights is the only important thing, thus we can
define
Non-Linear Regression Algorithm
0
Let φ : Rd +1 → Rd+1 be any function, than the non-linear regression
algorithm is defined by the solution wopt to the following optimization problem:
N
1 X t 2
wopt := arg min Ein (w) = arg min w φ(xn ) − yn
w∈Rd+1 w∈Rd+1 N n=1

The solution is again (see lecture 6):

wopt = X † y = (X t X )−1 X t y

where Xi,j := φ(xi )j .

6/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


How to Measure Complexity?

Ideas:
The complexity of the model is related to the number of
coefficients.
We can compare models by a complexity hierarchy.
We can regard a simpler model as a constrained version of a more
complex model.
Thus, if Hn := {w0 + w1 φ1 (x ) + . . . + wn φn (x )} we can embed Hm in
Hn for m < n, i.e. Hm ⊆ Hn be setting all wi = 0 for m < i ≤ n.
We call Hm a constrained version of Hn .
Complexity Measure
Idea: The bigger the values of wi , the more complex is Hn .

7/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


How to Measure Complexity? (cont.)
Definition 1 (Soft Order Constraint)
Fix C > 0. Then the constraint
n
X
wi2 ≤ C
i=0

is called soft order constraint and


n
X
HnC := {h ∈ Hn | wi2 ≤ C }
i=0

is the version of Hn constrained by this soft order constraint. Clearly


HnC ⊆ Hn .

Intuitively, we expect HnC to suffer less from overfitting the smaller C .

8/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Regularized (Linear) Regression

Definition 2 (Regularized (Linear) Regression)


0
Let φ : Rd +1 → Rd+1 be any function, let C > 0 be any fixed positive
number, than the regularized non-linear regression algorithm is defined
by the solution wreg to the following optimization problem:

wreg := arg min Ein (w)


w∈Rd+1
subject to wt w ≤ C .

Note, that wreg ∈ HnC .

Note further, that unconstrained problems are easier to handle thus we


like to convert the problem into an unconstrained one.

9/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Regularized (Linear) Regression – Lagrange (Dual) Form
Lemma 3
The solution of the Regularized (Linear) Regression is equivalent to the solution
of the following unconstrained optimization problem:

wreg := arg min Ein (w) + λC wt w,


w∈Rd+1

1
with λC := − 2C wt ∇Ein (w).

Sketch of Proof.
Obviously,

arg min Ein (w) + λC wt w = arg min Ein (w) + λC (wt w − C ).


w∈Rd+1 w∈Rd+1

We define f (w) := Ein (w), and g(w) := wt w − C .


Then we can apply the theory of Langrange multiplicators for inequalities (see
also Karush-Kuhn-Tucker conditions, see [KT51]), where we maximize f ,
constrained by g ≤ 0.
10/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Proof (cont.)
Sketch of Proof – cont.

Case 1 (left): Minimum inside


feasible region. Then constraint is
ineffective.
Case 2 (right): Minimum outside
feasible region. Then constraint is
effective and has to lie at
boundary (otherwise we could
improve the minimum by moving
along the gradient). At the
boundary, however the gradient of
f and g must be parallel,
i.e. ∇f (x ) = λ∇g(x ), because
otherwise we could improve by
moving inside feasible region Source: Onmyphd ,

again. https://web.archive.org/web/20210506170321/http:
//www.onmyphd.com/?p=kkt.karush.kuhn.tucker
See also
http://www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf
11/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Proof (cont.)

Sketch of Proof – cont.


We compute ∇f (wreg ) = ∇Ein (wreg ) and ∇g(wreg ) = 2wreg , henceforth,
from the dual formulation, it follows that

∇Ein (wreg ) + λC 2wreg = 0.


Therefore, multiplying with wreg t and using wreg t wreg = C we get
1 t
λC = − w ∇Ein (wreg ).
2C reg

12/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


Augmented Error

As the dual formulation comes in handy, one makes the following


definition:
Definition 4 (Augmented Error)
The Augmented Error Eaug is defined as

Eaug (w) := Ein (w) + λwt w.

Note that the above considerations show, that we can either


1. minimize the original error subject to the soft constraint or
2. minimize the augmented error globally.

13/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


Augmented Error (cont.)
The above ideas can be generalized:
Definition 5 (Augmented Error)
Let N be the sample size and Ω(h) be a complexity penalty term. Then
the generalized Augmented Error Eaug is defined as

λ
Eaug (h, λ, Ω) := Ein (h) + Ω(h).
N

Definition 6 (Weight Decay)


Consider (linear) regularized regression. The penalty term Ω(h) := N wt w
is called weight decay.

Note: The name weight decay stems from the fact that it reduces large
weights. However, in general it does not reduce them to zero.
14/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Solution of L2 -regularized (Non-) Linear Regression
Recall that the solution of the (unregularized) (Non-) Linear Regression

1
wlin = arg min (X w − y)T (X w − y)
w∈Rd+1 N

is given by
wlin = (X t X )−1 X t y,
with Xi,j := φ(xi )j for the non-linear case and Xi,j := xi,j for the linear
case, if (X t X ) is invertible.
In general, if it is not invertible, the solution is obtained by using the
Penrose inverse X † , defined by the solution to wlin → X † wlin , defined by

X t X wlin = X t Y .

15/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Solution of Regularized (Non-) Linear Regression (cont.)
Similarly, one can show that the solution of the regularized (Non-) Linear
Regression
1 λ
wreg = arg min (X w − y)T (X w − y) + wt w
w∈Rd+1 N N
is
wreg = (X t X + λI)−1 X t y.

Lemma 7
If λ > 0, then the matrix B λ := X t X + λI is invertible.

Proof.
For any vector x ∈ ker B λ with x 6= 0 we have

0 =< x, B λ x >= xt X t X x + λ < x, x >= ||X x||2 + λ||x||2 > 0,

which is a contradiction. Thus, ker B λ = 0 equivalent to B λ being invertible.


16/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
Limits of Regularized Solutions

For the limit we have

lim wreg = (X t X )−1 X t y = wlin


λ→0

and
lim wreg = 0.
λ→∞

Note, that both limits behave as intuitively expected.

17/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Lp Norms and Regularizers

Note, that we can write the weight decay as follows:

Ω(hw ) = wt w = kwk2 = kwk2L2


This inspires to define

n
!1
X p
Ωp (hw ) := kwkLp = |wi |p , ∀p > 0
i=1

for regularization.

In particular the L1 norms are used very often. Although this might seem
only a small difference to L2 , they behave quite different from L2
regularizers.

18/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Geometry of the L1 and L2 Norms

2 2

1 1

0 0
−2 −1 0 1 2 −2 −1 0 1 2

−1 −1

−2 −2

L1 is square angular, L2 is a circle.

19/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Touching Points of the L1 Norm

2
2

1
1

0
0
−2 −1 0 1 2
−2 −1 0 1 2

−1
−1

−2 −2

L1 touches an arbitrary line (or any other contour) most likely at a corner
(left graphics). But at a corner, one of the coordinates (here x2 ) is zero.

20/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Touching Points of the L2 Norm

0
−2 −1 0 1 2

−1

−2

L2 touches an arbitrary line (or any other contour) most likely anywhere,
where both coordinates are almost always different from zero.

21/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Intuition
Observe, that
for the L1 -Norm to not touch at just a corner, one has to orientate
one of the surfaces exactly parallel to the tangent plane of the
manifold to touch.
this is extremely unlikely (in fact, has probability zero).
conversely, for the L2 -Norm to touch such that one of the
coordinates is zero requires this coordinate axis to be ortho-normal
to the tangent plane of the manifold to touch.
this is extremely unlikely (in fact, has probability zero).
Henceforth, if we minimize using an L1 - or L2 -Norm, resp.
the L1 -Norm tends to yield some (or many) of the coordinates to be
exactly zero, whereas
the L2 -Norm tends to keep all coordinates.

22/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Sparse Solutions and Feature Selection

Because of the L1 -Norm yielding many coordinates as zero, we speak of


the L1 -Norm to favor sparse models or solutions.
Feature Selection
As the L1 -Norm yields sparse solutions, we can use L1 -Norm regularizers
to select features (all features wich come out as zero are regarded as
unimportant and become ignored).

23/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


Solution of L1 -regularized (Non-) Linear Regression

As opposed to regularization with the L2 norm, which has a


closed-form solution, no closed form solution exists for
L1 -regularization.
Furthermore, note that the L1 norm is not differentiable. This makes
many numerical optimization procedures such as (stochastic)
gradient descent more problematic.
In practice, one uses quadratic programming (convex problem) to
achieve a solution to

arg min(y − X w)t (y − X w) s.t. ||w||1 ≤ s.


w

24/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Examples
Regularized Non−Linear Regression, Lambda = 0.1 Regularized Non−Linear Regression, Lambda = 1

unregularized unregularized
L1−Regularization ● ● ● L1−Regularization ● ● ●
L2−Regularization L2−Regularization
target target

● ●

● ●
y

y
● ●

● ●
● ● ● ●

● ●

x x
Regularized Non−Linear Regression, Lambda = 10

unregularized
L1−Regularization ● ● ●
L2−Regularization
target


y


● ●

25/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Sparsity

Feature Selection with L1−Regularization

1.4
1.2
1.0
weight coefficient

w1
0.8

w2
w3
0.6

w4
0.4
0.2
0.0

0 5 10 15 20 25 30

lambda

26/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Interpretation
Both L1 - and L2 -regularization reduce complexity by penalizing the
weights.
Both make the fitting curve (the selected hypothesis) “smoother”
and less “wiggly”.
This means less overfitting and
will reduce Eout .
In the case of L1 -regularization, some weights are reduced to zero
(as opposed to L2 -regularization) – as theoretically expected.
Thus, the L1 -regularized solution is sparse1 .
However, the L1 -regularized solution does not automagically pick
the right coefficients – w3 survives instead of w1 .
Neither L1 -nor L2 - is better by default, see however, e.g. [Ng04].
1
This will become much more pronounced if we look at higher dimensional feature
space.
27/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization
R Demo

# exakt solution
we= t(solve(t(X)%*%X)%*%t(X)%*%y)

# augmented error using L1 norm


augerror1 <- function(w, l){
return(error(w) + l*sum(abs(w)))
}

# L1 solution
wr1 = t(optim(c(0,0,0,0), augerror1, l=l)$par)

28/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Terminology

L1 regularization is often called Lasso (Least Absolute Shrinkage


and Selection Operator) [SS86] in the ML-community.
L2 regularization is often called Ridge [HK70] in the ML-community
(ridge refers to the form of a ridge like in a ridge of mountains).
One can combine Lasso and Ridge – so-called “elastic net”.

29/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


Summary

Regularization is used for reducing the overfitting problem.


Regularization adds an extra term to the cost function.
The extra term is called the regularization term.
It penalizes the complexity (parameters) of the model while – if
possible – it does not impact Ein .
There are many different ways to regularize – depending on the
model.
Other common regularizers are the Akaike information criterion
(AIC), the minimum description length (MDL), and the Bayesian
information criterion (BIC) (||w ||0 ).

30/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


General Definition and lll-Posed Problems
The theory of regularization has a very long history.
Hadamard [Had02] has defined so-called well- and ill-posed problems:
A problem is well-posed if its solution:
1. exists
2. is unique and
3. depends continuously on the data (e.g. it is stable).
A problem is ill-posed if it is not well-posed.
In Machine Learning we often try to infer y from x given an operator L:
y = Lx .
This is a so-called “inverse” problem and has been proven to be unstable.
Following Tikhonov [TA77], one can regularize it
y = Lx + R(x ),
and turn it into a stable problem. This is further analyzed in the so-called
structural risk minimization theory (later).
31/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization
How to Pick Regularization Parameters?

In practice, regularization depends on one or many parameters such


as λ.
How to pick then the “best” regularization parameters?
Answer: Cross Validation!

32/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization


How to Pick Regularization Parameters?

In practice, regularization depends on one or many parameters such


as λ.
How to pick then the “best” regularization parameters?
Answer: Cross Validation!

32/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


How to Pick Regularization Parameters?

In practice, regularization depends on one or many parameters such


as λ.
How to pick then the “best” regularization parameters?
Answer: Cross Validation!

32/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


How to Pick Regularization Parameters?

In practice, regularization depends on one or many parameters such


as λ.
How to pick then the “best” regularization parameters?
Answer: Cross Validation!

32/34 Jörg Schäfer | Learning From Data | c b na 9: Regularization


References I
[Had02] J. Hadamard, “Sur les problèmes aux dérivées partielles et leur
signification physique.” Princeton University Bulletin, pp. pp. 49–52,
1902.
[HK70] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation
for nonorthogonal problems,” Technometrics, vol. 12, pp. 55–67,
1970.
[KT51] H. Kuhn and A. Tucker, “Nonlinear programming,” in Proceedings of
the 2nd Berkeley Symposium on Mathematics, Statistics and
Probability,, Berkeley. University of California Press, 1951, pp.
481–492.
[Ng04] A. Y. Ng, “Feature selection, l1 vs. l2 regularization, and rotational
invariance,” in Proceedings of the Twenty-first International
Conference on Machine Learning, ser. ICML ’04. New York, NY,
USA: ACM, 2004, pp. 78–. [Online]. Available:
http://doi.acm.org/10.1145/1015330.1015435
33/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization
References II

[SS86] F. Santosa and W. Symes, “Linear inversion of band-limited reflection


seismograms,” SIAM Journal on Scientific and Statistical
Computing, vol. 7, no. 4, pp. 1307–1330, 1986.

[TA77] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-posed problems.


W.H. Winston, 1977.

34/34 Jörg Schäfer | Learning From Data | c b n a 9: Regularization

You might also like