Lecture 4_Regularization
Lecture 4_Regularization
Boyu Wang
Department of Computer Science
University of Western Ontario
Motivation examples: polynomial regression
3
Overfitting
4
Larger data set
I Same model complexity: more data ⇒ less overfitting. With more data,
more complex (i.e. more flexible) models can be used.
6
Model selection
6
Model selection: cross-validation
1. Randomly split the training data into K groups, and following procedure
K times:
i. Leave out the k -th group from the training set as a validation set
ii. Use the other other K − 1 to find best parameter vector wk
iii. Measure the error of wk on the validation set; call this Jk
1
PK
2. Compute the average errors: J = K k =1 Jk
7
Model selection: cross-validation
8
General learning procedure
These sets must be disjoint! – you should never touch the test data
before you evaluate your model.
9
Summary of cross-validation
10
Regularization
11
`2 -norm regularization for linear regression
Objective function:
m n 2 λ X n
1 XX
L(w) = w0 + wj · xi,j − yi + wj2
2 2
i=1 j=1 j=1
I No regularization on w0 !
Equivalently, we have
1 λ
L(w) = ||Xw − y||22 + w > Îw
2 2
where w = [w0 , w1 , . . . , wn ]>
0 0 ··· 0
0 1 ··· 0
Î = .
.. .. ..
.. . . .
0 0 ··· 1
12
`2 -norm regularization for linear regression
Objective function:
1 λ
L(w) = ||Xw − y||22 + w > Îw
2 2
1 > >
= w (X X + λÎ)w − w > X > y − y > Xw + y > y
2
Optimal solution (by solving ∇L(w) = 0):
13
More on `2 -norm regularization
1 λ
arg min ||Xw − y||22 + w > Îw = (X > X + λÎ)−1 X > y
w 2 2
I `2 -norm regularization pushes the parameters towards to 0.
I λ→∞⇒w →0
is equivalent to
min J(w)
w
1
e.g., Boyd and Lieven. Convex Optimization. 2004.
15
Visualizing `2 -norm regularization (2 features)
16
`1 -norm regularization
which is equivalent to
m n
1 XX 2
min w0 + wj · xi,j − yi
w 2
i=1 j=1
n
X
such that |wj | ≤ η
j=1
17
Visualizing `1 -norm regularization (2 features)
I If λ is large enough , the circle is very likely to intersect the diamond at
one of the corners.
I This makes `1 -norm regularization much more likely to make some
weights exactly 0.
I In other words, we essentially perform feature selection!
18
Comparison of `2 and `1
20