One Fourth Labs: L2 Regularization
One Fourth Labs: L2 Regularization
L2 regularization
What is the intuition behind L-2 regularization?
1. Consider the error curves for training and test set
N
ˆ ))2
2. In the case of Square error loss: Ltrain (θ) = ∑ (y i − f(xi
i=1
a. Where θ = [W 111 , W 112 , + ... + W Lnk ]
b. Our aim has been to minimise the loss function minθ L(θ)
3. Now, imagine if we include a new term in the minimization condition minθ L(θ) = Ltrain (θ) + Ω(θ)
a. Here, in addition to minimising the training loss, we are also minimising some other quantity
that is dependent on our parameters
b. In the case of L2 Regularisation, Ω(θ) = ||θ||2 2 (sq.root of the sum of the squares of the
weight)
c. Ω(θ) = W 2 111 + W 2 112 + ... + W 2 Lnk
d. Here, we should aim to minimize both Ltrain (θ) and Ω(θ) , it wouldn’t make sense for either of
them to be high values.
4. What if we set all weights to 0? In this case, the model would not have learned much, therefore
Ltrain (θ) would be high.
5. What if we try to minimise Ltrain (θ) to 0? In this case, it is possible that some of the weights would
take on large values, thereby driving the value of Ω(θ) high.
6. To counter the previous point’s shortcoming, we need to minimize Ltrain (θ) but shouldn’t allow
the weights to grow too large
7. Thus, as shown in the figure, in L2 Regularisation, we do not allow the training loss to be brought
to be zero, instead we maintain it at slightly above zero, so that Ω(θ) doesn’t become too high
8. This works in the Gradient Descent Algorithm as well
PadhAI: Regularization
One Fourth Labs
9. The algorithm
a. Initialise: w111, w112, … w313, b1, b2, b3 randomly
b. Iterate over data
i. Compute ŷ
ii. Compute L(w,b) Cross-entropy loss function
iii. w111 = w111 - η𝚫w111
iv. w112 = w112 - η𝚫w112
…
v. w313 = w111 - η𝚫w313
c. Till satisfied
∂L(θ)
10. The derivative of the loss function w.r.t any weight is ΔW ijk = ∂W ijk
∂Ltrain (θ) ∂Ω(θ)
11. In the case of L2 Regularisation, that value would be ΔW ijk = ∂W ijk + ∂W ijk
12. Here, the derivative of the regularisation term will cancel out all other weights except the
∂Ω(θ)
concerned weight and we will compute its derivative. I.e. ∂W = 2W ijk
ijk
∂Ltrain (θ)
13. So the new derivative term will be ΔW ijk = ∂W ijk + 2W ijk
14. This process is automatically done in PyTorch.