0% found this document useful (0 votes)
33 views

One Fourth Labs: L2 Regularization

L2 regularization adds a regularization term to the loss function to prevent weights from growing too large during training. This regularization term is the sum of squares of all weights. By minimizing both the training loss and regularization loss, L2 regularization finds a balanced solution where the training loss is minimized while keeping weights small. This helps reduce overfitting. The gradient descent algorithm is modified such that the update for each weight factor in both its contribution to the training loss and regularization loss.

Uploaded by

ashwin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

One Fourth Labs: L2 Regularization

L2 regularization adds a regularization term to the loss function to prevent weights from growing too large during training. This regularization term is the sum of squares of all weights. By minimizing both the training loss and regularization loss, L2 regularization finds a balanced solution where the training loss is minimized while keeping weights small. This helps reduce overfitting. The gradient descent algorithm is modified such that the update for each weight factor in both its contribution to the training loss and regularization loss.

Uploaded by

ashwin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

PadhAI: Regularization 

One Fourth Labs 

L2 regularization 
What is the intuition behind L-2 regularization? 
1. Consider the error curves for training and test set

 
N
ˆ ))2  
2. In the case of Square error loss: Ltrain (θ) = ∑ (y i − f(xi
i=1
a. Where θ = [W 111 , W 112 , + ... + W Lnk ]  
b. Our aim has been to minimise the loss function minθ L(θ)  
3. Now, imagine if we include a new term in the minimization condition minθ L(θ) = Ltrain (θ) + Ω(θ)  
a. Here, in addition to minimising the training loss, we are also minimising some other quantity 
that is dependent on our parameters 
b. In the case of L2 Regularisation, Ω(θ) = ||θ||2 2 (sq.root of the sum of the squares of the 
weight) 
c. Ω(θ) = W 2 111 + W 2 112 + ... + W 2 Lnk  
d. Here, we should aim to minimize both Ltrain (θ) and Ω(θ) , it wouldn’t make sense for either of 
them to be high values. 
4. What if we set all weights to 0? In this case, the model would not have learned much, therefore 
Ltrain (θ) would be high. 
5. What if we try to minimise Ltrain (θ) to 0? In this case, it is possible that some of the weights would 
take on large values, thereby driving the value of Ω(θ) high. 
6. To counter the previous point’s shortcoming, we need to minimize Ltrain (θ) but shouldn’t allow 
the weights to grow too large 
7. Thus, as shown in the figure, in L2 Regularisation, we do not allow the training loss to be brought 
to be zero, instead we maintain it at slightly above zero, so that Ω(θ) doesn’t become too high 
8. This works in the Gradient Descent Algorithm as well 
 
PadhAI: Regularization 
One Fourth Labs 

9. The algorithm 
a. Initialise:​ w​111​, w​112​, … w​313​, b​1​, b​2​, b​3​ randomly 
b. Iterate over data 
i. Compute ŷ 
ii. Compute L(w,b) Cross-entropy loss function 
iii. w​111​ = w​111​ - η𝚫w​111 
iv. w​112​ = w​112​ - η𝚫w​112 
… 
v. w​313​ = w​111​ - η𝚫w​313 
c. Till satisfied  
∂L(θ)
10. The derivative of the loss function w.r.t any weight is ΔW ijk = ∂W ijk  
∂Ltrain (θ) ∂Ω(θ)
11. In the case of L2 Regularisation, that value would be ΔW ijk = ∂W ijk + ∂W ijk  
12. Here, the derivative of the regularisation term will cancel out all other weights except the 
∂Ω(θ)
concerned weight and we will compute its derivative. I.e. ∂W = 2W ijk  
ijk
∂Ltrain (θ)
13. So the new derivative term will be ΔW ijk = ∂W ijk + 2W ijk  
14. This process is automatically done in PyTorch. 

You might also like