cst414-deep learning module 2
cst414-deep learning module 2
Where Ltotal is the total loss, Loriginal is the original loss, wi are the
weights and λ is the regularization parameter that controls the penalty.
2) L1 Regularization (Lasso Regularization): It adds a penalty term to the
loss function based on the absolute values of the weights. It has a tendency
to drive some weights to exactly zero, effectively performing feature
selection.
3) Dropout: At each training step, a fraction of the neurons in the network is
randomly ignored (dropped out). The remaining neurons must learn to work
together and generalize better.
4) Early Stopping: The training process is stopped when the validation loss
starts increasing, indicating that the model is overfitting to the training data.
This helps prevent the model from memorizing the training data.
5) Data Augmentation: It is the process of artificially increasing the size of
the training set by applying random transformations (e.g., rotation, scaling,
flipping, etc.) to the training data. For tasks like image classification,
transformations (e.g., rotations, flips, crops) are applied to the original
images to create new training samples.
Optimization in Deep Learning
●
Optimization in deep learning refers to the process of adjusting the parameters
(weights and biases) of a neural network in order to minimize a loss function.
1) Gradient Descent (GD):
• It is used for finding local minimum for a differentiable function.In this
method weight is initialized using some initialization strategies and is updated
with each epoch according to the update equation.
• The core idea is to update the parameters in the direction opposite to the
gradient of the function, thus progressively reducing the value of the function.
The size of the step taken in each iteration is controlled by a parameter called
the learning rate (η).
2) Stochastic Gradient Descent (SGD):
●
It updates the model parameters using only one data point at a time, as
opposed to the entire dataset.
●
But this approach may lead to noiser result because it iterates one
obseervation at a time.
●
Mini-Batch SGD: It is a cross-over between GD and SGD. Here entire
dataset is divided into batches and compute the gradient for each
batch.
3) GD with momentum:
●
Even with Mini-Batch SGD there are noises and taking too much time for
convergence.
●
SGD with momentum is working with the concept of Exponentially
Weighted Moving Average(EWMA).
●
Here most recent previous values will be getting more importance and
earlier events will be having less importance.
●
Here βVt-1 is the momentum, this will reduce the noise while trying to
reach the global minima.
4) Nestrov Accelerated GD
●
In gradient descent with momentum, the actual movement is large due
to the added momentum.
●
This added momentum causes different types of problems.We cross
the actual minimum point and have to come back to get the minimum
point.
●
That means SGD with momentum oscillates around the minimum
point.
●
To reduce the problem we can use Nestrov Accelerated GD.
●
In NAG we calculate the gradient at the look ahead point and then used
it to update the weights.
●
In momentum based GD, the weight is update was based on
ΔW= History of velocity + Gradient at that point
●
But in NAG the only difference is that
ΔW= History of velocity + Gradient at a look ahead point
●
So based on this it will decide wheather to move forward or backward to
reach minimum.
5) Adagrad Optimizer
●
In SGD and Mini-Batch SGD the learning rate was same for each
weight or for each parameter.
●
But in Adagrad Optimizer the learning rate gets modified based on
how frequently a parameter gets updated during training.
●
This method is adopted when the dataset is having different features
of various dimentions. For example if it is having Dense and Sparse
features the saming learning rate is applied to both types of features
will face difficulty in optimization.
6) RMSProp
●
It is an extension of the popular Adaptive Gradient Algorithm and is
designed to dramatically reduce the amount of computational effort
used in training neural networks.
●
RMSProp is able to smoothly adjust the learning rate for each of the
parameters in the network, providing a better performance than
regular Gradient Descent.