Deep Learning Unit2
Deep Learning Unit2
Neural networks can learn to represent complex relationships between network inputs and
outputs. This representational power helps them perform better than traditional machine
learning algorithms in computer vision and natural language processing tasks. However, one of
the challenges associated with training neural networks is overfitting.
When a neural network overfits the training dataset, it learns an overly complex representation
that models the training dataset too well. As a result, it performs exceptionally well on
the training dataset but generalizes poorly to unseen test data.
● Early stopping
● L1 and L2 regularization
● Data augmentation
● Dropout
1. Early Stopping
Early stopping is one of the simplest and most intuitive regularization techniques. It involves
stopping the training of the neural network at an earlier epoch; hence the name early
stopping.
As you train the neural network over many epochs, the training error decreases.
If the training error becomes too low and reaches arbitrarily close to zero, then the network is
sure to overfit the training dataset. Such a neural network is a high-variance model that
performs badly on test data that it has never seen before despite its near-perfect performance on
the training samples.
Therefore, heuristically, if we can prevent the training loss from becoming arbitrarily low, the
model is less likely to overfit the training dataset and will generalize better.
A simple approach is to monitor metrics such as validation error and validation accuracy as the
neural network training proceeds and use them to decide when to stop.
If we find that the validation error is not decreasing significantly or is increasing over a window
of epochs, say p epochs, we can stop training. We can as well lower the learning rate and train
for a few more epochs before stopping.
in terms of the neural network’s accuracy on the training and validation datasets. Stopping early
when the validation error starts increasing (or is no longer decreasing) is equivalent to stopping
when the validation accuracy starts decreasing.
Monitoring the Change in the Weight Vector
Another way to know when to stop is to monitor the change in the weights of the network. Let
wtwtand wt−kwt−kdenote the weight vectors at epochs tt and t−kt−k, respectively.
We can compute the L2 norm of the difference vector wt−wt−kwt−wt−k. We can stop training if
this quantity is sufficiently small, say, less than ϵϵ.
∣∣wt−wt−k∣∣2<ϵ∣∣wt−wt−k∣∣2<ϵ
Certain weights might have changed a lot in the last k epochs, while some weights may have
negligible changes. Therefore, the norm of the resultant difference vector can be small despite
the drastic change in certain components of the weight vector.
max∣wti−wt−ki∣<ϵmaxi∣wti−wt−ki∣<ϵ
2. Data Augmentation
Data augmentation is a regularization technique that helps a neural network generalize better by
exposing it to a more diverse set of training examples. As deep neural networks require a large
training dataset, data augmentation is also helpful when we have insufficient data to train a
neural network.
Let’s take the example of image data augmentation. Suppose we have a dataset with N training
examples across C classes. We can apply certain transformations to these N images to construct
a larger dataset.
3. L1/L2 Regularization
Elastic net regularization essentially combines both ridge and lasso regression but inserts
both the L1 and L2 penalty terms into the SSE loss function. L2 and L1 derive their penalty term
value, respectively, by squaring or taking the absolute value of the sum of the feature weights.
Elastic net inserts both of these penalty values into the cost function (SSE) equation. In this way,
elastic net addresses multicollinearity while also enabling feature selection.
4. Dropout Regularization
Read the
original
paper here.
- During training, dropout operates by randomly "dropping out" units (both hidden and
visible) with a certain probability, usually set around 0.5 for hidden units and closer to 1
for input units. This means that during each forward and backward pass through the
network, only a randomly chosen subset of neurons is active.
- Consequently, the network trains on a different architecture each time a training
example is processed. This process can be thought of as training an ensemble of
networks, where each sub-network shares weights with the others.
- At test time, all neurons are used, but their weights are scaled down by the dropout
probability to compensate for the effect of dropout during training.
- The primary benefit of dropout is its ability to reduce overfitting by preventing neurons
from becoming too reliant on specific patterns and co-adapting with each other.
- By randomly deactivating neurons, dropout ensures that each neuron contributes
meaningfully to the learning process, leading to the development of more robust and
independent features.
- This process encourages the network to generalize better to new, unseen data.
- Additionally, dropout naturally leads to sparse representations, where only a small
fraction of neurons are highly activated, further contributing to the network's ability to
generalize.
Practical Considerations
Deep learning models are typically trained using first-order optimization methods that rely on
computing the gradient of the objective function with respect to the model parameters.
Some popular first-order optimization methods are:
SGD is a widely used optimization algorithm for training deep neural networks. It works by
computing the gradient of the objective function with respect to a mini-batch of training
examples and updating the model parameters in the direction of the negative gradient. The
learning rate determines the step size taken in the direction of the gradient. SGD has been
shown to be effective in practice, but it can be slow to converge and can get stuck in local
minima.
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient
descent seeks to minimize an objective function J ( θ ) by iteratively updating each parameter θ
by a small amount based on the negative gradient of a given data set.
Under batch gradient descent, the gradient, ∇ θ J ( θ ) , is calculated at every step against a full
data set. When the training data is large, computation may be slow or require large amounts of
computer memory.
Stochastic Gradient Descent Algorithm
SGD modifies the batch gradient descent algorithm by calculating the gradient for only one
training example at every iteration.[7] The steps for performing SGD are as follows:
By calculating the gradient for one data set per iteration, SGD takes a less direct route towards
the local minimum. However, SGD has the advantage of having the ability to incrementally
update an objective function J ( θ ) when new training data is available at minimum cost.
Learning Rate
The learning rate is used to calculate the step size at every iteration. Too large a learning rate
and the step sizes may overstep too far past the optimum value. Too small a learning rate may
require many iterations to reach a local minimum. A good starting point for the learning rate is
0.1 and adjust as necessary
- SGD is an algorithm that seeks to find the steepest descent during each iteration. The
process decreases the time it takes to search large data sets and determine local minima
immensely. The SGD provides many applications in machine learning, geophysics, least
mean squares (LMS), and other areas.
B. Adagrad
Adagrad is an adaptive learning rate optimization algorithm that adapts the learning rate for
each model parameter based on the historical gradient information. This can be useful for
sparse datasets where some features are rarely observed. Adagrad has been shown to be effective
in practice, but can converge too quickly and stop learning before reaching the global minimum.
AdaGrad was introduced by Duchi et al. in a highly cited paper published in the
Journal of machine learning research in 2011. It is arguably one of the most
popular algorithms for machine learning (particularly for training deep neural
networks) and it influenced the development of the Adam algorithm.
The objective of AdaGrad is to minimize the expected value of a stochastic objective function,
with respect to a set of parameters, given a sequence of realizations of the function. As with
other sub-gradient-based methods, it achieves so by updating the parameters in the opposite
direction of the sub-gradients. While standard sub-gradient methods use update rules with
step-sizes that ignore the information from the past observations, AdaGrad adapts the learning
rate for each parameter individually using the sequence of gradient estimates.
Traditionally, gradient descent algorithms use a single learning rate for all parameters. This can
be problematic when applied to high-dimensional optimization problems, where some
dimensions require larger updates that others. Adagrad addresses this issue by adapting the
learning rate for each parameter individually.
● The key idea behind Adagrad is to accumulate the sum of squares of past gradients for
each parameter and use this information to scale the learning rate for new parameters.
Mathematically speaking, the update at each iteration is given by:
θ = θ - (η / √G) * g
Here θ is the parameter that is updated with each iteration, η is the learning rate, G is the sum
of squares of past gradients for that parameter, and g is the current gradient.
This update rule decreases the learning rates of parameters with large gradients, while
parameters with small gradients have increased learning rates. This helps improve convergence
and prevents oscillations that disturb the optimization process.
Analysis of First-Order Optimization Methods: Pros and Cons
1) SGD
Pros:
Cons:
2) Adagrad
Pros:
• Automatically reduces the learning rate for parameters with large gradients, preventing
divergence.
• Suitable for sparse datasets, as it allows each parameter to have its own learning rate.
Cons:
• Can accumulate too much historical gradient information, resulting in slower convergence in
later iterations.
3) Adadelta
Pros:
• Adapts the learning rate without the need for an initial learning rate or tuning.
• Requires less memory than Adagrad, as it only stores a window of past gradients.
Cons:
lems.
A. Newton’s Method
Newton’s method is a classic second-order optimization method that uses the Hessian matrix to
calculate the step size at each iteration. The basic idea of Newton’s method is to approximate the
objective function using a quadratic function, and then minimize this quadratic function to
obtain the next point.
Primarily used for optimization—finding the minimum of a loss function. Here's how it works as
a second-order approximation technique:
Given a function \( f(x) \), the goal is to find the value of \( x \) that minimizes (or maximizes)
the function. The method uses both the first derivative (gradient) and the second derivative
(Hessian) to make a more informed update to the parameter \( x \).
Steps:
Here, \( H(x_n)^{-1} \) is the inverse of the Hessian matrix, and \( \nabla f(x_n) \) is the
gradient at the current point.
While Newton’s method can converge to the optimal solution in fewer iterations compared to
first-order methods, it has several drawbacks. One of the main challenges of Newton’s method is
computing or approximating the Hessian matrix, which can be computationally expensive for
large-scale problems. Additionally, the Hessian matrix may not be positive definite, which can
lead to unstable updates and slow convergence.
The conjugate gradient method is another popular second-order optimization method that does
not require the Hessian matrix to be computed explicitly. Instead, it uses a sequence of
conjugate directions to iteratively approximate the Hessian matrix and find the optimal solution.
The update rule of the conjugate gradient method can be expressed as:
wt+1 = wt + αtdt
where αt is the step size, and dt is the conjugate direction. The conjugate direction dt is
calculated as a linear combination of the negative gradient and the previous conjugate direction:
The conjugate gradient method can converge faster than first-order methods and is
computationally more efficient than Newton’s method. However, the convergence of the
conjugate gradient method depends on the conditioning of the Hessian matrix, and it may
perform poorly on ill-conditioned problems.
CHALLENGES AND TECHNIQUES IN DEEP
LEARNING OPTIMIZATION
Optimization in deep learning is often challenging due to the high dimensionality of the
parameter space, complex nonlinear functions, and the presence of many local optima.
One of the most significant challenges in deep learning optimization is the vanishing and
exploding gradient problem. When training deep neural networks, the gradients of the loss
function with respect to the parameters can become very small or very large as they propagate
through the network. This can make it difficult to optimize the network and can lead
to slow convergence or divergence. To address this problem, various techniques have been
proposed. Batch normalization and layer normalization helps to address the
vanishing and exploding gradient problem by normalizing the inputs to each layer.
A variety of optimization algorithms have been proposed for deep learning, including first-order
methods, second-order methods, and adaptive methods.
- First-order methods, such as Stochastic Gradient Descent (SGD), Adagrad, Adadelta, and
RMSprop, are simple and computationally efficient.
- Second-order methods, such as Newton’s method and the conjugate gradient method,
can converge faster than first-order methods, but are more computationally expensive.
- Adaptive methods, such as Adam and AMSGrad, adjust the learning rate for each
parameter based on their past gradients.
- Momentum-based optimization methods, such as Nesterov accelerated gradient (NAG),
Adam, and Nadam, can help to accelerate convergence and overcome the saddle point
problem.
- Adaptive gradient methods, such as AdaMax and AMSGrad, can adaptively adjust the
learning rate for each parameter based on the moving average of the gradients.
C. Regularization Techniques
Regularization techniques are used to prevent overfitting and improve the generalization
performance of deep neural networks. Some of the most commonly used regularization
techniques include L1 and L2 regularization, dropout, and early stopping.
L1 and L2 regularization can help to prevent overfitting by adding a penalty term to the loss
function that encourages the parameters to be small. Dropout can help to prevent overfitting
by randomly dropping out some of the neurons during training. Early stopping can help to
prevent overfitting by stopping the training process when the validation error starts to increase.
Cross Validation
References
[1] http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_regularization
[2]https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
[3]https://cedar.buffalo.edu/~srihari/CSE676/8.5%20AdaptiveLearning.pdf
[4]https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
[5]https://builtin.com/machine-learning/adam-optimization
[6]https://optmlclass.github.io/notes/notes6_adaptive1.pdf
[7]https://www.geeksforgeeks.org/first-order-algorithms-in-machine-learning/#1-deterministi
c-firstorder-algorithms
[8]https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent