0% found this document useful (0 votes)
5 views

Deep Learning Unit2

The document discusses optimization and regularization techniques in neural networks, focusing on methods to prevent overfitting and improve generalization. Key techniques include early stopping, data augmentation, L1/L2 regularization, and dropout, each serving to enhance model performance by managing complexity and training data diversity. Additionally, it covers first-order and second-order optimization methods, such as Stochastic Gradient Descent and Newton's method, highlighting their advantages and limitations.

Uploaded by

J07Anubha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Deep Learning Unit2

The document discusses optimization and regularization techniques in neural networks, focusing on methods to prevent overfitting and improve generalization. Key techniques include early stopping, data augmentation, L1/L2 regularization, and dropout, each serving to enhance model performance by managing complexity and training data diversity. Additionally, it covers first-order and second-order optimization methods, such as Stochastic Gradient Descent and Newton's method, highlighting their advantages and limitations.

Uploaded by

J07Anubha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT 2

Optimisation and Regularization: Cross Validation, Feature Selection, Regularization,


Hyperparameters, Approximate Second-Order Methods, Algorithms with Adaptive Learning
Rates, Dropout, Dimension Reduction.

Neural networks can learn to represent complex relationships between network inputs and
outputs. This representational power helps them perform better than traditional machine
learning algorithms in computer vision and natural language processing tasks. However, one of
the challenges associated with training neural networks is overfitting.

When a neural network overfits the training dataset, it learns an overly complex representation
that models the training dataset too well. As a result, it performs exceptionally well on
the training dataset but generalizes poorly to unseen test data.

Regularization techniques help improve a neural network’s generalization ability by


reducing overfitting. They do this by minimizing needless complexity and exposing the network
to more diverse data. This article will cover common regularization techniques:

● Early stopping
● L1 and L2 regularization
● Data augmentation
● Dropout

1. Early Stopping

Early stopping is one of the simplest and most intuitive regularization techniques. It involves
stopping the training of the neural network at an earlier epoch; hence the name early
stopping.
As you train the neural network over many epochs, the training error decreases.

If the training error becomes too low and reaches arbitrarily close to zero, then the network is
sure to overfit the training dataset. Such a neural network is a high-variance model that
performs badly on test data that it has never seen before despite its near-perfect performance on
the training samples.
Therefore, heuristically, if we can prevent the training loss from becoming arbitrarily low, the
model is less likely to overfit the training dataset and will generalize better.

Monitoring the Change in Validation Error

A simple approach is to monitor metrics such as validation error and validation accuracy as the
neural network training proceeds and use them to decide when to stop.

If we find that the validation error is not decreasing significantly or is increasing over a window
of epochs, say p epochs, we can stop training. We can as well lower the learning rate and train
for a few more epochs before stopping.

in terms of the neural network’s accuracy on the training and validation datasets. Stopping early
when the validation error starts increasing (or is no longer decreasing) is equivalent to stopping
when the validation accuracy starts decreasing.
Monitoring the Change in the Weight Vector

Another way to know when to stop is to monitor the change in the weights of the network. Let
wtwt​and wt−kwt−k​denote the weight vectors at epochs tt and t−kt−k, respectively.

We can compute the L2 norm of the difference vector wt−wt−kwt​−wt−k​. We can stop training if
this quantity is sufficiently small, say, less than ϵϵ.

∣∣wt−wt−k∣∣2<ϵ∣∣wt​−wt−k​∣∣2​<ϵ

Certain weights might have changed a lot in the last k epochs, while some weights may have
negligible changes. Therefore, the norm of the resultant difference vector can be small despite
the drastic change in certain components of the weight vector.

A better approach is to compute the change in individual components of the weight


vector. If the maximum change (across all components) is less than ϵϵ, we can conclude that the
weights are not changing significantly, so we can stop the training of the neural network.

max∣wti−wt−ki∣<ϵmaxi​∣wt​i−wt−k​i∣<ϵ
2. Data Augmentation
Data augmentation is a regularization technique that helps a neural network generalize better by
exposing it to a more diverse set of training examples. As deep neural networks require a large
training dataset, data augmentation is also helpful when we have insufficient data to train a
neural network.

Let’s take the example of image data augmentation. Suppose we have a dataset with N training
examples across C classes. We can apply certain transformations to these N images to construct
a larger dataset.

3. L1/L2 Regularization

Lasso regularization is a regularization technique that penalizes high-value, correlated


coefficients. It introduces a regularization term (also called, penalty term) into the model’s sum
of squared errors (SSE) loss function. This penalty term is the absolute value of the sum of
coefficients. Controlled in turn by the hyperparameter lambda (λ), it reduces select feature
weights to zero. Lasso regression thereby removes multicollinear features from the model
altogether.

Ridge regularization (or L2 regularization) is a regularization technique that similarly


penalizes high-value coefficients by introducing a penalty term in the SSE loss function. It
differs from lasso regression, however. First, the penalty term in ridge regression is the squared
sum of coefficients rather than the absolute value of coefficients. Second, ridge regression does
not enact feature selection. While lasso regression’s penalty term can remove features from the
model by shrinking coefficient values to zero, ridge regression only shrinks feature weights
towards zero but never to zero.

Elastic net regularization essentially combines both ridge and lasso regression but inserts
both the L1 and L2 penalty terms into the SSE loss function. L2 and L1 derive their penalty term
value, respectively, by squaring or taking the absolute value of the sum of the feature weights.
Elastic net inserts both of these penalty values into the cost function (SSE) equation. In this way,
elastic net addresses multicollinearity while also enabling feature selection.

4. Dropout Regularization

Read the
original
paper here.

- Dropout regularization is a computationally cheap way to regularize a deep neural


network.
- Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which
may be input variables in the data sample or activations from a previous layer. It has the
effect of simulating a large number of networks with very different network structure
and, in turn, making nodes in the network generally more robust to the inputs.
- Dropout is a technique where randomly selected neurons are ignored during training.
They are “dropped out” randomly. This means that their contribution to the activation of
downstream neurons is temporally removed on the forward pass, and any weight
updates are not applied to the neuron on the backward pass.
- As a neural network learns, neuron weights settle into their context within the network.
Weights of neurons are tuned for specific features, providing some specialization.
Neighboring neurons come to rely on this specialization, which, if taken too far, can
result in a fragile model too specialized for the training data. This reliance on context for
a neuron during training is referred to as complex co-adaptations.
- It is introduced to address the issue of overfitting in deep neural networks. Overfitting
occurs when a model learns to perform exceptionally well on the training data but fails to
generalize to unseen data. This problem is especially pronounced in deep neural
networks with a large number of parameters, where the model can easily memorize the
training data, leading to poor generalization. Dropout mitigates this by randomly
deactivating a portion of neurons during training, which forces the network to learn
more robust and general features.

Mechanism and Implementation

- During training, dropout operates by randomly "dropping out" units (both hidden and
visible) with a certain probability, usually set around 0.5 for hidden units and closer to 1
for input units. This means that during each forward and backward pass through the
network, only a randomly chosen subset of neurons is active.
- Consequently, the network trains on a different architecture each time a training
example is processed. This process can be thought of as training an ensemble of
networks, where each sub-network shares weights with the others.
- At test time, all neurons are used, but their weights are scaled down by the dropout
probability to compensate for the effect of dropout during training.

Why use Dropout?

- The primary benefit of dropout is its ability to reduce overfitting by preventing neurons
from becoming too reliant on specific patterns and co-adapting with each other.
- By randomly deactivating neurons, dropout ensures that each neuron contributes
meaningfully to the learning process, leading to the development of more robust and
independent features.
- This process encourages the network to generalize better to new, unseen data.
- Additionally, dropout naturally leads to sparse representations, where only a small
fraction of neurons are highly activated, further contributing to the network's ability to
generalize.

Practical Considerations

- Implementing dropout requires careful consideration of hyperparameters, particularly


the dropout probability, learning rate, and network size.
- A smaller dropout probability may require a larger network to maintain the model's
capacity, while a higher dropout rate introduces more noise, necessitating
adjustments to the learning rate and momentum.
First Order Optimization

Deep learning models are typically trained using first-order optimization methods that rely on
computing the gradient of the objective function with respect to the model parameters.
Some popular first-order optimization methods are:

A. Stochastic Gradient Descent (SGD)

SGD is a widely used optimization algorithm for training deep neural networks. It works by
computing the gradient of the objective function with respect to a mini-batch of training
examples and updating the model parameters in the direction of the negative gradient. The
learning rate determines the step size taken in the direction of the gradient. SGD has been
shown to be effective in practice, but it can be slow to converge and can get stuck in local
minima.

SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient
descent seeks to minimize an objective function J ( θ ) by iteratively updating each parameter θ
by a small amount based on the negative gradient of a given data set.

The steps for performing gradient descent


are as follows:

Step 1: Select a learning rate α

Step 2: Select initial parameter


values θ as the starting point

Step 3: Update all parameters from


the gradient of the training data set,
i.e. compute θ i + 1 = θ i − α × ∇ θ
J(θ)

Step 4: Repeat Step 3 until a local


minima is reached

Under batch gradient descent, the gradient, ∇ θ J ( θ ) , is calculated at every step against a full
data set. When the training data is large, computation may be slow or require large amounts of
computer memory.
Stochastic Gradient Descent Algorithm

SGD modifies the batch gradient descent algorithm by calculating the gradient for only one
training example at every iteration.[7] The steps for performing SGD are as follows:

Step 1: Randomly shuffle the data set of size m

Step 2: Select a learning rate α

Step 3: Select initial parameter values θ as the


starting point

Step 4: Update all parameters from the gradient


of a single training example x j , y j , i.e.
compute θ i + 1 = θ i − α × ∇ θ J ( θ ; x j ; y j )

Step 5: Repeat Step 4 until a local minimum is


reached

By calculating the gradient for one data set per iteration, SGD takes a less direct route towards
the local minimum. However, SGD has the advantage of having the ability to incrementally
update an objective function J ( θ ) when new training data is available at minimum cost.

Learning Rate

The learning rate is used to calculate the step size at every iteration. Too large a learning rate
and the step sizes may overstep too far past the optimum value. Too small a learning rate may
require many iterations to reach a local minimum. A good starting point for the learning rate is
0.1 and adjust as necessary

- SGD is an algorithm that seeks to find the steepest descent during each iteration. The
process decreases the time it takes to search large data sets and determine local minima
immensely. The SGD provides many applications in machine learning, geophysics, least
mean squares (LMS), and other areas.
B. Adagrad

Adagrad is an adaptive learning rate optimization algorithm that adapts the learning rate for
each model parameter based on the historical gradient information. This can be useful for
sparse datasets where some features are rarely observed. Adagrad has been shown to be effective
in practice, but can converge too quickly and stop learning before reaching the global minimum.

AdaGrad was introduced by Duchi et al. in a highly cited paper published in the
Journal of machine learning research in 2011. It is arguably one of the most
popular algorithms for machine learning (particularly for training deep neural
networks) and it influenced the development of the Adam algorithm.

The objective of AdaGrad is to minimize the expected value of a stochastic objective function,
with respect to a set of parameters, given a sequence of realizations of the function. As with
other sub-gradient-based methods, it achieves so by updating the parameters in the opposite
direction of the sub-gradients. While standard sub-gradient methods use update rules with
step-sizes that ignore the information from the past observations, AdaGrad adapts the learning
rate for each parameter individually using the sequence of gradient estimates.

How does Adagrad work?

Traditionally, gradient descent algorithms use a single learning rate for all parameters. This can
be problematic when applied to high-dimensional optimization problems, where some
dimensions require larger updates that others. Adagrad addresses this issue by adapting the
learning rate for each parameter individually.

● The key idea behind Adagrad is to accumulate the sum of squares of past gradients for
each parameter and use this information to scale the learning rate for new parameters.
Mathematically speaking, the update at each iteration is given by:

θ = θ - (η / √G) * g

Here θ is the parameter that is updated with each iteration, η is the learning rate, G is the sum
of squares of past gradients for that parameter, and g is the current gradient.

This update rule decreases the learning rates of parameters with large gradients, while
parameters with small gradients have increased learning rates. This helps improve convergence
and prevents oscillations that disturb the optimization process.
Analysis of First-Order Optimization Methods: Pros and Cons

1) SGD

Pros:

• Fast convergence rate for large-scale datasets.

• Memory-efficient, as only a small batch of data is used in each iteration.

• Easy to implement and tune.

Cons:

• May get stuck in local minima or saddle points.

• Slow convergence rate near the minimum.

• Requires careful tuning of the learning rate and momentum parameters.

2) Adagrad

Pros:

• Adapts the learning rate to each parameter, improving convergence.

• Automatically reduces the learning rate for parameters with large gradients, preventing
divergence.

• Suitable for sparse datasets, as it allows each parameter to have its own learning rate.

Cons:

• Learning rate can decay too quickly, leading to premature convergence.

• Can accumulate too much historical gradient information, resulting in slower convergence in
later iterations.

• May not be suitable for non-convex optimization problems.

3) Adadelta

Pros:

• Adapts the learning rate without the need for an initial learning rate or tuning.

• Reduces the effect of noise and outliers in the gradients.

• Requires less memory than Adagrad, as it only stores a window of past gradients.
Cons:

• Slow convergence rate near the minimum.

• Computationally expensive due to the need to calculate

the running average of the gradients.

• May not be suitable for non-convex optimization prob-

lems.

To learn more: https://westonjackson.github.io/pdfs/first_order_survey.pdf


Second Order Optimization
In addition to first-order optimization methods, second- order optimization methods are
another important class of optimization algorithms used in deep learning. Second-order
optimization methods involve computing or approximating the Hessian matrix of the objective
function to accelerate convergence and improve accuracy. In this section, we review two popular
second-order optimization methods: Newton’s method and the conjugate gradient method.

A. Newton’s Method

Newton’s method is a classic second-order optimization method that uses the Hessian matrix to
calculate the step size at each iteration. The basic idea of Newton’s method is to approximate the
objective function using a quadratic function, and then minimize this quadratic function to
obtain the next point.

Primarily used for optimization—finding the minimum of a loss function. Here's how it works as
a second-order approximation technique:

Given a function \( f(x) \), the goal is to find the value of \( x \) that minimizes (or maximizes)
the function. The method uses both the first derivative (gradient) and the second derivative
(Hessian) to make a more informed update to the parameter \( x \).

Steps:

1. Start with an Initial Guess:


Choose an initial point \( x_0 \) close to where you think the minimum might be.

2. Compute the first-order Gradient:


The gradient \( \nabla f(x) \) gives the direction of the steepest ascent (or descent). It's the
first derivative of the function \( f(x) \).

3. Compute the Hessian:


The Hessian \( H(x) \) is a matrix of second-order partial derivatives. It provides information
about the curvature of the function. For a function \( f(x) \), the Hessian is defined as:

Calculating the Hessian matrix.


4. Update Rule:
Newton's method uses the gradient and the Hessian to update the parameter \( x \) according
to

Here, \( H(x_n)^{-1} \) is the inverse of the Hessian matrix, and \( \nabla f(x_n) \) is the
gradient at the current point.

While Newton’s method can converge to the optimal solution in fewer iterations compared to
first-order methods, it has several drawbacks. One of the main challenges of Newton’s method is
computing or approximating the Hessian matrix, which can be computationally expensive for
large-scale problems. Additionally, the Hessian matrix may not be positive definite, which can
lead to unstable updates and slow convergence.

B. Conjugate Gradient Method

The conjugate gradient method is another popular second-order optimization method that does
not require the Hessian matrix to be computed explicitly. Instead, it uses a sequence of
conjugate directions to iteratively approximate the Hessian matrix and find the optimal solution.
The update rule of the conjugate gradient method can be expressed as:

wt+1 = wt + αtdt

where αt is the step size, and dt is the conjugate direction. The conjugate direction dt is
calculated as a linear combination of the negative gradient and the previous conjugate direction:

dt = −∇f (wt) + βtdt − 1

where βt is the conjugacy coefficient.

The conjugate gradient method can converge faster than first-order methods and is
computationally more efficient than Newton’s method. However, the convergence of the
conjugate gradient method depends on the conditioning of the Hessian matrix, and it may
perform poorly on ill-conditioned problems.
CHALLENGES AND TECHNIQUES IN DEEP
LEARNING OPTIMIZATION

Optimization in deep learning is often challenging due to the high dimensionality of the
parameter space, complex nonlinear functions, and the presence of many local optima.

A. Vanishing and Exploding Gradients

One of the most significant challenges in deep learning optimization is the vanishing and
exploding gradient problem. When training deep neural networks, the gradients of the loss
function with respect to the parameters can become very small or very large as they propagate
through the network. This can make it difficult to optimize the network and can lead
to slow convergence or divergence. To address this problem, various techniques have been
proposed. Batch normalization and layer normalization helps to address the
vanishing and exploding gradient problem by normalizing the inputs to each layer.

B. Optimization Algorithms for Deep Learning

A variety of optimization algorithms have been proposed for deep learning, including first-order
methods, second-order methods, and adaptive methods.
- First-order methods, such as Stochastic Gradient Descent (SGD), Adagrad, Adadelta, and
RMSprop, are simple and computationally efficient.
- Second-order methods, such as Newton’s method and the conjugate gradient method,
can converge faster than first-order methods, but are more computationally expensive.
- Adaptive methods, such as Adam and AMSGrad, adjust the learning rate for each
parameter based on their past gradients.
- Momentum-based optimization methods, such as Nesterov accelerated gradient (NAG),
Adam, and Nadam, can help to accelerate convergence and overcome the saddle point
problem.
- Adaptive gradient methods, such as AdaMax and AMSGrad, can adaptively adjust the
learning rate for each parameter based on the moving average of the gradients.
C. Regularization Techniques

Regularization techniques are used to prevent overfitting and improve the generalization
performance of deep neural networks. Some of the most commonly used regularization
techniques include L1 and L2 regularization, dropout, and early stopping.
L1 and L2 regularization can help to prevent overfitting by adding a penalty term to the loss
function that encourages the parameters to be small. Dropout can help to prevent overfitting
by randomly dropping out some of the neurons during training. Early stopping can help to
prevent overfitting by stopping the training process when the validation error starts to increase.

Cross Validation
References

[1] http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_regularization
[2]https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
[3]https://cedar.buffalo.edu/~srihari/CSE676/8.5%20AdaptiveLearning.pdf
[4]https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
[5]https://builtin.com/machine-learning/adam-optimization
[6]https://optmlclass.github.io/notes/notes6_adaptive1.pdf
[7]https://www.geeksforgeeks.org/first-order-algorithms-in-machine-learning/#1-deterministi
c-firstorder-algorithms
[8]https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent

You might also like