0% found this document useful (0 votes)
3 views

Lecture_2

Uploaded by

Sasindu Amesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture_2

Uploaded by

Sasindu Amesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

EC 9170

Deep Learning for Electrical &


Computer Engineers

Optimization for training deep


models

18th March 2024


Faculty of Engineering, University of Jaffna
What is optimization?
• The process of finding maxima or minima based on constraints

• Involved in many contexts of deep learning, but the hardest one is neural network
training
• Expand days to months to solve a single instance
• Special techniques have been developed specifically for this case

• Finding the parameters θ of a network that reduce a cost function J(θ)


Gradient Decent
Gradient Descent variants

• Batch gradient descent


• Stochastic gradient descent
• Mini-Batch gradient descent

Difference: Amount of Data used per update


• Pass the training set through the hidden layers of the neural
network and then update the parameters of the layers by
computing the gradients using the training samples from the
training dataset.

• Algorithms that use the whole training set are called batch or
deterministic.

• If only subsets are used, it is called minibatch.

• If the algorithm uses only one example, it is called stochastic or


online.

• Most algorithms use between 1 and a few examples.


Batch Gradient Descent
• Batch Gradient Descent is great for convex or relatively smooth error manifolds.
In this case, we move somewhat directly towards an optimum solution.
• The graph of cost vs epochs is also quite smooth because we are averaging over
all the gradients of training data for a single step. The cost keeps on decreasing
over the epochs.

In Batch Gradient Descent, we considered all


the examples for every step. But what if the
dataset is huge???
Stochastic Gradient Descent
• In Stochastic Gradient Descent (SGD), we consider just one sample at a time to
take a single step.
• We do the following steps in one epoch for SGD:
1. Take a sample
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the samples in the training dataset.

• Since we are considering just one sample at a time, the cost will fluctuate over
the training samples but will not necessarily decrease.
• But in the long run, the cost will decrease with fluctuations.
• Also, it will never reach the minimum, but it will keep dancing around it.
• Can be used for larger datasets.
• Can slow down the computations.
Mini Gradient Descent
• We use a batch of a fixed number of training examples which is less than the
actual dataset and call it a mini-batch.
• After creating the mini-batches of fixed size, we do the following steps in one
epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created

• Faster computations
Challenges

• Want fast learning rate in horizontal direction to aggressively


move toward minimum.
• Want slow learning rate in vertical direction to prevent
oscillations.
• If use a large learning rate, oscillations can be large
preventing convergence. So, this requires a small learning
rate limiting the speed of learning.
Basic optimization Algorithms
Stochastic Gradient Descent With Momentum
• Solution: Compute exponentially weighted average of the derivatives
• In vertical direction this will zero out the oscillations because average to close
to 0
• In horizontal direction (because no oscillations) - all derivatives in same
direction

Vdw = ß *Vdw + (1 - ß)*dW


Vdb = ß *Vdb + (1 - ß) *db
w = w - 𝛼 Vdw
b = b - 𝛼 Vdb
Adaptive learning rate
• Learning rate is one of the
hyperparameter that impacts the
most
• The gradient is highly sensitive to
some directions
• If we assume that the sensitivity is
axis-aligned, it makes sense to use
separate rates for each parameter
AdaGrad Algorithm
RMSProp Algorithm
RMSProp Algorithm
Adam Algorithm
Adam Algorithm (Original paper: https://arxiv.org/abs/1412.6980)

Adam optimizer
.. the name Adam is derived from Adaptive Moment Estimation.

According to the authors ...


• Computationally efficient.
• Little memory requirements.
• Well suited for problems that are large in terms of data and/or
parameters.
• Appropriate for problems with very noisy/or sparse gradients.
• Hyper-parameters have intuitive interpretation and typically
require little to no tuning.
• Gradient descent uses a constant learning rate (alpha) whereas
Adam computes adaptive learning rates.
Hyperparameter Tuning
• Training NNs can involve setting many hyper-parameters
• The most common hyper-parameters include:
• Number of layers, and number of neurons per layer
• Initial learning rate
• Learning rate decay schedule (e.g., decay constant)
• Optimizer type
• Other hyper-parameters may include:
• Regularization parameters (ℓ! penalty, dropout rate)
• Batch size
• Activation functions
• Loss function
• Hyper-parameter tuning can be time-consuming for larger NNs
Hyperparameter Optimization Techniques
• Grid search
• Select values for each hyperparameter to test and try all combinations
• Expensive to valuate all combinations

• Random search
• Select values randomly for every hyperparameter
• Often preferred to grid search

• Bayesian hyper-parameter optimization


• Is an active area of research
k-Fold Cross-Validation
• Using k-fold cross-validation for hyper-parameter tuning is common when the size
of the training data is small.
• It also leads to a better and less noisy estimate of the model performance by
averaging the results across several folds

• E.g., 5-fold cross-validation (see the figure on the next slide)


1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the model on the
test data
k-Fold Cross-Validation
• Illustration of a 5-fold cross-validation
Batch Normalization
• Batch Normalization is one of the most exciting innovations in neural network
optimization.
• It is not an optimization algorithm but a method of adaptive reparameterization.
• Motivated by the difficulty of training very deep models.
• Training a deep model involves parameter updates for each layer via gradient
direction, assuming that other layers are not changing.
• In practice, all layers are updated simultaneously. This is called internal covariate
shift
• This can cause unexpected results in optimization.
Batch Normalization
• It is very hard to choose an appropriate learning rate because the effects of
an update to the parameters for one layer depend so strongly on all of the
other layers.
• Second-order optimization methods try to remedy this phenomenon by
taking into account second-order effects. However, in very deep networks,
the effects of higher-order effects are very prominent.
• Batch normalization provides an elegant way of reparametrizing almost any
deep network.
• It can be applied to any layer, and the reparametrization significantly
reduces the problem of coordinating updates across many layers.
Batch Normalization
Batch Normalization

• This means that the gradient will never propose an operation that acts to
increase the standard deviation or mean of xi; the normalization
operations remove the effect of such an action and zero out its
component in the gradient.

• At test time, μ and σ are replaced by a moving average of the mean and
standard deviation that was collected during training.
Batch Normalization

• Improves gradient flow through the network.


• Allows higher learning rates.
• Reduces the strong dependence on initialization.
References
1. Goodfellow, I., Bengio, Y., and A., C., “Deep Learning,” MIT Press, 2016.

2. Goodfellow, I. J., Vinyals, O., and Saxe, A. M., “Qualitatively characterizing


neural network optimization problems,” In International Conference on
Learning Representations, 2015.

3. M. Feurer and F. Hutter, “Hyperparameter Optimization,” in Automated


Machine Learning: Methods, Systems, Challenges, F. Hutter, L. Kotthoff, and J.
Vanschoren, Eds. Springer International Publishing, 2019,
pp. 3-33. https://link.springer. com/chapter/10.1007/978-3-030-05318-5_1.

4. J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,”


J. Mach. Learn. Res., vol. 13, pp. 281-305, Feb.
2012. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.
Thank you!

You might also like