Lecture_2
Lecture_2
• Involved in many contexts of deep learning, but the hardest one is neural network
training
• Expand days to months to solve a single instance
• Special techniques have been developed specifically for this case
• Algorithms that use the whole training set are called batch or
deterministic.
• Since we are considering just one sample at a time, the cost will fluctuate over
the training samples but will not necessarily decrease.
• But in the long run, the cost will decrease with fluctuations.
• Also, it will never reach the minimum, but it will keep dancing around it.
• Can be used for larger datasets.
• Can slow down the computations.
Mini Gradient Descent
• We use a batch of a fixed number of training examples which is less than the
actual dataset and call it a mini-batch.
• After creating the mini-batches of fixed size, we do the following steps in one
epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Faster computations
Challenges
Adam optimizer
.. the name Adam is derived from Adaptive Moment Estimation.
• Random search
• Select values randomly for every hyperparameter
• Often preferred to grid search
• This means that the gradient will never propose an operation that acts to
increase the standard deviation or mean of xi; the normalization
operations remove the effect of such an action and zero out its
component in the gradient.
• At test time, μ and σ are replaced by a moving average of the mean and
standard deviation that was collected during training.
Batch Normalization