0% found this document useful (0 votes)
1 views

ppt3dl

The document discusses the importance of normalizing data sets in deep learning to improve training speed and efficiency by ensuring features are on a similar scale. It covers techniques such as subtracting the mean and normalizing variance, as well as addressing issues like vanishing and exploding gradients. Additionally, it introduces batch normalization and hyperparameter tuning strategies to enhance model performance.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

ppt3dl

The document discusses the importance of normalizing data sets in deep learning to improve training speed and efficiency by ensuring features are on a similar scale. It covers techniques such as subtracting the mean and normalizing variance, as well as addressing issues like vanishing and exploding gradients. Additionally, it introduces batch normalization and hyperparameter tuning strategies to enhance model performance.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Deep Learning

Experiments
Lecture 15

22 January 2025 1
Normalizing Data Sets

22 January 2025 2
Speed up the training/ Why normalization
Use same normalizer in the test set also, exactly in the same way as
training set

If the features are on different scale 1, 1000 and 0,1 weights will end
up taking very different values

More steps may be needed to reach the optimal value and the
learning can be slow.

Shape of the Normalized bowl will be more spherical and


symmetrical making it easier to faster to optimize

22 January 2025 3
Normalizing Training Sets
Subtract Mean

1 𝑚 (𝑖)
𝜇 = Σ𝑙=1 𝑥
𝑚
𝑥 =𝑥−𝜇

Normalize Variance

2
1 𝑚 (𝑖)
𝜎 = Σ𝑖=1 𝑥 ∗∗ 2
𝑚
𝑥/= 𝜎 2
22 January 2025 4
Vanishing/exploding gradients
g(z) = z # A linear function b[l]=0
𝑦ො = 𝑤 [𝑙] 𝑤 [𝑙−1] 𝑤 [𝑙−2] … 𝑤 3 𝑤 2 𝑤 1 𝑥

1.5 0 .5 0 The matrix will be multiplied by l-1 (as w[l]


0 1.5 0 .5 will be different dimension) leading to
exploding and vanishing gradients

22 January 2025 5
Exploding/vanishing gradients
Gradients/slope becoming too small or two large

So it is very important to see that how we initialize our weights.

If the value of features are large than weights needs to very small

It has been proposed to have the variance between the weights to


be 2/n

22 January 2025 6
Batch Norm
It is an extension of normalizing inputs and applies to every layer of the neural
network
Given some intermediate values in Neural Network
1
• 𝜇= σ 𝑧 (𝑖)
𝑚
1
• 𝜎 2 = σ(𝑧 − 𝜇)2
𝑚
𝑖 𝑧 (𝑖) −𝜇
• 𝑧𝑛𝑜𝑟𝑚 = 2
𝜎 +∈
𝑖
• 𝑧෥𝑖 = 𝛾𝑧𝑛𝑜𝑟𝑚 +𝛽 where 𝛾 and 𝛽 are learnable parameters
• If j = 𝜎 2 +∈ and 𝛽= 𝜇 then 𝑧෥𝑖 = 𝑧 (𝑖)

22 January 2025 7
Applying Batch Norm
1 [ ] 1 [ ]
1 1
𝑤 ,𝑏 𝛽 ,𝛾 [1 ]
•X z[1] 𝑧෥𝑖 =a =g (𝑧෦
[1] [1]

• tf.nn.batch-normalization
• Each mini-batch is scaled by the mean/variance computed on just that mini-
batch
• This adds some noise to the values of z within that minibatch. Similar to
dropout it has some regularization effect, as it adds to hidden layers
activations.
22 January 2025 8
Why batch norm

That actually means that it provides


Applying it on earlier layers helps in a robustness to the changes in the
decoupling from the later layers. covariance shift due to change in the
input distribution.

For test data we should prefer taking


So if there are frequent changes in the exponentially weighted averages
the input examples than it will for 𝜇 and 𝜎 2 of subsequent layers
provide a cushion for the effect to during training that will be better
taper off while going to later layers. than the 𝜇 and 𝜎 2 values of the
training set itself.
22 January 2025 9
First focus on Most
important ones and
How to try Hyperparameters then the lesser ones
in the sequence

Learning
Learning Hidden Mini Number
Beta Decay Others
Rate units batch size of Layers
Rate

22 January 2025 10
Random is better than a Grid
𝜶
* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * *

* * * * * * * 𝜷

22 January 2025 11
Coarser to finer

22 January 2025 12
Picking hyperparameter at random and as per
scale
• Is it ok to use the actual scale for all parameters or in some cases we
require the log scale

0.0001 1

22 January 2025 13
Babysitting one model Vs Parallel model
training
Panda or Caviar Approach

Depends upon the applications, resources and the time you


have.
In babysitting we painstakingly try to observe and introduce
mid-path corrections.
In Parallel model training we simultaneously try to train with
different models
22 January 2025 14
Thank You
For more information, please visit the
following links:

[email protected]
[email protected]
https://www.linkedin.com/in/gauravsingal789/
http://www.gauravsingal.in

22 January 2025
15

You might also like