0% found this document useful (0 votes)
8 views67 pages

8.TrainingNN-3

Uploaded by

8varlock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views67 pages

8.TrainingNN-3

Uploaded by

8varlock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture 7 Recap

I2DL: Prof. Dai 1


Regression Losses: L2 vs L1
• L2 Loss: • L1 Loss:
– 𝐿 = 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑓 𝑥𝑖
2 – 𝐿1 = σ𝑛𝑖=1 |𝑦𝑖 − 𝑓(𝑥𝑖 )|
– Sum of squared – Sum of absolute
differences (SSD) differences
– Prone to outliers – Robust
– Compute-efficient – Costly to compute
(optimization)
– Optimum is the mean – Optimum is the median

I2DL: Prof. Dai 2


Binary Classification: Sigmoid
1
𝜎 𝒙, 𝜽 =
𝑥0 1 + 𝑒 −σ𝜃𝑖 𝑥𝑖
𝜃0 1
𝜎 𝑠 = 1
1 + 𝑒 −𝑠
Can be
𝜃1 𝑠 interpreted as
𝑥1 Σ a probability
𝜃2 𝑝(𝑦 = 1|𝑥, 𝜽)
0
𝑥2

I2DL: Prof. Dai 3


Softmax Formulation
• What if we have multiple classes?
Scores Probabilities
𝑥0 for each class for each class
𝑠1 𝑒 𝒔𝟏
Σ 𝑝(𝑦 = 1|𝒙, 𝜽) = 𝒔𝟏
𝑒 + 𝑒 𝒔𝟐 + 𝑒 𝒔𝟑

𝑠2 𝑒 𝒔𝟐
𝑥1 Σ Softmax 𝑝(𝑦 = 2|𝒙, 𝜽) = 𝒔𝟏
𝑒 + 𝑒 𝒔𝟐 + 𝑒 𝒔𝟑

𝑠3 𝑒 𝒔𝟑
Σ 𝑝(𝑦 = 3|𝒙, 𝜽) = 𝒔𝟏
𝑒 + 𝑒 𝒔𝟐 + 𝑒 𝒔𝟑
𝑥2

I2DL: Prof. Dai 4


Example: Hinge vs Cross-Entropy
Hinge Loss: 𝐿𝑖 = σ𝑘≠𝑦𝑖 max(0, 𝑠𝑘 − 𝑠𝑦𝑖 + 1)
𝑠
𝑒 𝑦𝑖
Cross Entropy : 𝐿𝑖 = − log(σ 𝑠𝑘 )
𝑘𝑒
Given the following scores for 𝒙𝑖 : Hinge loss: Cross Entropy loss:
𝑠 = [5, −3, 2] max(0, −3 − 5 + 1) + 𝑒5
Model 1 max 0, 2 − 5 + 1 = 0
− ln 𝑒 5 +𝑒 −3 +𝑒 2
= 0.05
max(0, 10 − 5 + 1) + 𝑒5
Model 2 𝑠 = [5, 10, 10] − ln 𝑒 5 +𝑒 10 +𝑒 10
= 5.70
max 0, 10 − 5 + 1 = 12
𝑠 = [5, −20, −20] max(0, −20 − 5 + 1) + 𝑒5
Model 3 − ln 𝑒 5 +𝑒 −20 +𝑒 −20
max 0, −20 − 5 + 1 = 0 −11
𝑦𝑖 = 0 = 2 ∗ 10
− Cross Entropy *always* wants to improve! (loss never 0)
I2DL: Prof. Dai − Hinge Loss saturates. 5
Sigmoid Activation
1
Forward 𝜎 𝑠 =
1 + 𝑒 −𝑠

𝜕𝐿 𝜕𝑠 𝜕𝐿
=
𝜕𝑤 𝜕𝑤 𝜕𝑠
Saturated neurons kill
the gradient flow

𝜕𝐿 𝜕𝜎 𝜕𝐿 𝜕𝜎 𝜕𝐿
=
𝜕𝑠 𝜕𝑠 𝜕𝜎 𝜕𝑠 𝜕𝜎
I2DL: Prof. Dai 6
TanH Activation

Still saturates

Zero-centered

[LeCun et al. 1991] Improving Generalization Performance in Character Recognition


I2DL: Prof. Dai 7
Rectified Linear Units (ReLU)
Dead ReLU

Large and
What happens if a consistent
ReLU outputs zero? gradients

Fast convergence Does not saturate


[Krizhevsky et al. NeurIPS 2012] ImageNet Classification with Deep Convolutional Neural Networks
I2DL: Prof. Dai 8
Quick Guide
• Sigmoid is not really used.

• ReLU is the standard choice.

• Second choice are the variants of ReLU or Maxout.

• Recurrent nets will require TanH or similar.

I2DL: Prof. Dai 9


Initialization is Extremely Important!
• Optimum 𝑥 ∗ = arg min 𝑓(𝑥)

Initialization

Not guaranteed to
reach the optimum

I2DL: Prof. Dai 10


Xavier/Kaiming Initialization
• How to ensure the variance of the output is the same
as the input?
𝑛𝑉𝑎𝑟(𝑤 𝑉𝑎𝑟 𝑥 )
=1
1
𝑉𝑎𝑟 𝑤 =
𝑛

I2DL: Prof. Dai 11


ReLU Kills Half of the Data
2
𝑉𝑎𝑟 𝑤 =
𝑛
It makes a huge difference!

I2DL: Prof. Dai [He et al., ICCV’15] He Initialization 12


Lecture 8

I2DL: Prof. Dai 13


Data Augmentation

I2DL: Prof. Dai 14


Data Pre-Processing

For images subtract the mean image (AlexNet) or per-channel mean (VGG-Net)

I2DL: Prof. Dai 15


Data Augmentation
• A classifier has to be invariant to a wide variety of
transformations

I2DL: Prof. Dai 16


I2DL: Prof. Dai
Pose Appearance Illumination 17
Data Augmentation
• A classifier has to be invariant to a wide variety of
transformations

• Helping the classifier: synthesize data simulating


plausible transformations

I2DL: Prof. Dai 18


Data Augmentation

I2DL: Prof. Dai [Krizhevsky et al., NIPS’12] ImageNet 19


Data Augmentation: Brightness
• Random brightness and contrast changes

I2DL: Prof. Dai [Krizhevsky et al., NIPS’12] ImageNet 20


Data Augmentation: Random Crops
• Training: random crops
– Pick a random L in [256,480]
– Resize training image, short side L
– Randomly sample crops of 224x224

• Testing: fixed set of crops


– Resize image at N scales
– 10 fixed crops of 224x224: (4 corners + 1 center ) ×2 flips

I2DL: Prof. Dai [Krizhevsky et al., NIPS’12] ImageNet 21


Data Augmentation: Advanced

Cubuk et al., RandAugment, CVPRW 2020 Muller et al., Trivial Augment, ICCV 2021

I2DL: Prof. Dai 22


Data Augmentation
• When comparing two networks make sure to use the
same data augmentation!

• Consider data augmentation a part of your network


design

I2DL: Prof. Dai 23


Advanced
Regularization

I2DL: Prof. Dai 24


L2 regularization, also (wrongly) called weight
decay
• L2 regularization

Θ𝑘+1 = Θ𝑘 − 𝜖𝛻Θ Θ𝑘 , 𝑥, 𝑦 − 𝜆𝜃𝑘

Learning rate Gradient Gradient of L2-regularization

Θ Θ/2 Θ/2
• Penalizes large weights 0

• Improves generalization

I2DL: Prof. Dai 25


L2 regularization, also (wrongly) called weight
decay
• Weight decay regularization
Θ𝑘+1 = (1 − 𝜆)Θ𝑘 −𝛼𝛻𝑓𝑘 Θ𝑘

Learning rate of weight Learning rate of the


decay optimizer

• Equivalent to L2 regularization in GD, but not in Adam.


Loshchilov and Hutter, Decoupled Weight Decay
Regularization, ICLR 2019
I2DL: Prof. Dai 26
Early Stopping

Overfitting

I2DL: Prof. Dai 27


Bagging and Ensemble Methods
• Train multiple models and average their results

• E.g., use a different algorithm for optimization or


change the objective function / loss function.

• If errors are uncorrelated, the expected combined


error will decrease linearly with the ensemble size

I2DL: Prof. Dai 29


Bagging and Ensemble Methods
• Bagging: uses k different datasets (or SGD/init noise)

Training Set 1 Training Set 2 Training Set 3

I2DL: Prof. Dai Image Source: [Srivastava et al., JMLR’14] Dropout 30


Dropout

I2DL: Prof. Dai 31


Dropout
• Disable a random set of neurons (typically 50%)

d
Forwar
I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 32
Dropout: Intuition
• Using half the network = half capacity
Redundant
representations
Furry

Has two
eyes

Has a tail

Has paws

Has two ears

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 33


Dropout: Intuition
• Using half the network = half capacity
– Redundant representations
– Base your scores on more features

• Consider it as a model ensemble

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 34


Dropout: Intuition
• Two models in one

Model 1

Model 2

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 35


Dropout: Intuition
• Using half the network = half capacity
– Redundant representations
– Base your scores on more features

• Consider it as two models in one


– Training a large ensemble of models, each on different
set of data (mini-batch) and with SHARED parameters

Reducing co-adaptation between neurons

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 36


Dropout: Test Time
• All neurons are “turned on” – no dropout

Conditions at train and test


time are not the same

PyTorch: model.train() and model.eval()

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 37


Dropout: Test Time Dropout
probability
• Test: 𝑧 = (𝜃1 𝑥1 + 𝜃2 𝑥2 ) ∙ 𝑝 𝑝 = 0.5

1
• Train: 𝑧 𝐸 𝑧 = (𝜃1 0 + 𝜃2 0
4
𝜃1 𝜃2
+ 𝜃1 𝑥1 + 𝜃2 0
+ 𝜃1 0 + 𝜃2 𝑥2
𝑥1 𝑥2
+ 𝜃1 𝑥1 + 𝜃2 𝑥2 )
1
Weight scaling = (𝜃1 𝑥1 + 𝜃2 𝑥2 )
inference rule 2

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 38


Dropout: Before
• Efficient bagging method with parameter sharing

• Try it!

• Dropout reduces the effective capacity of a model,


but needs more training time

• Efficient regularization method, can be used with L2

I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 39


Dropout: Nowadays
• Usually does not work well when combined with
batch-norm.
• Training takes a bit longer, usually 1.5x
• But, can be used for uncertainty estimation.
• Monte Carlo dropout (Yarin Gal and Zoubin
Ghahramani series of papers).

I2DL: Prof. Dai 40


Monte Carlo Dropout
• Neural networks are massively overconfident.
• We can use dropout to make the softmax
probabilities more calibrated.
• Training: use dropout with a low p (0.1 or 0.2).
• Inference, run the same image multiple times (25-
100), and average the results.

Gal et al., Bayesian Convolutional Neural Networks with Bernoulli


Approximate Variational Inference, ICLRW 2015
Gal and Ghahramani, Dropout as a Bayesian approximation, ICML 2016
Gal et al., Deep Bayesian Active Learning with Image Data, ICML 2017
Gal, Uncertainty in Deep Learning, PhD thesis 2017
I2DL: Prof. Dai 41
Batch Normalization:
Reducing Internal Covariate
Shift

I2DL: Prof. Dai 42


Batch Normalization:
Reducing Internal Covariate
Shift
What is internal covariate shift, by the way?

I2DL: Prof. Dai 43


Our Goal
• All we want is that our activations do not die out

I2DL: Prof. Dai 44


Batch Normalization
• Wish: Unit Gaussian activations (in our example)
• Solution: let’s do it
D = num of features Mean of your mini-batch
N = mini-batch size

examples over feature k

𝑘 𝑘
𝑘
𝒙 −𝐸 𝒙

𝒙 =
𝑉𝑎𝑟 𝒙 𝑘

feature 1 … feature k …

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 45


Batch Normalization
• In each dimension of the features, you have a unit
gaussian (in our example)
D = num of features Mean of your mini-batch
N = mini-batch size

examples over feature k

𝑘 𝑘
𝑘
𝒙 −𝐸 𝒙

𝒙 =
𝑉𝑎𝑟 𝒙 𝑘

Unit gaussian

feature 1 … feature k …

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 46


Batch Normalization
• In each dimension of the features, you have a unit
gaussian (in our example)

• For NN in general, BN normalizes the mean and


variance of the inputs to your activation functions

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 47


BN Layer
• A layer to be applied after Fully
Connected (or Convolutional) layers and
before non-linear activation functions

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 48


Batch Normalization
• 1. Normalize
𝒙 𝑘 −𝐸 𝒙 𝑘

𝒙 𝑘
= Differentiable function so we
𝑉𝑎𝑟 𝒙 𝑘 can backprop through it….

• 2. Allow the network to change the range

𝑘 𝑘 These parameters will be


𝒚 =𝛾 ෝ(𝑘)
𝒙 +𝛽 𝑘
optimized during backprop

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 49


Batch Normalization
• 1. Normalize
The network can
𝒙 𝑘 −𝐸 𝒙 𝑘 learn to undo the
ෝ 𝑘
𝒙 = normalization
𝑉𝑎𝑟 𝒙 𝑘

𝑘 𝑘
𝛾 = 𝑉𝑎𝑟 𝒙
• 2. Allow the network to change the
range 𝛽 𝑘
=𝐸 𝒙 𝑘

𝒚 𝑘 =𝛾 𝑘 ෝ(𝑘) + 𝛽
𝒙 𝑘

backprop
I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 50
Batch Normalization
• Ok to treat dimensions separately?
Shown empirically that even if features are not
correlated, convergence is still faster with this
method

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 51


BN: Train vs Test
• Train time: mean and variance is taken over the mini-
batch
𝒙 𝑘 −𝐸 𝒙 𝑘

𝒙 𝑘 =
𝑉𝑎𝑟 𝒙 𝑘

• Test-time: what happens if we can just process one


image at a time?
– No chance to compute a meaningful mean and variance

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 52


BN: Train vs Test
Training: Compute mean and variance from mini-batch
1,2,3 …

Testing: Compute mean and variance by running an


exponentially weighted averaged across training mini-
batches. Use them as 𝜎𝑡𝑒𝑠𝑡
2
and 𝜇𝑡𝑒𝑠𝑡 .
𝑉𝑎𝑟𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛽𝑚 ∗ 𝑉𝑎𝑟𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + 1 − 𝛽𝑚 ∗ 𝑉𝑎𝑟𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ
𝜇𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛽𝑚 ∗ 𝜇𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1 − 𝛽𝑚 ) ∗ 𝜇𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ
𝛽𝑚 : momentum (hyperparameter)

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 53


BN: What do you get?
• Very deep nets are much easier to train, more stable
gradients

• A much larger range of hyperparameters works


similarly when using BN

I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 54


BN: A Milestone

I2DL: Prof. Dai [Wu and He, ECCV’18] Group Normalization 55


BN: Drawbacks

I2DL: Prof. Dai [Wu and He, ECCV’18] Group Normalization 56


Other Normalizations

I2DL: Prof. Dai [Wu and He, ECCV’18] Group Normalization 57


Image size
Other Normalizations

Number of elements in the batch


Number of channels

I2DL: Prof. Dai [Wu and He, ECCV’18] Group Normalization 58


What We Know

I2DL: Prof. Dai 59


What do we know so far?
Width

Depth
I2DL: Prof. Dai 60
What do we know so far?
Concept of a ‘Neuron’
𝑥0
𝜃0 1
𝜎 𝑠 =
1 + 𝑒 −𝑠

𝜃1 𝑠
𝑥1 Σ
𝜃2

𝑥2

I2DL: Prof. Dai 61


What do we know so far?
Activation Functions (non-linearities)

1
• Sigmoid: 𝜎 𝑥 = • ReLU: max 0, 𝑥
(1+𝑒 −𝑥 )

• TanH: tanh 𝑥 • Leaky ReLU: max 0.1𝑥, 𝑥

I2DL: Prof. Dai 62


What do we know so far?
Backpropagation
𝑤0

𝑥0

𝑤1

𝑥1

I2DL: Prof. Dai 63


What do we know so far?
SGD Variations (Momentum, etc.)

I2DL: Prof. Dai 64


What do we know so far?
Data Augmentation Batch-Norm
𝑘 𝑘
𝑘
𝒙 −𝐸 𝒙

𝒙 =
𝑉𝑎𝑟 𝒙 𝑘

Dropout

Weight Initialization
(e.g., Kaiming)

Weight Regularization
e.g., 𝐿2 -reg: 𝑅2 𝑾 = σ𝑁 2
𝑖=1 𝑤𝑖

I2DL: Prof. Dai 65


Why not simply more layers?
• Neural nets with at least one hidden layer are universal function
approximators.
• But generalization is another issue.

• Why not just go deeper and get better?


– No structure!!
– It is just brute force!
– Optimization becomes hard
– Performance plateaus / drops!

• We need more! More means CNNs, RNNs and eventually Transformers.

I2DL: Prof. Dai 66


See you next week!

I2DL: Prof. Dai 67


References
• Goodfellow et al. “Deep Learning” (2016),
– Chapter 6: Deep Feedforward Networks

• Bishop “Pattern Recognition and Machine Learning” (2006),


– Chapter 5.5: Regularization in Network Nets

• http://cs231n.github.io/neural-networks-1/

• http://cs231n.github.io/neural-networks-2/

• http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Dai 68

You might also like