8.TrainingNN-3
8.TrainingNN-3
𝑠2 𝑒 𝒔𝟐
𝑥1 Σ Softmax 𝑝(𝑦 = 2|𝒙, 𝜽) = 𝒔𝟏
𝑒 + 𝑒 𝒔𝟐 + 𝑒 𝒔𝟑
𝑠3 𝑒 𝒔𝟑
Σ 𝑝(𝑦 = 3|𝒙, 𝜽) = 𝒔𝟏
𝑒 + 𝑒 𝒔𝟐 + 𝑒 𝒔𝟑
𝑥2
𝜕𝐿 𝜕𝑠 𝜕𝐿
=
𝜕𝑤 𝜕𝑤 𝜕𝑠
Saturated neurons kill
the gradient flow
𝜕𝐿 𝜕𝜎 𝜕𝐿 𝜕𝜎 𝜕𝐿
=
𝜕𝑠 𝜕𝑠 𝜕𝜎 𝜕𝑠 𝜕𝜎
I2DL: Prof. Dai 6
TanH Activation
Still saturates
Zero-centered
Large and
What happens if a consistent
ReLU outputs zero? gradients
Initialization
Not guaranteed to
reach the optimum
For images subtract the mean image (AlexNet) or per-channel mean (VGG-Net)
Cubuk et al., RandAugment, CVPRW 2020 Muller et al., Trivial Augment, ICCV 2021
Θ Θ/2 Θ/2
• Penalizes large weights 0
• Improves generalization
Overfitting
d
Forwar
I2DL: Prof. Dai [Srivastava et al., JMLR’14] Dropout 32
Dropout: Intuition
• Using half the network = half capacity
Redundant
representations
Furry
Has two
eyes
Has a tail
Has paws
Model 1
Model 2
1
• Train: 𝑧 𝐸 𝑧 = (𝜃1 0 + 𝜃2 0
4
𝜃1 𝜃2
+ 𝜃1 𝑥1 + 𝜃2 0
+ 𝜃1 0 + 𝜃2 𝑥2
𝑥1 𝑥2
+ 𝜃1 𝑥1 + 𝜃2 𝑥2 )
1
Weight scaling = (𝜃1 𝑥1 + 𝜃2 𝑥2 )
inference rule 2
• Try it!
𝑘 𝑘
𝑘
𝒙 −𝐸 𝒙
ෝ
𝒙 =
𝑉𝑎𝑟 𝒙 𝑘
feature 1 … feature k …
𝑘 𝑘
𝑘
𝒙 −𝐸 𝒙
ෝ
𝒙 =
𝑉𝑎𝑟 𝒙 𝑘
Unit gaussian
feature 1 … feature k …
𝑘 𝑘
𝛾 = 𝑉𝑎𝑟 𝒙
• 2. Allow the network to change the
range 𝛽 𝑘
=𝐸 𝒙 𝑘
𝒚 𝑘 =𝛾 𝑘 ෝ(𝑘) + 𝛽
𝒙 𝑘
backprop
I2DL: Prof. Dai [Ioffe and Szegedy, PMLR’15] Batch Normalization 50
Batch Normalization
• Ok to treat dimensions separately?
Shown empirically that even if features are not
correlated, convergence is still faster with this
method
Depth
I2DL: Prof. Dai 60
What do we know so far?
Concept of a ‘Neuron’
𝑥0
𝜃0 1
𝜎 𝑠 =
1 + 𝑒 −𝑠
𝜃1 𝑠
𝑥1 Σ
𝜃2
𝑥2
1
• Sigmoid: 𝜎 𝑥 = • ReLU: max 0, 𝑥
(1+𝑒 −𝑥 )
𝑥0
𝑤1
𝑥1
Dropout
Weight Initialization
(e.g., Kaiming)
Weight Regularization
e.g., 𝐿2 -reg: 𝑅2 𝑾 = σ𝑁 2
𝑖=1 𝑤𝑖
• http://cs231n.github.io/neural-networks-1/
• http://cs231n.github.io/neural-networks-2/
• http://cs231n.github.io/neural-networks-3/