Lesson 13
Lesson 13
i2 h1
Activation functions must be monotonic,
differentiable, and quickly converging.
i3
𝑑𝑓(𝑥)
=𝑎
𝑑𝑥
Observations:
• Constant gradient
• Gradient does not depend on the change in the input
Linear
𝑓 𝑥 = 𝑎𝑥 + 𝑏
𝑓 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ⋯ + 𝑏
Linear i1
i2 h1
𝑓 𝑥 = 𝑎𝑥 + 𝑏
i3
𝑓 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ⋯ + 𝑏
Perceptron
Non-Linear
• Sigmoid (Logistic)
• Hyperbolic Tangent (Tanh)
• Rectified Linear Unit (ReLU)
• Leaky Relu
• Parametric Relu
• Exponential Linear Unit (ELU)
Sigmoid Activation Functions (Logistics)
1
𝑓 𝑥 =
1 + 𝑒 −𝑥
𝑑𝑓(𝑥)
= 𝑓 𝑥 (1 − 𝑓(𝑥))
𝑑𝑥
Observations:
• Output: 0 to 1
• Outputs are not zero-centered
• Can saturate and kill (vanish) gradients
Tanh Activation Function
𝑒 𝑥 − 𝑒 −𝑥
𝑓 𝑥 = 𝑥
𝑒 + 𝑒 −𝑥
𝑑𝑓(𝑥)
= 1 − 𝑓(𝑥)2
𝑑𝑥
Observations:
• Output: -1 to +1
• Outputs are zero-centered
• Can Saturate and kill (vanish) gradients
• Gradient is more steeped than Sigmoid, resulting
in faster convergence
Rectified Linear Unit(ReLU)
𝑓 𝑥 = max(0, 𝑥)
𝑑𝑓(𝑥)
=1
𝑑𝑥
Observations:
• Greatly increase training speed compared to tanh
and sigmoid
• Reduces likelihood of killing(vanishing) gradient
• It can blow up activation
• Dead nodes
Leaky-ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)
Observations:
• Fixed dying ReLU
Parameterized-ReLU
𝑓 𝑥 = max(𝛼𝑥, 𝑥)
𝑑𝑓(𝑥) 𝛼, 𝑥<0
=ቊ
𝑑𝑥 1, 𝑥≥0
Observations:
Exponential Linear Unit (ELU)
𝛼(𝑒 𝑥 − 1), 𝑥<0
f(𝑥) = ቊ
1𝑥 𝑥 ≥ 0
𝑑𝑓(𝑥) 𝑓 𝑥 + 𝛼, 𝑥<0
=ቊ
𝑑𝑥 1, 𝑥≥0
Observations:
• It can produce –ve output
• It can blow up activation function
Complete Chain
X1 y y’
0.6 1
X2
0.4 0
X3
V
U
14
Deep Network
X1
X2
X3 V
U W
15
Deep Network - Vanishing/Exploding Gradient
X1
X2
X3 V
U W
𝛿𝐸 𝛿𝑏1 𝛿𝑔1 𝛿𝑎1 𝛿ℎ1 𝛿𝑧1 𝛿𝑦1 𝛿𝐸
= × × × × × ×
𝛿𝑈11 𝛿𝑈11 𝛿𝑏1 𝛿𝑔1 𝛿𝑎1 𝛿ℎ1 𝛿𝑧1 𝛿𝑦1
16
Summary
• We learn characteristics of different Activation Functions and their
gradient
• The choice of activation function depend on the nature of the
problem, nature of the target output and the deepness of the
network.