0% found this document useful (0 votes)
6 views

Lesson 13

Uploaded by

kunalgarg83603
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lesson 13

Uploaded by

kunalgarg83603
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Lesson 13

Activation Functions and Their Derivatives


Activation Functions
i1 Activation Function is applied over the linear
weighted summation of the incoming
information to a node.
i2 h1

Convert linear input signals from perceptron to


i3 a linear/non-linear output signal.

Activation It decides whether to activate a node or not.


Function
Activation Functions i1

i2 h1
Activation functions must be monotonic,
differentiable, and quickly converging.
i3

Types of Activation Functions:


Activation
• Linear Function
• Non-Linear
Linear
𝑓 𝑥 = 𝑎𝑥 + 𝑏

𝑑𝑓(𝑥)
=𝑎
𝑑𝑥

Observations:
• Constant gradient
• Gradient does not depend on the change in the input
Linear
𝑓 𝑥 = 𝑎𝑥 + 𝑏

𝑓 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ⋯ + 𝑏
Linear i1

i2 h1
𝑓 𝑥 = 𝑎𝑥 + 𝑏
i3

𝑓 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + ⋯ + 𝑏
Perceptron
Non-Linear
• Sigmoid (Logistic)
• Hyperbolic Tangent (Tanh)
• Rectified Linear Unit (ReLU)
• Leaky Relu
• Parametric Relu
• Exponential Linear Unit (ELU)
Sigmoid Activation Functions (Logistics)
1
𝑓 𝑥 =
1 + 𝑒 −𝑥

𝑑𝑓(𝑥)
= 𝑓 𝑥 (1 − 𝑓(𝑥))
𝑑𝑥

Observations:
• Output: 0 to 1
• Outputs are not zero-centered
• Can saturate and kill (vanish) gradients
Tanh Activation Function
𝑒 𝑥 − 𝑒 −𝑥
𝑓 𝑥 = 𝑥
𝑒 + 𝑒 −𝑥

𝑑𝑓(𝑥)
= 1 − 𝑓(𝑥)2
𝑑𝑥

Observations:
• Output: -1 to +1
• Outputs are zero-centered
• Can Saturate and kill (vanish) gradients
• Gradient is more steeped than Sigmoid, resulting
in faster convergence
Rectified Linear Unit(ReLU)
𝑓 𝑥 = max(0, 𝑥)

𝑑𝑓(𝑥)
=1
𝑑𝑥

Observations:
• Greatly increase training speed compared to tanh
and sigmoid
• Reduces likelihood of killing(vanishing) gradient
• It can blow up activation
• Dead nodes
Leaky-ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)

𝑑𝑓(𝑥) 0.01, 𝑥<0


=ቊ
𝑑𝑥 1, 𝑥≥0

Observations:
• Fixed dying ReLU
Parameterized-ReLU
𝑓 𝑥 = max(𝛼𝑥, 𝑥)

𝑑𝑓(𝑥) 𝛼, 𝑥<0
=ቊ
𝑑𝑥 1, 𝑥≥0

Observations:
Exponential Linear Unit (ELU)
𝛼(𝑒 𝑥 − 1), 𝑥<0
f(𝑥) = ቊ
1𝑥 𝑥 ≥ 0

𝑑𝑓(𝑥) 𝑓 𝑥 + 𝛼, 𝑥<0
=ቊ
𝑑𝑥 1, 𝑥≥0

Observations:
• It can produce –ve output
• It can blow up activation function
Complete Chain

X1 y y’
0.6 1

X2
0.4 0

X3
V
U

𝑧1 = ෍ ℎ𝑗 𝑉𝑗1 𝑦1 = 𝑓(𝑧1 ) E 𝛿𝐸 𝛿𝑧1 𝛿𝑦1 𝛿𝐸


= × ×
𝑉11 𝛿𝑉11 𝛿𝑉11 𝛿𝑧1 𝛿𝑦1

14
Deep Network

X1

X2

X3 V

U W

15
Deep Network - Vanishing/Exploding Gradient

X1

X2

X3 V

U W
𝛿𝐸 𝛿𝑏1 𝛿𝑔1 𝛿𝑎1 𝛿ℎ1 𝛿𝑧1 𝛿𝑦1 𝛿𝐸
= × × × × × ×
𝛿𝑈11 𝛿𝑈11 𝛿𝑏1 𝛿𝑔1 𝛿𝑎1 𝛿ℎ1 𝛿𝑧1 𝛿𝑦1

𝑎1 = ෍ 𝑔𝑖 𝑊𝑖1 ℎ1 = 𝑓(𝑎1 ) 𝑧1 = ෍ ℎ𝑗 𝑉𝑗1 𝑦1 = 𝑓(𝑧1 )


𝑏1 = ෍ 𝑥𝑖 𝑈𝑖1 𝑔1 = 𝑓(𝑏1 ) 𝑊 𝑉11
𝑈11 11

16
Summary
• We learn characteristics of different Activation Functions and their
gradient
• The choice of activation function depend on the nature of the
problem, nature of the target output and the deepness of the
network.

You might also like