7 Types of Neural Network Activation Functions
7 Types of Neural Network Activation Functions
This article is part of MissingLink’s Neural Network Guide, which focuses on practical explanations
of concepts and processes, skipping the theoretical background. We’ll provide you with an overview
of the subject, pro tips for choosing activation functions, and explain how to use the
MissingLink deep learning platform to speed up your experiments.
Activation functions are mathematical equations that determine the output of a neural
network. The function is attached to each neuron in the network, and determines
whether it should be activated (“fired”) or not, based on whether each neuron’s input is
relevant for the model’s prediction. Activation functions also help normalize the output
of each neuron to a range between 1 and 0 or between -1 and 1.
The need for speed has led to the development of new functions such as ReLu and
Swish (see more about nonlinear activation functions below).
What are Artificial Neural Networks and Deep Neural Networks?
Artificial Neural Networks (ANN) are comprised of a large number of simple elements,
called neurons, each of which makes simple decisions. Together, the neurons can
provide accurate answers to some complex problems, such as natural language
processing, computer vision, and AI.
A neural network can be “shallow”, meaning it has an input layer of neurons, only one
“hidden layer” that processes the inputs, and an output layer that provides the final
output of the model. A Deep Neural Network (DNN) commonly has between 2-8
additional layers of neurons. Research from Goodfellow, Bengio and Courville and
other experts suggests that neural networks increase in accuracy with the number of
hidden layers.
"Non-deep" feedforward neural network
The activation function is a mathematical “gate” in between the input feeding the
current neuron and its output going to the next layer. It can be as simple as a step
function that turns the neuron output on and off, depending on a rule or threshold. Or it
can be a transformation that maps the input signals into output signals that are
needed for the neural network to function.
Increasingly, neural networks use non-linear activation functions, which can help the
network learn complex data, compute and learn almost any function representing a
question, and provide accurate predictions.
* This is just the number 1, making it possible to represent activation functions that do
not cross the origin. Biases are also assigned a weight.
The problem with a step function is that it does not allow multi-value outputs—for
example, it cannot support classifying the inputs into one of several categories.
A = cx
It takes the inputs, multiplied by the weights for each neuron, and creates an output
signal proportional to the input. In one sense, a linear function is better than a step
function because it allows multiple outputs, not just yes and no.
2. All layers of the neural network collapse into one—with linear activation
functions, no matter how many layers in the neural network, the last layer will be a
linear function of the first layer (because a linear combination of linear functions is still
a linear function). So a linear activation function turns the neural network into just one
layer.
A neural network with a linear activation function is simply a linear regression model. It
has limited power and ability to handle complexity varying parameters of input data.
ELU
Exponential Linear Unit or its widely known name ELU is a function that tend to
converge cost to zero faster and produce more accurate results. Different to other
activation functions, ELU has a extra alpha constant which should be positive number.
ELU is very similiar to RELU except negative inputs. They are both in identity function
form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its
output equal to -α whereas RELU sharply smoothes.
Function Derivative
R(z)={zα.(ez–1)z>0z<=0}R(z)={zz>0α. R′(z)={1α.ezz>0z<0}R′
(ez–1)z<=0} (z)={1z>0α.ezz<0}
Pros
ELU becomes smooth slowly until its output equal to -α whereas RELU sharply
smoothes.
ELU is a strong alternative to ReLU.
Unlike to ReLU, ELU can produce negative outputs.
Cons
For x > 0, it can blow up the activation with the output range of [0, inf].
ReLU
A recent invention which stands for Rectified Linear Units. The formula is deceptively
simple: max(0,z)max(0,z). Despite its name and appearance, it’s not linear and
provides the same benefits as Sigmoid but with better performance.
Function Derivative
R(z)={z0z>0z<=0}R(z)={zz>00z<=0} R′(z)={10z>0z<0}R′(z)={1z>00z<0}
Pros
One of its limitation is that it should only be used within Hidden layers of a Neural
Network Model.
Some gradients can be fragile during training and can die. It can cause a weight
update which will makes it never activate on any data point again. Simply saying that
ReLu could result in Dead Neurons.
In another words, For activations in the region (x<0) of ReLu, gradient will be 0
because of which the weights will not get adjusted during descent. That means,
those neurons which go into that state will stop responding to variations in error/
input ( simply because gradient is 0, nothing changes ). This is called dying ReLu
problem.
The range of ReLu is [0, inf). This means it can blow up the activation.
Further reading
LeakyReLU
LeakyRelu is a variant of ReLU. Instead of being 0 when z<0z<0, a leaky ReLU allows
a small, non-zero, constant gradient αα (Normally, α=0.01α=0.01). However, the
consistency of the benefit across tasks is presently unclear. [1]
Function Derivative
R(z)={zαzz>0z<=0}R(z)={zz>0αzz<=0
R′(z)={1αz>0z<0}R′(z)={1z>0αz<0}
}
def leakyrelu(z, alpha): def leakyrelu_prime(z, alpha):
return max(alpha * z, z) return 1 if z > 0 else alpha
Pros
Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small
negative slope (of 0.01, or so).
Cons
Further reading
Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s
easy to work with and has all the nice properties of activation functions: it’s non-linear,
continuously differentiable, monotonic, and has a fixed output range.
Function Derivative
S(z)=11+e−zS(z)=11+e−z S′(z)=S(z)⋅(1−S(z))S′(z)=S(z)⋅(1−S(z))
def sigmoid(z): def sigmoid_prime(z):
return 1.0 / (1 + np.exp(-z)) return sigmoid(z) * (1-sigmoid(z))
Pros
Cons
Towards either end of the sigmoid function, the Y values tend to respond very
less to changes in X.
It gives rise to a problem of “vanishing gradients”.
Its output isn’t zero centered. It makes the gradient updates go too far in different
directions. 0 < output < 1, and it makes optimization harder.
Sigmoids saturate and kill gradients.
The network refuses to learn further or is drastically slow ( depending on use
case and until gradient /computation gets hit by floating point value limits ).
Further reading
Tanh
Tanh squashes a real-valued number to the range [-1, 1]. It’s non-linear. But unlike
Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is
always preferred to the sigmoid nonlinearity. [1]
Function Derivative
tanh(z)=ez−e−zez+e−ztanh(z)=ez−e−zez+e
tanh′(z)=1−tanh(z)2tanh′(z)=1−tanh(z)2
−z
def tanh(z): def tanh_prime(z):
return (np.exp(z) - np.exp(-z)) /
return 1 - np.power(tanh(z), 2)
(np.exp(z) + np.exp(-z))
Pros
The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
Cons
https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-
right/
https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html