Activation Function (1)
Activation Function (1)
Learning
Activation function
In artificial neurons inputs and weights are given from
which the weighted sum of input is calculated, and
then it is given to an activation function that converts
it into the output.
The input is fed to the input layer, the neurons
perform a linear transformation on this input using
the weights and biases.
x = (weight * input) + bias
An activation function is applied on the above result.
Y= Activation ( Σ (weight * input) + bias )
Activation function helps a neural network to
learn complex relationships and patterns in data.
Now the question is what if we don’t use any
activation function and allow a neuron to give
the weighted sum of inputs as it is as the output.
In this case computation will be very difficult as
the weighted sum of input doesn’t have any
range and depending upon input it can take any
value. Hence one important use of the
activation function is to keep output restricted
to a particular range.
Another use of activation function is to add non-
linearity in data.
In Neural Network, output from the activation
function moves to the next hidden layer and the
same process is repeated. This forward movement of
information is known as the forward propagation.
Using the output from the forward propagation, error
is calculated. Based on this error value, the weights
and biases of the neurons are updated. This process is
known as back-propagation.
Non-linearity is important in neural networks because
linear activation functions are not enough to form a
universal function approximator.
Most of the real-world problems are very complex we
need non-linear activation functions in a neural
network.
Neural Network without non-linear activation
functions will be just a simple linear regression
model.
Vanishing gradient problem
In neural networks during back propagation, each
weight receives an update proportional to the partial
derivative of the error function. In some cases, this
derivative term is so small that it makes updates very
small. Especially in deep layers of the neural network,
the update is obtained by multiplication of various
partial derivatives.
If these partial derivatives are very small then the
overall update becomes very small and approaches
zero. In such a case, weights will not be able to update
and hence there will be slow or no convergence. This
problem is known as the Vanishing gradient problem.
Exploding gradient problem.
Similarly, if the derivative term is very large
then updates will also be very large. In such a
case, the algorithm will overshoot the
minimum and won’t be able to converge. This
problem is known as the Exploding gradient
problem.
There are various methods to avoid these
problems. Choosing the appropriate activation
function is one of the them.
Various Activation function
1. Binary Step Function :
This activation function would be a threshold based
classifier means whether or not the neuron should be
activated based on the value from the linear
transformation.
In other words, if the input to the activation function is
greater than a threshold, then the neuron is activated,
else it is deactivated, means its output is not
considered for the next hidden layer. Let us look at it
mathematically:
f(x) = 1, when x>=0
= 0, when x<0
Pros-
1. Easy to compute.
2. Binary / Linear Classifier
Cons-
The gradient of the function is zero, the weights
and biases don’t update. The problem with a step function is
that it does not allow multi-value outputs—for example, it cannot support
classifying the inputs into one of several categories
2. Linear Function
The problem with the binary step function is
that - the gradient of the function became
zero since there is no component of x is used
in the binary step function.
Then instead of a binary function, we can use
a linear function. We can define the function
as-
Pros-
1. Easy to compute.
2. the gradient here does not become zero, but it is a constant, so
the weights and biases will be updated during the back
propagation process but the updating factor would be the same.
Cons-
The neural network will not really improve the error since the
gradient is the same for every iteration. The network will not be able
to train well and capture the complex patterns from the data.
Uses : Linear activation function is used at just one place i.e. output layer
3. Sigmoid Function-
Sigmoid is an ‘S’ shaped non-linear
mathematical function whose formula is-
Pros-
1. Sigmoid Function is continuous and differentiable.
2. It will limit output between 0 and 1
3. Very clear predictions for binary classification.
Cons-
4. It can cause the vanishing gradient problem.
5. It’s not centered around zero.
6. Computationally Expensive
Cons-
4. It can cause the vanishing gradient problem.
5. Computationally expensive.
Uses :- Usually used in hidden layers of a neural network as
it’s values lies between -1 to 1 hence the mean for the
hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This
makes learning for the next layer much easier.
5. Relu-
Rectified linear Unit often called as just a
rectifier or relu is-
Pros-
1. Easy to compute.
2. Does not cause vanishing gradient problem
3. As all neurons are not activated, this creates sparsity in the
network and hence it will be fast and efficient.
Cons-
4. Cause Exploding Gradient problem.
5. Not Zero Centered.
6. Can kill some neurons forever as it always gives 0 for
negative values.
Uses :- ReLu is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations.
At a time only a few neurons are activated making the network
sparse making it efficient and easy for computation
6. Leaky Relu-
Leaky relu is the improvement of relu function.
Relu function can kill some neurons in each
iteration, this is known as dying relu condition.
Leaky relu can overcome this problem, instead
of giving 0 for negative values, it will use a
relatively small component of input to compute
output, hence it will never kill any neuron.
Pros-
1. Easy to compute.
2. Does not cause vanishing gradient problem
3. Does not cause the dying relu problem.
Cons-
4. Cause Exploding Gradient problem.
5. Not Zero Centered
7. Parameterized Relu-
In parameterized relu, instead of fixing a rate
for the negative axis, it is passed as a new
trainable parameter which network learns on
its own to achieve faster convergence.
Pros-
1. The network will learn the most appropriate value
of alpha on its own.
2. Does not cause vanishing gradient problem
Cons-
3. Difficult to compute.
4. Performance depends upon the problem.
8. Swish
The swish function is proposed by Google’s
Brain team. Their experiments show that
swish tends to work faster than Relu of deep
models across several challenging data sets.
The swish function is obtained by multiplying
x with the sigmoid function.
Pros-
1. Does not cause vanishing gradient problem.
2. Proven to be slightly better than relu.
Cons-
Computationally Expensive
9. Softmax Function
Softmax Function is a generalization of sigmoid
function to a multi-class setting. It’s popularly used in
the final layer of multi-class classification. It takes a
vector of ‘k’ real number and then normalizes it into a
probability distribution consisting of ‘k’ probabilities
corresponding to the exponentials of the input
number. Before applying softmax, some vector
components could be negative, or greater than one,
and might not sum to 1 but after applying softmax
each component will be in the range of 0–1 and will
sum to 1, and hence they can be interpreted as
probabilities.
Pros-
It can be used for multiclass classification and
hence used in the output layer of neural
networks.
Cons-
It is computationally expensive as we have to
calculate a lot of exponent terms.
10. Softplus and Softsign
Softplus function is -