Activation Functions
Activation Functions
When constructing Artificial Neural Network (ANN) models, one of the primary
considerations is choosing activation functions for hidden and output layers that
are differentiable. This is because calculating the backpropagated error signal that is used to
determine ANN parameter updates requires the gradient of the activation function gradient.
Three of the most commonly-used activation functions used in ANNs are the identity function,
the logistic sigmoid function, and the hyperbolic tangent function. Examples of these functions
and their associated gradients (derivatives in 1D) are plotted in Figure 1.
Figure 1 Common activation functions functions used in artificial neural, along with their derivatives
In the remainder of this post, we derive the derivatives/gradients for each of these common
activation functions.
(Figure 1, red curves). This activation function simply maps the pre-activation to itself and can
output values that range . Why would one want to do use an identity activation
function? After all, a multi-layered network with linear activations at each layer can be equally-
formulated as a single-layered linear network. It turns out that the identity activation function is
surprisingly useful. For example, a multi-layer network that has nonlinear activation functions
amongst the hidden units and an output layer that uses the identity activation function
implements a powerful form of nonlinear regression. Specifically, the network can predict
continuous target values using a linear combination of signals that arise from one or more layers
of nonlinear transformations of the input.
The derivative of , , is simply 1, in the case of 1D inputs. For vector inputs of length
the gradient is , a vector of ones of length .
(Figure 1, blue curves) and outputs values that range . The logistic sigmoid is motivated
somewhat by biological neurons and can be interpreted as the probability of an artificial neuron
“firing” given its inputs. (It turns out that the logistic sigmoid can also be derived as
the maximum likelihood solution to for logistic regression in statistics). Calculating the
derivative of the logistic sigmoid function makes use of the quotient rule and a clever trick that
both adds and subtracts a one from the numerator:
An alternative to the logistic sigmoid is the hyperbolic tangent, or tanh function (Figure 1, green
curves):
Like the logistic sigmoid, the tanh function is also sigmoidal (“s”-shaped), but instead outputs
values that range . Thus, strongly negative inputs to the tanh will map to negative outputs.
Additionally, only zero-valued inputs are mapped to near-zero outputs. These properties make
the network less likely to get “stuck” during training. Calculating the gradient for the tanh
function also uses the quotient rule:
Similar to the derivative for the logistic sigmoid, the derivative of is a function of feed-
forward activation evaluated at , namely . Thus, the same caching trick can be
used for layers that implement tanh activation functions.
Wrapping Up
In this post we reviewed a few commonly-used activation functions in neural network literature
and their derivative calculations. These activation functions are motivated by biology and/or
provide some handy implementation tricks like calculating derivatives using cached feed-
forward activation values. Note that there are also many other options for activation functions
not covered here: e.g. rectification, soft rectification, polynomial kernels, etc. Indeed, finding and
evaluating novel activation functions is an active subfield of machine learning research.
However, the three basic activations covered here can be used to solve a majority of the machine
learning problems one will likely face.