0% found this document useful (0 votes)
8 views

Activation Functions

Activation Functions

Uploaded by

bcdedes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Activation Functions

Activation Functions

Uploaded by

bcdedes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Introduction

When constructing Artificial Neural Network (ANN) models, one of the primary
considerations is choosing activation functions for hidden and output layers that
are differentiable. This is because calculating the backpropagated error signal that is used to
determine ANN parameter updates requires the gradient of the activation function gradient.
Three of the most commonly-used activation functions used in ANNs are the identity function,
the logistic sigmoid function, and the hyperbolic tangent function. Examples of these functions
and their associated gradients (derivatives in 1D) are plotted in Figure 1.

Figure 1 Common activation functions functions used in artificial neural, along with their derivatives

In the remainder of this post, we derive the derivatives/gradients for each of these common
activation functions.

The Identity Activation Function


The simplest activation function, one that is commonly used for the output layer activation
function in regression problems, is the identity/linear activation function:

(Figure 1, red curves). This activation function simply maps the pre-activation to itself and can
output values that range . Why would one want to do use an identity activation
function? After all, a multi-layered network with linear activations at each layer can be equally-
formulated as a single-layered linear network. It turns out that the identity activation function is
surprisingly useful. For example, a multi-layer network that has nonlinear activation functions
amongst the hidden units and an output layer that uses the identity activation function
implements a powerful form of nonlinear regression. Specifically, the network can predict
continuous target values using a linear combination of signals that arise from one or more layers
of nonlinear transformations of the input.

The derivative of , , is simply 1, in the case of 1D inputs. For vector inputs of length
the gradient is , a vector of ones of length .

The Logistic Sigmoid Activation Function


Another function that is often used as the output activation function for binary classification
problems (i.e. outputs values that range ), is the logistic sigmoid. The logistic sigmoid has
the following form:

(Figure 1, blue curves) and outputs values that range . The logistic sigmoid is motivated
somewhat by biological neurons and can be interpreted as the probability of an artificial neuron
“firing” given its inputs. (It turns out that the logistic sigmoid can also be derived as
the maximum likelihood solution to for logistic regression in statistics). Calculating the
derivative of the logistic sigmoid function makes use of the quotient rule and a clever trick that
both adds and subtracts a one from the numerator:

Here we see that evaluated at is simply weighted by 1-minus- .


This turns out to be a convenient form for efficiently calculating gradients used in neural
networks: if one keeps in memory the feed-forward activations of the logistic function for a
given layer, the gradients for that layer can be evaluated using simple multiplication and
subtraction rather than performing any re-evaluating the sigmoid function, which requires
extra exponentiation.

The Hyperbolic Tangent Activation Function


Though the logistic sigmoid has a nice biological interpretation, it turns out that the logistic
sigmoid can cause a neural network to get “stuck” during training. This is due in part to the fact
that if a strongly-negative input is provided to the logistic sigmoid, it outputs values very near
zero. Since neural networks use the feed-forward activations to calculate parameter gradients
(again, see this previous post for details), this can result in model parameters that are updated
less regularly than we would like, and are thus “stuck” in their current state.

An alternative to the logistic sigmoid is the hyperbolic tangent, or tanh function (Figure 1, green
curves):

Like the logistic sigmoid, the tanh function is also sigmoidal (“s”-shaped), but instead outputs
values that range . Thus, strongly negative inputs to the tanh will map to negative outputs.
Additionally, only zero-valued inputs are mapped to near-zero outputs. These properties make
the network less likely to get “stuck” during training. Calculating the gradient for the tanh
function also uses the quotient rule:
Similar to the derivative for the logistic sigmoid, the derivative of is a function of feed-
forward activation evaluated at , namely . Thus, the same caching trick can be
used for layers that implement tanh activation functions.

Wrapping Up
In this post we reviewed a few commonly-used activation functions in neural network literature
and their derivative calculations. These activation functions are motivated by biology and/or
provide some handy implementation tricks like calculating derivatives using cached feed-
forward activation values. Note that there are also many other options for activation functions
not covered here: e.g. rectification, soft rectification, polynomial kernels, etc. Indeed, finding and
evaluating novel activation functions is an active subfield of machine learning research.
However, the three basic activations covered here can be used to solve a majority of the machine
learning problems one will likely face.

You might also like