Layer-Wise Training Rectified Linear Activation Function: Kick-Start Your Project With My New Book
Layer-Wise Training Rectified Linear Activation Function: Kick-Start Your Project With My New Book
It describes the situation where a deep multilayer feed-forward network or a recurrent neural network
is unable to propagate useful gradient information from the output end of the model back to the
layers near the input end of the model.
The result is the general inability of models with many layers to learn on a given dataset, or for
models with many layers to prematurely converge to a poor solution.
Many fixes and workarounds have been proposed and investigated, such as alternate weight
initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient
descent. Perhaps the most common change is the use of the rectified linear activation function that
has become the new default, instead of the hyperbolic tangent activation function that was the default
through the late 1990s and 2000s.
In this tutorial, you will discover how to diagnose a vanishing gradient problem when training a
neural network model and how to fix it using an alternate activation function and weight initialization
scheme.
The vanishing gradients problem limits the development of deep neural networks with
classically popular activation functions such as the hyperbolic tangent.
How to fix a deep neural network Multilayer Perceptron for classification using ReLU and
He weight initialization.
How to use TensorBoard to diagnose a vanishing gradient problem and confirm the impact of
ReLU to improve the flow of gradients through the model.
Kick-start your project with my new book Better Deep Learning, including step-by-step
tutorials and the Python source code files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into five parts; they are:
This involves first calculating the prediction error made by the model and using the error to estimate
a gradient used to update each weight in the network so that less error is made next time. This error
gradient is propagated backward through the network from the output layer to the input layer.
It is desirable to train neural networks with many layers, as the addition of more layers increases the
capacity of the network, making it capable of learning a large training dataset and efficiently
representing more complex mapping functions from inputs to outputs.
A problem with training networks with many layers (e.g. deep neural networks) is that the gradient
diminishes dramatically as it is propagated backward through the network. The error may be so small
by the time it reaches layers close to the input of the model that it may have very little effect. As
such, this problem is referred to as the “vanishing gradients” problem.
Vanishing gradients make it difficult to know which direction the parameters should move to
improve the cost function …
— Random Walk Initialization for Training Very Deep Feedforward Networks, 2014.
Vanishing gradients is a particular problem with recurrent neural networks as the update of the
network involves unrolling the network for each input time step, in effect creating a very deep
network that requires weight updates. A modest recurrent neural network may have 200-to-400 input
time steps, resulting conceptually in a very deep network.
The vanishing gradients problem may be manifest in a Multilayer Perceptron by a slow rate of
improvement of a model during training and perhaps premature convergence, e.g. continued training
does not result in any further improvement. Inspecting the changes to the weights during training, we
would see more change (i.e. more learning) occurring in the layers closer to the output layer and less
change occurring in the layers close to the input layer.
There are many techniques that can be used to reduce the impact of the vanishing gradients problem
for feed-forward neural networks, most notably alternate weight initialization schemes and use of
alternate activation functions.