Chapter15 RNN
Chapter15 RNN
Tsz-Chiu Au
[email protected]
• At each time step t (also called a frame), this recurrent neuron receives
the inputs x(t) as well as its own output from the previous time step, y(t–1).
» In the first time step, the input is 0
A Layer of Recurrent Neurons
• At each time step t, every neuron receives both the input vector x(t) and
the output vector from the previous time step y(t–1).
• Each recurrent neuron has two sets of weights: Wx for the inputs x(t) and
Wy for the outputs of the previous time step, y(t–1).
Memory Cells
• A recurrent neuron has memory because its output is a function of all the
inputs from previous time steps.
• A part of a neural network that preserves some state across time steps is
called a memory cell (or simply a cell).
• A single recurrent neuron, or a layer of recurrent neurons, is a very basic
cell, capable of learning only short patterns.
» To learn longer patterns, a more powerful type of cells is needed.
• A cell’s state at time step t, denoted h(t) (the “h” stands for “hidden”), is a
function of some inputs at that time step and its state at the previous time
step: h(t) = f(h(t–1), x(t)).
• The output at time step t, denoted y(t), is also a function of the previous
state and the current inputs.
Input and Output Sequences
• Sequence-to-sequence network
» E.g., predicting time series such as
stock prices
• Sequence-to-vector network
» E.g., feed the network a sequence of
words corresponding to a movie
review and output a sentiment
score
• Vector-to-sequence network
» E.g., the input could be an image,
and the output could be a caption
for that image.
• Encoder–Decoder
» E.g., translating a sentence from one
language to another.
» Feed the network a sentence in one
language, the encoder would
convert this sentence into a single
vector representation, and then the
decoder would decode this vector
into a sentence in another language.
Training RNNs
• Backpropagation through time (BPTT)
» First, forward pass through the unrolled network
» Second, the output sequence is evaluated using a cost function C(Y(0), Y(1), ...Y(T))
§ The cost function can ignore some outputs
» Third, the gradients of that cost function are then propagated backward through the unrolled
network
» Fourth, the model parameters are updated using the gradients computed during BPTT.
§ Since the same parameters W and b are used at each time step, backpropagation will do
the right thing and sum over all time steps.
Forecasting a Time Series
• A time series is a sequence of data, one per time step.
» A univariate time series is a time series that have one value per time step.
» A multivariate time series is a time series that have multiple values per time step.
• Forecasting is a task of predicting future values.
» E.g., forecast the value at the next time step (represented by the X) in the following
graphs
• The function returns a NumPy array of shape [batch size, time steps, 1],
where each series is the sum of two sine waves of fixed amplitudes but
random frequencies and phases, plus a bit of noise.
» In general, the input features are represented as 3D arrays of shape [batch size, time
steps, dimensionality], where dimensionality is 1 for univariate time series and more for
multivariate time series.
• Let’s create a training set, a validation set, and a test set.
Baseline Metrics
• We like to compare our RNN models with some baseline methods to see
whether the RNN models perform as well as we expect.
• Baseline 1: naive forecasting
» Use the last value in a series to predict the next value.
» It gives a mean squared error of about 0.020 in our previous example.
» If we compile this model using the MSE loss and the default Adam optimizer, then fit it
on the training set for 20 epochs and evaluate it on the validation set, we get an MSE of
about 0.004.
Implementing a Simple RNN
• Let’s build a very simple RNN and compare it with the baseline methods.
• Note that you must set return_sequences=True for all recurrent layers
except the last one.
• If you compile, fit, and evaluate this model, you will find that it reaches an
MSE of 0.003 (i.e., better than the linear model)
• The last layer is too simple, and we can replace it with a Dense layer
Forecasting Several Time Steps Ahead
• To predict not just the value at the next time step but also the next 10 values
» One simple way is to use the trained model to predict the next value, then add that value
to the inputs, and use the model again to predict the following value, and so on.
• We get a validation MSE of about 0.006, which is 25% better than the previous model.
Unstable Gradients Problem
• To deal with the unstable gradients problem, we can reuse the same tricks
for deep nets:
» good parameter initialization, faster optimizers, dropout, and so on.
• However, unlike deep nets such as CNNs, we should not use nonsaturating
activation functions (e.g., ReLU) for RNNs.
» They may actually lead the RNN to be even more unstable during training.
§ Small increase in the outputs will eventually cause the outputs to explode
after many time steps.
» Hence, use a saturating activation function like the hyperbolic tangent.
• The gradients themselves can explode too.
» If you notice that training is unstable, you may want to monitor the size of the
gradients (e.g., using TensorBoard)
» Use Gradient Clipping when needed.
Layer Normalization
• Batch Normalization (BN) cannot be used as efficiently with RNNs as with
deep feedforward nets.
» In fact, you cannot use BN between time steps, only between recurrent layers.
» The use of BNs was slightly better than nothing when applied between recurrent layers (i.e.,
vertically in Figure 15-7), but not within recurrent layers (i.e., horizontally).
• Layer Normalization: instead of normalizing across the batch dimension, it
normalizes across the features dimension.
» Like BN, Layer Normalization learns a scale and an offset parameter for each input.
» In an RNN, it is typically used right after the linear combination of the inputs and the hidden
states.
• One advantage is that it can compute the required statistics on the fly, at each
time step, independently for each instance.
» This also means that it behaves the same way during training and testing (as opposed to BN).
» It does not need to use exponential moving averages to estimate the feature statistics across
all instances in the training set.
Implementation of Layer Normalization
• Use tf.keras to implement Layer Normalization within a simple memory cell.
• To add dropout, all recurrent layers (except for keras.layers.RNN) and all cells
provided by Keras have a dropout hyperparameter and a recurrent_dropout
hyperparameter.
» The former defines the dropout rate to apply to the inputs (at each time step), and the latter
defines the dropout rate for the hidden states (also at each time step).
Long Short-Term Memory
• In RNNs, some information is lost at each time step.
» After a while, the RNN’s state contains virtually no trace of the first inputs.
• To tackle this problem, various types of cells with long-term memory have
been introduced.
» They have proven so successful that the basic cells are not used much anymore.
• The most popular long-term memory cells: the Long Short-Term Memory
(LSTM) cell.
• Two ways to add LSTM layers to a model in Keras:
• Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for
their connection to the input vector x(t).
• Whi, Whf, Who, and Whg are the weight matrices of each of the four layers
for their connection to the previous short-term state h(t–1).
• bi, bf, bo, and bg are the bias terms for each of the four layers. Note that
Tensor-Flow initializes bf to a vector full of 1s instead of 0s. This prevents
forgetting everything at the beginning of training.
Peephole Connections
• In a regular LSTM cell, the gate controllers can look only at the input
x(t) and the previous short-term state h(t–1).
• An LSTM variant with extra connections called peephole connections
» The previous long-term state c(t–1) is added as an input to the controllers of the forget
gate and the input gate
» The current long-term state c(t) is added as input to the controller of the output gate.
• Peephole Connections often improve performance, but not always.
• Keras offers an experimental implementation of LSTM cells with
peephole connections
» tf.keras.experimental.PeepholeLSTMCell
» You can create a keras.layers.RNN layer and pass a PeepholeLSTM Cell to its
constructor.
Gated Recurrent Unit (GRU) Cell
• The GRU cell is a simplified version of the LSTM cell, and it seems to
perform just as well
» Both state vectors are merged into a single vector h(t).
» A single gate controller z(t) controls both the forget gate and the input gate.
§ If the gate controller outputs a 1, the forget gate is open (= 1) and the input gate is
closed (1 – 1 = 0).
§ If it outputs a 0, the opposite happens.
» There is no output gate; the full state vector is output at every time step.
§ However, there is a new gate controller r(t) that controls which part of the previous
state will be shown to the main layer (g(t)).
Equations for GRU Cells
• The equations for GRU cells:
• By shortening the sequences, the convolutional layer may help the GRU layers
detect longer patterns.
WaveNet
• It is possible to use only 1D convolutional layers and drop the recurrent
layers entirely.
• WaveNet stacks 1D convolutional layers, doubling the dilation rate (how
spread apart each neuron’s inputs are) at every layer.
» The lower layers learn short-term patterns, while the higher layers learn long-term
patterns.
» Thanks to the doubling dilation rate, the network can process extremely large sequences
very efficiently.
WaveNet (cont.)
• Here is how to implement a simplified WaveNet in Keras:
• This Sequential model starts with an explicit input layer, then continues
with a 1D convolutional layer using "causal" padding.
» This ensures that the convolutional layer does not peek into the future when making
predictions.
• Add similar pairs of layers using growing dilation rates: 1, 2, 4, 8, and again
1, 2, 4, 8.
• Finally, we add the output layer: a convolutional layer with 10 filters of size
1 and without any activation function.
• GRU with 1D convolutional layers and WaveNet offer the best
performance so far in forecasting our time series.