0% found this document useful (0 votes)
28 views

Chapter15 RNN

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Chapter15 RNN

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Chapter 15: Processing Sequences

Using RNNs and CNNs

Tsz-Chiu Au
[email protected]

Ulsan National Institute of Science and Technology (UNIST)


South Korea
Recurrent Neural Networks
• Recurrent neural networks (RNNs) is a class of nets
that can predict the future.
» They can analyze time series data such as stock prices.
• In this chapter, we will study
» The fundamental concepts underlying RNNs.
» How to train them using backpropagation through time.
» How to use them to forecast a time series.
» How to cope with unstable gradients and a (very) limited
short-term memory.
» A CNN architecture called WaveNet that is capable of
processing time series data as well as RNNs.
Recurrent Neurons and Layers
• A recurrent neural network looks very much like a feedforward neural
network, except it also has connections pointing backward.
• A recurrent neuron unrolled through time:

• At each time step t (also called a frame), this recurrent neuron receives
the inputs x(t) as well as its own output from the previous time step, y(t–1).
» In the first time step, the input is 0
A Layer of Recurrent Neurons
• At each time step t, every neuron receives both the input vector x(t) and
the output vector from the previous time step y(t–1).

• Each recurrent neuron has two sets of weights: Wx for the inputs x(t) and
Wy for the outputs of the previous time step, y(t–1).
Memory Cells
• A recurrent neuron has memory because its output is a function of all the
inputs from previous time steps.
• A part of a neural network that preserves some state across time steps is
called a memory cell (or simply a cell).
• A single recurrent neuron, or a layer of recurrent neurons, is a very basic
cell, capable of learning only short patterns.
» To learn longer patterns, a more powerful type of cells is needed.
• A cell’s state at time step t, denoted h(t) (the “h” stands for “hidden”), is a
function of some inputs at that time step and its state at the previous time
step: h(t) = f(h(t–1), x(t)).
• The output at time step t, denoted y(t), is also a function of the previous
state and the current inputs.
Input and Output Sequences
• Sequence-to-sequence network
» E.g., predicting time series such as
stock prices
• Sequence-to-vector network
» E.g., feed the network a sequence of
words corresponding to a movie
review and output a sentiment
score
• Vector-to-sequence network
» E.g., the input could be an image,
and the output could be a caption
for that image.
• Encoder–Decoder
» E.g., translating a sentence from one
language to another.
» Feed the network a sentence in one
language, the encoder would
convert this sentence into a single
vector representation, and then the
decoder would decode this vector
into a sentence in another language.
Training RNNs
• Backpropagation through time (BPTT)
» First, forward pass through the unrolled network
» Second, the output sequence is evaluated using a cost function C(Y(0), Y(1), ...Y(T))
§ The cost function can ignore some outputs
» Third, the gradients of that cost function are then propagated backward through the unrolled
network
» Fourth, the model parameters are updated using the gradients computed during BPTT.
§ Since the same parameters W and b are used at each time step, backpropagation will do
the right thing and sum over all time steps.
Forecasting a Time Series
• A time series is a sequence of data, one per time step.
» A univariate time series is a time series that have one value per time step.
» A multivariate time series is a time series that have multiple values per time step.
• Forecasting is a task of predicting future values.
» E.g., forecast the value at the next time step (represented by the X) in the following
graphs

• Imputation is a task of predicting missing values from the past.


Generating Time Series for Experiments
• In this chapter, instead of using real time series data in the real world, we
will consider the time series generated by this function:

• The function returns a NumPy array of shape [batch size, time steps, 1],
where each series is the sum of two sine waves of fixed amplitudes but
random frequencies and phases, plus a bit of noise.
» In general, the input features are represented as 3D arrays of shape [batch size, time
steps, dimensionality], where dimensionality is 1 for univariate time series and more for
multivariate time series.
• Let’s create a training set, a validation set, and a test set.
Baseline Metrics
• We like to compare our RNN models with some baseline methods to see
whether the RNN models perform as well as we expect.
• Baseline 1: naive forecasting
» Use the last value in a series to predict the next value.
» It gives a mean squared error of about 0.020 in our previous example.

• Baseline 2: linear model (implemented as a fully connected network)


» Use a simple Linear Regression model so that each prediction will be a linear
combination of the values in the time series

» If we compile this model using the MSE loss and the default Adam optimizer, then fit it
on the training set for 20 epochs and evaluate it on the validation set, we get an MSE of
about 0.004.
Implementing a Simple RNN
• Let’s build a very simple RNN and compare it with the baseline methods.

• It just contains a single layer with a single neuron.


• By default, the SimpleRNN layer uses the hyperbolic tangent activation function.
• The initial state h(init) is set to 0.
• The neuron computes a weighted sum of these values and applies the hyperbolic
tangent activation function to the result, and this gives the first output, y0. In a
simple RNN, this output is also the new state h0.
• This new state is passed to the same recurrent neuron along with the next input
value, x(1). The process is repeated until it returns y49.
• By default, recurrent layers in Keras only return the final output. To make them
return one output per time step, you must set return_sequences=True
• If you compile, fit, and evaluate this model (just like earlier, we train for 20
epochs using Adam), you will find that its MSE reaches only 0.014
» better than the naive approach but it does not beat the simple linear model.
» Reason: this simple RNN has just three parameters whereas the simple linear model has 51
parameters.
Trend and Seasonality
• There are other models for forecasting time series
» E.g., weighted moving average models and autoregressive integrated
moving average (ARIMA) models.
• Some of them require you to first remove the trend and
seasonality.
» The known pattern of the time series should be ignored first.
• After the model is trained and makes predictions, you would
have to add the trend and the seasonal pattern back to get
the final predictions.
• When using RNNs, it is generally not necessary to do all this,
but it may improve performance in some cases, since the
model will not have to learn the trend or the seasonality.
Deep RNNs
• To implement a deep RNN with multiple layers of cells:

• Note that you must set return_sequences=True for all recurrent layers
except the last one.
• If you compile, fit, and evaluate this model, you will find that it reaches an
MSE of 0.003 (i.e., better than the linear model)
• The last layer is too simple, and we can replace it with a Dense layer
Forecasting Several Time Steps Ahead
• To predict not just the value at the next time step but also the next 10 values
» One simple way is to use the trained model to predict the next value, then add that value
to the inputs, and use the model again to predict the following value, and so on.

• In this method, the errors might


accumulate over time.
» We get an MSE of about 0.029, which is
even worse than the naive forecasting
(MSE of about 0.223) and the linear
model (MSE of about 0.0188).
• Still, if you only want to forecast a
few time steps ahead only on more
complex tasks, this approach may
work well.
Forecasting Several Time Steps Ahead (cont.)
• The second option is to train an RNN to predict all 10 next values at once.
» We can still use a sequence-to-vector model, but it will output 10 values instead of 1.

» Now we just need the output layer to have 10 units instead of 1

• The MSE for the next 10 time steps is about 0.008.


» Much better than the linear model
• But we can still do better
» instead of training the model to forecast the next 10 values only at the very last time
step, we can train it to forecast the next 10 values at each and every time step.
§ i.e., turn this sequence-to-vector RNN into a sequence-to-sequence RNN.
» The advantage of this technique is that the loss will contain a term for the output of the
RNN at each and every time step, not just the output at the last time step.
Forecasting Several Time Steps Ahead (cont.)
• At time step 0 the model will output a vector containing the forecasts for time steps 1
to 10, then at time step 1 the model will forecast time steps 2 to 11, and so on.

• To turn the model into a sequence-to-sequence model,


» we must set return_sequences=True in all recurrent layers (even the last one)
» we must apply the output Dense layer at every time step.
• Keras offers a TimeDistributed layer, which reshapes the inputs for the wrapped layer
and then reshape the outputs back to sequences.

• We get a validation MSE of about 0.006, which is 25% better than the previous model.
Unstable Gradients Problem
• To deal with the unstable gradients problem, we can reuse the same tricks
for deep nets:
» good parameter initialization, faster optimizers, dropout, and so on.
• However, unlike deep nets such as CNNs, we should not use nonsaturating
activation functions (e.g., ReLU) for RNNs.
» They may actually lead the RNN to be even more unstable during training.
§ Small increase in the outputs will eventually cause the outputs to explode
after many time steps.
» Hence, use a saturating activation function like the hyperbolic tangent.
• The gradients themselves can explode too.
» If you notice that training is unstable, you may want to monitor the size of the
gradients (e.g., using TensorBoard)
» Use Gradient Clipping when needed.
Layer Normalization
• Batch Normalization (BN) cannot be used as efficiently with RNNs as with
deep feedforward nets.
» In fact, you cannot use BN between time steps, only between recurrent layers.
» The use of BNs was slightly better than nothing when applied between recurrent layers (i.e.,
vertically in Figure 15-7), but not within recurrent layers (i.e., horizontally).
• Layer Normalization: instead of normalizing across the batch dimension, it
normalizes across the features dimension.
» Like BN, Layer Normalization learns a scale and an offset parameter for each input.
» In an RNN, it is typically used right after the linear combination of the inputs and the hidden
states.
• One advantage is that it can compute the required statistics on the fly, at each
time step, independently for each instance.
» This also means that it behaves the same way during training and testing (as opposed to BN).
» It does not need to use exponential moving averages to estimate the feature statistics across
all instances in the training set.
Implementation of Layer Normalization
• Use tf.keras to implement Layer Normalization within a simple memory cell.

• To add dropout, all recurrent layers (except for keras.layers.RNN) and all cells
provided by Keras have a dropout hyperparameter and a recurrent_dropout
hyperparameter.
» The former defines the dropout rate to apply to the inputs (at each time step), and the latter
defines the dropout rate for the hidden states (also at each time step).
Long Short-Term Memory
• In RNNs, some information is lost at each time step.
» After a while, the RNN’s state contains virtually no trace of the first inputs.
• To tackle this problem, various types of cells with long-term memory have
been introduced.
» They have proven so successful that the basic cells are not used much anymore.
• The most popular long-term memory cells: the Long Short-Term Memory
(LSTM) cell.
• Two ways to add LSTM layers to a model in Keras:

• The LSTM layer uses an optimized implementation when running on a GPU,


so in general it is preferable to use it.
The Architecture of LSTM Cells
• The state of a LSTM cell is split into two vectors: h(t) and c(t)
» You can think of h(t) as the short-term state and c(t) as the long-term state.
• The key idea is that the network can learn what to store in the long-term
state, what to throw away, and what to read from it.
» c(t–1) first goes through a forget gate, dropping some memories, and then adding some
new memories (selected by an input gate) via the addition operation.
» c(t–1) is copied and passed through the tanh function, and then the result is filtered by
the output gate to produce the short-term state h(t), which is equal to the cell’s output
for this time step, y(t).
Gates in LSTM Cells
• The current input vector x(t) and the previous short-term state h(t–1) are fed to four
different fully connected layers.
• The main layer is the one that outputs g(t) given the current inputs x(t) and the
previous (short-term) state h(t–1).
» Depending on the input gate, g(t) may or may not be added to the long-term state c(t).
• The three other layers are gate
controllers.
» The forget gate (controlled by f(t)) controls
which parts of the long-term state should
be erased.
» The input gate (controlled by i(t)) controls
which parts of g(t) should be added to the
long-term state.
» The output gate (controlled by o(t))
controls which parts of the long-term
state should be read and output at this
time step, both to h(t) and to y(t).
• Gate controllers use the logistic activation
function, whose range is between 0 to 1.
» If they output 0s they close the gate, and
if they output 1s they open it.
The Equations of LSTM Cells
• A LSTM cell can be implemented by the following equations:

• Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for
their connection to the input vector x(t).
• Whi, Whf, Who, and Whg are the weight matrices of each of the four layers
for their connection to the previous short-term state h(t–1).
• bi, bf, bo, and bg are the bias terms for each of the four layers. Note that
Tensor-Flow initializes bf to a vector full of 1s instead of 0s. This prevents
forgetting everything at the beginning of training.
Peephole Connections
• In a regular LSTM cell, the gate controllers can look only at the input
x(t) and the previous short-term state h(t–1).
• An LSTM variant with extra connections called peephole connections
» The previous long-term state c(t–1) is added as an input to the controllers of the forget
gate and the input gate
» The current long-term state c(t) is added as input to the controller of the output gate.
• Peephole Connections often improve performance, but not always.
• Keras offers an experimental implementation of LSTM cells with
peephole connections
» tf.keras.experimental.PeepholeLSTMCell
» You can create a keras.layers.RNN layer and pass a PeepholeLSTM Cell to its
constructor.
Gated Recurrent Unit (GRU) Cell
• The GRU cell is a simplified version of the LSTM cell, and it seems to
perform just as well
» Both state vectors are merged into a single vector h(t).
» A single gate controller z(t) controls both the forget gate and the input gate.
§ If the gate controller outputs a 1, the forget gate is open (= 1) and the input gate is
closed (1 – 1 = 0).
§ If it outputs a 0, the opposite happens.
» There is no output gate; the full state vector is output at every time step.
§ However, there is a new gate controller r(t) that controls which part of the previous
state will be shown to the main layer (g(t)).
Equations for GRU Cells
• The equations for GRU cells:

• Keras provides a keras.layers.GRU layer.


Using 1D convolutional layers to process sequences
• LSTM and GRU cells are one of the main reasons behind the success of RNNs.
» But they still have a fairly limited short-term memory.
§ They have a hard time learning long-term patterns in sequences of 100 time steps or more,
such as audio samples, long time series, or long sentences.
• One way to solve this is to shorten the input sequences, for example using 1D
convolutional layers.
» Build a neural network composed of a mix of recurrent layers and 1D convolutional layers (or
even 1D pooling layers).
» The 1D convolutional layers downsample the input sequence by a factor of 2, using a stride of 2.

• By shortening the sequences, the convolutional layer may help the GRU layers
detect longer patterns.
WaveNet
• It is possible to use only 1D convolutional layers and drop the recurrent
layers entirely.
• WaveNet stacks 1D convolutional layers, doubling the dilation rate (how
spread apart each neuron’s inputs are) at every layer.
» The lower layers learn short-term patterns, while the higher layers learn long-term
patterns.
» Thanks to the doubling dilation rate, the network can process extremely large sequences
very efficiently.
WaveNet (cont.)
• Here is how to implement a simplified WaveNet in Keras:

• This Sequential model starts with an explicit input layer, then continues
with a 1D convolutional layer using "causal" padding.
» This ensures that the convolutional layer does not peek into the future when making
predictions.
• Add similar pairs of layers using growing dilation rates: 1, 2, 4, 8, and again
1, 2, 4, 8.
• Finally, we add the output layer: a convolutional layer with 10 filters of size
1 and without any activation function.
• GRU with 1D convolutional layers and WaveNet offer the best
performance so far in forecasting our time series.

You might also like