Recurrent Neural Networks cheatsheet
Recurrent Neural Networks cheatsheet
Star5,479
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
Overview
Architecture of a traditional RNNRecurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while
having hidden states. They are typically as follows:
t, the activation
a^{< t >}
a
<t>
y^{< t >}
y
<t>
a
<t>
=g
1
(W
aa
a
<t−1>
+W
ax
x
<t>
+b
a
)andy
<t>
=g
2
(W
ya
a
<t>
+b
y
where
W_{ax}, W_{aa}, W_{ya}, b_a, b_y
W
ax
,W
aa
,W
ya
,b
a
,b
y
are coefficients that are shared temporally and
g_1, g_2
g
1
,g
2
activation functions.
The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
Applications of RNNsRNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the
table below:
T
x
=T
y
=1
One-to- Music
many generation
T_x=1,
T_y>1
T
x
=1,T
y
>1
Many-to-one Sentiment
classification
T_x>1,
T_y=1
T
x
>1,T
y
=1
T_x=T_y
T
x
=T
y
Many-to- Machine
many translation
T_x\neq
T_y
T
x
=T
y
Loss functionIn the case of a recurrent neural network, the loss function
\mathcal{L}
L of all time steps is defined based on the loss at every time step as follows:
\boxed{\mathcal{L}(\widehat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\
widehat{y}^{< t >},y^{< t >})}
L(
,y)=
t=1
∑
T
y
L(
<t>
,y
<t>
\mathcal{L}
W is expressed as follows:
\boxed{\frac{\partial \mathcal{L}^{(T)}}{\partial W}=\sum_{t=1}^T\left.\
frac{\partial\mathcal{L}^{(T)}}{\partial W}\right|_{(t)}}
∂W
∂L
(T)
=
t=1
∑
T
∂W
∂L
(T)
(t)
+e
−z
e
z
−e
−z
Vanishing/exploding gradientThe vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is
difficult to capture long term dependencies because of multiplicative gradient that can
be exponentially decreasing/increasing with respect to the number of layers.
Gradient clippingIt is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum
value for the gradient, this phenomenon is controlled in practice.
Types of gatesIn order to remedy the vanishing gradient problem, specific gates are
used in some types of RNNs and usually have a well-defined purpose. They are usually
noted
\Gamma
Γ=σ(Wx
<t>
+Ua
<t−1>
+b)
where
W, U, b
\sigma
σ is the sigmoid function. The main ones are summed up in the table below:
Γ
u
\Gamma_r
Γ
r
\Gamma_f
Γ
f
\Gamma_o
Γ
o
GRU/LSTMGated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM)
deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM
being a generalization of GRU. Below is a table summing up the characterizing
equations of each architecture:
\ \textrm{tanh}(W_c[\ \textrm{tanh}(W_c[\
tilde{c}^{ Gamma_r\star a^{< t-1 Gamma_r\star a^{< t-1
< t >} >},x^{< t >}]+b_c) >},x^{< t >}]+b_c)
c tanh(W tanh(W
~ c c
<t>
[Γ [Γ
r r
⋆a ⋆a
<t−1> <t−1>
,x ,x
<t> <t>
]+b ]+b
c c
) )
Γ Γ
u u
⋆ ⋆
c c
~ ~
<t> <t>
+(1−Γ +Γ
u f
)⋆c ⋆c
<t−1> <t−1>
a c Γ
<t> <t> o
⋆c
<t>
Dependenc
ies
\star
Variants of RNNsThe table below sums up the other commonly used RNN
architectures:
V
V the vocabulary and
|V|
• Noted • Noted
o_w e_w
o e
w w
o_w
o
w
to its embedding
e_w
e
w
as follows:
\boxed{e_w=Eo_w}
e
w
=Eo
w
Remark: learning the embedding matrix can be done using target/context likelihood
models.
Word embeddings
Word2vecWord2vec is a framework aimed at learning word embeddings by estimating
the likelihood that a given word is surrounded by other words. Popular models include
skip-gram, negative sampling and CBOW.
c. By noting
\theta_t
θ
t
t, the probability
P(t|c)
\boxed{P(t|c)=\frac{\exp(\theta_t^Te_c)}{\displaystyle\sum_{j=1}^{|V|}\exp(\
theta_j^Te_c)}}
P(t∣c)=
j=1
∑
∣V∣
exp(θ
j
e
c
exp(θ
t
T
e
c
Remark: summing over the whole vocabulary in the denominator of the softmax part
makes this model computationally expensive. CBOW is another word2vec model using
the surrounding words to predict a given word.
Negative samplingIt is a set of binary classifiers using logistic regressions that aim at
assessing how a given context and a given target words are likely to appear
simultaneously, with the models being trained on sets of
P(y=1∣c,t)=σ(θ
t
e
c
)
Remark: this method is less computationally expensive than the skip-gram model.
GloVeThe GloVe model, short for global vectors for word representation, is a word
embedding technique that uses a co-occurence matrix
X where each
X_{i,j}
X
i,j
J is as follows:
\boxed{J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\
theta_i^Te_j+b_i+b_j'-\log(X_{ij}))^2}
J(θ)=
1
i,j=1
∑
∣V∣
f(X
ij
)(θ
i
e
j
+b
i
+b
j
−log(X
ij
))
2
where
X_{i,j}=0\Longrightarrow f(X_{i,j})=0
X
i,j
=0⟹f(X
i,j
)=0.
Given the symmetry that
e and
\theta
e_w^{(\textrm{final})}
e
w
(final)
is given by:
\boxed{e_w^{(\textrm{final})}=\frac{e_w+\theta_w}{2}}
e
w
(final)
e
w
+θ
w
Remark: the individual components of the learned word embeddings are not necessarily
interpretable.
Comparing words
Cosine similarityThe cosine similarity between words
w_1
w
1
and
w_2
w
2
is expressed as follows:
similarity=
∣∣w
1
∣∣ ∣∣w
2
∣∣
w
1
⋅w
2
=cos(θ)
Remark:
\theta
w_1
w
1
and
w_2
w
2
t-SNE
t-SNE (
P(y)
P(y).
n-gram modelThis model is a naive approach aiming at quantifying the probability that
an expression appears in a corpus by counting its number of appearance in the training
data.
PerplexityLanguage models are commonly assessed using the perplexity metric, also
known as PP, which can be interpreted as the inverse probability of the dataset
normalized by the number of words
T. The perplexity is such that the lower, the better and is defined as follows:
\boxed{\textrm{PP}=\prod_{t=1}^T\left(\frac{1}{\sum_{j=1}^{|V|}y_j^{(t)}\
cdot \widehat{y}_j^{(t)}}\right)^{\frac{1}{T}}}
PP=
t=1
∏
T
∑
j=1
∣V∣
y
j
(t)
(t)
)
T
t-SNE.
Machine translation
OverviewA machine translation model is similar to a language model except it has an
encoder network placed before. For this reason, it is sometimes referred as a
conditional language model.
y such that:
\boxed{y=\underset{y^{< 1 >}, ..., y^{< T_y >}}{\textrm{arg max}}P(y^{< 1
>},...,y^{< T_y >}|x)}
y=
y
<1>
,...,y
<T
>
arg max
P(y
<1>
,...,y
<T
>
∣x)
Beam searchIt is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence
y given an input
x.
• Step 1: Find top
B likely words
y^{< 1 >}
y
<1>
y
<k>
∣x,y
<1>
,...,y
<k−1>
B combinations
x,y
<1>
,...,y
<k>
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.
B yield to better result but with slower performance and increased memory. Small
values of
B lead to worse results but is less computationally intensive. A standard value for
B is around 10.
\boxed{\textrm{Objective } = \frac{1}{T_y^\alpha}\sum_{t=1}^{T_y}\log\
Big[p(y^{< t >}|x,y^{< 1 >}, ..., y^{< t-1 >})\Big]}
Objective =
T
y
t=1
∑
T
log[p(y
<t>
∣x,y
<1>
,...,y
<t−1>
)]
\alpha
α can be seen as a softener, and its value is usually between 0.5 and 1.
\widehat{y}
that is bad, one can wonder why we did not get a good translation
y^*
y
∗
P(y P(y
∗ ∗
∣x)>P( ∣x)⩽P(
y y
∣x) ∣x)
• Regularize
Bleu scoreThe bilingual evaluation understudy (bleu) score quantifies how good a
machine translation is by computing a similarity score based on
bleu score=exp(
k=1
∑
n
p
k
where
p_n
p
n
is the bleu score on
n
n-gram only defined as follows:
p_n=\frac{\displaystyle\sum_{\textrm{n-gram}\in\widehat{y}}\
textrm{count}_{\textrm{clip}}(\textrm{n-gram})}{\displaystyle\sum_{\
textrm{n-gram}\in\widehat{y}}\textrm{count}(\textrm{n-gram})}
p
n
=
n-gram∈
count(n-gram)
n-gram∈
∑
count
clip
(n-gram)
Attention
Attention modelThis model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting
model in practice. By noting
\alpha^{< t, t'>}
α
<t,t
>
y^{< t >}
y
<t>
a
<t
′
>
and
c^{< t >}
c
<t>
t, we have:
\boxed{c^{< t >}=\sum_{t'}\alpha^{< t, t' >}a^{< t' >}}\quad\textrm{with}\
quad\sum_{t'}\alpha^{< t,t' >}=1
c
<t>
=
t
α
<t,t
>
a
<t
′
>
with
t
α
<t,t
>
=1
Remark: the attention scores are commonly used in image captioning and machine
translation.
Attention weightThe amount of attention that the output
y^{< t >}
y
<t>
a
<t
′
>
is given by
α
<t,t
>
computed as follows:
α
<t,t
>
=
t
′′
=1
∑
T
exp(e
<t,t
′′
>
exp(e
<t,t
>
T_x
T
x