0% found this document useful (0 votes)

60 views14 pages

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Uploaded by

jwxuan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views14 pages

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Uploaded by

jwxuan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

An Empirical Evaluation of Generic Convolutional and Recurrent Networks

for Sequence Modeling

Shaojie Bai 1 J. Zico Kolter 2 Vladlen Koltun 3

Abstract chine translation (van den Oord et al., 2016; Kalchbrenner

For most deep learning practitioners, sequence et al., 2016; Dauphin et al., 2017; Gehring et al., 2017a;b).
modeling is synonymous with recurrent networks. This raises the question of whether these successes of con-
Yet recent results indicate that convolutional ar- volutional sequence modeling are confined to specific ap-
chitectures can outperform recurrent networks on plication domains or whether a broader reconsideration of
tasks such as audio synthesis and machine trans- the association between sequence processing and recurrent
lation. Given a new sequence modeling task or networks is in order.
dataset, which architecture should one use? We We address this question by conducting a systematic empiri-
conduct a systematic evaluation of generic convo- cal evaluation of convolutional and recurrent architectures
lutional and recurrent architectures for sequence on a broad range of sequence modeling tasks. We specif-
modeling. The models are evaluated across a ically target a comprehensive set of tasks that have been
broad range of standard tasks that are commonly repeatedly used to compare the effectiveness of different
used to benchmark recurrent networks. Our re- recurrent network architectures. These tasks include poly-
sults indicate that a simple convolutional archi- phonic music modeling, word- and character-level language
tecture outperforms canonical recurrent networks modeling, as well as synthetic stress tests that had been de-
such as LSTMs across a diverse range of tasks liberately designed and frequently used to benchmark RNNs.
and datasets, while demonstrating longer effective Our evaluation is thus set up to compare convolutional and
memory. We conclude that the common associ- recurrent approaches to sequence modeling on the recurrent
ation between sequence modeling and recurrent networks’ “home turf”.
networks should be reconsidered, and convolu-
tional networks should be regarded as a natural To represent convolutional networks, we describe a generic
starting point for sequence modeling tasks. temporal convolutional network (TCN) architecture that is
applied across all tasks. This architecture is informed by
recent research, but is deliberately kept simple, combining
1. Introduction some of the best practices of modern convolutional archi-
tectures. It is compared to canonical recurrent architectures
Deep learning practitioners commonly regard recurrent ar- such as LSTMs and GRUs.
chitectures as the default starting point for sequence model-
The results suggest that TCNs convincingly outperform
ing tasks. The sequence modeling chapter in the canonical
baseline recurrent architectures across a broad range of se-
textbook on deep learning is titled “Sequence Modeling:
quence modeling tasks. This is particularly notable because
Recurrent and Recursive Nets” (Goodfellow et al., 2016),
the tasks include diverse benchmarks that have commonly
capturing the common association of sequence modeling
been used to evaluate recurrent network designs (Chung
and recurrent architectures. A well-regarded recent online
et al., 2014; Pascanu et al., 2014; Jozefowicz et al., 2015;
course on “Sequence Models” focuses exclusively on recur-
Zhang et al., 2016). This indicates that the recent successes
rent architectures (Ng, 2018).
of convolutional architectures in applications such as audio
On the other hand, recent research indicates that certain con- processing are not confined to these domains.
volutional architectures can reach state-of-the-art accuracy
To further understand these results, we analyze more deeply
in audio synthesis, word-level language modeling, and ma-
the memory retention characteristics of recurrent networks.
1
Machine Learning Department, Carnegie Mellon University, We show that despite the theoretical ability of recurrent
Pittsburgh, PA, USA 2 Computer Science Department, Carnegie architectures to capture infinitely long history, TCNs exhibit
Mellon University, Pittsburgh, PA, USA 3 Intel Labs, Santa Clara, substantially longer memory, and are thus more suitable for
CA, USA.
domains where a long history is required.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

To our knowledge, the presented study is the most extensive the effectiveness of different recurrent architectures. These
systematic comparison of convolutional and recurrent archi- studies have been motivated in part by the many degrees of
tectures on sequence modeling tasks. The results suggest freedom in the design of such architectures. Chung et al.
that the common association between sequence modeling (2014) compared different types of recurrent units (LSTM
and recurrent networks should be reconsidered. The TCN vs. GRU) on the task of polyphonic music modeling. Pas-
architecture appears not only more accurate than canoni- canu et al. (2014) explored different ways to construct deep
cal recurrent networks such as LSTMs and GRUs, but also RNNs and evaluated the performance of different architec-
simpler and clearer. It may therefore be a more appropri- tures on polyphonic music modeling, character-level lan-
ate starting point in the application of deep networks to guage modeling, and word-level language modeling. Joze-
sequences. To assist related work, we have made code avail- fowicz et al. (2015) searched through more than ten thou-
able at http://github.com/locuslab/TCN. sand different RNN architectures and evaluated their perfor-
mance on various tasks. They concluded that if there were
2. Background “architectures much better than the LSTM”, then they were
“not trivial to find”. Greff et al. (2017) benchmarked the
Convolutional networks (LeCun et al., 1989) have been performance of eight LSTM variants on speech recognition,
applied to sequences for decades (Sejnowski & Rosen- handwriting recognition, and polyphonic music modeling.
berg, 1987; Hinton, 1989). They were used prominently They also found that “none of the variants can improve upon
for speech recognition in the 80s and 90s (Waibel et al., the standard LSTM architecture significantly”. Zhang et al.
1989; Bottou et al., 1990). ConvNets were subsequently (2016) systematically analyzed the connecting architectures
applied to NLP tasks such as part-of-speech tagging and of RNNs and evaluated different architectures on character-
semantic role labelling (Collobert & Weston, 2008; Col- level language modeling and on synthetic stress tests. Melis
lobert et al., 2011; dos Santos & Zadrozny, 2014). More et al. (2018) benchmarked LSTM-based architectures on
recently, convolutional networks were applied to sentence word-level and character-level language modeling, and con-
classification (Kalchbrenner et al., 2014; Kim, 2014) and cluded that “LSTMs outperform the more recent models”.
document classification (Zhang et al., 2015; Conneau et al.,
Other recent works have aimed to combine aspects of RNN
2017; Johnson & Zhang, 2015; 2017). Particularly inspiring
and CNN architectures. This includes the Convolutional
for our work are the recent applications of convolutional
LSTM (Shi et al., 2015), which replaces the fully-connected
architectures to machine translation (Kalchbrenner et al.,
layers in an LSTM with convolutional layers to allow for
2016; Gehring et al., 2017a;b), audio synthesis (van den
additional structure in the recurrent layers; the Quasi-RNN
Oord et al., 2016), and language modeling (Dauphin et al.,
model (Bradbury et al., 2017) that interleaves convolutional
2017).
layers with simple recurrent layers; and the dilated RNN
Recurrent networks are dedicated sequence models that (Chang et al., 2017), which adds dilations to recurrent ar-
maintain a vector of hidden activations that are propagated chitectures. While these combinations show promise in
through time (Elman, 1990; Werbos, 1990; Graves, 2012). combining the desirable aspects of both types of architec-
This family of architectures has gained tremendous pop- tures, our study here focuses on a comparison of generic
ularity due to prominent applications to language mod- convolutional and recurrent architectures.
eling (Sutskever et al., 2011; Graves, 2013; Hermans &
While there have been multiple thorough evaluations of
Schrauwen, 2013) and machine translation (Sutskever et al.,
RNN architectures on representative sequence modeling
2014; Bahdanau et al., 2015). The intuitive appeal of re-
tasks, we are not aware of a similarly thorough compari-
current modeling is that the hidden state can act as a rep-
son of convolutional and recurrent approaches to sequence
resentation of everything that has been seen so far in the
modeling. (Yin et al. (2017) have reported a comparison
sequence. Basic RNN architectures are notoriously difficult
of convolutional and recurrent networks for sentence-level
to train (Bengio et al., 1994; Pascanu et al., 2013) and more
and document-level classification tasks. In contrast, se-
elaborate architectures are commonly used instead, such
quence modeling calls for architectures that can synthesize
as the LSTM (Hochreiter & Schmidhuber, 1997) and the
whole sequences, element by element.) Such comparison
GRU (Cho et al., 2014). Many other architectural innova-
is particularly intriguing in light of the aforementioned re-
tions and training techniques for recurrent networks have
cent success of convolutional architectures in this domain.
been introduced and continue to be actively explored (El
Our work aims to compare generic convolutional and re-
Hihi & Bengio, 1995; Schuster & Paliwal, 1997; Gers et al.,
current architectures on typical sequence modeling tasks
2002; Koutnik et al., 2014; Le et al., 2015; Ba et al., 2016;
that are commonly used to benchmark RNN variants them-
Wu et al., 2016; Krueger et al., 2017; Merity et al., 2017;
selves (Hermans & Schrauwen, 2013; Le et al., 2015; Joze-
Campos et al., 2018).
fowicz et al., 2015; Zhang et al., 2016).
Multiple empirical studies have been conducted to evaluate
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

3. Temporal Convolutional Networks regressive prediction (where we try to predict some signal
given its past) by setting the target output to be simply the
We begin by describing a generic architecture for convo- input shifted by one time step. It does not, however, directly
lutional sequence prediction. Our aim is to distill the best capture domains such as machine translation, or sequence-
practices in convolutional network design into a simple to-sequence prediction in general, since in these cases the
architecture that can serve as a convenient but powerful entire input sequence (including “future” states) can be used
starting point. We refer to the presented architecture as a to predict each output (though the techniques can naturally
temporal convolutional network (TCN), emphasizing that be extended to work in such settings).
we adopt this term not as a label for a truly new architecture,
but as a simple descriptive term for a family of architec-
3.2. Causal Convolutions
tures. The distinguishing characteristics of TCNs are: 1)
the convolutions in the architecture are causal, meaning that As mentioned above, the TCN is based upon two principles:
there is no information “leakage” from future to past; 2) the the fact that the network produces an output of the same
architecture can take a sequence of any length and map it to length as the input, and the fact that there can be no leakage
an output sequence of the same length, just as with an RNN. from the future into the past. To accomplish the first point,
Beyond this, we emphasize how to build very long effective the TCN uses a 1D fully-convolutional network (FCN) ar-
history sizes (i.e., the ability for the networks to look very chitecture (Long et al., 2015), where each hidden layer is the
far into the past to make a prediction) using a combination same length as the input layer, and zero padding of length
of very deep networks (augmented with residual layers) and (kernel size − 1) is added to keep subsequent layers the
dilated convolutions. same length as previous ones. To achieve the second point,
the TCN uses causal convolutions, convolutions where an
Our architecture is informed by recent convolutional ar-
output at time t is convolved only with elements from time
chitectures for sequential data (van den Oord et al., 2016;
t and earlier in the previous layer.
Kalchbrenner et al., 2016; Dauphin et al., 2017; Gehring
et al., 2017a;b), but is distinct from all of them and was To put it simply: TCN = 1D FCN + causal convolutions.
designed from first principles to combine simplicity, autore-
Note that this is essentially the same architecture as the
gressive prediction, and very long memory. For example,
time delay neural network proposed nearly 30 years ago by
the TCN is much simpler than WaveNet (van den Oord et al.,
Waibel et al. (1989), with the sole tweak of zero padding to
2016) (no skip connections across layers, conditioning, con-
ensure equal sizes of all layers.
text stacking, or gated activations).
A major disadvantage of this basic design is that in order to
Compared to the language modeling architecture of Dauphin
achieve a long effective history size, we need an extremely
et al. (2017), TCNs do not use gating mechanisms and have
deep network or very large filters, neither of which were
much longer memory.
particularly feasible when the methods were first introduced.
Thus, in the following sections, we describe how techniques
3.1. Sequence Modeling
from modern convolutional architectures can be integrated
Before defining the network structure, we highlight the na- into a TCN to allow for both very deep networks and very
ture of the sequence modeling task. Suppose that we are long effective history.
given an input sequence x0 , . . . , xT , and wish to predict
some corresponding outputs y0 , . . . , yT at each time. The 3.3. Dilated Convolutions
key constraint is that to predict the output yt for some time
A simple causal convolution is only able to look back at a
t, we are constrained to only use those inputs that have
history with size linear in the depth of the network. This
been previously observed: x0 , . . . , xt . Formally, a sequence
makes it challenging to apply the aforementioned causal con-
modeling network is any function f : X T +1 → Y T +1 that
volution on sequence tasks, especially those requiring longer
produces the mapping
history. Our solution here, following the work of van den
ŷ0 , . . . , ŷT = f (x0 , . . . , xT ) (1) Oord et al. (2016), is to employ dilated convolutions that
enable an exponentially large receptive field (Yu & Koltun,
if it satisfies the causal constraint that yt depends only on
2016). More formally, for a 1-D sequence input x ∈ Rn
x0 , . . . , xt and not on any “future” inputs xt+1 , . . . , xT .
and a filter f : {0, . . . , k − 1} → R, the dilated convolution
The goal of learning in the sequence modeling setting
operation F on element s of the sequence is defined as
is to find a network f that minimizes some expected k−1
loss between the actual outputs and the predictions,
X
F (s) = (x ∗d f )(s) = f (i) · xs−d·i (2)
L(y0 , . . . , yT , f (x0 , . . . , xT )), where the sequences and i=0
outputs are drawn according to some distribution.
where d is the dilation factor, k is the filter size, and s − d · i
This formalism encompasses many settings such as auto- accounts for the direction of the past. Dilation is thus equiv-
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

ŷ0 ŷ1 ŷ2 ... ŷT 2 ŷT 1 ŷT

(i)
ẑ(i) = (ẑ1 , . . . , ẑT )
(i)

Output Residual block (k, d) Residual block (k=3, d=1) (1) (1)
ẑT 1 ẑT
Dropout +
d=4 Convolutional Filter + +
ReLU
Identity Map (or 1x1 Conv)
Hidden WeightNorm

Dilated Causal Conv

d=2 1x1 Conv
Dropout (optional)

ReLU
Hidden
WeightNorm

d=1 Dilated Causal Conv

x0 x1 . . . xT 1 xT
Input (i 1) (i 1)
x0 x1 x2 ... xT 2 xT 1 xT ẑ(i 1)
= (ẑ1 , . . . , ẑT )

(a) (b) (c)

Figure 1. Architectural elements in a TCN. (a) A dilated causal convolution with dilation factors d = 1, 2, 4 and filter size k = 3. The
receptive field is able to cover all values from the input sequence. (b) TCN residual block. An 1x1 convolution is added when residual
input and output have different dimensions. (c) An example of residual connection in a TCN. The blue lines are filters in the residual
function, and the green lines are identity mappings.

alent to introducing a fixed step between every two adjacent ure 1(b). Within a residual block, the TCN has two layers
filter taps. When d = 1, a dilated convolution reduces to a of dilated causal convolution and non-linearity, for which
regular convolution. Using larger dilation enables an output we used the rectified linear unit (ReLU) (Nair & Hinton,
at the top level to represent a wider range of inputs, thus 2010). For normalization, we applied weight normaliza-
effectively expanding the receptive field of a ConvNet. tion (Salimans & Kingma, 2016) to the convolutional filters.
In addition, a spatial dropout (Srivastava et al., 2014) was
This gives us two ways to increase the receptive field of the
added after each dilated convolution for regularization: at
TCN: choosing larger filter sizes k and increasing the dila-
each training step, a whole channel is zeroed out.
tion factor d, where the effective history of one such layer is
(k − 1)d. As is common when using dilated convolutions, However, whereas in standard ResNet the input is added
we increase d exponentially with the depth of the network directly to the output of the residual function, in TCN (and
(i.e., d = O(2i ) at level i of the network). This ensures that ConvNets in general) the input and output could have differ-
there is some filter that hits each input within the effective ent widths. To account for discrepant input-output widths,
history, while also allowing for an extremely large effective we use an additional 1x1 convolution to ensure that element-
history using deep networks. We provide an illustration in wise addition ⊕ receives tensors of the same shape (see
Figure 1(a). Figure 1(b,c)).

3.4. Residual Connections 3.5. Discussion

A residual block (He et al., 2016) contains a branch leading We conclude this section by listing several advantages and
out to a series of transformations F, whose outputs are disadvantages of using TCNs for sequence modeling.
added to the input x of the block:
• Parallelism. Unlike in RNNs where the predictions for
o = Activation(x + F(x)) (3) later timesteps must wait for their predecessors to com-
plete, convolutions can be done in parallel since the same
This effectively allows layers to learn modifications to
filter is used in each layer. Therefore, in both training and
the identity mapping rather than the entire transformation,
evaluation, a long input sequence can be processed as a
which has repeatedly been shown to benefit very deep net-
whole in TCN, instead of sequentially as in RNN.
works.
• Flexible receptive field size. A TCN can change its re-
Since a TCN’s receptive field depends on the network depth ceptive field size in multiple ways. For instance, stacking
n as well as filter size k and dilation factor d, stabilization of more dilated (causal) convolutional layers, using larger
deeper and larger TCNs becomes important. For example, in dilation factors, or increasing the filter size are all viable
a case where the prediction could depend on a history of size options (with possibly different interpretations). TCNs
212 and a high-dimensional input sequence, a network of up thus afford better control of the model’s memory size,
to 12 layers could be needed. Each layer, more specifically, and are easy to adapt to different domains.
consists of multiple filters for feature extraction. In our
• Stable gradients. Unlike recurrent architectures, TCN
design of the generic TCN model, we therefore employ a
has a backpropagation path different from the temporal
generic residual module in place of a convolutional layer.
direction of the sequence. TCN thus avoids the problem
The residual block for our baseline TCN is shown in Fig- of exploding/vanishing gradients, which is a major issue
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

for RNNs (and which led to the development of LSTM, give an MSE of about 0.1767. First introduced by Hochreiter
GRU, HF-RNN (Martens & Sutskever, 2011), etc.). & Schmidhuber (1997), the adding problem has been used
• Low memory requirement for training. Especially in repeatedly as a stress test for sequence models (Martens
the case of a long input sequence, LSTMs and GRUs can & Sutskever, 2011; Pascanu et al., 2013; Le et al., 2015;
easily use up a lot of memory to store the partial results Arjovsky et al., 2016; Zhang et al., 2016).
for their multiple cell gates. However, in a TCN the filters Sequential MNIST and P-MNIST. Sequential MNIST is
are shared across a layer, with the backpropagation path frequently used to test a recurrent network’s ability to retain
depending only on network depth. Therefore in practice, information from the distant past (Le et al., 2015; Zhang
we found gated RNNs likely to use up to a multiplicative et al., 2016; Wisdom et al., 2016; Cooijmans et al., 2016;
factor more memory than TCNs. Krueger et al., 2017; Jing et al., 2017). In this task, MNIST
• Variable length inputs. Just like RNNs, which model images (LeCun et al., 1998) are presented to the model
inputs with variable lengths in a recurrent way, TCNs as a 784×1 sequence for digit classification. In the more
can also take in inputs of arbitrary lengths by sliding the challenging P-MNIST setting, the order of the sequence is
1D convolutional kernels. This means that TCNs can be permuted at random (Le et al., 2015; Arjovsky et al., 2016;
adopted as drop-in replacements for RNNs for sequential Wisdom et al., 2016; Krueger et al., 2017).
data of arbitrary length.
Copy memory. In this task, each input sequence has length
T + 20. The first 10 values are chosen randomly among the
There are also two notable disadvantages to using TCNs. digits 1, . . . , 8, with the rest being all zeros, except for the
last 11 entries that are filled with the digit ‘9’ (the first ‘9’ is
• Data storage during evaluation. In evaluation/testing, a delimiter). The goal is to generate an output of the same
RNNs only need to maintain a hidden state and take in a length that is zero everywhere except the last 10 values after
current input xt in order to generate a prediction. In other the delimiter, where the model is expected to repeat the 10
words, a “summary” of the entire history is provided by values it encountered at the start of the input. This task was
the fixed-length set of vectors ht , and the actual observed used in prior works such as Zhang et al. (2016); Arjovsky
sequence can be discarded. In contrast, TCNs need to et al. (2016); Wisdom et al. (2016); Jing et al. (2017).
take in the raw sequence up to the effective history length, JSB Chorales and Nottingham. JSB Chorales (Allan &
thus possibly requiring more memory during evaluation. Williams, 2005) is a polyphonic music dataset consisting
• Potential parameter change for a transfer of domain. of the entire corpus of 382 four-part harmonized chorales
Different domains can have different requirements on the by J. S. Bach. Each input is a sequence of elements. Each
amount of history the model needs in order to predict. element is an 88-bit binary code that corresponds to the 88
Therefore, when transferring a model from a domain keys on a piano, with 1 indicating a key that is pressed at
where only little memory is needed (i.e., small k and d) a given time. Nottingham is a polyphonic music dataset
to a domain where much longer memory is required (i.e., based on a collection of 1,200 British and American folk
much larger k and d), TCN may perform poorly for not tunes, and is much larger than JSB Chorales. JSB Chorales
having a sufficiently large receptive field. and Nottingham have been used in numerous empirical
investigations of recurrent sequence modeling (Chung et al.,
4. Sequence Modeling Tasks 2014; Pascanu et al., 2014; Jozefowicz et al., 2015; Greff
et al., 2017). The performance on both tasks is measured in
We evaluate TCNs and RNNs on tasks that have been com- terms of negative log-likelihood (NLL).
monly used to benchmark the performance of different RNN
sequence modeling architectures (Hermans & Schrauwen, PennTreebank. We used the PennTreebank (PTB) (Mar-
2013; Chung et al., 2014; Pascanu et al., 2014; Le et al., cus et al., 1993) for both character-level and word-level
2015; Jozefowicz et al., 2015; Zhang et al., 2016). The language modeling. When used as a character-level lan-
intention is to conduct the evaluation on the “home turf” guage corpus, PTB contains 5,059K characters for training,
of RNN sequence models. We use a comprehensive set of 396K for validation, and 446K for testing, with an alphabet
synthetic stress tests along with real-world datasets from size of 50. When used as a word-level language corpus,
multiple domains. PTB contains 888K words for training, 70K for validation,
and 79K for testing, with a vocabulary size of 10K. This
The adding problem. In this task, each input consists of is a highly studied but relatively small language modeling
a length-n sequence of depth 2, with all values randomly dataset (Miyamoto & Cho, 2016; Krueger et al., 2017; Mer-
chosen in [0, 1], and the second dimension being all zeros ity et al., 2017).
except for two elements that are marked by 1. The objective
is to sum the two random values whose second dimensions Wikitext-103. Wikitext-103 (Merity et al., 2016) is almost
are marked by 1. Simply predicting the sum to be 1 should 110 times as large as PTB, featuring a vocabulary size of
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Table 1. Evaluation of TCNs and recurrent architectures on synthetic stress tests, polyphonic music modeling, character-level language
modeling, and word-level language modeling. The generic TCN architecture outperforms canonical recurrent networks across a
comprehensive suite of tasks and datasets. Current state-of-the-art results are listed in the supplement. h means that higher is better.
`
means that lower is better.

Models
Sequence Modeling Task Model Size (≈)
LSTM GRU RNN TCN
h
Seq. MNIST (accuracy ) 70K 87.2 96.2 21.5 99.0
Permuted MNIST (accuracy) 70K 85.7 87.3 25.3 97.2
Adding problem T =600 (loss` ) 70K 0.164 5.3e-5 0.177 5.8e-5
Copy memory T =1000 (loss) 16K 0.0204 0.0197 0.0202 3.5e-5
Music JSB Chorales (loss) 300K 8.45 8.43 8.91 8.10
Music Nottingham (loss) 1M 3.29 3.46 4.05 3.07
Word-level PTB (perplexity` ) 13M 78.93 92.48 114.50 89.21
Word-level Wiki-103 (perplexity) - 48.4 - - 45.19
Word-level LAMBADA (perplexity) - 4186 - 14725 1279
Char-level PTB (bpc` ) 3M 1.41 1.42 1.52 1.35
Char-level text8 (bpc) 5M 1.52 1.56 1.69 1.45

about 268K. The dataset contains 28K Wikipedia articles TCN architecture, just varying the depth of the network n
(about 103 million words) for training, 60 articles (about and occasionally the kernel size k so that the receptive field
218K words) for validation, and 60 articles (246K words) covers enough context for predictions. We use an expo-
for testing. This is a more representative and realistic dataset nential dilation d = 2i for layer i in the network, and the
than PTB, with a much larger vocabulary that includes many Adam optimizer (Kingma & Ba, 2015) with learning rate
rare words, and has been used in Merity et al. (2016); Grave 0.002 for TCN, unless otherwise noted. We also empiri-
et al. (2017); Dauphin et al. (2017). cally find that gradient clipping helped convergence, and we
pick the maximum norm for clipping from [0.3, 1]. When
LAMBADA. Introduced by Paperno et al. (2016), LAM-
training recurrent models, we use grid search to find a good
BADA is a dataset comprising 10K passages extracted from
set of hyperparameters (in particular, optimizer, recurrent
novels, with an average of 4.6 sentences as context, and 1 tar-
drop p ∈ [0.05, 0.5], learning rate, gradient clipping, and
get sentence the last word of which is to be predicted. This
initial forget-gate bias), while keeping the network around
dataset was built so that a person can easily guess the miss-
the same size as TCN. No other architectural elaborations,
ing word when given the context sentences, but not when
such as gating mechanisms or skip connections, were added
given only the target sentence without the context sentences.
to either TCNs or RNNs. Additional details and controlled
Most of the existing models fail on LAMBADA (Paperno
experiments are provided in the supplementary material.
et al., 2016; Grave et al., 2017). In general, better results
on LAMBADA indicate that a model is better at capturing
information from longer and broader context. The training 5.1. Synopsis of Results
data for LAMBADA is the full text of 2,662 novels with A synopsis of the results is shown in Table 1. Note that
more than 200M words. The vocabulary size is about 93K. on several of these tasks, the generic, canonical recurrent
text8. We also used the text8 dataset for character-level architectures we study (e.g., LSTM, GRU) are not the state-
language modeling (Mikolov et al., 2012). text8 is about of-the-art. (See the supplement for more details.) With this
20 times larger than PTB, with about 100M characters from caveat, the results strongly suggest that the generic TCN
Wikipedia (90M for training, 5M for validation, and 5M for architecture with minimal tuning outperforms canonical re-
testing). The corpus contains 27 unique alphabets. current architectures across a broad variety of sequence
modeling tasks that are commonly used to benchmark the
performance of recurrent architectures themselves. We now
5. Experiments analyze these results in more detail.
We compare the generic TCN architecture described in Sec-
tion 3 to canonical recurrent architectures, namely LSTM, 5.2. Synthetic Stress Tests
GRU, and vanilla RNN, with standard regularizations. All The adding problem. Convergence results for the adding
experiments reported in this section used exactly the same problem, for problem sizes T = 200 and 600, are shown
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

0.25 TCN 7x27, k=6 (70K) 0.25 TCN 8x24, k=8 (70K) 0.08 0.08
TCN 9x10, k=6 (10K) TCN 9x10, k=8 (14K)
LSTM, (70K) LSTM (70K) 0.07 GRU (16K) 0.07 GRU (16K)
0.20 GRU (70K) 0.20 GRU (70K) LSTM (16K) LSTM (16K)
AWD-LSTM (70K) AWD-LSTM (70K) 0.06 0.06
EURNN (16K) EURNN (16K)
Testing loss

Testing loss

Testing Loss

Testing Loss
0.05 Guess 0 for all 0.05 Guess 0 for all
0.15 0.15
0.04 0.04
0.10 0.10 0.03 0.03
0.02 0.02
0.05 0.05
0.01 0.01
0.000 1000 2000 3000 4000 5000 6000 7000 0.000 1000 2000 3000 4000 5000 6000 7000 0.00 0.00
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Iteration Iteration Iteration Iteration

(a) T = 200 (b) T = 600 (a) T = 500 (b) T = 1000

Figure 2. Results on the adding problem for different sequence Figure 4. Result on the copy memory task for different sequence
lengths T . TCNs outperform recurrent architectures. lengths T . TCNs outperform recurrent architectures.
1.0 1.0

0.8 0.8 tasks (Zhang et al., 2016; Ha et al., 2017; Krueger et al.,
2017; Grave et al., 2017; Greff et al., 2017; Merity et al.,
Testing accuracy

Testing accuracy

0.6 0.6
2017). We mention some of these specialized architectures
0.4 0.4
TCN 8x25, k=7 (66K) TCN 8x25, k=7 (66K)
when useful, but our primary goal is to compare the generic
TCN 8x20, k=6 (41K) TCN 8x20, k=6 (41K)
0.2
LSTM (68K)
0.2
LSTM (68K) TCN model to similarly generic recurrent architectures, be-
GRU (68K) GRU (68K)
0.00 1000 2000 3000 4000 5000 6000 7000 8000 0.00 1000 2000 3000 4000 5000 6000 7000 8000
fore domain-specific tuning. The results are summarized in
Iteration Iteration
Table 1.
(a) Sequential MNIST (b) P-MNIST
Polyphonic music. On Nottingham and JSB Chorales, the
Figure 3. Results on Sequential MNIST and P-MNIST. TCNs out-
TCN with virtually no tuning outperforms the recurrent
perform recurrent architectures.
models by a considerable margin, and even outperforms
some enhanced recurrent architectures for this task such as
in Figure 2. All models were chosen to have roughly 70K HF-RNN (Boulanger-Lewandowski et al., 2012) and Diago-
parameters. TCNs quickly converged to a virtually perfect nal RNN (Subakan & Smaragdis, 2017). Note however that
solution (i.e., MSE near 0). GRUs also performed quite other models such as the Deep Belief Net LSTM perform
well, albeit slower to converge than TCNs. LSTMs and better still (Vohra et al., 2015); we believe this is likely due
vanilla RNNs performed significantly worse. to the fact that the datasets are relatively small, and thus the
Sequential MNIST and P-MNIST. Convergence results right regularization method or generative modeling proce-
on sequential and permuted MNIST, run over 10 epochs, dure can improve performance significantly. This is largely
are shown in Figure 3. All models were configured to have orthogonal to the RNN/TCN distinction, as a similar variant
roughly 70K parameters. For both problems, TCNs sub- of TCN may well be possible.
stantially outperform the recurrent architectures, both in Word-level language modeling. Language modeling re-
terms of convergence and in final accuracy on the task. For mains one of the primary applications of recurrent networks
P-MNIST, TCNs outperform state-of-the-art results (95.9%) and many recent works have focused on optimizing LSTMs
based on recurrent networks with Zoneout and Recurrent for this task (Krueger et al., 2017; Merity et al., 2017).
BatchNorm (Cooijmans et al., 2016; Krueger et al., 2017). Our implementation follows standard practice that ties the
Copy memory. Convergence results on the copy mem- weights of encoder and decoder layers for both TCN and
ory task are shown in Figure 4. TCNs quickly converge RNNs (Press & Wolf, 2016), which significantly reduces
to correct answers, while LSTMs and GRUs simply con- the number of parameters in the model. For training, we use
verge to the same loss as predicting all zeros. In this case SGD and anneal the learning rate by a factor of 0.5 for both
we also compare to the recently-proposed EURNN (Jing TCN and RNNs when validation accuracy plateaus.
et al., 2017), which was highlighted to perform well on On the smaller PTB corpus, an optimized LSTM architec-
this task. While both TCN and EURNN perform well for ture (with recurrent and embedding dropout, etc.) outper-
sequence length T = 500, the TCN has a clear advantage forms the TCN, while the TCN outperforms both GRU and
for T = 1000 and longer (in terms of both loss and rate of vanilla RNN. However, on the much larger Wikitext-103
convergence). corpus and the LAMBADA dataset (Paperno et al., 2016),
without any hyperparameter search, the TCN outperforms
5.3. Polyphonic Music and Language Modeling the LSTM results of Grave et al. (2017), achieving much
lower perplexities.
We now discuss the results on polyphonic music modeling,
character-level language modeling, and word-level language Character-level language modeling. On character-level
modeling. These domains are dominated by recurrent archi- language modeling (PTB and text8, accuracy measured in
tectures, with many specialized designs developed for these bits per character), the generic TCN outperforms regular-
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

ized LSTMs and GRUs as well as methods such as Norm-

stabilized LSTMs (Krueger & Memisevic, 2015). (Special-
ized architectures exist that outperform all of these, see the
supplement.)

5.4. Memory Size of TCN and RNNs

One of the theoretical advantages of recurrent architectures
is their unlimited memory: the theoretical ability to retain
information through sequences of unlimited length. We now
examine specifically how long the different architectures
can retain information in practice. We focus on 1) the copy
memory task, which is a stress test designed to evaluate long-
term, distant information propagation in recurrent networks, Figure 5. Accuracy on the copy memory task for sequences of
and 2) the LAMBADA task, which tests both local and different lengths T . While TCN exhibits 100% accuracy for all
sequence lengths, the LSTM and GRU degenerate to random guess-
non-local textual understanding.
ing as T grows.
The copy memory task is perfectly set up to examine a
model’s ability to retain information for different lengths long-range information propagation in convolutional and
of time. The requisite retention time can be controlled by recurrent networks, and showed that the “infinite memory”
varying the sequence length T . In contrast to Section 5.2, advantage of RNNs is largely absent in practice. TCNs
we now focus on the accuracy on the last 10 elements of exhibit longer memory than recurrent architectures with the
the output sequence (which are the nontrivial elements that same capacity.
must be recalled). We used models of size 10K for both
Numerous advanced schemes for regularizing and opti-
TCN and RNNs.
mizing LSTMs have been proposed (Press & Wolf, 2016;
The results of this focused study are shown in Figure 5. Krueger et al., 2017; Merity et al., 2017; Campos et al.,
TCNs consistently converge to 100% accuracy for all se- 2018). These schemes have significantly advanced the ac-
quence lengths, whereas LSTMs and GRUs of the same curacy achieved by LSTM-based architectures on some
size quickly degenerate to random guessing as the sequence datasets. The TCN has not yet benefitted from this con-
length T grows. The accuracy of the LSTM falls below 20% certed community-wide investment into architectural and
for T < 50, while the GRU falls below 20% for T < 200. algorithmic elaborations. We see such investment as desir-
These results indicate that TCNs are able to maintain a much able and expect it to yield advances in TCN performance
longer effective history than their recurrent counterparts. that are commensurate with the advances seen in recent
years in LSTM performance. We will release the code for
This observation is backed up on real data by experiments
our project to encourage this exploration.
on the large-scale LAMBADA dataset, which is specifically
designed to test a model’s ability to utilize broad context (Pa- The preeminence enjoyed by recurrent networks in sequence
perno et al., 2016). As shown in Table 1, TCN outperforms modeling may be largely a vestige of history. Until recently,
LSTMs and vanilla RNNs by a significant margin in perplex- before the introduction of architectural elements such as
ity on LAMBADA, with a substantially smaller network and dilated convolutions and residual connections, convolutional
virtually no tuning. (State-of-the-art results on this dataset architectures were indeed weaker. Our results indicate that
are even better, but only with the help of additional memory with these elements, a simple convolutional architecture
mechanisms (Grave et al., 2017).) is more effective across diverse sequence modeling tasks
than recurrent architectures such as LSTMs. Due to the
6. Conclusion comparable clarity and simplicity of TCNs, we conclude
that convolutional networks should be regarded as a natural
We have presented an empirical evaluation of generic convo- starting point and a powerful toolkit for sequence modeling.
lutional and recurrent architectures across a comprehensive
suite of sequence modeling tasks. To this end, we have
References
described a simple temporal convolutional network (TCN)
that combines best practices such as dilations and residual Allan, Moray and Williams, Christopher. Harmonising chorales
by probabilistic inference. In NIPS, 2005.
connections with the causal convolutions needed for autore-
gressive prediction. The experimental results indicate that Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary
TCN models substantially outperform generic recurrent ar- evolution recurrent neural networks. In ICML, 2016.
chitectures such as LSTMs and GRUs. We further studied Ba, Lei Jimmy, Kiros, Ryan, and Hinton, Geoffrey E. Layer
normalization. arXiv:1607.06450, 2016.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural Gehring, Jonas, Auli, Michael, Grangier, David, Yarats, Denis, and
machine translation by jointly learning to align and translate. In Dauphin, Yann N. Convolutional sequence to sequence learning.
ICLR, 2015. In ICML, 2017b.
Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning Gers, Felix A, Schraudolph, Nicol N, and Schmidhuber, Jürgen.
long-term dependencies with gradient descent is difficult. IEEE Learning precise timing with lstm recurrent networks. JMLR, 3,
Transactions on Neural Networks, 5(2), 1994. 2002.
Bottou, Léon, Soulie, F Fogelman, Blanchet, Pascal, and Liénard, Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep
Jean-Sylvain. Speaker-independent isolated digit recognition: Learning. MIT Press, 2016.
Multilayer perceptrons vs. dynamic time warping. Neural Net- Grave, Edouard, Joulin, Armand, and Usunier, Nicolas. Improving
works, 3(4), 1990. neural language models with a continuous cache. In ICLR,
Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, 2017.
Pascal. Modeling temporal dependencies in high-dimensional Graves, Alex. Supervised Sequence Labelling with Recurrent
sequences: Application to polyphonic music generation and Neural Networks. Springer, 2012.
transcription. arXiv:1206.6392, 2012.
Graves, Alex. Generating sequences with recurrent neural net-
Bradbury, James, Merity, Stephen, Xiong, Caiming, and Socher, works. arXiv:1308.0850, 2013.
Richard. Quasi-recurrent neural networks. In ICLR, 2017.
Greff, Klaus, Srivastava, Rupesh Kumar, Koutnı́k, Jan, Steune-
Campos, Victor, Jou, Brendan, Giró i Nieto, Xavier, Torres, Jordi, brink, Bas R., and Schmidhuber, Jürgen. LSTM: A search space
and Chang, Shih-Fu. Skip RNN: Learning to skip state updates odyssey. IEEE Transactions on Neural Networks and Learning
in recurrent neural networks. In ICLR, 2018. Systems, 28(10), 2017.
Chang, Shiyu, Zhang, Yang, Han, Wei, Yu, Mo, Guo, Xiaoxiao, Ha, David, Dai, Andrew, and Le, Quoc V. HyperNetworks. In
Tan, Wei, Cui, Xiaodong, Witbrock, Michael J., Hasegawa- ICLR, 2017.
Johnson, Mark A., and Huang, Thomas S. Dilated recurrent
neural networks. In NIPS, 2017. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition. In CVPR, 2016.
Cho, Kyunghyun, Van Merriënboer, Bart, Bahdanau, Dzmitry, and
Bengio, Yoshua. On the properties of neural machine translation: Hermans, Michiel and Schrauwen, Benjamin. Training and
Encoder-decoder approaches. arXiv:1409.1259, 2014. analysing deep recurrent neural networks. In NIPS, 2013.
Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Ben- Hinton, Geoffrey E. Connectionist learning procedures. Artificial
gio, Yoshua. Empirical evaluation of gated recurrent neural Intelligence, 40(1-3), 1989.
networks on sequence modeling. arXiv:1412.3555, 2014. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term mem-
Chung, Junyoung, Ahn, Sungjin, and Bengio, Yoshua. Hierarchical ory. Neural Computation, 9(8), 1997.
multiscale recurrent neural networks. arXiv:1609.01704, 2016. Jing, Li, Shen, Yichen, Dubcek, Tena, Peurifoy, John, Skirlo, Scott,
Collobert, Ronan and Weston, Jason. A unified architecture for nat- LeCun, Yann, Tegmark, Max, and Soljačić, Marin. Tunable
ural language processing: Deep neural networks with multitask efficient unitary neural networks (EUNN) and their application
learning. In ICML, 2008. to RNNs. In ICML, 2017.
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Michael, Johnson, Rie and Zhang, Tong. Effective use of word order for
Kavukcuoglu, Koray, and Kuksa, Pavel P. Natural language text categorization with convolutional neural networks. In HLT-
processing (almost) from scratch. JMLR, 12, 2011. NAACL, 2015.
Conneau, Alexis, Schwenk, Holger, LeCun, Yann, and Barrault, Johnson, Rie and Zhang, Tong. Deep pyramid convolutional neural
Loı̈c. Very deep convolutional networks for text classification. networks for text categorization. In ACL, 2017.
In European Chapter of the Association for Computational Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An
Linguistics (EACL), 2017. empirical exploration of recurrent network architectures. In
Cooijmans, Tim, Ballas, Nicolas, Laurent, César, Gülçehre, Çağlar, ICML, 2015.
and Courville, Aaron. Recurrent batch normalization. In ICLR, Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom, Phil. A
2016. convolutional neural network for modelling sentences. In ACL,
Dauphin, Yann N., Fan, Angela, Auli, Michael, and Grangier, 2014.
David. Language modeling with gated convolutional networks. Kalchbrenner, Nal, Espeholt, Lasse, Simonyan, Karen, van den
In ICML, 2017. Oord, Aäron, Graves, Alex, and Kavukcuoglu, Koray. Neural
dos Santos, Cı́cero Nogueira and Zadrozny, Bianca. Learning machine translation in linear time. arXiv:1610.10099, 2016.
character-level representations for part-of-speech tagging. In Kim, Yoon. Convolutional neural networks for sentence classifica-
ICML, 2014. tion. In EMNLP, 2014.
El Hihi, Salah and Bengio, Yoshua. Hierarchical recurrent neural Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic
networks for long-term dependencies. In NIPS, 1995. optimization. In ICLR, 2015.
Elman, Jeffrey L. Finding structure in time. Cognitive Science, 14 Koutnik, Jan, Greff, Klaus, Gomez, Faustino, and Schmidhuber,
(2), 1990. Juergen. A clockwork RNN. In ICML, 2014.
Gehring, Jonas, Auli, Michael, Grangier, David, and Dauphin, Krueger, David and Memisevic, Roland. Regularizing RNNs by
Yann. A convolutional encoder model for neural machine trans- stabilizing activations. arXiv:1511.08400, 2015.
lation. In ACL, 2017a.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Krueger, David, Maharaj, Tegan, Kramár, János, Pezeshki, Mo- (11), 1997.
hammad, Ballas, Nicolas, Ke, Nan Rosemary, Goyal, Anirudh, Sejnowski, Terrence J. and Rosenberg, Charles R. Parallel net-
Bengio, Yoshua, Larochelle, Hugo, Courville, Aaron C., and works that learn to pronounce English text. Complex Systems,
Pal, Chris. Zoneout: Regularizing RNNs by randomly preserv- 1, 1987.
ing hidden activations. In ICLR, 2017.
Shi, Xingjian, Chen, Zhourong, Wang, Hao, Yeung, Dit-Yan,
Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple Wong, Wai-Kin, and Woo, Wang-chun. Convolutional LSTM
way to initialize recurrent networks of rectified linear units. network: A machine learning approach for precipitation now-
arXiv:1504.00941, 2015. casting. In NIPS, 2015.
LeCun, Yann, Boser, Bernhard, Denker, John S., Henderson, Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex,
Donnie, Howard, Richard E., Hubbard, Wayne, and Jackel, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple
Lawrence D. Backpropagation applied to handwritten zip code way to prevent neural networks from overfitting. JMLR, 15(1),
recognition. Neural Computation, 1(4), 1989. 2014.
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Subakan, Y Cem and Smaragdis, Paris. Diagonal RNNs in sym-
Gradient-based learning applied to document recognition. Pro- bolic music modeling. arXiv:1704.05420, 2017.
ceedings of the IEEE, 86(11), 1998.
Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generat-
Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully ing text with recurrent neural networks. In ICML, 2011.
convolutional networks for semantic segmentation. In CVPR,
2015. Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to
sequence learning with neural networks. In NIPS, 2014.
Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini,
Beatrice. Building a large annotated corpus of English: The van den Oord, Aäron, Dieleman, Sander, Zen, Heiga, Simonyan,
Penn treebank. Computational Linguistics, 19(2), 1993. Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior,
Andrew W., and Kavukcuoglu, Koray. WaveNet: A generative
Martens, James and Sutskever, Ilya. Learning recurrent neural model for raw audio. arXiv:1609.03499, 2016.
networks with Hessian-free optimization. In ICML, 2011.
Vohra, Raunaq, Goel, Kratarth, and Sahoo, JK. Modeling temporal
Melis, Gábor, Dyer, Chris, and Blunsom, Phil. On the state of the dependencies in data using a DBN-LSTM. In Data Science and
art of evaluation in neural language models. In ICLR, 2018. Advanced Analytics (DSAA), 2015.
Merity, Stephen, Xiong, Caiming, Bradbury, James, and Socher, Waibel, Alex, Hanazawa, Toshiyuki, Hinton, Geoffrey, Shikano,
Richard. Pointer sentinel mixture models. arXiv:1609.07843, Kiyohiro, and Lang, Kevin J. Phoneme recognition using time-
2016. delay neural networks. IEEE Transactions on Acoustics, Speech,
Merity, Stephen, Keskar, Nitish Shirish, and Socher, Richard. and Signal Processing, 37(3), 1989.
Regularizing and optimizing LSTM language models. Werbos, Paul J. Backpropagation through time: What it does and
arXiv:1708.02182, 2017. how to do it. Proceedings of the IEEE, 78(10), 1990.
Mikolov, Tomáš, Sutskever, Ilya, Deoras, Anoop, Le, Hai-Son, Wisdom, Scott, Powers, Thomas, Hershey, John, Le Roux,
Kombrink, Stefan, and Cernocky, Jan. Subword language mod- Jonathan, and Atlas, Les. Full-capacity unitary recurrent neural
eling with neural networks. Preprint, 2012. networks. In NIPS, 2016.
Miyamoto, Yasumasa and Cho, Kyunghyun. Gated word-character Wu, Yuhuai, Zhang, Saizheng, Zhang, Ying, Bengio, Yoshua, and
recurrent language model. arXiv:1606.01700, 2016. Salakhutdinov, Ruslan R. On multiplicative integration with
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve recurrent neural networks. In NIPS, 2016.
restricted Boltzmann machines. In ICML, 2010. Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen,
Ng, Andrew. Sequence Models (Course 5 of Deep Learning Spe- William W. Breaking the softmax bottleneck: A high-rank
cialization). Coursera, 2018. RNN language model. ICLR, 2018.
Paperno, Denis, Kruszewski, Germán, Lazaridou, Angeliki, Pham, Yin, Wenpeng, Kann, Katharina, Yu, Mo, and Schütze, Hinrich.
Quan Ngoc, Bernardi, Raffaella, Pezzelle, Sandro, Baroni, Comparative study of CNN and RNN for natural language pro-
Marco, Boleda, Gemma, and Fernández, Raquel. The LAM- cessing. arXiv:1702.01923, 2017.
BADA dataset: Word prediction requiring a broad discourse Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation
context. arXiv:1606.06031, 2016. by dilated convolutions. In ICLR, 2016.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan, Memi-
difficulty of training recurrent neural networks. In ICML, 2013. sevic, Roland, Salakhutdinov, Ruslan R, and Bengio, Yoshua.
Pascanu, Razvan, Gülçehre, Çaglar, Cho, Kyunghyun, and Bengio, Architectural complexity measures of recurrent neural networks.
Yoshua. How to construct deep recurrent neural networks. In In NIPS, 2016.
ICLR, 2014. Zhang, Xiang, Zhao, Junbo Jake, and LeCun, Yann. Character-
Press, Ofir and Wolf, Lior. Using the output embedding to improve level convolutional networks for text classification. In NIPS,
language models. arXiv:1608.05859, 2016. 2015.
Salimans, Tim and Kingma, Diederik P. Weight normalization: A
simple reparameterization to accelerate training of deep neural
networks. In NIPS, 2016.
Schuster, Mike and Paliwal, Kuldip K. Bidirectional recurrent
neural networks. IEEE Transactions on Signal Processing, 45
An Empirical Evaluation of Generic Convolutional and Recurrent Networks
for Sequence Modeling
Supplementary Material

A. Hyperparameters Settings C. Effect of Filter Size and Residual Block

A.1. Hyperparameters for TCN In this section we briefly study the effects of different com-
ponents of a TCN layer. Overall, we believe dilation is
Table 2 lists the hyperparameters we used when applying the
required for modeling long-term dependencies, and so we
generic TCN model on various tasks and datasets. The most
mainly focus on two other factors here: the filter size k used
important factor for picking parameters is to make sure that
by each layer, and the effect of residual blocks.
the TCN has a sufficiently large receptive field by choosing
k and d that can cover the amount of context needed for the We perform a series of controlled experiments, with the
task. results of the ablative analysis shown in Figure 6. As be-
fore, we kept the model size and depth exactly the same for
As discussed in Section 5, the number of hidden units was
different models, so that the dilation factor is strictly con-
chosen so that the model size is approximately at the same
trolled. The experiments were conducted on three different
level as the recurrent models with which we are comparing.
tasks: copy memory, permuted MNIST (P-MNIST), and
In Table 2, a gradient clip of N/A means no gradient clip-
Penn Treebank word-level language modeling. These ex-
ping was applied. In larger tasks (e.g., language modeling),
periments confirm that both factors (filter size and residual
we empirically found that gradient clipping (we randomly
connections) contribute to sequence modeling performance.
picked a threshold from [0.3, 1]) helps with regularizing
TCN and accelerating convergence. Filter size k. In both the copy memory and the P-MNIST
tasks, we observed faster convergence and better accuracy
All weights were initialized from a Gaussian disitribution
for larger filter sizes. In particular, looking at Figure 6a, a
N (0, 0.01). In general, we found TCN to be relatively in-
TCN with filter size ≤ 3 only converges to the same level as
sensitive to hyperparameter changes, as long as the effective
random guessing. In contrast, on word-level language mod-
history (i.e., receptive field) size is sufficient.
eling, a smaller kernel with filter size of k = 3 works best.
We believe this is because a smaller kernel (along with fixed
A.2. Hyperparameters for LSTM/GRU dilation) tends to focus more on the local context, which is
Table 3 reports hyperparameter settings that were used for especially important for PTB language modeling (in fact,
the LSTM. These values are picked from hyperparameter the very success of n-gram models suggests that only a
search for LSTMs that have up to 3 layers, and the optimiz- relatively short memory is needed for modeling language).
ers are chosen from {SGD, Adam, RMSprop, Adagrad}. Residual block. In all three scenarios that we compared
For certain larger datasets, we adopted the settings used here, we observed that the residual function stabilized train-
in prior work (e.g., Grave et al. (2017) on Wikitext-103). ing and brought faster convergence with better final results.
GRU hyperparameters were chosen in a similar fashion, but Especially in language modeling, we found that residual
typically with more hidden units than in LSTM to keep the connections contribute substantially to performance (See
total network size approximately the same (since a GRU Figure 6f).
cell is more compact).
D. Gating Mechanisms
B. State-of-the-Art Results
One component that had been used in prior work on con-
As previously noted, the generic TCN and LSTM/GRU volutional architectures for language modeling is the gated
models we used can be outperformed by more specialized activation (van den Oord et al., 2016; Dauphin et al., 2017).
architectures on some tasks. State-of-the-art results are We have chosen not to use gating in the generic TCN model.
summarized in Table 4. The same TCN architecture is used We now examine this choice more closely.
across all tasks. Note that the size of the state-of-the-art
model may be different from the size of the TCN.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Table 2. TCN parameter settings for experiments in Section 5.

TCN S ETTINGS
Dataset/Task Subtask k n Hidden Dropout Grad Clip Note
T = 200 6 7 27
The Adding Problem T = 400 7 7 27 0.0 N/A
T = 600 8 8 24
7 8 25
Seq. MNIST - 0.0 N/A
6 8 20
7 8 25
Permuted MNIST - 0.0 N/A
6 8 20
T = 500 6 9 10
Copy Memory Task T = 1000 8 8 10 0.05 1.0 RMSprop 5e-4
T = 2000 8 9 10
Music JSB Chorales - 3 2 150 0.5 0.4
Music Nottingham - 6 4 150 0.2 0.4
PTB 3 4 600 0.5 Embed. size 600
Word-level LM Wiki-103 3 5 1000 0.4 Embed. size 400
0.4
LAMBADA 4 5 500 Embed. size 500
PTB 3 3 450
Char-level LM 0.1 0.15 Embed. size 100
text8 2 5 520

Dauphin et al. (2017) compared the effects of gated linear

units (GLU) and gated tanh units (GTU), and adopted GLU
in their non-dilated gated ConvNet. Following the same
choice, we now compare TCNs using ReLU and TCNs
with gating (GLU), represented by an elementwise prod-
uct between two convolutional layers, with one of them
also passing through a sigmoid function σ(x). Note that
the gates architecture uses approximately twice as many
convolutional layers as the ReLU-TCN.
The results are shown in Table 5, where we kept the number
of model parameters at about the same size. The GLU
does further improve TCN accuracy on certain language
modeling datasets like PTB, which agrees with prior work.
However, we do not observe comparable benefits on other
tasks, such as polyphonic music modeling or synthetic stress
tests that require longer information retention. On the copy
memory task with T = 1000, we found that TCN with
gating converged to a worse result than TCN with ReLU
(though still better than recurrent models).
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Table 3. LSTM parameter settings for experiments in Section 5.

LSTM S ETTINGS (K EY PARAMETERS )
Dataset/Task Subtask n Hidden Dropout Grad Clip Bias Note
T = 200 2 77 50 5.0 SGD 1e-3
The Adding Problem T = 400 2 77 0.0 50 10.0 Adam 2e-3
T = 600 1 130 5 1.0 -
Seq. MNIST - 1 130 0.0 1 1.0 RMSprop 1e-3
Permuted MNIST - 1 130 0.0 1 10.0 RMSprop 1e-3
T = 500 1 50 0.25
Copy Memory Task T = 1000 1 50 0.05 1 - RMSprop/Adam
T = 2000 3 28 1
Music JSB Chorales - 2 200 0.2 1 10.0 SGD/Adam
3 280 0.5 -
Music Nottingham - 0.1 Adam 4e-3
1 500 1 -
PTB 3 700 0.4 0.3 1.0 SGD 30, Emb. 700, etc.
Word-level LM Wiki-103 - - - - - Grave et al. (2017)
LAMBADA - - - - - Grave et al. (2017)
PTB 2 600 0.1 0.5 - Emb. size 120
Char-level LM
text8 1 1024 0.15 0.5 - Adam 1e-2

Table 4. State-of-the-art (SoTA) results for tasks in Section 5.

TCN VS . S OTA R ESULTS
Task TCN Result Size SoTA Size Model
Seq. MNIST (acc.) 99.0 21K 99.0 21K Dilated GRU (Chang et al., 2017)
P-MNIST (acc.) 97.2 42K 95.9 42K Zoneout (Krueger et al., 2017)
Adding Prob. 600 (loss) 5.8e-5 70K 5.3e-5 70K Regularized GRU
Copy Memory 1000 (loss) 3.5e-5 70K 0.011 70K EURNN (Jing et al., 2017)
JSB Chorales (loss) 8.10 300K 3.47 - DBN+LSTM (Vohra et al., 2015)
Nottingham (loss) 3.07 1M 1.32 - DBN+LSTM (Vohra et al., 2015)
AWD-LSTM-MoS + Dynamic
Word PTB (ppl) 89.21 13M 47.7 22M
Eval. (Yang et al., 2018)
Neural Cache Model (Large)
Word Wiki-103 (ppl) 45.19 148M 40.4 >300M
(Grave et al., 2017)
Neural Cache Model (Large)
Word LAMBADA (ppl) 1279 56M 138 >100M
(Grave et al., 2017)
2-LayerNorm HyperLSTM
Char PTB (bpc) 1.35 3M 1.22 14M
(Ha et al., 2017)
Char text8 (bpc) 1.45 4.6M 1.29 >12M HM-LSTM (Chung et al., 2016)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

0.06 1.00 140

TCN with k=2 TCN k=2 (13M)
TCN with k=3 TCN k=3 (13M)
0.05 0.95 130 TCN k=4 (13M)
TCN with k=4
TCN with k=6

Testing perplexity
Testing accuracy
0.04 0.90 120
TCN with k=8
Testing Loss

Guess 0 for all

0.03 0.85
110
TCN k=2 (67K)
0.02 0.80
TCN k=3 (67K) 100
TCN k=4 (67K)
0.01 0.75 TCN k=6 (67K)
TCN k=8 (67K) 90
0.00 0.70
0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000 5000 6000 20 30 40 50 60 70 80 90
Iteration Iteration Iteration
(a) Different k on Copy Memory Task (b) Different k on P-MNIST (c) Different k on PTB (word)
0.030 1.00 140
TCN w/ residual (13K) TCN w/ residual k=2 (13M)
TCN w/o residual (13K) TCN w/o residual k=2 (13M)
0.025 0.95 130
Guess 0 for all TCN w/ residual k=3 (13M)
TCN w/o residual k=3 (13M)

Testing perplexity
Testing accuracy

0.020 0.90
120
Testing Loss

0.015 0.85
110
0.010 0.80
TCN w/ residual (67K) 100
TCN w/o residual (67K)
0.005 0.75 TCN w/ residual (10K)
TCN w/o residual (10K) 90
0.000 0.70
0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000 5000 6000 20 30 40 50 60 70 80 90
Iteration Iteration Iteration
(d) Residual on Copy Memory Task (e) Residual on P-MNIST (f) Residual on PTB (word)

Figure 6. Controlled experiments that study the effect of different components of the TCN model.

Table 5. An evaluation of gating in TCN. A plain TCN is compared to a TCN that uses gated activations.

Task TCN TCN + Gating

Sequential MNIST (acc.) 99.0 99.0
Permuted MNIST (acc.) 97.2 96.9
Adding Problem T = 600 (loss) 5.8e-5 5.6e-5
Copy Memory T = 1000 (loss) 3.5e-5 0.00508
JSB Chorales (loss) 8.10 8.13
Nottingham (loss) 3.07 3.12

Word-level PTB (ppl) 89.21 87.94

Char-level PTB (bpc) 1.35 1.343

Char text8 (bpc) 1.45 1.485

5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
Tworh2xgwk4j63v0hqj9prgf1.Bachelorthesis Simon Zocholl
No ratings yet
Tworh2xgwk4j63v0hqj9prgf1.Bachelorthesis Simon Zocholl
92 pages
11-rnn
No ratings yet
11-rnn
32 pages
31-Architectures, Deep Recurrent Networks, Auto Encoders-26!09!2024
No ratings yet
31-Architectures, Deep Recurrent Networks, Auto Encoders-26!09!2024
34 pages
A Hybrid Deep Neural Network Model For Time Series Forecasting
No ratings yet
A Hybrid Deep Neural Network Model For Time Series Forecasting
6 pages
RNN Vanishing Gradients LSTM Compressed
No ratings yet
RNN Vanishing Gradients LSTM Compressed
53 pages
6159 Resurrecting Recurrent Ne
No ratings yet
6159 Resurrecting Recurrent Ne
29 pages
Java script developer
No ratings yet
Java script developer
12 pages
Unit-5-updated
No ratings yet
Unit-5-updated
125 pages
Recurrent Neural Networks: An Embedded Computing Perspective
No ratings yet
Recurrent Neural Networks: An Embedded Computing Perspective
33 pages
1
No ratings yet
1
15 pages
2203.05095v1
No ratings yet
2203.05095v1
10 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
111 pages
Hierarchically Gated Recurrent
No ratings yet
Hierarchically Gated Recurrent
20 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
RNN
No ratings yet
RNN
79 pages
8 Sequence Models - The Mathematical Engineering of Deep Learning (2021)
No ratings yet
8 Sequence Models - The Mathematical Engineering of Deep Learning (2021)
22 pages
Algorithms 17 00104
No ratings yet
Algorithms 17 00104
28 pages
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
No ratings yet
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
12 pages
DL-unit-4-part-2
No ratings yet
DL-unit-4-part-2
8 pages
Recurrent & Recursive Nets
No ratings yet
Recurrent & Recursive Nets
10 pages
Unit V
No ratings yet
Unit V
32 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
UNIT-3 part2
No ratings yet
UNIT-3 part2
14 pages
Unit 5
No ratings yet
Unit 5
76 pages
EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
2410[1]
No ratings yet
2410[1]
27 pages
Unit 4b - Recurrent Neural Networks
No ratings yet
Unit 4b - Recurrent Neural Networks
60 pages
A Neural Attention Model For Speech Command Recognition: A B C C
No ratings yet
A Neural Attention Model For Speech Command Recognition: A B C C
18 pages
An Empirical Evaluation of Generic Convolutional and Recurrent Networks For Sequence Modeling
No ratings yet
An Empirical Evaluation of Generic Convolutional and Recurrent Networks For Sequence Modeling
14 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
30 Encoder, Decoder, Sequence To Sequence 25-09-2024
No ratings yet
30 Encoder, Decoder, Sequence To Sequence 25-09-2024
5 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
question bank_3
No ratings yet
question bank_3
5 pages
DNN U2 Notes
No ratings yet
DNN U2 Notes
32 pages
RNN
No ratings yet
RNN
9 pages
Aquino Dominic Bien FA2.2
No ratings yet
Aquino Dominic Bien FA2.2
3 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
Week5 CNN and RNN
No ratings yet
Week5 CNN and RNN
2 pages
UNIT-IV DL
No ratings yet
UNIT-IV DL
23 pages
Were Rnns All We Needed?: Leo - Feng@Mila - Quebec
No ratings yet
Were Rnns All We Needed?: Leo - Feng@Mila - Quebec
20 pages
Deep Learning - Unit-V Two marks
No ratings yet
Deep Learning - Unit-V Two marks
5 pages
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
No ratings yet
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
12 pages
Article On Recurrent Neural Networks
No ratings yet
Article On Recurrent Neural Networks
3 pages
Unit 4
No ratings yet
Unit 4
27 pages
ASP-DAC2017-1352-11
No ratings yet
ASP-DAC2017-1352-11
6 pages
unit 4_merged
No ratings yet
unit 4_merged
13 pages
Transactions On Neural Networks and Learning Systems 11
No ratings yet
Transactions On Neural Networks and Learning Systems 11
1 page
CH4_AA1.1-Sequence Models (1)
No ratings yet
CH4_AA1.1-Sequence Models (1)
26 pages
1308 0850 PDF
No ratings yet
1308 0850 PDF
43 pages
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
No ratings yet
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
1 page
Artificial Intelligence in Higher Education: The Power and Damage of AI-assisted Tools on Academic English Reading Skills
No ratings yet
Artificial Intelligence in Higher Education: The Power and Damage of AI-assisted Tools on Academic English Reading Skills
19 pages
Survey of Prediction Using Recurrent Neural Network
No ratings yet
Survey of Prediction Using Recurrent Neural Network
3 pages
Survey On Recurrent Neural Network in Natural Lang
No ratings yet
Survey On Recurrent Neural Network in Natural Lang
5 pages
MPS-Automated Template 2022-2023 Item Analysis
No ratings yet
MPS-Automated Template 2022-2023 Item Analysis
1 page
Lecture Notes_RRN
No ratings yet
Lecture Notes_RRN
8 pages
Module 06
No ratings yet
Module 06
5 pages
Homework Should Not Be Banned
89% (9)
Homework Should Not Be Banned
2 pages
Wharton Executive Development Program Brochure
No ratings yet
Wharton Executive Development Program Brochure
16 pages
Core Music Standards EUs EQs Definitions
No ratings yet
Core Music Standards EUs EQs Definitions
2 pages
Steps For Training A Recurrent Neural Network: Advantages
No ratings yet
Steps For Training A Recurrent Neural Network: Advantages
13 pages
Manage Conflict Within A Team ILM - Assessment - Guidance
No ratings yet
Manage Conflict Within A Team ILM - Assessment - Guidance
6 pages
CSTP Continuum of Practice Levels of Teacher Development
100% (1)
CSTP Continuum of Practice Levels of Teacher Development
2 pages
Multiple Vocabulary: An Alternative Choice
No ratings yet
Multiple Vocabulary: An Alternative Choice
13 pages
Module8. Why Does The Future Not Need Us.v3
100% (2)
Module8. Why Does The Future Not Need Us.v3
7 pages
Super_Safari_1_Lesson_Plan
No ratings yet
Super_Safari_1_Lesson_Plan
9 pages
Sample Documentation
No ratings yet
Sample Documentation
6 pages
Cipp Model For School Evaluation
100% (1)
Cipp Model For School Evaluation
5 pages
MAPEH OBE Articulation Matrix GRADE 1
No ratings yet
MAPEH OBE Articulation Matrix GRADE 1
4 pages
Title of The Session Session 1: Session Guide Writing Duration, Date & Venue Target Participants and Profile Objectives Terminal
No ratings yet
Title of The Session Session 1: Session Guide Writing Duration, Date & Venue Target Participants and Profile Objectives Terminal
19 pages
Orange and White Creative Mind Map Brainstorm.pdf
No ratings yet
Orange and White Creative Mind Map Brainstorm.pdf
1 page
Ramsey 2011 Provocative Theory 1
No ratings yet
Ramsey 2011 Provocative Theory 1
16 pages
DLP 012
No ratings yet
DLP 012
4 pages
Nicky. H - Sharing The Planet - Summative Assessment Rubric 2011
No ratings yet
Nicky. H - Sharing The Planet - Summative Assessment Rubric 2011
2 pages
Pat Conway Curriculum Vitae
No ratings yet
Pat Conway Curriculum Vitae
2 pages
LS2 DLL (Mixture)
No ratings yet
LS2 DLL (Mixture)
8 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Sample Argumentative Essay
No ratings yet
Sample Argumentative Essay
3 pages
Kinesiology, Human Movement Science (Pre - Medical/Dental Option), B.S
No ratings yet
Kinesiology, Human Movement Science (Pre - Medical/Dental Option), B.S
1 page
Classroom Observation Reflective Summary
No ratings yet
Classroom Observation Reflective Summary
2 pages
Nalina Resume
No ratings yet
Nalina Resume
3 pages
Ban Homework
No ratings yet
Ban Homework
7 pages
Lesson-Exemplar - Research
No ratings yet
Lesson-Exemplar - Research
9 pages
Course Description Grade 10
No ratings yet
Course Description Grade 10
5 pages
Instructional Design and Materials Evaluation Form
No ratings yet
Instructional Design and Materials Evaluation Form
3 pages
Lesson8 Activity Network Security Applications Countermeasures
No ratings yet
Lesson8 Activity Network Security Applications Countermeasures
1 page
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
From Everand
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet