An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
To our knowledge, the presented study is the most extensive the effectiveness of different recurrent architectures. These
systematic comparison of convolutional and recurrent archi- studies have been motivated in part by the many degrees of
tectures on sequence modeling tasks. The results suggest freedom in the design of such architectures. Chung et al.
that the common association between sequence modeling (2014) compared different types of recurrent units (LSTM
and recurrent networks should be reconsidered. The TCN vs. GRU) on the task of polyphonic music modeling. Pas-
architecture appears not only more accurate than canoni- canu et al. (2014) explored different ways to construct deep
cal recurrent networks such as LSTMs and GRUs, but also RNNs and evaluated the performance of different architec-
simpler and clearer. It may therefore be a more appropri- tures on polyphonic music modeling, character-level lan-
ate starting point in the application of deep networks to guage modeling, and word-level language modeling. Joze-
sequences. To assist related work, we have made code avail- fowicz et al. (2015) searched through more than ten thou-
able at http://github.com/locuslab/TCN. sand different RNN architectures and evaluated their perfor-
mance on various tasks. They concluded that if there were
2. Background “architectures much better than the LSTM”, then they were
“not trivial to find”. Greff et al. (2017) benchmarked the
Convolutional networks (LeCun et al., 1989) have been performance of eight LSTM variants on speech recognition,
applied to sequences for decades (Sejnowski & Rosen- handwriting recognition, and polyphonic music modeling.
berg, 1987; Hinton, 1989). They were used prominently They also found that “none of the variants can improve upon
for speech recognition in the 80s and 90s (Waibel et al., the standard LSTM architecture significantly”. Zhang et al.
1989; Bottou et al., 1990). ConvNets were subsequently (2016) systematically analyzed the connecting architectures
applied to NLP tasks such as part-of-speech tagging and of RNNs and evaluated different architectures on character-
semantic role labelling (Collobert & Weston, 2008; Col- level language modeling and on synthetic stress tests. Melis
lobert et al., 2011; dos Santos & Zadrozny, 2014). More et al. (2018) benchmarked LSTM-based architectures on
recently, convolutional networks were applied to sentence word-level and character-level language modeling, and con-
classification (Kalchbrenner et al., 2014; Kim, 2014) and cluded that “LSTMs outperform the more recent models”.
document classification (Zhang et al., 2015; Conneau et al.,
Other recent works have aimed to combine aspects of RNN
2017; Johnson & Zhang, 2015; 2017). Particularly inspiring
and CNN architectures. This includes the Convolutional
for our work are the recent applications of convolutional
LSTM (Shi et al., 2015), which replaces the fully-connected
architectures to machine translation (Kalchbrenner et al.,
layers in an LSTM with convolutional layers to allow for
2016; Gehring et al., 2017a;b), audio synthesis (van den
additional structure in the recurrent layers; the Quasi-RNN
Oord et al., 2016), and language modeling (Dauphin et al.,
model (Bradbury et al., 2017) that interleaves convolutional
2017).
layers with simple recurrent layers; and the dilated RNN
Recurrent networks are dedicated sequence models that (Chang et al., 2017), which adds dilations to recurrent ar-
maintain a vector of hidden activations that are propagated chitectures. While these combinations show promise in
through time (Elman, 1990; Werbos, 1990; Graves, 2012). combining the desirable aspects of both types of architec-
This family of architectures has gained tremendous pop- tures, our study here focuses on a comparison of generic
ularity due to prominent applications to language mod- convolutional and recurrent architectures.
eling (Sutskever et al., 2011; Graves, 2013; Hermans &
While there have been multiple thorough evaluations of
Schrauwen, 2013) and machine translation (Sutskever et al.,
RNN architectures on representative sequence modeling
2014; Bahdanau et al., 2015). The intuitive appeal of re-
tasks, we are not aware of a similarly thorough compari-
current modeling is that the hidden state can act as a rep-
son of convolutional and recurrent approaches to sequence
resentation of everything that has been seen so far in the
modeling. (Yin et al. (2017) have reported a comparison
sequence. Basic RNN architectures are notoriously difficult
of convolutional and recurrent networks for sentence-level
to train (Bengio et al., 1994; Pascanu et al., 2013) and more
and document-level classification tasks. In contrast, se-
elaborate architectures are commonly used instead, such
quence modeling calls for architectures that can synthesize
as the LSTM (Hochreiter & Schmidhuber, 1997) and the
whole sequences, element by element.) Such comparison
GRU (Cho et al., 2014). Many other architectural innova-
is particularly intriguing in light of the aforementioned re-
tions and training techniques for recurrent networks have
cent success of convolutional architectures in this domain.
been introduced and continue to be actively explored (El
Our work aims to compare generic convolutional and re-
Hihi & Bengio, 1995; Schuster & Paliwal, 1997; Gers et al.,
current architectures on typical sequence modeling tasks
2002; Koutnik et al., 2014; Le et al., 2015; Ba et al., 2016;
that are commonly used to benchmark RNN variants them-
Wu et al., 2016; Krueger et al., 2017; Merity et al., 2017;
selves (Hermans & Schrauwen, 2013; Le et al., 2015; Joze-
Campos et al., 2018).
fowicz et al., 2015; Zhang et al., 2016).
Multiple empirical studies have been conducted to evaluate
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
3. Temporal Convolutional Networks regressive prediction (where we try to predict some signal
given its past) by setting the target output to be simply the
We begin by describing a generic architecture for convo- input shifted by one time step. It does not, however, directly
lutional sequence prediction. Our aim is to distill the best capture domains such as machine translation, or sequence-
practices in convolutional network design into a simple to-sequence prediction in general, since in these cases the
architecture that can serve as a convenient but powerful entire input sequence (including “future” states) can be used
starting point. We refer to the presented architecture as a to predict each output (though the techniques can naturally
temporal convolutional network (TCN), emphasizing that be extended to work in such settings).
we adopt this term not as a label for a truly new architecture,
but as a simple descriptive term for a family of architec-
3.2. Causal Convolutions
tures. The distinguishing characteristics of TCNs are: 1)
the convolutions in the architecture are causal, meaning that As mentioned above, the TCN is based upon two principles:
there is no information “leakage” from future to past; 2) the the fact that the network produces an output of the same
architecture can take a sequence of any length and map it to length as the input, and the fact that there can be no leakage
an output sequence of the same length, just as with an RNN. from the future into the past. To accomplish the first point,
Beyond this, we emphasize how to build very long effective the TCN uses a 1D fully-convolutional network (FCN) ar-
history sizes (i.e., the ability for the networks to look very chitecture (Long et al., 2015), where each hidden layer is the
far into the past to make a prediction) using a combination same length as the input layer, and zero padding of length
of very deep networks (augmented with residual layers) and (kernel size − 1) is added to keep subsequent layers the
dilated convolutions. same length as previous ones. To achieve the second point,
the TCN uses causal convolutions, convolutions where an
Our architecture is informed by recent convolutional ar-
output at time t is convolved only with elements from time
chitectures for sequential data (van den Oord et al., 2016;
t and earlier in the previous layer.
Kalchbrenner et al., 2016; Dauphin et al., 2017; Gehring
et al., 2017a;b), but is distinct from all of them and was To put it simply: TCN = 1D FCN + causal convolutions.
designed from first principles to combine simplicity, autore-
Note that this is essentially the same architecture as the
gressive prediction, and very long memory. For example,
time delay neural network proposed nearly 30 years ago by
the TCN is much simpler than WaveNet (van den Oord et al.,
Waibel et al. (1989), with the sole tweak of zero padding to
2016) (no skip connections across layers, conditioning, con-
ensure equal sizes of all layers.
text stacking, or gated activations).
A major disadvantage of this basic design is that in order to
Compared to the language modeling architecture of Dauphin
achieve a long effective history size, we need an extremely
et al. (2017), TCNs do not use gating mechanisms and have
deep network or very large filters, neither of which were
much longer memory.
particularly feasible when the methods were first introduced.
Thus, in the following sections, we describe how techniques
3.1. Sequence Modeling
from modern convolutional architectures can be integrated
Before defining the network structure, we highlight the na- into a TCN to allow for both very deep networks and very
ture of the sequence modeling task. Suppose that we are long effective history.
given an input sequence x0 , . . . , xT , and wish to predict
some corresponding outputs y0 , . . . , yT at each time. The 3.3. Dilated Convolutions
key constraint is that to predict the output yt for some time
A simple causal convolution is only able to look back at a
t, we are constrained to only use those inputs that have
history with size linear in the depth of the network. This
been previously observed: x0 , . . . , xt . Formally, a sequence
makes it challenging to apply the aforementioned causal con-
modeling network is any function f : X T +1 → Y T +1 that
volution on sequence tasks, especially those requiring longer
produces the mapping
history. Our solution here, following the work of van den
ŷ0 , . . . , ŷT = f (x0 , . . . , xT ) (1) Oord et al. (2016), is to employ dilated convolutions that
enable an exponentially large receptive field (Yu & Koltun,
if it satisfies the causal constraint that yt depends only on
2016). More formally, for a 1-D sequence input x ∈ Rn
x0 , . . . , xt and not on any “future” inputs xt+1 , . . . , xT .
and a filter f : {0, . . . , k − 1} → R, the dilated convolution
The goal of learning in the sequence modeling setting
operation F on element s of the sequence is defined as
is to find a network f that minimizes some expected k−1
loss between the actual outputs and the predictions,
X
F (s) = (x ∗d f )(s) = f (i) · xs−d·i (2)
L(y0 , . . . , yT , f (x0 , . . . , xT )), where the sequences and i=0
outputs are drawn according to some distribution.
where d is the dilation factor, k is the filter size, and s − d · i
This formalism encompasses many settings such as auto- accounts for the direction of the past. Dilation is thus equiv-
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Output Residual block (k, d) Residual block (k=3, d=1) (1) (1)
ẑT 1 ẑT
Dropout +
d=4 Convolutional Filter + +
ReLU
Identity Map (or 1x1 Conv)
Hidden WeightNorm
ReLU
Hidden
WeightNorm
x0 x1 . . . xT 1 xT
Input (i 1) (i 1)
x0 x1 x2 ... xT 2 xT 1 xT ẑ(i 1)
= (ẑ1 , . . . , ẑT )
alent to introducing a fixed step between every two adjacent ure 1(b). Within a residual block, the TCN has two layers
filter taps. When d = 1, a dilated convolution reduces to a of dilated causal convolution and non-linearity, for which
regular convolution. Using larger dilation enables an output we used the rectified linear unit (ReLU) (Nair & Hinton,
at the top level to represent a wider range of inputs, thus 2010). For normalization, we applied weight normaliza-
effectively expanding the receptive field of a ConvNet. tion (Salimans & Kingma, 2016) to the convolutional filters.
In addition, a spatial dropout (Srivastava et al., 2014) was
This gives us two ways to increase the receptive field of the
added after each dilated convolution for regularization: at
TCN: choosing larger filter sizes k and increasing the dila-
each training step, a whole channel is zeroed out.
tion factor d, where the effective history of one such layer is
(k − 1)d. As is common when using dilated convolutions, However, whereas in standard ResNet the input is added
we increase d exponentially with the depth of the network directly to the output of the residual function, in TCN (and
(i.e., d = O(2i ) at level i of the network). This ensures that ConvNets in general) the input and output could have differ-
there is some filter that hits each input within the effective ent widths. To account for discrepant input-output widths,
history, while also allowing for an extremely large effective we use an additional 1x1 convolution to ensure that element-
history using deep networks. We provide an illustration in wise addition ⊕ receives tensors of the same shape (see
Figure 1(a). Figure 1(b,c)).
for RNNs (and which led to the development of LSTM, give an MSE of about 0.1767. First introduced by Hochreiter
GRU, HF-RNN (Martens & Sutskever, 2011), etc.). & Schmidhuber (1997), the adding problem has been used
• Low memory requirement for training. Especially in repeatedly as a stress test for sequence models (Martens
the case of a long input sequence, LSTMs and GRUs can & Sutskever, 2011; Pascanu et al., 2013; Le et al., 2015;
easily use up a lot of memory to store the partial results Arjovsky et al., 2016; Zhang et al., 2016).
for their multiple cell gates. However, in a TCN the filters Sequential MNIST and P-MNIST. Sequential MNIST is
are shared across a layer, with the backpropagation path frequently used to test a recurrent network’s ability to retain
depending only on network depth. Therefore in practice, information from the distant past (Le et al., 2015; Zhang
we found gated RNNs likely to use up to a multiplicative et al., 2016; Wisdom et al., 2016; Cooijmans et al., 2016;
factor more memory than TCNs. Krueger et al., 2017; Jing et al., 2017). In this task, MNIST
• Variable length inputs. Just like RNNs, which model images (LeCun et al., 1998) are presented to the model
inputs with variable lengths in a recurrent way, TCNs as a 784×1 sequence for digit classification. In the more
can also take in inputs of arbitrary lengths by sliding the challenging P-MNIST setting, the order of the sequence is
1D convolutional kernels. This means that TCNs can be permuted at random (Le et al., 2015; Arjovsky et al., 2016;
adopted as drop-in replacements for RNNs for sequential Wisdom et al., 2016; Krueger et al., 2017).
data of arbitrary length.
Copy memory. In this task, each input sequence has length
T + 20. The first 10 values are chosen randomly among the
There are also two notable disadvantages to using TCNs. digits 1, . . . , 8, with the rest being all zeros, except for the
last 11 entries that are filled with the digit ‘9’ (the first ‘9’ is
• Data storage during evaluation. In evaluation/testing, a delimiter). The goal is to generate an output of the same
RNNs only need to maintain a hidden state and take in a length that is zero everywhere except the last 10 values after
current input xt in order to generate a prediction. In other the delimiter, where the model is expected to repeat the 10
words, a “summary” of the entire history is provided by values it encountered at the start of the input. This task was
the fixed-length set of vectors ht , and the actual observed used in prior works such as Zhang et al. (2016); Arjovsky
sequence can be discarded. In contrast, TCNs need to et al. (2016); Wisdom et al. (2016); Jing et al. (2017).
take in the raw sequence up to the effective history length, JSB Chorales and Nottingham. JSB Chorales (Allan &
thus possibly requiring more memory during evaluation. Williams, 2005) is a polyphonic music dataset consisting
• Potential parameter change for a transfer of domain. of the entire corpus of 382 four-part harmonized chorales
Different domains can have different requirements on the by J. S. Bach. Each input is a sequence of elements. Each
amount of history the model needs in order to predict. element is an 88-bit binary code that corresponds to the 88
Therefore, when transferring a model from a domain keys on a piano, with 1 indicating a key that is pressed at
where only little memory is needed (i.e., small k and d) a given time. Nottingham is a polyphonic music dataset
to a domain where much longer memory is required (i.e., based on a collection of 1,200 British and American folk
much larger k and d), TCN may perform poorly for not tunes, and is much larger than JSB Chorales. JSB Chorales
having a sufficiently large receptive field. and Nottingham have been used in numerous empirical
investigations of recurrent sequence modeling (Chung et al.,
4. Sequence Modeling Tasks 2014; Pascanu et al., 2014; Jozefowicz et al., 2015; Greff
et al., 2017). The performance on both tasks is measured in
We evaluate TCNs and RNNs on tasks that have been com- terms of negative log-likelihood (NLL).
monly used to benchmark the performance of different RNN
sequence modeling architectures (Hermans & Schrauwen, PennTreebank. We used the PennTreebank (PTB) (Mar-
2013; Chung et al., 2014; Pascanu et al., 2014; Le et al., cus et al., 1993) for both character-level and word-level
2015; Jozefowicz et al., 2015; Zhang et al., 2016). The language modeling. When used as a character-level lan-
intention is to conduct the evaluation on the “home turf” guage corpus, PTB contains 5,059K characters for training,
of RNN sequence models. We use a comprehensive set of 396K for validation, and 446K for testing, with an alphabet
synthetic stress tests along with real-world datasets from size of 50. When used as a word-level language corpus,
multiple domains. PTB contains 888K words for training, 70K for validation,
and 79K for testing, with a vocabulary size of 10K. This
The adding problem. In this task, each input consists of is a highly studied but relatively small language modeling
a length-n sequence of depth 2, with all values randomly dataset (Miyamoto & Cho, 2016; Krueger et al., 2017; Mer-
chosen in [0, 1], and the second dimension being all zeros ity et al., 2017).
except for two elements that are marked by 1. The objective
is to sum the two random values whose second dimensions Wikitext-103. Wikitext-103 (Merity et al., 2016) is almost
are marked by 1. Simply predicting the sum to be 1 should 110 times as large as PTB, featuring a vocabulary size of
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Table 1. Evaluation of TCNs and recurrent architectures on synthetic stress tests, polyphonic music modeling, character-level language
modeling, and word-level language modeling. The generic TCN architecture outperforms canonical recurrent networks across a
comprehensive suite of tasks and datasets. Current state-of-the-art results are listed in the supplement. h means that higher is better.
`
means that lower is better.
Models
Sequence Modeling Task Model Size (≈)
LSTM GRU RNN TCN
h
Seq. MNIST (accuracy ) 70K 87.2 96.2 21.5 99.0
Permuted MNIST (accuracy) 70K 85.7 87.3 25.3 97.2
Adding problem T =600 (loss` ) 70K 0.164 5.3e-5 0.177 5.8e-5
Copy memory T =1000 (loss) 16K 0.0204 0.0197 0.0202 3.5e-5
Music JSB Chorales (loss) 300K 8.45 8.43 8.91 8.10
Music Nottingham (loss) 1M 3.29 3.46 4.05 3.07
Word-level PTB (perplexity` ) 13M 78.93 92.48 114.50 89.21
Word-level Wiki-103 (perplexity) - 48.4 - - 45.19
Word-level LAMBADA (perplexity) - 4186 - 14725 1279
Char-level PTB (bpc` ) 3M 1.41 1.42 1.52 1.35
Char-level text8 (bpc) 5M 1.52 1.56 1.69 1.45
about 268K. The dataset contains 28K Wikipedia articles TCN architecture, just varying the depth of the network n
(about 103 million words) for training, 60 articles (about and occasionally the kernel size k so that the receptive field
218K words) for validation, and 60 articles (246K words) covers enough context for predictions. We use an expo-
for testing. This is a more representative and realistic dataset nential dilation d = 2i for layer i in the network, and the
than PTB, with a much larger vocabulary that includes many Adam optimizer (Kingma & Ba, 2015) with learning rate
rare words, and has been used in Merity et al. (2016); Grave 0.002 for TCN, unless otherwise noted. We also empiri-
et al. (2017); Dauphin et al. (2017). cally find that gradient clipping helped convergence, and we
pick the maximum norm for clipping from [0.3, 1]. When
LAMBADA. Introduced by Paperno et al. (2016), LAM-
training recurrent models, we use grid search to find a good
BADA is a dataset comprising 10K passages extracted from
set of hyperparameters (in particular, optimizer, recurrent
novels, with an average of 4.6 sentences as context, and 1 tar-
drop p ∈ [0.05, 0.5], learning rate, gradient clipping, and
get sentence the last word of which is to be predicted. This
initial forget-gate bias), while keeping the network around
dataset was built so that a person can easily guess the miss-
the same size as TCN. No other architectural elaborations,
ing word when given the context sentences, but not when
such as gating mechanisms or skip connections, were added
given only the target sentence without the context sentences.
to either TCNs or RNNs. Additional details and controlled
Most of the existing models fail on LAMBADA (Paperno
experiments are provided in the supplementary material.
et al., 2016; Grave et al., 2017). In general, better results
on LAMBADA indicate that a model is better at capturing
information from longer and broader context. The training 5.1. Synopsis of Results
data for LAMBADA is the full text of 2,662 novels with A synopsis of the results is shown in Table 1. Note that
more than 200M words. The vocabulary size is about 93K. on several of these tasks, the generic, canonical recurrent
text8. We also used the text8 dataset for character-level architectures we study (e.g., LSTM, GRU) are not the state-
language modeling (Mikolov et al., 2012). text8 is about of-the-art. (See the supplement for more details.) With this
20 times larger than PTB, with about 100M characters from caveat, the results strongly suggest that the generic TCN
Wikipedia (90M for training, 5M for validation, and 5M for architecture with minimal tuning outperforms canonical re-
testing). The corpus contains 27 unique alphabets. current architectures across a broad variety of sequence
modeling tasks that are commonly used to benchmark the
performance of recurrent architectures themselves. We now
5. Experiments analyze these results in more detail.
We compare the generic TCN architecture described in Sec-
tion 3 to canonical recurrent architectures, namely LSTM, 5.2. Synthetic Stress Tests
GRU, and vanilla RNN, with standard regularizations. All The adding problem. Convergence results for the adding
experiments reported in this section used exactly the same problem, for problem sizes T = 200 and 600, are shown
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
0.25 TCN 7x27, k=6 (70K) 0.25 TCN 8x24, k=8 (70K) 0.08 0.08
TCN 9x10, k=6 (10K) TCN 9x10, k=8 (14K)
LSTM, (70K) LSTM (70K) 0.07 GRU (16K) 0.07 GRU (16K)
0.20 GRU (70K) 0.20 GRU (70K) LSTM (16K) LSTM (16K)
AWD-LSTM (70K) AWD-LSTM (70K) 0.06 0.06
EURNN (16K) EURNN (16K)
Testing loss
Testing loss
Testing Loss
Testing Loss
0.05 Guess 0 for all 0.05 Guess 0 for all
0.15 0.15
0.04 0.04
0.10 0.10 0.03 0.03
0.02 0.02
0.05 0.05
0.01 0.01
0.000 1000 2000 3000 4000 5000 6000 7000 0.000 1000 2000 3000 4000 5000 6000 7000 0.00 0.00
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Iteration Iteration Iteration Iteration
0.8 0.8 tasks (Zhang et al., 2016; Ha et al., 2017; Krueger et al.,
2017; Grave et al., 2017; Greff et al., 2017; Merity et al.,
Testing accuracy
Testing accuracy
0.6 0.6
2017). We mention some of these specialized architectures
0.4 0.4
TCN 8x25, k=7 (66K) TCN 8x25, k=7 (66K)
when useful, but our primary goal is to compare the generic
TCN 8x20, k=6 (41K) TCN 8x20, k=6 (41K)
0.2
LSTM (68K)
0.2
LSTM (68K) TCN model to similarly generic recurrent architectures, be-
GRU (68K) GRU (68K)
0.00 1000 2000 3000 4000 5000 6000 7000 8000 0.00 1000 2000 3000 4000 5000 6000 7000 8000
fore domain-specific tuning. The results are summarized in
Iteration Iteration
Table 1.
(a) Sequential MNIST (b) P-MNIST
Polyphonic music. On Nottingham and JSB Chorales, the
Figure 3. Results on Sequential MNIST and P-MNIST. TCNs out-
TCN with virtually no tuning outperforms the recurrent
perform recurrent architectures.
models by a considerable margin, and even outperforms
some enhanced recurrent architectures for this task such as
in Figure 2. All models were chosen to have roughly 70K HF-RNN (Boulanger-Lewandowski et al., 2012) and Diago-
parameters. TCNs quickly converged to a virtually perfect nal RNN (Subakan & Smaragdis, 2017). Note however that
solution (i.e., MSE near 0). GRUs also performed quite other models such as the Deep Belief Net LSTM perform
well, albeit slower to converge than TCNs. LSTMs and better still (Vohra et al., 2015); we believe this is likely due
vanilla RNNs performed significantly worse. to the fact that the datasets are relatively small, and thus the
Sequential MNIST and P-MNIST. Convergence results right regularization method or generative modeling proce-
on sequential and permuted MNIST, run over 10 epochs, dure can improve performance significantly. This is largely
are shown in Figure 3. All models were configured to have orthogonal to the RNN/TCN distinction, as a similar variant
roughly 70K parameters. For both problems, TCNs sub- of TCN may well be possible.
stantially outperform the recurrent architectures, both in Word-level language modeling. Language modeling re-
terms of convergence and in final accuracy on the task. For mains one of the primary applications of recurrent networks
P-MNIST, TCNs outperform state-of-the-art results (95.9%) and many recent works have focused on optimizing LSTMs
based on recurrent networks with Zoneout and Recurrent for this task (Krueger et al., 2017; Merity et al., 2017).
BatchNorm (Cooijmans et al., 2016; Krueger et al., 2017). Our implementation follows standard practice that ties the
Copy memory. Convergence results on the copy mem- weights of encoder and decoder layers for both TCN and
ory task are shown in Figure 4. TCNs quickly converge RNNs (Press & Wolf, 2016), which significantly reduces
to correct answers, while LSTMs and GRUs simply con- the number of parameters in the model. For training, we use
verge to the same loss as predicting all zeros. In this case SGD and anneal the learning rate by a factor of 0.5 for both
we also compare to the recently-proposed EURNN (Jing TCN and RNNs when validation accuracy plateaus.
et al., 2017), which was highlighted to perform well on On the smaller PTB corpus, an optimized LSTM architec-
this task. While both TCN and EURNN perform well for ture (with recurrent and embedding dropout, etc.) outper-
sequence length T = 500, the TCN has a clear advantage forms the TCN, while the TCN outperforms both GRU and
for T = 1000 and longer (in terms of both loss and rate of vanilla RNN. However, on the much larger Wikitext-103
convergence). corpus and the LAMBADA dataset (Paperno et al., 2016),
without any hyperparameter search, the TCN outperforms
5.3. Polyphonic Music and Language Modeling the LSTM results of Grave et al. (2017), achieving much
lower perplexities.
We now discuss the results on polyphonic music modeling,
character-level language modeling, and word-level language Character-level language modeling. On character-level
modeling. These domains are dominated by recurrent archi- language modeling (PTB and text8, accuracy measured in
tectures, with many specialized designs developed for these bits per character), the generic TCN outperforms regular-
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural Gehring, Jonas, Auli, Michael, Grangier, David, Yarats, Denis, and
machine translation by jointly learning to align and translate. In Dauphin, Yann N. Convolutional sequence to sequence learning.
ICLR, 2015. In ICML, 2017b.
Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning Gers, Felix A, Schraudolph, Nicol N, and Schmidhuber, Jürgen.
long-term dependencies with gradient descent is difficult. IEEE Learning precise timing with lstm recurrent networks. JMLR, 3,
Transactions on Neural Networks, 5(2), 1994. 2002.
Bottou, Léon, Soulie, F Fogelman, Blanchet, Pascal, and Liénard, Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep
Jean-Sylvain. Speaker-independent isolated digit recognition: Learning. MIT Press, 2016.
Multilayer perceptrons vs. dynamic time warping. Neural Net- Grave, Edouard, Joulin, Armand, and Usunier, Nicolas. Improving
works, 3(4), 1990. neural language models with a continuous cache. In ICLR,
Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, 2017.
Pascal. Modeling temporal dependencies in high-dimensional Graves, Alex. Supervised Sequence Labelling with Recurrent
sequences: Application to polyphonic music generation and Neural Networks. Springer, 2012.
transcription. arXiv:1206.6392, 2012.
Graves, Alex. Generating sequences with recurrent neural net-
Bradbury, James, Merity, Stephen, Xiong, Caiming, and Socher, works. arXiv:1308.0850, 2013.
Richard. Quasi-recurrent neural networks. In ICLR, 2017.
Greff, Klaus, Srivastava, Rupesh Kumar, Koutnı́k, Jan, Steune-
Campos, Victor, Jou, Brendan, Giró i Nieto, Xavier, Torres, Jordi, brink, Bas R., and Schmidhuber, Jürgen. LSTM: A search space
and Chang, Shih-Fu. Skip RNN: Learning to skip state updates odyssey. IEEE Transactions on Neural Networks and Learning
in recurrent neural networks. In ICLR, 2018. Systems, 28(10), 2017.
Chang, Shiyu, Zhang, Yang, Han, Wei, Yu, Mo, Guo, Xiaoxiao, Ha, David, Dai, Andrew, and Le, Quoc V. HyperNetworks. In
Tan, Wei, Cui, Xiaodong, Witbrock, Michael J., Hasegawa- ICLR, 2017.
Johnson, Mark A., and Huang, Thomas S. Dilated recurrent
neural networks. In NIPS, 2017. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition. In CVPR, 2016.
Cho, Kyunghyun, Van Merriënboer, Bart, Bahdanau, Dzmitry, and
Bengio, Yoshua. On the properties of neural machine translation: Hermans, Michiel and Schrauwen, Benjamin. Training and
Encoder-decoder approaches. arXiv:1409.1259, 2014. analysing deep recurrent neural networks. In NIPS, 2013.
Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Ben- Hinton, Geoffrey E. Connectionist learning procedures. Artificial
gio, Yoshua. Empirical evaluation of gated recurrent neural Intelligence, 40(1-3), 1989.
networks on sequence modeling. arXiv:1412.3555, 2014. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term mem-
Chung, Junyoung, Ahn, Sungjin, and Bengio, Yoshua. Hierarchical ory. Neural Computation, 9(8), 1997.
multiscale recurrent neural networks. arXiv:1609.01704, 2016. Jing, Li, Shen, Yichen, Dubcek, Tena, Peurifoy, John, Skirlo, Scott,
Collobert, Ronan and Weston, Jason. A unified architecture for nat- LeCun, Yann, Tegmark, Max, and Soljačić, Marin. Tunable
ural language processing: Deep neural networks with multitask efficient unitary neural networks (EUNN) and their application
learning. In ICML, 2008. to RNNs. In ICML, 2017.
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Michael, Johnson, Rie and Zhang, Tong. Effective use of word order for
Kavukcuoglu, Koray, and Kuksa, Pavel P. Natural language text categorization with convolutional neural networks. In HLT-
processing (almost) from scratch. JMLR, 12, 2011. NAACL, 2015.
Conneau, Alexis, Schwenk, Holger, LeCun, Yann, and Barrault, Johnson, Rie and Zhang, Tong. Deep pyramid convolutional neural
Loı̈c. Very deep convolutional networks for text classification. networks for text categorization. In ACL, 2017.
In European Chapter of the Association for Computational Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An
Linguistics (EACL), 2017. empirical exploration of recurrent network architectures. In
Cooijmans, Tim, Ballas, Nicolas, Laurent, César, Gülçehre, Çağlar, ICML, 2015.
and Courville, Aaron. Recurrent batch normalization. In ICLR, Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom, Phil. A
2016. convolutional neural network for modelling sentences. In ACL,
Dauphin, Yann N., Fan, Angela, Auli, Michael, and Grangier, 2014.
David. Language modeling with gated convolutional networks. Kalchbrenner, Nal, Espeholt, Lasse, Simonyan, Karen, van den
In ICML, 2017. Oord, Aäron, Graves, Alex, and Kavukcuoglu, Koray. Neural
dos Santos, Cı́cero Nogueira and Zadrozny, Bianca. Learning machine translation in linear time. arXiv:1610.10099, 2016.
character-level representations for part-of-speech tagging. In Kim, Yoon. Convolutional neural networks for sentence classifica-
ICML, 2014. tion. In EMNLP, 2014.
El Hihi, Salah and Bengio, Yoshua. Hierarchical recurrent neural Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic
networks for long-term dependencies. In NIPS, 1995. optimization. In ICLR, 2015.
Elman, Jeffrey L. Finding structure in time. Cognitive Science, 14 Koutnik, Jan, Greff, Klaus, Gomez, Faustino, and Schmidhuber,
(2), 1990. Juergen. A clockwork RNN. In ICML, 2014.
Gehring, Jonas, Auli, Michael, Grangier, David, and Dauphin, Krueger, David and Memisevic, Roland. Regularizing RNNs by
Yann. A convolutional encoder model for neural machine trans- stabilizing activations. arXiv:1511.08400, 2015.
lation. In ACL, 2017a.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Krueger, David, Maharaj, Tegan, Kramár, János, Pezeshki, Mo- (11), 1997.
hammad, Ballas, Nicolas, Ke, Nan Rosemary, Goyal, Anirudh, Sejnowski, Terrence J. and Rosenberg, Charles R. Parallel net-
Bengio, Yoshua, Larochelle, Hugo, Courville, Aaron C., and works that learn to pronounce English text. Complex Systems,
Pal, Chris. Zoneout: Regularizing RNNs by randomly preserv- 1, 1987.
ing hidden activations. In ICLR, 2017.
Shi, Xingjian, Chen, Zhourong, Wang, Hao, Yeung, Dit-Yan,
Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple Wong, Wai-Kin, and Woo, Wang-chun. Convolutional LSTM
way to initialize recurrent networks of rectified linear units. network: A machine learning approach for precipitation now-
arXiv:1504.00941, 2015. casting. In NIPS, 2015.
LeCun, Yann, Boser, Bernhard, Denker, John S., Henderson, Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex,
Donnie, Howard, Richard E., Hubbard, Wayne, and Jackel, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple
Lawrence D. Backpropagation applied to handwritten zip code way to prevent neural networks from overfitting. JMLR, 15(1),
recognition. Neural Computation, 1(4), 1989. 2014.
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Subakan, Y Cem and Smaragdis, Paris. Diagonal RNNs in sym-
Gradient-based learning applied to document recognition. Pro- bolic music modeling. arXiv:1704.05420, 2017.
ceedings of the IEEE, 86(11), 1998.
Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generat-
Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully ing text with recurrent neural networks. In ICML, 2011.
convolutional networks for semantic segmentation. In CVPR,
2015. Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to
sequence learning with neural networks. In NIPS, 2014.
Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini,
Beatrice. Building a large annotated corpus of English: The van den Oord, Aäron, Dieleman, Sander, Zen, Heiga, Simonyan,
Penn treebank. Computational Linguistics, 19(2), 1993. Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior,
Andrew W., and Kavukcuoglu, Koray. WaveNet: A generative
Martens, James and Sutskever, Ilya. Learning recurrent neural model for raw audio. arXiv:1609.03499, 2016.
networks with Hessian-free optimization. In ICML, 2011.
Vohra, Raunaq, Goel, Kratarth, and Sahoo, JK. Modeling temporal
Melis, Gábor, Dyer, Chris, and Blunsom, Phil. On the state of the dependencies in data using a DBN-LSTM. In Data Science and
art of evaluation in neural language models. In ICLR, 2018. Advanced Analytics (DSAA), 2015.
Merity, Stephen, Xiong, Caiming, Bradbury, James, and Socher, Waibel, Alex, Hanazawa, Toshiyuki, Hinton, Geoffrey, Shikano,
Richard. Pointer sentinel mixture models. arXiv:1609.07843, Kiyohiro, and Lang, Kevin J. Phoneme recognition using time-
2016. delay neural networks. IEEE Transactions on Acoustics, Speech,
Merity, Stephen, Keskar, Nitish Shirish, and Socher, Richard. and Signal Processing, 37(3), 1989.
Regularizing and optimizing LSTM language models. Werbos, Paul J. Backpropagation through time: What it does and
arXiv:1708.02182, 2017. how to do it. Proceedings of the IEEE, 78(10), 1990.
Mikolov, Tomáš, Sutskever, Ilya, Deoras, Anoop, Le, Hai-Son, Wisdom, Scott, Powers, Thomas, Hershey, John, Le Roux,
Kombrink, Stefan, and Cernocky, Jan. Subword language mod- Jonathan, and Atlas, Les. Full-capacity unitary recurrent neural
eling with neural networks. Preprint, 2012. networks. In NIPS, 2016.
Miyamoto, Yasumasa and Cho, Kyunghyun. Gated word-character Wu, Yuhuai, Zhang, Saizheng, Zhang, Ying, Bengio, Yoshua, and
recurrent language model. arXiv:1606.01700, 2016. Salakhutdinov, Ruslan R. On multiplicative integration with
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve recurrent neural networks. In NIPS, 2016.
restricted Boltzmann machines. In ICML, 2010. Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen,
Ng, Andrew. Sequence Models (Course 5 of Deep Learning Spe- William W. Breaking the softmax bottleneck: A high-rank
cialization). Coursera, 2018. RNN language model. ICLR, 2018.
Paperno, Denis, Kruszewski, Germán, Lazaridou, Angeliki, Pham, Yin, Wenpeng, Kann, Katharina, Yu, Mo, and Schütze, Hinrich.
Quan Ngoc, Bernardi, Raffaella, Pezzelle, Sandro, Baroni, Comparative study of CNN and RNN for natural language pro-
Marco, Boleda, Gemma, and Fernández, Raquel. The LAM- cessing. arXiv:1702.01923, 2017.
BADA dataset: Word prediction requiring a broad discourse Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation
context. arXiv:1606.06031, 2016. by dilated convolutions. In ICLR, 2016.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan, Memi-
difficulty of training recurrent neural networks. In ICML, 2013. sevic, Roland, Salakhutdinov, Ruslan R, and Bengio, Yoshua.
Pascanu, Razvan, Gülçehre, Çaglar, Cho, Kyunghyun, and Bengio, Architectural complexity measures of recurrent neural networks.
Yoshua. How to construct deep recurrent neural networks. In In NIPS, 2016.
ICLR, 2014. Zhang, Xiang, Zhao, Junbo Jake, and LeCun, Yann. Character-
Press, Ofir and Wolf, Lior. Using the output embedding to improve level convolutional networks for text classification. In NIPS,
language models. arXiv:1608.05859, 2016. 2015.
Salimans, Tim and Kingma, Diederik P. Weight normalization: A
simple reparameterization to accelerate training of deep neural
networks. In NIPS, 2016.
Schuster, Mike and Paliwal, Kuldip K. Bidirectional recurrent
neural networks. IEEE Transactions on Signal Processing, 45
An Empirical Evaluation of Generic Convolutional and Recurrent Networks
for Sequence Modeling
Supplementary Material
Testing perplexity
Testing accuracy
0.04 0.90 120
TCN with k=8
Testing Loss
Testing perplexity
Testing accuracy
0.020 0.90
120
Testing Loss
0.015 0.85
110
0.010 0.80
TCN w/ residual (67K) 100
TCN w/o residual (67K)
0.005 0.75 TCN w/ residual (10K)
TCN w/o residual (10K) 90
0.000 0.70
0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000 5000 6000 20 30 40 50 60 70 80 90
Iteration Iteration Iteration
(d) Residual on Copy Memory Task (e) Residual on P-MNIST (f) Residual on PTB (word)
Figure 6. Controlled experiments that study the effect of different components of the TCN model.
Table 5. An evaluation of gating in TCN. A plain TCN is compared to a TCN that uses gated activations.