DeepAR- Probabilistic forecasting with autoregressive recurrent networks
DeepAR- Probabilistic forecasting with autoregressive recurrent networks
article info a b s t r a c t
Keywords: Probabilistic forecasting, i.e., estimating a time series’ future probability distribution
Probabilistic forecasting given its past, is a key enabler for optimizing business processes. In retail businesses,
Neural networks for example, probabilistic demand forecasts are crucial for having the right inventory
Deep learning
available at the right time and in the right place. This paper proposes DeepAR, a
Big data
methodology for producing accurate probabilistic forecasts, based on training an au-
Demand forecasting
toregressive recurrent neural network model on a large number of related time series.
We demonstrate how the application of deep learning techniques to forecasting can
overcome many of the challenges that are faced by widely-used classical approaches
to the problem. By means of extensive empirical evaluations on several real-world
forecasting datasets, we show that our methodology produces more accurate forecasts
than other state-of-the-art methods, while requiring minimal manual work.
© 2020 The Authors. Published by Elsevier B.V. on behalf of International Institute of
Forecasters. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.1016/j.ijforecast.2019.07.001
0169-2070/© 2020 The Authors. Published by Elsevier B.V. on behalf of International Institute of Forecasters. This is an open access article under
the CC BY license (http://creativecommons.org/licenses/by/4.0/).
1182 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191
Intermittent demand can be treated using techniques that consolidation, blending, or ensembling (e.g. Oliveira &
range from classical methods (Croston, 1972) to neu- Torgo, 2014; Timmermann, 2006). These techniques in-
ral networks (Gutierrez, Solis, & Mukhopadhyay, 2008) crease the complexity still further. Deep neural networks
directly. Data-preprocessing methods such as Box–Cox offer an alternative to such pipelines. Such models require
transformations (Box & Cox, 1964) or differencing (Hynd- a limited amount of standardized data pre-processing,
man & Athanasopoulos, 2012, provides an overview) have after which the forecasting problem is solved by learn-
been proposed in order to alleviate unfavorable char- ing an end-to-end model. In particular, data-processing
acteristics of time series, but with mixed results. Other is included in the model and optimized jointly towards
approaches incorporate more suitable likelihood func- the goal of producing the best possible forecast. In prac-
tions, such as the zero-inflated Poisson distribution, the tice, deep learning forecasting pipelines rely almost exclu-
negative binomial distribution (Snyder, Ord, & Beaumont, sively on what the model can learn from the data, unlike
2012), a combination of both (Chapados, 2014), or a tai- traditional pipelines, which rely heavily on heuristics such
lored multi-stage likelihood (Seeger, Salinas, & Flunkert, as expert-designed components and manual covariate de-
2016). We approach the (demand) forecasting problem sign.
by incorporating appropriate likelihoods and combining Other approaches to the sharing of information across
them with non-linear data transformation techniques, time series are via matrix factorization methods (e.g. the
as learned by a (deep) neural network. In particular, recent work of Yu, Rao, and Dhillon (2016)). We compare
we use a negative binomial likelihood in the case of our approach to this method directly in Section 5, and
demand forecasting, which improves the accuracy but show how we empirically outperform it. Further methods
precludes us from applying standard data normalization that share information include Bayesian methods that
techniques directly. By using deeper networks than have share information via hierarchical priors (Chapados, 2014)
been proposed previously in the forecasting literature, and by making use of any hierarchical structure that may
we allow the neural network to represent more complex be present in the data (Ben Taieb, Taylor, & Hyndman,
data transformations. Goodfellow, Bengio, and Courville 2017; Hyndman, Ahmed, Athanasopoulos, & Shang, 2011).
(2016) provides a comprehensive overview of modern Finally, we note that neural networks have been be-
deep neural networks, including justifications of why ing investigated in the context of forecasting for a long
deep neural networks are preferable to shallow and wide time by both the machine learning and forecasting com-
neural networks. munities (for more recent work considering LSTM cells,
When faced with time series as they occur in indus- see for example the numerous references in the surveys
trial applications, sharing information across time series by Zhang, Eddy Patuwo, & Hu, 1998, Fildes, Nikolopoulos,
is key to improving the forecast accuracy. However, this Crone, & Syntetos, 2008, and Gers, Eck, & Schmidhu-
can be difficult to accomplish in practice, due to the ber, 2001). Outside of the forecasting community, time
often heterogeneous nature of the data. A prominent ap- series models based on RNNs have been applied very
proach to sharing information across time series is to successfully to various other applications, such as natural
use clustering techniques such as k-means clustering to language processing (NLP) (Graves, 2013; Sutskever et al.,
compute seasonality indices, which are then combined 2014), audio modeling (van den Oord et al., 2016) or
with classic forecasting models (see for example the Fore- image generation (Gregor, Danihelka, Graves, Rezende,
sight 2007 Spring issue on seasonality for a number of & Wierstra, 2015). Direct applications of RNNs to fore-
examples, as well as the papers by ,Chen & Boylan, 2008 casting include the recent papers by Wen et al. (2017)
and Mohammadipour, Boylan, & Syntetos, 2012). Other and Laptev et al. (2017). Our work differs from these
examples include the explicit handling of promotional in that it provides a comprehensive benchmark includ-
effects (see e.g. Trapero, Kourentzes, & Fildes, 2015, and ing publicly available datasets and a fully probabilistic
references therein) via pooled principal component anal- forecast.1
ysis regression. The latter is another instance of using Within the forecasting community, neural networks in
an unsupervised learning technique as a pre-processing forecasting have been applied typically to individual time
step. These effects need to be handled in practical appli- series, i.e., a different model is fitted to each time series
cations, which leads to complex pipelines that are difficult independently (Díaz-Robles et al., 2008; Ghiassi, Saidane,
both to tune and to maintain. The complexity of such & Zimbra, 2005; Hyndman & Athanasopoulos, 2018; Kaas-
pipelines is likely to increase when one needs to ad- tra & Boyd, 1996). Kourentzes (2013) applies neural net-
dress specific sub-problems such as forecasts for new works specifically to intermittent data. The author uses a
products. Effectively, one decomposes the overall fore- feed-forward neural network (which, by design, ignores
casting problem into a number of distinct forecasting
sub-problems and applies a dedicated model or even a 1 Since the initial pre-print of the present work became available,
chain of models to each one. These models can range from
neural networks have received an increasing amount of attention,
classical statistical models (e.g. Hyndman et al., 2008) to see for example (Bandara, Bergmeir, & Smyl, 2017; Gasthaus, Benidis,
machine learning models (e.g. Laptev, Yosinsk, Li Erran, & Wang, Rangapuram, Salinas, Flunkert, et al., 2019; Oreshkin, Carpov,
Smyl, 2017; Wen, Torkkola, & Narayanaswamy, 2017) and Chapados, & Bengio, 2019; Rangapuram, Seeger, Gasthaus, Stella, Wang,
judgmental approaches (e.g. Davydenko & Fildes, 2013). & Januschowski, 2018; Smyl, Ranganathan, & Pasqua, 2018; Toubeau,
Bottieau, Vallée, & De Grève, 2018). The winning solution to the M4
However, the forecasting problem might not lend itself to competition was based on a neural network (Makridakis, Spiliotis, &
a decomposition into a sequence of distinct procedures, Assimakopoulos, 2018; Smyl et al., 2018). Future work will address a
in which case one would have to think about model systematic review of these methods.
1184 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191
3. Background: RNNs
Fig. 2. Left. A RNN without an output layer. Right. A partially unrolled
RNN with an output layer and multiple hidden units.
We assume that the reader is familiar with basic neu-
ral network nomenclature around multi-layer percep-
trons (MLPs) or feed-forward neural networks, which
have been applied successfully to forecasting problems gradient-based optimization procedure. The recursive na-
by the forecasting community (e.g., Gutierrez et al., 2008; ture of RNNs often results in ill-conditioned optimization
Kourentzes, 2013); in particular, modern forecasting text- problems which are referred to commonly in the machine
books include MLPs, e.g., (Hyndman & Athanasopoulos, learning community as vanishing or exploding gradients.
2018). We refer the interested reader to Goodfellow et al. The long short-term memory (LSTM) model (Hochreiter &
(2016) for a comprehensive introduction to modern deep Schmidhuber, 1997) alleviates this problem (among other
learning approaches and Faloutsos, Flunkert, Gasthaus, favorable properties), and it is the approach that we adopt
Januschowski, and Wang (2019) for a tutorial that focuses in this paper. We do not present the full functional form
on forecasting. In what follows, we provide a brief in- of LSTMs, as this is unnecessary for our arguments, but
troduction to recurrent neural networks (RNNs) and key again refer to the paper by Goodfellow et al. (2016) for
techniques for handling them, as they have not been dealt an overview and a comprehensive exposition. All mod-
with extensively in the forecasting literature (though the ern neural learning packages, such as that of Chen et al.
machine learning community have applied them to fore- (2015), include an implementation of LSTM-based RNNs.
casting problems with some success; e.g. Laptev et al., In addition to LSTMs, another concept from RNNs
2017; Wen et al., 2017). We follow (Goodfellow et al., will be useful: the encoder–decoder framework, which
2016) in our exposition. allows RNNs to be used to map an input sequence x =
A classic dynamic system driven by an external signal (x1 , . . . , xnx ) to an output sequence y = (y1 , . . . , yny ) of
x(t) is given by differing lengths. This idea is used frequently in NLP and
machine translation, and works as follows. Given an input
h(t) = f (h(t −1) , x(t) ; θ ), (1)
sequence, a first RNN processes this sequence and emits
where h is the state of the system at step t and θ
(t) a so-called context, a vector or a sequence of vectors. In
is a parameter of a transit function f . RNNs use Eq. (1) practice, this is often the last state hnx of the encoder
to model the values of their hidden units (recall that a RNN. A second RNN, the decoder RNN, is conditioned on
hidden unit of a neural network is one that is neither the the context in order to generate the output sequence. The
input layer nor the output layer). two RNNs are trained jointly to maximize the average of
This means that RNNs are deterministic, non-linear dy- log P(y|x) over all pairs x, y in the training set. Section 4
namic systems, in contrast to additive exponential discusses the application of this concept to forecasting.
smoothing in state space form, which can be represented
as linear non-deterministic dynamic systems with a single 4. Model
source of error/innovation (Hyndman et al., 2008).
Fig. 2 contains a depiction of a simple RNN. The recur- Denoting the value of time series (for an item) i at time
sive structure of the RNN means that fewer parameters t by zi,t , our goal is to model the conditional distribution
need to be learned than in the case of MLPs. However,
a technical difficulty arises in the training of RNNs via a P(zi,t0 :T |zi,1:t0 −1 , xi,1:T )
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1185
Fig. 3. Summary of the model. Training (left): At each time step t, the inputs to the network are the covariates xi,t , the target value at the previous
time step zi,t −1 , and the previous network output hi,t −1 . The network output hi,t = h(hi,t −1 , zi,t −1 , xi,t , Θ ) is then used to compute the parameters
θi,t = θ (hi,t , Θ ) of the likelihood p(z |θ ), which is used for training the model parameters. For prediction, the history of the time series zi,t is fed
in for t < t0 , then in the prediction range (right) for t ≥ t0 a sample ẑi,t ∼ p(·|θi,t ) is drawn and fed back for the next point until the end of the
prediction range t = t0 + T , generating one sample trace. Repeating this prediction process yields many traces that represent the joint predicted
distribution.
of the future of each time series [zi,t0 , zi,t0 +1 , . . . , zi,T ] := are given by a function θ (hi,t , Θ ) of the network output
zi,t0 :T given its past [zi,1 , . . . , zi,t0 −2 , zi,t0 −1 ] := zi,1:t0 −1 , hi,t (see below).
where t0 denotes the time point from which we assume Information about the observations in the condition-
zi,t to be unknown at prediction time, and xi,1:T are covari- ing range zi,1:t0 −1 is transferred to the prediction range
ates that are assumed to be known for all time points. To through the initial state hi,t0 −1 . In the sequence-to-
limit ambiguity, we avoid the terms ‘‘past’’ and ‘‘future’’ sequence setup, this initial state is the output of an en-
and will refer to the time ranges [1, t0 − 1] and [t0 , T ] as coder network. While in general this encoder network
the conditioning range and prediction range, respectively. can have a different architecture, in our experiments
The conditioning range corresponds to the encoder range we opt to use the same architecture for the model in
introduced in Section 3 and the prediction range to the both the conditioning range and the prediction range
decoder range. During training, both ranges have to lie in (corresponding to the encoder and decoder in a sequence-
the past so that the zi,t are observed, but during predic- to-sequence model). Further, we share weights between
tion, zi,t is only available in the conditioning range. Note them, so that the initial state for the decoder hi,t0 −1 is
that the time index t is relative, i.e. t = 1 can correspond obtained by computing Eq. (2) for t = 1, . . . , t0 − 1, where
to a different actual/absolute time period for each i. all required quantities are observed. The initial states of
Our model, summarized in Fig. 3, is based on an au- both the encoder hi,0 and zi,0 are initialized to zero.
toregressive recurrent network architecture (Graves, 2013; Given the model parameters Θ , we can obtain joint
Sutskever et al., 2014). We assume that our model dis- samples z̃i,t0 :T ∼ QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) directly through
tribution QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) consists of a product of ancestral sampling. First, we obtain hi,t0 −1 by computing
likelihood factors Eq. (2) for t = 1, . . . , t0 . For t = t0 , t0 + 1(, . . . , T , we sam-
ple z̃i,t ∼ p(·|θ (h̃i,t , Θ )), where h̃i,t = h hi,t −1 , z̃i,t −1 , xi,t ,
T
Θ initialized with h̃i,t0 −1 = hi,t0 −1 and z̃i,t0 −1 = zi,t0 −1 .
∏ )
QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) = QΘ (zi,t |zi,1:t −1 , xi,1:T )
Samples from the model obtained in this way can then
t =t0
be used to compute quantities of interest, e.g. quantiles
T
∏ of the distribution of the sum of values for some future
= p(zi,t |θ (hi,t , Θ )), time period.
t =t0
its mean and standard deviation, θ = (µ, σ ), where the t = 1 corresponds to 2013-01-01, 2013-01-02, 2013-
mean is given by an affine function of the network output, 01-03, and so on. When choosing these windows, we
and the standard deviation is obtained by applying an ensure that the entire prediction range is always cov-
affine transformation followed by a softplus activation in ered by the available ground truth data, but we may
order to ensure σ > 0: chose to have t = 1 lie before the start of the time
1 series, e.g. 2012-12-01 in the example above, padding
pG (z |µ, σ ) = (2π σ 2 )− 2 exp(−(z − µ)2 /(2σ 2 )) , the unobserved target with zeros. This allows the model
µ(hi,t ) = wTµ hi,t + bµ , to learn the behavior of ‘‘new’’ time series by taking
into account all other available covariates. Augmenting
σ (hi,t ) = log(1 + exp(wTσ hi,t + bσ )) .
the data using this windowing procedure ensures that
The negative binomial distribution is a common choice information about absolute time is available to the model
for modeling time series of positive count data (Chapados, only through covariates, not through the relative position
2014; Snyder et al., 2012). We parameterize the negative of zi,t in the time series. Fig. 4 contains a depiction of this
binomial distribution by its mean µ ∈ R+ and a shape data augmentation technique.
parameter α ∈ R+ , Bengio, Vinyals, Jaitly, and Shazeer (2015) noted that
) α1 ( the autoregressive nature of such models means that op-
Γ (z + α1 )
)z
αµ
(
1 timizing Eq. (3) directly causes a discrepancy between
pNB (z |µ, α ) = ,
Γ (z + 1)Γ ( α1 ) 1 + αµ 1 + αµ the ways in which the model is used during training
and when obtaining predictions from the model: during
µ(hi,t ) = log(1 + exp(wTµ hi,t + bµ )) , training, the values of zi,t are known in the prediction
α (hi,t ) = log(1 + exp(wTα hi,t + bα )) , range and can be used to compute hi,t ; however, during
prediction, zi,t is unknown for t ≥ t0 , and a single sample
where both parameters are obtained from the network z̃i,t ∼ p(·|θ (hi,t )) from the model distribution is used
output by a fully-connected layer with softplus activation in the computation of hi,t according to Eq. (2) instead.
so as to ensure positivity. In this parameterization of the While this disconnect has been shown to pose a severe
negative binomial distribution, the shape parameter α problem for NLP tasks, for example, we have not observed
scales the variance relative to the mean, i.e. Var[z ] = adverse effects from it in a forecasting setting. Preliminary
µ + µ2 α . While other parameterizations are possible, experiments with variants of scheduled sampling (Bengio
preliminary experiments showed this particular one to be et al., 2015) did not show any noteworthy improvements
especially conducive to fast convergence. in accuracy (but did slow convergence).
Given a dataset of time series {zi,1:T }i=1,...,N and asso- Applying the model to data that exhibit a power-law
ciated covariates xi,1:T , obtained by choosing a time range of scales, as depicted in Fig. 1, presents two challenges.
such that zi,t in the prediction range is known, the param- Firstly, the autoregressive nature of the model means
eters Θ of the model, consisting of the parameters of both that both the autoregressive input zi,t −1 and the output
the RNN h(·) and θ (·), can be learned by maximizing the of the network (e.g. µ) scale with the observations zi,t
log-likelihood directly, but the non-linearities of the network in between
N T have a limited operating range. Thus, without further
modifications, the network has to learn, first, to scale the
∑ ∑
L= log p(zi,t |θ (hi,t )) . (3)
input to an appropriate range in the input layer, then to
i=1 t =t0
invert this scaling at the output. We address this issue by
As hi,t is a deterministic function of the input, all quanti- dividing the autoregressive inputs zi,t (or z̃i,t ) by an item-
ties required for computing Eq. (3) are observed, so that, dependent scale factor νi , and conversely multiplying the
in contrast to state space models with latent variables, no scale-dependent likelihood parameters by the same fac-
inference is required, and Eq. (3) can be optimized directly tor. For instance, for the negative binomial likelihood, we
√
via stochastic gradient descent by computing gradients use µ = νi log(1 + exp(oµ )) and α = log(1 + exp(oα ))/ νi ,
with respect to Θ . In our experiments, where the encoder where oµ and oα are the outputs of the network for these
model is the same as the decoder, the distinction between parameters. Note that while one could alternatively scale
the encoder and the decoder is somewhat artificial during the input in a preprocessing step for real-valued data, this
training, so that we also include the likelihood terms for is not possible for count distributions. The selection of
t = 0, . . . , t0 − 1 in Eq. (3) (or, equivalently, set t0 = 0). an appropriate scale factor might be challenging in itself
For each time series in the dataset, we generate multi- (especially in the presence of missing data or large within-
ple training instances by selecting from the original time item variances).
∑t However, scaling by the average value
series windows with different starting points. In prac- νi = 1 + t1 t0=1 zi,t , as we do in our experiments, is a
0
tice, we keep both the total length T and the relative heuristic that works well in practice.
lengths of the conditioning and prediction ranges fixed for Secondly, the imbalance in the data means that a
all training examples. For example, if the total available stochastic optimization procedure that picks training in-
period for a given time series ranges from 2013-01-01 stances uniformly at random will visit the small number
to 2017-01-01, we can create training examples where time series with large scales very infrequently, resulting
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1187
Fig. 4. Depiction of the data setup for training and forecasting for a single input time series z with covariates x. The green vertical line separates
the training data from the testing data, so we compute the out-of-sample accuracy for forecasts to the right of the green line; in particular, no data
to the right of the green line is used in training. Left. The data setup during the training phase. The red lines marks the slices of x that are presented
to the model during training, where the left part marks the conditioning range and the right part the prediction range. Note that all windows are
to the left of the green line. Right. During forecasting, when the model is fully trained, only the conditioning range is to the left of the green line.
(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
in those time series being underfitted. This could be multi-dimensional binary variables to encode them. Fur-
especially problematic in the demand forecasting set- thermore, we include a single categorical item covariate,
ting, where high-velocity items can exhibit qualitatively for which an embedding is learned by the model. In the
different behaviors from low-velocity items, and having retail demand forecasting datasets, the item covariate
accurate forecasts for high-velocity items might be more corresponds to a (coarse) product category (e.g. ‘‘cloth-
important for meeting certain business objectives. We ing’’), while in the smaller datasets it corresponds to the
counteract this effect by sampling the examples non- item’s identity, allowing the model to learn item-specific
uniformly during training. In particular, our weighted behaviors. By appropriate normalization, we standardize
sampling scheme sets the probability of selecting a win- all covariates to have a zero mean and unit variance.
dow from an example with scale νi proportional to νi . This
sampling scheme is simple yet effective in compensating 5. Applications and experiments
for the skew in Fig. 1.
We implement our model using MXNet (Chen et al.,
2015), and use a single p2.xlarge AWS EC2 compute in-
4.4. Covariates
stance containing 4 CPUs and 1 GPU to run all experi-
ments.3 With this set-up, training and prediction on the
The covariates xi,t can be item-dependent, time-
large ec dataset containing 500K time series can be com-
dependent, or both. This distinction is mainly of practical
pleted in less than 10 h. Note that prediction with a
importance. In theory, any covariate xi,t that does not
trained model is fast (in the order of tens of minutes for a
vary with time can be generalized trivially to be time- single compute instance), and can be sped-up if necessary
dependent by repeating it along the time dimension. by executing prediction in parallel.
Examples of such time-independent covariates include We use the ADAM optimizer (Kingma & Ba, 2014) with
the categorization of item i, for example to denote mem- early stopping and standard LSTM cells with a forget bias
bership to a certain group of products (e.g., product i set to 1.0 in all experiments, and 200 samples are drawn
is a shoe, so xi,t = s, where s is an identifier for the from our decoder for generating predictions.
shoe category). Such covariates allow for the fact that
time series with the same time-independent covariate 5.1. Datasets
may be similar. Examples of time-dependent covariates
include information about the time point (e.g. week of We use five datasets for our evaluations. The first
year) of the model. They can also be used to include three, namely parts, electricity, and traffic, are
covariates that one expects to influence the outcome public datasets; parts consists of 1046 aligned time se-
(e.g. price or promotion status in the demand forecasting ries of 50 time steps each, representing the monthly sales
setting), as long as the covariates’ values are available of different items by a US automobile company (Seeger
in the prediction range as well. If these values are not et al., 2016); electricity contains hourly time series of
available (e.g., future price changes), one option is to the electricity consumptions of 370 customers (Yu et al.,
set them manually (e.g., assume that there are no price 2016); and traffic, also used by Yu et al. (2016), con-
changes), which allows for what-if analysis. The solution tains the hourly occupancy rates, between zero and one,
based on principles is to predict these time series jointly, of 963 car lanes of San Francisco bay area freeways. For
e.g., forecast demand and price jointly in a multivariate the parts dataset, we use the 42 first months as training
forecast. What-if analyses are possible when future prices data and report the error on the remaining 8 months.
become known via conditioning. We leave multivariate For the other datasets, electricity, traffic, ec-sub
forecasting as a valuable direction for future work. and ec, the set of possible training instances is sub-
All of our experiments use an ‘‘age’’ covariate, sampled to the number indicated in Table 1. The results
i.e., the distance to the first observation in that time for electricity and traffic are computed using a
series. We also add day-of-the-week and hour-of-the-day rolling window of predictions, as described by Yu et al.
for hourly data, week-of-the-year for weekly data and
month-of-the-year for monthly data. We encode these 3 Implementations of DeepAR are available on Amazon SageMaker
simply as increasing numeric values, instead of using (closed-source) and as part of GluonTS (Alexandrov et al., 2019).
1188 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191
Fig. 5. Example time series of ec. The vertical line separates the conditioning period from the prediction period. The black line shows the true target.
In the prediction range, we plot the p50 as a blue line (mostly zero for the three slow items), along with 80% confidence intervals (shaded). The
model learns accurate seasonality patterns and uncertainty estimates for items of different velocities and ages. (For interpretation of the references
to colour in this figure legend, the reader is referred to the web version of this article.)
Table 1
Dataset statistics and RNN parameters.
parts electricity traffic ec-sub ec
# of time series 1046 370 963 39700 534884
time granularity month hourly hourly week week
domain N R+ [0, 1] N N
encoder length 8 168 168 52 52
decoder length 8 24 24 52 52
# of training examples 35K 500K 500K 2M 2M
item input embedding dimension 1046 370 963 5 5
item output embedding dimension 1 20 20 20 20
batch size 64 64 64 512 512
learning rate 1e−3 1e−3 1e−3 5e−3 5e−3
# of LSTM layers 3 3 3 3 3
# of LSTM nodes 40 40 40 120 120
running time 5 min 7h 3h 3h 10h
(2016). We do not retrain our model for each window, on different windows, but also on non-overlapping time
but use a single model trained on the data before the intervals. We tune the learning rate manually for ev-
first prediction window. The remaining two datasets, ec ery dataset and keep it fixed in hyper-parameter tuning.
and ec-sub, are the weekly item sales from Amazon that Other parameters such as the encoder length, decoder
were used by Seeger et al. (2016), and we predict the 52 length and item input embedding are considered to be
weeks following 2014-09-07. The time series in these domain-dependent, and are not fitted. The batch size is
two datasets are very diverse and lumpy, ranging from increased on larger datasets in order to benefit more
very fast-moving to very slow-moving items, and include from GPU’s parallelization. Finally, the running time mea-
‘‘new’’ products that were introduced in the weeks be- sures an end-to-end evaluation, e.g. processing covariates,
fore the forecast time 2014-09-07; see Fig. 5. Further, training the neural network, drawing samples for the
the item velocities in these datasets have a power-law production of probabilistic forecasts, and evaluating the
distribution, as shown in Fig. 1. forecasts.
Table 1 also lists running times as measured by an
end-to-end evaluation, e.g. processing covariates, training 5.2. Accuracy comparison
the neural network, drawing samples and evaluating the
distributions produced. For the parts and ec/ec-sub datasets, we provide
For each dataset, a grid-search is used to find the best comparisons with the following baselines, which repre-
value for the hyper-parameters item output embedding sent the state-of-the-art on these datasets to the best of
dimension and # of LSTM nodes (e.g. hidden number of our knowledge:
units). To do so, the data before the forecast start time
• Croston: the Croston method developed for in-
are used as the training set and split into two parti-
termittent demand forecasting, from the R package
tions. For each hyper-parameter candidate, we fit our of Hyndman and Khandakar (2008).
model on the first partition of the training set, containing
• ETS: the ETS model (Hyndman et al., 2008) from
90% of the data, and pick the one that has the mini-
the R package with automatic model selection. Only
mal negative log-likelihood on the remaining 10%. Once
additive models are used, as multiplicative models
the best set of hyper-parameters has been found, the
shows numerical issues on some time series.
evaluation metrics (0.5-risk, 0.9-risk, ND and RMSE) are
• Snyder: the negative-binomial autoregressive
evaluated on the test set, that is, the data coming after method of Snyder et al. (2012).
the forecast start time. Note that this procedure could
• ISSM: the method of Seeger et al. (2016) using an in-
lead to the hyper-parameters being over-fitted to the
novative state space model with covariate features.
training set, but this would also degrade the metric that
we report. A better procedure would be to fit the param- In addition, we compare our results to two baseline
eters and evaluate the negative log-likelihood not only RNN models:
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1189
Table 2
Accuracy metrics relative to the strongest previously published method (baseline).
0.5-risk 0.9-risk Average
parts
(L, S) (0, 1) (2, 1) (0, 8) all(8) (0, 1) (2, 1) (0, 8) all(8) average
Snyder (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Croston 1.47 1.70 2.86 1.83 – – – – 1.97
ISSM 1.04 1.07 1.24 1.06 1.01 1.03 1.14 1.06 1.08
ETS 1.28 1.33 1.42 1.38 1.01 1.03 1.35 1.04 1.23
rnn-gaussian 1.17 1.49 1.15 1.56 1.02 0.98 1.12 1.04 1.19
rnn-negbin 0.95 0.91 0.95 1.00 1.10 0.95 1.06 0.99 0.99
DeepAR 0.98 0.91 0.91 1.01 0.90 0.95 0.96 0.94 0.94
ec-sub
(L, S) (0, 2) (0, 8) (3, 12) all(33) (0, 2) (0, 8) (3, 12) all(33) average
Snyder 1.04 1.18 1.18 1.07 1.0 1.25 1.37 1.17 1.16
Croston 1.29 1.36 1.26 0.88 – – – – 1.20
ISSM (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ETS 0.83 1.06 1.15 0.84 1.09 1.38 1.45 0.74 1.07
rnn-gaussian 1.03 1.19 1.24 0.85 0.91 1.74 2.09 0.67 1.21
rnn-negbin 0.90 0.98 1.11 0.85 1.23 1.67 1.83 0.78 1.17
DeepAR 0.64 0.74 0.93 0.73 0.71 0.81 1.03 0.57 0.77
ec
(L, S) (0, 2) (0, 8) (3, 12) all(33) (0, 2) (0, 8) (3, 12) all(33) average
Snyder 0.87 1.06 1.16 1.12 0.94 1.09 1.13 1.01 1.05
Croston 1.30 1.38 1.28 1.39 – – – – 1.34
ISSM (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ETS 0.77 0.97 1.07 1.23 1.05 1.33 1.37 1.11 1.11
rnn-gaussian 0.89 0.91 0.94 1.14 0.90 1.15 1.23 0.90 1.01
rnn-negbin 0.66 0.71 0.86 0.92 0.85 1.12 1.33 0.98 0.93
DeepAR 0.59 0.68 0.99 0.98 0.76 0.88 1.00 0.91 0.85
6. Conclusion
Chapados, N. (2014). Effective Bayesian modeling of groups of related Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep
count time series. In Proceedings of the 31st international conference network training by reducing internal covariate shift. In Pro-
on machine learning (pp. 1395–1403). ceedings of the 32nd international conference on machine learning
Chen, H., & Boylan, J. E. (2008). Empirical evidence on individual, group (pp. 448–456).
and shrinkage seasonal indices. International Journal of Forecasting, Kaastra, I., & Boyd, M. (1996). Designing a neural network for fore-
24(3), 525–534. casting financial and economic time series. Neurocomputing, 10(3),
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., et al. (2015). MXNet: 215–236.
A Flexible and efficient machine learning library for heterogeneous Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic
distributed systems. arXiv preprint arXiv:1512.01274. optimization. arXiv preprint arXiv:1412.6980.
Croston, J. (1972). Forecasting and stock control for intermittent Kourentzes, N. (2013). Intermittent demand forecasts with neural
demands. Operational Research Quarterly, 23, 289–304. networks. International Journal of Production Economics, 143(1),
Davydenko, A., & Fildes, R. (2013). Measuring forecasting accuracy: 198–206.
the case of judgmental adjustments to SKU-level demand forecasts. Laptev, N., Yosinsk, J., Li Erran, L., & Smyl, S. (2017). Time-series extreme
International Journal of Forecasting, 29(3), 510–522. event forecasting with neural networks at Uber. ICML time series
Díaz-Robles, L. A., Ortega, J. C., Fu, J. S., Reed, G. D., Chow, J. workshop.
C., Watson, J. G., et al. (2008). A hybrid ARIMA and artificial Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Com-
neural networks model to forecast particulate matter in urban petition: results, findings, conclusion and way forward. International
areas: the case of Temuco, Chile. Atmospheric Enviroment, 42(35), Journal of Forecasting, 34(4), 802–808.
8331–8340. Mohammadipour, M., Boylan, J., & Syntetos, A. (2012). The application
Durbin, J., & Koopman, S. J. (2012). Time series analysis by state space of product-group seasonal indexes to individual products. Foresight:
methods: Vol. 38. OUP Oxford. The International Journal of Applied Forecasting, 26, 20–26.
Faloutsos, C., Flunkert, V., Gasthaus, J., Januschowski, T., & Wang, Y. Oliveira, M. R., & Torgo, L. (2014). Ensembles for time series forecasting.
(2019). Forecasting big time series: theory and practice. In KDD Journal of Machine Learning Research (JMLR).
’19, Proceedings of the 25th ACM SIGKDD international conference on van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O.,
knowledge discovery and data mining (pp. 3209–3210). New York, Graves, A., et al. (2016). Wavenet: A generative model for raw
NY, USA: ACM. audio. arXiv preprint arXiv:1609.03499.
Fildes, R., Nikolopoulos, K., Crone, S., & Syntetos, A. (2008). Forecasting Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019),
and operational research: a review. The Journal of the Operational N-BEATS: neural basis expansion analysis for interpretable time
Research Society, 59(9), 1150–1172. series forecasting. arXiv preprint arXiv:1905.10437.
Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Salinas, D., Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., &
Flunkert, V., et al. (2019). Probabilistic forecasting with spline Januschowski, T. (2018). Deep state space models for time series
quantile function RNNs. In The 22nd international conference on forecasting. In Advances in neural information processing systems (pp.
artificial intelligence and statistics (pp. 1901–1910). 7785–7794).
Gers, F. A., Eck, D., & Schmidhuber, J. (2001). Applying LSTM to time Seeger, M. W., Salinas, D., & Flunkert, V. (2016). Bayesian intermittent
series predictable through time-window approaches. In G. Dorffner demand forecasting for large inventories. In Advances in neural
(Ed.), Artificial neural networks – ICANN 2001 (Proceedings) information processing systems (pp. 4646–4654).
(pp. 669–676). Springer. Smyl, S., Ranganathan, J., & Pasqua, A. (2018). M4 Forecasting compe-
Ghiassi, M., Saidane, H., & Zimbra, D. (2005). A dynamic artificial neural tition: introducing a new hybrid ES-RNN model. https://eng.uber.
network model for forecasting time series events. International com/m4-forecasting-competition.
Journal of Forecasting, 21(2), 341–362. Snyder, R. D., Ord, J., & Beaumont, A. (2012). Forecasting the intermit-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning: Adaptive tent demand for slow-moving inventories: a modelling approach.
computation and machine learning. MIT Press. International Journal of Forecasting, 28(2), 485–496.
Graves, A. (2013). Generating sequences with recurrent neural Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence
networks. arXiv preprint arXiv:1308.0850. learning with neural networks. In Advances in neural information
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). processing systems (pp. 3104–3112).
DRAW: a recurrent neural network for image generation. arXiv Timmermann, A. (2006). Forecast combinations. In G. Elliott, C. Granger,
preprint arXiv:1502.04623. & A. Timmermann (Eds.), Handbook of economic forecasting: Vol. 1
Gutierrez, R. S., Solis, A. O., & Mukhopadhyay, S. (2008). Lumpy (1st ed.). (pp. 135–196). Elsevier.
demand forecasting using neural networks. International Journal of Toubeau, J.-F., Bottieau, J., Vallée, F., & De Grève, Z. (2018). Deep
Production Economics, 111(2), 409–420. learning-based multivariate probabilistic forecasting for short-term
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. scheduling in power markets. IEEE Transactions on Power Systems,
Neural Compututation, 9(8), 1735–1780. 34(2), 1203–1215.
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. Trapero, J. R., Kourentzes, N., & Fildes, R. (2015). On the identification of
(2011). Optimal combination forecasts for hierarchical time series. sales forecasting models in the presence of promotions. The Journal
Computational Statistics & Data Analysis, 55(9), 2579–2589. of the Operational Research Society, 66(2), 299–307.
Hyndman, R. J., & Athanasopoulos, G. (2012). Forecasting: principles and Wen, R. W., Torkkola, K., & Narayanaswamy, B. (2017). A multi-horizon
practice. OTexts. quantile recurrent forecaster. NIPS time series workshop.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and Yu, H.-F., Rao, N., & Dhillon, I. S. (2016). Temporal regularized matrix
practice. OTexts. factorization for high-dimensional time series prediction. In D.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecast- D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.),
ing: the forecast package for R. Journal of Statistical Software, 26(3), Advances in neural information processing systems 29 (pp. 847–855).
1–22. Curran Associates, Inc..
Hyndman, R., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Springer Zhang, G., Eddy Patuwo, B., & Hu, Y. M. (1998). Forecasting with
series in statistics, Forecasting with exponential smoothing: the state artificial neural networks:: the state of the art. International Journal
space approach. Springer. of Forecasting, 14(1), 35–62.