0% found this document useful (0 votes)
54 views

DeepAR- Probabilistic forecasting with autoregressive recurrent networks

Uploaded by

Vaqif Aghayev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

DeepAR- Probabilistic forecasting with autoregressive recurrent networks

Uploaded by

Vaqif Aghayev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

International Journal of Forecasting 36 (2020) 1181–1191

Contents lists available at ScienceDirect

International Journal of Forecasting


journal homepage: www.elsevier.com/locate/ijforecast

DeepAR: Probabilistic forecasting with autoregressive


recurrent networks

David Salinas, Valentin Flunkert , Jan Gasthaus, Tim Januschowski
Amazon Research, Germany

article info a b s t r a c t

Keywords: Probabilistic forecasting, i.e., estimating a time series’ future probability distribution
Probabilistic forecasting given its past, is a key enabler for optimizing business processes. In retail businesses,
Neural networks for example, probabilistic demand forecasts are crucial for having the right inventory
Deep learning
available at the right time and in the right place. This paper proposes DeepAR, a
Big data
methodology for producing accurate probabilistic forecasts, based on training an au-
Demand forecasting
toregressive recurrent neural network model on a large number of related time series.
We demonstrate how the application of deep learning techniques to forecasting can
overcome many of the challenges that are faced by widely-used classical approaches
to the problem. By means of extensive empirical evaluations on several real-world
forecasting datasets, we show that our methodology produces more accurate forecasts
than other state-of-the-art methods, while requiring minimal manual work.
© 2020 The Authors. Published by Elsevier B.V. on behalf of International Institute of
Forecasters. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction based on the classical Box-Jenkins methodology, expo-


nential smoothing techniques, or state space models (Box,
The majority of the forecasting methods that are in Jenkins, Reinsel, & Ljung, 2015; Durbin & Koopman, 2012;
use today have been developed in the setting of fore- Hyndman, Koehler, Ord, & Snyder, 2008).
casting individual or small groups of time series. In this Over the last few years, a different type of forecast-
approach, model parameters for each given time series are ing problem has gained increasing importance in many
estimated independently for each time series from past applications. Rather than predicting individual or small
observations. Although good and freely available auto- numbers of time series, one is faced with the need to
matic forecasting packages exist (such as that of Hyndman forecast thousands or millions of related time series. Ex-
& Khandakar, 2008), typically models are selected manu- amples of such problems include forecasting the energy
ally, both in practice and in research, in order to account consumption at the level of the individual household,
for different factors, such as autocorrelation structure, forecasting the load for servers in a data center, or fore-
trend and seasonality, and in particular other explanatory casting the demand for each of the products offered by
variables. The fitted model is then used to forecast the a large retailer. In each of these scenarios, a substantial
time series into the future according to the model dynam- amount of data on the past behaviors of similar, related
ics, with probabilistic forecasts possibly being admitted time series can be used to produce a forecast for an indi-
through simulation or closed-form expressions for the vidual time series. Using data from related time series (the
predictive distributions. Many methods in this class are energy consumption of other households, the demand for
other products) not only allows more complex (and hence
∗ Corresponding author. potentially more accurate) models to be fitted without
E-mail addresses: [email protected] (D. Salinas),
over-fitting, but can also alleviate the human time- and
[email protected] (V. Flunkert), [email protected] labor-intensive steps of selecting and preparing covariates
(J. Gasthaus), [email protected] (T. Januschowski). and selecting models that classical techniques require.

https://doi.org/10.1016/j.ijforecast.2019.07.001
0169-2070/© 2020 The Authors. Published by Elsevier B.V. on behalf of International Institute of Forecasters. This is an open access article under
the CC BY license (http://creativecommons.org/licenses/by/4.0/).
1182 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191

the time series vary widely; and (2) we demonstrate em-


pirically, on several real-world datasets, that this model
produces accurate probabilistic forecasts across a range
of input characteristics, thus showing that modern deep
learning based approaches can be effective at addressing
the probabilistic forecasting problem. This provides fur-
ther evidence that neural networks are a useful, general-
purpose forecasting technique (see e.g. Kourentzes, 2013,
for a further successful application of a less deep neural
network).
In addition to providing a better forecast accuracy than
Fig. 1. Histogram of the average number of sales per item (in log–log previous methods, our approach has a number of key
scale) for 500K items of ec, showing the scale-free nature (approxi- advantages over classical approaches:
mately straight line) of the ec dataset (axis labels omitted due to the
non-public nature of the data). This figure was created by grouping the (i) As the model learns seasonal behaviors and de-
items in the dataset into buckets according to their average weekly
pendencies on given covariates across time series,
sales and counting the number of items in each bucket. The number
of items per bucket is then plotted against the number of sales (both minimal manual intervention in providing covari-
axes in log scale). ates is needed in order to capture complex, group-
dependent behavior.
(ii) DeepAR makes probabilistic forecasts in the form of
This work presents DeepAR, a forecasting method Monte Carlo samples that can be used to compute
based on autoregressive recurrent neural networks, which consistent quantile estimates for all sub-ranges in
learns a global model from historical data of all time series the prediction horizon.
in the dataset. Our method builds upon previous work on (iii) By learning from similar items, our method is able
deep learning for time series data (Graves, 2013; van den to provide forecasts for items that have little or no
Oord et al., 2016; Sutskever, Vinyals, & Le, 2014), and tai- history available, a case where traditional single-
lors a similar long short-term memory (LSTM; Hochreiter item forecasting methods fail.
& Schmidhuber, 1997) based recurrent neural network (vi) Our approach does not assume Gaussian noise, but
architecture to the probabilistic forecasting problem. can incorporate a wide range of likelihood func-
One challenge that is encountered often when at- tions, allowing the user to choose one that is ap-
tempting to learn from multiple time series jointly in propriate for the statistical properties of the data.
real-world forecasting problems is that the magnitudes of Points (i) and (iii) are what sets DeepAR apart from
the time series differ widely, and the distribution of the classical forecasting approaches, while (ii) and (iv) are
magnitudes is skewed strongly in practical applications. instrumental in the production of accurate, calibrated
This issue is illustrated in Fig. 1, which shows a histogram forecast distributions that are learned from the histori-
(in log–log scale) of the number of sales per item for cal behavior of all of the time series jointly, which has
millions of items sold by Amazon. We refer to this as the not been addressed by previous related methods (see
velocity of an item. Segmenting time series according to Section 2). Such probabilistic forecasts are of crucial im-
their velocity is crude yet intuitive, is easy to convey to portance in many applications, as, in contrast to point
non-experts and suffices for our arguments and the main forecasts, they enable optimal decision making under un-
application further on. certainty by minimizing risk functions, i.e., expectations
The distribution of the velocity over a few orders of some loss function under the forecast distribution.
of magnitude is an approximate power-law. This ob- This paper is structured as follows. We begin by dis-
servation has fundamental implications for forecasting cussing related work in Section 2. Section 3 provides a
methods that attempt to learn a single model from such brief overview of the key deep learning techniques on
datasets. The scale-free nature of the distribution makes which we build in this paper. These techniques are well-
it difficult to divide the dataset into sub-groups of time established in the machine learning community, but we
series in a certain velocity band and learn separate models provide them here for convenience. Section 4 discusses
for them, as each such velocity sub-group will have a the architecture of the DeepAR model in detail, and also
similar skew. Furthermore, group-based regularization details how the training of the model works. Section 5
schemes, such as that proposed by Chapados (2014), provides empirical evidence for the practical usability of
may fail, as the velocities will be differ vastly within our method, and we conclude in Section 6.
each group. Finally, such skewed distributions make the
use of certain commonly-employed normalization tech- 2. Related work
niques, such as input standardization or batch normaliza-
tion (Ioffe & Szegedy, 2015), less effective. In practical forecasting problems, especially in the de-
The main contributions of this paper are twofold: mand forecasting domain, one is often faced with highly
(1) we propose a recurrent neural network (RNN) archi- lumpy or intermittent data which violate the core as-
tecture for probabilistic forecasting, which incorporates sumptions of many classical techniques, such as Gaussian-
a negative binomial likelihood for count data as well as ity, stationarity, or homoscedasticity of the time series.
special treatment for the case where the magnitudes of This has long been recognized as an important issue.
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1183

Intermittent demand can be treated using techniques that consolidation, blending, or ensembling (e.g. Oliveira &
range from classical methods (Croston, 1972) to neu- Torgo, 2014; Timmermann, 2006). These techniques in-
ral networks (Gutierrez, Solis, & Mukhopadhyay, 2008) crease the complexity still further. Deep neural networks
directly. Data-preprocessing methods such as Box–Cox offer an alternative to such pipelines. Such models require
transformations (Box & Cox, 1964) or differencing (Hynd- a limited amount of standardized data pre-processing,
man & Athanasopoulos, 2012, provides an overview) have after which the forecasting problem is solved by learn-
been proposed in order to alleviate unfavorable char- ing an end-to-end model. In particular, data-processing
acteristics of time series, but with mixed results. Other is included in the model and optimized jointly towards
approaches incorporate more suitable likelihood func- the goal of producing the best possible forecast. In prac-
tions, such as the zero-inflated Poisson distribution, the tice, deep learning forecasting pipelines rely almost exclu-
negative binomial distribution (Snyder, Ord, & Beaumont, sively on what the model can learn from the data, unlike
2012), a combination of both (Chapados, 2014), or a tai- traditional pipelines, which rely heavily on heuristics such
lored multi-stage likelihood (Seeger, Salinas, & Flunkert, as expert-designed components and manual covariate de-
2016). We approach the (demand) forecasting problem sign.
by incorporating appropriate likelihoods and combining Other approaches to the sharing of information across
them with non-linear data transformation techniques, time series are via matrix factorization methods (e.g. the
as learned by a (deep) neural network. In particular, recent work of Yu, Rao, and Dhillon (2016)). We compare
we use a negative binomial likelihood in the case of our approach to this method directly in Section 5, and
demand forecasting, which improves the accuracy but show how we empirically outperform it. Further methods
precludes us from applying standard data normalization that share information include Bayesian methods that
techniques directly. By using deeper networks than have share information via hierarchical priors (Chapados, 2014)
been proposed previously in the forecasting literature, and by making use of any hierarchical structure that may
we allow the neural network to represent more complex be present in the data (Ben Taieb, Taylor, & Hyndman,
data transformations. Goodfellow, Bengio, and Courville 2017; Hyndman, Ahmed, Athanasopoulos, & Shang, 2011).
(2016) provides a comprehensive overview of modern Finally, we note that neural networks have been be-
deep neural networks, including justifications of why ing investigated in the context of forecasting for a long
deep neural networks are preferable to shallow and wide time by both the machine learning and forecasting com-
neural networks. munities (for more recent work considering LSTM cells,
When faced with time series as they occur in indus- see for example the numerous references in the surveys
trial applications, sharing information across time series by Zhang, Eddy Patuwo, & Hu, 1998, Fildes, Nikolopoulos,
is key to improving the forecast accuracy. However, this Crone, & Syntetos, 2008, and Gers, Eck, & Schmidhu-
can be difficult to accomplish in practice, due to the ber, 2001). Outside of the forecasting community, time
often heterogeneous nature of the data. A prominent ap- series models based on RNNs have been applied very
proach to sharing information across time series is to successfully to various other applications, such as natural
use clustering techniques such as k-means clustering to language processing (NLP) (Graves, 2013; Sutskever et al.,
compute seasonality indices, which are then combined 2014), audio modeling (van den Oord et al., 2016) or
with classic forecasting models (see for example the Fore- image generation (Gregor, Danihelka, Graves, Rezende,
sight 2007 Spring issue on seasonality for a number of & Wierstra, 2015). Direct applications of RNNs to fore-
examples, as well as the papers by ,Chen & Boylan, 2008 casting include the recent papers by Wen et al. (2017)
and Mohammadipour, Boylan, & Syntetos, 2012). Other and Laptev et al. (2017). Our work differs from these
examples include the explicit handling of promotional in that it provides a comprehensive benchmark includ-
effects (see e.g. Trapero, Kourentzes, & Fildes, 2015, and ing publicly available datasets and a fully probabilistic
references therein) via pooled principal component anal- forecast.1
ysis regression. The latter is another instance of using Within the forecasting community, neural networks in
an unsupervised learning technique as a pre-processing forecasting have been applied typically to individual time
step. These effects need to be handled in practical appli- series, i.e., a different model is fitted to each time series
cations, which leads to complex pipelines that are difficult independently (Díaz-Robles et al., 2008; Ghiassi, Saidane,
both to tune and to maintain. The complexity of such & Zimbra, 2005; Hyndman & Athanasopoulos, 2018; Kaas-
pipelines is likely to increase when one needs to ad- tra & Boyd, 1996). Kourentzes (2013) applies neural net-
dress specific sub-problems such as forecasts for new works specifically to intermittent data. The author uses a
products. Effectively, one decomposes the overall fore- feed-forward neural network (which, by design, ignores
casting problem into a number of distinct forecasting
sub-problems and applies a dedicated model or even a 1 Since the initial pre-print of the present work became available,
chain of models to each one. These models can range from
neural networks have received an increasing amount of attention,
classical statistical models (e.g. Hyndman et al., 2008) to see for example (Bandara, Bergmeir, & Smyl, 2017; Gasthaus, Benidis,
machine learning models (e.g. Laptev, Yosinsk, Li Erran, & Wang, Rangapuram, Salinas, Flunkert, et al., 2019; Oreshkin, Carpov,
Smyl, 2017; Wen, Torkkola, & Narayanaswamy, 2017) and Chapados, & Bengio, 2019; Rangapuram, Seeger, Gasthaus, Stella, Wang,
judgmental approaches (e.g. Davydenko & Fildes, 2013). & Januschowski, 2018; Smyl, Ranganathan, & Pasqua, 2018; Toubeau,
Bottieau, Vallée, & De Grève, 2018). The winning solution to the M4
However, the forecasting problem might not lend itself to competition was based on a neural network (Makridakis, Spiliotis, &
a decomposition into a sequence of distinct procedures, Assimakopoulos, 2018; Smyl et al., 2018). Future work will address a
in which case one would have to think about model systematic review of these methods.
1184 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191

the sequential nature of the data) and provides a shallow


neural network that contains one hidden layer. Given the
limited training data, this shallowness allows for effec-
tive training, and the author obtains promising results.
However, the use of larger datasets and data augmenta-
tion techniques such as we discuss further on allows us
to train deeper neural networks. Furthermore, we pro-
vide more details regarding the set-up for the training
of the RNN and a methodology for obtaining probabilistic
forecasts. Gutierrez et al. (2008) also use a feed-forward
neural network for lumpy demand forecasting data. RNNs
have the advantage for modeling the sequential nature
of time series explicitly, and thus have a smaller number
of parameters that need fitting. Our work differs from
these papers in that it provides more details on the neural
network architecture and utilizes recent advances from
the machine learning community on the training of RNNs.
In addition, in contrast to the aforementioned papers, our
work also provides the full probability distribution of the
forecasts, which is crucial for optimal decision making in
downstream applications.

3. Background: RNNs
Fig. 2. Left. A RNN without an output layer. Right. A partially unrolled
RNN with an output layer and multiple hidden units.
We assume that the reader is familiar with basic neu-
ral network nomenclature around multi-layer percep-
trons (MLPs) or feed-forward neural networks, which
have been applied successfully to forecasting problems gradient-based optimization procedure. The recursive na-
by the forecasting community (e.g., Gutierrez et al., 2008; ture of RNNs often results in ill-conditioned optimization
Kourentzes, 2013); in particular, modern forecasting text- problems which are referred to commonly in the machine
books include MLPs, e.g., (Hyndman & Athanasopoulos, learning community as vanishing or exploding gradients.
2018). We refer the interested reader to Goodfellow et al. The long short-term memory (LSTM) model (Hochreiter &
(2016) for a comprehensive introduction to modern deep Schmidhuber, 1997) alleviates this problem (among other
learning approaches and Faloutsos, Flunkert, Gasthaus, favorable properties), and it is the approach that we adopt
Januschowski, and Wang (2019) for a tutorial that focuses in this paper. We do not present the full functional form
on forecasting. In what follows, we provide a brief in- of LSTMs, as this is unnecessary for our arguments, but
troduction to recurrent neural networks (RNNs) and key again refer to the paper by Goodfellow et al. (2016) for
techniques for handling them, as they have not been dealt an overview and a comprehensive exposition. All mod-
with extensively in the forecasting literature (though the ern neural learning packages, such as that of Chen et al.
machine learning community have applied them to fore- (2015), include an implementation of LSTM-based RNNs.
casting problems with some success; e.g. Laptev et al., In addition to LSTMs, another concept from RNNs
2017; Wen et al., 2017). We follow (Goodfellow et al., will be useful: the encoder–decoder framework, which
2016) in our exposition. allows RNNs to be used to map an input sequence x =
A classic dynamic system driven by an external signal (x1 , . . . , xnx ) to an output sequence y = (y1 , . . . , yny ) of
x(t) is given by differing lengths. This idea is used frequently in NLP and
machine translation, and works as follows. Given an input
h(t) = f (h(t −1) , x(t) ; θ ), (1)
sequence, a first RNN processes this sequence and emits
where h is the state of the system at step t and θ
(t) a so-called context, a vector or a sequence of vectors. In
is a parameter of a transit function f . RNNs use Eq. (1) practice, this is often the last state hnx of the encoder
to model the values of their hidden units (recall that a RNN. A second RNN, the decoder RNN, is conditioned on
hidden unit of a neural network is one that is neither the the context in order to generate the output sequence. The
input layer nor the output layer). two RNNs are trained jointly to maximize the average of
This means that RNNs are deterministic, non-linear dy- log P(y|x) over all pairs x, y in the training set. Section 4
namic systems, in contrast to additive exponential discusses the application of this concept to forecasting.
smoothing in state space form, which can be represented
as linear non-deterministic dynamic systems with a single 4. Model
source of error/innovation (Hyndman et al., 2008).
Fig. 2 contains a depiction of a simple RNN. The recur- Denoting the value of time series (for an item) i at time
sive structure of the RNN means that fewer parameters t by zi,t , our goal is to model the conditional distribution
need to be learned than in the case of MLPs. However,
a technical difficulty arises in the training of RNNs via a P(zi,t0 :T |zi,1:t0 −1 , xi,1:T )
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1185

Fig. 3. Summary of the model. Training (left): At each time step t, the inputs to the network are the covariates xi,t , the target value at the previous
time step zi,t −1 , and the previous network output hi,t −1 . The network output hi,t = h(hi,t −1 , zi,t −1 , xi,t , Θ ) is then used to compute the parameters
θi,t = θ (hi,t , Θ ) of the likelihood p(z |θ ), which is used for training the model parameters. For prediction, the history of the time series zi,t is fed
in for t < t0 , then in the prediction range (right) for t ≥ t0 a sample ẑi,t ∼ p(·|θi,t ) is drawn and fed back for the next point until the end of the
prediction range t = t0 + T , generating one sample trace. Repeating this prediction process yields many traces that represent the joint predicted
distribution.

of the future of each time series [zi,t0 , zi,t0 +1 , . . . , zi,T ] := are given by a function θ (hi,t , Θ ) of the network output
zi,t0 :T given its past [zi,1 , . . . , zi,t0 −2 , zi,t0 −1 ] := zi,1:t0 −1 , hi,t (see below).
where t0 denotes the time point from which we assume Information about the observations in the condition-
zi,t to be unknown at prediction time, and xi,1:T are covari- ing range zi,1:t0 −1 is transferred to the prediction range
ates that are assumed to be known for all time points. To through the initial state hi,t0 −1 . In the sequence-to-
limit ambiguity, we avoid the terms ‘‘past’’ and ‘‘future’’ sequence setup, this initial state is the output of an en-
and will refer to the time ranges [1, t0 − 1] and [t0 , T ] as coder network. While in general this encoder network
the conditioning range and prediction range, respectively. can have a different architecture, in our experiments
The conditioning range corresponds to the encoder range we opt to use the same architecture for the model in
introduced in Section 3 and the prediction range to the both the conditioning range and the prediction range
decoder range. During training, both ranges have to lie in (corresponding to the encoder and decoder in a sequence-
the past so that the zi,t are observed, but during predic- to-sequence model). Further, we share weights between
tion, zi,t is only available in the conditioning range. Note them, so that the initial state for the decoder hi,t0 −1 is
that the time index t is relative, i.e. t = 1 can correspond obtained by computing Eq. (2) for t = 1, . . . , t0 − 1, where
to a different actual/absolute time period for each i. all required quantities are observed. The initial states of
Our model, summarized in Fig. 3, is based on an au- both the encoder hi,0 and zi,0 are initialized to zero.
toregressive recurrent network architecture (Graves, 2013; Given the model parameters Θ , we can obtain joint
Sutskever et al., 2014). We assume that our model dis- samples z̃i,t0 :T ∼ QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) directly through
tribution QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) consists of a product of ancestral sampling. First, we obtain hi,t0 −1 by computing
likelihood factors Eq. (2) for t = 1, . . . , t0 . For t = t0 , t0 + 1(, . . . , T , we sam-
ple z̃i,t ∼ p(·|θ (h̃i,t , Θ )), where h̃i,t = h hi,t −1 , z̃i,t −1 , xi,t ,
T
Θ initialized with h̃i,t0 −1 = hi,t0 −1 and z̃i,t0 −1 = zi,t0 −1 .
∏ )
QΘ (zi,t0 :T |zi,1:t0 −1 , xi,1:T ) = QΘ (zi,t |zi,1:t −1 , xi,1:T )
Samples from the model obtained in this way can then
t =t0
be used to compute quantities of interest, e.g. quantiles
T
∏ of the distribution of the sum of values for some future
= p(zi,t |θ (hi,t , Θ )), time period.
t =t0

parametrized by the output hi,t of an autoregressive re- 4.1. Likelihood model


current network
The likelihood p(z |θ ) determines the ‘‘noise model’’,
hi,t = h hi,t −1 , zi,t −1 , xi,t , Θ ,
( )
(2) and should be chosen to match the statistical properties of
the data. In our approach, the network directly predicts all
where h is a function that is implemented by a multi-layer
parameters θ (e.g. mean and variance) of the probability
recurrent neural network with LSTM cells parametrized
distribution for the next time point.
by Θ . We provide further details of the architecture and
For the experiments in this paper, we consider two
hyper-parameters in Section 5. The model is autoregres-
choices: Gaussian likelihood for real-valued data, and
sive, in the sense that it consumes the observation at
negative-binomial likelihood for positive count data. Other
the last time step zi,t −1 as an input, as well as recur-
likelihood models can also be used readily, e.g. a beta like-
rent, i.e., the previous output of the network hi,t −1 is fed
lihood for data in the unit interval, a Bernoulli likelihood
back as an input at the next time step. The likelihood2
for binary data, or mixtures in order to handle complex
p(zi,t |θ (hi,t )) is a fixed distribution with parameters that
marginal distributions, as long as samples from the dis-
tribution can be obtained cheaply, and the log-likelihood
2 We refer to p(z |θ ) as a likelihood when we think of it as a function and its gradients with respect to the parameters can be
of θ for a fixed z. evaluated. We parametrize the Gaussian likelihood using
1186 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191

its mean and standard deviation, θ = (µ, σ ), where the t = 1 corresponds to 2013-01-01, 2013-01-02, 2013-
mean is given by an affine function of the network output, 01-03, and so on. When choosing these windows, we
and the standard deviation is obtained by applying an ensure that the entire prediction range is always cov-
affine transformation followed by a softplus activation in ered by the available ground truth data, but we may
order to ensure σ > 0: chose to have t = 1 lie before the start of the time
1 series, e.g. 2012-12-01 in the example above, padding
pG (z |µ, σ ) = (2π σ 2 )− 2 exp(−(z − µ)2 /(2σ 2 )) , the unobserved target with zeros. This allows the model
µ(hi,t ) = wTµ hi,t + bµ , to learn the behavior of ‘‘new’’ time series by taking
into account all other available covariates. Augmenting
σ (hi,t ) = log(1 + exp(wTσ hi,t + bσ )) .
the data using this windowing procedure ensures that
The negative binomial distribution is a common choice information about absolute time is available to the model
for modeling time series of positive count data (Chapados, only through covariates, not through the relative position
2014; Snyder et al., 2012). We parameterize the negative of zi,t in the time series. Fig. 4 contains a depiction of this
binomial distribution by its mean µ ∈ R+ and a shape data augmentation technique.
parameter α ∈ R+ , Bengio, Vinyals, Jaitly, and Shazeer (2015) noted that
) α1 ( the autoregressive nature of such models means that op-
Γ (z + α1 )
)z
αµ
(
1 timizing Eq. (3) directly causes a discrepancy between
pNB (z |µ, α ) = ,
Γ (z + 1)Γ ( α1 ) 1 + αµ 1 + αµ the ways in which the model is used during training
and when obtaining predictions from the model: during
µ(hi,t ) = log(1 + exp(wTµ hi,t + bµ )) , training, the values of zi,t are known in the prediction
α (hi,t ) = log(1 + exp(wTα hi,t + bα )) , range and can be used to compute hi,t ; however, during
prediction, zi,t is unknown for t ≥ t0 , and a single sample
where both parameters are obtained from the network z̃i,t ∼ p(·|θ (hi,t )) from the model distribution is used
output by a fully-connected layer with softplus activation in the computation of hi,t according to Eq. (2) instead.
so as to ensure positivity. In this parameterization of the While this disconnect has been shown to pose a severe
negative binomial distribution, the shape parameter α problem for NLP tasks, for example, we have not observed
scales the variance relative to the mean, i.e. Var[z ] = adverse effects from it in a forecasting setting. Preliminary
µ + µ2 α . While other parameterizations are possible, experiments with variants of scheduled sampling (Bengio
preliminary experiments showed this particular one to be et al., 2015) did not show any noteworthy improvements
especially conducive to fast convergence. in accuracy (but did slow convergence).

4.2. Training 4.3. Scale handling

Given a dataset of time series {zi,1:T }i=1,...,N and asso- Applying the model to data that exhibit a power-law
ciated covariates xi,1:T , obtained by choosing a time range of scales, as depicted in Fig. 1, presents two challenges.
such that zi,t in the prediction range is known, the param- Firstly, the autoregressive nature of the model means
eters Θ of the model, consisting of the parameters of both that both the autoregressive input zi,t −1 and the output
the RNN h(·) and θ (·), can be learned by maximizing the of the network (e.g. µ) scale with the observations zi,t
log-likelihood directly, but the non-linearities of the network in between
N T have a limited operating range. Thus, without further
modifications, the network has to learn, first, to scale the
∑ ∑
L= log p(zi,t |θ (hi,t )) . (3)
input to an appropriate range in the input layer, then to
i=1 t =t0
invert this scaling at the output. We address this issue by
As hi,t is a deterministic function of the input, all quanti- dividing the autoregressive inputs zi,t (or z̃i,t ) by an item-
ties required for computing Eq. (3) are observed, so that, dependent scale factor νi , and conversely multiplying the
in contrast to state space models with latent variables, no scale-dependent likelihood parameters by the same fac-
inference is required, and Eq. (3) can be optimized directly tor. For instance, for the negative binomial likelihood, we

via stochastic gradient descent by computing gradients use µ = νi log(1 + exp(oµ )) and α = log(1 + exp(oα ))/ νi ,
with respect to Θ . In our experiments, where the encoder where oµ and oα are the outputs of the network for these
model is the same as the decoder, the distinction between parameters. Note that while one could alternatively scale
the encoder and the decoder is somewhat artificial during the input in a preprocessing step for real-valued data, this
training, so that we also include the likelihood terms for is not possible for count distributions. The selection of
t = 0, . . . , t0 − 1 in Eq. (3) (or, equivalently, set t0 = 0). an appropriate scale factor might be challenging in itself
For each time series in the dataset, we generate multi- (especially in the presence of missing data or large within-
ple training instances by selecting from the original time item variances).
∑t However, scaling by the average value
series windows with different starting points. In prac- νi = 1 + t1 t0=1 zi,t , as we do in our experiments, is a
0
tice, we keep both the total length T and the relative heuristic that works well in practice.
lengths of the conditioning and prediction ranges fixed for Secondly, the imbalance in the data means that a
all training examples. For example, if the total available stochastic optimization procedure that picks training in-
period for a given time series ranges from 2013-01-01 stances uniformly at random will visit the small number
to 2017-01-01, we can create training examples where time series with large scales very infrequently, resulting
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1187

Fig. 4. Depiction of the data setup for training and forecasting for a single input time series z with covariates x. The green vertical line separates
the training data from the testing data, so we compute the out-of-sample accuracy for forecasts to the right of the green line; in particular, no data
to the right of the green line is used in training. Left. The data setup during the training phase. The red lines marks the slices of x that are presented
to the model during training, where the left part marks the conditioning range and the right part the prediction range. Note that all windows are
to the left of the green line. Right. During forecasting, when the model is fully trained, only the conditioning range is to the left of the green line.
(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

in those time series being underfitted. This could be multi-dimensional binary variables to encode them. Fur-
especially problematic in the demand forecasting set- thermore, we include a single categorical item covariate,
ting, where high-velocity items can exhibit qualitatively for which an embedding is learned by the model. In the
different behaviors from low-velocity items, and having retail demand forecasting datasets, the item covariate
accurate forecasts for high-velocity items might be more corresponds to a (coarse) product category (e.g. ‘‘cloth-
important for meeting certain business objectives. We ing’’), while in the smaller datasets it corresponds to the
counteract this effect by sampling the examples non- item’s identity, allowing the model to learn item-specific
uniformly during training. In particular, our weighted behaviors. By appropriate normalization, we standardize
sampling scheme sets the probability of selecting a win- all covariates to have a zero mean and unit variance.
dow from an example with scale νi proportional to νi . This
sampling scheme is simple yet effective in compensating 5. Applications and experiments
for the skew in Fig. 1.
We implement our model using MXNet (Chen et al.,
2015), and use a single p2.xlarge AWS EC2 compute in-
4.4. Covariates
stance containing 4 CPUs and 1 GPU to run all experi-
ments.3 With this set-up, training and prediction on the
The covariates xi,t can be item-dependent, time-
large ec dataset containing 500K time series can be com-
dependent, or both. This distinction is mainly of practical
pleted in less than 10 h. Note that prediction with a
importance. In theory, any covariate xi,t that does not
trained model is fast (in the order of tens of minutes for a
vary with time can be generalized trivially to be time- single compute instance), and can be sped-up if necessary
dependent by repeating it along the time dimension. by executing prediction in parallel.
Examples of such time-independent covariates include We use the ADAM optimizer (Kingma & Ba, 2014) with
the categorization of item i, for example to denote mem- early stopping and standard LSTM cells with a forget bias
bership to a certain group of products (e.g., product i set to 1.0 in all experiments, and 200 samples are drawn
is a shoe, so xi,t = s, where s is an identifier for the from our decoder for generating predictions.
shoe category). Such covariates allow for the fact that
time series with the same time-independent covariate 5.1. Datasets
may be similar. Examples of time-dependent covariates
include information about the time point (e.g. week of We use five datasets for our evaluations. The first
year) of the model. They can also be used to include three, namely parts, electricity, and traffic, are
covariates that one expects to influence the outcome public datasets; parts consists of 1046 aligned time se-
(e.g. price or promotion status in the demand forecasting ries of 50 time steps each, representing the monthly sales
setting), as long as the covariates’ values are available of different items by a US automobile company (Seeger
in the prediction range as well. If these values are not et al., 2016); electricity contains hourly time series of
available (e.g., future price changes), one option is to the electricity consumptions of 370 customers (Yu et al.,
set them manually (e.g., assume that there are no price 2016); and traffic, also used by Yu et al. (2016), con-
changes), which allows for what-if analysis. The solution tains the hourly occupancy rates, between zero and one,
based on principles is to predict these time series jointly, of 963 car lanes of San Francisco bay area freeways. For
e.g., forecast demand and price jointly in a multivariate the parts dataset, we use the 42 first months as training
forecast. What-if analyses are possible when future prices data and report the error on the remaining 8 months.
become known via conditioning. We leave multivariate For the other datasets, electricity, traffic, ec-sub
forecasting as a valuable direction for future work. and ec, the set of possible training instances is sub-
All of our experiments use an ‘‘age’’ covariate, sampled to the number indicated in Table 1. The results
i.e., the distance to the first observation in that time for electricity and traffic are computed using a
series. We also add day-of-the-week and hour-of-the-day rolling window of predictions, as described by Yu et al.
for hourly data, week-of-the-year for weekly data and
month-of-the-year for monthly data. We encode these 3 Implementations of DeepAR are available on Amazon SageMaker
simply as increasing numeric values, instead of using (closed-source) and as part of GluonTS (Alexandrov et al., 2019).
1188 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191

Fig. 5. Example time series of ec. The vertical line separates the conditioning period from the prediction period. The black line shows the true target.
In the prediction range, we plot the p50 as a blue line (mostly zero for the three slow items), along with 80% confidence intervals (shaded). The
model learns accurate seasonality patterns and uncertainty estimates for items of different velocities and ages. (For interpretation of the references
to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1
Dataset statistics and RNN parameters.
parts electricity traffic ec-sub ec
# of time series 1046 370 963 39700 534884
time granularity month hourly hourly week week
domain N R+ [0, 1] N N
encoder length 8 168 168 52 52
decoder length 8 24 24 52 52
# of training examples 35K 500K 500K 2M 2M
item input embedding dimension 1046 370 963 5 5
item output embedding dimension 1 20 20 20 20
batch size 64 64 64 512 512
learning rate 1e−3 1e−3 1e−3 5e−3 5e−3
# of LSTM layers 3 3 3 3 3
# of LSTM nodes 40 40 40 120 120
running time 5 min 7h 3h 3h 10h

(2016). We do not retrain our model for each window, on different windows, but also on non-overlapping time
but use a single model trained on the data before the intervals. We tune the learning rate manually for ev-
first prediction window. The remaining two datasets, ec ery dataset and keep it fixed in hyper-parameter tuning.
and ec-sub, are the weekly item sales from Amazon that Other parameters such as the encoder length, decoder
were used by Seeger et al. (2016), and we predict the 52 length and item input embedding are considered to be
weeks following 2014-09-07. The time series in these domain-dependent, and are not fitted. The batch size is
two datasets are very diverse and lumpy, ranging from increased on larger datasets in order to benefit more
very fast-moving to very slow-moving items, and include from GPU’s parallelization. Finally, the running time mea-
‘‘new’’ products that were introduced in the weeks be- sures an end-to-end evaluation, e.g. processing covariates,
fore the forecast time 2014-09-07; see Fig. 5. Further, training the neural network, drawing samples for the
the item velocities in these datasets have a power-law production of probabilistic forecasts, and evaluating the
distribution, as shown in Fig. 1. forecasts.
Table 1 also lists running times as measured by an
end-to-end evaluation, e.g. processing covariates, training 5.2. Accuracy comparison
the neural network, drawing samples and evaluating the
distributions produced. For the parts and ec/ec-sub datasets, we provide
For each dataset, a grid-search is used to find the best comparisons with the following baselines, which repre-
value for the hyper-parameters item output embedding sent the state-of-the-art on these datasets to the best of
dimension and # of LSTM nodes (e.g. hidden number of our knowledge:
units). To do so, the data before the forecast start time
• Croston: the Croston method developed for in-
are used as the training set and split into two parti-
termittent demand forecasting, from the R package
tions. For each hyper-parameter candidate, we fit our of Hyndman and Khandakar (2008).
model on the first partition of the training set, containing
• ETS: the ETS model (Hyndman et al., 2008) from
90% of the data, and pick the one that has the mini-
the R package with automatic model selection. Only
mal negative log-likelihood on the remaining 10%. Once
additive models are used, as multiplicative models
the best set of hyper-parameters has been found, the
shows numerical issues on some time series.
evaluation metrics (0.5-risk, 0.9-risk, ND and RMSE) are
• Snyder: the negative-binomial autoregressive
evaluated on the test set, that is, the data coming after method of Snyder et al. (2012).
the forecast start time. Note that this procedure could
• ISSM: the method of Seeger et al. (2016) using an in-
lead to the hyper-parameters being over-fitted to the
novative state space model with covariate features.
training set, but this would also degrade the metric that
we report. A better procedure would be to fit the param- In addition, we compare our results to two baseline
eters and evaluate the negative log-likelihood not only RNN models:
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1189

Table 2
Accuracy metrics relative to the strongest previously published method (baseline).
0.5-risk 0.9-risk Average
parts
(L, S) (0, 1) (2, 1) (0, 8) all(8) (0, 1) (2, 1) (0, 8) all(8) average
Snyder (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Croston 1.47 1.70 2.86 1.83 – – – – 1.97
ISSM 1.04 1.07 1.24 1.06 1.01 1.03 1.14 1.06 1.08
ETS 1.28 1.33 1.42 1.38 1.01 1.03 1.35 1.04 1.23
rnn-gaussian 1.17 1.49 1.15 1.56 1.02 0.98 1.12 1.04 1.19
rnn-negbin 0.95 0.91 0.95 1.00 1.10 0.95 1.06 0.99 0.99
DeepAR 0.98 0.91 0.91 1.01 0.90 0.95 0.96 0.94 0.94
ec-sub
(L, S) (0, 2) (0, 8) (3, 12) all(33) (0, 2) (0, 8) (3, 12) all(33) average
Snyder 1.04 1.18 1.18 1.07 1.0 1.25 1.37 1.17 1.16
Croston 1.29 1.36 1.26 0.88 – – – – 1.20
ISSM (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ETS 0.83 1.06 1.15 0.84 1.09 1.38 1.45 0.74 1.07
rnn-gaussian 1.03 1.19 1.24 0.85 0.91 1.74 2.09 0.67 1.21
rnn-negbin 0.90 0.98 1.11 0.85 1.23 1.67 1.83 0.78 1.17
DeepAR 0.64 0.74 0.93 0.73 0.71 0.81 1.03 0.57 0.77
ec
(L, S) (0, 2) (0, 8) (3, 12) all(33) (0, 2) (0, 8) (3, 12) all(33) average
Snyder 0.87 1.06 1.16 1.12 0.94 1.09 1.13 1.01 1.05
Croston 1.30 1.38 1.28 1.39 – – – – 1.34
ISSM (baseline) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ETS 0.77 0.97 1.07 1.23 1.05 1.33 1.37 1.11 1.11
rnn-gaussian 0.89 0.91 0.94 1.14 0.90 1.15 1.23 0.90 1.01
rnn-negbin 0.66 0.71 0.86 0.92 0.85 1.12 1.33 0.98 0.93
DeepAR 0.59 0.68 0.99 0.98 0.76 0.88 1.00 0.91 0.85

Note: The best results are marked in bold (lower is better).

• rnn-gaussian uses the same architecture as 5.2.2. ND and RMSE metrics


DeepAR with a Gaussian likelihood; however, it uses The ND and RMSE metrics are defined as
uniform sampling and a simpler scaling mechanism,

|zi,t − ẑi,t |
i ,t
where the time series zi are divided by νi and the ND = ∑ and
outputs are multiplied by νi . |zi,t |
√ i,t
• rnn-negbin uses a negative binomial distribution, 1

− ẑi,t )2
N(T −t0 ) i,t (zi,t
but does not scale the inputs and outputs of the
RMSE = ,
RNN, and the training instances are drawn uniformly 1

N(T −t0 ) i ,t |zi,t |
rather than using weighted sampling.
where ẑi,t is the predicted median value for item i at time
We define the error metrics used in our comparisons t and the sums are over all items and all time points in
formally below. The metrics are evaluated for certain the prediction period.
spans [L, L + S) in the prediction range, where L is a lead
time after the forecast start point. 5.2.3. Results
Table 2 shows the 0.5-risk and 0.9-risk for different
5.2.1. ρ -risk metric
lead times and spans. Here, all(K ) denotes the average
Following Seeger et al. (2016), we use ρ -risk metrics
risk of the marginals [L, L + 1) for L < K . We nor-
(quantile loss) that quantify the accuracy of a quantile ρ
malize all reported metrics with respect to the strongest
of the predictive distribution.
previously-published method (baseline). DeepAR outper-
The aggregated target ∑value of an item i in a span is
t0 +L+S forms all other methods on these datasets. The results
denoted by Zi (L, S) = t =t0 +L zi,t . For a given quantile also show the importance of modeling these datasets
ρ ∈ (0, 1) we denote the predicted ρ -quantile for Zi (L, S) using a count distribution, as rnn-gaussian leads to
ρ
by Ẑi (L, S). We obtain such a quantile prediction from a worse accuracies. The ec and ec-sub datasets exhibit the
set of sample paths by first summing each realization in power-law behavior discussed above, and overall forecast
the given span. The samples of these sums then represent accuracy is affected negatively by the absence of scaling
the estimated distribution for Zi (L, S), and we can take the and weighted sampling (rnn-negbin). On the parts
ρ -quantile from the empirical distribution. dataset, which does not exhibit the power-law behavior,
The ρ -quantile loss is then defined as the performance of rnn-negbin is similar to that of
Lρ (Z , Ẑ ρ ) = 2(Ẑ − Z ) ρ IẐ ρ >Z − (1 − ρ )IẐ ρ ≤Z .
( )
DeepAR.
Table 3 compares the point forecast accuracies on
We summarize the quantile losses for a given span across the electricity and traffic datasets against that
all items
( by considering
) a normalized sum of quantile of the matrix factorization technique (MatFact) pro-
ρ
Lρ (Zi , Ẑi ) / Zi , which we call the ρ -risk.
∑ (∑ )
losses i i posed by Yu et al. (2016). We consider the same metrics,
1190 D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191

this case, the model does learn an overall growth of un-


certainty over time. However, this is not simply linear
growth: uncertainty (correctly) increases during Q4, and
decreases again shortly afterwards.
The calibration of the forecast distribution is depicted
in Fig. 7. Here, we show the Coverage(p) for each per-
centile p, which is defined as the fraction of time series
in the dataset for which the p-percentile of the predictive
distribution is larger than the true target. For a perfectly
calibrated prediction, it holds that Coverage(p) = p, which
corresponds to the diagonal. Overall, the calibration is
improved compared to the ISSM model.
We assess the effect of modeling correlations in the
output, i.e., how much they differ from independent dis-
Fig. 6. Uncertainty growth over time for the ISSM and DeepAR models. tributions for each time-point, by plotting the calibration
Unlike the ISSM, which postulates a linear growth of uncertainty, the curves for a shuffled forecast, where the realizations of
behavior of the uncertainty is learned from the data, resulting in a the original forecast have been shuffled for each time
non-linear growth with a (plausibly) higher uncertainty around Q4.
point, destroying any correlation between time steps. For
The aggregate is calculated over the entire ec dataset.
the short lead-time span (left), which consists of just
one time-point, this has no impact, because it is just the
marginal distribution. However, for the longer lead-time
span (right), destroying the correlation leads to a worse
calibration, showing that important temporal correlations
are captured between the time steps.

6. Conclusion

We have shown that forecasting approaches based


on modern deep learning techniques can improve the
forecast accuracy drastically relative to state-of-the-art
forecasting methods on a wide variety of datasets. Our
proposed DeepAR model is effective at learning a global
Fig. 7. Coverages for two spans of the ec-sub dataset. The left panel model from related time series, can handle widely-varying
shows the coverage for a single time-step interval, while the right scales through rescaling and velocity-based sampling, gen-
panel shows these metrics for a larger time interval with nine time-
erates calibrated probabilistic forecasts with high accu-
steps. When the samples for each time step are shuffled, the correlation
in the prediction sample paths is destroyed and the forecast becomes racy, and is able to learn complex patterns such as
less calibrated. This shuffled prediction also has a 10% higher 0.9-risk. seasonality and uncertainty growth over time from the
data.
Table 3 Interestingly, the method works on a wide variety of
Comparison with MatFact. datasets with little or no hyperparameter tuning, and is
electricity traffic applicable to medium-sized datasets that contain only a
ND RMSE ND RMSE few hundred time series.
MatFact 0.16 1.15 0.20 0.43
DeepAR 0.07 1.00 0.17 0.42 References

Alexandrov, A., Benidis, K., Bohlke-Schneider, M., Flunkert, V.,


Gasthaus, J., Januschowski, T., et al. (2019). GluonTS: probabilis-
namely the normalized deviation (ND) and normalized tic time series models in Python. ICML time series workshop,
RMSE (NRMSE). The results show that DeepAR outper- abs/1906.05264 arXiv:1906.05264.
forms MatFact on both datasets. Bandara, K., Bergmeir, C., & Smyl, S. (2017). Forecasting across time se-
ries databases using long short-term memory networks on groups
of similar series. arXiv preprint arXiv:1710.03222 8 (pp. 805–815).
5.3. Qualitative analysis Ben Taieb, S., Taylor, J. W., & Hyndman, R. J. (2017). Coherent proba-
bilistic forecasts for hierarchical time series. In Proceedings of the
34th international conference on machine learning.
Fig. 5 shows example predictions from the ec dataset. Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sam-
Fig. 6 shows aggregate sums of different quantiles of the pling for sequence prediction with recurrent neural networks. In
marginal predictive distribution for DeepAR and ISSM Advances in neural information processing systems (pp. 1171–1179).
on the ec dataset. In contrast to ISSM models such as Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal
of the Royal Statistical Society. Series B. Statistical Methodology, 26(2),
that of Seeger et al. (2016), where a linear growth of 211–252.
uncertainty is part of the modeling assumptions, the un- Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time
certainty growth pattern is learned from the data. In series analysis: forecasting and control. John Wiley & Sons.
D. Salinas, V. Flunkert, J. Gasthaus et al. / International Journal of Forecasting 36 (2020) 1181–1191 1191

Chapados, N. (2014). Effective Bayesian modeling of groups of related Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep
count time series. In Proceedings of the 31st international conference network training by reducing internal covariate shift. In Pro-
on machine learning (pp. 1395–1403). ceedings of the 32nd international conference on machine learning
Chen, H., & Boylan, J. E. (2008). Empirical evidence on individual, group (pp. 448–456).
and shrinkage seasonal indices. International Journal of Forecasting, Kaastra, I., & Boyd, M. (1996). Designing a neural network for fore-
24(3), 525–534. casting financial and economic time series. Neurocomputing, 10(3),
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., et al. (2015). MXNet: 215–236.
A Flexible and efficient machine learning library for heterogeneous Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic
distributed systems. arXiv preprint arXiv:1512.01274. optimization. arXiv preprint arXiv:1412.6980.
Croston, J. (1972). Forecasting and stock control for intermittent Kourentzes, N. (2013). Intermittent demand forecasts with neural
demands. Operational Research Quarterly, 23, 289–304. networks. International Journal of Production Economics, 143(1),
Davydenko, A., & Fildes, R. (2013). Measuring forecasting accuracy: 198–206.
the case of judgmental adjustments to SKU-level demand forecasts. Laptev, N., Yosinsk, J., Li Erran, L., & Smyl, S. (2017). Time-series extreme
International Journal of Forecasting, 29(3), 510–522. event forecasting with neural networks at Uber. ICML time series
Díaz-Robles, L. A., Ortega, J. C., Fu, J. S., Reed, G. D., Chow, J. workshop.
C., Watson, J. G., et al. (2008). A hybrid ARIMA and artificial Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Com-
neural networks model to forecast particulate matter in urban petition: results, findings, conclusion and way forward. International
areas: the case of Temuco, Chile. Atmospheric Enviroment, 42(35), Journal of Forecasting, 34(4), 802–808.
8331–8340. Mohammadipour, M., Boylan, J., & Syntetos, A. (2012). The application
Durbin, J., & Koopman, S. J. (2012). Time series analysis by state space of product-group seasonal indexes to individual products. Foresight:
methods: Vol. 38. OUP Oxford. The International Journal of Applied Forecasting, 26, 20–26.
Faloutsos, C., Flunkert, V., Gasthaus, J., Januschowski, T., & Wang, Y. Oliveira, M. R., & Torgo, L. (2014). Ensembles for time series forecasting.
(2019). Forecasting big time series: theory and practice. In KDD Journal of Machine Learning Research (JMLR).
’19, Proceedings of the 25th ACM SIGKDD international conference on van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O.,
knowledge discovery and data mining (pp. 3209–3210). New York, Graves, A., et al. (2016). Wavenet: A generative model for raw
NY, USA: ACM. audio. arXiv preprint arXiv:1609.03499.
Fildes, R., Nikolopoulos, K., Crone, S., & Syntetos, A. (2008). Forecasting Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019),
and operational research: a review. The Journal of the Operational N-BEATS: neural basis expansion analysis for interpretable time
Research Society, 59(9), 1150–1172. series forecasting. arXiv preprint arXiv:1905.10437.
Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Salinas, D., Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., &
Flunkert, V., et al. (2019). Probabilistic forecasting with spline Januschowski, T. (2018). Deep state space models for time series
quantile function RNNs. In The 22nd international conference on forecasting. In Advances in neural information processing systems (pp.
artificial intelligence and statistics (pp. 1901–1910). 7785–7794).
Gers, F. A., Eck, D., & Schmidhuber, J. (2001). Applying LSTM to time Seeger, M. W., Salinas, D., & Flunkert, V. (2016). Bayesian intermittent
series predictable through time-window approaches. In G. Dorffner demand forecasting for large inventories. In Advances in neural
(Ed.), Artificial neural networks – ICANN 2001 (Proceedings) information processing systems (pp. 4646–4654).
(pp. 669–676). Springer. Smyl, S., Ranganathan, J., & Pasqua, A. (2018). M4 Forecasting compe-
Ghiassi, M., Saidane, H., & Zimbra, D. (2005). A dynamic artificial neural tition: introducing a new hybrid ES-RNN model. https://eng.uber.
network model for forecasting time series events. International com/m4-forecasting-competition.
Journal of Forecasting, 21(2), 341–362. Snyder, R. D., Ord, J., & Beaumont, A. (2012). Forecasting the intermit-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning: Adaptive tent demand for slow-moving inventories: a modelling approach.
computation and machine learning. MIT Press. International Journal of Forecasting, 28(2), 485–496.
Graves, A. (2013). Generating sequences with recurrent neural Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence
networks. arXiv preprint arXiv:1308.0850. learning with neural networks. In Advances in neural information
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). processing systems (pp. 3104–3112).
DRAW: a recurrent neural network for image generation. arXiv Timmermann, A. (2006). Forecast combinations. In G. Elliott, C. Granger,
preprint arXiv:1502.04623. & A. Timmermann (Eds.), Handbook of economic forecasting: Vol. 1
Gutierrez, R. S., Solis, A. O., & Mukhopadhyay, S. (2008). Lumpy (1st ed.). (pp. 135–196). Elsevier.
demand forecasting using neural networks. International Journal of Toubeau, J.-F., Bottieau, J., Vallée, F., & De Grève, Z. (2018). Deep
Production Economics, 111(2), 409–420. learning-based multivariate probabilistic forecasting for short-term
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. scheduling in power markets. IEEE Transactions on Power Systems,
Neural Compututation, 9(8), 1735–1780. 34(2), 1203–1215.
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. Trapero, J. R., Kourentzes, N., & Fildes, R. (2015). On the identification of
(2011). Optimal combination forecasts for hierarchical time series. sales forecasting models in the presence of promotions. The Journal
Computational Statistics & Data Analysis, 55(9), 2579–2589. of the Operational Research Society, 66(2), 299–307.
Hyndman, R. J., & Athanasopoulos, G. (2012). Forecasting: principles and Wen, R. W., Torkkola, K., & Narayanaswamy, B. (2017). A multi-horizon
practice. OTexts. quantile recurrent forecaster. NIPS time series workshop.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and Yu, H.-F., Rao, N., & Dhillon, I. S. (2016). Temporal regularized matrix
practice. OTexts. factorization for high-dimensional time series prediction. In D.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecast- D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.),
ing: the forecast package for R. Journal of Statistical Software, 26(3), Advances in neural information processing systems 29 (pp. 847–855).
1–22. Curran Associates, Inc..
Hyndman, R., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Springer Zhang, G., Eddy Patuwo, B., & Hu, Y. M. (1998). Forecasting with
series in statistics, Forecasting with exponential smoothing: the state artificial neural networks:: the state of the art. International Journal
space approach. Springer. of Forecasting, 14(1), 35–62.

You might also like