0% found this document useful (0 votes)
5 views9 pages

[7] Hybrid Variational Autoencoder for Time Series Forecasting

The document presents a novel hybrid variational autoencoder (HyVAE) designed for time series forecasting, which integrates the learning of local patterns and temporal dynamics through variational inference. Experimental results demonstrate that HyVAE outperforms existing methods and its variants that focus solely on local patterns or temporal dynamics. The study emphasizes the importance of accurately capturing both aspects to improve forecasting accuracy in complex real-world time series data.

Uploaded by

Adham Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

[7] Hybrid Variational Autoencoder for Time Series Forecasting

The document presents a novel hybrid variational autoencoder (HyVAE) designed for time series forecasting, which integrates the learning of local patterns and temporal dynamics through variational inference. Experimental results demonstrate that HyVAE outperforms existing methods and its variants that focus solely on local patterns or temporal dynamics. The study emphasizes the importance of accurately capturing both aspects to improve forecasting accuracy in complex real-world time series data.

Uploaded by

Adham Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Knowledge-Based Systems 281 (2023) 111079

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Hybrid variational autoencoder for time series forecasting


Borui Cai a ,∗, Shuiqiao Yang b , Longxiang Gao c , Yong Xiang a
a
School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
b
School of Computer Science and Engineering, University of New South Wales, Sydney, 2032, NSW, Australia
c
Qilu University of Technology (Shandong Academy of Sciences), Jinan, China

ARTICLE INFO ABSTRACT

Keywords: Variational autoencoders (VAE) are powerful generative models that learn the latent representations of input
Time series forecasting data as random variables. Recent studies show that VAE can flexibly learn the complex temporal dynamics
Variational autoencoder of time series and achieve more promising forecasting results than deterministic models. However, a major
Deep learning
limitation of existing works is that they fail to jointly learn the local patterns (e.g., seasonality and trend)
and temporal dynamics of time series for forecasting. Accordingly, we propose a novel hybrid variational
autoencoder (HyVAE) to integrate the learning of local patterns and temporal dynamics by variational inference
for time series forecasting. Experimental results on four real-world datasets show that the proposed HyVAE
achieves better forecasting results than various counterpart methods, as well as two HyVAE variants that only
learn the local patterns or temporal dynamics of time series, respectively.

1. Introduction subsequences/segments (e.g., seasonality [8] and trend [9]), while RNN
learns long-term or mid-term temporal dynamics/dependencies of the
Time series forecasting aims at learning the generation process of entire time series [10]. In fact, many works capture both types of tem-
time series and uses previously observed samples to predict future poral information by proposing hybrid DNN models and obtaining more
values [1]. Accurate forecasting is essential and can help with the accurate forecasting results [9,11]. For example, the researchers [12]
success of many applications/businesses. For example, an electricity adopt a hybrid neural network, which stacks CNN with RNN, for DNA
company can design effective energy policies in advance by predicting
sequence prediction. Specifically, CNN can capture short and recurring
the future energy consumption [2]; a corporation can minimize its
sequence motifs, which represent biological function units in a DNA
investment risk if the future stock prices are accurately predicted [3].
sequence. RNN, i.e., long short-term memory (LSTM) [10], is stacked
Time series forecasting has been studied in the literature for decades,
but to date, it remains a challenging and active research problem with the output of CNN to learn the spatial arrangement of these motifs.
due to the complexity of time series. Classical time series forecast- However, these DNN-based models cannot capture temporal infor-
ing methods, including autoregressive models (AR), moving average mation from time series with high accuracy since they are sensitive
models (MA), and autoregressive integrated moving average models to small perturbations on time series [13]. Recent works refer to
(ARIMA) [4], predict future values by assuming they have linear variational autoencoder (VAE) [14], which is a type of deep generative
relationships with observed values; however, this simplification nor- model, to learn representations of time series as latent random variables
mally leads to unsatisfactory results for complex real-world time series. and obtain improved results [15]. Compared with directly fitting the
With the booming of deep learning techniques, deep neural networks exact values of time series, the latent random variables learned by VAE
(DNN) are widely used to tackle time series forecasting problems. represent the generation process of time series and thus can more accu-
Unlike classical models, DNNs are flexible non-linear models that can rately capture essential temporal information of time series [16]. Based
capture the temporal information of time series for forecasting [5].
on this, existing methods learn either local seasonal-trend patterns [17]
Convolutional neural networks (CNN) [6] and recurrent neural net-
or temporal dynamics [15]; but to date, there is no VAE model that can
works (RNN) [7] are two types of DNN widely adopted for time series
jointly capture both information for time series forecasting.
forecasting. CNN captures salient local patterns of short time series

∗ Corresponding author.
E-mail addresses: [email protected] (B. Cai), [email protected] (S. Yang), [email protected] (L. Gao), [email protected] (Y. Xiang).

https://doi.org/10.1016/j.knosys.2023.111079
Received 13 March 2023; Received in revised form 4 September 2023; Accepted 13 October 2023
Available online 20 October 2023
0950-7051/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

In this paper, we bridge this gap by proposing a novel hybrid 2.2. Variational autoencoder-based forecasting
variational autoencoder (HyVAE) method for time series forecasting.
HyVAE follows the variational inference [14] to jointly learn local Variational autoencoder (VAE) [14] is a powerful deep generative
patterns and temporal dynamics of time series. To achieve this goal, model that encodes the input data as latent random variables, in-
HyVAE is designed based on two objectives: (1) capturing local patterns stead of deterministic values. To enhance the flexibility of VAE (learns
by encoding time series subsequences into latent representations; and independent latent random variables), follow-up methods introduce
(2) learning temporal dynamics through the temporal dependencies extra dependencies among the latent random variables. For example,
among latent representations of different time series subsequences. ladder variational autoencoder [28] specifies a top-down hierarchi-
HyVAE integrates the two objectives following the variational infer- cal dependency among the latent random variables, fully-connected
ence. Extensive experiments conducted on four real-world time series variational autoencoder [29] includes all possible dependencies among
datasets show that HyVAE can improve the time series forecasting variables, and graph variational autoencoder [30] automatically learns
accuracy over strong counterpart methods. The contributions of this an acyclic dependency graph. Due to the high flexibility, it is introduced
paper are summarized as follows: to time series forecasting [31]. To improve the performance of the
vanilla VAE, VRNN [15] introduces an RNN as the backbone to capture
– We propose a novel hybrid variational autoencoder (HyVAE) for the long-term temporal dynamics of time series. LaST [17] develops
time series forecasting. HyVAE derives an objective following vari- disentangled VAE to learn dissociated seasonality and trend patterns
ational inference to integrate the learning of local patterns and of time series for forecasting. The proposed HyVAE is different from
temporal dynamics of time series, thereby improving the accuracy existing methods as it integrates the learning of both local patterns and
of forecasting. the temporal dynamic for time series forecasting.
– We conduct comprehensive experiments on four real-world datasets
to demonstrate the effectiveness of the proposed HyVAE method, 3. Preliminaries
and the results show that HyVAE achieves better forecasting accu-
racy than strong counterpart methods. In this section, we first define the problem and then introduce the
preliminary knowledge of VAE.
The rest of this paper is organized as follows. The related works
are reviewed in Section 2. The preliminary knowledge is introduced 3.1. Notation and problem statement
in Section 3. The proposed method is detailed in Section 4, and is
evaluated in Section 5. The paper is summarized in Section 6. A scalar is denoted as a lowercase character, a vector is denoted as
a bold lowercase character, and a matrix is denoted as an uppercase
2. Related work character. A time series is denoted as 𝒔 = {𝑠1 , 𝑠2 , … , 𝑠𝑚 , 𝑠𝑚+1 , … , 𝑠𝑚+𝑛 },
the time series forecasting problem is defined as determining {𝑠𝑚+1 , … ,
In this section, we briefly review time series forecasting methods 𝑠𝑚+𝑛 } with known {𝑠1 , 𝑠2 , … , 𝑠𝑚 }, where 𝑛 is the step of forecasting. For
and VAE-related forecasting approaches. the convenience, we denote 𝒚 = {𝑠𝑚+1 , … , 𝑠𝑚+𝑛 }, and the forecasting
problem can be formulated as 𝒚̂ = 𝑓 (𝑠1 , 𝑠2 , … , 𝑠𝑚 ), where 𝒚̂ is the
2.1. Time series forecasting predicted values for 𝒚. The error of forecasting is measured as follows:

1∑
𝑛
Classical auto-regressive model (AR) predicts by the linear aggrega- ̂ =
𝐸𝑟𝑟(𝒚, 𝒚) (𝑦 − 𝑦̂𝑖 )2 , (1)
𝑛 𝑖=1 𝑖
tion of past time series values and a stochastic term (e.g., white noise).
ARIMA extends AR to non-stationary time series by incorporating Time series subsequence is denoted as 𝒙𝑡 = {𝑠𝑡 , … , 𝑠𝑡+𝑙−1 }, where 𝑙
moving average (MA) and differencing. Other statistical models, such as is the length. Time series subsequence contains contextual information
linear regression [18] and support vector regression [19], enhance the that expresses local patterns [9], and thus we use subsequences in
model capacity but still have limited expressiveness. DNNs are flexible the forecasting task. Following [32], we obtain a series of 𝑙 length
non-linear models and have been widely used for time series forecasting subsequences from time series, using a sliding window. Time series
in recent years. Specifically, RNNs memorize historical information represented by subsequences is denoted as {𝒙1 , … , 𝒙𝑇 }, where 𝑇 =
with feedback loops and can conveniently learn the temporal dynamics 𝑚−𝑙+1 is the number of its subsequences. Thus, the forecasting problem
of time series. Long short-memory network (LSTM) [10] is a typical becomes 𝒚̂ = 𝑓 (𝒙≤𝑇 ) = 𝑓 (𝒙1 , … , 𝒙𝑇 ) (see Table 1).
RNN that alleviates gradient vanishing with forget gates, and that
enables the learning of long-term temporal dynamics for time series. 3.2. Variational autoencoder
Other types of RNN, e.g., GRU [20], and Informer [1], which uses the
attention mechanism [21], are also used to improve the effectiveness Variational autoencoder (VAE) [14] is an unsupervised generative
of different forecasting scenarios. In addition, CNNs [22] are further learning model that learns the latent representation of the input data as
used to capture local patterns of time series (such as seasonality [8] random variables. Similar to the conventional autoencoder [33], VAE
and trends [9]). Many works stack CNN and RNN to learn both the has an encoding process that encodes the input into latent representa-
local patterns and the temporal dynamics for challenging forecasting tions, and a decoding process that reconstructs the original input with
the learned representations. We show the process of VAE in Fig. 1.
problems; for example, combining multi-layer one-dimensional CNNs
VAE learns the generative model as 𝑝(𝒙, 𝒛) = 𝑝(𝒙|𝒛)𝑝(𝒛), where 𝒙 is
with bi-directional LSTM for air quality forecasting [9] and DNA se-
the input data and 𝒛 is its latent representations. The prior of 𝒛, 𝑝(𝒛),
quence forecasting [12]; integrating a Savitzky–Golay filter (to avoid
is normally defined as a multivariate Gaussian distribution, i.e., 𝒛 ∼
noise) and a stacked TCN-LSTM for traffic forecasting [11]. In addition,
 (𝟎, 𝐼); we denote that as 𝑝(𝒛) =  (𝒛|𝟎, 𝐼) for convenience. The
transformers [23] and GNNs are also adopted for forecasting. Trans-
posterior 𝑝(𝒛|𝒙) normally can be an arbitrary non-linear non-Gaussian
former [23] better captures the complicated period dynamics [24] or
distribution and thus is intractable. To resolve that, VAE approximates
resolves over-stationarization [25] with the flexible multi-head atten-
the posterior with 𝑞(𝒛|𝒙) =  (𝒛|𝝁(𝒙), 𝝈(𝒙)), where mean and variance
tion module; however, recent work [26] also suggests that its position
are determined by 𝒙. Then, VAE defines the learning problem as the
encoding incurs losses of temporal information. Meanwhile, GCN is
maximum likelihood estimation of log 𝑝(𝒙), which can be formulated
applied on specific graph representations, e.g., the time-conditioned
as:
graph structures in Z-GCNETs [27] (by introducing time-aware zigzag ( )
persistence), for robust time series forecasting. log 𝑝(𝒙) = 𝐾𝐿 𝑞(𝒛|𝒙)||𝑝(𝒛|𝒙) + 𝓁, (2)

2
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Table 1
Summary of notations.
Notation Description
𝒔 time series
𝒚 ground truth future values, {𝑠𝑚+1 , … , 𝑠𝑚+𝑛 }
𝒚̂ predicted future values
𝒙 time series subsequence, {𝑠𝑡 , … , 𝑠𝑡+𝑙−1 }
𝒛 latent representations learnt with VAE
𝒉 hidden states learned with RNN (i.e., GRU)
𝑳 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 for subsequence encoding Fig. 2. The encoder (inference process) and the decoder (generative process) of
 (𝜇, 𝜎) Gaussian distribution subsequence encoding.
𝐾𝐿(𝑞(𝑥)||𝑝(𝑥)) KL divergence from 𝑞(𝑥) to 𝑝(𝑥)

to learn the local patterns. In VAE, 𝒛𝑡 are assumed to be independent


variables as 𝑝(𝒛𝑡 ) =  (𝒛𝑡 |𝝁𝑡 , 𝝈 𝑡 ), where 𝝈 𝑡 is the diagonal covariance
matrix. However, such a simplified assumption may lead to inaccurate
Fig. 1. The framework of VAE. VAE encodes the input (𝒙) into the latent random local pattern learning since essential causal information is neglected.
variables (as Gaussian distributions). Then, 𝑧 is sampled from the distribution of latent
In particular, a sample in a time series is always highly affected by its
random variables to reconstruct the input (𝒙).
̂
previous samples (e.g., autoregressive). Therefore, we further enforce
causal dependency among variables in 𝒛𝑡 to capture causal information
within local patterns.
where the first term is the KL divergence between the approximated We separate latent random variables in 𝒛𝑡 as 𝐿 ladders (groups)
posterior and the true posterior. Specifically, the KL divergence of two (𝐿 is the 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒) [28], i.e, {𝒛𝑡1 , … , 𝒛𝑡𝐿 }, and the groups have a
distributions 𝑞(𝒙) and 𝑝(𝒙) measures their similarity and is defined as: sequential causal dependency (from 1 to 𝐿). We illustrate the encoding
( ) ∑ 𝑞(𝒙) 𝑞(𝒙) and decoding process of subsequence encoding in Fig. 2, in which the
𝐾𝐿 𝑞(𝒙)||𝑝(𝒙) = 𝑞(𝒙) = E𝑞(𝒙) . (3)
𝒙
𝑝(𝒙) 𝑝(𝒙) top row is the encoding process and the bottom row is the decoding
process. For the convenience of implementation, we adopt the same
In Eq. (2), since 𝑝(𝒛|𝒙) is intractable and KL divergence is non-negative,
causal dependency among the latent random variables (𝒛𝑡1 → ←← ⋯ →←←
maximizing log 𝑝(𝒙) is achieved by maximizing 𝓁, which is the evidence
𝒛𝑡𝐿 ) in the encoding and decoding processes. Based on this, the prior
lower bound (ELBO) of log 𝑝(𝒙) defined as follows:
( ) distribution of 𝒛𝑡 can be factorized as:
𝓁 = 𝐸𝑞(𝒛|𝒙) log 𝑝(𝒙|𝒛) − 𝐾𝐿 𝑞(𝒛|𝒙)||𝑝(𝒛) , (4) ∏
𝐿−1
𝑝(𝒛𝑡 ) = 𝑝(𝒛𝑡𝐿 ) 𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ),
The first term in 𝓁 maximizes the conditional probability of 𝒙 given 𝑖=1 (5)
the latent representation 𝒛 and can be seen as the reconstruction loss, 𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ) =  (𝒛𝑡𝑖 |𝝁𝑖 𝑡 (𝒛𝑡𝑖+1 ), 𝝈 𝑡𝑖 (𝒛𝑡𝑖+1 )),
while the second term minimizes the difference between the prior and
the approximated posterior. where {𝝁(⋆), 𝝈(⋆)} = 𝜑(⋆) and we implement 𝜑(⋆) as a multilayer per-
ceptron (MLP). By changing the size of dependency (𝐿, the 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒),
4. The proposed method we can regulate how well causal information is preserved, and no
causal information when 𝐿 = 1 (i.e., all latent random variables
In this section, we first provide an overview of the proposed hybrid are independent). Based on this, the generative model of subsequence
variational autoencoder (HyVAE) method and then elaborate on its encoding can further be factorized as follows:
details. ∏
𝐿−1
𝑝(𝒙𝑡 , 𝒛𝑡 ) = 𝑝(𝒙𝑡 |𝒛𝑡1 )𝑝(𝒛𝑡𝐿 ) 𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ), (6)
4.1. Overview of HyVAE 𝑖=1

where 𝑝(𝒙𝑡 |𝒛𝑡1 ) =  (𝒙𝑡 |𝝁𝑡𝑖 (𝒛𝑡1 ), 𝝈 𝑡𝑖 (𝒛𝑡1 )). This causal dependency ensures
We propose a novel generative hybrid variational autoencoder (Hy- the latent random variables have sufficient flexibility to model the
VAE) model for time series forecasting, inspired by existing hybrid complex local patterns of subsequences. Since the posterior 𝑝(𝒛𝑡 |𝒙𝑡 ) is
deterministic deep neural models. HyVAE jointly learns the local pat- intractable, 𝑞(𝒛𝑡 |𝒙𝑡 ) is used as an approximation. Meanwhile, to avoid
terns from time series subsequences and the temporal dynamics among {𝑧𝑡𝐿 , … , 𝑧𝑡1 } converging to arbitrary variables, they all depend on 𝑥𝑡 in
time series subsequences. To achieve that, HyVAE is derived based on the inference model similar to [28] as follows:
variational inference to integrate two processes: (1) the encoding of

𝐿−1
time series subsequences, which captures local patterns; and (2) the 𝑞(𝒛𝑡 |𝒙𝑡 ) = 𝑞(𝒛𝑡𝐿 |𝒙𝑡 ) 𝑞(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 , 𝒙𝑡 ),
encoding of entire time series, which learns temporal dynamics among 𝑖=1
(7)
time series subsequences. In the following content, we separately detail 𝑞(𝒛𝑡𝐿 |𝒙𝑡 ) =  (𝒛𝑡𝐿 |𝝁𝑡𝑖 (𝒙𝑡 ), 𝝈 𝑡𝑖 (𝒙𝑡 )),
the encoding of time series subsequences and the encoding of the entire
𝑞(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 , 𝒙𝑡 ) =  (𝒛𝑡𝑖 |𝝁𝑡𝑖 (𝒛𝑡𝑖+1 , 𝒙𝑡 ), 𝝈 𝑡𝑖 (𝒛𝑡𝑖+1 , 𝒙𝑡 )),
time series, respectively, and then explain the integration of these two
processes for time series forecasting. where {𝝁(⋆, ⋆), 𝝈(⋆, ⋆)} = 𝜑([⋆; ⋆]) and [; ] is the concatenation opera-
tion.
4.2. Encoding of time series subsequence
4.3. Encoding of entire time series
As discussed in Section 1, many existing models have shown that
learning the local patterns can effectively improve time series forecast- From the global perspective, we encode all time series subsequences
ing [9]. To capture the flexible local patterns, we encode time series {𝒙1 , … , 𝒙𝑇 } as {𝒛1 , … , 𝒛𝑇 } to learn the temporal dynamics of entire
subsequences as latent random variables, rather than deterministic time series. Since time series subsequences are normally not indepen-
values. dent across different time stamps, we first impose a temporal depen-
An intuitive choice of the encoder is the conventional VAE, which dency for consecutive subsequences (e.g., 𝑝(𝒛𝑡 , 𝒛𝑡−1 ) = 𝑝(𝒛𝑡 |𝒛𝑡−1 )𝑝(𝒛𝑡−1 )).
maps a time series subsequence (𝒙𝑡 ) into latent random variables (𝒛𝑡 ) In addition, we capture long-term temporal dependency with other

3
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Fig. 3. The illustration of HyVAE; (a) shows the prior defined by Eq. (13); (b) is the recurrent updating of GRU hidden states in Eq. (9); (c) shows the inference operation in
Eq. (14); and (d) represents the generation operation in Eq. (15).

subsequences by hidden states of a recurrent neural network, i.e., gated illustrated in Fig. 3. By combining the prior of subsequence encoding
recurrent unit (GRU) [20]. Therefore, we have the following derivation: in Eq. (5) and the prior of entire time series encoding in Eq. (8), we
𝑝(𝒛𝑡 |𝒛<𝑡 ) can be derived as follows: obtain the prior of HyVAE, which is factorized as follows:
𝑝(𝒛𝑡 |𝒛<𝑡 ) = 𝑝(𝒛𝑡 |𝒛𝑡−1 , 𝒉𝑡−1 ), ∏
𝐿
(8) 𝑝(𝒛𝑡 |𝒛𝑡−1 , 𝒉𝑡−1 ) = 𝑝(𝒛𝑡𝐿 |𝒛𝑡−1 , 𝒉𝑡−1 ) 𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ),
𝑝(𝒛 |𝒛
𝑡 𝑡−1
,𝒉 𝑡−1
) =  (𝒛 𝑡
|𝝁𝑡𝑖 (𝒛𝑡−1 , 𝒉𝑡−1 ), 𝝈 𝑡𝑖 (𝒛𝑡−1 , 𝒉𝑡−1 )), 1
𝑖=1
(13)
where 𝒉 is the hidden state and is obtained by: 𝑝(𝒛𝑡𝐿 |𝒛𝑡−1
1
, 𝒉𝑡−1 ) =  (𝒛𝑡𝐿 |𝝁𝑡 (𝒛𝑡−1
1
,𝒉 𝑡−1
), 𝝈 𝑡 (𝒛𝑡−1
1
, 𝒉𝑡−1 )),

𝒉𝑡 = GRU(𝒉𝑡−1 , 𝒙𝑡 ). (9) 𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ) =  (𝒙𝑡 |𝝁𝑡𝑖 (𝒛𝑡1 ), 𝝈 𝑡𝑖 (𝒛𝑡1 )).


As shown in Fig. 3(a), the prior of HyVAE integrates the long-term
GRU(∗) is the calculation of hidden states in a GRU unit. GRU adopts
temporal dynamics by affecting the first latent random variable of each
gates and memory cells and alleviates the gradient vanishing problem
subsequence (e.g., 𝒛𝑡𝐿 ) with the hidden states (e.g., 𝒉𝑡−1 , generated by
while being easier to train than LSTM due to fewer gates used. The
GRU) of its precedent subsequence. Meanwhile, 𝒉𝑡 is obtained by the
structure of GRU is formulated as follows:
recurrence process with GRU as shown in Fig. 3(b). We then obtain the
𝒓𝑡 = 𝜎(𝑊𝑟 [𝒉𝑡−1 ; 𝒙𝑡 ]), inference model of HyVAE by integrating Eqs. (7) and (12) as follows
𝜻 𝑡 = 𝜎(𝑊𝜁 [𝒉𝑡−1 ; 𝒙𝑡 ]), (Fig. 3(c)):
𝑡 (10)
𝒉̃ = 𝑡𝑎𝑛ℎ(𝑊ℎ̃ [𝒓𝑡 ◦𝒉𝑡−1 ; 𝒙𝑡 ]), ∏
𝐿−1
𝑞(𝒛𝑡 |𝒙≤𝑡 , 𝒛𝑡−1 ) = 𝑞(𝒛𝑡𝐿 |𝒙≤𝑡 , 𝒛𝑡−1 ) 𝑞(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 , 𝒙𝑡 ),
𝒉𝑡 = (𝟏 − 𝜻 𝑡 )◦𝒉𝑡−1 + 𝜻 𝑡 ◦𝒉 , ̃𝑡 1
𝑖=1
(14)
where ◦ is the element-wise product. Specifically, 𝒓𝑡 and 𝜻 𝑡 are the reset 𝑞(𝒛𝑡𝐿 |𝒙≤𝑡 , 𝒛𝑡−1
1
)=  (𝒛𝑡𝐿 |𝝁𝑡 (𝒛𝑡−1
1
, 𝒙≤𝑡 ), 𝝈 𝑡 (𝒛𝑡−1
1
, 𝒙≤𝑡 )),
gate vector and update gate vector, which decides how much past in- 𝑞(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 , 𝒙𝑡 ) =  (𝒛𝐿 |𝝁𝑖 (𝒛𝑖+1 , 𝒙 ), 𝝈 𝑖 (𝒛𝑖+1 , 𝒙𝑡 )).
𝑡 𝑡 𝑡 𝑡 𝑡 𝑡
𝑡
formation needs to be forgotten/preserved, respectively. Meanwhile, 𝒉̃
Similarly, for 𝒙𝑡 , the above encoding process also includes the temporal
is the candidate activation vector that memorizes the past information,
dynamics carried by 𝒉𝑡−1 during the encoding of 𝒛𝑡𝐿 , while the rest
and 𝒉𝑡 is obtained as the balanced sum of the short (𝒉𝑡−1 ) memory and
𝑡 latent random variables of 𝒛𝑡 only learn from 𝒙𝑡 . Then, as shown in
the long (𝒉̃ ) memory.
Fig. 3(d), the generation model of HyVAE is obtained by combining
For the generative process 𝑝(𝒙≤𝑇 , 𝒛≤𝑇 ), we explicitly simplify
Eqs. (6) and (11) as follows:
𝑝(𝒙 |𝒛≤𝑡 ) as 𝑝(𝒙𝑡 |𝒛𝑡 ) to ensure the local pattern of 𝒙𝑡 is mainly preserved
𝑡

in 𝒛𝑡 ; this simplification also can largely reduce the complexity of the 𝑝(𝒙𝑡 |𝒛𝑡 , 𝒙<𝑡 ) = 𝑝(𝒙𝑡 |𝒛𝑡1 , 𝒉𝑡−1 ),
(15)
reconstruction/decoding process. Based on (8), the generation model =  (𝒙𝑡 |𝝁𝑡𝑖 (𝒛𝑡1 , 𝒉𝑡−1 ), 𝝈 𝑡𝑖 (𝒛𝑡1 , 𝒉𝑡−1 )).
can be factorized as follows:
Following the derivative process of variational inference, with Eq. (13)

𝑇
to (15), HyVAE learns the latent representations by maximizing its
≤𝑇 ≤𝑇
𝑝(𝒙 ,𝒛 )= 𝑝(𝒙 |𝒛 , 𝒙 )𝑝(𝒛 |𝒙 , 𝒛
𝑡 𝑡 <𝑡 𝑡 <𝑡 𝑡−1
), (11)
𝑡=1
ELBO defined as follows:
{
𝒑(𝒙𝑡 |𝒛𝑡 , 𝒙<𝑡 ) also can be denoted as 𝑝(𝒙𝑡 |𝒛𝑡 , 𝒉𝑡−1 ), due to the recursive ∑
𝑇
𝓁𝑒𝑛𝑐 = E𝑞(𝒛𝑡 |𝒙≤𝑡 ,𝒛𝑡−1 ) log 𝑝(𝒙𝑡 |𝒉𝑡−1 , 𝒛𝑡1 )
nature of GRU, which requires 𝒉𝑡−1 being obtained by the recursive 𝑡=1
𝑙 1

calculation with 𝒙<𝑡 . ∑


𝑙−1 ( )
Similarly, we derive the inference model as: − 𝐾𝐿 𝑞(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 , 𝒙𝑡 )||𝑝(𝒛𝑡𝑖 |𝒛𝑡𝑖+1 ) (16)
1
𝑡 ≤𝑡
𝑞(𝒛 |𝒙 , 𝒛 𝑡−1
) = (𝒛 |𝝁𝑡𝑖 (𝒙≤𝑡 , 𝒛𝑡−1 ), 𝝈 𝑡𝑖 (𝒙≤𝑡 , 𝒛𝑡−1 ))
𝑡 }
(12) ( )
= (𝒛𝑡 |𝝁𝑡𝑖 (𝒉𝑡−1 , 𝒙𝑡 , 𝒛𝑡−1 ), 𝝈 𝑡𝑖 (𝒉𝑡−1 , 𝒙𝑡 , 𝒛𝑡−1 )). − 𝐾𝐿 𝑞(𝒛𝑡𝑙 |𝒙≤𝑡 , 𝒛𝑡−1
1
)||𝑝(𝒛𝑡𝑙 |𝒙<𝑡 , 𝒛<𝑡
1
.

The above approximated posterior of 𝒛𝑡 captures the long-term dynam-


The first term in 𝓁𝑒𝑛𝑐 implies the reconstruction loss of HyVAE for each
ics carried by 𝒙<𝑡 (𝒉𝑡−1 ), the neighboring dependency with 𝒛𝑡−1 , and the
time series subsequence, i.e., between the input 𝒙𝑡 and the 𝒙̂ 𝑡 recon-
corresponding subsequence 𝒙𝑡 .
structed with (𝒛𝑡1 , 𝒉𝑡−1 ) (see Fig. 3(d)). The second and third terms are
regularization terms that enforce the encoded latent random variables
4.4. Integration and joint learning
to jointly capture the local patterns of individual subsequences and
learn the temporal dynamics of the entire time series. The expectation
Based on the encoding of a subsequence and the encoding of the en-
of 𝓁𝑒𝑛𝑐 is approximated by Monte Carlo estimation [34] and is estimated
tire time series (represented as subsequences) discussed above, we now
with the average of the 𝓁𝑒𝑛𝑐 of each sample time series.
integrate them into a HyVAE model, which can jointly learn the local
We use 𝒉𝑡 and 𝒛𝑡 for the final time series forecasting, i.e., 𝒚̂ =
patterns and temporal dynamics for time series forecasting. The jointly
𝜓(𝒉𝑡 , 𝒛𝑡 ), where 𝜓(∗) is a single-layer fully-connected neural network.
learned latent random variables for both time series subsequences
The forecasting loss is measured by Eq. (1):
and the entire time series are denoted as {(𝒛1𝐿 , … , 𝒛11 ), … , (𝒛𝑇𝐿 , … , 𝒛𝑇1 )},
with respect to time series {𝒙1 , … , 𝒙𝑇 }, and the encoding process is 𝓁𝑝𝑟𝑒𝑑 = 𝐸𝑟𝑟(𝒚, 𝒚).
̂ (17)

4
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Table 2 average of the residuals of the forecasting results to ground truth and
Statistics of the datasets.
are respectively defined as follows:
Dataset Train Valid Test Description
1∑ 1∑
𝑛 𝑛
Parking 2856 357 358 Car park occupancy 𝑀𝑆𝐸 = (𝒚 𝑖 − 𝒚̂ 𝑖 )2 , 𝑀𝐴𝐸 = |𝒚 − 𝒚̂ 𝑖 |. (20)
Stock 1081 135 136 NASDAQ stock index 𝑛 𝑖=1 𝑛 𝑖=1 𝑖
Electricity 1120 140 140 Electricity load values
Sealevel 1120 140 140 Sea level pressure MAPE measures the proportion of forecasting deviation to the ground
truth as follows:
1 ∑ 𝒚 𝑖 − 𝒚̂ 𝑖
𝑛
𝑀𝐴𝑃 𝐸 = | |, (21)
Then, the overall loss minimizes the negative ELBO of HyVAE and the 𝑛 𝑖=1 max(𝜖, 𝒚 𝑖 )
forecasting loss as follows: where 𝜖 is an arbitrarily small positive value to ensure the dividing is
always legal.
𝓁 = −𝓁𝑒𝑛𝑐 + 𝓁𝑝𝑟𝑒𝑑 . (18)

In 𝓁, 𝓁𝑒𝑛𝑐 aims at learning representations that capture the latent dis- 5.3. Counterpart methods
tribution of time series, while 𝓁𝑝𝑟𝑒𝑑 can be regarded as a regularization
term that ensures the latent representations can provide insights for ac- We select three types of counterpart methods to compare with the
curate forecasting. We perform ADMA [35] for the optimization and use proposed method, i.e., the classical statistical models, deterministic
the reparameterization trick [14] for the model training. For 𝓁𝑒𝑛𝑐 , we DNN-based methods, and VAE-based methods. The classical models
adopt the warm-up scheme [28] during the implementation to avoid in- include the widely used AR, ARIMA, and SVR. For deterministic DNN-
active latent random variables caused by the variational regularization. based methods, we choose the LSTM and Informer and implement a
stacked CNN and LSTM model (CNN+LSTM) following [9] for time
series forecasting. For the VAE-based methods, other than the vanilla
5. Evaluation
VAE, we adopt VRNN [15] and LaST [17]. We brief these methods as
follows:
In this section, we first introduce the real-world datasets used to
evaluate the proposed method. Then, we explain the accuracy metrics – AR forecasts with the weighted sum of past values. ARIMA incor-
for time series forecasting and briefly describe the counterpart methods. porates moving average and differencing to AR for non-stationary
Finally, we analyze the results and compare HyVAE with counterpart time series.
methods regarding the effectiveness of time series forecasting. All the – SVR [19] is based on the support vector machine (SVM) and the
experiments are implemented with Python 3.7 and run on a Linux principle of structural risk minimization.
platform with a 2.6G CPU and 132G RAM. – LSTM [10] is an RNN model that can learn the long dynamics
with its forget gates.
5.1. Datasets – Informer [23] uses multi-head attention with position encoding
to learn the latent structure of time series for forecasting.
– CNN+LSTM [9] stacks CNN and LSTM for accurate air qual-
We select four datasets widely used for time series forecasting.
ity forecasting. CNN+LSTM includes three TCN layers and two
Parking Birmingham dataset [36] is collected from car parks in Birm-
bi-LSTM layers.
ingham, which regularly records the total occupancy of all available
– Vanilla VAE [14] is the basic variational autoencoder that learns
parking spaces between October 4, 2016, and December 19, 2016.
latent representations as independent Gaussian random variables.
We down-sample the recording frequency to every 5 h and result in
– VRNN [15] extends VAE to be capable of learning temporal
3571 records. Another NASDAQ stock dataset [37] consists of stock
dynamics by introducing temporal dependency among the latent
prices of 104 corporations together with the overall NASDAQ100 in-
representations.
dex, which is collected from July 26, 2016, to December 22, 2016.
– LaST [17] adopts disentangled variational autoencoder to cap-
We use the NASDAQ100 index for forecasting and down-sample the
ture seasonality and trend, with auxiliary objectives to ensure
records every 30 min, which results in 1352 records. The other two
dissociate representations.
datasets1 record the electricity load values of Poland from the 1990s
– GBT [38] decouples the Transformer-based forecasting into an
and monthly Darwin sea level pressures from 1882 to 1998, respec-
auto-regressive stage and a self-regression stage to overcome the
tively; both datasets contain 1400 records. We preprocess each dataset
severe over-fitting problem.
with Min-Max normalization by:
– N-HiTS [39] incorporates hierarchical interpolation and multi-
𝑠𝑖 − 𝑚𝑖𝑛(𝒔) rate data sampling to address the volatility of predictions.
𝑠′𝑖 = . (19)
𝑚𝑎𝑥(𝒔) − 𝑚𝑖𝑛(𝒔)
Then, each dataset is split into a training set, a validation set and a 5.4. Experiment setup
test set by {80%, 10%, 10%}. The number of known time series values
used for forecasting is fixed as 50 for all datasets. The statistics of the In all the experiments, we use the validation sets to tune optimal
datasets are shown in Table 2. parameters and use the test sets for forecasting accuracy measurement.
For AR, we search the optimal number of 𝑙𝑎𝑔 (past time series val-
ues) from 1 to 10, and use the same strategy to search optimal 𝑝
5.2. Performance metric (the number of past observations) and 𝑞 (the size of moving average
window) for ARIMA, with the optimal differencing degree searched
We use three different metrics widely used for time series fore- from 0 to 3. For SVR, we adopt the radial basis function (𝑅𝐵𝐹 )
casting [9,15] in the evaluation, and they are mean square error kernel for running, with its parameters 𝐶 (regularization parameter)
(MSE), mean absolute error (MAE), and mean absolute percentage searched from {1, 10, 100, 1000} and 𝛾 (kernel coefficient) searched from
error (MAPE). MSE and MAE respectively measure the variance and {0.00005, 0.0005, 0.005, 0.05}.
For LSTM, Informer, CNN+LSTM, GBT, N-HiTS, vanilla VAE, VRNN,
LaST, and HyVAE, we search the optimal batch size from {32, 64, 128}
1
https://research.cs.aalto.fi/aml/datasets.shtml and set the maximum iteration to be 100 epochs. The learning rate

5
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Table 3
Time series forecasting results on the datasets, with the best displayed in bold.
Parking Stock Electricity Sealevel
Methods
MSE MAE MAPE MSE MAE MAPE MSE MAE MAPE MSE MAE MAPE
(×10−2 ) (×10−2 ) (×10−2 ) (×10−2 )

AR 1.043 0.085 0.474 1.876 0.114 0.146 1.093 0.086 0.336 1.481 0.095 0.187
ARIMA 0.637 0.066 0.289 0.880 0.080 0.100 0.743 0.056 0.249 1.613 0.107 0.204
SVR 1.077 0.082 0.288 0.606 0.075 0.092 0.563 0.057 0.254 1.003 0.079 0.161
LSTM 0.571 0.057 0.249 0.557 0.068 0.078 0.321 0.036 0.211 0.801 0.068 0.139
Informer 0.425 0.051 0.224 0.728 0.078 0.090 0.305 0.037 0.219 1.083 0.083 0.173
CNN+LSTM 0.397 0.046 0.200 0.254 0.043 0.049 0.149 0.023 0.210 0.667 0.062 0.127
GBT 0.400 0.046 0.200 0.420 0.056 0.061 0.117 0.017 0.178 0.760 0.072 0.138
N-HiTS 0.401 0.469 0.197 0.127 0.027 0.031 0.336 0.042 0.229 0.687 0.064 0.129
Vanilla VAE 0.713 0.067 0.312 15.142 0.373 0.754 5.445 0.200 0.457 4.386 0.178 0.332
VRNN 0.454 0.056 0.226 0.190 0.039 0.042 0.199 0.029 0.195 0.665 0.066 0.125
LaST 0.366 0.043 0.191 0.119 0.026 0.029 0.116 0.018 0.164 0.674 0.064 0.128
HyVAE 0.133 0.028 0.144 0.087 0.021 0.023 0.097 0.015 0.143 0.623 0.060 0.123

Table 4
Multi-step forecasting results (MSE×10−2 ) on the datasets, with the best displayed in bold.
Methods Parking Stock Electricity Sealevel
3 4 5 3 4 5 3 4 5 3 4 5
LSTM 0.712 0.696 0.743 0.669 0.778 1.388 0.340 0.396 0.386 0.938 1.052 1.122
Informer 0.662 0.638 0.649 0.933 1.163 1.297 0.307 0.409 0.450 1.105 1.177 1.056
CNN+LSTM 0.635 0.643 0.636 0.524 0.710 0.822 0.163 0.156 0.197 0.890 0.959 1.032
GBT 0.640 0.661 0.682 0.722 0.848 0.951 0.140 0.153 0.201 0.958 1.023 1.247
N-HiTS 0.640 0.653 0.655 0.436 0.493 0.670 0.362 0.403 0.492 0.847 0.936 1.096
VRNN 0.614 0.700 0.760 0.557 0.653 0.895 0.232 0.437 0.458 0.870 1.257 1.322
LaST 0.442 0.533 0.574 0.425 0.463 0.617 0.136 0.138 0.189 0.819 0.892 1.021
HyVAE 0.446 0.466 0.502 0.367 0.365 0.431 0.123 0.137 0.151 0.787 0.858 0.931

is searched from {0.001, 0.01, 0.1}. The dimension of the LSTM/GRU observation shows the effectiveness of HyVAE in learning both the local
hidden states and latent representations are searched from {8, 16, 32, 64, patterns and temporal dynamics for time series forecasting.
128}, and the number of layers is no more than 3. For HyVAE, the For multi-step forecasting, in Table 4, we show the MSE of LSTM,
𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 and the 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ are searched from {2, 4, 6, 8, 10} Informer, CNN+LSTM, GBT, N-HiTS, VRNN, LaST, and HyVAE; AR,
and {10, 20, 30, 40}, respectively. We run each method 50 times and ARIMA, SVR, and vanilla VAE are excluded due to low performance.
report the average accuracy as the final results. The results show that HyVAE achieves more accurate forecasting results
than compared counterpart methods. Although generally, the forecast-
5.5. Main results ing accuracy decreases with larger forecasting steps, except for the
Parking dataset, the forecasting accuracy of HyVAE decreases much
slower than the compared counterpart methods since it captures more
In this experiment, we compare the accuracy of HyVAE with coun-
informative patterns of time series. For example, from 3-step forecast-
terpart methods, with respect to single-step forecasting and multi-step
ing to 5-step forecasting in the Electricity dataset, the MSE of HyVAE
(3, 4, and 5 steps) forecasting, respectively, on the four datasets.
only decreases by 0.028, while LSTM, Informer, CNN+LSTM, GBT, N-
As shown in Table 3, HyVAE generally achieves the best perfor-
HiTS, VRNN, and LaST decrease by 0.046, 0.143, 0.034, 0.061, 0.130,
mance among all methods on the four datasets. Notably, we observe on
0.226 and 0.063, respectively. Meanwhile, CNN+LSTM and HyVAE in
the Parking dataset, the MSE achieved by HyVAE (0.133×10−2 ) is nearly
most cases produce more accurate forecasting results than other de-
three times smaller than that of the second-best performed LaST (0.366×
terministic DNN-based methods and VAE-based methods, respectively,
10−2 ). The least improvement over all the counterpart methods is
and that again supports the effectiveness of learning both local patterns
shown in the Sealevel dataset, in which HyVAE reduces the MSE, MAE,
and temporal dynamics for time series forecasting.
and MAPE of VRNN (the second best) by 6.3%, 9.1% and 1.6%, respec- In addition, we compare the variational inference-based methods
tively. When further considering the type of the counterpart methods, (VRNN, LaST, and HyVAE) on their performance on probabilistic fore-
first, we see HyVAE achieves significant improvement over the classical casting, and the results are measured by continuous ranked probability
AR, ARIMA, and SVR methods, by achieving nearly one magnitude score (CRPS). With 𝐹 denoting the cumulative distribution function of
smaller MSE on the Parking and Stock datasets. Second, compared ∞
the forecasts distribution, CRPS is defined as CRPS(𝐹 , 𝑥) = ∫−∞ (𝐹 (𝑦) −
with the deterministic DNN-based LSTM, Informer, CNN+LSTM, GBT, 2
1(𝑦 − 𝑥) )𝑑𝑦, where 1 is the Heaviside step function. CRPS measures
and N-HiTS, HyVAE also shows significant improvement; especially on the similarity of the forecasted distribution with the true prediction
the Parking dataset, HyVAE achieves nearly two times smaller MSE, and is minimized when they are identical. The results in Table 5 show
MAE, and MAPE than the best-performed deterministic neural model that on all four datasets, HyVAE achieves more accurate probabilistic
(CNN-LSTM). Although CNN+LSTM also considers both local patterns forecasting performance than VRNN, which only learns the global
and temporal dynamics of time series and outperforms LSTM and temporal dynamics, and LaST, which aims to capture specific temporal
Informer on all the datasets, HyVAE constantly being more effective patterns (trend and seasonality).
and thus is better at capturing the complex structure of time series
for forecasting. Third, we can see that HyVAE achieves more accurate 5.6. Ablation analysis
forecasting results than other VAE-based methods that only learn part
of the information of time series. That includes the vanilla VAE, which We conduct an ablation analysis to further understand the effective-
misses the temporal dynamics, VRNN, which only learns the temporal ness of learning both the local patterns and the temporal dynamics in
dynamics, and LaST for seasonality/trend patterns of time series. This HyVAE. To do that, we implement two variants of HyVAE by removing

6
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Fig. 4. Forecasting results of HyVAE, w/o Entire and w/o Subseq on the datasets.

Table 5 We further show the forecasting results of HyVAE and the variants
Probabilistic forecasting results (measured by CRPS) on the datasets, with the best
against the ground truth in Fig. 4. For w/o Entire, since no temporal
displayed in bold.
dynamics is learned, it cannot properly capture the global trend of time
Methods Parking Stock Electricity Sealevel
series; especially for the Stock dataset (Fig. 4(b)), it misunderstands the
VRNN 1.220 0.231 0.129 0.574
steady curves between the 55 and 75 timestamps as sharp spikes. As
LaST 0.838 0.172 0.060 0.389
HyVAE 0.649 0.138 0.052 0.306 for the results of the Electricity dataset and the Sealevel dataset shown
in Fig. 4(c–d), w/o Entire only emphasizes recurring local patterns
but misses their differences at different timestamps. Meanwhile, w/o
Table 6 Subseq can better express the temporal dynamics than w/o Entire, as
Ablation analysis of HyVAE (MSE×10−2 ), with the best displayed in bold.
clearly shown in Fig. 4(b); however, it fails to properly capture local
Methods Parking Stock Electricity Sealevel
details. By combining the strengths of w/o Entire and w/o Subseq,
w/o Entire 0.410 0.980 0.503 1.742 HyVAE achieves the best forecasting results that are quite close to the
w/o Subseq 0.389 0.513 0.160 1.088
HyVAE 0.133 0.087 0.116 0.623
ground truth.
Based on the above analysis, we summarize that the improved fore-
casting accuracy of HyVAE is due to its joint learning of local patterns
and global dynamics of time series. Compared with only learning the
the learning of one type of information, respectively; that is, w/o global dynamics (w/o Subseq), HyVAE can more accurately capture the
Subseq that excludes the learning of local patterns from subsequences, local details, e.g., the flat peaks in Fig. 4(c), thus significantly reduc-
and w/o Entire that does not learn the temporal dynamics of the entire ing the forecasting error in detail-rich regions on time series. While
time series. The parameters of w/o Subseq and w/o Entire are tuned compared with only capturing the local patterns (w/o Entire), HyVAE
with the validation set the same as HyVAE, and we show the results of improves the forecasting of global trends and dynamics, leading to an
time series forecasting measured by MSE in Table 6. overall reduction of forecasting errors on most time series samples.
In Table 6, HyVAE that learns both information achieves higher
forecasting accuracy than the two variants. Specifically, the largest 5.7. Parameter analysis
improvement of HyVAE towards the variants is shown in the Stock
dataset (0.087), i.e., around six times smaller than that of w/o Subseq In this experiment, we analyze the impact of three parameters of
(0.513, second best). The smallest improvement appears in the Sealevel HyVAE, i.e., the 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒, the 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ, and the 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
dataset, but the MSE of HyVAE is still around two times smaller 𝑠𝑖𝑧𝑒, on its performance. Specifically, the 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 determines the
than that of the second-best performed w/o Subseq (0.623 to 1.088). causal information during subsequence encoding, and we vary the
Meanwhile, it is interesting to observe that w/o Entire, which misses 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 from 0 to 10, where 0 means HyVAE learns no causal in-
the temporal dynamics, constantly performs worse that w/o Subseq, formation of subsequences (see Fig. 2). The 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ balances
which misses local patterns, on the four datasets. the local patterns and the temporal dynamics, i.e., HyVAE is degraded

7
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Fig. 5. Parameter analysis of HyVAE with respect to 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 and 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ.

Fig. 6. Parameter analysis of HyVAE with respect to 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒.

to w/o Subseq or w/o Entire if 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ is 0 (no subsequence) 6. Conclusion


or the maximum (50, the subsequence becomes the entire time series),
respectively. The results measured by MSE are shown in Fig. 5. For the This paper proposes a novel hybrid variational autoencoder (Hy-
𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 shown in Fig. 5(a–d), the forecasting accuracy significantly VAE) model for time series forecasting. HyVAE integrates the learning
decreases when 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 is too small or too large. The optimal ladder of local patterns and temporal dynamics into a variational autoencoder.
sizes are relatively small (2 for the Sealevel dataset, 4 for the Parking Through comprehensive evaluation on four real-world datasets, we
and Stock datasets, and 6 for the Electricity dataset). Meanwhile, we show that HyVAE achieves better time series forecasting accuracy
see that when the 𝑙𝑎𝑑𝑑𝑒𝑟 𝑠𝑖𝑧𝑒 equals 0, HyVAE still outperforms w/o than various counterpart methods, including a deterministic DNN-
based method (CNN+LSTM) that also learns both information of time
Subseq on all the datasets. The results of 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ are shown
series. Moreover, the ablation analyses demonstrate that by jointly
in Fig. 5(c–d), in which we see that HyVAE prefers short subsequences
learning local patterns and temporal dynamics, HyVAE outperforms its
to obtain optimal forecasting results, i.e., 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ is 10
two variants which only learn local patterns and temporal dynamics
(Parking, Electricity, and Sealevel datasets) or 20 (Stock dataset). The
from time series, respectively.
reason is that if the subsequences are too long, temporal dynamics
can hardly be preserved. Not surprisingly, HyVAE that learns temporal
CRediT authorship contribution statement
dynamics with different 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ achieves better forecasting
accuracy than w/o Entire, which does not learn temporal dynamics at
Borui Cai: Conceptualization, Methodology, Writing – original draft,
all.
Writing – review & editing. Shuiqiao Yang: Conceptualization, Method-
We then run HyVAE with varying 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒 {8, 16, 32, 64, 128}, ology, Writing – original draft, Writing – review & editing. Longxiang
which determines the dimension of latent representation and the di- Gao: Conceptualization, Writing – original draft, Writing – review
mension of hidden states in neural networks, and the results are shown & editing. Yong Xiang: Conceptualization, Writing – original draft,
in Fig. 6. On all the datasets, accuracy measured by MSE, MAE, and Writing – review & editing.
MAPE has similar trends. First, the forecasting accuracy is low with
small 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒, mainly because the small size of latent random Declaration of competing interest
variables cannot properly capture the complex non-linear processes of
time series. When the 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒 becomes too large (128), the ac- The authors declare that they have no known competing finan-
curacy decreases due to over-fitting. The results show that HyVAE can cial interests or personal relationships that could have appeared to
obtain optimal forecasting results with relatively small 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒. influence the work reported in this paper.

8
B. Cai et al. Knowledge-Based Systems 281 (2023) 111079

Data availability [19] U. Thissen, R. Van Brakel, A. De Weijer, W. Melssen, L. Buydens, Using support
vector machines for time series prediction, Chemometr. Intell. Lab. Syst. 69 (1–2)
(2003) 35–49.
Data will be made available on request.
[20] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
Y. Bengio, Learning phrase representations using RNN encoder–decoder for
Acknowledgments statistical machine translation, in: Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1724–1734.
This work was supported in part by the Australian Research Council [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
I. Polosukhin, Attention is all you need, in: Advances in Neural Information
under Grant LP190100594. Processing Systems, 2017.
[22] R. Sen, H.-F. Yu, I.S. Dhillon, Think globally, act locally: A deep neural network
References approach to high-dimensional time series forecasting, in: Neural Information
Processing Systems, 2019.
[1] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: [23] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer:
Beyond efficient transformer for long sequence time-series forecasting, in: AAAI Beyond efficient transformer for long sequence time-series forecasting, in: AAAI
Conference on Artificial Intelligence, Vol. 35, No. 12, 2021, pp. 11106–11115. Conference on Artificial Intelligence, 2021.
[2] K. Wang, C. Xu, Y. Zhang, S. Guo, A.Y. Zomaya, Robust big data analytics for [24] W. Chen, W. Wang, B. Peng, Q. Wen, T. Zhou, L. Sun, Learning to rotate:
electricity price forecasting in the smart grid, IEEE Trans. Big Data 5 (1) (2017) Quaternion transformer for complicated periodical time series forecasting, in:
34–45. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp.
[3] A.W. Li, G.S. Bastos, Stock market forecasting using deep learning and technical 146–156.
analysis: A systematic review, IEEE Access 8 (2020) 185232–185242. [25] Y. Liu, H. Wu, J. Wang, M. Long, Non-stationary transformers: Exploring
[4] P.R. Junior, F.L.R. Salomon, E. de Oliveira Pamplona, et al., ARIMA: An applied the stationarity in time series forecasting, in: Advances in Neural Information
time series forecasting model for the Bovespa stock index, Appl. Math. 5 (21) Processing Systems, Vol. 35, 2022, pp. 9881–9893.
(2014) 3383. [26] A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series
[5] B. Lim, S. Zohren, Time-series forecasting with deep learning: a survey, Phil. forecasting? in: AAAI Conference on Artificial Intelligence, Vol. 37, No. 9, 2023,
Trans. R. Soc. A 379 (2194) (2021) 20200209. pp. 11121–11128.
[6] Z.D. Akşehir, E. Kiliç, How to handle data imbalance and feature selec- [27] Y. Chen, I. Segovia, Y.R. Gel, Z-GCNETs: Time zigzags at graph convolutional
tion problems in CNN-based stock price forecasting, IEEE Access 10 (2022) networks for time series forecasting, in: International Conference on Machine
31297–31305. Learning, PMLR, 2021, pp. 1684–1694.
[7] R. He, Y. Liu, Y. Xiao, X. Lu, S. Zhang, Deep spatio-temporal 3D densenet [28] C.K. Sønderby, T. Raiko, L. Maaløe, S.K. Sønderby, O. Winther, Ladder variational
with multiscale ConvLSTM-Resnet network for citywide traffic flow forecasting, autoencoders, in: Advances in Neural Information Processing Systems, Vol. 29,
Knowl.-Based Syst. 250 (2022) 109054. 2016, pp. 3745–3753.
[8] S. Liu, H. Ji, M.C. Wang, Nonpooling convolutional neural network forecasting [29] D.P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling,
for seasonal time series with trends, IEEE Trans. Neural Netw. Learn. Syst. 31 Improved variational inference with inverse autoregressive flow, in: Advances in
(8) (2019) 2879–2888. Neural Information Processing Systems, 2016.
[9] S. Du, T. Li, Y. Yang, S.-J. Horng, Deep air quality forecasting using hybrid deep [30] J. He, Y. Gong, J. Marino, G. Mori, A. Lehrmann, Variational autoencoders with
learning framework, IEEE Trans. Knowl. Data Eng. 33 (6) (2019) 2412–2424. jointly optimized latent dependency structure, in: International Conference on
[10] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) Learning Representations, 2018.
(1997) 1735–1780. [31] A. Zeroual, F. Harrou, A. Dairi, Y. Sun, Deep learning methods for forecasting
[11] J. Bi, X. Zhang, H. Yuan, J. Zhang, M. Zhou, A hybrid prediction method for COVID-19 time-series data: A comparative study, Chaos Solitons Fractals 140
realistic network traffic with temporal convolutional network and LSTM, IEEE (2020) 110121.
Trans. Autom. Sci. Eng. 19 (3) (2021) 1869–1879. [32] D. Hallac, S. Vare, S. Boyd, J. Leskovec, Toeplitz inverse covariance-based
[12] D. Quang, X. Xie, DanQ: a hybrid convolutional and recurrent deep neural clustering of multivariate time series data, in: ACM SIGKDD International
network for quantifying the function of DNA sequences, Nucleic Acids Res. 44 Conference on Knowledge Discovery and Data Mining, 2017, pp. 215–223.
(11) (2016) e107. [33] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural
[13] J.E. Van Engelen, H.H. Hoos, A survey on semi-supervised learning, Mach. Learn. networks, Science 313 (5786) (2006) 504–507.
109 (2) (2020) 373–440. [34] L. Li, J. Yan, X. Yang, Y. Jin, Learning interpretable deep state space model
[14] D.P. Kingma, M. Welling, Auto-encoding variational Bayes, in: International for probabilistic time series forecasting, in: International Joint Conference on
Conference on Learning Representations, 2014. Artificial Intelligence, 2019.
[15] U. Ullah, Z. Xu, H. Wang, S. Menzel, B. Sendhoff, T. Bäck, Exploring clinical [35] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International
time series forecasting with meta-features in variational recurrent models, in: Conference on Learning Representations, 2015.
International Joint Conference on Neural Networks, IEEE, 2020, pp. 1–9. [36] D.H. Stolfi, E. Alba, X. Yao, Predicting car park occupancy rates in smart cities,
[16] W. Chen, L. Tian, B. Chen, L. Dai, Z. Duan, M. Zhou, Deep variational graph in: International Conference on Smart Cities, Springer, 2017, pp. 107–117.
convolutional recurrent network for multivariate time series anomaly detection, [37] Y. Qin, D. Song, H. Cheng, W. Cheng, G. Jiang, G.W. Cottrell, A dual-
in: International Conference on Machine Learning, PMLR, 2022, pp. 3621–3633. stage attention-based recurrent neural network for time series prediction, in:
[17] Z. Wang, X. Xu, W. Zhang, G. Trajcevski, T. Zhong, F. Zhou, Learning latent International Joint Conference on Artificial Intelligence, 2017, pp. 2627–2633.
seasonal-trend representations for time series forecasting, in: Advances in Neural [38] L. Shen, Y. Wei, Y. Wang, GBT: Two-stage transformer framework for
Information Processing Systems, 2022. non-stationary time series forecasting, Neural Netw. 165 (2023) 953–970.
[18] J.F. de Oliveira, E.G. Silva, P.S. de Mattos Neto, A hybrid system based on [39] C. Challu, K.G. Olivares, B.N. Oreshkin, F.G. Ramirez, M.M. Canseco, A.
dynamic selection for time series forecasting, IEEE Trans. Neural Netw. Learn. Dubrawski, NHITS: Neural hierarchical interpolation for time series forecasting,
Syst. 33 (8) (2021) 3251–3263. in: AAAI Conference on Artificial Intelligence, Vol. 37, No. 6, 2023, pp.
6989–6997.

You might also like