Pyraformer Low Complexity Pyramidal Attention For Long Range Time Series Modeling
Pyraformer Low Complexity Pyramidal Attention For Long Range Time Series Modeling
Shizhan Liu1,2∗ , Hang Yu1∗, Cong Liao1 , Jianguo Li1†, Weiyao Lin2 , Alex X. Liu1 ,
and Schahram Dustdar3
1
Ant Group, 2 Shanghai Jiaotong University, 3 TU Wien, Austria
A BSTRACT
Accurate prediction of the future given the past based on time series data is of
paramount importance, since it opens the door for decision making and risk man-
agement ahead of time. In practice, the challenge is to build a flexible but parsi-
monious model that can capture a wide range of temporal dependencies. In this
paper, we propose Pyraformer by exploring the multi-resolution representation of
the time series. Specifically, we introduce the pyramidal attention module (PAM)
in which the inter-scale tree structure summarizes features at different resolutions
and the intra-scale neighboring connections model the temporal dependencies of
different ranges. Under mild conditions, the maximum length of the signal travers-
ing path in Pyraformer is a constant (i.e., O(1)) with regard to the sequence length
L, while its time and space complexity scale linearly with L. Extensive experi-
mental results show that Pyraformer typically achieves the highest prediction ac-
curacy in both single-step and long-range multi-step forecasting tasks with the
least amount of time and memory consumption, especially when the sequence is
long1 .
1 I NTRODUCTION
Time series forecasting is the cornerstone for downstream tasks such as decision making and risk
management. As an example, reliable prediction of the online traffic for micro-services can yield
early warnings of the potential risk in cloud systems. Furthermore, it also provides guidance for
dynamic resource allocation, in order to minimize the cost without degrading the performance. In
addition to online traffic, time series forecasting has also found vast applications in other fields,
including disease propagation, energy management, and economics and finance.
The major challenge of time series forecasting lies in constructing a powerful but parsimonious
model that can compactly capture temporal dependencies of different ranges. Time series often
exhibit both short-term and long-term repeating patterns (Lai et al., 2018), and taking them into
account is the key to accurate prediction. Of particular note is the more difficult task of handling
long-range dependencies, which is characterized by the length of the longest signal traversing path
(see Proposition 2 for the definition) between any two positions in the time series (Vaswani et al.,
2017). The shorter the path, the better the dependencies are captured. Additionally, to allow the
models to learn these long-term patterns, the historical input to the models should also be long. To
this end, low time and space complexity is a priority.
Unfortunately, the present state-of-the-art methods fail to accomplish these two objectives simul-
taneously. On one end, RNN (Salinas et al., 2020) and CNN (Munir et al., 2018) achieve a low
time complexity that is linear in terms of the time series length L, yet their maximum length of the
signal traversing path is O(L), thus rendering them difficult to learn dependencies between distant
positions. On the other extreme, Transformer dramatically shortens the maximum path to be O(1)
∗
Equal contribution. This work was done when Shizhan Liu was a research intern at Ant Group.
†
Corresponding author
1
Code is available at: https://github.com/alipay/Pyraformer
1
Published as a conference paper at ICLR 2022
Input
Embeddings (f) LogTrans
Layer 2
Embeddings
Layer 1
Embeddings
Input
Embeddings
Figure 1: Graphs of commonly used neural network models for sequence data.
Table 1: Comparison of the complexity and the maximum signal traveling path for different models,
where G is the number of global tokens in ETC. In practice, the G increases with L, and so the
complexity of ETC is super-linear.
Method Complexity per layer Maximum path length
CNN (Munir et al., 2018) O(L) O(L)
RNN (Salinas et al., 2020) O(L) O(L)
Full-Attention (Vaswani et al., 2017) O(L2 ) O(1)
ETC (Ainslie et al., 2020) O(GL) O(1)
Longformer (Beltagy et al., 2020) O(L) O(L)
LogTrans (Li et al., 2019) O(L log L) O(log L)
Pyraformer O(L) O(1)
at the sacrifice of increasing the time complexity to O(L2 ). As a consequence, it cannot tackle
very long sequences. To find a compromise between the model capacity and complexity, variants
of Transformer are proposed, such as Longformer (Beltagy et al., 2020), Reformer (Kitaev et al.,
2019), and Informer (Zhou et al., 2021). However, few of them can achieve a maximum path length
less than O(L) while greatly reducing the time and space complexity.
In this paper, we propose a novel pyramidal attention based Transformer (Pyraformer) to bridge
the gap between capturing the long-range dependencies and achieving a low time and space com-
plexity. Specifically, we develop the pyramidal attention mechanism by passing messages based on
attention in the pyramidal graph as shown in Figure 1(d). The edges in this graph can be divided
into two groups: the inter-scale and the intra-scale connections. The inter-scale connections build
a multiresolution representation of the original sequence: nodes at the finest scale correspond to
the time points in the original time series (e.g., hourly observations), while nodes in the coarser
scales represent features with lower resolutions (e.g., daily, weekly, and monthly patterns). Such
latent coarser-scale nodes are initially introduced via a coarser-scale construction module. On the
other hand, the intra-scale edges capture the temporal dependencies at each resolution by connecting
neighboring nodes together. As a result, this model provides a compact representation for long-range
temporal dependencies among far-apart positions by capturing such behavior at coarser resolutions,
leading to a smaller length of the signal traversing path. Moreover, modeling temporal dependencies
of different ranges at different scales with sparse neighboring intra-scale connections significantly
reduces the computational cost. In short, our key contributions comprise:
2
Published as a conference paper at ICLR 2022
highlight the appeal of the proposed model, we further compare different models in terms
of the maximum path and the complexity in Table 1.
• Experimentally, we show that the proposed Pyraformer yields more accurate predictions
than the original Transformer and its variants on various real-world datasets under the sce-
nario of both single-step and long-range multi-step forecasting, but with lower time and
memory cost.
2 R ELATED W ORKS
Time series forecasting methods can be roughly divided into statistical methods and neural network
based methods. The first group involves ARIMA (Box & Jenkins, 1968) and Prophet (Taylor &
Letham, 2018). However, both of them need to fit each time series separately, and their performance
pales when it comes to long-range forecasting.
More recently, the development of deep learning has spawned a tremendous increase in neural net-
work based time series forecasting methods, including CNN (Munir et al., 2018), RNN (Salinas
et al., 2020) and Transformer (Li et al., 2019). As mentioned in the previous section, CNN and RNN
enjoy a low time and space complexity (i.e., O(L)), but entail a path of O(L) to describe long-range
dependence. We refer the readers to Appendix A for a more detailed review on related RNN-based
models. By contrast, Transformer (Vaswani et al., 2017) can effectively capture the long-range de-
pendence with a path of O(1) steps, whereas the complexity increases vastly from O(L) to O(L2 ).
To alleviate this computational burden, LogTrans (Li et al., 2019) and Informer (Zhou et al., 2021)
are proposed: the former constrains that each point in the sequence can only attend to the point that
is 2n steps before it, where n = 1, 2, · · · , and the latter utilizes the sparsity of the attention score,
resulting in substantial decrease in the complexity (i.e., O(L log L) at the expense of introducing a
longer maximum path length.
In addition to the literature on time series forecasting, a plethora of methods have been proposed for
enhancing the efficiency of Transformer in the field of natural language processing (NLP). Similar
to CNN, Longformer (Beltagy et al., 2020) computes attention within a local sliding window or a
dilated sliding window. Although the complexity is reduced to O(AL), where A is the local window
size, the limited window size makes it difficult to exchange information globally. The consequent
maximum path length is O(L/A). As an alternative, Reformer (Kitaev et al., 2019) exploits locality
sensitive hashing (LSH) to divide the sequence into several buckets, and then performs attention
within each bucket. It also employs reversible Transformer to further reduce memory consumption,
and so an extremely long sequence can be processed. Its maximum path length is proportional
to the number of buckets though, and worse still, a large bucket number is required to reduce the
complexity. On the other hand, ETC (Ainslie et al., 2020) introduces an extra set of global tokens
for the sake of global information exchange, leading to an O(GL) time and space complexity and an
O(1) maximum path length, where G is the number of global tokens. However, G typically increases
with L, and the consequent complexity is still super-linear. Akin to ETC, the proposed Pyraformer
also introduces global tokens, but in a multiscale manner, successfully reducing the complexity to
O(L) without increasing the order of the maximum path length as in the original Transformer.
Finally, we provide a brief review on methods that improve Transformer’s ability to capture the
hierarchical structure of natural language, although they have never been used for time series fore-
casting. HIBERT (Miculicich et al., 2018) first uses a Sent Encoder to extract the features of a
sentence, and then forms the EOS tokens of sentences in the document as a new sequence and input
it into the Doc Encoder. However, it is specialized for natural language and cannot be generalized
to other sequence data. Multi-scale Transformer (Subramanian et al., 2020) learns the multi-scale
representations of sequence data using both the top-down and bottom-up network structures. Such
multi-scale representations help reduce the time and memory cost of the original Transformer, but
3
Published as a conference paper at ICLR 2022
Figure 2: The architecture of Pyraformer: The CSCM summarizes the embedded sequence at differ-
ent scales and builds a multi-resolution tree structure. Then the PAM is used to exchange information
between nodes efficiently.
it still suffers from the pitfall of the quadratic complexity. Alternatively, BP-Transformer (Ye et al.,
2019) recursively partitions the entire input sequence into two until a partition only contains a single
token. The partitioned sequences then form a binary tree. In the attention layer, each upper-scale
node can attend to its own children, while the nodes at the bottom scale can attend to the adjacent A
nodes at the same scale and all coarser-scale nodes. Note that BP-Transformer initializes the nodes
at coarser scale with zeros, whereas Pyraformer introduces the coarser-scale nodes using a construc-
tion module in a more flexible manner. Moreover, BP-Transformer is associated with a denser graph
than Pyraformer, thus giving rise to a higher complexity of O(L log L).
3 M ETHOD
The time series forecasting problem can be formulated as predicting the future M steps zt+1:t+M
given the previous L steps of observations zt−L+1:t and the associated covariates xt−L+1:t+M
(e.g., hour-of-the-day). To move forward to this goal, we propose Pyraformer in this paper, whose
overall architecture is summarized in Figure 2. As shown in the figure, we first embed the observed
data, the covariates, and the positions separately and then add them together, in the same vein with
Informer (Zhou et al., 2021). Next, we construct a multi-resolution C-ary tree using the coarser-
scale construction module (CSCM), where nodes at a coarser scale summarize the information of
C nodes at the corresponding finer scale. To further capture the temporal dependencies of different
ranges, we introduce the pyramidal attention module (PAM) by passing messages using the attention
mechanism in the pyramidal graph. Finally, depending on the downstream task, we employ different
network structures to output the final predictions. In the sequel, we elaborate on each part of the
proposed model. For ease of exposition, all notations in this paper are summarized in Table 4.
We begin with the introduction of the PAM, since it lies at the heart of Pyraformer. As demonstrated
in Figure 1(d), we leverage a pyramidal graph to describe the temporal dependencies of the observed
time series in a multiresolution fashion. Such a multiresolution structure has proved itself an effec-
tive and efficient tool for long-range interaction modeling in the field of computer vision (Sun et al.,
2019; Wang et al., 2021) and statistical signal processing (Choi et al., 2008; Yu et al., 2019). We
can decompose the pyramidal graph into two parts: the inter-scale and the intra-scale connections.
The inter-scale connections form a C-ary tree, in which each parent has C children. For example,
if we associate the finest scale of the pyramidal graph with hourly observations of the original time
series, the nodes at coarser scales can be regarded as the daily, weekly, and even monthly features of
the time series. As a consequence, the pyramidal graph offers a multi-resolution representation of
the original time series. Furthermore, it is easier to capture long-range dependencies (e.g., monthly
dependence) in the coarser scales by simply connecting the neighboring nodes via the intra-scale
connections. In other words, the coarser scales are instrumental in describing long-range correla-
tions in a manner that is graphically far more parsimonious than could be solely captured with a sin-
gle, finest scale model. Indeed, the original single-scale Transformer (see Figure 1(a)) adopts a full
graph that connects every two nodes at the finest scale so as to model the long-range dependencies,
leading to a computationally burdensome model with O(L2 ) time and space complexity (Vaswani
4
Published as a conference paper at ICLR 2022
et al., 2017). In stark contrast, as illustrated below, the pyramidal graph in the proposed Pyraformer
reduces the computational cost to O(L) without increasing the order of the maximum length of the
signal traversing path.
Before delving into the PAM, we first introduce the original attention mechanism. Let X and Y
denote the input and output of a single attention head respectively. Note that multiple heads can
be introduced to describe the temporal pattern from different perspectives. X is first linearly trans-
formed into three distinct matrices, namely, the query Q = XWQ , the key K = XWK , and the
value V = XWV , where WQ , WK , WV ∈ RL×DK . For the i-th row qi in Q, it can attend to any
rows (i.e., keys) in K. In other words, the corresponding output yi can be expressed as:
L √
X exp(qi kℓT / DK )vℓ
yi = PL √ , (1)
T
ℓ=1 ℓ=1 exp(qi kℓ / DK )
where kℓT denotes the transpose of row ℓ in K. We emphasize that the number of query-key dot
products (Q-K pairs) that need to be calculated and stored dictates the time and space complexity of
the attention mechanism. Viewed another way, this number is proportional to the number of edges
in the graph (see Figure 1(a)). Since all Q-K pairs are computed and stored in the full attention
mechanism (1), the resulting time and space complexity is O(L2 ).
As opposed to the above full attention mechanism, every node only pays attention to a limited set of
(s)
keys in the PAM, corresponding to the pyramidal graph in Figure 1d. Concretely, suppose that nℓ
denotes the ℓ-th node at scale s, where s = 1, · · · , S represents the bottom scale to the top scale
(s)
sequentially. In general, each node in the graph can attend to a set of neighboring nodes Nℓ at
(s)
three scales: the adjacent A nodes at the same scale including the node itself (denoted as Aℓ ), the
(s)
C children it has in the C-ary tree (denoted as Cℓ ), and the parent of it in the C-ary tree (denoted
(s)
Pℓ ), that is,
(s) (s) (s) (s)
Nℓ = Aℓ ∪ Cℓ ∪ Pl
A(s) = {n(s) : |j − ℓ| ≤ A−1 , 1 ≤ j ≤ L }
ℓ j 2 C s−1
. (2)
C(s)
ℓ = {n
(s−1)
j : (ℓ − 1)C < j ≤ ℓC} if s ≥ 2 else ∅
(s) (s+1)
: j = ⌈ Cℓ ⌉} if s ≤ S − 1 else ∅
Pℓ = {nj
(s)
It follows that the attention at node nℓ can be simplified as:
√
X exp(qi kℓT / dK )vℓ
yi = P T
√ , (3)
(s) exp(qi k / dK )
(s) ℓ∈N ℓ
ℓ∈Nℓ l
We further denote the number of attention layers as N . Without loss of generality, we assume that
L is divisible by C S−1 . We can then have the following lemma (cf. Appendix B for the proof and
Table 4 for the meanings of the notations).
Lemma 1. Given A, C, L, N , and S that satisfy Equation (4), after N stacked attention layers,
nodes at the coarsest scale can obtain a global receptive field.
L (A − 1)N
−1≤ . (4)
C S−1 2
In addition, when the number of scales S is fixed, the following two propositions summarize the
time and space complexity and the order of the maximum path length for the proposed pyramidal
attention mechanism. We refer the readers to Appendix C and D for proof.
Proposition 1. The time and space complexity for the pyramidal attention mechanism is O(AL) for
given A and L and amounts to O(L) when A is a constant w.r.t. L.
Proposition 2. Let the signal traversing path between two nodes in a graph denote the shortest path
connecting them. Then the maximum length of signal traversing path between two arbitrary nodes
in the pyramidal graph is O(S + L/C S−1 /A) for given A, C, L, and S. Suppose that A and S are
fixed and C satisfies Equation (5), the maximum path length is O(1) for time series with length L.
s
√
S−1 L
L≥C≥ S−1
. (5)
(A − 1)N/2 + 1
5
Published as a conference paper at ICLR 2022
Conv with
𝐵×(𝐿/𝐶 ")×𝐷!
stride C
Conv with
𝐵×(𝐿/𝐶 #)×𝐷!
stride C
Figure 3: Coarser-scale construction module: B is the batch size and D is the dimension of a node.
In our experiments, we fix S and N , and A can only take 3 or 5, regardless of the sequence length
L. Therefore, the proposed PAM achieves the complexity of O(L) with the maximum path length
of O(1). Note that in the PAM, a node can attend to at most A + C + 1 nodes. Unfortunately,
such a sparse attention mechanism is not supported in the existing deep learning libraries, such as
Pytorch and TensorFlow. A naive implementation of the PAM that can fully exploit the tensor
operation framework is to first compute the product between all Q-K pairs, i.e., qi kℓT for ℓ =
(s)
1, · · · , L, and then mask out ℓ ̸∈ Nℓ . However, the resulting time and space complexity of this
2
implementation is still O(L ). Instead, we build a customized CUDA kernel specialized for the
PAM using TVM (Chen et al., 2018), practically reducing the computational time and memory cost
and making the proposed model amenable to long time series. Longer historical input is typically
helpful for improving the prediction accuracy, as more information is provided, especially when
long-range dependencies are considered.
6
Published as a conference paper at ICLR 2022
Table 2: Single-step forecasting results on three datasets. “Q-K pairs” refer to the number of query-
key dot products performed by all attention layers in the network, which encodes the time and space
complexity. We write the number of attention layers by N , the number of attention heads by H, the
number of scales by S, the dimension of a node by D, the dimension of a key by DK , the maximum
dimension of feed-forward layer by DF , and the convolution stride by C.
4 E XPERIMENTS
4.1 DATASETS AND E XPERIMENT S ETUP
We demonstrated the advantages of the proposed Pyraformer on the four real-world datasets, in-
cluding Wind, App Flow, Electricity, and ETT. The first three datasets were used for single-step
forecasting, while the last two for long-range multi-step forecasting. We refer the readers to Ap-
pendix E and F for more details regarding the data description and the experiment setup.
7
Published as a conference paper at ICLR 2022
cretely, there are three major trends that can be gleaned from Table 2: (1) The proposed Pyraformer
yields the most accurate prediction results, suggesting that the pyramidal graph can better explain
the temporal interactions in the time series by considering dependencies of different ranges. Inter-
estingly, for the Wind dataset, sparse attention mechanisms, namely, LogTrans, ETC, Longformer
and Pyraformer, outperform the original full attention Transformer, probably because the data con-
tains a large number of zeros and the promotion of adequate sparsity can help avoid over-fitting.
(2) The number of Q-K pairs in Pyraformer is the smallest. Recall that this number character-
izes the time and space complexity. Remarkably enough, it is 65.4% fewer than that of LogTrans
and 96.6% than that of the full attention. It is worth emphasizing that this computational gain
will continue to increase for longer time series. (3) The number of parameters for Pyraformer is
slightly larger than that of the other models, resulting from the CSCM. However, this module is very
lightweight, which incurs merely 5% overhead in terms of model size compared to other models.
Moreover, p in practice, we can fix the hyper-parameters A, S and N , and ensure that C satisfies
C > S−1 L/((A − 1)N/2 + 1). Consequently, √ the extra number of parameters introduced by the
2
CSCM is only O((S − 1)CDK ) ≈ O( S−1 L).
8
Published as a conference paper at ICLR 2022
Time Memory
175 full attention
150 prob-sparse attention 10
Pyraformer-TVM
125 8
ms/batch
100
6
GB
75
4 full attention
50 prob-sparse attention
25 Pyraformer-TVM
2
0
0 2000 4000 6000 8000 10000120001400016000 0 2000 4000 6000 8000 10000120001400016000
sequence length sequence length
(a) (b)
Figure 4: Comparison of the time and memory consumption between the full, the prob-sparse, and
the TVM implementation of the pyramidal attention: (a) computation time; (b) memory occupation.
9
Published as a conference paper at ICLR 2022
ACKNOWLEDGEMENT
In this work, Prof. Weiyao Lin was supported by Ant Group through Ant Research Program and in
part by National Natural Science Foundation of China under grant U21B2013.
R EFERENCES
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured
inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 268–284, 2020.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150, 2020.
George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal
of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael
Witbrock, Mark Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks.
Advances in Neural Information Processing Systems, 2017:77–87, 2017.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan
Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing
compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 18), pp. 578–594, 2018.
M. J. Choi, V. Chandrasekaran, D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Multiscale
stochastic modeling for tractable inference and data assimilation. Computer Methods in Applied
Mechanics and Engineering, 197(43-44):3492–3515, 2008.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural net-
works. In 5th International Conference on Learning Representations, ICLR 2017, 2019.
Marta R Costa-jussà and José AR Fonollosa. Character-based neural machine translation. In Pro-
ceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers), pp. 357–361, 2016.
Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee. Residual lstm: Design of a deep recurrent
architecture for distant speech recognition. arXiv preprint arXiv:1701.03360, 2017.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations, 2019.
Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term
temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, pp. 95–104, 2018.
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng
Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series
forecasting. Advances in Neural Information Processing Systems, 32:5243–5253, 2019.
Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. Document level neural
machine translation with hierarchical attention networks. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP), number CONF, 2018.
Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed. Deepant: A deep
learning approach for unsupervised anomaly detection in time series. Ieee Access, 7:1991–2005,
2018.
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic fore-
casting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–
1191, 2020.
10
Published as a conference paper at ICLR 2022
M. Schuster. Bi-directional recurrent neural networks for speech recognition. In Proceeding of IEEE
Canadian Conference on Electrical and ComputerEngineering, pp. 7–12, 1996.
Sandeep Subramanian, Ronan Collobert, Marc’Aurelio Ranzato, and Y-Lan Boureau. Multi-scale
transformer language models. arXiv preprint arXiv:2005.00581, 2020.
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for
human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 5693–5703, 2019.
Sean J Taylor and Benjamin Letham. Forecasting at scale. The American Statistician, 72(1):37–45,
2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
W. Wang, E. Xie, X. Li, D. P. Fan, and L. Shao. Pyramid vision transformer: A versatile backbone
for dense prediction without convolutions. 2021.
Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. Bp-transformer: Modelling
long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
Hang Yu, Luyin Xin, and Justin Dauwels. Variational wishart approximation for graphical model
selection: Monoscale and multiscale models. IEEE Transactions on Signal Processing, 67(24):
6468–6482, 2019. doi: 10.1109/TSP.2019.2953651.
Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for
high-dimensional time series prediction. Advances in neural information processing systems, 29:
847–855, 2016.
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings
of AAAI, 2021.
Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes
process. In International Conference on Machine Learning, pp. 11692–11702. PMLR, 2020.
11
Published as a conference paper at ICLR 2022
In this section, we provide a brief review on the related RNN-based models. Multiscale tempo-
ral dependencies are successfully captured in HRNN (Costa-jussà & Fonollosa, 2016) and HM-
RNN (Chung et al., 2019). The former requires expert knowledge to partition the sequence into
different resolutions, while the latter learns the partition automatically from the data. Note that the
theoretical maximum length of the signal traversing path in both models is still O(L). Another line
of works aim to shorten the signal traversing path by adding residual connections (Kim et al., 2017)
or dilated connections to LSTMs (Chang et al., 2017). However, they do not consider the multires-
olution temporal dependencies explicitly. Furthermore, all aforementioned RNNs only propagate
information in one direction from the past to the future. An appealing approach that allows bidirec-
tional information exchange is Bi-LSTM (Schuster, 1996). The forward and backward propagation
is realized through two different LSTMs though, and so still incurs a long signal traversing path.
12
Published as a conference paper at ICLR 2022
As opposed to the abovementioned RNN-based models, the proposed Pyraformer enables bidirec-
tional information exchange that can better describe the temporal dependencies, while providing a
multiresolution representation of the observed sequence at the same time. We also notice that due
to the unidirectional property of RNNs, it is difficult the realize the pyramidal graph in Figure 1d
based on RNNs.
B P ROOF OF L EMMA 1
Proof. Let S denote the number of scales in the pyramidal graph, C the number of children nodes
in the finer scale s − 1 that a node in the the coarser scale s can summarize for s = 2, · · · , S, A
the number of adjacent nodes that a node can attend to within each scale, N the number of attention
layers, and L the length of the input time series. We define the term “receptive field” of an arbitrary
node na in a graph as the set of nodes that na can receive messages from. We further define the
distance between two arbitrary nodes in a graph as the length of the shortest path between them
(i.e., the number of steps to travel from one node to another). Note that in each attention layer, the
messages can only travel by one step in the graph.
Without sacrificing generality, we assume that L is divisible by C S−1 , and then the number of
nodes at the coarsest scale S is L/C S−1 . Since every node is connected to A closest nodes at
the same scale, the distance between the leftmost and the rightmost node at the coarsest scale is
2(L/C S−1 − 1)/(A − 1). Hence, the leftmost and the rightmost node in the coarsest scale are in
the receptive field of each other after the stack of N ≥ 2(L/C S−1 − 1)/(A − 1) layers of the
pyramidal attention. In addition, owing to the CSCM, nodes at the coarsest scale can be regarded as
the summary of the nodes in the finer scales. As a result, when Equation (4) is satisfied, all nodes at
the coarsest scale have a global receptive field, which closes the proof.
C P ROOF OF P ROPOSITION 1
Proof. Suppose that L(s) denotes the number of nodes at scale s, that is,
L
L(s) = s−1 , 1 ≤ s ≤ S. (6)
C
(s) (s)
For a node nℓ in the pyramidal graph, the number of dot products Pℓ it acts as the query can be
decomposed into two parts:
(s) (s) (s)
Pℓ = Pℓ inter
+ Pℓ intra
, (7)
(s) (s)
where Pℓ intra andPℓ inter denotes the intra-scale and the inter-scale part respectively. According
to the structure of the pyramidal graph, we can have the following inequalities:
(s)
Pℓ intra
≤ A, (8)
(s)
Pℓ inter ≤ C + 1. (9)
The first inequality (8) holds since a node typically attends to A most adjacent nodes at the same
scale but for the leftmost and the rightmost node, the number of in-scale nodes it can attend to is
smaller than A. On the other hand, the second inequality (9) holds because a node typically has
C children and 1 parent in the pyramidal graph but nodes at the top and the bottom scale can only
attend to fewer than C + 1 nodes at adjacent scales.
In summary, the number of dot products that need to be calculated for scale s is:
(s)
L
(s) (s)
X
(s)
≤ L(s) (A + C + 1).
P = Pℓ intra
+ Pℓ inter
(10)
ℓ=1
Note that P (1) ≤ L(A + 1) for the finest scale (i.e., s = 1) since nodes at this scale do not have
any children. It follows that the number of dot products that need to be calculated for the entire
pyramidal attention layer is:
S
X
P = P (s)
s=1
13
Published as a conference paper at ICLR 2022
√ that the nodes at the coarsest scale have a global receptive field, we choose C
In order to guarantee
such that C ∝ S−1 L. Consequently, the complexity of the proposed pyramidal attention is:
S
X
O(P ) ≤ O(L((A + 2) C −(s−1) + 1))
s=1
S
X
= O(L(A + 2) C −(s−1) )
s=1
S
(A + 2)L S−1 − 1
= O( 1 )
L S−1 − 1
S
AL S−1 − 1
= O( 1 ). (12)
L S−1 − 1
When L approaches infinity, the above expression amounts to O(AL). Since A can be fixed when
L changes, the complexity can be further reduced to O(L).
D P ROOF OF P ROPOSITION 2
(s) (1)
Proof. Let nℓ represent the ℓ-th node of the s-th scale. It is evident that the distance between n1
(1)
and nL is the largest among all pairs of nodes in the pyramidal graph. The shortest path to travel
(1) (s)
from n1 to nL is:
(1) (2) (S) (S) (S−1) (1)
n1 → n1 → · · · → n1 → · · · → nL(S) → nL(S−1) → · · · → nL . (13)
Correspondingly, the length of the maximum path between two arbitrary nodes in the graph is:
2(L(S) − 1)
Lmax = 2(S − 1) + . (14)
A−1
When C satisfies Equation (5), that is, L(S) − 1 ≤ (A − 1)N/2, we can obtain:
2(L(S) − 1)
O(Lmax ) = O 2(S − 1) +
A−1
L
− 1)
2( C S−1
= O 2(S − 1) +
A−1
= O(2(S − 1) + N )
= O(S + N ). (15)
Since A, S and N are invariant with L, the order of the maximum path length Lmax can be further
simplified as O(1).
E DATASETS
We demonstrated the advantages of the proposed Pyraformer on the following four datasets. The
first three datasets were used for single-step forecasting, while the last two for long-range multi-step
forecasting.
14
Published as a conference paper at ICLR 2022
Wind2 : This dataset contains hourly estimation of the energy potential in 28 countries between
1986 and 2015 as a percentage of a power plant’s maximum output. Compared with the remaining
datasets, it is more sparse and periodically exhibits a large number of zeros. Due to the large size of
this dataset, the ratio between training and testing set was roughly 32:1.
App Flow: This dataset was collected at Ant Group3 . It consists of hourly maximum traffic flow for
128 systems deployed on 16 logic data centers, resulting in 1083 different time series in total. The
length of each series is more than 4 months. Each time series was divided into two segments for
training and testing respectively, with a ratio of 32:1.
Electricity4 (Yu et al., 2016): This dataset contains time series of electricity consumption recorded
every 15 minutes from 370 users. Following DeepAR (Salinas et al., 2020), we aggregated every
4 records to get the hourly observations. This dataset was employed for both single-step and long-
range forecasting. We trained with data from 2011-01-01 to 2014-09-01 for single-step forecasting,
and from 2011-04-01 to 2014-04-01 for long-range forecasting.
ETT5 (Zhou et al., 2021): This dataset comprises 2 years of 2 electricity transformers collected
from 2 stations, including the oil temperature and 6 power load features. Observations every hour
(i.e., ETTh1) and every 15 minutes (i.e., ETTm1) are provided. This dataset is typically exploited
for model assessment on long-range forecasting. Here, we followed Informer (Zhou et al., 2021)
and partitioned the data into 12 and 4 months for training and testing respectively.
F E XPERIMENT S ETUP
We set S = 4 and N = 4 for Pyraformer in all experiments. When the historical length L is not
divisible by C, we only introduced ⌊L/C⌋ nodes in the upper scale, where ⌊·⌋ denotes the round
down operation. The last L − (⌊L/C⌋ − 1)C nodes at the bottom scale were all connected to the
last node at the upper scale. For single-step forecasting, we set C = 4, A = 3, and H = 4 in
all experiments. Both training and testing used a fixed-size historical sequence to predict the mean
and variance of the Gaussian distribution of a single future value. We chose the MSE loss and the
log-likelihood (Zuo et al., 2020) as our loss functions. The ratio between them was set to 100. For
optimization, we used Adam with the learning rate starting from 10−5 and halving in every epoch.
We trained Pyraformer with 10 epochs. Weighted sampler based on each window’s average value
and hard sample mining were used to improve the generalization ability of the network. On the
other hand, for long-range forecasting, we tested four combinations of A and C in each experiment,
and the best results were presented. Specifically, when the prediction length is smaller than 600, we
tested A = 3, 5 and C = 4, 5. When the prediction length is larger than 600, we tested A = 3, 5
and C = 5, 6. The resulting choice of hyper-parameters for each experiment is listed in Table 5.
In addition, the loss function was the MSE loss only. We still used Adam as our optimizer, but the
learning rate started from 10−4 and was reduced to one-tenth every epoch. We set the number of
epochs to be 5.
G P RETRAINING
For single-step forecasting, the value to be predicted is usually close to the last value of history.
Since we only use the last nodes of all scales to predict, the network tends to focus only on short-
term dependencies. To force the network to capture long-range dependencies, we add additional
supervision in the first few epochs of training. Specifically, in the first epoch, we form our network
as an auto-encoder, as shown in Figure 5. Apart from predicting future values, the PAM is also
2
Wind dataset can be downloaded at https://www.kaggle.com/sohier/30-years-of-european-wind-generation
3
The App Flow dataset does not contain any Personal Identifiable Information and is desensi-
tized and encrypted. Adequate data protection was carried out during the experiment to prevent the
risk of data copy leakage, and the dataset was destroyed after the experiment. It is only used
for academic research, it does not represent any real business situation. The download link is
https://github.com/alipay/Pyraformer/tree/master/data/app zone rpc hour encrypted.csv
4
Electricity dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20
112014
5
ETT dataset can be downloaded at https:// github.com/zhouhaoyi/ETDataset
15
Published as a conference paper at ICLR 2022
......
Figure 5: The pretraining strategy for one-step prediction. Features of nodes surrounded by the
dashed ellipses are concatenated to recover the corresponding input value.
trained to recover the input values. Note that we test all methods with and without this pretraining
strategy and the better results are displayed in Table 2.
H M ETRICS
Denote the target value as zj,t and the predicted value as ẑj,t , where j is the sample index and t is
the time index. Then NRMSE and ND are calculated as follows:
q PN PT
1 2
NT j=1 t=1 (zj,t − ẑj,t )
NRMSE = 1
P N P T
, (16)
NT j=1 t=1 |zj,t |
PN PT
j=1 t=1 |zj,t − ẑj,t |
ND = PN PT . (17)
j=1 t=1 |zj,t |
16
Published as a conference paper at ICLR 2022
Specifically, each time series in the synthetic dataset is a linear combination of three sine functions
of different periods: 24, 168 and 720, that is,
2π 2π 2π
f (t) = β0 + β1 sin( t) + β2 sin( t) + β3 sin( t). (18)
24 168 720
In the above equation, the coefficients of the three sine functions β1 , β2 , and β3 for each time
series are uniformly sampled from [5, 10]. β0 is a Gaussian process with a covariance function
Σt1 ,t2 = |t1 − t2 |−1 and Σt1 = Σt2 = 1, where t1 and t2 denote two arbitrary time stamps. Such
polynomially decaying covariance functions are known to have long-range dependence, as oppose to
the exponentially decaying covariance functions (Yu et al., 2019). The start time of each time series
t0 is uniformly sampled from [0, 719]. We first generate 60 time series of length 14400, and then
split each time series into sliding windows of width 1440 with a stride of 24. In our experiments, we
use the historical 720 time points to predict the future 720 points. Since both the deterministic and
stochastic parts of the synthetic time series have long-range correlations, such dependencies should
be well captured in the model in order to yield accurate predictions of the next 720 points. The
results are summarized in Table 6. Here, we consider two different configurations of Pyraformer: 1)
C = 6 for all scales in the pyramidal graph (denoted as Pyraformer6,6,6 ); 2) C = 12, 7, and 4 for
the three layers sequentially from bottom to top (denoted as Pyraformer12,7,4 ).
It can be observed that Pyraformer6,6,6 with the same C for all scales already outperforms the
benchmark methods by a large margin. In particular, the MSE given by Pyraformer is decreased by
18.2% compared with Reformer, which produces the smallest MSE among the existing variants of
Transformer. On the other hand, by exploiting the information of the known period, Pyraformer12,7,4
performs even better than Pyraformer6,6,6 . Note that in Pyraformer12,7,4 , nodes at scale 2, 3, and 4
characterizes coarser temporal resolutions respectively corresponding to half a day, half a week, and
half a month. We also tested Pyraformer24,7,4 , but setting C = 24 in the second scale degrades the
performance, probably because the convolution layer with a kernel size of 24 is difficult to train.
We further visualized the forecasting results produced by Pyraformer12,7,4 in Figure 6. The blue
solid curve and red dashed curve denote the true and predicted time series respectively. By capturing
the temporal dependencies with different ranges, the prediction resulting from Pyraformer closely
follows the ground truth.
On the other hand, to check whether Pyraformer can extract features with different temporal resolu-
tions, we depicted the extracted features in a randomly selected channel across time at each scale in
the pyramidal graph in Figure 7. It is apparent that the features at the coarser scales can be regarded
as a lower resolution version of the features at the finer scales.
J A BLATION S TUDY
J.1 I MPACT OF A AND C
We studied the impact of A and C on the performance of Pyraformer for long-range time series
forecasting, and showed the results in Table 7. Here, we focus on the dataset ETTh1. The history
17
Published as a conference paper at ICLR 2022
15
Ground truth
Predicted values
10
5
Value
10
15
0 200 400 600 800 1000 1200 1400
Time step
(a)
1.00 1.0
1.5
0.75
1.0 0.5
0.50
Value
Value
0.0 0.00
0.5
0.25
0.5
0.50 1.0
1.0
0.75
0 100 200 300 400 500 600 700 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7
Intra-scale node index Intra-scale node index Intra-scale node index
(a) (b) (c)
Figure 7: Visualization of the extracted features across time in second channel at different scales:
(a) scale 1; (b) scale 2; (c) scale 3.
length is 336 and the prediction length is 720. From Table 7, we can conclude that the receptive fields
of the nodes at the coarsest scale in the PAM play an indispensable role in reducing the prediction
error of Pyraformer. For instance, there are 42 nodes at the coarsest scale when C = 2. Without the
intra-scale connections, each node can only receive messages from 16 nodes at the finest scale. As
the number of adjacent connections A in each scale increases, the receptive fields of the coarsest-
scale nodes also extend, and therefore, the prediction error decreases accordingly. However, as long
as the nodes at the top scale have a global receptive field, further increasing A will not bring large
gains. For C = 5, the performance does not improve even though A increases. Such observations
indicate that it is better to set A to be small once the uppermost nodes in the PAM have a global
receptive field. In practice, we only increase C with the increase of L, but keep A small.
In addition to convolution, there exist other mechanisms for constructing the C-ary tree, such as
max pooling and average pooling. We studied the impact of different CSCM architectures on the
performance for long-range forecasting on dataset ETTh1. The history and the prediction length
are both 168 and C = 4 for all mechanisms. The results are listed in Table 8. From Table 8, we
can tell that: (1) Using pooling layers instead of convolution typically degrades the performance.
However, the performance of Pyraformer based on max pooling is still superior to that of Informer,
demonstrating the advantages of the PAM over the prob-sparse attention in Informer. (2) The MSE
of convolution with the bottleneck is only 1.51% larger than that without bottleneck, but the number
18
Published as a conference paper at ICLR 2022
Table 8: Impact of the CSCM architecture on long-range Table 9: Impact of history length.
forecasting. Parameters introduced by the normalization The prediction length is 1344.
layers are relatively few, and thus, are ignored. History Length MSE MAE
CSCM MSE MAE Parameters 84 1.234 0.856
Max-pooling 0.842 0.700 0 168 1.226 0.868
Average-pooling 0.833 0.693 0 336 1.108 0.835
Conv. 0.796 0.679 3147264 672 1.057 0.806
Conv. w/bottleneck 0.808 0.683 328704 1344 1.062 0.806
of parameters is reduced by almost 90%. Thus, we adopt the more compact module of convolution
with bottleneck as our CSCM.
We also checked the influence of the history length on the prediction accuracy. The dataset is
ETTm1, since its granularity is minute and contains more long-range dependencies. We fixed the
prediction length to 1344 and changed the history length from 84 to 1344 in Table 9. As expected,
a longer history typically improves prediction accuracy. On the other hand, this performance gain
starts to level off when introducing more history stops providing new information. As shown in
Figure 8, the time series with length 672 contains almost all periodicity information that is essential
for prediction, while length 1344 introduces more noise.
Finally, we investigated the importance of the PAM. We compared the performance of Pyraformer
with and without the PAM on the dataset ETTm1. For a fair comparison, the number of parameters
of the two methods were controlled to be within the same order of magnitude. More precisely, we
increased the bottleneck dimension of ”Conv. w/bottleneck” for the model only with the CSCM.
The results are shown in Table 10. Obviously, the PAM is vital to yield accurate predictions.
19
Published as a conference paper at ICLR 2022
27.5 22
25.0 20
22.5 18
20.0 16
Value
Value
17.5 14
15.0 12
12.5 10
10.0 8
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time Time
(a) (b)
25.0
27.5
22.5
25.0
20.0
22.5
17.5
20.0
Value
15.0 12.5
12.5 10.0
10.0 7.5
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
Time Time
(c) (d)
Figure 8: Time series with different lengths in the ETTm1 dataset. The sequence length in (a) and
(b) is 672, and that in (c) and (d) is 1344. The time series in (a) and (b) corresponds to the latter half
of those in (c) and (d) respectively.
select a C that satisfies Equation (5). In practice, we can use a validation set to choose C from its
candidates that satisfies (5). It is also worthwhile to check whether choosing different C for different
scales based on the granularity of the time series can further improve the performance as we did in
Appendix I.
20