Time Series
Time Series
8, AUGUST 2015 1
Abstract—Time-Series Mining (TSM) is an important research area since it shows great potential in practical applications. Deep learning
models that rely on massive labeled data have been utilized for TSM successfully. However, constructing a large-scale well-labeled
dataset is difficult due to data annotation costs. Recently, Pre-Trained Models have gradually attracted attention in the time series domain
due to their remarkable performance in computer vision and natural language processing. In this survey, we provide a comprehensive
review of Time-Series Pre-Trained Models (TS-PTMs), aiming to guide the understanding, applying, and studying TS-PTMs. Specifically,
we first briefly introduce the typical deep learning models employed in TSM. Then, we give an overview of TS-PTMs according to the
pre-training techniques. The main categories we explore include supervised, unsupervised, and self-supervised TS-PTMs. Further,
arXiv:2305.10716v1 [cs.LG] 18 May 2023
extensive experiments are conducted to analyze the advantages and disadvantages of transfer learning strategies, Transformer-based
models, and representative TS-PTMs. Finally, we point out some potential directions of TS-PTMs for future work. The source code is
available at https://github.com/qianlima-lab/time-series-ptms.
Index Terms—Time-Series Mining, Pre-Trained Models, Deep Learning, Transfer Learning, Transformer
F
1 I NTRODUCTION
As an important research direction in the field of data cies in the time series. Moreover, design of the time-series
mining, Time-Series Mining (TSM) has been widely utilized data augmentation techniques generally relies on expert
in real-world applications, such as finance [1], speech anal- knowledge. On the other hand, semi-supervised methods
ysis [2], action recognition [3], [4], and traffic flow forecast- employ a large amount of unlabeled data to improve model
ing [5], [6]. The fundamental problem of TSM is how to performance. However, in many cases, even unlabeled time-
represent the time-series data [7], [8]. Then, various mining series samples can be difficult to collect (e.g., electrocardio-
tasks can be performed based on the given representations. gram time series data in healthcare [19], [20]).
Traditional time-series representations (e.g., shapelets [9]) Another effective solution to alleviate the problem of
are time-consuming due to heavy reliance on domain or insufficient training data is transfer learning [21], [22],
expert knowledge. Therefore, it remains challenging to learn which relaxes the assumption that the training and test
the appropriate time series representations automatically. data must be independently and identically distributed.
In recent years, deep learning models [10], [11], [12], [13], Transfer learning usually has two stages: pre-training and
[14] have achieved great success in a variety of TSM tasks. fine-tuning. During pre-training, the model is pre-trained on
Unlike traditional machine learning methods, deep learning some source domains that contain a large amount of data,
models do not require time-consuming feature engineering. and are separate but relevant to the target domain. On fine-
Instead, they automatically learn time-series representations tuning, the pre-trained model (PTM) is fine-tuned on the
through a data-driven approach. However, the success of often limited data from the target domain.
deep learning models relies on the availability of massive Recently, PTMs, particularly Transformer-based PTMs,
labeled data. In many real-world situations, it can be dif- have achieved remarkable performance in various Com-
ficult to construct a large well-labeled dataset due to data puter Vision (CV) [23], [24] and Natural Language Process-
acquisition and annotation costs. ing (NLP) [25] applications. Inspired by these, recent studies
To alleviate the reliance of deep learning models on large consider the design of Time-Series Pre-Trained Models (TS-
datasets, approaches based on data augmentation [15], [16] PTMs) for time-series data. First, a time-series model is pre-
and semi-supervised learning [17] have been commonly trained by supervised learning [26], [27], unsupervised learn-
used. Data augmentation can effectively enhance the size ing [28], [29], or self-supervised learning [30], [31], [32] so as
and quality of the training data, and has been used as an to obtain appropriate representations. The TS-PTM is then
important component in many computer vision tasks [18]. fine-tuned on the target domain for improved performance
However, different from image data augmentation, time- on the downstream TSM tasks (e.g., time-series classification
series data augmentation also needs to consider properties and anomaly detection).
such as temporal dependencies and multi-scale dependen- Supervised TS-PTMs [26], [33] are typically pre-trained
by classification or forecasting tasks. However, the diffi-
Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, culty to obtain massive labeled time-series datasets for pre-
and Zhongzhong Yu are with the School of Computer Science
and Engineering, South China University of Technology, Guangzhou training often limits the performance of supervised TS-
510006, China. E-mail: [email protected]; [email protected]; PTMs. In addition, unsupervised TS-PTMs utilize unlabeled
[email protected]; [email protected]; [email protected]; yuz- data for pre-training, which further tackle the limitation of
[email protected]. (Corresponding author: Qianli Ma.) insufficient labeled data. For example, the reconstruction-
James T. Kwok is with the Department of Computer Science and Engineering,
The Hong Kong University of Science and Technology, Hong Kong, China. based TS-PTMs [28] employ auto-encoders and a recon-
E-mail: [email protected]. struction loss to pre-train time-series models. Recently,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Classification
(ii) transformer, (iii) inherent properties, (iv) adver-
Source Base Network sarial attacks, and (v) noisy labels.
Labeled Dataset Forecasting
The remainder of this paper is organized as follows.
(a) Supervised pre-training models Section 2 provides background on the TS-PTM. A compre-
hensive review of the TS-PTMs is then given in Section 3.
Source Base Network Reconstruction Section 4 presents experiments on the various TS-PTMs.
Unlabeled Dataset Section 5 suggests some future directions. Finally, we sum-
(b) Unsupervised pre-training models
marize our findings in Section 6.
that all the layers are of the same length, and employ causal
convolutions with no information ”leakage” from future to
past. A typical TCNs is shown in Fig. 2(c). Compared to
TCN
recurrent networks, TCNs have recently shown to be more
(a) RNN. (b) CNN. accurate, simpler, and more efficient across a diverse range
Transformer of sequence modeling tasks [59]. For example, Sen et al. [13]
combined a local temporal network and a global matrix
factorization model regularized by a TCN for time-series
forecasting.
2.2.3 Transformers
(c) TCN. (d) Transformer.
Transformers [60], [61] integrate information from data
Fig. 2. Deep learning models used for time-series mining. points in the time series by dynamically computing the
associations between representations with self-attention. A
typical Transformer is shown in Fig. 2(d). Transformers
2.1.5 Time Series Imputation have shown great power in TSM due to their powerful
Time-Series Imputation (TSI) [45] aims to replace miss- capacity to model long-range dependencies. For example,
ing values in a time series with realistic values so as Zhou et al. [14] combined a self-attention mechanism (with
to facilitate downstream TSM tasks. Given a time se- O(L log L) time and space complexities) and a generative
ries X = [x1 , . . . , xt , . . . , xT ] and a binary M = decoder for long time-series forecasting.
[m1 , . . . , mt , . . . , mT ], xt is missing if mt = 0, and is
observed otherwise. TSI imputes the missing values as: 2.3 Why Pre-Trained Models?
Ximputed = X M + X̂ (1 − M ), With the rapid development of deep learning, deep
learning-based TSM has received more and more attention.
where X̂ is the predicted values generated by the TSI tech- In recent years, deep learning models have been widely
nique. As a conditional generation model, TSI techniques used in TSM and have achieved great success. However, as
have been studied [46] and applied to areas such as gene data acquisition and annotation can be expensive, the lim-
expression [47] and healthcare [48], [49]. ited labeled time-series data often hinders sufficient training
of deep learning models. For example, in bioinformatics,
2.2 Deep Learning Models for Time Series time series classification involves assigning each sample to a
specific category, such as clinical diagnosis results. Nonethe-
2.2.1 Recurrent Neural Networks
less, obtaining accurate labels for all samples is a challenge
A Recurrent Neural Network (RNN) [50] usually consists as it requires expert knowledge to classify the samples.
of an input layer, one or more recurrent hidden layers, and Therefore, pre-training strategies have been proposed to
an output layer (Fig. 2(a)). In the past decade, RNNs and alleviate this data sparseness problem. The advantages of
their variants (such as the Long Short-Term Memory (LSTM) Pre-Trained Models (PTMs) for TSM can be summarized as
network [51] and Gated Recurrent Units (GRUs) [50]) have follows:
achieved remarkable success in TSM. For example, Muralid-
har et al. [52] combined dynamic attention and a RNN-based • PTMs provide better model initialization for the
sequence-to-sequence model for time-series forecasting. Ma downstream TSM tasks, which generally results in
et al. [53] employed a multi-layer Dilated RNN to extract better generalization performance.
multi-scale temporal dependencies for time-series cluster- • PTMs can automatically obtain appropriate time
ing. series representations by pre-training on source
datasets, thus avoiding over-reliance on expert
2.2.2 Convolutional Neural Networks knowledge.
Convolutional Neural Networks (CNNs) [54] are originally
designed for computer vision tasks. A typical CNN is shown 3 OVERVIEW OF TS-PTM S
in Fig. 2(b). To use CNNs for TSM, the data need to be first In this section, we propose a new taxonomy of TS-PTMs,
encoded in an image-like format. The CNN receives em- which systematically classifies existing TS-PTMs based
bedding of the value at each time step and then aggregates on pre-training techniques. The taxonomy of TS-PTMs is
local information from nearby time steps using convolution. shown in Fig. 3, and please refer to Appendix A.1 for a
CNNs have been shown to be very effective for TSM. For literature summary of TS-PTMs.
example, the multi-scale convolutional neural network [55]
can automatically extract features at different scales by a
multi-branch layer and convolutional layers. Kashiparekh et 3.1 Supervised PTMs
al. [56] incorporated filters of multiple lengths in all convo- The early TS-PTMs are inspired by transfer learning appli-
lutional layers to capture multi-scale temporal features for cations in CV. Many vision-based PTMs are trained on large
time-series classification. labeled datasets such as the ImageNet [62]. The correspond-
Unlike vanilla CNNs, Temporal Convolutional Net- ing weights are then fine-tuned on the target dataset, which
works (TCNs) [57] use a fully convolutional network [58] so is usually small. This strategy has been shown to improve
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
Universal Encoder
Classification-based PTMs Aligned Encoder
Model Reprogramming
Supervised PTMs
AutoRegression
Forecasting-based PTMs
Adaptive Encoder
AutoEncoder
Transformer Encoder
Subseries Consistency
Temporal Consistency
Consistency-based PTMs
Transformation Consistency
Contextual Consistency
Self-supervised PTMs
dataset. Due to privacy and annotation issues, it may be MMD MMD MMD Classifier
difficult to obtain source datasets that are very similar to the Target
target dataset. To alleviate this problem, Meiseles et al. [71] Time Series
utilized the clustering property between latent encoding
(a) Maximum Mean Discrepancy
space categories as an indicator to select the best source
dataset. The above studies mainly focus on univariate time Source
Time Series
series. Li et al. [67] proposed a general architecture that can
Task Classifier
be used for transfer learning on multivariate time series. Target
Time Series
The aforementioned works employ CNN as backbone Feature Extractor
for time series transfer learning. However, vanilla CNNs GRL
Domain Classifier
have difficulty in capturing multi-scale information and (b) Adversarial Learning
long-term dependencies in the time series. Studies [55], [66],
[72] have shown that using different time scales or using Fig. 5. Aligned encoder aims to learn domain-invariant representations.
LSTM with vanilla CNNs can further improve classification
performance. For example, Kashiparekh et al. [56] proposed
a novel pre-training deep CNN, in which 1-D convolutional can influence the extraction of domain-invariant features,
filters of multiple lengths are used to capture features at Cai et al. [74] designed a sparse associative structure align-
different time scales. Mutegeki et al. [73] used CNN-LSTM ment model which assumes that the causal structures are
as the base network to explore how transfer learning can stable across domains. Li et al. [84] considered the com-
improve the performance of time-series classification with pact causal mechanisms among variables and the variant
few labeled time series samples. strength of association, and used the Granger causality
Aligned Encoder A universal encoder first pre-trains the alignment model [86] to discover the data’s causal structure.
model with the source dataset, and then fine-tunes the Ragab et al. [87] proposed an autoregressive domain dis-
model using the target dataset. However, the difference criminator to explicitly address the temporal dependencies
between the source and target data distributions is often not during both representation learning and domain alignment.
considered. To address this issue, some recent works [74], Liu et al. [88] minimized domain divergence on a reinforced
[75] first map the source and target datasets to a shared MMD metric, which is embedded by a hybrid spectral
feature representation space, and then prompt the model to kernel network in time-series distribution. Despite all these
learn domain-invariant representations during pre-training. advances, the use of MMD-based methods on time series
The pre-training strategy of the aligned encoder is at the data is still challenging due to the underlying complex
core of domain adaptation, and has been extensively val- dynamics.
idated studied on image data [76]. For time series data, Another common approach is to learn a domain-
extraction of domain-invariant representations is difficult invariant representation between the source and target do-
due to distribution shifts among timestamps and the asso- mains through adversarial learning [89], [90], [91]. For ex-
ciative structure among variables. To this end, existing pre- ample, Wilson et al. [75] proposed the Convolutional deep
training techniques for time series aligned encoder are based Domain Adaptation model for Time Series data (CoDATS),
on either Maximum Mean Discrepancy (MMD) [77], [78] or which consists of a feature extractor, Gradient Reversal
adversarial learning [79], [80]. Layer (GRL), task classifier, and domain classifier (Fig. 5 (b)).
MMD [81] is a standard metric on distributions, and has The adversarial step is performed by the GRL placed be-
been employed to measure the dissimilarity of two distri- tween the feature extractor and domain classifier in the
butions in domain adaptation [82]. Given a representation network. CoDATS first updates the feature extractor and
f (·) on source data Xs ∈ Ds and target data Xt ∈ Dt , the task classifier to classify the labeled source data. The do-
empirical approximation of MMD is: main classifier is then updated to distinguish which domain
each sample comes from. At the same time, the feature
1 X 1 X extractor is adversarially updated to make it more difficult
MMD(Ds , Dt ) = f (Xs ) − f (Xt ) . for the domain classifier to distinguish which domain each
|Ds |X ∈D |Dt |X ∈D
s s t t sample comes from. Li et al. [92] argued that temporal
causal mechanisms should be considered and proposed a
MMD-based methods [83], [84] learn domain-invariant
time-series causal mechanism transfer network to obtain a
representations by minimizing the MMD between the
domain-invariant representation. However, the exploitation
source and target domains in classification training
of inherent properties in the time series (such as multi-scale
(Fig. 5 (a)). Khan et al. [85] used a CNN to extract features
and frequency properties) still need to be further explored
from the source and target domain data separately. The
in adversarial training.
divergence of the source and target domains is reduced by
minimizing the Kullback–Leibler divergence in each layer Model Reprogramming The great success of PTMs in
of the network. Wang et al. [83] proposed stratified transfer CV [34] and NLP [25] shows that using a large-scale la-
learning to improve the accuracy for cross-domain activity beled dataset can significantly benefit the downstream tasks.
recognition. Moreover, considering that time lags or offsets However, most time-series datasets are not large. Recently,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
h1 h2 h3 h4 h5
Encoder Decoder
L3 x̂4 x̂3 x̂2 x̂1
L2
L1 h1 h2 h3 h4 ĥ4 ĥ3 ĥ2 ĥ1
1 3* x1 x2 x3 x4
2*
Fig. 9. The RNN-based encoder-decoder architecture.
Fig. 8. Adaptive encoder aims to obtain a better initialization model
parameters θ so that the model can quickly generalize to new tasks
using only a small number of samples (e.g., θ1∗ ), where θi∗ (i ∈ 1, 2, 3)
denote task adaptive parameters obtained using gradient descent on θ tasks with heterogeneous channels. Experimental results
(e.g., ∇L1 ) based on the task adaptive data. showed that it provides a good generalization for the target
datasets. Also, Autoformer [60] and FEDformer [61] show
that frequency domain information and Transformer archi-
learning [35] to obtain time series representations favorable tecture can improve time-series forecasting performance.
for downstream tasks. For example, Oord et al. [100] pro- Hence, it would be an interesting direction to explore task
posed contrastive predictive coding by employing model- adaptive Transformer-based models for TS-PTMs.
predicted timesteps as positive samples and randomly-
sampled timesteps as negative samples. Eldele et al. [38] Summary TSF-based PTMs can exploit the complex dy-
used the autoregressive model’s predicted values as positive namics in the time series and use that to guide the model
sample pairs for contrastive learning, thus enabling the in capturing temporal dependencies. Autoregression-based
model to capture temporal dependencies of the time series. models use the dependencies between subseries and the
In particular, they fed both strong and weak augmentation consistency of future predicted values of the same time
samples to the encoder, therefore using a cross-view TSF series, thus pre-training the time series data using TSF.
task as the PTM objective. Unlike classification-based PTMs that use manual labels
for pre-training, avoiding sampling bias among subseries
Adaptive Encoder Unlike transfer learning which fo- (e.g., outliers) [68] for pre-training based on the TSF task
cuses on the current learning ability of the model, meta- remain challenging. Meanwhile, the adaptive encoder based
learning [102], [103], [104] focuses on the future learning on meta-learning allows for scenarios with small time series
potential of the model to obtain an adaptive encoder via task- samples in the target dataset. In addition, regression-based
adaptive pre-training paradigm (as shown in Fig. 8). Espe- one-step forecasting models (e.g., RNNs) potentially lead
cially, transfer learning-based PTMs are prone to overfitting to bad performance due to accumulated error [10], [49].
on downstream tasks when the number of samples in the Instead, some studies [14], [60] employ Transformer-based
target dataset is small. Inspired by the fact that humans are models to generate all predictions in one forward operation.
good at learning a priori knowledge from a few new sam- Therefore, designing efficient TSF encoders would be a basis
ples, task-adaptive pre-training, or meta-learning , has also for studying TSF-based PTMs.
been used. Fig. 8 shows an adaptive encoder via a classic
task-adaptive pre-training algorithm called model agnostic
meta-learning [105], [106], [107]. Recently, task-adaptive pre- 3.2 Unsupervised PTMs
training strategies have been receiving increasing attention This section introduces unsupervised TS-PTMs, which are
in the field of time series. Existing studies focus on how to often pre-trained by reconstruction techniques. Compared
perform cross-task learning using properties such as long- with supervised TS-PTMs, unsupervised TS-PTMs are more
and short-term trends and multivariate dependencies in the widely applicable since they do not require labeled time-
time series. series samples.
A task-adaptive pre-training paradigm based on time-
series forecasting has been applied to TS-PTMs [108], [109]. 3.2.1 Reconstruction-based PTMs
For example, Oreshkin et al. [109] proposed a meta-learning
Reconstruction is a common unsupervised task, and is
framework based on deep stacking of fully-connected
usually implemented by an encoder-decoder architecture
networks for zero-shot time-series forecasting. The meta-
[53], [111]. The encoder maps the original time series to
learning procedure consists of a meta-initialization func-
a latent space of representations, which is then used by
tion, an update function, and a meta-learner. The meta-
the decoder to reconstruct the input time series. The mean
initialization function defines the initial predictor parame-
square error is often used as the reconstruction loss. For
ters for a given task. The update function then iteratively
example, Castellani et al. [111] used reconstruction to learn
updates the predictor parameters based on the previous
robust representations for detecting noisy labels in the time
value of the predictor parameters and the given task.
series. Naturally, reconstruction is also utilized for TS-PTMs.
The final predictor parameters are obtained by perform-
In the following, we introduce TS-PTMs based on the (i) au-
ing a Bayesian ensemble (or weighted sum) on the whole
toencoder, (ii) denoising autoencoder, and (iii) transformer
sequence of parameters. The meta-learner learns shared
encoder.
knowledge across tasks by training on different tasks, thus
providing suitable initial predictor parameters for new AutoEncoder For variable-length NLP sequences such as
tasks. Brinkmeyer and Rego [110] developed a few-shot sentences, paragraphs, and documents, it has been very
multivariate time series forecasting model working across useful to first convert them to fixed-dimensional vectors
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Transformation Consistency Two augmented sequences beneficial for TSM tasks. Although subseries and temporal
of the same time series by different transformations are consistency strategies have achieved good performance on
selected as positive pair, which is called transformation the classification task, sampling biases between subseries
consistency, as shown in Fig. 11 (c). Inspired by the success- (e.g., outliers and pattern shifts) [68] tend to introduce false
ful application of transformation consistency in CV, Eldele positive pairs. Meanwhile, the transformation consistency
et al. [38] proposed a time-series representation learning strategy relies on effective instance-level data augmentation
framework via Temporal and Contextual Contrasting (TS- techniques. However, designing uniform time series data
TCC). TS-TCC first transforms the original time series data augmentation techniques for time series datasets from dif-
in two different ways to obtain weak augmentation (jitter- ferent domains remains a challenge [16], [143]. One alterna-
and-scaling) and strong augmentation (permutation-and- tive is to utilize expert features of time series [144] instead
jitter) variants, respectively. Then, the temporal and contex- of the commonly used data transformation for contrastive
tual contrastive learning modules are designed in TS-TCC learning. The contextual consistency strategy utilizes a
for learning discriminative representations. mask (or dropout) mechanism [68], [142] to obtain con-
In addition, Hyvärinen et al. [140] analyzed the non- trastive pairs by capturing temporal dependencies, which
stationarity of time series by performing a nonlinear mixture can alleviate the problem of sampling bias among subse-
transformation of subseries from different time windows to quences and achieve excellent performance on forecasting,
ensure that the same subseries are consistent. Unlike [140], classification, and anomaly detection tasks. Nevertheless,
Hyvarinen and Morioka [141] developed a method based designing consistency-based strategies using the multi-scale
on a logistic regression estimation model to learn the de- property [66] of time series has not been fully explored.
pendencies of time dimensions by distinguishing the dif-
ferences between subseries of the original time series and 3.3.2 Pseudo-labeling PTMs
those randomly transformed time points. Lately, Zhang et In addition to consistency-based PTMs mentioned above,
al. [31] utilized a time-based augmentation bank (jitter- some other self-supervised TS-PTMs have been explored to
ing, scaling, time-shifts, and neighborhood segments) and improve the performance of TSM. We briefly review these
frequency-based augmentation methods (adding or remov- methods in this section.
ing frequency components) for time series self-supervised
contrastive pre-training via time-frequency consistency. Predict Pseudo Labels A large amount of correctly labeled
information is the basis for the success of deep neural
Contextual Consistency Two augmented contexts with the network models. However, labeling time series data gen-
same timestamp inside the same time series are selected as erally requires manual expert knowledge assistance, result-
a positive pair, which is called contextual consistency, as ing in high annotation costs. Meanwhile, representation
shown in Fig. 11 (d). By designing a timestamp masking and learning using pseudo-label-guided models has yielded rich
random cropping strategy to obtain augmented contexts, results in CV [145], [146]. In addition, some studies [147]
Yue et al. [68] proposed a universal framework for learn- employed self-supervised learning as an auxiliary task to
ing time series representations named TS2Vec. Specifically, help the primary task learn better by training the auxil-
TS2Vec distinguishes positive and negative samples from iary task alongside the primary task. In the auxiliary task,
the temporal dimensions and instance level hierarchically. In when given the transformed sample, the model predicts
terms of temporal consistency, the augmented contexts with which transformation is applied to the input (i.e., predicts
the same timestamp within the same time series are treated pseudo labels). This idea has been applied in a few TS-
as positive pairs, while the augmented contexts in different PTMs [135]. For example, Fan et al. [135] randomly selected
timestamps are treated as negative pairs. Unlike temporal two length-L subsequences from the same time series and
consistency, instance level consistency treats the augmented assigned a pseudo-label based on their temporal distance,
contexts of other instances within a batch as negative pairs. and then the proposed model is pre-trained by predicting
Experiments on various time series datasets indicate that the pseudo-label of subsequence pairs. Furthermore, Zhang
TS2Vec achieves excellent performance on downstream clas- et al. [36] incorporated expert features to create pseudo-
sification, forecasting, and anomaly detection tasks. labels for time series self-supervised contrastive represen-
Recently, Yang et al. [142] proposed a novel represen- tation learning. Despite these progresses, predicting pseudo
tation learning framework for time series, namely Bilinear labels inevitably contain incorrect labels. How to mitigate
Temporal-Spectral Fusion (BTSF). Unlike TS2Vec [68], BTSF the negative impact of incorrect labels will be the focus of
utilizes instance-level augmentation by simply applying studying PTMs.
a dropout strategy to obtain positive/negative samples
for unsupervised contrastive learning, thus preserving the
global context and capturing long-term dependencies of 4 E XPERIMENTAL R ESULTS AND A NALYSIS
time series. Further, an iterative bilinear temporal-spectral
In this section, following [68], [142], we evaluate TS-PTMs
fusion module is designed in BTSF to iteratively refine the
on three TSM tasks, including classification, forecasting, and
representation of time series by encoding the affinities of
anomaly detection. Like [68], we select a range of time-
abundant time-frequency pairs. Extensive experiments on
series benchmark datasets employed in the corresponding
downstream classification, forecasting, and anomaly detec-
TSM tasks for evaluation. We first analyze the performance
tion tasks demonstrate the superiority of BTSF.
of TS-PTMs on the classification task using UCR [148] and
Summary The aforementioned studies demonstrate that UEA [149] archives time series datasets. Also, following [31],
consistency-based PTMs can obtain robust representations we select four time series scenarios datasets for transfer
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
learning PTMs analysis. Second, the performance of TS- time series datasets of the models (please refer to Table 12
PTMs and related baselines on the forecasting task are in the Appendix), we choose FCN for transfer learning.
compared using the ETT [14] and Electricity [150] datasets. From the 128 UCR datasets, the 15 datasets with the largest
Finally, we analyze the performance of TS-PTMs and re- numbers of samples are employed as source datasets. The
lated baselines on the anomaly detection task by using the FCN encoder is pre-trained using the supervised classifi-
Yahoo [151] and KPI [152] datasets. For information about cation task or unsupervised reconstruction task combined
datasets, baselines, and implementation details, please refer with a decoder (symmetric FCN decoder or asymmetric
to Appendix A. RNN decoder). From the remaining 113 UCR datasets, we
PTMs are usually trained in two stages. Firstly, super- select a total of 45 time series datasets as target datasets
vised, unsupervised or self-supervised techniques are used for downstream classification task fine-tuning. 15 of them
to pre-train the base model on the source dataset. Then, have the smallest numbers of samples, 15 of them have
the base model is fine-tuned using the training set of the the largest numbers of samples, and 15 of them have the
target dataset. Finally, the base model is evaluated on the medium numbers of samples. Please refer to Tables 9 and 10
target test set to obtain the test results. However, there in the Appendix for details on the source and target datasets.
is no available benchmark large-scale well-labelled time For the source dataset, we employ all samples to pre-
series dataset. In other words, selecting a suitable source train the FCN encoder. Using the five-fold cross-validation
dataset to pre-train the encoder to obtain a positive transfer strategy, each target dataset contains five different training
performance on the target dataset is difficult. To address the sets, and so we performed five fine-tunings for analysis.
above issues, existing studies (e.g., TS2Vec [68], TST [29], For each transfer strategy, 15 × 15 = 225 sets of transfer
TS-TCC [38], and CoST [137]) utilize self-supervised or results (15 source datasets, 15 target datasets) are obtained
unsupervised learning strategies to pre-train the base model for each target dataset sample size (minimum, medium, and
using the target dataset and then fine-tune it on the target maximum). The detailed transfer learning results are shown
dataset. We conducted extensive experiments on two fronts in Fig. 14 in the Appendix.
to evaluate the effectiveness of existing TS-PTMs. On the Due to space constraints, we report the transfer learning
one hand, we selected the UCR archive and four scenarios classification results in Table 1. The p-value is calculated
time series datasets for transfer learning using the selected for the transfer classification results between the super-
source and target datasets, thus analyzing the performance vised classification and unsupervised reconstruction trans-
of different pre-training techniques on the downstream clas- fer strategy. As shown in Table 1, the supervised classifi-
sification task. On the other hand, following the strategy of cation transfer (SUP CLS) strategy has the best average Acc
studies in TS2Vec [68] and CoST [137], we compare and and the number of positive transfer results on the minimum,
analyze the strategy (using the target dataset to pretrain medium and maximum target datasets. The above results
the base model) on time-series downstream classification, indicate that the SUP CLS strategy is more suitable for time
forecasting, and anomaly detection tasks. Through extensive series pre-training than the unsupervised transfer strategy.
experiments, we aim to provide guidance for the study of On the smallest target datasets, the overall performance of
pre-training paradigms, techniques for TS-PTMs on differ- the SUP CLS strategy and the unsupervised reconstruction
ent downstream tasks. strategy utilizing the symmetric FCN decoder is insignif-
icant (P-value greater than 0.05). Also, the unsupervised
reconstruction strategy utilizing the symmetric FCN de-
4.1 Performance of PTMs on Time-Series Classification
coder is better than the supervised classification transfer
The datasets in the UCR and UEA archives do not specify strategy. The above results indicate that a symmetric FCN
a validation set for hyperparameter selection, while some decoder is more suitable for transfer learning in the time-
datasets have a much higher number of test set samples series classification task than the asymmetric RNN decoder.
than the training set. As suggested by [148], we merge Overall, the number of positive transfer classification results
the training and test sets for each dataset in UCR and obtained on the target datasets is unsatisfactory, which may
UEA archives. Then, we adopt the five-fold cross-validation be related to the small number of samples in the UCR source
strategy to divide the training, validation, and test sets in datasets (the number of samples in most source datasets is
the ratio of 60%-20%-20%. For the four independent time less than 8000, and please refer to Table 9 in the Appendix).
series scenarios datasets, we utilize the datasets processed Further, we select four independent time series datasets
by [31] for experimental analysis. Finally, we use the average for transfer learning PTMs analysis. The number of samples
accuracy on the test set as the evaluation metric. in each source dataset is large, and the test classification
results are shown in Table 2. Compared with the super-
4.1.1 Comparison of Transfer Learning PTMs based on vised approach (FCN) without pre-training, the transfer
Supervised Classification and Unsupervised Reconstruction learning strategy based on supervised classification (Sup
Advanced time series classification methods (i.e., TS- CLS) achieves significant positive transfer performance in
CHIEF [153], miniRocket [154], and OS-CNN [66]) methods the neural stage detection and mechanical device diagnosis
either integrate multiple techniques or require finding the scenarios. In addition, the transfer learning strategy based
optimal hyperparameters in training and are not suitable on unsupervised RNN Decoders achieves obvious positive
as the backbone of PTMs. In recent years, related scholars transfer performance in the activity recognition scenario.
have employed FCN [26], TCN [68], and Transformer [29] However, in the physical status monitoring scenario, the
as the backbone for studying TS-PTMs. Considering the three transfer learning strategies all result in negative trans-
training time and classification performance on 128 UCR fer, which may be related to the small number of EMG
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
samples. Compared with using UCR time series datasets, Direct classification via FCN (100%)
transfer learning has a better pre-training effect on indepen- 100
dent time series datasets with large source datasets. 1
0
1
0 20 40 60 80 100 120 140 80
4.1.2 Comparison of PTMs based on Transfomer and Con- Direct classification via TCN (50%)
sistency
1
We select five TS-PTMs for performance comparison of the 0 60
downstream classification task. TST [29] is a Transformer- 1
0 20 40 60 80 100 120 140
based PTM. T-loss [134], Selftime [135], TS-TCC [38], and Supervised transfer via FCN (100%)
TS2Vec [68] are consistency-based PTMs. Specifically, the
1 40
encoder of the above methods is first pre-trained on the
0
target dataset using an unsupervised/self-supervised pre- 1
learning strategy. Then the label information of the target 0 20 40 60 80 100 120 140
Unsupervised transfer via FCN decoder (98.5%) 20
dataset is utilized to fine-tune the pre-trained encoder or
to perform downstream classification tasks using the low- 1
dimensional representations obtained from the pre-trained 0
1 0
encoder. Related work in CV [35] and NLP [119] generally 0 20 40 60 80 100 120 140
utilizes a selected backbone combined with a linear classifier
as a supervised method to analyze the classification perfor-
Fig. 12. Visualization of the Gunpoint dataset using CAM. For the dis-
mance compared with PTMs. Therefore, the FCN combined criminative features learned by the CNN-based model, red represents
with a linear classifier is selected as a baseline without pre- high contribution, blue indicates almost no contribution, and yellow is
training. The averaged results on UCR and UEA archives are medium contribution (each subplot title gives the accuracy).
reported in Table 3. Detailed results are in Appendix B.2.
Since the training of Selftime on the multivariate time se-
the validation sets from the five-fold cross-validation strat-
ries UEA datasets is too time-consuming, we only report
egy to select models for all comparison settings. Among
its performance on the univariate UCR datasets. Also, we
the five models for analysis, we choose the one with the
spend about a month on four 1080Ti GPUs to obtain the
most significant visualization difference to highlight the
classification results of SelfTime on the UCR archive.
visualization.
As shown in Table 3, the classification performance of
The Gunpoint dataset in the UCR archive contains two
TS2Vec is the best on the 128 UCR datasets. Also, the P-
types of series, representing a male or female performing
value is calculated for the classification results between
two actions (using a replicate gun or finger to point to
TS2Vec and other methods. The above results show that
the target) [156]. Fawaz et al. [39] employed the Gunpoint
consistency-based TS2Vec can effectively learn robust rep-
dataset for CAM visualization due to its 100% classification
resentations beneficial for the time-series classification task.
accuracy on FCN, low noise, and containing only two types
However, on the 30 UEA datasets, the model using FCN
of data. Therefore, we perform a CAM visualization on the
for direct classification is the best in terms of classification
Gunpoint dataset, as shown in Fig. 12.
performance and training time, while TS2Vec is inferior in
We select the sample with the smallest variance in
terms of average accuracy and rank. Further, in terms of
each class of series in the Gunpoint dataset for the CAM
the P-value for significance tests, TS2Vec is insignificant
visualization. As shown in Fig. 12, the above two series
compared to TS-TCC and TST (P-value greater than 0.05).
are discriminative between the two fragments at [45, 60]
The UEA archive is the benchmark for multivariate time
and [90, 105]. The direct classification using FCN can learn
series classification, while TS2Vec does not have a suitable
the discriminative subseries information between the two
pre-training strategy designed specifically for time series
class of series well. However, when the TCN is utilized
variables. In addition, the TST model based on the vanilla
for direct classification, only one class of series is marked
Transformer has poorer classification results than Super-
in red (on the fragment [90, 105]), while the other class is
vised (FCN) on both UCR and UEA archives, indicating that
marked in blue, resulting in low classification performance.
the vanilla Transformer architecture is still a challenge for
Meanwhile, we select UWaveGestureLibraryX as the source
pre-training in the time series classification task.
dataset for transfer learning of the Gunpoint dataset. Re-
garding the supervised classification-based transfer strategy,
4.1.3 Visualization the FCN model can simultaneously make the two classes
In this section, we use the Class Activation Map (CAM) [155] of series marked in red for the [90, 105] fragment. For the
and heatmap [68] for visualization of the time series clas- unsupervised transfer strategy, the FCN model can make
sification models. CAM is a way to visualize CNNs and one class of series marked in dark red on the [45, 60] and
analyze which regions of the input data the CNN-based [90, 105] fragments, and another class of series marked in
model focuses on. The heatmap is used in [68] to analyze the yellow on the [45, 60] and [90, 105] fragments.
heat distribution of the 16-dimensional variables with the The MixedShapesSmallTrain and Wine datasets from the
largest variance among the feature variables, thus assisting UCR archive are selected as the target datasets to analyze
to analyze the trend of the time series. Hence, the ability of the learning of positive and negative transfer on time-series
the model to capture the change in time series distributions classification. Based on the classification results of transfer
can be measured based on the heat distribution. We employ learning in Fig. 14 in the Appendix, we select the source
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
TABLE 1
Test classification results of transfer learning on 675 sets (15 source datasets multiplied by 45 target datasets) UCR time series datasets. Positive
represents the classification performance is better than direct classification without using a transfer strategy.
TABLE 2
Test classification results of transfer learning on four independent time series scenarios. Supervised (FCN) means that the FCN encoder is directly
used for supervised classification training on the target dataset.
Scenario Source Dataset Target Dataset Supervised (FCN) Sup CLS Unsup FCN Decoder Unsup RNN Decoder
Neurological Stage Detection SleepEEG Epilepsy 0.7766 (0.0552) 0.8021 (0.0010) 0.6813 (0.2483) 0.6813 (0.2486)
Mechanical Device Diagnosis FD-A FD-B 0.5781 (0.0292) 0.6718 (0.0783) 0.6541 (0.0547) 0.6541 (0.0543)
Activity Recognition HAR Gesture 0.3904 (0.0281) 0.3796 (0.0257) 0.3517 (0.0363) 0.4933 (0.0196)
Physical Status Monitoring ECG EMG 0.9756 (0.0322) 0.4634 (0.0010) 0.8439 (0.0892) 0.8731 (0.0103)
TABLE 3
Comparisons of classification test accuracy using Transformer and consistency-based methods (standard deviations are in parentheses).
Direct classification Positive transfer Negative transfer Direct classification Positive transfer Negative transfer Informer [14], Autoformer [60], and Temporal Convolu-
2 2 2 3 3 3
1 1 1
2 2 2 tional Network (TCN) [57]. TS2Vec and CoST performs
1 1 1
0 0 0 0 0 0 pre-training the downstream forecasting task on the same
1 1 1 1 1 1
2 2 2
dataset, while other baselines are trained directly on the
0 500 1000 0 500 1000 0 500 1000 0 100 200 0 100 200 0 100 200
15 15 15 15 15 15 dataset without pre-training. We employ the mean square
10 10 10 10 10 10 error (MSE) and mean absolute error (MAE) as evaluation
5 5 5 5 5 5 metrics following [137]. Also, we follow [68] to preprocess
0
0 500 1000
0
0 500 1000
0
0 500 1000
0
0 100 200
0
0 100 200
0
0 100 200 the datasets. Table 4 presents the test results on four public
datasets for multivariate time-series forecasting. As can be
(a) MixedShapesSmallTrain (b) Wine
seen, TS2Vec and CoST generally outperform the TCN,
Fig. 13. Comparison of positive and negative transfer based on super- LogTrans, and Informer. Autoformer achieves remarkable
vised classification using the MixedShapesSmallTrain and Wine dataset. improvements and has better long-term robustness. Note
that CoST achieves comparable performance to Autoformer
for some settings on the ETTm1 and Electricity datasets. We
datasets with the best accuracy for positive transfer and the credit it to the modeling of trend and seasonal features,
worst accuracy for negative transfer. From Fig. 13, the heat which is also verified to be effective in the decomposition
distribution of the variable with the largest 16-dimensional architecture of Autoformer. Most existing studies are end-
variance obtained by direct classification and positive trans- to-end supervised methods, while few utilize pre-training
fer is more similar to the trend of the original time series, and downstream fine-tuning paradigms for time-series fore-
while the negative transfer is difficult to present a heat dis- casting. The empirical results show that it is a promising
tribution that matches the trend of the original time series. paradigm. Although recent time series forecasting tech-
The heatmap visualization indicates that negative transfer niques such as FiLM [11], Scaleformer [158], and MICN [159]
may make it difficult for the model to capture the dynamic have demonstrated impressive performance, it is important
change information of the original time series, leading to to note that these methods incorporate various time series
degraded classification performance. characteristics (such as frequency domain and multiscale
properties) which differ from the single property-based
4.2 Performance of PTMs on Time-Series Forecasting approaches of TS2Vec and CoST. Therefore, we do not adopt
the aforementioned methods as baselines. Moreover, mining
We further evaluate the performance of PTMs including the seasonal and trend characteristics of time series [61],
TS2Vec [68] and CoST [137] on time-series forecasting. [160], and utilizing Transformer-based TS-PTMs to discover
Experiments on state-of-the-art direct forecasting methods the temporal dependencies are the potential to improve
are also conducted for comparison. These approaches in- forecasting performance.
clude three Transformer based models, LogTrans [157],
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
TABLE 4
Comparisons of forecasting test results on ETT and Electricity datasets using TCN, Transformer-based and pre-training methods.
5.2 Inherent Properties of Time Series 5.4 Adversarial Attacks on Time Series
Time-series representation learning has attracted much at- Adversarial example attacks have recently received exten-
tention in recent years. Existing studies have explored sive attention in various fields because they have significant
the inherent properties of time series for representation security risks. Naturally, scholars in the time-series domain
learning, such as using CNNs to capture multi-scale de- have also begun to consider the impact of adversarial sam-
pendencies, RNNs to model temporal dependencies, and ple attacks on time-series models [166], [167], [168], [169].
Transformers to model long-term temporal dependencies. For example, Karim et al. [170] utilized a distilled model as
Also, the context-dependencies and frequency domain (or a surrogate that simulated the behavior of the attacked time-
seasonal-trend [137]) information of time series have been series classification model. Then, an adversarial transforma-
explored in recent contrastive learning studies. Although tion network is applied to the distilled model to generate
mining the inherent properties of time series can learn time-series adversarial examples. Experiments on 42 UCR
representations beneficial for downstream TSM tasks, the datasets show they are vulnerable to adversarial examples.
transferability of time series from different domains remains Adversarial examples are generated by adding pertur-
weakly interpretable. bations to the original examples to cross the classification
Due to the inherent properties (i.e., frequency domain boundaries of the classifier. In general, adding random
information) of time series, the pre-training techniques ap- perturbations is difficult to generate adversarial examples.
plied to image data are difficult to transfer directly to time Also, adversarial examples are not easy to generate when
series. Compared to NLP, TS-PTMs are challenging to learn each cluster is far from the classification boundary. Recently,
universal time-series representations due to the absence of a Hendrycks et al. [171] found that self-supervised learning
large-scale unified semantic sequence dataset. For example, could effectively improve the robustness of deep learning
each word in a text sequence dataset has similar seman- models to adversarial examples. Therefore, improving the
tics in different sentences with high probability. Therefore, robustness of time series models to adversarial examples by
the word embeddings learned by the model can transfer utilizing TS-PTMs is a direction worth exploring.
knowledge across different text sequence data scenarios.
However, time series datasets are challenging to obtain 5.5 Pre-Training Models for Time-Series Noisy Labels
subsequences (corresponding to words in text sequences) The acquisition cost of large-scale labeled datasets is very
that have consistent semantics across scenarios, making it expensive. Therefore, various low-cost surrogate strategies
difficult to transfer the knowledge learned by the model. are provided to collect labels automatically. For example,
Hence, exploiting the inherent properties of time series to many weakly labeled images can be collected with the help
mine transferable segments in time series data is a challenge of search engines and crawling algorithms. In the time-series
for future research on TS-PTMs. domain, Castellani et al. [111] employed sensor readings to
generate labels for time series. Although these strategies en-
5.3 Transformer in Time Series able obtaining large-scale labeled data, they also inevitably
lead to label noise. Training deep learning models efficiently
Transformer-based models have achieved excellent perfor-
with noisy labels is challenging since deep learning models
mance in various fields, such as NLP and CV. So naturally,
have a high capacity for fitting noisy labels [172]. To this
transformers have also been explored in the study of time
end, many studies on learning with label noise [173] have
series. Transformer employs a multi-head attention mecha-
emerged. As an effective representation learning method,
nism to capture long-term dependencies in the input data.
PTMs can be effectively employed to solve the problem
With this advantage, Informer [14], Autoformer [60], and
of noisy labels [174], [175], which has been studied in CV.
FEDformer [61] for time-series forecasting, and Anomaly
However, only a few studies are currently investigating the
Transformer [164] for time-series anomaly detection are
time-series noisy label [111], and PTMs for time-series noisy
well suited for analyzing time series. However, there are
labels have not been studied.
relatively few existing Transformer models with competitive
advantages for the time-series classification task, which may
be the fact that the time-series classification task is more fo- 6 C ONCLUSION
cused on capturing discriminative subseries (e.g., shapelets) In this survey, we provide a systematic review and analy-
or multiscale features of time-series, where existing CNN- sis of the development of TS-PTMs. In the early research
based models [66] may be more advantageous. on TS-PTMs, related studies were mainly based on CNN
Although there has been some work applying transform- and RNN models for transfer learning on PTMs. In re-
ers to TSM tasks [165], there is little research on PTMs for cent years, Transformer-based and consistency-based mod-
time series. We believe that the current challenges with pre- els have achieved remarkable performance in time-series
training Transformer models for time series are two-fold. downstream tasks, and have been utilized for time series
First, pre-training transformers usually require massive data pre-training. Hence, we conducted a large-scale experimen-
for pre-training to learn the universal representations. How- tal analysis of existing TS-PTMs, transfer learning strate-
ever, there is currently a lack of large-scale datasets in gies, Transformer-based time series methods, and related
the time series field. Second, how to design an effective representative methods on the three main tasks of time-
pre-trained transformer model combined with the inherent series classification, forecasting, and anomaly detection.
properties of time series still needs further exploration. The experimental results indicate that Transformer-based
Therefore, exploring pre-training transformer models for PTMs have significant potential for time-series forecast-
time series is an exciting research direction. ing and anomaly detection tasks, while designing suitable
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
Transformer-based models for the time-series classification [16] B. K. Iwana and S. Uchida, “An empirical survey of data aug-
task remains challenging. Meanwhile, the pre-training strat- mentation for time series classification with neural networks,”
Plos one, vol. 16, no. 7, p. e0254841, 2021.
egy based on contrastive learning is a potential focus for the [17] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised
development of future TS-PTMs. learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
[18] C. Shorten and T. M. Khoshgoftaar, “A survey on image data
augmentation for deep learning,” Journal of big data, vol. 6, no. 1,
ACKNOWLEDGMENTS pp. 1–48, 2019.
The work described in this paper was partially funded by [19] Q. Ma, Z. Zheng, J. Zheng, S. Li, W. Zhuang, and G. W. Cot-
trell, “Joint-label learning by dual augmentation for time series
the National Natural Science Foundation of China (Grant classification,” in Proceedings of the AAAI Conference on Artificial
Nos. 62272173, 61872148), the Natural Science Founda- Intelligence, vol. 35, no. 10, 2021, pp. 8847–8855.
tion of Guangdong Province (Grant Nos. 2022A1515010179, [20] L. Yang, S. Hong, and L. Zhang, “Spectral propagation graph
2019A1515010768). The authors would like to thank Profes- network for few-shot time series classification,” arXiv preprint
arXiv:2202.04769, 2022.
sor Garrison W. Cottrell from UCSD, and Yu Chen, Peitian [21] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Ma from SCUT for the review and helpful suggestions. Transactions on Knowledge and Data Engineering, vol. 22, no. 10,
pp. 1345–1359, 2009.
[22] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong,
R EFERENCES and Q. He, “A comprehensive survey on transfer learning,”
[1] Y. Wu, J. M. Hernández-Lobato, and G. Zoubin, “Dynamic co- Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
variance models for multivariate financial time series,” in Inter- [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas-
national Conference on Machine Learning. PMLR, 2013, pp. 558– sification with deep convolutional neural networks,” Advances in
566. Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012.
[2] N. Moritz, T. Hori, and J. Le, “Streaming automatic speech [24] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick,
recognition with the transformer model,” in IEEE International “Masked autoencoders are scalable vision learners,” arXiv
Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, preprint arXiv:2111.06377, 2021.
pp. 6074–6078. [25] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained
[3] J. Yang, M. N. Nguyen, P. P. San, X. L. Li, and S. Krishnaswamy, models for natural language processing: A survey,” Science China
“Deep convolutional neural networks on multichannel time se- Technological Sciences, pp. 1–26, 2020.
ries for human activity recognition,” in Twenty-fourth International [26] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A.
Joint Conference on Artificial Intelligence, 2015. Muller, “Transfer learning for time series classification,” in 2018
[4] J. Martinez, M. J. Black, and J. Romero, “On human motion IEEE International Conference on Big Data (Big Data). IEEE, 2018,
prediction using recurrent neural networks,” in Proceedings of the pp. 1367–1376.
IEEE Conference on Computer Vision and Pattern Recognition, 2017, [27] C.-H. H. Yang, Y.-Y. Tsai, and P.-Y. Chen, “Voice2series: Repro-
pp. 2891–2900. gramming acoustic models for time series classification,” Interna-
[5] S. Wang, J. Cao, and P. Yu, “Deep learning for spatio-temporal tional Conference on Machine Learning, 2021.
data mining: A survey,” IEEE transactions on knowledge and data [28] P. Malhotra, V. TV, L. Vig, P. Agarwal, and G. Shroff, “Timenet:
engineering, pp. 1–20, 2020. Pre-trained deep recurrent neural network for time series classi-
[6] D. A. Tedjopurnomo, Z. Bao, B. Zheng, F. Choudhury, and A. K. fication,” arXiv preprint arXiv:1706.08838, 2017.
Qin, “A survey on modern deep neural network for traffic [29] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eick-
prediction: Trends, methods and challenges,” IEEE Transactions hoff, “A transformer-based framework for multivariate time
on Knowledge and Data Engineering, vol. 34, no. 04, pp. 1544–1561, series representation learning,” in Proceedings of the 27th ACM
2022. SIGKDD Conference on Knowledge Discovery & Data Mining, 2021,
[7] T.-c. Fu, “A review on time series data mining,” Engineering pp. 2114–2124.
Applications of Artificial Intelligence, vol. 24, no. 1, pp. 164–181, [30] S. Deldari, H. Xue, A. Saeed, J. He, D. V. Smith, and F. D. Salim,
2011. “Beyond just vision: A review on self-supervised representa-
[8] E. Eldele, M. Ragab, Z. Chen, M. Wu, C.-K. Kwoh, and X. Li, tion learning on multimodal and temporal data,” arXiv preprint
“Label-efficient time series representation learning: A review,” arXiv:2206.02353, 2022.
arXiv preprint arXiv:2302.06433, 2023.
[31] X. Zhang, Z. Zhao, T. Tsiligkaridis, and M. Zitnik, “Self-
[9] G. Li, B. K. K. Choi, J. Xu, S. S. Bhowmick, K.-P. Chun, and G. L.
supervised contrastive pre-training for time series via time-
Wong, “Efficient shapelet discovery for time series classification,”
frequency consistency,” in Advances in neural information process-
IEEE Transactions on Knowledge and Data Engineering, vol. 34,
ing systems, 2022.
no. 03, pp. 1149–1163, 2022.
[32] W. Zhang, L. Yang, S. Geng, and S. Hong, “Cross reconstruction
[10] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, “Recur-
transformer for self-supervised time series representation learn-
rent neural networks for multivariate time series with missing
ing,” arXiv preprint arXiv:2205.09928, 2022.
values,” Scientific Reports, vol. 8, no. 1, pp. 1–12, 2018.
[11] T. Zhou, Z. Ma, Q. Wen, L. Sun, T. Yao, R. Jin et al., “Film: [33] R. Ye and Q. Dai, “A novel transfer learning framework for time
Frequency improved legendre memory model for long-term time series forecasting,” Knowledge-Based Systems, vol. 156, pp. 74–99,
series forecasting,” arXiv preprint arXiv:2205.08897, 2022. 2018.
[12] M. H. Tahan, M. Ghasemzadeh, and S. Asadi, “Development [34] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum
of fully convolutional neural networks based on discretization contrast for unsupervised visual representation learning,” in
in time series classification,” IEEE Transactions on Knowledge and Proceedings of the IEEE/CVF Conference on Computer Vision and
Data Engineering, pp. 1–12, 2022. Pattern Recognition, 2020, pp. 9729–9738.
[13] R. Sen, H.-F. Yu, and I. S. Dhillon, “Think globally, act locally: A [35] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
deep neural network approach to high-dimensional time series framework for contrastive learning of visual representations,” in
forecasting,” in Advances in Neural Information Processing Systems, International Conference on Machine Learning. PMLR, 2020, pp.
2019. 1597–1607.
[14] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and [36] H. Zhang, J. Wang, Q. Xiao, J. Deng, and Y. Lin, “Sleeppriorcl/:
W. Zhang, “Informer: Beyond efficient transformer for long se- Contrastive representation learning with prior knowledge-based
quence time-series forecasting,” in Proceedings of Advancement of positive mining and adaptive temperature for sleep staging,”
Artificial Intelligence Conference on Artificial Intelligence, vol. 35, arXiv preprint arXiv:2110.09966, 2021.
no. 12, 2021, pp. 11 106–11 115. [37] N. Laptev, J. Yu, and R. Rajagopal, “Reconstruction and regres-
[15] Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, sion loss for time-series transfer learning,” in Proceedings of the
“Time series data augmentation for deep learning: A survey,” in Special Interest Group on Knowledge Discovery and Data Mining
Proceedings of the Thirtieth International Joint Conference on Artificial (SIGKDD) and the 4th Workshop on the Mining and LEarning from
Intelligence, 2021, pp. 4653–4660. Time Series (MiLeTS), 2018.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17
[38] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and [59] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, “Probabilistic forecasting
C. Guan, “Time-series representation learning via temporal and with temporal convolutional neural network,” Neurocomputing,
contextual contrasting,” in Proceedings of the Thirtieth International vol. 399, pp. 491–501, 2020.
Joint Conference on Artificial Intelligence, 2021, pp. 2352–2359. [60] J. Xu, J. Wang, M. Long et al., “Autoformer: Decomposition trans-
[39] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. formers with auto-correlation for long-term series forecasting,”
Muller, “Deep learning for time series classification: a review,” Advances in Neural Information Processing Systems, vol. 34, 2021.
Data mining and knowledge discovery, vol. 33, no. 4, pp. 917–963, [61] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer:
2019. Frequency enhanced decomposed transformer for long-term se-
[40] B. Lim and S. Zohren, “Time-series forecasting with deep learn- ries forecasting,” arXiv preprint arXiv:2201.12740, 2022.
ing: A survey,” Philosophical Transactions of the Royal Society A, vol. [62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
379, no. 2194, p. 20200209, 2021. “Imagenet: A large-scale hierarchical image database,” in 2009
[41] B. Lafabregue, J. Weber, P. Gançarski, and G. Forestier, “End-to- IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
end deep representation learning for time series clustering: A 2009, pp. 248–255.
comparative study,” Data Mining and Knowledge Discovery, pp. 1– [63] J. Serrà, S. Pascual, and A. Karatzoglou, “Towards a universal
53, 2021. neural network encoder for time series.” in Artificial Intelligence
[42] F. Liu, X. Zhou, J. Cao, Z. Wang, T. Wang, H. Wang, and Research and Development, 2018, pp. 120–129.
Y. Zhang, “Anomaly detection in quasi-periodic time series based [64] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and
on automatic data segmentation and attentional lstm-cnn,” IEEE G. Batista, “The UCR time series classification archive,” July 2015,
Transactions on Knowledge and Data Engineering, vol. 34, no. 06, pp. www.cs.ucr.edu/∼eamonn/time series data/.
2626–2640, 2022. [65] Z. Wang, W. Yan, and T. Oates, “Time series classification from
[43] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection scratch with deep neural networks: A strong baseline,” in 2017
for temporal data: A survey,” IEEE Transactions on Knowledge and International joint conference on neural networks (IJCNN). IEEE,
Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2014. 2017, pp. 1578–1585.
[44] A. Blázquez-Garcı́a, A. Conde, U. Mori, and J. A. Lozano, “A [66] W. Tang, G. Long, L. Liu, T. Zhou, M. Blumenstein, and J. Jiang,
review on outlier/anomaly detection in time series data,” ACM “Omni-scale cnns: a simple and effective kernel size configura-
Computing Surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021. tion for time series classification,” in International Conference on
[45] W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “BRITS: Learning Representations, 2022.
Bidirectional recurrent imputation for time series,” arXiv preprint [67] F. Li, K. Shirahama, M. A. Nisar, X. Huang, and M. Grzegorzek,
arXiv:1805.10572, 2018. “Deep transfer learning for time series data based on sensor
[46] Y. Luo, X. Cai, Y. Zhang, J. Xu, and X. Yuan, “Multivariate time modality classification,” Sensors, vol. 20, no. 15, p. 4271, 2020.
series imputation with generative adversarial networks,” in Pro- [68] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and
ceedings of the 32nd International Conference on Neural Information B. Xu, “Ts2vec: Towards universal representation of time series,”
Processing Systems, 2018, pp. 1603–1614. Proceedings of Advancement of Artificial Intelligence Conference on
[47] M. C. De Souto, I. G. Costa, D. S. De Araujo, T. B. Ludermir, Artificial Intelligence, 2022.
and A. Schliep, “Clustering cancer gene expression data: A [69] L. Ye and E. Keogh, “Time series shapelets: a new primitive for
comparative study,” BMC Bioinformatics, vol. 9, no. 1, pp. 1–14, data mining,” in Proceedings of the 15th ACM SIGKDD international
2008. conference on Knowledge discovery and data mining, 2009, pp. 947–
[48] J. de Jong, M. A. Emon, P. Wu, R. Karki, M. Sood, P. Godard, 956.
A. Ahmad, H. Vrooman, M. Hofmann-Apitius, and H. Fröhlich, [70] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme,
“Deep learning for clustering of multivariate clinical patient “Learning time-series shapelets,” in Proceedings of the 20th ACM
trajectories with missing values,” GigaScience, vol. 8, no. 11, p. SIGKDD International Conference on Knowledge Discovery and Data
giz134, 2019. Mining, 2014, pp. 392–401.
[49] Q. Ma, S. Li, and G. Cottrell, “Adversarial joint-learning recurrent [71] A. Meiseles and L. Rokach, “Source model selection for deep
neural network for incomplete time series classification,” IEEE learning in the time series domain,” IEEE Access, vol. 8, pp. 6190–
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, 6200, 2020.
no. 4, pp. 1765–1776, 2022. [72] F. J. O. Morales and D. Roggen, “Deep convolutional fea-
[50] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua- ture transfer across mobile activity recognition domains, sensor
tion of gated recurrent neural networks on sequence modeling,” modalities and locations,” in Proceedings of the 2016 ACM Interna-
arXiv preprint arXiv:1412.3555, 2014. tional Symposium on Wearable Computers, 2016, pp. 92–99.
[51] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and [73] R. Mutegeki and D. S. Han, “Feature-representation transfer
J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transac- learning for human activity recognition,” in 2019 International
tions on neural networks and learning systems, vol. 28, no. 10, pp. Conference on Information and Communication Technology Conver-
2222–2232, 2016. gence (ICTC). IEEE, 2019, pp. 18–20.
[52] N. Muralidhar, S. Muthiah, and N. Ramakrishnan, “Dyat nets: [74] R. Cai, J. Chen, Z. Li, W. Chen, K. Zhang, J. Ye, Z. Li, X. Yang, and
Dynamic attention networks for state forecasting in cyber- Z. Zhang, “Time series domain adaptation via sparse associative
physical systems.” in International Joint Conference on Artificial structure alignment,” in Proceedings of the Advancement of Artificial
Intelligence, 2019, pp. 3180–3186. Intelligence Conference on Artificial Intelligence, 2021.
[53] Q. Ma, J. Zheng, S. Li, and G. W. Cottrell, “Learning represen- [75] G. Wilson, J. R. Doppa, and D. J. Cook, “Multi-source deep
tations for time series clustering,” Advances in Neural Information domain adaptation with weak supervision for time-series sen-
Processing Systems, vol. 32, pp. 3781–3791, 2019. sor data,” in Proceedings of the 26th ACM SIGKDD International
[54] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, Conference on Knowledge Discovery & Data Mining, 2020, pp. 1768–
X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional 1778.
neural networks,” Pattern recognition, vol. 77, pp. 354–377, 2018. [76] G. Csurka, “Domain adaptation for visual applications: A com-
[55] Z. Cui, W. Chen, and Y. Chen, “Multi-scale convolutional prehensive survey,” arXiv preprint arXiv:1702.05374, 2017.
neural networks for time series classification,” arXiv preprint [77] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep
arXiv:1603.06995, 2016. domain confusion: Maximizing for domain invariance,” arXiv
[56] K. Kashiparekh, J. Narwariya, P. Malhotra, L. Vig, and G. Shroff, preprint arXiv:1412.3474, 2014.
“Convtimenet: A pre-trained deep convolutional neural network [78] C. Chen, Z. Fu, Z. Chen, S. Jin, Z. Cheng, X. Jin, and X.-S.
for time series classification,” in 2019 International Joint Conference Hua, “Homm: Higher-order moment matching for unsupervised
on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8. domain adaptation,” in Proceedings of the Advancement of Artificial
[57] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation Intelligence conference on artificial intelligence, vol. 34, no. 04, 2020,
of generic convolutional and recurrent networks for sequence pp. 3422–3429.
modeling,” arXiv preprint arXiv:1803.01271, 2018. [79] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic repre-
[58] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- sentations for unsupervised domain adaptation,” in International
works for semantic segmentation,” in Proceedings of the IEEE conference on machine learning. PMLR, 2018, pp. 5423–5432.
conference on computer vision and pattern recognition, 2015, pp. [80] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
3431–3440. F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18
adversarial training of neural networks,” The journal of machine [101] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:
learning research, vol. 17, no. 1, pp. 2096–2030, 2016. Unsupervised pre-training for speech recognition,” Proc. Inter-
[81] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind speech 2019, pp. 3465–3469, 2019.
the class weight bias: Weighted maximum mean discrepancy [102] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching
for unsupervised domain adaptation,” in Proceedings of the IEEE networks for one shot learning,” Advances in Neural Information
conference on computer vision and pattern recognition, 2017, pp. Processing Systems, vol. 29, 2016.
2272–2281. [103] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for
[82] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Universal few-shot learning,” Advances in Neural Information Processing Sys-
domain adaptation,” in Proceedings of the IEEE/CVF conference on tems, vol. 30, 2017.
computer vision and pattern recognition, 2019, pp. 2720–2729. [104] X. Jiang, R. Missel, Z. Li, and L. Wang, “Sequential latent variable
[83] J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip, “Stratified models for few-shot high-dimensional time-series forecasting,”
transfer learning for cross-domain activity recognition,” in 2018 in The Eleventh International Conference on Learning Representations,
IEEE International Conference on Pervasive Computing and Commu- 2023.
nications (PerCom). IEEE, 2018, pp. 1–10. [105] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[84] Z. Li, R. Cai, T. Z. Fu, and K. Zhang, “Transferable time- for fast adaptation of deep networks,” in International Conference
series forecasting under causal conditional shift,” arXiv preprint on Machine Learning. PMLR, 2017, pp. 1126–1135.
arXiv:2111.03422, 2021. [106] B. Lu, X. Gan, W. Zhang, H. Yao, L. Fu, and X. Wang, “Spatio-
[85] M. A. A. H. Khan, N. Roy, and A. Misra, “Scaling human temporal graph few-shot learning with cross-city knowledge
activity recognition via deep learning-based domain adaptation,” transfer,” in Proceedings of the 28th ACM SIGKDD Conference on
in 2018 IEEE international conference on pervasive computing and Knowledge Discovery and Data Mining, 2022, pp. 1162–1172.
communications (PerCom). IEEE, 2018, pp. 1–9. [107] R. Wang, R. Walters, and R. Yu, “Meta-learning dynamics fore-
[86] A. Tank, I. Covert, N. Foti, A. Shojaie, and E. B. Fox, “Neural casting using task inference,” Advances in Neural Information
granger causality,” IEEE Transactions on Pattern Analysis and Ma- Processing Systems, vol. 35, pp. 21 640–21 653, 2022.
chine Intelligence, vol. 44, no. 8, pp. 4267–4279, 2021. [108] T. Iwata and A. Kumagai, “Few-shot learning for time-series
[87] M. Ragab, E. Eldele, Z. Chen, M. Wu, C.-K. Kwoh, and X. Li, forecasting,” arXiv preprint arXiv:2009.14379, 2020.
“Self-supervised autoregressive domain adaptation for time se- [109] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “Meta-
ries data,” arXiv preprint arXiv:2111.14834, 2021. learning framework with applications to zero-shot time-series
[88] Q. Liu and H. Xue, “Adversarial spectral kernel matching for forecasting,” in Proceedings of the Advancement of Artificial Intel-
unsupervised time series domain adaptation,” in International ligence Conference on Artificial Intelligence, vol. 35, no. 10, 2021, pp.
Joint Conference on Artificial Intelligence, 2021, pp. 2744–2750. 9242–9250.
[110] L. Brinkmeyer, R. R. Drumond, J. Burchert, and L. Schmidt-
[89] S. Purushotham, W. Carvalho, T. Nilanon, and Y. Liu, “Variational
Thieme, “Few-shot forecasting of time-series with heterogeneous
recurrent adversarial deep domain adaptation,” in International
channels,” arXiv preprint arXiv:2204.03456, 2022.
Conference on Learning Representations, 2017.
[111] A. Castellani, S. Schmitt, and B. Hammer, “Estimating the elec-
[90] P. R. d. O. da Costa, A. Akçay, Y. Zhang, and U. Kaymak, “Re-
trical power output of industrial devices with end-to-end time-
maining useful lifetime prediction via deep domain adaptation,”
series classification in the presence of label noise,” in European
Reliability Engineering & System Safety, vol. 195, p. 106682, 2020.
Conference on Machine Learning, 2021.
[91] W. Garrett, R. D. Janardhan, and J. C. Diane, “Calda: Improving [112] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
multi-source time series domain adaptation with contrastive learning with neural networks,” Advances in Neural Information
adversarial learning,” arXiv preprint arXiv:2109.14778, 2021. Processing Systems, vol. 27, 2014.
[92] Z. Li, R. Cai, H. W. Ng, M. Winslett, T. Z. Fu, B. Xu, X. Yang, and [113] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and
Z. Zhang, “Causal mechanism transfer network for time series L. Bottou, “Stacked denoising autoencoders: Learning useful rep-
domain adaptation in mechanical systems,” ACM Transactions on resentations in a deep network with a local denoising criterion.”
Intelligent Systems and Technology (TIST), vol. 12, no. 2, pp. 1–21, Journal of Machine Learning Research, vol. 11, no. 12, 2010.
2021. [114] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-
[93] G. F. Elsayed, I. Goodfellow, and J. Sohl-Dickstein, “Adversarial supervised learning of discrete speech representations,” arXiv
reprogramming of neural networks,” in International Conference preprint arXiv:1910.05453, 2019.
on Learning Representations, 2019. [115] Q. Ma, C. Chen, S. Li, and G. W. Cottrell, “Learning repre-
[94] D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A sentations for incomplete time series clustering,” in Proceedings
neural attention model for speech command recognition,” arXiv of the Advancement of Artificial Intelligence Conference on Artificial
preprint arXiv:1808.08929, 2018. Intelligence, vol. 35, no. 10, 2021, pp. 8837–8846.
[95] C.-H. H. Yang, J. Qi, S. Y.-C. Chen, P.-Y. Chen, S. M. Siniscalchi, [116] P. Shi, W. Ye, and Z. Qin, “Self-supervised pre-training for time
X. Ma, and C.-H. Lee, “Decentralizing feature extraction with series classification,” in 2021 International Joint Conference on Neu-
quantum convolutional neural network for automatic speech ral Networks (IJCNN). IEEE, 2021, pp. 1–8.
recognition,” in IEEE International Conference on Acoustics, Speech [117] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee,
and Signal Processing. IEEE, 2021, pp. 6523–6527. “Audio word2vec: Unsupervised learning of audio segment
[96] P. Xiong, Y. Zhu, Z. Sun, Z. Cao, M. Wang, Y. Zheng, J. Hou, representations using sequence-to-sequence autoencoder,” arXiv
T. Huang, and Z. Que, “Application of transfer learning in con- preprint arXiv:1603.00982, 2016.
tinuous time series for anomaly detection in commercial aircraft [118] Q. Hu, R. Zhang, and Y. Zhou, “Transfer learning for short-term
flight data,” in 2018 IEEE International Conference on Smart Cloud wind speed prediction with deep neural networks,” Renewable
(SmartCloud). IEEE, 2018, pp. 13–18. Energy, vol. 85, pp. 83–95, 2016.
[97] Y. Du, J. Wang, W. Feng, S. Pan, T. Qin, R. Xu, and C. Wang, [119] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
“Adarnn: Adaptive learning and forecasting of time series,” in training of deep bidirectional transformers for language un-
Proceedings of the 30th ACM International Conference on Information derstanding,” in Proceedings of the 2019 Conference of the North
& Knowledge Management, 2021, pp. 402–411. American Chapter of the Association for Computational Linguistics:
[98] S. Baireddy, S. R. Desai, J. L. Mathieson, R. H. Foster, M. W. Chan, Human Language Technologies, Volume 1 (Long and Short Papers),
M. L. Comer, and E. J. Delp, “Spacecraft time-series anomaly 2019, pp. 4171–4186.
detection using transfer learning,” in Proceedings of the IEEE/CVF [120] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichten-
Conference on Computer Vision and Pattern Recognition, 2021, pp. hofer, “Masked feature prediction for self-supervised visual pre-
1951–1960. training,” arXiv preprint arXiv:2112.09133, 2021.
[99] Z. Shao, Z. Zhang, F. Wang, and Y. Xu, “Pre-training enhanced [121] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu,
spatial-temporal graph neural network for multivariate time “Simmim: A simple framework for masked image modeling,”
series forecasting,” in Proceedings of the 28th ACM SIGKDD Confer- arXiv preprint arXiv:2111.09886, 2021.
ence on Knowledge Discovery and Data Mining, 2022, pp. 1567–1577. [122] R. R. Chowdhury, X. Zhang, J. Shang, R. K. Gupta, and D. Hong,
[100] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with “Tarnet: Task-aware reconstruction for time-series transformer,”
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, in Proceedings of the 28th ACM SIGKDD Conference on Knowledge
2018. Discovery and Data Mining, Washington, DC, USA, 2022, pp. 14–18.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19
[123] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked se- representations,” in International Conference on Machine Learning.
quence to sequence pre-training for language generation,” arXiv PMLR, 2022, pp. 16 969–16 989.
preprint arXiv:1905.02450, 2019. [145] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep cluster-
[124] K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A sur- ing for unsupervised learning of visual features,” in Proceedings
vey of transformer-based pretrained models in natural language of the European conference on computer vision, 2018, pp. 132–149.
processing,” arXiv preprint arXiv:2108.05542, 2021. [146] Y. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simul-
[125] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. taneous clustering and representation learning,” in International
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Conference on Learning Representations, 2020.
Advances in neural information processing systems, vol. 30, 2017. [147] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen-
[126] L. Hou, Y. Geng, L. Han, H. Yang, K. Zheng, and X. Wang, tation learning by predicting image rotations,” in Proceedings of
“Masked token enabled pre-training: A task-agnostic approach International Conference on Learning Representations, 2018.
for understanding complex traffic flow,” TechRxiv, 2022. [148] H. A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu,
[127] L. Zhao, M. Gao, and Z. Wang, “St-gsp: Spatial-temporal global S. Gharghabi, C. A. Ratanamahatana, and E. Keogh, “The ucr
semantic representation learning for urban flow prediction,” in time series archive,” IEEE/CAA Journal of Automatica Sinica, vol. 6,
Proceedings of the Fifteenth ACM International Conference on Web no. 6, pp. 1293–1305, 2019.
Search and Data Mining, 2022, pp. 1443–1451. [149] A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom,
[128] I. Padhi, Y. Schiff, I. Melnyk, M. Rigotti, Y. Mroueh, P. Dognin, P. Southam, and E. Keogh, “The uea multivariate time series
J. Ross, R. Nair, and E. Altman, “Tabular transformers for mod- classification archive 2018,” arXiv preprint arXiv:1811.00075, 2018.
eling multivariate time series,” in IEEE International Conference on [150] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 3565–3569. [Online]. Available: http://archive.ics.uci.edu/ml
[129] S. M. Shankaranarayana and D. Runje, “Attention augmented [151] Y. B. Nikolay Laptev, Saeed Amizadeh, “A benchmark
convolutional transformer for tabular time-series,” in 2021 Inter- dataset for time series anomaly detection,” 2015,
national Conference on Data Mining Workshops (ICDMW). IEEE, https://yahooresearch.tumblr.com/post/114590420346/
2021, pp. 537–541. a-benchmark-dataset-for-time-series-anomaly.
[130] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mock- [152] H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang,
ingjay: Unsupervised speech representation learning with deep J. Tong, and Q. Zhang, “Time-series anomaly detection service at
bidirectional transformer encoders,” in IEEE International Confer- microsoft,” in Proceedings of the 25th ACM SIGKDD international
ence on Acoustics, Speech and Signal Processing. IEEE, 2020, pp. conference on knowledge discovery & data mining, 2019, pp. 3009–
6419–6423. 3017.
[131] A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised learning [153] A. Shifaz, C. Pelletier, F. Petitjean, and G. I. Webb, “Ts-chief: a
of transformer encoder representation for speech,” IEEE/ACM scalable and accurate forest algorithm for time series classifica-
Transactions on Audio, Speech, and Language Processing, vol. 29, pp. tion,” Data Mining and Knowledge Discovery, vol. 34, no. 3, pp.
2351–2366, 2021. 742–775, 2020.
[132] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense con- [154] A. Dempster, D. F. Schmidt, and G. I. Webb, “Minirocket: A
trastive learning for self-supervised visual pre-training,” in Pro- very fast (almost) deterministic transform for time series clas-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern sification,” in Proceedings of the 27th ACM SIGKDD conference on
Recognition, 2021, pp. 3024–3033. knowledge discovery & data mining, 2021, pp. 248–257.
[133] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [155] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Distributed representations of words and phrases and their “Learning deep features for discriminative localization,” in Pro-
compositionality,” Advances in Neural Information Processing Sys- ceedings of the IEEE conference on computer vision and pattern
tems, vol. 26, 2013. recognition, 2016, pp. 2921–2929.
[156] J. Lines, L. M. Davis, J. Hills, and A. Bagnall, “A shapelet
[134] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi, “Unsupervised
transform for time series classification,” in Proceedings of the 18th
scalable representation learning for multivariate time series,”
ACM SIGKDD international conference on Knowledge discovery and
Advances in Neural Information Processing Systems, vol. 32, 2019.
data mining, 2012, pp. 289–297.
[135] H. Fan, F. Zhang, and Y. Gao, “Self-supervised time series rep-
[157] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan,
resentation learning by inter-intra relational reasoning,” arXiv
“Enhancing the locality and breaking the memory bottleneck
preprint arXiv:2011.13548, 2020.
of transformer on time series forecasting,” Advances in Neural
[136] S. Tonekaboni, D. Eytan, and A. Goldenberg, “Unsupervised rep- Information Processing Systems, vol. 32, 2019.
resentation learning for time series with temporal neighborhood
[158] A. Shabani, A. Abdi, L. Meng, and T. Sylvain, “Scaleformer: itera-
coding,” in International Conference on Learning Representations,
tive multi-scale refining transformers for time series forecasting,”
2021.
in The Eleventh International Conference on Learning Representations,
[137] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi, “Cost: Con- 2023.
trastive learning of disentangled seasonal-trend representations [159] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, and Y. Xiao, “Micn:
for time series forecasting,” arXiv preprint arXiv:2202.01575, 2022. Multi-scale local and global context modeling for long-term series
[138] S. Deldari, D. V. Smith, H. Xue, and F. D. Salim, “Time series forecasting,” in The Eleventh International Conference on Learning
change point detection with self-supervised contrastive predic- Representations, 2023.
tive coding,” in Proceedings of the Web Conference 2021, 2021, pp. [160] Z. Wang, X. Xu, W. Zhang, G. Trajcevski, T. Zhong, and F. Zhou,
3124–3135. “Learning latent seasonal-trend representations for time series
[139] D. Luo, W. Cheng, Y. Wang, D. Xu, J. Ni, W. Yu, X. Zhang, forecasting,” in Advances in Neural Information Processing Systems,
Y. Liu, H. Chen, and X. Zhang, “Information-aware time series 2022.
meta-contrastive learning,” Submitted to International Conference [161] A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, “Anomaly
on Learning Representations, 2022. detection in streams with extreme value theory,” in Proceedings
[140] A. Hyvarinen and H. Morioka, “Unsupervised feature extraction of the 23rd ACM SIGKDD International Conference on Knowledge
by time-contrastive learning and nonlinear ica,” Advances in Discovery and Data Mining, 2017, pp. 1067–1075.
Neural Information Processing Systems, vol. 29, 2016. [162] D. Park, Y. Hoshi, and C. C. Kemp, “A multimodal anomaly de-
[141] H. M. Aapo Hyvarinen, “Nonlinear ica of temporally dependent tector for robot-assisted feeding using an LSTM-based variational
stationary sources,” in Artificial Intelligence and Statistics. PMLR, autoencoder,” IEEE Robotics and Automation Letters, vol. 3, no. 3,
2017, pp. 460–469. pp. 1544–1551, 2018.
[142] L. Yang, S. Hong, and L. Zhang, “Unsupervised time-series [163] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao,
representation learning with iterative bilinear temporal-spectral D. Pei, Y. Feng et al., “Unsupervised anomaly detection via
fusion,” arXiv preprint arXiv:2202.04770, 2022. variational auto-encoder for seasonal kpis in web applications,”
[143] Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, in Proceedings of the 2018 world wide web conference, 2018, pp. 187–
“Time series data augmentation for deep learning: A survey,” 196.
arXiv preprint arXiv:2002.12478, 2020. [164] J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer:
[144] M. T. Nonnenmacher, L. Oldenburg, I. Steinwart, and D. Reeb, Time series anomaly detection with association discrepancy,” in
“Utilizing expert features for contrastive learning of time-series International Conference on Learning Representations, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20
TABLE 6
Benchmark Datasets in Time-Series Domain
TABLE 7
The Open-Source Implementations of TS-PTMs.
learning feature representations of unlabeled time series. to capture the time series’ local context and long-term
We use the open source code from https://github.com/ dependencies information. We use the open source code
emadeldeen24/TS-TCC for experimental analysis. from https://github.com/mlpotter/Transformer Time
TST: Zerveas et al. [29] first proposed a multivariate Series for experimental analysis.
time series unsupervised representation learning frame- TCN: Bai et al. [57]’s extensive experiments on the se-
work called Time Series Transformer (TST). We use quence modeling problem indicated that a simple convo-
the open source code from https://github.com/gzerveas/ lutional architecture consisting of Temporal Convolutional
mvts transformer for experimental analysis. Networks (TCN) outperformed canonical recurrent net-
TS2Vec: Yue et al. [68] proposed a general framework for works such as LSTMs for time series forecasting on dif-
time series representation learning in an arbitrary semantic ferent datasets. We use the open source code from https:
level called TS2Vec. For the time series classification task, //github.com/locuslab/TCN for experimental analysis.
the authors employed TCN as the backbone and the low- Informer: Zhou et al. [14] designed an efficient
dimensional feature representation obtained from TCN to transformer-based model called Informer for long time se-
input into an SVM classifier with RBF kernel for classifi- ries forecasting. We use the open source code from https:
cation. We use the open source code from https://github. //github.com/zhouhaoyi/Informer2020 for experimental
com/yuezhihan/ts2vec for experimental analysis. analysis.
Autoformer: Wu et al. [60] designed Autoformer as a
A.3.2 Time-Series Forecasting novel decomposition architecture with an Auto-Correlation
TS2Vec: Yue et al. [68] employed TCN as the backbone mechanism for long time series forecasting. We use the open
and the data from the last T observations to predict the source code from https://github.com/thuml/autoformer
observations for the next H timestamps for the time series for experimental analysis.
forecasting task. We use the open source code from https:
//github.com/yuezhihan/ts2vec for experimental analysis. A.3.3 Time-Series Anomaly Detection
CoST: Woo et al. [137] proposed a representation learning TS2Vec: For the time-series anomaly detection task, Yue et
framework for long time series forecasting, called CoST, al. [68] followed a streaming evaluation protocol [152] to
which utilizes contrastive learning to learn time series dis- perform point anomaly detection of time series. We use
entangled seasonal-trend representations. We use the open the open source code from https://github.com/yuezhihan/
source code from https://github.com/salesforce/CoST for ts2vec for experimental analysis.
experimental analysis. SPOT and DSPOT: Siffer et al. [161] proposed a time
LongTrans: Li et al. [157] proposed a Transformer-based series anomaly detection method based on extreme value
model for Long time series forecasting with convolutional theory. The authors divided the proposed approach into two
self-attention and LogSparse Transformer (LongTrans) algorithms: SPOT for streaming data having any station-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 23
TABLE 8
The summary of Time-Series Pre-Trained Models (TS-PTMs).
ary distribution, and DSPOT for streaming data that can TABLE 9
be subject to concept drift. We use the open source code Details of the source UCR time series datasets used for transfer
learning.
from https://github.com/Amossys-team/SPOT for experi-
mental analysis.
LSTM-VAE: Park et al. [162] proposed a Long Short- ID Dataset Name Total Sample Size Length Class
Term Memory-based Variational autoEncoder (LSTM- 1 Crop 24000 46 24
VAE) model for outlier detection of multimodal sensory 2 ElectricDevices 16637 96 7
3 StarLightCurves 9236 1024 3
signals. We use the open source code from https://github. 4 Wafer 7164 152 2
com/SchindlerLiang/VAE-for-Anomaly-Detection for ex- 5 ECG5000 5000 140 5
perimental analysis. 6 TwoPatterns 5000 128 4
7 FordA 4921 500 2
DONUT: Xu et al. [163] proposed an unsupervised 8 UWaveGestureLibraryAll 4478 945 8
anomaly detection algorithm based on a variational au- 9 UWaveGestureLibraryX 4478 315 8
10 UWaveGestureLibraryY 4478 315 8
toencoder called DONUT. We use the open source code 11 UWaveGestureLibraryZ 4478 315 8
from https://github.com/NetManAIOps/donut for experi- 12 FordB 4446 500 2
13 ChlorineConcentration 4307 166 3
mental analysis. 14 NonInvasiveFetalECGThorax1 3765 750 42
SR: Ren et al. [152] proposed an algorithm based on 15 NonInvasiveFetalECGThorax2 3765 750 42
Spectral Residual (SR) and CNN for anomaly detection of
time series. Since the authors do not provide the open source
code, we use the experimental results on Yahoo and KPI of the experimental results, we use a uniform random seed
datasets in the original paper for comparative analysis. for all baselines. Finally, all experiments are done on eight
AT: Xu et al. [164] proposed Anomaly Transformer (AT) 1080Ti GPUs and two 3090 GPUs with Ubuntu 18.04 system.
with a new anomaly-attention mechanism for unsupervised Also, the training times in the main experimental results are
detection of anomaly points in time series. We use the all run on 3090 GPUs.
open source code from https://github.com/spencerbraun/
anomaly transformer pytorch for experimental analysis.
A PPENDIX B
For the above baselines in three major time-series mining
tasks (classification, forecasting, and anomaly detection), F ULL R ESULTS
we use uniform random seeds for the model’s training. In this section, we give detailed classification results based
In addition, the dataset partitioning used by all baselines on time series PTMs on 128 UCR and 30 UEA datasets.
is kept consistent. Further, the hyperparameter settings of
all baselines are set according to the parameters provided B.1 Comparison of Transfer Learning PTMs based
by the original authors, and the details can be found in on Supervised Classification and Unsupervised Recon-
our open-source code https://github.com/qianlima-lab/ struction
time-series-ptms.
The test supervised classification accuracy of FCN, TCN,
and Transformer on 128 UCR time series datasets is given
A.4 Implementation Details in Table 12. To facilitate the layout and reading of the test
For the time-series classification task, we normalize each classification results, the standard deviation of the classifi-
series of the UCR and UEA datasets via z-score [39], and cation accuracy for each dataset is not given in Table 12. For
build FCN and TCN models using Pytorch according to the the datasets used in transfer learning PTMs, please refer to
settings of [39], [134]. Adam is adopted as the optimizer Tables 9, 10, and 11.
with a learning rate of 0.001 and a maximum batch size of
128. The maximum epoch of the source UCR times series B.2 Comparison of PTMs based on Transfomer and
dataset for transfer learning is 2000, while the maximum Consistency
epoch of the target UCR times series dataset is 1000. Also, The pre-training test classification accuracy using
a uniform early stopping rule is used for all baselines in transformer-based and contrastive learning on 128 UCR
training. The maximum epoch of the source dataset of and 30 UEA time series datasets are shown in Tables 13
four independent time series scenarios (refer to Table 11) and 14. Also, for the convenience of layout and reading of
is 40, while the maximum epoch of the corresponding target the test classification results, the standard deviations of the
dataset is 100. For the classification accuracy of the target classification performance evaluated for each dataset are
UCR time series, we use the average test accuracy on five- not given in Tables 13 and 14.
fold test sets by running one seed. For the classification ac-
curacy of the target datasets on four independent time series
scenarios, we use the average test accuracy on the test set by
running ten different seeds. For the time-series forecasting
and anomaly detection tasks, we follow the way of [68] to
preprocess the datasets and set the hyperparameters. In ad-
dition, the comparative benchmark methods for time-series
classification, forecasting, and anomaly detection tasks are
reproduced and analyzed using the open-source codes pro-
vided by the original authors. To ensure the reproducibility
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 25
TwoPatterns 0.05 -0.12 0.00 0.10 0.07 -0.11 -0.07 0.01 -0.03 -0.04 0.02 -0.03 -0.07 -0.11 -0.21 -0.10 -0.17 0.01 -0.07 -0.07 -0.10 -0.10 0.00 -0.09 -0.05 0.00 -0.25 -0.09 -0.19 -0.15 -0.01 -0.01 -0.00 -0.03 -0.22 -0.01 -0.02 -0.00 -0.02 -0.05 -0.02 -0.06 -0.03 -0.03 -0.04
FordA 0.02 -0.08 0.00 0.05 0.15 -0.03 -0.08 -0.02 -0.03 -0.02 0.00 -0.05 -0.10 -0.01 -0.11 -0.09 -0.19 0.02 -0.07 -0.07 -0.12 -0.09 0.00 -0.09 -0.00 -0.00 -0.21 -0.09 -0.09 -0.05 -0.02 -0.20 -0.00 -0.04 -0.19 -0.03 -0.08 -0.00 -0.05 -0.04 -0.09 0.06 -0.05 -0.03 -0.03
UWaveGestureLibraryAll 0.05 -0.10 0.00 0.17 0.20 -0.09 -0.09 -0.10 -0.04 -0.00 -0.01 -0.04 -0.08 -0.09 -0.10 -0.09 -0.18 -0.02 -0.08 -0.08 -0.10 -0.09 -0.02 -0.01 -0.01 -0.00 -0.24 -0.11 -0.15 -0.01 -0.01 -0.01 -0.00 -0.06 -0.22 -0.03 -0.20 -0.00 -0.03 -0.05 -0.03 0.10 -0.08 -0.03 -0.08 0.15
UWaveGestureLibraryX 0.07 -0.12 0.00 0.20 0.20 -0.09 -0.07 -0.17 -0.05 0.01 0.00 -0.05 -0.06 -0.08 -0.09 -0.09 -0.13 0.01 -0.07 -0.07 -0.10 -0.09 -0.01 -0.01 -0.01 -0.01 -0.27 -0.10 -0.20 -0.13 -0.01 -0.11 -0.00 -0.03 -0.21 -0.04 -0.35 -0.01 0.00 -0.06 0.00 0.02 -0.08 -0.04 -0.07
UWaveGestureLibraryY 0.05 -0.12 0.00 0.20 0.15 -0.11 -0.13 -0.18 -0.06 0.01 -0.01 -0.03 -0.14 -0.06 -0.07 -0.09 -0.16 -0.02 -0.08 -0.08 -0.09 -0.09 -0.01 -0.01 -0.03 -0.01 -0.20 -0.10 -0.22 -0.12 -0.01 -0.11 -0.01 -0.04 -0.24 -0.05 -0.10 -0.01 -0.02 -0.05 -0.21 0.08 -0.08 -0.03 -0.08
UWaveGestureLibraryZ 0.05 -0.10 0.00 0.20 0.08 -0.09 -0.13 -0.14 -0.05 0.01 0.00 -0.03 -0.08 -0.04 -0.06 -0.09 -0.13 -0.01 -0.08 -0.07 -0.12 -0.10 -0.01 -0.11 -0.01 -0.01 -0.19 -0.15 -0.07 -0.08 -0.01 -0.11 -0.01 -0.03 -0.21 -0.04 -0.34 -0.01 0.00 -0.05 -0.00 0.10 -0.05 -0.03 -0.13
0.30
FordB 0.07 -0.10 0.00 -0.02 0.02 -0.13 -0.05 -0.11 -0.02 -0.02 0.01 -0.11 -0.07 -0.02 -0.13 -0.09 -0.21 -0.01 -0.08 -0.08 -0.12 -0.11 0.00 -0.09 -0.06 -0.00 -0.23 -0.12 -0.12 -0.03 -0.02 -0.05 -0.00 -0.05 -0.21 -0.02 -0.11 -0.01 -0.09 -0.04 0.01 -0.10 -0.05 -0.04 -0.03
ChlorineConcentration 0.05 -0.05 0.00 0.08 0.10 -0.10 -0.05 -0.17 -0.05 -0.02 0.01 -0.07 -0.09 -0.05 -0.11 -0.10 -0.14 0.02 -0.07 -0.06 -0.13 -0.09 -0.01 -0.09 -0.03 -0.00 -0.18 -0.07 -0.09 -0.12 -0.01 -0.02 -0.00 -0.00 -0.22 -0.01 -0.02 0.00 -0.02 -0.05 0.00 0.06 -0.10 -0.03 -0.10
NonInvasiveFetalECGThorax1 -0.03 -0.10 0.00 0.12 0.17 -0.10 -0.04 -0.08 0.18 -0.03 0.00 -0.03 -0.04 -0.02 -0.15 -0.09 -0.14 0.01 -0.07 -0.06 -0.12 -0.11 -0.00 -0.11 -0.07 -0.00 -0.19 -0.12 -0.13 -0.05 -0.02 0.00 0.00 -0.03 -0.22 -0.09 -0.10 -0.00 0.11 -0.05 -0.16 0.10 -0.20 -0.04 -0.09
NonInvasiveFetalECGThorax2 0.05 -0.08 0.00 0.13 0.13 -0.07 -0.11 -0.12 -0.05 -0.00 0.02 -0.02 -0.04 -0.03 -0.07 -0.09 -0.11 0.02 -0.08 -0.07 -0.11 -0.08 -0.17 -0.11 -0.07 -0.00 -0.22 -0.16 -0.19 -0.06 -0.01 0.00 -0.00 0.01 -0.23 -0.03 -0.23 -0.00 0.02 -0.06 -0.11 0.13 -0.02 -0.04 -0.07 0.45
TwoPatterns 0.05 -0.08 0.00 -0.25 0.08 -0.09 -0.06 -0.19 -0.05 -0.01 0.01 -0.05 -0.04 -0.04 -0.14 -0.09 -0.23 -0.02 -0.14 -0.12 -0.11 -0.11 -0.04 -0.11 -0.07 -0.00 -0.17 -0.20 -0.20 -0.20 -0.01 -0.15 -0.00 -0.10 -0.23 -0.06 -0.40 -0.01 -0.02 -0.05 -0.04 0.06 -0.04 -0.02 -0.03
FordA 0.05 -0.10 0.00 -0.25 -0.07 -0.14 -0.08 -0.16 -0.05 -0.03 0.01 -0.07 -0.08 -0.03 -0.03 -0.09 -0.22 -0.04 -0.12 -0.12 -0.10 -0.11 -0.00 -0.00 -0.07 -0.00 -0.18 -0.19 -0.26 -0.24 -0.01 -0.22 -0.00 -0.10 -0.21 -0.06 -0.32 -0.02 -0.16 -0.06 -0.12 -0.09 -0.07 -0.04 -0.02 0.15
UWaveGestureLibraryAll 0.02 -0.05 0.00 -0.07 0.17 -0.04 -0.06 -0.03 -0.03 -0.02 -0.01 -0.05 -0.08 -0.02 -0.05 -0.09 -0.19 0.01 -0.09 -0.12 -0.10 -0.11 -0.03 -0.02 -0.03 -0.01 -0.19 -0.14 -0.13 -0.22 -0.01 0.00 -0.00 -0.10 -0.19 -0.04 -0.28 -0.01 0.01 -0.04 -0.04 -0.02 -0.08 -0.02 -0.03
UWaveGestureLibraryX 0.07 -0.05 0.00 -0.25 0.12 -0.16 -0.04 -0.15 -0.05 -0.00 0.02 -0.03 -0.04 -0.05 -0.08 -0.09 -0.21 -0.01 -0.14 -0.12 -0.10 -0.11 -0.04 -0.03 -0.07 -0.01 -0.22 -0.15 -0.15 -0.22 -0.01 -0.01 -0.00 -0.08 -0.22 -0.06 -0.39 -0.01 -0.02 -0.05 -0.01 -0.04 -0.13 -0.04 -0.04
UWaveGestureLibraryY 0.07 -0.10 0.00 -0.25 0.13 -0.13 -0.07 -0.15 -0.08 0.01 0.01 -0.03 -0.05 -0.03 -0.06 -0.09 -0.21 -0.03 -0.12 -0.12 -0.10 -0.11 -0.05 -0.02 -0.07 -0.01 -0.23 -0.18 -0.18 -0.22 -0.01 -0.00 -0.00 -0.09 -0.20 -0.06 -0.44 -0.01 -0.03 -0.05 0.03 -0.07 -0.14 -0.04 -0.05 0.30
UWaveGestureLibraryZ 0.07 -0.05 0.00 -0.25 0.17 0.00 -0.07 -0.16 -0.03 -0.00 0.01 -0.03 -0.07 -0.03 -0.02 -0.09 -0.22 -0.02 -0.12 -0.12 -0.10 -0.11 -0.04 -0.03 -0.07 -0.01 -0.25 -0.15 -0.20 -0.23 -0.02 0.00 -0.00 -0.09 -0.21 -0.06 -0.41 -0.01 0.01 -0.05 -0.01 -0.03 -0.13 -0.03 -0.05
FordB 0.05 -0.05 0.00 -0.25 0.07 -0.09 -0.05 -0.05 -0.03 -0.02 0.00 -0.07 -0.04 -0.05 -0.05 -0.09 -0.24 -0.03 -0.11 -0.12 -0.10 -0.11 -0.02 -0.03 -0.07 -0.01 -0.17 -0.17 -0.14 -0.24 -0.02 -0.21 -0.00 -0.11 -0.24 -0.06 -0.30 -0.01 -0.11 -0.05 -0.13 -0.14 -0.09 -0.04 -0.02
ChlorineConcentration 0.02 -0.08 0.00 0.10 0.15 -0.07 -0.07 -0.04 -0.03 -0.02 0.01 -0.05 -0.06 0.01 -0.11 -0.09 -0.23 0.01 -0.09 -0.12 -0.10 -0.11 -0.02 -0.03 -0.07 -0.00 -0.20 -0.07 -0.08 -0.13 -0.02 -0.02 -0.00 -0.10 -0.21 -0.05 -0.27 -0.00 -0.02 -0.04 0.01 0.00 -0.01 -0.02 -0.03 0.45
NonInvasiveFetalECGThorax1 0.05 -0.05 0.00 -0.25 0.13 -0.03 -0.04 -0.06 -0.03 -0.02 0.02 -0.02 -0.01 -0.02 -0.06 -0.09 -0.18 0.02 -0.09 -0.12 -0.10 -0.11 -0.03 -0.01 -0.07 -0.01 -0.20 -0.15 -0.07 -0.05 -0.01 -0.01 -0.00 -0.09 -0.21 -0.06 -0.23 -0.01 0.00 -0.05 0.08 0.11 -0.01 -0.02 -0.05
NonInvasiveFetalECGThorax2 0.05 -0.05 0.00 -0.25 0.13 -0.06 -0.04 -0.07 -0.05 -0.03 0.02 -0.02 -0.05 -0.03 -0.05 -0.09 -0.21 0.01 -0.09 -0.12 -0.09 -0.11 0.00 -0.02 -0.07 -0.01 -0.20 -0.10 -0.09 -0.07 -0.01 -0.06 -0.00 -0.09 -0.21 -0.03 -0.41 -0.01 0.01 -0.04 -0.06 0.12 -0.02 -0.03 -0.05
TwoPatterns 0.05 -0.05 0.00 -0.25 0.07 -0.10 -0.10 -0.12 -0.03 -0.05 0.01 -0.06 -0.11 -0.03 -0.12 -0.09 -0.23 0.01 -0.10 -0.12 -0.10 -0.11 -0.01 -0.01 -0.07 -0.00 -0.14 -0.11 -0.08 -0.22 -0.01 -0.08 -0.00 -0.04 -0.21 -0.03 -0.27 -0.00 0.05 -0.06 -0.09 -0.09 -0.08 -0.02 -0.00 0.15
FordA 0.02 -0.08 0.00 0.15 0.03 -0.09 -0.15 -0.08 -0.05 -0.03 0.00 -0.08 -0.12 -0.02 -0.11 -0.09 -0.24 -0.03 -0.11 -0.12 -0.11 -0.11 -0.04 -0.01 -0.07 -0.01 -0.19 -0.07 -0.09 -0.23 -0.01 -0.06 -0.00 -0.11 -0.22 -0.04 -0.23 -0.00 0.01 -0.06 -0.07 -0.06 -0.08 -0.04 0.01
UWaveGestureLibraryAll 0.07 -0.10 0.00 -0.25 -0.02 -0.11 -0.16 -0.15 -0.04 -0.04 -0.39 -0.03 -0.09 -0.01 -0.06 -0.09 -0.29 -0.10 -0.14 -0.12 -0.10 -0.11 -0.05 -0.02 -0.07 -0.01 -0.34 -0.26 -0.36 -0.23 -0.02 -0.13 -0.00 -0.13 -0.24 -0.04 -0.11 -0.00 -0.02 -0.06 -0.02 0.07 -0.03 -0.04 -0.03
0.30
UWaveGestureLibraryX 0.05 -0.10 0.00 0.07 0.02 -0.09 -0.17 -0.09 -0.03 -0.04 -0.02 -0.08 -0.03 -0.03 -0.14 -0.09 -0.24 -0.04 -0.10 -0.14 -0.09 -0.11 -0.05 -0.01 -0.07 -0.00 -0.24 -0.09 -0.18 -0.23 -0.01 -0.02 -0.01 -0.12 -0.23 -0.04 -0.38 -0.01 0.04 -0.08 -0.05 -0.04 -0.10 -0.04 -0.06
UWaveGestureLibraryY 0.05 -0.10 0.00 0.08 0.03 -0.13 -0.18 -0.14 -0.04 -0.03 0.01 -0.08 -0.16 -0.04 -0.08 -0.09 -0.22 -0.05 -0.14 -0.12 -0.11 -0.11 -0.06 -0.01 -0.07 -0.01 -0.18 -0.09 -0.13 -0.16 -0.02 -0.01 -0.01 -0.10 -0.26 -0.04 -0.23 -0.01 -0.07 -0.05 0.01 -0.04 -0.01 -0.05 -0.03
UWaveGestureLibraryZ 0.02 -0.08 0.00 0.07 0.02 -0.13 -0.26 -0.34 -0.03 -0.04 -0.02 -0.11 -0.21 -0.07 -0.20 -0.09 -0.23 -0.11 -0.11 -0.13 -0.10 -0.11 -0.02 -0.02 -0.04 -0.00 -0.26 -0.09 -0.10 -0.23 -0.02 -0.02 -0.01 -0.12 -0.27 -0.04 -0.62 -0.01 -0.15 -0.07 -0.09 -0.07 -0.27 -0.05 -0.05 0.45
FordB 0.00 -0.08 0.00 -0.08 0.03 -0.16 -0.14 -0.10 -0.05 -0.05 -0.01 -0.09 -0.08 -0.05 -0.19 -0.09 -0.22 -0.09 -0.10 -0.12 -0.10 -0.11 -0.01 -0.04 -0.07 -0.00 -0.18 -0.08 -0.08 -0.23 -0.02 -0.08 -0.00 -0.13 -0.23 -0.04 -0.27 -0.01 0.08 -0.08 -0.01 -0.05 -0.09 -0.04 -0.01
ChlorineConcentration 0.00 -0.03 0.00 -0.25 -0.02 -0.30 -0.17 -0.21 -0.04 -0.04 -0.62 -0.05 -0.11 -0.02 -0.21 -0.09 -0.33 -0.21 -0.14 -0.17 -0.13 -0.21 -0.25 -0.11 -0.17 -0.03 -0.50 -0.28 -0.38 -0.25 -0.02 -0.24 -0.12 -0.13 -0.39 -0.11 -0.70 -0.01 -0.10 -0.18 -0.09 -0.20 -0.02 -0.19 -0.15
NonInvasiveFetalECGThorax1 0.07 -0.10 0.00 -0.25 0.15 -0.04 -0.20 -0.17 -0.03 -0.02 -0.12 -0.05 -0.12 -0.08 -0.08 -0.09 -0.24 -0.07 -0.14 -0.12 -0.10 -0.11 -0.02 -0.02 -0.07 -0.01 -0.28 -0.13 -0.24 -0.23 -0.01 -0.01 -0.00 -0.13 -0.23 -0.03 -0.44 -0.01 -0.01 -0.04 0.04 0.08 -0.00 -0.04 -0.05 0.60
NonInvasiveFetalECGThorax2 0.07 -0.05 0.00 -0.20 0.12 -0.06 -0.13 -0.08 -0.03 -0.02 0.01 -0.04 -0.05 -0.06 -0.06 -0.09 -0.25 0.01 -0.11 -0.12 -0.11 -0.11 -0.03 -0.01 -0.07 -0.00 -0.21 -0.22 -0.24 -0.04 -0.01 -0.02 -0.00 -0.13 -0.20 -0.03 -0.23 -0.00 -0.01 -0.05 -0.04 -0.03 -0.04 -0.04 -0.04
ken eFly Coffee liveOil Bee
f k Z
Roc iimote iimote
Z e
Win aceFou
r
Mea
t Car tning2 erring tning7 uakes aptics puters nxTW Group nxTW Group ontrol nxTW Group rface1 Skate ignal ignal ances Strain tlines Torso neme ound esUCR aceAll Mallat llTrain orrect llTrain rTrain rTrain Yoga strian
Chic Beetl O F Ligh H Ligh rthq H om hala Age hala Age ticC hala Age tSu e lS lS li u
Inlin Vertica rizonta enApp Mote HandO CinCEC
G Pho eatS Fac F ma esC rSma egula egula ede
Bird eW
stur stur
eW Ea C talP ine leP ine the alP ine obo ingb esS tlin e R R neP
eGe kupGe Dis nxOutl Midd nxOutl Syn Proxim nxOutl AIBOR o
EOG EOGH allKitc
h ctW hap sOu Freez hapes reezer our
a k la Sony Inse edS nge S F Melb
Sh Pic lPha
la
le Pha
la
a lPha Sm Mix Phala M ixed
Dis ta d xim
Mid Pro
Target dataset
Fig. 14. Classification comparison results based on transfer learning. The values indicate the difference in classification accuracy on the FCN
encoder with and without transfer learning. A value greater than zero indicates positive transfer, while a value less than zero indicates negative
transfer. Red background is for positive values and blue for negative, with larger magnitude values more saturated.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 26
TABLE 10
Details of the target UCR time series datasets used for transfer
learning.
TABLE 11
The detailed information of four independent time series scenarios datasets. For column Samples, Train is the training set, Val is the validation set,
and Test is the test set.
TABLE 12 TABLE 13
Test classification accuracy using FCN, TCN, and Transformer for Test classification accuracy using Transformer-based and
directly supervised classification training on 128 UCR datasets. consistency-based PTMs on 128 UCR datasets.
TABLE 14
Test classification accuracy using Transformer-based and consistency-based PTMs on 30 UEA datasets