0% found this document useful (0 votes)
57 views14 pages

Self-Supervised Pretraining of Transformers For Satellite Image Time Series Classification

This document discusses self-supervised pretraining of transformers for satellite image time series (SITS) classification. The proposed method pretrains a transformer-based network on large unlabeled SITS data by having it predict randomly contaminated observations in time series. This leverages the temporal structure to learn general spectral-temporal representations related to land cover semantics. The pretrained network can then be fine-tuned on small labeled datasets for specific classification tasks, improving generalization performance and reducing overfitting compared to training from scratch. Experiments on three benchmark datasets demonstrated the effectiveness of this pretraining approach.

Uploaded by

Adithyan Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views14 pages

Self-Supervised Pretraining of Transformers For Satellite Image Time Series Classification

This document discusses self-supervised pretraining of transformers for satellite image time series (SITS) classification. The proposed method pretrains a transformer-based network on large unlabeled SITS data by having it predict randomly contaminated observations in time series. This leverages the temporal structure to learn general spectral-temporal representations related to land cover semantics. The pretrained network can then be fine-tuned on small labeled datasets for specific classification tasks, improving generalization performance and reducing overfitting compared to training from scratch. Experiments on three benchmark datasets demonstrated the effectiveness of this pretraining approach.

Uploaded by

Adithyan Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

474 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

14, 2021

Self-Supervised Pretraining of Transformers for


Satellite Image Time Series Classification
Yuan Yuan , Member, IEEE, and Lei Lin

Abstract—Satellite image time series (SITS) classification is a breakthroughs in the latest-generation of satellites. After the
major research topic in remote sensing and is relevant for a wide successive launch of the twin satellites (Sentinel-2A/B) of the
range of applications. Deep learning approaches have been com- Sentinel-2 mission, it is now possible to access large area,
monly employed for the SITS classification and have provided
state-of-the-art performance. However, deep learning methods suf- high-quality medium spatial resolution EO data with much
fer from overfitting when labeled data are scarce. To address this higher frequency. Due to the high revisit time of Sentinel-2A/B
problem, we propose a novel self-supervised pretraining scheme satellites (five days from the two-satellite constellation),
to initialize a transformer-based network by utilizing large-scale consecutively acquired images covering the same geographical
unlabeled data. In detail, the model is asked to predict randomly area can be properly organized into satellite image time
contaminated observations given an entire time series of a pixel.
The main idea of our proposal is to leverage the inherent tem- series (SITS) [1]. Such medium-resolution SITS data provide
poral structure of satellite time series to learn general-purpose valuable information about the status and dynamics of the Earth
spectral-temporal representations related to land cover semantics. surface, supporting analyses of the functional and structural
Once pretraining is completed, the pretrained network can be characteristics of land covers as well as identifying change
further adapted to various SITS classification tasks by fine-tuning events [2], [3]. For this reason, SITS have been widely used in
all the model parameters on small-scale task-related labeled data.
In this way, the general knowledge and representations about various application domains, such as ecology [4], agriculture [5],
SITS can be transferred to a label-scarce task, thereby improv- forest [6], land management [7], disaster monitoring [8] risk as-
ing the generalization performance of the model as well as re- sessment [9], etc. In the meantime, new challenges are also being
ducing the risk of overfitting. Comprehensive experiments have introduced by the question of how to extract valuable knowledge
been carried out on three benchmark datasets over large study and meaningful information to exploit such abundant data.
areas. Experimental results demonstrate the effectiveness of the
proposed pretraining scheme, leading to substantial improvements SITS classification is one of the central problems in the SITS
in classification accuracy using transformer, 1-D convolutional analysis, which is closely associated with many land applica-
neural network, and bidirectional long short-term memory net- tions, such as land cover mapping and change detection [10],
work. The code and the pretrained model will be available at vegetation species classification [11], and crop yields estimation
https://github.com/linlei1214/SITS-BERT upon publication. [12]. SITS classification involves assigning every pixel in an im-
Index Terms—Bidirectional encoder representations from age to a categorical label, primarily based on the pixel’s spectral
Transformers (BERT), classification, satellite image time series profile (trajectories of spectral variations over time) [3]. Machine
(SITS), self-supervised learning, transfer learning, unsupervised learning algorithms provide effective tools to achieve automated
pretraining.
SITS classification. Traditional algorithms such as support vec-
tor machine (SVM) and random forest (RF) classify SITS via
I. INTRODUCTION handcrafted features, such as raw reflectances, spectral statistics,
and phenological metrics [13], [14]. However, characterizing
OWADAYS, a huge volume of Earth observation
N (EO) data are being accumulated thanks to remarkable
these features is rather difficult due to the strong interannual
variations in seasonal patterns of the land surface reflectance,
which can be caused by shifts in land cover or environmental
Manuscript received September 15, 2020; revised October 22, 2020; accepted conditions, management practices, and disturbance [15].
November 3, 2020. Date of publication November 9, 2020; date of current Lately, deep learning is gaining widespread popularity in the
version January 6, 2021. This work was supported in part by the Research Project
of Surveying Mapping and Geoinformation of Jiangsu Province under Project remote sensing community. Various deep neural network archi-
JSCHKY201905, in part by the Natural Science Foundation of Jiangsu Province tectures have been introduced to advance the state of the art for
under Grant BK20170897, in part by the National Natural Science Foundation many remote sensing classification problems [16]–[18]. The ma-
of China under Grant 41901356, and in part by the support of Environmental
Protection Research Project of Jiangsu Province under Grant 2019010. (Yuan jor advantage of deep learning methods is that they are capable
Yuan and Lei Lin are co-first authors.) (Corresponding author: Yuan Yuan.) of learning features from the input data optimized for a specific
Yuan Yuan is with the School of Geographic and Biologic Information, task without the need for manual feature engineering [1], [19].
Nanjing University of Posts and Telecommunications, Nanjing 210023, China
(e-mail: [email protected]). Deep learning-based approaches are increasingly being applied
Lei Lin is with the Beijing Qihoo Technology Company Ltd., Beijing 100015, to SITS classification. Among these methods, convolutional neu-
China (e-mail: [email protected]). ral networks (CNNs) [20], [21] and recurrent neural networks
This article has supplementary downloadable material available at https://
ieeexplore.ieee.org, provided by the authors. [RNNs, including long short-term memory (LSTM) or gated
Digital Object Identifier 10.1109/JSTARS.2020.3036602 recurrent units] [22], [23] have been most widely used to capture

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 475

temporal characteristics from spectral profiles. To exploit both


spatial and temporal information of high-resolution SITS, hybrid
architectures combining convolutional and recurrent layers [24],
[25] and convolutional-recurrent neural networks [26] have been
introduced and comprehensively compared [27]. Recently, a
promising alternative to RNNs for sequence encoding, namely
the transformer, has been proposed in the field of natural lan-
guage processing (NLP) [28]. Transformers have been reported
to have advantages over RNNs in terms of feature extraction and Fig. 1. Proposed pipeline for satellite image time series (SITS) classification:
in the pretraining stage, a neural network (i.e., SITS-BERT, where BERT stands
training efficiency. Garnot et al. first introduced transformers for bidirectional encoder representations from Transformers) is pretrained on
into SITS classification. They employed a transformer encoder massive unlabeled data to solve a pretext task; in the fine-tuning stage, the
to learn temporal correlations from a sequence of pixel-set pretrained network serves as a representation model that is used in a specific
classification task by fine-tuning all the parameters on task-related labeled
embeddings [29]. A detailed comparison between transformers samples.
and other prevailing deep learning architectures for SITS clas-
sification can be found in [30], showing that transformers and overfitting. Although a variety of self-supervised learning meth-
RNNs outperform CNNs on processing raw satellite time series ods have emerged in recent years in computer vision [35]–[38]
in terms of their architectures. and NLP [39]–[41], we are not aware of any existing work that
Nonetheless, the recent success of deep learning approaches introduces the perspective of self-supervised learning into the
has been primarily driven by deeper architectures and the avail- SITS analysis to cope with the label scarcity problem.
ability of large amounts of labeled data. For instance, by training In this article, we propose a novel self-supervised pretrain-
a deep CNN on millions of human-annotated photographs such ing scheme for parameter initialization of deep neural net-
as ImageNet, it can learn powerful visual features reusable in works on SITS classification tasks, which follows a stan-
various image understanding tasks. When training samples are dard unsupervised-pretraining/supervised-fine-tuning pipeline
scarce, a common drawback of most deep learning architectures shown in Fig. 1. We design the following pretext task to train
is that they are very prone to overfitting, since such models often a transformer network: the model is forced to predict corrupted
have millions of parameters. However, there are rarely enough observations with randomly added noise, given an entire annual
labeled data in remote sensing applications, since labeling is very time series of a pixel. The central idea behind our pretext task
time/labor-consuming and requires expertise. As a result, even is to leverage the inherent temporal structure of satellite time
though a huge supply of SITS data are available, the performance series to capture meaningful spectral-temporal characteristics
of deep learning approaches is restricted due to lack of labels. from a large volume of SITS data, and that these characteristics
To fully tap the potential of deep learning methods for the are closely related to the natural changes at the Earth’s surface.
SITS classification, some studies have begun to explore how In this way, enormous amounts of background knowledge are
to mitigate the demand for labeled data. For example, Bazzi accumulated in the network through pretraining, making the
et al. proposed a transfer learning framework, which adapts a model “understand” what satellite time series should look like.
network trained on a label-rich dataset to a label-scarce task Hence, it is more robust for neural networks to learn a mapping
by means of knowledge distillation [31]. Ienco et al. developed between spectral profiles and corresponding types given a small
a weakly supervised layerwise pretraining strategy to initialize amount of labeled data, as the connection between temporal
parameters of RNNs by utilizing coarse granularity labels to dynamics of satellite time series and land cover semantics is
provide supervision signals [32]. Although these approaches can strong.
partially alleviate the problem, their effectiveness is still limited The key contributions of this article are threefold.
by the quality and quantity of labeled samples. To the best of 1) For the first time, we propose a self-supervised pretraining
our knowledge, none of the existing work takes advantage of scheme to cope with the problem of insufficient labeled
unlabeled data for SITS classification. samples for the SITS analysis.
Self-supervised learning is a recently emerged unsupervised 2) We introduce an end-to-end deep learning architecture
learning paradigm for tackling the challenge of insufficient as an effective alternative to convolutional and recurrent
labels [33]. In this paradigm, models are trained on unlabeled neural networks for the SITS classification.
data by leveraging the structure present in the data itself to 3) We conduct comprehensive experiments on three large-
create supervised tasks (such tasks are often referred as “pretext scale datasets to validate the effectiveness of the proposed
tasks”) [34]. The general pipeline of self-supervised learning is method.
as follows: in the pretraining stage, models learn an initial set The remainder of this article is organized as follows. Section II
of representations by solving a predefined pretext task; in the summarizes related work on self-supervised learning in remote
fine-tuning stage, these representations are further adapted to sensing. Section III explains the motivation of the proposed
a downstream task by continued training on task-related data method. Sections IV and V describe the proposed network
in a supervised manner. In this way, deep learning models can architecture and pretraining scheme, respectively. Section VI
transfer general knowledge learned from large-scale unlabeled provides the data information. Section VII reports the experi-
data to a label-scarce downstream task, thereby enhancing the mental results and discussions. Finally, Section VIII concludes
generalization capacity of deep learning models and preventing this article.
476 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

II. RELATED WORK series, the insight of BERT has yet to be explored in the canon
of existing work in the remote sensing community.
An effective pretext task is the key to successful self-
supervised learning and should guarantee that the model learn Inspired by BERT, we develop a context-based pretext task
to capture meaningful spectral-temporal features from massive
meaningful representations instead of trivial solutions [34].
unlabeled SITS data. Specifically, the network is asked to recover
According to the strategies used to design pretext tasks, most
of the available self-supervised learning approaches in remote contaminated observations by means of corresponding acquisi-
tion dates and clear observations. Our hypothesis is that noisy
sensing can be broadly classified into the following three
observations can be distinguished and reconstructed from dense
categories.
Reconstruction-based methods include denoising autoen- satellite time series. The main idea to design this pretext task is
based on the fact that noise caused by clouds and shadows is com-
coders (DAEs) [42], [43] and deCNNs [44], [45]. These meth-
ods project data into a low-dimensional latent space and then monly found in optical satellite images. Human remote sensing
reconstruct the input from the compressed features. However, experts can easily eliminate noise interference and distinguish
different land cover types from a limited number of images. The
exact input reconstruction is often not conducive to learning
discriminant features. absence of images or contaminated images does not necessarily
Generation-based methods include variational autoencoders result in a decrease in the interpretation accuracy because the
missing information can be inferred from the remaining images.
[46], [47] and generative adversarial networks [48], [49]. These
methods aim to approximate the real data distribution and simul- Intuitively, a good representation of SITS should be able to
taneously learn a mapping from a latent space to the input space. capture stable spectral-temporal patterns that are robust to noise.
However, the main goal of these methods is to generate more
realistic samples rather than extracting meaningful features that IV. MODEL ARCHITECTURE
facilitate downstream tasks.
A. Overall Network Architecture
Context-based methods leverage spatial, spectral, and tempo-
ral context similarity or correlation in the data to create super- In this article, we introduce a transformer-based network for
vision signals. These types of methods are generally heuristic the SITS classification. Since the model architecture is modified
and have no uniform theoretical patterns. For example, Wang from BERT, we name it SITS-BERT.
et al. learned mapping functions between images and their We first introduce the symbols used in our description. Let
transformed copies for the purpose of image registration [50]. X = {O1 , t1 , . . . , O1 , tL } be an annual time series of a
Dong et al. sampled image patches from bitemporal images and pixel, where Oi ∈ RD denotes a D-dimensional satellite obser-
then used the temporal-source of these patches as pseudolabels vation vector whose elements correspond to each of the input
for change detection [51]. Vincenzi et al. attempted to predict spectral reflectances; ti corresponds to the date Oi was captured,
visible bands of remote sensing images by other spectral bands which is specified using day of year (DOY).
[52]. At present, these kinds of methods are rarely applied to the SITS-BERT’s model architecture is comprised of two parts:
remote sensing data. an observation embedding layer and a standard transformer
In summary, the existing work mainly focuses on learning encoder (see Fig. 2). Specifically, all observation tuples
spatial and spectral features from single or bitemporal images, O1 , t1 , . . . , O1 , tL  of a time series are first encoded into a
while very little effort has been devoted to satellite time-series sequence of observation embeddings, which are then fed into
analysis. a multilayer bidirectional transformer network to produce their
corresponding representations. The final observation represen-
III. MOTIVATION tations can be aggregated into a single feature vector, which
contains the global information of the entire time series and
BERT, which stands for bidirectional encoder representations is used for classification. Here, SITS-BERT plays the role of
from transformers [39], is undoubtedly the most remarkable self- a representation model. By adding an additional output layer,
supervised learning-based framework in the current NLP field. the whole network architecture of SITS-BERT can be further
Following an unsupervised-pretraining/supervised-fine-tuning fine-tuned for a specific classification task.
pipeline, BERT can learn general language representations that
are reusable in various downstream tasks. In detail, BERT uti-
B. Observation Embedding
lizes a “Masked Language Model” as its pretext task, in which
some tokens in a sentence are randomly masked out and the The observation embedding layer projects an input tuple
model is forced to predict the missing tokens according to the Oi , ti  into a higher-dimensional feature space. The reason for
context. This enables BERT to learn word correspondences from this is that the input dimension is tied with the hidden layer size
a plain text corpus. Since BERT was proposed, the concept of of the transformer, and using larger embedding sizes gives better
self-supervised pretraining has started dominating the state-of- performance [41].
the-art in NLP. In this article, an observation embedding is a concatenation
Similar to text, satellite time series also have strong tempo- of two parts. Given an observation Oi , it is projected into a
ral correlations. Likewise, the temporal dynamics of spectral high-dimensional vector using a linear dense layer. In addition,
profiles contain rich semantic information, which is closely as- the corresponding date ti is also encoded into a vector of the
sociated with seasonal variations and plant phenology. However, same size using the positional encoding (PE) technique [28]. PE
while there are clear similarities between text and satellite time encodes the order information of a sequence with sine/cosine
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 477

Fig. 2. Structure of the SITS-BERT architecture shown unfolded in time. Besides the output layer (not depicted in the figure), the same architecture is used in
both the pretraining and fine-tuning stages. The input tuple Oi , ti  at each timestep is a pair of observation and the corresponding acquisition date (day of year,
DOY). The final hidden representation is denoted as Ti . Ti encodes the information about the entire time series.

functions of different frequencies


Embed (Oi , ti ) = Concat (Oi We , PE(ti )) (1)
  
sin ti /100002k/dm , if p = 2k
P E(ti )p =   (2)
cos ti /100002k/dm , if p = 2k + 1

where Embed(Oi , ti ) represents the observation embedding of


Oi , ti ; dm = 256 is the embedding size, which is the same
of the hidden layer size of transformer; Concat denotes a con-
dm
catenation between two vectors; We ∈ RD× 2 is the weight
matrix of the dense layer; P E(ti )p represents the pth element
of the PE vector. It should be noted that in BERT, the PE vector Fig. 3. Illustration of the scaled dot-product attention (left) and multihead
is directly added to a token embedding. However, we suggest attention (right). Q stands for queries’ matrix; K stands for keys’ matrix; V
stands for values’ matrix.
concatenating the two parts in case the model cannot distinguish
between observations and time. Experiments also indicate that generate higher level representations. A single transformer block
the model converges faster using concatenation than summation. (depicted in Fig. 2) contains two major sublayers: a multihead
The main purpose of utilizing time information (i.e., DOY attention layer and a positionwise fully connected feed-forward
rather than the order of an image in a sequence) is to make network (FFN). In addition, a residual connection [55] and
the model learn meaningful temporal variation patterns that are layer normalization [56] are utilized for both sublayers. The
closely related to seasonal cycles and vegetation phenology. two sublayers are described in detail below.
The benefits of this are twofold. On one hand, the model can A multihead attention layer consists of H parallel scaled dot-
overcome the limitations associated with different sampling product attention layers, each called a head (see Fig. 3) [28].
dates and sequence lengths, making the model trainable and A scaled dot-product attention is a function that maps a query
transferable across different years. On the other hand, the model vector and a set of key-value pairs into an output vector
is able to deal with irregular sampling problems caused by the  
presence of noise in the time series. Accordingly, no smoothing, QK T
Attention (Q, K, V ) = softmax √ V (3)
gap filling, or compositing method is needed to reconstruct dk
evenly spaced time series. where Q, K, V represent the matrices stacked by multiple query,
key, and value vectors as rows, respectively; dk is the dimen-
C. Transformer Encoder sionality of the query/key vectors. Multihead attention takes
Transformer was first introduced in NLP as an efficient alter- it a step further by first mapping Q, K, and V into different
native to RNNs [28], which has been introduced to some remote lower-dimensional feature subspaces via different linear dense
sensing tasks, such as hyperspectral image classification [53], layers, and then using the results to calculate attention. Finally,
image captioning [54], and SITS classification [29]. the outputs produced by H heads are concatenated and projected
A transformer encoder is a stack of multiple transformer into a final hidden representation using another dense layer
blocks. The first transformer block takes a collection of observa- MultiHead (Q, K, V ) = Concat (head1 , . . . , headH ) W o
tion embeddings of a time series as its input and generates cor-  
responding hidden representations. Then, these representations where headi = Attention QWiQ , KWiK , V WiV
are passed to the next transformer block as its input to iteratively (4)
478 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

The noise is generated from a uniform distribution in the interval


[0, 0.5]. In the meantime, the corresponding PE vector remains
unchanged.
The model is then forced to predict the original values of
these contaminated observations according to their acquisition
dates and contextual information. In detail, the final hidden
representations corresponding to the contaminated observations
are fed into a linear dense layer for prediction. The mean-square
Fig. 4. Illustration of the proposed pretraining strategy. Some observations in
a satellite time series are randomly selected and added with positive or negative error (MSE) between original observations and predictions is
noise, and then the model is trained to predict these contaminated observations used as the optimization function for model training
using the rest of the observations. In the figure, the input observations O3 and 
O7 are added with noise ±ε, where ε ≥ 0, and the prediction loss at these two 2
Oj ∈Ψ Oj − Ôj 
positions is used to optimize the model. MSELoss = (6)
N
where WiQ , WiK , WiV are weight matrices of the inner dense where Ψ denotes the set of contaminated observations; Ôj is
layers of each head; W o is the weight matrix of the top the predicted value of Oj ; and N is the number of contaminated
dense layer. In this article, WiQ , WiK , WiV ∈ Rdm ×dv , W o ∈ observations. In this article, we pretrain SITS-BERT for 100
RHdv ×dm , where we use H = 8 as the number of heads, and epochs with a batch size of 256 sequences. We utilize Adam
use dv = dHm = 32 as the output dimensionality of each head. optimizer with a learning rate of 1e−4, learning rate warmup
In a transformer, a self-attention mechanism is used to learn over the first 30 epochs, and exponentially decay of the learning
the internal relationship of an input sequence [28]. In the rate. Dropout is implemented on all transformer layers with a
self-attention mechanism, the query, key, and value vectors are dropout rate of 0.1.
identical. That is, attention is calculated between each position in As a side benefit, the proposed pretext task overcomes the
a sequence and every other position (including itself). As a result, pretrain-fine-tune discrepancy suffered by the original BERT
the hidden representation of each observation is a weighted sum [15]. Specifically, BERT uses a special token [MASK] to tell
of all positions in a time series, thereby capturing the global the model which inputs are missing; this token does not exist in
sequence information and highlights the part around the ith the downstream tasks. In contrast, our model does not introduce
observation. artificial tokens. Our assumption is that the model can learn how
A positionwise FFN is then applied to the hidden state of to distinguish noise from normal observations from the context
each position independently and identically. FFN is made up of without explicit annotations.
two linear transformations with a ReLU activation function in It is noteworthy that the proposed pretext task is different
between from DAEs [57], because our aim is to predict contaminated
observations instead of reconstructing the entire time series.
FFN (xi ) = max (0, xi W1 + b1 ) W2 + b2 (5)
where xi is the ith hidden state produced by the multihead atten- B. Fine-Tuning SITS-BERT
tion layer, W1 , W2 are weight matrices, and b1 , b2 are bias terms The pretrained SITS-BERT can be easily adjusted to a spe-
of the inner and output dense layers, respectively. In this article, cific SITS classification task by adding an additional output
we use df f = 4 dm = 1024 as the dimensionality of the inner layer. In this article, we fine-tune SITS-BERT on the following
dense layer, i.e., W1 ∈ Rdm ×df f , b1 ∈ Rdf f , W2 ∈ Rdf f ×dm , two representative tasks, i.e., crop classification and land cover
b2 ∈ Rdm . mapping. For both tasks, we simply feed the input–output pairs
to the model and fine-tune all the parameters end-to-end on
V. PROPOSED SELF-SUPERVISED PRETRAINING SCHEME task-related data. Specifically, the input of the model is an annual
time series, and the output is the corresponding category label.
A. Pretraining SITS-BERT
A common method to address time series classification is to
In analogy to BERT, we train our model on a predesigned map time series into a feature space, such that the values of
pretext task: some of the input observations are randomly chosen features should be close for pixels belonging to the same class.
and added with noise, and then the model is forced to predict In general, there are two strategies for aggregating individual
these contaminated observations (see Fig. 4). By solving this observation representations produced by SITS-BERT into a
pretext task, the model can learn temporal correspondences single sequence-level representation. These two approaches are
between observations. 1) the use of a [CLS] token, and 2) the pooling method (see
In the experiments, 15% of the observations in a time series are Fig. 5).
selected at random and added with noise. When an observation 1) SITS-BERT with a [CLS] token: The first method is to
Oi is selected, there is a 50% chance that all elements of the insert a special token [CLS] (an abbreviation for “classifi-
vector is added a positive noise to simulate abnormal reflectance cation”) at the front of every input time series and then use
increases caused by clouds and snow/ice; and a 50% chance that the output of [CLS] as a global representation of the whole
all element of the vector is subtracted from a positive noise time series [39]. In our implementation, we set [CLS] to
to simulate abnormal reflectance decreases caused by shadows. be an all-ones vector with the same dimension as the input
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 479

each tile, all the available Sentinel-2 A/B images with <10%
cloud coverage, taken between January 2018 and December
2019, were used. There were a total of 219 images to construct
the pretraining dataset. Samples were collected from each tile
at a regular sampling interval of 10 rows/columns. Sequences
shorter than 10 were discarded. Finally, the pretraining dataset
was composed of about 6.5 million sequences.
We chose California as the pretraining area for the following
reasons. First, California has extremely diverse natural land-
scapes, representing most of the major biomes in North America,
including grasslands, shrublands, deciduous forests, coniferous
Fig. 5. Two strategies to generate a sequence-level representation (denoted as
C) from individual observation representations of a time series: (1) SITS-BERT forests, tundra, mountains, deserts, rainforest, marine, estuarine,
with a [CLS] token, (2) SITS-BERT with pooling. and freshwater habitats. The state’s varied topography and cli-
mate have given rise to a remarkable plant diversity that cannot
observations, and set the corresponding time to be zero. be found in any other state of the United States [58]. Moreover,
The [CLS] token also needs to be added to the input time the Central Valley where the study area is located is California’s
series at the pretraining stage. most productive agricultural region as well as one of the most
2) SITS-BERT with pooling: The second method is to apply a productive worldwide. The agricultural systems in this study
pooling operation to the output vectors of BERT by either area are rather complicated, characterized by small parcels and
averaging the output (average pooling) or computing max- a vast variety of crop types [59]. For all those reasons, this area
over-time of the output (max pooling) to aggregate them can provide abundant and diverse training samples for learning
into a single fixed-size vector. general-purpose SITS representations. Finally, this area has a
Once the global representation of a time series is obtained, it is Mediterranean climate with no significant precipitation during
then fed into a softmax layer to be classified. The cross-entropy the summer season. Therefore, the spectral trajectory of major
loss is used for the model optimization in the fine-tuning stage crops (such as corn, cotton, rice, and soybeans) throughout the
entire growth cycle can be characterized, making it feasible to
P
 inferring the corrupted part of a time series from the remainder.
CrossEntropyLoss = − yj log (ŷj ) (7)
j =1

where ỹj is the probability score inferred by the softmax function B. Two Crop Classification Datasets
for class j, and yj is the ground truth, and P is the number of We applied the proposed method to two crop classification
classes. datasets.
In this article, we fine-tune SITS-BERT on task-specific data The first dataset came from the pretraining area. A total of
in a supervised manner for 100 epochs. Adam optimizer is used 45 images captured in 2019 from Sentinel-2 tile T11SKA in
with a learning rate of 2e-4 and a batch size of 128. California (see Fig. 6 Tile 
3 ) were used. Ten major crop types
covering most of the study area were targeted in this study: corn,
VI. STUDY AREAS AND DATASETS cotton, winter wheat, alfalfa, tomatoes, grapes, citrus, almonds,
walnuts, and pistachios. Three nonagricultural categories were
In this article, we applied the proposed method on Sentinel-2
also included: developed, evergreen forest, and grass/pasture.
image time series. The data were organized into four datasets:
It should be noted that the fine-tuning data were not used for
one was used for model pretraining, the other three (includ-
pretraining.
ing two crop classification datasets and a land cover mapping
The second dataset came from the area at the border between
dataset) were used for model evaluation.
Missouri and Arkansas, USA (see Fig. 7), and was composed
We investigated the model performance on two kinds of SITS
of 26 images acquired in 2019 from Sentinel-2 tile 15SYA. Ten
data. In the first type of data, the pretraining and fine-tuning
agricultural and nonagricultural categories were considered in
data belonged to the same area, serving the purpose of verify-
this study: corn, cotton, rice, soybeans, fallow/idle cropland,
ing whether the pretrained model can extract informative and
open water, developed, forest, grass/pasture, and woody wet-
discriminant features from homologous unlabeled data. In the
lands.
second type of data, the pretraining and fine-tuning data came
For both datasets, the 2019 Cropland Data Layer (CDL) [60]
from different areas, with the purpose of exploring whether the
and the corresponding CDL confidence layer [61] provided by
pretrained model can be transferred to unseen data.
the United States Department of Agriculture National Agricul-
tural Statistics Service (USDA NASS) were used as reference
A. Pretraining Dataset
data to collect samples. The original spatial resolution of CDL
We chose three Sentinel-2 tiles (T10SEJ, T10SFH, and and confidence layer products was 30 m. They were upsampled
T11SKA, each covering a region of about 100 × 100 km) to 10 m to be consistent with Sentinel-2 images. Then, we
located in the Central Valley, CA, USA to create a pretraining extracted samples from the whole scene at a regular interval of 10
dataset. The location of the study area is depicted in Fig. 6. For rows/columns. To increase the accuracy of the samples collected
480 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

Fig. 6. 2019 cropland data layer (CDL) product of California, USA, which is colored in the same way as the CDL website
(https://nassgeodata.gmu.edu/CropScape/). Three Sentinel tiles were used for model pre-training: 
1 T10SEJ, 
2 T10SFH, 
3 T11SKA. The coverage of
each Sentinel-2 tile is marked with a red rectangle.

Fig. 8. Location of the study area for land cover mapping, showing the
Fig. 7. Location of the second study area for crop classification, showing the FROM_GLC10 product.
2019 CDL product.

from the CDL, we employed the following two criteria. First, C. Land Cover Mapping Dataset
all the neighboring pixels along with the center pixel in a 7×7 We also applied the proposed method to a land cover mapping
window should belong to the same category, to avoid sampling dataset, which was built using 20 images taken in Beijing, China,
from the boundary of land parcels. Second, the classification in 2017 from Sentinel-2 tile 50TMK (see Fig. 8). Beijing is
confidence of all neighboring pixels in the window should be characterized by a monsoonal humid continental climate with
higher than 80%. hot humid summers and cold dry winters. The natural vegetation
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 481

is deciduous broadleaf forest, and the dominant crop types are TABLE I
NUMBERS OF SAMPLES IN THE THREE FINE-TUNING DATASETS
winter wheat, corn, and soybean. There is a large divergence
between the study area and the pretraining area in terms of both
geographical conditions and vegetation species. Therefore, this
dataset provides a good example to explore whether the proposed
pretraining scheme is effective when the target domain and the
pretraining domain have distinct data distributions.
The reference data we used was a public global land cover
product named FROM_GLC10 (v0.1.3), which was built on 10
m resolution Sentinel-2 data acquired in 2017 [62]. We adopted
the same classification systems of FROM_GLC10. Five majority
land cover types were considered: cropland, forest, grassland,
water, and impervious area.

D. Data Collection and Preprocessing


All the Sentinel-2 images (Level-1C) we used were down-
loaded from the United States Geological Survey (USGS) Earth-
Explorer website and preprocessed to Bottom-Of-Atmosphere
(BOA) reflectance Level-2A using the Sen2Cor plugin v2.8 and
the Sentinel Application Platform (SNAP 7.0). The Multispec-
tral Instrument (MSI) sensor provides 13 spectral bands, i.e.,
four bands at 10 m (Blue, Green, Red, NIR), six bands at 20
m (Vegetation Red Edge 1–3, Narrow NIR, SWIR 1–2), and
three atmospheric bands at 60 m spatial resolution. With the
exception of the atmospheric bands, all bands were used in
this article. Bands at 20 m resolution were resampled to 10 m
via nearest sampling. A scene classification map was generated
for each image along with the Level-2A processing, which as- classified samples divided by the number of total samples
signed pixels to clouds, cloud shadows, vegetation, soils/deserts, of each class.
water, snow, etc. According to the scene classification map, To assess the effectiveness of the proposed network, we
low-quality observations belonging to clouds (including cirrus), compared it with five methods that are widely employed in SITS
cloud-shadows, and snow were discarded when extracting the classification. For traditional machine learning algorithms, we
annual time series of each pixel. Each selected sample should selected SVM and RF classifiers. SVM is widely considered
include at least three clear observations. The total number of as a powerful technique for classification tasks, while RF has
samples in each evaluation dataset is displayed in Table I. some advantages such as short training time, easy parameteri-
zation, and high robustness to high-dimensional input features
VII. EXPERIMENTAL RESULTS [13], [63]. Concerning deep learning models, we selected three
approaches that performed well in previous SITS classification
A. Evaluation Criteria and Methods studies: CNN-1D (1-D CNN) [21], LSTM [22], [23], and bidi-
We compared the performance of different evaluated algo- rectional LSTM (Bi-LSTM) [64], [65]. These architectures have
rithms with the following metrics derived from the confusion been proven to have advantages to capture temporal dependen-
matrix [18]: cies in sequential data.
1) Overall Accuracy (OA): This metric represents the propor- 1) SVM: We used a SVM classifier with radial basis function
tion of correctly classified samples in all tested samples, (RBF) kernel. The hyperplane parameters C and γ were
and is computed by dividing the number of correctly optimized in a logarithmic grid from 10-2 to 102 .
classified samples by the total number of test samples. 2) RF: We optimized a RF classifier using a varied number
2) Kappa Coefficient: This metric is considered to be a con- of trees in the forest. The optimal number of trees was
sistency measure between the classification result and the selected in the set of {100, 200, 300, 400, 500}.
ground truth. Kappa coefficient is generally considered 3) CNN-1D: A three-layer CNN-1D network was used in the
to be more robust than OA as it takes into account the experiment [21], which was formed by three convolutional
possibility of agreement occurred solely by chance. layers (128 units), one dense layer (256 units), and one
3) Average Accuracy (AA): This metric is an objective in- softmax layer. The kernel size of all convolutional layers
dicator of classification accuracy on unbalanced datasets. was set to 5. Dropout was used after each convolutional
AA is calculated by dividing the sum of the accuracy for layer with a dropout rate of 0.5.
all classes by the number of classes, where the accuracy 4) LSTM: A three-layer stacked LSTM network was used,
(i.e., the producer’s accuracy) is the number of correctly which was formed by three LSTM layer (256 units) and
482 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

one softmax layer. Dropout was used after each LSTM TABLE II
CLASSIFICATION ACCURACY COMPARISON OF DIFFERENT SITS-BERT
layer with a dropout rate of 0.5. VARIANTS ON THE THREE DATASETS
5) Bi-LSTM: A three-layer stacked Bi-LSTM network was
used [66], which was formed by three Bi-LSTM layers
(128 units) and one softmax layer. The outputs at each
timestep from both forward and backward directions were
concatenated and used the input of the next layer. Dropout
was used after each layer with a dropout rate of 0.5.
SVM and RF handle a multivariate time series as a flat vector
information, where the dimensionality of the input features
should be identical. Hence, we padded the time series to the
maximum length by filling in the missing observations with
zeros. For deep learning models, we used the same observation
embedding layer to encode the input observations. SVM and RF
were implemented using the Python Scikit-Learn library, while
all deep learning methods were implemented using the Python
PyTorch library. Experiments were carried out on a workstation
with an Intel Xeon Silver CPU 4210R 20-Core (2.40GHz) with
128 GB of RAM and a NVIDIA TITAN RTX 24G GPU.
In order to simulate real-world scenarios, we randomly se-
lected 100 samples per category from each dataset to create the
training and validation sets, respectively, and used all the residual pretraining scheme leads to an averaged accuracy increment of
samples for testing. The training set was used for training or 2.97%, 1.88%, 2.58% for OA, 0.034, 0.023, 0.045 for Kappa
fine-tuning pretrained models, and the validation set was used coefficient, and 0.030, 0.029, 0.068 for AA on each dataset, re-
for choosing hyper-parameters (e.g., training epochs). It is worth spectively. The accuracy improvement on the latter two datasets
mentioning that we did not follow the standard paradigm for indicates that unsupervised pretraining does produce positive
method evaluation, which splits all data into training-validation- transfer, that is, the representations learned from large-scale un-
test set at certain rates. In our setting, the data distributions of labeled data can be transferred to new context, thereby improves
training samples and test samples may be different. The main the performance of the model on label-scarce downstream tasks.
purpose of this is to verify the effectiveness of the proposed We also find that SITS-BERTMAX exceeds other variants in
pretraining scheme under small-sample conditions. term of OA and Kappa coefficient, while SITS-BERTCLS yields
higher AA than SITS-BERTMAX in the latter two datasets, show-
B. Model Configuration ing that SITS-BERTMAX is more advantageous for handling
imbalanced classification problems.
In this section, we evaluate the performance of three SITS-
BERT variants to derive sequence-level representations. To ver-
ify whether the pretraining scheme is effective, we also trained C. Method Comparison
each SITS-BERT variant from scratch using only labeled data In this section, we compare the performance of SITS-BERT
(we refer to such models as “non-pre-trained” models). The three with other algorithms for the SITS classification. We used SITS-
variants are described as follows. BERTMAX as the default model configuration. The term “SITS-
1) SITS-BERT using the [CLS] token: In this case, the fine- BERT” hereinafter refers to the pretrained SITS-BERTMAX . All
tuned hidden representation of the [CLS] token was re- other deep learning models were trained from scratch for 300
garded as a representation of the entire input time series. epochs with a learning rate of 2e−4 and a batch size of 128.
We refer to such competitor as “SITS-BERTCLS .” The classification accuracy assessment for the three datasets are
2) SITS-BERT using average pooling: In this case, the hidden displayed in Table III, and the confusion matrices are given in
representations of all the observations in a time series the Supplementary Materials.
were averaged and used for classification. We refer to such A major finding is that both CNN and RNN networks yield
competitor as “SITS-BERTAVG .” comparable or even worse results than traditional methods
3) SITS-BERT using max pooling: In this case, the maximum (SVM and RF), while the non-pre-trained SITS-BERT performs
value for the hidden representations along the time dimen- slightly better than traditional methods on the latter two datasets.
sion was computed and used for classification. We refer Apart from the proposed model, RF performs best overall, rank-
to such competitor as “SITS-BERTMAX .” ing first on the latter two datasets. This suggests that for small
The classification results are shown in Table II, and the datasets, sophisticated deep learning models have no substantial
confusion matrices are given in the Supplementary Materials. advantages over traditional machine learning methods in SITS
We observe that for all variants, the pretrained models remark- classification tasks. This is likely due to the serious overfit-
ably outperform their randomly initialized versions (except for ting problem that occurs when training deep neural networks
SITS-BERTAVG in the third dataset). Specifically, the proposed on insufficient labeled samples, even though transformer and
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 483

TABLE III TABLE IV


CLASSIFICATION ACCURACY COMPARISON OF DIFFERENT METHODS ON THE CLASSIFICATION ACCURACY OF PRETRAINED AND NON-PRE-TRAINED
THREE DATASETS BI-LSTM NETWORKS ON THE THREE DATASETS

validate its ability on CNN-1D and Bi-LSTM. For CNN-1D, we


utilized zero-padding to keep the sequence length unchanged.
Both models were pretrained on the same pretraining dataset and
then fine-tuned on each target dataset. The same pretraining/fine-
tuning configurations used for SITS-BERT were adopted. The
accuracy assessment is shown in Table IV, and the confusion
CNN-1D seem to be more robust to overfitting compared to matrices are given in the Supplementary Materials.
LSTM-based networks. We observe that for all datasets, the classification accuracy
In general, SITS-BERT outperforms other competitors re- of pretrained models is improved from their non-pre-trained
markably in terms of all indicators. Specifically, compared with versions. In particular, the pretraining scheme brings an averaged
the RF baseline, SITS-BERT increases OA by 5.27%, 2.38%, increment of 2.58% OA and 5.42% OA to CNN-1D and Bi-
3.35%, Kappa coefficient by 0.061, 0.029, 0.053, and AA by LSTM, respectively. The results demonstrate that the proposed
0.056, 0.018, 0.026 on each dataset, respectively. The results pretraining scheme also works for CNNs and RNNs for SITS
highlight that the risk of overfitting utilizing deep neural net- classification.
works can be largely reduced by starting from a pretrained In addition, we observe that the performance gain of Bi-LSTM
model. Moreover, according to the confusion matrices, SITS- is larger than that of CNN-1D on all datasets, indicating that
BERT exhibits a notably lower rate of misclassifications for the proposed pretraining scheme seems to be more effective for
complicated and heterogeneous categories compared to other transformers and LSTM-based networks than for CNNs. This
methods. For instance, in the first dataset, pistachios is easily may be attributed to transformer and LSTM’s ability in capturing
confused with evergreen forest; the proposed method achieves long-term dependencies between observations. In detail, LSTM
a 20 points gain in the producer’s accuracy with respect to the encodes the whole time series by receiving each observation
second best method (SVM). In the third dataset, while cropland once a timestep, and utilizes internal gates to control the update
has extremely high intraclass variability, the proposed method of the memory content. Transformer reads the entire time series
also achieves nine points of gain with respect to the second best at once and calculates dependencies between each observation
method (RF). and all other observations through the self-attention mechanism.
To better visualize the classification results of different meth- In contrast, CNN can only capture observation dependencies
ods, we selected a 3000 × 3000 area of interest (ROI) in each within the width of its filters. In other words, it can only
study area for testing. The classification results are shown in model contextual information within a local neighborhood. This
Figs. 9–11. It can be seen that for crop classification tasks, the character makes it inferior to transformers and LSTM-based
results obtained by SITS-BERT are more consistent with the networks for temporal feature extraction.
reference data and the salt and pepper effect is negligible. For
the land cover mapping task, apart from SITS-BERT, all deep E. Influence of the Number of Labeled Samples
learning methods exhibit a high confusion between cropland and
grassland. The large misclassifications of the two categories (i.e., It is well known that the number of training samples can
cropland and grassland) may come from errors in the reference affect classification performance, but its influence on pretrained
data, which underlines the fact that the pretrained model exhibits models is unknown. In this section, we fine-tuned the pretrained
relatively higher robustness to noisy labels. CNN-1D, Bi-LSTM, and SITS-BERT with different numbers of
labeled samples, varied from 50 to 500 per category, and took RF
(300 trees) as a baseline for comparison. The results are depicted
D. Effect of the Pretraining Scheme on Other Models
in Fig. 12.
The proposed pretraining scheme may be widely effective We observe that all the pretrained deep learning models sig-
for a number of deep learning models. In this section, we nificantly outperform RF and the non-pre-trained model, among
484 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

Fig. 9. Crop classification results of different methods in the first study area (California). (1) Ground truth. (2) SVM. (3) Random forest (RF). (4) Long-short
term memory (LSTM). (5) Bidirectional LSTM (Bi-LSTM). (6) Non-pre-trained SITS-BERT. (7) SITS-BERT.

Fig. 10. Crop classification results of different methods in the second study area (Missouri). (1) Ground truth. (2) SVM. (3) RF. (4) LSTM. (5) Bi-LSTM. (6)
Non-pre-trained SITS-BERT. (7) SITS-BERT.
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 485

Fig. 11. Land cover classification results of different methods in the third study area (Beijing). (1) Ground truth. (2) SVM. (3) RF. (4) LSTM. (5) Bi-LSTM. (6)
Non-pre-trained SITS-BERT. (7) SITS-BERT.

RF are 93.33% and 97.58% on the first and second dataset,


respectively, while the worst OA yielded by SITS-BERT are
93.26% and 98.24%. Two main conclusions can be drawn: first,
the experiments confirm and clarify the effectiveness of self-
supervised pretraining using varied numbers of labeled samples;
and second, the transformer architecture has advantages over
CNNs and RNNs for satellite time series classification.
In contrast, although the non-pre-trained SITS-BERT per-
forms slightly better than RF on the first two datasets, it performs
much worse on the third dataset. In addition, it is not guaranteed
that the accuracy of non-pre-trained SITS-BERT will improve
as the number of training samples increases, indicating that
randomly initialized deep neural networks give much more
unstable results.

F. Computational Efficiency
In practice, SITS classification methods need to be applied
to millions of satellite time series. Therefore, the computational
efficiency of deep learning models should be taken into consid-
Fig. 12. Classification accuracy of SITS-BERT and the RF baseline using eration. In this section, we compare the computation speeds of
varied training samples. Evaluation metrics in each row from left to right: Overall
accuracy (OA), Kappa coefficient, and average accuracy (AA).
several deep learning models. The results are given in Table V.
According to the results, CNN-1D is the fastest, followed
by LSTM-based models, and SITS-BERT is the slowest on a
which SITS-BERT surpasses all other competitors in all datasets. single GPU. Specifically, the computation speed of CNN-1D is
It is worth noting that the accuracy of SITS-BERT fine-tuned about 19% faster than SITS-BERT. For Bi-LSTM, which has
on only 50 labeled samples per category is comparable with shown competitive performance in our previous experiments,
the accuracy of RF trained on ten times more samples for the it is also 5% faster than SITS-BERT. Fortunately, an advantage
first and second datasets. Specifically, the best OA yielded by of transformer is that the computation can be parallelized
486 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021

TABLE V [4] S. Rapinel, C. Mony, L. Lecoq, B. Clement, A. Thomas, and L. Hubert-


COMPUTATION SPEED (TIME SERIES PER SECOND) OF DIFFERENT METHODS Moy, “Evaluation of Sentinel-2 time-series for mapping floodplain grass-
land plant communities,” Remote Sens. Environ., vol. 223, pp. 115–129,
Mar. 15 2019.
[5] M. J. Lambert, P. C. S. Traore, X. Blaes, P. Baret, and P. Defourny, “Esti-
mating smallholder crops production at village level from Sentinel-2 time
series in Mali’s cotton belt,” Remote Sens. Environ., vol. 216, pp. 647–657,
Oct. 2018.
[6] E. Grabska, P. Hostert, D. Pflugmacher, and K. Ostapowicz, “Forest stand
species mapping using the Sentinel-2 time series,” Remote Sens., vol. 11,
no. 10, May 2019, Art. no. 1197.
[7] J. Inglada, A. Vincent, M. Arias, B. Tardy, D. Morin, and I. Rodes,
“Operational high resolution land cover map production at the country
on multiple GPUs. Thanks to the self-attention mechanism scale using satellite image time series,” Remote Sens., vol. 9, no. 1,
in transformers, the attention weights for all positions in a Jan. 2017, Art. no. 95.
sequence can be calculated simultaneously, whereas a recurrent [8] W. T. Yang, Y. Q. Wang, S. Sun, Y. J. Wang, and C. Ma, “Using Sentinel-2
time series to detect slope movement before the Jinsha river landslide,”
layer should be computed successively after the output of the Landslides, vol. 16, no. 7, pp. 1313–1324, Jul. 2019.
previous layer is obtained [28]. Therefore, SITS-BERT is better [9] M. Sudmanns, D. Tiede, L. Wendt, and A. Baraldi, “Automatic ex-post
suited for massive parallel computations on modern machine flood assessment using long time series of optical earth observation im-
ages,” GI_Forum., vol. 1, pp. 217–227, 2017.
learning acceleration hardware. [10] Y. Yuan et al., “A new framework for modelling and monitoring the
conversion of cultivated land to built-up land based on a hierarchical
VIII. CONCLUSION hidden semi-Markov model using satellite image time series,” Remote
Sens., vol. 11, no. 2, Jan. 2019, Art. no. 210.
In this article, we proposed a simple but powerful self- [11] M. Immitzer, F. Vuolo, and C. Atzberger, “First experience with Sentinel-2
data for crop and tree species classifications in central Europe,” Remote
supervised pretraining scheme for the SITS classification. The Sens., vol. 8, no. 3, Mar. 2016, Art. no. 166.
aim was to leverage large volumes of unlabeled SITS data [12] E. Kamir, F. Waldner, and Z. Hochman, “Estimating wheat yields in Aus-
to learn robust and transferable spectral-temporal features to tralia using climate records, satellite image time series and machine learn-
ing methods,” ISPRS J. Photogramm. Remote Sens., vol. 160, pp. 124–135,
facilitate label-scarce downstream tasks. We also introduced Feb. 2020.
a transformer-based neural network architecture, named SITS- [13] C. Pelletier, S. Valero, J. Inglada, N. Champion, and G. Dedieu, “Assessing
BERT, for the SITS classification. the robustness of random forests to map land cover with high resolution
satellite image time series over large areas,” Remote Sens. Environ.,
The evaluation on three benchmark datasets have revealed vol. 187, pp. 156–168, 2016.
that the proposed pretraining scheme is generally effective [14] Q. Hu et al., “How do temporal and spectral features matter in crop clas-
for several deep learning models, i.e., CNN, Bi-LSTM, and sification in Heilongjiang province, China?,” J. Integrative Agriculture,
vol. 16, no. 2, pp. 324–336, 2017.
transformer. The experiments support our hypothesis that the [15] L. H. Nguyen, D. R. Joshi, D. E. Clay, and G. M. Henebry, “Characterizing
representations learned from large-scale unlabeled data can land cover/land use from multiple years of landsat and MODIS time
produce positive transfer, thereby improves the performance series: A novel approach using land surface phenology modeling and
random forest classifier,” Remote Sens. Environ., vol. 238, Mar. 2020, Art.
and generalization ability of deep learning models on a specific no. 111017.
SITS classification task. In addition, the results also demonstrate [16] R. Hang, Q. Liu, D. Hong, and P. Ghamisi, “Cascaded recurrent neural
that the transformer network exceeds CNNs and RNNs for networks for hyperspectral image classification,” IEEE Trans. Geosci.
Remote Sens., vol. 57, no. 8, pp. 5384–5394, 2019.
SITS classification. [17] D. Hong, L. Gao, J. Yao, B. Zhang, A. Plaza, and J. Chanussot, “Graph con-
We would like to emphasize that the pretraining data we used volutional networks for hyperspectral image classification,” IEEE Trans.
in this article are far from enough to fully exploit the potential Geosci. Remote Sens., to be published, doi: 10.1109/TGRS.2020.3015157.
[18] D. Hong et al., “More diverse means better: Multimodal deep learning
of transformers. Nevertheless, the proposed approach is easily meets remote-sensing imagery classification,” IEEE Trans. Geosci. Re-
scaled to far larger datasets with very low costs, since no human- mote Sens., to be published, doi: 10.1109/TGRS.2020.3016820.
annotated labels are required during the pretraining stage. [19] L. H. Zhong, L. N. Hu, and H. Zhou, “Deep learning based multi-temporal
crop classification,” Remote Sens. Environ., vol. 221, pp. 430–443,
Since the main purpose of this article was to validate the Feb. 2019.
effectiveness of self-supervised learning for SITS, we put em- [20] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning
phasis on the characteristics of spectral profiles of individual classification of land cover and crop types using remote sensing data,”
IEEE Geosci. Remote Sens. Lett., vol. 14, no. 5, pp. 778–782, May 2017.
pixels while ignored the spatial correlation information between [21] C. Pelletier, G. I. Webb, and F. Petitjean, “Temporal convolutional neural
pixels. In the future, it will be of interest to incorporate spatial network for the classification of satellite image time series,” Remote Sens.,
information into our framework to learn more discriminant vol. 11, no. 5, Mar. 2019, Art. no. 523.
[22] D. Ienco, R. Gaetano, C. Dupaquier, and P. Maurel, “Land cover clas-
features. sification via multitemporal spatial data by deep recurrent neural net-
works,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 10, pp. 1685–1689,
REFERENCES Oct. 2017.
[23] M. Rußwurm and M. Korner, “Temporal vegetation modelling using
[1] D. Ienco, R. Interdonato, R. Gaetano, and M. D. H.. Tong, “Combining long short-term memory networks for crop identification from medium-
Sentinel-1 and Sentinel-2 satellite image time series for land cover map- resolution multi-spectral satellite images,” in Proc. IEEE Conf. Comput.
ping via a multi-source deep learning architecture,” ISPRS J. Photogramm. Vis. Pattern Recognit., Jul 2017, pp. 11–19.
Remote Sens., vol. 158, pp. 11–22, Dec. 2019. [24] P. Benedetti, D. Ienco, R. Gaetano, K. Ose, R. G. Pensa, and S. Dupuy,
[2] P. Jonsson and L. Eklundh, “Seasonality extraction by function fitting to “M3Fusion: A deep learning architecture for multiscale multimodal mul-
time-series of satellite sensor data,” IEEE Trans. Geosci. Remote Sens., titemporal satellite data fusion,” IEEE J. Sel. Topics Appl. Earth Observ.
vol. 40, no. 8, pp. 1824–1832, Aug. 2002. Remote Sens., vol. 11, no. 12, pp. 4939–4949, 2018.
[3] C. Gómez, J. C. White, and M. A. Wulder, “Optical remotely sensed time [25] R. Interdonato, D. Ienco, R. Gaetano, and K. Ose, “DuPLO: A DUal view
series data for land cover classification: A review,” ISPRS J. Photogramm. point deep learning architecture for time series classification,” ISPRS J.
Remote Sens., vol. 116, pp. 55–72, 2016. Photogramm. Remote Sens., vol. 149, pp. 91–104, Mar. 2019.
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 487

[26] M. Rußwurm and M. Körner, “Multi-temporal land cover classification [49] R. Hang, F. Zhou, Q. Liu, and P. Ghamisi, “Classification of hyperspec-
with sequential recurrent encoders,” ISPRS Int. J. Geoinf., vol. 7, no. 4, tral images via multitask generative adversarial networks,” IEEE Trans.
2018, Art. no. 129. Geosci. Remote Sens., to be published.
[27] V. S. Garnot, L. Landrieu, S. Giordano, and N. Chehata, “Time-space [50] S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao, “A deep learning
tradeoff in deep learning models for crop classification on satellite multi- framework for remote sensing image registration,” ISPRS J. Photogramm.
spectral image time series,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Remote Sens., vol. 145, pp. 148–164, 2018.
Jul. 2019, pp. 6247–6250. [51] H. Dong, W. Ma, Y. Wu, J. Zhang, and L. Jiao, “Self-supervised rep-
[28] A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf. Process. resentation learning for remote sensing image change detection based
Syst., Dec. 2017, pp. 5998–6008. on temporal prediction,” Remote Sens., vol. 12, no. 11, Jun. 2020, Art.
[29] V. S. F. Garnot, L. Landrieu, S. Giordano, and N. Chehata, “Satellite no. 1868.
image time series classification with pixel-set encoders and temporal [52] S. Vincenzi et al., “The color out of space: Learning self-supervised
self-attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun representations for earth observation imagery,” 2020. [Online]. Available:
2020, pp. 12325–12334. https://arxiv.org/abs/2006.12119
[30] M. Rußwurm and M. Körner, “Self-attention for raw optical satellite [53] J. He, L. Zhao, H. Yang, M. Zhang, and W. Li, “HSI-BERT: Hyperspec-
time series classification,” 2019. [Online]. Available: https://arxiv.org/abs/ tral image classification using the bidirectional encoder representation
1910.10536 from transformers,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 1,
[31] H. Bazzi, D. Ienco, N. Baghdadi, M. Zribi, and V. Demarez, “Distilling pp. 165–178, Jan. 2020.
before refine: Spatio-temporal transfer learning for mapping irrigated areas [54] X. Shen, B. Liu, Y. Zhou, J. Zhao, and M. Liu, “Remote sensing image cap-
using Sentinel-1 time series,” IEEE Geosci. Remote Sens. Lett., vol. 17, tioning via variational autoencoder and reinforcement learning,” Knowl.
no. 11, pp. 1909–1913, Nov. 2020. Based Syst., vol. 203, 2020, Art. no. 105920.
[32] D. Ienco, Y. J. Eudes Gbodjo, R. Interdonato, and R. Gaetano, “Attentive [55] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
weakly supervised land cover mapping for object-based satellite image networks,” 2016. [Online]. Available: https://arxiv.org/abs/1603.05027
time series data with spatial interpretation,” 2020. [Online]. Available: [56] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016.
https://arxiv.org/abs/2004.14672 [Online]. Available: https://arxiv.org/abs/1607.06450
[33] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: [57] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and
Transfer learning from unlabeled data,” in Proc. IEEE Int. Conf. Mach. composing robust features with denoising autoencoders,” in Proc. IEEE
Learn., Jun. 2007, pp. 759–766. Int. Conf. Mach. Learn., Jul. 2008, pp. 1096–1103.
[34] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep [58] B. G. Baldwin, “Origins of plant diversity in the California floristic
neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., to province,” Annu. Rev. Ecol. Evol. Syst., vol. 45, no. 1, pp. 347–369, 2014.
be published, doi: 10.1109/TPAMI.2020.2992393. [59] L. Zhong, T. Hawkins, G. Biging, and P. Gong, “A phenology-based
[35] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proc. approach to map crop types in the San Joaquin Valley, California,” Int.
Eur. Conf. Comput. Vis., Oct 2016, pp. 649–666. J. Remote Sens., vol. 32, no. 22, pp. 7777–7804, Nov 2011.
[36] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa- [60] USDA National Agricultural Statistics Service Cropland Data Layer. 2019.
tions by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis., Oct. Published crop-specific data layer. USDA-NASS, Washington, DC, USA.
2016, pp. 69–84. Accessed May 2020. [Online]. Available: https://nassgeodata.gmu.edu/
[37] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for CropScape/
unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis., [61] W. Liu, S. Gopal, and E. F. Wood, “Uncertainty and confidence in land
Sep. 2018, pp. 132–149. cover classification using a hybrid classifier approach,” Photogramm. Eng.
[38] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for Remote Sens., vol. 70, no. 8, pp. 963–971, Aug. 2004.
contrastive learning of visual representations,” 2020. [Online]. Available: [62] P. Gong et al., “Stable classification with limited sample: Transferring a
https://arxiv.org/abs/2002.05709 30-m resolution sample set collected in 2015 to mapping 10-m resolution
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training global land cover in 2017,” Sci. Bull., vol. 64, no. 6, pp. 370–373, 2019.
of deep bidirectional transformers for language understanding,” 2018. [63] A. Chakhar, D. Ortega-Terol, D. Hernandez-Lopez, R. Ballesteros, J. E.
[Online]. Available: https://arxiv.org/abs/1810.04805 Ortega, and M. A. Moreno, “Assessing the accuracy of multiple classifi-
[40] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and cation algorithms for crop classification using Landsat-8 and Sentinel-2
Q. V. Le, “XLNet: Generalized autoregressive pretraining for lan- data,” Remote Sens., vol. 12, no. 11, Jun. 2020, Art. no. 1735.
guage understanding,” in Proc. Neural Inf. Process. Syst., Dec. 2019, [64] L. Lin, “Satellite image time series classification and change detection
pp. 5753–5763. based on recurrent neural network model,” Doctor Eng. China, Univ.
[41] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, Chinese Acad. Sci., 2018.
“ALBERT: A lite BERT for self-supervised learning of language repre- [65] H. Wang, X. Zhao, X. Zhang, D. Wu, and X. Du, “Long time series land
sentations,” 2019. [Online]. Available: https://arxiv.org/abs/1909.11942 cover classification in China from 1982 to 2015 based on Bi-LSTM deep
[42] A. Elshamli, G. W. Taylor, A. Berg, and S. Areibi, “Domain adaptation learning,” Remote Sens., vol. 11, no. 14, 2019, Art. no. 1639.
using representation learning for the classification of remote sensing [66] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
images,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol. 10, IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov 1997.
no. 9, pp. 4198–4209, Sep. 2017.
[43] X. Tang, X. Zhang, F. Liu, and L. Jiao, “Unsupervised deep feature Yuan Yuan (Member, IEEE) received the B.S. de-
learning for remote sensing image retrieval,” Remote Sens., vol. 10, no. 8, gree in geography from Nanjing University, Nanjing,
Aug. 2018, Art. no. 1243. China, in 2011 and the Ph.D. degree in signal and in-
[44] Y. T. Tao, M. Z. Xu, F. Zhang, B. Du, and L. P. Zhang, “Unsupervised- formation processing from Institute of Remote Sens-
restricted deconvolutional neural network for very high resolution remote- ing and Digital Earth, Chinese Academy of Sciences,
sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, Beijing, China, in 2016.
no. 12, pp. 6805–6823, Dec. 2017. She is currently working at Nanjing University of
[45] X. Q. Lu, X. T. Zheng, and Y. Yuan, “Remote sensing scene classification Posts and Telecommunications, Department of Sur-
by unsupervised representation learning,” IEEE Trans. Geosci. Remote veying and Geoinformatics. Her research interests
Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017. include remote sensing time series analysis, change
[46] X. Wang, K. Tan, Q. Du, Y. Chen, and P. J. Du, “CVA2E: A conditional detection, and deep learning.
variational autoencoder with an adversarial training process for hyperspec-
Lei Lin received the B.S. degree in geographic in-
tral imagery classification,” IEEE Trans. Geosci. Remote Sens., vol. 58, formation system from Shandong Agricultural Uni-
no. 8, pp. 5676–5692, Aug. 2020.
versity, Taian, China, in 2013, the Ph.D. degree in
[47] X. Ma, Y. Lin, Z. Nie, and H. Ma, “Structural damage identification
signal and information processing from Institute of
based on unsupervised feature-extraction via variational Auto-encoder,”
Remote Sensing and Digital Earth, Chinese Academy
Measurement, vol. 160, Aug 2020, Art. no. 107811.
of Sciences, Beijing, China, in 2018.
[48] D. Y. Lin, K. Fu, Y. Wang, G. L. Xu, and X. Sun, “MARTA GANs:
He is currently working at Qihoo Technology Cor-
Unsupervised representation learning for remote sensing image classifi-
poration. His research interests include time series
cation,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 11, pp. 2092–2096, analysis and natural language processing.
Nov. 2017.

You might also like