Self-Supervised Pretraining of Transformers For Satellite Image Time Series Classification
Self-Supervised Pretraining of Transformers For Satellite Image Time Series Classification
14, 2021
Abstract—Satellite image time series (SITS) classification is a breakthroughs in the latest-generation of satellites. After the
major research topic in remote sensing and is relevant for a wide successive launch of the twin satellites (Sentinel-2A/B) of the
range of applications. Deep learning approaches have been com- Sentinel-2 mission, it is now possible to access large area,
monly employed for the SITS classification and have provided
state-of-the-art performance. However, deep learning methods suf- high-quality medium spatial resolution EO data with much
fer from overfitting when labeled data are scarce. To address this higher frequency. Due to the high revisit time of Sentinel-2A/B
problem, we propose a novel self-supervised pretraining scheme satellites (five days from the two-satellite constellation),
to initialize a transformer-based network by utilizing large-scale consecutively acquired images covering the same geographical
unlabeled data. In detail, the model is asked to predict randomly area can be properly organized into satellite image time
contaminated observations given an entire time series of a pixel.
The main idea of our proposal is to leverage the inherent tem- series (SITS) [1]. Such medium-resolution SITS data provide
poral structure of satellite time series to learn general-purpose valuable information about the status and dynamics of the Earth
spectral-temporal representations related to land cover semantics. surface, supporting analyses of the functional and structural
Once pretraining is completed, the pretrained network can be characteristics of land covers as well as identifying change
further adapted to various SITS classification tasks by fine-tuning events [2], [3]. For this reason, SITS have been widely used in
all the model parameters on small-scale task-related labeled data.
In this way, the general knowledge and representations about various application domains, such as ecology [4], agriculture [5],
SITS can be transferred to a label-scarce task, thereby improv- forest [6], land management [7], disaster monitoring [8] risk as-
ing the generalization performance of the model as well as re- sessment [9], etc. In the meantime, new challenges are also being
ducing the risk of overfitting. Comprehensive experiments have introduced by the question of how to extract valuable knowledge
been carried out on three benchmark datasets over large study and meaningful information to exploit such abundant data.
areas. Experimental results demonstrate the effectiveness of the
proposed pretraining scheme, leading to substantial improvements SITS classification is one of the central problems in the SITS
in classification accuracy using transformer, 1-D convolutional analysis, which is closely associated with many land applica-
neural network, and bidirectional long short-term memory net- tions, such as land cover mapping and change detection [10],
work. The code and the pretrained model will be available at vegetation species classification [11], and crop yields estimation
https://github.com/linlei1214/SITS-BERT upon publication. [12]. SITS classification involves assigning every pixel in an im-
Index Terms—Bidirectional encoder representations from age to a categorical label, primarily based on the pixel’s spectral
Transformers (BERT), classification, satellite image time series profile (trajectories of spectral variations over time) [3]. Machine
(SITS), self-supervised learning, transfer learning, unsupervised learning algorithms provide effective tools to achieve automated
pretraining.
SITS classification. Traditional algorithms such as support vec-
tor machine (SVM) and random forest (RF) classify SITS via
I. INTRODUCTION handcrafted features, such as raw reflectances, spectral statistics,
and phenological metrics [13], [14]. However, characterizing
OWADAYS, a huge volume of Earth observation
N (EO) data are being accumulated thanks to remarkable
these features is rather difficult due to the strong interannual
variations in seasonal patterns of the land surface reflectance,
which can be caused by shifts in land cover or environmental
Manuscript received September 15, 2020; revised October 22, 2020; accepted conditions, management practices, and disturbance [15].
November 3, 2020. Date of publication November 9, 2020; date of current Lately, deep learning is gaining widespread popularity in the
version January 6, 2021. This work was supported in part by the Research Project
of Surveying Mapping and Geoinformation of Jiangsu Province under Project remote sensing community. Various deep neural network archi-
JSCHKY201905, in part by the Natural Science Foundation of Jiangsu Province tectures have been introduced to advance the state of the art for
under Grant BK20170897, in part by the National Natural Science Foundation many remote sensing classification problems [16]–[18]. The ma-
of China under Grant 41901356, and in part by the support of Environmental
Protection Research Project of Jiangsu Province under Grant 2019010. (Yuan jor advantage of deep learning methods is that they are capable
Yuan and Lei Lin are co-first authors.) (Corresponding author: Yuan Yuan.) of learning features from the input data optimized for a specific
Yuan Yuan is with the School of Geographic and Biologic Information, task without the need for manual feature engineering [1], [19].
Nanjing University of Posts and Telecommunications, Nanjing 210023, China
(e-mail: [email protected]). Deep learning-based approaches are increasingly being applied
Lei Lin is with the Beijing Qihoo Technology Company Ltd., Beijing 100015, to SITS classification. Among these methods, convolutional neu-
China (e-mail: [email protected]). ral networks (CNNs) [20], [21] and recurrent neural networks
This article has supplementary downloadable material available at https://
ieeexplore.ieee.org, provided by the authors. [RNNs, including long short-term memory (LSTM) or gated
Digital Object Identifier 10.1109/JSTARS.2020.3036602 recurrent units] [22], [23] have been most widely used to capture
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 475
II. RELATED WORK series, the insight of BERT has yet to be explored in the canon
of existing work in the remote sensing community.
An effective pretext task is the key to successful self-
supervised learning and should guarantee that the model learn Inspired by BERT, we develop a context-based pretext task
to capture meaningful spectral-temporal features from massive
meaningful representations instead of trivial solutions [34].
unlabeled SITS data. Specifically, the network is asked to recover
According to the strategies used to design pretext tasks, most
of the available self-supervised learning approaches in remote contaminated observations by means of corresponding acquisi-
tion dates and clear observations. Our hypothesis is that noisy
sensing can be broadly classified into the following three
observations can be distinguished and reconstructed from dense
categories.
Reconstruction-based methods include denoising autoen- satellite time series. The main idea to design this pretext task is
based on the fact that noise caused by clouds and shadows is com-
coders (DAEs) [42], [43] and deCNNs [44], [45]. These meth-
ods project data into a low-dimensional latent space and then monly found in optical satellite images. Human remote sensing
reconstruct the input from the compressed features. However, experts can easily eliminate noise interference and distinguish
different land cover types from a limited number of images. The
exact input reconstruction is often not conducive to learning
discriminant features. absence of images or contaminated images does not necessarily
Generation-based methods include variational autoencoders result in a decrease in the interpretation accuracy because the
missing information can be inferred from the remaining images.
[46], [47] and generative adversarial networks [48], [49]. These
methods aim to approximate the real data distribution and simul- Intuitively, a good representation of SITS should be able to
taneously learn a mapping from a latent space to the input space. capture stable spectral-temporal patterns that are robust to noise.
However, the main goal of these methods is to generate more
realistic samples rather than extracting meaningful features that IV. MODEL ARCHITECTURE
facilitate downstream tasks.
A. Overall Network Architecture
Context-based methods leverage spatial, spectral, and tempo-
ral context similarity or correlation in the data to create super- In this article, we introduce a transformer-based network for
vision signals. These types of methods are generally heuristic the SITS classification. Since the model architecture is modified
and have no uniform theoretical patterns. For example, Wang from BERT, we name it SITS-BERT.
et al. learned mapping functions between images and their We first introduce the symbols used in our description. Let
transformed copies for the purpose of image registration [50]. X = {O1 , t1 , . . . , O1 , tL } be an annual time series of a
Dong et al. sampled image patches from bitemporal images and pixel, where Oi ∈ RD denotes a D-dimensional satellite obser-
then used the temporal-source of these patches as pseudolabels vation vector whose elements correspond to each of the input
for change detection [51]. Vincenzi et al. attempted to predict spectral reflectances; ti corresponds to the date Oi was captured,
visible bands of remote sensing images by other spectral bands which is specified using day of year (DOY).
[52]. At present, these kinds of methods are rarely applied to the SITS-BERT’s model architecture is comprised of two parts:
remote sensing data. an observation embedding layer and a standard transformer
In summary, the existing work mainly focuses on learning encoder (see Fig. 2). Specifically, all observation tuples
spatial and spectral features from single or bitemporal images, O1 , t1 , . . . , O1 , tL of a time series are first encoded into a
while very little effort has been devoted to satellite time-series sequence of observation embeddings, which are then fed into
analysis. a multilayer bidirectional transformer network to produce their
corresponding representations. The final observation represen-
III. MOTIVATION tations can be aggregated into a single feature vector, which
contains the global information of the entire time series and
BERT, which stands for bidirectional encoder representations is used for classification. Here, SITS-BERT plays the role of
from transformers [39], is undoubtedly the most remarkable self- a representation model. By adding an additional output layer,
supervised learning-based framework in the current NLP field. the whole network architecture of SITS-BERT can be further
Following an unsupervised-pretraining/supervised-fine-tuning fine-tuned for a specific classification task.
pipeline, BERT can learn general language representations that
are reusable in various downstream tasks. In detail, BERT uti-
B. Observation Embedding
lizes a “Masked Language Model” as its pretext task, in which
some tokens in a sentence are randomly masked out and the The observation embedding layer projects an input tuple
model is forced to predict the missing tokens according to the Oi , ti into a higher-dimensional feature space. The reason for
context. This enables BERT to learn word correspondences from this is that the input dimension is tied with the hidden layer size
a plain text corpus. Since BERT was proposed, the concept of of the transformer, and using larger embedding sizes gives better
self-supervised pretraining has started dominating the state-of- performance [41].
the-art in NLP. In this article, an observation embedding is a concatenation
Similar to text, satellite time series also have strong tempo- of two parts. Given an observation Oi , it is projected into a
ral correlations. Likewise, the temporal dynamics of spectral high-dimensional vector using a linear dense layer. In addition,
profiles contain rich semantic information, which is closely as- the corresponding date ti is also encoded into a vector of the
sociated with seasonal variations and plant phenology. However, same size using the positional encoding (PE) technique [28]. PE
while there are clear similarities between text and satellite time encodes the order information of a sequence with sine/cosine
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 477
Fig. 2. Structure of the SITS-BERT architecture shown unfolded in time. Besides the output layer (not depicted in the figure), the same architecture is used in
both the pretraining and fine-tuning stages. The input tuple Oi , ti at each timestep is a pair of observation and the corresponding acquisition date (day of year,
DOY). The final hidden representation is denoted as Ti . Ti encodes the information about the entire time series.
each tile, all the available Sentinel-2 A/B images with <10%
cloud coverage, taken between January 2018 and December
2019, were used. There were a total of 219 images to construct
the pretraining dataset. Samples were collected from each tile
at a regular sampling interval of 10 rows/columns. Sequences
shorter than 10 were discarded. Finally, the pretraining dataset
was composed of about 6.5 million sequences.
We chose California as the pretraining area for the following
reasons. First, California has extremely diverse natural land-
scapes, representing most of the major biomes in North America,
including grasslands, shrublands, deciduous forests, coniferous
Fig. 5. Two strategies to generate a sequence-level representation (denoted as
C) from individual observation representations of a time series: (1) SITS-BERT forests, tundra, mountains, deserts, rainforest, marine, estuarine,
with a [CLS] token, (2) SITS-BERT with pooling. and freshwater habitats. The state’s varied topography and cli-
mate have given rise to a remarkable plant diversity that cannot
observations, and set the corresponding time to be zero. be found in any other state of the United States [58]. Moreover,
The [CLS] token also needs to be added to the input time the Central Valley where the study area is located is California’s
series at the pretraining stage. most productive agricultural region as well as one of the most
2) SITS-BERT with pooling: The second method is to apply a productive worldwide. The agricultural systems in this study
pooling operation to the output vectors of BERT by either area are rather complicated, characterized by small parcels and
averaging the output (average pooling) or computing max- a vast variety of crop types [59]. For all those reasons, this area
over-time of the output (max pooling) to aggregate them can provide abundant and diverse training samples for learning
into a single fixed-size vector. general-purpose SITS representations. Finally, this area has a
Once the global representation of a time series is obtained, it is Mediterranean climate with no significant precipitation during
then fed into a softmax layer to be classified. The cross-entropy the summer season. Therefore, the spectral trajectory of major
loss is used for the model optimization in the fine-tuning stage crops (such as corn, cotton, rice, and soybeans) throughout the
entire growth cycle can be characterized, making it feasible to
P
inferring the corrupted part of a time series from the remainder.
CrossEntropyLoss = − yj log (ŷj ) (7)
j =1
where ỹj is the probability score inferred by the softmax function B. Two Crop Classification Datasets
for class j, and yj is the ground truth, and P is the number of We applied the proposed method to two crop classification
classes. datasets.
In this article, we fine-tune SITS-BERT on task-specific data The first dataset came from the pretraining area. A total of
in a supervised manner for 100 epochs. Adam optimizer is used 45 images captured in 2019 from Sentinel-2 tile T11SKA in
with a learning rate of 2e-4 and a batch size of 128. California (see Fig. 6 Tile
3 ) were used. Ten major crop types
covering most of the study area were targeted in this study: corn,
VI. STUDY AREAS AND DATASETS cotton, winter wheat, alfalfa, tomatoes, grapes, citrus, almonds,
walnuts, and pistachios. Three nonagricultural categories were
In this article, we applied the proposed method on Sentinel-2
also included: developed, evergreen forest, and grass/pasture.
image time series. The data were organized into four datasets:
It should be noted that the fine-tuning data were not used for
one was used for model pretraining, the other three (includ-
pretraining.
ing two crop classification datasets and a land cover mapping
The second dataset came from the area at the border between
dataset) were used for model evaluation.
Missouri and Arkansas, USA (see Fig. 7), and was composed
We investigated the model performance on two kinds of SITS
of 26 images acquired in 2019 from Sentinel-2 tile 15SYA. Ten
data. In the first type of data, the pretraining and fine-tuning
agricultural and nonagricultural categories were considered in
data belonged to the same area, serving the purpose of verify-
this study: corn, cotton, rice, soybeans, fallow/idle cropland,
ing whether the pretrained model can extract informative and
open water, developed, forest, grass/pasture, and woody wet-
discriminant features from homologous unlabeled data. In the
lands.
second type of data, the pretraining and fine-tuning data came
For both datasets, the 2019 Cropland Data Layer (CDL) [60]
from different areas, with the purpose of exploring whether the
and the corresponding CDL confidence layer [61] provided by
pretrained model can be transferred to unseen data.
the United States Department of Agriculture National Agricul-
tural Statistics Service (USDA NASS) were used as reference
A. Pretraining Dataset
data to collect samples. The original spatial resolution of CDL
We chose three Sentinel-2 tiles (T10SEJ, T10SFH, and and confidence layer products was 30 m. They were upsampled
T11SKA, each covering a region of about 100 × 100 km) to 10 m to be consistent with Sentinel-2 images. Then, we
located in the Central Valley, CA, USA to create a pretraining extracted samples from the whole scene at a regular interval of 10
dataset. The location of the study area is depicted in Fig. 6. For rows/columns. To increase the accuracy of the samples collected
480 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
Fig. 6. 2019 cropland data layer (CDL) product of California, USA, which is colored in the same way as the CDL website
(https://nassgeodata.gmu.edu/CropScape/). Three Sentinel tiles were used for model pre-training:
1 T10SEJ,
2 T10SFH,
3 T11SKA. The coverage of
each Sentinel-2 tile is marked with a red rectangle.
Fig. 8. Location of the study area for land cover mapping, showing the
Fig. 7. Location of the second study area for crop classification, showing the FROM_GLC10 product.
2019 CDL product.
from the CDL, we employed the following two criteria. First, C. Land Cover Mapping Dataset
all the neighboring pixels along with the center pixel in a 7×7 We also applied the proposed method to a land cover mapping
window should belong to the same category, to avoid sampling dataset, which was built using 20 images taken in Beijing, China,
from the boundary of land parcels. Second, the classification in 2017 from Sentinel-2 tile 50TMK (see Fig. 8). Beijing is
confidence of all neighboring pixels in the window should be characterized by a monsoonal humid continental climate with
higher than 80%. hot humid summers and cold dry winters. The natural vegetation
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 481
is deciduous broadleaf forest, and the dominant crop types are TABLE I
NUMBERS OF SAMPLES IN THE THREE FINE-TUNING DATASETS
winter wheat, corn, and soybean. There is a large divergence
between the study area and the pretraining area in terms of both
geographical conditions and vegetation species. Therefore, this
dataset provides a good example to explore whether the proposed
pretraining scheme is effective when the target domain and the
pretraining domain have distinct data distributions.
The reference data we used was a public global land cover
product named FROM_GLC10 (v0.1.3), which was built on 10
m resolution Sentinel-2 data acquired in 2017 [62]. We adopted
the same classification systems of FROM_GLC10. Five majority
land cover types were considered: cropland, forest, grassland,
water, and impervious area.
one softmax layer. Dropout was used after each LSTM TABLE II
CLASSIFICATION ACCURACY COMPARISON OF DIFFERENT SITS-BERT
layer with a dropout rate of 0.5. VARIANTS ON THE THREE DATASETS
5) Bi-LSTM: A three-layer stacked Bi-LSTM network was
used [66], which was formed by three Bi-LSTM layers
(128 units) and one softmax layer. The outputs at each
timestep from both forward and backward directions were
concatenated and used the input of the next layer. Dropout
was used after each layer with a dropout rate of 0.5.
SVM and RF handle a multivariate time series as a flat vector
information, where the dimensionality of the input features
should be identical. Hence, we padded the time series to the
maximum length by filling in the missing observations with
zeros. For deep learning models, we used the same observation
embedding layer to encode the input observations. SVM and RF
were implemented using the Python Scikit-Learn library, while
all deep learning methods were implemented using the Python
PyTorch library. Experiments were carried out on a workstation
with an Intel Xeon Silver CPU 4210R 20-Core (2.40GHz) with
128 GB of RAM and a NVIDIA TITAN RTX 24G GPU.
In order to simulate real-world scenarios, we randomly se-
lected 100 samples per category from each dataset to create the
training and validation sets, respectively, and used all the residual pretraining scheme leads to an averaged accuracy increment of
samples for testing. The training set was used for training or 2.97%, 1.88%, 2.58% for OA, 0.034, 0.023, 0.045 for Kappa
fine-tuning pretrained models, and the validation set was used coefficient, and 0.030, 0.029, 0.068 for AA on each dataset, re-
for choosing hyper-parameters (e.g., training epochs). It is worth spectively. The accuracy improvement on the latter two datasets
mentioning that we did not follow the standard paradigm for indicates that unsupervised pretraining does produce positive
method evaluation, which splits all data into training-validation- transfer, that is, the representations learned from large-scale un-
test set at certain rates. In our setting, the data distributions of labeled data can be transferred to new context, thereby improves
training samples and test samples may be different. The main the performance of the model on label-scarce downstream tasks.
purpose of this is to verify the effectiveness of the proposed We also find that SITS-BERTMAX exceeds other variants in
pretraining scheme under small-sample conditions. term of OA and Kappa coefficient, while SITS-BERTCLS yields
higher AA than SITS-BERTMAX in the latter two datasets, show-
B. Model Configuration ing that SITS-BERTMAX is more advantageous for handling
imbalanced classification problems.
In this section, we evaluate the performance of three SITS-
BERT variants to derive sequence-level representations. To ver-
ify whether the pretraining scheme is effective, we also trained C. Method Comparison
each SITS-BERT variant from scratch using only labeled data In this section, we compare the performance of SITS-BERT
(we refer to such models as “non-pre-trained” models). The three with other algorithms for the SITS classification. We used SITS-
variants are described as follows. BERTMAX as the default model configuration. The term “SITS-
1) SITS-BERT using the [CLS] token: In this case, the fine- BERT” hereinafter refers to the pretrained SITS-BERTMAX . All
tuned hidden representation of the [CLS] token was re- other deep learning models were trained from scratch for 300
garded as a representation of the entire input time series. epochs with a learning rate of 2e−4 and a batch size of 128.
We refer to such competitor as “SITS-BERTCLS .” The classification accuracy assessment for the three datasets are
2) SITS-BERT using average pooling: In this case, the hidden displayed in Table III, and the confusion matrices are given in
representations of all the observations in a time series the Supplementary Materials.
were averaged and used for classification. We refer to such A major finding is that both CNN and RNN networks yield
competitor as “SITS-BERTAVG .” comparable or even worse results than traditional methods
3) SITS-BERT using max pooling: In this case, the maximum (SVM and RF), while the non-pre-trained SITS-BERT performs
value for the hidden representations along the time dimen- slightly better than traditional methods on the latter two datasets.
sion was computed and used for classification. We refer Apart from the proposed model, RF performs best overall, rank-
to such competitor as “SITS-BERTMAX .” ing first on the latter two datasets. This suggests that for small
The classification results are shown in Table II, and the datasets, sophisticated deep learning models have no substantial
confusion matrices are given in the Supplementary Materials. advantages over traditional machine learning methods in SITS
We observe that for all variants, the pretrained models remark- classification tasks. This is likely due to the serious overfit-
ably outperform their randomly initialized versions (except for ting problem that occurs when training deep neural networks
SITS-BERTAVG in the third dataset). Specifically, the proposed on insufficient labeled samples, even though transformer and
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 483
Fig. 9. Crop classification results of different methods in the first study area (California). (1) Ground truth. (2) SVM. (3) Random forest (RF). (4) Long-short
term memory (LSTM). (5) Bidirectional LSTM (Bi-LSTM). (6) Non-pre-trained SITS-BERT. (7) SITS-BERT.
Fig. 10. Crop classification results of different methods in the second study area (Missouri). (1) Ground truth. (2) SVM. (3) RF. (4) LSTM. (5) Bi-LSTM. (6)
Non-pre-trained SITS-BERT. (7) SITS-BERT.
YUAN AND LIN: SELF-SUPERVISED PRETRAINING OF TRANSFORMERS FOR SATELLITE IMAGE TIME SERIES CLASSIFICATION 485
Fig. 11. Land cover classification results of different methods in the third study area (Beijing). (1) Ground truth. (2) SVM. (3) RF. (4) LSTM. (5) Bi-LSTM. (6)
Non-pre-trained SITS-BERT. (7) SITS-BERT.
F. Computational Efficiency
In practice, SITS classification methods need to be applied
to millions of satellite time series. Therefore, the computational
efficiency of deep learning models should be taken into consid-
Fig. 12. Classification accuracy of SITS-BERT and the RF baseline using eration. In this section, we compare the computation speeds of
varied training samples. Evaluation metrics in each row from left to right: Overall
accuracy (OA), Kappa coefficient, and average accuracy (AA).
several deep learning models. The results are given in Table V.
According to the results, CNN-1D is the fastest, followed
by LSTM-based models, and SITS-BERT is the slowest on a
which SITS-BERT surpasses all other competitors in all datasets. single GPU. Specifically, the computation speed of CNN-1D is
It is worth noting that the accuracy of SITS-BERT fine-tuned about 19% faster than SITS-BERT. For Bi-LSTM, which has
on only 50 labeled samples per category is comparable with shown competitive performance in our previous experiments,
the accuracy of RF trained on ten times more samples for the it is also 5% faster than SITS-BERT. Fortunately, an advantage
first and second datasets. Specifically, the best OA yielded by of transformer is that the computation can be parallelized
486 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
[26] M. Rußwurm and M. Körner, “Multi-temporal land cover classification [49] R. Hang, F. Zhou, Q. Liu, and P. Ghamisi, “Classification of hyperspec-
with sequential recurrent encoders,” ISPRS Int. J. Geoinf., vol. 7, no. 4, tral images via multitask generative adversarial networks,” IEEE Trans.
2018, Art. no. 129. Geosci. Remote Sens., to be published.
[27] V. S. Garnot, L. Landrieu, S. Giordano, and N. Chehata, “Time-space [50] S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao, “A deep learning
tradeoff in deep learning models for crop classification on satellite multi- framework for remote sensing image registration,” ISPRS J. Photogramm.
spectral image time series,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Remote Sens., vol. 145, pp. 148–164, 2018.
Jul. 2019, pp. 6247–6250. [51] H. Dong, W. Ma, Y. Wu, J. Zhang, and L. Jiao, “Self-supervised rep-
[28] A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf. Process. resentation learning for remote sensing image change detection based
Syst., Dec. 2017, pp. 5998–6008. on temporal prediction,” Remote Sens., vol. 12, no. 11, Jun. 2020, Art.
[29] V. S. F. Garnot, L. Landrieu, S. Giordano, and N. Chehata, “Satellite no. 1868.
image time series classification with pixel-set encoders and temporal [52] S. Vincenzi et al., “The color out of space: Learning self-supervised
self-attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun representations for earth observation imagery,” 2020. [Online]. Available:
2020, pp. 12325–12334. https://arxiv.org/abs/2006.12119
[30] M. Rußwurm and M. Körner, “Self-attention for raw optical satellite [53] J. He, L. Zhao, H. Yang, M. Zhang, and W. Li, “HSI-BERT: Hyperspec-
time series classification,” 2019. [Online]. Available: https://arxiv.org/abs/ tral image classification using the bidirectional encoder representation
1910.10536 from transformers,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 1,
[31] H. Bazzi, D. Ienco, N. Baghdadi, M. Zribi, and V. Demarez, “Distilling pp. 165–178, Jan. 2020.
before refine: Spatio-temporal transfer learning for mapping irrigated areas [54] X. Shen, B. Liu, Y. Zhou, J. Zhao, and M. Liu, “Remote sensing image cap-
using Sentinel-1 time series,” IEEE Geosci. Remote Sens. Lett., vol. 17, tioning via variational autoencoder and reinforcement learning,” Knowl.
no. 11, pp. 1909–1913, Nov. 2020. Based Syst., vol. 203, 2020, Art. no. 105920.
[32] D. Ienco, Y. J. Eudes Gbodjo, R. Interdonato, and R. Gaetano, “Attentive [55] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
weakly supervised land cover mapping for object-based satellite image networks,” 2016. [Online]. Available: https://arxiv.org/abs/1603.05027
time series data with spatial interpretation,” 2020. [Online]. Available: [56] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016.
https://arxiv.org/abs/2004.14672 [Online]. Available: https://arxiv.org/abs/1607.06450
[33] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: [57] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and
Transfer learning from unlabeled data,” in Proc. IEEE Int. Conf. Mach. composing robust features with denoising autoencoders,” in Proc. IEEE
Learn., Jun. 2007, pp. 759–766. Int. Conf. Mach. Learn., Jul. 2008, pp. 1096–1103.
[34] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep [58] B. G. Baldwin, “Origins of plant diversity in the California floristic
neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., to province,” Annu. Rev. Ecol. Evol. Syst., vol. 45, no. 1, pp. 347–369, 2014.
be published, doi: 10.1109/TPAMI.2020.2992393. [59] L. Zhong, T. Hawkins, G. Biging, and P. Gong, “A phenology-based
[35] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proc. approach to map crop types in the San Joaquin Valley, California,” Int.
Eur. Conf. Comput. Vis., Oct 2016, pp. 649–666. J. Remote Sens., vol. 32, no. 22, pp. 7777–7804, Nov 2011.
[36] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa- [60] USDA National Agricultural Statistics Service Cropland Data Layer. 2019.
tions by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis., Oct. Published crop-specific data layer. USDA-NASS, Washington, DC, USA.
2016, pp. 69–84. Accessed May 2020. [Online]. Available: https://nassgeodata.gmu.edu/
[37] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for CropScape/
unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis., [61] W. Liu, S. Gopal, and E. F. Wood, “Uncertainty and confidence in land
Sep. 2018, pp. 132–149. cover classification using a hybrid classifier approach,” Photogramm. Eng.
[38] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for Remote Sens., vol. 70, no. 8, pp. 963–971, Aug. 2004.
contrastive learning of visual representations,” 2020. [Online]. Available: [62] P. Gong et al., “Stable classification with limited sample: Transferring a
https://arxiv.org/abs/2002.05709 30-m resolution sample set collected in 2015 to mapping 10-m resolution
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training global land cover in 2017,” Sci. Bull., vol. 64, no. 6, pp. 370–373, 2019.
of deep bidirectional transformers for language understanding,” 2018. [63] A. Chakhar, D. Ortega-Terol, D. Hernandez-Lopez, R. Ballesteros, J. E.
[Online]. Available: https://arxiv.org/abs/1810.04805 Ortega, and M. A. Moreno, “Assessing the accuracy of multiple classifi-
[40] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and cation algorithms for crop classification using Landsat-8 and Sentinel-2
Q. V. Le, “XLNet: Generalized autoregressive pretraining for lan- data,” Remote Sens., vol. 12, no. 11, Jun. 2020, Art. no. 1735.
guage understanding,” in Proc. Neural Inf. Process. Syst., Dec. 2019, [64] L. Lin, “Satellite image time series classification and change detection
pp. 5753–5763. based on recurrent neural network model,” Doctor Eng. China, Univ.
[41] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, Chinese Acad. Sci., 2018.
“ALBERT: A lite BERT for self-supervised learning of language repre- [65] H. Wang, X. Zhao, X. Zhang, D. Wu, and X. Du, “Long time series land
sentations,” 2019. [Online]. Available: https://arxiv.org/abs/1909.11942 cover classification in China from 1982 to 2015 based on Bi-LSTM deep
[42] A. Elshamli, G. W. Taylor, A. Berg, and S. Areibi, “Domain adaptation learning,” Remote Sens., vol. 11, no. 14, 2019, Art. no. 1639.
using representation learning for the classification of remote sensing [66] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
images,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol. 10, IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov 1997.
no. 9, pp. 4198–4209, Sep. 2017.
[43] X. Tang, X. Zhang, F. Liu, and L. Jiao, “Unsupervised deep feature Yuan Yuan (Member, IEEE) received the B.S. de-
learning for remote sensing image retrieval,” Remote Sens., vol. 10, no. 8, gree in geography from Nanjing University, Nanjing,
Aug. 2018, Art. no. 1243. China, in 2011 and the Ph.D. degree in signal and in-
[44] Y. T. Tao, M. Z. Xu, F. Zhang, B. Du, and L. P. Zhang, “Unsupervised- formation processing from Institute of Remote Sens-
restricted deconvolutional neural network for very high resolution remote- ing and Digital Earth, Chinese Academy of Sciences,
sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, Beijing, China, in 2016.
no. 12, pp. 6805–6823, Dec. 2017. She is currently working at Nanjing University of
[45] X. Q. Lu, X. T. Zheng, and Y. Yuan, “Remote sensing scene classification Posts and Telecommunications, Department of Sur-
by unsupervised representation learning,” IEEE Trans. Geosci. Remote veying and Geoinformatics. Her research interests
Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017. include remote sensing time series analysis, change
[46] X. Wang, K. Tan, Q. Du, Y. Chen, and P. J. Du, “CVA2E: A conditional detection, and deep learning.
variational autoencoder with an adversarial training process for hyperspec-
Lei Lin received the B.S. degree in geographic in-
tral imagery classification,” IEEE Trans. Geosci. Remote Sens., vol. 58, formation system from Shandong Agricultural Uni-
no. 8, pp. 5676–5692, Aug. 2020.
versity, Taian, China, in 2013, the Ph.D. degree in
[47] X. Ma, Y. Lin, Z. Nie, and H. Ma, “Structural damage identification
signal and information processing from Institute of
based on unsupervised feature-extraction via variational Auto-encoder,”
Remote Sensing and Digital Earth, Chinese Academy
Measurement, vol. 160, Aug 2020, Art. no. 107811.
of Sciences, Beijing, China, in 2018.
[48] D. Y. Lin, K. Fu, Y. Wang, G. L. Xu, and X. Sun, “MARTA GANs:
He is currently working at Qihoo Technology Cor-
Unsupervised representation learning for remote sensing image classifi-
poration. His research interests include time series
cation,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 11, pp. 2092–2096, analysis and natural language processing.
Nov. 2017.