Self-Supervised Learning: Generative or Contrastive
Self-Supervised Learning: Generative or Contrastive
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
1
Abstract—Deep supervised learning has achieved great success in the last decade. However, its defects of heavy dependence on
manual labels and vulnerability to attacks have driven people to find other paradigms. As an alternative, self-supervised learning (SSL)
attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised representation
learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we take a look into
new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We
comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives:
generative, contrastive, and generative-contrastive (adversarial). We further collect related theoretical analysis on self-supervised
learning to provide deeper thoughts on why self-supervised learning works. Finally, we briefly discuss open problems and future
directions for self-supervised learning. An outline slide for the survey is provided1 .
1 I NTRODUCTION
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
2
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
3
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
4
Hard Hard
Model FOS Type Generator Self-supervision Pretext Task NS strategy
NS PS
GPT/GPT-2 [98], [99] NLP G AR Following words Next word prediction - - -
PixelCNN [122], [124] CV G AR Following pixels Next pixel prediction - - -
NICE [30] CV G - - -
Flow
RealNVP [31] CV G Whole image Image reconstruction - - -
based
Glow [62] CV G - - -
word2vec [79], [80] NLP G AE CBOW & SkipGram × × End-to-end
Context words
FastText [10] NLP G AE CBOW × × End-to-end
DeepWalk-based
Graph G AE × × End-to-end
[43], [92], [115] Graph edges Link prediction
VGAE [65] Graph G AE × × End-to-end
Masked words Masked language model,
BERT [27] NLP G AE - - -
Sentence topic Next senetence prediction
SpanBERT [58] NLP G AE Masked words Masked language model - - -
Masked words Masked language model,
ALBERT [68] NLP G AE - - -
Sentence order Sentence order prediction
Masked words Masked language model,
ERNIE [114], [149] NLP G AE - - -
Sentence topic Next senetence prediction
GPT-GNN [52] Graph G AE Attribute & Edge Masked graph generation - - -
VQ-VAE 2 [101] CV G AE Whole image Image reconstruction - - -
XLNet [138] NLP G AE+AR Masked words Permutation language model - - -
GraphAF [107] Graph G Flow+AR Attribute & Edge Sequential graph generation - - -
RelativePosition [32] CV C - Relative postion prediction - - -
Spatial relations Jigsaw + Inpainting
CDJP [61] CV C - × × End-to-end
(Context-Instance) + Colorization
PIRL [81] CV C - Jigsaw × X Memory bank
RotNet [38] CV C - Rotation Prediction - - -
Deep InfoMax [49] CV C - × × End-to-end
AMDIM [6] CV C - × X End-to-end
CPC [88] CV C - × × End-to-end
InfoWord [66] NLP C - Belonging × × End-to-end
MI Maximization
DGI [127] Graph C - (Context-Instance) X × End-to-end
End-to-end
InfoGraph [110] Graph C - × ×
(batch-wise)
CMC-Graph [46] Graph C - × X End-to-end
S2 GRL [90] Graph C - × × End-to-end
Belonging MI maximization,
Pre-trained GNN [51] Graph C - × × End-to-end
Node attributes Masked attribute prediction
DeepCluster [13] CV C - - - -
Local Aggregation [152] CV C - - - -
ClusterFit [136] CV C - Similarity - - -
Cluster discrimination
SwAV [14] CV C - (Instance-Instance) - X End-to-end
SEER [41] CV C - - X End-to-end
M3S [113] Graph C - - - -
InstDisc [132] CV C - × × Memory bank
CMC [118] CV C - × X End-to-end
MoCo [47] CV C - × × Momentum
MoCo v2 [18] CV C - × X Momentum
SimCLR [15] CV C - × X End-to-end
Identity
InfoMin [119] CV C - Instance discrimination × X End-to-end
(Instance-Instance)
BYOL [42] CV C - no NS X End-to-end
ReLIC [82] CV C - × X End-to-end
SimSiam [19] CV C - no NS X End-to-end
SimCLR v2 (semi) [16] CV C - × X End-to-end
GCC [94] Graph C - × X Momentum
GraphCL [142] Graph C - × X End-to-end
GAN [40] CV G-C AE - - -
Adversarial AE [77] CV G-C AE Whole image Image reconstruction - - -
BiGAN/ALI [33], [36] CV G-C AE - - -
BigBiGAN [34] CV G-C AE - - -
Colorization [69] CV G-C AE Image color Colorization - - -
Inpainting [89] CV G-C AE Parts of images Inpainting - - -
Super-resolution [72] CV G-C AE Details of images Super-resolution - - -
ELECTRA [21] NLP G-C AE Masked words Replaced token detection X × End-to-end
WKLM [134] NLP G-C AE Masked entities Replaced entity detection X × End-to-end
ANE [23] Graph G-C AE - - -
Graph edges Link prediction
GraphGAN [128] Graph G-C AE - - -
GraphSGAN [28] Graph G-C AE Graph nodes Node classification - - -
TABLE 1: An overview of recent self-supervised representation learning. For acronyms used, “FOS” refers to fields of study;
“NS” refers to negative samples; “PS” refers to positive samples; “MI” refers to mutual information. For alphabets in “Type”:
G Generative ; C Contrastive; G-C Generative-Contrastive (Adversarial). For symbols in “Hard NS” and “Hard PS”, “-”
means not applicable, “×” means not adopted, “X”’ means adopted; “no NS” particularly means not using negative samples
in instance-instance contrast.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
5
formalizing the densities is difficult. To obtain a compli- 3.3.2 Context Prediction Model (CPM)
cated densities, we hope to generate it “step by step” by The idea of the Context Prediction Model (CPM) is to predict
stacking a series of transforming functions that describing contextual information based on inputs.
different data characteristics respectively. Generally, flow- In NLP, when it comes to self-supervised learning on
based models first define a latent variable z which follows word embedding, CBOW and Skip-Gram [80] are pioneering
a known distribution pZ (z). Then define z = fθ (x), where works. CBOW aims to predict the input tokens based on
fθ is an invertible and differentiable function. The goal is context tokens. In contrast, Skip-Gram aims to predict context
to learn the transformation between x and z so that the tokens based on input tokens. Usually, negative sampling is
density of x can be depicted. According to the integral employed to ensure computational efficiency and scalability.
rule, pθ (x)dx = p(z)dz . Therefore, the densities of x and Following CBOW architecture, FastText [10] is proposed by
z satisfies: utilizing subword information.
∂fθ (x)
pθ (x) = p(fθ (x))
(2) Inspired by the progress of word embedding models in
∂x NLP, many network embedding models are proposed based
and the objective is to maximize the likelihood: on a similar context prediction objective. Deepwalk [92]
samples truncated random walks to learn latent node embed-
ding based on the Skip-Gram model. It treats random walks
X X
max log pθ (x(i) ) = max log pZ (fθ (x(i) ))
θ θ as the equivalent of sentences. However, another network
i i
∂fθ (i) (3) embedding approach LINE [115] aims to generate neighbors
+ log (x ) rather than nodes on a path based on current nodes:
∂x X
The advantage of flow-based models is that the mapping O=− wij log p(vj |vi ) (4)
between x and z is invertible. However, it also requires (i,j)∈E
that x and z must have the same dimension. fθ needs to where E denotes edge set, v denotes the node, wij represents
be carefully designed since it should be invertible and the the weight of edge (vi , vj ). LINE also uses negative sampling
Jacobian determinant in Eq. (2) should also be calculated to sample multiple negative edges to approximate the
easily. NICE [30] and RealNVP [31] design affine coupling objective.
layer to parameterize fθ . The core idea is to split x into two
blocks (x1 , x2 ) and apply a transformation from (x1 , x2 ) to
3.3.3 Denoising AE Model
(z1 , z2 ) in an auto-regressive manner, that is z1 = x1 and
z2 = x2 + m(x1 ). More recently, Glow [62] was proposed The intuition of denoising autoencoder models is that repre-
and it introduces invertible 1 × 1 convolutions and simplifies sentation should be robust to the introduction of noise. The
RealNVP. masked language model (MLM), one of the most successful
architectures in natural language processing, can be regarded
3.3 Auto-encoding (AE) Model as a denoising AE model. To model text sequence, the masked
The auto-encoding model’s goal is to reconstruct (part of) language model (MLM) randomly masks some of the tokens
inputs from (corrupted) inputs. Due to its flexibility, the AE from the input and then predicts them based on their context
model is probably the most popular generative model with information, which is similar to the Cloze task [117]. BERT [27]
many variants. is the most representative work in this field. Specifically, in
BERT, a unique token [MASK] is introduced in the training
3.3.1 Basic AE Model process to mask some tokens. However, one shortcoming
Autoencoder (AE) was first introduced in [8] for pre-training of this method is that there are no input [MASK] tokens
artificial neural networks. Before autoencoder, Restricted for down-stream tasks. To mitigate this, the authors do not
Boltzmann Machine (RBM) [109] can also be viewed as always replace the predicted tokens with [MASK] in training.
a special “autoencoder”. RBM is an undirected graphical Instead, they replace them with original words or random
model, and it only contains two layers: the visible layer words with a small probability.
and the hidden layer. The objective of RBM is to minimize Following BERT, many extensions of MLM emerge.
the difference between the marginal distribution of models SpanBERT [58] chooses to mask continuous random spans
and data distributions. In contrast, an autoencoder can rather than random tokens adopted by BERT. Moreover,
be regarded as a directed graphical model, and it can it trains the span boundary representations to predict the
be trained more efficiently. Autoencoder is typically for masked spans, inspired by ideas in coreference resolution.
dimensionality reduction. Generally, the autoencoder is a ERNIE (Baidu) [114] masks entities or phrases to learn
feed-forward neural network trained to produce its input entity-level and phrase-level knowledge, which obtains good
at the output layer. The AE is comprised of an encoder results in Chinese natural language processing tasks. ERNIE
0
network h = fenc (x) and a decoder network x = fdec (h). (Tsinghua) [149] further integrates knowledge (entities and
0
The objective of AE is to make x and x as similar as possible relations) in knowledge graphs into language models.
(such as through mean-square error). It can be proved that Compared with the AR model, in denoising AE for
the linear autoencoder corresponds to the PCA method. language modeling, the predicted tokens have access to
Sometimes the number of hidden units is greater than the contextual information from both sides. However, the fact
number of input units, and some interesting structures can that MLM assumes the predicted tokens are independent
be discovered by imposing sparsity constraints on the hidden if the unmasked tokens are given (which does not hold in
units [85]. reality) has long been considered as its inherent drawback.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
6
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
7
including MoCo [47], SimCLR [15], BYOL [42] and SwAV [14]
have presented overwhelming performances on various CV
benchmarks. Nevertheless, in the NLP domain, researchers
still depend on generative language models to conduct text
classification.
Second, the point-wise nature of the generative objec-
tive has some inherent defects. This objective is usually
formulated
P as a maximum likelihood function LM LE =
− x log p(x|c) where x is all the samples we hope to model,
Fig. 6: Illustration for permutation language modeling [138] and c is a conditional constraint such as context information.
objective for predicting x3 given the same input sequence x Considering its form, MLE has two fatal problems:
but with different factorization orders. Adapted from [138] 1) Sensitive and Conservative Distribution. When
p(x|c) → 0, LM LE becomes super large, making gen-
let ZT denotes the set of all possible permutations of the erative model extremely sensitive to rare samples. It
length-T index sequence [1, 2, ..., T ], the objective of PLM directly leads to a conservative distribution, which has
can be expressed as follows: a low performance.
T
X 2) Low-level Abstraction Objective. In MLE, the represen-
max Ez∼ZT [ log pθ (xzt |xz<t )] (7) tation distribution is modeled at x’s level (i.e., point-wise
θ
t=1 level), such as pixels in images, words in texts, and nodes
Actually, for each text sequence, different factorization or- in graphs. However, most of the classification tasks tar-
ders are sampled. Therefore, each token can see its contextual get at high-level abstraction, such as object detection, long
information from both sides. Based on the permuted order, paragraph understanding, and molecule classification.
XLNet also conducts reparameterization with positions to and as an opposite approach, generative-contrastive self-
let the model know which position is needed to predict. supervised learning abandons the point-wise objective. It
Then a special two-stream self-attention is introduced for turns to distributional matching objectives that are more
target-aware prediction. robust and better handle the high-level abstraction challenge
Furthermore, different from BERT, inspired by the latest in the data manifold.
advancements in the AR model, XLNet integrates the seg-
ment recurrence mechanism and relative encoding scheme
4 C ONTRASTIVE S ELF - SUPERVISED L EARNING
of Transformer-XL [24] into pre-training, which can model
long-range dependency better than Transformer [125]. From a statistical perspective, machine learning models are
categorized into generative and discriminative models. Given
3.4.2 Combining AE and Flow-based Models the joint distribution P (X, Y ) of the input X and target
Y , the generative model calculates the p(X|Y = y) while
In the graph domain, GraphAF [107] is a flow-based auto-
the discriminative model tries to model the P (Y |X = x).
regressive model for molecule graph generation. It can
Because most of the representation learning tasks hope to
generate molecules in an iterative process and also calculate
model relationships between x, for a long time, people
the exact likelihood in parallel. GraphAF formalizes molecule
believe that the generative model is the only choice for
generation as a sequential decision process. It incorporates
representation learning.
detailed domain knowledge into the reward design, such
However, recent breakthroughs in contrastive learning,
as valency check. Inspired by the recent progress of flow-
such as Deep InfoMax, MoCo and SimCLR, shed light on
based models, it defines an invertible transformation from a
the potential of discriminative models for representation.
base distribution (e.g., multivariate Gaussian) to a molecular
Contrastive learning aims at ”learn to compare” through a
graph structure. Additionally, Dequantization technique [50]
Noise Contrastive Estimation (NCE) [44] objective formatted
is utilized to convert discrete data (including node types and
as:
edge types) into continuous data.
T +
ef (x) f (x )
L = Ex,x+ ,x− [−log( ] (8)
3.5 Pros and Cons e f (x)T f (x+ )
+ ef (x)T f (x− )
A reason for the generative self-supervised learning’s success where x+ is similar to x, x− is dissimilar to x and f is an
in self-supervised learning is its ability to recover the original encoder (representation function). The similarity measure
data distribution without assumptions for downstream tasks, and encoder may vary from task to task, but the framework
which enables generative models’ wide applications in remains the same. With more dissimilar pairs involved, we
both classification and generation. Notably, all the existing have the InfoNCE [88] formulated as:
generation tasks (including text, image, and audio) rely T +
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
8
• PRP focuses on learning relative positions between local where gi is the representation encoder, Gi is a class of
components. The global context serves as an implicit encoders with some constraints, and I(·, ·) is a sample-
requirement for predicting these relations (such as based estimator for the accurate mutual information. In
understanding what an elephant looks like is critical applications, MI is notorious for its complex computation.
for predicting relative position between its head and A common practice is to alternatively maximize I ’s lower
tail). bound with an NCE objective.
• MI focuses on learning the direct belonging relationships Deep InfoMax [49] is the first one to explicitly model
between local parts and global context. The relative mutual information through a contrastive learning task,
positions between local parts are ignored. which maximize the MI between a local patch and its global
context. For real practices, take image classification as an
4.1.1 Predict Relative Position example, we can encode a cat image x into f (x) ∈ RM ×M ×d ,
Many data contain rich spatial or sequential relations be- and take out a local feature vector v ∈ Rd . To conduct
tween parts of it. For example, in image data such as Fig. 8, contrast between instance and context, we need two other
the elephant’s head is on the right of its tail. In text data, a things:
M ×M ×d
sentence like ”Nice to meet you.” would probably be ahead of • a summary function g:R → Rd to generate the
d
”Nice to meet you, too.”. Various models regard recognizing context vector s = g(f (x)) ∈ R
−
relative positions between parts of it as the pretext task [57]. • another cat image x and its context vector s− =
−
It could be to predict relative positions of two patches from g(f (x )).
a sample [32], or to recover positions of shuffled segments and the contrastive objective is then formulated as
of an image (solve jigsaw) [61], [86], [131], or to infer the T
ev ·s
rotation angle’s degree of an image [38]. PRP may also serve L = Ev,x [−log( )] (11)
as tools to create hard positive samples. For instance, the
e v T ·s+ evT ·s−
jigsaw technique is applied in PIRL [81] to augment the Deep InfoMax provides us with a new paradigm and
positive sample, but PIRL does not regard solving jigsaw boosts the development of self-supervised learning. The
and recovering spatial relation as its objective. first influential follower is Contrastive Predictive Coding
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
9
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
10
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
11
For example, given an input image x, our intuition is to learn is similar to CMC [118], which leverages several different
a instinct representation q = fq (x) by a query encoder fq (·) views to augment the positive pairs. SimCLR follows the end-
that can distinguish x from any other images. Therefore, to-end training framework instead of momentum contrast
for a set of other images xi , we employ an asynchronously from MoCo, and to handle the large-scale negative samples
updated key encoder fk (·) to yield k+ = fk (x) and ki = problem, SimCLR chooses a batch size of N as large as 8196.
fk (xi ), and optimize the following objective The details are as follows. A minibatch of N samples
exp(q · k+ /τ ) is augmented to be 2N samples x̂j (j = 1, 2, ..., 2N ). For a
L = −log PK (12)
pair of a positive sample x̂i and x̂j (derive from one original
i=0 exp(q · ki /τ )
sample), other 2(N − 1) are treated as negative ones. A
where K is the number of negative samples. This formula is
pairwise contrastive loss NT-Xent loss [17] is defined as
in the form of InfoNCE.
Besides, MoCo presents two other critical ideas in dealing exp(sim(x̂i , x̂j )/τ )
li,j = −log P2N (13)
with negative sampling efficiency. k=1 I[k6=i] exp(sim(x̂i , x̂k )/τ )
• First, it abandons the traditional end-to-end training
noted that li,j is asymmetrical, and the sim(·, ·) function
framework. It designs the momentum contrast learning
here is a cosine similarity function that can normalize the
with two encoders (query and key), which prevents the
representations. The summed up loss is
fluctuation of loss convergence in the beginning period.
N
• Second, to enlarge negative samples’ capacity, MoCo 1 X
employs a queue (with K as large as 65536) to save L= [l2i−1,2i + l2i,2i−1 ] (14)
2N k=1
the recently encoded batches as negative samples. This
significantly improves the negative sampling efficiency. SimCLR also provides some other practical techniques,
including a learnable nonlinear transformation between the
representation and the contrastive loss, more training steps,
and deeper neural networks. [18] conducts ablation studies
to show that techniques in SimCLR can also further improve
MoCo’s performance.
More investigation into augmenting positive samples is
made in InfoMin [119]. The authors claim that we should
select those views with less mutual information for better-
augmented views in contrastive learning. In the optimal
Fig. 12: Conceptual comparison of three contrastive loss situation, the views should only share the label information.
mechanisms. Taken from MoCo [47]. To produce such optimal views, the authors first propose
an unsupervised method to minimize mutual information
There are some other auxiliary techniques to ensure the between views. However, this may result in a loss of
training convergence, such as batch shuffling to avoid trivial information for predicting labels (such as a pure blank view).
solutions and temperature hyper-parameter τ to adjust the Therefore, a semi-supervised method is then proposed to
scale. find views sharing only label information. This technique
However, MoCo adopts a too simple positive sample leads to an improve about 2% over MoCo v2.
strategy: a pair of positive representations come from the A more radical step is made by BYOL [42], which discards
same sample without any transformation or augmentation, negative sampling in self-supervised learning but achieves
making the positive pair far too easy to distinguish. PIRL [81] an even better result over InfoMin. For contrastive learning
adds jigsaw augmentation as described in Section 4.1.1. PIRL methods we mentioned above, they learn representations
asks the encoder to regard an image and its jigsawed one as by predicting different views of the same image and cast
similar pairs to produce a pretext-invariant representation. the prediction problem directly in representation space.
However, predicting directly in representation space can
lead to collapsed representations because multi-views are
generally too predictive for each other. Without negative
samples, it would be too easy for the neural networks to
distinguish those positive views.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
12
In BYOL, researchers argue that negative samples may 4.3 Self-supervised Contrastive Pre-training for Semi-
not be necessary in this process. They show that, if we supervised Self-training
use a fixed randomly initialized network (which would
not collapse because it is not trained) to serve as the key While contrastive learning-based self-supervised learning
encoder, the representation produced by query encoder continues to push the boundaries on various benchmarks,
would still be improved during training. If then we set labels are still important because there is a gap between
the target encoder to be the trained query encoder and training objectives of self-supervised learning and supervised
iterate this procedure, we would progressively achieve better learning. In other words, no matter how self-supervised
performance. Therefore, BYOL proposes an architecture learning models improve, they are still the only powerful
(Figure 14) with an exponential moving average strategy feature extractor, and to transfer to the downstream task, we
to update the target encoder just as MoCo does. Additionally, still need labels more or less. As a result, to bridge the gap
instead of using cross-entropy loss, they follow the regression between self-supervised pre-training and downstream tasks,
paradigm in which mean square error is used as: semi-supervised learning is what we are looking for.
Recall the MoCo [47] that have topped the ImageNet
leader-board. Although it is proved beneficial for many other
hqθ (zθ ), zξ0 i downstream vision tasks, it fails to improve the COCO object
LBYOL
θ , ||q¯θ (zθ ) − z¯ξ 0 ||22 = 2 − 2 · (15)
||qθ (zθ )||2 · ||zξ0 ||2 detection task. Some following work [84], [153] investigates
this problem and attributes it to the gap between the instance
This not only makes the model better-performed in discrimination and object detection. In such a situation,
downstream tasks, but also more robust to smaller batch while pure self-supervised pre-training fails to help, semi-
size. In MoCo and SimCLR, a drop in batch size results supervised-based self-training can contribute a lot to it.
in a significant decline in performance. However, in BYOL, First, we will clarify the definitions of semi-supervised
although batch size still matters, it is far less critical. The learning and self-training. Semi-supervised learning is an
ablation study shows that a batch size of 512 only causes approach to machine learning that combines a small amount
a drop of 0.3% compared to a standard batch size of 4096, of labeled data with many unlabeled data during training.
while SimCLR shows a drop of 1.4%. Various methods derive from several different assumptions
In SimSiam [19], researchers further study how necessary made on the data distribution, with self-training (or self-
is negative sampling, and even batch normalization in labeling) being the oldest. In self-training, a model is trained
contrastive representation learning. They show that the most on the small amount of labeled data and then yield labels
critical component in BYOL is the stop gradient operation, on unlabeled data. Only those data with highly confident
which makes the target representation stable. SimSiam is labels are combined with original labeled data to train a new
proved to converge faster than MoCo, SimCLR, and BYOL model. We iterate this procedure to find the best model.
with even smaller batch sizes, while the performance only The current state-of-the-art supervised model [133] on
slightly decreases. ImageNet follows the self-training paradigm, where we first
Some other works are inspired by theoretical analysis into train an EfficientNet model on labeled ImageNet images
the contrastive objective. ReLIC [82] argues that contrastive and use it as a teacher to generate pseudo labels on 300M
pre-training teaches the encoder to causally disentangle unlabeled images. We then train a larger EfficientNet as a
the invariant content (i.e., main objects) and style (i.e., student model based on labeled and pseudo labeled images.
environments) in an image. To better enforce this observation We iterate this process by putting back the student as the
in the data augmentation, they propose to add an extra teacher. During the pseudo labels generation, the teacher
KL-divergence regularizer between prediction logits of an is not noised so that the pseudo labels are as accurate
image’s different views. The results show that this can as possible. However, during the student’s learning, we
enhance the models’ generalization ability and robustness inject noise such as dropout, stochastic depth, and data
and improve the performance. augmentation via RandAugment to the student to generalize
In graph learning, Graph Contrastive Coding (GCC) [94] better than the teacher.
is a pioneer to leverage instance discrimination as the pretext In light of semi-supervised self-training’s success, it is
task for structural information pre-training. For each node, natural to rethink its relationship with the self-supervised
we sample two subgraphs independently by random walks methods, especially with the successful contrastive pre-
with restart and use top eigenvectors from their normalized trained methods. In Section 4.2.1, we have introduced
graph Laplacian matrices as nodes’ initial representations. M3S [112] that attempts to combine cluster-based contrastive
Then we use GNN to encode them and calculate the InfoNCE pre-training and downstream semi-supervised learning. For
loss as what MoCo and SimCLR do, where the node computer vision tasks, Zoph et al. [153] study the MoCo
embeddings from the same node (in different subgraphs) pre-training and a self-training method in which a teacher
are viewed as similar. Results show that GCC learns bet- is first trained on a downstream dataset (e.g., COCO) and
ter transferable structural knowledge than previous work then yield pseudo labels on unlabeled data (e.g., ImageNet),
such as struc2vec [103], GraphWave [35] and ProNE [145]. and finally a student learns jointly over real labels on
GraphCL [142] studies the data augmentation strategies in the downstream dataset and pseudo labels on unlabeled
graph learning. They propose four different augmentation data. They surprisingly find that pre-training’s performance
methods based on edge perturbation and node dropping. It hurts while self-training still benefits from strong data
further demonstrates that the appropriate combination of augmentation. Besides, more labeled data diminishes the
these strategies can yield even better performance. value of pre-training, while semi-supervised self-training
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
13
always improves. They also discover that the improvements 5 G ENERATIVE -C ONTRASTIVE (A DVERSARIAL )
from pre-training and self-training are orthogonal to each S ELF - SUPERVISED L EARNING
other, i.e., contributing to the performance from different
Generative-contrastive representation learning, or in a more
perspectives. The model with joint pre-training and self-
familiar name adversarial representation learning, leverage
training is the best.
discriminative loss function as the objective. Yann Lecun
Chen et al. [16]’s SimCLR v2 supports the conclusion comments on adversarial learning as ”the most interesting
mentioned above by showing that with only 10% of the idea in the last ten years in machine learning.”. Its application
original ImageNet labels, the ResNet-50 can surpass the in learning representation is also booming.
supervised one with joint pre-training and self-training. They The idea of adversarial learning derives from generative
propose a 3-step framework: learning, where researchers have observed some inherent
1) Do self-supervised pre-training as SimCLR v1, with shortcomings of point-wise generative reconstruction (See
some minor architecture modification and a deeper Section 3.5). As an alternative, adversarial learning learns
ResNet. to reconstruct the original data distribution rather than the
2) Fine-tune the last few layers with only 1% or 10% of samples by minimizing the distributional divergence.
original ImageNet labels. In terms of contrastive learning, adversarial methods still
3) Use the fine-tuned network as teacher to yield labels on preserve the generator structure consisting of an encoder and
unlabeled data to train a smaller student ResNet-50. a decoder. In contrast, the contrastive abandons the decoder
component (as shown in Fig. 4). It is critical because, on the
The success in combining self-supervised contrastive pre-
one hand, the generator endows adversarial learning with
training and semi-supervised self-training opens up our eyes
strong expressiveness that is peculiar to generative models;
for a future data-efficient deep learning paradigm. More
on the other hand, it also makes the objective of adversarial
work is expected for investigating their latent mechanisms.
methods far more challenging to learn than that of contrastive
methods, leading to unstable convergence. In the adversarial
setting, the decoder’s existence asks the representation to be
4.4 Pros and Cons ”reconstructive,” in other words, it contains all the necessary
information for constructing the inputs. However, in the
Because contrastive learning has assumed the downstream contrastive setting, we only need to learn ”distinguishable”
applications to be classifications, it only employs the encoder information to discriminate different samples.
and discards the decoder in the architecture compared To sum up, the adversarial methods absorb merits from
to generative models. Therefore, contrastive models are both generative and contrastive methods together with some
usually light-weighted and perform better in discriminative drawbacks. In a situation where we need to fit on an implicit
downstream applications. distribution, it is a better choice. In the following several
Contrastive learning is closely related to metric learning, subsections, we will discuss its various applications on
a discipline that has been long studied. However, self- representation learning.
supervised contrastive learning is still an emerging field,
and many problems remain to be solved, including:
5.1 Generate with Complete Input
1) Scale to natural language pre-training. Despite its This section introduces GAN and its variants for representa-
success in computer vision, contrastive pre-training tion learning, focusing on capturing the sample’s complete
does not present a convincing result in the NLP bench- information.
marks. Most contrastive learning in NLP now lies The inception of adversarial representation learning
in BERT’s supervised fine-tuning, such as improving should be attributed to Generative Adversarial Networks
BERT’s sentence-level representation [102], information (GAN) [97], which proposes the adversarial training frame-
retrieval [59]. Few algorithms have been proposed to work. Follow GAN, many variants [11], [55], [56], [60], [72],
apply contrastive learning in the pre-training stage. As [89] emerge and reshape people’s understanding of deep
most language understanding tasks are classifications, learning’s potential. GAN’s training process could be viewed
a contrastive language pre-training approach should be as two players play a game; one generates fake samples
better than the current generative language models. while another tries to distinguish them from real ones. To
2) Sampling efficiency. Negative sampling is a must for formulate this problem, we define G as the generator, D as
most contrastive learning, but this process is often the discriminator, p
data (x) as the real sample distribution,
tricky, biased, and time-consuming. BYOL [42] and p (z) as the learned latent sample distribution, we want to
z
SimSiam [19] are the pioneers to get contrastive learning optimize this min-max game
rid of negative samples, but it can be improved. It is also
not clear enough that what role negative sampling plays min max E
x∼pdata (x) [logD(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
in contrastive learning. G D
3) Data augmentation. Researchers have proved that data (16)
augmentation can boost contrastive learning’s perfor- Before VQ-VAE2, GAN maintains dominating perfor-
mance, but the theory for why and how it helps is still mance on image generation tasks over purely generative
quite ambiguous. This hinders its application into other models, such as autoregressive PixelCNN and autoencoder
domains, such as NLP and graph learning, where the VAE. It is natural to think about how this framework could
data is discrete and abstract. benefit representation learning.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
14
As we mentioned before, compared to l2 loss of autoencoder, Colorization is first proposed by [147]. The problem
discriminative loss in GAN better models the high-level can be described as given one color channel L in an image
abstraction. To alleviate the problem, AAE substitutes the KL and predicting the value of two other channels A, B . The
divergence function for a discriminative loss encoder and decoder networks can be set to any form of
LDisc = CrossEntropy(q(z), p(z)) (18) convolutional neural network. Interestingly, to avoid the
uncertainty brought by traditional generative methods such
as VAE, the author transforms the generation task into a
that asks the discriminator to distinguish representation from
classification one. The first figure out the common locating
the encoder and a prior distribution.
area of (A, B) and then split it into 313 categories. The
However, AAE still preserves the reconstruction error,
classification is performed through a softmax layer with
which contradicts GAN’s core idea. Based on AAE, Bi-
hyper-parameter T as an adjustment. Based on [147], a
GAN [33] and ALI [36] argue to embrace adversarial learning
range of colorization-based representation methods [69], [70],
without reservation and put forward a new framework.
[148] are proposed to benefit downstream tasks.
Given an actual sample x
Inpainting [55], [89] is more straight forward. We will ask
• Generator G: the generator here virtually acts as the the model to predict an arbitrary part of an image given the
decoder, generates fake samples x0 = G(z) by z from a rest of it. Then a discriminator is employed to distinguish
prior latent distribution (e.g. [uniform(-1,1)]d , d refers to the inpainted image from the original one. Super-resolution
dimension). method SRGAN [72] follows the same idea to recover high-
• Encoder E : a newly added component, mapping real resolution images from blurred low-resolution ones in the
sample x to representation z 0 = E(x). This is also exactly adversarial setting.
what we want to train.
• Discriminator D: given two inputs [z , G(z)] and [E(x),
5.3 Pre-trained Language Model
x], decide which one is from the real sample distribution.
For a long time, the pre-trained language model (PTM)
It is easy to see that their training goal is E = G−1 . In
focuses on maximum likelihood estimation based pretext task
other words, encoder E should learn to ”convert” generator
because discriminative objectives are thought to be helpless
G. This goal could be rewritten as a l0 loss for autoen-
due to languages’ vibrant patterns. However, recently some
coder [33], but it is not the same as a traditional autoencoder
work shows excellent performance and sheds light on
because the distribution does not make any assumption
contrastive objectives’ potential in PTM.
about the data itself. The distribution is shaped by the
discriminator, which captures the semantic-level difference.
Based on BiGAN and ALI, later studies [20], [34] discover
that GAN with deeper and larger networks and modified
architectures can produce even better results on downstream
task.
5.2 Recover with Partial Input Fig. 16: The architecture of ELECTRA [21]. It follows GAN’s
framework but uses a two-stage training paradigm to avoid
As we mentioned above, GAN’s architecture is not born
using policy gradient. The MLM is Masked Language Model.
for representation learning, and modification is needed to
apply its framework. While BiGAN and ALI choose to extract The pioneering work is ELECTRA [21], surpassing BERT
the implicit distribution directly, some other methods such given at the same computation budget. ELECTRA proposes
as colorization [69], [70], [147], [148], inpainting [55], [89] Replaced Token Detection (RTD) and leverages GAN’s
and super-resolution [72] apply the adversarial learning via structure to pre-train a language model. In this setting, the
in a different way. Instead of asking models to reconstruct generator G is a small Masked Language Model (MLM),
the whole input, they provide models with partial input which replaces masked tokens in a sentence to words.
and ask them to recover the rest parts. This is similar to The discriminator D is asked to predict which words are
denoising autoencoder (DAE) such as BERT’s family in replaced. Notice that replaced means not the same with
natural language processing but conducted in an adversarial original unmasked inputs. The training is conducted in two
manner. stages:
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
15
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
16
ways to leverage GAN’s learned latent representation datasets, learning the transferable inductive biases efficiently
and achieve good performance, contrastive learning has is still an open problem.
soon outperformed them with fewer parameters.
Exploring potential of sampling strategies In [120], the
Despite the challenges, however, it is still promising authors attribute one of the reasons for the success of mutual
because it overcomes some inherent deficits of the point- information-based methods to better sampling strategies.
wise generative objective. Maybe we still need to wait for a MoCo [47], SimCLR [15], and a series of other contrastive
better future implementation of this idea. methods may also support this conclusion. They propose
to leverage super large amounts of negative samples and
6 D ISCUSSIONS AND F UTURE D IRECTIONS augmented positive samples, whose effects are studied in
In this section, we will discuss several open problems and fu- deep metric learning. How to further release the power of
ture directions in self-supervised learning for representation. sampling is still an unsolved and attractive problem.
Theoretical Foundation Though self-supervised learning Early Degeneration for Contrastive Learning Contrastive
has achieved great success, few works investigate the mecha- learning methods such as MoCo [47] and SimCLR [15] are
nisms behind it. In this survey, we have listed several recent rapidly approaching the performance of supervised learning
works on this topic and show that theoretical analysis is for computer vision. However, their incredible performances
significant to avoid misleading empirical conclusions. are generally limited to the classification problem. Mean-
In [4], researchers present a conceptual framework to while, the generative-contrastive method ELETRA [21] for
analyze the contrastive objective’s function in generalization language model pre-training is also outperforming other
ability. [120] empirically proves that mutual information generative methods on several standard NLP benchmarks
is only loosely related to the success of several MI-based with fewer model parameters. However, some remarks
methods, in which the sampling strategies and architecture indicate that ELETRA’s performance on language generation
design may count more. This type of works is crucial for self- and neural entity extraction is not up to expectations.
supervised learning to form a solid foundation, and more Problems above are probably because the contrastive
work related to theory analysis is urgently needed. objectives often get trapped into embedding spaces’ early
degeneration problem, which means that the model over-fits
Transferring to downstream tasks There is an essential gap to the discriminative pretext task too early, and therefore
between pre-training and downstream tasks. Researchers lost the ability to generalize. We expect that there would be
design elaborate pretext tasks to help models learn some techniques or new paradigms to solve the early degeneration
critical features of the dataset that can transfer to other jobs, problem while preserving contrastive learning’s advantages.
but sometimes this may fail to realize. Besides, the process
of selecting pretext tasks seems to be too heuristic and tricky
without patterns to follow. 7 C ONCLUSION
A typical example is the selection of pre-training tasks This survey comprehensively reviews the existing self-
in BERT and ALBERT. BERT uses Next Sentence Prediction supervised representation learning approaches in natural
(NSP) to enhance its ability for sentence-level understanding. language processing (NLP), computer vision (CV), graph
However, ALBERT shows that NSP equals a naive topic learning, and beyond. Self-supervised learning is the present
model, which is far too easy for language model pre-training and future of deep learning due to its supreme ability to
and even decreases BERT’s performance. utilize Web-scale unlabeled data to train feature extractors
For the pre-training task selection problem, a probably and context generators efficiently. Despite the diversity
exciting direction would be to design pre-training tasks of algorithms, we categorize all self-supervised methods
for a specific downstream task automatically, just as what into three classes: generative, contrastive, and generative
Neural Architecture Search [154] does for neural network contrastive according to their essential training objectives.
architecture. We introduce typical and representative methods in each
Transferring across datasets This problem is also known category and sub-categories. Moreover, we discuss the pros
as how to learn inductive biases or inductive learning. and cons of each category and their unique application sce-
Traditionally, we split a dataset into the training used for narios. Finally, fundamental problems and future directions
learning the model parameters and the testing part for eval- of self-supervised learning are listed.
uation. An essential prerequisite for this learning paradigm
is that data in the real world conform to our dataset’s ACKNOWLEDGMENTS
distribution. Nevertheless, this assumption frequently fails The work is supported by the National Key R&D Program
in experiments. of China (2018YFB1402600), NSFC for Distinguished Young
Self-supervised representation learning solves part of Scholar (61825602), and NSFC (61836013).
this problem, especially in the field of natural language
processing. Vast amounts of corpora used in the language
model pre-training help cover most language patterns and, R EFERENCES
therefore, contribute to the success of PTMs in various [1] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marc-
language tasks. However, this is based on the fact that text hand. Domain-adversarial neural networks. arXiv preprint
in the same language shares the same embedding space. For arXiv:1412.4446, 2014.
[2] F. Alam, S. Joty, and M. Imran. Domain adaptation with adversar-
other tasks like machine translation and fields like graph ial training and graph embeddings. arXiv preprint arXiv:1805.05151,
learning where embedding spaces are different for different 2018.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
17
[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative [29] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang. Cognitive graph
adversarial networks. In International conference on machine learning, for multi-hop reading comprehension at scale. arXiv preprint
pages 214–223. PMLR, 2017. arXiv:1905.05460, 2019.
[4] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and [30] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent
N. Saunshi. A theoretical analysis of contrastive unsupervised components estimation. arXiv preprint arXiv:1410.8516, 2014.
representation learning. arXiv preprint arXiv:1902.09229, 2019. [31] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using
[5] A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong. real nvp. arXiv preprint arXiv:1605.08803, 2016.
Learning to retrieve reasoning paths over wikipedia graph for [32] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual
question answering. arXiv preprint arXiv:1911.10470, 2019. representation learning by context prediction. In Proceedings of the
[6] P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning represen- IEEE ICCV, pages 1422–1430, 2015.
tations by maximizing mutual information across views. In NIPS, [33] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature
pages 15509–15519, 2019. learning. arXiv preprint arXiv:1605.09782, 2016.
[7] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang. Simgnn: A [34] J. Donahue and K. Simonyan. Large scale adversarial representa-
neural network approach to fast graph similarity computation. In tion learning. In NIPS, pages 10541–10551, 2019.
WSDM, pages 384–392, 2019. [35] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec. Learning
[8] D. H. Ballard. Modular learning in neural networks. In AAAI, structural node embeddings via diffusion wavelets. In SIGKDD,
pages 279–284, 1987. pages 1320–1329, 2018.
[9] Y. Bengio, N. Léonard, and A. Courville. Estimating or prop- [36] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb,
agating gradients through stochastic neurons for conditional M. Arjovsky, and A. Courville. Adversarially learned inference.
computation. arXiv preprint arXiv:1308.3432, 2013. arXiv preprint arXiv:1606.00704, 2016.
[10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word [37] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
vectors with subword information. Transactions of the Association F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial
for Computational Linguistics, 5:135–146, 2017. training of neural networks. The Journal of Machine Learning
[11] A. Brock, J. Donahue, and K. Simonyan. Large scale gan Research, 17(1):2096–2030, 2016.
training for high fidelity natural image synthesis. arXiv preprint [38] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised repre-
arXiv:1809.11096, 2018. sentation learning by predicting image rotations. arXiv preprint
[12] L. Cai and W. Y. Wang. Kbgan: Adversarial learning for knowledge arXiv:1803.07728, 2018.
graph embeddings. arXiv preprint arXiv:1711.04071, 2017. [39] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-
[13] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering archies for accurate object detection and semantic segmentation.
for unsupervised learning of visual features. In Proceedings of the In CVPR, pages 580–587, 2014.
ECCV (ECCV), pages 132–149, 2018. [40] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[14] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.
Unsupervised learning of visual features by contrasting cluster In NIPS, pages 2672–2680, 2014.
assignments. arXiv preprint arXiv:2006.09882, 2020. [41] P. Goyal, M. Caron, B. Lefaudeux, M. Xu, P. Wang, V. Pai, M. Singh,
[15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple V. Liptchinsky, I. Misra, A. Joulin, and P. Bojanowski. Self-
framework for contrastive learning of visual representations. arXiv supervised pretraining of visual features in the wild. arXiv preprint
preprint arXiv:2002.05709, 2020. arXiv:2103.01988, 2021.
[16] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big [42] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond,
self-supervised models are strong semi-supervised learners. arXiv E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al.
preprint arXiv:2006.10029, 2020. Bootstrap your own latent: A new approach to self-supervised
[17] T. Chen, Y. Sun, Y. Shi, and L. Hong. On sampling strategies for learning. arXiv preprint arXiv:2006.07733, 2020.
neural network-based collaborative filtering. In SIGKDD, pages [43] A. Grover and J. Leskovec. node2vec: Scalable feature learning
767–776, 2017. for networks. In SIGKDD, pages 855–864, 2016.
[18] X. Chen, H. Fan, R. Girshick, and K. He. Improved baselines with [44] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation:
momentum contrastive learning. arXiv preprint arXiv:2003.04297, A new estimation principle for unnormalized statistical models.
2020. In Proceedings of the Thirteenth International Conference on Artificial
[19] X. Chen and K. He. Exploring simple siamese representation Intelligence and Statistics, pages 297–304, 2010.
learning. arXiv preprint arXiv:2011.10566, 2020. [45] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. Realm:
[20] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang. Triple generative Retrieval-augmented language model pre-training. arXiv preprint
adversarial nets. In NIPS, pages 4088–4098, 2017. arXiv:2002.08909, 2020.
[21] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre- [46] K. Hassani and A. H. Khasahmadi. Contrastive multi-view
training text encoders as discriminators rather than generators. representation learning on graphs. arXiv preprint arXiv:2006.05582,
arXiv preprint arXiv:2003.10555, 2020. 2020.
[22] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and [47] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast
H. Jégou. Word translation without parallel data. arXiv preprint for unsupervised visual representation learning. arXiv preprint
arXiv:1710.04087, 2017. arXiv:1911.05722, 2019.
[23] Q. Dai, Q. Li, J. Tang, and D. Wang. Adversarial network [48] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
embedding. In AAAI, 2018. image recognition. In CVPR, pages 770–778, 2016.
[24] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi- [49] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bach-
nov. Transformer-xl: Attentive language models beyond a fixed- man, A. Trischler, and Y. Bengio. Learning deep representations by
length context. In Proceedings of the 57th Annual Meeting of the mutual information estimation and maximization. arXiv preprint
Association for Computational Linguistics, pages 2978–2988, 2019. arXiv:1808.06670, 2018.
[25] V. R. de Sa. Learning classification with unlabeled data. In NIPS, [50] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++:
pages 112–119, 1994. Improving flow-based generative models with variational de-
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: quantization and architecture design. In ICML, pages 2722–2730,
A large-scale hierarchical image database. In CVPR, pages 248–255. 2019.
Ieee, 2009. [51] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre- J. Leskovec. Strategies for pre-training graph neural networks. In
training of deep bidirectional transformers for language under- ICLR, 2019.
standing. In Proceedings of the 2019 Conference of the North American [52] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun. Gpt-gnn:
Chapter of the Association for Computational Linguistics: Human Generative pre-training of graph neural networks. arXiv preprint
Language Technologies, Volume 1 (Long and Short Papers), pages arXiv:2006.15437, 2020.
4171–4186, 2019. [53] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph
[28] M. Ding, J. Tang, and J. Zhang. Semi-supervised learning on transformer. arXiv preprint arXiv:2003.01332, 2020.
graphs with generative adversarial nets. In Proceedings of the 27th [54] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
ACM CIKM, pages 913–922, 2018. convolutional networks. 2017 IEEE CVPR, pages 2261–2269, 2017.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
18
[55] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally [81] I. Misra and L. van der Maaten. Self-supervised learning of
consistent image completion. ACM Transactions on Graphics (ToG), pretext-invariant representations. arXiv preprint arXiv:1912.01991,
36(4):1–14, 2017. 2019.
[56] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image [82] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell.
translation with conditional adversarial networks. In CVPR, pages Representation learning via invariant causal mechanisms. arXiv
1125–1134, 2017. preprint arXiv:2010.07922, 2020.
[57] L. Jing and Y. Tian. Self-supervised visual feature learning with [83] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral
deep neural networks: A survey. arXiv preprint arXiv:1902.06162, normalization for generative adversarial networks. arXiv preprint
2019. arXiv:1802.05957, 2018.
[58] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. [84] A. Newell and J. Deng. How useful is self-supervised pretraining
Spanbert: Improving pre-training by representing and predicting for visual tasks? In CVPR, pages 7345–7354, 2020.
spans. Transactions of the Association for Computational Linguistics, [85] A. Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–
8:64–77, 2020. 19, 2011.
[59] V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and [86] M. Noroozi and P. Favaro. Unsupervised learning of visual
W.-t. Yih. Dense passage retrieval for open-domain question representations by solving jigsaw puzzles. In ECCV, pages 69–84.
answering. arXiv preprint arXiv:2004.04906, 2020. Springer, 2016.
[60] T. Karras, S. Laine, and T. Aila. A style-based generator archi- [87] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting
tecture for generative adversarial networks. In CVPR, pages self-supervised learning via knowledge transfer. In CVPR, pages
4401–4410, 2019. 9359–9367, 2018.
[61] D. Kim, D. Cho, D. Yoo, and I. S. Kweon. Learning image [88] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning
representations by completing damaged jigsaw puzzles. In 2018 with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
pages 793–802. IEEE, 2018. [89] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros.
[62] D. P. Kingma and P. Dhariwal. Glow: Generative flow with Context encoders: Feature learning by inpainting. In CVPR, pages
invertible 1x1 convolutions. In NIPS, pages 10215–10224, 2018. 2536–2544, 2016.
[63] D. P. Kingma and M. Welling. Auto-encoding variational bayes. [90] Z. Peng, Y. Dong, M. Luo, X. ming Wu, and Q. Zheng. Self-
arXiv preprint arXiv:1312.6114, 2013. supervised graph representation learning via global context
[64] T. N. Kipf and M. Welling. Semi-supervised classification with prediction. ArXiv, abs/2003.01604, 2020.
graph convolutional networks. arXiv preprint arXiv:1609.02907, [91] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng. Self-
2016. supervised graph representation learning via global context
[65] T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv prediction. arXiv preprint arXiv:2003.01604, 2020.
preprint arXiv:1611.07308, 2016.
[92] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning
[66] L. Kong, C. d. M. d’Autume, W. Ling, L. Yu, Z. Dai, and of social representations. In SIGKDD, pages 701–710, 2014.
D. Yogatama. A mutual information maximization perspective of
[93] M. Popova, M. Shvets, J. Oliva, and O. Isayev. Molecularrnn:
language representation learning. arXiv preprint arXiv:1910.08350,
Generating realistic molecular graphs with optimized properties.
2019.
arXiv preprint arXiv:1905.13372, 2019.
[67] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifi-
[94] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and
cation with deep convolutional neural networks. In NIPS, pages
J. Tang. Gcc: Graph contrastive coding for graph neural network
1097–1105, 2012.
pre-training. arXiv preprint arXiv:2006.09963, 2020.
[68] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
R. Soricut. Albert: A lite bert for self-supervised learning of [95] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang. Deepinf:
language representations. arXiv preprint arXiv:1909.11942, 2019. Social influence prediction with deep learning. In KDD’18, pages
2110–2119. ACM, 2018.
[69] G. Larsson, M. Maire, and G. Shakhnarovich. Learning repre-
sentations for automatic colorization. In ECCV, pages 577–593. [96] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang. Pre-trained
Springer, 2016. models for natural language processing: A survey. arXiv preprint
arXiv:2003.08271, 2020.
[70] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a
proxy task for visual understanding. In CVPR, pages 6874–6883, [97] A. Radford, L. Metz, and S. Chintala. Unsupervised representation
2017. learning with deep convolutional generative adversarial networks.
[71] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, arXiv preprint arXiv:1511.06434, 2015.
521(7553):436–444, 2015. [98] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Im-
[72] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, proving language understanding by generative pre-training.
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo- [99] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever.
realistic single image super-resolution using a generative adver- Language models are unsupervised multitask learners. OpenAI
sarial network. In CVPR, pages 4681–4690, 2017. Blog, 1(8):9, 2019.
[73] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H. [100] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+
Yang. Unsupervised visual representation learning by graph- questions for machine comprehension of text. arXiv preprint
based consistent constraints. In ECCV, pages 678–694. Springer, arXiv:1606.05250, 2016.
2016. [101] A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse
[74] B. Liu. Sentiment analysis and opinion mining. Synthesis lectures high-fidelity images with vq-vae-2. In NIPS, pages 14837–14847,
on human language technologies, 5(1):1–167, 2012. 2019.
[75] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [102] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings
M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly op- using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
timized bert pretraining approach. arXiv preprint arXiv:1907.11692, [103] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo. struc2vec: Learn-
2019. ing node representations from structural identity. In SIGKDD,
[76] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks pages 385–394, 2017.
for semantic segmentation. In CVPR, pages 3431–3440, 2015. [104] N. Sarafianos, X. Xu, and I. A. Kakadiaris. Adversarial represen-
[77] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. tation learning for text-to-image matching. In Proceedings of the
Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015. IEEE ICCV, pages 5814–5824, 2019.
[78] M. Mathieu. Masked autoencoder for distribution estimation. [105] J. Shen, Y. Qu, W. Zhang, and Y. Yu. Adversarial representation
2015. learning for domain adaptation. stat, 1050:5, 2017.
[79] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient [106] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from
estimation of word representations in vector space. CoRR, non-parallel text by cross-alignment. In NIPS, pages 6830–6841,
abs/1301.3781, 2013. 2017.
[80] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. [107] C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang. Graphaf: a
Distributed representations of words and phrases and their flow-based autoregressive model for molecular graph generation.
compositionality. In NIPS’13, pages 3111–3119, 2013. arXiv preprint arXiv:2001.09382, 2020.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
19
[108] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu, and [134] W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov. Pretrained
K. Wang. An overview of microsoft academic service (mas) and encyclopedia: Weakly supervised knowledge-pretrained language
applications. In WWW’15, pages 243–246, 2015. model. arXiv preprint arXiv:1912.09637, 2019.
[109] P. Smolensky. Information processing in dynamical systems: [135] K. Xu, J. Li, M. Zhang, S. S. Du, K.-i. Kawarabayashi, and S. Jegelka.
Foundations of harmony theory. Technical report, Colorado Univ How neural networks extrapolate: From feedforward to graph
at Boulder Dept of Computer Science, 1986. neural networks. arXiv preprint arXiv:2009.11848, 2020.
[110] F.-Y. Sun, J. Hoffmann, and J. Tang. Infograph: Unsupervised and [136] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan.
semi-supervised graph-level representation learning via mutual Clusterfit: Improving generalization of visual representations.
information maximization. arXiv preprint arXiv:1908.01000, 2019. arXiv preprint arXiv:1912.03330, 2019.
[111] F.-Y. Sun, M. Qu, J. Hoffmann, C.-W. Huang, and J. Tang. vgraph: [137] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning
A generative model for joint community detection and node of deep representations and image clusters. In CVPR, pages
representation learning. In NIPS, pages 512–522, 2019. 5147–5156, 2016.
[112] K. Sun, Z. Lin, and Z. Zhu. Multi-stage self-supervised learning [138] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V.
for graph convolutional networks on graphs with few labeled Le. Xlnet: Generalized autoregressive pretraining for language
nodes. In Proceedings of the AAAI Conference on Artificial Intelligence, understanding. In NIPS, pages 5754–5764, 2019.
volume 34, pages 5892–5899, 2020. [139] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov,
[113] K. Sun, Z. Zhu, and Z. Lin. Multi-stage self-supervised learning and C. D. Manning. Hotpotqa: A dataset for diverse, explainable
for graph convolutional networks. arXiv preprint arXiv:1902.11038, multi-hop question answering. arXiv preprint arXiv:1809.09600,
2019. 2018.
[140] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec. Graph
[114] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu,
convolutional policy network for goal-directed molecular graph
H. Tian, and H. Wu. Ernie: Enhanced representation through
generation. In NIPS, pages 6410–6421, 2018.
knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
[141] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. Graphrnn:
[115] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line:
Generating realistic graphs with deep auto-regressive models. In
Large-scale information network embedding. In WWW’15, pages
ICML, pages 5708–5717, 2018.
1067–1077, 2015.
[142] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen.
[116] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Graph contrastive learning with augmentations. arXiv preprint
extraction and mining of academic social networks. In Proceedings arXiv:2010.13902, 2020.
of the 14th ACM SIGKDD international conference on Knowledge [143] Y. You, T. Chen, Z. Wang, and Y. Shen. When does self-
discovery and data mining, pages 990–998, 2008. supervision help graph convolutional networks? arXiv preprint
[117] W. L. Taylor. “cloze procedure”: A new tool for measuring arXiv:2006.09136, 2020.
readability. Journalism quarterly, 30(4):415–433, 1953. [144] F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang,
[118] Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. B. Shao, R. Li, and K. Wang. Oag: Toward linking large-scale
arXiv preprint arXiv:1906.05849, 2019. heterogeneous entity graphs. In KDD’19, pages 2585–2595, 2019.
[119] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. [145] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding. Prone: fast
What makes for good views for contrastive learning. arXiv preprint and scalable network representation learning. In IJCAI, pages
arXiv:2005.10243, 2020. 4278–4284, 2019.
[120] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic. [146] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end
On mutual information maximization for representation learning. deep learning architecture for graph classification. In AAAI, 2018.
arXiv preprint arXiv:1907.13625, 2019. [147] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization.
[121] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, In ECCV, pages 649–666. Springer, 2016.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. [148] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders:
Wavenet: A generative model for raw audio. In 9th ISCA Speech Unsupervised learning by cross-channel prediction. In CVPR,
Synthesis Workshop, pages 125–125. pages 1058–1067, 2017.
[122] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, [149] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu. Ernie:
A. Graves, et al. Conditional image generation with pixelcnn Enhanced language representation with informative entities. arXiv
decoders. In NIPS, pages 4790–4798, 2016. preprint arXiv:1905.07129, 2019.
[123] A. van den Oord, O. Vinyals, et al. Neural discrete representation [150] D. Zhu, P. Cui, D. Wang, and W. Zhu. Deep variational network
learning. In NIPS, pages 6306–6315, 2017. embedding in wasserstein space. In SIGKDD, pages 2827–2836,
[124] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel 2018.
recurrent neural networks. In ICML, pages 1747–1756, 2016. [151] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
[125] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. E. Shechtman. Toward multimodal image-to-image translation. In
Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 465–476, 2017.
NIPS, pages 5998–6008, 2017. [152] C. Zhuang, A. L. Zhai, and D. Yamins. Local aggregation for
[126] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, unsupervised learning of visual embeddings. In Proceedings of the
and Y. Bengio. Graph attention networks. arXiv preprint IEEE ICCV, pages 6002–6012, 2019.
arXiv:1710.10903, 2017. [153] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and
Q. V. Le. Rethinking pre-training and self-training. arXiv preprint
[127] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and
arXiv:2006.06882, 2020.
R. D. Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341,
[154] B. Zoph and Q. V. Le. Neural architecture search with reinforce-
2018.
ment learning. arXiv preprint arXiv:1611.01578, 2016.
[128] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie,
and M. Guo. Graphgan: Graph representation learning with
generative adversarial nets. In AAAI, 2018.
[129] P. Wang, S. Li, and R. Pan. Incorporating gan for negative sampling
in knowledge representation learning. In AAAI, 2018.
[130] Z. Wang, Q. She, and T. E. Ward. Generative adversarial networks:
A survey and taxonomy. arXiv preprint arXiv:1906.01529, 2019.
[131] C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L. Yuille.
Iterative reorganization with weak spatial constraints: Solving
arbitrary jigsaw puzzles for unsupervised representation learning. Xiao Liu is a senior undergraduate student with
In CVPR, pages 1910–1919, 2019. the Department of Computer Science and Tech-
[132] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature nology, Tsinghua University. His main research
learning via non-parametric instance discrimination. In CVPR, interests include data mining, machine learning
pages 3733–3742, 2018. and knowledge graph. He has published a paper
[133] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with on KDD.
noisy student improves imagenet classification. In CVPR, pages
10687–10698, 2020.
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
20
1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.