0% found this document useful (0 votes)
98 views

Self-Supervised Learning: Generative or Contrastive

This article surveys recent developments in self-supervised learning methods. It categorizes self-supervised learning approaches into three main categories: generative methods, contrastive methods, and generative-contrastive (adversarial) methods. The article reviews representative methods within each category, including generative models, contrastive coding, and deep infomax. It also discusses the theoretical foundations for why self-supervised learning works and outlines open problems and future directions.

Uploaded by

IndritNallbani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Self-Supervised Learning: Generative or Contrastive

This article surveys recent developments in self-supervised learning methods. It categorizes self-supervised learning approaches into three main categories: generative methods, contrastive methods, and generative-contrastive (adversarial) methods. The article reviews representative methods within each category, including generative models, contrastive coding, and deep infomax. It also discusses the theoretical foundations for why self-supervised learning works and outlines open problems and future directions.

Uploaded by

IndritNallbani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
1

Self-supervised Learning: Generative or Contrastive


Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, Jie Tang*, IEEE Fellow

Abstract—Deep supervised learning has achieved great success in the last decade. However, its defects of heavy dependence on
manual labels and vulnerability to attacks have driven people to find other paradigms. As an alternative, self-supervised learning (SSL)
attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised representation
learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we take a look into
new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We
comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives:
generative, contrastive, and generative-contrastive (adversarial). We further collect related theoretical analysis on self-supervised
learning to provide deeper thoughts on why self-supervised learning works. Finally, we briefly discuss open problems and future
directions for self-supervised learning. An outline slide for the survey is provided1 .

Index Terms—Self-supervised Learning, Generative Model, Contrastive Learning, Deep Learning

1 I NTRODUCTION

D eep neural networks [71] have shown outstanding


performance on various machine learning tasks, es-
pecially on supervised learning in computer vision (image
classification [26], [48], [54], semantic segmentation [39],
[76]), natural language processing (pre-trained language
models [27], [68], [75], [138], sentiment analysis [74], question
answering [5], [29], [100], [139] etc.) and graph learning (node
classification [53], [64], [95], [126], graph classification [7],
[110], [146] etc.). Generally, the supervised learning is trained
on a specific task with a large labeled dataset, which is
randomly divided for training, validation and test. Fig. 1: An illustration to distinguish the supervised, unsu-
However, supervised learning is meeting its bottleneck. It pervised and self-supervised learning framework. In self-
relies heavily on expensive manual labeling and suffers from supervised learning, the “related information” could be
generalization error, spurious correlations, and adversarial another modality, parts of inputs, or another form of the
attacks. We expect the neural networks to learn more with inputs. Repainted from [25].
fewer labels, fewer samples, and fewer trials. As a promising
alternative, self-supervised learning has drawn massive sions, Deep Infomax, and Contrastive Coding. An outline
attention for its data efficiency and generalization ability, slide is also provided.1
and many state-of-the-art models have been following this The term “self-supervised learning” is first introduced
paradigm. This survey will take a comprehensive look at in robotics, where training data is automatically labeled
the recent developing self-supervised learning models and by leveraging the relations between different input sensor
discuss their theoretical soundness, including frameworks signals. Afterwards, machine learning community further
such as Pre-trained Language Models (PTM), Generative develops the idea. In the invited speech on AAAI 2020, The
Adversarial Networks (GAN), autoencoders and their exten- Turing award winner Yann LeCun described self-supervised
learning as ”the machine predicts any parts of its input for
any observed part”. 2 Combining self-supervised learning’s
• Xiao Liu, Fanjin Zhang, and Zhenyu Hou are with the Department of
Computer Science and Technology, Tsinghua University, Beijing, China. traditional definition and LeCun’s definition, we can further
E-mail: [email protected], [email protected], summarize its features as:
[email protected]
• Jie Tang is with the Department of Computer Science and Technology, • Obtain “labels” from the data itself by using a “semi-
Tsinghua University, and Tsinghua National Laboratory for Information automatic” process.
Science and Technology (TNList), Beijing, China, 100084. • Predict part of the data from other parts.
E-mail: [email protected], corresponding author
• Li Mian is with the Beijing Institute of Technonlogy, Beijing, China. Specifically, the “other part” could be incomplete, trans-
Email: [email protected] formed, distorted, or corrupted (i.e., data augmentation
• Zhaoyu Wang is with the Anhui University, Anhui, China.
Email: [email protected]
technique). In other words, the machine learns to ’recover’
• Jing Zhang is with the Renmin University of China, Beijing, China.
Email: [email protected] 1. Slides at https://www.aminer.cn/pub/5ee8986f91e011e66831c59b/
2. https://aaai.org/Conferences/AAAI-20/invited-speakers/

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
2

Fig. 2: Number of publications and citations on self-


supervised learning during 2012-2020, from Microsoft Aca- Fig. 3: Categorization of Self-supervised learning (SSL):
demic [108], [144]. Self-supervised learning is drawing Generative, Contrastive and Generative-Contrastive (Adver-
tremendous attention in recent years. sarial).

whole, or parts of, or merely some features of its original


input. 2 M OTIVATION OF S ELF - SUPERVISED L EARNING
People are often confused by the concepts of unsuper-
vised learning and self-supervised learning. Self-supervised It is universally acknowledged that deep learning algorithms
learning can be viewed as a branch of unsupervised learning are data-hungry. Compared to traditional feature-based
since there is no manual label involved. However, narrowly methods, deep learning usually follows the so-called “end-
speaking, unsupervised learning concentrates on detecting to-end” fashion (raw-data in, prediction out). It makes very
specific data patterns, such as clustering, community dis- few prior assumptions, which leads to over-fitting and
covery, or anomaly detection, while self-supervised learning biases in scenarios with little supervised data. Literature has
aims at recovering, which is still in the paradigm of super- shown that simple multi-layer perceptrons have a very poor
vised settings. Figure 1 provides a vivid explanation of the generalization ability (always assume a linear relationship
differences between them. for out-of-distribution (OOD) samples) [135], which results
There exist several comprehensive reviews related to in over-confident (and wrong) predictions.
Pre-trained Language Models [96], Generative Adversarial To conquer the fundamental OOD and generalization
Networks [130], autoencoders, and contrastive learning for vi- problem, while numerous works focus on designing new
sual representation [57]. However, none of them concentrates architectures for neural networks, another simple yet effec-
on the inspiring idea of self-supervised learning itself. In this tive solution is to enlarge the training dataset to make as
work, we collect studies from natural language processing, many samples “in-distribution”. However, the fact is, despite
computer vision, and graph learning in recent years to massive available unlabeled web data in this big data era,
present an up-to-date and comprehensive retrospective on high-quality data with human labeling could be costly. For
the frontier of self-supervised learning. To sum up, our example, Scale.ai3 , a data labeling company, charges $6.4
contributions are: per image for the image segmentation labeling. An image
segmentation dataset containing 10k+ high-quality samples
• We provide a detailed and up-to-date review of self-
could cost up to a million-dollar.
supervised learning for representation. We introduce
the background knowledge, models with variants, and The most crucial point for self-supervised learning’s
important frameworks. One can easily grasp the frontier success is that it figures out a way to leverage the tremendous
ideas of self-supervised learning. amounts of unlabeled data that becomes available in the
• We categorize self-supervised learning models into big data era. It a time for deep learning algorithms to
generative, contrastive, and generative-contrastive (ad- get rid of human supervision and turn back to data’s self-
versarial), with particular genres inner each one. We supervision. The intuition of self-supervised learning is to
demonstrate the pros and cons of each category. leverage the data’s inherent co-occurrence relationships as
• We identify several open problems in this field, analyze the self-supervision, which could be versatile. For example,
the limitations and boundaries, and discuss the future in the incomplete sentence “I like apples”, a well-trained
direction for self-supervised representation learning. language model would predict “eating” for the blank (i.e., the
famous Cloze Test [117]) because it frequently co-occurs with
We organize the survey as follows. In Section 2, we the context in the corpora. We can summarize the mainstream
introduce the motivation of self-supervised learning. We self-supervision into three general categories (see Fig. 3) and
also present our categorization of self-supervised learning detailed subsidiaries:
and a conceptual comparison between them. From Sec-
tion 3 to Section 5, we will introduce the empirical self- • Generative: train an encoder to encode input x into an
supervised learning methods utilizing generative, contrastive explicit vector z and a decoder to reconstruct x from z
and generative-contrastive objectives. Finally, in Section 6 (e.g., the cloze test, graph generation)
and 7, we discuss the open problems, future directions and
our conclusions. 3. https://scale.com/pricing

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
3

3.1 Auto-regressive (AR) Model


Auto-regressive (AR) models can be viewed as “Bayes net
structure” (directed graph model). The joint distribution can
be factorized as a product of conditionals
XT
max pθ (x) = log pθ (xt |x1:t−1 ) (1)
θ
t=1
where the probability of each variable is dependent on the
previous variables.
In NLP, the objective of auto-regressive language model-
ing is usually maximizing the likelihood under the forward
autoregressive factorization [138]. GPT [98] and GPT-2 [99]
use Transformer decoder architecture [125] for language
Fig. 4: Conceptual comparison between Generative, Con- model. Different from GPT, GPT-2 removes the fine-tuning
trastive, and Generative-Contrastive methods. processes of different tasks. To learn unified representa-
tions that generalize across different tasks, GPT-2 models
•Contrastive: train an encoder to encode input x into p(output|input, task), which means given different tasks,
an explicit vector z to measure similarity (e.g., mutual the same inputs can have different outputs.
information maximization, instance discrimination) The auto-regressive models have also been employed
• Generative-Contrastive (Adversarial): train an encoder- in computer vision, such as PixelRNN [124] and Pixel-
decoder to generate fake samples and a discriminator to CNN [122]. The general idea is to use auto-regressive
distinguish them from real samples (e.g., GAN) methods to model images pixel by pixel. For example, the
Their main difference lies in model architectures and lower (right) pixels are generated by conditioning on the
objectives. A detailed conceptual comparison is shown in upper (left) pixels. The pixel distributions of PixelRNN and
Fig. 4. Their architectures can be unified into two general PixelCNN are modeled by RNN and CNN, respectively.
components: the generator and the discriminator, and the For 2D images, auto-regressive models can only factorize
generator can be further decomposed into an encoder and a probabilities according to specific directions (such as right
decoder. Different things are: and down). Therefore, masked filters are employed in CNN
1) For latent distribution z : in generative and contrastive architecture. Furthermore, two convolutional networks are
methods, z is explicit and is often leveraged by down- combined to remove the blind spot in images. Based on Pixel-
stream tasks; while in GAN, z is implicitly modeled. CNN, WaveNet [121] – a generative model for raw audio was
2) For discriminator: the generative method does not proposed. To deal with long-range temporal dependencies,
have a discriminator while GAN and contrastive have. the authors develop dilated causal convolutions to improve
Contrastive discriminator has comparatively fewer pa- the receptive field. Moreover, Gated Residual blocks and skip
rameters (e.g., a multi-layer perceptron with 2-3 layers) connections are employed to empower better expressivity.
The auto-regressive models can also be applied to graph
than GAN (e.g., a standard ResNet [48]).
domain problems, such as graph generation. You et al. [141]
3) For objectives: the generative methods use a reconstruc-
propose GraphRNN to generate realistic graphs with deep
tion loss, the contrastive ones use a contrastive similarity
auto-regressive models. They decompose the graph genera-
metric (e.g., InfoNCE), and the generative-contrastive
tion process into a sequence generation of nodes and edges
ones leverage distributional divergence as the loss (e.g.,
conditioned on the graph generated so far. The objective
JS-divergence, Wasserstein Distance).
of GraphRNN is defined as the likelihood of the observed
A properly designed training objective related to down-
graph generation sequences. GraphRNN can be viewed as a
stream tasks could turn our randomly initialized models
hierarchical model, where a graph-level RNN maintains the
into excellent pre-trained feature extractors. For example,
state of the graph and generates new nodes, while an edge-
contrastive learning is found to be useful for almost all
level RNN generates new edges based on the current graph
visual classification tasks. This is probably because the
state. After that, MRNN [93] and GCPN [140] are proposed
contrastive object is modeling the class-invariance between
as auto-regressive approaches. MRNN and GCPN both use
different image instances. The contrastive loss makes images
a reinforcement learning framework to generate molecule
containing the same object class more similar. It makes
graphs through optimizing domain-specific rewards. How-
those containing different classes less similar, essentially
ever, MRNN mainly uses RNN-based networks for state
accords with the downstream image classification, object
representations, but GCPN employs GCN-based encoder
detection, and other classification-based tasks. The art of
networks.
self-supervised learning primarily lies in defining proper
The advantage of auto-regressive models is that they
objectives for unlabeled data.
can model the context dependency well. However, one
shortcoming of the AR model is that the token at each
3 G ENERATIVE S ELF - SUPERVISED L EARNING position can only access its context from one direction.
This section will introduce important self-supervised learn-
ing methods based on generative models, including auto- 3.2 Flow-based Model
regressive (AR) models, flow-based models, auto-encoding The goal of flow-based models is to estimate complex high-
(AE) models, and hybrid generative models. dimensional densities p(x) from data. Intuitively, directly

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
4

Hard Hard
Model FOS Type Generator Self-supervision Pretext Task NS strategy
NS PS
GPT/GPT-2 [98], [99] NLP G AR Following words Next word prediction - - -
PixelCNN [122], [124] CV G AR Following pixels Next pixel prediction - - -
NICE [30] CV G - - -
Flow
RealNVP [31] CV G Whole image Image reconstruction - - -
based
Glow [62] CV G - - -
word2vec [79], [80] NLP G AE CBOW & SkipGram × × End-to-end
Context words
FastText [10] NLP G AE CBOW × × End-to-end
DeepWalk-based
Graph G AE × × End-to-end
[43], [92], [115] Graph edges Link prediction
VGAE [65] Graph G AE × × End-to-end
Masked words Masked language model,
BERT [27] NLP G AE - - -
Sentence topic Next senetence prediction
SpanBERT [58] NLP G AE Masked words Masked language model - - -
Masked words Masked language model,
ALBERT [68] NLP G AE - - -
Sentence order Sentence order prediction
Masked words Masked language model,
ERNIE [114], [149] NLP G AE - - -
Sentence topic Next senetence prediction
GPT-GNN [52] Graph G AE Attribute & Edge Masked graph generation - - -
VQ-VAE 2 [101] CV G AE Whole image Image reconstruction - - -
XLNet [138] NLP G AE+AR Masked words Permutation language model - - -
GraphAF [107] Graph G Flow+AR Attribute & Edge Sequential graph generation - - -
RelativePosition [32] CV C - Relative postion prediction - - -
Spatial relations Jigsaw + Inpainting
CDJP [61] CV C - × × End-to-end
(Context-Instance) + Colorization
PIRL [81] CV C - Jigsaw × X Memory bank
RotNet [38] CV C - Rotation Prediction - - -
Deep InfoMax [49] CV C - × × End-to-end
AMDIM [6] CV C - × X End-to-end
CPC [88] CV C - × × End-to-end
InfoWord [66] NLP C - Belonging × × End-to-end
MI Maximization
DGI [127] Graph C - (Context-Instance) X × End-to-end
End-to-end
InfoGraph [110] Graph C - × ×
(batch-wise)
CMC-Graph [46] Graph C - × X End-to-end
S2 GRL [90] Graph C - × × End-to-end
Belonging MI maximization,
Pre-trained GNN [51] Graph C - × × End-to-end
Node attributes Masked attribute prediction
DeepCluster [13] CV C - - - -
Local Aggregation [152] CV C - - - -
ClusterFit [136] CV C - Similarity - - -
Cluster discrimination
SwAV [14] CV C - (Instance-Instance) - X End-to-end
SEER [41] CV C - - X End-to-end
M3S [113] Graph C - - - -
InstDisc [132] CV C - × × Memory bank
CMC [118] CV C - × X End-to-end
MoCo [47] CV C - × × Momentum
MoCo v2 [18] CV C - × X Momentum
SimCLR [15] CV C - × X End-to-end
Identity
InfoMin [119] CV C - Instance discrimination × X End-to-end
(Instance-Instance)
BYOL [42] CV C - no NS X End-to-end
ReLIC [82] CV C - × X End-to-end
SimSiam [19] CV C - no NS X End-to-end
SimCLR v2 (semi) [16] CV C - × X End-to-end
GCC [94] Graph C - × X Momentum
GraphCL [142] Graph C - × X End-to-end
GAN [40] CV G-C AE - - -
Adversarial AE [77] CV G-C AE Whole image Image reconstruction - - -
BiGAN/ALI [33], [36] CV G-C AE - - -
BigBiGAN [34] CV G-C AE - - -
Colorization [69] CV G-C AE Image color Colorization - - -
Inpainting [89] CV G-C AE Parts of images Inpainting - - -
Super-resolution [72] CV G-C AE Details of images Super-resolution - - -
ELECTRA [21] NLP G-C AE Masked words Replaced token detection X × End-to-end
WKLM [134] NLP G-C AE Masked entities Replaced entity detection X × End-to-end
ANE [23] Graph G-C AE - - -
Graph edges Link prediction
GraphGAN [128] Graph G-C AE - - -
GraphSGAN [28] Graph G-C AE Graph nodes Node classification - - -

TABLE 1: An overview of recent self-supervised representation learning. For acronyms used, “FOS” refers to fields of study;
“NS” refers to negative samples; “PS” refers to positive samples; “MI” refers to mutual information. For alphabets in “Type”:
G Generative ; C Contrastive; G-C Generative-Contrastive (Adversarial). For symbols in “Hard NS” and “Hard PS”, “-”
means not applicable, “×” means not adopted, “X”’ means adopted; “no NS” particularly means not using negative samples
in instance-instance contrast.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
5

formalizing the densities is difficult. To obtain a compli- 3.3.2 Context Prediction Model (CPM)
cated densities, we hope to generate it “step by step” by The idea of the Context Prediction Model (CPM) is to predict
stacking a series of transforming functions that describing contextual information based on inputs.
different data characteristics respectively. Generally, flow- In NLP, when it comes to self-supervised learning on
based models first define a latent variable z which follows word embedding, CBOW and Skip-Gram [80] are pioneering
a known distribution pZ (z). Then define z = fθ (x), where works. CBOW aims to predict the input tokens based on
fθ is an invertible and differentiable function. The goal is context tokens. In contrast, Skip-Gram aims to predict context
to learn the transformation between x and z so that the tokens based on input tokens. Usually, negative sampling is
density of x can be depicted. According to the integral employed to ensure computational efficiency and scalability.
rule, pθ (x)dx = p(z)dz . Therefore, the densities of x and Following CBOW architecture, FastText [10] is proposed by
z satisfies: utilizing subword information.
∂fθ (x)
pθ (x) = p(fθ (x))
(2) Inspired by the progress of word embedding models in
∂x NLP, many network embedding models are proposed based
and the objective is to maximize the likelihood: on a similar context prediction objective. Deepwalk [92]
samples truncated random walks to learn latent node embed-
ding based on the Skip-Gram model. It treats random walks
X X
max log pθ (x(i) ) = max log pZ (fθ (x(i) ))
θ θ as the equivalent of sentences. However, another network
i i

∂fθ (i) (3) embedding approach LINE [115] aims to generate neighbors
+ log (x ) rather than nodes on a path based on current nodes:
∂x X
The advantage of flow-based models is that the mapping O=− wij log p(vj |vi ) (4)
between x and z is invertible. However, it also requires (i,j)∈E
that x and z must have the same dimension. fθ needs to where E denotes edge set, v denotes the node, wij represents
be carefully designed since it should be invertible and the the weight of edge (vi , vj ). LINE also uses negative sampling
Jacobian determinant in Eq. (2) should also be calculated to sample multiple negative edges to approximate the
easily. NICE [30] and RealNVP [31] design affine coupling objective.
layer to parameterize fθ . The core idea is to split x into two
blocks (x1 , x2 ) and apply a transformation from (x1 , x2 ) to
3.3.3 Denoising AE Model
(z1 , z2 ) in an auto-regressive manner, that is z1 = x1 and
z2 = x2 + m(x1 ). More recently, Glow [62] was proposed The intuition of denoising autoencoder models is that repre-
and it introduces invertible 1 × 1 convolutions and simplifies sentation should be robust to the introduction of noise. The
RealNVP. masked language model (MLM), one of the most successful
architectures in natural language processing, can be regarded
3.3 Auto-encoding (AE) Model as a denoising AE model. To model text sequence, the masked
The auto-encoding model’s goal is to reconstruct (part of) language model (MLM) randomly masks some of the tokens
inputs from (corrupted) inputs. Due to its flexibility, the AE from the input and then predicts them based on their context
model is probably the most popular generative model with information, which is similar to the Cloze task [117]. BERT [27]
many variants. is the most representative work in this field. Specifically, in
BERT, a unique token [MASK] is introduced in the training
3.3.1 Basic AE Model process to mask some tokens. However, one shortcoming
Autoencoder (AE) was first introduced in [8] for pre-training of this method is that there are no input [MASK] tokens
artificial neural networks. Before autoencoder, Restricted for down-stream tasks. To mitigate this, the authors do not
Boltzmann Machine (RBM) [109] can also be viewed as always replace the predicted tokens with [MASK] in training.
a special “autoencoder”. RBM is an undirected graphical Instead, they replace them with original words or random
model, and it only contains two layers: the visible layer words with a small probability.
and the hidden layer. The objective of RBM is to minimize Following BERT, many extensions of MLM emerge.
the difference between the marginal distribution of models SpanBERT [58] chooses to mask continuous random spans
and data distributions. In contrast, an autoencoder can rather than random tokens adopted by BERT. Moreover,
be regarded as a directed graphical model, and it can it trains the span boundary representations to predict the
be trained more efficiently. Autoencoder is typically for masked spans, inspired by ideas in coreference resolution.
dimensionality reduction. Generally, the autoencoder is a ERNIE (Baidu) [114] masks entities or phrases to learn
feed-forward neural network trained to produce its input entity-level and phrase-level knowledge, which obtains good
at the output layer. The AE is comprised of an encoder results in Chinese natural language processing tasks. ERNIE
0
network h = fenc (x) and a decoder network x = fdec (h). (Tsinghua) [149] further integrates knowledge (entities and
0
The objective of AE is to make x and x as similar as possible relations) in knowledge graphs into language models.
(such as through mean-square error). It can be proved that Compared with the AR model, in denoising AE for
the linear autoencoder corresponds to the PCA method. language modeling, the predicted tokens have access to
Sometimes the number of hidden units is greater than the contextual information from both sides. However, the fact
number of input units, and some interesting structures can that MLM assumes the predicted tokens are independent
be discovered by imposing sparsity constraints on the hidden if the unmasked tokens are given (which does not hold in
units [85]. reality) has long been considered as its inherent drawback.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
6

L(x, D(e)) = kx−D(e)k22 +ksg[E(x)]−ek22 +βksg[e]−E(x)k22


(6)
where e refers to the codebook, the operator sg refers to a
stop-gradient operation that blocks gradients from flowing
into its argument, and β is a hyperparameter which controls
the reluctance to change the code corresponding to the
encoder output.
Fig. 5: Architecture of VQ-VAE [123]. Compared to VAE, the More recently, researchers propose VQ-VAE-2 [101],
orginal hidden distribution is replaced with a quantized vec- which can generate versatile high-fidelity images that rival
tor dictionary. In addition, the prior distribution is replaced BigGAN [11] on ImageNet [26], the state-of-the-art GAN
with a pre-trained PixelCNN that models the hierarchical model. First, the authors enlarge the scale and enhance the
features of images. Taken from [123] autoregressive priors by a powerful PixelCNN [122] prior.
Additionally, they adopt a multi-scale hierarchical organiza-
In graph learning, Hu et al. [52] proposes GPT-GNN, a tion of VQ-VAE, which enables learning local information
generative pre-training method for graph neural networks. and global information of images separately. Nowadays,
It also leverages the graph masking techniques and then VAE and its variants have been widely used in the computer
asks the graph neural network to generate masked edges vision area, such as image representation learning, image
and attributes. GPT-GNN’s wide range of experiments on generation, video generation.
OAG [108], [116], [144], the largest public, academic graph Variational auto-encoding models have also been em-
with 100 million nodes and 2 billion edges, shows impressive ployed in node representation learning on graphs. For
improvements on various graph learning tasks. example, Variational graph auto-encoder (VGAE) [65] uses
the same variational inference technique as VAE with graph
3.3.4 Variational AE Model convolutional networks (GCN) [64] as the encoder. Due to
The variational auto-encoding model assumes that data are the uniqueness of graph-structured data, the objective of
generated from underlying latent (unobserved) representa- VGAE is to reconstruct the adjacency matrix of the graph by
tion. The posterior distribution over a set of unobserved measuring node proximity. Zhu et al. [150] propose DVNE, a
variables Z = {z1 , z2 , ..., zn } given some data X is ap- deep variational network embedding model in Wasserstein
proximated by a variational distribution q(z|x) ≈ p(z|x). space. It learns Gaussian node embedding to model the
In variational inference, the evidence lower bound (ELBO) uncertainty of nodes. 2-Wasserstein distance is used to
on the log-likelihood of data is maximized during training. measure the similarity between the distributions for its
effectiveness in preserving network transitivity. vGraph [111]
log p(x) ≥ −DKL (q(z|x)||p(z)) + E∼q(z|x) [log p(x|z)] (5) can perform node representation learning and community
detection collaboratively through a generative variational
where p(x) is evidence probability, p(z) is prior and p(x|z) inference framework. It assumes that each node can be gen-
is likelihood probability. The right-hand side of the above erated from a mixture of communities, and each community
equation is called ELBO. From the auto-encoding perspective, is defined as a multinomial distribution over nodes.
the first term of ELBO is a regularizer forcing the posterior
to approximate the prior. The second term is the likelihood
3.4 Hybrid Generative Models
of reconstructing the original input data based on latent
variables. 3.4.1 Combining AR and AE Model.
Variational Autoencoders (VAE) [63] is one important Some researchers propose to combine the advantages of
example where variational inference is utilized. VAE assumes both AR and AE. MADE [78] makes a simple modification
the prior p(z) and the approximate posterior q(z|x) both to autoencoder. It masks the autoencoder’s parameters
follow Gaussian distributions. Specifically, let p(z) ∼ N (0, 1). to respect auto-regressive constraints. Specifically, for the
Furthermore, reparameterization trick is utilized for mod- original autoencoder, neurons between two adjacent layers
eling approximate posterior q(z|x). Assume z ∼ N (µ, σ 2 ), are fully-connected through MLPs. However, in MADE, some
z = µ + σ where  ∼ N (0, 1). Both µ and σ are parameter- connections between adjacent layers are masked to ensure
ized by neural networks. Based on calculated latent variable that each input dimension is reconstructed solely from its
z , decoder network is utilized to reconstruct the input data. dimensions. MADE can be easily parallelized on conditional
Recently, a novel and powerful variational AE model computations, and it can get direct and cheap estimates of
called VQ-VAE [123] was proposed. VQ-VAE aims to learn high-dimensional joint probabilities by combining AE and
discrete latent variables motivated by the fact that many AR models.
modalities are inherently discrete, such as language, speech, In NLP, Permutation Language Model (PLM) [138] is a
and images. VQ-VAE relies on vector quantization (VQ) to representative model that combines the advantage of auto-
learn the posterior distribution of discrete latent variables. regressive model and auto-encoding model. XLNet [138],
The discrete latent variables are calculated by the nearest which introduces PLM, is a generalized auto-regressive
neighbor lookup using a shared, learnable embedding table. pretraining method. XLNet enables learning bidirectional
In training, the gradients are approximated through straight- contexts by maximizing the expected likelihood over all
through estimator [9] as permutations of the factorization order. To formalize the idea,

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
7

including MoCo [47], SimCLR [15], BYOL [42] and SwAV [14]
have presented overwhelming performances on various CV
benchmarks. Nevertheless, in the NLP domain, researchers
still depend on generative language models to conduct text
classification.
Second, the point-wise nature of the generative objec-
tive has some inherent defects. This objective is usually
formulated
P as a maximum likelihood function LM LE =
− x log p(x|c) where x is all the samples we hope to model,
Fig. 6: Illustration for permutation language modeling [138] and c is a conditional constraint such as context information.
objective for predicting x3 given the same input sequence x Considering its form, MLE has two fatal problems:
but with different factorization orders. Adapted from [138] 1) Sensitive and Conservative Distribution. When
p(x|c) → 0, LM LE becomes super large, making gen-
let ZT denotes the set of all possible permutations of the erative model extremely sensitive to rare samples. It
length-T index sequence [1, 2, ..., T ], the objective of PLM directly leads to a conservative distribution, which has
can be expressed as follows: a low performance.
T
X 2) Low-level Abstraction Objective. In MLE, the represen-
max Ez∼ZT [ log pθ (xzt |xz<t )] (7) tation distribution is modeled at x’s level (i.e., point-wise
θ
t=1 level), such as pixels in images, words in texts, and nodes
Actually, for each text sequence, different factorization or- in graphs. However, most of the classification tasks tar-
ders are sampled. Therefore, each token can see its contextual get at high-level abstraction, such as object detection, long
information from both sides. Based on the permuted order, paragraph understanding, and molecule classification.
XLNet also conducts reparameterization with positions to and as an opposite approach, generative-contrastive self-
let the model know which position is needed to predict. supervised learning abandons the point-wise objective. It
Then a special two-stream self-attention is introduced for turns to distributional matching objectives that are more
target-aware prediction. robust and better handle the high-level abstraction challenge
Furthermore, different from BERT, inspired by the latest in the data manifold.
advancements in the AR model, XLNet integrates the seg-
ment recurrence mechanism and relative encoding scheme
4 C ONTRASTIVE S ELF - SUPERVISED L EARNING
of Transformer-XL [24] into pre-training, which can model
long-range dependency better than Transformer [125]. From a statistical perspective, machine learning models are
categorized into generative and discriminative models. Given
3.4.2 Combining AE and Flow-based Models the joint distribution P (X, Y ) of the input X and target
Y , the generative model calculates the p(X|Y = y) while
In the graph domain, GraphAF [107] is a flow-based auto-
the discriminative model tries to model the P (Y |X = x).
regressive model for molecule graph generation. It can
Because most of the representation learning tasks hope to
generate molecules in an iterative process and also calculate
model relationships between x, for a long time, people
the exact likelihood in parallel. GraphAF formalizes molecule
believe that the generative model is the only choice for
generation as a sequential decision process. It incorporates
representation learning.
detailed domain knowledge into the reward design, such
However, recent breakthroughs in contrastive learning,
as valency check. Inspired by the recent progress of flow-
such as Deep InfoMax, MoCo and SimCLR, shed light on
based models, it defines an invertible transformation from a
the potential of discriminative models for representation.
base distribution (e.g., multivariate Gaussian) to a molecular
Contrastive learning aims at ”learn to compare” through a
graph structure. Additionally, Dequantization technique [50]
Noise Contrastive Estimation (NCE) [44] objective formatted
is utilized to convert discrete data (including node types and
as:
edge types) into continuous data.
T +
ef (x) f (x )
L = Ex,x+ ,x− [−log( ] (8)
3.5 Pros and Cons e f (x)T f (x+ )
+ ef (x)T f (x− )
A reason for the generative self-supervised learning’s success where x+ is similar to x, x− is dissimilar to x and f is an
in self-supervised learning is its ability to recover the original encoder (representation function). The similarity measure
data distribution without assumptions for downstream tasks, and encoder may vary from task to task, but the framework
which enables generative models’ wide applications in remains the same. With more dissimilar pairs involved, we
both classification and generation. Notably, all the existing have the InfoNCE [88] formulated as:
generation tasks (including text, image, and audio) rely T +

heavily on generative self-supervised learning. Nevertheless,


ef (x) f (x )
L = Ex,x+ ,xk [−log( ] (9)
ef (x)T f (x+ ) + K f (x)T f (xk )
P
two shortcomings restrict its performance. k=1 e
First, despite its central status in generation tasks, gen- Here we divide recent contrastive learning frameworks
erative self-supervised learning is recently found far less into 2 types: context-instance contrast and instance-instance
competitive than contrastive self-supervised learning in contrast. Both of them achieve amazing performance in
some classification scenarios because contrastive learning’s downstream tasks, especially on classification problems
goal naturally conforms the classification objective. Works under the linear protocol.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
8

Fig. 8: Three typical methods for spatial relation contrast: pre-


dict relative position [32], rotation [38] and solve jigsaw [61],
[81], [86], [131].

In the pre-trained language model, similar ideas such as


Next Sentence Prediction (NSP) are also adopted. NSP loss is
initially introduced by BERT [27], where for a sentence, the
model is asked to distinguish the following and a randomly
sampled one. However, some later work empirically proves
that NSP helps little, even harm the performance. So in
Fig. 7: Self-supervised representation learning performance
RoBERTa [75], the NSP loss is removed.
on ImageNet top-1 accuracy in March, 2021, under linear
To replace NSP, ALBERT [68] proposes Sentence Order
classification protocol. The self-supervised learning’s ability
Prediction (SOP) task. That is because, in NSP, the negative
on feature extraction is rapidly approaching the supervised
next sentence is sampled from other passages that may have
method (ResNet50). Except for BigBiGAN, all the models
different topics from the current one, turning the NSP into a
above are contrastive self-supervised learning methods.
far easier topic model problem. In SOP, two sentences that
exchange their position are regarded as a negative sample,
4.1 Context-Instance Contrast making the model concentrate on the semantic meaning’s
The context-instance contrast, or so-called global-local contrast, coherence.
focuses on modeling the belonging relationship between the
local feature of a sample and its global context representation. 4.1.2 Maximize Mutual Information
When we learn the representation for a local feature, we hope This kind of method derives from mutual information (MI)
it is associative to the representation of the global content, – a fundamental concept in statistics. Mutual information
such as stripes to tigers, sentences to its paragraph, and targets modeling the association between two variables, and
nodes to their neighborhoods. our objective is to maximize it. Generally, this kind of models
There are two main types of Context-Instance Contrast: optimize
Predict Relative Position (PRP) and Maximize Mutual Infor- max I(g1 (x1 ), g2 (x2 )) (10)
mation (MI). The differences between them are: g1 ∈G1 ,g2 ∈G1

• PRP focuses on learning relative positions between local where gi is the representation encoder, Gi is a class of
components. The global context serves as an implicit encoders with some constraints, and I(·, ·) is a sample-
requirement for predicting these relations (such as based estimator for the accurate mutual information. In
understanding what an elephant looks like is critical applications, MI is notorious for its complex computation.
for predicting relative position between its head and A common practice is to alternatively maximize I ’s lower
tail). bound with an NCE objective.
• MI focuses on learning the direct belonging relationships Deep InfoMax [49] is the first one to explicitly model
between local parts and global context. The relative mutual information through a contrastive learning task,
positions between local parts are ignored. which maximize the MI between a local patch and its global
context. For real practices, take image classification as an
4.1.1 Predict Relative Position example, we can encode a cat image x into f (x) ∈ RM ×M ×d ,
Many data contain rich spatial or sequential relations be- and take out a local feature vector v ∈ Rd . To conduct
tween parts of it. For example, in image data such as Fig. 8, contrast between instance and context, we need two other
the elephant’s head is on the right of its tail. In text data, a things:
M ×M ×d
sentence like ”Nice to meet you.” would probably be ahead of • a summary function g:R → Rd to generate the
d
”Nice to meet you, too.”. Various models regard recognizing context vector s = g(f (x)) ∈ R

relative positions between parts of it as the pretext task [57]. • another cat image x and its context vector s− =

It could be to predict relative positions of two patches from g(f (x )).
a sample [32], or to recover positions of shuffled segments and the contrastive objective is then formulated as
of an image (solve jigsaw) [61], [86], [131], or to infer the T
ev ·s
rotation angle’s degree of an image [38]. PRP may also serve L = Ev,x [−log( )] (11)
as tools to create hard positive samples. For instance, the
e v T ·s+ evT ·s−
jigsaw technique is applied in PIRL [81] to augment the Deep InfoMax provides us with a new paradigm and
positive sample, but PIRL does not regard solving jigsaw boosts the development of self-supervised learning. The
and recovering spatial relation as its objective. first influential follower is Contrastive Predictive Coding

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
9

In graph learning, Deep Graph InfoMax (DGI) [127]


regards a node’s representation as the local feature and
the average of randomly samples 2-hop neighbors as the
context. However, it is hard to generate negative contexts
on a single graph. To solve this problem, DGI proposes
to corrupt the original context by keeping the sub-graph
structure and permuting the node features. DGI is followed
by many works, such as InfoGraph [110], which targets
learning graph-level representation rather than node level,
maximizing the mutual information between graph-level
representation and substructures at different levels. As what
CMC has done to improve Deep InfoMax, in [46] authors
propose a contrastive multi-view representation learning
method for the graph. They also discover that graph diffusion
is the most effective way to yield augmented positive sample
pairs in graph learning.
As an attempt to unify graph pre-training, in [51],
Fig. 9: Two representatives for mutual information’s applica- the authors systematically analysis the pre-training strate-
tion in contrastive learning. Deep InfoMax (DIM) [49] first gies for graph neural networks from two dimensions: at-
encodes an image into feature maps, and leverage a read- tribute/structural and node-level/graph-level. For structural
out function (or so-called summary function) to produce a prediction at node-level, they propose Context Prediction
summary vector. AMDIM [6] enhances the DIM through to maximize the MI between the k-hop neighborhood’s
randomly choosing another view of the image to produce representations and its context graph. For attributes in the
the summary vector. chemical domain, they propose Attribute Mask to predict a
node’s attribute according to its neighborhood, which is a
(CPC) [88] for speech recognition. CPC maximizes the generative objective similar to token masks in BERT.
association between a segment of audio and its context audio. S2 GRL [91] further separates nodes in the context graph
To improve data efficiency, it takes several negative context into k-hop context subgraphs and maximizes their MI with
vectors at the same time. Later on, CPC has also been applied target node, respectively. However, a fundamental problem
in image classification. of graph pre-training is about learning inductive biases
AMDIM [6] enhances the positive association between across graphs, and existing graph pre-training work is only
a local feature and its context. It randomly samples two applicable for a specific domain.
different views of an image (truncated, discolored, and so
forth) to generate the local feature vector and context vector, 4.2 Instance-Instance Contrast
respectively. CMC [118] extends it into several different views
Though MI-based contrastive learning achieves great success,
for one image and samples another irrelevant image as the
some recent studies [15], [18], [47], [120] cast doubt on the
negative. However, CMC is fundamentally different from
actual improvement brought by MI.
Deep InfoMax and AMDIM because it proposes to measure
the instance-instance similarity rather than context-instance The [120] provides empirical evidence that the success
similarity. We will discuss it in the following subsection. of the models mentioned above is only loosely connected
to MI by showing that an upper bound MI estimator leads
In language pre-training, InfoWord [66] proposes to maxi-
to ill-conditioned and lower performance representations.
mize the mutual information between a global representation
Instead, more should be attributed to encoder architecture
of a sentence and n-grams in it. The context is induced from
and a negative sampling strategy related to metric learning.
the sentence with selected n-grams being masked, and the
A significant focus in metric learning is to perform hard
negative contexts are randomly picked out from the corpus.
positive sampling while increasing the negative sampling
efficiency. They probably play a more critical role in MI-based
models’ success.
As an alternative, instance-instance contrastive learning
discards MI and directly studies the relationships between
different samples’ instance-level local representations as what
metric learning does. Instance-level representation, rather
than context-level, is more crucial for a wide range of classi-
fication tasks. For example, in an image classified as “dog”,
while there must be dog instances, some other irrelevant
Fig. 10: Deep Graph InfoMax [127] uses a readout function to context objects such as grass might appear. But what matters
generate summary vector s1 , and puts it into a discriminator for the image classification is the dog rather than the context.
with node 1’s embedding x1 and corrupted embedding Another example would be sentence emotional classification,
x
e1 respectively to identify which embedding is the real which primarily relies on few but important keywords.
embedding. The corruption is to shuffle the positions of In the early stage of instance-instance contrastive learn-
nodes. ing’s development, researchers borrow ideas from semi-

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
10

deficiency for VAE to generate high-fidelity images, VQ-VAE


proposes quantizing vectors. For the feature matrix encoded
from an image, VQ-VAE substitutes each 1-dimensional
vector in the matrix to the nearest one in an embedding
dictionary. This process is somehow the same as what LA is
doing.
Clustering-based discrimination may also help in the
generalization of other pre-trained models, transferring
models from pretext objectives to downstream tasks better.
Traditional representation learning models have only two
stages: one for pre-training and the other for evaluation. Clus-
Fig. 11: Cluster-based instance-instance contrastive emthods: terFit [136] introduces a cluster prediction fine-tuning stage
DeepCluster [13] and Local Aggregation [152]. In the embed- similar to DeepCluster between the above two stages, which
ding space, DeepCluster uses clustering to yield pseudo improves the representation’s performance on downstream
labels for discrimination to draw near similar samples. classification evaluation.
However, Local Aggregation shows that a egocentric soft- Despite the previous success of cluster discrimination-
clustering objective would be more effective. based contrastive learning, the two-stage training paradigm
is time-consuming and poor performing compared to later
instance discrimination-based methods, including CMC [118],
supervised learning to produce pseudo labels via cluster-
MoCo [47] and SimCLR [15]. These instance discrimination-
based discrimination and achieve rather good performance
based methods have got rid of the slow clustering stage
on representations. More recently, CMC [118], MoCo [47],
and introduced efficient data augmentation (i.e., multi-
SimCLR [15], and BYOL [42] further support the above
view) strategies to boost the performance. In light of these
conclusion by outperforming the context-instance contrastive
problems, authors in SwAV [14] bring online clustering
methods and achieve a competitive result to supervised
ideas and multi-view data augmentation strategies into the
methods under the linear classification protocol. We will
cluster discrimination approach. SwAV proposes a swapped
start with cluster-based discrimination proposed earlier and
prediction contrastive objectives to deal with multi-view
then turn to instance-based discrimination.
augmentation. The intuition is that, given some (clustered)
prototypes, different views of the same images should
4.2.1 Cluster Discrimination be assigned into the same prototypes. SwAV names this
Instance-instance contrast is first studied in clustering-based “assignment” as “codes”. To accelerate code computing, the
methods [13], [73], [87], [137], especially the DeepCluster [13] authors of SwAV design an online computing strategy. SwAV
which first achieves competitive performance to the super- outperforms instance discrimination-based methods when
vised model AlexNet [67]. model size is small and is more computationally efficient.
Image classification asks the model to categorize images Based on SwAV, a 1.3-billion-parameter SEER [41] is trained
correctly, and the representation of images in the same cate- on 1 billion web images collected from Instagram.
gory should be similar. Therefore, the motivation is to pull In graph learning, M3S [113] adopts a similar idea to
similar images near in the embedding space. In supervised perform DeepCluster-style self-supervised pre-training for
learning, this pulling-near process is accomplished via label better semi-supervised prediction. Given little labeled data
supervision; in self-supervised learning, however, we do and many unlabeled data, for every stage, M3S first pre-
not have such labels. To solve the label problem, Deep train itself to produce pseudo labels on unlabeled data as
Cluster [13] proposes to leverage clustering to yield pseudo DeepCluster does and then compares these pseudo labels
labels and asks a discriminator to predict images’ labels. The with those predicted by the model being supervised trained
training could be formulated in two steps. In the first step, on labeled data. Only top-k confident labels are added into
DeepCluster uses K-means to cluster encoded representation a labeled set for the next stage of semi-supervised training.
and produces pseudo labels for each sample. Then in the In [143], this idea is further developed into three pre-training
second step, the discriminator predicts whether two samples tasks: topology partitioning (similar to spectral clustering),
are from the same cluster and back-propagates to the encoder. node feature clustering, and graph completion.
These two steps are performed iteratively.
Recently, Local Aggregation (LA) [152] has pushed for- 4.2.2 Instance Discrimination
ward the cluster-based method’s boundary. It points out The prototype of leveraging instance discrimination as a
several drawbacks of DeepCluster and makes the corre- pretext task is InstDisc [132]. Based on InstDisc, CMC [118]
sponding optimization. First, in DeepCluster, samples are proposes to adopt multiple different views of an image as
assigned to mutual-exclusive clusters, but LA identifies positive samples and take another one as the negative. CMC
neighbors separately for each example. Second, DeepCluster draws near multiple views of an image in the embedding
optimizes a cross-entropy discriminative loss, while LA space and pulls away from other samples. However, it is
employs an objective function that directly optimizes a local somehow constrained by the idea of Deep InfoMax, which
soft-clustering metric. These two changes substantially boost only samples one negative sample for each positive one.
the performance of LA representation on downstream tasks. In MoCo [47], researchers further develop the idea of
A similar work to LA would be VQ-VAE [101], [123] leveraging instance discrimination via momentum contrast,
that we introduce in Section 3. To conquer the traditional which substantially increases the amount of negative samples.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
11

For example, given an input image x, our intuition is to learn is similar to CMC [118], which leverages several different
a instinct representation q = fq (x) by a query encoder fq (·) views to augment the positive pairs. SimCLR follows the end-
that can distinguish x from any other images. Therefore, to-end training framework instead of momentum contrast
for a set of other images xi , we employ an asynchronously from MoCo, and to handle the large-scale negative samples
updated key encoder fk (·) to yield k+ = fk (x) and ki = problem, SimCLR chooses a batch size of N as large as 8196.
fk (xi ), and optimize the following objective The details are as follows. A minibatch of N samples
exp(q · k+ /τ ) is augmented to be 2N samples x̂j (j = 1, 2, ..., 2N ). For a
L = −log PK (12)
pair of a positive sample x̂i and x̂j (derive from one original
i=0 exp(q · ki /τ )
sample), other 2(N − 1) are treated as negative ones. A
where K is the number of negative samples. This formula is
pairwise contrastive loss NT-Xent loss [17] is defined as
in the form of InfoNCE.
Besides, MoCo presents two other critical ideas in dealing exp(sim(x̂i , x̂j )/τ )
li,j = −log P2N (13)
with negative sampling efficiency. k=1 I[k6=i] exp(sim(x̂i , x̂k )/τ )
• First, it abandons the traditional end-to-end training
noted that li,j is asymmetrical, and the sim(·, ·) function
framework. It designs the momentum contrast learning
here is a cosine similarity function that can normalize the
with two encoders (query and key), which prevents the
representations. The summed up loss is
fluctuation of loss convergence in the beginning period.
N
• Second, to enlarge negative samples’ capacity, MoCo 1 X
employs a queue (with K as large as 65536) to save L= [l2i−1,2i + l2i,2i−1 ] (14)
2N k=1
the recently encoded batches as negative samples. This
significantly improves the negative sampling efficiency. SimCLR also provides some other practical techniques,
including a learnable nonlinear transformation between the
representation and the contrastive loss, more training steps,
and deeper neural networks. [18] conducts ablation studies
to show that techniques in SimCLR can also further improve
MoCo’s performance.
More investigation into augmenting positive samples is
made in InfoMin [119]. The authors claim that we should
select those views with less mutual information for better-
augmented views in contrastive learning. In the optimal
Fig. 12: Conceptual comparison of three contrastive loss situation, the views should only share the label information.
mechanisms. Taken from MoCo [47]. To produce such optimal views, the authors first propose
an unsupervised method to minimize mutual information
There are some other auxiliary techniques to ensure the between views. However, this may result in a loss of
training convergence, such as batch shuffling to avoid trivial information for predicting labels (such as a pure blank view).
solutions and temperature hyper-parameter τ to adjust the Therefore, a semi-supervised method is then proposed to
scale. find views sharing only label information. This technique
However, MoCo adopts a too simple positive sample leads to an improve about 2% over MoCo v2.
strategy: a pair of positive representations come from the A more radical step is made by BYOL [42], which discards
same sample without any transformation or augmentation, negative sampling in self-supervised learning but achieves
making the positive pair far too easy to distinguish. PIRL [81] an even better result over InfoMin. For contrastive learning
adds jigsaw augmentation as described in Section 4.1.1. PIRL methods we mentioned above, they learn representations
asks the encoder to regard an image and its jigsawed one as by predicting different views of the same image and cast
similar pairs to produce a pretext-invariant representation. the prediction problem directly in representation space.
However, predicting directly in representation space can
lead to collapsed representations because multi-views are
generally too predictive for each other. Without negative
samples, it would be too easy for the neural networks to
distinguish those positive views.

Fig. 13: Ten different views adopted by SIMCLR [15]. The


enhancement of positive samples substantially improves the
self-supervised learning performance. Taken from [15]
Fig. 14: The architecture of BYOL [42]. Noted that the online
In SimCLR [15], the authors further illustrate the im- encoder has an additional layer qθ compared to the target
portance of a hard positive sample strategy by introducing one, which gives the representations some flexibility to be
data augmentation in 10 forms. This data augmentation improved during the training. Taken from [42]

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
12

In BYOL, researchers argue that negative samples may 4.3 Self-supervised Contrastive Pre-training for Semi-
not be necessary in this process. They show that, if we supervised Self-training
use a fixed randomly initialized network (which would
not collapse because it is not trained) to serve as the key While contrastive learning-based self-supervised learning
encoder, the representation produced by query encoder continues to push the boundaries on various benchmarks,
would still be improved during training. If then we set labels are still important because there is a gap between
the target encoder to be the trained query encoder and training objectives of self-supervised learning and supervised
iterate this procedure, we would progressively achieve better learning. In other words, no matter how self-supervised
performance. Therefore, BYOL proposes an architecture learning models improve, they are still the only powerful
(Figure 14) with an exponential moving average strategy feature extractor, and to transfer to the downstream task, we
to update the target encoder just as MoCo does. Additionally, still need labels more or less. As a result, to bridge the gap
instead of using cross-entropy loss, they follow the regression between self-supervised pre-training and downstream tasks,
paradigm in which mean square error is used as: semi-supervised learning is what we are looking for.
Recall the MoCo [47] that have topped the ImageNet
leader-board. Although it is proved beneficial for many other
hqθ (zθ ), zξ0 i downstream vision tasks, it fails to improve the COCO object
LBYOL
θ , ||q¯θ (zθ ) − z¯ξ 0 ||22 = 2 − 2 · (15)
||qθ (zθ )||2 · ||zξ0 ||2 detection task. Some following work [84], [153] investigates
this problem and attributes it to the gap between the instance
This not only makes the model better-performed in discrimination and object detection. In such a situation,
downstream tasks, but also more robust to smaller batch while pure self-supervised pre-training fails to help, semi-
size. In MoCo and SimCLR, a drop in batch size results supervised-based self-training can contribute a lot to it.
in a significant decline in performance. However, in BYOL, First, we will clarify the definitions of semi-supervised
although batch size still matters, it is far less critical. The learning and self-training. Semi-supervised learning is an
ablation study shows that a batch size of 512 only causes approach to machine learning that combines a small amount
a drop of 0.3% compared to a standard batch size of 4096, of labeled data with many unlabeled data during training.
while SimCLR shows a drop of 1.4%. Various methods derive from several different assumptions
In SimSiam [19], researchers further study how necessary made on the data distribution, with self-training (or self-
is negative sampling, and even batch normalization in labeling) being the oldest. In self-training, a model is trained
contrastive representation learning. They show that the most on the small amount of labeled data and then yield labels
critical component in BYOL is the stop gradient operation, on unlabeled data. Only those data with highly confident
which makes the target representation stable. SimSiam is labels are combined with original labeled data to train a new
proved to converge faster than MoCo, SimCLR, and BYOL model. We iterate this procedure to find the best model.
with even smaller batch sizes, while the performance only The current state-of-the-art supervised model [133] on
slightly decreases. ImageNet follows the self-training paradigm, where we first
Some other works are inspired by theoretical analysis into train an EfficientNet model on labeled ImageNet images
the contrastive objective. ReLIC [82] argues that contrastive and use it as a teacher to generate pseudo labels on 300M
pre-training teaches the encoder to causally disentangle unlabeled images. We then train a larger EfficientNet as a
the invariant content (i.e., main objects) and style (i.e., student model based on labeled and pseudo labeled images.
environments) in an image. To better enforce this observation We iterate this process by putting back the student as the
in the data augmentation, they propose to add an extra teacher. During the pseudo labels generation, the teacher
KL-divergence regularizer between prediction logits of an is not noised so that the pseudo labels are as accurate
image’s different views. The results show that this can as possible. However, during the student’s learning, we
enhance the models’ generalization ability and robustness inject noise such as dropout, stochastic depth, and data
and improve the performance. augmentation via RandAugment to the student to generalize
In graph learning, Graph Contrastive Coding (GCC) [94] better than the teacher.
is a pioneer to leverage instance discrimination as the pretext In light of semi-supervised self-training’s success, it is
task for structural information pre-training. For each node, natural to rethink its relationship with the self-supervised
we sample two subgraphs independently by random walks methods, especially with the successful contrastive pre-
with restart and use top eigenvectors from their normalized trained methods. In Section 4.2.1, we have introduced
graph Laplacian matrices as nodes’ initial representations. M3S [112] that attempts to combine cluster-based contrastive
Then we use GNN to encode them and calculate the InfoNCE pre-training and downstream semi-supervised learning. For
loss as what MoCo and SimCLR do, where the node computer vision tasks, Zoph et al. [153] study the MoCo
embeddings from the same node (in different subgraphs) pre-training and a self-training method in which a teacher
are viewed as similar. Results show that GCC learns bet- is first trained on a downstream dataset (e.g., COCO) and
ter transferable structural knowledge than previous work then yield pseudo labels on unlabeled data (e.g., ImageNet),
such as struc2vec [103], GraphWave [35] and ProNE [145]. and finally a student learns jointly over real labels on
GraphCL [142] studies the data augmentation strategies in the downstream dataset and pseudo labels on unlabeled
graph learning. They propose four different augmentation data. They surprisingly find that pre-training’s performance
methods based on edge perturbation and node dropping. It hurts while self-training still benefits from strong data
further demonstrates that the appropriate combination of augmentation. Besides, more labeled data diminishes the
these strategies can yield even better performance. value of pre-training, while semi-supervised self-training

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
13

always improves. They also discover that the improvements 5 G ENERATIVE -C ONTRASTIVE (A DVERSARIAL )
from pre-training and self-training are orthogonal to each S ELF - SUPERVISED L EARNING
other, i.e., contributing to the performance from different
Generative-contrastive representation learning, or in a more
perspectives. The model with joint pre-training and self-
familiar name adversarial representation learning, leverage
training is the best.
discriminative loss function as the objective. Yann Lecun
Chen et al. [16]’s SimCLR v2 supports the conclusion comments on adversarial learning as ”the most interesting
mentioned above by showing that with only 10% of the idea in the last ten years in machine learning.”. Its application
original ImageNet labels, the ResNet-50 can surpass the in learning representation is also booming.
supervised one with joint pre-training and self-training. They The idea of adversarial learning derives from generative
propose a 3-step framework: learning, where researchers have observed some inherent
1) Do self-supervised pre-training as SimCLR v1, with shortcomings of point-wise generative reconstruction (See
some minor architecture modification and a deeper Section 3.5). As an alternative, adversarial learning learns
ResNet. to reconstruct the original data distribution rather than the
2) Fine-tune the last few layers with only 1% or 10% of samples by minimizing the distributional divergence.
original ImageNet labels. In terms of contrastive learning, adversarial methods still
3) Use the fine-tuned network as teacher to yield labels on preserve the generator structure consisting of an encoder and
unlabeled data to train a smaller student ResNet-50. a decoder. In contrast, the contrastive abandons the decoder
component (as shown in Fig. 4). It is critical because, on the
The success in combining self-supervised contrastive pre-
one hand, the generator endows adversarial learning with
training and semi-supervised self-training opens up our eyes
strong expressiveness that is peculiar to generative models;
for a future data-efficient deep learning paradigm. More
on the other hand, it also makes the objective of adversarial
work is expected for investigating their latent mechanisms.
methods far more challenging to learn than that of contrastive
methods, leading to unstable convergence. In the adversarial
setting, the decoder’s existence asks the representation to be
4.4 Pros and Cons ”reconstructive,” in other words, it contains all the necessary
information for constructing the inputs. However, in the
Because contrastive learning has assumed the downstream contrastive setting, we only need to learn ”distinguishable”
applications to be classifications, it only employs the encoder information to discriminate different samples.
and discards the decoder in the architecture compared To sum up, the adversarial methods absorb merits from
to generative models. Therefore, contrastive models are both generative and contrastive methods together with some
usually light-weighted and perform better in discriminative drawbacks. In a situation where we need to fit on an implicit
downstream applications. distribution, it is a better choice. In the following several
Contrastive learning is closely related to metric learning, subsections, we will discuss its various applications on
a discipline that has been long studied. However, self- representation learning.
supervised contrastive learning is still an emerging field,
and many problems remain to be solved, including:
5.1 Generate with Complete Input
1) Scale to natural language pre-training. Despite its This section introduces GAN and its variants for representa-
success in computer vision, contrastive pre-training tion learning, focusing on capturing the sample’s complete
does not present a convincing result in the NLP bench- information.
marks. Most contrastive learning in NLP now lies The inception of adversarial representation learning
in BERT’s supervised fine-tuning, such as improving should be attributed to Generative Adversarial Networks
BERT’s sentence-level representation [102], information (GAN) [97], which proposes the adversarial training frame-
retrieval [59]. Few algorithms have been proposed to work. Follow GAN, many variants [11], [55], [56], [60], [72],
apply contrastive learning in the pre-training stage. As [89] emerge and reshape people’s understanding of deep
most language understanding tasks are classifications, learning’s potential. GAN’s training process could be viewed
a contrastive language pre-training approach should be as two players play a game; one generates fake samples
better than the current generative language models. while another tries to distinguish them from real ones. To
2) Sampling efficiency. Negative sampling is a must for formulate this problem, we define G as the generator, D as
most contrastive learning, but this process is often the discriminator, p
data (x) as the real sample distribution,
tricky, biased, and time-consuming. BYOL [42] and p (z) as the learned latent sample distribution, we want to
z
SimSiam [19] are the pioneers to get contrastive learning optimize this min-max game
rid of negative samples, but it can be improved. It is also
not clear enough that what role negative sampling plays min max E
x∼pdata (x) [logD(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
in contrastive learning. G D
3) Data augmentation. Researchers have proved that data (16)
augmentation can boost contrastive learning’s perfor- Before VQ-VAE2, GAN maintains dominating perfor-
mance, but the theory for why and how it helps is still mance on image generation tasks over purely generative
quite ambiguous. This hinders its application into other models, such as autoregressive PixelCNN and autoencoder
domains, such as NLP and graph learning, where the VAE. It is natural to think about how this framework could
data is discrete and abstract. benefit representation learning.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
14

However, there is a gap between generation and repre-


sentation. Compared to autoencoder’s explicit latent sample
distribution pz (z), GAN’s latent distribution pz (z) is implic-
itly modeled. We need to extract this implicit distribution
out. To bridge this gap, AAE [77] first proposes a solution to
follow the autoencoder’s natural idea. The generator in GAN
could be viewed as an implicit autoencoder. We can replace
the generator with an explicit variational autoencoder (VAE) Fig. 15: Illustration of typical ”recovering with partial in-
to extract the representation out. Recall the objective of VAE put” methods: colorization, inpainting and super-resolution.
Given the original input on the left, models are asked to
recover it with different partial inputs given on the right.
LVAE = −Eq(z|x) (−log(p(x|z)) + KL(q(z|x)kp(z)) (17)

As we mentioned before, compared to l2 loss of autoencoder, Colorization is first proposed by [147]. The problem
discriminative loss in GAN better models the high-level can be described as given one color channel L in an image
abstraction. To alleviate the problem, AAE substitutes the KL and predicting the value of two other channels A, B . The
divergence function for a discriminative loss encoder and decoder networks can be set to any form of
LDisc = CrossEntropy(q(z), p(z)) (18) convolutional neural network. Interestingly, to avoid the
uncertainty brought by traditional generative methods such
as VAE, the author transforms the generation task into a
that asks the discriminator to distinguish representation from
classification one. The first figure out the common locating
the encoder and a prior distribution.
area of (A, B) and then split it into 313 categories. The
However, AAE still preserves the reconstruction error,
classification is performed through a softmax layer with
which contradicts GAN’s core idea. Based on AAE, Bi-
hyper-parameter T as an adjustment. Based on [147], a
GAN [33] and ALI [36] argue to embrace adversarial learning
range of colorization-based representation methods [69], [70],
without reservation and put forward a new framework.
[148] are proposed to benefit downstream tasks.
Given an actual sample x
Inpainting [55], [89] is more straight forward. We will ask
• Generator G: the generator here virtually acts as the the model to predict an arbitrary part of an image given the
decoder, generates fake samples x0 = G(z) by z from a rest of it. Then a discriminator is employed to distinguish
prior latent distribution (e.g. [uniform(-1,1)]d , d refers to the inpainted image from the original one. Super-resolution
dimension). method SRGAN [72] follows the same idea to recover high-
• Encoder E : a newly added component, mapping real resolution images from blurred low-resolution ones in the
sample x to representation z 0 = E(x). This is also exactly adversarial setting.
what we want to train.
• Discriminator D: given two inputs [z , G(z)] and [E(x),
5.3 Pre-trained Language Model
x], decide which one is from the real sample distribution.
For a long time, the pre-trained language model (PTM)
It is easy to see that their training goal is E = G−1 . In
focuses on maximum likelihood estimation based pretext task
other words, encoder E should learn to ”convert” generator
because discriminative objectives are thought to be helpless
G. This goal could be rewritten as a l0 loss for autoen-
due to languages’ vibrant patterns. However, recently some
coder [33], but it is not the same as a traditional autoencoder
work shows excellent performance and sheds light on
because the distribution does not make any assumption
contrastive objectives’ potential in PTM.
about the data itself. The distribution is shaped by the
discriminator, which captures the semantic-level difference.
Based on BiGAN and ALI, later studies [20], [34] discover
that GAN with deeper and larger networks and modified
architectures can produce even better results on downstream
task.

5.2 Recover with Partial Input Fig. 16: The architecture of ELECTRA [21]. It follows GAN’s
framework but uses a two-stage training paradigm to avoid
As we mentioned above, GAN’s architecture is not born
using policy gradient. The MLM is Masked Language Model.
for representation learning, and modification is needed to
apply its framework. While BiGAN and ALI choose to extract The pioneering work is ELECTRA [21], surpassing BERT
the implicit distribution directly, some other methods such given at the same computation budget. ELECTRA proposes
as colorization [69], [70], [147], [148], inpainting [55], [89] Replaced Token Detection (RTD) and leverages GAN’s
and super-resolution [72] apply the adversarial learning via structure to pre-train a language model. In this setting, the
in a different way. Instead of asking models to reconstruct generator G is a small Masked Language Model (MLM),
the whole input, they provide models with partial input which replaces masked tokens in a sentence to words.
and ask them to recover the rest parts. This is similar to The discriminator D is asked to predict which words are
denoising autoencoder (DAE) such as BERT’s family in replaced. Notice that replaced means not the same with
natural language processing but conducted in an adversarial original unmasked inputs. The training is conducted in two
manner. stages:

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
15

1) Warm-up the generator: train the G with MLM pretext


task LMLM (x, θG ) for some steps to warm up the
parameters.
2) Trained with the discriminator: D’s parameters is initial-
ized with G’s and then trained with the discriminative
objective LDisc (x, θD ) (a cross-entropy loss). During this
period, G’s parameter is frozen.
The final objective could be written as
X Fig. 17: The architecture of GraphSGAN [28], which inves-
min LMLM (x, θG ) + λLDisc (x, θD ) (19) tigates density gaps in embedding space for classification
θG ,θD
x∈X problems. Taken from [28]
Though ELECTRA is structured as GAN, it is not trained
in the GAN setting. Compared to image data, which is con- GraphSGAN [28] applies the adversarial method in semi-
tinuous, word tokens are discrete, which stops the gradient supervised graph learning with the motivation that marginal
backpropagation. A possible substitution is to leverage policy nodes cause most classification errors in the graph. Consider
gradient, but ELECTRA experiments show that performance samples in the same category; they are usually clustered in
is slightly lower. Theoretically speaking, LDisc (x, θD ) is actu- the embedding space. Between clusters, there are density
ally turning the conventional k -class softmax classification gaps where few samples exist. The author provides rigorous
into a binary classification. This substantially saves the com- mathematical proof that we can perform complete classi-
putation effort but may somehow harm the representation fication theoretically if we generate enough fake samples
quality due to the early degeneration of embedding space. in density gaps. GraphSGAN leverages a generator G to
In summary, ELECTRA is still an inspiring pioneer work in generate fake nodes in density gaps during the training and
leveraging discriminative objectives. asks the discriminator D to classify nodes into their original
At the same time, WKLM [134] proposes to perform categories and a category for those fake ones. In the test
RTD at the entity-level. For entities in Wikipedia paragraphs, period, fake samples are removed, and classification results
WKLM replaced them with similar entities and trained the on original categories could be improved substantially.
language model to distinguish them in a similar discrimina-
tive objective as ELECTRA, performing exceptionally well 5.5 Domain Adaptation and Multi-modality Representa-
in downstream tasks like question answering. Similar work tion
is REALM [45], which conducts higher article-level retrieval Essentially, the discriminator in adversarial learning serves
augmentation to the language model. However, REALM is to match the discrepancy between latent representation
not using the discriminative objective. distribution and data distribution. This function naturally
relates to domain adaptation and multi-modality represen-
5.4 Graph Learning tation problems, aiming to align different representation
distribution. [1], [2], [37], [105] studies how GAN can help on
There are also attempts to utilize adversarial learning ( [23],
domain adaptation. [12], [129] leverage adversarial sampling
[28], [128]). Interestingly, their ideas are quite different from
to improve the negative samples’ quality. For multi-modality
each other.
representation, [151]’s image to image translation, [106]’s
The most natural idea is to follow BiGAN [33] and
text style transfer, [22]’s word to word translation and [104]
ALI [36]’s a practice that asks discriminator to distinguish
image to text translation show great power of adversarial
representation from generated and prior distribution. Adver-
representation learning.
sarial Network Embedding (ANE) [23] designs a generator
G that is updated in two stages: 1) G encodes sampled graph
into target embedding and computes traditional NCE with 5.6 Pros and Cons
a context encoder F like Skip-gram, 2) discriminator D is Generative-contrastive (adversarial) self-supervised learning
asked to distinguish embedding from G and a sampled one is particularly successful in image generation, transformation
from a prior distribution. The optimized objective is a sum and manipulation, but there are also some challenges for its
of the above two objectives, and the generator G could yield future development:
better node representation for the classification task. • Limited applications in NLP and graph. Due to the
GraphGAN [128] considers to model the link prediction discrete nature of languages and graphs, the adversarial
task and follow the original GAN style discriminative methods do not perform as well as they do in computer
objective to distinguish directly at node-level rather than vision. Furthermore, GAN-based language generation
representation-level. The model first selects nodes from the has been found to be much worse than unidirectional
target node’s subgraph vc according to embedding encoded language models such as GPTs.
by the generator G. Then some neighbor nodes to vc selected • Easy to collapse. It is also notorious that adversarial
from the subgraph, together with those selected by G, are models are prone to collapse during the training, with
put into a binary classifier D to decide whether they are numerous techniques developed to stabilize its training,
linked to vc . Because this framework involves a discrete such as spectral normalization [83], W-GAN [3] and so
selection procedure, while gradient descents could update on.
the discriminator D, the generator G is updated via policy • Not for feature extraction. Although works such as
gradients. BiGAN [33] and BigBiGAN [34] have explored some

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
16

ways to leverage GAN’s learned latent representation datasets, learning the transferable inductive biases efficiently
and achieve good performance, contrastive learning has is still an open problem.
soon outperformed them with fewer parameters.
Exploring potential of sampling strategies In [120], the
Despite the challenges, however, it is still promising authors attribute one of the reasons for the success of mutual
because it overcomes some inherent deficits of the point- information-based methods to better sampling strategies.
wise generative objective. Maybe we still need to wait for a MoCo [47], SimCLR [15], and a series of other contrastive
better future implementation of this idea. methods may also support this conclusion. They propose
to leverage super large amounts of negative samples and
6 D ISCUSSIONS AND F UTURE D IRECTIONS augmented positive samples, whose effects are studied in
In this section, we will discuss several open problems and fu- deep metric learning. How to further release the power of
ture directions in self-supervised learning for representation. sampling is still an unsolved and attractive problem.

Theoretical Foundation Though self-supervised learning Early Degeneration for Contrastive Learning Contrastive
has achieved great success, few works investigate the mecha- learning methods such as MoCo [47] and SimCLR [15] are
nisms behind it. In this survey, we have listed several recent rapidly approaching the performance of supervised learning
works on this topic and show that theoretical analysis is for computer vision. However, their incredible performances
significant to avoid misleading empirical conclusions. are generally limited to the classification problem. Mean-
In [4], researchers present a conceptual framework to while, the generative-contrastive method ELETRA [21] for
analyze the contrastive objective’s function in generalization language model pre-training is also outperforming other
ability. [120] empirically proves that mutual information generative methods on several standard NLP benchmarks
is only loosely related to the success of several MI-based with fewer model parameters. However, some remarks
methods, in which the sampling strategies and architecture indicate that ELETRA’s performance on language generation
design may count more. This type of works is crucial for self- and neural entity extraction is not up to expectations.
supervised learning to form a solid foundation, and more Problems above are probably because the contrastive
work related to theory analysis is urgently needed. objectives often get trapped into embedding spaces’ early
degeneration problem, which means that the model over-fits
Transferring to downstream tasks There is an essential gap to the discriminative pretext task too early, and therefore
between pre-training and downstream tasks. Researchers lost the ability to generalize. We expect that there would be
design elaborate pretext tasks to help models learn some techniques or new paradigms to solve the early degeneration
critical features of the dataset that can transfer to other jobs, problem while preserving contrastive learning’s advantages.
but sometimes this may fail to realize. Besides, the process
of selecting pretext tasks seems to be too heuristic and tricky
without patterns to follow. 7 C ONCLUSION
A typical example is the selection of pre-training tasks This survey comprehensively reviews the existing self-
in BERT and ALBERT. BERT uses Next Sentence Prediction supervised representation learning approaches in natural
(NSP) to enhance its ability for sentence-level understanding. language processing (NLP), computer vision (CV), graph
However, ALBERT shows that NSP equals a naive topic learning, and beyond. Self-supervised learning is the present
model, which is far too easy for language model pre-training and future of deep learning due to its supreme ability to
and even decreases BERT’s performance. utilize Web-scale unlabeled data to train feature extractors
For the pre-training task selection problem, a probably and context generators efficiently. Despite the diversity
exciting direction would be to design pre-training tasks of algorithms, we categorize all self-supervised methods
for a specific downstream task automatically, just as what into three classes: generative, contrastive, and generative
Neural Architecture Search [154] does for neural network contrastive according to their essential training objectives.
architecture. We introduce typical and representative methods in each
Transferring across datasets This problem is also known category and sub-categories. Moreover, we discuss the pros
as how to learn inductive biases or inductive learning. and cons of each category and their unique application sce-
Traditionally, we split a dataset into the training used for narios. Finally, fundamental problems and future directions
learning the model parameters and the testing part for eval- of self-supervised learning are listed.
uation. An essential prerequisite for this learning paradigm
is that data in the real world conform to our dataset’s ACKNOWLEDGMENTS
distribution. Nevertheless, this assumption frequently fails The work is supported by the National Key R&D Program
in experiments. of China (2018YFB1402600), NSFC for Distinguished Young
Self-supervised representation learning solves part of Scholar (61825602), and NSFC (61836013).
this problem, especially in the field of natural language
processing. Vast amounts of corpora used in the language
model pre-training help cover most language patterns and, R EFERENCES
therefore, contribute to the success of PTMs in various [1] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marc-
language tasks. However, this is based on the fact that text hand. Domain-adversarial neural networks. arXiv preprint
in the same language shares the same embedding space. For arXiv:1412.4446, 2014.
[2] F. Alam, S. Joty, and M. Imran. Domain adaptation with adversar-
other tasks like machine translation and fields like graph ial training and graph embeddings. arXiv preprint arXiv:1805.05151,
learning where embedding spaces are different for different 2018.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
17

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative [29] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang. Cognitive graph
adversarial networks. In International conference on machine learning, for multi-hop reading comprehension at scale. arXiv preprint
pages 214–223. PMLR, 2017. arXiv:1905.05460, 2019.
[4] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and [30] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent
N. Saunshi. A theoretical analysis of contrastive unsupervised components estimation. arXiv preprint arXiv:1410.8516, 2014.
representation learning. arXiv preprint arXiv:1902.09229, 2019. [31] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using
[5] A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong. real nvp. arXiv preprint arXiv:1605.08803, 2016.
Learning to retrieve reasoning paths over wikipedia graph for [32] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual
question answering. arXiv preprint arXiv:1911.10470, 2019. representation learning by context prediction. In Proceedings of the
[6] P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning represen- IEEE ICCV, pages 1422–1430, 2015.
tations by maximizing mutual information across views. In NIPS, [33] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature
pages 15509–15519, 2019. learning. arXiv preprint arXiv:1605.09782, 2016.
[7] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang. Simgnn: A [34] J. Donahue and K. Simonyan. Large scale adversarial representa-
neural network approach to fast graph similarity computation. In tion learning. In NIPS, pages 10541–10551, 2019.
WSDM, pages 384–392, 2019. [35] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec. Learning
[8] D. H. Ballard. Modular learning in neural networks. In AAAI, structural node embeddings via diffusion wavelets. In SIGKDD,
pages 279–284, 1987. pages 1320–1329, 2018.
[9] Y. Bengio, N. Léonard, and A. Courville. Estimating or prop- [36] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb,
agating gradients through stochastic neurons for conditional M. Arjovsky, and A. Courville. Adversarially learned inference.
computation. arXiv preprint arXiv:1308.3432, 2013. arXiv preprint arXiv:1606.00704, 2016.
[10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word [37] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
vectors with subword information. Transactions of the Association F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial
for Computational Linguistics, 5:135–146, 2017. training of neural networks. The Journal of Machine Learning
[11] A. Brock, J. Donahue, and K. Simonyan. Large scale gan Research, 17(1):2096–2030, 2016.
training for high fidelity natural image synthesis. arXiv preprint [38] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised repre-
arXiv:1809.11096, 2018. sentation learning by predicting image rotations. arXiv preprint
[12] L. Cai and W. Y. Wang. Kbgan: Adversarial learning for knowledge arXiv:1803.07728, 2018.
graph embeddings. arXiv preprint arXiv:1711.04071, 2017. [39] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-
[13] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering archies for accurate object detection and semantic segmentation.
for unsupervised learning of visual features. In Proceedings of the In CVPR, pages 580–587, 2014.
ECCV (ECCV), pages 132–149, 2018. [40] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[14] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.
Unsupervised learning of visual features by contrasting cluster In NIPS, pages 2672–2680, 2014.
assignments. arXiv preprint arXiv:2006.09882, 2020. [41] P. Goyal, M. Caron, B. Lefaudeux, M. Xu, P. Wang, V. Pai, M. Singh,
[15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple V. Liptchinsky, I. Misra, A. Joulin, and P. Bojanowski. Self-
framework for contrastive learning of visual representations. arXiv supervised pretraining of visual features in the wild. arXiv preprint
preprint arXiv:2002.05709, 2020. arXiv:2103.01988, 2021.
[16] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big [42] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond,
self-supervised models are strong semi-supervised learners. arXiv E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al.
preprint arXiv:2006.10029, 2020. Bootstrap your own latent: A new approach to self-supervised
[17] T. Chen, Y. Sun, Y. Shi, and L. Hong. On sampling strategies for learning. arXiv preprint arXiv:2006.07733, 2020.
neural network-based collaborative filtering. In SIGKDD, pages [43] A. Grover and J. Leskovec. node2vec: Scalable feature learning
767–776, 2017. for networks. In SIGKDD, pages 855–864, 2016.
[18] X. Chen, H. Fan, R. Girshick, and K. He. Improved baselines with [44] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation:
momentum contrastive learning. arXiv preprint arXiv:2003.04297, A new estimation principle for unnormalized statistical models.
2020. In Proceedings of the Thirteenth International Conference on Artificial
[19] X. Chen and K. He. Exploring simple siamese representation Intelligence and Statistics, pages 297–304, 2010.
learning. arXiv preprint arXiv:2011.10566, 2020. [45] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. Realm:
[20] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang. Triple generative Retrieval-augmented language model pre-training. arXiv preprint
adversarial nets. In NIPS, pages 4088–4098, 2017. arXiv:2002.08909, 2020.
[21] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre- [46] K. Hassani and A. H. Khasahmadi. Contrastive multi-view
training text encoders as discriminators rather than generators. representation learning on graphs. arXiv preprint arXiv:2006.05582,
arXiv preprint arXiv:2003.10555, 2020. 2020.
[22] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and [47] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast
H. Jégou. Word translation without parallel data. arXiv preprint for unsupervised visual representation learning. arXiv preprint
arXiv:1710.04087, 2017. arXiv:1911.05722, 2019.
[23] Q. Dai, Q. Li, J. Tang, and D. Wang. Adversarial network [48] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
embedding. In AAAI, 2018. image recognition. In CVPR, pages 770–778, 2016.
[24] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi- [49] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bach-
nov. Transformer-xl: Attentive language models beyond a fixed- man, A. Trischler, and Y. Bengio. Learning deep representations by
length context. In Proceedings of the 57th Annual Meeting of the mutual information estimation and maximization. arXiv preprint
Association for Computational Linguistics, pages 2978–2988, 2019. arXiv:1808.06670, 2018.
[25] V. R. de Sa. Learning classification with unlabeled data. In NIPS, [50] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++:
pages 112–119, 1994. Improving flow-based generative models with variational de-
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: quantization and architecture design. In ICML, pages 2722–2730,
A large-scale hierarchical image database. In CVPR, pages 248–255. 2019.
Ieee, 2009. [51] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre- J. Leskovec. Strategies for pre-training graph neural networks. In
training of deep bidirectional transformers for language under- ICLR, 2019.
standing. In Proceedings of the 2019 Conference of the North American [52] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun. Gpt-gnn:
Chapter of the Association for Computational Linguistics: Human Generative pre-training of graph neural networks. arXiv preprint
Language Technologies, Volume 1 (Long and Short Papers), pages arXiv:2006.15437, 2020.
4171–4186, 2019. [53] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph
[28] M. Ding, J. Tang, and J. Zhang. Semi-supervised learning on transformer. arXiv preprint arXiv:2003.01332, 2020.
graphs with generative adversarial nets. In Proceedings of the 27th [54] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
ACM CIKM, pages 913–922, 2018. convolutional networks. 2017 IEEE CVPR, pages 2261–2269, 2017.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
18

[55] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally [81] I. Misra and L. van der Maaten. Self-supervised learning of
consistent image completion. ACM Transactions on Graphics (ToG), pretext-invariant representations. arXiv preprint arXiv:1912.01991,
36(4):1–14, 2017. 2019.
[56] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image [82] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell.
translation with conditional adversarial networks. In CVPR, pages Representation learning via invariant causal mechanisms. arXiv
1125–1134, 2017. preprint arXiv:2010.07922, 2020.
[57] L. Jing and Y. Tian. Self-supervised visual feature learning with [83] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral
deep neural networks: A survey. arXiv preprint arXiv:1902.06162, normalization for generative adversarial networks. arXiv preprint
2019. arXiv:1802.05957, 2018.
[58] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. [84] A. Newell and J. Deng. How useful is self-supervised pretraining
Spanbert: Improving pre-training by representing and predicting for visual tasks? In CVPR, pages 7345–7354, 2020.
spans. Transactions of the Association for Computational Linguistics, [85] A. Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–
8:64–77, 2020. 19, 2011.
[59] V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and [86] M. Noroozi and P. Favaro. Unsupervised learning of visual
W.-t. Yih. Dense passage retrieval for open-domain question representations by solving jigsaw puzzles. In ECCV, pages 69–84.
answering. arXiv preprint arXiv:2004.04906, 2020. Springer, 2016.
[60] T. Karras, S. Laine, and T. Aila. A style-based generator archi- [87] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting
tecture for generative adversarial networks. In CVPR, pages self-supervised learning via knowledge transfer. In CVPR, pages
4401–4410, 2019. 9359–9367, 2018.
[61] D. Kim, D. Cho, D. Yoo, and I. S. Kweon. Learning image [88] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning
representations by completing damaged jigsaw puzzles. In 2018 with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
pages 793–802. IEEE, 2018. [89] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros.
[62] D. P. Kingma and P. Dhariwal. Glow: Generative flow with Context encoders: Feature learning by inpainting. In CVPR, pages
invertible 1x1 convolutions. In NIPS, pages 10215–10224, 2018. 2536–2544, 2016.
[63] D. P. Kingma and M. Welling. Auto-encoding variational bayes. [90] Z. Peng, Y. Dong, M. Luo, X. ming Wu, and Q. Zheng. Self-
arXiv preprint arXiv:1312.6114, 2013. supervised graph representation learning via global context
[64] T. N. Kipf and M. Welling. Semi-supervised classification with prediction. ArXiv, abs/2003.01604, 2020.
graph convolutional networks. arXiv preprint arXiv:1609.02907, [91] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng. Self-
2016. supervised graph representation learning via global context
[65] T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv prediction. arXiv preprint arXiv:2003.01604, 2020.
preprint arXiv:1611.07308, 2016.
[92] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning
[66] L. Kong, C. d. M. d’Autume, W. Ling, L. Yu, Z. Dai, and of social representations. In SIGKDD, pages 701–710, 2014.
D. Yogatama. A mutual information maximization perspective of
[93] M. Popova, M. Shvets, J. Oliva, and O. Isayev. Molecularrnn:
language representation learning. arXiv preprint arXiv:1910.08350,
Generating realistic molecular graphs with optimized properties.
2019.
arXiv preprint arXiv:1905.13372, 2019.
[67] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifi-
[94] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and
cation with deep convolutional neural networks. In NIPS, pages
J. Tang. Gcc: Graph contrastive coding for graph neural network
1097–1105, 2012.
pre-training. arXiv preprint arXiv:2006.09963, 2020.
[68] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
R. Soricut. Albert: A lite bert for self-supervised learning of [95] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang. Deepinf:
language representations. arXiv preprint arXiv:1909.11942, 2019. Social influence prediction with deep learning. In KDD’18, pages
2110–2119. ACM, 2018.
[69] G. Larsson, M. Maire, and G. Shakhnarovich. Learning repre-
sentations for automatic colorization. In ECCV, pages 577–593. [96] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang. Pre-trained
Springer, 2016. models for natural language processing: A survey. arXiv preprint
arXiv:2003.08271, 2020.
[70] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a
proxy task for visual understanding. In CVPR, pages 6874–6883, [97] A. Radford, L. Metz, and S. Chintala. Unsupervised representation
2017. learning with deep convolutional generative adversarial networks.
[71] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, arXiv preprint arXiv:1511.06434, 2015.
521(7553):436–444, 2015. [98] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Im-
[72] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, proving language understanding by generative pre-training.
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo- [99] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever.
realistic single image super-resolution using a generative adver- Language models are unsupervised multitask learners. OpenAI
sarial network. In CVPR, pages 4681–4690, 2017. Blog, 1(8):9, 2019.
[73] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H. [100] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+
Yang. Unsupervised visual representation learning by graph- questions for machine comprehension of text. arXiv preprint
based consistent constraints. In ECCV, pages 678–694. Springer, arXiv:1606.05250, 2016.
2016. [101] A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse
[74] B. Liu. Sentiment analysis and opinion mining. Synthesis lectures high-fidelity images with vq-vae-2. In NIPS, pages 14837–14847,
on human language technologies, 5(1):1–167, 2012. 2019.
[75] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [102] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings
M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly op- using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
timized bert pretraining approach. arXiv preprint arXiv:1907.11692, [103] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo. struc2vec: Learn-
2019. ing node representations from structural identity. In SIGKDD,
[76] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks pages 385–394, 2017.
for semantic segmentation. In CVPR, pages 3431–3440, 2015. [104] N. Sarafianos, X. Xu, and I. A. Kakadiaris. Adversarial represen-
[77] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. tation learning for text-to-image matching. In Proceedings of the
Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015. IEEE ICCV, pages 5814–5824, 2019.
[78] M. Mathieu. Masked autoencoder for distribution estimation. [105] J. Shen, Y. Qu, W. Zhang, and Y. Yu. Adversarial representation
2015. learning for domain adaptation. stat, 1050:5, 2017.
[79] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient [106] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from
estimation of word representations in vector space. CoRR, non-parallel text by cross-alignment. In NIPS, pages 6830–6841,
abs/1301.3781, 2013. 2017.
[80] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. [107] C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang. Graphaf: a
Distributed representations of words and phrases and their flow-based autoregressive model for molecular graph generation.
compositionality. In NIPS’13, pages 3111–3119, 2013. arXiv preprint arXiv:2001.09382, 2020.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
19

[108] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu, and [134] W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov. Pretrained
K. Wang. An overview of microsoft academic service (mas) and encyclopedia: Weakly supervised knowledge-pretrained language
applications. In WWW’15, pages 243–246, 2015. model. arXiv preprint arXiv:1912.09637, 2019.
[109] P. Smolensky. Information processing in dynamical systems: [135] K. Xu, J. Li, M. Zhang, S. S. Du, K.-i. Kawarabayashi, and S. Jegelka.
Foundations of harmony theory. Technical report, Colorado Univ How neural networks extrapolate: From feedforward to graph
at Boulder Dept of Computer Science, 1986. neural networks. arXiv preprint arXiv:2009.11848, 2020.
[110] F.-Y. Sun, J. Hoffmann, and J. Tang. Infograph: Unsupervised and [136] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan.
semi-supervised graph-level representation learning via mutual Clusterfit: Improving generalization of visual representations.
information maximization. arXiv preprint arXiv:1908.01000, 2019. arXiv preprint arXiv:1912.03330, 2019.
[111] F.-Y. Sun, M. Qu, J. Hoffmann, C.-W. Huang, and J. Tang. vgraph: [137] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning
A generative model for joint community detection and node of deep representations and image clusters. In CVPR, pages
representation learning. In NIPS, pages 512–522, 2019. 5147–5156, 2016.
[112] K. Sun, Z. Lin, and Z. Zhu. Multi-stage self-supervised learning [138] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V.
for graph convolutional networks on graphs with few labeled Le. Xlnet: Generalized autoregressive pretraining for language
nodes. In Proceedings of the AAAI Conference on Artificial Intelligence, understanding. In NIPS, pages 5754–5764, 2019.
volume 34, pages 5892–5899, 2020. [139] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov,
[113] K. Sun, Z. Zhu, and Z. Lin. Multi-stage self-supervised learning and C. D. Manning. Hotpotqa: A dataset for diverse, explainable
for graph convolutional networks. arXiv preprint arXiv:1902.11038, multi-hop question answering. arXiv preprint arXiv:1809.09600,
2019. 2018.
[140] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec. Graph
[114] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu,
convolutional policy network for goal-directed molecular graph
H. Tian, and H. Wu. Ernie: Enhanced representation through
generation. In NIPS, pages 6410–6421, 2018.
knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
[141] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. Graphrnn:
[115] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line:
Generating realistic graphs with deep auto-regressive models. In
Large-scale information network embedding. In WWW’15, pages
ICML, pages 5708–5717, 2018.
1067–1077, 2015.
[142] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen.
[116] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Graph contrastive learning with augmentations. arXiv preprint
extraction and mining of academic social networks. In Proceedings arXiv:2010.13902, 2020.
of the 14th ACM SIGKDD international conference on Knowledge [143] Y. You, T. Chen, Z. Wang, and Y. Shen. When does self-
discovery and data mining, pages 990–998, 2008. supervision help graph convolutional networks? arXiv preprint
[117] W. L. Taylor. “cloze procedure”: A new tool for measuring arXiv:2006.09136, 2020.
readability. Journalism quarterly, 30(4):415–433, 1953. [144] F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang,
[118] Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. B. Shao, R. Li, and K. Wang. Oag: Toward linking large-scale
arXiv preprint arXiv:1906.05849, 2019. heterogeneous entity graphs. In KDD’19, pages 2585–2595, 2019.
[119] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. [145] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding. Prone: fast
What makes for good views for contrastive learning. arXiv preprint and scalable network representation learning. In IJCAI, pages
arXiv:2005.10243, 2020. 4278–4284, 2019.
[120] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic. [146] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end
On mutual information maximization for representation learning. deep learning architecture for graph classification. In AAAI, 2018.
arXiv preprint arXiv:1907.13625, 2019. [147] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization.
[121] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, In ECCV, pages 649–666. Springer, 2016.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. [148] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders:
Wavenet: A generative model for raw audio. In 9th ISCA Speech Unsupervised learning by cross-channel prediction. In CVPR,
Synthesis Workshop, pages 125–125. pages 1058–1067, 2017.
[122] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, [149] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu. Ernie:
A. Graves, et al. Conditional image generation with pixelcnn Enhanced language representation with informative entities. arXiv
decoders. In NIPS, pages 4790–4798, 2016. preprint arXiv:1905.07129, 2019.
[123] A. van den Oord, O. Vinyals, et al. Neural discrete representation [150] D. Zhu, P. Cui, D. Wang, and W. Zhu. Deep variational network
learning. In NIPS, pages 6306–6315, 2017. embedding in wasserstein space. In SIGKDD, pages 2827–2836,
[124] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel 2018.
recurrent neural networks. In ICML, pages 1747–1756, 2016. [151] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
[125] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. E. Shechtman. Toward multimodal image-to-image translation. In
Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 465–476, 2017.
NIPS, pages 5998–6008, 2017. [152] C. Zhuang, A. L. Zhai, and D. Yamins. Local aggregation for
[126] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, unsupervised learning of visual embeddings. In Proceedings of the
and Y. Bengio. Graph attention networks. arXiv preprint IEEE ICCV, pages 6002–6012, 2019.
arXiv:1710.10903, 2017. [153] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and
Q. V. Le. Rethinking pre-training and self-training. arXiv preprint
[127] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and
arXiv:2006.06882, 2020.
R. D. Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341,
[154] B. Zoph and Q. V. Le. Neural architecture search with reinforce-
2018.
ment learning. arXiv preprint arXiv:1611.01578, 2016.
[128] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie,
and M. Guo. Graphgan: Graph representation learning with
generative adversarial nets. In AAAI, 2018.
[129] P. Wang, S. Li, and R. Pan. Incorporating gan for negative sampling
in knowledge representation learning. In AAAI, 2018.
[130] Z. Wang, Q. She, and T. E. Ward. Generative adversarial networks:
A survey and taxonomy. arXiv preprint arXiv:1906.01529, 2019.
[131] C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L. Yuille.
Iterative reorganization with weak spatial constraints: Solving
arbitrary jigsaw puzzles for unsupervised representation learning. Xiao Liu is a senior undergraduate student with
In CVPR, pages 1910–1919, 2019. the Department of Computer Science and Tech-
[132] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature nology, Tsinghua University. His main research
learning via non-parametric instance discrimination. In CVPR, interests include data mining, machine learning
pages 3733–3742, 2018. and knowledge graph. He has published a paper
[133] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with on KDD.
noisy student improves imagenet classification. In CVPR, pages
10687–10698, 2020.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2021.3090866, IEEE
Transactions on Knowledge and Data Engineering
20

Fanjin Zhang is a PhD candidate in the De-


partment of Computer Science and Technology,
Tsinghua University. She got her bachelor de-
gree from the Department of Computer Science
and Technology, Nanjing Unviersity. Her research
interests include data mining and social network.

Zhenyu Hou is an undergraduate with the de-


partment of Computer Science and Technology,
Tsinghua University. His main research interests
include graph representation learning and rea-
soning.

Li Mian received bachelor degree(2020) from De-


partment of Computer Science, Beijing Institute of
Technology. She is now admitted into a graduate
program in Georgia Institute of Technology. Her
research interests focus on data mining, natural
language processing and machine learning.

Zhaoyu Wang is a graduate student with the De-


partment of Computer Science and Technology of
Anhui University. His research interests include
data mining, natural language processing and
their applications in recommender systems.

Jing Zhang received the master and PhD de-


gree from the Department of Computer Science
and Technology, Tsinghua University. She is an
assistant professor in Information School, Ren-
min University of China. Her research interests
include social network mining and deep learning.

Jie Tang received the PhD degree from Tsinghua


University. He is full professor in the Department
of Computer Science and Technology, Tsinghua
University. His main research interests include
data mining, social network, and machine learn-
ing. He has published over 200 research papers
in top international journals and conferences.

1041-4347 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like