Review On Self-Supervised Image Recognition Using Deep Neural
Review On Self-Supervised Image Recognition Using Deep Neural
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
article info a b s t r a c t
Article history: Deep learning has brought significant developments in image understanding tasks such as object
Received 19 October 2020 detection, image classification, and image segmentation. But the success of image recognition largely
Received in revised form 14 April 2021 relies on supervised learning that requires huge number of human-annotated labels. To avoid costly
Accepted 26 April 2021
collection of labeled data and the domains where very few standard pre-trained models exist, self-
Available online 29 April 2021
supervised learning comes to our rescue. Self-supervised learning is a form of unsupervised learning
Keywords: that allows the network to learn rich visual features that help in performing downstream computer
Self-supervised learning vision tasks such as image classification, object detection, and image segmentation. This paper provides
Unsupervised learning a thorough review of self-supervised learning which has the potential to revolutionize the computer
Semi-supervised learning vision field using unlabeled data. First, the motivation of self-supervised learning is discussed, and
Transfer learning other annotation efficient learning schemes. Then, the general pipeline for supervised learning and
Deep learning self-supervised learning is illustrated. Next, various handcrafted pretext tasks are explained that
Pretext tasks
enable learning of visual features using unlabeled image dataset. The paper also highlights the recent
Convolutional neural network
Contrastive learning
breakthroughs in self-supervised learning using contrastive learning and clustering methods that
Online clustering are outperforming supervised learning. Finally, we have performance comparisons of self-supervised
techniques on evaluation tasks such as image classification and detection. In the end, the paper
is concluded with practical considerations and open challenges of image recognition tasks in self-
supervised learning regime. From the onset of the review paper, the core focus is on visual feature
learning from images using the self-supervised approaches.
© 2021 Elsevier B.V. All rights reserved.
1. Introduction ULMFiT [4], Word2Vec [5], GloVe [6], fastText [7], RoBERTa [8],
XLM-R [9], T5 [10] but less in computer vision tasks because of
The essence of learning is not always direct supervision but the high-dimensional continuous space that images occupy. Mo-
the ability to predict in limited external guidance. Self-supervised tivated by the BERTs success in NLP as a self-supervised learning
learning is a form of unsupervised learning where the data it- model, ActBERT a method to learn video-text pairs in a self-
self provides a strong supervisory signal that enables Convolu- supervised way is proposed [11]. It enables learning of joint
tional Neural Network (ConvNet) to capture intricate dependen- video-text representations from large unlabeled video dataset.
cies in data without the need for external labels. Essentially, The earlier models used linguistic features for video-text joint
a self-supervised learning task is formulated from a large un- modeling whereas ActBERT leverages three sources of informa-
labeled corpus of images and the ConvNet is trained to learn tion for cross-model pre-training such as action-related features,
some task (pretext task) designed by the user. For most pretext region features, and language embeddings. The model is pre-
tasks, the ConvNet predicts a masked area of the image [1] or trained with HowTo100M which is a large-scale dataset of 136
predicts the correct angle by which the image is rotated [2] etc. million video clips sourced from 1.22M narrated instructional
The representations learned on the pretext task from the en- web videos. The pre-trained model is evaluated on five down-
coder part of the ConvNet are subsequently used for downstream stream tasks namely text-video clip retrieval, video captioning,
tasks where limited annotated data is available. Self-supervised video question answering, action segmentation, and action step
learning approaches are successful in developing standard pre- localization. Results show that actBERT outperforms its counter
trained models for natural language processing like BERT [3], supervised models in learning video-text representations.
Recently various self-supervised approaches using pretext
∗ Corresponding author. tasks have surged in the field of computer vision. These pretext
E-mail addresses: [email protected] (K. Ohri), [email protected] tasks designed by the user helps in learning rich representations
(M. Kumar). from the input images. Learning representations of the input
https://doi.org/10.1016/j.knosys.2021.107090
0950-7051/© 2021 Elsevier B.V. All rights reserved.
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
signal makes it easier to extract useful characteristics of the data 1.1. Different learning schemes
that help in building classifiers or other predictors on top of
ConvNets [12]. The cognitive motivation behind self-supervised The researchers are always in a quest to formulate different
learning is how infants learn; they are not always presented learning schemes that rely on minimal labeled data. Following are
with the correct answers. Within 3 to 4 months of birth, infants the various learning methods that require minimal fine-grained
have meaningful expectations about the world around them [13]. labeling:
Through observation, common sense, and minimal interaction Semi-supervised learning: Semi-supervised learning methods
makes infants capable of self-learning. The environment around use a combination of supervised and unsupervised learning. The
the infants becomes a source of supervision that helps in de- model is first trained on a fraction of the dataset that is labeled
veloping a general understanding of how things work without manually [26]. Once the model is trained it is used to predict the
constant supervision. A similar concept is mimicked through self- remaining portion of the unlabeled dataset. At last, the network
supervised learning in machines, where the data itself contains is trained on the full dataset comprising of manually labeled
inherent features that provide supervision for training the model data and pseudo labeled data. Another setup of semi-supervised
rather than the annotated labels instructing the network of what learning is to train a large capacity model called the ‘‘teacher’’
is right and what is wrong. Hence, the goal of self-supervised
model with large labeled dataset. This model is then used to
learning is to learn representations of the input without the labels
predict labels of an unlabeled dataset. The predicted examples are
that transfer well to the downstream tasks where little labels are
ranked against each concept class and the top-scoring examples
available.
are used for pretraining the target model called the ‘‘student’’
There are many popular ConvNet architectures like AlexNet
model. The final step is to fine-tune the student model with
[14], VGG [15], GoogLeNet [16], ResNet [17], DenseNet [18], in-
all the available labeled data. It is found that such models have
ceptionV3 [19] etc. which have achieved state of the art perfor-
mance on various image recognition tasks that are trained on higher accuracy compared with the target model trained only on
a large labeled dataset. These models are widely used as pre- labeled data [27].
trained models and are fine-tuned for target tasks which the user Weakly-supervised learning: Weakly supervised learning
expects the network to perform. These pre-trained models have refers to learning from coarse-grained labels or noisy labels. In
yielded state-of-the-art performance on standard labeled datasets the effort to minimize manual labeling, the researchers exploited
such as ImageNet (1.4 million images with 1000 categories) [20] Instagram images that are posted by users with hashtags. These
and OpenImage (59.9 million images with 19,957 categories) [21] hashtags have associated images and form a good source of
etc. The success of these sophisticated architectures relies on abundant data (3.5 billion images and 17,000 hashtags). The
large labeled datasets on which they are trained for several GPU ConvNet for image recognition is pre-trained with a billion-image
hours per day. But in reality, building huge labeled datasets is version of this large hashtag dataset and fine-tuned on labeled
a challenging task and requires intense manual effort. ImageNet ImageNet dataset. To refine the noisy hashtag labels different
dataset contains around 14M labeled images and it took around label space tests are used such as hashtag mapped to ImageNet
22 human years to develop. Though we have publicly available synsets, hashtag mapped to WordNet synsets, etc. [28]. The cost
image labeling tools like Amazon SageMaker that guides a human of obtaining weak supervision labels is much cheaper than the
labeler step-by-step to label images, audio, text, etc. but it comes human labeling process.
with an additional expense [22]. The situation is, even more, com- Semi-weak supervised learning: Semi-weak supervised learn-
plex and cumbersome for video datasets that are more expensive ing combines semi-supervised learning and weakly-supervised
to label due to the addition of spatial and temporal information. learning [27]. It uses a framework called the ‘‘student–teacher’’
The Kinetics dataset [23], which is used to train ConvNets for network that combines both weak and semi-supervised learning.
human motion or action recognition, contains 500,000 videos The "teacher" model is first pre-trained with hashtag images that
belonging to 600 categories, and each video lasts for 10 s. In have weak and noisy labels also called the weakly supervised
addition to labeling challenges, learning good representation of dataset. Further, the model is fine-tuned with ImageNet labeled
the input is an additional reason to look for an alternative solution dataset and then it is used to predict the softmax distribution over
to supervised learning [24]. Moreover, in a domain like medical, it the weekly supervised dataset (hashtag images) that was used to
is hard to obtain annotations because of the privacy concerns and pre-train the "teacher" model. Next, the target "student" model
also not knowing what exactly to annotate. Hence, a pre-trained is pre-trained with the weakly supervised data with the refined
model trained on a large unlabeled dataset can be used in such a
labels from the teacher model. Finally, the labeled data is used to
domain. However, the benefit of pre-training and fine-tuning on
fine-tune the "student" model.
downstream tasks is reduced when the downstream task images
The goal of machine learning has always been to bridge the
belong to a completely different domain than the images that
gap between human and machine level learning. To make ma-
were used for pre-training the network [25].
chines even more intelligent, the researchers proposed different
This paper provides an extensive compilation of different ideas
learning strategies that mimic how humans learn visual concepts
bought forth by the researchers that help in building a pre-trained
model using intrinsic properties of the data rather than labeled of rare instances or objects by just seeing them once. Recently,
dataset. The paper highlights the breakthrough in self-supervised there are increasing interests in learning methods that learn novel
learning methods that will give a head start to budding re- concepts from limited data and also in the learning methods that
searchers to explore self-supervised learning that will serve as a continually learn without forgetting the prior knowledge. Follow-
better alternative in long run. Self-supervised learning is a step ing are the learning methods that are a step forward towards
forward for building background knowledge and instilling com- annotation efficient learning.
mon sense in machines that remain an open challenge in AI since Incremental Learning: The goal of incremental learning is to
its inception. The paper attempts to summarize and evaluate vari- continuously learn and solve new tasks without forgetting the
ous self-supervised methods that target learning better represen- tasks learned in the past by leveraging the data that gets added
tations from the input without labels. The learned representation with time. Incremental learning comes with different variants
aid the downstream task (actual task) where small labeled dataset such as: task incremental learning, class incremental learning,
is available. The ongoing research in self-supervised learning in- domain incremental learning, etc. In task incremental learning,
dicates that we can bring self-supervised learning paradigm shift the model can perform varied tasks right from image classifi-
to computer vision without much relying on labeled data. cation to object detection to image segmentation and then to
2
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
instance segmentation as data gets continuously updated with Though many new learning techniques have reduced the re-
time. In class incremental learning the model fixes a learning quirement of fine-grained labeling, still a follow-up question is
task for e.g., image classification and then it goes from classifying asked if these labels are really required because scaling the man-
samples of class A to class B and to class C and so on with all ual labeling process to all the internet images is completely
sets of classes. In domain incremental learning transition takes infeasible. Hence, a potential solution proposed by researchers
from one task happening in one domain (eg. Natural images) is to learn visual image representations from a large unlabeled
to another domain lets say in medical domain. Some methods dataset by proposing various pretext tasks that are given to
that achieve incremental learning are zero-shot learning [29], the network to solve [34]. The learned weights or the learned
continuous updating of the training set, or using a fixed data representations from the pre-trained model are then used as
representation. initialization for downstream computer vision tasks where only
Few shot learning: The goal of few short learning is to effec- some annotations are available.
tively learn visual concepts from a training dataset that contains Unsupervised and self-supervised learning: Self-supervised
very few instances related to the novel classes. Given N number models use supervisory signals of the partial input that is avail-
of novel classes and K samples in each class (K can be small able to learn a better representation of the input. They leverage
as one)the learner tries to learn visual concepts for instances the underlying structure in the data to predict the unobserved
that are exotic or rare. In this scenario where we have limited or hidden part of the input. Whereas, in unsupervised learning,
training samples for each class, training a deep learning net- we have samples with no external signals or labels that guide the
work from scratch results in poor model performance due to learning process. Hence calling self-supervised learning as ‘‘unsu-
overfitting. A standard two-stage approach is adopted for few pervised’’ is not true as it uses far more feedback signals. Details
short learning [30]. In the first stage, a model is pre-trained on a about the structural semantics of the images can be well learned
dataset containing a large number of instances in every class (also through self-supervised learning than any other form of learning
referred to as a base or train class) to solve a classification task. In method. The unlabeled data is humungous, and the amount of
the second stage, the model is adapted by transferring the learned feedback provided by each sample is huge that aids in learning
parameters from the pre-trained model by removing the last better representations of the input. According to Misha Denil,
output layer and having a classifier that can classify an instance unsupervised learning is thinking hard about the model and
to one of the novel classes. Finally, the model is finetuned on using whatever data fits the model. Whereas, in self-supervised
the limited dataset containing novel classes to perform the actual
learning, you think hard about the data and use whatever model
downstream task of classifying the query to one of the novel
fits.
classes. Though transfer learning achieves significant improve-
On the other hand, supervised learning not only depends on
ment in performance than a model made from scratch but the
annotated data but also suffers from issues such as generaliza-
scarcity of data at the second stage, leads to an overfitted model.
tion error, adversarial attacks, model brittleness, shortcuts, and
Hence few short learning requires a transfer learning approach
spurious correlations. Moreover, the networks trained using su-
called meta-learning (learn-to-learn approach) [31]. The idea of
pervised learning may not strive hard to learn generalized feature
meta learning is not to supervise to get the right result rather
representations and can get away by memorizing the mapping
how the answer should behave. Few shot learning with meta
between input and output as the ground truth is always available.
learning allows for the training of the learning algorithm itself
Sometimes the ConvNet starts classifying the objects by mere
(Stochastic Gradient Descent) instead of the classifier. To train
texture without learning rich representation of the object [35,36].
the learning algorithm, a meta learner module fθ is implemented
For example, if the texture of the object is scattered randomly
which gets optimized for solving few-shot classification task by
around the image, the convNet still predicts the right object
backpropagating through the classifier and the meta learner fθ
to find gradients with respect to the parameters of the meta without extracting the object from its background. Hence, the
learner. The meta learner fθ is trained using training episodes supervisory signal can bias the network and lead it to work in
(S , Q ) from base class data (large labeled dataset) by sampling an unexpected way indicating that the supervised models are
few base classes N and sampling S support examples per class invariant to useful features required for structural understand-
and K query/test samples. The meta learner fθ then generates ing [37]. In real sense then our model becomes a texture detector
the classifier model that predicts the classification scores for rather than an object detector. Also, the ConvNets trained using
test samples. The classification loss (cross-entropy loss) is then supervised learning are brittle when it comes to dealing with
minimized by optimizing the parameters of the meta learner dynamic data. Once the supervised model is trained it becomes
by backpropagation. The meta learner fθ is then evaluated or difficult for the model to adapt to the new data without forgetting
tested by keeping it fixed which generates a model for novel the previous knowledge. Hence the supervised model needs to be
classes by using the train or support examples of the novel classes re-trained with all the data again. With the infeasibility of using
and predicting the novel class for the test/query samples. Recent supervised learning in all deep learning tasks, the researchers
progress in few shot classification using meta learning lead to proclaim that the next AI revolution will not be supervised but
the use of unlabeled examples along with the labeled data within self-supervised. Fig. 1 shows the supervised learning pipeline, the
each training episode [32]. The authors adopted two approaches, model learns visual features through the process of training the
one where all unlabeled examples are assumed to belong to the ConvNet by large amount of labeled data (e.g. ImageNet dataset).
same set of classes as the labeled examples of the episode, as After the model is trained, the learned parameters serve as a pre-
well as the more challenging situation where examples from trained model and are transferred to the target task like object
other distractor classes are also provided. The experimental re- detection or image segmentation by fine-tuning the model. With
sults show that this scheme learns to improve the prediction of pre-training, we can use fewer data and can take advantage of
novel classes due to the incorporation of unlabeled examples. In models that are already trained with millions of images. During
another recent work, the authors leveraged the vast amount of the transfer learning, only general features from the first few
freely available unlabeled video data to perform the task of few- layers are transferred to target tasks. However, getting labels for
shot video classification. In this semi-supervised few-shot video a dataset of the size of ImageNet is quite expensive and and
classification task, millions of unlabeled data are available for curating datasets of such size for different domains is a laborious
each episode during training [33]. task.
3
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 1. Supervised learning pipeline, the learned features from the pre-trained model trained on large labeled dataset (e.g. ImageNet dataset) are transferred to the
target task where limited labeled dataset (e.g. Pascal dataset) is available.
Fig. 2. ConvNet trained to generate automated labels using self-supervised learning without human annotating the input images.
1.2. Self-supervised learning Pretext or auxiliary task: To learn visual features from un-
labeled data certain tasks are pre-designed for the ConvNet to
The key to unleashing unlabeled data is self-supervised learn- solve. The term ‘‘pretext’’ means that the task is done before the
ing, which is gaining importance in recent years. Though the use actual target task is undertaken whose mere purpose is to learn
of self-supervised learning is not new and it can be traced back generalizable representations of the input both at a low level and
to 1989 by Jürgen Schmidhuber, in his paper ‘‘Making the World high level as shown in Fig. 3 [39]. For most of the pretext tasks,
Differentiable’’ which explains how two self-supervised recurrent a part of the data is withheld or some transformation is applied
networks can interact to attack fundamental credit assignment for which the network predicts the missing part or the correct
transformation applied. The label generated by the network is the
problem [38]. However, self-supervised learning gained momen-
property of the data itself.
tum in recent years due to the explosion of huge amount of
Pseudo labels: Pseudo labels are generated automatically based
unlabeled data. The primary objectives of self-supervised learning
on the type of pretext task the network solves [39]. For example,
are (1) To deploy state-of-the-art deep learning models with per-
if an input image is taken and a transform is applied to it let us
formance matching the supervised counterpart without relying
say rotation and passed to the ConvNet, the ConvNet predicts the
on huge labeled dataset. (2) To learn generalized and semantically
property of the transform i.e., angle by which the image is rotated.
meaningful representations from unlabeled data that help during Pre-trained model: It is a ConvNet that contains the learned
downstream tasks like image classification, image segmentation, representations/useful behavior of data that is trained on a large
object detection, etc. (3) To harness huge amount of data that is unlabeled dataset using a pretext task. Mostly the model is pre-
available for free by replacing the supervised pre-training with trained on ImageNet without labels on a pretext task and subse-
self-supervised pre-training. (4) To have a more practical ap- quently fine-tuned on a smaller labeled dataset [37].
proach to learning as possessed by humans. Fig. 2 shows self- Transfer learning: Transfer learning is the operation of trans-
supervised learning where the ConvNet generates pseudo labels ferring the pre-training features from the pre-trained model to
(e.g. degree to which the input image is rotated) from the rotated solve the downstream tasks like image classification, object de-
input image by exposing the relationship between parts of the tection, and image segmentation, etc. Through transfer learning
input data. the model achieves higher accuracy with much less labeled data
Concepts in self-supervised learning: We will now discuss and requires less computation time than models that do not
various concepts related to self-supervised learning and its vo- use transfer learning. Whereas building the model from scratch
cabulary. initialized with random weights results in an inefficient solution
4
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 3. Self-supervised learning using pretext task to learn good representation of the input.
as the network starts from a point where it does not know 2. Image representation or feature learning techniques using
anything [40]. Two major transfer learning scenarios that eval- self-supervised learning
uate the learned representation are (1) Linear classifier (Con-
vNet as fixed feature extractor) (2) Fine-tuning the ConvNet for This section summarizes the early works on traditional unsu-
downstream tasks. pervised learning, and also the recent handcrafted pretext tasks
Linear classification: To evaluate the learned representation and contrastive instance learning schemes designed to learn rich
a linear classifier is trained on top of the ConvNet trained on the representation of the input. Broadly the representations learned
large unlabeled data set (ImageNet). The last fully connected layer using self-supervised learning are categorized into six categories:
of the pre-trained model is removed, and the rest of the ConvNet reconstruction from a corrupted or partial image, reconstruc-
is frozen on which a classifier is trained. Evaluation is often per- tion of the image from an altered view of the image, image
formed on the same dataset that was used to train the network on generation, spatial context prediction, transformation prediction,
the pretext task. Typical datasets for linear classification include instance discrimination or contrastive learning and clustering
ImageNet, Places205, Pascal VOC07, COCO14, etc. based schemes.
Fine-tuning the model: The second scheme to evaluate the
learned representation is not only to replace and retrain the 2.1. Reconstruction from a corrupted or partial image
classifier on top of the ConvNet but also fine-tune the weights of
the pre-trained network through backpropagation. It is possible One of the early works in unsupervised learning is the clas-
to fine-tune all the layers of the ConvNet, or we can keep some sical autoencoder that learns representations by compressing
of the earlier layers fixed and only fine-tune some higher-level the input image at a low dimensional bottleneck layer. The
layers of the network. encoder–decoder assumes that there exists a high degree of
Type of architecture: The quality of the visual representations correlation/structure in the data. The encoder compresses the
learned using pretext task depends significantly on the type of data into an intermediate representation and the decoder takes
ConvNet architecture used and the ability of the network to scale the intermediate representation and reconstructs the input im-
with increased data. The impact of the architecture is found on age. Traditional autoencoders happen to underperform as it only
the quality of the representation learned and on the accuracy compresses the input without learning rich representations of the
of the downstream tasks. The popular convolutional neural ar- input [41]. The denoising autoencoder is an enhancement to the
chitectures used for pre-training are- ResNet50, ResNet50 v1, traditional autoencoder that prevents the network to learn an
ResNet50 v2 and also their scaled-up versions [34]. It has also identity function by introducing pixel level noise in the image and
been observed that low-capacity model like Alexnet [14] has the network is forced to reconstruct the original image as shown
not shown much improvement with more data as compared to in Fig. 5. Another version of denoising autoencoder is stacked
ResNet. autoencoder [42] in which the layers of the autoencoders are
Downstream tasks: These tasks are specific to the prob- stacked to initialize a deep architecture that encodes and decodes
lem that defines what the model actually does (primary task). the data across various layers. As the layers are stacked, the model
Whereas, pretext task is a secondary task undertaken to achieve learns a better representation of the input. The downside of the
the primary tasks. Many computer vision downstream tasks exist autoencoders is that the noise added to the images is random
such as image classification, object detection, image segmenta- and unsystematic which does not contribute to semantic learning
tion, etc. Table 1 shows the image datasets used for downstream of the input. Some improvements were bought by denoising
tasks. autoencoders by corrupting the input but the corruption was
localized and random which did not account for greater learning
1.3. Self-supervised learning pipeline at the semantic level. The scheme also results in a train and
evaluation gap as training is done on noisy images and evaluation
The general pipeline of self-supervised learning is shown in is done on clean images.
Fig. 4. In the first stage, as shown in Fig. 4(a), the ConvNet is Instead of formulating the problem as a compression task,
trained on a pretext task (e.g. image rotation) using a large corpus many researchers formulated the problem as a context-based
of unlabeled data. The network learns useful representations and prediction task. An ad-hoc approach is implemented wherein the
predicts the degree by which the image is rotated. In the second user designs a task for the ConvNet to solve that aims in learning
stage as shown in Fig. 4(b), the rotation prediction head from rich representation of the input. Image inpainting is one such
the pre-trained model is removed by keeping the remaining pretext task that forces the network to predict the masked area
network fixed and a linear classifier is trained on top of it with of pixels from other parts of the input [1]. The model is trained
a new dataset where fewer labels are available. Complex down- on large number of unlabeled images with corresponding masked
stream tasks can also be performed such as image segmentation out regions. Fig. 6 shows the masked input image passed to the
and object detection by fine-tuning the entire network or par- encoder that captures the semantic context of the image to a
tial network (deeper layers) using backpropagation as shown in latent feature representation and the decoder reconstructs the ac-
Fig. 4(c). tual image by filling realistic image content in the masked region.
5
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Table 1
Summary of commonly used image datasets used in downstream image recognition tasks.
Category Dataset Train size Classes References
Natural Caltech 101 3060 102 (L. Fei-Fei et al. 2004)
Natural CIFAR-10 50000 10 (Krizhevsky, 2009)
Natural CIFAR-100 50,000 100 (Krizhevsky & Hinton, 2009)
Natural DTD 3760 47 (Cimpoi et al., 2014)
Natural Flowers 102 2040 102 (Nilsback & Zisserman, 2008)
Natural Oxford-IIIT Pets 3680 37 (Parkhi et al., 2012)
Natural Sun 397 87,003 397 (Xiao et al., 2010)
Natural Caltech-UCSD 6033 200 (Lin et al. 2015)
Natural Stanford cars 8144 196 (Jonathan Krause et al. 2013)
Natural Food 101 120,216 251 (Bossard et al. 2014)
Natural FGVC Aircraft 3334 100 (Maji et al. 2013)
Natural Stanford Dogs 12, 000 120 (Aditya Khosla et al. 2011)
Natural SVHN 73, 257 10 (Netzer et al. 2011)
Fig. 4. The general two stage pipeline of self-supervised learning. First, the model is pre-trained on a pretext task (e.g., rotation) without the labels, and then the
linear classifier is trained on top of the ConvNet removing the prediction head and finally fine-tuned for complex downstream tasks with a small labeled dataset.
Fig. 5. Denoising autoencoder learns the compressed representation of the input at the encoder by exploiting the redundant information of the input image. The
decoder reconstructs the input image which is close to the original image by minimizing reconstruction loss.
To solve this task the network must acquire a general understand- only be easy for the ConvNet to solve if it can recognize the object
ing of the structure of different objects, their colors, and gain a in the image. The loss function used in the scheme is the joint
deeper semantic understanding of the entire scene. This task will loss of reconstruction objective (L2 loss) and adversarial loss. The
6
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 6. Image inpainting pretext task of reconstructing the masked region of the input image. The ConvNet is trained on pairs of masked-original images by reducing
the reconstruction and adversarial loss.
reconstruction loss captures the overall structure of the missing and actual image patch. Once the pre-training is complete, the
region and the adversarial loss is incorporated to get a sharp decoder is removed and the learned representation is used for the
prediction of the masked region by picking up a particular mode downstream tasks. Though the scheme reserves the fine details
from the distribution. In another way, the network can be thought of the image, reconstruction is tough and ambiguous as there are
of as an adversarial generative network, where the generator tries multiple ways to fill the missing region in the image. Moreover,
to generate the images by filling the masked region and the dis- there exists a train-evaluation gap as training is done on masked
criminator tries to find the discrepancies between the predicted images and evaluation is done on non-masked images.
7
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Another self-supervised learning approach proposed is pre- be applied for images with depth and color in which the pre-
dicting bag-of-visual words (BoW) as shown in Fig. 7 that is text task forces the network to predict one from another. This
inspired by the natural language processing method of predicting method ensures backward consistency by doing two-way predic-
bag of words. Instead of words, the scheme uses visual words tion from grayscale to color image and from color to grayscale
that encode discrete visual concepts useful for downstream tasks image, and together the reconstructed image should be close to
like image classification and object detection [43]. Fig. 7(a) shows the original image. It is a challenging task as the reconstruction
the unlabeled images fed to a pre-trained self-supervised feature is hard and ambiguous because there are multiple possible ways
extractor Φ̂ (.) that extracts features across the entire dataset and to color an image. However, recently authors have devised new
clusters the similar features to form a vocabulary of visual words. schemes using variational autoencoders and latent variables for
Further, a histogram for each image x is created representing incorporating diverse colorization in the reconstructed images
the count of individual high level and mid-level features present. [47].
Then as a self-supervised task, the pre-trained ConvNet Φ̂ (.)
predicts the bag of words representation for original image x. 2.3. Image generation
Correspondingly another network Φ (.) is trained to predict bag
of visual words of a perturbed version x̃ of the image x as shown Generation based self-supervised methods are used for learn-
in Fig. 7(b). Further cross entropy loss is calculated between the ing image representations that involve the process of generating
predicted bag of words and the original bag of words y(x) which images or high-resolution images using Generative Adversarial
is backpropagated to refine the model. To solve this task the Networks. Most of the methods for image generation do not
ConvNet must learn to detect visual clues that are invariant to need any human-annotated labels. GANs learn to create realistic
perturbations and also contextual features. The advantage of the images that are similar to the input images but not present in
scheme is that the representations learned are invariant to trans- the input dataset. The intuition behind this is that if we can get a
formations and it learns contextual reasoning skills by inferring model to generate a realistic image of a person’s face for example,
missing visual words of missing image regions. On the other hand, then the model must have also learned a lot about human faces
it loses low-level details and spatial information of the input in general. In GANs two networks compete with each other. A
image. generator samples z vector from a latent space to generate a
reconstructed or a realistic fake image. On the other hand, the
discriminator network competes with the generator and tries to
2.2. Reconstruction of the image from an altered view of image
distinguish between the fake samples coming from the generator
and the real samples. Following the game-theoretic approach, the
Another set of pretext tasks the ConvNet solves is to pre-
discriminator forces the generator to generate realistic images,
dict the correct view of the image from its altered view. Image
while the generator forces the discriminator to improve its dif-
colorization is one such pretext task that forces the ConvNet to
ferentiation ability. During the training, the two networks are
predict probable color (ab channel) of an image from a grayscale
competing against each other and make each other stronger [48].
(L channel) image [44]. By solving this pretext task, the ConvNet
GAN has served as a foundational work that has helped in the
learns an image representation by predicting the color values of
creation of various successful architectures such as DC GAN [49],
an input ‘grayscale’ image. The network is trained on millions
WGAN [50], WGAN-GP [51], Progressive GAN [52], SN-GAN [53],
of pairs of colored and grayscale images with negligible cost as SAGAN [54], BIGAN [55], BigGAN [56], StyleGAN [57], LOGAN [58]
they are freely available. To solve the colorizing task, the network etc. After initial success in using GANs for unsupervised learning,
has to recognize different objects present in the image and group GANs have been surpassed by self-supervised based approaches.
related parts together to tint them with the same color. Therefore, One such unsupervised method is Large Scale Adversarial Rep-
a visual representation of the input image is learned in the pro- resentation Learning (BigBiGAN) for image generation and rep-
cess of performing the pretext task that is useful in performing resentation learning [46]. This method allows for the extraction
downstream tasks. An illustration of a grayscale image with its of features in an unsupervised way from a Generative Adver-
predicted color image using an encoder–decoder architecture sarial Network and scales up the previously existing algorithms
based on ConvNet is shown in Fig. 8. Once the network is trained such as BIGAN [55] and BigGAN leading to an improved GAN
on the pretext task, the decoder part of the network is removed model [56]. The traditional GANs take the latent variable and
for performing downstream tasks. The downside of the scheme convert it into an actual image but from the representational
is that the reconstruction of the image is hard and ambiguous as learning perspective, we have to go from the image to its latest
several possible solutions exist for coloring the image. As color space. BigBiGAN, builds upon the state-of-the-art BigGAN model,
mapping is not deterministic, the network colors the object with extending it to representation learning by adding an encoder
the combination of all colors which leads to a greyish colored and modifying the discriminator. Fig. 10 shows the generator
image. Also, the network is forced to evaluate on grayscale images part of the network G that samples z from a latent space to
resulting in loss of information. Though the scheme is useful generate reconstructed images x̂ that discriminator tells if they
when we want to color the old grayscale films. are fake or real. The encoder part of the generator E maps the
Further to the above work, an enhancement was bought by images back to a latent space ẑ. The latent space ẑ ∼ E (x) and
a scheme called the split-brain autoencoder trained on pairs of z ∼ Pz correspond to different images form a feature space
grayscale and colored images. Fig. 9 shows a given color image that can be used for downstream recognition tasks by training
X that is split it into grayscale channels X1 and color channels logistic regression classifier using a dataset where fewer labels
X2 . The ConvNet is split into two distinct subnetworks F 1 and are available. On the other hand, the discriminator part of the
F 2. F 1 predicts the color channel from the grayscale channel network not only takes x and z pairs as input but also the joint
and F 2 predicts the grayscale channel from the colored channel. distribution of x and z. The loss includes the aggregation of unary
The two complementary channels Xˆ1 and Xˆ2 are then aggre- data term sx , sz , as well as the joint term sxz which ties the
gated to predict the reconstructed image of the original image data and latent distributions together. The discriminator loss ℓ
on which the cross-entropy loss is calculated [45]. The goal of trains the network to distinguish between the two joint data-
performing such a pretext task is to induce representations that latent distributions from the encoder and the generator, pushing
transfer well to the downstream tasks. The same setup can also it to predict positive values for encoder input pairs (x, E (x)) and
8
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 8. Colorization as a pretext task, the ConvNet predicts the color image from a grayscale image.
Fig. 9. Split-Brain Autoencoder composed of two disjoint sub-networks F 1 and F 2, each trained to predict one channel from another. Network F 1 performs
automatic colorization, whereas network F 2 performs grayscale predictions. Combining the two channels give the predicted reconstructed image [45].
Fig. 10. Representation learning using BigBiGAN framework. The joint discriminator D calculates the loss ℓ. The inputs to the discriminator D are data-latent pairs,
either (x ∼ Px , ẑ ∼ E (x)), sampled from the data distribution Px and encoder E outputs, or (x̂ ∼ G (z), z ∼ Pz ), sampled from the generator G outputs and the latent
distribution Pz . The loss ℓ consists of the unary data term sx and the unary latent term sz , as well as the joint term sxz which ties the data and latent distributions [46].
negative values for generator input pairs (G (z), z). To generate that forces the network to predict the correct spatial orientation
data in a particular domain necessitates that the model should of two randomly chosen patches of an image [60]. The network
understand the semantics of the said domain. Despite the success is trained on pairs of (image-patch and neighboring-patch) that
of GANs, they are faced with few challenges such as (a) harder to are chosen randomly to form a large, unlabeled corpus of data.
train because the parameters oscillate and rarely converge and The goal of training is to assign similar representations to se-
(b) the learning process is inhibited because the discriminator mantically similar patches, for example, the representation of
overpowers the generator and it fails to create real-like fakes. ears of a cat coming from different images of cat should be
Another work proposed is to use pretext task for e.g., rotation for semantically similar to each other. The network learns to as-
better GAN discriminators. The rotation pretext task encourages sociate semantically similar patches using the nearest-neighbor
the discriminator to learn meaningful representations that are not matching principle. The network is forced to predict the spatial
forgotten during training [59]. arrangement of patches for which the input image is divided into
3 × 3 grid of non-overlapping patches. Fig. 11 shows the ConvNet
2.4. Spatial context predictions that predicts the spatial arrangement of the two input patches.
The inputs to the shared ConvNet are the two random image
Spatial context predictions exploit the spatial relationships patches, one is the anchor patch and the other is the query patch.
among image parts. Context prediction is one such pretext task Given the two patches, the network predicts the relative position
9
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 12. Jigsaw image puzzle task. An image containing non-overlapping patches in the original image on the left, the randomly permuted patches in the middle,
and the ConvNet predicts correct spatial arrangement of patches on the right.
Fig. 13. Visualization of Contrastive Predictive Coding for images. A PixelCNN autoregressive model is used to make the predictions of bottom patches from top
image patches [62].
Fig. 14. Illustration of self-supervised task of predicting the angle by which the image is rotated.
dataset will be negative to the anchor-positive pair. The objective a bunch of automated augmentations in contrastive learning. The
is to maximize the similarity between two views of the same purpose of argumentation in contrastive learning is very much
image and repel them from the views coming from different im- different than in supervised learning as the task is very different.
ages [65]. Hence, contrastive learning can reason about multiple We would not like the ConvNet to find an easy way of doing a
images at once whereas jigsaw or rotation pretext tasks always contrastive task just by learning one feature. The network should
reason about a single image independently. Moreover, the patch be able to learn many different kinds of features before it does in-
prediction tasks are not fine-grained due to the non-availability of stance discrimination. A good argumentation strategy is the most
negatives from other images. Recently many contrastive learning important ingredient for contrastive learning that will force the
methods at the instance level have been proposed which have network to learn rather than cheat. Augmentation of the images
shown promising results in learning good feature representation allows ConvNet to learn rich and generalizable features in a self-
supervised learning environment. The goal of data augmentation
that helps perform the downstream tasks. Despite being unsuper-
is to generate anchor, positive and negative (APN) images that
vised these schemes have outperformed supervised pre-training
are used in contrastive learning. Augmenting the images makes
in learning image representations.
the task harder for the ConvNet as it cannot get away with-
Concepts in Contrastive Learning: We will now discuss vari- out learning rich representation of the input. Also, contrastive
ous concepts related to contrastive learning that aids in learning learning needs more data augmentation than supervised learning
rich representations of the input at the instance level. due to the non-availability of the image labels. A well-tuned
Objective of Contrastive Learning: To learn a representation composition of augmentations stands out and leads to substan-
or a feature space that pulls representation that come from the tial improvement gains in the performance of the downstream
same images and repel representations that come from different tasks [66]. Fig. 16 shows the typical data augmentation methods
images. used for visual feature learning such as color transformation,
Data Augmentation: Handcrafted pretext tasks for represen- scaling, random cropping, flipping (horizontally, vertically), etc.
tation learning such as dividing the images into patches, rotation, Encoder: The encoder part of the network extracts the feature
colorization or masking the images, etc. have been taken over by representations of the images. It takes two different augmented
11
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 16. Typical data augmentation methods used for visual feature learning in contrastive learning using self-supervised setup [66].
12
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 17. Nonlinear transformation from h to z and contrastive loss computed on the representation z [66].
Fig. 18. Discriminating unsupervised feature learning with Exemplar Convolutional Neural Networks. Exemplary patches sampled from the unlabeled dataset (left up).
Several random augmentations applied to the exemplary patch (left down). The original (‘‘seed’’) patch is marked in red. Given a distorted crop from an exemplary
patch, ConvNet classifies the distorted crop to one of the K surrogate classes [67].
Eq. (2) depicts the loss function for a positive pair of examples that results in a large number of classes and inhibits scalability.
xi and xj and the goal is to identify positive pair of each zi and A revised scheme is to treat the task as a metric learning task
repel others. Where N is the number of samples, 2N are the were given the cropped image; the representation learned should
transformed pairs, 2(N − 1) negative pairs, sim(zi , zj ) represents be similar to the crop coming from the same image source and
the cosine similarity as discussed in Eq. (1). The term in the different from others.
numerator is the positive pairs and the terms in the denominator Recent work on instance discrimination has raised the perfor-
is the negative pairs. τ is a hyperparameter called the adjustable mance of self-supervised learning at power to supervised learn-
temperature parameter. 1 represents as ‘‘indicator function’’, the ing. The objective of contrastive learning is to make the em-
term 1[k!=i] will become 1 if the condition matches else it is 0. beddings of the query (anchor image) similar to the positive
key and dissimilar to the negative key embeddings. Each input
2.6.1. Contrastive learning prediction schemes image is split into a query and a key formed by performing two
different sets of augmentations on the image. The authors have
Exemplar ConvNets [67] is one of the earliest methods to work
suggested various methods for handling negatives such as: end-
at the instance discrimination level. Fig. 18 shows patches of size
to-end mechanism, memory bank, and momentum encoder. In
32 × 32 sampled from the unlabeled dataset of images at varying
end-to-end mechanism, as shown in Fig. 19, the query (original
positions and are scaled to form the initial training set. One of the
samples) is passed through query encoder and the keys (aug-
input patches is selected and several random transformations are
mented versions of positive and negative samples) are passed
applied to give rise to many patches that vary in the degree of through the key encoder (same shared encoder for both query
perturbations but not in terms of content. Similarly, this process and keys) that produces the embeddings for both the query
is applied to all the cropped images of the dataset. All the dis- and the keys [70,71]. The loss is calculated over the different
torted crops from a randomly sampled ‘seed’ image patch form pairs (query-keys) and the shared encoder is updated by back-
a surrogate class. Now as a self-supervised prediction task, given propagating through all the samples during training maintaining
a distorted crop exacted from an image the ConvNet classifies it consistency between the queries and keys. One of the downside
to one of the K surrogate classes as shown in Fig. 18. For the of the approach is that the number of negatives is limited to the
network to predict the right surrogate class the network needs size of the GPU memory as the batch size cannot be larger than
to be invariant to any transformations related to geometry and the GPU memory.
color. One of the downsides of this scheme is that number of The extension to the above work is to use a memory bank
surrogate classes is equal to the number of samples in the dataset which is much more memory efficient than using large batch
13
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 23. Pretext-Invariant Representation Learning (PIRL). Given an input image I, a pretext task t of rotation and jigsaw is applied to give the transformed image
I t . I and I t are sent to the shared ConvNet resulting in the feature embeddings. The network learns representations that are invariant to the transformation t and
retains semantic information by keeping the representations of the image I and its transformed counterpart I t close together and distancing from others [75].
Fig. 24. Contrastive instance learning (left) vs. SwAV (right). Swapping Assignments between Views (SwAV), first the codes are obtained by assigning features to
prototype vectors. Then to solve a ‘‘swapped’’ prediction problem, the codes from one data augmented view is predicted using the other view. Thus, SwAV does not
directly compare image features unlike contrastive instance learning. Prototype vectors are learned along with the ConvNet parameters by backpropagation [77].
clustering mechanism with contrastive learning. The researchers another view of the same image. The online clustering problem
have proposed a self-supervised approach to learn features by is treated as an optimal transport problem using Sinkhorn–Knopp
Swapping Assignments between multiple Views of the same im- algorithm that enforces an equipartition constraint of assigning
age (SwAV) [77,80]. It uses an online clustering mechanism to samples to clusters [81].
learn better representations by grouping similar features together SwAV uses a different augmentation strategy as opposed to
by comparing representations with cluster centroids. The objec- SimCLR and MoCo. It uses multi-scale cropping and creates mul-
tive is not only to make the positive pairs of samples close to each tiple views of the single image. In Multi-crop augmentation,
other but also, to make sure that all other features that are similar full high-resolution images (eg: 224 × 224) from the dataset is
to each other club together. As a result, negative comparison taken to generate standard or high-resolution cropped images
with all the images lying in the large mini-batch is prevented (representing global views of the image) and then additional low-
thereby leading to reduced computational overhead. For example,
resolution images (eg: 96 × 96) is sampled along with the global
in a feature space, the features of sheep should be closer to the
views. Fig. 25 shows the setup off swapped prediction problem
features of goats (as both are animals) but should be far from the
wherein zs and zt are the representations of the two views of
features of cars. Fig. 24 shows SwAV framework, the ResNet-50
the image and qs and qt are the respective codes generated.
network receives different augmented views of the same image
Now as a swapped prediction problem, the code of the image is
(views can be more than two) generating the embedding. This
predicted from another view of the same image. The goal is to
embedding vector then goes to a shallow non-linear network fθ
that produces a projection vector denoted by Z . The represen- minimize the cross-entropy loss between the two views of the
tation or features generated are not directly compared to each same image. If two different views of the same image contain
other, like in contrastive learning. Rather mapping of features is similar information, then it should be possible to predict its code
done to their nearest neighbor in a set of K trainable prototype from one or the other feature. Recently, a self-supervised method
vectors (C = [c1, . . . , cK ]). This method then maps feature en- based on SwAV is proposed called SEER that works with high
coding of the augmented views of images into discrete codebook dimensional complex data. The model is pre-trained on billions of
C containing the set of prototype vectors or clusters [c1, . . . , cK ] random, unlabeled and uncurated public Instagram images, and
and use it to look up to the codes that are most similar to the is fine-tuned on ImageNet in a supervised fashion. SEER outper-
features. Then as a swapped prediction problem, the code Q of forms the state-of-the-art self-supervised models, attaining 84.2
the view of an image is predicted from the representation Z of percent top-1 accuracy on ImageNet [82].
16
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
Fig. 25. Setting up the swapped prediction problem between two separate views of the same image [77].
Table 3
Linear classification on ImageNet. Top-1 accuracy on ImageNet dataset trained
on frozen features from different self-supervised methods with a standard
ResNet-50 containing 24M parameters.
Method Arch. Parameters Top1
Supervised R50 24 76.5
Colorization [44] R50 24 39.6
Jigsaw [61] R50 24 45.7
NPID [72] R50 24 54
BigBiGAN [46] R50 24 56.6
LA [89] R50 24 58.8
NPID++ [75] R50 24 59
MoCo [73] R50 24 60.6
SeLa [91] R50 24 61.5
PIRL [75] R50 24 63.6
CPC [62] R50 24 63.8
PCL [62] R50 24 65.9
SimCLR [66] R50 24 70
MoCov2 [73] R50 24 71.1
SwAV [77] R50 24 75.3
Table 4
Linear classification performance on four datasets (ImageNet, VOC07, Places205,
iNaturalist datasets) using the setup [92]. The linear classifiers are trained on
image representations generated by the ConvNet trained on ImageNet (without
labels) using self-supervised approach. Numbers with † are measured using
10-crop evaluation.
Fig. 26. ImageNet Top-1 accuracy of linear classifiers trained on representations Method Parameters Transfer dataset
learned with different self-supervised schemes. The pre-trained model is trained
ImageNet VOC07 Places205 iNat.
on ImageNet dataset without labels. Results show further gains in the accuracy
as the number of parameters are increased beyond 24M by increasing the width ResNet-50 using evaluation setup of [92]
of ResNet with a factor × 2, × 4, and × 5 for both SimCLR and SwAV. However, Supervised 25.6M 75.9 87.5 51.5 45.4
SwAV outperforms SimCLR with an increased number of parameters [77]. Colorization [92] 25.6M 39.6 55.6 37.5 –
Rotation [2] 25.6M 48.9 63.9 41.4 23
NPID++ [72] 25.6M 59 76.6 46.4 32.4
MoCo [71] 25.6M 60.6 – – –
Jigsaw [61] 25.6M 45.7 64.5 41.2 21.3
PIRL [75] 25.6M 63.6 81.1 49.8 34.1
SwAV [77] 25.6M 75.3 88.9 56.7 48.6
Different architecture or evaluation setup
NPID [72] 25.6M 54 – 45.5 –
BigBiGAN [46] 25.6M 56.6 – – –
AET [64] 61M 40.6 – 37.1 –
DeepCluster [80] 61M 39.8 – 37.5 –
Rot. [34] 61M 54 – 45.5 –
LA [89] 25.6M 60.2† – 50.2† –
CMC [93] 51M 64.1 – – –
CPC [62] 44.5M 48.7 – – –
CPC-Huge [94] 305M 61 – – –
BigBiGAN-Big [46] 86M 61.3 – – –
AMDIM [95] 670M 68.1 – 55.1 –
4. Practical consideration
Table 5
Top-1 linear classifier accuracy on ImageNet on top of frozen features from a ResNet-50 pre-trained using self-
supervised learning approach. The performance of the self-supervised models is studied as the function of MLP
head, augmentation (Multi-crop), cosine learning decay cos, number of epochs, and batch size. 2 × 160 + 4 × 96
indicates 2 crops of size 160 × 60 and 4 crops of size 96 × 96.
Method Unsupervised pre-training
MLP Multi-crop cos epochs batch ImageNet
top-1 accuracy
MoCo v2 2 × 224 200 256 67.5
SimCLR 2 × 224 200 256 61.9
SimCLR 2 × 224 200 8192 66.6
SwAV 2 × 160 + 4 × 96 200 256 72.0
SwAV 2 × 224 + 6 × 96 200 256 72.7
Results of longer unsupervised pre-training
MoCo v2 2 × 224 800 256 71.1
SimCLR 2 × 224 1000 4096 69.3
PIRL 2 × 224 800 1024 63.6
SwAV 2 × 224 + 6 × 96 400 256 74.3
20
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual [41] Yoshua Bengio, Learning deep architectures for AI, Now Publishers Inc,
learning for image recognition, in: Proceedings of the IEEE Conference 2009.
on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [42] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol,
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q Weinberger, Extracting and composing robust features with denoising autoencoders,
Densely connected convolutional networks, in: Proceedings of the IEEE in: Proceedings of the 25th International Conference on Machine
Conference on Computer Vision and Pattern Recognition, 2017, pp. Learning, 2008, pp. 1096–1103.
4700–4708. [43] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, Matthieu
[19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Cord, Learning representations by predicting bags of visual words, in:
Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6928–6938.
Recognition, 2016, pp. 2818–2826. [44] Richard Zhang, Phillip Isola, Alexei A. Efros, Colorful image coloriza-
[20] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, Imagenet: tion, in: European Conference on Computer Vision, Springer, 2016, pp.
A large-scale hierarchical image database, in: 2009 IEEE Conference on 649–666.
Computer Vision and Pattern Recognition, Ieee, 2009, pp. 248–255. [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Split-brain autoencoders:
[21] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Unsupervised learning by cross-channel prediction, in: Proceedings of the
Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
Veit, et al., Openimages: A public dataset for large-scale multi-label 1058–1067.
and multi-class image classification, 2017, Dataset available from https: [46] Jeff Donahue, Karen Simonyan, Large scale adversarial representation
//github.com/openimages, 2(3), 2-3. learning, 2019, arXiv preprint arXiv:1907.02544.
[22] Ameet V. Joshi, Amazon’s machine learning toolkit: Sagemaker, in: [47] Carl Doersch, Tutorial on variational autoencoders, 2016, arXiv preprint
Machine Learning and Artificial Intelligence, Springer, 2020, pp. 233–243. arXiv:1606.05908.
[23] Joao Carreira, Andrew Zisserman, Quo vadis, action recognition? a new [48] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
model and the kinetics dataset, in: Proceedings of the IEEE Conference Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. adversarial networks, 2014, arXiv preprint arXiv:1406.2661.
[24] Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, Michael S. [49] Alec Radford, Luke Metz, Soumith Chintala, Unsupervised representation
Lew, Deep learning for visual understanding: A review, Neurocomputing learning with deep convolutional generative adversarial networks, 2015,
187 (2016) 27–48. arXiv preprint arXiv:1511.06434.
[25] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio, Transfusion: [50] Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein generative
Understanding transfer learning for medical imaging, in: Advances in adversarial networks, in: International Conference on Machine Learning,
Neural Information Processing Systems, 2019, pp. 3347–3357. PMLR, 2017, pp. 214–223.
[26] Olivier Chapelle, Bernhard Scholkopf, Alexander Zien, Semi-supervised [51] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,
learning (chapelle, o. et al., eds.; 2006)[book reviews], IEEE Trans. Neural Aaron Courville, Improved training of wasserstein gans, 2017, arXiv
Netw. 20 (3) (2009) 542–542. preprint arXiv:1704.00028.
[27] I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan, [52] Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive growing
Billion-scale semi-supervised learning for image classification, 2019, arXiv of gans for improved quality, stability, and variation, 2017, arXiv preprint
preprint arXiv:1905.00546. arXiv:1710.10196.
[28] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, [53] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida,
Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten, Spectral normalization for generative adversarial networks, 2018, arXiv
Exploring the limits of weakly supervised pretraining, in: Proceedings of preprint arXiv:1802.05957.
the European Conference on Computer Vision, ECCV, 2018, 181–196. [54] Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena, Self-
[29] Yongqin Xian, Christoph H. Lampert, Bernt Schiele, Zeynep Akata, Zero- attention generative adversarial networks, in: International Conference
shot learning—a comprehensive evaluation of the good, the bad and the on Machine Learning, PMLR, 2019, pp. 7354–7363.
ugly, IEEE Trans. Pattern Anal. Mach. Intell. 41 (9) (2018) 2251–2265. [55] Jeff Donahue, Philipp Krähenbühl, Trevor Darrell, Adversarial feature
[30] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, Jia- learning, 2016, arXiv preprint arXiv:1605.09782.
Bin Huang, A closer look at few-shot classification, 2019, arXiv preprint [56] Andrew Brock, Jeff Donahue, Karen Simonyan, Large scale GAN training
arXiv:1904.04232. for high fidelity natural image synthesis, 2018, arXiv preprint arXiv:
[31] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele, Meta-transfer 1809.11096.
learning for few-shot learning, in: Proceedings of the IEEE/CVF Conference [57] Tero Karras, Samuli Laine, Timo Aila, A style-based generator architecture
on Computer Vision and Pattern Recognition, 2019, 403–412. for generative adversarial networks, in: Proceedings of the IEEE/CVF
[32] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Conference on Computer Vision and Pattern Recognition, 2019, pp.
Joshua B. Tenenbaum, Hugo Larochelle, Richard S. Zemel, Meta-learning 4401–4410.
for semi-supervised few-shot classification, 2018, arXiv preprint arXiv: [58] Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, Timothy Lillicrap,
1803.00676. Logan: Latent optimisation for generative adversarial networks, 2019,
[33] Linchao Zhu, Yi Yang, Label independent memory for semi-supervised arXiv preprint arXiv:1912.00953.
few-shot video classification, IEEE Ann. Hist. Comput. (01) (2020) 1–1. [59] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby,
[34] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer, Revisiting self- Self-supervised gans via auxiliary rotation loss, in: Proceedings of the
supervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12154–12163.
1920–1929. [60] Carl Doersch, Abhinav Gupta, Alexei A. Efros, Unsupervised visual rep-
[35] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, Texture and art with resentation learning by context prediction, in: Proceedings of the IEEE
deep neural networks, Curr. Opin. Neurobiol. 46 (2017) 178–186. International Conference on Computer Vision, 2015, pp. 1422–1430.
[36] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, [61] Mehdi Noroozi, Paolo Favaro, Unsupervised learning of visual represen-
Felix A Wichmann, Wieland Brendel, Imagenet-trained CNNs are biased tations by solving jigsaw puzzles, in: European Conference on Computer
towards texture; increasing shape bias improves accuracy and robustness, Vision, Springer, 2016, pp. 69–84.
2018, arXiv preprint arXiv:1811.12231. [62] Aaron van den Oord, Yazhe Li, Oriol Vinyals, Representation learning with
[37] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, contrastive predictive coding, 2018, arXiv preprint arXiv:1807.03748.
Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim [63] Michael Gutmann, Aapo Hyvärinen, Noise-contrastive estimation: A new
Neumann, Alexey Dosovitskiy, et al., A large-scale study of representation estimation principle for unnormalized statistical models, in: Proceedings
learning with the visual task adaptation benchmark, 2019, arXiv preprint of the Thirteenth International Conference on Artificial Intelligence and
arXiv:1910.04867. Statistics, 2010, pp. 297–304.
[38] Jiirgen Schmidhuber, Making the world differentiable: On using self- [64] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo, Aet vs. aed: Unsu-
supervised fully recurrent neural networks for dynamic reinforcement pervised representation learning by auto-encoding transformations rather
learning and planning in non-stationary environm nts, 1990. than data, in: Proceedings of the IEEE Conference on Computer Vision and
[39] Longlong Jing, Yingli Tian, Self-supervised visual feature learning with Pattern Recognition, 2019, pp. 2547–2555.
deep neural networks: A survey, IEEE Trans. Pattern Anal. Mach. Intell. [65] William Falcon, Kyunghyun Cho, A framework for contrastive self-
(2020). supervised learning and designing a new approach, 2020, arXiv preprint
[40] Dumitru Erhan, Aaron Courville, Yoshua Bengio, Pascal Vincent, Why arXiv:2009.00104.
does unsupervised pre-training help deep learning? in: Proceedings of [66] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton, A
the Thirteenth International Conference on Artificial Intelligence and simple framework for contrastive learning of visual representations, 2020,
Statistics, 2010, 201–208. arXiv preprint arXiv:2002.05709.
21
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090
[67] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin [85] Diederik P. Kingma, Jimmy Ba, Adam: A method for stochastic
Riedmiller, Thomas Brox, Discriminative unsupervised feature learning optimization, 2014, arXiv preprint arXiv:1412.6980.
with exemplar convolutional neural networks, IEEE Trans. Pattern Anal. [86] Aravind Srinivas, Michael Laskin, Pieter Abbeel, Curl: Contrastive unsu-
Mach. Intell. 38 (9) (2015) 1734–1747. pervised representations for reinforcement learning, 2020, arXiv preprint
[68] Matthew Schultz, Thorsten Joachims, Learning a distance metric from arXiv:2004.04136.
relative comparisons, Adv. Neural Inform. Process. Syst. 16 (2004) 41–48. [87] Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, Ananthram Swami,
[69] Kihyuk Sohn, Improved deep metric learning with multi-class n-pair loss Graphcl: Contrastive self-supervised learning of graph representations,
objective, in: Proceedings of the 30th International Conference on Neural 2020, arXiv preprint arXiv:2007.08025.
Information Processing Systems, 2016, pp. 1857–1865. [88] Yang You, Igor Gitman, Boris Ginsburg, Large batch training of
[70] Raia Hadsell, Sumit Chopra, Yann LeCun, Dimensionality reduction by convolutional networks, 2017, arXiv preprint arXiv:1708.03888.
learning an invariant mapping, in: 2006 IEEE Computer Society Confer- [89] Chengxu Zhuang, Alex Lin Zhai, Daniel Yamins, Local aggregation for un-
ence on Computer Vision and Pattern Recognition, CVPR’06, 2, IEEE, 2006, supervised learning of visual embeddings, in: Proceedings of the IEEE/CVF
pp. 1735–1742. International Conference on Computer Vision, 2019, pp. 6002–6012.
[71] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick, Momentum [90] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr
contrast for unsupervised visual representation learning, in: Proceedings Dollár, Designing network design spaces, in: Proceedings of the IEEE/CVF
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, 2020, pp.
2020, pp. 9729–9738. 10428–10436.
[72] Zhirong Wu, Yuanjun Xiong, Stella X Yu, Dahua Lin, Unsupervised feature [91] Yuki Markus Asano, Christian Rupprecht, Andrea Vedaldi, Self-labelling
learning via non-parametric instance discrimination, in: Proceedings of via simultaneous clustering and representation learning, 2019, arXiv
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, preprint arXiv:1911.05371.
pp. 3733–3742. [92] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra, Scaling and
[73] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He, Improved baselines benchmarking self-supervised visual representation learning, in: Proceed-
with momentum contrastive learning, 2020, arXiv preprint arXiv:2003. ings of the IEEE International Conference on Computer Vision, 2019, pp.
04297. 6391–6400.
[74] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Ge- [93] Yonglong Tian, Dilip Krishnan, Phillip Isola, Contrastive multiview coding,
offrey Hinton, Big self-supervised models are strong semi-supervised 2019, arXiv preprint arXiv:1906.05849.
learners, 2020, arXiv preprint arXiv:2006.10029. [94] Olivier Henaff, Data-efficient image recognition with contrastive predic-
[75] Ishan Misra, Laurens van der Maaten, Self-supervised learning of pretext- tive coding, in: International Conference on Machine Learning, PMLR,
invariant representations, in: Proceedings of the IEEE/CVF Conference on 2020, pp. 4182–4192.
Computer Vision and Pattern Recognition, 2020, pp. 6707–6717. [95] Philip Bachman, R. Devon Hjelm, William Buchwalter, Learning repre-
[76] Ilya Loshchilov, Frank Hutter, Sgdr: Stochastic gradient descent with sentations by maximizing mutual information across views, 2019, arXiv
warm restarts, 2016, arXiv preprint arXiv:1608.03983. preprint arXiv:1906.00910.
[77] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, [96] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid,
Armand Joulin, Unsupervised learning of visual features by contrasting Phillip Isola, What makes for good views for contrastive learning, 2020,
cluster assignments, 2020, arXiv preprint arXiv:2006.09882. arXiv preprint arXiv:2005.10243.
[78] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, Deep [97] Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krahenbuhl,
clustering for unsupervised learning of visual features, in: Proceedings of Sampling matters in deep embedding learning, in: Proceedings of the
the European Conference on Computer Vision (ECCV), 2018, pp. 132–149. IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
[79] Yuki Markus Asano, Christian Rupprecht, Andrea Vedaldi, Self-labelling [98] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H
via simultaneous clustering and representation learning, 2019, arXiv Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao-
preprint arXiv:1911.05371. han Daniel Guo, Mohammad Gheshlaghi Azar, et al., Bootstrap your own
[80] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, Deep latent: A new approach to self-supervised learning, 2020, arXiv preprint
clustering for unsupervised learning of visual features, in: Proceedings of arXiv:2006.07733.
the European Conference on Computer Vision, ECCV, 2018, pp. 132–149. [99] Yuki M Asano, Christian Rupprecht, Andrea Vedaldi, A critical analysis of
[81] Marco Cuturi, Sinkhorn distances: lightspeed computation of optimal self-supervision, or what we can learn from a single image, 2019, arXiv
transport, NIPS 2 (3) (2013) 4. preprint arXiv:1904.13132.
[82] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao [100] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Li-
Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand ong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom,
Joulin, et al., Self-supervised pretraining of visual features in the wild, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings
2021, arXiv preprint arXiv:2103.01988. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[83] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya 2020, pp. 11621–11631.
Banerjee, Fillia Makedon, A survey on contrastive self-supervised learning, [101] Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram
Technologies 9 (1) (2021) 2. Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van
[84] Piotr Bojanowski, Armand Joulin, Unsupervised learning by predicting Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et
noise, in: International Conference on Machine Learning, PMLR, 2017, pp. al., Diagnostic assessment of deep learning algorithms for detection of
517–526. lymph node metastases in women with breast cancer, JAMA 318 (22)
(2017) 2199–2210.
22