0% found this document useful (0 votes)
67 views

Review On Self-Supervised Image Recognition Using Deep Neural

This document provides a thorough review of self-supervised learning techniques for image recognition using deep neural networks. It discusses the motivation for self-supervised learning, describes common pretext tasks used for learning visual representations from unlabeled images, and highlights recent advances in self-supervised learning using contrastive learning and clustering methods.

Uploaded by

Vishaka Saralaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Review On Self-Supervised Image Recognition Using Deep Neural

This document provides a thorough review of self-supervised learning techniques for image recognition using deep neural networks. It discusses the motivation for self-supervised learning, describes common pretext tasks used for learning visual representations from unlabeled images, and highlights recent advances in self-supervised learning using contrastive learning and clustering methods.

Uploaded by

Vishaka Saralaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Knowledge-Based Systems 224 (2021) 107090

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Review on self-supervised image recognition using deep neural


networks

Kriti Ohri , Mukesh Kumar
Department of CSE, National Institute of Technology Patna, Bihar 800005, India

article info a b s t r a c t

Article history: Deep learning has brought significant developments in image understanding tasks such as object
Received 19 October 2020 detection, image classification, and image segmentation. But the success of image recognition largely
Received in revised form 14 April 2021 relies on supervised learning that requires huge number of human-annotated labels. To avoid costly
Accepted 26 April 2021
collection of labeled data and the domains where very few standard pre-trained models exist, self-
Available online 29 April 2021
supervised learning comes to our rescue. Self-supervised learning is a form of unsupervised learning
Keywords: that allows the network to learn rich visual features that help in performing downstream computer
Self-supervised learning vision tasks such as image classification, object detection, and image segmentation. This paper provides
Unsupervised learning a thorough review of self-supervised learning which has the potential to revolutionize the computer
Semi-supervised learning vision field using unlabeled data. First, the motivation of self-supervised learning is discussed, and
Transfer learning other annotation efficient learning schemes. Then, the general pipeline for supervised learning and
Deep learning self-supervised learning is illustrated. Next, various handcrafted pretext tasks are explained that
Pretext tasks
enable learning of visual features using unlabeled image dataset. The paper also highlights the recent
Convolutional neural network
Contrastive learning
breakthroughs in self-supervised learning using contrastive learning and clustering methods that
Online clustering are outperforming supervised learning. Finally, we have performance comparisons of self-supervised
techniques on evaluation tasks such as image classification and detection. In the end, the paper
is concluded with practical considerations and open challenges of image recognition tasks in self-
supervised learning regime. From the onset of the review paper, the core focus is on visual feature
learning from images using the self-supervised approaches.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction ULMFiT [4], Word2Vec [5], GloVe [6], fastText [7], RoBERTa [8],
XLM-R [9], T5 [10] but less in computer vision tasks because of
The essence of learning is not always direct supervision but the high-dimensional continuous space that images occupy. Mo-
the ability to predict in limited external guidance. Self-supervised tivated by the BERTs success in NLP as a self-supervised learning
learning is a form of unsupervised learning where the data it- model, ActBERT a method to learn video-text pairs in a self-
self provides a strong supervisory signal that enables Convolu- supervised way is proposed [11]. It enables learning of joint
tional Neural Network (ConvNet) to capture intricate dependen- video-text representations from large unlabeled video dataset.
cies in data without the need for external labels. Essentially, The earlier models used linguistic features for video-text joint
a self-supervised learning task is formulated from a large un- modeling whereas ActBERT leverages three sources of informa-
labeled corpus of images and the ConvNet is trained to learn tion for cross-model pre-training such as action-related features,
some task (pretext task) designed by the user. For most pretext region features, and language embeddings. The model is pre-
tasks, the ConvNet predicts a masked area of the image [1] or trained with HowTo100M which is a large-scale dataset of 136
predicts the correct angle by which the image is rotated [2] etc. million video clips sourced from 1.22M narrated instructional
The representations learned on the pretext task from the en- web videos. The pre-trained model is evaluated on five down-
coder part of the ConvNet are subsequently used for downstream stream tasks namely text-video clip retrieval, video captioning,
tasks where limited annotated data is available. Self-supervised video question answering, action segmentation, and action step
learning approaches are successful in developing standard pre- localization. Results show that actBERT outperforms its counter
trained models for natural language processing like BERT [3], supervised models in learning video-text representations.
Recently various self-supervised approaches using pretext
∗ Corresponding author. tasks have surged in the field of computer vision. These pretext
E-mail addresses: [email protected] (K. Ohri), [email protected] tasks designed by the user helps in learning rich representations
(M. Kumar). from the input images. Learning representations of the input

https://doi.org/10.1016/j.knosys.2021.107090
0950-7051/© 2021 Elsevier B.V. All rights reserved.
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

signal makes it easier to extract useful characteristics of the data 1.1. Different learning schemes
that help in building classifiers or other predictors on top of
ConvNets [12]. The cognitive motivation behind self-supervised The researchers are always in a quest to formulate different
learning is how infants learn; they are not always presented learning schemes that rely on minimal labeled data. Following are
with the correct answers. Within 3 to 4 months of birth, infants the various learning methods that require minimal fine-grained
have meaningful expectations about the world around them [13]. labeling:
Through observation, common sense, and minimal interaction Semi-supervised learning: Semi-supervised learning methods
makes infants capable of self-learning. The environment around use a combination of supervised and unsupervised learning. The
the infants becomes a source of supervision that helps in de- model is first trained on a fraction of the dataset that is labeled
veloping a general understanding of how things work without manually [26]. Once the model is trained it is used to predict the
constant supervision. A similar concept is mimicked through self- remaining portion of the unlabeled dataset. At last, the network
supervised learning in machines, where the data itself contains is trained on the full dataset comprising of manually labeled
inherent features that provide supervision for training the model data and pseudo labeled data. Another setup of semi-supervised
rather than the annotated labels instructing the network of what learning is to train a large capacity model called the ‘‘teacher’’
is right and what is wrong. Hence, the goal of self-supervised
model with large labeled dataset. This model is then used to
learning is to learn representations of the input without the labels
predict labels of an unlabeled dataset. The predicted examples are
that transfer well to the downstream tasks where little labels are
ranked against each concept class and the top-scoring examples
available.
are used for pretraining the target model called the ‘‘student’’
There are many popular ConvNet architectures like AlexNet
model. The final step is to fine-tune the student model with
[14], VGG [15], GoogLeNet [16], ResNet [17], DenseNet [18], in-
all the available labeled data. It is found that such models have
ceptionV3 [19] etc. which have achieved state of the art perfor-
mance on various image recognition tasks that are trained on higher accuracy compared with the target model trained only on
a large labeled dataset. These models are widely used as pre- labeled data [27].
trained models and are fine-tuned for target tasks which the user Weakly-supervised learning: Weakly supervised learning
expects the network to perform. These pre-trained models have refers to learning from coarse-grained labels or noisy labels. In
yielded state-of-the-art performance on standard labeled datasets the effort to minimize manual labeling, the researchers exploited
such as ImageNet (1.4 million images with 1000 categories) [20] Instagram images that are posted by users with hashtags. These
and OpenImage (59.9 million images with 19,957 categories) [21] hashtags have associated images and form a good source of
etc. The success of these sophisticated architectures relies on abundant data (3.5 billion images and 17,000 hashtags). The
large labeled datasets on which they are trained for several GPU ConvNet for image recognition is pre-trained with a billion-image
hours per day. But in reality, building huge labeled datasets is version of this large hashtag dataset and fine-tuned on labeled
a challenging task and requires intense manual effort. ImageNet ImageNet dataset. To refine the noisy hashtag labels different
dataset contains around 14M labeled images and it took around label space tests are used such as hashtag mapped to ImageNet
22 human years to develop. Though we have publicly available synsets, hashtag mapped to WordNet synsets, etc. [28]. The cost
image labeling tools like Amazon SageMaker that guides a human of obtaining weak supervision labels is much cheaper than the
labeler step-by-step to label images, audio, text, etc. but it comes human labeling process.
with an additional expense [22]. The situation is, even more, com- Semi-weak supervised learning: Semi-weak supervised learn-
plex and cumbersome for video datasets that are more expensive ing combines semi-supervised learning and weakly-supervised
to label due to the addition of spatial and temporal information. learning [27]. It uses a framework called the ‘‘student–teacher’’
The Kinetics dataset [23], which is used to train ConvNets for network that combines both weak and semi-supervised learning.
human motion or action recognition, contains 500,000 videos The "teacher" model is first pre-trained with hashtag images that
belonging to 600 categories, and each video lasts for 10 s. In have weak and noisy labels also called the weakly supervised
addition to labeling challenges, learning good representation of dataset. Further, the model is fine-tuned with ImageNet labeled
the input is an additional reason to look for an alternative solution dataset and then it is used to predict the softmax distribution over
to supervised learning [24]. Moreover, in a domain like medical, it the weekly supervised dataset (hashtag images) that was used to
is hard to obtain annotations because of the privacy concerns and pre-train the "teacher" model. Next, the target "student" model
also not knowing what exactly to annotate. Hence, a pre-trained is pre-trained with the weakly supervised data with the refined
model trained on a large unlabeled dataset can be used in such a
labels from the teacher model. Finally, the labeled data is used to
domain. However, the benefit of pre-training and fine-tuning on
fine-tune the "student" model.
downstream tasks is reduced when the downstream task images
The goal of machine learning has always been to bridge the
belong to a completely different domain than the images that
gap between human and machine level learning. To make ma-
were used for pre-training the network [25].
chines even more intelligent, the researchers proposed different
This paper provides an extensive compilation of different ideas
learning strategies that mimic how humans learn visual concepts
bought forth by the researchers that help in building a pre-trained
model using intrinsic properties of the data rather than labeled of rare instances or objects by just seeing them once. Recently,
dataset. The paper highlights the breakthrough in self-supervised there are increasing interests in learning methods that learn novel
learning methods that will give a head start to budding re- concepts from limited data and also in the learning methods that
searchers to explore self-supervised learning that will serve as a continually learn without forgetting the prior knowledge. Follow-
better alternative in long run. Self-supervised learning is a step ing are the learning methods that are a step forward towards
forward for building background knowledge and instilling com- annotation efficient learning.
mon sense in machines that remain an open challenge in AI since Incremental Learning: The goal of incremental learning is to
its inception. The paper attempts to summarize and evaluate vari- continuously learn and solve new tasks without forgetting the
ous self-supervised methods that target learning better represen- tasks learned in the past by leveraging the data that gets added
tations from the input without labels. The learned representation with time. Incremental learning comes with different variants
aid the downstream task (actual task) where small labeled dataset such as: task incremental learning, class incremental learning,
is available. The ongoing research in self-supervised learning in- domain incremental learning, etc. In task incremental learning,
dicates that we can bring self-supervised learning paradigm shift the model can perform varied tasks right from image classifi-
to computer vision without much relying on labeled data. cation to object detection to image segmentation and then to
2
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

instance segmentation as data gets continuously updated with Though many new learning techniques have reduced the re-
time. In class incremental learning the model fixes a learning quirement of fine-grained labeling, still a follow-up question is
task for e.g., image classification and then it goes from classifying asked if these labels are really required because scaling the man-
samples of class A to class B and to class C and so on with all ual labeling process to all the internet images is completely
sets of classes. In domain incremental learning transition takes infeasible. Hence, a potential solution proposed by researchers
from one task happening in one domain (eg. Natural images) is to learn visual image representations from a large unlabeled
to another domain lets say in medical domain. Some methods dataset by proposing various pretext tasks that are given to
that achieve incremental learning are zero-shot learning [29], the network to solve [34]. The learned weights or the learned
continuous updating of the training set, or using a fixed data representations from the pre-trained model are then used as
representation. initialization for downstream computer vision tasks where only
Few shot learning: The goal of few short learning is to effec- some annotations are available.
tively learn visual concepts from a training dataset that contains Unsupervised and self-supervised learning: Self-supervised
very few instances related to the novel classes. Given N number models use supervisory signals of the partial input that is avail-
of novel classes and K samples in each class (K can be small able to learn a better representation of the input. They leverage
as one)the learner tries to learn visual concepts for instances the underlying structure in the data to predict the unobserved
that are exotic or rare. In this scenario where we have limited or hidden part of the input. Whereas, in unsupervised learning,
training samples for each class, training a deep learning net- we have samples with no external signals or labels that guide the
work from scratch results in poor model performance due to learning process. Hence calling self-supervised learning as ‘‘unsu-
overfitting. A standard two-stage approach is adopted for few pervised’’ is not true as it uses far more feedback signals. Details
short learning [30]. In the first stage, a model is pre-trained on a about the structural semantics of the images can be well learned
dataset containing a large number of instances in every class (also through self-supervised learning than any other form of learning
referred to as a base or train class) to solve a classification task. In method. The unlabeled data is humungous, and the amount of
the second stage, the model is adapted by transferring the learned feedback provided by each sample is huge that aids in learning
parameters from the pre-trained model by removing the last better representations of the input. According to Misha Denil,
output layer and having a classifier that can classify an instance unsupervised learning is thinking hard about the model and
to one of the novel classes. Finally, the model is finetuned on using whatever data fits the model. Whereas, in self-supervised
the limited dataset containing novel classes to perform the actual
learning, you think hard about the data and use whatever model
downstream task of classifying the query to one of the novel
fits.
classes. Though transfer learning achieves significant improve-
On the other hand, supervised learning not only depends on
ment in performance than a model made from scratch but the
annotated data but also suffers from issues such as generaliza-
scarcity of data at the second stage, leads to an overfitted model.
tion error, adversarial attacks, model brittleness, shortcuts, and
Hence few short learning requires a transfer learning approach
spurious correlations. Moreover, the networks trained using su-
called meta-learning (learn-to-learn approach) [31]. The idea of
pervised learning may not strive hard to learn generalized feature
meta learning is not to supervise to get the right result rather
representations and can get away by memorizing the mapping
how the answer should behave. Few shot learning with meta
between input and output as the ground truth is always available.
learning allows for the training of the learning algorithm itself
Sometimes the ConvNet starts classifying the objects by mere
(Stochastic Gradient Descent) instead of the classifier. To train
texture without learning rich representation of the object [35,36].
the learning algorithm, a meta learner module fθ is implemented
For example, if the texture of the object is scattered randomly
which gets optimized for solving few-shot classification task by
around the image, the convNet still predicts the right object
backpropagating through the classifier and the meta learner fθ
to find gradients with respect to the parameters of the meta without extracting the object from its background. Hence, the
learner. The meta learner fθ is trained using training episodes supervisory signal can bias the network and lead it to work in
(S , Q ) from base class data (large labeled dataset) by sampling an unexpected way indicating that the supervised models are
few base classes N and sampling S support examples per class invariant to useful features required for structural understand-
and K query/test samples. The meta learner fθ then generates ing [37]. In real sense then our model becomes a texture detector
the classifier model that predicts the classification scores for rather than an object detector. Also, the ConvNets trained using
test samples. The classification loss (cross-entropy loss) is then supervised learning are brittle when it comes to dealing with
minimized by optimizing the parameters of the meta learner dynamic data. Once the supervised model is trained it becomes
by backpropagation. The meta learner fθ is then evaluated or difficult for the model to adapt to the new data without forgetting
tested by keeping it fixed which generates a model for novel the previous knowledge. Hence the supervised model needs to be
classes by using the train or support examples of the novel classes re-trained with all the data again. With the infeasibility of using
and predicting the novel class for the test/query samples. Recent supervised learning in all deep learning tasks, the researchers
progress in few shot classification using meta learning lead to proclaim that the next AI revolution will not be supervised but
the use of unlabeled examples along with the labeled data within self-supervised. Fig. 1 shows the supervised learning pipeline, the
each training episode [32]. The authors adopted two approaches, model learns visual features through the process of training the
one where all unlabeled examples are assumed to belong to the ConvNet by large amount of labeled data (e.g. ImageNet dataset).
same set of classes as the labeled examples of the episode, as After the model is trained, the learned parameters serve as a pre-
well as the more challenging situation where examples from trained model and are transferred to the target task like object
other distractor classes are also provided. The experimental re- detection or image segmentation by fine-tuning the model. With
sults show that this scheme learns to improve the prediction of pre-training, we can use fewer data and can take advantage of
novel classes due to the incorporation of unlabeled examples. In models that are already trained with millions of images. During
another recent work, the authors leveraged the vast amount of the transfer learning, only general features from the first few
freely available unlabeled video data to perform the task of few- layers are transferred to target tasks. However, getting labels for
shot video classification. In this semi-supervised few-shot video a dataset of the size of ImageNet is quite expensive and and
classification task, millions of unlabeled data are available for curating datasets of such size for different domains is a laborious
each episode during training [33]. task.
3
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 1. Supervised learning pipeline, the learned features from the pre-trained model trained on large labeled dataset (e.g. ImageNet dataset) are transferred to the
target task where limited labeled dataset (e.g. Pascal dataset) is available.

Fig. 2. ConvNet trained to generate automated labels using self-supervised learning without human annotating the input images.

1.2. Self-supervised learning Pretext or auxiliary task: To learn visual features from un-
labeled data certain tasks are pre-designed for the ConvNet to
The key to unleashing unlabeled data is self-supervised learn- solve. The term ‘‘pretext’’ means that the task is done before the
ing, which is gaining importance in recent years. Though the use actual target task is undertaken whose mere purpose is to learn
of self-supervised learning is not new and it can be traced back generalizable representations of the input both at a low level and
to 1989 by Jürgen Schmidhuber, in his paper ‘‘Making the World high level as shown in Fig. 3 [39]. For most of the pretext tasks,
Differentiable’’ which explains how two self-supervised recurrent a part of the data is withheld or some transformation is applied
networks can interact to attack fundamental credit assignment for which the network predicts the missing part or the correct
transformation applied. The label generated by the network is the
problem [38]. However, self-supervised learning gained momen-
property of the data itself.
tum in recent years due to the explosion of huge amount of
Pseudo labels: Pseudo labels are generated automatically based
unlabeled data. The primary objectives of self-supervised learning
on the type of pretext task the network solves [39]. For example,
are (1) To deploy state-of-the-art deep learning models with per-
if an input image is taken and a transform is applied to it let us
formance matching the supervised counterpart without relying
say rotation and passed to the ConvNet, the ConvNet predicts the
on huge labeled dataset. (2) To learn generalized and semantically
property of the transform i.e., angle by which the image is rotated.
meaningful representations from unlabeled data that help during Pre-trained model: It is a ConvNet that contains the learned
downstream tasks like image classification, image segmentation, representations/useful behavior of data that is trained on a large
object detection, etc. (3) To harness huge amount of data that is unlabeled dataset using a pretext task. Mostly the model is pre-
available for free by replacing the supervised pre-training with trained on ImageNet without labels on a pretext task and subse-
self-supervised pre-training. (4) To have a more practical ap- quently fine-tuned on a smaller labeled dataset [37].
proach to learning as possessed by humans. Fig. 2 shows self- Transfer learning: Transfer learning is the operation of trans-
supervised learning where the ConvNet generates pseudo labels ferring the pre-training features from the pre-trained model to
(e.g. degree to which the input image is rotated) from the rotated solve the downstream tasks like image classification, object de-
input image by exposing the relationship between parts of the tection, and image segmentation, etc. Through transfer learning
input data. the model achieves higher accuracy with much less labeled data
Concepts in self-supervised learning: We will now discuss and requires less computation time than models that do not
various concepts related to self-supervised learning and its vo- use transfer learning. Whereas building the model from scratch
cabulary. initialized with random weights results in an inefficient solution
4
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 3. Self-supervised learning using pretext task to learn good representation of the input.

as the network starts from a point where it does not know 2. Image representation or feature learning techniques using
anything [40]. Two major transfer learning scenarios that eval- self-supervised learning
uate the learned representation are (1) Linear classifier (Con-
vNet as fixed feature extractor) (2) Fine-tuning the ConvNet for This section summarizes the early works on traditional unsu-
downstream tasks. pervised learning, and also the recent handcrafted pretext tasks
Linear classification: To evaluate the learned representation and contrastive instance learning schemes designed to learn rich
a linear classifier is trained on top of the ConvNet trained on the representation of the input. Broadly the representations learned
large unlabeled data set (ImageNet). The last fully connected layer using self-supervised learning are categorized into six categories:
of the pre-trained model is removed, and the rest of the ConvNet reconstruction from a corrupted or partial image, reconstruc-
is frozen on which a classifier is trained. Evaluation is often per- tion of the image from an altered view of the image, image
formed on the same dataset that was used to train the network on generation, spatial context prediction, transformation prediction,
the pretext task. Typical datasets for linear classification include instance discrimination or contrastive learning and clustering
ImageNet, Places205, Pascal VOC07, COCO14, etc. based schemes.
Fine-tuning the model: The second scheme to evaluate the
learned representation is not only to replace and retrain the 2.1. Reconstruction from a corrupted or partial image
classifier on top of the ConvNet but also fine-tune the weights of
the pre-trained network through backpropagation. It is possible One of the early works in unsupervised learning is the clas-
to fine-tune all the layers of the ConvNet, or we can keep some sical autoencoder that learns representations by compressing
of the earlier layers fixed and only fine-tune some higher-level the input image at a low dimensional bottleneck layer. The
layers of the network. encoder–decoder assumes that there exists a high degree of
Type of architecture: The quality of the visual representations correlation/structure in the data. The encoder compresses the
learned using pretext task depends significantly on the type of data into an intermediate representation and the decoder takes
ConvNet architecture used and the ability of the network to scale the intermediate representation and reconstructs the input im-
with increased data. The impact of the architecture is found on age. Traditional autoencoders happen to underperform as it only
the quality of the representation learned and on the accuracy compresses the input without learning rich representations of the
of the downstream tasks. The popular convolutional neural ar- input [41]. The denoising autoencoder is an enhancement to the
chitectures used for pre-training are- ResNet50, ResNet50 v1, traditional autoencoder that prevents the network to learn an
ResNet50 v2 and also their scaled-up versions [34]. It has also identity function by introducing pixel level noise in the image and
been observed that low-capacity model like Alexnet [14] has the network is forced to reconstruct the original image as shown
not shown much improvement with more data as compared to in Fig. 5. Another version of denoising autoencoder is stacked
ResNet. autoencoder [42] in which the layers of the autoencoders are
Downstream tasks: These tasks are specific to the prob- stacked to initialize a deep architecture that encodes and decodes
lem that defines what the model actually does (primary task). the data across various layers. As the layers are stacked, the model
Whereas, pretext task is a secondary task undertaken to achieve learns a better representation of the input. The downside of the
the primary tasks. Many computer vision downstream tasks exist autoencoders is that the noise added to the images is random
such as image classification, object detection, image segmenta- and unsystematic which does not contribute to semantic learning
tion, etc. Table 1 shows the image datasets used for downstream of the input. Some improvements were bought by denoising
tasks. autoencoders by corrupting the input but the corruption was
localized and random which did not account for greater learning
1.3. Self-supervised learning pipeline at the semantic level. The scheme also results in a train and
evaluation gap as training is done on noisy images and evaluation
The general pipeline of self-supervised learning is shown in is done on clean images.
Fig. 4. In the first stage, as shown in Fig. 4(a), the ConvNet is Instead of formulating the problem as a compression task,
trained on a pretext task (e.g. image rotation) using a large corpus many researchers formulated the problem as a context-based
of unlabeled data. The network learns useful representations and prediction task. An ad-hoc approach is implemented wherein the
predicts the degree by which the image is rotated. In the second user designs a task for the ConvNet to solve that aims in learning
stage as shown in Fig. 4(b), the rotation prediction head from rich representation of the input. Image inpainting is one such
the pre-trained model is removed by keeping the remaining pretext task that forces the network to predict the masked area
network fixed and a linear classifier is trained on top of it with of pixels from other parts of the input [1]. The model is trained
a new dataset where fewer labels are available. Complex down- on large number of unlabeled images with corresponding masked
stream tasks can also be performed such as image segmentation out regions. Fig. 6 shows the masked input image passed to the
and object detection by fine-tuning the entire network or par- encoder that captures the semantic context of the image to a
tial network (deeper layers) using backpropagation as shown in latent feature representation and the decoder reconstructs the ac-
Fig. 4(c). tual image by filling realistic image content in the masked region.
5
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Table 1
Summary of commonly used image datasets used in downstream image recognition tasks.
Category Dataset Train size Classes References
Natural Caltech 101 3060 102 (L. Fei-Fei et al. 2004)
Natural CIFAR-10 50000 10 (Krizhevsky, 2009)
Natural CIFAR-100 50,000 100 (Krizhevsky & Hinton, 2009)
Natural DTD 3760 47 (Cimpoi et al., 2014)
Natural Flowers 102 2040 102 (Nilsback & Zisserman, 2008)
Natural Oxford-IIIT Pets 3680 37 (Parkhi et al., 2012)
Natural Sun 397 87,003 397 (Xiao et al., 2010)
Natural Caltech-UCSD 6033 200 (Lin et al. 2015)
Natural Stanford cars 8144 196 (Jonathan Krause et al. 2013)
Natural Food 101 120,216 251 (Bossard et al. 2014)
Natural FGVC Aircraft 3334 100 (Maji et al. 2013)
Natural Stanford Dogs 12, 000 120 (Aditya Khosla et al. 2011)
Natural SVHN 73, 257 10 (Netzer et al. 2011)

Fig. 4. The general two stage pipeline of self-supervised learning. First, the model is pre-trained on a pretext task (e.g., rotation) without the labels, and then the
linear classifier is trained on top of the ConvNet removing the prediction head and finally fine-tuned for complex downstream tasks with a small labeled dataset.

Fig. 5. Denoising autoencoder learns the compressed representation of the input at the encoder by exploiting the redundant information of the input image. The
decoder reconstructs the input image which is close to the original image by minimizing reconstruction loss.

To solve this task the network must acquire a general understand- only be easy for the ConvNet to solve if it can recognize the object
ing of the structure of different objects, their colors, and gain a in the image. The loss function used in the scheme is the joint
deeper semantic understanding of the entire scene. This task will loss of reconstruction objective (L2 loss) and adversarial loss. The
6
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 6. Image inpainting pretext task of reconstructing the masked region of the input image. The ConvNet is trained on pairs of masked-original images by reducing
the reconstruction and adversarial loss.

Fig. 7. A two stage process of predicting bag of visual words.

reconstruction loss captures the overall structure of the missing and actual image patch. Once the pre-training is complete, the
region and the adversarial loss is incorporated to get a sharp decoder is removed and the learned representation is used for the
prediction of the masked region by picking up a particular mode downstream tasks. Though the scheme reserves the fine details
from the distribution. In another way, the network can be thought of the image, reconstruction is tough and ambiguous as there are
of as an adversarial generative network, where the generator tries multiple ways to fill the missing region in the image. Moreover,
to generate the images by filling the masked region and the dis- there exists a train-evaluation gap as training is done on masked
criminator tries to find the discrepancies between the predicted images and evaluation is done on non-masked images.
7
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Another self-supervised learning approach proposed is pre- be applied for images with depth and color in which the pre-
dicting bag-of-visual words (BoW) as shown in Fig. 7 that is text task forces the network to predict one from another. This
inspired by the natural language processing method of predicting method ensures backward consistency by doing two-way predic-
bag of words. Instead of words, the scheme uses visual words tion from grayscale to color image and from color to grayscale
that encode discrete visual concepts useful for downstream tasks image, and together the reconstructed image should be close to
like image classification and object detection [43]. Fig. 7(a) shows the original image. It is a challenging task as the reconstruction
the unlabeled images fed to a pre-trained self-supervised feature is hard and ambiguous because there are multiple possible ways
extractor Φ̂ (.) that extracts features across the entire dataset and to color an image. However, recently authors have devised new
clusters the similar features to form a vocabulary of visual words. schemes using variational autoencoders and latent variables for
Further, a histogram for each image x is created representing incorporating diverse colorization in the reconstructed images
the count of individual high level and mid-level features present. [47].
Then as a self-supervised task, the pre-trained ConvNet Φ̂ (.)
predicts the bag of words representation for original image x. 2.3. Image generation
Correspondingly another network Φ (.) is trained to predict bag
of visual words of a perturbed version x̃ of the image x as shown Generation based self-supervised methods are used for learn-
in Fig. 7(b). Further cross entropy loss is calculated between the ing image representations that involve the process of generating
predicted bag of words and the original bag of words y(x) which images or high-resolution images using Generative Adversarial
is backpropagated to refine the model. To solve this task the Networks. Most of the methods for image generation do not
ConvNet must learn to detect visual clues that are invariant to need any human-annotated labels. GANs learn to create realistic
perturbations and also contextual features. The advantage of the images that are similar to the input images but not present in
scheme is that the representations learned are invariant to trans- the input dataset. The intuition behind this is that if we can get a
formations and it learns contextual reasoning skills by inferring model to generate a realistic image of a person’s face for example,
missing visual words of missing image regions. On the other hand, then the model must have also learned a lot about human faces
it loses low-level details and spatial information of the input in general. In GANs two networks compete with each other. A
image. generator samples z vector from a latent space to generate a
reconstructed or a realistic fake image. On the other hand, the
discriminator network competes with the generator and tries to
2.2. Reconstruction of the image from an altered view of image
distinguish between the fake samples coming from the generator
and the real samples. Following the game-theoretic approach, the
Another set of pretext tasks the ConvNet solves is to pre-
discriminator forces the generator to generate realistic images,
dict the correct view of the image from its altered view. Image
while the generator forces the discriminator to improve its dif-
colorization is one such pretext task that forces the ConvNet to
ferentiation ability. During the training, the two networks are
predict probable color (ab channel) of an image from a grayscale
competing against each other and make each other stronger [48].
(L channel) image [44]. By solving this pretext task, the ConvNet
GAN has served as a foundational work that has helped in the
learns an image representation by predicting the color values of
creation of various successful architectures such as DC GAN [49],
an input ‘grayscale’ image. The network is trained on millions
WGAN [50], WGAN-GP [51], Progressive GAN [52], SN-GAN [53],
of pairs of colored and grayscale images with negligible cost as SAGAN [54], BIGAN [55], BigGAN [56], StyleGAN [57], LOGAN [58]
they are freely available. To solve the colorizing task, the network etc. After initial success in using GANs for unsupervised learning,
has to recognize different objects present in the image and group GANs have been surpassed by self-supervised based approaches.
related parts together to tint them with the same color. Therefore, One such unsupervised method is Large Scale Adversarial Rep-
a visual representation of the input image is learned in the pro- resentation Learning (BigBiGAN) for image generation and rep-
cess of performing the pretext task that is useful in performing resentation learning [46]. This method allows for the extraction
downstream tasks. An illustration of a grayscale image with its of features in an unsupervised way from a Generative Adver-
predicted color image using an encoder–decoder architecture sarial Network and scales up the previously existing algorithms
based on ConvNet is shown in Fig. 8. Once the network is trained such as BIGAN [55] and BigGAN leading to an improved GAN
on the pretext task, the decoder part of the network is removed model [56]. The traditional GANs take the latent variable and
for performing downstream tasks. The downside of the scheme convert it into an actual image but from the representational
is that the reconstruction of the image is hard and ambiguous as learning perspective, we have to go from the image to its latest
several possible solutions exist for coloring the image. As color space. BigBiGAN, builds upon the state-of-the-art BigGAN model,
mapping is not deterministic, the network colors the object with extending it to representation learning by adding an encoder
the combination of all colors which leads to a greyish colored and modifying the discriminator. Fig. 10 shows the generator
image. Also, the network is forced to evaluate on grayscale images part of the network G that samples z from a latent space to
resulting in loss of information. Though the scheme is useful generate reconstructed images x̂ that discriminator tells if they
when we want to color the old grayscale films. are fake or real. The encoder part of the generator E maps the
Further to the above work, an enhancement was bought by images back to a latent space ẑ. The latent space ẑ ∼ E (x) and
a scheme called the split-brain autoencoder trained on pairs of z ∼ Pz correspond to different images form a feature space
grayscale and colored images. Fig. 9 shows a given color image that can be used for downstream recognition tasks by training
X that is split it into grayscale channels X1 and color channels logistic regression classifier using a dataset where fewer labels
X2 . The ConvNet is split into two distinct subnetworks F 1 and are available. On the other hand, the discriminator part of the
F 2. F 1 predicts the color channel from the grayscale channel network not only takes x and z pairs as input but also the joint
and F 2 predicts the grayscale channel from the colored channel. distribution of x and z. The loss includes the aggregation of unary
The two complementary channels Xˆ1 and Xˆ2 are then aggre- data term sx , sz , as well as the joint term sxz which ties the
gated to predict the reconstructed image of the original image data and latent distributions together. The discriminator loss ℓ
on which the cross-entropy loss is calculated [45]. The goal of trains the network to distinguish between the two joint data-
performing such a pretext task is to induce representations that latent distributions from the encoder and the generator, pushing
transfer well to the downstream tasks. The same setup can also it to predict positive values for encoder input pairs (x, E (x)) and
8
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 8. Colorization as a pretext task, the ConvNet predicts the color image from a grayscale image.

Fig. 9. Split-Brain Autoencoder composed of two disjoint sub-networks F 1 and F 2, each trained to predict one channel from another. Network F 1 performs
automatic colorization, whereas network F 2 performs grayscale predictions. Combining the two channels give the predicted reconstructed image [45].

Fig. 10. Representation learning using BigBiGAN framework. The joint discriminator D calculates the loss ℓ. The inputs to the discriminator D are data-latent pairs,
either (x ∼ Px , ẑ ∼ E (x)), sampled from the data distribution Px and encoder E outputs, or (x̂ ∼ G (z), z ∼ Pz ), sampled from the generator G outputs and the latent
distribution Pz . The loss ℓ consists of the unary data term sx and the unary latent term sz , as well as the joint term sxz which ties the data and latent distributions [46].

negative values for generator input pairs (G (z), z). To generate that forces the network to predict the correct spatial orientation
data in a particular domain necessitates that the model should of two randomly chosen patches of an image [60]. The network
understand the semantics of the said domain. Despite the success is trained on pairs of (image-patch and neighboring-patch) that
of GANs, they are faced with few challenges such as (a) harder to are chosen randomly to form a large, unlabeled corpus of data.
train because the parameters oscillate and rarely converge and The goal of training is to assign similar representations to se-
(b) the learning process is inhibited because the discriminator mantically similar patches, for example, the representation of
overpowers the generator and it fails to create real-like fakes. ears of a cat coming from different images of cat should be
Another work proposed is to use pretext task for e.g., rotation for semantically similar to each other. The network learns to as-
better GAN discriminators. The rotation pretext task encourages sociate semantically similar patches using the nearest-neighbor
the discriminator to learn meaningful representations that are not matching principle. The network is forced to predict the spatial
forgotten during training [59]. arrangement of patches for which the input image is divided into
3 × 3 grid of non-overlapping patches. Fig. 11 shows the ConvNet
2.4. Spatial context predictions that predicts the spatial arrangement of the two input patches.
The inputs to the shared ConvNet are the two random image
Spatial context predictions exploit the spatial relationships patches, one is the anchor patch and the other is the query patch.
among image parts. Context prediction is one such pretext task Given the two patches, the network predicts the relative position
9
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

predict the representation of a particular patch from the patch


that lies above it. To do so, the network needs to reason about
the object and its associated parts. It uses a special type of
contrastive loss function for self-supervised learning called the
InfoNCE loss [63]. The loss contrasts the representation of the
predicted patch, the correct patch, and all the negatives coming
from the same image and other images. The intuition is that the
predictions should be more similar to the true patch embedding
versus the negative patches that come from the rest of the images
and the same image. The scheme achieves hard goals and learns
representations spread across different modalities. The downside
of the approach is that the training is slow as the images have to
be divided into several patches. Additionally, the scheme also en-
counters train and evaluation gap, as training is done on patches,
and evaluation is done on images.

2.5. Transformation prediction

Image rotation is a pretext task intended for the network


to solve to predict the correct angle by which the image is
rotated. The network is pre-trained on pairs of rotated image
Fig. 11. Visualization of the contextual patch prediction task, the shared Con- and rotation-angle by randomly rotating images of an unlabeled
vNet receives the anchor image patch and the query image patch and it predicts
the location of the query patch corresponding to the anchor patch by choosing
dataset by 0◦ , 90◦ , 180◦ , and 270◦ [2]. Fig. 14 shows the input
one position among the eight positions. image rotated by multiple of 90◦ angle and passed through a
ConvNet that predicts the angle by which the input image is
rotated resulting in a four-class classification task. To accomplish
of the query patch with respect to the anchor patch by choosing the task the network has to understand the location, type, and
one location among 8 possibilities using cross-entropy loss. To pose of objects in an image. The scheme is simple to imple-
solve this task, the model has to learn some semantics in terms ment and at the same time semantically meaningful. Whereas,
of recognizing object parts and their spatial relationship with the downside of rotation prediction assumes that the training
each other. To prevent the network from cheating, the authors images are captured with canonical orientations. Moreover, it
jittered the patches by dropping some color channels to avoid results in a train-evaluation gap because the training is done on
chromatic aberration and also created the grid of non-overlapping rotated images and evaluation is done on upright images. On
patches to prevent the network from getting clues from boundary similar lines, researchers have also worked on relative geometric
pixels. The downside of the scheme is that the model is trained transformation prediction where the network predicts the correct
on patches and evaluation is done on complete images. transformation by which the image is transformed [64].
Follow-up work to context prediction is jigsaw puzzle that
is harder to solve than context prediction [61]. The pretext task 2.6. Instance discrimination or contrastive learning
forces the ConvNet to learn input representation by solving a jig-
saw puzzle created from an input image. The network is trained Based on the way visual features are learned, some schemes
on pairs of shuffled and ordered puzzle patches of the images rely on instance level discrimination rather than patch level
which are available for free and forms a large corpus of unlabeled predictions or image generation. Many visual common-sense
images. In this method, 3 × 3 grid of patches is shuffled through schemes discussed such as colorization, jigsaw, relative patch
a random permutation and passed through the ConvNet that prediction, and rotation, etc. are adhoc hand-crafted pretext tasks
predicts the correct order of the shuffled patches as shown in and many of them rely on patches. Dividing the image into several
Fig. 12. To accomplish this pretext task, ConvNet needs to identify patches increases the batch size by manifolds and also increases
objects, their shape, and their associations with their sub-parts. the training time. Moreover, it results in a train-evaluation gap
For a 3 × 3 shuffled puzzle, we have 9! (362880 ways) possible because the training is done on patches, and evaluation is done
permutations for arranging the patches of the image which is on complete images. Hence, researchers proposed the concept
quite a large sample space. To prevent searching across a large of contrastive learning that works at the instance level where
sample space, only a subset of possible permutations is used instances of the same image form a positive pair, and any other
such as 64 permutations with the highest hamming distance. The image act negative to the pair. In other words, contrastive learn-
representations learned encapsulates geometrical understanding ing provides a framework that tries to learn a feature space that
of the input which helps perform downstream tasks. pools together representations that are related and push apart
Another beautiful piece of work is Contrastive Predicting Cod- representations that are not related.
ing (CPC) which is a self-supervised technique that is generic In contrastive learning, the ConvNet is trained on a large
and multimodal in approach. The scheme provides a common corpus of millions of unlabeled images containing examples of
framework for handling different modalities like speech, image, similar and dissimilar images. Contrastive learning at the instance
text, and reinforcement learning in 3D environments [62]. In the level is an approach to learn useful features by solving the pretext
context of computer vision, the model is trained by input images task which compares the anchor image, negative and positive
of size 256 × 256 from an unlabeled ImageNet dataset with the (APN) representations from a large unlabeled dataset. Fig. 15
corresponding overlapping grid of image patches. Fig. 13 shows shows a batch of two images where each image forms its own
the input image divided into a grid of overlapping patches of size class. To generate a positive pair, we take the image and augment
64 × 64 with 50% overlap that results in 7 × 7 grid of patches. it in two different ways. We refer to one of the augmented view
Each patch is then encoded by a ResNet encoder resulting in a of the image as an anchor image and the other view of the
grid of patch embeddings. The pretext task forces the network to image as positive. Any different image to the anchor image in the
10
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 12. Jigsaw image puzzle task. An image containing non-overlapping patches in the original image on the left, the randomly permuted patches in the middle,
and the ConvNet predicts correct spatial arrangement of patches on the right.

Fig. 13. Visualization of Contrastive Predictive Coding for images. A PixelCNN autoregressive model is used to make the predictions of bottom patches from top
image patches [62].

Fig. 14. Illustration of self-supervised task of predicting the angle by which the image is rotated.

dataset will be negative to the anchor-positive pair. The objective a bunch of automated augmentations in contrastive learning. The
is to maximize the similarity between two views of the same purpose of argumentation in contrastive learning is very much
image and repel them from the views coming from different im- different than in supervised learning as the task is very different.
ages [65]. Hence, contrastive learning can reason about multiple We would not like the ConvNet to find an easy way of doing a
images at once whereas jigsaw or rotation pretext tasks always contrastive task just by learning one feature. The network should
reason about a single image independently. Moreover, the patch be able to learn many different kinds of features before it does in-
prediction tasks are not fine-grained due to the non-availability of stance discrimination. A good argumentation strategy is the most
negatives from other images. Recently many contrastive learning important ingredient for contrastive learning that will force the
methods at the instance level have been proposed which have network to learn rather than cheat. Augmentation of the images
shown promising results in learning good feature representation allows ConvNet to learn rich and generalizable features in a self-
supervised learning environment. The goal of data augmentation
that helps perform the downstream tasks. Despite being unsuper-
is to generate anchor, positive and negative (APN) images that
vised these schemes have outperformed supervised pre-training
are used in contrastive learning. Augmenting the images makes
in learning image representations.
the task harder for the ConvNet as it cannot get away with-
Concepts in Contrastive Learning: We will now discuss vari- out learning rich representation of the input. Also, contrastive
ous concepts related to contrastive learning that aids in learning learning needs more data augmentation than supervised learning
rich representations of the input at the instance level. due to the non-availability of the image labels. A well-tuned
Objective of Contrastive Learning: To learn a representation composition of augmentations stands out and leads to substan-
or a feature space that pulls representation that come from the tial improvement gains in the performance of the downstream
same images and repel representations that come from different tasks [66]. Fig. 16 shows the typical data augmentation methods
images. used for visual feature learning such as color transformation,
Data Augmentation: Handcrafted pretext tasks for represen- scaling, random cropping, flipping (horizontally, vertically), etc.
tation learning such as dividing the images into patches, rotation, Encoder: The encoder part of the network extracts the feature
colorization or masking the images, etc. have been taken over by representations of the images. It takes two different augmented
11
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

anchor-positive, anchor-negative pairs. To compute the similarity


between representations zi and zj we have various similarity
measures such as dot product, cosine similarity, or bi-linear trans-
formations, etc. One of the most common similarity metrics used
is cosine similarity that computes the amount of similarity be-
tween two representations by outputting a scalar score indicating
the degree of similarity. E.g., consider an image on which two ran-
dom transformations are applied to get a pair of two augmented
images xi and xj . Each image in that pair is passed through an
encoder to get representations. Then a non-linear fully connected
layer is applied to get representations z. The task is to maximize
the similarity between these two representations zi and zj . The
cosine similarity is calculated on projected representations zi and
zj which is defined in Eq. (1).
ziT .zj
sim(zi , zj ) = (1)
(τ ∥zi ∥∥zj ∥)
∥zi ∥ or ∥zj ∥ represents normalized feature vector or normalized
embeddings on which the loss function is computed. τ is the
adjustable temperature parameter that restricts the range of sim-
ilarity scores from softmax by increasing entropy that stabilizes
the loss.
Loss Function : Cosine similarly and temperature parame-
ter acts as a basis for contrastive loss function. A loss function
is defined on top of the self-supervised training that penalizes
the network for getting different representations for different
versions of the same image. An image rotated by two different
angles should have the same consistent representations even
Fig. 15. Contrastive learning helps distinguish between similar and dissimilar
though they are rotated with different angles. The augmentations
objects using self-supervised learning. For each image in the batch, a random
transform is applied to get a pair of two images that represent different like color distortion or gaussian blur interferes and corrupts the
instances of the same image. Contrastive self-supervised pretext task makes input data and forces the network to learn from a diverse set of
the representation of anchor and positive image close together and negative features. In each case, the original image and the transformed
representations away. image should give the same predictions, and create the same
features in intermediate representations. Hence, the loss function
is minimized if the similarity between the query image and the
images xi and xj of the same image x and extracts representation positive embedding is more and maximized if the dissimilarity
hi and hj in the form of vectors as shown in Fig. 17. The encoder between the two is more. Based on the loss, the encoder and
can be generic and replaceable with many architectures. In many projection head representations improve over time and the rep-
works, the authors have used ResNet-50 and its variants as the resentations obtained place similar instances of images closer in
ConvNet encoder. the space and negatives far apart. Widely used loss functions in
Nonlinear Projection Head: The representations hi and hj are the self-supervised setup is the negative contrastive estimation
passed through a nonlinear projection head to produce embed- (NCE) loss [63], triplet loss [68], N-pair loss [69], and InfoNCE [62].
dings zi and zj on which the contrastive loss is computed as Generally, the contrastive schemes focus on comparing the em-
shown in Fig. 17. While some methods calculate the loss on beddings with contrastive loss called ‘‘NT-Xent loss’’ (Normalized
representation h from the encoder part of the network. It is found Temperature-Scaled Cross-Entropy Loss) that is defined in Eq. (2).
beneficial to define the contrastive loss on z rather than h [66].
Similarity Measure: Some mechanism is required to com- exp(sim(zi , zj )/τ )
ℓij = −log ∑2N (2)
pute the similarity between representations or embeddings of
k=1 1[k̸=i] exp(sim(zi , zk )/τ )

Fig. 16. Typical data augmentation methods used for visual feature learning in contrastive learning using self-supervised setup [66].

12
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 17. Nonlinear transformation from h to z and contrastive loss computed on the representation z [66].

Fig. 18. Discriminating unsupervised feature learning with Exemplar Convolutional Neural Networks. Exemplary patches sampled from the unlabeled dataset (left up).
Several random augmentations applied to the exemplary patch (left down). The original (‘‘seed’’) patch is marked in red. Given a distorted crop from an exemplary
patch, ConvNet classifies the distorted crop to one of the K surrogate classes [67].

Eq. (2) depicts the loss function for a positive pair of examples that results in a large number of classes and inhibits scalability.
xi and xj and the goal is to identify positive pair of each zi and A revised scheme is to treat the task as a metric learning task
repel others. Where N is the number of samples, 2N are the were given the cropped image; the representation learned should
transformed pairs, 2(N − 1) negative pairs, sim(zi , zj ) represents be similar to the crop coming from the same image source and
the cosine similarity as discussed in Eq. (1). The term in the different from others.
numerator is the positive pairs and the terms in the denominator Recent work on instance discrimination has raised the perfor-
is the negative pairs. τ is a hyperparameter called the adjustable mance of self-supervised learning at power to supervised learn-
temperature parameter. 1 represents as ‘‘indicator function’’, the ing. The objective of contrastive learning is to make the em-
term 1[k!=i] will become 1 if the condition matches else it is 0. beddings of the query (anchor image) similar to the positive
key and dissimilar to the negative key embeddings. Each input
2.6.1. Contrastive learning prediction schemes image is split into a query and a key formed by performing two
different sets of augmentations on the image. The authors have
Exemplar ConvNets [67] is one of the earliest methods to work
suggested various methods for handling negatives such as: end-
at the instance discrimination level. Fig. 18 shows patches of size
to-end mechanism, memory bank, and momentum encoder. In
32 × 32 sampled from the unlabeled dataset of images at varying
end-to-end mechanism, as shown in Fig. 19, the query (original
positions and are scaled to form the initial training set. One of the
samples) is passed through query encoder and the keys (aug-
input patches is selected and several random transformations are
mented versions of positive and negative samples) are passed
applied to give rise to many patches that vary in the degree of through the key encoder (same shared encoder for both query
perturbations but not in terms of content. Similarly, this process and keys) that produces the embeddings for both the query
is applied to all the cropped images of the dataset. All the dis- and the keys [70,71]. The loss is calculated over the different
torted crops from a randomly sampled ‘seed’ image patch form pairs (query-keys) and the shared encoder is updated by back-
a surrogate class. Now as a self-supervised prediction task, given propagating through all the samples during training maintaining
a distorted crop exacted from an image the ConvNet classifies it consistency between the queries and keys. One of the downside
to one of the K surrogate classes as shown in Fig. 18. For the of the approach is that the number of negatives is limited to the
network to predict the right surrogate class the network needs size of the GPU memory as the batch size cannot be larger than
to be invariant to any transformations related to geometry and the GPU memory.
color. One of the downsides of this scheme is that number of The extension to the above work is to use a memory bank
surrogate classes is equal to the number of samples in the dataset which is much more memory efficient than using large batch
13
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 19. End-to-end approach for contrastive learning.


Fig. 21. Momentum Encoder for contrastive learning.

momentum encoder are updated using momentum update with


the new weights of the query encoder at every iteration. The mo-
mentum encoder relies on the memory bank but with a different
update scheme that is slowly pursuing via exponential moving
average (momentum update). This prevents the outdated keys
and queries from being collected and the scheme becomes mem-
ory efficient. Another important feature of MoCo is the queue
that follows the first-in-first-out scheme, that is to say, the older
key representations are discarded as compared to the new key
embeddings as the training proceeds. MoCo outperforms the end-
to-end and memory bank approach by having more consistent
and updated key representations and also decouples the batch
size from the negatives. Further, MoCo v1 was enhanced to MoCo
v2 [73] by adding MLP head, adding more argumentations similar
to SimCLR [71] and cosine learning rate. The results also show
that MoCo v2 largely closes the gap between unsupervised and
supervised representation learning in many image recognition
tasks and can serve as an alternative to ImageNet supervised
Fig. 20. Memory bank approach for contrastive learning. pre-training in several applications.
A very popular scheme based on end-to-end approach is a
simple framework for contrastive learning of visual representa-
sizes. The memory bank or the dictionary contains the embed- tions (SimCLR) [66] that has served as a base for many recent
dings of all the negative and positive samples as shown in Fig. 20. contrastive learning schemes. SimCLR adopts contrastive learning
The query is passed through the encoder to get the embedding that attempts to attract different augmented views of the same
which is compared to the subset of keys that are randomly sam- image and repel augmented views coming from other images.
pled from the memory bank. Contrastive loss is then calculated Fig. 22 shows an input image x on which two separate sets
and backpropagated through the query encoder and not through of augmentations are applied resulting in two correlated views
the memory bank. The memory bank is updated once in a while of the same image x̃i and x̃j (positive pair). The transformation
via exponential moving average to make sure the memory bank applied can be a combination of random cropping, random color
slowly updates itself and is in sink with the query representation. distortion, and gaussian blur. All other pairs in the batch are
The advantage of the scheme is that we can have many negative sampled as dissimilar images (negatives) to the positive pair.
but, on the downside, the keys get obsolete with corresponding Each correlated view of the same image is then passed through
queries as the memory bank is not updated frequently [71,72]. individual ResNet-50 encoder f (.) to get representations hi and hj .
To address the issues faced by memory bank, another scheme
The two representations are then passed through an MLP based
is proposed called Momentum Contrast (MoCO v1) that relies
nonlinear projection head g(.) resulting in a lower dimension rep-
on a memory bank but uses a different approach for updat-
resentations zi and zj . Next, the similarity between two correlated
ing it [71]. Each input image is split into a query and a key
formed by performing two different sets of augmentations. Fig. 21 versions of an image is calculated using cosine similarity on rep-
shows the momentum encoder, the query is passed through the resentations zi and zj . Ideally, the similarities between augmented
encoder and all the keys are passed through the momentum images of the same object will be high while the similarity
encoder to produce the embeddings. A similarity measure takes between different objects will be lower. Finally, the loss function
these embeddings and measures the similarity between the pairs for the contrastive learning objective is calculated, which helps to
(query-key). Contrastive loss is then calculated and backprop- identify the invariant features of each input image and maximize
agated through the query encoder and the parameters of the the ability of the network to identify different transformations
14
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

of the transform applied to the image. Under this setup, it is


observed that the last layers of the ConvNet include low-level
information of the transform, so if the ConvNet received a rotated
image, the last layer features will change dramatically according
to the pretext task applied. As a result, the network lands up
learning less semantic features at the higher layers which do not
transfer well to downstream tasks. Whereas what is expected is
that the representations learned are invariant to these transfor-
mations that transfer well to the downstream tasks. Instinctively
this makes sense because even if an image is divided into patches
and shuffled, it does not change the visual semantics of the
image. Motivated by the idea of generating representations that
are invariant to transformations, PIRL is proposed as shown in
Fig. 23. The input to the ConvNet (ResNet-50) is a pair of original
image I and transformed image I t . Each pair is passed through
the shared ConvNet to produce feature embedding and are finally
sent to the linear projection to produce the representation of
the original image and its corresponding transformed image. The
image I and any pretext transformed version of this image I t are
related samples and any other sample is unrelated sample. Hence
by training the network like this, representations contain very
Fig. 22. A simple framework for contrastive learning of visual representations. little information about the transform t. Finally, a loss function
Two separate data augmentation are applied to the input image x to obtain two (Noise Contrastive Estimator) is added that penalizes for getting
correlated views x̃i and x̃j . A base encoder network f (.) and a projection head different representations for the positive pair. The loss function
g(.) are trained to maximize agreement using a contrastive loss. After training
puts the embedding of related images close and pushes away the
is completed, the projection head g(.) is thrown away and encoder f (.) and
representation h are used for downstream tasks [66].
embedding of unrelated or random images. Once the model is
trained, the linear projection heads are removed and the encoder
is used for downstream tasks. The network is trained using mini-
batch SGD using an initial learning rate of 1.2× 10−1 and a final
of the same image. SimCLR uses a contrastive loss called ‘‘NT-
learning rate of 1.2× 10−4 with a cosine learning rate decay [76].
Xent loss’’ (Normalized Temperature-Scaled Cross-Entropy Loss)
The network is trained for 800 epochs using a batch size of 1024
for the augmented pairs in the batch that are taken one by one images. The important thing that has made contrastive learning
and the softmax function is applied to get the probability of the work so well is the availability of large number of negatives.
two images being similar. SimCLR improves upon the previous PIRL uses a memory bank containing a moving average of learned
state-of-the-art self-supervised learning methods but also beats representations of all the original images that enable a large
the supervised learning method on ImageNet classification by number of negatives used during training. PIRL is one of the
incorporating nonlinear projection head over representations hi recent works that learns good visual representations of images
and hj and implementing strong data argumentation techniques. irrespective of the pretext task used.
The representation before the non linear projection head is used Contrastive schemes are computationally expensive as they
for downstream tasks. The model is trained on varying batch sizes push away the representations that come from different images
from 256 to 8192, a batch size of 8192 is trained for a max of while pull the representations that come from different views
100 epochs. The optimizer used is LARS with a learning rate of of the same image. Computing all the pairwise comparisons on
4.8 (=0.3 × BatchSize/256) and weight decay of 10−6 . The linear a large dataset is intractable and requires a lot of computation.
warmup is used for the first 10 epochs, and then the learning rate Hence, most contrastive schemes rely on approximation and take
is decayed with the cosine decay schedule. One of the limitations only a subset of examples for comparison. Moreover, in instance-
of SimCLR is that the numbers of negatives are limited by the based learning, every sample is treated as its own class. This
batch size. However, the performance becomes better with large makes it unreliable in conditions where it compares an input
batch sizes and a higher number of training epochs. The authors sample against other samples coming from the same class.
of simCLR extended the work to simCLR v2, where the dataset has
large amount of unlabeled data and very little labeled data. The 2.7. Clustering based schemes
unlabeled data is used to pre-train a large model (teacher model)
in an unsupervised way. Next, the model is fine-tuned on a small Clustering is another scheme for learning representation using
subset of data that is labeled in a supervised fashion. Lastly, self-supervised learning. The clustering schemes extract repre-
distillation or self-learning is done with unlabeled examples for sentations from the input images using a feature extractor and
refining and transferring the task-specific knowledge to a smaller club semantically similar image features together. The model is
network also called a student network [74]. Using distillation, the then trained on these cluster assignments which serve as pseudo
large network can be distilled back to a smaller ResNet50 network labels. Some of the clustering based approaches are DeepCluster
by retaining almost the same accuracy as that of the larger model. [78], and SeLA [79]. Most of the clustering approaches are offline,
The distillation is not just on a labeled dataset but also uses the which means they require at least one forward pass of the entire
labels produced by the teacher model over the entire unlabeled dataset to calculate the cluster assignment. Hence this becomes
dataset to train the smaller network or the teacher network. computationally expensive for large datasets. On the other hand,
Another recent work on contrastive learning is the Pretext- contrastive schemes based on noise contrastive estimation gen-
Invariant Representations learning (PIRL) that focuses on learning erally operate by comparing different pairs of images and then
representations that are invariant to the pretext tasks using a calculating a contrastive loss which again becomes computa-
memory bank of negatives [75]. In conventional pretext tasks tionally expensive. Addressing the challenges, recently a new
given an image I on which a transformation t is applied (eg. clustering-based self-supervised based approach for image rep-
rotation or jigsaw puzzle), the ConvNet predicts the property resentation learning has attracted attention that combines online
15
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 23. Pretext-Invariant Representation Learning (PIRL). Given an input image I, a pretext task t of rotation and jigsaw is applied to give the transformed image
I t . I and I t are sent to the shared ConvNet resulting in the feature embeddings. The network learns representations that are invariant to the transformation t and
retains semantic information by keeping the representations of the image I and its transformed counterpart I t close together and distancing from others [75].

Fig. 24. Contrastive instance learning (left) vs. SwAV (right). Swapping Assignments between Views (SwAV), first the codes are obtained by assigning features to
prototype vectors. Then to solve a ‘‘swapped’’ prediction problem, the codes from one data augmented view is predicted using the other view. Thus, SwAV does not
directly compare image features unlike contrastive instance learning. Prototype vectors are learned along with the ConvNet parameters by backpropagation [77].

clustering mechanism with contrastive learning. The researchers another view of the same image. The online clustering problem
have proposed a self-supervised approach to learn features by is treated as an optimal transport problem using Sinkhorn–Knopp
Swapping Assignments between multiple Views of the same im- algorithm that enforces an equipartition constraint of assigning
age (SwAV) [77,80]. It uses an online clustering mechanism to samples to clusters [81].
learn better representations by grouping similar features together SwAV uses a different augmentation strategy as opposed to
by comparing representations with cluster centroids. The objec- SimCLR and MoCo. It uses multi-scale cropping and creates mul-
tive is not only to make the positive pairs of samples close to each tiple views of the single image. In Multi-crop augmentation,
other but also, to make sure that all other features that are similar full high-resolution images (eg: 224 × 224) from the dataset is
to each other club together. As a result, negative comparison taken to generate standard or high-resolution cropped images
with all the images lying in the large mini-batch is prevented (representing global views of the image) and then additional low-
thereby leading to reduced computational overhead. For example,
resolution images (eg: 96 × 96) is sampled along with the global
in a feature space, the features of sheep should be closer to the
views. Fig. 25 shows the setup off swapped prediction problem
features of goats (as both are animals) but should be far from the
wherein zs and zt are the representations of the two views of
features of cars. Fig. 24 shows SwAV framework, the ResNet-50
the image and qs and qt are the respective codes generated.
network receives different augmented views of the same image
Now as a swapped prediction problem, the code of the image is
(views can be more than two) generating the embedding. This
predicted from another view of the same image. The goal is to
embedding vector then goes to a shallow non-linear network fθ
that produces a projection vector denoted by Z . The represen- minimize the cross-entropy loss between the two views of the
tation or features generated are not directly compared to each same image. If two different views of the same image contain
other, like in contrastive learning. Rather mapping of features is similar information, then it should be possible to predict its code
done to their nearest neighbor in a set of K trainable prototype from one or the other feature. Recently, a self-supervised method
vectors (C = [c1, . . . , cK ]). This method then maps feature en- based on SwAV is proposed called SEER that works with high
coding of the augmented views of images into discrete codebook dimensional complex data. The model is pre-trained on billions of
C containing the set of prototype vectors or clusters [c1, . . . , cK ] random, unlabeled and uncurated public Instagram images, and
and use it to look up to the codes that are most similar to the is fine-tuned on ImageNet in a supervised fashion. SEER outper-
features. Then as a swapped prediction problem, the code Q of forms the state-of-the-art self-supervised models, attaining 84.2
the view of an image is predicted from the representation Z of percent top-1 accuracy on ImageNet [82].
16
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Fig. 25. Setting up the swapped prediction problem between two separate views of the same image [77].

2.7.1. Training in contrastive schemes Table 2


Contrastive learning employs a variety of optimization algo- Linear classification top-1 accuracy on top of features learned using self-
supervised approach pre-trained on VOC7 without labels and object detection
rithms for effective training of the model [83]. The training of with fine-tuned features on VOC7+12 using Faster-CNN. The backbone archi-
ConvNet involves learning the parameters in a way that mini- tecture used is AlexNet having 61M parameters and ResNet50 having 25.6M
mizes contrastive loss that serves as an unsupervised objective parameters.
function. Many contrastive schemes such as [72,73,75,84] have Method Architecture Parameters Classification Detection
been trained using mini-batch Stochastic Gradient Descent (SGD). Supervised AlexNet 61M 79.9 56.8
One of the most important hyperparameters for SGD is the learn- Supervised ResNet50 25.6M 87.5 81.3
ing rate which in practice should gradually be decreased over Inpaint [1] AlexNet 61M 56.5 44.5
Color [44] AlexNet 61M 65.6 46.9
time to prevent the overshooting of the objective. But it takes
BiGAN [55] AlexNet 61M 60.1 46.9
a lot of time to navigate a gentle slope. Hence, we need to add Context [60] AlexNet 61M 65.3 51.1
momentum to the gradient descent which is used in most deep DeepCluster [80] AlexNet 61M 72 55.4
learning approaches. Rotation [2] ResNet50 25.6M 63.9 72.5
Another popular optimization method known as gradient de- Jigsaw [61] ResNet50 25.6M 64.5 75.1
LA [89] ResNet50 25.6M 69.1 –
scent with adaptive learning rate (Adam) [85] has been used in a NPID [72] ResNet50 25.6M 76.6 79.1
few methods [62,86,87]. Adam is a combination of RMSProp [85] PIRL [75] ResNet50 25.6M 81.1 80.7
and momentum which incorporates the first-order momentum of MoCo [71] ResNet50 25.6M – 81.4
the gradient term. Both first and second moments are corrected SwAV [77] ResNet50 25.6M 88.9 82.6
to bias to account for their initialization at zero.
For large batch size training as in end-to-end contrastive
learning schemes [66,73,77] a standard SGD based optimizer and
Similarly, for object detection, the pre-trained model is fine-tuned
using large learning rates becomes highly unstable. It results in
on VOC7+12 using Faster-RCNN. We see that the performance of
lower model performance and training may diverge. To stabi-
contrastive schemes such as MoCo, PIRL, SwAV outperforms in
lize the training, Layer-wise Adaptive Rate Scaling (LARS) [88] comparison to other self-supervised models.
optimizer is used along with cosine decay schedule [76]. LARS Fig. 26 depicts ImageNet Top-1 accuracy of linear classifiers
uses a different learning rate for each layer and not for each trained on representations learned with different self-supervised
weight. And the magnitude of the update is controlled with methods (pre-trained on ImageNet). The performances of self-
respect to the weight norm for stabilizing the training speed. The supervised schemes have shown exceptional results and are head-
LARS optimizer is initialized with the learning rate and the LARS ing towards supervised learning. SimCLR achieves the same clas-
coefficient η defines how much we trust the layer to change its sification accuracy as ResNet-50 trained on supervised learning at
weights during one update. Once the LARS optimizer is defined, the cost of increased width of ResNet-50 or increased parameters.
a scheduler is initialized with an initial warm up learning rate SimCLR performs well due to the large batch of negative exam-
for few initial warm up epochs till the learning rate is gradually ples, output projection head, stronger data augmentation, and
increased to the target learning rate. After the warm-up period, longer training time. But the performance gains are even more
the learning rate is decayed with the cosine decay schedule for SwAV than SimCLR which are attributed to factors such as
without restarts. multi-scale cropping, generating multiple views of a single image,
and adopting an online clustering mechanism to group similar
3. Performance comparisons features. Networks having large parameters do give higher linear
evaluation accuracy than training on ResNet50 and shrinks the
Recently, there is a rapid surge in self-supervised learning gap with supervised training. However, large models are tough
methods for computer vision tasks that have started to outper- to handle and also computationally expensive, hence researchers
form supervised learning methods. In this section, we evaluate are coming up with ways where a large trained model is distilled
self-supervised methods on various standardized datasets and back to standard ResNet50 after it is pre-trained on a large subset
a variety of downstream tasks. Most of the time the model is of unlabeled data and finetuned on the small subset of data with
pre-trained on a large unlabeled dataset such as ImageNet on a labels [74]. On the architectural design front, a recent work called
pretext task and fine-tuned on a smaller labeled dataset. Table 2 RegNets is proposed which is a new family of ConvNets that is
shows the linear classification top-1 accuracy on top of the fea- capable of scaling to billions of parameters and can be optimized
tures learned by the network pre-trained on VOC7 without labels. to fit different runtime and memory constraints [90].
17
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Table 3
Linear classification on ImageNet. Top-1 accuracy on ImageNet dataset trained
on frozen features from different self-supervised methods with a standard
ResNet-50 containing 24M parameters.
Method Arch. Parameters Top1
Supervised R50 24 76.5
Colorization [44] R50 24 39.6
Jigsaw [61] R50 24 45.7
NPID [72] R50 24 54
BigBiGAN [46] R50 24 56.6
LA [89] R50 24 58.8
NPID++ [75] R50 24 59
MoCo [73] R50 24 60.6
SeLa [91] R50 24 61.5
PIRL [75] R50 24 63.6
CPC [62] R50 24 63.8
PCL [62] R50 24 65.9
SimCLR [66] R50 24 70
MoCov2 [73] R50 24 71.1
SwAV [77] R50 24 75.3

Table 4
Linear classification performance on four datasets (ImageNet, VOC07, Places205,
iNaturalist datasets) using the setup [92]. The linear classifiers are trained on
image representations generated by the ConvNet trained on ImageNet (without
labels) using self-supervised approach. Numbers with † are measured using
10-crop evaluation.
Fig. 26. ImageNet Top-1 accuracy of linear classifiers trained on representations Method Parameters Transfer dataset
learned with different self-supervised schemes. The pre-trained model is trained
ImageNet VOC07 Places205 iNat.
on ImageNet dataset without labels. Results show further gains in the accuracy
as the number of parameters are increased beyond 24M by increasing the width ResNet-50 using evaluation setup of [92]
of ResNet with a factor × 2, × 4, and × 5 for both SimCLR and SwAV. However, Supervised 25.6M 75.9 87.5 51.5 45.4
SwAV outperforms SimCLR with an increased number of parameters [77]. Colorization [92] 25.6M 39.6 55.6 37.5 –
Rotation [2] 25.6M 48.9 63.9 41.4 23
NPID++ [72] 25.6M 59 76.6 46.4 32.4
MoCo [71] 25.6M 60.6 – – –
Jigsaw [61] 25.6M 45.7 64.5 41.2 21.3
PIRL [75] 25.6M 63.6 81.1 49.8 34.1
SwAV [77] 25.6M 75.3 88.9 56.7 48.6
Different architecture or evaluation setup
NPID [72] 25.6M 54 – 45.5 –
BigBiGAN [46] 25.6M 56.6 – – –
AET [64] 61M 40.6 – 37.1 –
DeepCluster [80] 61M 39.8 – 37.5 –
Rot. [34] 61M 54 – 45.5 –
LA [89] 25.6M 60.2† – 50.2† –
CMC [93] 51M 64.1 – – –
CPC [62] 44.5M 48.7 – – –
CPC-Huge [94] 305M 61 – – –
BigBiGAN-Big [46] 86M 61.3 – – –
AMDIM [95] 670M 68.1 – 55.1 –

Fig. 27. ImageNet Top-1 accuracy of linear classifiers trained on representations


learned with SwAV and SimCLR. The results show SwAV converges fast in a
smaller number of epochs in comparison to SimCLR and reaches a performance
of 75% when number of epochs are increased to 800. and is only 1% away from supervised learning on the ImageNet
classification task. The self-supervised methods show further in-
crease in performance as the number of parameters increase.
Fig. 27 depicts ImageNet classification accuracy for self-
supervised learning schemes. ResNet50 is pre-trained on Ima- Table 4 shows the results of transfer learning from learned
geNet without labels. The network is frozen and a linear classifier representation on Image-classification task on four datasets (Im-
is trained on top of the network. The gray bar depicts the per- ageNet, VOC07, Places205, and iNaturalist datasets) using the
formance of supervised pre-training of ImageNet and the blue setup of [92]. Linear classifiers are trained on the representa-
bar indicates the performance of SimCLR after 1000 epochs of tion learned obtained by self-supervised models pre-trained on
training, that last for 40 h on 64 GPUs. The orange bar indicates ImageNet without labels. SwAV substantially outperforms their
the performance of swAV that converges fast and reaches better covariant counterparts and produces comparable accuracy to the
performance with 100 epochs which lasts for 6 h. Longer training state-of-the-art supervised model on linear classification on var-
of SwAV for 800 epochs lasting 50 h results in 76% top one ious datasets.
accuracy on ImageNet as shown in the green bar. Table 5 shows the ablation of SwAV, MoCo v2, and SimCLR,
Table 3 shows the progression of top 1 accuracy (classification) and PIRL which are all trained under the same setup. MoCo v2
on ImageNet dataset using contrastive self-supervised learning. included certain details that were a part SimCLR such as the MLP
The ResNet-50 ConvNet learns a linear classifier on top of frozen head, color distortion and Gaussian blur augmentation (agu+),
representation trained on ImageNet without labels. The perfor- and cosine learning decay (cos). It was found that the perfor-
mance of SwAV outperforms all other self-supervised approaches mance of MoCo v2 increased to 67.5% from 60.6%. Further, an
18
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

improvement of 3.5% was seen by increasing the training time


from 200 to 800 epochs. However, SwAV achieves the state-
of-the-art performance when trained in the small-batch setting,
with fewer epochs, and by using multi-crop argumentation. The
authors of SwAV suggest that multi-crop augmentation strategy
is generic and can be implemented in various contrastive learning
schemes, to further enhance the performance on downstream
tasks.

4. Practical consideration

To implement self-supervised techniques, we need to consider


some practical considerations that will enable the learning of
good representations. In self-supervision, a pretext task holds a
very important part that is intended for the network to solve Fig. 28. Linear evaluation (ImageNet top-1 accuracy) on pre-trained network
though it is not the primary task. Nevertheless, it is done to learn trained on ImageNet without labels under single and composition of data
rich image representations that benefit the downstream tasks augmentations [66].
and enable the network to show high performance on a limited
labeled dataset. However, the pretext task should be chosen
wisely and in conjunction with the downstream task. Although
numerous pretext tasks have been proposed in contrastive learn-
ing, still research is going on to identify the right pretext task
for a given problem. Another consideration is to prevent the
network from cheating, as often networks ‘‘cheat’’ and find an
easy way to solve the pretext task hence shortcut prevention is
essential. Many schemes like jigsaw and context prediction form
a grid of non-overlapping patches that prevent the network from
learning via boundary pixels, also the color channels are jittered
to prevent easy learning [1,61]. Also, a well-designed and strong
data augmentation and type of network architecture used for self-
supervised learning makes a big difference in the kind of results
in which we are interested. Impact of strong data augmentation
on self-supervised schemes has been thoroughly studied recently
in SimCLR [66] and SwAV [77]. Fig. 28 shows linear evaluation
(ImageNet top-1 accuracy) under single augmentation as well as
Fig. 29. Top 1 ImageNet accuracy on ResNet-50 trained with different batch size
on composition of data augmentations. Applying single augmen-
and training steps. Each bar is a single run from scratch [66].
tation results in low performance than applying composition of
augmentations. E.g., applying crop and then color augmentation
leads to 56.3% accuracy than just applying color augmentation.
This insight was laid in SimCLR paper, where the authors stated the focus of contrastive learning is instance discrimination a lot
the color distortion is very important for contrastive learning. If many negative and positive pairs have to be generated and hence
you have two crops of the same image, the intensity histogram the selection of appropriate sample selection strategy should
may be similar for both the crops as a result color distortion be done [97]. Recent work called ‘‘Bootstrap Your Own Latent
becomes the second important transformation that makes the (BYOL)’’ proposed a new approach to self-supervised learning
task harder for the ConvNet, and improves the quality of the where the authors considered the reconstruction of features in-
representation. A well-tuned augmentation strategy can lead to stead of the inputs. It claims to perform better than state-of-the-
substantial information gains as seen in SimCLR and SwAV. art contrastive methods by avoiding large number of negative
Scaling of batch size and training steps has also been found to pairs [98].
impact the performance of self-supervised models. As the training Beyond Contrastive learning: In another work, the authors
epochs increases the accuracy of the models tends to increase claim that the starting layers of the ConvNet learn useful rep-
for self-supervised learning methods as shown in Fig. 29. Self- resentation by only a single high-resolution image provided suf-
supervised learning benefits more from scaling up the training ficient data augmentation is used. However, at deep layers gap
and enhancing argumentations than in supervised training. with manual supervision cannot be closed even if millions of
unlabeled images are used for training. The proposed scheme uses
4.1. Open challenges a high-resolution image and generates 1 million augmented crops
on which the model is trained. It is observed that representations
Self-supervised learning methods have achieved great success learned from the first few layers are of the same quality as in
and are obtaining good performance that is close to supervised the case of representations learned when a linear classifier is
models on image recognition tasks. However, there are certain trained with supervised and unsupervised learning on millions of
challenges faced by self-supervision which are as follows: images [99].
Resource intensive and requires careful attention to de- Learning beyond a single object in an image: Most of the
tailing: Contrastive learning requires long batch training time self-supervised pre-trained models are trained using images that
as well as complex resource setups like many TPUs [96]. For have a single dominant object like in the case of ImageNet
example, SimCLR uses 128 TPUs for large batch training dur- dataset. Whereas in applications like self-driving cars the scene
ing self-supervised learning. Also, careful attention has to be contains multiple objects and distinguishing between two similar
paid to detailing like data augmentation and pretext tasks. As scenes is quite a challenging task [100].
19
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

Table 5
Top-1 linear classifier accuracy on ImageNet on top of frozen features from a ResNet-50 pre-trained using self-
supervised learning approach. The performance of the self-supervised models is studied as the function of MLP
head, augmentation (Multi-crop), cosine learning decay cos, number of epochs, and batch size. 2 × 160 + 4 × 96
indicates 2 crops of size 160 × 60 and 4 crops of size 96 × 96.
Method Unsupervised pre-training
MLP Multi-crop cos epochs batch ImageNet
top-1 accuracy
MoCo v2  2 × 224  200 256 67.5
SimCLR  2 × 224  200 256 61.9
SimCLR  2 × 224  200 8192 66.6
SwAV  2 × 160 + 4 × 96  200 256 72.0
SwAV  2 × 224 + 6 × 96  200 256 72.7
Results of longer unsupervised pre-training
MoCo v2  2 × 224  800 256 71.1
SimCLR  2 × 224  1000 4096 69.3
PIRL  2 × 224  800 1024 63.6
SwAV  2 × 224 + 6 × 96  400 256 74.3

Learning beyond structured images: satellite images and Acknowledgments


medical images (microscopic images) have no or very less struc-
ture to exploit as a result it becomes difficult to find context All authors approved the version of the manuscript to be
in them. Hence schemes like relative patch prediction or jigsaw published.
puzzle are inefficient to deal with such images [101].
Dataset biases: In self-supervised learning task, the data itself References
provides strong supervision to solve the pretext tasks. As a result,
the feature representations learned using self-supervised objec- [1] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A
tives are influenced by the data on which the model is trained. Efros, Context encoders: Feature learning by inpainting, in: Proceedings of
Such biases are hard to minimize with the increase in the size of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 2536–2544.
the datasets. [2] Spyros Gidaris, Praveer Singh, Nikos Komodakis, Unsupervised repre-
Augmentation strategy for new domains: Constructing a sentation learning by predicting image rotations, 2018, arXiv preprint
pre-trained model with datasets containing medical and satel- arXiv:1803.07728.
lite images, may demand a different augmentation strategy as [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Bert: Pre-
compared to what we are using with natural image dataset (Im- training of deep bidirectional transformers for language understanding,
2018, arXiv preprint arXiv:1810.04805.
ageNet).
[4] Jeremy Howard, Sebastian Ruder, Universal language model fine-tuning
for text classification, 2018, arXiv preprint arXiv:1801.06146.
5. Conclusions [5] Yoav Goldberg, Omer Levy, Word2vec explained: deriving mikolov et
al.’s negative-sampling word-embedding method, 2014, arXiv preprint
Self-supervised methods have shown results at power with arXiv:1402.3722.
supervised learning by leveraging the huge amount of unlabeled [6] Jeffrey Pennington, Richard Socher, Christopher D. Manning, Glove: Global
data that is available for free. The effectiveness of self-supervised vectors for word representation, in: Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp.
learning techniques has been found in complex downstream tasks
1532–1543.
such as image classification, object detection, image segmenta- [7] Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Bag
tion, etc., where limited labeled data is available. The unlabeled of tricks for efficient text classification, 2016, arXiv preprint arXiv:1607.
data can be utilized which is available for free and present in 01759.
abundance for building effective pre-trained models. The great- [8] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
est benefits of pre-training are currently in low data regimes Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, Roberta:
A robustly optimized bert pretraining approach, 2019, arXiv preprint
where limited annotation data is available. The paper has done arXiv:1907.11692.
an extensive review on various handcrafted pretext tasks as well [9] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary,
as various self-supervised methods that follow contrastive ap- Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
proach or instance discrimination. The paper also highlights the Zettlemoyer, Veselin Stoyanov, Unsupervised cross-lingual representation
state-of-the-art self-supervised methods that are showing signif- learning at scale, 2019, arXiv preprint arXiv:1911.02116.
[10] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
icant results in comparison to supervised learning. Finally, this
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, Exploring the limits
work concludes by discussing some practical considerations and of transfer learning with a unified text-to-text transformer, 2019, arXiv
open challenges in image recognition tasks using self-supervised preprint arXiv:1910.10683.
learning. [11] Linchao Zhu, Yi Yang, Actbert: Learning global-local video-text represen-
tations, in: Proceedings of the IEEE/CVF Conference on Computer Vision
CRediT authorship contribution statement and Pattern Recognition, 2020, pp. 8746–8755.
[12] Yoshua Bengio, Aaron Courville, Pascal Vincent, Representation learning:
A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35
Kriti Ohri: Conception and design of study, Acquisition of data,
(8) (2013) 1798–1828.
Analysis and/or interpretation of data, Writing - original draft, [13] A. Emin Orhan, Vaibhav V. Gupta, Brenden M. Lake, Self-supervised
Writing - review & editing. Mukesh Kumar: Conception and de- learning through the eyes of a child, 2020, arXiv e-prints, arXiv–2007.
sign of study, Acquisition of data, Analysis and/or interpretation [14] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classifica-
of data, Writing - original draft, Writing - review & editing. tion with deep convolutional neural networks, in: Advances in Neural
Information Processing Systems, 2012, pp. 1097–1105.
Declaration of competing interest [15] Karen Simonyan, Andrew Zisserman, Very deep convolutional networks
for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[16] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
The authors declare that they have no known competing finan- Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabi-
cial interests or personal relationships that could have appeared novich, Going deeper with convolutions, in: Proceedings of the IEEE
to influence the work reported in this paper. Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

20
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual [41] Yoshua Bengio, Learning deep architectures for AI, Now Publishers Inc,
learning for image recognition, in: Proceedings of the IEEE Conference 2009.
on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [42] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol,
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q Weinberger, Extracting and composing robust features with denoising autoencoders,
Densely connected convolutional networks, in: Proceedings of the IEEE in: Proceedings of the 25th International Conference on Machine
Conference on Computer Vision and Pattern Recognition, 2017, pp. Learning, 2008, pp. 1096–1103.
4700–4708. [43] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, Matthieu
[19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Cord, Learning representations by predicting bags of visual words, in:
Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6928–6938.
Recognition, 2016, pp. 2818–2826. [44] Richard Zhang, Phillip Isola, Alexei A. Efros, Colorful image coloriza-
[20] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, Imagenet: tion, in: European Conference on Computer Vision, Springer, 2016, pp.
A large-scale hierarchical image database, in: 2009 IEEE Conference on 649–666.
Computer Vision and Pattern Recognition, Ieee, 2009, pp. 248–255. [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Split-brain autoencoders:
[21] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Unsupervised learning by cross-channel prediction, in: Proceedings of the
Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
Veit, et al., Openimages: A public dataset for large-scale multi-label 1058–1067.
and multi-class image classification, 2017, Dataset available from https: [46] Jeff Donahue, Karen Simonyan, Large scale adversarial representation
//github.com/openimages, 2(3), 2-3. learning, 2019, arXiv preprint arXiv:1907.02544.
[22] Ameet V. Joshi, Amazon’s machine learning toolkit: Sagemaker, in: [47] Carl Doersch, Tutorial on variational autoencoders, 2016, arXiv preprint
Machine Learning and Artificial Intelligence, Springer, 2020, pp. 233–243. arXiv:1606.05908.
[23] Joao Carreira, Andrew Zisserman, Quo vadis, action recognition? a new [48] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
model and the kinetics dataset, in: Proceedings of the IEEE Conference Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. adversarial networks, 2014, arXiv preprint arXiv:1406.2661.
[24] Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, Michael S. [49] Alec Radford, Luke Metz, Soumith Chintala, Unsupervised representation
Lew, Deep learning for visual understanding: A review, Neurocomputing learning with deep convolutional generative adversarial networks, 2015,
187 (2016) 27–48. arXiv preprint arXiv:1511.06434.
[25] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio, Transfusion: [50] Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein generative
Understanding transfer learning for medical imaging, in: Advances in adversarial networks, in: International Conference on Machine Learning,
Neural Information Processing Systems, 2019, pp. 3347–3357. PMLR, 2017, pp. 214–223.
[26] Olivier Chapelle, Bernhard Scholkopf, Alexander Zien, Semi-supervised [51] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,
learning (chapelle, o. et al., eds.; 2006)[book reviews], IEEE Trans. Neural Aaron Courville, Improved training of wasserstein gans, 2017, arXiv
Netw. 20 (3) (2009) 542–542. preprint arXiv:1704.00028.
[27] I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan, [52] Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive growing
Billion-scale semi-supervised learning for image classification, 2019, arXiv of gans for improved quality, stability, and variation, 2017, arXiv preprint
preprint arXiv:1905.00546. arXiv:1710.10196.
[28] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, [53] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida,
Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten, Spectral normalization for generative adversarial networks, 2018, arXiv
Exploring the limits of weakly supervised pretraining, in: Proceedings of preprint arXiv:1802.05957.
the European Conference on Computer Vision, ECCV, 2018, 181–196. [54] Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena, Self-
[29] Yongqin Xian, Christoph H. Lampert, Bernt Schiele, Zeynep Akata, Zero- attention generative adversarial networks, in: International Conference
shot learning—a comprehensive evaluation of the good, the bad and the on Machine Learning, PMLR, 2019, pp. 7354–7363.
ugly, IEEE Trans. Pattern Anal. Mach. Intell. 41 (9) (2018) 2251–2265. [55] Jeff Donahue, Philipp Krähenbühl, Trevor Darrell, Adversarial feature
[30] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, Jia- learning, 2016, arXiv preprint arXiv:1605.09782.
Bin Huang, A closer look at few-shot classification, 2019, arXiv preprint [56] Andrew Brock, Jeff Donahue, Karen Simonyan, Large scale GAN training
arXiv:1904.04232. for high fidelity natural image synthesis, 2018, arXiv preprint arXiv:
[31] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele, Meta-transfer 1809.11096.
learning for few-shot learning, in: Proceedings of the IEEE/CVF Conference [57] Tero Karras, Samuli Laine, Timo Aila, A style-based generator architecture
on Computer Vision and Pattern Recognition, 2019, 403–412. for generative adversarial networks, in: Proceedings of the IEEE/CVF
[32] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Conference on Computer Vision and Pattern Recognition, 2019, pp.
Joshua B. Tenenbaum, Hugo Larochelle, Richard S. Zemel, Meta-learning 4401–4410.
for semi-supervised few-shot classification, 2018, arXiv preprint arXiv: [58] Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, Timothy Lillicrap,
1803.00676. Logan: Latent optimisation for generative adversarial networks, 2019,
[33] Linchao Zhu, Yi Yang, Label independent memory for semi-supervised arXiv preprint arXiv:1912.00953.
few-shot video classification, IEEE Ann. Hist. Comput. (01) (2020) 1–1. [59] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby,
[34] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer, Revisiting self- Self-supervised gans via auxiliary rotation loss, in: Proceedings of the
supervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12154–12163.
1920–1929. [60] Carl Doersch, Abhinav Gupta, Alexei A. Efros, Unsupervised visual rep-
[35] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, Texture and art with resentation learning by context prediction, in: Proceedings of the IEEE
deep neural networks, Curr. Opin. Neurobiol. 46 (2017) 178–186. International Conference on Computer Vision, 2015, pp. 1422–1430.
[36] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, [61] Mehdi Noroozi, Paolo Favaro, Unsupervised learning of visual represen-
Felix A Wichmann, Wieland Brendel, Imagenet-trained CNNs are biased tations by solving jigsaw puzzles, in: European Conference on Computer
towards texture; increasing shape bias improves accuracy and robustness, Vision, Springer, 2016, pp. 69–84.
2018, arXiv preprint arXiv:1811.12231. [62] Aaron van den Oord, Yazhe Li, Oriol Vinyals, Representation learning with
[37] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, contrastive predictive coding, 2018, arXiv preprint arXiv:1807.03748.
Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim [63] Michael Gutmann, Aapo Hyvärinen, Noise-contrastive estimation: A new
Neumann, Alexey Dosovitskiy, et al., A large-scale study of representation estimation principle for unnormalized statistical models, in: Proceedings
learning with the visual task adaptation benchmark, 2019, arXiv preprint of the Thirteenth International Conference on Artificial Intelligence and
arXiv:1910.04867. Statistics, 2010, pp. 297–304.
[38] Jiirgen Schmidhuber, Making the world differentiable: On using self- [64] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo, Aet vs. aed: Unsu-
supervised fully recurrent neural networks for dynamic reinforcement pervised representation learning by auto-encoding transformations rather
learning and planning in non-stationary environm nts, 1990. than data, in: Proceedings of the IEEE Conference on Computer Vision and
[39] Longlong Jing, Yingli Tian, Self-supervised visual feature learning with Pattern Recognition, 2019, pp. 2547–2555.
deep neural networks: A survey, IEEE Trans. Pattern Anal. Mach. Intell. [65] William Falcon, Kyunghyun Cho, A framework for contrastive self-
(2020). supervised learning and designing a new approach, 2020, arXiv preprint
[40] Dumitru Erhan, Aaron Courville, Yoshua Bengio, Pascal Vincent, Why arXiv:2009.00104.
does unsupervised pre-training help deep learning? in: Proceedings of [66] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton, A
the Thirteenth International Conference on Artificial Intelligence and simple framework for contrastive learning of visual representations, 2020,
Statistics, 2010, 201–208. arXiv preprint arXiv:2002.05709.

21
K. Ohri and M. Kumar Knowledge-Based Systems 224 (2021) 107090

[67] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin [85] Diederik P. Kingma, Jimmy Ba, Adam: A method for stochastic
Riedmiller, Thomas Brox, Discriminative unsupervised feature learning optimization, 2014, arXiv preprint arXiv:1412.6980.
with exemplar convolutional neural networks, IEEE Trans. Pattern Anal. [86] Aravind Srinivas, Michael Laskin, Pieter Abbeel, Curl: Contrastive unsu-
Mach. Intell. 38 (9) (2015) 1734–1747. pervised representations for reinforcement learning, 2020, arXiv preprint
[68] Matthew Schultz, Thorsten Joachims, Learning a distance metric from arXiv:2004.04136.
relative comparisons, Adv. Neural Inform. Process. Syst. 16 (2004) 41–48. [87] Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, Ananthram Swami,
[69] Kihyuk Sohn, Improved deep metric learning with multi-class n-pair loss Graphcl: Contrastive self-supervised learning of graph representations,
objective, in: Proceedings of the 30th International Conference on Neural 2020, arXiv preprint arXiv:2007.08025.
Information Processing Systems, 2016, pp. 1857–1865. [88] Yang You, Igor Gitman, Boris Ginsburg, Large batch training of
[70] Raia Hadsell, Sumit Chopra, Yann LeCun, Dimensionality reduction by convolutional networks, 2017, arXiv preprint arXiv:1708.03888.
learning an invariant mapping, in: 2006 IEEE Computer Society Confer- [89] Chengxu Zhuang, Alex Lin Zhai, Daniel Yamins, Local aggregation for un-
ence on Computer Vision and Pattern Recognition, CVPR’06, 2, IEEE, 2006, supervised learning of visual embeddings, in: Proceedings of the IEEE/CVF
pp. 1735–1742. International Conference on Computer Vision, 2019, pp. 6002–6012.
[71] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick, Momentum [90] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr
contrast for unsupervised visual representation learning, in: Proceedings Dollár, Designing network design spaces, in: Proceedings of the IEEE/CVF
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, 2020, pp.
2020, pp. 9729–9738. 10428–10436.
[72] Zhirong Wu, Yuanjun Xiong, Stella X Yu, Dahua Lin, Unsupervised feature [91] Yuki Markus Asano, Christian Rupprecht, Andrea Vedaldi, Self-labelling
learning via non-parametric instance discrimination, in: Proceedings of via simultaneous clustering and representation learning, 2019, arXiv
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, preprint arXiv:1911.05371.
pp. 3733–3742. [92] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra, Scaling and
[73] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He, Improved baselines benchmarking self-supervised visual representation learning, in: Proceed-
with momentum contrastive learning, 2020, arXiv preprint arXiv:2003. ings of the IEEE International Conference on Computer Vision, 2019, pp.
04297. 6391–6400.
[74] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Ge- [93] Yonglong Tian, Dilip Krishnan, Phillip Isola, Contrastive multiview coding,
offrey Hinton, Big self-supervised models are strong semi-supervised 2019, arXiv preprint arXiv:1906.05849.
learners, 2020, arXiv preprint arXiv:2006.10029. [94] Olivier Henaff, Data-efficient image recognition with contrastive predic-
[75] Ishan Misra, Laurens van der Maaten, Self-supervised learning of pretext- tive coding, in: International Conference on Machine Learning, PMLR,
invariant representations, in: Proceedings of the IEEE/CVF Conference on 2020, pp. 4182–4192.
Computer Vision and Pattern Recognition, 2020, pp. 6707–6717. [95] Philip Bachman, R. Devon Hjelm, William Buchwalter, Learning repre-
[76] Ilya Loshchilov, Frank Hutter, Sgdr: Stochastic gradient descent with sentations by maximizing mutual information across views, 2019, arXiv
warm restarts, 2016, arXiv preprint arXiv:1608.03983. preprint arXiv:1906.00910.
[77] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, [96] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid,
Armand Joulin, Unsupervised learning of visual features by contrasting Phillip Isola, What makes for good views for contrastive learning, 2020,
cluster assignments, 2020, arXiv preprint arXiv:2006.09882. arXiv preprint arXiv:2005.10243.
[78] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, Deep [97] Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krahenbuhl,
clustering for unsupervised learning of visual features, in: Proceedings of Sampling matters in deep embedding learning, in: Proceedings of the
the European Conference on Computer Vision (ECCV), 2018, pp. 132–149. IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
[79] Yuki Markus Asano, Christian Rupprecht, Andrea Vedaldi, Self-labelling [98] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H
via simultaneous clustering and representation learning, 2019, arXiv Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao-
preprint arXiv:1911.05371. han Daniel Guo, Mohammad Gheshlaghi Azar, et al., Bootstrap your own
[80] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, Deep latent: A new approach to self-supervised learning, 2020, arXiv preprint
clustering for unsupervised learning of visual features, in: Proceedings of arXiv:2006.07733.
the European Conference on Computer Vision, ECCV, 2018, pp. 132–149. [99] Yuki M Asano, Christian Rupprecht, Andrea Vedaldi, A critical analysis of
[81] Marco Cuturi, Sinkhorn distances: lightspeed computation of optimal self-supervision, or what we can learn from a single image, 2019, arXiv
transport, NIPS 2 (3) (2013) 4. preprint arXiv:1904.13132.
[82] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao [100] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Li-
Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand ong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom,
Joulin, et al., Self-supervised pretraining of visual features in the wild, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings
2021, arXiv preprint arXiv:2103.01988. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[83] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya 2020, pp. 11621–11631.
Banerjee, Fillia Makedon, A survey on contrastive self-supervised learning, [101] Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram
Technologies 9 (1) (2021) 2. Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van
[84] Piotr Bojanowski, Armand Joulin, Unsupervised learning by predicting Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et
noise, in: International Conference on Machine Learning, PMLR, 2017, pp. al., Diagnostic assessment of deep learning algorithms for detection of
517–526. lymph node metastases in women with breast cancer, JAMA 318 (22)
(2017) 2199–2210.

22

You might also like