0% found this document useful (0 votes)
23 views

SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons

SimCLR: simple framework for contrastive learning of visual representaitons

Uploaded by

YuFei Yang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons

SimCLR: simple framework for contrastive learning of visual representaitons

Uploaded by

YuFei Yang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen 1 Simon Kornblith 1 Mohammad Norouzi 1 Geoffrey Hinton 1

Abstract Supervised SimCLR (4x)


75 SimCLR (2x)

ImageNet Top-1 Accuracy (%)


This paper presents SimCLR: a simple framework
CPCv2-L
for contrastive learning of visual representations. 70 SimCLR MoCo (4x)
We simplify recently proposed contrastive self- CMC
arXiv:2002.05709v3 [cs.LG] 1 Jul 2020

supervised learning algorithms without requiring


PIRL-c2x AMDIM
65 MoCo (2x)
specialized architectures or a memory bank. In CPCv2 PIRL-ens.
order to understand what enables the contrastive PIRL BigBiGAN
prediction tasks to learn useful representations, 60 MoCo
we systematically study the major components of LA
our framework. We show that (1) composition of 55 Rotation
data augmentations plays a critical role in defining InstDisc
effective predictive tasks, (2) introducing a learn- 25 50 100 200 400 626
able nonlinear transformation between the repre- Number of Parameters (Millions)
sentation and the contrastive loss substantially im-
proves the quality of the learned representations, Figure 1. ImageNet Top-1 accuracy of linear classifiers trained
and (3) contrastive learning benefits from larger on representations learned with different self-supervised meth-
batch sizes and more training steps compared to ods (pretrained on ImageNet). Gray cross indicates supervised
supervised learning. By combining these findings, ResNet-50. Our method, SimCLR, is shown in bold.
we are able to considerably outperform previous
methods for self-supervised and semi-supervised However, pixel-level generation is computationally expen-
learning on ImageNet. A linear classifier trained sive and may not be necessary for representation learning.
on self-supervised representations learned by Sim- Discriminative approaches learn representations using objec-
CLR achieves 76.5% top-1 accuracy, which is a tive functions similar to those used for supervised learning,
7% relative improvement over previous state-of- but train networks to perform pretext tasks where both the in-
the-art, matching the performance of a supervised puts and labels are derived from an unlabeled dataset. Many
ResNet-50. When fine-tuned on only 1% of the such approaches have relied on heuristics to design pretext
labels, we achieve 85.8% top-5 accuracy, outper- tasks (Doersch et al., 2015; Zhang et al., 2016; Noroozi &
forming AlexNet with 100× fewer labels. 1 Favaro, 2016; Gidaris et al., 2018), which could limit the
generality of the learned representations. Discriminative
approaches based on contrastive learning in the latent space
have recently shown great promise, achieving state-of-the-
1. Introduction art results (Hadsell et al., 2006; Dosovitskiy et al., 2014;
Learning effective visual representations without human Oord et al., 2018; Bachman et al., 2019).
supervision is a long-standing problem. Most mainstream In this work, we introduce a simple framework for con-
approaches fall into one of two classes: generative or dis- trastive learning of visual representations, which we call
criminative. Generative approaches learn to generate or SimCLR. Not only does SimCLR outperform previous work
otherwise model pixels in the input space (Hinton et al., (Figure 1), but it is also simpler, requiring neither special-
2006; Kingma & Welling, 2013; Goodfellow et al., 2014). ized architectures (Bachman et al., 2019; Hénaff et al., 2019)
1
Google Research, Brain Team. Correspondence to: Ting Chen nor a memory bank (Wu et al., 2018; Tian et al., 2019; He
<[email protected]>. et al., 2019; Misra & van der Maaten, 2019).

Proceedings of the 37 th International Conference on Machine In order to understand what enables good contrastive repre-
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by sentation learning, we systematically study the major com-
the author(s). ponents of our framework and show that:
1
Code available at https://github.com/google-research/simclr.
A Simple Framework for Contrastive Learning of Visual Representations

• Composition of multiple data augmentation operations Maximize agreement


zi zj
is crucial in defining the contrastive prediction tasks that
yield effective representations. In addition, unsupervised g(·) g(·)
contrastive learning benefits from stronger data augmen- ←− Representation −→ hj
hi
tation than supervised learning.
• Introducing a learnable nonlinear transformation be- f (·) f (·)
tween the representation and the contrastive loss substan-
tially improves the quality of the learned representations. x̃i x̃j
• Representation learning with contrastive cross entropy t∼
T T
loss benefits from normalized embeddings and an appro- t ∼
0
x
priately adjusted temperature parameter.
• Contrastive learning benefits from larger batch sizes and
Figure 2. A simple framework for contrastive learning of visual
longer training compared to its supervised counterpart.
representations. Two separate data augmentation operators are
Like supervised learning, contrastive learning benefits sampled from the same family of augmentations (t ∼ T and
from deeper and wider networks. t0 ∼ T ) and applied to each data example to obtain two correlated
We combine these findings to achieve a new state-of-the-art views. A base encoder network f (·) and a projection head g(·)
are trained to maximize agreement using a contrastive loss. After
in self-supervised and semi-supervised learning on Ima-
training is completed, we throw away the projection head g(·) and
geNet ILSVRC-2012 (Russakovsky et al., 2015). Under the
use encoder f (·) and representation h for downstream tasks.
linear evaluation protocol, SimCLR achieves 76.5% top-1
accuracy, which is a 7% relative improvement over previous to obtain hi = f (x̃i ) = ResNet(x̃i ) where hi ∈ Rd is
state-of-the-art (Hénaff et al., 2019). When fine-tuned with the output after the average pooling layer.
only 1% of the ImageNet labels, SimCLR achieves 85.8%
top-5 accuracy, a relative improvement of 10% (Hénaff et al., • A small neural network projection head g(·) that maps
2019). When fine-tuned on other natural image classifica- representations to the space where contrastive loss is
tion datasets, SimCLR performs on par with or better than applied. We use a MLP with one hidden layer to obtain
a strong supervised baseline (Kornblith et al., 2019) on 10 zi = g(hi ) = W (2) σ(W (1) hi ) where σ is a ReLU non-
out of 12 datasets. linearity. As shown in section 4, we find it beneficial to
define the contrastive loss on zi ’s rather than hi ’s.

2. Method • A contrastive loss function defined for a contrastive pre-


diction task. Given a set {x̃k } including a positive pair
2.1. The Contrastive Learning Framework of examples x̃i and x̃j , the contrastive prediction task
aims to identify x̃j in {x̃k }k6=i for a given x̃i .
Inspired by recent contrastive learning algorithms (see Sec-
tion 7 for an overview), SimCLR learns representations We randomly sample a minibatch of N examples and define
by maximizing agreement between differently augmented the contrastive prediction task on pairs of augmented exam-
views of the same data example via a contrastive loss in ples derived from the minibatch, resulting in 2N data points.
the latent space. As illustrated in Figure 2, this framework We do not sample negative examples explicitly. Instead,
comprises the following four major components. given a positive pair, similar to (Chen et al., 2017), we treat
the other 2(N − 1) augmented examples within a minibatch
• A stochastic data augmentation module that transforms
as negative examples. Let sim(u, v) = u> v/kukkvk de-
any given data example randomly resulting in two cor-
note the dot product between `2 normalized u and v (i.e.
related views of the same example, denoted x̃i and x̃j ,
cosine similarity). Then the loss function for a positive pair
which we consider as a positive pair. In this work, we
of examples (i, j) is defined as
sequentially apply three simple augmentations: random
cropping followed by resize back to the original size, ran- exp(sim(zi , zj )/τ )
`i,j = − log P2N , (1)
dom color distortions, and random Gaussian blur. As
k=1 1[k6=i] exp(sim(zi , zk )/τ )
shown in Section 3, the combination of random crop and
color distortion is crucial to achieve a good performance. where 1[k6=i] ∈ {0, 1} is an indicator function evaluating to
1 iff k 6= i and τ denotes a temperature parameter. The fi-
• A neural network base encoder f (·) that extracts repre-
nal loss is computed across all positive pairs, both (i, j)
sentation vectors from augmented data examples. Our
and (j, i), in a mini-batch. This loss has been used in
framework allows various choices of the network archi-
previous work (Sohn, 2016; Wu et al., 2018; Oord et al.,
tecture without any constraints. We opt for simplicity
2018); for convenience, we term it NT-Xent (the normalized
and adopt the commonly used ResNet (He et al., 2016)
temperature-scaled cross entropy loss).
A Simple Framework for Contrastive Learning of Visual Representations

Algorithm 1 SimCLR’s main learning algorithm.


input: batch size N , constant τ , structure of f , g, T .
for sampled minibatch {xk }N k=1 do D
for all k ∈ {1, . . . , N } do
A B
draw two augmentation functions t ∼ T , t0 ∼ T C
# the first augmentation
x̃2k−1 = t(xk )
h2k−1 = f (x̃2k−1 ) # representation (a) Global and local views. (b) Adjacent views.
z2k−1 = g(h2k−1 ) # projection Figure 3. Solid rectangles are images, dashed rectangles are ran-
# the second augmentation dom crops. By randomly cropping images, we sample contrastive
x̃2k = t0 (xk ) prediction tasks that include global to local view (B → A) or
h2k = f (x̃2k ) # representation adjacent view (D → C) prediction.
z2k = g(h2k ) # projection
end for 2.3. Evaluation Protocol
for all i ∈ {1, . . . , 2N } and j ∈ {1, . . . , 2N } do
Here we lay out the protocol for our empirical studies, which
si,j = zi> zj /(kzi kkzj k) # pairwise similarity
aim to understand different design choices in our framework.
end for
exp(s /τ )
define `(i, j) as `(i, j) = − log P2N 1 i,jexp(s /τ ) Dataset and Metrics. Most of our study for unsupervised
k=1 [k6=i] i,k

L = 2N 1
PN
[`(2k−1, 2k) + `(2k, 2k−1)] pretraining (learning encoder network f without labels)
k=1
update networks f and g to minimize L is done using the ImageNet ILSVRC-2012 dataset (Rus-
end for sakovsky et al., 2015). Some additional pretraining experi-
return encoder network f (·), and throw away g(·) ments on CIFAR-10 (Krizhevsky & Hinton, 2009) can be
found in Appendix B.9. We also test the pretrained results
on a wide range of datasets for transfer learning. To evalu-
Algorithm 1 summarizes the proposed method. ate the learned representations, we follow the widely used
linear evaluation protocol (Zhang et al., 2016; Oord et al.,
2.2. Training with Large Batch Size 2018; Bachman et al., 2019; Kolesnikov et al., 2019), where
To keep it simple, we do not train the model with a memory a linear classifier is trained on top of the frozen base net-
bank (Wu et al., 2018; He et al., 2019). Instead, we vary work, and test accuracy is used as a proxy for representation
the training batch size N from 256 to 8192. A batch size quality. Beyond linear evaluation, we also compare against
of 8192 gives us 16382 negative examples per positive pair state-of-the-art on semi-supervised and transfer learning.
from both augmentation views. Training with large batch Default setting. Unless otherwise specified, for data aug-
size may be unstable when using standard SGD/Momentum mentation we use random crop and resize (with random
with linear learning rate scaling (Goyal et al., 2017). To flip), color distortions, and Gaussian blur (for details, see
stabilize the training, we use the LARS optimizer (You et al., Appendix A). We use ResNet-50 as the base encoder net-
2017) for all batch sizes. We train our model with Cloud work, and a 2-layer MLP projection head to project the
TPUs, using 32 to 128 cores depending on the batch size.2 representation to a 128-dimensional latent space. As the
Global BN. Standard ResNets use batch normaliza- loss, we use NT-Xent, optimized using LARS with learning
tion (Ioffe & Szegedy, 2015). In distributed training with rate of 4.8 (= 0.3 × BatchSize/256) and weight decay of
data parallelism, the BN mean and variance are typically 10−6 . We train at batch size 4096 for 100 epochs.3 Fur-
aggregated locally per device. In our contrastive learning, thermore, we use linear warmup for the first 10 epochs,
as positive pairs are computed in the same device, the model and decay the learning rate with the cosine decay schedule
can exploit the local information leakage to improve pre- without restarts (Loshchilov & Hutter, 2016).
diction accuracy without improving representations. We ad-
dress this issue by aggregating BN mean and variance over 3. Data Augmentation for Contrastive
all devices during the training. Other approaches include Representation Learning
shuffling data examples across devices (He et al., 2019), or
replacing BN with layer norm (Hénaff et al., 2019). Data augmentation defines predictive tasks. While data
augmentation has been widely used in both supervised and
2
With 128 TPU v3 cores, it takes ∼1.5 hours to train our unsupervised representation learning (Krizhevsky et al.,
ResNet-50 with a batch size of 4096 for 100 epochs.
3
Although max performance is not reached in 100 epochs, rea-
sonable results are achieved, allowing fair and efficient ablations.
A Simple Framework for Contrastive Learning of Visual Representations

(a) Original (b) Crop and resize (c) Crop, resize (and flip) (d) Color distort. (drop) (e) Color distort. (jitter)

(f) Rotate {90◦ , 180◦ , 270◦ } (g) Cutout (h) Gaussian noise (i) Gaussian blur (j) Sobel filtering
Figure 4. Illustrations of the studied data augmentation operators. Each augmentation can transform data stochastically with some internal
parameters (e.g. rotation degree, noise level). Note that we only test these operators in ablation, the augmentation policy used to train our
models only includes random crop (with flip and resize), color distortion, and Gaussian blur. (Original image cc-by: Von.grzanka)

2012; Hénaff et al., 2019; Bachman et al., 2019), it has Crop 33.1 33.9 56.3 46.0 39.9 35.0 30.2 39.2
50
not been considered as a systematic way to define the con-
Cutout 32.2 25.6 33.9 40.0 26.5 25.2 22.4 29.4
trastive prediction task. Many existing approaches define 40
1st transformation

contrastive prediction tasks by changing the architecture. Color 55.8 35.5 18.8 21.0 11.4 16.5 20.8 25.7

For example, Hjelm et al. (2018); Bachman et al. (2019) Sobel 46.2 40.6 20.9 4.0 9.3 6.2 4.2 18.8 30
achieve global-to-local view prediction via constraining the
Noise 38.8 25.8 7.5 7.6 9.8 9.8 9.6 15.5
receptive field in the network architecture, whereas Oord 20

et al. (2018); Hénaff et al. (2019) achieve neighboring view Blur 35.1 25.2 16.6 5.8 9.7 2.6 6.7 14.5

prediction via a fixed image splitting procedure and a con- 10


Rotate 30.0 22.5 20.7 4.3 9.7 6.5 2.6 13.8
text aggregation network. We show that this complexity can
ut l
op lor be ise
Blu
r te ge
be avoided by performing simple random cropping (with Cr
Cu
to Co So No Ro
ta
Av
er
a

resizing) of target images, which creates a family of predic- 2nd transformation

tive tasks subsuming the above mentioned two, as shown in Figure 5. Linear evaluation (ImageNet top-1 accuracy) under in-
Figure 3. This simple design choice conveniently decouples dividual or composition of data augmentations, applied only to
the predictive task from other components such as the neural one branch. For all columns but the last, diagonal entries corre-
network architecture. Broader contrastive prediction tasks spond to single transformation, and off-diagonals correspond to
can be defined by extending the family of augmentations composition of two transformations (applied sequentially). The
and composing them stochastically. last column reflects the average over the row.

3.1. Composition of data augmentation operations is To understand the effects of individual data augmentations
crucial for learning good representations and the importance of augmentation composition, we in-
vestigate the performance of our framework when applying
To systematically study the impact of data augmentation, augmentations individually or in pairs. Since ImageNet
we consider several common augmentations here. One type images are of different sizes, we always apply crop and re-
of augmentation involves spatial/geometric transformation size images (Krizhevsky et al., 2012; Szegedy et al., 2015),
of data, such as cropping and resizing (with horizontal which makes it difficult to study other augmentations in
flipping), rotation (Gidaris et al., 2018) and cutout (De- the absence of cropping. To eliminate this confound, we
Vries & Taylor, 2017). The other type of augmentation consider an asymmetric data transformation setting for this
involves appearance transformation, such as color distortion ablation. Specifically, we always first randomly crop im-
(including color dropping, brightness, contrast, saturation, ages and resize them to the same resolution, and we then
hue) (Howard, 2013; Szegedy et al., 2015), Gaussian blur, apply the targeted transformation(s) only to one branch of
and Sobel filtering. Figure 4 visualizes the augmentations the framework in Figure 2, while leaving the other branch
that we study in this work. as the identity (i.e. t(xi ) = xi ). Note that this asymmet-
A Simple Framework for Contrastive Learning of Visual Representations

80
Sup. R50(4x)
Sup. R50(2x)
Sup. R50 R50(4x)*
75
R50(2x)*
R50(4x)
R152(2x)
R101(2x)
70
R50* R50(2x)
R34(4x)
R152
R101

Top 1
65 R18(4x)
R50
R34(2x)
(a) Without color distortion. (b) With color distortion.
60
R18(2x)
Figure 6. Histograms of pixel intensities (over all channels) for
different crops of two different images (i.e. two rows). The image
for the first row is from Figure 4. All axes have the same range. 55
R34

50 R18
Color distortion strength
Methods 1/8 1/4 1/2 1 1 (+Blur) AutoAug 0 50 100 150 200 250 300 350 400 450
Number of Parameters (Millions)
SimCLR 59.6 61.0 62.6 63.2 64.5 61.1
Supervised 77.0 76.7 76.5 75.7 75.4 77.1 Figure 7. Linear evaluation of models with varied depth and width.
Models in blue dots are ours trained for 100 epochs, models in red
stars are ours trained for 1000 epochs, and models in green crosses
Table 1. Top-1 accuracy of unsupervised ResNet-50 using linear
are supervised ResNets trained for 90 epochs7 (He et al., 2016).
evaluation and supervised ResNet-505 , under varied color distor-
tion strength (see Appendix A) and other data transformations.
Strength 1 (+Blur) is our default data augmentation policy. shown in Table 1. Stronger color augmentation substan-
tially improves the linear evaluation of the learned unsuper-
ric data augmentation hurts the performance. Nonetheless, vised models. In this context, AutoAugment (Cubuk et al.,
this setup should not substantively change the impact of 2019), a sophisticated augmentation policy found using su-
individual data augmentations or their compositions. pervised learning, does not work better than simple cropping
+ (stronger) color distortion. When training supervised mod-
Figure 5 shows linear evaluation results under individual els with the same set of augmentations, we observe that
and composition of transformations. We observe that no stronger color augmentation does not improve or even hurts
single transformation suffices to learn good representations, their performance. Thus, our experiments show that unsu-
even though the model can almost perfectly identify the pervised contrastive learning benefits from stronger (color)
positive pairs in the contrastive task. When composing aug- data augmentation than supervised learning. Although pre-
mentations, the contrastive prediction task becomes harder, vious work has reported that data augmentation is useful
but the quality of representation improves dramatically. Ap- for self-supervised learning (Doersch et al., 2015; Bachman
pendix B.2 provides a further study on composing broader et al., 2019; Hénaff et al., 2019; Asano et al., 2019), we
set of augmentations. show that data augmentation that does not yield accuracy
One composition of augmentations stands out: random crop- benefits for supervised learning can still help considerably
ping and random color distortion. We conjecture that one with contrastive learning.
serious issue when using only random cropping as data
augmentation is that most patches from an image share a 4. Architectures for Encoder and Head
similar color distribution. Figure 6 shows that color his-
tograms alone suffice to distinguish images. Neural nets 4.1. Unsupervised contrastive learning benefits (more)
may exploit this shortcut to solve the predictive task. There- from bigger models
fore, it is critical to compose cropping with color distortion Figure 7 shows, perhaps unsurprisingly, that increasing
in order to learn generalizable features. depth and width both improve performance. While similar
findings hold for supervised learning (He et al., 2016), we
3.2. Contrastive learning needs stronger data find the gap between supervised models and linear classifiers
augmentation than supervised learning trained on unsupervised models shrinks as the model size
To further demonstrate the importance of the color aug- increases, suggesting that unsupervised learning benefits
mentation, we adjust the strength of color augmentation as more from bigger models than its supervised counterpart.
7
5
Supervised models are trained for 90 epochs; longer training Training longer does not improve supervised ResNets (see
improves performance of stronger augmentation by ∼ 0.5%. Appendix B.3).
A Simple Framework for Contrastive Learning of Visual Representations

Name Negative loss function Gradient w.r.t. u


T −
exp(uT v + /τ )
uT v + /τ − log v∈{v+ ,v− } exp(uT v/τ ) − v− exp(uZ(u)
)/τ v + v /τ )
/τ v −
P P
NT-Xent (1 − Z(u)

NT-Logistic log σ(uT v + /τ ) + log σ(−uT v − /τ ) (σ(−uT v + /τ ))/τ v + − σ(uT v − /τ )/τ v −


Margin Triplet − max(uT v − − uT v + + m, 0) v + − v − if uT v + − uT v − < m else 0

Table 2. Negative loss functions and their gradients. All input vectors, i.e. u, v + , v − , are `2 normalized. NT-Xent is an abbreviation for
“Normalized Temperature-scaled Cross Entropy”. Different loss functions impose different weightings of positive and negative examples.

70
Representation
60 What to predict? Random guess
h g(h)
Top 1

50 Projection Color vs grayscale 80 99.3 97.4


Linear Rotation 25 67.6 25.6
40 Non-linear Orig. vs corrupted 50 99.5 59.6
None Orig. vs Sobel filtered 50 96.6 56.3
30
32 64 12
8
25
6
51
2 24 48
10 20 Table 3. Accuracy of training additional MLPs on different repre-
Projection output dimensionality sentations to predict the transformation applied. Other than crop
Figure 8. Linear evaluation of representations with different pro- and color augmentation, we additionally and independently add
jection heads g(·) and various dimensions of z = g(h). The rotation (one of {0◦ , 90◦ , 180◦ , 270◦ }), Gaussian noise, and So-
representation h (before projection) is 2048-dimensional here. bel filtering transformation during the pretraining for the last three
rows. Both h and g(h) are of the same dimensionality, i.e. 2048.
4.2. A nonlinear projection head improves the
representation quality of the layer before it be found in Appendix B.4.
We then study the importance of including a projection
head, i.e. g(h). Figure 8 shows linear evaluation results 5. Loss Functions and Batch Size
using three different architecture for the head: (1) identity
5.1. Normalized cross entropy loss with adjustable
mapping; (2) linear projection, as used by several previous
temperature works better than alternatives
approaches (Wu et al., 2018); and (3) the default nonlinear
projection with one additional hidden layer (and ReLU acti- We compare the NT-Xent loss against other commonly used
vation), similar to Bachman et al. (2019). We observe that a contrastive loss functions, such as logistic loss (Mikolov
nonlinear projection is better than a linear projection (+3%), et al., 2013), and margin loss (Schroff et al., 2015). Table
and much better than no projection (>10%). When a pro- 2 shows the objective function as well as the gradient to
jection head is used, similar results are observed regardless the input of the loss function. Looking at the gradient, we
of output dimension. Furthermore, even when nonlinear observe 1) `2 normalization (i.e. cosine similarity) along
projection is used, the layer before the projection head, h, with temperature effectively weights different examples, and
is still much better (>10%) than the layer after, z = g(h), an appropriate temperature can help the model learn from
which shows that the hidden layer before the projection hard negatives; and 2) unlike cross-entropy, other objec-
head is a better representation than the layer after. tive functions do not weigh the negatives by their relative
hardness. As a result, one must apply semi-hard negative
We conjecture that the importance of using the representa-
mining (Schroff et al., 2015) for these loss functions: in-
tion before the nonlinear projection is due to loss of informa-
stead of computing the gradient over all loss terms, one can
tion induced by the contrastive loss. In particular, z = g(h)
compute the gradient using semi-hard negative terms (i.e.,
is trained to be invariant to data transformation. Thus, g can
those that are within the loss margin and closest in distance,
remove information that may be useful for the downstream
but farther than positive examples).
task, such as the color or orientation of objects. By leverag-
ing the nonlinear transformation g(·), more information can To make the comparisons fair, we use the same `2 normaliza-
be formed and maintained in h. To verify this hypothesis, tion for all loss functions, and we tune the hyperparameters,
we conduct experiments that use either h or g(h) to learn and report their best results.8 Table 4 shows that, while
to predict the transformation applied during the pretraining. (semi-hard) negative mining helps, the best result is still
Here we set g(h) = W (2) σ(W (1) h), with the same input much worse than our default NT-Xent loss.
and output dimensionality (i.e. 2048). Table 3 shows h 8
Details can be found in Appendix B.10. For simplicity, we
contains much more information about the transformation only consider the negatives from one augmentation view.
applied, while g(h) loses information. Further analysis can
A Simple Framework for Contrastive Learning of Visual Representations

Margin NT-Logi. Margin (sh) NT-Logi.(sh) NT-Xent Method Architecture Param (M) Top 1 Top 5
50.9 51.6 57.5 57.9 63.9 Methods using ResNet-50:
Local Agg. ResNet-50 24 60.2 -
Table 4. Linear evaluation (top-1) for models trained with different MoCo ResNet-50 24 60.6 -
loss functions. “sh” means using semi-hard negative mining. PIRL ResNet-50 24 63.6 -
CPC v2 ResNet-50 24 63.8 85.3
SimCLR (ours) ResNet-50 24 69.3 89.0
`2 norm? τ Entropy Contrastive acc. Top 1
Methods using other architectures:
0.05 1.0 90.5 59.7 Rotation RevNet-50 (4×) 86 55.4 -
0.1 4.5 87.8 64.4 BigBiGAN RevNet-50 (4×) 86 61.3 81.9
Yes
0.5 8.2 68.2 60.7 AMDIM Custom-ResNet 626 68.1 -
1 8.3 59.1 58.0 CMC ResNet-50 (2×) 188 68.4 88.2
10 0.5 91.7 57.2 MoCo ResNet-50 (4×) 375 68.6 -
No CPC v2 ResNet-161 (∗) 305 71.5 90.1
100 0.5 92.1 57.0
SimCLR (ours) ResNet-50 (2×) 94 74.2 92.0
Table 5. Linear evaluation for models trained with different choices SimCLR (ours) ResNet-50 (4×) 375 76.5 93.2
of `2 norm and temperature τ for NT-Xent loss. The contrastive
Table 6. ImageNet accuracies of linear classifiers trained on repre-
distribution is over 4096 examples.
sentations learned with different self-supervised methods.
70.0

67.5
Label fraction
Method Architecture 1% 10%
65.0 Top 5

62.5
Supervised baseline ResNet-50 48.4 80.4
Methods using other label-propagation:
Top 1

60.0
Batch size Pseudo-label ResNet-50 51.6 82.4
57.5 256 VAT+Entropy Min. ResNet-50 47.0 83.4
512 UDA (w. RandAug) ResNet-50 - 88.5
55.0 1024 FixMatch (w. RandAug) ResNet-50 - 89.1
2048 S4L (Rot+VAT+En. M.) ResNet-50 (4×) - 91.2
52.5 4096
8192 Methods using representation learning only:
50.0 InstDisc ResNet-50 39.2 77.4
100 200 300 400 500 600 700 800 900 1000
Training epochs BigBiGAN RevNet-50 (4×) 55.2 78.8
PIRL ResNet-50 57.2 83.8
Figure 9. Linear evaluation models (ResNet-50) trained with differ- CPC v2 ResNet-161(∗) 77.9 91.2
ent batch size and epochs. Each bar is a single run from scratch.10 SimCLR (ours) ResNet-50 75.5 87.8
SimCLR (ours) ResNet-50 (2×) 83.0 91.2
We next test the importance of the `2 normalization (i.e. SimCLR (ours) ResNet-50 (4×) 85.8 92.6
cosine similarity vs dot product) and temperature τ in our
Table 7. ImageNet accuracy of models trained with few labels.
default NT-Xent loss. Table 5 shows that without normal-
ization and proper temperature scaling, performance is sig- supervised learning (Goyal et al., 2017), in contrastive learn-
nificantly worse. Without `2 normalization, the contrastive ing, larger batch sizes provide more negative examples,
task accuracy is higher, but the resulting representation is facilitating convergence (i.e. taking fewer epochs and steps
worse under linear evaluation. for a given accuracy). Training longer also provides more
negative examples, improving the results. In Appendix B.1,
5.2. Contrastive learning benefits (more) from larger results with even longer training steps are provided.
batch sizes and longer training
Figure 9 shows the impact of batch size when models are 6. Comparison with State-of-the-art
trained for different numbers of epochs. We find that, when
the number of training epochs is small (e.g. 100 epochs), In this subsection, similar to Kolesnikov et al. (2019); He
larger batch sizes have a significant advantage over the et al. (2019), we use ResNet-50 in 3 different hidden layer
smaller ones. With more training steps/epochs, the gaps widths (width multipliers of 1×, 2×, and 4×). For better
between different batch sizes decrease or disappear, pro- convergence, our models here are trained for 1000 epochs.
vided the batches are randomly resampled. In contrast to Linear evaluation. Table 6 compares our results with previ-
10
A linear learning rate scaling is used here. Figure B.1 shows ous approaches (Zhuang et al., 2019; He et al., 2019; Misra
using a square root learning rate scaling can improve performance & van der Maaten, 2019; Hénaff et al., 2019; Kolesnikov
of ones with small batch sizes. et al., 2019; Donahue & Simonyan, 2019; Bachman et al.,
A Simple Framework for Contrastive Learning of Visual Representations

Food CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft VOC2007 DTD Pets Caltech-101 Flowers
Linear evaluation:
SimCLR (ours) 76.9 95.3 80.2 48.4 65.9 60.0 61.2 84.2 78.9 89.2 93.9 95.0
Supervised 75.2 95.7 81.2 56.4 64.9 68.8 63.8 83.8 78.7 92.3 94.1 94.2
Fine-tuned:
SimCLR (ours) 89.4 98.6 89.0 78.2 68.1 92.1 87.0 86.6 77.8 92.1 94.1 97.6
Supervised 88.7 98.3 88.7 77.8 67.0 91.4 88.0 86.5 78.8 93.2 94.2 98.0
Random init 88.3 96.0 81.9 77.0 53.7 91.3 84.8 69.4 64.1 82.7 72.5 92.5

Table 8. Comparison of transfer learning performance of our self-supervised approach with supervised baselines across 12 natural image
classification datasets, for ResNet-50 (4×) models pretrained on ImageNet. Results not significantly worse than the best (p > 0.05,
permutation test) are shown in bold. See Appendix B.8 for experimental details and results with standard ResNet-50.

2019; Tian et al., 2019) in the linear evaluation setting (see 7. Related Work
Appendix B.6). Table 1 shows more numerical compar-
isons among different methods. We are able to use standard The idea of making representations of an image agree with
networks to obtain substantially better results compared to each other under small transformations dates back to Becker
previous methods that require specifically designed archi- & Hinton (1992). We extend it by leveraging recent ad-
tectures. The best result obtained with our ResNet-50 (4×) vances in data augmentation, network architecture and con-
can match the supervised pretrained ResNet-50. trastive loss. A similar consistency idea, but for class label
prediction, has been explored in other contexts such as semi-
Semi-supervised learning. We follow Zhai et al. (2019) supervised learning (Xie et al., 2019; Berthelot et al., 2019).
and sample 1% or 10% of the labeled ILSVRC-12 training
datasets in a class-balanced way (∼12.8 and ∼128 images Handcrafted pretext tasks. The recent renaissance of self-
per class respectively). 11 We simply fine-tune the whole supervised learning began with artificially designed pretext
base network on the labeled data without regularization tasks, such as relative patch prediction (Doersch et al., 2015),
(see Appendix B.5). Table 7 shows the comparisons of solving jigsaw puzzles (Noroozi & Favaro, 2016), coloriza-
our results against recent methods (Zhai et al., 2019; Xie tion (Zhang et al., 2016) and rotation prediction (Gidaris
et al., 2019; Sohn et al., 2020; Wu et al., 2018; Donahue & et al., 2018; Chen et al., 2019). Although good results
Simonyan, 2019; Misra & van der Maaten, 2019; Hénaff can be obtained with bigger networks and longer train-
et al., 2019). The supervised baseline from (Zhai et al., ing (Kolesnikov et al., 2019), these pretext tasks rely on
2019) is strong due to intensive search of hyper-parameters somewhat ad-hoc heuristics, which limits the generality of
(including augmentation). Again, our approach significantly learned representations.
improves over state-of-the-art with both 1% and 10% of the Contrastive visual representation learning. Dating back
labels. Interestingly, fine-tuning our pretrained ResNet-50 to Hadsell et al. (2006), these approaches learn represen-
(2×, 4×) on full ImageNet are also significantly better then tations by contrasting positive pairs against negative pairs.
training from scratch (up to 2%, see Appendix B.2). Along these lines, Dosovitskiy et al. (2014) proposes to
Transfer learning. We evaluate transfer learning perfor- treat each instance as a class represented by a feature vector
mance across 12 natural image datasets in both linear evalu- (in a parametric form). Wu et al. (2018) proposes to use
ation (fixed feature extractor) and fine-tuning settings. Fol- a memory bank to store the instance class representation
lowing Kornblith et al. (2019), we perform hyperparameter vector, an approach adopted and extended in several recent
tuning for each model-dataset combination and select the papers (Zhuang et al., 2019; Tian et al., 2019; He et al.,
best hyperparameters on a validation set. Table 8 shows 2019; Misra & van der Maaten, 2019). Other work explores
results with the ResNet-50 (4×) model. When fine-tuned, the use of in-batch samples for negative sampling instead
our self-supervised model significantly outperforms the su- of a memory bank (Doersch & Zisserman, 2017; Ye et al.,
pervised baseline on 5 datasets, whereas the supervised 2019; Ji et al., 2019).
baseline is superior on only 2 (i.e. Pets and Flowers). On Recent literature has attempted to relate the success of their
the remaining 5 datasets, the models are statistically tied. methods to maximization of mutual information between
Full experimental details as well as results with the standard latent representations (Oord et al., 2018; Hénaff et al., 2019;
ResNet-50 architecture are provided in Appendix B.8. Hjelm et al., 2018; Bachman et al., 2019). However, it is not
11 clear if the success of contrastive approaches is determined
The details of sampling and exact subsets can be found in
https://www.tensorflow.org/datasets/catalog/imagenet2012_subset. by the mutual information, or by the specific form of the
contrastive loss (Tschannen et al., 2019).
A Simple Framework for Contrastive Learning of Visual Representations

We note that almost all individual components of our frame- Chen, T., Sun, Y., Shi, Y., and Hong, L. On sampling strategies
work have appeared in previous work, although the specific for neural network-based collaborative filtering. In Proceed-
instantiations may be different. The superiority of our frame- ings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 767–776, 2017.
work relative to previous work is not explained by any single
design choice, but by their composition. We provide a com- Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-
prehensive comparison of our design choices with those of supervised gans via auxiliary rotation loss. In Proceedings of the
previous work in Appendix C. IEEE Conference on Computer Vision and Pattern Recognition,
pp. 12154–12163, 2019.
8. Conclusion Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi,
A. Describing textures in the wild. In IEEE Conference on
In this work, we present a simple framework and its in- Computer Vision and Pattern Recognition (CVPR), pp. 3606–
stantiation for contrastive visual representation learning. 3613. IEEE, 2014.
We carefully study its components, and show the effects
of different design choices. By combining our findings, Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V.
we improve considerably over previous methods for self- Autoaugment: Learning augmentation strategies from data. In
Proceedings of the IEEE conference on computer vision and
supervised, semi-supervised, and transfer learning. pattern recognition, pp. 113–123, 2019.
Our approach differs from standard supervised learning on
DeVries, T. and Taylor, G. W. Improved regularization of
ImageNet only in the choice of data augmentation, the use of
convolutional neural networks with cutout. arXiv preprint
a nonlinear head at the end of the network, and the loss func- arXiv:1708.04552, 2017.
tion. The strength of this simple framework suggests that,
despite a recent surge in interest, self-supervised learning Doersch, C. and Zisserman, A. Multi-task self-supervised visual
remains undervalued. learning. In Proceedings of the IEEE International Conference
on Computer Vision, pp. 2051–2060, 2017.

Acknowledgements Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual


representation learning by context prediction. In Proceedings
We would like to thank Xiaohua Zhai, Rafael Müller and of the IEEE International Conference on Computer Vision, pp.
Yani Ioannou for their feedback on the draft. We are also 1422–1430, 2015.
grateful for general support from Google Research teams in Donahue, J. and Simonyan, K. Large scale adversarial representa-
Toronto and elsewhere. tion learning. In Advances in Neural Information Processing
Systems, pp. 10541–10551, 2019.
References Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E.,
Asano, Y. M., Rupprecht, C., and Vedaldi, A. A critical analysis and Darrell, T. Decaf: A deep convolutional activation feature
of self-supervision, or what we can learn from a single image. for generic visual recognition. In International Conference on
arXiv preprint arXiv:1904.13132, 2019. Machine Learning, pp. 647–655, 2014.

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning rep- Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T.
resentations by maximizing mutual information across views. Discriminative unsupervised feature learning with convolutional
In Advances in Neural Information Processing Systems, pp. neural networks. In Advances in neural information processing
15509–15519, 2019. systems, pp. 766–774, 2014.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and
Becker, S. and Hinton, G. E. Self-organizing neural network that
Zisserman, A. The pascal visual object classes (voc) challenge.
discovers surfaces in random-dot stereograms. Nature, 355
International Journal of Computer Vision, 88(2):303–338, 2010.
(6356):161–163, 1992.
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual
Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., models from few training examples: An incremental bayesian
and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual approach tested on 101 object categories. In IEEE Conference
categorization of birds. In IEEE Conference on Computer Vision on Computer Vision and Pattern Recognition (CVPR) Workshop
and Pattern Recognition (CVPR), pp. 2019–2026. IEEE, 2014. on Generative-Model Based Vision, 2004.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, Gidaris, S., Singh, P., and Komodakis, N. Unsupervised represen-
A., and Raffel, C. A. Mixmatch: A holistic approach to semi- tation learning by predicting image rotations. arXiv preprint
supervised learning. In Advances in Neural Information Pro- arXiv:1803.07728, 2018.
cessing Systems, pp. 5050–5060, 2019.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101–mining Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative
discriminative components with random forests. In European adversarial nets. In Advances in neural information processing
conference on computer vision, pp. 446–461. Springer, 2014. systems, pp. 2672–2680, 2014.
A Simple Framework for Contrastive Learning of Visual Representations

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent
Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Jour-
nal of machine learning research, 9(Nov):2579–2605, 2008.
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction
by learning an invariant mapping. In 2006 IEEE Computer So- Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A.
ciety Conference on Computer Vision and Pattern Recognition Fine-grained visual classification of aircraft. Technical report,
(CVPR’06), volume 2, pp. 1735–1742. IEEE, 2006. 2013.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient esti-
image recognition. In Proceedings of the IEEE conference on mation of word representations in vector space. arXiv preprint
computer vision and pattern recognition, pp. 770–778, 2016. arXiv:1301.3781, 2013.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum Misra, I. and van der Maaten, L. Self-supervised learn-
contrast for unsupervised visual representation learning. arXiv ing of pretext-invariant representations. arXiv preprint
preprint arXiv:1911.05722, 2019. arXiv:1912.01991, 2019.

Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. Nilsback, M.-E. and Zisserman, A. Automated flower classification
v. d. Data-efficient image recognition with contrastive predictive over a large number of classes. In Computer Vision, Graphics &
coding. arXiv preprint arXiv:1905.09272, 2019. Image Processing, 2008. ICVGIP’08. Sixth Indian Conference
on, pp. 722–729. IEEE, 2008.
Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning al-
Noroozi, M. and Favaro, P. Unsupervised learning of visual repre-
gorithm for deep belief nets. Neural computation, 18(7):1527–
sentations by solving jigsaw puzzles. In European Conference
1554, 2006.
on Computer Vision, pp. 69–84. Springer, 2016.
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K.,
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with
Bachman, P., Trischler, A., and Bengio, Y. Learning deep repre-
contrastive predictive coding. arXiv preprint arXiv:1807.03748,
sentations by mutual information estimation and maximization.
2018.
arXiv preprint arXiv:1808.06670, 2018.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats
Howard, A. G. Some improvements on deep convolutional and dogs. In IEEE Conference on Computer Vision and Pattern
neural network based image classification. arXiv preprint Recognition (CVPR), pp. 3498–3505. IEEE, 2012.
arXiv:1312.5402, 2013.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma,
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.
network training by reducing internal covariate shift. arXiv Imagenet large scale visual recognition challenge. International
preprint arXiv:1502.03167, 2015. journal of computer vision, 115(3):211–252, 2015.
Ji, X., Henriques, J. F., and Vedaldi, A. Invariant information Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified
clustering for unsupervised image classification and segmenta- embedding for face recognition and clustering. In Proceed-
tion. In Proceedings of the IEEE International Conference on ings of the IEEE conference on computer vision and pattern
Computer Vision, pp. 9865–9874, 2019. recognition, pp. 815–823, 2015.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. Simonyan, K. and Zisserman, A. Very deep convolutional
arXiv preprint arXiv:1312.6114, 2013. networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised
visual representation learning. In Proceedings of the IEEE Sohn, K. Improved deep metric learning with multi-class n-pair
conference on Computer Vision and Pattern Recognition, pp. loss objective. In Advances in neural information processing
1920–1929, 2019. systems, pp. 1857–1865, 2016.
Kornblith, S., Shlens, J., and Le, Q. V. Do better ImageNet models Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk,
transfer better? In Proceedings of the IEEE conference on E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simpli-
computer vision and pattern recognition, pp. 2661–2671, 2019. fying semi-supervised learning with consistency and confidence.
arXiv preprint arXiv:2001.07685, 2020.
Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a
large-scale dataset of fine-grained cars. In Second Workshop on Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,
Fine-Grained Visual Categorization, 2013. Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper
with convolutions. In Proceedings of the IEEE conference on
Krizhevsky, A. and Hinton, G. Learning multiple layers of features computer vision and pattern recognition, pp. 1–9, 2015.
from tiny images. Technical report, University of Toronto,
2009. URL https://www.cs.toronto.edu/~kriz/ Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding.
learning-features-2009-TR.pdf. arXiv preprint arXiv:1906.05849, 2019.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifi- Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lu-
cation with deep convolutional neural networks. In Advances in cic, M. On mutual information maximization for representation
neural information processing systems, pp. 1097–1105, 2012. learning. arXiv preprint arXiv:1907.13625, 2019.
A Simple Framework for Contrastive Learning of Visual Representations

Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature
learning via non-parametric instance discrimination. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3733–3742, 2018.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun
database: Large-scale scene recognition from abbey to zoo. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 3485–3492. IEEE, 2010.

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsu-
pervised data augmentation. arXiv preprint arXiv:1904.12848,
2019.

Ye, M., Zhang, X., Yuen, P. C., and Chang, S.-F. Unsupervised
embedding learning via invariant and spreading instance feature.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 6210–6219, 2019.
You, Y., Gitman, I., and Ginsburg, B. Large batch training of con-
volutional networks. arXiv preprint arXiv:1708.03888, 2017.

Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-
supervised semi-supervised learning. In The IEEE International
Conference on Computer Vision (ICCV), October 2019.

Zhang, R., Isola, P., and Efros, A. A. Colorful image coloriza-


tion. In European conference on computer vision, pp. 649–666.
Springer, 2016.

Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation for


unsupervised learning of visual embeddings. In Proceedings
of the IEEE International Conference on Computer Vision, pp.
6002–6012, 2019.
A Simple Framework for Contrastive Learning of Visual Representations

A. Data Augmentation Details


In our default pretraining setting (which is used to train our best models), we utilize random crop (with resize and random
flip), random color distortion, and random Gaussian blur as the data augmentations. The details of these three augmentations
are provided below.

Random crop and resize to 224x224 We use standard Inception-style random cropping (Szegedy et al., 2015). The
crop of random size (uniform from 0.08 to 1.0 in area) of the original size and a random aspect ratio (default: of
3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to the original size. This has been imple-
mented in Tensorflow as “slim.preprocessing.inception_preprocessing.distorted_bounding_box_crop”, or in Pytorch
as “torchvision.transforms.RandomResizedCrop”. Additionally, the random crop (with resize) is always followed by a
random horizontal/left-to-right flip with 50% probability. This is helpful but not essential. By removing this from our default
augmentation policy, the top-1 linear evaluation drops from 64.5% to 63.4% for our ResNet-50 model trained in 100 epochs.

Color distortion Color distortion is composed by color jittering and color dropping. We find stronger color jittering
usually helps, so we set a strength parameter.
A pseudo-code for color distortion using TensorFlow is as follows.

import tensorflow as tf
def color_distortion(image, s=1.0):
# image is a tensor with value range in [0, 1].
# s is the strength of color distortion.

def color_jitter(x):
# one can also shuffle the order of following augmentations
# each time they are applied.
x = tf.image.random_brightness(x, max_delta=0.8*s)
x = tf.image.random_contrast(x, lower=1-0.8*s, upper=1+0.8*s)
x = tf.image.random_saturation(x, lower=1-0.8*s, upper=1+0.8*s)
x = tf.image.random_hue(x, max_delta=0.2*s)
x = tf.clip_by_value(x, 0, 1)
return x

def color_drop(x):
image = tf.image.rgb_to_grayscale(image)
image = tf.tile(image, [1, 1, 3])

# randomly apply transformation with probability p.


image = random_apply(color_jitter, image, p=0.8)
image = random_apply(color_drop, image, p=0.2)
return image

A pseudo-code for color distortion using Pytorch is as follows 12 .

from torchvision import transforms


def get_color_distortion(s=1.0):
# s is the strength of color distortion.
color_jitter = transforms.ColorJitter(0.8*s, 0.8*s, 0.8*s, 0.2*s)
rnd_color_jitter = transforms.RandomApply([color_jitter], p=0.8)
rnd_gray = transforms.RandomGrayscale(p=0.2)
color_distort = transforms.Compose([
rnd_color_jitter,
rnd_gray])
12
Our code and results are based on Tensorflow, the Pytorch code here is a reference.
A Simple Framework for Contrastive Learning of Visual Representations

return color_distort

Gaussian blur This augmentation is in our default policy. We find it helpful, as it improves our ResNet-50 trained for
100 epochs from 63.2% to 64.5%. We blur the image 50% of the time using a Gaussian kernel. We randomly sample
σ ∈ [0.1, 2.0], and the kernel size is set to be 10% of the image height/width.

B. Additional Experimental Results


B.1. Batch Size and Training Steps
Figure B.1 shows the top-5 accuracy on linear evaluation when trained with different batch sizes and training epochs. The
conclusion is very similar to top-1 accuracy shown before, except that the differences between different batch sizes and
training steps seems slightly smaller here.
In both Figure 9 and Figure B.1, we use a linear scaling of learning rate similar to (Goyal et al., 2017) when training
with different batch sizes. Although linear learning rate scaling is popular with SGD/Momentum optimizer, we find a
square root learning rate scaling
√ is more desirable with LARS optimizer. With square root learning rate scaling, we have
LearningRate = 0.075 × BatchSize, instead of LearningRate = 0.3 × BatchSize/256 in the linear scaling case, but
the learning rate is the same under both scaling methods when batch size of 4096 (our default batch size). A comparison is
presented in Table B.1, where we observe that square root learning rate scaling improves the performance for models trained
with small batch sizes and in smaller number of epochs.

Batch size \ Epochs 100 200 400 800


256 57.5 / 62.8 61.9 / 64.3 64.7 / 65.7 66.6 / 66.5
512 60.7 / 63.8 64.0 / 65.6 66.2 / 66.7 67.8 / 67.4
1024 62.8 / 64.3 65.3 / 66.1 67.2 / 67.2 68.5 / 68.3
2048 64.0 / 64.7 66.1 / 66.8 68.1 / 67.9 68.9 / 68.8
4096 64.6 / 64.5 66.5 / 66.8 68.2 / 68.0 68.9 / 69.1
8192 64.8 / 64.8 66.6 / 67.0 67.8 / 68.3 69.0 / 69.1

Table B.1. Linear evaluation (top-1) under different batch sizes and training epochs. On the left side of slash sign are models trained with
linear LR scaling, and on the right are models trained with square root LR scaling. The result is bolded if it is more than 0.5% better.
Square root LR scaling works better for smaller batch size trained in fewer epochs (with LARS optimizer).

We also train with larger batch size (up to 32K) and longer (up to 3200 epochs), with the square root learning rate scaling. A
shown in Figure B.2, the performance seems to saturate with a batch size of 8192, while training longer can still significantly
improve the performance.

90.0 72
Batch size
87.5 256
70 512
85.0 1024
2048
68
82.5 4096
8192
Top 1
Top 5

80.0 66 16384
Batch size 32768
77.5 256
512 64
75.0 1024
2048
62
72.5 4096
8192
70.0 60
100 200 300 400 500 600 700 800 900 1000 50 100 200 400 800 1600 3200
Training epochs Training epochs

Figure B.1. Linear evaluation (top-5) of ResNet-50 trained with Figure B.2. Linear evaluation (top-1) of ResNet-50 trained with
different batch sizes and epochs. Each bar is a single run from different batch sizes and longer epochs. Here a square root learn-
scratch. See Figure 9 for top-1 accuracy. ing rate, instead of a linear one, is utilized.
A Simple Framework for Contrastive Learning of Visual Representations

B.2. Broader composition of data augmentations further improves performance


Our best results in the main text (Table 6 and 7) can be further improved when expanding the default augmentation policy to
include the following: (1) Sobel filtering, (2) additional color distortion (equalize, solarize), and (3) motion blur. For linear
evaluation protocol, the ResNet-50 models (1×, 2×, 4×) trained with broader data augmentations achieve 70.0 (+0.7), 74.4
(+0.2), 76.8 (+0.3), respectively.
Table B.2 shows ImageNet accuracy obtained by fine-tuning the SimCLR model (see Appendix B.5 for the details of
fine-tuning procedure). Interestingly, when fine-tuned on full (100%) ImageNet training set, our ResNet (4×) model
achieves 80.4% top-1 / 95.4% top-5 13 , which is significantly better than that (78.4% top-1 / 94.2% top-5) of training from
scratch using the same set of augmentations (i.e. random crop and horizontal flip). For ResNet-50 (2×), fine-tuning our
pre-trained ResNet-50 (2×) is also better than training from scratch (77.8% top-1 / 93.9% top-5). There is no improvement
from fine-tuning for ResNet-50.

Label fraction
Architecture 1% 10% 100%
Top 1 Top 5 Top 1 Top 5 Top 1 Top 5
ResNet-50 49.4 76.6 66.1 88.1 76.0 93.1
ResNet-50 (2×) 59.4 83.7 71.8 91.2 79.1 94.8
ResNet-50 (4×) 64.1 86.6 74.8 92.8 80.4 95.4

Table B.2. Classification accuracy obtained by fine-tuning the SimCLR (which is pretrained with broader data augmentations) on 1%,
10% and full of ImageNet. As a reference, our ResNet-50 (4×) trained from scratch on 100% labels achieves 78.4% top-1 / 94.2% top-5.

B.3. Effects of Longer Training for Supervised Models


Here we perform experiments to see how training steps and stronger data augmentation affect supervised training. We test
ResNet-50 and ResNet-50 (4×) under the same set of data augmentations (random crops, color distortion, 50% Gaussian
blur) as used in our unsupervised models. Figure B.3 shows the top-1 accuracy. We observe that there is no significant
benefit from training supervised models longer on ImageNet. Stronger data augmentation slightly improves the accuracy of
ResNet-50 (4×) but does not help on ResNet-50. When stronger data augmentation is applied, ResNet-50 generally requires
longer training (e.g. 500 epochs 14 ) to obtain the optimal result, while ResNet-50 (4×) does not benefit from longer training.

Top 1
Model Training epochs
Crop +Color +Color+Blur
90 76.5 75.6 75.3
ResNet-50 500 76.2 76.5 76.7
1000 75.8 75.2 76.4
90 78.4 78.9 78.7
ResNet-50 (4×) 500 78.3 78.4 78.5
1000 77.9 78.2 78.3

Table B.3. Top-1 accuracy of supervised models trained longer under various data augmentation procedures (from the same set of data
augmentations for contrastive learning).

B.4. Understanding The Non-Linear Projection Head


Figure B.3 shows the eigenvalue distribution of linear projection matrix W ∈ R2048×2048 used to compute z = W h. This
matrix has relatively few large eigenvalues, indicating that it is approximately low-rank.
Figure B.4 shows t-SNE (Maaten & Hinton, 2008) visualizations of h and z = g(h) for randomly selected 10 classes by
our best ResNet-50 (top-1 linear evaluation 69.3%). Classes represented by h are better separated compared to z.
13
It is 80.1% top-1 / 95.2% top-5 without broader augmentations for pretraining SimCLR.
14
With AutoAugment (Cubuk et al., 2019), optimal test accuracy can be achieved between 900 and 500 epochs.
A Simple Framework for Contrastive Learning of Visual Representations

16 101

14
10−1
12

Squared eigenvalue
Squared eigenvalue

10−3
10
10−5
8

6 10−7

4 10−9

2
10−11
0
0 500 1000 1500 2000 0 500 1000 1500 2000
Ranking Ranking

(a) Y-axis in uniform scale. (b) Y-axis in log scale.


(a) h (b) z = g(h)
Figure B.3. Squared real eigenvalue distribution of linear projection
matrix W ∈ R2048×2048 used to compute g(h) = W h. Figure B.4. t-SNE visualizations of hidden vectors of images from
a randomly selected 10 classes in the validation set.

B.5. Semi-supervised Learning via Fine-Tuning


Fine-tuning Procedure We fine-tune using the Nesterov momentum optimizer with a batch size of 4096, momentum of
0.9, and a learning rate of 0.8 (following LearningRate = 0.05 × BatchSize/256) without warmup. Only random cropping
(with random left-to-right flipping and resizing to 224x224) is used for preprocessing. We do not use any regularization
(including weight decay). For 1% labeled data we fine-tune for 60 epochs, and for 10% labeled data we fine-tune for 30
epochs. For the inference, we resize the given image to 256x256, and take a single center crop of 224x224.
Table B.4 shows the comparisons of top-1 accuracy for different methods for semi-supervised learning. Our models
significantly improve state-of-the-art.

Label fraction
Method Architecture 1% 10%
Top 1
Supervised baseline ResNet-50 25.4 56.4
Methods using label-propagation:
UDA (w. RandAug) ResNet-50 - 68.8
FixMatch (w. RandAug) ResNet-50 - 71.5
S4L (Rot+VAT+Ent. Min.) ResNet-50 (4×) - 73.2
Methods using self-supervised representation learning only:
CPC v2 ResNet-161(∗) 52.7 73.1
SimCLR (ours) ResNet-50 48.3 65.6
SimCLR (ours) ResNet-50 (2×) 58.5 71.7
SimCLR (ours) ResNet-50 (4×) 63.0 74.4

Table B.4. ImageNet top-1 accuracy of models trained with few labels. See Table 7 for top-5 accuracy.

B.6. Linear Evaluation


For linear evaluation, we follow similar procedure as fine-tuning (described in Appendix B.5), except that a larger learning
rate of 1.6 (following LearningRate = 0.1 × BatchSize/256) and longer training of 90 epochs. Alternatively, using LARS
optimizer with the pretraining hyper-parameters also yield similar results. Furthermore, we find that attaching the linear
classifier on top of the base encoder (with a stop_gradient on the input to linear classifier to prevent the label information
from influencing the encoder) and train them simultaneously during the pretraining achieves similar performance.

B.7. Correlation Between Linear Evaluation and Fine-Tuning


Here we study the correlation between linear evaluation and fine-tuning under different settings of training step and network
architecture.
Figure B.5 shows linear evaluation versus fine-tuning when training epochs of a ResNet-50 (using batch size of 4096) are
varied from 50 to 3200 as in Figure B.2. While they are almost linearly correlated, it seems fine-tuning on a small fraction
A Simple Framework for Contrastive Learning of Visual Representations

of labels benefits more from training longer.

50.0 66

Fine-tuning on 10%
Fine-tuning on 1%
47.5
64
45.0

42.5 62
40.0
60
37.5

35.0
62 64 66 68 70 62 64 66 68 70
Linear eval Linear eval

Figure B.5. Top-1 accuracy of models trained in different epochs (from Figure B.2), under linear evaluation and fine-tuning.

Figure B.6 shows shows linear evaluation versus fine-tuning for different architectures of choice.

54 Width 69 Width
1x 1x
51 2x 2x
4x 66 4x
48 Depth Depth
18 18
Finetune (10%)

63
Finetune (1%)

45 34 34
42 50 50
101 60 101
39 152 152
57
36
33 54
30 51
50 55 60 65 70 50 55 60 65 70
Linear eval Linear eval

Figure B.6. Top-1 accuracy of different architectures under linear evaluation and fine-tuning.

B.8. Transfer Learning


We evaluated the performance of our self-supervised representation for transfer learning in two settings: linear evaluation,
where a logistic regression classifier is trained to classify a new dataset based on the self-supervised representation learned
on ImageNet, and fine-tuning, where we allow all weights to vary during training. In both cases, we follow the approach
described by Kornblith et al. (2019), although our preprocessing differs slightly.

B.8.1. M ETHODS
Datasets We investigated transfer learning performance on the Food-101 dataset (Bossard et al., 2014), CIFAR-10
and CIFAR-100 (Krizhevsky & Hinton, 2009), Birdsnap (Berg et al., 2014), the SUN397 scene dataset (Xiao et al.,
2010), Stanford Cars (Krause et al., 2013), FGVC Aircraft (Maji et al., 2013), the PASCAL VOC 2007 classification
task (Everingham et al., 2010), the Describable Textures Dataset (DTD) (Cimpoi et al., 2014), Oxford-IIIT Pets (Parkhi et al.,
2012), Caltech-101 (Fei-Fei et al., 2004), and Oxford 102 Flowers (Nilsback & Zisserman, 2008). We follow the evaluation
protocols in the papers introducing these datasets, i.e., we report top-1 accuracy for Food-101, CIFAR-10, CIFAR-100,
Birdsnap, SUN397, Stanford Cars, and DTD; mean per-class accuracy for FGVC Aircraft, Oxford-IIIT Pets, Caltech-101,
and Oxford 102 Flowers; and the 11-point mAP metric as defined in Everingham et al. (2010) for PASCAL VOC 2007. For
DTD and SUN397, the dataset creators defined multiple train/test splits; we report results only for the first split. Caltech-101
defines no train/test split, so we randomly chose 30 images per class and test on the remainder, for fair comparison with
previous work (Donahue et al., 2014; Simonyan & Zisserman, 2014).
We used the validation sets specified by the dataset creators to select hyperparameters for FGVC Aircraft, PASCAL VOC
A Simple Framework for Contrastive Learning of Visual Representations

2007, DTD, and Oxford 102 Flowers. For other datasets, we held out a subset of the training set for validation while
performing hyperparameter tuning. After selecting the optimal hyperparameters on the validation set, we retrained the
model using the selected parameters using all training and validation images. We report accuracy on the test set.

Transfer Learning via a Linear Classifier We trained an `2 -regularized multinomial logistic regression classifier on
features extracted from the frozen pretrained network. We used L-BFGS to optimize the softmax cross-entropy objective
and we did not apply data augmentation. As preprocessing, all images were resized to 224 pixels along the shorter side
using bicubic resampling, after which we took a 224 × 224 center crop. We selected the `2 regularization parameter from a
range of 45 logarithmically spaced values between 10−6 and 105 .

Transfer Learning via Fine-Tuning We fine-tuned the entire network using the weights of the pretrained network as
initialization. We trained for 20,000 steps at a batch size of 256 using SGD with Nesterov momentum with a momentum
parameter of 0.9. We set the momentum parameter for the batch normalization statistics to max(1 − 10/s, 0.9) where s is
the number of steps per epoch. As data augmentation during fine-tuning, we performed only random crops with resize and
flips; in contrast to pretraining, we did not perform color augmentation or blurring. At test time, we resized images to 256
pixels along the shorter side and took a 224 × 224 center crop. (Additional accuracy improvements may be possible with
further optimization of data augmentation, particularly on the CIFAR-10 and CIFAR-100 datasets.) We selected the learning
rate and weight decay, with a grid of 7 logarithmically spaced learning rates between 0.0001 and 0.1 and 7 logarithmically
spaced values of weight decay between 10−6 and 10−3 , as well as no weight decay. We divide these values of weight decay
by the learning rate.

Training from Random Initialization We trained the network from random initialization using the same procedure
as for fine-tuning, but for longer, and with an altered hyperparameter grid. We chose hyperparameters from a grid of 7
logarithmically spaced learning rates between 0.001 and 1.0 and 8 logarithmically spaced values of weight decay between
10−5 and 10−1.5 . Importantly, our random initialization baselines are trained for 40,000 steps, which is sufficiently long to
achieve near-maximal accuracy, as demonstrated in Figure 8 of Kornblith et al. (2019).
On Birdsnap, there are no statistically significant differences among methods, and on Food-101, Stanford Cars, and FGVC
Aircraft datasets, fine-tuning provides only a small advantage over training from random initialization. However, on the
remaining 8 datasets, pretraining has clear advantages.

Supervised Baselines We compare against architecturally identical ResNet models trained on ImageNet with standard
cross-entropy loss. These models are trained with the same data augmentation as our self-supervised models (crops, strong
color augmentation, and blur) and are also trained for 1000 epochs. We found that, although stronger data augmentation and
longer training time do not benefit accuracy on ImageNet, these models performed significantly better than a supervised
baseline trained for 90 epochs and ordinary data augmentation for linear evaluation on a subset of transfer datasets. The
supervised ResNet-50 baseline achieves 76.3% top-1 accuracy on ImageNet, vs. 69.3% for the self-supervised counterpart,
while the ResNet-50 (4×) baseline achieves 78.3%, vs. 76.5% for the self-supervised model.

Statistical Significance Testing We test for the significance of differences between model with a permutation test. Given
predictions of two models, we generate 100,000 samples from the null distribution by randomly exchanging predictions
for each example and computing the difference in accuracy after performing this randomization. We then compute the
percentage of samples from the null distribution that are more extreme than the observed difference in predictions. For top-1
accuracy, this procedure yields the same result as the exact McNemar test. The assumption of exchangeability under the null
hypothesis is also valid for mean per-class accuracy, but not when computing average precision curves. Thus, we perform
significance testing for a difference in accuracy on VOC 2007 rather than a difference in mAP. A caveat of this procedure is
that it does not consider run-to-run variability when training the models, only variability arising from using a finite sample
of images for evaluation.

B.8.2. R ESULTS WITH S TANDARD R ES N ET


The ResNet-50 (4×) results shown in Table 8 of the text show no clear advantage to the supervised or self-supervised models.
With the narrower ResNet-50 architecture, however, supervised learning maintains a clear advantage over self-supervised
learning. The supervised ResNet-50 model outperforms the self-supervised model on all datasets with linear evaluation,
and most (10 of 12) datasets with fine-tuning. The weaker performance of the ResNet model compared to the ResNet (4×)
A Simple Framework for Contrastive Learning of Visual Representations

Food CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft VOC2007 DTD Pets Caltech-101 Flowers
Linear evaluation:
SimCLR (ours) 68.4 90.6 71.6 37.4 58.8 50.3 50.3 80.5 74.5 83.6 90.3 91.2
Supervised 72.3 93.6 78.3 53.7 61.9 66.7 61.0 82.8 74.9 91.5 94.5 94.7
Fine-tuned:
SimCLR (ours) 88.2 97.7 85.9 75.9 63.5 91.3 88.1 84.1 73.2 89.2 92.1 97.0
Supervised 88.3 97.5 86.4 75.8 64.3 92.1 86.0 85.0 74.6 92.1 93.3 97.6
Random init 86.9 95.9 80.2 76.1 53.6 91.4 85.9 67.3 64.8 81.5 72.6 92.0

Table B.5. Comparison of transfer learning performance of our self-supervised approach with supervised baselines across 12 natural
image datasets, using ImageNet-pretrained ResNet models. See also Figure 8 for results with the ResNet (4×) architecture.

model may relate to the accuracy gap between the supervised and self-supervised models on ImageNet. The self-supervised
ResNet gets 69.3% top-1 accuracy, 6.8% worse than the supervised model in absolute terms, whereas the self-supervised
ResNet (4×) model gets 76.5%, which is only 1.8% worse than the supervised model.

B.9. CIFAR-10
While we focus on using ImageNet as the main dataset for pretraining our unsupervised model, our method also works with
other datasets. We demonstrate it by testing on CIFAR-10 as follows.

Setup As our goal is not to optimize CIFAR-10 performance, but rather to provide further confirmation of our observations
on ImageNet, we use the same architecture (ResNet-50) for CIFAR-10 experiments. Because CIFAR-10 images are much
smaller than ImageNet images, we replace the first 7x7 Conv of stride 2 with 3x3 Conv of stride 1, and also remove the first
max pooling operation. For data augmentation, we use the same Inception crop (flip and resize to 32x32) as ImageNet,15 and
color distortion (strength=0.5), leaving out Gaussian blur. We pretrain with learning rate in {0.5, 1.0, 1.5}, temperature in
{0.1, 0.5, 1.0}, and batch size in {256, 512, 1024, 2048, 4096}. The rest of the settings (including optimizer, weight decay,
etc.) are the same as our ImageNet training.
Our best model trained with batch size 1024 can achieve a linear evaluation accuracy of 94.0%, compared to 95.1% from the
supervised baseline using the same architecture and batch size. The best self-supervised model that reports linear evaluation
result on CIFAR-10 is AMDIM (Bachman et al., 2019), which achieves 91.2% with a model 25× larger than ours. We note
that our model can be improved by incorporating extra data augmentations as well as using a more suitable base network.

Performance under different batch sizes and training steps Figure B.7 shows the linear evaluation performance under
different batch sizes and training steps. The results are consistent with our observations on ImageNet, although the largest
batch size of 4096 seems to cause a small degradation in performance on CIFAR-10.

94 Figure B.7. Linear evaluation of ResNet-50 (with ad-


justed stem) trained with different batch size and
92
epochs on CIFAR-10 dataset. Each bar is averaged
90
over 3 runs with different learning rates (0.5, 1.0,
1.5) and temperature τ = 0.5. Error bar denotes
Top 1

88 standard deviation.

86 Batch size
256
84 512
1024
82 2048
4096
80
100 200 300 400 500 600 700 800 900 1000
Training epochs

15
It is worth noting that, although CIFAR-10 images are much smaller than ImageNet images and image size does not differ among
examples, cropping with resizing is still a very effective augmentation for contrastive learning.
A Simple Framework for Contrastive Learning of Visual Representations

Optimal temperature under different batch sizes Figure B.8 shows the linear evaluation of model trained with three
different temperatures under various batch sizes. We find that when training to convergence (e.g. training epochs > 300), the
optimal temperature in {0.1, 0.5, 1.0} is 0.5 and seems consistent regardless of the batch sizes. However, the performance
with τ = 0.1 improves as batch size increases, which may suggest a small shift of optimal temperature towards 0.1.
95.0 95

92.5
94
90.0

87.5 93

Top 1
Top 1

85.0

82.5 92

Temperature Temperature
80.0
0.1 0.1
91
77.5 0.5 0.5
1.0 1.0
75.0 90
256 512 1024 2048 4096 256 512 1024 2048 4096
Batch size Batch size

(a) Training epochs ≤ 300 (b) Training epochs > 300

Figure B.8. Linear evaluation of the model (ResNet-50) trained with three temperatures on different batch sizes on CIFAR-10. Each bar is
averaged over multiple runs with different learning rates and total train epochs. Error bar denotes standard deviation.

B.10. Tuning For Other Loss Functions


The learning rate that works best for NT-Xent loss may not be a good learning rate for other loss functions. To ensure a
fair comparison, we also tune hyperparameters for both margin loss and logistic loss. Specifically, we tune learning rate
in {0.01, 0.1, 0.3, 0.5, 1.0} for both loss functions. We further tune the margin in {0, 0.4, 0.8, 1.6} for margin loss, the
temperature in {0.1, 0.2, 0.5, 1.0} for logistic loss. For simplicity, we only consider the negatives from one augmentation
view (instead of both sides), which slightly impairs performance but ensures fair comparison.

C. Further Comparison to Related Methods


As we have noted in the main text, most individual components of SimCLR have appeared in previous work, and the
improved performance is a result of a combination of these design choices. Table C.1 provides a high-level comparison of
the design choices of our method with those of previous methods. Compared with previous work, our design choices are
generally simpler.

Model Data Augmentation Base Encoder Projection Head Loss Batch Size Train Epochs
#
CPC v2 Custom ResNet-161 (modified) PixelCNN Xent 512 ∼200
AMDIM Fast AutoAug. Custom ResNet Non-linear MLP Xent w/ clip,reg 1008# 150
CMC Fast AutoAug. ResNet-50 (2×, L+ab) Linear layer Xent w/ `2 , τ 156∗ 280
MoCo Crop+color ResNet-50 (4×) Linear layer Xent w/ `2 , τ 256∗ 200
PIRL Crop+color ResNet-50 (2×) Linear layer Xent w/ `2 , τ 1024∗ 800
SimCLR Crop+color+blur ResNet-50 (4×) Non-linear MLP Xent w/ `2 , τ 4096 1000

Table C.1. A high-level comparison of design choices and training setup (for best result on ImageNet) for each method. Note that
descriptions provided here are general; even when they match for two methods, formulations and implementations may differ (e.g. for
color augmentation). Refer to the original papers for more details. # Examples are split into multiple patches, which enlarges the effective
batch size. ∗ A memory bank is employed.

In below, we provide an in-depth comparison of our method to the recently proposed contrastive representation learning
methods:

• DIM/AMDIM (Hjelm et al., 2018; Bachman et al., 2019) achieve global-to-local/local-to-neighbor prediction by
predicting the middle layer of ConvNet. The ConvNet is a ResNet that has bewen modified to place significant
constraints on the receptive fields of the network (e.g. replacing many 3x3 Convs with 1x1 Convs). In our framework,
we decouple the prediction task and encoder architecture, by random cropping (with resizing) and using the final
A Simple Framework for Contrastive Learning of Visual Representations

representations of two augmented views for prediction, so we can use standard and more powerful ResNets. Our
NT-Xent loss function leverages normalization and temperature to restrict the range of similarity scores, whereas they
use a tanh function with regularization. We use a simpler data augmentation policy, while they use FastAutoAugment
for their best result.
• CPC v1 and v2 (Oord et al., 2018; Hénaff et al., 2019) define the context prediction task using a deterministic strategy
to split examples into patches, and a context aggregation network (a PixelCNN) to aggregate these patches. The base
encoder network sees only patches, which are considerably smaller than the original image. We decouple the prediction
task and the encoder architecture, so we do not require a context aggregation network, and our encoder can look at the
images of wider spectrum of resolutions. In addition, we use the NT-Xent loss function, which leverages normalization
and temperature, whereas they use an unnormalized cross-entropy-based objective. We use simpler data augmentation.
• InstDisc, MoCo, PIRL (Wu et al., 2018; He et al., 2019; Misra & van der Maaten, 2019) generalize the Exemplar
approach originally proposed by Dosovitskiy et al. (2014) and leverage an explicit memory bank. We do not use a
memory bank; we find that, with a larger batch size, in-batch negative example sampling suffices. We also utilize a
nonlinear projection head, and use the representation before the projection head. Although we use similar types of
augmentations (e.g., random crop and color distortion), we expect specific parameters may be different.
• CMC (Tian et al., 2019) uses a separated network for each view, while we simply use a single network shared for all
randomly augmented views. The data augmentation, projection head and loss function are also different. We use larger
batch size instead of a memory bank.
• Whereas Ye et al. (2019) maximize similarity between augmented and unaugmented copies of the same image, we
apply data augmentation symmetrically to both branches of our framework (Figure 2). We also apply a nonlinear
projection on the output of base feature network, and use the representation before projection network, whereas Ye
et al. (2019) use the linearly projected final hidden vector as the representation. When training with large batch sizes
using multiple accelerators, we use global BN to avoid shortcuts that can greatly decrease representation quality.

You might also like