0801
0801
Abstract
1
representation that captures apparent similarity among in- Generative Models. The primary objective of generative
stances, just like how class-wise supervised learning still models is to reconstruct the distribution of data as faithfully
retains apparent similarity among classes. This formulation as possible. Classical generative models include Restricted
of unsupervised learning as an instance-level discrimination Bolztmann Machines (RBMs) [12, 39, 21], and Auto-
is also technically appealing, as it could benefit from latest encoders [40, 20]. The latent features produced by gen-
advances in discriminative supervised learning, e.g. on new erative models could also help object recognition. Recent
network architectures. approaches such as generative adversarial networks [8, 4]
However, we also face a major challenge, now that the and variational auto-encoder [14] improve both generative
number of “classes” is the size of the entire training set. For qualities and feature learning.
ImageNet, it would be 1.2-million instead of 1,000 classes. Self-supervised Learning. Self-supervised learning ex-
Simply extending softmax to many more classes becomes ploits internal structures of data and formulates predictive
infeasible. We tackle this challenge by approximating the tasks to train a model. Specifically, the model needs to pre-
full softmax distribution with noise-contrastive estimation dict either an omitted aspect or component of an instance
(NCE) [9], and by resorting to a proximal regularization given the rest. To learn a representation of images, the
method [29] to stabilize the learning process. tasks could be: predicting the context [2], counting the ob-
To evaluate the effectiveness of unsupervised learning, jects [28], filling in missing parts of an image [31], recover-
past works such as [2, 31] have relied on a linear classifier, ing colors from grayscale images [47], or even solving a jig-
e.g. Support Vector Machine (SVM), to connect the learned saw puzzle [27]. For videos, self-supervision strategies in-
feature to categories for classification at the test time. How- clude: leveraging temporal continuity via tracking [44, 45],
ever, it is unclear why features learned via a training task predicting future [42], or preserving the equivariance of
could be linearly separable for an unknown testing task. egomotion [13, 50, 30]. Recent work [3] attempts to com-
We advocate a non-parametric approach for both training bine several self-supervised tasks to obtain better visual rep-
and testing. We formulate instance-level discrimination as resentations. Whereas self-supervised learning may capture
a metric learning problem, where distances (similarity) be- relations among parts or aspects of an instance, it is unclear
tween instances are calculated directly from the features in a why a particular self supervision task should help semantic
non-parametric way. That is, the features for each instance recognition and which task would be optimal.
are stored in a discrete memory bank, rather than weights
in a network. At the test time, we perform classification Metric Learning. Every feature representation F induces
using k-nearest neighbors (kNN) based on the learned met- a metric between instances x and y: dF (x, y) = kF (x) −
ric. Our training and testing are thus consistent, since both F (y)k. Feature learning can thus also be viewed as a
learning and evaluation of our model are concerned with the certain form of metric learning. There have been exten-
same metric space between images. We report and compare sive studies on metric learning [15, 33]. Successful ap-
experimental results with both SVM and kNN accuracies. plication of metric learning can often result in competitive
performance, e.g. on face recognition [35] and person re-
Our experimental results demonstrate that, under unsu-
identification [46]. In these tasks, the classes at the test
pervised learning settings, our method surpasses the state-
time are disjoint from those at the training time. Once a
of-the-art on image classification by a large margin, with
network is trained, one can only infer from its feature rep-
top-1 accuracy 42.5% on ImageNet 1K [1] and 38.7% for
resentation, not from the subsequent linear classifier. Metric
Places 205 [49]. Our method is also remarkable for con-
learning has been shown to be effective for few-shot learn-
sistently improving test performance with more training
ing [38, 41, 37]. An important technical point on metric
data and better network architectures. By fine-tuning the
learning for face recognition is normalization [35, 22, 43],
learned feature, we further obtain competitive results for
which we also utilize in this work. Note that all the methods
semi-supervised learning and object detection tasks. Fi-
mentioned here require supervision in certain ways. Our
nally, our non-parametric model is highly compact: With
work is drastically different: It learns the feature and thus
128 features per image, our method requires only 600MB
the induced metric in an unsupervised fashion, without any
storage for a million images, enabling fast nearest neigh-
human annotations.
bour retrieval at the run time.
Exemplar CNN. Exemplar CNN [5] appears similar to our
2. Related Works work. The fundamental difference is that it adopts a para-
metric paradigm during both training and testing, while our
There has been growing interest in unsupervised learn- method is non-parametric in nature. We study this essen-
ing without human-provided labels. Previous works mainly tial difference experimentally in Sec 4.1. Exemplar CNN is
fall into two categories: 1) generative models and 2) self- computationally demanding for large-scale datasets such as
supervised approaches. ImageNet.
CNN backbone 1-th image
Non-param Memory
Softmax i-th image
Bank O
n-th image
2048D
Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature
vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level
discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.
Training Loss
We adapt NCE to our problem, in order to tackle the dif-
ficulty of computing the similarity to all the instances in the
training set. The basic idea is to cast the multi-class clas-
sification problem into a set of binary classification prob-
lems, where the binary classification task is to discrimi-
nate between data samples and noise samples. Specifically, Training Iterations
the probability that feature representation v in the memory
bank corresponds to the i-th example under our model is, Figure 3: The effect of our proximal regularization. The original
T objective value oscillates a lot and converges very slowly, whereas
exp(v fi /τ )
P (i|v) = (4) the regularized objective has smoother learning dynamics.
Zi
n
X Unlike typical classification settings where each class
exp vjT fi /τ
Zi = (5)
has many instances, we only have one instance per class.
j=1
During each training epoch, each class is only visited once.
where Zi is the normalizing constant. We formalize the Therefore, the learning process oscillates a lot from ran-
noise distribution as a uniform distribution: Pn = 1/n. dom sampling fluctuation. We employ the proximal opti-
Following prior work, we assume that noise samples are m mization method [29] and introduce an additional term to
times more frequent than data samples. Then the posterior encourage the smoothness of the training dynamics. At
probability of sample i with feature v being from the data current iteration t, the feature representation for data xi is
distribution (denoted by D = 1) is: (t)
computed from the network vi = fθ (xi ). The memory
P (i|v) bank of all the representation are stored at previous itera-
h(i, v) := P (D = 1|i, v) = . (6) tion V = {v(t−1) }. The loss function for a positive sample
P (i|v) + mPn (i)
from Pd is:
Our approximated training objective is to minimize the neg-
ative log-posterior distribution of data and noise samples, (t−1) (t) (t−1) 2
− log h(i, vi ) + λkvi − vi k2 . (9)
JN CE (θ) = −EPd [log h(i, v)]
As learning converges, the difference between iterations,
−m·EPn [log(1 − h(i, v0 ))] . (7) (t)
i.e. vi − vi
(t−1)
, gradually vanishes, and the augmented
Here, Pd denotes the actual data distribution. For Pd , v is loss is reduced to the original one. With proximal regular-
the feature corresponding to xi ; whereas for Pn , v0 is the ization, our final objective becomes:
feature from another image, randomly sampled according h i
(t−1) (t) (t−1) 2
to noise distribution Pn . In our model, both v and v0 are JN CE (θ) = −EPd log h(i, vi ) − λkvi − vi k2
sampled from the non-parametric memory bank V . h i
Computing normalizing constant Zi according to Eq. (4) −m·EPn log(1 − h(i, v0(t−1) )) . (10)
is expensive. We follow [25], treating it as a constant and
estimating its value via Monte Carlo approximation: Fig. 3 shows that, empirically, proximal regularization helps
m
stabilize training, speed up convergence, and improve the
n X learned representation, with negligible extra cost.
Z ' Zi ' nEj exp(vjT fi /τ ) = exp(vjTk fi /τ ),
m
k=1 3.4. Weighted k-Nearest Neighbor Classifier
(8)
where {jk } is a random subset of indices. Empirically, we To classify test image x̂, we first compute its feature f̂ =
find the approximation derived from initial batches suffi- fθ (x̂), and then compare it against the embeddings of all
cient to work well in practice. the images in the memory bank, using the cosine similarity
Training / Testing Linear SVM Nearest Neighbor we obtain accuracies of 60.3% and 63.0% with linear SVM
Param Softmax 60.3 63.0 and kNN classifiers respectively. On the features learned
Non-Param Softmax 75.4 80.8 with non-parametric softmax, the accuracy rises to 75.4%
NCE m = 1 44.3 42.5 and 80.8% for the linear and nearest neighbour classifiers,
NCE m = 10 60.2 63.4 a remarkable 18% boost for the latter.
NCE m = 512 64.3 78.4 We also study the quality of NCE approximating non-
NCE m = 4096 70.2 80.4 parametric softmax (Sec. 3.2). The approximation is con-
trolled by m, the number of negatives drawn for each in-
Table 1: Top-1 accuracies on CIFAR10, by applying linear SVM stance. With m = 1, the accuracy with kNN drops signifi-
or kNN classifiers on the learned features. Our non-parametric cantly to 42.5%. As m increases, the performance improves
softmax outperforms parametric softmax, and NCE provides close steadily. When m = 4, 096, the accuracy approaches that at
approximation as m increases. m = 49, 999 – full form evaluation without any approxima-
tion. This result provides assurance that NCE is an efficient
approximation.
si = cos(vi , f̂ ). The top k nearest neighbors, denoted by
Nk , would then be used to make the prediction via weighted 4.2. Image Classification
voting.PSpecifically, the class c would get a total weight
wc = i∈Nk αi · 1(ci = c). Here, αi is the contributing We learn a feature representation on ImageNet
weight of neighbor xi , which depends on the similarity as ILSVRC [34], and compare our method with representative
αi = exp(si /τ ). We choose τ = 0.07 as in training and we unsupervised learning methods.
set k = 200. Experimental Settings. We choose design parameters
via empirical validation. In particular, we set tempera-
4. Experiments ture τ = 0.07 and use NCE with m = 4, 096 to balance
performance and computing cost. The model is trained
We conduct 4 sets of experiments to evaluate our ap-
for 200 epochs using SGD with momentum. The batch
proach. The first set is on CIFAR-10 to compare our non-
size is 256. The learning rate is initialized to 0.03, scaled
parametric softmax with parametric softmax. The second
down with coefficient 0.1 every 40 epochs after the first
set is on ImageNet to compare our method with other unsu-
120 epochs. Our code is available at: http://github.
pervised learning methods. The last two sets of experiments
com/zhirongw/lemniscate.pytorch.
investigate two different tasks, semi-supervised learning
and object detection, to show the generalization ability of Comparisons. We compare our method with a randomly
our learned feature representation. initialized network (as a lower bound) and various unsu-
pervised learning methods, including self-supervised learn-
4.1. Parametric vs. Non-parametric Softmax ing [2, 47, 27, 48], adversarial learning [4], and Exemplar
A key novelty of our approach is the non-parametric CNN [3]. The split-brain autoencoder [48] serves a strong
softmax function. Compared to the conventional paramet- baseline that represents the state of the art. The results
ric softmax, our softmax allows a non-parametric metric to of these methods are reported with AlexNet architecture
transfer to supervised tasks. [18] in their original papers, except for exemplar CNN [5],
We compare both parametric and non-parametric formu- whose results are reported with ResNet-101 [3]. As the
lations on CIFAR-10 [17], a dataset with 50, 000 training network architecture has a big impact on the performance,
instances in 10 classes. This size allows us to compute the we consider a few typical architectures: AlexNet [18],
non-parametric softmax in Eq.(2) without any approxima- VGG16 [36], ResNet-18, and ResNet-50 [10].
tion. We use ResNet18 as the backbone network and its We evaluate the performance with two different proto-
output features mapped into 128-dimensional vectors. cols: (1) Perform linear SVM on the intermediate features
We evaluate the classification effectiveness based on the from conv1 to conv5. Note that there are also corre-
learned feature representation. A common practice [48, 2, sponding layers in VGG16 and ResNet [36, 10]. (2) Per-
31] is to train an SVM on the learned feature over the train- form kNN on the output features. Table 2 shows that:
ing set, and to then classify test instances based on the fea- 1. With AlexNet and linear classification on intermediate
ture extracted from the trained network. In addition, we also features, our method achieves an accuracy of 35.6%,
use nearest neighbor classifiers to assess the learned feature. outperforming all baselines, including the state-of-the-
The latter directly relies on the feature metric and may bet- art. Our method can readily scale up to deeper networks.
ter reflect the quality of the representation. As we move from AlexNet to ResNet-50, our accuracy
Table 1 shows top-1 classification accuracies on CI- is raised to 42.5%, whereas the accuracy with exemplar
FAR10. On the features learned with parametric softmax, CNN [3] is only 31.5% even with ResNet-101.
Image Classification Accuracy on ImageNet 16
Consistency of training and testing objectives
40
method conv1 conv2 conv3 conv4 conv5 kNN #dim
14
Random 11.6 17.1 16.9 16.3 14.1 3.5 10K
12 30
Testing Accuracy
Data-Init [16] 17.5 23.0 24.5 23.2 20.6 - 10K
Training Loss
Context [2] 16.2 23.3 30.2 31.7 29.6 - 10K 10
Adversarial [4] 17.7 24.5 31.0 29.9 28.0 - 10K 20
8
Color [47] 13.1 24.8 31.0 32.6 31.8 - 10K
Jigsaw [27] 19.2 30.1 34.7 33.9 28.3 - 10K 6
10
Count [28] 18.0 30.6 34.3 32.5 25.7 - 10K 4
SplitBrain [48] 17.7 29.3 35.4 35.2 32.8 11.8 10K 2 0
Exemplar[3] 31.5 - 4.5K 0 25 50 75 100 125 150 175 200
Training Epochs
Ours Alexnet 16.8 26.5 31.8 34.1 35.6 31.3 128
Ours VGG16 16.5 21.4 27.6 33.1 37.2 33.9 128 Figure 4: Our kNN testing accuracy on ImageNet continues to
Ours Resnet18 16.0 19.9 26.3 35.7 42.1 40.5 128 improve as the training loss decreases, demonstrating that our un-
Ours Resnet50 15.3 18.8 24.4 35.3 43.9 42.5 128 supervised learning objective captures apparent similarity which
aligns well with the semantic annotation of the data.
Table 2: Top-1 classification accuracies on ImageNet.
sions. Hence, for other methods, using the features from
Image Classification Accuracy on Places
the best-performing layers can incur significant storage
method conv1 conv2 conv3 conv4 conv5 kNN #dim
and computation costs. Our method produces a 128-
Random 15.7 20.3 19.8 19.1 17.5 3.9 10K dimensional representation at the last layer, which is
Data-Init [16] 21.4 26.2 27.1 26.1 24.0 - 10K very efficient to work with. The encoded features of
Context [2] 19.7 26.7 31.9 32.7 30.9 - 10K all 1.28M images in ImageNet only take about 600 MB
Adversarial [4] 17.7 24.5 31.0 29.9 28.0 - 10K of storage. Exhaustive nearest neighbor search over this
Video [44] 20.1 28.5 29.9 29.7 27.9 - 10K dataset only takes 20 ms per image on a Titan X GPU.
Color [47] 22.0 28.7 31.8 31.3 29.7 - 10K
Jigsaw [27] 23.0 32.1 35.5 34.8 31.3 - 10K
SplitBrain [48] 21.3 30.7 34.0 34.1 32.5 10.8 10K Feature generalization. We also study how the learned
Ours Alexnet 18.8 24.3 31.9 34.5 33.6 30.1 128 feature representations can generalize to other datasets.
Ours VGG16 17.6 23.1 29.5 33.8 36.3 32.8 128 With the same settings, we conduct another large-scale ex-
Ours Resnet18 17.8 23.0 30.3 34.2 41.3 36.7 128 periment on Places [49], a large dataset for scene classifi-
Ours Resnet50 18.1 22.3 29.7 34.1 42.1 38.7 128 cation, which contains 2.45M training images in 205 cate-
gories. In this experiment, we directly use the feature ex-
Table 3: Top-1 classification accuracies on Places, based directly traction networks trained on ImageNet without finetuning.
on features learned on ImageNet, without any fine-tuning. Table 3 compares the results obtained with different meth-
ods and under different evaluation policies. Again, with
linear classifier on conv5 features, our method achieves
2. Using nearest neighbor classification on the final 128 di- competitive performance of top-1 accuracy 34.5% with
mensional features, our method achieves 31.3%, 33.9%, AlexNet, and 42.1% with ResNet-50. With nearest neigh-
40.5% and 42.5% accuracies with AlexNet, VGG16, bors on the last layer which is much smaller than intermedi-
ResNet-18 and ResNet-50, not much lower than the lin- ate layers, we achieve an accuracy of 38.7% with ResNet-
ear classification results, demonstrating that our learned 50. These results show remarkable generalization ability of
feature induces a reasonably good metric. As a com- the representations learned using our method.
parison, for Split-brain, the accuracy drops to 8.9% with
nearest neighbor classification on conv3 features, and Consistency of training and testing objectives. Unsu-
to 11.8% after projecting the features to 128 dimensions. pervised feature learning is difficult because the training
3. With our method, the performance gradually increases as objective is agnostic about the testing objective. A good
we examine the learned feature representation from ear- training objective should be reflected in consistent improve-
lier to later layers, which is generally desirable. With ment in the testing performance. We investigate the relation
all other methods, the performance decreases beyond between the training loss and the testing accuracy across it-
conv3 or conv4. erations. Fig. 4 shows that our testing accuracy continues to
improve as training proceeds, with no sign of overfitting. It
4. It is important to note that the features from interme-
also suggests that better optimization of the training objec-
diate convolutional layers can be over 10, 000 dimen-
tive may further improve our testing accuracy.
query retrievals
Successful Cases
Failure Cases
Figure 5: Retrieval results for example queries. The left column are queries from the validation set, while the right columns show the 10
closest instances from the training set. The upper half shows the best cases. The lower half shows the worst cases.
embedding size 32 64 128 256 training set size 0.1% 1% 10% 30% 100%
top-1 accuracy 34.0 38.8 40.5 40.1 accuracy 3.9 10.7 23.1 31.7 40.5
Table 4: Classification performance on ImageNet with ResNet18 Table 5: Classification performances trained on different amount
for different embedding feature sizes. of training set with ResNet-18.
The embedding feature size. We study how the perfor- where all top 10 results are in the same categories as the
mance changes as we vary the embedding size from 32 to queries. The lower four rows show the worst cases where
256. Table 4 shows that the performance increases from 32, none of the top 10 are in the same categories. However,
plateaus at 128, and appears to saturate towards 256. even for the failure cases, the retrieved images are still vi-
sually similar to the queries, a testament to the power of our
Training set size. To study how our method scales with unsupervised learning objective.
the data size, we train different representations with vari-
ous proportions of ImageNet data, and evaluate the classi- 4.3. Semi-supervised Learning
fication performance on the full labeled set using nearest
We now study how the learned feature extraction net-
neighbors. Table 5 shows that our feature learning method
work can benefit other tasks, and whether it can provide
benefits from larger training sets, and the testing accuracy
a good basis for transfer learning to other tasks. A com-
improves as the training set grows. This property is crucial
mon scenario that can benefit from unsupervised learning is
for successful unsupervised learning, as there is no shortage
when we have a large amount of data of which only a small
of unlabeled data in the wild.
fraction are labeled. A natural semi-supervised learning ap-
Qualitative case study. To illustrate the learned features, proach is to first learn from the big unlabeled data and then
Figure 5 shows the results of image retrieval using the fine-tune the model on the small labeled data.
learned features. The upper four rows show the best cases We randomly choose a subset of ImageNet as labeled
90 Evaluation of Semi-Supervised Learning Method mAP Method mAP
AlexNet Labels† 56.8 VGG Labels† 67.3
80 Gaussian 39.7
Gaussian 43.4
70 Data-Init [16] 45.6 Video [44] 60.2
Context [2] 51.1 Context [2] 61.5
60
Top-5 Accuracy