0% found this document useful (0 votes)
7 views10 pages

0801

This document presents a novel approach to unsupervised feature learning through non-parametric instance discrimination, where individual instances are treated as distinct classes. The method utilizes noise-contrastive estimation to address computational challenges and demonstrates significant improvements in performance on ImageNet classification compared to state-of-the-art techniques. Additionally, the approach allows for efficient storage and retrieval of features, making it suitable for large datasets.

Uploaded by

ziyuzhou30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

0801

This document presents a novel approach to unsupervised feature learning through non-parametric instance discrimination, where individual instances are treated as distinct classes. The method utilizes noise-contrastive estimation to address computational challenges and demonstrates significant improvements in performance on ImageNet classification compared to state-of-the-art techniques. Additionally, the approach allows for efficient storage and retrieval of features, making it suitable for large datasets.

Uploaded by

ziyuzhou30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

Zhirong Wu?† Yuanjun Xiong†‡ Stella X. Yu? Dahua Lin†


? † ‡
UC Berkeley / ICSI Chinese University of Hong Kong Amazon Rekognition

Abstract

Neural net classifiers trained on data with annotated


class labels can also capture apparent visual similarity
among categories without being directed to do so. We study
whether this observation can be extended beyond the con-
ventional domain of supervised learning: Can we learn a
good feature representation that captures apparent similar-
ity among instances, instead of classes, by merely asking
the feature to be discriminative of individual instances?
We formulate this intuition as a non-parametric clas-
sification problem at the instance-level, and use noise-
contrastive estimation to tackle the computational chal-
lenges imposed by the large number of instance classes.
Our experimental results demonstrate that, under unsu-
pervised learning settings, our method surpasses the state- leopard jaguar cheetah lifeboat shopcart bookcase
of-the-art on ImageNet classification by a large margin.
Our method is also remarkable for consistently improv- Figure 1: Supervised learning results that motivate our unsuper-
ing test performance with more training data and better vised approach. For an image from class leopard, the classes that
get highest responses from a trained neural net classifier are all
network architectures. By fine-tuning the learned feature,
visually correlated, e.g., jaguar and cheetah. It is not the seman-
we further obtain competitive results for semi-supervised
tic labeling, but the apparent similarity in the data themselves that
learning and object detection tasks. Our non-parametric brings some classes closer than others. Our unsupervised approach
model is highly compact: With 128 features per image, our takes the class-wise supervision to the extreme and learns a feature
method requires only 600MB storage for a million images, representation that discriminates among individual instances.
enabling fast nearest neighbour retrieval at the run time.

image is more likely to be visually correlated. Fig. 1 shows


1. Introduction that an image from class leopard is rated much higher by
The rise of deep neural networks, especially convolu- class jaguar rather than by class bookcase [11]. Such obser-
tional neural networks (CNN), has led to several break- vations reveal that a typical discriminative learning method
throughs in computer vision benchmarks. Most successful can automatically discover apparent similarity among se-
models are trained via supervised learning, which requires mantic categories, without being explicitly guided to do so.
large datasets that are completely annotated for a specific In other words, apparent similarity is learned not from se-
task. However, obtaining annotated data is often very costly mantic annotations, but from the visual data themselves.
or even infeasible in certain cases. In recent years, unsu- We take the class-wise supervision to the extreme of
pervised learning has received increasing attention from the instance-wise supervision, and ask: Can we learn a mean-
community [5, 2]. ingful metric that reflects apparent similarity among in-
Our novel approach to unsupervised learning stems from stances via pure discriminative learning? An image is dis-
a few observations on the results of supervised learning for tinctive in its own right, and each could differ significantly
object recognition. On ImageNet, the top-5 classification from other images in the same semantic category [23]. If we
error is significantly lower than the top-1 error [18], and the learn to discriminate between individual instances, without
second highest responding class in the softmax output to an any notion of semantic categories, we may end up with a

1
representation that captures apparent similarity among in- Generative Models. The primary objective of generative
stances, just like how class-wise supervised learning still models is to reconstruct the distribution of data as faithfully
retains apparent similarity among classes. This formulation as possible. Classical generative models include Restricted
of unsupervised learning as an instance-level discrimination Bolztmann Machines (RBMs) [12, 39, 21], and Auto-
is also technically appealing, as it could benefit from latest encoders [40, 20]. The latent features produced by gen-
advances in discriminative supervised learning, e.g. on new erative models could also help object recognition. Recent
network architectures. approaches such as generative adversarial networks [8, 4]
However, we also face a major challenge, now that the and variational auto-encoder [14] improve both generative
number of “classes” is the size of the entire training set. For qualities and feature learning.
ImageNet, it would be 1.2-million instead of 1,000 classes. Self-supervised Learning. Self-supervised learning ex-
Simply extending softmax to many more classes becomes ploits internal structures of data and formulates predictive
infeasible. We tackle this challenge by approximating the tasks to train a model. Specifically, the model needs to pre-
full softmax distribution with noise-contrastive estimation dict either an omitted aspect or component of an instance
(NCE) [9], and by resorting to a proximal regularization given the rest. To learn a representation of images, the
method [29] to stabilize the learning process. tasks could be: predicting the context [2], counting the ob-
To evaluate the effectiveness of unsupervised learning, jects [28], filling in missing parts of an image [31], recover-
past works such as [2, 31] have relied on a linear classifier, ing colors from grayscale images [47], or even solving a jig-
e.g. Support Vector Machine (SVM), to connect the learned saw puzzle [27]. For videos, self-supervision strategies in-
feature to categories for classification at the test time. How- clude: leveraging temporal continuity via tracking [44, 45],
ever, it is unclear why features learned via a training task predicting future [42], or preserving the equivariance of
could be linearly separable for an unknown testing task. egomotion [13, 50, 30]. Recent work [3] attempts to com-
We advocate a non-parametric approach for both training bine several self-supervised tasks to obtain better visual rep-
and testing. We formulate instance-level discrimination as resentations. Whereas self-supervised learning may capture
a metric learning problem, where distances (similarity) be- relations among parts or aspects of an instance, it is unclear
tween instances are calculated directly from the features in a why a particular self supervision task should help semantic
non-parametric way. That is, the features for each instance recognition and which task would be optimal.
are stored in a discrete memory bank, rather than weights
in a network. At the test time, we perform classification Metric Learning. Every feature representation F induces
using k-nearest neighbors (kNN) based on the learned met- a metric between instances x and y: dF (x, y) = kF (x) −
ric. Our training and testing are thus consistent, since both F (y)k. Feature learning can thus also be viewed as a
learning and evaluation of our model are concerned with the certain form of metric learning. There have been exten-
same metric space between images. We report and compare sive studies on metric learning [15, 33]. Successful ap-
experimental results with both SVM and kNN accuracies. plication of metric learning can often result in competitive
performance, e.g. on face recognition [35] and person re-
Our experimental results demonstrate that, under unsu-
identification [46]. In these tasks, the classes at the test
pervised learning settings, our method surpasses the state-
time are disjoint from those at the training time. Once a
of-the-art on image classification by a large margin, with
network is trained, one can only infer from its feature rep-
top-1 accuracy 42.5% on ImageNet 1K [1] and 38.7% for
resentation, not from the subsequent linear classifier. Metric
Places 205 [49]. Our method is also remarkable for con-
learning has been shown to be effective for few-shot learn-
sistently improving test performance with more training
ing [38, 41, 37]. An important technical point on metric
data and better network architectures. By fine-tuning the
learning for face recognition is normalization [35, 22, 43],
learned feature, we further obtain competitive results for
which we also utilize in this work. Note that all the methods
semi-supervised learning and object detection tasks. Fi-
mentioned here require supervision in certain ways. Our
nally, our non-parametric model is highly compact: With
work is drastically different: It learns the feature and thus
128 features per image, our method requires only 600MB
the induced metric in an unsupervised fashion, without any
storage for a million images, enabling fast nearest neigh-
human annotations.
bour retrieval at the run time.
Exemplar CNN. Exemplar CNN [5] appears similar to our
2. Related Works work. The fundamental difference is that it adopts a para-
metric paradigm during both training and testing, while our
There has been growing interest in unsupervised learn- method is non-parametric in nature. We study this essen-
ing without human-provided labels. Previous works mainly tial difference experimentally in Sec 4.1. Exemplar CNN is
fall into two categories: 1) generative models and 2) self- computationally demanding for large-scale datasets such as
supervised approaches. ImageNet.
CNN backbone 1-th image

low dim L2 norm 2-th image

Non-param Memory
Softmax i-th image
Bank O

128D 128D n-1 th image

n-th image
2048D

128D Unit Sphere

Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature
vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level
discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

3. Approach where τ is a temperature parameter that controls the con-


centration level of the distribution [11]. τ is important for
Our goal is to learn an embedding function v = fθ (x) supervised feature learning [43], and also necessary for tun-
without supervision. fθ is a deep neural network with ing the concentration of v on our unit sphere.
parameters θ, mapping image x to feature v. This em-
bedding would induces a metric over the image space, as The Qlearning objective is then to maximize the joint prob-
n
dθ (x, y) = kfθ (x) − fθ (y)k for instances x and y. A ability i=1 Pθ (i|fθ (xi )), or equivalently to minimize the
good embedding should map visually similar images closer negative log-likelihood over the training set, as
to each other.
n
Our novel unsupervised feature learning approach is X
J(θ) = − log P (i|fθ (xi )). (3)
instance-level discrimination. We treat each image instance
i=1
as a distinct class of its own and train a classifier to distin-
guish between individual instance classes (Fig.2).
Learning with A Memory Bank. To compute the proba-
3.1. Non-Parametric Softmax Classifier bility P (i|v) in Eq. (2), {vj } for all the images are needed.
Parametric Classifier. We formulate the instance-level Instead of exhaustively computing these representations ev-
classification objective using the softmax criterion. Sup- ery time, we maintain a feature memory bank V for stor-
pose we have n images x1 , . . . , xn in n classes and their ing them [46]. In the following, we introduce separate no-
features v1 , . . . , vn with vi = fθ (xi ). Under the conven- tations for the memory bank and features forwarded from
tional parametric softmax formulation, for image x with the network. Let V = {vj } be the memory bank and
feature v = fθ (x), the probability of it being recognized fi = fθ (xi ) be the feature of xi . During each learning itera-
as i-th example is tion, the representation fi as well as the network parameters
θ are optimized via stochastic gradient descend. Then fi is
updated to V at the corresponding instance entry fi → vi .

exp wiT v
P (i|v) = nP T
. (1) We initialize all the representations in the memory bank V
j=1 exp wj v
as unit random vectors.
where wj is a weight vector for class j, and wjT v measures Discussions. The conceptual change from class weight vec-
how well v matches the j-th class i.e., instance. tor wj to feature representation vj directly is significant.
Non-Parametric Classifier. The problem with the para- The weight vectors {wj } in the original softmax formula-
metric softmax formulation in Eq. (1) is that the weight vec- tion are only valid for training classes. Consequently, they
tor w serves as a class prototype, preventing explicit com- are not generalized to new classes, or in our setting, new in-
parisons between instances. stances. When we get rid of these weight vectors, our learn-
We propose a non-parametric variant of Eq. (1) that re- ing objective focuses entirely on the feature representation
places wjT v with vjT v, and we enforce kvk = 1 via a L2- and its induced metric, which can be applied everywhere in
normalization layer. Then the probability P (i|v) becomes: the space and to any new instances at the test time.
 Computationally, our non-parametric formulation elimi-
exp viT v/τ nates the need for computing and storing the gradients for
P (i|v) = Pn T
, (2)
j=1 exp vj v/τ
{wj }, making it more scalable for big data applications.
3.2. Noise-Contrastive Estimation NCE reduces the computational complexity from O(n)
to O(1) per sample. With such drastic reduction, our exper-
Computing the non-parametric softmax in Eq.(2) is cost
iments still yield competitive performance.
prohibitive when the number of classes n is very large,
e.g. at the scale of millions. Similar problems have been 3.3. Proximal Regularization
well addressed in the literature for learning word embed-
dings [25, 24], where the number of words can also scale The Effect of Proximal Regularizer
to millions. Popular techniques to reduce computation in- lambda = 0
clude hierarchical softmax [26], noise-contrastive estima- lambda = 10
lambda = 30
tion (NCE) [9], and negative sampling [24]. We use NCE lambda = 50
[9] to approximate the full softmax.

Training Loss
We adapt NCE to our problem, in order to tackle the dif-
ficulty of computing the similarity to all the instances in the
training set. The basic idea is to cast the multi-class clas-
sification problem into a set of binary classification prob-
lems, where the binary classification task is to discrimi-
nate between data samples and noise samples. Specifically, Training Iterations
the probability that feature representation v in the memory
bank corresponds to the i-th example under our model is, Figure 3: The effect of our proximal regularization. The original
T objective value oscillates a lot and converges very slowly, whereas
exp(v fi /τ )
P (i|v) = (4) the regularized objective has smoother learning dynamics.
Zi
n
X Unlike typical classification settings where each class
exp vjT fi /τ

Zi = (5)
has many instances, we only have one instance per class.
j=1
During each training epoch, each class is only visited once.
where Zi is the normalizing constant. We formalize the Therefore, the learning process oscillates a lot from ran-
noise distribution as a uniform distribution: Pn = 1/n. dom sampling fluctuation. We employ the proximal opti-
Following prior work, we assume that noise samples are m mization method [29] and introduce an additional term to
times more frequent than data samples. Then the posterior encourage the smoothness of the training dynamics. At
probability of sample i with feature v being from the data current iteration t, the feature representation for data xi is
distribution (denoted by D = 1) is: (t)
computed from the network vi = fθ (xi ). The memory
P (i|v) bank of all the representation are stored at previous itera-
h(i, v) := P (D = 1|i, v) = . (6) tion V = {v(t−1) }. The loss function for a positive sample
P (i|v) + mPn (i)
from Pd is:
Our approximated training objective is to minimize the neg-
ative log-posterior distribution of data and noise samples, (t−1) (t) (t−1) 2
− log h(i, vi ) + λkvi − vi k2 . (9)
JN CE (θ) = −EPd [log h(i, v)]
As learning converges, the difference between iterations,
−m·EPn [log(1 − h(i, v0 ))] . (7) (t)
i.e. vi − vi
(t−1)
, gradually vanishes, and the augmented
Here, Pd denotes the actual data distribution. For Pd , v is loss is reduced to the original one. With proximal regular-
the feature corresponding to xi ; whereas for Pn , v0 is the ization, our final objective becomes:
feature from another image, randomly sampled according h i
(t−1) (t) (t−1) 2
to noise distribution Pn . In our model, both v and v0 are JN CE (θ) = −EPd log h(i, vi ) − λkvi − vi k2
sampled from the non-parametric memory bank V . h i
Computing normalizing constant Zi according to Eq. (4) −m·EPn log(1 − h(i, v0(t−1) )) . (10)
is expensive. We follow [25], treating it as a constant and
estimating its value via Monte Carlo approximation: Fig. 3 shows that, empirically, proximal regularization helps
m
stabilize training, speed up convergence, and improve the
n X learned representation, with negligible extra cost.
Z ' Zi ' nEj exp(vjT fi /τ ) = exp(vjTk fi /τ ),
 
m
k=1 3.4. Weighted k-Nearest Neighbor Classifier
(8)
where {jk } is a random subset of indices. Empirically, we To classify test image x̂, we first compute its feature f̂ =
find the approximation derived from initial batches suffi- fθ (x̂), and then compare it against the embeddings of all
cient to work well in practice. the images in the memory bank, using the cosine similarity
Training / Testing Linear SVM Nearest Neighbor we obtain accuracies of 60.3% and 63.0% with linear SVM
Param Softmax 60.3 63.0 and kNN classifiers respectively. On the features learned
Non-Param Softmax 75.4 80.8 with non-parametric softmax, the accuracy rises to 75.4%
NCE m = 1 44.3 42.5 and 80.8% for the linear and nearest neighbour classifiers,
NCE m = 10 60.2 63.4 a remarkable 18% boost for the latter.
NCE m = 512 64.3 78.4 We also study the quality of NCE approximating non-
NCE m = 4096 70.2 80.4 parametric softmax (Sec. 3.2). The approximation is con-
trolled by m, the number of negatives drawn for each in-
Table 1: Top-1 accuracies on CIFAR10, by applying linear SVM stance. With m = 1, the accuracy with kNN drops signifi-
or kNN classifiers on the learned features. Our non-parametric cantly to 42.5%. As m increases, the performance improves
softmax outperforms parametric softmax, and NCE provides close steadily. When m = 4, 096, the accuracy approaches that at
approximation as m increases. m = 49, 999 – full form evaluation without any approxima-
tion. This result provides assurance that NCE is an efficient
approximation.
si = cos(vi , f̂ ). The top k nearest neighbors, denoted by
Nk , would then be used to make the prediction via weighted 4.2. Image Classification
voting.PSpecifically, the class c would get a total weight
wc = i∈Nk αi · 1(ci = c). Here, αi is the contributing We learn a feature representation on ImageNet
weight of neighbor xi , which depends on the similarity as ILSVRC [34], and compare our method with representative
αi = exp(si /τ ). We choose τ = 0.07 as in training and we unsupervised learning methods.
set k = 200. Experimental Settings. We choose design parameters
via empirical validation. In particular, we set tempera-
4. Experiments ture τ = 0.07 and use NCE with m = 4, 096 to balance
performance and computing cost. The model is trained
We conduct 4 sets of experiments to evaluate our ap-
for 200 epochs using SGD with momentum. The batch
proach. The first set is on CIFAR-10 to compare our non-
size is 256. The learning rate is initialized to 0.03, scaled
parametric softmax with parametric softmax. The second
down with coefficient 0.1 every 40 epochs after the first
set is on ImageNet to compare our method with other unsu-
120 epochs. Our code is available at: http://github.
pervised learning methods. The last two sets of experiments
com/zhirongw/lemniscate.pytorch.
investigate two different tasks, semi-supervised learning
and object detection, to show the generalization ability of Comparisons. We compare our method with a randomly
our learned feature representation. initialized network (as a lower bound) and various unsu-
pervised learning methods, including self-supervised learn-
4.1. Parametric vs. Non-parametric Softmax ing [2, 47, 27, 48], adversarial learning [4], and Exemplar
A key novelty of our approach is the non-parametric CNN [3]. The split-brain autoencoder [48] serves a strong
softmax function. Compared to the conventional paramet- baseline that represents the state of the art. The results
ric softmax, our softmax allows a non-parametric metric to of these methods are reported with AlexNet architecture
transfer to supervised tasks. [18] in their original papers, except for exemplar CNN [5],
We compare both parametric and non-parametric formu- whose results are reported with ResNet-101 [3]. As the
lations on CIFAR-10 [17], a dataset with 50, 000 training network architecture has a big impact on the performance,
instances in 10 classes. This size allows us to compute the we consider a few typical architectures: AlexNet [18],
non-parametric softmax in Eq.(2) without any approxima- VGG16 [36], ResNet-18, and ResNet-50 [10].
tion. We use ResNet18 as the backbone network and its We evaluate the performance with two different proto-
output features mapped into 128-dimensional vectors. cols: (1) Perform linear SVM on the intermediate features
We evaluate the classification effectiveness based on the from conv1 to conv5. Note that there are also corre-
learned feature representation. A common practice [48, 2, sponding layers in VGG16 and ResNet [36, 10]. (2) Per-
31] is to train an SVM on the learned feature over the train- form kNN on the output features. Table 2 shows that:
ing set, and to then classify test instances based on the fea- 1. With AlexNet and linear classification on intermediate
ture extracted from the trained network. In addition, we also features, our method achieves an accuracy of 35.6%,
use nearest neighbor classifiers to assess the learned feature. outperforming all baselines, including the state-of-the-
The latter directly relies on the feature metric and may bet- art. Our method can readily scale up to deeper networks.
ter reflect the quality of the representation. As we move from AlexNet to ResNet-50, our accuracy
Table 1 shows top-1 classification accuracies on CI- is raised to 42.5%, whereas the accuracy with exemplar
FAR10. On the features learned with parametric softmax, CNN [3] is only 31.5% even with ResNet-101.
Image Classification Accuracy on ImageNet 16
Consistency of training and testing objectives
40
method conv1 conv2 conv3 conv4 conv5 kNN #dim
14
Random 11.6 17.1 16.9 16.3 14.1 3.5 10K
12 30

Testing Accuracy
Data-Init [16] 17.5 23.0 24.5 23.2 20.6 - 10K

Training Loss
Context [2] 16.2 23.3 30.2 31.7 29.6 - 10K 10
Adversarial [4] 17.7 24.5 31.0 29.9 28.0 - 10K 20
8
Color [47] 13.1 24.8 31.0 32.6 31.8 - 10K
Jigsaw [27] 19.2 30.1 34.7 33.9 28.3 - 10K 6
10
Count [28] 18.0 30.6 34.3 32.5 25.7 - 10K 4
SplitBrain [48] 17.7 29.3 35.4 35.2 32.8 11.8 10K 2 0
Exemplar[3] 31.5 - 4.5K 0 25 50 75 100 125 150 175 200
Training Epochs
Ours Alexnet 16.8 26.5 31.8 34.1 35.6 31.3 128
Ours VGG16 16.5 21.4 27.6 33.1 37.2 33.9 128 Figure 4: Our kNN testing accuracy on ImageNet continues to
Ours Resnet18 16.0 19.9 26.3 35.7 42.1 40.5 128 improve as the training loss decreases, demonstrating that our un-
Ours Resnet50 15.3 18.8 24.4 35.3 43.9 42.5 128 supervised learning objective captures apparent similarity which
aligns well with the semantic annotation of the data.
Table 2: Top-1 classification accuracies on ImageNet.
sions. Hence, for other methods, using the features from
Image Classification Accuracy on Places
the best-performing layers can incur significant storage
method conv1 conv2 conv3 conv4 conv5 kNN #dim
and computation costs. Our method produces a 128-
Random 15.7 20.3 19.8 19.1 17.5 3.9 10K dimensional representation at the last layer, which is
Data-Init [16] 21.4 26.2 27.1 26.1 24.0 - 10K very efficient to work with. The encoded features of
Context [2] 19.7 26.7 31.9 32.7 30.9 - 10K all 1.28M images in ImageNet only take about 600 MB
Adversarial [4] 17.7 24.5 31.0 29.9 28.0 - 10K of storage. Exhaustive nearest neighbor search over this
Video [44] 20.1 28.5 29.9 29.7 27.9 - 10K dataset only takes 20 ms per image on a Titan X GPU.
Color [47] 22.0 28.7 31.8 31.3 29.7 - 10K
Jigsaw [27] 23.0 32.1 35.5 34.8 31.3 - 10K
SplitBrain [48] 21.3 30.7 34.0 34.1 32.5 10.8 10K Feature generalization. We also study how the learned
Ours Alexnet 18.8 24.3 31.9 34.5 33.6 30.1 128 feature representations can generalize to other datasets.
Ours VGG16 17.6 23.1 29.5 33.8 36.3 32.8 128 With the same settings, we conduct another large-scale ex-
Ours Resnet18 17.8 23.0 30.3 34.2 41.3 36.7 128 periment on Places [49], a large dataset for scene classifi-
Ours Resnet50 18.1 22.3 29.7 34.1 42.1 38.7 128 cation, which contains 2.45M training images in 205 cate-
gories. In this experiment, we directly use the feature ex-
Table 3: Top-1 classification accuracies on Places, based directly traction networks trained on ImageNet without finetuning.
on features learned on ImageNet, without any fine-tuning. Table 3 compares the results obtained with different meth-
ods and under different evaluation policies. Again, with
linear classifier on conv5 features, our method achieves
2. Using nearest neighbor classification on the final 128 di- competitive performance of top-1 accuracy 34.5% with
mensional features, our method achieves 31.3%, 33.9%, AlexNet, and 42.1% with ResNet-50. With nearest neigh-
40.5% and 42.5% accuracies with AlexNet, VGG16, bors on the last layer which is much smaller than intermedi-
ResNet-18 and ResNet-50, not much lower than the lin- ate layers, we achieve an accuracy of 38.7% with ResNet-
ear classification results, demonstrating that our learned 50. These results show remarkable generalization ability of
feature induces a reasonably good metric. As a com- the representations learned using our method.
parison, for Split-brain, the accuracy drops to 8.9% with
nearest neighbor classification on conv3 features, and Consistency of training and testing objectives. Unsu-
to 11.8% after projecting the features to 128 dimensions. pervised feature learning is difficult because the training
3. With our method, the performance gradually increases as objective is agnostic about the testing objective. A good
we examine the learned feature representation from ear- training objective should be reflected in consistent improve-
lier to later layers, which is generally desirable. With ment in the testing performance. We investigate the relation
all other methods, the performance decreases beyond between the training loss and the testing accuracy across it-
conv3 or conv4. erations. Fig. 4 shows that our testing accuracy continues to
improve as training proceeds, with no sign of overfitting. It
4. It is important to note that the features from interme-
also suggests that better optimization of the training objec-
diate convolutional layers can be over 10, 000 dimen-
tive may further improve our testing accuracy.
query retrievals
Successful Cases
Failure Cases

Figure 5: Retrieval results for example queries. The left column are queries from the validation set, while the right columns show the 10
closest instances from the training set. The upper half shows the best cases. The lower half shows the worst cases.

embedding size 32 64 128 256 training set size 0.1% 1% 10% 30% 100%
top-1 accuracy 34.0 38.8 40.5 40.1 accuracy 3.9 10.7 23.1 31.7 40.5

Table 4: Classification performance on ImageNet with ResNet18 Table 5: Classification performances trained on different amount
for different embedding feature sizes. of training set with ResNet-18.

The embedding feature size. We study how the perfor- where all top 10 results are in the same categories as the
mance changes as we vary the embedding size from 32 to queries. The lower four rows show the worst cases where
256. Table 4 shows that the performance increases from 32, none of the top 10 are in the same categories. However,
plateaus at 128, and appears to saturate towards 256. even for the failure cases, the retrieved images are still vi-
sually similar to the queries, a testament to the power of our
Training set size. To study how our method scales with unsupervised learning objective.
the data size, we train different representations with vari-
ous proportions of ImageNet data, and evaluate the classi- 4.3. Semi-supervised Learning
fication performance on the full labeled set using nearest
We now study how the learned feature extraction net-
neighbors. Table 5 shows that our feature learning method
work can benefit other tasks, and whether it can provide
benefits from larger training sets, and the testing accuracy
a good basis for transfer learning to other tasks. A com-
improves as the training set grows. This property is crucial
mon scenario that can benefit from unsupervised learning is
for successful unsupervised learning, as there is no shortage
when we have a large amount of data of which only a small
of unlabeled data in the wild.
fraction are labeled. A natural semi-supervised learning ap-
Qualitative case study. To illustrate the learned features, proach is to first learn from the big unlabeled data and then
Figure 5 shows the results of image retrieval using the fine-tune the model on the small labeled data.
learned features. The upper four rows show the best cases We randomly choose a subset of ImageNet as labeled
90 Evaluation of Semi-Supervised Learning Method mAP Method mAP
AlexNet Labels† 56.8 VGG Labels† 67.3
80 Gaussian 39.7
Gaussian 43.4
70 Data-Init [16] 45.6 Video [44] 60.2
Context [2] 51.1 Context [2] 61.5
60
Top-5 Accuracy

Adversarial [4] 46.9 Transitivity [45] 63.2


50 Color [47] 46.9 Ours VGG 60.5
Video [44] 47.4 ResNet Labels† 76.2
40 Ours-Resnet Ours Alexnet 48.1
Scratch-Resnet Ours ResNet 65.4
30 Color-Resnet-152
20 Ours-Alexnet Table 6: Object detection performance on PASCAL VOC
Scratch-Alexnet 2007 test, in terms of mean average precision (mAP), for
10 SplitBrain-Alexnet supervised pretraining methods (marked by †), existing un-
1%2% 4% 10% 20% supervised methods, and our method.
The Amount of Labeled Data
Figure 6: Semi-supervised learning results on ImageNet with an weights below the 3rd type of residual blocks, only updat-
increasing fraction of labeled data (x axis). Ours are consistently ing the layers above and freezing all batch normalization
and significantly better. Note that the results for colorization-based layers. We follow the standard pipeline for finetuning and
pretraining are from a deeper ResNet-152 network [19]. do not use the rescaling method proposed in [2]. We use the
standard trainval set in VOC 2007 for training and testing.
and treat others as unlabeled. We perform the above semi- We compare three settings: 1) directly training from
supervised learning and measure the classification accuracy scratch (lower bound), 2) pretraining on ImageNet in a su-
on the validation set. In order to compare with [19], we pervised way (upper bound), and 3) pretraining on Ima-
report the top-5 accuracy here. geNet or other data using various unsupervised methods.
We compare our method with three baselines: (1)
Table 6 lists detection performance in terms of mean
Scratch, i.e. fully supervised training on the small labeled
average precision (mAP). With AlexNet and VGG16, our
subsets, (2) Split-brain [48] for pre-training, and (3) Col-
method achieves an mAP of 48.1% and 60.5%, on par with
orization [19] for pre-training. Finetuning on the labeled
the state-of-the-art unsupervised methods. With Resnet-50,
subset takes 70 epochs with initial learning rate 0.01 and a
our method achieves an mAP of 65.4%, surpassing all ex-
decay rate of 10 every 30 epochs. We vary the proportion
isting unsupervised learning approaches. It also shows that
of labeled subset from 1% to 20% of the entire dataset.
our method scales well as the network gets deeper. There
Fig. 6 shows that our method significantly outperforms remains a significant gap of 11% to be narrowed towards
all other approaches, and ours is the only one outperform- mAP 76.2% from supervised pretraining.
ing supervised learning from limited labeled data. When
only 1% of data is labeled, we outperform by a large 10%
margin, demonstrating that our feature learned from unla- 5. Summary
beled data is effective for task adaptation.
We present an unsupervised feature learning approach by
4.4. Object Detection maximizing distinction between instances via a novel non-
To further assess the generalization capacity of the parametric softmax formulation. It is motivated by the ob-
learned features, we transfer the learned networks to the servation that supervised learning results in apparent image
new task of object detection on PASCAL VOC 2007 [6]. similarity. Our experimental results show that our method
Training object detection model from scratch is often dif- outperforms the state-of-the-art on image classification on
ficult, and a prevalent practice is to pretrain the underlying ImageNet and Places, with a compact 128-dimensional rep-
CNN on ImageNet and fine-tune it for the detection task. resentation that scales well with more data and deeper net-
We experiment with Fast R-CNN [7] with AlexNet and works. It also delivers competitive generalization results on
VGG16 architectures, and Faster R-CNN [32] with ResNet- semi-supervised learning and object detection tasks.
50. When fine-tuning Fast R-CNN, the learning rate is ini- Acknowledgements. This work was supported in part
tialized to 0.001 and scaled down by 10 times after every by Berkeley Deep Drive, Big Data Collaboration Re-
50K iterations. When fine-tuning AlexNet and VGG16, search grant from SenseTime Group (CUHK Agreement
we follow the standard practice, fixing the conv1 model No. TS1610626), and the General Research Fund (GRF)
weights. When fine-tuning Faster R-CNN, we fix the model of Hong Kong (No. 14236516).
References 26th annual international conference on machine learning.
ACM, 2009. 2
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [22] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
Fei. Imagenet: A large-scale hierarchical image database. In Sphereface: Deep hypersphere embedding for face recogni-
CVPR. IEEE, 2009. 2 tion. In CVPR, 2017. 2
[2] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- [23] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of
sual representation learning by context prediction. In ICCV, exemplar-svms for object detection and beyond. In ICCV.
2015. 1, 2, 5, 6, 8 IEEE, 2011. 1
[3] C. Doersch and A. Zisserman. Multi-task self-supervised [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
visual learning. arXiv preprint arXiv:1708.07860, 2017. 2, J. Dean. Distributed representations of words and phrases
5, 6 and their compositionality. In NIPS, 2013. 4
[4] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial fea- [25] A. Mnih and K. Kavukcuoglu. Learning word embeddings
ture learning. arXiv preprint arXiv:1605.09782, 2016. 2, 5, efficiently with noise-contrastive estimation. In NIPS, 2013.
6, 8 4
[5] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and [26] F. Morin and Y. Bengio. Hierarchical probabilistic neural
T. Brox. Discriminative unsupervised feature learning with network language model. In Aistats, volume 5. Citeseer,
convolutional neural networks. In NIPS, 2014. 1, 2, 5 2005. 4
[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [27] M. Noroozi and P. Favaro. Unsupervised learning of vi-
A. Zisserman. The pascal visual object classes (voc) chal- sual representations by solving jigsaw puzzles. In ECCV.
lenge. IJCV, 2010. 8 Springer, 2016. 2, 5, 6
[7] R. Girshick. Fast r-cnn. In ICCV, 2015. 8 [28] M. Noroozi, H. Pirsiavash, and P. Favaro. Represen-
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, tation learning by learning to count. arXiv preprint
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- arXiv:1708.06734, 2017. 2, 6
erative adversarial nets. In NIPS, 2014. 2 [29] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations
[9] M. Gutmann and A. Hyvärinen. Noise-contrastive estima- and Trends R in Optimization, 2014. 2, 4
tion: A new estimation principle for unnormalized statistical [30] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hari-
models. In AISTATS, 2010. 2, 4 haran. Learning features by watching objects move. arXiv
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- preprint arXiv:1612.06370, 2016. 2
ing for image recognition. arXiv preprint arXiv:1512.03385, [31] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.
2015. 5 Efros. Context encoders: Feature learning by inpainting. In
[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge CVPR, 2016. 2, 5
in a neural network. arXiv preprint arXiv:1503.02531, 2015. [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
1, 3 real-time object detection with region proposal networks. In
[12] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning NIPS, 2015. 8
algorithm for deep belief nets. Neural computation, 2006. 2 [33] S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood
[13] D. Jayaraman and K. Grauman. Learning image representa- component analysis. Adv. Neural Inf. Process. Syst.(NIPS),
tions tied to egomotion from unlabeled video. IJCV, 2017. 17, 2004. 2
2 [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[14] D. P. Kingma and M. Welling. Auto-encoding variational S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
bayes. arXiv preprint arXiv:1312.6114, 2013. 2 et al. Imagenet large scale visual recognition challenge.
[15] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and
IJCV, 2015. 5
H. Bischof. Large scale metric learning from equivalence [35] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
constraints. In CVPR. IEEE, 2012. 2 fied embedding for face recognition and clustering. In CVPR,
[16] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-
2015. 2
dependent initializations of convolutional neural networks. [36] K. Simonyan and A. Zisserman. Very deep convolutional
arXiv preprint arXiv:1511.06856, 2015. 6, 8 networks for large-scale image recognition. arXiv preprint
[17] A. Krizhevsky and G. Hinton. Learning multiple layers of
arXiv:1409.1556, 2014. 5
features from tiny images. 2009. 5 [37] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
for few-shot learning. arXiv preprint arXiv:1703.05175,
classification with deep convolutional neural networks. In
2017. 2
NIPS, 2012. 1, 5 [38] K. Sohn. Improved deep metric learning with multi-class
[19] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization
n-pair loss objective. In NIPS, 2016. 2
as a proxy task for visual understanding. CVPR, 2017. 8 [39] Y. Tang, R. Salakhutdinov, and G. Hinton. Robust boltzmann
[20] Q. V. Le. Building high-level features using large scale un-
machines for recognition and denoising. In CVPR. IEEE,
supervised learning. In Acoustics, Speech and Signal Pro-
2012. 2
cessing (ICASSP), 2013 IEEE International Conference on. [40] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
IEEE, 2013. 2 Extracting and composing robust features with denoising au-
[21] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolu-
toencoders. In Proceedings of the 25th international confer-
tional deep belief networks for scalable unsupervised learn-
ence on Machine learning. ACM, 2008. 2
ing of hierarchical representations. In Proceedings of the [41] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.
Matching networks for one shot learning. In NIPS, 2016.
2
[42] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncer-
tain future: Forecasting from static images using variational
autoencoders. In ECCV. Springer, 2016. 2
[43] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2
hypersphere embedding for face verification. arXiv preprint
arXiv:1704.06369, 2017. 2, 3
[44] X. Wang and A. Gupta. Unsupervised learning of visual rep-
resentations using videos. In ICCV, 2015. 2, 6, 8
[45] X. Wang, K. He, and A. Gupta. Transitive invariance for
self-supervised visual representation learning. arXiv preprint
arXiv:1708.02901, 2017. 2, 8
[46] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection
and identification feature learning for person search. CVPR,
2017. 2, 3
[47] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
tion. ECCV, 2016. 2, 5, 6, 8
[48] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders:
Unsupervised learning by cross-channel prediction. CVPR,
2017. 5, 6, 8
[49] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places
database. In NIPS, 2014. 2, 6
[50] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper-
vised learning of depth and ego-motion from video. arXiv
preprint arXiv:1704.07813, 2017. 2

You might also like