0% found this document useful (0 votes)
46 views10 pages

Unsupervised Domain Adaptation Method

The document presents a novel approach to unsupervised domain adaptation in deep learning architectures, enabling models to adapt to target domains without requiring labeled data. It introduces a method that combines feature learning, domain adaptation, and classifier training into a unified architecture using standard backpropagation, achieving significant performance improvements on image classification tasks. The proposed architecture utilizes a gradient reversal layer to encourage the emergence of domain-invariant features, outperforming previous state-of-the-art methods on benchmark datasets.

Uploaded by

8w9m8x845f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views10 pages

Unsupervised Domain Adaptation Method

The document presents a novel approach to unsupervised domain adaptation in deep learning architectures, enabling models to adapt to target domains without requiring labeled data. It introduces a method that combines feature learning, domain adaptation, and classifier training into a unified architecture using standard backpropagation, achieving significant performance improvements on image classification tasks. The proposed architecture utilizes a gradient reversal layer to encourage the emergence of domain-invariant features, outperforming previous state-of-the-art methods on benchmark datasets.

Uploaded by

8w9m8x845f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin GANIN @ SKOLTECH . RU


Victor Lempitsky LEMPITSKY @ SKOLTECH . RU
Skolkovo Institute of Science and Technology (Skoltech), Moscow Region, Russia

Abstract large amount of labeled training data is available. At the


Top-performing deep architectures are trained on same time, for problems lacking labeled data, it may be
massive amounts of labeled data. In the absence still possible to obtain training sets that are big enough for
of labeled data for a certain task, domain adap- training large-scale deep models, but that suffer from the
tation often provides an attractive option given shift in data distribution from the actual data encountered
that labeled data of similar nature but from a dif- at “test time”. One particularly important example is syn-
ferent domain (e.g. synthetic images) are avail- thetic or semi-synthetic training data, which may come in
able. Here, we propose a new approach to do- abundance and be fully labeled, but which inevitably have
main adaptation in deep architectures that can a distribution that is different from real data (Liebelt &
be trained on large amount of labeled data from Schmid, 2010; Stark et al., 2010; Vázquez et al., 2014; Sun
the source domain and large amount of unlabeled & Saenko, 2014).
data from the target domain (no labeled target- Learning a discriminative classifier or other predictor in
domain data is necessary). the presence of a shift between training and test distribu-
tions is known as domain adaptation (DA). A number of
As the training progresses, the approach pro-
approaches to domain adaptation has been suggested in the
motes the emergence of “deep” features that are
context of shallow learning, e.g. in the situation when data
(i) discriminative for the main learning task on
representation/features are given and fixed. The proposed
the source domain and (ii) invariant with respect
approaches then build the mappings between the source
to the shift between the domains. We show that
(training-time) and the target (test-time) domains, so that
this adaptation behaviour can be achieved in al-
the classifier learned for the source domain can also be ap-
most any feed-forward model by augmenting it
plied to the target domain, when composed with the learned
with few standard layers and a simple new gra-
mapping between domains. The appeal of the domain
dient reversal layer. The resulting augmented
adaptation approaches is the ability to learn a mapping be-
architecture can be trained using standard back-
tween domains in the situation when the target domain data
propagation.
are either fully unlabeled (unsupervised domain annota-
Overall, the approach can be implemented with tion) or have few labeled samples (semi-supervised domain
little effort using any of the deep-learning pack- adaptation). Below, we focus on the harder unsupervised
ages. The method performs very well in a se- case, although the proposed approach can be generalized to
ries of image classification experiments, achiev- the semi-supervised case rather straightforwardly.
ing adaptation effect in the presence of big do- Unlike most previous papers on domain adaptation that
main shifts and outperforming previous state-of- worked with fixed feature representations, we focus on
the-art on Office datasets. combining domain adaptation and deep feature learning
within one training process (deep domain adaptation). Our
goal is to embed domain adaptation into the process of
1. Introduction learning representation, so that the final classification de-
Deep feed-forward architectures have brought impressive cisions are made based on features that are both discrim-
advances to the state-of-the-art across a wide variety of inative and invariant to the change of domains, i.e. have
machine-learning tasks and applications. At the moment, the same or very similar distributions in the source and the
however, these leaps in performance come only when a target domains. In this way, the obtained feed-forward net-
work can be applicable to the target domain without being
Proceedings of the 32 nd International Conference on Machine hindered by the shift between the two domains.
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
right 2015 by the author(s). We thus focus on learning features that combine (i)
Unsupervised Domain Adaptation by Backpropagation

discriminativeness and (ii) domain-invariance. This is plished by modifying the feature representation itself rather
achieved by jointly optimizing the underlying features as than by reweighing or geometric transformation. Also, our
well as two discriminative classifiers operating on these method uses (implicitly) a rather different way to measure
features: (i) the label predictor that predicts class labels the disparity between distributions based on their separa-
and is used both during training and at test time and (ii) the bility by a deep discriminatively-trained classifier.
domain classifier that discriminates between the source and Several approaches perform gradual transition from the
the target domains during training. While the parameters of source to the target domain (Gopalan et al., 2011; Gong
the classifiers are optimized in order to minimize their error et al., 2012) by a gradual change of the training distribu-
on the training set, the parameters of the underlying deep tion. Among these methods, (S. Chopra & Gopalan, 2013)
feature mapping are optimized in order to minimize the loss does this in a “deep” way by the layerwise training of a
of the label classifier and to maximize the loss of the domain sequence of deep autoencoders, while gradually replacing
classifier. The latter encourages domain-invariant features source-domain samples with target-domain samples. This
to emerge in the course of the optimization. improves over a similar approach of (Glorot et al., 2011)
Crucially, we show that all three training processes can that simply trains a single deep autoencoder for both do-
be embedded into an appropriately composed deep feed- mains. In both approaches, the actual classifier/predictor
forward network (Figure 1) that uses standard layers and is learned in a separate step using the feature representa-
loss functions, and can be trained using standard backprop- tion learned by autoencoder(s). In contrast to (Glorot et al.,
agation algorithms based on stochastic gradient descent or 2011; S. Chopra & Gopalan, 2013), our approach performs
its modifications (e.g. SGD with momentum). Our ap- feature learning, domain adaptation and classifier learning
proach is generic as it can be used to add domain adaptation jointly, in a unified architecture, and using a single learning
to any existing feed-forward architecture that is trainable by algorithm (backpropagation). We therefore argue that our
backpropagation. In practice, the only non-standard com- approach is simpler (both conceptually and in terms of its
ponent of the proposed architecture is a rather trivial gra- implementation). Our method also achieves considerably
dient reversal layer that leaves the input unchanged during better results on the popular O FFICE benchmark.
forward propagation and reverses the gradient by multiply- While the above approaches perform unsupervised domain
ing it by a negative scalar during the backpropagation. adaptation, there are approaches that perform supervised
Below, we detail the proposed approach to domain adap- domain adaptation by exploiting labeled data from the tar-
tation in deep architectures, and present results on tradi- get domain. In the context of deep feed-forward archi-
tional deep learning image datasets (such as MNIST (Le- tectures, such data can be used to “fine-tune” the net-
Cun et al., 1998) and SVHN (Netzer et al., 2011)) as well work trained on the source domain (Zeiler & Fergus, 2013;
as on O FFICE benchmarks (Saenko et al., 2010), where Oquab et al., 2014; Babenko et al., 2014). Our approach
the proposed method considerably improves over previous does not require labeled target-domain data. At the same
state-of-the-art accuracy. time, it can easily incorporate such data when it is avail-
able.
2. Related work An idea related to ours is described in (Goodfellow et al.,
A large number of domain adaptation methods have been 2014). While their goal is quite different (building gener-
proposed over the recent years, and here we focus on the ative deep networks that can synthesize samples), the way
most related ones. Multiple methods perform unsuper- they measure and minimize the discrepancy between the
vised domain adaptation by matching the feature distri- distribution of the training data and the distribution of the
butions in the source and the target domains. Some ap- synthesized data is very similar to the way our architecture
proaches perform this by reweighing or selecting samples measures and minimizes the discrepancy between feature
from the source domain (Borgwardt et al., 2006; Huang distributions for the two domains.
et al., 2006; Gong et al., 2013), while others seek an ex- In the recent year, domain adaptation for feed-forward neu-
plicit feature space transformation that would map source ral networks has attracted a lot of interest. Thus, a very
distribution into the target ones (Pan et al., 2011; Gopalan similar idea to ours has been developed in parallel and in-
et al., 2011; Baktashmotlagh et al., 2013). An important dependently for shallow architecture (with a single hidden
aspect of the distribution matching approach is the way the layer) in (Ajakan et al., 2014). Their system is evaluated
(dis)similarity between distributions is measured. Here, on a natural language task (sentiment analysis). Further-
one popular choice is matching the distribution means in more, recent and concurrent reports by (Tzeng et al., 2014;
the kernel-reproducing Hilbert space (Borgwardt et al., Long & Wang, 2015) also focus on domain adaptation in
2006; Huang et al., 2006), whereas (Gong et al., 2012; Fer- feed-forward networks. Their set of techniques measures
nando et al., 2013) map the principal axes associated with and minimizes the distance between the data distribution
each of the distributions. Our approach also attempts to means across domains (potentially, after embedding distri-
match feature space distributions, however this is accom- butions into RKHS). The approach of (Tzeng et al., 2014)
Unsupervised Domain Adaptation by Backpropagation

Figure 1. The proposed architecture includes a deep feature extractor (green) and a deep label predictor (blue), which together form
a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the
feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during the backpropagation-
based training. Otherwise, the training proceeds in a standard way and minimizes the label prediction loss (for source examples) and
the domain classification loss (for all samples). Gradient reversal ensures that the feature distributions over the two domains are made
similar (as indistinguishable as possible for the domain classifier), thus resulting in the domain-invariant features.

and (Long & Wang, 2015) is thus different from our idea ing labels yi ∈ Y are known at training time. For the ex-
of matching distribution by making them indistinguishable amples from the target domains, we do not know the labels
for a discriminative classifier. Below, we compare our ap- at training time, and we want to predict such labels at test
proach to (Tzeng et al., 2014; Long & Wang, 2015) on the time.
Office benchmark. Another approach to deep domain adap- We now define a deep feed-forward architecture that for
tation, which is arguably more different from ours, has been each input x predicts its label y ∈ Y and its domain label
developed in parallel in (Chen et al., 2015). d ∈ {0, 1}. We decompose such mapping into three parts.
We assume that the input x is first mapped by a mapping
3. Deep Domain Adaptation Gf (a feature extractor) to a D-dimensional feature vector
3.1. The model f ∈ RD . The feature mapping may also include several
We now detail the proposed model for the domain adap- feed-forward layers and we denote the vector of parame-
tation. We assume that the model works with input sam- ters of all layers in this mapping as θf , i.e. f = Gf (x; θf ).
ples x ∈ X, where X is some input space and cer- Then, the feature vector f is mapped by a mapping Gy (la-
tain labels (output) y from the label space Y . Below, bel predictor) to the label y, and we denote the parameters
we assume classification problems where Y is a finite set of this mapping with θy . Finally, the same feature vector f
(Y = {1, 2, . . . L}), however our approach is generic and is mapped to the domain label d by a mapping Gd (domain
can handle any output label space that other deep feed- classifier) with the parameters θd (Figure 1).
forward models can handle. We further assume that there During the learning stage, we aim to minimize the label
exist two distributions S(x, y) and T (x, y) on X ⊗ Y , prediction loss on the annotated part (i.e. the source part)
which will be referred to as the source distribution and of the training set, and the parameters of both the feature
the target distribution (or the source domain and the tar- extractor and the label predictor are thus optimized in or-
get domain). Both distributions are assumed complex and der to minimize the empirical loss for the source domain
unknown, and furthermore similar but different (in other samples. This ensures the discriminativeness of the fea-
words, S is “shifted” from T by some domain shift). tures f and the overall good prediction performance of the
Our ultimate goal is to be able to predict labels y given combination of the feature extractor and the label predictor
the input x for the target distribution. At training time, on the source domain.
we have an access to a large set of training samples At the same time, we want to make the features f
{x1 , x2 , . . . , xN } from both the source and the target do- domain-invariant. That is, we want to make the dis-
mains distributed according to the marginal distributions tributions S(f ) = {Gf (x; θf ) | x∼S(x)} and T (f ) =
S(x) and T (x). We denote with di the binary variable (do- {Gf (x; θf ) | x∼T (x)} to be similar. Under the covariate
main label) for the i-th example, which indicates whether shift assumption, this would make the label prediction ac-
xi come from the source distribution (xi ∼S(x) if di =0) or curacy on the target domain to be the same as on the source
from the target distribution (xi ∼T (x) if di =1). For the ex- domain (Shimodaira, 2000). Measuring the dissimilarity
amples from the source distribution (di =0) the correspond- of the distributions S(f ) and T (f ) is however non-trivial,
given that f is high-dimensional, and that the distributions
Unsupervised Domain Adaptation by Backpropagation

themselves are constantly changing as learning progresses.


One way to estimate the dissimilarity is to look at the loss !
∂Liy ∂Li
of the domain classifier Gd , provided that the parameters θf ←− θf − µ −λ d (4)
θd of the domain classifier have been trained to discrim- ∂θf ∂θf
inate between the two feature distributions in an optimal ∂Liy
way. θy ←− θy − µ (5)
∂θy
This observation leads to our idea. At training time, in or-
der to obtain domain-invariant features, we seek the param- ∂Lid
θd ←− θd − µ (6)
eters θf of the feature mapping that maximize the loss of ∂θd
the domain classifier (by making the two feature distribu- where µ is the learning rate (which can vary over time).
tions as similar as possible), while simultaneously seeking The updates (4)-(6) are very similar to stochastic gradient
the parameters θd of the domain classifier that minimize the descent (SGD) updates for a feed-forward deep model that
loss of the domain classifier. In addition, we seek to mini- comprises feature extractor fed into the label predictor and
mize the loss of the label predictor. into the domain classifier. The difference is the −λ factor
More formally, we consider the functional: in (4) (the difference is important, as without such factor,
X stochastic gradient descent would try to make features dis-
similar across domains in order to minimize the domain

E(θf , θy , θd ) = Ly Gy (Gf (xi ; θf ); θy ), yi −
i=1..N classification loss). Although direct implementation of (4)-
di =0
X  (6) as SGD is not possible, it is highly desirable to reduce
λ Ld Gd (Gf (xi ; θf ); θd ), yi = the updates (4)-(6) to some form of SGD, since SGD (and
i=1..N its variants) is the main learning algorithm implemented in
most packages for deep learning.
X X
= Liy (θf , θy ) − λ Lid (θf , θd ) (1)
i=1..N i=1..N
Fortunately, such reduction can be accomplished by intro-
di =0 ducing a special gradient reversal layer (GRL) defined as
follows. The gradient reversal layer has no parameters as-
Here, Ly (·, ·) is the loss for label prediction (e.g. multino- sociated with it (apart from the meta-parameter λ, which
mial), Ld (·, ·) is the loss for the domain classification (e.g. is not updated by backpropagation). During the forward
logistic), while Liy and Lid denote the corresponding loss propagation, GRL acts as an identity transform. During
functions evaluated at the i-th training example. the backpropagation though, GRL takes the gradient from
Based on our idea, we are seeking the parameters θ̂f , θ̂y , θ̂d the subsequent level, multiplies it by −λ and passes it to
that deliver a saddle point of the functional (1): the preceding layer. Implementing such layer using exist-
ing object-oriented packages for deep learning is simple, as
defining procedures for forwardprop (identity transform),
(θ̂f , θ̂y ) = arg min E(θf , θy , θ̂d ) (2) backprop (multiplying by a constant), and parameter up-
θf ,θy
date (nothing) is trivial.
θ̂d = arg max E(θ̂f , θ̂y , θd ) . (3) The GRL as defined above is inserted between the feature
θd
extractor and the domain classifier, resulting in the archi-
tecture depicted in Figure 1. As the backpropagation pro-
At the saddle point, the parameters θd of the domain classi-
cess passes through the GRL, the partial derivatives of the
fier θd minimize the domain classification loss (since it en-
loss that is downstream the GRL (i.e. Ld ) w.r.t. the layer
ters into (1) with the minus sign) while the parameters θy of
parameters that are upstream the GRL (i.e. θf ) get multi-
the label predictor minimize the label prediction loss. The
plied by −λ, i.e. ∂L ∂Ld
∂θf is effectively replaced with −λ ∂θf .
d

feature mapping parameters θf minimize the label predic-


Therefore, running SGD in the resulting model implements
tion loss (i.e. the features are discriminative), while maxi-
the updates (4)-(6) and converges to a saddle point of (1).
mizing the domain classification loss (i.e. the features are
Mathematically, we can formally treat the gradient reversal
domain-invariant). The parameter λ controls the trade-off
layer as a “pseudo-function” Rλ (x) defined by two (incom-
between the two objectives that shape the features during
patible) equations describing its forward- and backpropa-
learning.
gation behaviour:
Below, we demonstrate that standard stochastic gradient
solvers (SGD) can be adapted for the search of the saddle Rλ (x) = x (7)
point (2)-(3). dRλ
= −λI (8)
3.2. Optimization with backpropagation dx
A saddle point (2)-(3) can be found as a stationary point of where I is an identity matrix. We can then define the
the following stochastic updates: objective “pseudo-function” of (θf , θy , θd ) that is being
Unsupervised Domain Adaptation by Backpropagation

optimized by the stochastic gradient descent within our MNIST S YN N UM SVHN S YN S IGNS
method:
X 
Ẽ(θf , θy , θd ) = Ly Gy (Gf (xi ; θf ); θy ), yi +
i=1..N
di =0
X  MNIST-M SVHN MNIST GTSRB
Ld Gd (Rλ (Gf (xi ; θf )); θd ), yi (9)
i=1..N Figure 2. Examples of domain pairs (top – source domain, bot-
tom – target domain) used in the small image experiments. See
Running updates (4)-(6) can then be implemented as do- Section 4.1 for details.
ing SGD for (9) and leads to the emergence of features
that are domain-invariant and discriminative at the same
S OURCE A MAZON DSLR W EBCAM
time. After the learning, the label predictor y(x) = M ETHOD
Gy (Gf (x; θf ); θy ) can be used to predict labels for sam- TARGET W EBCAM W EBCAM DSLR

ples from the target domain (as well as from the source GFK(PLS, PCA) (G ONG ET AL ., 2012) .214 .691 .650
domain). SA* (F ERNANDO ET AL ., 2013) .450 .648 .699
The simple learning procedure outlined above can be re-
DLID (S. C HOPRA & G OPALAN , 2013) .519 .782 .899
derived/generalized along the lines suggested in (Goodfel-
low et al., 2014) (see the supplementary material (Ganin & DDC (T ZENG ET AL ., 2014) .605 .948 .985
Lempitsky, 2015)). DAN (L ONG & WANG , 2015) .645 .952 .986
3.3. Relation to H∆H-distance S OURCE ONLY .642 .961 .978
In this section we give a brief analysis of our method in P ROPOSED A PPROACH .730 .964 .992
terms of H∆H-distance (Ben-David et al., 2010; Cortes &
Mohri, 2011) which is widely used in the theory of non-
conservative domain adaptation. Formally, Table 2. Accuracy evaluation of different DA approaches on the
standard O FFICE (Saenko et al., 2010) dataset. All methods (ex-
cept SA) are evaluated in the “fully-transductive” protocol (some
dH∆H (S, T ) = 2 sup |Pf ∼S [h1 (f ) 6= h2 (f )]− results are reproduced from (Long & Wang, 2015)). Our method
h1 ,h2 ∈H
(last row) outperforms competitors setting the new state-of-the-
−Pf ∼T [h1 (f ) 6= h2 (f )]| (10) art.

defines a discrepancy distance between two distributions S can easily show that training the Gd is closely related to
and T w.r.t. a hypothesis set H. Using this notion one can the estimation of dHp ∆Hp (S, T ). Indeed,
obtain a probabilistic bound (Ben-David et al., 2010) on the
performance εT (h) of some classifier h from T evaluated dHp ∆Hp (S, T ) =
on the target domain given its performance εS (h) on the =2 sup |Pf ∼S [h(f ) = 1] − Pf ∼T [h(f ) = 1]| ≤
source domain: h∈Hp ∆Hp

1 ≤ 2 sup |Pf ∼S [h(f ) = 1] − Pf ∼T [h(f ) = 1]| =


εT (h) ≤ εS (h) + dH∆H (S, T ) + C , (11) h∈Hd
2
= 2 sup |1 − α(h)| = 2 sup [α(h) − 1]
where S and T are source and target distributions respec- h∈Hd h∈Hd
tively, and C does not depend on particular h. (13)
Consider fixed S and T over the representation space pro-
duced by the feature extractor Gf and a family of label where α(h) = Pf ∼S [h(f ) = 0] + Pf ∼T [h(f ) = 1] is max-
predictors Hp . We assume that the family of domain classi- imized by the optimal Gd .
fiers Hd is rich enough to contain the symmetric difference Thus, optimal discriminator gives the upper bound for
hypothesis set of Hp : dHp ∆Hp (S, T ). At the same time, backpropagation of
the reversed gradient changes the representation space
Hp ∆Hp = {h | h = h1 ⊕ h2 , h1 , h2 ∈ Hp } . (12) so that α(Gd ) becomes smaller effectively reducing
dHp ∆Hp (S, T ) and leading to the better approximation of
It is not an unrealistic assumption as we have a freedom to εT (Gy ) by εS (Gy ).
pick Hd whichever we want. For example, we can set the
architecture of the domain discriminator to be the layer- 4. Experiments
by-layer concatenation of two replicas of the label predic- We perform extensive evaluation of the proposed approach
tor followed by a two layer non-linear perceptron aimed to on a number of popular image datasets and their modifi-
learn the XOR-function. Given the assumption holds, one cations. These include large-scale datasets of small im-
Unsupervised Domain Adaptation by Backpropagation

S OURCE MNIST S YN N UMBERS SVHN S YN S IGNS


M ETHOD
TARGET MNIST-M SVHN MNIST GTSRB
S OURCE ONLY .5225 .8674 .5490 .7900
SA (F ERNANDO ET AL ., 2013) .5690 (4.1%) .8644 (−5.5%) .5932 (9.9%) .8165 (12.7%)
P ROPOSED APPROACH .7666 (52.9%) .9109 (79.7%) .7385 (42.6%) .8865 (46.4%)
T RAIN ON TARGET .9596 .9220 .9942 .9980

Table 1. Classification accuracies for digit image classifications for different source and target domains. MNIST-M corresponds to
difference-blended digits over non-uniform background. The first row corresponds to the lower performance bound (i.e. if no adaptation
is performed). The last row corresponds to training on the target domain data with known class labels (upper bound on the DA perfor-
mance). For each of the two DA methods (ours and (Fernando et al., 2013)) we show how much of the gap between the lower and the
upper bounds was covered (in brackets). For all five cases, our approach outperforms (Fernando et al., 2013) considerably, and covers a
big portion of the gap.

ages popular with deep learning methods, and the O FFICE using previously published results.
datasets (Saenko et al., 2010), which are a de facto standard
CNN architectures. In general, we compose feature ex-
for domain adaptation in computer vision, but have much
tractor from two or three convolutional layers, picking their
fewer images.
exact configurations from previous works. We give the ex-
Baselines. For the bulk of experiments the following base- act architectures in the supplementary material (Ganin &
lines are evaluated. The source-only model is trained with- Lempitsky, 2015).
out consideration for target-domain data (no domain clas- For the domain adaptator we stick to the three fully con-
sifier branch included into the network). The train-on- nected layers (x → 1024 → 1024 → 2), except for
target model is trained on the target domain with class MNIST where we used a simpler (x → 100 → 2) ar-
labels revealed. This model serves as an upper bound on chitecture to speed up the experiments.
DA methods, assuming that target data are abundant and For loss functions, we set Ly and Ld to be the logistic re-
the shift between the domains is considerable. gression loss and the binomial cross-entropy respectively.
In addition, we compare our approach against the recently
CNN training procedure. The model is trained on 128-
proposed unsupervised DA method based on subspace
sized batches. Images are preprocessed by the mean sub-
alignment (SA) (Fernando et al., 2013), which is simple
traction. A half of each batch is populated by the sam-
to setup and test on new datasets, but has also been shown
ples from the source domain (with known labels), the rest
to perform very well in experimental comparisons with
is comprised of the target domain (with unknown labels).
other “shallow” DA methods. To boost the performance
In order to suppress noisy signal from the domain classifier
of this baseline, we pick its most important free parame-
at the early stages of the training procedure instead of fixing
ter (the number of principal components) from the range
the adaptation factor λ, we gradually change it from 0 to 1
{2, . . . , 60}, so that the test performance on the target do-
using the following schedule:
main is maximized. To apply SA in our setting, we train
a source-only model and then consider the activations of 2
the last hidden layer in the label predictor (before the final λp = − 1, (14)
1 + exp(−γ · p)
linear classifier) as descriptors/features, and learn the map-
ping between the source and the target domains (Fernando where γ was set to 10 in all experiments (the schedule was
et al., 2013). not optimized/tweaked). Further details on the CNN train-
Since the SA baseline requires to train a new classifier after ing can be found in the supplementary material (Ganin &
adapting the features, and in order to put all the compared Lempitsky, 2015).
settings on an equal footing, we retrain the last layer of Visualizations. We use t-SNE (van der Maaten, 2013) pro-
the label predictor using a standard linear SVM (Fan et al., jection to visualize feature distributions at different points
2008) for all four considered methods (including ours; the of the network, while color-coding the domains (Figure 3).
performance on the target domain remains approximately We observe strong correspondence between the success of
the same after the retraining). the adaptation in terms of the classification accuracy for the
For the O FFICE dataset (Saenko et al., 2010), we directly target domain, and the overlap between the domain distri-
compare the performance of our full network (feature ex- butions in such visualizations.
tractor and label predictor) against recent DA approaches
Choosing meta-parameters. Good unsupervised DA
Unsupervised Domain Adaptation by Backpropagation

methods should provide ways to set meta-parameters (such three-digit numbers), positioning, orientation, background
as λ, the learning rate, the momentum rate, the network and stroke colors, and the amount of blur. The degrees of
architecture for our method) in an unsupervised way, i.e. variation were chosen manually to simulate SVHN, how-
without referring to labeled data in the target domain. In ever the two datasets are still rather distinct, the biggest
our method, one can assess the performance of the whole difference being the structured clutter in the background of
system (and the effect of changing hyper-parameters) by SVHN images.
observing the test error on the source domain and the do- The proposed backpropagation-based technique works well
main classifier error. In general, we observed a good cor- covering almost 80% of the gap between training with
respondence between the success of adaptation and these source data only and training on target domain data with
errors (adaptation is more successful when the source do- known target labels. In contrast, SA (Fernando et al., 2013)
main test error is low, while the domain classifier error is results in a slight classification accuracy drop (probably
high). In addition, the layer, where the the domain discrim- due to the information loss during the dimensionality re-
inator is attached can be picked by computing difference duction), thus indicating that the adaptation task is even
between means as suggested in (Tzeng et al., 2014). more challenging than in the case of the MNIST experi-
ment.
4.1. Results
We now discuss the experimental settings and the results. MNIST ↔ SVHN. In this experiment, we further increase
In each case, we train on the source dataset and test on the gap between distributions, and test on MNIST and
a different target domain dataset, with considerable shifts SVHN, which are significantly different in appearance.
between domains (see Figure 2).When MNIST and SVHN Training on SVHN even without adaptation is challeng-
datasets are used as target, standard training-test splits are ing — classification error stays high during the first 150
considered, and all training images are used for unsuper- epochs. In order to avoid ending up in a poor local min-
vised adaptation. The results are summarized in Table 1 imum we, therefore, do not use learning rate annealing
and Table 2. here. Obviously, the two directions (MNIST → SVHN
and SVHN → MNIST) are not equally difficult. As
MNIST → MNIST-M. Our first experiment deals with SVHN is more diverse, a model trained on SVHN is ex-
the MNIST dataset (LeCun et al., 1998) (source). In or- pected to be more generic and to perform reasonably on
der to obtain the target domain (MNIST-M) we blend dig- the MNIST dataset. This, indeed, turns out to be the case
its from the original set over patches randomly extracted and is supported by the appearance of the feature distribu-
from color photos from BSDS500 (Arbelaez et al., 2011). tions. We observe a quite strong separation between the
This operation is formally defined for two images I 1 , I 2 as domains when we feed them into the CNN trained solely
out 1 2
Iijk = |Iijk − Iijk |, where i, j are the coordinates of a on MNIST, whereas for the SVHN-trained network the
pixel and k is a channel index. In other words, an output features are much more intermixed. This difference prob-
sample is produced by taking a patch from a photo and in- ably explains why our method succeeded in improving the
verting its pixels at positions corresponding to the pixels of performance by adaptation in the SVHN → MNIST sce-
a digit. For a human the classification task becomes only nario (see Table 1) but not in the opposite direction (SA is
slightly harder compared to the original dataset (the digits not able to perform adaptation in this case either). Unsu-
are still clearly distinguishable) whereas for a CNN trained pervised adaptation from MNIST to SVHN gives a failure
on MNIST this domain is quite distinct, as the background example for our approach (we are unaware of any unsuper-
and the strokes are no longer constant. Consequently, the vised DA methods capable of performing such adaptation).
source-only model performs poorly. Our approach suc-
ceeded at aligning feature distributions (Figure 3), which Synthetic Signs → GTSRB. Overall, this setting is similar
led to successful adaptation results (considering that the to the S YN N UMBERS → SVHN experiment, except the
adaptation is unsupervised). At the same time, the im- distribution of the features is more complex due to the sig-
provement over source-only model achieved by subspace nificantly larger number of classes (43 instead of 10). For
alignment (SA) (Fernando et al., 2013) is quite modest, the source domain we obtained 100,000 synthetic images
thus highlighting the difficulty of the adaptation task. (which we call S YN S IGNS) simulating various imaging
conditions. In the target domain, we use 31,367 random
Synthetic numbers → SVHN. To address a common sce- training samples for unsupervised adaptation and the rest
nario of training on synthetic data and testing on real data, for evaluation. Once again, our method achieves a sensi-
we use Street-View House Number dataset SVHN (Netzer ble increase in performance proving its suitability for the
et al., 2011) as the target domain and synthetic digits as the synthetic-to-real data adaptation.
source. The latter (S YN N UMBERS) consists of ≈ 500,000 As an additional experiment, we also evaluate the proposed
images generated by ourselves from WindowsTM fonts by algorithm for semi-supervised domain adaptation. Here,
varying the text (that includes different one-, two-, and we reveal 430 labeled examples (10 samples per class) and
Unsupervised Domain Adaptation by Backpropagation

MNIST → MNIST-M: top feature extractor layer S YN N UMBERS → SVHN: last hidden layer of the label predictor

(a) Non-adapted (b) Adapted (a) Non-adapted (b) Adapted

Figure 3. The effect of adaptation on the distribution of the extracted features (best viewed in color). The figure shows t-SNE (van der
Maaten, 2013) visualizations of the CNN’s activations (a) in case when no adaptation was performed and (b) in case when our adaptation
procedure was incorporated into training. Blue points correspond to the source domain examples, while red ones correspond to the target
domain. In all cases, the adaptation in our method makes the two distributions of features much closer.

0.2 Real (Gong et al., 2013; S. Chopra & Gopalan, 2013; Long &
Syn Wang, 2015) as during adaptation we use all available la-
Validation error

0.15
Syn Adapted beled source examples and unlabeled target examples (the
Syn + Real premise of our method is the abundance of unlabeled data
Syn + Real Adapted
0.1
in the target domain). Also, all source domain is used
for training. Under this “fully-transductive” setting, our
method is able to improve previously-reported state-of-the-
0 1 2 3 4 5
art accuracy for unsupervised adaptation very considerably
Batches seen ·105 (Table 2), especially in the most challenging A MAZON →
Figure 4. Results for the traffic signs classification in the semi- W EBCAM scenario (the two domains with the largest do-
supervised setting. Syn and Real denote available labeled data main shift).
(100,000 synthetic and 430 real images respectively); Adapted
Interestingly, in all three experiments we observe a slight
means that ≈ 31,000 unlabeled target domain images were used
over-fitting as training progresses, however, it doesn’t ruin
for adaptation. The best performance is achieved by employing
both the labeled samples and the large unlabeled corpus in the the validation accuracy. Moreover, switching off the do-
target domain. main classifier branch makes this effect far more apparent,
from which we conclude that our technique serves as a reg-
add them to the training set for the label predictor. Fig- ularizer.
ure 4 shows the change of the validation error through-
out the training. While the graph clearly suggests that our 5. Discussion
method can be beneficial in the semi-supervised setting, We have proposed a new approach to unsupervised do-
thorough verification of semi-supervised setting is left for main adaptation of deep feed-forward architectures, which
future work. allows large-scale training based on large amount of an-
notated data in the source domain and large amount of
Office dataset. We finally evaluate our method on O F -
unannotated data in the target domain. Similarly to many
FICE dataset, which is a collection of three distinct do-
previous shallow and deep DA techniques, the adaptation
mains: A MAZON, DSLR, and W EBCAM. Unlike previ-
is achieved through aligning the distributions of features
ously discussed datasets, O FFICE is rather small-scale with
across the two domains. However, unlike previous ap-
only 2817 labeled images spread across 31 different cate-
proaches, the alignment is accomplished through standard
gories in the largest domain. The amount of available data
backpropagation training. The approach is therefore rather
is crucial for a successful training of a deep model, hence
scalable, and can be implemented using any deep learn-
we opted for the fine-tuning of the CNN pre-trained on the
ing package. To this end we release the source code for
ImageNet (AlexNet from the Caffe package (Jia et al.,
the Gradient Reversal layer along with the usage examples
2014)) as it is done in some recent DA works (Donahue
as an extension to Caffe (Jia et al., 2014) (see (Ganin &
et al., 2014; Tzeng et al., 2014; Hoffman et al., 2013; Long
Lempitsky, 2015)). Further evaluation on larger-scale tasks
& Wang, 2015). We make our approach more compara-
and in semi-supervised settings constitutes future work.
ble with (Tzeng et al., 2014) by using exactly the same
Acknowledgements. This research is supported by
network architecture replacing domain mean-based regu-
the Russian Ministry of Science and Education grant
larization with the domain classifier.
RFMEFI57914X0071. We also thank the Graphics & Me-
Following previous works, we assess the performance of
dia Lab, Moscow State University for providing the syn-
our method across three transfer tasks most commonly
thetic road signs data.
used for evaluation. Our training protocol is adopted from
Unsupervised Domain Adaptation by Backpropagation

References Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Do-


Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Lavi- main adaptation for large-scale sentiment classification:
olette, François, and Marchand, Mario. Domain- A deep learning approach. In ICML, pp. 513–520, 2011.
adversarial neural networks. CoRR, abs/1412.4446,
2014. Gong, Boqing, Shi, Yuan, Sha, Fei, and Grauman, Kristen.
Geodesic flow kernel for unsupervised domain adapta-
Arbelaez, Pablo, Maire, Michael, Fowlkes, Charless, and tion. In CVPR, pp. 2066–2073, 2012.
Malik, Jitendra. Contour detection and hierarchical im-
age segmentation. PAMI, 33, 2011. Gong, Boqing, Grauman, Kristen, and Sha, Fei. Con-
necting the dots with landmarks: Discriminatively learn-
Babenko, Artem, Slesarev, Anton, Chigorin, Alexander, ing domain-invariant features for unsupervised domain
and Lempitsky, Victor S. Neural codes for image re- adaptation. In ICML, pp. 222–230, 2013.
trieval. In ECCV, pp. 584–599, 2014.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,
Baktashmotlagh, Mahsa, Harandi, Mehrtash Tafazzoli, Bing, Warde-Farley, David, Ozair, Sherjil, Courville,
Lovell, Brian C., and Salzmann, Mathieu. Unsupervised Aaron, and Bengio, Yoshua. Generative adversarial nets.
domain adaptation by domain invariant projection. In In NIPS, 2014.
ICCV, pp. 769–776, 2013.
Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama.
Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Domain adaptation for object recognition: An unsuper-
Alex, Pereira, Fernando, and Vaughan, Jennifer Wort- vised approach. In ICCV, pp. 999–1006, 2011.
man. A theory of learning from different domains.
JMLR, 79, 2010. Hoffman, Judy, Tzeng, Eric, Donahue, Jeff, Jia, Yangqing,
Saenko, Kate, and Darrell, Trevor. One-shot adapta-
Borgwardt, Karsten M., Gretton, Arthur, Rasch, Malte J., tion of supervised deep convolutional models. CoRR,
Kriegel, Hans-Peter, Schölkopf, Bernhard, and Smola, abs/1312.6204, 2013.
Alexander J. Integrating structured biological data by
kernel maximum mean discrepancy. In ISMB, pp. 49– Huang, Jiayuan, Smola, Alexander J., Gretton, Arthur,
57, 2006. Borgwardt, Karsten M., and Schölkopf, Bernhard. Cor-
recting sample selection bias by unlabeled data. In NIPS,
Chen, Qiang, Huang, Junshi, Feris, Rogerio, Brown, pp. 601–608, 2006.
Lisa M., Dong, Jian, and Yan, Shuicheng. Deep domain
adaptation for describing people based on fine-grained Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev,
clothing attributes. June 2015. Sergey, Long, Jonathan, Girshick, Ross, Guadar-
rama, Sergio, and Darrell, Trevor. Caffe: Convolu-
Cortes, Corinna and Mohri, Mehryar. Domain adaptation tional architecture for fast feature embedding. CoRR,
in regression. In Algorithmic Learning Theory, 2011. abs/1408.5093, 2014.

Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. De- based learning applied to document recognition. Pro-
caf: A deep convolutional activation feature for generic ceedings of the IEEE, 86(11):2278–2324, November
visual recognition, 2014. 1998.

Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Liebelt, Joerg and Schmid, Cordelia. Multi-view object
Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library class detection with a 3d geometric model. In CVPR,
for large linear classification. Journal of Machine Learn- 2010.
ing Research, 9:1871–1874, 2008.
Long, Mingsheng and Wang, Jianmin. Learning trans-
Fernando, Basura, Habrard, Amaury, Sebban, Marc, and ferable features with deep adaptation networks. CoRR,
Tuytelaars, Tinne. Unsupervised visual domain adapta- abs/1502.02791, 2015.
tion using subspace alignment. In ICCV, 2013.
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco,
Ganin, Yaroslav and Lempitsky, Victor. Project website. Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig-
[Link] its in natural images with unsupervised feature learning.
projects/grl/, 2015. [Online; accessed 25-May- In NIPS Workshop on Deep Learning and Unsupervised
2015]. Feature Learning 2011, 2011.
Unsupervised Domain Adaptation by Backpropagation

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning


and transferring mid-level image representations using
convolutional neural networks. In CVPR, 2014.
Pan, Sinno Jialin, Tsang, Ivor W., Kwok, James T., and
Yang, Qiang. Domain adaptation via transfer component
analysis. IEEE Transactions on Neural Networks, 22(2):
199–210, 2011.
S. Chopra, S. Balakrishnan and Gopalan, R. Dlid: Deep
learning for domain adaptation by interpolating between
domains. In ICML Workshop on Challenges in Repre-
sentation Learning, 2013.

Saenko, Kate, Kulis, Brian, Fritz, Mario, and Darrell,


Trevor. Adapting visual category models to new do-
mains. In ECCV, pp. 213–226. 2010.
Shimodaira, Hidetoshi. Improving predictive inference un-
der covariate shift by weighting the log-likelihood func-
tion. Journal of Statistical Planning and Inference, 90
(2):227–244, October 2000.
Stark, Michael, Goesele, Michael, and Schiele, Bernt. Back
to the future: Learning shape models from 3d CAD data.
In BMVC, pp. 1–11, 2010.

Sun, Baochen and Saenko, Kate. From virtual to reality:


Fast adaptation of virtual object detectors to real do-
mains. In BMVC, 2014.
Tzeng, Eric, Hoffman, Judy, Zhang, Ning, Saenko, Kate,
and Darrell, Trevor. Deep domain confusion: Maximiz-
ing for domain invariance. CoRR, abs/1412.3474, 2014.
van der Maaten, Laurens. Barnes-hut-sne. CoRR,
abs/1301.3342, 2013.
Vázquez, David, López, Antonio Manuel, Marı́n, Javier,
Ponsa, Daniel, and Gomez, David Gerónimo. Virtual
and real world adaptationfor pedestrian detection. IEEE
Trans. Pattern Anal. Mach. Intell., 36(4):797–809, 2014.
Zeiler, Matthew D. and Fergus, Rob. Visualizing
and understanding convolutional networks. CoRR,
abs/1311.2901, 2013.

You might also like