Unsupervised Domain Adaptation Method
Unsupervised Domain Adaptation Method
discriminativeness and (ii) domain-invariance. This is plished by modifying the feature representation itself rather
achieved by jointly optimizing the underlying features as than by reweighing or geometric transformation. Also, our
well as two discriminative classifiers operating on these method uses (implicitly) a rather different way to measure
features: (i) the label predictor that predicts class labels the disparity between distributions based on their separa-
and is used both during training and at test time and (ii) the bility by a deep discriminatively-trained classifier.
domain classifier that discriminates between the source and Several approaches perform gradual transition from the
the target domains during training. While the parameters of source to the target domain (Gopalan et al., 2011; Gong
the classifiers are optimized in order to minimize their error et al., 2012) by a gradual change of the training distribu-
on the training set, the parameters of the underlying deep tion. Among these methods, (S. Chopra & Gopalan, 2013)
feature mapping are optimized in order to minimize the loss does this in a “deep” way by the layerwise training of a
of the label classifier and to maximize the loss of the domain sequence of deep autoencoders, while gradually replacing
classifier. The latter encourages domain-invariant features source-domain samples with target-domain samples. This
to emerge in the course of the optimization. improves over a similar approach of (Glorot et al., 2011)
Crucially, we show that all three training processes can that simply trains a single deep autoencoder for both do-
be embedded into an appropriately composed deep feed- mains. In both approaches, the actual classifier/predictor
forward network (Figure 1) that uses standard layers and is learned in a separate step using the feature representa-
loss functions, and can be trained using standard backprop- tion learned by autoencoder(s). In contrast to (Glorot et al.,
agation algorithms based on stochastic gradient descent or 2011; S. Chopra & Gopalan, 2013), our approach performs
its modifications (e.g. SGD with momentum). Our ap- feature learning, domain adaptation and classifier learning
proach is generic as it can be used to add domain adaptation jointly, in a unified architecture, and using a single learning
to any existing feed-forward architecture that is trainable by algorithm (backpropagation). We therefore argue that our
backpropagation. In practice, the only non-standard com- approach is simpler (both conceptually and in terms of its
ponent of the proposed architecture is a rather trivial gra- implementation). Our method also achieves considerably
dient reversal layer that leaves the input unchanged during better results on the popular O FFICE benchmark.
forward propagation and reverses the gradient by multiply- While the above approaches perform unsupervised domain
ing it by a negative scalar during the backpropagation. adaptation, there are approaches that perform supervised
Below, we detail the proposed approach to domain adap- domain adaptation by exploiting labeled data from the tar-
tation in deep architectures, and present results on tradi- get domain. In the context of deep feed-forward archi-
tional deep learning image datasets (such as MNIST (Le- tectures, such data can be used to “fine-tune” the net-
Cun et al., 1998) and SVHN (Netzer et al., 2011)) as well work trained on the source domain (Zeiler & Fergus, 2013;
as on O FFICE benchmarks (Saenko et al., 2010), where Oquab et al., 2014; Babenko et al., 2014). Our approach
the proposed method considerably improves over previous does not require labeled target-domain data. At the same
state-of-the-art accuracy. time, it can easily incorporate such data when it is avail-
able.
2. Related work An idea related to ours is described in (Goodfellow et al.,
A large number of domain adaptation methods have been 2014). While their goal is quite different (building gener-
proposed over the recent years, and here we focus on the ative deep networks that can synthesize samples), the way
most related ones. Multiple methods perform unsuper- they measure and minimize the discrepancy between the
vised domain adaptation by matching the feature distri- distribution of the training data and the distribution of the
butions in the source and the target domains. Some ap- synthesized data is very similar to the way our architecture
proaches perform this by reweighing or selecting samples measures and minimizes the discrepancy between feature
from the source domain (Borgwardt et al., 2006; Huang distributions for the two domains.
et al., 2006; Gong et al., 2013), while others seek an ex- In the recent year, domain adaptation for feed-forward neu-
plicit feature space transformation that would map source ral networks has attracted a lot of interest. Thus, a very
distribution into the target ones (Pan et al., 2011; Gopalan similar idea to ours has been developed in parallel and in-
et al., 2011; Baktashmotlagh et al., 2013). An important dependently for shallow architecture (with a single hidden
aspect of the distribution matching approach is the way the layer) in (Ajakan et al., 2014). Their system is evaluated
(dis)similarity between distributions is measured. Here, on a natural language task (sentiment analysis). Further-
one popular choice is matching the distribution means in more, recent and concurrent reports by (Tzeng et al., 2014;
the kernel-reproducing Hilbert space (Borgwardt et al., Long & Wang, 2015) also focus on domain adaptation in
2006; Huang et al., 2006), whereas (Gong et al., 2012; Fer- feed-forward networks. Their set of techniques measures
nando et al., 2013) map the principal axes associated with and minimizes the distance between the data distribution
each of the distributions. Our approach also attempts to means across domains (potentially, after embedding distri-
match feature space distributions, however this is accom- butions into RKHS). The approach of (Tzeng et al., 2014)
Unsupervised Domain Adaptation by Backpropagation
Figure 1. The proposed architecture includes a deep feature extractor (green) and a deep label predictor (blue), which together form
a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the
feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during the backpropagation-
based training. Otherwise, the training proceeds in a standard way and minimizes the label prediction loss (for source examples) and
the domain classification loss (for all samples). Gradient reversal ensures that the feature distributions over the two domains are made
similar (as indistinguishable as possible for the domain classifier), thus resulting in the domain-invariant features.
and (Long & Wang, 2015) is thus different from our idea ing labels yi ∈ Y are known at training time. For the ex-
of matching distribution by making them indistinguishable amples from the target domains, we do not know the labels
for a discriminative classifier. Below, we compare our ap- at training time, and we want to predict such labels at test
proach to (Tzeng et al., 2014; Long & Wang, 2015) on the time.
Office benchmark. Another approach to deep domain adap- We now define a deep feed-forward architecture that for
tation, which is arguably more different from ours, has been each input x predicts its label y ∈ Y and its domain label
developed in parallel in (Chen et al., 2015). d ∈ {0, 1}. We decompose such mapping into three parts.
We assume that the input x is first mapped by a mapping
3. Deep Domain Adaptation Gf (a feature extractor) to a D-dimensional feature vector
3.1. The model f ∈ RD . The feature mapping may also include several
We now detail the proposed model for the domain adap- feed-forward layers and we denote the vector of parame-
tation. We assume that the model works with input sam- ters of all layers in this mapping as θf , i.e. f = Gf (x; θf ).
ples x ∈ X, where X is some input space and cer- Then, the feature vector f is mapped by a mapping Gy (la-
tain labels (output) y from the label space Y . Below, bel predictor) to the label y, and we denote the parameters
we assume classification problems where Y is a finite set of this mapping with θy . Finally, the same feature vector f
(Y = {1, 2, . . . L}), however our approach is generic and is mapped to the domain label d by a mapping Gd (domain
can handle any output label space that other deep feed- classifier) with the parameters θd (Figure 1).
forward models can handle. We further assume that there During the learning stage, we aim to minimize the label
exist two distributions S(x, y) and T (x, y) on X ⊗ Y , prediction loss on the annotated part (i.e. the source part)
which will be referred to as the source distribution and of the training set, and the parameters of both the feature
the target distribution (or the source domain and the tar- extractor and the label predictor are thus optimized in or-
get domain). Both distributions are assumed complex and der to minimize the empirical loss for the source domain
unknown, and furthermore similar but different (in other samples. This ensures the discriminativeness of the fea-
words, S is “shifted” from T by some domain shift). tures f and the overall good prediction performance of the
Our ultimate goal is to be able to predict labels y given combination of the feature extractor and the label predictor
the input x for the target distribution. At training time, on the source domain.
we have an access to a large set of training samples At the same time, we want to make the features f
{x1 , x2 , . . . , xN } from both the source and the target do- domain-invariant. That is, we want to make the dis-
mains distributed according to the marginal distributions tributions S(f ) = {Gf (x; θf ) | x∼S(x)} and T (f ) =
S(x) and T (x). We denote with di the binary variable (do- {Gf (x; θf ) | x∼T (x)} to be similar. Under the covariate
main label) for the i-th example, which indicates whether shift assumption, this would make the label prediction ac-
xi come from the source distribution (xi ∼S(x) if di =0) or curacy on the target domain to be the same as on the source
from the target distribution (xi ∼T (x) if di =1). For the ex- domain (Shimodaira, 2000). Measuring the dissimilarity
amples from the source distribution (di =0) the correspond- of the distributions S(f ) and T (f ) is however non-trivial,
given that f is high-dimensional, and that the distributions
Unsupervised Domain Adaptation by Backpropagation
optimized by the stochastic gradient descent within our MNIST S YN N UM SVHN S YN S IGNS
method:
X
Ẽ(θf , θy , θd ) = Ly Gy (Gf (xi ; θf ); θy ), yi +
i=1..N
di =0
X MNIST-M SVHN MNIST GTSRB
Ld Gd (Rλ (Gf (xi ; θf )); θd ), yi (9)
i=1..N Figure 2. Examples of domain pairs (top – source domain, bot-
tom – target domain) used in the small image experiments. See
Running updates (4)-(6) can then be implemented as do- Section 4.1 for details.
ing SGD for (9) and leads to the emergence of features
that are domain-invariant and discriminative at the same
S OURCE A MAZON DSLR W EBCAM
time. After the learning, the label predictor y(x) = M ETHOD
Gy (Gf (x; θf ); θy ) can be used to predict labels for sam- TARGET W EBCAM W EBCAM DSLR
ples from the target domain (as well as from the source GFK(PLS, PCA) (G ONG ET AL ., 2012) .214 .691 .650
domain). SA* (F ERNANDO ET AL ., 2013) .450 .648 .699
The simple learning procedure outlined above can be re-
DLID (S. C HOPRA & G OPALAN , 2013) .519 .782 .899
derived/generalized along the lines suggested in (Goodfel-
low et al., 2014) (see the supplementary material (Ganin & DDC (T ZENG ET AL ., 2014) .605 .948 .985
Lempitsky, 2015)). DAN (L ONG & WANG , 2015) .645 .952 .986
3.3. Relation to H∆H-distance S OURCE ONLY .642 .961 .978
In this section we give a brief analysis of our method in P ROPOSED A PPROACH .730 .964 .992
terms of H∆H-distance (Ben-David et al., 2010; Cortes &
Mohri, 2011) which is widely used in the theory of non-
conservative domain adaptation. Formally, Table 2. Accuracy evaluation of different DA approaches on the
standard O FFICE (Saenko et al., 2010) dataset. All methods (ex-
cept SA) are evaluated in the “fully-transductive” protocol (some
dH∆H (S, T ) = 2 sup |Pf ∼S [h1 (f ) 6= h2 (f )]− results are reproduced from (Long & Wang, 2015)). Our method
h1 ,h2 ∈H
(last row) outperforms competitors setting the new state-of-the-
−Pf ∼T [h1 (f ) 6= h2 (f )]| (10) art.
defines a discrepancy distance between two distributions S can easily show that training the Gd is closely related to
and T w.r.t. a hypothesis set H. Using this notion one can the estimation of dHp ∆Hp (S, T ). Indeed,
obtain a probabilistic bound (Ben-David et al., 2010) on the
performance εT (h) of some classifier h from T evaluated dHp ∆Hp (S, T ) =
on the target domain given its performance εS (h) on the =2 sup |Pf ∼S [h(f ) = 1] − Pf ∼T [h(f ) = 1]| ≤
source domain: h∈Hp ∆Hp
Table 1. Classification accuracies for digit image classifications for different source and target domains. MNIST-M corresponds to
difference-blended digits over non-uniform background. The first row corresponds to the lower performance bound (i.e. if no adaptation
is performed). The last row corresponds to training on the target domain data with known class labels (upper bound on the DA perfor-
mance). For each of the two DA methods (ours and (Fernando et al., 2013)) we show how much of the gap between the lower and the
upper bounds was covered (in brackets). For all five cases, our approach outperforms (Fernando et al., 2013) considerably, and covers a
big portion of the gap.
ages popular with deep learning methods, and the O FFICE using previously published results.
datasets (Saenko et al., 2010), which are a de facto standard
CNN architectures. In general, we compose feature ex-
for domain adaptation in computer vision, but have much
tractor from two or three convolutional layers, picking their
fewer images.
exact configurations from previous works. We give the ex-
Baselines. For the bulk of experiments the following base- act architectures in the supplementary material (Ganin &
lines are evaluated. The source-only model is trained with- Lempitsky, 2015).
out consideration for target-domain data (no domain clas- For the domain adaptator we stick to the three fully con-
sifier branch included into the network). The train-on- nected layers (x → 1024 → 1024 → 2), except for
target model is trained on the target domain with class MNIST where we used a simpler (x → 100 → 2) ar-
labels revealed. This model serves as an upper bound on chitecture to speed up the experiments.
DA methods, assuming that target data are abundant and For loss functions, we set Ly and Ld to be the logistic re-
the shift between the domains is considerable. gression loss and the binomial cross-entropy respectively.
In addition, we compare our approach against the recently
CNN training procedure. The model is trained on 128-
proposed unsupervised DA method based on subspace
sized batches. Images are preprocessed by the mean sub-
alignment (SA) (Fernando et al., 2013), which is simple
traction. A half of each batch is populated by the sam-
to setup and test on new datasets, but has also been shown
ples from the source domain (with known labels), the rest
to perform very well in experimental comparisons with
is comprised of the target domain (with unknown labels).
other “shallow” DA methods. To boost the performance
In order to suppress noisy signal from the domain classifier
of this baseline, we pick its most important free parame-
at the early stages of the training procedure instead of fixing
ter (the number of principal components) from the range
the adaptation factor λ, we gradually change it from 0 to 1
{2, . . . , 60}, so that the test performance on the target do-
using the following schedule:
main is maximized. To apply SA in our setting, we train
a source-only model and then consider the activations of 2
the last hidden layer in the label predictor (before the final λp = − 1, (14)
1 + exp(−γ · p)
linear classifier) as descriptors/features, and learn the map-
ping between the source and the target domains (Fernando where γ was set to 10 in all experiments (the schedule was
et al., 2013). not optimized/tweaked). Further details on the CNN train-
Since the SA baseline requires to train a new classifier after ing can be found in the supplementary material (Ganin &
adapting the features, and in order to put all the compared Lempitsky, 2015).
settings on an equal footing, we retrain the last layer of Visualizations. We use t-SNE (van der Maaten, 2013) pro-
the label predictor using a standard linear SVM (Fan et al., jection to visualize feature distributions at different points
2008) for all four considered methods (including ours; the of the network, while color-coding the domains (Figure 3).
performance on the target domain remains approximately We observe strong correspondence between the success of
the same after the retraining). the adaptation in terms of the classification accuracy for the
For the O FFICE dataset (Saenko et al., 2010), we directly target domain, and the overlap between the domain distri-
compare the performance of our full network (feature ex- butions in such visualizations.
tractor and label predictor) against recent DA approaches
Choosing meta-parameters. Good unsupervised DA
Unsupervised Domain Adaptation by Backpropagation
methods should provide ways to set meta-parameters (such three-digit numbers), positioning, orientation, background
as λ, the learning rate, the momentum rate, the network and stroke colors, and the amount of blur. The degrees of
architecture for our method) in an unsupervised way, i.e. variation were chosen manually to simulate SVHN, how-
without referring to labeled data in the target domain. In ever the two datasets are still rather distinct, the biggest
our method, one can assess the performance of the whole difference being the structured clutter in the background of
system (and the effect of changing hyper-parameters) by SVHN images.
observing the test error on the source domain and the do- The proposed backpropagation-based technique works well
main classifier error. In general, we observed a good cor- covering almost 80% of the gap between training with
respondence between the success of adaptation and these source data only and training on target domain data with
errors (adaptation is more successful when the source do- known target labels. In contrast, SA (Fernando et al., 2013)
main test error is low, while the domain classifier error is results in a slight classification accuracy drop (probably
high). In addition, the layer, where the the domain discrim- due to the information loss during the dimensionality re-
inator is attached can be picked by computing difference duction), thus indicating that the adaptation task is even
between means as suggested in (Tzeng et al., 2014). more challenging than in the case of the MNIST experi-
ment.
4.1. Results
We now discuss the experimental settings and the results. MNIST ↔ SVHN. In this experiment, we further increase
In each case, we train on the source dataset and test on the gap between distributions, and test on MNIST and
a different target domain dataset, with considerable shifts SVHN, which are significantly different in appearance.
between domains (see Figure 2).When MNIST and SVHN Training on SVHN even without adaptation is challeng-
datasets are used as target, standard training-test splits are ing — classification error stays high during the first 150
considered, and all training images are used for unsuper- epochs. In order to avoid ending up in a poor local min-
vised adaptation. The results are summarized in Table 1 imum we, therefore, do not use learning rate annealing
and Table 2. here. Obviously, the two directions (MNIST → SVHN
and SVHN → MNIST) are not equally difficult. As
MNIST → MNIST-M. Our first experiment deals with SVHN is more diverse, a model trained on SVHN is ex-
the MNIST dataset (LeCun et al., 1998) (source). In or- pected to be more generic and to perform reasonably on
der to obtain the target domain (MNIST-M) we blend dig- the MNIST dataset. This, indeed, turns out to be the case
its from the original set over patches randomly extracted and is supported by the appearance of the feature distribu-
from color photos from BSDS500 (Arbelaez et al., 2011). tions. We observe a quite strong separation between the
This operation is formally defined for two images I 1 , I 2 as domains when we feed them into the CNN trained solely
out 1 2
Iijk = |Iijk − Iijk |, where i, j are the coordinates of a on MNIST, whereas for the SVHN-trained network the
pixel and k is a channel index. In other words, an output features are much more intermixed. This difference prob-
sample is produced by taking a patch from a photo and in- ably explains why our method succeeded in improving the
verting its pixels at positions corresponding to the pixels of performance by adaptation in the SVHN → MNIST sce-
a digit. For a human the classification task becomes only nario (see Table 1) but not in the opposite direction (SA is
slightly harder compared to the original dataset (the digits not able to perform adaptation in this case either). Unsu-
are still clearly distinguishable) whereas for a CNN trained pervised adaptation from MNIST to SVHN gives a failure
on MNIST this domain is quite distinct, as the background example for our approach (we are unaware of any unsuper-
and the strokes are no longer constant. Consequently, the vised DA methods capable of performing such adaptation).
source-only model performs poorly. Our approach suc-
ceeded at aligning feature distributions (Figure 3), which Synthetic Signs → GTSRB. Overall, this setting is similar
led to successful adaptation results (considering that the to the S YN N UMBERS → SVHN experiment, except the
adaptation is unsupervised). At the same time, the im- distribution of the features is more complex due to the sig-
provement over source-only model achieved by subspace nificantly larger number of classes (43 instead of 10). For
alignment (SA) (Fernando et al., 2013) is quite modest, the source domain we obtained 100,000 synthetic images
thus highlighting the difficulty of the adaptation task. (which we call S YN S IGNS) simulating various imaging
conditions. In the target domain, we use 31,367 random
Synthetic numbers → SVHN. To address a common sce- training samples for unsupervised adaptation and the rest
nario of training on synthetic data and testing on real data, for evaluation. Once again, our method achieves a sensi-
we use Street-View House Number dataset SVHN (Netzer ble increase in performance proving its suitability for the
et al., 2011) as the target domain and synthetic digits as the synthetic-to-real data adaptation.
source. The latter (S YN N UMBERS) consists of ≈ 500,000 As an additional experiment, we also evaluate the proposed
images generated by ourselves from WindowsTM fonts by algorithm for semi-supervised domain adaptation. Here,
varying the text (that includes different one-, two-, and we reveal 430 labeled examples (10 samples per class) and
Unsupervised Domain Adaptation by Backpropagation
MNIST → MNIST-M: top feature extractor layer S YN N UMBERS → SVHN: last hidden layer of the label predictor
Figure 3. The effect of adaptation on the distribution of the extracted features (best viewed in color). The figure shows t-SNE (van der
Maaten, 2013) visualizations of the CNN’s activations (a) in case when no adaptation was performed and (b) in case when our adaptation
procedure was incorporated into training. Blue points correspond to the source domain examples, while red ones correspond to the target
domain. In all cases, the adaptation in our method makes the two distributions of features much closer.
0.2 Real (Gong et al., 2013; S. Chopra & Gopalan, 2013; Long &
Syn Wang, 2015) as during adaptation we use all available la-
Validation error
0.15
Syn Adapted beled source examples and unlabeled target examples (the
Syn + Real premise of our method is the abundance of unlabeled data
Syn + Real Adapted
0.1
in the target domain). Also, all source domain is used
for training. Under this “fully-transductive” setting, our
method is able to improve previously-reported state-of-the-
0 1 2 3 4 5
art accuracy for unsupervised adaptation very considerably
Batches seen ·105 (Table 2), especially in the most challenging A MAZON →
Figure 4. Results for the traffic signs classification in the semi- W EBCAM scenario (the two domains with the largest do-
supervised setting. Syn and Real denote available labeled data main shift).
(100,000 synthetic and 430 real images respectively); Adapted
Interestingly, in all three experiments we observe a slight
means that ≈ 31,000 unlabeled target domain images were used
over-fitting as training progresses, however, it doesn’t ruin
for adaptation. The best performance is achieved by employing
both the labeled samples and the large unlabeled corpus in the the validation accuracy. Moreover, switching off the do-
target domain. main classifier branch makes this effect far more apparent,
from which we conclude that our technique serves as a reg-
add them to the training set for the label predictor. Fig- ularizer.
ure 4 shows the change of the validation error through-
out the training. While the graph clearly suggests that our 5. Discussion
method can be beneficial in the semi-supervised setting, We have proposed a new approach to unsupervised do-
thorough verification of semi-supervised setting is left for main adaptation of deep feed-forward architectures, which
future work. allows large-scale training based on large amount of an-
notated data in the source domain and large amount of
Office dataset. We finally evaluate our method on O F -
unannotated data in the target domain. Similarly to many
FICE dataset, which is a collection of three distinct do-
previous shallow and deep DA techniques, the adaptation
mains: A MAZON, DSLR, and W EBCAM. Unlike previ-
is achieved through aligning the distributions of features
ously discussed datasets, O FFICE is rather small-scale with
across the two domains. However, unlike previous ap-
only 2817 labeled images spread across 31 different cate-
proaches, the alignment is accomplished through standard
gories in the largest domain. The amount of available data
backpropagation training. The approach is therefore rather
is crucial for a successful training of a deep model, hence
scalable, and can be implemented using any deep learn-
we opted for the fine-tuning of the CNN pre-trained on the
ing package. To this end we release the source code for
ImageNet (AlexNet from the Caffe package (Jia et al.,
the Gradient Reversal layer along with the usage examples
2014)) as it is done in some recent DA works (Donahue
as an extension to Caffe (Jia et al., 2014) (see (Ganin &
et al., 2014; Tzeng et al., 2014; Hoffman et al., 2013; Long
Lempitsky, 2015)). Further evaluation on larger-scale tasks
& Wang, 2015). We make our approach more compara-
and in semi-supervised settings constitutes future work.
ble with (Tzeng et al., 2014) by using exactly the same
Acknowledgements. This research is supported by
network architecture replacing domain mean-based regu-
the Russian Ministry of Science and Education grant
larization with the domain classifier.
RFMEFI57914X0071. We also thank the Graphics & Me-
Following previous works, we assess the performance of
dia Lab, Moscow State University for providing the syn-
our method across three transfer tasks most commonly
thetic road signs data.
used for evaluation. Our training protocol is adopted from
Unsupervised Domain Adaptation by Backpropagation
Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. De- based learning applied to document recognition. Pro-
caf: A deep convolutional activation feature for generic ceedings of the IEEE, 86(11):2278–2324, November
visual recognition, 2014. 1998.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Liebelt, Joerg and Schmid, Cordelia. Multi-view object
Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library class detection with a 3d geometric model. In CVPR,
for large linear classification. Journal of Machine Learn- 2010.
ing Research, 9:1871–1874, 2008.
Long, Mingsheng and Wang, Jianmin. Learning trans-
Fernando, Basura, Habrard, Amaury, Sebban, Marc, and ferable features with deep adaptation networks. CoRR,
Tuytelaars, Tinne. Unsupervised visual domain adapta- abs/1502.02791, 2015.
tion using subspace alignment. In ICCV, 2013.
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco,
Ganin, Yaroslav and Lempitsky, Victor. Project website. Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig-
[Link] its in natural images with unsupervised feature learning.
projects/grl/, 2015. [Online; accessed 25-May- In NIPS Workshop on Deep Learning and Unsupervised
2015]. Feature Learning 2011, 2011.
Unsupervised Domain Adaptation by Backpropagation