0% found this document useful (0 votes)
44 views8 pages

Domain Generalization with UNVP Method

Uploaded by

minliang2kbj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views8 pages

Domain Generalization with UNVP Method

Uploaded by

minliang2kbj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2020 17th Conference on Computer and Robot Vision (CRV)

Domain Generalization via Universal Non-volume Preserving Approach

Dat T. Truong1,3,4 , Chi Nhan Duong2 , Khoa Luu1 , Minh-Triet Tran3,4 , Ngan Le1
1 University of Arkansas, USA
2 Concordia University, Canada
3 University of Science, Ho Chi Minh city, Vietnam
4 Vietnam National University, Ho Chi Minh city, Vietnam
{tt032, khoaluu, thile}@[Link], dcnhan@[Link], tmtriet@[Link]

Abstract—Recognition across domains has recently become


an active topic in the research community. However, it has been
largely overlooked in the problem of recognition in new unseen
domains. Under this condition, the delivered deep network
models are unable to be updated, adapted, or fine-tuned.
Therefore, recent deep learning techniques, such as domain
adaptation, feature transferring, and fine-tuning, cannot be
applied. This paper presents a novel approach to the problem
of domain generalization in the context of deep learning. The
proposed method1 is evaluated on different datasets in various Figure 1. Comparison between domain adaptation (A) and our proposed
problems, i.e. (i) digit recognition on MNIST, SVHN, and domain generalization (B) problems
MNIST-M, (ii) face recognition on Extended Yale-B, CMU-
PIE and CMU-MPIE, and (iii) pedestrian recognition on RGB work models are usually unable to be retrained or fine-tuned
and Thermal image datasets. The experimental results show with the inputs in new unseen domains or environments,
that our proposed method consistently improves performance
as illustrated in Fig. 2. Thus, domain adaptation cannot
accuracy. It can also be easily incorporated with any other
CNN frameworks within an end-to-end deep network design be applied in these problems since the new unseen target
for object detection and recognition problems to improve their domains are unavailable.
performance. Besides, there are some prior works to perform recogni-
tion problems with high accuracy by presenting new loss
I. I NTRODUCTION functions [3] [4] or increasing deep network structures
Deep learning-based detection and recognition studies [5] via mining hard samples in training sets. These loss
have been recently achieving very accurate performance in functions are deployed to deal with hard samples con-
visual applications. However, many such methods assume sidered as unseen domains. However, these methods are
the testing images come from the same distribution as the limited to be generalized in new unseen domains in real-
training ones and often fail when performing in new un- world applications. Some real-world problems are unable
seen domains. Indeed, detection and classification crossing to observe training samples from new unseen domains in
domains have recently become active topics in the research the training process. Therefore, in the scope of this work,
communities. In particular, domain adaptation [1] [2] has there is no assumption about the new unseen domains.
received significant attention in computer vision. In the Our proposed method can be supportively incorporated with
domain adaptation (Fig. 1(A)), we usually have a large- Convolutional Neural Networks (CNNs)-based detection and
scale training set with labels, i.e., the source domain A, classification methods to train within an end-to-end deep
and a small training set with or without labels, i.e., the learning framework to improve the performance potential.
target domain B. The knowledge from the source domain A. Contributions of this Work
A is learned and adapted to the target domain B. During
the testing time, the trained model is deployed only in the This work presents a novel domain generalization ap-
target domain B. Recent results in domain adaptation have proach to learn to better generalize new unseen domains.
shown significant improvement in the many computer vision The restrictive setting is considered in this work where
applications. However, the trained models are potentially there is only single source domain available for training.
deployed not only in the target domain B but also in many Table I summarizes the differences between our approach
other new unseen domains, e.g., C, D, etc. (Fig. 1(B)) in real- and the prior works. Our contributions can be summarized
world applications. In these scenarios, the released deep net- as follows.
A novel approach named Universal Non-volume Preserv-
1 Source code will be publicly available. ing (UNVP) and its extension named Extended Universal

978-1-7281-9891-0/20/$31.00 ©2020 IEEE 93


DOI 10.1109/CRV50864.2020.00021

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.


        % 
 % 

   


 $ 
#  "
 !


#   
! 
 !
!    

$ 
 !
 ! 

Figure 2. The ideas of domain generalization. The deep model is trained Figure 3. Illustration of the proposed UNVP method. The traditional
only in a single domain (A), i.e. RGB images. It is deployed in other unseen classifier fails to model new samples in unseen domains (top). Meanwhile,
domains, i.e. thermal images (B) and infrared images (C). UNVP consistently maintains the feature distribution in each class while
searching for a new shifting domain (bottom).

Non-volume Preserving (E-UNVP) frameworks are firstly


introduced to generalize environments of new unseen do- minimizing the differences in the marginal distributions of
mains from a given single-source training domain. Secondly, multiple domains, whereas Y. Li [20] proposed an end-to-
the environmental features extracted from the environment end conditional invariant deep domain generalization ap-
modeling via Deep Generative Flows (DGF) and the discrim- proach by leveraging deep neural networks for domain-
inative features extracted from the deep network classifiers invariant representation learning. To address the problem of
are then unified together to provide final generalized deep unseen domains, R. Volpi et al. presented Adversarial Data
features that are robustly discriminative in new unseen do- Augmentation (ADA) [16] to generalize to unseen domains.
mains. Our approach is designed within an end-to-end deep III. T HE P ROPOSED M ETHOD
learning framework and inherits the power of the CNNs.
It can be quickly end-to-end integrated with a CNN-based Far apart from previous augmentation methods that tried
deep network design for object detection or recognition to to generate new samples in image space using prior knowl-
improve the performance. Finally, the proposed method has edge with the hope that these samples can cover unseen
experimented in various visual modalities and applications domains, our approach, on the other hand, focuses on
with consistently improving performances. modeling the environment density as multiple Gaussian
distributions in a deep feature space and uses this knowledge
II. R ELATED W ORK for the generalization process. In this way, the new samples
Domain Adaptation has recently become one of the most are automatically synthesized with more semantic meaning
popular research topics in the field [1] [6] [7] [8] [2]. Ganin while consistently maintaining the feature structures (see
et al. [1] proposed to incorporate both classification and Fig. 3). Thus, without the need to see the samples in target
domain adaptation to a unified network so that both tasks can domains, the method is still able to handle the domain
be learned together. Similarly, Tzeng et al. [2] later intro- shifting effectively and robustly achieves high performance
duced a unified framework for Unsupervised Domain Adap- in these unseen domains.
tation based on adversarial learning objectives (ADDA). It In particular, the proposed UNVP and E-UNVP ap-
uses a loss function in a discriminator to be solely dependent proaches present a new tractable CNN deep network to ex-
on its target distribution. Liu et al. [9] presented Coupled tract the deep features of samples in the source environment
Generative Adversarial Network (CoGAN) for learning a and formulate their probability densities to multiple Gaus-
joint distribution of multi-domain images. It is then applied sian distributions (Fig. 3). From these learned distributions, a
to domain adaptation. density-based augmentation approach is employed to expand
Domain Generalization aims to learn a classification data distributions of the source environment for generalizing
model from a single-source domain and generalize that to different unseen domains. This architecture design allows
knowledge to achieve high performance in unseen target do- unifying deep feature modeling and distribution modeling
mains robustly. To learn a domain-invariant feature represen- within an end-to-end framework.
tation, M. Ghifary et al. [17] used multi-view autoencoders The proposed framework consists of two main streams:
to perform cross-domain reconstructions. Later, [18] intro- (1) Discriminative feature modeling with a deep network
duced MMD-AAE to learn a feature representation by jointly classifier; and (2) Deep Generative Flows to model the
optimizing a multi-domain autoencoder regularized via the domain variations in the form of distributions. They are
Maximum Mean Discrepancy (MMD) distance. Recently, K. together going through an end-to-end learning process that
Muandet et al. [19] presented a kernel-based algorithm for alternatively minimizes the within-class distributions and

94

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison in the properties between our proposed approaches (UNVP and E-UNVP) and other recent methods,
where 55represents not applicable properties. Gaussian Mixture Model (GMM), Probabilistic Graphical Model (PGM),
Convolutional Neural Network (CNN), Adversarial Loss (adv ), Log Likelihood Loss (LL ), Cycle Consistency Loss (cyc ),
Discrepancy Loss (dis ) and Cross-Entropy Loss (CE ).
Domain Loss End-to Target-domain Target-domain Deployable
Architecture
Modelity Function -End sample-free label-free Domains
FT [10] Transfer Learning CNN 2 51 55 55 Two
UBM [11] Adaptation GMM LL 55 55 51 Any
DANN [1] Adaptation CNN adv 51 55 51 Two
CoGAN [9] Adaptation CNN+GAN adv 51 55 51 Two
I2IAdapt [12] Adaptation CNN+GAN adv + cyc 51 55 51 Two
ADDA [13] Adaptation CNN+GAN adv 51 55 51 Two
MCD [14] Adaptation CNN+GAN adv + dis 51 55 51 Two
CrossGrad [15] Generalization Bayesian Net CE 51 51 51 Any
ADA [16] Generalization CNN CE 51 51 51 Any
Our UNVP Generalization PGM+CNN LL + CE 51 51 51 Any
Our E-UNVP Generalization PGM+CNN LL + CE 51 51 51 Any

synthesizing new useful samples to generalize to new unseen from a real data distribution to a Gaussian distribution in
domains. Notice that our proposed framework does not latent space. Then, we can model the environment variation
require the presence of samples in the target domains during via deviations from the Gaussian distributions of all classes
the training process. in a latent domain. When F is well-defined with tractable
computation of its Jacobian determinant, the two-way con-
A. Domain Variation Modeling as Distributions
nection, i.e., inference and generation, is existed for x and
This section aims at learning a Deep Generative Flow z.
model, i.e. function F, that maps an image x in image space The prior class distributions. Motivated from these prop-
I to its latent representation z in latent domain Z such erties, given C classes, we choose C Gaussian distribu-
that the density function pX (x) can be estimated via the tions with different means {μ1 , μ2 , .., μC } and covariances
probability density function pZ (z). Then via F, rather than {Σ1 , Σ2 , ..., ΣC } as prior distributions for these classes, i.e.
representing the environment variation, i.e. pX (x), directly zc ∼ N (μc , Σc ). It is worth noting that even when we
in the image space, it can be easily modeled via variables in choose Gaussian Distributions, our framework is not limited
latent space, i.e. pZ (z), with more semantic manner. When to other distribution types.
pZ (z) follows prior distributions, all samples in the given
Mapping function structure. To enforce the information
domain can be effectively modeled in the forms of latent
flow from an image domain to a latent space with different
distributions.
abstraction levels, we formulate the mapping function F as
Structure and Variable Relationship. Let x ∈ I be a data
a composition of several sub-functions fi as follows.
sample in image domain I, y be its corresponding class
label, and z = F(x, y, θ) where θ denotes the parameters of F = f1 ◦ f2 ◦ ... ◦ fN (3)
F, the probability density function of x can be formulated
via the change of variable formula as follows: where N is the number of sub-functions. The Jacobian ∂F
  ∂x
 ∂F(z, y; θ)  ∂f1 ∂f2 ∂fN
pX (x, y; θ) = pZ (z, y; θ)   (1)
can be derived by ∂F∂x = ∂x · ∂f1 · · · ∂fN −1 . With this struc-
∂x  ture, the properties of each fi will define the properties for
where pX (x, y) and pZ (z, y; θ) define the distributions of the whole mapping function F. For example, if the Jacobian
samples of class y in image and latent domains, respectively. of ∂f
∂x is tractable, then F is also tractable. Furthermore, if
1

∂F (z,y;θ) fi is a non-linear function built from a composition of CNN


denotes the Jacobian matrix with respect to x.
∂x
layers then F becomes a deep convolution neural network.
Then the log-likelihood is computed by.
There are several ways to construct the sub-functions, i.e.
  different CNN structures for non-linearity property.
 ∂F(z, y; θ) 
log pX (x, y; θ) = log pZ (z, y; θ) + log  
 (2)
∂x f (x) = bx+(1−b)  [x  exp (S(b  x) + T (b  x)] (4)
Eqns. (1) and (2) provide two facts: (1) learning the density
function of samples in class y is equivalent to estimate the where b = [1, ..., 1, 0, ..., 0] is a binary mask, and  is the
density of its latent representation z and determinant of Hadamard product. S and T define the scale and translation
the associated Jacobian matrix ∂F ∂x ; and (2) if the latent
functions during mapping process.
distribution pZ is defined as a Gaussian distribution, the Learning the mapping function and Environment Model-
learned function F explicitly becomes the mapping function ing. To learn the parameter θ for mapping function F, the

95

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
where {X, Y} denotes the images and their labels; d(·, ·) is
the distance between probability distributions; PX src
(X, Y)
and PX (X, Y) are the density distributions of the source
and current expanded environments, respectively.
src
Since both PX and PX are density distributions, the
src
Wasserstein distance with respect to PX and PX can be
adopted. Notice that from previous section, we have leaned
a mapping function F that maps the density functions from
image space, i.e. PX , to prior distributions in latent space,
i.e. PZ . Moreover, since F is invertible with the specific
formula of its sub-functions, computing d(PX , PX src
) is
equivalent to d(PZ , PZsrc ). From this, we can efficiently
Figure 4. The distributions: (A) MNIST. (B) MNIST-M using a Pure-CNN estimate cost as the transformation cost between Gaussian
trained on MNIST, (C) MNIST-M using our UNVP trained on MNIST. (D) distributions. Then d(PX , PX src
) is reformulated by.
MNIST-M using our E-UNVP trained on MNIST.
src
d(PX , PX ) = d(PZ , PZsrc )
 
= inf E [cost (F(xc ), F (xsrc
c ))]
log-likelihood in Eqn. (2) is maximized as follows. c xc ,xsrc
c (7)
  

θ = arg max i
log pX (x , c; θ) (5) = inf E [cost (zc , zsrc
c )]
θ c zc ,zsrc
c
c i

Notice that after learning the mapping function, all images where cost(·, ·) denotes the transformation cost between
of all classes are mapped into the corresponding distribu- Gaussian distributions:
2 src
 src 2
tions of their classes. Then the environment density can be cost (zc , zc ) = ||μc − μc ||2
considered as the composition of these distributions. Figure c (8)
1/2 1/2 1/2
4(A) illustrated an example of the learned environment dis- +Tr(Σsrc
c + Σc − 2((Σsrc
c ) Σc (Σsrc
c ) ) )
tributions of MNIST with 10 digit classes. When only sam-
{μc , Σc } and {μc , Σc }
are the means and covariances of the
ples in MNIST are used for training, the density distributions
distributions of class c in the source and the expanded en-
of MNIST-M, i.e., unseen during training, using Pure-CNN,
vironment, respectively. Plugging this distance and applying
in our UNVP and E-UNVP are shown in Fig. 4 (B, C, D),
the Lagrangian relaxation to Eqn. (6), we have
respectively. In the next section, a generalization approach is
src
proposed so that using only samples in a source environment, arg min sup E [(X, Y; M, F , θ, θ1 )] − α · d(PX , PX )
θ1
the learned model can expand the density distributions of 
P

the source environment so that they can cover as much as = arg min sup{(x, c; M, F, θ, θ1 ) − α · cost(F(x), F(xsrc
c ))}
θ1 x
c
possible the distributions of unseen environments.
To solve this objective function, the optimization process
B. Unseen Domain Generalization can be divided into two alternative steps: (1) generate the
After modeling the source environment variation as the sample x for each class such that
compositions of its class distributions, this section introduces x = arg max{(x, c; M, F, θ, θ1 ) − α · cost(F(x), F(xsrc
c ))}
x
the generalization process of these distributions with respect (9)
to a classification model M such that the expansion of these and consider x as a new “hard” example for class c; and
distributions can help M generalize to unseen environments (2) add x to the training data and optimize the model M.
with high accuracy. Notice that M can be any type of Deep In other words, this two-step optimization process aims
CNN such as LeNet [21], AlexNet [22], VGG [23], ResNet at finding new samples belonging to distributions that are
[5], DenseNet [24]. ρ distance far away from the distributions of the source
Let (X, Y; M, F, θ, θ1 ) be the training loss function of environment, and making M became more robust when
M, and θ1 be the parameters of M. The generalization classifying these examples. In this way, after a certain of it-
process of M can be formulated as updating the parameters eration, the distributions learned from M can be generalized
θ1 such that it can robustly classify the samples having latent so that they can cover as much as possible the distributions
distributions that are distance ρ away from the samples in of new unseen environments.
the source environment. Then, the objective function of M
is formulated as. C. Universal Non-volume Preserving (UNVP) Models
The proposed UNVP consists of two main branches:
arg min sup E [(X, Y; M, F, θ, θ1 )] (6) (1) Discriminative Feature Modeling and (2) Generative
θ1 src )≤ρ
P :d(PX ,PX

96

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Notice that in the structure of F, the choice of Gaussian
distributions for all classes play an important role and
directly affects the performance of the generative model. By
varying the choices for these distributions, different variants
of UNVP can be introduced.
Universal Non-volume Preserving Models (UNVP)::
The means and covariances of Gaussian distributions are
pre-defined for all C classes where μc = 1c; Σ = I; zc ∼
N (μc , I) where 1 is the all-one vector.
Extended Universal Non-volume Preserving Models (E-
UNVP):: Rather than fixing the means and covariances of
the Gaussian distributions of C classes, we consider them
as variables and flexibly learned during the environment
modeling as well as adjusted during domain generalization.
Particularly, given the class label c, F maps each sample xc
to a Gaussian distribution with the mean and covariance as.
μc = γGm (c) + λHm (n)
(10)
Σc = Gstd (c)
Figure 5. The Training Process of Our proposed UNVP. consists
of one pre-training step and a two-stage optimization by alternatively where Gm (c) and Gstd (c) denote the learnable function
minimizing the within-class distributions and synthesizing new samples for that map label c to the mean and covariance values of its
generalization. Gaussian distribution. n is a noise signal that is generated
following the normal distribution. Hm (n) defines the allow-
able shifting range of the Gaussian given the noise signal
Distribution Modeling. While the discriminative part focuses
n. γ and λ are user-defined parameters that control the
on constructing a classifier that minimizes within-class dis-
separation of the Gaussian Distributions between different
tributions, the generative one aims at embedding samples
classes and the contribution of Hm (n) to μc . We choose the
of all classes into their corresponding latent distributions
Fully Connected structure for Gm (c) and Gstd (c) that take
and then expanding these distributions for generalization.
the input c in the form of one-hot vector while Convolutional
Fig. 5 illustrates the whole end-to-end joint training process
Layer is adopted for Hm (n).
for UNVP where the generative part, i.e., Deep Generative
Flow F, is firstly employed to learn the mapping from
IV. D ISCUSSION
image space to Gaussian distributions in latent space. Then
a two-stage training process is adopted to learn the Deep As shown in Fig. 3, by exploiting the Generative Flows
Classifier M and adjust the Deep Generative Flow F for that model samples of each class as a Gaussian in semantic
generalization. feature space, the proposed UNVP can robustly maintain the
In the first stage of this process, given a training dataset, feature structure of each class while expanding and shifting
both parameters {θ, θ1 } of the mapping function F and the the domain distributions. In this way, we can generate more
classifier M are updated according to the loss function as. useful “hard” samples for the generalization process.
By introducing the noise signal n, we allow the Gaussian
(X, Y; M, F, θ, θ1 ) =CE (M(X; θ1 ), Y − log pX (X, Y; θ)
distribution of each class shifting around within a limited
where the first term is the cross-entropy loss for M and the range, i.e., Hm (n). This further enhances the robustness of
second term is the log-likelihood of F. E-UNVP against noise during the environment modeling.
In the second stage, we adapt the generalization process as To further enhance the capability of modeling the input
presented in Sec. III-B and Eqn. (9) to synthesize new “hard” signal with high-resolution, we incorporate the activation
samples. Notice that, to further constraint the perturbation normalization and invertible 1 × 1 convolution operators
in latent space, we incorporate another regularization term [25] to the structure of each sub-function fi in Eqn. (3).
to Eqn. (7) as. Particularly, the input to each fi is passed through an
 src actnorm layer followed by an invertible 1 × 1 convolution
2 src 2
cost (zc , zc ) = ||μc − μc ||2 before being transformed by S and T as in Eqn. (4). The
c two transformations S and T are defined by two Residual
1/2 1/2 1/2
+ Tr(Σsrc
c + Σc − 2((Σsrc
c ) Σc (Σsrc
c ) ) ) Networks with rectifier non-linearity and skip connections.
2
+ ||M(Xc ) − M(Xsrc
c )||2 Each of them contains three residual blocks. For input image
with the resolution higher than 128×128, six residual blocks
New generated samples are then added to the training set
are set for S and T .
and used for updating both branches of UNVP.

97

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table II: Ablative experiment results (%) on the effectiveness of the parameters λ, α and β that control the distribution
separation and shitting range. MNIST is used as the only training set, MNIST-M is used as the unseen testing set.
λ α β(%)
Dataset Methods
0.01 0.1 1.0 0.01 0.1 1.0 0% 1% 10% 20% 30%
Pure-CNN 99.28
MNIST UNVP − − − 99.33 99.18 99.30 99.28 99.28 99.35 99.30 99.36
E-UNVP 99.22 99.42 99.40 99.13 99.31 99.42 99.28 99.36 99.34 99.42 99.43
Pure-CNN 55.90
MNIST-M UNVP − − − 58.18 60.76 59.44 55.90 59.99 57.24 59.44 55.11
E-UNVP 59.83 60.49 59.47 56.92 61.70 60.49 55.90 57.10 60.49 61.70 60.49

V. E XPERIMENTS various percentages of “hard” generated samples (β), i.e.,


This section first shows the effectiveness of our proposed from 0% to 30%. When β = 0, there are no new samples.
methods with comprehensive ablative experiments. In these There are two phases alternatively updated in the training
experiments, we use MNIST as the only the training set process: (1) Minimization phase to optimize the networks
and MNIST-M as the unseen testing set. The proposed and (2) Maximization (perturb) phase to generate new hard
approaches are also benchmarked on various deep network examples. We do K times of the maximization phase, for
structures, i.e. LeNet [21], AlexNet [22], VGG [23], ResNet each time, we randomly select β percent of the number
[5] and DenseNet [24]. Using the final optimal model, we of training images to generate new hard samples via deep
show in the next subsection that our approaches consistently generative models. Specifically, our maximization phase
achieve the state-of-the-art results in digit recognition on generalizes new images based on both semantic features
three-digit datasets, i.e., MNIST, SVHN [26], and MNIST- from the CNN classifier and the semantic space via the
M. Then, we show the results of our proposed approaches estimation of environment density. The experimental results
in face recognition in three databases, i.e. Extended Yale-B in Table II show that the proposed approaches consistently
[27], CMU-PIE [28] and CMU-MPIE [29]. We use facial help to improve the classifiers.
images with normal illumination as the training domain and Sample Distributions in Unseen Domains. The sample
the ones in dark illumination conditions as the testing set on class distributions with the optimal parameter set are used
the new unseen domains. Finally, we show the advantages to visually observed and demonstrated in Fig. 4. While
of UNVP and E-UNVP in the cross-domain pedestrian Pure-CNN obviously fails to model unseen domain MNIST-
recognition on the Thermal Database. M dataset, our UNVP successfully does domain shift and
cover unseen domain dataset. These sample distributions are
A. Ablation Study completely class separated when using our E-UNVP.
This experiment aims to measure the effectiveness of Backbone Deep Networks. This section evaluates the
the domain generalization and perturbation processes This robustness and the consistent improvements of UNVP and
experiment uses MNIST as the only training set and MNIST- E-UNVP with common deep networks, including LeNet,
M as the testing one. To simplify the experiment, LeNet [21] AlexNet, VGG, ResNet, and DenseNet, as in Table III. The
is used as the classifier, i.e., Pure-CNN. About the network proposed UNVP and E-UNVP consistently outperform the
hyper-parameters, we choose the learning rate and the batch
size to 0.0001 and 128, respectively.
Table III: Experimental results (%) when using UNVP and
Hyper-parameter Settings. In the GLOW learning process,
E-UNVP in various common CNNs.
the multiple Gaussian distributions are handled via the set Networks Methods MNIST MNIST-M
of scale parameters, i.e., γ and λ, to control the distribution Pure-CNN 99.06 55.90
separation and shitting range as in Eqn. (10). The contribu- LeNet UNVP 99.30 59.44
tions of the generalization process are also evaluated with E-UNVP 99.42 61.70
Pure CNN 99.17 40.12
AlexNet UNVP 98.81 39.94
E-UNVP 98.89 40.60
Pure CNN 99.43 50.67
VGG UNVP 99.42 54.41
E-UNVP 99.40 51.37
Pure CNN 98.01 35.35
ResNet UNVP 98.82 37.15
E-UNVP 98.97 40.60
Pure CNN 99.23 41.16
Figure 6. Examples in (A) MNIST, (B) MNIST-M, and (C) SVHN DenseNet UNVP 99.42 41.98
databases E-UNVP 99.14 43.72

98

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table IV: Results (%) on three digit datasets. ADA and Table V: Results (%) on Extended Yale-B [27], CMU-PIE
ours do not require target data in training. ADDA, DANN [28] and CMU-MPIE [29] databases. ADA and ours do not
require training data from target domains in training. require target domain data during training while ADDA
Methods MNIST SVHN MNIST-M does.
ADDA 99.29 32.20 63.39 E-Yale-B CMU-PIE CMU-MPIE
Method
DANN − − 76.66 N D N D N D
Pure-CNN 99.06 31.96 55.90 ADDA 99.17 75.28 96.09 70.33 99.93 97.71
ADA 99.17 37.87 60.02 Pure-CNN 98.50 51.39 95.59 62.18 99.93 94.74
UNVP 99.30 41.23 59.45 ADA 99.00 53.08 96.49 62.69 99.92 96.08
E-UNVP 99.42 42.87 61.70 UNVP 99.17 58.24 96.32 64.88 99.83 98.25
E-UNVP 99.54 62.95 97.55 66.89 99.93 98.03
stand-alone classifier (Pure-CNN) using the same network
configuration in all experiments. Particularly, it helps to against the other baseline methods, i.e., Pure-CNN, ADA,
improve 6%, 0.5%, 4%, 5%, 2% on MNIST-M using and ADDA, on three face recognition databases, including
LeNet, AlexNet, VGG, ResNet and DenseNet respectively. Extended Yale-B, CMU-PIE, and CMU-MPIE. In each
The proposed methods can be easily integrated with database, we select the face images with normal lighting
standard CNN deep networks. Therefore, it potentially can as the source domain, i.e., Normal illumination (N), and the
be applied to improve the performance in many existed face images with dark lighting as the target domain, i.e.,
CNN-based applications, e.g., detection and recognition, that Dark illumination (D). Each database is randomly split into
are experimented in the next sections. two sets: a training set (80%) and a testing set (20%). The
experimental framework structures are similar to the one in
B. Digit Recognition on Unseen Domains digit recognition. All cropped face images are resized to
The proposed approaches have experimented in digit 64 × 64 pixels. The experimental results in Table V show
recognition on new unseen domains with two other digit that our proposed methods help to improve the recognition
databases, i.e., MNIST-M and SVHN (Fig. 6). In this ex- performance on new unseen domains where the lighting
periment, MNIST is the only database used to train the conditions are unknown. Particularly, it helps to improve
classifier. Then, two other datasets, i.e., MNIST-M and approximately 11%, 4% and 3% in dark lighting conditions
SVHN, are used as the new unseen domains to benchmark on Extended Yale-B, CMU-PIE and CMU-MPIE databases
the performance. The classifier is trained using 50,000 respectively.
images of MNIST. In order to generalize an image phase,
we use 10,000 images in this set to perturb and generalize D. Pedestrian Recognition on Unseen Domains
new samples. All digit images are resized to 32 × 32. This experiment aims to improve RGB-based pedestrian
We benchmark the learned classifiers on MNIST and two recognition on thermal images on the Thermal Dataset2 .
other unseen digit datasets, i.e., SVHN and MNIST-M. The There are two datasets organized in this experiment: (1)
results using our approach are compared against the LeNet RGB pedestrian and (2) Thermal pedestrian. The methods
classifier (Pure-CNN), and the Adversarial Data Augmenta- are trained only on the RGB pedestrian dataset and tested
tion (ADA). We also show the recognition results on these on the Thermal pedestrian dataset. In the training phase,
datasets using the Domain Adaptation methods, including we use 2, 000 images to generalize new images, and all
Adversarial Discriminative Domain Adaptation (ADDA), images of two datasets are resized to 128 × 128 pixels. The
Domain-Adversarial Training of Neural Networks (DANN) experimental results in Table VI show that our proposed
[1]. It is noticed that Pure-CNN, ADA, and our approaches methods consistently help to improve the performance of
do not require the target domain data during training. Mean- the Pure-CNN in various common deep network structures,
while, ADDA, DANN require the target domain data in the including LeNet, AlexNet, VGG, ResNet, and DenseNet.
training steps.
VI. C ONCLUSIONS
Our generalization phase synthesizes images based on
semantic space via the estimation of environment density. This paper has introduced the novel deep learning based
It helps our generated images to be more diverse than the domain generalization approach that generalizes well to
synthesized images using the ADA method. The experimen- different unseen domains. Only using training data from
tal results are shown in Table IV. The proposed methods a source domain, we propose an iterative procedure that
consistently achieve state-of-the-art performance on these augments the dataset with samples from a fictitious target
datasets. Notably, it helps to improve approximately 11% domain that is hard under the current model. It can be easily
and 6% on SVHN and MNIST-M, respectively. integrated with any other CNN based framework within an
end-to-end network to improve the performance. On digit
C. Face Recognition on Unseen Domains recognition, the proposed method has been benchmarked
In this experiment, the proposed approaches are ap-
2 [Link]
plied in unseen environment face recognition and compared

99

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table VI: Results (%) on RGB and Thermal pedestrian [11] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker ver-
databases with various common deep network structures. ification using adapted gaussian mixture models,” in Digital
Networks Methods RGB Thermal Signal Processing, 2000.
Pure-CNN 95.45 79.72
LeNet [12] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and
E-UNVP 97.25 90.29
K. Kim, “Image to image translation for domain adaptation,”
Pure CNN 96.64 81.38
AlexNet in CVPR, June 2018.
E-UNVP 97.04 82.98
Pure CNN 97.54 95.60 [13] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial
VGG
E-UNVP 98.64 98.38 discriminative domain adaptation,” CVPR, 2017.
Pure CNN 98.52 96.07
ResNet
E-UNVP 98.56 98.35 [14] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum
Pure CNN 98.39 95.87 classifier discrepancy for unsupervised domain adaptation,” in
DenseNet
E-UNVP 98.60 96.14 CVPR, 2018.

on three popular digit recognition datasets and consistently [15] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi,
showed the improvement. The method is also experimented and S. Sarawagi, “Generalizing across domains via cross-
gradient training,” 2018.
in face recognition on three standard databases and out-
performs the other state-of-the-art methods. In the problem [16] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and
of pedestrian recognition, we empirically observe that the S. Savarese, “Generalizing to unseen domains via adversarial
proposed method learns models that improve performance data augmentation,” NIPS, 2018.
across a priori unknown data distributions. [17] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi,
“Domain generalization for object recognition with multi-task
VII. ACKNOWLEDGEMENT autoencoders,” in ICCV, 2015.
In this project, Dat T. Truong and Minh-Triet Tran are par-
[18] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain
tially supported by Vingroup Innovation Foundation (VINIF)
generalization with adversarial feature learning,” in CVPR,
in project code VINIF.2019.DA19. 2018.
R EFERENCES [19] K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain gener-
[1] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation alization via invariant feature representation,” in ICML, 2013.
by backpropagation,” in ICML, 2015.
[20] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and
[2] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial D. Tao, “Deep domain generalization via conditional invariant
discriminative domain adaptation,” July 2017. adversarial networks,” in ECCV, 2018.

[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-


[3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A
based learning applied to document recognition,” Proceedings
unified embedding for face recognition and clustering,” in
of the IEEE, 1998.
CVPR, June 2015.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
[4] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss classification with deep convolutional neural networks,” in
for deep face recognition with long-tailed training data,” in NIPS, 2012.
ICCV, 2017.
[23] K. Simonyan and A. Zisserman, “Very deep convolutional
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning networks for large-scale image recognition,” in ICLR, 2015.
for image recognition,” in CVPR, 2016.
[24] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,
[6] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simulta- “Densely connected convolutional networks,” in CVPR, 2017.
neous deep transfer across domains and tasks,” CoRR, 2015.
[25] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with
[7] O. Sener, H. O. Song, A. Saxena, and S. Savarese, “Learning invertible 1x1 convolutions,” in NIPS, 2018.
transferrable representations for unsupervised domain adap-
tation,” in NIPS, 2016. [26] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and
A. Y. Ng, “Reading digits in natural images with unsupervised
[8] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, feature learning,” in NIPSW, 2011.
“Deep domain confusion: Maximizing for domain invari-
ance,” CoRR, 2014. [27] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few
to many: Illumination cone models for face recognition under
[9] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial variable lighting and pose,” TPAMI, 2001.
networks,” in Advances in Neural Information Processing
[28] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination,
Systems 29, 2016.
and expression (pie) database,” in FG, 2002.
[10] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature [29] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker,
transfer learning for deep face recognition with long-tail “Multi-pie,” IVC, 2010.
data,” CoRR, 2018.

100

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like