Domain Generalization with UNVP Method
Domain Generalization with UNVP Method
Dat T. Truong1,3,4 , Chi Nhan Duong2 , Khoa Luu1 , Minh-Triet Tran3,4 , Ngan Le1
1 University of Arkansas, USA
2 Concordia University, Canada
3 University of Science, Ho Chi Minh city, Vietnam
4 Vietnam National University, Ho Chi Minh city, Vietnam
{tt032, khoaluu, thile}@[Link], dcnhan@[Link], tmtriet@[Link]
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
%
%
#
!
!
!
$
!
!
Figure 2. The ideas of domain generalization. The deep model is trained Figure 3. Illustration of the proposed UNVP method. The traditional
only in a single domain (A), i.e. RGB images. It is deployed in other unseen classifier fails to model new samples in unseen domains (top). Meanwhile,
domains, i.e. thermal images (B) and infrared images (C). UNVP consistently maintains the feature distribution in each class while
searching for a new shifting domain (bottom).
94
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison in the properties between our proposed approaches (UNVP and E-UNVP) and other recent methods,
where 55represents not applicable properties. Gaussian Mixture Model (GMM), Probabilistic Graphical Model (PGM),
Convolutional Neural Network (CNN), Adversarial Loss (adv ), Log Likelihood Loss (LL ), Cycle Consistency Loss (cyc ),
Discrepancy Loss (dis ) and Cross-Entropy Loss (CE ).
Domain Loss End-to Target-domain Target-domain Deployable
Architecture
Modelity Function -End sample-free label-free Domains
FT [10] Transfer Learning CNN 2 51 55 55 Two
UBM [11] Adaptation GMM LL 55 55 51 Any
DANN [1] Adaptation CNN adv 51 55 51 Two
CoGAN [9] Adaptation CNN+GAN adv 51 55 51 Two
I2IAdapt [12] Adaptation CNN+GAN adv + cyc 51 55 51 Two
ADDA [13] Adaptation CNN+GAN adv 51 55 51 Two
MCD [14] Adaptation CNN+GAN adv + dis 51 55 51 Two
CrossGrad [15] Generalization Bayesian Net CE 51 51 51 Any
ADA [16] Generalization CNN CE 51 51 51 Any
Our UNVP Generalization PGM+CNN LL + CE 51 51 51 Any
Our E-UNVP Generalization PGM+CNN LL + CE 51 51 51 Any
synthesizing new useful samples to generalize to new unseen from a real data distribution to a Gaussian distribution in
domains. Notice that our proposed framework does not latent space. Then, we can model the environment variation
require the presence of samples in the target domains during via deviations from the Gaussian distributions of all classes
the training process. in a latent domain. When F is well-defined with tractable
computation of its Jacobian determinant, the two-way con-
A. Domain Variation Modeling as Distributions
nection, i.e., inference and generation, is existed for x and
This section aims at learning a Deep Generative Flow z.
model, i.e. function F, that maps an image x in image space The prior class distributions. Motivated from these prop-
I to its latent representation z in latent domain Z such erties, given C classes, we choose C Gaussian distribu-
that the density function pX (x) can be estimated via the tions with different means {μ1 , μ2 , .., μC } and covariances
probability density function pZ (z). Then via F, rather than {Σ1 , Σ2 , ..., ΣC } as prior distributions for these classes, i.e.
representing the environment variation, i.e. pX (x), directly zc ∼ N (μc , Σc ). It is worth noting that even when we
in the image space, it can be easily modeled via variables in choose Gaussian Distributions, our framework is not limited
latent space, i.e. pZ (z), with more semantic manner. When to other distribution types.
pZ (z) follows prior distributions, all samples in the given
Mapping function structure. To enforce the information
domain can be effectively modeled in the forms of latent
flow from an image domain to a latent space with different
distributions.
abstraction levels, we formulate the mapping function F as
Structure and Variable Relationship. Let x ∈ I be a data
a composition of several sub-functions fi as follows.
sample in image domain I, y be its corresponding class
label, and z = F(x, y, θ) where θ denotes the parameters of F = f1 ◦ f2 ◦ ... ◦ fN (3)
F, the probability density function of x can be formulated
via the change of variable formula as follows: where N is the number of sub-functions. The Jacobian ∂F
∂x
∂F(z, y; θ) ∂f1 ∂f2 ∂fN
pX (x, y; θ) = pZ (z, y; θ) (1)
can be derived by ∂F∂x = ∂x · ∂f1 · · · ∂fN −1 . With this struc-
∂x ture, the properties of each fi will define the properties for
where pX (x, y) and pZ (z, y; θ) define the distributions of the whole mapping function F. For example, if the Jacobian
samples of class y in image and latent domains, respectively. of ∂f
∂x is tractable, then F is also tractable. Furthermore, if
1
95
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
where {X, Y} denotes the images and their labels; d(·, ·) is
the distance between probability distributions; PX src
(X, Y)
and PX (X, Y) are the density distributions of the source
and current expanded environments, respectively.
src
Since both PX and PX are density distributions, the
src
Wasserstein distance with respect to PX and PX can be
adopted. Notice that from previous section, we have leaned
a mapping function F that maps the density functions from
image space, i.e. PX , to prior distributions in latent space,
i.e. PZ . Moreover, since F is invertible with the specific
formula of its sub-functions, computing d(PX , PX src
) is
equivalent to d(PZ , PZsrc ). From this, we can efficiently
Figure 4. The distributions: (A) MNIST. (B) MNIST-M using a Pure-CNN estimate cost as the transformation cost between Gaussian
trained on MNIST, (C) MNIST-M using our UNVP trained on MNIST. (D) distributions. Then d(PX , PX src
) is reformulated by.
MNIST-M using our E-UNVP trained on MNIST.
src
d(PX , PX ) = d(PZ , PZsrc )
= inf E [cost (F(xc ), F (xsrc
c ))]
log-likelihood in Eqn. (2) is maximized as follows. c xc ,xsrc
c (7)
∗
θ = arg max i
log pX (x , c; θ) (5) = inf E [cost (zc , zsrc
c )]
θ c zc ,zsrc
c
c i
Notice that after learning the mapping function, all images where cost(·, ·) denotes the transformation cost between
of all classes are mapped into the corresponding distribu- Gaussian distributions:
2 src
src 2
tions of their classes. Then the environment density can be cost (zc , zc ) = ||μc − μc ||2
considered as the composition of these distributions. Figure c (8)
1/2 1/2 1/2
4(A) illustrated an example of the learned environment dis- +Tr(Σsrc
c + Σc − 2((Σsrc
c ) Σc (Σsrc
c ) ) )
tributions of MNIST with 10 digit classes. When only sam-
{μc , Σc } and {μc , Σc }
are the means and covariances of the
ples in MNIST are used for training, the density distributions
distributions of class c in the source and the expanded en-
of MNIST-M, i.e., unseen during training, using Pure-CNN,
vironment, respectively. Plugging this distance and applying
in our UNVP and E-UNVP are shown in Fig. 4 (B, C, D),
the Lagrangian relaxation to Eqn. (6), we have
respectively. In the next section, a generalization approach is
src
proposed so that using only samples in a source environment, arg min sup E [(X, Y; M, F , θ, θ1 )] − α · d(PX , PX )
θ1
the learned model can expand the density distributions of
P
the source environment so that they can cover as much as = arg min sup{(x, c; M, F, θ, θ1 ) − α · cost(F(x), F(xsrc
c ))}
θ1 x
c
possible the distributions of unseen environments.
To solve this objective function, the optimization process
B. Unseen Domain Generalization can be divided into two alternative steps: (1) generate the
After modeling the source environment variation as the sample x for each class such that
compositions of its class distributions, this section introduces x = arg max{(x, c; M, F, θ, θ1 ) − α · cost(F(x), F(xsrc
c ))}
x
the generalization process of these distributions with respect (9)
to a classification model M such that the expansion of these and consider x as a new “hard” example for class c; and
distributions can help M generalize to unseen environments (2) add x to the training data and optimize the model M.
with high accuracy. Notice that M can be any type of Deep In other words, this two-step optimization process aims
CNN such as LeNet [21], AlexNet [22], VGG [23], ResNet at finding new samples belonging to distributions that are
[5], DenseNet [24]. ρ distance far away from the distributions of the source
Let (X, Y; M, F, θ, θ1 ) be the training loss function of environment, and making M became more robust when
M, and θ1 be the parameters of M. The generalization classifying these examples. In this way, after a certain of it-
process of M can be formulated as updating the parameters eration, the distributions learned from M can be generalized
θ1 such that it can robustly classify the samples having latent so that they can cover as much as possible the distributions
distributions that are distance ρ away from the samples in of new unseen environments.
the source environment. Then, the objective function of M
is formulated as. C. Universal Non-volume Preserving (UNVP) Models
The proposed UNVP consists of two main branches:
arg min sup E [(X, Y; M, F, θ, θ1 )] (6) (1) Discriminative Feature Modeling and (2) Generative
θ1 src )≤ρ
P :d(PX ,PX
96
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Notice that in the structure of F, the choice of Gaussian
distributions for all classes play an important role and
directly affects the performance of the generative model. By
varying the choices for these distributions, different variants
of UNVP can be introduced.
Universal Non-volume Preserving Models (UNVP)::
The means and covariances of Gaussian distributions are
pre-defined for all C classes where μc = 1c; Σ = I; zc ∼
N (μc , I) where 1 is the all-one vector.
Extended Universal Non-volume Preserving Models (E-
UNVP):: Rather than fixing the means and covariances of
the Gaussian distributions of C classes, we consider them
as variables and flexibly learned during the environment
modeling as well as adjusted during domain generalization.
Particularly, given the class label c, F maps each sample xc
to a Gaussian distribution with the mean and covariance as.
μc = γGm (c) + λHm (n)
(10)
Σc = Gstd (c)
Figure 5. The Training Process of Our proposed UNVP. consists
of one pre-training step and a two-stage optimization by alternatively where Gm (c) and Gstd (c) denote the learnable function
minimizing the within-class distributions and synthesizing new samples for that map label c to the mean and covariance values of its
generalization. Gaussian distribution. n is a noise signal that is generated
following the normal distribution. Hm (n) defines the allow-
able shifting range of the Gaussian given the noise signal
Distribution Modeling. While the discriminative part focuses
n. γ and λ are user-defined parameters that control the
on constructing a classifier that minimizes within-class dis-
separation of the Gaussian Distributions between different
tributions, the generative one aims at embedding samples
classes and the contribution of Hm (n) to μc . We choose the
of all classes into their corresponding latent distributions
Fully Connected structure for Gm (c) and Gstd (c) that take
and then expanding these distributions for generalization.
the input c in the form of one-hot vector while Convolutional
Fig. 5 illustrates the whole end-to-end joint training process
Layer is adopted for Hm (n).
for UNVP where the generative part, i.e., Deep Generative
Flow F, is firstly employed to learn the mapping from
IV. D ISCUSSION
image space to Gaussian distributions in latent space. Then
a two-stage training process is adopted to learn the Deep As shown in Fig. 3, by exploiting the Generative Flows
Classifier M and adjust the Deep Generative Flow F for that model samples of each class as a Gaussian in semantic
generalization. feature space, the proposed UNVP can robustly maintain the
In the first stage of this process, given a training dataset, feature structure of each class while expanding and shifting
both parameters {θ, θ1 } of the mapping function F and the the domain distributions. In this way, we can generate more
classifier M are updated according to the loss function as. useful “hard” samples for the generalization process.
By introducing the noise signal n, we allow the Gaussian
(X, Y; M, F, θ, θ1 ) =CE (M(X; θ1 ), Y − log pX (X, Y; θ)
distribution of each class shifting around within a limited
where the first term is the cross-entropy loss for M and the range, i.e., Hm (n). This further enhances the robustness of
second term is the log-likelihood of F. E-UNVP against noise during the environment modeling.
In the second stage, we adapt the generalization process as To further enhance the capability of modeling the input
presented in Sec. III-B and Eqn. (9) to synthesize new “hard” signal with high-resolution, we incorporate the activation
samples. Notice that, to further constraint the perturbation normalization and invertible 1 × 1 convolution operators
in latent space, we incorporate another regularization term [25] to the structure of each sub-function fi in Eqn. (3).
to Eqn. (7) as. Particularly, the input to each fi is passed through an
src actnorm layer followed by an invertible 1 × 1 convolution
2 src 2
cost (zc , zc ) = ||μc − μc ||2 before being transformed by S and T as in Eqn. (4). The
c two transformations S and T are defined by two Residual
1/2 1/2 1/2
+ Tr(Σsrc
c + Σc − 2((Σsrc
c ) Σc (Σsrc
c ) ) ) Networks with rectifier non-linearity and skip connections.
2
+ ||M(Xc ) − M(Xsrc
c )||2 Each of them contains three residual blocks. For input image
with the resolution higher than 128×128, six residual blocks
New generated samples are then added to the training set
are set for S and T .
and used for updating both branches of UNVP.
97
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table II: Ablative experiment results (%) on the effectiveness of the parameters λ, α and β that control the distribution
separation and shitting range. MNIST is used as the only training set, MNIST-M is used as the unseen testing set.
λ α β(%)
Dataset Methods
0.01 0.1 1.0 0.01 0.1 1.0 0% 1% 10% 20% 30%
Pure-CNN 99.28
MNIST UNVP − − − 99.33 99.18 99.30 99.28 99.28 99.35 99.30 99.36
E-UNVP 99.22 99.42 99.40 99.13 99.31 99.42 99.28 99.36 99.34 99.42 99.43
Pure-CNN 55.90
MNIST-M UNVP − − − 58.18 60.76 59.44 55.90 59.99 57.24 59.44 55.11
E-UNVP 59.83 60.49 59.47 56.92 61.70 60.49 55.90 57.10 60.49 61.70 60.49
98
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table IV: Results (%) on three digit datasets. ADA and Table V: Results (%) on Extended Yale-B [27], CMU-PIE
ours do not require target data in training. ADDA, DANN [28] and CMU-MPIE [29] databases. ADA and ours do not
require training data from target domains in training. require target domain data during training while ADDA
Methods MNIST SVHN MNIST-M does.
ADDA 99.29 32.20 63.39 E-Yale-B CMU-PIE CMU-MPIE
Method
DANN − − 76.66 N D N D N D
Pure-CNN 99.06 31.96 55.90 ADDA 99.17 75.28 96.09 70.33 99.93 97.71
ADA 99.17 37.87 60.02 Pure-CNN 98.50 51.39 95.59 62.18 99.93 94.74
UNVP 99.30 41.23 59.45 ADA 99.00 53.08 96.49 62.69 99.92 96.08
E-UNVP 99.42 42.87 61.70 UNVP 99.17 58.24 96.32 64.88 99.83 98.25
E-UNVP 99.54 62.95 97.55 66.89 99.93 98.03
stand-alone classifier (Pure-CNN) using the same network
configuration in all experiments. Particularly, it helps to against the other baseline methods, i.e., Pure-CNN, ADA,
improve 6%, 0.5%, 4%, 5%, 2% on MNIST-M using and ADDA, on three face recognition databases, including
LeNet, AlexNet, VGG, ResNet and DenseNet respectively. Extended Yale-B, CMU-PIE, and CMU-MPIE. In each
The proposed methods can be easily integrated with database, we select the face images with normal lighting
standard CNN deep networks. Therefore, it potentially can as the source domain, i.e., Normal illumination (N), and the
be applied to improve the performance in many existed face images with dark lighting as the target domain, i.e.,
CNN-based applications, e.g., detection and recognition, that Dark illumination (D). Each database is randomly split into
are experimented in the next sections. two sets: a training set (80%) and a testing set (20%). The
experimental framework structures are similar to the one in
B. Digit Recognition on Unseen Domains digit recognition. All cropped face images are resized to
The proposed approaches have experimented in digit 64 × 64 pixels. The experimental results in Table V show
recognition on new unseen domains with two other digit that our proposed methods help to improve the recognition
databases, i.e., MNIST-M and SVHN (Fig. 6). In this ex- performance on new unseen domains where the lighting
periment, MNIST is the only database used to train the conditions are unknown. Particularly, it helps to improve
classifier. Then, two other datasets, i.e., MNIST-M and approximately 11%, 4% and 3% in dark lighting conditions
SVHN, are used as the new unseen domains to benchmark on Extended Yale-B, CMU-PIE and CMU-MPIE databases
the performance. The classifier is trained using 50,000 respectively.
images of MNIST. In order to generalize an image phase,
we use 10,000 images in this set to perturb and generalize D. Pedestrian Recognition on Unseen Domains
new samples. All digit images are resized to 32 × 32. This experiment aims to improve RGB-based pedestrian
We benchmark the learned classifiers on MNIST and two recognition on thermal images on the Thermal Dataset2 .
other unseen digit datasets, i.e., SVHN and MNIST-M. The There are two datasets organized in this experiment: (1)
results using our approach are compared against the LeNet RGB pedestrian and (2) Thermal pedestrian. The methods
classifier (Pure-CNN), and the Adversarial Data Augmenta- are trained only on the RGB pedestrian dataset and tested
tion (ADA). We also show the recognition results on these on the Thermal pedestrian dataset. In the training phase,
datasets using the Domain Adaptation methods, including we use 2, 000 images to generalize new images, and all
Adversarial Discriminative Domain Adaptation (ADDA), images of two datasets are resized to 128 × 128 pixels. The
Domain-Adversarial Training of Neural Networks (DANN) experimental results in Table VI show that our proposed
[1]. It is noticed that Pure-CNN, ADA, and our approaches methods consistently help to improve the performance of
do not require the target domain data during training. Mean- the Pure-CNN in various common deep network structures,
while, ADDA, DANN require the target domain data in the including LeNet, AlexNet, VGG, ResNet, and DenseNet.
training steps.
VI. C ONCLUSIONS
Our generalization phase synthesizes images based on
semantic space via the estimation of environment density. This paper has introduced the novel deep learning based
It helps our generated images to be more diverse than the domain generalization approach that generalizes well to
synthesized images using the ADA method. The experimen- different unseen domains. Only using training data from
tal results are shown in Table IV. The proposed methods a source domain, we propose an iterative procedure that
consistently achieve state-of-the-art performance on these augments the dataset with samples from a fictitious target
datasets. Notably, it helps to improve approximately 11% domain that is hard under the current model. It can be easily
and 6% on SVHN and MNIST-M, respectively. integrated with any other CNN based framework within an
end-to-end network to improve the performance. On digit
C. Face Recognition on Unseen Domains recognition, the proposed method has been benchmarked
In this experiment, the proposed approaches are ap-
2 [Link]
plied in unseen environment face recognition and compared
99
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
Table VI: Results (%) on RGB and Thermal pedestrian [11] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker ver-
databases with various common deep network structures. ification using adapted gaussian mixture models,” in Digital
Networks Methods RGB Thermal Signal Processing, 2000.
Pure-CNN 95.45 79.72
LeNet [12] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and
E-UNVP 97.25 90.29
K. Kim, “Image to image translation for domain adaptation,”
Pure CNN 96.64 81.38
AlexNet in CVPR, June 2018.
E-UNVP 97.04 82.98
Pure CNN 97.54 95.60 [13] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial
VGG
E-UNVP 98.64 98.38 discriminative domain adaptation,” CVPR, 2017.
Pure CNN 98.52 96.07
ResNet
E-UNVP 98.56 98.35 [14] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum
Pure CNN 98.39 95.87 classifier discrepancy for unsupervised domain adaptation,” in
DenseNet
E-UNVP 98.60 96.14 CVPR, 2018.
on three popular digit recognition datasets and consistently [15] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi,
showed the improvement. The method is also experimented and S. Sarawagi, “Generalizing across domains via cross-
gradient training,” 2018.
in face recognition on three standard databases and out-
performs the other state-of-the-art methods. In the problem [16] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and
of pedestrian recognition, we empirically observe that the S. Savarese, “Generalizing to unseen domains via adversarial
proposed method learns models that improve performance data augmentation,” NIPS, 2018.
across a priori unknown data distributions. [17] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi,
“Domain generalization for object recognition with multi-task
VII. ACKNOWLEDGEMENT autoencoders,” in ICCV, 2015.
In this project, Dat T. Truong and Minh-Triet Tran are par-
[18] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain
tially supported by Vingroup Innovation Foundation (VINIF)
generalization with adversarial feature learning,” in CVPR,
in project code VINIF.2019.DA19. 2018.
R EFERENCES [19] K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain gener-
[1] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation alization via invariant feature representation,” in ICML, 2013.
by backpropagation,” in ICML, 2015.
[20] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and
[2] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial D. Tao, “Deep domain generalization via conditional invariant
discriminative domain adaptation,” July 2017. adversarial networks,” in ECCV, 2018.
100
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on July 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.