Deep Generative Adversarial Networks For Image-To
Deep Generative Adversarial Networks For Image-To
Review
Deep Generative Adversarial Networks for
Image-to-Image Translation: A Review
Aziz Alotaibi
College of Computers and Information Technology, Taif University, Taif 21974, Saudi Arabia; [email protected]
Received: 3 August 2020; Accepted: 9 October 2020; Published: 16 October 2020
Abstract: Many image processing, computer graphics, and computer vision problems can be treated
as image-to-image translation tasks. Such translation entails learning to map one visual representation
of a given input to another representation. Image-to-image translation with generative adversarial
networks (GANs) has been intensively studied and applied to various tasks, such as multimodal
image-to-image translation, super-resolution translation, object transfiguration-related translation, etc.
However, image-to-image translation techniques suffer from some problems, such as mode collapse,
instability, and a lack of diversity. This article provides a comprehensive overview of image-to-image
translation based on GAN algorithms and its variants. It also discusses and analyzes current
state-of-the-art image-to-image translation techniques that are based on multimodal and multidomain
representations. Finally, open issues and future research directions utilizing reinforcement learning
and three-dimensional (3D) modal translation are summarized and discussed.
1. Introduction
With the rapid advancement in deep learning algorithms, the tasks of analyzing and understanding
digital images for many computer vision applications have drawn increasing attention in the recent
years due to such algorithms’ extraordinary performance and availability of large amounts of data.
Such algorithms directly process raw data (e.g., an RGB image) and obviate the need for domain
experts or handcrafted features [1–3]. The powerful ability of deep feature learning to automatically
utilize complex and high-level feature representations has significantly advanced the performance of
state-of-the-art methods across computer applications, such as object detection [4], medical imaging [5,6],
image segmentation [7], image classification [8], and face detection [9]. The underlying structure
and distinctive (complex) features are both discovered via deep learning-based methods that can
be classified further into discriminative feature-learning algorithms and generative feature-learning
algorithms. Discriminative models focus on the classification-learning process by learning the
conditional probability p (x|y) to map input x to class label y. One of the most popular methods used
for image feature learning utilizes convolutional neural networks (CNN) for feature extraction and
image classification. Examples include LeNet [8], AlexNet [10], VGGNet [11], and ResNet [12,13],
all of which are supervised learning algorithms. On the other hand, generative models focus on the
data distribution to discover the underlying features from large amounts of data in an unsupervised
setting. Such models are able to generate new samples by learning the estimation of the joint
probability distribution p (x,y) and predicting y [14] in contexts, such as image super-resolution [15,16],
text-to-image generation [17,18], and image-to-image translation [19,20].
Figure
Figure 1. Generative Adversarial
1. Generative Adversarial Network
Network Architecture.
Architecture.
2.3. Definitions
In this section, notation, abbreviations, and concepts used throughout this survey are explained in
order to facilitate the understanding of the topic. Lists of commonly used abbreviations and notations
are shown in Tables 1 and A1, respectively. Concepts that are related to both generative adversarial
networks and image-to-image translation are explained in what follows [42,44,45].
Notation Explanation
Pdata Real Sample
Symmetry 2020, 12, x FOR PEER REVIEW 4 of 24
Pg Fake sample
(z) Random noise vector
p(x|y)
p(x|y) Conditional
Conditionalprobability
probability
P(x,y)
P(x,y) Joint probability
Joint probability
( )
G z,( θ, (G) )
Generator
GeneratorNetwork
Network
( )
D x,( θ, (D) ) Discriminator Network
Discriminator Network
D(y) (y) Discriminator
Discriminatoroutput
output
LD ℒ Discriminator loss
Discriminator loss
Lℒ G
Generator loss
Generator loss
Figure 2. (a,c) are paired Images and (b,d) are unpaired Images [41,46].
Figure 2. (a),(c) are paired Images and (b),(d) are unpaired Images [41,46].
2.4. Motivation and Contribution
2.4. Motivation and Contribution
With development in deep learning, numerous approaches have been proposed in order to
improve the quality of image
With development in deep synthesis. Mostnumerous
learning, recently, image synthesis
approaches with
have deepproposed
been generative inadversarial
order to
networks has attracted many researchers’ attention due to their ability
improve the quality of image synthesis. Most recently, image synthesis with deep generative to capture the probability
distribution as compared
adversarial networks has to traditional
attracted many generative
researchers’models. Several
attention due research papersto
to their ability have proposed
capture the
performing image synthesis while using adversarial networks. Several articles
probability distribution as compared to traditional generative models. Several research papers have [21,22,34,47–49] have
recently
proposedsurveyed
performing generative adversarial
image synthesis while networks and GANnetworks.
using adversarial variants, Several
including image
articles synthesis as
[21,22,34,47–
an
49]application.
have recently Other surveys
surveyed [50,51] covered
generative image
adversarial synthesis
networks with
and GANGANs and partially
variants, including discussed
image
image-to-image translation. The effort closest to this survey is that of [50], where
synthesis as an application. Other surveys [50,51] covered image synthesis with GANs and partially authors discussed
image
discussedsynthesis, includingtranslation.
image-to-image few image-to-image
The efforttranslation methods.
closest to this survey is that of [50], where authors
To the best of our knowledge, image-to-image
discussed image synthesis, including few image-to-image translation translation with GANs
methods.has never been reviewed
previously.
To the best of our knowledge, image-to-image translation with GANs has nevertranslation
Thus, this article provides a comprehensive overview of image-to-image been reviewed using
GAN algorithms
previously. Thus,andthisvariants. A general
article provides introduction to overview
a comprehensive generativeofadversarial networks
image-to-image is given,
translation
and
usingGANGAN variants, structures,
algorithms and objective
and variants. functions
A general are demonstrated.
introduction The image-to-image
to generative translation
adversarial networks is
approaches are discussed in detail, including the state-of-the-art algorithms,
given, and GAN variants, structures, and objective functions are demonstrated. The image-to-image theory, applications, and
open challenges.
translation approachesThe image-to-image
are discussed intranslation approaches
detail, including are classified algorithms,
the state-of-the-art into supervised and
theory,
supervised
applications, types.
and Theopencontributions
challenges. The of this article can be summarized,
image-to-image as follows: are classified into
translation approaches
supervised and supervised types. The contributions of this article can be summarized, as follows:
1. This review article provides a comprehensive review including general generative adversarial
1. This review
network article provides
algorithms, objectivea comprehensive review including general generative adversarial
function, and structure.
2. network algorithms,
Image-to-image objectiveapproaches
translation function, and arestructure.
classified into supervised and unsupervised types
2. Image-to-image translation
with in-depth explanations. approaches are classified into supervised and unsupervised types
with in-depth explanations.
3. This review article also summarizes the benchmark datasets, evaluation metric, and image-to-image
3. This review article also summarizes the benchmark datasets, evaluation metric, and image-to-
translation applications.
image translation applications.
4. Limitations, open challenges, and directions for future research are among the topics discussed,
illustrated, and investigated in depth.
This paper’s structure can be summarized, as follows. In Section 2, GANs and variants of GAN
architectures are demonstrated. GAN objective functions and GAN structure are discussed in
Sections 3 and 4, respectively. Section 5 introduces and discusses both supervised and unsupervised
Symmetry 2020, 12, 1705 5 of 26
4. Limitations, open challenges, and directions for future research are among the topics discussed,
illustrated, and investigated in depth.
This paper’s structure can be summarized, as follows. In Section 2, GANs and variants of
GAN architectures are demonstrated. GAN objective functions and GAN structure are discussed in
Sections 3 and 4, respectively. Section 5 introduces and discusses both supervised and unsupervised
image-to-image translation techniques. In Section 6, image-to-image translation applications, including
the topics of datasets, practical applications, and evaluation metric, are illustrated and summarized.
Discussion and Directions of future research utilizing reinforcement learning and three-dimensional
(3D) models are discussed and summarized in Section 7. The last section concludes this review paper.
Conditional variable y could be text or a number that turns the GAN model into a supervised
GAN model. CGAN can be used with images, sequence models, and other models. CGAN is used to
model complex and large-scale datasets that have different labels by adding conditional information y
to both the generator and discriminator.
through training. InfoGAN is implemented by adding a regularization term to the original GAN’s
objective function.
min max VI (D, G) = V(D, G) − λI(c; G(z, c)) (2)
where VI (D, G) is the loss function of GAN, λI (c;G(z,c)) is the mutual information, and Lambda is a
constant. InfoGAN maximizes the mutual information between the generator’s output G(z,c) and latent
code c to discover the meaningful features of the real data distribution. However, mutual information
λI(c; G(z, c)) requires access to the posterior probability p(c|x), which makes it difficult to directly
optimize [22,47,49,55]. Later, other InfoGAN variants were proposed, such as the semi-supervised
InfoGAN (ss-InfoGAN) [56] and the causal InfoGAN [57]
3.4. BigGAN
Brock et al. [58] propose BigGAN, a class-conditional GAN that is based on the self-attention GAN
(SAGAN) model [43]. BigGAN is trained on ImageNet at the 128 × 128 resolution to generate natural
images with high fidelity and variety. The BigGAN model is based on scaling up the GAN models to
improve the quality of generated samples by (1) adding orthogonal regularization to the generator,
(2) increasing the number of batch size and many parameters, (3) normalizing the generator’s weight
using spectral normalization, and (4) introducing a truncation trick in order to control the variety of
the generated samples.
Adversarial Loss. The original GAN consists of generator G z, θ(G) and discriminator D x, θ(D)
competing against each other. GANs utilize the sigmoid cross entropy as a loss function for discriminator
D, and use the minmax loss and a non-saturated loss with generator G. D is a differential function whose
inputs and parameters are x and θ(D) , respectively, which outputs a single scalar D(y). D(y) represents
the probability that input (x) belongs to the real data Pdata rather than the generated data Pg . Generator G
is a differential function whose input and parameters are z and θ(G) , respectively. The discriminator
and the generator both have separate loss functions, as shown in Table 3. Both update their parameters
(D)
to achieve the Nash equilibrium, whereby discriminator V D, θ aims to maximize the probability
of assigning
the correct label to both training samples and the sample generated by G [32]. Generator
(G)
V D, θ aims to minimize log(1 − D(G(z))) to deceive D. They are both trained simultaneously and
inspired by the min–max game.
Symmetry 2020, 12, 1705 7 of 26
Wassertien GAN. Arjovsky et al. [62] propose WGAN, using what is sometimes called the Earth
Mover’s (ME) distance, to overcome the GAN model instability and mode collapse. WGAN uses the
Wasserstein distance instead of that of Jensen–Shannan to measure the similarity between the real
data distribution and generated data distribution. The Wasserstein distance can be used to measure
the distance between probability distributions Pdata (x) and P g (x) even if there is no overlap, where
(Pdata , P g ) denotes the set of all joint distributions between the real distribution and the generated
distribution [34,47,52]. WGAN applies a weight clipping to enforce the Lipschitz constraint on the
discriminator. However, WGAN may suffer from gradient vanishing or exploding due to the use
of weight clipping. The discriminator in WGAN is utilized as a regression task to approximate the
Wasserstein distance instead of being a binary classifier [49]. It should be noted that WGAN does not
change the GAN structure, but instead enhances parameter learning and model optimization [22].
WGAN-GP [69] is a later proposal for stabilizing the training of a GAN by utilizing gradient penalty
regularization [70].
Least Squares GAN. LSGAN [59] has been proposed to overcome the vanishing gradient problem
that is caused by the minimax loss and the non-saturated loss in the original GAN model. LSGAN
adopts the least squares or L2 loss function instead of the sigmoid cross-entropy loss function used in
the original GAN. As shown in Table 3, a and b codes are the labels for generated and real samples,
respectively, while c represents the value that G wishes D to believe for generated samples. There are
two benefits of implementing LSGAN over the original GAN: first, LSGAN can generate high-quality
samples; second, LSGAN allows for the learning process to be more stable.
Energy-based GAN. EBGAN [66] has been proposed to model the discriminator as an energy function.
This model also uses the autoencoder architecture to first estimate the reconstruction error and, second,
to assign a lower energy to the real samples and a higher energy to the generated samples. The output
of the discriminator in EBGAN goes through a loss function to shape the energy function. Table 3 shows
the loss function, where m is the positive margin, and [.]+ = max(0, .). The EBGAN framework exhibits
better convergence and scalability, which result in generating high-resolution images [66].
Margin adaption GAN. MAGAN [68] is an extension of EBGAN that uses the hinge loss function, in
which margin m is adapted automatically while using the expected energy of the real data distribution.
Hence, margin m is monotonically reduced over time. Unlike EBGAN, MAGAN converges to its
global optima, where both real and generated samples’ distributions match exactly.
Boundary Equilibrium GAN. BEGAN [67] is an extension of EBGAN that uses an autoencoder as
the discriminator. BEGAN’s objective function computes the loss based on the Wasserstein distance.
Using the proportional control theory, the authors of BEGAN propose an equilibrium method for
balancing the generator and the discriminator during training without directly matching the data
distributions [47].
Symmetry 2020, 12, 1705 8 of 26
5. GAN Structure
The typical GAN is based on a multilayer perceptron (MLP), as mentioned above. Subsequently,
structures of various types have been proposed to either solve GANs’ issues or address a specific
application, as explained in what follows.
Deep convolutional GAN. DCGAN [71] is one of recent major improvements in the field of computer
vision and generative modeling. It combines a GAN with a convolutional neural network (CNN).
DCGAN has been proposed to stabilize GANs in order to train deep architectures to generate high-quality
images. DCGANs have set some architectural constraints to train the generator and discriminator
networks for unsupervised representation learning in order to resolve the issues of training instability
and the collapse of the GAN architecture. First, DCGANs replace a spatial pooling layer with strided
convolution and fractionally-strided convolution to allow for both the generator and the discriminator to
learn its own spatial downsampling and spatial upsampling. Second, batch normalization (BN) is used
to stabilize learning by normalizing the input and solve the vanishing gradient problem. BN is mainly
applied to prevent the deep generator from collapsing all samples to the same points. Third, eliminating
the fully connected hidden layers that would otherwise be on the top CNN increases model stability.
Finally, DCGANs use both ReLU and LeakyReLU to allow the model to learn quickly and perform
well. ReLU activation function is used in all generator layers, except the last layer, which uses the tanh
activation function; additionally, LeakyReLU activation functions are used in all discriminator layers.
Self-Attention GAN. SAGAN [72] has been proposed to incorporate a self-attention mechanism into
a convolutional GAN framework to improve the quality of generated images. A traditional convolution-
based GAN has difficulty in modeling some image classes when trained on large multi-class image
datasets due to the local receptive field. SAGAN adapts a self-attention mechanism to different stages
of both the generator and the discriminator to model long-range and multi-level dependencies across
image regions. SAGAN uses three techniques to stabilize the training of a GAN. First, it applies
spectrally to both the generator and discriminator to improve performance and reduce the amount
of computations that are performed during training. Second, it uses the Two Time-scale Update
Rule (TTUR) for both the generator and discriminator to speed up the training of the regularized
discriminator. Third, it integrates the conditional batch normalization layers into the generator and a
projection into the discriminator. SAGAN utilizes the hinge loss as the adversarial loss and uses the
Adam optimizer to train the model that achieves the state-of-the-art performance on class-condition
image synthesis.
Progressive Growing GAN. Karras et al. [73] proposed PGGAN to generate large high-quality images
and to stabilize the training process. The PGGAN architecture is based on growing both the generator
and the discriminator progressively, starting from small-size images, and then adding new blocks of
layers to both the generator and the discriminator. These new blocks of layers are incrementally enlarged
in order to discover the large-scale structure and achieve high resolution. Progressive training has three
benefits: stabilizing the learning process, increasing the resolution, and reducing the training time.
Laplacian Pyramid GAN. Denton et al. [74] proposed LAPGAN to generate high-quality images by
combining a conditional GAN model with a Laplacian pyramid representation. A Laplacian pyramid is
a linear invertible image representation that consists of a low-frequency residual based on a Gaussian
pyramid. LAPGAN is a cascade of convolutional GANs with k levels of the Laplacian pyramid.
The approach of LAPGAN uses multiple generators and discriminator networks and proceeds, as follows:
in the beginning, the image is downsampled by a factor of two at each k level of the pyramid. Subsequently,
the image is upsampled in a backward pass by a factor of two to reconstruct the image and then return it
to its original size while the image acquires noise generated by a conditional GAN at each layer. LAPGAN
is trained through unsupervised learning, and each level of a Laplacian pyramid is trained independently
and evaluated while using both log-likelihood and human evaluation.
VAE-GAN. Larsen et al. [75] combine a variational autoencoder with a generative adversarial
network (VAE-GAN) into an unsupervised generative model and train them jointly to produce
high-quality generated images. VAE and GAN are implemented by assigning the VAE decoder to the
Symmetry 2020, 12, 1705 9 of 26
GAN generator and combining the VAE’s objective function with an adversarial loss [47], where the
element-wise reconstruction metric is replaced with a feature-wise reconstruction metric in order to
produce sharp images.
Figure
Figure 3. Image-to-image translation
3. Image-to-image translation taxonomy
taxonomy based
based on
on technique.
technique.
6.1. Supervised Translation
6.1. Supervised Translation
A supervised method requires a set of pairs of corresponding images (x,y) in different domains
A supervised method requires a set of pairs of corresponding images (x,y) in different domains
(X,Y); for each image x ∈ X and a corresponding image from another domain y € Y, the method learns a
(X,Y); for each image x ∈ X and a corresponding image from another domain y € Y, the method
probability distribution while using the translator G: X→Y. In some cases, supervised translation utilizes
learns aimages
domain probability distribution
that are conditioned while using
on class theortranslator
labels G: X→Y
source images . In some
to generate cases, supervised
a high-quality image,
translation utilizes domain images that are conditioned on class labels or source images
as shown in Figure 4a. Supervised translation is further divided into directional and bidirectionalto generate
a high-quality
translation, as image,
explainedas shown in Figure 4a.
in the following Supervised translation is further divided into directional
sections.
and bidirectional translation, as explained in the following sections.
6.1.1. Directional Supervision
Pix2Pix [76] is a supervised image-to-image translation approach that is based on a conditional
generative adversarial network. Pix2Pix requires paired images to learn one-to-one mapping and
uses two datasets; one dataset is used as input, and the other is used as condition input. An example
is translating a semantic label x to a realistic-looking image, and Figure 5 shows an edge-to-photo
translation. The generator uses a ”U-Net”-based architecture that relies on skip connections to each
layer. In contrast, the discriminator uses a convolution-based “PatchGAN” as a classifier. The objective
function of Pix2Pix uses cGAN with the L1 norm instead of L2, which leads to less blurring. Although
image-to-image translation methods that are based on cGAN such as Pix2Pix enable a variety of
Symmetry 2020, 12, 1705 10 of 26
translation applications, the generated images are still limited to being of low resolution and blurry.
Wang et al. [77] proposed Pix2pixHD to increase the resolution of the output images to 2048*1024 by
utilizing a coarse-to-fine generator, an architecture based on three multiscale discriminators, and a
Figure 3. Image-to-image translation taxonomy based on technique.
robust adversarial learning objective function. It is worth noting that previous studies have been
limited to two domains. Thus, StarGAN [44] has been proposed as a unified GAN for multi-domain
6.1. Supervised Translation
image-to-image translation using only a single generator and a discriminator. The generator is trained
A supervised
to translate an inputmethod
imagerequires
(x) to ana output
set of pairs
imageof (y),
corresponding
conditioning images (x,y) inlabel
on domain different domains
information c,
(X,Y); for each image x ∈ X and a corresponding image from another domain
G(x,c)→y. To learn the mapping among k domain k(k-1) between multiple databases, mask vector y € Y, the method
Symmetry 2020, 12, x FOR PEER REVIEW 10 of 24
m is utilized
learns to control
a probability domain labels,
distribution whileignore
using unknown labels,
the translator G:and
X→Y . In on
focus somea particular label that
cases, supervised
belongs
translationto autilizes
specific
6.1.1. Directional dataset.images
domain
Supervision StarGAN requires
that a single generator
are conditioned on class and a discriminator
labels or source imagesto train multiple
to generate
adatabases simultaneously
high-quality image, as shownby adding an auxiliary
in Figure classifier
4a. Supervised on the top
translation of the discriminator
is further to control
divided into directional
Pix2Pix
multiple [76] isand
domains a supervised
by applying image-to-image
cycle translation
consistency to the approach[51].
generator that is based on a conditional
and bidirectional translation, as explained in the following sections.
generative adversarial network. Pix2Pix requires paired images to learn one-to-one mapping and
uses two datasets; one dataset is used as input, and the other is used as condition input. An example
is translating a semantic label x to a realistic-looking image, and Figure 5 shows an edge-to-photo
translation. The generator uses a ”U-Net”-based architecture that relies on skip connections to each
layer. In contrast, the discriminator uses a convolution-based “PatchGAN” as a classifier. The
objective function of Pix2Pix uses cGAN with the L1 norm instead of L2, which leads to less blurring.
Although image-to-image translation methods that are based on cGAN such as Pix2Pix enable a
variety of translation applications, the generated images are still limited to being of low resolution
and blurry. Wang et al. [77] proposed Pix2pixHD to increase the resolution of the output images to
2048*1024 by utilizing a coarse-to-fine generator, an architecture based on three multiscale
discriminators, and a robust adversarial learning objective function. It is worth noting that previous
studies have been limited to two domains. Thus, StarGAN [44] has been proposed as a unified GAN
for multi-domain image-to-image translation using only a single generator and a discriminator. The
generator is trained to translate an input image (x) to an output image (y), conditioning on domain
label information c, G(x,c)→y. To learn the mapping among k domain k(k-1) between multiple
databases, mask vector m is utilized to control domain labels, ignore unknown labels, and focus on a
particular label that belongs to a specific dataset. StarGAN requires a single generator and a
discriminator to train multiple
Comparison
Figure 4. Comparison between
betweendatabases simultaneously
supervised
supervised and unsupervised
and unsupervisedby adding an auxiliary
image-to-image
image-to-image classifier
translation
translation on the top
methods.
methods.
of the(a)discriminator to control
Supervised methods, multiple
such domains
as Pix2Pix and by applying
and BicycleGAN. cycle consistency
(b) Unsupervised to the
methods, generator
such as
[51]. CycleGAN, DualGAN, and DiscoGAN.
→ photos [76].
Figure 5. Example of supervised image-to-image translation, edge →photos [76].
LSGAN, which uses a U-Net for the generator and two PatchGANs for the discriminator. BicycleGAN
uses the least-squares loss function instead of the cross-entropy loss to stabilize the training process.
In addition, CEGAN [80] has been proposed as a novel image-to-image translation approach to learning
multi-model mapping that is based on conditional generation models for generating realistic and
diverse images. CEGAN captures the distribution of possible multiple modes of results by enforcing
tight connections between the latent space and the real image space. The model consists of generator G,
discriminator D, and encoder E. In this model, unlike other proposed GAN methods, the discriminator
is used to distinguish between real and fake samples in the latent space instead of the real image space,
in order to reduce the impact of redundancy and noise and produce realistic-looking images.
LGAN (G, DY , X, Y) = E y v P data ( y) [log DY ( y)] + Ex v P data (x) [log (1 − DY (G(x)))] (3)
LGAN (F, DX , Y, X) = Ex v P data (x) [log Dx (x)] + E y v P data ( y) [log (1 − Dx (G( y)))] (4)
The cyclic consistency losses consist of forward and backward cyclic consistency terms, aiming to
prevent the learned mappings G and F from contradicting each other:
h i h i
Lcyc (G, F) = Ex v P data (x) F(G(x)) − x1 + E y v P data ( y) G(F( y)) − y1 (5)
L(G, F, Dx, Dy) = LGAN (G, DY , X, Y) + LGAN (F, DX , Y, X) + λLcyc (G, f ) (6)
CycleGAN has been successfully applied to several image-to-image translation tasks, including
object transfiguration, collection style transfer, season transfer, photo enhancement, etc. However,
Symmetry 2020, 12, 1705 12 of 26
CycleGAN is only able to learn a one-to-one mapping. A recently proposed augmented cycleGAN [82]
introduces an idea that is similar to cycleGAN, but it learns a many-to-many mapping between two
domains in a unsupervised setting by adding auxiliary latent codes to each domain.
Similarly, DiscoGAN [83] and DualGAN [84] have been proposed at the same time to tackle the
unpaired image-to-image translation problem based on cyclic consistency. However, DiscoGAN learns
to discover relations among different domains using the reconstruction losses, while DualGAN is
based on dual learning using the Wasserstein GAN. More recently, AsymGAN [85] has been proposed;
it uses an asymmetric framework to model unpaired image-to-image translation between asymmetric
domains by adding an auxiliary variable (aux). The aux is used to learn the extra information between
the poor and rich domains. The difficulties involving complexity between different domains and
imbalanced information are resolved and better balanced using this variable. AsymGAN is able to
generate diverse and higher-quality outputs by utilizing different aux variables, and it has been proven
to generate high-quality and diverse images, unlike CycleGAN and the related approaches that focus on
unsupervised pixel-level image-to-image translation. Chen et al. [86] propose a unified quality-aware
GAN (QGAN) framework that is based on a quality loss for unpaired image-to-image translation.
The QGAN design involves two detailed implementations of the quality-aware loss—a classical quality
assessment loss and an adaptive perceptual quality-aware loss—in addition to the adversarial loss.
XGAN [43] is a proposed dual adversarial autoencoder for unsupervised image-to-image translation
based on a semantic consistency loss. XGAN captures the shared feature representation of both domains
to learn the joint feature-level information rather than pixel-level information. A semantic consistency
loss is used in both domains’ translation to preserve the semantic content of the image across domains.
SPA-GAN [87] introduces an effective spatial attention mechanism for unsupervised image-to-image
translation tasks. SPA-GAN computes the spatial attention maps in the discriminator and directly
feeds it back to the generator. This forces the generator to focus on the most discriminative areas in
image-to-image translation. SPA-GAN also introduces a feature map loss term, defined as the difference
of feature maps of real and fake images, to encourage the generators to preserve domain-specific
features during image translation, which leads to more realistic output images.
architecture for two domains. BranchGAN transfers one image from one domain to another domain by
exploiting the shared distribution of the two domains with the same encoder. It uses a reconstruction
loss, an encoding loss, and an adversarial loss to train the model to learn the joint distribution of two
image domains.
7.1. Datasets
There are several benchmark datasets that are available for image synthesis tasks that can be
utilized to perform image-to-image translation tasks. Such datasets differ in image counts, quality,
resolution, complexity, and diversity, and they allow researchers to investigate a variety of practical
applications such as facial attributes, cartoon faces, semantic applications, and urban scene analysis.
Table 4 summarizes the selected benchmark datasets.
Symmetry 2020, 12, 1705 14 of 26
• The inception score (IS) [104] is an automated metric for evaluating the visual quality of generated
images by computing the KL divergence between the conditional class distribution and the marginal
class distribution via inception networks. IS aims to measure the image quality and diversity.
However, the IS metric has two limitations: (1) a high sensitivity to small changes and (2) a large
variance of scores [105].
• The Amazon Mechanical Turk (AMT) is used to measure the realism and faithfulness of the
translated images that are based on human perception. Workers (“turkers”) are given an input
image and translated images and are instructed to choose or score the best image based on quality
and perceptual realism. The number of validated turkers varies by experiment.
• The Frechet inception distance (FID) is used to construct the FID score [106] that is used to evaluate
the quality of the generated images and measure the similarity between two different datasets [80].
It is used to measure the distance between the generated images’ distribution and the real image
distribution by computing the Frechet inception distance using the inception network. FID very
accurately captures the distribution and it is considered to be more consistent than IS with noise
level. Lower FID values indicate better quality of the generated images’ sample [107].
• The kernel inception distance (KID) [108] is an improved measure of GAN convergence that has a
simple unbiased estimator with no unnecessary assumptions regarding the form of the activations’
distribution. KID involves a computation of the squared maximum mean discrepancy between
representations of reference and generated distributions [87]. A lower KID score signifies better
visual quality of generated images
• The learned perceptual image patch similarity (LPIPS) distance [109] measures the image
translation diversity by computing the average feature distance between the generated images.
LPIPS is defined as a weighed L2 distance between deep features of two images. A higher LPIPS
value indicates greater diversity among the generated images.
• Fully Convolutional Networks (FCN) [110] can be used to compute the FCN-score that uses the
FCN model as a performance metric in order to evaluate the image quality by segmenting the
generated image and comparing it with the ground truth label using a well-trained segmentation
FCN model. A smaller value of the FCN-score between the generated image and ground truth
means better performance. The FCN-score is calculated based on three parts: per-pixel accuracy,
per-class accuracy, and class intersection-over-union (IOU).
Symmetry 2020, 12, 1705 15 of 26
7.3.1. Super-Resolution
Super-resolution (SR) refers to the process of translating a low-resolution source image to a
high-resolution target image. GANs have recently been used to solve super-resolution problems in an
end-to-end manner [111–114] by treating the generator as an SR model to output a high-resolution
image and using the discriminator as a binary classifier. SRGAN [115] adds a new distributional loss
term in order to generate an upsampled image with the resolution increased fourfold and it is based
on the DCGAN architecture with a residual block. ERSGAN [15] has been proposed to improve the
overall perceptual quality of SR results by introducing the residual-in-residual dense block (RRDB)
without
Symmetrybatch normalization
2020, 12, in order to further enhance the quality of generated images.
x FOR PEER REVIEW 15 of 24
7.3.2.
7.3.2.Style
StyleTransfer
Transfer
Style
Styletransfer
transferisisthethe process
process ofof rendering
rendering the content
content of of an
an image
image with
withaaspecific
specificstyle
stylewhile
while
preserving
preservingthethecontent,
content,as asshown
shown in in Figure
Figure 6.6. The earlier style transfer
earlier style transfermodels
modelscould
couldonly
onlygenerate
generate
one
oneimage
imageandandtransfer
transferititaccording
according to to one
one style. However, recent
style. However, recent studies
studiesattempt
attempttototransfer
transferimage
image
content
contentaccording
according to multiple stylesstyles
to multiple that are based
that are on a perceptual
based loss. In addition,
on a perceptual loss. Inwith advancement
addition, with
inadvancement
deep generative models,
in deep adversarial
generative models, losses can also
adversarial be used
losses to train
can also the style
be used model
to train to make
the style modelthe
to make
output the output
image image indistinguishable
indistinguishable from images from in theimages
targetedin the
styletargeted
domain. style domain.
Style Style
transfer is atransfer
practical
is a practical
application ofapplication
image-to-imageof image-to-image
translation. translation. Chen propose
Chen et al. [116] et al. [116]anpropose an adversarial
adversarial gated
gated networks,
networks,
called called Gated-GAN,
Gated-GAN, to transfer
to transfer multiple multiple
styles whilestyles
usingwhile using
a single a single
model basedmodel basedmodalities:
on three on three
anmodalities:
encoder, a an encoder,
gated a gatedand
transformer, transformer,
a decoder.and a decoder.
GANILLA [117]GANILLA[117] is a proposed
is a proposed novel framework novel
with
framework
the ability to with
betterthe abilitybetween
balance to better content
balance and
between
style.content and style.
Figure6.6.Style
Figure Styletransfer
transfer applications
applications with
with (a)
(a) inter-domain
inter-domain attribute
attribute transfer
transferand
and(b)
(b)intra-domain
intra-domain
attribute transfer[95].
attribute transfer [95].
7.3.3.
7.3.3.Object
ObjectTransfiguration
Transfiguration
Object
Objecttransfiguration
transfigurationaims
aimsto
todetect
detect the
the object of interest
interest in
in an
animage
imageand
andthen
thentransform
transformititinto
into
another
anotherobject
objectininthe
thetarget
targetdomain
domain while
while preserving backgroundregions,
preserving the background regions,e.g.,
e.g.,transforming
transforminganan
appleinto
apple intoan
anorange,
orange,ororaahorse
horseinto
into aa zebra.
zebra. Using
Using GANs
GANshas hasbeen
beenexplored
exploredininobject
objecttransfiguration
transfiguration
to perform two tasks: (1) to detect the object of interest in an image and (2) to transform the object
into a target domain. Attention GANs [118,119] are mostly used for object transfiguration; such a
model consists of an attention network, a transformation network, and a discriminative network.
to perform two tasks: (1) to detect the object of interest in an image and (2) to transform the object into
a target domain. Attention GANs [118,119] are mostly used for object transfiguration; such a model
consists of an attention network, a transformation network, and a discriminative network.
Image-to-image translation with GANs and GAN variants usually suffers from the mode collapse
issue that occurs when the generator only generates the same output, regardless of whether it uses a
single input or operates on multiple modes, e.g., when two images I1 = G(c, z1 ) and I2 = G(c, z2 ) are
likely to be mapped to the same model [107]. There are two types of mode collapse: inter-mode and
intra-mode collapse. Inter-mode collapse occurs when the expected output is known, e.g., if digits (0–9)
are used and the generator keeps generating the same number to fool the discriminator. In contrast,
intra-mode collapse usually happens if the generator only learns one style of the expected output to
fool the discriminator. Many proposals have recently been made to alleviate and avoid mode collapse;
the sample approaches include LSGAN [59], using a mode-seeking regularization term [107], and cycle
consistency [84,95]. However, the mode collapse problem still has not been completely solved and it is
considered to be one of the open issues of image-to-image translation tasks.
As mentioned above, several evaluation methods have been proposed [104,106,108,109] to measure
and assess the quality of the translated images and investigate the strengths and limitations of the used
models. These evaluation measures can be categorized into quantitative and qualitative. They have
been further explored in [122], where the difference between both of the techniques has been investigated
in depth. Metrics of success of image-to-image translation usually evaluate the quality of generated
images while using a limited number of test images or user studies. The evaluation of a limited number
of test images must consider both style and content simultaneously, which is difficult to do [117].
In addition, user studies are based on human judgment, which is a subjective metric [1]. However,
there is no well-defined evaluation metric, and it is still difficult to accurately assess the quality of
generated images, since there is no strict one-to-one correspondence between the translated image and
the input image [1].
• Lack of diversity
Image-to-image translation diversity is related to the quality of diverse generated outputs utilizing
multi-modal and multi-domain mapping, as mentioned above. Several approaches [42,78,79] injected
a random noise vector into the generator in order to model a diverse distribution in the target domain.
One of the existing limitations of image-to-image translation is the lack of diversity of generated
images due to the lack of regularization between the random noise and the target domain [79].
The DIRT [79] and DIRT++ [95] methods have been proposed to improve the diversity of generated
images; however, generating diverse outputs for producing high quality and diverse images has not
yet been fully explored.
and volumetric convolutional networks can play a very important role in the future in generating 3D
Images. Furthermore, cybersecurity applications should utilize image-to-image translation with GAN
to design reliable and efficient systems. The image steganography that is based on GAN should be
further investigated and developed in order to overcome critical cybersecurity issues and challenges.
9. Conclusions
Image-to-image translation with GANs has made huge success in computer vision applications.
This article presents a comprehensive overview of GAN variants that are related to image-to-image
translation based on algorithms, the objective function and structure. Recent state-of-the-art
image-to-image translation techniques, both supervised and unsupervised, are surveyed and classified.
In addition, benchmark datasets, evaluation metrics and practical applications are summarized.
This review paper covers open issues that are related to mode collapse, evaluation metrics, and lack
of diversity. Finally, reinforcement learning and 3D models have not been fully explored and are
suggested as future directions towards better performance on image-to-image translation tasks. In the
future, quantum generative adversarial network for image-to-image translation will be further explored
and implemented in order to overcome complex problems related to image generation.
Appendix A
References
1. Huang, H.; Yu, P.S.; Wang, C. An introduction to image synthesis with generative adversarial nets. arXiv 2018,
arXiv:1803.04469.
2. Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief
review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [CrossRef] [PubMed]
3. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
4. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
27–30 June 2016; pp. 779–788.
5. Suzuki, K. Overview of deep learning in medical imaging. Radiol. Phys. Technol. 2017, 10, 257–273. [CrossRef]
6. Zhao, D.; Zhu, D.; Lu, J.; Luo, Y.; Zhang, G. Synthetic medical images using F&BGAN for improved lung
nodules classification by multi-scale VGG16. Symmetry 2018, 10, 519.
7. Ma, B.; Ban, X.; Huang, H.; Chen, Y.; Liu, W.; Zhi, Y. Deep learning-based image segmentation for Al-La alloy
microscopic images. Symmetry 2018, 10, 107. [CrossRef]
8. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition.
Proc. IEEE 1998, 86, 2278–2324. [CrossRef]
9. Alotaibi, A.; Mahmood, A. Deep face liveness detection based on nonlinear diffusion using convolution
neural network. SignalImage Video Process 2017, 11, 713–720. [CrossRef]
10. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December
2012; pp. 1097–1105.
11. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014,
arXiv:1409.1556.
Symmetry 2020, 12, 1705 21 of 26
12. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
13. Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7,
63373–63394. [CrossRef]
14. Ng, A.Y.; Jordan, M.I. On discriminative vs. generative classifiers: A comparison of logistic regression and
naive bayes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC,
Canada, 3–8 December 2001; pp. 841–848.
15. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution
generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV),
Munich, Germany, 8–14 September 2018; pp. 63–79.
16. Li, C.; Wang, L.; Cheng, S.; Ao, N. Generative Adversarial Network-Based Super-Resolution Considering
Quantitative and Perceptual Quality. Symmetry 2020, 12, 449. [CrossRef]
17. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis.
arXiv 2016, arXiv:1605.05396.
18. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan: Text to photo-realistic
image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5907–5915.
19. Bhattacharjee, D.; Kim, S.; Vizier, G.; Salzmann, M. DUNIT: Detection-Based Unsupervised Image-to-Image
Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Seattle, WA, USA, 13–19 June 2020; pp. 4787–4796.
20. Venkateswara, H.; Chakraborty, S.; Panchanathan, S. Deep-learning systems for domain adaptation in
computer vision: Learning transferable feature representations. IEEE Signal Process. Mag. 2017, 34, 117–129.
[CrossRef]
21. Cao, Y.-J.; Jia, L.-L.; Chen, Y.-X.; Lin, N.; Yang, C.; Zhang, B.; Liu, Z.; Li, X.-X.; Dai, H.-H. Recent advances of
generative adversarial networks in computer vision. IEEE Access 2018, 7, 14985–15006. [CrossRef]
22. Wang, K.; Gou, C.; Duan, Y.; Lin, Y.; Zheng, X.; Wang, F.-Y. Generative adversarial networks: Introduction
and outlook. IEEE/CAA J. Autom. Sin. 2017, 4, 588–598. [CrossRef]
23. Rasmussen, C.E. The infinite Gaussian mixture model. In Proceedings of the Advances in Neural Information
Processing Systems, Denver, CO, USA, 6 December 2020; pp. 554–560.
24. Jiang, L.; Zhang, H.; Cai, Z. A novel Bayes model: Hidden naive Bayes. IEEE Trans. Knowl. Data Eng. 2008,
21, 1361–1371. [CrossRef]
25. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE
1989, 77, 257–286. [CrossRef]
26. Maaløe, L.; Sønderby, C.K.; Sønderby, S.K.; Winther, O. Auxiliary deep generative models. arXiv 2016,
arXiv:1602.05473.
27. Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S. A survey
on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36.
[CrossRef]
28. Oussidi, A.; Elhassouny, A. Deep generative models: Survey. In Proceedings of the 2018 International
Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 2–4 April 2018; pp. 1–8.
29. Salakhutdinov, R.; Hinton, G. Deep boltzmann machines. In Proceedings of the Artificial Intelligence and
Statistics, Clearwater, FL, USA, 16–19 April 2009; pp. 448–455.
30. Hinton, G.E. Deep belief networks. Scholarpedia 2009, 4, 5947. [CrossRef]
31. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
32. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y.
Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems,
Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680.
33. Abbasnejad, M.E.; Shi, Q.; van den Hengel, A.; Liu, L. A generative adversarial density estimator. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–21 June 2019; pp. 10782–10791.
34. Wang, Z.; She, Q.; Ward, T.E. Generative adversarial networks in computer vision: A survey and taxonomy.
arXiv 2019, arXiv:1906.01529.
Symmetry 2020, 12, 1705 22 of 26
35. Tang, H.; Xu, D.; Liu, H.; Sebe, N. Asymmetric Generative Adversarial Networks for Image-to-Image
Translation. arXiv 2019, arXiv:1912.06931.
36. Regmi, K.; Borji, A. Cross-view image synthesis using conditional gans. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 3501–3510.
37. Lin, C.-H.; Yumer, E.; Wang, O.; Shechtman, E.; Lucey, S. St-gan: Spatial transformer generative adversarial
networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 9455–9464.
38. Mo, S.; Cho, M.; Shin, J. Instagan: Instance-aware image-to-image translation. arXiv 2018, arXiv:1812.10889.
39. Kahng, M.; Thorat, N.; Chau, D.H.P.; Viégas, F.B.; Wattenberg, M. Gan lab: Understanding complex deep
generative models using interactive visual experimentation. IEEE Trans. Vis. Comput. Graph. 2018, 25, 1–11.
[CrossRef]
40. Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B.; Salesin, D.H. Image analogies. In Proceedings of the 28th
Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 12–17 August
2001; pp. 327–340.
41. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice,
Italy, 22–29 October 2017; pp. 2223–2232.
42. Huang, X.; Liu, M.-Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation.
In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September
2018; pp. 172–189.
43. Royer, A.; Bousmalis, K.; Gouws, S.; Bertsch, F.; Mosseri, I.; Cole, F.; Murphy, K. Xgan: Unsupervised
image-to-image translation for many-to-many mappings. In Domain Adaptation for Visual Understanding;
Springer: Berlin/Heidelberg, Germany, 2020; pp. 33–49.
44. Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for
multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 8789–8797.
45. Zhao, B.; Chang, B.; Jie, Z.; Sigal, L. Modular generative adversarial networks. In Proceedings of the European
Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 150–165.
46. Tao, R.; Li, Z.; Tao, R.; Li, B. ResAttr-GAN: Unpaired Deep Residual Attributes Learning for Multi-Domain
Face Image Translation. IEEE Access 2019, 7, 132594–132608. [CrossRef]
47. Hong, Y.; Hwang, U.; Yoo, J.; Yoon, S. How generative adversarial networks and their variants work:
An overview. ACM Comput. Surv. (CSUR) 2019, 52, 1–43. [CrossRef]
48. Pan, Z.; Yu, W.; Yi, X.; Khan, A.; Yuan, F.; Zheng, Y. Recent progress on generative adversarial networks
(GANs): A survey. IEEE Access 2019, 7, 36322–36333. [CrossRef]
49. Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory,
and applications. arXiv 2020, arXiv:2001.06937.
50. Wang, L.; Chen, W.; Yang, W.; Bi, F.; Yu, F.R. A State-of-the-Art Review on Image Synthesis with Generative
Adversarial Networks. IEEE Access 2020, 8, 63514–63537. [CrossRef]
51. Wu, X.; Xu, K.; Hall, P. A survey of image synthesis and editing with generative adversarial networks.
Tsinghua Sci. Technol. 2017, 22, 660–674. [CrossRef]
52. Gonog, L.; Zhou, Y. A review: Generative adversarial networks. In Proceedings of the 2019 14th IEEE
Conference on Industrial Electronics and Applications (ICIEA), Xi’an, China, 19–21 June 2019; pp. 505–510.
53. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784.
54. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation
learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural
Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2172–2180.
55. Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial
networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [CrossRef]
56. Spurr, A.; Aksan, E.; Hilliges, O. Guiding infogan with semi-supervision. In Proceedings of the Joint
European Conference on Machine Learning and Knowledge Discovery in Databases, Skopje, Macedonia,
18–22 September 2017; pp. 119–134.
Symmetry 2020, 12, 1705 23 of 26
57. Kurutach, T.; Tamar, A.; Yang, G.; Russell, S.J.; Abbeel, P. Learning plannable representations with causal
infogan. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada,
3–8 December 2018; pp. 8733–8744.
58. Brock, A.; Donahue, J.; Simonyan, K. Large scale gan training for high fidelity natural image synthesis.
arXiv 2018, arXiv:1809.11096.
59. Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks.
In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017;
pp. 2794–2802.
60. Nowozin, S.; Cseke, B.; Tomioka, R. f-gan: Training generative neural samplers using variational divergence
minimization. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain,
5–10 December 2016; pp. 271–279.
61. Mroueh, Y.; Sercu, T. Fisher gan. In Proceedings of the Advances in Neural Information Processing Systems,
Long Beach, CA, USA, 4–9 December 2017; pp. 2513–2523.
62. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein gan. arXiv 2017, arXiv:1701.07875.
63. Mroueh, Y.; Sercu, T.; Goel, V. Mcgan: Mean and covariance feature matching gan. arXiv 2017, arXiv:1702.08398.
64. Li, Y.; Swersky, K.; Zemel, R. Generative moment matching networks. In Proceedings of the International
Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1718–1727.
65. Li, C.-L.; Chang, W.-C.; Cheng, Y.; Yang, Y.; Póczos, B. Mmd gan: Towards deeper understanding of moment
matching network. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach,
CA, USA, 4–9 December 2017; pp. 2203–2213.
66. Zhao, J.; Mathieu, M.; LeCun, Y. Energy-based generative adversarial network. arXiv 2016, arXiv:1609.03126.
67. Berthelot, D.; Schumm, T.; Metz, L. Began: Boundary equilibrium generative adversarial networks. arXiv 2017,
arXiv:1703.10717.
68. Wang, R.; Cully, A.; Chang, H.J.; Demiris, Y. Magan: Margin adaptation for generative adversarial networks.
arXiv 2017, arXiv:1704.03817.
69. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein
gans. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA,
4–9 December 2017; pp. 5767–5777.
70. Pan, Z.; Yu, W.; Wang, B.; Xie, H.; Sheng, V.S.; Lei, J.; Kwong, S. Loss Functions of Generative Adversarial
Networks (GANs): Opportunities and Challenges. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 4, 500–522.
[CrossRef]
71. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative
adversarial networks. arXiv 2015, arXiv:1511.06434.
72. Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks.
In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019;
pp. 7354–7363.
73. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability,
and variation. arXiv 2017, arXiv:1710.10196.
74. Denton, E.L.; Chintala, S.; Fergus, R. Deep generative image models using a laplacian pyramid of adversarial
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
7–12 December 2015; pp. 1486–1494.
75. Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned
similarity metric. In Proceedings of the International Conference on Machine Learning, New York, NY, USA,
20–22 June 2016; pp. 1558–1566.
76. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 1125–1134.
77. Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and
semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 8798–8807.
78. Zhu, J.-Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal
image-to-image translation. In Proceedings of the Advances in Neural Information Processing Systems,
Long Beach, CA, USA, 4–9 December 2017; pp. 465–476.
Symmetry 2020, 12, 1705 24 of 26
79. Lee, H.-Y.; Tseng, H.-Y.; Huang, J.-B.; Singh, M.; Yang, M.-H. Diverse image-to-image translation via
disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV),
Munich, Germany, 8–14 September 2018; pp. 35–51.
80. Xiong, F.; Wang, Q.; Gao, Q. Consistent Embedded GAN for Image-to-Image Translation. IEEE Access 2019,
7, 126651–126661. [CrossRef]
81. Tripathy, S.; Kannala, J.; Rahtu, E. Learning image-to-image translation using paired and unpaired training
samples. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018;
pp. 51–66.
82. Almahairi, A.; Rajeswar, S.; Sordoni, A.; Bachman, P.; Courville, A. Augmented cyclegan: Learning
many-to-many mappings from unpaired data. arXiv 2018, arXiv:1802.10151.
83. Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative
adversarial networks. arXiv 2017, arXiv:1703.05192.
84. Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation.
In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017;
pp. 2849–2857.
85. Li, Y.; Tang, S.; Zhang, R.; Zhang, Y.; Li, J.; Yan, S. Asymmetric GAN for unpaired image-to-image translation.
IEEE Trans. Image Process. 2019, 28, 5881–5896. [CrossRef]
86. Chen, L.; Wu, L.; Hu, Z.; Wang, M. Quality-aware unpaired image-to-image translation. IEEE Trans. Multimed.
2019, 21, 2664–2674. [CrossRef]
87. Emami, H.; Aliabadi, M.M.; Dong, M.; Chinnam, R. Spa-gan: Spatial attention gan for image-to-image
translation. IEEE Trans. Multimed. 2020. [CrossRef]
88. Liu, M.-Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the
Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 700–708.
89. Liu, M.-Y.; Tuzel, O. Coupled generative adversarial networks. In Proceedings of the Advances in Neural
Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 469–477.
90. Liu, Y.; De Nadai, M.; Yao, J.; Sebe, N.; Lepri, B.; Alameda-Pineda, X. GMM-UNIT: Unsupervised
Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling.
arXiv 2020, arXiv:2003.06788.
91. Zhou, Y.-F.; Jiang, R.-H.; Wu, X.; He, J.-Y.; Weng, S.; Peng, Q. Branchgan: Unsupervised mutual image-to-image
transfer with a single encoder and dual decoders. IEEE Trans. Multimed. 2019, 21, 3136–3149. [CrossRef]
92. Liu, A.H.; Liu, Y.-C.; Yeh, Y.-Y.; Wang, Y.-C.F. A unified feature disentangler for multi-domain image
translation and manipulation. In Proceedings of the Advances in Neural Information Processing Systems,
Montréal, QC, Canada, 3–8 December 2018; pp. 2590–2599.
93. Lin, J.; Chen, Z.; Xia, Y.; Liu, S.; Qin, T.; Luo, J. Exploring explicit domain supervision for latent space
disentanglement in unpaired image-to-image translation. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [CrossRef]
94. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE
International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738.
95. Lee, H.-Y.; Tseng, H.-Y.; Mao, Q.; Huang, J.-B.; Lu, Y.-D.; Singh, M.; Yang, M.-H. Drit++: Diverse
image-to-image translation via disentangled representations. Int. J. Comput. Vis. 2020, 1–16. [CrossRef]
96. Wu, P.-W.; Lin, Y.-J.; Chang, C.-H.; Chang, E.Y.; Liao, S.-W. Relgan: Multi-domain image-to-image translation
via relative attributes. In Proceedings of the IEEE International Conference on Computer Vision, Seoul,
Korea, 27 October–2 November 2019; pp. 5914–5922.
97. Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.; Hawk, S.T.; Van Knippenberg, A. Presentation and
validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [CrossRef]
98. Tyleček, R.; Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Proceedings
of the German Conference on Pattern Recognition, Saarbrücken, Germany, 3–6 September 2013; pp. 364–374.
99. Shen, Z.; Huang, M.; Shi, J.; Xue, X.; Huang, T.S. Towards instance-level image-to-image translation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–21 June 2019; pp. 3683–3692.
100. Ng, H.-W.; Winkler, S. A data-driven approach to cleaning large face datasets. In Proceedings of the 2014
IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 343–347.
Symmetry 2020, 12, 1705 25 of 26
101. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
102. Le, V.; Brandt, J.; Lin, Z.; Bourdev, L.; Huang, T.S. Interactive facial feature localization. In Proceedings of the
European Conference on Computer Vision, Providence, RI, USA, 16–21 June 2012; pp. 679–692.
103. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database.
In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,
20–25 June 2009; pp. 248–255.
104. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for
training gans. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain,
5–10 December 2016; pp. 2234–2242.
105. Shmelkov, K.; Schmid, C.; Alahari, K. How good is my GAN? In Proceedings of the European Conference on
Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 213–229.
106. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing
Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637.
107. Mao, Q.; Lee, H.-Y.; Tseng, H.-Y.; Ma, S.; Yang, M.-H. Mode seeking generative adversarial networks for
diverse image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, CA, USA, 15–21 June 2019; pp. 1429–1437.
108. Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying mmd gans. arXiv 2018, arXiv:1801.01401.
109. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a
perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 19–21 June 2018; pp. 586–595.
110. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 3431–3440.
111. Huang, H.; He, R.; Sun, Z.; Tan, T. Wavelet domain generative adversarial network for multi-scale face
hallucination. Int. J. Comput. Vis. 2019, 127, 763–784. [CrossRef]
112. Zhang, W.; Liu, Y.; Dong, C.; Qiao, Y. Ranksrgan: Generative adversarial networks with ranker for image
super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea,
27 October–2 November 2019; pp. 3096–3105.
113. Wang, Y.; Perazzi, F.; McWilliams, B.; Sorkine-Hornung, A.; Sorkine-Hornung, O.; Schroers, C. A fully
progressive approach to single-image super-resolution. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 864–873.
114. Yuan, Y.; Liu, S.; Zhang, J.; Zhang, Y.; Dong, C.; Lin, L. Unsupervised image super-resolution using
cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 701–710.
115. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.;
Wang, Z. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 4681–4690.
116. Chen, X.; Xu, C.; Yang, X.; Song, L.; Tao, D. Gated-gan: Adversarial gated networks for multi-collection style
transfer. IEEE Trans. Image Process. 2018, 28, 546–560. [CrossRef] [PubMed]
117. Hicsonmez, S.; Samet, N.; Akbas, E.; Duygulu, P. GANILLA: Generative adversarial networks for image to
illustration translation. Image Vis. Comput. 2020, 95, 103886. [CrossRef]
118. Ma, S.; Fu, J.; Wen Chen, C.; Mei, T. Da-gan: Instance-level image translation by deep attention generative
adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 19–21 June 2018; pp. 5657–5666.
119. Chen, X.; Xu, C.; Yang, X.; Tao, D. Attention-gan for object transfiguration in wild images. In Proceedings of
the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 164–180.
120. Yi, X.; Walia, E.; Babyn, P. Generative adversarial network in medical imaging: A review. Med. Image Anal.
2019, 58, 101552. [CrossRef]
Symmetry 2020, 12, 1705 26 of 26
121. Armanious, K.; Jiang, C.; Fischer, M.; Küstner, T.; Hepp, T.; Nikolaou, K.; Gatidis, S.; Yang, B. MedGAN:
Medical image translation using GANs. Comput. Med. Imaging Graph. 2020, 79, 101684. [CrossRef]
122. Borji, A. Pros and cons of gan evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [CrossRef]
123. Furuta, R.; Inoue, N.; Yamasaki, T. Fully convolutional network with multi-step reinforcement learning for
image processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA,
27 January–1 February 2019; pp. 3598–3605.
124. Kosugi, S.; Yamasaki, T. Unpaired image enhancement featuring reinforcement-learning-controlled image
editing software. arXiv 2019, arXiv:1912.07833. [CrossRef]
125. Ganin, Y.; Kulkarni, T.; Babuschkin, I.; Eslami, S.; Vinyals, O. Synthesizing programs for images using
reinforced adversarial learning. arXiv 2018, arXiv:1804.01118.
126. Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of
the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2642–2651.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).