0% found this document useful (0 votes)
177 views

A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF

Generative Adversarial Networks (GANs) consist of two neural networks, a generator and a discriminator, that are trained in competition. The generator produces synthetic images to fool the discriminator, while the discriminator tries to distinguish real from generated images. Due to their ability to model complex, high-dimensional distributions, GANs have shown great potential for image synthesis and editing applications such as texture synthesis, image inpainting, and image-to-image translation. This document surveys recent papers applying GANs for these and other image generation tasks.

Uploaded by

zhenhua wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views

A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF

Generative Adversarial Networks (GANs) consist of two neural networks, a generator and a discriminator, that are trained in competition. The generator produces synthetic images to fool the discriminator, while the discriminator tries to distinguish real from generated images. Due to their ability to model complex, high-dimensional distributions, GANs have shown great potential for image synthesis and editing applications such as texture synthesis, image inpainting, and image-to-image translation. This document surveys recent papers applying GANs for these and other image generation tasks.

Uploaded by

zhenhua wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

TSINGHUA SCIENCE AND TECHNOLOGY

ISSNll1007-0214ll09/15llpp660-674
Volume 22, Number 6, December 2017

A Survey of Image Synthesis and Editing with Generative Adversarial


Networks

Xian Wu, Kun Xu , and Peter Hall

Abstract: This paper presents a survey of image synthesis and editing with Generative Adversarial Networks
(GANs). GANs consist of two deep networks, a generator and a discriminator, which are trained in a competitive
way. Due to the power of deep networks and the competitive training manner, GANs are capable of producing
reasonable and realistic images, and have shown great capability in many image synthesis and editing applications.
This paper surveys recent GAN papers regarding topics including, but not limited to, texture synthesis, image
inpainting, image-to-image translation, and image editing.

Key words: image synthesis; image editing; constrained image synthesis; generative adversarial networks; image-
to-image translation

is because these traditional methods are mostly


based on pixels[1, 4, 11] , patches[8, 10, 12] , and low-level
1 Introduction image features[3, 13] , and lacking high-level semantic
information.
With the rapid development of Internet and digital
In recent years, deep learning techniques have made
capturing devices, huge volumes of images have
a breakthrough in computer vision. Trained using
become readily available. There are now widespread
large-scale data, deep neural networks substantially
demands for tasks requiring synthesizing and editing
outperform previous techniques with regard to the
images, such as removing unwanted objects in
semantic understanding of images. They claim
wedding photographs, adjusting the colors of
state-of-the-art in various tasks, including image
landscape images, and turning photographs into
classification[14–16] , object detection[17, 18] , image
artwork or vice-versa. These and other problems have
segmentation[19, 20] , etc.
attracted significant attention within both the computer
Deep learning has also shown great ability in content
graphics and computer vision communities. A variety
generation. In 2014 Goodfellow et al.[21] proposed
of methods have been proposed for image/video
a generative model, called Generative Adversarial
editing and synthesis, including texture synthesis[1–3] ,
Networks (GANs). GANs contain two networks,
image inpainting[4–6] , image stylization[7, 8] , image
a generator and a discriminator. The discriminator
deformation[9, 10] , and so on. Although many methods
tries to distinguish fake images from real ones; the
have been proposed, intelligent image synthesis
generator produces fake images but it tries to fool the
and editing remains a challenging problem. This
discriminator. Both networks are jointly trained in
a competitive way. The resulting generator is able
 Xian Wu and Kun Xu are with TNList and the Department
to synthesize plausible images. GAN variants have
of Computer Science and Technology, Tsinghua University,
Beijing 100084, China. E-mail: [email protected]. now achieved impressive results in a variety of image
 Peter Hall is with the Department of Computer Science, synthesis and editing applications.
University of Bath, Bath, UK. In this survey, we cover recent papers that leverage
 To whom correspondence should be addressed. GANs for image synthesis and editing applications.
Manuscript received: 2017-11-15; accepted: 2017-11-20 This survey discusses the ideas, contributions, and
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 661

drawbacks of these networks. This survey is structured the discriminator always produces 1/2[21] . However,
as follows. Section 2 provides a brief introduction in practice, training GANs is difficult because of
to GANs and related variants. Section 3 discusses several reasons: firstly, networks converge is difficult
applications in image synthesis, including texture to achieve[23] ; secondly, GANs often get into “mode
synthesis, image impainting, and face and human collapse”, in which the generator produces the same or
image synthesis. Section 4 discusses applications in similar images for different noise vectors z. Various
constrained image synthesis, including general image- extensions of GANs have been proposed to improve
to-image translation, text-to-image, and sketch-to- training stability[23–27] .
image. Section 5 discusses applications in image cGANs. Mirza and Osindero[28] introduced
editing and video generation. Finally, Section 6 conditional Generative Adversarial Networks (cGANs),
provides a summary discussion and current challenges which extends GANs into a conditional model. In
and limitations of GAN based methods. cGANs, the generator G and the discriminator D
are conditioned on some extra information c. This is
2 Generative Adversarial Networks done by putting c as additional inputs to both G and
GANs were proposed by Goodfellow et al.[21] in 2014. D. The extra information could be class labels, text,
They contain two networks, a generator G and a or sketches. cGANs provide additional controls on
discriminator D. The generator tries to create fake which kind of data are being generated, while the
but plausible images, while the discriminator tries to original GANs do not have such controls. It makes
distinguish fake images (produced by the generator) cGANs popular for image synthesis and image editing
from real images. Formally, the generator G maps a applications. The structure of cGANs is illustrated as
noise vector z in the latent space to an image: G.z/ ! x, Fig. 2.
and the discriminator is defined as D.x/ ! Œ0; 1, DCGANs. Radford et al.[23] presented Deep
which classifies an image as a real image (i.e., close to Convolutional Generative Adversarial Networks
1) or as a fake image (i.e., close to 1). (DCGANs). They proposed a class of architecturally
To train the networks, the loss function is formulated constrained convolution networks for both generator
as and discriminator. The architectural constraints
include: (1) replacing all pooling layers with strided
min max Ex2X Œlog D.x/ C Ez2Z Œlog.1 D.G.z///
G D convolutions and fractional-strided convolutions; (2)
(1) using batchnorm layers; (3) removing fully connected
where X denotes the set of real images, Z denotes hidden layers; (4) in the generator, using tan h as
the latent space. The above loss function (Eq. (1)) is the activation function and using Rectified Linear
referred to as the adversarial loss. The two networks are Units (ReLU) activation for other layers; and (5) in
trained in a competitive fashion with back propagation. the discriminator, using LeakyReLU activation in
The structure of GANs is illustrated as Fig. 1. discriminator for all layers. DCGANs have shown to be
Compared with other generative models such as more stable in training and are able to produce higher
Variational AutoEncoders (VAEs)[22] , images generated quality images, hence they have been widely used in
by GANs are usually less blurred and more realistic. many applications.
It is also theoretically proven that optimal GANs LAPGAN. Laplacian Generative Adversarial
exist, that is the generator perfectly produces images Networks (LAPGAN)[29] are composed of a cascade of
which match the distributions of real images well, and convolutional GANs with the framework of a Laplacian

Fig. 1 Structure of GANs. Fig. 2 Structure of cGANs.


662 Tsinghua Science and Technology, December 2017, 22(6): 660–674

pyramid with K levels. At the coarsest level, K, a GAN Gatys et al.[31] introduced the first CNN-based
is trained which maps a noise vector to an image with method for texture synthesis. To characterize a texture,
the coarsest resolution. At each level of the pyramid they defined a Gram-matrix representation. By feeding
except the coarsest one (i.e., level k, 0 6 k < K), a the texture into a pre-trained VGG19[15] , the Gram
separate cGAN is trained, which takes the output image matrices are computed by the correlations of feature
in the coarser level (i.e., level k C 1) as a conditional responses in some layers. The target texture is obtained
variable to generate the residual image at this level. by minimizing the distance between the Gram-matrix
Due to such a coarse-to-fine manner, LAPGANs are representation of the target texture and that of the input
able to produce images with higher resolutions. texture. The target texture starts from random noise,
Other extensions. Zhao et al.[24] proposed and is iteratively optimized through back propagation,
an Energy-Based Generative Adversarial Network hence, its computational cost is expensive.
(EBGAN), which views the discriminator as an energy MGANs. Li and Wand[32] proposed a real-time
function instead of a probability function. They texture synthesis method. They first introduced
showed that EBGANs are more stable in training. To Markovian Deconvolutional Adversarial Networks
overcome the vanishing gradient problem, Mao et (MDANs). Given a content image xc (e.g., a face
al.[25] proposed Least Squares Generative Adversarial image) and a texture image xt (i.e., a texture image of
Networks (LSGAN), which replace the log function by leaves), MDANs synthesize a target image xs (e.g., a
least square function in the adversarial loss. Arjovsky face image textured by leaves). Feature maps of an
et al.[26] proposed Wasserstein Generative Adversarial image are defined as feature maps extracted from a
Networks (WGANs). They first theoretically showed pre-trained VGG19 by feeding the image into it[33] ,
that the Earth-Mover (EM) distance produces better and neural patches of an image are defined as patch
gradient behaviors in distribution learning compared samples on the feature maps[15] . A discriminator is
to other distance metrics. Accordingly, they made trained to distinguish neural patches from real and fake
several changes to regular GANs: (1) removing images. The objective function includes a texture loss
the sigmoid layer and adding weight clipping in and a feature loss. The texture loss is computed from
the discriminator; (2) removing the log function the classification scores of neural patches of xs from the
in the adversarial loss. They demonstrated that discriminator. The feature loss considers the distance
WGANs generate images with comparable quality between the feature maps of xs and xc . The target image
compared to well designed DCGANs. Berthelot et is initialized with random noise, and is iteratively
al.[27] proposed Boundary Equilibrium Generative updated through back propagation by minimizing the
Adversarial Networks (BEGAN), trying to maintain an objective function. They further introduced Markovian
equilibrium which can be adjusted for the trade-off Generative Adversarial Networks (MGANs), which
between diversity and quality. take feature maps of a content image xc as input
Creswell et al.[30] provided an overview of GANs. to generate a texture image. MGANs are trained
They mainly focused on GANs themselves, including using content and target image pairs synthesized by
architectures and training strategies of GANs. Our MDANs. The objective function of MGANs is defined
survey differs because it focuses on image synthesis and similar to MDANs. MGANs are able to achieve real-
editing applications with GANs. time performance for neural texture synthesis, which is
about 500 times faster than previous methods.
3 Image Synthesis SGAN and PSGAN. Regular GANs map a random
This section discusses applications including texture vector to an image. Instead, Jetchev and Bergmann[34]
synthesis, image super-resolution, image inpainting, proposed Spatial GANs (SGAN), which extend to
face image synthesis, and human image synthesis. map a spatial tensor to an image. The network
architecture follows DCGANs[23] . The architectural
3.1 Texture synthesis properties of SGAN make it suitable for the task of
Texture synthesis is a classic problem in both computer texture synthesis. Bergmann et al.[35] further extended
graphics and computer vision. Given a sample texture, SGAN to Periodic Spatial GAN (PSGAN). In PSGAN,
the goal is to generate a new texture with identical the input spatial tensor contains three parts: a local
second order statistics. independent part, a spatially global part, and a periodic
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 663

part. PSGAN is able to synthesize diverse, periodic, and inpainting method based on GANs. The network is
high-resolution textures. based on an encoder-decoder architecture. The input
is a 128  128 image with holes. The output is a
3.2 Image super-resolution
64  64 image content in the hole (when the hole is
Given a low-resolution image, the goal of super- central) or the full 128  128 inpainted image (when the
resolution is to upsample it to a high-resolution one. hole is arbitrary). The objective function includes an
Essentially, this problem is ill-posed because high adversarial loss and a content loss measuring L2 pixel-
frequency information is lacking, especially for large wise difference between the generated inpainted image
upscaling factors. Recently, some deep learning based and the ground truth image. Experiments used the Paris
methods[36–38] were proposed to tackle this problem, StreetView dataset[42] and the ImageNet dataset[43] . It
results are good for low upsampling factors, but less achieves satisfactory inpainting results for central holes
satisfactory for larger scales. Below, we discuss GAN- but less satisfactory results for arbitrary holes.
based super-resolution methods. Multiscale method. Yang et al.[44] presented a
SRGAN. Ledig et al.[39] proposed Super-Resolution multiscale synthesis method for high-resolution image
Generative Adversarial Network (SRGAN), which impainting. They first trained a context network which
takes a low-resolution image as input, and generates is similar to context encoder[41] with minor changes on
an upsampled image with 4 resolution. The network some layers. To complete a 512  512 image with
architecture follows the guidelines of DCGAN[23] , and a 256  256 hole, they first downsampled the image
the generator uses very deep convolutional network by a factor of 4, then obtained an initial reconstructed
with residual blocks. The objective function includes hole image x0 at resolution 64  64 through the trained
an adversarial loss and a feature loss. The feature loss context network. The final reconstructed hole image (at
is computed as the distance between the feature maps resolution 256256) is then obtained in a coarse-to-fine
of the generated upsampled image and the ground truth manner. At each scale, the reconstructed hole image
image, where the feature maps are extracted from a is iteratively updated by optimizing a joint objective
pre-trained VGG19 network. Experiments show that function, including a content loss, a texture loss, and a
SRGAN outperforms the state-of-art approaches on Total Variation (TV) loss. The content loss measures
public datasets. the L2 difference between the currently optimized
FCGAN. Based on Boundary Equilibrium image and the resulting image from the coarser level.
Generative Adversarial Networks (BEGANs)[27] , The texture loss is computed by comparing the neural
Huanget al.[40] proposed Face Conditional Generative patches inside the hole and the neural patches outside
Adversarial Network (FCGAN), which specializes the hole. The texture loss enforces the image content
on facial image super resolution. Within the network inside and outside the hole to have similar texture
architecture, both the generator and discriminator use details. It shows nice image inpainting results with
an encoder-decoder along with skip connections. For 512  512 resolution, but is slow due to the iterative
training, the objective function includes a content loss, approach. Some results of this method and Context
which is computed by the L1 pixelwise difference Encoder are shown in Fig. 3.
between the generated upsampled image and the Consistent completion. Iizuka et al.[45] proposed a
ground truth. FCGAN generates satisfactory results
with 4 scaling factor.
3.3 Image inpainting
The goal of image inpaiting is to fill holes in images.
It has been always a hot topic in computer graphics
and computer vision. Traditional approaches replicate
pixels or patches from the original image[4, 5, 10] or from
image library[6, 8] to fill the holes. GANs offer a new
way for image inpainting. Input Context Encoder Yang et al. Ground truth

Context encoder. Pathak et al.[41] presented a Fig. 3 Comparison between Context Encoder[41] and Yang
network called context encoder, which is the first image et al.[44]
664 Tsinghua Science and Technology, December 2017, 22(6): 660–674

GAN-based approach for global and local consistent z, together with a conditional vector c indicating a new
image inpainting. The input is an image with an age, are fed into the generator G to generate a new face
additional binary mask to indicate the missing hole. The image. Dz takes the vector z as input, and enforces z
output is an inpainted image with the same resolution. to be uniformly distributed. Dimg forces the face image
The generator follows the encoder-decoder architecture generated by G to be realistic and to conform with the
and uses dilated convolution layers[46] instead of given age. Besides the two adversarial losses of the
standard convolution layers for larger spatial support. two discriminators, the objective function also includes
There are two discriminators, a global discriminator that an L2 content loss and a Total Variation (TV) loss.
takes the entire image as input and a local discriminator The content loss enforces the input face image and the
that takes a small region covering the hole as input. The generated face image to be similar: x  G.E.x/; c/.
two discriminators ensure that the resulting image is The TV loss is introduced to remove ghosting effects.
consistent at both global and local scale. This work All the networks are jointly trained. CAAE is able to
produces natural image inpainting results for high- generate map input face images to plausibly appear as a
resolution images with arbitrary holes. different age.
Other methods. Yeh et al.[47] proposed a GAN- Antipov et al.[50] proposed an Age-conditional
based iterative method for semantic image inpainting. Generative Adversarial Network (Age-cGAN) for face
It first pre-trained a GAN, whose generator G maps aging. Age-cGAN consists of an encoder and a cGAN.
a latent vector z to an image. Given an image with Like CAAE, the encoder E maps a face image x to a
missing contents x0 , they recover the latent vector latent vector z, and the conditional generator G maps
z by minimizing an objective function including a latent vector z with an age condition c to a new face
an adversarial loss and a content loss. The content image. The cGAN is first trained, and the encoder is
loss is computed by a weighted L1 pixel-wise trained using pairs of latent vectors and generated face
distance between the generated image G.z / and x0 images of the cGAN. After training, given an input face
on uncorrupted regions, where pixels near the hole image x0 with age c0 , face aging is achieved by: (1)
are given higher weights. The objective function is feeding x0 into the encoder to obtain an initial latent
iteratively optimized through back propagation. Li et vector z0 ; (2) iteratively updating z0 to a new latent
al.[48] proposed a GAN-based specialized approach vector z through an identity preserving optimization,
for face image impainting. Following the network which enforces the reconstructed face image with the
architecture of Iizuka et al.[45] , it incorporates a global same age to be close to the input image: G.z ; c0 / 
discriminator and a local discriminator to enforce x0 ; and (3) feeding the optimized latent vector z and
image consistency in both global and local scales. It the target age into the generator to obtain the new face
additionally includes a pre-trained parsing network to image.
enforce the harmony of the inpainted face image. Face frontalization. Face frontalization aims to
transform a face image from rotated or perspective
3.4 Face image synthesis
views to frontal views. Tran et al.[51] proposed
Face image synthesis is a specialized but important Disentangled Representation learning-GAN (DR-
topic. Because human vision is sensitive to facial GAN) for face synthesis with new poses. The generator
irregularities and deformations, it is not an easy task to G uses an encoder-decoder architecture. It learns a
generate realistic synthesized face images. GANs have disentangled representation for face images which are
shown a good ability in creating face images of high the output of the encoder and also the input of the
perceptual quality and with detailed textures. decoder. Specifically, the encoder maps a face image
Face aging. Face aging methods transform a facial x to an identity feature f , and the decoder synthesizes
image to another age, while still keeping identity. a new face image, given an identity feature f , a target
Zhang et al.[49] presented a Conditional Adversarial pose, and a noise vector. The discriminator D has two
AutoEncoder (CAAE) for this problem, which consists parts, one for identity classification (i.e., also contains
of an encoder E, a generator G, and two discriminators an additional identity class for fake images), and the
Dz and Dimg . The encoder E maps a face image x to a other for pose classification. The goal of D is to predict
vector z indicating personal features. The output vector both identity and pose correctly for real images and
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 665

also to predict identity as fake for fake images. The the high-resolution image xtHR conditioned on the input
goal of G is to fool D to classify fake images to its image x0 . Gf and D are jointly trained with an objective
input identity and pose. The objective function for function consisting of an adversarial loss and a content
training only contains the newly introduced adversarial loss measuring L1 difference between xtHR and ground
loss. Experiments show that DR-GAN is superior to truth.
existing methods on pose invariant face recognition. Pose guided generation PG2 . Ma et al.[58] proposed
Yin et al.[52] proposed a Face Frontalization a pose guided human image generation method
Generative Adversarial Network (FF-GAN), which (P G 2 ). Given an input human image and a target pose,
incorporates 3D Morphable Model (3DMM)[53] into it generates a new image with the target pose. They
the GAN structure. Since 3DMM provides geometry also used a coarse-to-fine two-stage approach. In the
and appearance priors for face images and the first stage, a generator G1 produces a coarse image xLR
representation of 3DMM is also compact, FF-GAN has from the input image x0 and the target pose, capturing
the advantage of fast convergence, and produces high- the global structure. In the second stage, a generator
quality frontal face images. G2 generates a high-resolution difference image xHR
Huang et al.[54] proposed Two-Pathway Generative from the coarse image xLR and the input image x0 .
Adversarial Network (TP-GAN) for frontal face image The final image is obtained by summing up xLR and
synthesis. The network has a two-pathway architecture, xHR . A conditioned discriminator is also involved.
a global generator for generating global structures, and Both G1 and G2 use the U-Net[20] architecture. It is
a local generator for generating details around facial able to produce new 256  256 images.
landmarks. The objective function for training consists
of an L1 pixel-wise content loss which measures the 4 Constrained Image Synthesis
difference between the generated face image b x and the This section discusses a constrained image synthesis,
ground truth, a symmetry loss which enforces b x to be which is synthesizing a new image with respect to
horizontally symmetric, an adversarial loss, an identity some specified constraints from users, such as another
preserving loss, and a TV loss. image, text description, or sketches. We will discuss
3.5 Human image synthesis applications on image-to-image translation, text-to-
image, and sketch-to-image.
Human image processing is important in computer
4.1 Image-to-image translation
vision. Most existing works focused on problems
such as pose estimation, detection, and re-identification, Image-to-image translation refers to a constrained
while generating novel human images attracted few synthesis process which maps an input image to an
attention until GANs were presented. For the purpose output image.
of improving person re-identification precision, Zheng Pix2pix. Isola et al.[59] proposed a general image-
et al.[55] utilized GANs to generate human images as to-image translation framework pix2pix using cGANs.
extra training data. Other GANs are designed for human Their network architecture follows the guidelines
image synthesis per se. of DCGANs[23] with some additional changes: (1)
VariGAN. Variational GAN (VariGAN)[56] aims to applying modules of the form convolution-BatchNorm-
generate multi-view human images from a single- ReLU; and (2) adding skip connections between the
view. It follows a coarse-to-fine manner and consists deep layers and the shallow layers for the generator.
of three networks: a coarse image generator, a fine The discriminator uses PatchGAN[32] , which runs faster
image generator, and a conditional discriminator. The and penalizes unreal structure at the patch scale. Since
coarse image generator Gc uses a conditional VAE the goal is not only to produce realistic images (which
architecture[57] . Given an input image x0 and a target fool the discriminator), but also require the generated
view t , it is separately trained to generate a low- image to be close to ground truth; hence, besides the
resolution image with the target view xtLR . The adversarial loss, they additionally include a content loss
fine image generator Gf uses a dual-path U-Net[20] in the objective function. The content loss measures
architecture. It maps xtLR to a high-resolution image the L1 distance between the output image and the
xtHR conditioned on x0 . The discriminator D examines ground truth image. Pix2pix was demonstrated to be
666 Tsinghua Science and Technology, December 2017, 22(6): 660–674

effective for a variety of image-to-image translation


tasks, including labels to cityscape, labels to façade,
edges to photo, day to night, etc. It produces convincing
results at the 256  256 resolution, as shown in Fig. 4.
CycleGAN. Pix2pix[59] requires paired images
Apple → Orange Horse → Zebra
(an image before translation and the corresponding
image after translation) as training data, however,
in many cases, such image pairs do not exist. To
address this issue, Zhu et al.[60] proposed an unpaired
image-to-image translation framework, named
cycle-consistent Generative Adversarial Networks Orange → Apple Zebra → Horse
(cycleGAN). CycleGAN consists of two separate
Fig. 5 Results produced by cycleGAN[60] .
GANs, one translates an image from one domain to
another (e.g., horse to zebra): xtrans D G.x/, the other loss and a reconstruction loss enforcing x  P .G.x//.
does the inverse translation (e.g., zebra to horse): Beside image-to-image translation, AIGN could be
x D Ginv .xtrans /. Their network architecture follows also used for 3D human pose estimation, face super-
Johnson et al.[61] which has shown to be effective in resolution, image inpainting, etc.
style transfer. Similar to pix2pix, the discriminators 4.2 Text-to-image
use PatchGAN[32] . The two GANs are jointly trained.
Following LSGAN[25] , the adversarial loss uses least Text-to-image refers to the process of generating an
square function instead of a log function for more image which corresponds to a given text description.
stable training. Beside the two adversarial losses of For example, we could imagine an image describing
the two GANs, the objective function additionally “a red bird with a black tail” or “a white flower with a
includes an L1 cycle consistency loss, which enforces yellow anther”. This is a difficult problem, but recently,
that an image translates to itself after the translation it is able to generate images depicting simple scenes
cycle: x  Ginv .G.x//; xtrans  G.Ginv .xtrans //. with the help of GANs.
Their method has been successfully applied to several GAN-INT-CLS. Reed et al.[63] proposed a text-to-
translation tasks, including collection style transfer, image synthesis method using GANs. The input text
is encoded into a text embedding vector ˚ .t / using a
season transfer, etc., as shown in Fig. 5.
recurrent network. Conditioned on the text embedding
AIGN. The Adversarial Inverse Graphics Network
vector ˚ .t /, the generator maps a noise vector z
(AIGN)[62] also utilizes unpaired training. AIGN
to a synthesized image. The discriminator is also
consists of a generator G, a discriminator D, and
conditioned on ˚ .t /, and is designed to judge whether
a task specific renderer P . Consider image-to-image
the input image is real or fake, and that it matches the
translation as example: the generator G maps an input
texture description. The network architecture follows
image x to an output image G.x/, and the renderer
the guidelines of DCGAN[23] . The objective function
maps the output of the generator back to its input. The
only includes an adversarial loss. Note that the noise
objective function for training includes an adversarial vector z could be used to control styles of generated
Labels to Cityscape Labels to Facade Edges to Photo images.
GAWWN. Reed et al.[64] introduced a Generative
Adversarial What-Where Network (GAWWN), which
Input
considers location constraints in addition to text
descriptions. A location constraint could be given by
a bounding box, or by keypoints. Specifically, for
bounding box constraints, a bounding-box-conditional
Output
GAN is proposed. The networks (both the generator
and the discriminator) are conditioned on the bounding
box and the text embedding vector which represents text
Fig. 4 Results produced by pix2pix[59] . description. The networks have two pathways: a global
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 667

pathway that operates on the full image, and a local extracted from a pre-trained VGG19 network. The
pathway that operates on the region inside the bounding TV loss is included to improve the smoothness of the
box. For keypoint constraints, a keypoint-conditional generated images[61] . It is able to generate realistic,
GAN is also proposed. The keypoint constraints are diverse, and controllable images.
represented using binary mask maps. TextureGAN. Xian et al.[70] proposed TextureGAN,
Other methods. Stacked Generative Adversarial which converts sketch images to realistic images with
Networks (StackGAN)[65] are able to generate the additional control of object textures. The generator
high-resolution images conditioned by given text takes a sketch image, a color image, and a texture
descriptions. This method has two stages of GANs. image xt as input to generate a new image b x. The
[69]
Stage-1 GAN generates a low-resolution (64  64) network structure follows Scribbler . The objective
image from a noise vector conditioned to some text function consists of a content loss, a feature loss, an
description. The output 64  64 image from Stage-1 adversarial loss, and a texture loss. The content loss,
and the text descriptions are both fed into Stage-2 GAN feature loss, and adversarial loss are defined similar
to generate a high-resolution (256  256) image. It to Scribbler[69] . Following the CNN based texture
is the first work to generate images with 256  256 synthesis method[31] , the texture loss is computed as
resolution from texts. A Text conditioned Auxiliary the distance between the Gram-matrix representation of
Classifier GAN (TAC-GAN)[66] is another text-to- patches in b x and xt , enforcing texture appearance of b
x
image synthesis method. It is built upon Auxiliary close to xt . Segmentation masks are also introduced to
Classifier GAN (AC-GAN)[67] , but replaces the class enforce computing texture loss and content loss only in
label condition by a text description condition. the foreground region.
Current text-to-image synthesis approaches are Other methods. Magic Pencil[71] is another GAN-
capable of generating plausible images of single object, based sketch-to-image synthesis method. Beside a
such as a bird or a flower. But they are still not well generator and a discriminator, it additionally includes
adapted to complex scenes with multiple objects, which a classifier to enforce the generated image and the input
is an important direction for future works. sketch in the same category. With the help of the newly
included classifier, it is able to achieve multi-category
4.3 Sketch-to-image image synthesis. Auto-painter[72] converts sketches to
cartoon images based on GANs. It adopts the network
Sketches are a convenient way for users to draw
architecture of pix2pix[59] and additionally includes
what they want, but they lack detail, e.g., color etc.
texture loss and TV loss into the objective function.
Therefore, automatically mapping the input sketches
Sketch dataset is rare, so researchers often utilize
to the user desired images is an attractive problem for
edge detection algorithm to extract sketches for
researchers. Sketch2Photo[68] provides a fantastic way
training. However, extracted sketches often contain
to synthesize images with sketch and text labels by
lots of details and their low level statistics are very
the composition of Internet images, but text labels are
different from hand drawings. That is not the only
necessary in their work. Recently, GAN-based methods
difference, there can be changes in geometry and
are able to generate images from sketches without text
connectedness too. Additionally, while these methods
labels, showing better flexibility.
work on single sketched object well, GAN-based
Scribbler. Sangkloyet al.[69] proposed a GAN-based
generation of complex scenes from sketches is still a
synthesis method named Scribbler, which converts
challenging problem.
sketch images with color strokes to realistic images.
The generator employs an encoder-decoder architecture
5 Image Editing and Videos
with residual blocks[16] , and generates a new image with
the same resolution as the input sketch image. The 5.1 Image editing
objective function consists of a content loss, a feature
loss, an adversarial loss, and a TV loss. The content Image editing is an important topic in computer
loss measures L2 pixel-wise difference between the graphics, in which users manipulate an image through
generated image b x and ground truth xG . Similar to color and (or) geometry interactions. A lot of works
MDANs[32] , the feature loss is defined as the feature have investigated tasks such as image warping[73, 74] ,
distance between b x and xG , where the features are colorization[75–77] , and blending[78, 79] . These works
668 Tsinghua Science and Technology, December 2017, 22(6): 660–674

mainly work on pixels or patches, and do not respect to an objective function consisting of a content
necessarily keep semantic consistency in editing. loss, a feature loss, the ternary adversarial loss, and the
iGAN. Zhu et al.[80] proposed iGAN, which uses KL divergence of VAE[22] . The content loss measures
GAN as a manifold approximation and constrains the L1 pixel-wise difference between reconstructed and
edited images on the manifold. They pre-trained a original images. The feature loss measures L2 feature
GAN from a large image collection, which maps a difference between reconstructed and original images.
latent vector z to a natural image. Given an original Such designs enforce high-quality reconstruction.
image, the method works as follows: (1) The original Other methods. Cao et al.[82] applied GAN for
image x0 is projected to the latent space and a latent imaged colorization. Its high-level network architecture
representation vector z0 is obtained. The projection is follows pix2pix[59] . It generates diverse colorization
done through a hybrid method combining optimization results by feeding different input noise into the
and a pre-trained encoder network. (2) After specifying generator. Gaussian-Poisson GAN (GP-GAN)[83]
shape and color edits, they optimize for a new vector applies GAN for high-resolution image blending.
z minimizing an objective function containing a data It combines GAN with traditional gradient based
loss, a manifold smoothness loss, and an adversarial blending techniques. GAN is used to generate an initial
loss. The data loss measures differences with user low-resolution blended image, and the final result is
edit constraints so as to enforce satisfying user edits. obtained by optimizing the Gaussian-Poisson equation.
The manifold smoothness loss is defined as the L2
difference between z and z0 , so that the image is not 5.2 Video generation
changed too much. (3) By interpolating between z0
Inspired by the success of GANs in image synthesis
and z , a sequence of continuous edited images are
applications, researchers have also applied GANs to
generated. (4) Finally, the same amount of edits are
video generation. Compared to image synthesis, video
transferred to the original image x0 using optical flow to
generation is more difficult since video has an extra
obtain the final results. iGAN achieves realistic image
temporal dimension requiring much larger computation
editing, on various edit operators such as coloring,
and memory cost. It is also not trivial to keep temporal
sketching, and geometric warping. Figure 6 shows the
coherence. We will discuss some important works for
interpolation between generated images of a bag and an
such attempts.
outdoor scene.
VGAN. Vondrick et al.[84] proposed a Generative
IAN. Brock et al.[81] proposed an Introspective
Adversarial Network for Video (VGAN). They assumed
Adversarial Network (IAN) for image editing. IAN
the whole video is combined by a static background
consists of a generator G, a discriminator D, and
image and a moving foreground video. Hence, the
an encoder E. The network architecture follows
generator has two streams. The input to both streams
DCGAN[23] . The generator G maps a noise vector z to a
is a noise vector. The background stream generates
generated image. The encoder E uses the discriminator
the background image with 2D convolutional layers,
D as a feature vector, and is built on top of the final
and the foreground stream generates the 3D foreground
convolutional layer of D. The discriminator D inputs
video cube and the corresponding 3D foreground mask,
an image and determines whether it is real, fake, or
with spatial-temporal 3D convolutional layers. The
reconstructed. The networks are jointly trained with
discriminator takes the whole generated video as input,
and tries to distinguish it from real videos. Since VGAN
treats videos as 3D cubes, it requires large memory
space; it can generate tiny videos of about one second
duration.
TGAN. Saito et al.[85] proposed Temporal Generative
Adversarial Network (TGAN) for video generation.
TGAN consists of a temporal generator, an image
generator, and a discriminator. The temporal generator
produces a sequence of latent frame vectors Œz11 ; :::; zK
1 
from a random variable z0 , where K is the number of
Fig. 6 Results produced by iGAN[80] . video frames. The image generator takes z0 and a frame
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 669

vector zt1 (1 6 t 6 K) as input, and produces the t-th artifacts usually occur. Although some approaches use
video frame. The discriminator takes the whole video coarse-to-fine iterative approaches to generate high-
as input and tries to distinguish it from real ones. For resolution images, but they are not end-to-end and are
stable training, they follow WGAN[26] , but further apply usually slow. Recently, Chen and Koltun[91] introduced
singular value clipping instead of the weight clipping to cascaded refinement networks for photographic image
the discriminator. synthesis at 2-megapixel resolution, which gives us a
MocoGAN. Tulyakov et al.[86] proposed Motion and novel perspective for high-resolution image generation.
Content decomposed GAN (MoCoGAN) for video Secondly, the resolutions of input and output images
generation. The basic idea is to use a motion and are usually required to be fixed. In comparison,
content decomposed representation. Given a sequence traditional image synthesis approaches are more flexible
of random variables Œ 1 ; :::;  K , a recurrent network and could be adapted to arbitrary resolution. Recently
maps them to a sequence of motion vectors z1M ; :::; proposed PixelRNN[92] draws images pixel to pixel and
zKM , where K is the number of video frames. A
allows arbitrary resolution, which gives a good insight.
content vector zC , together with a motion vector ztM Thirdly, as a common issue in deep learning, ground
(1 6 t 6 K) are fed into the generator to produce the truth data (for training) are crucial but hard to get. This
t -th video frame. There are two discriminators, one for is more important in GAN-based image synthesis and
distinguishing real from fake single frames, while the editing applications, because usually it is not easy to
other for distinguishing real from fake videos. find ground truth of synthesized or edited images (or
Video prediction. Video prediction refers to the they simply do not exist). CycleGAN[60] and AIGN[62]
process of predicting one (or a few) future frames proposed to use unpaired data for training, which might
conditioned by a few existing video frames. Various be a feasible solution for similar problems but this needs
GAN-based approaches[87–90] have been proposed for more attention and exploration.
this goal. With a multiscale architecture, Mathieu et Finally, although GANs have been applied to video
al.[87] generated future frames by minimizing an MSE generation and synthesis of 3D models[93–96] , the results
loss, an adversarial loss, and a gradient difference loss. are far from perfect. It is still hard to extract temporal
Zhou and Berg[88] learned temporal transformations information from videos or decrease memory costs.
of specific phenomenon from videos, such as flower Acknowledgment
blooming, ice melting, etc. Vondrick and Torralba[89]
learned pixel transformation and generated future This work was supported by the National Key Technology
frames by transforming pixels from existing frames. R&D Program (No. 2016YFB1001402), the National
Liang et al.[90] proposed dual motion GAN, which Natural Science Foundation of China (No. 61521002), the
enforces predicted future frames to be consistent with Joint NSFC-ISF Research Program (No. 61561146393),
predicted optical flows. and Research Grant of Beijing Higher Institution
Engineering Research Center and Tsinghua-Tencent Joint
6 Discussion and Conclusion Laboratory for Internet Innovation Technology. This
work was also supported by the EPSRC CDE (No.
There have been great advances in image synthesis EP/L016540/1).
and editing applications using GANs in recent years.
By exploring large amounts of images, GANs are References
able to generate more reasonable, more semantically
[1] A. A. Efros and T. K. Leung, Texture synthesis by
consistent results than classical methods. Besides
non-parametric sampling, in Proc. 7th IEEE Int. Conf.
that, GANs can produce texture details and realistic
Computer Vision, Kerkyra, Greece, 1999, pp. 1033–1038.
content, which is beneficial to many applications,
[2] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick,
such as texture synthesis, super-resolution, image Graphcut textures: Image and video synthesis using graph
inpainting, etc. Comparisons between different GAN- cuts, in Proc. ACM SIGGRAPH 2003 Papers, San Diego,
based methods are given in Table 1. CA, USA, 2003, pp. 277–286.
However, GANs are still facing many challenges. [3] Q. Wu and Y. Z. Yu, Feature matching and deformation for
Firstly, it is difficult to generate high-resolution images. texture synthesis, ACM Trans. Graph., vol. 23, no. 3, pp.
At present, most GAN-based applications are limited 364–367, 2004.
to handle images with resolution not larger than 256  [4] A. Criminisi, P. Perez, and K. Toyama, Object removal by
256. When applied to high-resolution images, blurry exemplar-based inpainting, in Proc. 2003 IEEE Computer
670 Tsinghua Science and Technology, December 2017, 22(6): 660–674

Table 1 Comparison between different GAN-based methods. Each row gives the information of a specific method. From left
to right, we give the method name, its input format, its output format, its characteristic, the composition of its loss function,
its maximal allowed image/video resolution, and the framework of its provided code. In the column of loss function, Ladv , L1 ,
L2 , Lf , Lt , LTV , Lseg , Lip , Lsym , Lcyc , Lcls , DKL denote adversarial loss, L1 distance, L2 distance, feature loss, texture loss, TV
loss, segmentation loss, identity preserving loss, symmetry loss, cycle consistency loss, classification loss, and KL divergence,
respectively. In the column of code, T, Th, TF, C, PT, Ch denote torch, theano, tensorflow, caffe, pytorch, chainer, respectively.

Application Input Output Characteristics Loss function Resolution Code


Texture Synthesis
MGAN[32] image texture real-time Ladv C Lf arbitrary T
PSGAN[35] noise tensor texture periodical texture Ladv arbitrary Th
Image Super-Resolution
SRGAN[39] image image high upscaling factor Ladv C Lf arbitrary TF
FCGAN[40] face face Ladv C L1 128  128
Image Inpainting
Context Encoder[41] image+holes image Ladv C L2 128  128 T+TF
Yang et al.[44] image+holes image high-quality but slow L2 C Lt C LTV (update) 512  512 T
Iizuka et al.[45] image+holes image arbitrary holes Ladv C L2 256  256
Yeh et al.[47] image+holes image high missing rate Ladv C L1 (update) 64  64 TF
Li et al.[48] face+holes face semantic regularization Ladv C L2 C Lseg 128  128 C
Face Aging
CAAE[49] face+age face smooth interpolation Ladv C L2 C LTV 128  128 TF
Age-cGAN[50] face+age face identity preserved Lip (update)
Face Frontalization
DR-GAN[51] face+pose face arbitrary rotation Ladv 96  96
FF-GAN[52] face face 3DMM coefficients Ladv C L1 C LTV C Lip C Lsym 100  100
TP-GAN[54] face face two pathway Ladv C L1 C LTV C Lip C Lsym 128  128
Human Image Synthesis
VariGAN[56] human+view human coarse-to-fine Ladv C L1 128  128
P G 2[58] human+pose human coarse-to-fine Ladv C L1 256  256
Image-to-Image Translation
pix2pix[59] image image general framework Ladv C L1 256  256 T+PT
cycleGAN[60] image image unpaired data Ladv C Lcyc 256  256 T+PT+TF
AIGN[62] image image unpaired data Ladv C Lcyc 128  128
Text-to-Image
GAN-INT-CLS[63] text image Ladv 64  64 T
GAWWM[64] text+location image location-controllable Ladv 128  128 T
StackGAN[65] text image high-quality Ladv 256  256 TF+PT
TAC-GAN[66] text image diversity Ladv 128  128 TF
Sketch-to-Image
Scribbler[69] sketch(+color) image guided colorization Ladv C L2 C Lf C LTV 128  128
TextureGAN[70] sketch+texture+color image texture-controllable Ladv C L2 C Lf C Lt 128  128
Magic Pencil[71] sketch image multi-class Ladv C Lcls 64  64
Auto-painter[72] sketch cartoon Ladv C L1 C Lf C LTV 512  512
Image Editing
iGAN[80] image+manipulation image interpolation sequence Ldata C Lsmooth C Ladv (update) 64  64 Th
IAN[81] image+manipulation image fine reconstruction Ladv C L1 C Lf C DKL 64  64 Th
Cao et al.[82] grayscale image image diversity Ladv C L1 64  64 TF
GP-GAN[83] composited image image coarse-to-fine Ladv C L2 Ch
Video Generation
VGAN[84] noise vector video two streams Ladv 64  64 T
TGAN[85] noise vector video temporal generator Ladv 64  64 Ch
MoCoGAN[86] noise vector video unfixed-length Ladv 64  64 PT
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 671

Society Conf. Computer Vision and Pattern Recognition [19] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional
(CVPR), Madison, WI, USA, 2003. networks for semantic segmentation, in Proc. 2015 IEEE
[5] N. Komodakis and G. Tziritas, Image completion using Conf. Computer Vision and Pattern Recognition (CVPR),
efficient belief propagation via priority scheduling and Boston, MA, USA, 2015, pp. 3431–3440.
dynamic pruning, IEEE Trans. Image Process., vol. 16, no. [20] O. Ronneberger, P. Fischer, and T. Brox, U-Net:
11, pp. 2649–2661, 2007. Convolutional networks for biomedical image
[6] J. Hays and A. A. Efros, Scene completion using millions segmentation, in Proc. 18th Int. Conf. Medical
of photographs, ACM Trans. Graph., vol. 26, no. 3, p. 4, Image Computing and Computer-Assisted Intervention,
2007. Munich, Germany, 2015, pp. 234–241.
[21] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B.
[7] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y.
D. H. Salesin, Image analogies, in Proc. 28t h Annu.
Bengio, Generative adversarial nets, in Proc. 27th Int.
Conf. Computer Graphics and Interactive Techniques, Los
Conf. Neural Information Processing Systems, Montreal,
Angeles, CA, USA, 2001, pp. 327–340.
Canada, 2014, pp. 2672–2680.
[8] C. Barnes, F. L. Zhang, L. M. Lou, X. Wu, and S. M. Hu, [22] D. P. Kingma and M. Welling, Auto-encoding variational
PatchTable: Efficient patch queries for large datasets and Bayes, in Proc. 2nd Int. Conf. Learning Representations
applications, ACM Trans. Graph., vol. 34, no. 4, p. 97, (ICLR), Ithaca, NY, USA, 2014.
2015. [23] A. Radford, L. Metz, and S. Chintala, Unsupervised
[9] H. Fang and J. C. Hart, Detail preserving shape representation learning with deep convolutional
deformation in image editing, in Proc. ACM SIGGRAPH generative adversarial networks, in Int. Conf. Learning
2007 Papers, San Diego, CA, USA, 2007. Representations (ICLR), San Juan, Puerto Rico, 2016.
[10] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. [24] J. B. Zhao, M. Mathieu, and Y. LeCun, Energy-based
Goldman, PatchMatch: A randomized correspondence generative adversarial network, in Proc. 5th Int. Conf.
algorithm for structural image editing, ACM Trans. Learning Representations (ICLR), Palais des Congrès
Graph., vol. 28, no. 3, p. 24, 2009. Neptune, Toulon, France, 2017.
[11] T. Welsh, M. Ashikhmin, and K. Mueller, Transferring [25] X. D. Mao, Q. Li, H. R. Xie, R. Y. K. Lau, Z. Wang, and S.
color to greyscale images, ACM Trans. Graph., vol. 21, P. Smolley, Least squares generative adversarial networks,
no. 3, pp. 277–280, 2002. The IEEE International Conf. Computer Vision (ICCV),
Venice, Italy, 2017.
[12] Z. Zhu, R. R. Martin, and S. M. Hu, Panorama completion
[26] M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein
for street views, Comput. Vis. Media, vol. 1, no. 1, pp. 49–
generative adversarial networks, in Proc. 34th Int. Conf.
57, 2015.
Machine Learning, Sydney, Australia, 2017.
[13] P. Y. Laffont, Z. L. Ren, X. F. Tao, C. Qian, and J. [27] D. Berthelot, T. Schumm, and L. Metz, BEGAN:
Hays, Transient attributes for high-level understanding and Boundary equilibrium generative adversarial networks,
editing of outdoor scenes, ACM Trans. Graph., vol. 33, no. arXiv preprint arXiv: 1703.10717, 2017.
4, p. 149, 2014. [28] M. Mirza and S. Osindero, Conditional generative
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet adversarial nets, arXiv preprint arXiv: 1411.1784, 2014.
classification with deep convolutional neural networks, [29] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, Deep
in Proc. 25t h Int. Conf. Neural Information Processing generative image models using a laplacian pyramid of
Systems, Lake Tahoe, NV, USA, 2012, pp. 1097–1105. adversarial networks, in Proc. 28th Int. Conf. Neural
[15] K. Simonyan and A. Zisserman, Very deep convolutional Information Processing Systems, Montreal, Canada, 2015,
networks for large-scale image recognition, in Int. Conf. pp. 1486–1494.
Learning Representations (ICLR), San Diego, CA, USA, [30] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran,
2015. B. Sengupta, and A. A. Bharath, Generative adversarial
networks: An overview, arXiv preprint arXiv: 1710.07035,
[16] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, Deep
2017.
residual learning for image recognition, in Proc. 2016
[31] L. A. Gatys, A. S. Ecker, and M. Bethge, Texture synthesis
IEEE Conf. Computer Vision and Pattern Recognition
using convolutional neural networks, in Proc. 28th Int.
(CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
Conf. Neural Information Processing Systems, Montreal,
[17] S. Q. Ren, K. M. He, R. Girshick, and J. Sun, Faster Canada, 2015, pp. 262–270.
R-CNN: Towards real-time object detection with region [32] C. Li and M. Wand, Precomputed real-time texture
proposal networks, in Proc. 28t h Int. Conf. Neural synthesis with Markovian generative adversarial networks,
Information Processing Systems 28, Montreal, Canada, in Proc. 14th European Conf. Computer Vision (ECCV),
2015, pp. 91–99. Amsterdam, The Netherlands, 2016, pp.702–716.
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, [33] L. A. Gatys, A. S. Ecker, and M. Bethge, Image
You only look once: Unified, real-time object detection, style transfer using convolutional neural networks, in
in Proc. 2016 IEEE Conf. Computer Vision and Pattern Proc. 2016 IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779– Recognition (CVPR), Las Vegas, NV, USA, 2016, pp.
788. 2414–2423.
672 Tsinghua Science and Technology, December 2017, 22(6): 660–674

[34] N. Jetchev, U. Bergmann, and R. Vollgraf, Texture [48] Y. J. Li, S. F. Liu, J. M. Yang, and M. H. Yang, Generative
synthesis with spatial generative adversarial networks, face completion, in Proc. IEEE Conf. Computer Vision and
arXiv preprint arXiv: 1611.08207, 2016. Pattern Recognition (CVPR), Honolulu, HI, USA, 2017.
[35] U. Bergmann, N. Jetchev, and R. Vollgraf, Learning texture [49] Z. F. Zhang, Y. Song, and H. R. Qi, Age
manifolds with the periodic spatial GAN, in Proc. 34th Int progression/regression by conditional adversarial
Conf. Machine Learning, Sydney, Australia, 2017. autoencoder, in Proc. 2017 IEEE Conf. Computer Vision
[36] C. Dong, C. C. Loy, K. M. He, and X. O. Tang, Learning and Pattern Recognition (CVPR), Honolulu, HI, USA,
a deep convolutional network for image super-resolution, 2017.
in Proc 13t h European Conf. Computer Vision (ECCV), [50] G. Antipov, M. Baccouche, and J. L. Dugelay, Face aging
Zurich, Switzerland, 2014. with conditional generative adversarial networks, in IEEE
[37] J. Kim, J. K. Lee, and K. M. Lee, Deeply-recursive Int. Conf. Image Processing, Beijing, China, 2017.
convolutional network for image super-resolution, in [51] L. Tran, X. Yin, and X. M. Liu, Disentangled
Proc. 2016 IEEE Conf. Computer Vision and Pattern representation learning GAN for pose-invariant face
Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. recognition, in Proc. IEEE Conf. Computer Vision and
1637–1645. Pattern Recognition (CVPR), Honolulu, HI, USA, 2017.
[38] W. Z. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, [52] X. Yin, X. Yu, K. Sohn, X. M. Liu, and M. Chandraker,
R. Bishop, D. Rueckert, and Z. H. Wang, Real-time single Towards large-pose face frontalization in the wild, in Proc.
image and video super-resolution using an efficient sub- IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy,
pixel convolutional neural network, in Proc. 2016 IEEE 2017.
[53] V. Blanz and T. Vetter, A morphable model for the
Conf. Computer Vision and Pattern Recognition (CVPR),
synthesis of 3D faces, in Proc. 26th Annu. Conf.
Las Vegas, NV, USA, 2016, pp. 1874–1883.
[39] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Computer Graphics and Interactive Techniques, Los
Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Angeles, CA, USA, 1999, pp. 187–194.
[54] R. Huang, S. Zhang, T. Y. Li, and R. He, Beyond
Z. H. Wang, and W. Z. Shi, Photo-realistic single image
face rotation: Global and local perception GAN
super-resolution using a generative adversarial network,
for photorealistic and identity preserving frontal view
in Proc. 2017 IEEE Conf. Computer Vision and Pattern
synthesis, in Proc. IEEE Int. Conf. Computer Vision
Recognition (CVPR), Honolulu, HI, USA, 2017.
(ICCV), Honolulu, HI, USA, 2017.
[40] B. Huang, W. H. Chen, X. M. Wu, and C. L. Lin,
[55] Z. D. Zheng, L. Zheng, and Y. Yang, Unlabeled samples
High-quality face image SR using conditional generative
generated by GAN improve the person re-identification
adversarial networks, arXiv preprint arXiv: 1707.00737,
baseline in vitro, in Proc. IEEE Int. Conf. Computer Vision
2017.
(ICCV), Honolulu, HI, USA, 2017.
[41] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and
[56] B. Zhao, X. Wu, Z. Q. Cheng, H. Liu, and J. S. Feng,
A. A. Efros, Context encoders: Feature learning by
Multi-view image generation from a single-view, arXiv
inpainting, in Proc. IEEE Conf. Computer Vision and
preprint arXiv: 1704.04886, 2017.
Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016. [57] K. Sohn, X. C. Yan, and H. Lee, Learning structured output
[42] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, representation using deep conditional generative models,
What makes Paris look like Paris? ACM Trans. Graph., in Proc. 28th Int. Conf. Neural Information Processing
vol. 31, no. 4, p. 101, 2012. Systems, Montreal, Canada, 2015, pp. 3483–3491.
[43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, [58] L. Q. Ma, Q. R. Sun, X. Jia, B. Schiele, T. Tuytelaars, and
S. Ma, Z. H. Huang, A. Karpathy, A. Khosla, and M. L. Van Gool, Pose guided person image generation, arXiv
Bernstein, et al., ImageNet large scale visual recognition preprint arXiv: 1705.09368, 2017.
challenge, Int. J . Comput. Vis., vol. 115, no. 3, pp. 211– [59] P. Isola, J. Y. Zhu, T. H. Zhou, and A. A. Efros,
252, 2015. Image-to-image translation with conditional adversarial
[44] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, networks, in Proc. IEEE Conf. Computer Vision and
High-resolution image inpainting using multi-scale neural Pattern Recognition (CVPR), Honolulu, HI, USA, 2017.
patch synthesis, in Proc. IEEE Conf. Computer Vision and [60] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired
Pattern Recognition (CVPR), Honolulu, HI, USA, 2017. image-to-image translation using cycle-consistent
[45] S. Iizuka, E. Simo-Serra, and H. Ishikawa, Globally and adversarial networks, in The IEEE Int. Conf. Computer
locally consistent image completion, ACM Trans. Graph., Vision (ICCV), Venice, Italy, 2017.
vol. 36, no. 4, p. 107, 2017. [61] J. Johnson, A. Alahi, and F. F. Li, Perceptual losses for
[46] F. Yu and V. Koltun, Multi-scale context aggregation real-time style transfer and super-resolution, in Proc. 14th
by dilated convolutions, in Int. Conf. Learning European Conf. Computer Vision (ECCV), Amsterdam,
Representations (ICLR), San Juan, Puerto Rico, 2016. The Netherlands, 2016.
[47] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. [62] H. Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki,
Hasegawa-Johnson, and M. N. Do, Semantic image Adversarial inverse graphics networks: Learning 2D-to-
inpainting with deep generative models, in Proc. IEEE 3D lifting and image-to-image translation from unpaired
Conf. Computer Vision and Pattern Recognition (CVPR), supervision, in The IEEE Int. Conf. Computer Vision
Honolulu, HI, USA, 2017. (ICCV), Venice, Italy, 2017.
Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 673

[63] S. Reed, Z. Akata, X. C. Yan, L. Logeswaran, B. Schiele, [79] Z. Farbman, G. Hoffer, Y. Lipman, D. Cohen-Or, and D.
and H. Lee, Generative adversarial text to image synthesis, Lischinski, Coordinates for instant image cloning, in Proc.
in Proc. 33rd Int. Conf. Machine Learning, New York, ACM SIGGRAPH 2009 Papers, New Orleans, LA, USA,
NY, USA, 2016, pp. 1060–1069. 2009.
[64] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and [80] J. Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
H. Lee, Learning what and where to draw, in Proc. 29th Efros, Generative visual manipulation on the natural image
Conf. Neural Information Processing Systems, Barcelona, manifold, in Proc. 14th European Conf. Computer Vision
Spain, 2016, pp. 217–225. (ECCV), Amsterdam, The Netherlands, 2016.
[65] H. Zhang, T. Xu, H. S. Li, S. T. Zhang, X. G. Wang, [81] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, Neural
X. L. Huang, and D. Metaxas, StackGAN: Text to photo editing with introspective adversarial networks, in
photo-realistic image synthesis with stacked generative Int. Conf. Learning Representations (ICLR), Palais des
adversarial networks, in The IEEE Int. Conf. Computer Congrès Neptune, Toulon, France, 2017.
Vision (ICCV), Venice, Italy, 2017. [82] Y. Cao, Z. M. Zhou, W. N. Zhang, and Y. Yu, Unsupervised
[66] A. Dash, J. C. B. Gamboa, S. Ahmed, M. Liwicki, diverse colorization via generative adversarial networks,
and M. Z. Afzal, TAC-GAN-Text conditioned auxiliary arXiv preprint arXiv: 1702.06674, 2017.
classifier generative adversarial network, arXiv preprint [83] H. K. Wu, S. Zheng, J. G. Zhang, and K. Q. Huang, GP-
arXiv: 1703.06412, 2017. GAN: Towards realistic high-resolution image blending,
[67] A. Odena, C. Olah, and J. Shlens, Conditional image arXiv preprint arXiv: 1703.07195, 2017.
[84] C. Vondrick, H. Pirsiavash, and A. Torralba, Generating
synthesis with auxiliary classifier GANs, in Proc. 34th Int.
videos with scene dynamics, in Proc. 29th Conf. Neural
Conf. Machine Learning, Sydney, Australia, 2017.
[68] T. Chen, M. M. Cheng, P. Tan, A. Shamir, and S. M. Information Processing Systems, Barcelona, Spain, 2016,
Hu, Sketch2Photo: Internet image montage, ACM Trans. pp. 613–621.
[85] M. Saito, E. Matsumoto, and S. Saito, Temporal generative
Graph., vol. 28, no. 5, p. 124, 2009.
adversarial nets with singular value clipping, in The IEEE
[69] P. Sangkloy, J. W. Lu, C. Fang, F. Yu, and J. Hays,
Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017.
Scribbler: Controlling deep image synthesis with sketch
[86] S. Tulyakov, M. Y. Liu, X. D. Yang, and J. Kautz,
and color, in Proc. 2017 IEEE Conf. Computer Vision and
MoCoGAN: Decomposing motion and content for video
Pattern Recognition (CVPR), Honolulu, HI, USA, 2017.
generation, arXiv preprint arXiv: 1707.04993, 2017.
[70] W. Q. Xian, P. Sangkloy, J. W. Lu, C. Fang, F. Yu, and J.
[87] M. Mathieu, C. Couprie, and Y. LeCun, Deep multi-scale
Hays, TextureGAN: Controlling deep image synthesis with
video prediction beyond mean square error, in Int. Conf.
texture patches, arXiv preprint arXiv: 1706.02823, 2017.
Learning Representations (ICLR), San Juan, Puerto Rico,
[71] H. Zhang and X. C. Cao, Magic pencil: Generalized
2016.
sketch inversion via generative adversarial nets, in Proc. [88] Y. P. Zhou and T. L. Berg, Learning temporal
SIGGRAPH ASIA 2016 Posters, Macau, China, 2016. transformations from time-lapse videos, in Proc. 14th
[72] Y. F. Liu, Z. C. Qin, Z. B. Luo, and H. Wang, Auto-
European Conf. Computer Vision (ECCV), Amsterdam,
painter: Cartoon image generation from sketch by using
The Netherlands, 2016.
conditional generative adversarial networks, arXiv preprint [89] C. Vondrick and A. Torralba, Generating the future with
arXiv: 1705.01908, 2017. adversarial transformers, in The IEEE Conf. Computer
[73] M. Alexa, D. Cohen-Or, and D. Levin, As-rigid-as- Vision and Pattern Recognition (CVPR), Honolulu, HI,
possible shape interpolation, in Proc. 27t h Annu. Conf. USA, 2017, pp. 1020–1028.
Computer Graphics and Interactive Techniques, New [90] X. D. Liang, L. Lee, W. Dai, and E. P. Xing, Dual motion
Orleans, LA, USA, 2000, pp. 157–164. GAN for future-flow embedded video prediction, in The
[74] S. Avidan and A. Shamir, Seam carving for content-aware IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy,
image resizing, ACM Trans. Graph., vol. 26, no. 3, p. 10, 2017.
2007. [91] Q. F. Chen and V. Koltun, Photographic image synthesis
[75] A. Levin, D. Lischinski, and Y. Weiss, Colorization using with cascaded refinement networks, in The Int. Conf.
optimization, in Proc. ACM SIGGRAPH 2004 Papers, Los Computer Vision (ICCV), Venice, Italy, 2017.
Angeles, CA, USA, 2004, pp. 689–694. [92] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu,
[76] X. J. Li, H. L. Zhao, G. Z. Nie, and H. Huang, Pixel recurrent neural networks, in Proc. 33rd Int. Conf.
Image recoloring using geodesic distance based color Machine Learning, New York, NY, USA, 2016.
harmonization, Comput. Vis. Media, vol. 1, no. 2, pp. 143– [93] J. J. Wu, C. K. Zhang, T. F. Xue, W. T. Freeman, and
155, 2015. J. B. Tenenbaum, Learning a probabilistic latent space of
[77] S. P. Lu, G. Dauphin, G. Lafruit, and A. Munteanu, object shapes via 3D generative-adversarial modeling, in
Color retargeting: Interactive time-varying color image Proc. 30th Conf. Neural Information Processing Systems,
composition from time-lapse sequences, Comput. Vis. Barcelona, Spain, 2016, pp. 82–90.
Media, vol. 1, no. 4, pp. 321–330, 2015. [94] E. J. Smith and D. Meger, Improved adversarial systems
[78] P. Pérez, M. Gangnet, and A. Blake, Poisson image editing, for 3D object generation and reconstruction, in Proc. 1st
in Proc. ACM SIGGRAPH 2003 Papers, San Diego, CA, Conf. Robot Learning, Mountain View, CA, USA, 2017,
USA, 2003, pp. 313–318. pp. 87–96.
674 Tsinghua Science and Technology, December 2017, 22(6): 660–674

[95] W. Y. Wang, Q. G. Huang, S. Y. You, C. Yang, and [96] B. Yang, H. K. Wen, S. Wang, R. Clark, A. Markham, and
U. Neumann, Shape inpainting using 3D generative N. Trigoni, 3D object reconstruction from a single depth
adversarial network and recurrent convolutional networks, view with adversarial learning, in Int. Conf. Computer
in The IEEE Int. Conf. Computer Vision (ICCV), Venice, Vision Workshops (ICCVW), 2017, pp. 679–688.
Italy, 2017, pp. 2298–2306.

Xian Wu is a PhD student in the Peter Hall is a professor in the Department


Department of Computer Science and of Computer Science at the University
Technology, Tsinghua University. Before of Bath. He is also the director of the
that, he received his bachelor degree in Media Technology Research Centre, Bath.
the same university in 2015. His research He founded vision, video, and graphics
interests include image/video editing and network of excellence in the United
computer vision. Kingdom, and has served on the executive
committee of the British Machine Vision
Conference since 2003. He has published works extensively in
computer vision, especially where it interfaces with computer
Kun Xu is an associate professor in
graphics. More recently he is developing an interest in robotics.
the Department of Computer Science and
Technology, Tsinghua University. Before
that, he received his bachelor and PhD
degrees from the same university in
2005 and 2009, respectively. His research
interests include realistic rendering and
image/video editing.

You might also like