StackGAN_Realistic_Image_Synthesis_with_Stacked_Generative_Adversarial_Networks
StackGAN_Realistic_Image_Synthesis_with_Stacked_Generative_Adversarial_Networks
Abstract—Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still face
challenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGANs) aimed
at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture,
StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of a scene based on a given text
description, yielding low-resolution images. The Stage-II GAN takes Stage-I results and the text description as inputs, and generates
high-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture,
StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generators
and multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generated
from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating
multiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantly
outperform other state-of-the-art methods in generating photo-realistic images.
Index Terms—Generative models, generative adversarial networks (GANs), multi-stage GANs, multi-distribution approximation,
photo-realistic image generation, text-to-image synthesis
1 INTRODUCTION
Fig. 1. The architecture of the proposed StackGAN-v1. The Stage-I generator draws a low-resolution image by sketching rough shape and basic col-
ors of the object from the given text and painting the background from a random noise vector. Conditioned on Stage-I results, the Stage-II generator
corrects defects and adds compelling details into Stage-I results, yielding a more realistic high-resolution image.
residual image conditioned on the image of the previous of gradient vanishing [12]. We use this modified non-
stage is generated and then added back to the input image to saturating objective in all our experiments.
produce the input for the next stage. Instead of producing a Conditional GANs [11], [28] are extension of GANs
residual image, our StackGANs directly generate high reso- where both the generator and discriminator receive addi-
lution images that are conditioned on their low-resolution tional conditioning variables c, yielding Gðz; cÞ and Dðx; cÞ.
inputs. Concurrent to our work, Kerras et al. [19] incremen- This formulation allows G to generate images conditioned
tally add more layers in the generator and discriminator for on variables c.
high resolution image generation. The main difference in
terms of experimental setting is that they used a more 4 STACKGAN-V1: TWO-STAGE GENERATIVE
restrained upsampling rule: starting from 4 4 pixels, their ADVERSARIAL NETWORK
image resolution is increased by a factor of 2 between conse-
cutive image generation stages. Furthermore, although To generate high-resolution images with photo-realistic
StackGANs, LAPGANs [8] and Progressive GANs [19] all details, we propose a simple yet effective two-stage genera-
put emphasis on adding finer details in higher resolution tive adversarial network, StackGAN-v1. As shown in Fig. 1,
images, our StackGANs can also correct incoherent artifacts it decomposes the text-to-image generative process into two
or defects in low resolution results by utilizing an encoder- stages. Stage-I GAN sketches the primitive shape and basic
decoder network before the upsampling layers. colors of the object conditioned on the given text description,
and draws the background layout from a random noise vec-
3 PRELIMINARIES tor, yielding a low-resolution image. Stage-II GAN corrects
defects in the low-resolution image from Stage-I and com-
Generative Adversarial Networks [12] are composed of two pletes details of the object by reading the text description
models that are alternatively trained to compete with each again, producing a high-resolution photo-realistic image.
other. The generator G is optimized to reproduce the true
data distribution pdata by generating images that are difficult 4.1 Conditioning Augmentation
for the discriminator D to differentiate from real images.
As shown in Fig. 1, the text description t is first encoded by
Meanwhile, D is optimized to distinguish real images and
an encoder, yielding a text embedding ’t . In previous
synthetic images generated by G. Overall, the training pro-
works [33], [35], the text embedding is nonlinearly trans-
cedure is a minmax two-player game with the following
formed to generate conditioning latent variables as the
objective function,
input of the generator. However, latent space for the text
embedding is usually high dimensional ( > 100 dimen-
min max V ðD; GÞ ¼ Expdata ½log DðxÞ
G D
(1) sions). With limited amount of data, it usually causes dis-
þ Ezpz ½log ð1 DðGðzÞÞÞ; continuity in the latent data manifold, which is not
desirable for learning the generator. To mitigate this prob-
where x is a real image from the true data distribution pdata , lem, we introduce a Conditioning Augmentation technique
and z is a noise vector sampled from the prior distribution to produce additional conditioning variables c^. In contrast
pz (e.g., uniform or Gaussian distribution). In practice, the to the fixed conditioning text variable c in [33], [35], we ran-
generator G is modified to maximize log ðDðGðzÞÞÞ instead domly sample the latent variables c^ from an independent
of minimizing log ð1 DðGðzÞÞÞ to mitigate the problem Gaussian distribution N ðmð’t Þ; Sð’t ÞÞ, where the mean
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
1950 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 8, AUGUST 2019
mð’t Þ and diagonal covariance matrix Sð’t Þ are functions of For the discriminator D0 , the text embedding ’t is first
the text embedding ’t . The proposed Conditioning Aug- compressed to Nd dimensions using a fully-connected layer
mentation yields more training pairs given a small number and then spatially replicated to form a Md Md Nd tensor.
of image-text pairs, and thus encourages robustness to small Meanwhile, the image is fed through a series of down-
perturbations along the conditioning manifold. To further sampling blocks until it has Md Md spatial dimension.
enforce the smoothness over the conditioning manifold and Then, the image filter map is concatenated along the channel
avoid overfitting [9], [22], we add the following regulariza- dimension with the text tensor. The resulting tensor is fur-
tion term to the objective of the generator during training, ther fed to a 1 1 convolutional layer to jointly learn features
DKL ðN ðmð’t Þ; Sð’t ÞÞ jj N ð0; IÞÞ; (2) across the image and the text. Finally, a fully-connected layer
with one node is used to produce the decision score.
which is the Kullback-Leibler divergence (KL divergence)
between the standard Gaussian distribution and the condi- 4.3 Stage-II GAN
tioning Gaussian distribution. The randomness introduced Low-resolution images generated by Stage-I GAN usually
in the Conditioning Augmentation is beneficial for model- lack vivid object parts and might contain shape distortions.
ing text to image translation as the same sentence usually Some details in the text might also be omitted in the first
corresponds to objects with various poses and appearances. stage, which is vital for generating photo-realistic images.
Our Stage-II GAN is built upon Stage-I GAN results to
4.2 Stage-I GAN generate high-resolution images. It is conditioned on low-
Instead of directly generating a high-resolution image con- resolution images and also the text embedding again to cor-
ditioned on the text description, we simplify the task to first rect defects in Stage-I results. The Stage-II GAN completes
generate a low-resolution image with our Stage-I GAN, previously ignored text information to generate more
which focuses on drawing only rough shape and correct col- photo-realistic details.
ors for the object. Conditioning on the low-resolution result s0 ¼ G0 ðz; c^0 Þ
Let ’t be the text embedding of the given description. and Gaussian latent variables c^, the discriminator D and
The Gaussian conditioning variables c^0 for text embedding generator G in Stage-II GAN are trained by alternatively
are sampled from N ðm0 ð’t Þ; S0 ð’t ÞÞ to capture the meaning maximizing LD in Eq. (5) and minimizing LG in Eq. (6),
of ’t with variations. Conditioned on c^0 and random vari-
able z, Stage-I GAN trains the discriminator D0 and the gen- LD ¼ EðI;tÞpdata ½log DðI; ’t Þ
(5)
erator G0 by alternatively maximizing LD0 in Eq. (3) and þ Es0 pG0 ;tpdata ½log ð1 DðGðs0 ; c^Þ; ’t ÞÞ;
minimizing LG0 in Eq. (4),
LD0 ¼ EðI0 ;tÞpdata ½log D0 ðI0 ; ’t Þ LG ¼ Es0 pG0 ;tpdata ½log DðGðs0 ; c^Þ; ’t Þ
(3) (6)
þ Ezpz ;tpdata ½log ð1 D0 ðG0 ðz; c^0 Þ; ’t ÞÞ; þ DKL ðN ðmð’t Þ; Sð’t ÞÞ jj N ð0; IÞÞ;
Fig. 2. The overall framework of our proposed StackGAN-v2 for the conditional image synthesis task. c is the vector of conditioning variables which
can be computed from the class label, the text description, etc. Ng and Nd are the numbers of channels of a tensor.
For the discriminator, its structure is similar to that of resolution image distributions. To make the framework more
Stage-I discriminator with only extra down-sampling blocks general, in this paper, we propose a new end-to-end network,
since the image size is larger in this stage. To explicitly StackGAN-v2, to model a series of multi-scale image distribu-
enforce GANs to learn better alignment between the image tions. As shown in Fig. 2, StackGAN-v2 consists of multiple
and the conditioning text, rather than using the vanilla dis- generators (G s) and discriminators (D s) in a tree-like struc-
criminator, we adopt the matching-aware discriminator ture. Images from low-resolution to high-resolution are gener-
proposed by Reed et al. [35] for both stages. During training, ated from different branches of the tree. At each branch, the
the discriminator takes real images and their corresponding generator captures the image distribution at that scale and the
text descriptions as positive sample pairs, whereas negative discriminator estimates the probability that a sample came
sample pairs consist of two groups. The first is real images from training images of that scale rather than the generator.
with mismatched text embeddings, while the second is syn- The generators are jointly trained to approximate the mul-
thetic images with their corresponding text embeddings. tiple distributions, and the generators and discriminators are
trained in an alternating fashion. In this section, we explore
4.4 Implementation Details two types of multi-distributions: (1) multi-scale image distri-
The up-sampling blocks consist of the nearest-neighbor butions; and (2) joint conditional and unconditional image
upsampling followed by a 3 3 stride 1 convolution. Batch distributions.
normalization [17] and ReLU activation are applied after
every convolution except the last one. The residual blocks 5.1 Multi-Scale Image Distributions Approximation
consist of 3 3 stride 1 convolutions, Batch normalization Our StackGAN-v2 framework has a tree-like structure, which
and ReLU. Two residual blocks are used in 128 128 Stack- takes a noise vector z pnoise as the input and has multiple
GAN-v1 models while four are used in 256 256 models. generators to produce images of different scales. The pnoise is
The down-sampling blocks consist of 4 4 stride 2 convolu- a prior distribution, which is usually chosen as the standard
tions, Batch normalization and LeakyReLU, except that the normal distribution. The latent variables z are transformed to
first one does not have Batch normalization. hidden features layer by layer. We compute the hidden fea-
By default, Ng ¼ 128, Nz ¼ 100, Mg ¼ 16, Md ¼ 4, Nd ¼ tures hi for each generator Gi by a non-linear transformation,
128, W0 ¼ H0 ¼ 64 and W ¼ H ¼ 256. For training, we first
iteratively train D0 and G0 of Stage-I GAN for 600 epochs h0 ¼ F0 ðzÞ; hi ¼ Fi ðhi1 ; zÞ; i ¼ 1; 2; . . . ; m 1; (7)
by fixing Stage-II GAN. Then we iteratively train D and G where hi represents hidden features for the ith branch, m is
of Stage-II GAN for another 600 epochs by fixing Stage-I the total number of branches, and Fi are modeled as neural
GAN. All networks are trained using ADAM [20] solver networks (see Fig. 2 for illustration). In order to capture
with beta1 = 0.5. The batch size is 64. The learning rate is ini- information omitted in preceding branches, the noise vector
tialized to be 0.0002 and decayed to 1/2 of its previous value z is concatenated to the hidden features hi1 as the inputs of
every 100 epochs. The source code for StackGAN-v1 is Fi for calculating hi . Based on hidden features at different
available at https://github.com/hanzhanggit/StackGAN layers (h0 , h1 ; . . . ; hm1 ), generators produce samples of
for more implementation details. small-to-large scales (s0 , s1 ; . . . ; sm1 ),
to classify inputs into two classes (real or fake) by minimiz- The unconditional loss determines whether the image is
ing the following cross-entropy loss, real or fake while the conditional one determines whether
the image and the condition match or not. Accordingly, the
LDi ¼ Exi pdatai ½log Di ðxi Þ Esi pGi ½log ð1 Di ðsi Þ; (9) loss function for each generator Gi is converted to
where xi is from the true image distribution pdatai at the ith LGi ¼ Esi pGi ½log Di ðsi Þ þ Esi pGi ½log Di ðsi ; cÞ : (12)
scale, and si is from the model distribution pGi at the same |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
unconditional loss conditional loss
scale. The multiple discriminators are trained in parallel,
and each of them focuses on a single image scale. The generator Gi at each scale therefore jointly approximates
Guided by the trained discriminators, the generators are unconditional and conditional image distributions. The final
optimized to jointly approximate multi-scale image distri- loss for jointly training generators of conditional StackGAN-
butions (pdata0 ; pdata1 ; . . . ; pdatam1 ) by minimizing the follow- v2 is computed by substituting Eq. (12) into Eq. (10).
ing loss function,
5.3 Color-Consistency Regularization
X
m1
As we increase the image resolution at different generators,
LG ¼ LGi ; LGi ¼ Esi pGi ½log Di ðsi Þ; (10)
i¼0 the generated images at different scales should share similar
basic structure and colors. A color-consistency regulariza-
where LGi is the loss function for approximating the image tion term is introduced to keep samples generated from the
distribution at the ith scale (i.e., pdatai ). During the training same input at different generators more consistent in color
process, the discriminators Di and the generators Gi are and thus to improve the quality of the generated images.
alternately optimized till convergence. Let xk ¼ ðR; G; BÞT represent a pixel in a generated image,
The motivation of the proposed StackGAN-v2 is that, by then the mean and covarianceP of pixels of the given
P image
modeling data distributions at multiple scales, if any one of can be defined by m ¼ k xk =N and S ¼ k ðx xk m Þ
those model distributions shares support with the real data xk m ÞT =N, where N is the number of pixels in the image.
ðx
distribution at that scale, the overlap could provide good The color-consistency regularization term aims at minimiz-
gradient signal to expedite or stabilize training of the whole ing the differences of m and s between different scales to
network at multiple scales. For instance, approximating the encourage the consistency, which is defined as
low-resolution image distribution at the first branch results
n
in images with basic color and structures. Then the genera- 1X
tors at the subsequent branches can focus on completing LCi ¼ msj msj k22 þ 2 kS
1 km Ssj Ssj k2F ; (13)
n j¼1 i i1 i i1
details for generating higher resolution images.
where n is the batch size, msj and Ssj are mean and covariance
5.2 Joint Conditional and Unconditional for the jth sample generatedi by the ith i
generator. Empirically,
Distribution Approximation we set 1 ¼ 1 and 2 ¼ 5 by default. For the jth input vector,
For unconditional image generation, discriminators in Stack- multi-scale samples sj0 ; sj1 ; . . . ; sjm1 are generated from m gen-
GAN-v2 are trained to distinguish real images from fake erators of StackGAN-v2. LCi can be added to the loss function
ones. To handle conditional image generation, convention- of the ith generator defined in Eqs. (10) or (12), where i ¼
ally, images and their corresponding conditioning variables 1; 2; . . . ; m 1. Therefore, the final loss for training the ith
are input into the discriminator to determine whether an generator is defined as L0Gi ¼ LGi þ a LCi . Experimental
image-condition pair matches or not, which guides the gen- results indicate that the color-consistency regularization is
erator to approximate the conditional image distribution. very important (e.g., a ¼ 50:0 in this paper) for the uncondi-
We propose conditional StackGAN-v2 that jointly approxi- tional task, while it is not needed (a ¼ 0:0) for the text-to-
mates conditional and unconditional image distributions. image synthesis task which has a stronger constraint, i.e., the
For the generator of our conditional StackGAN-v2, F0 instance-wise correspondence between images and text
and Fi are converted to take the conditioning vector c as descriptions.
input, such that h0 ¼ F0 ðc; zÞ and hi ¼ Fi ðhi1 ; cÞ. For Fi , the
conditioning vector c replaces the noise vector z to encour- 5.4 Implementation Details
age the generators to draw images with more details accord- As shown in Fig. 2, our StackGAN-v2 models are designed to
ing to the conditioning variables. Consequently, multi-scale generate 256 256 images. The input vector (i.e., z for uncon-
samples are now generated by si ¼ Gi ðhi Þ. The objective ditional StackGAN-v2, or the concatenated z and c1 for con-
function of training the discriminator Di for conditional ditional StackGAN-v2) is first transformed to a 4 4 64Ng
StackGAN-v2 now consists of two terms, the unconditional feature tensor, where Ng is the number of channels in the ten-
loss and the conditional loss, sor. Then, this 4 4 64Ng tensor is gradually transformed
to 64 64 4Ng , 128 128 2Ng , and eventually 256
LDi ¼ Exi pdatai ½log Di ðxi Þ Esi pGi ½log ð1 Di ðsi Þ 256 1Ng tensors at different layers of the network by six
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
unconditional loss up-sampling blocks. The intermediate 64 64 4Ng , 128
þ Exi pdatai ½log Di ðxi ; cÞ Esi pGi ½log ð1 Di ðsi ; cÞ : 128 2Ng , and 256 256 1Ng features are used to gener-
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ate images of corresponding scales with 3 3 convolutions.
conditional loss
(11)
1. The conditioning variable c for StackGAN-v2 is also generated by
Conditioning Augmentation.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: STACKGAN++: REALISTIC IMAGE SYNTHESIS WITH STACKED GENERATIVE ADVERSARIAL NETWORKS 1953
TABLE 1
Statistics of Datasets
Dataset CUB [49] Oxford-102 [30] COCO [24] LSUN [54] ImageNet [39]
train test train test train test bedroom church dog cat
#Samples 8,855 2,933 7,034 1,155 80,000 40,000 3,033,042 126,227 147,873 6,500
We do not split LSUN or ImageNet because they are utilized for the unconditional tasks.
Conditioning variables c or unconditional variables z are also utilized for evaluation. Different from CUB and Oxford-102,
directly fed into intermediate layers of the network to ensure the COCO dataset contains images with multiple objects and
encoded information in c or z is not omitted. All the discrimi- various backgrounds. Each image in COCO has 5 descrip-
nators Di have down-sampling blocks and 3 3 convolutions tions, while 10 descriptions are provided by [34] for every
to transform the input image to a 4 4 8Nd tensor, and even- image in CUB and Oxford-102 datasets. Following the experi-
tually the sigmoid function is used for outputting probabili- mental setup in [35], we directly use the training and valida-
ties. For all datasets, we set Ng ¼ 32, Nd ¼ 64 and use two tion sets provided by COCO, meanwhile we split CUB and
residual blocks between every two generators. ADAM [20] Oxford-102 into class-disjoint training and test sets. Our
solver with beta1 = 0.5 and a learning rate of 0.0002 is used for unconditional StackGAN utilizes bedroom and church sub-
all models. The source code for StackGAN-v2 is available at sets of LSUN [54], a dog-breed2 and a cat-breed3 sub-sets of
https://github.com/hanzhanggit/StackGAN-v2 for more ImageNet [39] to synthesize different types of images. The
implementation details. statistics of datasets are presented in Table 1.
Evaluation Metrics. It is difficult to evaluate the perfor-
6 EXPERIMENTS mance of generative models (e.g., GANs). In this paper, we
choose inception score (IS) [40] and frechet inception dis-
We conduct extensive experiments to evaluate the proposed
tance (FID) [15] for quantitative evaluation. Inception score
methods. In Section 6.1, several state-of-the-art methods on
(IS) [40] is the first well-known metric for evaluating GANs.
text-to-image synthesis and on unconditional image synthe-
IS ¼ exp ðEx DKL ð pðyjx xÞ jj pðyÞÞÞ; where x denotes one gen-
sis are compared with the proposed methods. We first eval-
uate the effectiveness of our StackGAN-v1 for text-to-image erated sample, and y is the label predicted by the inception
synthesis by comparing it with GAWWN [33] and GAN- model [41]4. The intuition behind this metric is that good
INT-CLS [35]. And then, StackGAN-v2 is compared with models should generate diverse but meaningful images.
StackGAN-v1 on different datasets to show its advantages Therefore, the KL divergence between the marginal distri-
and limitations. Moreover, StackGAN-v2 as a more general bution pðyÞ and the conditional distribution pðyjx xÞ should
framework also works well on unconditional image synthe- be large. As suggested in [40], we compute the inception
sis tasks, and on such tasks, it is compared with several score on a large number of samples (i.e., 30k samples ran-
state-of-the-art methods [3], [13], [26], [32], [56]. In Section domly generated for the test set) for each model.5
6.2, several baseline models are designed to investigate the Frechet inception distance [15] was recently proposed as a
overall design and important components of our Stack- metric that considers not only the synthetic data distribution
GAN-v1. For the first baseline, we directly train Stage-I but also how it compares to the real data distribution. It
GAN for generating 64 64 and 256 256 images to inves- directly measures the distance between the synthetic data
tigate whether the proposed two-stage stacked structure distribution pð:Þ and the real data distribution pr ð:Þ. In prac-
and the Conditioning Augmentation are beneficial. Then tice, images are encoded with visual features by the incep-
we modify our StackGAN-v1 to generate 128 128 and 256 tion model. Assuming the feature embeddings follow a
256 images to investigate whether larger images by our multidimensional Gaussian distribution, the synthetic data’s
method can result in higher image quality. We also investi- Gaussian with mean and covariance ðm; CÞ is obtained from
gate whether inputting text at both stages of StackGAN-v1 pð:Þ and the real data’s Gaussian with mean and covariance
is useful. In Section 6.3, experiments are designed to vali- ðmr ; Cr Þ is obtained from pr ð:Þ. The difference between the
date important components of our StackGAN-v2, including synthetic and real Gaussians is measured by the Frechet
designs with fewer multi-scale image distributions, the distance, i.e., FID ¼ jjm mr jj22 þ TrðC þ Cr 2ðCCr Þ1=2 Þ.
effect of jointly approximating conditional and uncondi- Lower FID values mean closer distances between synthetic
tional distributions, and the effectiveness of the proposed and real data distributions. To compute the FID score for a
color-consistency regularization.
unconditional model, 30k samples are randomly generated.
Datasets. We evaluate our conditional StackGAN for
To compute the FID score for a text-to-image model, all
text-to-image synthesis on the CUB [49], Oxford-102 [30] and
COCO [24] datasets. CUB [49] contains 200 bird species
with 11,788 images. Since 80 percent of birds in this dataset 2. Using the wordNet IDs provided by Vinyals et al., [47].
3. Using the wordNet IDs provided in our supplementary materials,
have object-image size ratios of less than 0.5 [49], as a pre- which can be found on the Computer Society Digital Library at http://
processing step, we crop all images to ensure that bounding doi.ieeecomputersociety.org/10.1109/TPAMI.2018.2856256
boxes of birds have greater-than-0.75 object-image size ratios. 4. In our experiments, for fine-grained datasets, CUB and Oxford-
Oxford-102 [30] contains 8,189 images of flowers from 102 dif- 102, we fine-tune an inception model for each of them. For other data-
sets, we directly use the pre-trained inception model.
ferent categories. To show the generalization capability of 5. The mean and standard derivation inception scores of ten splits
our approach, a more challenging dataset, COCO [24] is also are reported.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
1954 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 8, AUGUST 2019
TABLE 2
chet Inception Distance (FID) and Average Human Ranks (HR) of GAN-INT-CLS [35],
Inception Scores (IS), Fre
GAWWN [33] and Our StackGAN-v1 on CUB, Oxford-102, and COCO
(* means that images are re-sized to 64 64 before computing IS* and FID*).
sentences in the corresponding test set are utilized to gener- 6.1 Comparison with State-of-the-Art GAN Models
ate samples. To demonstrate the effectiveness of the proposed method,
To better evaluate the proposed methods, especially to we compare it with state-of-the-art GAN models on text-
see whether the generated images are well conditioned on to-image synthesis [33], [35] and unconditional image
the given text descriptions, we also conduct user studies. synthesis [3], [13], [26], [32], [56].
We randomly select 50 text descriptions for each class of Text-to-image synthesis. We compare our StackGAN mod-
CUB and Oxford-102 test sets. For COCO dataset, 4k text els with several state-of-the-art text-to-image methods [33],
descriptions are randomly selected from its validation set. [35] on CUB, Oxford-102 and COCO datasets. The inception
For each sentence, 5 images are generated by each model. scores, frechet inception distances and average human ranks
Given the same text descriptions, 30 users (not including for the proposed StackGAN models and compared methods
any of the authors) are asked to rank the results by different are reported in Table 2. Representative examples are com-
methods. The average ranks by human users are calculated pared in Figs. 3, 4. For meaningful and fair comparisons with
to evaluate all compared methods. previous methods, the inception scores (IS/IS*) and frechet
In addition, we use t-SNE [46] embedding method to inception distances (FID/FID*) are computed in two settings.
visualize a large number (e.g., 30k on the CUB test set) of In the first setting, 256 256 images produced by StackGAN,
high-dimensional images in a two-dimensional map. We 128 128 images generated by GAWWN [33] and 64 64
observe that t-SNE is a good tool to examine the distribution images yielded by GAN-INT-CLS [35] are used directly to
of synthesized images and identify collapsed modes. compute IS and FID. Thus, in this setting, the different
Fig. 3. Example results by our StackGANs, GAWWN [33], and GAN-INT-CLS [35] conditioned on text descriptions from CUB test set.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: STACKGAN++: REALISTIC IMAGE SYNTHESIS WITH STACKED GENERATIVE ADVERSARIAL NETWORKS 1955
Fig. 4. Example results by our StackGANs and GAN-INT-CLS [35] conditioned on text descriptions from Oxford-102 test set (leftmost four columns)
and COCO validation set (rightmost four columns).
models are compared directly using their generated images, CUB dataset, which is still slightly lower than ours. It gener-
which have different resolutions. In the second setting, ates higher resolution images with more details than GAN-
before computing IS* and FID*, all generated images are re- INT-CLS, as shown in Fig. 3. However, as mentioned by its
sized to the same resolution of 64 64 for fair comparison. authors, GAWWN fails to generate any plausible images
Compared with previous GAN models [33], [35], on when it is only conditioned on text descriptions [33]. In com-
the text-to-image synthesis task, our StackGAN-v1 model parison, our StackGAN-v1 for the first time generates images
achieves the best FID*, IS and average human rank on all of 256 256 resolution with photo-realistic details from only
three datasets. As shown in Table 2, compared with GAN- text descriptions.
INT-CLS [35], StackGAN-v1 achieves 28.47 percent improve- Comparison between StackGAN-v1 and StackGAN-v2. The
ment in terms of inception score (IS) on CUB dataset (from comparison between StackGAN-v1 and StackGAN-v2 by
2.88 to 3.70), and 20.30 percent improvement on Oxford-102 different quantitative metrics as well as human evaluations
(from 2.66 to 3.20). When we compare images of different are reported in Table 3. For unconditional generation, the
models at the same resolution of 64 64, our StackGAN-v1 samples generated by StackGAN-v2 are consistently better
still achieves higher inception scores (IS*) than GAN-INT- than those by StackGAN-v1 (last four columns in Table 3)
CLS, but produces a slightly worse inception score (IS*) than from a human perspective. The end-to-end training scheme
GAAWN [33] because GAWWN uses additional supervi- together with the color-consistency regularization enables
sion. Meanwhile, the FID* of StackGAN-v1 is nearly one half StackGAN-v2 to produce more feedback and regularization
of the FID* of GAN-INT-CLS on each dataset. It means that for each branch so that consistency is better maintained dur-
the StackGAN-v1 can better model and estimate the 64 64 ing the multi-step generation process. This is especially use-
image distribution. As comparison, the FID of StackGAN-v1 ful for unconditional generation as no extra conditions (e.g.,
is higher than that of GAN-INT-CLS [35] on COCO. The rea- text) are applied. On the text-to-image datasets, the scores
son is that the FID of GAN-INT-CLS is the distance between are mixed for StackGAN-v1 and StackGAN-v2. The reason is
two 64 64 image distributions while the FID of StackGAN- partially due to the fact that the text information, which is a
v1 is the distance between two 256 256 image distributions. strong constraint, is added in all the stages to keep coherence.
It is clear that estimating the 64 64 image distribution is The comparison results of FIDs are consistent with the com-
much easier than estimating the 256 256 image distribu- parison results of human ranks on all datasets. On the other
tion. It is also the reason why the FID is higher than the FID* hand, the inception score draws different conclusions on
for the same model. Finally, the better average human rank LSUN-bedroom, LSUN-church, and ImageNet-cat. We think
of our StackGAN-v1 also indicates our proposed method is that the reason is because the inception model is pre-trained
able to generate more realistic samples conditioned on text on ImageNet with 1000 classes, which makes it less suitable
descriptions. On the other hand, representative examples are for class-specific datasets. Compared with ImageNet-cat
shown in Figs. 3 and 4 for visualization comparison. As which has 17 classes, the inception score for ImageNet-dog
shown in Fig. 3, the 64 64 samples generated by GAN- correlates better with human ranks because ImageNet-dog
INT-CLS [35] can only reflect the general shape and color of covers more (i.e., 118) classes from ImageNet. Hence we
the birds. Their results lack vivid parts (e.g., beak and legs) believe that, using class-specific datasets, it is more reason-
and convincing details in most cases, which make them able to use FID to directly compare feature distances between
neither realistic enough nor have sufficiently high resolution. generated samples with that of the real world samples [15].
By using additional conditioning variables on location con- For visual comparison of the results by the two models,
straints, GAWWN [33] obtains a better inception score on we utilize the t-SNE [46] algorithm. For each model, a large
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
1956 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 8, AUGUST 2019
TABLE 3
Comparison of StackGAN-v1 and StackGAN-v2 on Different Datasets by Inception Scores (IS),
chet Inception Distance (FID) and Average Human Ranks (HR)
Fre
number of images are generated and embedded into the 2D SNE visualization of the generated samples can easily help
plane. We first extract a 2048d CNN feature from each gen- us identify any collapsed modes in the samples as well as
erated image using a pre-trained Inception model. Then, t- evaluate sample variability in texture, color and viewpoint.
SNE algorithm is applied to embed the CNN features into a More visual comparison of StackGAN-v1 and Stack-
2D plane, resulting a location for each image in the 2D GAN-v2 on different datasets can be found in Figs 3, 4, 6, 7,
plane. Due to page limits, Fig. 5 only shows a 50 50 grid 8, and 9. Specially, Fig. 9 illustrates failure cases of Stack-
with compressed images for each dataset, where each gen- GAN-v1 and StackGAN-v2. We categorize the failures in
erated image is mapped to its nearest grid location. By visu- these cases into three groups: mild, moderate, and severe.
alizing a large number of images, the t-SNE is a good tool to The “mild” group means that the generated images have
examine the synthesized distribution and evaluate its diver- smooth and coherent appearance but lack vivid objects. The
sity. We also follow [31] to use the multiscale structural sim- “moderate” group means that the generated images have
ilarity (MS-SSIM) [51] as a metric to measure the variability obvious artifacts, which usually are signs of mode collapse.
of samples. We observe that the MS-SSIM is useful to find The “severe” group indicates that the generated images fall
large-scale mode collapses but often fails to detect small- into collapsed modes. Based on such criterion, on the simple
scale mode collapses or fails to measure the loss of variation dataset, Oxford-102, all failure cases of StackGAN-v1 belong
in the generated samples’ color or texture. This observation to the “mild” group, while on other datasets all three groups
is consistent with the one found in [19]. For example, in of failure cases are observed. As comparison, we observe
Fig. 5, StackGAN-v1 has two small collapsed modes (non- that all failure cases of StackGAN-v2 belong to the “mild”
sensical images) while StackGAN-v2 does not have any col- group, meaning StackGAN-v2-generated images have no
lapsed nonsensical mode. However, the MS-SSIM score of collapsed nonsensical mode (see Fig. 5). By jointly optimiz-
StackGAN-v1 (0.0945) is better than that of StackGAN-v2 ing multiple distributions (objectives), StackGAN-v2 shows
(0.1311) and even better than that of the real data (0.1007). more stable training behavior and results in better FID and
Thus, we argue that the MS-SSIM is not a good metric to inception scores on most datasets (see Table 3). However,
capture small-scale mode collapses. On the contrary, the t- because of the same reason, compared with StackGAN-v1,
Fig. 5. Utilizing t-SNE to embed images generated by our StackGAN-v1 and StackGAN-v2 on the CUB test set.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: STACKGAN++: REALISTIC IMAGE SYNTHESIS WITH STACKED GENERATIVE ADVERSARIAL NETWORKS 1957
Fig. 6. Comparison of samples generated by models trained on LSUN bedroom dataset (Zoom in for better comparison).
Fig. 8. 256 256 samples generated by our StackGAN-v1 (top) and StackGAN-v2 (bottom) on ImageNet cat (left) and LSUN church (right).
it is harder for StackGAN-v2 to converge on more complex Unconditional Image Synthesis. We evaluate the effective-
datasets, such as COCO. In contrast, StackGAN-v1 opti- ness of StackGAN-v2 for the unconditional image genera-
mizes sub-tasks separately by training stage by stage. It pro- tion task by comparing it with DCGAN [32], WGAN [3],
duces slightly more appealing images on COCO than EBGAN-PT [56], LSGAN [26], and WGAN-GP [13] on the
StackGAN-v2 based on human rank results, but also gener- LSUN bedroom dataset. As shown in Fig. 6, our StackGAN-
ates more images that are moderate or severe failure cases. v2 is able to generate 256 256 images with more photo-
Consequently, while StackGAN-v2 is more advanced than realistic details. In Fig. 7, we also compare the 256 256
StackGAN-v1 in many aspects (such as end-to-end training samples generated by StackGAN-v2 and EBGAN-PT. As
and more stable training behavior), StackGAN-v1 has shown in the figure, the samples generated by the two
the advantage of stage-by-stage training, which converges methods have the same resolution, but StackGAN-v2 gener-
faster and requires less GPU memory. ates more realistic ones (e.g., more recognizable dog faces
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
1958 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 8, AUGUST 2019
Fig. 9. Examples of failure cases of StackGAN-v1 (top) and StackGAN-v2 (bottom) on different datasets.
Fig. 10. Samples generated by our StackGAN-v1 from unseen texts in CUB test set. Each column lists the text description, images generated from
the text by Stage-I and Stage-II of StackGAN-v1.
with eyes and noses). While on LSUN bedroom dataset, without adding more information, the inception score
only qualitative results are reported in [3], [13], [26], [32], would remain the same for samples of different resolutions.
[56], a DCGAN model [32] is trained for quantitative com- Therefore, the decrease in inception score by 128 128
parison using the public available source code6 on the StackGAN-v1 demonstrates that our 256 256 StackGAN-
ImageNet Dog dataset. The inception score of DCGAN is v1 does add more details into the larger images. For the 256
8.19 0.11 which is much lower than the inception achieved 256 StackGAN-v1, if the text is only input to Stage-I
by our StackGAN-v2 (9.55 0.11). These experiments (denoted as “no Text twice”), the inception score decreases
demonstrate that our StackGAN-v2 outperforms the state- from 3.70 to 3.45. It indicates that processing text descrip-
of-the-art methods for unconditional image generation. tions again at Stage-II helps refine Stage-I results. The same
Example images generated by StackGAN on LSUN church conclusion can be drawn from the results of 128 128 Stack-
and ImageNet cat datasets are presented in Fig. 8. GAN-v1 models.
Fig. 10 illustrates some examples of the Stage-I and Stage-
6.2 The Component Analysis of StackGAN-v1 II images generated by our StackGAN-v1. As shown in the
In this section, we analyze different components of Stack- first row of Fig. 10, in most cases, Stage-I GAN is able to
GAN-v1 on CUB dataset with baseline models. draw rough shapes and colors of objects given text
The Design of StackGAN-v1. As shown in the first four
rows of Table 4, if Stage-I GAN is directly used to generate TABLE 4
images, the inception scores decrease significantly. Such Inception Scores Calculated with 30,000 Samples Generated on
performance drop can be well illustrated by results in CUB by Different Baseline Models of Our StackGAN-v1
Fig. 11. As shown in the first row of Fig. 11, Stage-I GAN
fails to generate any plausible 256 256 samples without Method CA Text twice Inception score
using Conditioning Augmentation (CA). Although Stage-I no / 2.66 .03
64 64 Stage-I GAN
GAN with CA is able to generate more diverse 256 256 yes / 2.95 .02
samples, those samples are not as realistic as samples gener- no / 2.48 .00
ated by StackGAN-v1. It demonstrates the necessity of the 256 256 Stage-I GAN
yes / 3.02 .01
proposed stacked structure. In addition, by decreasing the yes no 3.13 .03
output resolution from 256 256 to 128 128, the inception 128 128 StackGAN-v1 no yes 3.20 .03
score decreases from 3.70 to 3.35. Note that all images are yes yes 3.35 .02
scaled to 299 299 before calculating the inception score.
Thus, if our StackGAN-v1 just increases the image size yes no 3.45 .02
256 256 StackGAN-v1 no yes 3.31 .03
yes yes 3.70 .04
6. https://github.com/carpedm20/DCGAN-tensorflow
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: STACKGAN++: REALISTIC IMAGE SYNTHESIS WITH STACKED GENERATIVE ADVERSARIAL NETWORKS 1959
TABLE 5
Inception Scores by Our StackGAN-v2 and Its Baseline Models on CUB test set
“JCU” means using the proposed discriminator that jointly approximates conditional and unconditional distributions.
Fig. 14. Example images generated by the StackGAN-v2 and its baseline models on LSUN bedroom (top) and CUB (bottom) datasets.
Fig. 15. Example images generated without (top) and with (bottom) the proposed color-consistency regularization for our StackGAN-v2 on ImageNet
dog, ImageNet cat and LSUN church datasets. (Left to right) 64 64, 128 128, and 256 256 images by G0, G1 , G2 , respectively.
dramatically decrease from 4.04 to 3.49 for “StackGAN-v2- datasets (see Fig. 15) show that the color-consistency regulari-
G2 ” and to 2.89 for “StackGAN-v2-all256” (See Table 5 and zation has significant positive effects for the unconditional
Figs. 14e, 14f). This demonstrates the importance of the image synthesis task. Quantitatively, removing the color-con-
multi-scale, multi-stage architecture in StackGAN-v2. sistency regularization decreases the inception score from
Inspired by [10], we also build a baseline model with multi- 9.55 0.11 to 9.02 0.14 on the ImageNet dog dataset. It dem-
ple discriminators at the 256 256 scale, namely onstrates that the additional constraint provided by the color-
“StackGAN-v2-3 G2 ”. Those discriminators have the same consistency regularization is able to facilitate multi-distribu-
structure but different initializations. However, the results tion approximation and help generators at different branches
do not show improvement over “StackGAN-v2-G2 ”. Similar produce more coherent samples. It is worth mentioning that
comparisons have also been done for the unconditional task there is no need to utilize the color-consistency regularization
on the LSUN bedroom dataset. As shown in Figs. 14a, 14b, for the text-to-image synthesis task because the text condition-
14c, those baseline models fail to generate realistic images ing appears to provide sufficient constraints. Experimentally,
because they suffer from severe mode collapses. adding the color-consistency regularization did not improve
To further demonstrate the effectiveness of jointly approxi- the inception score on CUB dataset.
mating conditional and unconditional distributions,
“StackGAN-v2-no-JCU” replaces the jointly conditional and
unconditional discriminators with the conventional ones, 7 CONCLUSIONS
resulting in much lower inception score than that of In this paper, Stacked Generative Adversarial Networks,
“StackGAN-v2”. Another baseline model does not use the StackGAN-v1 and StackGAN-v2, are proposed to decom-
color-consistency regularization term. Results on various pose the difficult problem of generating realistic high-
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: STACKGAN++: REALISTIC IMAGE SYNTHESIS WITH STACKED GENERATIVE ADVERSARIAL NETWORKS 1961
resolution images into more manageable sub-problems. The [15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
S. Hochreiter, “Gans trained by a two time-scale update rule con-
StackGAN-v1 with Conditioning Augmentation is first pro- verge to a local nash equilibrium,” in Proc. Conf. Neural Inf. Process.
posed for text-to-image synthesis through a novel sketch- Syst., 2017, pp. 6629–6640.
refinement process. It succeeds in generating images of [16] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie,
256 256 resolution with photo-realistic details from text “Stacked generative adversarial networks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2017, pp. 1866–1875.
descriptions. To further improve the quality of generated [17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
samples and stabilize GANs’ training, the StackGAN-v2 network training by reducing internal covariate shift,” in Proc.
jointly approximates multiple related distributions, includ- 32nd Int. Conf. Int. Conf. Mach. Learn., 2015, pp. 448–456.
[18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
ing (1) multi-scale image distributions and (2) jointly condi- translation with conditional adversarial networks,” in Proc. IEEE
tional and unconditional image distributions. In addition, a Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5967–5976.
color-consistency regularization is proposed to facilitate [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing
multi-distribution approximation. Extensive quantitative of gans for improved quality, stability, and variation,” in Proc.
ICLR, 2018.
and qualitative results demonstrate that our proposed [20] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
methods significantly improve the state of the art in both mization,” in Proc. ICLR, 2015.
conditional and unconditional image generation tasks. [21] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
in Proc. ICLR, 2014.
[22] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
ACKNOWLEDGMENTS “Autoencoding beyond pixels using a learned similarity metric,”
in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 1558–1566.
This work is partially supported by the Air Force Office [23] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani,
of Scientific Research (AFOSR) under the Dynamic Data- J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-
Driven Application Systems Program and National Sci- resolution using a generative adversarial network,” in Proc. IEEE
ence Foundation (NSF) 1763523, 1747778, 1733843 and Conf. Comput. Vis. Pattern Recognit., 2017, pp. 105–114.
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
1703883 Awards; partially supported by the NSF under P. Dollr, and C. L. Zitnick, “Microsoft coco: Common objects in
grant ABI-1661280 and the CNS-1629913; and partially context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
supported by the General Research Fund sponsored by [25] E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov,
the Research Grants Council of Hong Kong “Generating images from captions with attention,” in Proc. ICLR,
2016.
(Nos. CUHK14213616, CUHK14206114, CUHK14205615, [26] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley,
CUHK14203015, CUHK14239816, CUHK419412, “Least squares generative adversarial networks,” in Proc. IEEE
CUHK14207814, CUHK14208417, CUHK14202217). Han Conf. Comput. Vis., 2017, pp. 2813–2821.
[27] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled gener-
Zhang and Tao Xu contributed equally to this work. ative adversarial networks,” in Proc. ICLR, 2017.
[28] M. Mirza and S. Osindero, “Conditional generative adversarial
REFERENCES nets,” J. CoRR, vol. abs/1411.1784, 2014.
[29] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune,
[1] T. Salimans, H. Zhang, A. Radford, and D. N. Metaxas, “Plug & play generative networks: Conditional iterative genera-
“Improving GANs using optimal transport,” in Proc. ICLR, 2018. tion of images in latent space,” in Proc. IEEE Conf. Comput. Vis.
[2] M. Arjovsky and L. Bottou, “Towards principled methods for Pattern Recognit, 2017, pp. 3510–3520.
training generative adversarial networks,” in Proc. ICLR, 2017. [30] M.-E. Nilsback and A. Zisserman, “Automated flower classifica-
[3] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” in tion over a large number of classes,” in Proc. 6th Indian Conf. Com-
CoRR, vol. abs/1701.0787, 2017. put. Vis. Graph. Image Process., 2008, pp. 722–729.
[4] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo edit- [31] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis
ing with introspective adversarial networks,” in Proc. ICLR, 2017. with auxiliary classifier GANs,” in Proc. 34th Int. Conf. Mach.
[5] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularized Learn., 2017, pp. 2642–2651.
generative adversarial networks,” in Proc. ICLR, 2017. [32] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa-
[6] T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio, tion learning with deep convolutional generative adversarial
“Maximum-likelihood augmented discrete generative adversarial networks,” in Proc. ICLR, 2016.
networks,” J. CoRR, vol. abs/1702.07983, 2017. [33] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,
[7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and “Learning what and where to draw,” in Proc. 30th Int. Conf. Neural
P. Abbeel, “InfoGAN: Interpretable representation learning by Inf. Process. Syst., 2016, pp. 217–225.
information maximizing generative adversarial nets,” in Proc. [34] S. Reed, Z. Akata, B. Schiele, and H. Lee, “Learning deep repre-
Conf. Neural Inf. Process. Syst., 2016, pp. 2172–2180. sentations of fine-grained visual descriptions,” in Proc. IEEE Conf.
[8] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep genera- Comput. Vis. Pattern Recognit., 2016, pp. 49–58.
tive image models using a laplacian pyramid of adversarial [35] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, “Generative adversarial text-to-image synthesis,” in Proc. Int.
pp. 1486–1494. Conf. Mach. Learn., 2016, pp. 1060–1069.
[9] C. Doersch, “Tutorial on variational autoencoders,” J. CoRR, [36] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick,
vol. abs/1606.05908, 2016. and N. de Freitas, “Generating interpretable images with controlla-
[10] I. P. Durugkar, I. Gemp, and S. Mahadevan, “Generative multi- ble structure,” Tech. Rep., 2016.
adversarial networks,” in Proc. ICLR, 2017. [37] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpro-
[11] J. Gauthier, “Conditional generative adversarial networks for con- pagation and approximate inference in deep generative models,”
volutional face generation,” Tech. Rep., 2015. in Proc. Int. Conf. Mach. Learn., 2014, pp. 1278–1286.
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [38] D. L. Ruderman, “The statistics of natural images,” Netw.: Comput.
Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative Neural Syst., vol. 5, no. 4, pp. 517–548, 1994.
adversarial nets,” in Proc. Neural Inf. Process. Syst., 2014, pp. 2672– [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
2680. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and and L. Fei-Fei, “ImageNet Large Scale Visual Recognition
A. C. Courville, “Improved training of Wasserstein gans,” in Proc. Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,
Conf. Neural Inf. Process. Syst., 2017, pp. 5769–5779. 2015.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [40] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- and X. Chen, “Improved techniques for training GANs,” in Proc.
nit., 2016, pp. 770–778. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2234–2242.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.
1962 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 8, AUGUST 2019
[41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Hongsheng Li received the bachelor’s degree in
“Rethinking the inception architecture for computer vision,” in automation from the East China University of Sci-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826. ence and Technology, and the master’s and doc-
[42] C. K. Snderby, J. Caballero, L. Theis, W. Shi, and F. Huszar, torate degrees in computer science from Lehigh
“Amortised map inference for image super-resolution,” in Proc. University, Pennsylvania, in 2006, 2010, and
ICLR, 2017. 2012, respectively. He is currently with the Depa-
[43] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain rtment of Electronic Engineering, Chinese Uni-
image generation,” in Proc. ICLR, 2017. versity of Hong Kong. His research interests
[44] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel include computer vision, medical image analysis,
recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2016, and machine learning.
pp. 1747–1756.
[45] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
A. Graves, and K. Kavukcuoglu, “Conditional image generation Shaoting Zhang received the BE degree from
with pixelcnn decoders,” in Proc. 30th Int. Conf. Neural Inf. Process. Zhejiang University, in 2005, the MS degree from
Syst., 2016, pp. 4797–4805. Shanghai Jiao Tong University, in 2007, and the
[46] L. van der Maaten and G. Hinton, “Visualizing high-dimensional PhD degree in computer science from Rutgers, in
data using t-sne,” J. Mach. Learn. Res., vol. 9: 2579–2605, Nov. 2008. January 2012. His research interests include
[47] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and interface of medical imaging informatics, large-
D. Wierstra, “Matching networks for one shot learning,” in Proc. scale visual understanding, and machine learn-
30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 3630–3638. ing. He is a senior member of the IEEE.
[48] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos
with scene dynamics,” in Proc. 30th Int. Conf. Neural Inf. Process.
Syst., 2016, pp. 613–621.
[49] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
Caltech-UCSD Birds-200–2011 Dataset,” California Inst. Technol., Xiaogang Wang received the BS degree from the
Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011. University of Science and Technology of China, in
[50] X. Wang and A. Gupta, “Generative image modeling using style 2001, the MS degree from the Chinese University
and structure adversarial networks,” in Proc. Eur. Conf. Comput. of Hong Kong, in 2003, and the PhD degree from
Vis., 2016, pp. 318–335. the Computer Science and Artificial Intelligence
[51] Z. Wang, E. Simoncelli, and A. C. Bovik, “Multi-scale structural Laboratory, Massachusetts Institute of Technology,
similarity for image quality assessment,” in Proc. 37th Signals Syst. in 2009. He is currently an associate professor with
Comput., 2003, pp. 1398–1402. the Department of Electronic Engineering, The
[52] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Condi- Chinese University of Hong Kong. His research
tional image generation from visual attributes,” in Proc. 14th Eur. interests include computer vision and machine
Conf. Comput. Vis., 2016, pp. 776–791. learning. He is a member of the IEEE.
[53] J. Yang, A. Kannan, D. Batra, and D. Parikh, “LR-GAN: layered
recursive generative adversarial networks for image generation,” Xiaolei Huang received the bachelor’s degree
in Proc. ICLR, 2017. in computer science from Tsinghua University,
[54] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “LSUN: Construction China, and the doctorate degree in computer sci-
of a large-scale image dataset using deep learning with humans in ence from Rutgers University–New Brunswick.
the loop,” J. CoRR, vol. abs/1506.03365, 2015. She is currently an associate professor with the
[55] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and Computer Science and Engineering Department,
D. Metaxas, “StackGAN: Text to photo-realistic image synthesis Lehigh University, Bethlehem, Pennsylvania. Her
with stacked generative adversarial networks,” in Proc. IEEE Conf. research interests include computer vision, bio-
Comput. Vis., 2017, pp. 5908–5916. medical image analysis, computer graphics, and
[56] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative machine learning. In these areas she has pub-
adversarial network,” in Proc. ICLR, 2017. lished articles in journals such as the IEEE
[57] J. Zhu, P. Kr€ahenb€uhl, E. Shechtman, and A. A. Efros, “Generative Transactions on Pattern Analysis and Machine Intelligence, the MedIA,
visual manipulation on the natural image manifold,” in Proc. Eur. the IEEE Transactions on Medical Imaging, the ACM Transactions
Conf. Comput. Vis., 2016, pp. 597–613. on Graphics, and Scientific Reports. She also regularly contributes
research papers to conferences such as CVPR, MICCAI, and ICCV. She
Han Zhang received the BS degree in informa- serves as an associate editor for the Computer Vision and Image Under-
tion science from China Agricultural University, standing Journal. She is a member of the IEEE.
Beijing, China, in 2009, and the ME degree in
communication and information systems from
Beijing University of Posts and Telecommunica- Dimitris N. Metaxas received the BE degree
tions, Beijing, China, in 2012. He is currently from the National Technical University of Athens
working toward the PhD degree in the Depart- Greece, in 1986, the MS degree from the Univer-
ment of Computer Science, Rutgers University, sity of Maryland, in 1988, and the PhD degree
Piscataway, New Jersey. His current research from the University of Toronto, in 1992. He is a
interests include computer vision, deep learning, professor with the Computer Science Depart-
and medical image processing. ment, Rutgers University. He is directing the
Computational Biomedicine Imaging and Model-
Tao Xu received the BE degree in agricultural ing Center (CBIM). He has been conducting
mechanization and automatization from China research toward the development of formal meth-
Agricultural University, Beijing, China, in 2010, ods upon which computer vision, computer
and the MS degree in computer science from graphics, and medical imaging can advance synergistically. He is a fel-
the Institute of Computing Technology, Chinese low of the IEEE.
Academy of Science, Beijing, China, in 2013. She
is currently working toward the PhD degree in the
Department of Computer Science and Engineer- " For more information on this or any other computing topic,
ing, Lehigh University, Bethlehem, PA. Her current please visit our Digital Library at www.computer.org/publications/dlib.
research interests include deep learning, com-
puter vision, and medical image processing.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 17,2024 at 04:38:48 UTC from IEEE Xplore. Restrictions apply.