Cross-Caption Cycle-Consistent Text-to-Image Synthesis
Cross-Caption Cycle-Consistent Text-to-Image Synthesis
Abstract
359
(a) Legend : {ti } : True Captions; {t̂i } : Generated Captions; {Îi } : Generated Images; I : True Image;
CCCL: Cross-Caption Cycle Consistency Loss; DL: Discriminator Loss.
Figure 2: The figure shows how Cross-Caption Cycle Consistency is maintained across four captions (t1 , · · · , t4 ). A gener-
ator G converts ti to an image Îi . Discriminator at each step forces Îi to be realistic. A Cross-Caption Cycle Consistency
Network (CCCN) converts Îi back to a caption (t̂i+1 ). The Cross Caption Consistency Loss (CCCL) forces it to be close to
ti+1 . In the last step, t̂5 is ensured to be consistent with the initial caption t1 , hence completing a cycle.
complish real-world applications such as sketch-to-image the data distribution (pz (z)). G and D play the following
generation, real image-to-anime character generation, etc. min-max game with the value function V (D, G):
Our proposed approach takes the idea one step ahead and
imposes a transitive consistency across multiple captions. min max V (D, G) = Ex∼pdata (x) [log D(x)]
G D
We call this Cross-Caption Cycle Consistency, which is ex-
+ Ez∼pz (z) [log(1 − D(G(z)))]
plained in Section 4.1.
Recurrent GAN Architectures: Recurrent GAN was In the proposed architecture, we make use of a con-
proposed to model data with a temporal component. In par- ditional GAN [10] in which both the generator and the
ticular, Vondrick et al. [18] uses this idea to generate small discriminator are conditioned on a variable φ(t) yielding
realistic video clips and, Ghosh et al. [3] depict the use of a G(z|φ(t)) and D(x|φ(t)), where φ(t) is a vector represen-
Recurrent GAN architecture to make predictions on abstract tation of the caption.
reasoning tasks by conditioning on the previous context or
the history. More relevant examples come from the efforts 3.2. Text embedding
in [5, 2], which display the potential of recurrent GAN ar- The text embedding of the caption that we use to con-
chitectures in generating better quality images. The gener- dition the GAN, would yield best results if it could bear a
ative process spans across time, building the image step by semantic correspondence with the actual image that it rep-
step. [2] utilizes this time lapse to enhance an attribute of resents. One method for such a text encoding is Struc-
the object at a time. We exploit this recurrent nature to con- tured Joint Embeddings (SJE) initially proposed by Akata
tinually improve upon the history while generating an im- et al. [1] and further improved by Reed et al. [13]. They
age. Unlike previous efforts, we differ in how we model the learn a text encoder, ϕ(t), which transforms the caption t
recurrent aspect of the model and how we update the hidden in such a way that its inner product with the correspond-
state of the recurrent model. To the best of our knowledge, ing image embedding, θ(v), will be higher if they belong
the proposed architecture is the first to use a recurrent for- to the same class and lower otherwise. For a training set
mulation for the text-to-image synthesis problem. {(vn , tn , yn : n = 1, · · · , N )}, where vn , tn and yn corre-
sponds to image, text and the class label, ϕ(t) is learned by
3. Preliminaries optimizing the following structured loss:
3.1. Generative Adversarial Networks 1
N
Δ(yn , fv (vn )) + Δ(yn , ft (tn )) where
GANs are generative models that sidestep the difficulty N n=1
in approximating intractable probabilistic computations as-
sociated with maximum likelihood estimation and related fv (v) = arg max Et∼T (y) [θ(v)T ϕ(t)] and
strategies by matching the generative model (G) with an ad- y∈Y
versary (D), which learns to discriminate whether the sam- ft (t) = arg max Ev∼V (y) [θ(v)T ϕ(t)]
ples are coming from the model distribution (pdata (x)) or y∈Y
360
Figure 3: Figure depicts the architecture of Cascaded-C4Synth. A series of generators are conditioned on N captions one by
one and previously generated image through a non-linear mapping (convolutional block Bi ). The value N is set to be 3, in
our experimentations.
After the network is trained [1], we use ϕ(t) to encode the captions. In the first iteration, a generator network (G) takes
captions. Similar method has been used in previous meth- noise and the first caption, t1 , as its input, to generate an im-
ods for text to image generation [14, 25, 24, 16, 17]. ϕ(t) is age, Î1 , which is passed to the discriminator network (D),
a high dimensional vector. To transform it to a lower dimen- which verifies whether it is real or not. As in a usual GAN
sional conditioning latent variable, Han et al. [25] proposed setup, generator tries to create better looking images so that
the ‘Conditional Augmentation’ technique. Here, the latent it can fool the discriminator. The generated image features
vector is randomly sampled from an independent Gaussian are passed on to a ‘Cross-Caption Cycle Consistency Net-
distribution whose mean vector and covariance matrix is pa- work’ (CCCN) which will learn to generate a caption for
rameterized by ϕ(t). We request the reader to refer to [25] the image. While training, the Cross-Caption Cycle Con-
for more information. sistency Loss ensures that the generated caption is similar
to the second caption, t2 .
4. Methodology In the next iteration, Î1 and t2 is fed to the generator to
generate Î2 . While D urges G to make Î2 similar to the real
The main contribution of our work is to formulate a
image I, the CCCN ensures that the learned image repre-
framework to generate images by utilizing information from
sentation is consistent for generating the next caption in se-
multiple captions. This is achieved by ensuring Cross-
quence. This repeats until when Î4 gets generated. Here,
Caption Cycle Consistency. The generic idea of Cross-
the CCCN will ensure that the generated caption is sim-
Caption Cycle Consistency is explained in Section 4.1. We
ilar to the first caption, t1 . Hence we complete a cycle:
devise two network architectures that maintain this consis-
t1 → t2 → t3 → t4 → t1 , while generating Î1 · · · Î4 pro-
tency. The first one is a straightforward instantiation of the
gressively. Î4 contains the concepts from all the captions
idea, where multiple generators progressively generate im-
and hence is much richer in quality.
ages by consuming captions one by one. This method is ex-
plained in Section 4.2. A serious limitation of this approach
is that the network architecture restricts the number of cap-
4.2. Cascaded-C4Synth
tions that can be used to generate an image. This leads us In our first approach, we consider Cross-caption Cycle
to formulate a recurrent version of the method, where a sin- Consistent image generation as a cascaded process where a
gle generator recursively consumes any number of captions. series of generators consumes multiple captions one by one,
This elegant method is explained in Section 4.3. to generate images. The image that is generated at each step
is a function of the previous image and the caption supplied
4.1. Cross-Caption Cycle Consistency
at the current stage. This enables each stage to build up on
Cross-Caption Cycle Consistency is achieved by ensur- the intermediate images generated in the previous stage, by
ing that the generated image is consistent with a set of cap- utilizing the additional concepts from the new captions seen
tions describing the same image. Figure 2 gives a simpli- in the current stage. At each stage, a separate discriminator
fied overview of the process. Let us take an example of and CCCN is used. The discriminator is tasked to identify
synthesizing an image by distilling information from four whether the generated image is fake or real while the CCCN
361
Figure 4: Architecture of Recurrent-C4Synth. The figure shows the network unrolled in time. hi refers to the hidden state at
time step i. ti is the caption and t̂i is the vector representation of ti at time step i.
translates the image to its corresponding caption and checks before, and a new convolutional block (B3 ) gets added. In
how much close it is to the next caption in succession. our experiments, we used three Bi s, due to GPU memory
The architecture is presented in Figure 3. A set of convo- limitations. Ng is set to 32.
lutional blocks (denoted by Bi , in the figure) builds up the
backbone of the network. The first layer of each Bi con- 4.2.2 Generator
sumes a caption. Each generator (Gi ) and CCCN (CCCNi )
branches off from the last layer of each Bi , while a new Bi Each generator (Gi ) takes the features from the backbone
attaches itself to grow the backbone. The number of Bi ’s is network and passes it through a single 3 × 3 convolutional
fixed while designing the architecture and restricts the num- layer followed by a tanh activation function to generate an
ber of captions that can be used to generate an image. The RGB image. As the spatial resolution of the features from
main components of the architecture are explained below. each Bi increases (as explained in Section 4.2.1), the size
of the image generated by each generator, also increases.
Multiple generators are trained together by minimizing
4.2.1 Backbone Network the following loss function:
A vector representation (t̃i ) for each caption (ti ) is gener- N
ated using Structured Joint Embedding (φ(ti )) followed by LG = LGi , where LGi = Esi ∼pGi [log(1 − Di (si ))]
Conditional Augmentation module. A brief description of i=1
the text encoding is presented in Section 3.2. t̃i is a vector + λDKL (N (μ(φ(ti ), Σ(ti ))||N (0, 1))
of 128 dimension. In the first convolutional block, B1 , t̃1 is
combined with a 100 dimensional noise vector (z), sampled The first term in LGi is the standard minimization term in
from a standard normal distribution. The combined vector the GAN framework which pushes the generator to generate
is passed through fully connected layers and then reshaped better quality images. pGi is the distribution of the gener-
into 4 × 4 × 16Ng tensor. Four up-sampling layers, up- ator network. The DKL term is used to learn the param-
samples the tensor to 64 × 64 × 8Ng tensor. This tensor eters of μ(φ(ti )) and Σ(ti ) of the Conditional Augmenta-
is passed on to the first generator (G1 ) and the first CCCN tion framework [25]. It is learned very similar to the re-
(CCCN1 ). parameterization trick in VAEs [7]. λ is a regularization
Further convolutional blocks, Bi , are added to B1 as fol- parameter, whose value we set to 1 for the experiments.
lows. The new caption encoding t̃i , is spatially replicated
at each location of the backbone features (bi ) coming from
4.2.3 Discriminator
the previous convolutional block (Bi ), followed by a 3 × 3
convolution. These features are passed thorough a residual The discriminators (di ) contains a set of down-sampling
block and an up-sampling layer. This helps to increase the layers which converts the input tensor to 4 × 4 × 18Nd
spatial resolution of the feature maps in each Bi . Hence, tensor. Following the spirit of conditional GAN [10], the
the output of B2 is a tensor of size 128 × 128 × 4Ng . The encoded caption, t̃i is spatially replicated and joined by a
generator and CCCN branches off from this feature map as 3 × 3 convolution to the incoming image. The final logit
362
is obtained by convolving with a 4 × 4 × 1 kernel and a Initializer Module helps the model to learn better than ran-
Sigmoid activation function. domly initializing the same.
The loss function for training the discriminator is as fol- The hidden state along with the text embedding of the
lows: caption is passed to the generator to generate an Image.
A discriminator guides the generator to generate realistic
LDi = EIi ∼pdata [log Di (Ii )] + Esi ∼pGi [log(1 − Di (si ))]
image while a Cross-Caption Cycle Consistency Network
(CCCN) ensure that the captions that are generated from
pdata is the original data distribution and pGi is the distribu-
the image features are consistent with the second caption.
tion of the corresponding generator network. The multiple
As we unroll the network in time, different captions are fed
discriminators are trained in parallel.
to the generator at each time step. When the final caption
4.2.4 Cross-Caption Cycle Consistency Network is fed in, the CCCN makes sure that it is consistent with
the first caption. Hence the network ensures that the cycle
CCCN is modeled as an LSTM which generates one word
consistency between captions is maintained.
at each time-step conditioned on a context vector (derived
The network architecture of CCCN is same as that of
by attending to specific regions of the image), the hidden
Cascaded-C4Synth, while the architecture of the generator
state and the previously generated word. CCCN takes as
and discriminator is slightly different. We explain them in
input the same set of backbone features that the generator
section 4.3.2. While Cascaded-C4Synth has separate gen-
consumes. It is then pooled to reduce the spatial dimen-
erator, and the corresponding discriminator and CCCN at
sion. Regions of these feature maps are aggregated into a
each stage, the Recurrent-C4Synth has only one generator,
single context vector by learning to attend to these feature
discriminator and CCCN. The weights of the generator is
maps similar to the method proposed by [21]. Each word is
shared across time steps and is updated via Back Propaga-
encoded as its one-hot representation.
tion Through Time (BPTT)[20].
There is one CCCN block per generator. CCCN is
trained by minimizing the cross-entropy loss between each
of the generated words and words in the true caption. The 4.3.1 Updating the Hidden State
true caption for Stage i is (i + 1)th caption, and finally the
first caption, as is explained in Section 4.1. The loss of each In the first time step of the unrolled network, the hidden
of the CCCN block is aggregated and back-propagated to- state is initialized by the Initializer Module. In the succes-
gether. sive time steps, the hidden state and the image generated in
the previous time step is used to generate the new hidden
4.3. Recurrent-C4Synth state, as shown in figure 4. The 64 × 64 images are down-
The architecture of Cascaded-C4Synth limits the num- sampled by a set of down-sampling convolutional layers to
ber of captions that can be consumed because the number generate feature maps of spatial dimension 8 × 8. These
of generator-discriminator pairs has to be decided and fixed feature maps are fused with the hidden state (also of spatial
during training. We overcome this problem by formulating dimension 8 × 8) by eight 3 × 3 filters. This will result in
a recurrent approach for text to image synthesis. At its core, a new hidden state of dimension 8 × 8 × 8. If we denote
Recurrent-C4Synth maintains a hidden state, which guides the above operation by a function U pdateHiddenState(.),
the image generation at each time step, along with a new then the recurrence relation at each time-step i, can be ex-
caption. The hidden state by itself is modeled as a function pressed as:
of the previous hidden state and the image that was gener-
Iˆi = Generator(hi , φ(ti ))
ated in the previous time step. This allows the hidden state
to act like a shared memory, that captures the essential fea- hi = U pdateHiddenState(hi−1 , Ii−1 )
tures from the different captions to generate good looking,
semantically rich images. The exact way in which hidden Iˆi is the image generated by the Generator, by consuming
state is updated is explained in Section 4.3.1. the hidden state (hi ) and the vector representation of the
Figure 4 presents the simplified architecture of caption (φ(ti )) that was provided in time step i.
Recurrent-C4Synth. We explain the architecture in detail
here. The hidden state is realized as an 8 × 8 × 8 tensor. 4.3.2 Generator and Discriminator
The values for the initial hidden state is learned by the Ini-
tializer Module, which takes as input a noise vector (z) of Recurrent-C4Synth uses a single generator to generate im-
length 100, sampled randomly from a Gaussian distribution. ages of size 256 × 256. It consumes the hidden state hi , and
It is passed though a fully connected layer followed by non a vector representation φ(ti ) of the caption provided in the
linearity and finally reshaped into a 8×8×8 tensor. Our ex- current time step. φ(ti ) is spatially replicated to each loca-
perimentations reveal that initializing the hidden state with tion of the hidden state and then fused by a 3 × 3 convolu-
363
Figure 5: Generations from Cascaded-C4Synth. The first row shows the images generated and the corresponding captions
consumed in the process. The first two images belong to Indigo Bunting, Tree Sparrow class of CUB dataset [19] and the last
image belongs to Peruvian Lily class of Flowers dataset [11]. The bottom row showcases some random samples of generated
images. (Kindly zoom in to see the detailing in the images.)
Figure 6: Generations from Recurrent-C4Synth. The first two images are generated from the caption belonging to Black
Footed Albatross class and Great Crested Flycatcher class of CUB dataset [19], while the last one is from the Moon Orchid
class of Flowers dataset [11]. The last two rows contains random generations from both the datasets. (Kindly zoom in to see
the detailing in the images.)
tion layer. This results in a feature map of spatial resolution tors. The architecture of each of the discriminator is similar
8 × 8. to Cascaded-C4Synth.
One easy way to generate 256 × 256 images from these
feature maps would be to stack five up-convolution layers 5. Experiments and Results
(each doubling the spatial resolution) back to back. Our 5.1. Datasets and Evaluation Criteria
experiments showed that such a method will not work in
practice. Hence, we choose to generate intermediate images We evaluate Cascaded-C4Synth and Recurrent-C4Synth
of spatial resolution 64 × 64 and 128 × 128 also. This is on Oxford-102 flowers dataset [11] and Caltech-UCSD
achieved by attaching 3 × 3 × 3 kernels after the third and Birds (CUB) [19] datasets. Oxford-102 contains 102 cat-
fourth up-sampling layer. The extra gradients (obtained by egories of flowers counting to 8189 images in total, while
discriminating the intermediate images) that flow through CUB contains 200 bird species with 11,788 images. Fol-
the network will help the network to learn better. lowing the previous methods [14, 25, 24], we pre-process
In-order to discriminate the two intermediate images and the dataset to improve the object to image ratio.
the final image, we make use of three separate discrimina- We gauge the performance of the generated images by its
364
of C4Synth models. On Oxford-102 dataset, Recurrent-
C4Synth method gives state-of-the-art result, improving the
previous baseline. On CUB dataset, the results are compa-
rable with HDGAN [26].
The results indicate that Recurrent-C4Synth has an edge
over Cascaded-C4Synth. It is worth noting that both the
methods perform better than four out of five other baseline
methods.
Figure 7: The top row shows the images generated by Figure 5 and 6 shows the generations from Cascaded-
Recurrent-C4Synth at each time-step. The corresponding C4Synth and Recurrent-C4Synth methods respectively. The
captions that was consumed is also added. The bottom row generations from Cascaded-C4Synth method consumes
shows generated birds of the same class, but with varying three captions, as is restricted by the architecture, while the
pose and background. These are generated by keeping the Recurrent-C4Synth method consumes five captions. The
captions the same and varying the noise vector used to con- quality of the images generated by both the methods are
dition the GAN. comparable as is evident from the Inception Scores. All the
generated images are of 256 × 256 pixels in resolution.
The images that are generated at each time step by the
‘Inception Score’[15], which has emerged as the dominant Recurrent-C4Synth method is captured in the top row of
way of measuring the quality of generative models. The in- Figure 7. The captions that are consumed in each step is
ception model has been fine-tuned on both the the datasets also shown. This figure validates our assertion that the re-
so that we can have a fair comparison with previous meth- current formulation progressively generates better images
ods like [14, 25, 24, 26]. by consuming one caption at a time.
The bottom row of Figure 7 shows the interpolation
5.2. Results of the noise vector, used to generate the hidden state of
We validate the efficacy of Cascaded-C4Synth and Recurrent-C4Synth, while fixing the captions used. This
Recurrent-C4Synth by comparing it with GAN-INT-CLS results in generating the same bird in different orientations
[14], GAWWN [13], StackGAN [25], StackGAN++ [24] and backgrounds.
and HD-GAN [26] (Our future work will include inte-
grating attention in our framework, and comparing against 5.2.3 Zero Shot generations
attention-based frameworks such as [17]).
We note that while training both the C4Synth architectures
with Oxford-102 flowers dataset [11] and Caltech-UCSD
5.2.1 Quantitative Results Birds (CUB) [19] datasets, the classes used for training and
testing are disjoint. We use the official train-test split for
both the datasets. CUB has 150 train+val classes and 50
Method Oxford-102 [11] CUB [19]
test classes, while Oxford-102 has 82 train+val classes and
GAN-INT-CLS [14] 2.66 ± .03 2.88 ± .04 20 test classes. Hence all the results shown in the paper are
GAWWN [13] - 3.62 ± .07 zero-shot generations, where none of the classes of captions
StackGAN [25] 3.20 ± .01 3.70 ± .04 that are used to generate the image in test phase, has ever
StackGAN++ [24] - 3.82 ± .06 been seen in training phase.
HDGAN [26] 3.45 ± .07 4.15 ± .05
6. Conclusion
Cascaded C4Synth 3.41 ± .17 3.92 ± .04
Recurrent C4Synth 3.52 ± .15 4.07 ± .13 We formulate two generative models for text to im-
age synthesis, Cascaded-C4Synth and Recurrent-C4Synth,
Table 1: Comparison of C4Synth methods with other text to which makes use of multiple captions to generate an image.
image synthesis methods. The number reported are Incep- The method is able to generate plausible images on Oxford-
tion Scores (higher is better). 102 flowers dataset [11] and Caltech-UCSD Birds (CUB)
[19] dataset. We believe that attending to specific parts of
Table 1 summarizes the Inception Score of competing the captions at each stage, would improve the results of our
methods on Oxford-102 flowers dataset [11] and Caltech- method. We will explore this in a future work. The code is
UCSD Birds (CUB) [19] dataset along with the results open-sourced at http://josephkj.in/projects/C4Synth.
365
References [17] Q. H. H. Z. Z. G. X. H. X. H. Tao Xu, Pengchuan Zhang. At-
tngan: Fine-grained text to image generation with attentional
[1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evalu- generative adversarial networks. 2018.
ation of output embeddings for fine-grained image classifica- [18] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating
tion. In Computer Vision and Pattern Recognition (CVPR), videos with scene dynamics. CoRR, abs/1609.02612, 2016.
2015 IEEE Conference on, pages 2927–2936. IEEE, 2015.
[19] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-
[2] R. Y. Benmalek, C. Cardie, S. Belongie, X. He, and J. Gao. longie, and P. Perona. Caltech-UCSD Birds 200. Technical
The neural painter: Multi-turn image generation. arXiv Report CNS-TR-2010-001, California Institute of Technol-
preprint arXiv:1806.06183, 2018. ogy, 2010.
[3] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and
[20] R. J. Williams and D. Zipser. Gradient-based learning algo-
M. Bansal. Contextual rnn-gans for abstract reasoning dia-
rithms for recurrent networks and their computational com-
gram generation. arXiv preprint arXiv:1609.09444, 2016.
plexity. Backpropagation: Theory, architectures, and appli-
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, cations, 1:433–486, 1995.
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
erative adversarial nets. In Advances in neural information
nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
processing systems, pages 2672–2680, 2014.
image caption generation with visual attention. In Interna-
[5] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Gen-
tional Conference on Machine Learning, pages 2048–2057,
erating images with recurrent adversarial networks. CoRR,
2015.
abs/1602.05110, 2016.
[22] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
[6] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to
X. He. Attngan: Fine-grained text to image generation with
discover cross-domain relations with generative adversarial
attentional generative adversarial networks. arXiv preprint
networks. arXiv preprint arXiv:1703.05192, 2017.
arXiv:1711.10485, 2017.
[7] D. P. Kingma and M. Welling. Auto-encoding variational
[23] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsu-
bayes. arXiv preprint arXiv:1312.6114, 2013.
pervised dual learning for image-to-image translation. arXiv
[8] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
preprint, 2017.
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[24] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
mon objects in context. In European conference on computer
D. Metaxas. Stackgan++: Realistic image synthesis with
vision, pages 740–755. Springer, 2014.
stacked generative adversarial networks. arXiv: 1710.10916,
[9] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-
2017.
image translation networks. In Advances in Neural Informa-
tion Processing Systems, pages 700–708, 2017. [25] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
D. Metaxas. Stackgan: Text to photo-realistic image synthe-
[10] M. Mirza and S. Osindero. Conditional generative adversar-
sis with stacked generative adversarial networks. In ICCV,
ial nets. arXiv preprint arXiv:1411.1784, 2014.
2017.
[11] M.-E. Nilsback and A. Zisserman. Automated flower classi-
[26] Z. Zhang, Y. Xie, and L. Yang. Photographic text-to-image
fication over a large number of classes. In Proceedings of the
synthesis with a hierarchically-nested adversarial network.
Indian Conference on Computer Vision, Graphics and Image
In The IEEE Conference on Computer Vision and Pattern
Processing, Dec 2008.
Recognition (CVPR), 2018.
[12] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
Collecting image annotations using amazon’s mechanical [27] Z. Zhang, L. Yang, and Y. Zheng. Translating and seg-
turk. In Proceedings of the NAACL HLT 2010 Workshop menting multimodal medical volumes with cycle-and shape-
on Creating Speech and Language Data with Amazon’s Me- consistency generative adversarial network. arXiv preprint
chanical Turk, pages 139–147. Association for Computa- arXiv:1802.09655, 2018.
tional Linguistics, 2010. [28] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
[13] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep to-image translation using cycle-consistent adversarial net-
representations of fine-grained visual descriptions. In Pro- works. arXiv preprint arXiv:1703.10593, 2017.
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 49–58, 2016.
[14] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
H. Lee. Generative adversarial text-to-image synthesis. In
Proceedings of The 33rd International Conference on Ma-
chine Learning, 2016.
[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
ford, and X. Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages
2234–2242, 2016.
[16] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and
Y. Bengio. Chatpainter: Improving text to image generation
using dialogue. ICLR Workshops, 2018.
366