A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks
A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks
ABSTRACT Text to face generation is a sub-domain of text to image synthesis. It has a huge impact on
new research areas along with the wide range of applications in the public safety domain. Due to the lack of
dataset, the research work focused on the text to face generation is very limited. Most of the work for text
to face generation until now is based on the partially trained generative adversarial networks, in which the
pre-trained text encoder has been used to extract the semantic features of the input sentence. Later, these
semantic features have been utilized to train the image decoder. In this research work, we propose a fully
trained generative adversarial network to generate realistic and natural images. The proposed work trained
the text encoder as well as the image decoder at the same time to generate more accurate and efficient results.
In addition to the proposed methodology, another contribution is to generate the dataset by the amalgamation
of LFW, CelebA and locally prepared dataset. The dataset has also been labeled according to our defined
classes. Through performing different kinds of experiments, it has been proved that our proposed fully trained
GAN outperformed by generating good quality images by the input sentence. Moreover, the visual results
have also strengthened our experiments by generating the face images according to the given query.
INDEX TERMS GAN, CNN, text to face, image generation, face synthesis, data augmentation, legal identity
for all.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1250 VOLUME 9, 2021
M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks
text encoder and image decoder separately. Most of the face generation is very limited. In this paper, a self-generated
generative adversarial networks focus on the generation of dataset is also presented with the help of the google image
the synthesized images using the sentence level information. search and two publically available datasets for face to text
Generating the images using the sentence level information generation.
probably has chances of information loss at word level. As a The main motivation behind this research work is to
result, the accurate images cannot be generated [1], [2]. Most generate the synthesized images of the face based on the
of the work which was done for the problem of ‘‘ Text text description. The proposed algorithm in this paper has
to Image’’ generation is based on simple dataset problems ensured to generate high-quality images by preserving the
such as birds [3] and flowers [4]. However, the work that face identity. Moreover, it is also capable of generating the
mapped the objects along with scenes was very limited. exact images based on the given descriptions. This research
To overcome this problem, [2] utilized the AttnGAN, they work has also been utilized in many industrial applications
were failed to achieve good results as their output image like automatic sketch making of the suspected face in crime
was semantically not meaningful. They tried to explore the investigation departments.
COCO dataset and mapped the object along with the scene We have made the following contributions in our paper.
with the sentence-level information. However, the object and 1. Generating the dataset related to the text to face images.
word-level information was still missing. 2. Proposed a Fully Trained Generative Adversarial Net-
work, which has a trainable text encoder as well as a
trainable image decoder.
3. Two discriminators are proposed to utilize the strength
of joint learning.
4. Generating the photo-realistic images of the faces from
the description by preserving details.
The rest of the paper is organized as follows; Literature
survey has been discussed in section II, proposed method-
ology and framework design have been briefly discussed in
section III. Whereas, dataset description and experimental
analysis described in IV and V, respectively. Paper has been
concluded in Section VI.
FIGURE 1. Two images for text to image synthesis system that is
referenced with same input sentence.
II. LITERATURE SURVEY
Text to face image generation is the subdomain of the Two domains are related to the work. The first one is the text
text to image generation, where the ultimate goal is to gen- to image synthesis and the second one is the text attributes to
erate the image using the user-specified description about face generation. Both domains are discussed one by one as
the face. So, there are two major tasks of generating face follow;
images from text. Fig. 1 shows the input and output for a
text to image synthesis system. It can be observed that text A. TEXT TO IMAGE GENERATION
to face synthesis involves generating high-quality images There are a lot of frameworks available for the text to image
and generating the appropriate images related to the given generation. These frameworks are based on the encoding
description. This task of generating the face images from the through the encoder and decoding through decoder also by
text description is more relevant to the public safety tasks. using the conditional GAN. The text is encoded through an
For example, we consider the scenario of the crime scene. encoder that processes sequential information of text and
In most of the cases, the witness of the crime scene has the image is decoded using the spatial decoder. The text
appeared before the law enforcement agencies to help in encoder encodes the input description into the semantic vec-
drawing the portrait of the suspected criminal. The witness tors, whereas the image decoder uses these semantic vectors
tells the description of the criminal to the portrait maker, then to generate the natural and realistic images. There are two
he/she draws the portrait of the criminal on the drawing board. basic purposes of the text to image synthesis, the first one is
The proposed work will help to automate the whole task by to generate the natural and realistic images and the second one
negating the role of the portrait maker. The manual work is is to make sure that generated images are related to the given
tedious and time-consuming and requires professional knowl- description. All basic algorithms and procedures for text to
edge and experience. Thus, this work will be helpful for image generation are based on this rule of thumb.
law-enforcement agencies. From the past few years, the work on the generative
There are different datasets available for text to image network has been boasted up for image synthesis. Kingma
syntheses, like CUB [5], Oxford102 [6], and COCO [7]. But et.al [8] utilized the stochastic backpropagation to train the
there is no standard dataset, which is available for text to auto-variational encoder for data generation purposes. Since
face generation. The work is done in the domain of text to the birth of the generative adversarial network, which was
proposed by Goodfellow et al. [1] researchers have studied Some of the networks can generate good quality face
and researched it widely. The very first task which focused images with a size of 1024 × 1024. These face images are
the text to image generation has been done by Reed et. al. [9]. much larger than the original images present in the face
They have utilized the conditional GAN to build the two dataset. These described models first learned through the
end to end network for text to image generation. They have noise vector with the help of mapping and followed the
obtained the semantic vectors from the text by using the pre- normal distribution to generate the natural images of the face.
trained Char-CNN-RNN and used these vectors to decode However, they are not able to generate an accurate and precise
the natural images using the decoder which is very similar face based on the input description.
to DCGAN [10]. To overcome and tackle this problem, many researchers
After that, researchers have started to make further have worked on different directions of face synthesis. These
progress in this particular domain [11]. Zhang et al [12] pro- directions include converting the face edges into the natural
posed the StackGAN, which is based on two stages and gen- face images [23], swapping the facial attributes of two dif-
erates high-quality images with the improved inception score. ferent face images [24], generating the face with the help
Till then, researchers were able to generate high-quality of the side face [25], generating the face with the help of
images. The focus at that time shifted to improve the simi- the human eye’s region [26], draw sketches from the human
larity between the text and images. Reed et.al [13] proposed face [27], face make-up [28] and many more. But as per our
a network that generates images based on the first generated best knowledge, no one combined the different face-related
box. This produced more efficient and accurate results on information in a single methodology to generate the natural
the output images. Sharma et.al. [14] introduced the mech- and realistic face images.
anism of dialogue to enhance the understanding of the text. Some of the researchers have also worked on the face
They claimed that the method helped them to achieve good generation through the attributes description. Li et. al [32]
results for the image synthesis relevant to the input text. proposed the work, in which they generated the face with
Dong et al. [11] proposed and introduced a new approach for the help of the attribute description by making sure that
the image to image and text to image generation. Moreover, they preserve the identity of the face. The drawback of
they also introduced the training mechanism of image-text- their proposed methodology is that it is only applicable
image. They first generated the text from the images, and then to those faces which can be generated using the simple
this text was used to generate the images. attributes. Another work named TP-GAN [33] has been pro-
The attention-based mechanism has gained a lot of suc- posed by the researchers. In this work, they have proposed
cess in the image and text-related tasks. Researchers have the generative adversarial network based on the two path-
also utilized the attention mechanism in generating text to ways. They synthesized the frontal face images using the
image task. Xu et al. [15] first utilized the attention mech- proposed network. Although they succeeded to generate the
anism to generate the images from the text. They have intro- good results but required a large amount of labeled data of
duced the AttnGAN to generate high-quality images from the frontal faces. Some of the researchers have also explored
text by applying natural language processing techniques and the disentangled representation learning for face synthesis
algorithms. Qiao et al. [16] proposed the approach, which using the defined attributes of the face. DC-IGN [34] has
was based on the global-local collaborative attention model. proposed the variational auto-encoder using the patterns and
Zhang et.al [17] proposed an approach that was based on techniques of disentangled representation learning. However,
visual semantic similarity. So, as a conclusion, we can say the major drawback of this work is that it only tackles one
that these researchers currently have focused to boost up the attribute in particularly one batch. It makes it computation-
consistency between the generated images and input text. ally weak as well as it also requires the large explicitly
annotated data for training. Luan et. al [35] proposed the
algorithm, which they named as the DR-GAN. It is used
B. TEXT TO FACE GENERATION for the learning purpose of generative and discriminative
Since the invention of the GAN, which was proposed by representation of face synthesis. Their proposed work was
Goodfellow [1] in 2014, image synthesis using deep learning based on the poses of the face and did not focused on
techniques become the hot topic of research [18]. There are specified face attributes. However, our proposed framework
two large scale datasets which are publically available for makes sure to preserve the identity of the generated image
face synthesis task. These datasets are the CelebA [19] and by incorporating all the attributes information related to
LFW [20]. Face synthesis is very popular among the research the face.
community. Most of the state of the artwork has tested their As per our best knowledge and based on the literature
model capabilities and abilities for face synthesis using the survey, the work on the face generation through the attribute
GAN and conditional GAN. DCGAN [21], CycleGAN [22], description using the generative adversarial network is very
Pro-GAN [6], BigGAN [16], StyleGAN [10], StarGAN [9] less. Most of the work on this problem is done on the limited
are the examples of this problem. The quality of the generated scope and failed to generate impressive results by not pre-
face images is improving day by day with the development in serving the face identity. Moreover, most of the relevant pro-
the generative adversarial networks. posed networks have trained the image decoder and used the
The states of Bidirectional LSTM have been concatenated The second block contains the same deconvolution layers the
on the global sentence vector using the following equation 2. same as the first block and outputs the 128×128 feature map.
Whereas, in the 3rd block, we get the 256×256 feature map
C ∈ RD (2)
using the same layer architecture which was previously used
Here in equation 2, C represents the semantic vector, which in the first two blocks. After the three blocks of up-sampling
is concatenated with some noise having some dimensions D layers, we have generated the 256 × 256 image, which later
to generate the new vector. These generated vectors are fed is used to calculate the generator loss.
as input to the image decoder network.
TABLE 2. Generator architecture.
B. IMAGE DECODER
Our proposed convolution neural network is based on the
three blocks as depicted in Fig. 4. Each block contains
the 3 deconvolution layers. So, we have total 3 blocks and
9 deconvolution layers, which upsample feature map twice to
its original size. The layers present into the blocks take the
input from the encoded features of text as semantic vectors
and generates realistic images. In the first stage, the semantic
vectors are extracted from the text along with the noise con-
catenation and are passed as an input, and then these input
vector is reduced to the 4 × 4 feature map. In all blocks,
deconvolution has been performed on the feature maps. The
up-sampling on the feature map increases the size twice the
feature maps in all three layers of each block. So, the size of
the feature map is up-sampled to 8 × 8, 16 × 16 and 32 × 32.
After performing all the operations, feature maps are passed
to the fully connected layers. In the second and third block,
a similar task of up-sampling has been performed. There is
a fine-tune block between the first, second and third blocks. Table 2 shows the architectural detail of our proposed
Fine-tuning block contains the 3 × 3 kernel. The up-sampling generator network. It specifies the details of input features,
block helped in fine-tuning the training parameters. The input deconvolution layers, filter size, the output from deconvolu-
to the second block is the feature map with a size of 64 × 64. tion layers as well as the output features from the defined
three blocks of the architecture. This output is further passed unconditional probabilities pt0 and pt1 at the output neuron of
to the discriminator network, to find the effectiveness of the D0 and D1 , respectively.
generated face features.
eźi
pti (źi ) = PN (4)
źi
C. DISCRIMINATOR n=1 e
We propose a discriminator that measures the realness of Eq. 4 shows the computation of probability score where z∧ i
human face region as well as the face features. A gener- depicts the output of last dense layer and N is number of out-
ated image along with its sentence encoding is passed to put classes. Training is performed in adversarial manner using
the CNN network that extracts the low-level region features both of these losses hence generator loss can be described as
using attention mechanism to be compared with ground truth follows.
image. The attention layer is added that allows convolution N
layers of discriminator to attend the region based features of lossGAN (z, x̂)y =
X
− log D0 G z, x̂
eyes, nose, lips as well as entire facial features. Two stream
n=1
discriminator network is designed as shown in Fig. 5. N
The red box explains the attention-based discriminator
X
+ − log D1 G z, x̂ (5)
D0 network and the yellow box explains another discrimi- n=1
nator network D1 . In discriminator D0 , an attention vector
is initialized to focus on eye regions, lips region and nose In this Eq. 5, z denotes the input noise vector, x̂ denotes sen-
region. Semantic vector representation from original sentence tence encoding and N is number of data samples. Where D0
corresponding to these features is concatenated to attention and D1 are the two discriminators, one with the attention layer
features and then passed to succeeding convolutional layers. and the other one is without the attention layer, respectively.
Finally, the unconditional probabilities are computed to
determine the correctness of local facial features. There are IV. DATASET
three convolutional layers combined with batch normaliza- In each deep learning-based technique, dataset is meant to be
tion layer and max-pooling layer. These layers minimize the the backbone. If there is no standard and meaningful data,
feature representation of facial image with size 64 × 64 × 3. then we cannot generate accurate and precise results. For text
Each convolution layer applies a convolution of filter size to face synthesis, currently there is no standard dataset avail-
4 × 4. Also, the max-pooling layer applies 2 × 2 sized filter able. In this paper, we have also contributed to the generation
to pool the strong weights. The semantic sentence vector is of dataset. Multiple publically available datasets are explored
passed to second convolutional layer set. Since the vector rep- that contain face images like Celeb [19], LFW [20] etc.
resentation of entire face features is also computed to measure Moreover, we have also generated and gathered the images
the consonance of local features. Three sets of convolutional of Asian people to enhance the dataset. We have defined the
layers are adopted here with a convolution filter of size 4 × 4 categories in our proposed research work based on gender,
and max pool filter of size 2 × 2. Discriminator D1 has the age, hair, eyes, ethnicity attributes. Dataset has following
same architecture as the D0 without the attention layer. Final categories in gender;
loss can be computed as a sum of two losses that are measured 1) Male
in an adversarial manner. 2) Female
Whereas for the age information, dataset has included follow-
losstotal = lossD0 + lossD1 (3) ing information;
1) Young
where in equation 3, lossD0 is the cross-entropy loss for D0 2) Adult
and lossD1 is for D1 . These losses are computed based on the 3) Old
TABLE 4. Comparison with other face generation model using FSD and
FID Criterion.
For hair and eyes, following colors are considered; V. EXPERIMENTAL ANALYSIS
1) Black This section discusses the extensive experimental analysis
2) Brown that has been carried out to evaluate the performance of
Dataset has been annotated on to the following emotions; proposed GAN network. We have illustrated the compari-
1) Happy son of our proposed GAN with state of the art text to face
2) Sad generation models and proved the efficiency of our proposed
In the last for ethnicity, following attributes are selected; model. Later, qualitative assessment has been performed on
1) Black synthesized images by human resources. The proposed net-
2) White work has been trained on a single Nvidia 1080Ti GPU with
3) Brown 11 GB memory. The model was trained for 500 epochs with
Dataset has been prepared by manually extracting the images initial learning rate of 0.0001. Adam optimizer is used for
from the LFW [20] and Celeb [19] dataset and then carefully generator and both discriminators to optimize the weights.
annotating them using above predefined categories. A team In the Fig.7, it is shown that our proposed model can generate
was established for data generation purpose. This team photo-realistic facial images that are very near to the quality
included the five interns and two full time-domain experts. of ground truth images. Based on text, facial features of the
This process took approximately five weeks. We have gath- ground truth and synthesized images are compared. Other few
ered and annotated 11,000 images against defined classes. criterions are followed to evaluate the text to face generation.
Images are pre-processed before feeding them to the pro- As the ultimate goal of text to face synthesis is to synthesize
posed network. facial images that are correlated to ground truth images. The
Pre-processing involves the removal of bad quality images, comparison is made by calculating the distance between the
resizing each image to (256,256) and image enhancement. features of both images. This distance of facial features is
Table 3 represents the statistics of the self-generated dataset called face semantic distance (FSD).
corresponding to each class with gender classes taken as This distance is computed by using pre-trained FaceNet
reference. Some frames from our dataset have been shown [29] FNET model. FSD can be described as follows
in the Fig. 6. We will soon make this dataset available for the 1 XN
|F NET (yi ) − FNET ŷi |
research community. FSD = (6)
N i=1
TABLE 5. Five generated images of text-to-face generation model along with the ground truth images. Left column represents the input sentences.
In equation 6, yi is the generated output at each input we have also compared the Frechet Inception Distance (FID)
i = 1, 2..N (N is the total number of samples) and ŷi is [30] of synthesized images to the ground truth images. The
the ground-truth image. Along with face semantic distance, purpose of Frechet Inception Distance is not to anticipate the
FID score of 42.62 that is comparatively less than other [14] S. Sharma, D. Suhubdy, V. Michalski, S. Ebrahimi Kahou, and Y. Bengio,
benchmark algorithms. Additionally, human ratings for our ‘‘ChatPainter: Improving text to image generation using dialogue,’’
2018, arXiv:1802.08216. [Online]. Available: http://arxiv.org/abs/1802.
generated images are also plausible. 08216
In future, to further improve the quality of images and to [15] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He,
increase to similarity between the description and the gener- ‘‘AttnGAN: Fine-grained text to image generation with attentional genera-
tive adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
ated faces, we will focus on denser and precise information Recognit., Jun. 2018, pp. 1316–1324.
related to face for the proposed architecture. This proposed [16] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘MirrorGAN: Learning text-to-
work has a huge impact on security related domains like image generation by redescription,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514.
forensic analysis and public safety domain etc. [17] Z. Zhang, Y. Xie, and L. Yang, ‘‘Photographic text-to-image synthesis with
a hierarchically-nested adversarial network,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6199–6208.
ACKNOWLEDGEMENT [18] A. Gatt, M. Tanti, A. Muscat, P. Paggio, R. A. Farrugia, C. Borg,
The authors would also like to express our earnest gratitude K. P. Camilleri, M. Rosner, and L. van der Plas, ‘‘Face2Text: Col-
to National Center of Artificial Intelligence Pakistan Fund lecting an annotated image description corpus for the generation of
rich face descriptions,’’ 2018, arXiv:1803.03827. [Online]. Available:
and organization (KICS) for full supporting our research http://arxiv.org/abs/1803.03827
work. They also extend their gratitude to AIDA Lab CCIS [19] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, ‘‘Ms-celeb-1m: A dataset and
Prince Sultan University Riyadh Saudi Arabia for their sup- benchmark for large-scale face recognition,’’ in Proc. Eur. Conf. Comput.
Vis. Cham, Switzerland: Springer, 2016, pp. 87–102.
port to this research. The authors acknowledge support of [20] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces
Prince Sultan University for paying the Article Processing in the wild: A database forstudying face recognition in unconstrained
Charges (APC) for this publication. environments,’’ Tech. Rep., 2008.
[21] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep learning face attributes
in the wild,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
REFERENCES pp. 3730–3738.
[22] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, 2014, arXiv:1411.1784. [Online]. Available: http://arxiv.org/abs/
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adver- 1411.1784
sarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, [23] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
pp. 2672–2680. ‘‘High-resolution image synthesis and semantic manipulation with condi-
[2] S. Hong, D. Yang, J. Choi, and H. Lee, ‘‘Inferring semantic layout for tional GANs,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
hierarchical text-to-image synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Jun. 2018, pp. 8798–8807.
Vis. Pattern Recognit., Jun. 2018, pp. 7986–7994. [24] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘Towards open-set identity
[3] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, preserving face synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
‘‘Conditional image generation with pixelcnn decoders,’’ in Proc. Adv. Recognit., Jun. 2018, pp. 6713–6722.
Neural Inf. Process. Syst., 2016, pp. 4790–4798. [25] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global and
[4] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, local perception GAN for photorealistic and identity preserving frontal
‘‘StackGAN++: Realistic image synthesis with stacked generative adver- view synthesis,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
sarial networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 2439–2448.
pp. 1947–1962, Aug. 2019. [26] X. Chen, L. Qing, X. He, J. Su, and Y. Peng, ‘‘From eyes to face synthesis:
[5] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘The caltech- A new approach for human-centered smart surveillance,’’ IEEE Access,
ucsd birds-200-2011 dataset,’’ California Inst. Technol., Pasadena, CA, vol. 6, pp. 14567–14575, 2018.
USA, Tech. Rep., 2011. [27] X. Di and V. M. Patel, ‘‘Face synthesis from visual attributes via sketch
[6] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over using conditional VAEs and GANs,’’ 2017, arXiv:1801.00077. [Online].
a large number of classes,’’ in Proc. 6th Indian Conf. Comput. Vis., Graph. Available: http://arxiv.org/abs/1801.00077
Image Process., Dec. 2008, pp. 722–729. [28] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, ‘‘Generative image
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, inpainting with contextual attention,’’ in Proc. IEEE/CVF Conf. Comput.
P. Dolláar, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in con- Vis. Pattern Recognit., Jun. 2018, pp. 5505–5514.
text,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, [29] F. Schroff, D. Kalenichenko, and J. Philbin, ‘‘FaceNet: A unified embed-
pp. 740–755. ding for face recognition and clustering,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823.
[8] D. P. Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’
2013, arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312. [30] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
6114 ‘‘GANs trained by a two time-scale update rule converge to a local
Nash equilibrium,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017,
[9] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
pp. 6626–6637.
‘‘Generative adversarial text to image synthesis,’’ 2016, arXiv:1605.05396.
[31] X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, ‘‘FTGAN: A fully-
[Online]. Available: http://arxiv.org/abs/1605.05396
trained generative adversarial networks for text to face generation,’’ 2019,
[10] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representa- arXiv:1904.05729. [Online]. Available: http://arxiv.org/abs/1904.05729
tion learning with deep convolutional generative adversarial networks,’’ [32] M. Li, W. Zuo, and D. Zhang, ‘‘Convolutional network for attribute-driven
2015, arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511. and identity-preserving human face generation,’’ 2016, arXiv:1608.06434.
06434 [Online]. Available: http://arxiv.org/abs/1608.06434
[11] H. Dong, S. Yu, C. Wu, and Y. Guo, ‘‘Semantic image synthesis via adver- [33] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global
sarial learning,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, and local perception GAN for photorealistic and identity preserving
pp. 5706–5714. frontal view synthesis,’’ 2017, arXiv:1704.04086. [Online]. Available:
[12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, http://arxiv.org/abs/1704.04086
‘‘StackGAN: Text to photo-realistic image synthesis with stacked genera- [34] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, ‘‘Deep convo-
tive adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), lutional inverse graphics network,’’ in Proc. Adv. Neural Inf. Process. Syst.,
Oct. 2017, pp. 5907–5915. 2015, pp. 2539–2547.
[13] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, [35] L. Tran, X. Yin, and X. Liu, ‘‘Disentangled representation learning GAN
‘‘Learning what and where to draw,’’ in Proc. Adv. Neural Inf. Process. for pose-invariant face recognition,’’ in Proc. IEEE Conf. Comput. Vis.
Syst., 2016, pp. 217–225. Pattern Recognit. (CVPR), vol. 4, Jul. 2017, pp. 1415–1424.
[36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ASIM REHMAT received the Ph.D. degree from
‘‘Rethinking the inception architecture for computer vision,’’ in Proc. the University of Engineering and Technology,
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, Lahore, Pakistan. He is currently an Assistant
pp. 2818–2826. Professor with the Department of Computer Sci-
ence, University of Engineering and Technol-
ogy, Lahore. His areas of specialization include
MUHAMMAD ZEESHAN KHAN received the
robotics and embedded system development using
M.S. degree in computer science from UET
artificial intelligence.
Lahore, Pakistan. He is currently a Team Lead with
the Intelligent Criminology Lab, National Center
of Artificial Intelligence, Al Khawarizmi Institute
of Computer Science, UET Lahore. His areas of
specialization are computer vision, machine learn-
ing, deep learning, and blockchain.