0% found this document useful (0 votes)
12 views

A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks

Uploaded by

timothimabey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks

Uploaded by

timothimabey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received July 23, 2020, accepted July 28, 2020, date of publication August 10, 2020, date of current

version January 5, 2021.


Digital Object Identifier 10.1109/ACCESS.2020.3015656

A Realistic Image Generation of Face From Text


Description Using the Fully Trained Generative
Adversarial Networks
MUHAMMAD ZEESHAN KHAN1 , SAIRA JABEEN1 , MUHAMMAD USMAN GHANI KHAN2 ,
TANZILA SABA 3 , (Senior Member, IEEE), ASIM REHMAT2 ,
AMJAD REHMAN 3 , (Senior Member, IEEE),
AND USMAN TARIQ 4
1 Alkhawarizmi Institute of Computer Sciences, UET Lahore, Lahore 54000, Pakistan
2 Department of Computer Science and Engineering, UET Lahore, Lahore 54000, Pakistan
3 Artificial Intelligence and Data Analytics Lab, CCIS Prince Sultan University, Riyadh 11586, Saudi Arabia
4 College of Computer Engineering and Science, Prince Sattam bin Abdulaziz University, Alkharj 16278, Saudi Arabia

Corresponding author: Amjad Rehman ([email protected]) and Usman Tariq ([email protected])


This work was supported in part by the National Center of Artificial Intelligence, Pakistan. and in part by the Artificial Intelligence and
Data Analytics Lab, Prince Sultan University, Riyadh, Saudi Arabia.

ABSTRACT Text to face generation is a sub-domain of text to image synthesis. It has a huge impact on
new research areas along with the wide range of applications in the public safety domain. Due to the lack of
dataset, the research work focused on the text to face generation is very limited. Most of the work for text
to face generation until now is based on the partially trained generative adversarial networks, in which the
pre-trained text encoder has been used to extract the semantic features of the input sentence. Later, these
semantic features have been utilized to train the image decoder. In this research work, we propose a fully
trained generative adversarial network to generate realistic and natural images. The proposed work trained
the text encoder as well as the image decoder at the same time to generate more accurate and efficient results.
In addition to the proposed methodology, another contribution is to generate the dataset by the amalgamation
of LFW, CelebA and locally prepared dataset. The dataset has also been labeled according to our defined
classes. Through performing different kinds of experiments, it has been proved that our proposed fully trained
GAN outperformed by generating good quality images by the input sentence. Moreover, the visual results
have also strengthened our experiments by generating the face images according to the given query.

INDEX TERMS GAN, CNN, text to face, image generation, face synthesis, data augmentation, legal identity
for all.

I. INTRODUCTION Generating images from text is the opposite process of


Generating images using the text description is one of most image captioning and image classification, where text and
challenging and important tasks in machine learning. This caption are generated from images. Just like the image cap-
task involves handling the language modalities problems tions, text to image generation helps to find context and rela-
which include the control and management of incomplete and tionship between the image and the text along with exploring
ambiguous information using the natural language processing human visual semantics. Moreover, it has a large number of
techniques and algorithms. After that, this information is applications in art, designs, image retrieval and searching.
used to learn by computer vision approaches and algorithms Currently, most of the methods for generating images from
Currently, it is one of the latest research domains in computer the text are based on the traditional method in which the
vision. pre-trained text encoder has been utilized to get the semantic
vector from input descriptions. Based on the semantic vec-
tors, conditional GAN is trained to generate realistic face
The associate editor coordinating the review of this manuscript and images. Although this method generates high-quality face
approving it for publication was Pengcheng Liu . images, they split the training method into two steps; train the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1250 VOLUME 9, 2021
M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

text encoder and image decoder separately. Most of the face generation is very limited. In this paper, a self-generated
generative adversarial networks focus on the generation of dataset is also presented with the help of the google image
the synthesized images using the sentence level information. search and two publically available datasets for face to text
Generating the images using the sentence level information generation.
probably has chances of information loss at word level. As a The main motivation behind this research work is to
result, the accurate images cannot be generated [1], [2]. Most generate the synthesized images of the face based on the
of the work which was done for the problem of ‘‘ Text text description. The proposed algorithm in this paper has
to Image’’ generation is based on simple dataset problems ensured to generate high-quality images by preserving the
such as birds [3] and flowers [4]. However, the work that face identity. Moreover, it is also capable of generating the
mapped the objects along with scenes was very limited. exact images based on the given descriptions. This research
To overcome this problem, [2] utilized the AttnGAN, they work has also been utilized in many industrial applications
were failed to achieve good results as their output image like automatic sketch making of the suspected face in crime
was semantically not meaningful. They tried to explore the investigation departments.
COCO dataset and mapped the object along with the scene We have made the following contributions in our paper.
with the sentence-level information. However, the object and 1. Generating the dataset related to the text to face images.
word-level information was still missing. 2. Proposed a Fully Trained Generative Adversarial Net-
work, which has a trainable text encoder as well as a
trainable image decoder.
3. Two discriminators are proposed to utilize the strength
of joint learning.
4. Generating the photo-realistic images of the faces from
the description by preserving details.
The rest of the paper is organized as follows; Literature
survey has been discussed in section II, proposed method-
ology and framework design have been briefly discussed in
section III. Whereas, dataset description and experimental
analysis described in IV and V, respectively. Paper has been
concluded in Section VI.
FIGURE 1. Two images for text to image synthesis system that is
referenced with same input sentence.
II. LITERATURE SURVEY
Text to face image generation is the subdomain of the Two domains are related to the work. The first one is the text
text to image generation, where the ultimate goal is to gen- to image synthesis and the second one is the text attributes to
erate the image using the user-specified description about face generation. Both domains are discussed one by one as
the face. So, there are two major tasks of generating face follow;
images from text. Fig. 1 shows the input and output for a
text to image synthesis system. It can be observed that text A. TEXT TO IMAGE GENERATION
to face synthesis involves generating high-quality images There are a lot of frameworks available for the text to image
and generating the appropriate images related to the given generation. These frameworks are based on the encoding
description. This task of generating the face images from the through the encoder and decoding through decoder also by
text description is more relevant to the public safety tasks. using the conditional GAN. The text is encoded through an
For example, we consider the scenario of the crime scene. encoder that processes sequential information of text and
In most of the cases, the witness of the crime scene has the image is decoded using the spatial decoder. The text
appeared before the law enforcement agencies to help in encoder encodes the input description into the semantic vec-
drawing the portrait of the suspected criminal. The witness tors, whereas the image decoder uses these semantic vectors
tells the description of the criminal to the portrait maker, then to generate the natural and realistic images. There are two
he/she draws the portrait of the criminal on the drawing board. basic purposes of the text to image synthesis, the first one is
The proposed work will help to automate the whole task by to generate the natural and realistic images and the second one
negating the role of the portrait maker. The manual work is is to make sure that generated images are related to the given
tedious and time-consuming and requires professional knowl- description. All basic algorithms and procedures for text to
edge and experience. Thus, this work will be helpful for image generation are based on this rule of thumb.
law-enforcement agencies. From the past few years, the work on the generative
There are different datasets available for text to image network has been boasted up for image synthesis. Kingma
syntheses, like CUB [5], Oxford102 [6], and COCO [7]. But et.al [8] utilized the stochastic backpropagation to train the
there is no standard dataset, which is available for text to auto-variational encoder for data generation purposes. Since
face generation. The work is done in the domain of text to the birth of the generative adversarial network, which was

VOLUME 9, 2021 1251


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

proposed by Goodfellow et al. [1] researchers have studied Some of the networks can generate good quality face
and researched it widely. The very first task which focused images with a size of 1024 × 1024. These face images are
the text to image generation has been done by Reed et. al. [9]. much larger than the original images present in the face
They have utilized the conditional GAN to build the two dataset. These described models first learned through the
end to end network for text to image generation. They have noise vector with the help of mapping and followed the
obtained the semantic vectors from the text by using the pre- normal distribution to generate the natural images of the face.
trained Char-CNN-RNN and used these vectors to decode However, they are not able to generate an accurate and precise
the natural images using the decoder which is very similar face based on the input description.
to DCGAN [10]. To overcome and tackle this problem, many researchers
After that, researchers have started to make further have worked on different directions of face synthesis. These
progress in this particular domain [11]. Zhang et al [12] pro- directions include converting the face edges into the natural
posed the StackGAN, which is based on two stages and gen- face images [23], swapping the facial attributes of two dif-
erates high-quality images with the improved inception score. ferent face images [24], generating the face with the help
Till then, researchers were able to generate high-quality of the side face [25], generating the face with the help of
images. The focus at that time shifted to improve the simi- the human eye’s region [26], draw sketches from the human
larity between the text and images. Reed et.al [13] proposed face [27], face make-up [28] and many more. But as per our
a network that generates images based on the first generated best knowledge, no one combined the different face-related
box. This produced more efficient and accurate results on information in a single methodology to generate the natural
the output images. Sharma et.al. [14] introduced the mech- and realistic face images.
anism of dialogue to enhance the understanding of the text. Some of the researchers have also worked on the face
They claimed that the method helped them to achieve good generation through the attributes description. Li et. al [32]
results for the image synthesis relevant to the input text. proposed the work, in which they generated the face with
Dong et al. [11] proposed and introduced a new approach for the help of the attribute description by making sure that
the image to image and text to image generation. Moreover, they preserve the identity of the face. The drawback of
they also introduced the training mechanism of image-text- their proposed methodology is that it is only applicable
image. They first generated the text from the images, and then to those faces which can be generated using the simple
this text was used to generate the images. attributes. Another work named TP-GAN [33] has been pro-
The attention-based mechanism has gained a lot of suc- posed by the researchers. In this work, they have proposed
cess in the image and text-related tasks. Researchers have the generative adversarial network based on the two path-
also utilized the attention mechanism in generating text to ways. They synthesized the frontal face images using the
image task. Xu et al. [15] first utilized the attention mech- proposed network. Although they succeeded to generate the
anism to generate the images from the text. They have intro- good results but required a large amount of labeled data of
duced the AttnGAN to generate high-quality images from the frontal faces. Some of the researchers have also explored
text by applying natural language processing techniques and the disentangled representation learning for face synthesis
algorithms. Qiao et al. [16] proposed the approach, which using the defined attributes of the face. DC-IGN [34] has
was based on the global-local collaborative attention model. proposed the variational auto-encoder using the patterns and
Zhang et.al [17] proposed an approach that was based on techniques of disentangled representation learning. However,
visual semantic similarity. So, as a conclusion, we can say the major drawback of this work is that it only tackles one
that these researchers currently have focused to boost up the attribute in particularly one batch. It makes it computation-
consistency between the generated images and input text. ally weak as well as it also requires the large explicitly
annotated data for training. Luan et. al [35] proposed the
algorithm, which they named as the DR-GAN. It is used
B. TEXT TO FACE GENERATION for the learning purpose of generative and discriminative
Since the invention of the GAN, which was proposed by representation of face synthesis. Their proposed work was
Goodfellow [1] in 2014, image synthesis using deep learning based on the poses of the face and did not focused on
techniques become the hot topic of research [18]. There are specified face attributes. However, our proposed framework
two large scale datasets which are publically available for makes sure to preserve the identity of the generated image
face synthesis task. These datasets are the CelebA [19] and by incorporating all the attributes information related to
LFW [20]. Face synthesis is very popular among the research the face.
community. Most of the state of the artwork has tested their As per our best knowledge and based on the literature
model capabilities and abilities for face synthesis using the survey, the work on the face generation through the attribute
GAN and conditional GAN. DCGAN [21], CycleGAN [22], description using the generative adversarial network is very
Pro-GAN [6], BigGAN [16], StyleGAN [10], StarGAN [9] less. Most of the work on this problem is done on the limited
are the examples of this problem. The quality of the generated scope and failed to generate impressive results by not pre-
face images is improving day by day with the development in serving the face identity. Moreover, most of the relevant pro-
the generative adversarial networks. posed networks have trained the image decoder and used the

1252 VOLUME 9, 2021


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

pre-trained text encoder. So, in this work we have proposed


the fully trainable generative adversarial network.
Research gap matrix has also been shown in the Table 1.

TABLE 1. Research matrix of text to face.

FIGURE 2. Design of our fully trained generative adversarial network.

generate realistic images to some extent. However, we cannot


confirm that generated images are according to the input text
without human evaluation or realistic. Because the images are
generated based on the semantic vectors, which have been
extracted using the pre-trained text encoder. So, to generate
good results, training of both text decoder and image encoder
is performed concurrently.
The Architecture of Our Proposed Fully Trained Gener-
III. PROPOSED METHODOLOGY AND ative Adversarial Network: In this section, we describe the
FRAMEWORK DESIGN details of our proposed fully trained generative adversarial
In this methodology section, we will describe our proposed network. Fig. 2 depicts the pictorial representation of our
work in detail. Our proposed work is presented in two sec- network architecture. From Fig. 4 and Fig. 5, it clearly states
tions. In the first section, we discussed, how to encode the that our proposed architecture of conditional GAN is based on
text into the semantic vectors, whereas in the second section one generator and two discriminators. The generator contains
we describe the mechanism of decoding the semantic features the trainable encoder and decoder, and it is the backbone of
of text into the realistic natural images. A comprehensive our proposed architecture for text to image task. It has been
overview of the whole network architecture is also described divided into two parts; first one is the text encoder and the
in detail. The framework design of our proposed architecture second one is the image decoder. The details of these sections
is based on two streams. In the first portion, the text is are as follows;
encoded, whereas in the second part image is decoded using
the encoded text embeddings. A text encoder converts the tex- A. TEXT ENCODER
tual data into a semantic vector. After that, the image decoder
In our proposed work, text encodings have been extracted
generates realistic images using the semantic features of the
using the bidirectional LSTM as shown in Fig. 3. By using the
text, which is encoded by the text encoder. Currently, most
bidirectional LSTM, we have extracted the semantic features
of the text to image generation techniques of the generative
from the given input text sentence.
adversarial network is based on the training of the separate
In the proposed bidirectional LSTM, each word present in
modules. They trained the text encoder and image decoder
the sentence is connected with the two hidden states. Each
separately. They have utilized the pre-trained text encoder or
hidden state corresponds to the one direction. The outputs of
fully trained the text encoder for the training purpose of the
these two hidden states have been concatenated to get the
image encoder. Whereas, in our proposed work, the entire
semantic meaning against each word of the sentence. The
framework is trained with text encoder and image encoder
input sentence is encoded in the form of the matrix using
at the same time. The design of our fully trained generative
equation 1.
adversarial network has been shown in Fig. 2. The training
mechanism has also been incorporated on the text encoder to e ∈ RDx T (1)
generate natural and realistic images. As shown in Fig. 2, our
proposed architecture based on the auto-encoder and decoder Here in equation 1, T represents the total number of words
framework. The encoder firstly encodes the sequences of along with the D dimensions. Whereas, e represents the fea-
input sentences to the semantic vector and using these vec- ture vector for the ith word present in the sentence. Firstly,
tors, natural images are generated. Both of these tasks have each word embedding has been extracted to get the semantic
equal weights and importance in the text to image synthesis. features, which later become the input of the image genera-
The reason behind not utilizing the pre-trained text encoder tion process. The input to the image decoder for the image
because it has a direct link with the upper limits of image generation process is not in the form of a single word embed-
decoder and it affects the accuracy to generate the quality ding of the sentence. So, to feed the image decoder network,
images. The main task of text to image synthesis is to generate the outputs of the last hidden states of the bidirectional LSTM
high-quality images which is relevant and according to the have been concatenated and passed to the image decoder
input sentence. Utilizing the pre-trained text-encoder can network.

VOLUME 9, 2021 1253


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

FIGURE 3. Text encoder.

FIGURE 4. Image decoder.

The states of Bidirectional LSTM have been concatenated The second block contains the same deconvolution layers the
on the global sentence vector using the following equation 2. same as the first block and outputs the 128×128 feature map.
Whereas, in the 3rd block, we get the 256×256 feature map
C ∈ RD (2)
using the same layer architecture which was previously used
Here in equation 2, C represents the semantic vector, which in the first two blocks. After the three blocks of up-sampling
is concatenated with some noise having some dimensions D layers, we have generated the 256 × 256 image, which later
to generate the new vector. These generated vectors are fed is used to calculate the generator loss.
as input to the image decoder network.
TABLE 2. Generator architecture.
B. IMAGE DECODER
Our proposed convolution neural network is based on the
three blocks as depicted in Fig. 4. Each block contains
the 3 deconvolution layers. So, we have total 3 blocks and
9 deconvolution layers, which upsample feature map twice to
its original size. The layers present into the blocks take the
input from the encoded features of text as semantic vectors
and generates realistic images. In the first stage, the semantic
vectors are extracted from the text along with the noise con-
catenation and are passed as an input, and then these input
vector is reduced to the 4 × 4 feature map. In all blocks,
deconvolution has been performed on the feature maps. The
up-sampling on the feature map increases the size twice the
feature maps in all three layers of each block. So, the size of
the feature map is up-sampled to 8 × 8, 16 × 16 and 32 × 32.
After performing all the operations, feature maps are passed
to the fully connected layers. In the second and third block,
a similar task of up-sampling has been performed. There is
a fine-tune block between the first, second and third blocks. Table 2 shows the architectural detail of our proposed
Fine-tuning block contains the 3 × 3 kernel. The up-sampling generator network. It specifies the details of input features,
block helped in fine-tuning the training parameters. The input deconvolution layers, filter size, the output from deconvolu-
to the second block is the feature map with a size of 64 × 64. tion layers as well as the output features from the defined

1254 VOLUME 9, 2021


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

FIGURE 5. Discriminator networks.

three blocks of the architecture. This output is further passed unconditional probabilities pt0 and pt1 at the output neuron of
to the discriminator network, to find the effectiveness of the D0 and D1 , respectively.
generated face features.
eźi
pti (źi ) = PN (4)
źi
C. DISCRIMINATOR n=1 e
We propose a discriminator that measures the realness of Eq. 4 shows the computation of probability score where z∧ i
human face region as well as the face features. A gener- depicts the output of last dense layer and N is number of out-
ated image along with its sentence encoding is passed to put classes. Training is performed in adversarial manner using
the CNN network that extracts the low-level region features both of these losses hence generator loss can be described as
using attention mechanism to be compared with ground truth follows.
image. The attention layer is added that allows convolution N
layers of discriminator to attend the region based features of lossGAN (z, x̂)y =
X
− log D0 G z, x̂

eyes, nose, lips as well as entire facial features. Two stream
n=1
discriminator network is designed as shown in Fig. 5. N
The red box explains the attention-based discriminator
X 
+ − log D1 G z, x̂ (5)
D0 network and the yellow box explains another discrimi- n=1
nator network D1 . In discriminator D0 , an attention vector
is initialized to focus on eye regions, lips region and nose In this Eq. 5, z denotes the input noise vector, x̂ denotes sen-
region. Semantic vector representation from original sentence tence encoding and N is number of data samples. Where D0
corresponding to these features is concatenated to attention and D1 are the two discriminators, one with the attention layer
features and then passed to succeeding convolutional layers. and the other one is without the attention layer, respectively.
Finally, the unconditional probabilities are computed to
determine the correctness of local facial features. There are IV. DATASET
three convolutional layers combined with batch normaliza- In each deep learning-based technique, dataset is meant to be
tion layer and max-pooling layer. These layers minimize the the backbone. If there is no standard and meaningful data,
feature representation of facial image with size 64 × 64 × 3. then we cannot generate accurate and precise results. For text
Each convolution layer applies a convolution of filter size to face synthesis, currently there is no standard dataset avail-
4 × 4. Also, the max-pooling layer applies 2 × 2 sized filter able. In this paper, we have also contributed to the generation
to pool the strong weights. The semantic sentence vector is of dataset. Multiple publically available datasets are explored
passed to second convolutional layer set. Since the vector rep- that contain face images like Celeb [19], LFW [20] etc.
resentation of entire face features is also computed to measure Moreover, we have also generated and gathered the images
the consonance of local features. Three sets of convolutional of Asian people to enhance the dataset. We have defined the
layers are adopted here with a convolution filter of size 4 × 4 categories in our proposed research work based on gender,
and max pool filter of size 2 × 2. Discriminator D1 has the age, hair, eyes, ethnicity attributes. Dataset has following
same architecture as the D0 without the attention layer. Final categories in gender;
loss can be computed as a sum of two losses that are measured 1) Male
in an adversarial manner. 2) Female
Whereas for the age information, dataset has included follow-
losstotal = lossD0 + lossD1 (3) ing information;
1) Young
where in equation 3, lossD0 is the cross-entropy loss for D0 2) Adult
and lossD1 is for D1 . These losses are computed based on the 3) Old

VOLUME 9, 2021 1255


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

TABLE 3. Statistics of self-generated and collected dataset.

FIGURE 7. Three generated images are shown in correspondence with


same input text.

TABLE 4. Comparison with other face generation model using FSD and
FID Criterion.

FIGURE 6. Samples frames from dataset.

For hair and eyes, following colors are considered; V. EXPERIMENTAL ANALYSIS
1) Black This section discusses the extensive experimental analysis
2) Brown that has been carried out to evaluate the performance of
Dataset has been annotated on to the following emotions; proposed GAN network. We have illustrated the compari-
1) Happy son of our proposed GAN with state of the art text to face
2) Sad generation models and proved the efficiency of our proposed
In the last for ethnicity, following attributes are selected; model. Later, qualitative assessment has been performed on
1) Black synthesized images by human resources. The proposed net-
2) White work has been trained on a single Nvidia 1080Ti GPU with
3) Brown 11 GB memory. The model was trained for 500 epochs with
Dataset has been prepared by manually extracting the images initial learning rate of 0.0001. Adam optimizer is used for
from the LFW [20] and Celeb [19] dataset and then carefully generator and both discriminators to optimize the weights.
annotating them using above predefined categories. A team In the Fig.7, it is shown that our proposed model can generate
was established for data generation purpose. This team photo-realistic facial images that are very near to the quality
included the five interns and two full time-domain experts. of ground truth images. Based on text, facial features of the
This process took approximately five weeks. We have gath- ground truth and synthesized images are compared. Other few
ered and annotated 11,000 images against defined classes. criterions are followed to evaluate the text to face generation.
Images are pre-processed before feeding them to the pro- As the ultimate goal of text to face synthesis is to synthesize
posed network. facial images that are correlated to ground truth images. The
Pre-processing involves the removal of bad quality images, comparison is made by calculating the distance between the
resizing each image to (256,256) and image enhancement. features of both images. This distance of facial features is
Table 3 represents the statistics of the self-generated dataset called face semantic distance (FSD).
corresponding to each class with gender classes taken as This distance is computed by using pre-trained FaceNet
reference. Some frames from our dataset have been shown [29] FNET model. FSD can be described as follows
in the Fig. 6. We will soon make this dataset available for the 1 XN
|F NET (yi ) − FNET ŷi |

research community. FSD = (6)
N i=1

1256 VOLUME 9, 2021


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

TABLE 5. Five generated images of text-to-face generation model along with the ground truth images. Left column represents the input sentences.

In equation 6, yi is the generated output at each input we have also compared the Frechet Inception Distance (FID)
i = 1, 2..N (N is the total number of samples) and ŷi is [30] of synthesized images to the ground truth images. The
the ground-truth image. Along with face semantic distance, purpose of Frechet Inception Distance is not to anticipate the

VOLUME 9, 2021 1257


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

similarity of synthetic images as compared to real images.


The purpose of FID score is to assess the generated images
based on the statistics of group of generated images in com-
parison to the statistics of group of real images. FID is
measured by first computing 2048 inception features from
pre-trained inception v3 [36] for real and synthetic images.
FID is described as
2
d t mg , Cg , (mt , Ct ) = mg − mt 2
 
 
1
+ Tr Cg + Ct − Cg Ct 2 (7)

where in equation 7, mg and Cg are the mean and covariance


from the features of generated distribution of GAN and mt
and Ct are the mean and covariance of 2048 features of
FIGURE 8. Peak signal to noise ratio over 700 epochs of the generator
ground truth sample distribution. Tr represents the trace of training.
square matrix. Table 4 shows the comparison of proposed
methodology with other GAN models. From the table 4, TABLE 6. Average rating Score of Human Assessment between 1 to 5.
we interpret that the proposed model has FSD value lesser
than FTGAN [31] and AttnGAN [15]. This tells us that
the faces generated by our model are more similar to the
ground truth face images than any of other two techniques.
Moreover, the less FID score for proposed methodology tells
us that mean and covariance of synthetic images by proposed
model has little variation from the mean and covariance
of ground truth real images. Moreover, two-staged Stack-
GAN [4] method was also opted to show the effectiveness
of proposed end-to-end trainable method. We trained Stack-
GAN on our dataset and evaluated it based on FSD and A total of 100 synthetic facial images were shown to
FID measures. Other than achieving relatively higher FSD five human volunteers with and without textual description.
and FID values, two-staged StackGAN took longer time to Following table 6 shows the average score of the ratings they
converge than the proposed network. Additionally, using two did on 100 videos. They were asked to mark the score on the
discriminators proved the effect of joint adversarial learning. scale of 1-5 based on the quality of the generated image and
Facial features and entire face aesthetics are compared with provided a description. It is depicted from the table 6 that
real images to produce photo-realistic synthetic faces. the proposed architecture achieved good results based on the
The synthesized images are also relevant to the textual human assessment.
description that is given as input. It is clear from Table 5 that
images are accurately related to the description of hair, skin VI. CONCLUSION
and eye color. To prove the efficiency of the proposed model, In this paper, we have proposed the fully trained generative
Table 5 is provided. The results show that the faces gener- adversarial network for text to face image synthesis. The work
ated by our model are aesthetically appealing and correct presents a network, that trained both text encoder and image
regarding input sentences. Also proposed model generates decoder for generating good quality images relative to the
images that are of higher resolution i.e. 256 × 256. There input sentences. By performing extensive experiments on the
were other generated images from similar text descriptions. publicly available dataset, the superiority of our proposed
These samples are shown in Fig. 7. methodology is proved. Moreover, in this novel task, we have
However, Table 5 shows only those generated images that also contributed towards the text to face generation dataset.
are similar to ground truth. The top results are shown in the Different publically available dataset along with the locally
table 5. For female face, model fuses the makeup specifica- generated images have been combined. After that manual
tions such as lip color and accessories in ground truth image labeling of each image with defined categories has been
that cannot be preserved. To measure the realistic high quality performed. The proposed work also presents the details of the
synthetization of proposed model, we have also monitored the similarity between the generated faces and the ground-truth
Peak Noise to Signal Ratio (PNSR). Fig. 8 shows the input description sentences. Experiments have shown that our
increase in the value of PNSR. This ratio gives the quantifi- proposed generative adversarial network generates natural
cation of the quality of the generated image with comparison images with good quality along with a similar face compared
to the ground truth images. Fig. 8 shows the increase in the to the ground truth labels and faces. We compared proposed
quality of generated images such that lower PSNR value on method with state of the art methods using FID and FSD
increasing epochs. scores. Proposed model achieved FSD score of 1.118 and

1258 VOLUME 9, 2021


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

FID score of 42.62 that is comparatively less than other [14] S. Sharma, D. Suhubdy, V. Michalski, S. Ebrahimi Kahou, and Y. Bengio,
benchmark algorithms. Additionally, human ratings for our ‘‘ChatPainter: Improving text to image generation using dialogue,’’
2018, arXiv:1802.08216. [Online]. Available: http://arxiv.org/abs/1802.
generated images are also plausible. 08216
In future, to further improve the quality of images and to [15] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He,
increase to similarity between the description and the gener- ‘‘AttnGAN: Fine-grained text to image generation with attentional genera-
tive adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
ated faces, we will focus on denser and precise information Recognit., Jun. 2018, pp. 1316–1324.
related to face for the proposed architecture. This proposed [16] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘MirrorGAN: Learning text-to-
work has a huge impact on security related domains like image generation by redescription,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514.
forensic analysis and public safety domain etc. [17] Z. Zhang, Y. Xie, and L. Yang, ‘‘Photographic text-to-image synthesis with
a hierarchically-nested adversarial network,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6199–6208.
ACKNOWLEDGEMENT [18] A. Gatt, M. Tanti, A. Muscat, P. Paggio, R. A. Farrugia, C. Borg,
The authors would also like to express our earnest gratitude K. P. Camilleri, M. Rosner, and L. van der Plas, ‘‘Face2Text: Col-
to National Center of Artificial Intelligence Pakistan Fund lecting an annotated image description corpus for the generation of
rich face descriptions,’’ 2018, arXiv:1803.03827. [Online]. Available:
and organization (KICS) for full supporting our research http://arxiv.org/abs/1803.03827
work. They also extend their gratitude to AIDA Lab CCIS [19] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, ‘‘Ms-celeb-1m: A dataset and
Prince Sultan University Riyadh Saudi Arabia for their sup- benchmark for large-scale face recognition,’’ in Proc. Eur. Conf. Comput.
Vis. Cham, Switzerland: Springer, 2016, pp. 87–102.
port to this research. The authors acknowledge support of [20] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces
Prince Sultan University for paying the Article Processing in the wild: A database forstudying face recognition in unconstrained
Charges (APC) for this publication. environments,’’ Tech. Rep., 2008.
[21] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep learning face attributes
in the wild,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
REFERENCES pp. 3730–3738.
[22] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, 2014, arXiv:1411.1784. [Online]. Available: http://arxiv.org/abs/
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adver- 1411.1784
sarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, [23] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
pp. 2672–2680. ‘‘High-resolution image synthesis and semantic manipulation with condi-
[2] S. Hong, D. Yang, J. Choi, and H. Lee, ‘‘Inferring semantic layout for tional GANs,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
hierarchical text-to-image synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Jun. 2018, pp. 8798–8807.
Vis. Pattern Recognit., Jun. 2018, pp. 7986–7994. [24] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘Towards open-set identity
[3] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, preserving face synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
‘‘Conditional image generation with pixelcnn decoders,’’ in Proc. Adv. Recognit., Jun. 2018, pp. 6713–6722.
Neural Inf. Process. Syst., 2016, pp. 4790–4798. [25] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global and
[4] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, local perception GAN for photorealistic and identity preserving frontal
‘‘StackGAN++: Realistic image synthesis with stacked generative adver- view synthesis,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
sarial networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 2439–2448.
pp. 1947–1962, Aug. 2019. [26] X. Chen, L. Qing, X. He, J. Su, and Y. Peng, ‘‘From eyes to face synthesis:
[5] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘The caltech- A new approach for human-centered smart surveillance,’’ IEEE Access,
ucsd birds-200-2011 dataset,’’ California Inst. Technol., Pasadena, CA, vol. 6, pp. 14567–14575, 2018.
USA, Tech. Rep., 2011. [27] X. Di and V. M. Patel, ‘‘Face synthesis from visual attributes via sketch
[6] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over using conditional VAEs and GANs,’’ 2017, arXiv:1801.00077. [Online].
a large number of classes,’’ in Proc. 6th Indian Conf. Comput. Vis., Graph. Available: http://arxiv.org/abs/1801.00077
Image Process., Dec. 2008, pp. 722–729. [28] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, ‘‘Generative image
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, inpainting with contextual attention,’’ in Proc. IEEE/CVF Conf. Comput.
P. Dolláar, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in con- Vis. Pattern Recognit., Jun. 2018, pp. 5505–5514.
text,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, [29] F. Schroff, D. Kalenichenko, and J. Philbin, ‘‘FaceNet: A unified embed-
pp. 740–755. ding for face recognition and clustering,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823.
[8] D. P. Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’
2013, arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312. [30] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
6114 ‘‘GANs trained by a two time-scale update rule converge to a local
Nash equilibrium,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017,
[9] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
pp. 6626–6637.
‘‘Generative adversarial text to image synthesis,’’ 2016, arXiv:1605.05396.
[31] X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, ‘‘FTGAN: A fully-
[Online]. Available: http://arxiv.org/abs/1605.05396
trained generative adversarial networks for text to face generation,’’ 2019,
[10] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representa- arXiv:1904.05729. [Online]. Available: http://arxiv.org/abs/1904.05729
tion learning with deep convolutional generative adversarial networks,’’ [32] M. Li, W. Zuo, and D. Zhang, ‘‘Convolutional network for attribute-driven
2015, arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511. and identity-preserving human face generation,’’ 2016, arXiv:1608.06434.
06434 [Online]. Available: http://arxiv.org/abs/1608.06434
[11] H. Dong, S. Yu, C. Wu, and Y. Guo, ‘‘Semantic image synthesis via adver- [33] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global
sarial learning,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, and local perception GAN for photorealistic and identity preserving
pp. 5706–5714. frontal view synthesis,’’ 2017, arXiv:1704.04086. [Online]. Available:
[12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, http://arxiv.org/abs/1704.04086
‘‘StackGAN: Text to photo-realistic image synthesis with stacked genera- [34] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, ‘‘Deep convo-
tive adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), lutional inverse graphics network,’’ in Proc. Adv. Neural Inf. Process. Syst.,
Oct. 2017, pp. 5907–5915. 2015, pp. 2539–2547.
[13] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, [35] L. Tran, X. Yin, and X. Liu, ‘‘Disentangled representation learning GAN
‘‘Learning what and where to draw,’’ in Proc. Adv. Neural Inf. Process. for pose-invariant face recognition,’’ in Proc. IEEE Conf. Comput. Vis.
Syst., 2016, pp. 217–225. Pattern Recognit. (CVPR), vol. 4, Jul. 2017, pp. 1415–1424.

VOLUME 9, 2021 1259


M. Z. Khan et al.: Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks

[36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ASIM REHMAT received the Ph.D. degree from
‘‘Rethinking the inception architecture for computer vision,’’ in Proc. the University of Engineering and Technology,
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, Lahore, Pakistan. He is currently an Assistant
pp. 2818–2826. Professor with the Department of Computer Sci-
ence, University of Engineering and Technol-
ogy, Lahore. His areas of specialization include
MUHAMMAD ZEESHAN KHAN received the
robotics and embedded system development using
M.S. degree in computer science from UET
artificial intelligence.
Lahore, Pakistan. He is currently a Team Lead with
the Intelligent Criminology Lab, National Center
of Artificial Intelligence, Al Khawarizmi Institute
of Computer Science, UET Lahore. His areas of
specialization are computer vision, machine learn-
ing, deep learning, and blockchain.

SAIRA JABEEN received the M.S. degree in com-


puter science from UET Lahore, Pakistan. She is
AMJAD REHMAN (Senior Member, IEEE)
currently a Team Lead with the Computer Vision
received the Ph.D. degree from the Faculty of
and Machine Learning Lab, National Center of
Computing, Universiti Teknologi Malaysia, with a
Artificial Intelligence, Al Khawarizmi Institute
specialization in forensic documents analysis and
of Computer Science, UET Lahore. Her area of
classification, in 2010, with honor and awarded
specialization involves computer vision, machine
rector award for best student in the university,
learning, and neural networks.
in 2010. He is currently a Senior Researcher with
the AIDA Lab, Prince Sultan University, Riyadh,
Saudi Arabia. His keen interests are in data mining,
MUHAMMAD USMAN GHANI KHAN received health informatics, and pattern recognition. He is
the Ph.D. degree from Sheffield University, U.K. author of more than 200 indexed journal articles.
He is currently an Associate Professor with the
Department of Computer Science, University of
Engineering and Technology, Lahore. He is also a
Principle Investigator of the Intelligent Criminol-
ogy Lab, National Center of Artificial Intelligence,
Kics UET Lahore, Pakistan. His Ph.D. study was
concerned with statistical modeling for machine
vision signals, specifically language descriptions
of video streams.
USMAN TARIQ received the Ph.D. degree in
information and communication technology in
TANZILA SABA (Senior Member, IEEE) received the Ph.D. degree in docu- computer science from Ajou University, South
ment information security and management from the Faculty of Computing, Korea. He is a skilled Research Engineer. He has a
Universiti Teknologi Malaysia (UTM), Malaysia, in 2012. She currently strong background in ad hoc networks and network
serves as a Research Professor with the College of Computer and Information communications. He experienced in managing and
Sciences, Prince Sultan University (PSU), Riyadh, Saudi Arabia. Her pri- developing projects from conception to comple-
mary research focus in the recent years is bioinformatics, pattern recognition, tion. He have worked in large international scale
machine learning, and applied soft computing. She has above two hundred and long-term projects with multinational organi-
publications that have around 4000 citations with H-index 40. She won best zations. He is currently with Prince Sattam bin
student award at the Faculty of Computing, UTM, in 2012. She received best Abdul-Aziz University as an Associate Professor with the College of Com-
researcher awards in PSU in 2014, 2015, 2016, and 2018. She has supervised puter Engineering and Science. His research interests span networking and
Ph.D. and M.S. degree students. Due to her excellent research achievement, security fields. His current research is focused on several network security
she is included in Marquis Who’s Who (S & T) 2012. She is currently problems: botnets, denial-of-service attacks, and IP spoofing. Additionally,
an editor of several reputed journals and on panel of TPC of international he is interested in methodologies for conducting security.
conferences.

1260 VOLUME 9, 2021

You might also like