0% found this document useful (0 votes)
60 views

Generating and Analyzing Chatbot Responses Using Natural Language ProcessingInternational Journal of Advanced Computer Science and Applications

This document discusses the development of a chatbot using natural language processing techniques to automatically generate responses to customer queries. Specifically, it explores using deep learning models like LSTM, GRU and CNN to map customer queries to appropriate responses. The experimental results show that LSTM and GRU models provide promising results, especially for responding to emotional queries with general, meaningful responses. The goal is to build an automated response system using these techniques to accelerate customer service performance on social media platforms.

Uploaded by

Frank Ayala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Generating and Analyzing Chatbot Responses Using Natural Language ProcessingInternational Journal of Advanced Computer Science and Applications

This document discusses the development of a chatbot using natural language processing techniques to automatically generate responses to customer queries. Specifically, it explores using deep learning models like LSTM, GRU and CNN to map customer queries to appropriate responses. The experimental results show that LSTM and GRU models provide promising results, especially for responding to emotional queries with general, meaningful responses. The goal is to build an automated response system using these techniques to accelerate customer service performance on social media platforms.

Uploaded by

Frank Ayala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 10, No. 9, 2019

Generating and Analyzing Chatbot Responses using


Natural Language Processing
Moneerh Aleedy1 Hadil Shaiba2
Information Technology Department Computer Sciences Department
College of Computer and Information Sciences College of Computer and Information Sciences
Princess Nourah bint Abdulrahman University Princess Nourah bint Abdulrahman University
Riyadh, Saudi Arabia Riyadh, Saudi Arabia

Marija Bezbradica3
School of Computing, Dublin City University, Dublin, Ireland

Abstract—Customer support has become one of the most Artificial intelligence (AI) improves digital marketing in a
important communication tools used by companies to provide number of different areas from banking, retail, and travel to
before and after-sale services to customers. This includes healthcare and education. While the idea of using human
communicating through websites, phones, and social media language to communicate with computers holds merit, AI
platforms such as Twitter. The connection becomes much faster scientists underestimate the complexity of human language, in
and easier with the support of today's technologies. In the field of both comprehension and generation. The challenge for
customer service, companies use virtual agents (Chatbot) to computers is not just understanding the meanings of words, but
provide customer assistance through desktop interfaces. In this understanding expression in how those words are collocated.
research, the main focus will be on the automatic generation of
Moreover, a chatbot is an example of a virtual conversational
conversation “Chat” between a computer and a human by
developing an interactive artificial intelligent agent through the
service robot that can provide human-computer interaction.
use of natural language processing and deep learning techniques Companies use robotic virtual agents (Chatbot) to assist
such as Long Short-Term Memory, Gated Recurrent Units and customers through desktop interfaces [1, 2].
Convolution Neural Network to predict a suitable and automatic Natural language processing (NLP) is a subfield of
response to customers’ queries. Based on the nature of this computer science that employs computational techniques for
project, we need to apply sequence-to-sequence learning, which learning, understanding and producing human language
means mapping a sequence of words representing the query to content. NLP can have multiple goals; it can aid human-human
another sequence of words representing the response. Moreover,
communication, such as in machine translation and aid human-
computational techniques for learning, understanding, and
producing human language content are needed. In order to
machine communication, such as with conversational agents.
achieve this goal, this paper discusses efforts towards data Text mining and natural language processing are widely used
preparation. Then, explain the model design, generate responses, in customer care applications to predict a suitable response to
and apply evaluation metrics such as Bilingual Evaluation customers, which significantly reduces reliance on call center
Understudy and cosine similarity. The experimental results on operations [3].
the three models are very promising, especially with Long Short- AI and NLP have emerged as a new front in IT customer
Term Memory and Gated Recurrent Units. They are useful in
service chatbots. The importance of these applications appears
responses to emotional queries and can provide general,
when no technicians manage the customer service office due to
meaningful responses suitable for customer query. LSTM has
been chosen to be the final model because it gets the best results the end of working time or their presence outside the office [4].
in all evaluation metrics. In this project, the main focus will be on the automatic
generation of conversation ”Chat” between a computer and a
Keywords—Chatbot; deep learning; natural language human by developing an interactive artificial intelligent agent
processing; similarity using deep learning. This will provide customers with the right
I. INTRODUCTION information and response from a trusted source at the right
time as fast as possible.
With the arrival of the information age, customer support
has become one of the most influential tools companies use to This project aims to build an automated response system
communicate with customers. Modern companies opened up (Chatbot) that responds to customer queries on social
communication lines (conversations) with clients to support networking platforms (Twitter) to accelerate the performance
them regarding products before and after-sales through of the service. Also, to keep the simplicity in mind while
websites, telephones, and social media platforms such as designing the system to enhance its efficiency.
Twitter. This communication becomes faster and much easier
with the support of the technologies that are being used today.

60 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

This project centers around the study of deep learning manual intervention and supervision, which affects the speed
models, natural language generation, and the evaluation of the and quality of processes execution. IT service providers are
generated results. under competitive pressure to continually improve their service
quality and reduce operating costs through automation. Hence,
We believe that this contribution can add improvement by they need the adoption of chatbots in order to speed up the
applying the right preprocessing steps which may organize work and ensure its quality [10].
sentences in a better way and help in generating proper
responses. On the other hand, we start with the existing text On the medical side, the field of healthcare has developed a
generative models CNN and LSTM and then try to improve lot, lately. This development appears with the use of
them as well as develop a new model such as GRU to compare information technology and AI in the field. In [11], the authors
results. We focus on evaluating the generated responses from proposed a mobile healthcare application as a chatbot to give a
two aspects: the number of words matches between the fast treatment in response to accidents that may occur in
reference response and the generated response and their everyday life, and also in response to the sudden health
semantic similarity. changes that can affect patients and threaten their lives.
The rest of this paper is organized as follows. Section II Customer services agent is an application of applying
provides reviews of the related works. The methodological chatbot technologies in businesses to solve customer problems
approach is described in Section III. Moreover, dataset and help the sales process. As companies become globalized in
collection and analysis in details are provided in Section IV. the new era of digital marketing and artificial intelligence,
The implementation strategy and results of this project are brands are moving to the online world to enhance the customer
discussed in section V. Finally, the conclusion of the project experience in purchasing and provide new technical support
and its future work are provided in Sections VI and VII ways to solve after-sales problems. Moreover, fashion brands
respectively. such as Burberry, Louis Vuitton, Tommy Hilfiger, Levi's,
H&M, and eBay are increasing the popularity of e-service
II. LITERATURE REVIEW agents [1].
Developing computational conversational models B. Natural Language Processing
(chatbots) took the attention of AI scientists, for a number of
years. Modern intelligent conversational and dialogue systems NLP allows users to communicate with computers in a
draw principles from many disciplines, including philosophy, natural way. The process of understanding natural language
linguistics, computer science, and sociology [5]. This section can be decomposed into the syntactic and semantic analysis.
will explore the previous work of chatbots and their Syntactic refers to the arrangement of words in a sentence such
implementations. that they make grammatical sense. Moreover, syntactic
analysis transforms sequences of words into structures that
A. Chatbots Applications and Uses show how these words are related to each other. On the other
Artificial dialogue systems are interactive talking machines hand, semantic refers to the meaning of each word and
called chatbots. Chatbot applications have been around for a sentence. The semantic analysis of natural language content
long time; the first well-known chatbot is Joseph captures the real meaning; it processes the logical structure of
Weizenbaum‟s Eliza program developed in the early 1960s. sentences to find the similarities between words and
Eliza facilitated the interaction between human and machine understand the topic discussed in the sentences [12].
through a simple pattern matching and a template-based As part of the text mining process, the text needs many
response mechanism to emulate the conversation [6, 7]. modification and cleaning before using it in the prediction
Chatbot became important in many life areas; one of the models. As mentioned in [13], the text needs many
primary uses of chatbots is in education as a question preprocessing steps which include: removing URLs,
answering system for a specific knowledge domain. In [8], the punctuation marks and stop words such as a, most, and, is and
authors proposed a system that has been implemented as a so on in the text because those words do not contain any useful
personal agent to assist students in learning Java programming information. In addition, tokenizing, which is the process of
language. The developed prototype has been evaluated to breaking the text into single words. Moreover, text needs
analyze how users perceive the interaction with the system. stemming, which means changing a word into its root, such as
Also, the student can get help in registering and dropping “happiness” to “happy”. For features extraction, the authors
courses by using a chatbot spatialized in student administrative use Bag of Words (BoW) to convert the text into a set of
problems, as mentioned in [9]. The administrative student‟s features vector in numerical format. BoW is the process of
chatbot helps the colleges to have 24*7 automated query transforming all texts into a dictionary that consist of all words
resolution and helps students have the right information from a in the text paired with their word counts. Vectors are then
trusted source. formed based on the frequency of each word appearing in the
text.
On another hand, information technology (IT) service
management is an important application area for enterprise Before entering the data into a model or a classifier, it is
chatbots. In many originations and companies, IT services desk necessary to make sure that the data are suitable, convenient,
is one of the essential departments that helps to ensure the and free of outliers. In [14], the authors explain how to
continuity of work and solving technical problems that preprocess the text data. The main idea was to simplify the text
employees and clients are facing. This variability demands for the classifier to learn the features quickly. For example, the

61 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

names can be replaced with one feature {{Name}} in the The customer service agent is an important chatbot that is
feature set, instead of having the classifier to learn 100 names used to map conversations from request to the response using
from the text as features. This will help in grouping similar the sequence to sequence model. Moreover, a sequence to
features together to build a better predicting classifier. On sequence models has two networks one work as an encoder
another hand, emoticons and punctuation‟s marks are that maps a variable-length input sequence to a fixed-length
converted to indicators (tags). Moreover, a list of emoticons is vector, and the other work as a decoder that maps the vector to
compiled from online sources and grouped into categories. a variable-length output sequence. In [4], the authors generate
Other punctuation marks that were not relevant to the coding word-embedding features and train word2vec models. They
scheme are removed. trained LSTMs jointly with five layers and 640 memory cells
using stochastic gradient descent for optimization and gradient
Chat language contains many abbreviations and clipping. In order to evaluate the model, the system was
contractions in the form of short forms and acronyms that have compared with actual human agents responses and the
to be expanded. Short forms are shorter representations of a similarity measured by human judgments and an automatic
word which are done by omitting or replacing few characters, evaluation metric BLEU.
e.g., grp → group and can‟t → cannot. The authors created a
dictionary of these words from the Urban Dictionary to replace As a conclusion of reviewing works concerned with the
abbreviations by expansions. Spell checking is performed as conversational system, text generation in English language and
the next step of the pre-processing pipeline on all word tokens, the collaboration of social media in customer support service,
excluding the tagged ones from the previous steps [14]. this paper proposes a work that aims to fill the gap of limited
works in the conversational system for customer support field,
Minimizing the words during the text pre-processing phase especially in the Twitter environment. The hypothesis of this
as much as possible is very important to group similar features project was aiming to improve the automated responses
and obtain a better prediction. As mentioned in [15], the generated by different deep learning algorithms such as LSTM,
authors suggest processing the text through stemming and CNN, and GRU to compare results and then evaluate them
lower casing of words to reduce inflectional forms and using BLEU and cosine similarity techniques. As a result, this
derivational affixes from the text. The Porter Stemming project will help to improve the text generation process in
algorithm is used to map variations of words (e.g., run, running general, and customer support field in particular.
and runner) into a common root term (e.g., run).
Words can not be used directly as inputs in machine III. METHODOLOGICAL APPROACH
learning models; each word needs to be converted into a vector This section discusses the background of the implemented
feature. In [4], the authors adopt the Word2vec word methods, explain why these methods are appropriate and give
embedding method to learn word representations of customer an overview of the project methodology.
service conversations. Word2vec's idea is that each dimension
A. Text Generative Model
of inclusion is a possible feature of the word, which can
capture useful grammatical and semantic properties. Moreover, Based on the nature of this project, which is generating a
they tokenize the data by building a vocabulary of the most proper response to every customer query in social media,
frequent 100K words in the conversations. applying sequence-to-sequence learning are needed. Moreover,
sequence-to-sequence means mapping a sequence of words
C. Machine Learning Algorithm and Evaluation representing the query to another sequence of words
A large number of researchers use the idea of artificial representing the response, the length of queries and responses
intelligence and deep learning techniques to develop chatbots can be different. This can be applied by the use of NLP and
with different algorithms and methods. As mentioned in [16], deep learning techniques.
the authors use a repository of predefined responses and a
Sequence-to-sequence models are used in many fields,
model that ranks these responses to pick an appropriate
including chat generation, text translation, speech recognition,
response for a user‟s input. Besides, they proposed topic aware
and video captioning. As shown in Fig. 1, a sequence-to-
convolutional neural tensor network (TACNTN) model to
sequence model consists of two networks, encoder, and
classify whether or not a response is proper for a message. The
decoder. The input text enters the encoder network in reverse
matching model used to select a response for a user message.
order, then it is converted into a sequence of fixed length
Specifically, it has three-stages that include: pre-processing the
context vector, which is then used by the decoder to generate
message, retrieving response candidates from the pre-defined
the output sequence [18].
message-response pair index, then ranking the response
candidates with a pre-train matching model.
In [17], the authors train two word-based machine learning
models, a convolutional neural network (CNN) and a bag of
words SVM classifier. Resulting scores are measured by the
Explanatory Power Index (EPI). EPI used to determine how
much words contribute to the classification decision and filter
relevant information without an explicit semantic information
extraction step. Fig. 1. Sequence to Sequence Model.

62 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

Before inserting the sequence of words into the encoder


model, it needs to be converted into a numerical format; this
can be applied by using NLP techniques. This project focused
on Bag of Words, or BoW vector representations, which is the
most commonly used traditional vector representation for text
generating models. BoW is used to transforms all texts into a
dictionary that consists of all words that appear in the
document [13]. It then creates a set of features in real number
inside a vector for each text.
Fig. 3. The Architecture of RNN.
B. Deep Learning Models
1) Convolutional Neural Network (CNN) Model: In this C. Project Methodology
project, CNN is chosen mainly for its efficiency, since CNN is In order to implement this project, several preprocessing
faster compared to other text representation and extraction and modeling steps are performed. First, split the original
methods [19]. The CNN consists of the convolution and dataset into train and test sets. Then, prepare the dataset for
pooling layers and provides a standard architecture that takes a modeling. The preparing process includes preprocessing steps
variable-length sequence of words as an input and then passes and features extraction. After that, train models using train set
it to a word embedding layer. The embedding layer maps each with LSTM, GRU, and CNN. Finally, prepare the test set and
word into a fixed dimensional real-valued vector then passes it use it for evaluating the models. Fig. 4 illustrates the
to the 1D convolutional layer. The output is then further down- methodology steps.
sampled by a 1D max-pooling layer. Outputs from the pooling
layers are then fed into the final output layer to produce a
fixed-length feature vector [20]. CNN has been widely used in
image and video recognition systems, and, lately, they have
shown promising results in NLP applications [21]. Fig. 2
shows the standard architecture of the CNN model.
2) Recurrent Neural Network (RNN) Model: In a
traditional neural network, all inputs and outputs are
independent of each other, which is not useful when working
with sequential information. Predicting the next word in a
sentence requires knowing the sequence of the words in the
sentence that come before the predicted word. Among all
models for learning sentence representations, recurrent neural
network (RNN) models, especially the Long Short Term
Fig. 4. The General Implementation Steps.
Memory (LSTM) model, are the most appropriate models for
processing sentences, as they have achieved substantial success
IV. DATASET COLLECTION AND ANALYSIS
in text categorization and machine translation [22]. Therefore,
this project applies LSTM and Gated Recurrent Units (GRU) The dataset “Customer Support on Twitter” from Kaggle is
used to develop and evaluate the models. The original dataset
as a newer generation of Recurrent Neural Networks. Fig. 3
includes information such as: tweet_id, author_id, inbound,
illustrates the basic architecture of RNN. created_at, text, response_tweet_id and in_response_to_
Hochreiter & Schmidhuber introduced Long Short Term tweet_id. The description of the original dataset is shown in
Memory Networks in 1997. They solve the problem of Table I.
vanishing and exploding gradient problem that is prevalent in a The original dataset contains 2,811,774 collections of
simple recurrent structure, as it allows some states to pass tweets and replies from the biggest brands on Twitter as
without activation. In 2014, Cho et al developed GRU customer support (tweets and replies are in different rows).
networks in an effort to design recurrent encoder-decoder Moreover, the number of brands in the dataset is 108, and they
architecture [23]. They are relatively more straightforward than responded to queries from 597075 users. Fig. 5 shows the top
LSTM and retain a majority of its advantages. 10 customer support responses per brand.
While performing exploratory analysis on the dataset, it has
been noticed, for instance, that Amazon customer support
handles a lot of questions (around 84600 in seven months)
which is a huge number to deal with if we consider the
working hours and working days per week. Also, some of the
questions have a delay in responding or had no responses at all.
Fig. 6 shows the average delay in response to customers in
Fig. 2. The Architecture of CNN. hours per brand.

63 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

TABLE. I. DATASET FEATURES DESCRIPTION As shown in the above figure, around ten brands take more
than two days (60 hours) to respond to customers queries,
Feature Description Datatypes which may cause problems to customers, effect companies'
reputation and the customers may start looking for other
A unique, anonymized ID for the
Tweet. Referenced by
service providers.
tweet_id int64
response_tweet_id and A filtering process is used to convert the dataset records
in_response_to_tweet_id.
into a conversational dataset suitable for the experiments. The
A unique, anonymized user ID. The
filtering is done as follows:
real user_id in the dataset has been
author_id
replaced with their associated
object 1) Pick only inbound tweets that are not in reply to any
anonymized user ID. other tweet.
2) Organize each tweet with the corresponding reply by
Whether the tweet is "inbound" to a matching in_response_to_tweet_id with tweet_id features.
company doing customer support
3) Filter out cases where reply tweets are not from a
Inbound on Twitter. This feature is useful bool
when reorganizing data for training company based on the in inbound feature (if the inbound
conversational models. feature is False it means that the tweet is from a company;
otherwise it is from a user).
Date and time when the tweet was
created_at object
sent. However, when revising the dataset, it has been found that
some of the tweets have no replies at all; they are from multiple
Tweet content. Sensitive languages, and some of them are just samples and emojis. For
information like phone numbers
Text
and email addresses are replaced
object this type of tweets further preprocessing step is performed to
with mask values like __email__. remove non-English tweets by the use of the langdetect library
which detects any non-English text [24]. Then, the non-
IDs of tweets that are responses to responses English tweets are studied, as shown in the word
response_tweet_id object
this tweet, comma-separated. cloud in Fig. 7, (which is a graph that illustrates the most words
that appear in the text).
ID of the tweet this tweet is in
in_response_to_tweet_id float64
response to, if any. It can be observed that the words appear with no hint to a
specific problem discussed, and most of the queries are
thanking the customer support services for example:
 @AmazonHelp Thanks for the quick response
 @AppleSupport Awesome, thanks
Others asking for help in general:
 @Uber_Support Sent a DM Hope you could help soon.
 @O2 DM sent. Still no further forward!
The modified dataset contains 794,299 rows and 6 columns
which are: author_id_x, created_at_x, text_x, author_id_y,
created_at_y and text_y. X refers to the queries, and Y refers to
the responses from customer support teams.

Fig. 5. Top 10 Customer Support Responses Per Brands.

Fig. 6. The Average Delay in Response to Customers in Hours per Brand. Fig. 7. Most Words used in the Queries without Responses Data.

64 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

V. IMPLEMENTATION STRATEGY TABLE. II. THE CHANGES IN TEXT BEFORE AND AFTER APPLYING
PREPROCESSING STEPS
In this section, we are going to explain the methodology
followed for this project. At first, prepare the dataset for Before preprocessing After preprocessing
modeling. The preparing process includes preprocessing step
and features extraction then train the models using a training @115743 C91. Feel free to keep an
feel free to keep an eye on the ps
set and evaluate them with a test set. eye on the PS Blog for news and
blog for news and updates
updates: https://t.co/aLtfBAztyC
A. Data Preprocessing
A data analyst cannot handle raw text directly to suit @133100 We do our best to clear as we do our best to clear as many
machine learning or deep learning methods. Therefore, it is many upgrades as we can, send us a upgrades as we can send us a dm
necessary to work on texts‟ preprocessing from all existing DM with the reservation you're with the reservation you are referring
referring to and we'll take a look. to and we will take a look
impurities, for example, punctuation, expression code, and
non-English words (Chinese, Spanish, French, and others). In
@129388 We'd like to look into this
order to do this, a number of python NLP libraries such as with you. To confirm, did you update
we would like to look into this with
regular expression (RE), unicodedata, langdetect, and you to confirm did you update to ios
to iOS 11.1? Please DM us here:
contractions are used. please dm us here
https://t.co/GDrqU22YpT
In this project, the performed preprocessing steps include:
remove links, images, Twitter ID, numbers, punctuation, emoji, TABLE. III. TRAINING TIME IN HOURS
non-English words and replace abbreviations with long forms.
Table II illustrates the changes in the dataset before and after Model Training Time in Hours
applying all the previous preprocessing steps. LSTM 12
The preprocessing steps are chosen carefully; not all GRU 8
preprocessing techniques are suitable for this kind of projects. CNN 5
For example, removing stopwords and text stemming cannot
be applied because it will affect the sentences structures as well In the experiments, multiple parameters are tested, and their
as the text generation process. effects are addressed. All models are tested with varying
B. Feature Extraction dimensionality of the word embeddings (100, 300 and 640), it
was observed that models perform better and faster with 100-
Before doing any complex modeling, the dataset needs to word embedding size.
be transformed into a numerical format suitable for training.
The Bag of Words (BOW) concept is applied to extract The dataset is large, the number of vocabularies is 388,950
features from the text dataset. First, all of the texts in the unique words, and our computers cannot handle it. So, only the
dataset are split into an array of tokens (words). Then, a frequent words appeared in the dataset should be used. The
vocabulary dictionary is built with all of the words in the most frequent words are decided by the max_features
dataset and its corresponding index value. The array of words parameter in the CountVectorizer function which sort words by
is then converted to an array of indexes. This process can be its frequency then choose the most frequent words. The first
applied by the use of the sklearn‟ predefined method called vocabulary size in the experiments is 8000 and then it
CountVectorizer. increases, taking into consideration memory limitation. A
slight improvement has been recognized in all models and
In order to handle variable length, the maximum sentence because of the memory limitation, only 10,000 of the
length needs to be decided. Moreover, all remaining vector vocabularies are used. Moreover, the GRU model was trained
positions should be filled with a value („1‟ in this case) to make for eight epochs but without significant improvement. The
all sequences have the same length. On the other hand, words three models are all trained under the same conditions. Table
not in the vocabulary dictionary will be represented with UNK IV shows the common parameters used in all models.
as a shortcut of unknown words. Moreover, each output text in
the dataset will start with a start flag („2‟ in this case) to help in TABLE. IV. THE COMMON PARAMETERS USED IN LSTM, GRU AND CNN
training. Now the dataset is ready for training. MODELS

C. Modeling Parameter Value


The infrastructure used for experimentation involves Word embedding dimension size 100
google colaboratory and Crestle cloud services which are Vocabulary size 10,000
GPU-enabled Jupyter environments with powerful computing
resources. All popular scientific computing and deep learning Context dimension size 100
packages are pre-installed and configured to run on a GPU. Learning rate 0.001
The experiments are applied using three different models Optimization function Adam
LSTM, GRU, and CNN. The models use a training dataset of 1000 (the max that our computer can
around 700k pairs of queries and responses and a testing Batch size
handle)
dataset of 30k of unseen data. Training time is between 5 and
Max message length 30
12 hours, depending on the model ( see Table III).

65 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

The following are the common layers used in the models,  Calculate brevity penalty (equation 1): a penalization is
starting from inserting the sequence of words into the model to applied to short answers, which might be incomplete.
generating the responses:
 Last Word Input Layer: Inputs the last word of the BP={ ( ( )) (1)
sequence.
, where c is the length of generated response and r is the
 Encoder Input Layer: Inputs sequence data and pass it to length of reference response.
the embedding layer.
 Then, calculate the BLEU score (equation 2) [23]:
 Embedding Layer: Used to create word vectors for ∑ ( ( ))
incoming words. BLEU (2)
 Encoder Layer (LSTM, GRU, CNN): Creates a , where Wn = 1/N.
temporary output vector from the input sequence. 2) Cosine Similarity: On the other hand, cosine similarity
 Repeated Vector Layer: Used like an adapter to fit the also used to compute the similarity between the generated
encoder and decoder parts of the network together. It can response and the reference response in vector representation. If
be configured to repeat the fixed-length vector one time there is more similarity between the two vectors, the cosine
for each time step in the output sequence. similarity value is near to one; otherwise, it is near to zero.
 Concatenate Layer: Takes inputs and concatenates them 3) In order to implement the cosine similarity, the pre-
along a specified dimension. trained model word2vec are used. The word2vec model is in
gensim package, and it has been trained on part of Google
 Decoder Layer (LSTM, GRU, CNN)(Dense): Used as News dataset (about 100 billion words) [25]. The model
the output for the network.
contains 300-dimensional vectors for 3 million words and
 Next Word Dense Layer: Takes inputs from the previous phrases.
layer and outputs a one vector representing the target
The word2vec model used to represent words in a vector
word.
space [26]. Words are represented in the form of vectors and
 Next Word softmax Layer: Applies a softmax function placement is done in such a way that similar meaning words
that turns the dense layer output into a probability appear together, and different words are located far away.
distribution, from to pick the most likely next word.
D. Generating Responses
After training the models, the generating responses process
is started using the 30k test set. The following are samples of
the generated responses from all models (see Fig. 8 and 9).
E. Evaluation
The Bilingual Evaluation Understudy and cosine similarity
evaluation metrics are used to compute the similarity between
the generated response and the reference response.
1) Bilingual Evaluation Understudy (BLEU): BLEU was
originally created to measure the quality of machine translation
with respect to human translation. It calculates an n-gram
precision (An n-gram is a sequence of n words that appear Fig. 8. Good Result Example.
consecutively in the text) between the two sequences and also
imposes a commensurate penalty for machine sequence being
shorter than human one. A perfect match score is 1.0, whereas
a perfect mismatch score is 0.0.
The computation of BLEU involves various components:
n-gram precisions (Pn) and BLEU‟s brevity penalty. Those
measures are calculated as shown in the following steps:
 Calculate n-gram precision (Pn): measures the frequency
of the n-gram according to the number of times it
appears in the generated response and reference
response. Pn must be calculated for each value of n,
which usually ranges from 1 to 4. Then the geometric
average of Pn should be computed with a weighted sum
of the logarithms of Pn. Fig. 9. Bad Result Example.

66 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

Gensim is a topic modeling toolkit which is implemented in Furthermore, another evaluation metric cosine similarity
python. Topic modeling is discovering the hidden structure in are applied to captures the semantics beyond responses and
the text body. Word2vec model is imported from Gensim gives similarity scores. It has been found that RNN models
toolkit and uses a built-in function to calculate the similarity capture the semantics in the responses and they are more
between the generated response and reference response. effective in improving the reply quality than the CNN model.
Fig. 11 shows the similarity scores for each model.
F. Result and Discussion
Before discussing and reviewing the results, the most After exploring the generated responses and get in-depth in
important features of the baseline model are discovered to have the good and bad results, it has been found that RNN models,
a rich discussion with clear comparisons. Table V shows the in general, are good in responses to emotional queries more
baseline model implementation. than an informative one. The models can provide general,
meaningful responses suitable for customer query. Table VI
In this project, the process of generating responses take shows an example of an emotional query.
around 6 hours for each model to be accomplished. Moreover,
calculating BLEU and cosine similarity scores takes around 4 On the other hand, the queries that are more informative
hours. and ask about specific information are hard to generate, and the
generated responses become less efficient. Table VII shows an
The models are evaluated automatically based on the words example of an informative query.
using BLEU score. The BLEU is applied for 1-gram, 2-gram,
3-gram, and 4-gram in order to explore the strength of the By looking at the different responses from different models,
models. It can be seen that LSTM and GRU models it has been noticed that LSTM is generating better sentences
outperform the official baseline LSTM model [4] with respect that make sense and it is hard to say if the response is from a
to the 4-gram BLEU score. Fig. 10, shows in details the human or machine whereas GRU responses are not as good as
performance of models in each n-gram. LSTM.

Hence it can be seen that LSTM achieves the highest


evaluation scores for all grams, but it takes a long time in
training. Moreover, the GRU model has very close evaluation
scores to LSTM. In the other hand, the CNN model has the
lowest evaluation scores compared with all RNN models but
achieves high-speed performance, which can be useful in
application trained on large datasets.

TABLE. V. BASELINE MODEL IMPLEMENTATION

Remove non-English queries, queries with


Preprocessing
images and @mentions.
Feature extraction Word2vec
Model LSTM with five layers.
Embedding size 640 Fig. 11. The Cosine Similarity Scores.

Optimization Function Stochastic gradient descent and gradient clipping. TABLE. VI. EXAMPLE OF EMOTIONAL QUERY AND RESPONSES FROM ALL
Evaluation BLEU with the best score achieved 0.36. MODELS

my package is days late and i am leaving


Customer Query tomorrow on holidays could you please help it
is extremely

Customer Support sorry to hear this please dm us your tracking


Response and phone number

i am sorry for the trouble with your order


LSTM Generated
please report this to our support team here and
Response
we will check this

i am sorry for the trouble with your order


GRU Generated please reach out to us here and we will look
Response into this for you please do not provide your
order details

hi there is not provide your order number and


CNN Generated
we can you please dm us a dm us a dm us a dm
Response
us a dm us a dm us
Fig. 10. The BLEU Scores for 1, 2, 3 and 4 Grams.

67 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 9, 2019

TABLE. VII. EXAMPLE OF INFORMATIVE QUERY AND RESPONSES FROM [5] S. Oraby, P. Gundecha, J. Mahmud, M. Bhuiyan, and R. Akkiraju,
ALL MODELS “Modeling Twitter Customer ServiceConversations Using Fine-Grained
Dialogue Acts,” in Proceedings of the 22nd International Conference on
guys when are you going to open your services Intelligent User Interfaces - IUI ‟17, 2017, pp. 343–355.
Customer Query
in middle east
[6] H. Shah, K. Warwick, J. Vallverdú, and D. Wu, “Can machines talk?
hulu store is only available in the us at this time Comparison of Eliza with modern dialogue systems,” Comput. Human
Customer Support Behav., vol. 58, pp. 278–295, May 2016.
but we will share the interest in bringing our
Response
service to the middle east [7] R. DALE, “The return of the chatbots,” Nat. Lang. Eng., vol. 22, no. 05,
pp. 811–817, Sep. 2016.
hi there we are sorry to hear about this please
LSTM Generated [8] M. Coronado, C. A. Iglesias, Á. Carrera, and A. Mardomingo, “A
dm us with your email address so we can
Response cognitive assistant for learning java featuring social dialogue,” Int. J.
connect Hum. Comput. Stud., vol. 117, pp. 55–67, Sep. 2018.
hi there i am sorry to hear about this please dm [9] S. Jha, S. Bagaria, L. Karthikey, U. Satsangi, and S. Thota, “STUDENT
GRU Generated INFORMATION AI CHATBOT,” in International Journal of Advanced
me the details of the issue you are having with
Response Research in Computer Science, 2018, vol. 9, no. 3.
your services
[10] P. R. Telang, A. K. Kalia, M. Vukovic, R. Pandita, and M. P. Singh, “A
CNN Generated Conceptual Framework for Engineering Chatbots,” IEEE Internet
hi there is not have you are you
Response Comput., vol. 22, no. 6, pp. 54–59, Nov. 2018.
[11] K. Chung and R. C. Park, “Chatbot-based heathcare service with a
VI. CONCLUSION knowledge base for cloud computing,” Cluster Comput., pp. 1–13, Mar.
2018.
In this project, we build customer support chatbot that helps
[12] J. Savage et al., “Semantic reasoning in service robots using expert
companies to have 24 hours of automated responses. After systems,” Rob. Auton. Syst., vol. 114, pp. 77–92, Apr. 2019.
analyzing the dataset and understanding the importance to have [13] S. T. Indra, L. Wikarsa, and R. Turang, “Using logistic regression
automated responses to customers and companies, we start method to classify tweets into the selected topics,” in 2016 International
exploring existing techniques used for generating responses in Conference on Advanced Computer Science and Information Systems
the customer service field. Then, we attempt to try three (ICACSIS), 2016, pp. 385–390.
different models LSTM, GRU, and CNN. The experimental [14] A. Shibani, E. Koh, V. Lai, and K. J. Shim, “Assessing the Language of
results show that LSTM and GRU models(with modified Chat for Teamwork Dialogue,” 2017.
parameters) tend to generate more informative and valuable [15] A. Singh and C. S. Tucker, “A machine learning approach to product
review disambiguation based on function, form and behavior
responses compared to CNN model and the baseline model classification,” Decis. Support Syst., vol. 97, pp. 81–91, May 2017.
LSTM. Besides, we used a BLEU score and cosine similarity
[16] Y. Wu, Z. Li, W. Wu, and M. Zhou, “Response selection with topic
as evaluation measures to support the final decision. clues for retrieval-based chatbots,” Neurocomputing, vol. 316, pp. 251–
261, Nov. 2018.
VII. FUTURE WORK [17] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek,
In future work, we plan to incorporate other similarity “"What is relevant in a text document?": An interpretable
measures such as soft cosine similarity. Also, we plan to machine learning approach,” PLoS One, vol. 12, no. 8, p. e0181142,
Aug. 2017.
improve the experiments by increase the vocabulary size and
[18] S. Sen and A. Raghunathan, “Approximate Computing for Long Short
try to increase the epoch parameters to reach 100 after Term Memory (LSTM) Neural Networks,” IEEE Trans. Comput. Des.
providing proper infrastructure. We further can add more data Integr. Circuits Syst., vol. 37, no. 11, pp. 2266–2276, Nov. 2018.
for the training by taking benefits from the queries without [19] Z. Wang, Z. Wang, Y. Long, J. Wang, Z. Xu, and B. Wang, “Enhancing
responses and translate non-English queries. generative conversational service agents with dialog history and external
knowledge I,” 2019.
ACKNOWLEDGMENT [20] J. Zhang and C. Zong, “Deep Neural Networks in Machine Translation:
This research was funded by the Deanship of Scientific An Overview,” IEEE Intell. Syst., vol. 30, no. 5, pp. 16–25, Sep. 2015.
Research at Princess Nourah bint Abdulrahman University [21] R. C. Gunasekara, D. Nahamoo, L. C. Polymenakos, D. E. Ciaurri, J.
Ganhotra, and K. P. Fadnis, “Quantized Dialog – A general approach for
through the Fast-track Research Funding Program. conversational systems,” Comput. Speech Lang., vol. 54, pp. 17–30,
REFERENCES Mar. 2019.
[1] M. Chung, E. Ko, H. Joung, and S. J. Kim, “Chatbot e-service and [22] G. Aalipour, P. Kumar, S. Aditham, T. Nguyen, and A. Sood,
customer satisfaction regarding luxury brands,” J. Bus. Res., Nov. 2018. “Applications of Sequence to Sequence Models for Technical Support
Automation,” in 2018 IEEE International Conference on Big Data (Big
[2] J. Hill, W. Ford, I. F.-C. in H. Behavior, and undefined 2015‫‏‬, “Real Data), 2018, pp. 4861–4869.
conversations with artificial intelligence: A comparison between
human–human online conversations and human–chatbot conversations‫‏‬,” [23] J. Singh and Y. Sharma, “Encoder-Decoder Architectures for Generating
Elsevier‫‏‬. Questions,” Procedia Comput. Sci., vol. 132, pp. 1041–1048, 2018.
[3] J. Hirschberg and C. D. Manning, “Advances in natural language [24] N. Shuyo, “Language Detection Library for Java.” 2010.
processing,” Science (80-. )., vol. 349, no. 6245, pp. 261–266, Jul. 2015. [25] R. Rehurek and P. Sojka, “Software Framework for Topic Modelling
[4] A. Xu, Z. Liu, Y. Guo, V. Sinha, and R. Akkiraju, “A New Chatbot for with Large Corpora,” in Proceedings of the LREC 2010 Workshop on
Customer Service on Social Media,” in Proceedings of the 2017 CHI New Challenges for NLP Frameworks, 2010, pp. 45–50.
Conference on Human Factors in Computing Systems - CHI ‟17, 2017, [26] Y. Zhu, E. Yan, and F. Wang, “Semantic relatedness and similarity of
pp. 3506–3510. biomedical terms: examining the effects of recency, size, and section of
biomedical publications on the performance of word2vec.,” BMC Med.
Inform. Decis. Mak., vol. 17, no. 1, p. 95, Jul. 2017.

68 | P a g e
www.ijacsa.thesai.org

You might also like