0% found this document useful (0 votes)
78 views

A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing

This paper acts as a comprehensive analytical study of natural language processing (NLP) and provides a briefing of the most prominent astounding reforms of the field over a good chunk of time. It covers even the future research insights and most relevant features, which act as a result of the discussed concepts or research, until this paper's reading point.

Uploaded by

WARSE Journals
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing

This paper acts as a comprehensive analytical study of natural language processing (NLP) and provides a briefing of the most prominent astounding reforms of the field over a good chunk of time. It covers even the future research insights and most relevant features, which act as a result of the discussed concepts or research, until this paper's reading point.

Uploaded by

WARSE Journals
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ISSN 2278-3091

Volume
Aditya Datta et al., International Journal of Advanced 10, inNo.5,
Trends September
Computer - October
Science and 2021
Engineering, 10(5), September - October 2021, 3009 – 3019
International Journal of Advanced Trends in Computer Science and Engineering
Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse121052021.pdf
https://doi.org/10.30534/ijatcse/2021/121052021

A Comprehensive Analytical Study of Traditional and


Recent Development in Natural Language Processing
Aditya Datta1, Biswajit Jena2*, Amiya Kumar Dash3, Roshni Pradhan4
1
International Institute of Information Technology, Bhubaneshwar, Odisha, India, Email:[email protected].
2
International Institute of Information Technology, Bhubaneshwar, Odisha, India, Email: [email protected].
3
KIIT University, Bhubaneshwar, Odisha, India, Email:[email protected].
4
KIIT University, Bhubaneshwar, Odisha, India, Email: [email protected]
*Corresponding Author

Received Date : August 04, 2021 Accepted Date : September 13, 2021 Published Date : October 06, 2021

 Firstly, we would like to discuss concepts related to text


ABSTRACT processing and their importance in the future chain of topics
going to be discussed. In recent times, many outstanding
This paper acts as a comprehensive analytical study of natural reforms have occurred in this field, no matter how it is
language processing (NLP) and provides a briefing of the analyzed from many points of view. For instance, right from
most prominent astounding reforms of the field over a good traditional approaches like TF-IDF, parsing, regular
chunk of time. It covers even the future research insights and expression matching, strings, etc., to word vectors, similarity
most relevant features, which act as a result of the discussed scores with various parameters like Euclidean distance, cosine
concepts or research, until this paper's reading point. This similarity, etc. This field has also witnessed ongoing
paper starts with covering the most basic concepts of text breakthroughs like neural networks, machine translation,
cleaning, such as tokenization, the importance of stop words, sentiment classification, emotional analysis of documents,
etc., to concepts such as sequence modeling, speech and to enhance it, the percentage of importance of a particular
recognition, the effect of quantum computing concepts in statement in a specified context. On that note, the importance
Natural Language Processing, and so on. The current of word embedding in the applications mentioned above
development of deep neural networks, which is the current cannot be neglected, and what new application ideas and
trend in artificial intelligence, always gives NLP a details it has given to the scientists to excel in the given area of
cutting-edge technology, also covered in this paper. This proficiency should also be valued in its way. Its importance
paper will also emphasize that it covers the broad area of has grown so much that it would not be exaggerating to say
explanations to the concepts to guide learners or researchers that it currently holds a massive share in ongoing research in
to have an excellent overall understanding of the field. artificial intelligence. The world has also witnessed the
wonders in this research and has accomplished successful
Key words: Natural Language Processing (NLP), tools like Alexa, Google Assistant, Siri of apple, etc., which
Tokenization, Deep Learning, Recurrent Neural network gained profound importance and popularity in many
(RNN), Long Short-Term Memory (LSTM), Bidirectional households and the help it delivers in reforming technology
Recurrent Neural Network (BiRNN). bringing the future to the present. Speech-to-text converters
helped in education, and machine translation helped the
1. INTRODUCTION people connect globally even if they are of varied cultures,
varied traditions, and varied languages. It has helped
Natural language processing [1] is an essential and profound significantly in bringing the world near to us in its way. This
concept in the field of artificial intelligence and mostly deals breakthrough has also led to new and exciting research areas
with areas related to human-computer interactions. This field like the study of animal language, their speech, and the
has the most important applications, such as text processing, analysis of their emotions, which has helped many countries
text summarization, document analysis, sentiment think about how technology can transform our understanding
classification, etc., which will be further discussed in this with respect to nature and its functionalities. It has also helped
paper, the applications section, and future research. Before many animal researchers to show improvements in their
discussing the most prominent features, this paper would like research regarding animal behavior and functioning. This is
to state the name of the person who started the research in this just the beginning of the entire vast subject of NLP and what it
field, who is none other than Alan Turing in his research can do in people’s lives and to get a clear picture of nature and
article published in 1950 by the name “computing machinery its interactions with each other. It has removed the partitions
and intelligence” [2]. among people with people, people, and wildlife and people
with technologies as it has become a part of our lives, helping
us develop day by day and to develop our understanding
regarding it day by day. It is a special type of relationship

3009
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

between living beings, technology, and interdependencies if it 3.2 Word Tokenizer


is deeply understood. It can finally be said, and it is a beautiful
enhancement done to the growing technology ever. This Tokenizer perhaps has got more significant importance
in the step of text preprocessing as mostly in many
The remaining sections of the paper are organized as follows. documents, sentence tokenization is common, and in most of
Section 2 and 3 discuss the Text cleaning and Tokenization the cases, we get sentences as the input directly by the tool, so
approaches of NLP. Section 4 is devoted to text preprocessing it becomes prominently important to divide them properly
concepts. Postagging and noun chunks are covered in section into word tokens. Here comes a tricky question or rather a
5. Word Embedding is covered in Section 6. Word vectors and basic fundamental question of why we are not considering a
similarity scores are discussed in Section 7. The main focus, sentence as a whole, as a feature vector, and why we are
which is the deep learning approach, is discussed in Section 8. fixated on words in determining tasks on various applications.
Recurrent neural network (RNN), Long short-term memory The answer is rather simple: the same composition of words
(LSTM), and Bidirectional Recurrent neural networks can give a different meaning depending on the phrasing.
(BRNN) are focused on sections 9, 10, and 11, respectively.
Then Applications of NLP and more concepts associated with Example:1) this dish is not bad but good. (Positive sense)
NLP are discussed in sections 12 and 13. Finally, the 2) this dish is not good but bad. (Negative sense)
conclusion is given in section 14. The above is related to the concept of sentiment classification,
which this paper will discuss in a bit. The above example
2. TEXT CLEANING depicts the variance; the sentences are chosen as features
rather than words. Now, to dive further into word
Text cleaning [3] is a crucial text preprocessing step on which tokenization, they are divided or partitioned according to
many modern tools for NLP, such as machine translation, individual needs, and many of the tools for this function
chatbots, speech recognition tools, etc., are highly dependent. follow different tokenization rules, for example, some
In a word, we can say that it is the root of the efficiency of partition according to the white spaces between the words,
applications and research in the area. The most basic text some with respect to commas, some give the user a choice of
cleaning steps require tokenization, lemmatization, stemming, devising a regular expression according to its need, etc., as the
removing stop words, etc. more profoundly, the topics or applications of NLP are very much unique in each
functions like lemmatization and stemming have a prominent individual’s approach and hence demand different variations
impact on the normalization of text. Text cleaning results in of text processing which can be intuitively understood in the
almost what we can say as the feature matrix of machine upcoming section of chunking of words, and further, in
learning or deep learning, and also it acts as the dominant explanation of how the chunking can affect efficiency,
player in producing accurate predictions or results. Now it is strengthening the idea of the above-discussed scenario.
essential to explain the function of each of the above subfields Finally, words formed after tokenization can effectively take
of Text Cleaning. the role of an input feature vector to train data and form
effective correlations in data to get valid output [5,43].
3. TOKENIZATION
4. TEXT PROCESSING
In simple words, tokenization [4] means to break the more
complex sense into something more straightforward and more It is the second major topic that controls the accuracy of the
comfortable to work. In this process, practitioners sometimes output. This is the step performed by an individual once he
remove mistakes or grammatically incorrect framing, which gets satisfied with the text Preprocessing steps. Text
can harm the process of efficiency by forming a barrier for processing includes algorithms, tuning of hyperparameters,
what we can say as an improper feature vector in machine checking the algorithm’s efficiency, and customizing the
learning terms. This process mainly contains two essential or traditional algorithms to an individual’s necessity. We would
widely used functions known as Word Tokenization and like to explain this section by giving frequent connections or
Sentence Tokenization. relating to previous sections and how they are interconnected
3.1 Sentence Tokenization via various methods or applications [6].
Before we jump into discussing algorithms’ applications, we
need to understand the common words which are essential in
This Tokenizer acts to divide or partition a chunk of text, many paths. There are previously standard machine learning
paragraph, or document into each sentence. This partitioning approaches, but researchers have designed deep learning
is done based on a particular regular expression. Before this approaches for the same problems or applications, which will
goes further, to give a simpler sense of regular expressions. be explained briefly in the following content, and modern
Regular expressions can be understood as a pattern matcher in deep learning enhancements will be discussed. The common
documents; what it does matches a specific pattern. For concept covered in both approaches is stop words.
example, a sequence of four digits for which we design an
expression matches any sequence of four-digit numbers in the
entire document.

3010
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

4.1 Stop words named entity recognition, etc., further information on parsing
and noun chunks is here [46], and information on how to
The stop words removal step is sometimes covered in the text extract noun chunks from large scale texts can be found here
preprocessing step but not always; sometimes, it may also be [45].
performed depending upon the efficiency of the algorithm
without the removal of stop words. If necessary, only then the 6. WORD EMBEDDING
stop words are removed. In a simpler sense, stop words
obstruct the correct formation of the Embedding vectors, or To just brief, for now, what is the function of word
we can say the words on the removal of which the accuracy of embedding. In a more straightforward sense, word embedding
the algorithm can increase are stop words. Generally, many does the same function as One Hot Encoding of sequences but
standard machine learning libraries like NLTK and Spacy are not the same. However, during the initial times, one-hot
a set of predefined words traditionally removed in encoding is used to denote sentences and sequences in vectors
applications such as text summarization, word tagging, etc. To having value 1 wherever the word is present and 0 else. Later,
intuitively understand their effect, we can say that in the due to an increase in the sparsity of feature matrix and
routine sentences we speak, we often give a lot of junk increasing vocabulary usage and also the variation in the
information based on which the sentence meaning does not speech of languages, word embedding was designed based on
depend and on the removal of such words, the text processing specific features such as several adjectives, gender dictating
as well as algorithms responsible for forming better words, phrasing, etc. which acts as different parameters for
correlations between words show improvement, as for many different tasks. This resulted in embedding a vector
words tagging and word embedding algorithms may find best functioning as a feature vector in a finite number of
optimal correlations and might increase accuracy in a notable dimensions for huge datasets, and one important thing is the
amount.Not only a predefined set, but there can also be parameters are chosen in such a way that there will be
different meanings of the word “stop words” depending upon convergence while training the parameters in the least time.
the format of the document or the type of the document, for Now, Figure 1 depicting the word embedding matrix with an
example, in the case of the following sentences. example.

I am a senior analyst, and my email id is [email protected].


In the above, while we use that sentence in a document where
the sentence has no unique importance like a movie review by
a customer who is the above person, in this case, we are only
interested in rating but not email id. However, when we are
trying to recommend a movie, the email id cannot be removed
as it acts as an identity to the individual, in a similar sense for
some documents; the stop words may not be the same as other Figure 1: Image depicting an embedding matrix of a certain
documents and may even carry useful information. For further text.[7]
reading, there is a good understanding provided by this [1].
7. WORD VECTOR AND SIMILARITY SCORE
5. POSTAGGING AND NOUN CHUNK
The idea of word vectors is well established in the paper by
Parts of speech tagging are a famous tool to assign each Mikolov et al. 2013 [5]. It is beautifully described how
individual word with its part of speech. This tool has various sentences containing words can be converted into word
versions of it provided by various organizations. Research is vectors by using two models.
effectively going on to assign more precise parts of speech
without any error, which is necessarily vital because parts of 7.1Continuous Bag of Words Model (CBOW)
speech play an essential role in forming noun chunks which
are useful if and only if we can effectively identify the nouns, In a sentence, if given a context of the sentence or given a
adjectives, articles, etc. with Pos tagging we can also certain number of neighborhood words (window size
formulate our chunks of words with the help of regular parameter), we need to predict the subject word of the context.
expressions. It is instrumental in designing a regular A detailed representation is given in the form of a diagram
expression based on the type of document for better results. below in Figure 2. As shown in the figure, an input vector
For example, the general chunking rule is: (embedding vector) is given, which is the context, and there is
{<DT>?<JJ>*<NN>+} the hidden projection matrix containing parameters, or in deep
The above is a noun chunk expression that matches any chunk learning language, we call it as a hidden layer to learn the
made of article+adjective+noun but, this is useful in the case subject according to the context and finally we predict context
of description documents or texts. In contrast, a regular and optimize it by using appropriate methods for training and
expression of form {<NN>+} is much more helpful for a optimization [23].
better correlation matrix in the case of scientific or
experimental texts where there is no much weightage to the
description. Hence, forming the right noun chunks is very
much important in cases of text analysis, word embedding,
3011
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

Figure 3: Showing the function of the neuron [21].

Figure 2: Depicting the CBOW and Skip Gram model [22]. In Figure 3, xi(1) ,xi(2) and other input neurons together can be
grouped as input vectors (feature vector) as [xi(1), xi(2), xi(3),....,
7.2Skip Gram Model xi(n)]T and this vector can be called xi which is the ith training
example and has ‘n’ dimensions or in this case of Figure 4
Unlike the cbow model, given the word, we try to predict the four features(n=4) to it and the circle in between input neuron
context, and this model is very favorable when working on layer and output layer is called the hidden unit, and every
fewer amounts of data. When we have more data, we use the circular unit in the neural network is known as a neuron. For
cbow model for appropriate reasons that when handling large the hidden unit, the value is (W.T ∗ xi), where the operator ‘*’
data, we have enough information to train on as we can used depicts the dot product between the two. After this, an
produce an efficient and optimized model. However, in some activation function aij is applied on the hidden unit value
applications, though we have more data, we use the skip-gram where “aij” is the representation of the activation on the jth
model. This model has a clear edge for applications involving hidden unit in the ith layer of the network, here for this case the
predictions of the context, word tagging task, and in some representation for the hidden unit activation is a11. There are
tasks of machine translation. Finally, we would like to many types of activation functions such as tanh, relu, linear
conclude the word vectors by explaining similarity scores. activation, sigmoid, etc., which are widely used.
Now, generalizing the above functions, we get to the
7.3 Similarity Scores following formulae of deep neural networks:
Forward Propagation:
This is easiest to understand where we use the basic concept X-feature vector of a particular training example
of Linear Algebra known as the dot product. To compute the ZL - the result of forwarding propagation of ZL-1 and
similarity between two-word vectors, we take the dot product WL of the Lth layer.
of vectors divided by the product of magnitudes of vectors AL- activation layer L
given by the formula,
ZL = (WL * XL) (2)
Similarity Score = (first word vector * second word AL = activation (ZL) (3)
vector)/(||first word vector|| * ||second word vector||) (1)
“*”- operation specifies the dot product. MATRIX.T=>Transpose of the matrix in any of the given
“×”- operation denotes the normal multiplication equations in this paper.
operator. Above are the two essential equations used in feed-forward
propagation. Figure 4, representing feed-forward neural
8. DEEP LEARNING APPROACHES OF NLP networks. There is going to be a definition of the loss function
and a discussion on backpropagation intuitively. X is the input
8.1Neural Networks feature matrix, and the deltas represent the gradients obtained
Neural networks are a revolution in the field of artificial through the backpropagation algorithm.
intelligence, and many research papers like artificial neural
networks by Geoffrey Hinton and many pioneers in the field
like Andre Ng, Yann Le Cunn, etc., have produced unmatched
outcomes from their efforts.

Feature Input: It is the input given to the neural network.


Hidden Layer: It may consist of many neurons identifying the
underlying correlation between different dimensions of the
features. This function is given in the basic form below.
Output Layer: The final prediction of the neural network.
Labels: The true classification from training data to compute
loss and minimize error.
Consider a basic neural network as in Figure 3. Figure 4: A deep neural network [47].

3012
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

8.2 Back Propagation


Now, compute the gradients with respect to W's of every layer
to perform optimization functions to reduce error. To
understand this first, we will get to the equations part,

erroroutput_layer = modulus(predictions-Y) (4)


errorL = (WL+1.T * errorL+1) * g|(ZL) (5)

Now, for the intuition part, we just need to think of it as a


reverse calculation of forwarding propagation; hence, instead Figure 5: Basic Recurrent Neural Network [23].
of propagating the input feature to the output layer, we make
the error of the output layer and propagate the error according Xt represents the tth time step of the feature vector, or it could
to the weights and taking derivative of activation function be understood that if there is a sentence,
following the simple multivariate chain rule of calculus we Input feature (x) = This is a flower.
considered the following cost function for above calculation then, x1=” This” which is converted to word embedding
of gradients, during input
U denotes the input weight matrix.
J = 0.5*((predictions-Y)^2) (6) W denotes the weight matrix.
Where, J - cost function, Y-Labels S denotes the activation function of neurons.
V is the output weight matrix.
The above cost function is known as the mean squared error
function, and many other functions define cost accordingly. Hence, it can be seen that simple neural networks cannot
Further information on backpropagation or in-depth handle the tasks as RNN, and it can be understood to have
explanation can be understood from the paper [6]. There will been designed to handle such specific tasks, and for much
not be any discussion on CNN in this paper. However, it is deeper insights in RNN, this research article [8] will be
fascinating to note that many CNN algorithms have been helpful. This paper will briefly discuss deep recurrent neural
applied to NLP, such as in speech recognition, machine networks and further discuss LSTM with their equations for
translation, etc., to identify hidden patterns in data. the next section.

9. RECURRENT NEURAL NETWORK (RNN) 9.2 Deep RNN


As said previously, this is a brief explanation of deep RNN
Recurrent neural networks [23] are designed to handle input at and just is the discussion of one more type of RNN. It could be
each neuron rather than input all of the feature vectors at once visualized as a stacking up of multiple layers of RNN on top
and entirely forward propagating and producing predictions. of each other as shown in Figure 6, and to explain the things
To understand this first, it would be better why simple neural further, it is just that the output is now passed as input to the
networks fail in tasks, including sequences or word prediction next layer with a certain activation function to just understand
or machine translation, etc. The primary disadvantage in parameters quickly, there will be individual weight matrices
simple neural networks is that we cannot handle input feature for input feature x, for activation result obtained during the
vectors of varying length and also, we cannot use partitioned previous timestep, for each layer, and weight matrix for each
input feature to predict the remaining partition of the feature, vertical arrow (upward transition) between layers and output
as in the tasks like word prediction; hence, it becomes very [18].
much necessary to design a neural network which can take NOTE: The weight matrix for the entire horizontal layer is the
input at each stage of forwarding propagation. same; similarly, for the same layer transition, the weight
There are many types of RNN, such as, self RNN which takes matrix is the same for that transition anywhere for that
only one single input and propagates it to generate a random transition.
sequence (at initial timesteps) but, certainly specific sequence
(after multiple timesteps) whether it can be music, letters, or
words, speech, etc. there is also an architecture by name
LSTM which is very important in the field and also there will
be information on deep Recurrent neural networks which is
when we try to understand the activation functions in RNN.
9.1 Basic RNN

The concept of RNN is very much easily visualized as the


given Figure 5 below.

Figure 6: A deep recurrent neural network [18].

3013
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

Now, as discussed above, every application has its own To simply brief the above, it is seen that every gate is using
architecture for training the data. For example, Sentiment the sigmoid function as the function becomes either 0 or 1
classification uses many to one architecture; machine over a large range of values, and we need such a type of
translation uses many to many to architecture, which will be function to either update the previous or forget the previous to
covered in the coming sections. However, the material the current time step, as far as the output gate is concerned it
covered here is only the basic version of the explanation, and decides how much of the current information needs to be
in many cases, deep RNN is not used effectively, consisting of followed on leaving room for the input feature for next time
many layers unless the task is complex as the handling of the step to have a significant representation of the prediction. If
weight matrix (parameter matrix) would become tough and clearly observed, it can be sensed that the forget gate and
time-consuming. update gate has the opposite effect to each other, i.e., when for
forget gate is nearly 1, we do not consider the current feature
10. LONGSHORT-TERM MEMORY (LSTM) will have the least impact on the change of the value of
In handling sequences of greater length, the RNN may not activation which will be carried on to next LSTM and if an
effectively capture the impact created on preceding word on update is nearly 1 then the current value of “C” is updated. A
future estimate or prediction of the word as there is continuous simple depiction of LSTM cells is given below in Figure 7,
updating of parameters at each and individual timestep, so it explaining the features and mechanisms going inside it. This
would be very much beneficial if the system could remember idea or concept can be further enhanced in the research article
the past dependencies for a greater amount of timesteps to [9], which clearly explains the effects of different activation
establish a well and good prediction of results. In this section, functions.
many equations from previous concepts are discussed, and
firstly this section discusses the update equation of RNN.
a<t> = g(Wa[a<t-1>,x<t>] + ba) (7)
y<t> = (Wy * a<t>) + by (8)
In every upcoming equation superscript on each symbol
denotes the time step of the sequence. The subscript denotes
to which value the parameter belongs to. But the above
equation is not efficient in capturing dependencies over a
longer period of time.
NOTE: ‘g’ denotes activation function and ba denotes the bias
of output, x<t> denotes input at timestep ‘t’ and finally one
more representation is,
Wa[a<t-1>,x<t>] = Waa * a<t-1> + Wax * x<t>(9)
Waa=weight matrix belonging to output
Figure 7: Block diagram of LSTM [48]
Wax=weight matrix corresponding to input feature vector.
* - denotes the dot product. Gated recurrent unit cell is also depicted in Figure 8 as it is
Now, to tackle the above issue, there is a need to construct an also familiar with LSTM, and in some applications, gated
object for handling when to update the parameter for recurrent units work better than LSTM, and if interested
preserving the dependency and when to forget the previous in-depth insights of GRU can be found here [9].
dependency so as to delete the unwanted relational
dependencies and also a final tweaking parameter for
updating the output at each timestep so as to not copy the
result directly for next time step.So, finally, the following
equations act as an add-on to the above fundamental equation.

S<t> = tanh (Wa[a<t-1>,x<t>]+bs) (10)


gammau = sigmoid (Wa[a<t-1>,x<t>]+bu) (11)
gammaf = sigmoid (Wa[a<t-1>,x<t>]+bf) (12)
gammao = sigmoid (Wa[a<t-1>,x<t>]+bo) (13)
Figure 8: Block diagram of Gated Recurrent Unit (GRU) [49].
C<t> = gammau*S<t> + gammaf*C<t-1> (14)
a<t> = gammao*C<t> (15)
In the above listed formulae, 11. BIDIRECTIONAL RECURRENT NEURAL
Dot(.) - element wise matrix multiplication NETWORK (BRNN)
Update gate-gamma(u) - responsible for when to update the Bidirectional Recurrent neural networks are one of the
current “C” important neural networks as the above architectures are
Forget gate-gamma(f) - responsible for when to forget the unidirectional it becomes to predict output at a time step with
update for the current time step respect to a future wording, and in such scenarios, the BRNN
Output gate-gamma(o) - amount of information needs to be is helpful as it takes both forward propagation activation and
passed on for next time step forward propagation activation in the reverse direction in
predicting output for each and individual output, it performs

3014
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

better for sequences of greater length and sequences of high as not only to vectorize words but also sentences as a whole.
interdependency [10]. The basic block diagram which shows The important part here is to understand how the extraction is
the functionality of BRNN is depicted in Figure 9. taking place and, more precisely, how the TF IDF algorithm is
working; understanding this, we follow two steps that
primarily calculate the word frequency individually and
represent them as values, If more the frequency of words the
more is the importance of the word in the document, but
words such as “and”, “the”, “an” etc. are used extensively in
any type of document and to reduce their impact there is a
second step to penalize such scores by multiplying them with
a factor which is calculating the ratio between the total
number of documents to the number of documents in which
Figure 9: Showing a basic BRNN [50].
the word appeared and finally apply logarithm function to
base 2 to the result for controlled penalization. The same is
12. APPLICATIONS
given by the following equations,
The above-given theory and insights are very helpful in the
following discussed applications, as they can be inferred as Term frequency = (number of times the word appeared in the
the transformation of the NLP field. text)/(total number of words in the text) (16)
12.1Word Tagging
Inverse Document Frequency = log{(total number of the
Suppose any time searched online about a question related to document)/(the number of documents the word appeared)}(17)
words, whether it might be technical or business-related or After obtaining the feature matrix, the individual conditional
anything else, it will be suggested or shown as blocks or in probabilities are considered following the Bayesian inference
any other format. The Word Tagging model thoroughly between sentences of the given document, and the sentence
explains this type of correlation between the word searches vectors are given as input for calculation of similarity scores,
and popping suggestions on related articles. This section and further, the most relevant chunk of sentences with high
comes from highlighting the previously discussed sections relative similarity score is considered as an output. The further
and their importance. insights of weights for calculation of embedding can be found
To start, this model first takes a raw document related to many here [2] helpful in the summarization of text and also one
subjects or topics, which is a common step in many Text interesting paper that covers much more depth here [4] and
processing applications, and then it performs basic and needed also another good reading here [3].
text cleaning and text processing operations, as discussed
above in the previous sections. Then the processed text is 12.3Machine translation and Speech Recognition
converted into a feature matrix in the form of word Machine translation is a sequence-to-sequence model of
embedding, and each and the individual sentence is given a RNN, and instead of outputting the sequence at each and
<POS> and <EOS> tags determining the start and the end of individual time step, the architecture is designed to first
the sentence. Now, the similarity scores between the memorize the entire input feature vector and output each
sentences are found following any of the CBOW models or output word vector after memorizing the input feature. This
Skip Gram model by using a traditional machine learning model works on maximizing the Bayesian probability of the
library like WORD2VEC, which vectorizes the word output conditional on input, on covering a short formula used
embedding into vectors by capturing dependencies of all here, it is important to see the general formula here below,
contexts of every word present in the “VOCABULARY” of
the data matrix, OR the above can also be determined by using If two events A and B are occurring with probabilities P(A)
latest deep learning models like BERT which is used in and P(B) respectively, then the probability of occurring of A
performing the Named entity recognition task efficiently. The given B is given by P(A|B), which is,
above methods have been used until now, but depending upon
the scale of the task, the usage of the method is determined. P(A|B) = (P(A)/P(B)) * P(B|A) (18)
12.2Text Summarization The above extended to involve more variables (events)
conditional on multiple events, which is the crux for the task,
Text summarization is another important application of
natural language processing that requires proper parameters P(A,B|C) = P(A|C) * P(B|A,C) (19)
applied so that there is an effective summary of the text and
avoiding minimal errors as possible because sometimes Assuming B here to be output after the timestep of first
unimportant things crop up due to less rigorous extraction of timestep output A and C to be the input feature as a whole then
the text and can be corrected by following the opposite steps. it is clear that the above can be generalized and used for
Now, to discuss the most important and fundamental tool in getting output at nth time step as follows,
the extraction of the text is the TF IDF model, which works on
basic word frequency count across various documents and P(y<1>,y<2>,y<3>,y<4>,......,y<n-1>,y<n> |x) =
perform the similarity score between the sentences just like P(y<1>|x)*P(y<2>|x,y<1>)........*P(y<n>|x,y<1>,y<2>,y<3>,.....,y<n-1>) (20)
we do for words by word embedding vectorizer functions so

3015
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

Now, there is a formula to find the probability of getting


n<th> output in terms of the conditional probability of
previous individual timestep’s output and the input feature
vector ‘x’. If the conditional probability at each step is
maximized, we get the absolute translation of the input
sentence; this approach is known as a greedy approach.

But the above approach may fail to produce the most sensible
translation, for suppose let the correct translation for input be,
“The plan is going to get executed tomorrow,” and suppose
another translation with the overall average probability being
highest be, “The plan will be executed tomorrow.” Figure 10: Depicting beam search with beam width = 5 [20].
Compared to the first sentence, where a greedy approach is
followed at each individual time step, getting overall
12.5Attention Model
probability over multiple options is highest; for this, an
algorithm known as Beam Search is used. Now, it is important to mention a concept known as the
attention model, which comes under machine translation, and
also, this concept is used in speech recognition. This concept
12.4Beam Search pitches in when there is a sequence of greater length then
Instead of always choosing the word with the highest entirely memorizing the input during the first part of the
probability, we choose a fixed number of different words architecture of LSTM (ENCODING), and after that
depicting the top “n” number of conditional probabilities completely outputting the sequence (DECODING) will
arranged in descending order of magnitudes of conditional become tough, and also the results would not be appropriate,
probabilities [10]. For example, to tackle this the model divides entire sequence into parts and
performs beam search with respect to a new parameter known
For y<1> the top “n” number of outputs are chosen, which as ‘context,” and there is a small change in architecture as
implies, there are “n” copies of the network made in which compared to the previous as shown in Figure 11.
y<1> takes a different value, i.e., a different word is formed as
output. The same can be viewed from Figure 10.
Now, for y<2> we again calculate,

Top “n” values of y<2> which maximizes the conditional


probability P(y<1>,y<2>|x) for each and individual network of
“n” networks present by which there are n square of
probabilities present out of which , the top “n” conditional
probabilities are selected and again “n” individual copies of
architecture are made taking y<1> and y<2> as individual values
of order respectively.
And the process is continued,
Here, “n” is known as beamwidth.
So, the final objective can be defined as, Figure 11: Depicting the attention mechanism [19].
y = argmaxy {productt=[1,Ly](P(y<t>|x,y<1>,y<1>,...,y<1>))} (21)
The above is a BRNN (bi-directional recurrent neural
Rather than following the above objective for selecting the network) creating a "context" which is a weighted sum of
best y, It would be better to take the logarithm of the above outputs from BRNN whose weights are known as "attention
objective to predict y as in both objectives maximizing y weights” and following equations are of help to understand
maximizes the function, and further, we can normalize the above,
function by the length of output “Ly”. Hence the function now summationt=[1,T] (alpha<t,t|>) = 1 (23)
becomes, Now, to find out context “C”, the following is carried out,
y=argmaxy(1/Ly)*{summationt=[1,Ly](P(y<t>|x,y<1>,y<1>,...,y<1>)
)} (22) C<t> = summationt=[1,T] (alpha<t,t|> * O<t|>) (24)
Now, to calculate y<t>, there can be a neural network used
In an above-defined way, we can perform machine taking previous output y<t-1> and C<t>as inputs and y<t> as
translation, and for any further reading on this concept, this output. This is suggested as it is exactly unknown the relation
paper [11] is beneficial and a summary here [12]. between input and output. So, it is easier for a neural network
to identify the mathematical relation between them. There is a
concept known as the Bleu score, which is explained in [13]
and is considered for evaluating models.

3016
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

Speech Recognition is just the application of the above which can be meaningful for the purpose, hence to solve this,
concepts, but instead, we deal with audio data, and there is an there is a book [32] which can be treated as the complete
algorithm known as Transformer and many further concepts. introduction to information retrieval. A survey paper [33]
There is a good paper on this concept which gives good describes the different methods for information retrieval and
intuition here [14]. The attention model is also used to adapt to filtering methods. Understanding Neural network
the context and to summarize a series, an excellent paper on implementations in information retrieval can be found here
video summarization using the model is here [42]. [34]. Information retrieval done by the tokenization technique
is explained here [44]. Self-organizing maps [36] are often
12.6Quantum encoding and decoding
considered as the clustering algorithm and perhaps more of
The current research in artificial intelligence is becoming identifying different chunks in the given data, which generally
rigorous with respect to Quantum Computing; already there suits in case of text processing, describing this is an
many applications of quantum computing, such as the QAOA implementation of SOM in Information Retrieval in the paper
algorithm [15], which is extensively used in the optimization here [35].
of algorithms, and recently there is an article [17] depicting
the quantum computing implementation for encoding and Optimization is a very important part of understanding neural
decoding part of neural network architecture and also there is networks and also many applications of it which decides the
increasing use of Grover’s search algorithm [16] in this field. efficiency and accuracy of algorithms and parameters and
The further understanding of this very much requires a thus is very much useful to learn about fundamental
thorough understanding of Quantum Computing and optimization algorithms like gradient descent, RMS prop,
Quantum Information Theory. Adam, Adaboost, momentum, etc. [40,41]. Feature extraction
is very important in machine learning and deep learning. The
13. SOME MORE CONCEPTS feature extraction for text categorization and getting a good
There are some of the important concepts covered in this understanding of text categorization are two useful aspects in
paper. A few other mentions are similar to the above concepts that direction. BERT,which became successful in document
or can be like good reading to better understand and instigate classification, is the main pre-trained model used extensively
deep intuition both application and concept. They are the in handling complex tasks, and hence relevant knowledge is
following, necessary.

Text clustering is an important algorithm that has meaning in 13. CONCLUSION


its name itself. It clusters out the data in a general text when it It can be said that Natural language processing is a massive
is composed as a description of many types in common [31]. field with a lot of cross-platform knowledge implementation,
Text clustering knowledge can also be helpful in other and the growing demand for the applications in current
subdivisions of the field, which are the dialogue systems [26, products makes it even more special. NLP is a complex
30]. To completely understand the practical implementation concept with infinite dimensions, and one can always obtain
tricks of Neural Machine Translation [24], especially; how it more based on his efforts and creativity. Currently, the
deals with embedding layers and how to handle large batches computationally efficient quantum computing ideas are being
of input to the model. As there is a sequence-to-sequence applied into various types of fields, and one can be sure that it
model used in neural machine translation, but the sequence is is going to create a greater impact on all applications which
getting handled at the level of words, and recently this are unable to perform well or are not being used because of
sequence-to-sequence model is also being operated at the being computationally expensive, but now with growing
character level, which is beneficial in its own strong territory. technology such as quantum optimization makes it possible
This operation at the character level is well explained in the for such applications to realize, and one major chunk of
article [25]. applications definitely belong to NLP. Thus, it is sure that in
the coming day’s text, speech, sequence, etc., based
Sentiment classification [37, 38] is another important applications will come up with much more ease, and there will
application that is very helpful in analyzing the text's polarity. be an increase in the growth of practical applications of the
The machine learning implementation for sentiment theory proposed. Finally, Natural language processing depicts
classification [39] explains how gated recurrent units can be the human-computer relationship beautifully and certainly the
used for sentiment classification.Convolutional neural positive growth of technology and human-machine
networks are one of the major influencers in coming up with interaction.
different approaches for tasks such as clustering, speech
recognition, etc. and there is a good number of papers, very ACKNOWLEDGEMENT
much beneficial in grasping the concept of architecture for the Compliance with Ethical Standards:
various tasks in NLP, which are [27, 28, 29]. Funding: This study was not funded by any person or
organization.
Information retrieval is an important major division of text Conflict of Interest: Authors declare that they have no
processing as it gets out the important crux in the entire conflict of interest.
document, to address the previous, sometimes in a 100 pages
document, there can be only 1 page of useful information

3017
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

REFERENCES 19.Bahdanau, D., Cho, K. H., &Bengio, Y. (2015). Neural


1.RasmitaRautray, Rakesh Chandra Balabantaray, Anisha machine translation by jointly learning to align and translate.
Bhardwaj, Document summarization using sentence features, Paper presented at 3rd International Conference on Learning
International Journal of Information Retrieval Research Representations, ICLR 2015, San Diego, United States.
,2015. 20. Liu, Bowen &Ramsundar, Bharath &Kawthekar, Prasad
2. Rakesh Chandra Balabantaray, DK Sahoo, B Sahoo, M & Shi, Jade & Gomes, Joseph & Nguyen, Quang & Ho,
Swain,Text summarization using term weights, Stephen & Sloane, Jack &Wender, Paul & Pande, Vijay.
International Journal of Computer Applications,2012. (2017). Retrosynthetic Reaction Prediction Using Neural
3.Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Sequence-to-Sequence Models. ACS Central Science. 3.
Zhang, “Tapping into the Power of Text Mining”, Journal of 10.1021/acscentsci.7b00303.
ACM, Blacksburg, 2005. 21. Long, Dan &Wuest, S. & Williams, John &Rauwendaal,
4. Gupta, V., Lehal ,G. S.,A Survey of Text Summarization Randall & Bailey, M.. (2010). Contour Planting: A Strategy to
Extractive Techniques,JOURNAL OF EMERGING Reduce Soil Erosion on Steep Slopes.
TECHNOLOGIES IN WEB 22.Landthaler, Joerg &Waltl, Bernhard &Huth, Dominik &
INTELLIGENCE,Vol.2,No.3,2010. Braun, Daniel &Matthes, Florian & Stocker, Christoph &
5.Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient Geiger, Thomas. (2017). Extending Thesauri Using Word
estimation of word representations in vector space. In Embeddings and the Intersection Method.
Proceedings of Workshop at ICLR, 2013a. 23. Zhu, Juncheng& Yang, Zhile & Mourshed, Monjur&
6.Rumelhart, D. E., Hinton, G. E., and Williams, R. Guo, Yuanjun& Zhou, Yimin& Chang, Yan & Wei, Yanjie&
J.Learning representations by back-propagating Feng, Shengzhong. (2019). Electric Vehicle Charging Load
errors.Nature, 323, 533--536.1986. Forecasting: A Comparative Study of Deep Learning
7.Jurafsky, Daniel & Martin, James. (2008). Speech and Approaches. Energies. 12. 2692. 10.3390/en12142692.
Language Processing: An Introduction to Natural Language 24.Neishi, M., Sakuma, J., Tohda, S., Ishiwatari, S.,
Processing, Computational Linguistics, and Speech Yoshinaga, N., & Toyoda, M. (2017). A Bag of Useful Tricks
Recognition. for Practical Neural Machine Translation: Embedding Layer
8. Alex Graves. Generating sequences with recurrent neural Initialization and Large Batch Size. WAT@IJCNLP.
networks. CoRR, abs/1308.0850, 2013. 25. Zhang, H., Li, J., Ji, Y., & Yue, H. (2016). A
9. Sepp Hochreiter and Jürgen Schmidhuber. Long character-level sequence-to-sequence method for subtitle
Short-Term memory. Neural computation, 9(8):1735–1780, learning. 2016 IEEE 14th International Conference on
1997. Industrial Informatics (INDIN), 780-783.
10. A. Graves and J. Schmidhuber. Framewise phoneme 26. Ren, D., Cai, Y., Chan, W.H., & Li, Z. (2018). A
classification with bidirectional LSTM networks. In Proc. Int. Clustering Based Adaptive Sequence-to-Sequence Model for
Joint Conf. on Neural Networks IJCNN 2005, 2005. Dialogue Systems. 2018 IEEE International Conference on
11. Freitag, M., Al-Onaizan, Y.,Beam Search strategies for Big Data and Smart Computing (BigComp), 775-781.
neural machine translation,arXiv preprint arXiv:1702.01806, 27.Allamanis, M., Peng, H., & Sutton, C.A. (2016). A
2017 Convolutional Attention Network for Extreme
12.Garg,A.,Agarwal,M.,Machine Translation : a literature Summarization of Source Code. ICML.
review,arXiv preprint arXiv:1901.01122, 2018. 28. Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. (2017).
13.Papineni, Kishore &Roukos, Salim & Ward, Todd & Zhu, A Convolutional Encoder Model for Neural Machine
Wei Jing. (2002). BLEU: a Method for Automatic Evaluation Translation. ArXiv, abs/1611.02344.
of Machine Translation. 10.3115/1073083.1073135. 29. Xing, Y., Xiao, C., Wu, Y., & Ding, Z. (2018). A
14.AwniHannun, Carl Case, Jared Casper, Bryan Catanzaro, Convolutional Neural Network for Aspect Sentiment
Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Classification. IJPRAI, 33, 1959046:1-1959046:13.
Shubho Sengupta, Adam Coates, Andrew Y Ng, Deep Speech 30. Aggarwal, C. C., &Zhai, C. (2012). A survey of text
: scaling up end-to-end speech recognition. clustering algorithms. In Mining text data (pp. 77-128).
15. Farhi, E.,Goldstone, J.,Gutmann, S.,A Quantum Springer, Boston, MA.
Approximate Optimization Algorithm,arXiv preprint 31. Jing, L., Ng, M. K., & Huang, J. Z. (2010).
arXiv:1411.4028, 2014. Knowledge-based vector space model for text clustering.
16. L. Grover. A fast quantum mechanical algorithm for Knowledge and information systems, 25(1), 35-55.
database search. In Proc. 28th STOC, pages 212–219, 32. Manning, C. D., Raghavan, P., &Schütze, H. (2008).
Philadelphia, Pennsylvania, 1996. ACM Press. Introduction to information retrieval. Cambridge university
17. Bausch, J.,Subramanian, S.,Piddock, S.,A quantum press.
search decoder for natural language processing, arXiv 33.Faloutsos, C., &Oard, D. W. (1998). A survey of
preprint arXiv:1909.05023, 2019. information retrieval and filtering methods.
18. Kong, Huifang& Fang, Yao & Fan, Lei & Wang, Hai & 34. Mitra, B., &Craswell, N. (2017). Neural models for
Zhang, Xiaoxue& Hu, Jie. (2019). A novel torque distribution information retrieval. arXiv preprint arXiv:1705.01509.
strategy based on deep recurrent neural network for parallel 35. Lin, X., Soergel, D., &Marchionini, G. (1991,
hybrid electric vehicle. IEEE Access. PP. 1-1. September). A self-organizing semantic map for information
10.1109/ACCESS.2019.2917545. retrieval. In Proceedings of the 14th annual international

3018
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019

ACM SIGIR conference on research and development in


information retrieval (pp. 262-269).
36.Kohonen, T. (1997, June). Exploration of very large
databases by self-organizing maps. In Proceedings of
international conference on neural networks (icnn’97) (Vol.
1, pp. PL1-PL6). IEEE.
37. Xia, R., Zong, C., & Li, S. (2011). Ensemble of feature
sets and classification algorithms for sentiment classification.
Information sciences, 181(6), 1138-1152.
38. Pang, B., Lee, L., &Vaithyanathan, S. (2002, July).
Thumbs up? sentiment classification using machine learning
techniques. In Proceedings of the ACL-02 conference on
Empirical methods in natural language processing-Volume
10 (pp. 79-86). Association for Computational Linguistics.
39. Tang, D., Qin, B., & Liu, T. (2015, September). Document
modeling with gated recurrent neural network for sentiment
classification. In Proceedings of the 2015 conference on
empirical methods in natural language processing (pp.
1422-1432).
40. Ruder, S. (2016). An overview of gradient descent
optimization algorithms. arXiv preprint arXiv:1609.04747.
41. Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B.,
& Ng, A. Y. (2011). On optimization methods for deep
learning.
42. Ma, Y. F., Lu, L., Zhang, H. J., & Li, M. (2002,
December). A user attention model for video summarization.
In Proceedings of the tenth ACM international conference on
Multimedia (pp. 533-542).
43. Nayak, A. S., Kanive, A. P., Chandavekar, N.,
&Balasubramani, R. (2016). Survey on preprocessing
techniques for text mining. International Journal Of
Engineering And Computer Science, ISSN, 2319-7242.
44. Singh, V., & Saini, B. (2014). An Effective tokenization
algorithm for information retrieval systems. Department of
Computer Engineering, National Institute of Technology
Kurukshetra, Haryana, India.
45. Chen, K. H., & Chen, H. H. (1994, June). Extracting noun
phrases from large-scale texts: A hybrid approach and its
automatic evaluation. In Proceedings of the 32nd annual
meeting on Association for Computational Linguistics (pp.
234-241). Association for Computational Linguistics.
46. Abney, S. P. (1991). Parsing by chunks. In
Principle-based parsing (pp. 257-278). Springer, Dordrecht.
47. Deep neural network tutorial [Online]. Avail:
https://miro.medium.com/max/958/1*QVIyc5HnGDWTNX3
m-nIm9w.png
48. Long Short-Term Memory tutorial [Online] Available:
https://i.stack.imgur.com/RHNrZ.jpg
49. Gated Recurrent Unit tutorial [Online] Available:
https://cdnimages1.medium.com/freeze/max/1000/1*OBCui-
SbIRUtlBgWkgQUlw.png?q=2
50. Bidirectional Recurrent neural network tutorial [Online]
Available:http://www.easy-tensorflow.com/tftutorials/recurre
nt-neural-networks/bidirectional-rnn-for-classification

3019

You might also like