A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing
A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing
Volume
Aditya Datta et al., International Journal of Advanced 10, inNo.5,
Trends September
Computer - October
Science and 2021
Engineering, 10(5), September - October 2021, 3009 – 3019
International Journal of Advanced Trends in Computer Science and Engineering
Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse121052021.pdf
https://doi.org/10.30534/ijatcse/2021/121052021
Received Date : August 04, 2021 Accepted Date : September 13, 2021 Published Date : October 06, 2021
3009
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
3010
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
4.1 Stop words named entity recognition, etc., further information on parsing
and noun chunks is here [46], and information on how to
The stop words removal step is sometimes covered in the text extract noun chunks from large scale texts can be found here
preprocessing step but not always; sometimes, it may also be [45].
performed depending upon the efficiency of the algorithm
without the removal of stop words. If necessary, only then the 6. WORD EMBEDDING
stop words are removed. In a simpler sense, stop words
obstruct the correct formation of the Embedding vectors, or To just brief, for now, what is the function of word
we can say the words on the removal of which the accuracy of embedding. In a more straightforward sense, word embedding
the algorithm can increase are stop words. Generally, many does the same function as One Hot Encoding of sequences but
standard machine learning libraries like NLTK and Spacy are not the same. However, during the initial times, one-hot
a set of predefined words traditionally removed in encoding is used to denote sentences and sequences in vectors
applications such as text summarization, word tagging, etc. To having value 1 wherever the word is present and 0 else. Later,
intuitively understand their effect, we can say that in the due to an increase in the sparsity of feature matrix and
routine sentences we speak, we often give a lot of junk increasing vocabulary usage and also the variation in the
information based on which the sentence meaning does not speech of languages, word embedding was designed based on
depend and on the removal of such words, the text processing specific features such as several adjectives, gender dictating
as well as algorithms responsible for forming better words, phrasing, etc. which acts as different parameters for
correlations between words show improvement, as for many different tasks. This resulted in embedding a vector
words tagging and word embedding algorithms may find best functioning as a feature vector in a finite number of
optimal correlations and might increase accuracy in a notable dimensions for huge datasets, and one important thing is the
amount.Not only a predefined set, but there can also be parameters are chosen in such a way that there will be
different meanings of the word “stop words” depending upon convergence while training the parameters in the least time.
the format of the document or the type of the document, for Now, Figure 1 depicting the word embedding matrix with an
example, in the case of the following sentences. example.
Figure 2: Depicting the CBOW and Skip Gram model [22]. In Figure 3, xi(1) ,xi(2) and other input neurons together can be
grouped as input vectors (feature vector) as [xi(1), xi(2), xi(3),....,
7.2Skip Gram Model xi(n)]T and this vector can be called xi which is the ith training
example and has ‘n’ dimensions or in this case of Figure 4
Unlike the cbow model, given the word, we try to predict the four features(n=4) to it and the circle in between input neuron
context, and this model is very favorable when working on layer and output layer is called the hidden unit, and every
fewer amounts of data. When we have more data, we use the circular unit in the neural network is known as a neuron. For
cbow model for appropriate reasons that when handling large the hidden unit, the value is (W.T ∗ xi), where the operator ‘*’
data, we have enough information to train on as we can used depicts the dot product between the two. After this, an
produce an efficient and optimized model. However, in some activation function aij is applied on the hidden unit value
applications, though we have more data, we use the skip-gram where “aij” is the representation of the activation on the jth
model. This model has a clear edge for applications involving hidden unit in the ith layer of the network, here for this case the
predictions of the context, word tagging task, and in some representation for the hidden unit activation is a11. There are
tasks of machine translation. Finally, we would like to many types of activation functions such as tanh, relu, linear
conclude the word vectors by explaining similarity scores. activation, sigmoid, etc., which are widely used.
Now, generalizing the above functions, we get to the
7.3 Similarity Scores following formulae of deep neural networks:
Forward Propagation:
This is easiest to understand where we use the basic concept X-feature vector of a particular training example
of Linear Algebra known as the dot product. To compute the ZL - the result of forwarding propagation of ZL-1 and
similarity between two-word vectors, we take the dot product WL of the Lth layer.
of vectors divided by the product of magnitudes of vectors AL- activation layer L
given by the formula,
ZL = (WL * XL) (2)
Similarity Score = (first word vector * second word AL = activation (ZL) (3)
vector)/(||first word vector|| * ||second word vector||) (1)
“*”- operation specifies the dot product. MATRIX.T=>Transpose of the matrix in any of the given
“×”- operation denotes the normal multiplication equations in this paper.
operator. Above are the two essential equations used in feed-forward
propagation. Figure 4, representing feed-forward neural
8. DEEP LEARNING APPROACHES OF NLP networks. There is going to be a definition of the loss function
and a discussion on backpropagation intuitively. X is the input
8.1Neural Networks feature matrix, and the deltas represent the gradients obtained
Neural networks are a revolution in the field of artificial through the backpropagation algorithm.
intelligence, and many research papers like artificial neural
networks by Geoffrey Hinton and many pioneers in the field
like Andre Ng, Yann Le Cunn, etc., have produced unmatched
outcomes from their efforts.
3012
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
3013
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
Now, as discussed above, every application has its own To simply brief the above, it is seen that every gate is using
architecture for training the data. For example, Sentiment the sigmoid function as the function becomes either 0 or 1
classification uses many to one architecture; machine over a large range of values, and we need such a type of
translation uses many to many to architecture, which will be function to either update the previous or forget the previous to
covered in the coming sections. However, the material the current time step, as far as the output gate is concerned it
covered here is only the basic version of the explanation, and decides how much of the current information needs to be
in many cases, deep RNN is not used effectively, consisting of followed on leaving room for the input feature for next time
many layers unless the task is complex as the handling of the step to have a significant representation of the prediction. If
weight matrix (parameter matrix) would become tough and clearly observed, it can be sensed that the forget gate and
time-consuming. update gate has the opposite effect to each other, i.e., when for
forget gate is nearly 1, we do not consider the current feature
10. LONGSHORT-TERM MEMORY (LSTM) will have the least impact on the change of the value of
In handling sequences of greater length, the RNN may not activation which will be carried on to next LSTM and if an
effectively capture the impact created on preceding word on update is nearly 1 then the current value of “C” is updated. A
future estimate or prediction of the word as there is continuous simple depiction of LSTM cells is given below in Figure 7,
updating of parameters at each and individual timestep, so it explaining the features and mechanisms going inside it. This
would be very much beneficial if the system could remember idea or concept can be further enhanced in the research article
the past dependencies for a greater amount of timesteps to [9], which clearly explains the effects of different activation
establish a well and good prediction of results. In this section, functions.
many equations from previous concepts are discussed, and
firstly this section discusses the update equation of RNN.
a<t> = g(Wa[a<t-1>,x<t>] + ba) (7)
y<t> = (Wy * a<t>) + by (8)
In every upcoming equation superscript on each symbol
denotes the time step of the sequence. The subscript denotes
to which value the parameter belongs to. But the above
equation is not efficient in capturing dependencies over a
longer period of time.
NOTE: ‘g’ denotes activation function and ba denotes the bias
of output, x<t> denotes input at timestep ‘t’ and finally one
more representation is,
Wa[a<t-1>,x<t>] = Waa * a<t-1> + Wax * x<t>(9)
Waa=weight matrix belonging to output
Figure 7: Block diagram of LSTM [48]
Wax=weight matrix corresponding to input feature vector.
* - denotes the dot product. Gated recurrent unit cell is also depicted in Figure 8 as it is
Now, to tackle the above issue, there is a need to construct an also familiar with LSTM, and in some applications, gated
object for handling when to update the parameter for recurrent units work better than LSTM, and if interested
preserving the dependency and when to forget the previous in-depth insights of GRU can be found here [9].
dependency so as to delete the unwanted relational
dependencies and also a final tweaking parameter for
updating the output at each timestep so as to not copy the
result directly for next time step.So, finally, the following
equations act as an add-on to the above fundamental equation.
3014
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
better for sequences of greater length and sequences of high as not only to vectorize words but also sentences as a whole.
interdependency [10]. The basic block diagram which shows The important part here is to understand how the extraction is
the functionality of BRNN is depicted in Figure 9. taking place and, more precisely, how the TF IDF algorithm is
working; understanding this, we follow two steps that
primarily calculate the word frequency individually and
represent them as values, If more the frequency of words the
more is the importance of the word in the document, but
words such as “and”, “the”, “an” etc. are used extensively in
any type of document and to reduce their impact there is a
second step to penalize such scores by multiplying them with
a factor which is calculating the ratio between the total
number of documents to the number of documents in which
Figure 9: Showing a basic BRNN [50].
the word appeared and finally apply logarithm function to
base 2 to the result for controlled penalization. The same is
12. APPLICATIONS
given by the following equations,
The above-given theory and insights are very helpful in the
following discussed applications, as they can be inferred as Term frequency = (number of times the word appeared in the
the transformation of the NLP field. text)/(total number of words in the text) (16)
12.1Word Tagging
Inverse Document Frequency = log{(total number of the
Suppose any time searched online about a question related to document)/(the number of documents the word appeared)}(17)
words, whether it might be technical or business-related or After obtaining the feature matrix, the individual conditional
anything else, it will be suggested or shown as blocks or in probabilities are considered following the Bayesian inference
any other format. The Word Tagging model thoroughly between sentences of the given document, and the sentence
explains this type of correlation between the word searches vectors are given as input for calculation of similarity scores,
and popping suggestions on related articles. This section and further, the most relevant chunk of sentences with high
comes from highlighting the previously discussed sections relative similarity score is considered as an output. The further
and their importance. insights of weights for calculation of embedding can be found
To start, this model first takes a raw document related to many here [2] helpful in the summarization of text and also one
subjects or topics, which is a common step in many Text interesting paper that covers much more depth here [4] and
processing applications, and then it performs basic and needed also another good reading here [3].
text cleaning and text processing operations, as discussed
above in the previous sections. Then the processed text is 12.3Machine translation and Speech Recognition
converted into a feature matrix in the form of word Machine translation is a sequence-to-sequence model of
embedding, and each and the individual sentence is given a RNN, and instead of outputting the sequence at each and
<POS> and <EOS> tags determining the start and the end of individual time step, the architecture is designed to first
the sentence. Now, the similarity scores between the memorize the entire input feature vector and output each
sentences are found following any of the CBOW models or output word vector after memorizing the input feature. This
Skip Gram model by using a traditional machine learning model works on maximizing the Bayesian probability of the
library like WORD2VEC, which vectorizes the word output conditional on input, on covering a short formula used
embedding into vectors by capturing dependencies of all here, it is important to see the general formula here below,
contexts of every word present in the “VOCABULARY” of
the data matrix, OR the above can also be determined by using If two events A and B are occurring with probabilities P(A)
latest deep learning models like BERT which is used in and P(B) respectively, then the probability of occurring of A
performing the Named entity recognition task efficiently. The given B is given by P(A|B), which is,
above methods have been used until now, but depending upon
the scale of the task, the usage of the method is determined. P(A|B) = (P(A)/P(B)) * P(B|A) (18)
12.2Text Summarization The above extended to involve more variables (events)
conditional on multiple events, which is the crux for the task,
Text summarization is another important application of
natural language processing that requires proper parameters P(A,B|C) = P(A|C) * P(B|A,C) (19)
applied so that there is an effective summary of the text and
avoiding minimal errors as possible because sometimes Assuming B here to be output after the timestep of first
unimportant things crop up due to less rigorous extraction of timestep output A and C to be the input feature as a whole then
the text and can be corrected by following the opposite steps. it is clear that the above can be generalized and used for
Now, to discuss the most important and fundamental tool in getting output at nth time step as follows,
the extraction of the text is the TF IDF model, which works on
basic word frequency count across various documents and P(y<1>,y<2>,y<3>,y<4>,......,y<n-1>,y<n> |x) =
perform the similarity score between the sentences just like P(y<1>|x)*P(y<2>|x,y<1>)........*P(y<n>|x,y<1>,y<2>,y<3>,.....,y<n-1>) (20)
we do for words by word embedding vectorizer functions so
3015
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
But the above approach may fail to produce the most sensible
translation, for suppose let the correct translation for input be,
“The plan is going to get executed tomorrow,” and suppose
another translation with the overall average probability being
highest be, “The plan will be executed tomorrow.” Figure 10: Depicting beam search with beam width = 5 [20].
Compared to the first sentence, where a greedy approach is
followed at each individual time step, getting overall
12.5Attention Model
probability over multiple options is highest; for this, an
algorithm known as Beam Search is used. Now, it is important to mention a concept known as the
attention model, which comes under machine translation, and
also, this concept is used in speech recognition. This concept
12.4Beam Search pitches in when there is a sequence of greater length then
Instead of always choosing the word with the highest entirely memorizing the input during the first part of the
probability, we choose a fixed number of different words architecture of LSTM (ENCODING), and after that
depicting the top “n” number of conditional probabilities completely outputting the sequence (DECODING) will
arranged in descending order of magnitudes of conditional become tough, and also the results would not be appropriate,
probabilities [10]. For example, to tackle this the model divides entire sequence into parts and
performs beam search with respect to a new parameter known
For y<1> the top “n” number of outputs are chosen, which as ‘context,” and there is a small change in architecture as
implies, there are “n” copies of the network made in which compared to the previous as shown in Figure 11.
y<1> takes a different value, i.e., a different word is formed as
output. The same can be viewed from Figure 10.
Now, for y<2> we again calculate,
3016
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
Speech Recognition is just the application of the above which can be meaningful for the purpose, hence to solve this,
concepts, but instead, we deal with audio data, and there is an there is a book [32] which can be treated as the complete
algorithm known as Transformer and many further concepts. introduction to information retrieval. A survey paper [33]
There is a good paper on this concept which gives good describes the different methods for information retrieval and
intuition here [14]. The attention model is also used to adapt to filtering methods. Understanding Neural network
the context and to summarize a series, an excellent paper on implementations in information retrieval can be found here
video summarization using the model is here [42]. [34]. Information retrieval done by the tokenization technique
is explained here [44]. Self-organizing maps [36] are often
12.6Quantum encoding and decoding
considered as the clustering algorithm and perhaps more of
The current research in artificial intelligence is becoming identifying different chunks in the given data, which generally
rigorous with respect to Quantum Computing; already there suits in case of text processing, describing this is an
many applications of quantum computing, such as the QAOA implementation of SOM in Information Retrieval in the paper
algorithm [15], which is extensively used in the optimization here [35].
of algorithms, and recently there is an article [17] depicting
the quantum computing implementation for encoding and Optimization is a very important part of understanding neural
decoding part of neural network architecture and also there is networks and also many applications of it which decides the
increasing use of Grover’s search algorithm [16] in this field. efficiency and accuracy of algorithms and parameters and
The further understanding of this very much requires a thus is very much useful to learn about fundamental
thorough understanding of Quantum Computing and optimization algorithms like gradient descent, RMS prop,
Quantum Information Theory. Adam, Adaboost, momentum, etc. [40,41]. Feature extraction
is very important in machine learning and deep learning. The
13. SOME MORE CONCEPTS feature extraction for text categorization and getting a good
There are some of the important concepts covered in this understanding of text categorization are two useful aspects in
paper. A few other mentions are similar to the above concepts that direction. BERT,which became successful in document
or can be like good reading to better understand and instigate classification, is the main pre-trained model used extensively
deep intuition both application and concept. They are the in handling complex tasks, and hence relevant knowledge is
following, necessary.
3017
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
3018
Aditya Datta et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(5), September - October 2021, 3009 – 3019
3019