0% found this document useful (0 votes)
47 views

Sentiment Analysis Using Machine Learning Classifiers

Sentiment analysis using I) Pre-processing II ) Word embedding III ) Count vectorizer IV ) GloVe V ) Multinomial Naïve bayes VI) j48 VII) Random forest VIII ) Long short term memory IX) BERT

Uploaded by

souptik.sc
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Sentiment Analysis Using Machine Learning Classifiers

Sentiment analysis using I) Pre-processing II ) Word embedding III ) Count vectorizer IV ) GloVe V ) Multinomial Naïve bayes VI) j48 VII) Random forest VIII ) Long short term memory IX) BERT

Uploaded by

souptik.sc
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Comparative study of machine

learning and deep learning


algorithms along with BERT on
the Movie review dataset

Souptik Nag
Satyajit Debnath
Introduction Process

Our
Previous work
Implementation
analysis
procedure

Agenda
Result and
conclusion

Presentation Title
Sentiment analysis with machine learning classifier 2
Introduction
In todays world, very large amounts of information are available in on-line documents. As part of the effort to better
organize this information for users, researchers have been actively investigating the problem of automatic text
categorization. The applications of sentiment analysis encompass the areas like social event planning, election campaigning,
healthcare monitoring, consumer products and awareness services. The computational power is fueled by burgeon of
machine learning techniques. In this work we first use machine learning classifier Multinomial Naïve Bayes, Bernoulli
Naïve Bayes,j48, random forest and observe the accuracy . We use GloVe embedding model and then LSTM for training. At
the last we use BERT model for sentiment analysis. Then we compare the accuracy with the previous research.

Sentiment analysis 3
TO FIND
THE HIGH
ACCURACY
MODEL Primary goals
WITH THE
EXPERIME
NT
General process:

Sentiment analysis 5
Preprocessing:
Preprocessing consist of several steps according to the dataset. Basically it is used for cleaning.

Here we will make a wordcloud graph to show the most used words in large font and the least used

words in small font in positive review and negative review. In our preprocessing we use to convert
lowercase, then removing the URL links , special characters, punctuation, and stop words. We also
remove the most frequent words. At the last we apply Lemmatization.
Lemmatization: It is used to convert the word to its base form or lemma by removing affixes from the
inflected words. It is help us to create better features for machine learning and NLP models hence it is
an important preprocessing step.

Sentiment analysis 6
What Are Word Embeddings?
In NLP models, we deal with texts which are human-readable and understandable. But the
machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the
technique to convert each word into an equivalent float vector. Various techniques exist
depending upon the use-case of the model and dataset. Some of the techniques are One Hot
Encoding, TF-IDF, Word2Vec and FastText.

Algorithm that are used :

1: Count Vectorizer

2: GloVe (Global Vector)

Sentiment analysis 7
CountVectorizer to extracting features from the text:
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a
vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have
multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). Let us
consider a few sample texts from a document (each as a list element):
document = [ “One Geek helps Two Geeks”, “Two Geeks help Four Geeks”, “Each Geek helps many other Geeks at
GeeksforGeeks.”]

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample
from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text
sample. This can be visualized as follows

geeksforg
at each four geek geeks help helps many one other two
eeks

document[
0 0 0 1 1 0 0 1 0 1 0 1
0]

document[
0 0 1 0 2 0 1 0 0 0 0 1
1]

document[
1 1 0 1 1 1 0 1 1 0 1 0
2]

Sentiment analysis 8
GloVe — Global Vectors for Word
Representation
“GloVe is a count-based, unsupervised learning model that uses co-occurrence (how frequently two words appear together)
statistics at a Global level to model the vector representations of words.”
GloVe is based on ratios of probabilities from the word-word co-occurrence matrix, combining the intuitions of count-based
models while also capturing the linear structures used by methods like word2vec.
Let us consider an example of how the co-occurrence probability rations work in GloVe.
Pi → Probability of ice
Pj → Probability of steam

Since, ice is more similar to solid than steam, we can see that P(solid|ice) has a higher probability than P(solid|steam).
Similarly, both ice and steam are not related to fashion. But, if we consider the standalone probability values for
P(fashion|ice) or P(fashion|steam), it will not help us to infer the lack of relationship.

Sentiment analysis 9
Co-occurrence matrix used in
GloVe

Sentiment analysis 10
GloVe — Global Vectors for Word
Representation
Compared to the raw probabilities, the ratio of probabilities is better able to distinguish
relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better
able to discriminate between the two relevant words.
Semantic Relational Representation of Words using GloVe:

Sentiment analysis 11
Algorithm for word embedding using GloVe
•Preprocess the text data.
•Created the dictionary.
•Traverse the glove file of a specific dimension and compare each word with all
words in the dictionary,
•if a match occurs, copy the equivalent vector from the glove and paste into
embedding_matrix at the corresponding index

Example :
Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words,
then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words
having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell} Vocabulary= {'The':2, 'peon':1, 'is':1,
'ringing':1}

Sentiment analysis 12
Work Analysis:
Research papers:
1: Optimization of sentiment analysis using machine learning classifiers:
Author:- Jaspreet Singh*,Gurvinder Singh and Rajinder Singh

2: Sentiment classification of online political discussions: a comparison of a word-based and dependency-


based method: Author:- Hugo Lewi Hammer, Per Erik Solberg , Lilja Øvrelid

3: Thumbs up? Sentiment Classification using Machine Learning Techniques:


Author:- Bo Pang and Lillian Lee , Shivakumar Vaithyanathan

4: Sentiment Analysis of Movie Reviews using Machine Learning Techniques :


Author:- Palak Baid , Apoorva Gupta , Neelam Chaplot

5: Application of Machine Learning Techniques to Sentiment Analysis:


Author:- Anuja P Jain, Asst. Prof Padma Dandannavar

Sentiment analysis 13
Work Analysis:
Research papers:
6: Sentiment analysis from product reviews using SentiWordNet as lexical resource
Author:- Jas Alexandra Cernian, Valentin Sgarciu, Bogdan Martinpreet Singh*,Gurvinder Singh and Rajinder Singh

7: A Review of Natural Language Processing Techniques for Sentiment Analysis using Pre-trained Models
Author:- Leeja Mathew, Bindu V R

8: Comparative Analysis of Different Transformer Based Architectures Used in Sentiment Analysis


Author:- Keval Pipalia1 Rahul Bhadja2 and Madhu Shukla

9: A Survey of Sentiment Analysis Based on Transfer Learning


Author:- RUIJUN LIU1,3, YUQIAN SHI1,3, CHANGJIANG JI2 , AND MING JIA

Sentiment analysis 14
1: Optimization of sentiment analysis using machine learning classifiers:

Methods applied:
1: Naïve Bayes used for sentiment classification:
This preprocessing produces word-category pairs for training set. Consider a word ‘y’ from test set
(unlabeled word set) and a window of n-words (x1, x2, …… xn) from a document. The conditional probability of given
data point ‘y’ to be in the category of n-words from training set is given by:
P (1) (y/x1, x2, ...... xn) = P y × n i=1 P(xi/y) P(x1, x2, ...... xn)

Consider an example of a movie review for movie “Exposed”.

2:J48 algorithm used for sentiment prediction:


The assignment of labels to the word features of test set gradually generates different two branches of
decision tree. J48 algorithm uses entropy function for testing the classification of terms from the test set.

where (Term) can be unigram, bigram and trigram. In this study we have considered unigrams and bigrams.

Sentiment analysis 15
1: Optimization of sentiment analysis using machine learning classifiers:

Methods applied:
3: BFTREE algorithm used for sentiment prediction:
Another classification approach outperforms J48, C4.5 and CART by expanding only best node in the
depth first order. BFTree algorithm excavates the training file for locating best supporting matches of positive and negative
terms in the test file. BFTree algorithm keeps heuristic information gain to identify best node by probing all collected word
features.

4:OneR algorithm used for sentiment prediction:


OneR algorithm is a classification approach which restricts decision tree to level one thereby
generating one rule. One rule makes prediction on word feature terms with minimal error rate due to repetitive assessment
of word occurrences. The classification of most frequent terms of a particular sentence is made on the basis of class of
featured terms from training set. The demonstration of OneR algorithm for sentiment prediction with smallest error of
classification is given below: Step 1 Select a featured term from training set. Step 2 Train a model using step 3 and step 4.
Step 3 For each prediction term. For each value of that predictor. Count frequency of each value of target term. Find most
frequent class. Make a rule and assign that class to predictor. Step 4 Calculate total error of rules of each predictor. Step 5
Choose predictor with smallest error.
Sentiment analysis 16
1: Optimization of sentiment analysis using machine learning classifiers:

Sentiment analysis 17
3: Thumbs up? Sentiment Classification using Machine Learning Techniques :
Method applied:
1: Maximum Entropy: Maximum entropy classification (MaxEnt, or ME, for short) is an alternative technique
which has proven effective in a number of natural language processing applications (Berger et al., 1996). Nigam et al.
(1999) show that it sometimes, but not always, outperforms Naive Bayes at standard text classification. Its estimate of P(c |
d) takes the following exponential form:

2:Support Vector Machines : Support vector machines (SVMs) have been shown to be highly effective at
traditional text categorization, generally outperforming Naive Bayes (Joachims, 1998). the solution can be written

Sentiment analysis 18
3: Thumbs up? Sentiment Classification using Machine Learning Techniques :
Method applied:
3: Naive Bayes : One approach to text classification is to assign to a given document d the
class c ∗ = arg maxc P(c | d). We derive the Naive Bayes (NB) classifier by first observing that by Bayes’ rule,

P(c | d) = P(c)P(d | c) P(d) ,


where P(d) plays no role in selecting c ∗ . To estimate the term P(d | c), Naive Bayes decomposes it by assuming the fi ’s
are conditionally independent given

Sentiment analysis 19
3: Thumbs up? Sentiment Classification using Machine Learning Techniques :

Sentiment analysis 20
4: Sentiment Analysis of Movie Reviews using Machine Learning Techniques :
Method used here:
1: Naïve bayes:
2: K- Nearest Neighbours: K-NN is the simplest of all machine learning algorithms. The principle
behind this method is to find a predefined number of training samples closest in distance to the new point and
predict the label from these. The number of samples can be a user-defined constant or vary based on the local
density of points. The distance can be any metric measure. Standard Euclidean distance is the most common
choice for calculating the distance between two points

3: Random Forest: Random Forests are the learning method for classification and regression. It
construct a number of decision trees at training time. To classify new case it sends the new case to each of the
trees. Each tree perform classification and output a class. The output class is chosen based on majority voting
that is the maximum number of similar class generated by various trees is considered as the output of the
Random Forest.

Sentiment analysis 21
4: Sentiment Analysis of Movie Reviews using Machine Learning Techniques :
.

Sentiment analysis 22
5: Application of Machine Learning Techniques to Sentiment Analysis :
.
Methods applied here:
1:Naïve bayes
2: Support vector machine
3: Decision tree

23
6:Attention-Based CNN and Bi-LSTM Model Based on TF-IDF and GloVe Word Embedding for
Sentiment Analysis :
Dataset used : Accuracy table:

24
Sentiment analysis
Summary of the research papers

25
Sentiment analysis
Summary of the research papers

26
Sentiment analysis
The implementation we have
done …….

27
Sentiment analysis
Our implementation process:
Data set : we use IMDB dataset with 50k reviews
.
Data Preprocessing : The steps that are carried out in preprocessing of data are as follows –

a) Case Conversion: All words are converted either into lower case or upper case in order to
remove the difference between “Text” and “text” for further processing.

b) Stop-words Removal: The commonly used words like a, an, the, has, have etc. which carry no
meaning i.e. do not help in determining the sentiment of text while analyzing should be removed from the
input text.

c) Punctuation Removal: Punctuation marks such as comma or colon often carry no meaning for
the textual analysis hence they can be removed from input text.

d) Lemmatization: Deals with removal of inflectional endings only and to return the base or
dictionary form of a word, which is known as the lemma.

e) Stemming: Stemming usually refers to a simple process that chops off the ends of words
to remove derivational affixes.

f) White space Removal: Deals with removal of the extra white space for better cleaned words.
28
and also removal of most frequent words in the dataset.
Our implementation process:
Word embedding algorithms used here are CountVectorizer and GloVe
.

CountVectorizer GloVe

Complement NB

Multinomial Naïve
Bayes
. NB
Bernoulli LSTM ( Long short term memory )

J48

Random forest

29
LSTM
LSTM stands for Long-Short Term Memory. LSTM is a type of recurrent neural network but is better than traditional
recurrent neural networks in terms of memory. Having a good hold over memorizing certain patterns LSTMs perform fairly
better. As with every other NN, LSTM can have multiple hidden layers and as it passes through every layer, the relevant
information is kept and all the irrelevant information gets discarded in every single cell. How does it do the keeping and
discarding you ask? Now this is shown in the following structure. LSTM has main 3 gates

LSTM has main 3 gates


1. FORGATE gate
2. INPUT gate
3. OUTPUT gate

Presentation title 30
Why LSTM ?
Tradition neural networks suffer from short-term memory. Also, a big drawback is the vanishing gradient problem.
( While backpropagation the gradient becomes so small that it tends to 0 and such a neuron is of no use in further
processing.) LSTMs efficiently improves performance by memorizing the relevant information that is important and
finds the pattern.

In LSTM we can use a multiple word string to find out the class to which it belongs. This is very helpful while working
with Natural language processing. If we use appropriate layers of embedding and encoding in LSTM, the model will be
able to find out the actual meaning in input string and will give the most accurate output class. The following code will
elaborate the idea on how text classification is done using LSTM.

31
Sentiment analysis
BERT(Bidirectional Encoder Representation from Trans-formers):
BERT (Bidirectional Encoder Representations from Trans-
formers) is a pre-trained NLP model developed by Google AI
Language researchers [16] in October 2018. It has caused a
hype in the community by showing stateof-the-art (SOTA)
performance in comprehensive NLP tasks for instance,
Question-Answering (SQuAD), Sentiment Analysis, language
translation and many more. BERT is a Transformer, which uses
an attention method that acquire a knowledge of correlation
between sequential words.

32
Sentiment analysis
BERT(Bidirectional Encoder Representation from Trans-formers):
Transformer involves two distinct process - an encoder that
proceeds data as input and a decoder that produces an output
for the task. As BERT’s aim is to create a language model, for
that only encoder mechanism is needed. As compared to
directional models, which takes the input and interprets
sequentially, the Transformer takes the entire string of words at
once. This particular feature allows it to determine the meaning
of a word based on all of its surroundings.

33
Sentiment analysis
Accuracy :
J48 model accuracy = 73.03%
Precision Recall F1-score support
Negative 0.73 0.73 0.73 4961
Positive 0.74 0.73 0.73 5039
Accuracy 0.73 100000
Macro avg 0.73 0.73 0.73 100000
Weighted 0.73 0.73 0.73 100000
avg

Random forest model accuracy = 85.24%


Precision Recall F1-score support
Negative 0.85 0.85 0.85 4961
Positive 0.85 0.85 0.85 5039
Accuracy 0.85 100000
Macro avg 0.85 0.85 0.85 100000
Weighted 0.85 0.85 0.85 100000
avg 34
Accuracy :
Multinomial Naïve bayes model accuracy = 85.60%
Precision Recall F1-score support
Negative 0.84 0.87 0.86 4977
Positive 0.87 0.84 0.85 5023
Accuracy 0.86 100000
Macro avg 0.86 0.86 0.86 100000
Weighted 0.86 0.86 0.86 100000
avg

Complement Naïve bayes model accuracy = 85.60%


Precision Recall F1-score support
Negative 0.84 0.87 0.86 4977
Positive 0.87 0.84 0.85 5023
Accuracy 0.86 100000
Macro avg 0.86 0.86 0.86 100000
Weighted 0.86 0.86 0.86 100000
avg 35
Accuracy :
GloVe + LSTM model - Accuracy – 84.11%

36
Accuracy :
BERT model : Accuracy – 82.59%

37
Comparison
Paper Dataset Method + accuracy Our Method + accuracy

Optimization of sentiment 7465 Digital Camera reviews Naïve bayes – 45% Multinomial naïve bayes –
analysis using machine of Sony 85.60%
learning classifiers: J-48 – 96%
Complement naïve bayes –
85.60%

J-48 – 73.03%

Thumbs up? Sentiment Movie review dataset with Naïve bayes – 78% Multinomial naïve bayes –
Classification using Machine 20k reviews 85.60%
Learning Techniques:
Complement naïve bayes –
85.60%

Sentiment Analysis of Movie 2000 user-created movie Naïve bayes – 81.4% Multinomial naïve bayes –
Reviews using Machine reviews archived on the IMDb 85.60%
Learning Techniques Random forest – 78.65%
Complement naïve bayes –
85.60%
Random forest – 85.24%

38
Sentiment analysis
Comparison
Paper Dataset Method + accuracy Our Method + accuracy

Application of Machine Dataset with Multinomial naïve bayes – Multinomial naïve bayes –
Learning Techniques to 200 tweets, 2000 tweets, 84.577% 85.60%
Sentiment Analysis: 4000 tweets

A Review of Natural IMDB movie review data set BERT – 94% BERT – 82.59%
Language Processing with 25k samples
Techniques for Sentiment
Analysis using Pre-trained
Models

39
Sentiment analysis
Conclusion:
Our first four models depend on the machine learning models that performs well for IMDB
dataset with 50k reviews with accuracy 73.03%, 85.24%, 85.60%, 85.60% for j48, Random
Forest, Multinomial naïve bayes, Complement naïve bayes respectively. Multinomial naïve
bayes is better. We have used six types of pre-processing, so we have used a good pre-
processed data for embedding. The word embedding using the GloVe is computationally
efficient and the then the LSTM model result is consistent in our experiment and this model
gives the accuracy 84.11% 12 epochs. Then we experiment with the most recent developed
model that is BERT model and this time the accuracy is 82.59% that is not good as LSTM as it
has some limitations. The limitation of BERT model is that it takes maximum sequence length
of 512, that means first 512 words of the reviews. In this case several applications of BERT
model are required, which increase the training and classification time. In future we will
enhance the model to overcome the limitation of BERT by splitting large reviews text into
smaller subtexts and will apply the BERT model for each of these subtexts. Finally, the outputs
of subtexts will be combined for classification of reviews. But in this case time requirement
should also be measured to justify the application of this model in real time context.

40
Sentiment analysis
Thank you

You might also like