Sentiment Analysis of IMDb Movie Reviews A Comparative Study On Performance of Hyperparameter-Tuned Classification Algorithms
Sentiment Analysis of IMDb Movie Reviews A Comparative Study On Performance of Hyperparameter-Tuned Classification Algorithms
Abstract-Sentiment Analysis (SA) is a sub-domain of Nat- semi-structured format depending upon the source. Pre-
ural Language Processing where useful insights on sentiment processing of texts includes several computational steps
and opinion of people can be obtained and analyzed from such as Removing punctuation, stemming and/or lemma-
various textual data in structured, unstructured and semi-
structured format. In this work, I have tried to analyze tization, removal of stopwords and building vocabulary
6 with filtered tokens. It is an important step as it eventually
o the sentiment of viewers from the IMDb Movie Reviews
u.J
u.J
dataset. For this, I have taken three different Supervised filters out various important tag words. One can easily
u.J
learning methods, namely, Linear Support Vector Machine, understand the implication of a review by looking at
Logistic Regression and Multinomial Naive Bayes Classifier,
N
N
o the tokens. Thus, a wrong or bad pre-processing may
N
each with different settings of hyperparameters. Moreover,
g
o to capture the notion of informal jargon, approach of N- sometimes lead to wrong conclusions, which is not at
o
,...; grams has been followed. Furthermore, a comparative study all intended. After this, the analysis can be done by the
'"
</l-
-..... has been performed to find the one best model from each of machines by capturing the tag words that may actually
N
N
-..... the different types of above mentioned Supervised learning describe the meaning of a sentence.
U'l
J, techniques based on their Accuracy Score, FI-Score and AVC The next step of pre-processing involves building the
.-<
Score. In this approach, I have obtained the best accuracy
00
o
score of around 0.910 and mean Fl-score after 10-fold Cross classifier which can use the knowledge base and extract
""
U'l
\!)
If
validation of around 0.894. meaningful insights. There are several algorithms out
.-< Index Terms-Text mining, Information retrieval, Senti- there such as Linear Support Vector Machines, Logistic
00 ment analysis, Binary Classification, Supervised learning, Regression, K-Nearest Neighbours, Random Forest and so
"
en
IMDb dataset, AVC score. on. There are two types of learners namely Eager Learners
Vl
u
u and Lazy Learners. Eager learners build model from the
«
S:! I. INTRODUCTION existing knowledge base i.e. training data and use the
V>
E model to predict the unknown data which is given to it.
OJ
t:>- In the present era of social media and online content On the other hand, Lazy learners are those who store the
U'l
C
distribution, huge amount of data is generated at a daily training data and perform classification at the inference
o basis by the users of such platforms and thus, the amount
:;:;
ro stage when unknown data is available.
.~
c of data is exponentially growing day by day. One of In this work, I have used the IMDb Movie Reviews
::J
E the major part of such data is the reviews given by the dataset to perform the task of sentiment analysis us-
E
U
o users on various things such as their purchases on e- ing three different supervised learning algorithms namely
"'0
C commerce platforms, the shows they watch, the news they Linear Support Vector Machine, Logistic Regression
ro
OJ)
c
read and so on. Such opinions expressed by the people and Multinomial Naive Bayes Classifier. Various com-
:;:;
::J
Q.
help the stakeholders to understand their sentiment. From binations of the hyperparameters have been taken and
E
o a business point of view, the reviews given by a group fine-tuned so as to find the best performing model on this
U
"'0 of consumers on a particular product help to gain insights dataset. Furthermore, the performance of the models have
OJ
u
c regarding customers' needs or requirements, which help a been evaluated using three different evaluation metrics
ro
>
"'0 business to grow and improve. namely, Accuracy Score, FI-Score and AVC Score.
«
c
o As the reviews or opinions corne from various sources In Section II, I have given a brief survey of previ-
OJ
u
c
in unstructured or semi-structured format, it becomes quite ous works done in this area. Section III describes the
e: difficult for a human being to bring those in structured methodology of experiments carried out in this work. In
~
c
o format, analyse them and find valuable insights from those. the second last Section IV, experimental results have been
u
roc Here is the need of machines which have the capability to shown and explained with proper diagrams and plots.
o
:;:; do aforementioned objectives in a short span of time.
ro
c The ability of a machine to learn and process things at a II. PREVIOUS WORKS
~ higher rate gives rise to a sub-domain of text mining called The authors of [1] have applied four different classi-
oS
-;:; Sentiment Analysis or Opinion Mining. Text processing fication algorithms on IMDb movie reviews dataset and
00
N
N
o
and Sentiment analysis is a challenging task as it includes evaluated their performance using Accuracy Score, FI
N
processing of streams of textual data in unstructured or Score and AVC Score. The authors have not mentioned
any particular settings of the hyperparameters for the 4) Removal of Stopwords: Each review is split into to-
classification algorithms. kens and from the generated token streams, stopwords are
The authors of [6] have used 3 different classification removed. For this step, the English language stopwords'
algorithms for performing sentiment analysis. The draw- set from nltk. corpora has been used.
back is they have not taken different hyperparameters 5) Stemming: Stemming is a heuristic process of chop-
and language models. Moreover, the authors have only ping off the tail of each word in order to achieve a
used Term Frequency for quantifying the reviews. They base form of the token. In this step, stemming has
have obtained accuracies 85.69%, 86.23% and 83.54% for been performed on the remaining tokens using Porter's
Linear SVM, Logistic Regressor and Multinomial Naive Stemming Algorithm. In this work, I have used the
Bayes classifiers respectively. PorterStemmer () method of nl tk.
The authors of [3] have examined the sentiment from Another alternative to stemming is the process called
IMDb Movie Reviews dataset to find the polarity of the Lemmatization, which involves converting the token into
movie reviews on a scale of 0 (highly disliked) to 4 their actual dictionary form. It is a computationally expen-
(highly liked). Then they have performed feature extrac- sive process as compared to Stemming and using this to
tion, followed by training a multilabel classifier to classify pre-process large dataset will take much more time.
the reviews into its correct label. They have obtained an C. Bag-of-Words & TF-IDF Vectorization
accuracy of around 88.95%.
In text dataset, tokens are the features of a particular
The authors have [4] have applied sentiment analysis on
document. Unique token are extracted from the entire
IMDb Movie Reviews Dataset in which, they have applied
dataset or corpus and a dictionary is created. Relevant
various steps of text processing and feature selection,
information such as Count and/or TF_IDF weightage etc.
followed by classifying them into positive or negative
is stored corresponding to each of the features and a Bag-
reviews. Furthermore, they have evaluated the model with
of-Words representation is created. As machines cannot
eight different classifiers using five different evaluation
understand the tokens in text format, these numerical
metrics.
valued features are fed to the learner so as to make it
analyzable by the machine.
III. METHODOLOGY
Here, I have used the TF-IDF scoring of the tokens
The steps which have been followed are described in to map them into numerical values. TF-IDF stands for
this section. Term Frequency - Inverted Document Frequency. Term
Frequency is defined as the total number of times a
A. The Dataset token is appearing in a document. As the number of
words in a document may greatly vary, Document Length
I have chosen the IMDb Movie Reviews Dataset from Normalization has been performed while finding TF. The
Kaggle. This dataset is originally derived from the Large expression for finding TF is as follows,
Movie Review Dataset, which consists of around 50,000
highly polar movie reviews among which, 25,000 are given tf(t, d) = d.l~~~th
as Training data and rest 25,000 are for Test. The dataset where ft,d is the number of times term t occurs in
in CSV format I have used here is made from the original document d; d.length denotes the total number of words
dataset. It has two columns namely review and sentiment. in document d.
The review column consists of all 50,000 reviews from Inverted Document Frequency of a term is defined as the
Training and Test set. The sentiment column consists of logarithmically scaled fraction of the corpus size and the
the corresponding sentiment as either positive or negative. number of documents which contain that term. It captures
The entire dataset has been divided into 3 subsets by how much information a particular term may contain.
randomly picking up each instances, Train, Validation and Moreover, the logarithm rewards the terms appearing less
Test set in a ratio of 70:15:15. number of times as well as penalizes the influence of
terms occurring higher number of times. The expression
B. Preprocessing for finding IDF is as follows,
After loading the dataset, it has to be pre-processed in a idf(t) = log dICt)
proper step-by-step manner. The pre-processing steps that
we have followed are described below, where df(t) = Id ED: t E dl is the Document Frequency
1) Removal of HTML and CSS tags: The reviews of t; D is the corpus; N = IDI is the size of corpus. It
contain certain HTML tags such as <br> ... </br> and may also happen that a term may not be present in the
CSS stylings, which need to be removed during pre- corpus, therefore making the denominator df(t) = O. To
processing. For this purpose I have used the built-in HTML avoid such problem, we add 1, hence smoothing the IDF
parser of BeautifulSoup library. score.
2) Casefolding: Casefolding is done in order to nor- idf(t) = log l+;?;-(t)
malize tokens so that same words written with different
cases or one with first letter capitalized can be treated as Therefore, the TF-IDF score is calculated by multiply-
same tokens. ing the TF and IDF of a term obtained by following the
above steps.
3) Removal of Punctuations: The punctuations have
been removed from each of the reviews. tfidf(t, d, D) = tf(t, d) x idf(t, D)
290
Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on January 28,2024 at 11:41:06 UTC from IEEE Xplore. Restrictions apply.
2022 8th International Conference on Advanced Computing and Communication Systems (lCACCS)
But in this approach also, tf( t, d) of term t gets rewarded At the end, the model is being saved into the secondary
when it appears significantly higher number of times. So, a memory. For each of the candidate supervised techniques,
common modification can be done as taking the logarithm best model is found and saved into the secondary storage.
of the term frequency as shown, 3) Loading and Prediction: At the inference stage, the
saved models are listed and parsing the filenames, the
tfidf(t, d, D) = (1 + log(tf(t, d))) x idf(t, D)
particular type of classifier is loaded into main memory.
All the reviews of the dataset are pre-processed through The model predicts the unseen Test data and gives its
the steps explained above. Text pre-processing is a lengthy prediction. Then, based on the prediction of Test data
and time-taking process. For this reason, to ease the points, the Accuracy Score, F1 Score and AUC Score are
implementation, I have saved a copy of the pre-processed computed for Test dataset.
dataset in a seperate CSV file and used in further steps. The steps have been depicted in Fig. 1
D. Model Selection and Hyperparameter tuning
As the problem, being addressed, is a Binary Classifica-
tion Problem, we have chosen the Linear Support Vector
Machine, which uses Linear hyperplane to separate the
two classes. Other classification methods I have chosen
are Logistic Regression and Multinomial Naive Bayes.
Linear SVM has been widely regarded to be one of the
best classification algorithm for classifying text data [5].
1) Setting up different hyperparameters: At first, dif-
ferent hyperparameter values for each of the algorithms
have been taken. In case of Linear SVM and Logistic
Regression, the following hyperparameter settings have
been taken,
• C: It is a regularization parameter which is inversely
proportional to the regularization strength. Lower
values of C allow the classifier to misclassify more
number of points. The following values of C have
been taken while finding the best model: 0.001, 0.01,
0.1, 1, 10, 100.
• Class Weight : In case of imbalanced training
dataset, we can provide more weightage to the class Best Linear
291
Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on January 28,2024 at 11:41:06 UTC from IEEE Xplore. Restrictions apply.
2022 8th International Conference on Advanced Computing and Communication Systems (lCACCS)
among all the data points with that label present in the
dataset (i.e. fraction of Retrieved among the Relevant Fig. 2. C-Accuracy Score plot of Linear SVM with different n-gram
Data). range
TPR = TruePositive
TruePositive + FalseN egative
1----:==========:::;;-]
ROC Curve of Best LinearSVM Classifier
1.0
FPR =
0.8
FalsePositive
FalsePositive + TrueN egative
.&
8!ClI 0.6
>
"'o
·in
The higher the AUC score, the better the classifier can c.. 0.4
ClI
During the training, at each iteration a record of accuracy 0.0 - ROC curve (area = 0.97)
score and FI score along with the C and n-gram range 0.0 0.2 0.4 0.6 0.8 1.0
values have been kept and plotted for each model in a 2D Fa Ise Positive Rate
A. Linear SVM
The C vs Accuracy Score plot is shown in Fig.-2 and B. Logistic Regression
the C vs FI Score for the same is shown in Fig.-3. Clearly, The C vs Accuracy Score plot is shown in Fig.5 and the
it can be said that for I-gram, the accuracy and FI score C vs FI Score for the same is shown in Fig.-6. Clearly,
increase at first but it drastically reduce for higher value of it can be said that for I-gram, the FI score was higher
C. Moreover, it never performed better than other n-gram than other n-gram ranges at first but it reduced for higher
range. Among all, the best performing n-gram range is (1, value of C. On the other hand, the Accuracy can be seen
2) when the values of C are higher. varying somewhat similar to what has been observed in
The Test Accuracy score and F I-score of the best model case of Linear SVM. Among all, the best performing n-
are 0.910 and 0.894 respectively. The AUC score for the gram range is (1, 2) when the value of C is higher.
best performing Linear SVM model is 0.97 [Fig. 4]. The Test Accuracy score and F I-score of the best model
292
Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on January 28,2024 at 11:41:06 UTC from IEEE Xplore. Restrictions apply.
2022 8th International Conference on Advanced Computing and Communication Systems (lCACCS)
are 0.907 and 0.893 respectively. The AVC score for the Multinomoal Naive Bayes is giving very poor performance
as compared to other n-gram language models considered
here. The Accuracy score as well as FI Scores (for differ-
Logistic Regression Performance for different ngram and C ent a values) of Trigram and 4-gram language models are
0.91 - I-gram close to each other and decrease with higher values of a.
- 2-gram
0.90 - 3-gram However, in this experiment, 4-gram model has yielded
0.89
4-gram highest Accuracy score for a=O.I and highest FI Score
flI
.x 0.88
o for a=l.
>-
u
The Test Accuracy score and F I-score of the best model
0.87
~
:J
u
are 0.896 and 0.872 respectively. The AVC score of the
.'i 0.86
0.85
MNB Performance for different ngram and alpha
0.84
0.89
0.0001 0.001 0.01 0.1 10 100 0.88
C values
0.87
flI
Fig. 5. C-Accuracy Score plot of Logistic Regression with different n- .x 0.86
o
flI 0.86 Fig. 8. a value - Accuracy Score plot of Multinomial NB with different
8
ttl
n-gram range
~ 0.85
0.84
0.83
0.82
I
0.0001 0.001 0.01 0.1 10 100
0.86
-
-
MNB Performance for different ngram and alpha
I-gram
2-gram
- 3-gram
C values
4-gram
0.84
Fig. 6. C-Fl Score plot of Logistic Regression with different n-gram flI
range ~ 0.82
~
best performing Logistic Regression model is 0.97 [Fig. 0.80
7].
0.78
ROC Curve of Best Logistic Regressor 0.0001 0.001 0.01 0.1 10 100
1.0 No. of Alpha
293
Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on January 28,2024 at 11:41:06 UTC from IEEE Xplore. Restrictions apply.
2022 8th International Conference on Advanced Computing and Communication Systems (lCACCS)
0.8
0.6
0.4
0.2
294
Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on January 28,2024 at 11:41:06 UTC from IEEE Xplore. Restrictions apply.