0% found this document useful (0 votes)
117 views

Analyzing The Performance of Sentiment Analysis Using BERT DistilBERT and RoBERTa

This document analyzes the performance of three transformer-based deep learning models (BERT, DistilBERT, RoBERTa) for sentiment analysis using two datasets (Sentiment 140 tweets and Coronavirus tweets). It finds that the BERT model outperforms the other two models based on accuracy.

Uploaded by

Saad Tayef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Analyzing The Performance of Sentiment Analysis Using BERT DistilBERT and RoBERTa

This document analyzes the performance of three transformer-based deep learning models (BERT, DistilBERT, RoBERTa) for sentiment analysis using two datasets (Sentiment 140 tweets and Coronavirus tweets). It finds that the BERT model outperforms the other two models based on accuracy.

Uploaded by

Saad Tayef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Analyzing the Performance of Sentiment Analysis

using BERT, DistilBERT, and RoBERTa


Archa Joshy Sumod Sundar
PG Scholar Asst. Professor
Centre for Artificial Intelligence Centre for Artificial Intelligence
TKM College of Engineering, Kollam, Kerala TKM College of Engineering, Kollam, Kerala
[email protected] [email protected]
2022 IEEE International Power and Renewable Energy Conference (IPRECON) | 978-1-6654-9175-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/IPRECON55716.2022.10059542

Abstract—Sentiment analysis or opinion mining is a natural sentiment analysis on Sentiment 140 and Coronavirus tweets
language processing (NLP) technique to identify, extract, and NLP dataset and compares the results with DistilBERT and
quantify the emotional tone behind a body of text. It helps to RoBERTa models.
capture public opinion and user interests on various topics based
on comments on social events, product reviews, film reviews, etc.
Linear Regression, Support Vector Machines, Convolution Neural
Networks (CNN), Recurrent Neural Networks (RNN), LSTM II. RELATED WORKS
(Long Short Term Memory), and other machine learning and Researchers used Information Gain and K-Nearest Neigh-
deep learning algorithms can be used to analyze the sentiment bour to conduct research on sentiment analysis of movie
behind a text. This work analyses the sentiments behind movie
reviews and tweets using the Coronavirus tweets NLP dataset reviews. (KNN) in 2020 [3]. In order to improve the per-
and Sentiment140 dataset. Three advanced transformer-based formance of the system, feature selection was made with
deep learning models like BERT, DistilBERT, and RoBERTa are Information Gain. They made use of the Polarity v2.0 dataset
experimented with to perform the sentiment analysis. Finally, the from the Cornell movie review dataset, and on comparing with
performance obtained using these models on these two different Naı̈ve Bayes, SVM, and Random Forest, the KNN model
datasets is compared using the accuracy as the performance
evaluation matrix. On analyzing the performance, it can be seen with information gain achieved the best performance with
that the BERT model outperforms the other two models. 96.8% accuracy. Although KNN gave a comparatively good
Index Terms—Sentiment Analysis, Natural Language Process- accuracy, the results can be further improved by automating
ing, BERT, DistilBERT, RoBERTa the process of selecting the optimal threshold. Ghorbani et al.
[4] developed a ConvLSTMConv hybrid model in 2020, which
I. I NTRODUCTION used the architecture of the Convolutional Neural Network
As a result of the rapid development of information (CNN) and the Long Short-Term Memory (LSTM) network
technology, social media platforms such as Twitter, Instagram, to determine the polarity of words on the Google cloud.
and Facebook have emerged to play an important role in It made use of the Movie Reviews (MR) dataset, which is
modern life. Online platforms have replaced traditional a collection of negative and positive movie reviews with a
sources of textual data over the past few years. Extracting sentence in each. A CNN is used to extract the features,
and utilizing useful information from user-generated data is and the contextual information is learned by BiLSTM, and
crucial for various organizations and governments [1]. One before applying it to the final dense layer, the results were
of the biggest challenges in performing efficient data analysis used again for CNN to provide an abstract feature. This
is extracting sentiment from textual data to determine the model achieved a result of 89.02% accuracy. Since two
author’s attitude or opinion. Sentiment analysis, also called CNNs were utilized for the model, the complexity was higher
opinion mining, is the technique of identifying, obtaining compared to other classifiers. Jnoub et al. [5], in the year
and categorizing subjective data from unstructured text using 2020, introduced a neural model-based domain-independent
various text analyses and linguistic techniques [2]. classification model for sentiment analysis. In those models,
Although statistical machine learning algorithms such as the sentiment classification task was performed using CNN
linear regression and Support Vector Machines perform well and SNN (Shallow Neural Network). The model was trained
in simpler sentiment analysis applications, they cannot be on the IMDb dataset and then tested on three different datasets:
generalized to more complex sentiment analysis problems. the IMDb dataset, the Movie Reviews (MR) dataset, and a
Deep learning models like CNN, RNN, and LSTM, on custom dataset collected from Amazon reviews that rate users’
the other hand, produce significant results in sentiment opinions about Apple products. The model can automatically
analysis [2]. BERT (Bidirectional Encoder Representation and adaptively extract spatial hierarchies of features from
from Transformers) has become an emerging technology for written reviews that may capture different writing styles of
various NLP tasks like text classification, sentiment analysis, users. SNN outperformed CNN in generalization performance
etc. This work aims to experimentally evaluate BERT for using the MR dataset, with an accuracy of 0.82. These models

978-1-6654-9175-4/22/$31.00 ©2022 IEEE


Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.
were able to classify domain-independent binary reviews with Finally, a feed-forward neural network classifier with an
greater accuracy. Priyadarshini et al. [6] developed a novel activation function (softmax) is used to classify the word
LSTM-grid search-based deep neural network model for sen- vectors into the corresponding output classes. DistilBERT,
timent analysis in 2021. The model made use of two Kaggle
datasets: Amazon reviews for sentiment analysis and the
IMDB Dataset of 50K Movie Reviews. On multiple datasets,
baseline algorithms such as convolutional neural networks,
K-nearest neighbor, LSTM, neural networks, LSTM-CNN,
and CNN-LSTM were evaluated using accuracy, sensitivity,
precision, specificity, and F1-score. The use of hyperparameter
optimization via grid search increased the model’s efficiency,
allowing it to outperform other baseline models with an
overall accuracy greater than 96%. The model misclassified
some words due to grammatical variations, misspellings, slang,
and complicated structures in the language. Kausar S et al.,
in 2019 [7], proposed a technique for categorizing online
product reviews based on sentiment polarity. The dataset used
for sentiment polarity categorization was crawled using a
python crawler. The dataset crawler retrieved reviews for two
products: office products such as Microsoft Word, Microsoft
PowerPoint, Microsoft Excel, and Microsoft AccessDatabase,
and musical DVDs with two main albums consisting of
pop and slow tracks. Various approaches like Naive Bayes,
Decision Tree, Random Forest, SVM, Gradient Boosting, and
SeqtoSeq RNN were employed for the task. Three polarity
features, Verb, Adverb, and Adjective, and their combinations
with their various senses, were also considered in review-level
Fig. 1. Architecture of BERT model
categorization. The F1-score of SDT, RF, SVM, and gradient
boosting classifier with Boosting was found to be 0.96. But
this model has a drawback when coming to the scenario of which represents the functionality of BERT in a smaller, faster,
processing different styles of textual information. A Hybrid and lighter form, is created by a technique called distillation.
CNN-LSTM Model was developed in 2019 by Rehman et al. This model has 60% of the size of BERT while retaining
for improving the accuracy of the Movie Reviews Sentiment 97% of the language understanding capacities and obtaining a
Analysis models [8]. In the Hybrid CNN-LSTM model, CNN greater speed [10]. RoBERTa(Robustly Optimized BERT Pre-
helped to extract features from the data, and the LSTM model training Approach)has almost similar architecture to BERT.
captured long-term dependencies between word sequences, The key differences are that BERT employs static masking,
and the dropout technique was also used, which improved whereas RoBERTa employs dynamic masking. RoBERTa is
the execution time. This model outperformed single CNN only trained with the MLM task and is trained on more data
and LSTM models on two benchmark movie review datasets, than BERT for a longer amount of time [11].
IMDB and Amazon, with an accuracy of 91% when compared
to traditional machine learning and deep learning models.
B. Datasets Used
III. METHODOLOGY
A. Techniques Used The following datasets were used for the experimentation.
Bidirectional Encoder Representation for Transformer 1) Sentiment 140 dataset: The Sentiment140 dataset con-
(BERT) is an NLP model developed by Google Research. tains 16,00,000 tweets which were extracted with the help of
It refers to a trained Transformer Encoder stack. During Twitter API. The tweets have been labelled into two classes -
the pre-training phase, it learns information from both the “4” denoting positive tweets and “0” denoting negative tweets.
left and right sides of a token’s context. BERT is initially Out of 1.6 million tweets, 0.8 million represent positive tweets,
trained on a large corpus of data. The two steps in pre- and 0.8 million represent negative tweets. The dataset contains
training are MLM (Masked Language Modeling) and NSP six columns in total, which include – “target”, “id”, “date”,
(Next Sentence Prediction). These pre-trained general-purpose “flag”, “user”, and “text”, out of which the “target” and “text”
models can then be fine-tuned on smaller task-specific datasets attributes are utilized here. A sample of the dataset is shown in
[9]. The architecture of the BERT model is shown in figure figure 2, and the distribution of the class labels in the training
1. The tokenized input is given to the encoder stack that and testing dataset is given in figure 3. The testing dataset is
creates the word embeddings using the attention mechanism. used for validating the model.

Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Sample data from Sentiment 140 dataset Fig. 4. Sample data from Coronavirus tweets NLP dataset

Fig. 5. Plot of the distribution of class in training and testing data of


Coronavirus tweets NLP dataset

Fig. 3. Plot of the distribution of class in training and testing data of Sentiment
140 dataset
C. Proposed Framework

Figure 6 shows the proposed framework. The initial step


is pre-processing of the data. Pre-processing the data is a
crucial step in any NLP task. It is mainly done to extract only
2) Coronavirus tweets NLP dataset: The dataset contains meaningful information from the raw data which are relevant
44,955 tweets related to COVID -19, which are manually to that particular application. The main methods by which data
labelled into five classes – “positive”, “extremely positive”, pre-processing can be done include:
“negative”, “extremely negative”, and “neutral”. The classes
labelled as positive and extremely positive are considered as • Converting the sequence of words to lowercase
a unique class with a “positive” class label, and negative, and • Removal of stop words. Stop words refers to a group
extremely negative classes are considered as a unique class of words that are commonly used in a language. Some
with a “negative” class label. There are a total of 4 attributes in of the stop words used in the English language are “a”,
the dataset, namely “Location”, “Tweet At”, “Original Tweet”, “the”, “are”, “is”, etc. Since these words carry very little
and “Sentiment”, out of which the “Original Tweet” and information, removing them would not affect the overall
“Sentiment” attributes are utilized here. A representation of context of the sentence
the dataset is shown in figure 4, and the class label distribution • Removing punctuation and other special characters
in the training and testing dataset is shown in figure 5. The • Stemming/Lemmatization: It refers to the process of
testing dataset is used for validating the model. converting the words to their root form or base form

Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.
corresponds to validation loss, and the line indicated in red
corresponds to training loss. After each epoch, both training
and validation loss is seen to decrease for all three models.

TABLE I
ACCURACY SCORES OF THE THREE MODELS IN S ENTIMENT 140 DATASET

Model Training Accuracy Validation Accuracy Testing Accuracy


BERT 95.3% 93.13% 92.76%
DistilBERT 97% 75.7% 77.10%
RoBERTa 80.3% 78.13% 78.20%

Fig. 6. Proposed Framework

After cleaning/pre-processing, the data is made ready for to-


kenizing. Tokenization is the process of converting or breaking
down the pre-processed text into words or sentences known as
tokens. These tokens help in the comprehension of the context
and the development of models for NLP tasks. Fig. 7. Loss graph of BERT model on Sentiment 140 dataset
BERT and DistilBERT use WordPiece tokenizers which
are designed such that the tokenized sentence starts with the
[CLS](Classifier) token and ends with the [SEP](Separator)
token. Tokenizers splits the string into subsequent substrings.
If the substrings are already in the vocabulary, they will remain
as such and if the substring is not in its vocabulary, it will
be split repeatedly until each string is represented by its
vocabulary.
D. Environmental Setup
Experiments using Sentiment 140 dataset on BERT, Dis-
tilBERT, and RoBERTa models were done on HPC Machine
Hardware, Ubuntu 78.04.05 LTS, NVIDIA Tesla V100 16G
Passive GPU, and experiments using Coronavirus tweets NLP
dataset on BERT, DistilBERT, and RoBERTa models were
done on Google Colab, cloud-based python IDE with GPU Fig. 8. Loss graph of DistilBERT model on Sentiment 140 dataset
1x Tesla K80, compute 3.7, having 2496 CUDA cores,12GB
GDDR5 VRAM. The framework used is Keras with Ten-
sorFlow as background in the Jupyter notebook. The opti-
mizer used for the experimentation was ADAM with sparse
categorical cross-entropy as the loss function in the case of
Coronavirus tweets NLP dataset, and ADAM with binary
cross-entropy as loss function in the case of Sentiment 140
dataset. To identify the best model, accuracy was used as a
performance evaluation matrix.
IV. RESULTS AND DISCUSSION
From the results of accuracy, as shown in table 1, it can be
seen that the BERT model performs best on the Sentiment
140 dataset when compared to DistilBERT and RoBERTa
models. BERT model achieved a training accuracy of 95.3% Fig. 9. Loss graph of RoBERTa model on Sentiment 140 dataset
and a testing accuracy of 93.13%. The loss graphs of the three
models – BERT, DistilBERT, and RoBERTa models are shown From the results of accuracy, as given in table 2, it is seen
in figure 7, 8, and 9, respectively. The line indicated in blue that the BERT model performs best on Coronavirus tweets

Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.
NLP dataset when compared to DistilBERT and RoBERTa For testing the three models, 5000 data samples from Sen-
model. BERT model gives 95.3% training accuracy and timent140 dataset and 100 data samples Coronavirus tweets
93.13% testing accuracy. The loss graphs of the three models – NLP are given to the model and obtained an accuracy of
BERT, DistilBERT and RoBERTa models as seen in figure 10, 90.43% and 93.76% for Coronavirus tweets NLP dataset and
11, and 12, respectively. The line indicated in blue corresponds Sentiment140 dataset respectively. From the results, it can
to validation loss, and the line indicated in red corresponds to be seen that the BERT model obtains better performance.
training loss. After each epoch, both training and validation The results of prediction and a sample of test data from
loss is seen to decrease for all three models. Coronavirus tweets NLP dataset and Sentiment140 dataset in
BERT model are shown in figure 13 and 14.
TABLE II
ACCURACY SCORES OF THE THREE MODELS IN C ORONAVIRUS TWEETS
NLP DATASET

Model Training Accuracy Validation Accuracy Testing Accuracy


BERT 94.1% 81.3% 90.43%
DistilBERT 75.1% 74.4% 74.48%
RoBERTa 76.32% 75.06% 76.03%

Fig. 13. Predicted labels for a sample of test data from Coronavirus Tweets
NLP dataset in BERT model

Fig. 10. Loss graph of BERT model on Coronavirus tweets NLP dataset

Fig. 14. Predicted labels for a sample of test data from Sentiment 140 dataset
in BERT model

V. CONCLUSION
Deep learning models are currently popular in the field
of sentiment analysis, but existing traditional models can be
improved in accuracy [12]. This work experimented with a
proposed BERT model for sentiment analysis on two differ-
Fig. 11. Loss graph of DistilBERT model on Coronavirus tweets NLP dataset ent datasets and compared the results with DistilBERT and
RoBERTa models. On analysing the results it is seen that
the BERT model achieved better accuracy than the other two
models on the Sentiment 140 and Coronavirus tweets NLP
dataset, with a training accuracy of 95.3%, validation accuracy
of 93.13% and testing accuracy of 92.76% on the Sentiment
140 dataset and a training accuracy of 94.1%, validation
accuracy of 81.3% and testing accuracy of 90.43% on the
Coronavirus tweets NLP dataset. Since DistilBERT has fewer
layers when compared to BERT due to the distillation process,
the accuracy may slightly vary, and the NSP task in pre-
training the model seems to be useful in these datasets as
the BERT model achieves higher accuracy than the RoBERTa
Fig. 12. Loss graph of RoBERTa model on Coronavirus tweets NLP dataset
model. Pre-training the model on more task-specific data may
further improve the accuracy.

Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] Q. A. Xu, V. Chang, and C. Jayne, “A systematic review of social media-
based sentiment analysis: Emerging trends and challenges,” Decision
Analytics Journal, p. 100073, 2022.
[2] S. Tam, R. B. Said, and Ö. Ö. Tanriöver, “A convbilstm deep learning
model-based approach for twitter sentiment classification,” IEEE Access,
vol. 9, pp. 41 283–41 293, 2021.
[3] N. O. F. Daeli and A. Adiwijaya, “Sentiment analysis on movie reviews
using information gain and k-nearest neighbor,” Journal of Data Science
and Its Applications, vol. 3, no. 1, pp. 1–7, 2020.
[4] M. Ghorbani, M. Bahaghighat, Q. Xin, and F. Özen, “Convlstmconv
network: a deep learning approach for sentiment analysis in cloud
computing,” Journal of Cloud Computing, vol. 9, no. 1, pp. 1–12, 2020.
[5] N. Jnoub, F. Al Machot, and W. Klas, “A domain-independent clas-
sification model for sentiment analysis using neural models,” Applied
Sciences, vol. 10, no. 18, p. 6221, 2020.
[6] I. Priyadarshini and C. Cotton, “A novel lstm–cnn–grid search-based
deep neural network for sentiment analysis,” The Journal of Supercom-
puting, vol. 77, no. 12, pp. 13 911–13 932, 2021.
[7] S. Kausar, X. Huahu, W. Ahmad, and M. Y. Shabir, “A sentiment
polarity categorization technique for online product reviews,” IEEE
Access, vol. 8, pp. 3594–3605, 2019.
[8] A. U. Rehman, A. K. Malik, B. Raza, and W. Ali, “A hybrid cnn-lstm
model for improving accuracy of movie reviews sentiment analysis,”
Multimedia Tools and Applications, vol. 78, no. 18, pp. 26 597–26 613,
2019.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[10] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled
version of bert: smaller, faster, cheaper and lighter,” arXiv preprint
arXiv:1910.01108, 2019.
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

Authorized licensed use limited to: Rajshahi University Of Engineering and Technology. Downloaded on November 21,2023 at 08:04:55 UTC from IEEE Xplore. Restrictions apply.

You might also like