0% found this document useful (0 votes)

16 views

Impact of SMOTE On Imbalanced Text Features For Toxic Comments Classification Using RVVC Model

This document presents a new ensemble approach called regression vector voting classifier (RVVC) to classify toxic comments. RVVC combines logistic regression and support vector classification using soft voting. The performance of RVVC is evaluated on imbalanced and balanced datasets with different feature extraction and data balancing techniques. RVVC achieves the highest accuracy of 97% when used with TF-IDF features and a SMOTE balanced dataset.

Uploaded by

Mafer Revelo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Impact of SMOTE On Imbalanced Text Features For Toxic Comments Classification Using RVVC Model

Uploaded by

Mafer Revelo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received April 19, 2021, accepted May 20, 2021, date of publication May 25, 2021, date of current

version June 4, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3083638

Impact of SMOTE on Imbalanced Text Features for

Toxic Comments Classification Using RVVC Model
VAIBHAV RUPAPARA 1 , FURQAN RUSTAM 2 , HINA FATIMA SHAHZAD 2,

ARIF MEHMOOD 3 , IMRAN ASHRAF 4 , AND GYU SANG CHOI 4

1 Schoolof Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
2 Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
3 Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
4 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38544, South Korea

Corresponding authors: Imran Ashraf ([email protected]) and Gyu Sang Choi ([email protected])
This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF)
funded by the Ministry of Education under Grant NRF-2019R1A2C1006159, and in part by the Ministry of Science and ICT (MSIT),
South Korea, through the Information Technology Research Center (ITRC) support program supervised by the Institute for Information
and communications Technology Promotion (IITP) under Grant IITP-2021-2016-0-00313.

ABSTRACT Social media platforms and microblogging websites have gained accelerated popularity during
the past few years. These platforms are used for expressing views and opinions about products, personalities,
and events. Often during discussions and debates, fights take place on social media platforms which
involves using rude, disrespectful, and hateful comments called toxic comments. The identification of toxic
comments has been regarded as an essential element for social media platforms. This study introduces
an ensemble approach, called regression vector voting classifier (RVVC), to identify the toxic comments
on social media platforms. The ensemble merges the logistic regression and support vector classifier
under soft voting criteria. Several experiments are performed on the imbalanced and balanced dataset to
analyze the performance of the proposed approach. For data balance, the synthetic minority oversampling
technique (SMOTE) is used on the imbalanced dataset. Furthermore, two feature extraction approaches are
utilized to investigate their suitability such as term frequency-inverse document frequency (TF-IDF) and
bag-of-words (BoW). The performance of the proposed approach is compared with several machine learning
classifiers using accuracy, precision, recall, and F1-score. Results suggest that RVVC outperforms all other
individual models when TF-IDF features are used with SMOTE balanced dataset and achieves an accuracy
of 0.97.

INDEX TERMS Toxic comments classification, ensemble classifier, synthetic minority oversampling
technique, TF-IDF, BoW, text classification, data re-sampling.

I. INTRODUCTION ground for these users to share opinions and discuss ideas.
Social media platforms and microblogging websites have However, problems arise when debates take a dirty side and
gained accelerated popularity for social communication fights take place on social media platforms which involves
between individuals and groups. Through these platforms, using rude, disrespectful, and hateful comments called toxic
people share their thoughts, ideas, opinions and express their comments. Text in online comments contain many hazards
feelings using comments and feedback [1]. The number of such as fake news, cyberbullying, online harassment and tox-
internet users has been increasing gradually each year, from icity [4]. Unfortunately, these toxic comments have become
2.4 billion in 2014 to 3.4 billion, 4 billion, and 4.4 billion a serious issue that affects the reputation of social platforms
in 2016, 2017, and June 2019, respectively [2]. As of and cause different psychological problems for users, such
May 2020, the number of internet users is increased to as depression, frustration, and even suicidal thoughts [1].
4,648 billion [3]. Social media platforms provide a common Toxic comment classification is very important to overcome
the above-mentioned issues and maintain stability in online
The associate editor coordinating the review of this manuscript and debates [5]. Toxic comments can be considered as a personal
approving it for publication was Biju Issac . attack, online harassment, and bullying behaviors. Over the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 78621
V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

past few years, several cases of police arrests happened where are evaluated: classification, feature dimension reduction,
police arrested many individuals due to the abusive or nega- and feature importance.
tive content on personal pages [6], [7]. The authors use a deep learning-based toxic comments
So a framework that can detect toxic comments and prevent classification approach in [11] for the imbalanced toxic
publishing is of significant importance. As a result, several dataset. The performance evaluation is carried out on
approaches have been introduced for the automatic detec- Kaggle Wikipedia’s talk page edits dataset which contains
tion of toxic comments using machine learning algorithms. 159,571 records of toxic comments. The proposed approach
For example, the study [8] combines machine learning and makes a multi-class classification including toxic, threat,
crowd-sourcing to classify the comments that are considered severe toxic, obscene, insult, and identity hate. Convolu-
a personal attack. Support vector machines were also used tional neural network (CNN), bidirectional long short- term
by [9] for Cyberbullies detection. The cyberbullies are also memory (LSTM), bidirectional gated recurrent unit (GRU),
detected in [10] using deep learning models. Despite the pro- and the ensemble of the three models are used for classi-
posed approaches, there is a need to model more approaches fication. Results indicate that the ensemble approach gives
to provide high accuracy for toxic comments. This study the highest classification with an F1 score of 0.828 for
introduces an ensemble approach for toxic comments detec- toxic/non-toxic and 0.872 for toxicity types. The study [12]
tion in imbalanced datasets and makes the following proposed a method to classify the online toxic comments
contributions using logistic regression and neural network models. Online
• This study proposes a novel approach, called regres- toxic comments classification dataset is taken from Kaggle
sion vector voting classifier (RVVC), for toxic comment and logistic regression (LR), CNN, LSTM, and CNN+LSTM
classification. RVVC is an ensemble classifier that com- (2 layers of LSTM and 4 layers of CNN) are used. All
bines the logistic regression and support vector classifier models perform good but CNN+LSTM achieves 0.982 accu-
through soft voting criteria. racy which is the highest among all the classifiers. In the
• For evaluation, term frequency-inverse document same vein, the study [13] perform classification for online
frequency (TF-IDF) and bag-of-words (BoW) are toxic comments using support vector machine (SVM), naive
utilized as feature extraction with imbalanced and imbal- Bayes (NB), K-nearest neighbor (KNN), linear discriminant
anced datasets. Synthetic minority oversampling tech- analysis (LDA), and CNN. The classification is conducted
nique (SMOTE) and random under-sampling technique on Kaggle Wikiperida comments for toxic and non-toxic
are used for balancing the datasets. comments. CNN model achieves accuracy higher than 90%
• Several state-of-the-art models are used along with accuracy while the machine learning classifier obtains accu-
machine learning models including support vector racy between 65% to 85%.
machine (SVM), random forest (RF), gradient boosting Due to the reported high accuracy of deep learning
machine (GBM), logistic regression (LR), and k-nearest approaches, several researchers focus on using deep CNN
neighbor (K-NN) for performance appraisal. Addition- and LSTM architectures for classification. For example, deep
ally, a recurrent neural network is implemented for toxic neural network architectures are used for toxic comments
sentiment classification. classification in [14]. The study uses NB, LSTM, and RNN to
identify toxic comments. For this purpose, a toxic comment
The rest of the paper is organized as follows. Section II dis- classification challenge dataset comprising 159,000 com-
cusses research papers from the literature which are closely ments is used. LSTM performs best with 67% true positive
related to the current study. Section III gives an overview rate which is 20% higher than the NB model. On the other
of the machine learning algorithms adopted for the current hand, LSTM achieved a 73% F1 score, 81% precision score,
research, as well as, the description of the dataset used for and 66% recall. Similarly, hybrid deep learning approaches
the experiment. The proposed approach is also presented in are adopted in [15] for the same task. For this purpose,
the same Section. Results are discussed in Section IV while the Jigsaw toxic comments classification dataset is used.
the conclusion is given in Section V. The hybrid deep learning achieved 98% accuracy and 80%
F1 score. Another study [16] created their dataset taking
II. LITERATURE REVIEW comments from Facebook pages and labeled them with six
Toxic comments on social media platforms have been a categories: toxic, severe toxic, obscene, threat, insult, identity
source of a great stir between individuals and groups. A toxic hate. Different machine learning and deep learning algo-
comment is not only verbal violence but includes the com- rithms are applied for Bangla toxic comments classification.
ment that is rude, disrespectful, negative online behavior, SVM, Gaussian NB, Multinomial NB, Multi-Label k Nearest
or other similar attitudes that make someone leave a discus- Neighbor (MLKNN), and Backpropagation for Multi-Label
sion. Therefore, the toxic comments identification on social Neighbor (BP-MLL) are used to classify comments.
platforms is an important task that can help to maintain BP-MLL outperforms both machine learning and deep learn-
its interruption and hatred-free operations. Consequently, ing algorithms used for experiments.
a large variety of toxic comment approaches have been pro- The study [17] proposed a methodology for the clas-
posed. Three characteristics concerning toxic classification sification of toxic comments and depth error analysis.

78622 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

The study uses two datasets including the Wikipedia Experiments are performed with several feature extrac-
talk pages and a Twitter dataset, containing six classes tion approaches such as information gain, mutual infor-
of toxic comments. The study uses CNN, LSTM, mation, etc. and results indicate the better performance of
bidirectional LSTM, bidirectional GRU, bidirectional GRU NB with most of the used features. Dimensionality reduc-
attention, and LR for the classification. For feature extraction plays an important role in enhancing the classifica-
tion, GloVe is applied for word embedding and fastText for tion accuracy of machine learning classifiers. The impact of
sub-word embedding. Deep learning models are trained and feature dimensions is analyzed through Chi-square, GSS
tested with both GloVe and fastText tools, while the LR is coefficient, odds ratio, NGL coefficient, information gain,
used with char-n-grams and word-n-grams. The ensemble relevancy score, and multi-set of features with NB and KNN
classifier achieves a 79.1% F1 score on the Wikipedia dataset in [25]. Similarly, the use of multi-viewpoints cosine-based
and 79.3% on the Twitter dataset. In the study, [18] deep neu- similarity visual assessment tendency is made in [26] to
ral network architectures are used to perform toxic comment handle the scalability issues in data clustering from social
classification. CNN, bidirectional LSTM, and bidirectional media.
GRU are used for classification where the bidirectional GRU Abusive language is detected by [27] using machine learn-
performs the best. Another study [19] proposed a methodol- ing approaches. This study is based on online comments
ogy to establish lexical baselines for classification by apply- classification which is collected from Yahoo! Finance, and
ing supervised classification methods. A 78% accuracy is News. Three datasets including the primary dataset, tempo-
achieved for the three-class task of classifying hate, offensive, ral dataset, and WWW2015 dataset are used for this pur-
and Ok using n-Gram and linear SVM. DNN based twitter pose. Features are divided into four classes: n-grams, lin-
hate speech detection is proposed in [20]. This study created guistic, syntactic, and distributional semantics for the exper-
its dataset from 6 publicly available datasets. For classifi- iments. Experiments indicate that when features are com-
cation of hate speech on Twitter, DNN, and a combined bined the classification accuracy is high. Emotional states
model of CNN and GRU are used. The combined model of are used to classify hate speech from social media com-
GRU+CNN is optimized with a dropout and pooling layer ments in the approach proposed in [28]. The emotional states
(1D max pooling, softmax poling, and global max pooling). of joy, anger, sad, surprise, fear, trust, disgust and antic-
The proposed model achieves an accuracy of 91.4% and ipation are used for this purpose. Annotated hate speech
92.1% on two different datasets. dataset is used to detect hate speech with the lexicon-based
Another study [21] developed a model for automatically approach. The proposed approach achieves an accuracy
identifying the comments from social media as toxic or of 80.56%.
non-toxic. TF-IDF has used the feature selection and a A project called ‘Perspective’ is launched by Google and
multi-headed model using logistic regression is used for the jigsaw which uses machine learning techniques to automat-
classification. Toxic comments are further categorized into ically detect online abusive, insulted, and harassment com-
severe toxic, obscene, threat, insult, and identity-hate. Results ments. Perspective is a toxic detector API on Google that
are pretty good, however, the research provides the result filters the comments on news websites to identify abuse and
for training accuracy only. In the same way, the study [22] harassment. An attack strategy is proposed in [29] to deceive
classify toxic comments by using multi-class and multi-label Perspective by modifying the toxic phrase to significantly
word embedding techniques. For feature selection BOW, and lower the toxic score assigned by the Perspective. Research
TF-IDF are used and the GloVe, Google news dataset, SNAp, indicates that the use of white spaces, redundant characters,
FastText, and Dranziera word embedding techniques are used and full stops can substantially cheat the Perspective to lower
for multi-label and multi-class words. The highest accuracy the toxic score.
and AUC are achieved as 0.83 and 0.89 for per label accuracy Despite the high accuracy for toxic sentiment classifi-
using TF-IDF features. cation, the above-cited research works have several lim-
Features analysis is an important part of toxic com- itations. For example, the studies often use imbalanced
ment classification and several tend to perform analysis datasets, and consequently, the reported accuracy is higher
on the influence of various feature selection methods on than the F1 score. F1 score is preferred on imbalanced
classification accuracy. For example, Twitter hate-speech datasets which is very low in the discussed research works.
text classification is done using CNN in [23]. The The machine learning algorithms can be overfitted for
study uses a Twitter dataset that contains four categories the majority class on highly imbalanced datasets. Previ-
including racism, sexism, both (racism and sexism), and ous studies do not focus on balancing the datasets and
non-hate-speech. In this study, CNN is used with four their results may be biased due to the model’s overfit-
techniques including random vectors, word2vec, character ting. Predominantly, the proposed approaches follow deep
n-grams, and word2vec+character n-grams. Test results with learning models that are data-intensive and the accuracy
10-fold cross-validation show a 78.3% F score with word2vec is affected when used with smaller datasets. Hence, this
embeddings. The role of feature selection on the text clas- study leverages the machine learning algorithms to per-
sification is analyzed in [24], where two machine learn- form toxic comment detection and overcomes the mentioned
ing algorithms including Naive Bayes (NB) and KNN. issues.

VOLUME 9, 2021 78623

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

III. MATERIALS AND METHODS a huge difference and makes the dataset highly imbalanced.
This study uses different techniques, methods, and tools for Due to the large size of the dataset, only 70,000 non-toxic
the classification of toxic and non-toxic comments. Also, comments are randomly selected for the experiments.
various preprocessing steps, data re-sampling methods, fea-
tures extraction techniques, and supervised machine learning B. PREPROCESSING STEPS
models are adopted for the said task. Pre-processing techniques are applied to clean the data which
helps to improve the learning efficiency of machine learning
A. DATA DESCRIPTION models [31]. For this purpose, the following steps are exe-
This study aims at the automatic classification of toxic and cuted in the given sequence.
non-toxic comments from social media platforms. Various Tokenization: is a process of dividing a text into smaller
machine learning models are utilized for this purpose to units called ‘tokens’. A token can be a number, word, or any
evaluate their strength for the said task. For evaluation, type of symbol that contains all the important information
the selected models are trained and tested with binary class about the data without conceding its security.
datasets. Traditionally, toxic comments are grouped under Punctuation removal: involves removing the
several classes such as hate, toxic, threat, severe toxic, punctuation from comments using natural language pro-
obscene, insult and non-toxic, etc. We follow a different cessing techniques. Punctuations are the symbols that are
approach by grouping the comments under two classes, toxic utilized in sentences/comments to make the sentence clear
and non-toxic. The original dataset which is taken from and readable for humans. However, it creates problems in
Kaggle [30], is a multi-label dataset and contains labels the learning process of machine learning algorithms and
such as toxic, severe_toxic, obscene, threat, insult, and iden- needs to be removed to improve their learning process.
tity_hate. The non-toxic comments belong to one class, while Some common punctuation marks are mostly used such that
from the other comments only those comments are selected colon, question marks, comma, semicolon, full-stop/period,
that have toxic labels. It means that the comments that label etc. . . . ?:,;.[]() [32].
severe_toxic, obscene, threat, insult, and identity_hate are not Number removal: is also a part of preprocessing which
selected. For example, Table 1 shows that ‘comment 2’ is only helps to improve the performance of the machine learning
toxic and ‘comment 3’ is non-toxic. For our experiment, both algorithms. Numbers are unnecessary and do not contribute
‘comment 2’ and ‘comment 1’ are selected under toxic and to the learning of text analysis approaches. Removing the
no-toxic classes, but ‘comment 1’ and ‘comment 4’ are not numbers increases the efficiency of models and decreases the
selected. complexity of the data.
Stemming: is an important part of preprocessing because
TABLE 1. Example of various classes in the original dataset. it increases the performance by clarifying affixes from sen-
tences/comments and converting the comments into the orig-
inal form. Stemming is the process of transforming a word
into its root form. For example, different words have the same
meaning such as: ‘plays’, ‘playing’, ‘played’ are modified
forms of ‘play’. Stemming is implemented using the Porter
stemmer algorithms [33].
Spelling correction: is the process of correcting the mis-
spelled words. In this phase, the spelling checker is used to
We extract only toxic and non-toxic comments from the check the misspelled words and replace them with the correct
dataset and Table 2 shows the ratio of toxic and non-toxic word. Python library ‘pyspellchecker’ provides the necessary
comments in the dataset used for experiments. The ratio of features to check the misspelled words and is used for the
toxic and non-toxic comments in the dataset is not equal experiments [34].
which shows the imbalanced data problem. The performance Stopwords removal: Stopwords are those English words
of the classifiers could be affected due to an imbalanced that do not add any meaning to a sentence. So these can be
dataset. removed by stopwords removal without affecting the mean-
ing of a sentence. The removal of stop-words increases the
TABLE 2. Number of records for toxic and non-toxic comments. model’s performances and decreases the complexity of input
features [35].

C. FEATURE ENGINEERING
Feature engineering aims at discovering useful data fea-
tures or constructing features from original features to train
The dataset contains 158,640 comments in total with machine learning algorithms effectively [36]. The study [37]
toxic comments having the lowest ratio in the dataset, concludes that feature engineering can improve the effi-
i.e., 15,294 while non-toxic comments are 143,346. It makes ciency of machine learning algorithms. ‘Garbage out’ is

78624 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

a corporate proverb used in machine learning which implies classification tasks because models can over-fit on the major-
that senseless data used as the input, yields meaningless ity class [40]. To solve this problem different data re-sampling
output. In contrast, more information-driven data will yield techniques have been presented. In this study, two types of
favorable results. Hence, feature engineering can derive use- re-sampling techniques are used including under-sampling
ful features from raw data which helps to improve the and over-sampling.
reliability and accurateness of learning algorithms. In the
proposed methodology, two feature engineering methods are 1) RANDOM UNDER-SAMPLING
used including the bag of words and term frequency-inverse Under-sampling reduce the size of the dataset by deleting
document frequency. example of the majority class. For the under-sampling, a ran-
dom under-sampling approach is used in the current study.
D. BAG-OF-WORDS In the random under-sampling, the major class examples are
The bag of words (BoW) technique is used to extract features rejected at random and deleted to balance the distribution of
from the text data. The boW is easy to implement and under- the target classes. Simply we can say that under-sampling
stand besides being the simplest method to extract features aims to balance class distribution by randomly deleting
from the text data. The boW is very suitable and useful for majority class examples. The random under-sampling tech-
language modeling and text classification. The ‘CountVec- nique is one of the widely used re-sampling approaches and
torizer’ library is used to implement BoW. CountVectorizer selected due to its reported performance [41]–[44].
calculates the occurrence of words and constructs a spare
database matrix of words [38]. The boW is a pool of words 2) SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE
or features, where every feature is categorized as a label that Over-sampling is a technique in which the number of samples
signifies the occurrences of the categorized feature. of the minority class is increased in the ratio of the major-
ity class. Over-sampling increases the size of data which
E. TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY generates more features for model training and could be
TF stands for term frequency and IDF stands for inverse helpful to increase the accuracy of the model. In this study
document frequency of the word. The TF-IDF is a statistical synthetic minority over-sampling technique (SMOTE) is used
analysis that is used to determine how many relevant words for over-sampling. SMOTE is a state-of-art technique that
are in a list or corpus. The value increases with the number was proposed in [45] to solve the overfitting problem for
of times a word is shown in the text but is normalized by the imbalanced datasets. SMOTE randomly picks up the smaller
word occurrence in the document [39]. class and finds the K-nearest neighbors of each smaller class.
• Term Frequency (TF): is the frequency of a term given The picked samples are evaluated using the K-nearest neigh-
in the text of a document. Because each document is bor for that particular point to construct a new minority
dissimilar in size, it is likely that in long documents class. SMOTE is adopted on account of the results reported
a word will occur more often than the shorter ones. in [46]–[49].
To normalize, the term frequency is also divided by the
length of the text. G. PROPOSED METHODOLOGY
No. of times t appears in a document Ensemble learning is widely used to attain high accuracy for
F(t) = (1) classification tasks. The combination of various models can
Total no. of terms in the document
perform well as compared to individual models. Owing to
• Inverse Document Frequency (IDF): is a rating of how the high accuracy of ensemble models, this study leverage
infrequent the term is in a given document. IDF indicates an ensemble model to perform toxic comments classification.
the importance of a word on account of its rareness. The Our experiments indicate the good performance from LR
rare words have a higher IDF score. and SVC, so to further improve the performance, this study
combines these models. The proposed approach is called
Total no. of documents
IDF(t) = loge (2) regression vector voting classifier (RVVC) and combines
No. of documents with term t in it these models using soft voting criteria as shown in Figure 2.
TF-IDF is then calculated using both TF and IDF using The soft voting criteria ensure that the class with a high
N predicted probability by two classifiers will be considered as
TF − IDF = TFt,d ∗ log , (3) the final prediction.
Df
Algorithm 1 shows the working of the proposed RVVC
where the TFt,d is frequency of term t in document d. models and explains how it combines the LR and SVC for
toxic comment classification. Let LR and SVC be the two
F. DATA RE-SAMPLING TECHNIQUES models and ‘toxic’ and ‘non-toxic’ be the two classes, then
Data re-sampling techniques are used to solve imbal- the prediction can be made using the following equation
anced dataset problems. The imbalanced dataset contains an
unequal ratio of the target classes and can cause problems in RVVC = argmax{Toxicprob , NonToxicprob } (4)

VOLUME 9, 2021 78625

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

Algorithm 1 Algorithm for Toxic Comments Classification

Input: Corpus-text comments
Output: Class-Toxic or Non-Toxic

1: TLR→ Trained LR
2: TSVC→ Trained SVC
3: for i in Corpus do
4: ToxicPobLR → TLR(i)
5: NonToxicPobLR → TLR(i)
6: ToxicPobSVC → TSVC(i)
7: NonToxicPobSVC → TSVC(i)
8: RVCCPred → argmax((ToxicPobLR +
ToxicPobSVC )/2, (NonToxicPobLR +
NonToxicPobSVC )/2)
9: end for FIGURE 1. The flow of the proposed methodology.
10: Toxic|Non − Toxic → RVVC prediction
classification problem is solved using LR and SVC. For
classification, the dataset is obtained from Kaggle [30] which
where argmax is used in machine learning for finding the contains toxic comments. Several preprocessing steps are
class with the largest predicted probability. The Toxicprob carried out on the dataset to clean the data. After data clean-
and NonToxicprob indicate the joint probability of toxic and ing, two feature extraction approaches including TF-IDF and
non-toxic classes by the LR and SVC models and are calcu- BoW are applied.
lated as follows Owing to the higher difference in the number of sam-
ToxicProbLR + ToxicProbSVC ples for toxic and non-toxic classes, various re-sampling
Toxicprob = (5) approaches area applied. Random undersampling and
2
NonToxicProbLR + NonToxicProbSVC SMOTE oversampling approaches are leveraged to balance
NonToxicprob = the dataset and improve the performance of the proposed
2
(6) methodology. The ratio of the number of samples after
re-sampling is given in Table 3.
where ToxicProbLR , and ToxicProbSVC are the probabil-
ity for toxic class by LR and SVC, respectively while TABLE 3. Number of samples after applying re-sampling.
NonToxicProbLR , and NonToxicProbSVC are the probability
scores for the non-toxic class by LR and SVC, respectively.
To illustrate the working of the proposed RVVC, the values
for one sample are taken from the dataset used for the exper-
iments. LR and SVC given probabilities for the sample data
are In under-sampling random samples of the majority
• ToxicProbLR = 0.6 class are removed while in over-sampling, the samples
• NonToxicProbLR = 0.4 of the minority class are generated using SMOTE. After
• ToxicrPobSVC = 0.5 re-sampling, the data is split into training and testing sets with
• NonToxicProbSVC = 0.5 a 75:25 ratio. The number of training and testing samples after
The combined Toxicprob and NonToxicprob are calculated as data split is given in Table 4.
follows
0.6 + 0.5 TABLE 4. Number of samples for train and test data.
Toxicprob = (7)
2
0.4 + 0.5
NonToxicprob = (8)
2
Then argmax function is applied to select the class with the
higher probability. Here the largest prediction probability is
for the Toxic class so the final prediction by RVVC will be
Toxic class.
RVVC = argmax{0.55, 0.45} (9)
The flow of the proposed methodology is shown We used the training set to train the machine learning
in Figure 1. In the proposed methodology, the toxic comment models and the proposed ensemble classifier on extracted

78626 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

features and evaluate the performance of machine learning where p is the final decision of the decision trees by majority
models on the test data. For performance evaluation, various vote, while T1 (y), T2 (y), T3 (y), and Tm (y) are the number of
metrics including accuracy, precision, recall, and F1 score are decision trees involved in the prediction process.
used. To improve the accuracy, RF was implemented with n
as 100 which indicates the number of trees that contribute to
H. SUPERVISED MACHINE LEARNING MODELS the prediction in an RF. The ‘max_depth’ is set to 60 which
Various machine learning models are adopted to perform shows the every decision tree can go to a maximum depth
toxic comments classification. Machine learning algorithms of 60 levels. By specifying the depth point, the ‘max_depth’
are implemented using the Scikit-learn library. We used parameter decreases uncertainty in the decision tree and
two tree-based models such as RF, GBM, two linear mod- decreases the probability of the decision tree over-fitting. The
els LR, SVM, and one non-parametric model KNN. The parameter ‘random state’ is used for the randomness of the
hyper-parameters of all the machine learning models are samples during the training. For our experiments, we attain
given in Table 5. good results with RF by using only two hyperparameters.

TABLE 5. Machine learning models parameters. 3) GRADIENT BOOSTING MACHINE

Gradient boosting classifiers is a collection of algorithms
for machine learning that combine several weak learners to
construct a strong prediction model. A loss function relies on
the GBM and a customized loss function can also be used.
The GBM supports several generic loss functions, but the loss
mechanism has to be differentiable. Classification algorithms
also use logarithmic loss, while squared errors can be used in
1) SUPPORT VECTOR MACHINE regression algorithms. Every time the boosting algorithm is
SVM is a supervised machine learning model used for both implemented, the gradient boosting system does not need to
classification and regression problems. The straightforward derive a new loss function, rather any differentiable loss func-
approach to classifying the data starts by constructing a function can be applied to the system. Several hyperparameters
tion that divides the data points into consistent labels with are tuned to get good accuracy from the GBM. For example,
(a) the least amount of errors possible or (b) the highest n is set to 100 indicating the number of trees which contribute
possible margin. That is because larger empty areas next to to the prediction. Equipped with 100 decision trees, the final
the splitting function contribute to fewer errors. After the prediction is made by voting all predictions of the decision
function is constructed, the labels are better separated from trees. Value of ‘max depth is used 60 allowing a decision tree
each other. Hyperparameters of SVM are listed in Table 5 in to a maximum depth of 60 levels.
which the kernel =‘linear’ specifies the kernel type used for
SVM. The linear kernel is used to ensure high accuracy and 4) LOGISTIC REGRESSION
reduced time complexity. The term C = 2.0 is used as the Logistic regression is one of the most widely used approaches
regularization parameter and the strength of the regularization for binary classification problems. LR is known for the
is inversely proportional to C. The parameter random_state = method that it uses, i.e., the logistic equation also called the
500 is used for the seed of the pseudo-random number which sigmoid function. The sigmoid function is an S-shaped curve
is used for likelihood calculations when shuffling the results. that can take any evaluated number and maps it to a value
between 0 and 1 [50].
2) RANDOM FOREST
1
RF is a tree-based ensemble classifier, which generates pre- (12)
(1 + e− value)
dictions that are extremely accurate by combining several
poor apprentices (weak learners). RF uses bootstrap bagging where e is the base of the normal logarithms and value is
to train a variety of decision trees using various bootstrap the real numerical value that is to be converted. Below is a
samples. In RF, a bootstrap sample is produced by subsam- plot of numbers between -5 and 5, transformed by the logistic
pling the training data set, where the size of a sample dataset function into ranges 0 and 1.
and the training dataset sample are the same. RF and other e(b0+b1∗x)
ensemble classifiers utilize decision trees for the prediction y= (13)
(1 + e(b0+b1∗x) )
using the decision trees. The identification of the attribute where b0 is the bias or intercept, y is the expected perfor-
for the root node at each stage is a major challenge for mance and b1 is the coefficient for the single input value x.
constructing the decision trees. Every column of the input data has a coefficient b correlated
p = mode{T1 (y), T2 (y), . . . , Tm (y)} (10) with it (a constant actual value) to be learned from the training
m
X data.
p = mode{ [Tm (y)]} (11) To attain high accuracy, LR is used with 100 ‘max_iter’
m=1 for the solvers to converge. The parameter ‘penalty’ is set
VOLUME 9, 2021 78627
V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

FIGURE 2. Architecture of the proposed ensemble model RVVC.

to ‘l2’ which is used to specify the norm used in the penaliza- 1) ACCURACY
tion. The parameter C = 1.0 is used to specify the inverse of Accuracy indicates the ratio of correct predictions to the total
the regularization strength. predictions from the classifiers on test data. The maximum
accuracy score is 1 indicating that all predictions from the
5) K-NEAREST NEIGHBOR classifier are correct while the minimum accuracy score can
KNN is one of the simplest supervised classification meth- be 0. Accuracy can be calculated as
ods in machine learning. The KNN identifies the similari-
Number of correct predictions
ties between the new data and existing cases and puts the Accuracy = , (14)
new data in the group with high similarity. The similarity is Total number of predictions
calculated using distance calculation between the new data Another form to calculated accuracy is using
and the existing classes. For distance measurement, various
distance estimation methods are used such as Euclidean, TP + TN
Manhattan, and Cityblock, etc. KNN algorithm can be used Accuracy = , (15)
TP + TN + FP + FN
for both regression and classification, but it is mainly used for
classification problems. KNN is a non-parametric algorithm, where TP is a true positive, TN is a true negative, FP is a false
implying that it considered no inference to the underlying positive and FN is a false negative.
data. KNN has multiple parameters that can be refined to
2) PRECISION
achieve high accuracy. For the current study, leaf size is set
Precision is also known as a positive predictive value and rep-
to 30 which is passed to the ball tree or KD Tree. The optimal
resents the relative number of correctly classified instances
value depends on the nature of the problem. Minkowski is
among all true classified instances. A precision value of 1
used as the distance metric while the number of the neighbor
means that every instance of data that is categorized as pos-
is set to 3.
itive which is positive. It is important to note, however, that
this does not influence the number of positive instances with
I. EVALUATION METRICS the label negative which are predicted as positive.
We evaluate the performance of machine learning models
in terms of accuracy, precision, recall, and TP
Precision = (16)
F1 score. TP + FP

78628 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

3) RECALL TABLE 7. Performance results of all models on imbalanced dataset using

BoW.
Recall often called sensitivity represents the relative number
of positive classified instances from all positive instances.
The recall is defined as
TP
Recall = (17)
TP + FN
4) F1 SCORE
Precision and recall are not regarded as true representers
of the performance of a classifier individually. F1 has been TABLE 8. Correct and wrong predictioins from all classifiers on the
deemed more important as it combines both precision and imbalanced dataset.

recall and gives a score between 0 and 1. It is the harmonic

mean of precision and recall and calculated using
Precision × Recall
F1 = 2 × (18)
Precision + Recall
IV. RESULTS AND DISCUSSION
Several experiments are performed to evaluate the perfor-
mance of both the selected machine learning classifiers,
as well as, the proposed RVVC ensemble classifier. The
experiments are divided into three categories: experiments
without re-sampling, experiments with under-sampling, and
experiments with over-sampling.
that RVVC gives the highest number of correct predic-
A. PERFORMANCE OF MACHINE LEARNING MODELS ON tions, i.e., 20,007 out of 21,324 total predestines, and gives
IMBALANCED DATASET only 1,317 wrong predictions with TF-IDF. RVVC performs
Initial experiments are performed using the original imbal- somehow better than other classifiers, on the imbalanced
anced dataset with TF-IDF and BoW separately. Tables 6 dataset because of its ensemble architecture. The ensemble
shows the values of performance evaluation metrics on the architecture model can perform better than individual models.
imbalanced dataset using TF-IDF features. There is a lot Using the soft voting criteria on the predictions from two
of fluctuation in the values of evaluation parameters. For well-performing models increases the probability of correct
example, the accuracy of RF using TF-IDF is 0.92 but the prediction.
F1 score is 0.83. The difference in the values of accuracy and
F1 score is similar for other machine learning models. B. PERFORMANCE OF MODELS ON BALANCED DATASET
USING UNDER-SAMPLING
TABLE 6. Performance results of all models on imbalanced dataset Further experiments are performed using a balanced dataset
using TF-IDF. with a random under-sampling technique. Results using
TF-IDF features on the under-sampled data are shown
in Table 9. Results suggest that the performance of the
selected models has been degraded on the under-sampled
dataset. As under-sampling reduces the size of the dataset,
the number of features to train the models is also reduced
which affects the accuracy of machine learning models. It is
observed that the difference in the values of accuracy and
Table 7 shows the results for machine learning models other evaluation parameters has been reduced and the values
when trained and tested using the BoW features on the imbal- for accuracy and F1 are similar now. It indicates the good fit of
anced dataset. Results indicate that the models get over-fitted
on the majority class data because the models get more
data from the majority class as compared to the minority TABLE 9. Performance results of all models using TF-IDF features from
under-sampled dataset.
class. Consequently, the number of wrong predictions for the
minority class is higher than the majority class.
Owing to the high difference in the values of accuracy
and F1 score, correct predictions (CP) and wrong predic-
tions (WP) are important evaluation parameters to be ana-
lyzed. Table 8 shows the TP, TN, FP, CP, and WP for
both TF-IDF and BoW for all the classifiers. Results show

VOLUME 9, 2021 78629

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

machine learning models. RVVC model outperforms all other TABLE 12. Performance results of all models using TF-IDF features from
over-sampled dataset.
models in terms of accuracy, precision, recall, and F1 score
when used with TF-IDF feature from the under-sampled
dataset. It achieves the highest accuracy and F1 with a value
of 0.91 each and performs better than all other classifiers.
Table 10 shows the performance of machine learning
models after under-sampling with the BoW features. This
performance shows that the ensemble model can also perform
well on a small dataset resulting from the under-sampling.
The proposed RVVC model is a combination of LR and RF
which is a good combination for small and large datasets. The performance of machine learning models has improved
RVVC performs better with BoW features in the under- significantly when trained on TF-IDF features from the
sampling case with 0.91 accuracy which is the highest of all SMOTE over-sampled dataset. Over-sampling increases the
the classifiers. dataset size which increases the number of features for train-
ing the models. Consequently, it helps to a good fit of models
TABLE 10. Performance results of all models using BoW features from and increases their performance. However, at the same time
under-sampled data. the performance of the KNN model, which does not perform
well on large features set, has degraded. Among all models,
RVVC achieves the highest accuracy of 0.97 and outperforms
all other models when trained on TF-IDF features.
Table 13 shows the performance of the models with BoW
features from the over-sampled dataset. RVVC performs well
on BoW features as well and achieves a joint accuracy
of 0.93 with LR. However, its recall score is higher than LR
which shows its superior performance. KNN models show
Table 11 shows the performance of machine learning mod-
poor performance among all the classifiers with an accuracy
els in terms of correct and wrong predictions. In both TF-IDF
of 0.64 while RF and SVC perform well with 0.90 and 0.91
and BoW cases, the RVVC model outperforms all other mod-
accuracies, respectively. As a whole, the performance of all
els in terms of correct predictions. RVVC gives 6,927 correct
the models has been reduced with BoW features than that of
predictions with TF-IDF features and gives only 720 wrong
TF-IDF features.
predictions as compared to LR which is the 2nd highest
performer and gives 748 wrong predictions. Similarly, with
TABLE 13. Performance results of all models on over-sampled data using
BoW features, RVVC gives 6,940 correct predictions and BoW features.
707 wrong predictions which is also the lowest number for
any model. In light of these results, RVVC shows the highest
performance among all the machine learning classifiers.

TABLE 11. Number of correct and wrong predictions using the TF-IDF and
BoW features from the under-sampled dataset.

TF-IDF shows superior performance than the BoW feature

for toxic comments classification. It is important to point out
that boW contains only frequency (count) of word occurrence
for a given comment and does not record any information
regarding the importance of a word. For BoW, no word is a
rare or common word, it just counts how many times it has
appeared in a given comment. On the other hand, TF-IDF
counts the occurrence of a word and its importance. As a
result, it performs better than BoW. Besides, with the increase
in the size of comments, the size of the vocabulary also
increases, which leads to sparsity in BoW. The increased size
C. EXPERIMENT RESULTS OF MODELS USING SMOTE of the training vector affects the performance of the classifiers
OVER-SAMPLED DATA and degrades the accuracy.
Experiments are performed using the SMOTE balanced The performance of machine learning models using
dataset both with TF-IDF and BoW features. Table 12 SMOTE technique is also evaluated in terms of cor-
shows the performance of all models with TF-IDF features. rect and wrong predictions as shown in Table 14.

78630 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

TABLE 14. Number of correct and wrong predictions using SMOTE

over-sampled dataset.

FIGURE 3. The architecture of the used recurrent neural network for toxic
comments.

TABLE 15. Accuracy of RNN with under-sampled, over-sampled and

imbalanced dataset.

Results suggest that RVVC gives the highest number of

correct predictions when used with TF-IDF features from
SMOTE over-sampled dataset. RVVC gives 33,857 correct
predictions out of 35,000 predictions and only 1,143 predic-
tions are wrong. RF and SVC are behind the RVVC model
with 33,552 and 33,076 correct predictions, respectively.
Table 9 and 10 contain the results using the random the lowest accuracy of 0.887 with the imbalanced dataset.
under-sampling technique with TF-IDF and BoW features, However, RNN’s highest accuracy of 0.95 is lower than the
respectively, and Tables 12 and 13 contain the results with accuracy of the proposed RVVC which is 0.97 with TF-IDF
SMOTE. The performance of models with oversampling features from SMOTE over-sampled dataset.
technique is significant in comparison to under-sampling.
In random under-sampling random data are deleted from E. PERFORMANCE ANALYSIS WITH STATE-OF-THE-ART
the majority class to balance the samples for the majority APPROACHES
and minority class. As a result, the size of data, as well as, Performance comparison of the proposed RVVC is done
the size of the feature set is reduced. When trained on a with five state-of-the-art approaches including both machine
small feature vector, the performance of the machine learning and deep learning approaches for toxic comments classifica-
models is degraded. Additionally, in deleting procedure of tion. Table 16 shows the performance appraisal results for
random under-sampling, many important records that can be RVVC and other models. Results prove that the proposed
influential in models’ training may be deleted which affects RVVC performs better than other approaches to correctly
the performance of the models. On the other hand, oversam- classify the toxic and non-toxic comments.
pling increases the size of data by generating new records
which help to generate a large feature set and improves the TABLE 16. Performance comparison results for state-of-the-art
approaches and proposed RVVC.
performance of learning models.

D. EXPERIMENTAL RESULTS USING RNN WITH

OVER-SAMPLED, UNDER-SAMPLING,
AND IMBALANCED DATASET
Along with the machine learning models and proposed
RVVC, a recurrent neural network is also tested for toxic
comment classification. Deep learning approaches tend to
show higher performance for text classification tasks [51].
F. STATISTICAL T-TEST
The RNN is used with the architecture given in Figure 3.
Experimental results to classify the toxic and non-toxic To show the significance of the proposed RVVC model,
comments using RNN are given in table 15. RNN model a statistical significance test, a T-test has been performed.
produces the output based on previous computation by using To support the T-test we suppose two hypotheses as follow:
sequential information and gives better results for this reason. • Null hypotheses: The proposed model RVVC is statisti-
The results using the SMOTE over-sampled dataset show cally significant.
higher accuracy due to the large feature set. However, with • Alternative hypotheses: The Proposed model is not sta-
the under-sampled dataset, the size of the dataset is decreased tistically significant.
which decreases the feature vector for training and the perfor- Statistical T-test results that the RVVC is statistically sig-
mance of RNN is reduced. RNN achieves the highest accu- nificant for all resampling cases. RVVC accepts null hypothe-
racy of 0.95 when trained on an over-sampled dataset while ses without resampling, under-sampling, and oversampling.

VOLUME 9, 2021 78631

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

with SMOTE over-sampling. Results of applying SMOTE

over-sampling and random under-sampling are shown
in Figure 4 using a t-distributed stochastic neighbor embed-
ding (TSNE) plot. Figure 4a indicates clearer and impact
features generation using the SMOTE. SMOTE technique
provides an equal number of features for training the models
which leads to a significant increase in the performance of
models.
Conversely, for the under-sampling case, although fea-
tures are equal, the reduced size of the features degrades the
performance of the models. Moreover, the features are too
scattered for the under-sampling case, as shown in Figure 4b.
As a result, the distinctiveness of comments is reduced which
reduces the classification accuracy. Figure 4c shows that the
data without sampling contain more data for the non-toxic
class as compared to the toxic class. Additionally, toxic data
is scattered and it is difficult for machine learning models
to learn on the scattered dataset. So, the models give higher
accuracy for the non-toxic class than the toxic class. SMOTE
helps to overcome these limitations and shows higher perfor-
mance than both the imbalanced dataset and under-sampled
dataset for toxic comments classification.
The proposed model RVVC outperforms both the RNN
and machine learning models due to its structure. RVVC is
an ensemble model which is a combination of LR and SVC
and uses soft voting criteria to make the final prediction.
RVVC performs better than the individual models because it
combines the predictions from two well-performing models
including LR and SVC. Computing the probability for each
class using its individual models and then finding the target
class with maximum probability to make the final prediction
elevates its performance as compared to the individual mod-
els. The deep architecture of RVVC makes it more accurate.

V. CONCLUSION
This study analyzes the performance of various machine
learning models to perform toxic comments classification
and proposes an ensemble approached called RVVC. The
influence of an imbalanced dataset and balanced dataset
using random under-sampling and SMOTE over-sampling on
the performance of the models is analyzed through exten-
sive experiments. Two feature extraction approaches includ-
FIGURE 4. Impact of SMOTE as compared to under-sampling and
without-sampling.
ing TF-IDF and BoW are used to get the feature vector
for models’ training. Results indicate that models perform
poorly on the imbalanced dataset while the balanced dataset
tends to increase the classification accuracy. Besides the
T-test also shows that TF-IDF gives more significance for machine learning classifiers like SVM, RF, GBM, and LR,
machine learning models. Model reject null hypotheses when the proposed RVVC and RNN deep learning models perform
trained on BoW features in comparison to TF-IDF. well with the balanced dataset. The performance with an
over-sampled dataset is better than the under-sampled dataset
G. DISCUSSION as the feature set is large when the data is over-sampled which
Experiments are carried out to analyze the impact of SMOTE elevates the performance of the models. Results suggest that
oversampling and random under-sampling approaches on balancing the data reduces the chances of models over-fitting
the performance of selected machine learning models and which happens if the imbalanced dataset is used for training.
the proposed RVVC model. Experimental results signify Moreover, TF-IDF shows better classification accuracy for
the superior performance of machine learning models toxic comments than BoW as TF-IDF records the importance

78632 VOLUME 9, 2021

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

of a word contrary to BoW which simply counts the occur- [15] R. Beniwal and A. Maurya, ‘‘Toxic comment classification using hybrid
rence of a word. The proposed ensemble approach RVVC deep learning model,’’ in Sustainable Communication Networks and Appli-
cation. Cham, Switzerland: Springer, 2021, pp. 461–473.
demonstrates its efficiency for toxic and non-toxic comments [16] A. N. M. Jubaer, A. Sayem, and M. A. Rahman, ‘‘Bangla toxic comment
classification. The performance of RVVC is superior both classification (machine learning and deep learning approach),’’ in Proc. 8th
with the imbalanced and balanced dataset, yet, it achieves Int. Conf. Syst. Modeling Adv. Res. Trends (SMART), Nov. 2019, pp. 62–66.
[17] B. van Aken, J. Risch, R. Krestel, and A. Löser, ‘‘Challenges for toxic com-
the highest accuracy of 0.97 when used with TF-IDF features ment classification: An in-depth error analysis,’’ 2018, arXiv:1809.07572.
from SMOTE over-sampled dataset. The performance com- [Online]. Available: http://arxiv.org/abs/1809.07572
parison with state-of-the-art approaches also indicates that [18] H. H. Saeed, K. Shahzad, and F. Kamiran, ‘‘Overlapping toxic sentiment
classification using deep neural architectures,’’ in Proc. IEEE Int. Conf.
RVVC shows better performance and proves good on small Data Mining Workshops (ICDMW), Nov. 2018, pp. 1361–1366.
and large feature vectors. Despite the better performance of [19] S. Malmasi and M. Zampieri, ‘‘Detecting hate speech in social
the proposed ensemble approach, its computational complex- media,’’ 2017, arXiv:1712.06427. [Online]. Available: http://arxiv.org/abs/
1712.06427
ity is higher than the individual models which is an important
[20] Z. Zhang, D. Robinson, and J. Tepper, ‘‘Detecting hate speech on Twitter
topic for our future research. Similarly, dataset imbalance can using a convolution-GRU based deep neural network,’’ in Proc. Eur.
overstate the results because data balancing using SMOTE Semantic Web Conf. Cham, Switzerland: Springer, 2018, pp. 745–760.
or random under-sampling approach may have a certain [21] P. A. Ozoh, M. O. Olayiwola, and A. A. Adigun, ‘‘Identification and
classification of toxic comments on social media using machine learning
influence on the reported accuracy. Moreover, we intend techniques,’’ Int. J. Res. Innov. Appl. Sci., vol. 4, no. 9, pp. 1–6, Nov. 2019.
to perform further experiments on multi-domain datasets [22] S. Carta, A. Corriga, R. Mulas, D. Recupero, and R. Saia, ‘‘A supervised
and run experiments on more datasets for toxic comment multi-class multi-label word embeddings approach for toxic comment
classification,’’ in Proc. KDIR, 2019, pp. 105–112.
classification. [23] B. Gambäck and U. K. Sikdar, ‘‘Using convolutional neural networks to
classify hate-speech,’’ in Proc. 1st Workshop Abusive Lang. Online, 2017,
ACKNOWLEDGMENT pp. 85–90.
(Vaibhav Rupapara and Furqan Rustam are co-first authors.) [24] S. R. Basha, J. K. Rani, J. P. Yadav, and G. R. Kumar, ‘‘Impact of
feature selection techniques in text classification: An experimental study,’’
J. Mech. Continua Math. Sci., no. 3, pp. 39–51, 2019.
REFERENCES [25] S. R. Basha and J. K. Rani, ‘‘A comparative approach of dimensionality
[1] E. Aboujaoude, M. W. Savage, V. Starcevic, and W. O. Salame, ‘‘Cyberbul- reduction techniques in text classification,’’ Eng., Technol. Appl. Sci. Res.,
lying: Review of an old problem gone viral,’’ J. Adolescent Health, vol. 57, vol. 9, no. 6, pp. 4974–4979, Dec. 2019.
no. 1, pp. 10–18, Jul. 2015. [26] M. S. Basha, S. K. Mouleeswaran, and K. R. Prasad, ‘‘Sampling-based
[2] How Much Data is Created on the Internet Each Day? Accessed: visual assessment computing techniques for an efficient social data clus-
Jun. 6, 2020. [Online]. Available: https://blog.microfocus.com/how-much- tering,’’ J. Supercomput., pp. 1–25, Jan. 2021.
data-is-created-on-the-internet-each-day/ [27] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, ‘‘Abu-
[3] World Internet Users and 2020 Population Stats. Accessed: Jun. 6, 2020. sive language detection in online user content,’’ in Proc. 25th Int.
[Online]. Available: https://www.internetworldstats.com/stats.htm Conf. World Wide Web. Republic and Canton of Geneva, Switzer-
[4] M. Duggan, ‘‘Online harassment,’’ Pew Res. Center, Washington, DC, land: International World Wide Web Conferences Steering Committee,
USA, Tech. Rep., 2014. [Online]. Available: https://www.pewresearch. Apr. 2016, pp. 145–153.
org/internet/wp-content/uploads/sites/9/2017/07/PI _2017.07.11_Online- [28] R. Martins, M. Gomes, J. J. Almeida, P. Novais, and P. Henriques, ‘‘Hate
Harassment_FINAL.pdf speech classification in social media using emotional analysis,’’ in Proc.
[5] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, ‘‘Deep learning for hate 7th Brazilian Conf. Intell. Syst. (BRACIS), Oct. 2018, pp. 61–66.
speech detection in tweets,’’ in Proc. 26th Int. Conf. World Wide Web
[29] H. Hosseini, S. Kannan, B. Zhang, and R. Poovendran, ‘‘Deceiving
Companion, 2017, pp. 759–760.
Google’s perspective API built for detecting toxic comments,’’ 2017,
[6] Man Jailed for 35 years in Thailand for Insulting Monarchy on Facebook.
arXiv:1702.08138. [Online]. Available: http://arxiv.org/abs/1702.08138
Accessed: Jun. 6, 2020. [Online]. Available: https://www.theguardian.
[30] Toxic Comment Classification Challenge. Accessed: May 5, 2020.
com/world/2017/jun/09/man-jailed-for-35-years-in-thailand-for-
[Online]. Available: https://www.kaggle.com/c/jigsaw-toxic-comment-
insulting-monarchy-on-facebook
classification-challenge
[7] Mississippi Teacher Fired After Racist Facebook Post; Black Parent
Responds. Accessed: Jun. 6, 2020. [Online]. Available: https://www. [31] S. Alam and N. Yao, ‘‘The impact of preprocessing steps on the accuracy of
clarionledger.com/story/news/2017/09/20/mississippi-teacher-fired-after- machine learning algorithms in sentiment analysis,’’ Comput. Math. Org.
racist-facebook-post/684264001/ Theory, vol. 25, no. 3, pp. 319–335, Sep. 2019.
[8] E. Wulczyn, N. Thain, and L. Dixon, ‘‘Ex machina: Personal attacks seen at [32] F. Rustam, I. Ashraf, A. Mehmood, S. Ullah, and G. Choi, ‘‘Tweets clas-
scale,’’ in Proc. 26th Int. Conf. World Wide Web, Apr. 2017, pp. 1391–1399. sification on the base of sentiments for US airline companies,’’ Entropy,
[9] M. Ptaszynski, J. K. K. Eronen, and F. Masui, ‘‘Learning deep on cyber- vol. 21, no. 11, p. 1078, Nov. 2019.
bullying is always better than brute force,’’ in Proc. LaCATODA@ IJCAI, [33] M. Anandarajan, C. Hill, and T. Nolan, Practical Text Analytics: Maxi-
2017, pp. 3–10. mizing the Value of Text Data (Advances Analytics Data Science), vol. 2.
[10] S. Agrawal and A. Awekar, ‘‘Deep learning for detecting cyberbullying Cham, Switzerland: Springer, 2019.
across multiple social media platforms,’’ in Proc. Eur. Conf. Inf. Retr. [34] Z. Z. Wint, T. Ducros, and M. Aritsugi, ‘‘Spell corrector to social media
Cham, Switzerland: Springer, 2018, pp. 141–153. datasets in message filtering systems,’’ in Proc. 12th Int. Conf. Digit. Inf.
[11] M. Ibrahim, M. Torki, and N. El-Makky, ‘‘Imbalanced toxic comments Manage. (ICDIM), Sep. 2017, pp. 209–215.
classification using data augmentation and deep learning,’’ in Proc. 17th [35] S. Yang and H. Zhang, ‘‘Text mining of Twitter data using a latent Dirichlet
IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2018, pp. 875–878. allocation topic model and sentiment analysis,’’ Int. J. Comput. Inf. Eng,
[12] M. A. Saif, A. N. Medvedev, M. A. Medvedev, and T. Atanasova, vol. 12, no. 7, pp. 525–529, 2018.
‘‘Classification of online toxic comments using the logistic regression [36] F. F. Bocca and L. H. A. Rodrigues, ‘‘The effect of tuning, feature engineer-
and neural networks models,’’ AIP Conf. Proc., vol. 2048, no. 1, 2018, ing, and feature selection in data mining applied to rainfed sugarcane yield
Art. no. 060011. modelling,’’ Comput. Electron. Agricult., vol. 128, pp. 67–76, Oct. 2016.
[13] S. V. Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, [37] J. Heaton, ‘‘An empirical analysis of feature engineering for predictive
‘‘Convolutional neural networks for toxic comment classification,’’ in modeling,’’ in Proc. SoutheastCon, Mar. 2016, pp. 1–6.
Proc. 10th Hellenic Conf. Artif. Intell., Jul. 2018, pp. 1–6. [38] S. C. Eshan and M. S. Hasan, ‘‘An application of machine learning to
[14] S. Zaheri, J. Leath, and D. Stroud, ‘‘Toxic comment classification,’’ SMU detect abusive Bengali text,’’ in Proc. 20th Int. Conf. Comput. Inf. Technol.
Data Sci. Rev., vol. 3, no. 1, p. 13, 2020. (ICCIT), Dec. 2017, pp. 1–6.

VOLUME 9, 2021 78633

V. Rupapara et al.: Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification

[39] F. Rustam, M. Khalid, W. Aslam, V. Rupapara, A. Mehmood, and HINA FATIMA SHAHZAD received the B.S.
G. S. Choi, ‘‘A performance comparison of supervised machine learning degree in computer science from The Government
models for COVID-19 tweets sentiment analysis,’’ PLoS ONE, vol. 16, Sadiq College Women University (GSCWU),
no. 2, Feb. 2021, Art. no. e0245909. Bahawalpur, in 2018. She is currently pursuing the
[40] E. B. Fatima, B. Omar, E. M. Abdelmajid, F. Rustam, A. Mehmood, M.S. degree in computer science with the Depart-
and G. S. Choi, ‘‘Minimizing the overlapping degree to improve class- ment of Computer Science, Khwaja Fareed Uni-
imbalanced learning under sparse feature selection: Application to fraud versity of Engineering and Information Technol-
detection,’’ IEEE Access, vol. 9, pp. 28101–28110, 2021.
ogy (KFUEIT), Rahim Yar Khan, Pakistan. Her
[41] T. Elhassan and M. Aljurf, ‘‘Classification of imbalance data using Tomek
research interests include text mining, data mining,
link (T-link) combined with random under-sampling (RUS) as a data
reduction method,’’ Global J. Technol. Optim., vol. 1, no. 1, pp. 1–11, 2016. machine learning, and deep learning-based IoT.
[42] J. Prusa, T. M. Khoshgoftaar, D. J. Dittman, and A. Napolitano, ‘‘Using
random undersampling to alleviate class imbalance on tweet sentiment
data,’’ in Proc. IEEE Int. Conf. Inf. Reuse Integr., Aug. 2015, pp. 197–202.
[43] W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, ‘‘Clustering-based under-
sampling in class-imbalanced data,’’ Inf. Sci., vols. 409–410, pp. 17–26,
Oct. 2017.
[44] A. R. Hassan and M. I. H. Bhuiyan, ‘‘Automated identification of sleep
states from EEG signals by means of ensemble empirical mode decompo-
sition and random under sampling boosting,’’ Comput. Methods Programs ARIF MEHMOOD received the Ph.D. degree
Biomed., vol. 140, pp. 201–210, Mar. 2017. from the Department of Information and Com-
[45] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE: munication Engineering, Yeungnam University,
Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16, South Korea, in 2017. From December 2017 to
pp. 321–357, Jun. 2002. November 2019, he served as an Assistant Profes-
[46] D. J. Dittman, T. M. Khoshgoftaar, R. Wald, and A. Napolitano, ‘‘Compar- sor with the Information Technology Department,
ison of data sampling approaches for imbalanced bioinformatics data,’’ in Khawaja Fareed University of Engineering and
Proc. 27th Int. FLAIRS Conf., 2014, pp. 1–4. Information Technology. He is currently serving
[47] R. Blagus and L. Lusa, ‘‘Smote for high-dimensional class-imbalanced as an Assistant Professor with The Islamia Univer-
data,’’ BMC Bioinf., vol. 14, Mar. 2013, Art. no. 106. sity of Bahawalpur, Pakistan. He is also working
[48] A. Fernández, S. Garcia, F. Herrera, and N. V. Chawla, ‘‘SMOTE for mainly in deep learning-based text mining and data science management
learning from imbalanced data: Progress and challenges, marking the 15-
techniques.
year anniversary,’’ J. Artif. Intell. Res., vol. 61, pp. 863–905, Apr. 2018.
[49] R. Blagus and L. Lusa, ‘‘Evaluation of SMOTE for high-dimensional class-
imbalanced microarray data,’’ in Proc. 11th Int. Conf. Mach. Learn. Appl.,
Dec. 2012, pp. 89–94.
[50] GeeksforGeeks. Ml | Cost Function in Logistic Regression. Accessed:
Dec. 16, 2019. [Online]. Available: https://www.geeksforgeeks.org/ml-
cost-function-in-logistic-regression/
[51] M. Umer, I. Ashraf, A. Mehmood, S. Kumari, S. Ullah, and G. S. Choi,
‘‘Sentiment analysis of tweets using a unified convolutional neural
network-long short-term memory network model,’’ Comput. Intell., IMRAN ASHRAF received the Ph.D. degree
vol. 37, no. 1, pp. 409–434, Feb. 2021. in information and communication engineering
from Yeungnam University, South Korea, in 2018,
and the M.S. degree in computer science from
the Blekinge Institute of Technology, Karlskrona,
Sweden, in 2010. He worked as a Postdoctoral Fel-
low with Yeungnam University, Gyeongsan, South
VAIBHAV RUPAPARA received the Master of Korea. He is currently working as an Assistant Pro-
Science degree in computer science from Florida fessor with the Information and Communication
International University, Miami, FL, USA. He has Engineering Department, Yeungnam University.
worked in different domains, including finance His research interests include indoor positioning and localization, advanced
and healthcare. His expertise contributed towards location-based services in wireless communication, smart sensors (LIDAR)
achieving high quality and scalable deliverability for smart cars, and data mining.
with security. His recent research interests include
machine learning, AI, and deep learning.

FURQAN RUSTAM received the M.C.S. degree GYU SANG CHOI received the Ph.D. degree
from the Department of Computer Science, from the Department of Computer Science and
The Islamia University of Bahawalpur, Pakistan, Engineering, Pennsylvania State University, Uni-
in October 2017. He is currently working as versity Park, PA, USA, in 2005. From 2006 to
a Visiting Lecturer at Department of Computer 2009, he was a Research Staff Member with Sam-
Science, Khwaja Fareed University of Engi- sung Advanced Institute of Technology (SAIT) for
neering and Information Technology (KFUEIT), Samsung Electronics. Since 2009, he has been a
Rahim Yar Khan, Pakistan. He is also serving as Faculty Member with the Department of Infor-
a Research Assistant with the Fareed Computing mation and Communication, Yeungnam Univer-
Research Center, KFUEIT. His recent research sity, South Korea. His research interests include
interests include data mining, machine learning, and artificial intelligence, non-volatile memory and storage systems.
mainly working on creative computing and supervised machine learning.