0% found this document useful (0 votes)
8 views

Offensive Social Network Posts Classification Using Apache Spark Platform

Uploaded by

shubham jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Offensive Social Network Posts Classification Using Apache Spark Platform

Uploaded by

shubham jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)

IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

Offensive Social Network Posts Classification using


Apache Spark Platform
Sandhiya.B Saravanan.S
Dept. of Computer Science and Engineering, Dept. of Computer Science and Engineering,
Amrita School of Computing, Bengaluru Amrita School of Computing, Chennai
Amrita Vishwa Vidyapeetham, India Amrita Vishwa Vidyapeetham, India
[email protected] [email protected]
2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC) | 978-1-6654-5630-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICAAIC56838.2023.10140980

Abstract—Social networking has become one of the most 2. RELATED WORK


significant aspects of everyday lives since it allows us to
communicate with many people. Social networking is used An Apache Spark-based machine learning Framework
for many purposes besides communication. Though social has been built for the Swahili language. Swahili is one of
networking platforms provide abundant benefits, there is no the popular languages spoken by over 800 million
proper mechanism to check if only legitimate users are using
speakers in Africa. The system uses Twitter’s API and
the platforms. This happens to be the reason for many users
to use those platforms to make offensive and vulgar Apache Spark’s Pyspark to implement the same. A
comments about others. This research study intends to sentiment tag is assigned for each message, such as
classify those social media messages as offensive or not positive normal, and offensive. Then based on the
offensive using Machine Learning Models available in the sentiment tag assigned, the offensive messages were
Apache Spark framework using Scala Programming classified using different machine learning algorithms
Language. The dataset used for the classification is of size 43 with the help of Apache’s Pyspark. Different machine
GB, and the clustering mechanism has been implemented learning algorithms such as Logistic Regression, Random
using the Apache Spark platform. Forest, Decision Tree, and Naive Bayes have been
implemented for classification. The system tried to
Index Terms—Social Network, Offensive Language classify offensive language both with and without Apache
Detection, Apache Spark Platform, Clustering, Pyspark. A better result was found for classification using
Classification, Scala Programming Language. Pyspark [5].

A novel approach for offensive language detection in


1. INTRODUCTION social media has been implemented using an ensemble of
As per statistics,” As of April 2022, there were more classifiers. Three different sub-tasks have been
than five billion internet users globally, accounting for accomplished while detecting offensive languages. The
63.1 percent of the global population. Of this total, 4.7 three task includes Task A: offensive language
billion people, or 59% of the world’s population, used identification, Task B: offense type identification, and
social media”. Social networking is used for many Task C: offense target identification. The data set for all
purposes besides communication, including marketing, the sub-tasks were taken from Twitter. Different
education, and supporting charitable causes. Even though classification models have been used to classify offensive
the platforms claim that it don’t support illegitimate users messages, such as DistilBERT, XLNet, CNN-LSTM, and
and that their accounts will be deleted, many users still use other ensembles of deep learning classifiers. Promising
social media platforms to make offensive and vulgar results were observed for subtask A, although sub-tasks B
comments about others. There are many methods to detect and C require some improvisation [6].
offensive language. One such method is using Text
Classification. Text classification is the process of An approach to detect hate speech text in Afan Oromo
classifying text to a particular label/category [1], [2]. Text social media with the help of machine learning has been
classification can be done for any kind of text, such as implemented. Afan Oromo is a predominantly spoken
documents, News Articles, Social Media Posts, and language in the region of Ethiopia, mainly in the
comments. However, there is a need to train the classifier Ethiopian States of Kenya and Oroma. Comments from
with more data as it leads to better accuracy of the various social media platforms like Twitter and Facebook
classifier [3], [4]. Also, the trained model must handle have been used as the data. After performing the required
millions of online posts while classifying them in real- pre-processing and labeling for the data, various
time. classifiers such as Support Vector Classifier, Multinomial
Na¨ıve Bayes, Linear Support Vector Classifier, Logistic
Hence, this study proposes Apache Spark - A big data Regression, Decision tree, and Random Forest Classifier
platform-based, offensive language detection tool trained have been used for classifying hate speech text. A unique
with 40GB of training data. Proposed system can handle way of combining n-gram and TF-IDF for extracting the
millions of real-time streaming posts and detect the features is a novel idea proposed in the system [7].
offensive language in those posts.

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1222


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

A systematic review of various approaches for detecting


fake news and hate speech in Ethiopian languages has
been summarised. Various Machine Learning approaches,
including SVM, Multinomial Na¨ıve Bayes, Linear
Support Vector Classifier, Logistic Regression, Decision
tree, Random Forest Classifier, and Various Deep
Learning approaches such as Bi-GRU, CNN, Attention-
based models, and Bi-LSTM have been summarized. In
addition to this, text-to-feature conversion techniques
such as Word2Vec, TF-IDF, and N-grams have also been
summarized along with the novelty of each approach [8].

A Machine Learning framework for detecting online


aggression on Twitter has been done using Apache
Spark’s Streaming Component. The proposed work is
claimed to be the first practical and real-time framework
for aggression detection on Twitter. State-of-the-art
streaming ML methods such as Hoeffding Trees,
Adaptive Random Forests, and Streaming Logistic
Regression have been implemented to detect online
aggression. Since Apache Spark’s streaming component is
used to implement the proposed method, the proposed
framework is claimed with only three standard machines,
it can readily scale to support the entire Twitter Firehose
and increase throughput to over 14k tweets/sec [9].

A one-vs-rest-based model has been implemented to


classify comments into three categories: Hate Speech,
Normal, and Offensive. Furthermore, an analysis of how
these four categories have spread across the 17target
groups has also been done. And the 17 different target
groups include different categories from gender, race,
religion, miscellaneous, and sexual orientation. Different
algorithms such as Binary Relevance, Label Powerset,
Multi KNN, and one vs. rest have been used for
classification. The paper's novelty lies in the deep analysis
to check how well the classifier can classify 17target Figure 1. Flow diagram of the proposed system
groups rather than focusing on Global accuracy [10]. architecture

An open-source content management system has been A hate speech detection system for the Vietnamese
implemented, leading to content creation and language has been implemented. A combination of the
management. The implemented system has an inbuilt hate PhoBERT and CNN models is used to detect and classify
speech deduction system built using the state-of-the-art hate speech. The system for detecting and classifying hate
mono BERT model. The implemented system is claimed speech is integrated with the streaming processing system.
to alert and hinder any published content if it is found to The PhoBERT-CNN model uses the stream of data from
be hateful [11]. the streaming system to detect and classify hate speech
[13].
A system to identify offensive language that has noisy
labels has been implemented. To do the classification, a The proposed work here is to classify social media
hybrid system is made with the help of the BERT messages as offensive/not offensive. The uniqueness of
classifier. The hybrid system consists of three the proposed work lies in the fact that each message’s
components: statistical sampling algorithm, BERT labels are individually made based on a list of
classifier, and post-processing. The implemented system offensive/profane
is said to have an F1 score of 90% and has not produced
any false negatives [12].

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1223


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

words. Additionally, since the kinds of offensive words,


slang words and colloquial words vary depending on the Feature Name Feature Representation
region. The proposed system allows for the flexibility of Post ID Unique identifier given to each post
adding any number of words to the list, which is used to Parent Post ID Unique identifier given to each parent post
determine each message's labels. Furthermore, the Profile ID Unique identifier given to each profile
clustering mechanism of Apache Spark Platform gives the Posted On Date on which a particular post was posted
advantage of parallel processing, which trains a massive
Latitude Geographical location indicating where the post was posted
volume of data quickly.
Longitude Geographical location indicating where the post was posted
Image ID Unique identification number assigned to each image
Post Text Text or caption, or message that is contained in a post
The next sections of the research work are organized as
follows. Section III describes the proposed methodology
and various techniques used in detail. Section IV TABLE 1. DATASET DESCRIPTION
describes the results observed from the proposed system After the successful completion of the setup, a cluster will
and discusses the insights obtained from it. Section V be created. For the proposed system, a cluster was made
shows the conclusion of the study. using two worker nodes/slave nodes and one master node.
In Figure 2, it can be observed that two worker nodes are
connected to the master node.
3. SYSTEM DESIGN AND IMPLEMENTATION
The programming language used here to implement the
classification algorithm is the”Scala” programming
The proposed method, Apache Spark - A big data language. The name Scala stands for Scalable Language.
platform-based, offensive language detection tool trained It is a blend of both object-oriented and functional
with 40GB training observations is implemented. An language. Scala language is concise, meaning that
offensive language detection tool, in simple words, is the complex concepts can be implemented using a few lines
Machine Learning based classification model, which of code. In the proposed system, the scala language useda
detects and classifies offensive language from social few libraries/packages such as String Indexer, Vector
network posts. The proposed methodology will classify Assembler, Logistic Regression, and Multiclass Metric to
offensive social media posts and analyze their results. The implement the proposed system successfully.
methodology includes a machine learning classifier that
classifies offensive social media posts and non-offensive 3.2Dataset Description
social media posts.
Nearby is a location-based social networking service
The flow diagram of the proposed system architecture is primarily used by users from the United States, the United
shown in Figure 1. The overall proposed system Kingdom, and India. Nearby was originally launched in
wasimplemented using the clustering technique provided June 2010 and was in everyday use until September 2021,
by the Apache Spark platform. Scala programming after which it was discontinued. The dataset used for this
language was used to implement the same. The following work includes ten years of data from the” Nearby” social
subsections provide a step-by-step breakdown of every network. The features in the dataset are briefly explained
action shown in the proposed system’s flow diagram. in Table 1. The dataset used for this proposed system was
downloaded from Kaggle. The size of the dataset is 43
3.1 Apache Spark and Scala Programming Language GB. The dataset can be downloaded from [15]. The main
feature considered for the proposed system here is the
Apache Spark offers batch applications to support various “PostText” feature, as it contains more information related
workloads, such as interactive queries, streaming, to classifying whether or not a post is offensive or non-
machine learning, and graph processing. Spark is a offensive. Based on this, ”PostText” feature, respective
“Computational engine” in charge of scheduling, target ”Labels” were created, then given to the Machine
distributing, and monitoring applications comprising Learning model for classification.
many computational tasks distributed across a cluster of
worker machines. The same clustering concept has been 3.3 Feature Extraction
used to implement the proposed system. The clustering
technique is implemented using the Apache Spark cluster The first and foremost step of the proposed system is
computing tool [14].Before creating the clusters, the importing the dataset to the Apache Spark environment.
systems in which they are planned to be created should
have the Apache Spark Engine already downloaded. To
Once the dataset is imported, then the features can be
create the cluster, the Spark Master page from the master
extracted which are required for classifying. But, before
node is opened. Then, the IP address of the master node is
extracting the features, a vital step must be completed:
copied to the slave nodes. pre-processing. It is a well-known fact in the Machine
Learning community that whenever text data comes into

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1224


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

Figure 2. Clustering slave nodes with the master node

picture, some pre-processing is necessary before using the


data for classification. A few pre-processing steps that are
done here are changing the data type of the features to int
or double, creating vectors, creating labels for the classes,
etc. As already said, only the ”Post Text” feature is
considered as input in the proposed system. The labels for
each observation need to be created based on the
information available in the ”PostText” feature.

To classify whether an observation falls into the category


of offensive or non-offensive, each observation is checked
to determine whether it contains any words from the list
of offensive words. If the observation contains even one
word from the list of offensive words, then that particular
observation is labelled as offensive. If not, it is labelled as
non-offensive. This step, where each observation in the
dataset is checked to determine whether or not it contains
offensive words and assigns a label to them as ”Offense /
No Offense,” is called feature extraction. The list of
offensive words was chosen from Luis von Ahn’s
resource of ”Offensive/Profane Word List” [16]. The list
contains different words a person may find offensive or
not offensive. But one can filter words from the list
according to the requirement and can use it to block
offensive or profane terms on their Site. Figure 3 shows
the ”Post Text” feature, which is considered as the input
for the proposed system, and it can also be observed that
the ”Target” column is Null. It means that labels are not
yet assigned for the observations. Figure 4 shows the
same feature with the label column, and each observation
is labelled as either ”Offense” or ”No Offense.” From the
exact Figure 4, it can also be inferred that the total number
of observations
Figure 3. A snippet of the data before labeling

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1225


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

Figure 5. A snippet of the data after performing pre-


processing
Figure 4. A snippet of the data after labeling

Regression is used to assess the relationship between a


labelled as ”No Offense” are greater in number than the dependent variable and one or more independent variables
observations labelled as ”Offense.” This implies that the when the dependent variable is binary. In this paper, we
data used here is imbalanced. are attempting to assess the link between the class of the
post as Offensive/Not Offensive (dependent variable) and
After this, other pre-processing steps such as removing the independent variable, which is the Post Text feature.
NA’s, converting the raw text features to vectors using Once the training is done, then the model is made to
string indexing, converting the labels to numeric form predict the label for test data, and model evaluation is
using label encoding, and data normalization to scale the done.
data are done. Figure 5 shows how the data looks once all
the pre-processing steps are done. Finally, the data is split 4. RESULTS
into training data and testing data. The split ratio taken
here is 80:20, where the former goes for training and the An Apache Spark-based, offensive social media posts
latter for testing. classification system has been implemented here. Since
the dataset used here is unbalanced, the F1 score is chosen
3.4 Classifier as the evaluation metric. Along with that, the accuracy of
the classifier is also reported. The F1 score is a machine
Once the data is split into training and testing data, the learning evaluation metric that assesses the accuracy of a
necessary packages required for training, testing, and model. Using the harmonic mean, the F1 score combines
classification are imported. The classifier used here in the precision and recall. As a result, the F1 score has become
proposed system is Logistic Regression. Logistic the preferred method for evaluating the models in
conjunction with accuracy.

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1226


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

and an F1 Score of 96%. The same can be inferred from


Figure 6. A snippet showing all CPU coresTable
of the2.
slave nodes being utilized

The model proposed here is to detect offensive social


network posts. Hence, the Positive refers to the label
”Offense,” and the Negative refers to the label ”No
Offense.” TP stands for True Positive. It refers to those
observations whose actual label is ”Positive,” and the TABLE 2. THE RESULT OBSERVED FOR THE PROPOSED WORK
model also predicted its label as ” Positive.” FP stands for
Classification Algorithm Accuracy F1 - Score
False Positive. It refers to those observations where the
actual label is ”Negative,” but the model predicts it as Logistic Regression 0.9325 0.9651
”Positive.” TN stands for True Negative. It refers to those
observations whose actual label is “Negative,” and the
model also predicted its label as ”Negative.” FN stands As the master nodes and slave nodes are clustered
for False Negative. It refers to those observations where together here, when the training is done, all the cores of
the actual label is ”Positive,” but the model predicts it as the slave nodes are utilized to their fullest potential. This
”Negative.” is what makes the training and testing process fast, as all
the slave nodes are processing any given task in parallel,
Precision is defined as the ratio of correctly even though the dataset is enormous in size. Figure 6
classifiedpositive samples (True Positive) to the total shows us that all the cores of the slave nodes are being
number of classified positive samples (either correctly or utilized during the training process.
incorrectly), Equation 1.

(1) 5. CONCLUSION
From the results observed, it can be concluded that the
The recall is calculated as the proportion of Positive proposed system gives reliable results for classifying
samples correctly classified as Positive to the total number offensive social network posts from a not offensive
of Positive samples, Equation 2. category. Since all of the testing, training, and
classification was done by clustering multiple nodes to the
master node, the training took approximately less than 5
(2)
minutes, even though the size of the data is 43 GB. If
F1 score is the harmonic mean of the precision and performed without using the Apache Spark technology on
recall, Equation 3. a normal 8 core system, the same task would have taken
hours of training. The Apache Spark platform here allows
(3) us to cluster multiple nodes together and perform the
Accuracy is the correctly predicted observations from same task in minutes. The system implemented here is
the overall observations, which is given in Equation 4. just the tip of the iceberg. Many other tasks can be
accomplished by using the Apache Spark Platform.

The future scope of the research includes finding which


location has a higher number of offensive posts. This can
(4) be achieved using the Latitude and Longitude features
available in the dataset, and a package called “Geocoder”
available in the Spark Platform can be utilized for this.
A logistic Regression Classifier was implemented to Also, for each account/profile, the total count of the
classify Offensive Social Network Posts. The dataset used offensive post made can be kept on track. And if an
was imbalanced. The classifier gives an Accuracy of 93% account/profile crosses a threshold, then that particular

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1227


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC 2023)
IEEE Xplore Part Number: CFP23BC3-ART; ISBN: 978-1-6654-5630-2

account/profile can be restricted or blocked further from [15]Brian Hamachek. (2021). Nearby Social Network - All Posts
[Dataset].Kaggle.https://doi.org/10.34740/KAGGLE/DSV/2054752.
using the social network.
Dataset - https://www.kaggle.com/datasets/brianhamachek/nearbysocial-
network-all-posts.
[16] OffensiveWordList-https://www.cs.cmu.edu/ biglou/resources/.
REFERENCES

[1] Prasanna Kumar Kumaresan, Premjith, RatnasingamSakuntharaj,


Sajeetha Thavareesan, SubalalithaNavaneethakrishnan, Anand Kumar
Madasamy, Bharathi Raja Chakravarthi, and John P. McCrae.,” Findings
of Shared Task on Offensive Language Identification in Tamil and
Malayalam”, Proceedings of the 13th Annual Meeting of the Forum for
Information Retrieval Evaluation (FIRE ’21), pp.16–18, January 2022.
[2] Madhusmita Khuntia, Deepa Gupta,”Indian News Headlines
Classification using Word Embedding Techniques and LSTM Model”,
Procedia Computer Science, Volume 218, pp. 899-907, 2023.
[3] Priyanga, V.T et al., ”Exploring Fake News Identification Using
Word and Sentence Embeddings”, Journal of Intelligent & Fuzzy
Systems, vol. 41, no. 5, pp. 5441-5448, November 2021.
[4] R. S. V. Mukhesh and A. Soni, ”Identifying similar sentences when
processed by Apache Spark,” 2022 IEEE 4th International Conference
on Cybernetics, Cognition and Machine Learning Applications
(ICCCMLA), Goa, pp. 504-510, December 2022.
[5] F. Jonathan, D. Yang, G. Gowing, and S. Wei, ”Machine Learning
Framework for Detecting Offensive Swahili Messages in Social
Networks with Apache Spark Implementation”, 2021 IEEE International
Conference on Progress in Informatics and Computing (PIC), pp. 293-
297, January 2022.
[6] Mahen Herath, Thushari Atapattu, Hoang Anh Dung, Christoph
Treude, and Katrina Falkner, “Ensemble of Classifiers for Offensive
Language Detection in Social Media”, Proceedings of the 14th
International Workshop on Semantic Evaluation, pp. 1516–1523,
December 2020.
[7] Naol BakalaDefersha and, Kula Kekeba Tune, “Detection of Hate
Speech Text in Afan Oromo social media using Machine Learning
Approach”, Indian Journal of Science and Technology. Pp. 25672578,
September 2021.
[8] Wubetu Barud Demilie, and Ayodeji Olalekan Salau, ”Detection of
fake news and hate speech for Ethiopian languages: a systematic review
of the approaches”, Journal of Big Data volume 9, Article number: 66,
May 2022.
[9] Herodotos Herodotou, Despoina Chatzakou, and Nicolas Kourtellis,
”A Streaming Machine Learning Framework for Online Aggression
Detection on Twitter”, arXiv:2006.10104, November 2020.
[10] A. Aditya, R. Vinod, A. Kumar, I. Bhowmik and J. Swaminathan,
”Classifying Speech into Offensive and Hate Categories along
withTargeted Communities using Machine Learning,” 2022 International
Conference on Inventive Computation Technologies (ICICT), pp.
291295, August 2022.

[11] S. Veerasamy, Y. Khare, A. Ramesh, A. S, P. Singh, and A. T,


”Hate Speech Detection using mono BERT model in custom Content-
Management-System,” 2022 4th International Conference on Smart
Systems and Inventive Technology (ICSSIT), pp. 1681-1686, February
2022.

[12] Manikandan Ravikiran, Amin Ekant Muljibhai, Toshinori Miyoshi,


Hiroaki Ozaki, Yuta Koreeda, and Masayuki Sakata, “Offensive
Language Identification with Noisy Labels using Statistical Sampling
and Post-Processing”, arXiv preprint arXiv:2005.00295, May 2020.

[13]Khanh Q. Tran, An T. Nguyen, Phu Gia Hoang, Canh Duc Luu,


Trong-Hop Do, and Kiet Van Nguyen, “Vietnamese Hate and Offensive
Detection using PhoBERT-CNN and Social Media Streaming
Data”,10.48550/arXiv.2206.00524, June 2022.

[14] Apache Spark - https://spark.apache.org/.

978-1-6654-5630-2/23/$31.00 ©2023 IEEE 1228


Authorized licensed use limited to: KIIT University. Downloaded on January 20,2024 at 07:47:28 UTC from IEEE Xplore. Restrictions apply.

You might also like