Offensive Social Network Posts Classification Using Apache Spark Platform
Offensive Social Network Posts Classification Using Apache Spark Platform
An open-source content management system has been A hate speech detection system for the Vietnamese
implemented, leading to content creation and language has been implemented. A combination of the
management. The implemented system has an inbuilt hate PhoBERT and CNN models is used to detect and classify
speech deduction system built using the state-of-the-art hate speech. The system for detecting and classifying hate
mono BERT model. The implemented system is claimed speech is integrated with the streaming processing system.
to alert and hinder any published content if it is found to The PhoBERT-CNN model uses the stream of data from
be hateful [11]. the streaming system to detect and classify hate speech
[13].
A system to identify offensive language that has noisy
labels has been implemented. To do the classification, a The proposed work here is to classify social media
hybrid system is made with the help of the BERT messages as offensive/not offensive. The uniqueness of
classifier. The hybrid system consists of three the proposed work lies in the fact that each message’s
components: statistical sampling algorithm, BERT labels are individually made based on a list of
classifier, and post-processing. The implemented system offensive/profane
is said to have an F1 score of 90% and has not produced
any false negatives [12].
(1) 5. CONCLUSION
From the results observed, it can be concluded that the
The recall is calculated as the proportion of Positive proposed system gives reliable results for classifying
samples correctly classified as Positive to the total number offensive social network posts from a not offensive
of Positive samples, Equation 2. category. Since all of the testing, training, and
classification was done by clustering multiple nodes to the
master node, the training took approximately less than 5
(2)
minutes, even though the size of the data is 43 GB. If
F1 score is the harmonic mean of the precision and performed without using the Apache Spark technology on
recall, Equation 3. a normal 8 core system, the same task would have taken
hours of training. The Apache Spark platform here allows
(3) us to cluster multiple nodes together and perform the
Accuracy is the correctly predicted observations from same task in minutes. The system implemented here is
the overall observations, which is given in Equation 4. just the tip of the iceberg. Many other tasks can be
accomplished by using the Apache Spark Platform.
account/profile can be restricted or blocked further from [15]Brian Hamachek. (2021). Nearby Social Network - All Posts
[Dataset].Kaggle.https://doi.org/10.34740/KAGGLE/DSV/2054752.
using the social network.
Dataset - https://www.kaggle.com/datasets/brianhamachek/nearbysocial-
network-all-posts.
[16] OffensiveWordList-https://www.cs.cmu.edu/ biglou/resources/.
REFERENCES