Optimizing YouTube Spam Detection With Ensemble Deep Learning Techniques
Optimizing YouTube Spam Detection With Ensemble Deep Learning Techniques
2024 14th
626 Authorized licensed International Conference on Cloud Computing, Data Science & Engineering (Confluence)
use limited to: Bahria University. Downloaded on November 11,2024 at 09:17:10 UTC from IEEE Xplore. Restrictions apply.
In [14], Roy et al. used deep learning methods to Fig. 1. Architectural diagram for processes involved in the classification.
separate spam from real SMS messages by combining
LSTM and CNN. They contrasted their strategy with well-
established machine learning techniques like LR, Stochastic The dataset has been taken from the comment section of
Gradient Descent, Gradient boosting, and Random Forest. 5 different music artists. This dataset has been directly
The CNN-LSTM models outperformed other approaches, extracted from Kaggle which had already been classified as
according to the results. Additionally, they suggested two spam ham and provided with 1.6k+ comments to work on.
methods, one for SMS spam identification that combined an Here first the raw dataset is taken and preprocessed using
MLP network with a pre-trained "BERT" language model NLP techniques to make it ready for implementation.
for 98.37% accuracy and a hybrid CNN-LSTM classifier for In the implementation phase, all DL techniques
94.37% accuracy. introduced above and discussed below are coded on the
Moving on to the next phase, an overview of multi- dataset and Resultant values such as F1 score, Accuracy, and
faceted landscape spam detection across different platforms Precision are recorded to make a comparative analysis of
has been studied and approximately 90-94% has been these with an ensembled and stacked model. In this way, we
achieved; however looking at spam detection in the can say whether the approach and objective of this paper
YouTube spam sector, only 89-90% has been recorded. The have been a success or a failure in disputing this issue. The
objective is to improve the accuracy rate and decrease false processes followed are elaborated below, starting with Data
positives with the combination of deep learning approaches Selection in (A) part.
and ensembling techniques. A deeper look into this process A. Data Selection
has been explained in detail in the consequent part, (iii)rd
A dataset from Kaggle has been used in this paper. This
section of this paper.
consists of a spam/ham collection from 5 popular global
III. METHODOLOGY artists including Katy Perry, LMFAO, Psy, and others. The
dataset is balanced with 1005 spam and 750 ham comments.
Since it is pre-labelled, supervised learning is employed
Discussed in the section is the overall approach and here. The dataset is a CSV file with comment id, author,
method used to classify YouTube comment sections by date, content and label has columns. For future work,
individually analyzing DL methods such as LSTM, CNN, procuring the comments using the Youtube API would
and CNN-LSTM combination as well as Attention models
and comparing it with results obtained from combining
them.
2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Authorized licensed use limited to: Bahria University. Downloaded on November 11,2024 at 09:17:10 UTC from IEEE Xplore. Restrictions apply.
627
provide a more rich but unbalanced dataset. low-dimensional dense vector that retains this
collected information and also identifies nuisance
texts that are also spam.
2024 14th
628 Authorized licensed International Conference on Cloud Computing, Data Science & Engineering (Confluence)
use limited to: Bahria University. Downloaded on November 11,2024 at 09:17:10 UTC from IEEE Xplore. Restrictions apply.
Moving on to the final part of the third section, we will Ï Recall(true positive Rate or sensitivity): By
see about stacking classification and what it is used for. evaluating the ratio of true positives to all actual
positive cases, recall accesses the rate of catching
all genuine positive cases.
B. Stacking Classifiers and Result
(3)
ܴ݈݈݁ܿܽ = ܶܲܶܲ+ܰܨ
IV. RESULT
(1)
ݏ݊݅ݐܿ݅݀݁ݎܲ ݈ܲܶܽݐܶ = ݕܿܽݎݑܿܿܣ+ܶܰ
2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Authorized licensed use limited to: Bahria University. Downloaded on November 11,2024 at 09:17:10 UTC from IEEE Xplore. Restrictions apply.
629
Fig. 3. (b) Comparative analysis of the models for identification of Ham [2]PortioResearch Worldwide A2P SMS Markets 2014–2017:
Understanding and Analysis of Application to-Person Text Messaging
Markets Worldwide; Portio Research Limited: Chippenham, UK, 2014.
TABLE II. COMPARISON OF RESULTS FOR ALL MODELS [3] Yerima, Suleiman & Bashar, Abul. (2022). Semi-supervised novelty
detection with one class SVM for SMS spam detection.
Precision Recall F1-Score Accuracy 10.1109/IWSSIP55020.2022.9854496.
Metrics/ [4] Liu, Xiaoxu & Lu, Haoye & Nayak, Amiya. (2021). A Spam
Algorithms Transformer Model for SMS Spam Detection. IEEE Access. PP. 1-1.
10.1109/ACCESS.2021.3081479.
Attention Spam 0.82 0.82 0.82 0.86 [5] P. K. Roy, J. P. Singh, and S. Banerjee, ‘‘Deep learning to filter SMS
spam,’’ Future Gener. Comput. Syst., vol. 102, pp. 524–533, Jan. 2020
[6] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ 2020,
Ham 0.88 0.88 0.88
arXiv:2005.14165. [Online].
[7] Prof. R.S and Ms. Rachana Mishra, 2013, “Thakur: Analysis of
CNN Spam 0.81 0.96 0.88 0.91 Random Forest and Naïve Bayes for Spam Mail using Feature Selection
Categorization”.
Ham 0.97 0.86 0.91 [8] H. Oh, "A YouTube Spam Comments Detection Scheme Using
Cascaded Ensemble Machine Learning Model," in IEEE Access, vol. 9, pp.
144121-144128, 2021, doi: 10.1109/ACCESS.2021.3121508.
LSTM Spam 0.82 0.97 0.89 0.90 [9] Choi, Seungwoo & Segev, Aviv. (2016). Finding informative
comments forvideoviewing.2457-2465.10.1109/BigData.2016.7840882.
Ham 0.98 0.86 0.91 [10] S. Sharmin and Z. Zaman, ‘‘Spam detection in social media employing
machine learning tools for text mining,’’ in Proc. 13th Int. Conf.
SignalImage Technology. Internet-Based Syst. (SITIS), Dec. 2017, pp.137–
Spam 0.85 0.91 0.88 0.90 142, doi: 10.1109/SITIS.2017.32.
CNN [11] A. O. Abdullah, M. A. Ali, M. Karabatak, and A. Sengur, ‘‘A
-LSTM Ham 0.94 0.90 0.92 comparative analysis of common Youtube comment spam filtering
techniques,’’ in Proc. 6th Int. Symp. Digit. Forensic Secur. (ISDFS), Mar.
2018, pp. 1–5, doi: 10.1109/ISDFS.2018.8355315.
Stacking Spam 0.96 0.85 0.90 0.935
[12] E. Poche, N. Jha, G. Williams, J. Staten, M. Vesper, and A. Mahmoud,
‘‘Analyzing user comments on Youtube coding tutorial videos,’’ in Proc.
Ham 0.90 0.97 0.94 IEEE/ACM 25th Int. Conf. Program Comprehension (ICPC), May 2017,
pp. 196–206, doi: 10.1109/ICPC.2017.26.
The results of individual techniques and their analysis [13] J. Gracia Betty, R. Harivarthini, O. Deepthi, R. Pari, and P. Maharajan,
have been tabulated above. This research can be further "YouTube Video Spam Comment Detection Using Light Gradient Boosting
Machine," 2023 5th International Conference on Inventive Research in
revamped in further studies with mathematical values Computing Applications (ICIRCA), Coimbatore, India, 2023, pp. 1650-
however, with other factors such as running time, result 1656, doi: 10.1109/ICIRCA57980.2023.10220615.
segregation, etc we can draw and conclude that the [14] Roy, P.K.; Singh, J.P.; Banerjee, S. Deep learning to filter SMS Spam.
ensemble method proves more effective than others after all. Future Gener. Comput. Syst. 2020, 102, 524–533.
V. CONCLUSION
REFERENCES
[1] Sjarif, Nilam & Mohd Azmi, Nurulhuda & Chuprat, Suriayati & Sarkan,
Haslina & Yahya, Yazriwati & Sam, Suriani. (2019). SMS Spam Message
Detection using Term Frequency-Inverse Document Frequency and
Random Forest Algorithm. Procedia Computer Science. 161. 509-515
2024 14th
630 Authorized licensed International Conference on Cloud Computing, Data Science & Engineering (Confluence)
use limited to: Bahria University. Downloaded on November 11,2024 at 09:17:10 UTC from IEEE Xplore. Restrictions apply.