0% found this document useful (0 votes)

159 views

An Approach For Spam Detection in Youtube Comments Based On Supervised Learning

The document discusses spam detection in YouTube comments. It provides background on the growth of YouTube and the emergence of spam as a problem. The authors evaluate several machine learning techniques for automatically filtering spam comments, including multilayer perceptrons and support vector machines. They aim to address challenges in detecting spam on YouTube due to the short, informal nature of comments.

Uploaded by

Sena Dereje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views

An Approach For Spam Detection in Youtube Comments Based On Supervised Learning

Uploaded by

Sena Dereje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

An Approach for Spam Detection in YouTube

Comments Based on Supervised Learning

Amir Ali, Muhammad Zain Amin

Department of Computer Science and Engineering
University of Engineering and Technology
Lahore, Pakistan

Abstract billion users, 300 hours of video are uploaded

every minute and it generates billions of views
In the recently advanced society, online social every day. Around 60% of a creator’s views
media sites like YouTube, Twitter, Facebook, come from outside their home country and half
LinkedIn, etc are very popular. People turn to of YouTube views are on mobile devices.
social media for interacting with other people, In 2016, the company paid 2 billion US dollars
gaining knowledge, sharing ideas, for to the producers who chose to monetize
entertainment and staying informed about the claims, since, 2007. After the launching of the
events happening in the rest of the world. monetization system, the YouTube site was
Among these sites, YouTube has emerged as plagued by very low-quality content which can
the most popular website for sharing and be considered as spam videos and spam
viewing video content. However, such success comments. Spam comment can be defined as
has also attracted malicious users, which aim the comment that is not relevant to the specific
to self-promote their videos or disseminate content of the web page. Comment spams
viruses and malware. These spam videos may have been used to publish specific unwanted
be unrelated to their title or may contain content, declare sales, promote pornographic
pornographic content. Therefore, it is very content, degrade the website standing, making
important to find a way to detect these videos the website reliable by increasing the count of
and report them. In this work, we have views. The spam found on YouTube is directly
evaluated several top-performance related to the attractive profit offered by the
classification techniques for such purpose. The monetization organization.
statistical analysis of results indicates that the According to a press release by Google, more
Multilayer Perceptron and Support Vector than a million advertisers are using Google ad
Machine show good accuracy results. platforms, the mobile profits on YouTube are
up 100% year over year and the number of
Keywords: Text Classification, TubeSpam
hours people are watching on YouTube each
Detection, MLPs, SVM.
month is up 50% year over year. At the same
1. Introduction time, according to Nexgate, a computer
security company, just in the first half of 2013,
YouTube is a very successful video-sharing the volume of social spam increased by 55%
company. It has more than one billion users [2]. For each spam found on any social
and that almost 33% of the users of the whole network, the other 200 spams are found on
internet. The success of YouTube can be Facebook and YouTube.
expressed through recent statistics reported by The problem became so dangerous that it
Google [1] the platform has more than 1 motivated users to create a request in 2012, in

1
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

which they ask YouTube to provide tools to [8], [9], e-mail spam [10], [11] and SMS spam
deal with undesired content [3].In 2013, the [12], [13] filtering. In social networks,
YouTube official blog report efforts to deal undesired messages are known as social spam.
with undesired remarks through recognition of
malevolent links, ASCII art detection and Alex Kantchelian et al. developed a Spam
display changes to long comments [4]. Detection (SD) technique that can compute
However, many users are still not satisfied useless and superfluous features in blogs,
with such solutions. In fact, in 2014, the user making significant stories more handy to
“PewDiePie”, owner of the most subscribed perpetual stakeholders. They suggested an
channel on YouTube (nearly 40 million extension of their work to widen the definition
subscribers), disabled comments on his videos, of spam such as URLs, short message
claiming most of the comments are mainly removal, etc. in addition to inclusion
spam and there is no tool to deal with them antagonist awareness, online deployment to
[5]. enable prediction of futuristic comments and
The problem caused by social spam began to so on [14].
be critically discussed from 2010, but an In an online social network, spam is originated
earlier work is dated from 2005 [6]. However, from our friends and thus it reduces the
undesired comments on YouTube still harm pleasure of communication. Normally spam is
the platform’s community; evidencing such a detected in text format. The system collects
problem requires attention and research. large amounts of data from an online social
Established techniques for automatic spam network and that data is used for identifying
filtering have their performance degraded spam. This identified spam is used for
when dealing with YouTube’s comments. It is generating a template. Whenever a new stream
mainly due to the fact that such messages are of messages comes for identifying spam or not
usually very short and rife with idioms, slangs,
spam, those generated templates are used for
symbols, emoticons, and abbreviations which
matching with a stream of messages, so it
make even tokenization a challenging task.
reduces the execution time of identifying
Given this situation, this paper presents a spam. That implemented framework is called
comprehensive performance evaluation of
Tangram [15].
several well-known machine learning
techniques that can be applied to automatically Enhua Tan et al. designed a runtime SD
filter such undesired messages. scheme called BARS: Blacklist-Assisted
The paper is organized as follows: In Section Runtime SD which constructed a database of
2, we briefly describe the related work. Section Spam URLs against which URL of every new
3 explains the System Architecture. In Section post was analyzed to determine if the post was
4, we present the achieved results. Finally, spam or not. The clustering of User IDS based
Section 5 describes the main conclusion and on shared URLs also increased the
future work. effectiveness of detection. However, the
efficiency of this approach is a consequence of
2. Literature Review how successfully the blacklisted URL list is
fostered [16].
Spam is usually interconnected to undesired
content with low worth information. They are Blog comment spam is the most similar
commonly found as images, texts or videos, situation to that investigated in this paper.
hindering visualization of exciting content. However, the most-known strategy to detect a
There are many researches related to spam in blog spam comment usually is to find the best
literature, such as web spam [7], blog spam representation of the language model in post-

2
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

publication, using that representation to filter Feature Extraction, features are extracted from
less related comments to its original subject. existing data. URL redirects chain length like
Such a strategy cannot be applied on YouTube feature collects system because attackers use
since comments are related to video content long URL redirect chains to make analysis
with small or no textual description, therefore difficult. Suspicious URL on twitter is
language models cannot be properly mapped classified Based on the feature [23].
from original publication [17], [18].
Igor Santos et al. applied the concept of
HongyuGao paper mentions that numerous anomaly detection wherein the divergence
online social networks are detected on the from authentic emails was used as a metric to
internet. For identifying spam in online social classify emails as spam or ham. Better
networks, the existing method uses the accuracy was achieved owing to the limited
Facebook wall post. Crawlers are used for training sets as seen in labeling based systems
collecting wall posts in exacting Facebook [24].
users. Then this wall post filters and finally
collects wall post which contains the URLs. Another common alternative is automatic
This method differentiates the wall to post text blocking spammers – users that disseminate
and link which is mentioned in the wall. This spam [25], [26]. However, unlike spam
method collects groups from similar texture disseminated in other social networks and
content and posts it including the same email [27], [28], the spam posted on YouTube
destination URLs. Post Similarity graph is not usually created by bots, but posted by
clustering algorithm is used to identify the real users aiming self-promotion on popular
similarity between post and URL. Based on videos. Therefore, such messages are more
this malicious user and post are identified [19]. difficult to identify due to its similarity to
legitimate messages.
Seungwoo Choietal experimented with their
algorithm on Ted-Talks videos to find the In this paper, the author presents an online
comment offering exposure and information spam filtering system that can be used in real-
about the video contents. However, the time to inspect messages generated by users.
proposed method was found inadequate in The system can be deployed as a component
analyzing the feelings and opinions expressed of the OSN platform. The author proposes to
in platforms such as YouTube [20]. rearrange spam messages into campaigns for
classification instead of examining them
YouTube also faces malicious users that individually. Although campaign identification
publish low-quality content videos, which is is used for offline spam analysis, the author
known as video spam. There are some studies applies this technique to support the online
in the literature to find efficient ways to grip spam detection problem with sufficiently low
this activity through classification methods and expenses. Accordingly, this system adopts a
feature extraction from metadata, such as set of fresh features that effectively distinguish
popularity numbers, title, and description [21], spam campaigns. It drops messages classified
[22]. as “spam” before they reach the recipients,
thus protecting them from various kinds of
S. Lee and J. Kim, details about three fraud. The system is evaluated using 187
modules, Data Collection, Feature Extraction, million wall posts collected from Facebook
and Classification. Under Data Collection, the and 17 million tweets collected from Twitter
system collects tweets with URL by using [29].
Twitter Streaming API which is publicly
available for getting data from twitter. In

3
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

As noted by Bratko et al. the spam ﬁltering 3.1 Data Collection

task slightly differs from similar text The YouTube Spam Collection Data Set
categorization problems. They claim undesired Collect from UCI Repositories [31]. It has five
messages have chronological order and their datasets composed of 1,956 real and non-
characteristics may change according to that. It encoded messages that were labeled as
also explains that cross-validation is not legitimate (ham) or spam.
recommended, because earlier samples should Each sample represents a text comment posted
be used to train the methods, while newer ones in the comments section of each selected
should be used to test them. Furthermore, in video. No preprocessing technique was
spam ﬁltering, errors associated with each performed. Subsequently, each sample was
class should be considered differently, because manually labeled as spam or legitimate (ham),
a blocked legitimate message is worst than using a collaborative tagging tool developed
unblocked spam [30]. for this purpose, called Labeling [32]. The
samples have associated a piece of metadata
3. System Architecture information, such as the author’s name and
publication date, which have been preserved.
Figure 3.1 shows a Proposed Architecture for
YouTube Spam Detection of the entire
3.1.1 Dataset Information
process. The YouTube Spam Collection Data
The samples were extracted from the
Set Labeled as spam or ham. This forms a
comments section of five videos that were
dataset that is fed into a Term frequency-
among the 10 most viewed on YouTube
Inverse document frequency (TF-IDF)
during the collection period. The table below
vectorizer which transforms words into
lists the datasets, the YouTube video ID, the
numerical features (numpy arrays) for training
number of samples in each class and the total
and testing. The transformed dataset is split
number of samples per dataset.
into training and testing subsets and fed into
Multilayer Perceptrons(MLPs), Support Table 3.1: Dataset Information
Vector Machine(SVM), Naïve Bayes(NB),
Decision Tree(DT), Random Forest(RF), Dataset YouTube ID Spa Ha Tot
Logistic Regression(LG), and k-Nearest m m al
Neighbor(kNN) pipelines respectively. All Psy 9bZkp7q19f0 175 175 350
implementation was performed on Python 3 KatyPer CevxZvSJLk 175 175 350
using the two Libraries Keras for Neural ry 8
Network models and Scikit−learn for Machine LMFA KQ6zr6kCPj8 236 202 438
Learning Classifiers. O
Eminem uelHwf8o7_U 245 203 448
Shakira pRpeEdMmm 174 196 370
Q0

3.1.2 Attribute Information

The collection is composed of one CSV file
Figure 3.1: Architecture for YouTube
per dataset, where each line has the following
Comment Spam Detection
attributes:
Below is the description for every phase in
Table 3.2: Attribute Information
YouTube Spam detection framework:
Attributes Example

4
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

COMME LZQPQhLyRh80UYxNuaDWhI the word count vector (our feature here) of

NT_ID GQYNQ96IuCg-AYWqNPjpU 4454 dimensions for each YouTube comment
AUTHOR Julius NM of the whole dataset. Each word count vector
DATE 2013-11-07 T 06:20:48 contains the frequency of 4454 words in the
CONTEN Huh, anyway check out this whole dataset file. Of course, you might have
T YouTube channel: kobyoshi02 guessed by now that most of them will be
Class 1 (Spam) zero. Let us take an example. Suppose we
have 500 words in our dictionary. Each word
count vector contains the frequency of 500
dictionary words in the training file. Suppose
text in training file was “Get the work done,
3.2 Data Preprocessing work done” then it will be encoded as
The data-set used here is split into 80% for the [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,
training set and the remaining 20% for the test 0,1,0,0,……2,0,0, 0,0,0]. Here, all the word
set. In any text mining problem, text cleaning counts are placed at 296th, 359th, 415th, 495th
is the first step where we remove those words index of 500 length word count vector and the
from the document which may not contribute rest are zero.
to the information we want to extract. 3.4 Classifier Techniques
YouTube Comments may contain a lot of After Feature Extraction the transformed
undesirable characters like punctuation marks, dataset is fed into classifier techniques
stop words, digits, etc which may not be Multilayer Perceptrons(MLPs), Support
helpful in SD. After cleaning the text we fed Vector Machine(SVM), Naïve Bayes(NB),
our dataset into a Term frequency-Inverse Decision Tree(DT), Random Forest(RF),
document frequency (TF-IDF) vectorizer Logistic Regression(LG), and k-Nearest
which transforms words into numerical Neighbor(kNN) pipelines respectively.
features (numpy arrays) for training and
testing. 3.4.1 Multilayer Perceptrons (MLPs)
An MLPs is a feed-forward network with one
3.2.1 Term Frequency-Inverse Document input layer, one output layer, and at least one
Frequency(TF-IDF) hidden layer [33]. To classify data that is not
TF-IDF stands for term frequency-inverse linear in nature, it uses non-linear activation
document frequency, and the TF-IDF weight is functions, mainly hyperbolic tangent or
a weight often used in information retrieval logistic function [34]. The network is fully
and text mining. This weight is a statistical connected, which means that every node in the
measure used to evaluate how important a current layer is connected to each node in the
word is to a document in a collection or next layer. This architecture with the hidden
corpus. The importance increases layer forms the basis of deep learning
proportionally to the number of times a word architecture which has at least three hidden
appears in the document but is offset by the layers [35]. Multilayer Perceptrons are used
frequency of the word in the corpus. for speech recognition and translation
Variations of the tf-IDF weighting scheme are operations of NLP.
often used by search engines as a central tool
in scoring and ranking a document's relevance
given a user query. 3.4.2 Support Vector Machine (SVM)
A Support Vector Machine (SVM) is a
3.3 Feature Extraction discriminative classifier that can be used for
Once the dictionary is ready, we can extract both classification and regression problems.

5
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

The goal of SVM is to identify an optimal terms, a Naive Bayes classifier assumes that
separating hyperplane which maximizes the the presence of a particular feature in a class is
margin between different classes of the unrelated to the presence of any other feature.
training data. In other words, given labeled Even if these features depend on each other or
training data (supervised learning), the upon the existence of the other features, all of
algorithm outputs an optimal hyperplane these properties independently contribute to
which categorizes new examples to create the the probability. NB is based on probability
largest possible distance to reduce an upper estimations, called a posterior probability.
bound. Supports Vectors are simply the
coordinates of data points that are nearest to 3.4.6 Random Forest
the optimal separating hyperplane provide the Random forests or random decision forests are
most useful information for SVM an ensemble learning method for
classification. In addition, an appropriate classification, regression, and other tasks, that
kernel function is used to transform the data operate by constructing a multitude of decision
into a high-dimension to use linear trees at training time and outputting the class
discriminate functions. that is the mode of the classes (classification)
or mean prediction (regression) of the
3.4.3 K-Nearest Neighbors (kNN) individual trees. Random decision forests
The k-nearest-neighbors algorithm is a correct for decision trees’ habit of overfitting
classification algorithm, and it is supervised: it to their training set.
takes a bunch of labeled points and uses them
to learn how to label other points. To label a 3.4.7 Decision Tree
new point, it looks at the labeled points closest A decision tree is a tree whose internal nodes
to that new point (those are its nearest can be taken as tests (on input data patterns)
neighbors), and has those neighbors vote, so and whose leaf nodes can be taken as
whichever label the most of the neighbors categories (of these patterns). These tests are
have is the label for the new point (the “k” is filtered down through the tree to get the right
the number of neighbors it checks). output to the input pattern. Decision Tree
algorithms can be applied and used in various
3.4.4 Logistic Regression different fields. It can be used as a replacement
It is a statistical method for analyzing a data for statistical procedures to find data, to
set in which there are one or more independent extract text, to find missing data in a class, to
variables that determine an outcome. The improve search engines and it also finds
outcome is measured with a dichotomous various applications in medical fields. Many
variable (in which there are only two possible Decision tree algorithms have been
outcomes). The goal of logistic regression is to formulated. They have different accuracy and
find the best fitting model to describe the cost-effectiveness. It is also very important for
relationship between the dichotomous us to know which algorithm is best to use. The
characteristic of interest (dependent variable = ID3 is one of the oldest Decision tree
response or outcome variable) and a set of algorithms. It is very useful while making
independent (predictor or explanatory) simple decision trees but as the complications
variables. increase its accuracy to make good Decision
trees decreases. Hence IDA (intelligent
3.4.5 Naïve Bayes (NB) decision tree algorithm) and C4.5 algorithms
It is a classification technique based on the have been formulated.
Bayesian network with an assumption of
independence among predictors. In simple 4. Result Evaluation

6
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

In this section, we evaluate our Result and also correct positive predictions divided by the
define the evaluation criteria to calculate the total number of true positives and false
performances of our classification models. negatives.
TP
4.1 Evaluation Criteria TPR = TP

During the training process, the confusion

4.1.3 False Positive Rate (FPR)
matrix was used to evaluate the classification
models. The confusion matrix is a matrix that The false-positive rate (FPR) is the proportion
maps the predicted outputs across actual of negatives cases that were incorrectly
outputs. It is often used to describe the classified as positive. The false-positive rate is
performance of a classification model on a set calculated as the number of incorrect positive
of test data. predictions divided by the total number of
false positives and true negatives.
Table 4.1: Confusion Matrix
P
FPR =
P T
Class Predicted Predicted
Negative Positive 4.1.4 Precision
Actual TN FP
Negative PR is the proportion of the predicted positive
Actual FN TP cases that were correct. PR is calculated as the
Positive number of correct positive predictions divided
by the total number of positive predictions.
TP
Important metrics were computed from the Precision = TP P
confusion matrix in order to evaluate the
classification models. In addition to correct 4.1.5 Error Rate (ER)
classification rate or accuracy other metrics
Also known as the ‘Misclassification rate’ is
that were computed for evaluation were True
calculated as the number of all false
Positive Rate (TPR), False Positive Rate
predictions divided by the total number of all
(FPR), Precision, Accuracy, F1 score, and
predictions.
Misclassification rate.
P
4.1.1 Accuracy or Correct Classification Error Rate =
TP P T
Rate (CCR)
4.1.6 F1 Score
AC is the proportion of the total number of
predictions that were correct. AC is calculated F1 score is a harmonic mean of precision and
as the number of all correct predictions recall.
divided by the total number of all predictions. 2 x ecall x Precision
F1 Score = ecall Precision
TP T
Accuracy = TP P T
4.2 Final Result
4.1.2 True Positive Rate (TPR) In classification, a total of nine algorithms
implemented in Scikit-Learn were set as
The true positive rate (TPR) also known as
classifiers in detecting YouTube spam
sensitivity or recall is the proportion of
comments. The purpose of implementing the
positive cases that were correctly identified.
nine algorithms is to compare accuracy.
TPR or Recall is calculated as the number of

7
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

The classification of accuracy across nine [1] YouTube – Statistics (2015). Available at
different classification algorithms such as https://goo.gl/ozUXMB, accessed in July 23,
Multilayer Perceptrons, Support Vector 2015.
Machine, Random Forest, Logistic Regression, [2] State of Social Media Spam (2013).
Decision Tree, k-Nearest Neighbor and Naïve Available at http://goo.gl/bnkUhh, accessed in
Bayes using data proportion of 80:20 Ratio July 23, 2015.
means 80% for training and 20% for testing.
[3] Give YouTube Users Tools to Combat
Table 5.2 shows the experiments with a data
Video Spam (2012). Available at
proportion of the percentage split training and https://goo.gl/dWUXgC, accessed in July 23,
testing. The result shows that the MLPs and 2015.
SVM classifier gives the highest accuracy
when testing in Scikit-Learn Library among [4] An update on YouTube comments (2013).
other classifiers. Available at http://goo.gl/ GDZuWw, accessed
in July 23, 2015.

[5] The Guardian – PewDiePie switches off

Table 4.2: Results YouTube comments: ‘It’s mainly spam’
(2014). Available at http://goo.gl/AixgI1,
accessed in July 23, 2015.

[6] G. Mishne, D. Carmel, and R. Lempel,

“Blocking blog spam with language model
disagreement,” in Proc. of the 1st AIRWeb,
Chiba, Japan, 2005, pp. 1–6.

[7] R. M. Silva, T. A. de Almeida, and A.

Yamakami, “Artificial eural etworks for
The goal of this work is to find which Content-based Web Spam Detection,” in Proc.
algorithms provide high and best accuracy and of the 2012 ICAI, Las Vegas, NV, EUA, 2012,
precision to help in detecting Spam comments pp. 1–7.
on YouTube. [8] C. Romero, M. Valdez, and A. Alanis, “A
comparative study of machine learning
5. Conclusion and Future Work techniques in blog comments spam filtering,”
in Proc. of the 6th WCCI, Barcelona, Spain,
5.1 Conclusion 2010, pp. 63–69.
The goal of this research was to find capable
[9] T. C. Alberto and T. A. Almeida,
methods and settings that could be used to
“Aprendizado de máquina aplicado na detecc
help the detection of spam comments on ¸ão automática de comentários indesejados,”
YouTube. in Anais do X Encontro Nacional de
Inteligência Artificial e Computacional
5.2 Future Work (E IAC’13), ortaleza, Brazil, 2013.
For Future work, with the Deep neural
[10] Z. Li and H. Shen, “Soap: A social
network-based implementations such as network aided personalized and effective spam
convolutional recurrent neural networks, may filter to clean your e-mail box,” in I OCOM,
obtain better accuracy results for detecting 2011 Proceedings IEEE, April 2011, pp.
unwanted Youtube Comments. 1835–1843.
Reference
[11] T. Almeida, J. Almeida, and A.
Yamakami, “Spam filtering: How the
dimensionality reduction affects the accuracy
of naive bayes classifiers,” Journal of Internet

8
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

Services and Applications, JISA’11, vol. 1, no. International Conference on Big Data (Big
3, pp. 183–200, 2011. Data), 2016, pp. 2457–2465.
doi:10.1109/BigData.2016.7840882.
[12] J. M. Gómez Hidalgo, T. Almeida, and
A. Yamakami, “On the Validity of a ew [21] V. Chaudhary and A. Sureka, “Contextual
SMS Spam Collection,” in Proc. of the 11st feature based one-class classifier approach for
ICMLA, vol. 2, Miami, FL, EUA, 2012, pp. detecting video response spam on youtube,” in
240–245. Privacy, Security and Trust (PST), 2013
Eleventh Annual International Conference on,
[13] T. P. Silva, I. Santos, T. A. Almeida, and July 2013, pp. 195–204.
J. M. Gómez Hidalgo, “ ormalizac ¸ão
Textual e Indexac ¸˜ ao Semântica Aplicadas [22] R. Chowdury, M. Monsur Adnan, G.
na iltragem de SMS Spam,” in Proc. of the Mahmud, and . ahman, “A data mining
11st E IAC,S ˜ ao Carlos, Brazil, 2014, pp. 1– based spam detection system for youtube,” in
6. Digital Information Management (ICDIM),
2013 Eighth International Conference on, Sept
[14] A. Kantchelian, J. Ma, L. Huang, S. 2013, pp. 373–378.
Afroz, A. Joseph, J. D. Tygar, Robust
detection of comment spam using entropy rate, [23] S. Lee and J. Kim, WarningBird:
in: Proceedings of the 5th ACM Workshop on Detecting suspicious URLs in Twitter stream,
Security and Artificial Intelligence, AISec ’12, in Proc. NDSS, 2012, pp. 113.
ACM, New York, NY, USA, 2012, pp. 59–70.
doi:10.1145/2381896.2381907. URL [24] I. Santos, C. Laorden, X. Ugarte-Pedrero,
http://doi.acm.org/10.1145/2381896.2381907 B. Sanz, P. G. Bringas, Spam Filtering through
Anomaly Detection, Springer Berlin
[15] H. Gao et al., Spam aint as diverse as it Heidelberg, Berlin, Heidelberg, 2012, pp.
seems: Throttling OSN spam with templates 203–216. doi:10.1007/978-3-642-35755-8 15.
underneath, in Proc. ACSAC, 2014, pp. 7685. URL https://doi.org/10.1007/978-3-642-
Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. 35755-8_15

[16] E. Tan, L. Guo, S. Chen, X. Zhang, Y. [25] F. Benevenuto, T. Rodrigues, V.

Zhao, Spammer behavior analysis and Almeida, J. Almeida, and M. Gonçalves,
detection in user generated content on social “Detecting spammers and content promoters in
networks, in: 2012 IEEE 32nd International online video social networks,” in Proceedings
Conference on Distributed Computing - 32nd Annual International ACM SIGIR
Systems, 2012, pp. 305–314. Conference on Research and Development in
doi:10.1109/ICDCS.2012.40. Information Retrieval, SIGIR 2009, 2009, pp.
620–627.
[17] G. Mishne, D. Carmel, and R. Lempel,
“Blocking blog spam with language model [26] J. M. Campanha, J. V. Lochter, and T. A.
disagreement,” in Proc. of the 1st AI Web, Almeida, “Detecc ¸˜ ao automática de
Chiba, Japan, 2005, pp. 1–6. spammers em redes sociais,” in Anais do XI
Encontro acional de Inteligência Artificial e
[18] G. Mishne and . Glance, “Leave a reply: Computacional (E IAC’14),S ˜ ao Carlos,
An analysis of weblog comments,” in Proc. of Brazil, 2014.
3rd WWE, Edinburgh, UK, 2006, pp. 1–8.
[27] T. Almeida and A. Yamakami, “Occam´s
[19] HongyuGao NorthwesternUniversity razor-based spam filter,” Journal of Internet
Evanston,IL,USA [email protected], Services and Applications, vol. 3, no. 3, pp.
JunHu HuazhongUniv. ofSci. &Tech 245– 253, 2012.
andNorthwesternUniversity
[email protected] Detecting and [28] D. Wang, D. Irani, and C. Pu, “A social-
Characterizing Social Spam Campaigns spam detection framework,” in The 8th
Annual Collaboration, Electronic Messaging,
[20] S. Choi, A. Segev, Finding informative
comments for video viewing, in: 2016 IEEE

9
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

Anti-Abuse and Spam Conference (CEAS’11),

Perth, Australia, 2011, pp. 46–54.

[29] Hongyu GaoNorthwestern

University,Evanston,
ILUSA,[email protected],Yan Chen
Northwestern University Evanston, IL, USA
[email protected], Towards Online
Spam Filtering in Social Networks.

[30] A. Bratko, B. ilipiˇc, G. V. Cormack, T.

. Lynam, and B. Zupan, “Spam ﬁltering
using statistical data compression models,” J.
Mach. Learn. Res., vol. 7, pp. 2673–2698,
2006.

[31] YouTube Spam Collection Data Set

Collect from UCI available at:
https://archive.ics.uci.edu/ml/datasets/YouTub
e+Spam+Collection

[32] Labeling. Available at

http://lasid.sor.ufscar.br/ml-tools/

[33] I. N. d. Silva, D. H. Spatti, R. A.

Flauzino, L. H. B. Liboni, and S. F. d. R.
Alves, Artificial Neural Networks A Practical
Course, Springer International Publishing,
2017.

[34] O. Davydova, "7 Types of Artificial

Neural Networks for Natural Language
Processing," [Online]. Available:
https://www.kdnuggets.com/2017/10/7-types-
artificial-neural-networks-natural language-
processing.html.

[35] Y. Lecun, Y. Bengio and G. Hinton,

"Deep learning," Nature, vol. 521, no. 7553, p.
436, 2015.

[36] “Performance Evaluation of Supervised

Machine Learning Classifiers for Predicting
Healthcare Operational Decisions”, by
Muhammad Zain Amin, Amir Ali, C-Section
Classification Database Report, UCI Machine
Learning Repository, University of California,
Irvine, USA (2017)

[37] “Convolutional eural etwork: Text

Classification Model for Open Domain
Question Answering System”, by Muhammad
Zain Amin, Noman Nadeem, arXiv preprint
arxiv: 1809.02479

Mitsubishi Service Manual V41 Chassis
100% (3)
Mitsubishi Service Manual V41 Chassis
69 pages
Spamming
100% (1)
Spamming
4 pages
Project 2 Ms SQL Dev PDF
100% (1)
Project 2 Ms SQL Dev PDF
3 pages
Chapter 1-2-3-4-5 (AutoRecovered)
No ratings yet
Chapter 1-2-3-4-5 (AutoRecovered)
74 pages
Spam Detection For Youtube Comments Using ML
No ratings yet
Spam Detection For Youtube Comments Using ML
1 page
Spam Detection Paper
No ratings yet
Spam Detection Paper
3 pages
Analysis and Detection of Spam Comments On Social Networking Platforms Like YouTube Using Machine Learning
No ratings yet
Analysis and Detection of Spam Comments On Social Networking Platforms Like YouTube Using Machine Learning
5 pages
YOUTUBE SPAM COMMENTS DETECTION
No ratings yet
YOUTUBE SPAM COMMENTS DETECTION
6 pages
Detecting Spammers in Youtube: A Study To Find Spam Content in A Video Platform
No ratings yet
Detecting Spammers in Youtube: A Study To Find Spam Content in A Video Platform
5 pages
article 1
No ratings yet
article 1
28 pages
Machine Learning Based Spam Comments Detection on YouTube
No ratings yet
Machine Learning Based Spam Comments Detection on YouTube
6 pages
978-981-97-1711-8_23
No ratings yet
978-981-97-1711-8_23
14 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
Network Analysis of Recurring Youtube Spam Campaigns: Derek O'Callaghan, Martin Harrigan, Joe Carthy, P Adraig Cunningham
No ratings yet
Network Analysis of Recurring Youtube Spam Campaigns: Derek O'Callaghan, Martin Harrigan, Joe Carthy, P Adraig Cunningham
4 pages
Nexgate 2013 State of Social Media Spam Research Report
100% (1)
Nexgate 2013 State of Social Media Spam Research Report
21 pages
Review On Detection of Spam Comments Using NLP Algorithm
No ratings yet
Review On Detection of Spam Comments Using NLP Algorithm
4 pages
Spamming As Cyber Crime
No ratings yet
Spamming As Cyber Crime
20 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
Spam Detection Research Papers
100% (1)
Spam Detection Research Papers
7 pages
Techniques To Detect Spammers in Twitter-A Survey: International Journal of Computer Applications December 2013
No ratings yet
Techniques To Detect Spammers in Twitter-A Survey: International Journal of Computer Applications December 2013
7 pages
Format-Machine Learning Base Spam Comments Detection On Youtube
No ratings yet
Format-Machine Learning Base Spam Comments Detection On Youtube
17 pages
Spam Filtering Using Spam Mail Communities: A Paper On
No ratings yet
Spam Filtering Using Spam Mail Communities: A Paper On
13 pages
Spam_detection_in_online_social_networks_by_deep_learning
No ratings yet
Spam_detection_in_online_social_networks_by_deep_learning
4 pages
Classification of Multimodal Spam Using Deep Learning
No ratings yet
Classification of Multimodal Spam Using Deep Learning
45 pages
Information: Malicious Text Identification: Deep Learning From Public Comments and Emails
No ratings yet
Information: Malicious Text Identification: Deep Learning From Public Comments and Emails
19 pages
Cloning Internet Applications with Ruby
From Everand
Cloning Internet Applications with Ruby
Chang Sau Sheong
5/5 (2)
Ethiopian TVET-System: Learning Guide # 3
No ratings yet
Ethiopian TVET-System: Learning Guide # 3
6 pages
A-Novel-Framework-for-Internet-of-Knowledge-Protect_2017_Journal-of-Computat
No ratings yet
A-Novel-Framework-for-Internet-of-Knowledge-Protect_2017_Journal-of-Computat
30 pages
Getting started with Deep Learning for Natural Language Processing: Learn how to build NLP applications with Deep Learning (English Edition)
From Everand
Getting started with Deep Learning for Natural Language Processing: Learn how to build NLP applications with Deep Learning (English Edition)
Sunil Patel
No ratings yet
46_ijme...Mech Engg..Research Paper-1
No ratings yet
46_ijme...Mech Engg..Research Paper-1
10 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
No ratings yet
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
5 pages
Learn Penetration Testing with Python 3.x: Perform Offensive Pentesting and Prepare Red Teaming to Prevent Network Attacks and Web Vulnerabilities (English Edition)
From Everand
Learn Penetration Testing with Python 3.x: Perform Offensive Pentesting and Prepare Red Teaming to Prevent Network Attacks and Web Vulnerabilities (English Edition)
Yehia Elghaly
5/5 (1)
3216avc01 PDF
No ratings yet
3216avc01 PDF
10 pages
Social Media Data Mining: Insights and Strategies
From Everand
Social Media Data Mining: Insights and Strategies
Vidhur Gupta
No ratings yet
A Hybrid Approach For Detecting Automated Spammers in Twitter
No ratings yet
A Hybrid Approach For Detecting Automated Spammers in Twitter
6 pages
Hands-On Julia Programming: An Authoritative Guide to the Production-Ready Systems in Julia
From Everand
Hands-On Julia Programming: An Authoritative Guide to the Production-Ready Systems in Julia
Sambit Kumar Dash
No ratings yet
Review (2) - Machine Learning For SPAM Detection 2023
No ratings yet
Review (2) - Machine Learning For SPAM Detection 2023
13 pages
F-Growth. Gamification, virality and monetization
From Everand
F-Growth. Gamification, virality and monetization
Ilya Osipov
No ratings yet
EmailSpamFilteringTechniques AReview
No ratings yet
EmailSpamFilteringTechniques AReview
13 pages
Learning Website Development with Django
From Everand
Learning Website Development with Django
Ayman Hourieh
No ratings yet
Article 28
No ratings yet
Article 28
5 pages
Mastering Android Application Development: Learn how to do more with the Android SDK with this advanced Android Application guide which shows you how to make even better Android apps that users will love
From Everand
Mastering Android Application Development: Learn how to do more with the Android SDK with this advanced Android Application guide which shows you how to make even better Android apps that users will love
Antonio Pachon
5/5 (1)
Social Media: The Basics
From Everand
Social Media: The Basics
Janet Amber
No ratings yet
Mastering Social Media Mining with Python
From Everand
Mastering Social Media Mining with Python
Marco Bonzanini
5/5 (1)
Learning Yii Testing
From Everand
Learning Yii Testing
Matteo Pescarin
1/5 (1)
Social Media for WordPress Beginner's Guide
From Everand
Social Media for WordPress Beginner's Guide
Michael Kuhlmann
No ratings yet
EMAIL SPAM DETECTION USING SPOT ALGORITHM
No ratings yet
EMAIL SPAM DETECTION USING SPOT ALGORITHM
3 pages
MASFE MutliagentSystemforFilteringE MailsUsingJADE
No ratings yet
MASFE MutliagentSystemforFilteringE MailsUsingJADE
21 pages
final ppt
No ratings yet
final ppt
12 pages
Learning Guide 35: Information Technology Support Service
No ratings yet
Learning Guide 35: Information Technology Support Service
24 pages
Hands-on Penetration Testing for Web Applications: Run Web Security Testing on Modern Applications Using Nmap, Burp Suite and Wireshark (English Edition)
From Everand
Hands-on Penetration Testing for Web Applications: Run Web Security Testing on Modern Applications Using Nmap, Burp Suite and Wireshark (English Edition)
Richa Gupta
No ratings yet
Spam_filtering_on_social_media_using_machine_learning_ijariie21244
No ratings yet
Spam_filtering_on_social_media_using_machine_learning_ijariie21244
6 pages
SMS Spam Classification Using WEKA: Dipak R. Kawade Kavita S. Oza
No ratings yet
SMS Spam Classification Using WEKA: Dipak R. Kawade Kavita S. Oza
5 pages
693613494-Project-Report-Emaildetection-4-44
No ratings yet
693613494-Project-Report-Emaildetection-4-44
41 pages
Spammer Detection and Fake User Identification On Social Networks
No ratings yet
Spammer Detection and Fake User Identification On Social Networks
7 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
NW.js Essentials
From Everand
NW.js Essentials
Alessandro Benoit
No ratings yet
E-Mail Spam Detection and Classification Using SVM and Feature Extraction
No ratings yet
E-Mail Spam Detection and Classification Using SVM and Feature Extraction
5 pages
Multi-Purpose Chat Bot: Team Formation Team Members
No ratings yet
Multi-Purpose Chat Bot: Team Formation Team Members
15 pages
1822 b Deleted Merged Cropped
No ratings yet
1822 b Deleted Merged Cropped
40 pages
Oracle: 1Z0-819 Exam
No ratings yet
Oracle: 1Z0-819 Exam
9 pages
Unit-4 DBMS SQL Notes
No ratings yet
Unit-4 DBMS SQL Notes
49 pages
Anitha
No ratings yet
Anitha
26 pages
ARRL - Computer Networking Conference 8 (1989)
No ratings yet
ARRL - Computer Networking Conference 8 (1989)
228 pages
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
No ratings yet
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
15 pages
UI - Vision RPA Software 2020
No ratings yet
UI - Vision RPA Software 2020
15 pages
DCR Report 2022
No ratings yet
DCR Report 2022
24 pages
5.2.3 Analogue To Digital Converters
No ratings yet
5.2.3 Analogue To Digital Converters
18 pages
Apple Com
No ratings yet
Apple Com
5 pages
AWS Simple Icons PPT v15.9.3
No ratings yet
AWS Simple Icons PPT v15.9.3
30 pages
Information Security and Informationv2
No ratings yet
Information Security and Informationv2
33 pages
Power BI Training
No ratings yet
Power BI Training
5 pages
Arcgis Solution
No ratings yet
Arcgis Solution
38 pages
Obstracle Avoiding Vehicle
No ratings yet
Obstracle Avoiding Vehicle
45 pages
Micro Matrix
No ratings yet
Micro Matrix
10 pages
Xerox Wireless Print Solutions Adapter: Your Modern Workflow Connector
No ratings yet
Xerox Wireless Print Solutions Adapter: Your Modern Workflow Connector
2 pages
PHS3233 Test 2 (Q)
No ratings yet
PHS3233 Test 2 (Q)
5 pages
Ee - 105 - ADSP-2100 Family C Programming Examples Guide (En)
No ratings yet
Ee - 105 - ADSP-2100 Family C Programming Examples Guide (En)
61 pages
log20250411
No ratings yet
log20250411
13 pages
Crowdstrike Healthcare Solution Brief
No ratings yet
Crowdstrike Healthcare Solution Brief
10 pages
Parsons 2017
No ratings yet
Parsons 2017
12 pages
What Is Geoprocessing - ArcGIS Pro - Documentation
No ratings yet
What Is Geoprocessing - ArcGIS Pro - Documentation
2 pages
Project Mngt. 2021 Lifecycle - Challenges and Opportunities of Hybrid Projects in Complex Environments
No ratings yet
Project Mngt. 2021 Lifecycle - Challenges and Opportunities of Hybrid Projects in Complex Environments
11 pages
Movement - Catalogue - September - 2021 and Movement Interchange List
No ratings yet
Movement - Catalogue - September - 2021 and Movement Interchange List
184 pages
Eflexis LIS SET UP
No ratings yet
Eflexis LIS SET UP
42 pages
Statement - 2024 10 13 2024 11 12
No ratings yet
Statement - 2024 10 13 2024 11 12
1 page
Notification SBI Specialist Cadre Officer Posts
No ratings yet
Notification SBI Specialist Cadre Officer Posts
21 pages
Rohit Sharma Wiseman Innovations Cease and Desist To Mohammad Sohail Flower Mound
No ratings yet
Rohit Sharma Wiseman Innovations Cease and Desist To Mohammad Sohail Flower Mound
30 pages

An Approach For Spam Detection in Youtube Comments Based On Supervised Learning

Uploaded by

An Approach For Spam Detection in Youtube Comments Based On Supervised Learning

Uploaded by

2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan

An Approach for Spam Detection in YouTube

Amir Ali, Muhammad Zain Amin

Abstract billion users, 300 hours of video are uploaded

As noted by Bratko et al. the spam ﬁltering 3.1 Data Collection

3.1.2 Attribute Information

COMME LZQPQhLyRh80UYxNuaDWhI the word count vector (our feature here) of

During the training process, the confusion

[5] The Guardian – PewDiePie switches off

[6] G. Mishne, D. Carmel, and R. Lempel,

[7] R. M. Silva, T. A. de Almeida, and A.

[16] E. Tan, L. Guo, S. Chen, X. Zhang, Y. [25] F. Benevenuto, T. Rodrigues, V.

Anti-Abuse and Spam Conference (CEAS’11),

[29] Hongyu GaoNorthwestern

[30] A. Bratko, B. ilipiˇc, G. V. Cormack, T.

[31] YouTube Spam Collection Data Set

[32] Labeling. Available at

[33] I. N. d. Silva, D. H. Spatti, R. A.

[34] O. Davydova, "7 Types of Artificial

[35] Y. Lecun, Y. Bengio and G. Hinton,

[36] “Performance Evaluation of Supervised

[37] “Convolutional eural etwork: Text

You might also like