An Approach For Spam Detection in Youtube Comments Based On Supervised Learning
An Approach For Spam Detection in Youtube Comments Based On Supervised Learning
1
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
which they ask YouTube to provide tools to [8], [9], e-mail spam [10], [11] and SMS spam
deal with undesired content [3].In 2013, the [12], [13] filtering. In social networks,
YouTube official blog report efforts to deal undesired messages are known as social spam.
with undesired remarks through recognition of
malevolent links, ASCII art detection and Alex Kantchelian et al. developed a Spam
display changes to long comments [4]. Detection (SD) technique that can compute
However, many users are still not satisfied useless and superfluous features in blogs,
with such solutions. In fact, in 2014, the user making significant stories more handy to
“PewDiePie”, owner of the most subscribed perpetual stakeholders. They suggested an
channel on YouTube (nearly 40 million extension of their work to widen the definition
subscribers), disabled comments on his videos, of spam such as URLs, short message
claiming most of the comments are mainly removal, etc. in addition to inclusion
spam and there is no tool to deal with them antagonist awareness, online deployment to
[5]. enable prediction of futuristic comments and
The problem caused by social spam began to so on [14].
be critically discussed from 2010, but an In an online social network, spam is originated
earlier work is dated from 2005 [6]. However, from our friends and thus it reduces the
undesired comments on YouTube still harm pleasure of communication. Normally spam is
the platform’s community; evidencing such a detected in text format. The system collects
problem requires attention and research. large amounts of data from an online social
Established techniques for automatic spam network and that data is used for identifying
filtering have their performance degraded spam. This identified spam is used for
when dealing with YouTube’s comments. It is generating a template. Whenever a new stream
mainly due to the fact that such messages are of messages comes for identifying spam or not
usually very short and rife with idioms, slangs,
spam, those generated templates are used for
symbols, emoticons, and abbreviations which
matching with a stream of messages, so it
make even tokenization a challenging task.
reduces the execution time of identifying
Given this situation, this paper presents a spam. That implemented framework is called
comprehensive performance evaluation of
Tangram [15].
several well-known machine learning
techniques that can be applied to automatically Enhua Tan et al. designed a runtime SD
filter such undesired messages. scheme called BARS: Blacklist-Assisted
The paper is organized as follows: In Section Runtime SD which constructed a database of
2, we briefly describe the related work. Section Spam URLs against which URL of every new
3 explains the System Architecture. In Section post was analyzed to determine if the post was
4, we present the achieved results. Finally, spam or not. The clustering of User IDS based
Section 5 describes the main conclusion and on shared URLs also increased the
future work. effectiveness of detection. However, the
efficiency of this approach is a consequence of
2. Literature Review how successfully the blacklisted URL list is
fostered [16].
Spam is usually interconnected to undesired
content with low worth information. They are Blog comment spam is the most similar
commonly found as images, texts or videos, situation to that investigated in this paper.
hindering visualization of exciting content. However, the most-known strategy to detect a
There are many researches related to spam in blog spam comment usually is to find the best
literature, such as web spam [7], blog spam representation of the language model in post-
2
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
publication, using that representation to filter Feature Extraction, features are extracted from
less related comments to its original subject. existing data. URL redirects chain length like
Such a strategy cannot be applied on YouTube feature collects system because attackers use
since comments are related to video content long URL redirect chains to make analysis
with small or no textual description, therefore difficult. Suspicious URL on twitter is
language models cannot be properly mapped classified Based on the feature [23].
from original publication [17], [18].
Igor Santos et al. applied the concept of
HongyuGao paper mentions that numerous anomaly detection wherein the divergence
online social networks are detected on the from authentic emails was used as a metric to
internet. For identifying spam in online social classify emails as spam or ham. Better
networks, the existing method uses the accuracy was achieved owing to the limited
Facebook wall post. Crawlers are used for training sets as seen in labeling based systems
collecting wall posts in exacting Facebook [24].
users. Then this wall post filters and finally
collects wall post which contains the URLs. Another common alternative is automatic
This method differentiates the wall to post text blocking spammers – users that disseminate
and link which is mentioned in the wall. This spam [25], [26]. However, unlike spam
method collects groups from similar texture disseminated in other social networks and
content and posts it including the same email [27], [28], the spam posted on YouTube
destination URLs. Post Similarity graph is not usually created by bots, but posted by
clustering algorithm is used to identify the real users aiming self-promotion on popular
similarity between post and URL. Based on videos. Therefore, such messages are more
this malicious user and post are identified [19]. difficult to identify due to its similarity to
legitimate messages.
Seungwoo Choietal experimented with their
algorithm on Ted-Talks videos to find the In this paper, the author presents an online
comment offering exposure and information spam filtering system that can be used in real-
about the video contents. However, the time to inspect messages generated by users.
proposed method was found inadequate in The system can be deployed as a component
analyzing the feelings and opinions expressed of the OSN platform. The author proposes to
in platforms such as YouTube [20]. rearrange spam messages into campaigns for
classification instead of examining them
YouTube also faces malicious users that individually. Although campaign identification
publish low-quality content videos, which is is used for offline spam analysis, the author
known as video spam. There are some studies applies this technique to support the online
in the literature to find efficient ways to grip spam detection problem with sufficiently low
this activity through classification methods and expenses. Accordingly, this system adopts a
feature extraction from metadata, such as set of fresh features that effectively distinguish
popularity numbers, title, and description [21], spam campaigns. It drops messages classified
[22]. as “spam” before they reach the recipients,
thus protecting them from various kinds of
S. Lee and J. Kim, details about three fraud. The system is evaluated using 187
modules, Data Collection, Feature Extraction, million wall posts collected from Facebook
and Classification. Under Data Collection, the and 17 million tweets collected from Twitter
system collects tweets with URL by using [29].
Twitter Streaming API which is publicly
available for getting data from twitter. In
3
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
4
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
5
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
The goal of SVM is to identify an optimal terms, a Naive Bayes classifier assumes that
separating hyperplane which maximizes the the presence of a particular feature in a class is
margin between different classes of the unrelated to the presence of any other feature.
training data. In other words, given labeled Even if these features depend on each other or
training data (supervised learning), the upon the existence of the other features, all of
algorithm outputs an optimal hyperplane these properties independently contribute to
which categorizes new examples to create the the probability. NB is based on probability
largest possible distance to reduce an upper estimations, called a posterior probability.
bound. Supports Vectors are simply the
coordinates of data points that are nearest to 3.4.6 Random Forest
the optimal separating hyperplane provide the Random forests or random decision forests are
most useful information for SVM an ensemble learning method for
classification. In addition, an appropriate classification, regression, and other tasks, that
kernel function is used to transform the data operate by constructing a multitude of decision
into a high-dimension to use linear trees at training time and outputting the class
discriminate functions. that is the mode of the classes (classification)
or mean prediction (regression) of the
3.4.3 K-Nearest Neighbors (kNN) individual trees. Random decision forests
The k-nearest-neighbors algorithm is a correct for decision trees’ habit of overfitting
classification algorithm, and it is supervised: it to their training set.
takes a bunch of labeled points and uses them
to learn how to label other points. To label a 3.4.7 Decision Tree
new point, it looks at the labeled points closest A decision tree is a tree whose internal nodes
to that new point (those are its nearest can be taken as tests (on input data patterns)
neighbors), and has those neighbors vote, so and whose leaf nodes can be taken as
whichever label the most of the neighbors categories (of these patterns). These tests are
have is the label for the new point (the “k” is filtered down through the tree to get the right
the number of neighbors it checks). output to the input pattern. Decision Tree
algorithms can be applied and used in various
3.4.4 Logistic Regression different fields. It can be used as a replacement
It is a statistical method for analyzing a data for statistical procedures to find data, to
set in which there are one or more independent extract text, to find missing data in a class, to
variables that determine an outcome. The improve search engines and it also finds
outcome is measured with a dichotomous various applications in medical fields. Many
variable (in which there are only two possible Decision tree algorithms have been
outcomes). The goal of logistic regression is to formulated. They have different accuracy and
find the best fitting model to describe the cost-effectiveness. It is also very important for
relationship between the dichotomous us to know which algorithm is best to use. The
characteristic of interest (dependent variable = ID3 is one of the oldest Decision tree
response or outcome variable) and a set of algorithms. It is very useful while making
independent (predictor or explanatory) simple decision trees but as the complications
variables. increase its accuracy to make good Decision
trees decreases. Hence IDA (intelligent
3.4.5 Naïve Bayes (NB) decision tree algorithm) and C4.5 algorithms
It is a classification technique based on the have been formulated.
Bayesian network with an assumption of
independence among predictors. In simple 4. Result Evaluation
6
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
In this section, we evaluate our Result and also correct positive predictions divided by the
define the evaluation criteria to calculate the total number of true positives and false
performances of our classification models. negatives.
TP
4.1 Evaluation Criteria TPR = TP
7
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
The classification of accuracy across nine [1] YouTube – Statistics (2015). Available at
different classification algorithms such as https://goo.gl/ozUXMB, accessed in July 23,
Multilayer Perceptrons, Support Vector 2015.
Machine, Random Forest, Logistic Regression, [2] State of Social Media Spam (2013).
Decision Tree, k-Nearest Neighbor and Naïve Available at http://goo.gl/bnkUhh, accessed in
Bayes using data proportion of 80:20 Ratio July 23, 2015.
means 80% for training and 20% for testing.
[3] Give YouTube Users Tools to Combat
Table 5.2 shows the experiments with a data
Video Spam (2012). Available at
proportion of the percentage split training and https://goo.gl/dWUXgC, accessed in July 23,
testing. The result shows that the MLPs and 2015.
SVM classifier gives the highest accuracy
when testing in Scikit-Learn Library among [4] An update on YouTube comments (2013).
other classifiers. Available at http://goo.gl/ GDZuWw, accessed
in July 23, 2015.
8
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
Services and Applications, JISA’11, vol. 1, no. International Conference on Big Data (Big
3, pp. 183–200, 2011. Data), 2016, pp. 2457–2465.
doi:10.1109/BigData.2016.7840882.
[12] J. M. G´omez Hidalgo, T. Almeida, and
A. Yamakami, “On the Validity of a ew [21] V. Chaudhary and A. Sureka, “Contextual
SMS Spam Collection,” in Proc. of the 11st feature based one-class classifier approach for
ICMLA, vol. 2, Miami, FL, EUA, 2012, pp. detecting video response spam on youtube,” in
240–245. Privacy, Security and Trust (PST), 2013
Eleventh Annual International Conference on,
[13] T. P. Silva, I. Santos, T. A. Almeida, and July 2013, pp. 195–204.
J. M. G´omez Hidalgo, “ ormalizac ¸˜ao
Textual e Indexac ¸˜ ao Semˆantica Aplicadas [22] R. Chowdury, M. Monsur Adnan, G.
na iltragem de SMS Spam,” in Proc. of the Mahmud, and . ahman, “A data mining
11st E IAC,S ˜ ao Carlos, Brazil, 2014, pp. 1– based spam detection system for youtube,” in
6. Digital Information Management (ICDIM),
2013 Eighth International Conference on, Sept
[14] A. Kantchelian, J. Ma, L. Huang, S. 2013, pp. 373–378.
Afroz, A. Joseph, J. D. Tygar, Robust
detection of comment spam using entropy rate, [23] S. Lee and J. Kim, WarningBird:
in: Proceedings of the 5th ACM Workshop on Detecting suspicious URLs in Twitter stream,
Security and Artificial Intelligence, AISec ’12, in Proc. NDSS, 2012, pp. 113.
ACM, New York, NY, USA, 2012, pp. 59–70.
doi:10.1145/2381896.2381907. URL [24] I. Santos, C. Laorden, X. Ugarte-Pedrero,
http://doi.acm.org/10.1145/2381896.2381907 B. Sanz, P. G. Bringas, Spam Filtering through
Anomaly Detection, Springer Berlin
[15] H. Gao et al., Spam aint as diverse as it Heidelberg, Berlin, Heidelberg, 2012, pp.
seems: Throttling OSN spam with templates 203–216. doi:10.1007/978-3-642-35755-8 15.
underneath, in Proc. ACSAC, 2014, pp. 7685. URL https://doi.org/10.1007/978-3-642-
Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. 35755-8_15
9
2nd National Conference on Emerging Technologies, December 6-7, 2016, University of South Asia, Lahore, Pakistan
10