0% found this document useful (0 votes)
90 views10 pages

A Systematic Literature Review On SMS Spam Detection Techniques

Uploaded by

yasmin liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views10 pages

A Systematic Literature Review On SMS Spam Detection Techniques

Uploaded by

yasmin liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318298908

A Systematic Literature Review on SMS Spam Detection Techniques

Article  in  International Journal of Information Technology and Computer Science · July 2017


DOI: 10.5815/ijitcs.2017.07.05

CITATIONS READS

7 2,191

2 authors:

Lutfun Nahar Lota B M Mainul Hossain


Islamic University of Technology University of Dhaka
6 PUBLICATIONS   19 CITATIONS    18 PUBLICATIONS   59 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Lutfun Nahar Lota on 01 November 2018.

The user has requested enhancement of the downloaded file.


I.J. Information Technology and Computer Science, 2017, 7, 42-50
Published Online July 2017 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijitcs.2017.07.05

A Systematic Literature Review on SMS Spam


Detection Techniques
Lutfun Nahar Lota
Institute of Information Technology, University of Dhaka, Dhaka, 1000, Bangladesh
E-mail: [email protected]

B M Mainul Hossain
Institute of Information Technology, University of Dhaka, Dhaka, 1000, Bangladesh
E-mail: [email protected]

Abstract—Spam SMSes are unsolicited messages to all over the world. SMS spamming gained popularity
users, which are disturbing and sometimes harmful. over other spamming approaches like email and twitter,
There are a lot of survey papers available on email spam due to the increasing popularity of SMS communication.
detection techniques. But, SMS spam detection is However, opening rates of SMS are higher than 90% and
comparatively a new area and systematic literature review opened within 15 minutes of receipt whereas opening rate
on this area is insufficient. In this paper, we perform a in email is only 20-25% within 24 hours of receipt [28].
systematic literature review on SMS spam detection Thus, a proper SMS spam detection technique has
techniques. For that purpose, we consider the available significant necessity. There are several researches on
published research works from 2006 to 2016. We choose email, twitter, web and social tagging spam detection
17 papers for our study and reviewed their used techniques. However, a very few researches have been
techniques, approaches and algorithms, their advantages conducted on SMS spam detection. Spam SMS detection
and disadvantages, evaluation measures, discussion on is more challenging than email spam detection because of
datasets and finally result comparison of the studies. the restricted length of SMS, use of regional content and
Although, the SMS spam detection techniques are more shortcut words and SMS contains less header information
challenging than email spam detection techniques than an email.
because of the regional contents, use of abbreviated We cannot use techniques of email spam detection as-
words, unfortunately none of the existing research is in SMS spam detection. Proper SMS spam detection
addresses these challenges. There is a huge scope of technique is needed to be identified. This is an open and
future research in this area and this survey can act as a comparatively new research field. There is a huge scope
reference point for the future direction of research. of research work in this field. A Systematic Literature
Review (SLR) is necessary for starting any kind of
Index Terms—SMS Spam Filtering, SMS Spam research in any research field. There is no SLR on this
Detection, Systematic Literature Review, Machine topic. For this reason we intended to write a SLR on the
Learning. field of spam SMS detection. The purpose of this study is
to review the current status of SMS spam detection,
finding the approaches and techniques of SMS spam
I. INTRODUCTION detection, their advantages and disadvantages, their
performance and performance measurement process
Short Message Service (SMS) is the most frequently using available resources to conduct a systematic
and widely used communication medium. The term literature review within time period 2006-2016. Through
“SMS” is used for both the user activity and all types of this research we can summarize all the researches on
short text messaging in many parts of the world. It has SMS spam detection field. This will establish a baseline
become a medium of advertisement and promotion of for the future research. Researchers will get an overview
products, banking updates, agricultural information, flight on this research area at a glance.
updates and internet offers. SMS is also employed in
direct marketing known as SMS marketing. Sometimes
SMS marketing is a matter of disturbance to users. These
II. BACKGROUND AND RELATED WORK
kinds of SMSs are called spam SMS. Spam is one or
more unsolicited messages, which is unwanted to the SMS spam detection is comparatively a new research
users, sent or posted as part of a larger collection of area than email, social tags, and twitter and web Spam
messages, all having substantially identical content. The detection. Some of the researches of Spam detection
purposes of SMS spam are advertisement and marketing includes [1], [2], [3] etc. These researches are mostly
of various products, sending political issues, spreading conducted after 2011. There are several established email
inappropriate adult content and internet offers. That is spam detection techniques. SMS spam detection
why spam SMS flooding has become a serious problem technique has some challenges over email spam detection

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
A Systematic Literature Review on SMS Spam Detection Techniques 43

such as restricted message size, use of regional and most votes [35].
shortcut words and limited header information. These
B. Spam Filtering Process
challenges need to be solved. There is scope of research
in this field and some research works have been A manually classified spam and ham messages are
conducted on it. There are different categories of SMS input or training set for a spam filtering algorithm. The
spam filtering such as white listing and black listing, algorithm consists of the following steps [12].
content-based, non-content based, collaborative
approaches and challenge-response technique [4], [5], Preprocessing: Removing irrelevant contents like stop
[12], [29]. The techniques are used in client side, server words are the part of data preprocessing.
side or in both client and server side [4]. Several Machine
Learning Algorithms such as Naï ve Bayes, Support Tokenization: Segmenting the message according to
Vector Machine (SVM), Logistic Regression, Decision words, characters or symbols called tokens. There are
Trees, K-Nearest Neighbor are used to classify between different tokenization approaches such as word
Spam and legitimate SMSes named as Ham. Discussion tokenization, sentence tokenization, word or character N-
about the machine learning algorithms, process and grams and orthogonal sparse bigrams.
techniques of spam filtering is discussed in the following
subsections. Representation: Conversion to attribute value pairs.
A. Machine learning Algorithm
Selection: Selecting important attribute values which
Bayesian is a probabilistic approach that starts with a have impact on classification rather than choosing all
prior belief, observes some data and then updates that pairs of attribute value.
belief. The probability being spam and not spam of a
word can be calculated with the frequency of that word in Training: Train the algorithm with the selected attribute
ham and spam messages using the Bayesian algorithm values.
[30]. A prior probability also needs to be assumed in this
algorithm which is a shortcoming of this approach. Testing: Test the newly arrived data with the training
Support Vector Machines are supervised learning model.
models with associated learning algorithms that analyse
C. Content Based Filtering
data used for classification and regression analysis. If a
set of training example containing spam and legitimate Most of the works on SMS Spam detection are content
SMS is given, then an SVM training algorithm builds a based [1], [3], [11], [12]. Content based filtering is based
model that can assign new examples into spam and on the contents of SMS like spam words, unusual
legitimate category. An SVM model is a representation of distribution of punctuations and message length. Yadav et
the examples as points in space, mapped so that the al. [1] proposed a user centric approach that used content
examples of the separate categories are divided by a clear based filtering using Bayesian machine learning
gap that is as wide as possible. New examples are then algorithm with user generated features like blacklisting
mapped into that same space and predicted to belong to a and white listing, preferred keywords to filter unwanted
category based on which side of the gap they fall on [31]. SMSes and reduced the burden of notifications for a
The binary logistic model is used to estimate the mobile user.
probability of a binary response based on one or more Narayan et al. [3] developed a two level stacked
predictor (or independent) variables (features). Logistic classifier to classify between spam and legitimate SMS.
regression can be used in SMS spam detection on the The first level of classifier records a subset of words
basis of different feature variables [32]. whose individual probability is higher than a threshold.
A decision tree is a decision support tool that uses a After that second level of classifier is invoked, this takes
tree-like graph or model of decisions and their possible the chosen words form first level as input. They took
consequences, including chance of event outcomes. A different combinations of machine learning classification
decision tree can be used to make decision that whether a algorithms in two levels such as Bayesian and SVM,
new message is spam or ham [33]. SVM and Bayesian, Bayesian and Bayesian, SVM and
The k-nearest neighbors algorithm (k-NN) is a SVM.
nonparametric method used for classification and Ishtiaq et al. [11] proposed a SMS spam classification
regression. The input consists of the k closest training algorithm using the combination of Naive Bayes
examples in the feature space. The output is a class classifier and Apriori algorithm. They integrated
membership. An object is classified by a majority vote of association rule mining using Apriori algorithm with
its neighbors, with the object being assigned to the class Bayesian algorithm. Apriori retrieves the most frequent
most common among its k nearest neighbors [34]. words occurred together then Bayesian calculates the
Random Forests grows many classification trees. To probability of occurring a word independently and
classify a new SMS from an input vector, the algorithm together with other words, in spam or ham messages.
puts the input vector down each of the trees in the forest. Gomez et al. [12] analysed to what extent Bayesian
Each tree gives a classification, called "votes" for that filtering techniques used to block email spam, can be
class. The forest chooses the classification having the applied to the problem of detecting and stopping mobile

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
44 A Systematic Literature Review on SMS Spam Detection Techniques

spam. They pre-processed the messages with different questions and demonstrating a summarized result of
tokenization approach, selected features and tested them literature survey. Details of these steps are discussed in
with different machine learning algorithms, in terms of the following sub-sections.
effectiveness. They demonstrated that Bayesian filtering
A. The Need for a Systematic Literature Survey
techniques can be effectively transferred from email to
SMS spam with appropriate feature extraction. Email Spam detection is an established research field.
Many researches and literature survey have been done on
D. Non-Content based filtering
email spam detection as well as for twitter, web and
Many proposed techniques used non-content based social tag spam detection. There is insufficient systematic
filtering [2], [7]. Warade et al. [2] detected the spam literature survey available on SMS spam detection
messages by checking mutual relation between the sender because of its being comparatively new research area.
and receiver and the content of the messages. If no Although SMS communication has started mostly in
mutual relation is found between sender and receiver and 2000, it gained its popularity in 2006 and even became
message contains spam contents, then the system tags the more popular after the flourishment of android phones
message as spam and sends it to spam box. If mutual [19]. With the increase of the number of people using
relation and no spamming content exist then it directly SMS as a communication medium, SMS spamming also
sends to inbox of the receivers mobile. It solved the gets more popularity to spammers. As a result, research
problem of balance deduction and wastage of SMS on SMS spam detection had emerged with its necessity
memory. But calculating only mutual relation is not a and researches on it have started mainly after 2007. Our
proper solution. Spam detection algorithm needs both goal with this SLR is collecting proper background
classification algorithm and this kind of feature extraction knowledge on SMS spam detection field, gaining
from contents. knowledge about the currently used algorithms for SMS
Qian Xu et al. [7] investigated ways to detect spam spam detection, their advantages and disadvantages,
message senders based on non-content features that identifying the evaluation measure for the spam detection
include temporal and graph-topology information but algorithms, comparing the accuracy of the algorithms,
exclude contents because of user-privacy issues. They identifying any gaps in current research in order to
focused on the problem of identifying professional suggest areas for further investigation. The motivation for
spammers based on the overall message sending patterns. this work is to establish a basis for any research on SMS
Furthermore, they only concentrated on finding SMS spam detection. Any kind of research starts on the basis
spam on the server side, as the client-side detection is of systematic literature review. This is the main rationale
mostly content based. of this SLR.
E. Feature Engineering B. Research Questions
The success of machine learning depends mostly on Identifying research question is one of the important
appropriate feature selection [6]. The feature can be both steps in a SLR. We have identified three research
content based and non-content based. The ref. [2], [8] questions for this SLR. The questions and their
focused only on non-content based features like mutual motivation are presented in table 1.
relation of sender and receiver, user black-listing and
white listing and user preferred keywords words. Table 1. Research Question
Whereas some researchers considered only content based RQ1. What are the current To identify the algorithms
features [6]. A proper spam detection algorithm needs approaches of SMS spam used for SMS spam detection.
both content and non-content based features. Non-content detection?
based features include static, temporal and network RQ2. What are the advantages To understand the convenience
and disadvantages of the and drawbacks of the
features [7]. Content based features are word frequencies algorithms? algorithms.
[11] and keyword based features are presence of spam RQ3. What are the To identify existing
words and stylistic features are count of exclamation, measurement policies of SMS measurement policies and
count of alphanumeric word, average word length and Spam detection algorithms? metrics to evaluate the
algorithms.
many others [15].
C. Searches for Studies
At first we searched with the term 'SMS spam
III. METHODOLOGY
detection' on Google Scholar. Then we identified
A systematic review collects and critically analyses keywords noted in the relevant papers. After that we
multiple research studies or papers or journals and identified alternative spelling and synonyms for search
provides the summary of the existing literature on a terms. Some examples of resulting search string are given
specific research domain [9]. A review of existing studies below: “SMS Spam”, “SMS Spam Filtering”, “Machine
is often quicker and cheaper than embarking on a new Learning”, “Security and Protection”, “Text Analysis”,
study. For conducting SLR, some steps need to be “Security in Mobile Communication”, “Short Message
followed as mentioned in [9]. The steps include Service”, “Naive Bayesian Algorithm”, and “Anti-Spam
formulating research questions, finding and analysing Filtering”.
researches that relate to the questions, answering to that

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
A Systematic Literature Review on SMS Spam Detection Techniques 45

table 2. We also performed manual google search.


D. Study Selection Procedure
Selected paper contains many references; we also
To select relevant studies we primarily searched on searched for the referenced papers and have taken some
google scholar. We have collected some papers from it. of them as our relevant paper. We used the google
There are some other conferences and journals such as: scholar's related articles and cited by feature for our
IEEExplore, ACM, IJCSI, ITJ are found through google searching procedure.
scholar tool. The list of journals and conferences from
where we have found our relevant papers is presented

Table 2. Sources Searched

No. Source Abbreviation


1 IEEE IEEE Xplore
2 ACM Association for Computing Machinery
3 IJCSI International Journal of Computer Science Issues
International Journal of Information Security
4 IJISS
Science
5 ITJ Information Technology Journal
International Journal of Research in Advent
6 IJRAT
Technology
International Journal of Information Technology and
7 IJITCS
Computer Science
7 CAE International CAE Conference
8 Google Scholar
9 Google
10 Computers and Security
11 Expert Systems With Applications

E. Study Selection Criteria


IV. VALIDATION OF THE STUDY
There are inclusion and exclusion criteria for
Our SLR was conducted to investigate all the used
systematic literature review. SMS spam detection is a
approaches and techniques in SMS spam detection. The
new research area and there are not much relevant studies
threats to the validity of our review are that there may be
in this field. That is why we chose most of the available
selection bias and lack of sufficient resources. We tried to
articles.
reach all possible and relevant information resources.
F. Data Extraction Some resources might not have been published directly.
Another threat is some resources are not available for
Table 3 contains the extraction form used to gather
public use.
extracted information from our study. This table
demonstrates information about our chosen data such as
chosen papers type, their published conferences,
V. RESULT ANALYSIS
publication years, motivation and methodology of paper.
At first we manually searched on google using the
Table 3. Data Extraction Table topic Spam Detection to gain an overview in spam
detection field. It resulted in many email, twitter, web and
Data item Value
SMS spam detection related papers. Then we customized
Study identifier S# our search using only SMS spam detection. It resulted in
a few papers. Although there are SLR for other spam
Paper type Conference/ Journal
detection techniques but none of the search strings
Name of e-library e.g. ACM produces a SLR for SMS Spam detection. Through our
study selection procedure we have chosen 17(S1-S17)
Year of publication 2006- 2016
papers published in different conferences and journals
Name of journal e.g: ITJ relating only to SMS spam detection. Among the 17
studies S1 and S11 are from same authors and S11 is an
Which RQ was answered RQ1/ RQ1/RQ3
extension of S1. The ref. [20] is a journal which is an
Summarized literature survey extension of the conference paper S10. S12 is an
Outcomes of the paper
on SMS spam detection
Create a baseline for SMS
extension of [8]. As a result, in total we have studied 19
Motivation of paper studies. Table 4 summarizes the reviewed papers Study
spam detection
Method of paper
Techniques/ Approaches / ID with the reference no given in reference section,
Algorithms publication years, name of the conferences and journals
Validation of paper Analysis model where the papers published and the research questions
they answered.

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
46 A Systematic Literature Review on SMS Spam Detection Techniques

Table 4. Summary of the Reviewed Literature

Study ID Year Conference/Journal Answer Research Question


S1 [1] 2012 IEEE RQ1, RQ2, RQ3
S2 [2] 2014 IJRAT RQ1
S3 [3] 2013 ACM RQ1,RQ2, RQ3
S4 [5] 2010 Computers and Security RQ1,RQ2, RQ3
S5 [10] 2012 IJCSI RQ1 RQ2, RQ3
S6 [11] 2014 IJMLC RQ1 RQ2, RQ3
S7 [12] 2006 ACM RQ1 RQ2, RQ3
S8 [7] 2012 IEEE RQ1, RQ3
S9 [13] 2015 CAE RQ1 RQ2, RQ3
S10 [14] 2011 ACM RQ1, RQ3
S11 [15] 2011 ACM RQ1, RQ3
S12 [16] 2007 ACM RQ1, RQ3
S13 [17] 2008 ITJ RQ1, RQ3
S14 [18] 2014 ASTL RQ1
Information Security
S15 [4] 2015 RQ1 RQ2, RQ3
Journal
S16[21] 2014 JBASR RQ1 RQ2, RQ3
S17[22] 2013 RQ1, RQ3

Table 5. Summary of the Techniques Used by the Literature


Techniques/
Study ID Algorithms/ Description
Approaches
SMSAssassin: Android application uses content based filtering with user generated
S1 [1] Content Based (Bayesian)
features to automatically filter spam SMSes resulting in different tabs.
S2 [2] Mutual Relation Based on the previous relation of sender and receiver.
First level Records some words more than a threshold then sends them to the next level
S3 [3] Two level stacked classifier
using Bayesian in both level and Bayesian in 1st level and SVM in second level.
Used upper and lower bound of threshold introducing an uncertain region for Bayesian
Hybrid Approach(Content Based
S4 [5] filtering after that the messages which fall into uncertain region sent to the challenge –
and Challenge – Response)
response technique which is user query based.
Artificial Immune The phases of the algorithm are Building dataset, Message Matching and Affinity
S5 [10]
System Calculation.
Apriori retrieves the most frequent words occurred together then Bayesian calculates the
S6 [11] Bayesian and Apriori Algorithm probability of occurring a word independently and together with other words, in spam or
ham messages.
Message pre-processing and encoding, feature selection and then applying the
S7 [12] Bayesian
classification algorithm.
Non content-based features such as static, temporal and network features then
S8 [7] Non- Content Based
classification with SVM and KNN.
Total number of spam SMSes are divided by the total occurrences of a word in
S9 [13] Bayesian with modified formula Spam/Ham messages instead of the formula of occurrences of words divided by the total
number of Spam and Ham messages. They also combined two formulas.
Tokenization with various Two kinds of tokenization : separated by blanks and separated by special characters are
S10 [14]
classifier used for classification in various machine learning algorithms.
S11 [15] Bayesian and SVM Tested the feasibility of applying both algorithms in mobile application domain.
Machine learning algorithms with Lexical feature expansion such as words, orthogonal
S12 [16] Content Based filtering
sparse word bigrams, character bigrams and trigrams.
At a regular interval on the basis of new arrival of SMSes Features will be updated using
S13 [17] Feature Updating Protocol methods like document frequency, term frequency, information gain and mutual
information.
Virtual Ratio on Naï ve Bayes, J-
S14 [18] VR is the relative ratio of average frequency of a keyword in spam and ham messages.
48 and logistic regression
Artificial Immune Consists of five modules: Innate mechanism, User feedback, Quarantine, Tokenizer,
S15 [4]
System Immune Engine.
Bayesian, Multilayer Perceptron Selected four features and performed classification algorithms on them resulting in better
S16 [21]
Algorithm, Decision Tree performance in Bayesian
Feature extraction and classification algorithms like Bayesian, SVM, K- Nearest
S17 [22] Content based Filtering neighbour, Random forest and Adaboost. There results concludes SVM outperforms
other algorithms.

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
A Systematic Literature Review on SMS Spam Detection Techniques 47

B. RQ2: What are the advantages and disadvantages of


A. RQ1: What are the current approaches of SMS spam
the algorithms?
detection?
Table 6 demonstrates the result of RQ2. The
The used techniques, approaches and algorithms in
advantages and drawbacks of the approaches are
spam detection and their short description with their
mentioned in the tables. From the table we can say that,
study id is described in table 5. From the table we can see
content based filtering is more convenient than other non-
that, most of the approaches use content based filtering
content based and server side algorithms. Server side
and for classification they used several machine learning
algorithms suffer from implementation complexity.
algorithms mostly Bayesian and SVM. Study S1, S3, S6-
Feature selection is also an important task for machine
S7, S9-S12, S14, S16-S17 used content based filtering.
learning algorithms to work correctly. One important
S4 is a hybrid approach, S8 is non content based, and S5
drawback is, some approaches do not use classification
and S15 are based on artificial immune system. Most of
algorithm only focusing on user generated features.
the content based filtering used Bayesian as a
Classification algorithm is necessary for gaining better
classification algorithm.
accuracy.

Table 6. Advantages and Disadvantages of Used Techniques


Study
Advantages Disadvantages
ID
Combination of machine learning algorithms with user
S1 [1] Users need to select features manually
generated features
S2 [2] No classification algorithm is used
Threshold selection
S3 [3] Classification based on two algorithms
Challenge-response technique suffers from server side
S4 [5] Combination of client and server side algorithms
traffic and user interaction problems
Accurate as Naïve Bayesian with necessary feature
S5 [10] Complex implementation
extraction
S6 [11] Incorporating Apriori Algorithm
S7 [12] Used a weighting Mechanism to reduce false negatives
S8 [7] Suffers from implementation complexity
Combination of two formulas gives better result in terms
S9 [13]
of false positives
Concludes SVM outperforms other algorithms and
S10 [14]
created a baseline for further comparison
Although SVM gives better results in Spam identification
S11 [15] Extensive feature engineering is needed for better accuracy
Bayesian is more feasible for mobile applications
Demonstrates the need of spam filtering in spite of having
S12 [16]
established email spam filtering
S14 [18] Lightweight and focuses on runtime
S15 [4] Server side, complex and suffers from updating issues
S16[21] Implementation complexity

C. RQ3: What are the measurement policies of SMS D. Dataset Description


Spam detection algorithms?
A training dataset is needed for any kind of machine
Accuracy of the SMS spam detection needs to be learning classification algorithms. Results of the machine
measured. In table 7, we have demonstrated the method learning algorithms depend on the dataset. As a result
or matrix to measure the algorithms for each study. spam detection algorithms can't run without a dataset. In
Calculating accuracy from confusion matrix is one of the table 8, we demonstrated different publicly available
most commonly used measurement methods for dataset used in different studies. Link of the dataset and
classification algorithms. S3-S9, S11, S15, S17 used some statistics such as total number of SMSes, number of
accuracy to measure their algorithm. Receiver operating Spam and Ham messages are shown in the table 8.
characteristics (ROC) and Area under the curve (AUC)
E. Performance Comparison
were also used to demonstrate algorithm accuracy. S7, S8,
S11, S12 used ROC and AUC methods. True Positive Most of the results of our studies demonstrated that
rate, False Positive rate, F-measure, Precision, and Recall Bayesian filtering is more suitable for spam detection. S3
are also measurement methods for classification showed that a two level stacked classifier using dataset
algorithms, which can be calculated from confusion referenced in [25] gives better accuracy of 99% with
matrix. Some of the studies also used these measures. S2 threshold 0.4 and 0.6 than the single classifiers. Hybrid
and S14 do not use any evaluation measure. approach of S4 demonstrates accuracy of 95%. S15 and
S5 based on artificial immune system shows accuracy of
99% and 98% respectively. S6 gives 98%-100% on the

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
48 A Systematic Literature Review on SMS Spam Detection Techniques

dataset [26] but they did not consider all the data instead VI. DISCUSSION
they choose small portions of the dataset and this
In light of the above discussion, we can say that most
accuracy is achievable only for some specific parts of the
of the research studies answered RQ1. They mostly used
dataset. S9 shows 89% accuracy on some publicly not
content based filtering with various machine learning
available Farsi SMS dataset with modified Bayesian
algorithms. Eleven research studies used content based
formula. 97% accuracy is achieved by SVM with Spam
filtering, two studies used artificial immune system, one
Caught Rate 83.10% and Blocked Ham rate 0.18% on the
of them used hybrid approach, two of them focused on
dataset [26]. Whereas 98% accuracy is achieved by SVM
feature engineering and two of them focused on real
on the same dataset [26] with Spam Caught Rate 92%
world data set. Content based filtering suffers from
and Blocked Ham Rate 0.31% in S17. This observation
challenges like short content, abbreviated words and user
shows that results not only depends on classification
content safety. All of the studies tried to solve some
algorithms and datasets but also on data preprocessing
challenges of SMS spam detection. For example, some
and feature selection process. S17 also demonstrates
studies solved real world data extraction process, some
accuracy 98% on Bayesian with Spam caught rate 94%
studies proposed hybrid approach to give better accuracy,
and Blocked Ham Rate 0.51%. S11 shows 97% ham
some studies tried to overcome the challenges over email
accuracy and 72.5% spam accuracy on Bayesian and 93%
spam detection. None of the techniques solved the
ham accuracy and 86% spam accuracy on SVM on some
challenge of the use of regional content and shortcut
publicly not available dataset. S16 showed 92 % correctly
words. These challenges lead to the future researchers to
classified instances and 8% incorrectly classified
further investigation on the used approaches and
instances on Bayesian which is better than Multilayer
techniques. Also most of the studies used Bayesian
perceptron and Decision tree.
filtering for classification algorithm. Bayesian algorithm
Table 7. Evaluation Measures of the Algorithms
also suffers from traditional threshold selection problem,
dataset dependency, assuming prior probability. Despite
Study ID Evaluation Measure having those shortcomings, Bayesian is declared as the
No evaluation measure only demonstrate their most suitable algorithm for spam filtering. Solving these
S1 [1]
application
problems of Bayesian also can be a research direction.
S2 [2]
This can result in better performance in Bayesian
S3 [3] Precision, Recall, F-measure and Accuracy
algorithm. SVM also gives better accuracy but suffers
S4 [5] Traffic amount, Accuracy
from implementation complexity. Other algorithms are
S5 [10] Accuracy, False Positive Rate
less suitable for SMS Spam filtering.
S6 [11] Accuracy
S7 [12] ROCCH
S8 [7] False Positive Rate, AUC
VII. CONCLUSION
Confusion matrix, Precision, Accuracy, F-
S9 [13]
measure This paper presents the results of the systematic
Spam Caught/True Positive, Blocked ham/False literature review on SMS spam detection techniques. We
S10 [14] positive, Accuracy and Matthews Correlation
Coefficient (MCC) chose a total of 17 research papers on this field and
Ham, Spam identification Accuracy and Area reviewed their proposed techniques, advantages and
S11 [15]
Under the Curve (AUC) disadvantages and challenges they addressed. We also
S12 [16] ROC, AUC examined their evaluation procedures. We demonstrated
S15 [4] Confusion Matrix, Accuracy, AUC the publicly available dataset information which is a prior
S16[21] Correctly and Incorrectly classified Instances need for a spam filtering algorithm. We also discussed
Spam Caught(SC), Blocked Ham(BH), the background of this topic. In our systematic literature
S17[22]
Accuracy (ACC) review, we have discussed the search and selection
procedure, their publication years and the journals and
Table 8. Dataset Description
conferences where those studies were published. Our
Available
Total results show the summary of the used techniques and
Study ID No. of Hams Spams advantages and disadvantages of the approaches. We
At
Messages
have performed a performance comparison on the studied
S1 [1] [24] 2000 1000 1000
literature. In addition, we have found that none of the
S3 [3] [25] 1450 730 721
studies solve the challenges of use of regional contents
S4 [5] 85.32% 14.75%
and shortcut words. We have also discussed the problems
S6 [11] [26] 5574 4827 747
of traditional machine learning algorithms. There is scope
S7 [12] [27]
of further research in this filed and our systematic
S10 [14] [26] 5574 4827 747
literature review can serve as a reference point for future
S11 [15] 4318 2195 2123
researches.
S12 [16]
S14[18] [26]/ 5574 4827 747 REFERENCES
S15 [4] 5240 2890 2350
S17 [22] [26] 5574 4827 747 [1] K. Yadav, S. K. Saha, P. Kumaraguru, and R. Kumra,
“Take control of your smses: Designing an usable spam

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
A Systematic Literature Review on SMS Spam Detection Techniques 49

sms filtering system,” in 2012 IEEE 13th International run time,” 2014.
Conference on Mobile Data Management. IEEE, 2012, pp. [19] A Brief History of Text Messaging,
352–355. http://mashable.com/2012/09/21/text-messaging-
[2] S. J. Warade, P. A. Tijare, and S. N. Sawalkar, “An history/#F4V9_15QGkqx. [Last Accessed: 05-11-2016]
approach for sms spam detection.” [20] Almeida, Tiago, José María Gómez Hidalgo, and Tiago
[3] A. Narayan and P. Saxena, “The curse of 140 characters: Pasqualini Silva. "Towards sms spam filtering: Results
evaluating the efficacy of sms spam detection on under a new dataset." (2013): 1-18.
android,” in Proceedings of the Third ACM workshop on [21] Mujtaba, G., and M. Yasin. "SMS spam detection using
Security and privacy in smartphones & mobile devices. simple message content features." J. Basic Appl. Sci.
ACM, 2013, pp. 33–42. Res 4 (2014): 275-279.
[4] A. S. Onashoga, O. O. Abayomi-Alli, A. S. Sodiya, and D. [22] Shirani-Mehr, Houshmand. "SMS spam detection using
A. Ojo, “An adaptive and collaborative server side sms machine learning approach." (2013): 1-4.
spam filtering scheme using artificial immune system,” [23] Ahmed, Ishtiaq, et al. "Semi-supervised learning using
Information Security Journal: A Global Perspective, vol. frequent itemset and ensemble learning for SMS
24, no. 4-6, pp. 133–145, 2015. classification." Expert Systems with Applications42.3
[5] J. W. Yoon, H. Kim, and J. H. Huh, “Hybrid spam (2015): 1065-1073.
filtering for mobile communication,” computers & [24] http://precog.iiitd.edu.in/resources.html [Last Accessed:
security, vol. 29, no. 4, pp. 446–459, 2010. 05-11-2016]
[6] S. J. Delany, M. Buckley, and D. Greene, “Sms spam [25] https://github.com/okkhoy/SpamSMSData. [Last
filtering: methods and data,” Expert Systems with Accessed: 05-11-2016]
Applications, vol. 39, no. 10, pp. 9899–9908, 2012. [26] http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
[7] Q. Xu, E. W. Xiang, Q. Yang, J. Du, and J. Zhong, “Sms [Last Accessed: 05-11-2016]
spam detection using noncontent features,” IEEE [27] http://www.esp.uem.es/jmgomez/smsspamcorpus/ [Last
Intelligent Systems, vol. 27, no. 6, pp. 44–51, 2012. Accessed: 05-11-2016]
[8] G. V. Cormack, J. M. G. Hidalgo, and E. P. S´anz, [28] https://www.cloudmark.com/en/s/resources/whitepapers/s
“Feature engineering for mobile (sms) spam filtering,” in ms-spam-overview [Last Accessed: 05-11-2016]
Proceedings of the 30th annual international ACM SIGIR [29] Iqbal, Muhammad, et al. "Study on the Effectiveness of
conference on Research and development in information Spam Detection Technologies." (2016).
retrieval. ACM, 2007, pp. 871–872. [30] http://fastml.com/bayesian-machine-learning/ [Last
[9] S. Keele, “Guidelines for performing systematic literature Accessed: 05-11-2016]
reviews in software engineering,” in Technical report, Ver. [31] https://en.wikipedia.org/wiki/Support_vector_machine
2.3 EBSE Technical Report. EBSE, 2007. [Last Accessed: 05-11-2016]
[10] T. M. Mahmoud and A. M. Mahfouz, “Sms spam filtering [32] https://en.wikipedia.org/wiki/Logistic_regression [Last
technique based on artificial immune system,” IJCSI Accessed: 05-11-2016]
International Journal of Computer Science Issues, vol. 9, [33] https://en.wikipedia.org/wiki/Decision_tree [Last
no. 1, pp. 589–597, 2012. Accessed: 05-11-2016]
[11] I. Ahmed, D. Guan, and T. C. Chung, “Sms classification [34] https://en.wikipedia.org/wiki/K-
based on naïve bayes classifier and apriori algorithm nearest_neighbors_algorithm [Last Accessed: 05-11-2016]
frequent itemset,” International Journal of machine [35] https://www.stat.berkeley.edu/~breiman/RandomForests/c
Learning and computing, vol. 4, no. 2, p. 183, 2014. c_home.htm [Last Accessed: 05-11-2016]
[12] J. M. G´omez Hidalgo, G. C. Bringas, E. P. S´anz, and F.
C. Garc´ıa, “Content based sms spam filtering,” in
Proceedings of the 2006 ACM symposium on Document
engineering. ACM, 2006, pp. 107–114.
Authors’ Profiles
[13] M. Poorshahsavari and O. Pourgalehdari, “Enhancing the
rate of accuracy and precision in spam filtering in farsi
Lutfun Nahar Lota is a graduate student at
sms.”
the Institute of Information Technology
[14] T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami,
(IIT), University of Dhaka, Bangladesh.
“Contributions to the study of sms spam filtering: new
Currently, she is pursuing her Master of
collection and results,” in Proceedings of the 11th ACM
Science in Software Engineering (MSSE).
symposium on Document engineering. ACM, 2011, pp.
She earned her Bachelor of Science in
259–262.
Software Engineering (BSSE) from the same
[15] K. Yadav, P. Kumaraguru, A. Goyal, A. Gupta, and V.
institution. Her core areas of interest are
Naik, “Smsassassin: crowdsourcing driven mobile-based
Software Engineering, Security and Machine Learning.
system for sms spam filtering,” in Proceedings of the 12th
Workshop on Mobile Computing Systems and
Applications. ACM, 2011, pp. 1–6.
Dr. B. M. Mainul Hossain is Assistant
[16] G. V. Cormack, J. M. G´omez Hidalgo, and E. P. S´anz,
Professor at the Institute of Information
“Spam filtering for short messages,” in Proceedings of the
Technology (IIT), University of Dhaka,
sixteenth ACM conference on Conference on information
Bangladesh. He received his Ph.D. degree
and knowledge management. ACM, 2007, pp. 313–320.
in computer science from University of
[17] Q. Sun, H. Qiao, and Z. Luo, “The feature updating
Illinois at Chicago, USA. Before that, he
algorithm for short message content filtering,”
earned his Bachelor of Science and Master
Information Technology Journal, vol. 7, no. 5, pp. 790–
degrees from the department of Computer
795, 2008.
Science & Engineering, University of Dhaka, Bangladesh. He
[18] S.-E. Kim, J.-T. Jo, and S. [18] S.-E. Kim, J.-T. Jo, and
has the experiences of working both in industry and academia.
S.-H. Choi, “A spam message filtering method: focus on
He worked as a Software Engineer in Microsoft Corporation

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50
50 A Systematic Literature Review on SMS Spam Detection Techniques

(Redmond, USA) & Accenture Technology Lab (Chicago &


California). His core areas of interest are Software Engineering,
Security, Data Mining and Machine Learning.

How to cite this paper: Lutfun Nahar Lota, B M Mainul


Hossain,"A Systematic Literature Review on SMS Spam
Detection Techniques", International Journal of Information
Technology and Computer Science(IJITCS), Vol.9, No.7,
pp.42-50, 2017. DOI: 10.5815/ijitcs.2017.07.05

Copyright © 2017 MECS I.J. Information Technology and Computer Science, 2017, 7, 42-50

View publication stats

You might also like