SMSSpam Detectionusing Deep Neural IJPAM
SMSSpam Detectionusing Deep Neural IJPAM
net/publication/327446469
CITATIONS READS
21 932
4 authors:
A Kodieswari P. Thangaraj
Bannari Amman Institute of Technology KPR INSTITUTE OF ENGINEERING AND TECHNOLOGY
10 PUBLICATIONS 31 CITATIONS 126 PUBLICATIONS 955 CITATIONS
All content following this page was uploaded by A Kodieswari on 08 February 2022.
1
Assistant Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology of
Technology Sathyamangalam, Tamilnadu, India
2
Assistant Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology of
Technology, Sathyamangalam,Tamilnadu,India
3
Assistant Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology of
Technology, Sathyamangalam,Tamilnadu,India
4
Assistant Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology of
Technology, Sathyamangalam,Tamilnadu,India
1 2 3 4
[email protected], [email protected], [email protected], [email protected]
Abstract - In modern years, the fame of cell phone gadgets has expanded; Short M essage Service (SM S) has developedinto a
multi-billion dollars business. In the meantime, decreasein the cost of messaging serviceshave brought about development in
spontaneous business ads (spams) being sent to cell phones. SM S spams are one of the concerns and a lot ofindividuals do
not like to accept them as they are annoying. M any SM S spam discovery techniques already exist and various classifiers like
Support Vector machine, NaïveBay es and several other machine learning algorithms were utilized. The existing SM S spam
detection methods involves all the features extracted from the SM S spams, causing high false positive rate.To deal with this
issue, an involuntary SM S spam detection system is proposed to identifyan efficient set of features using Restricted
Boltzmann M achine (RBM ), a deep learning approach. The proposed framework classifies the SM S spam and ham by means
of Deep Neural Network (DNN) classifier usingSM S Spams database from UCI M achine Learning repository. Finally the
results of proposed deep learning techniques are compared with the existing machine learning algorithms in terms of
measures like Accuracy , True Positives, False Positives, etc and achieves higher SM S spam detection rate with low false
positive rate.
2425
International Journal of Pure and Applied Mathematics Special Issue
I.INTRODUCTION
In recent years, a substantialgrowth has been experienced in the mobile phone market. A cu mulat ive of
432.1 million mobile gadgets has delivered in the second quarter of 2013 with an increment of6.0% year over
year [1]. As the consumption of cellphone gadgets has turned out to be ordinary, Short Message Service(SM S)
has developed into a mu lti-billion do llars‟ business [2]. A rush in the quantity of unwanted businessnotices sent
to cell phones utilizing text messages has additionally expanded due to the increased popularity of mobile
platforms.
One of the most pro minent methods of commun ication between a huge numbers of ind ividuals is SM S,
where trans mission of messages must happen as per the communicat ion standard protocols. Consequently, there
is a requirement for text categorizat ion algorith ms that can be utilized as a part of classifying the messages
either to ham or spam messages. While spam messages are not alluring, ham messages are the one that is made
by genuine users. Therefore spam messages must be identified and detached once they reach the mobile station,
example of spam messages are the ones created by promotional companies. The additional consequences of the
SMS spam are frustrating, they are also consuming mo re t ime, resources, money and network bandwidth, in any
case the accessibility of spam filtering software for identifying SMS spam are limited. Moreover, there is an
additionalmisclassificat ion problem that may be arising when ham messages are eliminated and blocked as spam
[2].
Users get annoyed of Email and SMS spam and leads to performance humiliation of the service [3],
SMS spam usually influence acluster of people and disseminated through mobile networks, conversely World
Wide Web transmits email spam. Still, d iverse solutions for SMS spam detection were acquired fro m email
spam filtering and classification methodologies [4]. Also, there are differentissues that are faced by the
researcher analysts of SMS spam discovery such as the restriction of openly accessible dataset. On the other
hand, not all the methods of email spam filtering are performing well with SMS spam detection [2].
There are different strategies utilized for SM S spam identificat ion, likenaïve Bayes (NB), support
vector mach ine (SVM ), artificial neural network, decision tree, k-nearest neighbor (KNN) and rando m forest, in
adding up with hybrid methods [5]. However d ifferent techniques using different datasets are correlated and
analyzed to show that the highest accuracy is achieved using SVM and NB classifiers [6], though methods like
Bayesian classificat ion, logistic regression anddecision tree still experience time consuming prob lem [6]. The
greater part of existing investigates utilized Weka for their experimental analysis [7, 8].
Recently, deep learn ing neural networks an advanced class of machine learning algorith ms have
obtained significance, which can learn features dynamically and act as a feature extractor and on the other hand
it can represent a classifier that classifies the data based on the features learn ed autonomously from the data [9,
10, and 11].
In this paper, a new system is proposed for SM S spam detection with the dataset that is available in
UCI Machine Learning Repositories. Feature selection is done using the stacked RBM wh ich is a greedy
mu ltilayer unsupervised deep learning technique. Deep Neural Network (DNN) is a b inary classifier wh ich is
used for identifying whether the SM S is spam or ham. Co mparisons will be made between different machine
learning algorith ms: NB, SVM and Rando m Forest. Besides, the metrics like accuracy, True Positives, False
Positives are used to measure the time effectiveness.
2426
International Journal of Pure and Applied Mathematics Special Issue
The remaining sections of the paper areordered as follows: Section 2 depicts the related work. Sect ion
3exp lains the proposed system that includes the collection of dataset, feature extract ion and selection, binary
classification of spam and ham messages and evaluation of performance metrics. Experimental results are
discussed and analyzed in section 4. Finally conclusion and future work is presented in section 5.
II.RELATED WORK
The issues of mobile SMS spam is addressed utilizing filtering system; however the execution of spam
discovery is imperative and must achieve sufficient level by making certain adjustment on filtering strategies.
Go rdon V. Co rmacket. altried to expand the execution of SMS spam discovery by applying the similar filter
utilized as an email spam filters which acco mplished the most elevated execution [12]. For testing two datasets
were utilized, one for Eng lish and another for Spanish. English dataset comprises of 1002 hams and 82 spams,
while Spanish holds 1157 genuine and 199 spams. The greater part of machine learning calcu lations that were
connected in vector interpretation of messages utilized for certain features, for example, words, bigram and
trigram of characters, words bigrams and lowercase words .
Email filtering algorith ms has becoming failed to meet expectations when utilized with SMS spam, the
reasons of this mentions to various reasons: messages with restricted set of features, there was no genuine
database for SMS spam, the casual dialect of the messages and its short length. Consequently, a genuine
database of SMS spams was made in 2011 UCI Machine Learning repository, this database was made a nd was
made openly accessible [2]. Therefore, many investigations were made in this dataset and examinations between
various machine learn ing calculations occur in [2, 13]. In [1] two tokenizers were utilized, the first tokenizer
was used to divide the words in patterns containing colons, dots or commas, at the interior such as email, and the
other one focuses on the splitted sequence of characters separated by commas, b lanks, dots and others In any
case, no pre-preparing was utilized, for example, stop word elimination or stemming since numerous scientists
found that such pre-processing may influence the precision. Then again, correlations between numerous
classifiers were made and all outcomes demonstrated that linear Support Vector Machine is the best.
Additionally, in [13] the same as in [2] no stemming was utilized, wh ile uncommon characters were evacuated
and dataset was splitted into tokens. Fro m that point onward, the underly ing examination of information was
made ut ilizing NB calculat ion, and results demonstrated that, three tokens were critical, they are: the quantity of
numeric strings, number of dollar signs ($) and the length of the message, uncommon characters, for examp le,
"://" and "@" are essential to arrange messages. Accordingly, the investigation demonstrated that the message
length considered as a good measure to identify spams. Nu merous classifiers like, SVM, Random Forests, KNN
and Adaboost were utilized. The outcomes demonstrated that the general blunder was lessened by the greater
part of error in1.
T. Almeida introduced a new dataset comprises of 747 spam messages and 4827 ham and furthermore
extraordinary machine learn ing calcu lations were connected on it [14]. It had the same approach discussed in
[2]. The evaluations between various classifiers demonstrated that SVM shows better result.
Feature extract ion and selection are essential for SMS spam detection as it influences the exactness and
execution of classifiers. Therefore, the effects of such feature selection were examined after applying in two
languages: English and Turkish in [15]. Likewise, co mbination between the feature initiated from bag -of-words
(BoW) and structural features (SF) were utilized. Gathering of Tu rkish SMS was made and made available
2427
International Journal of Pure and Applied Mathematics Special Issue
public; this dataset comprises of 430 ham and 420 spam messages. with the end goal of making examinations,
English dataset comprises of 425 spam and 450 ham messages were utilized. Term Frequency - Inverse
Document Frequency (TF-IDF) were applied to calcu late the frequency of terms and vector space model was
used for the representation of document as collection of words and their frequencies. So me of important features
that were used to identify spams are: the length of the message, number of important terms , upper or lo wer
case characters, numeric and alphanumeric characters such as (“!”, “$”) and URL link. To apply classificat ion
technique SVM was used as it achieves greater results. In addition to message, occurrence of diagrams,
monograms, size was the principle highlights utilized as a part of [16] for distinguishing spams. The dataset
gathered by volunteers which contains 6600 messages. Correlations between three classifiers were made, they
are: Decision tree, Artificial neural system and NB calculations, whose results demonstrated that NB was the
best.
Karami and Zhou recommended new highlights with a specific end goal to expand the execution of
SMS spam recognition which largely concentrated on content features [17]. While the greater part of past
research concentrated on tokens without relating to the deep level semantic, managing semantic isn't simp le as
spamming approaches show improvement always. Then again, [17] they utilized the classification semantic of
the word which brings about productivity changes. In any case, many highlights were utilized, for examp le,
Semantic of words, SMS recurrence, SMS frag ments, capital and spam words, URL, novel words, utilizing
word "Call" and the rate of URL. Additionally, Linguistic Inquiry features were utilized, for examp le, semantic
procedures, mental procedures and Spoken highlights and numerous others. As needs be, exactness, F-measure,
precision, rev iew, ROC bend and different measurements were used to make assessment in the wake of
applying,boosting, SVM and irregular woodland as classifiers, the precision ranges from 92% to 98%.
Moreover, many text classificat ions can be used as classifiers for SMS spam such as Neural Network,
tree structure and Increment Co mponent Analysis (ICA) algorith ms [1 8]. Features like nu mber of characters,
ratio of number dig it characters, occurrence of every letter and special characters, whitespaces and alpha
characters, ratio of short words, total number of words, average of sentence length according to number of
words and number of characters, rat io of nu mber of characters in words, in addition to average word length and
others can be extracted and then applied to the algorith m. Extract ing such features does not affect the working of
algorithm.
Akbariet. al use GentleBoost algorithms for the first time in classifying SM S spam [19]. The
explanation behind utilizing boosting calculation is the existing of unequal information. The proposed
techniquelimited the utilization of capacity and in the meantime keeping the precision h igh.The capacity was
dimin ished since unused features will be evacuated and upgraded.The stop words and images, for examp le,
“the”, “” will be expelled, at that point the tokenization procedure will be connected. To keep away fro m
removing word attributes that might be imperat ive for arrangement process, the likelihood of each word for
being ham or spam will be ascertained, and then these probabilities will be evaluated. Likewise if the likelihood
of word attributes to be ham is more noteworthy than its likelihood to be spam the word will be expelled ,
otherwise it must remain. The trials were made in Tiago's dataset which comprises of 5572 messages, and the
precision was 98%.
Sable used Apriori and NB classifiers for SMS spam detection [20], each word in a dataset is
considered as free word. Ho wever, in pre-processing stage, the dispose of words that are not critical will be
2428
International Journal of Pure and Applied Mathematics Special Issue
evacuated. Many overviews and evaluations were made in SMS spam d iscovery studies [3, 4, 7, and 8]. Two
unique tools were utilized Weka and Rap id M iner to classify and cluster SMS spam, all inquires about
examinations in this overview [7] utilized UCI dataset, the consequences of correlations demonstrated that the
two tools gave similar outcomes.
Stacked RBM, a deep learning procedure is utilized as a part of the proposed work to choose
importantfeaturesprogressively. The features selected and learned by Stacked RBM are g iven to DNN classifier
for the classification of spam and ham messages .A deep learning algorith m can discover all of the samples with
like features, and, on the other hand, recognize the d iverse classes. The pro of deep learning over past neural
systems and other machine learning strategies is its ability to extrapolate new features that correspond to those
definitely known and will find better approaches for recognizing SMS spams.
III.PROPOSED METHODOLOGY
The overall system design of the proposed method is shown in Fig. 1. Dataset collection phase includes
the collection of spam and ham messages. Feature extract ion phase includes pre-processing and normalization.
Feature selection and pre-training of the selected features are performed using Stacked RBM. Finally, DNN
classifier is used in the binary classification of SMS data samples.
The ham and spam SMS messages aregathered fro m UCI Machine Learning warehouse which was
assembled in 2012 [21]. The dataset comprises of 5574 wording messages classified as ham and spam messages,
the quantity of spam messages is 747 wh ile the quantity of ham messages is 4,827 messages. The dataset is
saved in a text file where each line represents one message; the line consists of label of a message and text
string. Table 1 displays examples of messages in dataset.
2429
International Journal of Pure and Applied Mathematics Special Issue
3.2Feature Extraction:
The collected SMS samples are parsed and tokenized into different lexical patterns. Each SMS sample
has different lexical patterns. These strings of lexical patterns are converted to numerical values using the
conversion methods such as string to numeric and nominal to numeric. Features are ext racted fro m the
numerical patterns after the completion of pre-processing task. The collected data are integrated together; in
which some of the lexical patterns contain missing, noisy and duplicate data values. In order to remove these
data values and to extract efficient features, pre-processing step has to be performed using unsupervised filters
like replace the missing values, remove duplicates, etc. The extracted features value lie in d ifferent numerical
series. Prior to pre train the data for feature selection,min-max normalizat ion as shown in the equ (1) has to be
carried out on the pre-processed numerical values has to convert the range of feature values between 0 and 1.As
a result, 10 features are extracted from the collected samples which are listed in Table 2.
Where Z is the Nu meric Value for each samp le of the feature, ZMIN is the Minimu m value for feature Z and
ZMAX is the Maximum value for feature Z.
2430
International Journal of Pure and Applied Mathematics Special Issue
Feature selection is a process, which aims to simplify the model construction by selecting a subset of
relevant features. Here dimensionality reduction is performed by unsupervised pre -train ing using Deep Belief
Network [22], which is a set of stacked RBM. A DBN is trained on the normalized features that are e xtracted
after pre-processing and the weights of the DBN are used to initialize a neural network.
This pre-training phase consists of learning a stack of RBMs with the unsupervised, greedy contrastive
divergence (CD) [23] algorithm wh ich converts a nonlinear, high-d imensional unlabeled input data into optimal,
low-d imensional representation of features. The learned features of RBM are used as the „data‟ for training the
next RBM in the stack. Finally, a DBN is constructed by unrolling the RBMs and then the DBN is fine-tuned
using back-propagation of error derivatives [24]. The stacking of three layers of RBMs in the DBN is illustrated
in Fig. 3 with one input layer x, that includes all the extracted features and three hidden layers h1, h2, and h3
fro m left to right. After the greedy layer-wise unsupervised learning, h3(x) is the final representation of x that
includes the best set of 7 features .
2431
International Journal of Pure and Applied Mathematics Special Issue
Detection Accuracy
98%
100%
79%
80% 73%
Percentage 67%
61%
60% 53%
40%
20% Detection Accuracy
0%
No. of features
Consequently by varying the numbers of hidden layers in the RBMd ifferent set of features are selected
and best set of 7 features are selected based on the min imu m value of misclassification error calculated as in the
equ (2). The selected feature sets are given to the detection phase and the accuracy is calculated. The feature set
with the better accuracy is considered to be the most relevant feature set as shown in Fig. 4. Thus the RBM
provides the most relevant features
The pre-trained data is divided into training data (80%) and testing data (20%) before prov iding it to
Deep Neural Network (DNN) classifier. The train ing data is fed into the DNN classifier and the network is
trained in order to detect the spam & ham SM S messages . The trained DNN classifier is used to predict the SMS
in the testing data. DNN performs a binary classification to identify whether the SMS is spam or ham messages .
IV.RESULT ANALYSIS
The proposed work is developed in RStudio software using machine with the configuration,4 GB RAM
memo ry and Intel(R) Core(TM ) i5 1.80 GHz processor. The accuracy, True Positive (TP) rate and False Positive
(FP) rate of DNN, Rando m Forest (RF), Naïve Bayes (NB), Support Vector Machine (SVM ), K- Nearest
Neighbour (KNN) classifiers are co mpared Fig. and Table. 3 shows that DNN binary classifier performs well
with higher True Positive (TP) rate of 98.73% and lower False Positive (FP) rate of 2.27%co mpared with
machine learning algorithms used in the existing system [25-27] in SMS spam detection.
2432
International Journal of Pure and Applied Mathematics Special Issue
100
80
Percentage
60
Accuracy
40 TP Rate
FP Rate
20
0
DNN RF NB SVM KNN
Different Detection Models
The proposed work demonstrates that deep learning method using RBM and DNN can be succes sfully
applied in SMS spam detection and classification domain. This deep learning model not only learns high -
dimensional representations but also performs efficient classification tasks.The test comes about on SMS spam
dataset demonstrates that DNN can learn a better generative model and perfo rm well on SMS spam recognition
task. In fact, it is found that a three-hidden-layer RBM can produce the best published result. The deep learn ing
RBM provides a new design ideas and strategies for future research on SMS spam discovery. Further, other
deep learning algorithms like Stacked Auto encoder or Recurrent Neural Networks can be used to in the
detection of SMS spam messages.
2433
International Journal of Pure and Applied Mathematics Special Issue
REFEERENCES
[1] Press Release, Growth Accelerates in the Worldwide Mobile Phone and Smartphone Markets in the Second Quarter, According
to IDC, http://www.idc.com/getdoc.jsp?containerId=prUS24239313.
[2] T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “ Contributions to the study of SMS spam filtering,” in Proceedings of the
11th ACM symposium on Document engineering - DocEng ‟11, 2011.
[3] G. V. Cormack, “Email Spam Filtering: A Systematic Review,” Foundations and Trends in Information Retrieval, vol. 1, no. 4,
pp. 335–455, 2008.
[4] H. Ji and H. Zhang, "Analysis on the content features and their correlation of web pages for spam detection," Communications
China, 2015.
[5] HediehSajedi, GolazinZarghamiParast, FatemehAkbari, “ SMS Spam Filtering Using Machine Learning Techniques: A
Survey, Machine Learning Research,” Machine Learning Research, Vol. 1, No. 1, pp. 1 -14, 2016.
[6] Chaudhari N., Jayvala, Vinitashah, “ Survey on Spam SMS filtering using Data Mining T echniques,” International Journal of
Advanced Research in Computer and Communication Engineering, Vol. 5, Issue 11, November 2016.
[7] Zainal, K., Sulaiman, N. F., Jali, M. Z., “An Analysis of Various Algorithms for Text Spam Classification and Clustering Using
RapidMiner and Weka,” International Journal of Computer Science and Information Security, Vol. 13, No. 3, March 2015.
[8] Kawade D.,Oza K., “ SMS Spam Classification using WEKA,” International Journal of Electronics Communication and
Computer Technology, Vol. 5, May 2015.
[9] Deng Li, “A tutorial survey of architectures, algorithms, and applications for deep learning,” in APSIPA Transactions on Sign al
and Information Processing Bd. 3, 2014.
[10] Deng Li and Yu Dong, “Deep Learning: Methods and Applications,” in Foundations and Tren ds in Signal Processing Bd. 7,
2014.
[11] Wang Yao, Cai Wan-Dong and Wei Peng-Cheng, “A deep learning approach for detecting malicious JavaScript code,” in
Security and Communication Networks, Wiley, 2016.
[12] G. V. Cormack, J. M. G. Hidalgo, and E. P. Sánz, “ Feature engineering for mobile (SMS) spam filtering,” in Proceedings of the
30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ‟07, 2007.
[13] Shirani-Mehr, H., “SMS spam detection using machine learning approach,” Stanford University, USA, pp. 1–4, 2013.
[14] T. Almeida, J. M. G. Hidalgo, and T. P. Silva, “Towards sms spam filtering: Results under a new dataset.,” International Journal
of Information Security Science, 2013.
[15] A. K. Uysal, S. Gunal, S. Ergin, and E. SoraGunal, “The Impact of Feature Extraction and Selection on SMS Spam Filtering,”
Electronics and Electrical Engineering, vol. 19, no. 5, May 2013.
[16] Mujtaba, G., Yasin, M., “ SMS spam detection using simple message content features.,” J. Basic Appl. Sci. Res. 4, 275–279,
2014.
[17] Karami and L. Zhou, “Improving static SMS spam detection by using new content-based features,” 2014.
[18] Anchal, Sharma A., “ SMS Spam Detection Using Neural Network Classifier,” International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 4, Issue 6, June 2014 .
[19] Akbari, F., Sajedi, H., “ SMS spam detection using selected text features and boosting classifiers,” Information and Knowledge
T echnology, 2015.
[20] Sable S, “SMS classification based on naive bayes classifier and semi-supervised learning,”International Journal of Innovations
in Engineering Research and Technology , Vol. 3, Issue 7, July 2016.
[21] SMS Spam Collection Data Set from UCI Machine Learning Repository,
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
[22] A. K. Noulas and B. J. A. Krose, “Deep Belief Network for Dimensionality Reduction,” Belgian-Dutch Conference on Artificial
Intelligence, Netherland, 2008.
[23] G. E. Hinton, S. Osindero, and Y.-W. Teh, “ A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation, vol. 18, no.
7, pp. 1527–1554, Jul. 2006.
[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no.
6088, pp. 533–536, Oct. 1986.
[25] Dima Suleiman, Ghazi Al- Naymat, “ SMS Spam Detection using H2O Framework,”8th International Conference on Emerging
Ubiquitous Systems and Pervasive Networks, 2017.
2434
International Journal of Pure and Applied Mathematics Special Issue
[26] HoushmandShirani-Mehr, “SMS Spam Detectionusing Machine Learning Approach,” Stanford University, pp. 1-4, 2013.
[27] Balamurugan.E, jagadeesan.A, “ Geographic Routing Resilient To Location Errors”, International Journal Of Innovations In
Scientific And Engineering Research Vol 5 ,no3, 21-26, Mar ,2018.
2435
2436