0% found this document useful (0 votes)
61 views

E-Mail Spam Classification Via Machine Learning and Natural Language Processing

The document discusses classifying email spam using machine learning and natural language processing. It analyzes text classification algorithms and URL filtering, then integrates them to build an add-on for Gmail spam classification with high accuracy and performance.

Uploaded by

ineel264
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

E-Mail Spam Classification Via Machine Learning and Natural Language Processing

The document discusses classifying email spam using machine learning and natural language processing. It analyzes text classification algorithms and URL filtering, then integrates them to build an add-on for Gmail spam classification with high accuracy and performance.

Uploaded by

ineel264
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).

IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

E-Mail Spam Classification via Machine Learning


2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) | 978-1-6654-1960-4/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICICV50876.2021.9388530

and Natural Language Processing


Akash Junnarkar Siddhant Adhikari Jainam Fagania
Sardar Patel Institute of Technology Sardar Patel Institute of Technology Sardar Patel Institute of Technology
Mumbai, India Mumbai, India Mumbai, India
[email protected] [email protected] [email protected]

Priya Chimurkar Deepak Karia


Sardar Patel Institute of Technology Sardar Patel Institute of Technology
Mumbai, India Mumbai, India
[email protected] deepak [email protected]

Abstract—In our current modern scenario, majority of the literature survey, second part is the explanation of the various
correspondence and exchange in all business sectors take place machine learning algorithms used, a comparative analysis of
through Emails. As the rate of exchange of information via emails various algorithms, and the implementation of the machine
is increasing exponentially, the amount of unsolicited bulk mail or
Spam. These Emails are sent for a number of reasons: Extracting learning model for text classification. In the third part, we
confidential information from individuals, promotion of adult performed URL Filtering, in the fourth part, we integrated
content and marketing/advertising of products and services. both text classification and URL filtering to create an add-on
Thus, keeping this mind, it is of paramount importance to build a on Gmail.
comprehensive system for Spam Classification based on semantics
based text classification using NLP and URL based filtering. II. L ITERATURE S URVEY
Various Machine Learning algorithms have been surveyed and
the objective is to create a model with high performance and In [2], an integrated approach involving all the three pro-
efficiency. cesses has more accuracy than any of the processes having
Keywords— Email Classification, Spam, Phishing, Machine a standalone approach (URL Analysis, NLP, ML). [3] further
Learning, Random Forest, Cyber Security, Text Classification, highlights the system of URL Classification and the Decision
URL Classification, Add-on, Naive Bayes.
Tree algorithm is used, and the model is trained using a data-
set from Phishtank. In [4], the URL is analysed based on
I. I NTRODUCTION
parameters like the number of special characters and dots.
In this era of globalisation, in the 21st century, majority Random tree and KNN have the same numerical value for the
of the correspondence and exchange in all business sectors parameters but KNN requires more time to build the model
take place via Emails. In the year 2019, [1] 246 billion than random tree, hence random tree is the most efficient
emails were exchanged in a day and this figure is expected machine learning algorithm used to classify the emails. In
to grow to 320 billion emails by the year 2021. Out of [5], Random forest, K nearest neighbour, decision making
these, 128.8 billion emails are business emails while 117.7 tree algorithm and support vector machines algorithms were
billion emails were consumer emails. Right from individuals implemented; Enron Spam Project data-set was used to train
to businesses and firms to even governments, every entity the model. Furthermore, an add-on to outlook was created in
finds email communication to be a quite efficient, professional C in a visual studio. Random Forest was the machine learning
and effective way of transfer of information. It is highly algorithm which was the most accurate. Initially, the model is
possible that the content and body of these mails contain trained for text classification from the data-sets available on
highly confidential information of high value to the entity. Enron Email Corpus and CMU Corpus. In [6] A total of 32
For a business, it could be the details of a prospective high parameters are taken into consideration for file classification.
profile deal which cannot be leaked to the general public, for Support Vector Machine algorithm is the most accurate for
an individual, it may contain the banking and account details both the stages i.e. text classification and file classification.
of him or her, while for the government, information not in In [7], the drawbacks of models based on term frequency
the nation’s interest may get leaked. Thus, in this scenario it were considered which leads to huge computational load and
is of utmost importance that all the information exchanged is slow training speed due to the size of huge feature vector
end to end encrypted There are 4 main types of spam emails space. Implementation of a model based on semantic simi-
which the paper has targeted namely: aggresive marketing, larity which extracts semantic meanings layer by layer using
adult emails, phishing and email spoofing. The structure of the effective information retrieval techniques. In [8], performance
paper has been divided into four parts. First part consists of the of different ML algorithms by using features of keyword

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 693

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

extraction, unigrams, bigrams and n-grams for classification III. M ETHODOLOGY


of spam and fake online reviews. In [9],the measure of cosine
similarity function was used and applied only on specific parts
of speech (POS) and by employing lemmatization algorithms
and effects of different pre-processing algorithms on classifi-
cation accuracy. In [10], a semantic feature space was created
from training data using statistical methods and solving the
problems of conventional neural networks using BP algorithm.
Keyword based spam filtering model efficient feature selec-
tion method using adaptive learning rate is used. In [11],
modern day Spam Statistics and different types of Spam
attacks (Email-phishing, spear phishing and spoofing) have
been studies, existing softwares and scope of techniques for
spam classification have been analysed, performance metrics
of various Supervised Learning algorithms are compared and
extraction of email routing information from a spam source. In
Figure 1: Functional Block Diagram
[12], spam management for SMS services using text mining
applications such as Rapid Miner for use in classification Figure 1 depicts the block diagram of the methodology and
and clustering algorithms have been focused upon, they have steps involved for text classification deploying the various
justified the use of SVMs and Naı̈ve Bayesian models as they machine learning algorithms
have high performance even in the absence of large data and
A. Machine Learning Algorithms
feature modelling and engineering. In [13], the main focus is
on comparing four different machine learning algorithms for ’ There are various parameters involved in the machine
content based spam filtering technique, experimental results learning algorithms which we have implemented which are
show that Neural Network classifier is more sensitive to the as follows:
training set size and unsuitable for using alone as a spam • Naive Bayes: Maximum likelihood estimate
rejection tool, generally, the performances of the SVM and • KNN: Euclidean distance
RVM classifiers are less influenced by data sets and feature • Decision Tree: Number of nodes, class weight
sizes, and obviously superior to the Naive Bayesian classifier. • Random Forest: Maximum features
In [14] features for spam classification and spamming behavior • Support Vector Machine: Decision function shape
of mails are discussed, the paper have presented a rule-based 1) Naı̈ve Bayes: Naı̈ve Bayes is a classification algorithm
method for instantiating behavior-based features into discrete which uses probabilistic classifiers that are based on ap-
values, the paper presents the design and implementation plying Bayes’ theorem for conditional probability with two
of back-propagation neural networks for spam classification assumptions: all features are equally important and all features
using behavior-based features. In [15] Spam classifier which are independent of each other. For text classification, the
is based on RF algorithm is used, they have used parameter features may be modelled on a multinomial distribution or
optimization and feature selection , optimization of two pa- if the features are continuous, they may follow a Gaussian
rameters of Random Forest to maximize the Spam detection distribution. A Naı̈ve Bayes classifier deals efficiently with
rates is done, and also provided the importance of individual high dimensional data points [8] in the feature vector space
feature selection. It can detect spam with low processing and has faster training speed, it requires few training examples
resources with high accuracy. In [16] research work on several and it is widely used for text classification and with appropriate
State-of-the-art approaches used for various spam filtering pre-processing of data its performance is comparable to that
methods. Different spam filtering formulas implemented by of advanced methods.
Gmail, Yahoo and Outlook; various performance evaluation 2) K-Nearest Neighbours (KNN): KNN is a widely used
measures are discussed on basis of which the performances supervised learning algorithm mainly used for classification.
are measured. All the machine learning algorithms used for It assumes that similar things are in close proximity to each
filtering are discussed with their comparative analysis, the other. The degree of similarity is calculated using similarity
paper also includes various open research problems faced measures, usually the Euclidean Distance. KNN is easy to
by existing techniques. [24] Highlights an algorithm where implement, since there is no need to build a model and tune
SVM and KNN both were implemented for Chinese web page parameters. KNN is a non-parametric algorithm so it makes
classification, it improved the classifying predictability and no underlying assumptions about the distribution of the data.
gave a better feedback. But as the dimensionality and size of the dataset increases the
algorithm performs slower.
3) Decision Tree: Decision tree is a popular machine learn-
ing tool used for both classification and regression which has
a flowchart like tree structure. It uses a recursive partitioning

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 694

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

algorithm to test the presence of features or attributes based on


certain purity indexes [8]. Gini Index and Entropy are the most
popular indexes. Gini index is used to measure the probability
that a randomly chosen feature would be incorrectly classified
and Entropy is the amount of uncertainty which in turn
is proportional to the information gain. These are used to
decide which features are to be placed at the root or internal
node. Decision tree works for both continuous and categorical
variables. Figure 2 elucidates on how the decision tree may
algorithm work, customised to our use case.

Figure 3: Random Forest Flowchart


5) Support Vector Machines (SVM): Support Vector Ma-
chine is a one of the most widely used supervised learning
algorithm for classification. In a n-dimensional workspace
a hyperplane based on the hypothesis of the model that
segregates each of the classes from each other by analysing the
features in the training data set [19]. SVM is a large margin
classifier and it creates an optimum decision surface so that
it has the maximum separation from the training sets of each
of the classes. SVM has the unique ability to transform non-
linearly separable data into linearly separable by the use of
Figure 2: Decision Tree Flowchart kernel methods. Kernel methods transform the training data
such that the non-linear decision surface is transformed to a
linear decision boundary in a higher dimensional space. Ker-
4) Random Forest: Random Forest is an ensemble learning nel methods enable to operate in higher dimensions without
algorithm which uses large number of uncorrelated decision actually calculating the coordinates but rather computing the
trees to create a model which performs with a high accuracy on inner products between the vectors. This gives Support Vector
new datasets because of better generalization to new, unseen Machines computational efficiency and the ability to scale to
data. It reduces the variance and always prevents over-fitting larger training sets.
of the model using the technique of Bootstrap Aggregation or
Bagging where several subsets of the data are created from B. Text classification
the training set and chosen randomly with replacement[19]. Text classification is also known as text-tagging and text-
These subsets are then used to train each decision tree. categorization, it is a process in which text which can be
Predictions by a multitude of uncorrelated models lead to a unstructured or structured is classified into organised groups
robust performance on new datasets. Figure 3 elaborates on and according to the requirements. Various machine learning
the working of the random forest algorithm as a collection of models use Natural Language Processing to analyze text and
decision tree perform other operation and then assign them group based on
their content. Figure 4 depicts the text classification flowchart
and the steps involved.

Figure 4: Text Classification flowchart

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 695

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

and are of no use, for example, $, @, etc. These are


removed using regular expression(regex).
• Feature Engineering
• Stemming and lemmatization: Word stems are the base
• Gensim Package
form of the words that are created by adding suffixes
• Pre-processing the data
or prefixes to the words, for example, dances, dancer,
• Vectorization
dancing all three have one root word which is dance. So
1) Feature Engineering: Feature engineering has always the process of obtaining base word is known as stemming.
been the key for creating superior, efficient and better machine Lemmatization is also similar to Stemming where affixes
learning models. It is very difficult to work on unstructured are removed to get root word and not root stem.
text data especially when trying to build a machine learning • Removing stopwords: There are many words in the data
model. To convert the data into structured text data we need set which have little or no significance when creating
to process and transform noisy, unstructured data into a features, such words have maximum frequency and their
vectorized format, so that we can perform complex operations examples are a, an, the, is if etc. There is a stopword list
with the vectors and which can easily be understood by available in a natural language processing tool kit.
machine learning models. ML algorithms are based on three
There are various other pre-processing steps like tokenization,
basic principles, statistics, optimization and mathematics. And
removing extra spaces, removing digits, text case lowering,
therefore machine learning models are not intelligent enough
spelling correctors etc[19].
to start processing text in unstructured and native form and that
is why to obtain logical and understandable data, we perform 4) Machine Learning algorithms: After comparing all the
feature engineering on the unstructured or raw data which machine learning algorithms, two of them were used, and they
helps us create feature vectors and these vectors are easily are listed below:
understood by machine as well as humans[19]. • Naive Bayes:Naive Bayes is one of the ML model Or
2) Gensim Package: Traditional methods for feature en- algorithm which is used, it is a statistical technique of
gineering were mainly based on counting the frequency of email filtering and it uses BOW(bag of words) features
the word or the occurrence of the word and then predicting which helps in identifying whether the mail is spam or
the output, example of such methods are term frequency(TF), ham. It works by correlating the tokens with spam and
term frequency-inverse document frequency (TF-IDF) etc. The ham mails and then as mentioned in it’s name it uses
above-mentioned methods are very effective but due to the Bayes theorem and calculates whether the mail is spam
nature of models, we are not able to consider important or ham. It’s main advantage is that it can be trained
information like semantic similarity, the context of the data according to the users data set and it has very low false
and the sequence around nearby words. So that is why we use positive spam detection rate.
more sophisticated models and for that, we require Gensim • Support Vector Machine: Support vector machine delivers
package. Gensim stands for Generate Similar, it is a popular very high accuracy result as it uses optimization process.
open-source NLP library which is used for topic modelling. It classifies by identifying the optimal hyperplane and
It is used to perform various complex tasks, for example, maximizes the margin that separates spam and ham. SVM
Semantic similarity approach, handling large text collections, is robust in nature and is very effective when number of
and generation of word embeddings using Word2Vec to create dimension is greater than number of samples[16].
feature vector space of 100 dimensions[19]. 5) Vectorization: To overcome the above-mentioned short-
3) Pre-processing the data: There are several ways in comings we use VSM i.e. Vector space models, it is used
which data can be cleaned and pre-processed in natural lan- in such a way that on the basis of contextual similarity and
guage processing pipelines. They are mentioned below: semantic similarity the vectors which are word vectors are
• Removing tags: The text data set may contain unneces- plotted in the n dimensional vector space. It is known that
sary tags like HTML which are of no use and add zero words which occur and are used in the same context are
value. So tags like these are removed. semantically similar to one another. To achieve this we use
• Removing accented characters: The text data set may word2vec models.
contain accented characters, for example, é, á, ñ etc. And In simple words, word2vec is a technique for natural lan-
mainly in the English language data set. So over here guage processing, it uses a two-layered NN which is a neural
these characters are converted into standard ASCII value, network model,it then it learns word association from the large
for example, ‘á’ is converted to ‘a’. data set. And then it can predict words with similar meaning or
• Expanding contractions: The text data set might contain can suggest words to complete the partial sentence. Word2vec
shortened words or shortened syllables, and they are model takes in a large number of text data and produces a
converted their expanded, original and standard form, ex- vector space of hundred and more dimensions, and every word
ample, ‘I’d’ is converted to ‘I would’, ‘don’t’ is converted in the data set is assigned a unique corresponding vector and
to ‘do not’ etc. these vectors are positioned in a way that words with similar
• Removing special characters: The text data set may meaning or context are placed close to one another[19]. For
contain special characters which are non-alphanumeric the above process, word2vec uses 2 model architectures and

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 696

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

they are, 3) Presence of trigger words: The URL is also analysed for
• Continuous bag of words: Continuous bag of words is a some specific words known as the trigger words. Such words
model architecture that predicts the current word using are mentioned in the spam trigger data-set [16] , if any phrase
the surrounding words or the words nearby. or word matches a word in the list, the URL will be classified
• Continuous skip grams: Continuous skip-gram is a model as malicious, and the mail as spam. These words represent a
architecture that predicts the surrounding words Or the variety of mails such as phishing, marketing and advertising,
nearby words of the sentence using the current word. banking scams and adult words.
Diagram [20] 4) Identification of special characters: The last sub steps
involve detecting special characters present in the URL. These
characters are usually used by the attackers to form a URL
C. URL Filtering
similar to an already established safe looking URL so that
As aforementioned, the first step of spam classification is the users can misinterpret this as a verified site and enter
text classification in which the following machine learning their personal confidential details, Such URL’s mainly redirect
algorithms i.e. Naive Bayes and Support Vector Machine the users to a website which has a similar User Interface
having the greatest accuracy and precision are implemented in to an already existing verified website in order to gain the
order to classify the text containing trigger words. In addition confidence of the users, This phenomena is common in social
to this, the next step is URL classification and filtering. In networking sites or in the finance and banking sector. Thus,
order to protect the users from mails containing malicious special characters like ‘@,’ are identified in order to detect if
URLs in the email body, it is of paramount importance that a URL is malicious or not.
a proper methodology is followed to ensure data privacy, Thus in this step, if any of the four sub-steps has a positive
safety, integrity and security. This step has further been sub output i.e. either of any aspect is present, the URL will be
categorized into four steps as shown in the figure: present in the mail and will be considered to be malicious.

D. Integration
After the model for the first step: text classification and the
second step i.e. URL analysis and filtering were individually
created, the next and the penultimate step was to integrate the
two and create a model which takes into consideration the
output of both,the first and the second step. An OR operation
will be followed here. The model has been created such that
if any of the two has an output saying that either the text has
presence of spam words, or the URL is malicious, the mail
will be considered as spam, and classified accordingly. Thus,
for an email to be classified as ham, it will have to undergo a
two step process, of text classification and URL classification.
This model is then hosted as an API on the heroku platform,
Figure 5: URL Filtering flowchart on the cloud server.
• URL Un-Shortening Google apps script is an application development platform
• Similarity to Blacklisted URL used to create customized user interfaces that can integrate
• Presence of Trigger Words with the google workspace. Apps script is based on the
• Identification of Special Characters javascript language and allows developers to create various
1) URL un-shortening: It may happen in some cases that types of add-ons.
a spammer may deliberately try to hide the actual harmful or Google has provided a platform for developers in order to
malicious URL underneath a safe-looking URL in order to test, deploy and publish add-ons and integrate them with other
trick the user to dupe data or information. The hacker may google products such as google docs, google sheets and Gmail.
perform this activity by shortening the original link to form For the purpose of a demonstration, a google workspace add
a separate link which on the outside looks safe to visit. Thus on has been created which is integrated with Gmail. The apps
this problem is tackled by un-shortening the URL in order script when executed classifies the mails in real time as spam
to get the original link by allowing the user to get redirected or ham, by calling the API through a request.
unidirectionally to the original link. This link is then analysed
in the further steps, IV. E XPERIMENTAL R ESULTS AND A NALYSIS
2) Blacklisted URL: The un-shortened URL along with the Initially, text classification was performed, as aforemen-
original shortened URL is compared to a list of URLs present tioned using the Term Frequency Inverse Document Frequency
in the blacklisted URL data-set, then the URL is automatically approach. 5 different Machine Learning algorithms have been
treated as malicious and the email is classified as spam. implemented in order to perform a comparative analysis of

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 697

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

their performance taking into consideration the key perfor-


mance indicators such as Accuracy, Precision, Recall and F1
score. A confusion matrix has been created to validate the
same. Following are the algorithms implemented:

• K Nearest Neighbour
• Naive Bayes
• Decision Tree
• Random Forest
• Support Vector Machine

The algorithms are implemented on 2 different data-sets, Table 2: Comparative Tabular Analysis 2
namely spam.csv available on kaggle [23] and enron [22] spam
data-set. Kaggle data set: The spam collection in this data
set is a set of SMS tagged messages which we have used to
perform experiment. This data set contains 5574 messages in
english which are classified or tagged according to ham i.e.
legitimate or spam[23]. Enron Data set: The Enron data set was
used for performing experiment using different ML algorithms
to get the required results. Enron data set in total consists
of 30207 examples out of which we had 16545 examples
classified as legitimate or ham and remaining 13662 examples
were classified as spam[22].

A machine learning model was created and the following 5


algorithms were imported form the scikit learn library in and
used for classification.
Table 3: Comparative Tabular Analysis 3
The comparative tabular analysis is given below as men-
Furthermore, there are certain shortcomings in the TF IDF
tioned in the table:
approach which are overcome by using the Gensim Library.
In this paper, the TF-IDF approach was discarded while the
gensim library was used which support semantic similarity
yielding better results. Furthermore, in addition to text classi-
fication, the URL present was also classified via a three step
filtering and analysis. This was the novelty presented in this
paper. The Gensim library has been introduced by google in
2013 which utilizes the text8 data-set which has been iterated
over 13 million documents, improving the accuracy. The
gensim library takes into consideration the semantic similarity
approach. As the accuracy for Naive Bayes and Support Vector
Machine was highest, they were implemented along with
gensim which resulted in a better accuracy as compared to
TF-IDF approach. The accuracy for Naive Bayes and Support
Vector Machine is 95.48 Percent and 97.83 Percent
In order to obtain real time results, 100 mails were sent to a
particular Gmail account to check if the emails were classified
correctly, the results obtained were as follows:
Total Number of Emails sent: 110
Table 1: Comparative Tabular Analysis 1
Spam Emails sent: 57
Ham Emails sent: 53
Spam Emails classified as spam (True Positive): 53

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 698

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

Spam Emails classified as ham (False Negative): 4 [4] S. Nandhini and D. J. Marseline.K.S, “Performance Evaluation of
Ham Emails classified as spam (False Positive): 3 Machine Learning Algorithms for Email Spam Detection”, 2020 Inter-
national Conference on Emerging Trends in Information Technology and
Ham Emails classified as ham (True Negative): 50 Engineering (ic-ETITE), Vellore, India, 2020, pp. 1-4, doi: 10.1109/ic-
Accuracy: 91.11 Percent ETITE47903.2020.312.
[5] K. Kandasamy and P. Koroth, “An integrated approach to spam clas-
Precision: 94.64 Percent sification on Twitter using URL analysis, natural language processing
Recall: 92.98 Percent and machine learning techniques”, 2014 IEEE Students’ Conference on
F1 score: 0.9137 Electrical, Electronics and Computer Science, Bhopal, 2014, pp. 1-5,
doi: 10.1109/SCEECS.2014.6804508.
[6] S. B. Rathod and T. M. Pattewar, “A comparative performance evaluation
V. F UTURE S COPE of content based spam and malicious URL detection in E-mail”, 2015
Further research in this topic can be done across various IEEE International Conference on Computer Graphics, Vision and
Information Security (CGVIS), Bhubaneswar, 2015, pp. 49-54, doi:
sub-domains. Initially, the focus can be on improving ac- 10.1109/CGVIS.2015.7449891.
curacy by using some more computationally expensive but [7] Wei Hu, Jinglong Du, and Yongkang Xing, “Spam Filtering by
accurate machine learning classifiers like XGBoost. Further- Semantics-based Text Classification”, 8th International Conference on
Advanced Computational Intelligence Chiang Mai, Thailand; February
more, different word embedding algorithms other than Gensim 14-16, 2016
word2Vec can be explored. Research in the field of deep [8] Crawford, M., Khoshgoftaar, T.M., Prusa, J.D. et al. ,“Survey of review
learning could include transformer based deep learning mod- spam detection using machine learning techniques”, Journal of Big Data
2, 23 (2015). https://doi.org/10.1186/s40537-015-0029-9
els which was introduced in 2017. It enables training on [9] Vlad Sandulescu, Martin Ester “Detecting Singleton Review Spammers
humongous data sets, and also includes pre-trained systems Using Semantic Similarity”, WWW ’15 Companion Proceedings of the
which are used for text summarization and translation. Lastly, 24th International Conference on World Wide Web, 2015, p.971-976
10.1145/2740908.2742570
real time learning of email classifiers is something which the [10] Cheng Hua Li, Jimmy Xiangji Huang “Spam filtering using semantic
current data-sets do not focus on. It is important because real similarity approach and adaptive BPNN”, Neurocomputing Journal,
time factors play a huge role in determining the classification Elsevier, https://doi.org/10.1016/j.neucom.2011.09.036
[11] Krishnan Kannoorpatti, Asif Karim , Sami Azam, Bharanidharan San-
accuracy. mugam, ”on A Comprehensive Survey for Intelligent Spam Email
Detection,” IEEE Journal of Computational Intelligence, 2015.
VI. C ONCLUSION [12] Zainal K, Sulaiman NF, Jali MZ, “An Analysis of Various Algorithms
For Text Spam Classification and Clustering Using RapidMiner and
A comprehensive and efficient spam classification system Weka”, ( IJCSIS) International Journal of Computer Science and In-
has been created which follows a two step methodology to formation Security, Vol. 13, No. 3, March 2015
[13] B. Yu, Z. Xu, “A comparative study for content-based dy-
completely ensure that the mail received is spam or not. namic spam classification”, Knowl. Based Syst. , China, 2008,
Initially, text classification takes place which is followed by doi:10.1016/j.knosys.2008.01.001
URL analysis and filtering in order to determine if any link [14] C.H.Wu, “Behavior based spam detection using a hybrid method of
rule based techniques and neural networks”, Expert Systems with
present in the mail is malicious or not. For text classification, Applications, Kaohsiung, Taiwan, 2009, doi:10.1016/j.eswa.2008.03.002
5 machine learning algorithms were studied and analyzed, out [15] S.M.Lee, D.S.Kim, J.H.Kim, J.S.Park, “Spam Detection Using Feature
of which Naive Bayes and Support Vector Machine having Selection and Parameter Optimization”, 2010 International Confer-
ence on Complex, Intelligent and Software Intensive Systems, DOI
the highest accuracy were included in the final model. Various 10.1109/CISIS.2010.116
data-sets have been referred to for a list of spam trigger words [16] E.G.Dada, J.S.Bassi, H.Chiroma, S.M.Abdulhamid, A.O.Adetunmbi,
and a list of blacklisted URLs. This model was hosted as an O.E.Ajibuwa, “ Machine learning for email spam filtering: re-
view, approaches and open research problems”, Heliyon (2019)
API which was then called by the javascript code in the google DOI:doi.org/10.1016/j.heliyon.2019.e01802
apps script in order to classify mails in real time in gmail. [17] 455 Spam Trigger Words to Avoid in 2019, accessed 3 November 2020,
https://prospect.io/blog/455-email-spam-trigger-words-avoid-2018/
ACKNOWLEDGMENT [18] PhishTank, accessed 3 November 2020, https://www.phishtank.com/
[19] Word2vec skip gram and cbow, accessed 3 November 2020,
We avail this opportunity to express our indebtedness to https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-
cbow-93512ee24314
Department of Electronics engineering at the Sardar Patel [20] Asif Karim, Sami Azam, Bharanidharan Shanmugam, Krishnan Kan-
Institute of Technology, Mumbai for their valuable guidance noorpatti, and Mamoun Alazab - “A Comprehensive Survey for Intelli-
and help at various stages. gent Spam Email Detection”, College of Engineering, IT and Environ-
ment, Charles Darwin University, Casuarina, NT 0810, Australia.
[21] Two Simple Adaptations of Word2Vec for Syntax Problems -
R EFERENCES Scientific Figure on ResearchGate, accessed 3 November 2020,
[1] Statista, accessed 3 November2020, https://www.researchgate.net/figure/Illustration-of-the-Skip-gram-and-
https://www.statista.com/statistics/255080/number-of-e-mail-users- Continuous-Bag-of-Word-CBOW-modelsfig1281812760
worldwide/ [22] Enron Spam data set accessed on 3 November 2020,
[2] E. Marková, T. Bajtoš, P. Sokol and T. Mézešová, “Classification of http://nlp.cs.aueb.gr/softwareanddatasets/Enron-Spam/index.html
malicious emails”, 2019 IEEE 15th International Scientific Confer- [23] Kaggle data set accessed on 3 November 2020,
ence on Informatics, Poprad, Slovakia, 2019, pp. 000279-000284, doi: https://www.kaggle.com/uciml/sms-spam-collection-dataset
10.1109/Informatics47936.2019.9119329. [24] Y. Lin and J. Wang, ”Research on text classification based on SVM-
[3] M. S. Swetha and G. Sarraf, “Spam Email and Malware Elimination KNN,” 2014 IEEE 5th International Conference on Software Engineer-
employing various Classification Techniques”, 2019 4th International ing and Service Science, Beijing, 2014, pp. 842-844, doi: 10.1109/IC-
Conference on Recent Trends on Electronics, Information, Communica- SESS.2014.6933697.
tion Technology (RTEICT), Bangalore, India, 2019, pp. 140-145, doi:
10.1109/RTEICT46194.2019.9016964.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 699

Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.

You might also like