E-Mail Spam Classification Via Machine Learning and Natural Language Processing
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
Abstract—In our current modern scenario, majority of the literature survey, second part is the explanation of the various
correspondence and exchange in all business sectors take place machine learning algorithms used, a comparative analysis of
through Emails. As the rate of exchange of information via emails various algorithms, and the implementation of the machine
is increasing exponentially, the amount of unsolicited bulk mail or
Spam. These Emails are sent for a number of reasons: Extracting learning model for text classification. In the third part, we
confidential information from individuals, promotion of adult performed URL Filtering, in the fourth part, we integrated
content and marketing/advertising of products and services. both text classification and URL filtering to create an add-on
Thus, keeping this mind, it is of paramount importance to build a on Gmail.
comprehensive system for Spam Classification based on semantics
based text classification using NLP and URL based filtering. II. L ITERATURE S URVEY
Various Machine Learning algorithms have been surveyed and
the objective is to create a model with high performance and In [2], an integrated approach involving all the three pro-
efficiency. cesses has more accuracy than any of the processes having
Keywords— Email Classification, Spam, Phishing, Machine a standalone approach (URL Analysis, NLP, ML). [3] further
Learning, Random Forest, Cyber Security, Text Classification, highlights the system of URL Classification and the Decision
URL Classification, Add-on, Naive Bayes.
Tree algorithm is used, and the model is trained using a data-
set from Phishtank. In [4], the URL is analysed based on
I. I NTRODUCTION
parameters like the number of special characters and dots.
In this era of globalisation, in the 21st century, majority Random tree and KNN have the same numerical value for the
of the correspondence and exchange in all business sectors parameters but KNN requires more time to build the model
take place via Emails. In the year 2019, [1] 246 billion than random tree, hence random tree is the most efficient
emails were exchanged in a day and this figure is expected machine learning algorithm used to classify the emails. In
to grow to 320 billion emails by the year 2021. Out of [5], Random forest, K nearest neighbour, decision making
these, 128.8 billion emails are business emails while 117.7 tree algorithm and support vector machines algorithms were
billion emails were consumer emails. Right from individuals implemented; Enron Spam Project data-set was used to train
to businesses and firms to even governments, every entity the model. Furthermore, an add-on to outlook was created in
finds email communication to be a quite efficient, professional C in a visual studio. Random Forest was the machine learning
and effective way of transfer of information. It is highly algorithm which was the most accurate. Initially, the model is
possible that the content and body of these mails contain trained for text classification from the data-sets available on
highly confidential information of high value to the entity. Enron Email Corpus and CMU Corpus. In [6] A total of 32
For a business, it could be the details of a prospective high parameters are taken into consideration for file classification.
profile deal which cannot be leaked to the general public, for Support Vector Machine algorithm is the most accurate for
an individual, it may contain the banking and account details both the stages i.e. text classification and file classification.
of him or her, while for the government, information not in In [7], the drawbacks of models based on term frequency
the nation’s interest may get leaked. Thus, in this scenario it were considered which leads to huge computational load and
is of utmost importance that all the information exchanged is slow training speed due to the size of huge feature vector
end to end encrypted There are 4 main types of spam emails space. Implementation of a model based on semantic simi-
which the paper has targeted namely: aggresive marketing, larity which extracts semantic meanings layer by layer using
adult emails, phishing and email spoofing. The structure of the effective information retrieval techniques. In [8], performance
paper has been divided into four parts. First part consists of the of different ML algorithms by using features of keyword
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
they are, 3) Presence of trigger words: The URL is also analysed for
• Continuous bag of words: Continuous bag of words is a some specific words known as the trigger words. Such words
model architecture that predicts the current word using are mentioned in the spam trigger data-set [16] , if any phrase
the surrounding words or the words nearby. or word matches a word in the list, the URL will be classified
• Continuous skip grams: Continuous skip-gram is a model as malicious, and the mail as spam. These words represent a
architecture that predicts the surrounding words Or the variety of mails such as phishing, marketing and advertising,
nearby words of the sentence using the current word. banking scams and adult words.
Diagram [20] 4) Identification of special characters: The last sub steps
involve detecting special characters present in the URL. These
characters are usually used by the attackers to form a URL
C. URL Filtering
similar to an already established safe looking URL so that
As aforementioned, the first step of spam classification is the users can misinterpret this as a verified site and enter
text classification in which the following machine learning their personal confidential details, Such URL’s mainly redirect
algorithms i.e. Naive Bayes and Support Vector Machine the users to a website which has a similar User Interface
having the greatest accuracy and precision are implemented in to an already existing verified website in order to gain the
order to classify the text containing trigger words. In addition confidence of the users, This phenomena is common in social
to this, the next step is URL classification and filtering. In networking sites or in the finance and banking sector. Thus,
order to protect the users from mails containing malicious special characters like ‘@,’ are identified in order to detect if
URLs in the email body, it is of paramount importance that a URL is malicious or not.
a proper methodology is followed to ensure data privacy, Thus in this step, if any of the four sub-steps has a positive
safety, integrity and security. This step has further been sub output i.e. either of any aspect is present, the URL will be
categorized into four steps as shown in the figure: present in the mail and will be considered to be malicious.
D. Integration
After the model for the first step: text classification and the
second step i.e. URL analysis and filtering were individually
created, the next and the penultimate step was to integrate the
two and create a model which takes into consideration the
output of both,the first and the second step. An OR operation
will be followed here. The model has been created such that
if any of the two has an output saying that either the text has
presence of spam words, or the URL is malicious, the mail
will be considered as spam, and classified accordingly. Thus,
for an email to be classified as ham, it will have to undergo a
two step process, of text classification and URL classification.
This model is then hosted as an API on the heroku platform,
Figure 5: URL Filtering flowchart on the cloud server.
• URL Un-Shortening Google apps script is an application development platform
• Similarity to Blacklisted URL used to create customized user interfaces that can integrate
• Presence of Trigger Words with the google workspace. Apps script is based on the
• Identification of Special Characters javascript language and allows developers to create various
1) URL un-shortening: It may happen in some cases that types of add-ons.
a spammer may deliberately try to hide the actual harmful or Google has provided a platform for developers in order to
malicious URL underneath a safe-looking URL in order to test, deploy and publish add-ons and integrate them with other
trick the user to dupe data or information. The hacker may google products such as google docs, google sheets and Gmail.
perform this activity by shortening the original link to form For the purpose of a demonstration, a google workspace add
a separate link which on the outside looks safe to visit. Thus on has been created which is integrated with Gmail. The apps
this problem is tackled by un-shortening the URL in order script when executed classifies the mails in real time as spam
to get the original link by allowing the user to get redirected or ham, by calling the API through a request.
unidirectionally to the original link. This link is then analysed
in the further steps, IV. E XPERIMENTAL R ESULTS AND A NALYSIS
2) Blacklisted URL: The un-shortened URL along with the Initially, text classification was performed, as aforemen-
original shortened URL is compared to a list of URLs present tioned using the Term Frequency Inverse Document Frequency
in the blacklisted URL data-set, then the URL is automatically approach. 5 different Machine Learning algorithms have been
treated as malicious and the email is classified as spam. implemented in order to perform a comparative analysis of
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
• K Nearest Neighbour
• Naive Bayes
• Decision Tree
• Random Forest
• Support Vector Machine
The algorithms are implemented on 2 different data-sets, Table 2: Comparative Tabular Analysis 2
namely spam.csv available on kaggle [23] and enron [22] spam
data-set. Kaggle data set: The spam collection in this data
set is a set of SMS tagged messages which we have used to
perform experiment. This data set contains 5574 messages in
english which are classified or tagged according to ham i.e.
legitimate or spam[23]. Enron Data set: The Enron data set was
used for performing experiment using different ML algorithms
to get the required results. Enron data set in total consists
of 30207 examples out of which we had 16545 examples
classified as legitimate or ham and remaining 13662 examples
were classified as spam[22].
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Spam Emails classified as ham (False Negative): 4 [4] S. Nandhini and D. J. Marseline.K.S, “Performance Evaluation of
Ham Emails classified as spam (False Positive): 3 Machine Learning Algorithms for Email Spam Detection”, 2020 Inter-
national Conference on Emerging Trends in Information Technology and
Ham Emails classified as ham (True Negative): 50 Engineering (ic-ETITE), Vellore, India, 2020, pp. 1-4, doi: 10.1109/ic-
Accuracy: 91.11 Percent ETITE47903.2020.312.
[5] K. Kandasamy and P. Koroth, “An integrated approach to spam clas-
Precision: 94.64 Percent sification on Twitter using URL analysis, natural language processing
Recall: 92.98 Percent and machine learning techniques”, 2014 IEEE Students’ Conference on
F1 score: 0.9137 Electrical, Electronics and Computer Science, Bhopal, 2014, pp. 1-5,
doi: 10.1109/SCEECS.2014.6804508.
[6] S. B. Rathod and T. M. Pattewar, “A comparative performance evaluation
V. F UTURE S COPE of content based spam and malicious URL detection in E-mail”, 2015
Further research in this topic can be done across various IEEE International Conference on Computer Graphics, Vision and
Information Security (CGVIS), Bhubaneswar, 2015, pp. 49-54, doi:
sub-domains. Initially, the focus can be on improving ac- 10.1109/CGVIS.2015.7449891.
curacy by using some more computationally expensive but [7] Wei Hu, Jinglong Du, and Yongkang Xing, “Spam Filtering by
accurate machine learning classifiers like XGBoost. Further- Semantics-based Text Classification”, 8th International Conference on
Advanced Computational Intelligence Chiang Mai, Thailand; February
more, different word embedding algorithms other than Gensim 14-16, 2016
word2Vec can be explored. Research in the field of deep [8] Crawford, M., Khoshgoftaar, T.M., Prusa, J.D. et al. ,“Survey of review
learning could include transformer based deep learning mod- spam detection using machine learning techniques”, Journal of Big Data
2, 23 (2015). https://doi.org/10.1186/s40537-015-0029-9
els which was introduced in 2017. It enables training on [9] Vlad Sandulescu, Martin Ester “Detecting Singleton Review Spammers
humongous data sets, and also includes pre-trained systems Using Semantic Similarity”, WWW ’15 Companion Proceedings of the
which are used for text summarization and translation. Lastly, 24th International Conference on World Wide Web, 2015, p.971-976
10.1145/2740908.2742570
real time learning of email classifiers is something which the [10] Cheng Hua Li, Jimmy Xiangji Huang “Spam filtering using semantic
current data-sets do not focus on. It is important because real similarity approach and adaptive BPNN”, Neurocomputing Journal,
time factors play a huge role in determining the classification Elsevier, https://doi.org/10.1016/j.neucom.2011.09.036
[11] Krishnan Kannoorpatti, Asif Karim , Sami Azam, Bharanidharan San-
accuracy. mugam, ”on A Comprehensive Survey for Intelligent Spam Email
Detection,” IEEE Journal of Computational Intelligence, 2015.
VI. C ONCLUSION [12] Zainal K, Sulaiman NF, Jali MZ, “An Analysis of Various Algorithms
For Text Spam Classification and Clustering Using RapidMiner and
A comprehensive and efficient spam classification system Weka”, ( IJCSIS) International Journal of Computer Science and In-
has been created which follows a two step methodology to formation Security, Vol. 13, No. 3, March 2015
[13] B. Yu, Z. Xu, “A comparative study for content-based dy-
completely ensure that the mail received is spam or not. namic spam classification”, Knowl. Based Syst. , China, 2008,
Initially, text classification takes place which is followed by doi:10.1016/j.knosys.2008.01.001
URL analysis and filtering in order to determine if any link [14] C.H.Wu, “Behavior based spam detection using a hybrid method of
rule based techniques and neural networks”, Expert Systems with
present in the mail is malicious or not. For text classification, Applications, Kaohsiung, Taiwan, 2009, doi:10.1016/j.eswa.2008.03.002
5 machine learning algorithms were studied and analyzed, out [15] S.M.Lee, D.S.Kim, J.H.Kim, J.S.Park, “Spam Detection Using Feature
of which Naive Bayes and Support Vector Machine having Selection and Parameter Optimization”, 2010 International Confer-
ence on Complex, Intelligent and Software Intensive Systems, DOI
the highest accuracy were included in the final model. Various 10.1109/CISIS.2010.116
data-sets have been referred to for a list of spam trigger words [16] E.G.Dada, J.S.Bassi, H.Chiroma, S.M.Abdulhamid, A.O.Adetunmbi,
and a list of blacklisted URLs. This model was hosted as an O.E.Ajibuwa, “ Machine learning for email spam filtering: re-
view, approaches and open research problems”, Heliyon (2019)
API which was then called by the javascript code in the google DOI:doi.org/10.1016/j.heliyon.2019.e01802
apps script in order to classify mails in real time in gmail. [17] 455 Spam Trigger Words to Avoid in 2019, accessed 3 November 2020,
https://prospect.io/blog/455-email-spam-trigger-words-avoid-2018/
ACKNOWLEDGMENT [18] PhishTank, accessed 3 November 2020, https://www.phishtank.com/
[19] Word2vec skip gram and cbow, accessed 3 November 2020,
We avail this opportunity to express our indebtedness to https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-
cbow-93512ee24314
Department of Electronics engineering at the Sardar Patel [20] Asif Karim, Sami Azam, Bharanidharan Shanmugam, Krishnan Kan-
Institute of Technology, Mumbai for their valuable guidance noorpatti, and Mamoun Alazab - “A Comprehensive Survey for Intelli-
and help at various stages. gent Spam Email Detection”, College of Engineering, IT and Environ-
ment, Charles Darwin University, Casuarina, NT 0810, Australia.
[21] Two Simple Adaptations of Word2Vec for Syntax Problems -
R EFERENCES Scientific Figure on ResearchGate, accessed 3 November 2020,
[1] Statista, accessed 3 November2020, https://www.researchgate.net/figure/Illustration-of-the-Skip-gram-and-
https://www.statista.com/statistics/255080/number-of-e-mail-users- Continuous-Bag-of-Word-CBOW-modelsfig1281812760
worldwide/ [22] Enron Spam data set accessed on 3 November 2020,
[2] E. Marková, T. Bajtoš, P. Sokol and T. Mézešová, “Classification of http://nlp.cs.aueb.gr/softwareanddatasets/Enron-Spam/index.html
malicious emails”, 2019 IEEE 15th International Scientific Confer- [23] Kaggle data set accessed on 3 November 2020,
ence on Informatics, Poprad, Slovakia, 2019, pp. 000279-000284, doi: https://www.kaggle.com/uciml/sms-spam-collection-dataset
10.1109/Informatics47936.2019.9119329. [24] Y. Lin and J. Wang, ”Research on text classification based on SVM-
[3] M. S. Swetha and G. Sarraf, “Spam Email and Malware Elimination KNN,” 2014 IEEE 5th International Conference on Software Engineer-
employing various Classification Techniques”, 2019 4th International ing and Service Science, Beijing, 2014, pp. 842-844, doi: 10.1109/IC-
Conference on Recent Trends on Electronics, Information, Communica- SESS.2014.6933697.
tion Technology (RTEICT), Bangalore, India, 2019, pp. 140-145, doi:
10.1109/RTEICT46194.2019.9016964.
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on March 24,2024 at 16:43:10 UTC from IEEE Xplore. Restrictions apply.