International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
RESEARCH ARTICLE OPEN ACCESS
RESEARCH ARTICLE OPEN
ACCESS Precise Email Classification using Deep Learning
Ms. Deepali Bhimrao Chavan
Computer Science & Engineering Department AMGOI, Vathar- Maharasthra
Prof. Suraj Shivaji Redekar
HOD, Computer Science & Engineering Department AMGOI, Vathar - Maharasthra
ABSTRACT
In practically every industry today, from business to education, emails are used. Ham and spam are the two
subcategories of emails. Email spam, often known as junk email or unwelcome email, is a kind of email that can
be used to hurt any user by sapping their time and computing resources and stealing important data. Spam email
volume is rising quickly day by day. Today's email and IoT service providers face huge and massive challenges
with spam identification and filtration. Email filtering is one of the most important and well-known methods
among all the methods created for identifying and preventing spam. SVM, decision trees, CNN, and other
machine learning and deep learning approaches have all been applied to this problem.
Together with the explosive growth in internet users, email spam has increased substantially in recent years.
Individuals are using them for illegal and dishonest purposes, such as fraud, phishing, and distributing malicious
links through unsolicited email that can harm our systems and attempt to access your systems. By quickly
constructing phony profiles and email accounts, spammers prey on those who are ignorant of these scams. They
use a real name in their spam emails. As a result, it's critical to identify spam emails that include fraud. This
project will accomplish this by utilizing machine learning methods, and this article will examine the machine
learning algorithms, put them to use on our data sets, and select the approach that can detect email spam with the
maximum degreeof precision and accuracy.
I. INTRODUCTION phishing websites is built on applying classification or
Email spam, often known as electronic mail association Data Mining methods. The suggested
spam, is the practice of sending unwanted emails system can be thought as as a classification issue with
or commercial emails to a list of subscribers. two categories, ham and phished, with the purpose of
Unsolicited emails signify that the recipient has detecting phished email. In the area of artificial
not given consent to receive them. Throughout intelligence known as machine learning, the system is
last decade, using spam emails has grown in given the capacity to learn without being specifically
popularity. Spam has grown to be a significant designed. Algorithms for supervised machine learning
online problem. Spam wastes space, time, and are utilized for classification in our model.
message delivery. Although automatic email
filtering may be the best way to stop spam, 1.1 PROBLEM STATEMENT:
modern spammers may quickly get around all of Several packet features were extracted for
these apps. Prior to a few years ago, the majority categorization purposes from each email in a self-
of spam that came from particular email created dataset that included n phished emails and m
addresses could be manually stopped. ham emails. The classifiers receive these features, and
For spam detection, a machine learning approach the results are recorded. The goal is to categories data
will be utilised. using a variety of machine learning techniques while
employing the fewest possible features to create a
The most popular form of official system that is more accurate.
communication for business purposes is email.
Despite the existence of other communication 1.2 OBJECTIVE:
tools, email usage keeps growing. Today's 1. Create and implement a machine learning
environment necessitates automated email strategy for email phishing detection using huge
management due to the daily increase in email synthetic as well as real-time data.
volume. More than 55% of the emails overall are
classified as spam. This demonstrates how these 2. to create a strategy utilizing different machine
spams waste email users' time and resources learning algorithms and investigate the accuracy
while producing nothing helpful. Because using majority routing.
spammers utilize sophisticated and inventive
tactics to carry out their criminal actions through 3. create an algorithm to extract various features
spam emails, it is crucial to comprehend the from emails in order to improve classification
various spam email classification approaches accuracy.
and how they work.
Emails are a popular form of communication for
both personal and business purposes. An
intelligent and successful approach for detecting
ISSN: 2347-8578 www.ijcstjournal.org Page 148
International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
successfully used to filter spam emails.
4. to examine and verify the system's Contemporary studies have classified text
classification findings using current messages as spam or gammon using certain
detection methods. stylistic characteristics. The ability to detect
SMS spam can be significantly impacted by the
II. RELATED WORK use of well-known terms, phrases, abbreviations,
and idioms.
Nikhil Kumar, Sanket Sonowal, Nishant et.al [1]
Email spam has grown significantly in recent Asma Bibi1, Rasia Latif1, Samina Khalid1*,
years along with the rapid expansion of internet Waqas Ahmed2, Raja Ahtsham Shabir1,
users. They are being used for fraud, phishing, Tehmina Shahryar et. Al [6] Emails are a
and other unethical and criminal activities. common form of communication on both a
sending dangerous links through unsolicited personal and business level. As time goes on,
emails, which might damage our systems and emails are increasingly being used for
enter your systems. spamming, distributing viruses, and defrauding
Naeem Ahmed, Rashid Amin ,Hamza Aldabbas internetusers.
et.al [2] In practically every industry today, from Some acceptable emails are classified as ham,
business to education, emails are used. Ham and whereas certain kinds of unsolicited emails are
spam are the two subcategories of emails. Email classified as spam.
spam, often known as junk email or unwelcome Many machine learning techniques are utilised
email, is a kind of email that can be used to hurt during the course of the year to estimate the
any user by sapping their time and computing category of emails. In this essay, we consider a
resources and stealing important data. classifier that is effective at classifying texts.
Luo GuangJun, Shah Nazir,Habib Ullah
Khan,and Amin Ul Haq et.al [3] The detection Neelam Choudhary,Ankit Kumar Jain et.al [7 ]
of spam is a significant problem in mobile SMS Mobile devices are becoming more and more
communication, which makes it insecure. A popular since they offer a wide range of services
precise and accurate mechanism for detecting at lower prices. SMS, or short message service,
spam in mobile SMS communication is required is one of the more popular types of
to address this issue. For accurate identification, communication. However, this has increased
we suggested using machine learning-based attacks on mobile devices, such as SMS spam. In
spam detection techniques. this research, we describe a novel strategy that
makes use of machine learning classification
Sridevi Gadde; A. Lakshmanarao; S. techniques to identify and filter spam
Satyanarayana[4]Those who use mobile devices communications. We have examined the
are becoming more numerous every day. Both characteristics of spam messages before
smartphones and basic phones support SMS identifying ten features that can effectively
(short message service), a text messaging separate SMS spam from ham transmissions.
service. As a result, SMS traffic dramatically
rose. There were also more spam messages. The EmmanuelGbenga Dada, Joseph Stephen Bassi,
spammers attempt to send spam communications Haruna Chiroma , Shafi'i
in order to benefit financially or commercially, Muhammad Abdulhamid, Adebayo
such as market expansion, collection of credit Olusola Adetunmbi, Opeyemi
card information, etc. Hence, spam classification Emmanuel Ajibuwa
is given considerable consideration. In this Et.al [8] The need for more dependable and
study, we used a variety of machine learning and powerful antispam filters has increased
deep learning techniques to identify SMS spam. dramatically due to the rise in the number of
We created a spam detection model using data unsolicited emails, or spam. Recently, spam
from UCI. emails have been successfully detected and
Mehul Gupta ; Aditya Bakliwal; Shubhangi filtered using machine learning techniques. We
Agarwal; Pulkit Mehndiratta [5]Short Messaging give a thorough analysis of a few well-liked
Service (SMS) usage on phones has expanded to machine learning-based email spam filtering
such a big degree due to technological techniques. Our analysis includes a summary of
developments and an increase in content-based the key ideas, initiatives, successes, and current
advertising that devices are occasionally research directions in spam filtering. The study
inundated with a large number of spam SMS. background's preliminary discussion looks at
Private data loss is another risk posed by these how machine learning techniques are applied to
spam mailings. There are numerous content- the email spam filtering systems of the top
based machine learning methods that have been
ISSN: 2347-8578 www.ijcstjournal.org Page 149
International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
internet service providers (ISPs), including Also, several research studies use the temporal
Google, Yahoo, and Outlook. information contained in emails (such as when
they were sent, received, etc.) to analyse emails.
Paras Sethi; Vaibhav Bhandari; Bhavna Kohli Several analysis papers make an effort to explain
et.al [9] Spam emails and messages have why emails supported similar topics or subjects.
increased during the last few years. Today, spam Certain email systems, like Google, group
text messages can be combated with the help of emails that are connected to one another.
legal, economic, and technical methods.
Bayesian filters play a significant part in IV. ALGORITHM
preventing this issue. In order to identify spam Support Vector Machine (SVM):
messages delivered on mobile devices, we One of the most well-liked supervised learning
examined and compared the relative merits of algorithms, Support Vector Machine, or SVM, is
various machine learning techniques in this used to solve Classification and Regression
research. For our testing and validation needs, problems. However, it is largely employed in
we built two datasets using data from an Machine Learning Classification issues. The
available public dataset. SVM algorithm's objective is to establish the
best line or decision boundary that can divide n-
S. Nandhini; Jeen Marseline K.S. et.al [10] dimensional space into classes, allowing us to
Sending a great deal of unwanted email puts quickly classify fresh data points in the future. A
consumers' security at risk. Despite numerous hyperplane is the name given to this optimal
security measures, spammers significantly decision boundary. SVM selects the extreme
increase internet vulnerability. The effective use vectors and points that aid in the creation of the
of various well-known algorithms for creating a hyperplane. Support vectors, which are used to
machine learning model that can distinguish represent these extreme instances, form the basis
between spam and legitimate mail is covered in for the SVM method.
this work. UCI The experiment uses the SVM comes in two varieties:
Machine Learning Repository Spambase Data 1. Linear SVM: Linear SVM is used for data that
Set. In order to train and develop a powerful can be divided into two classes using a single
machine learning model for email spam straight line. This type of data is called linearly
detection, the performance of five significant separable data, and the classifier employed is
machine learning classification algorithms, known as a Linear SVM classifier.
including Logistic Regression, Decision Tree,
Naive Bayes, KNN, and SVM, is assessed. The 2. Non-linear SVM: Non-Linear SVM is used
data set is trained and tested using the Weka tool. for non-linearly separated data. If a dataset
cannot be classified using a straight line, it is
considered non-linear data, and the classifier
III. PROPOSED SYSTEM: employed is referred to as a Non-linear SVM
classifier.
Email analysis is typically one of its most
frequent actions and is categorized under text Convolutional Neural Network (CNN):
analysis. Convolutional Neural Networks are designed
Algorithms used in email analysis include those specifically for use in image and video
like CNN, SVM, and decision trees. Classifying recognition applications. CNN is primarily
emails as spam rather than non-spam is a key utilized for image analysis applications such
analysis topic in email categorization. segmentation, object detection, and picture
Various publications on email spam recognition.
classification and analysis attempted to Convolutional Neural Networks have four
categories emails. Spam supported the sender's different kinds of layers:
gender given a lot of the characteristics that 1) Convolutional Layer: Each input neuron in a
make emails from women or men distinct from conventional neural network is connected to the
one another. following hidden layer. Only a small portion of
Emails can be divided into two categories the input layer neurons in CNN are connected to
relative to spam and non-spam emails: attention- the hidden layer of neurons.
grabbing emails and dull emails. 2) Pooling Layer: The pooling layer is used to
Email grouping also included the idea of make the feature map less dimensional. Inside
grouping emails into different folders or the CNN's hidden layer, there will be numerous
subjects. activation and pooling layers.
3) Flatten: Flattening is the process of reducing
data to a 1-dimensional array so that it may be
entered into the following layer. We flatten the
ISSN: 2347-8578 www.ijcstjournal.org Page 150
International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
convolutional layer output to produce a solitary, correctly and effectively detect spam. We have
lengthy feature vector. suggested a technique for spam identification
using machine learning predictive models to
4) Fully Connected Layer: Fully Connected address this problem. Consequently, the findings
Layers make up the network's final few tiers. imply that the suggested approach is more
The output from the last pooling or trustworthy for precise and prompt identification
convolutional layer is passed into the fully of spam and will secure messaging and email
connected layer, where it is flattened before systems.
being applied.
REFERENCE
Decision Tree:
A supervised learning method called a decision 1. “Email Spam Detection Using Machine
tree can be used to solve classification and Learning Algorithms
regression problems, but it is typically favored ”https://ieeexplore.ieee.org/
for doing so. It is a tree-structured classifier, 2. “Machine Learning Techniques for Spam
where internal nodes stand in for a dataset's Detection in Email and IoT Platforms”
features, branches for the decision-making https://www.hindawi.com/
process, and each leaf node for the classification 3. “Spam Detection Approach for Secure Mobile
result. The Decision Node and Leaf Node are the Message Communication Using Machine
two nodes of a decision tree. Whereas Leaf Learning Algorithms”
nodes are the results of decisions and do not https://www.hindawi.com/
have any more branches, Decision nodes are 4. “SMS Spam Detection using Machine
used to create decisions and have numerous Learning and Deep Learning Techniques”
branches. https://ieeexplore.ieee.org/
5. “A Comparative Study of Spam SMS
The given dataset's features are used to execute Detection Using Machine Learning Classifiers”
the test or make the decisions. It is a graphical https://ieeexplore.ieee.org/
depiction for obtaining all feasible answers to a 6. “Spam Mail Scanning Using Machine
choice or problem based on predetermined Learning Algorithm” http://www.jcomputers.us
conditions. It is known as a decision tree 7. “Towards Filtering of SMS Spam Messages
because, like a tree, it begins with the root node Using Machine Learning Based Technique
and grows on subsequent branches to form a ”https: //link.springer.com
structure resembling a tree. The CART
algorithm, which stands for Classification and 8. “Machine learning for email spam filtering:
Regression Tree algorithm, is used to construct a review, approaches and open research problems
tree. A decision tree only poses a question and ”https
divides the tree into subtrees according to the
response (Yes/No). 9. ”SMS spam detection and comparison of
various machine learning algorithms”
V. ADVANTAGES https://ieeexplore.ieee.org
10. ”Performance Evaluation of Machine
1. Extraction of link-based features from Learning Algorithms for Email Spam Detection”
emails. 11. K. Krombholz, H. Hobel, M. Huber, and E.
Weippl “Advanced Social Engineering
2. For the complete dataset, tag-based features Attacks”, Journal of information security and
extraction applications 22 (2015) 113-122
12. U. H. Rao and U. Nayak, “The InfoSec
3. Extracting word base characteristics. Handbook”, Chapter 15 Social Engineering, 01
September 2014 pages 307-323
4. All test data were classified as either 13. T. Ayodele, C. Shoniregun, and G.
fishing or normal, accordingly. Akmayeva, “Anti-Phishing Prevention Measure
for Email Systems”, World Congress on Internet
Security (WorldCIS-2012)
VI. CONCLUSION 14. C. Olivo, A. Santin, and L. Oliveira,
spam detection is crucial for protecting email “Obtaining the threat model for email phishing”,
and message communication. A significant Applied Soft Computing 13 (2013) 4841–484
problem is the accurate identification of spam,
and numerous detection techniques have been
put forth by various researchers. Nevertheless,
these techniques fall short in their ability to
ISSN: 2347-8578 www.ijcstjournal.org Page 151