0% found this document useful (0 votes)
45 views

Saurabh

The document describes developing a spam mail detection system using Python. It discusses using machine learning algorithms like KNN, SVM, random forest and Naive Bayes to classify SMS messages as spam or ham. It presents results of applying these algorithms on SMS spam datasets, with KNN and linear SVM performing best with error rates of 3.11% and 1.19% respectively. Feature extraction and preprocessing are done before applying the algorithms. The aim is to accurately filter spam SMS messages.

Uploaded by

saurabh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Saurabh

The document describes developing a spam mail detection system using Python. It discusses using machine learning algorithms like KNN, SVM, random forest and Naive Bayes to classify SMS messages as spam or ham. It presents results of applying these algorithms on SMS spam datasets, with KNN and linear SVM performing best with error rates of 3.11% and 1.19% respectively. Feature extraction and preprocessing are done before applying the algorithms. The aim is to accurately filter spam SMS messages.

Uploaded by

saurabh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

MICRO PROJECT

REPORT

ON

Develop a Spam mail detection using Python

BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION

THDC INSTITUTE OF HYDROPOWER ENGINEERING AND


TECHNOLOGY, TEHRI, UTTARAKHAND, INDIA
(UTTARAKHAND TECHNICAL UNIVERSITY,
DEHRADUN) 2022-2023

Ms. Nidhi lakhera Divyansh Kataria (700970102004)


Saurabh Sharma
(700970102007)
SIGNATURE ……… ECE(4th Year, 8th Semester)
Acknowledgement

As we were working on this project, we found lots of information that

helped us in making this project and we are glad that we successfully

completed this project

and was able to understand many things.Thank you,Nidhi Lakhera mam, for

giving us this opportunity to complete this project and for enabling us to

learn so much. We have no more valuable words to express our gratitude,

but our heart is still full of gratitude for all the kindness shown to us.

Saurabh Sharma,
Divyansh kataria
B. Tech, 8th Sem E
CE

I endorse the above declaration of the Student.

(Name and Signature of the Supervisor)


CONTENT

Chapter Name

1.Abstract ………………………………………………….
2.Introduction ……………………………………………
3.Requirements ……………………………………………
4.Flow Chart ……………………………………………
5. Introduction to the machine
Learning algorithms ……………………………………..

5.1 K-Nearest Neighbours …………………………


5.2 Support Vector Machines (SVM)………………
5.3 Random Forest …………………………………………
5.4 Naïve Bayes ……………………………………………….

Step 1 . About Bayes Theorem …………


Step 2 understand data ………………………..
Step 3: bag_of_word …………………………….
Step 4: training_and_testing ……………………
Step 5 : Implementing NB ML
alogorithm…… Step 6: Evaluate
model ………………………………

5.5: Adaboost ……………………………………………………


5.6. NLTK ………………………………………………………….

6. Python Code ScreenShot ……………………………


7. Result ScreenShot ………………………………………..
8. Conclusion ……………………………..………………………….
1. ABSTRACT

Short_Message_Service (SMS), which allows users to


send and receive messages, has become a multi-
billion dollar industry as
mobile phone usage has soared. The cost of messaging
services has also decreased, which has led to an increase in the
amount of spam that is delivered to mobile devices. Up to 4 0 %
of SMS messages in
some regions of Asia were spam in 2012 . Due to short
message lengths, lack of reliable databases for SMS
spams, informal language, and brief message
characteristics, the current email filtering algorithms may
not perform well in their. In this project, real SMS
spam databases from the ML repository are used. Following feature
extraction and preprocessing, On the databases, numerous machine
learning methods are used. After comparing the results, the best
algorithm for text message spam filtering is then presented. The
results utilising that in this study decreases the total error rate of
the best model in the original research referencing this. The
following
algorithms are used in this technique: Spam
communications are categorised in mobile device
communication using decision trees, K- Nearest
Neighbour, and logistic regression The SMS spam collecting set is used to test
the approach.
2. INTRO

Fig 1.1

Chat technology is simply one aspect of SMS. SMS technology was made
possible by standard, an accepted international standard. Spam is the term
for the abuse of electronic messaging services to send large numbers of
unwanted
messages to anybody. Eventhough email spam is the most well-known
example, identical offences in other media and mediums are
frequently referred to as "spam."
SMS In this sense, spam is frequently unsolicited bulk communications
that contains some commercial interest and is quite similar to email
spams.
Phishing URLs and business promotion are spread via SMS spam.
Commercial spammers use malware to transmit SMS since most countries
outlaw the practise. Since it is challenging to pinpoint the origin of spam
when it is sent from a hacked computer, spammers take less of a risk
while doing so. Only letters, numbers, and a few symbols are permitted in
SMS messages. A
brief glance at the mails identifies. Almost all spm msg direct users to
call a phone number or go to a website. A simple SQL query on the spam
yields
results that reveal this trend. Due of the low cost and large bandwidth
of the SMS network, SMS spam is widely used.
Fig 1. 2

Everytime a user receives an SMS spam message, their mobile phone


notifies them of the message's arrival. The consumer will be unhappy
when they
realise the message is unwanted, and SMS spam uses up some of the
storage space on their mobile device.

There are several notable differences emails and text messages.Contrary, which
may access a range of sizable datasets, actual databases for SMS spams are
quite scarce. The number of criteria that can be utilised to classify text
messages is also considerably less than those of emails due to the shorter
duration of text
messages. There is also no header in this case. In addition, text messages
use significantly less professional language than emails do and are chock
full of
acronyms. All of these elements could lead to a significant decline in the
effectiveness of the most important Short text message spam filtering algorithms are
utilised.

ML algorithms to the problem of classifying SMS spam, compare


their results to learn more and further research the problem, and
create a programme based on one of these approaches that can
precisely
filter SMS spams. A number of machine learning
algorithms are then implemented using the module in
Python after performing data feature extraction and basic
analysis in MAT_LAB. Data is first
analysed in MAT_LAB, and then several machine learning
techniques are applied using the learn module in python.
Fig 1.3

In the third installment of a three-part series, we'll examine the spam or ham
classifier from the standpoint of AI ideas, experiment with several
classification algorithms in a based on performance criteria. A web-
based Python.

applications of machine learning in modern internet technology.


service providers have integrated spam detection
algorithms that label such content as " Junk Mail" when
it is received.
In this project, the nave_bayes approach is utilised to create
a model that, depending on the training data we provide
the model, can classify a dataset. The words "free," "win,"
"winner," "cash," "prize," and similar expressions are
frequently used in these letters because they are meant to
catch your attention and in a sense persuade you to
open them.
Exclamation marks and writing in all capitals are other characteristics
of spam communications. Since spam texts are often
pretty evident to the receiver, we want to train a model
to identify them for us.
Finding spam mails is a binary classification issue since
messages can only be categorised and nothing else. This
is a supervised
learning problem as well because we will be giving the model a
tagged
3. Needful Equipment
For hardware

Processor: 1.5 GHz or more

4GB or more of RAM

HDD: at

least

100GB

software specifications

Python 3 IDLE or the

Anaconda Jupyter Notebook


4. Flow Chart
INTRO TO ML ALGORITHM

5.1 KNN
K-Nearest Neighbor is a straightforward instance-based
learning technique that can be used to solve classification
challenges. According to this method, a test sample's label is
predicted using the votes of its knn closest neighbour.

overall_ err Spm_ cought Blocked_hm


Knn_
3 3.11 82.5 0.35
12 3.25 86.2 0.42
22 2.91 79.4 0.41
52 3.25 78.6 0.34
90 4.12 69.5 0.17

5.2Support Vector Machine

On the dataset, support vector machine is used. The with various kernels are
shown in Table I I for a 10-fold cross validation. The table demonstrates
that the linear kernel outperforms alternative mappings in terms of
performance. The
error rate decreases while the degree of the polynomial from two to three,
but it does not decrease as the degree is raised higher. Here, the dataset is
subjected to another kernel called the radial basis function (RBF). The
following equation represents the RBF kernel for the two samples

Kernal Function Overall error Spam cought Blocked ham

linear 1.19 94.1 0.45


Degree 2 2.04 86.3 0.23

Degree 3 1.67 90.4 0.47


polynomial
Degree 4 2.01 92.45 0.65
polynomial

Radial basis 23.16 79.6 0.35


function
Sigmoid 22.4 0 0
We observe that the text message's character count is a very helpful factor for categorising
spam.

When features are ordered based on the mutual


informationcriterion, this feature has the highest mutual with the target labels.
Also, although text messages with lengths below a specific threshold are
normally hams, they could be mistakenly labelled as spams due to the tokens
that correlate. This is shown when looking at the samples that were improperly
classified.
The result_show no accuracy advantage over the algorithm, despite the model
being more complex and taking longer to train on data when using SVM with
different
kernels.-

5.3 Random Forest

Random-forests is a technique for classification that uses


ensemble ageing. The is a group of assembled from the boot
strap sample of a training set. when a node is divided during the
construction of the decision-tree, the split that is chosen is the
among a random selection of characteristics.A single model's
bias will increase as a result, however averaging can also
make up for
the increase in bias by lowering variance. As a result, a
better model is created. The scikit learn python
library's random forest implementation, which averages
the probabilistic predictions, is used in this study. For
this method, two numbers of estimators
are simulated. The overall error with 12 estim-ators is
1.91% the SC is 86.6% and the bh is 0.71% With 90
estim-ators, the overall
error will be 1.41%, the SC will be 92.2%, and the BH will be
0.52% We notice that, when compared to the naïve- bays-
algorithm,
performance is unchanged despite the model's
increased complexity.
5.4 Naïve Bays Algoritm

Step 1 About Bayes Theorem

The bayes Theorem one of the first prob-lastic algorithm created by Reverend-
Bayes (and use, no less, to try to infer the presence of god), Still works incredibly
well in some situations. To
understand this theorem, an example is recommended. Consider yourself a Secret
Service agent tasked with protecting the
democratic presidential candidate as he or she
delivers a campaign speech. Your task is challenging,
and you must always
be on guard for threats because it is a public event that is open to everyone.
Consequently, a reasonable place to start is by giving
each person a distinct threat level. Therefore, based on a person's physical
characteristics, such as their age, sex, and other minor
details like whether or not they are carrying a bag or seem tense, you can
determine whether they pose a threat.
If a person checks all the right boxes up until the point where your level of doubt is
crossed, and have them removed from the area. The works similarly to how we
determine the (a person who poses a threat) based on the probability of numerous.
The indepe-ndence of these features from one another
is something to take into account. For instance, if a
child exhibits signs of anxiety throughout the event,
the likelihood that they
pose a threat is lower than, say, if it were a big man. To clarify, age AND anxiousness are
the two characteristics we are taking into
account here. If we examine each of these
characteristics separately, we might be able to create
a model that marks
EVERYONE who exhibits anxiety as a possible threat. But given the likelihood that
any children present at the event will be anxious, it is possible that we will get a
lot of false positives.

Thus, by taking a person's age into account in addition to the


"nervousness" aspect, we would undoubtedly receive a more accurate conclusion regarding
who poses a threat and who does not.
The "Naive" portion of the theory is where it assumes that each aspect is
independent of the others, which may not always be the case and may therefore influence the
verdict.
In essence, the bayes theorem determines the likelihood that an event will occur based on the
proba-bilistic- distributions of a number of other events, in this
Case, the likelihood that a message would be spm. Later in the mission, we will go into
the bayes Theorem's operations, but first, let's examine the data we will be
using.
Step 2 understand data
Fig 2

Step 3: bag_of_word

We have a substantial set of text data (5580 rows of data). Email and other
messages usually contain a lot of language, yet the majority of machine
learning algorithms require numerical data as input.
In this part, we'd like to introduce the notion, which is a term for issues with
processing a single text data set or a collection of text data. BOW's
fundamental concept is to count the instances of each word inside a given
body of text. The order in which the words appear is irrelevant, according
to the BOW notion,
which analyses each word separately.
We can turn a group of documents into a matrix using a technique we'll
cover later, where each document represents a row, each word or token
represents a column, and the values in each row and column represent the
frequency with which each word or token appears in that document.

Step 4: training_and_testing

We can return to our dataset and continue our analysis now that we know
how to handle the Bag of Words problem. To later test our model, we would
first divide
our dataset into a training and testing set.

After dividing the data, our next goal is to carry out Step 2's
procedures: Convert our data to the desired matrix format and
bag of words. As before,
we will use CountVectorizer() to accomplish this. Here, there are two
steps to think about:

We will be using the data from X test, which has been transformed into
a matrix, to make predictions about the "sms message" column. Then, in
a subsequent step.
Step 5 : Implementing NB ML alogorithm

I'll utilise the technique to produce predictions on our dataset for SMS Spm
_Detection.

Particularly, we'll apply the multinomial nv byes


implementation. Using discrete features to categorise data
is appropriate for this particular classifier. Word counts in
the form of integers are
accepted as input.

Step 6: Evaluate model


our model is performing in relation to the forecasts we
made on our test set. There are other ways to do this, but
let's first quickly go over them.

Accuracy is used to determine how frequently the classifier


makes accurate predictions. The ratio of accurate
forecasts to all predictions is measured.
The percentage of messages that were mistakenly classified as
spam is revealed by precision. It is the proportion of genuine
positives (words flagged as spam that are actually spam) to all
positives, in other words (words labelled as spam regardless of
whether that classification was accurate).

The proportion of messages that we wrongly classified as spam is


shown by recall ( sensitivity) . It measures the proportion of true
positives—
words that were marked as spam but are in fact spam—to the
total number of words that were marked as spam.

For example, in our situation, if we had 100 text messages and only two
of them were spam and the other 98 were not, this is an example of a
classification problem where the classification distributions are skewed.
5.5 : Adaboost
classifiers one at a time, refining each one to account for examples that were
misidentified by prior classifiers . Even if the classifiers employed are only
moderately superior to random guessing, the final model will be improved.
To ensemble strategy combination others.

Certain weights are added to the training samples at each Ada Boost
iteration. These weights are distributed uniformly prior to the initial iteration.
Following that, the current model increases weights for labels that were
wrongly classified and decreases weights for samples that were incorrectly categorised.
This suggests that the new predictor is concentrating
on the problems with the.

Mdl S(C) (B)H Accuu-


NB 95.35 0.52 97.73
S(VM) 93.36 0.63 89.63
KNN 83.87 0.32 97.56
Random forest 91.62 0.63 97.52

Adaboost 93.21 0.45 98.56


with
decision-tree
We investigated implementing with the module. In the simulation with 12
estimators, the total error rate is 3.1%, the SC is 86.7%, and the BH is
0.64%. When the number of estimators is increased to 90, these figures will
change to (3.41, 93.6, and 0.71)%,. Similar to, naïve Bayes algorithm
performs better
than Ada boost with decision trees despite being significantly more sophisticated.

5.6 NLTK

Leading Python development environments for working with


human language data include NLTK. It offers straightforward
interfaces to more than 50 corpora
and lexical resources, including WordNet, as well as a collection of text
processing libraries for categorization, tokenization, stemming, tagging,
parsing, and
semantic reasoning, wrappers for powerful NLP libraries, and a lively discussion board.

NLTK is appropriate for linguists, engineers, students, educators, academics,


and industry users equally thanks to a hands-on approach covering
programming
foundations along with themes in computational linguistics and full API
documentation. Windows, Mac OS X, and Linux all support NLTK. The project is
community- driven, free, and open source, which is the best part.

Fig 3
6. Python code ScreenShot
7. Result ScreenShot
8. Conclusion

The outcomes of various classification models run on the SMS


Spam dataset are displayed. Results of the simulation.The
best classifiers for SMS spam
detection include SVM with a linear kernel and multinomial naive Bayes with

Laplace smoothing. The SVM-based classifier in the original research that


used this dataset had the highest overall accuracy (92.64%), making it the
best one. With an overall accuracy of 92.60%, enhanced naive Bayes is the
next best
classifier in their research. When compared to the outcome of earlier
research, our classifier cuts the overall error in half. The variables that led
to this increase in outcomes include the addition of significant
characteristics like the amount of characters in messages, the
addition of specific thresholds for the length,
and analysis of learning curves and misclassified data.
The capability of Naive Bayes to handle an exceptionally high
number of features is one of its key benefits over other
classification methods. Since there
are hundreds of distinct words, they are all considered as features in our situation.
Additionally, it functions effectively even when irrelevant
characteristics are present and is mostly unaffected by them. Its
relative simplicity is another key benefit, unless often in situations
when the data distribution is known. Rarely does the data overfit the
model.
Another key benefit is how quickly the model trains and predicts given
the volume of data it can manage. Overall, Naive Bayes' algorithm
really is a
treasure.

You might also like