0% found this document useful (0 votes)
4 views

DOCUMENT

The document is a lab report for a project titled 'SMS/Email Spam Detection' developed by N. Roma under the guidance of Mrs. B. Varalakshmi at Sree Dattha Group of Institutions for the academic year 2024-2025. It outlines the objectives, methodologies, and algorithms used in developing a machine learning model to accurately classify SMS messages as spam or legitimate. The report includes sections on problem identification, requirements, design, implementation, and results, emphasizing the need for an effective spam detection system due to the increasing prevalence of unsolicited messages.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DOCUMENT

The document is a lab report for a project titled 'SMS/Email Spam Detection' developed by N. Roma under the guidance of Mrs. B. Varalakshmi at Sree Dattha Group of Institutions for the academic year 2024-2025. It outlines the objectives, methodologies, and algorithms used in developing a machine learning model to accurately classify SMS messages as spam or legitimate. The report includes sections on problem identification, requirements, design, implementation, and results, emphasizing the need for an effective spam detection system due to the increasing prevalence of unsolicited messages.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

APPLICATION DEVELOPMENT LAB REPORT

Department of Computer Science &Engineering


SREE DATTHA GROUP OF INSTITUTION, HYDERABAD

2024-2025
SMS/EMAIL SPAM DETECTION

Designed and Developed by

N.ROM 219P1A0510

Guided by

B.VARALAKSMI

Department of Computer Science & Engineering


SREE DATTHA GROUP OF INSTITUTION

2024-2025
CERTIFICATE

This is to certify that this the application development lab record entitled
“SMS/EMAIL SPAM DETECTION” submitted by N.ROMA (219P1A0510)
B. Tech 4 year I semester, Department of CSE during the year 2024-2025. The
results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma.

INTERNAL GUIDE HOD-CSE

Mrs. B.Varalakshmi Dr. SHAIK MEERAVALI

External Examiner

iii
DECLARATION

I declare that this project report titled SMS/EMAIL SPAM DETECTION submitted
in partial fulfillment of the degree of B. Tech in CSE is a record of original work
carried out by me under the supervision of Mrs. B.VARALAKSHMI and has not
formed the basis for the award of any other degree or diploma, in this or any other
Institution or University. In keeping with the ethical practice in reporting scientific
information, due acknowledgements have been made wherever the findings of others
have been cited.

N.ROMA 219P1A0510

iv
ACKNOWLEDGEMENT

We would like to express our gratitude to all those who extended their support
and suggestions to come up with this software. Special Thanks to our mentor
Mrs.B.Varalakshmi whose help and stimulating suggestions and
encouragement helped us all time in the due course project development.

We sincerely thank our Head of the Department Dr. Shaik Meeravali for his
constant support and motivation all the time. A special acknowledgement goes
to a friend who enthused us from the backstage. Last but not the least our
sincere appreciation goes to our family who has been tolerant, understanding
our moods and extending timely support.

v
ABSTRACT
Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service has grown into a multi-billion dollars industry. At the same time,
reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements being sent to mobile phones. In parts of Asia, up to 30%
of text messages were spam in 2012. Lack of real databases for SMS spams, short
length of messages and limited features, and their informal language are the factors
that may cause the established email filtering algorithms to underperform in their
classification. In this project, a database of real SMS Spams from UCI Machine
Learning repository is used, and after preprocessing and feature extraction, different
machine learning techniques are applied to the database. Finally, the results are
compared and the best algorithm for spam filtering for text messaging is introduced.

vi
TABLE OF CONTENT

DESCRIPTION PAGE NUMBER

CERTIFICATE iii

DECLARATION iv

ACKNOWLEDGEMENTS v

ABSTRACT vi

LIST OF FIGURES viii

Chapter 1 Introduction
1
1.1 Overview

1.2 Problem Statement 1

1.3 Objective of Project 1-2

1.4 Goal of Project 2-3

Chapter 2 Problem Identification

2.1 Existing System 4


2.2 Proposed System
5
Chapter 3 Requirements

3.1 Software Requirements 6


3.2 Hardware Requirements
7
Chapter 4 Design and Implementation

8
4.1 Design
4.2 Implementation 9

vii
Chapter 5 Code

5.1 Source Code 10-20

5.2 Screenshot of the Application 21

Chapter 6 Results & Conclusion

6.1 Results 22
6.2 Conclusion 23

REFERENCES
23

viii
CHAPTER – 1
INTRODUCTION

1.1 OVERVIEW
Detecting SMS spam using machine learning involves training a model to
differentiate between spam and legitimate messages based on various features.
Common approaches include utilizing natural language processing techniques,
analyzing message content, and considering sender information. This process aims to
create a predictive model that can accurately classify incoming messages, helping
users filter out unwanted spam and enhance their messaging experience.

1.2 PROBLEM STATEMENT

A number of major differences exist between spam-filtering in text messages and


emails. Unlike emails, which have a variety of large datasets available, real databases
for SMS spams are very limited. Additionally, due to the small length of text
messages, the number of features that can be used for their classification is far
smaller than the corresponding number in emails. Here, no header exists as well.
Additionally, text messages are full of abbreviations and have much less formal
language that what one would expect from emails. All of these factors may result in
serious degradation in performance of major email spam filtering algorithms applied
to short text messages.

1.3 OBJECTIVE OF PROJECT

1. Accuracy: Develop a model that can accurately distinguish between spam and non
spam SMS messages to minimize false positives and negatives.
2. Efficiency: Create an efficient spam detection system that can process SMS messages
in real-time or near real-time to ensure timely filtering of spam.
3. Scalability: Design the system to handle large volumes of SMS messages without
sacrificing performance, allowing it to scale as the user base grows.

1
4. Robustness: Build a robust spam detection model that can generalize well across
different languages, dialects, and SMS formats, ensuring effectiveness across diverse
user demographics.
5. Adaptability: Implement mechanisms to continuously update and improve the spam
detection model to adapt to evolving spamming techniques and linguistic variations.
6. User Experience: Enhance the user experience by minimizing the impact of false
positives on legitimate SMS messages while effectively filtering out spam to improve
overall satisfaction.
7. Compliance: Ensure compliance with relevant privacy and data protection
regulations by handling SMS data securely and responsibly.
8. Integration: Integrate the spam detection system seamlessly into existing SMS
platforms or messaging applications to provide users with a hassle-free experience.
9. Feedback Mechanism: Incorporate mechanisms for users to report false positives
and negatives, enabling the system to learn and improve over time.
10. Monitoring and Reporting: Implement monitoring tools to track system
performance, detect anomalies, and generate reports on spam detection efficacy for
stakeholders.

1.4 GOAL OF PROJECT:

Identify Spam: Develop a system that accurately distinguishes between spam and
legitimate SMS messages.

Improve User Experience: Enhance user satisfaction by reducing the annoyance


and potential harm caused by unwanted spam messages.

Dataset Collection: Gather a diverse dataset of labeled SMS messages, including


both spam and legitimate messages.

Data Preprocessing: Clean and preprocess the SMS messages to prepare them
for analysis.

Feature Engineering: Extract relevant features from the text data to facilitate
classification.

2
Model Selection: Choose appropriate machine learning or natural language
processing models for classification.

Model Training: Train the selected models on the labeled SMS dataset using
appropriate techniques.

Evaluation Metrics: Assess the performance of the trained models using metrics
such as accuracy, precision, recall, and F1-score.

Model Deployment: Deploy the trained model in a production environment for


real-time classification of incoming SMS messages.

Monitoring and Updating: Continuously monitor the model's performance and


update it as needed to maintain effectiveness.

Incorporate User Feedback: Integrate user feedback to further refine and


improve the model over time.

3
CHAPTER-2

PROBLEM IDENTIFICATION
2.1 EXISTING SYSTEM

The existing SMS spam detection system utilizes machine learning algorithms to
classify incoming messages as either spam or legitimate. It follows a structured
pipeline involving data collection, preprocessing, feature extraction, model
training, evaluation, deployment, and maintenance.

Components

Data Collection: Collects a labeled dataset of SMS messages, including spam and
legitimate ones.

Preprocessing: Cleans and normalizes text data, removing noise and standardizing
formats.

Feature Extraction: Converts preprocessed text into numerical features using


techniques like Bag-of-Words or TF-IDF.

Model Training: Trains machine learning algorithms (e.g., Naive Bayes, SVM) on
the extracted features.

Model Evaluation: Evaluates model performance using metrics like accuracy,


precision, recall, and F1-score.

Deployment: Deploys the trained model into production for real-time classification
of incoming messages.

Monitoring and Maintenance: Monitors deployed model performance and


conducts regular maintenance to ensure effectiveness.

4
2.2 PROPOSED SYSTEM

The proposed system aims to enhance the existing SMS spam detection by
incorporating advanced techniques and addressing limitations. It introduces
improvements in data collection, preprocessing, feature extraction, and model
training.

Enhanced Data Collection: Augments the dataset with more diverse and
representative samples, including recent spam trends.

Advanced Preprocessing: Implements state-of-the-art text preprocessing


techniques to handle noisy and multilingual SMS data effectively.

Feature Engineering: Explores advanced feature engineering methods, including


deep learning-based embeddings, to capture semantic meaning.

Advanced Model Selection: Explores ensemble learning techniques and deep


learning architectures for improved classification performance.

Feedback Mechanism: Integrates a feedback loop to continuously update the


model based on user feedback and emerging spam patterns.

5
CHAPTER-3

REQUIREMENTS
3.1 SOFTWARE REQUIREMENTS
The software requirements for implementing an SMS spam detection system typically
involve a combination of programming languages, libraries, frameworks, and tools.
Here's a list of essential software requirements for building such a system:

Programming Language:

Python
Libraries and Frameworks:

Natural Language Toolkit (NLTK

Scikit-learn

TensorFlow or PyTorch

Pandas

NumPy
Matplotlib or Seaborn

Text Processing Tools:

Regular Expressions (regex

Stopword Lists

Development Environment:

Integrated Development Environment (IDE) like Jupyter Notebook, PyCharm, or VS


Code for coding, debugging, and testing.

Version Control System (e.g., Git) for tracking changes to code and collaborating
with team members.

Deployment Tools (for deploying the model into production):

Web Application Frameworks (e.g., Flask or Django) for building RESTful APIs to
serve the model predictions.

6
Cloud Platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure) for hosting
and scaling the deployed application.

Containerization Tools (e.g., Docker) for packaging the application and its
dependencies into lightweight, portable containers.

3.2 HARDWARE REQUIREMENTS


• Processor (CPU)
• Memory (RAM)
• Storage
• Graphics Processing Unit (GPU)
• Networking

7
CHAPTER- 4

DESIGN AND IMPLEMENTATION


4.1 DESIGN

Figure 4.1: Architecture

4.2 Flow chart

Fig:4.2 Flow Chart

8
4.2 IMPLEMENTATION

Methodology

• Data Collection

• Data Preprocessing

• Model Selection

• Training

• Validation

• Testing

• Deployment

• Continuous Improvement

4.3 Algorithms Used

• Naive Bayes
• Support Vector Machines (SVM)
• Logistic Regression
• Random Forest
• Neural Networks
• Ensemble Methods

9
CHAPTER – 5

CODE

import numpy as np

import pandas as pd

df=pd.read_csv('/content/drive/MyDrive/SMS_SPAM/spam.csv',encoding="Wind
ows-1252")

df.sample(5)

df.shape

df.info()

# drop last 3 cols

df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

# renaming the cols

df.rename(columns={'v1':'target','v2':'text'},inplace=True)

df.sample(5)

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['target'] = encoder.fit_transform(df['target'])

df.head()

# missing values

df.isnull().sum()

# check for duplicate values

df.duplicated().sum()

10
# remove duplicates

df = df.drop_duplicates(keep='first')

df.duplicated().sum()

df.shape

"""EDA"""

df.head()

df['target'].value_counts()

import matplotlib.pyplot as plt

plt.pie(df['target'].value_counts(), labels=['ham','spam'],autopct="%0.2f")

plt.show()

"""DATA IS IMBALANCED"""

import nltk

!pip install nltk

nltk.download('punkt')

df['num_characters'] = df['text'].apply(len)

df.head()

# num of words

df['num_words'] = df['text'].apply(lambda x:len(nltk.word_tokenize(x)))

df.head()

df['num_sentences'] = df['text'].apply(lambda x:len(nltk.sent_tokenize(x)))

df.head()

df[['num_characters','num_words','num_sentences']].describe()

11
# ham

df[df['target'] == 0][['num_characters','num_words','num_sentences']].describe()

#spam

df[df['target'] == 1][['num_characters','num_words','num_sentences']].describe()

import seaborn as sns

plt.figure(figsize=(12,6))

sns.histplot(df[df['target'] == 0]['num_characters'])

sns.histplot(df[df['target'] == 1]['num_characters'],color='red')

plt.figure(figsize=(12,6))

sns.histplot(df[df['target'] == 0]['num_words'])

sns.histplot(df[df['target'] == 1]['num_words'],color='red')

sns.pairplot(df,hue='target')

sns.heatmap(df.corr(),annot=True)

"""DATA PREPROCESSING

1. Lower case

2. Tokenization

3. Removing special characters

4. Removing stop words and punctuation

5. Stemming"""

from nltk.corpus import stopwords

import string

nltk.download('stopwords')

12
def transform_text(text):

text = text.lower()

text = nltk.word_tokenize(text)

y = []

for i in text:

if i.isalnum():

y.append(i)

text = y[:]

y.clear()

for i in text:

if i not in stopwords.words('english') and i not in string.punctuation:

y.append(i)

text = y[:]

y.clear()

for i in text:

y.append(ps.stem(i))

return " ".join(y)

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

ps.stem('loving')

transform_text("I'm gonna be home soon and i don't want to talk about this stuff
anymore tonight, k? I've cried enough today.")

13
df['text'][10]

df['transformed_text'] = df['text'].apply(transform_text)

df.head()

from wordcloud import WordCloud

wc=WordCloud(width=500,height=500,min_font_size=10,background_color='wh
ite')

spam_wc = wc.generate(df[df['target'] == 1]['transformed_text'].str.cat(sep=" "))

plt.figure(figsize=(15,6))

plt.imshow(spam_wc)

ham_wc = wc.generate(df[df['target'] == 0]['transformed_text'].str.cat(sep=" "))

plt.figure(figsize=(15,6))

plt.imshow(ham_wc)

df.head()

spam_corpus = []

for msg in df[df['target'] == 1]['transformed_text'].tolist():

for word in msg.split():

spam_corpus.append(word)

len(spam_corpus)

from collections import Counter

sns.barplot(pd.DataFrame(Counter(spam_corpus).most_common(30))[0],pd.Data
Frame(Counter(spam_corpus).most_common(30))[1])

plt.xticks(rotation='vertical')

14
plt.show()

ham_corpus = []

for msg in df[df['target'] == 0]['transformed_text'].tolist():

for word in msg.split():

ham_corpus.append(word)

len(ham_corpus)

from collections import Counter

sns.barplot(pd.DataFrame(Counter(ham_corpus).most_common(30))[0],pd.DataF
rame(Counter(ham_corpus).most_common(30))[1])

plt.xticks(rotation='vertical')

plt.show()

# Text Vectorization

# using Bag of Words

df.head()

"""MODEL BUILDING"""

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

cv = CountVectorizer()

tfidf = TfidfVectorizer(max_features=3000)

X = tfidf.fit_transform(df['transformed_text']).toarray()

X.shape

y = df['target'].values

from sklearn.model_selection import train_test_split

15
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

from sklearn.metrics import accuracy_score,confusion_matrix,precision_score

gnb = GaussianNB()

mnb = MultinomialNB()

bnb = BernoulliNB()

gnb.fit(X_train,y_train)

y_pred1 = gnb.predict(X_test)

print(accuracy_score(y_test,y_pred1))

print(confusion_matrix(y_test,y_pred1))

print(precision_score(y_test,y_pred1))

mnb.fit(X_train,y_train)

y_pred2 = mnb.predict(X_test)

print(accuracy_score(y_test,y_pred2))

print(confusion_matrix(y_test,y_pred2))

print(precision_score(y_test,y_pred2))

bnb.fit(X_train,y_train)

y_pred3 = bnb.predict(X_test)

print(accuracy_score(y_test,y_pred3))

print(confusion_matrix(y_test,y_pred3))

print(precision_score(y_test,y_pred3))

from sklearn.linear_model import LogisticRegression

16
from sklearn.svm import SVC

from sklearn.naive_bayes import MultinomialNB

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from sklearn.ensemble import BaggingClassifier

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.ensemble import GradientBoostingClassifier

from xgboost import XGBClassifier

svc = SVC(kernel='sigmoid', gamma=1.0)

knc = KNeighborsClassifier()

mnb = MultinomialNB()

dtc = DecisionTreeClassifier(max_depth=5)

lrc = LogisticRegression(solver='liblinear', penalty='l1')

rfc = RandomForestClassifier(n_estimators=50, random_state=2)

abc = AdaBoostClassifier(n_estimators=50, random_state=2)

bc = BaggingClassifier(n_estimators=50, random_state=2)

etc = ExtraTreesClassifier(n_estimators=50, random_state=2)

gbdt = GradientBoostingClassifier(n_estimators=50,random_state=2)

xgb = XGBClassifier(n_estimators=50,random_state=2)

clfs = {

17
'SVC' : svc,

'KN' : knc,

'NB': mnb,

'DT': dtc,

'LR': lrc,

'RF': rfc,

'AdaBoost': abc,

'BgC': bc,

'ETC': etc,

'GBDT':gbdt,

'xgb':xgb

def train_classifier(clf,X_train,y_train,X_test,y_test):

clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)

precision = precision_score(y_test,y_pred)

return accuracy,precision

train_classifier(svc,X_train,y_train,X_test,y_test)

accuracy_scores = []

precision_scores = []

for name,clf in clfs.items():

18
current_accuracy,current_precision=train_classifier(clf,X_train,y_train,X_test,y_t
est)

print("For ",name)

print("Accuracy - ",current_accuracy)

print("Precision - ",current_precision)

accuracy_scores.append(current_accuracy)

precision_scores.append(current_precision)

performance_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy':accuracy_scor
es,'Precision':precision_scores}).sort_values('Precision',ascending=False)

performance_df

performance_df1 = pd.melt(performance_df, id_vars = "Algorithm")

performance_df1

sns.catplot(x = 'Algorithm', y='value',

hue = 'variable',data=performance_df1, kind='bar',height=5)

plt.ylim(0.5,1.0)

plt.xticks(rotation='vertical')

plt.show()

# model improve

# 1. Change the max_features parameter of TfIdf

temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_max_ft_3000':accurac
y_scores,'Precision_max_ft_3000':precision_scores}).sort_values('Precision_max
_ft_3000',ascending=False)

19
temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_scaling':accuracy_sco
res,'Precision_scaling':precision_scores}).sort_values('Precision_scaling',ascendin
g=False)

new_df = performance_df.merge(temp_df,on='Algorithm')

new_df_scaled = new_df.merge(temp_df,on='Algorithm')

temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_num_chars':accuracy_
scores,'Precision_num_chars':precision_scores}).sort_values('Precision_num_char
s',ascending=False)

new_df_scaled.merge(temp_df,on='Algorithm')

# Voting Classifier

svc = SVC(kernel='sigmoid', gamma=1.0,probability=True)

mnb = MultinomialNB()

etc = ExtraTreesClassifier(n_estimators=50, random_state=2)

from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(estimators=[('svm', svc), ('nb', mnb), ('et',


etc)],voting='soft')

voting.fit(X_train,y_train)

y_pred = voting.predict(X_test)

print("Accuracy",accuracy_score(y_test,y_pred))

print("Precision",precision_score(y_test,y_pred))

# Applying stacking

estimators=[('svm', svc), ('nb', mnb), ('et', etc)]

final_estimator=RandomForestClassifier()

from sklearn.ensemble import StackingClassifier

20
clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print("Accuracy",accuracy_score(y_test,y_pred))

print("Precision",precision_score(y_test,y_pred))

import pickle

pickle.dump(tfidf,open('vectorizer.pkl','wb'))

pickle.dump(mnb,open('model.pkl','wb'))

5.2 Screenshot of the Application

Fig-5.2.1

21
CHAPTER-6
RESULTS AND CONCLUSION
6.1 RESULTS

Fig-6.1.1 Entering the message

Fig-6.1.2 Detecting the message

22
6.2 CONCLUSION

In conclusion, the SMS spam detection project successfully developed and


evaluated machine learning models for effectively distinguishing between spam and
ham messages. Leveraging diverse classifiers and thorough preprocessing, the
project achieved robust performance. Key insights from exploratory data analysis
informed feature engineering decisions. The selected model, serialized for
deployment, exhibits promising results in real-time spam identification, showcasing
the project's effectiveness in addressing the challenge of SMS spam detection.
Further refinements and continuous model evaluation can enhance the system's
performance and contribute to a more reliable solution for identifying and
mitigating SMS spam.

6.3 Future work:

Future scope of this project will involve adding more feature parameter. The
more the parameters are taken into account more will be the accuracy. The
algorithms can also be applied for analyzing the contents of public comments
and thus determine patterns/relationships between the customer and the
company. The use of traditional algorithms and data mining techniques can
also help predict the corporation performance structure as a whole. In the
future, we plan to integrate neural network with some other techniques such
as genetic algorithm or fuzzy logic. Genetic algorithm can be used to identify
optimal network architecture and training parameters. Fuzzy logic provides
the ability to account for some uncertainty produced by the neural network
predictions. Their uses in conjunction with neural network could provide an
improvement for SMS spam prediction.

REFERENCES
1. Kaggle for dataset

2. https://www.analyticsvidhya.com/blog/2022/07/end-to-end-project-on-sms-
email-spam-detection-using-naive-bayes/

23
24

You might also like