DOCUMENT
DOCUMENT
2024-2025
SMS/EMAIL SPAM DETECTION
N.ROM 219P1A0510
Guided by
B.VARALAKSMI
2024-2025
CERTIFICATE
This is to certify that this the application development lab record entitled
“SMS/EMAIL SPAM DETECTION” submitted by N.ROMA (219P1A0510)
B. Tech 4 year I semester, Department of CSE during the year 2024-2025. The
results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma.
External Examiner
iii
DECLARATION
I declare that this project report titled SMS/EMAIL SPAM DETECTION submitted
in partial fulfillment of the degree of B. Tech in CSE is a record of original work
carried out by me under the supervision of Mrs. B.VARALAKSHMI and has not
formed the basis for the award of any other degree or diploma, in this or any other
Institution or University. In keeping with the ethical practice in reporting scientific
information, due acknowledgements have been made wherever the findings of others
have been cited.
N.ROMA 219P1A0510
iv
ACKNOWLEDGEMENT
We would like to express our gratitude to all those who extended their support
and suggestions to come up with this software. Special Thanks to our mentor
Mrs.B.Varalakshmi whose help and stimulating suggestions and
encouragement helped us all time in the due course project development.
We sincerely thank our Head of the Department Dr. Shaik Meeravali for his
constant support and motivation all the time. A special acknowledgement goes
to a friend who enthused us from the backstage. Last but not the least our
sincere appreciation goes to our family who has been tolerant, understanding
our moods and extending timely support.
v
ABSTRACT
Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service has grown into a multi-billion dollars industry. At the same time,
reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements being sent to mobile phones. In parts of Asia, up to 30%
of text messages were spam in 2012. Lack of real databases for SMS spams, short
length of messages and limited features, and their informal language are the factors
that may cause the established email filtering algorithms to underperform in their
classification. In this project, a database of real SMS Spams from UCI Machine
Learning repository is used, and after preprocessing and feature extraction, different
machine learning techniques are applied to the database. Finally, the results are
compared and the best algorithm for spam filtering for text messaging is introduced.
vi
TABLE OF CONTENT
CERTIFICATE iii
DECLARATION iv
ACKNOWLEDGEMENTS v
ABSTRACT vi
Chapter 1 Introduction
1
1.1 Overview
8
4.1 Design
4.2 Implementation 9
vii
Chapter 5 Code
6.1 Results 22
6.2 Conclusion 23
REFERENCES
23
viii
CHAPTER – 1
INTRODUCTION
1.1 OVERVIEW
Detecting SMS spam using machine learning involves training a model to
differentiate between spam and legitimate messages based on various features.
Common approaches include utilizing natural language processing techniques,
analyzing message content, and considering sender information. This process aims to
create a predictive model that can accurately classify incoming messages, helping
users filter out unwanted spam and enhance their messaging experience.
1. Accuracy: Develop a model that can accurately distinguish between spam and non
spam SMS messages to minimize false positives and negatives.
2. Efficiency: Create an efficient spam detection system that can process SMS messages
in real-time or near real-time to ensure timely filtering of spam.
3. Scalability: Design the system to handle large volumes of SMS messages without
sacrificing performance, allowing it to scale as the user base grows.
1
4. Robustness: Build a robust spam detection model that can generalize well across
different languages, dialects, and SMS formats, ensuring effectiveness across diverse
user demographics.
5. Adaptability: Implement mechanisms to continuously update and improve the spam
detection model to adapt to evolving spamming techniques and linguistic variations.
6. User Experience: Enhance the user experience by minimizing the impact of false
positives on legitimate SMS messages while effectively filtering out spam to improve
overall satisfaction.
7. Compliance: Ensure compliance with relevant privacy and data protection
regulations by handling SMS data securely and responsibly.
8. Integration: Integrate the spam detection system seamlessly into existing SMS
platforms or messaging applications to provide users with a hassle-free experience.
9. Feedback Mechanism: Incorporate mechanisms for users to report false positives
and negatives, enabling the system to learn and improve over time.
10. Monitoring and Reporting: Implement monitoring tools to track system
performance, detect anomalies, and generate reports on spam detection efficacy for
stakeholders.
Identify Spam: Develop a system that accurately distinguishes between spam and
legitimate SMS messages.
Data Preprocessing: Clean and preprocess the SMS messages to prepare them
for analysis.
Feature Engineering: Extract relevant features from the text data to facilitate
classification.
2
Model Selection: Choose appropriate machine learning or natural language
processing models for classification.
Model Training: Train the selected models on the labeled SMS dataset using
appropriate techniques.
Evaluation Metrics: Assess the performance of the trained models using metrics
such as accuracy, precision, recall, and F1-score.
3
CHAPTER-2
PROBLEM IDENTIFICATION
2.1 EXISTING SYSTEM
The existing SMS spam detection system utilizes machine learning algorithms to
classify incoming messages as either spam or legitimate. It follows a structured
pipeline involving data collection, preprocessing, feature extraction, model
training, evaluation, deployment, and maintenance.
Components
Data Collection: Collects a labeled dataset of SMS messages, including spam and
legitimate ones.
Preprocessing: Cleans and normalizes text data, removing noise and standardizing
formats.
Model Training: Trains machine learning algorithms (e.g., Naive Bayes, SVM) on
the extracted features.
Deployment: Deploys the trained model into production for real-time classification
of incoming messages.
4
2.2 PROPOSED SYSTEM
The proposed system aims to enhance the existing SMS spam detection by
incorporating advanced techniques and addressing limitations. It introduces
improvements in data collection, preprocessing, feature extraction, and model
training.
Enhanced Data Collection: Augments the dataset with more diverse and
representative samples, including recent spam trends.
5
CHAPTER-3
REQUIREMENTS
3.1 SOFTWARE REQUIREMENTS
The software requirements for implementing an SMS spam detection system typically
involve a combination of programming languages, libraries, frameworks, and tools.
Here's a list of essential software requirements for building such a system:
Programming Language:
Python
Libraries and Frameworks:
Scikit-learn
TensorFlow or PyTorch
Pandas
NumPy
Matplotlib or Seaborn
Stopword Lists
Development Environment:
Version Control System (e.g., Git) for tracking changes to code and collaborating
with team members.
Web Application Frameworks (e.g., Flask or Django) for building RESTful APIs to
serve the model predictions.
6
Cloud Platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure) for hosting
and scaling the deployed application.
Containerization Tools (e.g., Docker) for packaging the application and its
dependencies into lightweight, portable containers.
7
CHAPTER- 4
8
4.2 IMPLEMENTATION
Methodology
• Data Collection
• Data Preprocessing
• Model Selection
• Training
• Validation
• Testing
• Deployment
• Continuous Improvement
• Naive Bayes
• Support Vector Machines (SVM)
• Logistic Regression
• Random Forest
• Neural Networks
• Ensemble Methods
9
CHAPTER – 5
CODE
import numpy as np
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/SMS_SPAM/spam.csv',encoding="Wind
ows-1252")
df.sample(5)
df.shape
df.info()
df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.sample(5)
encoder = LabelEncoder()
df['target'] = encoder.fit_transform(df['target'])
df.head()
# missing values
df.isnull().sum()
df.duplicated().sum()
10
# remove duplicates
df = df.drop_duplicates(keep='first')
df.duplicated().sum()
df.shape
"""EDA"""
df.head()
df['target'].value_counts()
plt.pie(df['target'].value_counts(), labels=['ham','spam'],autopct="%0.2f")
plt.show()
"""DATA IS IMBALANCED"""
import nltk
nltk.download('punkt')
df['num_characters'] = df['text'].apply(len)
df.head()
# num of words
df.head()
df.head()
df[['num_characters','num_words','num_sentences']].describe()
11
# ham
df[df['target'] == 0][['num_characters','num_words','num_sentences']].describe()
#spam
df[df['target'] == 1][['num_characters','num_words','num_sentences']].describe()
plt.figure(figsize=(12,6))
sns.histplot(df[df['target'] == 0]['num_characters'])
sns.histplot(df[df['target'] == 1]['num_characters'],color='red')
plt.figure(figsize=(12,6))
sns.histplot(df[df['target'] == 0]['num_words'])
sns.histplot(df[df['target'] == 1]['num_words'],color='red')
sns.pairplot(df,hue='target')
sns.heatmap(df.corr(),annot=True)
"""DATA PREPROCESSING
1. Lower case
2. Tokenization
5. Stemming"""
import string
nltk.download('stopwords')
12
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
ps = PorterStemmer()
ps.stem('loving')
transform_text("I'm gonna be home soon and i don't want to talk about this stuff
anymore tonight, k? I've cried enough today.")
13
df['text'][10]
df['transformed_text'] = df['text'].apply(transform_text)
df.head()
wc=WordCloud(width=500,height=500,min_font_size=10,background_color='wh
ite')
plt.figure(figsize=(15,6))
plt.imshow(spam_wc)
plt.figure(figsize=(15,6))
plt.imshow(ham_wc)
df.head()
spam_corpus = []
spam_corpus.append(word)
len(spam_corpus)
sns.barplot(pd.DataFrame(Counter(spam_corpus).most_common(30))[0],pd.Data
Frame(Counter(spam_corpus).most_common(30))[1])
plt.xticks(rotation='vertical')
14
plt.show()
ham_corpus = []
ham_corpus.append(word)
len(ham_corpus)
sns.barplot(pd.DataFrame(Counter(ham_corpus).most_common(30))[0],pd.DataF
rame(Counter(ham_corpus).most_common(30))[1])
plt.xticks(rotation='vertical')
plt.show()
# Text Vectorization
df.head()
"""MODEL BUILDING"""
cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['transformed_text']).toarray()
X.shape
y = df['target'].values
15
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1))
mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test,y_pred2))
bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3))
16
from sklearn.svm import SVC
knc = KNeighborsClassifier()
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth=5)
bc = BaggingClassifier(n_estimators=50, random_state=2)
gbdt = GradientBoostingClassifier(n_estimators=50,random_state=2)
xgb = XGBClassifier(n_estimators=50,random_state=2)
clfs = {
17
'SVC' : svc,
'KN' : knc,
'NB': mnb,
'DT': dtc,
'LR': lrc,
'RF': rfc,
'AdaBoost': abc,
'BgC': bc,
'ETC': etc,
'GBDT':gbdt,
'xgb':xgb
def train_classifier(clf,X_train,y_train,X_test,y_test):
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
precision = precision_score(y_test,y_pred)
return accuracy,precision
train_classifier(svc,X_train,y_train,X_test,y_test)
accuracy_scores = []
precision_scores = []
18
current_accuracy,current_precision=train_classifier(clf,X_train,y_train,X_test,y_t
est)
print("For ",name)
print("Accuracy - ",current_accuracy)
print("Precision - ",current_precision)
accuracy_scores.append(current_accuracy)
precision_scores.append(current_precision)
performance_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy':accuracy_scor
es,'Precision':precision_scores}).sort_values('Precision',ascending=False)
performance_df
performance_df1
plt.ylim(0.5,1.0)
plt.xticks(rotation='vertical')
plt.show()
# model improve
temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_max_ft_3000':accurac
y_scores,'Precision_max_ft_3000':precision_scores}).sort_values('Precision_max
_ft_3000',ascending=False)
19
temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_scaling':accuracy_sco
res,'Precision_scaling':precision_scores}).sort_values('Precision_scaling',ascendin
g=False)
new_df = performance_df.merge(temp_df,on='Algorithm')
new_df_scaled = new_df.merge(temp_df,on='Algorithm')
temp_df=pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy_num_chars':accuracy_
scores,'Precision_num_chars':precision_scores}).sort_values('Precision_num_char
s',ascending=False)
new_df_scaled.merge(temp_df,on='Algorithm')
# Voting Classifier
mnb = MultinomialNB()
voting.fit(X_train,y_train)
y_pred = voting.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))
# Applying stacking
final_estimator=RandomForestClassifier()
20
clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))
import pickle
pickle.dump(tfidf,open('vectorizer.pkl','wb'))
pickle.dump(mnb,open('model.pkl','wb'))
Fig-5.2.1
21
CHAPTER-6
RESULTS AND CONCLUSION
6.1 RESULTS
22
6.2 CONCLUSION
Future scope of this project will involve adding more feature parameter. The
more the parameters are taken into account more will be the accuracy. The
algorithms can also be applied for analyzing the contents of public comments
and thus determine patterns/relationships between the customer and the
company. The use of traditional algorithms and data mining techniques can
also help predict the corporation performance structure as a whole. In the
future, we plan to integrate neural network with some other techniques such
as genetic algorithm or fuzzy logic. Genetic algorithm can be used to identify
optimal network architecture and training parameters. Fuzzy logic provides
the ability to account for some uncertainty produced by the neural network
predictions. Their uses in conjunction with neural network could provide an
improvement for SMS spam prediction.
REFERENCES
1. Kaggle for dataset
2. https://www.analyticsvidhya.com/blog/2022/07/end-to-end-project-on-sms-
email-spam-detection-using-naive-bayes/
23
24