0% found this document useful (0 votes)
9 views

Machine Learning Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Machine Learning Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION


TECHNOLOGY

MACHINE LEARNING PROJECT

Email Spam Classifier


DƯƠNG MẠNH KIÊN HẠ NHẬT DUY
[email protected] [email protected]
LÊ TUẤN NAM NGÔ ANH QUÂN
[email protected] [email protected]

BÙI VIỆT HUY


[email protected]

Supervisor: Dr. Thân Quang Khoát

Hanoi, 29/05/2024
ABSTRACT

Email spam detection is a critical task aimed at filtering out unwanted or malicious
emails from users’ inboxes. In this Email Spam Classifier project, we leverage
machine learning methodologies to enhance the accuracy of our classifier model.
By applying various algorithms and techniques, we aim to develop a robust classifier
capable of accurately identifying spam emails and minimizing false positives.

Student
(Signature and full name)

Le Tuan Nam
TABLE OF CONTENTS

CHAPTER 1. INTRODUCTION .......................................................... 2

CHAPTER 2. DATA PREPROCESSING............................................ 3

2.1 Clean and tokenize ................................................................................... 3

2.2 Balance data of spam and ham emails....................................................... 3

2.3 Machine Learning approach: Tf-idf ........................................................... 4

CHAPTER 3. MACHINE LEARNING................................................ 6

3.1 Soft Margin Support Vector Machine ........................................................ 6

3.2 Logistic Regression .................................................................................... 9

3.2.1 Logistic sigmoid function ................................................................ 9

3.2.2 The objective function with Regularization L2 ............................... 9

3.3 XGBoost.................................................................................................... 10

3.4 Random Forest .......................................................................................... 13

3.4.1 Random Forest Algorithm Steps ..................................................... 13

3.4.2 Predict and evaluate model............................................................. 14

3.4.3 Optimization and Performance ....................................................... 14

3.4.4 Conclusion ...................................................................................... 15

CHAPTER 4. EVALUATION ................................................................ 17

4.1 Test Accuracy............................................................................................ 17

4.2 Confusion Matrix....................................................................................... 17

CHAPTER 5. CONCLUSION ................................................................ 20


CHAPTER 1. INTRODUCTION

The Email Spam Classifier project utilizes machine learning methodologies to


develop an effective model for classifying emails as either spam or non-spam (ham).
Similar to the IMDb Reviews dataset used for binary sentiment classification, the
Email Spam Classifier project aims to filter unwanted or malicious emails from
users’ inboxes.
The dataset for this project consists of a collection of labeled email examples
: nearly 6000 emails, with an equal number of instances for training and testing
purposes.
Upon preprocessing the text data and implementing word vectorization techniques,
the project will explore various machine learning algorithms for classification.
These include traditional supervised learning algorithms such as Support Vector
Machine (SVM), Logistic Regression, and XGBoost.
An example a review classified as "Spam": "FreeMsg Hey there darling it’s been
3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX
std chgs to send, å£1.50 to rcv".
An example of "Ham" email : "Go until jurong point, crazy.. Available only in
bugis n great world la e buffet... Cine there got amore wat..."

2
CHAPTER 2. DATA PREPROCESSING

2.1 Clean and tokenize


We do the basic learning techniques to clean the data: where:

• remove “(<.*?>)” markup,


• remove punctuation marks,
• remove all strings that contain a non-letter,
• convert to lower,
• remove stop words,
• Potter Stem: reduce words to their root form by removing common suffixes
such as "ing," "ed," "s," etc., without regard to the meaning of the word,
• remove empty emails.

Function "fit_transform" from library sklearn change "spam" and "ham"


in "Label" row to 1 and 0, respectively. Also, function "drop_duplicates"
in library pandas helps to remove duplicate emails.

Figure 2.1: Emails before and after cleaning

2.2 Balance data of spam and ham emails


Oversampling is a technique used to balance the number of samples between
classes (in this case, between spam and ham emails) by generating new samples for
the minority class (the class with fewer samples) to match the number of samples
in the majority class. The process typically works:

• Identify Minority and Majority Classes,


• Choose an Oversampling Method : Random Oversampling,
• Apply Oversampling: Implement the chosen Oversampling method to create
new samples for the minority class. This may involve randomly replicating
existing samples or generating synthetic samples based on existing ones,

3
(a) Before over-sampling

(b) After over-sampling

Figure 2.2: Comparison of two data

2.3 Machine Learning approach: Tf-idf


TF-IDF (term frequency-inverse document frequency) is a statistical measure
that evaluates how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics: how many times a word appears in
a document, and the inverse document frequency of the word across a set of
documents.

nt,d
tf (t, d)) = P
k nk,d

|D| |D|
idf (t, D) = log = log
|{d ∈ D; t ∈ d}| + 1 df (d, t) + 1

tf idf (t, d, D) = tf (t, d) × idf (t, D)

Where:

4
• |D| is the number of documents in the collection.

• df (d, t) = |{d ∈ D; t ∈ d}| is the frequency of the document d ∈ D that the word
t appears.

• tf (t, d) is the frequency of the word t in the document d.

• nt,d is the number of times term t appears in document d.


P
k nk,d is the total number of times all terms appear in document d.

The vocabulary of TF-IDF Vectorizer for our dataset would be roughly as below:

(a) A glance at TF-IDF vocabulary

5
CHAPTER 3. MACHINE LEARNING

3.1 Soft Margin Support Vector Machine


In similarity to the Perceptron Learning Algorithm (PLA), the Support Vector
Machine (SVM) functions effectively only when dealing with linearly separable data
between two classes. Naturally, there is an expectation for SVM to handle data
that is almost linearly separable, akin to the capability demonstrated by Logistic
Regression.

Figure 3.1: Soft margin SVM. When a) there is noise or b) the data is close to being
linearly separable, pure SVM will not work effectively.

There are two cases where it is easy to see that SVM does not work effectively
or even fails:

• The data is still linearly separable as in Figure 3.1.a, but there is a noisy point
from the red circle class that is too close to the green square class. In this case,
if using Hard Margin SVM, it will create a very small margin. Additionally,
the separating hyperplane is too close to the green square class and far from
the red circle class. However, if skipping this noisy point, it will get a much
better margin described by the dashed lines. Hard Margin SVM is therefore
considered sensitive to noise.
• The data is not linearly separable but is close to being linearly separable as
in Figure 3.1.b. In this case, if using Hard Margin SVM, the optimization
problem is clearly infeasible, meaning the feasible set is empty, so the SVM
optimization problem becomes infeasible. However, if skipping some points
near the boundary between the two classes, it can still create a fairly good
separating hyperplane like the solid dashed line. The support vectors still
help create a large margin for this classifier. For each point lying on the other
side of the support lines (or margin lines, or boundary lines), it falls into
the unsafe region. Note that the safe regions of the two classes are different,
intersecting in the area between the two support lines.

6
the standard form of the optimization problem for Soft-margin SVM:

N
!
1 X
argmin ∥w∥2 + C ξn2 (3.1)
w,b,ξ 2
n=1

Subject to:

yn (w · xn + b) ≥ 1 − ξn , ∀n = 1, . . . , N
ξn ≥ 0, ∀n = 1, . . . , N

Evaluate:
If C is small, the amount of sacrifice does not significantly affect the objective
function’s value, and the algorithm will adjust to minimize ∥w∥2 , which means
PN
maximizing the margin. This will lead to a large n=1 ξn . Conversely, if C is
too large,Pto minimize the objective function’s value, the algorithm will focus on
reducing N n=1 ξn . In the case where C is very large and the two classes are linearly
PN
separable, we will obtain n=1 ξn = 0. Note that this value cannot be less than 0.
This means that no points need to be sacrificed, i.e., we obtain a solution for Hard
Margin SVM. In other words, Hard Margin SVM is a special case of Soft Margin
SVM.
The optimization problem (2) includes the appearance of slack variables ξn . The
ξn = 0 correspond to data points lying in the safe region. The 0 < ξn ≤ 1 correspond
to points lying in the unsafe region but still correctly classified, i.e., they still lie on
the correct side of the decision boundary. The ξn > 1 correspond to misclassified
points.
The objective function in the optimization problem (2) is a convex function
because it is the sum of two convex functions: the norm function and the linear
function. The constraints are also linear functions of (w, b, ξ). Therefore, the
optimization problem (2) is a convex problem and can be expressed as a Quadratic
Programming (QP) problem.

Figure 3.2: Example on the effect of C

7
The default model uses C = 1.0 and yields the accuracy on the training and test
set as 99.996% and 97.75% respectively.

Figure 3.3: C from 0.1 to 100

We used the 10-fold cross validation schema to find the best value of C. The
model seems to start overfit since C = 1.1, and a value of 1.0 should suffice to
balance the running time, and slightly reduce overfitting.
The linear SVM model with C = 1.0 achieves 99.96% accuracy on the train set,
and 97.75% on the test set.

Figure 3.4: SVM model with C = 0.1

8
3.2 Logistic Regression
Logistic regression is a statistical method for analyzing datasets in which there
are one or more independent variables that determine an outcome. The outcome is
typically a binary variable (i.e., it has two possible outcomes). Logistic regression
is widely used for binary classification problems in various fields such as medicine,
finance,. . . or in this case, email classification.

3.2.1 Logistic sigmoid function


Sigmoid function as the output of this model:
1
σ(wT · x) = (3.2)
1 + e−w ·x
T

where:

• wT represents the transpose of the weight vector w.

• x is the feature vector.

• σ(wT · x) is the logistic sigmoid function applied to wT · x.

The logistic function outputs values between 0 and 1, which can be interpreted
as probabilities.

3.2.2 The objective function with Regularization L2


The loss function in logistic regression with L2 regularization (Ridge regularization)
can be written as follows:
N
1 X λ
J(w, b) = [−yi log(hw (xi )) − (1 − yi ) log(1 − hw (xi ))] + ∥w∥2
N 2
i=1
N
1 X λ
−yi log σ(wT xi ) − (1 − yi ) log 1 − σ(wT xi ) + ∥w∥2
  
=
N 2
i=1

Where:

• w is the weight vector.

• N is the number of samples.

• yi is the true label of sample i.

• xi is the feature vector of sample i.

• ∥w∥2 is the L2 norm of the weight vector.

• λ is the regularization parameter.

9
The parameter λ in the above formula is related to C as follows:

1
λ=
C
When C is small (meaning λ is large), regularization is stronger, and the model
will try to make the weights smaller to prevent overfitting, even if this leads to
some errors in predictions.
When C is large (meaning λ is small), regularization is weaker, and the model
will try to fit the training data better, but there is a higher risk of overfitting.

Figure 3.5: C from 1 to 100

Using the 10-fold cross validation schema to find the best value of C. The model
seems to start overfit since C = 20, and a value of 10 should suffice to balance the
running time, and slightly reduce overfitting.
The LG model with C = 10 achieves 99.91% accuracy on the train set, and
97.29% on the test set.

3.3 XGBoost
XGBoost (Extreme Gradient Boosting) is an algorithm based on gradient boosting,
however, it is accompanied by great improvements in terms of algorithm optimization,
in the perfect combination of software and hardware power, helping achieve outstanding
results in both training time and memory usage.
In the first step is identifying a training set with any number of factors and y
as the target variable, as well as a differentiable loss function L(y, F(x)). A loss
function simply compares the actual value of the target variable with the predicted
value of the target variable.

10
Next, initialize the XGBoost model with a constant value:
N
X
fˆ0 (x) = arg min L(yi , θ) (3.3)
θ
i=1

Here, θ represents any arbitrary value (its value does not matter) that serves as
the first value of estimation for the regression algorithm. The error of estimation
with θ will be very large but will get smaller and smaller with each additive iteration
of the algorithm.
Then, for all m ∈ {1, 2, 3, . . . , M }, compute the gradients and Hessian matrices
for the gradient boosting of the trees themselves:

(m−1)
∂L(yi , ŷi )
gi = (m−1)
= fˆ(m−1) (x) (3.4)
∂ ŷi

(m−1)
∂ 2 L(yi , ŷi )
hi = (m−1) 2
= fˆ(m−1) (x) (3.5)
∂(ŷi )

The gradients, often referred to as the “pseudo-residuals,” show the change in


the loss function for one unit change in the feature value. The Hessian is the
derivative of the gradient, which is the rate of change of the loss function in the
feature value. The Hessian will help determine how much the gradient is changing,
and therefore how much the model will change. Each of these is imperative for the
gradient descent process.
Using these new matrices, another tree is added to the algorithm by completing
the following optimization problem for each iteration of the algorithm:
N  2
X 1 ĝm (xi )
ϕ̂m = arg min ĥm (xi ) − − ϕm (xi ) (3.6)
ϕ∈Φ 2 ĥm (xi )
i=1

p̂m (x) = αϕ̂m (x) (3.7)


This optimization problem uses a Taylor approximation, which is necessary to be
able to use traditional optimization techniques. What this is doing is estimating
the point that the algorithm is at in the gradient boosting process. If the rate
of change of the gradient is steep, meaning that the residuals are large, then the
algorithm still needs significant change. On the other hand, if the rate of change
of the gradient is flat, the algorithm is close to completion.
The model is then updated by adding the new trees to the previous model:

fˆ(m) (x) = fˆ(m−1) (x) + p̂(m) (x) (3.8)

This process is repeated for every single weak learner, where each m ∈ {1, 2, . . . , M }.
Weak learners are utilized iteratively to improve the model’s accuracy by minimizing

11
the loss function. The final output of the XGBoost algorithm can then be expressed
as the sum of each individual weak learner’s predictions:

M
X
fˆ(x) = fˆm (x) (3.9)
m=0

The initial accuracy on the test set of XGBoost (with default values for parameters)
is 97.09%. After experimenting with tuning parameters such as max_depth, learning_rate,
n_estimators, gamma, and subsample, we observed that learning_rate, max_depth,
and subsample have the most significant impacts on the accuracy of the XGBoost
classifier. Narrowing down the scope and range of these parameters, we employed
GridSearch with 10-fold cross-validation to find the best combination.

Figure 3.6: max_depth from 3 to 10

Figure 3.7: learning_rate from 0.01 to 1

12
Figure 3.8: subsample from 0.1 to 1

The best combination (learning_rate = 0.5, max_depth = 5, and subsample =


1.0) yields an accuracy of 99.78% for the training set and 96.43% for the test set.

3.4 Random Forest


Random forest is an ensemble learning method that operates by constructing
a multitude of decision trees during training and outputting the class that is the
mode of the classes (classification) of the individual trees. Each tree in the random
forest is trained on a random subset of the training data and a random subset of
features. The randomness introduced during both the bootstrap sampling of the
training data and the feature selection helps to decorrelate the trees and reduce
overfitting.

3.4.1 Random Forest Algorithm Steps


1. Bootstrap Sampling:
• Randomly select a subset of the training data with replacement (bootstrap
sampling). Let N be the size of the training set, then for each tree:

Sampled_datak = {(xi , yi )} where xi is a feature vector and yi is its label.

2. Feature Subset Selection:


• Randomly select a subset of features (columns) from the training data.
Let M be the total number of features, then for each tree:

Selected_featuresk = {fj } where j ∈ {1, 2, ..., M }.

3. Grow Decision Trees:


• Construct a decision tree using the bootstrap sample and the selected
features. Let Treek denote the k th decision tree.

Treek = BuildTree(Sampled_datak , Selected_featuresk )

13
4. Ensemble Decision Trees:
• Repeat steps 1-3 to create multiple decision trees. Let K be the number
of trees in the forest. Then, the output for a given input x is:

Output(x) = mode (Predictions(x; Tree1 ), Predictions(x; Tree2 ), ..., Predictions(x; TreeK )) .

• Where Predictions(x; Treek ) represents the predicted class of input x by


the k th decision tree.

3.4.2 Predict and evaluate model


Precision: 0.9923664122137404
Recall: 0.87248322147651
Accuracy Score: 0.9844961240310077
F1 Score: 0.9285714285714286

3.4.3 Optimization and Performance


The Random Forest model was optimized using GridSearchCV with 10-fold
cross-validation. The best parameters found were:

• max_leaf_nodes: 10000

• n_estimators: 200

The n_estimators parameter in Random Forest determines the number of trees


in the ensemble. It plays a crucial role in balancing model complexity and overfitting.

• Small n_estimators: When n_estimators is small, regularization is stronger.


The model tends to prioritize smaller weights to prevent overfitting, even if
this results in some errors in predictions. This encourages a simpler model
structure and helps in generalization to unseen data.
• Large n_estimators: Conversely, when n_estimators is large, regularization
is weaker. The model tries to fit the training data more closely, potentially
leading to overfitting. While this may improve performance on the training
data, it also increases the risk of poor generalization to new data.

Choice of n_estimators with Cross-Validation


To determine the optimal value of n_estimators, a 10-fold cross-validation schema
was employed. This approach assesses model performance across different subsets
of the data, helping to identify the best value of n_estimators that balances model
complexity and performance.

14
Figure 3.9: classification report

Observation and Recommendation


Through the cross-validation process, it was observed that the model starts
to exhibit signs of overfitting when n_estimators reaches 200. Consequently, a
value of 100 was deemed sufficient to balance computational efficiency and reduce
overfitting tendencies. This choice aims to maintain a reasonable model complexity
while achieving satisfactory performance on both training and validation data.

The Random Forest model with n_estimator = 200 achieves 99.98% accuracy
on the train set, and 97.67% on the test set.

3.4.4 Conclusion
The Random Forest model demonstrates high performance on the given dataset,
achieving excellent accuracy and precision-recall balance. The optimized model
with tuned hyperparameters ensures robustness and generalization to unseen data.

15
Figure 3.10: validation curve

16
CHAPTER 4. EVALUATION

4.1 Test Accuracy


We chose accuracy as the metric to measure the performance of our Supervised
Machine Learning algorithms. Below is the plot of training and test accuracy of
Machine Learning models. It can be seen that all three models achieved quite good
accuracy of about 97 percent in average.

Figure 4.1: compare 3 models

4.2 Confusion Matrix

Figure 4.2: Confusion Matrix of LG

17
Figure 4.3: Confusion Matrix of SVM

Figure 4.4: Confusion Matrix of Random Forest

Figure 4.5: Confusion Matrix of XGBoost

18
When evaluating the confusion matrices of the three models: SVM, Logistic
Regression, Random Forest and XGBoost for email spam classifier we can see that:
All four models have a certain number of misclassified instances, represented by
the false positive and false negative values in the confusion matrices. However, the
degree of these misclassifications varies across the models. The Random Forest
model has a higher false positive rate, meaning it incorrectly predicted many
negative instances as positive. On the other hand, the XGBoost model has a
higher false negative rate, missing many actual positive instances.

19
CHAPTER 5. CONCLUSION

In conclusion, the developed spam classifier shows significant promise in reducing


the volume of spam emails received by users, thereby improving their overall email
experience and productivity.From data from file csv, we proposed many different
algorithms to find the best sentiment classifier for our problem. In terms of
Machine Learning models, Linear Support Vector Machine has outperformed other
algorithms with an accuracy of 97.75
Although the Linear SVM model has achieved remarkable performance, it is
essential to acknowledge that no single model can be considered a universal solution.
The choice of the most appropriate algorithm depends on various factors, including
the characteristics of the dataset, the specific requirements of the problem, and the
trade-offs between model complexity, interpretability, and computational resources.
Moving forward, it would be beneficial to explore ensemble techniques that combine
the strengths of multiple models, potentially leading to even better performance
and robustness. Additionally, continuous monitoring and updating of the spam
classifier are crucial as spam patterns evolve over time, and new strategies emerge
to circumvent existing filters.
Furthermore, it is essential to address the ethical considerations and potential
biases that may arise from the use of automated spam filtering systems. Transparency,
user control, and the ability to appeal misclassifications should be integrated
into the system to ensure fairness and accountability. Overall, the successful
development of an effective spam classifier represents a significant step towards
enhancing user experience and productivity in the realm of email communication.
However, ongoing research, adaptation, and a commitment to ethical practices are
necessary to maintain the reliability and trustworthiness of such systems in the
long run.

References
[1] YONATHA WIJAYA, Komang Dhiyo; KARYAWATI, Anak Agung Istri
Ngurah Eka. The Effects of Different Kernels in SVM Sentiment Analysis on
Mass Social Distancing. JELIKU (Jurnal Elektronik Ilmu Komputer Udayana),
[S.l.], v. 9, n. 2, p. 161-168, nov. 2020. ISSN 2654-5101
[2] https://www.youtube.com/@thanquangkhoat4070
[3] https://dominhhai.github.io/vi/2017/12/ml-logistic-regression/
[4] https://machinelearningcoban.com/

20

You might also like