Machine Learning Report
Machine Learning Report
Hanoi, 29/05/2024
ABSTRACT
Email spam detection is a critical task aimed at filtering out unwanted or malicious
emails from users’ inboxes. In this Email Spam Classifier project, we leverage
machine learning methodologies to enhance the accuracy of our classifier model.
By applying various algorithms and techniques, we aim to develop a robust classifier
capable of accurately identifying spam emails and minimizing false positives.
Student
(Signature and full name)
Le Tuan Nam
TABLE OF CONTENTS
3.3 XGBoost.................................................................................................... 10
2
CHAPTER 2. DATA PREPROCESSING
3
(a) Before over-sampling
nt,d
tf (t, d)) = P
k nk,d
|D| |D|
idf (t, D) = log = log
|{d ∈ D; t ∈ d}| + 1 df (d, t) + 1
Where:
4
• |D| is the number of documents in the collection.
• df (d, t) = |{d ∈ D; t ∈ d}| is the frequency of the document d ∈ D that the word
t appears.
•
P
k nk,d is the total number of times all terms appear in document d.
The vocabulary of TF-IDF Vectorizer for our dataset would be roughly as below:
5
CHAPTER 3. MACHINE LEARNING
Figure 3.1: Soft margin SVM. When a) there is noise or b) the data is close to being
linearly separable, pure SVM will not work effectively.
There are two cases where it is easy to see that SVM does not work effectively
or even fails:
• The data is still linearly separable as in Figure 3.1.a, but there is a noisy point
from the red circle class that is too close to the green square class. In this case,
if using Hard Margin SVM, it will create a very small margin. Additionally,
the separating hyperplane is too close to the green square class and far from
the red circle class. However, if skipping this noisy point, it will get a much
better margin described by the dashed lines. Hard Margin SVM is therefore
considered sensitive to noise.
• The data is not linearly separable but is close to being linearly separable as
in Figure 3.1.b. In this case, if using Hard Margin SVM, the optimization
problem is clearly infeasible, meaning the feasible set is empty, so the SVM
optimization problem becomes infeasible. However, if skipping some points
near the boundary between the two classes, it can still create a fairly good
separating hyperplane like the solid dashed line. The support vectors still
help create a large margin for this classifier. For each point lying on the other
side of the support lines (or margin lines, or boundary lines), it falls into
the unsafe region. Note that the safe regions of the two classes are different,
intersecting in the area between the two support lines.
6
the standard form of the optimization problem for Soft-margin SVM:
N
!
1 X
argmin ∥w∥2 + C ξn2 (3.1)
w,b,ξ 2
n=1
Subject to:
yn (w · xn + b) ≥ 1 − ξn , ∀n = 1, . . . , N
ξn ≥ 0, ∀n = 1, . . . , N
Evaluate:
If C is small, the amount of sacrifice does not significantly affect the objective
function’s value, and the algorithm will adjust to minimize ∥w∥2 , which means
PN
maximizing the margin. This will lead to a large n=1 ξn . Conversely, if C is
too large,Pto minimize the objective function’s value, the algorithm will focus on
reducing N n=1 ξn . In the case where C is very large and the two classes are linearly
PN
separable, we will obtain n=1 ξn = 0. Note that this value cannot be less than 0.
This means that no points need to be sacrificed, i.e., we obtain a solution for Hard
Margin SVM. In other words, Hard Margin SVM is a special case of Soft Margin
SVM.
The optimization problem (2) includes the appearance of slack variables ξn . The
ξn = 0 correspond to data points lying in the safe region. The 0 < ξn ≤ 1 correspond
to points lying in the unsafe region but still correctly classified, i.e., they still lie on
the correct side of the decision boundary. The ξn > 1 correspond to misclassified
points.
The objective function in the optimization problem (2) is a convex function
because it is the sum of two convex functions: the norm function and the linear
function. The constraints are also linear functions of (w, b, ξ). Therefore, the
optimization problem (2) is a convex problem and can be expressed as a Quadratic
Programming (QP) problem.
7
The default model uses C = 1.0 and yields the accuracy on the training and test
set as 99.996% and 97.75% respectively.
We used the 10-fold cross validation schema to find the best value of C. The
model seems to start overfit since C = 1.1, and a value of 1.0 should suffice to
balance the running time, and slightly reduce overfitting.
The linear SVM model with C = 1.0 achieves 99.96% accuracy on the train set,
and 97.75% on the test set.
8
3.2 Logistic Regression
Logistic regression is a statistical method for analyzing datasets in which there
are one or more independent variables that determine an outcome. The outcome is
typically a binary variable (i.e., it has two possible outcomes). Logistic regression
is widely used for binary classification problems in various fields such as medicine,
finance,. . . or in this case, email classification.
where:
The logistic function outputs values between 0 and 1, which can be interpreted
as probabilities.
Where:
9
The parameter λ in the above formula is related to C as follows:
1
λ=
C
When C is small (meaning λ is large), regularization is stronger, and the model
will try to make the weights smaller to prevent overfitting, even if this leads to
some errors in predictions.
When C is large (meaning λ is small), regularization is weaker, and the model
will try to fit the training data better, but there is a higher risk of overfitting.
Using the 10-fold cross validation schema to find the best value of C. The model
seems to start overfit since C = 20, and a value of 10 should suffice to balance the
running time, and slightly reduce overfitting.
The LG model with C = 10 achieves 99.91% accuracy on the train set, and
97.29% on the test set.
3.3 XGBoost
XGBoost (Extreme Gradient Boosting) is an algorithm based on gradient boosting,
however, it is accompanied by great improvements in terms of algorithm optimization,
in the perfect combination of software and hardware power, helping achieve outstanding
results in both training time and memory usage.
In the first step is identifying a training set with any number of factors and y
as the target variable, as well as a differentiable loss function L(y, F(x)). A loss
function simply compares the actual value of the target variable with the predicted
value of the target variable.
10
Next, initialize the XGBoost model with a constant value:
N
X
fˆ0 (x) = arg min L(yi , θ) (3.3)
θ
i=1
Here, θ represents any arbitrary value (its value does not matter) that serves as
the first value of estimation for the regression algorithm. The error of estimation
with θ will be very large but will get smaller and smaller with each additive iteration
of the algorithm.
Then, for all m ∈ {1, 2, 3, . . . , M }, compute the gradients and Hessian matrices
for the gradient boosting of the trees themselves:
(m−1)
∂L(yi , ŷi )
gi = (m−1)
= fˆ(m−1) (x) (3.4)
∂ ŷi
(m−1)
∂ 2 L(yi , ŷi )
hi = (m−1) 2
= fˆ(m−1) (x) (3.5)
∂(ŷi )
This process is repeated for every single weak learner, where each m ∈ {1, 2, . . . , M }.
Weak learners are utilized iteratively to improve the model’s accuracy by minimizing
11
the loss function. The final output of the XGBoost algorithm can then be expressed
as the sum of each individual weak learner’s predictions:
M
X
fˆ(x) = fˆm (x) (3.9)
m=0
The initial accuracy on the test set of XGBoost (with default values for parameters)
is 97.09%. After experimenting with tuning parameters such as max_depth, learning_rate,
n_estimators, gamma, and subsample, we observed that learning_rate, max_depth,
and subsample have the most significant impacts on the accuracy of the XGBoost
classifier. Narrowing down the scope and range of these parameters, we employed
GridSearch with 10-fold cross-validation to find the best combination.
12
Figure 3.8: subsample from 0.1 to 1
13
4. Ensemble Decision Trees:
• Repeat steps 1-3 to create multiple decision trees. Let K be the number
of trees in the forest. Then, the output for a given input x is:
• max_leaf_nodes: 10000
• n_estimators: 200
14
Figure 3.9: classification report
The Random Forest model with n_estimator = 200 achieves 99.98% accuracy
on the train set, and 97.67% on the test set.
3.4.4 Conclusion
The Random Forest model demonstrates high performance on the given dataset,
achieving excellent accuracy and precision-recall balance. The optimized model
with tuned hyperparameters ensures robustness and generalization to unseen data.
15
Figure 3.10: validation curve
16
CHAPTER 4. EVALUATION
17
Figure 4.3: Confusion Matrix of SVM
18
When evaluating the confusion matrices of the three models: SVM, Logistic
Regression, Random Forest and XGBoost for email spam classifier we can see that:
All four models have a certain number of misclassified instances, represented by
the false positive and false negative values in the confusion matrices. However, the
degree of these misclassifications varies across the models. The Random Forest
model has a higher false positive rate, meaning it incorrectly predicted many
negative instances as positive. On the other hand, the XGBoost model has a
higher false negative rate, missing many actual positive instances.
19
CHAPTER 5. CONCLUSION
References
[1] YONATHA WIJAYA, Komang Dhiyo; KARYAWATI, Anak Agung Istri
Ngurah Eka. The Effects of Different Kernels in SVM Sentiment Analysis on
Mass Social Distancing. JELIKU (Jurnal Elektronik Ilmu Komputer Udayana),
[S.l.], v. 9, n. 2, p. 161-168, nov. 2020. ISSN 2654-5101
[2] https://www.youtube.com/@thanquangkhoat4070
[3] https://dominhhai.github.io/vi/2017/12/ml-logistic-regression/
[4] https://machinelearningcoban.com/
20