A Computational Study On Classification of Malignant
A Computational Study On Classification of Malignant
O CTOBER
2024
Abstract
This study applies linear algebra and optimization techniques to the prediction of
breast cancer using the UCI breast cancer dataset. K-means clustering is used to
explore data patterns, with the cluster assignments compared to the actual benign
and malignant labels. Misclassified points are examined to understand their feature
ranges and implications for diagnostic accuracy. These findings are then compared
with traditional rule-based medical approaches.
The next stage involves developing classification models, starting with linear regres-
sion, followed by logistic regression using matrix operations and a sigmoid function
for binary classification. A support vector machine (SVM) is then formulated as a
convex optimization problem, with support vectors identified through eigenvalue
decomposition and solved using MATLAB’s CVX solvers. Model performance is eval-
uated and compared with the clustering results and medical standards.
The study highlights how linear algebra and optimization can be used to improve
classification models in medical diagnostics, offering insights into the alignment of
data-driven methods with established diagnostic rules.
ii
Acknowledgments
Comment this out if not needed.
iii
List of Tables
v
List of Figures
vii
Contents
1 Introduction 1
2 Methodology 3
2.1 Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Basic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Multicollinearity of feature . . . . . . . . . . . . . . . . . . . 8
2.2.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 IQR-Based Approach for Outlier Detection . . . . . . . . . . . 10
2.2.7 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Terminologies Used in Machine Learning Model Development . . . . 17
2.7 Logistic Regression Classifier . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Support Vector Machine (SVM) Classifier . . . . . . . . . . . . . . . . 24
2.8.1 Advantages of Convex Optimization in SVM . . . . . . . . . . 26
2.9 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9.1 Lagrangian Formulation of Decision Trees . . . . . . . . . . . 33
2.10 Built-in Machine Learning Functions in MATLAB . . . . . . . . . . . . 34
2.10.1 Logistic Regression using fitglm . . . . . . . . . . . . . . . . 34
2.10.2 Support Vector Machines using fitcsvm . . . . . . . . . . . . 35
2.10.3 Decision Trees using fitctree . . . . . . . . . . . . . . . . . 35
2.10.4 Comparative Evaluation and Tuning of Models . . . . . . . . 36
2.11 Developing an Explainable Tumor Classification Model . . . . . . . . 36
2.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
CONTENTS Table of Contents
4 Conclusion 51
x
Chapter 1
Introduction
Breast cancer remains a significant public health challenge, highlighting the need for
effective diagnostic tools to enhance early detection and treatment. The advent of
machine learning has provided innovative methods for analyzing complex datasets,
offering promising avenues for improving classification tasks in the medical field.
This project focuses on the development of a classification model utilizing the UCI
Breast Cancer Wisconsin dataset, which contains a diverse array of features pertinent
to breast cancer diagnosis.
The primary aim of this research is to explore the inherent patterns within the
dataset through a systematic analysis rooted in the principles of linear algebra and
optimization. Initially, the dataset will be structured as a matrix, enabling various
preprocessing steps such as data cleaning, outlier detection, and feature scaling.
Outliers will be identified using the interquartile range (IQR) method, while visu-
alizations will be employed to present a comprehensive summary of the numerical
features.
Following this preprocessing phase, K-means clustering will be implemented to gain
insights into the data’s structure. This approach will facilitate a comparative analysis
between the actual target labels and the clusters generated, revealing any misclas-
sified data points within the clusters and examining their distribution across the
feature space. Such analyses aim to provide valuable inferences that connect tradi-
tional rule-based models used in medical diagnostics with the findings from cluster-
ing techniques.
Subsequent to the exploratory analysis, the focus will shift to the development of
classification models. The initial step will involve constructing a linear regression
model, transitioning to logistic regression through matrix operations. The applica-
tion of the sigmoid function will be critical in implementing binary classification
techniques. Building upon this foundation, support vector machines (SVM) will
be utilized, employing a linear programming approach to identify support vectors
through the eigenvalue decomposition of the dual Lagrangian function. The opti-
mization of the SVM will be executed using CVX solvers, ensuring a robust solution
to the linear programming problem.
This project aspires to construct a reliable classification model while fostering a
deeper understanding of the relationships among various features in the dataset. By
adopting a methodical approach grounded in linear algebra and optimization, this
1
Chapter 1. Introduction
research seeks to explore the potential for improved diagnostic accuracy in breast
cancer detection, ultimately contributing to advancements in medical analytics and
patient care.
2
Chapter 2
Methodology
This chapter delineates the methodological framework employed in this project, fo-
cusing on the UCI Breast Cancer Wisconsin dataset as the basis for model develop-
ment. Initially, an exploration of the dataset was conducted to understand its struc-
ture and key characteristics. Essential statistical analyses were performed to sum-
marize the data, including identifying outliers and assessing feature relationships.
Following this, data preprocessing techniques were applied to ensure the dataset’s
quality, including normalization and outlier detection. Subsequently, a series of clas-
sification models—namely linear regression, logistic regression, and support vector
machines—were implemented using matrix operations grounded in linear algebra
and optimization principles. This systematic approach not only facilitated the devel-
opment of robust predictive models but also enhanced the overall understanding of
the data’s patterns and distributions.
– M: Malignant (cancerous)
– B: Benign (non-cancerous)
3
Chapter 2. Methodology 2.2. Basic Data Analysis
– radius mean, texture mean, perimeter mean, area mean, smoothness mean,
and others.
Each feature is calculated based on the mean, standard error, or worst (largest)
value, offering a comprehensive view of the tumor’s characteristics.
• Data Format: The dataset is provided in CSV format, making it easily accessi-
ble for data analysis and modeling tasks.
The primary objective of this dataset is to aid in the development of machine learn-
ing models capable of accurately predicting the diagnosis of breast cancer based on
the provided features. This dataset has become a benchmark for evaluating vari-
ous classification algorithms and serves as an essential resource for researchers and
practitioners in the field of medical diagnostics.
Class Count
Benign (B) 357
Malignant (M) 212
Table 2.1: Class distribution of the UCI Breast Cancer Wisconsin dataset.
The class distribution table shows 357 instances classified as benign (B) and 212
as malignant (M) in the UCI Breast Cancer Wisconsin dataset. This results in an
approximate distribution of 62.7% benign and 37.3% malignant instances.
The observed class imbalance may influence the performance of classification al-
gorithms, potentially leading to a bias towards the benign class. Therefore, it is
essential to implement strategies during model development to ensure reliable iden-
tification of malignant cases. The lower representation of malignant instances neces-
sitates the use of effective feature extraction and validation techniques to enhance
predictive accuracy.
from suspicious breast lesions using a thin needle under imaging guidance. This procedure provides
rapid diagnoses and high-quality samples, facilitating the extraction of critical features for differenti-
ating between benign and malignant cells, thereby enhancing cancer prediction models.
4
2.2. Basic Data Analysis Chapter 2. Methodology
The summary statistics table provides an overview of the numerical features ex-
tracted from the dataset.
- Mean values indicate the average size and characteristics of the breast tumors,
with the highest average observed for the area feature at 654.89. The perimeter and
radius also show considerable averages of 91.97 and 14.13, respectively, reflecting
the size and dimensions of the tumors.
- Median values closely follow the means, confirming the general distribution with-
out extreme outliers. For instance, the median radius is 13.37, suggesting that half
the tumors have a radius smaller than this value.
- Interquartile range (IQR) highlights the variability, with the area feature exhibit-
ing the highest IQR of 363.98, indicating substantial differences in tumor sizes.
- The minimum and maximum values indicate the range of each feature. For in-
stance, the area varies from 143.50 to 2501.00, demonstrating significant size di-
versity among the tumors.
- The standard deviation (SD) values reflect the spread of the data points around
the mean, with features such as perimeter (SD = 24.30) and area (SD = 351.91) ex-
hibiting higher variability compared to features like smoothness mean (SD = 0.01),
5
Chapter 2. Methodology 2.2. Basic Data Analysis
The Pearson correlation coefficient was computed for pairs of numerical variables
to identify strong correlations. A correlation matrix was created to visualize rela-
tionships between features, guiding feature selection for classification. Figure 2.1
illustrate the correlation between the 30 features in the dataset
radius mean 1
texture mean
perimeter mean
area mean
smoothness mean 0.8
compactness mean
concavity mean
concave points mean
symmetry mean 0.6
fractal dimension mean
radius se
texture se
perimeter se
Features
area se 0.4
smoothness se
compactness se
concavity se
concave points se
symmetry se 0.2
fractal dimension se
radius worst
texture worst
perimeter worst 0
area worst
smoothness worst
compactness worst
concavity worst
concave points worst -0.2
symmetry worst
fractal dimension worst
fra a on ne s e
como etee s e
al s e p avi s w orsst
al sye p avi s m eaan
r e
m ot a r m e n
ct ve ca ss se
m m ts se
r o st
di m oin ty e
co pathnare se
e
nc c ct es a s
ra nsetr se
co pa hn re e an
di m oin ty e n
si try e n
co pathnare wo rst
di m oin ty or t
si try o st
s im tur s
m m ts m an
fra ca con tne ss mean
or t
en e w or t
ra eaan
w orst
al sy p vi s
c s w r
pe tex diuiony s
m m ts w s
pe ex diu n
como etee w or
st
on w rs
en e m ea
a r
me
r m
c s m
c e a
rim tu s
pe tex diu
e
t
r
e
m o
ra
c
r
m o
y
v
v
s
n
n
ct
ct
Features
In order to identify the most significant features contributing to breast cancer di-
agnosis, a correlation analysis was conducted between each feature and the target
variable (diagnosis). This statistical method allows us to quantify the strength of
the linear relationship between the input features and the target class, which in
this case helps to identify the features most strongly associated with distinguishing
benign from malignant tumors.
The table below presents the correlation coefficients for each feature:
6
2.2. Basic Data Analysis Chapter 2. Methodology
Feature Correlation
radius mean 0.73
texture mean 0.42
perimeter mean 0.74
area mean 0.71
smoothness mean 0.36
compactness mean 0.60
concavity mean 0.70
concave points mean 0.78
symmetry mean 0.33
fractal dimension mean -0.01
radius se 0.57
texture se -0.01
perimeter se 0.56
area se 0.55
smoothness se -0.07
compactness se 0.29
concavity se 0.25
concave points se 0.41
symmetry se -0.01
fractal dimension se 0.08
radius worst 0.78
texture worst 0.46
perimeter worst 0.78
area worst 0.73
smoothness worst 0.42
compactness worst 0.59
concavity worst 0.66
concave points worst 0.79
symmetry worst 0.42
fractal dimension worst 0.32
Based on the correlation analysis, the three features most strongly correlated with
the target variable (diagnosis) are ’concave points worst’, ’perimeter worst’, and
’concave points mean’. These features have correlation values of 0.79, 0.78, and
0.78, respectively. The high correlation values indicate a strong positive relation-
ship with the diagnosis, suggesting that larger values of these features are typically
associated with malignant tumors.
Statistical interpretations of these results highlight that shape-related features, par-
ticularly those associated with concave points and perimeter, are critical in differen-
tiating between benign and malignant tumors. These findings provide strong justifi-
cation for focusing on these features in predictive modeling efforts. Visualizing the
distribution of these features across benign and malignant classes further demon-
strates their significance in separating the two groups. Distribution of these three
7
Chapter 2. Methodology 2.2. Basic Data Analysis
12 0.03 30
Malignant
10 0.025 25 Benign
8 0.02 20
Density
6 0.015 15
4 0.01 10
2 0.005 5
0 0 0
0 0.1 0.2 0.3 0 100 200 0 0.1 0.2
concave points worst perimeter worst concave points mean
Figure 2.2: Distribution of top three highly correlated features
8
2.2. Basic Data Analysis Chapter 2. Methodology
9
Chapter 2. Methodology 2.2. Basic Data Analysis
IQR = Q3 − Q1
The first quartile (Q1) represents the 25th percentile of the data, while the third
quartile (Q3) represents the 75th percentile. The IQR is particularly useful for iden-
tifying outliers in datasets, as it is resistant to extreme values (unlike the standard
deviation). Outliers are typically defined as data points that fall below Q1−1.5×IQR
or above Q3+1.5×IQR. These thresholds, often referred to as ”fences,” capture most
of the central data, and any points outside this range are considered outliers.
The IQR-based approach is widely used in datasets where the distribution is skewed
or does not follow a normal distribution, making it more robust compared to the
z-score method. The use of 1.5 times the IQR to determine outliers is a common rule
of thumb, although this factor can be adjusted based on the specific characteristics of
the data. By identifying and potentially removing or investigating these outliers, it is
possible to improve the accuracy and performance of statistical models and reduce
bias introduced by extreme values.
10
2.2. Basic Data Analysis Chapter 2. Methodology
Table 2.5: Outlier Status Before and After Square Root Transformation
11
Chapter 2. Methodology 2.3. Feature Selection
cessing steps.
5
Values
1
area se
radius se
area mean
area worst
perimeter se
Figure 2.3: Box plot showing the distribution of features post-square root transforma-
tion, highlighting potential outliers.
12
2.3. Feature Selection Chapter 2. Methodology
these features into logistic regression models can potentially improve classification
outcomes, aiding in timely and accurate patient treatment decisions.
Feature ρ P-Value
Concave Points Worst 0.79 1.97 × 10−124
Area Worst 0.78 1.23 × 10−119
Perimeter Worst 0.78 5.77 × 10−119
Concave Points Mean 0.78 7.10 × 10−116
Radius Worst 0.78 8.48 × 10−116
Perimeter Mean 0.74 8.44 × 10−101
Area Mean 0.73 3.45 × 10−97
Radius Mean 0.73 8.47 × 10−96
Area SE 0.71 9.52 × 10−89
Concavity Mean 0.70 9.97 × 10−84
Concavity Worst 0.66 2.46 × 10−72
Perimeter SE 0.63 3.20 × 10−64
Radius SE 0.63 1.77 × 10−63
Compactness Mean 0.60 3.94 × 10−56
Compactness Worst 0.59 7.07 × 10−55
Texture Worst 0.46 1.08 × 10−30
Smoothness Worst 0.42 6.58 × 10−26
Symmetry Worst 0.42 2.95 × 10−25
Texture Mean 0.42 4.06 × 10−25
Concave Points SE 0.41 3.07 × 10−24
Smoothness Mean 0.36 1.05 × 10−18
Symmetry Mean 0.33 5.73 × 10−16
Fractal Dimension Worst 0.32 2.47 × 10−15
Compactness SE 0.29 9.98 × 10−13
Concavity SE 0.25 8.26 × 10−10
Fractal Dimension SE 0.08 0.063
Smoothness SE -0.07 0.110
Fractal Dimension Mean -0.01 0.760
Texture SE -0.01 0.843
Symmetry SE -0.01 0.877
Chi-square tests are commonly used for evaluating the independence between cat-
egorical variables or for assessing goodness of fit between observed and expected
distributions. In particular, testing independence can help determine whether two
or more variables are dependent across populations, allowing one to estimate the
other. However, when applying Chi-square tests to this dataset, the results were in-
conclusive, likely due to the continuous nature of the transformed variables and the
limitations of the Chi-square method in handling such data. These experiments con-
sistently produced unreliable test statistics and p-values, highlighting the inadequacy
of Chi-square for this feature selection task.
13
Chapter 2. Methodology 2.4. Machine Learning Algorithms
Table 2.7: Selected features with correlation coefficients and their target association
14
2.5. Related works Chapter 2. Methodology
improves generalization, and typically yields better accuracy than single decision
trees.
In this project, these algorithms can be used to model the relationship between se-
lected features and the target variable, offering robust performance across a range
of classification problems.
A general structure of a Machine Learning Classification process is shown in Figure
2.4.
The selection of these algorithms for the current project is based on their ability to
handle both linear and non-linear relationships within the dataset. Logistic Regres-
sion offers a simple yet powerful baseline, while SVMs can efficiently manage more
complex patterns. Decision Trees provide interpretability, allowing for easier under-
standing of feature importance, and Random Forests enhance model performance
through ensemble learning, reducing the risk of overfitting. These algorithms col-
lectively offer a robust toolkit for accurately classifying the data and handling the
nuances of feature interaction in the project.
15
Chapter 2. Methodology 2.5. Related works
which typically takes two to five minutes, allows for the extraction of ten differ-
ent features from the segmented nuclei. These features, including radius, perime-
ter, area, compactness, smoothness, concavity, symmetry, and fractal dimension, are
then used to train a classifier. The classifier employs a variation of the Multi-surface
Method (MSM) to separate data points into benign and malignant sets. This method
involves constructing separating planes in the feature space to minimize misclassifi-
cations. Testing with a set of 569 images demonstrated a high level of accuracy. The
system achieved an accuracy of 97% in distinguishing between benign and malig-
nant tumors when using a set of three features: worst area, worst smoothness, and
mean texture. Moreover, the system achieved an accuracy of 80% in predicting the
distant recurrence of malignancy in patients. This study demonstrates the potential
of using nuclear features extracted from FNAs and machine learning techniques to
accurately diagnose breast cancer.
Khairunnahar et.al mentions the use of Decision Tree methods, specifically the C4.5
algorithm, which attained an accuracy of 94.74% [1]. Decision trees are intuitive
models that use a tree-like structure to represent decisions and their possible con-
sequences. They work by recursively partitioning the data based on feature values
until a classification can be made. Another approach discussed is the Rule Induction
Algorithm based on approximate classification, achieving an accuracy of 94.99%.
This method generates a set of rules from the training data that can be used to clas-
sify new instances. The rules are typically expressed in the form of “if-then” state-
ments. Combining Linear Discriminant Analysis (LDA) with Neural Networks (NN)
is yet another method explored in the source. This combined approach reached an
impressive accuracy of 96.8%. LDA seeks to find a linear combination of features
that maximizes the separation between classes, while neural networks are powerful
models inspired by the structure of the human brain that can learn complex non-
linear relationships in the data. Support Vector Machines (SVM) also stand out as a
successful method for breast cancer classification, achieving an accuracy of 97.2%.
SVMs work by finding the optimal hyperplane that maximizes the margin between
different classes in the feature space. Moving towards more advanced techniques,
the sources discuss feed-forward neural networks with rule extraction, yielding an
accuracy of 98.10%. These models combine the power of neural networks with the
interpretability of rule-based systems. The extracted rules provide insights into the
decision-making process of the neural network. Neuro-fuzzy techniques blend fuzzy
logic with neural networks, offering a way to handle uncertainty and imprecision in
the data. This approach achieved an accuracy of 95.06%. Another hybrid method
combines autoregressive models (AR) with neural networks (NN), attaining a clas-
sification accuracy of 97.4% for breast cancer diagnosis. Autoregressive models are
used to model time series data, where the current value depends on previous val-
ues. The sources also discuss various Learning Vector Quantization (LVQ) methods
(including LVQ, Big LVQ, and AIRS) applied to breast cancer detection, achieving
correction classification rates ranging from 96.7% to 97.2%. LVQ algorithms are a
type of competitive learning neural network where the network learns to classify
input vectors by adjusting the positions of prototype vectors in the feature space.
Further techniques include Supervised Fuzzy Clustering, which achieved an accu-
16
2.6. Terminologies Used in Machine Learning Model Development Chapter 2. Methodology
racy of 95.57% for breast cancer detection, and the Mixture Experts (ME) network
structure, which achieved a correct classification rate of 98.85% for breast cancer
diagnosis. The sources emphasize that the field of machine learning continues to
evolve rapidly, and new techniques are constantly being developed to improve breast
cancer detection and diagnosis. They also highlight the importance of carefully se-
lecting and extracting relevant features from the data to achieve optimal perfor-
mance. The wide range of approaches discussed underscores the ongoing research
and development in this critical area.
Train-Test Split
Train-test split is a technique used to evaluate the performance of a machine learning
model. The dataset is divided into two subsets: the training set, which is used to
train the model, and the test set, which is used to assess the model’s performance
on unseen data. A common split ratio is 80:20, where 80% of the data is used for
training and 20% for testing.
Iterations (T )
Iterations refer to the number of times the gradient descent algorithm updates the
weights and bias. More iterations can improve model performance, but excessively
high values may lead to overfitting or unnecessary computation.
17
Chapter 2. Methodology 2.7. Logistic Regression Classifier
Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function by
iteratively adjusting the model parameters (weights and bias) in the direction of the
negative gradient of the cost function. This process continues until convergence is
achieved or a predetermined number of iterations is reached.
K-Fold Cross-Validation
K-fold cross-validation is a technique used to assess the performance of a model by
splitting the training data into k subsets (folds). The model is trained on k − 1 folds
and validated on the remaining fold. This process is repeated k times, with each fold
used as the validation set once. The results are then averaged to provide a more
reliable estimate of model performance.
Validation Error
Validation error is the measure of how well a machine learning model performs
on unseen data during the validation phase. It provides insight into the model’s
generalization capability and is crucial for detecting overfitting.
Performance Metrics
Performance metrics are quantitative measures used to evaluate the effectiveness of
a machine learning model. Common metrics for classification tasks include accuracy,
precision, recall, and F1-score, which provide insights into the model’s predictive
capabilities.
In assessing the skill of a logistic regression classifier, several performance measures
are crucial for a comprehensive evaluation. Accuracy reflects the overall correctness
of the model’s predictions, but it can be misleading in imbalanced datasets. Sensi-
tivity (or Recall) measures the model’s ability to correctly identify positive instances,
making it essential in situations where missing positive cases is costly (e.g., detect-
ing diseases). Specificity assesses the ability to correctly classify negative instances,
which is important in avoiding false positives. The Area Under the ROC Curve
(AUC-ROC) provides a more holistic measure by summarizing the trade-off between
sensitivity and specificity across different thresholds. A higher AUC indicates that
the model performs well in distinguishing between positive and negative classes.
Together, these metrics provide insights into the model’s strengths and weaknesses,
helping assess how well it generalizes and handles different types of classification
errors.
18
2.7. Logistic Regression Classifier Chapter 2. Methodology
y = Xβ + ϵ
Here, X ∈ Rn×p is the matrix of feature vectors (with n samples and p features),
β ∈ Rp is the vector of model parameters (coefficients), and ϵ denotes the error
term. For a new observation xi , the predicted output is:
ŷi = x⊤
i β
1
σ(z) =
1 + e−z
This function ensures that the output is constrained between 0 and 1, reflecting the
probability of the outcome being 1.
The odds of an event occurring is defined as the ratio of the probability of the event
to the probability of the event not occurring:
19
Chapter 2. Methodology 2.7. Logistic Regression Classifier
Taking the natural logarithm of the odds yields the log-odds or logit:
P (yi = 1|xi )
log = x⊤
i β
1 − P (yi = 1|xi )
Thus, logistic regression models the log-odds of the probability of a binary outcome
as a linear function of the input features.
ŷ = σ(Xβ)
where X ∈ Rn×p is the matrix of feature vectors, β ∈ Rp is the parameter vector, and
ŷ ∈ [0, 1]n represents the predicted probabilities.
20
2.7. Logistic Regression Classifier Chapter 2. Methodology
Optimization Problem
Matrix Formulation
In matrix form, if y is the vector of outcomes and X is the design matrix of features,
the negative log-likelihood can be expressed as:
∇β ℓ(β) = X ⊤ (y − σ(Xβ))
Unlike linear regression, logistic regression lacks a closed-form solution due to the
non-linearity introduced by the sigmoid function. Therefore, iterative methods such
as gradient descent, stochastic gradient descent, or the Newton-Raphson method
(known as Iteratively Reweighted Least Squares (IRLS) in logistic regression) are
employed for parameter estimation.
1. Gradient Descent updates the parameters using:
21
Chapter 2. Methodology 2.7. Logistic Regression Classifier
Mathematical Representation
Logistic regression models the probability that the dependent variable y equals 1 (the
positive class) given a set of independent variables x. The model can be expressed
as:
P (yi = 1 | xi ) = σ(wT xi + b)
where:
1
σ(z) =
1 + e−z
is the logistic (sigmoid) function.
Separating Hyperplane
The decision boundary, or separating plane, is where the probability is exactly 0.5.
Therefore, we set the probability equal to 0.5:
σ(wT x + b) = 0.5
To find this boundary, we can simplify this equation:
The logistic function equals 0.5 when its argument is zero:
wT x + b = 0
Rearranging gives us the equation of the hyperplane:
wT x = −b
22
2.7. Logistic Regression Classifier Chapter 2. Methodology
Algorithm 1 Logistic Regression with Train-Test Split and K-Fold Cross Validation
1: Input: Dataset D = {(x(i) , y (i) )}m i=1 , learning rate α, number of iterations T ,
number of folds k
2: Output: Trained model parameters w, b
3: Step 1: Train-Test Split
4: Split the dataset D into training set Dtrain and test set Dtest with ratio 80:20.
5: Let Xtrain , ytrain be the training features and labels.
6: Let Xtest , ytest be the testing features and labels.
7: Step 2: Initialize weights w = 0 and bias b = 0
8: Step 3: Gradient Descent on Logistic Regression
9: for each iteration t = 1, 2, . . . , T do
10: Compute the linear combination: z (i) = wT x(i) + b
1
11: Apply the sigmoid function: hθ (x(i) ) = (i)
1+e−z
12: Calculate gradients:
m
∂J(w, b) 1 X (i)
= (hθ (x(i) ) − y (i) )xj
∂wj m i=1
m
∂J(w, b) 1 X
= (hθ (x(i) ) − y (i) )
∂b m i=1
13: Update the parameters:
∂J(w, b) ∂J(w, b)
wj = wj − α · , b=b−α·
∂wj ∂b
23
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier
subject to:
yi (wT xi + b) ≥ 1, ∀i = 1, 2, . . . , n
where xi ∈ Rd are the input vectors, yi ∈ {−1, 1} are the class labels, w ∈ Rd is the
weight vector, and b ∈ R is the bias.
This optimization problem can also be written in matrix form:
1
min wT w
w,b 2
subject to:
Y (Xw + b) ≥ 1
where X ∈ Rn×d is the matrix of input vectors, Y ∈ Rn×n is the diagonal matrix of
labels, and 1 ∈ Rn is the vector of ones.
• ∂L = − ni=1 λi yi = 0
P
∂b
24
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology
• λi yi (wT xi + b) − 1 = 0, λi ≥ 0
From the first condition, we derive:
n
X
w= λ i y i xi
i=1
Decision Function
The decision function for a new input x is given by:
n
X
f (x) = wT x + b = λi yi xTi x + b
i=1
25
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier
• Global Optimum: Convex problems guarantee that any local minimum is the
global minimum. This is crucial in SVM, ensuring that the optimal separating
hyperplane is found without getting trapped in local minima.
• Kernel Methods: The dual formulation of SVM allows the use of kernel func-
tions, enabling classification in high-dimensional spaces without explicitly com-
puting the coordinates. Convex optimization helps solve these non-linear prob-
lems efficiently.
26
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology
• Efficiency with Kernels: The dual formulation allows the use of kernel func-
tions, which enable SVM to handle non-linearly separable data by implicitly
mapping data points to higher-dimensional spaces.
• Regularization: In the dual form of soft margin SVM, the regularization pa-
rameter C is naturally incorporated as an upper bound on the Lagrange mul-
tipliers. This helps balance the trade-off between maximizing the margin and
minimizing the classification error.
The following code solves the dual problem for a hard margin SVM using CVX:
% Inputs:
% K: Kernel matrix (n x n) where K(i,j) = K(x_i, x_j)
% y: Labels vector (n x 1), y_i in {-1, +1}
% n: Number of data points
cvx_begin
variable lambda(n)
minimize( 0.5 * quad_form(lambda .* y, K) - sum(lambda) )
subject to
sum(lambda .* y) == 0
lambda >= 0
cvx_end
The following code solves the dual problem for a soft margin SVM using CVX:
27
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier
% Inputs:
% K: Kernel matrix (n x n)
% y: Labels vector (n x 1)
% C: Regularization parameter
cvx_begin
variable lambda(n)
minimize( 0.5 * quad_form(lambda .* y, K) - sum(lambda) )
subject to
sum(lambda .* y) == 0
0 <= lambda <= C
cvx_end
After solving the dual problem, the optimal Lagrange multipliers λ can be interpreted
as follows:
n
X
w= λ i y i xi
i=1
The bias term b can be computed using any support vector with 0 < λi < C as:
n
X
b = yi − λj yj K(xj , xi )
j=1
28
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology
9: Subject to constraints:
n
X
λi yi = 0, 0 ≤ λi ≤ C
i=1
n
X
w← λ i y i xi
i=1
(For a kernel SVM, use the implicit kernel representation for w.)
14: Step 5: Compute the Bias Term
15: Choose any support vector xk where 0 < λk < C and compute the bias:
n
X
b ← yk − λi yi K(xi , xk )
i=1
29
Chapter 2. Methodology 2.9. Decision Tree Classifier
Splitting Criteria
The core of building a decision tree lies in selecting the best feature to split the data
at each node. The goal is to maximize the information gain or minimize the impurity
after the split.
Information Gain
Information Gain (IG) measures the reduction in entropy after a dataset is split on
an attribute. The entropy H(D) of a dataset D is defined as:
X
H(D) = − P (c|D) log2 P (c|D)
c
The attribute that yields the highest Information Gain is selected for the split.
30
2.9. Decision Tree Classifier Chapter 2. Methodology
Gini Impurity
Alternatively, Gini Impurity can be used as a splitting criterion. The Gini Impurity
G(D) of a dataset D is defined as:
X
G(D) = 1 − P (c|D)2
c
For a split on feature A, the Gini Impurity after the split is given by:
X
G(D|A) = P (v|D)G(Dv )
v
The feature that minimizes the Gini Impurity is chosen for the split.
Recursive Partitioning
The process of constructing a decision tree involves recursively partitioning the data
based on the selected features until a stopping criterion is met. Common stopping
criteria include:
At each leaf node, a prediction is made based on the majority class (for classification)
or the average value (for regression) of the instances in that node.
• Pre-pruning: Stop growing the tree when further splits do not significantly
improve the model (e.g., based on a threshold for Information Gain).
• Post-pruning: Grow the full tree and then remove nodes that do not improve
model performance on a validation set.
31
Chapter 2. Methodology 2.9. Decision Tree Classifier
Decision Rule
Once the decision tree is constructed, the decision rule for predicting a new instance
x can be formulated as follows:
1. Start at the root node and evaluate the feature xj corresponding to the decision.
2. Traverse the tree by following the branches based on the values of xj until a leaf
node is reached. 3. The predicted class label ŷ is the label associated with the leaf
node.
Mathematically, the prediction can be represented as:
where θ represents the parameters defining the splits of the tree. While individual
node splits may not yield a convex loss function, the overall objective can be viewed
through an optimization lens.
The convex nature arises in ensemble methods built upon decision trees, such as Gra-
dient Boosting, where loss functions can be designed to be convex. The optimization
framework often involves:
where L is a convex loss function, y is the target variable, and f (x; θ) represents the
prediction from the ensemble of trees.
• Non-parametric: They do not assume any underlying distribution for the data.
• Handling Mixed Data Types: Decision trees can handle both numerical and
categorical data.
• Feature Importance: Decision trees naturally provide insights into the impor-
tance of different features.
32
2.9. Decision Tree Classifier Chapter 2. Methodology
Overall, decision trees are a versatile and powerful tool in machine learning, provid-
ing a foundation for more complex ensemble methods such as Random Forests and
Gradient Boosting Machines.
Objective Function
The objective is to minimize the total loss, which can be defined as the sum of the
loss at each node of the tree. A common choice for the loss function is the mean
squared error (MSE) for regression tasks or cross-entropy for classification tasks.
The overall optimization problem can be formulated as:
n
X
min L(yi , ŷi ) + λ · R(T ) (2.1)
T
i=1
Where:
• λ is a hyperparameter controlling the trade-off between the loss and the regu-
larization.
33
Chapter 2. Methodology 2.10. Built-in Machine Learning Functions in MATLAB
Lagrangian Function
Where:
The Lagrange multipliers λj play a critical role in balancing the trade-off between
minimizing the loss function and satisfying the constraints. A positive λj indicates
that the corresponding constraint is active, suggesting that the optimization process
will prioritize satisfying this constraint. As the optimization progresses, the values of
λj adjust, reflecting the importance of each constraint relative to the loss function.
34
2.10. Built-in Machine Learning Functions in MATLAB Chapter 2. Methodology
Here, X is the matrix containing the predictor variables, and y is the response vari-
able, which must be a binary outcome. The function fits the model by employing
maximum likelihood estimation under the assumption of a binomial distribution,
thus creating a probability-based classification. The trained model can be used for
predicting new outcomes using the predict method, and the confidence intervals
of the coefficients can be obtained via the coefCI method. This function provides
flexibility for logistic regression problems and supports customization in terms of
link functions and interaction terms, making it robust for exploring the relationship
between predictor variables and binary outcomes.
In this case, X represents the set of features, and y contains the class labels. The
KernelFunction argument specifies the type of kernel to use, with common options
including ’linear’, ’polynomial’, and ’rbf’ (radial basis function). SVMs are designed
to find the hyperplane that maximally separates the two classes in the feature space.
The choice of kernel function allows the SVM to handle non-linearly separable data
by transforming it into a higher-dimensional space where a linear separation is pos-
sible.
Once the model is trained, predictions can be made using the predict method.
Additionally, the model’s performance can be evaluated using cross-validation by
applying the crossval method, which divides the dataset into training and testing
sets to estimate the generalization error. This method is crucial for assessing the
model’s reliability on unseen data, ensuring that the chosen hyperparameters (such
as the regularization parameter or kernel type) are appropriate for the problem.
The matrix X contains the predictor variables, while y holds the corresponding class
labels. Decision trees work by recursively partitioning the feature space into regions
that maximize the separation between different classes. The split criterion is often
based on measures such as Gini impurity or information gain. The resulting tree can
35
Chapter 2. Methodology 2.11. Developing an Explainable Tumor Classification Model
36
2.12. Conclusion Chapter 2. Methodology
2.12 Conclusion
In this study, we revisited the mathematical models of classical machine learning
algorithms, specifically focusing on Logistic Regression, Support Vector Machines
(both hard and soft margin), and Decision Trees. A significant aspect of our approach
involved employing convex optimization techniques to formulate and solve these
models. The convex optimization framework allowed for effective minimization of
the loss functions associated with each algorithm, ensuring global optima in the
training process.
These custom models were utilized to predict the likelihood of breast cancer using
the Breast Cancer dataset from the University of Wisconsin at Madison. By apply-
ing these models, we gained insights into the underlying relationships between the
features and the target variable, which is crucial for understanding the predictive
capabilities of each algorithm in the context of breast cancer diagnosis.
Subsequently, we leveraged built-in MATLAB functions such as fitglm, fitcsvm, and
fitctree to perform the same predictive task.
37
Chapter 3
3.1 Introduction
This chapter presents the results obtained from implementing various classification
algorithms to predict breast cancer using the Breast Cancer dataset from the Uni-
versity of Wisconsin at Madison. The algorithms evaluated in this study include
custom implementations of Logistic Regression, Support Vector Machines (SVM)
with both hard and soft margins, and Decision Trees, all formulated through convex
optimization techniques. Additionally, the performance of built-in MATLAB func-
tions—fitglm, fitcsvm, and fitctree—is also assessed for comparison.
The primary objective of this chapter is to provide a detailed analysis of the classifi-
cation performance of each algorithm based on several metrics, including accuracy,
sensitivity, specificity, F1 score, Receiver Operating Characteristic (ROC) curve, and
Area Under the Curve (AUC). By systematically evaluating the results, we aim to
uncover insights into the strengths and weaknesses of each approach in predicting
breast cancer.
Furthermore, the discussions will address the implications of the findings, emphasiz-
ing the significance of mathematical modeling and optimization in machine learn-
ing applications. The interplay between the custom models and MATLAB’s built-in
functions will be explored to highlight how both approaches contribute to achieving
robust predictive performance in the medical domain. Through this examination,
we aim to provide a comprehensive understanding of the effectiveness of different
classification techniques in the context of breast cancer diagnosis.
38
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions
between the two classes. This analysis serves as a foundational step in feature se-
lection for the logistic regression model. Our aim is to find the optimal number of
features which produce better performance metrics in logistic regression. Distribu-
tion of top six dominant feature in classification into benign or malignant is shown
in Figure 3.1.
-3
12 0.03 30 #10
0.25 0.035 2.5
10 0.025 25 0.03
0.2 2
8 0.02 20 0.025
0.15 1.5
Density
Density
Density
0.02
Density
Density
Density
6 0.015 15
0.015
0.1 1
4 0.01 10
0.01
2 0.005 5
0.05 0.5
0.005
0 0 0 0 0 0
0 0.1 0.2 0.3 0 100 200 0 0.1 0.2 0 20 40 0 100 200 0 2000 4000
concave points worst perimeter worst concave points mean radius worst perimeter mean area worst
(a) Top three dominant features (b) Top four to six dominant features
Figure 3.1: Distribution of dominant features across the target variable (malignant and
benign) before preprocessing.
In logistic regression, outlier removal and feature scaling are critical preprocessing
steps that significantly improve model performance. Outliers can distort the deci-
sion boundary by disproportionately affecting the estimated coefficients, leading to
poor generalization and inaccurate predictions. Logistic regression assumes a lin-
ear relationship between the features and the log-odds of the target class. Outliers,
especially in features with large values, can skew this relationship, resulting in an
overfitted model with reduced interpretability.
39
Chapter 3. Results and Discussions 3.2. Skill of Logistic Regression Classifier
10 1 0.025 25 0.03
0.2
0.025
8 0.8 0.02 20
0.15
0.02
Density
Density
Density
Density
Density
Density
6 0.6 0.015 15
0.015
0.1
4 0.4 0.01 10
0.01
0.05
2 0.2 0.005 5 0.005
0 0 0 0 0 0
0 0.1 0.2 0.3 5 10 0 100 200 0 0.1 0.2 0 20 40 0 100 200
concave points worst area worst perimeter worst concave points mean radius worst perimeter mean
(a) Top three dominant features (b) Top four to six dominant features
Figure 3.2: Distribution of dominant features after outlier removal and scaling.
Incorporating these steps ensures a robust and reliable logistic regression model,
capable of accurately predicting the likelihood of breast cancer based on the most
informative features.
It is observed that after scaling and outlier removal, the top six highly correlated fea-
tures with the target variable remained the same, though their order of correlation
changed slightly.
Here, we explored a mathematical approach to develop a binary classifier from a
linear regressor. Using a linear model, the output is predicted as W T x, where W
represents the weight vector and x the input features. For binary classification, the
separating plane is defined by the equation W T x = 0, which divides the input space
into two regions corresponding to the two classes.
The key objective is to optimize the weight vector W using a kernel regressor de-
signed for linear regression. By employing kernel regression, we aimed to capture
nonlinear relationships between features and improve the performance of the model.
The linear regressor’s predictions are then transformed into binary outputs based on
the separating plane, where values greater than zero were classified into one class,
and those less than zero into the other.
This approach bridges the gap between regression and classification by transferring
the continuous output of a regressor to define a decision boundary for classifica-
tion. The results of this methodology demonstrated that the binary classifier derived
from the optimized linear regressor performs effectively in separating the two target
classes. This method is particularly useful when there is a linear or nearly linear
relationship between the input features and the output classes.
Models with 30 cleaned and scaled features shows almost similar performance met-
rics scores in logistic regression.
Figure 3.3 shows the separating planes and data points plotted over the top two
correlated features in both linear separating boundary (Figure 3.3a) and sigmoid
boundary (Figure 3.3b).
40
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions
Figure 3.3: Separating planes and data distribution in linear and logistic regression
Table 3.1: Performance Comparison of Linear and Logistic Regressors for Binary Classi-
fication
41
Chapter 3. Results and Discussions 3.2. Skill of Logistic Regression Classifier
Table 3.2: Performance metrics of logistic regression classifier across different feature
subsets
The correlation values (ρ) indicate a decline from 0.75 with the 5-feature subset
to 0.20 with the 25-feature subset. This suggests that while fewer features exhibit
stronger individual relationships with the target variable, the inclusion of more fea-
tures does not necessarily correlate with a proportional improvement in model per-
formance.
Accuracy (Acc)
The accuracy of the model shows minimal variation as the number of features in-
creases, peaking at 0.98 with both the 20 and 25 feature subsets. This indicates
that while the model maintains a high level of accuracy, the incremental benefit of
adding more features is marginal. The results suggest a point of diminishing returns
in terms of accuracy, emphasizing the importance of feature selection for model sim-
plicity and interpretability.
The K-fold cross-validation accuracy remains relatively consistent across all feature
subsets, ranging from 0.93 to 0.96. This stability suggests that the model’s perfor-
mance is robust and not overly dependent on the number of features included. It
reinforces the notion that a more parsimonious model could be equally effective, if
not more so, in terms of generalization and interpretability.
42
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions
Sensitivity
Sensitivity values, ranging from 0.92 to 0.97, indicate the model’s strong ability to
identify malignant cases consistently. However, the slight variations in sensitivity
across different feature subsets suggest that increasing the feature count does not
significantly enhance this critical aspect of performance.
Specificity
Specificity remains high, ranging from 0.96 to 0.98 across all subsets, indicating that
the model effectively identifies non-malignant cases. This consistency in specificity
further supports the potential for a simpler model without compromising the ability
to accurately classify both malignant and non-malignant cases.
Error Rate
The error rate decreases from 0.045 (with 5 features) to 0.017 (with 25 features),
suggesting a slight improvement in reliability. However, the reduction is modest
compared to the increase in complexity associated with additional features, high-
lighting the necessity of considering feature selection carefully.
The AUC values consistently remain high (0.990 to 0.998), indicating the model’s
strong discriminatory power across all feature subsets. This stability in AUC suggests
that a robust model can be developed without necessitating an excessive number of
features.
As a final step, regularization techniques—L1, L2, and Elastic Net—are applied to
the logistic regression model using gradient descent to minimize the mean square
error in prediction. The analysis is conducted on two feature sets-Full Feature Set
(30 scaled features) and Dominant Subset (Top 5 correlated features).
Performance metrics, including accuracy, sensitivity, specificity, F1-score, and AUC,
are computed for each model. Summary of these results along with the K-fold cross
validation of L1 regularization model is shown in Table 3.3.
Table 3.3: Comparison of performance metrics of various regularized models and K-fold
cross validation
43
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines
In summary, while the performance metrics indicate that the model maintains high
accuracy, sensitivity, and specificity across different feature subsets, the improvement
in these metrics with the addition of more features is not substantial. The K-fold
cross-validation accuracy remains stable across all subsets, further emphasizing that
a simpler model with fewer features could perform nearly as well as a more complex
model.
This suggests that, for the goal of developing a better and more consistent model,
it is essential to focus on identifying an optimal number of features that strike a
balance between performance and model simplicity. The results advocate for feature
selection based on correlation strength with the target variable while considering
the overall performance stability to enhance interpretability and ensure effective
medical diagnosis. Thus, an emphasis on parsimony may yield a model that is not
only efficient but also easier to implement in practical scenarios.
44
3.3. Skill of Support Vector Machines Chapter 3. Results and Discussions
Given the residual outliers and the need for a more robust classification framework,
SVMs are a natural next step to explore. The aim is to leverage their ability to man-
age outliers while improving the overall performance metrics of the model, particu-
larly in terms of classification accuracy and consistency across folds. The following
sections will discuss the application of SVMs to this dataset and provide a compara-
tive analysis with logistic regression.
(a) Distribution of data points over first two (b) Distribution of data points over first two fea-
features under hard margin svm tures under soft margin svm
Figure 3.4: Distribution of data points over first two features under hard and soft margin
linear SVM
This result underscores the need for a more flexible approach, such as soft margin
SVMs, to allow some misclassification and better accommodate the complexity of
the data. Using soft margin linear SVM on the 30 feature set did not produce even
a support vector. After normalizing the feature set, a soft margin SVM with a linear
kernel and regularization parameter c = 1 was applied. The model produced 22
support vectors and achieved an improved accuracy of 79.79%. The sensitivity and
specificity were 55.66% and 94.12%, respectively.Distribution of data points over
first two features and the separation boundary is shown in Figure 3.5b.
45
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines
The introduction of a soft margin allowed the model to tolerate some misclassifica-
tions, enhancing its flexibility compared to the hard margin SVM. This improved the
model’s ability to correctly classify malignant cases, as reflected in the increased sen-
sitivity. The high specificity suggests that the model maintained strong performance
in identifying benign cases.
Using the top 5 highly correlated features, a soft margin Support Vector Machine
(SVM) with a linear kernel and regularization parameter c = 10 is applied. This
model identified 5 support vectors and showed a significant improvement in per-
formance, achieving an accuracy of 94.4%. The sensitivity (95.8%) and specificity
(92.82%) indicate that the model effectively balances both positive (malignant) and
negative (benign) class predictions.
The introduction of the soft margin allows for some misclassification, enabling the
model to better handle the presence of noise and outliers in the dataset. This flexibil-
ity leads to improved sensitivity, which is critical in medical diagnosis, as it ensures
that malignant cases are correctly identified. The higher specificity further confirms
the model’s capability to accurately classify benign cases, resulting in a more robust
classification.
The trade-off between sensitivity and specificity indicates that while the model ef-
fectively reduces misclassifications, it still struggles with identifying all malignant
cases, suggesting the need for further optimization or more sophisticated kernels.
To further improve classification performance, the soft margin SVM was applied us-
ing the Radial Basis Function (RBF) kernel. This kernel allows the model to capture
more complex decision boundaries, which are particularly useful in cases where the
data is not linearly separable.
Experiments were conducted on both the full feature set and the top five correlated
feature subset, with the regularization parameter C varied to control the trade-off
between margin maximization and classification error. The accuracy of the model
ranged from 47.98% to 96.15%, indicating that the choice of C plays a significant
role in optimizing the classification skill.
The RBF kernel produced a more flexible and acceptable decision boundary com-
pared to the linear kernel, especially in handling the non-linearity present in the
data.
Distribution of data points and separation boundaries are shown in Figure 3.5.
46
3.3. Skill of Support Vector Machines Chapter 3. Results and Discussions
(a) Distribution of data points of full dataset (b) Distribution of data points of subset with
with C = 1. C = 2.5.
Figure 3.5: Distribution of data points over first two features under soft margin SVM
with RBF kernel
The higher accuracy in the upper range of C values suggests that the model was able
to find an optimal balance between underfitting and overfitting, leading to improved
classification performance across both feature sets.
Table 3.4 presents the performance metrics of the soft margin Support Vector Ma-
chine (SVM) with an RBF kernel applied to two different versions of the dataset:
the full 30-feature set and a reduced subset comprising the top five highly corre-
lated features. The regularization parameter C is varied to explore the impact of
the margin’s flexibility on classification performance. The primary objective is to in-
vestigate whether the reduced feature subset offers competitive accuracy and other
performance metrics while lowering computational costs.
Table 3.4: Comparison of performance metrics of full feature set and the five top corre-
lated subset over the regularization parameter C.
From the results, it is clear that the reduced feature subset consistently provides high
47
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines
Performance Metrics
C Accuracy Sensitivity Specificity F1-score
Mean SD Mean SD Mean SD Mean SD
1.0 0.933 0.015 0.924 0.035 0.938 0.021 0.915 0.022
1.5 0.933 0.015 0.938 0.036 0.930 0.017 0.913 0.022
2.0 0.933 0.009 0.943 0.039 0.927 0.018 0.913 0.014
2.5 0.928 0.007 0.948 0.030 0.916 0.010 0.907 0.010
3.0 0.927 0.007 0.948 0.030 0.915 0.010 0.907 0.010
3.5 0.926 0.007 0.948 0.030 0.913 0.018 0.905 0.010
4.0 0.923 0.009 0.948 0.030 0.907 0.021 0.901 0.011
4.5 0.923 0.007 0.948 0.030 0.907 0.021 0.901 0.008
5.0 0.926 0.008 0.953 0.023 0.916 0.016 0.905 0.009
The table summarizes the performance of the Support Vector Machine (SVM) model
with a Radial Basis Function (RBF) kernel across various values of the regulariza-
tion parameter C. Key metrics include accuracy, sensitivity, specificity, and F1-score,
along with their corresponding standard deviations.
Key Observations
48
3.4. Model Comparison and Conclusion Chapter 3. Results and Discussions
• F1-Score: The F1-score, which balances precision and recall, stays within a
close range (90.1% to 91.5%), indicating a well-balanced model performance
across all values of C.
• Standard Deviations: Standard deviations across all metrics are relatively low,
implying stable and reliable performance across the cross-validation folds.
Recommendation
Given that the differences in performance across different values of C are minimal,
the value C = 5.0 seems optimal due to its high sensitivity and balanced performance
across other metrics.
• Using All 30 Features: While using all 30 features may marginally improve
the model’s ability to capture more complex patterns, it risks overfitting, espe-
cially if many of the features are redundant or weakly correlated. Additionally,
training with a larger feature set increases computational complexity.
Considering the consistently strong results with the top 5 correlated features and the
risks associated with high-dimensional feature spaces, it is recommended to proceed
with the top 5 correlated features. This simplifies the model without sacrificing
performance, ensuring better generalization on unseen data.
49
Chapter 3. Results and Discussions 3.4. Model Comparison and Conclusion
Recommendation: Given the critical nature of medical data, where accurately de-
tecting true positives (sensitivity) is crucial, the SVM with the RBF kernel using 5
features is recommended due to its superior balance of sensitivity (98%), accuracy
(98%), and simplicity (using fewer features). However, Logistic Regression could be
considered in scenarios where model interpretability is essential for clinical decision-
making.
50
Chapter 4
Conclusion
51
Bibliography
[1] Laila Khairunnahar, Mohammad Abdul Hasib, Razib Hasan Bin Rezanur, Mo-
hammad Rakibul Islam, and Md Kamal Hosain. Classification of malignant
and benign tissue with logistic regression. Informatics in Medicine Unlocked,
16:100189, 2019. pages 16
52