0% found this document useful (0 votes)
10 views

A Computational Study On Classification of Malignant

Uploaded by

rehanair08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

A Computational Study On Classification of Malignant

Uploaded by

rehanair08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

AMRITA VISHWA VIDYAPEETHAM

A MRITA S CHOOL OF A RIFICIAL I NTELLIGENCE

A Computational Study on Classification of Malignant


and Benign Tissue

Submitted by: Submitted to:


Siju K S Prof. (Dr.) Soman K.P.
CB.AI.R4CEN24003 Professor & Dean
ASAI ASAI

In partial fulfillment of the requirements for the course work of the


P.hD. programme under Amrita School of Artificial Intelligence

O CTOBER
2024
Abstract

This study applies linear algebra and optimization techniques to the prediction of
breast cancer using the UCI breast cancer dataset. K-means clustering is used to
explore data patterns, with the cluster assignments compared to the actual benign
and malignant labels. Misclassified points are examined to understand their feature
ranges and implications for diagnostic accuracy. These findings are then compared
with traditional rule-based medical approaches.
The next stage involves developing classification models, starting with linear regres-
sion, followed by logistic regression using matrix operations and a sigmoid function
for binary classification. A support vector machine (SVM) is then formulated as a
convex optimization problem, with support vectors identified through eigenvalue
decomposition and solved using MATLAB’s CVX solvers. Model performance is eval-
uated and compared with the clustering results and medical standards.
The study highlights how linear algebra and optimization can be used to improve
classification models in medical diagnostics, offering insights into the alignment of
data-driven methods with established diagnostic rules.
ii
Acknowledgments
Comment this out if not needed.

iii
List of Tables

2.1 Class distribution of the UCI Breast Cancer Wisconsin dataset. . . . . 4


2.2 Summary Statistics of Numerical Features . . . . . . . . . . . . . . . 5
2.3 Correlation Coefficients of Features with Diagnosis . . . . . . . . . . 7
2.4 Variation Inflation Factor (VIF) Scores of Features . . . . . . . . . . . 9
2.5 Outlier Status Before and After Square Root Transformation . . . . . 11
2.6 Correlation Analysis of Features with Tumor Malignancy . . . . . . . 13
2.7 Selected features with correlation coefficients and their target associ-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Performance Comparison of Linear and Logistic Regressors for Binary


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Performance metrics of logistic regression classifier across different
feature subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Comparison of performance metrics of various regularized models
and K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Comparison of performance metrics of soft margin SVM with RBF kernel 47
3.5 Statistical summary of 5-fold cross validation results . . . . . . . . . 48
3.6 Performance Metrics of Models . . . . . . . . . . . . . . . . . . . . . 50

v
List of Figures

2.1 Correlation matrix of various features in the dataset . . . . . . . . . . 6


2.2 Distribution of top three highly correlated features . . . . . . . . . . 8
2.3 Box plot showing the distribution of features post-square root trans-
formation, highlighting potential outliers. . . . . . . . . . . . . . . . 12
2.4 Machine learning classification process . . . . . . . . . . . . . . . . . 15

3.1 Distribution of dominant features across the target variable (malig-


nant and benign) before preprocessing. . . . . . . . . . . . . . . . . . 39
3.2 Distribution of dominant features after outlier removal and scaling. . 40
3.3 Separating planes and data distribution in linear and logistic regression 41
3.4 Skill of hard and soft margin linear SVM . . . . . . . . . . . . . . . . 45
3.5 Skill of soft margin linear SVM with RBF kernel . . . . . . . . . . . . 47

vii
Contents

1 Introduction 1

2 Methodology 3
2.1 Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Basic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Multicollinearity of feature . . . . . . . . . . . . . . . . . . . 8
2.2.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 IQR-Based Approach for Outlier Detection . . . . . . . . . . . 10
2.2.7 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Terminologies Used in Machine Learning Model Development . . . . 17
2.7 Logistic Regression Classifier . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Support Vector Machine (SVM) Classifier . . . . . . . . . . . . . . . . 24
2.8.1 Advantages of Convex Optimization in SVM . . . . . . . . . . 26
2.9 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9.1 Lagrangian Formulation of Decision Trees . . . . . . . . . . . 33
2.10 Built-in Machine Learning Functions in MATLAB . . . . . . . . . . . . 34
2.10.1 Logistic Regression using fitglm . . . . . . . . . . . . . . . . 34
2.10.2 Support Vector Machines using fitcsvm . . . . . . . . . . . . 35
2.10.3 Decision Trees using fitctree . . . . . . . . . . . . . . . . . 35
2.10.4 Comparative Evaluation and Tuning of Models . . . . . . . . 36
2.11 Developing an Explainable Tumor Classification Model . . . . . . . . 36
2.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Results and Discussions 38


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Skill of Logistic Regression Classifier . . . . . . . . . . . . . . . . . . 38
3.2.1 Performance Metrics Analysis . . . . . . . . . . . . . . . . . . 42
3.2.2 Overview of Performance Metrics . . . . . . . . . . . . . . . . 42
3.2.3 Key Observations . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Skill of Support Vector Machines . . . . . . . . . . . . . . . . . . . . 44

ix
CONTENTS Table of Contents

3.3.1 Hard Margin Support Vector Machine with Linear Kernel . . . 45


3.3.2 Soft Margin Support Vector Machine with Linear Kernel . . . 46
3.3.3 Soft Margin Support Vector Machine with Non-linear Kernels 46
3.4 Model Comparison and Conclusion . . . . . . . . . . . . . . . . . . . 49

4 Conclusion 51

x
Chapter 1

Introduction

Breast cancer remains a significant public health challenge, highlighting the need for
effective diagnostic tools to enhance early detection and treatment. The advent of
machine learning has provided innovative methods for analyzing complex datasets,
offering promising avenues for improving classification tasks in the medical field.
This project focuses on the development of a classification model utilizing the UCI
Breast Cancer Wisconsin dataset, which contains a diverse array of features pertinent
to breast cancer diagnosis.
The primary aim of this research is to explore the inherent patterns within the
dataset through a systematic analysis rooted in the principles of linear algebra and
optimization. Initially, the dataset will be structured as a matrix, enabling various
preprocessing steps such as data cleaning, outlier detection, and feature scaling.
Outliers will be identified using the interquartile range (IQR) method, while visu-
alizations will be employed to present a comprehensive summary of the numerical
features.
Following this preprocessing phase, K-means clustering will be implemented to gain
insights into the data’s structure. This approach will facilitate a comparative analysis
between the actual target labels and the clusters generated, revealing any misclas-
sified data points within the clusters and examining their distribution across the
feature space. Such analyses aim to provide valuable inferences that connect tradi-
tional rule-based models used in medical diagnostics with the findings from cluster-
ing techniques.
Subsequent to the exploratory analysis, the focus will shift to the development of
classification models. The initial step will involve constructing a linear regression
model, transitioning to logistic regression through matrix operations. The applica-
tion of the sigmoid function will be critical in implementing binary classification
techniques. Building upon this foundation, support vector machines (SVM) will
be utilized, employing a linear programming approach to identify support vectors
through the eigenvalue decomposition of the dual Lagrangian function. The opti-
mization of the SVM will be executed using CVX solvers, ensuring a robust solution
to the linear programming problem.
This project aspires to construct a reliable classification model while fostering a
deeper understanding of the relationships among various features in the dataset. By
adopting a methodical approach grounded in linear algebra and optimization, this

1
Chapter 1. Introduction

research seeks to explore the potential for improved diagnostic accuracy in breast
cancer detection, ultimately contributing to advancements in medical analytics and
patient care.

2
Chapter 2

Methodology

This chapter delineates the methodological framework employed in this project, fo-
cusing on the UCI Breast Cancer Wisconsin dataset as the basis for model develop-
ment. Initially, an exploration of the dataset was conducted to understand its struc-
ture and key characteristics. Essential statistical analyses were performed to sum-
marize the data, including identifying outliers and assessing feature relationships.
Following this, data preprocessing techniques were applied to ensure the dataset’s
quality, including normalization and outlier detection. Subsequently, a series of clas-
sification models—namely linear regression, logistic regression, and support vector
machines—were implemented using matrix operations grounded in linear algebra
and optimization principles. This systematic approach not only facilitated the devel-
opment of robust predictive models but also enhanced the overall understanding of
the data’s patterns and distributions.

2.1 Dataset Summary


The UCI Breast Cancer Wisconsin dataset, sourced from the University of Califor-
nia, Irvine (UCI) Machine Learning Repository, is a widely used dataset for breast
cancer diagnosis research. It comprises 569 instances, each representing a unique
patient, along with 32 attributes that provide critical information regarding tumor
characteristics. The dataset is structured as follows:
• ID: A unique identifier for each patient.

• Diagnosis: A categorical variable indicating the tumor classification, with two


possible values:

– M: Malignant (cancerous)
– B: Benign (non-cancerous)

• Features: The dataset contains 30 continuous numerical attributes, derived


from digitized images of fine needle aspirate (FNA)1 of breast mass. These
features represent various measurements related to the tumors, including:
1
Fine Needle Aspiration (FNA) is a minimally invasive technique for collecting cytological samples

3
Chapter 2. Methodology 2.2. Basic Data Analysis

– radius mean, texture mean, perimeter mean, area mean, smoothness mean,
and others.

Each feature is calculated based on the mean, standard error, or worst (largest)
value, offering a comprehensive view of the tumor’s characteristics.

• Data Format: The dataset is provided in CSV format, making it easily accessi-
ble for data analysis and modeling tasks.

The primary objective of this dataset is to aid in the development of machine learn-
ing models capable of accurately predicting the diagnosis of breast cancer based on
the provided features. This dataset has become a benchmark for evaluating vari-
ous classification algorithms and serves as an essential resource for researchers and
practitioners in the field of medical diagnostics.

2.2 Basic Data Analysis


This section outlines the primary analyses conducted on the UCI Breast Cancer Wis-
consin dataset to understand its characteristics and prepare for model development.

2.2.1 Class Distribution


The class distribution of the ‘diagnosis‘ variable was examined, categorizing in-
stances as either malignant (M) or benign (B).

Class Count
Benign (B) 357
Malignant (M) 212

Table 2.1: Class distribution of the UCI Breast Cancer Wisconsin dataset.

The class distribution table shows 357 instances classified as benign (B) and 212
as malignant (M) in the UCI Breast Cancer Wisconsin dataset. This results in an
approximate distribution of 62.7% benign and 37.3% malignant instances.
The observed class imbalance may influence the performance of classification al-
gorithms, potentially leading to a bias towards the benign class. Therefore, it is
essential to implement strategies during model development to ensure reliable iden-
tification of malignant cases. The lower representation of malignant instances neces-
sitates the use of effective feature extraction and validation techniques to enhance
predictive accuracy.

from suspicious breast lesions using a thin needle under imaging guidance. This procedure provides
rapid diagnoses and high-quality samples, facilitating the extraction of critical features for differenti-
ating between benign and malignant cells, thereby enhancing cancer prediction models.

4
2.2. Basic Data Analysis Chapter 2. Methodology

2.2.2 Summary Statistics


Summary statistics, including mean, median, standard deviation, minimum, max-
imum, and interquartile range (IQR), were calculated for each numerical feature.
These statistics provide insights into the dataset’s central tendencies and variations.

Table 2.2: Summary Statistics of Numerical Features


Feature Mean Median Q1 Q2 Q3 IQR Min Max SD
radius mean 14.13 13.37 11.70 13.37 15.80 4.10 6.98 28.11 3.52
texture mean 19.29 18.84 16.17 18.84 21.80 5.63 9.71 39.28 4.30
perimeter mean 91.97 86.24 75.13 86.24 104.15 29.02 43.79 188.50 24.30
area mean 654.89 551.10 420.18 551.10 784.15 363.98 143.50 2501.00 351.91
smoothness mean 0.10 0.10 0.09 0.10 0.11 0.02 0.05 0.16 0.01
compactness mean 0.10 0.09 0.06 0.09 0.13 0.07 0.02 0.35 0.05
concavity mean 0.09 0.06 0.03 0.06 0.13 0.10 0.00 0.43 0.08
concave points mean 0.05 0.03 0.02 0.03 0.07 0.05 0.00 0.20 0.04
symmetry mean 0.18 0.18 0.16 0.18 0.20 0.03 0.11 0.30 0.03
fractal dimension mean 0.06 0.06 0.06 0.06 0.07 0.01 0.05 0.10 0.01
radius se 0.41 0.32 0.23 0.32 0.48 0.25 0.11 2.87 0.28
texture se 1.22 1.11 0.83 1.11 1.47 0.64 0.36 4.88 0.55
perimeter se 2.87 2.29 1.61 2.29 3.36 1.76 0.76 21.98 2.02
area se 40.34 24.53 17.85 24.53 45.24 27.39 6.80 542.20 45.49
smoothness se 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.03 0.00
compactness se 0.03 0.02 0.01 0.02 0.03 0.02 0.00 0.14 0.02
concavity se 0.03 0.03 0.02 0.03 0.04 0.03 0.00 0.40 0.03
concave points se 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.05 0.01
symmetry se 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.08 0.01
fractal dimension se 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00
radius worst 16.27 14.97 13.01 14.97 18.79 5.78 7.93 36.04 4.83
texture worst 25.68 25.41 21.07 25.41 29.76 8.68 12.02 49.54 6.15
perimeter worst 107.26 97.66 84.10 97.66 125.53 41.42 50.41 251.20 33.60
area worst 880.58 686.50 514.97 686.50 1085.00 570.03 185.20 4254.00 569.36
smoothness worst 0.13 0.13 0.12 0.13 0.15 0.03 0.07 0.22 0.02
compactness worst 0.25 0.21 0.15 0.21 0.34 0.19 0.03 1.06 0.16
concavity worst 0.27 0.23 0.11 0.23 0.38 0.27 0.00 1.25 0.21
concave points worst 0.11 0.10 0.06 0.10 0.16 0.10 0.00 0.29 0.07
symmetry worst 0.29 0.28 0.25 0.28 0.32 0.07 0.16 0.66 0.06
fractal dimension worst 0.08 0.08 0.07 0.08 0.09 0.02 0.06 0.21 0.02

The summary statistics table provides an overview of the numerical features ex-
tracted from the dataset.
- Mean values indicate the average size and characteristics of the breast tumors,
with the highest average observed for the area feature at 654.89. The perimeter and
radius also show considerable averages of 91.97 and 14.13, respectively, reflecting
the size and dimensions of the tumors.
- Median values closely follow the means, confirming the general distribution with-
out extreme outliers. For instance, the median radius is 13.37, suggesting that half
the tumors have a radius smaller than this value.
- Interquartile range (IQR) highlights the variability, with the area feature exhibit-
ing the highest IQR of 363.98, indicating substantial differences in tumor sizes.
- The minimum and maximum values indicate the range of each feature. For in-
stance, the area varies from 143.50 to 2501.00, demonstrating significant size di-
versity among the tumors.
- The standard deviation (SD) values reflect the spread of the data points around
the mean, with features such as perimeter (SD = 24.30) and area (SD = 351.91) ex-
hibiting higher variability compared to features like smoothness mean (SD = 0.01),

5
Chapter 2. Methodology 2.2. Basic Data Analysis

which is more consistent across observations.


These statistics are crucial for understanding the dataset’s characteristics and inform
the subsequent steps in feature selection and model training.

2.2.3 Correlation Analysis

The Pearson correlation coefficient was computed for pairs of numerical variables
to identify strong correlations. A correlation matrix was created to visualize rela-
tionships between features, guiding feature selection for classification. Figure 2.1
illustrate the correlation between the 30 features in the dataset

radius mean 1
texture mean
perimeter mean
area mean
smoothness mean 0.8
compactness mean
concavity mean
concave points mean
symmetry mean 0.6
fractal dimension mean
radius se
texture se
perimeter se
Features

area se 0.4
smoothness se
compactness se
concavity se
concave points se
symmetry se 0.2
fractal dimension se
radius worst
texture worst
perimeter worst 0
area worst
smoothness worst
compactness worst
concavity worst
concave points worst -0.2
symmetry worst
fractal dimension worst
fra a on ne s e
como etee s e

al s e p avi s w orsst
al sye p avi s m eaan

r e
m ot a r m e n

ct ve ca ss se
m m ts se

r o st
di m oin ty e
co pathnare se

e
nc c ct es a s

ra nsetr se
co pa hn re e an

di m oin ty e n
si try e n

co pathnare wo rst

di m oin ty or t
si try o st
s im tur s
m m ts m an
fra ca con tne ss mean

or t
en e w or t
ra eaan

fra ca con tne ss wost


s im tur s w se
on m an
como etee m ea

w orst
al sy p vi s

c s w r
pe tex diuiony s

m m ts w s
pe ex diu n

como etee w or

st
on w rs
en e m ea

a r
me
r m

c s m
c e a
rim tu s
pe tex diu

e
t
r

e
m o
ra

c
r
m o

y
v
v
s

n
n

ct
ct

Features

Figure 2.1: Correlation matrix of various features in the dataset

In order to identify the most significant features contributing to breast cancer di-
agnosis, a correlation analysis was conducted between each feature and the target
variable (diagnosis). This statistical method allows us to quantify the strength of
the linear relationship between the input features and the target class, which in
this case helps to identify the features most strongly associated with distinguishing
benign from malignant tumors.
The table below presents the correlation coefficients for each feature:

6
2.2. Basic Data Analysis Chapter 2. Methodology

Table 2.3: Correlation Coefficients of Features with Diagnosis

Feature Correlation
radius mean 0.73
texture mean 0.42
perimeter mean 0.74
area mean 0.71
smoothness mean 0.36
compactness mean 0.60
concavity mean 0.70
concave points mean 0.78
symmetry mean 0.33
fractal dimension mean -0.01
radius se 0.57
texture se -0.01
perimeter se 0.56
area se 0.55
smoothness se -0.07
compactness se 0.29
concavity se 0.25
concave points se 0.41
symmetry se -0.01
fractal dimension se 0.08
radius worst 0.78
texture worst 0.46
perimeter worst 0.78
area worst 0.73
smoothness worst 0.42
compactness worst 0.59
concavity worst 0.66
concave points worst 0.79
symmetry worst 0.42
fractal dimension worst 0.32

Based on the correlation analysis, the three features most strongly correlated with
the target variable (diagnosis) are ’concave points worst’, ’perimeter worst’, and
’concave points mean’. These features have correlation values of 0.79, 0.78, and
0.78, respectively. The high correlation values indicate a strong positive relation-
ship with the diagnosis, suggesting that larger values of these features are typically
associated with malignant tumors.
Statistical interpretations of these results highlight that shape-related features, par-
ticularly those associated with concave points and perimeter, are critical in differen-
tiating between benign and malignant tumors. These findings provide strong justifi-
cation for focusing on these features in predictive modeling efforts. Visualizing the
distribution of these features across benign and malignant classes further demon-
strates their significance in separating the two groups. Distribution of these three

7
Chapter 2. Methodology 2.2. Basic Data Analysis

12 0.03 30
Malignant
10 0.025 25 Benign

8 0.02 20

Density
6 0.015 15

4 0.01 10

2 0.005 5

0 0 0
0 0.1 0.2 0.3 0 100 200 0 0.1 0.2
concave points worst perimeter worst concave points mean
Figure 2.2: Distribution of top three highly correlated features

top features are shown in Figure 2.2.


Varition Inflation Factor score of the feature set is shown in Table 2.4.

2.2.4 Multicollinearity of feature


The Variation Inflation Factor (VIF) is a critical metric for evaluating multicollinearity
among predictor variables, particularly in the context of linear regression and its
extensions to classification models. A VIF score quantifies how much the variance of
a regression coefficient is inflated due to linear dependence among the predictors. In
general, a VIF score above 5 or 10 indicates problematic multicollinearity, warranting
attention and potential remedial action.
Table 2.4 of VIF scores for various features reveals significant disparities, with several
features exhibiting alarmingly high values. For instance, the VIF score for radius
mean (3806.12) and perimeter mean (3786.40) suggests extreme multicollinearity,
indicating that these variables are heavily interrelated. Such high VIF values raise
concerns about the stability of coefficient estimates and their interpretability in pre-
dictive modeling.
Conversely, features like symmetry mean (4.22) and fractal dimension mean (15.76)
present a mixed picture. While symmetry mean indicates acceptable multicollinearity
levels, the higher score of fractal dimension mean underscores a need for careful
consideration in model design. Features with high VIF scores can obscure the unique
contributions of individual predictors, potentially leading to overfitting and unreli-
able model performance.
Given these observations, reliance on linear models— especially logistic regression—

8
2.2. Basic Data Analysis Chapter 2. Methodology

may be inappropriate due to the instability caused by multicollinearity. To enhance


model robustness, practitioners should consider techniques such as feature selection
to mitigate multicollinearity’s adverse effects or explore regularization methods to
achieve more reliable predictive outcomes. Thus, addressing the issues highlighted
by the VIF scores will be crucial in constructing an effective and interpretable classi-
fication model.

Table 2.4: Variation Inflation Factor (VIF) Scores of Features

Sl.No. Feature VIF Score


1 radius mean 3806.12
2 texture mean 11.88
3 perimeter mean 3786.40
4 area mean 347.88
5 smoothness mean 8.19
6 compactness mean 50.51
7 concavity mean 70.77
8 concave points mean 60.04
9 symmetry mean 4.22
10 fractal dimension mean 15.76
11 radius se 75.46
12 texture se 4.21
13 perimeter se 70.36
14 area se 41.16
15 smoothness se 4.03
16 compactness se 15.37
17 concavity se 15.69
18 concave points se 11.52
19 symmetry se 5.18
20 fractal dimension se 9.72
21 radius worst 799.11
22 texture worst 18.57
23 perimeter worst 405.02
24 area worst 337.22
25 smoothness worst 10.92
26 compactness worst 36.98
27 concavity worst 31.97
28 concave points worst 36.76
29 symmetry worst 9.52
30 fractal dimension worst 18.86

2.2.5 Data Preprocessing


Based on the analysis findings, preprocessing steps were taken to prepare the dataset
for modeling. This included addressing missing values, normalizing features, and
handling outliers to enhance data quality.

9
Chapter 2. Methodology 2.2. Basic Data Analysis

2.2.6 IQR-Based Approach for Outlier Detection

The Interquartile Range (IQR) is a measure of statistical dispersion, which is the


spread of the middle 50% of a dataset. It is calculated as the difference between the
third quartile (Q3) and the first quartile (Q1), where:

IQR = Q3 − Q1

The first quartile (Q1) represents the 25th percentile of the data, while the third
quartile (Q3) represents the 75th percentile. The IQR is particularly useful for iden-
tifying outliers in datasets, as it is resistant to extreme values (unlike the standard
deviation). Outliers are typically defined as data points that fall below Q1−1.5×IQR
or above Q3+1.5×IQR. These thresholds, often referred to as ”fences,” capture most
of the central data, and any points outside this range are considered outliers.
The IQR-based approach is widely used in datasets where the distribution is skewed
or does not follow a normal distribution, making it more robust compared to the
z-score method. The use of 1.5 times the IQR to determine outliers is a common rule
of thumb, although this factor can be adjusted based on the specific characteristics of
the data. By identifying and potentially removing or investigating these outliers, it is
possible to improve the accuracy and performance of statistical models and reduce
bias introduced by extreme values.

• Lower Bound: Q1 − 1.5 × IQR

• Upper Bound: Q3 + 1.5 × IQR

Data points outside this range are flagged as potential outliers.


There are many outliers in almost all the features in the dataset. A square root trans-
formation is used on features with terrible number of outliers (area se, perimeter se,
radius se,area mean, area worst, fractal dimension worst). Result of the outlier re-
moval is shown in Table 2.5.

10
2.2. Basic Data Analysis Chapter 2. Methodology

Table 2.5: Outlier Status Before and After Square Root Transformation

Feature Outliers Before Outliers After


Radius Mean 13 0
Texture Mean 7 0
Perimeter Mean 13 0
Area Mean 23 12
Smoothness Mean 6 0
Compactness Mean 15 0
Concavity Mean 17 0
Concave Points Mean 9 0
Symmetry Mean 15 0
Fractal Dimension Mean 15 0
Radius SE 36 19
Texture SE 18 0
Perimeter SE 36 21
Area SE 61 30
Smoothness SE 28 0
Compactness SE 26 0
Concavity SE 22 0
Concave Points SE 18 0
Symmetry SE 27 0
Fractal Dimension SE 26 0
Radius Worst 16 0
Texture Worst 4 0
Perimeter Worst 13 0
Area Worst 31 15
Smoothness Worst 4 0
Compactness Worst 16 0
Concavity Worst 11 0
Concave Points Worst 0 0
Symmetry Worst 21 0
Fractal Dimension Worst 21 14

2.2.7 Data Visualization


To analyze the impact of the square root transformation and identify potential out-
liers, a box plot was created for the selected features. The box plot (Figure 2.3)
illustrates the distribution of features, including area se, perimeter se, radius se,
area mean, area worst, and fractal dimension worst.
In the box plot, the central box represents the interquartile range (IQR), with the line
indicating the median. Whiskers extend to the smallest and largest values within 1.5
times the IQR, while points outside this range are marked as potential outliers.
The analysis reveals notable outliers in features such as area se, perimeter se, and
radius se, necessitating further investigation. This visualization aids in understand-
ing the effects of the square root transformation and guides subsequent data prepro-

11
Chapter 2. Methodology 2.3. Feature Selection

cessing steps.

5
Values

1
area se

radius se

area mean

area worst
perimeter se

fractal dimension worst

Figure 2.3: Box plot showing the distribution of features post-square root transforma-
tion, highlighting potential outliers.

2.3 Feature Selection


The correlation analysis reveals several features with significant positive correla-
tions to tumor malignancy, particularly those exceeding a correlation coefficient of
0.75. Notable features include Concave Points Worst (0.79), Area Worst (0.78), and
Perimeter Worst (0.78), all of which exhibit strong relationships with malignancy.
These features, primarily related to tumor size and shape, are crucial in distinguish-
ing between malignant and benign tumors. The low p-values (e.g., 1.97 × 10−124 for
Concave Points Worst) further affirm their statistical significance.
Selecting these high-correlation features for model development enhances predictive
accuracy in classifying tumors. Their clinical relevance, derived from strong statis-
tical associations, supports their role in diagnostic processes. Thus, incorporating

12
2.3. Feature Selection Chapter 2. Methodology

these features into logistic regression models can potentially improve classification
outcomes, aiding in timely and accurate patient treatment decisions.

Table 2.6: Correlation Analysis of Features with Tumor Malignancy

Feature ρ P-Value
Concave Points Worst 0.79 1.97 × 10−124
Area Worst 0.78 1.23 × 10−119
Perimeter Worst 0.78 5.77 × 10−119
Concave Points Mean 0.78 7.10 × 10−116
Radius Worst 0.78 8.48 × 10−116
Perimeter Mean 0.74 8.44 × 10−101
Area Mean 0.73 3.45 × 10−97
Radius Mean 0.73 8.47 × 10−96
Area SE 0.71 9.52 × 10−89
Concavity Mean 0.70 9.97 × 10−84
Concavity Worst 0.66 2.46 × 10−72
Perimeter SE 0.63 3.20 × 10−64
Radius SE 0.63 1.77 × 10−63
Compactness Mean 0.60 3.94 × 10−56
Compactness Worst 0.59 7.07 × 10−55
Texture Worst 0.46 1.08 × 10−30
Smoothness Worst 0.42 6.58 × 10−26
Symmetry Worst 0.42 2.95 × 10−25
Texture Mean 0.42 4.06 × 10−25
Concave Points SE 0.41 3.07 × 10−24
Smoothness Mean 0.36 1.05 × 10−18
Symmetry Mean 0.33 5.73 × 10−16
Fractal Dimension Worst 0.32 2.47 × 10−15
Compactness SE 0.29 9.98 × 10−13
Concavity SE 0.25 8.26 × 10−10
Fractal Dimension SE 0.08 0.063
Smoothness SE -0.07 0.110
Fractal Dimension Mean -0.01 0.760
Texture SE -0.01 0.843
Symmetry SE -0.01 0.877

Chi-square tests are commonly used for evaluating the independence between cat-
egorical variables or for assessing goodness of fit between observed and expected
distributions. In particular, testing independence can help determine whether two
or more variables are dependent across populations, allowing one to estimate the
other. However, when applying Chi-square tests to this dataset, the results were in-
conclusive, likely due to the continuous nature of the transformed variables and the
limitations of the Chi-square method in handling such data. These experiments con-
sistently produced unreliable test statistics and p-values, highlighting the inadequacy
of Chi-square for this feature selection task.

13
Chapter 2. Methodology 2.4. Machine Learning Algorithms

As a more suitable alternative, Pearson correlation analysis was employed to mea-


sure the linear relationship between the numerical features and the binary target
variable. This method identified features with the strongest associations, where top
features such as ”Concave Points Worst,” ”Area Worst,” ”Perimeter Worst”, ”Con-
cave Points Mean” and ”Radius Worst” showed high correlation values (above 0.75).
These features are derived from basic measurements like radius, area, and perimeter
but encapsulate more complex geometric properties of the tumors. Thus, the cor-
relation analysis not only simplifies the feature selection process but also highlights
the significance of derived features, making it an effective choice for this dataset.
Selected features for developing machine learning models for the breast cancer data
set is shown in Table 2.7.

Table 2.7: Selected features with correlation coefficients and their target association

Feature Correlation P-value


Concave Points Worst 0.79 0
Area Worst 0.78 0
Perimeter Worst 0.78 0
Concave Points Mean 0.78 0
Radius Worst 0.78 0

2.4 Machine Learning Algorithms


Machine Learning (ML) algorithms, particularly supervised learning methods, are
widely applied in predictive modeling. Among them, Logistic Regression, Support
Vector Machines (SVMs), Decision Trees, and Random Forests are commonly
used for classification tasks.
Logistic Regression is a statistical method that models the probability of a binary
outcome based on one or more input features. It is effective for problems where
the relationship between features and the target variable is approximately linear.
Its simplicity, interpretability, and ability to provide probabilistic outputs make it a
popular choice for binary classification.
Support Vector Machines (SVMs) work by finding a hyperplane that best separates
the data into different classes. SVMs can handle both linear and non-linear clas-
sification tasks, using kernel functions to transform the data when needed. They
are particularly useful when the data is not linearly separable and perform well in
high-dimensional spaces.
Decision Trees are flowchart-like structures where decisions are made at each node,
based on feature values. They are easy to interpret and can handle both categorical
and numerical data. However, decision trees can be prone to overfitting, especially
when deep trees are built, which capture noise rather than underlying patterns.
Random Forests improve on decision trees by using an ensemble approach. Multiple
decision trees are built on random subsets of the data, and the final prediction is
made based on the majority vote of these trees. This method reduces overfitting,

14
2.5. Related works Chapter 2. Methodology

improves generalization, and typically yields better accuracy than single decision
trees.
In this project, these algorithms can be used to model the relationship between se-
lected features and the target variable, offering robust performance across a range
of classification problems.
A general structure of a Machine Learning Classification process is shown in Figure
2.4.

Figure 2.4: Machine learning classification process

The selection of these algorithms for the current project is based on their ability to
handle both linear and non-linear relationships within the dataset. Logistic Regres-
sion offers a simple yet powerful baseline, while SVMs can efficiently manage more
complex patterns. Decision Trees provide interpretability, allowing for easier under-
standing of feature importance, and Random Forests enhance model performance
through ensemble learning, reducing the risk of overfitting. These algorithms col-
lectively offer a robust toolkit for accurately classifying the data and handling the
nuances of feature interaction in the project.

2.5 Related works


The field of machine learning has produced a plethora of techniques for classifying
breast cancer patterns, achieving impressive classification accuracies. The sources
highlight several key approaches:
Street et.al, developed a system for breast cancer diagnosis that utilizes interac-
tive image processing and machine learning. The system uses a technique called
”snakes,” which are deformable splines that converge to the boundaries of cell nu-
clei in digitized images of fine needle aspirates (FNAs) [2]. This interactive process,

15
Chapter 2. Methodology 2.5. Related works

which typically takes two to five minutes, allows for the extraction of ten differ-
ent features from the segmented nuclei. These features, including radius, perime-
ter, area, compactness, smoothness, concavity, symmetry, and fractal dimension, are
then used to train a classifier. The classifier employs a variation of the Multi-surface
Method (MSM) to separate data points into benign and malignant sets. This method
involves constructing separating planes in the feature space to minimize misclassifi-
cations. Testing with a set of 569 images demonstrated a high level of accuracy. The
system achieved an accuracy of 97% in distinguishing between benign and malig-
nant tumors when using a set of three features: worst area, worst smoothness, and
mean texture. Moreover, the system achieved an accuracy of 80% in predicting the
distant recurrence of malignancy in patients. This study demonstrates the potential
of using nuclear features extracted from FNAs and machine learning techniques to
accurately diagnose breast cancer.

Khairunnahar et.al mentions the use of Decision Tree methods, specifically the C4.5
algorithm, which attained an accuracy of 94.74% [1]. Decision trees are intuitive
models that use a tree-like structure to represent decisions and their possible con-
sequences. They work by recursively partitioning the data based on feature values
until a classification can be made. Another approach discussed is the Rule Induction
Algorithm based on approximate classification, achieving an accuracy of 94.99%.
This method generates a set of rules from the training data that can be used to clas-
sify new instances. The rules are typically expressed in the form of “if-then” state-
ments. Combining Linear Discriminant Analysis (LDA) with Neural Networks (NN)
is yet another method explored in the source. This combined approach reached an
impressive accuracy of 96.8%. LDA seeks to find a linear combination of features
that maximizes the separation between classes, while neural networks are powerful
models inspired by the structure of the human brain that can learn complex non-
linear relationships in the data. Support Vector Machines (SVM) also stand out as a
successful method for breast cancer classification, achieving an accuracy of 97.2%.
SVMs work by finding the optimal hyperplane that maximizes the margin between
different classes in the feature space. Moving towards more advanced techniques,
the sources discuss feed-forward neural networks with rule extraction, yielding an
accuracy of 98.10%. These models combine the power of neural networks with the
interpretability of rule-based systems. The extracted rules provide insights into the
decision-making process of the neural network. Neuro-fuzzy techniques blend fuzzy
logic with neural networks, offering a way to handle uncertainty and imprecision in
the data. This approach achieved an accuracy of 95.06%. Another hybrid method
combines autoregressive models (AR) with neural networks (NN), attaining a clas-
sification accuracy of 97.4% for breast cancer diagnosis. Autoregressive models are
used to model time series data, where the current value depends on previous val-
ues. The sources also discuss various Learning Vector Quantization (LVQ) methods
(including LVQ, Big LVQ, and AIRS) applied to breast cancer detection, achieving
correction classification rates ranging from 96.7% to 97.2%. LVQ algorithms are a
type of competitive learning neural network where the network learns to classify
input vectors by adjusting the positions of prototype vectors in the feature space.
Further techniques include Supervised Fuzzy Clustering, which achieved an accu-

16
2.6. Terminologies Used in Machine Learning Model Development Chapter 2. Methodology

racy of 95.57% for breast cancer detection, and the Mixture Experts (ME) network
structure, which achieved a correct classification rate of 98.85% for breast cancer
diagnosis. The sources emphasize that the field of machine learning continues to
evolve rapidly, and new techniques are constantly being developed to improve breast
cancer detection and diagnosis. They also highlight the importance of carefully se-
lecting and extracting relevant features from the data to achieve optimal perfor-
mance. The wide range of approaches discussed underscores the ongoing research
and development in this critical area.

2.6 Terminologies Used in Machine Learning Model


Development
Dataset
A dataset is a collection of data that contains features (input variables) and labels
(target variable). In supervised learning, the dataset is used to train the model, with
features representing the input data and labels indicating the desired output.

Train-Test Split
Train-test split is a technique used to evaluate the performance of a machine learning
model. The dataset is divided into two subsets: the training set, which is used to
train the model, and the test set, which is used to assess the model’s performance
on unseen data. A common split ratio is 80:20, where 80% of the data is used for
training and 20% for testing.

Weights (w) and Bias (b)


Weights are coefficients assigned to each feature in the model, determining the in-
fluence of each feature on the prediction. The bias term is a constant added to the
output of the model to adjust the prediction independently of the input features.
Together, weights and bias form the parameters of the logistic regression model.

Learning Rate (α)


The learning rate is a hyperparameter that controls how much to change the model
parameters during each iteration of gradient descent. A smaller learning rate may
lead to slower convergence, while a larger learning rate can result in overshooting
the optimal solution.

Iterations (T )
Iterations refer to the number of times the gradient descent algorithm updates the
weights and bias. More iterations can improve model performance, but excessively
high values may lead to overfitting or unnecessary computation.

17
Chapter 2. Methodology 2.7. Logistic Regression Classifier

Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function by
iteratively adjusting the model parameters (weights and bias) in the direction of the
negative gradient of the cost function. This process continues until convergence is
achieved or a predetermined number of iterations is reached.

K-Fold Cross-Validation
K-fold cross-validation is a technique used to assess the performance of a model by
splitting the training data into k subsets (folds). The model is trained on k − 1 folds
and validated on the remaining fold. This process is repeated k times, with each fold
used as the validation set once. The results are then averaged to provide a more
reliable estimate of model performance.

Validation Error
Validation error is the measure of how well a machine learning model performs
on unseen data during the validation phase. It provides insight into the model’s
generalization capability and is crucial for detecting overfitting.

Performance Metrics
Performance metrics are quantitative measures used to evaluate the effectiveness of
a machine learning model. Common metrics for classification tasks include accuracy,
precision, recall, and F1-score, which provide insights into the model’s predictive
capabilities.
In assessing the skill of a logistic regression classifier, several performance measures
are crucial for a comprehensive evaluation. Accuracy reflects the overall correctness
of the model’s predictions, but it can be misleading in imbalanced datasets. Sensi-
tivity (or Recall) measures the model’s ability to correctly identify positive instances,
making it essential in situations where missing positive cases is costly (e.g., detect-
ing diseases). Specificity assesses the ability to correctly classify negative instances,
which is important in avoiding false positives. The Area Under the ROC Curve
(AUC-ROC) provides a more holistic measure by summarizing the trade-off between
sensitivity and specificity across different thresholds. A higher AUC indicates that
the model performs well in distinguishing between positive and negative classes.
Together, these metrics provide insights into the model’s strengths and weaknesses,
helping assess how well it generalizes and handles different types of classification
errors.

2.7 Logistic Regression Classifier


The binary logistic regression model is employed to predict a binary response
based on one or more predictor variables (features). Logistic regression assesses the

18
2.7. Logistic Regression Classifier Chapter 2. Methodology

relationship between a categorical dependent variable and one or more independent


variables by estimating probabilities through a logistic function, which represents the
cumulative logistic distribution. The term ”regression” signifies that we are fitting a
linear model to the feature space, which can consist of both categorical and contin-
uous variables. Logistic regression adopts a probabilistic approach to classification,
providing a means to model the likelihood of the outcome being one of the two
categories.

Linear Regression Model as a Starting Point


Logistic regression extends the principles of linear regression, where the objective is
to predict a continuous outcome y as a linear combination of input features X:

y = Xβ + ϵ

Here, X ∈ Rn×p is the matrix of feature vectors (with n samples and p features),
β ∈ Rp is the vector of model parameters (coefficients), and ϵ denotes the error
term. For a new observation xi , the predicted output is:

ŷi = x⊤
i β

However, in binary classification, predicting a continuous value is not suitable. In-


stead, we need to transform the output into a range between 0 and 1 to represent
probabilities.

From Probability to Odds and Log-Odds


In logistic regression, we model the probability that the output yi equals 1 as follows:

P (yi = 1|xi ) = σ(x⊤


i β)

where σ(z) is defined as the sigmoid function:

1
σ(z) =
1 + e−z
This function ensures that the output is constrained between 0 and 1, reflecting the
probability of the outcome being 1.
The odds of an event occurring is defined as the ratio of the probability of the event
to the probability of the event not occurring:

P (yi = 1|xi ) σ(x⊤


i β)
Odds(yi = 1|xi ) = =
1 − P (yi = 1|xi ) 1 − σ(x⊤
i β)

To derive this, we first express 1 − P (yi = 1|xi ):



1 e−xi β
1 − P (yi = 1|xi ) = 1 − σ(x⊤
i β) =1− ⊤ = ⊤
1 + e−xi β 1 + e−xi β

19
Chapter 2. Methodology 2.7. Logistic Regression Classifier

Thus, the odds become:


1
σ(x⊤
i β) 1+e−xi
⊤β
⊤β
Odds(yi = 1|xi ) = = = e xi
1 − σ(x⊤
i β)

e−xi β

1+e−xi β

Taking the natural logarithm of the odds yields the log-odds or logit:
 
P (yi = 1|xi )
log = x⊤
i β
1 − P (yi = 1|xi )

Thus, logistic regression models the log-odds of the probability of a binary outcome
as a linear function of the input features.

Logistic Regression Model


For all n observations, the model can be expressed in matrix form:

ŷ = σ(Xβ)

where X ∈ Rn×p is the matrix of feature vectors, β ∈ Rp is the parameter vector, and
ŷ ∈ [0, 1]n represents the predicted probabilities.

Loss Function: Maximum Likelihood Estimation (MLE)


To estimate the parameters β, logistic regression utilizes Maximum Likelihood Es-
timation (MLE). The likelihood function, derived from the probabilities, is defined
as:
Yn
L(β) = P (yi |xi ; β)
i=1

The log-likelihood function is easier to optimize and is given by:


n
X
yi log(σ(x⊤ ⊤
 
ℓ(β) = i β)) + (1 − yi ) log(1 − σ(xi β))
i=1

In matrix form, the log-likelihood function can be represented as:

ℓ(β) = y ⊤ log(σ(Xβ)) + (1 − y)⊤ log(1 − σ(Xβ))

Negative Log-Likelihood (Loss Function)


To convert the maximization problem of the log-likelihood into a minimization prob-
lem, we consider the negative log-likelihood:
n
X
yi log(σ(x⊤ ⊤
 
L(β) = −ℓ(β) = − i β)) + (1 − yi ) log(1 − σ(xi β))
i=1

20
2.7. Logistic Regression Classifier Chapter 2. Methodology

Optimization Problem

The optimization problem can be stated as:

β ∗ = arg min L(β)


β

Matrix Formulation

In matrix form, if y is the vector of outcomes and X is the design matrix of features,
the negative log-likelihood can be expressed as:

L(β) = − y ⊤ log(σ(Xβ)) + (1 − y)⊤ log(1 − σ(Xβ))




Gradient and Optimization

To maximize the log-likelihood function, optimization techniques such as gradient


descent are applied, given the non-linearity of the function. The gradient of the
log-likelihood with respect to β is computed as follows:

∇β ℓ(β) = X ⊤ (y − σ(Xβ))

This gradient is utilized in iterative algorithms to update the parameter vector β.

Closed-form Solution and Iterative Methods

Unlike linear regression, logistic regression lacks a closed-form solution due to the
non-linearity introduced by the sigmoid function. Therefore, iterative methods such
as gradient descent, stochastic gradient descent, or the Newton-Raphson method
(known as Iteratively Reweighted Least Squares (IRLS) in logistic regression) are
employed for parameter estimation.
1. Gradient Descent updates the parameters using:

βt+1 = βt + α∇β ℓ(β)

where α is the learning rate.


2. Newton-Raphson employs the Hessian matrix of second derivatives for parame-
ter updates:
βt+1 = βt − H −1 ∇β ℓ(β)
where H represents the Hessian matrix, reflecting the curvature of the log-likelihood
function. The sigmoid function σ(z) is vital in logistic regression. It converts the
output of the linear model, x⊤ i β, into a probability within the range of [0, 1]. This
transformation allows logistic regression to effectively predict binary outcomes. Ad-
ditionally, the derivative of the sigmoid function, σ(z)(1 − σ(z)), guarantees that
the log-likelihood is a concave function, facilitating efficient optimization through
gradient-based methods.

21
Chapter 2. Methodology 2.7. Logistic Regression Classifier

Definition of the Separating Plane


The separating plane in logistic regression is a hyperplane that distinguishes between
two classes in a feature space. In a binary classification problem, this hyperplane is
determined based on the estimated probabilities of the logistic function, which maps
linear combinations of the input features to values between 0 and 1.

Mathematical Representation
Logistic regression models the probability that the dependent variable y equals 1 (the
positive class) given a set of independent variables x. The model can be expressed
as:

P (yi = 1 | xi ) = σ(wT xi + b)
where:
1
σ(z) =
1 + e−z
is the logistic (sigmoid) function.

• w is the vector of weights (coefficients) for the features.

• b is the bias (intercept) term.

• xi is the feature vector for the i-th observation.

Separating Hyperplane
The decision boundary, or separating plane, is where the probability is exactly 0.5.
Therefore, we set the probability equal to 0.5:

σ(wT x + b) = 0.5
To find this boundary, we can simplify this equation:
The logistic function equals 0.5 when its argument is zero:

wT x + b = 0
Rearranging gives us the equation of the hyperplane:

wT x = −b

Interpretation of the separating plane


Classification: For any observation x:

• If wT x + b > 0, the predicted class is 1 (positive class).

• If wT x + b < 0, the predicted class is 0 (negative class).

22
2.7. Logistic Regression Classifier Chapter 2. Methodology

In a two-dimensional feature space, this separating plane is simply a line, and in


three dimensions, it becomes a plane. In higher dimensions, it remains a hyperplane.
Algorithm for Logistic Regression is given in Algorithm 1.

Algorithm 1 Logistic Regression with Train-Test Split and K-Fold Cross Validation
1: Input: Dataset D = {(x(i) , y (i) )}m i=1 , learning rate α, number of iterations T ,
number of folds k
2: Output: Trained model parameters w, b
3: Step 1: Train-Test Split
4: Split the dataset D into training set Dtrain and test set Dtest with ratio 80:20.
5: Let Xtrain , ytrain be the training features and labels.
6: Let Xtest , ytest be the testing features and labels.
7: Step 2: Initialize weights w = 0 and bias b = 0
8: Step 3: Gradient Descent on Logistic Regression
9: for each iteration t = 1, 2, . . . , T do
10: Compute the linear combination: z (i) = wT x(i) + b
1
11: Apply the sigmoid function: hθ (x(i) ) = (i)
1+e−z
12: Calculate gradients:
m
∂J(w, b) 1 X (i)
= (hθ (x(i) ) − y (i) )xj
∂wj m i=1

m
∂J(w, b) 1 X
= (hθ (x(i) ) − y (i) )
∂b m i=1
13: Update the parameters:

∂J(w, b) ∂J(w, b)
wj = wj − α · , b=b−α·
∂wj ∂b

14: end for


15: Step 4: K-Fold Cross-Validation
16: Split the training data Dtrain into k folds.
17: for each fold i = 1, 2, . . . , k do
18: Use the i-th fold as the validation set and the rest as the training set.
19: Train the logistic regression model using Gradient Descent on the training
set.
20: Compute validation error and store it.
21: end for
22: Average the validation errors across k folds to estimate the model’s performance.
23: Step 5: Evaluate on Test Set
24: Compute predictions on the test set Xtest using the final model parameters w and
b.
25: Calculate the accuracy or other performance metrics on ytest .
26: Return: Trained model parameters w and b.

23
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier

Detailed discussion on the implementation, findings and results of Logistic regression


classifier on the selected dataset will be given in the results and discussion chapter
(Chapter 3).

2.8 Support Vector Machine (SVM) Classifier


Support Vector Machines (SVM) are widely used for classification tasks in machine
learning. The SVM classifier aims to find the optimal separating hyperplane between
two classes, which maximizes the margin between them. This problem can be for-
mulated as a convex optimization problem using Lagrange multipliers and solved
using both primal and dual methods.

Primal Problem Formulation


The primal optimization problem for an SVM is defined as follows:
1
min ∥w∥2
w,b 2

subject to:
yi (wT xi + b) ≥ 1, ∀i = 1, 2, . . . , n
where xi ∈ Rd are the input vectors, yi ∈ {−1, 1} are the class labels, w ∈ Rd is the
weight vector, and b ∈ R is the bias.
This optimization problem can also be written in matrix form:
1
min wT w
w,b 2

subject to:
Y (Xw + b) ≥ 1
where X ∈ Rn×d is the matrix of input vectors, Y ∈ Rn×n is the diagonal matrix of
labels, and 1 ∈ Rn is the vector of ones.

Lagrangian for the Primal Problem


The primal problem can be solved using the Lagrangian method. Define the La-
grangian function:
n
1 2
X
λi yi (wT xi + b) − 1
 
L(w, b, λ) = ∥w∥ −
2 i=1

where λi ≥ 0 are the Lagrange multipliers.


The optimal solution satisfies the following KKT conditions:
∂L
= w − ni=1 λi yi xi = 0
P
• ∂w

• ∂L = − ni=1 λi yi = 0
P
∂b

24
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology

 
• λi yi (wT xi + b) − 1 = 0, λi ≥ 0
From the first condition, we derive:
n
X
w= λ i y i xi
i=1

In matrix form, this becomes:


w = X T Λy
where Λ = diag(λ1 , . . . , λn ) is the diagonal matrix of Lagrange multipliers.

Dual Problem Formulation


Substituting the expression for w into the Lagrangian, we eliminate w and b, leading
to the dual problem:
n n n
1 XX X
min λi λj yi yj K(xi , xj ) − λi
λ 2 i=1 j=1 i=1

subject to the constraints:


n
X
λi yi = 0, λi ≥ 0
i=1
In matrix form, the dual problem is written as:
1
min λ λT (Y XX T Y )λ − IT λ
2
n×n
where Y ∈ R is the diagonal matrix of labels and λ ∈ Rn is the vector of Lagrange
multipliers.

Solving the SVM Dual Problem


The dual problem is a quadratic programming (QP) problem, which can be solved
using numerical methods. Once the optimal λ is obtained, the weight vector w is
computed as:
n
X
w= λ i y i xi
i=1
The bias term b is computed from the support vectors (instances for which λi > 0).

Decision Function
The decision function for a new input x is given by:
n
X
f (x) = wT x + b = λi yi xTi x + b
i=1

The predicted class label ŷ is:


ŷ = sign(f (x))

25
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier

Kernel Trick for Nonlinear SVM


In the case of non-linear decision boundaries, the kernel trick can be used to map
the input data into a higher-dimensional space. The dual problem becomes:
1
min λ λT (Y KY )λ − IT λ
2
where K(xi , xj ) = ϕ(xi )T ϕ(xj ) is the kernel function.
The SVM classifier can be formulated as a convex optimization problem, which can
be solved using the Lagrange multipliers method. Both the primal and dual problems
have well-defined solutions, and the dual formulation provides insights into the role
of support vectors and the kernel trick for handling non-linear classification tasks.

2.8.1 Advantages of Convex Optimization in SVM


Convex optimization offers several significant advantages when applied to SVM clas-
sification:

• Global Optimum: Convex problems guarantee that any local minimum is the
global minimum. This is crucial in SVM, ensuring that the optimal separating
hyperplane is found without getting trapped in local minima.

• Efficiency with Quadratic Programs: The optimization problems in SVM


(both primal and dual) are quadratic programs, which are convex by nature.
Efficient algorithms such as interior-point methods and active-set methods can
be used to solve these quadratic convex optimization problems.

• Handling of Constraints: Convex optimization frameworks like CVX can easily


handle constraints (both equality and inequality), which are naturally imposed
in SVM formulations. These include the non-negativity of the Lagrange multi-
pliers and margin constraints.

• Kernel Methods: The dual formulation of SVM allows the use of kernel func-
tions, enabling classification in high-dimensional spaces without explicitly com-
puting the coordinates. Convex optimization helps solve these non-linear prob-
lems efficiently.

Convex Optimization and the Lagrange Function in SVM


The SVM optimization problem naturally leads to a convex optimization framework.
In the primal form, the goal is to maximize the margin between two classes, which
results in a convex optimization problem due to the quadratic nature of the margin
constraint.
The Lagrange function is used to incorporate these constraints into the objective
function. By transforming the problem into its dual form, we solve for the Lagrange
multipliers, which are constrained to be non-negative. The convexity of the dual
problem ensures it can be solved efficiently using convex optimization techniques.

26
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology

Benefits of Solving the Dual Formulation of SVM


The dual formulation of SVM provides several advantages over the primal formula-
tion:

• Identification of Support Vectors: The solution to the dual problem provides


the Lagrange multipliers, λi , associated with each data point. Only the data
points with non-zero values of λi lie on the margin, and these are called support
vectors.

• Efficiency with Kernels: The dual formulation allows the use of kernel func-
tions, which enable SVM to handle non-linearly separable data by implicitly
mapping data points to higher-dimensional spaces.

• Regularization: In the dual form of soft margin SVM, the regularization pa-
rameter C is naturally incorporated as an upper bound on the Lagrange mul-
tipliers. This helps balance the trade-off between maximizing the margin and
minimizing the classification error.

CVX Syntax for Solving Dual SVM Problems in MATLAB


The CVX toolbox in MATLAB can be used to solve both the hard margin and soft
margin SVM dual problems using convex optimization. Below is the CVX syntax for
each case.

Hard Margin SVM in MATLAB

The following code solves the dual problem for a hard margin SVM using CVX:

% Inputs:
% K: Kernel matrix (n x n) where K(i,j) = K(x_i, x_j)
% y: Labels vector (n x 1), y_i in {-1, +1}
% n: Number of data points

cvx_begin
variable lambda(n)
minimize( 0.5 * quad_form(lambda .* y, K) - sum(lambda) )
subject to
sum(lambda .* y) == 0
lambda >= 0
cvx_end

Soft Margin SVM in MATLAB

The following code solves the dual problem for a soft margin SVM using CVX:

27
Chapter 2. Methodology 2.8. Support Vector Machine (SVM) Classifier

% Inputs:
% K: Kernel matrix (n x n)
% y: Labels vector (n x 1)
% C: Regularization parameter

cvx_begin
variable lambda(n)
minimize( 0.5 * quad_form(lambda .* y, K) - sum(lambda) )
subject to
sum(lambda .* y) == 0
0 <= lambda <= C
cvx_end

After solving the dual problem, the optimal Lagrange multipliers λ can be interpreted
as follows:

• Support Vectors: Data points corresponding to non-zero λi values are the


support vectors.

The weight vector w can be computed as:

n
X
w= λ i y i xi
i=1

The bias term b can be computed using any support vector with 0 < λi < C as:

n
X
b = yi − λj yj K(xj , xi )
j=1

Convex optimization plays a crucial role in solving SVM classification problems. By


transforming the primal problem into its dual form, we can solve for the Lagrange
multipliers, identify support vectors, and efficiently handle non-linear classification
problems using kernel methods. The CVX package in MATLAB provides a straightfor-
ward way to solve both hard and soft margin SVM problems. Algorithm to imple-
ment the SVM classifier using the CVX solver is given in Algorithm 2.

28
2.8. Support Vector Machine (SVM) Classifier Chapter 2. Methodology

Algorithm 2 Solving SVM Classification via Convex Optimization


1: Input: Training data {(xi , yi )}ni=1 , regularization parameter C, kernel function
K(xi , xj ) (optional)
2: Output: Optimal weight vector w, bias term b, decision function f (x)
3: Step 1: Initialize Parameters
4: Initialize Lagrange multipliers λi ← 0 for i = 1, . . . , n
5: Set convergence criteria ϵ
6: Initialize weight vector w ← 0, bias b ← 0
7: Step 2: Formulate the Dual Problem
8: Formulate the dual objective:
n n n
1 XX X
min λi λj yi yj K(xi , xj ) − λi
λ 2
i=1 j=1 i=1

9: Subject to constraints:
n
X
λi yi = 0, 0 ≤ λi ≤ C
i=1

10: Step 3: Solve the Dual Problem


11: Solve the quadratic optimization problem using a suitable QP solver such as SMO
or any other algorithm.
12: Step 4: Compute the Weight Vector
13: Once λ is obtained, compute:

n
X
w← λ i y i xi
i=1

(For a kernel SVM, use the implicit kernel representation for w.)
14: Step 5: Compute the Bias Term
15: Choose any support vector xk where 0 < λk < C and compute the bias:

n
X
b ← yk − λi yi K(xi , xk )
i=1

16: Step 6: Construct the Decision Function


n
X
17: Define the decision function as: f (x) ← λi yi K(xi , x) + b
i=1
18: Predict the class label for a new input x as: ŷ ← sign(f (x))
19: Step 7: Convergence Check
20: Check the convergence of the optimization process. If the solution has not con-
verged within the specified tolerance ϵ, repeat the optimization process.
21: Step 8: Output the Classifier
22: Return the weight vector w, bias b, and decision function f (x).

29
Chapter 2. Methodology 2.9. Decision Tree Classifier

2.9 Decision Tree Classifier


Decision trees are a popular and interpretable model used for classification and re-
gression tasks in machine learning. The algorithm builds a tree-like model of deci-
sions and their possible consequences, effectively partitioning the feature space into
distinct regions. The main objective is to create a model that predicts the target
variable by learning simple decision rules inferred from the data features.

Tree Structure Representation


A decision tree is represented as a hierarchical structure composed of nodes and
branches. Each internal node represents a decision based on a feature, each branch
represents the outcome of the decision, and each leaf node represents a class label
(for classification tasks) or a continuous value (for regression tasks).
Let D be the dataset with n instances, where each instance is represented as (xi , yi )
for i = 1, 2, . . . , n, with xi ∈ Rd as the feature vector and yi as the target variable.

Splitting Criteria
The core of building a decision tree lies in selecting the best feature to split the data
at each node. The goal is to maximize the information gain or minimize the impurity
after the split.

Information Gain

Information Gain (IG) measures the reduction in entropy after a dataset is split on
an attribute. The entropy H(D) of a dataset D is defined as:
X
H(D) = − P (c|D) log2 P (c|D)
c

where P (c|D) is the proportion of instances in class c.


When splitting the dataset D on feature A, the entropy of the resulting subsets Dv
for each value v of A is computed as follows:
X
H(D|A) = P (v|D)H(Dv )
v

The Information Gain for the attribute A is given by:

IG(D, A) = H(D) − H(D|A)

The attribute that yields the highest Information Gain is selected for the split.

30
2.9. Decision Tree Classifier Chapter 2. Methodology

Gini Impurity

Alternatively, Gini Impurity can be used as a splitting criterion. The Gini Impurity
G(D) of a dataset D is defined as:
X
G(D) = 1 − P (c|D)2
c

For a split on feature A, the Gini Impurity after the split is given by:
X
G(D|A) = P (v|D)G(Dv )
v

The feature that minimizes the Gini Impurity is chosen for the split.

Recursive Partitioning
The process of constructing a decision tree involves recursively partitioning the data
based on the selected features until a stopping criterion is met. Common stopping
criteria include:

• Maximum tree depth

• Minimum number of samples in a node

• No further information gain from splits

At each leaf node, a prediction is made based on the majority class (for classification)
or the average value (for regression) of the instances in that node.

Overfitting and Pruning


One challenge in building decision trees is overfitting, where the model becomes too
complex and captures noise in the data. Pruning is a technique used to address this
issue by removing branches that provide little predictive power.
Two common pruning strategies are:

• Pre-pruning: Stop growing the tree when further splits do not significantly
improve the model (e.g., based on a threshold for Information Gain).

• Post-pruning: Grow the full tree and then remove nodes that do not improve
model performance on a validation set.

31
Chapter 2. Methodology 2.9. Decision Tree Classifier

Decision Rule
Once the decision tree is constructed, the decision rule for predicting a new instance
x can be formulated as follows:
1. Start at the root node and evaluate the feature xj corresponding to the decision.
2. Traverse the tree by following the branches based on the values of xj until a leaf
node is reached. 3. The predicted class label ŷ is the label associated with the leaf
node.
Mathematically, the prediction can be represented as:

ŷ = f (x) = label of the leaf node reached by x

Convex Optimization Model


Although decision trees are not typically framed as convex optimization problems,
we can discuss the optimization perspective in terms of minimizing the overall im-
purity across the tree. The goal is to minimize the weighted impurity of the nodes:
X
min P (v|D)G(Dv )
θ
v

where θ represents the parameters defining the splits of the tree. While individual
node splits may not yield a convex loss function, the overall objective can be viewed
through an optimization lens.
The convex nature arises in ensemble methods built upon decision trees, such as Gra-
dient Boosting, where loss functions can be designed to be convex. The optimization
framework often involves:

min L(y, f (x; θ))


θ

where L is a convex loss function, y is the target variable, and f (x; θ) represents the
prediction from the ensemble of trees.

Advantages of Decision Trees


Decision trees offer several advantages as a machine learning model:

• Interpretability: Decision trees provide a clear and interpretable model that


can be visualized easily.

• Non-parametric: They do not assume any underlying distribution for the data.

• Handling Mixed Data Types: Decision trees can handle both numerical and
categorical data.

• Feature Importance: Decision trees naturally provide insights into the impor-
tance of different features.

32
2.9. Decision Tree Classifier Chapter 2. Methodology

Overall, decision trees are a versatile and powerful tool in machine learning, provid-
ing a foundation for more complex ensemble methods such as Random Forests and
Gradient Boosting Machines.

Algorithm 3 Decision Tree Algorithm


1: Input: Dataset D = {(x(i) , y (i) )}ni=1 , stopping criteria
2: Output: Decision tree model
3: Step 1: If all instances in D belong to the same class, return a leaf node with
that class label
4: Step 2: If stopping criteria are met, return a leaf node with the majority class
label in D
5: Step 3: For each feature Aj , calculate Information Gain or Gini Impurity
6: Step 4: Select the feature Ak with the highest Information Gain or lowest Gini
Impurity for the split
7: Step 5: Split the dataset D into subsets Dv based on the values of feature Ak
8: for each subset Dv do
9: Step 6: Recursively call the decision tree algorithm on subset Dv
10: end for
11: Step 7: Combine the results to form the decision tree

2.9.1 Lagrangian Formulation of Decision Trees


Problem Statement
Given a dataset (X, y), where X ∈ Rn×m is the feature matrix with n samples and
m features, and y ∈ Rn is the target vector, we aim to construct a decision tree that
minimizes a loss function while adhering to certain constraints.

Objective Function
The objective is to minimize the total loss, which can be defined as the sum of the
loss at each node of the tree. A common choice for the loss function is the mean
squared error (MSE) for regression tasks or cross-entropy for classification tasks.
The overall optimization problem can be formulated as:
n
X
min L(yi , ŷi ) + λ · R(T ) (2.1)
T
i=1

Where:

• L(yi , ŷi ) is the loss at node i.

• R(T ) is a regularization term (e.g., tree depth, number of leaves).

• λ is a hyperparameter controlling the trade-off between the loss and the regu-
larization.

33
Chapter 2. Methodology 2.10. Built-in Machine Learning Functions in MATLAB

Lagrangian Function

To include constraints, we define a Lagrangian function. For a decision tree, you


might want to include constraints on the maximum depth of the tree and the mini-
mum number of samples at each leaf node.
The Lagrangian L can be defined as:
n
X
L(T, α, β) = L(yi , ŷi ) + λ · R(T ) + α(D − Dmax ) + β(Nmin − N ) (2.2)
i=1

Where:

• D is the depth of the tree.

• Dmax is the maximum allowed depth.

• Nmin is the minimum required samples at a leaf node.

• N is the number of samples in the node.

Interpretation of Lagrange Multipliers

The Lagrange multipliers λj play a critical role in balancing the trade-off between
minimizing the loss function and satisfying the constraints. A positive λj indicates
that the corresponding constraint is active, suggesting that the optimization process
will prioritize satisfying this constraint. As the optimization progresses, the values of
λj adjust, reflecting the importance of each constraint relative to the loss function.

2.10 Built-in Machine Learning Functions in MATLAB


To supplement custom-built models with standardized methods, several built-in ma-
chine learning functions are available in MATLAB. These functions, such as fitglm,
fitcsvm, and fitctree, allow for efficient model fitting, fine-tuning, and evaluation,
making them useful for both preliminary analysis and performance benchmarking
against hand-crafted models.

2.10.1 Logistic Regression using fitglm


Logistic regression is a widely used technique for binary classification tasks, and
MATLAB provides the fitglm function for this purpose. The fitglm function fits a
generalized linear model to the data, which in the context of binary classification
utilizes a binomial distribution to model the probability of a given class. The basic
syntax for fitting a logistic regression model is:

model = fitglm(X, y, ’Distribution’, ’binomial’);

34
2.10. Built-in Machine Learning Functions in MATLAB Chapter 2. Methodology

Here, X is the matrix containing the predictor variables, and y is the response vari-
able, which must be a binary outcome. The function fits the model by employing
maximum likelihood estimation under the assumption of a binomial distribution,
thus creating a probability-based classification. The trained model can be used for
predicting new outcomes using the predict method, and the confidence intervals
of the coefficients can be obtained via the coefCI method. This function provides
flexibility for logistic regression problems and supports customization in terms of
link functions and interaction terms, making it robust for exploring the relationship
between predictor variables and binary outcomes.

2.10.2 Support Vector Machines using fitcsvm


Support Vector Machines (SVMs) are a powerful classification technique, particu-
larly useful for high-dimensional spaces and cases where the classes are separable
by a hyperplane. In MATLAB, the fitcsvm function implements SVMs for binary
classification. The syntax to train a basic linear SVM is as follows:

model = fitcsvm(X, y, ’KernelFunction’, ’linear’);

In this case, X represents the set of features, and y contains the class labels. The
KernelFunction argument specifies the type of kernel to use, with common options
including ’linear’, ’polynomial’, and ’rbf’ (radial basis function). SVMs are designed
to find the hyperplane that maximally separates the two classes in the feature space.
The choice of kernel function allows the SVM to handle non-linearly separable data
by transforming it into a higher-dimensional space where a linear separation is pos-
sible.
Once the model is trained, predictions can be made using the predict method.
Additionally, the model’s performance can be evaluated using cross-validation by
applying the crossval method, which divides the dataset into training and testing
sets to estimate the generalization error. This method is crucial for assessing the
model’s reliability on unseen data, ensuring that the chosen hyperparameters (such
as the regularization parameter or kernel type) are appropriate for the problem.

2.10.3 Decision Trees using fitctree


Decision trees are a non-parametric, interpretable method for classification, which
recursively splits the data based on feature values to create a tree-like structure.
MATLAB’s fitctree function provides a straightforward way to build decision trees
for classification tasks. The basic syntax is:

model = fitctree(X, y);

The matrix X contains the predictor variables, while y holds the corresponding class
labels. Decision trees work by recursively partitioning the feature space into regions
that maximize the separation between different classes. The split criterion is often
based on measures such as Gini impurity or information gain. The resulting tree can

35
Chapter 2. Methodology 2.11. Developing an Explainable Tumor Classification Model

be visualized using the view function, which produces a graphical representation of


the splits and class assignments at each node.
This function allows for further customization, such as setting the minimum leaf
size to control overfitting, and the depth of the tree can be constrained to ensure
that the model generalizes well to new data. Predictions can be made using the
predict method, and the model can also undergo cross-validation to estimate its
performance using the crossval method.

2.10.4 Comparative Evaluation and Tuning of Models


Once models are trained using the built-in functions, their performance can be com-
pared against the custom-developed models by evaluating standard classification
metrics such as accuracy, precision, recall, and the F1 score. These metrics can be
computed using MATLAB’s confusionmat function, which creates a confusion matrix
by comparing true class labels with predicted labels. This allows for the calculation
of sensitivity (true positive rate), specificity (true negative rate), and overall accu-
racy.
For a more detailed evaluation of the model’s discriminative ability, the receiver
operating characteristic (ROC) curve can be plotted, and the area under the curve
(AUC) can be computed to measure the trade-off between sensitivity and specificity
at different classification thresholds.
Additionally, the built-in models offer hyperparameter tuning options. For exam-
ple, the regularization parameter in SVMs can be adjusted to control the trade-off
between margin size and classification error on the training data. This fine-tuning
process helps to improve the model’s generalization performance, ensuring that it
performs well not only on the training data but also on unseen data.

2.11 Developing an Explainable Tumor Classification


Model
While traditional supervised learning models have demonstrated high accuracy, the
complexity of medical data, particularly in tumor classification, necessitates a more
nuanced approach. The next stage of research focuses on designing a machine learn-
ing model that can identify patterns in medical data, much like an experienced doc-
tor, without predefined labels. This approach aims to handle the inherent variability
in tumor data, where outliers may represent different stages or types of tumors that
are crucial for diagnosis and treatment planning.
Using the unsupervised methods such as clustering, the model can uncover hidden
patterns in the data that align with medical science. This approach allows for the
creation of explainable tumor classes based on the identified patterns, offering in-
sights into potential new sub types or stages of disease progression. Such a model
is adaptive, evolving as new data—including emerging tumor types—becomes avail-
able, thus providing a flexible tool for medical practitioners.

36
2.12. Conclusion Chapter 2. Methodology

However, a critical challenge in this path is maintaining the explainability of the


model’s findings. Techniques like dimensionality reduction compress data into a
latent space that may obscure clinically relevant features. Ensuring that the model’s
output is interpretable by doctors remains a key objective, as the use of original
medical data is vital for transparency and trust in real-world applications.
This direction opens the door for future research aimed at building a robust ML sys-
tem capable of autonomously identifying and classifying tumors, ultimately aiding
in personalized treatments and more accurate diagnoses.

2.12 Conclusion
In this study, we revisited the mathematical models of classical machine learning
algorithms, specifically focusing on Logistic Regression, Support Vector Machines
(both hard and soft margin), and Decision Trees. A significant aspect of our approach
involved employing convex optimization techniques to formulate and solve these
models. The convex optimization framework allowed for effective minimization of
the loss functions associated with each algorithm, ensuring global optima in the
training process.
These custom models were utilized to predict the likelihood of breast cancer using
the Breast Cancer dataset from the University of Wisconsin at Madison. By apply-
ing these models, we gained insights into the underlying relationships between the
features and the target variable, which is crucial for understanding the predictive
capabilities of each algorithm in the context of breast cancer diagnosis.
Subsequently, we leveraged built-in MATLAB functions such as fitglm, fitcsvm, and
fitctree to perform the same predictive task.

37
Chapter 3

Results and Discussions

3.1 Introduction
This chapter presents the results obtained from implementing various classification
algorithms to predict breast cancer using the Breast Cancer dataset from the Uni-
versity of Wisconsin at Madison. The algorithms evaluated in this study include
custom implementations of Logistic Regression, Support Vector Machines (SVM)
with both hard and soft margins, and Decision Trees, all formulated through convex
optimization techniques. Additionally, the performance of built-in MATLAB func-
tions—fitglm, fitcsvm, and fitctree—is also assessed for comparison.
The primary objective of this chapter is to provide a detailed analysis of the classifi-
cation performance of each algorithm based on several metrics, including accuracy,
sensitivity, specificity, F1 score, Receiver Operating Characteristic (ROC) curve, and
Area Under the Curve (AUC). By systematically evaluating the results, we aim to
uncover insights into the strengths and weaknesses of each approach in predicting
breast cancer.
Furthermore, the discussions will address the implications of the findings, emphasiz-
ing the significance of mathematical modeling and optimization in machine learn-
ing applications. The interplay between the custom models and MATLAB’s built-in
functions will be explored to highlight how both approaches contribute to achieving
robust predictive performance in the medical domain. Through this examination,
we aim to provide a comprehensive understanding of the effectiveness of different
classification techniques in the context of breast cancer diagnosis.

3.2 Skill of Logistic Regression Classifier


In preparation for logistic regression to predict malignant cells, we conducted a cor-
relation analysis to identify the most dominant features associated with the target
variable (malignancy). The analysis revealed six features with a correlation coeffi-
cient greater than 0.75, indicating a strong relationship with the diagnosis outcome.
The distribution of these six features across the target classes (malignant and be-
nign) is visualized in the figure, providing insights into how these features differ

38
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions

between the two classes. This analysis serves as a foundational step in feature se-
lection for the logistic regression model. Our aim is to find the optimal number of
features which produce better performance metrics in logistic regression. Distribu-
tion of top six dominant feature in classification into benign or malignant is shown
in Figure 3.1.

-3
12 0.03 30 #10
0.25 0.035 2.5

10 0.025 25 0.03
0.2 2

8 0.02 20 0.025

0.15 1.5
Density

Density

Density

0.02

Density

Density

Density
6 0.015 15
0.015
0.1 1
4 0.01 10
0.01

2 0.005 5
0.05 0.5
0.005

0 0 0 0 0 0
0 0.1 0.2 0.3 0 100 200 0 0.1 0.2 0 20 40 0 100 200 0 2000 4000
concave points worst perimeter worst concave points mean radius worst perimeter mean area worst

Malignant Benign Malignant Benign

(a) Top three dominant features (b) Top four to six dominant features

Figure 3.1: Distribution of dominant features across the target variable (malignant and
benign) before preprocessing.

In logistic regression, outlier removal and feature scaling are critical preprocessing
steps that significantly improve model performance. Outliers can distort the deci-
sion boundary by disproportionately affecting the estimated coefficients, leading to
poor generalization and inaccurate predictions. Logistic regression assumes a lin-
ear relationship between the features and the log-odds of the target class. Outliers,
especially in features with large values, can skew this relationship, resulting in an
overfitted model with reduced interpretability.

Feature scaling is equally important because logistic regression is sensitive to the


relative magnitudes of feature values. Features with larger scales may dominate the
learning process, leading to biased coefficient estimates. Scaling ensures that all
features contribute equally to the optimization of the cost function during gradient
descent. This improves convergence speed and stability of the optimization process.

After applying these preprocessing steps, a correlation analysis was performed to


identify the six most correlated features with the target variable (malignant or be-
nign). The top six features were selected based on their correlation coefficient
(greater than 0.74), and their distribution across the target variable is shown in Fig-
ure 3.2. These features provide the most predictive power in distinguishing between
malignant and benign cells, enhancing the logistic regression classifier’s ability to
make accurate predictions.

39
Chapter 3. Results and Discussions 3.2. Skill of Logistic Regression Classifier

12 1.2 0.03 30 0.25 0.035

10 1 0.025 25 0.03
0.2
0.025
8 0.8 0.02 20
0.15
0.02
Density

Density

Density

Density

Density

Density
6 0.6 0.015 15
0.015
0.1
4 0.4 0.01 10
0.01
0.05
2 0.2 0.005 5 0.005

0 0 0 0 0 0
0 0.1 0.2 0.3 5 10 0 100 200 0 0.1 0.2 0 20 40 0 100 200
concave points worst area worst perimeter worst concave points mean radius worst perimeter mean

Malignant Benign Malignant Benign

(a) Top three dominant features (b) Top four to six dominant features

Figure 3.2: Distribution of dominant features after outlier removal and scaling.

Incorporating these steps ensures a robust and reliable logistic regression model,
capable of accurately predicting the likelihood of breast cancer based on the most
informative features.
It is observed that after scaling and outlier removal, the top six highly correlated fea-
tures with the target variable remained the same, though their order of correlation
changed slightly.
Here, we explored a mathematical approach to develop a binary classifier from a
linear regressor. Using a linear model, the output is predicted as W T x, where W
represents the weight vector and x the input features. For binary classification, the
separating plane is defined by the equation W T x = 0, which divides the input space
into two regions corresponding to the two classes.
The key objective is to optimize the weight vector W using a kernel regressor de-
signed for linear regression. By employing kernel regression, we aimed to capture
nonlinear relationships between features and improve the performance of the model.
The linear regressor’s predictions are then transformed into binary outputs based on
the separating plane, where values greater than zero were classified into one class,
and those less than zero into the other.
This approach bridges the gap between regression and classification by transferring
the continuous output of a regressor to define a decision boundary for classifica-
tion. The results of this methodology demonstrated that the binary classifier derived
from the optimized linear regressor performs effectively in separating the two target
classes. This method is particularly useful when there is a linear or nearly linear
relationship between the input features and the output classes.
Models with 30 cleaned and scaled features shows almost similar performance met-
rics scores in logistic regression.
Figure 3.3 shows the separating planes and data points plotted over the top two
correlated features in both linear separating boundary (Figure 3.3a) and sigmoid
boundary (Figure 3.3b).

40
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions

(a) Separating boundary W T X = 0 of linear (b) Separating boundary σ(W T X) = 0.5 of


regression sigmoid regression

Figure 3.3: Separating planes and data distribution in linear and logistic regression

A comparison of performance of both the generated linear regressor and logistic


regressor is shown in Table 3.1.

Table 3.1: Performance Comparison of Linear and Logistic Regressors for Binary Classi-
fication

Metric Linear Regressor Logistic Regressor


Accuracy 0.56 0.97
Sensitivity 1.00 0.96
Specificity 0.30 0.97
F1 Score 0.471 0.96
AUC 0.20 0.99

From Table 3.1, it is clear that logistic regressor is a clear winner.


In this study, we aimed to determine the optimal number of features that enhance
the classification performance of our model. To achieve this, we employed 5-fold
cross-validation, which helps stabilize performance metrics by mitigating the vari-
ability that may arise from different training and testing splits. We applied a linear
regression classifier across various feature subsets, using the correlation with the
target variable as a benchmark for feature selection. This approach enabled us to
systematically evaluate the impact of feature count on classification skill and iden-
tify the most effective subset for improved predictive accuracy. Table 3.2 presents the
performance metrics associated with different numbers of features corresponding to
the selected correlation level.

41
Chapter 3. Results and Discussions 3.2. Skill of Logistic Regression Classifier

Table 3.2: Performance metrics of logistic regression classifier across different feature
subsets

#Features ρ≥ Acc-K-fold Acc Sen Spe Er.rate AUC F1-Score


5 0.75 0.95 0.95 0.92 0.96 0.045 0.990 0.9396
9 0.70 0.95 0.96 0.94 0.97 0.030 0.993 0.9548
13 0.60 0.93 0.97 0.95 0.97 0.029 0.994 0.9599
15 0.50 0.95 0.96 0.93 0.98 0.030 0.990 0.9543
20 0.40 0.96 0.98 0.96 0.98 0.019 0.990 0.9699
23 0.30 0.95 0.97 0.96 0.98 0.020 0.990 0.9699
25 0.20 0.95 0.98 0.97 0.98 0.017 0.998 0.9750

3.2.1 Performance Metrics Analysis


The analysis of the performance metrics presented in Table 3.2 emphasizes the goal
of developing a robust and consistent model with an optimal number of features for
binary classification in medical diagnosis.

3.2.2 Overview of Performance Metrics


Number of Features and Correlation (ρ)

The correlation values (ρ) indicate a decline from 0.75 with the 5-feature subset
to 0.20 with the 25-feature subset. This suggests that while fewer features exhibit
stronger individual relationships with the target variable, the inclusion of more fea-
tures does not necessarily correlate with a proportional improvement in model per-
formance.

Accuracy (Acc)

The accuracy of the model shows minimal variation as the number of features in-
creases, peaking at 0.98 with both the 20 and 25 feature subsets. This indicates
that while the model maintains a high level of accuracy, the incremental benefit of
adding more features is marginal. The results suggest a point of diminishing returns
in terms of accuracy, emphasizing the importance of feature selection for model sim-
plicity and interpretability.

K-fold Cross-Validation Accuracy (Acc-K-fold)

The K-fold cross-validation accuracy remains relatively consistent across all feature
subsets, ranging from 0.93 to 0.96. This stability suggests that the model’s perfor-
mance is robust and not overly dependent on the number of features included. It
reinforces the notion that a more parsimonious model could be equally effective, if
not more so, in terms of generalization and interpretability.

42
3.2. Skill of Logistic Regression Classifier Chapter 3. Results and Discussions

Sensitivity

Sensitivity values, ranging from 0.92 to 0.97, indicate the model’s strong ability to
identify malignant cases consistently. However, the slight variations in sensitivity
across different feature subsets suggest that increasing the feature count does not
significantly enhance this critical aspect of performance.

Specificity

Specificity remains high, ranging from 0.96 to 0.98 across all subsets, indicating that
the model effectively identifies non-malignant cases. This consistency in specificity
further supports the potential for a simpler model without compromising the ability
to accurately classify both malignant and non-malignant cases.

Error Rate

The error rate decreases from 0.045 (with 5 features) to 0.017 (with 25 features),
suggesting a slight improvement in reliability. However, the reduction is modest
compared to the increase in complexity associated with additional features, high-
lighting the necessity of considering feature selection carefully.

Area Under the Curve (AUC)

The AUC values consistently remain high (0.990 to 0.998), indicating the model’s
strong discriminatory power across all feature subsets. This stability in AUC suggests
that a robust model can be developed without necessitating an excessive number of
features.
As a final step, regularization techniques—L1, L2, and Elastic Net—are applied to
the logistic regression model using gradient descent to minimize the mean square
error in prediction. The analysis is conducted on two feature sets-Full Feature Set
(30 scaled features) and Dominant Subset (Top 5 correlated features).
Performance metrics, including accuracy, sensitivity, specificity, F1-score, and AUC,
are computed for each model. Summary of these results along with the K-fold cross
validation of L1 regularization model is shown in Table 3.3.

Table 3.3: Comparison of performance metrics of various regularized models and K-fold
cross validation

Regularized Performance Metrics


Model 30 feature set Top 5 correlated subset
Metric −→ Acc Sen Spec F1 Acc Sen Sepc F1
L2 0.968 0.961 0.971 0.960 0.947 0.930 0.960 0.930
L1 0.975 0.970 0.980 0.970 0.944 0.920 0.960 0.920
Elastic Net 0.975 0.970 0.980 0.970 0.944 0.920 0.960 0.920
L1 with K-fold CV 0.973 0.960 0.980 0.960 0.942 0.910 0.960 0.920

43
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines

3.2.3 Key Observations


• L1 Regularization on the Full Feature Set produced the highest values across
all performance metrics, outperforming both L2 and Elastic Net. This result
suggests that L1 regularization, which promotes sparsity by shrinking less im-
portant feature weights to zero, is effective in this context.
• The model with L1 Regularization achieved an AUC of 0.99, demonstrating
exceptional discriminatory ability between malignant and benign cases.
• To validate the stability of the model, K-fold cross-validation was performed.
The results confirmed that L1 regularization provides consistent performance
across all folds, with negligible variation in metrics such as accuracy and AUC.

These findings highlight the effectiveness of L1 regularization for feature selection,


model accuracy, and consistency, making it the most suitable technique for this clas-
sification task.

In summary, while the performance metrics indicate that the model maintains high
accuracy, sensitivity, and specificity across different feature subsets, the improvement
in these metrics with the addition of more features is not substantial. The K-fold
cross-validation accuracy remains stable across all subsets, further emphasizing that
a simpler model with fewer features could perform nearly as well as a more complex
model.

This suggests that, for the goal of developing a better and more consistent model,
it is essential to focus on identifying an optimal number of features that strike a
balance between performance and model simplicity. The results advocate for feature
selection based on correlation strength with the target variable while considering
the overall performance stability to enhance interpretability and ensure effective
medical diagnosis. Thus, an emphasis on parsimony may yield a model that is not
only efficient but also easier to implement in practical scenarios.

3.3 Skill of Support Vector Machines


Despite the extensive preprocessing applied to the dataset—including feature scal-
ing and outlier removal—five features still exhibited significant outliers during the
logistic regression analysis (chapter 2, section 2.2.7, Figure 2.3). While logistic re-
gression performed well on the data, theoretical insights suggest that in the presence
of outliers, especially in high-dimensional spaces, Support Vector Machines (SVMs)
offer superior performance due to their robustness in handling such irregularities.
SVMs are designed to find an optimal hyperplane that maximally separates classes
by focusing on the most critical data points—referred to as support vectors—thus
minimizing the influence of outliers. This property makes SVMs particularly suitable
for scenarios where the data may not be perfectly linearly separable, as was observed
in the earlier experiments.

44
3.3. Skill of Support Vector Machines Chapter 3. Results and Discussions

Given the residual outliers and the need for a more robust classification framework,
SVMs are a natural next step to explore. The aim is to leverage their ability to man-
age outliers while improving the overall performance metrics of the model, particu-
larly in terms of classification accuracy and consistency across folds. The following
sections will discuss the application of SVMs to this dataset and provide a compara-
tive analysis with logistic regression.

3.3.1 Hard Margin Support Vector Machine with Linear Kernel


A hard margin Support Vector Machine (SVM) with a linear kernel is applied to the
dataset. The dual Lagrange multiplier formulation is solved using the CVX solver in
MATLAB. The model achieved an accuracy of 62.7%. However, it was observed that
the sensitivity (true positive rate) is 0, indicating that the model failed to correctly
classify any malignant cases. Even with the top 5 correlated features, the same
result is obtained. This outcome suggests that the hard margin SVM, which does not
allow for any misclassification or margin violation, is overly rigid for the given data.
The lack of flexibility in hard margin SVMs means that it prioritizes separating the
majority class perfectly, leading to poor sensitivity, especially when the classes are
not perfectly linearly separable. Figure 3.5a demonstrate this issue.

(a) Distribution of data points over first two (b) Distribution of data points over first two fea-
features under hard margin svm tures under soft margin svm

Figure 3.4: Distribution of data points over first two features under hard and soft margin
linear SVM

This result underscores the need for a more flexible approach, such as soft margin
SVMs, to allow some misclassification and better accommodate the complexity of
the data. Using soft margin linear SVM on the 30 feature set did not produce even
a support vector. After normalizing the feature set, a soft margin SVM with a linear
kernel and regularization parameter c = 1 was applied. The model produced 22
support vectors and achieved an improved accuracy of 79.79%. The sensitivity and
specificity were 55.66% and 94.12%, respectively.Distribution of data points over
first two features and the separation boundary is shown in Figure 3.5b.

45
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines

The introduction of a soft margin allowed the model to tolerate some misclassifica-
tions, enhancing its flexibility compared to the hard margin SVM. This improved the
model’s ability to correctly classify malignant cases, as reflected in the increased sen-
sitivity. The high specificity suggests that the model maintained strong performance
in identifying benign cases.

3.3.2 Soft Margin Support Vector Machine with Linear Kernel

Using the top 5 highly correlated features, a soft margin Support Vector Machine
(SVM) with a linear kernel and regularization parameter c = 10 is applied. This
model identified 5 support vectors and showed a significant improvement in per-
formance, achieving an accuracy of 94.4%. The sensitivity (95.8%) and specificity
(92.82%) indicate that the model effectively balances both positive (malignant) and
negative (benign) class predictions.
The introduction of the soft margin allows for some misclassification, enabling the
model to better handle the presence of noise and outliers in the dataset. This flexibil-
ity leads to improved sensitivity, which is critical in medical diagnosis, as it ensures
that malignant cases are correctly identified. The higher specificity further confirms
the model’s capability to accurately classify benign cases, resulting in a more robust
classification.
The trade-off between sensitivity and specificity indicates that while the model ef-
fectively reduces misclassifications, it still struggles with identifying all malignant
cases, suggesting the need for further optimization or more sophisticated kernels.

3.3.3 Soft Margin Support Vector Machine with Non-linear Ker-


nels

To further improve classification performance, the soft margin SVM was applied us-
ing the Radial Basis Function (RBF) kernel. This kernel allows the model to capture
more complex decision boundaries, which are particularly useful in cases where the
data is not linearly separable.
Experiments were conducted on both the full feature set and the top five correlated
feature subset, with the regularization parameter C varied to control the trade-off
between margin maximization and classification error. The accuracy of the model
ranged from 47.98% to 96.15%, indicating that the choice of C plays a significant
role in optimizing the classification skill.
The RBF kernel produced a more flexible and acceptable decision boundary com-
pared to the linear kernel, especially in handling the non-linearity present in the
data.
Distribution of data points and separation boundaries are shown in Figure 3.5.

46
3.3. Skill of Support Vector Machines Chapter 3. Results and Discussions

(a) Distribution of data points of full dataset (b) Distribution of data points of subset with
with C = 1. C = 2.5.

Figure 3.5: Distribution of data points over first two features under soft margin SVM
with RBF kernel

The higher accuracy in the upper range of C values suggests that the model was able
to find an optimal balance between underfitting and overfitting, leading to improved
classification performance across both feature sets.
Table 3.4 presents the performance metrics of the soft margin Support Vector Ma-
chine (SVM) with an RBF kernel applied to two different versions of the dataset:
the full 30-feature set and a reduced subset comprising the top five highly corre-
lated features. The regularization parameter C is varied to explore the impact of
the margin’s flexibility on classification performance. The primary objective is to in-
vestigate whether the reduced feature subset offers competitive accuracy and other
performance metrics while lowering computational costs.

Table 3.4: Comparison of performance metrics of full feature set and the five top corre-
lated subset over the regularization parameter C.

Regularization Performance Metrics


parameter 30 feature set Top 5 correlated subset
C Acc Sen Spec F1 Acc Sen Sepc F1
1.0 0.956 0.975 0.975 0.950 0.974 0.962 0.980 0.960
1.5 0.959 0.929 0.977 0.945 0.979 0.967 0.986 0.971
2.0 0.958 0.943 0.966 0.943 0.981 0.967 0.988 0.977
2.5 0.961 0.943 0.972 0.947 0.981 0.967 0.988 0.977
3.0 0.950 0.948 0.952 0.934 0.982 0.972 0.988 0.976
3.5 0.608 0.986 0.384 0.652 0.944 0.976 0.988 0.978
4.0 0.479 1.000 0.170 0.589 0.984 0.976 0.988 0.978
4.5 0.441 1.000 0.109 0.574 0.984 0.985 0.983 0.978
5.0 0.954 0.967 0.946 0.940 0.985 0.986 0.986 0.981

From the results, it is clear that the reduced feature subset consistently provides high

47
Chapter 3. Results and Discussions 3.3. Skill of Support Vector Machines

accuracy, sensitivity, and specificity across various values of C, comparable to or even


outperforming the full feature set in most cases. This suggests that the five most
correlated features capture sufficient information for effective classification without
requiring the entire feature set, significantly reducing model complexity.
As the regularization parameter C increases, the performance of the full feature set
tends to degrade, especially for extreme values, as indicated by the sharp decline
in accuracy and specificity. However, the reduced feature subset maintains stability
even at higher values of C, with only minor fluctuations in performance metrics.
This highlights the robustness of the reduced subset, particularly under varying reg-
ularization conditions.
A cross-validation is the right method to validate the model’s performance and en-
sure that the reduced feature subset does not lead to performance inconsistencies
across different data splits. It helps confirm that the reduced set is a viable and
stable representation for classification. A 5-fold cross validation of the five top cor-
related subset is done for each value of the regularization parameter C. Result of
this experiment is shown in Table 3.5.

Table 3.5: Statistical summary of 5-fold cross validation results

Performance Metrics
C Accuracy Sensitivity Specificity F1-score
Mean SD Mean SD Mean SD Mean SD
1.0 0.933 0.015 0.924 0.035 0.938 0.021 0.915 0.022
1.5 0.933 0.015 0.938 0.036 0.930 0.017 0.913 0.022
2.0 0.933 0.009 0.943 0.039 0.927 0.018 0.913 0.014
2.5 0.928 0.007 0.948 0.030 0.916 0.010 0.907 0.010
3.0 0.927 0.007 0.948 0.030 0.915 0.010 0.907 0.010
3.5 0.926 0.007 0.948 0.030 0.913 0.018 0.905 0.010
4.0 0.923 0.009 0.948 0.030 0.907 0.021 0.901 0.011
4.5 0.923 0.007 0.948 0.030 0.907 0.021 0.901 0.008
5.0 0.926 0.008 0.953 0.023 0.916 0.016 0.905 0.009

The table summarizes the performance of the Support Vector Machine (SVM) model
with a Radial Basis Function (RBF) kernel across various values of the regulariza-
tion parameter C. Key metrics include accuracy, sensitivity, specificity, and F1-score,
along with their corresponding standard deviations.

Key Observations

• Accuracy: The accuracy is quite stable across different values of C, ranging


from 92.3% to 93.3%, with only slight variations.

• Sensitivity: Sensitivity, which measures the model’s ability to correctly identify


positive cases, peaks at C = 5.0 with a mean of 95.3%. Higher sensitivity is
ideal in medical diagnoses, where false negatives are critical.

48
3.4. Model Comparison and Conclusion Chapter 3. Results and Discussions

• Specificity: Specificity, which measures the model’s ability to correctly identify


negative cases, remains consistently around 91.5% to 93.8%. A slight dip is
observed at C = 4.0, but this isn’t drastic.

• F1-Score: The F1-score, which balances precision and recall, stays within a
close range (90.1% to 91.5%), indicating a well-balanced model performance
across all values of C.

• Standard Deviations: Standard deviations across all metrics are relatively low,
implying stable and reliable performance across the cross-validation folds.

Recommendation
Given that the differences in performance across different values of C are minimal,
the value C = 5.0 seems optimal due to its high sensitivity and balanced performance
across other metrics.

Feature Selection: Top 5 Correlated Features vs. All 30 Features


• Top 5 Correlated Features: The performance with the top 5 correlated fea-
tures is strong, as seen from the cross-validation results. The model is achiev-
ing good accuracy, sensitivity, and specificity, while being computationally less
expensive compared to using all 30 features.

• Using All 30 Features: While using all 30 features may marginally improve
the model’s ability to capture more complex patterns, it risks overfitting, espe-
cially if many of the features are redundant or weakly correlated. Additionally,
training with a larger feature set increases computational complexity.

Considering the consistently strong results with the top 5 correlated features and the
risks associated with high-dimensional feature spaces, it is recommended to proceed
with the top 5 correlated features. This simplifies the model without sacrificing
performance, ensuring better generalization on unseen data.

3.4 Model Comparison and Conclusion


In this medical dataset, where diagnostic accuracy and sensitivity are paramount,
several models were tested for their predictive performance. Table 3.6 summarizes
the results across different models, number of features used, and key performance
metrics (accuracy, sensitivity, specificity, F1-score).
As seen in the table, several models provide high accuracy (96%-98%) with balanced
sensitivity and specificity, which is critical in medical diagnosis.

• Logistic Regression: Achieved high accuracy (96%-97%) with good sensitivity


and specificity. This makes it a reliable option, though it slightly underperforms
compared to SVM when sensitivity is a priority.

49
Chapter 3. Results and Discussions 3.4. Model Comparison and Conclusion

Table 3.6: Performance Metrics of Models

Algorithm # Features Accuracy Sensitivity Specificity F1-score


Logistic 20 0.96 0.93 0.98 0.96
Logistic+L1 30 0.97 0.96 0.98 0.96
SVM (RBF) 5 0.98 0.98 0.98 0.98
SVM (RBF, 5-fold) 5 0.93 0.92 0.94 0.91
fitgm() + L1 30 0.97 0.96 0.98 0.96
fitsvm() 5 0.97 0.96 0.99 0.97

• SVM (RBF Kernel): The best-performing model, especially with 5 features,


achieving the highest sensitivity (98%) and maintaining strong performance
across all metrics. This makes it highly suitable for medical applications where
detecting true positives is essential.

• SVM (5-fold Cross-validation): Despite performing slightly lower in accuracy


(93%), it maintains balanced sensitivity and specificity. However, the non-
cross-validated SVM variant is preferable due to better performance.

• Generalized Model (fitgm()+L1): Performed similarly to Logistic Regression


with L1 regularization, delivering high accuracy (97%) and sensitivity (96%).

Recommendation: Given the critical nature of medical data, where accurately de-
tecting true positives (sensitivity) is crucial, the SVM with the RBF kernel using 5
features is recommended due to its superior balance of sensitivity (98%), accuracy
(98%), and simplicity (using fewer features). However, Logistic Regression could be
considered in scenarios where model interpretability is essential for clinical decision-
making.

50
Chapter 4

Conclusion

51
Bibliography

[1] Laila Khairunnahar, Mohammad Abdul Hasib, Razib Hasan Bin Rezanur, Mo-
hammad Rakibul Islam, and Md Kamal Hosain. Classification of malignant
and benign tissue with logistic regression. Informatics in Medicine Unlocked,
16:100189, 2019. pages 16

[2] W. Nick Street, W. H. Wolberg, and O. L. Mangasarian. Nuclear feature ex-


traction for breast tumor diagnosis. In Raj S. Acharya and Dmitry B. Goldgof,
editors, Biomedical Image Processing and Biomedical Visualization, volume 1905,
pages 861 – 870. International Society for Optics and Photonics, SPIE, 1993.
pages 15

52

You might also like