0% found this document useful (0 votes)
206 views

Computer Science Extended Essay First Draft (Second Version)

This document provides an outline for a computer science extended essay that analyzes classification algorithms for medical diagnosis. Specifically, it examines how a hybrid approach combining support vector machine learning and clustering methods can improve accuracy for predicting breast cancer diagnosis. The outline includes sections on usage of methods, training data for breast cancer diagnosis, a two-step clustering technique, the hybrid approach, datasets, experimental design, results, and conclusions.

Uploaded by

Tanav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views

Computer Science Extended Essay First Draft (Second Version)

This document provides an outline for a computer science extended essay that analyzes classification algorithms for medical diagnosis. Specifically, it examines how a hybrid approach combining support vector machine learning and clustering methods can improve accuracy for predicting breast cancer diagnosis. The outline includes sections on usage of methods, training data for breast cancer diagnosis, a two-step clustering technique, the hybrid approach, datasets, experimental design, results, and conclusions.

Uploaded by

Tanav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Computer science extended essay draft 1

Research Topic - A critical analysis of classification algorithms for medical


diagnosis.

Research question – How does the hybrid process of support vector machine
learning algorithm and clustering methods create better accuracy to predict
breast cancer diagnosis?

Written By - Tanav Rawal


Index

1. Introduction -------------------------------------------------------- 1
2. Usage of methods --------------------------------------------------
3. Training of data for breast cancer diagnosis
4. Two – step clustering technique
5. Hybrid approach
6. Dataset
7. Experimental design
8. Result of the experimentation
9. conclusion
Introduction
Making decisions is usually very difficult and sometimes impossible for some
complicated systems. Nowadays decision support system is among the powerful tools to
help doctors for predicting the diagnosis of various diseases.

Nowadays there is an abundance of chest diseases, like TB, COPD, pneumonia, asthma
and lung cancer. In fact TB is a very dangerous, infectious and contagious and a deadly
disease that affects the lungs a lot. In fact the World Health organisation says 1.8 million
people died due to TB IN 2015. There are various symptoms of this disease such as
cough with sputum and blood.

In addition, breast cancer is also a very dangerous disease, it is a very serious dilemma
facing a lot of radiology scientists, but rather it was assessed that the newly malignancy
examples in 2012 is more than 1,600,000 whereas the number of tumours passing away would
spread more than 570,000 [1]. Breast cancer represented 29% of assessed new womanly tumour
patients (790,740 patients), making it the most regularly diagnosed malignancy among ladies [1].
Diagnosis of cancerous cells in the breast is one of the biggest real-world medical problems. The
diagnosis has always been a major problem in the medical field, based on various tests
conducted on various patients. Tests are meant to aid the physician in making a proper and
accurate diagnosis. However, misdiagnosis sometimes occurs, especially in tumour and
cancerous cells since it can be difficult to make an accurate diagnosis, even for a medicinal
cancer expert. One of the drifting issues in the medicinal field is a diagnosis of the tumours.
Mass descriptive tumour information and feature data on cancer studies can now be obtained
with the aid of information technology. Mammography by radiologists and physicians has long
been the means of predicting breast cancer. In 1994, ten radiologists analysed and interpreted
150 mammograms to classify the tumour categories in the breasts [2]. The variation of the
radiologists’ clarifications brought on a low accuracy of diagnosis even though the value of using
mammograms was proven. Above 89.5% of radiology scientists identified less than 3% of
tumours from the study.

Moreover, in this research we have made considerations for the best algorithm which
can be used for medical diagnosis, using those parameters we have made a basic
algorithmic model for a new algorithm for a good statistical analysis for the prediction of
medical diagnosis.
The following paper is designed as follows, first we discuss about the SVM and it’s
medical applications and then we train the SVM for the medical diagnosis, similar
approach is with the Clustering technique for the medical disease diagnosis of breast
cancer, next we will discuss about the hybrid approach of both the clustering and SVM
for better accurate results. Lastly coming to conclusions if hybrid approach is
appropriate for the medical diagnosis procedure.

Usage of Methods

Support vector machine algorithm


The SVM is one machine learning technique for solving discrimination and regression
problems, nowadays it is used in various areas of research such as face recognition
speech recognition and medical diagnosis.

Below in the figure 1 the SVM is basically creating a hyperplane that basically separates
tow classes of a datasets where D = where and class label of. However, many hyperplane
s can separate the two classes, therefore SVM uses a train phase to find the optimal
hyperplane which is basically the Optimal Separating Hyperplane (OSH), where w is the
multidimensional vector and B is a “Bias” term which basically is found by the ASVM.

Margin
As we can see from the given figure, we can see that the optimal hyperplane H verifies that
the hyperplanes H1 and H2 are parallel through it and pass through the closest points to H
which are called Support Vectors (SVs) . Thus, SVMs choose the optimal hyperplane that
maximizes the margin between the two classes which is the distance between H1 and H2

Training of data for breast cancer disease diagnosis


SVM predictors use the hyper plane to separate data points, each hyper plane is defined by
it’s direction which is denoted by “W”, the exact location in space or on the threshold is (b).
A group of training cases is presented by equations [1] and [2].

(x1, y1), (x2, y2),……,(xk, yk)--------------------[1]


Over here the k is the training dataset number.
The decision functions is written like this:
F(x,w,b) = sgn((w * xi) +b) ------------------------[2]
The margins are the region among the hyper-plane, which separates two classes; the margins
demonstrate the classification of breast cancer by SVM

The two step clustering algorithm

The tow step algorithm is proposed to disclose natural clusters inside a knowledge set that might or
not be obvious, the following process has many different options which distinguish it from ancient
clustering methods:

 The capability of making groups of elements that can support each continuous and
into categorical variables.
 Determine the number of clusters automatically
 Analysis of a big corpus efficiently

Hybrid approach
The introduced technique is a hybrid method for breast cancer dataset prediction using Two-Step
clustering and SVM methods and consists of two sub-methods: Two-Step data clustering based on
features similarity using likelihood distance measure, and classification breast cancer dataset based
on the SVM algorithm. The purpose of this research is to introduce a cancer diagnostic classification
approach with the aid of a hybrid Two-Step data clustering algorithm and the SVM prediction
method for the enhancement of the classification accuracy (effectiveness) and to reduce the rate of
misclassification. This work pioneers a new approach which combines the supervised and the
unsupervised learning methods Two-Step clustering algorithm and SVM techniques. A qualified
research has been conducted on the SVM classification and Two-Step data clustering structure on
breast cancer features. Then the results of clusters used as inputs to the prediction method using
the SVM technique as classifiers for cancer cases. The Hybrid Twostep-SVM technique is considered
to investigate the result of the trained method. As a result of a large number of cases correlated
with the cancer data, The dataset was split into ten parts as 10-folds cross validations for training
and testing the Twostep-SVM method. Figure 2 shows the stages of the introduced technique
(Twostep-SVM stages). [3]

Data clustering using clustering

Dataset for the


hybrid approach

Label feature extraction

Dataset
This research was conducted based on the Wisconsin Breast Cancer. This data is widely used
Training data
to discriminate the cancerous from the non - cancerous using the
sample, SVMtable below shows a
classifier
description of the WBC dataset. The number of cases and samples is 699, along with 11
features which are classified into two classes.

No Features and range


Testing data Diagnosis to cancer
SVM classifier
1 Sample code number

2 Clump thickness

3 Uniformity of cell size

4 Marginal adhesion

5 Single epithelial cell


size
6 Bare nuclei

These parameters are 7 Bland chromatin important to see if the


cells that are there are 8 Normal Nucleon
cancerous or non –
cancerous. In the nature of the normal cell size and
shape the cancerous cells 9 Mitosis usually differ from the
other cells. Healthy cells 10 Class (cancerous or have a tendency to be in
a group, whereas the non-cancerous) cancerous cells lose this
ability, thus if there is seen a damage of cells being together it mean signs of cancer. Along
with the adhesion property of the cancerous cells, enlarged epithelial cells are also a sign of
cancerous cells. Basically all of the factors that are included are a very important factor for
the classification
of cancerous Feature name No of instances for each and non –
cluster
cancerous cells.
1, 2, 3, 4, 5
ID 387, 0, 9, 2, 4
Normal nuclei 1, 85, 0, 46, 0
Clump thickness 21, 0, 4, 2, 3
Experimental Cell size 13, 0, 6, 0, 9 Design
Cell shape 5, 0, 10, 0, 4
Basically, to compare the
Adhesion 8, 0, 10, 2, 10
accuracy of the cancer
Epithelial cell size 0, 0, 3, 0, 1
predictors, I ran the two
Bare nuclei 0, 0, 3, 0, 6
step algorithm
Bland chromatin 0, 0, 15, 0, 7
which is basically a
Mitosis 0, 0, 3, 3, 3
merge of using both
class 12, 0, 4, 0, 1
the vector machine
algorithm and the clustering technique.
For better understanding of the data I distributed the data into ten sets, each set
represented 10% from the original dataset, so that all of the data is taken into
consideration. For each round of experimentation, I used nine sets for the training process
and the reminder one for the testing process. the result of Two-Step algorithm extract 5 clusters
with a different number of instances and features distributed from feature 1 to feature 11. The
algorithm automatically determines the optimal number of groups with the assistance of the
criterion defined in criterion cluster of the grouping. Table 1 describes the outcomes clusters while
Table 2 illustrates the distribution of these clusters.

In Table 2, “TwoStep clustering algorithm results extracted 5 clusters or groups; these are all
important clusters. The distributed numbers of instances members are 447, 55, 85, 64,” [4]
and 48-form cluster 1 to cluster 5 sequentially. It is shown that the highest number of instances due
to the similarity of the member features is scored by cluster 1. The majority of the members in the
cluster 1 are similar in Bare Nuclei feature,it demonstrates the shared members of the other
features in the cluster 1. The high score number of participated members among Clusters 3 and 5 is
Mitoses feature with 85 and 46, respectively. In cluster 4 and 5, further from the table we can see
that Bland Chromatin feature scored with ten members and ranked as a high score among these
cluster members. On the other hand, there is a less number of instances ranking to cluster 5 with 48
instances due to the variation and discrimination of cluster member features. Via these clusters, the
Two-Step Clustering algorithm analyzed and described the breast cancer dataset; the main task of
different clustering techniques is data description. The clustering algorithm was selected to be
hybridized with SVM to enhance the classification and prediction process. The steps of how the
clustering was used and combined with the SVM classifier are; first, the TwoStep method conducted
to cluster the corpus of data into different groups. The output of these groups and clustering is
represented in a new variable feature named label. The values of label feature are the cluster name
such as cluster1, cluster2, etc. Each record in the dataset was labeled with the cluster name. Then,
the SVM classifier was applied with the label feature for potential generating accurate diagnosis
result with high prediction accuracy

The equation for the algorithm of the SVM of the particular data is given below:

Results of the experimentation:

After the thorough experimentation of the data through the usage of the SVM formula to
get the basic data accumulation, In the experiments the WBC dataset was used in order to
determine the breast cancer stage. The dataset had each instance reported as either a
benign or a malignant case. The hybrid technique applied by training and testing the dataset
using hybrid Two-Step and SVM method. Using Two-Step algorithm, the dataset then was
divided into different clusters with each cluster having different instances. The main
objectives of clustering in this study is to extract patterns and structures by collecting the
breast cancer samples with similar patterns together thus, the complexity will be reduced
and the diagnosis interpretation will be accurate. In the process of the combination process,
the output of the two – step is added as a new feature. This feature can be used to increase
the correlation between the instances by grouping the dataset into different clusters, each
with similar characteristics. The SVM classifier is employed again with the output of the Two
– step method so that we get more accurate results.
In the graph it is being shown that how the results are with and without the process of
clustering involved.

The concluding factor about out data here is that there is an enhancement while using the
Two- step clustering algorithm, and we can see that the result of using the SVM with
clustering is better than when we don’t use the clustering.

The research performed T- Test algorithm as statistical significance test between the
obtained results from the first experiment using out SVM algorithm but then we use the
Two step SVM method, and it presented enhancements obtained by using the Two- step
method.

If we see not only with and without clustering technique or not we can see that if we take a
look at other types of algorithms we can see that two – step SVM is highest by 99.1 %
Conclusion
In the conclusion we used in this research, this study has basically used two types of
algorithms, which are k mean based clustering and SVM and combining them both to see if
the results come out better.
It has been proved that the SVM with the Two-Step algorithms can significantly improve the
prediction accuracy rate and decrease the miss-classification error in cancer disease. More
importantly, the hybrid method improved the prediction accuracy following the methodology
explained in section 6. In the future work, an optimization method will be combined with the SVM-
two-step clustering algorithm to enhance the diagnosis accuracy.

You might also like