0% found this document useful (0 votes)

206 views

Computer Science Extended Essay First Draft (Second Version)

This document provides an outline for a computer science extended essay that analyzes classification algorithms for medical diagnosis. Specifically, it examines how a hybrid approach combining support vector machine learning and clustering methods can improve accuracy for predicting breast cancer diagnosis. The outline includes sections on usage of methods, training data for breast cancer diagnosis, a two-step clustering technique, the hybrid approach, datasets, experimental design, results, and conclusions.

Uploaded by

Tanav

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views

Computer Science Extended Essay First Draft (Second Version)

Uploaded by

Tanav

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Computer science extended essay draft 1

Research Topic - A critical analysis of classification algorithms for medical

diagnosis.

Research question – How does the hybrid process of support vector machine
learning algorithm and clustering methods create better accuracy to predict
breast cancer diagnosis?

Written By - Tanav Rawal

Index

1. Introduction -------------------------------------------------------- 1
2. Usage of methods --------------------------------------------------
3. Training of data for breast cancer diagnosis
4. Two – step clustering technique
5. Hybrid approach
6. Dataset
7. Experimental design
8. Result of the experimentation
9. conclusion
Introduction
Making decisions is usually very difficult and sometimes impossible for some
complicated systems. Nowadays decision support system is among the powerful tools to
help doctors for predicting the diagnosis of various diseases.

Nowadays there is an abundance of chest diseases, like TB, COPD, pneumonia, asthma
and lung cancer. In fact TB is a very dangerous, infectious and contagious and a deadly
disease that affects the lungs a lot. In fact the World Health organisation says 1.8 million
people died due to TB IN 2015. There are various symptoms of this disease such as
cough with sputum and blood.

In addition, breast cancer is also a very dangerous disease, it is a very serious dilemma
facing a lot of radiology scientists, but rather it was assessed that the newly malignancy
examples in 2012 is more than 1,600,000 whereas the number of tumours passing away would
spread more than 570,000 [1]. Breast cancer represented 29% of assessed new womanly tumour
patients (790,740 patients), making it the most regularly diagnosed malignancy among ladies [1].
Diagnosis of cancerous cells in the breast is one of the biggest real-world medical problems. The
diagnosis has always been a major problem in the medical field, based on various tests
conducted on various patients. Tests are meant to aid the physician in making a proper and
accurate diagnosis. However, misdiagnosis sometimes occurs, especially in tumour and
cancerous cells since it can be difficult to make an accurate diagnosis, even for a medicinal
cancer expert. One of the drifting issues in the medicinal field is a diagnosis of the tumours.
Mass descriptive tumour information and feature data on cancer studies can now be obtained
with the aid of information technology. Mammography by radiologists and physicians has long
been the means of predicting breast cancer. In 1994, ten radiologists analysed and interpreted
150 mammograms to classify the tumour categories in the breasts [2]. The variation of the
radiologists’ clarifications brought on a low accuracy of diagnosis even though the value of using
mammograms was proven. Above 89.5% of radiology scientists identified less than 3% of
tumours from the study.

Moreover, in this research we have made considerations for the best algorithm which
can be used for medical diagnosis, using those parameters we have made a basic
algorithmic model for a new algorithm for a good statistical analysis for the prediction of
medical diagnosis.
The following paper is designed as follows, first we discuss about the SVM and it’s
medical applications and then we train the SVM for the medical diagnosis, similar
approach is with the Clustering technique for the medical disease diagnosis of breast
cancer, next we will discuss about the hybrid approach of both the clustering and SVM
for better accurate results. Lastly coming to conclusions if hybrid approach is
appropriate for the medical diagnosis procedure.

Usage of Methods

Support vector machine algorithm

The SVM is one machine learning technique for solving discrimination and regression
problems, nowadays it is used in various areas of research such as face recognition
speech recognition and medical diagnosis.

Below in the figure 1 the SVM is basically creating a hyperplane that basically separates
tow classes of a datasets where D = where and class label of. However, many hyperplane
s can separate the two classes, therefore SVM uses a train phase to find the optimal
hyperplane which is basically the Optimal Separating Hyperplane (OSH), where w is the
multidimensional vector and B is a “Bias” term which basically is found by the ASVM.

Margin
As we can see from the given figure, we can see that the optimal hyperplane H verifies that
the hyperplanes H1 and H2 are parallel through it and pass through the closest points to H
which are called Support Vectors (SVs) . Thus, SVMs choose the optimal hyperplane that
maximizes the margin between the two classes which is the distance between H1 and H2

Training of data for breast cancer disease diagnosis

SVM predictors use the hyper plane to separate data points, each hyper plane is defined by
it’s direction which is denoted by “W”, the exact location in space or on the threshold is (b).
A group of training cases is presented by equations [1] and [2].

(x1, y1), (x2, y2),……,(xk, yk)--------------------[1]

Over here the k is the training dataset number.
The decision functions is written like this:
F(x,w,b) = sgn((w * xi) +b) ------------------------[2]
The margins are the region among the hyper-plane, which separates two classes; the margins
demonstrate the classification of breast cancer by SVM

The two step clustering algorithm

The tow step algorithm is proposed to disclose natural clusters inside a knowledge set that might or
not be obvious, the following process has many different options which distinguish it from ancient
clustering methods:

 The capability of making groups of elements that can support each continuous and
into categorical variables.
 Determine the number of clusters automatically
 Analysis of a big corpus efficiently

Hybrid approach
The introduced technique is a hybrid method for breast cancer dataset prediction using Two-Step
clustering and SVM methods and consists of two sub-methods: Two-Step data clustering based on
features similarity using likelihood distance measure, and classification breast cancer dataset based
on the SVM algorithm. The purpose of this research is to introduce a cancer diagnostic classification
approach with the aid of a hybrid Two-Step data clustering algorithm and the SVM prediction
method for the enhancement of the classification accuracy (effectiveness) and to reduce the rate of
misclassification. This work pioneers a new approach which combines the supervised and the
unsupervised learning methods Two-Step clustering algorithm and SVM techniques. A qualified
research has been conducted on the SVM classification and Two-Step data clustering structure on
breast cancer features. Then the results of clusters used as inputs to the prediction method using
the SVM technique as classifiers for cancer cases. The Hybrid Twostep-SVM technique is considered
to investigate the result of the trained method. As a result of a large number of cases correlated
with the cancer data, The dataset was split into ten parts as 10-folds cross validations for training
and testing the Twostep-SVM method. Figure 2 shows the stages of the introduced technique
(Twostep-SVM stages). [3]

Data clustering using clustering

Dataset for the

hybrid approach

Label feature extraction

Dataset
This research was conducted based on the Wisconsin Breast Cancer. This data is widely used
Training data
to discriminate the cancerous from the non - cancerous using the
sample, SVMtable below shows a
classifier
description of the WBC dataset. The number of cases and samples is 699, along with 11
features which are classified into two classes.

No Features and range

Testing data Diagnosis to cancer
SVM classifier
1 Sample code number

2 Clump thickness

3 Uniformity of cell size

4 Marginal adhesion

5 Single epithelial cell

size
6 Bare nuclei

These parameters are 7 Bland chromatin important to see if the

cells that are there are 8 Normal Nucleon
cancerous or non –
cancerous. In the nature of the normal cell size and
shape the cancerous cells 9 Mitosis usually differ from the
other cells. Healthy cells 10 Class (cancerous or have a tendency to be in
a group, whereas the non-cancerous) cancerous cells lose this
ability, thus if there is seen a damage of cells being together it mean signs of cancer. Along
with the adhesion property of the cancerous cells, enlarged epithelial cells are also a sign of
cancerous cells. Basically all of the factors that are included are a very important factor for
the classification
of cancerous Feature name No of instances for each and non –
cluster
cancerous cells.
1, 2, 3, 4, 5
ID 387, 0, 9, 2, 4
Normal nuclei 1, 85, 0, 46, 0
Clump thickness 21, 0, 4, 2, 3
Experimental Cell size 13, 0, 6, 0, 9 Design
Cell shape 5, 0, 10, 0, 4
Basically, to compare the
Adhesion 8, 0, 10, 2, 10
accuracy of the cancer
Epithelial cell size 0, 0, 3, 0, 1
predictors, I ran the two
Bare nuclei 0, 0, 3, 0, 6
step algorithm
Bland chromatin 0, 0, 15, 0, 7
which is basically a
Mitosis 0, 0, 3, 3, 3
merge of using both
class 12, 0, 4, 0, 1
the vector machine
algorithm and the clustering technique.
For better understanding of the data I distributed the data into ten sets, each set
represented 10% from the original dataset, so that all of the data is taken into
consideration. For each round of experimentation, I used nine sets for the training process
and the reminder one for the testing process. the result of Two-Step algorithm extract 5 clusters
with a different number of instances and features distributed from feature 1 to feature 11. The
algorithm automatically determines the optimal number of groups with the assistance of the
criterion defined in criterion cluster of the grouping. Table 1 describes the outcomes clusters while
Table 2 illustrates the distribution of these clusters.

In Table 2, “TwoStep clustering algorithm results extracted 5 clusters or groups; these are all
important clusters. The distributed numbers of instances members are 447, 55, 85, 64,” [4]
and 48-form cluster 1 to cluster 5 sequentially. It is shown that the highest number of instances due
to the similarity of the member features is scored by cluster 1. The majority of the members in the
cluster 1 are similar in Bare Nuclei feature,it demonstrates the shared members of the other
features in the cluster 1. The high score number of participated members among Clusters 3 and 5 is
Mitoses feature with 85 and 46, respectively. In cluster 4 and 5, further from the table we can see
that Bland Chromatin feature scored with ten members and ranked as a high score among these
cluster members. On the other hand, there is a less number of instances ranking to cluster 5 with 48
instances due to the variation and discrimination of cluster member features. Via these clusters, the
Two-Step Clustering algorithm analyzed and described the breast cancer dataset; the main task of
different clustering techniques is data description. The clustering algorithm was selected to be
hybridized with SVM to enhance the classification and prediction process. The steps of how the
clustering was used and combined with the SVM classifier are; first, the TwoStep method conducted
to cluster the corpus of data into different groups. The output of these groups and clustering is
represented in a new variable feature named label. The values of label feature are the cluster name
such as cluster1, cluster2, etc. Each record in the dataset was labeled with the cluster name. Then,
the SVM classifier was applied with the label feature for potential generating accurate diagnosis
result with high prediction accuracy

The equation for the algorithm of the SVM of the particular data is given below:

Results of the experimentation:

After the thorough experimentation of the data through the usage of the SVM formula to
get the basic data accumulation, In the experiments the WBC dataset was used in order to
determine the breast cancer stage. The dataset had each instance reported as either a
benign or a malignant case. The hybrid technique applied by training and testing the dataset
using hybrid Two-Step and SVM method. Using Two-Step algorithm, the dataset then was
divided into different clusters with each cluster having different instances. The main
objectives of clustering in this study is to extract patterns and structures by collecting the
breast cancer samples with similar patterns together thus, the complexity will be reduced
and the diagnosis interpretation will be accurate. In the process of the combination process,
the output of the two – step is added as a new feature. This feature can be used to increase
the correlation between the instances by grouping the dataset into different clusters, each
with similar characteristics. The SVM classifier is employed again with the output of the Two
– step method so that we get more accurate results.
In the graph it is being shown that how the results are with and without the process of
clustering involved.

The concluding factor about out data here is that there is an enhancement while using the
Two- step clustering algorithm, and we can see that the result of using the SVM with
clustering is better than when we don’t use the clustering.

The research performed T- Test algorithm as statistical significance test between the
obtained results from the first experiment using out SVM algorithm but then we use the
Two step SVM method, and it presented enhancements obtained by using the Two- step
method.

If we see not only with and without clustering technique or not we can see that if we take a
look at other types of algorithms we can see that two – step SVM is highest by 99.1 %
Conclusion
In the conclusion we used in this research, this study has basically used two types of
algorithms, which are k mean based clustering and SVM and combining them both to see if
the results come out better.
It has been proved that the SVM with the Two-Step algorithms can significantly improve the
prediction accuracy rate and decrease the miss-classification error in cancer disease. More
importantly, the hybrid method improved the prediction accuracy following the methodology
explained in section 6. In the future work, an optimization method will be combined with the SVM-
two-step clustering algorithm to enhance the diagnosis accuracy.

CyberArk Notes
100% (6)
CyberArk Notes
21 pages
IB Extended Essay: Male Gender Roles in Cold War Ballet
No ratings yet
IB Extended Essay: Male Gender Roles in Cold War Ballet
22 pages
Little Academic Writing Booklet
No ratings yet
Little Academic Writing Booklet
20 pages
IB Geography Essay Template
No ratings yet
IB Geography Essay Template
3 pages
IDU Mock Questions
No ratings yet
IDU Mock Questions
3 pages
Environmental Systems and Societies: Example A: Extended Essay 1
No ratings yet
Environmental Systems and Societies: Example A: Extended Essay 1
27 pages
IB Geography Internal Assessment Instructions 2020
No ratings yet
IB Geography Internal Assessment Instructions 2020
6 pages
World Studies Extended Essay
No ratings yet
World Studies Extended Essay
6 pages
IB Geography Extended Essay Guideline
No ratings yet
IB Geography Extended Essay Guideline
4 pages
Ib Geography Extended Essay IA HL SL Tutor Help
No ratings yet
Ib Geography Extended Essay IA HL SL Tutor Help
2 pages
World Religions: Example B: Extended Essay 1
100% (1)
World Religions: Example B: Extended Essay 1
18 pages
Extended Essay General Guidelines
No ratings yet
Extended Essay General Guidelines
5 pages
General Extended Essay Report PDF
No ratings yet
General Extended Essay Report PDF
8 pages
The Extended Essay: Final Reflection Session
No ratings yet
The Extended Essay: Final Reflection Session
24 pages
Reflections On Planning and Progress: Supervisor Name
No ratings yet
Reflections On Planning and Progress: Supervisor Name
3 pages
The IB Extended Essay - Managing Your Research Project
No ratings yet
The IB Extended Essay - Managing Your Research Project
2 pages
Ib Economics Example Sample Ia Tutor
No ratings yet
Ib Economics Example Sample Ia Tutor
3 pages
Ib Senior Due Dates 2020-2021
No ratings yet
Ib Senior Due Dates 2020-2021
3 pages
History Extended Essay
No ratings yet
History Extended Essay
17 pages
50 Excellent Extended Essays English PDF
No ratings yet
50 Excellent Extended Essays English PDF
2 pages
RPP - Exemplar7 - WS 6 Out of 6 PDF
No ratings yet
RPP - Exemplar7 - WS 6 Out of 6 PDF
3 pages
Extended Essay
No ratings yet
Extended Essay
23 pages
The Causes of Vandalism in Eindhoven
No ratings yet
The Causes of Vandalism in Eindhoven
28 pages
Extended Essay
No ratings yet
Extended Essay
9 pages
(2005) Mollenkopf & Closs - The Hidden Value in Reverse Logistics
No ratings yet
(2005) Mollenkopf & Closs - The Hidden Value in Reverse Logistics
9 pages
Extended Essay Assessment Checklist
No ratings yet
Extended Essay Assessment Checklist
4 pages
Proposal For Extended Essay Topic
No ratings yet
Proposal For Extended Essay Topic
3 pages
Extended Essay
No ratings yet
Extended Essay
20 pages
Extended Essay Final AbhiAgarwal
No ratings yet
Extended Essay Final AbhiAgarwal
25 pages
Rubric Ee Visual Arts
No ratings yet
Rubric Ee Visual Arts
5 pages
ICS (Case 2: Using Activity-Based Management in A Medical Practice)
50% (2)
ICS (Case 2: Using Activity-Based Management in A Medical Practice)
32 pages
Class of 2020 Extended Essay Handbook - Final
No ratings yet
Class of 2020 Extended Essay Handbook - Final
41 pages
IB - Extended Essay Sample Title Page
100% (1)
IB - Extended Essay Sample Title Page
13 pages
Women Empowerment Thesis
No ratings yet
Women Empowerment Thesis
1 page
IB World Studies Report For Examiners
No ratings yet
IB World Studies Report For Examiners
5 pages
EE Langlit IB
No ratings yet
EE Langlit IB
20 pages
2015 - 2017 - Booklist For IB
No ratings yet
2015 - 2017 - Booklist For IB
11 pages
Geography IA 1
No ratings yet
Geography IA 1
10 pages
Extended Essay
No ratings yet
Extended Essay
11 pages
Testing The Resistivity of Graphite Penc
No ratings yet
Testing The Resistivity of Graphite Penc
6 pages
Extended Essay
100% (2)
Extended Essay
21 pages
Pop-Culture and Intertextuality
No ratings yet
Pop-Culture and Intertextuality
31 pages
Individual Oral Overview
100% (1)
Individual Oral Overview
12 pages
M2.LE4 - Non-Literary Unit
No ratings yet
M2.LE4 - Non-Literary Unit
9 pages
Diploma Programme Core: Theory of Knowledge: I. Course Description and Aims
No ratings yet
Diploma Programme Core: Theory of Knowledge: I. Course Description and Aims
3 pages
Tip #2 - How To Turn A Note Into A Paragraph
0% (1)
Tip #2 - How To Turn A Note Into A Paragraph
1 page
Lal p1 Criteria Unackedpdf
No ratings yet
Lal p1 Criteria Unackedpdf
12 pages
Ib Literature Review
100% (2)
Ib Literature Review
8 pages
Essay Writing Guide PDF
100% (1)
Essay Writing Guide PDF
14 pages
Global Issues For IO
0% (1)
Global Issues For IO
8 pages
Handout For Writing Extended Essays
100% (2)
Handout For Writing Extended Essays
89 pages
EE Example World Studies 2
No ratings yet
EE Example World Studies 2
25 pages
Extended Essay PDF
No ratings yet
Extended Essay PDF
16 pages
Qulaitative and Quantitative Methods
No ratings yet
Qulaitative and Quantitative Methods
9 pages
Business IA
0% (2)
Business IA
13 pages
IB History IA Extended Essay Help Tutors Examples Sameples Guide
No ratings yet
IB History IA Extended Essay Help Tutors Examples Sameples Guide
11 pages
Biology Extended Essay
100% (1)
Biology Extended Essay
8 pages
English 11 Course Outline 2013
No ratings yet
English 11 Course Outline 2013
3 pages
Ib DP Check List
No ratings yet
Ib DP Check List
4 pages
Machine Learning Models For Breast Cancer Classifi
No ratings yet
Machine Learning Models For Breast Cancer Classifi
13 pages
Case Studies in Advanced Skin Cancer Management: An Osce Viva Resource
From Everand
Case Studies in Advanced Skin Cancer Management: An Osce Viva Resource
James Bricknell
No ratings yet
Genji - Dawn of The Samurai - SCUS - 974.71 (NTSC-U)
No ratings yet
Genji - Dawn of The Samurai - SCUS - 974.71 (NTSC-U)
5 pages
Bcai 302 Computer Networks Unit 2
No ratings yet
Bcai 302 Computer Networks Unit 2
10 pages
Retail Customer Service
No ratings yet
Retail Customer Service
16 pages
Penggunaan Obat Off-Label Pada Anak Di Apotek Kota Yogyakarta (Off-Label Drug Use For Children at Community Pharmacies in Yogyakarta, Indonesia)
No ratings yet
Penggunaan Obat Off-Label Pada Anak Di Apotek Kota Yogyakarta (Off-Label Drug Use For Children at Community Pharmacies in Yogyakarta, Indonesia)
6 pages
BAQUIRAN - HUME 124 Activity 2.2. Hazards at Home
No ratings yet
BAQUIRAN - HUME 124 Activity 2.2. Hazards at Home
8 pages
Urine-Sediment-Guide IDEXX PDF
0% (1)
Urine-Sediment-Guide IDEXX PDF
2 pages
Muhurtha or Electional Astrology - Text
No ratings yet
Muhurtha or Electional Astrology - Text
204 pages
Microelectronics Chapter 02 Updated V2
No ratings yet
Microelectronics Chapter 02 Updated V2
15 pages
Ra 10912
No ratings yet
Ra 10912
19 pages
Mamta Lohia Agarwal - CV
No ratings yet
Mamta Lohia Agarwal - CV
2 pages
GI A2PLUS U4 Vocabulary Standard
No ratings yet
GI A2PLUS U4 Vocabulary Standard
1 page
Report Daily
No ratings yet
Report Daily
31 pages
Axis Technology FCT
No ratings yet
Axis Technology FCT
324 pages
Seed Industry Development in Pakistan
100% (5)
Seed Industry Development in Pakistan
98 pages
Gujarat
No ratings yet
Gujarat
1 page
06 Organizations in The Distribution Process
No ratings yet
06 Organizations in The Distribution Process
28 pages
406 - Respiratory Therapy Consult Service Handbook
No ratings yet
406 - Respiratory Therapy Consult Service Handbook
28 pages
IT 304: Computer Networks Lab # 2: Queueing Delay in Computer Networks
No ratings yet
IT 304: Computer Networks Lab # 2: Queueing Delay in Computer Networks
2 pages
Initial CASE Report: Controlled Environment Horticulture
No ratings yet
Initial CASE Report: Controlled Environment Horticulture
169 pages
HNS Level 4 Coc Queation
No ratings yet
HNS Level 4 Coc Queation
16 pages
Supreme Court Judgement
No ratings yet
Supreme Court Judgement
2 pages
ISOupdate 202404
No ratings yet
ISOupdate 202404
23 pages
Chapter 3 - Marketing
No ratings yet
Chapter 3 - Marketing
40 pages
MGM 221
No ratings yet
MGM 221
2 pages
BALAJI NEW Vipin Sir PDF
No ratings yet
BALAJI NEW Vipin Sir PDF
52 pages
Two Worlds Intertwined: Discoveries of A Communication Experiment
No ratings yet
Two Worlds Intertwined: Discoveries of A Communication Experiment
6 pages
Essay
No ratings yet
Essay
2 pages
Python Machine Learing Algorithms Links
No ratings yet
Python Machine Learing Algorithms Links
1 page
California Growers Assn. v. California Dept. of Food and Agriculture - Complaint
No ratings yet
California Growers Assn. v. California Dept. of Food and Agriculture - Complaint
9 pages