0% found this document useful (0 votes)
154 views

Detection of Breast Cancer Using Data Mining Tool WEKA PDF

Uploaded by

Parth Purandare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

Detection of Breast Cancer Using Data Mining Tool WEKA PDF

Uploaded by

Parth Purandare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1124

ISSN 2229-5518

Detection of Breast Cancer using Data


Mining Tool (WEKA)
Jyotismita Talukdar Dr. Sanjib Kr. Kalita
Centre of Information Technology Dept. of Computer Science
University of Technology and Management Gauhati University
Shillong, India Assam, India
[email protected] [email protected]

Abstract — Breast cancer has become the primary reason of death in women in developed countries.
Breast cancer is the second most common cause of cancer death in women worldwide. The high
incidence of breast cancer in women has increased significantly during the last few decades. In this
paper we have discussed various data mining approaches that have been utilized for early detection of
breast cancer. Breast Cancer Diagnosis is distinguishing of benign from malignant breast lumps. We
have approached the diagnosis of this disease by using Data mining technique. Data mining is an
essential step in the process of knowledge discovery in databases in which intelligent methods are
applied in order to extract patterns. The most effective way to reduce breast cancer deaths is to detect it
earlier. This paper discusses the early detection of breast cancer in three major steps of determining the

IJSER
breast cancer. They include (i) collection of data set, (ii) preprocess of the data set and (iii) classification.
Data mining and machine learning depend on classification which is the most essential and important
task. Many experiments are performed on medical datasets using multiple classifiers and feature
selection techniques. A good amount of research on breast cancer datasets is found in literature. Many of
them show good classification accuracy. For classification we have chosen J48.All experiments are
conducted in WEKA data mining tool. Data-Sets are collected from online repositories which are of actual
cancer patient

Key Words- Breast Cancer, Data Mining, WEKA, J48 Decision Tree, ZeroR

——————————  ——————————

INTRODUCTION With early diagnosis, 97% of women survive for


5 years or more. In the healthcare industry, it is
In this research paper we have proposed the vital to understand the gradual developments
diagnosis of breast cancer using data mining of such tumor. There has to be the availability
techniques. Breast cancer is the most common of precise and accurate data, so that a model
cancer among Women. Out of the two types of with accurate model helps the doctors to
breast cancer, i.e. malignant and benign, the predict and diagnose the cancer whether it is
malignant tumor develops when cells in the benign or malignant at the early stage. This will
breast tissue divide and grow without the really save time for the physicians and improve
normal controls on cell death and cell division. their efficiency. This paper primarily discusses
Industrialized nations such as the United States, the possibility to identify the breast cancer
Australia, and countries in Western Europe condition whether it is benign or malignant
witnessed the highest incidence rates of breast even at very early stage. The prediction
cancer. Although breast cancer is the second condition is based on the attributes related to
leading cause of cancer death in women, still the breast cancer. There are 10 attributes in the
the survival rate is high once it is detected early. data set used in this paper. These data will help

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1125
ISSN 2229-5518

the physicians to decide which attributes are nodes tell us the possible values that these
more important for early prediction. There are attributes can have in the observed samples,
three major steps that have been used in this while the terminal nodes tell us the final value
paper i.e. collection of datasets, data (classification) of the dependent variable.
preprocessing and classification. This paper The attribute that is to be predicted is known as
explains the various phases of data mining that the dependent variable, since its value depends
is performed on the dataset. We have used upon, or is decided by, the values of all the
WEKA as a data mining tool. other attributes. The other attributes, which
help in predicting the value of the dependent
PROBLEMSTATEMENT variable, are known as the independent
Breast Cancer is one of the leading cancer variables in the dataset.
developed in many countries including India.
Though the survival rate is high – with early Medical Data Mining
In 2011, the case of Sorrell v. IMS Health, Inc.,
diagnosis 97% women can survive for more
decided by the Supreme Court of the United
than 5 years. Statistically, the death toll due to
States, ruled that pharmacies may share
this disease has increased significantly in last
information with outside companies. The
few decades. The main issue pertaining to its

IJSER
practice was authorized under the 1st
cure is early detection. Hence, apart from
Amendment of the Constitution, protecting the
medicinal solutions some Data Science solution
"freedom of speech." However, the passage of
needs to be incorporated for resolving the
the Health Information Technology for
death causing issue.
Economic and Clinical Health Act (HITECH Act)
TheoreticalBackground helped to initiate the adoption of the electronic
health record (EHR) and supporting technology
ZeroR in the United States. The HITECH Act was signed
ZeroR is the simplest classification method into law on February 17, 2009 as part of the
which relies on the target and ignores all American Recovery and Reinvestment Act
predictors. ZeroR classifier simply predicts the (ARRA) and helped to open the door to medical
majority category (class). Although there is no data mining. Prior to the signing of this law,
predictability power in ZeroR, it is useful for estimates of only 20% of United States based
determining a baseline performance as a physician were utilizing electronic patient
benchmark for other classification methods. It records. Soren Brunak notes that “the patient
constructs a frequency table for the target and record becomes as information-rich as
selects its most frequent value. possible” and thereby “maximizes the data
mining opportunities.” Hence, electronic
J48 patient records further expands the possibilities
A decision tree is a predictive machine-learning
regarding medical data mining thereby opening
model that decides the target value (dependent
the door to a vast source of medical data
variable) of a new sample based on various
analysis.
attribute values of the available data. The
internal nodes of a decision tree denote the
Data Mining Tasks
different attributes; the branches between the
i. Classification

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1126
ISSN 2229-5518

ii. Clustering ZeroR Result Analysis


iii. Association Rule Discovery Percentage Split = 66 %
iv. Sequential Pattern Discovery Total Instances = 699
v. Regression Attributes = 10
vi. Deviation Detection Test mode: split 66.0% training set, remainder
test
ZeroR predicts class value: benign
Evaluation on test split for ZeroR
Table 1: Summary for ZeroR decision tree
Correctly Classified Instances 152 (63.8655%)
Incorrectly Classified Instances 86 (36.1345 %)
Kappa statistic 0.013
Mean absolute error 0.4548
Root mean squared error 0.481
Total Number of Instances 38
Relative absolute error 94.07%
Figure 1: WEKA data mining tool
Root Relative squared error 97.41%

Results of Analysis Detailed Accuracy by Class

IJSER
The analysis of Breast Cancer has been carried Table 2: Accuracy measures for ZeroR decision
upon 10 attributes, namely, ClumpThickness, tree
cell size uniformity, cell shape uniformity, TP FP Precision Recall Class
marginal adhesion, single epithelial cell size, Rate Rate
1 1 .639 1 benign
size of bare nuclei, BlandChromatin,
0 0 0 0 malignant
NormalNucleoli, Mitoses, class. Clump thickness
indicates that radius was computed by
Confusion Matrix
averaging the length of radial line segments
Table 3: Confusion matrix for ZeroR decision
from the center of the nuclear mass to each of tree
the points of the nuclear border. For cell size, Classifier Benign Malignant
perimeter was measured as the distance Session 1 152 0
Session 2 86 0
around the nuclear border which is considered
to be uniform. For measuring the cell shape,
area is measured by counting the number of
pixels in the interior of the nuclear border and
adding one-half of the pixels on the perimeter.
Marginal adhesion is measured bycombining
the perimeter and area to give a measure of the
compactness of the cell nuclei using formula:
perimeter2/area.
The analysis have been carried on using two J48 Result Analysis
algorithms namely, J48 and ZeroR. Total
instances for ZeroR analysis is 699. Following is Test mode: split 66.0% train, remainder test
the detailed analysis of both ZeroR and J48 Table 4: Instances for J48
algorithm.

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1127
ISSN 2229-5518

Correctly Classified 227 (95.3782 %)


Instances
Incorrectly Classified 11(4.6218 %)
Instances

Evaluation on test split for J48

Table 5: Summary for J48

Kappa statistic 0.9006


Mean absolute error 0.0671
Root mean squared error 0.2124
Relative absolute error 14.7632 %
Root relative squared error 44.1621 %
Total Number of Instances 238

Detailed Accuracy by Class


Table 6: Accuracy measures for J48
Figure 2: J48 decision tree
Conclusion and Future Work
TP Rate FP Rate Precision Recall Class
By using data mining we can predict the

IJSER
0.954 0.047 0.973 0.954 benign occurrence of breast cancer most efficiently.
0.953 0.046 0.921 0.953 malignant
For early detection, we must know the
attributes that are present in the pathology
Confusion Matrix
report. We have developed patterns via which ,
Table 7: Confusion matrix for J48
we can select the important attributes of Breast
Classifier Benign Malignant Cancer for early, efficient and accurate
Session 1 145 7
detection of it so that it can be properly
Session 2 4 82
medicated upon in time. The present study
could be extended with more number od
patients integrating more alike institutions or
organizations. With the help of clous computing
facilities the results so obtained could be
shared among the institutions and thus helping
the disgnosis process most affectively.

REFERENCES
[1] G. Holmes; A. Donkin and I.H. Witten
(1994). "Weka: A machine learning
workbench". Proc Second Australia and
New Zealand Conference on Intelligent
Information Systems, Brisbane, Australia.
J48 Decision Tree
Retrieved 2007-06-25.

[2] S.R. Garner; S.J. Cunningham, G. Holmes,


C.G. Nevill-Manning, and I.H. Witten (1995).
"Applying a machine learning workbench:

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1128
ISSN 2229-5518

Experience with agricultural databases".


Proc Machine Learning in Practice
Workshop, Machine Learning Conference,
Tahoe City, CA, USA. pp. 14–21. Retrieved
2007-06-25.

[3] P. Reutemann; B. Pfahringer and E. Frank


(2004). "Proper: A Toolbox for Learning
from Relational Data with Propositional and
Multi-Instance Learners". 17th Australian
Joint Conference on Artificial Intelligence
(AI2004). Springer-Verlag. Retrieved 2007-
06-25.

[4] Breast Cancer Wisconsin (Original) Data-


Set[Online] Available :
https://archive.ics.uci.edu/ml/datasets/Bre
ast+Cancer+Wisconsin+%28Original%29

IJSER
[5] [1] Ian H. Witten; Eibe Frank; Mark A. Hall
(2011). "Data Mining: Practical machine
learning tools and techniques, 3rd Edition".
Morgan Kaufmann, San Francisco. Retrieved
2011-01-19.
[6] http://134.208.26.59/INA/Cancer_Diagnosis
.pdf

IJSER © 2015
http://www.ijser.org

You might also like