Detection of Breast Cancer Using Data Mining Tool WEKA PDF
Detection of Breast Cancer Using Data Mining Tool WEKA PDF
ISSN 2229-5518
Abstract — Breast cancer has become the primary reason of death in women in developed countries.
Breast cancer is the second most common cause of cancer death in women worldwide. The high
incidence of breast cancer in women has increased significantly during the last few decades. In this
paper we have discussed various data mining approaches that have been utilized for early detection of
breast cancer. Breast Cancer Diagnosis is distinguishing of benign from malignant breast lumps. We
have approached the diagnosis of this disease by using Data mining technique. Data mining is an
essential step in the process of knowledge discovery in databases in which intelligent methods are
applied in order to extract patterns. The most effective way to reduce breast cancer deaths is to detect it
earlier. This paper discusses the early detection of breast cancer in three major steps of determining the
IJSER
breast cancer. They include (i) collection of data set, (ii) preprocess of the data set and (iii) classification.
Data mining and machine learning depend on classification which is the most essential and important
task. Many experiments are performed on medical datasets using multiple classifiers and feature
selection techniques. A good amount of research on breast cancer datasets is found in literature. Many of
them show good classification accuracy. For classification we have chosen J48.All experiments are
conducted in WEKA data mining tool. Data-Sets are collected from online repositories which are of actual
cancer patient
Key Words- Breast Cancer, Data Mining, WEKA, J48 Decision Tree, ZeroR
—————————— ——————————
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1125
ISSN 2229-5518
the physicians to decide which attributes are nodes tell us the possible values that these
more important for early prediction. There are attributes can have in the observed samples,
three major steps that have been used in this while the terminal nodes tell us the final value
paper i.e. collection of datasets, data (classification) of the dependent variable.
preprocessing and classification. This paper The attribute that is to be predicted is known as
explains the various phases of data mining that the dependent variable, since its value depends
is performed on the dataset. We have used upon, or is decided by, the values of all the
WEKA as a data mining tool. other attributes. The other attributes, which
help in predicting the value of the dependent
PROBLEMSTATEMENT variable, are known as the independent
Breast Cancer is one of the leading cancer variables in the dataset.
developed in many countries including India.
Though the survival rate is high – with early Medical Data Mining
In 2011, the case of Sorrell v. IMS Health, Inc.,
diagnosis 97% women can survive for more
decided by the Supreme Court of the United
than 5 years. Statistically, the death toll due to
States, ruled that pharmacies may share
this disease has increased significantly in last
information with outside companies. The
few decades. The main issue pertaining to its
IJSER
practice was authorized under the 1st
cure is early detection. Hence, apart from
Amendment of the Constitution, protecting the
medicinal solutions some Data Science solution
"freedom of speech." However, the passage of
needs to be incorporated for resolving the
the Health Information Technology for
death causing issue.
Economic and Clinical Health Act (HITECH Act)
TheoreticalBackground helped to initiate the adoption of the electronic
health record (EHR) and supporting technology
ZeroR in the United States. The HITECH Act was signed
ZeroR is the simplest classification method into law on February 17, 2009 as part of the
which relies on the target and ignores all American Recovery and Reinvestment Act
predictors. ZeroR classifier simply predicts the (ARRA) and helped to open the door to medical
majority category (class). Although there is no data mining. Prior to the signing of this law,
predictability power in ZeroR, it is useful for estimates of only 20% of United States based
determining a baseline performance as a physician were utilizing electronic patient
benchmark for other classification methods. It records. Soren Brunak notes that “the patient
constructs a frequency table for the target and record becomes as information-rich as
selects its most frequent value. possible” and thereby “maximizes the data
mining opportunities.” Hence, electronic
J48 patient records further expands the possibilities
A decision tree is a predictive machine-learning
regarding medical data mining thereby opening
model that decides the target value (dependent
the door to a vast source of medical data
variable) of a new sample based on various
analysis.
attribute values of the available data. The
internal nodes of a decision tree denote the
Data Mining Tasks
different attributes; the branches between the
i. Classification
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1126
ISSN 2229-5518
IJSER
The analysis of Breast Cancer has been carried Table 2: Accuracy measures for ZeroR decision
upon 10 attributes, namely, ClumpThickness, tree
cell size uniformity, cell shape uniformity, TP FP Precision Recall Class
marginal adhesion, single epithelial cell size, Rate Rate
1 1 .639 1 benign
size of bare nuclei, BlandChromatin,
0 0 0 0 malignant
NormalNucleoli, Mitoses, class. Clump thickness
indicates that radius was computed by
Confusion Matrix
averaging the length of radial line segments
Table 3: Confusion matrix for ZeroR decision
from the center of the nuclear mass to each of tree
the points of the nuclear border. For cell size, Classifier Benign Malignant
perimeter was measured as the distance Session 1 152 0
Session 2 86 0
around the nuclear border which is considered
to be uniform. For measuring the cell shape,
area is measured by counting the number of
pixels in the interior of the nuclear border and
adding one-half of the pixels on the perimeter.
Marginal adhesion is measured bycombining
the perimeter and area to give a measure of the
compactness of the cell nuclei using formula:
perimeter2/area.
The analysis have been carried on using two J48 Result Analysis
algorithms namely, J48 and ZeroR. Total
instances for ZeroR analysis is 699. Following is Test mode: split 66.0% train, remainder test
the detailed analysis of both ZeroR and J48 Table 4: Instances for J48
algorithm.
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1127
ISSN 2229-5518
IJSER
0.954 0.047 0.973 0.954 benign occurrence of breast cancer most efficiently.
0.953 0.046 0.921 0.953 malignant
For early detection, we must know the
attributes that are present in the pathology
Confusion Matrix
report. We have developed patterns via which ,
Table 7: Confusion matrix for J48
we can select the important attributes of Breast
Classifier Benign Malignant Cancer for early, efficient and accurate
Session 1 145 7
detection of it so that it can be properly
Session 2 4 82
medicated upon in time. The present study
could be extended with more number od
patients integrating more alike institutions or
organizations. With the help of clous computing
facilities the results so obtained could be
shared among the institutions and thus helping
the disgnosis process most affectively.
REFERENCES
[1] G. Holmes; A. Donkin and I.H. Witten
(1994). "Weka: A machine learning
workbench". Proc Second Australia and
New Zealand Conference on Intelligent
Information Systems, Brisbane, Australia.
J48 Decision Tree
Retrieved 2007-06-25.
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 11, November-2015 1128
ISSN 2229-5518
IJSER
[5] [1] Ian H. Witten; Eibe Frank; Mark A. Hall
(2011). "Data Mining: Practical machine
learning tools and techniques, 3rd Edition".
Morgan Kaufmann, San Francisco. Retrieved
2011-01-19.
[6] http://134.208.26.59/INA/Cancer_Diagnosis
.pdf
IJSER © 2015
http://www.ijser.org