Classification of Breast Cancer Detection by Using Machine Learning Technique
Classification of Breast Cancer Detection by Using Machine Learning Technique
ISSN No:-2456-2165
Abstract:- Breast cancer causes more death in women machine learning is the best solution out there to improve
and it also curable if it is early diagnosed. Hence, early doctor’s capabilities. There is so much potential here studies
detection of cancer in women will be helpful in taking show that over half of all women in the U.S.A who get
necessary actions. In order to detect the disease regular mammograms will receive at least one false positive
supervised machine learning techniques is discussed in which is a test that wrongly indicates the possibility of cancer.
this paper. With the help of Sequential ForwardSelection Radiologists regularly disagree on their respected
(SFS) best feature will be selected for support vector interpretations of medical images. Artificial Intelligence can
machines (SVM) model. Wisconsin breast cancer dataset do what no radiologist can it can learn from hundred and
(WBCD) is used for diagnosis of breast cancer. The SVM thousands of medical images and its estimated to be up to 10
result shows 96% precision because of random percent more accurate than average radiologist that accuracy
permutation on the data set. gap willincrease as computing power gets cheaper and can be
applied to any of the countless subfields of medicine not just
Keywords:- Sequential Forward Selection SFS; Support radiology. Doctors also have to interpret patient medical
Vector Machine; Breast Cancer; Classification; Machine which can be very complex task NLP a branch of artificial
Learning; Wisconsin Breast Cancer Dataset. intelligence that helps computers understand and interpret
human language can review thousands of medical records and
I. INTRODUCTION output the optimal steps for evaluating and managing patients
with illness. Doctors have natural biases artificial intelligence
Cancer or cancer cell are the cells that have lost the is more likely produce objective diagnosis for patents
ability to follow the normal control that the bodyexerts on all without preconceived socio- economic notion which can
cells. Cancer can occur anywhere in our body because there produce disparities in care machine learning will become an
are cells everywhere in our body. In women one of the most essential tool for doctors. It helps in minimizing and
common cancer is breast cancer and in men prostate cancer optimizing the error in short time and it can be examined in
and in both men and women lung cancer and colorectal more detailed way. In this study, SFS and SVM feature are
cancers. Generally, cancer has number of types which are used to diagnose the breast cancer. WBCD form university of
Carcinomas, Sarcomas, Leukemias and Lymphomas. California at Irvine (UCI) machine learning repository was
Carcinomas it a type of cancer which starts from skin or use for training and testing experiment. The observation was
tissues that covers outer layers of internal organs and breast that when we shuffle the data by using random permutation
cancer, prostate cancer, lung cancer are example of on it and then applied SFS. By applying SFS on dataset it
Carcinomas. Breast cancer begins when there is irregular gave 96.4% accuracy by using best ten feature of the dataset
development or unusual change in healthy cells forming a those are texture mean, perimeter mean, smoothness mean,
sheet of cells known as tumour. Tumours can either be non- texture, area, fractal dimension, texture worst, smoothness
cancerous (benign) or cancerous (malignant). Healthy body worst, concave points worst, diagnosis. With the help these
tissues are destroyed by Cancerous tumour when they break features a new dataset is created on which SVM is applied.
in.
II. LITERATURE REVIEW
Women 40 to 50 years of age die with breast cancer and
this rate of death is ranked second in the causes ofdeath in the In (2002) Vinterbo, Ohno-Machado, Wong, Lappas and
women. There are almost 145000 cases inIndia according to Albrecht 98.8% accuracy was record when logarithmic
world health organization. Huge innovation in medical simulated annealing learning and the perceptron algorithm
science has caused decreased inthe cases of breast cancer as are combined together [1]. In (1999) Sipper and Pena-Reyes,
there are effective treatments methods now. Early detection reached 97.36%accuracy in fuzzy-GA method [2]. In (2000)
and diagnosing accurately is key factor for decrease in breast Setiono 98.10% accuracy was reported in feed forward neural
cancer. network rule extraction algorithm [3]. By using 10- fold
cross-validation with C4.5 decision tree method 94.74%
Advances in medicine in past few decades have accuracy was reported by (Quinlan) in 1996. RIAC method
improved health care immensely. Allowing doctors tomore was used by Cercone, Shan, &Hamiton, in 1996 and they
efficiently diagnose and treat diseases. The biggest difference obtained 94.99% accuracy[6]. In 1996 by Dobnikar & Ster
between doctors is not their levelof intelligence it’s how they used linear discreet analysis method to obtained 96.8%
approach patient problems and the types of health system that accuracy [5]. Neuron-fuzzy techniques are used by Kruse and
supports them. This combination is what causes such wide Nauck in (1999) to obtained accuracy 95.06% [6]. In
variations in clinical outcomes and it’s the reason why Goodmen, Bogess, and Watkeens (2002), three different
Proposed method architecture is illustrated in fig 1. Injupyter implementing classification learnerapplication on proposed
algorithm.
Step 1: The dataset is taken as an input for random Step 3: The feature selection algorithm will give mostrelevant
permutation. So that random features of a dataset are selected. feature of the dataset on which accuracy will be tested.
Step 2: The motivation behind feature selection algorithms is Step 4: On the basis of accuracy a new dataset will be created.
to automatically select a subset of features that is most Those features who has the maximum accuracy will be
relevant to the problem. The goalof feature selection is two- selected for new dataset.
fold: We want to improve the computational efficiency and Step 5: Those feature who has the maximum accuracy are
reduce the generalization error of the model by removing selected for new dataset and now on these features
irrelevant features or noise classification will be conducted
IV. RESULTS
The F-score measures the importance of each feature. Fig.4: Confusion Matrix of 30 testing -70% training.
Grid search optimizes the SVM parameters. The F score can
be taken as a weighted average of the recall and precision, The possibilities are “1” meaning malignant and “0”
where an F1 score reaches its worst score at 0 and best value meaning Benign. Here a test is done on 171 patients for the
at 1. The relative contribution of precision and recall to the presence of breast cancer. According to dataset107 patients
F1 score are equal. Table 1 to 3 shows classification of are not suffering from breast cancer and 64 patients are
accuracies. suffering from breast cancer. Prediction made by classifier
are 56 times “yes” and 117 times “no”.
Confusion matrix shows true negative rate and true
positive rate of each class taken. The precision of the Accuracy: (true positive + true negative)/total =
classification models is based on features that has been “(55+106/171) = 0.94”
selected.
Precision Recall F1-score Support
Precision Recall F1-score Support 1 1.00 0.88 0.94 42
1 0.98 0.86 0.92 64 0 0.94 1.00 0.97 72
0 0.92 0.99 0.95 107 Micro avg 0.96 0.96 0.96 114
Micro avg .94 0.94 0.94 171 Macro avg 0.97 0.94 0.95 114
Macro avg 0.95 0.93 0.94 171 Weighted avg 0.96 0.96 0.96 114
Weighted avg 0.94 0.94 0.94 171 Table 2- The accuracy achieved when 80-20% data was
Table 1- The accuracy achieved when 70-30% data was divided into training and test respectively and accuracy is
divided into training and test respectively and accuracy is 96%.
94%.
Accuracy: (true positive + true negative)/total = Accuracy: (true positive + true negative)/total =
“(37+72)/114=0.96” “(94+178)/285 = .95”
Matrix y-axis defines the true class and matrix x-axis depicts
Precision Recall F1-score Support predicted class.
1 0.99 0.89 0.94 106
The Receiver Operating Characteristic (ROC) of linear
0 0.94 0.99 0.96 179
SVM. Subsequently, to know the accuracy a model should be
Micro avg 0.95 0.95 0.95 285 able to differentiate between patients being benign and
Macro avg 0.96 0.94 0.95 285 malignant. The performance visualization is done through
Weighted avg 0.96 0.95 0.95 285 ROC graph. Whereas summarizing a single value of overall
Table 3- The accuracy achieved when 50-50% data was performance is done through area under curve(AUC)
divided into training and test respectively and accuracy is
95%.
REFERENCES