0% found this document useful (0 votes)
27 views

How Can Machine Learning Be Used To Classify Breast Cancer?

Breast cancer is a prevalent form of cancer that affects a significant number of individuals worldwide and can have severe consequences if not detected and treated early. The World Health Organization (WHO) estimates that breast cancer is the most common cancer among women globally, with an estimated 2.3 million new cases in 2020 alone. Early detection is crucial in improving survival rates and treatment outcomes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

How Can Machine Learning Be Used To Classify Breast Cancer?

Breast cancer is a prevalent form of cancer that affects a significant number of individuals worldwide and can have severe consequences if not detected and treated early. The World Health Organization (WHO) estimates that breast cancer is the most common cancer among women globally, with an estimated 2.3 million new cases in 2020 alone. Early detection is crucial in improving survival rates and treatment outcomes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

How can Machine Learning be used to Classify


Breast Cancer?
Krish Kapoor

Abstract - Breast cancer is a prevalent form of cancer breast tumor images, including size, shape, texture, and other
that affects a significant number of individuals characteristics, to differentiate between malignant and
worldwide and can have severe consequences if not benign tumors. By learning from historical data, these
detected and treated early. The World Health algorithms can make predictions on new, unseen cases,
Organization (WHO) estimates that breast cancer is the aiding in the early detection and management of breast
most common cancer among women globally, with an cancer.
estimated 2.3 million new cases in 2020 alone. Early
detection is crucial in improving survival rates and In this paper, we focus on the classification of breast
treatment outcomes. This paper explores the application cancer using AI and machine learning algorithms. We utilize
of Machine Learning (ML) techniques for predicting a publicly available dataset, such as the one provided by the
breast cancer diagnosis in individuals. We utilize a Kaggle machine learning repository, which contains
publicly available dataset from the Kaggle machine information on various tumor features and corresponding
learning repository, which contains data from breast diagnoses. This dataset serves as the basis for training and
cancer patients collected from various medical evaluating the performance of different machine-learning
institutions. Several machine learning models, including models.
Naive Bayes Algorithm, Decision Trees, Logistic
Regression, Neural Networks, Random Forest, Stochastic The dataset is pre-processed to handle missing values,
Gradient, and Support Vector Machines, are employed normalize features, and split into training and testing sets.
We then apply a range of machine learning algorithms to the
to analyze the dataset. The performance of these models
training data, allowing them to learn from the patterns and
is assessed using 10-fold cross-validation. Furthermore,
we propose the most suitable machine learning algorithm relationships within the dataset. The performance of each
for breast cancer diagnosis based on specified input algorithm is evaluated using appropriate metrics such as
parameters and discuss the potential deployment of a accuracy, precision, recall, and F1 score.
breast cancer diagnostic tool. The goal of this research is to identify the most accurate
Keywords:- Breast Cancer Detection, Supervised and machine-learning algorithm for breast cancer classification.
Unsupervised Machine Learning, Artificial Intelligence. The selected algorithm can then be used as a reliable tool for
assisting healthcare professionals in diagnosing breast cancer
I. INTRODUCTION cases. Early and accurate classification enables timely
intervention, personalized treatment plans, and improved
Breast cancer is a prevalent and life-threatening disease patient outcomes.
that requires accurate classification for effective diagnosis
and treatment planning. [8] The ability to classify breast II. LITERATURE REVIEW
tumors into malignant (cancerous) or benign (non-
cancerous) categories is essential for determining the  Breast cancer type classification using machine learning -
appropriate course of action. [8] Artificial intelligence (AI) The study evaluated four machine learning algorithms for
and Machine Learning (ML) algorithms have shown great classifying breast cancer into triple negative and non-
promise in assisting with the classification of breast cancer, triple negative types. Among these algorithms, the
providing accurate and efficient tools for healthcare Support Vector Machine (SVM) demonstrated higher
professionals. accuracy and fewer misclassification errors compared to
the other three algorithms. The findings suggest that
Traditionally, the classification of breast tumors relied machine learning algorithms, particularly SVM, are
on histological examination by pathologists, which is time- effective for accurately classifying breast cancer into
consuming and subject to inter-observer variability. [9] triple negative and non-triple negative types.
However, with the advancement of AI and machine learning,  Breast cancer classification using machine learning - This
the development of automated systems for breast cancer paper discusses the significance of breast cancer
classification has become possible. These systems leverage classification due to its prevalence and high mortality
large datasets containing tumor features and corresponding rates. The study focuses on the application of machine
diagnoses to learn patterns and make accurate predictions. learning techniques for breast cancer classification,
specifically comparing two classifiers: Naive Bayes (NB)
Machine learning algorithms, such as Naive Bayes, and k-nearest neighbor (KNN). The authors evaluate the
Decision Trees, Logistic Regression, Neural Networks, accuracy of these classifiers using cross-validation and
Random Forest, Stochastic Gradient, and Support Vector find that KNN achieves the highest accuracy (97.51%)
Machines, have been applied to breast cancer classification with the lowest error rate, followed by the NB classifier
tasks. These algorithms analyze features extracted from (96.19%). This comparison underscores the effectiveness

IJISRT23AUG1167 www.ijisrt.com 1060


Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
of machine learning in accurately classifying breast techniques in cancer diagnosis, specifically for
cancer, supporting its potential for diagnostic distinguishing between benign and malignant breast
applications. tumors. The study combines support vector machines, K-
 Evaluating the Performance of Machine Learning nearest neighbors, and probabilistic neural network
Techniques in the Classification of Wisconsin Breast classifiers with various feature ranking, selection, and
Cancer - This paper addresses the importance of accurate extraction methods. The research achieves high accuracy
diagnosis in distinguishing between malignant and benign in breast cancer diagnosis, with support vector machine
breast tumors due to the significant impact of breast classifiers attaining an overall accuracy of 98.80% and
cancer on women's health and mortality rates. The study 96.33% on two commonly used breast cancer benchmark
focuses on the application of three machine learning datasets. This study demonstrates the effectiveness of
algorithms (Support Vector Machine, K-nearest machine learning algorithms in accurately differentiating
neighbors, and Decision tree) for breast cancer between benign and malignant breast tumors, offering
classification. Using the Wisconsin Breast Cancer valuable insights into the field of cancer diagnosis.
(Diagnostic) dataset, the study compares the performance  The above research primarily investigates the accuracy of
of these classifiers to determine the most effective one in machine learning models in classifying malign and benign
terms of accuracy. The findings reveal that the quadratic breast cancer, comparing each model's performance to
support vector machine achieves the highest accuracy determine the most effective one. This study not only
(98.1%) with the lowest false discovery rates. This contributes to the existing body of knowledge on the
research contributes to the literature by highlighting the application of machine learning in healthcare but also
superior performance of the quadratic support vector holds the potential to advance the medical industry. The
machine in breast cancer classification, demonstrating the observations and analyses from this research could
potential of machine learning algorithms in this domain. significantly impact future medical investigations,
 Machine learning techniques to diagnose breast cancer - facilitating quicker and more accurate diagnoses of breast
This paper explores the application of machine learning cancer.

III. METHODOLOGY

Fig. 1: Proposed System Architecture

IJISRT23AUG1167 www.ijisrt.com 1061


Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
A. Proposed System Architecture dataset as Database. The tool will take the symptoms from
The proposed system architecture is shown in the the user as input and will display and classify whether the
underlying figure. The dataset containing the information user has breast cancer or not.
about the symptoms of the patients will be fed to the
prediction algorithms like Naive Bayes, Decision Trees, B. Dataset Details
Logistic Regression, Support Vector Machines, Neural This dataset contains reports of breast-cancer symptoms
Networks, Stochastic Gradient, and Random Forest of 570 persons. We have taken this dataset from the Kaggle
algorithms. Then, the performance of the algorithms will be machine learning repository.
tested with an appropriate evaluation model, in particular,
10-fold Cross-validation. I will then choose the best Link: https://www.kaggle.com/datasets/yasserh/breast-
algorithm to build the system for the end users using the cancer-dataset?resource=download

Table 1: Description of Dataset


Number of Attributes Number of Instances
Breast Cancer Symptom Dataset 30 570

Table 2: Description of Attributes


Attributes
Radius_Mean Smoothness_Se
Texture_Mean Compactness_Se
Perimeter_Mean Concavity_Se
Area_Mean Concave Points_Se
Smoothness_Mean Symmetry_Se
Compactness_Mean Fractal_Dimension_Se
Concavity_Mean Radius_Worst
Concave Points_Mean Texture_Worst
Symmetry_Mean Perimeter_Worst
Fractal_Dimension_Mean Area_Worst
Radius_Se Smoothness_Worst
Texture_Se Compactness_Worst
Perimeter_Se Concavity_Worst
Area_Se Concave Points_Worst
Fractal_Dimension_Worst Symmetry_Worst

Table 3: Dataset Details

Benign Malignant
Diagnosis 0 (Negative) 1 (Positive)

IV. RESULTS

Performance of different Data Mining techniques on our dataset with detailed accuracy, information is represented in the
following tables.

Table 4: Comparison of Evaluation Metrics using 10-Fold Cross Validation


Evaluation Metrics Cross Validation
Model NB LR DT NN RF SG SVM
Total Number of Instances 570 570 570 570 570 570 570
Correctly Classified Instances 537 540 542 540 544 454 521
94.2% 94.7% 95.0% 94.7% 95.4% 79.6% 91.4%
Incorrectly Classified Instances 33 30 28 30 26 116 49
5.8% 5.3% 5.0% 5.3% 4.6% 20.4% 8.6%

IJISRT23AUG1167 www.ijisrt.com 1062


Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig. 2: Performance of Classification Algorithms Using Cross-Validation Technique

Table 5: Comparison of Performance Parameters using 10-Fold Cross Validation


Performance Class Weighted Average
Parameters NB LR DT NN RF SG SVM
Positive (Malignant) 0.92 0.90 0.92 0.93 0.93 1.00 0.99
Precision Negative (Benign) 0.95 0.90 0.91 0.94 0.91 0.66 0.85
Weighted Average 0.94 0.90 0.91 0.94 0.92 0.79 0.90
Positive (Malignant) 0.92 0.83 0.87 0.90 0.85 0.13 0.72

Recall Negative (Benign) 0.95 0.95 0.94 0.96 0.96 1.00 0.99

Weighted Average 0.94 0.90 0.91 0.94 0.92 0.67 0.89


Positive (Malignant) 0.92 0.86 0.90 0.91 0.89 0.24 0.83

F-measure Negative (Benign) 0.95 0.92 0.93 0.95 0.94 0.79 0.92

Weighted Average 0.94 0.90 0.91 0.94 0.92 0.58 0.89

IJISRT23AUG1167 www.ijisrt.com 1063


Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

1.2

0.8

0.6

0.4

0.2

0
Positive Negative Weighted Positive Negative Weighted Positive Negative Weighted
Average Average Average
Precision Recall F-measure
NB LR DT NN RF SG SVM

Fig. 3: Performance of Classification Algorithms Using Cross-Validation Technique

Table 4 shows us the pure accuracy of each model using In statistics, precision, recall, and F-measure are
10-fold cross-validation. We can clearly see that the Random common metrics used to evaluate the performance of a
Forest model classified the greatest number of instances classification model. Precision measures the proportion of
correctly with 544 correct instances out of 570 (95.4% true positives (TP) among the instances that are predicted as
accuracy). This is followed closely by Decision Trees, positive (TP + false positives, FP), and thus reflects the
classifying 542 instances correctly with a 95.0% accuracy. accuracy of the positive predictions. Recall, on the other
The least accurate models were Support Vector Machines hand, measures the proportion of true positives among the
and Stochastic Gradient, classifying 521 and 454 instances instances that are actually positive (TP + false negatives,
correctly respectively. FN), and thus reflects the completeness of the positive
predictions.
Table 5 shows us the precision, recall, and f-scores of
each model. The models with the highest average precision In classifying malignant and benign breast cancer using
scores (proportion of positively predicted labels that are machine learning, false positives, and false negatives have
actually correct) are Naive Bayes, Neural Networks, and different consequences. A false positive occurs when the
Random Forest. These three models also have the highest model predicts malignancy when the tumor is benign,
average recall scores (the ability to correctly predict the leading to unnecessary procedures and anxiety. A false
positives out of actual positives). Moving on to f-scores negative occurs when the model predicts benignity when the
(mean of a system's precision and recall values), the same 3 tumor is malignant, resulting in delayed diagnosis and
models appear again. treatment. Balancing precision and recall is crucial,
prioritizing recall if the cost of false negatives is higher and

IJISRT23AUG1167 www.ijisrt.com 1064


Volume 8, Issue 8, August 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
precision if the cost of false positives is higher. High recall REFERENCES
is often favored in medical settings to minimize missed
malignant cases, but the specific application and potential [1.] Wu, J., & Hicks, C. (2021). Breast cancer type
consequences should be considered to determine the optimal classification using machine learning. Journal of
trade-off between precision and recall. Personalized Medicine, 11(2), 61.
[2.] Amrane, M., et al. (2018). Breast cancer classification
V. DISCUSSION using machine learning. In 2018 Electric Electronics,
Computer Science, Biomedical Engineerings' Meeting
A. Result Analysis (EBBT). IEEE.
The best result was achieved using Random Forest [3.] Obaid, O. I., et al. (2018). Evaluating the performance
Algorithm where using 10-fold cross-validation, 95.4% of of machine learning techniques in the classification of
instances were classified correctly. It also had the highest Wisconsin Breast Cancer. International Journal of
average precision, recall, and f-measure percentages. In the Engineering & Technology, 7(4.36), 160-166.
figure above, the performance of the algorithms using Cross- [4.] Osareh, A., & Shadgar, B. (2010). Machine learning
validation evaluation is depicted. techniques to diagnose breast cancer. In 2010 5th
International Symposium on Health Informatics and
B. Proposed Tool Bioinformatics. IEEE.
Based on the study's findings, a user-friendly tool that [5.] World. (2023, July 12). Breast Cancer. Who.int.
utilizes machine learning algorithms is proposed, World Health Organization: WHO. URL:
specifically the Random Forest, to classify malignant and www.who.int/news-room/fact-sheets/detail/breast-
benign breast cancer. This tool would allow individuals to cancer.
input relevant medical information and receive a prediction [6.] Early Detection of Breast Cancer: Importance of
regarding the nature of their breast tumor. By harnessing the Regular Self-Exams and Mammograms. (2023).
power of machine learning, this tool aims to provide an Medanta.org. URL: www.medanta.org/patient-
accurate and convenient solution for predicting breast cancer education-blog/early-detection-of-breast-cancer-
risk. Given the increasing prevalence of breast cancer importance-of-regular-self-exams-and-
globally, this tool would empower individuals to monitor mammograms/#:~:text=Breast%20cancer%20affects
their health proactively. The intuitive design and user- %20millions%20of,when%20it%20is%20most%20tr
friendly interface would ensure that users can easily interpret eatable.
the results and take appropriate actions. [7.] Yasser, M. (2015). Breast Cancer Dataset.
By leveraging this technology, individuals can self- Kaggle.com. URL:
assess their breast tumor's nature and subsequently seek www.kaggle.com/datasets/yasserh/breast-cancer-
medical advice from a healthcare professional. This dataset?resource=download.
approach saves time and resources, enabling healthcare [8.] Łukasiewicz, S., et al. (2021, August 25). Breast
providers to focus on cases requiring immediate attention. Cancer-Epidemiology, Risk Factors, Classification,
Moreover, in regions where breast cancer poses a significant Prognostic Markers, and Current Treatment
health challenge, this tool can alleviate the strain on Strategies-an Updated Review. Cancers. URL:
healthcare systems by enabling individuals to self-diagnose www.ncbi.nlm.nih.gov/pmc/articles/PMC8428369/.
and manage their condition proactively. This not only [9.] Ginter, P. S., et al. (2021, April). Histologic Grading
benefits the individual but also helps ensure timely and of Breast Carcinoma: A Multi-Institution Study of
adequate medical care while relieving the burden on Interobserver Variation Using Virtual Microscopy.
healthcare systems. Modern Pathology : An Official Journal of the United
States and Canadian Academy of Pathology, Inc.
VI. CONCLUSION URL:
www.ncbi.nlm.nih.gov/pmc/articles/PMC7987728/.
In this paper, we utilized open-source machine learning
algorithms on a public dataset to classify breast tumors as
malignant or benign. Through evaluation using metrics such
as accuracy, precision, recall, and F1 score, we found that
machine learning models, specifically Random Forest,
demonstrated high accuracy in classifying breast cancer.
This has significant implications for early detection,
treatment planning, and improved patient outcomes. We
proposed a classification tool that utilizes these models to aid
healthcare professionals in making informed decisions.
While limitations exist, our study highlights the potential of
machine learning in accurately classifying breast cancer,
paving the way for enhanced diagnostic accuracy and more
effective treatment strategies.

IJISRT23AUG1167 www.ijisrt.com 1065

You might also like