How Can Machine Learning Be Used To Classify Breast Cancer?
How Can Machine Learning Be Used To Classify Breast Cancer?
ISSN No:-2456-2165
Abstract - Breast cancer is a prevalent form of cancer breast tumor images, including size, shape, texture, and other
that affects a significant number of individuals characteristics, to differentiate between malignant and
worldwide and can have severe consequences if not benign tumors. By learning from historical data, these
detected and treated early. The World Health algorithms can make predictions on new, unseen cases,
Organization (WHO) estimates that breast cancer is the aiding in the early detection and management of breast
most common cancer among women globally, with an cancer.
estimated 2.3 million new cases in 2020 alone. Early
detection is crucial in improving survival rates and In this paper, we focus on the classification of breast
treatment outcomes. This paper explores the application cancer using AI and machine learning algorithms. We utilize
of Machine Learning (ML) techniques for predicting a publicly available dataset, such as the one provided by the
breast cancer diagnosis in individuals. We utilize a Kaggle machine learning repository, which contains
publicly available dataset from the Kaggle machine information on various tumor features and corresponding
learning repository, which contains data from breast diagnoses. This dataset serves as the basis for training and
cancer patients collected from various medical evaluating the performance of different machine-learning
institutions. Several machine learning models, including models.
Naive Bayes Algorithm, Decision Trees, Logistic
Regression, Neural Networks, Random Forest, Stochastic The dataset is pre-processed to handle missing values,
Gradient, and Support Vector Machines, are employed normalize features, and split into training and testing sets.
We then apply a range of machine learning algorithms to the
to analyze the dataset. The performance of these models
training data, allowing them to learn from the patterns and
is assessed using 10-fold cross-validation. Furthermore,
we propose the most suitable machine learning algorithm relationships within the dataset. The performance of each
for breast cancer diagnosis based on specified input algorithm is evaluated using appropriate metrics such as
parameters and discuss the potential deployment of a accuracy, precision, recall, and F1 score.
breast cancer diagnostic tool. The goal of this research is to identify the most accurate
Keywords:- Breast Cancer Detection, Supervised and machine-learning algorithm for breast cancer classification.
Unsupervised Machine Learning, Artificial Intelligence. The selected algorithm can then be used as a reliable tool for
assisting healthcare professionals in diagnosing breast cancer
I. INTRODUCTION cases. Early and accurate classification enables timely
intervention, personalized treatment plans, and improved
Breast cancer is a prevalent and life-threatening disease patient outcomes.
that requires accurate classification for effective diagnosis
and treatment planning. [8] The ability to classify breast II. LITERATURE REVIEW
tumors into malignant (cancerous) or benign (non-
cancerous) categories is essential for determining the Breast cancer type classification using machine learning -
appropriate course of action. [8] Artificial intelligence (AI) The study evaluated four machine learning algorithms for
and Machine Learning (ML) algorithms have shown great classifying breast cancer into triple negative and non-
promise in assisting with the classification of breast cancer, triple negative types. Among these algorithms, the
providing accurate and efficient tools for healthcare Support Vector Machine (SVM) demonstrated higher
professionals. accuracy and fewer misclassification errors compared to
the other three algorithms. The findings suggest that
Traditionally, the classification of breast tumors relied machine learning algorithms, particularly SVM, are
on histological examination by pathologists, which is time- effective for accurately classifying breast cancer into
consuming and subject to inter-observer variability. [9] triple negative and non-triple negative types.
However, with the advancement of AI and machine learning, Breast cancer classification using machine learning - This
the development of automated systems for breast cancer paper discusses the significance of breast cancer
classification has become possible. These systems leverage classification due to its prevalence and high mortality
large datasets containing tumor features and corresponding rates. The study focuses on the application of machine
diagnoses to learn patterns and make accurate predictions. learning techniques for breast cancer classification,
specifically comparing two classifiers: Naive Bayes (NB)
Machine learning algorithms, such as Naive Bayes, and k-nearest neighbor (KNN). The authors evaluate the
Decision Trees, Logistic Regression, Neural Networks, accuracy of these classifiers using cross-validation and
Random Forest, Stochastic Gradient, and Support Vector find that KNN achieves the highest accuracy (97.51%)
Machines, have been applied to breast cancer classification with the lowest error rate, followed by the NB classifier
tasks. These algorithms analyze features extracted from (96.19%). This comparison underscores the effectiveness
III. METHODOLOGY
Benign Malignant
Diagnosis 0 (Negative) 1 (Positive)
IV. RESULTS
Performance of different Data Mining techniques on our dataset with detailed accuracy, information is represented in the
following tables.
Recall Negative (Benign) 0.95 0.95 0.94 0.96 0.96 1.00 0.99
F-measure Negative (Benign) 0.95 0.92 0.93 0.95 0.94 0.79 0.92
1.2
0.8
0.6
0.4
0.2
0
Positive Negative Weighted Positive Negative Weighted Positive Negative Weighted
Average Average Average
Precision Recall F-measure
NB LR DT NN RF SG SVM
Table 4 shows us the pure accuracy of each model using In statistics, precision, recall, and F-measure are
10-fold cross-validation. We can clearly see that the Random common metrics used to evaluate the performance of a
Forest model classified the greatest number of instances classification model. Precision measures the proportion of
correctly with 544 correct instances out of 570 (95.4% true positives (TP) among the instances that are predicted as
accuracy). This is followed closely by Decision Trees, positive (TP + false positives, FP), and thus reflects the
classifying 542 instances correctly with a 95.0% accuracy. accuracy of the positive predictions. Recall, on the other
The least accurate models were Support Vector Machines hand, measures the proportion of true positives among the
and Stochastic Gradient, classifying 521 and 454 instances instances that are actually positive (TP + false negatives,
correctly respectively. FN), and thus reflects the completeness of the positive
predictions.
Table 5 shows us the precision, recall, and f-scores of
each model. The models with the highest average precision In classifying malignant and benign breast cancer using
scores (proportion of positively predicted labels that are machine learning, false positives, and false negatives have
actually correct) are Naive Bayes, Neural Networks, and different consequences. A false positive occurs when the
Random Forest. These three models also have the highest model predicts malignancy when the tumor is benign,
average recall scores (the ability to correctly predict the leading to unnecessary procedures and anxiety. A false
positives out of actual positives). Moving on to f-scores negative occurs when the model predicts benignity when the
(mean of a system's precision and recall values), the same 3 tumor is malignant, resulting in delayed diagnosis and
models appear again. treatment. Balancing precision and recall is crucial,
prioritizing recall if the cost of false negatives is higher and