Project Report
Project Report
Bachelor of Technology
Computer Science Engineering
By
NAME OF STUDENTS ENROLL
Khalid Jan 190328
Syed Owais Bashir 190330
Mehrun Nissa 190350
Zahir Ahmed 190352
CERTIFICATE
This is to certify that the project titled “Breast Cancer Detection Using
Machine Learning & XAI” submitted by Khalid Jan (190328), Syed Owais
Bashir (190330), Mehrun Nissa (190350), Zahir Ahmed (190352) to
Government College of Engineering and Technology Safapora Ganderbal in
partial fulfilment of the requirements for the award of the degree of Bachelors
of Technology in Computer Science and Engineering during the year 2024.
i
CANDIDATE’S DECLARATION
We hereby certify that the project titled “Breast Cancer Detection Using Machine
Learning & XAI” submitted to the Department of Computer Science Engineering of
GOVERNMENT COLLEGE OF ENGINEERING AND TECHNOLOGY-SAFAPORA
GANDERBAL, is an authentic record of our work carried out during the period of
September 2023 to March 2024 under the guidance of Ms. Asiya Quyoum.
The matter presented in this Major project report has not been submitted by us to any other
Institute/ University for the award of any Degree/ Diploma.
This is to certify that the above statement made by the candidates is correct to the
best of my knowledge.
Place:-
Date:-
ii
ACKNOWLEDGEMENT
As a matter of first importance, we thank to almighty Allah for all the blessings in the
entirety of our undertakings.
We take this opportunity to express our profound gratitude and deep regards to our
Principal Prof. (Dr.) Rauf Ahmad Khan, for his exemplary guidance, monitoring and
constant encouragement throughout the course of engineering. We also take this
opportunity to express a deep sense of gratitude to Ms. Asiya Quyoum, Head (Department
of Computer Science Engineering), for her cordial support, valuable information &
guidance, which helped us in completing this task through various stages. Her guidance
shall carry is in long way in the journey of life which we are about to embark.
Khalid Jan
Mehrun Nissa
Zahir Ahmad
Date:
Place:
iii
ABSTRACT
Breast cancer is one of the most prevalent cancers among women globally,
representing a significant public health concern. Early diagnosis can improve
prognosis and chances of survival by enabling timely clinical treatment. Accurate
classification of benign tumors is crucial to prevent unnecessary treatments.
Consequently, correct diagnosis and classification of breast cancer as malignant or
benign are subjects of extensive research. Machine learning is widely recognized as
the methodology of choice for breast cancer pattern classification and forecast
modeling due to its unique advantages in detecting critical features from complex
datasets.
iv
TABLE OF CONTENTS
Title Page
CERTIFICATE…………………………………….……………………………………...i
DECLARATION…………………………………………………………………….…....ii
ACKNOWLEDGEMENTS…………………………………………………….….…......iii
ABSTRACT……………………………………………………………………………....iv
LIST OF TABLES…………………………………………………………..……...…..vii
LIST OF FIGURES……………………………………………………….……….......viii
ABBREVIATIONS………………………………………………………….…………...ix
CHAPTER 1 INTRODUCTION……………………………………………..…..……1-12
1.1 INTRODUCTION…………………………………………………………………..1
1.2 PROBLEM STATEMENT…………………………………………..………2
1.3 RELEVANT CONTEMPORARY ISSUES………………………..………..3
1.4 MOTIVATION…………………………………………………………..…..5
1.5 OBJECTIVE……………………………………………………………..…..7
1.6 REQUIREMENT AND SPECIFICATIONS………………………………..8
1.6.1 SOFTWARE REQUIREMENTS……………………….…………..8
1.6.2 HARDWARE REQUIREMENTS…………………………………..9
1.7 SCOPE………………………………………………………………………9
1.8 FEASIBILITY STUDY…………………………………………………….10
CHAPTER 2 LITERATURE REVIEW……………………………………...……..13-15
2.1 TIMELINE OF THE REPORTED PROBLEM……………………………13
CHAPTER 3 METHODOLOGY……………………………………………………..16-26
3.1 CONCEPT GENERATION………………………………………………..16
3.2 EVALUATION AND SELECTION OF FEATURES……………………..16
3.2.1 STAGE 1: DATA PRE-PROCESSING………………………..…17
3.2.2 STAGE 2: DATA EXPLORATION……………………………...18
3.2.2.a BIVARIATE DATA ANALYSIS………………………...19
3.2.2.b MULTIVARIATE DATA ANALYSIS…………………...20
3.2.3 STAGE 3: FEATURE SELECTION………………………………21
3.2.4 STAGE 4: FEATURE SCALING…………………………………21
3.2.5 STAGE 5: MODEL SELECTON………………………………….22
3.2.5.1 CODE SNIPPET FOR SVM………………………………23
3.2.5.2 CODE SNIPPET FOR NAÏVE BAYES…………………..23
3.2.5.3 CODE SNIPPET FOR LOGISTIC REGRESSION………24
v
3.2.5.4 CODE SNIPPET FOR DECISION TREES………………24
3.2.5.5 CODE SNIPPET FOR RANDOM FORESTS……………24
3.2.5.6 CODE SNIPPET FOR ADABOOST……………………..25
3.2.5.7 CODE SNIPPET FOR XGBOOST………………………25
3.2.5.8 CODE SNIPPET FOR CONFUSON MATRIX………….25
3.2.5.9 CODE SNIPPET FOR CLASSIFICATION REPORT…...26
CHAPTER 4 IMPLEMENTATION OF XAI……………………………………….27-30
4.1 XAI TECHNIQUES………………………………………………………...27
4.1.1 SHAP ………..……………………………………………………27
4.1.2 LIME…………..…………………………………………….……28
CHAPTER 5 SYSTEM IMPLEMENTATION…………………………………….31-35
5.1 FRONT-END WEBSITE DEVELOPMENT……………………………….31
5.1.1 USER INTERFACE DESIGN………………………………31
5.1.2 HTML STRUCTURE………………………………………32
5.1.3 CSS STYLING…………………………………………..…32
5.1.4 JAVASCRIPT INTERACTIVITY…………………………33
5.2 BACK-END INTEGRATION WITH FLASK…………………………….33
5.2.1 FLASK STRUCTURE……………………………………..33
5.2.2 ROUTE HANDLING………………………………………33
5.2.3 DATA PROCESSING……………………………………..34
5.2.4 MODEL LOADING AND PREDICTION………………..34
5.2.5 XAI EXPLANATION GENERATION…………………...34
5.2.6 RESPONSE RENDERING………………………………...34
5.3 INTEGRATION DEPLOYMENT………………………………………...34
5.4 CONCLUSION……………………………………………………………34
CHAPTER 6 RESULTS AND FINDINGS………………………………..36-38
CHAPTER 7 LOCAL DATA ANALYSIS………………………………...39-44
CHAPTER 8 FUTURE SCOPE……………………………………………45-48
7.1 DATA EXPANSION……………………………………………………….45
72 ADVANCED XAI TECHNIQUES………………...………………………..46
CHAPTER 9 CONCLUSION……………………………………………………….59-51
REFFRENCES……………………………………………………………….52-53
vi
List of Tables
vii
LIST OF FIGURES
viii
ABBREVIATIONS
AI Artificial Intelligence
ML Machine Learning
ix
CHAPTER-1
INTRODUCTION
1.1 Introduction
Breast cancer is one of the most prominent type of cancer among women all
around the world, according to research conducted by World Health Organization
(WHO). Breast Cancer is a leading causes of death among women all around the
world. Breast cancer also has an exceedingly high rate of cancer fatalities in India
which is around 14% and is the most common cancer among women. Breast Cancer
affects about 5% of Indian women, but it affects about 12.5 percent of women in
Europe and the United States. The 5th big reason of females death is Breast Cancer
comparatively to cancers in terms of all types. The malignant tumor of Breast Cancer
which produced inside breast cells. A group of splitting cells that form a lump or mass
of extra tissue which is called Tumors and these tumors can be whichever cancerous
(malignant) or non-cancerous (benign). As prognosis is so critical for long-term
survival, early detection of breast cancer benefits early treatment and diagnosis.
Because cancer can be detected, diagnosed, and treated only if detected early, the
chance of death is reduced by early detection. It plays a vital role patient's survival.
Delay in diagnosing cancer or detecting it at a later stage may lead to the spreading of
disease and complications in treatment. Cancer-related research done in the past on
the effects of a late cancer diagnosis has found that itis very closely linked to the
disease progressing to advanced stages, lowering the likelihood of saving the patient's
life. An analysis of 87 researchers found that female breast cancer patients who begin
treatment within 90 days after the onset of symptoms had a considerably higher
likelihood of surviving than those who wait more than 90 days. Many earlier studies
have found that detecting breast cancer in its early stages and starting the treatment
on time increases the chances of survival by preventing malignant (Cancerous) cells
from spreading throughout the body. This paper's main contribution is an evaluation
and study of the role of various machine learning approaches in breast cancer early
detection.
1
Nonetheless, merging AI with Machine Learning (ML) approaches helps achieve
accurate prediction and decision-making. For e.g., deciding whether or not the patient
needs surgery based on the biopsy results for detecting breast cancer. Mammograms
are currently the most utilized test, they can give false positive (high-risk) results,
which can lead to unnecessary biopsies and procedures. When surgery is performed
to remove malignant cells, it is sometimes discovered that the cells are benign that are
non-cancerous. This implies that the patient will be subjected to unnecessary,
unpleasant, and a costly surgery. M.L. Algorithms have a number of benefits,
including their ability to perform well on healthcare-related datasets such as pictures,
x-rays, and blood samples. Some strategies are better suited to small datasets, while
others are best suited to large datasets. Noise can be an issue with some methods.
Learning (ML) and artificial intelligence (AI) have been exploited to develop
software capable of aiding radiologists in clinical practice. Currently, many of these
AI-based tools designed for aiding radiologists and interpreting mammograms are
2
developed with machine learning. Machine learning is a specific domain of AI and is
concerned with constructing algorithms used by computers to perform certain tasks
without using explicit instructions, but instead relying on inference and patterns and
are able to improve their performance with experience.
So, if we talk about breast cancer basically breast cancer is the second major
death cause in women’s that is Breast Cancer. Cancer starts when cells begin to grow
out of control. Breast Cancer cells usually form a type of tumor that can be often seen
in X-Ray or felt as a lump. Breast cancer can spread when the cancer cells get into the
blood or lymph system and carried to several parts of the body. The main cause of
breast cancer according to us are which includes some changes and mutation in DNA.
There are many types of breast cancer. A breast cancer is a malignant, that means it
can grow and spread to other parts of body too and a Benign tumor means in which
tumor can grow but has not spread rapidly. And mostly breast cancer spreads to the
nearby lymph nodes in which the breast cancer is still treated as a local disease, but it
can also spread through one body to another through the blood vessels or we can say
lymph nodes. It’s important to understand that most breast lumps are benign and not
cancer (malignant). Non- cancerous breast tumors are abnormal growths, but they do
not spread outside of the breast. They are not life threatening, but some types of benign
3
breast lumps can increase a woman’s risk of getting breast cancer. Any breast lump
or change needs to be checked by a health care professional to determine if it is benign
or malignant (cancer) and if it might affect your future cancer risk. The breast is the
tissue overlying the chest (pectoral) muscles. Women's breasts are made of specialized
tissue that produces milk (glandular tissue) as well as fatty tissue. The amount of fat
determines the size of the breast. The milk-producing part of the breast is organized
into 15to 20 sections, called lobes. Within each lobe are smaller structures, called
lobules, where milk is produced. The milk travels through a network of tiny tubes
called ducts. The ducts connect and come together into larger ducts, which eventually
exit the skin in the nipple. The dark area of skin surrounding the nipple is called the
areola. Malignant (cancer) cells multiplying abnormally in the breast, eventually
spreading to the rest of the body if untreated. Breast cancer occurs almost exclusively
in women, although men can be affected. Signs of breast cancer include a lump,
bloody nipple discharge, or skin changes. The number and the size of databases
recording medical data are increasing rapidly. Medical data, produced from
measurements, examinations, prescriptions, etc., are stored in different databases on
a continuous basis. This enormous amount of data exceeds the ability of traditional
methods to analyze and search for interesting patterns and information that is e (e.g.,
machine learning) and business intelligence. The book Data mining: Practical
machine learning tools and techniques with given the class variable. Based on the
maximum probability. It detects the class membership for the given tuple to a
particular class. The term Breast Cancer refers to disease of breast. There are number
of factors that can affect the breast and leads to breast cancer.
1. Getting Older.
2. Genetic Mutations.
4. Physical Activity.
5. Obesity.
6. Food.
7. Having Dense Breasts Factors like these are used to analyze the breast cancer.
4
In many cases, diagnosis is generally based on patient’s current test results & doctor’s
experience. Thus, the Diagnosis is a complex task that requires much experience and
high skill.
1.4 MOTIVATION
Breast cancer is a global problem, and 1.7 million new cases are diagnosed per
year. Approximately 60% of deaths due to breast cancer occur in developing
countries, whereas in the United States (US), an estimated 249,260 new cases of breast
cancer are diagnosed each year, and mortality due to this disease is decreasing. In
contrast, breast cancer in developing countries represents one-half of all breast cancer
cases and 62% of the deaths. Developing countries have limited healthcare resources
and use different strategies to diagnose breast cancer. Most of the population depends
on the public healthcare system, which affects the diagnosis of the tumor. Thus, the
indicators observed in developed countries cannot be directly compared with those
observed in developing countries because the healthcare infrastructures in developing
countries are deficient.
Figure-1.1: The WHO analysis the data about causes of deaths in 2018 and result
clearly shows that the causes of breast cancer death are higher than
other causes of death in Women’s.
The motivation behind using machine learning for breast cancer prediction is
driven by the desire to improve early detection and diagnosis of breast cancer, which
is crucial for successful treatment and improved patient outcomes. Machine learning
5
algorithms have the potential to analyze large amounts of data, identify patterns, and
make accurate predictions based on the learned patterns. In the case of breast cancer,
these algorithms can analyze various factors and characteristics of breast tissue to
predict the likelihood of developing the disease.
Here are some key motivations for using machine learning in breast cancer prediction:
4. Handling big data: The field of healthcare generates vast amounts of data,
including medical records, imaging data, genomic data, and clinical trial results.
Machine learning algorithms are well-suited to handle and analyze such big data,
extracting meaningful insights and patterns that may not be apparent to human
analysts. By leveraging these large datasets, machine learning models can
6
potentially uncover new risk factors, identify novel biomarkers, and improve our
understanding of breast cancer.
1.5 OBJECTIVE
1. Early detection: One of the main objectives is to detect breast cancer at an early
stage when it is more treatable and the chances of survival are higher. Machine
learning models can analyze various data sources, such as mammograms, patient
demographics, genetic information, and medical records, to identify patterns and
indicators of early-stage breast cancer that may not be easily detectable by human
observers.
2. Risk assessment: Machine learning algorithms can assess the risk of developing
breast cancer by considering multiple risk factors and their interactions. By
incorporating personal factors, such as genetic predisposition, family history,
lifestyle choices, and medical history, these models can provide a personalized risk
assessment for individuals. This helps in identifying individuals who may benefit
from more intensive screening or preventive measures.
7
4. Feature identification: Machine learning algorithms can automatically identify
relevant features or biomarkers associated with breast cancer. By analyzing a large
number of data points, these models can uncover new risk factors or biomarkers
that may not have been previously recognized. This can contribute to a better
understanding of breast cancer and lead to the discovery of new diagnostic or
therapeutic targets.
8
1.6.2 Hardware Requirements
1.7 SCOPE
The scope of breast cancer prediction using machine learning is broad and
encompasses various aspects of detection, diagnosis, risk assessment, and treatment
planning. Here are some key areas where machine learning can make a significant
impact in breast cancer prediction:
9
breast cancer. These biomarkers can provide insights into disease progression,
treatment response, and potential therapeutic targets.
It’s important to note that while machine learning shows promise in breast cancer
prediction, these models should always be used as decision support tools and not
as a substitute for medical professionals. The ultimate goal is to augment human
expertise and improve patient care in the field of breast cancer.
The feasibility of breast cancer prediction using machine learning has been
widely demonstrated and holds significant potential in improving early detection and
patient outcomes. Here are several factors that contribute to the feasibility of breast
cancer prediction using machine learning:
10
1. Abundance of Data: There is a substantial amount of available data related to
breast cancer, including mammograms, patient demographics, genetic
information, and histopathological data. Machine learning models thrive on large
and diverse datasets, allowing them to learn patterns and make accurate
predictions. The availability of such data makes breast cancer prediction using
machine learning feasible.
11
7. Research and Collaborations: There is extensive ongoing research and
collaboration in the field of breast cancer prediction using machine learning.
Researchers, clinicians, and industry experts collaborate to develop and validate
machine learning models, ensuring that the feasibility of these models is
continuously improved.
12
CHAPTER 2
LITERATURE REVIEW
1985: Researchers discover that ladies with early-level breast most cancers who were
handled with a lumpectomy and radiation have comparable survival costs to women
handled with only amastectomy.
13
1986: Scientists determine the way to clone the HER2 gene.
1995: Scientists can clone the tumor suppressor genes BRCA1 and BRCA2. Inherited
mutations in these genes can expect an expanded chance of breast cancer.
1996: FDA approves anastrozole (Arimidex) as a treatment for breast cancers. This
drug blocks the production of estrogen.
1998: Tamoxifen is observed to decrease the danger of growing breast most cancers
in at-danger women through 50 percent Trusted supply. It’s now permitted with the
aid of the FDA for use as a preventive therapy.
1998: Trastuzumab (Herceptin), a drug targeting cancer cells which can be over-
generatingHER2, is likewise accredited by the FDA.
2006: The SERM drug raloxifene (Evista) is discovered to reduce breast most cancers
risk for postmenopausal ladies who have better threat. It has a lower risk of great
aspect outcomes than tamoxifen.
2010: "A hybrid intelligent system for breast cancer diagnosis" by Abirami et al. This
paper proposed a hybrid intelligent system that combines fuzzy logic and artificial
neural networks to improve breast cancer diagnosis accuracy.
2011: A massive meta-analysis Trusted supply finds that radiation therapy drastically
reduces the hazard of breast cancers recurrence and mortality.
2012: "A novel approach for automated detection of breast cancer using SVM
classifier” by Kourou et al. The authors presented a novel approach using support
vector machine (SVM) for automated breast cancer detection, achieving promising
results.
2013: The 4 principal subtypes Trusted supply of breast cancer are described as
HR+/HER2 (“luminal A”), HR-/HER2 (“triple poor”), HR+/HER2+ (“luminal B”),
and HR-/HER2+(“HER2-enriched”).
2014: "Deep learning for detecting breast cancer metastases on whole slide images"
by Liu et al. This study explored the application of deep learning techniques,
specifically convolutional neural networks (CNNs), for detecting breast cancer
metastases in whole slide images.
14
2016: "Breast cancer diagnosis using a hybrid intelligent system" by Arun Kumar et
al. The authors proposed a hybrid intelligent system that combines rough set theory,
fuzzy logic, and genetic algorithm for breast cancer diagnosis, achieving high
accuracy.
2018: A medical trial suggests that chemotherapy after surgical operation doesn’t
benefit 70 percent of girls with early-level breast cancer.
2019: Enhertu Trusted supply is permitted by the FDA, and this drug proves to be
very effective in treating HER2-high quality breast cancer that’s metastasized or can’t
be removed with surgical operation.
2020: The drug Trodelvy is accredited through the FDA for treating metastatic triple-
poor breast cancer for individuals who haven’t replied to at the least other treatments
2020: "Efficient breast cancer classification using a machine learning approach with
genetic algorithm-based feature selection" by Elakkiya et al. This study employed
genetic algorithm- based feature selection and machine learning techniques for
efficient breast cancer classification, demonstrating improved accuracy.
15
CHAPTER 3
PROPOSED WORK
Breast cancer amongst all other breast disease has become a significant
concern due to its potential as a silent killer without any obvious symptoms. Early
prediction and prevention play crucial role in reducing the mortality rate associated
with this deadly disease.ML techniques offer various promising solutions for the
analysis of breast cancer by testing various risk factors. This proposed work aims to
collect and analyze relevant data from diverse sources, classify the data under suitable
headings, and apply machine learning algorithms to predict the possibility of breast
disease. The objective is to empower healthcare professionals and individuals with
effective tools for early detection and prevention, ultimately reducing the mortality
rates caused by breast disease. Identifying and gathering relevant data from various
resources including medical records, patient’s histories, genetic data, and lifestyle
factors. Preforming data preprocessing tasks, such as data cleaning handling missing
values, standardizing the data and selection of features relevant for prediction. In this
project we have used breast disease data from repository of UCI []. The features of
this data are computed from a digital image of a fine needle aspiration (FNA) of a
breast mass. We have a total of 569 instances out which 212 instances belong to
benign tumor and 357 belong to malignant tumor. 30 clinical features have been
recorded for each instance. In this paper, we use python as a tool to implement breast
disease classification and prediction training via various machine learning algorithms;
SVM, logistic regression, decision tree, random forest, naïve bayes, adaboost,
xgboost. After compression of all the algorithms we use xgboost for further processing
of this project.
The working of system starts with the collection of data and selection of
important attributes. Then the data is pre-processed into the required format. The data
it then divided into two parts training and testing data. The models are then trained
16
using the training data and the accuracy of the models is obtained by testing the system
using the testing data. The following module are used to implement the system:
We will use UCI Machine Learning Repository for breast cancer dataset. The
dataset used in this project is publicly available and was created by Dr. William H.
Wolberg, physician at the University of Wisconsin Hospital at Madison, Wisconsin,
USA. To create the dataset Dr. Wolberg used fluid samples taken by fine needle
aspiration (FNA), taken from patients with solid breast masses and an easy-to-use
graphical computer program called Xcyt, which is capable of perform the analysis of
cytological features based on a digital scan. The program uses a curve-fitting
algorithm, to compute ten features from each one of the cells in the sample, then it
calculates the mean value, extreme value and standard error of each feature for the
image, returning a 30 real-valuated vector.
Attribute Information:
mean
mean radius mean texture mean perimeter mean area
smoothness
mean mean concave mean fractal
mean concavity mean symmetry
compactness points dimension
radius error texture error perimeter error area error smoothness error
compactness concave points fractal
concavity error symmetry error
error error dimension error
worst
worst radius worst texture worst perimeter worst area
smoothness
worst worst concave worst fractal
worst concavity worst symmetry
compactness points dimension
17
Objective: The objective of this analysis is to observe which features are most helpful
in predicting malignant and benign cancer and to see a general trend that would help
us in model selection. The goal is to classify whether the breast cancer is benign or
malignant. To achieve this, we have used machine learning classification methods to
fit function that can predict discrete class of new inputs.
For this we will be using Vs Code to work on the dataset. We will first go on
with importing all necessary libraries and upload our dataset on to Vs Code.
For this we will be using Vs Code to work on the dataset. We will first go on
with importing all necessary libraries and upload our dataset on to Vs Code. We can
find the dimensions of the dataset using panda command data. Shape (569,31) We
now know that we have a dataset that consist of total 569 rows and 31 columns.
‘target’ is the column which we are going to predict, which says if the cancer is 0 =
benign or 1 = malignant. Using the code line ‘data['target'].value_counts ()’ we can
detect that out of 569 persons, 212 are labelled as 0(benign) and 357 are labelled as 1
18
(malignant).
19
3.2.2.b Multivariate data analysis
Multivariate data analysis involves examining the relationship and patterns among
threeor more variables simultaneously. In the context of breast cancer classification,
mv data analysis can help uncover complex relationships between multiple features
and their combined impact on the classification task.
20
3.2.3 Stage 3: Feature Selection
Feature selection is the method of reducing the input variable to your model
by using only relevant data and getting rid of noise in data.
The dataset we used is split into training and testing data. The training set
contains a known output and the model learns on this data in order to be generalized
to other data later on. We have the test dataset in order to test models prediction on this
subset.
21
3.2.5 Stage 5: Model Selection
This is the most exciting phase in Applying Machine Learning to any Dataset.
It is also known as Algorithm selection for Predicting the best results. It involves
evaluating and comparing different models to determine which one is likely to
perform the best on unseen data. The algorithms are majorly classified into two
groups: supervised learning algorithm and unsupervised learning algorithms. Without
much due, I would like to give an over view of both the algorithms.
1. Logistic regression
2. Support Vector Machine
3. Naïve Bayes
4. Decision Tree Algorithm
5. Random Forest Classification
6. AdaBoost
7. XgBoost
22
After applying the different classification models, we have built our
classification model and we can see that XgBoost gives the best results for our dataset.
Well, it’s not always applicable to every dataset. To choose our model we always need
to analyze our dataset and then apply our machine learning model.
23
3.2.5.3 Code Snippet for Logistic Regression
24
3.2.5.6 Code Snippet for AdaBoost
25
Figure-3.14: Confusion Matrix of Model
26
CHAPTER 4
IMPLEMENTATION OF XAI
27
malignant), and negative values indicating a lower contribution towards a positive
prediction (e.g., benign).
1. Summary Plot: This plot provides an overview of the most important features
across the entire dataset, sorted by their SHAP values.
2. Force Plot: This plot shows the contribution of each feature for a specific
instance, making it easier to understand the model's reasoning for that
particular prediction.
The key idea behind LIME is to generate perturbed samples around the
instance of interest and train an interpretable model (e.g., a linear regression model)
on these perturbed samples, using the original model's predictions as the target
variable. The coefficients of this local interpretable model can then be used to explain
the original model's prediction for the instance being explained.
1. Instance Selection: The first step is to select the instance for which an
explanation is desired. This could be a specific data point from the test set or any
other instance of interest.
3. Model Evaluation: The original machine learning model (the one being
explained) is then evaluated on these perturbed samples, and the model's
predictions are obtained for each perturbed instance.
28
4. Local Surrogate Model: LIME trains an interpretable model (e.g., a linear
regression model) on the perturbed samples, using the original model's
predictions as the target variable. The interpretable model is trained to
approximate the original model's behaviour locally, around the instance being
explained.
The explanations generated by LIME are local in nature, meaning they are
specific to the instance being explained and may not generalize to other instances or
the entire dataset. However, this locality is a strength of LIME, as it allows for
capturing the model's behaviour in the vicinity of the instance of interest, which can
be particularly useful for understanding individual predictions.
It's important to note that the interpretable model used by LIME (e.g., linear
regression) is an approximation of the original model's behavior, and the quality of
the explanations depends on how well the interpretable model can approximate the
original model locally.
29
Figure-4.1 Code Snippet for Implementation of XAI
30
CHAPTER 5
SYSTEM IMPLEMENTATION
The frontend website was developed using HTML, CSS, and JavaScript.
These technologies were chosen for their widespread adoption, cross-platform
compatibility, and ease of integration with the backend components.
The user interface was designed with a focus on simplicity and intuitive
navigation. The website features a clean and modern layout, with clear sections for
data input, prediction display, and XAI explanations.
31
Figure-5.1.1 User Interface
The HTML structure of the website is organized into logical sections, including:
Data Input: A form or input fields for users to enter patient data, such as age,
tumor size, and other relevant features.
CSS was used to style the website's appearance, ensuring a visually appealing
and consistent design across different screen sizes and devices. Responsive design
principles were implemented to provide an optimal viewing experience on various
devices, including desktops, tablets, and mobile phones.
32
5.1.4 JavaScript Interactivity
JavaScript was utilized to enhance the user experience and provide interactive
features. For example, users can hover over bars or points in the SHAP plots to see
additional information or tooltips. The LIME feature importance list can be sorted or
filtered based on user preferences.
Flask routes were defined to handle user requests, such as submitting patient
data and retrieving predictions and XAI explanations. The /predict route, for instance,
receives the user input data, passes it to the machine learning model, and generates
the prediction and XAI explanations.
33
5.2.3 Data Processing
User input data received from the frontend website is processed and formatted
to match the requirements of the machine learning model. This includes handling
missing values, scaling numerical features, and encoding categorical features.
The trained machine learning model (e.g., logistic regression, random forest,
or SVM) is loaded into memory during the Flask application's initialization. When a
user submits data, the preprocessed input is fed into the model to obtain the prediction
(benign or malignant).
The XAI explanations are generated using the SHAP and LIME techniques
implemented in Python libraries such as shap and lime. The SHAP summary plot,
SHAP force plot, and LIME feature importance scores are calculated based on the
user input data and the model's prediction.
The Flask application generates a response containing the prediction and XAI
explanations in a format suitable for rendering on the frontend website. This response
is typically in the form of JSON or HTML, depending on the request type.
The frontend website and the Flask backend were integrated seamlessly,
ensuring a smooth flow of data and communication between the two components. The
website was deployed on a web server or cloud platform, allowing healthcare
professionals to access the breast cancer detection system and its XAI explanations
from any device with an internet connection.
5.5 Conclusion
The development of the frontend website and the integration with the machine
learning and XAI models using Flask have resulted in a user-friendly and transparent
breast cancer detection system. Healthcare professionals can now access the system
through an intuitive interface, input patient data, and receive predictions along with
valuable XAI explanations that shed light on the model's decision-making process.
34
This project demonstrates the successful integration of advanced machine learning
techniques with modern web technologies, enabling better decision-making and
increased trust in AI-powered healthcare solutions.
35
CHAPTER 6
RESULTS AND FINDINGS
36
leap from individual models. This ensemble method's strength lies in its ability to
reduce overfitting by averaging many trees, each trained on a different data subset.
Given these results, it's clear that XGBoost emerges as the champion.
However, in the realm of high-stakes decision-making, accuracy alone is insufficient.
Enter Explainable AI (XAI), a frontier in machine learning that demystifies the often
opaque nature of complex models like XGBoost.
The implications of this are profound. When our model classifies an instance,
we can now provide a narrative: "This instance was classified as Class A primarily
because Feature X had a high value, which typically correlates with Class A.
However, Feature Y, which usually indicates Class B, had a moderating effect." Such
explanations transform our model from a black box to a transparent advisor, fostering
trust among stakeholders.
37
feature's importance contradicts domain knowledge, it might indicate data leakage or
the need for feature engineering. XAI also aids in fairness and bias detection. If
sensitive attributes like gender or race disproportionately influence predictions, we
can take corrective actions, aligning our model with ethical AI principles.
38
CHAPTER 7
LOCAL DATA ANALYSIS
Kashmir faces a growing silent threat: a surge in cancer cases. Data from
two prominent hospitals, Sher-i-Kashmir Institute of Medical Sciences (SKIMS)
and Shri Maharaja Hari Singh (SMHS) Hospital, paints a concerning picture.
SKIMS alone has documented a staggering 44,112 cancer cases from 2013 to
2023. This immense number, alongside the 6,379 cases reported by SMHS
hospital from 2017 to 2023, underscores the magnitude of the public health crisis
unfolding in the Valley.
39
strategies.
40
A silent and deadly threat is gripping Jammu and Kashmir: a surge in
cancer cases. Data from a leading medical facility, the Department of Radiation
Oncology at SMHS Hospital GMC Srinagar, paints a concerning picture. Since
2017, the department has witnessed a staggering increase in new patient
registrations, culminating in a record-breaking 1,640 cases in 2023. This data
reveals a deeply worrying trend – a steady year-on-year climb from a mere 491
cases in 2017 to surpassing the 1,000 mark in both 2021 and 2022, with a
significant jump in 2023. The high number of daily registrations, averaging
around 5-6 new patients, further emphasizes the urgency of the situation.
In-depth research is critical to identify the root causes behind this alarming rise.
Environmental factors like pollutants, unhealthy lifestyle choices like smoking
and poor diet, limitations in early detection programs, and even genetic
predispositions could all be playing a role.
41
implementing effective prevention and early detection strategies, stakeholders
can work together to combat this silent epidemic. A collaborative effort involving
healthcare professionals, government agencies, community leaders, and research
institutions is crucial to pave the way for a healthier future for the people of
Jammu and Kashmir. Professor Manzoor Ahmad's statement highlights the
urgency further, emphasizing the record number of new cases. The fight against
cancer requires immediate and collective action to ensure better care for countless
patients and a healthier future for all residents of Jammu and Kashmir.
42
Prof Manzoor Ahmad, Head of the Department of Radiation Oncology at
GMC Srinagar said there is a sharp increase in new cancer cases in J&K, with the
department registering a record number of 1640 cases in 2023, out of which 911
were males and 729 were females.
Lung Cancer in Males: This is the most concerning finding. Lung cancer has a
strong link to tobacco use, and its prevalence suggests a significant issue with
tobacco products in the society. Prof. Ahmad emphasizes the need for extensive
public awareness campaigns to educate people about the dangers of tobacco use
and encourage them to quit.
43
help reduce the risk of breast cancer in women.
44
CHAPTER 8
FUTURE SCOPE
7.1 Data Expansion
Current breast cancer detection and risk assessment models are valuable tools,
but they can potentially become even more accurate and effective by incorporating
additional data sources. This is like casting a wider net – the more information we
have, the better we can understand an individual's risk and ultimately improve their
chances of early detection and successful treatment.
Unveiling the Lifestyle Connection: Many of our daily habits and choices can
influence our health, and breast cancer is no exception. By incorporating data on
lifestyle factors like diet, physical activity levels, smoking history, and even exposure
to environmental toxins, the model can paint a more complete picture of an
individual's risk profile. Imagine the model as a detective – the more clues it has (like
dietary habits and exercise routines), the better it can solve the mystery of a person's
susceptibility to breast cancer.
Diet: What we eat can play a role in our overall health, and research suggests that
certain dietary patterns might be linked to breast cancer risk. By including information
about a person's diet, the model can potentially identify individuals who might benefit
from dietary adjustments to lower their risk.
Physical Activity: Regular exercise has numerous health benefits, and studies have
shown a link between physical activity and a reduced risk of breast cancer. The model
can factor in a person's activity level to provide a more personalized risk assessment.
Smoking: The dangers of smoking are well-documented, and it's a significant risk
factor for several cancers, including breast cancer. Including smoking history in the
data set allows the model to account for this crucial risk factor.
45
With this additional lifestyle data, the model can become more sophisticated in its risk
assessments, potentially leading to earlier detection and better preventive measures
for individuals at higher risk.
Imagine the model being able to read an individual's genetic code, looking for specific
red flags that might indicate a higher risk. This allows for a more personalized
approach to breast cancer prevention and early detection.
While current methods like SHAP and LIME offer a window into how the AI
model makes decisions about breast cancer detection, there's room for further
exploration. By delving into more advanced Explainable Artificial Intelligence (XAI)
techniques, we can unlock a deeper understanding of the model's reasoning, fostering
greater trust and collaboration between healthcare professionals, patients, and the AI
system itself. Here's how these advanced techniques can illuminate the "black box" of
AI:
46
predictions. For example, the model could explain: "If a patient with a high-risk
prediction exercised regularly and maintained a healthy weight, how might their risk
change?" .These explanations are powerful because they point towards potential
interventions or preventive measures. By understanding how adjustments to lifestyle
factors or other variables might influence the prediction, doctors can make more
informed decisions about a patient's care plan.
Speaking the Doctor's Language: Currently, the model might explain its reasoning
based on individual data points like tumor size or cell measurements. While valuable,
this can be technical jargon for some healthcare professionals. Concept-based
explanations bridge this gap. They translate the model's decision-making process into
human-understandable concepts. Instead of focusing on raw numbers, the explanation
might say something like: "The model predicts a high risk due to the presence of
aggressive tumor characteristics." This shift towards concepts like "tumor
aggressiveness" or "cellular abnormalities" allows doctors to grasp the model's
reasoning more intuitively. This can facilitate better communication and collaboration
between healthcare professionals and the AI system, ultimately leading to better
patient care.
47
informed discussions about their health and treatment options. Ultimately, these
advancements in XAI hold the potential to improve decision-making, patient
outcomes, and communication in the fight against breast cancer.
48
CHAPTER 9
CONCLUSION
Breast cancer looms large as a global threat to women's health. The World
Health Organization paints a concerning picture, highlighting the immense challenge
with staggering statistics. Early detection is crucial, significantly improving patient
prognosis and survival rates. Traditional screening methods like mammograms, while
demonstrably redu cing mortality, have limitations. False positives, leading to
unnecessary biopsies and psychological distress, are a significant drawback.
49
AI algorithms can analyze vast amounts of healthcare data sets encompassing
images, x-rays, and clinical information. This multifaceted analysis allows them to
identify subtle patterns indicative of early-stage breast cancer that might be missed by
traditional methods. Imagine an AI algorithm trained on millions of mammogram
images, x-rays, and patient data. This AI can then analyze a new patient's data,
searching for even the most minute anomalies that could be indicative of cancer. By
identifying these subtle patterns, AI has the potential to detect breast cancer at its
earliest stages, significantly improving patient outcomes.
A critical strength of this research lies in its meticulous selection of the most
effective ML algorithm. Not all algorithms are created equal, and high accuracy is
paramount in cancer screening. This report emphasizes the importance of selecting an
algorithm that surpasses even the most stringent reliability standards. This meticulous
approach fosters trust in the technology and paves the way for its widespread adoption
in clinical settings.
50
potentially avoiding unnecessary biopsies while ensuring early detection of true
positives.
51
REFERENCES
[1] Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A.,
& Zalaudek, I. (2018). Man against machine: diagnostic performance of a deep
learning convolutional neural network for dermoscopic melanoma recognition
in comparison to 58 dermatologists. Annals of Oncology, 29(8), 1836-1842
[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer.
[4] Nelson, H. D., Tyne, K., Naik, A., Bougatsos, C., Chan, B. K., & Humphrey,
L. (2009). Screening for breast cancer: an update for the US Preventive
Services Task Force. Annals of Internal Medicine, 151(10), 727-737
[5] Duffy, S. W., Yen, A. M. F., Chen, T. H. H., Chen, S. L. S., Chiu, S. Y. H.,
Fan, J. J. Y., & Tabar, L. (2012). Long-term benefits of breast screening. Breast
Cancer Management, 1(1), 31-38.
[6] Colditz, G. A., Rosner, B. A., Chen, W. Y., Holmes, M. D., & Hankinson, S.
E. (2004). Risk factors for breast cancer according to estrogen and progesterone
receptor status. Journal of the National Cancer Institute, 96(3), 218-228.
[7] Perou, C. M., Sørlie, T., Eisen, M. B., Van De Rijn, M., Jeffrey, S. S., Rees, C.
A.,& Botstein, D. (2000). Molecular portraits of human breast
tumours. Nature, 406(6797), 747-752.
[9] Easton, D. F., Pharoah, P. D., Antoniou, A. C., Tischkowitz, M., Tavtigian, S.
V., Nathanson, K. L., & Foulkes, W. D. (2015). Gene-panel sequencing and the
prediction of breast-cancer risk. New England Journal of Medicine, 372(23),
2243-2257.
52
[10] Papadimitriou, N., Dimou, N., Tsilidis, K. K., Banbury, B., Martin, R. M.,
Lewis, S. J., ... & Murphy, N. (2020). Physical activity and risks of breast and
colorectal cancer: a Mendelian randomisation analysis. Nature
Communications, 11(1), 597.
[11] Monticciolo, D. L., Newell, M. S., Moy, L., Niell, B., Monsees, B., & Sickles,
E. A. (2018). Breast cancer screening in women at higher-than-average risk:
recommendations from the ACR. Journal of the American College of
Radiology, 15(3), 408-414.
[12] Love, P. E., Fang, W., Matthews, J., Porter, S., Luo, H., & Ding, L. (2023).
Explainable artificial intelligence (XAI): Precepts, models, and opportunities
for research in construction. Advanced Engineering Informatics, 57, 102024.
[13] Abbas, A. (2021). Reviewing the explainable artificial intelligence (XAI) and
its importance in tax administration. Center for Inter-American Tax Studies
(CIAT)
[14] Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi,
D. (2018). A survey of methods for explaining black box models. ACM
computing Surveys (CSUR), 51(5), 1-42.
53