0% found this document useful (0 votes)
19 views

EPD An Integrated Modeling Technique To Classify BC

Uploaded by

SashikantaPrusty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

EPD An Integrated Modeling Technique To Classify BC

Uploaded by

SashikantaPrusty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2023 International Conference in Advances in Power, Signal, and Information Technology

EPD: an integrated modeling technique to classify BC


Sashikanta Prusty1* Sujit Kumar Dash2 Srikanta Patnaik2
Department of Computer Science & Engineering Department of Electrical & Electronics Engineering Director of Interscience Institute of Management
Siksha ‘O’ Anusandhan (Deemed to be University) Siksha ‘O’ Anusandhan (Deemed to be University) and Technology
Bhubaneswar, India-751030 Bhubaneswar, India-751030 Bhubaneswar, India- 751030
[email protected] [email protected] [email protected]
2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT) | 979-8-3503-3936-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/APSIT58554.2023.10201778

Sushree Gayatri Priyadarsini Prusty3 Nrusingha Tripathy3


Department of Computer Science & Engineering Department of Computer Science & Engineering
Siksha ‘O’ Anusandhan (Deemed to be University) Siksha ‘O’ Anusandhan (Deemed to be University)
Bhubaneswar, India- 751030 Bhubaneswar, India- 751030
[email protected] [email protected]

Abstract- In the past two decades, Breast Cancer (BC) had found MRI or ultrasound to predict response to neoadjuvant
as second most common death and continues to be prone in low- chemotherapy in breast cancer patients [6- 12].
middle income countries. However, in those days there have been
a lot of technologies developed and implemented in the medical Thus, in this research, we aimed at developing a novel ML
field so far but still unable to cure this disease completely. Thus, model to classify the abnormal cells present in the human breast,
need to be more conscious and design novel techniques that would so that patients will be cured at the initial stage. However, there
be able to avoid unnecessary deaths at the early stages. In this have been a lot of models have been developed, and our
study, we have taken key studies of related cells, and risk factors proposed EPD model will be found as the best alternative for
and design a novel EPD (EDA, PCA, and DT) model to classify doctors. Before doing that there is a need to properly visualize
abnormal cells into either benign (B) or malignant (M). the raw images and select the most appropriate features to get
Furthermore, EPD has been designed by combining three major the best results.
techniques as Exploratory Data Analysis (EDA) to visualize the
raw data, principal component analysis (PCA) to select the most
promising features, and Decision Tree to predict the disease with
these features. These findings show the best novel approach
against BC for doctors as well as healthcare organizations as
compared to individual techniques.

Keywords: BC, AI, ML, EPD, PCA, EDA, DT

I. INTRODUCTION
BC is the second most common death worldwide and
continues to be prone without any hesitation. Many developed
nations have initiated BC screening programs for their quality
improvement in medical fields. Nevertheless, BC continues to
be the top or second leading cause of cancer death in women in
those nations—including those taking part in screening [1, 2]. Fig. 1. Representation of a human breast mammogram on both left and right-
This demonstrates that too many women are not receiving hand sides
enough mammographic screening. To lower breast cancer
mortality, early detection must be significantly improved. II. MATERIAL & METHOD
Underdiagnoses, or failing to detect disease at an early enough
stage to prevent morbidity and mortality from breast cancer, is A. Material
the main issue with current breast screening programs. A cancer In this research, we have taken Wisconsin Breast Cancer
diagnosis at a metastatic stage can be avoided with early Database (WBCD) from a freely accessible “UCI machine
detection through routine screening. To say that there is a need learning repository” [13]. The dataset provides details on tumour
for technologies that will reduce avoidable deaths now. Figure traits that were calculated from a digitized image of a breast mass
1, shows the breast mammograms on both sides containing obtained by fine-needle aspiration (FNA). Ten features, one for
normal and abnormal cells. each observation, are used to define the tumour's size, density,
However, previously there have been made huge texture, symmetry, and other aspects of the cell nuclei visible in
advancements in healthcare fields through the use of developed the image. For each image, the average, standard deviation, and
machines, applications, and other technologies. In this regard, "worst" mean of these features were calculated, yielding 30
the application of ML and DL technologies also has made a features. The category target feature provides information about
greater impact as well as providing a better diagnosis of the the tumour's nature i.e. benign or malignant.
disease at the early stages. Deep learning (DL) has recently been B. Method
used more and more to diagnose breast cancer and predict
It is essential to make good decisions and support the
treatment outcomes, and the results are optimistic [3–5]. In
particular, several studies have been carried out to use DL on reporting of outcomes given the extent and amount of data
collected in healthcare-related fields. The effective use of data
visualization can affect and facilitate decision-making.
979-8-3503-3936-9/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: Siksha O Anusandhan University. Downloaded on August 11,2023 at 03:50:53 UTC from IEEE Xplore. Restrictions apply.
651
Additionally, the feature selection technique has become
more advantageous to predict the disease using novel ML
models. A Decision Tree (DT), a tree-like model makes
EPD decisions based on resources from previous nodes at each level
WBCD and provides respective outcomes based on that decisions. This
Database
EDA PCA DT algorithm follows conditional statements to perform this
operation. Combining all these techniques at once as shown in
figure 2, our proposed EPD model enhances the capability to
make a better decision at classifying the disease than individual
Calculate operation.
accuracy_score, 1) EDA: The strength of data visualization lies in its
recall_score and capacity to highlight patterns that might otherwise go unnoticed.
The method of employing visual techniques to study data is
called exploratory data analysis (EDA) [14]. Thus, to perform
Design
statistical analysis and also to find trends and patterns in the raw
Confusion_matrix
images, we have taken the EDA technique as the first step of our
proposed approach. The number of concave points, perimeter,
Plot and area, as well as the nuclear radius and malignancy, all
Decision Tree demonstrate strong positive linear connections, as shown in
figure 3.
Fig. 2. Workflow of the EPD model

Fig. 3. Data visualization for 30 features using the EDA technique


uncorrelated variables referred to as principal components. By
2) PCA: A concern with having quite so many variables consolidating more variables into fewer components, PCA
is that they can make visualizations excessively complicated, reduces the attribute space. Thus, we have used the PCA
reduce efficiency by including variables that have no influence, technique in this article to find the best promising features for
or make it difficult to analyze the data. To improve the our model prediction. The WBCD consists of 30 dimensions but
performance of machine learning models, meaningful features we reduced it into 2 major components to determine whether
must be extracted from the raw data [15]. A technique called variables can be classified. Thus, the two target classes as (i)
principal component analysis (PCA) needs to be able to split dark for benign, and (ii) light for malignant, which can be
down a large set of correlated variables into a smaller set of separated linearly as shown in figure 4.

Authorized licensed use limited to: Siksha O Anusandhan University. Downloaded on August 11,2023 at 03:50:53 UTC from IEEE Xplore. Restrictions apply.
652
Fig. 4. Features selection using PCA projection

We would select the number of major components so that we


can fully account for 90% of the initial data distributions. However, figure 5 implies keeping 6 main components, so
we decrease the dimensionality from 30 features to 6
fundamental components. After decomposition from 30 features
to 6 features. To prevent overfitting and improve prediction
accuracy, we first apply normal distribution to all numerical
features before scaling them. A common technique for
transforming skewed data to normal is log transformation, which
makes the features to be distributed as normal. Despite concave
points features, which could be impacted by malignant cells and
where the frequency of contour concavities grows considerably,
practically all features in figure 6, have bell-shaped distributions.

Fig. 5. Representation of intersection point between features when x=6

Fig. 6. Log Transformation table to represent distribution as “normal” or “skewed”

Authorized licensed use limited to: Siksha O Anusandhan University. Downloaded on August 11,2023 at 03:50:53 UTC from IEEE Xplore. Restrictions apply.
653
grid. From figure 6, we have seen concave points (CP) are
3) Apply Decision Tree (DT): One of the simplest
skewed, whereas all others are bell-shaped. So, we performed
algorithms is the DT, in which there is a nonlinear relationship
DT based on CP <= 0.049, as shown in figure 7. The left-hand
between the features and the outcome. Scaling is typically not
side shows the true values as “benign” whereas the right one
needed for decision trees. Best parameters, including the depth
shows the false values as “malignant” according to the given
of the tree, split criteria, and the minimum number of samples
condition at each node. That means, if CP <= 0.049, R_mean
for a leaf node, can be found with the use of the GridSearchCV
<= 14.975, and Entropy = 0.154, then only the image will be
function in Python, which thoroughly finds model optimal
classified as “benign” or otherwise classified as “malignant”.
parameters by cross-validated grid-search over a parameter

CP_mean <= 0.049


E = 0.958
S = 398
V = [247, 151]
C=B

T F
R_mean <= 14.975 R_mean <= 16.205
E = 0.322 E = 0.58
S = 239 S = 159
V = [225, 14] V = [22, 137]
C=B C=M CP = Concave
Points
E = Entropy
T F T F S = Samples
V = Value
E = 0.154 E = 0.971 E = 0.858 E = 0.0 C = Class
S = 224 S = 15 S = 78 S = 81 B = Benign
V = [219, 5] V = [6, 9] V = [22, 56] V = [0, 81] M = Malignant
C=B C=M C=M C=M T = True
F = False

Fig. 7. Tree representation for specifying class level as either “benign” or “malignant”

Entropy is the optimum split criterion in this case because all


instances are completely exhausted after two splits, giving the Table 1, contains a testing dataset of 110 malignant data, 98
tree a depth of 2. According to Figure 7, the probability that a of which have been accurately predicted, and 12 of which have
cell is malignant increases as the number of concave spots been incorrectly forecasted. There are 61 benign data in total, of
increases and also the radius increases. which 2 are correctly predicted and 59 are successfully predicted
as shown in figure 8.
4) Performance Measures: As we know, the performance
of every ML model can be measured only by using III. RESULT & DISCUSSION
confusion_matrix which provides the prediction result after In this study, we have implemented the EPD model on the
classifying any disease. The number of wrong and right WBCD dataset at Jupytor Notebook 6.4.3 web-based application
classifications in each potential value of the variable being using Python programming language in Windows 10 platform.
classed in the confusion matrix is used to calculate the efficacy And we found, this model works well and produces an accuracy
of the classification model. score of 91.81 %, as shown in figure 9. The performance of our
EPD model is based on three basic techniques as (i) data
Table 1: Confusion_matrix visualization using EDA, (ii) feature selection using PCA, and
(iii) model classification using DT. Besides that, Recall and
Predicted Class Precision scores have been calculated which specify whether our
M (1) B (0) model correctly predicted the malignant tumours or not. And, it
M (1) 98 12 has been found that 96.72% is recall and 83.09% is precision.
Class
True

B (0) 2 59 Fig. 8. Representation


of confusion_matrix

Fig. 9. Performance result of the EPD model

Authorized licensed use limited to: Siksha O Anusandhan University. Downloaded on August 11,2023 at 03:50:53 UTC from IEEE Xplore. Restrictions apply.
654
IV. CONCLUSION [14] S. Prusty, S. Patnaik, and S. K. Dash, “Exploratory Data Analysis on
SARS-CoV-2 Variants in India: especially Omicron (B. 1.1. 529) as of
As discussed, BC is the most common disease and affects 6th December 2021,” In 2022 International Conference on Decision Aid
Sciences and Applications (DASA), 2022, (pp. 94-99). IEEE.
every single woman out of ten worldwide. Thus, early diagnosis
[15] E. Zdravevski, B. Risteska Stojkoska, M. Standl, and H. Schulz,
is much necessary to overcome this mortality these days. In this “Automatic machine-learning based identification of jogging periods
article, we have proposed a model, namely EPD that combining from accelerometer measurements of adolescents under field conditions,”
takes three operations to perform the BC classification. We now PLoS ONE 2017, 12, e0184216.
know from the EDA that area, perimeter, and radius are closely
connected. Because of this, it would be preferable to eliminate
all features from the "worst" samples, including perimeter, area,
and features. That’s why the PCA technique has come into place
to limit unnecessary features. Besides that, DT decides each
level based on previous nodes and for that it uses conditional
statements. Therefore, EPD performs well then every individual
operation. Apart from that from figure 7, we have seen that the
complete model could change if the training set is slightly altered
because the trees are also extremely sensitive to input data noise.
The model's capacity to be understood is hampered by this. To
greatly reduce overfitting in the future, we have worked to
develop techniques like pruning, specifying a minimum number
of samples per leaf, and defining a maximum depth for the tree.
REFERENCE
[1] American Cancer Society. Cancer Facts and Figures 2022. Atlanta, Ga:
American Cancer Society; 2022. https://www.cancer.org/cancer/breast-
cancer/about/how-common-is-breast-cancer.html. Accessed November
30, 2022.
[2] Breast cancer burden in EU-27. ECIS – European Cancer Information
System. https://ecis.jrc.ec.europa.eu/pdf/Breast_cancer_factsheet-
Oct_2020.pdf. Accessed November 30, 2022.
[3] Q. Hu, H. M. Whitney, and M. L. A. Giger, “deep learning methodology
for improved breast cancer diagnosis using multiparametric MRI,” Sci.
Rep. 10, 10536, 2020. https://doi.org/10.1038/s41598-020-67441-4.
[4] S. Prusty, S. K. Dash, and S. Patnaik, “A novel transfer learning technique
for detecting breast cancer mammograms using VGG16 bottleneck
feature,” ECS Transactions, 107(1), 733, 2022.
[5] W. C. Ou, D. Polat, and B. E. Dogan, “Deep learning in breast radiology:
current progress and future directions,” Eur. Radiol. 31, 4872–4885, 2021.
https://doi.org/10.1007/s00330-020-07640-9.
[6] M. El Adoui, S. Drisis, and M. Benjelloun, “Multi-input deep learning
architecture for predicting breast tumor response to chemotherapy using
quantitative MR images,” Int. J. Comput. Assist. Radiol. Surg. 15, 1491–
1500, 2020. https://doi.org/10.1007/s11548-020-02209-9.
[7] S. Prusty, S. Patnaik, and S. K. Dash, “SKCV: Stratified K-fold cross-
validation on ML classifiers for predicting cervical cancer,” Frontiers in
Nanotechnology, 4, 972421, 2022.
[8] S. Joo, et al., “Multimodal deep learning models for the prediction of
pathologic response to neoadjuvant chemotherapy in breast cancer,” Sci.
Rep. 11, 18800, 2021. https://doi.org/10.1038/s41598-021-98408-8.
[9] Y. H. Qu, et al., “Prediction of pathological complete response to
neoadjuvant chemotherapy in breast cancer using a deep learning (DL)
method,” Thorac. Cancer 11, 651–658, 2020.
https://doi.org/10.1111/1759-7714.13309.
[10] S. G. P. Prusty, and S. Prusty, “Time Series Analysis of SAR-Cov-2 Virus
in India Using Facebook’s Prophet,” In Meta Heuristic Techniques in
Software Engineering and Its Applications: METASOFT 2022 (pp. 72-
81). Cham: Springer International Publishing.
[11] J. Gu, et al., “Deep learning radiomics of ultrasonography can predict
response to neoadjuvant chemotherapy in breast cancer at an early stage
of treatment: A prospective study,” Eur. Radiol. 32, 2099–2109, 2022.
https://doi.org/10.1007/s00330-021-08293-y.
[12] M. Jiang, et al., “Ultrasound-based deep learning radiomics in the
assessment of pathological complete response to neoadjuvant
chemotherapy in locally advanced breast cancer,” Eur. J. Cancer 147, 95–
105, 2021. https://doi.org/10.1016/j.ejca.2021.01.028.
[13]
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Di
agnostic%29

Authorized licensed use limited to: Siksha O Anusandhan University. Downloaded on August 11,2023 at 03:50:53 UTC from IEEE Xplore. Restrictions apply.
655

You might also like