0% found this document useful (0 votes)
527 views491 pages

THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)

Uploaded by

16061977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
527 views491 pages

THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)

Uploaded by

16061977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 491

THE APPLIED DATA SCIENCE WORKSHOP:

URINARY BIOMARKERS BASED PANCREATIC CANCER


CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING
WITH PYTHON GUI
THE APPLIED DATA SCIENCE WORKSHOP:
URINARY BIOMARKERS BASED PANCREATIC CANCER
CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING
WITH PYTHON GUI
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Second Edition
 
 
 
VIVIAN SIAHAAN
RISMON HASIHOLAN SIANIPAR
 
 
Copyright © 2023 BALIGE Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews. Every effort has been made in the preparation of this book
to ensure the accuracy of the information presented. However, the information
contained in this book is sold without warranty, either express or implied. Neither the
authors, nor BALIGE Publishing or its dealers and distributors, will be held liable for
any damages caused or alleged to have been caused directly or indirectly by this
book. BALIGE Publishing has endeavored to provide trademark information about
all of the companies and products mentioned in this book by the appropriate use of
capitals. However, BALIGE Publishing cannot guarantee the accuracy of this
information.
Copyright © 2023 BALIGE Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews. Every effort has been made in the preparation of this book
to ensure the accuracy of the information presented. However, the information
contained in this book is sold without warranty, either express or implied. Neither the
authors, nor BALIGE Publishing or its dealers and distributors, will be held liable for
any damages caused or alleged to have been caused directly or indirectly by this
book. BALIGE Publishing has endeavored to provide trademark information about
all of the companies and products mentioned in this book by the appropriate use of
capitals. However, BALIGE Publishing cannot guarantee the accuracy of this
information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Published: JULY 2023
Production reference: 21070123
Published by BALIGE Publishing Ltd.
BALIGE, North Sumatera
ABOUT THE AUTHOR
ABOUT THE AUTHOR
 
 
 
 
 
 
Vivian Siahaan is a highly motivated individual with a
passion for continuous learning and exploring new areas. Born
and raised in Hinalang Bagasan, Balige, situated on the
picturesque banks of Lake Toba, she completed her high
school education at SMAN 1 Balige. Vivian's journey into the world of
programming began with a deep dive into various languages such as
Java, Android, JavaScript, CSS, C++, Python, R, Visual Basic, Visual
C#, MATLAB, Mathematica, PHP, JSP, MySQL, SQL Server, Oracle,
Access, and more. Starting from scratch, Vivian diligently studied
programming, focusing on mastering the fundamental syntax and
logic. She honed her skills by creating practical GUI applications,
gradually building her expertise. One particular area of interest for
Vivian is animation and game development, where she aspires to make
significant contributions. Alongside her programming and
mathematical pursuits, she also finds joy in indulging in novels,
nurturing her love for literature. Vivian Siahaan's passion for
programming and her extensive knowledge are reflected in the
numerous ebooks she has authored. Her works, published by Sparta
Publisher, cover a wide range of topics, including "Data Structure with
Java," "Java Programming: Cookbook," "C++ Programming:
Cookbook," "C Programming For High Schools/Vocational Schools
and Students," "Java Programming for SMA/SMK," "Java Tutorial:
GUI, Graphics and Animation," "Visual Basic Programming: From A
to Z," "Java Programming for Animation and Games," "C#
Programming for SMA/SMK and Students," "MATLAB For Students
and Researchers," "Graphics in JavaScript: Quick Learning Series,"
"JavaScript Image Processing Methods: From A to Z," "Java GUI
Case Study: AWT & Swing," "Basic CSS and JavaScript,"
"PHP/MySQL Programming: Cookbook," "Visual Basic: Cookbook,"
"C++ Programming for High Schools/Vocational Schools and
Students," "Concepts and Practices of C++," "PHP/MySQL For
Students," "C# Programming: From A to Z," "Visual Basic for
SMA/SMK and Students," and "C# .NET and SQL Server for High
School/Vocational School and Students." Furthermore, at the ANDI
Yogyakarta publisher, Vivian Siahaan has contributed to several
notable books, including "Python Programming Theory and Practice,"
"Python GUI Programming," "Python GUI and Database," "Build
From Zero School Database Management System In Python/MySQL,"
"Database Management System in Python/MySQL," "Python/MySQL
For Management Systems of Criminal Track Record Database,"
"Java/MySQL For Management Systems of Criminal Track Records
Database," "Database and Cryptography Using Java/MySQL," and
"Build From Zero School Database Management System With
Java/MySQL." Vivian's diverse range of expertise in programming
languages, combined with her passion for exploring new horizons,
makes her a dynamic and versatile individual in the field of
technology. Her dedication to learning, coupled with her strong
analytical and problem-solving skills, positions her as a valuable asset
in any programming endeavor. Vivian Siahaan's contributions to the
world of programming and literature continue to inspire and empower
aspiring programmers and readers alike.
 
 
 
Rismon Hasiholan Sianipar, born in Pematang Siantar in 1994, is a distinguished
researcher and expert in the field of electrical engineering. After completing his
education at SMAN 3 Pematang Siantar, Rismon ventured to the city of Jogjakarta
to pursue his academic journey. He obtained his Bachelor of Engineering (S.T) and
Master of Engineering (M.T) degrees in Electrical Engineering from Gadjah Mada University
in 1998 and 2001, respectively, under the guidance of esteemed professors, Dr. Adhi Soesanto
and Dr. Thomas Sri Widodo. During his studies, Rismon focused on researching non-
stationary signals and their energy analysis using time-frequency maps. He explored the
dynamic nature of signal energy distribution on time-frequency maps and developed
innovative techniques using discrete wavelet transformations to design non-linear filters for
data pattern analysis. His research showcased the application of these techniques in various
fields. In recognition of his academic prowess, Rismon was awarded the prestigious
Monbukagakusho scholarship by the Japanese Government in 2003. He went on to pursue his
Master of Engineering (M.Eng) and Doctor of Engineering (Dr.Eng) degrees at Yamaguchi
University, supervised by Prof. Dr. Hidetoshi Miike. Rismon's master's and doctoral theses
revolved around combining the SR-FHN (Stochastic Resonance Fitzhugh-Nagumo) filter
strength with the cryptosystem ECC (elliptic curve cryptography) 4096-bit. This innovative
approach effectively suppressed noise in digital images and videos while ensuring their
authenticity. Rismon's research findings have been published in renowned international
scientific journals, and his patents have been officially registered in Japan. Notably, one of his
patents, with registration number 2008-009549, gained recognition. He actively collaborates
with several universities and research institutions in Japan, specializing in cryptography,
cryptanalysis, and digital forensics, particularly in the areas of audio, image, and video
analysis. With a passion for knowledge sharing, Rismon has authored numerous national and
international scientific articles and authored several national books. He has also actively
participated in workshops related to cryptography, cryptanalysis, digital watermarking, and
digital forensics. During these workshops, Rismon has assisted Prof. Hidetoshi Miike in
developing applications related to digital image and video processing, steganography,
cryptography, watermarking, and more, which serve as valuable training materials. Rismon's
field of interest encompasses multimedia security, signal processing, digital image and video
analysis, cryptography, digital communication, digital forensics, and data compression. He
continues to advance his research by developing applications using programming languages
such as Python, MATLAB, C++, C, VB.NET, C#.NET, R, and Java. These applications serve
both research and commercial purposes, further contributing to the advancement of signal and
image analysis. Rismon Hasiholan Sianipar is a dedicated researcher and expert in the field of
electrical engineering, particularly in the areas of signal processing, cryptography, and digital
forensics. His academic achievements, patented inventions, and extensive publications
demonstrate his commitment to advancing knowledge in these fields. Rismon's contributions
to academia and his collaborations with prestigious institutions in Japan have solidified his
position as a respected figure in the scientific community. Through his ongoing research and
development of innovative applications, Rismon continues to make significant contributions to
the field of electrical engineering.
 
 
 
 
 
 
 
 
 
 
 
 
 
ABOUT THE BOOK
ABOUT THE BOOK
 
 
 
 
 
 
 
 
 
The Applied Data Science Workshop on "Urinary Biomarkers-Based
Pancreatic Cancer Classification and Prediction Using Machine
Learning with Python GUI" embarks on a comprehensive journey,
commencing with an in-depth exploration of the dataset. During this
initial phase, the structure and size of the dataset are thoroughly
examined, and the various features it contains are meticulously
studied. The principal objective is to understand the relationship
between these features and the target variable, which, in this case, is
the diagnosis of pancreatic cancer. The distribution of each feature is
analyzed, and potential patterns, trends, or outliers that could
significantly impact the model's performance are identified.
 
To ensure the data is in optimal condition for model training,
preprocessing steps are undertaken. This involves handling missing
values through imputation techniques, such as mean, median, or
interpolation, depending on the nature of the data. Additionally, feature
engineering is performed to derive new features or transform existing
ones, with the aim of enhancing the model's predictive power. In
preparation for model building, the dataset is split into training and
testing sets. This division is crucial to assess the models' generalization
performance on unseen data accurately. To maintain a balanced
representation of classes in both sets, stratified sampling is employed,
mitigating potential biases in the model evaluation process.
 
The workshop explores an array of machine learning classifiers
suitable for pancreatic cancer classification, such as Logistic
Regression, K-Nearest Neighbors, Decision Trees, Random Forests,
Gradient Boosting, Naive Bayes, Adaboost, Extreme Gradient
Boosting, Light Gradient Boosting, Naïve Bayes, and Multi-Layer
Perceptron (MLP). For each classifier, three different preprocessing
techniques are applied to investigate their impact on model
performance: raw (unprocessed data), normalization (scaling data to a
similar range), and standardization (scaling data to have zero mean and
unit variance).
 
To optimize the classifiers' hyperparameters and boost their predictive
capabilities, GridSearchCV, a technique for hyperparameter tuning, is
employed. GridSearchCV conducts an exhaustive search over a
specified hyperparameter grid, evaluating different combinations to
identify the optimal settings for each model and preprocessing
technique.
 
During the model evaluation phase, multiple performance metrics are
utilized to gauge the efficacy of the classifiers. Commonly used
metrics include accuracy, recall, precision, and F1-score. By
comprehensively assessing these metrics, the strengths and
weaknesses of each model are revealed, enabling a deeper
understanding of their performance across different classes of
pancreatic cancer. Classification reports are generated to present a
detailed breakdown of the models' performance, including precision,
recall, F1-score, and support for each class. These reports serve as
valuable tools for interpreting model outputs and identifying areas for
potential improvement.
 
The workshop highlights the significance of graphical user interfaces
(GUIs) in facilitating user interactions with machine learning models.
By integrating PyQt, a powerful GUI development library for Python,
participants create a user-friendly interface that enables users to
interact with the models effortlessly. The GUI provides options to
select different preprocessing techniques, visualize model outputs such
as confusion matrices and decision boundaries, and gain insights into
the models' classification capabilities. One of the primary advantages
of the graphical user interface is its ability to offer users a seamless
and intuitive experience in predicting and classifying pancreatic cancer
based on urinary biomarkers. The GUI empowers users to make
informed decisions by allowing them to compare the performance of
different classifiers under various preprocessing techniques.
 
Throughout the workshop, a strong emphasis is placed on the
significance of proper data preprocessing, hyperparameter tuning, and
robust model evaluation. These crucial steps contribute to building
accurate and reliable machine learning models for pancreatic cancer
prediction. By the culmination of the workshop, participants have
gained valuable hands-on experience in data exploration, machine
learning model building, hyperparameter tuning, and GUI
development, all geared towards addressing the specific challenge of
pancreatic cancer classification and prediction.
 
In conclusion, the Applied Data Science Workshop on "Urinary
Biomarkers-Based Pancreatic Cancer Classification and Prediction
Using Machine Learning with Python GUI" embarks on a
comprehensive and transformative journey, bringing together data
exploration, preprocessing, machine learning model selection,
hyperparameter tuning, model evaluation, and GUI development. The
project's focus on pancreatic cancer prediction using urinary
biomarkers aligns with the pressing need for early detection and
treatment of this deadly disease. As participants delve into the
intricacies of machine learning and medical research, they contribute
to the broader scientific community's ongoing efforts to combat cancer
and improve patient outcomes. Through the integration of data science
methodologies and powerful visualization tools, the workshop
exemplifies the potential of machine learning in revolutionizing
medical diagnostics and healthcare practices.
 
 
 
 
 
 
 
 
 
 
 
 
CONTENT
CONTENT
 
 
 
 
 
 
 
 
 
 
 
EXPLORING DATASET AND FEATURES DISTRIBUTION 1
Description 1
Exploring Dataset 2
Information of Dataset 6
Dropping Irrelevant Columns 8
Imputing Missing Values 9
Statistical Description 10
Distribution of Diagnosis Variable 12
Distribution of All Features 16
Distribution of Plasma CA19-9 versus Diagnosis 18
Distribution of Creatinine versus Diagnosis 23
Distribution of LYVE1 versus Diagnosis 26
Distribution of REG1B versus Diagnosis 30
Distribution of TFF1 versus Diagnosis 33
Distribution of REG1A versus Diagnosis 37
   
   
VISUALIZING CATEGORIZED FEATURES 40
DISTRIBUTION 40
Distribution of Categorized Age versus Diagnosis 46
Distribution of Sex versus Diagnosis 49
Distribution of Categorized CA19-9 versus Diagnosis 52
Distribution of Categorized Creatinine versus Diagnosis 55
Extracting Categorical and Numerical Features 58
Density Distribution of Numerical Features 60
Density Distribution of Numerical Features 62
Case Distribution of Four Categorical Features versus 64
Diagnosis 65
Case Distribution of Four Categorical Features versus 66
Categorized Creatinine
67
Case Distribution of Four Categorical Features versus
68
Categorized Age
73
Case Distribution of Four Categorical Features versus Sex
76
Case Distribution of Four Categorical Features versus
Categorized Plasma CA19-9 79
Percentage Distribution of Categorized Age and Sex versus 80
Diagnosis 81
Percentage Distribution of Categorized Plasma CA19-9 and Sex 82
versus Diagnosis  
Distribution of Four Categorical Features versus LYV1  
Density Distribution of Four Categorical Features versus LYV1 84
Case and Density Distribution of Four Categorical Features  
versus REG1B
84
Case and Density Distribution of Four Categorical Features
88
versus TFF1
90
Case and Density Distribution of Four Categorical Features
versus REG1A 93
PREDICTING PANCREATIC CANCER USING MACHINE 94
LEARNING 98
Features Importance Using Random Forest Classifier 100
Features Importance Using Extra Trees Classifier 103
106
Features Importance Using Recursive Feature Elimination 116
(RFE) 128
Resampling and Splitting Data 138
Learning Curve 149
Real Values versus Predicted Values 162
Decision Boundaries and ROC 167
Training Model and Predicting Pancreatic Cancer 171
Support Vector Classifier and Grid Search 176
Logistic Regression Classifier and Grid Search 181
K-Nearest Neighbors Classifier and Grid Search  
Decision Tree Classifier and Grid Search  
Random Forest Classifier and Grid Search 203
Gradient Boosting Classifier and Grid Search  
Extreme Gradient Boosting Classifier and Grid Search 203
Multi-Layer Perceptron Classifier and Grid Search 209
Light Gradient Boosting Classifier and Grid Search 217
Source Code 222
  226
  228
IMPLEMENTING GRAPHICAL USER INTERFACE 230
USING PYQT
233
Designing GUI
235
Preprocessing Data and Populating Comboboxes and Tables
243
Resampling and Splitting Data
247
Distribution of Target Variable
253
Distribution of Age Variable
256
Distribution of Sex Variable
260
Distribution of Plasma CA19-9 Variable
263
Distribution of Creatinine Variable
266
Distribution of Numerical Features
269
Correlation Matrix and Features Importance
272
Helper Functions to Plot Model’s Performance
Training Model and Predicting Pancreatic Cancer 276
Logistic Regression Classifier 279
Support Vector Classifier 282
K-Nearest Neighbors Classifier 286
Decision Tree Classifier 288
Random Forest Classifier 291
Gradient Boosting Classifier  
Naïve Bayes Classifier
Adaboost Classifier
Extreme Gradient Boosting Classifier
Light Gradient Boosting Classifier
Multi-Layer Perceptron Classifier
Source Code
 
 
 
 
 
EXPLORING
DATASET
AND FEATURES DISTRIBUTION
 
EXPLORING
DATASET
AND FEATURES DISTRIBUTION
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Description
Pancreatic cancer is an extremely deadly type of cancer. Once
diagnosed, the five-year survival rate is less than 10%. However, if
pancreatic cancer is caught early, the odds of surviving are much
better. Unfortunately, many cases of pancreatic cancer show no
symptoms until the cancer has spread throughout the body. A
diagnostic test to identify people with pancreatic cancer could be
enormously helpful.
 
In a paper by Silvana Debernardi and colleagues, published this year
in the journal PLOS Medicine, a multi-national team of researchers
sought to develop an accurate diagnostic test for the most common
type of pancreatic cancer, called pancreatic ductal adenocarcinoma
or PDAC. They gathered a series of biomarkers from the urine of
three groups of patients:
 
Healthy controls
Patients with non-cancerous pancreatic
conditions, like chronic pancreatitis
Patients with pancreatic ductal
adenocarcinoma
When possible, these patients were age- and sex-matched. The goal
was to develop an accurate way to identify patients with pancreatic
cancer.
 
The key features are four urinary biomarkers: creatinine, LYVE1,
REG1B, and TFF1.
Creatinine is a protein that is often used
as an indicator of kidney function.
YVLE1 is lymphatic vessel endothelial
hyaluronan receptor 1, a protein that
may play a role in tumor metastasis.
REG1B is a protein that may be
associated with pancreas regeneration.
TFF1 is trefoil factor 1, which may be
related to regeneration and repair of the
urinary tract.
Age and sex, both included in the dataset, may also play a role in
who gets pancreatic cancer. The dataset includes a few other
biomarkers as well, but these were not measured in all patients (they
were collected partly to measure how various blood biomarkers
compared to urine biomarkers).
 
The goal in this dataset is predicting diagnosis, and more
specifically, differentiating between 3 (pancreatic cancer) versus 2
(non-cancerous pancreas condition) and 1 (healthy). The dataset
includes information on stage of pancreatic cancer, and diagnosis for
non-cancerous patients, but remember—these won't be available to a
predictive model. The goal, after all, is to predict the presence of
disease before it's diagnosed, not after.
 
 
 
Exploring Dataset
Step Download dataset from
1 https://viviansiahaan.blogspot.com/2023/07/the-applied-data-
science-workshop_21.html and save it to your working directory.
Unzip the files, Debernardi et al 2020 data.csv and Debernardi
et al 2020 documentation.csv, and put them into working
directory.
 
Step Open a new Python script and save it as pancreatic.py.
2  
Step Import all necessary libraries:
3  
1 # pancreatic.py
2 import numpy as np
3 import pandas as pd
4 import matplotlib
5 import matplotlib.pyplot as plt
6
import seaborn as sns
7
sns.set_style('darkgrid')
8
from sklearn.preprocessing import
9 LabelEncoder
10 import warnings
11 warnings.filterwarnings('ignore')
12 import os
13 import plotly.graph_objs as go
14 import joblib
15 from sklearn.metrics import
16 roc_auc_score,roc_curve
17 from sklearn.model_selection import
18 train_test_split, RandomizedSearchCV,
GridSearchCV,StratifiedKFold
19
from sklearn.preprocessing import
20
StandardScaler, MinMaxScaler
21
from sklearn.linear_model import
22 LogisticRegression
23 from sklearn.naive_bayes import GaussianNB
24 from sklearn.tree import
25 DecisionTreeClassifier
26 from sklearn.svm import SVC
27 from sklearn.ensemble import
28 RandomForestClassifier,
ExtraTreesClassifier
29
from sklearn.neighbors import
30
KNeighborsClassifier
31
from sklearn.ensemble import
32 AdaBoostClassifier,
33 GradientBoostingClassifier
34 from xgboost import XGBClassifier
35 from sklearn.neural_network import
36 MLPClassifier
37 from sklearn.linear_model import
SGDClassifier
38
from sklearn.preprocessing import
39
StandardScaler, LabelEncoder,
40 OneHotEncoder
from sklearn.metrics import
confusion_matrix, accuracy_score,
recall_score, precision_score
from sklearn.metrics import
classification_report, f1_score,
plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import
learning_curve
from mlxtend.plotting import
plot_decision_regions
 
This code sets up the necessary environment for working with
classification tasks on a dataset related to pancreatic disease. It
imports various libraries, machine learning models, and
evaluation metrics that will be used to preprocess the data, build,
and evaluate different classification models. Additionally, it
includes techniques for handling class imbalance in the dataset
and tools for visualizing learning curves and decision regions.
 
 
Step Read dataset:
4  
1 #Reads dataset
2 curr_path = os.getcwd()
3 df = pd.read_csv(curr_path+"/Debernardi et
4 al 2020 data.csv")
5 print(df.iloc[:,0:8].head().to_string())
print(df.iloc[:,8:14].head().to_string())
 
Output:
sample_id patient_cohort sample_origin age sex
diagnosis stage benign_sample_diagnosis
0 S1 Cohort1 BPTB 33 F
1 NaN NaN
1 S10 Cohort1 BPTB 81 F
1 NaN NaN
2 S100 Cohort2 BPTB 51 M
1 NaN NaN
3 S101 Cohort2 BPTB 61 M
1 NaN NaN
4 S102 Cohort2 BPTB 62 M
1 NaN NaN
 
plasma_CA19_9 creatinine LYVE1 REG1B
TFF1 REG1A
0 11.7 1.83222 0.893219 52.94884
654.282174 1262.000
1 NaN 0.97266 2.037585 94.46703
209.488250 228.407
2 7.0 0.78039 0.145589 102.36600
461.141000 NaN
3 8.0 0.70122 0.002805 60.57900
142.950000 NaN
4 9.0 0.21489 0.000860 65.54000
41.088000 NaN
 
Let's break down the code step by step:
1. curr_path = os.getcwd(): This line of
code gets the current working
directory (the folder where the Python
script or Jupyter notebook is located)
and assigns it to the variable curr_path.
1. df =
pd.read_csv(curr_path+"/Debernardi et
al 2020 data.csv"): This line reads a
CSV file named "Debernardi et al
2020 data.csv" located in the current
working directory. It uses pd.read_csv
from the pandas library to read the
CSV file and stores the data in a
DataFrame called df.
2. print(df.iloc[:,0:8].head().to_string()):
This line prints the first 5 rows of the
DataFrame df containing columns
from index 0 to 7 (the first 8 columns).
It uses .iloc to select the rows and
columns by integer location. .head() is
used to limit the output to the first 5
rows, and .to_string() is used to
display the DataFrame as a string.
3. print(df.iloc[:,8:14].head().to_string()):
This line prints the first 5 rows of the
DataFrame df containing columns
from index 8 to 13 (columns 9 to 14).
Similarly to the previous line, it uses
.iloc to select the rows and columns by
integer location, .head() to limit the
output to the first 5 rows, and
.to_string() to display the DataFrame
as a string.
The code essentially reads a CSV dataset, prints the first 5 rows of
the first 8 columns, and then prints the first 5 rows of the columns
from index 8 to 13. It's a quick way to inspect the structure and
initial contents of the dataset.
 
Step Check the shape of dataset:
5  
1 #Checks shape
2 print(df.shape)
 
Output:
(590, 14)
 
The line of code print(df.shape) checks and prints the shape of the
DataFrame df.
 
In the context of pandas DataFrames, the shape attribute returns a
tuple representing the dimensions of the DataFrame. The tuple
contains two values: the number of rows and the number of
columns, respectively.
 
So, when you execute print(df.shape), it will output the number of
rows and columns in the DataFrame df.
 
The output (590, 14) indicates that the DataFrame df has 590 rows
and 14 columns.
The first value 590 represents the
number of rows in the DataFrame,
meaning there are 590 rows of data.
The second value 14 represents the
number of columns in the DataFrame,
indicating there are 14 columns of
data.
In summary, the DataFrame df contains 590 records (samples or
data points) and has 14 features (columns) in the dataset.
 
Step Read every column in dataset:
6  
1 #Reads columns
2 print("Data Columns --> ",df.columns)
 
Output:
Data Columns --> Index(['sample_id',
'patient_cohort', 'sample_origin', 'age', 'sex',
'diagnosis', 'stage', 'benign_sample_diagnosis',
'plasma_CA19_9', 'creatinine', 'LYVE1', 'REG1B',
'TFF1', 'REG1A'],
dtype='object')
 
The line of code print("Data Columns --> ", df.columns) reads
and prints the names of all the columns present in the DataFrame
df.
 
Here's an explanation of each column related to pancreatic cancer:
sample_id: Represents a unique
identifier for each sample. It could be
an internal tracking ID used to
distinguish individual samples in the
dataset.
patient_cohort: Indicates the cohort or
group to which the patient belongs.
Patients might be categorized into
different cohorts based on specific
criteria. In this case, Cohort 1,
previously used samples; Cohort 2,
newly added samples
sample_origin: Describes the origin or
source of the sample. In this case,
BPTB: Barts Pancreas Tissue Bank,
London, UK; ESP: Spanish National
Cancer Research Centre, Madrid,
Spain; LIV: Liverpool University, UK;
UCL: University College London, UK.
age: Represents the age of the patient
associated with the sample. Age can be
an essential factor in understanding the
prevalence and impact of pancreatic
cancer in different age groups.
sex: Represents the sex or gender of
the patient. In this case, M = male, F =
female.
diagnosis: Indicates the diagnosis
associated with the sample. In this
case, 1 = control (no pancreatic
disease), 2 = benign hepatobiliary
disease (119 of which are chronic
pancreatitis); 3 = Pancreatic ductal
adenocarcinoma, i.e. pancreatic cancer
stage: Represents the stage or severity
of the disease or cancer progression.
The stages of pancreatic cancer are
usually denoted as IA, IB, IIA, IIIB,
III, IV, reflecting the extent of tumor
growth and metastasis.
benign_sample_diagnosis: Describes
the diagnosis of benign samples (non-
cancerous). This column is particularly
relevant when comparing cancerous
samples to non-cancerous control
samples.
plasma_CA19_9: Represents blood
plasma levels of CA 19–9 monoclonal
antibody that is often elevated in
patients with pancreatic cancer. Only
assessed in 350 patients (one goal of
the study was to compare various CA
19-9 cutpoints from a blood sample to
the model developed using urinary
samples)..
creatinine: Represents the creatinine
level, which is an indicator of kidney
function. Elevated creatinine levels
may indicate kidney impairment,
which could be relevant in assessing a
patient's overall health.
LYVE1: Represents urinary levels of
Lymphatic vessel endothelial
hyaluronan receptor 1, a protein that
may play a role in tumor metastasis.
REG1B: A feature (gene expression,
protein level, or similar) named
REG1B. Represents urinary levels of a
protein that may be associated with
pancreas regeneration.
TFF1: Represents urinary levels of
Trefoil Factor 1, which may be related
to regeneration and repair of the
urinary tract.
REG1A: Indicates urinary levels of a
protein that may be associated with
pancreas regeneration. Only assessed
in 306 patients (one goal of the study
was to assess REG1B vs REG1A).
The dataset seems to include a mix of demographic information,
clinical features, and molecular features (such as gene expressions
or biomarker levels). These columns are likely crucial for
conducting analyses, building predictive models, and gaining
insights into pancreatic cancer and its characteristics.
 
 
 
Information of Dataset
Step Check the information of dataset:
1  
1 #Checks dataset information
2 print(df.info())
 
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590 entries, 0 to 589
Data columns (total 14 columns):
# Column Non-Null Count
Dtype
--- ------ -------------- ---
--
0 sample_id 590 non-null
object
1 patient_cohort 590 non-null
object
2 sample_origin 590 non-null
object
3 age 590 non-null
int64
4 sex 590 non-null
object
5 diagnosis 590 non-null
int64
6 stage 199 non-null
object
7 benign_sample_diagnosis 208 non-null
object
8 plasma_CA19_9 350 non-null
float64
9 creatinine 590 non-null
float64
10 LYVE1 590 non-null
float64
11 REG1B 590 non-null
float64
12 TFF1 590 non-null
float64
13 REG1A 306 non-null
float64
dtypes: float64(6), int64(2), object(6)
memory usage: 64.7+ KB
None
 
The code checks and prints the information about the
DataFrame df. When you execute this code, it will provide a
summary of the dataset, including the number of non-null
values in each column, the data type of each column, and the
memory usage.
 
The output indicates the information about the DataFrame df.
Let's interpret the information provided in the output:
class
'pandas.core.frame.DataFrame':
This line indicates that the variable
df is a pandas DataFrame.
RangeIndex: 590 entries, 0 to 589:
The DataFrame contains 590 rows
or entries, with row indices ranging
from 0 to 589.
Data columns (total 14 columns):
There are a total of 14 columns in
the DataFrame.
# Column Non-Null Count Dtype:
The column headers are displayed
with additional information about
each column.
sample_id: There are 590 non-
null values in this column. The
data type is object.
patient_cohort: There are 590
non-null values in this column.
The data type is object.
sample_origin: There are 590
non-null values in this column.
The data type is object.
age: There are 590 non-null
values in this column. The data
type is int64.
sex: There are 590 non-null
values in this column. The data
type is object.
diagnosis: There are 590 non-
null values in this column. The
data type is int64.
stage: There are 199 non-null
values in this column. The data
type is object.
benign_sample_diagnosis:
There are 208 non-null values in
this column. The data type is
object.
plasma_CA19_9: There are 350
non-null values in this column.
The data type is float64.
creatinine: There are 590 non-
null values in this column. The
data type is float64.
LYVE1: There are 590 non-null
values in this column. The data
type is float64.
REG1B: There are 590 non-null
values in this column. The data
type is float64.
TFF1: There are 590 non-null
values in this column. The data
type is float64.
REG1A: There are 306 non-null
values in this column. The data
type is float64.
memory usage: 64.7+ KB:
Indicates the memory usage of the
DataFrame, which is
approximately 64.7 KB.
In summary, the output provides an overview of the
DataFrame df, showing the number of non-null values and
the data type for each column. It also reveals the presence of
missing values (indicated by the non-null counts that are less
than the total number of entries) in columns like stage,
benign_sample_diagnosis, plasma_CA19_9, and REG1A.
Missing values in a dataset might require handling during
data preprocessing to ensure meaningful analyses and
modeling.
 
 
 
Dropping Irrelevant Columns
Step Drop irrelevant columns and check null values in each column:
1  
1 #Drops irrelevant columns
2 df = df.drop(columns=
3 ['sample_id','patient_cohort',\
4 'sample_origin','stage','benign_sample_diagnosis']
)
5
 
6
#Checks null values
7
print(df.isnull().sum())
print('Total number of null values: ',
df.isnull().sum().sum())
 
Output:
age 0
sex 0
diagnosis 0
plasma_CA19_9 240
creatinine 0
LYVE1 0
REG1B 0
TFF1 0
REG1A 284
dtype: int64
Total number of null values: 524
 
The code provided performs two operations on the DataFrame df:
1. Dropping irrelevant columns:
The code removes specific columns from the DataFrame df
that are deemed irrelevant for the analysis. The columns being
dropped are 'sample_id', 'patient_cohort', 'sample_origin',
'stage', and 'benign_sample_diagnosis'. The drop() method is
used with the columns parameter to remove these columns
from the DataFrame.
2. Checking for null values:
After dropping the specified columns, the code checks for any
remaining null (missing) values in the DataFrame. The isnull()
method is used to identify null values, and sum() is applied
twice to calculate the total number of null values present in
each column and the overall total number of null values in the
DataFrame.

 
 
Imputing Missing Values
Step Imputes missing values in plasma_CA19_9 and REG1A with mean:
1  
1 #Imputes missing values in plasma_CA19_9 with mean
2 df['plasma_CA19_9'].fillna((df['plasma_CA19_9'].mean()),
3 inplace=True)
4  
5 #Imputes missing value in REG1A with mean
6 df['REG1A'].fillna((df['REG1A'].mean()), inplace=True)
7  
8 #Checks null values
9 print(df.isnull().sum())
10 print('Total number of null values: ',
df.isnull().sum().sum())
 
Output:
age 0
sex 0
diagnosis 0
plasma_CA19_9 0
creatinine 0
LYVE1 0
REG1B 0
TFF1 0
REG1A 0
dtype: int64
Total number of null values: 0
 
Here's an explanation of the code in steps:
1. The code imputes missing values in the
'plasma_CA19_9' column of the DataFrame
df using the mean value of the available data
in that column.
2. Next, the code imputes any missing value in
the 'REG1A' column of the DataFrame df
with the mean value of the available data in
that column.
3. After imputing the missing values, the code
checks for any remaining null (missing)
values in the DataFrame df.
4. It prints the count of null values for each
column to verify that all missing values have
been imputed.
5. Finally, it prints the total number of null
values in the entire DataFrame to ensure that
there are no missing values left.
The imputation of missing values is performed to make the dataset more
complete and suitable for further analysis or modeling, as missing values
could adversely affect the quality of analysis and model performance. By
replacing missing values with the mean, the code ensures a simple method
of imputation based on the available data in the respective columns.
 
 
 
Statistical Description
Step Look at some statistical measures about the dataset:
1  
1 #Looks at statistical description of data
2 print(df.describe().iloc[:,0:5].to_string())
3 print(df.describe().iloc[:,5:10].to_string())
 
Output:
age diagnosis plasma_CA19_9
creatinine LYVE1
count 590.000000 590.000000 590.000000
590.000000 590.000000
mean 59.079661 2.027119 654.002944
0.855383 3.063530
std 13.109520 0.804873 1870.760130
0.639028 3.438796
min 26.000000 1.000000 0.000000
0.056550 0.000129
25% 50.000000 1.000000 17.000000
0.373230 0.167179
50% 60.000000 2.000000 654.002944
0.723840 1.649862
75% 69.000000 3.000000 654.002944
1.139482 5.205037
max 89.000000 3.000000 31000.000000
4.116840 23.890323
 
REG1B TFF1 REG1A
count 590.000000 590.000000 590.000000
mean 111.774090 597.868722 735.281222
std 196.267110 1010.477245 1063.030198
min 0.001104 0.005293 0.000000
25% 10.757216 43.961000 195.201000
50% 34.303353 259.873974 735.281222
75% 122.741013 742.736000 735.281222
max 1403.897600 13344.300000 13200.000000
 
The code prints the statistical description of the DataFrame
df to provide a summary of the numerical data in the dataset.
The describe() method computes various summary statistics
for each numeric column in the DataFrame, including
measures like count, mean, standard deviation, minimum,
25th percentile (Q1), median (50th percentile or Q2), 75th
percentile (Q3), and maximum.
 
The output will display the summary statistics for the first
five numeric columns and then for the next five numeric
columns separately.
 
Let's analyze the output for each column and conclude:
age:
The age column represents the age
of patients in the dataset.
The data shows a relatively normal
distribution of ages, with the mean
age being approximately 59 years.
The patients' ages range from 26 to
89 years, indicating a diverse age
group in the dataset.
diagnosis:
The diagnosis column indicates
different diagnoses, encoded as 1,
2, or 3.
The mean value of approximately
2.03 suggests that the patients'
diagnoses are distributed between
the three categories.
Further domain knowledge is
needed to understand the specific
meanings of the encoded values.
plasma_CA19_9:
The plasma_CA19_9 column has
missing values, which were
imputed with the mean value of
approximately 654.
The values in this column have a
wide range, from 0 to 31,000.
The high standard deviation
(1870.76) indicates significant
variability in the CA19-9 levels
among patients.
creatinine:
The creatinine column represents
indicators of kidney function.
The values range from 0.06 to 4.12,
with a mean value of
approximately 0.86.
The standard deviation (0.64)
suggests moderate variability in the
creatinine levels among patients.
LYVE1, REG1B, TFF1, and REG1A:
These columns represent various
features, such as gene expressions
or protein levels.
Each column has different ranges
of values, with mean values of
approximately 3.06, 111.77,
597.87, and 735.28, respectively.
The standard deviations for these
columns are relatively high,
indicating considerable variations
in the feature values among
patients.
Conclusion:
The dataset contains information related to pancreatic cancer
patients, including their age, diagnosis, and various
biomarker or feature measurements. Notably, the
plasma_CA19_9 and REG1A columns have missing values,
which were imputed with the mean values. It is important to
interpret the analysis with caution, as imputed values might
introduce some bias to the data.
 
The dataset exhibits a diverse range of ages and diagnoses,
suggesting the inclusion of patients with varying
characteristics. However, to gain a deeper understanding and
draw meaningful conclusions from the data, further domain
knowledge and context about the features and diagnosis
categories are necessary.
 
Researchers and analysts can utilize this dataset to conduct
further exploratory analysis, build predictive models, and
uncover potential associations between patient characteristics
and pancreatic cancer diagnosis or prognosis. Careful
consideration of the data's limitations and assumptions
related to imputed values will be critical in drawing
meaningful insights from the analysis.
 
 
 
Distribution of Diagnosis Variable
Step Plot the distribution of diagnosis (target variable) in dataset:
1  
1 #Defines function to create pie chart and bar plot
2 as subplots
3 def plot_piechart(df, var, title=''):
4 plt.figure(figsize=(25, 10))
5 plt.subplot(121)
6 label_list =
list(df[var].value_counts().index)
7
colors = sns.color_palette("husl",
8
len(label_list))
9
10 df[var].value_counts().plot.pie(autopct="%1.1f%%",
11 \
12 colors=colors, \
13 startangle=60, labels=label_list, \
14 wedgeprops={"linewidth": 3, "edgecolor":
15 "k"}, \
16 shadow=True, textprops={'fontsize': 20})
17 plt.title("Distribution of " + var + "
18 variable " + title, fontsize=25)
19  
20 value_counts = df[var].value_counts()
21 # Print percentage values
22 percentages = value_counts / len(df) * 100
23 print("Percentage values:")
24 print(percentages)
25  
26 plt.subplot(122)
27 ax = df[var].value_counts().plot(kind="barh")
28  
29 for i, j in
enumerate(df[var].value_counts().values):
30
ax.text(.7, i, j, weight="bold",
31
fontsize=20)
32
 
33
plt.title("Count of " + var + " cases " +
title, fontsize=25)
# Print count values
print("Count values:")
print(value_counts)
plt.show()
 
plot_piechart(df,'diagnosis')
 
The result is shown in Figure 1.
 
Figure 1 The distribution of diagnosis (target variable)
 
The purpose of the code is to define a Python function called
plot_piechart() that generates two subplots: a pie chart and a
horizontal bar plot to visualize the distribution and count of
categories within a specified categorical variable in a DataFrame.
 
Here's the purpose of each part of the code:
1. plt.figure(figsize=(25, 10)): Sets the
figure size for the entire plot, ensuring
that the generated subplots have an
appropriate size for better visualization.
2. plt.subplot(121): Specifies the first
subplot as a pie chart. This means that
the pie chart will be positioned on the left
side of the plot with 1 row and 2
columns, and this is the first subplot.
3. label_list =
list(df[var].value_counts().index):
Creates a list of unique categories from
the specified categorical variable var in
the DataFrame df. This list will be used
to label the slices of the pie chart.
4. colors = sns.color_palette("husl",
len(label_list)): Generates a list of colors
from the seaborn color palette "husl" for
the number of unique categories in the
label_list. These colors will be used to
distinguish the slices of the pie chart.
5. df[var].value_counts().plot.pie(...): Plots
the pie chart using the plot.pie() method
from pandas. It visualizes the distribution
of the categories in the var variable. The
percentage of each category is displayed
on the chart.
6. plt.title("Distribution of " + var + "
variable " + title, fontsize=25): Sets the
title for the pie chart with the specified
var variable and an optional title. The
title indicates that the chart represents the
distribution of the categorical variable.
7. value_counts = df[var].value_counts():
Calculates the count of each category in
the var variable and stores it in the
value_counts variable. This count will be
used to create the horizontal bar plot in
the next subplot.
8. percentages = value_counts / len(df) *
100: Calculates the percentage of each
category in the var variable. This
information will be printed later.
9. plt.subplot(122): Specifies the second
subplot as a horizontal bar plot. This
means that the horizontal bar plot will be
positioned on the right side of the plot
with 1 row and 2 columns, and this is the
second subplot.
10. ax =
df[var].value_counts().plot(kind="barh"):
Plots the horizontal bar plot using the
plot() method with kind="barh". This
plot visualizes the count of each category
in the var variable in a horizontal format.
11. ax.text(...): Adds the count values to the
horizontal bar plot as text annotations,
displaying the count for each category
next to the corresponding horizontal bar.
12. plt.title("Count of " + var + " cases " +
title, fontsize=25): Sets the title for the
horizontal bar plot with the specified var
variable and an optional title. The title
indicates that the chart represents the
count of cases for each category.
13. The code then prints the percentage
values and count values for each
category in the var variable.
14. plt.show(): Displays the entire plot with
both the pie chart and the horizontal bar
plot as subplots.
In summary, the purpose of the plot_piechart() function is to create a
side-by-side visualization of a categorical variable's distribution using
a pie chart and the count of each category using a horizontal bar plot.
This function allows users to quickly analyze the distribution of
categorical data in a DataFrame and understand the proportion and
count of each category within the variable of interest.
 
Output:
Percentage values:
2 35.254237
3 33.728814
1 31.016949
Name: diagnosis, dtype: float64
Count values:
2 208
3 199
1 183
Name: diagnosis, dtype: int64
 
From the provided output, we can make the following specific
observations and conclusions:
 
Percentage Values:
Category 2: Approximately 35.25% of
the data belongs to this category.
Category 3: Approximately 33.73% of
the data belongs to this category.
Category 1: Approximately 31.02% of
the data belongs to this category.
Count Values:
Category 2: There are 208 occurrences of
this category in the dataset.
Category 3: There are 199 occurrences of
this category in the dataset.
Category 1: There are 183 occurrences of
this category in the dataset.
Observations:
The 'diagnosis' variable represents
different diagnoses, possibly encoded as
1, 2, or 3.
Category 2 (encoded value 2) is the most
frequent diagnosis in the dataset,
accounting for approximately 35.25% of
the data and having 208 occurrences.
Category 3 (encoded value 3) is the
second most frequent diagnosis,
comprising approximately 33.73% of the
data and having 199 occurrences.
Category 1 (encoded value 1) is the least
common diagnosis, accounting for
approximately 31.02% of the data and
having 183 occurrences.
 
Conclusions:
The dataset appears to have a relatively
balanced distribution of diagnoses across
categories 2 and 3, each accounting for a
significant portion of the data (around
35% each).
Category 1, while less frequent, still
represents a substantial portion of the
dataset (approximately 31%).
Researchers and analysts can utilize this
information to gain insights into the
distribution of diagnoses in the dataset
and potentially explore relationships
between the diagnosis and other features
or outcomes in the data.
 
 
Distribution of All Features
Step Plot distribution of all features in the whole dataset:
1  
1 # Looks at distribution of all features in
2 the whole original dataset
3 columns = list(df.columns)
4 columns.remove('diagnosis')
5 plt.subplots(figsize=(45, 50))
6 length = len(columns)
7 color_palette = sns.color_palette("Set3",
n_colors=length) # Define color palette
8  
9 for i, j in itertools.zip_longest(columns,
10 range(length)):
11 plt.subplot((length // 2), 4, j + 1)
12 plt.subplots_adjust(wspace=0.2,
13 hspace=0.5)
14 ax = df[i].hist(bins=10,
edgecolor='black', color=color_palette[j])
15 # Set color for each histogram
16 for p in ax.patches:
17 ax.annotate(format(p.get_height(),
18 '.0f'), (p.get_x() + p.get_width() / 2.,
19 p.get_height()), ha='center',
20 va='center', xytext=
21 (0, 10), weight="bold", fontsize=17,
textcoords='offset points')
22
 
plt.title(i, fontsize=30) # Adjust
title font size
plt.show()
 
The result is shown in Figure 2. The code generates a set of
histograms to visualize the distribution of each feature
(column) in the original dataset df, excluding the 'diagnosis'
column. It uses the matplotlib library for plotting and the
seaborn library to define a color palette for the histograms.
 
Here's a step-by-step explanation of the code:
1. columns = list(df.columns):
Creates a list containing the names
of all columns in the DataFrame df.
2. columns.remove('diagnosis'):
Removes the 'diagnosis' column
from the list of columns, as it is not
included in the histogram plot.
3. plt.subplots(figsize=(45, 50)):
Creates a figure with a large size
(45 inches in width and 50 inches
in height) to accommodate multiple
subplots for each feature's
histogram.
4. length = len(columns): Determines
the number of columns (features)
that need to be visualized.
5. color_palette =
sns.color_palette("Set3",
n_colors=length): Defines a color
palette using the seaborn library.
The color palette "Set3" is chosen
to provide distinct colors for each
histogram, and the number of
colors in the palette is set to match
the number of features.
6. The code then iterates through each
feature using the
itertools.zip_longest function and
creates a subplot for each
histogram.
Figure 2 The distribution of all features in the whole dataset
 
7. plt.subplot((length // 2), 4, j + 1):
Configures the position of each
subplot. It arranges the subplots in
rows of 2 and a maximum of 4
columns, making sure that the
subplots fit within the given figure
size.
8. plt.subplots_adjust(wspace=0.2,
hspace=0.5): Adjusts the horizontal
and vertical spacing between
subplots to improve readability.
9. ax = df[i].hist(bins=10,
edgecolor='black',
color=color_palette[j]): Plots the
histogram for the current feature i
from the DataFrame df. The
number of bins is set to 10, and the
histogram bars have black edges.
Each histogram is colored using the
corresponding color from the
defined color palette.
10. ax.annotate(...): Annotates each
histogram bar with its
corresponding frequency value
(height). The annotation is placed
at the center of the bar and includes
the formatted frequency value.
11. plt.title(i, fontsize=30): Sets the
title for each subplot to the name of
the corresponding feature (i). The
font size for the titles is set to 30.
12. Finally, plt.show() displays the
complete set of histograms with the
distribution of all features in the
original dataset df.
The resulting plot shows histograms for each feature,
allowing visualization of the distribution of data within each
column. The histograms provide insights into the data's
characteristics, such as its central tendency, spread, and
potential skewness or outliers. This visualization helps
researchers and analysts better understand the data and
identify any patterns or anomalies in the feature distributions.
 
 
 
Distribution of Plasma CA19-9 versus Diagnosis
Step Define another_versus_diagnosis() method to plot the
1 distribution of a feature against diagnosis feature:
 
1 from tabulate import tabulate
2 def another_versus_diagnosis(feat, num_bins):
3 fig, axes = plt.subplots(nrows=3, ncols=1,
4 figsize=(30, 22))
5 plt.subplots_adjust(wspace=0.5,
hspace=0.25)
6
7
colors = sns.color_palette("Set2")
8
diagnosis_labels = {1: 'Control (No
9
Pancreatic Disease)',
10
2: 'Benign Hepatobiliary Disease',
11
3: 'Pancreatic Cancer'}
12
13
data = {}
14
15
for diagnosis_code, ax in zip([1, 2, 3],
16 axes):
17 subset_data = df[df['diagnosis'] ==
18 diagnosis_code][feat]
19 subset_data.plot(ax=ax, kind='hist',
20 bins=num_bins, edgecolor='black',
color=colors[diagnosis_code-1])
21
22
23
ax.set_title(diagnosis_labels[diagnosis_code],
24 fontsize=30)
25 ax.set_xlabel(feat, fontsize=30)
26 ax.set_ylabel('Count', fontsize=30)
27
28 patch_data = []
29 for p in ax.patches:
30 x = p.get_x() + p.get_width() / 2.
31 y = p.get_height()
32 ax.annotate(format(y, '.0f'), (x,
33 y), ha='center', va='center', xytext=(0, 10),
34 weight="bold",
fontsize=25, textcoords='offset points')
35
patch_data.append([x, y])
36
37
data[diagnosis_labels[diagnosis_code]]
38
= patch_data
39
40
plt.show()
41
 
42
for diagnosis_label, patch_data in
data.items():
print(diagnosis_label + ":")
print(tabulate(patch_data, headers=[feat,
diagnosis_label]))
print()
 
The purpose of the code is to visualize and compare the
distribution of a specific feature (feat) in a DataFrame df for
different diagnosis categories. The code creates three subplots
(axes) in a single figure, each representing a different diagnosis
category. It displays histograms for the selected feature's
distribution within each diagnosis group. Additionally, it
tabulates and prints the data of the histograms in a formatted
manner using the tabulate library.
 
Here's a step-by-step explanation of the code:
1. from tabulate import tabulate:
Imports the tabulate function from
the tabulate library, which is used to
format and print the histogram data.
2. The function
another_versus_diagnosis(feat,
num_bins) is defined with two input
parameters:
feat: The name of the feature
(column) in the DataFrame to be
analyzed.
num_bins: The number of bins
to be used for creating the
histograms.
3. fig, axes = plt.subplots(nrows=3,
ncols=1, figsize=(30, 22)): Creates a
figure with three subplots (rows) and
one column, each having the size of
30 inches in width and 22 inches in
height.
4. plt.subplots_adjust(wspace=0.5,
hspace=0.25): Adjusts the horizontal
and vertical spacing between
subplots for better visualization.
5. colors = sns.color_palette("Set2"):
Defines a color palette using the
seaborn library's "Set2" palette,
which will be used to differentiate
the histograms for each diagnosis
category.
6. diagnosis_labels = {1: 'Control (No
Pancreatic Disease)', ...}: Creates a
dictionary diagnosis_labels that
maps each diagnosis code (1, 2, or 3)
to its corresponding label. This
mapping is used to set the title for
each subplot, indicating the
diagnosis category it represents.
7. The variable data is initialized as an
empty dictionary. This dictionary
will be used to collect the data (x and
y values) of each histogram for later
tabulation.
8. The code then enters a loop that
iterates over the three diagnosis
codes (1, 2, and 3) and their
corresponding subplots (axes).
9. Within the loop:
subset_data = df[df['diagnosis']
== diagnosis_code][feat]: Filters
the DataFrame df to get the
subset of data corresponding to
the current diagnosis code
(diagnosis_code) and the
selected feature (feat).
subset_data.plot(ax=ax,
kind='hist', bins=num_bins,
edgecolor='black',
color=colors[diagnosis_code-
1]): Plots a histogram for the
subset_data on the current
subplot (ax) with the specified
number of bins and color
corresponding to the diagnosis
category.
ax.set_title(...), ax.set_xlabel(...),
ax.set_ylabel(...): Sets the title,
x-axis label, and y-axis label for
the current subplot.
A loop inside the loop iterates
over the histogram patches
(bars) and adds annotations
(frequency values) on top of
each bar.
The x and y values of each patch
are collected in the patch_data
list, which is then stored in the
data dictionary using the
diagnosis label as the key.
10. plt.show(): Displays the figure
containing the three subplots, each
showing the histogram of the
selected feature for the different
diagnosis categories.
11. The code then enters another loop to
tabulate and print the data collected
in the data dictionary. This loop
prints the x and y values (heights) of
each histogram bar in a formatted
table for each diagnosis category.
In summary, this code aims to provide a visual and tabulated
comparison of the distribution of a specific feature among
different diagnosis categories. The subplots show histograms
for each diagnosis group, and the tabulated data provides
numerical insights into the distribution of the selected feature
within each diagnosis category. This information can be helpful
in understanding how the feature's distribution varies among
different diagnostic groups in the dataset.
 
Step Look at plasma_CA19_9 feature distribution by diagnosis
2 feature:
 
1 #Looks at plasma_CA19_9 feature
2 distribution by diagnosis feature
another_versus_diagnosis("plasma_CA19_9",
10)
 
The result is shown in Figure 3. The resulting plot from the
function another_versus_diagnosis("plasma_CA19_9", 10)
shows a side-by-side comparison of the distribution of the
"plasma_CA19_9" feature among three different diagnosis
categories: Control (No Pancreatic Disease), Benign
Hepatobiliary Disease, and Pancreatic Cancer.
 
Figure 3 The plasma_CA19_9 feature distribution by
diagnosis feature
 
Output:
Control (No Pancreatic Disease):
plasma_CA19_9 Control (No Pancreatic Disease)
--------------- ---------------------------------
32.7001 91
98.1004 1
163.501 0
228.901 0
294.301 0
359.702 0
425.102 0
490.502 0
555.903 0
621.303 91
 
Benign Hepatobiliary Disease:
plasma_CA19_9 Benign Hepatobiliary Disease
--------------- ------------------------------
96.6 104
287.8 2
479 0
670.2 100
861.4 0
1052.6 0
1243.8 0
1435 0
1626.2 1
1817.4 1
 
Pancreatic Cancer:
plasma_CA19_9 Pancreatic Cancer
--------------- -------------------
1550.57 184
4650.51 7
7750.45 3
10850.4 2
13950.3 1
17050.3 1
20150.2 0
23250.2 0
26350.1 0
29450 1
 
From the output, we can make more specific observations and
draw conclusions about the distribution of the
"plasma_CA19_9" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer).
Control (No Pancreatic Disease):
The majority of patients in the
Control group have
"plasma_CA19_9" values within the
range of approximately 32.70 to
621.30.
There is one patient with a
"plasma_CA19_9" value of
approximately 98.10.
The "plasma_CA19_9" values in the
Control group are relatively lower
compared to the other two diagnosis
categories.
Benign Hepatobiliary Disease:
The majority of patients with Benign
Hepatobiliary Disease have
"plasma_CA19_9" values within the
range of approximately 96.60 to
1626.20.
There are two patients with
"plasma_CA19_9" values of
approximately 287.80 and 1626.20.
The distribution of
"plasma_CA19_9" values in this
group shows a broader range
compared to the Control group.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"plasma_CA19_9" values within the
range of approximately 1550.57 to
29450.00.
There are multiple patients with
higher "plasma_CA19_9" values,
such as 184 patients with a value
around 1550.57 and 7 patients with a
value around 4650.51.
The distribution of
"plasma_CA19_9" values in this
group shows a wide range, with
some values much higher than those
observed in the other diagnosis
categories.
Conclusions:
The output highlights the significant
differences in the distribution of
"plasma_CA19_9" values among the
three diagnosis categories.
Patients with Pancreatic Cancer
generally have higher
"plasma_CA19_9" levels compared
to those with Benign Hepatobiliary
Disease and the Control group. This
suggests that "plasma_CA19_9"
might be a potential biomarker for
detecting Pancreatic Cancer.
The Control group has lower
"plasma_CA19_9" levels, indicating
that elevated levels of this feature
might be associated with disease
conditions, particularly Pancreatic
Cancer and Benign Hepatobiliary
Disease.
The Benign Hepatobiliary Disease
group shows a broader range of
"plasma_CA19_9" values than the
Control group but lower values than
the Pancreatic Cancer group. This
could indicate a potential overlap in
the "plasma_CA19_9" levels
between Benign Hepatobiliary
Disease and Pancreatic Cancer
patients.
Overall, the combination of the plotted histograms and the
tabulated frequency data provides a comprehensive overview of
the distribution of "plasma_CA19_9" levels in different
diagnosis categories. It allows researchers or medical
professionals to identify potential patterns and relationships
between the "plasma_CA19_9" feature and the diagnosis of
pancreatic diseases. Further analysis and domain-specific
knowledge are required to draw definitive conclusions and
explore the clinical implications of these findings.
 
 
 
Distribution of Creatinine versus Diagnosis
Step Look at creatinine feature distribution by diagnosis feature:
1  
1 #Looks at creatinine feature distribution
2 by diagnosis feature
another_versus_diagnosis("creatinine", 10)
 
The result is shown in Figure 4. The function
another_versus_diagnosis("creatinine", 10) generates a plot
and tabulated data to analyze the distribution of the
"creatinine" feature among three different diagnosis
categories: Control (No Pancreatic Disease), Benign
Hepatobiliary Disease, and Pancreatic Cancer.
 

Figure 4 The creatinine feature distribution by diagnosis


feature
 
Output:
Control (No Pancreatic Disease):
creatinine Control (No Pancreatic Disease)
------------ ---------------------------------
0.236944 54
0.575114 40
0.913282 40
1.25145 26
1.58962 10
1.92779 10
2.26596 2
2.60413 0
2.9423 0
3.28047 1
 
Benign Hepatobiliary Disease:
creatinine Benign Hepatobiliary Disease
------------ ------------------------------
0.220545 54
0.548535 44
0.876525 49
1.20452 24
1.53251 18
1.8605 6
2.18849 6
2.51648 4
2.84447 2
3.17246 1
 
Pancreatic Cancer:
creatinine Pancreatic Cancer
------------ -------------------
0.281053 64
0.68482 56
1.08859 38
1.49235 15
1.89612 8
2.29989 9
2.70366 5
3.10742 2
3.51119 1
3.91496 1
 
Based on the output, we can make specific observations and
draw conclusions about the distribution of the "creatinine"
feature among different diagnosis categories (Control,
Benign Hepatobiliary Disease, and Pancreatic Cancer).
 
Control (No Pancreatic Disease):
The majority of patients in the
Control group have "creatinine"
levels within the range of
approximately 0.24 to 1.93.
The highest count of patients (54)
falls within the bin with a
"creatinine" level of approximately
0.24 to 0.58.
There are only two patients with
"creatinine" levels around 2.27.
Benign Hepatobiliary Disease:
Patients with Benign Hepatobiliary
Disease have "creatinine" levels
primarily within the range of
approximately 0.22 to 2.19.
The highest count of patients (54)
has "creatinine" levels between
0.22 and 0.55.
There are a few patients with
higher "creatinine" levels, with
counts decreasing as the level
increases.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"creatinine" levels within the range
of approximately 0.28 to 3.91.
The highest count of patients (64)
falls within the bin with a
"creatinine" level of approximately
0.28 to 0.68.
There is a gradual decrease in
patient counts as the "creatinine"
level increases.
Conclusions:
The output shows differences in the
distribution of the "creatinine"
feature among the three diagnosis
categories.
Patients with Pancreatic Cancer
tend to have higher "creatinine"
levels compared to those with
Benign Hepatobiliary Disease and
the Control group.
The Control group exhibits
relatively lower "creatinine" levels
compared to the other two
diagnosis categories.
The distribution of "creatinine"
levels varies, and higher levels of
"creatinine" might be associated
with specific medical conditions,
particularly Pancreatic Cancer.
The information provided in the
output can be valuable for medical
professionals and researchers to
identify potential correlations
between "creatinine" levels and the
diagnosis of pancreatic diseases.
To gain a deeper understanding of the clinical implications
and significance of these findings, further analysis and
domain-specific expertise are necessary. Additionally,
conducting statistical tests or machine learning models using
this feature can help assess its predictive power in diagnosing
or classifying different medical conditions related to
pancreatic diseases.
 
 
 
Distribution of LYVE1 versus Diagnosis
Step Look at LYVE1 feature distribution by diagnosis feature:
1  
1 #Looks at LYVE1 feature distribution by
2 diagnosis feature
another_versus_diagnosis("LYVE1", 10)
 
The result is shown in Figure 5. To analyze the distribution
of the "LYVE1" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer), the function another_versus_diagnosis("LYVE1",
10) generates a plot and tabulated data.
 
Here's the description of the resulting plot and tabulated data:
 
Control (No Pancreatic Disease):
The histogram displays the
distribution of "LYVE1" values for
patients diagnosed with no
pancreatic disease (Control group).
The x-axis represents the range of
"LYVE1" values, divided into ten
equally spaced bins.
The y-axis shows the frequency
count of "LYVE1" values falling
within each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
Benign Hepatobiliary Disease:
The histogram represents the
distribution of "LYVE1" values for
patients diagnosed with benign
hepatobiliary disease.
Similar to the previous histogram,
"LYVE1" values are grouped into
ten bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.

Figure 5 The LYVE1 feature distribution by diagnosis


feature
 
Pancreatic Cancer:
The histogram displays the
distribution of "LYVE1" values for
patients diagnosed with pancreatic
cancer.
"LYVE1" values are grouped into
ten bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The tabulated data shows the
frequency count of "LYVE1"
values within each bin for each
diagnosis category.
The plot and tabulated data provide insights into how the
distribution of the "LYVE1" feature varies among the
different diagnosis categories. Researchers or medical
professionals can use this information to understand potential
patterns or differences in "LYVE1" levels in different
medical conditions related to pancreatic diseases.
 
Output:
Control (No Pancreatic Disease):
LYVE1 Control (No Pancreatic Disease)
-------- ---------------------------------
0.416085 114
1.248 26
2.07991 11
2.91182 7
3.74373 6
4.57565 8
5.40756 3
6.23947 1
7.07138 4
7.90329 3
 
Benign Hepatobiliary Disease:
LYVE1 Benign Hepatobiliary Disease
-------- ------------------------------
0.55222 100
1.65621 34
2.76019 17
3.86418 21
4.96817 12
6.07216 11
7.17614 9
8.28013 1
9.38412 2
10.4881 1
 
Pancreatic Cancer:
LYVE1 Pancreatic Cancer
-------- -------------------
1.19572 46
3.58463 37
5.97353 51
8.36244 31
10.7513 19
13.1402 13
15.5292 1
17.9181 0
20.307 0
22.6959 1
 
Based on the output, we can make specific observations and
draw conclusions about the distribution of the "LYVE1"
feature among different diagnosis categories (Control,
Benign Hepatobiliary Disease, and Pancreatic Cancer).
 
Control (No Pancreatic Disease):
The majority of patients in the
Control group have "LYVE1"
levels within the range of
approximately 0.42 to 7.90.
The highest count of patients (114)
falls within the bin with a
"LYVE1" level of approximately
0.42 to 0.55.
There are only a few patients with
"LYVE1" levels above 7.
Benign Hepatobiliary Disease:
Patients with Benign Hepatobiliary
Disease have "LYVE1" levels
primarily within the range of
approximately 0.55 to 10.49.
The highest count of patients (100)
has "LYVE1" levels between 0.55
and 1.66.
There is a gradual decrease in
patient counts as the "LYVE1"
level increases.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"LYVE1" levels within the range of
approximately 1.20 to 22.70.
The highest count of patients (51)
falls within the bin with a
"LYVE1" level of approximately
5.97 to 8.36.
The patient counts decrease at
higher "LYVE1" levels.
Conclusions:
The output shows differences in the
distribution of the "LYVE1"
feature among the three diagnosis
categories.
Patients with Pancreatic Cancer
generally have higher "LYVE1"
levels compared to those with
Benign Hepatobiliary Disease and
the Control group.
The Control group exhibits
relatively lower "LYVE1" levels
compared to the other two
diagnosis categories.
The distribution of "LYVE1" levels
varies, and higher levels might be
associated with specific medical
conditions, particularly Pancreatic
Cancer.
The information provided in the
output can be valuable for medical
professionals and researchers to
identify potential correlations
between "LYVE1" levels and the
diagnosis of pancreatic diseases.
As with any analysis, further investigation and statistical
tests may be required to draw more definitive conclusions
and understand the clinical significance of these findings.
Additionally, domain-specific knowledge and expertise are
crucial in interpreting the results effectively.
 
Distribution of REG1B versus Diagnosis
Step Look at REG1B feature distribution by diagnosis feature:
1  
1 #Looks at REG1B feature distribution by
2 diagnosis feature
another_versus_diagnosis("REG1B", 10)
 
The result is shown in Figure 6.
 
Figure 6 The REG1B feature distribution by diagnosis
feature
 
The function another_versus_diagnosis("REG1B", 10)
generates three subplots to analyze the distribution of the
"REG1B" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer). Each subplot corresponds to a diagnosis category
and shows the histogram of "REG1B" values for patients
within that category.
 
Here's the description of the resulting plots:
 
Control (No Pancreatic Disease):
The first subplot represents the
distribution of "REG1B" values for
patients diagnosed with no
pancreatic disease (Control group).
The x-axis represents the range of
"REG1B" values, divided into ten
equally spaced bins.
The y-axis shows the frequency
count of "REG1B" values falling
within each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is "Control
(No Pancreatic Disease)."
Benign Hepatobiliary Disease:
The second subplot represents the
distribution of "REG1B" values for
patients diagnosed with benign
hepatobiliary disease.
Similar to the previous subplot,
"REG1B" values are grouped into
ten bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is "Benign
Hepatobiliary Disease."
Pancreatic Cancer:
The third subplot represents the
distribution of "REG1B" values for
patients diagnosed with pancreatic
cancer.
"REG1B" values are grouped into
ten bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is
"Pancreatic Cancer."
The histograms show the spread and concentration of
"REG1B" values within each diagnosis category. By
comparing the three subplots, one can identify differences in
the distribution of "REG1B" among the different medical
conditions.
 
Output:
Control (No Pancreatic Disease):
REG1B Control (No Pancreatic Disease)
-------- ---------------------------------
27.1787 140
81.534 24
135.889 14
190.245 1
244.6 2
298.955 1
353.31 0
407.666 0
462.021 0
516.376 1
 
Benign Hepatobiliary Disease:
REG1B Benign Hepatobiliary Disease
------- ------------------------------
43.221 166
129.657 19
216.094 15
302.53 1
388.967 0
475.403 4
561.839 1
648.276 0
734.712 1
821.149 1
 
Pancreatic Cancer:
REG1B Pancreatic Cancer
--------- -------------------
71.7641 107
211.989 42
352.213 21
492.438 9
632.662 4
772.887 5
913.112 5
1053.34 1
1193.56 3
1333.79 2
 
Based on the output, we can make specific observations and
draw conclusions about the distribution of the "REG1B"
feature among different diagnosis categories (Control,
Benign Hepatobiliary Disease, and Pancreatic Cancer).
 
Control (No Pancreatic Disease):
The majority of patients in the
Control group have "REG1B"
levels within the range of
approximately 27.18 to 516.38.
The highest count of patients (140)
falls within the bin with a
"REG1B" level of approximately
27.18 to 81.53.
There are a few patients with
higher "REG1B" levels, with
counts decreasing as the level
increases.
Benign Hepatobiliary Disease:
Patients with Benign Hepatobiliary
Disease have "REG1B" levels
primarily within the range of
approximately 43.22 to 821.15.
The highest count of patients (166)
has "REG1B" levels between 43.22
and 129.66.
There are fewer patients with
"REG1B" levels above 821.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"REG1B" levels within the range
of approximately 71.76 to 1333.79.
The highest count of patients (107)
falls within the bin with a
"REG1B" level of approximately
71.76 to 211.99.
The patient counts decrease at
higher "REG1B" levels.
Conclusions:
The output shows differences in the
distribution of the "REG1B"
feature among the three diagnosis
categories.
Patients with Pancreatic Cancer
tend to have higher "REG1B"
levels compared to those with
Benign Hepatobiliary Disease and
the Control group.
The Control group exhibits
relatively lower "REG1B" levels
compared to the other two
diagnosis categories.
The distribution of "REG1B" levels
varies, and higher levels might be
associated with specific medical
conditions, particularly Pancreatic
Cancer.
The information provided in the
output can be valuable for medical
professionals and researchers to
identify potential correlations
between "REG1B" levels and the
diagnosis of pancreatic diseases.
As with any analysis, further investigation and statistical
tests may be required to draw more definitive conclusions
and understand the clinical significance of these findings.
Additionally, domain-specific knowledge and expertise are
crucial in interpreting the results effectively.
 
 
 
Distribution of TFF1 versus Diagnosis
Step Look at TFF1 feature distribution by diagnosis feature:
1  
1 #Looks at TFF1 feature distribution by
2 diagnosis feature
another_versus_diagnosis("TFF1", 10)
 
The result is shown in Figure 7. The plot consists of three
subplots, each representing the distribution of the "TFF1"
feature within a specific diagnosis category.
 
Control (No Pancreatic Disease):
The first subplot represents the
distribution of "TFF1" values for
patients diagnosed with no
pancreatic disease (Control group).
The x-axis represents the range of
"TFF1" values, divided into ten
equally spaced bins.
The y-axis shows the frequency
count of "TFF1" values falling
within each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is "Control
(No Pancreatic Disease)."
 
Benign Hepatobiliary Disease:
The second subplot represents the
distribution of "TFF1" values for
patients diagnosed with benign
hepatobiliary disease.
Similar to the previous subplot,
"TFF1" values are grouped into ten
bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is "Benign
Hepatobiliary Disease."
Pancreatic Cancer:
The third subplot represents the
distribution of "TFF1" values for
patients diagnosed with pancreatic
cancer.
"TFF1" values are grouped into ten
bins, and the y-axis shows the
frequency count for each bin.
Annotations on top of each bar
indicate the frequency count for
each bin.
The title of this subplot is
"Pancreatic Cancer."
Figure 7 The TFF1 feature distribution by diagnosis feature
 
Tabulated Data Analysis:
The tabulated data for each subplot contains two columns:
"TFF1" and the corresponding diagnosis category. Each row
represents a bin with the range of "TFF1" values and the
frequency count of patients falling within that bin for the
respective diagnosis category.
 
Conclusions:
By analyzing the plots and tabulated data, you can observe
the distribution of the "TFF1" feature among the three
diagnosis categories. Look for patterns, variations, and
differences in the "TFF1" values between the groups to
identify potential correlations with specific medical
conditions. Similar to the previous analyses, these insights
can aid medical professionals and researchers in
understanding how "TFF1" levels relate to the diagnosis of
different pancreatic diseases. As always, further statistical
analysis and domain-specific knowledge are essential to
draw meaningful conclusions from the data.
 
Output:
Control (No Pancreatic Disease):
TFF1 Control (No Pancreatic Disease)
--------- ---------------------------------
93.8344 129
281.493 34
469.151 7
656.809 5
844.468 2
1032.13 2
1219.78 1
1407.44 2
1595.1 0
1782.76 1
 
Benign Hepatobiliary Disease:
TFF1 Benign Hepatobiliary Disease
-------- ------------------------------
223.097 133
669.264 49
1115.43 13
1561.6 7
2007.77 1
2453.93 0
2900.1 1
3346.27 2
3792.43 1
4238.6 1
 
Pancreatic Cancer:
TFF1 Pancreatic Cancer
--------- -------------------
667.235 147
2001.66 27
3336.09 20
4670.52 2
6004.95 1
7339.37 1
8673.8 0
10008.2 0
11342.7 0
12677.1 1
Based on the output, we can analyze the distribution of the
"TFF1" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer):
 
Control (No Pancreatic Disease):
The majority of patients in the
Control group have "TFF1" levels
within the range of approximately
93.83 to 1782.76.
The highest count of patients (129)
falls within the bin with "TFF1"
levels of approximately 93.83 to
281.49.
There are a few patients with
higher "TFF1" levels, with counts
decreasing as the level increases.
Benign Hepatobiliary Disease:
Patients with Benign Hepatobiliary
Disease have "TFF1" levels
primarily within the range of
approximately 223.10 to 4238.60.
The highest count of patients (133)
has "TFF1" levels between 223.10
and 669.26.
There are fewer patients with
"TFF1" levels above 4238.60.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"TFF1" levels within the range of
approximately 667.24 to 12677.10.
The highest count of patients (147)
falls within the bin with "TFF1"
levels of approximately 667.24 to
2001.66.
The patient counts decrease at
higher "TFF1" levels.
Conclusions:
The output shows differences in the
distribution of the "TFF1" feature
among the three diagnosis
categories.
Patients with Pancreatic Cancer
tend to have higher "TFF1" levels
compared to those with Benign
Hepatobiliary Disease and the
Control group.
The Control group exhibits
relatively lower "TFF1" levels
compared to the other two
diagnosis categories.
The distribution of "TFF1" levels
varies, and higher levels might be
associated with specific medical
conditions, particularly Pancreatic
Cancer.
As with previous analyses, further
statistical analysis and domain-
specific knowledge are required to
draw more definitive conclusions
and understand the clinical
significance of these findings.
 
Distribution of REG1A versus Diagnosis
Step Look at REG1A feature distribution by diagnosis feature:
1  
1 #Looks at REG1A feature distribution by
2 diagnosis feature
another_versus_diagnosis("REG1A", 10)
 
The result is shown in Figure 8. This will generate the
desired plot and tabulated data for the distribution of the
"REG1A" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer).
 
Figure 8 The REG1A feature distribution by diagnosis
feature
 
Output:
Control (No Pancreatic Disease):
REG1A Control (No Pancreatic Disease)
--------- ---------------------------------
80.8571 49
242.571 12
404.286 5
566 6
727.714 107
889.428 2
1051.14 0
1212.86 1
1374.57 0
1536.28 1
 
Benign Hepatobiliary Disease:
REG1A Benign Hepatobiliary Disease
-------- ------------------------------
404.175 195
1212.52 5
2020.87 4
2829.22 0
3637.57 1
4445.92 1
5254.27 1
6062.62 0
6870.97 0
7679.32 1
 
Pancreatic Cancer:
REG1A Pancreatic Cancer
------- -------------------
660 162
1980 24
3300 5
4620 3
5940 0
7260 2
8580 2
9900 0
11220 0
12540 1
 
Based on the output, we can analyze the distribution of the
"REG1A" feature among different diagnosis categories
(Control, Benign Hepatobiliary Disease, and Pancreatic
Cancer):
Control (No Pancreatic Disease):
The majority of patients in the
Control group have "REG1A"
levels within the range of
approximately 80.86 to 889.43.
The highest count of patients (107)
falls within the bin with "REG1A"
levels of approximately 727.71 to
889.43.
There are only a few patients with
"REG1A" levels beyond 889.43.
Benign Hepatobiliary Disease:
Patients with Benign Hepatobiliary
Disease have "REG1A" levels
primarily within the range of
approximately 404.18 to 7679.32.
The highest count of patients (195)
has "REG1A" levels between
404.18 and 1212.52.
There are fewer patients with
"REG1A" levels above 7679.32.
Pancreatic Cancer:
The majority of patients diagnosed
with Pancreatic Cancer have
"REG1A" levels within the range
of approximately 660 to 12540.
The highest count of patients (162)
falls within the bin with "REG1A"
levels of approximately 660 to
1980.
The patient counts decrease at
higher "REG1A" levels.
Conclusions:
The output demonstrates
differences in the distribution of
the "REG1A" feature among the
three diagnosis categories.
Patients with Pancreatic Cancer
tend to have higher "REG1A"
levels compared to those with
Benign Hepatobiliary Disease and
the Control group.
The Control group exhibits
relatively lower "REG1A" levels
compared to the other two
diagnosis categories.
The distribution of "REG1A"
levels varies, and higher levels
might be associated with specific
medical conditions, particularly
Pancreatic Cancer.
As always, further statistical
analysis and domain-specific
knowledge are required to draw
more definitive conclusions and
understand the clinical significance
of these findings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
VISUALIZING
CATEGORIZED
FEATURES DISTRIBUTION
 
VISUALIZING
CATEGORIZED
FEATURES DISTRIBUTION
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Distribution of Categorized Age versus Diagnosis
Step Creates a dummy dataframe for visualization, and categorize
1 diagnosis feature:
 
1 #Creates a dummy dataframe for
2 visualization
3 df_dummy=df.copy()
4  
5 #Categorizes diagnosis feature
6 def cat_diagnosis(n):
7 if n == 1:
8 return 'Control (No Pancreatic Disease)'
9 if n == 2:
10 return 'Benign Hepatobiliary Disease'
11 else:
12 return 'Pancreatic Cancer'
13
14 df_dummy['diagnosis'] =
df_dummy['diagnosis'].apply(lambda x:
cat_diagnosis(x))
 
The purpose of the code is to create a dummy dataframe for
visualization purposes and categorize the "diagnosis" feature
based on its numerical values into meaningful labels that
represent different types of pancreatic diseases. The original
dataframe df is duplicated as df_dummy to avoid modifying the
original data during the categorization process.
 
Here's a step-by-step explanation of the code:
1. df_dummy = df.copy(): This line
creates a copy of the original
dataframe df and assigns it to a new
dataframe named df_dummy. This is
done to create a separate dataframe
for visualization without affecting
the original data.
2. def cat_diagnosis(n): This is a
custom function that takes a
numerical value n as input and
categorizes it into descriptive labels
corresponding to different types of
pancreatic diseases. It uses an if-else
logic to map the numerical value to
the appropriate label.
3. df_dummy['diagnosis'] =
df_dummy['diagnosis'].apply(lambda
x: cat_diagnosis(x)): This line
applies the cat_diagnosis function to
the "diagnosis" column of the
df_dummy dataframe using the
apply method. It effectively replaces
the numerical values (1, 2, or 3) in
the "diagnosis" column with the
corresponding descriptive labels
(Control, Benign Hepatobiliary
Disease, or Pancreatic Cancer) by
using the lambda function to pass
each value to the cat_diagnosis
function.
The result is a new dataframe df_dummy with the "diagnosis"
column containing descriptive labels instead of numerical
values, making it more understandable and suitable for
visualization purposes. This new dataframe can be used to
create visualizations that show the distribution of different
pancreatic diseases using meaningful labels.
 
Step Define dist_one_vs_another_plot() method to plot the
2 distribution of one feature against another feature in stacked bar
plot:
 
1 def put_label_stacked_bar(ax,fontsize):
2 #patches is everything inside of the
3 chart
4 for rect in ax.patches:
5 # Find where everything is located
6 height = rect.get_height()
7 width = rect.get_width()
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
x = rect.get_x()
y = rect.get_y()

# The height of the bar is the data value and


can be used as the label
label_text = f'{height:.0f}'

# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
 
# plots only when height is greater than
specified value
if height > 0:
ax.text(label_x, label_y,
label_text, \
ha='center', va='center', \
weight =
"bold",fontsize=fontsize)

#Plots one variable against another variable


def dist_one_vs_another_plot(df, cat1, cat2):
fig = plt.figure(figsize=(25, 15))
ax1 = fig.add_subplot(111)
group_by_stat = df.groupby([cat1,
cat2]).size()
stacked_data = group_by_stat.unstack()
group_by_stat.unstack().plot(kind='bar',
stacked=True, ax=ax1, grid=True)
ax1.set_title('Stacked Bar Plot of ' + cat1
+ ' (number of cases)', fontsize=30)
ax1.set_ylabel('Number of Cases',
fontsize=20)
ax1.set_xlabel(cat1, fontsize=20)
put_label_stacked_bar(ax1,15)
plt.show()
 
# Group values by cat2
sentiment_groups =
stacked_data.groupby(level=0, axis=0)
 
# Create table headers
headers = [cat2 for cat2 in
stacked_data.columns]
 
# Create table rows with data
rows = []
for cat, group_data in sentiment_groups:
row_values = [str(val) for val in
group_data.values.flatten()]
rows.append([cat] + row_values)
 
# Print the table
print(tabulate(rows, headers=headers,
tablefmt='grid'))
 
Let's go through the code step by step:
1. def put_label_stacked_bar(ax,
fontsize): This line defines a
function named
put_label_stacked_bar() that takes
two parameters: ax (the axes of the
plot) and fontsize (the font size for
the labels on the bars).
2. Inside the put_label_stacked_bar()
function:
The function starts a for loop
that iterates through each bar
(rect) in the axes (ax.patches)
of the stacked bar plot.
For each bar, it extracts the
height, width, x-coordinate,
and y-coordinate.
It calculates the label text to be
displayed on the bar, which is
the height of the bar formatted
as an integer.
It computes the label's x and y
coordinates to position the
label at the center of the bar.
The function then checks if the
height of the bar is greater than
zero (meaning the bar has a
non-zero value) to avoid
adding labels to empty bars in
the stacked plot.
If the height is greater than
zero, the function uses ax.text
to add the label to the center of
the bar with the specified font
size and other formatting
options.
3. def dist_one_vs_another_plot(df,
cat1, cat2): This line defines
another function named
dist_one_vs_another_plot, which
takes three parameters: df (the
DataFrame to be used for plotting),
cat1 (the first categorical variable),
and cat2 (the second categorical
variable).
4. Inside the
dist_one_vs_another_plot()
function:
It creates a new figure (fig)
and axis (ax1) for the plot with
a specified size using plt.figure
and fig.add_subplot.
It groups the data in the
DataFrame (df) by the two
categorical variables (cat1 and
cat2) and calculates the counts
of each combination using
groupby and size. The result is
stored in the variable
group_by_stat.
The function then generates a
stacked bar plot with the data,
setting stacked=True to create
a stacked bar chart.
It sets the title, ylabel, and
xlabel for the plot using
set_title, set_ylabel, and
set_xlabel, respectively.
The put_label_stacked_bar()
function is called with ax1 and
a specified font size (15) to
add labels to the bars.
Finally, the plot is displayed
using plt.show().
5. After displaying the plot, the
function proceeds to generate a
table to display the counts of each
category combination using the
tabulate function from the tabulate
library. The table is printed to the
console, providing a tabular
representation of the data used to
create the stacked bar plot. Each
row in the table corresponds to a
different category of cat1, and the
columns represent the categories of
cat2, displaying the count of cases
for each combination.
 
Step 3 Categorize age feature, plot the distribution of age f
variable against age variable in stacked bar plot:
 
1 #Categorizes age feature
2 labels = ['0-40', '40-50', '50-60','60
3 df_dummy['age'] = pd.cut(df_dummy['age
4 [0, 40, 50, 60, 90], labels=labels)
5  
6 #Plots the distribution of age feature
pie chart and bar plot
7
plot_piechart(df_dummy,'age',)
8
 
9
#Plots diagnosis variable against age
10
variable in stacked bar plots
11
dist_one_vs_another_plot(df_dummy,'age
'diagnosis')
 
The results are shown in Figure 9 and Figure 10. Let'
1. labels = ['0-40', '40-50', '50-
categorize the 'age' feature int
as '0-40', '40-50', '50-60', and
40 to 50, 50 to 60, and 60 to 9
2. df_dummy['age'] = pd.cut(df_
This line uses the pd.cut() fu
based on the specified age r
DataFrame df_dummy called
each age value.

Figure 9 The distribut


 
3. plot_piechart(df_dummy, 'age
the distribution of the 'age' fea
4. dist_one_vs_another_plot(df_
dist_one_vs_another_plot fun
feature and the 'diagnosis' fea
of cases for each combination
Both plot_piechart() and dist_one_vs_another_plot()
data.
 
 
Figure 10 The distribution of diagn
 
Output:
Percentage values:
60-90 47.796610
50-60 25.084746
40-50 17.796610
0-40 9.322034
Name: age, dtype: float64
Count values:
60-90 282
50-60 148
40-50 105
0-40 55
Name: age, dtype: int64
 
+-------+--------------------------------+
| | Benign Hepatobiliary Disease |
+=======+================================+=
| 0-40 | 35 |
+-------+--------------------------------+
| 40-50 | 48 |
+-------+--------------------------------+
| 50-60 | 51 |
+-------+--------------------------------+
| 60-90 | 74 |
+-------+--------------------------------+
 
From the output, we can draw the following specific
 
Age Distribution:
The dataset contains individua
The majority of individuals
approximately 47.8% of the d
The least represented age gr
dataset.
Diagnosis Distribution:
The dataset includes three
Disease), Benign Hepatobiliar
The number of cases for each
Age Distribution by Diagnosis:
The stacked bar plot provide
distributed within each diagno
In the Control group, the age
60 and 40-50. The age group
For Benign Hepatobiliary Di
highest number of cases, whil
In the Pancreatic Cancer grou
cases, significantly outnumbe
Relationship between Age and Diagnosis:
The table provides a more de
and diagnosis categories.
It confirms the previously obs
count of cases for each combi
Overall, the analysis suggests that there might be
conditions. The dataset contains a higher number of c
category. However, further statistical analysis and m
potential age-related risk factors for specific diagnose
 
 
 
Distribution of Sex versus Diagnosis
Step Plot the distribution of sex feature in pie chart and bar plot, and plot the distribution of diagnosis variable
1 variable in stacked bar plot:
 
1 #Plots the distribution of sex feature in
2 pie chart and bar plot
3 plot_piechart(df_dummy,'sex')
4  
5 #Plots diagnosis variable against sex
variable in stacked bar plots
6
dist_one_vs_another_plot(df_dummy,'sex',
'diagnosis')
 
The results are shown in Figure 11 and Figure 12. The purpose of these visualizations is to gain insig
distribution of cases based on sex and understand how different diagnoses are related to the sex of the pat
visualizations help in understanding any potential differences in diagnosis patterns between males and fem
dataset. By using pie charts, bar plots, and stacked bar plots, the code facilitates a clear and concise represen
data to aid in exploratory data analysis.
 
Figure 11 The distribution of sex feature in pie chart and bar plot
 

Figure 12 The distribution of diagnosis variable against sex variable in stacked bar plot
Output:
Percentage values:
F 50.677966
M 49.322034
Name: sex, dtype: float64
Count values:
F 299
M 291
Name: sex, dtype: int64
+----+--------------------------------+-----------------------------------+------------
| | Benign Hepatobiliary Disease | Control (No Pancreatic Disease) | Pancreatic
+====+================================+===================================+=============
| F | 101 | 115 |
+----+--------------------------------+-----------------------------------+------------
| M | 107 | 68 |
+----+--------------------------------+-----------------------------------+------------
 
Analysis of the Output:
 
Distribution of Sex:
The pie chart shows that the dataset contains slightly more fema
(50.68%) than male cases (49.32%).
The bar plot indicates that there are 299 female cases and 291 male
the dataset.
Stacked Bar Plot of Diagnosis against Sex:
The stacked bar plot shows the count of cases for each diagnosis
(Control, Benign Hepatobiliary Disease, Pancreatic Cancer) for bo
and female groups.
In the 'Control (No Pancreatic Disease)' category, there are 101 fem
107 males.
In the 'Benign Hepatobiliary Disease' category, there are 115 female
males.
In the 'Pancreatic Cancer' category, there are 83 females and 116 male
Conclusion:
The dataset is almost evenly split between male and female cases
slightly higher number of female cases.
The stacked bar plot reveals the distribution of diagnoses between m
females.
It seems that 'Benign Hepatobiliary Disease' and 'Pancreatic Canc
have a more balanced distribution between males and females, while
(No Pancreatic Disease)' cases show a slightly higher number of ma
compared to females.
These visualizations provide valuable insights into the dataset's sex distribution and the relationship between
different diagnoses, which could be helpful for further analysis and understanding the data's patterns and tren
 
 
 
Distribution of Categorized CA19-9 versus Diagnosis
Step Categorize plasma_CA19_9 feature, plot the distribution of
1 plasma_CA19_9 feature in pie chart and bar plot, and plot the
distribution of diagnosis variable against plasma_CA19_9 variable
in stacked bar plot:
 
1 #Categorizes plasma_CA19_9 feature
2 labels = ['0-100', '100-1000', '1000-
3 10000','10000-35000']
4 df_dummy['plasma_CA19_9'] =
pd.cut(df_dummy['plasma_CA19_9'], [0, 100, 1000,
5
10000, 35000], labels=labels)
6
 
7
#Plots the distribution of plasma_CA19_9 feature
8 in pie chart and bar plot
9 plot_piechart(df_dummy,'plasma_CA19_9')
10  
11 #Plots diagnosis variable against plasma_CA19_9
12 variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'plasma_CA19_9',
'diagnosis')
 
The results are shown in Figure 13 and Figure 14. The visualizations
and analysis provide valuable information about the distribution of
'plasma_CA19_9' values, how the diagnosis varies across these
categories, and the relationship between the 'plasma_CA19_9'
feature and different diagnoses. This information can be essential for
understanding potential patterns and trends related to the
'plasma_CA19_9' feature in the dataset.
 

Figure 13 The distribution of plasma_CA19_9 feature in pie chart


and bar plot
 

Figure 14 The distribution of diagnosis variable against


plasma_CA19_9 variable in stacked bar plot
 
Output:
Percentage values:
100-1000 53.220339
0-100 38.813559
1000-10000 7.118644
10000-35000 0.677966
Name: plasma_CA19_9, dtype: float64
Count values:
100-1000 314
0-100 229
1000-10000 42
10000-35000 4
Name: plasma_CA19_9, dtype: int64
 
 
Analysis of the Output:
Percentage values for 'plasma_CA19_9':
The highest percentage of cases
(53.22%) falls within the '100-1000'
range of 'plasma_CA19_9'. This
suggests that a significant number of
patients have moderate levels of this
biomarker.
The '0-100' range accounts for 38.81%
of the cases, indicating that a
considerable number of patients have
low levels of 'plasma_CA19_9'.
The '1000-10000' range represents
7.12% of the cases, indicating that a
smaller portion of patients have higher
levels of this biomarker.
The '10000-35000' range has the lowest
percentage at 0.68%, suggesting that
very few patients have extremely high
levels of 'plasma_CA19_9'.
Count values for 'plasma_CA19_9':
The '100-1000' range has the highest
count with 314 cases, reinforcing that a
significant number of patients fall within
this moderate range of
'plasma_CA19_9'.
The '0-100' range follows closely with
229 cases, indicating that a substantial
number of patients have low levels of
this biomarker.
The '1000-10000' range has a lower
count with 42 cases, signifying a smaller
proportion of patients have higher
'plasma_CA19_9' levels.
The '10000-35000' range has the
smallest count with only 4 cases,
implying that very few patients have
extremely high levels of this biomarker.
Stacked Bar Plot:
The stacked bar plot provides an insight
into the distribution of diagnoses
(Control, Benign Hepatobiliary Disease,
Pancreatic Cancer) within each range of
'plasma_CA19_9'.
The '100-1000' range shows a relatively
balanced distribution of diagnoses, with
'Pancreatic Cancer' being the most
prevalent, followed closely by 'Control'
and 'Benign Hepatobiliary Disease'.
In the '0-100' range, 'Control' is the most
common diagnosis, followed by 'Benign
Hepatobiliary Disease' and 'Pancreatic
Cancer'.
The '1000-10000' range has a low count,
with 'Pancreatic Cancer' being the
dominant diagnosis.
The '10000-35000' range contains only
'Pancreatic Cancer' cases, indicating that
extremely high 'plasma_CA19_9' levels
are predominantly associated with this
diagnosis.
Conclusion:
The analysis of 'plasma_CA19_9' reveals that a majority of patients
have moderate levels of this biomarker (between 100 and 1000).
Low levels (0-100) and high levels (1000-10000) are also observed
but to a lesser extent. Extremely high levels (10000-35000) are rare.
The stacked bar plot illustrates the distribution of diagnoses within
each range, providing insights into the association between
'plasma_CA19_9' levels and different disease classes. This
information can be valuable in understanding the clinical
significance of 'plasma_CA19_9' as a potential biomarker for
diagnosing specific diseases.
 
 
 
Distribution of Categorized Creatinine versus Diagnosis
Step Categorize creatinine feature, plot the distribution of creatinine feature in pie chart and bar plot, and plot th
1 diagnosis variable against creatinine variable in stacked bar plot:
 
1 #Categorizes creatinine feature
2 labels = ['0-0.5', '0.5-1', '1-2','2-5']
3 df_dummy['creatinine'] =
4 pd.cut(df_dummy['creatinine'], [0, 0.5, 1, 2,
5], labels=labels)
5
 
6
#Plots the distribution of creatinine feature
7
in pie chart and bar plot
8
plot_piechart(df_dummy,'creatinine')
9
 
10
#Plots diagnosis variable against creatinine
11 variable in stacked bar plots
12 dist_one_vs_another_plot(df_dummy,'creatinine',
'diagnosis')
 
The results are shown in Figure 15 and Figure 16.
 
 

Figure 15 The distribution of creatinine feature in pie chart and bar plot
 
 
 
The code performs the following steps:
1. Categorization of 'creatinine' feature: The 'creatinine' feature is d
different categories using pandas' pd.cut() function. The feature is gr
four intervals: '0-0.5', '0.5-1', '1-2', and '2-5', representing different
creatinine levels.
2. Plots the distribution of 'creatinine' feature: The code calls the plot
function to visualize the distribution of the 'creatinine' feature in tw
plots:
Pie chart: It shows the percentage distribution of patients in each
level category. The pie chart provides an overview of th
proportions of patients in each category.
Bar plot: It displays the count of patients in each creatinine leve
The bar plot provides a more detailed representation of the actu
patients in each category.

Figure 16 The distribution of diagnosis variable against creatinine variable in stacked bar plo
 
3. Plots diagnosis variable against creatinine variable in stacked bar
code calls the dist_one_vs_another_plot() function to visualize the r
between the 'diagnosis' variable (classes: Control, Benign He
Disease, and Pancreatic Cancer) and the 'creatinine' variable. This i
using stacked bar plots. Each stacked bar represents a 'creatinine' ca
the bars are further divided into segments based on the 'diagnosis' cl
plot allows us to observe the distribution of diagnoses within each
level category.
Overall, the code enables us to explore the distribution of creatinine levels in the dataset and examine how th
distributed within each creatinine level category. This information can be valuable for identifying potenti
between creatinine levels and different disease classes, helping researchers and healthcare professionals ga
the clinical significance of creatinine as a potential biomarker.
 
Output:
Percentage values:
0-0.5 35.084746
0.5-1 33.728814
1-2 24.576271
2-5 6.610169
Name: creatinine, dtype: float64
Count values:
0-0.5 207
0.5-1 199
1-2 145
2-5 39
Name: creatinine, dtype: int64
+-------+--------------------------------+-----------------------------------+---------
| | Benign Hepatobiliary Disease | Control (No Pancreatic Disease) | Pancrea
+=======+================================+===================================+==========
| 0-0.5 | 72 | 67 |
+-------+--------------------------------+-----------------------------------+---------
| 0.5-1 | 71 | 60 |
+-------+--------------------------------+-----------------------------------+---------
| 1-2 | 51 | 51 |
+-------+--------------------------------+-----------------------------------+---------
| 2-5 | 14 | 5 |
+-------+--------------------------------+-----------------------------------+---------
 
Analysis and Conclusion:
Distribution of 'creatinine' feature:
The 'creatinine' feature has been categorized into four intervals: '0-0
'1-2', and '2-5'.
The majority of patients have 'creatinine' levels falling within the
'0.5-1' intervals, constituting approximately 35.08% and 33.73% of t
respectively.
A smaller proportion of patients have higher 'creatinine' le
approximately 24.58% falling within the '1-2' interval and only 6.6
within the '2-5' interval.
Diagnosis versus creatinine distribution:
For patients with 'creatinine' levels in the '0-0.5' interval:
The majority are diagnosed with 'Pancreatic Cancer,' with 68 cases.
Benign Hepatobiliary Disease and Control (No Pancreatic Disea
similar number of cases in this interval, with 72 and 67 cases, respect
For patients with 'creatinine' levels in the '0.5-1' interval:
The distribution of diagnoses is quite balanced in this interval, wit
classes having similar counts ranging from 60 to 71 cases.
For patients with 'creatinine' levels in the '1-2' interval:
The number of cases in each diagnosis class is lower compared to the
'0.5-1' intervals.
'Benign Hepatobiliary Disease' has the highest count with 51 cases, f
'Control (No Pancreatic Disease)' with 51 cases, and 'Pancreatic Canc
cases.
For patients with 'creatinine' levels in the '2-5' interval:
The number of cases is much smaller in this interval.
'Pancreatic Cancer' has the highest count with 20 cases, followed
Hepatobiliary Disease' with 14 cases, and 'Control (No Pancreati
with 5 cases.
The analysis of the 'creatinine' feature against the 'diagnosis' variable helps to understand how the distributi
classes varies with different 'creatinine' level intervals. The results suggest that patients with higher 'creatinin
'2-5' interval) tend to have a higher incidence of 'Pancreatic Cancer.' This observation highlights the p
significance of 'creatinine' levels as a marker in identifying certain disease classes, particularly in pa
conditions. However, further statistical analysis and investigation are required to establish any definitive
correlations.
 
 
 
Extracting Categorical and Numerical Features
Step Extract categorical and numerical columns:
1  
1 #Checks dataset information
2 print(df_dummy.info())
3  
4 #Extracts categorical and numerical
5 columns
6 cat_cols = [col for col in
df_dummy.columns if \
7
(df_dummy[col].dtype == 'object' or \
8
df_dummy[col].dtype.name == 'category')]
9
num_cols = [col for col in
10
df_dummy.columns if \
11
(df_dummy[col].dtype != 'object' and \
12
df_dummy[col].dtype.name != 'category')]
13
 
print(cat_cols)
print(num_cols)
 
Here's a step-by-step explanation of the code:
1. Checking dataset information:
The code begins by printing
the information about the
dataset using df_dummy.info().
This provides a summary of
the DataFrame, including the
number of non-null values,
data types, and memory usage.
2. Extracting categorical and
numerical columns:
The code defines two empty
lists, cat_cols and num_cols, to
store the names of categorical
and numerical columns,
respectively.
It then iterates over each
column in the DataFrame
df_dummy using a list
comprehension with conditions
to identify whether the column
is categorical or numerical:
If the column has data type
'object' (string) or 'category', it
is considered categorical and
added to the cat_cols list.
If the column has a data type
other than 'object' or 'category',
it is considered numerical and
added to the num_cols list.
3. Printing categorical and numerical
columns:
After the iteration, the code
prints the lists cat_cols and
num_cols using print(cat_cols)
and print(num_cols).
This provides a list of all the
columns that are categorized as
categorical and numerical,
respectively.
By executing this code, you can quickly identify the
categorical and numerical columns in the DataFrame, which
can be helpful for further data analysis and visualization
tasks, as these column types may require different types of
processing and treatment.
 
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590 entries, 0 to 589
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 590 non-null category
1 sex 590 non-null object
2 diagnosis 590 non-null object
3 plasma_CA19_9 589 non-null category
4 creatinine 590 non-null category
5 LYVE1 590 non-null float64
6 REG1B 590 non-null float64
7 TFF1 590 non-null float64
8 REG1A 590 non-null float64
dtypes: category(3), float64(4), object(2)
memory usage: 30.1+ KB
None
['age', 'sex', 'diagnosis', 'plasma_CA19_9',
'creatinine']
 
['LYVE1', 'REG1B', 'TFF1', 'REG1A']
 
Let's explain the output step-by-step:
1. Dataset Information:
The dataset consists of 590
rows and 9 columns.
The age, sex, diagnosis, and
plasma_CA19_9 columns have
non-null count values for all
590 rows.
The creatinine column has a
non-null count value for all
590 rows except one (589 non-
null count).
The LYVE1, REG1B, TFF1,
and REG1A columns have
non-null count values for all
590 rows.
The data types of the columns
are as follows:
age, plasma_CA19_9, and
creatinine columns are of
data type 'category'.
sex and diagnosis columns
are of data type 'object'
(string).
LYVE1, REG1B, TFF1,
and REG1A columns are
of data type 'float64'
(floating-point numbers).
The memory usage of the
DataFrame is
approximately 30.1 KB.
2. Categorical Columns:
The categorical columns in the DataFrame are age, sex,
diagnosis, plasma_CA19_9, and creatinine. These
columns have data types either 'category' or 'object'.
3. Numerical Columns:
The numerical columns in the DataFrame are LYVE1,
REG1B, TFF1, and REG1A. These columns have data
type 'float64'.

By using the info() method, you get a comprehensive


overview of the DataFrame, including its size, non-null
values, data types, and memory usage. Understanding the
data types of each column is crucial for performing
appropriate data manipulation, analysis, and visualization
tasks. In this case, you have identified which columns
contain categorical and numerical data, which can be helpful
for further analysis and visualization.
 
 
Density Distribution of Numerical Features
Step Check numerical features density distribution:
1  
1 #Checks numerical features density
2 distribution
3 # Define a custom color palette
4 colors = sns.color_palette("husl",
len(num_cols))
5
 
6
# Checks numerical features density
7
distribution
8
fig = plt.figure(figsize=(30, 20))
9
plotnumber = 1
10
 
11
for i, column in enumerate(num_cols):
12
if plotnumber <= 6:
13
ax = plt.subplot(2, 2, plotnumber)
14
sns.distplot(df_dummy[column],
15 color=colors[i]) # Use the custom color
16 for the plot
17 plt.xlabel(column, fontsize=40)
18 for p in ax.patches:
19
20 ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2.,
21
p.get_height()), ha='center', va='center',
22 xytext=(0, 10), weight="bold",
23 fontsize=30, textcoords='offset points')
24 plotnumber += 1
 
fig.suptitle('The density of numerical
features', fontsize=50)
plt.tight_layout()
plt.show()
 
The result is shown in Figure 17. The purpose of the code is
to visualize the density distribution of numerical features in a
dataset using density plots. It uses Seaborn and Matplotlib
libraries to create a figure containing multiple subplots,
where each subplot represents the density plot of a numerical
feature.
 
Here's a step-by-step explanation of the code:
1. colors = sns.color_palette("husl",
len(num_cols)): This line creates a
custom color palette using the
"husl" color map from Seaborn. It
generates a list of colors with the
same length as the number of
numerical columns in the dataset
(num_cols).
2. fig = plt.figure(figsize=(30, 20)):
This line creates a new Matplotlib
figure with a size of 30x20 inches.
3. plotnumber = 1: This variable is
used to keep track of the current
subplot number.
4. for i, column in
enumerate(num_cols):: This is a
loop that iterates over the
numerical columns (num_cols)
using enumerate. The i variable
stores the index, and column stores
the column name for each iteration.
5. if plotnumber <= 6:: This condition
checks if the current subplot
number is less than or equal to 6.
Since we want to create a 2x2 grid
of subplots (a total of 4 subplots),
we limit the loop to only 4
iterations.

Figure 17 The numerical features density distribution


 
6. ax = plt.subplot(2, 2, plotnumber):
This line creates a new subplot
with 2 rows, 2 columns, and the
current plotnumber as the position
of the subplot.
7. sns.distplot(df_dummy[column],
color=colors[i]): This line plots the
density distribution of the current
numerical column column using
Seaborn's distplot function. It uses
the custom color from the colors
list corresponding to the current
index i.
8. plt.xlabel(column, fontsize=40):
This sets the x-axis label of the
current subplot to the name of the
numerical column.
9. The for loop inside the subplot is
used to annotate the peaks of the
density plots with their respective
heights. It adds annotations to each
patch (bar) in the density plot.
10. fig.suptitle('The density of
numerical features', fontsize=50):
This sets the super title of the entire
figure.
11. plt.tight_layout(): This adjusts the
layout of the subplots to make
them fit within the figure area
without overlapping.
12. plt.show(): This displays the entire
figure containing the density plots
of the numerical features.
In summary, the code generates a figure with multiple
density plots, each representing the distribution of a
numerical feature. The plots are organized in a 2x2 grid, and
a custom color palette is used to give each subplot a different
appealing color. The annotations on the density plots provide
additional information about the peak heights, helping to
understand the distribution of each numerical feature better.
 
 
 
Density Distribution of Numerical Features
Step Check categorical features distribution:
1  
1 #Checks categorical features distribution
2 fig=plt.figure(figsize = (35, 25))
3 plotnumber = 1
4 for column in cat_cols:
5 if plotnumber <= 6:
6 ax = plt.subplot(2, 3, plotnumber)
7 sns.countplot(df_dummy[column],
8 palette = 'Spectral_r')
9 plt.xlabel(column)
10 for p in ax.patches:
11
ax.annotate(format(p.get_height(), '.0f'),
12
\
13
(p.get_x() + p.get_width() / 2.,
14 p.get_height()), \
15 ha = 'center', va = 'center', xytext =
16 (0, 10), \
17 weight = "bold",fontsize=30, \
18 textcoords = 'offset points')
19  
20 plotnumber += 1
fig.suptitle('The distribution of
categorical features distribution',
fontsize=50)
plt.tight_layout()
plt.show()
 
The result is shown in Figure 18. The purpose of the code is
to visualize the distribution of categorical features in a
dataset using count plots. It utilizes Seaborn and Matplotlib
libraries to create a figure containing multiple subplots,
where each subplot represents a count plot for a categorical
feature.
 
Here's a step-by-step explanation of the code:
1. fig = plt.figure(figsize=(35, 25)):
This line creates a new Matplotlib
figure with a size of 35x25 inches.
2. plotnumber = 1: This variable is
used to keep track of the current
subplot number.
3. for column in cat_cols:: This is a
loop that iterates over the
categorical columns (cat_cols).
4. if plotnumber <= 6:: This condition
checks if the current subplot
number is less than or equal to 6.
Since we want to create a 2x3 grid
of subplots (a total of 6 subplots),
we limit the loop to only 6
iterations.

Figure 18 The categorical features distribution


 
5. ax = plt.subplot(2, 3, plotnumber):
This line creates a new subplot
with 2 rows, 3 columns, and the
current plotnumber as the position
of the subplot.
6. sns.countplot(df_dummy[column],
palette='Spectral_r'): This line plots
the count distribution of the current
categorical column column using
Seaborn's countplot function. It
uses the 'Spectral_r' color palette to
give different colors to the bars.
7. plt.xlabel(column): This sets the x-
axis label of the current subplot to
the name of the categorical
column.
8. The for loop inside the subplot is
used to annotate the bars in the
count plot with their respective
counts.
9. fig.suptitle('The distribution of
categorical features distribution',
fontsize=50): This sets the super
title of the entire figure.
10. plt.tight_layout(): This adjusts the
layout of the subplots to make
them fit within the figure area
without overlapping.
11. plt.show(): This displays the entire
figure containing the count plots of
the categorical features.
In summary, the code generates a figure with multiple count
plots, each representing the distribution of a categorical
feature. The plots are organized in a 2x3 grid, and the
'Spectral_r' color palette is used to give each subplot a
different appealing color. The annotations on the count plots
provide information about the count of occurrences for each
category within the categorical features, helping to
understand the distribution of each categorical feature better.
 
 
 
Case Distribution of Four Categorical Features versus Diagnosis
Step Plot distribution of number of cases of four categorical
1 features versus diagnosis:
 
1 def plot_four_versus_one(df, column_names,
2 feat):
3 num_plots = len(column_names)
4 num_rows = num_plots // 2 + num_plots
5 % 2
6 fig, ax = plt.subplots(num_rows, 2,
figsize=(20, 13), facecolor='#fbe7dd')
7
 
8
for i, column in enumerate(column_names):
9
current_ax = ax[i // 2, i % 2]
10
g = sns.countplot(df[column],
11
hue=df[feat], palette='Spectral_r',
12 ax=current_ax)
13
14 for p in g.patches:
15
16 g.annotate(format(p.get_height(), '.0f'),
17 (p.get_x() + p.get_width() / 2.,
p.get_height()), ha='center', va='center',
18
xytext=(0, 10), weight="bold",
19 fontsize=20, textcoords='offset points')
20
21 current_ax.set_xlabel(column,
22 fontsize=20)
23 current_ax.set_ylabel("Count",
24 fontsize=20)
25 current_ax.tick_params(axis='x',
labelsize=15)
26
current_ax.tick_params(axis='y',
27
labelsize=15)
28
29
plt.tight_layout()
plt.show()

#Plots distribution of number of cases of


four categorical features versus diagnosis
column_names = ["age", "sex",
"plasma_CA19_9", "creatinine"]
plot_four_versus_one(df_dummy,
column_names, "diagnosis")
 
The result is shown in Figure 19. The purpose of the code is to
visualize the distribution of the number of cases for four
categorical features (age, sex, plasma_CA19_9, creatinine)
versus the diagnosis feature using count plots. The function
plot_four_versus_one is defined to create subplots and display
the count plots for each categorical feature with respect to the
diagnosis.
 
Here's a step-by-step explanation of the code:
1. num_plots = len(column_names):
This line calculates the number of
categorical features to be plotted.
2. num_rows = num_plots // 2 +
num_plots % 2: This calculates the
number of rows required for the
subplots. It uses integer division (//)
to determine the number of full rows
and then adds 1 (+ 1) for the extra
row if there are any remaining
subplots.
3. fig, ax = plt.subplots(num_rows, 2,
figsize=(20, 13),
facecolor='#fbe7dd'): This line
creates a new Matplotlib figure with
the specified number of rows and
two columns to hold the subplots.
The facecolor parameter sets the
background color of the figure.
4. The function then enters a loop to
create the count plots for each
categorical feature:
a. current_ax = ax[i // 2, i % 2]:
This selects the current subplot
(i-th subplot) to plot the count
plot for the i-th categorical
feature.
b. sns.countplot(df[column],
hue=df[feat],
palette='Spectral_r',
ax=current_ax): This creates a
count plot using Seaborn's
countplot function. The hue
parameter is set to the diagnosis
feature, which means the count
of each category in the
categorical feature will be
grouped and displayed by
diagnosis (different colors for
each diagnosis group). The
'Spectral_r' color palette is used
to give each category a different
color.
c. The for loop inside the subplot
is used to annotate the bars in
the count plot with their
respective counts.
d. current_ax.set_xlabel(column,
fontsize=20): This sets the x-
axis label of the current subplot
to the name of the categorical
feature.
e. current_ax.set_ylabel("Count",
fontsize=20): This sets the y-
axis label of the current subplot
to "Count".
f. current_ax.tick_params(axis='x',
labelsize=15): This sets the size
of the x-axis tick labels.
g. current_ax.tick_params(axis='y',
labelsize=15): This sets the size
of the y-axis tick labels.
5. plt.tight_layout(): This adjusts the
layout of the subplots to make them
fit within the figure area without
overlapping.
6. plt.show(): This displays the entire
figure containing the count plots for
the categorical features versus the
diagnosis.
In summary, the function plot_four_versus_one() creates a
figure with multiple subplots (arranged in rows and columns)
to visualize the distribution of the number of cases for four
categorical features versus the diagnosis feature. The count
plots are grouped by diagnosis, and each subplot represents the
count of each category in the categorical feature. The color
palette 'Spectral_r' is used to provide appealing colors for the
bars in the count plots.
 

Figure 19 The distribution of number of cases of four


categorical features versus diagnosis
 
 
 
Case Distribution of Four Categorical Features versus Categorized
Creatinine
Step Plot distribution of number of cases of four categorical
1 features versus creatinine:
 
1 #Plots distribution of number of cases of
2 four categorical features versus creatinine
3 column_names = ["age", "sex",
"plasma_CA19_9", "diagnosis"]
4
plot_four_versus_one(df_dummy,
column_names, "creatinine")
 
The result is shown in Figure 20. The resulting figure will
show four subplots, each representing the distribution of the
number of cases for a specific categorical feature (age, sex,
plasma_CA19_9, diagnosis) with respect to the creatinine
feature. Each count plot will display the count of each
category in the categorical feature, grouped by different
creatinine levels. The color palette used will provide
appealing colors for the bars in the count plots.
 
Figure 20 The distribution of number of cases of four
categorical features versus creatinine
 
 
 
Case Distribution of Four Categorical Features versus Categorized
Age
Step Plot distribution of number of cases of four categorical
1 features versus age:
 
1 #Plots distribution of number of cases of
2 four categorical features versus age
3 column_names = ["creatinine", "sex",
"plasma_CA19_9", "diagnosis"]
4
plot_four_versus_one(df_dummy,
column_names, "age")
 
The result is shown in Figure 21. The resulting figure will
show four subplots, each representing the distribution of the
number of cases for a specific categorical feature (creatinine,
sex, plasma_CA19_9, diagnosis) with respect to the age
feature. Each count plot will display the count of each
category in the categorical feature, grouped by different age
groups. The color palette used will provide appealing colors
for the bars in the count plots.
 
Figure 21 The distribution of number of cases of four
categorical features versus age
 
 
 
Case Distribution of Four Categorical Features versus Sex
Step Plot distribution of number of cases of four categorical
1 features versus sex:
 
1 #Plots distribution of number of cases of
2 four categorical features versus sex
3 column_names = ["creatinine", "age",
"plasma_CA19_9", "diagnosis"]
4
plot_four_versus_one(df_dummy,
column_names, "sex")
 
The result is shown in Figure 22. The code will plot the
distribution of the number of cases for four categorical
features (creatinine, age, plasma_CA19_9, diagnosis) versus
the sex feature using count plots. The function
plot_four_versus_one() will be used to create subplots and
display the count plots for each categorical feature with
respect to sex. The resulting figure will show four subplots,
each representing the distribution of the number of cases for
a specific categorical feature (creatinine, age,
plasma_CA19_9, diagnosis) with respect to the sex feature.
Each count plot will display the count of each category in the
categorical feature, grouped by different sexes. The color
palette used will provide appealing colors for the bars in the
count plots.
 
Figure 22 The distribution of number of cases of four
categorical features versus sex
 
 
 
Case Distribution of Four Categorical Features versus Categorized
Plasma CA19-9
Step Plot distribution of number of cases of four categorical
1 features versus plasma_CA19_9:
 
1 #Plots distribution of number of cases of
2 four categorical features versus
plasma_CA19_9
3
column_names = ["creatinine", "age", "sex",
4
"diagnosis"]
plot_four_versus_one(df_dummy,
column_names, "plasma_CA19_9")
 
The result is shown in Figure 23. The code will plot the
distribution of the number of cases for four categorical
features (creatinine, age, sex, diagnosis) versus the
plasma_CA19_9 feature using count plots. The function
plot_four_versus_one will be used to create subplots and
display the count plots for each categorical feature with
respect to plasma_CA19_9.
 
The resulting figure will show four subplots, each
representing the distribution of the number of cases for a
specific categorical feature (creatinine, age, sex, diagnosis)
with respect to the plasma_CA19_9 feature. Each count plot
will display the count of each category in the categorical
feature, grouped by different categories in plasma_CA19_9.
The color palette used will provide appealing colors for the
bars in the count plots.
 
Figure 23 The distribution of number of cases of four
categorical features versus plasma_CA19_9
 
 
 
Percentage Distribution of Categorized Age and Sex versus
Diagnosis
Step Plot the percentage distribution of age and sex versus
1 diagnosis in pie chart:
 
1 #Plots distribution of age and sex versus
2 diagnosis in pie chart
3 def plot_piechart_diagnosis(df, feat1,
feat2):
4
gs0 = df_dummy[df_dummy.diagnosis ==
5
'Control (No Pancreatic Disease)']
6 [feat1].value_counts()
7 gs1 = df_dummy[df_dummy.diagnosis ==
8 'Benign Hepatobiliary Disease']
9 [feat1].value_counts()
10 gs2 = df_dummy[df_dummy.diagnosis ==
'Pancreatic Cancer']
11
[feat1].value_counts()
12
ss0 = df_dummy[df_dummy.diagnosis ==
13 'Control (No Pancreatic Disease)']
14 [feat2].value_counts()
15 ss1 = df_dummy[df_dummy.diagnosis ==
16 'Benign Hepatobiliary Disease']
[feat2].value_counts()
17
ss2 = df_dummy[df_dummy.diagnosis ==
18
'Pancreatic Cancer']
19 [feat2].value_counts()
20  
21 label_gs0=list(gs0.index)
22
23 label_gs1=list(gs1.index)
24 label_gs2=list(gs2.index)
25 label_ss0=list(ss0.index)
26 label_ss1=list(ss1.index)
27 label_ss2=list(ss2.index)
28  
29 fig, ax = plt.subplots(2, 3, figsize=
30 (35, 20), facecolor='#fbe7dd')
31  
32 def print_percentage_table(data, labels,
title):
33
percentages = [f'{(value /
34
sum(data)) * 100:.1f}%' for value in data]
35
table_data = list(zip(labels,
36 percentages))
37 headers = [feat1, 'Percentage']
38 print(f"\n{title}:")
39 print(tabulate(table_data,
40 headers=headers, tablefmt='grid'))
41  
42 def plot_pie(ax, data, labels, title):
43 ax.pie(data, labels=labels,
44 shadow=True, autopct='%1.1f%%', textprops=
{'fontsize': 32})
45
ax.set_xlabel(title, fontsize=30)
46
 
47
plot_pie(ax[0, 0], gs0, label_gs0, f"
48
{feat1} feature")
49
print_percentage_table(gs0, label_gs0,
50 'diagnosis = Control (No Pancreatic
51 Disease)')
52  
53 plot_pie(ax[0, 1], gs1, label_gs1, f"
54 {feat1} feature")
55 print_percentage_table(gs1, label_gs1,
'diagnosis = Benign Hepatobiliary
56
Disease')
57
 
58
plot_pie(ax[0, 2], gs1, label_gs1, f"
59 {feat1} feature")
60 print_percentage_table(gs1, label_gs2,
61 'diagnosis = Pancreatic Cancer')
62
63 plot_pie(ax[1, 0], ss0, label_ss0, f"
64 {feat2} feature")
65
66 print_percentage_table(ss0, label_ss0,
67 'diagnosis = Control (No Pancreatic
Disease)')
68
 
69
plot_pie(ax[1, 1], ss1, label_ss1, f"
70
{feat2} feature")
71
print_percentage_table(ss1, label_ss1,
72 'diagnosis = Benign Hepatobiliary
73 Disease')
74  
plot_pie(ax[1, 2], ss1, label_ss1, f"
{feat2} feature")
print_percentage_table(ss1, label_ss2,
'diagnosis = Pancreatic Cancer')

ax[0][0].set_title('diagnosis =
Control (No Pancreatic Disease)',fontsize=
30)
ax[0][1].set_title('diagnosis = Benign
Hepatobiliary Disease',fontsize= 30)
ax[0][2].set_title('diagnosis =
Pancreatic Cancer',fontsize= 30)
plt.tight_layout()
plt.show()
 
#Plots distribution of age and sex versus
diagnosis in pie chart
plot_piechart_diagnosis(df_dummy, "age",
"sex")
 
The result is shown in Figure 24. The code aims to plot the
distribution of age and sex versus the diagnosis feature using
pie charts. The function plot_piechart_diagnosis is used to
create the visualizations.
 
Figure 24 The percentage distribution of age and sex versus
diagnosis in pie chart
 
Here's what the code does in steps:
1. The function
plot_piechart_diagnosis is defined
with three arguments: the
dataframe df, and two features
feat1 (age) and feat2 (sex) that will
be plotted against the diagnosis.
2. The code calculates the counts of
different values in feat1 (age) and
feat2 (sex) for each category in the
diagnosis feature.
3. The labels for each category in
feat1 and feat2 are extracted.
4. A 2x3 subplot is created with a
total of six pie charts (two for each
diagnosis category).
5. Two helper functions,
print_percentage_table() and
plot_pie(), are defined to display
the percentage breakdowns and to
plot the pie charts, respectively.
6. For each diagnosis category, a pie
chart is created with the counts of
different values in feat1 (age) on
the top row, and the counts of
different values in feat2 (sex) on
the bottom row. Each pie chart
represents the distribution of age
and sex within the specific
diagnosis category.
7. The percentages of different values
in feat1 and feat2 within each
diagnosis category are printed in
tabular format.
8. The titles for each subplot are set to
the respective diagnosis category.
9. The final visualization displays the
six pie charts, each showing the
distribution of age and sex for a
specific diagnosis category
(Control (No Pancreatic Disease),
Benign Hepatobiliary Disease,
Pancreatic Cancer) in different
colors, with percentage labels and
appropriate titles.
Output:
diagnosis = Control (No Pancreatic Disease):
+-------+--------------+
| age | Percentage |
+=======+==============+
| 60-90 | 36.1% |
+-------+--------------+
| 50-60 | 30.1% |
+-------+--------------+
| 40-50 | 24.6% |
+-------+--------------+
| 0-40 | 9.3% |
+-------+--------------+
 
diagnosis = Benign Hepatobiliary Disease:
+-------+--------------+
| age | Percentage |
+=======+==============+
| 60-90 | 35.6% |
+-------+--------------+
| 50-60 | 24.5% |
+-------+--------------+
| 40-50 | 23.1% |
+-------+--------------+
| 0-40 | 16.8% |
+-------+--------------+
 
diagnosis = Pancreatic Cancer:
+-------+--------------+
| age | Percentage |
+=======+==============+
| 60-90 | 35.6% |
+-------+--------------+
| 50-60 | 24.5% |
+-------+--------------+
| 40-50 | 23.1% |
+-------+--------------+
| 0-40 | 16.8% |
+-------+--------------+
 
diagnosis = Control (No Pancreatic Disease):
+-------+--------------+
| age | Percentage |
+=======+==============+
| F | 62.8% |
+-------+--------------+
| M | 37.2% |
+-------+--------------+
 
diagnosis = Benign Hepatobiliary Disease:
+-------+--------------+
| age | Percentage |
+=======+==============+
| M | 51.4% |
+-------+--------------+
| F | 48.6% |
+-------+--------------+
 
diagnosis = Pancreatic Cancer:
+-------+--------------+
| age | Percentage |
+=======+==============+
| M | 51.4% |
+-------+--------------+
| F | 48.6% |
+-------+--------------+
 
From the output, we can observe the distribution of age and
sex in relation to different diagnosis categories: Control (No
Pancreatic Disease), Benign Hepatobiliary Disease, and
Pancreatic Cancer. Here are the key insights:
 
Distribution of Age:
Across all three diagnosis
categories, the highest percentage
of cases falls in the age group of
60-90, which constitutes around
35% to 36% of the cases in each
category.
The 50-60 age group is the second
most prevalent, making up
approximately 24% to 30% of
cases across all categories.
The 40-50 age group follows
closely, accounting for around 23%
to 24% of cases.
The 0-40 age group has the lowest
representation, comprising about
9% to 17% of cases across the
different diagnosis categories.
Distribution of Sex:
In the Control (No Pancreatic
Disease) category, females (F)
make up a slightly higher
percentage (62.8%) of cases
compared to males (M) at 37.2%.
In contrast, in both the Benign
Hepatobiliary Disease and
Pancreatic Cancer categories, the
percentage of male (M) cases is
slightly higher, representing about
51.4% of cases, while females (F)
account for approximately 48.6%
of cases in each category.
Overall, the distribution of age and sex seems to be relatively
similar across the different diagnosis categories, with a
higher prevalence of cases in older age groups (60-90) and a
relatively balanced representation of males and females.
 
Conclusions:
The age groups 60-90, 50-60, and
40-50 are the most affected across
all diagnosis categories, suggesting
that age might be an important
factor in the development of
pancreatic diseases.
The representation of males and
females appears to be balanced in
the Benign Hepatobiliary Disease
and Pancreatic Cancer categories,
while a slight imbalance is
observed in the Control (No
Pancreatic Disease) category,
where females are slightly more
prevalent.
Further statistical analyses and
modeling techniques can be
applied to explore potential
associations between age, sex, and
different types of pancreatic
diseases. These insights can aid in
better understanding the disease
patterns and contribute to improved
diagnosis and treatment strategies.
 
 
Percentage Distribution of Categorized Plasma CA19-9 and Sex
versus Diagnosis
Step Plot the percentage distribution of plasma_CA19_9 and sex
1 versus diagnosis in pie chart:
 
1 #Plots distribution of plasma_CA19_9 and
2 creatinine versus diagnosis in pie chart
3 plot_piechart_diagnosis(df_dummy,
"plasma_CA19_9", "sex")
 
The result is shown in Figure 25.
 
Figure 25 The percentage distribution of plasma_CA19_9
and creatinine versus diagnosis in pie chart
 
Output:
diagnosis = Control (No Pancreatic Disease):
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| 100-1000 | 50.0% |
+-----------------+--------------+
| 0-100 | 50.0% |
+-----------------+--------------+
| 10000-35000 | 0.0% |
+-----------------+--------------+
| 1000-10000 | 0.0% |
+-----------------+--------------+
 
diagnosis = Benign Hepatobiliary Disease:
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| 100-1000 | 51.4% |
+-----------------+--------------+
| 0-100 | 47.6% |
+-----------------+--------------+
| 1000-10000 | 1.0% |
+-----------------+--------------+
| 10000-35000 | 0.0% |
+-----------------+--------------+
 
diagnosis = Pancreatic Cancer:
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| 100-1000 | 51.4% |
+-----------------+--------------+
| 1000-10000 | 47.6% |
+-----------------+--------------+
| 0-100 | 1.0% |
+-----------------+--------------+
| 10000-35000 | 0.0% |
+-----------------+--------------+
 
diagnosis = Control (No Pancreatic Disease):
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| F | 62.8% |
+-----------------+--------------+
| M | 37.2% |
+-----------------+--------------+
 
diagnosis = Benign Hepatobiliary Disease:
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| M | 51.4% |
+-----------------+--------------+
| F | 48.6% |
+-----------------+--------------+
 
diagnosis = Pancreatic Cancer:
+-----------------+--------------+
| plasma_CA19_9 | Percentage |
+=================+==============+
| M | 51.4% |
+-----------------+--------------+
| F | 48.6% |
+-----------------+--------------+
The output presents the distribution of "plasma_CA19_9"
and "creatinine" levels for each diagnosis category in the
form of pie charts. Additionally, it should have included the
distribution of "plasma_CA19_9" and "sex" for each
diagnosis category, but it is not present in the provided
output.
 
From the available pie charts, we can draw the following
conclusions:
Diagnosis: Control (No Pancreatic Disease):
The majority of cases (50%) have
"plasma_CA19_9" levels between
0-100 and 100-1000.
There are no cases with
"plasma_CA19_9" levels in the
range of 1000-10000 or 10000-
35000.
The distribution of "creatinine"
levels is not shown in the pie chart;
thus, it's difficult to draw specific
conclusions about creatinine levels
for this diagnosis category.
Diagnosis: Benign Hepatobiliary Disease:
About half of the cases
(approximately 51.4%) have
"plasma_CA19_9" levels between
100-1000.
Around 47.6% of cases have
"plasma_CA19_9" levels in the
range of 0-100.
A very small proportion
(approximately 1%) of cases have
"plasma_CA19_9" levels in the
range of 1000-10000.
There are no cases with
"plasma_CA19_9" levels in the
range of 10000-35000.
The distribution of "creatinine"
levels is not shown in the pie chart,
so we cannot draw specific
conclusions about creatinine levels
for this diagnosis category.
Diagnosis: Pancreatic Cancer:
Similar to the "Benign
Hepatobiliary Disease" group,
approximately 51.4% of cases have
"plasma_CA19_9" levels between
100-1000.
Approximately 47.6% of cases
have "plasma_CA19_9" levels in
the range of 1000-10000.
Only a very small proportion
(approximately 1%) of cases have
"plasma_CA19_9" levels in the
range of 0-100.
There are no cases with
"plasma_CA19_9" levels in the
range of 10000-35000.
The distribution of "creatinine"
levels is not shown in the pie chart,
so we cannot draw specific
conclusions about creatinine levels
for this diagnosis category.
Overall, the pie charts provide a visual representation of the
distribution of "plasma_CA19_9" levels for each diagnosis
category. However, the absence of the "creatinine"
distribution in the pie charts limits the ability to make
comprehensive comparisons between creatinine levels and
diagnosis groups. Additionally, the "sex" distribution,
mentioned in the function, is missing from the output. It
would be informative to include the "sex" distribution in the
pie charts to further analyze its relationship with the
diagnosis category.
 
 
 
Distribution of Four Categorical Features versus LYV1
Step Plot the distribution of four categorical features versus LYVE1 feature:
1  
1 plt.rcParams['figure.dpi'] = 600
2 fig = plt.figure(figsize=(10,5))
3 gs = fig.add_gridspec(2, 2)
4 gs.update(wspace=0.15, hspace=0.25)
5  
6 background_color = "#fbe7dd"
7 sns.set_palette(['#ff355d','#ffd514'])
8  
9 def
10 feat_versus_other(feat,another,legend,ax0,label):
11 for s in ["right", "top"]:
12 ax0.spines[s].set_visible(False)
13  
14 ax0.set_facecolor(background_color)
15 ax0_sns = sns.histplot(data=df, \
16 x=feat,ax=ax0,zorder=2,kde=False,hue=another,\
17 multiple="stack", shrink=.8, linewidth=0.3,alpha=1)
18  
19 put_label_stacked_bar(ax0_sns,5)
20 ax0_sns.set_xlabel('',fontsize=4, weight='bold')
21 ax0_sns.set_ylabel('',fontsize=4, weight='bold')
22  
23 ax0_sns.grid(which='major', axis='x', zorder=0,
\
24
color='#EEEEEE', linewidth=0.4)
25
ax0_sns.grid(which='major', axis='y', zorder=0,
27
\
28
color='#EEEEEE', linewidth=0.4)
29
 
30
ax0_sns.tick_params(labelsize=3, width=0.5,
31 length=1.5)
32 ax0_sns.legend(legend, ncol=2,
33 facecolor='#D8D8D8', \
34 edgecolor=background_color, fontsize=3, \
35 bbox_to_anchor=(1, 0.989), loc='upper right')
36 ax0.set_facecolor(background_color)
37 ax0_sns.set_xlabel(label)
38 plt.tight_layout()
39
40 label_diag =
41 list(df_dummy["diagnosis"].value_counts().index)
42 label_age =
list(df_dummy["age"].value_counts().index)
43
label_plas =
44
list(df_dummy["plasma_CA19_9"].value_counts().index)
45
label_sex =
46 list(df_dummy["sex"].value_counts().index)
47
48 def hist_feat_versus_four_cat(feat,label):
49 ax0 = fig.add_subplot(gs[0, 0])
50 feat_versus_other(feat,df_dummy["diagnosis"],\
51 label_diag,ax0,"diagnosis versus " + label)
52  
53 ax1 = fig.add_subplot(gs[0, 1])
54 feat_versus_other(feat,df_dummy["age"],\
55 label_age,ax1,"age versus " + label)
56  
57 ax2 = fig.add_subplot(gs[1, 0])
58
feat_versus_other(feat,df_dummy["plasma_CA19_9"],\
label_plas,ax2,"plasma_CA19_9 versus " + label)
 
ax3 = fig.add_subplot(gs[1, 1])
feat_versus_other(feat,df_dummy["creatinine"],\
label_sex,ax3,"sex versus " + label)
 
hist_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1")
 
The result is shown in Figure 26.
 

Figure 26 The distribution of four categorical features versus LYVE1


feature
 
 
The code creates a 2x2 grid of stacked bar plots, each representing the
distribution of a specific feature (e.g., "LYVE1") concerning four
different categorical variables: "diagnosis," "age," "plasma_CA19_9,"
and "sex."
The purpose of the code is to visualize the distribution of the "LYVE1"
feature concerning the different categories of each categorical variable.
The function feat_versus_other is utilized to create each stacked bar
plot.
 
Steps of the code:
1. Setting Up the Figure: The code sets up
the figure with a size of (10, 5) inches and
a resolution of 600 DPI (dots per inch). It
creates a 2x2 grid for arranging the
subplots, and adjusts the white space
between subplots using the gs.update
function.
2. Defining Colors and Background: A
custom color palette is set using
sns.set_palette, and the background color
of the plots is defined as "#fbe7dd."
3. feat_versus_other() Function: This
function is responsible for creating the
stacked bar plot for a specific feature
(e.g., "LYVE1") versus a categorical
variable. The sns.histplot function is used
to plot the histogram of the feature
"LYVE1" with the hue being the
categorical variable (e.g., "diagnosis").
Stacked bars are formed for each category
of the categorical variable, and the legend
is placed in the upper-right corner. Labels
and tick parameters are adjusted to
enhance plot aesthetics.
4. Labels and Categories: Lists of labels for
different categories are created for the
"diagnosis," "age," "plasma_CA19_9,"
and "sex" columns in the DataFrame.
5. hist_feat_versus_four_cat() Function:
This function calls the feat_versus_other
function four times for each categorical
variable: "diagnosis," "age,"
"plasma_CA19_9," and "sex." It sets the
title for each subplot with the respective
categorical variable it represents. The
function hist_feat_versus_four_cat is
called with the "LYVE1" feature, which
generates the 2x2 grid of stacked bar plots
for the distribution of "LYVE1"
concerning the four categorical variables.
Overall, the code aims to provide an overview of how the "LYVE1"
feature is distributed across different categories of "diagnosis," "age,"
"plasma_CA19_9," and "sex" using stacked bar plots. The color palette
and layout adjustments enhance the visual appeal of the plot. The same
function can be used to analyze the distribution of other features in a
similar manner.
 
 
 
 
 
Density Distribution of Four Categorical Features versus LYV1
Step Plot the density of four categorical features versus LYVE1 feature:
1  
1 def
2 prob_feat_versus_other(feat,another,legend,ax0,label):
3 for s in ["right", "top"]:
4 ax0.spines[s].set_visible(False)
5  
6 ax0.set_facecolor(background_color)
7 ax0_sns = sns.kdeplot(x=feat,ax=ax0,hue=another,\
8 linewidth=0.3,fill=True,cbar='g',zorder=2,alpha=1,\
9 multiple='stack')
10  
11 ax0_sns.set_xlabel('',fontsize=4, weight='bold')
12 ax0_sns.set_ylabel('',fontsize=4, weight='bold')
13  
14 ax0_sns.grid(which='major', axis='x', zorder=0, \
15 color='#EEEEEE', linewidth=0.4)
16 ax0_sns.grid(which='major', axis='y', zorder=0, \
17 color='#EEEEEE', linewidth=0.4)
18  
19 ax0_sns.tick_params(labelsize=3, width=0.5,
length=1.5)
20
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8',
21
\
22
edgecolor=background_color, fontsize=3, \
23
bbox_to_anchor=(1, 0.989), loc='upper right')
24
ax0.set_facecolor(background_color)
25
ax0_sns.set_xlabel(label)
27
plt.tight_layout()
28
29
def prob_feat_versus_four_cat(feat,label):
30
ax0 = fig.add_subplot(gs[0, 0])
31 prob_feat_versus_other(feat,df_dummy["diagnosis"],\
32 label_diag,ax0,"diagnosis versus " + label)
33  
34 ax1 = fig.add_subplot(gs[0, 1])
35 prob_feat_versus_other(feat,df_dummy["age"],\
36 label_age,ax1,"age versus " + label)
37  
38 ax2 = fig.add_subplot(gs[1, 0])
39
40 prob_feat_versus_other(feat,df_dummy["plasma_CA19_9"],\
41 label_plas,ax2,"plasma_CA19_9 versus " + label)
42  
43 ax3 = fig.add_subplot(gs[1, 1])
44
prob_feat_versus_other(feat,df_dummy["creatinine"],\
label_sex,ax3,"sex versus " + label)

prob_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1")
 
The result is shown in Figure 27. The code uses two functions,
prob_feat_versus_other() and prob_feat_versus_four_cat(), to create a set
of visualizations called kernel density estimate (KDE) plots. These plots
display the probability density distribution of a specific feature, such as
"LYVE1," with respect to four different categorical variables: "diagnosis,"
"age," "plasma_CA19_9," and "sex."
Each prob_feat_versus_other() function call generates a KDE plot that
shows the probability density of the feature for a specific category of the
categorical variable. The plots are stacked on top of each other, allowing
for easy comparison between categories. The appearance of the plots is
enhanced with custom colors, grid lines, and legend placement for better
aesthetics.
 
The prob_feat_versus_four_cat() function facilitates the generation of four
KDE plots, each corresponding to one of the four categorical variables.
The plots are arranged in a 2x2 grid, providing a comprehensive view of
how the probability density of the "LYVE1" feature varies across different
categories of each categorical variable.
 
Overall, these visualizations offer a clear and visually appealing way to
understand how the "LYVE1" feature is distributed within different
categories of the four selected categorical variables.
 
Figure 27 The density of four categorical features versus LYVE1 feature
 
 
 
Case and Density Distribution of Four Categorical Features versus REG1B
Step Plot the distribution and its density of four categorical features versus
1 REG1B feature:
 
1 hist_feat_versus_four_cat(df_dummy["REG1B"],"REG1B")
2 prob_feat_versus_four_cat(df_dummy["REG1B"],"REG1B")
 
The results are shown in Figure 28 and Figure 29.
 

Figure 28 The distribution of four categorical features versus


REG1B feature
 
Figure 29 The density of four categorical features versus REG1B
feature
 
 
 
Case and Density Distribution of Four Categorical Features versus TFF1
Step Plot the distribution and its density of four categorical features
1 versus TFF1 feature:
 
1 hist_feat_versus_four_cat(df_dummy["TFF1"],"TFF1")
2 prob_feat_versus_four_cat(df_dummy["TFF1"],"TFF1")
 
The results are shown in Figure 30 and Figure 31.
 

Figure 30 The distribution of four categorical features versus


TFF1 feature
 

Figure 31 The density of four categorical features versus TFF1


feature
 
 
 
Case and Density Distribution of Four Categorical Features versus REG1A
Step Plot the distribution and its density of four categorical features versus
1 REG1A feature:
 
1 hist_feat_versus_four_cat(df_dummy["REG1A"],"REG1A")
2 prob_feat_versus_four_cat(df_dummy["REG1A"],"REG1A")
 
The results are shown in Figure 32 and Figure 33.
 
 

Figure 32 The distribution of four categorical features versus


REG1A feature
 

Figure 33 The density of four categorical features versus REG1A


feature
 
 
 
 
 
 
 
 
 
 
PREDICTING
PANCREATIC CANCER
USING MACHINE LEARNING
PREDICTING
PANCREATIC CANCER
USING MACHINE LEARNING
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Features Importance Using Random Forest Classifier
Step Convert sex feature to {0,1}, convert diagnosis feature to
1 {0,1,2}, extract output and input variables, and plot feature
importance using RandomForest Classifier:
 
1 #Converts sex feature to {0,1}
2 def map_sex(n):
3 if n == "F":
4 return 0
5
6 else:
7 return 1
8 df['sex'] = df['sex'].apply(lambda x:
9 map_sex(x))
10  
11 #Converts diagnosis feature to {0,1,2}
12 def map_diagnosis(n):
13 if n == 1:
14 return 0
15 if n == 2:
16 return 1
17 else:
18 return 2
19 df['diagnosis'] =
20 df['diagnosis'].apply(lambda x:
map_diagnosis(x))
21
 
22
#Extracts outuput and input variables
23
y = df['diagnosis'].values # Target for
24
the model
25
X = df.drop(['diagnosis'], axis = 1)
26
 
27
#Feature Importance using RandomForest
28 Classifier
29 names = X.columns
30 rf = RandomForestClassifier()
31 rf.fit(X, y)
32  
33 result_rf = pd.DataFrame()
34 result_rf['Features'] = X.columns
35 result_rf ['Values'] =
36 rf.feature_importances_
37 result_rf.sort_values('Values', inplace =
38 True, ascending = False)
39  
40 plt.figure(figsize=(25,25))
41 sns.set_color_codes("pastel")
42 sns.barplot(x = 'Values',y = 'Features',
data=result_rf, color="Blue")
43
plt.xlabel('Feature Importance',
44
fontsize=30)
45
plt.ylabel('Feature Labels', fontsize=30)
46
plt.tick_params(axis='x', labelsize=20)
47
plt.tick_params(axis='y', labelsize=20)
plt.show()
 
# Print the feature importance table
print("Feature Importance:")
print(result_rf)
 
The result is shown in Figure 34. The code performs the
following steps:
1. Map 'sex' Feature: The code
converts the 'sex' feature from a
categorical variable ('F' for female
and 'M' for male) to a binary
numerical variable (0 for female
and 1 for male) using the map_sex
function and the apply method.
2. Map 'diagnosis' Feature: The code
converts the 'diagnosis' feature
from a categorical variable
('Control (No Pancreatic Disease)',
'Benign Hepatobiliary Disease', and
'Pancreatic Cancer') to a numerical
variable (0, 1, and 2, respectively)
using the map_diagnosis function
and the apply method.
3. Extract Input and Output Variables:
The code separates the target
variable 'diagnosis' from the input
features, and stores them in
variables y and X, respectively.
4. Feature Importance using
RandomForest Classifier: The code
uses the RandomForestClassifier
from scikit-learn to determine the
importance of each feature in
predicting the target variable
'diagnosis'. The
RandomForestClassifier is trained
on the input features (X) and target
variable (y). The feature
importances are then extracted
from the model and stored in a
DataFrame named result_rf.
5. Plot Feature Importance: The code
creates a horizontal bar plot to
visualize the feature importance
scores. The features are sorted
based on their importance values in
descending order. The plot helps
understand which features
contribute the most to the
prediction of the target variable.
6. Print Feature Importance Table:
The code prints the feature
importance table (result_rf)
showing the features and their
corresponding importance scores.
The table is sorted in descending
order of feature importance.
By analyzing the feature importance results and the plot, one
can identify which features have the most significant impact
on predicting the diagnosis of patients (whether they have
Control (No Pancreatic Disease), Benign Hepatobiliary
Disease, or Pancreatic Cancer). This information can help in
feature selection for modeling and understanding the most
relevant factors in determining the disease outcomes.
 

Figure 34 The feature importance using RandomForest


classifier
 
Output:
Feature Importance:
Features Values
4 LYVE1 0.194889
6 TFF1 0.167185
5 REG1B 0.148777
2 plasma_CA19_9 0.144791
0 age 0.121171
3 creatinine 0.115149
7 REG1A 0.088212
1 sex 0.019825
 
The output provides the feature importance values obtained
from the RandomForestClassifier for predicting the diagnosis
of patients into three categories: Control (No Pancreatic
Disease), Benign Hepatobiliary Disease, and Pancreatic
Cancer.
 
Analyzing the feature importance values:
LYVE1 (Lymphatic vessel
endothelial hyaluronan receptor 1):
It has the highest feature
importance value of approximately
19.5%. This indicates that the
LYVE1 gene expression level is a
crucial factor in predicting the
diagnosis. Higher values of LYVE1
may be associated with certain
conditions or diseases.
TFF1 (Trefoil factor 1): It is the
second most important feature with
an importance value of around
16.7%. TFF1 is a gene that encodes
a protein involved in mucosal
protection and repair. Its expression
level seems to be influential in
differentiating between different
diagnosis categories.
REG1B (Regenerating islet-
derived 1 beta): This feature has an
importance value of approximately
14.9%. REG1B is a gene encoding
a protein involved in tissue
regeneration, and its expression
may be relevant for distinguishing
between different diagnosis
outcomes.
plasma_CA19_9: With an
importance value of about 14.5%,
this is a plasma marker for cancer
antigen 19-9 (CA19-9). Its
presence and concentration may be
indicative of certain medical
conditions, particularly cancer,
making it an important predictive
factor.
age: Age contributes around 12.1%
to the feature importance,
suggesting that the age of the
patients is relevant in diagnosing
different conditions.
creatinine: It has an importance
value of approximately 11.5%.
Creatinine is a waste product
generated by muscle metabolism
and is commonly used as a
measure of kidney function. Its
level may be informative in
diagnosing certain diseases or
conditions.
REG1A (Regenerating islet-
derived 1 alpha): With an
importance value of around 8.8%,
this gene plays a role in tissue
regeneration and may have
implications in differentiating
between diagnosis categories.
sex: Sex has the lowest importance
value of approximately 2%. While
it is less influential in this specific
prediction task, it may still have
some discriminative power in
certain scenarios.
Concluding, the feature importance analysis suggests that
gene expressions like LYVE1, TFF1, REG1B,
plasma_CA19_9, and other variables like age and creatinine
play critical roles in predicting the diagnosis of patients into
different categories. These insights can be valuable for
building accurate predictive models and understanding the
factors that contribute significantly to the diagnostic
outcomes. Additionally, this information can aid in further
research and clinical decision-making for patients with
hepatobiliary and pancreatic diseases.
 
 
 
Features Importance Using Extra Trees Classifier
Step Plot feature importance using ExtraTreesClassifier:
1  
1 #Feature Importance using
2 ExtraTreesClassifier
3 model = ExtraTreesClassifier()
4 model.fit(X, y)
5  
6 result_et = pd.DataFrame()
7 result_et['Features'] = X.columns
8 result_et ['Values'] =
model.feature_importances_
9
result_et.sort_values('Values',
10
inplace=True, ascending =False)
11
 
12
plt.figure(figsize=(25,25))
13
sns.set_color_codes("pastel")
14
sns.barplot(x = 'Values',y = 'Features',
15 data=result_et, color="red")
16 plt.xlabel('Feature Importance',
17 fontsize=30)
18 plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
 
The result is shown in Figure 35. The output provides the
feature importance values obtained from the
ExtraTreesClassifier for predicting the diagnosis of patients
into three categories: Control (No Pancreatic Disease),
Benign Hepatobiliary Disease, and Pancreatic Cancer.
 
The ExtraTreesClassifier is another ensemble learning
method, similar to the RandomForestClassifier, which is used
for feature selection and classification tasks. It generates
multiple decision trees and makes predictions based on their
majority votes, and it is computationally efficient.
 
Output:
Feature Importance:
Features Values
4 LYVE1 0.200785
6 TFF1 0.163505
0 age 0.144982
5 REG1B 0.131220
2 plasma_CA19_9 0.125133
3 creatinine 0.113199
7 REG1A 0.098677
1 sex 0.022500
 

Figure 35 The feature importance using ExtraTreesClassifier


 
The output provides the feature importance values obtained
from the ExtraTreesClassifier for predicting the diagnosis of
patients into three categories: Control (No Pancreatic
Disease), Benign Hepatobiliary Disease, and Pancreatic
Cancer.
 
Analyzing the feature importance values obtained from the
ExtraTreesClassifier:
LYVE1 (Lymphatic vessel
endothelial hyaluronan receptor 1):
It retains its top position with the
highest importance value,
approximately 20.1%. This
reinforces its significance in
predicting the diagnosis, consistent
with the previous findings.
TFF1 (Trefoil factor 1): Similarly,
TFF1 remains the second most
important feature with an
importance value of around 16.4%.
age: Age becomes more influential
in this model, ranking third with an
importance value of approximately
14.5%.
REG1B (Regenerating islet-
derived 1 beta): It ranks fourth in
importance with around 13.1%.
plasma_CA19_9: In this model,
plasma_CA19_9 moves down to
the fifth position with an
importance value of approximately
12.5%.
creatinine: It retains its relevance
and ranks sixth with an importance
value of around 11.3%.
REG1A (Regenerating islet-
derived 1 alpha): Similar to the
previous output, REG1A has the
seventh position with an
importance value of approximately
9.9%.
sex: As before, sex remains the
feature with the lowest importance
value, approximately 2.2%.
Comparing the results obtained from the previous
RandomForestClassifier and the current ExtraTreesClassifier,
there are slight differences in the ordering of feature
importance, but the general conclusions remain consistent.
Both models agree on the significance of LYVE1, TFF1, age,
REG1B, plasma_CA19_9, creatinine, REG1A, and sex in
predicting the diagnosis.
 
In conclusion, the ExtraTreesClassifier further confirms the
importance of gene expressions like LYVE1, TFF1, and
variables like age and creatinine in predicting the diagnosis
of patients into different categories. The consistency of these
findings across different ensemble learning models reinforces
the credibility of the feature importance analysis and
underscores the potential clinical relevance of these variables
for diagnosing hepatobiliary and pancreatic diseases.
 
 
 
Features Importance Using Recursive Feature Elimination
Step Plot feature importance using RFE:
1  
1 #Feature Importance using RFE
2 from sklearn.feature_selection import RFE
3 model = LogisticRegression()
4 # create the RFE model
5 rfe = RFE(model)
6 rfe = rfe.fit(X, y)
7  
8 result_lg = pd.DataFrame()
9 result_lg['Features'] = X.columns
10 result_lg ['Ranking'] = rfe.ranking_
11 result_lg.sort_values('Ranking',
12 inplace=True , ascending = False)
13  
14 plt.figure(figsize=(25,25))
15 sns.set_color_codes("pastel")
16 sns.barplot(x = 'Ranking',y = 'Features',
data=result_lg, color="orange")
17
plt.ylabel('Feature Labels', fontsize=30)
18
plt.tick_params(axis='x', labelsize=20)
19
plt.tick_params(axis='y', labelsize=20)
20
plt.show()
21
 
22
print("Feature Ranking:")
23
print(result_lg)
 
The result is shown in Figure 36. The output shows the
feature rankings obtained from Recursive Feature
Elimination (RFE) using Logistic Regression for predicting
the diagnosis of patients into three categories: Control (No
Pancreatic Disease), Benign Hepatobiliary Disease, and
Pancreatic Cancer.
 

Figure 36 The feature importance using RFE


 
Output:
Feature Ranking:
Features Ranking
7 REG1A 5
2 plasma_CA19_9 4
6 TFF1 3
5 REG1B 2
0 age 1
1 sex 1
3 creatinine 1
4 LYVE1 1
 
The output shows the feature rankings obtained from Recursive
Feature Elimination (RFE) using Logistic Regression for predicting
the diagnosis of patients into three categories: Control (No
Pancreatic Disease), Benign Hepatobiliary Disease, and Pancreatic
Cancer.
 
Analyzing the feature rankings obtained from RFE:
age: Age receives the highest ranking
with a value of 1. This indicates that age
is considered the most important feature
in predicting the diagnosis.
sex: Similarly, sex is ranked second with
a value of 1, which means it is highly
relevant for the diagnosis prediction.
creatinine: It is ranked third with a value
of 1, reinforcing its significance in
diagnosing hepatobiliary and pancreatic
diseases.
LYVE1 (Lymphatic vessel endothelial
hyaluronan receptor 1): LYVE1 is
ranked fourth with a value of 1,
indicating its importance in the
diagnosis prediction.
REG1B (Regenerating islet-derived 1
beta): REG1B is ranked fifth with a
value of 2, further supporting its
relevance in predicting the diagnosis.
TFF1 (Trefoil factor 1): TFF1 receives
the sixth position with a value of 3,
highlighting its relevance for the
diagnosis.
plasma_CA19_9: Plasma_CA19_9 is
ranked seventh with a value of 4,
indicating its importance in diagnosing
patients into different categories.
REG1A (Regenerating islet-derived 1
alpha): REG1A receives the eighth
position with a value of 5, which means
it is considered less important compared
to the other features.
The RFE method helps in selecting the most relevant features for the
model, and in this case, it emphasizes the importance of age, sex,
creatinine, LYVE1, REG1B, TFF1, and plasma_CA19_9 in
diagnosing hepatobiliary and pancreatic diseases. The RFE results
slightly differ from the previous findings obtained using
RandomForestClassifier and ExtraTreesClassifier, but they still
provide valuable insights into the significance of these features for
diagnosis. It is important to note that feature rankings may vary
depending on the specific algorithm and dataset used.
 
 
 
Resampling and Splitting Data
Step Split dataset into train and test data with three feature
1 scaling: raw, normalization, and standardization:
 
1 sm = SMOTE(random_state=42)
2 X,y = sm.fit_resample(X, y.ravel())
3  
4 #Splits the data into training and testing
5 X_train, X_test, y_train, y_test =
6 train_test_split(X, y, test_size = 0.2,
random_state = 2021, stratify=y)
7
X_train_raw = X_train.copy()
8
X_test_raw = X_test.copy()
9
y_train_raw = y_train.copy()
10
y_test_raw = y_test.copy()
11
 
12
X_train_norm = X_train.copy()
13
X_test_norm = X_test.copy()
14 y_train_norm = y_train.copy()
15 y_test_norm = y_test.copy()
16 norm = MinMaxScaler()
17 X_train_norm =
18 norm.fit_transform(X_train_norm)
19 X_test_norm = norm.transform(X_test_norm)
20  
21 X_train_stand = X_train.copy()
22 X_test_stand = X_test.copy()
23 y_train_stand = y_train.copy()
24 y_test_stand = y_test.copy()
25 scaler = StandardScaler()
26 X_train_stand =
scaler.fit_transform(X_train_stand)
X_test_stand =
scaler.transform(X_test_stand)
 
The purpose of the code is to prepare the data for training
and testing a machine learning model. It performs the
following tasks:
1. Oversampling with SMOTE: The
code applies the Synthetic Minority
Over-sampling Technique
(SMOTE) to handle class
imbalance. In the original dataset,
the class distribution may be
skewed, with the majority class
(Control and Benign Hepatobiliary
Disease) having more samples than
the minority class (Pancreatic
Cancer). SMOTE generates
synthetic samples for the minority
class to balance the class
distribution, effectively increasing
the representation of the minority
class.
2. Data Splitting: The data is split into
training and testing sets. The
training set (X_train and y_train) is
used to train the machine learning
model, while the testing set (X_test
and y_test) is used to evaluate the
model's performance.
3. Raw Data and Preprocessed Data:
The code creates multiple sets of
data with different preprocessing
techniques:
a. Raw Data: The original feature
and target data (X_train_raw,
X_test_raw, y_train_raw,
y_test_raw) are kept
unchanged.
b. Normalized Data: The feature
data is normalized using the
Min-Max scaling method
(X_train_norm, X_test_norm).
This scales the features to a
range of [0, 1], which can be
helpful for algorithms that are
sensitive to the scale of
features.
c. Standardized Data: The feature
data is standardized using the
Standard Scaler
(X_train_stand, X_test_stand).
This centers the features
around zero with a standard
deviation of 1, which can be
beneficial for algorithms that
assume normally distributed
features.
By creating different versions of the data with various
preprocessing techniques, the code allows for comparing the
model's performance when trained on different datasets. This
comparison can help in selecting the most appropriate data
preprocessing method that yields the best performance for
the specific machine learning task at hand.
 
 
 
Learning Curve
Step Define plot_learning_curve() method to plot learning curve
1 of a certain classifier:
 
1 def plot_learning_curve(estimator, title,
2 X, y, axes=None, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1,
3
1.0, 5)):
4
if axes is None:
5
_, axes = plt.subplots(1, 3,
6 figsize=(35, 10))
7  
8 axes[0].set_title(title)
9 if ylim is not None:
10 axes[0].set_ylim(*ylim)
11 axes[0].set_xlabel("Training
12 examples")
13 axes[0].set_ylabel("Score")
14  
15 train_sizes, train_scores,
16 test_scores, fit_times, _ = \
17 learning_curve(estimator, X, y,
cv=cv, n_jobs=n_jobs,
18
19
train_sizes=train_sizes,
20
return_times=True)
21
train_scores_mean =
22 np.mean(train_scores, axis=1)
23 train_scores_std =
24 np.std(train_scores, axis=1)
25 test_scores_mean =
np.mean(test_scores, axis=1)
26
test_scores_std = np.std(test_scores,
27
axis=1)
28
fit_times_mean = np.mean(fit_times,
29 axis=1)
30 fit_times_std = np.std(fit_times,
31 axis=1)
32  
33 # Plot learning curve
34 axes[0].grid()
35 axes[0].fill_between(train_sizes, \
36 train_scores_mean - train_scores_std,\
37 train_scores_mean +
38 train_scores_std, \
39 alpha=0.1, color="r")
40 axes[0].fill_between(train_sizes, \
41 test_scores_mean - test_scores_std,\
42 test_scores_mean +
test_scores_std, \
43
alpha=0.1, color="g")
44
axes[0].plot(train_sizes,
45
train_scores_mean, 'o-', \
46
color="r", label="Training score")
47
axes[0].plot(train_sizes,
48 test_scores_mean, 'o-', \
49 color="g", label="Cross-validation
50 score")
51 axes[0].legend(loc="best")
52  
53 # Plot n_samples vs fit_times
54 axes[1].grid()
55 axes[1].plot(train_sizes,
56 fit_times_mean, 'o-')
57 axes[1].fill_between(train_sizes, \
58 fit_times_mean - fit_times_std,\
59 fit_times_mean + fit_times_std,
alpha=0.1)
60
axes[1].set_xlabel("Training
examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the
model")
 
# Plot fit_time vs score
axes[2].grid()
axes[2].plot(fit_times_mean,
test_scores_mean, 'o-')
axes[2].fill_between(fit_times_mean, \
test_scores_mean - test_scores_std,\
test_scores_mean +
test_scores_std, alpha=0.1)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the
model")
 
return plt
 
The code defines a function called plot_learning_curve() that
generates a plot to visualize the learning curve and
performance of a machine learning model. Here's a step-by-
step explanation of the code:
1. The function takes several
parameters:
estimator: The machine
learning model or estimator to
evaluate.
title: The title of the plot.
X: The input features of the
dataset.
y: The target variable of the
dataset.
axes: Optional parameter for
specifying the axes of the plot.
If not provided, a new figure
with three subplots will be
created.
ylim: Optional parameter for
setting the y-axis limits of the
first subplot.
cv: The cross-validation
strategy or number of cross-
validation folds to use.
n_jobs: The number of parallel
jobs to run. If set to -1, it will
use all available processors.
train_sizes: An array of
training set sizes to be used in
generating the learning curve.
2. The function initializes the first
subplot (axes[0]) with the provided
title and y-axis limit (if specified).
It sets the x-axis label as "Training
examples" and the y-axis label as
"Score".
3. The function uses the
learning_curve function to
calculate the training and test
scores, fit times, and other metrics
for the learning curve. It passes the
provided estimator, input features
X, target variable y, cross-
validation strategy cv, number of
parallel jobs n_jobs, and training
set sizes train_sizes.
4. The function calculates the mean
and standard deviation of the
training and test scores, fit times,
and other metrics.
5. The function plots the learning
curve on the first subplot (axes[0]).
It fills the area between the mean
scores plus/minus the standard
deviation to show the variance. It
plots the mean training scores,
mean test scores, and legends for
training and cross-validation
scores.
6. The function plots the scalability of
the model on the second subplot
(axes[1]). It shows the training set
sizes versus the fit times. It fills the
area between the mean fit times
plus/minus the standard deviation.
7. The function plots the performance
of the model on the third subplot
(axes[2]). It shows the fit times
versus the mean test scores. It fills
the area between the mean test
scores plus/minus the standard
deviation.
8. The function returns the plt object,
which allows further customization
or display of the plot outside the
function.
By calling this function with appropriate parameters, you can
generate a learning curve plot that visualizes the performance
and scalability of a machine learning model as the training
set size increases. It helps in understanding the bias-variance
trade-off and identifying the optimal training set size for the
model.
 
To observe and analyze each resulting plot generated by the
plot_learning_curve function, you can follow these steps:
 
Learning Curve:
Look at the trend of the training
score and cross-validation score as
the training set size increases.
If the training score and cross-
validation score are both low, it
suggests that the model is
underfitting the data.
If the training score is high and the
cross-validation score is
significantly lower, it indicates
overfitting.
If both scores are high and close to
each other, it suggests a good
balance between bias and variance.
Check the variance of the scores by
observing the shaded areas around
the lines. A wider shaded area
indicates higher variance.
Compare the training and cross-
validation scores to assess the
model's generalization ability.
Scalability of the Model:
Examine how the fit times change
with different training set sizes.
If the fit times increase linearly
with the training set size, it
indicates good scalability.
If the fit times increase
significantly or exponentially with
the training set size, it suggests
scalability issues.
Check the variance of the fit times
by observing the shaded area
around the line. A wider shaded
area indicates higher variance.
Understanding the scalability helps
assess the model's efficiency for
larger datasets.
Performance of the Model:
Analyze the relationship between
the fit times and the mean test
score.
Look for any patterns or trends in
the plot.
If the model's performance (mean
test score) improves with longer fit
times, it indicates that more
training leads to better results.
Check the variance of the test
scores by observing the shaded
area around the line. A wider
shaded area indicates higher
variance.
This plot helps understand the
trade-off between model
performance and the time required
to train the model.
By observing and analyzing each plot, you can gain insights
into the learning behavior, scalability, and performance of the
model. These insights can guide you in understanding the
model's strengths, weaknesses, and potential areas of
improvement.
 
 
Real Values versus Predicted Values and Confusion Matrix
Step Define plot_real_pred_val() to plot true values versus predicted
1 values and plot_cm() method to plot confusion matrix:
 
1 def plot_real_pred_val(Y_test, ypred, name):
2 plt.figure(figsize=(25,15))
3 acc=accuracy_score(Y_test,ypred)
4
5 plt.scatter(range(len(ypred)),ypred,color="blue",\
6 lw=5,label="Predicted")
7 plt.scatter(range(len(Y_test)), \
8 Y_test,color="red",label="Actual")
9 plt.title("Predicted Values vs True Values of
" + name, \
10
fontsize=10)
11
plt.xlabel("Accuracy: " +
12
str(round((acc*100),3)) + "%")
13
plt.legend()
14
plt.grid(True, alpha=0.75, lw=1, ls='-.')
15
plt.show()
16
 
17
def plot_cm(Y_test, ypred, name):
18
fig, ax = plt.subplots(figsize=(25, 15))
19
cm = confusion_matrix(Y_test, ypred)
20
21 sns.heatmap(cm, annot=True, linewidth=0.7,
22 linecolor='red', fmt='g', cmap="YlOrBr",
annot_kws={"size": 30})
23
plt.title(name + ' Confusion Matrix',
24
fontsize=30)
25
ax.xaxis.set_ticklabels(['Control (No
26 Pancreatic Disease)', 'Benign Hepatobiliary
27 Disease', 'Pancreatic Cancer'], fontsize=20);
28 ax.yaxis.set_ticklabels(['Control (No
29 Pancreatic Disease)', 'Benign Hepatobiliary
Disease', 'Pancreatic Cancer'], fontsize=20);
30
plt.xlabel('Y predict', fontsize=30)
plt.ylabel('Y test', fontsize=30)
plt.show()
return cm
 
Here are the steps for each function and their purposes:
plot_real_pred_val(Y_test, ypred, name):
1. Create a scatter plot to compare the
predicted values (ypred) with the true
values (Y_test) of a target variable.
2. The plot visualizes the predicted and
actual values, allowing you to observe
any discrepancies or patterns.
3. Calculate and display the accuracy of
the predictions (acc) as a percentage.
4. Set the plot title to indicate the target
variable being analyzed.
5. Set the x-axis label to show the accuracy
of the predictions.
6. Add a legend to differentiate between
the predicted and actual values.
7. Display the gridlines for better
visualization.
8. Show the plot.
plot_cm(Y_test, ypred, name):
1. Create a heatmap plot to visualize the
confusion matrix between the predicted
values (ypred) and the true values
(Y_test) of a target variable.
2. The confusion matrix displays the
counts or percentages of true positive,
true negative, false positive, and false
negative predictions.
3. The heatmap uses colors to represent
different values in the confusion matrix,
making it easier to interpret.
4. Set the plot title to indicate the target
variable's name.
5. Set the x-axis label as "Y predict" to
represent the predicted values.
6. Set the y-axis label as "Y test" to
represent the true values.
7. Show the plot.
8. Return the confusion matrix (cm).
These functions provide visualizations and metrics to assess the
performance of a model's predictions. The plot_real_pred_val()
function helps compare predicted and actual values, while the
plot_cm() function displays the confusion matrix to evaluate
prediction results.
To observe and analyze a confusion matrix, you can follow these
steps:
1. Understand the structure of the
confusion matrix: A confusion matrix is
a square matrix with dimensions
corresponding to the number of classes
or categories in your classification
problem. It consists of four main
components:
True Positive (TP): The number of
correctly predicted positive
instances.
True Negative (TN): The number of
correctly predicted negative
instances.
False Positive (FP): The number of
incorrectly predicted positive
instances (Type I error).
False Negative (FN): The number of
incorrectly predicted negative
instances (Type II error).
2. Analyze the values in the confusion
matrix:
Accuracy: It is calculated as (TP +
TN) / (TP + TN + FP + FN). It
represents the overall accuracy of
the model's predictions.
Precision: It is calculated as TP /
(TP + FP). It measures the model's
ability to correctly identify positive
instances among the predicted
positive instances.
Recall (Sensitivity or True Positive
Rate): It is calculated as TP / (TP +
FN). It measures the model's ability
to correctly identify positive
instances among the actual positive
instances.
Specificity (True Negative Rate): It
is calculated as TN / (TN + FP). It
measures the model's ability to
correctly identify negative instances
among the actual negative instances.
F1 Score: It is the harmonic mean of
precision and recall and is
calculated as 2 * (Precision *
Recall) / (Precision + Recall). It
provides a balance between
precision and recall.
3. Interpret the confusion matrix:
TP and TN represent correct
predictions, indicating that the
model correctly identified positive
and negative instances, respectively.
FP and FN represent incorrect
predictions, indicating that the
model misclassified instances as
positive or negative.
Pay attention to the imbalance
between FP and FN errors, as it can
vary based on the problem's nature
and the desired outcome.
Consider the trade-off between
precision and recall based on the
specific requirements of your
problem. If you prioritize
minimizing false positives, focus on
improving precision. If you
prioritize minimizing false
negatives, focus on improving
recall.
Compare the values in the confusion
matrix with your problem's context
and requirements to assess the
model's performance. For example,
in a medical diagnosis scenario, a
false negative (FN) error might be
more critical than a false positive
(FP) error.
4. Use additional metrics or visualizations
to gain further insights:
Calculate metrics such as precision,
recall, specificity, and F1 score to
obtain more detailed evaluation
measures.
Visualize the confusion matrix using
heatmaps or other graphical
representations to get a clearer
understanding of the distribution of
predictions and errors.
Overall, the confusion matrix provides a comprehensive overview of
the model's performance, allowing you to assess its accuracy,
precision, recall, and other relevant metrics for your specific
classification problem.
 
 
ROC and Decision Boundaries
Step Define plot_roc() method to plot ROC and
1 plot_decision_boundary() method to plot decision boundary
of two chosen feature with certain classifier:
 
1 #Plots ROC
2 def plot_roc(model,X_test, y_test, title):
3 Y_pred_prob =
4 model.predict_proba(X_test)
5 Y_pred_prob = Y_pred_prob[:, 1]
6  
7 fpr, tpr, thresholds =
roc_curve(y_test, Y_pred_prob)
8
plt.figure(figsize=(25,15))
9
plt.plot([0,1],[0,1], color='navy',
10
lw=5, linestyle='--')
11
plt.plot(fpr,tpr, label='ANN')
12
plt.xlabel('False Positive Rate')
13
plt.ylabel('True Positive Rate')
14
plt.title('ROC Curve of ' + title)
15
plt.grid(True)
16
plt.show()
17
 
18
def plot_decision_boundary(model,xtest,
19 ytest, name):
20 plt.figure(figsize=(25, 15))
21 #Trains model with two features
22 model.fit(xtest, ytest)
23  
24 plot_decision_regions(xtest.values,
25 ytest.ravel(), clf=model, legend=2)
26 plt.title("Decision boundary for " +
name + " (Test)", fontsize=30)
27
plt.xlabel("creatinine", fontsize=25)
28
plt.ylabel("LYVE1", fontsize=25)
plt.legend(fontsize=25)
plt.show()
 
 
Here are the steps to understand and analyze each function
and its purpose:
plot_roc(model, X_test, y_test, title): This function is used to
plot the Receiver Operating Characteristic (ROC) curve. The
steps involved are:
1. Predict the probabilities of the
positive class using the trained
model on the test data.
2. Extract the predicted probabilities
for the positive class.
3. Calculate the False Positive Rate
(FPR) and True Positive Rate
(TPR) using the roc_curve
function.
4. Plot the ROC curve, including the
diagonal line (representing a
random classifier) and the curve for
the model's predictions.
5. Set the labels and title of the plot.
plot_decision_boundary(model, xtest, ytest, name): This
function is used to plot the decision boundary of a
classification model. The steps involved are:
1. Fit the model using the features
(xtest) and labels (ytest).
2. Create a scatter plot of the data
points with decision regions plotted
based on the model's predictions.
3. Set the title and labels for the plot.
These functions are useful for visualizing and analyzing the
performance and decision boundaries of classification
models. The ROC curve helps assess the model's
classification performance by examining the trade-off
between the true positive rate and the false positive rate. The
decision boundary plot shows how the model separates
different classes based on the selected features.
 
By using these functions, you can gain insights into the
model's performance, evaluate its ability to distinguish
between classes, and understand how it makes decisions
based on the chosen features.
 
To observe and analyze the ROC curve, you can follow these
steps:
1. Plot the ROC curve: Use the
plot_roc() function to plot the ROC
curve for your model. Provide the
trained model, the test features
(X_test), the test labels (y_test),
and a title for the plot.
2. Interpret the ROC curve:
The ROC curve is a graph that
shows the trade-off between
the True Positive Rate (TPR)
and the False Positive Rate
(FPR) as the classification
threshold changes.
The TPR is the ratio of
correctly predicted positive
instances to the total actual
positive instances. It represents
the model's ability to correctly
identify positive samples.
The FPR is the ratio of
incorrectly predicted negative
instances to the total actual
negative instances. It
represents the model's
tendency to incorrectly classify
negative samples as positive.
The diagonal line in the plot
represents a random classifier
with an equal chance of true
positives and false positives. A
better classifier will have its
ROC curve above this line.
3. Analyze the ROC curve:
The closer the ROC curve is to
the top-left corner of the plot,
the better the model's
performance. This indicates a
higher TPR for a lower FPR.
The area under the ROC curve
(AUC-ROC) is a common
metric used to evaluate the
overall performance of the
model. A higher AUC-ROC
value (closer to 1) suggests
better discrimination between
the positive and negative
classes.
If two models have
overlapping ROC curves, you
can compare them by looking
at their AUC-ROC values. The
model with a higher AUC-
ROC value is generally
considered better.
4. Determine the optimal threshold:
Depending on your specific needs
and the nature of the problem, you
can choose a threshold that
balances the trade-off between the
TPR and FPR. This threshold can
be adjusted to prioritize either
sensitivity or specificity based on
the requirements of your
application.
In summary, the ROC curve provides a visual representation
of the model's performance across different classification
thresholds. By analyzing the curve's shape, proximity to the
diagonal line, and the AUC-ROC value, you can assess the
model's ability to discriminate between positive and negative
instances and make informed decisions about the model's
performance.
 
 
 
Training Model and Predicting Pancreatic Cancer
Step Choose two features for decision boundary:
1  
1 feat_boundary = ['creatinine','LYVE1']
2 X_feature = X[feat_boundary]
3 X_train_feat, X_test_feat, y_train_feat,
4 y_test_feat = \
5 train_test_split(X_feature, y, test_size
= 0.2, \
random_state = 2021, stratify=y)
 
The code above performs the following steps:
1. Selects the features for decision
boundary plotting: The variable
feat_boundary is a list that
specifies the features to be used for
plotting the decision boundary. In
this case, the features selected are
'creatinine' and ’LYVE1’.
2. Extracts the selected features: The
features specified in feat_boundary
are extracted from the original
feature matrix X and stored in
X_feature.
3. Splits the data into training and
testing sets: The train_test_split
function is used to split the
extracted features (X_feature) and
the target variable (y) into training
and testing sets. The testing set size
is set to 20% of the total data, and
the random state is set to 2021 for
reproducibility. The stratify
parameter ensures that the class
distribution is preserved in the
training and testing sets.
4. Stores the split data: The resulting
training and testing feature
matrices and target variables are
stored in the variables
X_train_feat, X_test_feat,
y_train_feat, and y_test_feat,
respectively. These datasets will be
used for training and evaluating the
model for decision boundary
plotting.
By performing these steps, you have prepared the necessary
data for plotting the decision boundary based on the selected
features.
 
Step Define train_model() method to train model,
2 predict_model() method to get predicted values, and
run_model() method to perform training model, predicting
results, plotting confusion matrix, plotting true values versus
predicted values, plotting ROC, plotting decision boundary,
and plotting learning curve:
 
1 def train_model(model, X, y):
2 model.fit(X, y)
3 return model
4  
5 def predict_model(model, X, proba=False):
6 if ~proba:
7 y_pred = model.predict(X)
8 else:
9 y_pred_proba =
10 model.predict_proba(X)
11 y_pred = np.argmax(y_pred_proba,
axis=1)
12
 
13
return y_pred
14
 
15
list_scores = []
16
 
17
def run_model(name, model, X_train, X_test,
18
y_train, y_test, fc, proba=False):
19
print(name)
20
print(fc)
21
22
model = train_model(model, X_train,
23 y_train)
24 y_pred = predict_model(model, X_test,
25 proba)
26
27 accuracy = accuracy_score(y_test,
28 y_pred)
29 recall = recall_score(y_test, y_pred)
30 precision = precision_score(y_test,
y_pred)
31
f1 = f1_score(y_test, y_pred)
32
33
print('accuracy: ', accuracy)
34
print('recall: ',recall)
35
print('precision: ', precision)
36
print('f1: ', f1)
37
print(classification_report(y_test,
38
y_pred))
39
40
41
plot_cm(y_test, y_pred, name)
42
plot_real_pred_val(y_test, y_pred,
43 name)
44 plot_roc(model, X_test, y_test, name)
45
plot_decision_boundary(model,X_test_feat,
y_test_feat, name)
plot_learning_curve(model, name,
X_train, y_train, cv=3);
plt.show()

list_scores.append({'Model Name': name,


\
'Feature Scaling':fc, 'Accuracy':
accuracy, \
'Recall': recall, 'Precision': precision,
'F1':f1})
 
The code defines several functions and executes a series of
steps to train and evaluate a machine learning model. Here's
an explanation of each function and the overall process:
1. train_model(model, X, y): This
function trains the specified model
on the features X and target
variable y using the fit() method. It
returns the trained model.
2. predict_model(model, X,
proba=False): This function makes
predictions using the trained model
on the features X. If proba is set to
False, it uses the predict() method
to obtain the predicted class labels.
Otherwise, it uses the
predict_proba() method to obtain
class probabilities and then selects
the class with the highest
probability using argmax(). It
returns the predicted labels.
3. list_scores: This list will store the
evaluation scores for each model.
4. run_model(name, model, X_train,
X_test, y_train, y_test, fc,
proba=False): This function runs
the model training, evaluation, and
plotting process. It takes the
following parameters:
name: The name of the model.
model: The machine learning
model to be trained and
evaluated.
X_train, X_test, y_train,
y_test: The training and testing
feature matrices and target
variables.
fc: The feature scaling method
('StandardScaler',
'MinMaxScaler', or 'None').
proba: A flag indicating
whether to obtain class
probabilities instead of class
labels.
Inside the function, the following steps are performed:
The model is trained using the
train_model() function.
Predictions are made on the
test set using the
predict_model() function.
Various evaluation metrics
(accuracy, recall, precision, F1-
score) are calculated using
scikit-learn functions.
The classification report is
printed to provide a detailed
evaluation.
The confusion matrix,
predicted vs. true values plot,
ROC curve, decision boundary
plot, and learning curve plot
are generated and displayed
using the corresponding
functions.
The evaluation scores are
stored in the list_scores list.
Overall, this function allows for training and evaluating the
model, generating visualizations, and collecting evaluation
scores for further analysis.
 
By using the run_model() function, you can train, evaluate,
and visualize different machine learning models with
different feature scaling methods. The results are displayed
for each model, including accuracy, recall, precision, F1-
score, and various plots to assess model performance. The
evaluation scores are also stored in list_scores for further
analysis or comparison between models.
 
 
 
Support Vector Classifier and Grid Search
Step Run Support Vector Machine (SVM) on three feature scaling:
1  
1 feature_scaling = {
2 'Raw':(X_train_raw, X_test_raw,
3 y_train_raw, y_test_raw),
4 'Normalization':(X_train_norm,
X_test_norm, \
5
y_train_norm, y_test_norm),
6
'Standardization':(X_train_stand,
7
X_test_stand, \
8
y_train_stand, y_test_stand),
9
}
10
 
11
#Support Vector Classifier
12
# Define the parameter grid for the Grid
13 Search
14 param_grid = {
15 'C': [0.1, 1, 10], #
16 Regularization parameter
17 'kernel': ['linear', 'rbf'], # Kernel
18 type
}
19  
20 # Create the SVC model with
21 probability=True
22 model_svc = SVC(random_state=2021,
23 probability=True)
24  
25 # Perform Grid Search for each feature
scaling method
26
for fc_name, value in
27 feature_scaling.items():
28 X_train, X_test, y_train, y_test =
29 value
30
31 # Initialize GridSearchCV
32 grid_search =
33 GridSearchCV(estimator=model_svc,
param_grid=param_grid, cv=3,
34
scoring='accuracy', n_jobs=-1)
35
36
# Perform Grid Search and fit the model
37
grid_search.fit(X_train, y_train)
38
39
# Get the best parameters and best model
40 from the Grid Search
best_params = grid_search.best_params_
best_model =
grid_search.best_estimator_

# Evaluate the best model


run_model('SVC with ' + fc_name,
best_model, X_train, X_test, y_train,
y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The results of using raw feature scaling are shown in Figure
37 – 40.
 
Output with Raw Scaling:
SVC
Raw
accuracy: 0.584
recall: 0.584
precision: 0.5899640125955916
f1: 0.5320058111380146
precision recall f1-score
support
 
1 0.54 0.98 0.69
42
2 0.54 0.17 0.25
42
3 0.69 0.61 0.65
41
accuracy 0.58
125
macro avg 0.59 0.58 0.53
125
weighted avg 0.59 0.58 0.53
125
 
The output shows the evaluation results of the Support
Vector Classifier (SVC) model using raw feature scaling on
the test set. Here's a detailed analysis of the output:
Accuracy: The model achieved an
accuracy of approximately 58.4%.
Accuracy measures the overall
correctness of the model's
predictions. In this case, the model
correctly classified around 58.4%
of the instances in the test set.
Recall: The recall score is also
approximately 58.4%. Recall
measures the model's ability to
identify all relevant instances (true
positives) for each class. Since the
classes are balanced in this case,
the recall is equal to accuracy.
Precision: The precision of the
model is around 58.99%. Precision
represents the percentage of
correctly predicted instances for
each class. The model correctly
identified about 58.99% of the
instances belonging to each class.
F1-score: The F1-score is
approximately 53.20%. The F1-
score is the weighted average of
precision and recall. It provides a
balanced measure of the model's
performance on each class, taking
into account both false positives
and false negatives. A higher F1-
score indicates better overall
performance.
Classification Report: The
classification report provides a
breakdown of the metrics for each
class (1, 2, 3). For class 1, the
precision is 54%, recall is 98%,
and F1-score is 69%. For class 2,
precision is 54%, recall is 17%,
and F1-score is 25%. For class 3,
precision is 69%, recall is 61%,
and F1-score is 65%. The macro
average of precision, recall, and
F1-score is approximately 58%,
and the weighted average is around
53%.
 

Figure 37 The confusion matrix of SVM model with raw


feature scaling
 

Figure 38 The true values versus predicted values of SVM


model with raw feature scaling
 
 
 

Figure 39 The learning curve of SVM model with raw


feature scaling
 
Figure 40 The decision boundary using two chosen features
with SVM model
Conclusion:
The SVC model with raw feature scaling achieved moderate
accuracy and performance on the test set. However, the
relatively low F1-score suggests that the model may struggle
with correctly identifying instances for some classes. This
could be due to the imbalanced distribution of instances in
the classes and the chosen hyperparameters for the SVC
model.
 
Output with Normalization Scaling:
SVC
Normalization
accuracy: 0.592
recall: 0.592
precision: 0.5935272727272727
f1: 0.592571752694271
precision recall f1-score
support
1 0.55 0.52 0.54
42
2 0.45 0.48 0.47
42
3 0.78 0.78 0.78
41
accuracy 0.59
125
macro avg 0.60 0.59 0.59
125
weighted avg 0.59 0.59 0.59
125
 
The results of using normalized feature scaling are shown in
Figure 41 – 43. The output shows the evaluation results of
the Support Vector Classifier (SVC) model using
normalization feature scaling on the test set. Here's a detailed
analysis of the output:
Accuracy: The model achieved an
accuracy of approximately 59.2%.
Accuracy measures the overall
correctness of the model's
predictions. In this case, the model
correctly classified around 59.2%
of the instances in the test set.
Recall: The recall score is also
approximately 59.2%. Recall
measures the model's ability to
identify all relevant instances (true
positives) for each class. Since the
classes are balanced in this case,
the recall is equal to accuracy.
Precision: The precision of the
model is around 59.35%. Precision
represents the percentage of
correctly predicted instances for
each class. The model correctly
identified about 59.35% of the
instances belonging to each class.
F1-score: The F1-score is
approximately 59.26%. The F1-
score is the weighted average of
precision and recall. It provides a
balanced measure of the model's
performance on each class, taking
into account both false positives
and false negatives. A higher F1-
score indicates better overall
performance.
Classification Report: The
classification report provides a
breakdown of the metrics for each
class (1, 2, 3). For class 1, the
precision is 55%, recall is 52%,
and F1-score is 54%. For class 2,
precision is 45%, recall is 48%,
and F1-score is 47%. For class 3,
precision is 78%, recall is 78%,
and F1-score is 78%. The macro
average of precision, recall, and
F1-score is approximately 59%,
and the weighted average is around
59%.
Conclusion:
The SVC model with normalization feature scaling achieved
slightly better accuracy and performance compared to the
raw feature scaling. The model showed improved precision
and recall for each class, and the F1-score also increased,
indicating better overall performance.
 
Normalization of features scales the data to a common range
(usually [0, 1]), which can help improve the convergence of
optimization algorithms like SVM. However, there is still
room for improvement in model performance. Further
hyperparameter tuning or exploring other classification
algorithms could potentially yield even better results.
 
Figure 41 The confusion matrix of SVM model with
normalized feature scaling
 

Figure 42 The true values versus predicted values of SVM


model with normalized feature scaling
 
Figure 43 The learning curve of SVM model with
normalized feature scaling
Output with Standardization Scaling:
SVC
Standardization
accuracy: 0.648
recall: 0.648
precision: 0.6384140984311196
f1: 0.6395951293441338
precision recall f1-score
support
1 0.62 0.69 0.65
42
2 0.55 0.43 0.48
42
3 0.76 0.83 0.79
41
accuracy 0.65
125
macro avg 0.64 0.65 0.64
125
weighted avg 0.64 0.65 0.64
125
 
The results of using standardized feature scaling are shown
in Figure 44 – 46. The output shows the evaluation results of
the Support Vector Classifier (SVC) model using
standardization feature scaling on the test set. Let's analyze
the output:
Accuracy: The model achieved an
accuracy of approximately 64.8%.
The accuracy measures the overall
correctness of the model's
predictions. In this case, the model
correctly classified around 64.8%
of the instances in the test set.
Recall: The recall score is also
approximately 64.8%. Recall
measures the model's ability to
identify all relevant instances (true
positives) for each class. Since the
classes are balanced in this case,
the recall is equal to accuracy.
Precision: The precision of the
model is around 63.84%. Precision
represents the percentage of
correctly predicted instances for
each class. The model correctly
identified about 63.84% of the
instances belonging to each class.
F1-score: The F1-score is
approximately 63.96%. The F1-
score is the weighted average of
precision and recall. It provides a
balanced measure of the model's
performance on each class,
considering both false positives
and false negatives. A higher F1-
score indicates better overall
performance.
Classification Report: The
classification report provides a
breakdown of the metrics for each
class (1, 2, 3). For class 1, the
precision is 62%, recall is 69%,
and F1-score is 65%. For class 2,
precision is 55%, recall is 43%,
and F1-score is 48%. For class 3,
precision is 76%, recall is 83%,
and F1-score is 79%. The macro
average of precision, recall, and
F1-score is approximately 64%,
and the weighted average is around
65%.
Conclusion:
The SVC model with standardization feature scaling
achieved the highest accuracy and performance compared to
raw and normalization feature scaling. The model showed
improved precision and recall for each class, and the F1-
score also increased, indicating better overall performance.
 

Figure 44 The confusion matrix of SVM model with


standardized feature scaling
 
Figure 46 The true values versus predicted values of SVM
model with standardized feature scaling
 
Figure 45 The learning curve of SVM model with
standardized feature scaling
 
Comparison and Analysis:
Raw Scaling: This approach leaves
the data untouched, which might
lead to the model not performing as
well due to varying scales and
ranges of features. As a result, the
accuracy is relatively low at 58.4%,
and the F1-score is also quite low
at 53.20%. The model's precision
and recall are reasonably balanced,
but they are still not optimal.
Normalization Scaling: Scaling the
features to a range of [0, 1]
improved the performance slightly
compared to raw scaling. The
accuracy increased to 59.2%, and
the F1-score improved to 59.26%.
The precision and recall are
slightly more balanced compared
to raw scaling, but they are still not
ideal.
Standardization Scaling: This
approach scales the features to
have a mean of 0 and a standard
deviation of 1. It significantly
improved the model's performance
compared to the other two scaling
techniques. The accuracy increased
to 64.8%, and the F1-score
improved to 63.96%. The precision
and recall for each class are
relatively balanced, indicating a
more robust performance.
Conclusion:
Among the three feature scaling techniques, standardization
yielded the best results for the SVC model in terms of
accuracy, F1-score, and overall performance. It significantly
outperformed raw scaling and normalization scaling in terms
of correctly classifying instances and achieving a better
balance between precision and recall for each class.
 
It is essential to apply appropriate feature scaling to machine
learning models, especially when using algorithms that rely
on distance-based calculations like SVC. Standardization
generally works well for many algorithms and helps improve
the convergence rate and performance of the model.
 
 
 
Logistic Regression Classifier and Grid Search
Step Run Logistic Regression (LR) on three feature scaling:
1  
1 #Logistic Regression Classifier
2 # Define the parameter grid for the grid
3 search
4 param_grid = {
5 'C': [0.01, 0.1, 1, 10],
6 'penalty': ['l1', 'l2'],
7 'solver': ['newton-cg', 'lbfgs',
'liblinear', 'saga'],
8
}
9
 
10
# Initialize the Logistic Regression model
11
logreg = LogisticRegression(max_iter=5000,
12
random_state=2021)
13
 
14
# Perform the grid search for each feature
15 scaling method
16 for fc_name, value in
17 feature_scaling.items():
18 X_train, X_test, y_train, y_test =
19 value
20
21 # Create GridSearchCV with the Logistic
Regression model and the parameter grid
22
grid_search = GridSearchCV(logreg,
23
param_grid, cv=3, scoring='accuracy',
24 n_jobs=-1)
25
26 # Train and perform grid search
27 grid_search.fit(X_train, y_train)
28
29 # Get the best Logistic Regression model
30 from the grid search
31 best_model =
grid_search.best_estimator_
32
# Evaluate and plot the best model
33
(setting proba=True for probability
prediction)
run_model('Logistic Regression',
best_model, X_train, X_test, y_train,
y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The purpose of the code is to perform hyperparameter tuning
for the Logistic Regression classifier using Grid Search and
evaluate the model's performance with different feature
scaling techniques.
 
Here's a step-by-step explanation of the code:
1. Define the Parameter Grid:
param_grid is a dictionary
containing different
hyperparameters and their
respective values that will be
explored during the Grid Search. It
includes the regularization
parameter C, the penalty type
penalty, and the solver for
optimization solver.
2. Initialize the Logistic Regression
Model: The Logistic Regression
model is initialized with
max_iter=5000 to ensure
convergence and
random_state=2021 for
reproducibility.
3. Perform Grid Search: For each
feature scaling method, the code
iterates through the feature_scaling
dictionary. It splits the data into
training and testing sets and creates
a GridSearchCV object
(grid_search) with the Logistic
Regression model and the defined
parameter grid.
4. Train and Perform Grid Search:
The model is trained and tuned
using cross-validation (cv=3) to
find the best combination of
hyperparameters that optimize
accuracy on the training data.
5. Get the Best Model: The best
Logistic Regression model is
obtained from the grid search based
on the combination of
hyperparameters that achieved the
highest accuracy.
6. Evaluate and Plot the Best Model:
The run_model function is called to
evaluate the best Logistic
Regression model's performance on
the test data for each feature
scaling technique. The model is
evaluated using various metrics
such as accuracy, precision, recall,
F1-score, confusion matrix, ROC
curve, and learning curve. The
proba=True parameter indicates
that the model will perform
probability prediction for ROC and
learning curve plots.
7. Print Best Hyperparameters: The
best hyperparameters found for
each feature scaling method are
printed to display the combination
of hyperparameters that resulted in
the highest accuracy during the
Grid Search.
By performing hyperparameter tuning using Grid Search, the
code aims to find the optimal hyperparameters for the
Logistic Regression model that produce the best performance
on the given dataset. Additionally, by evaluating the model
with different feature scaling techniques, it helps identify
which scaling approach works best for improving the model's
accuracy and generalization.
 
Output with Raw Scaling:
Logistic Regression
Raw
accuracy: 0.68
recall: 0.68
precision: 0.6770075471698113
f1: 0.6722490630981859
precision recall f1-score
support
 
0 0.62 0.79 0.69
42
1 0.56 0.43 0.49
42
2 0.85 0.83 0.84
41
 
accuracy 0.68
125
macro avg 0.68 0.68 0.67
125
weighted avg 0.68 0.68 0.67
125
Best Hyperparameters for Raw:
{'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
 
The results of using raw feature scaling are shown in Figure
47 – 50. Analysis of Output with Raw Scaling for Logistic
Regression:
Accuracy: The accuracy of the
model on the test data is
approximately 68%, which means
that around 68% of the test samples
were classified correctly by the
model.
Recall: The recall (also known as
sensitivity or true positive rate) is
approximately 68%, indicating that
the model is able to identify 68%
of the positive class (class 1)
instances correctly.
Precision: The precision is around
67.7%, which means that out of all
the samples predicted as positive
by the model, approximately
67.7% of them are actually
positive.
F1-Score: The F1-score is
approximately 67.2%. The F1-
score is the harmonic mean of
precision and recall, providing a
balance between the two metrics. A
higher F1-score indicates a better
balance between precision and
recall.
Confusion Matrix: The confusion
matrix shows the distribution of
true positive, true negative, false
positive, and false negative
predictions for each class. It helps
to assess the model's performance
for individual classes.
Best Hyperparameters: The best hyperparameters found
during the Grid Search for Raw Scaling are C=10,
penalty='l2' (L2 regularization), and solver='liblinear'. These
hyperparameters resulted in the highest accuracy for the
Logistic Regression model on the training data.
 
Conclusion:
With Raw Scaling, the Logistic Regression model achieved
an accuracy of approximately 68% on the test data. The
model shows reasonable performance in terms of precision
and recall for classifying the three classes. The best
hyperparameters obtained from the Grid Search helped
improve the model's accuracy. However, there is still room
for improvement, and further exploration of hyperparameters
or feature engineering might lead to even better results.
Additionally, experimenting with other feature scaling
techniques could also be beneficial to identify the best
preprocessing method for this specific dataset and model.
 

Figure 47 The confusion matrix of LR model with raw


feature scaling
 
Figure 48 The true values versus predicted values of LR
model with raw feature scaling
 

Figure 49 The decision boundary of LR model with raw


feature scaling
 
Figure 50 The learning curve of LR model with raw feature
scaling
 
Output with Normalized Scaling:
Logistic Regression
Normalization
accuracy: 0.648
recall: 0.648
precision: 0.6467393162393161
f1: 0.6407783783783784
precision recall f1-score
support
 
0 0.59 0.76 0.67
42
1 0.53 0.40 0.46
42
2 0.82 0.78 0.80
41
 
accuracy 0.65
125
macro avg 0.65 0.65 0.64
125
weighted avg 0.65 0.65 0.64
125
 
Best Hyperparameters for Normalization:
{'C': 10, 'penalty': 'l1', 'solver': 'saga'}
 
The results of using normalized feature scaling are shown in
Figure 51 – 53. Analysis of Output with Normalized Scaling
for Logistic Regression:
Accuracy: The accuracy of the
model on the test data is
approximately 64.8%, indicating
that around 64.8% of the test
samples were classified correctly
by the model.
Recall: The recall (sensitivity) is
approximately 64.8%, which
means that the model is able to
identify 64.8% of the positive class
(class 1) instances correctly.
Precision: The precision is around
64.7%, indicating that out of all the
samples predicted as positive by
the model, approximately 64.7% of
them are actually positive.
F1-Score: The F1-score is
approximately 64.1%. The F1-
score is the harmonic mean of
precision and recall, providing a
balance between the two metrics. A
higher F1-score indicates a better
balance between precision and
recall.
Confusion Matrix: The confusion
matrix shows the distribution of
true positive, true negative, false
positive, and false negative
predictions for each class. It helps
to assess the model's performance
for individual classes.
Best Hyperparameters: The best
hyperparameters found during the
Grid Search for Normalization
Scaling are C=10, penalty='l1' (L1
regularization), and solver='saga'.
These hyperparameters resulted in
the highest accuracy for the
Logistic Regression model on the
training data.
Conclusion:
With Normalization Scaling, the Logistic Regression model
achieved an accuracy of approximately 64.8% on the test
data. The model shows moderate performance in terms of
precision and recall for classifying the three classes. The best
hyperparameters obtained from the Grid Search helped
improve the model's accuracy. However, similar to the Raw
Scaling case, there is still room for improvement, and further
exploration of hyperparameters or feature engineering might
lead to even better results. Additionally, experimenting with
other feature scaling techniques or trying different classifiers
could also be beneficial to identify the best preprocessing
method and model for this specific dataset.
 
Figure 51 The confusion matrix of LR model with
normalized feature scaling
 

Figure 52 The true values versus predicted values of LR


model with normalized feature scaling
 
Figure 53 The learning curve of LR model with normalized
feature scaling
 
Output with Standardized Scaling:
Logistic Regression
Standardization
accuracy: 0.664
recall: 0.664
precision: 0.660641958041958
f1: 0.6573052167060678
precision recall f1-score
support
 
0 0.62 0.76 0.68
42
1 0.55 0.43 0.48
42
2 0.82 0.80 0.81
41
 
accuracy 0.66
125
macro avg 0.66 0.67 0.66
125
weighted avg 0.66 0.66 0.66
125
 
Best Hyperparameters for Standardization:
{'C': 1, 'penalty': 'l2', 'solver': 'newton-cg'}
 
The results of using standardized feature scaling are shown
in Figure 54 – 56. Analysis of Output with Standardized
Scaling for Logistic Regression:
Accuracy: The accuracy of the
model on the test data is
approximately 66.4%, indicating
that around 66.4% of the test
samples were classified correctly
by the model.
Recall: The recall (sensitivity) is
approximately 66.4%, which
means that the model is able to
identify 66.4% of the positive class
(class 1) instances correctly.
Precision: The precision is around
66.1%, indicating that out of all the
samples predicted as positive by
the model, approximately 66.1% of
them are actually positive.
F1-Score: The F1-score is
approximately 65.7%. The F1-
score is the harmonic mean of
precision and recall, providing a
balance between the two metrics. A
higher F1-score indicates a better
balance between precision and
recall.
Confusion Matrix: The confusion
matrix shows the distribution of
true positive, true negative, false
positive, and false negative
predictions for each class. It helps
to assess the model's performance
for individual classes.
Best Hyperparameters: The best
hyperparameters found during the
Grid Search for Standardization
Scaling are C=1, penalty='l2' (L2
regularization), and
solver='newton-cg'. These
hyperparameters resulted in the
highest accuracy for the Logistic
Regression model on the training
data.
Conclusion:
With Standardization Scaling, the Logistic Regression model
achieved an accuracy of approximately 66.4% on the test
data. The model shows moderate performance in terms of
precision and recall for classifying the three classes. The best
hyperparameters obtained from the Grid Search helped
improve the model's accuracy compared to the Normalized
Scaling case. However, similar to the previous scaling
methods, there is still room for improvement, and further
exploration of hyperparameters or feature engineering might
lead to even better results. Additionally, experimenting with
other feature scaling techniques or trying different classifiers
could also be beneficial to identify the best preprocessing
method and model for this specific dataset.
 
Figure 54 The confusion matrix of LR model with
standardized feature scaling
 

Figure 55 The true values versus predicted values of LR


model with standardized feature scaling
 
Figure 56 The learning curve of LR model with standardized
feature scaling
 
Overall Observations:
The Logistic Regression model
shows similar performance across
different feature scaling methods,
with accuracy ranging from
approximately 64.8% to 68.0%.
The recall values are also close,
indicating that the model can
identify the positive class instances
fairly well, but there is room for
improvement in distinguishing
between the classes.
The precision values are also
comparable, indicating that the
model's predictions of the positive
class have moderate reliability,
with values around 64.7% to
67.7%.
The F1-Score, which balances
precision and recall, ranges from
approximately 64.1% to 67.2%.
In terms of hyperparameters,
different scaling techniques result
in different optimal values, but they
do not show significant variations.
Conclusion:
The choice of feature scaling method does not lead to
significant differences in the model's performance for the
Logistic Regression classifier in this specific dataset.
The model's overall performance is moderate, but there is
room for improvement. Further hyperparameter tuning,
feature engineering, or exploring other classification
algorithms might lead to better results.
 
 
 
K-Nearest Neighbors Classifier and Grid Search
Step Run K-Nearest Neighbors (KNN) on three feature scaling:
1  
1 #KNN Classifier
2 # Define the parameter grid for the grid
3 search
4 param_grid = {
5 'n_neighbors': list(range(2, 10))
6 }
7  
8 # KNN Classifier Grid Search
9 for fc_name, value in
feature_scaling.items():
10
X_train, X_test, y_train, y_test =
11
value
12
13
# Initialize the KNN Classifier
14
knn = KNeighborsClassifier()
15
16
# Create GridSearchCV with the KNN model
17 and the parameter grid
18 grid_search = GridSearchCV(knn,
19 param_grid, cv=3, scoring='accuracy',
20 n_jobs=-1)
21
22 # Train and perform grid search
23 grid_search.fit(X_train, y_train)
24
25 # Get the best KNN model from the grid
26 search
27 best_model =
grid_search.best_estimator_
28
29
# Evaluate and plot the best model
30
(setting proba=True for probability
31 prediction)
32 run_model(f'KNeighbors Classifier
33 n_neighbors =
34 {grid_search.best_params_["n_neighbors"]}',
best_model, X_train, X_test,
y_train, y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The purpose of the code is to perform a grid search with
cross-validation to find the best hyperparameter for the K-
Nearest Neighbors (KNN) classifier. The code evaluates the
performance of the KNN classifier with different feature
scaling methods (raw, normalization, and standardization)
using cross-validation.
 
Here's a step-by-step explanation of the code:
1. param_grid: A dictionary that
defines the hyperparameter grid for
the KNN classifier. It specifies the
values of the hyperparameter
'n_neighbors', which represents the
number of neighbors to consider
when making predictions. The code
tests values ranging from 2 to 9 for
this hyperparameter.
2. The code iterates through each
feature scaling method (raw,
normalization, and standardization)
present in the feature_scaling
dictionary.
3. For each feature scaling method, it
splits the data into training and
testing sets.
4. It initializes the KNN classifier.
5. It creates a GridSearchCV object
(grid_search) using the KNN
model and the hyperparameter grid.
The GridSearchCV performs an
exhaustive search over the
specified hyperparameter values
and performs cross-validation to
evaluate the model's performance.
6. It fits the grid_search object to the
training data to find the best
hyperparameters for the KNN
classifier.
7. The code extracts the best KNN
model (best_model) based on the
optimal hyperparameters found
during the grid search.
8. It evaluates and plots the
performance of the best model
using the run_model function,
which includes various evaluation
metrics such as accuracy, precision,
recall, F1-score, confusion matrix,
ROC curve, and learning curve.
9. The best hyperparameters found
for each feature scaling method are
printed to the console.
By performing the grid search and evaluating the KNN
classifier with different feature scaling techniques, the code
aims to find the optimal number of neighbors for the KNN
model that yields the best performance for the given dataset.
It provides insights into how feature scaling impacts the
performance of the KNN classifier and helps identify the best
hyperparameter setting for the model.
 
Output with Raw Scaling:
KNeighbors Classifier n_neighbors = 5
Raw
accuracy: 0.696
recall: 0.696
precision: 0.7068025581395349
f1: 0.6961746697964475
precision recall f1-score
support
 
0 0.72 0.74 0.73
42
1 0.62 0.74 0.67
42
2 0.78 0.61 0.68
41
 
accuracy 0.70
125
macro avg 0.71 0.70 0.70
125
weighted avg 0.71 0.70 0.70
125
Best Hyperparameters for Raw:
{'n_neighbors': 5}
 
The results of using raw feature scaling are shown in Figure
60 – 63. The KNN classifier with raw scaling achieved an
accuracy of approximately 69.6%, which means it correctly
classified about 69.6% of the instances in the test set. The
recall (also known as sensitivity) is 69.6%, which indicates
that the model identified 69.6% of the positive class (label 1)
instances correctly. The precision (positive predictive value)
is around 70.7%, indicating that 70.7% of the instances
predicted as positive were actually positive. The F1-score,
which balances precision and recall, is approximately 69.6%.
 
Looking at the classification report, we can see that the
model performs relatively well for classifying the "Control
(No Pancreatic Disease)" (label 0) and "Benign Hepatobiliary
Disease" (label 1) classes with F1-scores of 73% and 67%,
respectively. However, it shows slightly lower performance
for classifying the "Pancreatic Cancer" (label 2) class with an
F1-score of 68%.
 
The confusion matrix shows that the model correctly
predicted 31 out of 42 instances for the "Control (No
Pancreatic Disease)" class, 31 out of 42 instances for the
"Benign Hepatobiliary Disease" class, and 25 out of 41
instances for the "Pancreatic Cancer" class.
 
Overall, the KNN classifier with raw scaling and 5 neighbors
performs reasonably well but may have some difficulty
distinguishing between the "Pancreatic Cancer" class and the
other classes. It's important to note that the performance may
vary based on the dataset, and further tuning of
hyperparameters or using additional features could
potentially improve the model's performance.
 
Figure 60 The confusion matrix of KNN model with raw
feature scaling
 

Figure 61 The true values versus predicted values of KNN


model with raw feature scaling
 
Figure 62 The decision boundary of KNN model with raw
feature scaling
 
Figure 63 The learning curve of KNN model with raw
feature scaling
Output with Normalized Scaling:
KNeighbors Classifier n_neighbors = 5
Normalization
accuracy: 0.536
recall: 0.536
precision: 0.5367064935064935
f1: 0.53034262319341
precision recall f1-score
support
 
0 0.53 0.69 0.60
42
1 0.43 0.36 0.39
42
2 0.66 0.56 0.61
41
 
accuracy 0.54
125
macro avg 0.54 0.54 0.53
125
weighted avg 0.54 0.54 0.53
125
 
Best Hyperparameters for Normalization:
{'n_neighbors': 5}
 
The results of using raw feature scaling are shown in Figure
64 – 66. The KNN classifier with normalized scaling
achieved an accuracy of approximately 53.6%. This means it
correctly classified about 53.6% of the instances in the test
set. The recall (sensitivity) is 53.6%, indicating that the
model identified 53.6% of the positive class (label 1)
instances correctly. The precision (positive predictive value)
is around 53.7%, indicating that 53.7% of the instances
predicted as positive were actually positive. The F1-score,
which balances precision and recall, is approximately 53.0%.
 
Looking at the classification report, we can see that the
model has relatively lower performance across all classes
compared to the raw scaling results. The F1-scores for
classifying the "Control (No Pancreatic Disease)" (label 0),
"Benign Hepatobiliary Disease" (label 1), and "Pancreatic
Cancer" (label 2) classes are 60%, 39%, and 61%,
respectively.
 
The confusion matrix shows that the model correctly
predicted 29 out of 42 instances for the "Control (No
Pancreatic Disease)" class, 15 out of 42 instances for the
"Benign Hepatobiliary Disease" class, and 23 out of 41
instances for the "Pancreatic Cancer" class.
 
Overall, the KNN classifier with normalized scaling
performs less effectively compared to the raw scaling
approach, suggesting that the normalization may not be well-
suited for this dataset and classification task. It's important to
consider other feature scaling techniques or further
optimization of hyperparameters to potentially improve the
model's performance. Additionally, feature engineering or the
inclusion of more informative features may contribute to
better results.
 

Figure 64 The confusion matrix of KNN model with


normalized feature scaling
 

Figure 65 The true values versus predicted values of KNN


model with normalized feature scaling
 

Figure 66 The learning curve of KNN model with


normalized feature scaling
 
Output with Standardized Scaling:
KNeighbors Classifier n_neighbors = 9
Standardization
accuracy: 0.6
recall: 0.6
precision: 0.5945658263305322
f1: 0.5926319261352297
precision recall f1-score
support
 
0 0.55 0.67 0.60
42
1 0.50 0.38 0.43
42
2 0.74 0.76 0.75
41
 
accuracy 0.60
125
macro avg 0.60 0.60 0.59
125
weighted avg 0.59 0.60 0.59
125
 
Best Hyperparameters for Standardization:
{'n_neighbors': 9}
 
The results of using standardized feature scaling are shown
in Figure 67 – 69. The KNN classifier with standardized
scaling achieved an accuracy of approximately 60%. This
means it correctly classified about 60% of the instances in
the test set. The recall (sensitivity) is also 60%, indicating
that the model identified 60% of the positive class (label 1)
instances correctly. The precision (positive predictive value)
is around 59.5%, meaning that 59.5% of the instances
predicted as positive were actually positive. The F1-score,
which balances precision and recall, is approximately 59.3%.
 
Looking at the classification report, we can see that the
model performs reasonably well across all classes, with F1-
scores of 60%, 43%, and 75% for the "Control (No
Pancreatic Disease)" (label 0), "Benign Hepatobiliary
Disease" (label 1), and "Pancreatic Cancer" (label 2) classes,
respectively.
 

Figure 67 The confusion matrix of KNN model with


standardized feature scaling
 
The confusion matrix shows that the model correctly
predicted 28 out of 42 instances for the "Control (No
Pancreatic Disease)" class, 16 out of 42 instances for the
"Benign Hepatobiliary Disease" class, and 31 out of 41
instances for the "Pancreatic Cancer" class.
 
Overall, the KNN classifier with standardized scaling
performs better than the model with normalized scaling but is
still outperformed by the raw scaling approach. It suggests
that standardization might be more suitable for this dataset
compared to normalization. However, the model's
performance can still be further improved by fine-tuning
hyperparameters, exploring different feature scaling methods,
and possibly incorporating more informative features into the
model. Regularization techniques or ensemble methods like
Random Forest or Gradient Boosting might also be
considered to enhance the classifier's predictive capabilities.
 

Figure 68 The learning curve of KNN model with


standardized feature scaling
 
Figure 69 The true values versus predicted values of KNN
model with standardized feature scaling
 
 
 
Decision Tree Classifier and Grid Search
Step Run Decision Tree (DT) classifier on three feature scaling:
1  
1 #Decision Tree Classifier
2 for fc_name, value in
3 feature_scaling.items():
4 X_train, X_test, y_train, y_test =
value
5
6
# Initialize the DecisionTreeClassifier
7
model
8
dt_clf =
9 DecisionTreeClassifier(random_state=2021)
10
11 # Define the parameter grid for the grid
12 search
13 param_grid = {
14 'max_depth': np.arange(1, 51, 1),
15 'criterion': ['gini', 'entropy'],
16 'min_samples_split': [2, 5, 10],
17 'min_samples_leaf': [1, 2, 4],
18 }
19
20 # Create GridSearchCV with the
DecisionTreeClassifier model and the
21
parameter grid
22
grid_search = GridSearchCV(dt_clf,
23 param_grid, cv=3, scoring='accuracy',
24 n_jobs=-1)
25
26 # Train and perform grid search
27 grid_search.fit(X_train, y_train)
28
29 # Get the best DecisionTreeClassifier
30 model from the grid search
31 best_model =
grid_search.best_estimator_
32
33
# Evaluate and plot the best model
34
(setting proba=True for probability
35 prediction)
36 run_model(f'DecisionTree Classifier
37 (Best Depth:
{grid_search.best_params_["max_depth"]})',
best_model, X_train, X_test,
y_train, y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The code performs a hyperparameter tuning for the Decision
Tree Classifier using different feature scaling methods (raw,
normalization, and standardization). It follows the steps
below for each feature scaling approach:
1. Set the training and testing data
based on the specific feature
scaling method.
2. Initialize the
DecisionTreeClassifier model.
3. Define the parameter grid for the
grid search. The hyperparameters
that are being tuned are:
'max_depth': The maximum
depth of the decision tree.
'criterion': The function to
measure the quality of a split
('gini' or 'entropy').
'min_samples_split': The
minimum number of samples
required to split an internal
node.
'min_samples_leaf': The
minimum number of samples
required to be at a leaf node.
4. Create GridSearchCV with the
DecisionTreeClassifier model and
the defined parameter grid.
5. Train and perform grid search to
find the best hyperparameters for
the Decision Tree model.
6. Get the best DecisionTreeClassifier
model from the grid search.
7. Evaluate and plot the best model
using the run_model function with
the "proba" parameter set to True
for probability prediction.
8. Print the best hyperparameters
found for the specific feature
scaling method.
By using GridSearchCV, the code automatically performs an
exhaustive search over the specified parameter grid and
cross-validates the results to find the best combination of
hyperparameters that yields the highest accuracy on the
validation data.
 
The process is repeated for each feature scaling method (raw,
normalization, and standardization), and the results for each
combination of the Decision Tree Classifier and feature
scaling approach are printed and plotted. The output will
provide insights into the best hyperparameters and the
corresponding performance of the Decision Tree Classifier
for each feature scaling method. This allows us to select the
best model and feature scaling approach based on their
respective evaluation metrics, such as accuracy, precision,
recall, and F1-score.
 
Output with Raw Scaling:
DecisionTree Classifier (Best Depth: 7)
Raw
accuracy: 0.544
recall: 0.544
precision: 0.5508166666666666
f1: 0.5438155880963628
precision recall f1-score
support
 
0 0.62 0.67 0.64
42
1 0.44 0.50 0.47
42
2 0.59 0.46 0.52
41
 
accuracy 0.54
125
macro avg 0.55 0.54 0.54
125
weighted avg 0.55 0.54 0.54
125
Best Hyperparameters for Raw:
{'criterion': 'entropy', 'max_depth': 7,
'min_samples_leaf': 4, 'min_samples_split': 2}
 
The results of using raw feature scaling are shown in Figure
70 – 73. The Decision Tree Classifier with raw feature
scaling produces an accuracy of approximately 54.4%. The
best hyperparameters found through the Grid Search are as
follows:
'criterion': 'entropy' (The function
to measure the quality of a split)
'max_depth': 7 (The maximum
depth of the decision tree)
'min_samples_leaf': 4 (The
minimum number of samples
required to be at a leaf node)
'min_samples_split': 2 (The
minimum number of samples
required to split an internal node)
The precision, recall, and F1-score for each class are as
follows:
Class 0 (Control): Precision - 0.62,
Recall - 0.67, F1-score - 0.64
Class 1 (Benign Hepatobiliary
Disease): Precision - 0.44, Recall -
0.50, F1-score - 0.47
Class 2 (Pancreatic Cancer):
Precision - 0.59, Recall - 0.46, F1-
score - 0.52
The macro average F1-score is approximately 0.54,
indicating a moderate overall performance of the model in
terms of F1-score.
 
These results indicate that the Decision Tree Classifier with
raw feature scaling has moderate predictive power, but it may
not be the most accurate model for this specific dataset.
Further improvements could be made by exploring different
algorithms or performing more in-depth feature engineering
and selection. Additionally, fine-tuning the hyperparameters
may help improve the model's performance. It's also
important to note that the dataset may be imbalanced, and
techniques such as oversampling or undersampling could be
considered to address this issue and potentially improve the
model's performance.
 
Figure 70 The confusion matrix of DT model with raw
feature scaling
 

Figure 71 The true values versus predicted values of DT


model with raw feature scaling
 
Figure 72 The decision boundary of DT model with raw
feature scaling
 
Figure 72 The learning curve of DT model with raw feature
scaling
 
Output with Normalized Scaling:
DecisionTree Classifier (Best Depth: 7)
Normalization
accuracy: 0.528
recall: 0.528
precision: 0.5436441873915557
f1: 0.5325276595744681
precision recall f1-score
support
 
0 0.68 0.62 0.65
42
1 0.40 0.50 0.45
42
2 0.54 0.46 0.50
41
 
accuracy 0.53
125
macro avg 0.54 0.53 0.53
125
weighted avg 0.54 0.53 0.53
125
 
Best Hyperparameters for Normalization:
{'criterion': 'entropy', 'max_depth': 7,
'min_samples_leaf': 1, 'min_samples_split': 2}
 
The results of using normalized feature scaling are shown in
Figure 74 – 76. The Decision Tree Classifier with
Normalization scaling method achieved an accuracy of
approximately 52.8%. Let's analyze the performance in more
detail:
Precision: The precision represents
the percentage of true positive
predictions out of all positive
predictions for each class. For the
Decision Tree Classifier with
Normalization scaling:
Class 0 (Control): The
precision is 0.68, meaning that
out of all predicted Control
cases, 68% are correct.
Class 1 (Benign Hepatobiliary
Disease): The precision is 0.40,
indicating that only 40% of the
predicted Benign Hepatobiliary
Disease cases are correct.
Class 2 (Pancreatic Cancer):
The precision is 0.54,
indicating that 54% of the
predicted Pancreatic Cancer
cases are correct.
Recall: The recall, also known as
sensitivity or true positive rate,
represents the percentage of true
positive predictions out of all
actual positive cases for each class.
For the Decision Tree Classifier
with Normalization scaling:
Class 0 (Control): The recall is
0.62, meaning that 62% of the
actual Control cases are
correctly identified.
Class 1 (Benign Hepatobiliary
Disease): The recall is 0.50,
indicating that only 50% of the
actual Benign Hepatobiliary
Disease cases are correctly
identified.
Class 2 (Pancreatic Cancer):
The recall is 0.46, indicating
that 46% of the actual
Pancreatic Cancer cases are
correctly identified.
F1-score: The F1-score is the
harmonic mean of precision and
recall and provides a balance
between these metrics. For the
Decision Tree Classifier with
Normalization scaling:
The weighted average F1-score is
0.533, indicating a moderate
balance between precision and
recall across all three classes.
Overall, the Decision Tree Classifier with Normalization
scaling exhibits relatively lower accuracy, precision, and
recall compared to the other classifiers we have analyzed so
far. The model seems to have difficulty distinguishing
between the classes, especially for the Benign Hepatobiliary
Disease class, which has the lowest precision and recall.
 
 
Figure 74 The confusion matrix of DT model with
normalized feature scaling
 

Figure 75 The true values versus predicted values of DT


model with normalized feature scaling
 
Figure 76 The learning curve of DT model with normalized
feature scaling
 
Output with Standardized Scaling:
DecisionTree Classifier (Best Depth: 7)
Standardization
accuracy: 0.544
recall: 0.544
precision: 0.5508166666666666
f1: 0.5438155880963628
precision recall f1-score
support
 
0 0.62 0.67 0.64
42
1 0.44 0.50 0.47
42
2 0.59 0.46 0.52
41
 
accuracy 0.54
125
macro avg 0.55 0.54 0.54
125
weighted avg 0.55 0.54 0.54
125
 
Best Hyperparameters for Standardization:
{'criterion': 'entropy', 'max_depth': 7,
'min_samples_leaf': 4, 'min_samples_split': 2}
 
The results of using standardized feature scaling are shown
in Figure 77 – 79. The Decision Tree Classifier with
Standardization scaling method achieved an accuracy of
approximately 54.4%. Let's analyze the performance in more
detail:
Precision: The precision represents
the percentage of true positive
predictions out of all positive
predictions for each class. For the
Decision Tree Classifier with
Standardization scaling:
Class 0 (Control): The
precision is 0.62, meaning that
out of all predicted Control
cases, 62% are correct.
Class 1 (Benign Hepatobiliary
Disease): The precision is 0.44,
indicating that only 44% of the
predicted Benign Hepatobiliary
Disease cases are correct.
Class 2 (Pancreatic Cancer):
The precision is 0.59,
indicating that 59% of the
predicted Pancreatic Cancer
cases are correct.
Recall: The recall, also known as
sensitivity or true positive rate,
represents the percentage of true
positive predictions out of all
actual positive cases for each class.
For the Decision Tree Classifier
with Standardization scaling:
Class 0 (Control): The recall is
0.67, meaning that 67% of the
actual Control cases are
correctly identified.
Class 1 (Benign Hepatobiliary
Disease): The recall is 0.50,
indicating that 50% of the
actual Benign Hepatobiliary
Disease cases are correctly
identified.
Class 2 (Pancreatic Cancer):
The recall is 0.46, indicating
that 46% of the actual
Pancreatic Cancer cases are
correctly identified.
F1-score: The F1-score is the
harmonic mean of precision and
recall and provides a balance
between these metrics. For the
Decision Tree Classifier with
Standardization scaling:
The weighted average F1-score
is 0.544, indicating a moderate
balance between precision and
recall across all three classes.
Overall, the Decision Tree Classifier with Standardization
scaling shows similar performance as the model with
Normalization scaling. It exhibits relatively lower accuracy,
precision, and recall compared to the other classifiers we
have analyzed so far. The model seems to struggle with
distinguishing between the classes, especially for the Benign
Hepatobiliary Disease and Pancreatic Cancer classes, which
have lower precision and recall values.
 

Figure 77 The confusion matrix of DT model with


standardized feature scaling
 

Figure 78 The true values versus predicted values of DT


model with standardized feature scaling
 

Figure 79 The learning curve of DT model with standardized


feature scaling
 
 
 
Random Forest Classifier and Grid Search
Step Run Random Forest (RF) classifier on three feature scaling:
1  
1 #Random Forest Classifier
2 # Define the parameter grid for the grid
3 search
4 param_grid = {
5 'n_estimators': [100, 200, 300],
6 'max_depth': [10, 20, 30, 40, 50],
7 'min_samples_split': [2, 5, 10],
8 'min_samples_leaf': [1, 2, 4]
9 }
10  
11 # Initialize the RandomForestClassifier model
12 rf =
13 RandomForestClassifier(random_state=2021)
14  
15 # RandomForestClassifier Grid Search
16 for fc_name, value in
feature_scaling.items():
17
X_train, X_test, y_train, y_test = value
18
19
# Create GridSearchCV with the
20 RandomForestClassifier model and the
21 parameter grid
22 grid_search = GridSearchCV(rf,
23 param_grid, cv=3, scoring='accuracy',
24 n_jobs=-1)
25
26 # Train and perform grid search
27 grid_search.fit(X_train, y_train)
28
29 # Get the best RandomForestClassifier model
from the grid search
30
best_model = grid_search.best_estimator_
31
32
# Evaluate and plot the best model (setting
33 proba=True for probability prediction)
34 run_model(f'RandomForest Classifier (Best
35 Estimators:
{grid_search.best_params_["n_estimators"]})',
36
best_model, X_train, X_test,
37
y_train, y_test, fc_name, proba=True)
38

# Print the best hyperparameters found


print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
Output with Raw Scaling:
RandomForest Classifier (Best Estimators: 300)
Raw
accuracy: 0.744
recall: 0.744
precision: 0.7379084181313599
f1: 0.7355017225672011
precision recall f1-score
support
 
0 0.74 0.83 0.79
42
1 0.69 0.52 0.59
42
2 0.78 0.88 0.83
41
 
accuracy 0.74
125
macro avg 0.74 0.75 0.74
125
weighted avg 0.74 0.74 0.74
125
Best Hyperparameters for Raw:
{'max_depth': 20, 'min_samples_leaf': 1,
'min_samples_split': 2, 'n_estimators': 300}
 
The results of using raw feature scaling are shown in Figure
80 – 83. The RandomForest Classifier with Raw scaling
method achieved an accuracy of approximately 74.4%. Let's
analyze the performance in more detail:
Precision: The precision represents
the percentage of true positive
predictions out of all positive
predictions for each class. For the
RandomForest Classifier with Raw
scaling:
Class 0 (Control): The precision
is 0.74, meaning that out of all
predicted Control cases, 74% are
correct.
Class 1 (Benign Hepatobiliary
Disease): The precision is 0.69,
indicating that 69% of the
predicted Benign Hepatobiliary
Disease cases are correct.
Class 2 (Pancreatic Cancer): The
precision is 0.78, indicating that
78% of the predicted Pancreatic
Cancer cases are correct.
Recall: The recall, also known as
sensitivity or true positive rate,
represents the percentage of true
positive predictions out of all actual
positive cases for each class. For the
RandomForest Classifier with Raw
scaling:
Class 0 (Control): The recall is
0.83, meaning that 83% of the
actual Control cases are correctly
identified.
Class 1 (Benign Hepatobiliary
Disease): The recall is 0.52,
indicating that 52% of the actual
Benign Hepatobiliary Disease
cases are correctly identified.
Class 2 (Pancreatic Cancer): The
recall is 0.88, indicating that
88% of the actual Pancreatic
Cancer cases are correctly
identified.
F1-score: The F1-score is the
harmonic mean of precision and
recall and provides a balance
between these metrics. For the
RandomForest Classifier with Raw
scaling:
The weighted average F1-score
is 0.736, indicating a good
balance between precision and
recall across all three classes.
Overall, the RandomForest Classifier with Raw scaling
demonstrates relatively better performance compared to the
previous classifiers we have analyzed. It exhibits higher
accuracy, precision, and recall, especially for the Pancreatic
Cancer class, which has high recall and precision values.
 
RandomForest Classifier is an ensemble learning method that
combines multiple decision trees to improve classification
performance. The hyperparameters used for this model
include 'max_depth', 'min_samples_leaf', 'min_samples_split',
and 'n_estimators'. The grid search helped in finding the best
combination of hyperparameters, leading to improved model
performance.
 
Figure 80 The confusion matrix of RF model with raw feature
scaling
 
 

Figure 81 The true values versus predicted values of RF


model with raw feature scaling
 

Figure 82 The decision boundary of RF model with raw


feature scaling
 
Figure 83 The learning curve of RF model with raw feature
scaling
 
Output with Normalized Scaling:
RandomForest Classifier (Best Estimators: 300)
Normalization
accuracy: 0.768
recall: 0.768
precision: 0.7640158102766799
f1: 0.76139736677116
precision recall f1-score
support
 
0 0.78 0.86 0.82
42
1 0.73 0.57 0.64
42
2 0.78 0.88 0.83
41
 
accuracy 0.77
125
macro avg 0.76 0.77 0.76
125
weighted avg 0.76 0.77 0.76
125
 
Best Hyperparameters for Normalization:
{'max_depth': 20, 'min_samples_leaf': 1,
'min_samples_split': 2, 'n_estimators': 300}
 
The results of using normalized feature scaling are shown in
Figure 84 – 86. The RandomForest Classifier with
Normalized scaling method achieved an accuracy of
approximately 76.8%. Let's analyze the performance in more
detail:
Precision: The precision represents
the percentage of true positive
predictions out of all positive
predictions for each class. For the
RandomForest Classifier with
Normalized scaling:
Class 0 (Control): The precision
is 0.78, meaning that out of all
predicted Control cases, 78%
are correct.
Class 1 (Benign Hepatobiliary
Disease): The precision is 0.73,
indicating that 73% of the
predicted Benign Hepatobiliary
Disease cases are correct.
Class 2 (Pancreatic Cancer): The
precision is 0.78, indicating that
78% of the predicted Pancreatic
Cancer cases are correct.
Recall: The recall, also known as
sensitivity or true positive rate,
represents the percentage of true
positive predictions out of all actual
positive cases for each class. For the
RandomForest Classifier with
Normalized scaling:
Class 0 (Control): The recall is
0.86, meaning that 86% of the
actual Control cases are
correctly identified.
Class 1 (Benign Hepatobiliary
Disease): The recall is 0.57,
indicating that 57% of the actual
Benign Hepatobiliary Disease
cases are correctly identified.
Class 2 (Pancreatic Cancer): The
recall is 0.88, indicating that
88% of the actual Pancreatic
Cancer cases are correctly
identified.
F1-score: The F1-score is the
harmonic mean of precision and
recall and provides a balance
between these metrics. For the
RandomForest Classifier with
Normalized scaling:
The weighted average F1-score
is 0.761, indicating a good
balance between precision and
recall across all three classes.
The RandomForest Classifier with Normalized scaling shows
improved performance compared to the model with Raw
scaling. It demonstrates higher accuracy, precision, and recall
values, especially for the Control and Pancreatic Cancer
classes, which have higher recall and precision values.
 
The normalization scaling method scales the features to a
common range, ensuring that each feature contributes equally
to the model's learning process. This can help the
RandomForest Classifier perform better in this case, as it
relies on the combined strength of multiple decision trees to
make accurate predictions.
 
Figure 84 The confusion matrix of RF model with normalized
feature scaling
 

Figure 85 The true values versus predicted values of RF


model with normalized feature scaling
 
Figure 86 The learning curve of RF model with normalized
feature scaling
Output with Standardized Scaling
RandomForest Classifier (Best Estimators: 300)
Standardization
accuracy: 0.768
recall: 0.768
precision: 0.7661666666666667
f1: 0.7610098746557482
precision recall f1-score
support
 
0 0.78 0.83 0.80
42
1 0.75 0.57 0.65
42
2 0.77 0.90 0.83
41
 
accuracy 0.77
125
macro avg 0.77 0.77 0.76
125
weighted avg 0.77 0.77 0.76
125
 
Best Hyperparameters for Standardization:
{'max_depth': 10, 'min_samples_leaf': 1,
'min_samples_split': 5, 'n_estimators': 300}
 
The results of using standardized feature scaling are shown in
Figure 87 – 89. In the output with Standardized Scaling, we
used the RandomForest Classifier with the following
hyperparameters:
Best Estimators: 300
Max Depth: 10
Min Samples Leaf: 1
Min Samples Split: 5
Now, let's analyze the results in more detail:
Accuracy: The RandomForest
Classifier achieved an accuracy of
approximately 76.8% when using
standardized feature scaling. This
means that around 76.8% of the
samples in the test set were
correctly classified by the model.
Precision: The precision for Class 0
(Control) is 0.78, for Class 1
(Benign Hepatobiliary Disease) is
0.75, and for Class 2 (Pancreatic
Cancer) is 0.77. This indicates that
the model correctly predicted 78%
of the Control cases, 75% of the
Benign Hepatobiliary Disease
cases, and 77% of the Pancreatic
Cancer cases out of all positive
predictions for each class.
Recall: The recall for Class 0 is
0.83, for Class 1 is 0.57, and for
Class 2 is 0.90. This indicates that
the model correctly identified 83%
of the actual Control cases, 57% of
the actual Benign Hepatobiliary
Disease cases, and 90% of the
actual Pancreatic Cancer cases.
F1-score: The weighted average F1-
score is 0.761, which provides a
balance between precision and
recall across all three classes. The
F1-score is a harmonic mean of
precision and recall and is a suitable
metric when dealing with
imbalanced datasets.
Support: The support represents the
number of samples of each class in
the test set. In this case, there were
42 samples for each class.
Conclusions:
The RandomForest Classifier with
Standardized Scaling achieved a
reasonably high accuracy of 76.8%,
which suggests that the model is
effective in classifying the three
classes: Control, Benign
Hepatobiliary Disease, and
Pancreatic Cancer.
The model shows good precision
for Class 0 and Class 2 (around 77-
78%), indicating that the majority
of the positive predictions for these
classes are correct.
The recall for Class 1 is relatively
low (57%), suggesting that the
model struggles to correctly identify
the samples belonging to Class 1
(Benign Hepatobiliary Disease).
This class may have characteristics
that are more challenging to
differentiate from other classes.
The F1-score of 0.761 indicates a
good balance between precision and
recall and suggests that the model
provides overall good performance
across all three classes.
In conclusion, the RandomForest Classifier with Standardized
Scaling appears to be a promising model for this classification
task. However, it's essential to continue monitoring the
model's performance on unseen data and potentially explore
other algorithms or hyperparameter combinations to further
improve its performance. Additionally, domain-specific
knowledge and further feature engineering may also
contribute to enhancing the model's predictive capabilities.
 
Figure 87 The confusion matrix of RF model with
standardized feature scaling
 

Figure 88 The true values versus predicted values of RF


model with standardized feature scaling
 
Figure 89 The learning curve of RF model with standardized
feature scaling
 
Let's compare, analyze, and conclude the three outputs of the
RandomForest Classifier with different feature scalings:
 
Output with Raw Scaling:
Accuracy: 74.4%
Recall: Class 0 (Control) - 83%,
Class 1 (Benign Hepatobiliary
Disease) - 52%, Class 2 (Pancreatic
Cancer) - 88%
Precision: Class 0 - 74%, Class 1 -
69%, Class 2 - 78%
F1-score: 0.735
Support: 42 samples for each class
Output with Normalized Scaling:
Accuracy: 76.8%
Recall: Class 0 - 86%, Class 1 -
57%, Class 2 - 88%
Precision: Class 0 - 78%, Class 1 -
73%, Class 2 - 78%
F1-score: 0.761
Support: 42 samples for each class
Output with Standardized Scaling:
Accuracy: 76.8%
Recall: Class 0 - 83%, Class 1 -
57%, Class 2 - 90%
Precision: Class 0 - 78%, Class 1 -
75%, Class 2 - 77%
F1-score: 0.761
Support: 42 samples for each class
Comparison and Analysis:
All three models achieve similar
accuracy and F1-scores, indicating
that the RandomForest Classifier is
performing consistently across
different feature scaling methods.
Raw Scaling performs slightly
worse than Normalized Scaling and
Standardized Scaling in terms of
accuracy and F1-score. This
suggests that scaling the features is
beneficial for the model's
performance.
Standardized Scaling results in the
highest recall for Class 2
(Pancreatic Cancer) at 90%. This
means the model is effective in
identifying most of the actual
positive cases for this class when
using standardized features.
Normalized Scaling shows the
highest recall for Class 0 (Control)
at 86%, indicating that the model is
good at correctly identifying most
of the true positive cases for this
class.
In terms of precision, all three
models perform similarly for each
class, showing precision values
around 75-78%.
Conclusion:
The RandomForest Classifier shows relatively good
performance across all feature scaling methods. Normalized
Scaling and Standardized Scaling result in slightly better
accuracy and F1-scores compared to Raw Scaling. However,
the differences in performance between the scaling methods
are not substantial.
 
When choosing the best feature scaling method, it's essential
to consider other factors such as computational efficiency and
the interpretability of the model. Additionally, further
investigation into feature engineering and hyperparameter
tuning could lead to further improvements in the model's
performance.
 
In conclusion, the RandomForest Classifier with either
Normalized Scaling or Standardized Scaling appears to be a
suitable model for this classification task. The choice between
the two scaling methods could depend on other considerations
specific to the application and the dataset at hand. It is
advisable to validate the model's performance on unseen data
and fine-tune it as needed for the best results.
 
 
Gradient Boosting Classifier and Grid Search
Step Run Gradient Boosting (GB) classifier on three feature
1 scaling:
 
1 #Gradient Boosting Classifier
2 # Initialize the GradientBoostingClassifier
3 model
4 gbt =
GradientBoostingClassifier(random_state=2021)
5
 
6
# Define the parameter grid for the grid
7
search
8
param_grid = {
9
'n_estimators': [100, 200, 300],
10
'max_depth': [10, 20, 30],
11
'subsample': [0.6, 0.8, 1.0],
12
'max_features': [0.2, 0.4, 0.6, 0.8, 1.0],
13
}
14
 
15
# GradientBoosting Classifier Grid Search
16
for fc_name, value in
17 feature_scaling.items():
18 X_train, X_test, y_train, y_test = value
19  
20 # Create GridSearchCV with the
21 GradientBoostingClassifier model and the
22 parameter grid
23 grid_search = GridSearchCV(gbt,
param_grid, cv=3, scoring='accuracy',
24
n_jobs=-1)
25
 
26
# Train and perform grid search
27
grid_search.fit(X_train, y_train)
28
 
29
30 # Get the best GradientBoostingClassifier
31 model from the grid search
32 best_model = grid_search.best_estimator_
33  
34 # Evaluate and plot the best model (setting
proba=True for probability prediction)
35
run_model(f'GradientBoosting Classifier
36 (Best Estimators:
37 {grid_search.best_params_["n_estimators"]})',
38 best_model, X_train, X_test,
y_train, y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The code performs a Grid Search with Cross-Validation to
tune hyperparameters for the GradientBoostingClassifier. The
purpose of this code is to find the best combination of
hyperparameters for the Gradient Boosting model that
provides the highest accuracy on the given dataset. The
hyperparameters being tuned are:
n_estimators: The number of
boosting stages to be run. It controls
the number of weak learners
(decision trees) to be combined.
max_depth: The maximum depth of
the individual decision trees. It
controls the complexity of the weak
learners and can prevent overfitting.
subsample: The fraction of samples
used for fitting the individual weak
learners. It helps in introducing
randomness and reducing
overfitting.
max_features: The maximum
number of features to consider
when looking for the best split in
each tree. It introduces further
randomness and can prevent
overfitting.
For each feature scaling method (Raw, Normalization, and
Standardization), the code performs a Grid Search over the
hyperparameter space defined by param_grid. The best model
is selected based on the highest accuracy achieved during
cross-validation.
 
After performing the Grid Search for each feature scaling
method, the code evaluates and prints the performance metrics
of the best GradientBoostingClassifier model on the test set.
 
The output will show the accuracy, recall, precision, and F1-
score for each class, along with the average metrics for the
entire test set. Additionally, it will display the best
hyperparameters found for each feature scaling method.
 
It's important to note that Gradient Boosting is a powerful
ensemble method, and with proper hyperparameter tuning, it
can perform well on various classification tasks. The Grid
Search helps in finding the optimal combination of
hyperparameters that maximizes the model's performance on
the given data. The final selected model can then be used for
making predictions on new, unseen data.
 
Output with Standardization Scaling:
GradientBoosting Classifier (Best Estimators: 100)
Standardization
accuracy: 0.744
recall: 0.744
precision: 0.7396456479690523
f1: 0.7374037522863862
precision recall f1-score
support
 
0 0.72 0.81 0.76
42
1 0.70 0.55 0.61
42
2 0.80 0.88 0.84
41
 
accuracy 0.74
125
macro avg 0.74 0.75 0.74
125
weighted avg 0.74 0.74 0.74 125
Best Hyperparameters for Standardization:
{'max_depth': 10, 'max_features': 0.8, 'n_estimators':
100, 'subsample': 0.6}
 
The results of using standardized feature scaling are shown in Figure
90 – 93. The GradientBoosting Classifier with Standardization
Scaling has achieved an accuracy of 0.744 on the test set. The model
performs well in predicting all three classes, with recall values of
0.744 for each class, indicating that it can correctly identify a
substantial portion of the true positive cases for each class. The
precision values are also reasonably high, indicating that when the
model predicts a class, it is likely to be correct.
 
The F1-score, which is the harmonic mean of precision and recall, is
0.737, which is a balanced measure that considers both false
positives and false negatives.
 
The classification report shows performance metrics for each class
(0, 1, and 2), along with the macro-averaged and weighted-averaged
metrics for the entire test set. The macro average calculates the mean
of the metrics for each class, giving equal weight to each class. The
weighted average calculates the mean of the metrics, weighted by
the number of samples in each class.
 
The best hyperparameters found for this model with Standardization
Scaling are max_depth=10, max_features=0.8, n_estimators=100,
and subsample=0.6. These hyperparameters control the depth of the
individual trees, the number of features to consider when splitting
nodes, the number of boosting stages, and the fraction of samples
used for fitting individual trees, respectively.
 
In conclusion, the GradientBoosting Classifier with Standardization
Scaling has performed well on the given dataset, and the model's
hyperparameters have been tuned to achieve the best possible
performance. It can be used for making predictions on new, unseen
data and is suitable for the classification task at hand.
 
Figure 90 The confusion matrix of GB model with standardized
feature scaling
 

Figure 91 The true values versus predicted values of GB model with


standardized feature scaling
 
Figure 92 The learning curve of GB model with standardized
feature scaling
 

Figure 93 The decision boundary of GB model with raw feature


scaling
 
Extreme Gradient Boosting Classifier and Grid Search
Step Run XGBoost classifier on three feature scaling:
1  
1 #Extreme Gradient Boosting Classifier
2 # XGBoost Classifier Grid Search
3 for fc_name, value in
4 feature_scaling.items():
5 X_train, X_test, y_train, y_test = value
6  
7 # Define the parameter grid for the grid
search
8
param_grid = {
9
'n_estimators': [100, 200, 300],
10
'max_depth': [10, 20, 30],
11
'learning_rate': [0.01, 0.1, 0.2],
12
'subsample': [0.6, 0.8, 1.0],
13
'colsample_bytree': [0.6, 0.8, 1.0],
14
}
15
 
16
# Initialize the XGBoost classifier
17
xgb = XGBClassifier(random_state=2021,
18
use_label_encoder=False,
19 eval_metric='mlogloss')
20  
21 # Create GridSearchCV with the XGBoost
22 classifier and the parameter grid
23 grid_search = GridSearchCV(xgb,
24 param_grid, cv=3, scoring='accuracy',
25 n_jobs=-1)
26  
27 # Train and perform grid search
28 grid_search.fit(X_train, y_train)
29  
30 # Get the best XGBoost classifier model from
31 the grid search
32 best_model = grid_search.best_estimator_
33  
34 # Evaluate and plot the best model (setting
proba=True for probability prediction)
35
run_model(f'XGB Classifier (Best
36 Estimators:
37 {grid_search.best_params_["n_estimators"]})',
38 best_model, X_train, X_test,
39 y_train, y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
 
The purpose of the code is to perform hyperparameter tuning
for two different classifiers, namely the Gradient Boosting
Classifier and the Extreme Gradient Boosting (XGBoost)
Classifier, using Grid Search with Cross-Validation.
 
The code aims to find the best combination of
hyperparameters for each classifier that results in the highest
accuracy on the given dataset. It explores various
hyperparameter values using the Grid Search technique,
which involves evaluating multiple combinations of
hyperparameters through cross-validation. The code does this
for different feature scaling methods (e.g., Raw,
Normalization, Standardization) to understand how each
scaling method affects the model's performance.
 
By running the code, the output will provide insights into the
best hyperparameters for each classifier and each feature
scaling method. It will also show the corresponding accuracy,
precision, recall, and F1-score metrics for each model. This
information allows for the selection of the most suitable
model configuration for making predictions on new, unseen
data, taking into account the impact of different feature
scaling techniques.
 
Output with Standardized Scaling:
XGB Classifier (Best Estimators: 100)
Standardization
accuracy: 0.752
recall: 0.752
precision: 0.7494697947214076
f1: 0.7437256630347197
precision recall f1-score
support
 
0 0.70 0.83 0.76
42
1 0.71 0.52 0.60
42
2 0.84 0.90 0.87
41
 
accuracy 0.75
125
macro avg 0.75 0.75 0.74
125
weighted avg 0.75 0.75 0.74
125
 
Best Hyperparameters for Standardization:
{'colsample_bytree': 1.0, 'learning_rate': 0.2,
'max_depth': 20, 'n_estimators': 100, 'subsample':
1.0}
 
The results of using standardized feature scaling are shown in
Figure 94 – 97.
 
Explanation:
The XGBoost model achieved an
accuracy of approximately 75.2%
on the test set using Standardized
Scaling.
The recall, which measures the
model's ability to correctly identify
positive samples (sensitivity), is
approximately 75.2%.
The precision, which measures the
proportion of true positive
predictions among all positive
predictions, is approximately
74.9%.
The F1-score, which balances
precision and recall, is
approximately 74.4%.
The model performed best in
classifying samples of category 2
(recall = 0.90, precision = 0.84)
compared to category 1 (recall =
0.52, precision = 0.71) and category
0 (recall = 0.83, precision = 0.70).
 
Best Hyperparameters:
The best hyperparameters for the
XGBoost model with Standardized
Scaling are as follows:
'colsample_bytree': 1.0 (percentage
of features used in each tree)
'learning_rate': 0.2 (step size
shrinkage to prevent overfitting)
'max_depth': 20 (maximum depth of
each tree)
'n_estimators': 100 (number of
boosting rounds)
'subsample': 1.0 (percentage of
samples used for fitting each tree)
Conclusion:
The XGBoost model with Standardized Scaling achieved
good overall performance, with an accuracy of 75.2% and
competitive precision, recall, and F1-score. The selected
hyperparameters optimized the model's performance based on
the given dataset. The choice of feature scaling method, in this
case, Standardization, likely contributed to the model's
effectiveness. These results can be used to make predictions
on new data with similar features, taking into account the
learned patterns from the training set.
 
Figure 94 The confusion matrix of XGBoost model with
standardized feature scaling
 

Figure 95 The true values versus predicted values of


XGBoost model with standardized feature scaling
 

Figure 96 The decision boundary of XGBoost model with


standardized feature scaling
 
Figure 97 The learning curve of XGBoost model with
standardized feature scaling
 
 
 
Multi-Layer Perceptron Classifier and Grid Search
Step Run MLP classifier on three feature scaling:
1  
1 # MLP Classifier Grid Search
2 for fc_name, value in
3 feature_scaling.items():
4 X_train, X_test, y_train, y_test =
value
5
 
6
# Define the parameter grid for the grid
7
search
8
param_grid = {
9
'hidden_layer_sizes': [(50,), (100,),
10 (50, 50), (100, 50), (100, 100)],
11 'activation': ['logistic', 'relu'],
12 'solver': ['adam', 'sgd'],
13 'alpha': [0.0001, 0.001, 0.01],
14 'learning_rate': ['constant',
15 'invscaling', 'adaptive'],
16 }
17  
18 # Initialize the MLP Classifier
19 mlp = MLPClassifier(random_state=2021)
20  
21 # Create GridSearchCV with the MLP
22 Classifier and the parameter grid
23 grid_search = GridSearchCV(mlp,
param_grid, cv=3, scoring='accuracy',
24
n_jobs=-1)
25
 
26
# Train and perform grid search
27
grid_search.fit(X_train, y_train)
28
 
29
# Get the best MLP Classifier model from
30 the grid search
31 best_model =
32 grid_search.best_estimator_
33  
34 # Evaluate and plot the best model
35 (setting proba=True for probability
prediction)
36
run_model('MLP Classifier',
best_model, X_train, X_test, y_train,
y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The code performs a Grid Search with Cross-Validation for
hyperparameter tuning on the MLP (Multi-Layer Perceptron)
Classifier using different feature scaling methods. Here's an
explanation of the code:
1. Iterate over different feature
scaling methods (contained in the
'feature_scaling' dictionary) that
were previously applied to the
dataset. Each feature scaling
method includes a split of the data
into training and test sets (X_train,
X_test, y_train, y_test).
2. Define a parameter grid
('param_grid') that specifies the
hyperparameters to be tuned during
the grid search. It includes various
values for the following
hyperparameters:
'hidden_layer_sizes': A tuple
representing the number of
units in each hidden layer.
'activation': The activation
function for the hidden layers.
'solver': The optimization
algorithm for weight
optimization.
'alpha': L2 penalty
(regularization term) for
weight optimization.
'learning_rate': The learning
rate schedule for weight
updates.
3. Initialize the MLPClassifier with
default parameters, and set
'random_state' for reproducibility.
4. Create a GridSearchCV object
('grid_search') with the MLP
Classifier and the parameter grid. It
performs a 3-fold cross-validation
and uses 'accuracy' as the scoring
metric to evaluate model
performance.
5. Train the MLP Classifier and
perform the grid search, trying
different combinations of
hyperparameters to find the best
model.
6. Get the best MLP Classifier model
from the grid search, which has the
highest cross-validated accuracy.
7. Evaluate and plot the best model's
performance using the 'run_model'
function, setting 'proba=True' to
enable probability predictions.
8. Print the best hyperparameters
found for each feature scaling
method.
The code essentially tunes hyperparameters of the MLP
Classifier to find the best configuration that maximizes
accuracy using different feature scaling methods. It then
evaluates the best model's performance on the test set to
assess its generalization ability.
 
By executing this code with various feature scaling methods,
you can identify the optimal hyperparameter settings for the
MLP Classifier and assess its performance under different
scaling conditions, helping to choose the best model for your
specific dataset.
 
Output with Standardized Scaling:
MLP Classifier
Standardization
accuracy: 0.68
recall: 0.68
precision: 0.6805249709639954
f1: 0.6801927710843373
precision recall f1-score
support
 
0 0.76 0.74 0.75
42
1 0.60 0.60 0.60
42
2 0.69 0.71 0.70
41
 
accuracy 0.68
125
macro avg 0.68 0.68 0.68
125
weighted avg 0.68 0.68 0.68
125
 
Best Hyperparameters for Standardization:
{'activation': 'relu', 'alpha': 0.001,
'hidden_layer_sizes': (100, 100),
'learning_rate': 'constant', 'solver': 'adam'}
 
The results of using standardized feature scaling are shown
in Figure 98 – 101.
 
Explanation:
The MLP Classifier was trained on
data with standardized scaling,
where each feature has zero mean
and unit variance.
The model achieved an accuracy of
0.68 on the test set, indicating that
it correctly classified 68% of the
samples.
The recall (sensitivity) for each
class is 0.68, meaning that the
model correctly identified 68% of
the samples for each class.
The precision for each class ranges
from 0.60 to 0.76, showing the
fraction of correct positive
predictions out of all positive
predictions made by the model.
The F1-score, which considers both
precision and recall, ranges from
0.60 to 0.75, providing a balanced
measure of model performance.
The support column represents the
number of samples in each class.
The best hyperparameters for the MLP Classifier with
standardized scaling are as follows:
Activation function: 'relu'
L2 regularization term (alpha):
0.001
Number of units in the hidden
layers: (100, 100)
Learning rate schedule: 'constant'
Optimization algorithm: 'adam'
Figure 98 The confusion matrix of MLP model with
standardized feature scaling
 
Conclusion:
The MLP Classifier achieved a reasonable accuracy of 0.68
on the test set when using standardized scaling. The model
seems to perform moderately well in terms of precision,
recall, and F1-score for each class. The chosen
hyperparameters might not be the optimal ones, but they
provide a good starting point for further fine-tuning. To
improve the model's performance, more extensive
hyperparameter tuning and feature engineering might be
necessary. Additionally, trying different feature scaling
methods and assessing their impact on the model's
performance could be beneficial.
 

Figure 99 The true values versus predicted values of MLP


model with standardized feature scaling
 
Figure 100 The decision boundary of MLP model with
standardized feature scaling

Figure 101 The learning curve of MLP model with


standardized feature scaling
 
 
 
Light Gradient Boosting Classifier and Grid Search
Step Run LGBM classifier on three feature scaling:
1  
1 #LGBM Classifier
2 # Define the parameter grid for grid
3 search
4 param_grid = {
5 'max_depth': [10, 20, 30],
6 'n_estimators': [100, 200, 300],
7 'subsample': [0.6, 0.8, 1.0],
'random_state': [2021]
8 }
9  
10 # Initialize the LightGBM classifier
11 lgbm = LGBMClassifier()
12  
13 # Grid Search
14 for fc_name, value in
15 feature_scaling.items():
16 X_train, X_test, y_train, y_test =
value
17
 
18
# Create GridSearchCV with the LightGBM
19
classifier and the parameter grid
20
grid_search = GridSearchCV(lgbm,
21 param_grid, cv=3, scoring='accuracy',
22 n_jobs=-1)
23  
24 # Train and perform grid search
25 grid_search.fit(X_train, y_train)
26  
27 # Get the best LightGBM classifier model
28 from the grid search
29 best_model =
grid_search.best_estimator_
30
 
31
# Evaluate and plot the best model
32
(setting proba=True for probability
33 prediction)
34 run_model('LGBM Classifier',
35 best_model, X_train, X_test, y_train,
y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for
{fc_name}:")
print(grid_search.best_params_)
 
The code performs a grid search using the LightGBM
(LGBM) classifier on different datasets with different feature
scaling methods. It aims to find the best hyperparameters for
the LGBM model that result in the highest accuracy on the
test set. Here's a breakdown of the purpose of each part of the
code:
1. Define the Parameter Grid: A
dictionary param_grid is created
with hyperparameters to be tuned
during the grid search. It includes
the maximum depth of the trees
(max_depth), the number of
estimators (trees) in the ensemble
(n_estimators), the subsample ratio
of data used for training each tree
(subsample), and a fixed random
state value (random_state) for
reproducibility.
2. Initialize the LightGBM Classifier:
The LightGBM classifier is
initialized without any specific
hyperparameters set. This will be
later updated with the best
hyperparameters found during the
grid search.
3. Grid Search Loop: The code loops
over each dataset with different
feature scaling methods
(feature_scaling dictionary). For
each dataset, it performs a grid
search using GridSearchCV with
the LightGBM classifier and the
defined parameter grid
(param_grid). The grid search uses
3-fold cross-validation and
measures the accuracy as the
scoring metric.
4. Train and Perform Grid Search:
For each dataset, the LightGBM
classifier is trained using the
training set and the grid search is
performed to find the best
hyperparameters.
5. Get the Best Model: After the grid
search is complete for each dataset,
the best LightGBM classifier
model (best_estimator_) is
obtained based on the
hyperparameters that resulted in
the highest accuracy.
6. Evaluate and Plot the Best Model:
The best model is evaluated on the
test set, and its performance
metrics (accuracy, precision, recall,
F1-score) are calculated. The
results are also plotted (setting
proba=True for probability
prediction).
7. Print Best Hyperparameters: The
best hyperparameters found for
each dataset (feature scaling
method) are printed.
The code allows for a systematic hyperparameter search
across different datasets with various feature scaling methods
to identify the optimal hyperparameters for the LightGBM
classifier. The best model is then evaluated to determine how
well it performs on the unseen test data. This process helps in
finding a well-tuned model that generalizes well to new,
unseen data.
 
Output with Standardized Scaling:
LGBM Classifier
Standardization
accuracy: 0.776
recall: 0.776
precision: 0.7749333333333334
f1: 0.7723355969331872
precision recall f1-score
support
 
0 0.75 0.86 0.80
42
1 0.74 0.62 0.68
42
2 0.83 0.85 0.84
41
 
accuracy 0.78
125
macro avg 0.78 0.78 0.77
125
weighted avg 0.77 0.78 0.77
125
 
Best Hyperparameters for Standardization:
{'max_depth': 10, 'n_estimators': 300,
'random_state': 2021, 'subsample': 0.6}
 
The results of using standardized feature scaling are shown
in Figure 102 – 105.
Analysis and Conclusion:
The output shows the performance of the LGBM Classifier
on the standardized dataset (Standardization) and the best
hyperparameters found during the grid search. The
evaluation metrics on the test set are as follows:
Accuracy: 0.776
Recall: 0.776
Precision: 0.7749333333333334
F1-score: 0.7723355969331872
The precision, recall, and F1-score for each class (0, 1, 2) are
reasonably high, indicating that the classifier is performing
well in distinguishing between different classes. The
accuracy of 0.776 suggests that approximately 77.6% of the
test samples were classified correctly.
 
Best Hyperparameters:
max_depth: 10
n_estimators: 300
random_state: 2021
subsample: 0.6
The best hyperparameters are obtained from the grid search,
which is a systematic way of finding optimal parameter
combinations. These hyperparameters help the LGBM
classifier to achieve the best performance on the test set for
the standardized dataset.
 
Overall, the LGBM Classifier with standardized scaling and
the identified hyperparameters demonstrates robust
performance in classifying the data into different classes. It
outperforms some of the other classifiers previously
evaluated, indicating that it is a promising model for this
specific task.
 
Figure 102 The confusion matrix of LGBM model with
standardized feature scaling
 

Figure 103 The true values versus predicted values of


LGBM model with standardized feature scaling
 

Figure 104 The decision boundary of LGBM model with


standardized feature scaling
Figure 105 The learning curve of LGBM model with
standardized feature scaling
 
 
Following is the full version of pancreatic.py:
 
#pancreatic.py
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import os
import plotly.graph_objs as go
import joblib
import itertools
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV,Strati
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score
from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
 
#Reads dataset
curr_path = os.getcwd()
df = pd.read_csv(curr_path+"/Debernardi et al 2020 data.csv")
print(df.iloc[:,0:8].head().to_string())
print(df.iloc[:,8:14].head().to_string())
 
#Checks shape
print(df.shape)
 
#Reads columns
print("Data Columns --> ",df.columns)
 
#Checks dataset information
print(df.info())
 
#Drops irrelevant columns
df = df.drop(columns=['sample_id','patient_cohort','sample_origin','stage','benign_sample_dia
 
#Checks null values
print(df.isnull().sum())
print('Total number of null values: ', df.isnull().sum().sum())
 
#Imputes missing values in plasma_CA19_9 with mean
df['plasma_CA19_9'].fillna((df['plasma_CA19_9'].mean()), inplace=True)
 
#Imputes missing value in REG1A with mean
df['REG1A'].fillna((df['REG1A'].mean()), inplace=True)
 
#Checks null values
print(df.isnull().sum())
print('Total number of null values: ', df.isnull().sum().sum())
 
#Looks at statistical description of data
print(df.describe().iloc[:,0:5].to_string())
print(df.describe().iloc[:,5:10].to_string())
 
#Defines function to create pie chart and bar plot as subplots
def plot_piechart(df, var, title=''):
plt.figure(figsize=(25, 10))
plt.subplot(121)
label_list = list(df[var].value_counts().index)
colors = sns.color_palette("husl", len(label_list))
df[var].value_counts().plot.pie(autopct="%1.1f%%", \
colors=colors, \
startangle=60, labels=label_list, \
wedgeprops={"linewidth": 3, "edgecolor": "k"}, \
shadow=True, textprops={'fontsize': 20})
plt.title("Distribution of " + var + " variable " + title, fontsize=25)
 
value_counts = df[var].value_counts()
# Print percentage values
percentages = value_counts / len(df) * 100
print("Percentage values:")
print(percentages)
 
plt.subplot(122)
ax = df[var].value_counts().plot(kind="barh")
 
for i, j in enumerate(df[var].value_counts().values):
ax.text(.7, i, j, weight="bold", fontsize=20)
 
plt.title("Count of " + var + " cases " + title, fontsize=25)
# Print count values
print("Count values:")
print(value_counts)
plt.show()
 
plot_piechart(df,'diagnosis')
 
# Looks at distribution of all features in the whole original dataset
columns = list(df.columns)
columns.remove('diagnosis')
plt.subplots(figsize=(45, 50))
length = len(columns)
color_palette = sns.color_palette("Set3", n_colors=length) # Define color palette
 
for i, j in itertools.zip_longest(columns, range(length)):
plt.subplot((length // 2), 4, j + 1)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
ax = df[i].hist(bins=10, edgecolor='black', color=color_palette[j]) # Set color for each
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_hei
ha='center',
va='center', xytext=(0, 10), weight="bold", fontsize=17, textcoords='offs
 
plt.title(i, fontsize=30) # Adjust title font size
plt.show()
 
from tabulate import tabulate
def another_versus_diagnosis(feat, num_bins):
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(30, 22))
plt.subplots_adjust(wspace=0.5, hspace=0.25)

colors = sns.color_palette("Set2")
diagnosis_labels = {1: 'Control (No Pancreatic Disease)',
2: 'Benign Hepatobiliary Disease',
3: 'Pancreatic Cancer'}

data = {}

for diagnosis_code, ax in zip([1, 2, 3], axes):


subset_data = df[df['diagnosis'] == diagnosis_code][feat]
subset_data.plot(ax=ax, kind='hist', bins=num_bins, edgecolor='black',
color=colors[diagnosis_code-1])

ax.set_title(diagnosis_labels[diagnosis_code], fontsize=30)
ax.set_xlabel(feat, fontsize=30)
ax.set_ylabel('Count', fontsize=30)

patch_data = []
for p in ax.patches:
x = p.get_x() + p.get_width() / 2.
y = p.get_height()
ax.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10),
weight="bold", fontsize=25, textcoords='offset points')
patch_data.append([x, y])

data[diagnosis_labels[diagnosis_code]] = patch_data

plt.show()
 
for diagnosis_label, patch_data in data.items():
print(diagnosis_label + ":")
print(tabulate(patch_data, headers=[feat, diagnosis_label]))
print()

#Looks at plasma_CA19_9 feature distribution by diagnosis feature


another_versus_diagnosis("plasma_CA19_9", 10)
 
#Looks at creatinine feature distribution by diagnosis feature
another_versus_diagnosis("creatinine", 10)
 
#Looks at LYVE1 feature distribution by diagnosis feature
another_versus_diagnosis("LYVE1", 10)
 
#Looks at REG1B feature distribution by diagnosis feature
another_versus_diagnosis("REG1B", 10)
 
#Looks at TFF1 feature distribution by diagnosis feature
another_versus_diagnosis("TFF1", 10)
 
#Looks at REG1A feature distribution by diagnosis feature
another_versus_diagnosis("REG1A", 10)
 
#Creates a dummy dataframe for visualization
df_dummy=df.copy()
 
#Categorizes diagnosis feature
def cat_diagnosis(n):
if n == 1:
return 'Control (No Pancreatic Disease)'
if n == 2:
return 'Benign Hepatobiliary Disease'
else:
return 'Pancreatic Cancer'
df_dummy['diagnosis'] = df_dummy['diagnosis'].apply(lambda x: cat_diagnosis(x))
 
def put_label_stacked_bar(ax,fontsize):
#patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()

# The height of the bar is the data value and can be used as the label
label_text = f'{height:.0f}'

# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
 
# plots only when height is greater than specified value
if height > 0:
ax.text(label_x, label_y, label_text, \
ha='center', va='center', \
weight = "bold",fontsize=fontsize)

#Plots one variable against another variable


def dist_one_vs_another_plot(df, cat1, cat2):
fig = plt.figure(figsize=(25, 15))
ax1 = fig.add_subplot(111)
group_by_stat = df.groupby([cat1, cat2]).size()
stacked_data = group_by_stat.unstack()
group_by_stat.unstack().plot(kind='bar', stacked=True, ax=ax1, grid=True)
ax1.set_title('Stacked Bar Plot of ' + cat1 + ' (number of cases)', fontsize=30)
ax1.set_ylabel('Number of Cases', fontsize=20)
ax1.set_xlabel(cat1, fontsize=20)
put_label_stacked_bar(ax1,15)
plt.show()
 
# Group values by cat2
sentiment_groups = stacked_data.groupby(level=0, axis=0)
 
# Create table headers
headers = [cat2 for cat2 in stacked_data.columns]
 
# Create table rows with data
rows = []
for cat, group_data in sentiment_groups:
row_values = [str(val) for val in group_data.values.flatten()]
rows.append([cat] + row_values)
 
# Print the table
print(tabulate(rows, headers=headers, tablefmt='grid'))
 
#Categorizes age feature
labels = ['0-40', '40-50', '50-60','60-90']
df_dummy['age'] = pd.cut(df_dummy['age'], [0, 40, 50, 60, 90], labels=labels)
 
#Plots the distribution of age feature in pie chart and bar plot
plot_piechart(df_dummy,'age',)
 
#Plots diagnosis variable against age variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'age', 'diagnosis')
 
#Plots the distribution of sex feature in pie chart and bar plot
plot_piechart(df_dummy,'sex')
 
#Plots diagnosis variable against sex variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'sex', 'diagnosis')
 
#Categorizes plasma_CA19_9 feature
labels = ['0-100', '100-1000', '1000-10000','10000-35000']
df_dummy['plasma_CA19_9'] = pd.cut(df_dummy['plasma_CA19_9'], [0, 100, 1000, 10000, 35000],
labels=labels)
 
#Plots the distribution of plasma_CA19_9 feature in pie chart and bar plot
plot_piechart(df_dummy,'plasma_CA19_9')
 
#Plots diagnosis variable against plasma_CA19_9 variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'plasma_CA19_9', 'diagnosis')
 
#Categorizes creatinine feature
labels = ['0-0.5', '0.5-1', '1-2','2-5']
df_dummy['creatinine'] = pd.cut(df_dummy['creatinine'], [0, 0.5, 1, 2, 5], labels=labels)
 
#Plots the distribution of creatinine feature in pie chart and bar plot
plot_piechart(df_dummy,'creatinine')
 
#Plots diagnosis variable against creatinine variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'creatinine', 'diagnosis')
 
#Checks dataset information
print(df_dummy.info())
 
#Extracts categorical and numerical columns
cat_cols = [col for col in df_dummy.columns if (df_dummy[col].dtype == 'object' or
df_dummy[col].dtype.name == 'category')]
num_cols = [col for col in df_dummy.columns if (df_dummy[col].dtype != 'object' and
df_dummy[col].dtype.name != 'category')]
 
print(cat_cols)
print(num_cols)
 
#Checks numerical features density distribution
# Define a custom color palette
colors = sns.color_palette("husl", len(num_cols))
 
# Checks numerical features density distribution
fig = plt.figure(figsize=(30, 20))
plotnumber = 1
 
for i, column in enumerate(num_cols):
if plotnumber <= 6:
ax = plt.subplot(2, 2, plotnumber)
sns.distplot(df_dummy[column], color=colors[i]) # Use the custom color for the plot
plt.xlabel(column, fontsize=40)
for p in ax.patches:
ax.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get
ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=30, textcoords='offset poin
plotnumber += 1
 
fig.suptitle('The density of numerical features', fontsize=50)
plt.tight_layout()
plt.show()
 
#Checks categorical features distribution
fig=plt.figure(figsize = (35, 25))
plotnumber = 1
for column in cat_cols:
if plotnumber <= 6:
ax = plt.subplot(2, 3, plotnumber)
sns.countplot(df_dummy[column], palette = 'Spectral_r')
plt.xlabel(column,fontsize=40)
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get
ha = 'center', va = 'center', xytext = (0, 10), weight = "bold",fontsize=30, textcoords = 'of
points')
 
plotnumber += 1
fig.suptitle('The distribution of categorical features distribution', fontsize=50)
plt.tight_layout()
plt.show()
 
def plot_four_versus_one(df, column_names, feat):
num_plots = len(column_names)
num_rows = num_plots // 2 + num_plots % 2
fig, ax = plt.subplots(num_rows, 2, figsize=(20, 13), facecolor='#fbe7dd')
 
for i, column in enumerate(column_names):
current_ax = ax[i // 2, i % 2]
g = sns.countplot(df[column], hue=df[feat], palette='Spectral_r', ax=current_ax)

for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_
ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=20, textcoords='offset poin

current_ax.set_xlabel(column, fontsize=20)
current_ax.set_ylabel("Count", fontsize=20)
current_ax.tick_params(axis='x', labelsize=15)
current_ax.tick_params(axis='y', labelsize=15)

plt.tight_layout()
plt.show()

#Plots distribution of number of cases of four categorical features versus diagnosis


column_names = ["age", "sex", "plasma_CA19_9", "creatinine"]
plot_four_versus_one(df_dummy, column_names, "diagnosis")
 
 
#Plots distribution of number of cases of four categorical features versus creatinine
column_names = ["age", "sex", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "creatinine")
 
#Plots distribution of number of cases of four categorical features versus age
column_names = ["creatinine", "sex", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "age")
 
#Plots distribution of number of cases of four categorical features versus sex
column_names = ["creatinine", "age", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "sex")
 
#Plots distribution of number of cases of four categorical features versus plasma_CA19_9
column_names = ["creatinine", "age", "sex", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "plasma_CA19_9")
 
#Categorizes diagnosis feature
def cat_diagnosis(n):
if n == 1:
return 'Control (No Pancreatic Disease)'
if n == 2:
return 'Benign Hepatobiliary Disease'
else:
return 'Pancreatic Cancer'

#Plots distribution of age and sex versus diagnosis in pie chart


def plot_piechart_diagnosis(df, feat1, feat2):
gs0 = df_dummy[df_dummy.diagnosis == 'Control (No Pancreatic Disease)'][feat1].value_coun
gs1 = df_dummy[df_dummy.diagnosis == 'Benign Hepatobiliary Disease'][feat1].value_counts(
gs2 = df_dummy[df_dummy.diagnosis == 'Pancreatic Cancer'][feat1].value_counts()
ss0 = df_dummy[df_dummy.diagnosis == 'Control (No Pancreatic Disease)'][feat2].value_coun
ss1 = df_dummy[df_dummy.diagnosis == 'Benign Hepatobiliary Disease'][feat2].value_counts(
ss2 = df_dummy[df_dummy.diagnosis == 'Pancreatic Cancer'][feat2].value_counts()
 
label_gs0=list(gs0.index)
label_gs1=list(gs1.index)
label_gs2=list(gs2.index)
label_ss0=list(ss0.index)
label_ss1=list(ss1.index)
label_ss2=list(ss2.index)
 
fig, ax = plt.subplots(2, 3, figsize=(35, 20), facecolor='#fbe7dd')
 
def print_percentage_table(data, labels, title):
percentages = [f'{(value / sum(data)) * 100:.1f}%' for value in data]
table_data = list(zip(labels, percentages))
headers = [feat1, 'Percentage']
print(f"\n{title}:")
print(tabulate(table_data, headers=headers, tablefmt='grid'))
 
def plot_pie(ax, data, labels, title):
ax.pie(data, labels=labels, shadow=True, autopct='%1.1f%%', textprops={'fontsize': 32
ax.set_xlabel(title, fontsize=30)
 
plot_pie(ax[0, 0], gs0, label_gs0, f"{feat1} feature")
print_percentage_table(gs0, label_gs0, 'diagnosis = Control (No Pancreatic Disease)')
 
plot_pie(ax[0, 1], gs1, label_gs1, f"{feat1} feature")
print_percentage_table(gs1, label_gs1, 'diagnosis = Benign Hepatobiliary Disease')
 
plot_pie(ax[0, 2], gs1, label_gs1, f"{feat1} feature")
print_percentage_table(gs1, label_gs2, 'diagnosis = Pancreatic Cancer')

plot_pie(ax[1, 0], ss0, label_ss0, f"{feat2} feature")


print_percentage_table(ss0, label_ss0, 'diagnosis = Control (No Pancreatic Disease)')
 
plot_pie(ax[1, 1], ss1, label_ss1, f"{feat2} feature")
print_percentage_table(ss1, label_ss1, 'diagnosis = Benign Hepatobiliary Disease')
 
plot_pie(ax[1, 2], ss1, label_ss1, f"{feat2} feature")
print_percentage_table(ss1, label_ss2, 'diagnosis = Pancreatic Cancer')

ax[0][0].set_title('diagnosis = Control (No Pancreatic Disease)',fontsize= 30)


ax[0][1].set_title('diagnosis = Benign Hepatobiliary Disease',fontsize= 30)
ax[0][2].set_title('diagnosis = Pancreatic Cancer',fontsize= 30)
plt.tight_layout()
plt.show()
 
#Plots distribution of age and sex versus diagnosis in pie chart
plot_piechart_diagnosis(df_dummy, "age", "sex")
 
#Plots distribution of plasma_CA19_9 and creatinine versus diagnosis in pie chart
plot_piechart_diagnosis(df_dummy, "plasma_CA19_9", "sex")
 
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10,5))
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.15, hspace=0.25)
 
background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#ffd514'])
 
def feat_versus_other(feat,another,legend,ax0,label):
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
 
ax0.set_facecolor(background_color)
ax0_sns = sns.histplot(data=df, x=feat,ax=ax0,zorder=2,kde=False,hue=another,multiple="st
shrink=.8
,linewidth=0.3,alpha=1)
 
put_label_stacked_bar(ax0_sns,5)
ax0_sns.set_xlabel('',fontsize=4, weight='bold')
ax0_sns.set_ylabel('',fontsize=4, weight='bold')
 
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
 
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fontsize=
bbox_to_anchor=(1, 0.989), loc='upper right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label)
plt.tight_layout()
 
def prob_feat_versus_other(feat,another,legend,ax0,label):
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
 
ax0.set_facecolor(background_color)
ax0_sns =
sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='g',zorder=2,alpha=1,multi
 
ax0_sns.set_xlabel('',fontsize=4, weight='bold')
ax0_sns.set_ylabel('',fontsize=4, weight='bold')
 
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
 
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fontsize=
bbox_to_anchor=(1, 0.989), loc='upper right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label)
plt.tight_layout()
label_diag = list(df_dummy["diagnosis"].value_counts().index)
label_age = list(df_dummy["age"].value_counts().index)
label_plas = list(df_dummy["plasma_CA19_9"].value_counts().index)
label_sex = list(df_dummy["sex"].value_counts().index)

def hist_feat_versus_four_cat(feat,label):
ax0 = fig.add_subplot(gs[0, 0])
feat_versus_other(feat,df_dummy["diagnosis"],label_diag,ax0,"diagnosis versus " + label)
 
ax1 = fig.add_subplot(gs[0, 1])
feat_versus_other(feat,df_dummy["age"],label_age,ax1,"age versus " + label)
 
ax2 = fig.add_subplot(gs[1, 0])
feat_versus_other(feat,df_dummy["plasma_CA19_9"],label_plas,ax2,"plasma_CA19_9 versus " +
 
ax3 = fig.add_subplot(gs[1, 1])
feat_versus_other(feat,df_dummy["creatinine"],label_sex,ax3,"sex versus " + label)
 
def prob_feat_versus_four_cat(feat,label):
ax0 = fig.add_subplot(gs[0, 0])
prob_feat_versus_other(feat,df_dummy["diagnosis"],label_diag,ax0,"diagnosis versus " + la
 
ax1 = fig.add_subplot(gs[0, 1])
prob_feat_versus_other(feat,df_dummy["age"],label_age,ax1,"age versus " + label)
 
ax2 = fig.add_subplot(gs[1, 0])
prob_feat_versus_other(feat,df_dummy["plasma_CA19_9"],label_plas,ax2,"plasma_CA19_9 versu
 
ax3 = fig.add_subplot(gs[1, 1])
prob_feat_versus_other(feat,df_dummy["creatinine"],label_sex,ax3,"sex versus " + label)

 
#hist_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1")
prob_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1")
 
hist_feat_versus_four_cat(df_dummy["REG1B"],"REG1B")
prob_feat_versus_four_cat(df_dummy["REG1B"],"REG1B")
 
hist_feat_versus_four_cat(df_dummy["TFF1"],"TFF1")
prob_feat_versus_four_cat(df_dummy["TFF1"],"TFF1")
 
#hist_feat_versus_four_cat(df_dummy["REG1A"],"REG1A")
prob_feat_versus_four_cat(df_dummy["REG1A"],"REG1A")
 
#Converts sex feature to {0,1}
def map_sex(n):
if n == "F":
return 0

else:
return 1
df['sex'] = df['sex'].apply(lambda x: map_sex(x))
 
#Converts diagnosis feature to {0,1,2}
def map_diagnosis(n):
if n == 1:
return 0
if n == 2:
return 1
else:
return 2
df['diagnosis'] = df['diagnosis'].apply(lambda x: map_diagnosis(x))
 
#Extracts outuput and input variables
y = df['diagnosis'].values # Target for the model
X = df.drop(['diagnosis'], axis = 1)
 
#Feature Importance using RandomForest Classifier
names = X.columns
rf = RandomForestClassifier()
rf.fit(X, y)
 
result_rf = pd.DataFrame()
result_rf['Features'] = X.columns
result_rf ['Values'] = rf.feature_importances_
result_rf.sort_values('Values', inplace = True, ascending = False)
 
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue")
plt.xlabel('Feature Importance', fontsize=30)
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
 
# Print the feature importance table
print("Feature Importance:")
print(result_rf)
 
#Feature Importance using ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
 
result_et = pd.DataFrame()
result_et['Features'] = X.columns
result_et ['Values'] = model.feature_importances_
result_et.sort_values('Values', inplace=True, ascending =False)
 
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_et, color="red")
plt.xlabel('Feature Importance', fontsize=30)
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
 
# Print the feature importance table
print("Feature Importance:")
print(result_et)
 
#Feature Importance using RFE
from sklearn.feature_selection import RFE
model = LogisticRegression()
# create the RFE model
rfe = RFE(model)
rfe = rfe.fit(X, y)
 
result_lg = pd.DataFrame()
result_lg['Features'] = X.columns
result_lg ['Ranking'] = rfe.ranking_
result_lg.sort_values('Ranking', inplace=True , ascending = False)
 
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange")
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
 
print("Feature Ranking:")
print(result_lg)
 
#Splits the data into training and testing
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X, y.ravel())
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 202
stratify=y)
X_train_raw = X_train.copy()
X_test_raw = X_test.copy()
y_train_raw = y_train.copy()
y_test_raw = y_test.copy()
 
X_train_norm = X_train.copy()
X_test_norm = X_test.copy()
y_train_norm = y_train.copy()
y_test_norm = y_test.copy()
norm = MinMaxScaler()
X_train_norm = norm.fit_transform(X_train_norm)
X_test_norm = norm.transform(X_test_norm)
 
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
scaler = StandardScaler()
X_train_stand = scaler.fit_transform(X_train_stand)
X_test_stand = scaler.transform(X_test_stand)
 
def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
if axes is None:
_, axes = plt.subplots(3, 1, figsize=(50, 50))
 
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
 
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
 
# Plot learning curve
axes[0].grid()
axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score", lw=10)
axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score", lw=10)
axes[0].legend(loc="best")
axes[0].set_title('Learning Curve', fontsize=50)
axes[0].set_xlabel('Training Examples', fontsize=40)
axes[0].set_ylabel('Score', fontsize=40)
axes[0].tick_params(labelsize=30)
 
# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, 'o-', lw=10)
axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std, alpha=0.1)
axes[1].set_xlabel("Training examples", fontsize=40)
axes[1].set_ylabel("fit_times", fontsize=40)
axes[1].set_title("Scalability of the model", fontsize=50)
axes[1].tick_params(labelsize=30)

# Plot fit_time vs score


axes[2].grid()
axes[2].plot(fit_times_mean, test_scores_mean, 'o-', lw=10)
axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1)
axes[2].set_xlabel("fit_times", fontsize=40)
axes[2].set_ylabel("Score", fontsize=40)
axes[2].set_title("Performance of the model", fontsize=50)
 
return plt
 
def plot_real_pred_val(Y_test, ypred, name):
plt.figure(figsize=(20,12))
acc=accuracy_score(Y_test,ypred)
plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predicted")
plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
plt.title("Predicted Values vs True Values of " + name, fontsize=30)
plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%", fontsize=30)
plt.legend()
plt.grid(True, alpha=0.75, lw=1, ls='-.')
plt.show()
 
def plot_cm(Y_test, ypred, name):
fig, ax = plt.subplots(figsize=(25, 15))
cm = confusion_matrix(Y_test, ypred)
sns.heatmap(cm, annot=True, linewidth=0.7, linecolor='red', fmt='g', cmap="YlOrBr", annot
{"size": 30})
plt.title(name + ' Confusion Matrix', fontsize=30)
ax.xaxis.set_ticklabels(['Control (No Pancreatic Disease)', 'Benign Hepatobiliary Disease
'Pancreatic Cancer'], fontsize=20);
ax.yaxis.set_ticklabels(['Control (No Pancreatic Disease)', 'Benign Hepatobiliary Disease
'Pancreatic Cancer'], fontsize=20);
plt.xlabel('Y predict', fontsize=30)
plt.ylabel('Y test', fontsize=30)
plt.show()
return cm

#Plots ROC
def plot_roc(model,X_test, y_test, title):
Y_pred_prob = model.predict_proba(X_test)
Y_pred_prob = Y_pred_prob[:, 1]
 
fpr, tpr, thresholds = roc_curve(y_test, Y_pred_prob)
plt.figure(figsize=(25,15))
plt.plot([0,1],[0,1], color='navy', lw=10, linestyle='--')
plt.plot(fpr,tpr, color='red', lw=10)
plt.xlabel('False Positive Rate', fontsize=30)
plt.ylabel('True Positive Rate', fontsize=30)
plt.title('ROC Curve of ' + title, fontsize=30)
plt.grid(True)
plt.show()

def plot_decision_boundary(model,xtest, ytest, name):


plt.figure(figsize=(25, 15))
#Trains model with two features
model.fit(xtest, ytest)
 
plot_decision_regions(xtest.values, ytest.ravel(), \
clf=model, legend=2)
plt.title("Decision boundary for " + name + " (Test)", fontsize=30)
plt.xlabel("creatinine", fontsize=25)
plt.ylabel("LYVE1", fontsize=25)
plt.legend(fontsize=25)
plt.show()
 
#Chooses two features for decision boundary
feat_boundary = ['creatinine','LYVE1']
X_feature = X[feat_boundary]
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_feature, y, test_si
random_state = 2021, stratify=y)

def train_model(model, X, y):


model.fit(X, y)
return model
 
def predict_model(model, X, proba=False):
if ~proba:
y_pred = model.predict(X)
else:
y_pred_proba = model.predict_proba(X)
y_pred = np.argmax(y_pred_proba, axis=1)
 
return y_pred
 
list_scores = []
 
def run_model(name, model, X_train, X_test, y_train, y_test, fc, proba=False):
print(name)
print(fc)

model = train_model(model, X_train, y_train)


y_pred = predict_model(model, X_test, proba)

accuracy = accuracy_score(y_test, y_pred)


recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print('accuracy: ', accuracy)


print('recall: ',recall)
print('precision: ', precision)
print('f1: ', f1)
print(classification_report(y_test, y_pred))

plot_cm(y_test, y_pred, name)


plot_real_pred_val(y_test, y_pred, name)
plot_decision_boundary(model,X_test_feat, y_test_feat, name)
plot_learning_curve(model, name, X_train, y_train, cv=3);
plt.show()

list_scores.append({'Model Name': name, 'Feature Scaling':fc, 'Accuracy': accuracy, 'Reca


'Precision': precision, 'F1':f1})
 
feature_scaling = {
#'Raw':(X_train_raw, X_test_raw, y_train_raw, y_test_raw),
#'Normalization':(X_train_norm, X_test_norm, y_train_norm, y_test_norm),
'Standardization':(X_train_stand, X_test_stand, y_train_stand, y_test_stand),
}
 
#Support Vector Classifier
# Define the parameter grid for the Grid Search
param_grid = {
'C': [0.1, 1, 10], # Regularization parameter
'kernel': ['linear', 'rbf'], # Kernel type
}
 
# Create the SVC model with probability=True
model_svc = SVC(random_state=2021, probability=True)
 
# Perform Grid Search for each feature scaling method
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model_svc, param_grid=param_grid, cv=3, scoring='acc
n_jobs=-1)

# Perform Grid Search and fit the model


grid_search.fit(X_train, y_train)

# Get the best parameters and best model from the Grid Search
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model


run_model('SVC with ' + fc_name, best_model, X_train, X_test, y_train, y_test, fc_name, p
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)

#Logistic Regression Classifier


# Define the parameter grid for the grid search
param_grid = {
'C': [0.01, 0.1, 1, 10],
'penalty': ['l1', 'l2'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'],
}
 
# Initialize the Logistic Regression model
logreg = LogisticRegression(max_iter=5000, random_state=2021)
 
# Perform the grid search for each feature scaling method
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value

# Create GridSearchCV with the Logistic Regression model and the parameter grid
grid_search = GridSearchCV(logreg, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Train and perform grid search


grid_search.fit(X_train, y_train)

# Get the best Logistic Regression model from the grid search
best_model = grid_search.best_estimator_
# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model('Logistic Regression', best_model, X_train, X_test, y_train, y_test, fc_name, p

# Print the best hyperparameters found


print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
#KNN Classifier
# Define the parameter grid for the grid search
param_grid = {
'n_neighbors': list(range(2, 10))
}
 
# KNN Classifier Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value

# Initialize the KNN Classifier


knn = KNeighborsClassifier()

# Create GridSearchCV with the KNN model and the parameter grid
grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Train and perform grid search


grid_search.fit(X_train, y_train)

# Get the best KNN model from the grid search


best_model = grid_search.best_estimator_

# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model(f'KNeighbors Classifier n_neighbors = {grid_search.best_params_["n_neighbors"]}
best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
#Decision Tree Classifier
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value

# Initialize the DecisionTreeClassifier model


dt_clf = DecisionTreeClassifier(random_state=2021)

# Define the parameter grid for the grid search


param_grid = {
'max_depth': np.arange(1, 51, 1),
'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}

# Create GridSearchCV with the DecisionTreeClassifier model and the parameter grid
grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Train and perform grid search


grid_search.fit(X_train, y_train)

# Get the best DecisionTreeClassifier model from the grid search


best_model = grid_search.best_estimator_

# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model(f'DecisionTree Classifier (Best Depth: {grid_search.best_params_["max_depth"]})
best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
#Random Forest Classifier
# Define the parameter grid for the grid search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, 40, 50],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
 
# Initialize the RandomForestClassifier model
rf = RandomForestClassifier(random_state=2021)
 
# RandomForestClassifier Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
# Create GridSearchCV with the RandomForestClassifier model and the parameter grid
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Train and perform grid search


grid_search.fit(X_train, y_train)

# Get the best RandomForestClassifier model from the grid search


best_model = grid_search.best_estimator_

# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model(f'RandomForest Classifier (Best Estimators: {grid_search.best_params_["n_estima
best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

# Print the best hyperparameters found


print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
#Gradient Boosting Classifier
# Initialize the GradientBoostingClassifier model
gbt = GradientBoostingClassifier(random_state=2021)
 
# Define the parameter grid for the grid search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'subsample': [0.6, 0.8, 1.0],
'max_features': [0.2, 0.4, 0.6, 0.8, 1.0],
}
 
# GradientBoosting Classifier Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
 
# Create GridSearchCV with the GradientBoostingClassifier model and the parameter grid
grid_search = GridSearchCV(gbt, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
 
# Train and perform grid search
grid_search.fit(X_train, y_train)
 
# Get the best GradientBoostingClassifier model from the grid search
best_model = grid_search.best_estimator_
 
# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model(f'GradientBoosting Classifier (Best Estimators:
{grid_search.best_params_["n_estimators"]})',
best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
#Extreme Gradient Boosting Classifier
# XGBoost Classifier Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
 
# Define the parameter grid for the grid search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
}
 
# Initialize the XGBoost classifier
xgb = XGBClassifier(random_state=2021, use_label_encoder=False, eval_metric='mlogloss')
 
# Create GridSearchCV with the XGBoost classifier and the parameter grid
grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
 
# Train and perform grid search
grid_search.fit(X_train, y_train)
 
# Get the best XGBoost classifier model from the grid search
best_model = grid_search.best_estimator_
 
# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model(f'XGB Classifier (Best Estimators: {grid_search.best_params_["n_estimators"]})'
best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
# MLP Classifier Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
 
# Define the parameter grid for the grid search
param_grid = {
'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],
'activation': ['logistic', 'relu'],
'solver': ['adam', 'sgd'],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate': ['constant', 'invscaling', 'adaptive'],
}
 
# Initialize the MLP Classifier
mlp = MLPClassifier(random_state=2021)
 
# Create GridSearchCV with the MLP Classifier and the parameter grid
grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
 
# Train and perform grid search
grid_search.fit(X_train, y_train)
 
# Get the best MLP Classifier model from the grid search
best_model = grid_search.best_estimator_
 
# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model('MLP Classifier', best_model, X_train, X_test, y_train, y_test, fc_name, proba=
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)

#LGBM Classifier
# Define the parameter grid for grid search
param_grid = {
'max_depth': [10, 20, 30],
'n_estimators': [100, 200, 300],
'subsample': [0.6, 0.8, 1.0],
'random_state': [2021]
}
 
# Initialize the LightGBM classifier
lgbm = LGBMClassifier()
 
# Grid Search
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
 
# Create GridSearchCV with the LightGBM classifier and the parameter grid
grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
 
# Train and perform grid search
grid_search.fit(X_train, y_train)
 
# Get the best LightGBM classifier model from the grid search
best_model = grid_search.best_estimator_
 
# Evaluate and plot the best model (setting proba=True for probability prediction)
run_model('LGBM Classifier', best_model, X_train, X_test, y_train, y_test, fc_name, proba
 
# Print the best hyperparameters found
print(f"Best Hyperparameters for {fc_name}:")
print(grid_search.best_params_)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IMPLEMENTING
GRAPHICAL USER INTERFACE
USING PYQT
IMPLEMENTING
GRAPHICAL USER INTERFACE
USING PYQT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Designing GUI
Step Now, you will create a GUI to implement how to classify and predict pancreatic cancer using
1 some machine learning algorithms. Open Qt Designer and choose Main Window template.
Save the form as gui_pancreatic.ui.
 
Step Put three Push Button widgets onto form. Set their text property as LOAD DATA, TRAIN ML
2 MODEL, and TRAIN DL MODEL. Set their objectName property as pbLoad, pbTrainML,
dan pbTrainDL.
 
Step Put two Table Widget onto form. Set their objectName properties as twData1 and twData2.
3  
Step Add two Label widgets onto form. Set their text properties as Label 1 and Label 2 and set their
4 objectName properties as label1 and label2.
 
Step Put three Widget from Containers panel onto form and set their ObjectName property as
5 widgetPlot1, widgetPlot2, and widgetPlot3.
 
Step Right click on the three Widgets and choose Promote to …. Set Promoted class name as
6 plot_class. Click Add and Promote button. In Object Inspector window, you can see that
widgetPlot1, widgetPlot2, and widgetPlot3 are now an object of plot_class as shown in Figure
106.
 

Figure 106 The widgetPlot1, widgetPlot2, and widgetPlot3 are now an object of plot_class
 
Step Write the definition of plot_class and save it as plot_class.py as follows:
7  
1 #plot_class.py
2 from PyQt5.QtWidgets import*
3 from matplotlib.backends.backend_qt5agg
4 import FigureCanvas
5 from matplotlib.figure import Figure
6
7 class plot_class(QWidget):
8 def __init__(self, parent = None):
9 QWidget.__init__(self, parent)
10 self.canvas = FigureCanvas(Figure())
11
12 vertical_layout = QVBoxLayout()
13
vertical_layout.addWidget(self.canvas)
14
15
16 self.canvas.axis1 =
17 self.canvas.figure.add_subplot(111)
18 self.canvas.axis1 =
self.canvas.figure.subplots_adjust(
19
20
top=0.936,
21
22 bottom=0.104,
23
24 left=0.047,
25
right=0.981,

hspace=0.2,

wspace=0.2
)
 
self.canvas.figure.set_facecolor("xkcd:sand"
)
self.setLayout(vertical_layout)
 
The purpose of the code is to define a custom PyQt5 widget called plot_class, which displays a
matplotlib plot on a QWidget. This custom widget is intended to be used as a part of a graphical
user interface (GUI) in a larger application.
 
Let's break down the code step-by-step:
1. from PyQt5.QtWidgets import*: Import necessary PyQt5
widgets and modules for building the GUI.
2. from matplotlib.backends.backend_qt5agg import
FigureCanvas: Import the FigureCanvas class from the
matplotlib.backends.backend_qt5agg module. This class
allows us to embed a matplotlib figure into a PyQt5
application.
3. from matplotlib.figure import Figure: Import the Figure
class from the matplotlib.figure module. It represents the
whole figure and contains one or more axes.
4. class plot_class(QWidget): Define a custom QWidget class
called plot_class. This class will inherit properties and
functionalities from the QWidget class.
5. def __init__(self, parent = None): Constructor method to
initialize the plot_class object. It takes an optional parent
argument, which represents the parent widget, and calls the
constructor of the base class (QWidget) using
QWidget.__init__(self, parent).
6. self.canvas = FigureCanvas(Figure()): Create a
FigureCanvas instance named canvas and pass it an empty
Figure instance. The Figure is the top-level container that
holds all the plot elements.
7. vertical_layout = QVBoxLayout(): Create a QVBoxLayout
instance named vertical_layout. QVBoxLayout is a layout
manager that arranges widgets in a vertical manner.
8. vertical_layout.addWidget(self.canvas): Add the canvas
(matplotlib figure) to the vertical_layout, so it will be
displayed vertically in the GUI.
9. self.canvas.axis1 = self.canvas.figure.add_subplot(111):
Create an axis (subplot) in the figure and store it in
self.canvas.axis1. The add_subplot(111) method creates a
single subplot within the figure.
10. self.canvas.axis1 = self.canvas.figure.subplots_adjust(...):
The subplots_adjust() method is used to adjust the
subplot parameters to create margins and spacings
between subplots.
It sets the position and spacing of the subplot within the
figure.
11. self.canvas.figure.set_facecolor("xkcd:sand"): Set the
background color of the figure to "xkcd:sand," which is a
light sand color from the xkcd color survey.
12. self.setLayout(vertical_layout): Set the layout of the
plot_class widget to the vertical_layout, which contains the
matplotlib figure.
Overall, the code defines a custom widget (plot_class) that contains a matplotlib figure displayed
on a QWidget. This custom widget can be embedded into a larger PyQt5 application to show
plots and visualizations as part of the GUI.
 
Step Add a Combo Box widget and set its objectName property as cbData. Let it empty. You will
8 populate it from the code.
 
Step Add another Combo Box widget and set its objectName property as cbClassifier. Populate this
9 widget with fourteen items as shown in Figure 107.
 

Figure 107 Populating cbClassifier widget with fourteen items


 
Step Add three radio buttons and set their text properties as Raw, Norm, and Stand. Then, set their
10 objectName as rbRaw, rbNorm, and rbStand.
 
Step Write this Python script and save it as gui_pancreatic.py:
11  
1 #gui_pancreatic.py
2 from PyQt5.QtWidgets import *
3 from PyQt5.uic import loadUi
4 from matplotlib.backends.backend_qt5agg
5 import (NavigationToolbar2QT as
NavigationToolbar)
6
from matplotlib.colors import
7
ListedColormap
8
 
9
class DemoGUI_Pancreatic(QMainWindow):
10
def __init__(self):
11
QMainWindow.__init__(self)
12
loadUi("gui_pancreatic.ui",self)
13
self.setWindowTitle(\
14
15 "GUI Demo of Classifying and Predicting
16 Pancreatic Cancer")
17 self.addToolBar(NavigationToolbar(\
18 self.widgetPlot1.canvas, self))
19
20 if __name__ == '__main__':
21 import sys
22 app = QApplication(sys.argv)
23 ex = DemoGUI_Pancreatic()
ex.show()
sys.exit(app.exec_())
 
The code is a PyQt5 application that creates a GUI for classifying and predicting pancreatic
cancer. Let's break down the code step-by-step:
1. from PyQt5.QtWidgets import *: Import necessary PyQt5
widgets and modules for building the GUI.
2. from PyQt5.uic import loadUi: Import the loadUi function
from PyQt5.uic. This function is used to load the user
interface (UI) file created with Qt Designer.
3. from matplotlib.backends.backend_qt5agg import
(NavigationToolbar2QT as NavigationToolbar): Import the
NavigationToolbar2QT class from
matplotlib.backends.backend_qt5agg. This class provides a
navigation toolbar for the matplotlib plot displayed in the
PyQt5 application.
4. from matplotlib.colors import ListedColormap: Import the
ListedColormap class from matplotlib.colors. This class is
used to create a custom colormap for plotting.
5. class DemoGUI_Pancreatic(QMainWindow): Define a
custom QMainWindow class called DemoGUI_Pancreatic.
This class will inherit properties and functionalities from the
QMainWindow class.
6. def __init__(self):: Constructor method to initialize the
DemoGUI_Pancreatic object.
7. QMainWindow.__init__(self): Call the constructor of the
base class (QMainWindow) using
QMainWindow.__init__(self).
8. loadUi("gui_pancreatic.ui",self): Load the UI file
"gui_pancreatic.ui" into the current instance of the
DemoGUI_Pancreatic class. The UI file was likely created
using Qt Designer and contains the layout and design of the
GUI.
9. self.setWindowTitle("GUI Demo of Classifying and
Predicting Pancreatic Cancer"): Set the window title for the
GUI application.
10. self.addToolBar(NavigationToolbar(self.widgetPlot1.canvas,
self)): Add a navigation toolbar to the widgetPlot1 canvas.
The widgetPlot1 is likely a custom widget containing a
matplotlib plot, and the navigation toolbar provides
functionality for interacting with the plot (e.g., zooming,
panning).
11. if __name__ == '__main__':: Check if the script is being run
as the main program.

Figure 108 The form when it first runs


 
12. import sys: Import the sys module for handling system-
related operations.
13. app = QApplication(sys.argv): Create a QApplication
instance, which represents the application and manages the
GUI event loop.
14. ex = DemoGUI_Pancreatic(): Create an instance of the
DemoGUI_Pancreatic class, which sets up the GUI.
15. ex.show(): Display the GUI application.
16. sys.exit(app.exec_()): Start the event loop of the GUI
application, waiting for user interactions and handling
events until the application is closed. sys.exit() ensures a
clean exit when the application is terminated.
Overall, the code sets up a PyQt5 application with a GUI window that loads a user interface
from the "gui_pancreatic.ui" file, adds a navigation toolbar to a custom widget (widgetPlot1),
and displays the GUI window with a title "GUI Demo of Classifying and Predicting Pancreatic
Cancer." This application can be run to show the GUI interface for classifying and predicting
pancreatic cancer using data visualization and machine learning techniques.
Step Run gui_prostate.py and click LOAD DATA button. You will see form’s layout as shown in
12 Figure 108.
 
 
 
Preprocessing Data and Populating Tables
Step Import all necessary modules:
1  
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 sns.set_style('darkgrid')
6 import warnings
7 import mglearn
8 warnings.filterwarnings('ignore')
9 import os
10 import joblib
11 from numpy import save
12 from numpy import load
13 from os import path
14 from sklearn.metrics import
15 roc_auc_score,roc_curve
16 from sklearn.model_selection import
train_test_split, RandomizedSearchCV,
17
GridSearchCV,StratifiedKFold
18
from sklearn.preprocessing import
19 StandardScaler, MinMaxScaler
20
21 from sklearn.linear_model import
22 LogisticRegression
23 from sklearn.naive_bayes import GaussianNB
24 from sklearn.tree import
DecisionTreeClassifier
25
from sklearn.svm import SVC
26
from sklearn.ensemble import
27
RandomForestClassifier,
28 ExtraTreesClassifier
29 from sklearn.neighbors import
30 KNeighborsClassifier
31 from sklearn.ensemble import
32 AdaBoostClassifier,
GradientBoostingClassifier
33
from xgboost import XGBClassifier
34
from sklearn.neural_network import
35
MLPClassifier
36
from sklearn.linear_model import
37 SGDClassifier
38 from sklearn.preprocessing import
39 StandardScaler, LabelEncoder,
40 OneHotEncoder
41 from sklearn.metrics import
confusion_matrix, accuracy_score,
42
recall_score, precision_score
43
from sklearn.metrics import
classification_report, f1_score,
plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import
learning_curve
from mlxtend.plotting import
plot_decision_regions
import tensorflow as tf
from sklearn.base import clone
from sklearn.decomposition import PCA
 
 
Step Define write_df_to_qtable() and populate_table() methods to populate any
2 table widget with some data:
 
 
1 # Takes a df and writes it to a qtable
2 provided. df headers become qtable headers
3 @staticmethod
4 def write_df_to_qtable(df,table):
5 headers = list(df)
6 table.setRowCount(df.shape[0])
7 table.setColumnCount(df.shape[1])
8 table.setHorizontalHeaderLabels(headers)
9  
10 # getting data from df is computationally
costly so convert it to array first
11
df_array = df.values
12
for row in range(df.shape[0]):
13
for col in range(df.shape[1]):
14
table.setItem(row, col, \
15
QTableWidgetItem(str(df_array[row,col])))
16
 
17
def populate_table(self,data, table):
18
#Populates two tables
19
self.write_df_to_qtable(data,table)
20
21
table.setAlternatingRowColors(True)
22
table.setStyleSheet(\
23
"alternate-background-color:
24
#ffb07c;background-color: #e6daa6;");
25
 
The purpose of the code is to populate a Qt QTableWidget with data from a
pandas DataFrame. The code consists of two methods:
1. write_df_to_qtable(df, table): This static
method takes a pandas DataFrame df and a Qt
QTableWidget table as input. It sets up the
QTableWidget with the same number of rows
and columns as the DataFrame and fills it with
data from the DataFrame. The DataFrame
headers are used as the headers for the
QTableWidget columns.
2. populate_table(self, data, table): This method
populates two QTableWidgets with data. It
calls the write_df_to_qtable method to
populate the table with the data from the
pandas DataFrame data. After populating the
table, it sets the alternating row colors to make
it easier to read the data.
Here's a step-by-step explanation of the write_df_to_qtable() method:
headers = list(df): Get the list of column
headers from the DataFrame df.
table.setRowCount(df.shape[0]): Set the
number of rows in the QTableWidget table to
match the number of rows in the DataFrame
df.
table.setColumnCount(df.shape[1]): Set the
number of columns in the QTableWidget table
to match the number of columns in the
DataFrame df.
table.setHorizontalHeaderLabels(headers): Set
the column headers of the QTableWidget table
using the headers obtained from the
DataFrame df.
df_array = df.values: Convert the DataFrame
df into a NumPy array to make accessing the
data faster.
The two nested loops for row in
range(df.shape[0]): and for col in
range(df.shape[1]): iterate over each cell of the
DataFrame df.
table.setItem(row, col,
QTableWidgetItem(str(df_array[row, col]))):
For each cell, create a QTableWidgetItem and
set the text of the item to the corresponding
value from the DataFrame df.
The populate_table() method calls write_df_to_qtable() to populate the table
with data and then sets the alternating row colors to improve the visual
appearance of the table.
 
In summary, the purpose of these methods is to take data from a pandas
DataFrame and display it in a Qt QTableWidget with proper headers and
alternating row colors for better readability in a GUI application.
 
Step Define initial_state() method to disable some widgets when form initially
3 runs:
 
1 def initial_state(self, state):
2 self.pbTrainML.setEnabled(state)
3 self.cbData.setEnabled(state)
4 self.cbClassifier.setEnabled(state)
5 self.cbPredictionML.setEnabled(state)
6 self.rbRaw.setEnabled(state)
7 self.rbNorm.setEnabled(state)
8 self.rbStand.setEnabled(state)
 
 
Step Read dataset, drop irrelevant columns, impute missing values in
4 plasma_CA19_9 with mean values, impute missing value in REG1A with
mean values, create dummy dataset, convert diagnosis feature to {0,1,2},
convert sex feature to {0,1}, extract outuput and input variables, and
categorize some features in df_dummy for visualization:
 
1 def read_dataset(self, dir):
2 #Loads csv file
3 df = pd.read_csv(dir)
4
5 #Drops irrelevant columns
6 df = df.drop(columns=['sample_id','patient_cohort',\
7 'sample_origin','stage','benign_sample_diagnosis'])
8
9 #Imputes missing values in plasma_CA19_9 with mean
10
11 df['plasma_CA19_9'].fillna((df['plasma_CA19_9'].mean()),
\
12
inplace=True)
13
14  
15 #Imputes missing value in REG1A with mean
16 df['REG1A'].fillna((df['REG1A'].mean()),
17 inplace=True)
18
19 #Creates dummy dataset
20 df_dummy=df.copy()
21
22 #Converts diagnosis feature to {0,1,2}
23 df['diagnosis'] = df['diagnosis'].apply(lambda x: \
24 self.map_diagnosis(x))
25
26 #Converts sex feature to {0,1}
27 df['sex'] = df['sex'].apply(lambda x:
self.map_sex(x))
28
 
29
#Categorizes df_dummy for visualization
30
df_dummy = self.df_visual(df_dummy)
31
32
return df, df_dummy
33
 
34
#Converts sex feature to {0,1}
35
def map_sex(self,n):
36
if n == "F":
37
return 0
38
else:
39
return 1
40
41
#Converts diagnosis feature to {0,1,2}
42
def map_diagnosis(self,n):
43
if n == 1:
44
return 0
45
if n == 2:
46
return 1
47
else:
48
return 2
49
 
50
 
51
#Categorizes diagnosis feature
52
53 def cat_diagnosis(self,n):
54 if n == 1:
55 return 'Control (No Pancreatic Disease)'
56 if n == 2:
57 return 'Benign Hepatobiliary Disease'
58 else:
59 return 'Pancreatic Cancer'
60
61 def df_visual(self,df_dummy):
62 #Categorizes diagnosis_result feature
63 df_dummy['diagnosis'] =
64 df_dummy['diagnosis'].apply(\
65 lambda x: self.cat_diagnosis(x))
66
67 #Categorizes age feature
68 labels = ['0-40', '40-50', '50-60','60-90']
69 df_dummy['age'] = pd.cut(df_dummy['age'], \
70 [0, 40, 50, 60, 90], labels=labels)
71
72 #Categorizes plasma_CA19_9 feature
73 labels = ['0-100', '100-1000', '1000-10000','10000-
35000']
74
df_dummy['plasma_CA19_9'] = pd.cut(\
75
df_dummy['plasma_CA19_9'], \
76
[0, 100, 1000, 10000, 35000], labels=labels)
77
78
#Categorizes creatinine feature
labels = ['0-0.5', '0.5-1', '1-2','2-5']
df_dummy['creatinine'] =
pd.cut(df_dummy['creatinine'],\
[0, 0.5, 1, 2, 5], labels=labels)
 
return df_dummy
 
The code defines several methods used to preprocess and categorize a
dataset. The main function, read_dataset, reads a CSV file, performs some
preprocessing steps, and then creates a dummy dataset for visualization
purposes. Let's explain each part of the code:
1. read_dataset(self, dir): This method takes the
directory of a CSV file as input, reads the file
into a pandas DataFrame (df), drops some
irrelevant columns, imputes missing values in
specific columns with their means, converts
some columns to binary encoding, creates a
dummy dataset (df_dummy) for visualization,
and returns both the original DataFrame df and
the dummy dataset df_dummy.
2. map_sex(self, n): This method maps the "sex"
column to binary encoding, where "F" is
mapped to 0 and any other value is mapped to
1.
3. map_diagnosis(self, n): This method maps the
"diagnosis" column to three categories, where
1 is mapped to 0, 2 is mapped to 1, and any
other value is mapped to 2.
4. cat_diagnosis(self, n): This method
categorizes the "diagnosis" feature into
human-readable labels based on the numeric
mapping applied in map_diagnosis.
5. df_visual(self, df_dummy): This method takes
the dummy dataset df_dummy as input and
further categorizes some of its features for
visualization purposes. The "diagnosis"
column is categorized into human-readable
labels using the cat_diagnosis method. The
"age," "plasma_CA19_9," and "creatinine"
columns are converted into categorical
variables with appropriate labels.
In summary, the read_dataset() method reads a CSV file, preprocesses the
data by dropping irrelevant columns and imputing missing values, and
performs binary encoding for "sex" and numeric encoding for "diagnosis." It
also creates a dummy dataset with additional categorical labels for
visualization purposes. The methods map_sex, map_diagnosis, and
cat_diagnosis handle the mapping and categorization tasks, while df_visual
performs the additional categorization for the dummy dataset. The resulting
datasets are then available for further analysis and visualization in a GUI
application or other tasks.
 
Step Define populate_cbData() to populate cbData widget:
5  
1 def populate_cbData(self):
2 self.cbData.addItems(self.df)
3 self.cbData.addItems(["Features
4 Importance"])
5 self.cbData.addItems(["Correlation
Matrix", \
"Pairwise Relationship", "Features
Correlation"])
 
 
Step Define import_dataset() method to import dataset for machine learning
6 algorithms (df) and populate two table widgets with data and its description:
 
1 def import_dataset(self):
2 curr_path = os.getcwd()
3 dataset_dir = curr_path +
4 "/Prostate_Cancer.csv"
5  
6 #Loads csv file
7 self.df = self.read_dataset(dataset_dir)
8  
9 #Populates tables with data
10 self.populate_table(self.df, self.twData1)
11 self.label1.setText('Pancreatic Cancer Data')
12  
13 self.populate_table(self.df.describe(),
self.twData2)
14
self.twData2.setVerticalHeaderLabels(['Count'
15
, \
16
'Mean', 'Std', 'Min', '25%', '50%', '75%',
17 'Max'])
18 self.label2.setText('Data Desciption')
19
20 #Turns on pbTrainML widget
21 self.pbTrainML.setEnabled(True)
22  
23 #Turns off pbLoad
24 self.pbLoad.setEnabled(False)

#Populates cbData
self.populate_cbData()
 
 
The import_dataset() method is implemented in the GUI application. It
serves the purpose of importing a dataset, performing some initial data
processing, and updating the user interface accordingly. Let's break down the
steps performed in this method:
1. curr_path = os.getcwd(): This line gets the
current working directory.
2. dataset_dir = curr_path +
"/Prostate_Cancer.csv": This line creates the
full file path of the CSV dataset file named
"Prostate_Cancer.csv" in the current working
directory.
3. self.df = self.read_dataset(dataset_dir): This
line reads the CSV file using the read_dataset
method, which performs data preprocessing
and returns the original DataFrame self.df.
4. self.populate_table(self.df, self.twData1): This
line populates the first table (twData1) in the
GUI with the data from the DataFrame self.df,
effectively displaying the main dataset with its
headers in the GUI.
5. self.label1.setText('Pancreatic Cancer Data'):
This line updates a label (label1) in the GUI to
display the title "Pancreatic Cancer Data,"
which indicates the type of data displayed in
the first table.
6. self.populate_table(self.df.describe(),
self.twData2): This line populates the second
table (twData2) in the GUI with summary
statistics of the DataFrame self.df (e.g., count,
mean, standard deviation, min, 25th percentile,
median, 75th percentile, and max).
7. self.twData2.setVerticalHeaderLabels(['Count',
'Mean', 'Std', 'Min', '25%', '50%', '75%',
'Max']): This line updates the vertical header
labels of the second table (twData2) to
represent the summary statistics.
8. self.label2.setText('Data Desciption'): This line
updates a label (label2) in the GUI to display
the title "Data Description," which indicates
the type of data displayed in the second table.
9. self.pbTrainML.setEnabled(True): This line
enables a push button widget (pbTrainML) in
the GUI, which likely initiates the training of a
machine learning model on the loaded dataset.
10. self.pbLoad.setEnabled(False): This line
disables another push button widget (pbLoad)
in the GUI, which was responsible for loading
the dataset. This is to prevent the user from
reloading the dataset once it has been loaded.
11. self.populate_cbData(): This line populates a
combo box widget (cbData) in the GUI with
the column names of the dataset, likely for
users to choose specific columns for analysis
or visualization.
In summary, the import_dataset() method is responsible for importing the
dataset, displaying the main dataset in one table and summary statistics in
another table, updating labels to provide context for the displayed data, and
enabling/disabling certain GUI widgets for proper data handling and user
interaction. The method prepares the GUI for further data analysis or model
training tasks with the imported dataset.
 
Step Connect clicked() event of pbLoad widget with import_dataset() and put it
7 inside __init__() method as shown in line 8 and invoke initial_state()
methode in line 9:
 
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_pancreatic.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting
6 Pancreatic Cancer")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset
)
self.initial_state(False)
 
 

Figure 109 The initial state of form


 
 

Figure 110 When LOAD DATA button is clicked, the two tables will be
populated
 
 
 
Step Run gui_pancreatic.py and you will see the other widgets are initially
8 disabled as shown in Figure 109. Then click LOAD DATA button. The two
tables will be populated as shown in Figure 110.
 
 
Resampling and Splitting Data
Step Define fit_dataset() method to resample data using SMOTE:
1  
1 def fit_dataset(self, df):
2 #Extracts diagnosis feature as target
3 variable
4 y = df['diagnosis'].values # Target
for the model
5
 
6
#Drops diagnosis feature and set input
7
variable
8
X = df.drop('diagnosis', axis = 1)
9
 
10
#Resamples data
11
sm = SMOTE(random_state=2021)
12
X,y = sm.fit_resample(X, y.ravel())
 
return X, y
 
The purpose of the fit_dataset function is to prepare the dataset for
machine learning model training by performing the following steps:
1. y = df['diagnosis'].values: Extract the
target variable (often denoted as "y")
from the DataFrame df. In this case, the
target variable is the "diagnosis" column,
which contains the class labels for each
data sample.
2. X = df.drop('diagnosis', axis=1): Set the
input variables (often denoted as "X") by
dropping the "diagnosis" column from
the DataFrame df. This means that the
input variables will include all other
columns except for the target variable
"diagnosis."
3. sm = SMOTE(random_state=2021):
Create an instance of the SMOTE
(Synthetic Minority Over-sampling
Technique) class. SMOTE is a technique
used for oversampling the minority class
in imbalanced datasets, which helps in
improving the performance of machine
learning models on such datasets.
4. X, y = sm.fit_resample(X, y.ravel()):
Resample the dataset using the SMOTE
technique. This step creates synthetic
samples for the minority class
(Pancreatic Cancer) to balance the class
distribution. The resampling process
generates new data points for the
minority class by interpolating existing
samples.
5. return X, y: Return the resampled input
variables X and the corresponding target
variable y. These resampled datasets are
now suitable for training a machine
learning model since they have a
balanced class distribution, which is
beneficial for improving model
performance, especially in cases where
the original dataset is imbalanced (i.e.,
when one class dominates the others).
Step Define train_test() to split dataset into train and test data with raw,
2 normalized, and standardized feature scaling:
 
1 def train_test(self):
2 X, y = self.fit_dataset(self.df)
3  
4 #Splits the data into training and testing
5 X_train, X_test, y_train, y_test =
6 train_test_split(X, y,\
7 test_size = 0.2, random_state = 2021,
stratify=y)
8
self.X_train_raw = X_train.copy()
9
self.X_test_raw = X_test.copy()
10
self.y_train_raw = y_train.copy()
11
self.y_test_raw = y_test.copy()
12
13
#Saves into npy files
14
save('X_train_raw.npy',
15
self.X_train_raw)
16
save('y_train_raw.npy',
17 self.y_train_raw)
18 save('X_test_raw.npy', self.X_test_raw)
19 save('y_test_raw.npy', self.y_test_raw)
20  
21 self.X_train_norm = X_train.copy()
22 self.X_test_norm = X_test.copy()
23 self.y_train_norm = y_train.copy()
24 self.y_test_norm = y_test.copy()
25 norm = MinMaxScaler()
26 self.X_train_norm[inf_cols] = \
27 norm.fit_transform(self.X_train_norm)
28 self.X_test_norm[inf_cols] = \
29 norm.transform(self.X_test_norm)
30
31 #Saves into npy files
32 save('X_train_norm.npy',
33 self.X_train_norm)
34 save('y_train_norm.npy',
35 self.y_train_norm)
36 save('X_test_norm.npy',
self.X_test_norm)
37
save('y_test_norm.npy',
38
self.y_test_norm)
39
 
40
41 self.X_train_stand = X_train.copy()
42 self.X_test_stand = X_test.copy()
43 self.y_train_stand = y_train.copy()
44 self.y_test_stand = y_test.copy()
45 scaler = StandardScaler()
46 self.X_train_stand[inf_cols] = \
47 scaler.fit_transform(self.X_train_stand)
48 self.X_test_stand[inf_cols] = \
scaler.transform(self.X_test_stand)
 
#Saves into npy files
save('X_train_stand.npy',
self.X_train_stand)
save('y_train_stand.npy',
self.y_train_stand)
save('X_test_stand.npy',
self.X_test_stand)
save('y_test_stand.npy',
self.y_test_stand)
 
The purpose of the train_test() function is to prepare the training and
testing datasets for machine learning by performing the following
steps:
1. X, y = self.fit_dataset(self.df): Call the
fit_dataset function to obtain the
resampled input variables (X) and the
corresponding target variable (y) with
balanced class distribution.
2. train_test_split(): Split the resampled
dataset into training and testing datasets
using the train_test_split() function from
scikit-learn. The training dataset will be
used to train the machine learning
models, while the testing dataset will be
used to evaluate the model's
performance.
3. Save the raw and
normalized/standardized training and
testing datasets to .npy files. This step is
essential to ensure reproducibility and
consistency in later stages of the
application.
4. Normalize the input features in the
training and testing datasets using
MinMaxScaler for normalized scaling or
standardize the features using
StandardScaler for standardized scaling.
Normalization scales the features to a
specific range (usually [0, 1]), while
standardization transforms the features to
have a mean of 0 and standard deviation
of 1. These preprocessing steps are
beneficial for some machine learning
algorithms that are sensitive to the scale
of features.
5. Save the normalized and standardized
training and testing datasets to separate
.npy files.
Overall, the train_test() function ensures that the raw, normalized, and
standardized datasets are prepared and saved for later use in the
machine learning model training and evaluation processes. The
application will use these datasets for various classifiers to determine
the best-performing model for classifying and predicting pancreatic
cancer.
 
Step Define split_data_ML() method execute splitting dataset into train
3 and test data:
 
1 def split_data_ML(self):
2 if path.isfile('X_train_raw.npy'):
3 #Loads npy files
4 self.X_train_raw = \
5 np.load('X_train_raw.npy',allow_pickle=True)
6 self.y_train_raw = \
7 np.load('y_train_raw.npy',allow_pickle=True)
8 self.X_test_raw = \
9 np.load('X_test_raw.npy',allow_pickle=True)
10 self.y_test_raw = \
11 np.load('y_test_raw.npy',allow_pickle=True)
12
13 self.X_train_norm = \
14 np.load('X_train_norm.npy',allow_pickle=True)
15 self.y_train_norm = \
16 np.load('y_train_norm.npy',allow_pickle=True)
17 self.X_test_norm = \
18 np.load('X_test_norm.npy',allow_pickle=True)
19 self.y_test_norm = \
20 np.load('y_test_norm.npy',allow_pickle=True)
21  
22 self.X_train_stand = \
23 np.load('X_train_stand.npy',allow_pickle=True
24 )
25 self.y_train_stand = \
26 np.load('y_train_stand.npy',allow_pickle=True
)
27
self.X_test_stand = \
28
np.load('X_test_stand.npy',allow_pickle=True)
29
self.y_test_stand = \
30
np.load('y_test_stand.npy',allow_pickle=True)
31
32
else:
33
self.train_test()
34
35
#Prints each shape
36
print('X train raw shape: ',
37
self.X_train_raw.shape)
38
print('Y train raw shape: ',
39 self.y_train_raw.shape)
40 print('X test raw shape: ',
41 self.X_test_raw.shape)
42 print('Y test raw shape: ',
43 self.y_test_raw.shape)
44  
45 #Prints each shape
46
47 print('X train norm shape: ',
48 self.X_train_norm.shape)
49 print('Y train norm shape: ',
self.y_train_norm.shape)
50
print('X test norm shape: ',
self.X_test_norm.shape)
print('Y test norm shape: ',
self.y_test_norm.shape)
 
#Prints each shape
print('X train stand shape: ',
self.X_train_stand.shape)
print('Y train stand shape: ',
self.y_train_stand.shape)
print('X test stand shape: ',
self.X_test_stand.shape)
print('Y test stand shape: ',
self.y_test_stand.shape)
 
 
The purpose of the split_data_ML() function is to load or generate the
preprocessed training and testing datasets for machine learning tasks.
Here's a breakdown of what the function does:
1. Check if the .npy files containing the
preprocessed datasets (X_train_raw.npy,
y_train_raw.npy, X_test_raw.npy, etc.)
exist in the current working directory.
2. If the files exist, load the datasets from
the .npy files using the np.load function
and store them in the corresponding
variables (self.X_train_raw,
self.y_train_raw, etc.).
3. If the files do not exist, call the train_test
function to generate the raw, normalized,
and standardized training and testing
datasets. This function will save these
datasets to .npy files in the current
working directory.
4. After loading or generating the datasets,
the function prints the shape of each
dataset to the console to verify that the
data is correctly loaded or generated. The
shapes of the datasets will be printed for
the raw, normalized, and standardized
versions of both the training and testing
datasets.
In summary, the split_data_ML() function ensures that the
preprocessed datasets are available for further use in machine
learning tasks. If the datasets have been previously generated and
saved, it loads them from the .npy files. Otherwise, it generates the
datasets by calling the train_test() function and then prints their
shapes for verification.
 
Step Define train_model_ML() method to invoke split_data_ML()
4 method:
 
1 def train_model_ML(self):
2 self.split_data_ML()
3  
4 #Turns on three widgets
5 self.cbData.setEnabled(True)
6 self.cbClassifier.setEnabled(True)
7 self.cbPredictionML.setEnabled(True)
8
9 #Turns off pbTrainML
10 self.pbTrainML.setEnabled(False)
11  
12 #Turns on three radio buttons
13 self.rbRaw.setEnabled(True)
14 self.rbNorm.setEnabled(True)
15 self.rbStand.setEnabled(True)
16 self.rbRaw.setChecked(True)
 
 
Figure 111 The cbData, cbClassifier, and cbPredictionML widgets
are enabled user clicks TRAIN ML MODEL button
 
 
Step Connect clicked() event of pbTrainML widget with
5 train_model_ML() and put it inside __init__() method as shown in
line 10:
 
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_pancreatic.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Pancreatic
6 Cancer")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML
)
 
 
Step Run gui_pancreatic.py and you will see the other widgets are
6 initially disabled. Click LOAD DATA button. The two tables are
populated, LOAD DATA button is disabled, and TRAIN ML
MODEL button is enabled. Then, click on TRAIN ML MODEL
button. You will see that cbData, cbClassifier, and cbPredictionML
are enabled and pbTrainML is disabled as shown in Figure 111. You
also will find four training dan test npy files for machine learning in
your working directory.
 
 
 
Distribution of Target Variable
Step Define pie_cat() and bar_cat() method to plot distribution of a categorical
1 feature in pie and bar chart on a widget:
 
1 def pie_cat(self, df_target, var_target,
2 labels, widget):
3 df_target.value_counts().plot.pie(\
4 ax = widget.canvas.axis1,labels=labels,\
5 startangle=40,explode=
[0,0.15],shadow=True,\
6
colors=['#ff6666','#F5C7B8FF'],autopct =
7
'%1.1f%%',\
8
textprops={'fontsize': 10})
9
widget.canvas.axis1.set_title('The
10 distribution of ' + \
11 var_target + ' variable', fontweight
12 ="bold",fontsize=14)
13 widget.canvas.figure.tight_layout()
14 widget.canvas.draw()
15  
16 def bar_cat(self,df,var, widget):
17 ax =
18 df[var].value_counts().plot(kind="barh",\
19 ax = widget.canvas.axis1)
20  
21 for i,j in
enumerate(df[var].value_counts().values):
ax.text(.7,i,j,weight =
"bold",fontsize=10)
 
widget.canvas.axis1.set_title("Count of
"+ var +" cases")
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 

Step Define stacked_bar_plot to plot distribution of diagnosis variable against


2 another categorical feature:
 
1 #Plots diagnosis with other variable
2 def stacked_bar_plot(self,df,cat,ax1):
3 cmap1=plt.cm.coolwarm_r
4 group_by_stat = df.groupby([cat,
5 'diagnosis']).size()
6
g=group_by_stat.unstack().plot(kind='bar',
7
\
8
stacked=True,ax=ax1,grid=True)
9
self.put_label_stacked_bar(g,17)
10
 
11
ax1.set_title('Stacked Bar Plot of '+ \
12
cat +' (in %)', fontsize=14)
13
ax1.set_ylabel('Number of Cases')
14
ax1.set_xlabel(cat)
15
plt.show()
16
 
17
def put_label_stacked_bar(self,
18 ax,fontsize):
19 #patches is everything inside of the chart
20 for rect in ax.patches:
21 # Find where everything is located
22 height = rect.get_height()
23 width = rect.get_width()
24 x = rect.get_x()
25 y = rect.get_y()
26
27 # The height of the bar is the data value
28 label_text = f'{height:.0f}'
29
30 # ax.text(x, y, text)
31 label_x = x + width / 2
32 label_y = y + height / 2
33  
34 # plots only when height is greater than
35 specified value
36 if height > 0:
37 ax.text(label_x, label_y,
38 label_text, \
ha='center', va='center', weight =
"bold",\
fontsize=fontsize)

ax.legend(bbox_to_anchor=(1.05, 1),
\
loc='lower right', borderaxespad=0.)
 
 
Step Define choose_plot() to read currentText property of cbData widget and act
3 accordingly:
 
1 def choose_plot(self):
2 strCB = self.cbData.currentText()
3  
4 if strCB == 'diagnosis':
5 #Plots distribution of diagnosis variable in pie
6 chart
7 self.widgetPlot1.canvas.figure.clf()
8 self.widgetPlot1.canvas.axis1 = \
9 self.widgetPlot1.canvas.figure.add_subplot(121,\
10 facecolor = '#fbe7dd')
11 label_class = \
12 list(self.df_dummy["diagnosis"].value_counts().index)
13 self.pie_cat(self.df_dummy["diagnosis"],'diagnosis',
\
14
label_class, self.widgetPlot1)
15
self.widgetPlot1.canvas.figure.tight_layout()
16
self.widgetPlot1.canvas.draw()
17
 
18
self.widgetPlot1.canvas.axis1 = \
19
self.widgetPlot1.canvas.figure.add_subplot(122,\
20
facecolor = '#fbe7dd')
21
self.bar_cat(self.df_dummy,'diagnosis', \
22
self.widgetPlot1)
23
self.widgetPlot1.canvas.figure.tight_layout()
24
self.widgetPlot1.canvas.draw()
25
 
26
self.widgetPlot2.canvas.figure.clf()
27
28 self.widgetPlot2.canvas.axis1 = \
29 self.widgetPlot2.canvas.figure.add_subplot(111,\
30 facecolor = '#fbe7dd')
31 self.stacked_bar_plot(self.df_dummy,'age',\
32 self.widgetPlot2.canvas.axis1)
33 self.widgetPlot2.canvas.figure.tight_layout()
34 self.widgetPlot2.canvas.draw()
35  
36 self.widgetPlot3.canvas.figure.clf()
37 self.widgetPlot3.canvas.axis1 = \
38 self.widgetPlot3.canvas.figure.add_subplot(221,\
39 facecolor = '#fbe7dd')
40 g=sns.countplot(self.df_dummy["sex"],\
41 hue = self.df_dummy["diagnosis"], \
42 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
43 )
44 self.put_label_stacked_bar(g,17)
45 self.widgetPlot3.canvas.axis1.set_title(\
46 "sex versus diagnosis",fontweight ="bold",\
47 fontsize=14)
48  
49 self.widgetPlot3.canvas.axis1 = \
50 self.widgetPlot3.canvas.figure.add_subplot(222,\
51 facecolor = '#fbe7dd')
52
g=sns.countplot(self.df_dummy["plasma_CA19_9"],\
53
hue = self.df_dummy["diagnosis"], \
54
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
55
)
56
self.put_label_stacked_bar(g,17)
57
self.widgetPlot3.canvas.axis1.set_title(\
58
"plasma_CA19_9 versus diagnosis",\
59
fontweight ="bold",fontsize=14)
60
61
self.widgetPlot3.canvas.axis1 = \
62
self.widgetPlot3.canvas.figure.add_subplot(223,\
63
facecolor = '#fbe7dd')
64
g=sns.countplot(self.df_dummy["creatinine"],\
65
hue = self.df_dummy["diagnosis"], \
66
67 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
68 )
69 self.put_label_stacked_bar(g,17)
70 self.widgetPlot3.canvas.axis1.set_title(\
71 "creatinine versus diagnosis",\
72 fontweight ="bold",fontsize=14)
73
74 self.widgetPlot3.canvas.axis1 = \
75 self.widgetPlot3.canvas.figure.add_subplot(224,\
76 facecolor = '#fbe7dd')
77 g=sns.countplot(self.df_dummy["age"],\
78 hue = self.df_dummy["diagnosis"], \
79 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
80
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"age versus diagnosis",fontweight ="bold",\
fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
The purpose of the choose_plot() function is to generate and display multiple
plots based on the selected category (diagnosis) from a drop-down list
(cbData). The plots visualize different aspects of the diagnosis variable using
pie charts, bar charts, and count plots. The function takes the selected
category, strCB, as input and performs the following tasks:
1. If strCB is equal to 'diagnosis':
a. Clear the first plot (widgetPlot1) to
prepare for new plots.
b. Plot the distribution of the diagnosis
variable using a pie chart in the left half of
the first plot (widgetPlot1).
c. Plot the count of different diagnosis
categories using a bar chart in the right
half of the first plot (widgetPlot1).
d. Clear the second plot (widgetPlot2) to
prepare for a new plot.
e. Plot a stacked bar plot of the age variable
in the second plot (widgetPlot2).
f. Clear the third plot (widgetPlot3) to
prepare for new plots.
g. Plot count plots of sex, plasma_CA19_9,
creatinine, and age variables against the
diagnosis variable in four separate
subplots within the third plot
(widgetPlot3).
The purpose of this function is to provide interactive visualizations that allow
users to explore and understand the distribution and relationships between
different variables and the diagnosis category in the dataset.
 
Step Connect currentIndexChanged() event of cbData widget with
4 choose_plot() method and put it inside __init__() method as shown in line
11:
 
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_pancreatic.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Pancreatic
6 Cancer")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
11 self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot
)
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
5 buttons. Then, choose diagnosis item from cbData widget. You will see the
result as shown in Figure 112.
 
Figure 112 The distribution of diagnosis variable versus other categorical
features
 
 
 
Distribution of Age Variable
Step Add this code to the end of choose_plot() method:
1  
1 if strCB == 'age':
2 #Plots distribution of age variable in pie chart
3 self.widgetPlot1.canvas.figure.clf()
4 self.widgetPlot1.canvas.axis1 = \
5 self.widgetPlot1.canvas.figure.add_subplot(121,\
6 facecolor = '#fbe7dd')
7 label_class = \
8 list(self.df_dummy["age"].value_counts().index)
9 self.pie_cat(self.df_dummy["age"],'age', \
10 label_class, self.widgetPlot1)
11 self.widgetPlot1.canvas.figure.tight_layout()
12 self.widgetPlot1.canvas.draw()
13  
14 self.widgetPlot1.canvas.axis1 = \
15 self.widgetPlot1.canvas.figure.add_subplot(122,\
16 facecolor = '#fbe7dd')
17 self.bar_cat(self.df_dummy,'age', \
18 self.widgetPlot1)
19 self.widgetPlot1.canvas.figure.tight_layout()
20 self.widgetPlot1.canvas.draw()
21  
22 self.widgetPlot2.canvas.figure.clf()
23 self.widgetPlot2.canvas.axis1 = \
24 self.widgetPlot2.canvas.figure.add_subplot(111,\
25 facecolor = '#fbe7dd')
26 self.stacked_bar_plot(self.df_dummy,'age',\
27 self.widgetPlot2.canvas.axis1)
28 self.widgetPlot2.canvas.figure.tight_layout()
29 self.widgetPlot2.canvas.draw()
30  
31 self.widgetPlot3.canvas.figure.clf()
32 self.widgetPlot3.canvas.axis1 = \
33 self.widgetPlot3.canvas.figure.add_subplot(221,\
34 facecolor = '#fbe7dd')
35 g=sns.countplot(self.df_dummy["sex"],\
36 hue = self.df_dummy["age"], \
37 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
38 )
39 self.put_label_stacked_bar(g,17)
40 self.widgetPlot3.canvas.axis1.set_title(\
41 "sex versus age",fontweight ="bold",\
42 fontsize=14)
43  
44 self.widgetPlot3.canvas.axis1 = \
45 self.widgetPlot3.canvas.figure.add_subplot(222,\
46 facecolor = '#fbe7dd')
47
g=sns.countplot(self.df_dummy["plasma_CA19_9"],\
48
hue = self.df_dummy["age"], \
49
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
50
)
51
self.put_label_stacked_bar(g,17)
52
self.widgetPlot3.canvas.axis1.set_title(\
53
"plasma_CA19_9 versus age",\
54
fontweight ="bold",fontsize=14)
55
56
self.widgetPlot3.canvas.axis1 = \
57
self.widgetPlot3.canvas.figure.add_subplot(223,\
58
facecolor = '#fbe7dd')
59
g=sns.countplot(self.df_dummy["creatinine"],\
60
61 hue = self.df_dummy["age"], \
62 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
63 )
64 self.put_label_stacked_bar(g,17)
65 self.widgetPlot3.canvas.axis1.set_title(\
66 "creatinine versus age",\
67 fontweight ="bold",fontsize=14)
68
69 self.widgetPlot3.canvas.axis1 = \
70 self.widgetPlot3.canvas.figure.add_subplot(224,\
71 facecolor = '#fbe7dd')
72 g=sns.countplot(self.df_dummy["diagnosis"],\
73 hue = self.df_dummy["age"], \
74 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
75
self.put_label_stacked_bar(g,17)
76
self.widgetPlot3.canvas.axis1.set_title(\
77
"diagnosis versus age",fontweight ="bold",\
fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 

Figure 113 The distribution of age variable versus other categorical


features
 
Similar to the previous section, the purpose of this code is to provide
interactive visualizations that allow users to explore and understand the
distribution and relationships between different variables and the age
category in the dataset. The plots help in understanding patterns and
trends related to different age groups and their association with other
variables in the dataset.
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML
2 MODEL buttons. Then, choose age item from cbData widget. You will
see the result as shown in Figure 113.
 
 
 
Distribution of Sex Variable
Step Add this code to the end of choose_plot() method:
1  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72 if strCB == 'sex':
73 #Plots distribution of sex variable in pie chart
74 self.widgetPlot1.canvas.figure.clf()
75 self.widgetPlot1.canvas.axis1 = \
76 self.widgetPlot1.canvas.figure.add_subplot(121,\
77 facecolor = '#fbe7dd')
label_class = \
list(self.df_dummy["sex"].value_counts().index)
self.pie_cat(self.df_dummy["sex"],'sex', \
label_class, self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 = \
self.widgetPlot1.canvas.figure.add_subplot(122,\
facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'sex', \
self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = \
self.widgetPlot2.canvas.figure.add_subplot(111,\
facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'sex',\
self.widgetPlot2.canvas.axis1)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(221,\
facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],\
hue = self.df_dummy["sex"], \
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"age versus sex",fontweight ="bold",\
fontsize=14)
 
self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(222,\
facecolor = '#fbe7dd')

g=sns.countplot(self.df_dummy["plasma_CA19_9"],\
hue = self.df_dummy["sex"], \
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"plasma_CA19_9 versus sex",\
fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(223,\
facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["creatinine"],\
hue = self.df_dummy["sex"], \
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"creatinine versus sex",\
fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(224,\
facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["diagnosis"],\
hue = self.df_dummy["sex"], \
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"diagnosis versus sex",fontweight ="bold",\
fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
Similar to the previous sections, the purpose of this code is to provide
interactive visualizations that allow users to explore and understand the
distribution and relationships between different variables and the sex
category in the dataset. The plots help in understanding patterns and
trends related to different sex categories and their association with other
variables in the dataset.
 
Step 2 Run gui_pancreatic.py and click L
TRAIN ML MODEL buttons. Then, c
cbData widget. You will see the result
114.
 

Figure 114 The distribution of sex va


categorical features
 
 
 
Distribution of Plasma CA19-9 Variable
Step Add this code to the end of choose_plot() method:
1  
1 if strCB == 'plasma_CA19_9':
2 #Plots distribution of plasma_CA19_9 variable in pie chart
3 self.widgetPlot1.canvas.figure.clf()
4 self.widgetPlot1.canvas.axis1 = \
5 self.widgetPlot1.canvas.figure.add_subplot(121,\
6 facecolor = '#fbe7dd')
7 label_class = \
8 list(self.df_dummy["plasma_CA19_9"].value_counts().index)
9 self.pie_cat(self.df_dummy["plasma_CA19_9"],'plasma_CA19_9'
10 , \
11 label_class, self.widgetPlot1)
12 self.widgetPlot1.canvas.figure.tight_layout()
13 self.widgetPlot1.canvas.draw()
14  
15 self.widgetPlot1.canvas.axis1 = \
16 self.widgetPlot1.canvas.figure.add_subplot(122,\
17 facecolor = '#fbe7dd')
18 self.bar_cat(self.df_dummy,'plasma_CA19_9', \
19 self.widgetPlot1)
20 self.widgetPlot1.canvas.figure.tight_layout()
21 self.widgetPlot1.canvas.draw()
22  
23 self.widgetPlot2.canvas.figure.clf()
24 self.widgetPlot2.canvas.axis1 = \
25 self.widgetPlot2.canvas.figure.add_subplot(111,\
26 facecolor = '#fbe7dd')
27 self.stacked_bar_plot(self.df_dummy,'plasma_CA19_9',\
28 self.widgetPlot2.canvas.axis1)
29 self.widgetPlot2.canvas.figure.tight_layout()
30 self.widgetPlot2.canvas.draw()
31  
32 self.widgetPlot3.canvas.figure.clf()
33 self.widgetPlot3.canvas.axis1 = \
34 self.widgetPlot3.canvas.figure.add_subplot(221,\
35 facecolor = '#fbe7dd')
36 g=sns.countplot(self.df_dummy["age"],\
37 hue = self.df_dummy["plasma_CA19_9"], \
38 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
39 self.put_label_stacked_bar(g,17)
40 self.widgetPlot3.canvas.axis1.set_title(\
41 "age versus plasma_CA19_9",fontweight ="bold",\
42 fontsize=14)
43  
44 self.widgetPlot3.canvas.axis1 = \
45 self.widgetPlot3.canvas.figure.add_subplot(222,\
46 facecolor = '#fbe7dd')
47 g=sns.countplot(self.df_dummy["sex"],\
48 hue = self.df_dummy["plasma_CA19_9"], \
49 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
50 self.put_label_stacked_bar(g,17)
51 self.widgetPlot3.canvas.axis1.set_title(\
52 "sex versus plasma_CA19_9",\
53 fontweight ="bold",fontsize=14)
54
55 self.widgetPlot3.canvas.axis1 = \
56 self.widgetPlot3.canvas.figure.add_subplot(223,\
57 facecolor = '#fbe7dd')
58 g=sns.countplot(self.df_dummy["creatinine"],\
59 hue = self.df_dummy["plasma_CA19_9"], \
60 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
61 self.put_label_stacked_bar(g,17)
62 self.widgetPlot3.canvas.axis1.set_title(\
63 "creatinine versus plasma_CA19_9",\
64 fontweight ="bold",fontsize=14)
65
66 self.widgetPlot3.canvas.axis1 = \
67 self.widgetPlot3.canvas.figure.add_subplot(224,\
68 facecolor = '#fbe7dd')
69 g=sns.countplot(self.df_dummy["diagnosis"],\
70 hue = self.df_dummy["plasma_CA19_9"], \
71 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
72 self.put_label_stacked_bar(g,17)
73 self.widgetPlot3.canvas.axis1.set_title(\
74 "diagnosis versus plasma_CA19_9",fontweight ="bold",\
75 fontsize=14)
76
77 self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
The purpose of this code is similar to the previous sections but focuses on
visualizing the distribution of data related to the plasma_CA19_9 variable and its
association with other variables in the dataset. These plots help in understanding
how the plasma_CA19_9 variable is distributed and its relationships with other
variables.
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
2 buttons. Then, choose plasma_CA19_9 item from cbData widget. You will see
the result as shown in Figure 115.
 

Figure 115 The distribution of plasma_CA19_9 variable versus other


categorical features
 
Distribution of Creatinine Variable
Step Add this code to the end of choose_plot() method:
1  
1 if strCB == 'creatinine':
2 #Plots distribution of creatinine variable in pie
3 chart
4 self.widgetPlot1.canvas.figure.clf()
5 self.widgetPlot1.canvas.axis1 = \
6 self.widgetPlot1.canvas.figure.add_subplot(121,\
7 facecolor = '#fbe7dd')
8 label_class = \
9 list(self.df_dummy["creatinine"].value_counts().index
)
10
self.pie_cat(self.df_dummy["creatinine"],'creatinine'
11
, \
12
label_class, self.widgetPlot1)
13
self.widgetPlot1.canvas.figure.tight_layout()
14
self.widgetPlot1.canvas.draw()
15
 
16
self.widgetPlot1.canvas.axis1 = \
17
self.widgetPlot1.canvas.figure.add_subplot(122,\
18
facecolor = '#fbe7dd')
19
self.bar_cat(self.df_dummy,'creatinine', \
20
self.widgetPlot1)
21
self.widgetPlot1.canvas.figure.tight_layout()
22
self.widgetPlot1.canvas.draw()
23
 
24
self.widgetPlot2.canvas.figure.clf()
25
self.widgetPlot2.canvas.axis1 = \
26
self.widgetPlot2.canvas.figure.add_subplot(111,\
27
facecolor = '#fbe7dd')
28
self.stacked_bar_plot(self.df_dummy,'creatinine',\
29
self.widgetPlot2.canvas.axis1)
30
self.widgetPlot2.canvas.figure.tight_layout()
31
self.widgetPlot2.canvas.draw()
32
 
33
self.widgetPlot3.canvas.figure.clf()
34
self.widgetPlot3.canvas.axis1 = \
35
self.widgetPlot3.canvas.figure.add_subplot(221,\
36
facecolor = '#fbe7dd')
37
g=sns.countplot(self.df_dummy["age"],\
38
hue = self.df_dummy["creatinine"], \
39
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
40 )
41 self.put_label_stacked_bar(g,17)
42
43 self.widgetPlot3.canvas.axis1.set_title(\
44 "age versus creatinine",fontweight ="bold",\
45 fontsize=14)
46  
47 self.widgetPlot3.canvas.axis1 = \
48 self.widgetPlot3.canvas.figure.add_subplot(222,\
49 facecolor = '#fbe7dd')
50 g=sns.countplot(self.df_dummy["sex"],\
51 hue = self.df_dummy["creatinine"], \
52 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
53 )
54 self.put_label_stacked_bar(g,17)
55 self.widgetPlot3.canvas.axis1.set_title(\
56 "sex versus creatinine",\
57 fontweight ="bold",fontsize=14)
58
59 self.widgetPlot3.canvas.axis1 = \
60 self.widgetPlot3.canvas.figure.add_subplot(223,\
61 facecolor = '#fbe7dd')
62
g=sns.countplot(self.df_dummy["plasma_CA19_9"],\
63
hue = self.df_dummy["creatinine"], \
64
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
65
)
66
self.put_label_stacked_bar(g,17)
67
self.widgetPlot3.canvas.axis1.set_title(\
68
"plasma_CA19_9 versus creatinine",\
69
fontweight ="bold",fontsize=14)
70
71
self.widgetPlot3.canvas.axis1 = \
72
self.widgetPlot3.canvas.figure.add_subplot(224,\
73
facecolor = '#fbe7dd')
74
g=sns.countplot(self.df_dummy["diagnosis"],\
75
hue = self.df_dummy["creatinine"], \
76
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
77 )
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title(\
"diagnosis versus creatinine",fontweight ="bold",\
fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
The purpose of this code is to generate and display multiple plots based
on the selected category (creatinine) from a drop-down list (cbData). The
plots visualize different aspects of the creatinine variable using pie charts,
bar charts, and count plots.
 

Figure 116 The distribution of creatinine variable versus other


categorical features
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML
2 MODEL buttons. Then, choose creatinine item from cbData widget.
You will see the result as shown in Figure 116.
 
 
 
 
Distribution of Numerical Variables
Step Define feat_versus_other() method to plot distribution (histogram) of one
1 feature versus another on a widget:
 
1 def feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
5 for s in ["right", "top"]:
6 ax0.spines[s].set_visible(False)
7  
8 ax0.set_facecolor(background_color)
9 ax0_sns = sns.histplot(data=self.df, \
10 x=self.df[feat],ax=ax0,zorder=2,kde=False,
\
11
hue=another,multiple="stack", shrink=.8,\
12
linewidth=0.3,alpha=1)
13
 
14
15 self.put_label_stacked_bar(ax0_sns,17)
16  
17 ax0_sns.set_xlabel('',fontsize=10,
18 weight='bold')
19 ax0_sns.set_ylabel('',fontsize=10,
weight='bold')
20
 
21
ax0_sns.grid(which='major', axis='x',
22
zorder=0, \
23
color='#EEEEEE', linewidth=0.4)
24
ax0_sns.grid(which='major', axis='y',
25 zorder=0, \
26 color='#EEEEEE', linewidth=0.4)
27  
28 ax0_sns.tick_params(labelsize=10,
29 width=0.5, length=1.5)
30 ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper
right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight
="bold",fontsize=14)
ax0_sns.set_title(title,fontweight
="bold",fontsize=16)
 
The code aims to enhance the visualization and comparison of different
features in a dataset through various types of plots. The code consists of
several functions, including pie_cat(), bar_cat(), choose_plot(), and
feat_versus_other(), each serving specific visualization purposes.
 
The pie_cat() and bar_cat() functions are responsible for generating pie
charts and horizontal bar charts, respectively, for categorical data. These
functions take the input data, such as the target variable or a specific feature,
and create visually appealing pie charts and bar charts with relevant labels,
titles, and colors to provide insights into the distribution of the data.
 
The choose_plot() function dynamically selects the appropriate type of plot
based on the selected feature in the dataset. It calls the pie_cat() and bar_cat()
functions for categorical data and also makes use of stacked_bar_plot() to
generate a stacked bar plot for comparing features. This function helps to
create a comprehensive visualization dashboard, allowing users to
interactively explore the data distribution and relationships between various
features.
 
The feat_versus_other() function is designed specifically for numerical data
and creates histograms with stacked bars to compare two features. It uses the
Seaborn library to plot the histogram, enhancing the visualization with
multiple stacked bars, labels, and a legend to clearly display the distribution
of data in relation to the two features being compared.
 
Overall, the purpose of the code is to present an interactive and informative
visualization dashboard that helps users explore and gain insights from the
dataset by effectively displaying the distribution and relationships of
different features, both categorical and numerical. This visualization can aid
in understanding the data patterns and support decision-making in data
analysis tasks related to the dataset.
 
Step Define feat_versus_other() method to plot the density of one feature versus
2 another on a widget:
 
1 def prob_feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
5 for s in ["right", "top"]:
6 ax0.spines[s].set_visible(False)
7  
8 ax0.set_facecolor(background_color)
9 ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,\
10
hue=another,linewidth=0.3,fill=True,cbar='g',
11
\
12
zorder=2,alpha=1,multiple='stack')
13
 
14
ax0_sns.set_xlabel('',fontsize=10,
15 weight='bold')
16 ax0_sns.set_ylabel('',fontsize=10,
17 weight='bold')
18  
19 ax0_sns.grid(which='major', axis='x',
20 zorder=0, \
21 color='#EEEEEE', linewidth=0.4)
22 ax0_sns.grid(which='major', axis='y',
zorder=0, \
23
color='#EEEEEE', linewidth=0.4)
24
 
25
ax0_sns.tick_params(labelsize=10,
26
width=0.5, length=1.5)
27
ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight
="bold",fontsize=14)
ax0_sns.set_title(title,fontweight
="bold",fontsize=16)
 
The prob_feat_versus_other() function serves the purpose of visualizing the
probability distribution of two numerical features in the dataset using kernel
density estimation (KDE) plots with multiple stacked areas. It aims to
provide insights into how the distribution of one feature varies with respect
to the other, highlighting potential patterns or differences between different
groups or categories represented by the another feature. The function utilizes
the Seaborn library, which simplifies the process of creating such
visualizations with minimal code.
 
When called, the function takes several arguments, including feat and
another, which specify the two numerical features to be compared. The
legend argument provides labels for the different groups in the stacked KDE
plot, allowing easy identification of each group. The ax0 parameter refers to
the axis object where the KDE plot will be drawn. The function sets a
consistent background color for the plot and defines a color palette to
represent the different groups in the stacked plot. It then utilizes Seaborn's
kdeplot function to estimate and visualize the probability distribution of the
feat feature, stacked by the another feature. The result is an informative and
visually appealing KDE plot that showcases the distributional relationship
between the two selected features. Additionally, the function takes care of
setting axis labels, titles, legends, and other visual elements to improve the
overall readability and aesthetics of the plot.
 
 
Step Define hist_num_versus_four_cat() method to plot the distribution
3 (histogram) of one numerical feature versus four categorical ones:
 
1 def hist_num_versus_four_cat(self,feat):
2 self.label_diagnosis = \
3 list(self.df_dummy["diagnosis"].value_counts().index)
4 self.label_age = \
5 list(self.df_dummy["age"].value_counts().index)
6 self.label_plasma = \
7 list(self.df_dummy["plasma_CA19_9"].value_counts().index
8 )
9 self.label_creatinine = \
10 list(self.df_dummy["creatinine"].value_counts().index)
11
12 self.widgetPlot3.canvas.figure.clf()
13 self.widgetPlot3.canvas.axis1 = \
14 self.widgetPlot3.canvas.figure.add_subplot(221,\
15 facecolor = '#fbe7dd')
16 print(self.df_dummy["diagnosis"].value_counts())
17 self.feat_versus_other(feat,\
18 self.df_dummy["diagnosis"],self.label_diagnosis,\
19 self.widgetPlot3.canvas.axis1,label=feat,\
20 title='diagnosis_result versus ' + feat)
21  
22 self.widgetPlot3.canvas.axis1 = \
23 self.widgetPlot3.canvas.figure.add_subplot(222,\
24 facecolor = '#fbe7dd')
25 print(self.df_dummy["age"].value_counts())
26 self.feat_versus_other(feat,\
27 self.df_dummy["age"],self.label_age,\
28 self.widgetPlot3.canvas.axis1,label=feat,\
29 title='radius versus ' + feat)
30  
31 self.widgetPlot3.canvas.axis1 = \
32 self.widgetPlot3.canvas.figure.add_subplot(223,\
33 facecolor = '#fbe7dd')
34 print(self.df_dummy["plasma_CA19_9"].value_counts())
35 self.feat_versus_other(feat,\
36 self.df_dummy["plasma_CA19_9"],self.label_plasma,\
37 self.widgetPlot3.canvas.axis1,label=feat,\
38 title='texture versus ' + feat)
39  
40 self.widgetPlot3.canvas.axis1 = \
41 self.widgetPlot3.canvas.figure.add_subplot(224,\
42 facecolor = '#fbe7dd')
43 print(self.df_dummy["creatinine"].value_counts())
44 self.feat_versus_other(feat,\
45 self.df_dummy["creatinine"],self.label_creatinine,\
46 self.widgetPlot3.canvas.axis1,label=feat,\
47 title='area versus ' + feat)
48  
49 self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
The hist_num_versus_four_cat() function aims to compare a numerical
feature (feat) against four categorical features in the dataset: "diagnosis,"
"age," "plasma_CA19_9," and "creatinine." For each of these categorical
features, the function creates four KDE plots using the feat_versus_other()
function (previously defined), visually representing the distribution of the
numerical feature with respect to different categories or groups within the
respective categorical feature. The function also labels each plot accordingly
to indicate the relationship between the numerical feature and the categorical
feature being analyzed.
 
When called, the function first retrieves the unique labels of each categorical
feature ("diagnosis," "age," "plasma_CA19_9," and "creatinine") from the
DataFrame. It then proceeds to create four subplots in a 2x2 grid arrangement
using the widgetPlot3 object. For each subplot, the function calls the
feat_versus_other() function, providing the current numerical feature (feat)
and the corresponding categorical feature as arguments, along with the labels
of the categories within the categorical feature. The feat_versus_other
function() is responsible for generating the KDE plots and customizing the
appearance of the plots. Finally, the function sets the labels and titles for each
subplot to reflect the specific comparison being made and ensures that the
plots are well-organized within the grid layout. The resulting visualization
provides valuable insights into how the numerical feature varies across
different categories of the categorical features.
 
Step Define prob_num_versus_two_cat() method to plot the density of one
4 numerical feature versus two categorical ones:
 
1 def prob_num_versus_two_cat(self,feat, feat_cat1,
2 feat_cat2, widget):
3 self.label_feat_cat1 = \
4 list(self.df_dummy[feat_cat1].value_counts().index
)
5
self.label_feat_cat2 = \
6
list(self.df_dummy[feat_cat2].value_counts().index
7
)
8
 
9
widget.canvas.figure.clf()
10
widget.canvas.axis1 = \
11
widget.canvas.figure.add_subplot(211,\
12
facecolor = '#fbe7dd')
13
print(self.df_dummy[feat_cat2].value_counts())
14
self.prob_feat_versus_other(feat,\
15
self.df_dummy[feat_cat2],self.label_feat_cat2,\
16
widget.canvas.axis1,label=feat,\
17
title=feat_cat2 + ' versus ' + feat)
18
 
19
widget.canvas.axis1 = \
20
widget.canvas.figure.add_subplot(212,\
21
facecolor = '#fbe7dd')
22
print(self.df_dummy[feat_cat1].value_counts())
23
24 self.prob_feat_versus_other(feat,\
25 self.df_dummy[feat_cat1],self.label_feat_cat1,\
26 widget.canvas.axis1,label=feat,\
27 title=feat_cat1 + ' versus ' + feat)
28  
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
The prob_num_versus_two_cat() function is designed to visualize the
probability distribution of a numerical feature (feat) with respect to two
categorical features (feat_cat1 and feat_cat2). The function utilizes the
previously defined prob_feat_versus_other() function to create KDE plots for
the probability distribution of the numerical feature across different
categories or groups within the two categorical features.
 
When called, the function first retrieves the unique labels of each categorical
feature (feat_cat1 and feat_cat2) from the DataFrame and stores them in
label_feat_cat1 and label_feat_cat2, respectively. It then clears the existing
content of the provided widget object to create new visualizations.
 
Next, the function creates two subplots in a vertical arrangement using the
widget.canvas figure. For each subplot, the function calls the
prob_feat_versus_other function, providing the current numerical feature
(feat) and the corresponding categorical feature (feat_cat1 or feat_cat2) as
arguments, along with the labels of the categories within that categorical
feature. The prob_feat_versus_other() function is responsible for generating
the KDE plots, representing the probability distribution of the numerical
feature for each category in the categorical feature.
 
The function sets appropriate labels and titles for each subplot to indicate the
specific categorical feature being analyzed, along with the numerical feature.
It ensures that the resulting plots are well-organized within the figure layout
and then draws the visualizations on the provided widget. The resulting
visualization provides insights into how the probability distribution of the
numerical feature varies across different categories within both categorical
features, helping to understand the relationships and patterns between these
variables.
 
Step Add this code to the end of choose_plot():
5  
1 if strCB == 'LYVE1':
2 self.prob_num_versus_two_cat("LYVE1","age", \
3 "diagnosis" ,self.widgetPlot1)
4 self.hist_num_versus_four_cat("LYVE1")
5 self.prob_num_versus_two_cat("LYVE1","creatinine"
6 , \
7 "plasma_CA19_9" ,self.widgetPlot2)
8
9 if strCB == 'REG1B':
10 self.prob_num_versus_two_cat("REG1B","age", \
11 "diagnosis" ,self.widgetPlot1)
12 self.hist_num_versus_four_cat("REG1B")
13 self.prob_num_versus_two_cat("REG1B","creatinine"
14 , \
15 "plasma_CA19_9" ,self.widgetPlot2)
16  
17 if strCB == 'TFF1':
18 self.prob_num_versus_two_cat("TFF1","age", \
19 "diagnosis" ,self.widgetPlot1)
20 self.hist_num_versus_four_cat("TFF1")
21 self.prob_num_versus_two_cat("TFF1","creatinine",
\
22
"plasma_CA19_9" ,self.widgetPlot2)
23
 
24
if strCB == 'REG1A':
25
self.prob_num_versus_two_cat("REG1A","age",
26
"diagnosis" ,self.widgetPlot1)
27
self.hist_num_versus_four_cat("REG1A")
self.prob_num_versus_two_cat("REG1A","creatinine"
, \
"plasma_CA19_9" ,self.widgetPlot2)
 
The purpose of the given code is to perform exploratory data analysis and
visualization for different genes ('LYVE1', 'REG1B', 'TFF1', 'REG1A'). The
code aims to compare and understand how the gene expression level (a
numerical feature) is related to various categorical features like 'age',
'diagnosis', 'plasma_CA19_9', and 'creatinine'. For each gene, the code
produces three sets of visualizations to gain insights into the gene's behavior
with respect to these categorical variables.
 
The prob_num_versus_two_cat() function is responsible for creating
probability distribution visualizations (KDE plots) of the gene expression
level based on two categorical features ('age' and 'diagnosis'). The
visualizations provide an understanding of how the gene expression varies
across different age groups and diagnostic categories, helping researchers
identify potential trends or patterns.
 
The hist_num_versus_four_cat() function generates histograms comparing
the gene expression level to four categorical features, namely 'diagnosis',
'age', 'plasma_CA19_9', and 'creatinine'. These histograms enable a
comprehensive examination of the gene's expression distribution concerning
these categorical variables, providing deeper insights into the potential
associations between the gene's expression and various clinical attributes.
 
Overall, the code serves as a valuable tool for exploring and understanding
the relationships between gene expression and different categorical features,
aiding researchers in identifying potential gene expression patterns related to
specific clinical attributes or conditions. By visualizing these relationships,
researchers can gain valuable insights into the biological relevance and
potential significance of the studied genes.
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
6 buttons. Then, choose LYVE1 item from cbData widget. You will see the
result as shown in Figure 117.
 

Figure 117 The distribution of LYVE1 variable versus other features


 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
7 buttons. Then, choose REG1B item from cbData widget. You will see the
result as shown in Figure 118.
 

Figure 118 The distribution of REG1B variable versus other features


 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
8 buttons. Then, choose TFF1 item from cbData widget. You will see the
result as shown in Figure 119.
 
Figure 119 The distribution of TFF1 variable versus other features
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN ML MODEL
9 buttons. Then, choose REG1A item from cbData widget. You will see the
result as shown in Figure 120.
 

Figure 120 The distribution of REG1A variable versus other features


 
 
 
Correlation Matrix and Features Importance
Step Define plot_corr() method to plot correlation matrix on a
1 widget:
 
1 def plot_corr(self, data, widget):
2 corrdata = data.corr()
3 sns.heatmap(corrdata, ax =
4 widget.canvas.axis1, \
5 lw=1, annot=True, cmap="Reds")
6
widget.canvas.axis1.set_title('Correlation
7
Matrix', \
8
fontweight ="bold",fontsize=20)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
The purpose of the plot_corr() function is to create a correlation
matrix heatmap for the given input data. The function takes two
parameters: data, which is the input DataFrame containing
numerical data for which the correlation matrix needs to be
computed, and widget, which represents the plotting widget
where the heatmap will be displayed.
 
Inside the function, the correlation matrix is calculated using
the corr() method of the DataFrame. This matrix provides
information about the pairwise correlation between all
numerical columns in the input data. The correlation coefficient
measures the strength and direction of the linear relationship
between two variables, with values closer to 1 indicating a
strong positive correlation, values closer to -1 indicating a
strong negative correlation, and values close to 0 indicating a
weak or no correlation.
 
The function then utilizes Seaborn's heatmap function to
visualize the correlation matrix as a colored heatmap. The
heatmap provides an intuitive and visually appealing
representation of the correlation coefficients. Each cell in the
heatmap is colored based on the correlation value, with higher
positive correlations shown in shades of red and higher
negative correlations shown in shades of blue.
 
Additionally, the correlation values are annotated in each cell
of the heatmap, making it easy to interpret the exact correlation
coefficient for each pair of variables. The title of the heatmap is
set to "Correlation Matrix," and the layout of the figure is
adjusted for better visualization. Finally, the plot is drawn on
the specified widget, allowing the user to view and explore the
correlation matrix heatmap. This function is useful for
understanding the relationships between different numerical
variables in the input data and identifying potential patterns or
dependencies between them.
 
Step Add this code to the end of choose_plot() method:
2  
1 if strCB == 'Correlation Matrix':
2 self.widgetPlot3.canvas.figure.clf()
3 self.widgetPlot3.canvas.axis1 = \
4 self.widgetPlot3.canvas.figure.add_subplot(111
5 )
6 X,_ = self.fit_dataset(self.df)
self.plot_corr(X, self.widgetPlot3)
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Then, choose Correlation Matrix item
from cbData widget. You will see the result as shown in Figure
121.
 

Figure 121 Correlation matrix


 
Step Define plot_importance() method to plot features importance
4 on a widget:
 
1 def plot_importance(self, widget):
2 #Compares different feature importances
3 r =
4 ExtraTreesClassifier(random_state=0)
5 X,y = self.fit_dataset(self.df)
6 r.fit(X, y)
7 feature_importance_normalized = \
8 np.std([tree.feature_importances_ for tree
in \
9
r.estimators_], axis = 0)
10
 
11
12
sns.barplot(feature_importance_normalized,
13 \
14 X.columns, ax = widget.canvas.axis1)
15 widget.canvas.axis1.set_ylabel('Feature
16 Labels',\
17 fontweight ="bold",fontsize=15)
18
19 widget.canvas.axis1.set_xlabel('Features
Importance',\
20
fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_title(\
'Comparison of different Features
Importance',\
fontweight ="bold",fontsize=20)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
The purpose of the plot_importance() function is to compare
and visualize the feature importances of different features in a
dataset. This function is useful when working with machine
learning models to understand which features contribute the
most to the model's predictive power.
The function uses an Extra Trees Classifier, which is an
ensemble learning method based on decision trees, to calculate
the feature importances. It fits the classifier on the dataset
(self.df) using the fit_dataset method, which separates the
features (X) and the target variable (y). Then, it trains multiple
decision trees within the ensemble and aggregates their feature
importances to obtain a more robust estimate of the importance
for each feature. The feature_importance_normalized variable
stores the normalized standard deviation of feature importances
across all decision trees in the ensemble.
 
The function then creates a bar plot using Seaborn's barplot
function to visualize the feature importances. The feature
importances are shown on the y-axis, and the feature labels
(column names) are shown on the x-axis. The bars represent the
importance of each feature, and they are sorted from highest to
lowest importance. This allows for easy comparison of the
importance of different features in the dataset.
 
The plot is further enhanced by adding axis labels and a title to
provide better context for understanding the visualization. The
title "Comparison of different Features Importance" highlights
the purpose of the plot, and the axis labels "Features
Importance" and "Feature Labels" provide clear descriptions
for the two axes. The layout of the figure is adjusted to ensure
that all elements are properly displayed, and the plot is drawn
on the specified widget, making it accessible for users to
analyze and interpret the feature importances. Overall, this
function provides a quick and effective way to gain insights
into the significance of different features in a machine learning
model.
 
 
Step Add this code to the end of choose_plot() method:
5  
1 if strCB == 'Features Importance':
2 self.widgetPlot3.canvas.figure.clf()
3 self.widgetPlot3.canvas.axis1 = \
4 self.widgetPlot3.canvas.figure.add_subplot(111
5 )
self.plot_importance(self.widgetPlot3)
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
6 ML MODEL buttons. Then, choose Features Importance
item from cbData widget. You will see the result as shown in
Figure 122.
 

Figure 122 The features importance


 
 
 
Helper Functions to Plot Model’s Performance
Step Define plot_real_pred_val() method to plot true values and
1 predicted values and plot_cm() method to calculate and plot
confusion matrix:
 
1 def plot_real_pred_val(self, Y_pred, Y_test,
2 widget, title):
3 #Calculate Metrics
4 acc=accuracy_score(Y_test,Y_pred)
5  
6 #Output plot
7 widget.canvas.figure.clf()
8 widget.canvas.axis1 = \
9 widget.canvas.figure.add_subplot(111,\
10 facecolor='steelblue')
11
widget.canvas.axis1.scatter(range(len(Y_pred)),\
12
Y_pred,color="yellow",lw=5,label="Predictions")
13
14
widget.canvas.axis1.scatter(range(len(Y_test)),
15 \
16 Y_test,color="red",label="Actual")
17 widget.canvas.axis1.set_title(\
18 "Prediction Values vs Real Values of " + title,
19 \
20 fontsize=10)
21 widget.canvas.axis1.set_xlabel("Accuracy: " +
\
22
str(round((acc*100),3)) + "%")
23
widget.canvas.axis1.legend()
24
widget.canvas.axis1.grid(True, alpha=0.75,
25
lw=1, ls='-.')
26
widget.canvas.draw()
27
 
28
def plot_cm(self, Y_pred, Y_test, widget,
29 title):
30 cm=confusion_matrix(Y_test,Y_pred)
31 widget.canvas.figure.clf()
32 widget.canvas.axis1 =
33 widget.canvas.figure.add_subplot(111)
34 class_label = ["1", "2", "3"]
35 df_cm = pd.DataFrame(cm, \
36 index=class_label,columns=class_label)
sns.heatmap(df_cm, ax=widget.canvas.axis1, \
annot=True, cmap='plasma',linewidths=2,fmt='d')
widget.canvas.axis1.set_title("Confusion
Matrix of " + \
title, fontsize=10)
widget.canvas.axis1.set_xlabel("Predicted")
widget.canvas.axis1.set_ylabel("True")
widget.canvas.draw()
 
The plot_real_pred_val() function is used to create a scatter plot
that compares the predicted values (Y_pred) against the actual
values (Y_test) for a specific prediction task. It calculates the
accuracy score based on the predicted and actual values and then
proceeds to create the scatter plot. The predicted values are
represented as yellow dots on the plot, and the actual values are
represented as red dots. The title of the plot is set to "Prediction
Values vs Real Values of [title]", where [title] is the specified title
of the plot. Additionally, the accuracy score is displayed on the x-
axis of the plot to provide an indication of how well the model's
predictions align with the actual values.
 
The plot_cm function() is designed to visualize the confusion
matrix for a classification task. It takes the predicted values
(Y_pred) and the actual values (Y_test) and calculates the
confusion matrix (cm) using scikit-learn's confusion_matrix
function. The confusion matrix is then converted into a
DataFrame to be used with Seaborn's heatmap function. The
heatmap represents the values in the confusion matrix using
different colors, where each cell in the heatmap corresponds to the
number of samples that fall into a particular category. The x-axis
and y-axis of the heatmap are labeled as "Predicted" and "True,"
respectively, to indicate the meaning of each axis. The title of the
plot is set to "Confusion Matrix of [title]", where [title] is the
specified title for the plot.
 
Both of these functions are useful for evaluating the performance
of machine learning models in classification tasks. The scatter
plot visually shows how well the predicted values match the
actual values, providing insights into the model's accuracy and
potential biases. On the other hand, the confusion matrix heatmap
provides a more detailed breakdown of the model's predictions,
showing the true positive, false positive, true negative, and false
negative values, which are essential for evaluating the
performance of a classifier and understanding its strengths and
weaknesses.
 
 
Step Define plot_decision() method to plot ROC:
2  
1 def plot_decision(self, cla, feat1, feat2,
2 widget, title=""):
3 curr_path = os.getcwd()
4 dataset_dir = curr_path + "/Debernardi
et al 2020 data.csv"
5
6
#Loads csv file
7
df, _ = self.read_dataset(dataset_dir)
8
9
#Plots decision boundary of two features
10
feat_boundary = [feat1, feat2]
11
X_feature = df[feat_boundary]
12
X_train_feature, X_test_feature,
13
y_train_feature, \
14
y_test_feature= train_test_split(X_feature,
15 \
16 df['diagnosis'], test_size=0.3,
17 random_state = 42)
18 cla.fit(X_train_feature,
19 y_train_feature)
20
21
plot_decision_regions(X_test_feature.values,
22
\
23
24 y_test_feature.ravel(), clf=cla, legend=2,
\
ax=widget.canvas.axis1)
widget.canvas.axis1.set_title(title, \
fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_xlabel(feat1)
widget.canvas.axis1.set_ylabel(feat2)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
The plot_decision() function is used to visualize the decision
boundary of a classifier (cla) on a 2D plane defined by two
features (feat1 and feat2). The function first loads a CSV file
named "Debernardi et al 2020 data.csv" using the read_dataset
method, which is assumed to contain the dataset for the classifier.
The features specified in feat1 and feat2 are selected from the
dataset, and the target variable is assumed to be labeled as
'diagnosis'.
 
The dataset is then split into training and testing sets using the
train_test_split function from scikit-learn, with 70% of the data
used for training and 30% for testing. The classifier is then trained
on the training set using cla.fit.
 
After the classifier is trained, the decision boundary is plotted on
a 2D plane defined by the values of feat1 and feat2. The decision
regions are calculated using plot_decision_regions from the
mlxtend library, which visualizes how the classifier divides the
feature space into different regions based on its predictions. The
decision boundary is displayed on the plot, with different regions
representing different predicted classes.
 
The title of the plot is set to the value of the title parameter, and
the x-axis and y-axis are labeled with the names of feat1 and
feat2, respectively. The resulting plot shows the decision regions
of the classifier on the 2D feature plane, providing insights into
how well the classifier separates different classes based on the
selected features.
 
Step Define plot_learning_curve() to plot learning curve of any
3 machine learning model:
 
1 def plot_learning_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title)
5
6 if ylim is not None:
7 widget.canvas.axis1.set_ylim(*ylim)
8 widget.canvas.axis1.set_xlabel("Training
9 examples")
10 widget.canvas.axis1.set_ylabel("Score")
11  
12 train_sizes, train_scores, test_scores,
fit_times, _ = \
13
learning_curve(estimator, X, y,
14
cv=cv, n_jobs=n_jobs,
15
train_sizes=train_sizes,
16
return_times=True)
17
train_scores_mean = np.mean(train_scores,
18 axis=1)
19 train_scores_std = np.std(train_scores,
20 axis=1)
21 test_scores_mean = np.mean(test_scores,
22 axis=1)
23 test_scores_std = np.std(test_scores,
axis=1)
24
 
25
# Plot learning curve
26
widget.canvas.axis1.grid()
27
28
widget.canvas.axis1.fill_between(train_sizes,
29 \
30 train_scores_mean - train_scores_std,\
31 train_scores_mean + train_scores_std,
alpha=0.1,color="r")

widget.canvas.axis1.fill_between(train_sizes,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1, color="g")
widget.canvas.axis1.plot(train_sizes,
train_scores_mean, \
'o-', color="r", label="Training score")
widget.canvas.axis1.plot(train_sizes,
test_scores_mean, \
'o-', color="g", label="Cross-validation
score")
widget.canvas.axis1.legend(loc="best")
 
The plot_learning_curve() function serves the purpose of
visualizing the learning curve of a machine learning model.
Learning curves provide insights into how the model's
performance changes as the amount of training data increases.
This function takes in an estimator (machine learning model)
along with the feature data (X) and target data (y) for training. It
also accepts optional parameters like title for the plot, ylim to set
the y-axis limits, and cv for cross-validation strategy. The
function uses the learning_curve() utility from scikit-learn to
calculate the training and cross-validation scores for different
training set sizes.
 
The function then plots the learning curve on the provided widget
or subplot. The plot shows the number of training examples on
the x-axis and the model's scores (e.g., accuracy) on the y-axis.
Two curves are drawn: one for the training score and another for
the cross-validation score. The shaded areas around the curves
represent the variability of scores across different cross-validation
folds. The learning curve helps in understanding if the model is
overfitting or underfitting. When the curves converge as more
data is added, it indicates that the model might benefit from
additional training data. Conversely, a substantial gap between the
training and cross-validation curves suggests overfitting, and
regularization or feature selection may be required to improve
generalization.
 
Overall, this function is a valuable diagnostic tool for evaluating
and improving the performance of machine learning models,
providing valuable insights into data size requirements and
potential model adjustments.
 
Step Define plot_scalability_curve() to plot the scalability of the
4 model and plot_performance_curve() method to plot
performance of the model on a widget:
 
1 def plot_scalability_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title, \
5
fontweight ="bold",fontsize=15)
6
if ylim is not None:
7
widget.canvas.axis1.set_ylim(*ylim)
8
widget.canvas.axis1.set_xlabel("Training
9 examples")
10 widget.canvas.axis1.set_ylabel("Score")
11  
12 train_sizes, train_scores, test_scores,
13 fit_times, _ = \
14 learning_curve(estimator, X, y, cv=cv,
n_jobs=n_jobs,
15 train_sizes=train_sizes,
16 return_times=True)
17 fit_times_mean = np.mean(fit_times, axis=1)
18 fit_times_std = np.std(fit_times, axis=1)
19  
20 # Plot n_samples vs fit_times
21 widget.canvas.axis1.grid()
22 widget.canvas.axis1.plot(train_sizes,
23 fit_times_mean, 'o-')
24
widget.canvas.axis1.fill_between(train_sizes, \
25
fit_times_mean - fit_times_std,\
26
fit_times_mean + fit_times_std,
27
alpha=0.1)
28
widget.canvas.axis1.set_xlabel("Training
29 examples")
30 widget.canvas.axis1.set_ylabel("fit_times")
31  
32 def plot_performance_curve(self,estimator,
33 title, X, y, \
34 widget, ylim=None, cv=None, n_jobs=None, \
35 train_sizes=np.linspace(.1, 1.0, 5)):
36  
37 widget.canvas.axis1.set_title(title, \
38 fontweight ="bold",fontsize=15)
39 if ylim is not None:
40 widget.canvas.axis1.set_ylim(*ylim)
41 widget.canvas.axis1.set_xlabel("Training
42 examples")
43 widget.canvas.axis1.set_ylabel("Score")
44  
45 train_sizes, train_scores, test_scores,
fit_times, _ = \
46
learning_curve(estimator, X, y, cv=cv,
47
n_jobs=n_jobs,
48
train_sizes=train_sizes,
49
return_times=True)
50
test_scores_mean = np.mean(test_scores,
51 axis=1)
52 test_scores_std = np.std(test_scores,
53 axis=1)
54 fit_times_mean = np.mean(fit_times,
axis=1)
 
# Plot n_samples vs fit_times
widget.canvas.axis1.grid()
widget.canvas.axis1.plot(fit_times_mean, \
test_scores_mean, 'o-')

widget.canvas.axis1.fill_between(fit_times_mean,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1)
widget.canvas.axis1.set_xlabel("fit_times")
widget.canvas.axis1.set_ylabel("Score")
 
The plot_scalability_curve() function aims to visualize the
scalability of a machine learning model concerning the training
time as the size of the training data increases. This function is
essential when evaluating how the model's training time changes
with different dataset sizes. It accepts an estimator (machine
learning model), feature data (X), and target data (y) for training.
Additionally, it takes optional parameters like title for the plot,
ylim to set the y-axis limits, and cv for cross-validation strategy.
The function uses the learning_curve utility from scikit-learn to
compute the training times for different training set sizes. The plot
displays the number of training examples on the x-axis and the
average fit times (training times) on the y-axis. The shaded area
around the curve represents the standard deviation in the fit times.
This helps in understanding how the training time changes as the
dataset size increases, allowing the evaluation of the model's
scalability.
 
On the other hand, the plot_performance_curve() function is
intended to visualize how the model's performance changes
concerning the training time. This is helpful in assessing whether
the model's accuracy or other metrics improve or degrade as the
training time increases. The function takes similar parameters as
the plot_scalability_curve. It uses the learning_curve utility to
calculate the training times and performance scores (e.g.,
accuracy) for different dataset sizes. The plot displays the average
fit times on the x-axis and the average performance scores on the
y-axis. The shaded area around the curve represents the standard
deviation in the performance scores. This helps in understanding
the trade-off between training time and model performance,
enabling the identification of the optimal balance between
efficiency and accuracy. Both functions provide valuable insights
into the model's behavior with varying dataset sizes and training
times, aiding in the model selection and optimization process.
 
 
 
 
Training Model and Predicting Pancreatic Cancer
Step Define train_model() and predict_model() methods to train any
1 classifier and calculate the prediction:
 
1 def train_model(self, model, X, y):
2 model.fit(X, y)
3 return model
4  
5 def predict_model(self, model, X,
6 proba=False):
7 if ~proba:
8 y_pred = model.predict(X)
9 else:
10 y_pred_proba =
model.predict_proba(X)
11
y_pred = np.argmax(y_pred_proba,
12
axis=1)
 
return y_pred
 
The train_model() function is responsible for training a given
machine learning model with the provided feature data X and
target data y. It simply calls the fit method of the model object,
passing in the feature data X and target data y for training. After
training the model, it returns the trained model. This function is
useful for easily training different machine learning models
without having to repeat the fit step in multiple places within the
code. By calling this function with various models and datasets, it
allows for efficient experimentation and comparison of different
models.
 
The predict_model() function takes a trained machine learning
model and feature data X, and it returns the predicted target
values (y_pred). The proba parameter is a boolean flag that
determines whether the function should return the actual predicted
classes (False) or the predicted class probabilities (True) if the
model supports probability estimates (e.g., for classifiers with
predict_proba method). If proba is False, the function simply calls
the predict method of the model object to obtain the predicted
classes. If proba is True, it calls the predict_proba method to get
the class probabilities and then selects the class with the highest
probability as the predicted class for each sample. The function
returns the predicted target values (y_pred) either as class labels
or class probabilities based on the proba flag.
 
Both functions are generic and flexible, allowing for seamless
integration of various machine learning models and facilitating
model training and prediction tasks throughout the code. They
simplify the process of training models and making predictions,
promoting code modularity and readability.
 
 
Step Define run_model() method to calculate accuracy and precision.
2 It also invokes six methods to plot confusion matrix, true values
versus predicted values diagram, ROC, learning curve, and
decision boundaries:
 
1 def run_model(self, name, scaling, model,
2 X_train, X_test, y_train, y_test, train=True,
proba=True):
3
if train == True:
4
model = self.train_model(model, X_train,
5
y_train)
6
y_pred = self.predict_model(model, X_test,
7 proba)
8
9 accuracy = accuracy_score(y_test, y_pred)
10 recall = recall_score(y_test, y_pred,
11 average='weighted')
12 precision = precision_score(y_test, y_pred,
13 \
14 average='weighted')
15 f1 = f1_score(y_test, y_pred,
average='weighted')
16
17
print('accuracy: ', accuracy)
18
print('recall: ',recall)
19
print('precision: ', precision)
20
print('f1: ', f1)
21
print(classification_report(y_test, y_pred))
22
23
self.widgetPlot1.canvas.figure.clf()
24
self.widgetPlot1.canvas.axis1 = \
25
self.widgetPlot1.canvas.figure.add_subplot(111,
26
\
27
facecolor = '#fbe7dd')
28
self.plot_cm(y_pred, y_test, self.widgetPlot1,
29 \
30 name + " -- " + scaling)
31 self.widgetPlot1.canvas.figure.tight_layout()
32 self.widgetPlot1.canvas.draw()
33  
34 self.widgetPlot2.canvas.figure.clf()
35 self.widgetPlot2.canvas.axis1 = \
36 self.widgetPlot2.canvas.figure.add_subplot(111,
37 \
38 facecolor = '#fbe7dd')
39 self.plot_real_pred_val(y_pred, y_test, \
40 self.widgetPlot2, name + " -- " + scaling)
41 self.widgetPlot2.canvas.figure.tight_layout()
42 self.widgetPlot2.canvas.draw()
43  
44 self.widgetPlot3.canvas.figure.clf()
45 self.widgetPlot3.canvas.axis1 = \
46 self.widgetPlot3.canvas.figure.add_subplot(221,
\
47
facecolor = '#fbe7dd')
48
self.plot_decision(model, 'creatinine',
49
'diagnosis', \
50
self.widgetPlot3, title="The decision
51 boundaries of " + \
52 name + " -- " + scaling)
53 self.widgetPlot3.canvas.figure.tight_layout()
54
55 self.widgetPlot3.canvas.axis1 = \
56 self.widgetPlot3.canvas.figure.add_subplot(222,
57 \
58 facecolor = '#fbe7dd')
59 self.plot_learning_curve(model, \
60 'Learning Curve' + " -- " + scaling, X_train, \
61 y_train, self.widgetPlot3)
62 self.widgetPlot3.canvas.figure.tight_layout()
63
64 self.widgetPlot3.canvas.axis1 = \
65 self.widgetPlot3.canvas.figure.add_subplot(223,
66 \
67 facecolor = '#fbe7dd')
68 self.plot_scalability_curve(model, 'Scalability
of ' + \
69
name + " -- " + scaling, X_train, y_train, \
70
self.widgetPlot3)
self.widgetPlot3.canvas.figure.tight_layout()
 
self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(224,
\
facecolor = '#fbe7dd')
self.plot_performance_curve(model, \
'Performance of ' + name + " -- " + scaling, \
X_train, y_train, self.widgetPlot3)
self.widgetPlot3.canvas.figure.tight_layout()
 
self.widgetPlot3.canvas.draw()
 
The run_model() function serves as the central hub to execute
various tasks associated with a given machine learning model. It
takes multiple parameters, including name (name of the model),
scaling (indicates whether data scaling is used), model (the
machine learning model object), X_train and X_test (feature
datasets for training and testing), y_train and y_test (target
datasets for training and testing), train (a boolean flag indicating
whether to train the model or use an already trained model), and
proba (a boolean flag indicating whether to predict probabilities
or class labels).
 
If train is set to True, it calls the train_model function to train the
model using the training data (X_train and y_train). Then, it calls
the predict_model function to predict the target values (y_pred)
using the trained model and the testing data (X_test). It then
calculates and prints several evaluation metrics such as accuracy,
recall, precision, and F1-score, and also prints a classification
report.
 
The function further produces several plots related to model
evaluation. It clears and sets up three different subplots in
self.widgetPlot1, self.widgetPlot2, and self.widgetPlot3 to show a
confusion matrix, real vs. predicted values, decision boundaries,
learning curves, scalability curves, and performance curves of the
model. These plots help visualize and analyze the performance
and behavior of the model using various datasets and metrics,
making it easier to assess its strengths and weaknesses. The
function effectively combines multiple evaluation and
visualization tasks in one central place, making it more
convenient for model evaluation and comparison across different
settings.
 
 
 
Logistic Regression Classifier
Step Define build_train_lr() method to build and train Logistic
1 Regression (LR) classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 def build_train_lr(self):
20 if path.isfile('logregRaw.pkl'):
21 #Loads model
22 self.logregRaw =
23 joblib.load('logregRaw.pkl')
24 self.logregNorm =
joblib.load('logregNorm.pkl')
25
self.logregStand =
26
joblib.load('logregStand.pkl')
27
28
if self.rbRaw.isChecked():
29
self.run_model('Logistic Regression',
30 'Raw', \
31 self.logregRaw, self.X_train_raw, \
32 self.X_test_raw, self.y_train_raw, \
33 self.y_test_raw)
34
if self.rbNorm.isChecked():
35
self.run_model('Logistic Regression', \
36
'Normalization', self.logregNorm, \
37
self.X_train_norm, self.X_test_norm, \
38
self.y_train_norm, self.y_test_norm)
39
 
40
if self.rbStand.isChecked():
41
self.run_model('Logistic Regression', \
42
'Standardization', self.logregStand, \
43
self.X_train_stand, self.X_test_stand, \
44
self.y_train_stand, self.y_test_stand)
45
 
46
else:
47
#Builds and trains Logistic Regression
48
self.logregRaw =
49 LogisticRegression(solver='lbfgs', \
50 max_iter=2000, random_state=2021)
51 self.logregNorm =
52 LogisticRegression(solver='lbfgs', \
53 max_iter=2000, random_state=2021)
54 self.logregStand =
55 LogisticRegression(solver='lbfgs', \
56 max_iter=2000, random_state=2021)

if self.rbRaw.isChecked():
self.run_model('Logistic Regression',
'Raw', \
self.logregRaw, self.X_train_raw, \
self.X_test_raw, self.y_train_raw, \
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Logistic Regression', \
'Normalization', self.logregNorm, \
self.X_train_norm, self.X_test_norm, \
self.y_train_norm, self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('Logistic Regression', \
'Standardization', self.logregStand, \
self.X_train_stand, self.X_test_stand, \
self.y_train_stand, self.y_test_stand)
 
#Saves model
joblib.dump(self.logregRaw,
'logregRaw.pkl')
joblib.dump(self.logregNorm,
'logregNorm.pkl')
joblib.dump(self.logregStand,
'logregStand.pkl')
 
The build_train_lr() function is responsible for building and
training three different logistic regression models with
different data preprocessing techniques (raw, normalization,
and standardization). The function first checks if the trained
models are already saved in the form of Pickle files
('logregRaw.pkl', 'logregNorm.pkl', 'logregStand.pkl') by
using the path.isfile function. If the models exist, it loads
them using joblib.load and proceeds to perform model
evaluation using the run_model function for each
preprocessing technique based on the selected radio buttons
(rbRaw, rbNorm, rbStand). If the models do not exist, the
function proceeds to build new logistic regression models
with the specified hyperparameters using the
LogisticRegression class from scikit-learn. Once the models
are built and trained, it performs model evaluation and saves
the trained models as Pickle files using joblib.dump.
 
The function allows for reusing the trained models if they
have already been built and saved, avoiding the need to
retrain the models every time the program runs. This
approach saves computational time and resources.
Additionally, the function provides flexibility to evaluate
different logistic regression models with different
preprocessing techniques, enabling comparison of their
performance under various settings. The code is part of a
larger program that includes the necessary data
preprocessing steps and user interface for model evaluation
and selection. The function centralizes the logistic regression
model training and evaluation process, making the code
more organized and easier to maintain.
 
 
Step 2 Define choose_ML_model() method to read
cbClassifier widget:
 
1 def choose_ML_model(self):
2 strCB = self.cbClassifier.current
3
4 if strCB == 'Logistic Regression':
5 self.build_train_lr()
 
 
Step 3 Connect currentIndexChanged() event of cbC
choose_ML_model() method and put it inside __in
in line 12-13:
 
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_pancreatic.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicti
6 Cancer")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.impo
10 self.initial_state(False)
11 self.pbTrainML.clicked.connect(self.t
12 self.cbData.currentIndexChanged.conne
)
13
self.cbClassifier.currentIndexChanged
self.choose_ML_model)
 
 
Step 4 Run gui_pancreatic.py and click LOAD DATA and
buttons. Click on Raw radio button. Then, choose L
from cbClassifier widget. Then, you will see the re
123.
 
Figure 123 The result using LR model with ra
 

Figure 124 The result using LR model with norma


 

Figure 125 The result using LR model with standar


 
Click on Norm radio button. Then, choose Logisti
cbClassifier widget. Then, you will see the result as
 
Click on Stand radio button. Then, choose Logisti
cbClassifier widget. Then, you will see the result as
 
 
 
 
Support Vector Classifier
Step Define build_train_svm() method to build and train Support
1 Vector Machine (SVM) classifier using three feature scaling:
Raw, Normalization, and Standardization:
 
1 def build_train_svm(self):
2 if path.isfile('SVMRaw.pkl'):
3 #Loads model
4 self.SVMRaw = joblib.load('SVMRaw.pkl')
5 self.SVMNorm = joblib.load('SVMNorm.pkl')
6 self.SVMStand =
7 joblib.load('SVMStand.pkl')
8
9 if self.rbRaw.isChecked():
10 self.run_model('Support Vector Machine',
'Raw', \
11
self.SVMRaw, self.X_train_raw,
12
self.X_test_raw, \
13
self.y_train_raw, self.y_test_raw)
14
if self.rbNorm.isChecked():
15
self.run_model('Support Vector Machine', \
16
'Normalization', self.SVMNorm, \
17
self.X_train_norm, self.X_test_norm, \
18
self.y_train_norm, self.y_test_norm)
19
 
20
if self.rbStand.isChecked():
21
self.run_model('Support Vector Machine', \
22
'Standardization', self.SVMStand, \
23
self.X_train_stand, self.X_test_stand, \
24
self.y_train_stand, self.y_test_stand)
25
else:
26
#Builds and trains Logistic Regression
27
self.SVMRaw =
28
SVC(random_state=2021,probability=True)
29
self.SVMNorm =
30 SVC(random_state=2021,probability=True)
31 self.SVMStand =
32 SVC(random_state=2021,probability=True)
33
34 if self.rbRaw.isChecked():
35 self.run_model('Support Vector Machine',
36 'Raw', \
37 self.SVMRaw, self.X_train_raw,
self.X_test_raw, \
38
39 self.y_train_raw, self.y_test_raw)
40 if self.rbNorm.isChecked():
41 self.run_model('Support Vector Machine', \
42 'Normalization', self.SVMNorm, \
43 self.X_train_norm, self.X_test_norm, \
44 self.y_train_norm, self.y_test_norm)
45  
46 if self.rbStand.isChecked():
47 self.run_model('Support Vector Machine', \
48 'Standardization', self.SVMStand, \
49 self.X_train_stand, self.X_test_stand, \
50 self.y_train_stand, self.y_test_stand)
 
#Saves model
joblib.dump(self.SVMRaw,
'SVMRaw.pkl')
joblib.dump(self.SVMNorm,
'SVMNorm.pkl')
joblib.dump(self.SVMStand,
'SVMStand.pkl')
 
 

Figure 126 The result using SVM model with raw feature
scaling
 
The build_train_svm() function is responsible for building
and training three different Support Vector Machine (SVM)
models with different data preprocessing techniques (raw,
normalization, and standardization). The function first checks
if the trained models are already saved as Pickle files
('SVMRaw.pkl', 'SVMNorm.pkl', 'SVMStand.pkl') using
path.isfile. If the models exist, it loads them using joblib.load
and proceeds to perform model evaluation using the
run_model function for each preprocessing technique based
on the selected radio buttons (rbRaw, rbNorm, rbStand).
 
If the models do not exist, the function proceeds to build new
SVM models with the specified hyperparameters using the
SVC class from scikit-learn. The probability=True argument
is set to enable the models to predict class probabilities. Once
the models are built and trained, it performs model
evaluation and saves the trained models as Pickle files using
joblib.dump.
 
Similar to the build_train_lr() function, this function also
allows for reusing the trained models if they have already
been built and saved, reducing the need to retrain the models
every time the program runs. It provides flexibility to
evaluate different SVM models with different preprocessing
techniques, enabling comparison of their performance under
various settings. The function is part of a larger program that
includes data preprocessing and user interface components
for model evaluation and selection. It centralizes the SVM
model training and evaluation process, making the code more
organized and easier to maintain.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Support Vector Machine':
2 self.build_train_svm()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Support Vector Machine item from cbClassifier
widget. Then, you will see the result as shown in Figure 126.
 

Figure 127 The result using SVM model with normalization


feature scaling
 
Figure 128 The result using SVM model with
standardization feature scaling
 
Click on Norm radio button. Then, choose Support Vector
Machine item from cbClassifier widget. Then, you will see
the result as shown in Figure 127.
 
Click on Stand radio button. Then, choose Support Vector
Machine item from cbClassifier widget. Then, you will see
the result as shown in Figure 128.
 
 
 
K-Nearest Neighbors Classifier
Step Define build_train_knn() method to build and train K-
1 Nearest Neighbor (KNN) classifier using three feature
scaling: Raw, Normalization, and Standardization:
 
1 def build_train_knn(self):
2 if path.isfile('KNNRaw.pkl'):
3 #Loads model
4 self.KNNRaw = joblib.load('KNNRaw.pkl')
5 self.KNNNorm = joblib.load('KNNNorm.pkl')
6 self.KNNStand =
7 joblib.load('KNNStand.pkl')
8
9 if self.rbRaw.isChecked():
10 self.run_model('K-Nearest Neighbor',
'Raw', \
11
self.KNNRaw, self.X_train_raw,
12
self.X_test_raw, \
13
self.y_train_raw, self.y_test_raw)
14
if self.rbNorm.isChecked():
15
self.run_model('K-Nearest Neighbor', \
16
'Normalization', self.KNNNorm, \
17 self.X_train_norm, self.X_test_norm, \
18 self.y_train_norm, self.y_test_norm)
19  
20 if self.rbStand.isChecked():
21 self.run_model('K-Nearest Neighbor', \
22 'Standardization', self.KNNStand, \
23 self.X_train_stand, self.X_test_stand, \
24 self.y_train_stand, self.y_test_stand)
25  
26 else:
27 #Builds and trains K-Nearest Neighbor
28 self.KNNRaw =
29 KNeighborsClassifier(n_neighbors = 50)
30 self.KNNNorm =
KNeighborsClassifier(n_neighbors = 50)
31
self.KNNStand =
32
KNeighborsClassifier(n_neighbors = 50)
33
34
if self.rbRaw.isChecked():
35
self.run_model('K-Nearest Neighbor',
36 'Raw', \
37 self.KNNRaw, self.X_train_raw, \
38 self.X_test_raw, self.y_train_raw, \
39 self.y_test_raw)
40
if self.rbNorm.isChecked():
41
self.run_model('K-Nearest Neighbor', \
42
'Normalization', self.KNNNorm, \
43
self.X_train_norm, self.X_test_norm, \
44
self.y_train_norm, self.y_test_norm)
45
 
46
if self.rbStand.isChecked():
47
self.run_model('K-Nearest Neighbor', \
48
'Standardization', self.KNNStand, \
49
self.X_train_stand, self.X_test_stand, \
50
self.y_train_stand, self.y_test_stand)
51
 
52
#Saves model
joblib.dump(self.KNNRaw,
'KNNRaw.pkl')
joblib.dump(self.KNNNorm,
'KNNNorm.pkl')
joblib.dump(self.KNNStand,
'KNNStand.pkl')
 
The build_train_knn() function is responsible for building
and training three different K-Nearest Neighbor (KNN)
models with different data preprocessing techniques (raw,
normalization, and standardization). The function first checks
if the trained models are already saved as Pickle files
('KNNRaw.pkl', 'KNNNorm.pkl', 'KNNStand.pkl') using
path.isfile. If the models exist, it loads them using joblib.load
and proceeds to perform model evaluation using the
run_model function for each preprocessing technique based
on the selected radio buttons (rbRaw, rbNorm, rbStand).
 
If the models do not exist, the function proceeds to build new
KNN models with the specified hyperparameter
n_neighbors=50 using the KNeighborsClassifier class from
scikit-learn. Once the models are built and trained, it
performs model evaluation and saves the trained models as
Pickle files using joblib.dump.
 
Similar to the previous functions (build_train_lr,
build_train_svm), this function also allows for reusing the
trained models if they have already been built and saved. It
provides flexibility to evaluate different KNN models with
different preprocessing techniques and n_neighbors
hyperparameter, enabling comparison of their performance
under various settings. The function is part of a larger
program that includes data preprocessing and user interface
components for model evaluation and selection. It centralizes
the KNN model training and evaluation process, making the
code more organized and easier to maintain.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'K-Nearest Neighbor':
2 self.build_train_knn()
 
 
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose K-Nearest Neighbor item from cbClassifier widget.
Then, you will see the result as shown in Figure 129.
 
Click on Norm radio button. Then, choose K-Nearest
Neighbor item from cbClassifier widget. Then, you will see
the result as shown in Figure 130.
 
Click on Stand radio button. Then, choose K-Nearest
Neighbor item from cbClassifier widget. Then, you will see
the result as shown in Figure 131.
 

Figure 129 The result using KNN model with raw feature
scaling
 

Figure 130 The result using KNN model with normalization


feature scaling
 

Figure 131 The result using KNN model with


standardization feature scaling
 
 
 
Decision Tree Classifier
Step Define build_train_dt() method to build and train Decision
1 Tree (DT) classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1 def build_train_dt(self):
2 if path.isfile('DTRaw.pkl'):
3 #Loads model
4 self.DTRaw = joblib.load('DTRaw.pkl')
5 self.DTNorm = joblib.load('DTNorm.pkl')
6 self.DTStand = joblib.load('DTStand.pkl')
7
8 if self.rbRaw.isChecked():
9 self.run_model('Decision Tree', \
10 'Raw', self.DTRaw, self.X_train_raw, \
11 self.X_test_raw, self.y_train_raw,
12 self.y_test_raw)
13 if self.rbNorm.isChecked():
14 self.run_model('Decision Tree', \
15 'Normalization', self.DTNorm, \
16 self.X_train_norm, self.X_test_norm, \
17 self.y_train_norm, self.y_test_norm)
18  
19 if self.rbStand.isChecked():
20 self.run_model('Decision Tree', \
21 'Standardization', self.DTStand, \
22 self.X_train_stand, self.X_test_stand, \
23 self.y_train_stand, self.y_test_stand)
24  
25 else:
26 #Builds and trains Decision Tree
27 dt = DecisionTreeClassifier()
28 parameters = {
29 'max_depth':np.arange(1,20,1),\
30 'random_state':[2021]}
31 self.DTRaw = GridSearchCV(dt, parameters)
32 self.DTNorm = GridSearchCV(dt,
parameters)
33
self.DTStand = GridSearchCV(dt,
34
parameters)
35
36
if self.rbRaw.isChecked():
37
self.run_model('Decision Tree', \
38
'Raw', self.DTRaw, self.X_train_raw, \
39
self.X_test_raw, self.y_train_raw, \
40 self.y_test_raw)
41 if self.rbNorm.isChecked():
42 self.run_model('Decision Tree', \
43 'Normalization', self.DTNorm, \
44 self.X_train_norm, self.X_test_norm, \
45 self.y_train_norm, self.y_test_norm)
46  
47 if self.rbStand.isChecked():
48 self.run_model('Decision Tree', \
49 'Standardization', self.DTStand, \
50 self.X_train_stand, self.X_test_stand, \
51 self.y_train_stand, self.y_test_stand)
52  
53 #Saves model
54 joblib.dump(self.DTRaw,
55 'DTRaw.pkl')
joblib.dump(self.DTNorm,
'DTNorm.pkl')
joblib.dump(self.DTStand,
'DTStand.pkl')
 
 

Figure 132 The result using DT model with raw feature


scaling
 
The build_train_dt() function follows a similar pattern to the
previously explained functions. Its purpose is to build and
train three different Decision Tree models with different data
preprocessing techniques (raw, normalization, and
standardization).
The function first checks if the trained models are already
saved as Pickle files ('DTRaw.pkl', 'DTNorm.pkl',
'DTStand.pkl') using path.isfile. If the models exist, it loads
them using joblib.load and proceeds to perform model
evaluation using the run_model() function for each
preprocessing technique based on the selected radio buttons
(rbRaw, rbNorm, rbStand).
 
If the models do not exist, the function proceeds to build new
Decision Tree models using the DecisionTreeClassifier class
from scikit-learn and performs hyperparameter tuning using
GridSearchCV to find the best hyperparameters (max_depth
and random_state). Once the models are built and trained, it
performs model evaluation and saves the trained models as
Pickle files using joblib.dump.
 
Like the other functions in the code, build_train_dt provides
flexibility for comparing the performance of Decision Tree
models with different preprocessing techniques and
hyperparameter settings. It centralizes the Decision Tree
model training and evaluation process, making the code more
modular and easier to manage. Additionally, it allows for
reusing the trained models if they have already been built and
saved, thus saving computational time.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Decision Tree':
2 self.build_train_dt()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Decision Tree item from cbClassifier widget. Then,
you will see the result as shown in Figure 132.
 

Figure 133 The result using DT model with normalization


feature scaling
 
Figure 134 The result using DT model with standardization
feature scaling
 
Click on Norm radio button. Then, choose Decision Tree
item from cbClassifier widget. Then, you will see the result
as shown in Figure 133.
 
Click on Stand radio button. Then, choose Decision Tree
item from cbClassifier widget. Then, you will see the result
as shown in Figure 134.
 
 
 
Random Forest Classifier
Step Define build_train_rf() method to build and train Random
1 Forest (RF) classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1 def build_train_rf(self):
2 if path.isfile('RFRaw.pkl'):
3 #Loads model
4 self.RFRaw = joblib.load('RFRaw.pkl')
5 self.RFNorm = joblib.load('RFNorm.pkl')
6 self.RFStand = joblib.load('RFStand.pkl')
7
8 if self.rbRaw.isChecked():
9 self.run_model('Random Forest', 'Raw', \
10 self.RFRaw, self.X_train_raw,
11 self.X_test_raw, \
12 self.y_train_raw, self.y_test_raw)
13 if self.rbNorm.isChecked():
14 self.run_model('Random Forest', \
15 'Normalization', self.RFNorm, \
16 self.X_train_norm, self.X_test_norm, \
17 self.y_train_norm, self.y_test_norm)
18  
19 if self.rbStand.isChecked():
20 self.run_model('Random Forest', \
21 'Standardization', self.RFStand, \
22 self.X_train_stand, self.X_test_stand, \
23 self.y_train_stand, self.y_test_stand)
24  
25 else:
26 #Builds and trains Random Forest
27 self.RFRaw =
28 RandomForestClassifier(n_estimators=200, \
29 max_depth=20, random_state=2021)
30 self.RFNorm = RandomForestClassifier(\
31 n_estimators=200, max_depth=11,
random_state=2021)
32
self.RFStand = RandomForestClassifier(\
33
n_estimators=200, max_depth=11,
34
random_state=2021)
35
36
if self.rbRaw.isChecked():
37
self.run_model('Random Forest', 'Raw',\
38
self.RFRaw, self.X_train_raw, \
39
self.X_test_raw, self.y_train_raw, \
40
self.y_test_raw)
41
if self.rbNorm.isChecked():
42
self.run_model('Random Forest', \
43
'Normalization', self.RFNorm, \
44
self.X_train_norm, self.X_test_norm, \
45
self.y_train_norm, self.y_test_norm)
46
 
47
if self.rbStand.isChecked():
48
self.run_model('Random Forest', \
49
'Standardization', self.RFStand, \
50
self.X_train_stand, self.X_test_stand, \
51
self.y_train_stand, self.y_test_stand)
52
 
53
#Saves model
54
joblib.dump(self.RFRaw,
55
'RFRaw.pkl')
joblib.dump(self.RFNorm,
'RFNorm.pkl')
joblib.dump(self.RFStand,
'RFStand.pkl')
 
The build_train_rf() function serves the same purpose as the
previously explained functions. It builds and trains three
different Random Forest models with different data
preprocessing techniques (raw, normalization, and
standardization).
 
The function first checks if the trained models are already
saved as Pickle files ('RFRaw.pkl', 'RFNorm.pkl',
'RFStand.pkl') using path.isfile. If the models exist, it loads
them using joblib.load and proceeds to perform model
evaluation using the run_model function for each
preprocessing technique based on the selected radio buttons
(rbRaw, rbNorm, rbStand).
 
If the models do not exist, the function proceeds to build new
Random Forest models using the RandomForestClassifier
class from scikit-learn with different hyperparameter settings
(n_estimators and max_depth). Once the models are built and
trained, it performs model evaluation and saves the trained
models as Pickle files using joblib.dump.
 
Like the other functions in the code, build_train_rf() provides
flexibility for comparing the performance of Random Forest
models with different preprocessing techniques and
hyperparameter settings. It centralizes the Random Forest
model training and evaluation process, making the code more
modular and easier to manage. Additionally, it allows for
reusing the trained models if they have already been built and
saved, thus saving computational time.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Random Forest':
2 self.build_train_rf()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Random Forest item from cbClassifier widget.
Then, you will see the result as shown in Figure 135.
 
Click on Norm radio button. Then, choose Random Forest
item from cbClassifier widget. Then, you will see the result
as shown in Figure 136.
 
Click on Stand radio button. Then, choose Random Forest
item from cbClassifier widget. Then, you will see the result
as shown in Figure 137.
 

Figure 135 The result using RF model with raw feature


scaling
 

Figure 136 The result using RF model with normalization


feature scaling
 

Figure 137 The result using RF model with standardization


feature scaling
 
 
 
Gradient Boosting Classifier
Step Define build_train_gb() method to build and train Gradient
1 Boosting (GB) classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1 def build_train_gb(self):
2 if path.isfile('GBRaw.pkl'):
3 #Loads model
4 self.GBRaw = joblib.load('GBRaw.pkl')
5 self.GBNorm = joblib.load('GBNorm.pkl')
6 self.GBStand = joblib.load('GBStand.pkl')
7
8 if self.rbRaw.isChecked():
9 self.run_model('Gradient Boosting',
10 'Raw', \
11 self.GBRaw, self.X_train_raw, \
12 self.X_test_raw, self.y_train_raw, \
13 self.y_test_raw)
14 if self.rbNorm.isChecked():
15 self.run_model('Gradient Boosting', \
16 'Normalization', self.GBNorm, \
17 self.X_train_norm, self.X_test_norm, \
18 self.y_train_norm, self.y_test_norm)
19  
20 if self.rbStand.isChecked():
21 self.run_model('Gradient Boosting',\
22 'Standardization', self.GBStand, \
23 self.X_train_stand, self.X_test_stand, \
24 self.y_train_stand, self.y_test_stand)
25  
26 else:
27 #Builds and trains Gradient Boosting
28 self.GBRaw = GradientBoostingClassifier(\
29 n_estimators = 200, max_depth=20,
30 subsample=0.8, \
31 max_features=0.2, random_state=2021)
32 self.GBNorm =
GradientBoostingClassifier(\
33
n_estimators = 200, max_depth=20,
34
subsample=0.8,\
35
max_features=0.2, random_state=2021)
36
self.GBStand =
37 GradientBoostingClassifier(\
38 n_estimators = 200, max_depth=20,
39 subsample=0.8,\
40 max_features=0.2, random_state=2021)
41
42 if self.rbRaw.isChecked():
43 self.run_model('Gradient Boosting',
44 'Raw', \
45 self.GBRaw, self.X_train_raw, \
46 self.X_test_raw, self.y_train_raw, \
47 self.y_test_raw)
48 if self.rbNorm.isChecked():
49 self.run_model('Gradient Boosting', \
50 'Normalization', self.GBNorm, \
51 self.X_train_norm, self.X_test_norm, \
52 self.y_train_norm, self.y_test_norm)
53  
54 if self.rbStand.isChecked():
55 self.run_model('Gradient Boosting', \
56 'Standardization', self.GBStand, \
57 self.X_train_stand, self.X_test_stand, \
58 self.y_train_stand, self.y_test_stand)
59  
#Saves model
joblib.dump(self.GBRaw,
'GBRaw.pkl')
joblib.dump(self.GBNorm,
'GBNorm.pkl')
joblib.dump(self.GBStand,
'GBStand.pkl')
 
 
The build_train_gb() function is similar to the previously
explained functions, but it is specific to building and training
Gradient Boosting models. It follows the same pattern of
checking if the models have already been saved as Pickle
files ('GBRaw.pkl', 'GBNorm.pkl', 'GBStand.pkl'), loading
them if they exist, and performing model evaluation based on
selected radio buttons (rbRaw, rbNorm, rbStand).
 
If the models do not exist, the function proceeds to build new
Gradient Boosting models using the
GradientBoostingClassifier class from scikit-learn with
specific hyperparameter settings (n_estimators, max_depth,
subsample, max_features). After building and training the
models, it performs model evaluation and saves the trained
models as Pickle files using joblib.dump.
 
The purpose of this function is to build, train, and evaluate
three different Gradient Boosting models with different
preprocessing techniques (raw, normalization, and
standardization) to compare their performance. The code
structure allows for easy comparison of the models and
ensures that the models are only built and trained when
necessary, avoiding redundant computations. The function
facilitates the evaluation and comparison of Gradient
Boosting models and contributes to the overall machine
learning pipeline in the application.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Gradient Boosting':
2 self.build_train_gb()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Gradient Boosting item from cbClassifier widget.
Then, you will see the result as shown in Figure 138.
 

Figure 138 The result using GB model with raw feature


scaling
 

Figure 139 The result using GB model with normalization


feature scaling
 
Click on Norm radio button. Then, choose Gradient
Boosting item from cbClassifier widget. Then, you will see
the result as shown in Figure 139.
 
Click on Stand radio button. Then, choose Gradient
Boosting item from cbClassifier widget. Then, you will see
the result as shown in Figure 140.
 

Figure 140 The result using GB model with standardization


feature scaling
 
 
 
Naïve Bayes Classifier
Step Define build_train_nb() method to build and train Naïve
1 Bayes (NB) classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1 def build_train_nb(self):
2 if path.isfile('NBRaw.pkl'):
3 #Loads model
4 self.NBRaw = joblib.load('NBRaw.pkl')
5 self.NBNorm = joblib.load('NBNorm.pkl')
6 self.NBStand = joblib.load('NBStand.pkl')
7
8 if self.rbRaw.isChecked():
9 self.run_model('Naive Bayes', 'Raw', \
10 self.NBRaw, self.X_train_raw, \
11 self.X_test_raw, self.y_train_raw, \
12 self.y_test_raw)
13 if self.rbNorm.isChecked():
14 self.run_model('Naive Bayes',
15 'Normalization', \
16 self.NBNorm, self.X_train_norm, \
17 self.X_test_norm, self.y_train_norm, \
18 self.y_test_norm)
19  
20 if self.rbStand.isChecked():
21 self.run_model('Naive Bayes', \
22 'Standardization', self.NBStand, \
23 self.X_train_stand, self.X_test_stand,\
24 self.y_train_stand, self.y_test_stand)
25  
26 else:
27 #Builds and trains Naive Bayes
28 self.NBRaw = GaussianNB()
29 self.NBNorm = GaussianNB()
30 self.NBStand = GaussianNB()
31
32 if self.rbRaw.isChecked():
33 self.run_model('Naive Bayes', 'Raw', \
34 self.NBRaw, self.X_train_raw, \
35 self.X_test_raw, self.y_train_raw, \
36 self.y_test_raw)
37 if self.rbNorm.isChecked():
38 self.run_model('Naive Bayes', \
39 'Normalization', self.NBNorm, \
40 self.X_train_norm, self.X_test_norm, \
41 self.y_train_norm, self.y_test_norm)
42  
43 if self.rbStand.isChecked():
44 self.run_model('Naive Bayes', \
45 'Standardization', self.NBStand, \
46 self.X_train_stand, self.X_test_stand, \
47 self.y_train_stand, self.y_test_stand)
48  
49 #Saves model
50 joblib.dump(self.NBRaw,
51 'NBRaw.pkl')
52 joblib.dump(self.NBNorm,
53 'NBNorm.pkl')
joblib.dump(self.NBStand,
'NBStand.pkl')
 
The build_train_nb() function follows a similar pattern to the
previously explained functions, but it is specifically designed
for building and training Naive Bayes models. It checks if
the models have already been saved as Pickle files
('NBRaw.pkl', 'NBNorm.pkl', 'NBStand.pkl'). If they exist, it
loads the models; otherwise, it proceeds to build and train
new Naive Bayes models.
 
When the models are built, it uses the GaussianNB class
from scikit-learn, which implements the Gaussian Naive
Bayes algorithm, a variant of the Naive Bayes algorithm that
assumes the features follow a Gaussian distribution.
 
After training the models, the function performs model
evaluation and comparison based on the selected radio
buttons (rbRaw, rbNorm, rbStand) for different preprocessing
techniques (raw, normalization, and standardization).
 
The purpose of this function is to build, train, and evaluate
three different Naive Bayes models with different
preprocessing techniques to compare their performance. It
ensures that the models are only built and trained when
necessary, avoiding redundant computations. The function is
part of the larger machine learning pipeline in the
application, facilitating the evaluation and comparison of
Naive Bayes models.
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Naive Bayes':
2 self.build_train_nb()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Naive Bayes item from cbClassifier widget. Then,
you will see the result as shown in Figure 141.
 
Click on Norm radio button. Then, choose Naive Bayes item
from cbClassifier widget. Then, you will see the result as
shown in Figure 142.
 
Figure 141 The result using Naïve Bayes model with raw
feature scaling
 

Figure 142 The result using Naïve Bayes model with


normalization feature scaling
Click on Stand radio button. Then, choose Naive Bayes item
from cbClassifier widget. Then, you will see the result as
shown in Figure 143.
 

Figure 143 The result using Naïve Bayes model with


standardization feature scaling
 
 
 
Adaboost Classifier
Step Define build_train_ada() method to build and train
1 Adaboost classifier using three feature scaling: Raw,
Normalization, and Standardization:
 
1 def build_train_ada(self):
2 if path.isfile('ADARaw.pkl'):
3 #Loads model
4 self.ADARaw = joblib.load('ADARaw.pkl')
5 self.ADANorm = joblib.load('ADANorm.pkl')
6
7 self.ADAStand =
8 joblib.load('ADAStand.pkl')
9
10 if self.rbRaw.isChecked():
11 self.run_model('Adaboost', 'Raw', \
12 self.ADARaw, self.X_train_raw, \
13 self.X_test_raw, self.y_train_raw, \
14 self.y_test_raw)
15 if self.rbNorm.isChecked():
16 self.run_model('Adaboost',
17 'Normalization',\
18 self.ADANorm, self.X_train_norm, \
19 self.X_test_norm, self.y_train_norm, \
20 self.y_test_norm)
21  
22 if self.rbStand.isChecked():
23 self.run_model('Adaboost', \
24 'Standardization', self.ADAStand, \
25 self.X_train_stand, self.X_test_stand, \
26 self.y_train_stand, self.y_test_stand)
27  
28 else:
29 #Builds and trains Adaboost
30 self.ADARaw = AdaBoostClassifier(\
31 n_estimators = 100, learning_rate=0.01)
32 self.ADANorm = AdaBoostClassifier(\
33 n_estimators = 100, learning_rate=0.01)
34 self.ADAStand = AdaBoostClassifier(\
35 n_estimators = 100, learning_rate=0.01)
36
37 if self.rbRaw.isChecked():
38 self.run_model('Adaboost', 'Raw',
self.ADARaw, \
39
self.X_train_raw, self.X_test_raw, \
40
self.y_train_raw, self.y_test_raw)
41
42 if self.rbNorm.isChecked():
43 self.run_model('Adaboost',
'Normalization',\
44
self.ADANorm, self.X_train_norm, \
45
self.X_test_norm, self.y_train_norm, \
46
self.y_test_norm)
47
 
48
if self.rbStand.isChecked(): \
49
50 self.run_model('Adaboost',
51 'Standardization', \
52 self.ADAStand, self.X_train_stand, \
53 self.X_test_stand, self.y_train_stand, \
54 self.y_test_stand)
55  
#Saves model
joblib.dump(self.ADARaw,
'ADARaw.pkl')
joblib.dump(self.ADANorm,
'ADANorm.pkl')
joblib.dump(self.ADAStand,
'ADAStand.pkl')
 
 

Figure 144 The result using Adaboost model with raw


feature scaling
 
The build_train_ada() function follows the same pattern as
the previous build and train functions but is specifically
designed for building and training AdaBoost models. Similar
to the other functions, it first checks if the models have
already been saved as Pickle files ('ADARaw.pkl',
'ADANorm.pkl', 'ADAStand.pkl'). If they exist, it loads the
models; otherwise, it proceeds to build and train new
AdaBoost models.
 
When the models are built, it uses the AdaBoostClassifier
class from scikit-learn, which implements the AdaBoost
algorithm for classification tasks. The function creates three
different AdaBoost models with different preprocessing
techniques (raw, normalization, and standardization) and
trains them on the corresponding datasets.
 
After training the models, the function performs model
evaluation and comparison based on the selected radio
buttons (rbRaw, rbNorm, rbStand) for different preprocessing
techniques.
 

Figure 145 The result using Adaboost model with


normalization feature scaling
 
The purpose of this function is to build, train, and evaluate
three different AdaBoost models with different preprocessing
techniques to compare their performance. It ensures that the
models are only built and trained when necessary, avoiding
redundant computations. The function is part of the larger
machine learning pipeline in the application, facilitating the
evaluation and comparison of AdaBoost models.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'Adaboost':
2 self.build_train_ada()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose Adaboost item from cbClassifier widget. Then, you
will see the result as shown in Figure 144.
 

Figure 146 The result using Adaboost model with


standardization feature scaling
 
Click on Norm radio button. Then, choose Adaboost item
from cbClassifier widget. Then, you will see the result as
shown in Figure 145.
 
Click on Stand radio button. Then, choose Adaboost item
from cbClassifier widget. Then, you will see the result as
shown in Figure 146.
 
 
 
Extreme Gradient Boosting Classifier
Step Define build_train_xgb() method to build and train XGB
1 classifier using three feature scaling: Raw, Normalization,
and Standardization:
 
1 def build_train_xgb(self):
2 if path.isfile('XGBRaw.pkl'):
3 #Loads model
4 self.XGBRaw = joblib.load('XGBRaw.pkl')
5 self.XGBNorm = joblib.load('XGBNorm.pkl')
6 self.XGBStand =
7 joblib.load('XGBStand.pkl')
8
9 if self.rbRaw.isChecked():
10 self.run_model('XGB', 'Raw', self.XGBRaw,
\
11
self.X_train_raw, self.X_test_raw, \
12
self.y_train_raw, self.y_test_raw)
13
14 if self.rbNorm.isChecked():
15 self.run_model('XGB', 'Normalization', \
16 self.XGBNorm, self.X_train_norm, \
17 self.X_test_norm, self.y_train_norm, \
18 self.y_test_norm)
19  
20 if self.rbStand.isChecked():
21 self.run_model('XGB', 'Standardization',
\
22
self.XGBStand, self.X_train_stand, \
23
self.X_test_stand, self.y_train_stand, \
24
self.y_test_stand)
25
 
26
else:
27
#Builds and trains XGB classifier
28
29 self.XGBRaw = XGBClassifier(n_estimators
30 = 200, \
31 max_depth=20, random_state=2021, \
32 use_label_encoder=False,
eval_metric='mlogloss')
33
self.XGBNorm = XGBClassifier(n_estimators
34
= 200, \
35
max_depth=20, random_state=2021, \
36
use_label_encoder=False,
37 eval_metric='mlogloss')
38 self.XGBStand =
39 XGBClassifier(n_estimators = 200, \
40 max_depth=20, random_state=2021, \
41 use_label_encoder=False,
42 eval_metric='mlogloss')
43
44 if self.rbRaw.isChecked():
45 self.run_model('XGB', 'Raw', self.XGBRaw,
\
46
self.X_train_raw, self.X_test_raw, \
47
self.y_train_raw, self.y_test_raw)
48
49 if self.rbNorm.isChecked():
50 self.run_model('XGB', 'Normalization', \
51 self.XGBNorm, self.X_train_norm, \
52 self.X_test_norm, self.y_train_norm, \
53 self.y_test_norm)
54  
55 if self.rbStand.isChecked():
56 self.run_model('XGB', 'Standardization',
\
57
self.XGBStand, self.X_train_stand, \
self.X_test_stand, self.y_train_stand, \
self.y_test_stand)
 
#Saves model
joblib.dump(self.XGBRaw,
'XGBRaw.pkl')
joblib.dump(self.XGBNorm,
'XGBNorm.pkl')
joblib.dump(self.XGBStand,
'XGBStand.pkl')
 
The build_train_xgb() function is very similar to the previous
build and train functions, but it is designed specifically for
building and training XGBoost models. Like the others, it
first checks if the models have already been saved as Pickle
files ('XGBRaw.pkl', 'XGBNorm.pkl', 'XGBStand.pkl'). If
they exist, it loads the models; otherwise, it proceeds to build
and train new XGBoost models.
When building the models, the function uses the
XGBClassifier class from the XGBoost library, which is an
implementation of the gradient boosting algorithm. It creates
three different XGBoost models with different preprocessing
techniques (raw, normalization, and standardization) and
trains them on the corresponding datasets.
 
After training the models, the function performs model
evaluation and comparison based on the selected radio
buttons (rbRaw, rbNorm, rbStand) for different preprocessing
techniques.
 
The purpose of this function is to build, train, and evaluate
three different XGBoost models with different preprocessing
techniques to compare their performance. Like the other
build and train functions, it ensures that the models are only
built and trained when necessary, avoiding redundant
computations. The function is part of the larger machine
learning pipeline in the application, facilitating the evaluation
and comparison of XGBoost models.
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'XGB Classifier':
2 self.build_train_xgb()
 
 
Step Run gui_prostate.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose XGB Classifier item from cbClassifier widget.
Then, you will see the result as shown in Figure 147.
 
Figure 147 The result using XGB model with raw feature
scaling
 

Figure 148 The result using XGB model with normalization


feature scaling
 
Click on Norm radio button. Then, choose XGB Classifier
item from cbClassifier widget. Then, you will see the result
as shown in Figure 148.
 
Click on Stand radio button. Then, choose XGB Classifier
item from cbClassifier widget. Then, you will see the result
as shown in Figure 149.
 

Figure 149 The result using XGB model with


standardization feature scaling
 
 
 
 
Light Gradient Boosting Classifier
Step Define build_train_lgbm() method to build and train LGBM
1 classifier using three feature scaling: Raw, Normalization,
and Standardization:
 
1 def build_train_lgbm(self):
2 if path.isfile('LGBMRaw.pkl'):
3 #Loads model
4 self.LGBMRaw = joblib.load('LGBMRaw.pkl')
5 self.LGBMNorm =
6 joblib.load('LGBMNorm.pkl')
7 self.LGBMStand =
joblib.load('LGBMStand.pkl')
8
9
if self.rbRaw.isChecked():
10
self.run_model('LGBM Classifier', 'Raw',
11
\
12
self.LGBMRaw, self.X_train_raw, \
13
self.X_test_raw, self.y_train_raw,\
14
self.y_test_raw)
15
if self.rbNorm.isChecked():
16
self.run_model('LGBM Classifier', \
17
'Normalization', self.LGBMNorm, \
18
self.X_train_norm, self.X_test_norm, \
19
self.y_train_norm, self.y_test_norm)
20
 
21
if self.rbStand.isChecked():
22
self.run_model('LGBM Classifier', \
23
'Standardization', self.LGBMStand, \
24
self.X_train_stand, self.X_test_stand, \
25
self.y_train_stand, self.y_test_stand)
26
 
27
else:
28
#Builds and trains LGBMClassifier
29
classifier
30
self.LGBMRaw = LGBMClassifier(max_depth =
31 20, \
32 n_estimators=500, subsample=0.8,
33 random_state=2021)
34 self.LGBMNorm = LGBMClassifier(max_depth
35 = 20, \
36 n_estimators=500, subsample=0.8,
random_state=2021)
37
self.LGBMStand = LGBMClassifier(max_depth
38
= 20, \
39
n_estimators=500, subsample=0.8,
40 random_state=2021)
41
42 if self.rbRaw.isChecked():
43 self.run_model('LGBM Classifier', 'Raw',
44 \
45 self.LGBMRaw, self.X_train_raw, \
46 self.X_test_raw, self.y_train_raw, \
47 self.y_test_raw)
48 if self.rbNorm.isChecked():
49 self.run_model('LGBM Classifier', \
50 'Normalization', self.LGBMNorm, \
51 self.X_train_norm, self.X_test_norm, \
52 self.y_train_norm, self.y_test_norm)
53  
54 if self.rbStand.isChecked():
55 self.run_model('LGBM Classifier', \
56 'Standardization', self.LGBMStand, \
self.X_train_stand, self.X_test_stand, \
self.y_train_stand, self.y_test_stand)
 
#Saves model
joblib.dump(self.LGBMRaw,
'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm,
'LGBMNorm.pkl')
joblib.dump(self.LGBMStand,
'LGBMStand.pkl')
 
 
 
Step Add this code to the end of choose_ML_model() method:
2  
1 if strCB == 'LGBM Classifier':
2 self.build_train_lgbm()
 
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
3 ML MODEL buttons. Click on Raw radio button. Then,
choose LGBM Classifier item from cbClassifier widget.
Then, you will see the result as shown in Figure 150.
 
Figure 150 The result using LGBM model with raw feature
scaling
 
Click on Norm radio button. Then, choose LGBM
Classifier item from cbClassifier widget. Then, you will see
the result as shown in Figure 151.
 
Click on Stand radio button. Then, choose LGBM
Classifier item from cbClassifier widget. Then, you will see
the result as shown in Figure 152.
 

Figure 151 The result using LGBM model with


normalization feature scaling
 

Figure 152 The result using LGBM model with


standardization feature scaling
 
 
 
Multi-Layer Perceptron Classifier
Step Define build_train_mlp() method to build and train LGBM
1 classifier using three feature scaling: Raw, Normalization,
and Standardization:
 
1 def build_train_mlp(self):
2 if path.isfile('MLPRaw.pkl'):
3 #Loads model
4 self.MLPRaw = joblib.load('MLPRaw.pkl')
5 self.MLPNorm = joblib.load('MLPNorm.pkl')
6 self.MLPStand =
7 joblib.load('MLPStand.pkl')
8
9 if self.rbRaw.isChecked():
10 self.run_model('MLP Classifier', 'Raw', \
11 self.MLPRaw, self.X_train_raw, \
12 self.X_test_raw, self.y_train_raw, \
13 self.y_test_raw)
14 if self.rbNorm.isChecked():
15 self.run_model('MLP Classifier', \
16 'Normalization', self.MLPNorm, \
17 self.X_train_norm, self.X_test_norm, \
18 self.y_train_norm, self.y_test_norm)
19  
20 if self.rbStand.isChecked():
21 self.run_model('MLP Classifier', \
22 'Standardization', self.MLPStand, \
23 self.X_train_stand, self.X_test_stand, \
24 self.y_train_stand, self.y_test_stand)
25  
26 else:
27 #Builds and trains MLP classifier
28 self.MLPRaw =
29 MLPClassifier(random_state=2021)
30 self.MLPNorm =
MLPClassifier(random_state=2021)
31
self.MLPStand =
32
MLPClassifier(random_state=2021)
33
34 if self.rbRaw.isChecked():
35 self.run_model('MLP Classifier', 'Raw', \
36 self.MLPRaw, self.X_train_raw,
37 self.X_test_raw, \
38 self.y_train_raw, self.y_test_raw)
39 if self.rbNorm.isChecked():
40 self.run_model('MLP Classifier', \
41 'Normalization', self.MLPNorm, \
42 self.X_train_norm, self.X_test_norm, \
43 self.y_train_norm, self.y_test_norm)
44  
45 if self.rbStand.isChecked():
46 self.run_model('MLP Classifier', \
47 'Standardization', self.MLPStand, \
48 self.X_train_stand, self.X_test_stand, \
49 self.y_train_stand, self.y_test_stand)
50  
51 #Saves model
52 joblib.dump(self.MLPRaw,
'MLPRaw.pkl')
joblib.dump(self.MLPNorm,
'MLPNorm.pkl')
joblib.dump(self.MLPStand,
'MLPStand.pkl')
 
 
 

Figure 153 The result using MLP model with raw feature
scaling
 
Step Add this code to the end of choose_ML_model() method:
1  
1 if strCB == 'MLP Classifier':
2 self.build_train_mlp()
 
 
Step Run gui_pancreatic.py and click LOAD DATA and TRAIN
2 ML MODEL buttons. Click on Raw radio button. Then,
choose MLP Classifier item from cbClassifier widget.
Then, you will see the result as shown in Figure 153.
 

Figure 154 The result using MLP model with normalization


feature scaling
 
Click on Norm radio button. Then, choose MLP Classifier
item from cbClassifier widget. Then, you will see the result
as shown in Figure 154.
 
Click on Stand radio button. Then, choose MLP Classifier
item from cbClassifier widget. Then, you will see the result
as shown in Figure 155.
 
Figure 155 The result using MLP model with
standardization feature scaling
 
 
Following is the full version of gui_pancreatic.py:
 
#gui_pancreatic.py
from PyQt5.QtWidgets import *
from PyQt5.uic import loadUi
from matplotlib.backends.backend_qt5agg import (NavigationToolbar2QT as
NavigationToolbar)
from matplotlib.colors import ListedColormap
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import warnings
import mglearn
warnings.filterwarnings('ignore')
import os
import joblib
from numpy import save
from numpy import load
from os import path
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split, RandomizedSearchCV,
GridSearchCV,StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score,
precision_score
from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
import tensorflow as tf
from sklearn.base import clone
from sklearn.decomposition import PCA
 
class DemoGUI_Pancreatic(QMainWindow):
def __init__(self):
QMainWindow.__init__(self)
loadUi("gui_pancreatic.ui",self)
self.setWindowTitle(\
"GUI Demo of Classifying and Predicting Pancreatic Cancer")
self.addToolBar(NavigationToolbar(\
self.widgetPlot1.canvas, self))
self.pbLoad.clicked.connect(self.import_dataset)
self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot)
self.cbClassifier.currentIndexChanged.connect(self.choose_ML_model)
# Takes a df and writes it to a qtable provided. df headers become qtable headers
@staticmethod
def write_df_to_qtable(df,table):
headers = list(df)
table.setRowCount(df.shape[0])
table.setColumnCount(df.shape[1])
table.setHorizontalHeaderLabels(headers)
 
# getting data from df is computationally costly so convert it to array first
df_array = df.values
for row in range(df.shape[0]):
for col in range(df.shape[1]):
table.setItem(row, col, QTableWidgetItem(str(df_array[row,col])))
 
def populate_table(self,data, table):
#Populates two tables
self.write_df_to_qtable(data,table)

table.setAlternatingRowColors(True)
table.setStyleSheet("alternate-background-color: #ffbacd;background-color:
#9be5aa;");
 
def initial_state(self, state):
self.pbTrainML.setEnabled(state)
self.cbData.setEnabled(state)
self.cbClassifier.setEnabled(state)
self.cbPredictionML.setEnabled(state)
self.rbRaw.setEnabled(state)
self.rbNorm.setEnabled(state)
self.rbStand.setEnabled(state)
 
def read_dataset(self, dir):
#Loads csv file
df = pd.read_csv(dir)

#Drops irrelevant columns


df = df.drop(columns=
['sample_id','patient_cohort','sample_origin','stage','benign_sample_diagnosis'])
#Imputes missing values in plasma_CA19_9 with mean
df['plasma_CA19_9'].fillna((df['plasma_CA19_9'].mean()), inplace=True)
 
#Imputes missing value in REG1A with mean
df['REG1A'].fillna((df['REG1A'].mean()), inplace=True)

#Creates dummy dataset


df_dummy=df.copy()

#Converts diagnosis feature to {0,1,2}


df['diagnosis'] = df['diagnosis'].apply(lambda x: self.map_diagnosis(x))

#Converts sex feature to {0,1}


df['sex'] = df['sex'].apply(lambda x: self.map_sex(x))
 
#Categorizes df_dummy for visualization
df_dummy = self.df_visual(df_dummy)

return df, df_dummy


 
#Converts sex feature to {0,1}
def map_sex(self,n):
if n == "F":
return 0
else:
return 1
 

#Converts diagnosis feature to {0,1,2}


def map_diagnosis(self,n):
if n == 1:
return 0
if n == 2:
return 1
else:
return 2
 
 
#Categorizes diagnosis feature
def cat_diagnosis(self,n):
if n == 1:
return 'Control (No Pancreatic Disease)'
if n == 2:
return 'Benign Hepatobiliary Disease'
else:
return 'Pancreatic Cancer'

def df_visual(self,df_dummy):
#Categorizes diagnosis_result feature
df_dummy['diagnosis'] = df_dummy['diagnosis'].apply(lambda x:
self.cat_diagnosis(x))

#Categorizes age feature


labels = ['0-40', '40-50', '50-60','60-90']
df_dummy['age'] = pd.cut(df_dummy['age'], [0, 40, 50, 60, 90],
labels=labels)

#Categorizes plasma_CA19_9 feature


labels = ['0-100', '100-1000', '1000-10000','10000-35000']
df_dummy['plasma_CA19_9'] = pd.cut(df_dummy['plasma_CA19_9'], [0, 100,
1000, 10000, 35000], labels=labels)

#Categorizes creatinine feature


labels = ['0-0.5', '0.5-1', '1-2','2-5']
df_dummy['creatinine'] = pd.cut(df_dummy['creatinine'], [0, 0.5, 1, 2, 5],
labels=labels)
 
return df_dummy
 
def import_dataset(self):
curr_path = os.getcwd()
dataset_dir = curr_path + "/Debernardi et al 2020 data.csv"

#Loads csv file


self.df, self.df_dummy = self.read_dataset(dataset_dir)
 
#Populates tables with data
self.populate_table(self.df, self.twData1)
self.label1.setText('Pancreatic Cancer Data')
 
self.populate_table(self.df.describe(), self.twData2)
self.twData2.setVerticalHeaderLabels(['Count', 'Mean', 'Std', 'Min', '25%',
'50%', '75%', 'Max'])
self.label2.setText('Data Desciption')

#Turns on pbTrainML widget


self.pbTrainML.setEnabled(True)
 
#Turns off pbLoad
self.pbLoad.setEnabled(False)

#Populates cbData
self.populate_cbData()
 
def populate_cbData(self):
self.cbData.addItems(self.df)
self.cbData.addItems(["Features Importance"])
self.cbData.addItems(["Correlation Matrix", "Pairwise Relationship", "Features
Correlation"])
 
def fit_dataset(self, df):
#Extracts diagnosis feature as target variable
y = df['diagnosis'].values # Target for the model
 
#Drops diagnosis feature and set input variable
X = df.drop('diagnosis', axis = 1)
 
#Resamples data
sm = SMOTE(random_state=2021)
X,y = sm.fit_resample(X, y.ravel())
 
return X, y
 
def train_test(self):
X, y = self.fit_dataset(self.df)
 
#Splits the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 2021, stratify=y)
self.X_train_raw = X_train.copy()
self.X_test_raw = X_test.copy()
self.y_train_raw = y_train.copy()
self.y_test_raw = y_test.copy()

#Saves into npy files


save('X_train_raw.npy', self.X_train_raw)
save('y_train_raw.npy', self.y_train_raw)
save('X_test_raw.npy', self.X_test_raw)
save('y_test_raw.npy', self.y_test_raw)
 
self.X_train_norm = X_train.copy()
self.X_test_norm = X_test.copy()
self.y_train_norm = y_train.copy()
self.y_test_norm = y_test.copy()
norm = MinMaxScaler()
self.X_train_norm = norm.fit_transform(self.X_train_norm)
self.X_test_norm = norm.transform(self.X_test_norm)

#Saves into npy files


save('X_train_norm.npy', self.X_train_norm)
save('y_train_norm.npy', self.y_train_norm)
save('X_test_norm.npy', self.X_test_norm)
save('y_test_norm.npy', self.y_test_norm)
 
self.X_train_stand = X_train.copy()
self.X_test_stand = X_test.copy()
self.y_train_stand = y_train.copy()
self.y_test_stand = y_test.copy()
scaler = StandardScaler()
self.X_train_stand = scaler.fit_transform(self.X_train_stand)
self.X_test_stand = scaler.transform(self.X_test_stand)
 
#Saves into npy files
save('X_train_stand.npy', self.X_train_stand)
save('y_train_stand.npy', self.y_train_stand)
save('X_test_stand.npy', self.X_test_stand)
save('y_test_stand.npy', self.y_test_stand)
 
def split_data_ML(self):
if path.isfile('X_train_raw.npy'):
#Loads npy files
self.X_train_raw = np.load('X_train_raw.npy',allow_pickle=True)
self.y_train_raw = np.load('y_train_raw.npy',allow_pickle=True)
self.X_test_raw = np.load('X_test_raw.npy',allow_pickle=True)
self.y_test_raw = np.load('y_test_raw.npy',allow_pickle=True)

self.X_train_norm = np.load('X_train_norm.npy',allow_pickle=True)
self.y_train_norm = np.load('y_train_norm.npy',allow_pickle=True)
self.X_test_norm = np.load('X_test_norm.npy',allow_pickle=True)
self.y_test_norm = np.load('y_test_norm.npy',allow_pickle=True)
 
self.X_train_stand = np.load('X_train_stand.npy',allow_pickle=True)
self.y_train_stand = np.load('y_train_stand.npy',allow_pickle=True)
self.X_test_stand = np.load('X_test_stand.npy',allow_pickle=True)
self.y_test_stand = np.load('y_test_stand.npy',allow_pickle=True)

else:
self.train_test()

#Prints each shape


print('X train raw shape: ', self.X_train_raw.shape)
print('Y train raw shape: ', self.y_train_raw.shape)
print('X test raw shape: ', self.X_test_raw.shape)
print('Y test raw shape: ', self.y_test_raw.shape)
 
#Prints each shape
print('X train norm shape: ', self.X_train_norm.shape)
print('Y train norm shape: ', self.y_train_norm.shape)
print('X test norm shape: ', self.X_test_norm.shape)
print('Y test norm shape: ', self.y_test_norm.shape)
 
#Prints each shape
print('X train stand shape: ', self.X_train_stand.shape)
print('Y train stand shape: ', self.y_train_stand.shape)
print('X test stand shape: ', self.X_test_stand.shape)
print('Y test stand shape: ', self.y_test_stand.shape)
 
 
def train_model_ML(self):
self.split_data_ML()

#Turns on three widgets


self.cbData.setEnabled(True)
self.cbClassifier.setEnabled(True)
self.cbPredictionML.setEnabled(True)

#Turns off pbTrainML


self.pbTrainML.setEnabled(False)

#Turns on three radio buttons


self.rbRaw.setEnabled(True)
self.rbNorm.setEnabled(True)
self.rbStand.setEnabled(True)
self.rbRaw.setChecked(True)
 
def pie_cat(self, df, var_target, labels, widget):
df.value_counts().plot.pie(ax =
widget.canvas.axis1,labels=labels,startangle=40,autopct = '%1.1f%%',textprops=
{'fontsize': 10})
widget.canvas.axis1.set_title('The distribution of ' + var_target + '
variable', fontweight ="bold",fontsize=14)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
def bar_cat(self,df,var, widget):
ax = df[var].value_counts().plot(kind="barh",ax = widget.canvas.axis1)
 
for i,j in enumerate(df[var].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize=10)
 
widget.canvas.axis1.set_title("Count of "+ var +" cases")
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
#Plots diagnosis with other variable
def stacked_bar_plot(self,df,cat,ax1):
cmap1=plt.cm.coolwarm_r
group_by_stat = df.groupby([cat, 'diagnosis']).size()
g=group_by_stat.unstack().plot(kind='bar', stacked=True,ax=ax1,grid=True)
self.put_label_stacked_bar(g,17)
 
ax1.set_title('Stacked Bar Plot of '+ cat +' (in %)', fontsize=14)
ax1.set_ylabel('Number of Cases')
ax1.set_xlabel(cat)
plt.show()
 
def put_label_stacked_bar(self, ax,fontsize):
#patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()

# The height of the bar is the data value and can be used as the label
label_text = f'{height:.0f}'

# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
 
# plots only when height is greater than specified value
if height > 0:
ax.text(label_x, label_y, label_text, ha='center', va='center',
weight = "bold",fontsize=fontsize)

ax.legend(bbox_to_anchor=(1.05, 1), loc='lower right', borderaxespad=0.)


 
def choose_plot(self):
strCB = self.cbData.currentText()
 
if strCB == 'diagnosis':
#Plots distribution of diagnosis variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(121,facecolor = '#fbe7dd')
label_class = list(self.df_dummy["diagnosis"].value_counts().index)
self.pie_cat(self.df_dummy["diagnosis"],'diagnosis', label_class,
self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(122,facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'diagnosis', self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 =
self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'age',self.widgetPlot2.canvas.axis1)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["sex"],hue = self.df_dummy["diagnosis"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sex versus diagnosis",fontweight
="bold",fontsize=14)
 
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["plasma_CA19_9"],hue =
self.df_dummy["diagnosis"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("plasma_CA19_9 versus
diagnosis",fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["creatinine"],hue =
self.df_dummy["diagnosis"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("creatinine versus diagnosis",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],hue = self.df_dummy["diagnosis"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("age versus diagnosis",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
if strCB == 'age':
#Plots distribution of age variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(121,facecolor = '#fbe7dd')
label_class = list(self.df_dummy["age"].value_counts().index)
self.pie_cat(self.df_dummy["age"],'age', label_class, self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(122,facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'age', self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 =
self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'age',self.widgetPlot2.canvas.axis1)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["sex"],hue = self.df_dummy["age"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sex versus age",fontweight
="bold",fontsize=14)
 
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["plasma_CA19_9"],hue =
self.df_dummy["age"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("plasma_CA19_9 versus age",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["creatinine"],hue =
self.df_dummy["age"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("creatinine versus age",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],hue = self.df_dummy["age"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("diagnosis versus age",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
if strCB == 'sex':
#Plots distribution of sex variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(121,facecolor = '#fbe7dd')
label_class = list(self.df_dummy["sex"].value_counts().index)
self.pie_cat(self.df_dummy["sex"],'sex', label_class, self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(122,facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'sex', self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 =
self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'sex',self.widgetPlot2.canvas.axis1)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],hue = self.df_dummy["sex"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("age versus sex",fontweight
="bold",fontsize=14)
 
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["plasma_CA19_9"],hue =
self.df_dummy["sex"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("plasma_CA19_9 versus sex",fontweight
="bold",fontsize=14)
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["creatinine"],hue =
self.df_dummy["sex"], palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("creatinine versus sex",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["diagnosis"],hue = self.df_dummy["sex"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("diagnosis versus sex",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
if strCB == 'plasma_CA19_9':
#Plots distribution of plasma_CA19_9 variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(121,facecolor = '#fbe7dd')
label_class =
list(self.df_dummy["plasma_CA19_9"].value_counts().index)
self.pie_cat(self.df_dummy["plasma_CA19_9"],'plasma_CA19_9', label_class,
self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(122,facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'plasma_CA19_9', self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 =
self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'plasma_CA19_9',self.widgetPlot2.canvas.axis1
)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],hue =
self.df_dummy["plasma_CA19_9"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("age versus plasma_CA19_9",fontweight
="bold",fontsize=14)
 
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["sex"],hue =
self.df_dummy["plasma_CA19_9"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sex versus plasma_CA19_9",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["creatinine"],hue =
self.df_dummy["plasma_CA19_9"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("creatinine versus
plasma_CA19_9",fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["diagnosis"],hue =
self.df_dummy["plasma_CA19_9"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("diagnosis versus
plasma_CA19_9",fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
if strCB == 'creatinine':
#Plots distribution of creatinine variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(121,facecolor = '#fbe7dd')
label_class = list(self.df_dummy["creatinine"].value_counts().index)
self.pie_cat(self.df_dummy["creatinine"],'creatinine', label_class,
self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot1.canvas.axis1 =
self.widgetPlot1.canvas.figure.add_subplot(122,facecolor = '#fbe7dd')
self.bar_cat(self.df_dummy,'creatinine', self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 =
self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '#fbe7dd')
self.stacked_bar_plot(self.df_dummy,'creatinine',self.widgetPlot2.canvas.axis1)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["age"],hue =
self.df_dummy["creatinine"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("age versus creatinine",fontweight
="bold",fontsize=14)
 
self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["sex"],hue =
self.df_dummy["creatinine"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sex versus creatinine",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["plasma_CA19_9"],hue =
self.df_dummy["creatinine"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("plasma_CA19_9 versus
creatinine",fontweight ="bold",fontsize=14)

self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '#fbe7dd')
g=sns.countplot(self.df_dummy["diagnosis"],hue =
self.df_dummy["creatinine"],
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("diagnosis versus creatinine",fontweight
="bold",fontsize=14)

self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
if strCB == 'LYVE1':
self.prob_num_versus_two_cat("LYVE1","age", "diagnosis" ,self.widgetPlot1)
self.hist_num_versus_four_cat("LYVE1")
self.prob_num_versus_two_cat("LYVE1","creatinine", "plasma_CA19_9"
,self.widgetPlot2)

if strCB == 'REG1B':
self.prob_num_versus_two_cat("REG1B","age", "diagnosis" ,self.widgetPlot1)
self.hist_num_versus_four_cat("REG1B")
self.prob_num_versus_two_cat("REG1B","creatinine", "plasma_CA19_9"
,self.widgetPlot2)
 
if strCB == 'TFF1':
self.prob_num_versus_two_cat("TFF1","age", "diagnosis" ,self.widgetPlot1)
self.hist_num_versus_four_cat("TFF1")
self.prob_num_versus_two_cat("TFF1","creatinine", "plasma_CA19_9"
,self.widgetPlot2)
 
if strCB == 'REG1A':
self.prob_num_versus_two_cat("REG1A","age", "diagnosis" ,self.widgetPlot1)
self.hist_num_versus_four_cat("REG1A")
self.prob_num_versus_two_cat("REG1A","creatinine", "plasma_CA19_9"
,self.widgetPlot2)
 
if strCB == 'Correlation Matrix':
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(111)
X,_ = self.fit_dataset(self.df)
self.plot_corr(X, self.widgetPlot3)
 
if strCB == 'Features Importance':
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(111)
self.plot_importance(self.widgetPlot3)
 
def plot_corr(self, data, widget):
corrdata = data.corr()
sns.heatmap(corrdata, ax = widget.canvas.axis1, lw=1, annot=True,
cmap="Reds")
widget.canvas.axis1.set_title('Correlation Matrix', fontweight ="bold",fontsize=20)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
def plot_importance(self, widget):
#Compares different feature importances
r = ExtraTreesClassifier(random_state=0)
X,y = self.fit_dataset(self.df)
r.fit(X, y)
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
r.estimators_],
axis = 0)
 
sns.barplot(feature_importance_normalized, X.columns, ax = widget.canvas.axis1)
widget.canvas.axis1.set_ylabel('Feature Labels',fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_xlabel('Features Importance',fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_title('Comparison of different Features Importance',fontweigh
widget.canvas.figure.tight_layout()
widget.canvas.draw()

def feat_versus_other(self, feat,another,legend,ax0,label='',title=''):


background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#66b3ff'])
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
 
ax0.set_facecolor(background_color)
ax0_sns = sns.histplot(data=self.df, x=self.df[feat],ax=ax0,zorder=2,kde=False,hue=an
shrink=.8,linewidth=0.3,alpha=1)
 
self.put_label_stacked_bar(ax0_sns,17)
 
ax0_sns.set_xlabel('',fontsize=10, weight='bold')
ax0_sns.set_ylabel('',fontsize=10, weight='bold')
 
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
 
ax0_sns.tick_params(labelsize=10, width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fonts
right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight ="bold",fontsize=14)
ax0_sns.set_title(title,fontweight ="bold",fontsize=16)
 
def prob_feat_versus_other(self,feat,another,legend,ax0,label='',title=''):
background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#66b3ff'])
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
 
ax0.set_facecolor(background_color)
ax0_sns = sns.kdeplot(x=self.df[feat],ax=ax0,hue=another,linewidth=0.3,fill=True,cbar
 
ax0_sns.set_xlabel('',fontsize=4, weight='bold')
ax0_sns.set_ylabel('',fontsize=4, weight='bold')
 
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
 
ax0_sns.tick_params(labelsize=10, width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fonts
right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight ="bold",fontsize=14)
ax0_sns.set_title(title,fontweight ="bold",fontsize=16)
 
def hist_num_versus_four_cat(self,feat):
self.label_diagnosis = list(self.df_dummy["diagnosis"].value_counts().index)
self.label_age = list(self.df_dummy["age"].value_counts().index)
self.label_plasma = list(self.df_dummy["plasma_CA19_9"].value_counts().index)
self.label_creatinine = list(self.df_dummy["creatinine"].value_counts().index)

self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '
print(self.df_dummy["diagnosis"].value_counts())
self.feat_versus_other(feat,self.df_dummy["diagnosis"],self.label_diagnosis,self.widgetPlot3
t versus ' + feat)
 
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '
print(self.df_dummy["age"].value_counts())
self.feat_versus_other(feat,self.df_dummy["age"],self.label_age,self.widgetPlot3.canvas.axis
 
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '
print(self.df_dummy["plasma_CA19_9"].value_counts())
self.feat_versus_other(feat,self.df_dummy["plasma_CA19_9"],self.label_plasma,self.widgetPlot
' + feat)
 
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '
print(self.df_dummy["creatinine"].value_counts())
self.feat_versus_other(feat,self.df_dummy["creatinine"],self.label_creatinine,self.widgetPlo
+ feat)
 
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
 
def prob_num_versus_two_cat(self,feat, feat_cat1, feat_cat2, widget):
self.label_feat_cat1 = list(self.df_dummy[feat_cat1].value_counts().index)
self.label_feat_cat2 = list(self.df_dummy[feat_cat2].value_counts().index)
 
widget.canvas.figure.clf()
widget.canvas.axis1 = widget.canvas.figure.add_subplot(211,facecolor = '#fbe7dd')
print(self.df_dummy[feat_cat2].value_counts())
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat2],self.label_feat_cat2,widget.canvas
+ feat)
 
widget.canvas.axis1 = widget.canvas.figure.add_subplot(212,facecolor = '#fbe7dd')
print(self.df_dummy[feat_cat1].value_counts())
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat1],self.label_feat_cat1,widget.canvas
+ feat)
 
widget.canvas.figure.tight_layout()
widget.canvas.draw()

def plot_real_pred_val(self, Y_pred, Y_test, widget, title):


#Calculate Metrics
acc=accuracy_score(Y_test,Y_pred)
 
#Output plot
widget.canvas.figure.clf()
widget.canvas.axis1 = widget.canvas.figure.add_subplot(111,facecolor='steelblue')
widget.canvas.axis1.scatter(range(len(Y_pred)),Y_pred,color="yellow",lw=5,label="Pred
widget.canvas.axis1.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
widget.canvas.axis1.set_title("Prediction Values vs Real Values of " + title, fontwei
widget.canvas.axis1.set_xlabel("Accuracy: " + str(round((acc*100),3)) + "%",fontweigh
widget.canvas.axis1.legend()
widget.canvas.axis1.grid(True, alpha=0.75, lw=1, ls='-.')
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
def plot_cm(self, Y_pred, Y_test, widget, title):
cm=confusion_matrix(Y_test,Y_pred)
widget.canvas.figure.clf()
widget.canvas.axis1 = widget.canvas.figure.add_subplot(111)
class_label = ['1', '2', '3']
df_cm = pd.DataFrame(cm, index=class_label,columns=class_label)
sns.heatmap(df_cm, ax=widget.canvas.axis1, annot=True, cmap='plasma',linewidths=2,fmt
widget.canvas.axis1.set_title("Confusion Matrix of " + title, fontweight ="bold",font
widget.canvas.axis1.set_xlabel("Predicted")
widget.canvas.axis1.set_ylabel("True")
widget.canvas.draw()
 
def plot_roc(self, clf, xtest, ytest, title, widget):
pred_prob = clf.predict_proba(xtest)
pred_prob = pred_prob[:, 1]
fpr, tpr, thresholds = roc_curve(ytest, pred_prob)
widget.canvas.axis1.plot(fpr,tpr, label='ANN',color='crimson', linewidth=3)
widget.canvas.axis1.set_xlabel('False Positive Rate')
widget.canvas.axis1.set_ylabel('True Positive Rate')
widget.canvas.axis1.set_title('ROC Curve of ' + title, fontweight ="bold",fontsize=15
widget.canvas.axis1.grid(True, alpha=0.75, lw=1, ls='-.')
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
def plot_decision(self, cla, feat1, feat2, widget, title=""):
curr_path = os.getcwd()
dataset_dir = curr_path + "/Debernardi et al 2020 data.csv"

#Loads csv file


df, _ = self.read_dataset(dataset_dir)

#Plots decision boundary of two features


feat_boundary = [feat1, feat2]
X_feature = df[feat_boundary]
X_train_feature, X_test_feature, y_train_feature, y_test_feature= train_test_split(X_
random_state = 42)
cla.fit(X_train_feature, y_train_feature)

plot_decision_regions(X_test_feature.values, y_test_feature.ravel(), clf=cla, legend=


widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_xlabel(feat1)
widget.canvas.axis1.set_ylabel(feat2)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
 
def plot_learning_curve(self,estimator, title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
if ylim is not None:
widget.canvas.axis1.set_ylim(*ylim)
widget.canvas.axis1.set_xlabel("Training examples")
widget.canvas.axis1.set_ylabel("Score")
 
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
 
# Plot learning curve
widget.canvas.axis1.grid()
widget.canvas.axis1.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
widget.canvas.axis1.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
widget.canvas.axis1.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
widget.canvas.axis1.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
widget.canvas.axis1.legend(loc="best")
 
def plot_scalability_curve(self,estimator, title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
if ylim is not None:
widget.canvas.axis1.set_ylim(*ylim)
widget.canvas.axis1.set_xlabel("Training examples")
widget.canvas.axis1.set_ylabel("Score")
 
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
 
# Plot n_samples vs fit_times
widget.canvas.axis1.grid()
widget.canvas.axis1.plot(train_sizes, fit_times_mean, 'o-')
widget.canvas.axis1.fill_between(train_sizes, fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std, alpha=0.1)
widget.canvas.axis1.set_xlabel("Training examples")
widget.canvas.axis1.set_ylabel("fit_times")
 
def plot_performance_curve(self,estimator, title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
if ylim is not None:
widget.canvas.axis1.set_ylim(*ylim)
widget.canvas.axis1.set_xlabel("Training examples")
widget.canvas.axis1.set_ylabel("Score")
 
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
 
# Plot n_samples vs fit_times
widget.canvas.axis1.grid()
widget.canvas.axis1.plot(fit_times_mean, test_scores_mean, 'o-')
widget.canvas.axis1.fill_between(fit_times_mean, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1)
widget.canvas.axis1.set_xlabel("fit_times")
widget.canvas.axis1.set_ylabel("Score")
 
def train_model(self, model, X, y):
model.fit(X, y)
return model
 
def predict_model(self, model, X, proba=False):
if ~proba:
y_pred = model.predict(X)
else:
y_pred_proba = model.predict_proba(X)
y_pred = np.argmax(y_pred_proba, axis=1)
 
return y_pred
 
def run_model(self, name, scaling, model, X_train, X_test, y_train, y_test, train=True, prob
if train == True:
model = self.train_model(model, X_train, y_train)
y_pred = self.predict_model(model, X_test, proba)

accuracy = accuracy_score(y_test, y_pred)


recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print('accuracy: ', accuracy)


print('recall: ',recall)
print('precision: ', precision)
print('f1: ', f1)
print(classification_report(y_test, y_pred))

self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(111,facecolor = '
self.plot_cm(y_pred, y_test, self.widgetPlot1, name + " -- " + scaling)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
 
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(111,facecolor = '
self.plot_real_pred_val(y_pred, y_test, self.widgetPlot2, name + " -- " + scaling)
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
 
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(221,facecolor = '
self.plot_decision(model, 'creatinine', 'diagnosis', self.widgetPlot3, title="The decision b
self.widgetPlot3.canvas.figure.tight_layout()

self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(222,facecolor = '


self.plot_learning_curve(model, 'Learning Curve' + " -- " + scaling, X_train, y_train, self
self.widgetPlot3.canvas.figure.tight_layout()

self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(223,facecolor = '


self.plot_scalability_curve(model, 'Scalability of ' + name + " -- " + scaling, X_train, y_t
self.widgetPlot3.canvas.figure.tight_layout()
 
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(224,facecolor = '
self.plot_performance_curve(model, 'Performance of ' + name + " -- " + scaling, X_train, y_t
self.widgetPlot3.canvas.figure.tight_layout()

self.widgetPlot3.canvas.draw()
 
def build_train_lr(self):
if path.isfile('logregRaw.pkl'):
#Loads model
self.logregRaw = joblib.load('logregRaw.pkl')
self.logregNorm = joblib.load('logregNorm.pkl')
self.logregStand = joblib.load('logregStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw, self.X_test_r
if self.rbNorm.isChecked():
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self.X_train_norm, s
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('Logistic Regression', 'Standardization', self.logregStand, self.X_train_stan
self.y_test_stand)
 
else:
#Builds and trains Logistic Regression
self.logregRaw = LogisticRegression(solver='lbfgs', max_iter=2000, random_state=2021)
self.logregNorm = LogisticRegression(solver='lbfgs', max_iter=2000, random_state=2021)
self.logregStand = LogisticRegression(solver='lbfgs', max_iter=2000, random_state=2021)

if self.rbRaw.isChecked():
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw, self.X_test_r
if self.rbNorm.isChecked():
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self.X_train_norm, s
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('Logistic Regression', 'Standardization', self.logregStand, self.X_train_stan
self.y_test_stand)
 
#Saves model
joblib.dump(self.logregRaw, 'logregRaw.pkl')
joblib.dump(self.logregNorm, 'logregNorm.pkl')
joblib.dump(self.logregStand, 'logregStand.pkl')

def choose_ML_model(self):
strCB = self.cbClassifier.currentText()

if strCB == 'Logistic Regression':


self.build_train_lr()
 
if strCB == 'Support Vector Machine':
self.build_train_svm()
 
if strCB == 'K-Nearest Neighbor':
self.build_train_knn()
 
if strCB == 'Decision Tree':
self.build_train_dt()
 
if strCB == 'Random Forest':
self.build_train_rf()
 
if strCB == 'Gradient Boosting':
self.build_train_gb()
 
if strCB == 'Naive Bayes':
self.build_train_nb()
 
if strCB == 'Adaboost':
self.build_train_ada()
 
if strCB == 'XGB Classifier':
self.build_train_xgb()
 
if strCB == 'LGBM Classifier':
self.build_train_lgbm()
 
if strCB == 'MLP Classifier':
self.build_train_mlp()
 
def build_train_svm(self):
if path.isfile('SVMRaw.pkl'):
#Loads model
self.SVMRaw = joblib.load('SVMRaw.pkl')
self.SVMNorm = joblib.load('SVMNorm.pkl')
self.SVMStand = joblib.load('SVMStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw, self.X_test_r
if self.rbNorm.isChecked():
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self.X_train_norm, s
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, self.X_train_stan
self.y_test_stand)
 
else:
#Builds and trains Logistic Regression
self.SVMRaw = SVC(random_state=2021,probability=True)
self.SVMNorm = SVC(random_state=2021,probability=True)
self.SVMStand = SVC(random_state=2021,probability=True)

if self.rbRaw.isChecked():
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw, self.X_test_r
if self.rbNorm.isChecked():
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self.X_train_norm, s
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, self.X_train_stan
self.y_test_stand)
 
#Saves model
joblib.dump(self.SVMRaw, 'SVMRaw.pkl')
joblib.dump(self.SVMNorm, 'SVMNorm.pkl')
joblib.dump(self.SVMStand, 'SVMStand.pkl')
 
def build_train_knn(self):
if path.isfile('KNNRaw.pkl'):
#Loads model
self.KNNRaw = joblib.load('KNNRaw.pkl')
self.KNNNorm = joblib.load('KNNNorm.pkl')
self.KNNStand = joblib.load('KNNStand.pkl')

if self.rbRaw.isChecked():
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, self.X_test_raw,
if self.rbNorm.isChecked():
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_train_norm, self.
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X_train_stand, s
self.y_test_stand)
 
else:
#Builds and trains K-Nearest Neighbor
self.KNNRaw = KNeighborsClassifier(n_neighbors = 50)
self.KNNNorm = KNeighborsClassifier(n_neighbors = 50)
self.KNNStand = KNeighborsClassifier(n_neighbors = 50)

if self.rbRaw.isChecked():
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, self.X_test_raw,
if self.rbNorm.isChecked():
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_train_norm, self.
self.y_test_norm)
 
if self.rbStand.isChecked():
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X_train_stand, s
self.y_test_stand)
 
#Saves model
joblib.dump(self.KNNRaw, 'KNNRaw.pkl')
joblib.dump(self.KNNNorm, 'KNNNorm.pkl')
joblib.dump(self.KNNStand, 'KNNStand.pkl')
 
def build_train_dt(self):
if path.isfile('DTRaw.pkl'):
#Loads model
self.DTRaw = joblib.load('DTRaw.pkl')
self.DTNorm = joblib.load('DTNorm.pkl')
self.DTStand = joblib.load('DTStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_test_raw, self.y
if self.rbNorm.isChecked():
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_norm, self.X_test
 
if self.rbStand.isChecked():
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_train_stand, self.X_
self.y_test_stand)
 
else:
#Builds and trains Decision Tree
dt = DecisionTreeClassifier()
parameters = { 'max_depth':np.arange(1,20,1),'random_state':[2021]}
self.DTRaw = GridSearchCV(dt, parameters)
self.DTNorm = GridSearchCV(dt, parameters)
self.DTStand = GridSearchCV(dt, parameters)

if self.rbRaw.isChecked():
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_test_raw, self.y
if self.rbNorm.isChecked():
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_norm, self.X_test
 
if self.rbStand.isChecked():
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_train_stand, self.X_
self.y_test_stand)
 
#Saves model
joblib.dump(self.DTRaw, 'DTRaw.pkl')
joblib.dump(self.DTNorm, 'DTNorm.pkl')
joblib.dump(self.DTStand, 'DTStand.pkl')
 
def build_train_rf(self):
if path.isfile('RFRaw.pkl'):
#Loads model
self.RFRaw = joblib.load('RFRaw.pkl')
self.RFNorm = joblib.load('RFNorm.pkl')
self.RFStand = joblib.load('RFStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_test_raw, self.y
if self.rbNorm.isChecked():
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_norm, self.X_test
 
if self.rbStand.isChecked():
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_train_stand, self.X_
self.y_test_stand)
 
else:
#Builds and trains Random Forest
self.RFRaw = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=2021)
self.RFNorm = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=2021)
self.RFStand = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=2021)

if self.rbRaw.isChecked():
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_test_raw, self.y
if self.rbNorm.isChecked():
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_norm, self.X_test
 
if self.rbStand.isChecked():
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_train_stand, self.X_
self.y_test_stand)
 
#Saves model
joblib.dump(self.RFRaw, 'RFRaw.pkl')
joblib.dump(self.RFNorm, 'RFNorm.pkl')
joblib.dump(self.RFStand, 'RFStand.pkl')

def build_train_gb(self):
if path.isfile('GBRaw.pkl'):
#Loads model
self.GBRaw = joblib.load('GBRaw.pkl')
self.GBNorm = joblib.load('GBNorm.pkl')
self.GBStand = joblib.load('GBStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, self.X_test_raw, se
if self.rbNorm.isChecked():
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_train_norm, self.X_
 
if self.rbStand.isChecked():
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_train_stand, sel
self.y_test_stand)
 
else:
#Builds and trains Gradient Boosting
self.GBRaw = GradientBoostingClassifier(n_estimators = 200, max_depth=20, subsample=0.8, max
self.GBNorm = GradientBoostingClassifier(n_estimators = 200, max_depth=20, subsample=0.8, ma
self.GBStand = GradientBoostingClassifier(n_estimators = 200, max_depth=20, subsample=0.8, m

if self.rbRaw.isChecked():
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, self.X_test_raw, se
if self.rbNorm.isChecked():
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_train_norm, self.X_
 
if self.rbStand.isChecked():
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_train_stand, sel
self.y_test_stand)
 
#Saves model
joblib.dump(self.GBRaw, 'GBRaw.pkl')
joblib.dump(self.GBNorm, 'GBNorm.pkl')
joblib.dump(self.GBStand, 'GBStand.pkl')
 
def build_train_nb(self):
if path.isfile('NBRaw.pkl'):
#Loads model
self.NBRaw = joblib.load('NBRaw.pkl')
self.NBNorm = joblib.load('NBNorm.pkl')
self.NBStand = joblib.load('NBStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_test_raw, self.y_t
if self.rbNorm.isChecked():
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_norm, self.X_test_n
 
if self.rbStand.isChecked():
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_stand, self.X_te
self.y_test_stand)
 
else:
#Builds and trains Naive Bayes
self.NBRaw = GaussianNB()
self.NBNorm = GaussianNB()
self.NBStand = GaussianNB()

if self.rbRaw.isChecked():
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_test_raw, self.y_t
if self.rbNorm.isChecked():
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_norm, self.X_test_n
 
if self.rbStand.isChecked():
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_stand, self.X_te
self.y_test_stand)
 
#Saves model
joblib.dump(self.NBRaw, 'NBRaw.pkl')
joblib.dump(self.NBNorm, 'NBNorm.pkl')
joblib.dump(self.NBStand, 'NBStand.pkl')
 
def build_train_ada(self):
if path.isfile('ADARaw.pkl'):
#Loads model
self.ADARaw = joblib.load('ADARaw.pkl')
self.ADANorm = joblib.load('ADANorm.pkl')
self.ADAStand = joblib.load('ADAStand.pkl')

if self.rbRaw.isChecked():
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_raw, self.y_tra
if self.rbNorm.isChecked():
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm, self.X_test_nor
 
if self.rbStand.isChecked():
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_stand, self.X_test
 
else:
#Builds and trains Adaboost
self.ADARaw = AdaBoostClassifier(n_estimators = 200, learning_rate=0.01)
self.ADANorm = AdaBoostClassifier(n_estimators = 200, learning_rate=0.01)
self.ADAStand = AdaBoostClassifier(n_estimators = 200, learning_rate=0.01)

if self.rbRaw.isChecked():
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_raw, self.y_tra
if self.rbNorm.isChecked():
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm, self.X_test_nor
 
if self.rbStand.isChecked():
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_stand, self.X_test
 
#Saves model
joblib.dump(self.ADARaw, 'ADARaw.pkl')
joblib.dump(self.ADANorm, 'ADANorm.pkl')
joblib.dump(self.ADAStand, 'ADAStand.pkl')
def build_train_xgb(self):
if path.isfile('XGBRaw.pkl'):
#Loads model
self.XGBRaw = joblib.load('XGBRaw.pkl')
self.XGBNorm = joblib.load('XGBNorm.pkl')
self.XGBStand = joblib.load('XGBStand.pkl')

if self.rbRaw.isChecked():
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw, self.y_train_ra
if self.rbNorm.isChecked():
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self.X_test_norm, se
 
if self.rbStand.isChecked():
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand, self.X_test_stan
 
else:
#Builds and trains XGB classifier
self.XGBRaw = XGBClassifier(n_estimators = 200, max_depth=20, random_state=2021, use_label_e
self.XGBNorm = XGBClassifier(n_estimators = 200, max_depth=20, random_state=2021, use_label_
self.XGBStand = XGBClassifier(n_estimators = 200, max_depth=20, random_state=2021, use_label

if self.rbRaw.isChecked():
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw, self.y_train_ra
if self.rbNorm.isChecked():
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self.X_test_norm, se
 
if self.rbStand.isChecked():
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand, self.X_test_stan
 
#Saves model
joblib.dump(self.XGBRaw, 'XGBRaw.pkl')
joblib.dump(self.XGBNorm, 'XGBNorm.pkl')
joblib.dump(self.XGBStand, 'XGBStand.pkl')
 
def build_train_lgbm(self):
if path.isfile('LGBMRaw.pkl'):
#Loads model
self.LGBMRaw = joblib.load('LGBMRaw.pkl')
self.LGBMNorm = joblib.load('LGBMNorm.pkl')
self.LGBMStand = joblib.load('LGBMStand.pkl')

if self.rbRaw.isChecked():
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, self.X_test_raw, se
if self.rbNorm.isChecked():
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_train_norm, self.X_
 
if self.rbStand.isChecked():
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_train_stand, sel
self.y_test_stand)
 
else:
#Builds and trains LGBMClassifier classifier
self.LGBMRaw = LGBMClassifier(max_depth = 20, n_estimators=500, subsample=0.8, random_state=
self.LGBMNorm = LGBMClassifier(max_depth = 20, n_estimators=500, subsample=0.8, random_state
self.LGBMStand = LGBMClassifier(max_depth = 20, n_estimators=500, subsample=0.8, random_stat

if self.rbRaw.isChecked():
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, self.X_test_raw, se
if self.rbNorm.isChecked():
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_train_norm, self.X_
 
if self.rbStand.isChecked():
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_train_stand, sel
self.y_test_stand)
 
#Saves model
joblib.dump(self.LGBMRaw, 'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm, 'LGBMNorm.pkl')
joblib.dump(self.LGBMStand, 'LGBMStand.pkl')

def build_train_mlp(self):
if path.isfile('MLPRaw.pkl'):
#Loads model
self.MLPRaw = joblib.load('MLPRaw.pkl')
self.MLPNorm = joblib.load('MLPNorm.pkl')
self.MLPStand = joblib.load('MLPStand.pkl')

if self.rbRaw.isChecked():
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X_test_raw, self
if self.rbNorm.isChecked():
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_norm, self.X_te
 
if self.rbStand.isChecked():
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_train_stand, self.
self.y_test_stand)
 
else:
#Builds and trains MLP classifier
self.MLPRaw = MLPClassifier(random_state=2021)
self.MLPNorm = MLPClassifier(random_state=2021)
self.MLPStand = MLPClassifier(random_state=2021)

if self.rbRaw.isChecked():
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X_test_raw, self
if self.rbNorm.isChecked():
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_norm, self.X_te
 
if self.rbStand.isChecked():
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_train_stand, self.
self.y_test_stand)
 
#Saves model
joblib.dump(self.MLPRaw, 'MLPRaw.pkl')
joblib.dump(self.MLPNorm, 'MLPNorm.pkl')
joblib.dump(self.MLPStand, 'MLPStand.pkl')

if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
ex = DemoGUI_Pancreatic()
ex.show()
sys.exit(app.exec_())
 
 
 
 
 

You might also like