0% found this document useful (0 votes)
12 views

ML Paper - Breast Cancer Model

The document describes a dataset containing breast cancer histopathology images and the process of building a deep learning model for cancer detection. It loads images from directories, preprocesses and splits the data, builds a model using transfer learning with VGG16, and trains the model. It then discusses adding a second pretrained model by feature concatenation and potential reasons for decreased accuracy. Finally, it suggests implementing k-fold cross-validation for a more robust performance evaluation.

Uploaded by

AwesomeDude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

ML Paper - Breast Cancer Model

The document describes a dataset containing breast cancer histopathology images and the process of building a deep learning model for cancer detection. It loads images from directories, preprocesses and splits the data, builds a model using transfer learning with VGG16, and trains the model. It then discusses adding a second pretrained model by feature concatenation and potential reasons for decreased accuracy. Finally, it suggests implementing k-fold cross-validation for a more robust performance evaluation.

Uploaded by

AwesomeDude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

About the dataset:

The original dataset consisted of 162 whole mount slide images of Breast Cancer (BCa) specimens
scanned at 40x. From that, 277,524 patches of size 50 x 50 were extracted (198,738 IDC negative
and 78,786 IDC positive). Each patch’s file name is of the format: u_xX_yY_classC.png — > example
10253_idx5_x1351_y1101_class0.png . Where u is the patient ID (10253_idx5), X is the x-coordinate
of where this patch was cropped from, Y is the y-coordinate of where this patch was cropped from,
and C indicates the class where 0 is non-IDC and 1 is IDC.

About the Model:

Code:

import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler

urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]

def loadImages(path, urls, target):


images = []
labels = []
for i in range(len(urls)):
img_path = path + "/" + urls[i]
img = cv2.imread(img_path)
img = img / 255.0
img = cv2.resize(img, (100,100))
images.append(img)
labels.append(target)
images = np.asarray(images)
return images, labels

Cancerous_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/1_Cancer"
Cancerous_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/1_Cancer")
Cancerous_imgs, Cancerous_targets = loadImages(COVID_path, COVID_urls, 1)
normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/0_NoCancer")

normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)

Cancerous_imgs = np.asarray(COVID_imgs)
normal_imgs = np.asarray(normal_imgs)

data = np.r_[COVID_imgs, normal_imgs]


targets = np.r_[Cancerous_targets, normal_targets]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, targets, test_size=0.25)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

from tensorflow.keras.applications import VGG16


from tensorflow.keras.applications.vgg16 import preprocess_input

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(100, 100, 3))

for layer in base_model.layers:


layer.trainable = False

model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_test,


y_test))
model.summary()

1. Data Loading:
Images from two directories, one containing cancerous images (1_Cancer) and the other
non-cancerous images (0_NoCancer), are loaded using the loadImages function.
The images are resized to a fixed size of (100, 100).
The pixel values are normalized to the range [0, 1].
2. Train-Test Split:
The dataset is split into training and testing sets using the train_test_split function from
scikit-learn.
3. Model Architecture:
The VGG16 pre-trained model is loaded from Keras applications. The top (classification)
layer is excluded ( include_top=False ), and the input shape is set to (100, 100, 3).
All layers in the pre-trained model are set to non-trainable.
A custom sequential model is created by adding the VGG16 base model, followed by a
Flatten layer to flatten the output, and three Dense layers for classification.
The Dense layers have 512, 256, and 1 neurons, respectively, with 'relu' activation functions
for the first two and a 'sigmoid' activation function for the last layer (binary classification).
The model is compiled using the Adam optimizer, binary crossentropy loss function, and
accuracy as the evaluation metric.
4. Model Training:
The model is trained using the training data ( x_train and y_train ) for 5 epochs with a
batch size of 32. Validation data ( x_test and y_test ) is used to monitor the model's
performance during training.
5. Model Summary:
The model.summary() function is used to display a summary of the model architecture,
including layer types, output shapes, and the number of trainable parameters.

Modification #1:

Feature Concatenation: Instead of using a single pre-trained model, use multiple pre-trained models
and concatenate their extracted features before feeding them into dense layers.

from tensorflow.keras.applications import ResNet50

# Add another pre-trained model


resnet_model = ResNet50(weights='imagenet', include_top=False, input_shape=(100, 100, 3))

for layer in resnet_model.layers:


layer.trainable = False
# Concatenate features from both VGG16 and ResNet50
concatenated_features = tf.keras.layers.Concatenate()([base_model.output,
resnet_model.output])

# Add dense layers on top


merged_model = Sequential([
concatenated_features,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

merged_model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

# Train the model with concatenated features


history = merged_model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=
(x_test, y_test))

Assume we have N pre-trained models, and each model i is denoted by fi​(X;Θi​) , where X is the
input tensor and Θi ​represents the set of parameters for the -th model. Each fi ​outputs a feature
tensor H(i) ​.

Let C be the operation of concatenation. The concatenated feature tensor H (concat) ​is defined as:
H ( concat ) ​=C(H1​,H2​,…,Hn​)

Mathematically, this concatenation operation can be represented as: H (concat) ​=[H1​,H2​,…,HN​]

Here, H1,H2,..,Hn denotes the concatenation of the feature tensors along a specific axis.

Now, let W (concat) ​be the weight matrix and b (concat) ​be the bias vector associated with the
concatenation operation.

The output of the concatenation operation is then passed through dense layers with rectified linear
unit (ReLU) activation functions:

Finally, the output layer produces the predicted class probabilities:


Here, σ represents the sigmoid activation function, W(output) ​is the weight matrix, and b(output) ​is
the bias vector for the output layer.

In summary, the entire process involves concatenating features from multiple pre-trained models,
passing them through dense layers, and producing the final classification probability.

Actual Outcome:

Several factors can contribute to a decrease in accuracy. Here are some possible reasons:

1. Feature Mismatch:
Different pre-trained models might capture different aspects or representations of the data.
If the features extracted from different models are not well-aligned or complementary,
concatenating them might introduce noise or conflicting information.
2. Model Complexity:
Combining features from multiple pre-trained models increases the overall complexity of the
model. If the dataset is not large enough, or if the models are not fine-tuned appropriately,
the increased complexity may lead to overfitting.
3. Dimensionality Mismatch:
The feature dimensions from different pre-trained models might not be compatible for
concatenation. Ensure that the features extracted from each model have the same or
compatible dimensions before concatenating.
4. Training Data Size:
If the dataset is small, training a complex model with concatenated features may lead to
poor generalization. Pre-trained models are typically trained on large datasets, and using
them in a concatenation approach might not be beneficial if your dataset is limited.
5. Learning Rate and Training Strategy:
When using a more complex model, it's essential to adjust the learning rate and training
strategy accordingly. A higher learning rate or inadequate training strategy might result in
suboptimal convergence.
6. Computational Resources:
Training a model with concatenated features from multiple pre-trained models requires
more computational resources. If the hardware limitations are reached, it might affect the
convergence of the model during training.
7. Feature Redundancy:
If the features extracted by different pre-trained models contain redundant information,
concatenating them may not bring additional discriminative power. It's crucial to analyze the
characteristics of the features and ensure they provide complementary information.
8. Hyperparameter Tuning:
The architecture and hyperparameters of the dense layers following the concatenated
features need to be tuned appropriately. The choice of activation functions, layer sizes, and
regularization techniques can significantly impact the model's performance.

Modification #2

Cross-validation: Implement k-fold cross-validation to get a better estimate of your model's


performance. This can provide a more robust evaluation compared to a single train-test split.

from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(data, targets):


x_train, x_test = data[train_index], data[test_index]
y_train, y_test = targets[train_index], targets[test_index]

K-fold cross-validation is a technique used to assess the performance of a predictive model by


partitioning the dataset into k subsets (or folds). The model is trained and evaluated k times, using a
different fold as the test set in each iteration while using the remaining k-1 folds for training. This
process helps in obtaining a more reliable estimate of the model's performance by reducing the
impact of variability in a single train-test split.

Mathematically, the steps of k-fold cross-validation can be described as follows:

1. Partition the dataset:


Split the dataset into k non-overlapping subsets, often referred to as folds.
Let D={D1​,D2​,...,Dk​} be the k subsets.
2. Iteration: For each iteration i from 1 to k:
Use i ​as the test set.
Use the union of the remaining folds {D1​,D2​,...,Di−1​,Di+1​,...,Dk​} as the training set.
3. Train and evaluate:
Train the model on the training set.
Evaluate the model on the test set.
4. Performance Metrics: Compute performance metrics (e.g., accuracy, precision, recall, F1
score) for each iteration.
5. Aggregate results:
Calculate the average performance across all iterations to get an overall estimate of the
model's performance.

The mathematical notation might look something like this:

Let M be the predictive model.


Let f be the performance metric (e.g., accuracy) that we want to evaluate.
The k-fold cross-validation estimate of the model's performance is given by:

Here, Di​is the test set in the i-th iteration, and the model is trained on the union of the remaining folds.

This process helps in obtaining a more robust performance estimate by ensuring that the model is
evaluated on different subsets of the data, reducing the impact of the randomness introduced by a
single train-test split.

Actual Outcome
There are several reasons why you might observe an increase in accuracy compared to a single train-
test split:

1. Better Utilization of Data:


In k-fold cross-validation, the dataset is partitioned into k subsets (folds). The model is
trained and evaluated k times, each time using a different fold as the validation set. This
allows the model to be trained on a larger portion of the data, leading to better utilization of
the available information.
2. Reduced Variability:
Single train-test splits can be sensitive to the specific choice of data in the split, leading to
variability in performance estimates. Cross-validation averages performance over multiple
splits, reducing the impact of data distribution on the evaluation.
3. Robustness to Dataset Heterogeneity:
Cross-validation helps in assessing the model's ability to generalize across different subsets
of the data. If the dataset is heterogeneous, containing various patterns or variations, cross-
validation provides a more comprehensive evaluation.
4. Mitigation of Overfitting or Underfitting:
With k-fold cross-validation, the model is trained and evaluated multiple times on different
subsets of the data. This helps in identifying whether the model is consistently overfitting or
underfitting across various data partitions.
5. More Reliable Performance Estimates:
Cross-validation provides more reliable estimates of the model's performance by averaging
the results over multiple folds. This is particularly valuable when dealing with limited data,
as it minimizes the impact of a single random split.

Modification #3:
Leave-One-Out Cross-Validation (LOOCV):

A special case of k-fold cross-validation where k is equal to the number of data points in the
dataset. In each iteration, a single data point is used as the test set, and the model is trained on
the remaining data.

Leave-One-Out Cross-Validation (LOOCV) is a cross-validation technique used to assess the


performance of a machine learning model. In LOOCV, the number of folds (k) is set equal to the
number of data points in the dataset. In each iteration, one data point is selected as the test set, and
the model is trained on the remaining data points. This process is repeated for each data point in the
dataset.

Let's break down the steps and understand the math behind LOOCV:

1. Dataset: Let D be a dataset with N data points, where D={(x1​,y1​),(x2​,y2​),…,(xN​,yN​)} , where


x(i) ​represents the features of the i-th data point, and y(i) ​is the corresponding label.
1. Iterations: LOOCV involves N iterations, where in each iteration, one data point is held out
as the test set, and the model is trained on the remaining N−1 data points.
2. Training and Testing:
Training Set: In each iteration, the model is trained on N−1 data points. The training set for
the i-th iteration is D(train-i) ​= D∖{(xi​,yi​)} , meaning it excludes the i-th data point.
Testing Set: The i-th data point is used as the test set for the i-th iteration. The testing
set is D(test-i)​​={(xi​,yi​)}.
3. Model Evaluation: After training the model on the training set, it is evaluated on the
corresponding test set. The evaluation metrics, such as accuracy, are recorded for each iteration.
4. Performance Metrics: The performance of the model is typically assessed by averaging the
evaluation metrics over all iterations. For example, if accuracy is used as the metric, the overall
accuracy is calculated as:

LOOCV is considered a thorough cross-validation method because it ensures that each data point is
used as both a training and test example. However, it can be computationally expensive, especially for
large datasets, as it requires training the model N times. The choice of LOOCV or other cross-
validation methods depends on the specific characteristics of the dataset and the computational
resources available.

import os
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from sklearn.model_selection import LeaveOneOut

def loadImages(path, urls, target):


images = []
labels = []
for i in range(len(urls)):
img_path = os.path.join(path, urls[i])
img = cv2.imread(img_path)
img = img / 255.0
img = cv2.resize(img, (100, 100))
images.append(img)
labels.append(target)
images = np.asarray(images)
return images, labels

path = "/kaggle/input/breast-histopathology-images/BreastCancer/train"
Cancerous_path = os.path.join(path, "1_Cancer")
normal_path = os.path.join(path, "0_NoCancer")

Cancerous_urls = os.listdir(Cancerous_path)
normal_urls = os.listdir(normal_path)

Cancerous_imgs, Cancerous_targets = loadImages(Cancerous_path, Cancerous_urls, 1)


normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)

data = np.concatenate([Cancerous_imgs, normal_imgs], axis=0)


targets = np.concatenate([Cancerous_targets, normal_targets], axis=0)

# Initialize LOOCV
loo = LeaveOneOut()

# Initialize model
base_model = tf.keras.applications.VGG16(
weights='imagenet', include_top=False, input_shape=(100, 100, 3)
)

for layer in base_model.layers:


layer.trainable = False

model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])

# Perform LOOCV
for train_index, test_index in loo.split(data):
x_train, x_test = data[train_index], data[test_index]
y_train, y_test = targets[train_index], targets[test_index]

# Train the model


history = model.fit(x_train, y_train, batch_size=32, epochs=10,
validation_data=(x_test, y_test))
model.summary()

Actual Outcome:

The observed increase in accuracy when using LOOCV compared to a single train-test split may be
attributed to several factors:

1. Utilization of All Data for Training:


In LOOCV, the model is trained on all data points except one in each iteration. This
comprehensive training approach ensures that the model has exposure to the entire dataset
during training. The increased amount of training data often leads to improved model
performance.
2. Reduction in Variability:
By repeatedly training and evaluating the model on different subsets of the data, LOOCV
provides a more stable and less variable estimate of the model's performance. This is
especially important when dealing with a limited dataset, and it helps in obtaining a more
reliable assessment of how well the model generalizes to new, unseen data.
3. Increased Robustness to Data Distribution:
LOOCV's leave-one-out strategy reduces the impact of a specific data split on model
evaluation. It tests the model's ability to generalize to diverse subsets of the data, making
the performance estimate more robust and reflective of the model's overall capability.

Modification #4
Bootstrap Cross-Validation:

Uses bootstrapped samples (randomly sampled with replacement) as training and test sets. It's
particularly useful when dealing with limited data.
# Perform Bootstrap Cross-Validation
for _ in range(n_bootstrap_samples):
# Generate a bootstrap sample
bootstrap_data, bootstrap_targets = resample(data, targets, random_state=42)

# Split the bootstrap sample into train and test sets


x_train, x_test, y_train, y_test = train_test_split(bootstrap_data,
bootstrap_targets, test_size=0.25, random_state=42)

# Train the model


history = model.fit(x_train, y_train, batch_size=32, epochs=10, verbose=0)

# Evaluate the model on the test set


y_pred = model.predict(x_test)
y_pred_binary = np.round(y_pred)

# Evaluate the model and store the accuracy


accuracy = accuracy_score(y_test, y_pred_binary)
accuracy_scores.append(accuracy)

# Calculate and print the average accuracy over all bootstrap samples
average_accuracy = np.mean(accuracy_scores)
print(f'Average Accuracy: {average_accuracy}')

# Display model summary


model.summary()

The Bootstrap Cross-Validation is a resampling technique that involves repeatedly drawing samples
(with replacement) from the dataset to create multiple training and testing sets. The key idea is to
mimic the process of repeatedly collecting new samples, as if you were conducting multiple
experiments.

Here's the step-by-step explanation of the math behind Bootstrap Cross-Validation:

1. Dataset: Let D be the original dataset with N data points, where D={(x1​,y1​),(x2​,y2​),…,(xN​,yN​
)}, and xi ​represents the features of the i-th data point, and y(i) ​is the corresponding
label.
2. Bootstrap Sampling: In each iteration of the Bootstrap Cross-Validation loop, a new dataset
(′D′) is created by randomly sampling N data points with replacement from the original dataset.
This means that some data points may be repeated, while others may be left out.

For a given iteration, let Di ​be the bootstrap sample created.


3. Training and Testing Sets: The model is then trained on Di ​and tested on the data points that
were not included in the bootstrap sample Di ​.
Let D(train-i) ​be the training set for the i-th iteration (Dtraini​​=D∖Di′​).
Let � D (test-i) ​be the testing set for the i-th iteration D (test-i)​​=D∩Di​).
4. Model Training and Evaluation: The model is trained on D(train-i) ​and evaluated on `D (test-
i)​​.
5. Accuracy Calculation: The accuracy of the model on the test set D(test-i) ​is calculated using
an appropriate metric (e.g., accuracy score).
6. Iteration: Steps 2-5 are repeated for a predefined number of iterations (the number of bootstrap
samples).
7. Average Accuracy: The final performance metric (e.g., average accuracy) is calculated by
averaging the accuracy values obtained in each iteration.

Mathematically, the average accuracy (overallAccuracy​) can be calculated as:

where B is the number of bootstrap samples (iterations).

The goal is to get a more robust estimate of the model's performance by simulating the process of
collecting new samples from the underlying population. Bootstrap Cross-Validation is particularly
useful when dealing with limited data, providing a way to assess the stability and reliability of a
machine learning model.

Actual Outcome:

Accuracy: 95.6%
The observed increase in accuracy when using Bootstrap Cross-Validation compared to a single train-
test split may be attributed to several factors:

1. Increased Data Diversity:


Bootstrap sampling introduces diversity in the training sets by allowing some instances to
be repeated while others are omitted. This diversity can be beneficial for the model to learn
a more generalized representation of the underlying patterns in the data.
2. Robustness to Limited Data:
When dealing with limited datasets, traditional train-test splits might lead to variability in
performance estimates due to the particular choice of the split. Bootstrap Cross-Validation
mitigates this issue by repeatedly sampling with replacement, allowing the model to be
exposed to a broader range of data instances.
3. Reduction of Overfitting:
By training and testing on multiple bootstrapped samples, the model is less likely to overfit
to specific patterns present in a single train-test split. The ensemble effect of averaging over
multiple models trained on different subsets can lead to a more robust and generalized
model.
4. Improved Confidence Intervals:
The use of bootstrapped samples enables the calculation of confidence intervals for
performance metrics, providing a measure of uncertainty around the estimated accuracy.
This information is valuable for understanding the reliability of the model's performance.
5. Utilization of All Data Points:
Since bootstrapping involves sampling with replacement, each instance in the dataset has a
chance to be included in the training set for multiple iterations. This ensures that every data
point contributes to the learning process, potentially leading to improved overall model
performance.

**Modification #5

Stratified Time Series Cross-Validation:

A combination of stratified k-fold and time series cross-validation. It ensures both class balance
and temporal order preservation in time series data

import os
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from sklearn.model_selection import StratifiedKFold
# Load images function
def loadImages(path, urls, target):
# ... (unchanged)

# Paths and URLs


path = "/kaggle/input/breast-histopathology-images/BreastCancer/train"
Cancerous_path = os.path.join(path, "1_Cancer")
normal_path = os.path.join(path, "0_NoCancer")

Cancerous_urls = os.listdir(Cancerous_path)
normal_urls = os.listdir(normal_path)

# Load data
Cancerous_imgs, Cancerous_targets = loadImages(Cancerous_path, Cancerous_urls, 1)
normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)

data = np.concatenate([Cancerous_imgs, normal_imgs], axis=0)


targets = np.concatenate([Cancerous_targets, normal_targets], axis=0)

# Number of stratified k-fold splits


n_splits = 5 # You can adjust this as needed

# Initialize model
base_model = tf.keras.applications.VGG16(
weights='imagenet', include_top=False, input_shape=(100, 100, 3)
)

for layer in base_model.layers:


layer.trainable = False

model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])

# Stratified Time Series Cross-Validation


stratified_kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

for train_index, test_index in stratified_kfold.split(data, targets):


x_train, x_test = data[train_index], data[test_index]
y_train, y_test = targets[train_index], targets[test_index]

# Train the model


history = model.fit(x_train, y_train, batch_size=32, epochs=10,
validation_data=(x_test, y_test))

# Display model summary


model.summary()

Assume we have a dataset D containing N data points, where D={(x1​,y1​),(x2​,y2​),…,(xN​,yN​)} , and


x(i) ​represents the features of the i-th data point, and y(i) ​is the corresponding label.

StratifiedKFold Splitting:

D is split into K folds using the StratifiedKFold method, where K is the number of folds.
The StratifiedKFold algorithm ensures that each fold maintains the same distribution of class
labels as the original dataset, preserving class balance.

Training and Testing Sets:

In each iteration of the StratifiedKFold loop, K−1 folds are used for training (D train​) and the
remaining fold is used for testing (Dtest​).
The process is repeated K times, covering all possible combinations of training and testing sets.

Temporal Order Preservation:

While preserving class balance, the StratifiedKFold method does not explicitly consider the
temporal order of the data.
To ensure temporal order preservation in time series data, it's crucial to arrange the data in a way
that respects the temporal sequence.
x(i) should correspond to a time point that precedes xj​if i< j

Model Training and Evaluation:


For each iteration, the model is trained on D(train)​and evaluated on D(test)​.
The training process involves adjusting the model parameters to minimize a chosen objective
function (e.g., binary cross-entropy loss) based on the training data.

The model's performance is then evaluated on the testing set to assess its generalization
capability.

Model Summary:
After completing all iterations, the final model parameters and performance metrics are
summarized.

This process ensures both class balance and temporal order preservation in the cross-validation
of time series data, providing a comprehensive evaluation of the model's performance.

Actual Outcome:

Accuracy: 94.7%

Modification #6
Custom Time Series Splitting: Instead of relying solely on StratifiedKFold, you might create a custom
time series splitting approach to ensure that each fold respects the temporal order of observations.
This can be done by specifying a custom splitting function, considering the time variable in your
dataset.

import numpy as np

def time_series_split(X, y, n_splits):

sorted_indices = np.argsort(X['time_column'])

# Calculate the size of each fold


fold_size = len(X) // n_splits

# Iterate over splits


for i in range(n_splits):
# Calculate the start and end indices for the test set
test_start = i * fold_size
test_end = (i + 1) * fold_size if i < n_splits - 1 else len(X)

# Test set indices


test_index = sorted_indices[test_start:test_end]

train_index = np.concatenate([sorted_indices[:test_start],
sorted_indices[test_end:]])

yield train_index, test_index

# Example usage:
n_splits = 5
for train_index, test_index in time_series_split(X, y, n_splits):
print(f"Train Index: {train_index}, Test Index: {test_index}")

Objective:
The goal is to create a custom time series splitting function that generates train and test indices
ensuring the temporal order of observations.

Key Steps:
1. Sorting by Time Variable:
Given the dataset D with features X and labels y, we first sort the indices based on a time
variable. Let's denote this time variable as T.
Mathematically, if X is a DataFrame, the sorting operation can be expressed as:
sorted_indices=argsort([′time_column′])sorted_indices=argsort(X[′time_column′])
2. Determining Fold Sizes:
Next, we determine the size of each fold, denoted as fold_sizefold_size, which is the total
number of observations divided by the number of splits.
fold_size=len()/n_splitsfold_size=n_splitslen(X)​
3. Iterating Over Splits:
We iterate over the number of splits (K) to generate train and test indices for each split.
For each iteration (i):
The start and end indices for the test set (test_start,test_endtest_start,test_end) are
determined based on the fold size.
The test set indices (test_indextest_index) are extracted from the sorted indices.
The training set indices (train_indextrain_index) are formed by concatenating the
indices before the test set and after the test set.
4. Yielding Train and Test Indices:
The function yields a tuple (train_index,test_indextrain_index,test_index) for each split.

Mathematical Representation:
1. Sorted Indices:
sorted_indices=argsort([′time_column′])sorted_indices=argsort(X[′time_column′])
2. Fold Size Calculation: fold_size=len()/n_splitsfold_size=n_splitslen(X)​
3. Test Set Indices for Split i: test_start=i×fold_size
4. test_start(i)​=i×fold_size
5. test_end(i)=(i+1)×fold_size if i <n_splits − 1 else len(X)
6. test_end(i)​=(i+1)×fold_size if i<n_splits−1 else len(X)
test_inde(x)=sorted_indices[test_start(i):test_end(i)]
7. test_index(i)​=sorted_indices[test_starti​:test_endi​]
8. Training Set Indices for Split i:
train_index(i)=concatenate(sorted_indices[:test_start(i)],sorted_indices[test_end(i):])
9. train_index(i)​=concatenate(sorted_indices[:test_start(i)],sorted_indices[test_end(i)​:])
10. Yielding Train and Test Indices: yield train_index(i), train_index(i)

Summary:
This custom time series splitting function mathematically ensures that each fold maintains the
temporal order of observations, providing a robust strategy for time series cross-validation.
Actual Outcome:

Accuracy: 93.3%

Modification #7
Sliding Window Approach:

Implement a sliding window approach for training and testing sets. Instead of discrete folds, the
training and testing sets overlap in consecutive time intervals. This is particularly useful when
there is a gradual change in patterns over time.

import pandas as pd

def sliding_window_split(X, y, window_size, step_size):


"""

Parameters:
- X: DataFrame, features
- y: Series, labels
- window_size: int, size of the sliding window
- step_size: int, step size for moving the window

Returns:
- List of tuples: (train_index, test_index) for each iteration
"""
n = len(X)

train_test_splits = []
for i in range(0, n - window_size + 1, step_size):
train_start, train_end = 0, i
test_start, test_end = i, i + window_size

train_index = list(range(train_start, train_end))


test_index = list(range(test_start, test_end))

train_test_splits.append((train_index, test_index))

return train_test_splits

# Example Usage:

window_size = 5
step_size = 2

splits = sliding_window_split(X, y, window_size, step_size)

# Displaying the resulting train-test splits


for i, (train_index, test_index) in enumerate(splits):
print(f"Split {i + 1}:")
print(f"Train Index: {train_index}")
print(f"Test Index: {test_index}")
print("-" * 30)

Basic Definitions:
X: Feature DataFrame
y: Label Series
n: Total number of observations in the time series

Sliding Window Parameters:


Window Size (w): The size of the sliding window, representing the number of consecutive
observations considered for each training and testing set.
Step Size (s): The number of observations the window moves forward after each iteration.

Mathematical Explanation:
1. Iterations:
The total number of iterations (N) is given by n-w/s + 1. This ensures that the window
covers the entire time series.
2. Train-Test Splits:
For each iteration (i), the training set (Ti​) consists of observations from the beginning up to
the current window's starting point, and the testing set (Si​) consists of observations within
the window.

3. Moving the Window:


The window then moves forward by the step size (s) for the next iteration.
4. Overlap:
The key idea is that the testing sets overlap with the previous window. This allows the
model to capture gradual changes in patterns over time.

Example:
Let's consider a simple example with n=10, w=3, and s=1 .

Iteration 1: T1 = {0}, S1 = {0,1,2}


Iteration 2: T2​={0,1},S2​={1,2,3}
Iteration 3: T3​={0,1,2},S3​={2,3,4}
... and so on.

This approach is beneficial when there is a gradual change in patterns over time, allowing the model
to learn from different segments of the time series. Adjust the window size and step size based on the
characteristics of your time series data.

Actual Outcome:

Accuracy: 96.7%
Modification #8
Expanding Window Approach:

Similar to the sliding window, but the training set includes all data points up to the current test
set, capturing historical information. This can be useful when the model's performance benefits
from a larger training set.

import pandas as pd

def expanding_window_split(X, y):


"""
Implement an expanding window approach for time series data.

Parameters:
- X: DataFrame, features
- y: Series, labels

Returns:
- List of tuples: (train_index, test_index) for each iteration
"""
n = len(X)

train_test_splits = []

for i in range(1, n + 1):


train_index = list(range(0, i))
test_index = [i - 1] # Test set contains only the current data point

train_test_splits.append((train_index, test_index))

return train_test_splits

# Example Usage:
# Assuming X is your feature DataFrame and y is your label Series
splits_expanding = expanding_window_split(X, y)

# Displaying the resulting train-test splits


for i, (train_index, test_index) in enumerate(splits_expanding):
print(f"Split {i + 1}:")
print(f"Train Index: {train_index}")
print(f"Test Index: {test_index}")
print("-" * 30)
Mathematical Explanation:
1. Iterations:
There are n iterations, where n is the total number of observations.
2. Train-Test Splits:

For each iteration (i), the training set (Ti​) includes all data points up to the current test set
(Si​).

Ti​
={0,1,2,…,i−1}
Si​
={i}

Example:
Let's consider a simple example with n=5.

Iteration 1: T1​={0},S1​={1}
Iteration 2: T2​={0,1},S2​={2}
Iteration 3: T3​={0,1,2},S3​={3}
... and so on.

This approach is useful when the model's performance benefits from a larger training set that includes
historical information. Adjustments can be made based on specific requirements and characteristics of
your time series data.

Actual Outcome:

Modification #9
Temporal Convolutional Networks (TCN):
TCNs are deep learning architectures designed for sequential data. They utilize dilated
convolutions to capture long-range dependencies and can be effective for time series tasks.

import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler

urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]

def loadImages(path, urls, target):


images = []
labels = []
for i in range(len(urls)):
img_path = path + "/" + urls[i]
img = cv2.imread(img_path)
img = img / 255.0
img = cv2.resize(img, (100, 100))
images.append(img)
labels.append(target)
images = np.asarray(images)
return images, labels

Cancerous_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/1_Cancer"
Cancerous_urls = os.listdir(Cancerous_path)
Cancerous_imgs, Cancerous_targets = loadImages(Cancerous_path, Cancerous_urls, 1)

normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir(normal_path)

normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)

Cancerous_imgs = np.asarray(Cancerous_imgs)
normal_imgs = np.asarray(normal_imgs)

data = np.r_[Cancerous_imgs, normal_imgs]


targets = np.r_[Cancerous_targets, normal_targets]

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(data, targets, test_size=0.25)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

from tensorflow.keras.applications import VGG16


from tensorflow.keras.applications.vgg16 import preprocess_input

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(100, 100, 3))

for layer in base_model.layers:


layer.trainable = False

model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_test,


y_test))

model.summary()

Sure, let's break down the key components of the code and explain the mathematical concepts
involved:

1. Data Loading:
The code loads breast histopathology images from two directories: 1_Cancer and
0_NoCancer .
Images are resized to (100, 100) pixels and normalized by dividing by 255.0.
2. Data Preparation:
Images and corresponding labels (targets) are loaded using the loadImages function.
Cancerous_imgs and normal_imgs are numpy arrays containing the images.
Cancerous_targets and normal_targets are numpy arrays containing the corresponding
labels (1 for cancerous and 0 for non-cancerous).
3. Train-Test Split:
The dataset is split into training and testing sets using train_test_split from scikit-learn.
4. VGG16 Model Initialization:
The code uses a pre-trained VGG16 model from Keras applications. The model is loaded
with weights pre-trained on ImageNet data.
The last classification layer is removed ( include_top=False ), and the input shape is set to
(100, 100, 3).
5. Freezing Pre-trained Layers:
The layers of the pre-trained VGG16 model are set to non-trainable.
6. Sequential Model Construction:
A new sequential model is created.
The pre-trained VGG16 model is added as the first layer of the new model.
A Flatten layer is added to convert the output to a 1D tensor.
Dense layers with ReLU activation functions are added for feature extraction.
The final Dense layer with a sigmoid activation function is added for binary classification
(cancerous or non-cancerous).
7. Model Compilation:
The model is compiled using the Adam optimizer, binary cross-entropy loss, and accuracy
as the metric.
8. Model Training:
The model is trained on the training data ( x_train and y_train ) for 10 epochs.
The testing data ( x_test and y_test ) is used for validation during training.
9. Summary:
The summary of the model, including layer names, types, and parameters, is printed at the
end of the code.

The mathematical concepts involved include image preprocessing (resizing, normalization), data
splitting, transfer learning (using a pre-trained VGG16 model), neural network architecture (sequential
model with dense layers), activation functions (ReLU and sigmoid), and optimization (Adam optimizer,
binary cross-entropy loss). The training process involves forward and backward passes, parameter
updates, and evaluation on the validation set. The summary provides insights into the model
architecture and the number of parameters in each layer.
Actual Outcome:

Modification #10

1st Code Snippet:

import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler

urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]

def loadImages(path, urls, target):


images = []
labels = []
for i in range(len(urls)):
img_path = path + "/" + urls[i]
img = cv2.imread(img_path)
img = img / 255.0
img = cv2.resize(img, (100,100))
images.append(img)
labels.append(target)
images = np.asarray(images)
return images, labels
Cancerous_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/1_Cancer"
Cancerous_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/1_Cancer")
Cancerous_imgs, Cancerous_targets = loadImages(COVID_path, COVID_urls, 1)

normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/0_NoCancer")

normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)

Cancerous_imgs = np.asarray(COVID_imgs)
normal_imgs = np.asarray(normal_imgs)

data = np.r_[COVID_imgs, normal_imgs]


targets = np.r_[Cancerous_targets, normal_targets]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, targets, test_size=0.25)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

from tensorflow.keras.applications import VGG16


from tensorflow.keras.applications.vgg16 import preprocess_input

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(100, 100, 3))

for layer in base_model.layers:


layer.trainable = False

model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_test,


y_test))

model.summary()

2nd Code Snippet

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objs as go

import cv2
from matplotlib.image import imread

import tensorflow as tf
import keras
from keras.utils import to_categorical
from keras.preprocessing import image
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import glob
import PIL
import random

random.seed(100)

breast_imgs = glob.glob('/kaggle/input/breast-histopathology-
images/IDC_regular_ps50_idx5/**/*.png', recursive = True)

for imgname in breast_imgs[:5]:


print(imgname)

non_cancer_imgs = []
cancer_imgs = []
for img in breast_imgs:
if img[-5] == '0' :
non_cancer_imgs.append(img)

elif img[-5] == '1' :


cancer_imgs.append(img)

non_cancer_num = len(non_cancer_imgs) # No cancer


cancer_num = len(cancer_imgs) # Cancer

total_img_num = non_cancer_num + cancer_num

print('Number of Images of no cancer: {}' .format(non_cancer_num)) # images of Non


cancer
print('Number of Images of cancer : {}' .format(cancer_num)) # images of cancer
print('Total Number of Images : {}' .format(total_img_num))

plt.figure(figsize = (15, 15))

some_non_cancerous = np.random.randint(0, len(non_cancer_imgs), 18)


some_cancerous = np.random.randint(0, len(cancer_imgs), 18)

s = 0
for num in some_non_cancerous:

img = image.load_img((non_cancer_imgs[num]), target_size=(100, 100))


img = image.img_to_array(img)

plt.subplot(6, 6, 2*s+1)
plt.axis('off')
plt.title('no cancer')
plt.imshow(img.astype('uint8'))
s += 1

s = 1

for num in some_cancerous:

img = image.load_img((cancer_imgs[num]), target_size=(100, 100))


img = image.img_to_array(img)
plt.subplot(6, 6, 2*s)
plt.axis('off')
plt.title('cancer')
plt.imshow(img.astype('uint8'))
s += 1

# Randomly sample images from two lists, 'non_cancer_imgs' and 'cancer_imgs'


some_non_img = random.sample(non_cancer_imgs, 70000)
some_can_img = random.sample(cancer_imgs, 65000)

non_img_arr = []
can_img_arr = []

for img_path in some_non_img:


img = cv2.imread(img_path, cv2.IMREAD_COLOR)
if img is not None and img.size > 0:
resized_img = cv2.resize(img, (50, 50), interpolation=cv2.INTER_LINEAR)
non_img_arr.append([resized_img, 0])
else:
print(f"Warning: Unable to read image at {img_path}")

for img_path in some_can_img:


img = cv2.imread(img_path, cv2.IMREAD_COLOR)
if img is not None and img.size > 0:
resized_img = cv2.resize(img, (50, 50), interpolation=cv2.INTER_LINEAR)
can_img_arr.append([resized_img, 1])
else:
print(f"Warning: Unable to read image at {img_path}")

# Convert lists to numpy arrays


non_img_arr = np.array(non_img_arr, dtype=object)
can_img_arr = np.array(can_img_arr, dtype=object)

breast_img_arr = np.concatenate((non_img_arr, can_img_arr))

X = [] # List for image data


y = [] # List for labels

# Shuffle the elements in the 'breast_img_arr' array randomly


random.shuffle(breast_img_arr)

# Loop through each element (feature, label) in the shuffled 'breast_img_arr'


for feature, label in breast_img_arr:
X.append(feature)
y.append(label)
# Convert the lists 'X' and 'y' into NumPy arrays
X = np.array(X)
y = np.array(y)

print('X shape: {}'.format(X.shape))

# Split the dataset into training and testing sets, with a test size of 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=42)

# Define a rate (percentage) for subsampling the training data


rate = 0.5

# Calculate the number of samples to keep in the training data based on the rate
num = int(X.shape[0] * rate)

# Convert the categorical labels in 'y_train' and 'y_test' to one-hot encoded format
y_train = to_categorical(y_train, 2)
y_test = to_categorical(y_test, 2)

print('X_train shape : {}' .format(X_train.shape))


print('X_test shape : {}' .format(X_test.shape))
print('y_train shape : {}' .format(y_train.shape))
print('y_test shape : {}' .format(y_test.shape))

from keras.preprocessing.image import ImageDataGenerator


# Data augmentation
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
)

# Create data generators for training and testing


train_datagen = datagen.flow(X_train, y_train, batch_size=32)
test_datagen = datagen.flow(X_test, y_test, batch_size=32, shuffle=False)

# Define an EarlyStopping callback


early_stopping = keras.callbacks.EarlyStopping(
monitor='val_loss', # Monitor the validation loss
patience=15, # Number of epochs with no improvement after which
training will be stopped
)

tf.random.set_seed(42)

# Create a Sequential model


model = keras.Sequential([

keras.layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


padding='same', input_shape=(50, 50, 3)),
keras.layers.BatchNormalization(),
keras.layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',
padding='same'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform',
padding='same'),
keras.layers.BatchNormalization(),
keras.layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform',
padding='same'),
keras.layers.BatchNormalization(),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Dropout(0.3),
keras.layers.Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform',
padding='same'),
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu', kernel_initializer='he_uniform'),
keras.layers.BatchNormalization(),
keras.layers.Dense(64, activation='relu', kernel_initializer='he_uniform'),
keras.layers.BatchNormalization(),
keras.layers.Dense(64, activation='relu', kernel_initializer='he_uniform'),
keras.layers.Dropout(0.3),
keras.layers.Dense(24, activation='relu', kernel_initializer='he_uniform'),
keras.layers.Dense(2, activation='softmax')

])

# Display a summary of the model architecture


model.summary()

# Compile the model with Adam optimizer, binary cross-entropy loss, and accuracy metric
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(X_train[:65000],
y_train[:65000],
validation_data = (X_test[:17000], y_test[:17000]),
epochs = 5,
batch_size = 50,
callbacks=[early_stopping])

model.evaluate(X_test,y_test)

Y_pred = model.predict(X_test[:10000])
Y_pred_classes = np.argmax(Y_pred,axis = 1)
Y_true = np.argmax(y_test[:10000],axis = 1)

confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)


f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(confusion_mtx, annot=True, linewidths=0.01,cmap="BuPu",linecolor="gray", fmt=
'.1f',ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

prediction = model.predict(X_test)
prediction

# Define a mapping of class indices to human-readable labels


class_labels = {
0: 'Non-Cancerous',
1: 'Cancerous',
}

# Define a function for plotting an image from an array


def img_plot(arr, index=0):
# Set the title for the plot
plt.title('Test Image')

# Display the image at the specified index in the array


plt.imshow(arr[index])

# Set the index value to 90


index = 90

# Plot an image from the X_test array using the img_plot function
img_plot(X_test, index)

# Extract a single image from X_test based on the specified index


input = X_test[index:index+1]

# Make a prediction using the CNN model and get the class with the highest probability
predicted_class_index = model.predict(input)[0].argmax()

# Get the true label from the y_test array


true_class_index = y_test[index].argmax()

# Get the predicted and true labels


predicted_label = class_labels[predicted_class_index]
true_label = class_labels[true_class_index]

print('Predicted Diagnosis:', predicted_label)


print('True Diagnosis:', true_label)

1. Data Loading:
The second code snippet uses the glob module to retrieve image file paths recursively
from the specified directory ( '/kaggle/input/breast-histopathology-
images/IDC_regular_ps50_idx5/**/*.png' ), whereas the first code snippet uses the
os.listdir function.

2. Data Preprocessing:
In the second code snippet, the code separates the images into two lists ( non_cancer_imgs
and cancer_imgs ) based on the presence of '0' or '1' in the filename. This is a different
approach than the first code snippet where cancerous and non-cancerous images are
loaded from separate directories.
3. Image Resizing:
In both snippets, images are resized to (50, 50) before being fed into the neural network.
However, in the second code snippet, the resizing is done using OpenCV ( cv2.resize ),
while in the first code snippet, it's done using Keras' image.load_img and
image.img_to_array functions.

4. Model Architecture:
The CNN architecture in the second code snippet is different. It defines a model with
multiple convolutional layers, batch normalization, max-pooling, and dense layers. The first
code snippet uses a pre-trained VGG16 model for feature extraction.
5. Data Augmentation:
The second code snippet includes data augmentation using the ImageDataGenerator from
Keras, which can create variations of the training data by applying random transformations
like rotation, shifting, and flipping. The first code snippet doesn't include explicit data
augmentation.
6. Training Loop:
The training loop in the second code snippet is also different. It uses model.fit to train the
model on a subset of the data, and early stopping is incorporated as a callback.
7. Evaluation and Visualization:
The second code snippet evaluates the model on a subset of the test data
( X_test[:10000] , y_test[:10000] ) and includes the creation of confusion matrix plots,
accuracy plots, and loss plots. The first code snippet does not include these visualizations.
8. Prediction and Diagnosis:
The second code snippet includes an example of making predictions on a single image from
the test set and printing the predicted and true labels.
Actual Outcome:

You might also like