ML Paper - Breast Cancer Model
ML Paper - Breast Cancer Model
The original dataset consisted of 162 whole mount slide images of Breast Cancer (BCa) specimens
scanned at 40x. From that, 277,524 patches of size 50 x 50 were extracted (198,738 IDC negative
and 78,786 IDC positive). Each patch’s file name is of the format: u_xX_yY_classC.png — > example
10253_idx5_x1351_y1101_class0.png . Where u is the patient ID (10253_idx5), X is the x-coordinate
of where this patch was cropped from, Y is the y-coordinate of where this patch was cropped from,
and C indicates the class where 0 is non-IDC and 1 is IDC.
Code:
import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]
Cancerous_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/1_Cancer"
Cancerous_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/1_Cancer")
Cancerous_imgs, Cancerous_targets = loadImages(COVID_path, COVID_urls, 1)
normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/0_NoCancer")
Cancerous_imgs = np.asarray(COVID_imgs)
normal_imgs = np.asarray(normal_imgs)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
1. Data Loading:
Images from two directories, one containing cancerous images (1_Cancer) and the other
non-cancerous images (0_NoCancer), are loaded using the loadImages function.
The images are resized to a fixed size of (100, 100).
The pixel values are normalized to the range [0, 1].
2. Train-Test Split:
The dataset is split into training and testing sets using the train_test_split function from
scikit-learn.
3. Model Architecture:
The VGG16 pre-trained model is loaded from Keras applications. The top (classification)
layer is excluded ( include_top=False ), and the input shape is set to (100, 100, 3).
All layers in the pre-trained model are set to non-trainable.
A custom sequential model is created by adding the VGG16 base model, followed by a
Flatten layer to flatten the output, and three Dense layers for classification.
The Dense layers have 512, 256, and 1 neurons, respectively, with 'relu' activation functions
for the first two and a 'sigmoid' activation function for the last layer (binary classification).
The model is compiled using the Adam optimizer, binary crossentropy loss function, and
accuracy as the evaluation metric.
4. Model Training:
The model is trained using the training data ( x_train and y_train ) for 5 epochs with a
batch size of 32. Validation data ( x_test and y_test ) is used to monitor the model's
performance during training.
5. Model Summary:
The model.summary() function is used to display a summary of the model architecture,
including layer types, output shapes, and the number of trainable parameters.
Modification #1:
Feature Concatenation: Instead of using a single pre-trained model, use multiple pre-trained models
and concatenate their extracted features before feeding them into dense layers.
merged_model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
Assume we have N pre-trained models, and each model i is denoted by fi(X;Θi) , where X is the
input tensor and Θi represents the set of parameters for the -th model. Each fi outputs a feature
tensor H(i) .
Let C be the operation of concatenation. The concatenated feature tensor H (concat) is defined as:
H ( concat ) =C(H1,H2,…,Hn)
Here, H1,H2,..,Hn denotes the concatenation of the feature tensors along a specific axis.
Now, let W (concat) be the weight matrix and b (concat) be the bias vector associated with the
concatenation operation.
The output of the concatenation operation is then passed through dense layers with rectified linear
unit (ReLU) activation functions:
In summary, the entire process involves concatenating features from multiple pre-trained models,
passing them through dense layers, and producing the final classification probability.
Actual Outcome:
Several factors can contribute to a decrease in accuracy. Here are some possible reasons:
1. Feature Mismatch:
Different pre-trained models might capture different aspects or representations of the data.
If the features extracted from different models are not well-aligned or complementary,
concatenating them might introduce noise or conflicting information.
2. Model Complexity:
Combining features from multiple pre-trained models increases the overall complexity of the
model. If the dataset is not large enough, or if the models are not fine-tuned appropriately,
the increased complexity may lead to overfitting.
3. Dimensionality Mismatch:
The feature dimensions from different pre-trained models might not be compatible for
concatenation. Ensure that the features extracted from each model have the same or
compatible dimensions before concatenating.
4. Training Data Size:
If the dataset is small, training a complex model with concatenated features may lead to
poor generalization. Pre-trained models are typically trained on large datasets, and using
them in a concatenation approach might not be beneficial if your dataset is limited.
5. Learning Rate and Training Strategy:
When using a more complex model, it's essential to adjust the learning rate and training
strategy accordingly. A higher learning rate or inadequate training strategy might result in
suboptimal convergence.
6. Computational Resources:
Training a model with concatenated features from multiple pre-trained models requires
more computational resources. If the hardware limitations are reached, it might affect the
convergence of the model during training.
7. Feature Redundancy:
If the features extracted by different pre-trained models contain redundant information,
concatenating them may not bring additional discriminative power. It's crucial to analyze the
characteristics of the features and ensure they provide complementary information.
8. Hyperparameter Tuning:
The architecture and hyperparameters of the dense layers following the concatenated
features need to be tuned appropriately. The choice of activation functions, layer sizes, and
regularization techniques can significantly impact the model's performance.
Modification #2
Here, Diis the test set in the i-th iteration, and the model is trained on the union of the remaining folds.
This process helps in obtaining a more robust performance estimate by ensuring that the model is
evaluated on different subsets of the data, reducing the impact of the randomness introduced by a
single train-test split.
Actual Outcome
There are several reasons why you might observe an increase in accuracy compared to a single train-
test split:
Modification #3:
Leave-One-Out Cross-Validation (LOOCV):
A special case of k-fold cross-validation where k is equal to the number of data points in the
dataset. In each iteration, a single data point is used as the test set, and the model is trained on
the remaining data.
Let's break down the steps and understand the math behind LOOCV:
LOOCV is considered a thorough cross-validation method because it ensures that each data point is
used as both a training and test example. However, it can be computationally expensive, especially for
large datasets, as it requires training the model N times. The choice of LOOCV or other cross-
validation methods depends on the specific characteristics of the dataset and the computational
resources available.
import os
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from sklearn.model_selection import LeaveOneOut
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train"
Cancerous_path = os.path.join(path, "1_Cancer")
normal_path = os.path.join(path, "0_NoCancer")
Cancerous_urls = os.listdir(Cancerous_path)
normal_urls = os.listdir(normal_path)
# Initialize LOOCV
loo = LeaveOneOut()
# Initialize model
base_model = tf.keras.applications.VGG16(
weights='imagenet', include_top=False, input_shape=(100, 100, 3)
)
model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
# Perform LOOCV
for train_index, test_index in loo.split(data):
x_train, x_test = data[train_index], data[test_index]
y_train, y_test = targets[train_index], targets[test_index]
Actual Outcome:
The observed increase in accuracy when using LOOCV compared to a single train-test split may be
attributed to several factors:
Modification #4
Bootstrap Cross-Validation:
Uses bootstrapped samples (randomly sampled with replacement) as training and test sets. It's
particularly useful when dealing with limited data.
# Perform Bootstrap Cross-Validation
for _ in range(n_bootstrap_samples):
# Generate a bootstrap sample
bootstrap_data, bootstrap_targets = resample(data, targets, random_state=42)
# Calculate and print the average accuracy over all bootstrap samples
average_accuracy = np.mean(accuracy_scores)
print(f'Average Accuracy: {average_accuracy}')
The Bootstrap Cross-Validation is a resampling technique that involves repeatedly drawing samples
(with replacement) from the dataset to create multiple training and testing sets. The key idea is to
mimic the process of repeatedly collecting new samples, as if you were conducting multiple
experiments.
1. Dataset: Let D be the original dataset with N data points, where D={(x1,y1),(x2,y2),…,(xN,yN
)}, and xi represents the features of the i-th data point, and y(i) is the corresponding
label.
2. Bootstrap Sampling: In each iteration of the Bootstrap Cross-Validation loop, a new dataset
(′D′) is created by randomly sampling N data points with replacement from the original dataset.
This means that some data points may be repeated, while others may be left out.
The goal is to get a more robust estimate of the model's performance by simulating the process of
collecting new samples from the underlying population. Bootstrap Cross-Validation is particularly
useful when dealing with limited data, providing a way to assess the stability and reliability of a
machine learning model.
Actual Outcome:
Accuracy: 95.6%
The observed increase in accuracy when using Bootstrap Cross-Validation compared to a single train-
test split may be attributed to several factors:
**Modification #5
A combination of stratified k-fold and time series cross-validation. It ensures both class balance
and temporal order preservation in time series data
import os
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from sklearn.model_selection import StratifiedKFold
# Load images function
def loadImages(path, urls, target):
# ... (unchanged)
Cancerous_urls = os.listdir(Cancerous_path)
normal_urls = os.listdir(normal_path)
# Load data
Cancerous_imgs, Cancerous_targets = loadImages(Cancerous_path, Cancerous_urls, 1)
normal_imgs, normal_targets = loadImages(normal_path, normal_urls, 0)
# Initialize model
base_model = tf.keras.applications.VGG16(
weights='imagenet', include_top=False, input_shape=(100, 100, 3)
)
model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
StratifiedKFold Splitting:
D is split into K folds using the StratifiedKFold method, where K is the number of folds.
The StratifiedKFold algorithm ensures that each fold maintains the same distribution of class
labels as the original dataset, preserving class balance.
In each iteration of the StratifiedKFold loop, K−1 folds are used for training (D train) and the
remaining fold is used for testing (Dtest).
The process is repeated K times, covering all possible combinations of training and testing sets.
While preserving class balance, the StratifiedKFold method does not explicitly consider the
temporal order of the data.
To ensure temporal order preservation in time series data, it's crucial to arrange the data in a way
that respects the temporal sequence.
x(i) should correspond to a time point that precedes xjif i< j
The model's performance is then evaluated on the testing set to assess its generalization
capability.
Model Summary:
After completing all iterations, the final model parameters and performance metrics are
summarized.
This process ensures both class balance and temporal order preservation in the cross-validation
of time series data, providing a comprehensive evaluation of the model's performance.
Actual Outcome:
Accuracy: 94.7%
Modification #6
Custom Time Series Splitting: Instead of relying solely on StratifiedKFold, you might create a custom
time series splitting approach to ensure that each fold respects the temporal order of observations.
This can be done by specifying a custom splitting function, considering the time variable in your
dataset.
import numpy as np
sorted_indices = np.argsort(X['time_column'])
train_index = np.concatenate([sorted_indices[:test_start],
sorted_indices[test_end:]])
# Example usage:
n_splits = 5
for train_index, test_index in time_series_split(X, y, n_splits):
print(f"Train Index: {train_index}, Test Index: {test_index}")
Objective:
The goal is to create a custom time series splitting function that generates train and test indices
ensuring the temporal order of observations.
Key Steps:
1. Sorting by Time Variable:
Given the dataset D with features X and labels y, we first sort the indices based on a time
variable. Let's denote this time variable as T.
Mathematically, if X is a DataFrame, the sorting operation can be expressed as:
sorted_indices=argsort([′time_column′])sorted_indices=argsort(X[′time_column′])
2. Determining Fold Sizes:
Next, we determine the size of each fold, denoted as fold_sizefold_size, which is the total
number of observations divided by the number of splits.
fold_size=len()/n_splitsfold_size=n_splitslen(X)
3. Iterating Over Splits:
We iterate over the number of splits (K) to generate train and test indices for each split.
For each iteration (i):
The start and end indices for the test set (test_start,test_endtest_start,test_end) are
determined based on the fold size.
The test set indices (test_indextest_index) are extracted from the sorted indices.
The training set indices (train_indextrain_index) are formed by concatenating the
indices before the test set and after the test set.
4. Yielding Train and Test Indices:
The function yields a tuple (train_index,test_indextrain_index,test_index) for each split.
Mathematical Representation:
1. Sorted Indices:
sorted_indices=argsort([′time_column′])sorted_indices=argsort(X[′time_column′])
2. Fold Size Calculation: fold_size=len()/n_splitsfold_size=n_splitslen(X)
3. Test Set Indices for Split i: test_start=i×fold_size
4. test_start(i)=i×fold_size
5. test_end(i)=(i+1)×fold_size if i <n_splits − 1 else len(X)
6. test_end(i)=(i+1)×fold_size if i<n_splits−1 else len(X)
test_inde(x)=sorted_indices[test_start(i):test_end(i)]
7. test_index(i)=sorted_indices[test_starti:test_endi]
8. Training Set Indices for Split i:
train_index(i)=concatenate(sorted_indices[:test_start(i)],sorted_indices[test_end(i):])
9. train_index(i)=concatenate(sorted_indices[:test_start(i)],sorted_indices[test_end(i):])
10. Yielding Train and Test Indices: yield train_index(i), train_index(i)
Summary:
This custom time series splitting function mathematically ensures that each fold maintains the
temporal order of observations, providing a robust strategy for time series cross-validation.
Actual Outcome:
Accuracy: 93.3%
Modification #7
Sliding Window Approach:
Implement a sliding window approach for training and testing sets. Instead of discrete folds, the
training and testing sets overlap in consecutive time intervals. This is particularly useful when
there is a gradual change in patterns over time.
import pandas as pd
Parameters:
- X: DataFrame, features
- y: Series, labels
- window_size: int, size of the sliding window
- step_size: int, step size for moving the window
Returns:
- List of tuples: (train_index, test_index) for each iteration
"""
n = len(X)
train_test_splits = []
for i in range(0, n - window_size + 1, step_size):
train_start, train_end = 0, i
test_start, test_end = i, i + window_size
train_test_splits.append((train_index, test_index))
return train_test_splits
# Example Usage:
window_size = 5
step_size = 2
Basic Definitions:
X: Feature DataFrame
y: Label Series
n: Total number of observations in the time series
Mathematical Explanation:
1. Iterations:
The total number of iterations (N) is given by n-w/s + 1. This ensures that the window
covers the entire time series.
2. Train-Test Splits:
For each iteration (i), the training set (Ti) consists of observations from the beginning up to
the current window's starting point, and the testing set (Si) consists of observations within
the window.
Example:
Let's consider a simple example with n=10, w=3, and s=1 .
This approach is beneficial when there is a gradual change in patterns over time, allowing the model
to learn from different segments of the time series. Adjust the window size and step size based on the
characteristics of your time series data.
Actual Outcome:
Accuracy: 96.7%
Modification #8
Expanding Window Approach:
Similar to the sliding window, but the training set includes all data points up to the current test
set, capturing historical information. This can be useful when the model's performance benefits
from a larger training set.
import pandas as pd
Parameters:
- X: DataFrame, features
- y: Series, labels
Returns:
- List of tuples: (train_index, test_index) for each iteration
"""
n = len(X)
train_test_splits = []
train_test_splits.append((train_index, test_index))
return train_test_splits
# Example Usage:
# Assuming X is your feature DataFrame and y is your label Series
splits_expanding = expanding_window_split(X, y)
For each iteration (i), the training set (Ti) includes all data points up to the current test set
(Si).
Ti
={0,1,2,…,i−1}
Si
={i}
Example:
Let's consider a simple example with n=5.
Iteration 1: T1={0},S1={1}
Iteration 2: T2={0,1},S2={2}
Iteration 3: T3={0,1,2},S3={3}
... and so on.
This approach is useful when the model's performance benefits from a larger training set that includes
historical information. Adjustments can be made based on specific requirements and characteristics of
your time series data.
Actual Outcome:
Modification #9
Temporal Convolutional Networks (TCN):
TCNs are deep learning architectures designed for sequential data. They utilize dilated
convolutions to capture long-range dependencies and can be effective for time series tasks.
import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]
Cancerous_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/1_Cancer"
Cancerous_urls = os.listdir(Cancerous_path)
Cancerous_imgs, Cancerous_targets = loadImages(Cancerous_path, Cancerous_urls, 1)
normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir(normal_path)
Cancerous_imgs = np.asarray(Cancerous_imgs)
normal_imgs = np.asarray(normal_imgs)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
model.summary()
Sure, let's break down the key components of the code and explain the mathematical concepts
involved:
1. Data Loading:
The code loads breast histopathology images from two directories: 1_Cancer and
0_NoCancer .
Images are resized to (100, 100) pixels and normalized by dividing by 255.0.
2. Data Preparation:
Images and corresponding labels (targets) are loaded using the loadImages function.
Cancerous_imgs and normal_imgs are numpy arrays containing the images.
Cancerous_targets and normal_targets are numpy arrays containing the corresponding
labels (1 for cancerous and 0 for non-cancerous).
3. Train-Test Split:
The dataset is split into training and testing sets using train_test_split from scikit-learn.
4. VGG16 Model Initialization:
The code uses a pre-trained VGG16 model from Keras applications. The model is loaded
with weights pre-trained on ImageNet data.
The last classification layer is removed ( include_top=False ), and the input shape is set to
(100, 100, 3).
5. Freezing Pre-trained Layers:
The layers of the pre-trained VGG16 model are set to non-trainable.
6. Sequential Model Construction:
A new sequential model is created.
The pre-trained VGG16 model is added as the first layer of the new model.
A Flatten layer is added to convert the output to a 1D tensor.
Dense layers with ReLU activation functions are added for feature extraction.
The final Dense layer with a sigmoid activation function is added for binary classification
(cancerous or non-cancerous).
7. Model Compilation:
The model is compiled using the Adam optimizer, binary cross-entropy loss, and accuracy
as the metric.
8. Model Training:
The model is trained on the training data ( x_train and y_train ) for 10 epochs.
The testing data ( x_test and y_test ) is used for validation during training.
9. Summary:
The summary of the model, including layer names, types, and parameters, is printed at the
end of the code.
The mathematical concepts involved include image preprocessing (resizing, normalization), data
splitting, transfer learning (using a pre-trained VGG16 model), neural network architecture (sequential
model with dense layers), activation functions (ReLU and sigmoid), and optimization (Adam optimizer,
binary cross-entropy loss). The training process involves forward and backward passes, parameter
updates, and evaluation on the validation set. The summary provides insights into the model
architecture and the number of parameters in each layer.
Actual Outcome:
Modification #10
import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
urls = os.listdir("/kaggle/input/breast-histopathology-images/BreastCancer/train")
path = "/kaggle/input/breast-histopathology-images/BreastCancer/train" + urls[0]
normal_path = "/kaggle/input/breast-histopathology-images/BreastCancer/train/0_NoCancer"
normal_urls = os.listdir("/kaggle/input/breast-histopathology-
images/BreastCancer/train/0_NoCancer")
Cancerous_imgs = np.asarray(COVID_imgs)
normal_imgs = np.asarray(normal_imgs)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
base_model,
Flatten(),
Dense(512, activation='relu'),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
model.summary()
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objs as go
import cv2
from matplotlib.image import imread
import tensorflow as tf
import keras
from keras.utils import to_categorical
from keras.preprocessing import image
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import glob
import PIL
import random
random.seed(100)
breast_imgs = glob.glob('/kaggle/input/breast-histopathology-
images/IDC_regular_ps50_idx5/**/*.png', recursive = True)
non_cancer_imgs = []
cancer_imgs = []
for img in breast_imgs:
if img[-5] == '0' :
non_cancer_imgs.append(img)
s = 0
for num in some_non_cancerous:
plt.subplot(6, 6, 2*s+1)
plt.axis('off')
plt.title('no cancer')
plt.imshow(img.astype('uint8'))
s += 1
s = 1
non_img_arr = []
can_img_arr = []
# Split the dataset into training and testing sets, with a test size of 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=42)
# Calculate the number of samples to keep in the training data based on the rate
num = int(X.shape[0] * rate)
# Convert the categorical labels in 'y_train' and 'y_test' to one-hot encoded format
y_train = to_categorical(y_train, 2)
y_test = to_categorical(y_test, 2)
tf.random.set_seed(42)
])
# Compile the model with Adam optimizer, binary cross-entropy loss, and accuracy metric
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train[:65000],
y_train[:65000],
validation_data = (X_test[:17000], y_test[:17000]),
epochs = 5,
batch_size = 50,
callbacks=[early_stopping])
model.evaluate(X_test,y_test)
Y_pred = model.predict(X_test[:10000])
Y_pred_classes = np.argmax(Y_pred,axis = 1)
Y_true = np.argmax(y_test[:10000],axis = 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
prediction = model.predict(X_test)
prediction
# Plot an image from the X_test array using the img_plot function
img_plot(X_test, index)
# Make a prediction using the CNN model and get the class with the highest probability
predicted_class_index = model.predict(input)[0].argmax()
1. Data Loading:
The second code snippet uses the glob module to retrieve image file paths recursively
from the specified directory ( '/kaggle/input/breast-histopathology-
images/IDC_regular_ps50_idx5/**/*.png' ), whereas the first code snippet uses the
os.listdir function.
2. Data Preprocessing:
In the second code snippet, the code separates the images into two lists ( non_cancer_imgs
and cancer_imgs ) based on the presence of '0' or '1' in the filename. This is a different
approach than the first code snippet where cancerous and non-cancerous images are
loaded from separate directories.
3. Image Resizing:
In both snippets, images are resized to (50, 50) before being fed into the neural network.
However, in the second code snippet, the resizing is done using OpenCV ( cv2.resize ),
while in the first code snippet, it's done using Keras' image.load_img and
image.img_to_array functions.
4. Model Architecture:
The CNN architecture in the second code snippet is different. It defines a model with
multiple convolutional layers, batch normalization, max-pooling, and dense layers. The first
code snippet uses a pre-trained VGG16 model for feature extraction.
5. Data Augmentation:
The second code snippet includes data augmentation using the ImageDataGenerator from
Keras, which can create variations of the training data by applying random transformations
like rotation, shifting, and flipping. The first code snippet doesn't include explicit data
augmentation.
6. Training Loop:
The training loop in the second code snippet is also different. It uses model.fit to train the
model on a subset of the data, and early stopping is incorporated as a callback.
7. Evaluation and Visualization:
The second code snippet evaluates the model on a subset of the test data
( X_test[:10000] , y_test[:10000] ) and includes the creation of confusion matrix plots,
accuracy plots, and loss plots. The first code snippet does not include these visualizations.
8. Prediction and Diagnosis:
The second code snippet includes an example of making predictions on a single image from
the test set and printing the predicted and true labels.
Actual Outcome: