Open In App

Lung Cancer Detection using Convolutional Neural Network (CNN)

Last Updated : 20 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Computer Vision is one of the applications of deep neural networks and one such use case is in predicting the presence of cancerous cells. In this article, we will learn how to build a classifier using Convolution Neural Network which can classify normal lung tissues from cancerous tissues.

The following process will be followed to build this classifier:

Flow Chart for the Project
Flow Chart for the Project

Below is the step by step process for making our CNN model.

1. Importing Libraries

We will be using use:

Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from glob import glob

from sklearn.model_selection import train_test_split
from sklearn import metrics

from zipfile import ZipFile
import cv2
import gc
import os

import tensorflow as tf
from tensorflow import keras
from keras import layers

import warnings
warnings.filterwarnings('ignore')

2. Importing Dataset

The dataset used for this project is available from Kaggle and it consists of 5,000 images belonging to three categories of lung conditions:

  1. Normal Class
  2. Lung Adenocarcinomas
  3. Lung Squamous Cell Carcinomas

This dataset has already been augmented meaning, the 250 images for each category were artificially expanded so we won’t need to perform Data Augmentation ourselves.

  • We use Python’s zipfile module to extract the contents of the dataset. This is crucial because the dataset is stored as a compressed .zip file.
  • zip.extractall() extracts all the contents to the current working directory.
Python
data_path = 'lung-and-colon-cancer-histopathological-images.zip'

with ZipFile(data_path,'r') as zip:
  zip.extractall()
  print('The data set has been extracted.')

Output:

The data set has been extracted.

3. Visualizing the Data

Here we visualize the images to get an understanding of what the data looks like. This helps in identifying the nature of images that the model will be trained on. Classes will contain the names: 'lung_n', 'lung_aca' and 'lung_scc' corresponding to Normal, Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. These are the three classes that we have here.

Python
path = '/lung_colon_image_set/lung_image_sets'

for cat in classes:
    image_dir = f'{path}/{cat}'
    images = os.listdir(image_dir)

    fig, ax = plt.subplots(1, 3, figsize=(15, 5))
    fig.suptitle(f'Images for {cat} category . . . .', fontsize=20)

    for i in range(3):
        k = np.random.randint(0, len(images))
        img = np.array(Image.open(f'{path}/{cat}/{images[k]}'))
        ax[i].imshow(img)
        ax[i].axis('off')
    plt.show()

Output:

Images for lung_n category
Images for lung_n category
Images for lung_aca category
Images for lung_aca category
Images for lung_scc category
Images for lung_scc category

The above output may vary if we will run this because the code has been implemented in such a way that it will show different images every time we run it.

  • It selects a random sample of three images from each category and visualizes them using Matplotlib.
  • PIL.Image.open(): open the images and convert them in a format that can be displayed.

4. Preparing the Dataset

Before training the model we need to process the images into a format suitable for the CNN model. This involves resizing the images and converting them into NumPy arrays for efficient computation.

  • Image Resizing: Since large images are computationally expensive to process we resize them to a standard size (256x256) using numpy array. We used 10 epochs with batch size of 64.
  • One hot encoding: Labels (Y) are converted to one-hot encoded vectors using pd.get_dummies(). This allows the model to output soft probabilities for each class.
  • Train-Test Split: We split the dataset into training and validation sets i.e 80% for training and 20% for validation. This allows us to evaluate the model's performance on unseen data.
Python
IMG_SIZE = 256
SPLIT = 0.2
EPOCHS = 10
BATCH_SIZE = 64

X = []
Y = []

for i, cat in enumerate(classes):
  images = glob(f'{path}/{cat}/*.jpeg')

  for image in images:
    img = cv2.imread(image)

    X.append(cv2.resize(img, (IMG_SIZE, IMG_SIZE)))
    Y.append(i)

X = np.asarray(X)
one_hot_encoded_Y = pd.get_dummies(Y).values

X_train, X_val, Y_train, Y_val = train_test_split(X, one_hot_encoded_Y, test_size=SPLIT, random_state=2022)

Output:

(12000, 256, 256, 3) (3000, 256, 256, 3)

5. Model Development

Now we start building the CNN. Here, we use TensorFlow and Keras to define the architecture of our CNN model.

  • Sequential(): Builds a linear stack of layers.
  • Conv2D(): Applies convolution with specified filters, kernel size, ReLU activation and padding.
  • MaxPooling2D(): Downsamples feature maps by taking max values over pool size.
  • Flatten(): Converts 2D feature maps into 1D vector.
  • Dense(): Fully connected layer with given units and activation.
  • BatchNormalization(): Normalizes activations to speed up training.
  • Dropout(): Randomly drops neurons to reduce overfitting.
  • model.summary(): Displays model architecture details.
Python
model = keras.models.Sequential([
    layers.Conv2D(filters=32,
                  kernel_size=(5, 5),
                  activation='relu',
                  input_shape=(IMG_SIZE,
                               IMG_SIZE,
                               3),
                  padding='same'),
    layers.MaxPooling2D(2, 2),

    layers.Conv2D(filters=64,
                  kernel_size=(3, 3),
                  activation='relu',
                  padding='same'),
    layers.MaxPooling2D(2, 2),

    layers.Conv2D(filters=128,
                  kernel_size=(3, 3),
                  activation='relu',
                  padding='same'),
    layers.MaxPooling2D(2, 2),

    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(3, activation='softmax')
])
model.summary()

Output:

Screenshot-2025-03-09-173802
Model Summary

6. Model Compilation

After defining the model architecture we will compile the model with an optimizer, loss function and evaluation metric then train it using the training data.

  • We use the Adam optimizer which adjusts the learning rate during training to speed up convergence.
  • Categorical cross entropy loss is appropriate as loss function for multi-class classification problems as it measures the difference between the predicted and actual probability distributions.
  • EarlyStopping: Stops training if validation accuracy doesn’t improve for a set number of epochs (patience).
  • ReduceLROnPlateau: Reduces learning rate when validation loss plateaus, controlled by patience and factor.
  • Custom myCallback class: Stops training early when validation accuracy exceeds 90%.
  • self.model.stop_training = True: Signals to stop training inside the callback.
Python
from keras.callbacks import EarlyStopping, ReduceLROnPlateau


class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if logs.get('val_accuracy') > 0.90:
            print('\n Validation accuracy has reached upto \
                      90% so, stopping further training.')
            self.model.stop_training = True


es = EarlyStopping(patience=3,
                   monitor='val_accuracy',
                   restore_best_weights=True)

lr = ReduceLROnPlateau(monitor='val_loss',
                       patience=2,
                       factor=0.5,
                       verbose=1)

7. Model Training

Now we will train our model by defining the following:

  • model.fit() trains the model on training data X_train and Y_train.
  • validation_data provides validation inputs X_val and Y_val for evaluation each epoch.
  • batch_size sets the number of samples per training batch.
  • epochs defines how many times the model iterates over the entire training set.
  • verbose=1 displays training progress.
  • callbacks includes early stopping, learning rate reduction and custom callback to control training based on validation metrics.
Python
history = model.fit(X_train, Y_train,
                    validation_data = (X_val, Y_val),
                    batch_size = BATCH_SIZE,
                    epochs = EPOCHS,
                    verbose = 1,
                    callbacks = [es, lr, myCallback()])

Output:

Screenshot-2025-03-09-175244
Trained Model

8. Visualizing

Let's visualize the training and validation accuracy with each epoch.

  • pd.DataFrame(history.history) converts training history into a DataFrame.
  • history_df.loc[:, ['accuracy', 'val_accuracy']].plot() plots training and validation accuracy.
Python
history_df = pd.DataFrame(history.history)
history_df.loc[:,['accuracy','val_accuracy']].plot()
plt.show()

Output:

download
Training Accuracy vs Validation Acuuracy

This graph shows the training and validation accuracy of the model over epochs. The training accuracy increases steadily reaching near 1.0 indicating the model is learning well from the training data.

However the validation accuracy fluctuates suggesting that the model may be overfitting where it performs well on the training data but struggles to generalize to unseen data. It can be avoided by further fine tuning the model.

9. Model Evaluation

Now as we have our model ready let's evaluate its performance on the validation data using different metrics. For this we will first predict the class for the validation data using this model and then compare the output with the true labels.

  • model.predict(X_val) generates predictions for validation data.
  • np.argmax(Y_val, axis=1) converts one-hot encoded true labels to class indices.
  • np.argmax(Y_pred, axis=1) converts predicted probabilities to class indices.
  • metrics.classification_report() prints precision, recall, f1-score and support for each class.
Python
Y_pred = model.predict(X_val)
Y_val = np.argmax(Y_val, axis=1)
Y_pred = np.argmax(Y_pred, axis=1)
print(metrics.classification_report(Y_val, Y_pred,
                                    target_names=classes))

Output:

94/94 ━━━━━━━━━━━━━━━━━━━━ 80s 851ms/step

Now we will draw classification report using the predicted labels and the true labels.

  • metrics.classification_report() displays detailed evaluation metrics for each class based on true (Y_val) and predicted (Y_pred) labels, using class names from classes.
Python
print(metrics.classification_report(Y_val, Y_pred,
                                    target_names=classes))

Output:

Screenshot-2025-03-09-180638
Classification Report for the Validation Data

The classification report shows that the model performs well on normal lung tissue (lung_n) with high precision and recall resulting in a strong F1-score. However it struggles with lung_aca (lung adenocarcinoma) and lung_scc (lung squamous cell carcinoma) particularly in terms of recall. It tells us that model can be improved in handling imbalanced classes and enhancing performance across all categories.


Next Article

Similar Reads