Open In App

Video Classification with a 3D Convolutional Neural Network

Last Updated : 01 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Video classification is a highly important technology for analyzing digital content. It finds applications in various industries, such as security surveillance, media entertainment, and automotive safety. Central to this technology are 3D Convolutional Neural Networks (CNNs), which significantly enhance the accuracy and efficiency of video classification models.

Unlike traditional methods that treat videos as separate frames, 3D CNNs consider the entire temporal dimension, leading to a better understanding of the visual content. This results in faster and more reliable identification and categorization of video content.

This guide provides a comprehensive step-by-step approach to performing video classification with 3D CNNs. From setting up your environment to evaluating your model, you will learn the basics of 3D CNN technology, data preparation, model building, training, and performance evaluation. Each section covers both theoretical concepts and practical applications, making this guide an essential resource for anyone interested in leveraging 3D CNNs for advanced video analysis.

Understanding 3D Convolutional Neural Networks

A 3D Convolutional Neural Network (3D CNN) is a type of neural network architecture designed to learn hierarchical data representations. Equipped with multiple layers, it progressively learns more complex spatial features for tasks such as classification, regression, or generation. Unlike traditional 2D Convolutional Neural Networks, which handle two-dimensional data, 3D CNNs can process three-dimensional data, capturing both spatial and temporal dependencies. This capability makes them ideal for working with volumetric data like medical images (CT scans or MRI scans) or video sequences, where understanding the spatial context and temporal progression is crucial.

3D CNNs utilize 3D convolutional layers with a three-dimensional kernel that slides over the input volume to detect patterns across three dimensions. This approach is more effective in capturing the complexities of spatial patterns compared to 2D CNNs. The inclusion of the temporal dimension allows 3D CNNs to analyze frame-to-frame relationships in video data, enhancing the model's understanding of dynamic scenes.

Key Differences between 2D and 3D CNNs

When analyzing video data, 3D CNNs adopt a different approach compared to 2D CNNs. Instead of considering only height and width, 3D CNNs also account for depth, using a three-dimensional kernel that moves across all three dimensions. This enables them to capture both spatial and temporal features in video sequences.

2D CNNs treat video frames as separate images, ignoring connections between them, while 3D CNNs understand the patterns that develop over time. They achieve this by having convolution and pooling layers that work in three dimensions, processing sequences of frames and preserving valuable temporal information. This additional complexity in computations is justified by the more detailed and meaningful representation of the data.

Implementing Video Classification with 3D CNNs

Step 1: Import Required Libraries

First, we need to import the necessary libraries for video processing, data manipulation, model creation, and evaluation.

Python
import os
import cv2
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv3D, MaxPooling3D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm


Step 2: Load and Visualize Data

You download the dataset from here.

Load the paths to the directories containing the training and testing videos, along with the labels associated with each video. Visualize the distribution of classes in the training labels to gain insights into the dataset's balance and determine the number of unique classes.

Python
# Define the paths to the directories containing the training and testing videos
train_videos_path = 'train'
test_videos_path = 'test'

# Load the training and testing labels from CSV files
train_labels = pd.read_csv('train.csv')
test_labels = pd.read_csv('test.csv')

# Visualize the distribution of classes in the training labels
# This code creates a bar plot showing the number of videos in each class
class_counts = train_labels['tag'].value_counts()
plt.figure(figsize=(10, 5))
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title('Distribution of Training Labels')
plt.xlabel('Class Labels')
plt.ylabel('Number of Videos')
plt.xticks(rotation=90)
plt.show()

# Calculate the number of unique classes in the training labels
num_classes = train_labels['tag'].nunique()
print(f'Number of classes: {num_classes}')

Output:

Screenshot-2024-06-11-090609
Distribution of training labels
Number of classes: 5

Step 3: Data Pre-Processing

Next, we define a function extract_frames that takes a path to a video file and extracts a specified number of frames from the video, resizing them to 112x112 pixels. The function returns an array of the extracted frames, ensuring that the number of frames extracted matches the specified num_frames parameter.

Python
def extract_frames(video_path, num_frames=16):
    """
    Extract frames from a video file.

    Args:
    - video_path (str): Path to the video file.
    - num_frames (int): Number of frames to extract (default is 16).

    Returns:
    - frames (np.array): Array of extracted frames.
    """
    # Open the video file
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    # Get the total number of frames in the video
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Calculate the interval at which frames will be extracted
    frame_interval = max(total_frames // num_frames, 1)
    
    # Iterate through the frames and extract them
    for i in range(num_frames):
        # Set the frame position to the current frame index
        cap.set(cv2.CAP_PROP_POS_FRAMES, i * frame_interval)
        
        # Read the frame
        ret, frame = cap.read()
        
        # Break the loop if the end of the video is reached
        if not ret:
            break
        
        # Resize the frame to 112x112 pixels
        frame = cv2.resize(frame, (112, 112))
        
        # Append the frame to the frames list
        frames.append(frame)
    
    # Release the video capture object
    cap.release()
    
    # Fill any missing frames with blank (zero) frames
    while len(frames) < num_frames:
        frames.append(np.zeros((112, 112, 3), np.uint8))
    
    # Convert the frames list to a NumPy array
    return np.array(frames)


Step 4: Load Data and Create Numpy Arrays

Next, we define a function load_data that takes a DataFrame containing video labels, a directory containing video files, the number of classes in the dataset, and the number of frames to extract per video. The function preprocesses the video data by extracting frames and converting labels to one-hot encoded format, returning arrays of video frames and labels suitable for training a 3D CNN model.

Python
def load_data(labels, video_dir, num_classes, num_frames=16):
    """
    Load and preprocess video data for training or testing.

    Args:
    - labels (pd.DataFrame): DataFrame containing video labels.
    - video_dir (str): Directory containing video files.
    - num_classes (int): Number of classes in the dataset.
    - num_frames (int): Number of frames to extract per video (default is 16).

    Returns:
    - X (np.array): Array of video frames.
    - y (np.array): Array of one-hot encoded labels.
    """
    X = []
    y = []
    
    # Iterate through each row in the labels DataFrame
    for idx, row in tqdm(labels.iterrows(), total=labels.shape[0]):
        # Construct the path to the video file
        video_path = os.path.join(video_dir, row['video_name'])
        
        # Extract frames from the video
        frames = extract_frames(video_path, num_frames)
        
        # Check if the correct number of frames were extracted
        if len(frames) == num_frames:
            X.append(frames)
            y.append(row['tag'])
    
    # Convert the lists to NumPy arrays
    X = np.array(X)
    
    # Convert the labels to one-hot encoded format
    y = to_categorical(pd.factorize(y)[0], num_classes)
    
    return X, y

# Load and preprocess the training and testing data
X_train, y_train = load_data(train_labels, train_videos_path, num_classes)
X_test, y_test = load_data(test_labels, test_videos_path, num_classes)

Output:

100%|██████████| 594/594 [01:09<00:00,  8.52it/s]
100%|██████████| 224/224 [00:25<00:00, 8.89it/s]

Step 5: Train-Test Split

  • X_train: Training set features (video frames).
  • X_val: Validation set features (video frames).
  • y_train: Training set labels (one-hot encoded).
  • y_val: Validation set labels (one-hot encoded).
Python
# Split the training data into training and validation sets
# X_train and y_train are the input features (video frames) and labels (one-hot encoded) for the training data, respectively
# test_size=0.2 specifies that 20% of the training data should be used for validation, while the remaining 80% is used for actual training
# random_state=42 sets the random seed for reproducibility, ensuring that the split is the same each time the code is run
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# After this line of code, you will have:
# X_train: Training set features (video frames)
# X_val: Validation set features (video frames)
# y_train: Training set labels (one-hot encoded)
# y_val: Validation set labels (one-hot encoded)
# These datasets can then be used for training and evaluating your 3D CNN model


Step 6: Create the 3D CNN Model

Next, we define a function create_advanced_3dcnn_model that constructs an advanced 3D CNN model for video classification. The function takes the shape of input frames (input_shape) and the number of classes (num_classes) as input and returns a compiled 3D CNN model.

The create_advanced_3dcnn_model function builds a deeper 3D CNN model compared to the previous model. It consists of multiple 3D convolutional layers followed by max pooling and batch normalization layers. The model also includes a dropout layer for regularization and ends with a dense layer for classification. The model is compiled with the Adam optimizer, categorical crossentropy loss, and accuracy metric. This model is suitable for more complex video classification tasks that may require capturing more intricate patterns in the data.

Python
def create_advanced_3dcnn_model(input_shape, num_classes):
    """
    Create an advanced 3D CNN model for video classification.

    Args:
    - input_shape (tuple): Shape of input frames (e.g., (16, 112, 112, 3) for 16 frames of size 112x112 pixels and 3 channels).
    - num_classes (int): Number of classes in the classification task.

    Returns:
    - model (Sequential): Compiled 3D CNN model.
    """
    model = Sequential()
    
    # 3D convolutional layer with 64 filters, kernel size of (3, 3, 3), and ReLU activation
    model.add(Conv3D(64, (3, 3, 3), activation='relu', padding='same', input_shape=input_shape))
    # 3D max pooling layer with pool size of (2, 2, 2)
    model.add(MaxPooling3D((2, 2, 2)))
    # Batch normalization layer
    model.add(BatchNormalization())

    # Another 3D convolutional layer with 128 filters, kernel size of (3, 3, 3), and ReLU activation
    model.add(Conv3D(128, (3, 3, 3), activation='relu', padding='same'))
    # Another 3D max pooling layer with pool size of (2, 2, 2)
    model.add(MaxPooling3D((2, 2, 2)))
    # Another batch normalization layer
    model.add(BatchNormalization())

    # Another 3D convolutional layer with 256 filters, kernel size of (3, 3, 3), and ReLU activation
    model.add(Conv3D(256, (3, 3, 3), activation='relu', padding='same'))
    # Another 3D max pooling layer with pool size of (2, 2, 2)
    model.add(MaxPooling3D((2, 2, 2)))
    # Another batch normalization layer
    model.add(BatchNormalization())

    # Flatten layer to flatten the output of the convolutional layers
    model.add(Flatten())
    # Fully connected (dense) layer with 512 units and ReLU activation
    model.add(Dense(512, activation='relu'))
    # Dropout layer with dropout rate of 0.5
    model.add(Dropout(0.5))
    # Output layer with softmax activation for multi-class classification
    model.add(Dense(num_classes, activation='softmax'))
    
    # Compile the model with Adam optimizer, categorical crossentropy loss, and accuracy metric
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

# Define the shape of input frames and create the advanced 3D CNN model
input_shape = (16, 112, 112, 3)
model = create_advanced_3dcnn_model(input_shape, num_classes)

Step 7: Train the Model

Next, we train the 3D CNN model (model) using the training data (X_train and y_train) and validate it on the validation data (X_val and y_val) for 10 epochs with a batch size of 8. The training process involves updating the model's weights based on the gradient of the loss function computed on the training data, with the goal of minimizing the loss and improving the model's accuracy.

  • X_train and y_train: Training set features (video frames) and labels (one-hot encoded).
  • X_val and y_val: Validation set features (video frames) and labels (one-hot encoded).
  • epochs=10: Number of epochs (iterations over the entire dataset) for training.
  • batch_size=8: Number of samples per gradient update..
Python
# Train the 3D CNN model
# X_train and y_train are the training set features (video frames) and labels (one-hot encoded), respectively
# X_val and y_val are the validation set features (video frames) and labels (one-hot encoded), respectively
# epochs=10 specifies the number of training epochs (iterations over the entire dataset)
# batch_size=8 specifies the number of samples per gradient update
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=8)

Output:

Epoch 1/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 150s 2s/step - accuracy: 0.4733 - loss: 41.4847 - val_accuracy: 0.5042 - val_loss: 314.1945
Epoch 2/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 164s 3s/step - accuracy: 0.8078 - loss: 23.4136 - val_accuracy: 0.5462 - val_loss: 249.8013
Epoch 3/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 225s 4s/step - accuracy: 0.8392 - loss: 13.7996 - val_accuracy: 0.3950 - val_loss: 465.2792
Epoch 4/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 179s 3s/step - accuracy: 0.9461 - loss: 4.9286 - val_accuracy: 0.8235 - val_loss: 28.2462
Epoch 5/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 165s 3s/step - accuracy: 0.9474 - loss: 4.8669 - val_accuracy: 0.9244 - val_loss: 5.1237
Epoch 6/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 170s 3s/step - accuracy: 0.9614 - loss: 3.1715 - val_accuracy: 0.8403 - val_loss: 39.0854
Epoch 7/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 168s 3s/step - accuracy: 0.9532 - loss: 3.4268 - val_accuracy: 0.8487 - val_loss: 27.8798
Epoch 8/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 163s 3s/step - accuracy: 0.9805 - loss: 1.1084 - val_accuracy: 0.9244 - val_loss: 6.3794
Epoch 9/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 163s 3s/step - accuracy: 0.9694 - loss: 3.3069 - val_accuracy: 0.8067 - val_loss: 51.3638
Epoch 10/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 164s 3s/step - accuracy: 0.9597 - loss: 3.3076 - val_accuracy: 0.8824 - val_loss: 10.6910

Step 8: Evaluate the Model

In this step, we evaluate the trained 3D CNN model (model) on the test data (X_test and y_test) to measure its performance on unseen data. The model's performance is evaluated by calculating the loss and accuracy on the test set. The test accuracy gives an indication of how well the model generalizes to new, unseen data. This evaluation step is crucial for assessing the model's effectiveness in real-world scenarios.

Python
# Evaluate the trained model on the test set
# X_test and y_test are the test set features (video frames) and labels (one-hot encoded), respectively
loss, accuracy = model.evaluate(X_test, y_test)

# Print the test accuracy
print(f'Test Accuracy: {accuracy:.2f}')

Output:

7/7 ━━━━━━━━━━━━━━━━━━━━ 30s 3s/step - accuracy: 0.7098 - loss: 95.7806
Test Accuracy: 0.63

Step 9: Get Predictions

In this step, we use the trained 3D CNN model (model) to predict the class labels for the test data (X_test) and compare them with the true labels (y_test) to evaluate the model's performance.

  • The predict method is used to obtain predicted probabilities for each class for each sample in the test set.
  • The predicted probabilities are then converted to class labels using np.argmax along the axis representing the classes, resulting in y_pred_classes.
  • Similarly, the true labels (y_test) are converted to class labels, resulting in y_true_classes.
  • These class labels can be used to calculate metrics such as precision, recall, and the confusion matrix for further evaluation of the model.
Python
# Get predictions from the model
# X_test is the test set features (video frames)
# y_pred contains the predicted probabilities for each class for each sample in X_test
y_pred = model.predict(X_test)

# Convert the predicted probabilities to class labels
# y_pred_classes contains the predicted class labels (indices) for each sample in X_test
y_pred_classes = np.argmax(y_pred, axis=1)

# Convert the true labels to class labels
# y_true_classes contains the true class labels (indices) for each sample in X_test
y_true_classes = np.argmax(y_test, axis=1)

Output:

7/7 ━━━━━━━━━━━━━━━━━━━━ 14s 2s/step

Step 10: Plot Confusion Matrix

In this step, we import the necessary functions from scikit-learn (classification_report and confusion_matrix) to evaluate the performance of our 3D CNN model.

Python
from sklearn.metrics import classification_report, confusion_matrix

# Calculate the confusion matrix
# y_true_classes are the true class labels for each sample in the test set
# y_pred_classes are the predicted class labels for each sample in the test set
conf_matrix = confusion_matrix(y_true_classes, y_pred_classes)


Next , we use matplotlib and seaborn libraries to plot a heatmap of the confusion matrix, which provides a visual representation of the model's performance.

Python
# Plot the confusion matrix
# conf_matrix is the confusion matrix computed using the true and predicted class labels
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Output:

Screenshot-2024-06-11-091102
Confusion Matrix


Step 11:Classification Report

In this step, we use the classification_report function from scikit-learn to generate a comprehensive report on the classification performance of our 3D CNN model.

Python
# Print the classification report
# y_true_classes are the true class labels for each sample in the test set
# y_pred_classes are the predicted class labels for each sample in the test set
# target_names are the unique class labels used for the classification report
class_report = classification_report(y_true_classes, y_pred_classes, target_names=train_labels['tag'].unique())
print(class_report)

Output:

 precision    recall  f1-score   support

CricketShot 0.60 1.00 0.75 49
PlayingCello 0.37 0.25 0.30 44
Punch 0.73 0.85 0.79 39
ShavingBeard 0.41 0.16 0.23 43
TennisSwing 0.80 0.84 0.82 49

accuracy 0.63 224
macro avg 0.58 0.62 0.58 224
weighted avg 0.59 0.63 0.58 224

Step 12: Plot training history

In this step, we use matplotlib to plot the training and validation accuracy of the 3D CNN model over epochs, which helps visualize how the model's accuracy improves during training.

Python
# Plot the training history
# history is the training history object returned by model.fit
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Output:

Screenshot-2024-06-11-091207
Model Accuracy


Next, we continue to visualize the training history of the 3D CNN model by plotting the training and validation loss over epochs, providing insights into the model's convergence and performance.

Python
# Plot the training and validation loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Output:

Screenshot-2024-06-11-091321
Model Loss


Applications of 3D CNNs in Video Classification

3D CNNs excel in video classification by capturing both the spatial and temporal aspects of footage. In datasets like UCF101, which focus on action recognition, 3D CNNs achieve higher accuracy by leveraging spatio-temporal features. These features are essential for understanding complex activities and making accurate predictions.

Beyond action recognition, 3D CNNs are utilized in fields such as medical imaging and surveillance, where identifying patterns or anomalies over time is crucial for diagnosis and monitoring. Sometimes, 3D CNNs are combined with Recurrent Neural Networks (RNNs) to create hybrid models capable of handling the most complex video analysis tasks. These models integrate spatial and temporal cues, providing a comprehensive understanding of the data.

Conclusion

In this guide, we've delved into the ins and outs of using 3D Convolutional Neural Networks (3D CNNs) for video classification. We covered everything from setting up the environment and preprocessing videos to building, training, and evaluating the model. By following this step-by-step approach, you'll see just how powerful 3D CNNs are in capturing spatio-temporal features, which are crucial for accurate video analysis.

Our exploration highlights the potential of 3D CNNs to revolutionize video analytics. These advanced models can interpret both the spatial and temporal dimensions of a video, giving us a deeper understanding of its content. Of course, using these models comes with its challenges, as we need to balance complexity and computational demands. But the possibilities they offer for advancements in automated surveillance and digital media are enormous. This guide serves as a solid foundation for mastering 3D CNNs in sophisticated video classification tasks.


Next Article

Similar Reads