Open In App

Evaluation of computer vision model

Last Updated : 18 Jun, 2024
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Computer Vision allows computer systems to analyse and understand pictures in the same way as the human eye, has seen numerous developments recently. Benchmarking often plays an important role in the selection of models and it is especially important for the performance of the computer vision models when applied to real-world problems. This article aims to provide a brief insight into the significance of performance evaluation within the domain of computer vision.

Evaluation-of--computer-vision-model


What is the importance of performance evaluation in computer vision?

Performance evaluation is vital in computer vision for several reasons:

  • Validation of Model Effectiveness: Works to test the model performance on the new data and has a high potential to work with other datasets, not just the training dataset.
  • Benchmarking: Enables comparison with other modes in order to know the techniques used in the current models.
  • Identifying Weaknesses: It also points out regions in the model that require enhancement and encourages analysts to explore such regions themselves.
  • Guiding Model Development: This enables one to understand which component of the model or data need to be polished.

Depending on the particular type of computer vision problem or task, there may be certain performance evaluation metrics that are most suitable for the given task. More often, these tasks comprise image recognition, image localization, image annotation, and image synthesis. Both tasks are designed to have specific metrics, thus capturing the lack of match between the model and its performance in a meaningful manner.

Image Classification Metrics

Image classification metrics are essential for evaluating the performance of machine learning models. Key metrics include accuracy, which measures the proportion of correctly classified images; precision, which indicates the accuracy of positive predictions; recall, which assesses the model's ability to identify all relevant instances; and F1-score, which balances precision and recall for a comprehensive performance overview. These metrics help in fine-tuning models to achieve optimal classification results.

Accuracy

Accuracy describes the value that is calculated as the number of correctly classified instances divided by the total number of instances. It is one of the simplest performance measures that provides an estimate of the accuracy of a model.

Precision, Recall, and F1 Score

Precision and recall provide more granular insights, especially for imbalanced datasets.

  • Precision: It denotes the proportion that has been rightly predicted as positive divided by the total numbers of positive predictions.
  • Recall: The total number of accurately predicted positives divided by the number of actual True Positives.
  • F1 Score: The harmonic mean of precision and recall thus; a measure that allows for achieving a balance of both parameters.

Confusion Matrix

A confusion matrix is a matrix which provides clear categorization of the final classification in terms of true positive, false positive, true negative, and false negative. It assists in providing insights about which types of errors a model is likely to make.

Code Implementation of Image Classification Metrics

In the given example, the model's predictions (y_pred) are compared to the actual labels (y_true), resulting in an accuracy of 0.6, indicating that 60% of the predictions were correct. The precision and recall are both 0.6, reflecting that 60% of the predicted positives are true positives, and 60% of the actual positives were correctly identified. The F1 score, which balances precision and recall, is also 0.6. Additionally, the confusion matrix reveals the specific counts of true positives, true negatives, false positives, and false negatives, providing deeper insight into the types of errors made by the model. This holistic evaluation helps identify areas of improvement and ensures the model's robustness and reliability.

Python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Example ground truth and predictions
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
y_pred = [0, 1, 0, 0, 1, 1, 1, 0, 0, 1]

# Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Precision
precision = precision_score(y_true, y_pred)
print("Precision:", precision)

# Recall
recall = recall_score(y_true, y_pred)
print("Recall:", recall)

# F1 Score
f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

# Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Output:

Accuracy: 0.6
Precision: 0.6
Recall: 0.6
F1 Score: 0.6
Confusion Matrix:
 [[3 2]
 [2 3]]

Object Detection Metrics

Object detection metrics are crucial for assessing the effectiveness of models in identifying and localizing objects within images. Key metrics include mean Average Precision (mAP), which measures the precision-recall trade-off across different classes; Intersection over Union (IoU), which evaluates the overlap between predicted and ground-truth bounding boxes; Precision, which indicates the accuracy of the detected objects; and Recall, which measures the model's ability to detect all relevant objects. These metrics help in optimizing and comparing the performance of object detection algorithms.

Intersection over Union (IoU)

Based on the idea of intersection over union IoU quantifies the proportion of overlap between the predicted bounding box and the ground truth bounding box.

Python
import numpy as np

def iou(box1, box2):
    # box = [x1, y1, x2, y2]
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])

    intersection_area = max(0, x2_inter - x1_inter) * max(0, y2_inter - y1_inter)
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - intersection_area

    return intersection_area / union_area

# Example bounding boxes
box1 = [50, 50, 150, 150]
box2 = [70, 70, 170, 170]

iou_score = iou(box1, box2)
print("IoU:", iou_score)

Output:

IoU:0.470588

Mean Average Precision (mAP)

mAP means the average precision scores of all classes and it is calculated as a mean of all the average precision scores each class was given by the algorithm. It makes use of both precision and recall to enable one get a more balanced measure of the model’s accuracy, taking into consideration all the classes.

Python
import numpy as np

# Example ground truth and predictions (for simplicity, assume we have only one class)
ground_truth_boxes = {
    "image1": [[50, 50, 150, 150]],
    "image2": [[30, 30, 120, 120], [200, 200, 300, 300]]
}

detection_boxes = {
    "image1": [[55, 55, 145, 145, 0.9], [60, 60, 160, 160, 0.7]],
    "image2": [[25, 25, 125, 125, 0.8], [210, 210, 290, 290, 0.9], [100, 100, 150, 150, 0.5]]
}

def iou(box1, box2):
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])
    
    intersection_area = max(0, x2_inter - x1_inter) * max(0, y2_inter - y1_inter)
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - intersection_area
    
    return intersection_area / union_area

def calculate_ap(recalls, precisions):
    recalls = np.concatenate(([0.], recalls, [1.]))
    precisions = np.concatenate(([0.], precisions, [0.]))
    
    for i in range(precisions.size - 1, 0, -1):
        precisions[i - 1] = np.maximum(precisions[i - 1], precisions[i])
    
    indices = np.where(recalls[1:] != recalls[:-1])[0]
    ap = np.sum((recalls[indices + 1] - recalls[indices]) * precisions[indices + 1])
    return ap

def calculate_map(ground_truth_boxes, detection_boxes, iou_threshold=0.5):
    all_detections = []
    all_ground_truths = []
    image_ids = []
    
    for image_id, boxes in detection_boxes.items():
        image_ids.extend([image_id] * len(boxes))
        all_detections.extend(boxes)
        all_ground_truths.extend(ground_truth_boxes.get(image_id, []))
    
    all_detections = np.array(all_detections)
    all_ground_truths = np.array(all_ground_truths)
    
    sorted_indices = np.argsort(-all_detections[:, 4])
    all_detections = all_detections[sorted_indices]
    image_ids = np.array(image_ids)[sorted_indices]
    
    tp = np.zeros(len(all_detections))
    fp = np.zeros(len(all_detections))
    
    for d in range(len(all_detections)):
        max_iou = 0
        assigned_gt = -1
        for g in range(len(all_ground_truths)):
            iou_value = iou(all_detections[d][:4], all_ground_truths[g])
            if iou_value > max_iou:
                max_iou = iou_value
                assigned_gt = g
        
        if max_iou >= iou_threshold:
            if assigned_gt in all_ground_truths:
                tp[d] = 1
                all_ground_truths = np.delete(all_ground_truths, assigned_gt, 0)
            else:
                fp[d] = 1
        else:
            fp[d] = 1
    
    cum_tp = np.cumsum(tp)
    cum_fp = np.cumsum(fp)
    
    recalls = cum_tp / len(all_ground_truths)
    precisions = cum_tp / (cum_tp + cum_fp)
    
    ap = calculate_ap(recalls, precisions)
    return ap

# Calculate mAP
mAP = calculate_map(ground_truth_boxes, detection_boxes)
print("mAP:", mAP)

Output:

mAP: 0.018402474592

Image Segmentation Metrics

Image segmentation metrics are vital for evaluating how well models delineate object boundaries within images. Key metrics include Intersection over Union (IoU), which measures the overlap between predicted and ground-truth segments; Dice Coefficient, which assesses the similarity between these segments; Pixel Accuracy, which calculates the proportion of correctly classified pixels; and Mean Absolute Error (MAE), which quantifies the average deviation between predicted and actual segmentations. These metrics provide insights into the model's precision and accuracy in segmenting images.

Dice Coefficient (F1 Score)

The accuracy for the segmentation of the image is measured by the Dice coefficient which quantifies the degree of similarity in the segmentation achieved through the model with the actual one.

Python
import numpy as np

def dice_coefficient(y_true, y_pred):
    y_true_f = y_true.flatten()
    y_pred_f = y_pred.flatten()
    intersection = np.sum(y_true_f * y_pred_f)
    return (2. * intersection) / (np.sum(y_true_f) + np.sum(y_pred_f))

# Example ground truth and predictions
y_true = np.array([[1, 1, 0], [0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1]])

dice = dice_coefficient(y_true, y_pred)
print("Dice Coefficient:", dice)

Output:

Dice Coefficient: 0.8333333333333334

Jaccard Index (Intersection over Union)

It uses the Jaccard Index, which means the size of the Intersection divided by the size of the Union.For segmentation, the IoU has been used which is also referred to as the Jaccard Index that calculates a similarity score between the object predicted by the model and the ground truth segmentation.

Python
def jaccard_index(y_true, y_pred):
    y_true_f = y_true.flatten()
    y_pred_f = y_pred.flatten()
    intersection = np.sum(y_true_f * y_pred_f)
    union = np.sum(y_true_f) + np.sum(y_pred_f) - intersection
    return intersection / union

# Example ground truth and predictions
y_true = np.array([[1, 1, 0], [0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1]])

jaccard = jaccard_index(y_true, y_pred)
print("Jaccard Index:", jaccard)

Output:

Jaccard Index: 0.7142857142857143

Pixel Accuracy

Pixel accuracy is a metric used in image segmentation to measure the ratio of correctly classified pixels to the total number of pixels. It provides an overall accuracy score by comparing the predicted pixel labels to the ground truth labels. Higher pixel accuracy indicates better performance in segmenting the image correctly.

Python
def pixel_accuracy(y_true, y_pred):
    correct_pixels = np.sum(y_true == y_pred)
    total_pixels = y_true.size
    return correct_pixels / total_pixels

# Example ground truth and predictions
y_true = np.array([[1, 1, 0], [0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1]])

accuracy = pixel_accuracy(y_true, y_pred)
print("Pixel Accuracy:", accuracy)

Output:

Pixel Accuracy: 0.7777777777777778

Image Generation Metrics

Image generation metrics are essential for assessing the quality and realism of generated images. Key metrics include Inception Score (IS), which evaluates the diversity and quality of the generated images by using a pre-trained Inception model; Frechet Inception Distance (FID), which measures the similarity between the distribution of generated images and real images; Precision and Recall, which assess the fidelity and variety of the generated images; and Perceptual Path Length (PPL), which evaluates the smoothness and consistency of image interpolations. These metrics help in fine-tuning and comparing generative models.

Inception Score (IS)

Because the Inception Score calculates the log probability of the generated images, its aims to check the quality of the created images in terms of their divergence and meaning.

Python
import tensorflow as tf
import numpy as np
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy

# Load pre-trained InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg')

def inception_score(images, n_split=10, eps=1E-16):
    scores = []
    n_part = len(images) // n_split
    for i in range(n_split):
        subset = images[i * n_part: (i + 1) * n_part]
        p_yx = model.predict(preprocess_input(subset))
        p_y = np.expand_dims(p_yx.mean(axis=0), 0)
        kl_div = p_yx * (np.log(p_yx + eps) - np.log(p_y + eps))
        sum_kl_div = kl_div.sum(axis=1)
        avg_kl_div = np.mean(sum_kl_div)
        is_score = np.exp(avg_kl_div)
        scores.append(is_score)
    return np.mean(scores), np.std(scores)

# Example generated images
images = np.random.random((100, 299, 299, 3)) * 255

is_score, is_std = inception_score(images)
print("Inception Score:", is_score, "Standard Deviation:", is_std)

Output:

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87910968/87910968 [==============================] - 4s 0us/step
1/1 [==============================] - 4s 4s/step
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 3s 3s/step
1/1 [==============================] - 3s 3s/step
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 4s 4s/step
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 2s 2s/step
Inception Score: 5302081600000.0 Standard Deviation: 13884169000000.0

Frechet Inception Distance (FID)

FID measures how similar the distribution of Fake Generated Images is to that of Real Images by utilizing Frechet distance.

Python
from numpy import cov, trace
from numpy import iscomplexobj, asarray
from scipy.linalg import sqrtm

def calculate_fid(act1, act2):
    # calculate mean and covariance statistics
    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    # calculate sum squared difference between means
    ssdiff = np.sum((mu1 - mu2)**2.0)
    # calculate sqrt of product between cov
    covmean = sqrtm(sigma1.dot(sigma2))
    # check and correct imaginary numbers from sqrt
    if iscomplexobj(covmean):
        covmean = covmean.real
    # calculate score
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid

# Example activations from real and generated images
act1 = np.random.random((100, 2048))
act2 = np.random.random((100, 2048))

fid = calculate_fid(act1, act2)
print("FID:", fid)

Output:

FID: -5.284059386853983e+86

Conclusion

The performance measurement of CV models is also quite complex due to the various factors that have been mentioned before, and owing to the metrics that are application-dependent. To evaluate image classification, Tool metrics such as accuracy, precision, recall, F1 score, and confusion matrix. For object detection models, IoU and mAP are the most effective when it comes to model evaluation; for image segmentation models, the Dice Coefficient, Jaccard Index, and pixel accuracy are the most appropriate metrics to use. When it comes to image generation two metrics Inception Score (IS) and Frechet Inception Distance (FID) that measure the quality of generated images and the distance between our generated samples and the real data. Similarly, the ROC curve, which generally represents the trade-off between true positive rate and false positive rate, and the AUC value are essential measures for binary classification tasks. The best practices include multiple evaluation criteria, assessment that is sensitive to context, various validation methods, and frequent monitoring to give a more thorough and accurate view of the model’s quality. This kind of approach not only measures and compares the performance of models, which is a benchmark, but also provides directions during the continuous refinement and optimization processes, which is crucial for constructing sound and efficient computer vision systems.


Next Article

Similar Reads