Object Detection Week 2 YOLOv1-YOLOv8
Object Detection Week 2 YOLOv1-YOLOv8
https://www.cv-
foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf
What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python)
https://www.youtube.com/watch?v=ag3DLKsl2vk
Non Maximum Suppression
Intersection Over Union (IoU)
Yolo-v1 - Network Architecture
YOLO-v1
➢ The YOLO is a network was “inspired by” GoogleNet. It
has 24 convolutional layers working for feature extractors
and 2 dense layers for doing the predictions. This
architecture works upon is called Darknet. There is a fast
version of YOLO called “Tiny-YOLO” which only has 9
convolution layers
YOLOV1
The input image is divided into an S×S grid (S=7)
Each grid cell predicts B bounding boxes (B=2) and confidence scores for those boxes
Each bounding box consists of 5 predictions: x, y, w, h, and confidence
Each grid cell also predicts conditional class probabilities, P(Classi|Object). (Total number of
classes=20)
output of the network The output size becomes: 7×7×(2×5+20)=1470
Yolo-v1 - Network Architecture
Yolo-v1
Yolo-v1
Limitations of YOLO
YOLOV1 - implementation
MOTIVATION OF YOLOV2
➢ YOLO v1 was faster than Faster R-CNN, but it was less
accurate.
➢ YOLO v1’s weakness was the bounding box accuracy. It
didn’t predict object locations and sizes well,
particularly bad at spotting small objects.
➢ SSD, another single-stage detector, broke the record by
being better (more accurate) than Faster R-CNN and
even faster than YOLO v1.
MOTIVATION OF YOLOV2
➢ Joseph Redmon and Ali Farhadi developed YOLO v2, which is
better and faster than SSD and Faster R-CNN. They made a series
of changes, and the paper details how much improvement each
incremental change brought. However, they didn’t stop there.
Two bounding boxes had to share the same class probabilities. As such, increasing the number of bounding
boxes would not benefit much. On the contrary, Faster R-CNN and SSD predicted class probabilities for each
bounding box, making it easier to predict multiple classes sharing a similar center location.
Increasing Feature Map Resolution
➢ They removed one max-pooling layer, leaving five of them
for downsampling input images by a factor of 32. It would
convert the original input size of 448 x 448 into 14 x 14
feature maps, four times more grid cell locations than the
original feature map of 7 x 7.
➢ Unlike YOLOv1, wherein each grid cell, the model predicted one set of
class probabilities per grid cell, ignoring the number of boxes B, YOLOv2
predicted class and objectness for every anchor box.
Dimension Clusters
➢ Dimension ClustersUnlike Faster-RCNN, which used hand-
picked anchor boxes, YOLOv2 used a smart technique to
find anchor boxes for the PASCAL VOC and MS COCO
datasets.
➢ Redmon and Farhadi thought that instead of using hand-
picked anchor boxes, we pick better priors that reflect the
data more closely. It would be a great starting point for the
network, and it would become much easier for the network
to predict the detections and optimize faster.
Dimension Clusters
➢ Using k-means clustering on the training set bounding boxes
to find good anchor boxes or priors.
Direct Location Prediction
Add fine-grained feature
Multi-Scale Training
Light-weight backbone
➢ Darknet-19
➢ A fully convolutional model with 19
convolutional layers and five max-
pooling layers was designed.
YOLOV2 - Implementation
➢ https://www.maskaravivek.com/post/yolov2/
Yolo-v3
➢ YOLO v3 released in April
2018 which adds further
small improvements,
included the fact that
bounding boxes get
predicted at different scales.
In this version, the darknet
framework expanded to 53
convolution layers.
Yolo-v3
Yolo-v3
YOLOV3
➢ YOLO v2 introduced darknet-19, which is 19-layer network
supplemented with 11 more layers for object detection.
However, having a 30-layer architecture, YOLO v2 often
struggled at detecting small objects. Therefore, the authors
introduce successive 3 × 3 and 1 × 1 convolutional layers
followed by some shortcut connections allowing the
network be much deeper. Thus, the authors introduce their
Darknet-53, which is shown below.
➢ The Original YOLO was the first object detection network to combine
the problem of drawing bounding boxes and identifying class labels in one
end-to-end differentiable network.
➢ YOLOv2 made a number of iterative improvements on top of YOLO
including BatchNorm, higher resolution, and anchor boxes.
➢ YOLOv3 built upon previous models by adding an objectness score to
bounding box prediction, added connections to the backbone network
layers, and made predictions at three separate levels of granularity to
improve performance on smaller objects.
YOLOV3-Implementation
YOLOV4
➢ The Original YOLO — YOLO was the first object detection network to
combine the problem of drawing bounding boxes and identifying class labels
in one end-to-end differentiable network.
https://playground.tensorflow.org
Class label smoothing
Overconfidence in Neural Networks
YOLOv4 Bag of Specials
• Mish activation,
• Cross-stage partial connections (CSP)
• Multi-input weighted residual connections (MiWRC)
Mish activation
YOLOV4
➢ IMPLEMENTATION
YOLOV5
YOLOV5
➢ Yolov5 almost resembles Yolov4 with some of the following
differences:
○ Yolov4 is released in the Darknet framework, which is
written in C. Yolov5 is based on the PyTorch
framework.
➢ YOLOv3 SPP outputs 3 predictions per location. Each prediction has an 85D vector
with embeddings for the following values.
○ Class score
○ IoU score
○ Bounding box coordinates ( center x, center y, width, height)
➢ On the other hand, YOLOX reduced the predictions at
each location (pixel) from 3 to 1. The prediction contains
only a single 4D vector, encoding the location of the box at
each foreground pixel.
Striding for anchor free
The anchor location on the image can be obtained with the following formulas:
x = s/2 + s*i
y = s/2 + s*j
The following formulas are used to map a predicted bounding box (p_x, p_y, p_w, p_h)
to the actual location on the image (l_x, l_y, l_w, l_h) if (x, y) is the intersection point on
the grid which the prediction belongs to and s is the stride at the current FPN level:
l_x = p_x + x
l_y = p_y + y
l_w = s*e^(p_w)
l_h = s*e^(p_h)
For example, let’s go back the bear image with a stride of 32. If the anchor point for this
prediction was (i, j) = (2, 1) meaning intersection point 2 on the x-axis and 1 on the y-axis, I
would be looking at the following point on the image:
l_x = 20 + 80 = 100
l_y = 15 + 48 = 63
If the model produces the prediction of (20, 15, 0.2, 0.3) l_w = 32*e^(0.2) = 39
l_h = 32*e^(0.3) = 43
Introducing the Decoupled Head in YOLOX
Shared Head in YOLO
➢ Earlier YOLO networks used a coupled head architecture.
All the regression and classification scores are obtained from
the same head.
The Need for Decoupling the YOLO Head
Decoupled Head in YOLOX
Multi-Positives in YOLOX to Improve the
Recall
Label Assignment in Object Detection problem
Label Assignment Before SimOTA
OTA: Optimal Transport Assignment for Object Detection
simOTA Advanced Label Assignment Strategy
Dynamic k Estimation
Center Prior
Strong Data Augmentation in YOLOX
Mosaic Augmentation
YOLOX IMplementation
YOLOV6
YOLO detectors are constantly evolving, as is evident from new YOLO models being released
every few months. With YOLOv6, let’s explore what new and exciting features it brings to the
table.
➢ YOLOv6 employs plenty of new approaches to achieve state-of-the-art
results. These can be summarized into four points:
• Anchor free: Hence provides better generalizability and costs less time in
post-processing.
• The model architecture: YOLOv6 comes with a revised reparameterized
backbone and neck.
• Loss functions: YOLOv6 used Varifocal loss (VFL) for classification and
Distribution Focal loss (DFL) for detection.
• Industry handy improvements: Longer training epochs, quantization, and
knowledge distillation are some techniques that make YOLOv6 models best
suited for real-time industrial applications.
Anchor-free method
➢ This makes YOLOv6 51% faster compared to most anchor-
based object detectors. This is possible because it has 3
times fewer predefined priors.
➢
The YOLOv6 Backbone Architecture
Training Inference
The YOLOv6 Neck Architecture
The YOLOv6 Detection Head
YOLOv6 implementation
YOLOV7
➢ YOLOv3 model, introduced by Redmon et al. in 2018
➢ YOLOv4 model, released by Bochkovskiy et al. in 2020, YOLOv4-tiny model,
research published in 2021
➢ YOLOR (You Only Learn One Representation) model, published in 2021
➢ YOLOX model, published in 2021
➢ NanoDet-Plus model, published in 2021
➢ PP-YOLOE, an industrial object detector, published in 2022
➢ YOLOv5 model v6.1 published by Ultralytics in 2022
➢ YOLOv7, published in 2022
YOLOV7
• Backbone: ELAN (YOLOv7-p5, YOLOv7-p6), E-ELAN (YOLOv7-
E6E)
• Neck: SPPCSPC + (CSP-OSA)PANet (YOLOv7-p5, YOLOv7-p6)
+ RepConv
• Head: YOLOR + Auxiliary Head (YOLOv7-p6)
➢ The architecture is derived from YOLOv4, Scaled YOLOv4, and
YOLO-R. Using these models as a base, further experiments
were carried out to develop new and improved YOLOv7.
➢
backbone Extended efficient layer aggregation networks (E-ELAN)
Model scaling for concatenation-based models
Planned re-parameterized convolution
Coarse for Auxiliary and Fine for Lead Loss
YOLOV7 implementaton
What’s New in YOLOv8
v User-friendly API (Command Line + Python).
v Faster and More Accurate.
v Supports
v Object Detection,
v Instance Segmentation,
v Image Classification.
v Extensible to all previous versions.
v New Backbone network.
v New Anchor-Free head.
v New Loss Function.
YOLOV8 implementaton
Here is a summary of the steps to calculate the AP: