100% found this document useful (1 vote)
673 views

Object Detection Week 2 YOLOv1-YOLOv8

YOLO and Its Variants discusses the YOLO object detection algorithm and its variants YOLOv1, YOLOv2, and YOLOv3. The original YOLO algorithm detects objects in images using a single neural network, making it faster than two-stage detectors like Faster R-CNN. YOLOv2 improved on YOLOv1 by adding batch normalization, higher resolution classifiers, anchor boxes, and dimension clusters to detect objects more accurately. YOLOv3 further improved performance by using Darknet-53, a deeper backbone network with 53 convolutional layers, to better detect small objects.

Uploaded by

Ngọc Hân
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
673 views

Object Detection Week 2 YOLOv1-YOLOv8

YOLO and Its Variants discusses the YOLO object detection algorithm and its variants YOLOv1, YOLOv2, and YOLOv3. The original YOLO algorithm detects objects in images using a single neural network, making it faster than two-stage detectors like Faster R-CNN. YOLOv2 improved on YOLOv1 by adding batch normalization, higher resolution classifiers, anchor boxes, and dimension clusters to detect objects more accurately. YOLOv3 further improved performance by using Darknet-53, a deeper backbone network with 53 convolutional layers, to better detect small objects.

Uploaded by

Ngọc Hân
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 264

YOLO and Its Variants

Dr. Vinh Dinh Nguyen


A paper list of object detection using deep learning
https://arxiv.org/abs/1905.05055
https://www.v7labs.com/blog/yolo-object-detection
motivation
➢ In 2015, Joseph Redmon (University of Washington) developed
YOLO. One of his co-authors, Ross Girshick (Microsoft Research),
published a paper for Faster R-CNN around the same time. They
probably shared common ideas in computer vision research as there
are some similarities between YOLO and Faster R-CNN. For
example, both models apply convolutional layers on input images to
generate feature maps. However, Faster R-CNN uses a two-stage
object detection pipeline, while YOLO has no separate region
proposal step and is much faster than Faster R-CNN.
➢ YOLO has many versions (variants). Joseph Redmon
developed the first three versions of YOLO: YOLOv1, v2,
and v3. Then, he quit.
What is YOLO?
➢ YOLO is an abbreviation for the term ‘You Only Look Once’. This
is an algorithm that detects and recognizes various objects in a picture
(in real-time). Object detection in YOLO is done as a regression
problem and provides the class probabilities of the detected images.

➢ YOLO algorithm employs convolutional neural networks (CNN) to


detect objects in real-time. As the name suggests, the algorithm
requires only a single forward propagation through a neural network
to detect objects.

Why the YOLO algorithm is important
➢ Speed: This algorithm improves the speed of detection
because it can predict objects in real-time.
➢ High accuracy: YOLO is a predictive technique that provides
accurate results with minimal background errors.
➢ Learning capabilities: The algorithm has excellent learning
capabilities that enable it to learn the representations of objects
and apply them in object detection.
How the YOLO algorithm works
➢ Residual blocks
➢ Bounding box regression
➢ Intersection Over Union (IOU)
Residual blocks
Bounding box regression
➢ Every bounding box in the image consists of the following
attributes:
○ Width (bw)
○ Height (bh)
○ Class (for example, person, car, traffic light, etc.)- This
is represented by the letter c.
○ Bounding box center (bx,by)
Intersection over union (IOU)
➢ Each grid cell is responsible for predicting the bounding
boxes and their confidence scores. The IOU is equal to 1
if the predicted bounding box is the same as the real box.
This mechanism eliminates bounding boxes that are not
equal to the real box.
Combination of the three techniques

https://www.cv-
foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf
What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python)
https://www.youtube.com/watch?v=ag3DLKsl2vk
Non Maximum Suppression
Intersection Over Union (IoU)
Yolo-v1 - Network Architecture
YOLO-v1
➢ The YOLO is a network was “inspired by” GoogleNet. It
has 24 convolutional layers working for feature extractors
and 2 dense layers for doing the predictions. This
architecture works upon is called Darknet. There is a fast
version of YOLO called “Tiny-YOLO” which only has 9
convolution layers
YOLOV1
The input image is divided into an S×S grid (S=7)
Each grid cell predicts B bounding boxes (B=2) and confidence scores for those boxes
Each bounding box consists of 5 predictions: x, y, w, h, and confidence
Each grid cell also predicts conditional class probabilities, P(Classi|Object). (Total number of
classes=20)
output of the network The output size becomes: 7×7×(2×5+20)=1470
Yolo-v1 - Network Architecture
Yolo-v1
Yolo-v1
Limitations of YOLO
YOLOV1 - implementation
MOTIVATION OF YOLOV2
➢ YOLO v1 was faster than Faster R-CNN, but it was less
accurate.
➢ YOLO v1’s weakness was the bounding box accuracy. It
didn’t predict object locations and sizes well,
particularly bad at spotting small objects.
➢ SSD, another single-stage detector, broke the record by
being better (more accurate) than Faster R-CNN and
even faster than YOLO v1.
MOTIVATION OF YOLOV2
➢ Joseph Redmon and Ali Farhadi developed YOLO v2, which is
better and faster than SSD and Faster R-CNN. They made a series
of changes, and the paper details how much improvement each
incremental change brought. However, they didn’t stop there.

➢ They wanted to make their object detector to recognize a wide


variety of objects. Pascal VOC object detection dataset contains
only 20 classes. They wanted their model to recognize much
more classes of objects
MOTIVATION OF YOLOV2
➢ . The problem was that building an object detection dataset with
millions of labeled images would take too much time. Labeling
images for object detection is far more expensive than image
classification. So, they devised a new method to train YOLO9000,
simultaneously taking advantage of MS
COCO and ImageNet datasets. As a result, YOLO9000 can detect
over 9000 object categories.
Changes in YOLOV2
➢ Batch Normalization
○ In YOLO v2, they added Batch Normalization to all
convolutional layers
Batch normalization
➢ Normalizing the input data => speed up training process
➢ Why do we not normalize the activation of every layer?
Changes in YOLOV2
➢ High-Resolution Classifier
○ In YOLO v1, they trained a classifier with ImageNet images of
size 224 x 224. Then, they increase the image resolution to
448 x 448 to train their object detection models. Hence, the
network had to simultaneously adjust to the new input
resolution and learn object detection.
Changes in YOLOV2
➢ High-Resolution Classifier
yolo-v2
➢ Convolutional with Anchor Boxes
○ YOLOv2 removes all fully connected layers and uses
anchor boxes to predict bounding boxes

○ YOLO v1 only predicted two bounding boxes per grid cell,


which means a total of 49 (= 7 x 7 x 2) bounding boxes per
image, much lower than Faster R-CNN.
YOLO v1 suffered from a low recall rate
YOLOV1

Two bounding boxes had to share the same class probabilities. As such, increasing the number of bounding
boxes would not benefit much. On the contrary, Faster R-CNN and SSD predicted class probabilities for each
bounding box, making it easier to predict multiple classes sharing a similar center location.
Increasing Feature Map Resolution
➢ They removed one max-pooling layer, leaving five of them
for downsampling input images by a factor of 32. It would
convert the original input size of 448 x 448 into 14 x 14
feature maps, four times more grid cell locations than the
original feature map of 7 x 7.

➢ (448 x 448) => (416x416) => 13x13 feature


➢ Having more weights in the fully-connected layers
Going Fully Convolutional
➢ They replaced the fully-connected layers with
convolutional layers. The number of weights became
negligible since the kernel size is small and fixed no matter
the input resolution. It allowed them to use more bounding
boxes per grid cell. As such, they introduced anchor boxes
similar to Faster R-CNN.
➢ The recall rate increased thanks to the increased feature
map resolution and more bounding boxes. However, the
accuracy went down from 69.5 mAP to 69.2 mAP. So, why
did they still make this change?
➢ YOLOv1 was an anchor-free model that predicted the coordinates of B-
boxes directly using fully connected layers in each grid cell.

➢ Inspired by Faster-RCNN that predicts B-boxes using hand-picked priors


known as anchor boxes, YOLOv2 also works on the same principle.

➢ Unlike YOLOv1, wherein each grid cell, the model predicted one set of
class probabilities per grid cell, ignoring the number of boxes B, YOLOv2
predicted class and objectness for every anchor box.
Dimension Clusters
➢ Dimension ClustersUnlike Faster-RCNN, which used hand-
picked anchor boxes, YOLOv2 used a smart technique to
find anchor boxes for the PASCAL VOC and MS COCO
datasets.
➢ Redmon and Farhadi thought that instead of using hand-
picked anchor boxes, we pick better priors that reflect the
data more closely. It would be a great starting point for the
network, and it would become much easier for the network
to predict the detections and optimize faster.
Dimension Clusters
➢ Using k-means clustering on the training set bounding boxes
to find good anchor boxes or priors.
Direct Location Prediction
Add fine-grained feature
Multi-Scale Training
Light-weight backbone
➢ Darknet-19
➢ A fully convolutional model with 19
convolutional layers and five max-
pooling layers was designed.
YOLOV2 - Implementation
➢ https://www.maskaravivek.com/post/yolov2/
Yolo-v3
➢ YOLO v3 released in April
2018 which adds further
small improvements,
included the fact that
bounding boxes get
predicted at different scales.
In this version, the darknet
framework expanded to 53
convolution layers.
Yolo-v3
Yolo-v3
YOLOV3
➢ YOLO v2 introduced darknet-19, which is 19-layer network
supplemented with 11 more layers for object detection.
However, having a 30-layer architecture, YOLO v2 often
struggled at detecting small objects. Therefore, the authors
introduce successive 3 × 3 and 1 × 1 convolutional layers
followed by some shortcut connections allowing the
network be much deeper. Thus, the authors introduce their
Darknet-53, which is shown below.
➢ The Original YOLO was the first object detection network to combine
the problem of drawing bounding boxes and identifying class labels in one
end-to-end differentiable network.
➢ YOLOv2 made a number of iterative improvements on top of YOLO
including BatchNorm, higher resolution, and anchor boxes.
➢ YOLOv3 built upon previous models by adding an objectness score to
bounding box prediction, added connections to the backbone network
layers, and made predictions at three separate levels of granularity to
improve performance on smaller objects.
YOLOV3-Implementation
YOLOV4
➢ The Original YOLO — YOLO was the first object detection network to
combine the problem of drawing bounding boxes and identifying class labels
in one end-to-end differentiable network.

➢ YOLOv2 — YOLOv2 made a number of iterative improvements on top of


YOLO including BatchNorm, higher resolution, and anchor boxes.

➢ YOLOv3 — YOLOv3 built upon previous models by adding an objectness


score to bounding box prediction, added connections to the backbone
network layers and made predictions at three separate levels of granularity to
improve performance on smaller objects.
➢ There are two types of object detection models : two-stage object detectors
and single-stage object detectors. Single-stage object detectors (like YOLO )
architecture are composed of three components: Backbone, Neck and a
Head to make dense predictions as shown in the figure bellow.
Yolo-v4
➢ The YOLO v4 released in April 2020, but this release is not from the YOLO
first author. In Feb 2020, Joseph Redmon announced he was leaving the field
of computer vision due to concerns regarding the possible negative impact of
his works.
YOLO-v4
➢ The YOLO v4 release lists three authors: Alexey Bochkovskiy, the Russian developer
who built the YOLO Windows version, Chien-Yao Wang, and Hong-Yuan Mark
Liao.
➢ Compared with the previous YOLOv3, YOLOv4 has the following advantages:
○ It is an efficient and powerful object detection model that enables anyone with a
1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.
○ The influence of state-of-the-art “Bag-of-Freebies” and “Bag-of-Specials” object
detection methods during detector training has been verified.
○ The modified state-of-the-art methods, including CBN (Cross-iteration batch
normalization), PAN (Path aggregation network), etc., are now more efficient
and suitable for single GPU training.
Backbone
➢ Models such as ResNet, DenseNet, VGG, etc, are used as
feature extractors. They are pre-trained on image
classification datasets, like ImageNet, and then fine-tuned
on the detection dataset. Turns out that, these networks that
produce different levels of features with higher semantics
as the network gets deeper (more layers), are useful for
latter parts of the object detection network.
What’s the backbone for?
➢ It’s a deep neural network composed mainly of convolution
layers. The main objective of the backbone is to extract the
essential features, the selection of the backbone is a key step
it will improve the performance of object detection. Often
pre-trained neural networks are used to train the backbone
YOLOv4 Backbone Network: Feature Formation
➢ CSPResNext50
➢ CSPDarknet53
➢ EfficientNet-B3
CPSDarknet53
➢ CSP( Cross-Stage-Partial connections) + Darknet53
DenseNet (Dense connected convolutional network)
Five-layer dense block
One Dense Block in DenseNet
The ResNet-18 architecture.
Effficientnet
YOLOv4 Neck: Feature Aggregation
➢ FPN, PAN, NAS-FPN, BiFPN
Feature image pyramid
SPP
Path Aggregation Network (PAN)
Path Aggregation Network (PAN)
Spatial Attention Module
YOLOV4+SAM
YOLOv4 Head: The Detection Step
Bag of freebies
➢ Bag of freebies methods are the set of methods that only
increase the cost of training or change the training
strategy while leaving the cost of inference low. Let’s present
some simple methods commonly used in computer vision.
YOLOv4 Bag of Freebies
➢ Improve performance of the network without adding to inference
time in production
CutMix data augmentation
Mosaic data augmentation.
DropBlock regularization

https://playground.tensorflow.org
Class label smoothing
Overconfidence in Neural Networks
YOLOv4 Bag of Specials
• Mish activation,
• Cross-stage partial connections (CSP)
• Multi-input weighted residual connections (MiWRC)
Mish activation
YOLOV4
➢ IMPLEMENTATION
YOLOV5
YOLOV5
➢ Yolov5 almost resembles Yolov4 with some of the following
differences:
○ Yolov4 is released in the Darknet framework, which is
written in C. Yolov5 is based on the PyTorch
framework.

○ Yolov4 uses .cfg for configuration whereas Yolov5 uses


.yaml file for configuration.
Backbone
➢ CSPResBlock
=> C3 Module
Activation function
Loss Function
➢ YOLOv5 returns three outputs: the classes of the detected
objects, their bounding boxes and the objectness scores.
Thus, it uses BCE (Binary Cross Entropy) to compute the
classes loss and the objectness loss. While CIoU (Complete
Intersection over Union) loss to compute the location loss.
The formula for the final loss is given by the following
equation
Anchor Box
n for extra small (nano) size model., s for small size
model., m for medium size model., l for large size model
x for extra large size model
YOLOV5 implementation
YOLOX
➢ Released in July 2021, YOLOX has switched to the anchor
free approach which is different from previous YOLO
networks.
➢ In short, salient features of YOLOX are,
• Anchor free design
• Decoupled head
• simOTA label assignment strategy
• Advanced Augmentations: Mixup and Mosaic
Anchors in Object Detection
Anchor-Based Object Detection
Drawbacks of Anchor Based Approach
1. It needs a large set of anchor boxes. For example, it is more
than 100k in RetinaNet.
2. The anchor boxes require a lot of hyperparameters and
design tweaks. For example,
○ Number of anchors
○ Size of the anchors
○ The aspect ratio of the boxes
○ The number of sections the image should be divided into
Anchor Free Object Detection
Anchor Free Object Detectors
Anchor Free YOLOX
➢ YOLOX adopts the center-based approach which has a per-pixel detection mechanism.
In anchor based detectors, the location of the input image acts as the center for multiple
anchors.

➢ YOLOv3 SPP outputs 3 predictions per location. Each prediction has an 85D vector
with embeddings for the following values.
○ Class score
○ IoU score
○ Bounding box coordinates ( center x, center y, width, height)
➢ On the other hand, YOLOX reduced the predictions at
each location (pixel) from 3 to 1. The prediction contains
only a single 4D vector, encoding the location of the box at
each foreground pixel.
Striding for anchor free
The anchor location on the image can be obtained with the following formulas:
x = s/2 + s*i
y = s/2 + s*j
The following formulas are used to map a predicted bounding box (p_x, p_y, p_w, p_h)
to the actual location on the image (l_x, l_y, l_w, l_h) if (x, y) is the intersection point on
the grid which the prediction belongs to and s is the stride at the current FPN level:
l_x = p_x + x

l_y = p_y + y

l_w = s*e^(p_w)

l_h = s*e^(p_h)
For example, let’s go back the bear image with a stride of 32. If the anchor point for this
prediction was (i, j) = (2, 1) meaning intersection point 2 on the x-axis and 1 on the y-axis, I
would be looking at the following point on the image:

l_x = 20 + 80 = 100

l_y = 15 + 48 = 63

If the model produces the prediction of (20, 15, 0.2, 0.3) l_w = 32*e^(0.2) = 39

l_h = 32*e^(0.3) = 43
Introducing the Decoupled Head in YOLOX
Shared Head in YOLO
➢ Earlier YOLO networks used a coupled head architecture.
All the regression and classification scores are obtained from
the same head.
The Need for Decoupling the YOLO Head
Decoupled Head in YOLOX
Multi-Positives in YOLOX to Improve the
Recall
Label Assignment in Object Detection problem
Label Assignment Before SimOTA
OTA: Optimal Transport Assignment for Object Detection
simOTA Advanced Label Assignment Strategy
Dynamic k Estimation
Center Prior
Strong Data Augmentation in YOLOX
Mosaic Augmentation
YOLOX IMplementation
YOLOV6

YOLO detectors are constantly evolving, as is evident from new YOLO models being released
every few months. With YOLOv6, let’s explore what new and exciting features it brings to the
table.
➢ YOLOv6 employs plenty of new approaches to achieve state-of-the-art
results. These can be summarized into four points:
• Anchor free: Hence provides better generalizability and costs less time in
post-processing.
• The model architecture: YOLOv6 comes with a revised reparameterized
backbone and neck.
• Loss functions: YOLOv6 used Varifocal loss (VFL) for classification and
Distribution Focal loss (DFL) for detection.
• Industry handy improvements: Longer training epochs, quantization, and
knowledge distillation are some techniques that make YOLOv6 models best
suited for real-time industrial applications.
Anchor-free method
➢ This makes YOLOv6 51% faster compared to most anchor-
based object detectors. This is possible because it has 3
times fewer predefined priors.

The YOLOv6 Backbone Architecture
Training Inference
The YOLOv6 Neck Architecture
The YOLOv6 Detection Head
YOLOv6 implementation
YOLOV7
➢ YOLOv3 model, introduced by Redmon et al. in 2018
➢ YOLOv4 model, released by Bochkovskiy et al. in 2020, YOLOv4-tiny model,
research published in 2021
➢ YOLOR (You Only Learn One Representation) model, published in 2021
➢ YOLOX model, published in 2021
➢ NanoDet-Plus model, published in 2021
➢ PP-YOLOE, an industrial object detector, published in 2022
➢ YOLOv5 model v6.1 published by Ultralytics in 2022
➢ YOLOv7, published in 2022
YOLOV7
• Backbone: ELAN (YOLOv7-p5, YOLOv7-p6), E-ELAN (YOLOv7-
E6E)
• Neck: SPPCSPC + (CSP-OSA)PANet (YOLOv7-p5, YOLOv7-p6)
+ RepConv
• Head: YOLOR + Auxiliary Head (YOLOv7-p6)
➢ The architecture is derived from YOLOv4, Scaled YOLOv4, and
YOLO-R. Using these models as a base, further experiments
were carried out to develop new and improved YOLOv7.

backbone Extended efficient layer aggregation networks (E-ELAN)
Model scaling for concatenation-based models
Planned re-parameterized convolution
Coarse for Auxiliary and Fine for Lead Loss
YOLOV7 implementaton
What’s New in YOLOv8
v User-friendly API (Command Line + Python).
v Faster and More Accurate.
v Supports
v Object Detection,
v Instance Segmentation,
v Image Classification.
v Extensible to all previous versions.
v New Backbone network.
v New Anchor-Free head.
v New Loss Function.
YOLOV8 implementaton
Here is a summary of the steps to calculate the AP:

• Generate the prediction scores using the model.


• Convert the prediction scores to class labels.
• Calculate the confusion matrix—TP, FP, TN, FN.
• Calculate the precision and recall metrics.
• Calculate the area under the precision-recall curve.
• Measure the average precision.
Dr. nguyen dinh vinh, FpT University, can tho

You might also like