YOLO Object Detection Explained_ A Beginner's Guide _ DataCamp
YOLO Object Detection Explained_ A Beginner's Guide _ DataCamp
Zoumana Keita
A data scientist who likes to write and share knowledge with the data and IA community
TO P I C S
Data Science
Object detection is a computer vision technique for identifying and localizing objects within
an image or a video.
Image localization is the process of identifying the correct location of one or multiple
objects using bounding boxes, which correspond to rectangular shapes around the
objects. This process is sometimes confused with image classification or image recognition,
which aims to predict the class of an image or an object within an image into one of the
categories or classes.
The illustration below corresponds to the visual representation of the previous explanation.
The object detected within the image is a “Person.”
Image by Author
In this conceptual blog, you will first understand the benefits of object detection before
introducing YOLO, the state-of-the-art object detection algorithm.
In the second part, we will focus more on the YOLO algorithm and how it works. After that,
we will provide some real-life applications using YOLO.
The last section will explain how YOLO evolved from 2015 to 2024 before concluding on the
next steps.
What is YOLO?
You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm
introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in
their famous research paper You Only Look Once: Unified, Real-Time Object Detection.
The authors frame the object detection problem as a regression rather than a classification
task by spatially separating bounding boxes and associating probabilities to each detected
image using a single convolutional neural network (CNN).
If you're interested in image classification, consider taking the Image Processing with Keras
in Python course, where you'll build Keras-based deep neural networks for image
classification tasks. If you are more interested in Pytorch, Deep Learning with Pytorch will
teach you about convolutional neural networks and how to use them to build much more
powerful models.
Speed
Detection accuracy
Good generalization
Open-source
1. Speed
YOLO is extremely fast because it does not deal with complex pipelines. It can process
images at 45 Frames Per Second (FPS). In addition, YOLO reaches more than twice the mean
Average Precision (mAP) compared to other real-time systems, which makes it a great
candidate for real-time processing.
From the graphic below, we observe that YOLO is far beyond the other object detectors with
91 FPS.
3. Better generalization
This is especially true for the new versions of YOLO, which will be discussed later in the
article. With those advancements, YOLO has gone a little further by providing better
generalization for new domains, which makes it great for applications relying on fast and
robust object detection.
For instance, the Automatic Detection of Melanoma with Yolo Deep Convolutional Neural
Networks paper shows that the first version, YOLOv1, has the lowest mAP for the automatic
detection of melanoma disease, compared to YOLOv2 and YOLOv3.
4. Open-source
Making YOLO open-source has led the community to improve the model constantly. This is
one of the reasons why YOLO has made so many improvements in such a limited time.
YOLO Architecture
YOLO architecture is similar to GoogleNet. As illustrated below, it has 24 convolutional
layers, four max-pooling layers, and two fully connected layers.
https://www.datacamp.com/blog/yolo-object-detection-explained 2/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Resizes the input image into 448x448 before going through the convolutional network.
A 1x1 convolution is first applied to reduce the number of channels, followed by a 3x3
convolution to generate a cuboidal output.
The activation function under the hood is ReLU, except for the final layer, which uses a
linear activation function.
Some additional techniques, such as batch normalization and dropout, regularize the
model and prevent it from overfitting.
By completing the Deep Learning in Python course, you will be ready to use Keras to train
and test complex, multi-output networks and dive deeper into deep learning.
“Imagine you built a YOLO application that detects players and soccer balls from a given
image.
But how can you explain this process to someone, especially non-initiated people?
→ That is the whole point of this section. You will understand the whole process of how
YOLO performs object detection and how to get image (B) from image (A).”
Image by Author
Residual blocks
Non-Maximum Suppression.
1. Residual blocks
This first step starts by dividing the original image (A) into NxN grid cells of equal shape,
where N, in our case, is 4, as shown in the image on the right. Each cell in the grid is
responsible for localizing and predicting the class of the object that it covers, along with the
probability/confidence value.
Image by Author
YOLO determines the attributes of these bounding boxes using a single regression module in
the following format, where Y is the final vector representation for each bounding box.
https://www.datacamp.com/blog/yolo-object-detection-explained 3/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
This is especially important during the training phase of the model.
pc corresponds to the probability score of the grid containing an object. For instance,
all the grids in red will have a probability score higher than zero. The image on the right
is the simplified version since the probability of each yellow cell is zero (insignificant).
Image by Author
bx , by are the x and y coordinates of the center of the bounding box with respect to
the enveloping grid cell.
bh , bw correspond to the height and the width of the bounding box with respect to the
enveloping grid cell.
c1 and c2 correspond to the two classes, Player and Ball. We can have as many
classes as your use case requires.
To understand, let’s pay closer attention to the player on the bottom right.
Image by Author
The user defines its IOU selection threshold, which can be, for instance, 0.5.
Then, YOLO computes the IOU of each grid cell, which is the Intersection area divided
by the Union Area.
Finally, it ignores the prediction of the grid cells having an IOU ≤ threshold and considers
those with an IOU > threshold.
Below is an illustration of applying the grid selection process to the bottom left object. We
can observe that the object originally had two grid candidates, and only “Grid 2” was
selected at the end.
https://www.datacamp.com/blog/yolo-object-detection-explained 4/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Image by Author
YOLO Applications
YOLO object detection has different applications in our day-to-day life. In this section, we
will cover some of them in the following domains: healthcare, agriculture, security
surveillance, and self-driving cars.
Object detection has been introduced in many practical industries, such as healthcare and
agriculture. Let’s understand each one with specific examples.
Healthcare
Specifically, in surgery, it can be challenging to localize organs in real time due to biological
diversity from one patient to another. Kidney Recognition in CT used YOLOv3 to facilitate
localizing kidneys in 2D and 3D from computerized tomography (CT) scans.
The Biomedical Image Analysis in Python course can help you learn the fundamentals of
exploring, manipulating, and measuring biomedical image data using Python.
Agriculture
Artificial Intelligence and robotics play a major role in modern agriculture. Harvesting
robots are vision-based robots that were introduced to replace manual picking of fruits and
vegetables. One of the best models in this field uses YOLO. In Tomato detection based on a
modified YOLOv3 framework, the authors describe how they used YOLO to identify the
types of fruits and vegetables for efficient harvest.
https://www.datacamp.com/blog/yolo-object-detection-explained 5/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Security surveillance
Even though object detection is mostly used in security surveillance, it is not the only
application. YOLOv3 has been used during the COVID-19 pandemic to estimate social
distance violations between people.
You can further your reading on this topic from A deep-learning-based social distance
monitoring framework for COVID-19.
Self-driving cars
Real-time object detection is part of the DNA of autonomous vehicle systems. This
integration is vital for autonomous vehicles because they need to properly identify the
correct lanes and all the surrounding objects and pedestrians to increase road safety.
YOLO's real-time aspect makes it a better candidate compared to simple image
segmentation approaches.
However, like many other solutions, the first version of YOLO has its own limitations:
Finally, the loss function used to approximate the detection performance treats errors
the same for both small and large bounding boxes, which in fact creates incorrect
localizations.
YOLOv2 or YOLO9000
YOLOv2 was created in 2016 with the idea of making the YOLO model better, faster and
stronger.
The improvement includes but is not limited to the use of Darknet-19 as new architecture,
batch normalization, higher resolution of inputs, convolution layers with anchors,
dimensionality clustering, and (5) Fine-grained features.
Batch normalization
Adding a batch normalization layer improved the performance by 2% mAP. This batch
normalization included a regularization effect, preventing overfitting.
YOLOv2 directly uses a higher resolution 448×448 input instead of 224×224, which makes the
model adjust its filter to perform better on higher resolution images. This approach
increased the accuracy by 4% mAP, after being trained for 10 epochs on the ImageNet
data.
https://www.datacamp.com/blog/yolo-object-detection-explained 6/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Instead of predicting the exact coordinates of the objects' bounding boxes as YOLOv1
operates, YOLOv2 simplifies the problem by replacing the fully connected layers with anchor
boxes. This approach slightly decreases the accuracy but improves the model recall by 7%,
which gives more room for improvement.
Dimensionality clustering
YOLOv2 automatically finds the previously mentioned anchor boxes using k-means
dimensionality clustering with k=5 instead of performing a manual selection. This novel
approach provides a good tradeoff between the recall and the precision of the model.
For a better understanding of the k-means dimensionality clustering, take a look at our K-
Means Clustering in Python with scikit-learn and K-Means Clustering in R tutorials. They
dive into the concept of k-means clustering using Python and R.
Fine-grained features
YOLOv2 predictions generate 13x13 feature maps, which is of course enough for large object
detection. But for much finer objects detection, the architecture can be modified by turning
the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, concatenated with the
original features. This approach improved the model performance by 1%.
The change mainly includes a new network architecture: Darknet-53. This is a 106 neural
network, with upsampling networks and residual blocks. It is much bigger, faster, and more
accurate compared to Darknet-19, which is the backbone of YOLOv2. This new architecture
has been beneficial on many levels:
A logistic regression model is used by YOLOv3 to predict the objectness score for each
bounding box.
Instead of using softmax as performed in YOLOv2, independent logistic classifiers have been
introduced to accurately predict the class of the bounding boxes. This is even useful when
facing more complex domains with overlapping labels (e.g. Person → Soccer Player). Using a
softmax would constrain each box to have only one class, which is not always true.
YOLOv3 performs three predictions at different scales for each location within the input
image to help with the upsampling from the previous layers. This strategy allows getting
fine-grained and more meaningful semantic information for a better quality output image.
The image below shows the YOLOv4 outperforming YOLOv3 and FPS in speed by 10% and
12% respectively.
B LO G S Category EN
Download Now
YOLOv4 Speed compared to YOLOv3 and other state-of-the-art object detectors (source)
YOLOv4 is specifically designed for production systems and optimized for parallel
computations.
https://www.datacamp.com/blog/yolo-object-detection-explained 7/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
The backbone of YOLOv4’s architecture is CSPDarknet53, a network containing 29
convolution layers with 3 × 3 filters and approximately 27.6 million parameters.
This architecture, compared to YOLOv3, adds the following information for better object
detection:
Spatial Pyramid Pooling (SPP) block significantly increases the receptive field,
segregates the most relevant context features, and does not affect the network speed.
Instead of the Feature Pyramid Network (FPN) used in YOLOv3, YOLOv4 uses PANet for
parameter aggregation from different detection levels.
Data augmentation uses the mosaic technique that combines four training images in
addition to a self-adversarial training approach.
Explicit knowledge is normal or conscious learning. Implicit learning on the other hand is one
performed subconsciously (from experience).
Combining these two technics, YOLOR is able to create a more robust architecture based on
three processes: (1) feature alignment, (2) prediction alignment for object detection, and (3)
canonical representation for multi-task learning
Prediction alignment
This approach introduces an implicit representation into the feature map of every feature
pyramid network (FPN), which improves the precision by about 0.5%.
The model predictions are refined by adding implicit representation to the output layers of
the network.
Performing multi-task training requires the execution of the joint optimization on the loss
function shared across all the tasks. This process can decrease the overall performance of
the model, and this issue can be mitigated with the integration of the canonical
representation during the model training.
From the following graphic, we can observe that YOLOR achieved on the MS COCO data
state-of-the-art inference speed compared to other models.
https://www.datacamp.com/blog/yolo-object-detection-explained 8/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
This uses a baseline that is a modified version of YOLOv3, with Darknet-53 as its backbone.
Published in the paper Exceeding YOLO Series in 2021, YOLOX brings to the table the
following four key characteristics to create a better model compared to the older versions.
The coupled head used in the previous YOLO versions is shown to reduce the models’
performance. YOLOX uses a decoupled instead, which allows separating classification and
localization tasks, thus increasing the performance of the model.
Integration of Mosaic and MixUp into the data augmentation approach considerably
increased YOLOX’s performance.
An anchor-free system
Anchor-based algorithms perform clustering under the hood, which increases the inference
time. Removing the anchor mechanism in YOLOX reduced the number of predictions per
image, and significantly improved inference time.
Instead of using the intersection of union (IoU) approach, the author introduced SimOTA, a
more robust label assignment strategy that achieves state-of-the-art results by not only
reducing the training time but also avoiding extra hyperparameter issues. In addition to that,
it improved the detection mAP by 3%.
YOLOv5
YOLOv5, compared to other versions, does not have a published research paper, and it is the
first version of YOLO to be implemented in Pytorch, rather than Darknet.
Released by Glenn Jocher in June 2020, YOLOv5, similarly to YOLOv4, uses CSPDarknet53
as the backbone of its architecture. The release includes five different model sizes: YOLOv5s
(smallest), YOLOv5m, YOLOv5l, and YOLOv5x (largest).
One of the major improvements in YOLOv5 architecture is the integration of the Focus layer,
represented by a single layer, which is created by replacing the first three layers of YOLOv3.
This integration reduced the number of layers, and number of parameters and also
increased both forward and backward speed without any major impact on the mAP.
The following illustration compares the training time between YOLOv4 and YOLOv5s.
Written in Pytorch, this new version was not part of the official YOLO but still got the name
YOLOv6 because its backbone was inspired by the original one-stage YOLO architecture.
YOLOv6 provides outstanding results compared to the previous YOLO versions in terms of
accuracy and speed on the COCO dataset as illustrated below.
https://www.datacamp.com/blog/yolo-object-detection-explained 9/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Comparison of state-of-the-art efficient object detectors. All models are tested with
TensorRT 7 except that the quantized model is with TensorRT 8 (source)
All these characteristics make YOLOv5, the right algorithm for industrial applications.
Comparison of YOLOv7 inference time with other real-time object detectors (source)
YOLOv7 has made a major change in its (1) architecture and (2) at the Trainable bag-of-
freebies level:
Architectural level
YOLOv7 reformed its architecture by integrating the Extended Efficient Layer Aggregation
Network (E-ELAN) which allows the model to learn more diverse features for better learning.
In addition, YOLOv7 scales its architecture by concatenating the architecture of the models
it is derived from such as YOLOv4, Scaled YOLOv4, and YOLO-R. This allows the model to
meet the needs of different inference speeds.
Trainable bag-of-freebies
The term bag-of-freebies refers to improving the model’s accuracy without increasing the
training cost, and this is the reason why YOLOv7 increased not only the inference speed but
https://www.datacamp.com/blog/yolo-object-detection-explained 10/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
also the detection accuracy.
Lightweight models further optimize speed and accuracy trade-offs, with smaller model
sizes aimed at real-time applications on edge devices.
It adds support for custom data to integrate easily with custom datasets, making it
versatile for specific applications.
YOLOv8 also adds new APIs for easier deployment and model management in
production settings.
It's tailored for optimal balance between performance and resource usage, perfect for
both high-accuracy models and low-latency applications.
It automatically adjusts the resolution for different objects within an image, further
optimizing the inference process.
This latest version highlights substantial advancements in both the training process and the
architecture, focusing on efficiency, adaptability, and precision for real-time applications.
Conclusion
This article has covered the benefit of YOLO compared to other state-of-the-art object
detection algorithms, and its evolution from 2015 to 2020 with a highlight of its benefits.
Given the rapid advancement of YOLO, there is no doubt that it will remain the leader in the
field of object detection for a very long time.
The next step of this article will be the application of the YOLO algorithm to real-world
cases. Until then, our Introduction to Deep Learning in Python course can help you learn the
fundamentals of neural networks and how to build deep learning models using Keras 2.0 in
Python.
FAQs
AUTHOR
Zoumana Keita
https://www.datacamp.com/blog/yolo-object-detection-explained 11/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
A multi-talented data scientist who enjoys sharing his knowledge and giving back to others,
Zoumana is a YouTube content creator and a top tech writer on Medium. He finds joy in
speaking, coding, and teaching . Zoumana holds two master’s degrees. The first one in
computer science with a focus in Machine Learning from Paris, France, and the second one
in Data Science from Texas Tech University in the US. His career path started as a Software
Developer at Groupe OPEN in France, before moving on to IBM as a Machine Learning
Consultant, where he developed end-to-end AI solutions for insurance companies. Zoumana
joined Axionable, the first Sustainable AI startup based in Paris and Montreal. There, he
served as a Data Scientist and implemented AI products, mostly NLP use cases, for clients
from France, Montreal, Singapore, and Switzerland. Additionally, 5% of his time was
dedicated to Research and Development. As of now, he is working as a Senior Data Scientist
at IFC-the world Bank Group.
TO P I C S
Data Science
COURSE
Cluster Analysis in R
4 hr 41.1K
Develop a strong intuition for how hierarchical and k-means clustering work and learn how to apply them to extract insights from your
data.
See More
Related
B LO G
What is TinyML? An Introduction
to Tiny Machine Learning
C H E AT- S H E E T
T U TO R I A L
See More
LEARN
Learn Python
Learn R
Learn AI
Learn SQL
https://www.datacamp.com/blog/yolo-object-detection-explained 12/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Learn Power BI
Learn Tableau
Assessments
Career Tracks
Skill Tracks
Courses
DATA C O U R S E S
Python Courses
R Courses
SQL Courses
Power BI Courses
Tableau Courses
Alteryx Courses
Azure Courses
AI Courses
DATA L A B
Get Started
Pricing
Security
Documentation
C E R T I F I C AT I O N
Certifications
Data Scientist
Data Analyst
Data Engineer
SQL Associate
Azure Fundamentals
AI Fundamentals
RESOURCES
Resource Center
Upcoming Events
Blog
Code-Alongs
Tutorials
https://www.datacamp.com/blog/yolo-object-detection-explained 13/14
12/6/24, 8:04 PM YOLO Object Detection Explained: A Beginner's Guide | DataCamp
Docs
Open Source
RDocumentation
Course Editor
Data Portfolio
Portfolio Leaderboard
PLANS
Pricing
For Business
For Universities
DataCamp Donates
FO R B U S I N E S S
Business Pricing
Teams Plan
Customer Stories
Partner Program
ABOUT
About Us
Learner Stories
Careers
Become an Instructor
Press
Leadership
Contact Us
DataCamp Español
DataCamp Português
DataCamp Deutsch
DataCamp Français
S U P PO R T
Help Center
Become an Affiliate
Privacy Policy Cookie Notice Do Not Sell My Personal Information Accessibility Security Terms of Use
https://www.datacamp.com/blog/yolo-object-detection-explained 14/14