0% found this document useful (0 votes)
286 views

Object Detection and Classification Using Yolov3 IJERTV10IS020078

This document summarizes the YOLOv3 object detection algorithm. YOLOv3 takes an input image and divides it into grids, where each grid predicts bounding boxes and class probabilities for objects within that grid in one pass of the neural network. It is a single neural network that predicts bounding boxes and class probabilities directly, making it very fast compared to other algorithms that require multiple steps. The document explains the YOLOv3 architecture and methodology, including how it represents predictions through bounding box coordinates and class probabilities. It also notes that YOLOv3 achieves high precision while maintaining real-time performance of 45 frames per second.

Uploaded by

ikhwancules46
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
286 views

Object Detection and Classification Using Yolov3 IJERTV10IS020078

This document summarizes the YOLOv3 object detection algorithm. YOLOv3 takes an input image and divides it into grids, where each grid predicts bounding boxes and class probabilities for objects within that grid in one pass of the neural network. It is a single neural network that predicts bounding boxes and class probabilities directly, making it very fast compared to other algorithms that require multiple steps. The document explains the YOLOv3 architecture and methodology, including how it represents predictions through bounding box coordinates and class probabilities. It also notes that YOLOv3 achieves high precision while maintaining real-time performance of 45 frames per second.

Uploaded by

ikhwancules46
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 10 Issue 02, February-2021

Object Detection and Classification using


YOLOv3
Dr. S.V. Viraktamath Madhuri Yavagal
Dept. of Electronics and Communication Engineering Dept. of Electronics and Communication Engineering
SDM College Of Engineering and Technology SDM College Of Engineering and Technology
Dharwad, India Dharwad, India

Rachita Byahatti
Dept. of Electronics and Communication Engineering
SDM College Of Engineering and Technology
Dharwad, India

Abstract—Autonomous driving will increasingly require more and bounding boxes for the whole picture in one run
and more dependable network-based mechanisms, requiring of the algorithm.The two most common models in this
redundant, real-time implementations. Object detection is a set are the YOLO family algorithms which provides
growing field of research in the field of computer vision. The maximum speed and precision for multiple object
ability to identify and classify objects, either in a single scene or
in more than one frame, has gained huge importance in a variety
detection in a single frame [3] and the SSD this
of ways, as while operating a vehicle, the operator could even lack algorithms that are typically used to track objects in
attention that could lead to disastrous collisions. In attempt to real-time.
improve these perceivable problems, the Autonomous Vehicles To understand the YOLO algorithm, it is important to
and ADAS (Advanced Driver Assistance System) have considered determine what is currently expected. It varies from the
to handle the task of identifying and classifying objects, which in majority of the neural network models because it uses a single
turn use deep learning techniques such as the Faster Regional convolutional network that predicts bounding boxes and the
Convoluted Neural Network (F-RCNN), the You Only Look Once resulting probabilities. The bounding boxes are weighted by the
Model (YOLO), the Single Shot Detector (SSD) etc. to improve probabilities and the model makes their detection dependent on
the precision of object detection. YOLO is a powerful technique
as it achieves high precision whilst being able to manage in real
the final weights. Thus, end-to-end output of the model can be
time. This paper explains the architecture and working of YOLO directly maximized and, as a result, images can be produced
algorithm for the purpose of detecting and classifying objects, and processed at a rapid pace[4]. Every bounding box can be
trained on the classes from COCO dataset. represented using four descriptors:
1. Centre of a bounding box (bx, by)
Keywords— YOLO, Convolutional Neural Network, Bounding 2. Width (bw)
Box, Anchor Box, Fast Region Based Convolutional Neural 3. Height (bh)
Network, Intersection over Union, Non-Max Suppression, COCO 4. Value ‘c’ refers to an object class
Dataset. The pc value also needs to be predicted, that indicates the
I. INTRODUCTION likelihood that there is an object in the bounding box [5].
Quick, exact calculations for object detection would permit
computer to drive vehicles without particular sensors, empower II. METHODOLOGY
assistive gadgets to pass on constant scene data to human YOLO takes an input image first and this input image is
clients, and open the potential for universally useful, responsive then divided into grids ( say 3 X 3 grid ) as shown in Fig 1
automated frameworks [1]. Object discovery includes
identifying locale of interest of object from given class of
picture [2]. There are basically two algorithms for object
discovery and they can be arranged into two kinds:
1. Classification-dependent algorithms are performed in
two steps. First, they define and select areas of
significance for an image. Second, these regions are
organized into convolutional neural networks. The
above-mentioned arrangement is mild, since it is
required to make estimates for all chosen regions. A
commonly recognized case of this type of algorithm is
the Regional Convolutional Neural Network (RCNN)
and Medium RCNN, Faster RCNN, and the most Fig. 1: Input image divided into 3 X 3 grid [6]
recent: Mask RCNN[2].
2. Algorithms based on regression – rather than selecting
a field of interest for an image, they estimate groups

IJERTV10IS020078 www.ijert.org 197


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 02, February-2021

On every grid, image classification and localization are


applied. The bounding boxes and their equivalent class Table 2: Y label of left centred grid in Fig 1
probabilities for objects are then predicted by YOLO.
In order to train it, the labelled data needs to be transferred 1
to the model. Suppose the image is divided into a 3 X 3 grid bx
and there is an aggregate of 3 classes in which the objects need
to be categorised. Suppose that the classes are people, cars, and by
trucks, the y label is an 8-dimensional vector for each grid cell bh
as shown in Table 1 y=
Table 1: 8-dimensional Y label bw
pc 0
bx 1
by 0
bh Since in this grid there is an entity, pc will be to 1 and bx, by,
y= bh, bw will be determined according to the same grid cell with
bw which we are dealing. Therefore, since the car is the 2nd class,
c1 c2 = 1 and c1 and c3 = 0. An 8-dimensional output vector for
each of the 9 grids is outputted. This performance will have a
c2 dimension to the (3 X 3 X 8).
c3 The object could be assigned to a solitary grid where its mid-
point is found, regardless of whether an entity spreads to more
than one grid. By increasing the number of grids, we can reduce
Here, the odds of different objects occurring in a similar grid cell.
• The probability of an object being present in the grid is
described by pc III. SPECIFICATIONS AND SYSTEM ARCHITECTURE
• bx, by, bh, bw defines the bounding box coordinates if an YOLO, in a single glance, takes the entire image and
entity is present. predicts for these boxes the bounding box coordinates and
• The classes are reflected by c1, c2, c3. class probabilities. YOLO's greatest advantage is its
Consider the first grid from Fig 1 as shown in Fig 2 outstanding pace, it's extremely fast, and it can handle 45
frames per second [1]. Amongst the three versions of YOLO,
version-3 is fastest and more accurate in terms of detecting
small objects.
The proposed algorithm, YOLO version-3 consists of total 106
layers [10]. The architecture is made up of 3 distinct layer
forms. Firstly, the residual layer which is formed when
activation is easily forwarded to a deeper layer in the neural
network. In a residual setup, outputs of layer 1 are added to the
outputs of layer 2. Second is the detection layer which
performs detection at 3 different scales or stages. Size of the
grids is increased for detection. Third is the up-sampling layer
which increases the spatial resolution of an image. Here image
Fig. 2: Grid having no object [6] is up sampled before it is scaled. Also, concatenation operation
• Since there is no object in this grid as shown in Fig 2, pc is used, to concatenate the outputs of previous layer to the
will be zero and all the other entries in the y label will be ‘?’ present layer. Addition operation is used to add previous
i.e., As there is no entity in the grid, it doesn't matter what layers. In the Fig 3, the pink colored blocks are the residual
the bx, by, bh, bw, c1, c2, and c3 values are. layers, orange ones are the detection layers and the green are
There are two objects (2 cars) in Fig 1; YOLO takes the the up-sampling layers. Detection at three different scales is as
mid-point of these two objects and assigns these objects to shown Fig 3.
the grid containing the mid-point of these objects. The y
label for the left-cantered grid with the car is as shown in
Table 2.

IJERTV10IS020078 www.ijert.org 198


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 02, February-2021

(0,0)

(1,1)

Fig. 5: x & y coordinates of the mid-point of the object [6]

Fig. 3: Architecture of YOLO v3 [4] bh is the ratio of the height of the bounding box (the red
box in Fig 5) to the height of the corresponding cell of the
YOLO v3 predicts 3 different scales of prediction. The
grid, which is about 0.8 in Fig 5. bw is the ratio between the
detection layer is used to detect feature maps of three different
width of the bounding box and the width of the cell of the
sizes, with strides 32, 16, 8 respectively. This means that
grid. This grid (Fig 5) will have a y label, as shown in Table
detections are made on scales of 13 x 13, 26 x 26 and 52 x 52
3.
with an input of 416 x 416. Table 3: Y label of grid shown in Fig 5

The working of YOLO is better explained in sections from A 1


to I.
A. Encoding Bounding Boxes 0.3
0.4
0.8
y=
0.4
0
1
0
bx and by will continuously lie in the 0 and 1 range while
the midpoint in the grid will continually remain in the grid,
Fig. 4: Right-centred grid [6] bh and bw may be greater than 1 in the event that the
dimensions of the bounding box are greater than the grid
Consider the right-cantered grid that includes a car,
dimension.
according to this grid, only bx, by, bh, and bw are
C. Intersection over Union and Non-Max Suppression
determined. This grid will have the y label as shown in
It measures the intersection of the ground-truth bounding
Table 2.
box and the expected bonding box over the union. Consider the
B. Making Predictions
ground-truth and the bounding boxes expected for a vehicle,
The following formulae explain how the network output is
shown in Figure 6.
converted to obtain bounding box predictions. The x, y centre
co-ordinates, width and height of our prediction are bx, by, bw,
bh. The output of the network is bx, by, bw, bh. The top-left
coordinates of the grid include cx and cy. pw and ph are the
dimensions of the anchor boxes [9].

bx = σ (tx) + cx -- (1)
by = σ (ty) + cy -- (2)
bw = pw x etw -- (3)
bh = ph x eth -- (4)

pc = 1(In Fig 4) since this grid has an object and it is a car, so


c2 = 1. In YOLO, the grid coordinates in the form of bx, by are Fig. 6: Intersection over Union [6]
the x & y coordinates of the midpoint of the grid object . It will The red box in Fig 6 is the bounding box for ground-truth
be (approximately) bx = 0.3 in this case & by = 0.4 and the purple box is the anticipated one. The intersection area

IJERTV10IS020078 www.ijert.org 199


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 02, February-2021

over the union of these two boxes is determined by IoU. i.e.,


the shaded yellow region as shown in Fig 6.
In general,
IoU = Area of the intersection / Area of the union
i.e.,
IoU = Area of yellow box / Area of green box.
If the IoU is greater than 0.5, then the estimate is reasonable.
0.5 is an arbitrary limit, can be changed as explicit problems
suggests. Instinctively, the higher the limit is raised, the better
the predictions become. There is one more strategy that can
radically boost YOLO's yield - Non-Max Suppression. One of Fig. 8: Vehicles recognised only once [6]
the most well-known problems with algorithms for object D. Anchor Boxes
detection is that they can recognize it on multiple occasions When multiple objects are in a single grid, it leads to the
instead of detecting an object only a single time. concept of the anchor boxes. Consider Fig 9, split into the grid
of 3 X 3, an object is allocated to a grid by taking its midpoint
and it is allocated to the corresponding grid according to its
position. The midpoint of both objects is in the same grid in
Fig 9.

Fig. 7: Vehicles recognised more than once [6]


Cars are recognised more than once, as seen in Fig 7. To get
a lone detection of any object, the Non-Max Suppression is Fig. 9: Midpoint of 2 objects in same grid [6]
employed
1. It looks at the probability of each detection first and Two separate types, called anchor boxes or anchor box
selects the highest. In Fig 7, 0.9 is the greatest shapes, are predefined for production of both boxes. For each
likelihood, so the 0.9 probability box is selected first. grid, instead of one output, two outputs are obtained.
2. The method then glances at the other boxes in the Anchor box 1: Anchor box 2:
frame. Boxes that have a high IoU with the currently
chosen box will be removed. The boxes with 0.6 and
0.7 probabilities are suppressed.
3. After the boxes have been removed, the next box is
chosen from all of the most possible boxes, which in
this situation is 0.8.
4. Once again, it looks at the IoU of the box and
compresses boxes with a high IoU.
Fig. 10: Two outputs of one grid
5. The algorithm repeats steps from 1 to 4, until the final The y label of YOLO with the anchor boxes is as shown
bounding boxes are selected or compressed Non-Max in Table 4.
Suppression is a way to take the boxes and suppress
the close-by boxes of non max possibilities with the
max likelihood, as shown in Figure 8 single detection
of objects is obtained after applying the above-
mentioned steps.

IJERTV10IS020078 www.ijert.org 200


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 02, February-2021

Table 4: Y label with 2 anchor boxes F. Testing


The new picture will be split into the same number of grids
pc that is selected during the training. The model predicts a 3 x 3
bx x 16 output for each grid. The 16 values in this prediction are
by in the same format as the training label.
G. Training
bh
The input for training the model will be images and their
bw respective y labels. Fig 1 is divided into 3 X 3 grid with two
c1 grid anchors, with 3 separate classes of objects. The
c2 corresponding y labels has the shape of 3 X 3 X 16. Training
c3 takes the form of an image and maps it into a target 3 X 3 X
y= 16.
pc
bx H. Implementation of YOLO
• Darknet: This algorithm is implemented using an
by
open-source neural network framework i.e., Darknet
bh which was developed in C Language and CUDA
bw technology to render speedy calculations on a GPU
c1 necessary for real-time predictions.
c2 • DNModel.py: Darknet Model file is a computer
c3 vision code used for building the model using the
configuration file and it appends each layer.
The first 8 rows of the y label represent the anchor box 1 and • Util.py: Contains all the formulas used.
the next 8 represent the 2nd anchor box. The objects are • imageprocees.py: Required to perform the image
allocated to the boxes on the basis of the similarities between processing task. It takes all the input images to resize
the bounding box and the shape of the box. Since the shape of them and perform Up-sampling, also performs
anchor box 1 is identical to the person's bounding box, the transpose function.
person is allocated to the box 1 and the car to box 2. Instead of • detect.py: The main code which is run to perform
3 X 3 X 8 (using 3 X 3 grid and 3 classes), the output for this object detection. This code uses all the above-
situation is 3 X 3 X 16 (since 2 anchors are utilized). mentioned files to perform object detection. Performs
E. Flowchart all the functions according to the YOLO concept
Flowchart of that steps that the YOLO algorithm follows I. COCO Dataset
COCO basically means that the data set images are daily
objects recorded on everyday scenes and provides the labelling
Bounding box Input of multi-objects, annotations of segmentation masks, image
prediction Images captioning, key-point detection and panoptic segmentation
annotations with a total of 81 categories, making it a very
flexible and polyvalent dataset [8].
Objectiveness Ignore the
> 0.5 bounding IV. APPLICATIONS
boxes 1. Vehicle detection

Identifying
the class
confidence

Applying
non-max
suppression

Fig. 12: Vehicle Detection


Output image Different vehicle types, i.e., cars, vans, bikes,
with labelled busses, training vehicles, vessels, bicycles, flights are
objects all detected by YOLO picture model and all-in real
time. While any of the above vehicles are detected, a
Fig. 11: Flow chart showing working of YOLO V3
bounding area is formed around which the vehicle is
detected. Type of vehicle is also shown above the
bounding box.

IJERTV10IS020078 www.ijert.org 201


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 02, February-2021

2. Crowd detection boundary boxes. We have five principal attributes for each
box, including x and y for coordinates, w and h for object width
and height and an insight into the possibility that the box holds
the object.
In recent years deep learning-based object identification has
become a hot spot for analysis due to its powerful study skills
and scale transition. This paper suggests a series of YOLO
rules to classify objects using a single neural network for the
purpose of detection. The rules are easy to create and can be
instantly comprehensively photographed. Limit the classifier
to a particular area through position concept techniques. In the
prediction of limits, YOLO accesses the whole photograph.
Fig. 13: Crowd Detection Moreover, in history regions it expects fewer false positives.
This system uses a collection of rules for object This algorithm "only looks once" as in it requires only one
detection to screen the flow of humans in various forward propagation to cross through the network to make
places. The system can control information in real estimations.
time and track the position of unusual crowds.
REFERENCES
3. Optical character recognition [1] Redmon, J., Divvala, S., Girshick, R., & Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection.” 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
doi:10.1109/cvpr.2016.91
[2] Chandan G, Ayush Jain, Harsh Jain, and Mohana, “Real Time Object
Detection and Tracking Using Deep Learning and OpenCV”
Proceedings of the International Conference on Inventive Research in
[3] Computing Applications (ICIRCA 2018) IEEE Xplore Compliant Part
Number:CFP18N67-ART; ISBN:978-1-5386-2456-2
[4] Chethan Kumar B, Punitha R, and Mohana, “YOLOv3 and YOLOv4:
Multiple Object Detection for Surveillance Applications” Proceedings of
the Third International Conference on Smart Systems and Inventive
Technology (ICSSIT 2020) IEEE Xplore Part Number: CFP20P17-
ART; ISBN: 978-1-7281-5821-1
Fig. 14: optical character recognition [5] Hassan, N. I., Tahir, N. M., Zaman, F. H. K., & Hashim, H, “People
Detection System Using YOLOv3 Algorithm” 2020 10th IEEE
International Conference on Control System, Computing and
The mechanical or electronic translation of Engineering (ICCSCE). doi:10.1109/iccsce50387.2020.9204925
printed, hand-written or published text images, into [6] Pulkit Sharma, “A Practical Guide to Object Detection using the Popular
machine-coded textual material, regardless of YOLO Framework – Part III” DECEMBER 6, 2018.
whether the scanned report is, the photograph of a [7] Nikhil Yadav , Utkarsh , “Comparative Study of Object Detection
Algorithms”, IRJET, 2017.
report, or a real time state, is the optical recognition [8] Viraf, “Master the COCO Dataset for Semantic Image Segmentation”,
of character regularly abbreviated by the term OCR. May 2020.
4. Image fire detection [9] Joseph Redmon, Ali Farhadi, “YOLOv3: An Incremental
Improvement”, University of Washington.
[10] Karlijn Alderliesten, “YOLOv3 — Real-time object detection”, May 28
2020.
[11] Arka Prava Jana, Abhiraj Biswas, Mohana, “YOLO based Detection and
Classification of Objects in video records” 2018 IEEE International
Conference On Recent Trends In Electronics Information
Communication Technology,(RTEICT) 2018, India.
[12] Akshay Mangawati, Mohana, Mohammed Leesan, H. V. Ravish
Aradhya, “Object Tracking Algorithms for video surveillance
applications” International conference on communication and signal
processing (ICCSP), India, 2018, pp. 0676-0680.

Fig. 15: Image fire detection

Real-time vision identification of fireplace has now been


enabled for monitoring equipment, guaranteeing a first-class
trend in the field of fireplace security.

V. CONCLUSION
YOLO is one of the best-known, most powerful object
detection models, dubbed "You Only Look Once." YOLO is
the first option for every real-time identification of objects.
Both input images are divided into the SXS grid structure by
YOLO algorithms. For object detection, any grid is
responsible. Now these grid cells forecast the observed object

IJERTV10IS020078 www.ijert.org 202


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like