0% found this document useful (0 votes)
111 views

A Study On Real Time Object Detection Using Deep Learning IJERTV11IS050269

This document discusses object detection using deep learning techniques. It begins with an introduction to deep learning and popular object detection systems like CNN, R-CNN, RNN, Faster R-CNN, and YOLO. It then focuses on the authors' proposed real-time object detection model using YOLOv4 and its architecture. The conventional YOLO model struggles to detect small objects accurately, but the proposed model aims to provide more precise detection results. Deep learning is crucial for machine learning applications like object detection that are used in computer vision systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

A Study On Real Time Object Detection Using Deep Learning IJERTV11IS050269

This document discusses object detection using deep learning techniques. It begins with an introduction to deep learning and popular object detection systems like CNN, R-CNN, RNN, Faster R-CNN, and YOLO. It then focuses on the authors' proposed real-time object detection model using YOLOv4 and its architecture. The conventional YOLO model struggles to detect small objects accurately, but the proposed model aims to provide more precise detection results. Deep learning is crucial for machine learning applications like object detection that are used in computer vision systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 11 Issue 05, May-2022

A Study on Real Time Object Detection using


Deep Learning
Pradyuman Tomar Sagar
Dept. of Electronics and Communication Engineering Dept. of Electronics and Communication Engineering
Meerut Institute of Engineering and technology Meerut Institute of Engineering and technology
Meerut, India Meerut, India

Sameer Haider
Dept. of Electronics and Communication Engineering
Meerut Institute of Engineering and technology
Meerut, India

Abstract — Object Detection is very closely connected with the Most of the humans have standard PCs (laptops), and cell
Field of Computer Vision. Object detection empowers phones, made this global expansion significantly more open.
recognizing instance of different objects in images and videos or Alongside this internet globalization, the development of
video recordings. It identifies the different characteristics of information, data and pictures accessible on the web/cloud has
Images rather than object detection techniques and produces an become to the mark of millions every day. Use of electronic
intelligent and effective understanding of pictures very much
like human vision works. In this paper, We will starts with the
devices to use this data and make important acknowledgments
concise presentation of introduction of deep learning and and cycles is indispensable because of people's difficulty
famous object detection system like CNN(Convolutional Neural performing same iterative assignments or tasks. The
Network), R-CNN, RNN(Recurrent brain network), Faster underlying advance of most such cycles might incorporate
RNN, YOLO(You Only look once). Then, at that point, we perceiving a particular article or region on a picture. Because
center around our proposed object detection model architecture of the unconventionality of the accessibility, area, size, or state
along for certain advancements and modifications. The of a thing in each picture, the acknowledgment interaction is
conventional model recognizes a little object in pictures. Our incomprehensibly hard to be performed through a
proposed model gives the right outcome with precision. conventional modified PC calculation.

Keywords — YOLOv4, CNN, Real-time object detection, Deep


Deep learning is essential for ML. An excessive number of
Learning, RNN. Methods have been proposed for object detection. Methods
and Techniques of object detection comes under deep
I . INTRODUCTION learning. Object Detection is an important field of Machine
Although the human eye can instantaneously and exactly Learning and broadly utilized in Computer vision. Deep
recognize a given visual, including its content, location, & learning has been becoming well known beginning around
nearby visuals by interacting with it, computer vision-enabled 2006.
robotic systems are sometimes and somehow to slow and
inaccurate. Any developments in this field will lead to Various Techniques have been proposed to tackle the issue
increased efficiency and performance may open the way of of Object Identification over time. These techniques center
more intelligent systems, similar to humans. As a result, around the solutions through different stages. Specifically,
systems such as advanced technology, which allow humans these center stages incorporate recognition, classification,
to accomplish tasks with little to no conscious thought, will localization, and object detection. Alongside the
definitely make our life lot easier. advancement in present technology throughout the long
term, these Techniques have been confronting difficulties,
For example, even if the driver is not aware of their activities, for example, output accuracy, resource cost, processing
driving a car equipped with computer vision enabled assistive speed and complexity issues. With the creation of the main
technology could foresee and notify a driving crash before to Convolutional Neural Network(CNN) algorithms during the
the incidence. As a result, real-time object identification has 1990s roused by Yann LeCun et al. [1] and very important
become a critical component in continuing to automate or research and innovations like AlexNet [2], CNN algorithms
replace human operations. Computer vision and object have been fit for giving answers for the item recognition
detection are very important and crucial fields in machine issue in various methodologies. With the goal of ease of
learning, and they are expected to help to unleash the hidden human, improving accuracy and speed of recognition and
potential of general-purpose robotic systems in the future. detection, optimization focused algorithms are continuously
being developed and improved over time, for example, Deep
With the ongoing innovation in current technology, making
Residual Learning (ResNet) [5], VGGNet [3] and
transparency and feasibility of information to and from
GoogLeNet [4] have been developed throughout the long
everybody associated with it has turned into a simple errand.
term.

IJERTV11IS050269 www.ijert.org 465


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

However these Algorithms improved over the long time, A. Model-based on Region
window selection or recognizing various objects from a a. CNN: This network was presented by Creators: Alex
given picture or image was as yet an issue. To carry answers Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton in
for this issue, algorithms having region proposals, crop/wrap 2012.
feauture, bounding boxes regressions like Regions with The network comprises of five convolutional layers. It
CNN (R-CNN)SVM classification were presented. Despite accepts input as a picture which is a 2D array of a pixel with
the fact that R-CNN was very high in precision with the past RGB channel. Then Channels or elements indicator apply to
innovations, its high utilization of existence later prompted the information picture and get yield highlights maps.
the creation of Spatial Pyramid Pooling System Numerous convolutional are acted in lined up by applying the
(SPPNet)[6]. ReLU work. CNN works for just a single object at a time so it
Regardless of SPPNet's speed, to remove the same problem doesn't work successfully in different objects in an image.
it was imparted to R-CNN; Faster R-CNN was presented. CNN turned into a decent norm for image classification after
However Faster R-CNN could arrive at ongoing paces Kriszhevsky's CNN's performance. We can't recognize
utilizing exceptionally profound organizations, it held a objects which are overlapping and various background and
computational bottleneck. Later Faster R-CNN, Algorithms, don't order these various objects yet in addition don't
is heavily based on previous algorithm ResNet, was distinguish boundries, contrasts and relations in other.
presented. Because of Faster R-CNN not yet fit for
outperforming results, YOLO was presented. This paper
will review You Only Look Once Algorithm for Object
detection.

A. Abbreviations
Abbreviations used:
CNN – Convolutional Neural Network.
ResNet50 - Residual Neural Network (50 layers).
ResNet152 - Residual Neural Network (152 layers).
YOLO – You Only Look Once.
RNN – Recurrent Neural Network . Figure 1. CNN layer diagram
RCNN – Region Based CNN.

A. Overview of object detection(CNN).

Object Detection is a study of Computer Vision Field. Object


location is a huge exploration region in Computer Vision, can
be applied to numerous applications, for example, Driver less
vehicles, security, reconnaissance, machine examination, and
so forth. Object Detection is utilized to distinguish the area of
the object in a picture, Face detection, medical imaging, etc. Figure 2. CNN Flow Diagram
Evolution of Deep learning have changed the old methods of
object detection and tracking system. Computer Vision
recognizes characteristics in pictures, Classifying Object in b. RCNN : This network is presented by Creators: Ross
the picture, Classifying objects along with localization, Girshick, Jeff Donahue, Trevor in 2013this network
drawing a bounding box around object Present in the picture, motivated by overfeat. This network incorporates three
Object segmentation or semantic segmentation, Neural style principal parts, first is region extractor, second is feature
Transfer. Deep learning strategies are the most grounded extractor and last is classifier. It involves a selective search
strategy for object detection. algorithms for object detection to create region proposal.
Extricate 2000 small regions for each picture. Here 2000
II. LITERATURE SURVEY convolutional networks utilized for every small regions of the
Three are various methodologies has introduced by numerous pictures. So have one Convolutional network expected to
researchers . An algorithms for the first face detector was handle RCNN different regions with CNN characteristics
concocted by Paul Viola and Michael Jones 2001. The face partitions the picture into a few regions. Run pictures through
had identified and detected continuously on Webcam feed. It pre-prepared AlexNet lastly apply the SVM algorithm.
was developed out by Opencv and Face Detection. This
couldn't distinguish some direction like up down, named,
wearing a mask, and so on. Because of the advancement of
Object detection in Deep Learning, it can be further classified
into two model (1) Model based of region proposal; (2)
Model based on regression/Classification.

IJERTV11IS050269 www.ijert.org 466


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

and split it into networks. Every lattice cell predicts


just a single object. YOLO is very quick at test time
and it requires single network assessment and
performs feature extraction, bounding box predict,
non max suppression, and contextual reasoning all
simultaneously. Just go for it isn't pertinent for little
items that shows up in gatherings like rushes of
birds. Consequences be damned has a few variation
like quick YOLO. Consequences be damned is
something else altogether. It looks just once
Figure 3. RCNN Flow Diagram however in clear ways. Assuming a basic picture
c. Fast R-CNN: This network is a superior gives through the convolutional network in a single
adaptation of R-CNN which is presented by Ross pass and comes out the opposite end as a 13×13×125
Girshick. The article guarantees that Quick R-CNN tensor portraying the bounding boxes for the
multiple times quicker than past R-CNN which is framework cells. All you really want to do process
nine times. Network select different sets then, at that point, is predict the last scores for the
/arrangements of bounding boxes then use feature bounding boxes and discard the ones scoring lower
extractor by CNN network then, at that point, use than 30%.
classifier or regression for yield the class of each
containers.

Figure 6. YOLO Network Arcchitecture


Figure 4. Fast RCNN Flow Diagram
b. SSD: SSD (Single Shot MultiBox Detector) aim of
classifications and localization are done in a single
d. Faster R-CNN: This is a better form of Faster RCNN forward pass . The main benefit is quickness with
which presented by Shaoqing Ren, Kaiming He, Ross relevel accuracy or, with great exactness. it runs a
Girshick, and Jian Sun in 2015. Picture is given as input to convolutional network on input pictures just a single
a convolutional network that gives convolutional map. To time and processes a characterstic map. Histograms
recognize the different regions here the different network is of Oriented Gradients are imagined by Navneet
utilized to foresee the region proposition. Dalal and Bill Triggs concocted in 2005. We need to
take a glance at every pixel that straightforwardly
encompassing it. Here contrast current pixel with
each encompassing pixel. It flopped in more
summed up object detection with commotion and
interruptions behind the scenes or noise in the
background.

Figure 5. Faster RCNN Flow Diagram

B. Model based on regression/Classification.

a. YOLO: YOLO (You just check out once) at a


picture to anticipate what are those object and where Figure 7. SSD(Single Shot MultiBox Detector
objects are available. A single convolutional
network at the same time predicts numerous
bounding boxes and class and probabilities for those
crates. Regards detection as a relapse issue.
Incredibly quick and precise YOLO takes a picture

IJERTV11IS050269 www.ijert.org 467


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

detects objects which are larger , 52 x 52 layer is


answerable for distinguishing more modest objects
III. PROPOSED SYSTEM METHODOLOGY
with the assistance of 26 x 26 layer recognize
medium objects.
A. PROBLEM STATEMENT
“To implement real - time object detection and recognition in
e. Choice of anchor boxes :This model total purposes 9
an images captured by webcam and videos in dynamic
anchor boxes for the identification and detection of an
environment using deep learning model.”
object. We are utilizing k-means clustering to
produce 9 anchors. For clustering orchestrate all
B. PROBLEM ELABORATION anchors in decreasing manner as indicated by the
The primary goal is to detect and recognize Objects in Real- aspects and relegate large anchors for the principal
time. We require rich data, all things considered. We need to scales following three anchors for the subsequent
observe the different types of objects which are moving in scale, and the last three anchors for the third scale.
respect to the camera. It will help us with perceiving and in This model predicts additional bouncing boxes. This
recognizing different objects collaboration and interaction. model predicts boxes at 3 distinct scales, for the
We center around accuracy in this paper. pictures of 416 *416, the quantity of predicted
Boxes are completed 10647 class prediction , softmax
C. PROPOSED METHODOLOGY is not utilized. Independent logistic classifier and
This model incorporates include extractor with Darknet53 single binary cross entropy loss are utilized.
with features & highlight maps to upsampling and perform
Concatenation on images . Proposed Model have different
up-gradation for object detection techniques.

a. Darknet 53: This proposed framework utilizes a


variation of Darknet which has initially 53 layers
network and prepared on Imagenet or tested on
Imagenet. For the detection more 53 layers are
utilizing onto it, an absolutely of 106 layers of
convolutional fundamental for proposed framework.
This is the explanation the proposed framework turns
out to be slow.
------- Figure 8. Prediction Model
b. Detection of three scales:This model makes
identification and detecting at three distinct scales. D. PROPOSED FRAMEWORK
Here Detection is produced by applying 1 x 1 a. YOLO Algorithm:YOLO is Abbrevation of (you
identification portions or kernels on highlight feature only look once). Older Object detection algorithms
maps for three different sizes on three different places utilize the districts grid parts to distinguish and
in the network 1 x 1(M x (5 +N)) is the state of the identifies the objects however don't utilize the whole
detection piece. picture , a few regions might contain the objects .
YOLO is an object detection algorithm entirely
c. Here M is the quantity of bounding boxes on the different from the district based algorithms seen
component guide and N is the quantityof classes. previously. In YOLO a convolutional network
Highlight Feature map produced by this kernel has a predicts the bounding boxes and the class
similar level and width of the past component map probabilities for these containers. It's challenging for
additionally distinguish ascribes alongside everybody to contain the assets for the Deep
profundity. Three distinct scales are utilized. The Learning So that is where this Yolo came into the
principal identification and detection is made by the picture. Furthermore, bunches of pre-prepared
82nd layer. The initial 61 layers of the picture are models and datasets are accessible at this point.
inspected by the network. On the off chance that we
have a picture X416, the component guide will be of
size 13 x 13. Identification is made by utilizing the 1
x 1 portion, and the resultant feature map will be 13 x
13 x 255. The subsequent identification is made by the
94th layer of the model and the resultant component
guide will be 26 x 26 x255. Then, at that point, last
identification and detection done by the 106th layer
and producing Feature map size 52 x 52 x 255.

d. Detecting smaller objects: In this model three layer


has distinct purposes, whereas 13 x 13 layer will

IJERTV11IS050269 www.ijert.org 468


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

Figure 9. YOLO Algorithm Process V. RESULT AND ANALYSIS WITH ACCURACY AND
YOLO stores the information in Vector Form: PERFORMANCE:

YOLO = (pc, bx, by, bh, bw, c1, c2, c3), a. On MS COCO Dataset

Where pc characterizes the Probability and demonstrates in


the event that object is available or not bx, by, bh, bw
determines whether objects for the classes c1,c2,c3.

So on the off chance that there is any object concerning class


c1, it will have the worth 1 generally 0.
It utilizes the non max suppression the bounding box with
more exactness, precision is chosen and remaining are
disregarded.

Equation for Non Max Suppression is :-

IoU = Area of the crossing point or interaction Figure 10. MS COCO Dataset Performance.
------------------------------------------------ b. On PASCAL VOC 2007 & 2012
Area of the association or union

Where , IoU = Intersection Of Union.

IV. DATASETS & PERFORMANCE COMPARISON


AMONG VARIOUS ALGORITHMS:
The advancement of detection models is firmly connected
with the blast of information volume. This is on the grounds
that the performance test and algorithms assessment should
be acquired through dataset, what's more, dataset is
additionally a strong main impetus to advance the exploration
field of detection.

Models backbone Size/pixel Test mAP% fps Figure 11. PASCAL VOC Dataset Performance.
YOLOv1 VGG16 448*448 VOC 67.2 46
2007
SSD VGG16 300*300 VOC 78.1 47
2007 c. Real-Time Detection : YOLO is a quick,
YOLOV2 Darknet-19 544*544 V0C 78.6 40 precise object detection model, making it ideal for
2007
various application in the field of Computer Vision.
YOLOv3 Darknet-53 608*608 MS 35 51
COCO We interface YOLO to a webcam and confirm that it
YOLOV4 CSP 610*610 MS 42.1 67.5 keeps up with continuous execution in real-time.
darknet-53 COCO
RCNN VGG16 1000*600 VOC 65 0.6
2007
SPP-Net ZF-5 1000*600 VOC 55.4 -
2007
Fast RCNN VGG16 1000*600 VOC 70.2 7
2007
Faster ResNet- 1000*600 VOC 76.5 6
RCNN 101 2007

Table 1- Comparison of object detection algorithm with


performance

Figure 12. Picasso Dataset precision-recall curves

IJERTV11IS050269 www.ijert.org 469


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

Model/dataset VOC 2007 Picasso People- • Pre-processing methods proposed here .i.e. edge detection
AP AP Best F1 Art techniques to increase the contrast of the image which
AP improve our model accuracy.
YOLOv4 59.5 53.4 0.595 45
DPM 43.2 37.9 0.460 35 • It can be improve and innovate in the future by anybody
without worrying about complexity.
RCNN 54.2 10.5 0.230 28
Poslets 36.5 17.9 0.275 • Future enhancements can be focused by implementing the
D&T --- 2.0 0.055 project on the system having GPU for faster results and better
accuracy.
Table 2. Results on VOC 2007, Picasso, & People-Art
Dataset • Like, for small object detection which is done by MS
COCO in some face detection application and task. For
improvement of localization of small objects under partial
barrier. So that we will improve the network architecture
with some modifications.

• By using that we can reduce the dependency of data


network.

• For achieving efficient recognition of small objects with


better accuracy.

Figure 12. Qualitative sample Results So it is finally concluded that for enhance the accuracy and
performance by using pre processing techniques like edge
detection and increase image augmentation and contrast so
that we get better results in output.

REFERENCES:

[1] T. Guo, J. Dong, H. Li, and Y. Gao, “Simple convolutional neural


network on image classification,” 2017 IEEE 2nd Int. Conf. Big
Data Anal. ICBDA 2017, pp. 721– 724, 2017, doi:
Figure 13. Test Image Detection and Identification 10.1109/ICBDA.2017.8078730.
[2] J. Du, “Understanding of Object Detection Based on CNN Family
and YOLO,” J. Phys. Conf. Ser., vol. 1004, no. 1, 2018, doi:
10.1088/1742- 6596/1004/1/012029.
[3] Mauricio Menegaz, “Understanding YOLO – Hacker Noon,”
Hackernoon. 2018.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” 2016, doi:
10.1109/CVPR.2016.91.
[5] Wu, R.B. Research on Application of Intelligent Video Surveillance
and Face Recognition Technology in Prison Security. China
Security Technology and Application. 2019,6: 16-19.
[6] Tian, J.X., Liu, G.C., Gu, S.S., Ju, Z.J., Liu, J.G., Gu, D.D.
Research and Challenge of Deep Learning Methods for Medical
Figure 14 . YOLOv4 On Test Image Image Analysis. Acta Automatica Sinica,2018, 44: 401-424.
[7] Jiang, S.Z., Bai, X. Research status and development trend of
industrial robot target recognition and intelligent detection
YOLO Algorithm running on Sample artwork from internet technology. Guangxi Journal of Light Industry, 2020, 36: 65-66. [4]
and natural images from the internet and website. Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet Classification
with Deep Convolutional Neural Networks. Advances in Neural
Information Processing Systems,2012, 25: 1097-1105.
VI. CONCLUSION & FUTURE SCOPE: [8] Russakovsky, O., Deng, J., Su, H., et al. ImageNet Large Scale
Visual Recognition Challenge. International Journal of Computer
As Object detection and recognition in today’s world can be Vision,2015, 115: 211-252.
considered as one of the most challenging, complex and most [9] Girshick, R., Donahue, J., Darrel, T.,Malik, J. Rich Feature
Hierarchies for Accurate Object Detection and Semantic
important task in computer vision Field. As we know that this Segmentation. In: Computer Vision and Pattern Recognition.
project is been developed with the underlying purpose of Columbus.2014, pp. 580-587.
real-time object in pictures, videos captured streaming [10] He, K.M., Zhang, X.Y., Ren, S.Q., Sun, J. Spatial Pyramid Pooling
cameras. or web cams. in Deep Convolutional Networks for Visual Recognition. IEEE
Transactions on Pattern Analysis & Machine Intelligence,2015, 37:
1904-1916.
[11] Girshick, R. Fast R-CNN.In: Proceedings of the IEEE international
conference on computer vision. Santiago.2015, pp. 1440-1448.

IJERTV11IS050269 www.ijert.org 470


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 11 Issue 05, May-2022

[12] Ren, S.Q., He, K.M., Girshick, R., Sun, J. Faster R-CNN: towards
real-time object detection with region proposal networks. In:
Advances in neural information processing systems. Montreal.2016,
pp. 91-99.
[13] Redmon, J., Divvala, S., Grishick, R., Farhadi, A. You Only Look
Once: Unified, Real-Time Object Detection. In: Computer Vision
and Pattern Recognition. Las Vegas.2016, pp. 779- 788.
[14] Pushkar Shukla, Beena Rautela and Ankush Mittal, “A Computer
Vision Framework for Automatic Description of Indian
Monuments”, 2017 13th International Conference on Signal Image
Technology and Internet- Based Systems(SITIS), Jaipur, India,
ISBN (e): 978-1-5386-4283-2, December-2015.
[15] M. Buric, M. Pobar and Ivasic-Kos, “Object Detection in Sports
Videos”, 2018 41st International Convention on Information and
Communication Technology, Electronics and
Microelectronics(MIPRO), Opatija, Croatia, ISBN (e): 978-953-
233-095-3, May-2018.

IJERTV11IS050269 www.ijert.org 471


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like