0% found this document useful (0 votes)
4 views

Traffic sign classification using CNN and detection using FRCNN and YOLOV4

Uploaded by

zohaibsaleemoff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Traffic sign classification using CNN and detection using FRCNN and YOLOV4

Uploaded by

zohaibsaleemoff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Heliyon 8 (2022) e11792

Contents lists available at ScienceDirect

Heliyon
journal homepage: www.cell.com/heliyon

Research article

Traffic sign classification using CNN and detection using faster-RCNN


and YOLOV4
Njayou Youssouf *
Department of Computer Science and Engineering, Islamic University of Technology, Gazipur 1704, Bangladesh

A R T I C L E I N F O A B S T R A C T

Keywords: Autonomous driving cars are becoming popular everywhere and the need for a robust traffic sign recognition
Convolutional neural network system that ensures safety by recognizing traffic signs accurately and fast is increasing. In this paper, we build a
Object detection CNN that can classify 43 different traffic signs from the German Traffic Sign Recognition benchmark dataset. The
Faster R–CNN
dataset is made up of 39,186 images for training and 12,630 for testing. Our CNN for classification is light and
YOLOv4
Traffic sign recognition
reached an accuracy of 99.20% with only 0.8 M parameters. It is tested also under severe conditions to prove its
Traffic sign classification generalization ability. We also used Faster R–CNN and YOLOv4 networks to implement a recognition system for
GTSDB traffic signs. The German Traffic Sign Detection benchmark dataset was used. Faster R–CNN obtained a mean
GTSRB average precision (mAP) of 43.26% at 6 Frames Per Second (FPS), which is not suitable for real-time application.
YOLOv4 achieved an mAP of 59.88% at 35 FPS, which is the preferred model for real-time traffic sign detection.
These mAPs are obtained using Intersect Over Union of 50%. A comparative analysis is also presented between
these models.

1. Introduction [3] Presents different classes of low-light image improvement algorithms


and their improved versions to tackle the problem of illumination on
Technology around Advanced Driving Assistance Systems (ADAS) is images.
ever so increasing. To keep improving Intelligent Driving and Traffic A system that aims at tackling these problems of recognizing traffic
Safety, Object detection plays an important role in the upcoming trend of signs, should be able to detect traffic sing in an image or a video feed
self-governing cars [1]. Tesla inc. is the leading self-driving car manu- camera and then classify these images. In traffic sign detection, the al-
facturer in the world and became the most valuable car company in the gorithm has to scale since the vehicle of the device capturing might be at
world during the writing of this paper. This shows the tremendous de- a different distance at different times. We explored the different filter
mand for self-driving cars in the auto industry. The ability of autonomous sizes to identify the effect of size on the performance of the models. In
vehicles to detect, classify and act upon the different traffic signs can be a this paper, we look at the different classifications that exist and we
matter of human life. This means the technology ought to be accurate and proposed our CNN that is lighter and faster but slightly less accurate. We
fast to detect promptly. Because such a system has the benefits of saving also explored the algorism for real-time object detection.
lives and saving costs, developing and improving such a system is one of Our contributions can be listed below:
the main motivations for this paper. Using deep learning algorithms to
design such a system is the main objective of our work. One of the ad- 1. We introduced a lighter and faster CNN with a highly acceptable
vancements in scene understanding was OLIMP, introduced by Amira et accuracy of 99.20% for traffic sign classification from GTSRB.
al (2020) [2]. It made use of the multimodal dataset for a better 2. We Fine-tuned the Faster R–CNN and YOLOv4 network architecture
perception of the environment. The different shapes and color coordi- to train and detect traffic signs from GTSDB.
nation allow us to identify and differentiate between different signs. 3. The original GTSDB dataset divided the data into only 3 categories
Though there are different aspects of traffic signs which make them i.e., prohibitory, mandatory, and danger. Our paper used this dataset
differentiable, it is still a challenging task due to the problems of a variety not only to detect the 3 categories but to detect all 43 categories of the
of colors, shapes, environmental conditions, occlusion, illumination, etc. German traffic sign individually.

* Corresponding author.
E-mail address: [email protected].

https://doi.org/10.1016/j.heliyon.2022.e11792
Received 25 September 2021; Received in revised form 28 June 2022; Accepted 14 November 2022
2405-8440/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
N. Youssouf Heliyon 8 (2022) e11792

and 50% respectively. Traditional network for image classification has


Table 1. Comparison of different networks for object detection. been developed over the years, AlexNet [6] introduced in 2012 was
Model name (backbone) Speed (ms) COCO mAP (%) revolutionary in image classification. It improved the traditional CNN. It
YOLOv4 CSPDarknet-53 38 43.5 made use of the importance of the ReLU activation function over the tanh
SSD VGG-16 43 39.5
function. In ImageNet Large-Scale Visual Recognition Challenge
Faster R–CNN ResNet50 92 34.9
(ILSVRC), a contest for best performing networks, AlexNet [6] had the
best performance [6]. was used as the standard model for image classi-
YOLOv2 DarkNet-19 25 21.6
fication. VGG [7] (2012), which is considered to be the next step after
YOLOv3 DarkNet53 45.5 42.4
[6], focused on a particular aspect of CNN, which is the depth of the
network. Also decreased the size of the network and increase accuracy.
4. A comparative analysis between these two models to understand each ResNet [8] also decreased the amount of computation compared to
model's performance and compromise over the other. VGG [7] by a factor of 5. Other algorithms [9] did not employ. CNNs are
good for classification at a high level because of their invariance prop-
In the next section, 2, we do a literature review and understand what erties [10].
existed, what exists, and why they need improvements. Then in section 3, Application-specific classifications of objects like traffic signs [11,
we discuss the methodologies that were used in this paper and all 12], and [1], have been introduced with different techniques to achieve
implementations. Experimental Results, IV, is the section where we better accuracy. L. Chen et al (2017) [11] presented a Combined
compared our obtained results to the state-of-the-art algorithms. And Convolution Neural Network. It consisted of two separate CNN, one for
also, a comparative analysis between our used models. And we concluded superclass recognition which consisted of 6 classes, and the other CNN
our work in section 5. for subclass recognition which consisted of 43 classes. Then combined
the result of both CNN to determine the final label using vector sum-
2. Related works mation. The framework in [11] achieved a recognition accuracy of
95.6%. However, the proposed framework time cost is 2.7 ms. Though
First, the task of Traffic Sign Recognition (TSR) can be subdivided the result was acceptable, the classification time was too slow for
into two main categories: Traffic sign detection (TSD) and Traffic sign real-time application. Kedkarn et al. 2015 [1] also use techniques like
classification (TSC). Many CNN networks have been designed in recent SVM to classify traffic signs from GTSRB. In object detection, the
years to perform object classifications and detections. Comparing the methodology developed so far can be classified into mainly two cate-
different results achieved by different researchers is a challenging task gories: one-stage detection and two-stage detection approaches. R–CNN
because of the use of different datasets, GTSDB, BTSD, etc. which have [13] used Region proposals to extract the features and also improved
different properties like the number of training datasets, and quality of precision [13]. uses selective search to generate all the proposed re-
training images, which affects considerably the performance of the gions, about 2000 regions. This selective search slowed down the
network (Fang et al., 2018) [4]. proposes a novel technique for classi- network. A better network has been designed soon which achieved
fying land use based on images captured by different users under better performance. SSP Net allowed the detection of input images of
different conditions. Nevertheless, over the years new techniques have different resolutions. Fast R–CNN [14] used a parameter aggregation
been developed to improve the performance of classification and network (SPP) to improve the precision compared to R–CNN. The Faster
real-time object detection by reducing network size, increasing accuracy, R–CNN [15] discarded the selective search method and used the RPN for
and reducing detection time. feature extraction, which saw considerable improvement in the perfor-
Before the advent of CNN, object classification and recognition were a mance of the network. Abdullah et al [16] proposed a system to detect
bit more difficult and expensive to achieve. Maximally Stable Extremal vehicles by using deep learning algorithms [16]. used mainly R–CNN
Regions (MSER) were used in combination with Class-Specific Extremal and fast R–CNN which achieved an mAP of 64% and 75% respectively,
Regions (CSER) in [5] to detect text-based regions and recognize the high mAP in [16] is because the detection and recognition are that of
matching text fields, with CSER and MSER attaining a precision of 80% vehicles rather than traffic signs which are much smaller and more

Figure 1. a) Top row shows images from GTSDB dataset and b) Bottom row shows images from Belgium traffic sign dataset.

2
N. Youssouf Heliyon 8 (2022) e11792

Then YOLOv2 [19] and YOLOv3 [20] were an improvement on the


previous versions [20]. use the DarkNet53 which is a bigger network, 53
CNN, than [19] DarkNet19 of 19 CNN, therefore increasing the accuracy
of [20]. YOLOv4 [21] (2020) is the latest in the YOLO series at the time of
writing paper. YOLOv4 [21] used CPSDarkNet53 and which makes it
capable of detecting smaller objects. Can be used in1080 Ti or 2080 Ti
GPU for training. A comparison of these networks on the COCO 2017
dataset is given in Table 1.

3. Methodology

Figure 2. Samples of images from GTSRB for classification. 3.1. Dataset collection and organization

For the classification module, the German Traffic Sign Recognition


Benchmark, GTSRB, was used. This dataset consists of about 51822 im-
ages stored in the Portable Pixmap (ppm) format. About 39209 images
were used for training and 12630 were used for validation. The sizes of
these images range from 15  15 to 250  250 pixels. The Region of
Interest ROI, for each image, is provided and was used. The annotation of
these images was also collected in comma-separated values (CSV) files.
There was a total of 43 different classes in the dataset. Samples of some
classes in our dataset are shown in Figure 2.
The dataset that was used in the detection module of this work is the
German Traffic Sign Detection Benchmark dataset (GTSDB), it consists of
Figure 3. Preprocessing images samples. 900 images all of size 1360  800 pixels, stored in ppm format. The labels
of these images were also collected in a CSV file which contained the file
name and the information about the ground truth location of the actual
similar [15]. also uses ROI pooling to scale its input. The network looks traffic sign on the image. This dataset was divided into two, the training
at the image twice before it completes. In one-stage detection, the image and the validation sets. The training set consisted of randomly selecting
is looked at by the network only one which makes it faster and more 600 images among all the images and the remaining 300 images were
suitable for real-time application. Liu et al. [17] introduced SSD. SSD used for the validation set. Since the training set was not large enough,
[17] uses a grid system to divide the image into multiple sections. In only 600 images, additional 300 images from the Belgium Traffic Sign
each independent grid, that grid is responsible for detecting objects in its Dataset (BTSD) which contained images of traffic signs were also
region [17]. has better accuracy compared to YOLO [18] which com- collected, making a total of 900 images for the training set and 300 im-
bined detection and classification into one CNN. Which makes [18] ages for the testing set. The image size is 1628  1236 pixels for every
faster but less accurate. image from BTSD and 1360  800 for images from GTSDB. Then these

Figure 4. CNN architecture for traffic sign classification.

3
N. Youssouf Heliyon 8 (2022) e11792

Figure 5. Distribution of images across the 43 different classes.

images’ ROIs were manually labeled in the Pascal VOC (Visual Object The results of this preprocessing are shown in Figure 3. The distribution
Classes) format. Any of these images may contain more than one traffic of the images across the 43 classes is shown in Figure 5. Image
sign in the same image. The images were taken under different lighting augmentation techniques like shifts, Brightness, and zoom were used to
conditions, different angles of the sign, and occlusion in some cases, see even the distribution of the number of images per class to improve the
Figure 1. The GTSRB, GTSDB, and BTSD datasets used in this paper are classification accuracy and reduce bias. The two-stage detection network
publicly available for use. is more accurate in the detection of bounding boxes and class objects.

3.2. Data pre-processing 3.3. Traffic sign classification module

In the detection phase, the image ROI annotations were converted to For the classification module, the GTSRB was used. The dataset was
Pascal VOC format from the CSV format annotations and the same was split in a ratio of 20% for testing, 20% for validation, and 60% for
also done for the classification model. In the classification phase, the training. The number of images in each class is not evenly distributed
images were resized to 32  32, then converted to grayscale, and then the therefore an augmentation technique was used to increase the number of
images were normalized. Resizing of the images was because the image training sets.
size of the dataset ranged from 15  15 to 250  250 pixels. In addition, The class ‘speed limit 20 km/h’, represented by 0 in Figure 5, has 210
since our CNN receives only input images with the same size, all images images while the class ‘Speed limit 50 km/h’, represented by 2, has 2250
must be resized to a specific size before being passed into the network. images. Because of these discrepancies, the model may become biased
Then converting the images to grayscale was because the colors of the towards the class with more images. The different augmentation pa-
images are not a very important determinant factor for the classification rameters included random rotating, stretch, and flips. These augmenta-
of the image. And it also reduces the complexity of processing the CNN. tion parameters are used to balance the dataset to reduce bias. From the
There is significant scientific value in enhancing system robustness in dataset, the images occur successively for the same class and similar
adverse weather conditions and ameliorating image quality [22]. Then images occur one after the other, because of this presence, random
the normalization of the images was done by dividing each pixel by the shuffling was performed on the dataset to avoid fluctuation of the
maximum pixel. This ensures that the input pixels have analogous data training and loss functions. A Convolutional Neural Network CNN was
distribution. It also makes convergence faster while training the network. constructed to do feature extraction and classification on the training set.

Figure 6. Convolutional neural network used for classification.

4
N. Youssouf Heliyon 8 (2022) e11792

Figure 7. Object detector architecture.

The convolutional filter size was 3  3 since it was better for smaller Fine-tuning is the process of using pre-trained weights of a model for
objects. A ReLU activation function was used in different hidden layers recognition on a new dataset. The first few layers of Faster R–CNN can
and the categorical cross-entropy loss function together with the Adam extract generic features (Faster R–CNN [23], 2017) since it is trained
optimizer and a learning rate of 0.001 was also applied. A summary of the with a big dataset. We used Faster R–CNN architecture trained on the
architecture of the CNN for classification in Figure 4, shows the different PASCAL VOC 2007 dataset (Shaoqing et al. (2017)) as the pre-trained
hidden layers and filters and pooling applied to the neural networks. The model. As our dataset was relatively large (39,186 images) compared
network architecture that was developed is shown in Figure 6. The to the PASCAL VOC 2007 dataset, we theorize that fine-tuning the first
proposed CNN for traffic sign classification was developed using the layers of the Faster R–CNN rather than the later layers would improve
python framework, TensorFlow, in addition to other libraries. performance. We fine-tuned the earlier CONV þ ReLU layers of the
Faster R–CNN by decreasing the depth to 32 and the filter size to 3  3,
3.4. Traffic sign detection module because our training images are smaller (32  32 pixels), thereby
decreasing the network size and increasing performance. We also
Most of the prominent state-of-the-art object detection techniques are initialized the FC þ ReLU layer to enable training from scratch for
divided into two: One-stage detection and two-stage detection. Examples multi-class classifications. Random Adjust Hue, Random Adjust satura-
of two-stage detection techniques are Faster R–CNN (Region-based tion, and random Adjust Contrast were used to minimize the effect of an
CNN), and Mask R–CNN, which uses RPN. Examples of one-stage unbalanced dataset in training.
detection include You Only Look Once (YOLO) [18], EfficientNet [9],
and SSD (Single Shot MultiBox Detector) MobileNet. [18], compared to
an SSD which has a high inference speed. So, our objective is to reduce
the compromises between speed and accuracy that exist between the
one-stage and two-stage object detection techniques. The two techniques
followed in this work are YOLOV4 [21] and Faster R–CNN [15] object
detection techniques.

3.5. Faster R–CNN

Faster R–CNN [23] is a deep convolutional network used for object


detection [23]. proposed a model that is made up of two modules: the
fully convolutional neural network called Region Proposal Network
(RPN), which proposes region boxes, and the next module is the de-
tector, which classifies the object. In the first module, there is a fully
connected convolutional neural network that is used for feature
extraction. This CNN will extract the main feature of an image and the
output of this CNN is a feature map. The feature map is then passed to
another network layer, the Region Proposal Layer, which is responsible
for proposing the potential regions where an object might exist and
giving the class and the probability or score of that particular proposed
region belonging to that class. For every other location on the feature
map, a sliding window is used in the RPN (Region Proposal Network).
Different bounding boxes are used for each location. 3 scales (128,256,
512) and 3 aspect ratios (1:1, 1:2, 2:1) are used in each location for the
proposed regions. This increases the generalization of the network. It
checks which of these locations contain objects and then these objects
will be passed to the next network for detection. Non-max Suppression
is used to remove overlapping regions. For detection, it goes through an
ROI pooling layer, then each ROI feature vector goes through a CNN,
and then through SoftMax where the final prediction will be made. This
network architecture was fine-tuned to detect traffic signs from GTSDB. Figure 8. Accuracy and loss function for our CNN classifier.

5
N. Youssouf Heliyon 8 (2022) e11792

Figure 9. A comparison of the results from Faster R–CNN and YOLOv4. The first column (a) is from Faster R–CNN and the second column (b) is from YOLOv4. From
the first row, Faster R–CNN misses the No overtaking sign and falsely classifies Speed limit 120 as speed limit 70. YOLOv4 correctly detects and classifies all the signs.
From the second row, Faster R–CNN miss classifies give way sign as bend sign. YOLOv4 correcly classifies the give way sign. From the third row, both Faster R–CNN and
YOLOv4 correctly classify the speed limit 100 traffic sign. The last row shows how Faster R–CNN result in low light and YOLOv4 results in bright light. Both the models
perform well in conditions of low light and bright light.

3.6. YOLOv4 combined, which improves the learning capabilities of CNN. The next
section of the YOLOv4 architecture is the Neck. The Neck adds more
The Single-Stage Detection used is the YOLOv4. A faster speed and layers from the backbone to the dense prediction block. It serves as an
better accuracy can be achieved by YOLOv4 which uses several network aggregation layer. In YOLOv3, the Feature Pyramid Network is used to
architectures [21], Figure 7. The Input section in the figure represents the extract the features of different resolutions from the backbone, but in
input of the image which can be of variable resolution from 32  32 to YOLOv4, the Path Aggregation Network (PANet) is used for this purpose,
512  512 or more. Some data augmentation process occurs to generalize which gives higher accuracy. Then the Spatial Pyramid Pooling (SPP)
the network so it can recognize objects of different sizes. Then it is passed used in R–CNN is also used to map any size input to a particular fixed
to the backbone section, which is where the feature extraction occurs. output size. The final section is the head section (Dense Prediction),
This backbone can be the different networks of VGG, EfficientNet, dar- similar to YOLOv3, which is responsible for locating the bounding box
kNet53, or ResNet. In Figure 7, the backbone used is CSPDarknet53, coordinates (x, y, w, h) from the previous layers and classifying the image
which separates the layers into different parts, of which one to passes section within the bounding box. It concurrently predicts numerous
through convolution and the other part does not, and then the results are bounding boxes and the likelihood of those bounding boxes to belong to a

6
N. Youssouf Heliyon 8 (2022) e11792

Table 2. Comparison between our CNN and other states of the art CNN classifier Table 5. Performance of the models on Video 1.
on GTSRB.
Model name Time Loss Accuracy No. of Parameters
Our CNN 6.631 s 0.031 99.20% 0.8 M
Enet-V1 [12] 7.794 s 0.064 98.69% 0.9 M
Enet-V2 [12] 3.090 s 0.2642 96.78% 0.31 M
MCDNN [24] 11.4 0.024 99.46% 38.5 M
Co. CNNs [16] - - 99.35% 5.22 M

Table 3. Comparison between our developed models.

Model mAP Speed (FPS)


Faster R–CNN 43.26% 6
YOLOv4 59.88% 35

specific class. The network uses Intersection over Union (IoU) in Eq. (1)
to detect the boundary boxes for the region proposals, where Agt is the
background truth and Ap is the predicted.

Agt \ Ap
IoU ¼ (1)
Agt [ Ap

Table 4. Results of our CNN under certain conditions of ambiguity, very low
illumination, low illumination, blurriness, occlusion, and high illumination. All YOLOv4 was fine-tuned to detect traffic signs from GTSDB. YOLOv4
were correctly predicted. has been pre-trained on PASCAL VOC 2007 with acceptable mAP (mean
average precision). The purpose of fine-tuning is to decrease the time of
Input Predicted Confidence
training and increase accuracy and also reduce the cost of training. It
Speed limit (30 km/h) 100.0 %
increases the data space. The fine-tuned network was trained on the
GTSDB dataset. We fine-tuned the YOLOv4 architecture which was the
pre-trained model. The fine-tuning technique was applied to the state-of-
the-art object detection algorithm YOLOv4, trained on the COCO dataset,
to specifically suit the needs of Traffic Sign recognition and detection
using some custom dataset. The weights of the pre-trained YOLOv4
Speed limit (100 km/h) 98.67% model, convolutional layers, were kept the same as the pre-trained
model. The input resolution size was increased to 614  614 making
the YOLOv4 model easier to detect smaller objects. The weights of the
dense layers were updated by passing new data. The pre-trained YOLOv4
was trained on the COCO dataset, which was a dataset of 80 different
classes for detection. In our model, there are only 43 classes and it uses
Speed limit (80 km/h) 67.18%
the same weights in the convolution layers which makes the model much
faster and also increases its performance of the model.

4. Experimental results

Slippery road 99.77% For classification purposes, the developed CNN shown in Figure 4
achieved a good accuracy of 99.20% on the GTSRB test dataset. The
accuracy and loss graphs for training and validation of the model are
shown in Figure 8, Our model took approximately 6.63 s to classify all the
images in the test set. The test set consists of 12360 images; therefore, the
model takes 0.14 ms to classify a single image. It is considered fast for
Slippery road 98.28%
applications that are not real-time. The comparison between our CNN
and other state-of-the-art CNN image classification is given in Table 2
[12, 16, 24]. The accuracy of our CNN is less than some of the prominent
CNN for classifications but considering that the number of parameters for
our network is 1/5 that of MCDNN [24] and 1/38 that of Co. CNNs [16].
To prove the robustness of our network, some of the successful classifi-
Speed limit (30 km/h) 99.96%
cations under different tough conditions, low and high illumination,
occlusion, and blurring, are provided in Table 4.
To evaluate the performance of the different methods used in this
work for object detection, Faster R–CNN, and YOLOv4, an experiment
was conducted as mentioned above and then the resulting weights from
these object detection techniques were evaluated on a real-time video

7
N. Youssouf Heliyon 8 (2022) e11792

feed from the street of berlin. The process of labeling, training, and Data availability statement
evaluating these models was done on a PC with a 3.4GHx Intel CPU, 16G
RAM, and 4G NVIDIA GeForce 1060 GPU. CUDA was used to improve the Data associated with this study has been deposited at https://driv
performance and speed of the model. The fine-tuned Faster R–CNN e.google.com/drive/folders/1DHnRnM4B0qCHOZaexG30kAScKHT7H
reached an mAP (mean Average Precision) of 43.26% on the GTSDB, D2o?usp¼sharing
with a speed of 6 FPS (Frame per Second) and the YOLOv4 reached a
much higher mAP of 59.88% on the same dataset. Table 3 shows the Declaration of interest’s statement
comparison between these models. The Faster R–CNN is too slow and
would need improvements to be used in real-time applications. A video, The authors declare no conflict of interest.
Video 1, was fed to the inference of both model and the traffic that was in
the video, along with whether the model correctly detected and recog-
nized it is given in Table 5. Additional information
Supplementary video related to this article can be found at https://
doi.org/10.1016/j.heliyon.2022.e11792 No additional information is available for this paper.

5. Conclusion References

[1] C. Kedkarn, H. Anusara, C. Ratiporn, K. Kittisak, K. Nittaya, "Traffic sign


In this paper, a novel CNN architecture was designed for traffic signed classification using support vector machine and image segmentation", in:
recognition and was tested on the GTSRB dataset and achieved a very International Conference on Industrial Application and Engineering, 2015,
high accuracy of 99.20% with minimal loss, which is comparable to the pp. 52–58.
[2] A. Mimouna, I. Alouani, A. Ben Khalifa, Y. El Hillali, A. Taleb-Ahmed, A. Menhaj,
state-of-the-art architecture [12, 16, 24]. Though this was achieved with
A. Ouahabi, N.E. Ben Amara, OLIMP: a heterogeneous multimodal dataset for
only 0.8 million trainable parameters and so is a light model that can be advanced environment perception”, Electronics 9 (4) (2020) 560.
used in computers with small resources for traffic sign recognition. This [3] W. Wang, X. Wu, X. Yuan, Z. Gao, An experiment-based review of low-light image
enhancement methods, IEEE Access 8 (2020) 87884–87917.
high accuracy can be justified by the pre-processing techniques that were
[4] F. Fang, X. Yuan, L. Wang, Y. Liu, Z. Luo, Urban land-use classification from
used before training and the robustness of the CNN. photographs, Geosci. Rem. Sens. Lett. IEEE 15 (12) (2018) 1927–1931.
This paper was further extended to recognize traffic signs in real [5] Mehmet Serdar Guzel, “A novel framework for text recognition in street view
time. The dataset used is the GTSRB dataset which was originally images”, Int. J. Intel. Syst. Appl. Eng.. 3. 140-144.
[6] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
designed for the recognition of 3 super classes, i.e., prohibitory, convolutional neural networks, Proc. Adv. Neural Inf. Process. Syst. 120 (Jan)
mandatory, and danger signs. The dataset consists of 600 images for (2012) 1097–1105.
training and 300 images for testing. There is an average of 200 images [7] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
recognition, CoRR 21 (3) (2014) 51–56.
per category in training. Our goal was not only to detect the 3 super [8] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proc.
classes but all 43 classes of the German traffic sign using the same CVPR (2016) 770–778.
dataset. As such, we had an average of 14 images per class for training. [9] Abdullah A. Yilmaz1, Mehmet S. Güzel, I. Askerbeyli, Bostanci Erkan, A Vehicle
Detection Approach Using Deep Learning Methodologies”, Cornell University, 1804
Faster R–CNN weights that were used to train on the COCO dataset were arXiv.
then fine-tuned into training for the recognition of traffic signs. The [10] Shervin M, Yuri B, Fatih P, "Image segmentation using deep learning: a survey," in
fine-tuned parameters and process is explained in section 3 E. Fine- IEEE Transactions on Pattern Analysis and Machine Intelligence.
[11] L. Chen, G. Zhao, J. Zhou, L. Kuang, Real-time traffic sign classification using
tuning enabled the model to take less time to train. And the Faster combined convolutional neural networks, in: IAPR Asian Conference on Pattern
R–CNN achieved a mean average precision (mAP) of 43.26% at a frame Recognition, 4th IAPR, 2017.
rate of 6 fps. Another model that was also investigated for the recog- [12] X. Bangquan, W. Xiong, Real-Time Embedded Traffic Sign Recognition Using
Efficient Convolutional Neural Network” 7, May 2019, pp. 53330–53346.
nition task is the YOLOv4 [18] model. This model was fine-tuned for
[13] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate
traffic sign detection. YOLOv4 achieved a mean average precision object detection and semantic segmentation, Proc. CVPR (2014) 580–587.
(mAP) of 59.88% at a frame rate of 35 fps. [14] R. Girshick, ‘‘Fast R-CNN,’’ Proc. ICCV, Dec. 2015, pp. 1440–1448.
The comparisons between these models were done and some results [15] S. Ren, K. He, R. Girshick, J. Sun, ‘‘Faster R-CNN: towards real-time object detection
with region proposal networks, ’’ in Proc. NIPS (2015) 91–99.
are shown in Figure 9. The YOLOv4 model was able to detect and classify [16] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber, A committee of neural networks for
the different German traffic signs despite using a very small amount of traffic sign classification, in: Proceedings of the International Joint Conference on
data for training. Other data augmentation techniques could be used to Neural Networks, 1921.
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, A. Berg, “SSD: Single Shot
increase the number of training datasets and lead to a better performance MultiBox Detector”
of the models. [18] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, “You only look once: unified, real-
time object detection”, unified, real-time object detection, May 2016 arXiv preprint
arXiv:1506.02640.
Declarations [19] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: In Computer Vision
and Pattern Recognition (CVPR), IEEE Conference on, IEEE, 2017, pp. 6517–6525.
Author contribution statement [20] J. Redmon, A. Farhadi, YOLOv3: an Incremental Improvement”, Apr 2018 arXiv
preprint arXiv:1804.02767.
[21] A. Bochkovskiy, C.Y. Wang, H.Y. Liao, YOLOv4: optimal speed and accuracy of
Njayou Youssouf: Conceived and designed the experiments; Per- object detection, in: Proc. Unified, Real-Time Object Detection, Apr 2020 arXiv
formed the experiments; Analyzed and interpreted the data; Contributed preprint arXiv:1506.02640.
[22] W. Wang, X. Yuan, X. Wu, Y. Liu, Fast image dehazing method based on linear
reagents, materials, analysis tools or data; Wrote the paper.
transformation, IEEE Trans. Multimed. 19 (6) (June 2017) 1142–1155.
[23] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection
Funding statement with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6)
(2017) 1137–1149.
[24] D. Ciregan, U. Meier, J. Schmidhuber, Multi-column deep neural networks for
This research did not receive any specific grant from funding agencies image classification, in: IEEE Conference on Computer Vision and Pattern
in the public, commercial, or not-for-profit sectors. Recognition, 2012, pp. 3642–3649.

You might also like