Traffic sign classification using CNN and detection using FRCNN and YOLOV4
Traffic sign classification using CNN and detection using FRCNN and YOLOV4
Heliyon
journal homepage: www.cell.com/heliyon
Research article
A R T I C L E I N F O A B S T R A C T
Keywords: Autonomous driving cars are becoming popular everywhere and the need for a robust traffic sign recognition
Convolutional neural network system that ensures safety by recognizing traffic signs accurately and fast is increasing. In this paper, we build a
Object detection CNN that can classify 43 different traffic signs from the German Traffic Sign Recognition benchmark dataset. The
Faster R–CNN
dataset is made up of 39,186 images for training and 12,630 for testing. Our CNN for classification is light and
YOLOv4
Traffic sign recognition
reached an accuracy of 99.20% with only 0.8 M parameters. It is tested also under severe conditions to prove its
Traffic sign classification generalization ability. We also used Faster R–CNN and YOLOv4 networks to implement a recognition system for
GTSDB traffic signs. The German Traffic Sign Detection benchmark dataset was used. Faster R–CNN obtained a mean
GTSRB average precision (mAP) of 43.26% at 6 Frames Per Second (FPS), which is not suitable for real-time application.
YOLOv4 achieved an mAP of 59.88% at 35 FPS, which is the preferred model for real-time traffic sign detection.
These mAPs are obtained using Intersect Over Union of 50%. A comparative analysis is also presented between
these models.
* Corresponding author.
E-mail address: [email protected].
https://doi.org/10.1016/j.heliyon.2022.e11792
Received 25 September 2021; Received in revised form 28 June 2022; Accepted 14 November 2022
2405-8440/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
N. Youssouf Heliyon 8 (2022) e11792
Figure 1. a) Top row shows images from GTSDB dataset and b) Bottom row shows images from Belgium traffic sign dataset.
2
N. Youssouf Heliyon 8 (2022) e11792
3. Methodology
Figure 2. Samples of images from GTSRB for classification. 3.1. Dataset collection and organization
3
N. Youssouf Heliyon 8 (2022) e11792
images’ ROIs were manually labeled in the Pascal VOC (Visual Object The results of this preprocessing are shown in Figure 3. The distribution
Classes) format. Any of these images may contain more than one traffic of the images across the 43 classes is shown in Figure 5. Image
sign in the same image. The images were taken under different lighting augmentation techniques like shifts, Brightness, and zoom were used to
conditions, different angles of the sign, and occlusion in some cases, see even the distribution of the number of images per class to improve the
Figure 1. The GTSRB, GTSDB, and BTSD datasets used in this paper are classification accuracy and reduce bias. The two-stage detection network
publicly available for use. is more accurate in the detection of bounding boxes and class objects.
In the detection phase, the image ROI annotations were converted to For the classification module, the GTSRB was used. The dataset was
Pascal VOC format from the CSV format annotations and the same was split in a ratio of 20% for testing, 20% for validation, and 60% for
also done for the classification model. In the classification phase, the training. The number of images in each class is not evenly distributed
images were resized to 32 32, then converted to grayscale, and then the therefore an augmentation technique was used to increase the number of
images were normalized. Resizing of the images was because the image training sets.
size of the dataset ranged from 15 15 to 250 250 pixels. In addition, The class ‘speed limit 20 km/h’, represented by 0 in Figure 5, has 210
since our CNN receives only input images with the same size, all images images while the class ‘Speed limit 50 km/h’, represented by 2, has 2250
must be resized to a specific size before being passed into the network. images. Because of these discrepancies, the model may become biased
Then converting the images to grayscale was because the colors of the towards the class with more images. The different augmentation pa-
images are not a very important determinant factor for the classification rameters included random rotating, stretch, and flips. These augmenta-
of the image. And it also reduces the complexity of processing the CNN. tion parameters are used to balance the dataset to reduce bias. From the
There is significant scientific value in enhancing system robustness in dataset, the images occur successively for the same class and similar
adverse weather conditions and ameliorating image quality [22]. Then images occur one after the other, because of this presence, random
the normalization of the images was done by dividing each pixel by the shuffling was performed on the dataset to avoid fluctuation of the
maximum pixel. This ensures that the input pixels have analogous data training and loss functions. A Convolutional Neural Network CNN was
distribution. It also makes convergence faster while training the network. constructed to do feature extraction and classification on the training set.
4
N. Youssouf Heliyon 8 (2022) e11792
The convolutional filter size was 3 3 since it was better for smaller Fine-tuning is the process of using pre-trained weights of a model for
objects. A ReLU activation function was used in different hidden layers recognition on a new dataset. The first few layers of Faster R–CNN can
and the categorical cross-entropy loss function together with the Adam extract generic features (Faster R–CNN [23], 2017) since it is trained
optimizer and a learning rate of 0.001 was also applied. A summary of the with a big dataset. We used Faster R–CNN architecture trained on the
architecture of the CNN for classification in Figure 4, shows the different PASCAL VOC 2007 dataset (Shaoqing et al. (2017)) as the pre-trained
hidden layers and filters and pooling applied to the neural networks. The model. As our dataset was relatively large (39,186 images) compared
network architecture that was developed is shown in Figure 6. The to the PASCAL VOC 2007 dataset, we theorize that fine-tuning the first
proposed CNN for traffic sign classification was developed using the layers of the Faster R–CNN rather than the later layers would improve
python framework, TensorFlow, in addition to other libraries. performance. We fine-tuned the earlier CONV þ ReLU layers of the
Faster R–CNN by decreasing the depth to 32 and the filter size to 3 3,
3.4. Traffic sign detection module because our training images are smaller (32 32 pixels), thereby
decreasing the network size and increasing performance. We also
Most of the prominent state-of-the-art object detection techniques are initialized the FC þ ReLU layer to enable training from scratch for
divided into two: One-stage detection and two-stage detection. Examples multi-class classifications. Random Adjust Hue, Random Adjust satura-
of two-stage detection techniques are Faster R–CNN (Region-based tion, and random Adjust Contrast were used to minimize the effect of an
CNN), and Mask R–CNN, which uses RPN. Examples of one-stage unbalanced dataset in training.
detection include You Only Look Once (YOLO) [18], EfficientNet [9],
and SSD (Single Shot MultiBox Detector) MobileNet. [18], compared to
an SSD which has a high inference speed. So, our objective is to reduce
the compromises between speed and accuracy that exist between the
one-stage and two-stage object detection techniques. The two techniques
followed in this work are YOLOV4 [21] and Faster R–CNN [15] object
detection techniques.
5
N. Youssouf Heliyon 8 (2022) e11792
Figure 9. A comparison of the results from Faster R–CNN and YOLOv4. The first column (a) is from Faster R–CNN and the second column (b) is from YOLOv4. From
the first row, Faster R–CNN misses the No overtaking sign and falsely classifies Speed limit 120 as speed limit 70. YOLOv4 correctly detects and classifies all the signs.
From the second row, Faster R–CNN miss classifies give way sign as bend sign. YOLOv4 correcly classifies the give way sign. From the third row, both Faster R–CNN and
YOLOv4 correctly classify the speed limit 100 traffic sign. The last row shows how Faster R–CNN result in low light and YOLOv4 results in bright light. Both the models
perform well in conditions of low light and bright light.
3.6. YOLOv4 combined, which improves the learning capabilities of CNN. The next
section of the YOLOv4 architecture is the Neck. The Neck adds more
The Single-Stage Detection used is the YOLOv4. A faster speed and layers from the backbone to the dense prediction block. It serves as an
better accuracy can be achieved by YOLOv4 which uses several network aggregation layer. In YOLOv3, the Feature Pyramid Network is used to
architectures [21], Figure 7. The Input section in the figure represents the extract the features of different resolutions from the backbone, but in
input of the image which can be of variable resolution from 32 32 to YOLOv4, the Path Aggregation Network (PANet) is used for this purpose,
512 512 or more. Some data augmentation process occurs to generalize which gives higher accuracy. Then the Spatial Pyramid Pooling (SPP)
the network so it can recognize objects of different sizes. Then it is passed used in R–CNN is also used to map any size input to a particular fixed
to the backbone section, which is where the feature extraction occurs. output size. The final section is the head section (Dense Prediction),
This backbone can be the different networks of VGG, EfficientNet, dar- similar to YOLOv3, which is responsible for locating the bounding box
kNet53, or ResNet. In Figure 7, the backbone used is CSPDarknet53, coordinates (x, y, w, h) from the previous layers and classifying the image
which separates the layers into different parts, of which one to passes section within the bounding box. It concurrently predicts numerous
through convolution and the other part does not, and then the results are bounding boxes and the likelihood of those bounding boxes to belong to a
6
N. Youssouf Heliyon 8 (2022) e11792
Table 2. Comparison between our CNN and other states of the art CNN classifier Table 5. Performance of the models on Video 1.
on GTSRB.
Model name Time Loss Accuracy No. of Parameters
Our CNN 6.631 s 0.031 99.20% 0.8 M
Enet-V1 [12] 7.794 s 0.064 98.69% 0.9 M
Enet-V2 [12] 3.090 s 0.2642 96.78% 0.31 M
MCDNN [24] 11.4 0.024 99.46% 38.5 M
Co. CNNs [16] - - 99.35% 5.22 M
specific class. The network uses Intersection over Union (IoU) in Eq. (1)
to detect the boundary boxes for the region proposals, where Agt is the
background truth and Ap is the predicted.
Agt \ Ap
IoU ¼ (1)
Agt [ Ap
Table 4. Results of our CNN under certain conditions of ambiguity, very low
illumination, low illumination, blurriness, occlusion, and high illumination. All YOLOv4 was fine-tuned to detect traffic signs from GTSDB. YOLOv4
were correctly predicted. has been pre-trained on PASCAL VOC 2007 with acceptable mAP (mean
average precision). The purpose of fine-tuning is to decrease the time of
Input Predicted Confidence
training and increase accuracy and also reduce the cost of training. It
Speed limit (30 km/h) 100.0 %
increases the data space. The fine-tuned network was trained on the
GTSDB dataset. We fine-tuned the YOLOv4 architecture which was the
pre-trained model. The fine-tuning technique was applied to the state-of-
the-art object detection algorithm YOLOv4, trained on the COCO dataset,
to specifically suit the needs of Traffic Sign recognition and detection
using some custom dataset. The weights of the pre-trained YOLOv4
Speed limit (100 km/h) 98.67% model, convolutional layers, were kept the same as the pre-trained
model. The input resolution size was increased to 614 614 making
the YOLOv4 model easier to detect smaller objects. The weights of the
dense layers were updated by passing new data. The pre-trained YOLOv4
was trained on the COCO dataset, which was a dataset of 80 different
classes for detection. In our model, there are only 43 classes and it uses
Speed limit (80 km/h) 67.18%
the same weights in the convolution layers which makes the model much
faster and also increases its performance of the model.
4. Experimental results
Slippery road 99.77% For classification purposes, the developed CNN shown in Figure 4
achieved a good accuracy of 99.20% on the GTSRB test dataset. The
accuracy and loss graphs for training and validation of the model are
shown in Figure 8, Our model took approximately 6.63 s to classify all the
images in the test set. The test set consists of 12360 images; therefore, the
model takes 0.14 ms to classify a single image. It is considered fast for
Slippery road 98.28%
applications that are not real-time. The comparison between our CNN
and other state-of-the-art CNN image classification is given in Table 2
[12, 16, 24]. The accuracy of our CNN is less than some of the prominent
CNN for classifications but considering that the number of parameters for
our network is 1/5 that of MCDNN [24] and 1/38 that of Co. CNNs [16].
To prove the robustness of our network, some of the successful classifi-
Speed limit (30 km/h) 99.96%
cations under different tough conditions, low and high illumination,
occlusion, and blurring, are provided in Table 4.
To evaluate the performance of the different methods used in this
work for object detection, Faster R–CNN, and YOLOv4, an experiment
was conducted as mentioned above and then the resulting weights from
these object detection techniques were evaluated on a real-time video
7
N. Youssouf Heliyon 8 (2022) e11792
feed from the street of berlin. The process of labeling, training, and Data availability statement
evaluating these models was done on a PC with a 3.4GHx Intel CPU, 16G
RAM, and 4G NVIDIA GeForce 1060 GPU. CUDA was used to improve the Data associated with this study has been deposited at https://driv
performance and speed of the model. The fine-tuned Faster R–CNN e.google.com/drive/folders/1DHnRnM4B0qCHOZaexG30kAScKHT7H
reached an mAP (mean Average Precision) of 43.26% on the GTSDB, D2o?usp¼sharing
with a speed of 6 FPS (Frame per Second) and the YOLOv4 reached a
much higher mAP of 59.88% on the same dataset. Table 3 shows the Declaration of interest’s statement
comparison between these models. The Faster R–CNN is too slow and
would need improvements to be used in real-time applications. A video, The authors declare no conflict of interest.
Video 1, was fed to the inference of both model and the traffic that was in
the video, along with whether the model correctly detected and recog-
nized it is given in Table 5. Additional information
Supplementary video related to this article can be found at https://
doi.org/10.1016/j.heliyon.2022.e11792 No additional information is available for this paper.
5. Conclusion References