0% found this document useful (0 votes)

28 views

Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images

Uploaded by

Mohamed Ali Sidi Yahya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images

Uploaded by

Mohamed Ali Sidi Yahya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

remote sensing

Article
Traffic Sign Detection and Recognition Using Multi-Frame
Embedding of Video-Log Images
Jian Xu, Yuchun Huang * and Dakan Ying

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China;
[email protected] (J.X.); [email protected] (D.Y.)
* Correspondence: [email protected]

Abstract: The detection and recognition of traffic signs is an essential component of intelligent vehicle
perception systems, which use on-board cameras to sense traffic sign information. Unfortunately,
issues such as long-tailed distribution, occlusion, and deformation greatly decrease the detector’s
performance. In this research, YOLOv5 is used as a single classification detector for traffic sign
localization. Afterwards, we propose a hierarchical classification model (HCM) for the specific
classification, which significantly reduces the degree of imbalance between classes without changing
the sample size. To cope with the shortcomings of a single image, a training-free multi-frame
information integration module (MIM) was constructed, which can extract the detection sequence
of traffic signs based on the embedding generated by the HCM. The extracted temporal detection
information is used for the redefinition of categories and confidence. At last, this research performed
detection and recognition of the full class on two publicly available datasets, TT100K and ONCE.
Experimental results show that the HCM-improved YOLOv5 has a mAP of 79.0 in full classes, which
exceeds that of state-of-the-art methods, and achieves an inference speed of 22.7 FPS. In addition,
MIM further improves model performance by integrating multi-frame information while only slightly
increasing computational resource consumption.

Keywords: traffic sign; intelligent vehicle; long-tailed distribution; anomalies; embedding;

information integration

Citation: Xu, J.; Huang, Y.; Ying, D.

Traffic Sign Detection and
Recognition Using Multi-Frame
1. Introduction
Embedding of Video-Log Images. Traffic sign detection and recognition is an essential component of automated driving
Remote Sens. 2023, 15, 2959. assistance systems, which can provide critical road guidance information. As illustrated in
https://doi.org/10.3390/rs15122959 Figure 1a, traffic signs are typically classified into three types: warning, prohibitory, and
Academic Editors: Mennatullah Siam
mandatory. Each of these categories can be further subdivided to provide a more detailed
and Xinshuo Weng range of guidance information, such as road types, prohibitions, speed limits, and height
limits. Traffic sign detection and recognition require the precise location and classification
Received: 10 April 2023 of traffic signs in the vehicle image.
Revised: 31 May 2023 Traffic signs are designed with distinct shapes such as squares, circles, and triangles, as
Accepted: 3 June 2023
well as distinct red, yellow, and blue colors to highlight the sign. Traditional detection and
Published: 6 June 2023
recognition methods are thus achieved by designing manual feature descriptors for traffic
sign detection and recognition. The use of sliding windows to find high-probability regions
in an image containing traffic signs is one example [1]. Researchers have also attempted to
Copyright: © 2023 by the authors.
determine adaptive segmentation thresholds for traffic sign extraction and classification
Licensee MDPI, Basel, Switzerland. by computing histograms of images [2,3]. Color space has also been used, for example, in
This article is an open access article segmentation in HSV [4]. Researchers extracted SURF feature points from signs and used
distributed under the terms and corroding images to match them [5]. In some studies, more complex feature descriptors,
conditions of the Creative Commons such as coarse localization of signs based on the Hough transform [6], were used for sign
Attribution (CC BY) license (https:// extraction. Traditional digital morphology-based methods typically necessitate clear images
creativecommons.org/licenses/by/ with high-resolution signs and no anomalies to interfere, making them difficult to apply in
4.0/). complex real-world scenarios.

Remote Sens. 2023, 15, 2959. https://doi.org/10.3390/rs15122959 https://www.mdpi.com/journal/remotesensing

RemoteRemote
Sens. 2023, 15, x 15,
Sens. 2023, FOR PEER REVIEW
2959 2 of 262 of 28

(a) (b)

(c)

Figure 1. Analysis of traffic signs. (a) A common way of classifying traffic signs; (b) several situations
Figure 1. Analysis
that have of traffic
a negative impactsigns.
on the(a) A common
detector; way of
(c) sample classifying
distribution oftraffic signs;dataset.
the TT100K (b) several situations
that have a negative impact on the detector; (c) sample distribution of the TT100K dataset.
Researchers added machine learning algorithms for refinement based on digital mor-
phology
Traffictosignsimprove the model’s
are designed withrobustness. Traditional
distinct shapes suchHOG featurescircles,
as squares, were combined
and triangles,
with SVM for traffic sign detection and classification [7,8]. The researchers
as well as distinct red, yellow, and blue colors to highlight the sign. Traditional attempted to
detection
segment the images based on color features and then used SVM to implement classifi-
and recognition methods are thus achieved by designing manual feature descriptors for
cation on the segmented regions [9–11]. Since many traffic signs are circular in shape,
traffic sign detection and recognition. The use of sliding windows to find high-probability
the researchers used the circular Hough transform to detect the signs and then classified
regions in an SVM
them using image containing
[12,13]. Although trafficthe signs
supportis one
vectorexample[1]. Researchers
machine improves haveand
detection also at-
tempted
recognition accuracy, it is still heavily reliant on manual features. As a result, the improved and
to determine adaptive segmentation thresholds for traffic sign extraction
classification
model, while bycapable
computing
of more histograms
accurate and of detailed
images classification,
[2,3]. Color space hastoalso
struggles dealbeen
with used,
forthe
example,
anomalies in segmentation
depicted in Figurein HSV
1b. [4]. Researchers extracted SURF feature points from
signs andObject
used detection
corrodingalgorithms
imagesbased on convolutional
to match them [5]. In neural networksmore
some studies, have complex
become fea-
a better choice for academics due to the extensive use of high-performance
ture descriptors, such as coarse localization of signs based on the Hough transform [6], computer
were used for sign extraction. Traditional digital morphology-based methods typically
necessitate clear images with high-resolution signs and no anomalies to interfere, making
them difficult to apply in complex real-world scenarios.
Researchers added machine learning algorithms for refinement based on digital mor-
Remote Sens. 2023, 15, 2959 3 of 26

systems. However, smaller traffic signs are difficult to detect and classify accurately as
the detailed features of small targets are difficult to transfer to the deeper feature maps.
Some researchers have used image super-resolution algorithms to enhance the detection of
small targets [14,15]. Since attention mechanisms can drive the network to focus more on
channel and spatial feature acquisition, adding attention mechanisms to the model can also
improve the model’s ability to extract semantic features [16–20].
From different viewpoints, traffic signs produce images of varying scales, and changes
in scale can also affect detector performance. Researchers attempted to incorporate
deformable convolution into the network, which can adjust the perceptual field adap-
tively [19,21,22]. Because the backbone generates feature maps of varying sizes during
layer-by-layer downsampling, some researchers have fused feature information from vari-
ous scales by constructing feature pyramids in order to improve the model’s extraction of
multi-scale features [22–30].
Some researchers have improved model performance without changing the detector
by using pre-processing techniques, such as image enhancement based on probabilistic
models [31], highlighting edge features of traffic signs [32], and enhancing the hue of dark
areas of images [33]. In addition, on-board cameras typically take high-resolution images for
sensing the vehicle’s surroundings but increase the search range for traffic signs. Therefore,
an attempt has been made in some studies to construct a coarse-to-fine framework, which is
used to reduce computational costs and improve model performance [34–36]. Background
information is frequently ignored, but some researchers have improved model accuracy
by using background detail features of neighboring signs [37,38]. In addition, spiking
neural networks (SNN) are used to improve existing traffic sign detection and recognition
algorithms [39–41], which can extract time-related features and have higher computational
efficiency on hardware platforms.
To obtain accurate indication information, we must classify traffic signs down to the
smallest category, which requires the algorithm to accurately identify up to several hundred
sign categories. As shown in Figure 1c, there is a great difference in sample size between
the traffic sign classes. This will result in classes with larger sample sizes having better
classification accuracy, while classes with sparse samples perform poorly [42]. Existing
studies usually only identify traffic signs according to three categories: prohibited, warning,
and mandatory, or remove categories with a sample size of less than 100. However, each
category of traffic sign is designed to convey important guidance information. Therefore,
the detection and recognition of traffic signs need to be implemented in as comprehensive
a range of categories as possible.
On-board cameras can continuously capture traffic signs, but most existing studies
only use information from a single image. Missed detections or misclassifications due to
anomalies such as occlusion and deformation are typically present in only a few frames, but
incorrect detection of a single sign can also pose a serious hazard, negatively impacting the
environment, infrastructure, and human life. Some researchers have attempted to improve
detector performance using image sequences in previous studies [43–46], but this often
necessitates an additional training process and consumes more computational resources.
Meanwhile, successive detections can result in redundant results.
In general, the main contributions of this paper are summarized as follows:
(1) We propose a hierarchical classification model (HCM) based on the natural distribu-
tion characteristics of traffic signs, which is used for the classification of traffic signs
in full classes. Meanwhile, the HCM significantly reduces the degree of imbalance
between classes without changing the sample size.
(2) To deal with missing or misleading information caused by anomalies, this study de-
signed a multi-frame information aggregation module (MIM) to extract the detection
sequence of traffic signs, which is based on the embedding generated by the HCM.
The temporal sequence of detection information can deal with the shortcomings of a
single image, reducing false detections caused by anomalies.
Remote Sens. 2023, 15, 2959 4 of 26

(3) We validated our method using two open-source datasets, TT100K and ONCE. The
HCM-improved YOLOv5 achieves a mAP of 79.0 in full classes, exceeding existing
state-of-the-art methods. Experiments using ONCE show that MIM further improves
the performance of the model by integrating multi-frame information.

2. Methods
The image sequence I captured by the on-board camera is used as input in this work,
and we need to detect and recognize traffic signs for each frame. Specifically, we need to
detect the bounding box bi = xi , yi , wi , hi of each traffic sign and recognize the specific

category ci . The central coordinates, width, and length of the bounding box are noted as x,
y, w, and h, respectively.
n o
bti , cit I , cit ∈ K, t ∈ T, i ∈ {1, 2, · · · , Nt } (1)

In conclusion, our work can be summarized as Equation (1). K and T in the formula
are the set of traffic sign categories and the set of time, respectively. Nt is the number of
23, 15, x FOR PEER REVIEW 5 method.
traffic signs in the image It . Figure 2 illustrates the overall framework of our of 28 While
the vehicle moves, the on-board camera catches street scenes, creating a temporal sequence
of photos I. The processing steps of our algorithm can be summarized as follows:

Figure 2. Model overview.

Figure 2. Traﬃc signs are Traffic
Model overview. captured several
signs times by
are captured the on-board
several times by the camera while
on-board camera while
the vehicle is in motion. First,isYOLOv5
the vehicle in motion.performs the positioning
First, YOLOv5 performs theof positioning
the traﬃc sign.
of theAfterwards, the
traffic sign. Afterwards, the
HCM determines the specific category. Finally, the MIM is used to integrate the information from
HCM determines the specific category. Finally, the MIM is used to integrate the information from the
the multiple detections to redefine
multiple detectionsthetoresults
redefineatthe
time 𝑡. at time t.
results

2.1. Detector Step 1: We use the previous m frames { It−m , It−m+1 , . . . , It−1 } as reference frame set
for a given image It inside I.
Object detection Step
is used to locate
2: Based on thetargets
imageinIt an
andimage and perform
all reference categorygenerates
frames, YOLOv5 recogni- a number
tion. Yet, becauseofofcandidate
small objects,
regionslarge-scale changes, and long-tailed distributions, it is
through detection.
challenging to accomplish
Step 3:robust detection classification
The hierarchical by merely applying a generic
module (HCM) object detection
implements the specific classifi-
algorithm. At the cation
same of candidate
time, areas.
the algorithm’s inference speed is a crucial assessment crite-
rion in order to interpret sign information in a timely manner on rapidly moving cars.
Object detection can be divided into two categories, depending on the framework:
one-stage and two-stage. Object detection algorithms with two stages generate regional
proposals first, then classify each proposed region. A considerable number of regions of
Remote Sens. 2023, 15, 2959 5 of 26

Step 4: Based on the embedding extracted by the HCM, the multi-frame information
integration module (MIM) searches for associated boxes in reference frames.
Step 5: MIM analyzes the detection sequences generated by the association operation,
which is used for category and confidence redefinition.

2.1. Detector
Object detection is used to locate targets in an image and perform category recogni-
tion. Yet, because of small objects, large-scale changes, and long-tailed distributions, it is
challenging to accomplish robust detection by merely applying a generic object detection al-
gorithm. At the same time, the algorithm’s inference speed is a crucial assessment criterion
in order to interpret sign information in a timely manner on rapidly moving cars.
Object detection can be divided into two categories, depending on the framework:
one-stage and two-stage. Object detection algorithms with two stages generate regional
proposals first, then classify each proposed region. A considerable number of regions of
interest are generated via region proposals. For example, R-CNN [47] creates approxi-
mately 2000 proposed regions in each input image. Because each proposed region needs
independent feature extraction and classification, two-stage object detection necessitates
considerable inference space and time costs.
In contrast to the classification and regression of proposed regions, one-stage object
detection typically divides the image into a number of grids, each containing a number of a
priori boxes. Following that, based on the feature maps provided by the backbone network,
the algorithm predicts the position and class of objects within each grid [48]. Compared to
the two-stage algorithm, the one-stage object detection uses a more direct global regression
and classification. The one-stage framework allows for fast inference as there are not a
large number of candidate regions to be computed independently.
In addition to the speed of inference, the accuracy of the model is an important
consideration. The scale fluctuations of traffic signs and small objects result in generic
object detection methods that are frequently hard to recognize robustly. MS-COCO [49]
defines objects with an area of fewer than 32 × 32 pixels as small objects. The sparse
appearance of small objects makes it difficult for the algorithm to distinguish between
background and object, and it also places higher challenges on the model’s detection
accuracy [50]. During the feature extraction process, the backbone network can generate
feature maps of different sizes to represent information at various scales. Shallow feature
maps contain more detailed spatial features, while deeper feature maps represent more
abstract semantic features. To address the performance decrease caused by scale factors,
researchers designed the feature pyramid network (FPN) for fusing feature information at
multiple scales by concatenating or summing elements between feature maps.
For reasons of inference speed and detection accuracy, the detection and recognition
of traffic signs require a one-stage object detection algorithm that adapts to the multi-scale
variation. Following comparison, YOLOv5 is selected as the detector in this research. As
a member of the yolo series of object detection algorithms, YOLOv5 not only inherits the
conventional quick detection capabilities but also applies a number of tactics to mitigate
the detrimental impacts of scale variation and small objects. Specifically, benefiting from
a unique residual structure and spatial pyramidal pooling, YOLOv5 has excellent multi-
scale feature extraction capabilities, which help to extract traffic sign features at different
distances. The creation of a bi-directional feature pyramid structure improves information
transfer across features at different scales and the model’s retention of detailed features in
the image. At the same time, traffic signs have distinct shapes and color qualities that set
them out from the background, so YOLOv5 can be used to precisely locate traffic signs in
an image.
In the real world, traffic signs suffer from a sample imbalance between categories,
which has a direct impact on the dataset. Although object detection algorithms contain
both localization and classification capabilities, the long-tailed distribution of traffic signs
frequently results in significant a reduction in the algorithm’s classification performance.
Remote Sens. 2023, 15, 2959 6 of 26

YOLOv5 achieves multi-classification in the head by modifying the feature map size, but
the imbalanced distribution of data has a substantial impact on the model’s detection and
classification performance. Therefore, YOLOv5 is not suitable for both the detection and
recognition of traffic signs, but traffic sign localization can be effectively solved by using a
single classification that distinguishes the traffic signs from the background.
Given the input image sequence I, we will utilize YOLOv5 to locate the traffic signs in
the image. For one of the images It , the detector will obtain Nt bounding boxes and the
corresponding confidence, denoted as Dt = bti , con f ti , i ∈ {1, 2, · · · , Nt }. The bounding

box is represented by the term bti in the equation, which contains the length, width, and
centroid coordinates of the box, while the other component con f ti is used to describe the
confidence of the detection, con f ti ∈ [0, 1]. By processing the entire image sequence with
YOLOv5, we can obtain a detection sequence D.

2.2. Hierarchical Classification Model

Although YOLOv5 as a detector can effectively address the problem of localizing traffic
signs in photos, robust classification is difficult to obtain. The difficulty in categorization
is due to the fact that there are hundreds of kinds of traffic signs and that the distribution
of samples is significantly imbalanced. In general, YOLOv5 needs a new classification
module that is able to achieve accurate classification for all classes in the case of long-
tailed distributions. Moreover, as there is often more than one traffic sign in an image, the
classification model should avoid using complex structural designs, which can lead to long
processing times for a single image.
The dataset of traffic signs generally suffers from an uneven distribution of samples,
which reflects the distribution in realistic scenarios. The mandatory category of signs, which
is typically used to guide lane information, has a large sample size, while the warning
category of signs has an extremely limited sample size. Despite the fact that there are
hundreds of traffic sign classifications, a few common categories account for the vast
majority of samples in the dataset. Based on the sample size, we divided the traffic sign
categories in TT100K into three categories: large, medium, and small. Categories with fewer
than 10 samples were labeled as small, those with more than 50 samples were labeled as
large, and the remaining categories were labeled as medium. Table 1 depicts the percentage
of sample size and the percentage of number of categories for these three types of signs.
The category with a large sample size accounts for only 20.6% of all categories but has
43.9% of the sample size. In contrast, the low sample size category accounts for more than
half of all categories but has a sample size of only 12.2%.

Table 1. Statistical results for the three categories.

Percentage Type Large Medium Small

Sample size 43.9% 43.9% 12.2%
Number of categories 20.6% 24.5% 54.8%

When faced with imbalanced sample sizes, previous research has often used either
under-sampling or oversampling to balance the sample size. Both strategies are data-level
approaches, but under-sampling yields less data for model training, whereas oversampling
lengthens training time and may result in model overfitting [42]. Unlike data augmentation,
we employ a grouping strategy to classify traffic sign categories. Traffic signs are classified
into three superclasses: warning, prohibitory, and mandatory, with a number of specialized
subclasses within each superclass. As illustrated in Figure 3, the superclasses range signifi-
cantly in color and shape. For example, prohibitory signs have a red circular border, but
warning signs have triangular and yellow features. This significant difference in features
makes it easy for the classification algorithm to achieve better classification accuracy on
the superclasses. The difficulty in classification is to precisely identify subclasses with
a long-tailed distribution. However, we discovered that the majority of the mandatory
number of specialized subclasses within each superclass. As illustrated in Figure 3, the
superclasses range significantly in color and shape. For example, prohibitory signs have a
red circular border, but warning signs have triangular and yellow features. This signifi-
cant difference in features makes it easy for the classification algorithm to achieve better
Remote Sens. 2023, 15, 2959 classification accuracy on the superclasses. The difficulty in classification is7 to of 26precisely
identify subclasses with a long-tailed distribution. However, we discovered that the ma
jority of the mandatory signs are seen in classes with a large sample size, whereas the
warning
signs signs
are seen in are typically
classes foundsample
with a large in classes with a small
size, whereas sample
the warning size.
signs areWhile the overal
typically
sample
found indistribution
classes with aexhibits a significant
small sample long-tailed
size. While distribution,
the overall the difference
sample distribution exhibitsina sample
size between
significant subclasses
long-tailed within athe
distribution, superclass
difference is
inmuch
samplesmaller. Thus,subclasses
size between groupingwithin
can improve
athe
superclass
overall is much smaller.
performance ofThus, grouping can model
the classification improvebythe overall performance
reducing the degree ofofthe
imbalance
classification model by reducing
without changing the sample size. the degree of imbalance without changing the sample size.

Figure3.3.Three
Figure Threesuperclasses
superclasses
andand their
their corresponding
corresponding subclasses.
subclasses.

Based on the grouping results, a lightweight hierarchical classification module (HCM)

was built and used to identify superclasses and subclasses. Although classifiers tailored
to each superclass can be created, this typically slows down inference. Therefore, the
HCM is made up of four structurally identical sub-networks, one of which serves as a
superclass classifier and the other three for subclass recognition within the superclass.
Each subnetwork can be divided into two parts: the backbone and the classification
component (CS). The extraction of deep semantic features is required for a number of
candidate regions provided by YOLOv5. We built a lightweight backbone h based on
MobileNet to accomplish quick feature extraction. The input images were initially rescaled
to 224 × 224 × 3. Following that, we extracted features using lightweight convolutional
components. To extract features, we then constantly downsample the input image based on
the convolutional component. This procedure is as follows: given a candidate region b, the
backbone h generates a high-dimensional embedding e through progressive downsampling,
e = h ( b ).
Following traffic sign feature extraction, we combine the convolution and softmax
functions to generate a classification component g. Convolution is used for classification
rather than fully connected layers because the number of parameters in fully connected
layers increases dramatically as the number of classes to be classified increases, increasing
space and time consumption. We first determine the convolution kernel θ depending on
Remote Sens. 2023, 15, 2959 8 of 26

the number of categories, then change the feature embedding e to a feature map of the
desired size using convolution operations, and lastly use the softmax function to produce
the classification result c, c = g(e, θ ) or c = g(h(b), θ ).
As shown in Figure 4, HCM begins prediction by identifying superclasses, following
which the relevant subclass classifier is chosen for a specific classification. Simultaneously,
there are model training requirements. We adjust the number of classifications by changing
the convolution kernel parameter θ in the classification component g. Take the superclass
classifier as an example, where the classification component generates a feature map of size
1 × 3 for the identification of the three superclasses. After completing the structural design
of the HCM, we use a cross-entropy loss function for model training. The c gt and c pre in
Equation (2) denote the ground truth and predicted value, respectively.
x FOR PEER REVIEW 9 of 28

Llog c gt , c pre = − c gt log c pre + 1 − c gt log 1 − c pre (2)

Figure 4. The structureFigure

of the
4. HCM. The model
The structure first performs
of the HCM. the performs
The model first classification of superclasses,
the classification and and
of superclasses,
then selects the corresponding
then selectssubclass classifiers
the corresponding to complete
subclass classifiersthe identification
to complete of specific
the identification classes.classes.
of specific

𝐿 𝑐 ,𝑐 = − 𝑐 log 𝑐 + 1−𝑐 log 1 − 𝑐 (2)

For one of the frames 𝐼 , we first extract all the candidate regions from the detection
Remote Sens. 2023, 15, 2959 9 of 26

For one of the frames It , we first extract all the candidate regions from the detection re-
sult Dt . Afterwards, the HCM identifies each candidate region bti independently. Following
the classification results, we supplement Dt with specific categories and embeddings. After
processing by HCM, Dt can be expressed by Equation (3).
n o
Dt = bti , con f ti , cit , eit , i ∈ {1, 2, · · · , Nt } (3)

2.3. Multi-Frame Information Integration Module

Although the combination of YOLOv5 and HCM can mitigate the detrimental effects
of scale variation and long-tailed distribution, occlusion and deformation issues in real-
world scenarios can still result in missed detections and misclassification. However, we
discovered that traffic signs are typically captured multiple times by the in-vehicle camera,
and the majority of the detection and recognition results are correct, while occlusion and
deformation are only present in a few frames. To use information from multiple frames,
the multi-frame information integration module (MIM) first correlates detection results
between frames using the embedding e generated by the HCM and then uses the output
from previous frames to improve model performance. The accumulation of detection
results from multiple frames can compensate for a single frame’s lack of information,
reducing the number of missed detections and misclassifications. Besides that, comparing
inter-frame detection outputs can be used to eliminate redundant results produced during
the continuous detection state.

2.3.1. Correlating Detection Results

Before association, we need to filter the image sequence. The previous m frames
{ It−m , It−m+1 , . . . , It−1 } in image sequence I are utilized as a reference frame set for a
given picture It , where the number of m is connected to the number of times the traf-
fic sign appears in the image sequence. Following that, MIM needs to find associated
detections in the reference frame, which necessitates applying high-dimensional embed-
dings. Although several methods for obtaining embedding have been proposed in previous
works [51,52], embeddings generated using HCM can avoid additional computational
resource consumption while achieving association.
The HCM generates a high-dimensional embedding of traffic signs, which is useful for
distinguishing between different types of traffic signs but difficult to differentiate between
instances of the same type. To that purpose, we extract the feature embedding e and the
centroid coordinates of the bounding box, which are used for the traffic sign’s distinguishing
feature set p, p = { x, y, e}. Following that, we created a function f to determine the
similarity between two traffic signs. Specifically, we first designed f cos , a function for
quantifying the similarity of embeddings based on cosine similarity. Equation (4) can
be used to calculate the corresponding embedding similarity given two distinguishing
features, p1 and p2 .
e · e2
f cos (e1 , e2 ) = 1 (4)
|e1 |·|e2 |
Following that, we created f center , a similarity computation function based on the
Euclidean distance. Due to the large range of the Euclidean distance, f center needs to
perform a normalizing operation, as indicated in Equation (7).

relu( x ) = max (0, x ) (5)

q
ed( x1 , x2 , y1 , y2 ) = ( x1 − x2 )2 + ( y1 − y2 )2 (6)

relu(ed( x1 , x2 , y1 , y2 ) − α)

f center ( x1 , x2 , y1 , y2 ) = 1 − tanh (7)
β
Remote Sens. 2023, 15, 2959 10 of 26

To enable an adjustable normalization calculation of the Euclidean distance, a hyper-

parameter α is set in f center to reflect the centroid offset of the traffic sign instance between
frames. As shown in Figure 5, the computed similarity is greatest when the Euclidean
distance is less than α. We also utilize another hyperparameter β to alter the Euclidean
distance range of interest. Lastly, Equation (8) depicts the weighted sum integration of
similarity information from embedding and Euclidean distance.
Remote Sens. 2023, 15, x FOR PEER REVIEW
f ( p1 , p2 ) = ωcos × f cos (e1 , e2 ) + ωcenter × f center ( x1 , x2 , y1 , y2 )
(8)
s.t. ωcos + ωcenter = 1, ωcos > ωcenter

Figure5.5.Controlled
Figure Controlled normalization
normalization of Euclidean
of Euclidean distances
distances based based
on f center . on 𝑓 .
Since the function f achieves result association mostly through embedding, the weight
parameters should 𝑓(𝑝 be
, 𝑝set) so
= that
𝜔 ω× cos𝑓is greater
(𝑒 , 𝑒 than
)+𝜔 × 𝑓 6 shows
ωcenter . Figure (𝑥 , two ,𝑦 )
𝑥 , 𝑦examples
of how similarity information𝑠.from 𝑡. 𝜔both+ aspects
𝜔 j can= be1,
used𝜔 to distinguish
>𝜔 among different
traffic signs. For a distinguishing feature pt in current frame It , j ∈ {1, 2, · · · , Nt }, the
Since the function 𝑓 achieves result association mostly through embedd
similarity is calculated based on the function f for the results in the reference frame
weight
Ire , re ∈ {parameters
t − m, t − m +should be1}set
1, · · · , t − . Asso thatin𝜔
stated Equationis greater than 𝜔 will produce
(9), the procedure . Figure 6 sho
aexamples max
of how
set of similarity similarity
scores. Afterwards, information
we extracted the from
maximum both aspects
similarity can
scoresbesre used
and to dist
the corresponding distinguishing feature p max based on Equations (10) and (11), respectively.
among diﬀerent traﬃc signs. For rea distinguishing feature 𝑝 in current frame
1,2, ⋯ , 𝑁 , the similarity n
i
sre i is calculated
sre

j i
= f pt , pre

based on the function
, i ∈ {1, 2, · · · , Nre }
o
𝑓 for the results
(9)
in
erence frame 𝐼 , 𝑟𝑒 ∈ 𝑡 − 𝑚, 𝑡 − 𝑚 + 1, ⋯ , 𝑡 − 1 . As stated in Equation (9), the pr
will produce a set of similaritysmax
scores. Afterwards,
n o
i we extracted the maximum si
re = max sre (10)
scores 𝑠 and the corresponding distinguishing feature 𝑝 based on Equati
and (11), respectively.
max

j i

pre = argmax f pt , pre , s.t. i ∈ {1, 2, · · · , Nre } (11)

𝑠 |𝑠 = 𝑓(𝑝 , 𝑝 ), 𝑖 ∈ 1,2, ⋯ , 𝑁

𝑠 = 𝑚𝑎𝑥 𝑠

𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓 𝑝 , 𝑝 , 𝑠. 𝑡. 𝑖 ∈ 1,2, ⋯ , 𝑁
RemoteRemote
Sens. 2023, 15, x15,
Sens. 2023, FOR PEER REVIEW
2959 12 of 28
11 of 26

(a) (b)
Figure 6. 6.
Figure TwoTwotypes
typesof
of association scenarios.
association scenarios. (a)(a) Distinguishing
Distinguishing between
between diﬀerent
different categories
categories of trafficof traf-
fic signs;
signs;(b)
(b)distinguishing
distinguishing between diﬀerent instances of the same category.
between different instances of the same category.

max max
If If𝑠sre exceeds
exceedsthe similarity
the threshold
similarity thresholdε, the εdetection result corresponding
, the detection result correspondingto pre to
j
𝑝 becomes the association result in 𝐼 , which is denoted
becomes the association result in Ire , which is denoted as d re . as 𝑑 . Similarly, we seek
Similarly, we seek association
j
results thatresults
association satisfy the
thatrequirements in all reference frames.
satisfy the requirements After
in all reference
n that, the detection
frames. Afterresult that,dothe
t de-
j j j j j
is included
tection 𝑑 is included
resultto generate the sequence to ofgenerate
detection the results u = dt−of
sequence m , ddetection t −1 , d t . 𝑢 =
t−m+1 , . . . , dresults
𝑑 , 𝑑
Lastly, for all, … , 𝑑 , 𝑑 . Lastly, for all traﬃct signs in the current frame 𝐼 , weresults
traffic signs in the current frame I , we can extract the set of detection can extract
U = u j , j ∈ {1, 2, · · · , N } .
t
the set of detection resultst 𝑈 = 𝑢 , 𝑗 ∈ 1,2, ⋯ , 𝑁 .
2.3.2. Sequence Analysis
2.3.2. Sequence Analysis
In real-world circumstances, traffic signs have anomalies such as occlusion and defor-
mation, which can circumstances,
In real-world lead to false detectiontraﬃcor misclassification by the detector.
signs have anomalies such asFundamentally,
occlusion and de-
the anomalies
formation, which cause
canthe image’s
lead to falseinformation
detectiontoor bemisclassification
absent or deceptive. byAs
theanomalies
detector.areFunda-
typically present in only a few frames, the lack of information in a single
mentally, the anomalies cause the image’s information to be absent or deceptive. As anom- image can be com-
pensated for by employing several detections, which can enhance the model’s performance
alies are typically present in only a few frames, the lack of information in a single image
even further.
can be Based
compensated for by of
on the findings employing
the preceding several detections,
analysis, which can
MIM redefines enhance
categories andthe model’s
confi-
performance even further.
dences based on the sequence of detection results, as illustrated in Figure 7. To count the
Based on
confidence of the findings
category ctargetofinthe
thepreceding
sequence, we analysis, MIM
construct redefines
a statistical categories
function and confi-
v, denoted
dences based(12).
as Equation on the sequence
After of detection
that, Equation results,
(13) is used as illustrated
to calculate in Figure
the category 7. To
with the count the
highest
j . Finally, the category with the highest confidence
category in𝑐 the sequence
cumulativeofconfidence
confidence in the usequence, we construct a statistical function 𝑣, de-
j
noted as Equation
becomes (12). After
the redefinition that,cEquation
category t , while the(13) is used to
redefinition thef tj category
calculatecon
confidence with the
is calculated
by Equation
highest (14). confidence in the sequence
cumulative 𝑢 . Finally, the category with the highest
con f , if c = c
category 𝑐 , whiletarget (12)𝑐𝑜𝑛𝑓

confidence becomes the redefinition
v d, ctarget = the redefinition confidence
0, i f c 6= ctarget
is calculated by Equation (14).
t
𝑐𝑜𝑛𝑓, 𝑖𝑓 𝑐 = 𝑐
∑
j
𝑣 𝑑,t 𝑐
c = argmax = v(dτ , c), s.t. c ∈ K (13) (12)
τ =t−0,m 𝑖𝑓 𝑐 ≠ 𝑐
t
1
m + 1 τ =∑
j j
con f t = v dτ , ct (14)
𝑐 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣(𝑑 , 𝑐) , 𝑠. 𝑡. 𝑐 ∈ 𝐾
t−m (13)

1
𝑐𝑜𝑛𝑓 = 𝑣 𝑑 ,𝑐 (14)
𝑚+1
RemoteRemote
Sens. 2023, 15, x 15,
Sens. 2023, FOR PEER REVIEW
2959 13 of 28
12 of 26

Figure 7. 7.
Figure Correlation
Correlationisisused
used to extract
extracttraffic
traﬃcsign
sign detection
detection sequences,
sequences, andand the current
the current resultsresults
are are
then redefined
then redefinedbybyintegrating
integrating information from
information from multiple
multiple detections.
detections.

ToTo eliminatedetection
eliminate detection results
results with
withpoor
poorconfidence,
confidence,a hyperparameter γ is 𝛾
a hyperparameter applied.
is applied.
j j
When the confidence con f is greater than γ, the categories and confidence levels in d are
When the confidence 𝑐𝑜𝑛𝑓t is greater than 𝛾, the categories and confidence levels t in 𝑑
replaced by the redefined results. The final result can be expressed by Equation (15). Based
areon
replaced
the sameby the redefined
process, results.
we redefine The final
all detection result
results can beframe
in current expressed
It . by Equation (15).
Based on the same process, we
redefine all detection

results in current frame 𝐼.
j j j j
D = b , con f , c , e , j ∈ {1, 2, · · · , Nt } (15)
𝐷 t= 𝑏 ,t𝑐𝑜𝑛𝑓 t , 𝑐t , 𝑒t , 𝑗 ∈ 1,2, ⋯ , 𝑁 (15)

Overall,by
Overall, byintegrating
integrating information
information from
frommultiple
multiple detections, sequence
detections, analysis
sequence mit- mit-
analysis
igates the unfavorable impact of abnormalities. While the majority of
igates the unfavorable impact of abnormalities. While the majority of the resultsthe results in thein the
detection sequence are correct, the redefinition results from sequence analysis can cor-
detection sequence are correct, the redefinition results from sequence analysis can correct
rect for a small number of classification errors. Simultaneously, deformation might occur
forata viewpoints
small number of classification errors. Simultaneously, deformation might occur at
near traffic signs, which usually results in low confidence in the results.
viewpoints near traffic
When employing signs,
multiple which usually
detections, results level
the confidence in low
canconfidence
be enhancedinby
theleveraging
results. When
employing multiple
high-confidence detections,
information theprevious
from confidence level
results, can beinenhanced
resulting by leveraging
fewer missed detections. high-
confidence information from previous results, resulting in fewer missed detections.
3. Results
3.1. Dataset
3. Results
We deal with specific kinds of traffic signs in this study; however, a portion of the
3.1. Dataset
traffic sign datasets are only labeled with three categories: warning, prohibitory, and
We deal [53,54].
mandatory with specific kinds
In contrast, of traffic
TT100K signs
[55] and in this
ONCE [56]study; however,
are better a portion
candidates becauseof the
traffic
theirsign datasets
annotation are only labeled
information is more with three categories: warning, prohibitory, and man-
detailed.
datory TT100K
[53,54]. contains
In contrast,
10,000TT100K
images [55]
withand ONCE [56]
a resolution of are × 2048.
2048better candidates because
At the same time, their
the data annotation
annotation information is further
is morerefined into 232 specific categories, as shown in Figure 8.
detailed.
Existing
TT100K studies typically
contains 10,000remove
imagescategories
with a with a sample
resolution of size
2048of×less than
2048. At 100
the[19,57–60],
same time, the
but we use the full traffic sign category for model training and testing. On the other hand,
data annotation is further refined into 232 specific categories, as shown in Figure 8. Exist-
the ONCE dataset is an autonomous driving dataset with millions of scenes. The images
ing studies typically remove categories with a sample size of less than 100 [19,57–60], but
in the dataset were selected from 144 h of on-board camera video, taken under different
welighting
use theand fullweather
traffic sign category
conditions. Infor model
order training
to use and testing.
the temporal image Ondatathe other hand,
in ONCE, we the
ONCE dataset
annotated theis an autonomous
ONCE driving
test set similarly datasetAfter
to TT100K. withremoving
millionstheof night
scenes. The
data, theimages
test in
thesetdataset were selected from 144 h of on-board camera
contains 13,268 time-series images with a resolution of 1920 × 1020. video, taken under different
lighting and weather conditions. In order to use the temporal image data in ONCE, we
annotated the ONCE test set similarly to TT100K. After removing the night data, the test
set contains 13,268 time-series images with a resolution of 1920 × 1020.
Remote
RemoteSens. 2023,15,
Sens.2023, 15,2959
x FOR PEER REVIEW 14 13
ofof2826

Figure8.
Figure 8. Traffic
Traﬃc sign
sign categories
categories in
in TT100K.
TT100K.

3.2.
3.2. Metrics
Metrics
The
The experiment
experiment uses precision, recall,
recall, and F1 score
and F1 score as
asmetrics
metricstotoevaluate
evaluatethe
theoverall
overall
performance
performance ofof the
the detector, as indicated in Equations
Equations (16)–(18). F1isisthe
(16)–(18).F1 thearithmetic
arithmeticmean
mean
of
ofprecision
precision and
and recall.
recall.
TP
Precision = 𝑇𝑃 (16)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = TP + FN (16)
𝑇𝑃 + 𝐹𝑁

Recall 𝑇𝑃TP
𝑅𝑒𝑐𝑎𝑙𝑙 = = TP + FP (17)
(17)
𝑇𝑃 + 𝐹𝑃
× Precision
2 ×2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × Recall
× 𝑅𝑒𝑐𝑎𝑙𝑙
= = Precision + Recall
𝐹1F1 (18)
(18)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
where
wheretrue
truepositives
positives(TP)(TP)arearethe
thenumber
numberofofsamples
samplesthatthatare
areactually
actuallypositive
positiveand
and classified
classi-
as positive
fied by thebyclassifier;
as positive false positives
the classifier; (FP) are
false positives theare
(FP) number of samples
the number that arethat
of samples actually
are
negative but classified as positive by the classifier; and false negatives (FN)
actually negative but classified as positive by the classifier; and false negatives (FN) are are the number
of
thesamples
numberthat are actually
of samples positive
that are butpositive
actually classifiedbutasclassified
negativeas bynegative
the classifier.
by the classifier.
Certain
Certain evaluation criteria, such as precision, have the potentialpotential
evaluation criteria, such as precision, have the to misleadtoresearch-
mislead
researchers
ers [42]. When[42]. aWhen a long-tailed
long-tailed distribution
distribution exists, exists, high scores
high scores may mistakenly
may mistakenly repre-
represent
sent
goodgood performance.
performance. As a consequence,
As a consequence, we usewe mAPusetomAP to further
further evaluateevaluate the model’s
the model’s perfor-
performance. As indicated
mance. As indicated in Equation
in Equation (19), mAP(19), mAP
is the is the average
average of each category’s
of each category’s mean averagemean
average
precision,precision, i.e., theprecision
i.e., the average average precision of all categories
of all categories divided by divided
the numberby the number of
of categories.
categories.
This paper This
uses paper
a fixeduses a fixed intersection-over-union
intersection-over-union (IoU) value of (IoU) value
0.5 for of 0.5 formAP.
computing comput-
ing mAP. ∑ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑚𝐴𝑃 = ∑ Average Precision (19)
mAP = 𝑁(𝐶𝑙𝑎𝑠𝑠) (19)
N (Class)
Remote Sens. 2023, 15, 2959 14 of 26

3.3. General Detector

Although newer versions of the YOLO model have been proposed, YOLOv5 still has
an advantage in the detection of traffic signs. To prove this point, the stable YOLOv7 was
chosen for comparison. Based on the TT100K dataset, we calculated the overall metrics for
YOLOv5, YOLOv7_l, and YOLOv7_x on the single classification detection task, as shown
in Table 2.

Table 2. The overall metrics of YOLOv5 and YOLOv7 when used as a single classification detector.

Method Precision (%) Recall (%) F1 (%) AP (%)

YOLOv5 92.68 95.87 94.24 96.01
YOLOv7_l 95.67 88.71 92.06 96.52
YOLOv7_x 95.57 89.43 92.40 96.72

YOLOv7 outperformed YOLOv5 in terms of AP; however, when inferring, the de-
tection algorithm needed to determine the final output based on a confidence threshold.
Based on the same confidence threshold of 0.5, YOLOv7_l outperforms YOLOv5 in terms
of precision but is lower than YOLOv5 in terms of recall and F1 score. Therefore, YOLOv5
is more suitable for traffic sign detection than YOLOv7.
We tested performance using SSD, Faster RCNN, CenterNet, and YOLOv5 as baselines,
which are derived from diverse architectures of target detection algorithms, to indicate that
generic object detection algorithms are challenging to apply to the detection and recognition
of traffic signs. SSD and YOLOv5 are typical one-stage models; Faster RCNN is a two-stage
algorithm; and CenterNet is well known for its unique anchor-free architecture. Since
TT100K has detailed category information, all baselines are trained using the training set of
TT100K until the model converges.
First, we recorded the overall metrics of the baselines on the TT100K test set in
Table 3. The results show that CenterNet and YOLOv5, which have been proposed in recent
years, outperformed SSD and Faster RCNN. Figure 9 compares the detection results of the
baselines to further analyze the reasons for the difference in performance. The traffic signs
in the red boxes are typical small objects in this case, and the signs in the yellow and blue
regions show deformations due to the viewpoint. According to the results, SSD and Faster
RCNN, which lack the ability to fuse multi-scale information, have a high percentage of
missed detections on small objects, whereas CenterNet and YOLOv5, which use a feature
pyramid structure, detect much more small-object traffic signs.

Table 3. The overall metrics of baselines evaluated on the TT100K dataset.

Method Precision (%) Recall (%) F1 (%)

SSD 32.26 12.56 18.08
Faster RCNN 33.30 57.02 42.04
CenterNet 54.32 57.12 55.69
YOLOv5 74.37 80.93 77.51

At the same time, this example reflects the negative impact of the long-tailed distribu-
tion on the detector. Specifically, the traffic sign in the yellow area has some deformation,
but the sample size of the corresponding category is sufficient. In contrast, the correspond-
ing category in the blue region has a much smaller sample size. Although the traffic signs
in the yellow and blue areas have similar deformations, the difference in sample size leads
to completely different results.
Remote
Remote Sens.
Sens. 15,15,
2023,
2023, 2959 PEER REVIEW
x FOR 16 15
of of
2826

Figure 9.9.
Figure Detection results
Detection ofof
results diﬀerent
differentbaselines
baselinesononTT100K
TT100Kdataset.
dataset.

Table 3.The
Theoverall
overall metrics
metricsofdo
baselines evaluatedrepresent
not accurately on the TT100K dataset. performance across cat-
the model’s
egories with varied sample sizes. As a result, we first compute the average precision of
Method Precision (%) Recall (%) F1 (%)
the baseline across all categories. After that, we calculated mAP in accordance with the
SSDin sample size, which
difference 32.26
is shown in Table 4.12.56 18.08
Faster RCNN 33.30 57.02 42.04
TableCenterNet 54.32 on the TT100K dataset.
4. The mAP of baselines evaluated 57.12 55.69
YOLOv5 74.37 80.93 77.51
Method mAPall mAPsmall mAPmedium mAPlarge
At the
SSDsame time, this5.87
example reflects 1.92
the negative impact
8.69of the long-tailed
13.00distri-
Faster RCNN 10.30 7.31 11.41 16.93
bution on the detector. Specifically, the traffic sign in the yellow area has some defor-
CenterNet
mation, 16.95
but the sample size 4.87
of the corresponding category is17.11 48.88 the
sufficient. In contrast,
YOLOv5 26.64 1.73 36.78 80.79
corresponding category in the blue region has a much smaller sample size. Although the
traffic signs in the yellow and blue areas have similar deformations, the difference in sam-
ple sizeThe results
leads show that there
to completely is a significant
different results. difference in accuracy between the categories
withThe overall metrics do not accurately with
an adequate sample size and those fewerthe
represent samples.
model’sYOLOv5 has a across
performance mAP
higher cat-
in categories
egories with sufficient
with varied sample
sample sizes. As asize. However,
result, we firstfor the category
compute with fewer
the average samples,
precision of
thebaseline
the gap between
across baselines is substantially
all categories. lower.
After that, we As a result,
calculated mAP in YOLOv5 is not
accordance suitable
with the
for combining
difference detection
in sample and multi-classification
size, which is shown in Table 4.tasks for objects having a long-tailed
distribution, such as traffic signs.
Remote Sens. 2023, 15, 2959 16 of 26

The photos in TT100K are often taken in bright light, and the majority of the samples
are small. ONCE, on the other hand, records photographs in a variety of weather conditions,
such as sunny, rainy, and cloudy days, but with a lower proportion of small objects than
TT100K. We used the same strategy to compute the metrics for the ONCE baseline method
and recorded them in Table 5.

Table 5. The overall metrics of baselines evaluated on the ONCE dataset.

Method Precision (%) Recall (%) F1 (%)

SSD 82.89 5.53 10.36
Faster RCNN 74.32 48.25 58.51
CenterNet 95.43 42.11 58.43
YOLOv5 87.69 71.23 78.61

The test results on ONCE are basically unchanged, with the exception that Faster
RCNN performs substantially better than TT100K on ONCE, showing that the small object
is the primary cause for Faster RCNN’s limited performance. Figure 10 shows detection
in three types of weather to illustrate the impact of weather on model performance. The
images in the sunny environment are clear, but those in the cloudy and rainy surroundings
are substantially dimmer. The effect of environmental elements is also represented in the
Remote Sens. 2023, 15, xbaseline
FOR PEERresults;
REVIEWfor example, cloudy and rainy conditions result in more missed detections 18 of
or misclassifications. It was also discovered that, while Faster RCNN performed well
overall, it lacked localization precision.

Figure 10.
Figure 10. Detection Detection
results resultsbaselines
of different of diﬀerent
on baselines on ONCE dataset.
ONCE dataset.

Based on this, weoncalculated

Based the mAP the
this, we calculated of the
mAP baseline for categories
of the baseline with varying
for categories with varying sa
sample sizes, as shown in Table 6. The results of the tests show that the Faster
ple sizes, as shown in Table 6. The results of the tests show that the Faster RCNN
RCNN outp
outperforms SSD SSD
forms and and
CenterNet in the
CenterNet in less-sample category.
the less-sample On cloudy
category. On cloudydays, the Faster
days, the Faster RCN
RCNN performs better
performs as aas
better two-stage object
a two-stage detection
object method.
detection method.

Table 6. The mAP of baselines evaluated on the ONCE dataset.

Weather Method mAPall mAPsmal mAPmedium mAPlarge

SSD 11.23 8.02 7.58 30.88
Faster RCNN 35.15 28.85 35.71 55.22
All
CenterNet 17.52 8.12 11.11 64.83
YOLOv5 39.59 25.60 43.49 77.82
SSD 15.51 11.68 10.30 47.27
Remote Sens. 2023, 15, 2959 17 of 26

Table 6. The mAP of baselines evaluated on the ONCE dataset.

Weather Method mAPall mAPsmal mAPmedium mAPlarge

SSD 11.23 8.02 7.58 30.88
Faster RCNN 35.15 28.85 35.71 55.22
All
CenterNet 17.52 8.12 11.11 64.83
YOLOv5 39.59 25.60 43.49 77.82
SSD 15.51 11.68 10.30 47.27
Faster RCNN 41.08 37.78 41.58 55.12
Sunny
CenterNet 22.61 11.38 21.04 79.16
YOLOv5 46.01 31.91 55.42 86.67
SSD 25.79 22.96 24.27 61.51
Faster RCNN 44.76 37.67 60.92 74.29
Rainy
CenterNet 37.06 26.67 58.37 87.51
YOLOv5 53.95 42.66 83.07 90.78
SSD 16.82 10.23 11.73 38.39
Faster RCNN 47.99 46.44 42.3 57.58
Cloudy
CenterNet 23.62 0.14 26.51 79.44
YOLOv5 45.55 23.69 55.12 90.66

3.4. HCM-Based Method

This section evaluates the performance of the HCM-improved model to validate the
efficacy of our method. The HCM-improved YOLOv5 is labeled as YOLOv5-HC. In the
training process, the input image resolution was adjusted from 2048 × 2048 to 640 × 640
after downsampling. The initial learning rate of the model was set to 0.001, and Adam was
used as the optimizer for the network training.
Since HCM is an image classification model, the TT100K training set must be cropped
based on the annotations to obtain picture samples of traffic signs. HCM is composed
of four separate sub-networks, one of which produces a feature map of size three for
superclass recognition and is trained using all of the data in the training set. The remaining
subnetworks are used for the three superclasses’ subclass identification, and each sub-
network is trained using data from the corresponding superclass. The subnetwork for
mandatory sign recognition, in particular, generates a feature map of dimension 22, which
corresponds to the 22 mandatory traffic sign subclasses. Similarly, to accommodate the
specific number of subclasses, the sub-networks used to recognize prohibitory and warning
signs will build feature maps of dimensions 119 and 32, respectively.
During the training of each sub-network, the input image was resized to 224 × 224
while Adam was set as the network’s optimizer. The network training was divided into two
sections. First, we freeze the backbone weights and set the learning rate at 1e-3. Following
that, training with 20 epochs is used to change the weights of the convolutional layer in
the classification component. The backbone weights are unfrozen in the second stage, and
the learning rate is decreased to 1 × 10−4 . To finish fine-tuning the weights, the model is
trained for 30 epochs.
After finishing the training, we recorded the overall metrics of YOLOv5-HC on the
TT100K test set in Table 7. The results show that YOLOv5-HC outperformed YOLOv5 by
11.16%, 14.36%, and 12.64 points in precision, recall, and F1 score, respectively. YOLOv5-
HC discovered the most complicated small object i2r in the red region shown in Figure 11.
Meanwhile, only YOLOv5-HC correctly classified the few-sample category sign in the
blue region.

Table 7. The overall metrics of YOLOv5 and YOLOv5-HC evaluated on the TT100K dataset.

Method Precision (%) Recall (%) Inference Speed

YOLOv5 74.37 80.93 37.1 FPS
YOLOv5-HC 85.53 95.29 22.7 FPS
Table 7. The overall metrics of YOLOv5 and YOLOv5-HC evaluated on the TT100K dataset.

Method Precision (%) Recall (%) Inference Speed

Remote Sens. 2023, 15, 2959 YOLOv5 74.37 80.93 37.1 FPS
18 of 26
YOLOv5-HC 85.53 95.29 22.7 FPS

Figure 11. 11.

Figure Detection results
Detection resultsofofYOLOv5
YOLOv5 and YOLOv5-HConon
and YOLOv5-HC TT100K
TT100K dataset.
dataset.

We computed mAP for YOLOv5-HC on categories with varied sample sizes in order to
explicitly analyze the performance of HCM on fewer sample categories, which is recorded
in Table 8. YOLOv5-HC improves accuracy across all categories. As the sample size
reduces, the degree of performance improvement of HCM on YOLOv5 increases. In
addition, we compared YOLOv5-HC with related methods in full classes, as shown in
Table 9. Compared to the state-of-the-art model, YOLOv5-HC has a 7.1% improvement in
mAP across all categories. Furthermore, our method has fewer model parameters.

Table 8. The mAP of YOLOv5 and YOLOv5-HC evaluated on the TT100K dataset.

Method mAPall mAPsmall mAPmedium mAPlarge

YOLOv5 26.64 1.73 36.78 80.79
YOLOv5-HC 79.04 67.12 92.16 95.10

Table 9. Performance comparison of YOLOv5-HC with existing methods in full classes.

Method Params mAP

Cao et al. [61] 26.8M 62.3
Wang et al. [24] 8.0M 65.1
Gao et al. [62] 93.6M 71.9
YOLOv5-HC 89.5M 79.0

As traffic sign identification and recognition require rapid perception information, the
model’s inference speed is also an important metric. We used Nvidia 2080ti to calculate
the inference speed of YOLOv5 and YOLOv5-HC, as shown in Table 7. Despite the fact
that the use of HCM increased the inference time, YOLOv5-HC still achieves an inference
speed of 22.7 FPS. In terms of the balance between model accuracy and inference speed,
we want to improve the accuracy of the model as much as possible while satisfying the
real-time condition. Taking the autonomous driving dataset ONCE as an example, the data
Remote Sens. 2023, 15, 2959 19 of 26

is sampled at a frequency of 10 FPS, so we consider that the inference speed of 22.7 FPS
satisfies the real-time requirement.

3.5. MIM-Based Method

Unlike TT100K, the data in ONCE are organized temporally. While MIM is intended
to be a training-free post-processing framework, we still use the model that completed
the training in the previous section, but test the detector on the ONCE test set based on
multiple frames of images. The MIM-improved YOLOv5-HC is labeled YOLOv5-HM.
The pre-requisite for implementing multi-image processing is to determine the number
of reference frames m. Based on the temporal information of the current frame, the MIM
selects m images before the current frame as a reference frame set. A temporal sequence
that is too long would result in computational redundancy. On the other hand, a temporal
sequence that is too short will miss some of the essential sign-detection information. By
analyzing the images in ONCE, we set m = 2.
The MIM module first needs to correlate the detection results in the image sequence,
where a number of hyperparameters need to be set. For ONCE, both α and β, which are
used to adjust the mapping effect in f center , are set to 500. These two parameters’ values
are determined by estimating the centroid offset of the same traffic sign instance in two
successive frames. The similarity rating function f uses two weights to integrate similarity
information. In the experiments, ωcos and ωcenter were set to 0.8 and 0.2, respectively. In
addition, when the redefinition confidence calculated by the MIM was below a threshold
value γ, the corresponding detection was removed. In the ONCE dataset, γ is set to 0.25.
To verify the effectiveness of the MIM, we tested the YOLOv5 and two improved
versions on ONCE’s test set. According to Table 10, the YOLOv5-HM showed the best
performance in terms of overall metrics. Specifically, YOLOv5-HM improved by 0.67%,
1.32%, and 1.05% as compared to YOLOv5-HC in terms of precision, recall, and F1
score, respectively.

Table 10. The overall metrics of YOLOv5 and two improved versions evaluated on the ONCE dataset.

Method Precision (%) Recall (%) F1 (%)

YOLOv5 87.69 71.23 78.61
YOLOv5-HC 93.18 80.35 86.29
YOLOv5-HM 93.85 81.67 87.34

The on-board camera, as shown in Figure 12, takes continuous photos of the same
instance as the vehicle moves. Except for the SSD with poor overall performance, the
remaining detectors achieved accurate detection and recognition in the first two frames.
However, the majority of the detectors exhibited a missed detection at moment t due to
deformation. Although the deformed traffic signs were detected by YOLOv5-HC, the result
had a low confidence level. In contrast, because YOLOv5-HM utilizes detection information
from multiple frames, the higher confidence in the first two frames is used in the sequence
analysis to obtain a higher confidence at moment t.
To investigate the impact of MIM further, we ran mAP calculations under various
weather conditions and sample sizes, as shown in Table 11. According to the results,
YOLOv5-HM achieves an optimal value of 73.86 for the overall mAP, an improvement of
1.07 over YOLOv5-HC. YOLOv5-HM outperforms the pre-improvement model in most
sample size categories. YOLOv5-HM demonstrated the most substantial performance boost
in sunny settings, with an overall mAP improvement of 1.37. YOLOv5-HM, on the other
hand, demonstrated a relatively smaller performance improvement in cloudy and rainy
situations.
maining detectors achieved accurate detection and recognition in the first two frames.
However, the majority of the detectors exhibited a missed detection at moment t due to
deformation. Although the deformed traﬃc signs were detected by YOLOv5-HC, the re-
sult had a low confidence level. In contrast, because YOLOv5-HM utilizes detection infor-
Remote Sens. 2023, 15, 2959 20 of 26
mation from multiple frames, the higher confidence in the first two frames is used in the
sequence analysis to obtain a higher confidence at moment 𝑡.

Figure 12. Temporal

Figure detection
12. Temporal detectionresults
resultsfor
for diﬀerent detectorsonon
different detectors thethe ONCE
ONCE dataset.
dataset.

Table 11. The mAP of YOLOv5 and two improved versions evaluated on the ONCE dataset.

Weather Method mAPall mAPsmall mAPmedium mAPlarge

YOLOv5 39.59 25.60 43.49 77.82
All YOLOv5-HC 72.79 65.53 78.66 83.40
YOLOv5-HM 73.86 66.61 79.89 84.08
YOLOv5 46.01 31.91 55.42 86.67
Sunny YOLOv5-HC 83.00 80.74 84.54 89.40
YOLOv5-HM 84.37 82.63 85.32 89.95
YOLOv5 53.95 42.66 83.07 90.78
Rainy YOLOv5-HC 70.67 67.12 77.40 89.51
YOLOv5-HM 71.13 67.12 79.71 89.48
YOLOv5 45.55 23.69 55.12 90.66
Cloudy YOLOv5-HC 73.93 71.53 67.52 86.34
YOLOv5-HM 73.48 70.45 68.23 86.32
Remote Sens. 2023, 15, 2959 21 of 26

In order to indicate the impact of the improvements, Figure 13 depicts examples of

Yolov5 and the two upgraded models under various weather situations. In a rainy situa-
tion, YOLOv5-HC exhibits a missed detection due to deformation. After MIM processing,
YOLOv5-HM achieves accurate detection and recognition, with a confidence of 0.53. A sim-
ilar problem arises in sunny situations, where confidence is diminished due to deformation.
In addition, the detector sometimes misclassifies. In the cloudy example, YOLOv5-HC
misclassified the traffic sign as i2r. Since the classifications in the reference frames were
all correct, YOLOv5-HM was able to correct this occasional misclassification using the
detection information from multiple frames. In terms of inference speed, since MIM uses
the feature embeddings extracted by the HCM, most of the increased computational effort
Remote Sens. 2023, 15, x FOR PEER REVIEW 23 of 28
comes from the calculation of the Euclidean distance in the association function, which
hardly affects the inference time of the model.

Figure 13. Detection

Figure results
13. Detection ofofYOLOv5
results YOLOv5and
and two improvedversions
two improved versionsonon ONCE
ONCE dataset.
dataset.

4. Discussion
4. Discussion
The evaluation results based on TT100K and ONCE demonstrate that the feature
The evaluation
pyramid structureresults based
is critical on TT100K
in dealing andobjects
with small ONCEand demonstrate that the
scale variations. feature
At the samepyr-
amid structure is critical in dealing with small objects and scale variations.
time, the test results reveal a shortcoming in the localization accuracy of the Faster RCNN, At the same
time, the test
which resultsfrom
originates revealthe acoarseness
shortcomingof thein the localization
feature map and theaccuracy of the Faster
limited information RCNN,
offered
which originates
by the candidate from
boxes the coarseness
[48]. of the
Furthermore, thefeature map
evaluation and the
findings onlimited information
classes with varying of-
feredsample
by the sizes show thatboxes
candidate the long-tailed distributionthe
[48]. Furthermore, hasevaluation
a considerable detrimental
findings impactwith
on classes
on the
varying detector’s
sample performance.
sizes show that the Forlong-tailed
reasons of inference speedhas
distribution anda detection performance,
considerable detrimental
YOLOv5 becomes a better choice. The use of YOLOv5 as a single classification detector
impact on the detector’s performance. For reasons of inference speed and detection per-
allows for more efficient localization of traffic signs due to their unique color and shape
formance, YOLOv5 becomes a better choice. The use of YOLOv5 as a single classification
features. Detecting the most challenging case in Figure 11 also demonstrates increased
detector allowsperformance.
localization for more efficient localization of traffic signs due to their unique color and
shape features.
To cope withDetecting
detectorthe most challenging
performance loss causedcase in Figure
by unequal 11 also demonstrates
distribution, we propose a in-
creased localization
hierarchical performance.
classification model (HCM) that divides traffic signs into three superclasses and
To cope withsubclasses.
corresponding detector performance loss makes
This classification causedusebyofunequal distribution,
the distribution we propose
characteristics
of traffic signs.
a hierarchical Specifically,model
classification mandatory(HCM)signs aredivides
that mostly intraffic
the category withthree
signs into a large sample
superclasses
size, whereas warning signs are mostly in the category with a small
and corresponding subclasses. This classification makes use of the distribution character-sample size. While the
overall sample distribution exhibits a significant long-tailed distribution, the difference in
istics of traffic signs. Specifically, mandatory signs are mostly in the category with a large
sample size between subclasses within a superclass is much smaller.
sample size, whereas warning signs are mostly in the category with a small sample size.
While the overall sample distribution exhibits a significant long-tailed distribution, the
difference in sample size between subclasses within a superclass is much smaller.
Equation (19) can be utilized to quantify the degree of imbalance in sample size be-
tween classes, which is calculated as the ratio of the maximum and minimum sample size
across all categories [42].
Remote Sens. 2023, 15, 2959 22 of 26

Equation (19) can be utilized to quantify the degree of imbalance in sample size
between classes, which is calculated as the ratio of the maximum and minimum sample
size across all categories [42].
maxi {|Ci |}
ρ= (20)
mini {|Ci |}
To clearly highlight the differences after grouping, some of the categories were ran-
domly selected from the large, medium, and small categories. Figure 14 indicates that after
grouping, the sample distribution of traffic signs is more balanced, which is especially
noticeable for warning traffic signs. Meanwhile, when the sample size falls, the reduction
in ρ becomes bigger, resulting in more performance gains for YOLOv5-HC in the fewer
sample categories. In addition, owing to its lightweight design, the HCM’s hierarchical
Remote Sens. 2023, 15, x FOR PEER REVIEW
classification structure does not considerably slow down inference speed. YOLOv5-HC24 of 28
is
similar to the two-stage object detection algorithm, but the inference speed is much faster
than Faster RCNN.

(a) (b)

(c)

Figure 14. Change in value ρ across categories with different sample sizes after grouping. (a) small
Figure 14. Change in value 𝜌 across categories with diﬀerent sample sizes after grouping. (a) small
categories; (b) medium categories; (c) large categories.
categories; (b) medium categories; (c) large categories.
Small objects, scale variations, and long-tailed distributions are the main causes of
Small objects, scale
detector performance variations,Furthermore,
degradation. and long-tailed distributions
real-world scenes are the main
contain causes
anomalies of
such
detector performance degradation. Furthermore, real-world scenes contain anomalies
as deformation and occlusion, which frequently result in missed detections or misclassifica-
such as deformation and occlusion, which frequently result in missed detections or mis-
classifications. Challenging cases generated by anomalies are diﬃcult to solve since the
anomalies result in missing or misleading image information. To overcome the limitations
of a single image, our proposed multi-frame information integration module (MIM) inte-
grates data from multiple detections to achieve robust detection and recognition. Mean-
Remote Sens. 2023, 15, 2959 23 of 26

tions. Challenging cases generated by anomalies are difficult to solve since the anomalies
result in missing or misleading image information. To overcome the limitations of a single
image, our proposed multi-frame information integration module (MIM) integrates data
from multiple detections to achieve robust detection and recognition. Meanwhile, correla-
tion can be utilized to eliminate the redundant results produced by successive detections.
The evaluation results on the ONCE dataset demonstrate that MIM achieves a per-
formance improvement in most cases. To further analyze the role of MIM, we varied
the range of information integration by adjusting the number of reference frames. The
data in Table 12 show that as the number of reference frames increases, the performance
of YOLOv5-HM gradually improves, indicating the significance of utilizing multi-frame
information. However, since traffic signs in ONCE are typically photographed three times,
there is no further improvement in model performance when the number of reference
frames m in the experiment exceeds two.

Table 12. YOLOv5-HM’s overall metrics evaluated on different reference frame numbers.

m Precision (%) Recall (%) F1 (%) mAPall

0 93.18 80.35 86.29 72.79
1 93.43 81.05 86.80 73.36
2 93.85 81.67 87.34 73.86
3 93.85 81.67 87.34 73.86
4 93.85 81.67 87.34 73.86

5. Conclusions
In this paper, two novel and simple-to-implement modules are proposed to improve
the performance of YOLOv5 for traffic sign detection and recognition. YOLOv5 provides
outstanding localization performance for small objects and scale variations as a single classi-
fication detector. To reduce the negative impact of long-tailed distributions on classification,
we propose a hierarchical classification module for the specific classification of traffic signs.
Through grouping, HCM divides traffic signs into three superclasses and corresponding
subclasses. The grouping takes advantage of traffic sign distributional characteristics,
which can greatly reduce sample size discrepancies between classes. However, in the pres-
ence of anomalies such as occlusion and deformation, single-image-based algorithms still
suffer from missing detection or misclassification. To deal with missing or misleading infor-
mation caused by anomalies, this study designed a multi-frame information aggregation
module to extract the detection sequence of traffic signs, which is based on the embedding
generated by the HCM. The temporal sequence of detection information can deal with the
shortcomings of a single image, reducing false detections caused by anomalies.
Experimental results based on TT100K show that YOLOv5-HC achieves a mAP of 79.0
in full classes, which exceeds state-of-the-art methods. At the same time, the inference
speed of 22.7 FPS satisfies the real-time requirement. Furthermore, YOLOv5-HM using
MIM outperformed YOLOv5-HC in terms of overall accuracy, with 0.67% improvement in
precision, 1.32% improvement in recall, and 1.05 improvement in F1 score, respectively.
YOLOv5 has some shortcomings in traffic sign detection and consumes most of the
computational resources. Therefore, we will improve the existing detection module in
our research as more advanced object detection algorithms are proposed. Meanwhile, we
will also try to improve the inference speed of the existing model using SNN, which can
better balance the inference time and model accuracy. In addition, due to the uniqueness of
the colors as well as the structure of the traffic signs, we also consider the use of VAE or
statistical models to generate distributions that can be used for traffic sign recognition.

Author Contributions: Conceptualization, Y.H. and J.X.; methodology, Y.H. and J.X.; software, J.X.;
supervision, Y.H.; validation, D.Y.; visualization, J.X. and D.Y.; writing—original draft, J.X.; writing—
review and editing, Y.H. and J.X. All authors have read and agreed to the published version of
the manuscript.
Remote Sens. 2023, 15, 2959 24 of 26

Funding: This research was funded by the Wuhan University–Huawei Geoinformatics Innova-
tion Laboratory.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Rehman, Y.; Khan, J.A.; Shin, H. Efficient coarser-to-fine holistic traffic sign detection for occlusion handling. IET Image Process.
2018, 12, 2229–2237. [CrossRef]
2. Xu, X.; Jin, J.; Zhang, S.; Zhang, L.; Pu, S.; Chen, Z. Smart data driven traffic sign detection method based on adaptive color
threshold and shape symmetry. Future Gener. Comput. Syst.-Int. J. Escience 2019, 94, 381–391. [CrossRef]
3. Yang, Y.; Luo, H.; Xu, H.; Wu, F. Towards Real-Time Traffic Sign Detection and Classification. IEEE Trans. Intell. Transp. Syst. 2016,
17, 2022–2031. [CrossRef]
4. Cao, J.; Song, C.; Peng, S.; Xiao, F.; Song, S. Improved Traffic Sign Detection and Recognition Algorithm for Intelligent Vehicles.
Sensors 2019, 19, 4021. [CrossRef]
5. Guo, S.; Yang, X. Fast recognition algorithm for static traffic sign information. Open Phys. 2018, 16, 1149–1156. [CrossRef]
6. Yin, S.; Ouyang, P.; Liu, L.; Guo, Y.; Wei, S. Fast Traffic Sign Recognition with a Rotation Invariant Binary Pattern Based Feature.
Sensors 2015, 15, 2161–2180. [CrossRef] [PubMed]
7. Hechri, A.; Mtibaa, A. Two-stage traffic sign detection and recognition based on SVM and convolutional neural networks. Iet
Image Process. 2020, 14, 939–946. [CrossRef]
8. Bouti, A.; Mahraz, M.A.; Riffi, J.; Tairi, H. A robust system for road sign detection and classification using LeNet architecture
based on convolutional neural network. Soft Comput. 2020, 24, 6721–6733. [CrossRef]
9. Madani, A.; Yusof, R. Traffic sign recognition based on color, shape, and pictogram classification using support vector machines.
Neural Comput. Appl. 2018, 30, 2807–2817. [CrossRef]
10. Lillo-Castellano, J.M.; Mora-Jimenez, I.; Figuera-Pozuelo, C.; Rojo-Alvarez, J.L. Traffic sign segmentation and classification using
statistical learning methods. Neurocomputing 2015, 153, 286–299. [CrossRef]
11. Li, H.; Sun, F.; Liu, L.; Wang, L. A novel traffic sign detection method via color segmentation and robust shape matching.
Neurocomputing 2015, 169, 77–88. [CrossRef]
12. Saadna, Y.; Behloul, A.; Mezzoudj, S. Speed limit sign detection and recognition system using SVM and MNIST datasets. Neural
Comput. Appl. 2019, 31, 5005–5015. [CrossRef]
13. Berkaya, S.K.; Gunduz, H.; Ozsen, O.; Akinlar, C.; Gunal, S. On circular traffic sign detection and recognition. Expert Syst. Appl.
2016, 48, 67–75. [CrossRef]
14. Yu, Y.; Jiang, T.; Li, Y.; Guan, H.; Li, D.; Chen, L.; Yu, C.; Gao, L.; Gao, S.; Li, J. SignHRNet: Street-level traffic signs recognition
with an attentive semi-anchoring guided high-resolution network. Isprs J. Photogramm. Remote Sens. 2022, 192, 142–160. [CrossRef]
15. Wang, Z.-Z.; Xie, K.; Zhang, X.-Y.; Chen, H.-Q.; Wen, C.; He, J.-B. Small-Object Detection Based on YOLO and Dense Block via
Image Super-Resolution. IEEE Access 2021, 9, 56416–56429. [CrossRef]
16. Li, Y.; Li, J.; Meng, P. Attention-YOLOV4: A real-time and high-accurate traffic sign detection algorithm. Multimed. Tools Appl.
2023, 82, 7567–7582. [CrossRef]
17. Wei, H.; Zhang, Q.; Qian, Y.; Xu, Z.; Han, J. MTSDet: Multi-scale traffic sign detection with attention and path aggregation. Appl.
Intell. 2023, 53, 238–250. [CrossRef]
18. Wang, X.; Guo, J.; Yi, J.; Song, Y.; Xu, J.; Yan, W.; Fu, X. Real-Time and Efficient Multi-Scale Traffic Sign Detection Method for
Driverless Cars. Sensors 2022, 22, 6930. [CrossRef]
19. Hu, J.; Wang, Z.; Chang, M.; Xie, L.; Xu, W.; Chen, N. PSG-Yolov5: A Paradigm for Traffic Sign Detection and Recognition
Algorithm Based on Deep Learning. Symmetry 2022, 14, 2262. [CrossRef]
20. Triki, N.; Karray, M.; Ksantini, M. A Real-Time Traffic Sign Recognition Method Using a New Attention-Based Deep Convolutional
Neural Network for Smart Vehicles. Appl. Sci. 2023, 13, 4793. [CrossRef]
21. Gao, X.; Chen, L.; Wang, K.; Xiong, X.; Wang, H.; Li, Y. Improved Traffic Sign Detection Algorithm Based on Faster R-CNN. Appl.
Sci. 2022, 12, 8948. [CrossRef]
22. Liu, Z.; Shen, C.; Fan, X.; Zeng, G.; Zhao, X. Scale-aware limited deformable convolutional neural networks for traffic sign
detection and classification. IET Intell. Transp. Syst. 2020, 14, 1712–1722. [CrossRef]
23. Zhang, Y.; Lu, Y.; Zhu, W.; Wei, X.; Wei, Z. Traffic sign detection based on multi-scale feature extraction and cascade feature fusion.
J. Supercomput. 2023, 79, 2137–2152. [CrossRef]
24. Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput.
Appl. 2022, 35, 7853–7865. [CrossRef]
25. Wu, J.; Liao, S. Traffic Sign Detection Based on SSD Combined with Receptive Field Module and Path Aggregation Network.
Comput. Intell. Neurosci. 2022, 2022, 4285436. [CrossRef] [PubMed]
Remote Sens. 2023, 15, 2959 25 of 26

26. Yao, Y.; Han, L.; Du, C.; Xu, X.; Jiang, X. Traffic sign detection algorithm based on improved YOLOv4-Tiny. Signal Process.-Image
Commun. 2022, 107, 116783. [CrossRef]
27. Liu, Y.; Peng, J.; Xue, J.-H.; Chen, Y.; Fu, Z.-H. TSingNet: Scale-aware and context-rich feature learning for traffic sign detection
and recognition in the wild. Neurocomputing 2021, 447, 10–22. [CrossRef]
28. Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Traffic sign detection and recognition based on pyramidal convolutional networks. Neural
Comput. Appl. 2020, 32, 6533–6543. [CrossRef]
29. Yuan, Y.; Xiong, Z.; Wang, Q. VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection. IEEE Trans.
Image Process. 2019, 28, 3423–3434. [CrossRef]
30. Ou, Z.; Xiao, F.; Xiong, B.; Shi, S.; Song, M. FAMN: Feature Aggregation Multipath Network for Small Traffic Sign Detection. IEEE
Access 2019, 7, 178798–178810. [CrossRef]
31. Suto, J. An Improved Image Enhancement Method for Traffic Sign Detection. Electronics 2022, 11, 871. [CrossRef]
32. Khan, J.A.; Chen, Y.; Rehman, Y.; Shin, H. Performance enhancement techniques for traffic sign recognition using a deep neural
network. Multimed. Tools Appl. 2020, 79, 20545–20560. [CrossRef]
33. Khan, J.A.; Yeo, D.; Shin, H. New Dark Area Sensitive Tone Mapping for Deep Learning Based Traffic Sign Recognition. Sensors
2018, 18, 3776. [CrossRef] [PubMed]
34. Wang, Z.; Wang, J.; Li, Y.; Wang, S. Traffic Sign Recognition with Lightweight Two-Stage Model in Complex Scenes. IEEE Trans.
Intell. Transp. Syst. 2022, 23, 1121–1131. [CrossRef]
35. Liu, L.; Wang, Y.; Li, K.; Li, J. Focus First: Coarse-to-Fine Traffic Sign Detection with Stepwise Learning. IEEE Access 2020, 8,
171170–171183. [CrossRef]
36. Song, Y.; Fan, R.; Huang, S.; Zhu, Z.; Tong, R. A three-stage real-time detector for traffic signs in large panoramas. Comput. Vis.
Media 2019, 5, 403–416. [CrossRef]
37. Min, W.; Liu, R.; He, D.; Han, Q.; Wei, Q.; Wang, Q. Traffic Sign Recognition Based on Semantic Scene Understanding and
Structural Traffic Sign Location. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15794–15807. [CrossRef]
38. Tian, Y.; Gelernter, J.; Wang, X.; Li, J.; Yu, Y. Traffic Sign Detection Using a Multi-Scale Recurrent Attention Network. IEEE Trans.
Intell. Transp. Syst. 2019, 20, 4466–4475. [CrossRef]
39. Rasteh, A.; Delpech, F.; Aguilar-Melchor, C.; Zimmer, R.; Shouraki, S.B.; Masquelier, T. Encrypted internet traffic classification
using a supervised spiking neural network. Neurocomputing 2022, 503, 272–282. [CrossRef]
40. Zhang, Y.; Xu, H.; Huang, L.; Chen, C. A storage-efficient SNN-CNN hybrid network with RRAM-implemented weights for
traffic signs recognition. Eng. Appl. Artif. Intell. 2023, 123, 106232. [CrossRef]
41. Xie, K.; Zhang, Z.; Li, B.; Kang, J.; Niyato, D.; Xie, S.; Wu, Y. Efficient Federated Learning with Spike Neural Networks for Traffic
Sign Recognition. IEEE Trans. Veh. Technol. 2022, 71, 9980–9992. [CrossRef]
42. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [CrossRef]
43. Yu, J.; Ye, X.; Tu, Q. Traffic Sign Detection and Recognition in Multiimages Using a Fusion Model with YOLO and VGG Network.
IEEE Trans. Intell. Transp. Syst. 2022, 23, 16632–16642. [CrossRef]
44. Atif, M.; Zoppi, T.; Gharib, M.; Bondavalli, A. Towards Enhancing Traffic Sign Recognition through Sliding Windows. Sensors
2022, 22, 2683. [CrossRef] [PubMed]
45. Zhang, Y.; Wang, Z.; Song, R.; Yan, C.; Qi, Y. Detection-by-tracking of traffic signs in videos. Appl. Intell. 2022, 52, 8226–8242.
[CrossRef]
46. Song, S.; Li, Y.; Huang, Q.; Li, G. A New Real-Time Detection and Tracking Method in Videos for Small Target Traffic Signs. Appl.
Sci. 2021, 11, 3061. [CrossRef]
47. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J.; IEEE. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH,
USA, 23–28 June 2014; pp. 580–587.
48. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. 2019, 30,
3212–3232. [CrossRef]
49. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September
2014; pp. 740–755.
50. Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020,
97, 103910. [CrossRef]
51. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada,
7–12 December 2015.
52. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R.; IEEE. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on
Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
53. Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; Igel, C.; IEEE. Detection of Traffic Signs in Real-World Images: The German
Traffic Sign Detection Benchmark. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Dallas, TX,
USA, 4–9 August 2013.
Remote Sens. 2023, 15, 2959 26 of 26

54. Zhang, J.; Zou, X.; Kuang, L.-D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A More Comprehensive Traffic Sign Detection
Benchmark. Hum.-Cent. Comput. Inf. Sci. 2022, 12, 23. [CrossRef]
55. Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S.; IEEE. Traffic-Sign Detection and Classification in the Wild. In Proceed-
ings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016;
pp. 2110–2118.
56. Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One million scenes for autonomous
driving: ONCE dataset. arXiv 2021, arXiv:2106.11037.
57. Chu, J.; Zhang, C.; Yan, M.; Zhang, H.; Ge, T. TRD-YOLO: A Real-Time, High-Performance Small Traffic Sign Detection Algorithm.
Sensors 2023, 23, 3871. [CrossRef] [PubMed]
58. Sharma, V.; Dhiman, P.; Rout, R.K. Improved traffic sign recognition algorithm based on YOLOv4-tiny. J. Vis. Commun. Image
Represent. 2023, 91, 103774. [CrossRef]
59. Wang, L.; Wang, L.; Zhu, Y.; Chu, A.; Wang, G. CDFF: A fast and highly accurate method for recognizing traffic signs. Neural
Comput. Appl. 2023, 35, 643–662. [CrossRef]
60. Yuan, X.; Kuerban, A.; Chen, Y.; Lin, W. Faster Light Detection Algorithm of Traffic Signs Based on YOLOv5s-A2. IEEE Access
2023, 11, 19395–19404. [CrossRef]
61. Cao, J.; Zhang, J.; Jin, X. A Traffic-Sign Detection Algorithm Based on Improved Sparse R-CNN. IEEE Access 2021, 9, 122774–122788.
[CrossRef]
62. Gao, E.; Huang, W.; Shi, J.; Wang, X.; Zheng, J.; Du, G.; Tao, Y. Long-Tailed Traffic Sign Detection Using Attentive Fusion and
Hierarchical Group Softmax. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24105–24115. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Quality Assessment Manual Chapter 4
No ratings yet
Quality Assessment Manual Chapter 4
14 pages
UPES An 2014-15
No ratings yet
UPES An 2014-15
172 pages
Narrative Report Volleyball
100% (2)
Narrative Report Volleyball
3 pages
1 en Print - Indd
No ratings yet
1 en Print - Indd
13 pages
FINAL G.P
No ratings yet
FINAL G.P
10 pages
Traffic Sign Detection For Autonomous VehicleApplication
No ratings yet
Traffic Sign Detection For Autonomous VehicleApplication
5 pages
Da 1 Idbb
No ratings yet
Da 1 Idbb
14 pages
JES_10_DR+PRASSHANT+AHIRE_3_1126
No ratings yet
JES_10_DR+PRASSHANT+AHIRE_3_1126
11 pages
A Survey On Road Sign Detection and Classification
No ratings yet
A Survey On Road Sign Detection and Classification
3 pages
A Novel Lightweight Real-Time Traffic Sign Detection Integration Framework Based On YOLOv4
No ratings yet
A Novel Lightweight Real-Time Traffic Sign Detection Integration Framework Based On YOLOv4
22 pages
MYOLO
No ratings yet
MYOLO
21 pages
Supervised Traffic Signs Recognition in Digital Images Using Interest Points
No ratings yet
Supervised Traffic Signs Recognition in Digital Images Using Interest Points
6 pages
Detection and Recognition of Traffic Sign Boards Using Random Forest Classifier
No ratings yet
Detection and Recognition of Traffic Sign Boards Using Random Forest Classifier
15 pages
Traffic Sign Recognition and Detection Using SVM and CNN
No ratings yet
Traffic Sign Recognition and Detection Using SVM and CNN
10 pages
5rexvw 7udiilf 6Ljq 5Hfrjqlwlrq DQG 7udfnlqj Iru $gydqfhg 'Ulyhu $VVLVWDQFH 6/Vwhpv
No ratings yet
5rexvw 7udiilf 6Ljq 5Hfrjqlwlrq DQG 7udfnlqj Iru $gydqfhg 'Ulyhu $VVLVWDQFH 6/Vwhpv
6 pages
Electronics 12 00305 v2
No ratings yet
Electronics 12 00305 v2
19 pages
Traffic Sign Classification
No ratings yet
Traffic Sign Classification
4 pages
2017 An Incremental Framework For Video-Based Traffic Sign Detection Tracking and Recognition
No ratings yet
2017 An Incremental Framework For Video-Based Traffic Sign Detection Tracking and Recognition
12 pages
Traffic sign detection based on classic visual recognition
No ratings yet
Traffic sign detection based on classic visual recognition
10 pages
Changzhen 2016
No ratings yet
Changzhen 2016
4 pages
Traffic sign classification using CNN and detection using FRCNN and YOLOV4
No ratings yet
Traffic sign classification using CNN and detection using FRCNN and YOLOV4
8 pages
Biogecko: Traffic Sign Recognition Using Deep Learning
No ratings yet
Biogecko: Traffic Sign Recognition Using Deep Learning
7 pages
yolo_report (1)
No ratings yet
yolo_report (1)
23 pages
Traffic Sign Detection
No ratings yet
Traffic Sign Detection
5 pages
Base Paper
No ratings yet
Base Paper
14 pages
Ieee Transaction on Traffic
No ratings yet
Ieee Transaction on Traffic
14 pages
An Efficient Method for Traffic Sign Recognition
No ratings yet
An Efficient Method for Traffic Sign Recognition
15 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
Automatic Traffic Sign Detection and Recognition Using Deeplearning For Autonomous Driverless Vehicles
No ratings yet
Automatic Traffic Sign Detection and Recognition Using Deeplearning For Autonomous Driverless Vehicles
4 pages
A Deep Learning Model of Traffic Signs in Panoramic Images Detection
No ratings yet
A Deep Learning Model of Traffic Signs in Panoramic Images Detection
18 pages
Paper 93-Traffic Sign Detection and Recognition (1)
No ratings yet
Paper 93-Traffic Sign Detection and Recognition (1)
8 pages
Traffic Sign Detection and Recognition Using Opencv: Icices2014 - S.A.Engineering College, Chennai, Tamil Nadu, India
No ratings yet
Traffic Sign Detection and Recognition Using Opencv: Icices2014 - S.A.Engineering College, Chennai, Tamil Nadu, India
6 pages
Road Traffic Sign Detection and Classification
No ratings yet
Road Traffic Sign Detection and Classification
12 pages
ITSC Presentation
No ratings yet
ITSC Presentation
18 pages
s11042-022-12163-0
No ratings yet
s11042-022-12163-0
13 pages
In-Vehicle Camera Traffic Sign Detection PDF
No ratings yet
In-Vehicle Camera Traffic Sign Detection PDF
17 pages
Detection of Traffic Sign Using CNN: June 2022
No ratings yet
Detection of Traffic Sign Using CNN: June 2022
11 pages
Traffic Sign Detection
No ratings yet
Traffic Sign Detection
5 pages
Traffic Sign Dec Tection Recognition Using Deep Learning
No ratings yet
Traffic Sign Dec Tection Recognition Using Deep Learning
14 pages
Deep_Learning_Based_Road_Traffic_Sign_Detection_and_Recognition
No ratings yet
Deep_Learning_Based_Road_Traffic_Sign_Detection_and_Recognition
4 pages
Multi-Feature_Fusion_and_Enhancement_Single_Shot_Detector_for_Traffic_Sign_Recognition
No ratings yet
Multi-Feature_Fusion_and_Enhancement_Single_Shot_Detector_for_Traffic_Sign_Recognition
10 pages
Traffic Sign Detection Using A Multi-Scale Recurrent Attention Network
No ratings yet
Traffic Sign Detection Using A Multi-Scale Recurrent Attention Network
10 pages
Traffic Sign Base
No ratings yet
Traffic Sign Base
13 pages
Color exploitation in HOG-based traffic sign detection
No ratings yet
Color exploitation in HOG-based traffic sign detection
5 pages
DAYOLO
No ratings yet
DAYOLO
19 pages
Mathematics 12 00297 v2
No ratings yet
Mathematics 12 00297 v2
31 pages
Uditi 4
No ratings yet
Uditi 4
6 pages
Uditi 2
No ratings yet
Uditi 2
6 pages
Literature Review
No ratings yet
Literature Review
2 pages
Traffic Sign Detection and Recognition With Deep CNN Using Raspberry Pi 4 in Real-Time
No ratings yet
Traffic Sign Detection and Recognition With Deep CNN Using Raspberry Pi 4 in Real-Time
6 pages
Color Exploitation in Hog-based Traffic Sign Detection
No ratings yet
Color Exploitation in Hog-based Traffic Sign Detection
4 pages
Conference Paper
No ratings yet
Conference Paper
7 pages
Republic of Turkey Firat University Graduate School of Natural and Applied Sciences
No ratings yet
Republic of Turkey Firat University Graduate School of Natural and Applied Sciences
60 pages
Traffic Sign Detection Using You Only Look Once Framework
No ratings yet
Traffic Sign Detection Using You Only Look Once Framework
9 pages
Improved YOLOv5 Network For Real-Time Multi-Scale Traffic
No ratings yet
Improved YOLOv5 Network For Real-Time Multi-Scale Traffic
12 pages
Real-Time Traffic Sign Recognition Based On Efficient Cnns in The Wild
No ratings yet
Real-Time Traffic Sign Recognition Based On Efficient Cnns in The Wild
10 pages
Traffic Sign Detection in Weather Conditions: Agothi Vaibhav Anjani Kumar
No ratings yet
Traffic Sign Detection in Weather Conditions: Agothi Vaibhav Anjani Kumar
3 pages
A Vision System For Traffic Sign Detection and Recognition
No ratings yet
A Vision System For Traffic Sign Detection and Recognition
6 pages
2nd base paper
No ratings yet
2nd base paper
11 pages
Fast and Accurate Traffic Sign Recognition For Self Driving Cars Using RetinaNet Based Detector
No ratings yet
Fast and Accurate Traffic Sign Recognition For Self Driving Cars Using RetinaNet Based Detector
7 pages
Traffic Sign Recognition With Voice Alert
No ratings yet
Traffic Sign Recognition With Voice Alert
26 pages
zhang2018
No ratings yet
zhang2018
4 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Jose Garcia
No ratings yet
Jose Garcia
29 pages
Emotional Intelligence and Gender Differences: A Study Among The Youth in Bangalore City, India
No ratings yet
Emotional Intelligence and Gender Differences: A Study Among The Youth in Bangalore City, India
10 pages
Assignment 7 Research For Visual Literacy
No ratings yet
Assignment 7 Research For Visual Literacy
8 pages
Attachment 1: Revisit The Infographic/s
No ratings yet
Attachment 1: Revisit The Infographic/s
11 pages
Responsible Users and Competent Producers of Media and Information
No ratings yet
Responsible Users and Competent Producers of Media and Information
5 pages
Profile
No ratings yet
Profile
7 pages
Concept of Project Based Learning
100% (2)
Concept of Project Based Learning
7 pages
CAPR Lesson Exemplar W 1
No ratings yet
CAPR Lesson Exemplar W 1
7 pages
Test Analysis Worksheet
No ratings yet
Test Analysis Worksheet
1 page
Activity 1 READING PURCOMM PDF
No ratings yet
Activity 1 READING PURCOMM PDF
2 pages
Thesis Hooks List
100% (3)
Thesis Hooks List
5 pages
Abo Amer 6B Cls 2 Model 4 2025
No ratings yet
Abo Amer 6B Cls 2 Model 4 2025
4 pages
252a - Management Information System
No ratings yet
252a - Management Information System
21 pages
Preliminary Pages
No ratings yet
Preliminary Pages
15 pages
Prototyping User eXperience in eXtended Reality 1st Edition Monica Bordegoni Marina Carulli Elena Spadoni - Download the full set of chapters carefully compiled
No ratings yet
Prototyping User eXperience in eXtended Reality 1st Edition Monica Bordegoni Marina Carulli Elena Spadoni - Download the full set of chapters carefully compiled
68 pages
Refers To The Reciprocal Exchange of Ideas, Belief, Attitude or Feelings Between or Among Persons
No ratings yet
Refers To The Reciprocal Exchange of Ideas, Belief, Attitude or Feelings Between or Among Persons
27 pages
Learning Theory of Career Counseling: Mark Lemuel L. Arceo, LPT
No ratings yet
Learning Theory of Career Counseling: Mark Lemuel L. Arceo, LPT
16 pages
Analysis of Cylindrical Tanks With Flat Bases by Moment Distribution Method
100% (2)
Analysis of Cylindrical Tanks With Flat Bases by Moment Distribution Method
12 pages
718-Article Text-3462-1-10-20220530
No ratings yet
718-Article Text-3462-1-10-20220530
3 pages
SEMI Detailed Lesson Plan
100% (1)
SEMI Detailed Lesson Plan
2 pages
SBM Validation Form
No ratings yet
SBM Validation Form
4 pages
My Reflection in Mathematics in The Mode
100% (1)
My Reflection in Mathematics in The Mode
3 pages
Stages of Psychotherapy Process: Zsolt Unoka
No ratings yet
Stages of Psychotherapy Process: Zsolt Unoka
15 pages
Barriers To Effective Communication and Their Management
No ratings yet
Barriers To Effective Communication and Their Management
27 pages
Chucue Malla
No ratings yet
Chucue Malla
1,037 pages
Langone - MBA - Global Strategy - Spring - 2024
No ratings yet
Langone - MBA - Global Strategy - Spring - 2024
10 pages

Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images

Uploaded by

Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images

Uploaded by

remote sensing

Keywords: traffic sign; intelligent vehicle; long-tailed distribution; anomalies; embedding;

Citation: Xu, J.; Huang, Y.; Ying, D.

Remote Sens. 2023, 15, 2959. https://doi.org/10.3390/rs15122959 https://www.mdpi.com/journal/remotesensing

Figure 2. Model overview.

2.2. Hierarchical Classification Model

Table 1. Statistical results for the three categories.

Percentage Type Large Medium Small

Based on the grouping results, a lightweight hierarchical classification module (HCM)

Figure 4. The structureFigure

𝐿 𝑐 ,𝑐 = − 𝑐 log 𝑐 + 1−𝑐 log 1 − 𝑐 (2)

2.3. Multi-Frame Information Integration Module

2.3.1. Correlating Detection Results

relu( x ) = max (0, x ) (5)

To enable an adjustable normalization calculation of the Euclidean distance, a hyper-

3.3. General Detector

Method Precision (%) Recall (%) F1 (%) AP (%)

Table 3. The overall metrics of baselines evaluated on the TT100K dataset.

Method Precision (%) Recall (%) F1 (%)

Table 5. The overall metrics of baselines evaluated on the ONCE dataset.

Method Precision (%) Recall (%) F1 (%)

Based on this, weoncalculated

Table 6. The mAP of baselines evaluated on the ONCE dataset.

Weather Method mAPall mAPsmal mAPmedium mAPlarge

Table 6. The mAP of baselines evaluated on the ONCE dataset.

Weather Method mAPall mAPsmal mAPmedium mAPlarge

3.4. HCM-Based Method

Method Precision (%) Recall (%) Inference Speed

Method Precision (%) Recall (%) Inference Speed

Figure 11. 11.

Method mAPall mAPsmall mAPmedium mAPlarge

Table 9. Performance comparison of YOLOv5-HC with existing methods in full classes.

Method Params mAP

3.5. MIM-Based Method

Method Precision (%) Recall (%) F1 (%)

Figure 12. Temporal

Weather Method mAPall mAPsmall mAPmedium mAPlarge

In order to indicate the impact of the improvements, Figure 13 depicts examples of

Figure 13. Detection

m Precision (%) Recall (%) F1 (%) mAPall

You might also like