Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images
Traffic Sign Detection and Recognition Using Multi-Frame Embedding of Video-Log Images
Article
Traffic Sign Detection and Recognition Using Multi-Frame
Embedding of Video-Log Images
Jian Xu, Yuchun Huang * and Dakan Ying
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China;
[email protected] (J.X.); [email protected] (D.Y.)
* Correspondence: [email protected]
Abstract: The detection and recognition of traffic signs is an essential component of intelligent vehicle
perception systems, which use on-board cameras to sense traffic sign information. Unfortunately,
issues such as long-tailed distribution, occlusion, and deformation greatly decrease the detector’s
performance. In this research, YOLOv5 is used as a single classification detector for traffic sign
localization. Afterwards, we propose a hierarchical classification model (HCM) for the specific
classification, which significantly reduces the degree of imbalance between classes without changing
the sample size. To cope with the shortcomings of a single image, a training-free multi-frame
information integration module (MIM) was constructed, which can extract the detection sequence
of traffic signs based on the embedding generated by the HCM. The extracted temporal detection
information is used for the redefinition of categories and confidence. At last, this research performed
detection and recognition of the full class on two publicly available datasets, TT100K and ONCE.
Experimental results show that the HCM-improved YOLOv5 has a mAP of 79.0 in full classes, which
exceeds that of state-of-the-art methods, and achieves an inference speed of 22.7 FPS. In addition,
MIM further improves model performance by integrating multi-frame information while only slightly
increasing computational resource consumption.
(a) (b)
(c)
Figure 1. Analysis of traffic signs. (a) A common way of classifying traffic signs; (b) several situations
Figure 1. Analysis
that have of traffic
a negative impactsigns.
on the(a) A common
detector; way of
(c) sample classifying
distribution oftraffic signs;dataset.
the TT100K (b) several situations
that have a negative impact on the detector; (c) sample distribution of the TT100K dataset.
Researchers added machine learning algorithms for refinement based on digital mor-
phology
Traffictosignsimprove the model’s
are designed withrobustness. Traditional
distinct shapes suchHOG featurescircles,
as squares, were combined
and triangles,
with SVM for traffic sign detection and classification [7,8]. The researchers
as well as distinct red, yellow, and blue colors to highlight the sign. Traditional attempted to
detection
segment the images based on color features and then used SVM to implement classifi-
and recognition methods are thus achieved by designing manual feature descriptors for
cation on the segmented regions [9–11]. Since many traffic signs are circular in shape,
traffic sign detection and recognition. The use of sliding windows to find high-probability
the researchers used the circular Hough transform to detect the signs and then classified
regions in an SVM
them using image containing
[12,13]. Although trafficthe signs
supportis one
vectorexample[1]. Researchers
machine improves haveand
detection also at-
tempted
recognition accuracy, it is still heavily reliant on manual features. As a result, the improved and
to determine adaptive segmentation thresholds for traffic sign extraction
classification
model, while bycapable
computing
of more histograms
accurate and of detailed
images classification,
[2,3]. Color space hastoalso
struggles dealbeen
with used,
forthe
example,
anomalies in segmentation
depicted in Figurein HSV
1b. [4]. Researchers extracted SURF feature points from
signs andObject
used detection
corrodingalgorithms
imagesbased on convolutional
to match them [5]. In neural networksmore
some studies, have complex
become fea-
a better choice for academics due to the extensive use of high-performance
ture descriptors, such as coarse localization of signs based on the Hough transform [6], computer
were used for sign extraction. Traditional digital morphology-based methods typically
necessitate clear images with high-resolution signs and no anomalies to interfere, making
them difficult to apply in complex real-world scenarios.
Researchers added machine learning algorithms for refinement based on digital mor-
Remote Sens. 2023, 15, 2959 3 of 26
systems. However, smaller traffic signs are difficult to detect and classify accurately as
the detailed features of small targets are difficult to transfer to the deeper feature maps.
Some researchers have used image super-resolution algorithms to enhance the detection of
small targets [14,15]. Since attention mechanisms can drive the network to focus more on
channel and spatial feature acquisition, adding attention mechanisms to the model can also
improve the model’s ability to extract semantic features [16–20].
From different viewpoints, traffic signs produce images of varying scales, and changes
in scale can also affect detector performance. Researchers attempted to incorporate
deformable convolution into the network, which can adjust the perceptual field adap-
tively [19,21,22]. Because the backbone generates feature maps of varying sizes during
layer-by-layer downsampling, some researchers have fused feature information from vari-
ous scales by constructing feature pyramids in order to improve the model’s extraction of
multi-scale features [22–30].
Some researchers have improved model performance without changing the detector
by using pre-processing techniques, such as image enhancement based on probabilistic
models [31], highlighting edge features of traffic signs [32], and enhancing the hue of dark
areas of images [33]. In addition, on-board cameras typically take high-resolution images for
sensing the vehicle’s surroundings but increase the search range for traffic signs. Therefore,
an attempt has been made in some studies to construct a coarse-to-fine framework, which is
used to reduce computational costs and improve model performance [34–36]. Background
information is frequently ignored, but some researchers have improved model accuracy
by using background detail features of neighboring signs [37,38]. In addition, spiking
neural networks (SNN) are used to improve existing traffic sign detection and recognition
algorithms [39–41], which can extract time-related features and have higher computational
efficiency on hardware platforms.
To obtain accurate indication information, we must classify traffic signs down to the
smallest category, which requires the algorithm to accurately identify up to several hundred
sign categories. As shown in Figure 1c, there is a great difference in sample size between
the traffic sign classes. This will result in classes with larger sample sizes having better
classification accuracy, while classes with sparse samples perform poorly [42]. Existing
studies usually only identify traffic signs according to three categories: prohibited, warning,
and mandatory, or remove categories with a sample size of less than 100. However, each
category of traffic sign is designed to convey important guidance information. Therefore,
the detection and recognition of traffic signs need to be implemented in as comprehensive
a range of categories as possible.
On-board cameras can continuously capture traffic signs, but most existing studies
only use information from a single image. Missed detections or misclassifications due to
anomalies such as occlusion and deformation are typically present in only a few frames, but
incorrect detection of a single sign can also pose a serious hazard, negatively impacting the
environment, infrastructure, and human life. Some researchers have attempted to improve
detector performance using image sequences in previous studies [43–46], but this often
necessitates an additional training process and consumes more computational resources.
Meanwhile, successive detections can result in redundant results.
In general, the main contributions of this paper are summarized as follows:
(1) We propose a hierarchical classification model (HCM) based on the natural distribu-
tion characteristics of traffic signs, which is used for the classification of traffic signs
in full classes. Meanwhile, the HCM significantly reduces the degree of imbalance
between classes without changing the sample size.
(2) To deal with missing or misleading information caused by anomalies, this study de-
signed a multi-frame information aggregation module (MIM) to extract the detection
sequence of traffic signs, which is based on the embedding generated by the HCM.
The temporal sequence of detection information can deal with the shortcomings of a
single image, reducing false detections caused by anomalies.
Remote Sens. 2023, 15, 2959 4 of 26
(3) We validated our method using two open-source datasets, TT100K and ONCE. The
HCM-improved YOLOv5 achieves a mAP of 79.0 in full classes, exceeding existing
state-of-the-art methods. Experiments using ONCE show that MIM further improves
the performance of the model by integrating multi-frame information.
2. Methods
The image sequence I captured by the on-board camera is used as input in this work,
and we need to detect and recognize traffic signs for each frame. Specifically, we need to
detect the bounding box bi = xi , yi , wi , hi of each traffic sign and recognize the specific
category ci . The central coordinates, width, and length of the bounding box are noted as x,
y, w, and h, respectively.
n o
bti , cit I , cit ∈ K, t ∈ T, i ∈ {1, 2, · · · , Nt } (1)
In conclusion, our work can be summarized as Equation (1). K and T in the formula
are the set of traffic sign categories and the set of time, respectively. Nt is the number of
23, 15, x FOR PEER REVIEW 5 method.
traffic signs in the image It . Figure 2 illustrates the overall framework of our of 28 While
the vehicle moves, the on-board camera catches street scenes, creating a temporal sequence
of photos I. The processing steps of our algorithm can be summarized as follows:
2.1. Detector Step 1: We use the previous m frames { It−m , It−m+1 , . . . , It−1 } as reference frame set
for a given image It inside I.
Object detection Step
is used to locate
2: Based on thetargets
imageinIt an
andimage and perform
all reference categorygenerates
frames, YOLOv5 recogni- a number
tion. Yet, becauseofofcandidate
small objects,
regionslarge-scale changes, and long-tailed distributions, it is
through detection.
challenging to accomplish
Step 3:robust detection classification
The hierarchical by merely applying a generic
module (HCM) object detection
implements the specific classifi-
algorithm. At the cation
same of candidate
time, areas.
the algorithm’s inference speed is a crucial assessment crite-
rion in order to interpret sign information in a timely manner on rapidly moving cars.
Object detection can be divided into two categories, depending on the framework:
one-stage and two-stage. Object detection algorithms with two stages generate regional
proposals first, then classify each proposed region. A considerable number of regions of
Remote Sens. 2023, 15, 2959 5 of 26
Step 4: Based on the embedding extracted by the HCM, the multi-frame information
integration module (MIM) searches for associated boxes in reference frames.
Step 5: MIM analyzes the detection sequences generated by the association operation,
which is used for category and confidence redefinition.
2.1. Detector
Object detection is used to locate targets in an image and perform category recogni-
tion. Yet, because of small objects, large-scale changes, and long-tailed distributions, it is
challenging to accomplish robust detection by merely applying a generic object detection al-
gorithm. At the same time, the algorithm’s inference speed is a crucial assessment criterion
in order to interpret sign information in a timely manner on rapidly moving cars.
Object detection can be divided into two categories, depending on the framework:
one-stage and two-stage. Object detection algorithms with two stages generate regional
proposals first, then classify each proposed region. A considerable number of regions of
interest are generated via region proposals. For example, R-CNN [47] creates approxi-
mately 2000 proposed regions in each input image. Because each proposed region needs
independent feature extraction and classification, two-stage object detection necessitates
considerable inference space and time costs.
In contrast to the classification and regression of proposed regions, one-stage object
detection typically divides the image into a number of grids, each containing a number of a
priori boxes. Following that, based on the feature maps provided by the backbone network,
the algorithm predicts the position and class of objects within each grid [48]. Compared to
the two-stage algorithm, the one-stage object detection uses a more direct global regression
and classification. The one-stage framework allows for fast inference as there are not a
large number of candidate regions to be computed independently.
In addition to the speed of inference, the accuracy of the model is an important
consideration. The scale fluctuations of traffic signs and small objects result in generic
object detection methods that are frequently hard to recognize robustly. MS-COCO [49]
defines objects with an area of fewer than 32 × 32 pixels as small objects. The sparse
appearance of small objects makes it difficult for the algorithm to distinguish between
background and object, and it also places higher challenges on the model’s detection
accuracy [50]. During the feature extraction process, the backbone network can generate
feature maps of different sizes to represent information at various scales. Shallow feature
maps contain more detailed spatial features, while deeper feature maps represent more
abstract semantic features. To address the performance decrease caused by scale factors,
researchers designed the feature pyramid network (FPN) for fusing feature information at
multiple scales by concatenating or summing elements between feature maps.
For reasons of inference speed and detection accuracy, the detection and recognition
of traffic signs require a one-stage object detection algorithm that adapts to the multi-scale
variation. Following comparison, YOLOv5 is selected as the detector in this research. As
a member of the yolo series of object detection algorithms, YOLOv5 not only inherits the
conventional quick detection capabilities but also applies a number of tactics to mitigate
the detrimental impacts of scale variation and small objects. Specifically, benefiting from
a unique residual structure and spatial pyramidal pooling, YOLOv5 has excellent multi-
scale feature extraction capabilities, which help to extract traffic sign features at different
distances. The creation of a bi-directional feature pyramid structure improves information
transfer across features at different scales and the model’s retention of detailed features in
the image. At the same time, traffic signs have distinct shapes and color qualities that set
them out from the background, so YOLOv5 can be used to precisely locate traffic signs in
an image.
In the real world, traffic signs suffer from a sample imbalance between categories,
which has a direct impact on the dataset. Although object detection algorithms contain
both localization and classification capabilities, the long-tailed distribution of traffic signs
frequently results in significant a reduction in the algorithm’s classification performance.
Remote Sens. 2023, 15, 2959 6 of 26
YOLOv5 achieves multi-classification in the head by modifying the feature map size, but
the imbalanced distribution of data has a substantial impact on the model’s detection and
classification performance. Therefore, YOLOv5 is not suitable for both the detection and
recognition of traffic signs, but traffic sign localization can be effectively solved by using a
single classification that distinguishes the traffic signs from the background.
Given the input image sequence I, we will utilize YOLOv5 to locate the traffic signs in
the image. For one of the images It , the detector will obtain Nt bounding boxes and the
corresponding confidence, denoted as Dt = bti , con f ti , i ∈ {1, 2, · · · , Nt }. The bounding
box is represented by the term bti in the equation, which contains the length, width, and
centroid coordinates of the box, while the other component con f ti is used to describe the
confidence of the detection, con f ti ∈ [0, 1]. By processing the entire image sequence with
YOLOv5, we can obtain a detection sequence D.
When faced with imbalanced sample sizes, previous research has often used either
under-sampling or oversampling to balance the sample size. Both strategies are data-level
approaches, but under-sampling yields less data for model training, whereas oversampling
lengthens training time and may result in model overfitting [42]. Unlike data augmentation,
we employ a grouping strategy to classify traffic sign categories. Traffic signs are classified
into three superclasses: warning, prohibitory, and mandatory, with a number of specialized
subclasses within each superclass. As illustrated in Figure 3, the superclasses range signifi-
cantly in color and shape. For example, prohibitory signs have a red circular border, but
warning signs have triangular and yellow features. This significant difference in features
makes it easy for the classification algorithm to achieve better classification accuracy on
the superclasses. The difficulty in classification is to precisely identify subclasses with
a long-tailed distribution. However, we discovered that the majority of the mandatory
number of specialized subclasses within each superclass. As illustrated in Figure 3, the
superclasses range significantly in color and shape. For example, prohibitory signs have a
red circular border, but warning signs have triangular and yellow features. This signifi-
cant difference in features makes it easy for the classification algorithm to achieve better
Remote Sens. 2023, 15, 2959 classification accuracy on the superclasses. The difficulty in classification is7 to of 26precisely
identify subclasses with a long-tailed distribution. However, we discovered that the ma
jority of the mandatory signs are seen in classes with a large sample size, whereas the
warning
signs signs
are seen in are typically
classes foundsample
with a large in classes with a small
size, whereas sample
the warning size.
signs areWhile the overal
typically
sample
found indistribution
classes with aexhibits a significant
small sample long-tailed
size. While distribution,
the overall the difference
sample distribution exhibitsina sample
size between
significant subclasses
long-tailed within athe
distribution, superclass
difference is
inmuch
samplesmaller. Thus,subclasses
size between groupingwithin
can improve
athe
superclass
overall is much smaller.
performance ofThus, grouping can model
the classification improvebythe overall performance
reducing the degree ofofthe
imbalance
classification model by reducing
without changing the sample size. the degree of imbalance without changing the sample size.
Figure3.3.Three
Figure Threesuperclasses
superclasses
andand their
their corresponding
corresponding subclasses.
subclasses.
the number of categories, then change the feature embedding e to a feature map of the
desired size using convolution operations, and lastly use the softmax function to produce
the classification result c, c = g(e, θ ) or c = g(h(b), θ ).
As shown in Figure 4, HCM begins prediction by identifying superclasses, following
which the relevant subclass classifier is chosen for a specific classification. Simultaneously,
there are model training requirements. We adjust the number of classifications by changing
the convolution kernel parameter θ in the classification component g. Take the superclass
classifier as an example, where the classification component generates a feature map of size
1 × 3 for the identification of the three superclasses. After completing the structural design
of the HCM, we use a cross-entropy loss function for model training. The c gt and c pre in
Equation (2) denote the ground truth and predicted value, respectively.
x FOR PEER REVIEW 9 of 28
Llog c gt , c pre = − c gt log c pre + 1 − c gt log 1 − c pre (2)
For one of the frames 𝐼 , we first extract all the candidate regions from the detection
Remote Sens. 2023, 15, 2959 9 of 26
For one of the frames It , we first extract all the candidate regions from the detection re-
sult Dt . Afterwards, the HCM identifies each candidate region bti independently. Following
the classification results, we supplement Dt with specific categories and embeddings. After
processing by HCM, Dt can be expressed by Equation (3).
n o
Dt = bti , con f ti , cit , eit , i ∈ {1, 2, · · · , Nt } (3)
q
ed( x1 , x2 , y1 , y2 ) = ( x1 − x2 )2 + ( y1 − y2 )2 (6)
relu(ed( x1 , x2 , y1 , y2 ) − α)
f center ( x1 , x2 , y1 , y2 ) = 1 − tanh (7)
β
Remote Sens. 2023, 15, 2959 10 of 26
Figure5.5.Controlled
Figure Controlled normalization
normalization of Euclidean
of Euclidean distances
distances based based
on f center . on 𝑓 .
Since the function f achieves result association mostly through embedding, the weight
parameters should 𝑓(𝑝 be
, 𝑝set) so
= that
𝜔 ω× cos𝑓is greater
(𝑒 , 𝑒 than
)+𝜔 × 𝑓 6 shows
ωcenter . Figure (𝑥 , two ,𝑦 )
𝑥 , 𝑦examples
of how similarity information𝑠.from 𝑡. 𝜔both+ aspects
𝜔 j can= be1,
used𝜔 to distinguish
>𝜔 among different
traffic signs. For a distinguishing feature pt in current frame It , j ∈ {1, 2, · · · , Nt }, the
Since the function 𝑓 achieves result association mostly through embedd
similarity is calculated based on the function f for the results in the reference frame
weight
Ire , re ∈ {parameters
t − m, t − m +should be1}set
1, · · · , t − . Asso thatin𝜔
stated Equationis greater than 𝜔 will produce
(9), the procedure . Figure 6 sho
aexamples max
of how
set of similarity similarity
scores. Afterwards, information
we extracted the from
maximum both aspects
similarity can
scoresbesre used
and to dist
the corresponding distinguishing feature p max based on Equations (10) and (11), respectively.
among different traffic signs. For rea distinguishing feature 𝑝 in current frame
1,2, ⋯ , 𝑁 , the similarity n
i
sre i is calculated
sre
j i
= f pt , pre
based on the function
, i ∈ {1, 2, · · · , Nre }
o
𝑓 for the results
(9)
in
erence frame 𝐼 , 𝑟𝑒 ∈ 𝑡 − 𝑚, 𝑡 − 𝑚 + 1, ⋯ , 𝑡 − 1 . As stated in Equation (9), the pr
will produce a set of similaritysmax
scores. Afterwards,
n o
i we extracted the maximum si
re = max sre (10)
scores 𝑠 and the corresponding distinguishing feature 𝑝 based on Equati
and (11), respectively.
max
j i
pre = argmax f pt , pre , s.t. i ∈ {1, 2, · · · , Nre } (11)
𝑠 |𝑠 = 𝑓(𝑝 , 𝑝 ), 𝑖 ∈ 1,2, ⋯ , 𝑁
𝑠 = 𝑚𝑎𝑥 𝑠
𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓 𝑝 , 𝑝 , 𝑠. 𝑡. 𝑖 ∈ 1,2, ⋯ , 𝑁
RemoteRemote
Sens. 2023, 15, x15,
Sens. 2023, FOR PEER REVIEW
2959 12 of 28
11 of 26
(a) (b)
Figure 6. 6.
Figure TwoTwotypes
typesof
of association scenarios.
association scenarios. (a)(a) Distinguishing
Distinguishing between
between different
different categories
categories of trafficof traf-
fic signs;
signs;(b)
(b)distinguishing
distinguishing between different instances of the same category.
between different instances of the same category.
max max
If If𝑠sre exceeds
exceedsthe similarity
the threshold
similarity thresholdε, the εdetection result corresponding
, the detection result correspondingto pre to
j
𝑝 becomes the association result in 𝐼 , which is denoted
becomes the association result in Ire , which is denoted as d re . as 𝑑 . Similarly, we seek
Similarly, we seek association
j
results thatresults
association satisfy the
thatrequirements in all reference frames.
satisfy the requirements After
in all reference
n that, the detection
frames. Afterresult that,dothe
t de-
j j j j j
is included
tection 𝑑 is included
resultto generate the sequence to ofgenerate
detection the results u = dt−of
sequence m , ddetection t −1 , d t . 𝑢 =
t−m+1 , . . . , dresults
𝑑 , 𝑑
Lastly, for all, … , 𝑑 , 𝑑 . Lastly, for all traffict signs in the current frame 𝐼 , weresults
traffic signs in the current frame I , we can extract the set of detection can extract
U = u j , j ∈ {1, 2, · · · , N } .
t
the set of detection resultst 𝑈 = 𝑢 , 𝑗 ∈ 1,2, ⋯ , 𝑁 .
2.3.2. Sequence Analysis
2.3.2. Sequence Analysis
In real-world circumstances, traffic signs have anomalies such as occlusion and defor-
mation, which can circumstances,
In real-world lead to false detectiontrafficor misclassification by the detector.
signs have anomalies such asFundamentally,
occlusion and de-
the anomalies
formation, which cause
canthe image’s
lead to falseinformation
detectiontoor bemisclassification
absent or deceptive. byAs
theanomalies
detector.areFunda-
typically present in only a few frames, the lack of information in a single
mentally, the anomalies cause the image’s information to be absent or deceptive. As anom- image can be com-
pensated for by employing several detections, which can enhance the model’s performance
alies are typically present in only a few frames, the lack of information in a single image
even further.
can be Based
compensated for by of
on the findings employing
the preceding several detections,
analysis, which can
MIM redefines enhance
categories andthe model’s
confi-
performance even further.
dences based on the sequence of detection results, as illustrated in Figure 7. To count the
Based on
confidence of the findings
category ctargetofinthe
thepreceding
sequence, we analysis, MIM
construct redefines
a statistical categories
function and confi-
v, denoted
dences based(12).
as Equation on the sequence
After of detection
that, Equation results,
(13) is used as illustrated
to calculate in Figure
the category 7. To
with the count the
highest
j . Finally, the category with the highest confidence
category in𝑐 the sequence
cumulativeofconfidence
confidence in the usequence, we construct a statistical function 𝑣, de-
j
noted as Equation
becomes (12). After
the redefinition that,cEquation
category t , while the(13) is used to
redefinition thef tj category
calculatecon
confidence with the
is calculated
by Equation
highest (14). confidence in the sequence
cumulative 𝑢 . Finally, the category with the highest
con f , if c = c
category 𝑐 , whiletarget (12)𝑐𝑜𝑛𝑓
confidence becomes the redefinition
v d, ctarget = the redefinition confidence
0, i f c 6= ctarget
is calculated by Equation (14).
t
𝑐𝑜𝑛𝑓, 𝑖𝑓 𝑐 = 𝑐
∑
j
𝑣 𝑑,t 𝑐
c = argmax = v(dτ , c), s.t. c ∈ K (13) (12)
τ =t−0,m 𝑖𝑓 𝑐 ≠ 𝑐
t
1
m + 1 τ =∑
j j
con f t = v dτ , ct (14)
𝑐 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑣(𝑑 , 𝑐) , 𝑠. 𝑡. 𝑐 ∈ 𝐾
t−m (13)
1
𝑐𝑜𝑛𝑓 = 𝑣 𝑑 ,𝑐 (14)
𝑚+1
RemoteRemote
Sens. 2023, 15, x 15,
Sens. 2023, FOR PEER REVIEW
2959 13 of 28
12 of 26
Figure 7. 7.
Figure Correlation
Correlationisisused
used to extract
extracttraffic
trafficsign
sign detection
detection sequences,
sequences, andand the current
the current resultsresults
are are
then redefined
then redefinedbybyintegrating
integrating information from
information from multiple
multiple detections.
detections.
ToTo eliminatedetection
eliminate detection results
results with
withpoor
poorconfidence,
confidence,a hyperparameter γ is 𝛾
a hyperparameter applied.
is applied.
j j
When the confidence con f is greater than γ, the categories and confidence levels in d are
When the confidence 𝑐𝑜𝑛𝑓t is greater than 𝛾, the categories and confidence levels t in 𝑑
replaced by the redefined results. The final result can be expressed by Equation (15). Based
areon
replaced
the sameby the redefined
process, results.
we redefine The final
all detection result
results can beframe
in current expressed
It . by Equation (15).
Based on the same process, we
redefine all detection
results in current frame 𝐼.
j j j j
D = b , con f , c , e , j ∈ {1, 2, · · · , Nt } (15)
𝐷 t= 𝑏 ,t𝑐𝑜𝑛𝑓 t , 𝑐t , 𝑒t , 𝑗 ∈ 1,2, ⋯ , 𝑁 (15)
Overall,by
Overall, byintegrating
integrating information
information from
frommultiple
multiple detections, sequence
detections, analysis
sequence mit- mit-
analysis
igates the unfavorable impact of abnormalities. While the majority of
igates the unfavorable impact of abnormalities. While the majority of the resultsthe results in thein the
detection sequence are correct, the redefinition results from sequence analysis can cor-
detection sequence are correct, the redefinition results from sequence analysis can correct
rect for a small number of classification errors. Simultaneously, deformation might occur
forata viewpoints
small number of classification errors. Simultaneously, deformation might occur at
near traffic signs, which usually results in low confidence in the results.
viewpoints near traffic
When employing signs,
multiple which usually
detections, results level
the confidence in low
canconfidence
be enhancedinby
theleveraging
results. When
employing multiple
high-confidence detections,
information theprevious
from confidence level
results, can beinenhanced
resulting by leveraging
fewer missed detections. high-
confidence information from previous results, resulting in fewer missed detections.
3. Results
3.1. Dataset
3. Results
We deal with specific kinds of traffic signs in this study; however, a portion of the
3.1. Dataset
traffic sign datasets are only labeled with three categories: warning, prohibitory, and
We deal [53,54].
mandatory with specific kinds
In contrast, of traffic
TT100K signs
[55] and in this
ONCE [56]study; however,
are better a portion
candidates becauseof the
traffic
theirsign datasets
annotation are only labeled
information is more with three categories: warning, prohibitory, and man-
detailed.
datory TT100K
[53,54]. contains
In contrast,
10,000TT100K
images [55]
withand ONCE [56]
a resolution of are × 2048.
2048better candidates because
At the same time, their
the data annotation
annotation information is further
is morerefined into 232 specific categories, as shown in Figure 8.
detailed.
Existing
TT100K studies typically
contains 10,000remove
imagescategories
with a with a sample
resolution of size
2048of×less than
2048. At 100
the[19,57–60],
same time, the
but we use the full traffic sign category for model training and testing. On the other hand,
data annotation is further refined into 232 specific categories, as shown in Figure 8. Exist-
the ONCE dataset is an autonomous driving dataset with millions of scenes. The images
ing studies typically remove categories with a sample size of less than 100 [19,57–60], but
in the dataset were selected from 144 h of on-board camera video, taken under different
welighting
use theand fullweather
traffic sign category
conditions. Infor model
order training
to use and testing.
the temporal image Ondatathe other hand,
in ONCE, we the
ONCE dataset
annotated theis an autonomous
ONCE driving
test set similarly datasetAfter
to TT100K. withremoving
millionstheof night
scenes. The
data, theimages
test in
thesetdataset were selected from 144 h of on-board camera
contains 13,268 time-series images with a resolution of 1920 × 1020. video, taken under different
lighting and weather conditions. In order to use the temporal image data in ONCE, we
annotated the ONCE test set similarly to TT100K. After removing the night data, the test
set contains 13,268 time-series images with a resolution of 1920 × 1020.
Remote
RemoteSens. 2023,15,
Sens.2023, 15,2959
x FOR PEER REVIEW 14 13
ofof2826
Figure8.
Figure 8. Traffic
Traffic sign
sign categories
categories in
in TT100K.
TT100K.
3.2.
3.2. Metrics
Metrics
The
The experiment
experiment uses precision, recall,
recall, and F1 score
and F1 score as
asmetrics
metricstotoevaluate
evaluatethe
theoverall
overall
performance
performance ofof the
the detector, as indicated in Equations
Equations (16)–(18). F1isisthe
(16)–(18).F1 thearithmetic
arithmeticmean
mean
of
ofprecision
precision and
and recall.
recall.
TP
Precision = 𝑇𝑃 (16)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = TP + FN (16)
𝑇𝑃 + 𝐹𝑁
Recall 𝑇𝑃TP
𝑅𝑒𝑐𝑎𝑙𝑙 = = TP + FP (17)
(17)
𝑇𝑃 + 𝐹𝑃
× Precision
2 ×2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × Recall
× 𝑅𝑒𝑐𝑎𝑙𝑙
= = Precision + Recall
𝐹1F1 (18)
(18)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
where
wheretrue
truepositives
positives(TP)(TP)arearethe
thenumber
numberofofsamples
samplesthatthatare
areactually
actuallypositive
positiveand
and classified
classi-
as positive
fied by thebyclassifier;
as positive false positives
the classifier; (FP) are
false positives theare
(FP) number of samples
the number that arethat
of samples actually
are
negative but classified as positive by the classifier; and false negatives (FN)
actually negative but classified as positive by the classifier; and false negatives (FN) are are the number
of
thesamples
numberthat are actually
of samples positive
that are butpositive
actually classifiedbutasclassified
negativeas bynegative
the classifier.
by the classifier.
Certain
Certain evaluation criteria, such as precision, have the potentialpotential
evaluation criteria, such as precision, have the to misleadtoresearch-
mislead
researchers
ers [42]. When[42]. aWhen a long-tailed
long-tailed distribution
distribution exists, exists, high scores
high scores may mistakenly
may mistakenly repre-
represent
sent
goodgood performance.
performance. As a consequence,
As a consequence, we usewe mAPusetomAP to further
further evaluateevaluate the model’s
the model’s perfor-
performance. As indicated
mance. As indicated in Equation
in Equation (19), mAP(19), mAP
is the is the average
average of each category’s
of each category’s mean averagemean
average
precision,precision, i.e., theprecision
i.e., the average average precision of all categories
of all categories divided by divided
the numberby the number of
of categories.
categories.
This paper This
uses paper
a fixeduses a fixed intersection-over-union
intersection-over-union (IoU) value of (IoU) value
0.5 for of 0.5 formAP.
computing comput-
ing mAP. ∑ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑚𝐴𝑃 = ∑ Average Precision (19)
mAP = 𝑁(𝐶𝑙𝑎𝑠𝑠) (19)
N (Class)
Remote Sens. 2023, 15, 2959 14 of 26
Table 2. The overall metrics of YOLOv5 and YOLOv7 when used as a single classification detector.
YOLOv7 outperformed YOLOv5 in terms of AP; however, when inferring, the de-
tection algorithm needed to determine the final output based on a confidence threshold.
Based on the same confidence threshold of 0.5, YOLOv7_l outperforms YOLOv5 in terms
of precision but is lower than YOLOv5 in terms of recall and F1 score. Therefore, YOLOv5
is more suitable for traffic sign detection than YOLOv7.
We tested performance using SSD, Faster RCNN, CenterNet, and YOLOv5 as baselines,
which are derived from diverse architectures of target detection algorithms, to indicate that
generic object detection algorithms are challenging to apply to the detection and recognition
of traffic signs. SSD and YOLOv5 are typical one-stage models; Faster RCNN is a two-stage
algorithm; and CenterNet is well known for its unique anchor-free architecture. Since
TT100K has detailed category information, all baselines are trained using the training set of
TT100K until the model converges.
First, we recorded the overall metrics of the baselines on the TT100K test set in
Table 3. The results show that CenterNet and YOLOv5, which have been proposed in recent
years, outperformed SSD and Faster RCNN. Figure 9 compares the detection results of the
baselines to further analyze the reasons for the difference in performance. The traffic signs
in the red boxes are typical small objects in this case, and the signs in the yellow and blue
regions show deformations due to the viewpoint. According to the results, SSD and Faster
RCNN, which lack the ability to fuse multi-scale information, have a high percentage of
missed detections on small objects, whereas CenterNet and YOLOv5, which use a feature
pyramid structure, detect much more small-object traffic signs.
At the same time, this example reflects the negative impact of the long-tailed distribu-
tion on the detector. Specifically, the traffic sign in the yellow area has some deformation,
but the sample size of the corresponding category is sufficient. In contrast, the correspond-
ing category in the blue region has a much smaller sample size. Although the traffic signs
in the yellow and blue areas have similar deformations, the difference in sample size leads
to completely different results.
Remote
Remote Sens.
Sens. 15,15,
2023,
2023, 2959 PEER REVIEW
x FOR 16 15
of of
2826
Figure 9.9.
Figure Detection results
Detection ofof
results different
differentbaselines
baselinesononTT100K
TT100Kdataset.
dataset.
Table 3.The
Theoverall
overall metrics
metricsofdo
baselines evaluatedrepresent
not accurately on the TT100K dataset. performance across cat-
the model’s
egories with varied sample sizes. As a result, we first compute the average precision of
Method Precision (%) Recall (%) F1 (%)
the baseline across all categories. After that, we calculated mAP in accordance with the
SSDin sample size, which
difference 32.26
is shown in Table 4.12.56 18.08
Faster RCNN 33.30 57.02 42.04
TableCenterNet 54.32 on the TT100K dataset.
4. The mAP of baselines evaluated 57.12 55.69
YOLOv5 74.37 80.93 77.51
Method mAPall mAPsmall mAPmedium mAPlarge
At the
SSDsame time, this5.87
example reflects 1.92
the negative impact
8.69of the long-tailed
13.00distri-
Faster RCNN 10.30 7.31 11.41 16.93
bution on the detector. Specifically, the traffic sign in the yellow area has some defor-
CenterNet
mation, 16.95
but the sample size 4.87
of the corresponding category is17.11 48.88 the
sufficient. In contrast,
YOLOv5 26.64 1.73 36.78 80.79
corresponding category in the blue region has a much smaller sample size. Although the
traffic signs in the yellow and blue areas have similar deformations, the difference in sam-
ple sizeThe results
leads show that there
to completely is a significant
different results. difference in accuracy between the categories
withThe overall metrics do not accurately with
an adequate sample size and those fewerthe
represent samples.
model’sYOLOv5 has a across
performance mAP
higher cat-
in categories
egories with sufficient
with varied sample
sample sizes. As asize. However,
result, we firstfor the category
compute with fewer
the average samples,
precision of
thebaseline
the gap between
across baselines is substantially
all categories. lower.
After that, we As a result,
calculated mAP in YOLOv5 is not
accordance suitable
with the
for combining
difference detection
in sample and multi-classification
size, which is shown in Table 4.tasks for objects having a long-tailed
distribution, such as traffic signs.
Remote Sens. 2023, 15, 2959 16 of 26
The photos in TT100K are often taken in bright light, and the majority of the samples
are small. ONCE, on the other hand, records photographs in a variety of weather conditions,
such as sunny, rainy, and cloudy days, but with a lower proportion of small objects than
TT100K. We used the same strategy to compute the metrics for the ONCE baseline method
and recorded them in Table 5.
The test results on ONCE are basically unchanged, with the exception that Faster
RCNN performs substantially better than TT100K on ONCE, showing that the small object
is the primary cause for Faster RCNN’s limited performance. Figure 10 shows detection
in three types of weather to illustrate the impact of weather on model performance. The
images in the sunny environment are clear, but those in the cloudy and rainy surroundings
are substantially dimmer. The effect of environmental elements is also represented in the
Remote Sens. 2023, 15, xbaseline
FOR PEERresults;
REVIEWfor example, cloudy and rainy conditions result in more missed detections 18 of
or misclassifications. It was also discovered that, while Faster RCNN performed well
overall, it lacked localization precision.
Figure 10.
Figure 10. Detection Detection
results resultsbaselines
of different of different
on baselines on ONCE dataset.
ONCE dataset.
Table 7. The overall metrics of YOLOv5 and YOLOv5-HC evaluated on the TT100K dataset.
We computed mAP for YOLOv5-HC on categories with varied sample sizes in order to
explicitly analyze the performance of HCM on fewer sample categories, which is recorded
in Table 8. YOLOv5-HC improves accuracy across all categories. As the sample size
reduces, the degree of performance improvement of HCM on YOLOv5 increases. In
addition, we compared YOLOv5-HC with related methods in full classes, as shown in
Table 9. Compared to the state-of-the-art model, YOLOv5-HC has a 7.1% improvement in
mAP across all categories. Furthermore, our method has fewer model parameters.
Table 8. The mAP of YOLOv5 and YOLOv5-HC evaluated on the TT100K dataset.
As traffic sign identification and recognition require rapid perception information, the
model’s inference speed is also an important metric. We used Nvidia 2080ti to calculate
the inference speed of YOLOv5 and YOLOv5-HC, as shown in Table 7. Despite the fact
that the use of HCM increased the inference time, YOLOv5-HC still achieves an inference
speed of 22.7 FPS. In terms of the balance between model accuracy and inference speed,
we want to improve the accuracy of the model as much as possible while satisfying the
real-time condition. Taking the autonomous driving dataset ONCE as an example, the data
Remote Sens. 2023, 15, 2959 19 of 26
is sampled at a frequency of 10 FPS, so we consider that the inference speed of 22.7 FPS
satisfies the real-time requirement.
Table 10. The overall metrics of YOLOv5 and two improved versions evaluated on the ONCE dataset.
The on-board camera, as shown in Figure 12, takes continuous photos of the same
instance as the vehicle moves. Except for the SSD with poor overall performance, the
remaining detectors achieved accurate detection and recognition in the first two frames.
However, the majority of the detectors exhibited a missed detection at moment t due to
deformation. Although the deformed traffic signs were detected by YOLOv5-HC, the result
had a low confidence level. In contrast, because YOLOv5-HM utilizes detection information
from multiple frames, the higher confidence in the first two frames is used in the sequence
analysis to obtain a higher confidence at moment t.
To investigate the impact of MIM further, we ran mAP calculations under various
weather conditions and sample sizes, as shown in Table 11. According to the results,
YOLOv5-HM achieves an optimal value of 73.86 for the overall mAP, an improvement of
1.07 over YOLOv5-HC. YOLOv5-HM outperforms the pre-improvement model in most
sample size categories. YOLOv5-HM demonstrated the most substantial performance boost
in sunny settings, with an overall mAP improvement of 1.37. YOLOv5-HM, on the other
hand, demonstrated a relatively smaller performance improvement in cloudy and rainy
situations.
maining detectors achieved accurate detection and recognition in the first two frames.
However, the majority of the detectors exhibited a missed detection at moment t due to
deformation. Although the deformed traffic signs were detected by YOLOv5-HC, the re-
sult had a low confidence level. In contrast, because YOLOv5-HM utilizes detection infor-
Remote Sens. 2023, 15, 2959 20 of 26
mation from multiple frames, the higher confidence in the first two frames is used in the
sequence analysis to obtain a higher confidence at moment 𝑡.
Table 11. The mAP of YOLOv5 and two improved versions evaluated on the ONCE dataset.
4. Discussion
4. Discussion
The evaluation results based on TT100K and ONCE demonstrate that the feature
The evaluation
pyramid structureresults based
is critical on TT100K
in dealing andobjects
with small ONCEand demonstrate that the
scale variations. feature
At the samepyr-
amid structure is critical in dealing with small objects and scale variations.
time, the test results reveal a shortcoming in the localization accuracy of the Faster RCNN, At the same
time, the test
which resultsfrom
originates revealthe acoarseness
shortcomingof thein the localization
feature map and theaccuracy of the Faster
limited information RCNN,
offered
which originates
by the candidate from
boxes the coarseness
[48]. of the
Furthermore, thefeature map
evaluation and the
findings onlimited information
classes with varying of-
feredsample
by the sizes show thatboxes
candidate the long-tailed distributionthe
[48]. Furthermore, hasevaluation
a considerable detrimental
findings impactwith
on classes
on the
varying detector’s
sample performance.
sizes show that the Forlong-tailed
reasons of inference speedhas
distribution anda detection performance,
considerable detrimental
YOLOv5 becomes a better choice. The use of YOLOv5 as a single classification detector
impact on the detector’s performance. For reasons of inference speed and detection per-
allows for more efficient localization of traffic signs due to their unique color and shape
formance, YOLOv5 becomes a better choice. The use of YOLOv5 as a single classification
features. Detecting the most challenging case in Figure 11 also demonstrates increased
detector allowsperformance.
localization for more efficient localization of traffic signs due to their unique color and
shape features.
To cope withDetecting
detectorthe most challenging
performance loss causedcase in Figure
by unequal 11 also demonstrates
distribution, we propose a in-
creased localization
hierarchical performance.
classification model (HCM) that divides traffic signs into three superclasses and
To cope withsubclasses.
corresponding detector performance loss makes
This classification causedusebyofunequal distribution,
the distribution we propose
characteristics
of traffic signs.
a hierarchical Specifically,model
classification mandatory(HCM)signs aredivides
that mostly intraffic
the category withthree
signs into a large sample
superclasses
size, whereas warning signs are mostly in the category with a small
and corresponding subclasses. This classification makes use of the distribution character-sample size. While the
overall sample distribution exhibits a significant long-tailed distribution, the difference in
istics of traffic signs. Specifically, mandatory signs are mostly in the category with a large
sample size between subclasses within a superclass is much smaller.
sample size, whereas warning signs are mostly in the category with a small sample size.
While the overall sample distribution exhibits a significant long-tailed distribution, the
difference in sample size between subclasses within a superclass is much smaller.
Equation (19) can be utilized to quantify the degree of imbalance in sample size be-
tween classes, which is calculated as the ratio of the maximum and minimum sample size
across all categories [42].
Remote Sens. 2023, 15, 2959 22 of 26
Equation (19) can be utilized to quantify the degree of imbalance in sample size
between classes, which is calculated as the ratio of the maximum and minimum sample
size across all categories [42].
maxi {|Ci |}
ρ= (20)
mini {|Ci |}
To clearly highlight the differences after grouping, some of the categories were ran-
domly selected from the large, medium, and small categories. Figure 14 indicates that after
grouping, the sample distribution of traffic signs is more balanced, which is especially
noticeable for warning traffic signs. Meanwhile, when the sample size falls, the reduction
in ρ becomes bigger, resulting in more performance gains for YOLOv5-HC in the fewer
sample categories. In addition, owing to its lightweight design, the HCM’s hierarchical
Remote Sens. 2023, 15, x FOR PEER REVIEW
classification structure does not considerably slow down inference speed. YOLOv5-HC24 of 28
is
similar to the two-stage object detection algorithm, but the inference speed is much faster
than Faster RCNN.
(a) (b)
(c)
Figure 14. Change in value ρ across categories with different sample sizes after grouping. (a) small
Figure 14. Change in value 𝜌 across categories with different sample sizes after grouping. (a) small
categories; (b) medium categories; (c) large categories.
categories; (b) medium categories; (c) large categories.
Small objects, scale variations, and long-tailed distributions are the main causes of
Small objects, scale
detector performance variations,Furthermore,
degradation. and long-tailed distributions
real-world scenes are the main
contain causes
anomalies of
such
detector performance degradation. Furthermore, real-world scenes contain anomalies
as deformation and occlusion, which frequently result in missed detections or misclassifica-
such as deformation and occlusion, which frequently result in missed detections or mis-
classifications. Challenging cases generated by anomalies are difficult to solve since the
anomalies result in missing or misleading image information. To overcome the limitations
of a single image, our proposed multi-frame information integration module (MIM) inte-
grates data from multiple detections to achieve robust detection and recognition. Mean-
Remote Sens. 2023, 15, 2959 23 of 26
tions. Challenging cases generated by anomalies are difficult to solve since the anomalies
result in missing or misleading image information. To overcome the limitations of a single
image, our proposed multi-frame information integration module (MIM) integrates data
from multiple detections to achieve robust detection and recognition. Meanwhile, correla-
tion can be utilized to eliminate the redundant results produced by successive detections.
The evaluation results on the ONCE dataset demonstrate that MIM achieves a per-
formance improvement in most cases. To further analyze the role of MIM, we varied
the range of information integration by adjusting the number of reference frames. The
data in Table 12 show that as the number of reference frames increases, the performance
of YOLOv5-HM gradually improves, indicating the significance of utilizing multi-frame
information. However, since traffic signs in ONCE are typically photographed three times,
there is no further improvement in model performance when the number of reference
frames m in the experiment exceeds two.
Table 12. YOLOv5-HM’s overall metrics evaluated on different reference frame numbers.
5. Conclusions
In this paper, two novel and simple-to-implement modules are proposed to improve
the performance of YOLOv5 for traffic sign detection and recognition. YOLOv5 provides
outstanding localization performance for small objects and scale variations as a single classi-
fication detector. To reduce the negative impact of long-tailed distributions on classification,
we propose a hierarchical classification module for the specific classification of traffic signs.
Through grouping, HCM divides traffic signs into three superclasses and corresponding
subclasses. The grouping takes advantage of traffic sign distributional characteristics,
which can greatly reduce sample size discrepancies between classes. However, in the pres-
ence of anomalies such as occlusion and deformation, single-image-based algorithms still
suffer from missing detection or misclassification. To deal with missing or misleading infor-
mation caused by anomalies, this study designed a multi-frame information aggregation
module to extract the detection sequence of traffic signs, which is based on the embedding
generated by the HCM. The temporal sequence of detection information can deal with the
shortcomings of a single image, reducing false detections caused by anomalies.
Experimental results based on TT100K show that YOLOv5-HC achieves a mAP of 79.0
in full classes, which exceeds state-of-the-art methods. At the same time, the inference
speed of 22.7 FPS satisfies the real-time requirement. Furthermore, YOLOv5-HM using
MIM outperformed YOLOv5-HC in terms of overall accuracy, with 0.67% improvement in
precision, 1.32% improvement in recall, and 1.05 improvement in F1 score, respectively.
YOLOv5 has some shortcomings in traffic sign detection and consumes most of the
computational resources. Therefore, we will improve the existing detection module in
our research as more advanced object detection algorithms are proposed. Meanwhile, we
will also try to improve the inference speed of the existing model using SNN, which can
better balance the inference time and model accuracy. In addition, due to the uniqueness of
the colors as well as the structure of the traffic signs, we also consider the use of VAE or
statistical models to generate distributions that can be used for traffic sign recognition.
Author Contributions: Conceptualization, Y.H. and J.X.; methodology, Y.H. and J.X.; software, J.X.;
supervision, Y.H.; validation, D.Y.; visualization, J.X. and D.Y.; writing—original draft, J.X.; writing—
review and editing, Y.H. and J.X. All authors have read and agreed to the published version of
the manuscript.
Remote Sens. 2023, 15, 2959 24 of 26
Funding: This research was funded by the Wuhan University–Huawei Geoinformatics Innova-
tion Laboratory.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Rehman, Y.; Khan, J.A.; Shin, H. Efficient coarser-to-fine holistic traffic sign detection for occlusion handling. IET Image Process.
2018, 12, 2229–2237. [CrossRef]
2. Xu, X.; Jin, J.; Zhang, S.; Zhang, L.; Pu, S.; Chen, Z. Smart data driven traffic sign detection method based on adaptive color
threshold and shape symmetry. Future Gener. Comput. Syst.-Int. J. Escience 2019, 94, 381–391. [CrossRef]
3. Yang, Y.; Luo, H.; Xu, H.; Wu, F. Towards Real-Time Traffic Sign Detection and Classification. IEEE Trans. Intell. Transp. Syst. 2016,
17, 2022–2031. [CrossRef]
4. Cao, J.; Song, C.; Peng, S.; Xiao, F.; Song, S. Improved Traffic Sign Detection and Recognition Algorithm for Intelligent Vehicles.
Sensors 2019, 19, 4021. [CrossRef]
5. Guo, S.; Yang, X. Fast recognition algorithm for static traffic sign information. Open Phys. 2018, 16, 1149–1156. [CrossRef]
6. Yin, S.; Ouyang, P.; Liu, L.; Guo, Y.; Wei, S. Fast Traffic Sign Recognition with a Rotation Invariant Binary Pattern Based Feature.
Sensors 2015, 15, 2161–2180. [CrossRef] [PubMed]
7. Hechri, A.; Mtibaa, A. Two-stage traffic sign detection and recognition based on SVM and convolutional neural networks. Iet
Image Process. 2020, 14, 939–946. [CrossRef]
8. Bouti, A.; Mahraz, M.A.; Riffi, J.; Tairi, H. A robust system for road sign detection and classification using LeNet architecture
based on convolutional neural network. Soft Comput. 2020, 24, 6721–6733. [CrossRef]
9. Madani, A.; Yusof, R. Traffic sign recognition based on color, shape, and pictogram classification using support vector machines.
Neural Comput. Appl. 2018, 30, 2807–2817. [CrossRef]
10. Lillo-Castellano, J.M.; Mora-Jimenez, I.; Figuera-Pozuelo, C.; Rojo-Alvarez, J.L. Traffic sign segmentation and classification using
statistical learning methods. Neurocomputing 2015, 153, 286–299. [CrossRef]
11. Li, H.; Sun, F.; Liu, L.; Wang, L. A novel traffic sign detection method via color segmentation and robust shape matching.
Neurocomputing 2015, 169, 77–88. [CrossRef]
12. Saadna, Y.; Behloul, A.; Mezzoudj, S. Speed limit sign detection and recognition system using SVM and MNIST datasets. Neural
Comput. Appl. 2019, 31, 5005–5015. [CrossRef]
13. Berkaya, S.K.; Gunduz, H.; Ozsen, O.; Akinlar, C.; Gunal, S. On circular traffic sign detection and recognition. Expert Syst. Appl.
2016, 48, 67–75. [CrossRef]
14. Yu, Y.; Jiang, T.; Li, Y.; Guan, H.; Li, D.; Chen, L.; Yu, C.; Gao, L.; Gao, S.; Li, J. SignHRNet: Street-level traffic signs recognition
with an attentive semi-anchoring guided high-resolution network. Isprs J. Photogramm. Remote Sens. 2022, 192, 142–160. [CrossRef]
15. Wang, Z.-Z.; Xie, K.; Zhang, X.-Y.; Chen, H.-Q.; Wen, C.; He, J.-B. Small-Object Detection Based on YOLO and Dense Block via
Image Super-Resolution. IEEE Access 2021, 9, 56416–56429. [CrossRef]
16. Li, Y.; Li, J.; Meng, P. Attention-YOLOV4: A real-time and high-accurate traffic sign detection algorithm. Multimed. Tools Appl.
2023, 82, 7567–7582. [CrossRef]
17. Wei, H.; Zhang, Q.; Qian, Y.; Xu, Z.; Han, J. MTSDet: Multi-scale traffic sign detection with attention and path aggregation. Appl.
Intell. 2023, 53, 238–250. [CrossRef]
18. Wang, X.; Guo, J.; Yi, J.; Song, Y.; Xu, J.; Yan, W.; Fu, X. Real-Time and Efficient Multi-Scale Traffic Sign Detection Method for
Driverless Cars. Sensors 2022, 22, 6930. [CrossRef]
19. Hu, J.; Wang, Z.; Chang, M.; Xie, L.; Xu, W.; Chen, N. PSG-Yolov5: A Paradigm for Traffic Sign Detection and Recognition
Algorithm Based on Deep Learning. Symmetry 2022, 14, 2262. [CrossRef]
20. Triki, N.; Karray, M.; Ksantini, M. A Real-Time Traffic Sign Recognition Method Using a New Attention-Based Deep Convolutional
Neural Network for Smart Vehicles. Appl. Sci. 2023, 13, 4793. [CrossRef]
21. Gao, X.; Chen, L.; Wang, K.; Xiong, X.; Wang, H.; Li, Y. Improved Traffic Sign Detection Algorithm Based on Faster R-CNN. Appl.
Sci. 2022, 12, 8948. [CrossRef]
22. Liu, Z.; Shen, C.; Fan, X.; Zeng, G.; Zhao, X. Scale-aware limited deformable convolutional neural networks for traffic sign
detection and classification. IET Intell. Transp. Syst. 2020, 14, 1712–1722. [CrossRef]
23. Zhang, Y.; Lu, Y.; Zhu, W.; Wei, X.; Wei, Z. Traffic sign detection based on multi-scale feature extraction and cascade feature fusion.
J. Supercomput. 2023, 79, 2137–2152. [CrossRef]
24. Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput.
Appl. 2022, 35, 7853–7865. [CrossRef]
25. Wu, J.; Liao, S. Traffic Sign Detection Based on SSD Combined with Receptive Field Module and Path Aggregation Network.
Comput. Intell. Neurosci. 2022, 2022, 4285436. [CrossRef] [PubMed]
Remote Sens. 2023, 15, 2959 25 of 26
26. Yao, Y.; Han, L.; Du, C.; Xu, X.; Jiang, X. Traffic sign detection algorithm based on improved YOLOv4-Tiny. Signal Process.-Image
Commun. 2022, 107, 116783. [CrossRef]
27. Liu, Y.; Peng, J.; Xue, J.-H.; Chen, Y.; Fu, Z.-H. TSingNet: Scale-aware and context-rich feature learning for traffic sign detection
and recognition in the wild. Neurocomputing 2021, 447, 10–22. [CrossRef]
28. Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Traffic sign detection and recognition based on pyramidal convolutional networks. Neural
Comput. Appl. 2020, 32, 6533–6543. [CrossRef]
29. Yuan, Y.; Xiong, Z.; Wang, Q. VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection. IEEE Trans.
Image Process. 2019, 28, 3423–3434. [CrossRef]
30. Ou, Z.; Xiao, F.; Xiong, B.; Shi, S.; Song, M. FAMN: Feature Aggregation Multipath Network for Small Traffic Sign Detection. IEEE
Access 2019, 7, 178798–178810. [CrossRef]
31. Suto, J. An Improved Image Enhancement Method for Traffic Sign Detection. Electronics 2022, 11, 871. [CrossRef]
32. Khan, J.A.; Chen, Y.; Rehman, Y.; Shin, H. Performance enhancement techniques for traffic sign recognition using a deep neural
network. Multimed. Tools Appl. 2020, 79, 20545–20560. [CrossRef]
33. Khan, J.A.; Yeo, D.; Shin, H. New Dark Area Sensitive Tone Mapping for Deep Learning Based Traffic Sign Recognition. Sensors
2018, 18, 3776. [CrossRef] [PubMed]
34. Wang, Z.; Wang, J.; Li, Y.; Wang, S. Traffic Sign Recognition with Lightweight Two-Stage Model in Complex Scenes. IEEE Trans.
Intell. Transp. Syst. 2022, 23, 1121–1131. [CrossRef]
35. Liu, L.; Wang, Y.; Li, K.; Li, J. Focus First: Coarse-to-Fine Traffic Sign Detection with Stepwise Learning. IEEE Access 2020, 8,
171170–171183. [CrossRef]
36. Song, Y.; Fan, R.; Huang, S.; Zhu, Z.; Tong, R. A three-stage real-time detector for traffic signs in large panoramas. Comput. Vis.
Media 2019, 5, 403–416. [CrossRef]
37. Min, W.; Liu, R.; He, D.; Han, Q.; Wei, Q.; Wang, Q. Traffic Sign Recognition Based on Semantic Scene Understanding and
Structural Traffic Sign Location. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15794–15807. [CrossRef]
38. Tian, Y.; Gelernter, J.; Wang, X.; Li, J.; Yu, Y. Traffic Sign Detection Using a Multi-Scale Recurrent Attention Network. IEEE Trans.
Intell. Transp. Syst. 2019, 20, 4466–4475. [CrossRef]
39. Rasteh, A.; Delpech, F.; Aguilar-Melchor, C.; Zimmer, R.; Shouraki, S.B.; Masquelier, T. Encrypted internet traffic classification
using a supervised spiking neural network. Neurocomputing 2022, 503, 272–282. [CrossRef]
40. Zhang, Y.; Xu, H.; Huang, L.; Chen, C. A storage-efficient SNN-CNN hybrid network with RRAM-implemented weights for
traffic signs recognition. Eng. Appl. Artif. Intell. 2023, 123, 106232. [CrossRef]
41. Xie, K.; Zhang, Z.; Li, B.; Kang, J.; Niyato, D.; Xie, S.; Wu, Y. Efficient Federated Learning with Spike Neural Networks for Traffic
Sign Recognition. IEEE Trans. Veh. Technol. 2022, 71, 9980–9992. [CrossRef]
42. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [CrossRef]
43. Yu, J.; Ye, X.; Tu, Q. Traffic Sign Detection and Recognition in Multiimages Using a Fusion Model with YOLO and VGG Network.
IEEE Trans. Intell. Transp. Syst. 2022, 23, 16632–16642. [CrossRef]
44. Atif, M.; Zoppi, T.; Gharib, M.; Bondavalli, A. Towards Enhancing Traffic Sign Recognition through Sliding Windows. Sensors
2022, 22, 2683. [CrossRef] [PubMed]
45. Zhang, Y.; Wang, Z.; Song, R.; Yan, C.; Qi, Y. Detection-by-tracking of traffic signs in videos. Appl. Intell. 2022, 52, 8226–8242.
[CrossRef]
46. Song, S.; Li, Y.; Huang, Q.; Li, G. A New Real-Time Detection and Tracking Method in Videos for Small Target Traffic Signs. Appl.
Sci. 2021, 11, 3061. [CrossRef]
47. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J.; IEEE. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH,
USA, 23–28 June 2014; pp. 580–587.
48. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. 2019, 30,
3212–3232. [CrossRef]
49. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September
2014; pp. 740–755.
50. Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020,
97, 103910. [CrossRef]
51. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada,
7–12 December 2015.
52. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R.; IEEE. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on
Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
53. Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; Igel, C.; IEEE. Detection of Traffic Signs in Real-World Images: The German
Traffic Sign Detection Benchmark. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Dallas, TX,
USA, 4–9 August 2013.
Remote Sens. 2023, 15, 2959 26 of 26
54. Zhang, J.; Zou, X.; Kuang, L.-D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A More Comprehensive Traffic Sign Detection
Benchmark. Hum.-Cent. Comput. Inf. Sci. 2022, 12, 23. [CrossRef]
55. Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S.; IEEE. Traffic-Sign Detection and Classification in the Wild. In Proceed-
ings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016;
pp. 2110–2118.
56. Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One million scenes for autonomous
driving: ONCE dataset. arXiv 2021, arXiv:2106.11037.
57. Chu, J.; Zhang, C.; Yan, M.; Zhang, H.; Ge, T. TRD-YOLO: A Real-Time, High-Performance Small Traffic Sign Detection Algorithm.
Sensors 2023, 23, 3871. [CrossRef] [PubMed]
58. Sharma, V.; Dhiman, P.; Rout, R.K. Improved traffic sign recognition algorithm based on YOLOv4-tiny. J. Vis. Commun. Image
Represent. 2023, 91, 103774. [CrossRef]
59. Wang, L.; Wang, L.; Zhu, Y.; Chu, A.; Wang, G. CDFF: A fast and highly accurate method for recognizing traffic signs. Neural
Comput. Appl. 2023, 35, 643–662. [CrossRef]
60. Yuan, X.; Kuerban, A.; Chen, Y.; Lin, W. Faster Light Detection Algorithm of Traffic Signs Based on YOLOv5s-A2. IEEE Access
2023, 11, 19395–19404. [CrossRef]
61. Cao, J.; Zhang, J.; Jin, X. A Traffic-Sign Detection Algorithm Based on Improved Sparse R-CNN. IEEE Access 2021, 9, 122774–122788.
[CrossRef]
62. Gao, E.; Huang, W.; Shi, J.; Wang, X.; Zheng, J.; Du, G.; Tao, Y. Long-Tailed Traffic Sign Detection Using Attentive Fusion and
Hierarchical Group Softmax. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24105–24115. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.