Sensors 24 02791 v2
Sensors 24 02791 v2
Article
A Real‑Time License Plate Detection and Recognition Model in
Unconstrained Scenarios
Lingbing Tao 1 , Shunhe Hong 1 , Yongxing Lin 1,2 , Yangbing Chen 1 , Pingan He 3 and Zhixin Tie 1,2, *
1 School of Computer Science and Technology, Zhejiang Sci‑Tech University, Hangzhou 310018, China;
[email protected] (L.T.); [email protected] (S.H.); [email protected] (Y.L.);
[email protected] (Y.C.)
2 Keyi College, Zhejiang Sci‑Tech University, Shaoxing 312369, China
3 School of Science, Zhejiang Sci‑Tech University, Hangzhou 310018, China; [email protected]
* Correspondence: [email protected]
Abstract: Accurate and fast recognition of vehicle license plates from natural scene images is a cru‑
cial and challenging task. Existing methods can recognize license plates in simple scenarios, but their
performance degrades significantly in complex environments. A novel license plate detection and
recognition model YOLOv5‑PDLPR is proposed, which employs YOLOv5 target detection algorithm
in the license plate detection part and uses the PDLPR algorithm proposed in this paper in the license
plate recognition part. The PDLPR algorithm is mainly designed as follows: (1) A Multi‑Head Atten‑
tion mechanism is used to accurately recognize individual characters. (2) A global feature extractor
network is designed to improve the completeness of the network for feature extraction. (3) The latest
parallel decoder architecture is adopted to improve the inference efficiency. The experimental results
show that the proposed algorithm has better accuracy and speed than the comparison algorithms,
can achieve real‑time recognition, and has high efficiency and robustness in complex scenes.
Keywords: license plate recognition; multi‑head attention; global feature extractor network; parallel
decoder; YOLOv5
parative analysis. Section 6 performs an ablation study to verify the efficacy of the proposed
method. Finally, conclusions are drawn in Section 7.
2. Related Work
2.1. License Plate Detection
License plate detection is the foundation of license plate recognition, and its accuracy
directly affects the results of character recognition in license plates. Currently, there are two
main categories of license plate detection methods in common use. One is the license plate
detection algorithm based on traditional methods. The other is the deep learning‑based license
plate detection algorithms.
License plate detection algorithms based on traditional methods usually extract the in‑
trinsic properties of license plates such as edges, colors, local textures, and morphological
analysis as manual image features for license plate detection [32], such as edge feature‑based
license plate detection algorithms [33,34], color feature‑based license plate detection algo‑
rithms [35,36], texture feature‑based license plate detection algorithms [37,38], character
feature‑based license plate detection algorithms [39,40], and detection algorithms based on
more than two features [35]. As noise interference occurs in real license plate images, most
methods based on a single manual feature only work in specific scenes and have poor detec‑
tion results.
Given sufficient training data, deep learning based license plate detection algorithms
have powerful feature representation and high performance compared to license plate detec‑
tion algorithms based on traditional methods. They can usually be divided into two cate‑
gories: one‑stage detection methods and two‑stage detection methods.
Fast R‑CNN [8] and Faster R‑CNN [9] are classical two‑stage detection networks that em‑
ploy the region proposal network in the first stage to share the convolutional features of the
whole image and generate high‑quality region proposal candidate frames. Then, in the sec‑
ond stage, the Convolution Neural Network (CNN) classifier is used to classify the candidate
frames and obtain the kind of targets.
Although two‑stage license plate detection methods detect objects accurately and quickly,
they are slow and cannot meet the needs of real‑time detection tasks. Therefore, one‑stage
detection methods have emerged, and the representative networks include the YOLO se‑
ries [11–14], TE2E [41], SSD [42], CA‑CenterNet [43], YOLOv5 [31], Optical Flow CNN al‑
gorithms [44,45], etc.
obtain individual characters, and then used template matching to recognize the characters.
Gou et al. [6] distinguished different license plate characters with the help of limit regions of
specific characters, and then character recognition was performed using Restricted Boltzmann
Machines (RBMs). Ashtari et al. [7] employed an improved template matching‑based method
to locate the license plates. A hybrid classifier consisting of a decision tree and a support vec‑
tor machine (SVM) with a homogeneous fifth degree polynomial kernel is then applied to
identify the extracted letters and numbers.
network with an Xception network for feature extraction and a RNN decoder combined with a
2D attention mechanism. In real‑world circumstances, it can recognize license plates with both
regular and irregular patterns. Gao et al. [20] proposed a novel license plate recognition method
using a two‑stage encoder combined with a Long Short Term Memory (LSTM) decoder. The
method is able to improve the coding quality and can recognize various types of license plates.
This category of methods uses RNN networks that treat license plates as a sequence recognition
problem; however, there is a large number of cyclic computations in RNN networks that cannot
be computed in parallel, increasing the inference time. Moreover, the long‑term dependency of
LSTM leads to performance degradation.
The third category is to use the features extracted by the license plate detection network
for license plate recognition. The method proposed by Xu et al. [22] first performed a region‑of‑
interest (ROI) pooling operation on the feature map generated by the detection part to obtain
the feature vectors of the license plate regions. The features of the license plate are then sent into
the classifier in the recognition portion to acquire the license plate sequence. The approaches
proposed by Gong et al. [21] and Qin et al. [26] used the Feature Pyramid Networks (FPN)
in the license plate detection part to extract the shared features for classification and recogni‑
tion. The detection branch then generates bounding boxes and corner points, which are used
for Region of Interest Align (RoIAlign) and correction, respectively. Finally, the located license
plate features are used for recognition to determine the license plate sequence. This category
of methods shares the features extracted by the license plate detection network with the license
plate recognition network, reducing the calculation cost. However, the recognition networks
of this category of methods are designed in a simpler way, and the extracted features are not
semantically rich enough, resulting in insufficient recognition accuracy.
2.3. Transformer
Vaswani et al. [1] proposed the Transformer architecture, which was initially applied in
the fields of machine translation and Natural Language Processing (NLP), as a neural network
based mainly on a self‑attention mechanism. In contrast to the RNN‑based approaches, Trans‑
former makes the training process highly parallel, which can reduce model complexity and
improves text recognition accuracy. In recent years, inspired by the successful application of
Tranformer in the field of NLP, several works [53–55] have proposed the use of Transformer to
replace the recursive structure in the seq2seq framework, which facilitates parallel computation
and speeds up processing.
Mahdavi et al. [56] used Transformer in the field of mathematical expression recognition.
They won the ICDAR 2019 Competition on Recognition of Handwritten Mathematical Expres‑
sions and Typeset Formula Detection with the greatest recognition rate. Yang et al. proposed
HRGAT [54] for scene text recognition using Transformer. By merging CNN feature maps with
a 2D attention map and then linking to the parallel decoder, it can swiftly recognize the text of
scenes with irregular spatial distribution. Kang et al. [55] applied Transformer network to a
handwritten text recognition task and achieved excellent performance. Ma et al. [57] proposed
the Text Attention Network (TATT), which uses CNN and Transformer to align text with spa‑
tially distorted text images, achieving state‑of‑the‑art performance in text super‑resolution tasks.
3. Proposed Method
As shown in Figure 1, the proposed YOLOv5‑PDLPR consists of two main parts: the
YOLOv5‑based license plate detection network and the PDLPR license plate recognition net‑
work proposed in this paper. The former receives an entire car picture as its input, locates the
license plate position within the picture, and then outputs a picture including only the license
plate information. The latter takes the license plate picture as its input and puts the license plate
picture through feature extraction, encoding, and decoding operations to obtain the sequence
of license plate characters.
locates the license plate position within the picture, and then outputs a picture including
YOLOv5-based license plate detection network and the PDLPR license plate recognition
only the license plate information. The latter takes the license plate picture as its input and
network proposed in this paper. The former receives an entire car picture as its input,
puts the license plate picture through feature extraction, encoding, and decoding opera-
locates the license plate position within the picture, and then outputs a picture including
tions to obtain the sequence of license plate characters.
only the license plate information. The latter takes the license plate picture as its input and
Sensors 2024, 24, 2791 6 of 22
puts the license plate picture through feature extraction, encoding, and decoding opera-
tions to obtain the sequence of license plate characters.
Figure 2.
Figure 2. The
The overall
overall framework
framework of
of the
the license
license plate
plate recognition
recognition algorithm.
algorithm.
3.2.1.
3.2.1. Improved
ImprovedGlobalGlobalFeature
FeatureExtractor
Extractor
Figure 2.
The The overall framework of theat
license plate recognition algorithm.
The Focus Structure was added the
Focus Structure was added beginning
at the part part
beginning of theofIGEF
the to implement
IGEF the feature
to implement the
map downsampling
feature map function
downsampling while ensuring
function while that no feature
ensuring information
that no feature was lost. In
information other
was parts
lost.
3.2.1. Improved
requiring Global Feature
downsampling, Extractor
the pooling operation was also replaced with a convolution operation
with aThe Focus
stride Structure
of two in orderwas added the
to preserve at the beginning
integrity of thepart of thenetwork
extracted IGEF tofeatures.
implement the
This can
feature map
improve downsampling
the accuracy function
of license while ensuring
plate character thatFigure
recognition. no feature information
3 shows wasoflost.
the structure the
IGFE module, which consists of a Focus Structure module, two ConvDownSampling modules,
and four RESBLOCK modules.
(1) Focus Structure Module
The structure of Focus Structure module is shown in the bottom part of Figure 3, which
was used to conduct picture slicing operations, and its operation process is shown in Figure 4,
where a value was taken at each interval of one pixel in a input picture so that one picture was
equally divided into four feature maps. Then, they were concatenated along the channel di‑
rection. Thus a three‑channel image became a 12‑channel feature map with half the original
width and height. Finally, the obtained feature map is convolved to perform the downsam‑
pling operation. A Focus Structure is better than other ways of downsampling because it does
not lose any feature information. This means that the extracted semantic information will be
more comprehensive.
one picture was equally divided into four feature maps. Then, they were concatenated
along the channel direction. Thus a three-channel image became a 12-channel feature map
with half the original width and height. Finally, the obtained feature map is convolved to
perform the downsampling operation. A Focus Structure is better than other ways o
Sensors 2024, 24, 2791 downsampling because it does not lose any feature information. This means 7 ofthat
22 the ex
tracted semantic information will be more comprehensive.
(3)ConvDownSampling
Conv2d 3x3
stride=2 Feature Extractor
BatchNormalization
ConvDownSampling
LeakyReLU
RESBLOCK
CNN BLOCK
(2)RESBLOCK RESBLOCK
Conv2d 3x3
stride=1
ConvDownSampling
BatchNormalization
CNN BLOCK
RESBLOCK
LeakyReLU
CNN BLOCK
RESBLOCK
(1)Focus Structure
FocusStructure
Concat
1 1 2 2
1 2 1 2
Slice 11 11 22 22 Concat 4 4
1 2 1 2 1 1 2 2 44 344
1 3
2 4
1 3
2 4 11 11 22 22 33 3 3
3 4 3 4 3
22 2 244 444
42
3 1
4 2
3 1
4 2 2
1 1 3 33
13
1 2 1 2 3 3 4 4 11 2123 223
1 3
2 4
1 3
2 4 33 33 44 44 1 12
2
3 4 3 4 3 3 4 4 11 11
3 4 3 4 33 33 44 44
Figure
Figure4.
4. The
The slicing
slicing process
process of
of Focus
Focus Structure
Structuremodule.
module.
(2)
(2) RESBLOCK
RESBLOCKmodule
module
structure of eachofRESBLOCK
The structure module is
each RESBLOCK shown in
module is the central
shown in part of Figurepart
the central 3, and
of consisted
Figure 3,
of two residually connected CNN BLOCK modules. During forward
and consisted of two residually connected CNN BLOCK modules. During forward inference, the residual
infer-
connected structure could prevent the network’s gradient disappearance
ence, the residual connected structure could prevent the network’s gradient disappear-and explosion.
anceIn andtheexplosion.
CNN BLOCK module’s convolutional layer for extracting features, we utilized
conv2d In with
the CNNstrideBLOCK
= 1 andmodule’s
kernelSizeconvolutional
= 3 to extract features,
layer for which werefeatures,
extracting then passed via the
we utilized
BatchNormalization
conv2d with stride layer [58]kernelSize
= 1 and and activation
= 3 tofunction
extract layer in order
features, whichto extract the image’s
were then vi‑
passed via
sual features.
the BatchNormalization layer [58] and activation function layer in order to extract the im-
age’sThe activation
visual function made use of the leakyRelu [59] shown in Figure 5a rather than the
features.
Relu [60] shown in Figure
The activation function 5b. The
madereason is the
use of thatleakyRelu
when the input of the Relu
[59] shown function
in Figure is negative,
5a rather than
the output is always 0, and its derivative is also 0. This tends to cause dead
the Relu [60] shown in Figure 5b. The reason is that when the input of the Relu function neurons, which
is negative, the output is always 0, and its derivative is also 0. This tends to cause dead
neurons, which means that the neurons no longer learn and the parameters no longer
change. The leakyRelu is given a smaller slope value for the case where the input is nega-
tive to avoid the occurrence of death neurons.
conv2d with stride = 1 and kernelSize = 3 to extract features, which were then passed via
the BatchNormalization layer [58] and activation function layer in order to extract the im-
age’s visual features.
The activation function made use of the leakyRelu [59] shown in Figure 5a rather than
Sensors 2024, 24, 2791 8 of 22
the Relu [60] shown in Figure 5b. The reason is that when the input of the Relu function
is negative, the output is always 0, and its derivative is also 0. This tends to cause dead
neurons, which means that the neurons no longer learn and the parameters no longer
means
change.that
ThetheleakyRelu
neurons nois longer
given alearn and the
smaller parameters
slope value forno
thelonger change.the
case where The leakyRelu
input is nega-is
given a smaller slope value for the case where
tive to avoid the occurrence of death neurons. the input is negative to avoid the occurrence of
death neurons.
𝑥 𝑖𝑓 𝑥 > 0 𝑥 𝑖𝑓 𝑥 > 0
𝑙𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈(𝑥) = 𝑅𝑒𝐿𝑈(𝑥) =
𝑎𝑥 𝑖𝑓 𝑥 ≤ 0 0 𝑖𝑓 𝑥 ≤ 0
(𝐚) (𝐛)
Figure5.5. The
Figure Thestructure
structureof
ofthe
theactivation
activationfunctions
functionsleakyRelu
leakyRelu(a)
(a)and
andRelu
Relu(b).
(b).
(3) ConvDownSampling
(3) ConvDownSamplingmodule
module
The structure
The structureof
ofthe
the ConvDownSampling
ConvDownSamplingmodule moduleininthis
thispaper
paperisisthe
thesame
sameasas that
that of
of
CNN BLOCK. However, we set the
CNN BLOCK. However, we set the stride to 2stride to 2 in conv2d for downsampling and
conv2d for downsampling and used theused the
convolutionoperation
convolution operationininplace
placeofofthe
the pooling
poolingoperation
operationfor
fordownsampling.
downsampling. This This preserves
preserves
morefeature
more featureinformation
information in the
in the downsampling
downsampling process,
process, thus improving
thus improving the accuracy
the accuracy of
of license
license
plate plate recognition.
recognition.
3.2.2.
3.2.2.Encoder
Encoder
As
Asshown
shownininFigure
Figure6,6,the
theEncoder
Encoder in in
thisthis
paper consisted
paper of three
consisted encoding
of three unitsunits
encoding con‑
Sensors 2024, 24, x FOR PEER REVIEW
nected by residuals,
connected and each
by residuals, andunit
eachcontained four submodules:
unit contained CNN BLOCK1,
four submodules: Multi‑Head
CNN BLOCK1, At‑ 10 of 23
Multi-
tention, CNN BLOCK2,
Head Attention, and Add&Norm.
CNN BLOCK2, The CNN The
and Add&Norm. BLOCK1
CNNand CNN and
BLOCK1 BLOCK2
CNNstructures
BLOCK2
are the same as in Section 3.2.1, but with a few differences that will be explained later.
structures are the same as in Section 3.2.1, but with a few differences that will be explained
later.
Add&Norm
Encoding unit
CNN BLOCK2 x3
Multi-Head
Attention
CNN BLOCK1
Positional
Encoding
The CNN BLOCK1 output feature vectors were then encoded using Multi‑Head Atten‑
tion [1]. Here, parallel processing was used to compute the attention on each subspace. The
results in different spatial dimensions are then connected, and a linear conversion is performed
to obtain the final encoding result. This can attend to the connections between features in mul‑
tiple ways and in multiple spaces. The Multi‑Head Attention MH A( Q, K, V ) is calculated as
shown in Equation (1):
( ) ( )
[head1 QW 1Q , KW 1K , VW 1V , head2 QW 2Q , KW 2K , VW 2V , . . . ,
MH A( Q, K, V ) = ( ) ( ) (1)
headi QW iQ , KW iK , VW V Q K V
i , . . . , head h QW h , KW h , VW h ]
( )
where Q, K, and V ∈ Rn×d , headi QWiQ , KWiK , VWiV ∈ Rn×dk , WiQ , WiK and WiV ∈
(
Rd×dk W O ∈ Rd×d , n = width ∗ height = 108, d = 1024, dk = dh = 128, h = 8. headi QWiQ ,
)
KWiK , VWiV denotes the result of attention calculation for the i‑th subspace; WiQ , WiK , and
WiV are the projection matrices that project Q, K, and V to the i‑th subspace, respectively; W O
is the matrix for computing the linear conversion of the head; width and height are the width
and height of the feature vector output from CNN BLOCK1, respectively. The value of d is
equal to the dimensionality of the feature vector output by CNN BLOCK1. h is the number of
heads in Multi‑Head Attention, which means that the neural network attends to features in h
spaces. After the experimental comparison, the license plate recognition accuracy is the highest
when h = 8, and the experimental results are shown in Section 6. dk is the dimension of the
projection vector of the input feature vector on each subspace, which is calculated by dividing
d by h. The calculation of Q, K, and V in Equation (1) is shown in Equation (2):
X = [ x1 , x2 , . . . xm , . . . x n ] T
Q = XWQ
(2)
K = XWK
V = XWV
where X ∈ Rn×d , WQ , WK , and WV ∈ Rd×d , xm ∈ R1×d . X is the feature vector output from
CNN BLOCK1. WQ , WK , and WV are three different trainable weights, which were obtained
by random initialization at the very beginning of training and then updated by gradient descent
during the training
( process, and finally) the suitable weights are obtained to fit the real values.
Each headi QWiQ , KWiK , VWiV in Equation (1) was calculated as shown in Equation (3):
( ( )T )
( ) QWiQ KWiK
headi QWiQ , KWiK , VWiV = so f tmax √ VWiV (3)
dk
The function
The function of theofMasked
the Masked Multi‑Head
Multi-Head Attention
Attention is istotoprevent
prevent the model
modelfrom fromfocusing
fo- on
cusingsubsequent
on subsequent sequence information
sequence and toand
information ensure the parallelism
to ensure of training.
the parallelism It is implemented
of training. It is
by adding
implemented bythe input the
adding eigenvector matrix with matrix
input eigenvector an upper triangular
with an uppermatrix whose elements
triangular matrix are all
whose−elements
∞, and then all −∞, and
are performing a softmax operation on
then performing the summed
a softmax matrix.onThis
operation theturns
summedthe original
matrix. This turns the original eigenvector matrix into a lower triangular eigenvector ma- of the
eigenvector matrix into a lower triangular eigenvector matrix. The masking operation
Masked
trix. The maskingMulti‑Head
operation Attention
of the is able to restrict
Masked the region
Multi-Head of attention
Attention at to
is able each time step,
restrict the ensur‑
region of attention at each time step, ensuring that the prediction at each location relieslocation.
ing that the prediction at each location relies only on the known output prior to that
only onDue
theto the design
known outputof Masked
prior toMulti‑Head Attention,
that location. Due to the
the entire
design training process
of Masked required a single
Multi-Head
forward
Attention, computation.
the entire trainingNevertheless, when athe
process required RNNforward
single model performs inference,
computation. the operation
Neverthe-
at time t + 1 can only continue once the operation at time t
less, when the RNN model performs inference, the operation at time 𝑡 + 1 can only con- has been completed. Therefore,
tinue once the operation at time 𝑡 has been completed. Therefore, Masked Multi-Head signifi‑
Masked Multi‑Head Attention made the inference of the model proposed in this paper
cantly
Attention faster
made thethan the RNN‑based
inference of the modelmodel.
proposed in this paper significantly faster than
The
the RNN-based model.output of the Masked Multi‑Head Attention was then fed into the Add&Norm mod‑
The output of the Masked Multi-Head Attentiontowas
ule. This performed a normalization operation prevent
then model
fed intooverfitting
the Add&Norm and accelerate
model convergence.
module. This performed a normalization operation to prevent model overfitting and ac-
celerate model convergence.
The CNN BLOCK3 and the CNN BLOCK4 change the dimension of the output fea-
ture vector of the Encoder to 512 × 18 before decoding in order to reduce the size of the
feature vector and the computational load on the parallel decoder. Then, the outputs of
the Add&Norm and Encoder modules are fed into Multi-Head Attention as 𝑄, 𝐾, and 𝑉,
Sensors 2024, 24, 2791 11 of 22
The CNN BLOCK3 and the CNN BLOCK4 change the dimension of the output feature
vector of the Encoder to 512 × 18 before decoding in order to reduce the size of the feature vec‑
tor and the computational load on the parallel decoder. Then, the outputs of the Add&Norm
and Encoder modules are fed into Multi‑Head Attention as Q, K, and V, respectively, where
K and V contain the feature information of the license plate image, and Q contains the seman‑
tic information of the license plate label. The Multi‑Head Attention calculates the correlation
of each image feature with the labeled text feature. The higher the correlation, the higher the
probability that the corresponding location in the image is a certain character. Here in the CNN
BLOCK3, we set the convolutional layer parameters stride to 3, kernelSize to (2,1), padding to 1,
and output dimension to 512. In the CNN BLOCK4, we set the convolutional layer parameters
stride to 3, kernelSize to 1, padding to (0,1), and the output dimension to 512.
The output features of Multi‑Head Attention were processed using the Add&Norm mod‑
ule and then input to the Feed‑Forward Network module. The Feed‑Forward Network module
consists of two linear conversions, where the feature vector is input to the first linear function,
activated with the ReLU function, and then input to the second linear function. The definition
of a Feed Forward Network FFN (·) is shown in Equation (4):
where W1 and W2 are the weights, b1 and b2 are the biases, and Max(·, ·) is the maximum func‑
tion. W1 ∈ Rd×d , W2 ∈ Rd×d , b1 ∈ Rd , and b2 ∈ Rd . In order to facilitate the connection
between model layers, the output size d was set to 512 for all sub‑layers in the model.
Finally, the output of the Feed‑Forward Network module was employed for forward in‑
ference with the Add&Norm module to speed up the convergence of the model.
4. Experimental Setup
4.1. Datasets
To evaluate the effectiveness of the proposed model YOLOv5‑PDLPR, the datasets, as
shown in Table 1, were selected.
CCPD [25] is a large and diverse open source dataset of Chinese city license plates, pro‑
viding 290k images of unique license plates annotated in detail. Each image in this dataset
contains only one license plate, and each plate consists of seven characters, of which the first
character represents a provincial administrative region (31 categories in total, excluding Tai‑
wan Province, Hong Kong SAR, and Macau SAR), the second character is a letter, and each of
the remaining five characters is a letter or a number (all occurrences do not contain “I” and “O”,
with 34 categories of numbers and letters). As shown in Table 2, the dataset is grouped into nine
sub‑datasets according to recognition difficulty, illumination of the license plate area, distance
from the plate at the time of capture, horizontal and vertical tilt degree, and weather (rain, snow,
fog). During the model comparison experiments, half of the data in the sub‑dataset CCPD‑base
were randomly selected as the training set, while the other half were used as the validation set.
Six sub‑datasets (CCPD‑DB, CCPD‑FN, CCPD‑Rotate, CCPD‑Weather, CCPD‑Challenge, and
CCPD‑Tilt) were selected for testing the models.
Sensors 2024, 24, 2791 12 of 22
PKUData [34], published by Yuan et al., offers pictures of license plates in various situa‑
tions, but each picture contains only the detection box for the location of the license plate and
misses the license plate sequence. Therefore, we manually annotated 2253 images from its data
subsets G1 (normal daytime environment), G2 (daytime with sun glare), and G3 (nighttime)
to evaluate license plate identification.
CLPD [24] is provided by Zhang et al. It collects 1200 images of different license plates
from 31 Chinese provincial administrative regions (excluding Taiwan Province, Hong Kong
SAR, and Macau SAR) taken in various environments. Like PKUData, CLPD is only used to
test the license plate recognition model.
AOLP [47] consists of 2049 license plate images from the Taiwan Province of China. Ac‑
cording to complexity and shooting conditions, the dataset is divided into three subsets: access
control (AC), law enforcement (LE), and road patrol (RP). AC contains 681 images, LE con‑
tains 757 images, and RP contains 611 images. Each license plate consists of six characters, each
consisting of a letter or number (excluding the “O”). As the license plate style of the Taiwan
Province is completely different from that of other Chinese provinces, we conducted three sets
of experiments on this dataset, each using two of its three subsets for training and the remaining
one for testing.
recognition algorithm was tested with the input image resized to 48 × 144 and the batch size
was set to 5.
Area(db ∩ gb)
IOU = (5)
Area(db ∪ gb)
where gb is the area of the ground truth box, db is the area of the detected bounding box,
Area(·) is a function that finds the area, ∩ is an intersection operation and ∪ is a union oper‑
ation. For a fair comparison, the same evaluation criteria as [25] were used in this paper on
the CCPD dataset. The detected bounding box was considered correct only if the IOU was
greater than 0.7. When the IOU was greater than 0.6 and each character on a license plate was
correctly recognized, the license plate recognition result was considered correct.
5. Experiment Results
The performance of the proposed license plate recognition algorithm YOLOv5‑PDLPR
and the state‑of‑the‑art approaches were compared in this section using the same exper‑
imental setup.
Table 3. Comparison of the results of different license plate detection methods on the CCPD dataset.
Labeling the best performance in bold and the second best performance with underlining.
As can be seen from Table 3, the YOLOv5 algorithm has a higher average accuracy
on the entire CCPD test dataset than all the comparison algorithms. In addition, the
accuracy on the CCPD‑DB, CCPD‑FN, CCPD‑Rotate, CCPD‑Tilt, and CCPD‑Weather
increased by 6.1%, 8.9%, 3.5%, 4.2%, and 17.6%, respectively, when compared to the
second‑best algorithm. Its detection speed reached 218.3 FPS, which was 155.3% faster
than RPnet [24]. Thus, it can be seen that YOLOv5 has the highest detection efficiency.
Table 4 shows the results of a comparison between the proposed framework YOLOv5‑
PDLPR and the state‑of‑the‑art algorithms on the CCPD dataset.
As seen from Table 4, the average accuracy of the proposed algorithm on the whole
CCPD test dataset was 99.4%, and the accuracies on CCPD‑base, CCPD‑DB, CCPD‑FN, CCPD‑
Weather, and CCPD‑Challenge were 99.9%, 99.5%, 99.5%, 99.4%, and 94.1%, respectively,
which are higher than all the comparison algorithms. The accuracies obtained using YOLOv5‑
PDLPR for recognition on the sub‑datasets CCPD‑Rotate and CCPD‑Tilt were 0.1% and 0.3%
lower than the method proposed by Fan et al. [43], because they trained their models with
synthetic data, which made their models learn more features. However, the recognition re‑
sults of YOLOv5‑PDLPR were better than those of the algorithm of Fan et al. [43], which was
Sensors 2024, 24, 2791 14 of 22
not trained using synthetic data. YOLOv5‑PDLPR achieved a speed of 159.8 FPS, which was
87.1% faster than the second‑fastest algorithm. This is because YOLOv5‑PDLPR uses parallel
inference to improve efficiency and saves recognition time by not performing additional cor‑
rection operations after the license plate detection task is completed. Experimental results on
this dataset show that our license plate recognition model is robust and efficient in complex
scenarios and is a real‑time recognition framework that can meet the requirements of road
surveillance.
Table 4. Comparison of results of different license plate recognition methods on the CCPD dataset.
Labeling the best performance in bold and the second best performance with underlining.
Figure 8 shows the results of detection and recognition of six license plate images by three
license plate detection recognition algorithms (Zhang et al.—2020 [24], Xu et al.—2018 [25],
YOLOv5‑PDLPR (Ours)), respectively. The method of Zhang et al.—2020 [24] does not use
the license plate detection network to determine the location of the license plate and then di‑
rectly inputs the real license plate image for recognition, and the other two methods use the
results of the license plate location network for recognition. The sequence of characters after
each “GT” in the first row of Figure 8 represents the sequence of real license plate characters,
and the visualization result pictures of the method for detecting and recognizing six license
plate images are listed after each method name in turn, and the sequence of characters after
“Pred” below each visualization detection and recognition picture indicates the sequence of
license plate characters detected and recognized by the method, and if a character is in red
font, it means that the character is not correctly recognized by the method. As can be seen
from Figure 8, some characters are incorrectly recognized by the method of Zhang et al. [24]
and Xu et al. [25] in the case of light intensity, a tilted license plate, and a blurred license plate,
while our method is able to accurately locate and correctly recognize them, which indicates
that our proposed algorithm performs better in complex scenarios.
By plotting the heat map of the model, we can observe where the network is concerned
during the run. As shown in Figure 9, there are six columns in total. Each column shows the
heat map of different license plate pictures in complex cases. The first row of each column
displays the original images of various license plates, while the second row begins with the
attention map of each individual character. In each column from top to bottom, if a character
on the license plate is darker, it indicates that the network pays more attention to the character
GT: 皖 A308Q0 GT: 皖 AS5769 GT: 皖 AP3392 GT: 皖 AX169B GT: 皖 AH892B GT: 皖 AJY723
Sensors 2024, 24, 2791 Pred: 皖 A308L0 Pred: 皖 AS5Z69 Pred: 皖 AP33S2 Pred :皖 AX169D Pred: 皖 AH8928 Pred: 皖 AJ1723
15 of 22
Xu et al. -2018 features at that location, and then the network extracts these features for recognition. For
example, in the second line of the license plate “皖NLE9132”, the color of the characters “皖”
Sensors 2024, 24, x FORPred:
PEER A20800 is deeper,
皖REVIEW Pred: indicating
皖 AS5Z69 thatPred
the:皖
network
A93392 is more
Pred: focused
皖 AX1698 on the characteristics
Pred: 皖 AH8928 of the皖location
Pred: of
16 of 23
AJY125
“皖”. Similarly, the network is able to locate the features on other character positions on the
license plate and thus accurately identify the characters on the plate.
Ours
GT: 皖 A308Q0 GT: 皖 AS5769 GT: 皖 AP3392 GT: 皖 AX169B GT: 皖 AH892B GT: 皖 AJY723
Pred: 皖 A308Q0 Pred: 皖 AS5769 Pred: 皖 AP3392 Pred: 皖 AX169B Pred: 皖 AH892B Pred: 皖 AJY723
Xu et al. -2018 By plotting the heat map of the model, we can observe where the network is con-
cerned during the run. As shown in Figure 9, there are six columns in total. Each column
Pred: 皖 A20800
shows the 皖 AS5Z69
Pred:heat map of different license plate
Pred :皖 A93392 皖 AX1698 in complex
Pred: pictures Pred: 皖 AH8928
cases. The 皖 AJY125
first
Pred: row of
each column displays the original images of various license plates, while the second row
begins with the attention map of each individual character. In each column from top to
Ours bottom, if a character on the license plate is darker, it indicates that the network pays more
attention to the character features at that location, and then the network extracts these
featuresPred:
Pred: 皖 A308Q0 for recognition.
皖 AS5769 For example,
Pred: 皖 AP3392 in the second
Pred: line of the
皖 AX169B Pred:license plate “皖
皖 AH892B NLE9132”,
Pred: 皖 AJY723
the color of the characters “皖” is deeper, indicating that the network is more focused on
Figure 8. The detection and recognition results of three license plate detection recognition algorithms.
Figure
the 8. The detection
characteristics of theand recognition
location results
of “皖”. of three the
Similarly, license plate detection
network is able torecognition
locate the algo-
fea-
If a character
rithms. If a is displayed
character is in red font,
displayed in it means
red font, that
it the corresponding
means that the method does
corresponding not recognize
method doesthe
not
tures on other character positions on the license plate and thus accurately identify
the character
recognize the correctly,
character refs. [24,25].
correctly, refs. [24,25].
characters on the plate.
By plotting the heat map of the model, we can observe where the network is con-
cerned during the run. As shown in Figure 9, there are six columns in total. Each column
shows the heat map of different license plate pictures in complex cases. The first row of
each column displays the original images of various license plates, while the second row
begins with the attention map of each individual character. In each column from top to
bottom, if a character on the license plate is darker, it indicates that the network pays more
attention to the character features at that location, and then the network extracts these
features for recognition. For example, in the second line of the license plate “皖 NLE9132”,
the color of the characters “皖” is deeper, indicating that the network is more focused on
the characteristics of the location of “皖”. Similarly, the network is able to locate the fea-
tures on other character positions on the license plate and thus accurately identify the
characters on the plate.
Table 5. Comparison of results of different license plate recognition methods on the CLPD and
PKUData datasets. Labeling the best performance in bold and the second best performance with
underlining.
CLPD PKUData
Method ACC (Without ACC (Without
ACC ACC
Chinese Characters) Chinese Characters)
Xu et al., 2017 [25] 66.5 78.9 77.6 78.4
Zhang et al., 2020 [24] 76.8 87.6 88.2 90.5
Fan et al., 2022 [43] 55.8 79.3 81.6 81.8
Fan et al., 2022 [43]
82.4 88.5 92.4 92.5
(SYNTHETIC DATA)
YOLOv5‑PDLPR 80.3 93.1 95.5 95.7
Considering Chinese characters, the test results on the CLPD dataset indicate that the
accuracy of the algorithm proposed in this paper is lower than that of the algorithm proposed
by Fan [43] using synthetic data. This is because our model does not use synthetic data and
is only trained on the CCPD dataset, and the Chinese characters in the CCPD dataset are not
evenly distributed, with “皖” accounting for more than 95%. However, when tested on the
CLPD dataset, the Chinese characters are evenly distributed, and there were also Chinese char‑
acters that do not appear in the CCPD dataset. Consequently, the model was unable to fully
learn the Chinese information, resulting in low test accuracy. Nevertheless, the accuracy of
our method is higher than the algorithm proposed by Fan [43], which was not trained using
synthetic data. When Chinese characters are not considered, our algorithm has the highest
accuracy (ACC), and the ACC improves by 5.2% compared to the second‑best result.
As shown in Table 5, the results on the PKUData dataset show that the accuracy of
the proposed YOLOv5‑PDLPR was the highest, regardless of whether Chinese characters
were considered or not. When Chinese characters were considered, the accuracy of the pro‑
posed YOLOv5‑PDLPR achieved 95.5%, which was 3.4% higher than the second‑best algo‑
rithm. When Chinese characters were not considered, the accuracy of the proposed YOLOv5‑
PDLPR achieved 95.7%, which was 3.5% higher than the second‑best algorithm. For the sam‑
ples in the PKUData dataset, the images were captured at different moments in each day, and
thus with different light levels. The proposed YOLOv5‑PDLPR has not been trained on this
dataset; however, it also has a very well‑tested accuracy, indicating that the model is robust.
The experimental results show that the algorithm proposed in this paper has high accu‑
racy for character recognition, can accurately recognize license plate characters even under
poor lighting conditions, and has reliable generalization ability and high robustness.
were 99.6%, 99.9%, and 99.8% for AOLP‑AC, AOLP‑LE, and AOLP‑RP, respectively, which
were 0.3%, 1.2%, and 5% higher than the second‑best algorithm, respectively. The subset of
AOLP‑RP consisted mainly of rotated license plates, and the proposed method in this paper
achieved the largest performance improvement in these plates. This result demonstrates that
the method is effective at recognizing irregular license plates.
Table 6. Comparative results of different license plate recognition methods using Box on the AOLP
dataset. The best performance is labeled in bold and the second best performance with underlining.
Table 7. Comparative results of different license plate recognition methods using GT on the AOLP
dataset. Labeling the best performance in bold.
6. Ablation Study
In this section, we conducted a series of experiments to evaluate the impact of the IGFE,
the parallel decoder, the number of decoding units in the parallel decoder, and the number of
heads in Multi‑Head Attention on the recognition accuracy. Without using a synthetic dataset,
the training dataset in the experiment was half of CCPD‑Base and the validation dataset was
the other half. The test datasets consisted of the three sub‑datasets CCPD‑DB, CCPD‑Tilt, and
CCPD‑Challenge, as these three sub‑datasets best represent the impact of natural scenes such
as light intensity, plate tilt, and plate blur on the performance of the license plate recognition
network. The batch size for the test was set to 5.
With all other conditions being the same, using ResNet‑18 [63] as the reference model,
the experiments were conducted with only IGFE as the backbone as well as after adding the
Focus Structure and the ConvDownSampling structure to IGFE, and the results are shown in
√
Table 8. A “ ” in the table indicates that the network used in the experiment contains the
structure corresponding to this column, whereas a “×” indicates that the network used in the
experiment does not contain the structure corresponding to this column.
Table 8. The influence of different module structures in the backbone network on the accuracy of
license plate recognition. Labeling the best performance in bold.
Accuracy
Module Focus
Backbone Structure ConvDownSampling Overall
DB Tilt Challenge
Accuracy
ResNet‑18 ‑ ‑ 99.0 99.3 93.3 97.7
IGFE (our) × ×
√ 98.8 98.3 90.3 96.6
×
√ 98.9 98.6 90.7 96.8
√ ×
√ 99.1 98.8 91.0 97.0
99.5 99.7 94.4 98.3
The second and third rows in Table 8 show that by only retaining ConvDownSampling in
IGFE to replace the pooling downsampling operation, the overall accuracy can be improved by
0.2 percentage points. This demonstrates that replacing the network’s pooling operation with
Sensors 2024, 24, 2791 18 of 22
the convolution operation can reduce the loss of features during the downsampling process
and increase the ratio of correct model identification.
The second and fourth rows in Table 8 show that retaining only the Focus Structure in
the IGFE improved the average accuracy by 0.4 percentage points and improved the accuracy
by 0.7 percentage points on the CCPD‑Challenge sub‑dataset. This shows that using the Fo‑
cus Structure in the network can reduce feature loss during downsampling and improve the
correct rate of model recognition.
The third and fourth rows in Table 8 show that retaining only the Focus Structure in the
IGFE can improve the overall accuracy by 0.2 percentage points compared to retaining only the
ConvDownSampling structure in the IGFE. This is because when using the Focus Structure
instead of the ConvDownSampling structure, there is no loss of feature information at any
point. This makes the increase in precision attributable to the Focus Structure more apparent
in the experiment results.
The second and fifth rows in Table 8 show that keeping both the Focus Structure and the
ConvDownSampling structure in the IGFE can improve the overall accuracy by 1.7 percent‑
age points; in particular, on the CCPD‑Challenge sub‑dataset, the accuracy was improved by
4.1 percentage points. This demonstrates that when the Focus Structure and the ConvDown‑
Sampling structure are used together, less feature information is lost during feature extrac‑
tion than when the two structures are used separately. As a result, recognition accuracy is
improved a lot. In addition, the first and fifth rows of Table 8 show that the accuracy of li‑
cense plate recognition using IGFE was higher than that using ResNet‑18. This means that
the features extracted by IGFE are more complete than those extracted by ResNet‑18.
To investigate the effect of using different decoders in our proposed model YOLOv5‑
PDLPR on license plate recognition accuracy, license plate recognition experiments were con‑
ducted on CCPD‑DB, CCPD‑Tilt, and CCPD‑Challenge using LSTM, BiLSTM, Linear, and
Parallel Decoder as decoders of the model, with all other conditions being the same, and the
experimental results are shown in Table 9.
Table 9. The influence of different decoder on the accuracy of license plate recognition. Labeling the
best performance in bold.
Accuracy
Decoder
CCPD‑DB CCPD‑Tilt CCPD‑Challenge
LSTM 97.9 97.7 87.8
BiLSTM 96.2 95.2 80.6
Linear 90.3 81.9 70.1
Parallel Decoder 99.5 99.7 94.4
As can be seen from Table 9, when using the parallel decoder in the proposed model
YOLOv5‑PDLPR is more accurate than using LSTM, BiLSTM, and Linear as decoders for li‑
cense plate recognition. The two decoders, LSTM and BiLSTM, as variants of RNN, have bet‑
ter accuracy for license plate recognition using them in model YOLOv5‑PDLPR than that of
Linear, the most basic fully connected layer decoder. This indicates that the parallel decoder
is able to extract the global semantics of the images more adequately than the RNN under the
conditions of light intensity, plate tilt and plate blurring, and the parallel decoder has higher
accuracy compared with the traditional RNN decoder.
The number of heads in Multi‑Head Attention submodule of the Encode module is an‑
other factor that impacts recognition performance of the proposed model YOLOv5‑PDLPR.
To evaluate the effect of changing the number of attention heads on the license plate recog‑
nition accuracy of the proposed model YOLOv5‑PDLPR, experiments were conducted on
CCPD‑DB, CCPD‑Tilt, and CCPD‑Challenge by changing only the number of attention heads
while keeping the number of decoder blocks as three and all other conditions the same, and
the results are shown in Table 10.
Sensors 2024, 24, 2791 19 of 22
Table 10. The influence of different number of attention heads on the accuracy of license plate recog‑
nition. Labeling the best performance in bold.
Accuracy
Head Number
CCPD‑DB CCPD‑Tilt CCPD‑Challenge
1 99.2 99.4 93.4
4 99.4 99.5 93.4
8 99.5 99.7 94.4
16 98.7 98.6 90.6
As can be seen from Table 10, when the number of attention heads was less than or equal
to eight, the license plate recognition accuracy increases with the increase in the number of at‑
tention heads on each dataset. However, when the number of attention heads exceeds eight,
the license plate recognition accuracy begins to decline. This indicates that increasing the
number of attention heads can improve the recognition rate; however, there is an upper limit
of eight attention heads. Therefore, the number of attention heads was finally set to eight in
this paper.
The recognition performance is also affected by the number of decoding units. In this sec‑
tion, while keeping the number of heads in Multi‑Head Attention as eight and other conditions
the same, the experiments were conducted on CCPD‑DB, CCPD‑Tilt and CCPD‑Challenge by
changing the number of decoder blocks, and the results are shown in Table 11.
Table 11. The influence of different number of decoding units on the accuracy of license plate recog‑
nition. Labeling the best performance in bold.
The experimental results in Table 11 show that when the number of stacked decoding
units is less than or equal to 3, the recognition accuracy of the license plate recognition model
increases as the number of decoding units increases. However, when the number of decoder
blocks stacked exceeds 3, the recognition effect begins to diminish. Adding more decoding
units also deepens and complicates the network, which requires more calculated costs and
makes training more difficult. Therefore, the number of decoder blocks was finally set to three
in this paper.
7. Conclusions
This paper proposed a YOLOv5‑PDLPR algorithm for resolving the problem of license
plate detection and recognition in natural scenes under complex conditions. Compared with
traditional feature extraction methods, this method included a feature extractor that can ob‑
tain global feature information, which can be used to obtain rich semantic information. Mean‑
while, the advantage of multi‑headed attention was fully utilized, which makes the license plate
pictures accurately recognized without auxiliary correction, showing excellent performance in
natural scenes. The model does not involve the RNN, so it can be inferred in parallel, which
improves the recognition efficiency significantly compared with other methods. Furthermore,
the experiments on the CCPD dataset achieved an average accuracy of 99.4% and recognition
speed of 159.8 FPS. However, due to the limited training data set, this method can recognize
fewer types of license plates and had a low recognition rate for Chinese characters other than
“皖” in license plates. Therefore, in the future, the accuracy of license plate recognition can be en‑
hanced by collecting more license plate data with a balanced distribution of Chinese characters.
Sensors 2024, 24, 2791 20 of 22
Author Contributions: Methodology, S.H. and Z.T.; software, S.H.; validation, S.H., Y.L. and Y.C.; writing—
original draft preparation, S.H. and L.T.; writing—review and editing, Z.T., P.H. and L.T.; visualization,
S.H. and Y.L.; supervision, Z.T.; funding acquisition, Z.T. and L.T. All authors have read and agreed to
the published version of the manuscript.
Funding: This study is partially supported by the National Natural Science Foundation of China (NSFC)
(No. 61170110), Zhejiang Provincial Natural Science Foundation of China (No. LY13F020043), and the
scientific research project of Zhejiang Provincial Department of Education (No. 21030074‑F).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Four publicly available datasets (CCPD [25], PKUData [34], CLPD [24], and
AOLP [47]) are used to validate the models.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Weihong, W.; Jiaoyang, T. Research on license plate recognition algorithms based on deep learning in complex environment. IEEE Access
2020, 8, 91661–91675. [CrossRef]
2. Shashirangana, J.; Padmasiri, H.; Meedeniya, D.; Perera, C. Automated license plate recognition: A survey on methods and techniques.
IEEE Access 2020, 9, 11203–11225. [CrossRef]
3. Abolghasemi, V.; Ahmadyfard, A. An edge‑based color‑aided method for license plate detection. Image Vis. Comput. 2009, 27, 1134–1142.
[CrossRef]
4. Lalimi, M.A.; Ghofrani, S.; McLernon, D. A vehicle license plate detection method using region and edge based methods. Comput. Electr.
Eng. 2013, 39, 834–845. [CrossRef]
5. Wu, Y.; Liu, S.; Wang, X. License plate location method based on texture and color. In Proceedings of the 2013 IEEE 4th International
Conference on Software Engineering and Service Science, Beijing, China, 23–25 May 2013.
6. Gou, C.; Wang, K.; Yao, Y.; Li, Z. Vehicle license plate recognition based on extremal regions and restricted boltzmann machines. IEEE
Trans. Intell. Transp. Syst. 2016, 17, 1097–1107. [CrossRef]
7. Ashtari, A.H.; Nordin, M.J.; Fathy, M. An Iranian license plate recognition system based on color features. IEEE Trans. Intell. Transp.
Syst. 2014, 15, 1690–1705. [CrossRef]
8. Girshick, R. Fast R‑CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015.
9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r‑cnn: Towards real‑time object detection with region proposal networks. IEEE Trans. Pattern
Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
10. Ammar, A.; Koubaa, A.; Boulila, W.; Benjdira, B.; Alhabashi, Y. A Multi‑Stage Deep‑Learning‑Based Vehicle and License Plate Recog‑
nition System with Real‑Time Edge Inference. Sensors 2023, 23, 2120. [CrossRef]
11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real‑time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
12. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017.
13. Hendry; Chen, R.C. Automatic License Plate Recognition Via Sliding‑Window Darknet‑Yolo Deep Learning. Image Vis. Comput. 2019,
87, 47–56. [CrossRef]
14. Bochkovskiy, A.; Wang, C.Y.; Liao, H. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
15. Zhuang, J.; Hou, S.; Wang, Z.; Zha, Z.J. Towards human‑level license plate recognition. In Proceedings of the European Conference on
Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
16. Castro‑Zunti, R.D.; Yépez, J.; Ko, S.B. License plate segmentation and recognition system using deep learning and OpenVINO. IET Intell.
Transp. Syst. 2020, 14, 119–126. [CrossRef]
17. Zherzdev, S.; Gruzdev, A. Lprnet: License plate recognition via deep neural networks. arXiv 2018, arXiv:1806.10447.
18. Xiao, D.; Zhang, L.; Li, J.; Li, J. Robust license plate detection and recognition with automatic rectification. J. Electron. Imaging 2021, 30,
013002. [CrossRef]
19. Yousaf, U.; Khan, A.; Ali, H.; Khan, F.G.; Rehman, Z.u.; Shah, S.; Ali, F.; Pack, S.; Ali, S. A deep learning based approach for localization
and recognition of pakistani vehicle license plates. Sensors 2021, 21, 7696. [CrossRef]
20. Gao, F.; Cai, Y.; Ge, Y.; Lu, S. EDF‑LPR: A new encoder–decoder framework for license plate recognition. IET Intell. Transp. Syst. 2020,
14, 959–969. [CrossRef]
21. Gong, Y.; Deng, L.; Tao, S.; Lu, X.; Wu, P.; Xie, Z.; Xie, M. Unified Chinese license plate detection and recognition with high efficiency. J.
Vis. Commun. Image Represent. 2022, 86, 103541. [CrossRef]
22. Xu, H.; Zhou, X.D.; Li, Z.; Liu, L.; Li, C.; Shi, Y. EILPR: Toward end‑to‑end irregular license plate recognition based on automatic
perspective alignment. IEEE Trans. Intell. Transp. Syst. 2021, 23, 2586–2595. [CrossRef]
Sensors 2024, 24, 2791 21 of 22
23. Zou, Y.; Zhang, Y.; Yan, J.; Jiang, X.; Huang, T.; Fan, H.; Cui, Z. A robust license plate recognition model based on bi‑lstm. IEEE Access
2020, 8, 211630–211641. [CrossRef]
24. Zhang, L.; Wang, P.; Li, H.; Li, Z.; Shen, C.; Zhang, Y. A robust attentional framework for license plate recognition in the wild. IEEE
Trans. Intell. Transp. Syst. 2020, 22, 6967–6976. [CrossRef]
25. Xu, Z.; Yang, W.; Meng, A.; Lu, N.; Huang, H.; Ying, C.; Huang, L. Towards end‑to‑end license plate detection and recognition: A large
dataset and baseline. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
26. Qin, S.; Liu, S. Towards end‑to‑end car license plate location and recognition in unconstrained scenarios. Neural Comput. Appl. 2022, 34,
21551–21566. [CrossRef]
27. Murugan, V.; Sowmyayani, S.; Kavitha, J.; Meenakshi, S. AI Driven Smart Number Plate Identification for Automatic Identification.
In Proceedings of the 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Greater
Noida, India, 9–10 February 2024.
28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Advances in neural
information processing systems. arXiv 2017, arXiv:1706.0376230.
29. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.;
et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
30. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17
October 2021.
31. Jocher, G. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 26 July 2022).
32. Du, S.; Ibrahim, M.; Shehata, M.; Badawy, W. Automatic license plate recognition (ALPR): A state‑of‑the‑art review. IEEE Trans. Circuits
Syst. Video Technol. 2012, 23, 311–325. [CrossRef]
33. Tian, J.; Wang, G.; Liu, J.; Xia, Y. License plate detection in an open environment by density‑based boundary clustering. J. Electron.
Imaging 2017, 26, 33017. [CrossRef]
34. Yuan, Y.; Zou, W.; Zhao, Y.; Wang, X.; Hu, X.; Komodakis, N. A robust and efficient approach to license plate detection. IEEE Trans.
Image Process. 2016, 26, 1102–1114. [CrossRef] [PubMed]
35. Dun, J.; Zhang, S.; Ye, X.; Zhang, Y. Chinese license plate localization in multi‑lane with complex background based on concomitant
colors. IEEE Intell. Transp. Syst. Mag. 2015, 7, 51–61. [CrossRef]
36. Kim, S.K.; Kim, D.W.; Kim, H.J. A recognition of vehicle license plate using a genetic algorithm based segmentation. In Proceedings of
the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996.
37. Zhang, H.; Jia, W.; He, X.; Wu, Q. Learning‑based license plate detection using global and local features. In Proceedings of the 18th
International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006.
38. Yu, S.; Li, B.; Zhang, Q.; Liu, C.; Meng, M.Q.H. A novel license plate location method based on wavelet transform and EMD analysis.
Pattern Recognit. 2015, 48, 114–125. [CrossRef]
39. Cho, B.K.; Ryu, S.H.; Shin, D.R.; Jung, J.I. License plate extraction method for identification of vehicle violations at a railway level crossing.
Int. J. Automot. Technol. 2011, 12, 281–289. [CrossRef]
40. Li, B.; Tian, B.; Li, Y.; Wen, D. Component‑based license plate detection using conditional random field model. IEEE Trans. Intell. Transp.
Syst. 2013, 14, 1690–1699. [CrossRef]
41. Li, H.; Wang, P.; Shen, C. Toward end‑to‑end car license plate detection and recognition with deep neural networks. IEEE Trans. Intell.
Transp. Syst. 2018, 20, 1126–1136. [CrossRef]
42. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016.
43. Fan, X.; Zhao, W. Improving robustness of license plates automatic recognition in natural scenes. IEEE Trans. Intell. Transp. Syst. 2022,
23, 18845–18854. [CrossRef]
44. Andriyanov, N.A.; Dementiev, V.E.; Tashlinskiy, A.G. Development of a Productive Transport Detection System Using Convolutional
Neural Networks. Pattern Recognit. Image Anal. 2022, 32, 495–500. [CrossRef]
45. Hui, T.W.; Tang, X.; Loy, C.C. A lightweight optical flow CNN—Revisiting data fidelity and regularizatio. IEEE Trans. Pattern Anal.
Mach. Intell. 2020, 43, 2555–2569. [CrossRef] [PubMed]
46. Maglad, K.W. A vehicle license plate detection and recognition system. J. Comput. Sci. 2012, 8, 310–315.
47. Hsu, G.S.; Chen, J.C.; Chung, Y.Z. Application‑oriented license plate recognition. IEEE Trans. Veh. Technol. 2012, 62, 552–561. [CrossRef]
48. Rahman, C.A.; Badawy, W.; Radmanesh, A. A real time vehicle’s license plate recognition system. In Proceedings of the IEEE Conference
on Advanced Video and Signal Based Surveillance, Miami, FL, USA, 22–22 July 2003.
49. Björklund, T.; Fiandrotti, A.; Annarumma, M.; Francini, G.; Magli, E. Robust license plate recognition using neural networks trained on
synthetic images. Pattern Recognit. 2019, 93, 134–146.
50. Yao, D.; Zhu, W.; Chen, Y.; Zhang, L. Chinese license plate character recognition based on convolution neural network. In Proceedings
of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017.
51. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Advances in neural information processing systems. arXiv
2015, arXiv:1506.02025.
52. Luo, C.; Jin, L.; Sun, Z. Moran: A multi‑object rectified attention network for scene text recognition. Pattern Recognit. 2019, 90, 109–118.
[CrossRef]
Sensors 2024, 24, 2791 22 of 22
53. Wang, T.; Zhu, Y.; Jin, L.; Luo, C.; Chen, X.; Wu, Y.; Cai, M. Decoupled attention network for text recognition. In Proceedings of the
AAAI Conference on Artificial Intelligence, New York Hilton Midtown, New York, NY, USA, 7–12 February 2020.
54. Yang, L.; Wang, P.; Li, H.; Li, Z.; Zhang, Y. A holistic representation guided attention network for scene text recognition. Neurocomputing
2020, 414, 67–75. [CrossRef]
55. Kang, L.; Riba, P.; Rusiñol, M.; Fornés, A.; Villegas, M. Pay attention to what you read: Non‑recurrent handwritten text‑line recognition.
Pattern Recognit. 2022, 129, 108766. [CrossRef]
56. Mahdavi, M.; Zanibbi, R.; Mouchere, H.; Viard‑Gaudin, C.; Garain, U. ICDAR 2019 CROHME+ TFD: Competition on recognition of
handwritten mathematical expressions and typeset formula detection. In Proceedings of the 2019 International Conference on Document
Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019.
57. Ma, J.; Liang, Z.; Zhang, L. A text attention network for spatial deformation robust scene text image super‑resolution. In Proceedings of
the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022.
58. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015,
arXiv:1502.03167.
59. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30 th
International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013.
60. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011.
61. Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal
Process. Lett. 2016, 23, 1499–1503. [CrossRef]
62. Zou, Y.; Zhang, Y.; Yan, J.; Jiang, X.; Huang, T.; Fan, H.; Cui, Z. License plate detection and recognition based on YOLOv3 and ILPRNET.
Signal Image Video Process. 2022, 16, 473–480. [CrossRef]
63. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property
resulting from any ideas, methods, instructions or products referred to in the content.