0% found this document useful (0 votes)
44 views9 pages

YOLOP You Only Look Once For Panoptic Driving Perception

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views9 pages

YOLOP You Only Look Once For Panoptic Driving Perception

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

YOLOP: You Only Look Once for Panoptic Driving


Perception
Dong Wu, Manwen Liao, Weitian Zhang, and Xinggang Wang, Member, IEEE

Abstract—A panoptic driving perception system is an essential


part of autonomous driving. A high-precision and real-time
perception system can assist the vehicle in making the reasonable
arXiv:2108.11250v5 [cs.CV] 31 Aug 2021

decision while driving. We present a panoptic driving perception


network (YOLOP) to perform traffic object detection, drivable
area segmentation and lane detection simultaneously. It is com-
posed of one encoder for feature extraction and three decoders
to handle the specific tasks. Our model performs extremely well
on the challenging BDD100K dataset, achieving state-of-the-art
on all three tasks in terms of accuracy and speed. Besides,
we verify the effectiveness of our multi-task learning model for
joint training via ablative studies. To our best knowledge, this
is the first work that can process these three visual perception
tasks simultaneously in real-time on an embedded device Jetson
TX2(23 FPS) and maintain excellent accuracy. To facilitate
(a) Input
further research, the source codes and pre-trained models will
be released at https://github.com/hustvl/YOLOP.
Index Terms—Deep learning, multitask learning, traffic object
detection, drivable area segmentation, lane detection.

I. I NTRODUCTION

R ECENTLY, extensive research on autonomous driving


has revealed the importance of the panoptic driving
perception system. It plays a significant role in autonomous
driving as it can extract visual information from the images
taken by the camera and assist the decision system to control
the actions of the vehicle. In order to restrict the maneuver
of vehicles, the visual perception system should be able to
understand the scene and then provide the decision system (b) Output
with information including: locations of the obstacles, judge- Fig. 1. The input and output of our model. The purpose of our model
ments of whether the road is drivable, the position of the is to process traffic objects detection, drivable area segmentation and lane
lanes etc. Object detection is usually involved in the panoptic detection simultaneously in one input image. In (b), the brown bounding boxes
indicate traffic objects, the green areas are the drivable areas, and the blue
driving perception system to help the vehicles avoid obstacles lines represent the lane line.
and follow traffic rules. Drivable area segmentation and lane
detection are also needed as they are crucial for planning the
driving route of the vehicle.
Many methods handle these tasks separately. For instance, In addition, different tasks in traffic scenes understanding
Faster R-CNN [1] and YOLOv4 [2] deal with object de- often have much related information, such as the three tasks
tection; UNet [3] and PSPNet [4] are proposed to perform mentioned above. As shown in the Figure 1, the lanes are often
semantic segmentation. SCNN [5] and SAD-ENet [6] are used the boundary of the drivable area, and the drivable area usually
for detecting lanes. Despite the excellent performance these closely surrounds the traffic objects. A multi-task network is
methods achieve, processing these tasks one after another takes more suitable in this situation as (1) it can accelerate the image
longer time than tackling them all at once. When deploying analysis process by handling multiple tasks at once instead of
the panoptic driving perception system on embedded devices one by one (2) it can share information among multiple tasks,
commonly used in the self-driving car, limited computational which may improve the performance of each task as multi-task
resources and latency should be taken into consideration. network often shares the same feature extraction backbone.
Therefore, it is of essence to explore multi-task approaches in
D. Wu, M. Liao, W. Zhang and X. Wang are with the School of Elec- autonomous driving.
tronic Information and Communication, Huazhong University Of Science
And Technology, Wuhan 430074, China (e-mail: {riserwu, mwliao, wtzhang, MultiNet [7] uses the encoder-decoder structure which has
xgwang}@hust.edu.cn) one shared encoder and three separate decoders for classifica-
2

tion, object detection and semantic segmentation. It performs II. R ELATED W ORK
well on these tasks and achieves state-of-the-art on KITTI In this section, we review solutions to the above three
drivable area segmentation task. Classification tasks, however, tasks respectively, and then introduce some related multi-task
are not as crucial as lane detection in controlling the vehicle. learning work. We only concentrate on solutions based on deep
DLT-Net [8] combines traffic object detection, drivable area learning.
segmentation and lane detection all together and proposes
context tensor to fuse feature maps between decoders in A. Traffic Object Detection
order to share mutual information. Although with competitive
In recent years, with the rapid development of deep learning,
performance, it does not reach real-time. Thus, we construct
many prominent object detection algorithms have emerged.
an efficient multi-task network for panoptic driving percep-
Current mainstream object detection algorithms can be divided
tion system which includes object detection, drivable area
into two-stage methods and one-stage methods.
segmentation and lane detection task and can reach real-time
Two-stage methods complete the detection task in two steps.
on embedded device Jetson TX2 with TensorRT deployment.
First, regional proposals are obtained, and then features in
By processing these three key tasks in autonomous driving
the regional proposals are used to locate and classify the
all at once, we reduce the inference time of the panoptic
objects. The generation of regional proposals has gone through
driving perception system, constrain the computational cost
several stages of development. R-CNN [12] creatively tries
to a reasonable range and enhance the performance of each
to use selective search instead of sliding windows to extract
task.
regional proposals on the original image, while Fast R-CNN
In order to obtain high precision and fast speed, we design a [13] performs this operation directly on the feature map. The
simple and efficient network architecture. We use a lightweight RPN network proposed in Faster-RCNN [1] greatly reduces
CNN [9] as the encoder to extract features from the image. the time consumption and obtains higher accuracy. Based on
Then these feature maps are fed to three decoders to complete the former, R-FCN [14] proposes a fully convolutional network
their respective tasks. Our detection decoder is based on the that replaces the fully connected layer with the convolutional
current best-performing single-stage detection network [2] for layer to further speed up detection.
two main reasons: (1) The single-stage detection network is The SDD-series [15] and YOLO-series algorithms are mile-
faster than the two-stage detection network. (2) The grid-based stones among one-stage methods. This kind of algorithm
prediction mechanism of the single-stage detector is more performs bounding box regression and object classification
related to the other two semantic segmentation tasks, while simultaneously. YOLO [16] divides the picture into S×S
instance segmentation is usually combined with the region- grids instead of extracting regional proposals with the RPN
based detector [10]. The feature map output by the encoder network, which significantly accelerates the detection speed.
incorporates semantic features of different levels and scales, YOLO9000 [17] introduces the anchor mechanism to improve
and our segmentation branch can use these feature maps to the recall of detection. YOLOv3 [18] uses the feature pyramid
complete pixel-wise semantic prediction excellently. network structure to achieve multi-scale detection. YOLOv4
[2] further improves the detection performance by refining
In addition to the end-to-end training strategy, we attempt the network structure, activation function, loss function and
some alternating optimization paradigms which train our applying abundant data augmentation.
model step-by-step. On the one hand, we can put unrelated
tasks in different training steps to prevent inter-limitation. On B. Drivable Area Segmentation
the other hand, the task trained first can guide other tasks. So Due to the great success of deep learning, CNN-based
this kind of paradigm sometimes works well though cumber- methods are used widely in semantic segmentation recently.
some. However, experiments show that it is unnecessary for FCN [19] firstly introduces fully convolutional network to
our model as the one trained end to end can perform well semantic segmentation. It preserves the backbone of the CNN-
enough. As a result, our panoptic driving perception system classifier and replaces the final fully connected layer with
reaches 41 FPS on a single NVIDIA TITAN XP and 23 FPS 1 × 1 convolutional layer and upsample layer. Despite the
on Jetson TX2; meanwhile, it achieves state-of-the-art on the skip-connection refinement, its performance is still limited
three tasks of the BDD100K dataset [11]. by low-resolution output. In order to obtain higher-resolution
In summary, our main contributions are: (1) We put forward output, Unet[3] constructs the encoder-decoder architecture.
an efficient multi-task network that can jointly handle three DeepLab [20] uses CRF(conditional random field) to improve
crucial tasks in autonomous driving: object detection, drivable the quality of the output as well as proposes the atrous
area segmentation and lane detection to save computational algorithm to expand the receptive field while maintaining
costs, reduce inference time as well as improve the perfor- similar computational costs. PSPNet [4] comes up with the
mance of each task. Our work is the first to reach real-time pyramid pooling module to extract features in various scales
on embedded devices while maintaining state-of-the-art level to enhance its performance.
performance on the BDD100K dataset. (2) We design the
ablative experiments to verify the effectiveness of our multi- C. Lane Detection
tasking scheme. It is proved that the three tasks can be learned In lane detection, there are lots of innovative researches
jointly without tedious alternating optimization. based on deep learning. [21] constructs a dual-branch network
3

to perform semantic segmentation and pixel embedding on A. Encoder


images. It further clusters the dual-branch features to achieve Our network shares one encoder, which is composed of a
lane instance segmentation. SCNN [5] proposes slice-by-slice backbone network and a neck network.
convolution, which enables the message to pass between pixels 1) Backbone: The backbone network is used to extract
across rows and columns in a layer, but this convolution is very the features of the input image. Usually, some classic image
time-consuming. Enet-SAD [6] uses self attention distillation classification networks serve as the backbone. Due to the
method, which enables low-level feature maps to learn from excellent performance of YOLOv4 [2] on object detection,
high-level feature maps. This method improves the perfor- we choose CSPDarknet [9] as the backbone, which solves the
mance of the model while keeping the model lightweight. [22] problem of gradient duplication during optimization [26]. It
defines lane detection as a task to find the collection of lane supports feature propagation and feature reuse which reduces
lines location in certain rows of the image, and this row-based the amount of parameters and calculations. Therefore, it is
classification uses global features. conducive to ensuring the real-time performance of the net-
work.
2) Neck: The neck is used to fuse the features generated
D. Multi-task Approaches by the backbone. Our neck is mainly composed of Spatial
The goal of multi-task learning is to learn better repre- Pyramid Pooling (SPP) module [27] and Feature Pyramid
sentations through shared information among multiple tasks. Network (FPN) module [28]. SPP generates and fuses features
Especially, a CNN-based multitask learning method can also of different scales, and FPN fuses features at different semantic
achieve convolutional sharing of the network structure. Mask levels, making the generated features contain multiple scales
R-CNN [10] extends Faster R-CNN by adding a branch for and multiple semantic level information. We adopt the method
predicting object mask, which combines instance segmentation of concatenation to fuse features in our work.
and object detection tasks effectively, and these two tasks B. Decoders
can promote each other’s performance. LSNet[23] summarizes
object detection, instance segmentation and pose estimation The three heads in our network are specific decoders for the
as location-sensitive visual recognition and uses a unified three tasks.
1) Detect Head: Similar to YOLOv4, we adopt an anchor-
solution to handle these tasks. With a shared encoder and
based multi-scale detection scheme. Firstly, we use a structure
three independent decoders, MultiNet [7] completes the three
called Path Aggregation Network (PAN), a bottom-up feature
scene perception tasks of scene classification, object detection
pyramid network [29]. FPN transfers semantic features top-
and segmentation of the driving area simultaneously. DLT-Net
down, and PAN transfers positioning features bottom-up. We
[8] inherits the encoder-decoder structure, and contributively
combine them to obtain a better feature fusion effect, and
constructs context tensors between sub-task decoders to share
then directly use the multi-scale fusion feature maps in the
designate information among tasks. [24] puts forward mutually
PAN for detection. Then, each grid of the multi-scale feature
interlinked sub-structures between lane area segmentation and
map will be assigned three prior anchors with different aspect
lane boundary detection. Meanwhile, it proposes a novel loss
ratios, and the detection head will predict the offset of position
function to constrain the lane line to the outer contour of
and the scaling of the height and width, as well as the
the lane area so that they’re going to overlap geometrically.
corresponding probability of each category and the confidence
However, this prior assumption also limits its application as
of the prediction.
it only works well on scenarios where the lane line tightly
2) Drivable Area Segment Head & Lane Line Segment
wraps the lane area. What’s more, the training paradigm of
Head: Drivable area segment head and Lane line Segment
multitask model is also worth thinking about. [25] states that
head adopt the same network structure. We feed the bottom
the joint training is appropriate and beneficial only when all
layer of FPN to the segmentation branch, with the size of
those tasks are indeed related; otherwise, it is necessary to
(W/8, H/8, 256). Our segmentation branch is very simple.
adopt alternating optimization. So Faster R-CNN [1] adopts a
After three upsampling processes, we restore the output feature
pragmatic 4-step training algorithm to learn shared features.
map to the size of (W, H, 2), which represents the probability
This paradigm sometimes may be helpful, but it is so tedious.
of each pixel in the input image for the drivable area/lane
line and the background. Because of the shared SPP in
III. M ETHODOLOGY the neck network, we do not add an extra SPP module to
segment branches like others usually do [4], which brings no
We put forward a simple and efficient feed-forward network improvement to the performance of our network. Additionally,
that can accomplish traffic object detection, drivable area we use the Nearest Interpolation method in our upsampling
segmentation and lane detection tasks altogether. As shown in layer to reduce computation cost instead of deconvolution. As
Figure 2, our panoptic driving perception single-shot network, a result, not only do our segment decoders gain high precision
termed as YOLOP, contains one shared encoder and three output, but also be very fast during inference.
subsequent decoders to solve specific tasks. There are no
complex and redundant shared blocks between different de- C. Loss Function
coders, which reduces computational consumption and allows Since there are three decoders in our network, our multi-task
our network to be easily trained end-to-end. loss contains three parts. As for the detection loss Ldet , it is a
4

Fig. 2. The architecture of YOLOP. YOLOP shares one encoder and combines three decoders to solve different tasks. The encoder consists of a backbone
and a neck.

weighted sum of classification loss, object loss and bounding D. Training Paradigm
box loss as in equation 1. We attempt different paradigms to train our model. The
Ldet = α1 Lclass + α2 Lobj + α3 Lbox , (1) simplest one is training end to end, and then three tasks can be
learned jointly. This training paradigm is useful when all tasks
where Lclass and Lobj are focal loss [30], which is utilized are indeed related. In addition, some alternating optimization
to reduce the loss of well-classified examples, thus forces the algorithms also have been tried, which train our model step
network to focus on the hard ones. Lclass is used for penalizing by step. In each step, the model can focus on one or multiple
classification and Lobj for the confidence of one prediction. related tasks regardless of those unrelated. Even if not all tasks
Lbox is LCIoU [31], which takes distance, overlap rate, the are related, our model can still learn adequately on each task
similarity of scale and aspect ratio between the predicted box with this paradigm. And Algorithm 1 illustrates the process of
and ground truth into consideration. one step-by-step training method.
Both of the loss of drivable area segmentation Lda−seg and
lane line segmentation Lll−seg contain Cross Entropy Loss IV. E XPERIMENTS
with Logits Lce , which aims to minimize the classification
A. Setting
errors between pixels of network outputs and the targets. It
TN 1) Dataset Setting: The BDD100K dataset [11] supports
is worth mentioning that IoU loss: LIoU = T N +F P +F N is
added to Lll−seg as it is especially efficient for the prediction the research of multi-task learning in the field of autonomous
of the sparse category of lane lines. Lda and Lll−seg are driving. With 100k frames of pictures and annotations of 10
defined as equation (2), (3) respectively. tasks, it is the largest driving video dataset. As the dataset
has the diversity of geography, environment, and weather, the
Lda−seg = Lce , (2) algorithm trained on the BDD100k dataset is robust enough
to migrate to a new environment. Therefore, we choose the
Lll−seg = Lce + LIoU . (3) BDD100k dataset to train and evaluate our network. The
BDD100K dataset has three parts, training set with 70K
In conclusion, our final loss is a weighted sum of the three
images, validation set with 10K images, and test set with 20K
parts all together as in equation (4).
images. Since the label of the test set is not public, we evaluate
Lall = γ1 Ldet + γ2 Lda−seg + γ3 Lll−seg , (4) our network on the validation set.
2) Implementation Details: In order to enhance the per-
where α1 , α2 , α3 , γ1 , γ2 , γ3 can be tuned to balance all parts formance of our model, we empirically adopt some practical
of the total loss. techniques and methods of data augmentation.
5

Algorithm 1 One step-by-step Training Method. First, we only Network Recall(%) mAP50(%) Speed(fps)
train Encoder and Detect head. Then we freeze the Encoder MultiNet 81.3 60.2 8.6
and Detect head as well as train two Segmentation heads. DLT-Net 89.4 68.4 9.3
Finally, the entire network is trained jointly for all three tasks. Faster R-CNN 77.2 55.6 5.3
YOLOv5s 86.8 77.2 82
Input: Target neural network F with parameter group: YOLOP (ours) 89.2 76.5 41
Θ = {θenc , θdet , θseg };
TABLE I
Training set: T ; T RAFFIC O BJECT D ETECTION R ESULTS : COMPARING THE PROPOSED
Threshold for convergence: thr; YOLOP WITH STATE - OF - THE - ART DETECTORS .
Loss function: Lall
Output: Well-trained network: F(x; Θ)
1: procedure T RAIN (F, T )
global information. We retrain the above networks on the
2: repeat
BDD100k dataset and compare them with our network on
3: Sample a mini-batch (xs , ys ) from training set T .
object detection and drivable area segmentation tasks. Since
4: ` ← Lall (F(xs ; Θ), ys )
there is no suitable existing multi-task network that processes
5: Θ ← arg minΘ `
lane detection task on the BDD100K dataset, we compare
6: until ` < thr
our network with Enet [33], SCNN and Enet-SAD, three
7: end procedure
advanced lane detection networks. Besides, the performance
8: Θ ← Θ\{θseg } // Freeze parameters of two Segmentation
of the joint training paradigm is compared with alternating
heads.
training paradigms of many kinds. Moreover, we compare the
9: T RAIN (F, T )
accuracy and speed of our multi-task model trained to handle
10: Θ ← Θ ∪ {θseg } \ {θdet , θenc } // Freeze parameters of
multiple tasks with the one trained to perform a specific task.
Encoder and Detect head and activate parameters of two
Following [6], we resize images in BDD100k dataset from
Segmentation heads.
1280×720×3 to 640×384×3. All control experiments follow
11: T RAIN (F, T )
the same experimental settings and evaluation metrics, and all
12: Θ ← Θ ∪ {θdet , θenc } // Activate all parameters of the
experiments are run on NVIDIA GTX TITAN XP.
neural network.
13: T RAIN (F, T )
14: return Trained network F(x; Θ) B. Result
In this section, we just simply train our model end to end
and then compare it with other representative models on all
With the purpose of enabling our detector to get more prior three tasks.
knowledge of the objects in the traffic scene, we use the k- 1) Traffic Object Detection Result: Since the Multinet and
means clustering algorithm to obtain prior anchors from all DLT-Net can only detect vehicles, we only consider the vehicle
detection frames of the dataset. We use Adam as the optimizer detection results of five models on the BDD100K dataset. As
to train our model and the initial learning rate, β1 , and β2 are shown in Table I, we use Recall and mAP50 as the evaluation
set to be 0.001, 0.937, and 0.999 respectively. Warm-up and metric of detection accuracy. Our model exceeds Faster R-
cosine annealing are used to adjust the learning rate during the CNN, MultiNet, and DLT-Net in detection accuracy, and is
training, which aim at leading the model to converge faster and comparable to YOLOv5s that actually uses more tricks than
better [32]. ours. Moreover, our model can infer in real time. YOLOv5s is
We use data augmentation to increase the variability of faster than ours because it does not have the lane line segment
images so as to make our model robust in different environ- head and drivable area segment head. Visualization of the
ments. Photometric distortions and geometric distortions are traffic objects detection is shown in Figure 3.
taken into consideration in our training scheme. For photo- 2) Drivable Area Segmentation Result: In this paper, both
metric distortions, we adjust the hue, saturation and value of “area/drivable” and “area/alternative” classes in BDD100K
images. We use random rotating, scaling, translating, shearing, dataset are categorized as ”Drivable area” without distinction.
and left-right flipping to process images to handle geometric Our model only needs to distinguish the drivable area and
distortions. the background in the image. mIoU is used to evaluate the
3) Experimental Setting: We select some excellent multi- segmentation performance of different models. The results are
task networks and networks that focus on a single task shown in Table II. It can be seen that our model outperforms
to compare with our network. Both MultiNet and DLT-Net MultiNet, DLT-Net and PSPNet by 19.9%, 20.2%, and 1.9%,
handle multiple panoptic driving perception tasks, and they respectively. Furthermore, our inference speed is 4 to 5 times
have achieved great performance in object detection and faster than theirs. Visualization results of the drivable area
drivable area segmentation tasks on the BDD100k dataset. segmentation can be seen in Figure 4.
Faster-RCNN is an outstanding representative of the two- 3) Lane Detection Result: The lane lines in BDD100K
stage object detection network. YOLOv5 is the single-stage dataset are labeled with two lines, so it is very tricky to directly
network that achieves state-of-the-art performance on the use the annotation. The experimental settings follow the [6]
COCO dataset. PSPNet achieves splendid performance on se- in order to compare expediently. First of all, we calculate the
mantic segmentation task with its superior ability to aggregate center lines based on the two-line annotations. Then we draw
6

(a) Day-time result

(b) Night-time result

Fig. 3. Visualization of the traffic objects detection results of YOLOP. Top Row: Traffic objects detection results in day-time scenes. Bottom row: Traffic
objects detection results in night scenes.

(a) Day-time result

(b) Night-time result

Fig. 4. Visualization of the drivable area segmentation results of YOLOP. Top Row: Drivable area segmentation results in day-time scenes. Bottom row:
Drivable area segmentation results in night scenes.

Network mIoU(%) Speed(fps) the lane line of the training with width set to 8 pixels while
MultiNet 71.6 8.6 keeping the lane line width of the test set as 2 pixels. We
DLT-Net 71.3 9.3 use pixel accuracy and IoU of lanes as evaluation metrics.
PSPNet 89.6 11.1
YOLOP (ours) 91.5 41 As shown in the Table III, the performance of our model
dramatically exceeds the other three models. The visualization
TABLE II
D RIVABLE A REA S EGMENTATION R ESULTS : C OMPARING THE PROPOSED
results of lane detection can be seen in Figure 5.
YOLOP WITH STATE - OF - THE - ART DRIVABLE AREA SEGMENTATION OR
SEMANTIC SEGMENTATION METHODS .
C. Ablation Studies
We designed the following two ablation experiments to
further illustrate the effectiveness of our scheme. All the
evaluation metrics in this section are consistent with above.
7

(a) Day-time result

(b) Night-time result

Fig. 5. Visualization of the lane detection results of YOLOP. Top Row: Lane detection results in day-time scenes. Bottom row: Lane detection results in
night scenes.

Network Accuracy(%) IoU(%) 2) Multi-task v.s. Single task: To verify the effectiveness of
ENet 34.12 14.64 our multi-task learning scheme, we compare the performance
SCNN 35.79 15.84 of the multi-task scheme and single task scheme. On the one
ENet-SAD 36.56 16.02
YOLOP (ours) 70.50 26.20 hand, we train our model to perform 3 tasks simultaneously.
On the other hand, we train our model to perform traffic
TABLE III
L ANE D ETECTION R ESULTS : COMPARING THE PROPOSED YOLOP WITH
object detection, drivable area segmentation, and lane line
STATE - OF - THE - ART LANE DETECTION METHODS . segmentation tasks separately. Table V shows the comparison
of the performance of these two schemes on each specific task.
It can be seen that our model adopts the multi-task scheme to
achieve performance is close to that of focusing on a single
1) End-to-end v.s. Step-by-step: In Table IV, we compare task. More importantly, the multitask model can save a lot of
the performance of joint training paradigm with alternating time compared to executing each task individually.
training paradigms of many kinds 1 . Obviously, our model has
performed very well enough through end-to-end training, so
there is no need to perform alternating optimization. However, V. C ONCLUSION
it is interesting that the paradigm training detection task firstly In this paper, we put forward a simple and efficient network,
seems to perform better. We think it is mainly because our which can simultaneously handle three driving perception
model is closer to a complete detection model and the model tasks of object detection, drivable area segmentation and
is harder to converge when performing detection tasks. What’s lane detection and can be trained end-to-end. Our model
more, the paradigm consist of three steps slightly outperforms performs exceptionally well on the challenging BDD100k
that with two steps. Similar alternating training can be run for dataset, achieving or greatly exceeding state-of-the-art level
more steps, but we have observed negligible improvements. on all three tasks. And it can perform real-time inference on
1 E, D, S and W refer to Encoder, Detect head, two Segment heads and
embedded device Jetson TX2, which ensures that our network
whole network. So the Algorithm 1 can be marked as ED-S-W, and the same can be used in real-world scenarios.
for others.

Training method Recall(%) AP(%) mIoU(%) Accuracy(%) IoU(%)


ES-W 87.0 75.3 90.4 66.8 26.2
ED-W 87.3 76.0 91.6 71,2 26.1
ES-D-W 87.0 75.1 91.7 68.6 27.0
ED-S-W 87.5 76.1 91.6 68.0 26.8
End-to-end 89.2 76.5 91.5 70.5 26.2

TABLE IV
PANOPTIC DRIVING PERCEPTION RESULTS : THE END - TO - END SCHEME V. S . DIFFERENT STEP - BY- STEP SCHEMES .
8

Training method Recall(%) AP(%) mIoU(%) Accuracy(%) IoU(%) Speed(ms/frame)


Det(only) 88.2 76.9 - - - 15.7
Da-Seg(only) - - 92.0 - - 14.8
Ll-Seg(only) - - - 79.6 27.9 14.8
Multitask 89.2 76.5 91.5 70.5 26.2 24.4

TABLE V
PANOPTIC DRIVING PERCEPTION RESULTS : MULTI - TASK LEARNING V. S . SINGLE TASK LEARNING .

R EFERENCES Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- [21] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
time object detection with region proposal networks,” arXiv preprint L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
arXiv:1506.01497, 2015. tation approach,” in 2018 IEEE Intelligent Vehicles Symposium (IV).
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op- IEEE, 2018, pp. 286–291.
timal speed and accuracy of object detection,” arXiv preprint [22] Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane
arXiv:2004.10934, 2020. detection,” arXiv preprint arXiv:2004.11757, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [23] K. Duan, L. Xie, H. Qi, S. Bai, Q. Huang, and Q. Tian, “Location-
for biomedical image segmentation,” in Medical Image Computing and sensitive visual recognition with cross-iou loss,” arXiv preprint
Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Horneg- arXiv:2104.04899, 2021.
ger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International [24] J. Zhang, Y. Xu, B. Ni, and Z. Duan, “Geometric constrained joint
Publishing, 2015, pp. 234–241. lane segmentation and lane boundary detection,” in Proceedings of the
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing European Conference on Computer Vision (ECCV), 2018, pp. 486–502.
network,” in Proceedings of the IEEE Conference on Computer Vision [25] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in
and Pattern Recognition (CVPR), July 2017. multi-task feature learning,” in ICML, 2011.
[5] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: [26] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Spatial cnn for traffic scene understanding,” in Proceedings of the AAAI Yeh, “Cspnet: A new backbone that can enhance learning capability of
Conference on Artificial Intelligence, vol. 32, no. 1, 2018. cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[6] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane and Pattern Recognition Workshops, 2020, pp. 390–391.
detection cnns by self attention distillation,” in Proceedings of the [27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
IEEE/CVF International Conference on Computer Vision, 2019, pp. convolutional networks for visual recognition,” IEEE Transactions on
1013–1021. Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
[7] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, 1916, 2015.
“Multinet: Real-time joint semantic reasoning for autonomous driving,” [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
arXiv preprint arXiv:1612.07695, 2016. “Feature pyramid networks for object detection,” in Proceedings of the
[8] Y. Qian, J. M. Dolan, and M. Yang, “Dlt-net: Joint detection of drivable IEEE Conference on Computer Vision and Pattern Recognition, 2017,
areas, lane lines, and traffic objects,” IEEE Transactions on Intelligent pp. 2117–2125.
Transportation Systems, vol. 21, no. 11, pp. 4670–4679, 2019. [29] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[9] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: for instance segmentation,” in Proceedings of the IEEE Conference on
Scaling cross stage partial network,” arXiv preprint arXiv:2011.08036, Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
2020. [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in for dense object detection,” in Proceedings of the IEEE International
Proceedings of the IEEE International Conference on Computer Vision, Conference on Computer Vision, 2017, pp. 2980–2988.
2017, pp. 2961–2969. [31] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss:
[11] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, Faster and better learning for bounding box regression,” in Proceedings
“Bdd100k: A diverse driving video database with scalable annotation of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020,
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. pp. 12 993–13 000.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature [32] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with
hierarchies for accurate object detection and semantic segmentation,” in warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
Proceedings of the IEEE Conference on Computer Vision and Pattern [33] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
Recognition, 2014, pp. 580–587. neural network architecture for real-time semantic segmentation,” arXiv
[13] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International preprint arXiv:1606.02147, 2016.
Conference on Computer Vision, 2015, pp. 1440–1448.
[14] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” arXiv preprint arXiv:1605.06409,
2016.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European Conference on
Computer Vision. Springer, 2016, pp. 21–37.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE DongWu is a undergraduate senior student in the
Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– School of Electronics Information and Communica-
788. tions, Huazhong University of Science and Technol-
[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in ogy (HUST), Wuhan, China. His research interests
Proceedings of the IEEE Conference on Computer Vision and Pattern include computer vision, machine learning and au-
Recognition, 2017, pp. 7263–7271. tonomous driving.
[18] ——, “Yolov3: An incremental improvement,” arXiv preprint
arXiv:1804.02767, 2018.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
9

Manwen Liao is a senior undergraduate student


from School of Electronics Information and Com-
munications, Huazhong University of Science and
Technology (HUST), Wuhan, China. He majors in
Electronic Information Engineering. His research
interests mainly include computer vision, machine
learning, robotics and autonomous driving.

Weitian Zhang is a undergraduate senior student


from Huazhong University of Science and Technol-
ogy, Wuhan, Hubei, China, majoring in Electronic
Information Engineering.
Her main research interests include computer vi-
sion and machine learning.

Xinggang Wang(M’17) received the B.S. and Ph.D.


degrees in Electronics and Information Engineering
from Huazhong University of Science and Tech-
nology (HUST), Wuhan, China, in 2009 and 2014,
respectively. He is currently an Associate Professor
with the School of Electronic Information an Com-
munications, HUST. His research interests include
computer vision and machine learning. He services
as associate editors for Pattern Recognition and Im-
age and Vision Computing journals and an editorial
board member of Electronics journal.

You might also like