YOLOP You Only Look Once For Panoptic Driving Perception
YOLOP You Only Look Once For Panoptic Driving Perception
I. I NTRODUCTION
tion, object detection and semantic segmentation. It performs II. R ELATED W ORK
well on these tasks and achieves state-of-the-art on KITTI In this section, we review solutions to the above three
drivable area segmentation task. Classification tasks, however, tasks respectively, and then introduce some related multi-task
are not as crucial as lane detection in controlling the vehicle. learning work. We only concentrate on solutions based on deep
DLT-Net [8] combines traffic object detection, drivable area learning.
segmentation and lane detection all together and proposes
context tensor to fuse feature maps between decoders in A. Traffic Object Detection
order to share mutual information. Although with competitive
In recent years, with the rapid development of deep learning,
performance, it does not reach real-time. Thus, we construct
many prominent object detection algorithms have emerged.
an efficient multi-task network for panoptic driving percep-
Current mainstream object detection algorithms can be divided
tion system which includes object detection, drivable area
into two-stage methods and one-stage methods.
segmentation and lane detection task and can reach real-time
Two-stage methods complete the detection task in two steps.
on embedded device Jetson TX2 with TensorRT deployment.
First, regional proposals are obtained, and then features in
By processing these three key tasks in autonomous driving
the regional proposals are used to locate and classify the
all at once, we reduce the inference time of the panoptic
objects. The generation of regional proposals has gone through
driving perception system, constrain the computational cost
several stages of development. R-CNN [12] creatively tries
to a reasonable range and enhance the performance of each
to use selective search instead of sliding windows to extract
task.
regional proposals on the original image, while Fast R-CNN
In order to obtain high precision and fast speed, we design a [13] performs this operation directly on the feature map. The
simple and efficient network architecture. We use a lightweight RPN network proposed in Faster-RCNN [1] greatly reduces
CNN [9] as the encoder to extract features from the image. the time consumption and obtains higher accuracy. Based on
Then these feature maps are fed to three decoders to complete the former, R-FCN [14] proposes a fully convolutional network
their respective tasks. Our detection decoder is based on the that replaces the fully connected layer with the convolutional
current best-performing single-stage detection network [2] for layer to further speed up detection.
two main reasons: (1) The single-stage detection network is The SDD-series [15] and YOLO-series algorithms are mile-
faster than the two-stage detection network. (2) The grid-based stones among one-stage methods. This kind of algorithm
prediction mechanism of the single-stage detector is more performs bounding box regression and object classification
related to the other two semantic segmentation tasks, while simultaneously. YOLO [16] divides the picture into S×S
instance segmentation is usually combined with the region- grids instead of extracting regional proposals with the RPN
based detector [10]. The feature map output by the encoder network, which significantly accelerates the detection speed.
incorporates semantic features of different levels and scales, YOLO9000 [17] introduces the anchor mechanism to improve
and our segmentation branch can use these feature maps to the recall of detection. YOLOv3 [18] uses the feature pyramid
complete pixel-wise semantic prediction excellently. network structure to achieve multi-scale detection. YOLOv4
[2] further improves the detection performance by refining
In addition to the end-to-end training strategy, we attempt the network structure, activation function, loss function and
some alternating optimization paradigms which train our applying abundant data augmentation.
model step-by-step. On the one hand, we can put unrelated
tasks in different training steps to prevent inter-limitation. On B. Drivable Area Segmentation
the other hand, the task trained first can guide other tasks. So Due to the great success of deep learning, CNN-based
this kind of paradigm sometimes works well though cumber- methods are used widely in semantic segmentation recently.
some. However, experiments show that it is unnecessary for FCN [19] firstly introduces fully convolutional network to
our model as the one trained end to end can perform well semantic segmentation. It preserves the backbone of the CNN-
enough. As a result, our panoptic driving perception system classifier and replaces the final fully connected layer with
reaches 41 FPS on a single NVIDIA TITAN XP and 23 FPS 1 × 1 convolutional layer and upsample layer. Despite the
on Jetson TX2; meanwhile, it achieves state-of-the-art on the skip-connection refinement, its performance is still limited
three tasks of the BDD100K dataset [11]. by low-resolution output. In order to obtain higher-resolution
In summary, our main contributions are: (1) We put forward output, Unet[3] constructs the encoder-decoder architecture.
an efficient multi-task network that can jointly handle three DeepLab [20] uses CRF(conditional random field) to improve
crucial tasks in autonomous driving: object detection, drivable the quality of the output as well as proposes the atrous
area segmentation and lane detection to save computational algorithm to expand the receptive field while maintaining
costs, reduce inference time as well as improve the perfor- similar computational costs. PSPNet [4] comes up with the
mance of each task. Our work is the first to reach real-time pyramid pooling module to extract features in various scales
on embedded devices while maintaining state-of-the-art level to enhance its performance.
performance on the BDD100K dataset. (2) We design the
ablative experiments to verify the effectiveness of our multi- C. Lane Detection
tasking scheme. It is proved that the three tasks can be learned In lane detection, there are lots of innovative researches
jointly without tedious alternating optimization. based on deep learning. [21] constructs a dual-branch network
3
Fig. 2. The architecture of YOLOP. YOLOP shares one encoder and combines three decoders to solve different tasks. The encoder consists of a backbone
and a neck.
weighted sum of classification loss, object loss and bounding D. Training Paradigm
box loss as in equation 1. We attempt different paradigms to train our model. The
Ldet = α1 Lclass + α2 Lobj + α3 Lbox , (1) simplest one is training end to end, and then three tasks can be
learned jointly. This training paradigm is useful when all tasks
where Lclass and Lobj are focal loss [30], which is utilized are indeed related. In addition, some alternating optimization
to reduce the loss of well-classified examples, thus forces the algorithms also have been tried, which train our model step
network to focus on the hard ones. Lclass is used for penalizing by step. In each step, the model can focus on one or multiple
classification and Lobj for the confidence of one prediction. related tasks regardless of those unrelated. Even if not all tasks
Lbox is LCIoU [31], which takes distance, overlap rate, the are related, our model can still learn adequately on each task
similarity of scale and aspect ratio between the predicted box with this paradigm. And Algorithm 1 illustrates the process of
and ground truth into consideration. one step-by-step training method.
Both of the loss of drivable area segmentation Lda−seg and
lane line segmentation Lll−seg contain Cross Entropy Loss IV. E XPERIMENTS
with Logits Lce , which aims to minimize the classification
A. Setting
errors between pixels of network outputs and the targets. It
TN 1) Dataset Setting: The BDD100K dataset [11] supports
is worth mentioning that IoU loss: LIoU = T N +F P +F N is
added to Lll−seg as it is especially efficient for the prediction the research of multi-task learning in the field of autonomous
of the sparse category of lane lines. Lda and Lll−seg are driving. With 100k frames of pictures and annotations of 10
defined as equation (2), (3) respectively. tasks, it is the largest driving video dataset. As the dataset
has the diversity of geography, environment, and weather, the
Lda−seg = Lce , (2) algorithm trained on the BDD100k dataset is robust enough
to migrate to a new environment. Therefore, we choose the
Lll−seg = Lce + LIoU . (3) BDD100k dataset to train and evaluate our network. The
BDD100K dataset has three parts, training set with 70K
In conclusion, our final loss is a weighted sum of the three
images, validation set with 10K images, and test set with 20K
parts all together as in equation (4).
images. Since the label of the test set is not public, we evaluate
Lall = γ1 Ldet + γ2 Lda−seg + γ3 Lll−seg , (4) our network on the validation set.
2) Implementation Details: In order to enhance the per-
where α1 , α2 , α3 , γ1 , γ2 , γ3 can be tuned to balance all parts formance of our model, we empirically adopt some practical
of the total loss. techniques and methods of data augmentation.
5
Algorithm 1 One step-by-step Training Method. First, we only Network Recall(%) mAP50(%) Speed(fps)
train Encoder and Detect head. Then we freeze the Encoder MultiNet 81.3 60.2 8.6
and Detect head as well as train two Segmentation heads. DLT-Net 89.4 68.4 9.3
Finally, the entire network is trained jointly for all three tasks. Faster R-CNN 77.2 55.6 5.3
YOLOv5s 86.8 77.2 82
Input: Target neural network F with parameter group: YOLOP (ours) 89.2 76.5 41
Θ = {θenc , θdet , θseg };
TABLE I
Training set: T ; T RAFFIC O BJECT D ETECTION R ESULTS : COMPARING THE PROPOSED
Threshold for convergence: thr; YOLOP WITH STATE - OF - THE - ART DETECTORS .
Loss function: Lall
Output: Well-trained network: F(x; Θ)
1: procedure T RAIN (F, T )
global information. We retrain the above networks on the
2: repeat
BDD100k dataset and compare them with our network on
3: Sample a mini-batch (xs , ys ) from training set T .
object detection and drivable area segmentation tasks. Since
4: ` ← Lall (F(xs ; Θ), ys )
there is no suitable existing multi-task network that processes
5: Θ ← arg minΘ `
lane detection task on the BDD100K dataset, we compare
6: until ` < thr
our network with Enet [33], SCNN and Enet-SAD, three
7: end procedure
advanced lane detection networks. Besides, the performance
8: Θ ← Θ\{θseg } // Freeze parameters of two Segmentation
of the joint training paradigm is compared with alternating
heads.
training paradigms of many kinds. Moreover, we compare the
9: T RAIN (F, T )
accuracy and speed of our multi-task model trained to handle
10: Θ ← Θ ∪ {θseg } \ {θdet , θenc } // Freeze parameters of
multiple tasks with the one trained to perform a specific task.
Encoder and Detect head and activate parameters of two
Following [6], we resize images in BDD100k dataset from
Segmentation heads.
1280×720×3 to 640×384×3. All control experiments follow
11: T RAIN (F, T )
the same experimental settings and evaluation metrics, and all
12: Θ ← Θ ∪ {θdet , θenc } // Activate all parameters of the
experiments are run on NVIDIA GTX TITAN XP.
neural network.
13: T RAIN (F, T )
14: return Trained network F(x; Θ) B. Result
In this section, we just simply train our model end to end
and then compare it with other representative models on all
With the purpose of enabling our detector to get more prior three tasks.
knowledge of the objects in the traffic scene, we use the k- 1) Traffic Object Detection Result: Since the Multinet and
means clustering algorithm to obtain prior anchors from all DLT-Net can only detect vehicles, we only consider the vehicle
detection frames of the dataset. We use Adam as the optimizer detection results of five models on the BDD100K dataset. As
to train our model and the initial learning rate, β1 , and β2 are shown in Table I, we use Recall and mAP50 as the evaluation
set to be 0.001, 0.937, and 0.999 respectively. Warm-up and metric of detection accuracy. Our model exceeds Faster R-
cosine annealing are used to adjust the learning rate during the CNN, MultiNet, and DLT-Net in detection accuracy, and is
training, which aim at leading the model to converge faster and comparable to YOLOv5s that actually uses more tricks than
better [32]. ours. Moreover, our model can infer in real time. YOLOv5s is
We use data augmentation to increase the variability of faster than ours because it does not have the lane line segment
images so as to make our model robust in different environ- head and drivable area segment head. Visualization of the
ments. Photometric distortions and geometric distortions are traffic objects detection is shown in Figure 3.
taken into consideration in our training scheme. For photo- 2) Drivable Area Segmentation Result: In this paper, both
metric distortions, we adjust the hue, saturation and value of “area/drivable” and “area/alternative” classes in BDD100K
images. We use random rotating, scaling, translating, shearing, dataset are categorized as ”Drivable area” without distinction.
and left-right flipping to process images to handle geometric Our model only needs to distinguish the drivable area and
distortions. the background in the image. mIoU is used to evaluate the
3) Experimental Setting: We select some excellent multi- segmentation performance of different models. The results are
task networks and networks that focus on a single task shown in Table II. It can be seen that our model outperforms
to compare with our network. Both MultiNet and DLT-Net MultiNet, DLT-Net and PSPNet by 19.9%, 20.2%, and 1.9%,
handle multiple panoptic driving perception tasks, and they respectively. Furthermore, our inference speed is 4 to 5 times
have achieved great performance in object detection and faster than theirs. Visualization results of the drivable area
drivable area segmentation tasks on the BDD100k dataset. segmentation can be seen in Figure 4.
Faster-RCNN is an outstanding representative of the two- 3) Lane Detection Result: The lane lines in BDD100K
stage object detection network. YOLOv5 is the single-stage dataset are labeled with two lines, so it is very tricky to directly
network that achieves state-of-the-art performance on the use the annotation. The experimental settings follow the [6]
COCO dataset. PSPNet achieves splendid performance on se- in order to compare expediently. First of all, we calculate the
mantic segmentation task with its superior ability to aggregate center lines based on the two-line annotations. Then we draw
6
Fig. 3. Visualization of the traffic objects detection results of YOLOP. Top Row: Traffic objects detection results in day-time scenes. Bottom row: Traffic
objects detection results in night scenes.
Fig. 4. Visualization of the drivable area segmentation results of YOLOP. Top Row: Drivable area segmentation results in day-time scenes. Bottom row:
Drivable area segmentation results in night scenes.
Network mIoU(%) Speed(fps) the lane line of the training with width set to 8 pixels while
MultiNet 71.6 8.6 keeping the lane line width of the test set as 2 pixels. We
DLT-Net 71.3 9.3 use pixel accuracy and IoU of lanes as evaluation metrics.
PSPNet 89.6 11.1
YOLOP (ours) 91.5 41 As shown in the Table III, the performance of our model
dramatically exceeds the other three models. The visualization
TABLE II
D RIVABLE A REA S EGMENTATION R ESULTS : C OMPARING THE PROPOSED
results of lane detection can be seen in Figure 5.
YOLOP WITH STATE - OF - THE - ART DRIVABLE AREA SEGMENTATION OR
SEMANTIC SEGMENTATION METHODS .
C. Ablation Studies
We designed the following two ablation experiments to
further illustrate the effectiveness of our scheme. All the
evaluation metrics in this section are consistent with above.
7
Fig. 5. Visualization of the lane detection results of YOLOP. Top Row: Lane detection results in day-time scenes. Bottom row: Lane detection results in
night scenes.
Network Accuracy(%) IoU(%) 2) Multi-task v.s. Single task: To verify the effectiveness of
ENet 34.12 14.64 our multi-task learning scheme, we compare the performance
SCNN 35.79 15.84 of the multi-task scheme and single task scheme. On the one
ENet-SAD 36.56 16.02
YOLOP (ours) 70.50 26.20 hand, we train our model to perform 3 tasks simultaneously.
On the other hand, we train our model to perform traffic
TABLE III
L ANE D ETECTION R ESULTS : COMPARING THE PROPOSED YOLOP WITH
object detection, drivable area segmentation, and lane line
STATE - OF - THE - ART LANE DETECTION METHODS . segmentation tasks separately. Table V shows the comparison
of the performance of these two schemes on each specific task.
It can be seen that our model adopts the multi-task scheme to
achieve performance is close to that of focusing on a single
1) End-to-end v.s. Step-by-step: In Table IV, we compare task. More importantly, the multitask model can save a lot of
the performance of joint training paradigm with alternating time compared to executing each task individually.
training paradigms of many kinds 1 . Obviously, our model has
performed very well enough through end-to-end training, so
there is no need to perform alternating optimization. However, V. C ONCLUSION
it is interesting that the paradigm training detection task firstly In this paper, we put forward a simple and efficient network,
seems to perform better. We think it is mainly because our which can simultaneously handle three driving perception
model is closer to a complete detection model and the model tasks of object detection, drivable area segmentation and
is harder to converge when performing detection tasks. What’s lane detection and can be trained end-to-end. Our model
more, the paradigm consist of three steps slightly outperforms performs exceptionally well on the challenging BDD100k
that with two steps. Similar alternating training can be run for dataset, achieving or greatly exceeding state-of-the-art level
more steps, but we have observed negligible improvements. on all three tasks. And it can perform real-time inference on
1 E, D, S and W refer to Encoder, Detect head, two Segment heads and
embedded device Jetson TX2, which ensures that our network
whole network. So the Algorithm 1 can be marked as ED-S-W, and the same can be used in real-world scenarios.
for others.
TABLE IV
PANOPTIC DRIVING PERCEPTION RESULTS : THE END - TO - END SCHEME V. S . DIFFERENT STEP - BY- STEP SCHEMES .
8
TABLE V
PANOPTIC DRIVING PERCEPTION RESULTS : MULTI - TASK LEARNING V. S . SINGLE TASK LEARNING .
R EFERENCES Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- [21] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
time object detection with region proposal networks,” arXiv preprint L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
arXiv:1506.01497, 2015. tation approach,” in 2018 IEEE Intelligent Vehicles Symposium (IV).
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op- IEEE, 2018, pp. 286–291.
timal speed and accuracy of object detection,” arXiv preprint [22] Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane
arXiv:2004.10934, 2020. detection,” arXiv preprint arXiv:2004.11757, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [23] K. Duan, L. Xie, H. Qi, S. Bai, Q. Huang, and Q. Tian, “Location-
for biomedical image segmentation,” in Medical Image Computing and sensitive visual recognition with cross-iou loss,” arXiv preprint
Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Horneg- arXiv:2104.04899, 2021.
ger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International [24] J. Zhang, Y. Xu, B. Ni, and Z. Duan, “Geometric constrained joint
Publishing, 2015, pp. 234–241. lane segmentation and lane boundary detection,” in Proceedings of the
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing European Conference on Computer Vision (ECCV), 2018, pp. 486–502.
network,” in Proceedings of the IEEE Conference on Computer Vision [25] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in
and Pattern Recognition (CVPR), July 2017. multi-task feature learning,” in ICML, 2011.
[5] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: [26] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Spatial cnn for traffic scene understanding,” in Proceedings of the AAAI Yeh, “Cspnet: A new backbone that can enhance learning capability of
Conference on Artificial Intelligence, vol. 32, no. 1, 2018. cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[6] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane and Pattern Recognition Workshops, 2020, pp. 390–391.
detection cnns by self attention distillation,” in Proceedings of the [27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
IEEE/CVF International Conference on Computer Vision, 2019, pp. convolutional networks for visual recognition,” IEEE Transactions on
1013–1021. Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
[7] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, 1916, 2015.
“Multinet: Real-time joint semantic reasoning for autonomous driving,” [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
arXiv preprint arXiv:1612.07695, 2016. “Feature pyramid networks for object detection,” in Proceedings of the
[8] Y. Qian, J. M. Dolan, and M. Yang, “Dlt-net: Joint detection of drivable IEEE Conference on Computer Vision and Pattern Recognition, 2017,
areas, lane lines, and traffic objects,” IEEE Transactions on Intelligent pp. 2117–2125.
Transportation Systems, vol. 21, no. 11, pp. 4670–4679, 2019. [29] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[9] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: for instance segmentation,” in Proceedings of the IEEE Conference on
Scaling cross stage partial network,” arXiv preprint arXiv:2011.08036, Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
2020. [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in for dense object detection,” in Proceedings of the IEEE International
Proceedings of the IEEE International Conference on Computer Vision, Conference on Computer Vision, 2017, pp. 2980–2988.
2017, pp. 2961–2969. [31] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss:
[11] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, Faster and better learning for bounding box regression,” in Proceedings
“Bdd100k: A diverse driving video database with scalable annotation of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020,
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. pp. 12 993–13 000.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature [32] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with
hierarchies for accurate object detection and semantic segmentation,” in warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
Proceedings of the IEEE Conference on Computer Vision and Pattern [33] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
Recognition, 2014, pp. 580–587. neural network architecture for real-time semantic segmentation,” arXiv
[13] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International preprint arXiv:1606.02147, 2016.
Conference on Computer Vision, 2015, pp. 1440–1448.
[14] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” arXiv preprint arXiv:1605.06409,
2016.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European Conference on
Computer Vision. Springer, 2016, pp. 21–37.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE DongWu is a undergraduate senior student in the
Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– School of Electronics Information and Communica-
788. tions, Huazhong University of Science and Technol-
[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in ogy (HUST), Wuhan, China. His research interests
Proceedings of the IEEE Conference on Computer Vision and Pattern include computer vision, machine learning and au-
Recognition, 2017, pp. 7263–7271. tonomous driving.
[18] ——, “Yolov3: An incremental improvement,” arXiv preprint
arXiv:1804.02767, 2018.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
9