YOLOP You Only Look Once For Panoptic Driving Perception

Uploaded by

sina.azad.sina988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views9 pages

YOLOP You Only Look Once For Panoptic Driving Perception

Uploaded by

sina.azad.sina988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

1

YOLOP: You Only Look Once for Panoptic Driving

Perception
Dong Wu, Manwen Liao, Weitian Zhang, and Xinggang Wang, Member, IEEE

Abstract—A panoptic driving perception system is an essential

part of autonomous driving. A high-precision and real-time
perception system can assist the vehicle in making the reasonable
arXiv:2108.11250v5 [cs.CV] 31 Aug 2021

decision while driving. We present a panoptic driving perception

network (YOLOP) to perform traffic object detection, drivable
area segmentation and lane detection simultaneously. It is com-
posed of one encoder for feature extraction and three decoders
to handle the specific tasks. Our model performs extremely well
on the challenging BDD100K dataset, achieving state-of-the-art
on all three tasks in terms of accuracy and speed. Besides,
we verify the effectiveness of our multi-task learning model for
joint training via ablative studies. To our best knowledge, this
is the first work that can process these three visual perception
tasks simultaneously in real-time on an embedded device Jetson
TX2(23 FPS) and maintain excellent accuracy. To facilitate
(a) Input
further research, the source codes and pre-trained models will
be released at https://github.com/hustvl/YOLOP.
Index Terms—Deep learning, multitask learning, traffic object
detection, drivable area segmentation, lane detection.

I. I NTRODUCTION

R ECENTLY, extensive research on autonomous driving

has revealed the importance of the panoptic driving
perception system. It plays a significant role in autonomous
driving as it can extract visual information from the images
taken by the camera and assist the decision system to control
the actions of the vehicle. In order to restrict the maneuver
of vehicles, the visual perception system should be able to
understand the scene and then provide the decision system (b) Output
with information including: locations of the obstacles, judge- Fig. 1. The input and output of our model. The purpose of our model
ments of whether the road is drivable, the position of the is to process traffic objects detection, drivable area segmentation and lane
lanes etc. Object detection is usually involved in the panoptic detection simultaneously in one input image. In (b), the brown bounding boxes
indicate traffic objects, the green areas are the drivable areas, and the blue
driving perception system to help the vehicles avoid obstacles lines represent the lane line.
and follow traffic rules. Drivable area segmentation and lane
detection are also needed as they are crucial for planning the
driving route of the vehicle.
Many methods handle these tasks separately. For instance, In addition, different tasks in traffic scenes understanding
Faster R-CNN [1] and YOLOv4 [2] deal with object de- often have much related information, such as the three tasks
tection; UNet [3] and PSPNet [4] are proposed to perform mentioned above. As shown in the Figure 1, the lanes are often
semantic segmentation. SCNN [5] and SAD-ENet [6] are used the boundary of the drivable area, and the drivable area usually
for detecting lanes. Despite the excellent performance these closely surrounds the traffic objects. A multi-task network is
methods achieve, processing these tasks one after another takes more suitable in this situation as (1) it can accelerate the image
longer time than tackling them all at once. When deploying analysis process by handling multiple tasks at once instead of
the panoptic driving perception system on embedded devices one by one (2) it can share information among multiple tasks,
commonly used in the self-driving car, limited computational which may improve the performance of each task as multi-task
resources and latency should be taken into consideration. network often shares the same feature extraction backbone.
Therefore, it is of essence to explore multi-task approaches in
D. Wu, M. Liao, W. Zhang and X. Wang are with the School of Elec- autonomous driving.
tronic Information and Communication, Huazhong University Of Science
And Technology, Wuhan 430074, China (e-mail: {riserwu, mwliao, wtzhang, MultiNet [7] uses the encoder-decoder structure which has
xgwang}@hust.edu.cn) one shared encoder and three separate decoders for classifica-
2

tion, object detection and semantic segmentation. It performs II. R ELATED W ORK
well on these tasks and achieves state-of-the-art on KITTI In this section, we review solutions to the above three
drivable area segmentation task. Classification tasks, however, tasks respectively, and then introduce some related multi-task
are not as crucial as lane detection in controlling the vehicle. learning work. We only concentrate on solutions based on deep
DLT-Net [8] combines traffic object detection, drivable area learning.
segmentation and lane detection all together and proposes
context tensor to fuse feature maps between decoders in A. Traffic Object Detection
order to share mutual information. Although with competitive
In recent years, with the rapid development of deep learning,
performance, it does not reach real-time. Thus, we construct
many prominent object detection algorithms have emerged.
an efficient multi-task network for panoptic driving percep-
Current mainstream object detection algorithms can be divided
tion system which includes object detection, drivable area
into two-stage methods and one-stage methods.
segmentation and lane detection task and can reach real-time
Two-stage methods complete the detection task in two steps.
on embedded device Jetson TX2 with TensorRT deployment.
First, regional proposals are obtained, and then features in
By processing these three key tasks in autonomous driving
the regional proposals are used to locate and classify the
all at once, we reduce the inference time of the panoptic
objects. The generation of regional proposals has gone through
driving perception system, constrain the computational cost
several stages of development. R-CNN [12] creatively tries
to a reasonable range and enhance the performance of each
to use selective search instead of sliding windows to extract
task.
regional proposals on the original image, while Fast R-CNN
In order to obtain high precision and fast speed, we design a [13] performs this operation directly on the feature map. The
simple and efficient network architecture. We use a lightweight RPN network proposed in Faster-RCNN [1] greatly reduces
CNN [9] as the encoder to extract features from the image. the time consumption and obtains higher accuracy. Based on
Then these feature maps are fed to three decoders to complete the former, R-FCN [14] proposes a fully convolutional network
their respective tasks. Our detection decoder is based on the that replaces the fully connected layer with the convolutional
current best-performing single-stage detection network [2] for layer to further speed up detection.
two main reasons: (1) The single-stage detection network is The SDD-series [15] and YOLO-series algorithms are mile-
faster than the two-stage detection network. (2) The grid-based stones among one-stage methods. This kind of algorithm
prediction mechanism of the single-stage detector is more performs bounding box regression and object classification
related to the other two semantic segmentation tasks, while simultaneously. YOLO [16] divides the picture into S×S
instance segmentation is usually combined with the region- grids instead of extracting regional proposals with the RPN
based detector [10]. The feature map output by the encoder network, which significantly accelerates the detection speed.
incorporates semantic features of different levels and scales, YOLO9000 [17] introduces the anchor mechanism to improve
and our segmentation branch can use these feature maps to the recall of detection. YOLOv3 [18] uses the feature pyramid
complete pixel-wise semantic prediction excellently. network structure to achieve multi-scale detection. YOLOv4
[2] further improves the detection performance by refining
In addition to the end-to-end training strategy, we attempt the network structure, activation function, loss function and
some alternating optimization paradigms which train our applying abundant data augmentation.
model step-by-step. On the one hand, we can put unrelated
tasks in different training steps to prevent inter-limitation. On B. Drivable Area Segmentation
the other hand, the task trained first can guide other tasks. So Due to the great success of deep learning, CNN-based
this kind of paradigm sometimes works well though cumber- methods are used widely in semantic segmentation recently.
some. However, experiments show that it is unnecessary for FCN [19] firstly introduces fully convolutional network to
our model as the one trained end to end can perform well semantic segmentation. It preserves the backbone of the CNN-
enough. As a result, our panoptic driving perception system classifier and replaces the final fully connected layer with
reaches 41 FPS on a single NVIDIA TITAN XP and 23 FPS 1 × 1 convolutional layer and upsample layer. Despite the
on Jetson TX2; meanwhile, it achieves state-of-the-art on the skip-connection refinement, its performance is still limited
three tasks of the BDD100K dataset [11]. by low-resolution output. In order to obtain higher-resolution
In summary, our main contributions are: (1) We put forward output, Unet[3] constructs the encoder-decoder architecture.
an efficient multi-task network that can jointly handle three DeepLab [20] uses CRF(conditional random field) to improve
crucial tasks in autonomous driving: object detection, drivable the quality of the output as well as proposes the atrous
area segmentation and lane detection to save computational algorithm to expand the receptive field while maintaining
costs, reduce inference time as well as improve the perfor- similar computational costs. PSPNet [4] comes up with the
mance of each task. Our work is the first to reach real-time pyramid pooling module to extract features in various scales
on embedded devices while maintaining state-of-the-art level to enhance its performance.
performance on the BDD100K dataset. (2) We design the
ablative experiments to verify the effectiveness of our multi- C. Lane Detection
tasking scheme. It is proved that the three tasks can be learned In lane detection, there are lots of innovative researches
jointly without tedious alternating optimization. based on deep learning. [21] constructs a dual-branch network
3

to perform semantic segmentation and pixel embedding on A. Encoder

images. It further clusters the dual-branch features to achieve Our network shares one encoder, which is composed of a
lane instance segmentation. SCNN [5] proposes slice-by-slice backbone network and a neck network.
convolution, which enables the message to pass between pixels 1) Backbone: The backbone network is used to extract
across rows and columns in a layer, but this convolution is very the features of the input image. Usually, some classic image
time-consuming. Enet-SAD [6] uses self attention distillation classification networks serve as the backbone. Due to the
method, which enables low-level feature maps to learn from excellent performance of YOLOv4 [2] on object detection,
high-level feature maps. This method improves the perfor- we choose CSPDarknet [9] as the backbone, which solves the
mance of the model while keeping the model lightweight. [22] problem of gradient duplication during optimization [26]. It
defines lane detection as a task to find the collection of lane supports feature propagation and feature reuse which reduces
lines location in certain rows of the image, and this row-based the amount of parameters and calculations. Therefore, it is
classification uses global features. conducive to ensuring the real-time performance of the net-
work.
2) Neck: The neck is used to fuse the features generated
D. Multi-task Approaches by the backbone. Our neck is mainly composed of Spatial
The goal of multi-task learning is to learn better repre- Pyramid Pooling (SPP) module [27] and Feature Pyramid
sentations through shared information among multiple tasks. Network (FPN) module [28]. SPP generates and fuses features
Especially, a CNN-based multitask learning method can also of different scales, and FPN fuses features at different semantic
achieve convolutional sharing of the network structure. Mask levels, making the generated features contain multiple scales
R-CNN [10] extends Faster R-CNN by adding a branch for and multiple semantic level information. We adopt the method
predicting object mask, which combines instance segmentation of concatenation to fuse features in our work.
and object detection tasks effectively, and these two tasks B. Decoders
can promote each other’s performance. LSNet[23] summarizes
object detection, instance segmentation and pose estimation The three heads in our network are specific decoders for the
as location-sensitive visual recognition and uses a unified three tasks.
1) Detect Head: Similar to YOLOv4, we adopt an anchor-
solution to handle these tasks. With a shared encoder and
based multi-scale detection scheme. Firstly, we use a structure
three independent decoders, MultiNet [7] completes the three
called Path Aggregation Network (PAN), a bottom-up feature
scene perception tasks of scene classification, object detection
pyramid network [29]. FPN transfers semantic features top-
and segmentation of the driving area simultaneously. DLT-Net
down, and PAN transfers positioning features bottom-up. We
[8] inherits the encoder-decoder structure, and contributively
combine them to obtain a better feature fusion effect, and
constructs context tensors between sub-task decoders to share
then directly use the multi-scale fusion feature maps in the
designate information among tasks. [24] puts forward mutually
PAN for detection. Then, each grid of the multi-scale feature
interlinked sub-structures between lane area segmentation and
map will be assigned three prior anchors with different aspect
lane boundary detection. Meanwhile, it proposes a novel loss
ratios, and the detection head will predict the offset of position
function to constrain the lane line to the outer contour of
and the scaling of the height and width, as well as the
the lane area so that they’re going to overlap geometrically.
corresponding probability of each category and the confidence
However, this prior assumption also limits its application as
of the prediction.
it only works well on scenarios where the lane line tightly
2) Drivable Area Segment Head & Lane Line Segment
wraps the lane area. What’s more, the training paradigm of
Head: Drivable area segment head and Lane line Segment
multitask model is also worth thinking about. [25] states that
head adopt the same network structure. We feed the bottom
the joint training is appropriate and beneficial only when all
layer of FPN to the segmentation branch, with the size of
those tasks are indeed related; otherwise, it is necessary to
(W/8, H/8, 256). Our segmentation branch is very simple.
adopt alternating optimization. So Faster R-CNN [1] adopts a
After three upsampling processes, we restore the output feature
pragmatic 4-step training algorithm to learn shared features.
map to the size of (W, H, 2), which represents the probability
This paradigm sometimes may be helpful, but it is so tedious.
of each pixel in the input image for the drivable area/lane
line and the background. Because of the shared SPP in
III. M ETHODOLOGY the neck network, we do not add an extra SPP module to
segment branches like others usually do [4], which brings no
We put forward a simple and efficient feed-forward network improvement to the performance of our network. Additionally,
that can accomplish traffic object detection, drivable area we use the Nearest Interpolation method in our upsampling
segmentation and lane detection tasks altogether. As shown in layer to reduce computation cost instead of deconvolution. As
Figure 2, our panoptic driving perception single-shot network, a result, not only do our segment decoders gain high precision
termed as YOLOP, contains one shared encoder and three output, but also be very fast during inference.
subsequent decoders to solve specific tasks. There are no
complex and redundant shared blocks between different de- C. Loss Function
coders, which reduces computational consumption and allows Since there are three decoders in our network, our multi-task
our network to be easily trained end-to-end. loss contains three parts. As for the detection loss Ldet , it is a
4

Fig. 2. The architecture of YOLOP. YOLOP shares one encoder and combines three decoders to solve different tasks. The encoder consists of a backbone
and a neck.

weighted sum of classification loss, object loss and bounding D. Training Paradigm
box loss as in equation 1. We attempt different paradigms to train our model. The
Ldet = α1 Lclass + α2 Lobj + α3 Lbox , (1) simplest one is training end to end, and then three tasks can be
learned jointly. This training paradigm is useful when all tasks
where Lclass and Lobj are focal loss [30], which is utilized are indeed related. In addition, some alternating optimization
to reduce the loss of well-classified examples, thus forces the algorithms also have been tried, which train our model step
network to focus on the hard ones. Lclass is used for penalizing by step. In each step, the model can focus on one or multiple
classification and Lobj for the confidence of one prediction. related tasks regardless of those unrelated. Even if not all tasks
Lbox is LCIoU [31], which takes distance, overlap rate, the are related, our model can still learn adequately on each task
similarity of scale and aspect ratio between the predicted box with this paradigm. And Algorithm 1 illustrates the process of
and ground truth into consideration. one step-by-step training method.
Both of the loss of drivable area segmentation Lda−seg and
lane line segmentation Lll−seg contain Cross Entropy Loss IV. E XPERIMENTS
with Logits Lce , which aims to minimize the classification
A. Setting
errors between pixels of network outputs and the targets. It
TN 1) Dataset Setting: The BDD100K dataset [11] supports
is worth mentioning that IoU loss: LIoU = T N +F P +F N is
added to Lll−seg as it is especially efficient for the prediction the research of multi-task learning in the field of autonomous
of the sparse category of lane lines. Lda and Lll−seg are driving. With 100k frames of pictures and annotations of 10
defined as equation (2), (3) respectively. tasks, it is the largest driving video dataset. As the dataset
has the diversity of geography, environment, and weather, the
Lda−seg = Lce , (2) algorithm trained on the BDD100k dataset is robust enough
to migrate to a new environment. Therefore, we choose the
Lll−seg = Lce + LIoU . (3) BDD100k dataset to train and evaluate our network. The
BDD100K dataset has three parts, training set with 70K
In conclusion, our final loss is a weighted sum of the three
images, validation set with 10K images, and test set with 20K
parts all together as in equation (4).
images. Since the label of the test set is not public, we evaluate
Lall = γ1 Ldet + γ2 Lda−seg + γ3 Lll−seg , (4) our network on the validation set.
2) Implementation Details: In order to enhance the per-
where α1 , α2 , α3 , γ1 , γ2 , γ3 can be tuned to balance all parts formance of our model, we empirically adopt some practical
of the total loss. techniques and methods of data augmentation.
5

Algorithm 1 One step-by-step Training Method. First, we only Network Recall(%) mAP50(%) Speed(fps)
train Encoder and Detect head. Then we freeze the Encoder MultiNet 81.3 60.2 8.6
and Detect head as well as train two Segmentation heads. DLT-Net 89.4 68.4 9.3
Finally, the entire network is trained jointly for all three tasks. Faster R-CNN 77.2 55.6 5.3
YOLOv5s 86.8 77.2 82
Input: Target neural network F with parameter group: YOLOP (ours) 89.2 76.5 41
Θ = {θenc , θdet , θseg };
TABLE I
Training set: T ; T RAFFIC O BJECT D ETECTION R ESULTS : COMPARING THE PROPOSED
Threshold for convergence: thr; YOLOP WITH STATE - OF - THE - ART DETECTORS .
Loss function: Lall
Output: Well-trained network: F(x; Θ)
1: procedure T RAIN (F, T )
global information. We retrain the above networks on the
2: repeat
BDD100k dataset and compare them with our network on
3: Sample a mini-batch (xs , ys ) from training set T .
object detection and drivable area segmentation tasks. Since
4: ` ← Lall (F(xs ; Θ), ys )
there is no suitable existing multi-task network that processes
5: Θ ← arg minΘ `
lane detection task on the BDD100K dataset, we compare
6: until ` < thr
our network with Enet [33], SCNN and Enet-SAD, three
7: end procedure
advanced lane detection networks. Besides, the performance
8: Θ ← Θ\{θseg } // Freeze parameters of two Segmentation
of the joint training paradigm is compared with alternating
heads.
training paradigms of many kinds. Moreover, we compare the
9: T RAIN (F, T )
accuracy and speed of our multi-task model trained to handle
10: Θ ← Θ ∪ {θseg } \ {θdet , θenc } // Freeze parameters of
multiple tasks with the one trained to perform a specific task.
Encoder and Detect head and activate parameters of two
Following [6], we resize images in BDD100k dataset from
Segmentation heads.
1280×720×3 to 640×384×3. All control experiments follow
11: T RAIN (F, T )
the same experimental settings and evaluation metrics, and all
12: Θ ← Θ ∪ {θdet , θenc } // Activate all parameters of the
experiments are run on NVIDIA GTX TITAN XP.
neural network.
13: T RAIN (F, T )
14: return Trained network F(x; Θ) B. Result
In this section, we just simply train our model end to end
and then compare it with other representative models on all
With the purpose of enabling our detector to get more prior three tasks.
knowledge of the objects in the traffic scene, we use the k- 1) Traffic Object Detection Result: Since the Multinet and
means clustering algorithm to obtain prior anchors from all DLT-Net can only detect vehicles, we only consider the vehicle
detection frames of the dataset. We use Adam as the optimizer detection results of five models on the BDD100K dataset. As
to train our model and the initial learning rate, β1 , and β2 are shown in Table I, we use Recall and mAP50 as the evaluation
set to be 0.001, 0.937, and 0.999 respectively. Warm-up and metric of detection accuracy. Our model exceeds Faster R-
cosine annealing are used to adjust the learning rate during the CNN, MultiNet, and DLT-Net in detection accuracy, and is
training, which aim at leading the model to converge faster and comparable to YOLOv5s that actually uses more tricks than
better [32]. ours. Moreover, our model can infer in real time. YOLOv5s is
We use data augmentation to increase the variability of faster than ours because it does not have the lane line segment
images so as to make our model robust in different environ- head and drivable area segment head. Visualization of the
ments. Photometric distortions and geometric distortions are traffic objects detection is shown in Figure 3.
taken into consideration in our training scheme. For photo- 2) Drivable Area Segmentation Result: In this paper, both
metric distortions, we adjust the hue, saturation and value of “area/drivable” and “area/alternative” classes in BDD100K
images. We use random rotating, scaling, translating, shearing, dataset are categorized as ”Drivable area” without distinction.
and left-right flipping to process images to handle geometric Our model only needs to distinguish the drivable area and
distortions. the background in the image. mIoU is used to evaluate the
3) Experimental Setting: We select some excellent multi- segmentation performance of different models. The results are
task networks and networks that focus on a single task shown in Table II. It can be seen that our model outperforms
to compare with our network. Both MultiNet and DLT-Net MultiNet, DLT-Net and PSPNet by 19.9%, 20.2%, and 1.9%,
handle multiple panoptic driving perception tasks, and they respectively. Furthermore, our inference speed is 4 to 5 times
have achieved great performance in object detection and faster than theirs. Visualization results of the drivable area
drivable area segmentation tasks on the BDD100k dataset. segmentation can be seen in Figure 4.
Faster-RCNN is an outstanding representative of the two- 3) Lane Detection Result: The lane lines in BDD100K
stage object detection network. YOLOv5 is the single-stage dataset are labeled with two lines, so it is very tricky to directly
network that achieves state-of-the-art performance on the use the annotation. The experimental settings follow the [6]
COCO dataset. PSPNet achieves splendid performance on se- in order to compare expediently. First of all, we calculate the
mantic segmentation task with its superior ability to aggregate center lines based on the two-line annotations. Then we draw
6

(a) Day-time result

(b) Night-time result

Fig. 3. Visualization of the traffic objects detection results of YOLOP. Top Row: Traffic objects detection results in day-time scenes. Bottom row: Traffic
objects detection results in night scenes.

(a) Day-time result

(b) Night-time result

Fig. 4. Visualization of the drivable area segmentation results of YOLOP. Top Row: Drivable area segmentation results in day-time scenes. Bottom row:
Drivable area segmentation results in night scenes.

Network mIoU(%) Speed(fps) the lane line of the training with width set to 8 pixels while
MultiNet 71.6 8.6 keeping the lane line width of the test set as 2 pixels. We
DLT-Net 71.3 9.3 use pixel accuracy and IoU of lanes as evaluation metrics.
PSPNet 89.6 11.1
YOLOP (ours) 91.5 41 As shown in the Table III, the performance of our model
dramatically exceeds the other three models. The visualization
TABLE II
D RIVABLE A REA S EGMENTATION R ESULTS : C OMPARING THE PROPOSED
results of lane detection can be seen in Figure 5.
YOLOP WITH STATE - OF - THE - ART DRIVABLE AREA SEGMENTATION OR
SEMANTIC SEGMENTATION METHODS .
C. Ablation Studies
We designed the following two ablation experiments to
further illustrate the effectiveness of our scheme. All the
evaluation metrics in this section are consistent with above.
7

(a) Day-time result

(b) Night-time result

Fig. 5. Visualization of the lane detection results of YOLOP. Top Row: Lane detection results in day-time scenes. Bottom row: Lane detection results in
night scenes.

Network Accuracy(%) IoU(%) 2) Multi-task v.s. Single task: To verify the effectiveness of
ENet 34.12 14.64 our multi-task learning scheme, we compare the performance
SCNN 35.79 15.84 of the multi-task scheme and single task scheme. On the one
ENet-SAD 36.56 16.02
YOLOP (ours) 70.50 26.20 hand, we train our model to perform 3 tasks simultaneously.
On the other hand, we train our model to perform traffic
TABLE III
L ANE D ETECTION R ESULTS : COMPARING THE PROPOSED YOLOP WITH
object detection, drivable area segmentation, and lane line
STATE - OF - THE - ART LANE DETECTION METHODS . segmentation tasks separately. Table V shows the comparison
of the performance of these two schemes on each specific task.
It can be seen that our model adopts the multi-task scheme to
achieve performance is close to that of focusing on a single
1) End-to-end v.s. Step-by-step: In Table IV, we compare task. More importantly, the multitask model can save a lot of
the performance of joint training paradigm with alternating time compared to executing each task individually.
training paradigms of many kinds 1 . Obviously, our model has
performed very well enough through end-to-end training, so
there is no need to perform alternating optimization. However, V. C ONCLUSION
it is interesting that the paradigm training detection task firstly In this paper, we put forward a simple and efficient network,
seems to perform better. We think it is mainly because our which can simultaneously handle three driving perception
model is closer to a complete detection model and the model tasks of object detection, drivable area segmentation and
is harder to converge when performing detection tasks. What’s lane detection and can be trained end-to-end. Our model
more, the paradigm consist of three steps slightly outperforms performs exceptionally well on the challenging BDD100k
that with two steps. Similar alternating training can be run for dataset, achieving or greatly exceeding state-of-the-art level
more steps, but we have observed negligible improvements. on all three tasks. And it can perform real-time inference on
1 E, D, S and W refer to Encoder, Detect head, two Segment heads and
embedded device Jetson TX2, which ensures that our network
whole network. So the Algorithm 1 can be marked as ED-S-W, and the same can be used in real-world scenarios.
for others.

Training method Recall(%) AP(%) mIoU(%) Accuracy(%) IoU(%)

ES-W 87.0 75.3 90.4 66.8 26.2
ED-W 87.3 76.0 91.6 71,2 26.1
ES-D-W 87.0 75.1 91.7 68.6 27.0
ED-S-W 87.5 76.1 91.6 68.0 26.8
End-to-end 89.2 76.5 91.5 70.5 26.2

TABLE IV
PANOPTIC DRIVING PERCEPTION RESULTS : THE END - TO - END SCHEME V. S . DIFFERENT STEP - BY- STEP SCHEMES .
8

Training method Recall(%) AP(%) mIoU(%) Accuracy(%) IoU(%) Speed(ms/frame)

Det(only) 88.2 76.9 - - - 15.7
Da-Seg(only) - - 92.0 - - 14.8
Ll-Seg(only) - - - 79.6 27.9 14.8
Multitask 89.2 76.5 91.5 70.5 26.2 24.4

TABLE V
PANOPTIC DRIVING PERCEPTION RESULTS : MULTI - TASK LEARNING V. S . SINGLE TASK LEARNING .

R EFERENCES Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- [21] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
time object detection with region proposal networks,” arXiv preprint L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
arXiv:1506.01497, 2015. tation approach,” in 2018 IEEE Intelligent Vehicles Symposium (IV).
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op- IEEE, 2018, pp. 286–291.
timal speed and accuracy of object detection,” arXiv preprint [22] Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane
arXiv:2004.10934, 2020. detection,” arXiv preprint arXiv:2004.11757, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [23] K. Duan, L. Xie, H. Qi, S. Bai, Q. Huang, and Q. Tian, “Location-
for biomedical image segmentation,” in Medical Image Computing and sensitive visual recognition with cross-iou loss,” arXiv preprint
Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Horneg- arXiv:2104.04899, 2021.
ger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International [24] J. Zhang, Y. Xu, B. Ni, and Z. Duan, “Geometric constrained joint
Publishing, 2015, pp. 234–241. lane segmentation and lane boundary detection,” in Proceedings of the
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing European Conference on Computer Vision (ECCV), 2018, pp. 486–502.
network,” in Proceedings of the IEEE Conference on Computer Vision [25] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in
and Pattern Recognition (CVPR), July 2017. multi-task feature learning,” in ICML, 2011.
[5] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: [26] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Spatial cnn for traffic scene understanding,” in Proceedings of the AAAI Yeh, “Cspnet: A new backbone that can enhance learning capability of
Conference on Artificial Intelligence, vol. 32, no. 1, 2018. cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[6] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane and Pattern Recognition Workshops, 2020, pp. 390–391.
detection cnns by self attention distillation,” in Proceedings of the [27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
IEEE/CVF International Conference on Computer Vision, 2019, pp. convolutional networks for visual recognition,” IEEE Transactions on
1013–1021. Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
[7] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, 1916, 2015.
“Multinet: Real-time joint semantic reasoning for autonomous driving,” [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
arXiv preprint arXiv:1612.07695, 2016. “Feature pyramid networks for object detection,” in Proceedings of the
[8] Y. Qian, J. M. Dolan, and M. Yang, “Dlt-net: Joint detection of drivable IEEE Conference on Computer Vision and Pattern Recognition, 2017,
areas, lane lines, and traffic objects,” IEEE Transactions on Intelligent pp. 2117–2125.
Transportation Systems, vol. 21, no. 11, pp. 4670–4679, 2019. [29] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[9] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: for instance segmentation,” in Proceedings of the IEEE Conference on
Scaling cross stage partial network,” arXiv preprint arXiv:2011.08036, Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
2020. [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in for dense object detection,” in Proceedings of the IEEE International
Proceedings of the IEEE International Conference on Computer Vision, Conference on Computer Vision, 2017, pp. 2980–2988.
2017, pp. 2961–2969. [31] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss:
[11] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, Faster and better learning for bounding box regression,” in Proceedings
“Bdd100k: A diverse driving video database with scalable annotation of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020,
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. pp. 12 993–13 000.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature [32] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with
hierarchies for accurate object detection and semantic segmentation,” in warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
Proceedings of the IEEE Conference on Computer Vision and Pattern [33] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
Recognition, 2014, pp. 580–587. neural network architecture for real-time semantic segmentation,” arXiv
[13] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International preprint arXiv:1606.02147, 2016.
Conference on Computer Vision, 2015, pp. 1440–1448.
[14] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” arXiv preprint arXiv:1605.06409,
2016.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European Conference on
Computer Vision. Springer, 2016, pp. 21–37.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE DongWu is a undergraduate senior student in the
Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– School of Electronics Information and Communica-
788. tions, Huazhong University of Science and Technol-
[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in ogy (HUST), Wuhan, China. His research interests
Proceedings of the IEEE Conference on Computer Vision and Pattern include computer vision, machine learning and au-
Recognition, 2017, pp. 7263–7271. tonomous driving.
[18] ——, “Yolov3: An incremental improvement,” arXiv preprint
arXiv:1804.02767, 2018.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
9

Manwen Liao is a senior undergraduate student

from School of Electronics Information and Com-
munications, Huazhong University of Science and
Technology (HUST), Wuhan, China. He majors in
Electronic Information Engineering. His research
interests mainly include computer vision, machine
learning, robotics and autonomous driving.

Weitian Zhang is a undergraduate senior student

from Huazhong University of Science and Technol-
ogy, Wuhan, Hubei, China, majoring in Electronic
Information Engineering.
Her main research interests include computer vi-
sion and machine learning.

Xinggang Wang(M’17) received the B.S. and Ph.D.

degrees in Electronics and Information Engineering
from Huazhong University of Science and Tech-
nology (HUST), Wuhan, China, in 2009 and 2014,
respectively. He is currently an Associate Professor
with the School of Electronic Information an Com-
munications, HUST. His research interests include
computer vision and machine learning. He services
as associate editors for Pattern Recognition and Im-
age and Vision Computing journals and an editorial
board member of Electronics journal.

Lane Detection Report
No ratings yet
Lane Detection Report
84 pages
Focus On Nursing Pharmacology 8 Ed Karch Ebook and TestBank Bundle Unlocked Test Bank
No ratings yet
Focus On Nursing Pharmacology 8 Ed Karch Ebook and TestBank Bundle Unlocked Test Bank
324 pages
YOLOv8 bdd100k
No ratings yet
YOLOv8 bdd100k
12 pages
Autonomous Drivingg
No ratings yet
Autonomous Drivingg
12 pages
Consciousness - Engine of Reason Seat of Soul - Paul Churchland PDF
100% (3)
Consciousness - Engine of Reason Seat of Soul - Paul Churchland PDF
329 pages
PersFormer - 3D Lane Detection Via Perspective Transformer and The OpenLane Benchmark
No ratings yet
PersFormer - 3D Lane Detection Via Perspective Transformer and The OpenLane Benchmark
33 pages
FACTORS AFFECTING CUSTOMER SATISFACTION THECASE OF Cbe
No ratings yet
FACTORS AFFECTING CUSTOMER SATISFACTION THECASE OF Cbe
82 pages
Robust Lane Detection From Continuous Driving Scenes Using Deep Neural Networks
No ratings yet
Robust Lane Detection From Continuous Driving Scenes Using Deep Neural Networks
14 pages
Yolop
No ratings yet
Yolop
13 pages
Research Paper
No ratings yet
Research Paper
21 pages
Sensors 23 06661 v2
No ratings yet
Sensors 23 06661 v2
21 pages
Paper
No ratings yet
Paper
17 pages
Sensors 23 03385
No ratings yet
Sensors 23 03385
20 pages
UNet
No ratings yet
UNet
18 pages
A Hybrid Spatial-Temporal Deep Learning Architecture For Lane Detection
No ratings yet
A Hybrid Spatial-Temporal Deep Learning Architecture For Lane Detection
18 pages
1902 07830
No ratings yet
1902 07830
27 pages
Complete Panoptic Traffic Recognition System With Ensemble of YOLO Family Models
No ratings yet
Complete Panoptic Traffic Recognition System With Ensemble of YOLO Family Models
9 pages
Multi-Modal 3D Object Detection in Autonomous Driving A Survey and Taxonomy
No ratings yet
Multi-Modal 3D Object Detection in Autonomous Driving A Survey and Taxonomy
18 pages
19bce0014 VL2021220702099 Pe003
No ratings yet
19bce0014 VL2021220702099 Pe003
17 pages
You Only Look at Once For Real-Time and Generic
No ratings yet
You Only Look at Once For Real-Time and Generic
12 pages
1 s2.0 S0957417423014720 Main
No ratings yet
1 s2.0 S0957417423014720 Main
15 pages
Computer Vision-Based Lane Detection and Detection of Vehicle, Traffic Sign, Pedestrian G Öztürk
No ratings yet
Computer Vision-Based Lane Detection and Detection of Vehicle, Traffic Sign, Pedestrian G Öztürk
13 pages
(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios
No ratings yet
(2024-AEJ) Z-YOLOv8s-based Approach For Road Object Recognition in Complex Traffic Scenarios
14 pages
YOLOv8n-FAWL Object Detection For Autonomous Driving Using YOLOv8 Network On Edge Devices
No ratings yet
YOLOv8n-FAWL Object Detection For Autonomous Driving Using YOLOv8 Network On Edge Devices
12 pages
Robust Lane Detection From Continuous Driving Scenes Using Deep Neural Networks
No ratings yet
Robust Lane Detection From Continuous Driving Scenes Using Deep Neural Networks
15 pages
Self Attention Yolov3
No ratings yet
Self Attention Yolov3
12 pages
LANE Detection - 2
No ratings yet
LANE Detection - 2
15 pages
Cin2022 4423744
No ratings yet
Cin2022 4423744
12 pages
1 s2.0 S0950705121010807 Main
No ratings yet
1 s2.0 S0950705121010807 Main
17 pages
Learning Lightweight Lane Detection Cnns by Self Attention Distillation
No ratings yet
Learning Lightweight Lane Detection Cnns by Self Attention Distillation
10 pages
Advances in Multimedia
No ratings yet
Advances in Multimedia
18 pages
Dynamic Loss Balancing and Sequential
No ratings yet
Dynamic Loss Balancing and Sequential
13 pages
Enhancing Object Detection in Self Driving Cars Using A 3nb1910g
No ratings yet
Enhancing Object Detection in Self Driving Cars Using A 3nb1910g
12 pages
Research Article: A Real-Time Object Detector For Autonomous Vehicles Based On Yolov4
No ratings yet
Research Article: A Real-Time Object Detector For Autonomous Vehicles Based On Yolov4
11 pages
YOLOPv2 bdd100k
No ratings yet
YOLOPv2 bdd100k
8 pages
Computational Intelligence and Neuroscience - 2021 - Wang - A Real Time Object Detector For Autonomous Vehicles Based On
No ratings yet
Computational Intelligence and Neuroscience - 2021 - Wang - A Real Time Object Detector For Autonomous Vehicles Based On
11 pages
Vellore Le Paper
No ratings yet
Vellore Le Paper
10 pages
Mirunalini Synopsis
No ratings yet
Mirunalini Synopsis
12 pages
Embedded YOLO A Real-Time Object Detector For Smal
No ratings yet
Embedded YOLO A Real-Time Object Detector For Smal
11 pages
YOLOMH: You Only Look Once For Multi-Task Driving Perception With High Efficiency
No ratings yet
YOLOMH: You Only Look Once For Multi-Task Driving Perception With High Efficiency
12 pages
LATR: 3D Lane Detection From Monocular Images With Transformer
No ratings yet
LATR: 3D Lane Detection From Monocular Images With Transformer
12 pages
A Novel Lightweight Real Time Traffic Sign Detection Method Based On An Embedded Device and Yolov8
No ratings yet
A Novel Lightweight Real Time Traffic Sign Detection Method Based On An Embedded Device and Yolov8
10 pages
Improving The Vehicle Small Object Detection Algorithm of Yolov5
No ratings yet
Improving The Vehicle Small Object Detection Algorithm of Yolov5
11 pages
ACM1
No ratings yet
ACM1
7 pages
Research Paper (3) (1) 2
No ratings yet
Research Paper (3) (1) 2
8 pages
A Hybrid Lane Detection Model For Wild Road Conditions-1
No ratings yet
A Hybrid Lane Detection Model For Wild Road Conditions-1
10 pages
IIT Bombay Racing Driverless: Autonomous Driving Stack For Formula Student AI
No ratings yet
IIT Bombay Racing Driverless: Autonomous Driving Stack For Formula Student AI
8 pages
05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
No ratings yet
05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
7 pages
End-To-End Lane Shape Prediction With Transformers: Lrj466097290@stu - Xjtu.edu - CN Yuan - Ze.jian@xjtu - Edu.cn
No ratings yet
End-To-End Lane Shape Prediction With Transformers: Lrj466097290@stu - Xjtu.edu - CN Yuan - Ze.jian@xjtu - Edu.cn
9 pages
Multi-Task ADAS System On FPGA
0% (1)
Multi-Task ADAS System On FPGA
4 pages
Towards EndtoEnd Lane Detection An Instance Segmentation
No ratings yet
Towards EndtoEnd Lane Detection An Instance Segmentation
7 pages
Paper Publication
No ratings yet
Paper Publication
6 pages
University of The Philppines
0% (1)
University of The Philppines
22 pages
A Real-Time Collision Detection System For Vehicles
No ratings yet
A Real-Time Collision Detection System For Vehicles
6 pages
Alert System
No ratings yet
Alert System
7 pages
High Accuracy Lane Line Detection System Using Enhanced Yolo V3
No ratings yet
High Accuracy Lane Line Detection System Using Enhanced Yolo V3
6 pages
Polylanenet: Lane Estimation Via Deep Polynomial Regression
No ratings yet
Polylanenet: Lane Estimation Via Deep Polynomial Regression
7 pages
Detecting Lane and Road Markings at A Distance With Perspective Transformer Layers
No ratings yet
Detecting Lane and Road Markings at A Distance With Perspective Transformer Layers
6 pages
IEEE Journal Submission Trans On MTT Example
No ratings yet
IEEE Journal Submission Trans On MTT Example
5 pages
Study Guide 2025 Business Management 1
No ratings yet
Study Guide 2025 Business Management 1
37 pages
Real-Time Image Segmentation and Objec1111 Tracking For Autonomous Vehicles
No ratings yet
Real-Time Image Segmentation and Objec1111 Tracking For Autonomous Vehicles
5 pages
Assessment For K To 12
No ratings yet
Assessment For K To 12
33 pages
Can We Unify Monocular Detectors For Autonomous Driving by Using The Pixel-Wise Semantic Segmentation of CNNS?
No ratings yet
Can We Unify Monocular Detectors For Autonomous Driving by Using The Pixel-Wise Semantic Segmentation of CNNS?
4 pages
FCE Sample Test FCE Sample Test: October 2009
No ratings yet
FCE Sample Test FCE Sample Test: October 2009
4 pages
01 Nature and Scope of Social Psychology200226050502025050
No ratings yet
01 Nature and Scope of Social Psychology200226050502025050
14 pages
PPS-RESEARCH METHODOLOGY (Research Tools)
No ratings yet
PPS-RESEARCH METHODOLOGY (Research Tools)
62 pages
ISO-22367-2020 Definitions
No ratings yet
ISO-22367-2020 Definitions
15 pages
ACM2
No ratings yet
ACM2
1 page
Chapter 3 - Global Marketing (Hollensen 2017)
No ratings yet
Chapter 3 - Global Marketing (Hollensen 2017)
40 pages
Training and Development Process
No ratings yet
Training and Development Process
28 pages
Historically Development of Sociology
No ratings yet
Historically Development of Sociology
15 pages
School of Business, Hospitality & Tourism Management
No ratings yet
School of Business, Hospitality & Tourism Management
28 pages
Preksha Porwal - 153 - OD
No ratings yet
Preksha Porwal - 153 - OD
14 pages
Investigative Journalism Vs Conventional Journalism
No ratings yet
Investigative Journalism Vs Conventional Journalism
5 pages
Langer and Rodin (1976) GMG
No ratings yet
Langer and Rodin (1976) GMG
15 pages
Research Paper PR2
No ratings yet
Research Paper PR2
31 pages
Rethinking Efficient Lane Detection Via Curve Modeling: Voldemortx/Pytorch-Auto-Drive
No ratings yet
Rethinking Efficient Lane Detection Via Curve Modeling: Voldemortx/Pytorch-Auto-Drive
15 pages
Syllabus PG Diploma Course in Dr. Babasaheb Ambedkar's Thoughts On India's National Security PDF
No ratings yet
Syllabus PG Diploma Course in Dr. Babasaheb Ambedkar's Thoughts On India's National Security PDF
17 pages
PRACTICAL RESEARCH 2 Lesson Plan
No ratings yet
PRACTICAL RESEARCH 2 Lesson Plan
2 pages
Rujukan Terkini
0% (1)
Rujukan Terkini
7 pages
Artificial Intelligence in Logistics Optimization With Sustainable
No ratings yet
Artificial Intelligence in Logistics Optimization With Sustainable
22 pages
Experiential Learning Cycle PDF
No ratings yet
Experiential Learning Cycle PDF
7 pages
Patient's Adherence and Personal Evaluations of Transitions in Treatment (PETiT)
No ratings yet
Patient's Adherence and Personal Evaluations of Transitions in Treatment (PETiT)
4 pages
Title
No ratings yet
Title
9 pages
Maths Rreport (Dijkstra)
No ratings yet
Maths Rreport (Dijkstra)
19 pages
Mba Dissertation Topic Selection Form Jan 2010
No ratings yet
Mba Dissertation Topic Selection Form Jan 2010
2 pages
Curvelane-Nas: Unifying Lane-Sensitive Architecture Search and Adaptive Point Blending
No ratings yet
Curvelane-Nas: Unifying Lane-Sensitive Architecture Search and Adaptive Point Blending
16 pages
SGPN: Similarity Group Proposal Network For 3D Point Cloud Instance Segmentation
No ratings yet
SGPN: Similarity Group Proposal Network For 3D Point Cloud Instance Segmentation
13 pages
Omnidet: Surround View Cameras Based Multi-Task Visual Perception Network For Autonomous Driving
No ratings yet
Omnidet: Surround View Cameras Based Multi-Task Visual Perception Network For Autonomous Driving
8 pages
Determination of Sample Size in Service Inspection: S. Qin G. E. 0. Widera
No ratings yet
Determination of Sample Size in Service Inspection: S. Qin G. E. 0. Widera
4 pages
Comparison Sensitivity of The Differential Item Fu
No ratings yet
Comparison Sensitivity of The Differential Item Fu
9 pages
Week4 Exercise Answer Sheet
No ratings yet
Week4 Exercise Answer Sheet
2 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet