0% found this document useful (0 votes)
21 views15 pages

Rethinking Efficient Lane Detection Via Curve Modeling: Voldemortx/Pytorch-Auto-Drive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

Rethinking Efficient Lane Detection Via Curve Modeling: Voldemortx/Pytorch-Auto-Drive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Rethinking Efficient Lane Detection via Curve Modeling

Zhengyang Feng1∗ , Shaohua Guo1 *, Xin Tan2,1 , Ke Xu3 , Min Wang4 , Lizhuang Ma1,2,5†
1
Shanghai Jiao Tong University 2 East China Normal University 3 City University of Hong Kong
4
SenseTime Research 5 MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
[email protected];[email protected];[email protected];[email protected];
[email protected];[email protected]
arXiv:2203.02431v1 [cs.CV] 4 Mar 2022

Abstract

This paper presents a novel parametric curve-based


method for lane detection in RGB images. Unlike state-
of-the-art segmentation-based and point detection-based
methods that typically require heuristics to either decode
predictions or formulate a large sum of anchors, the curve-
based methods can learn holistic lane representations natu-
rally. To handle the optimization difficulties of existing poly-
nomial curve methods, we propose to exploit the parametric Figure 1. Lane detection strategies. Segmentation-based and point
Bézier curve due to its ease of computation, stability, and detection-based representations are local and indirect. The abstract
high freedom degrees of transformations. In addition, we coefficients (a, b, c, d) used in polynomial curve are hard to opti-
propose the deformable convolution-based feature flip fu- mize. The cubic Bézier curve is defined by 4 actually existing
sion, for exploiting the symmetry properties of lanes in driv- control points, which roughly fit line shape and wrap the lane line
ing scenes. The proposed method achieves a new state-of- in its convex hull (dashed red lines). Best viewed in color.
the-art performance on the popular LLAMAS benchmark.
It also achieves favorable accuracy on the TuSimple and Deep lane detection methods can be classified into three
CULane datasets, while retaining both low latency (>150 categories, i.e., segmentation-based, point detection-based,
FPS) and small model size (<10M). Our method can serve and curve-based methods (Figure 1). Among them, by
as a new baseline, to shed the light on the parametric relying on classic segmentation [5] and object detection
curves modeling for lane detection. Codes of our model [28] networks, the segmentation-based and point detection-
and PytorchAutoDrive: a unified framework for self-driving based methods typically achieve state-of-the-art lane detec-
perception, are available at: https://github.com/ tion performance. The segmentation-based methods [21,22,
voldemortX/pytorch-auto-drive . 41] exploit the foreground texture cues to segment the lane
pixels and decode these pixels into line instances via heuris-
tics. The point detection-based methods [15, 33, 39] typi-
1. Introduction cally adopt the R-CNN framework [9, 28], and detect lane
lines by detecting a dense series of points (e.g., every 10
Lane detection is a fundamental task in autonomous driv- pixels in the vertical axis). Both kinds of approaches repre-
ing systems, which supports the decision-making of lane- sent lane lines via indirect proxies (i.e., segmentation maps
keeping, centering and changing, etc. Previous lane de- and points). To handle the learning of holistic lane lines,
tection methods [2, 12] typically rely on expensive sensors under cases of occlusions or adverse weather/illumination
such as LIDAR. Advanced by the rapid development of conditions, they have to rely on low-efficiency designs, such
deep learning techniques, many works [19, 21, 22, 33, 41] as recurrent feature aggregation (too heavy for this real-
are proposed to detect lane lines from RGB inputs captured time task) [22, 41], or a large number of heuristic anchors
by commercial front-mounted cameras. (> 1000, which may be biased to dataset statistics) [33].
* Equal Contribution.
On the other hand, there are only a few methods [19, 32]
† Lizhuang Ma is a member of Qing Yuan Research Institute, Shanghai proposed to model the lane lines as holistic curves (typi-
Jiao Tong University. cally the polynomial curves, e.g., x = ay 3 + by 2 + cy + d).

1
While we expect the holistic curve to be a concise and el- lates lane detection as multi-class semantic segmentation
egant way to model the geometric properties of lane line, and is the basis of the 1st-place solution in TuSimple chal-
the abstract polynomial coefficients are difficult to learn. lenge [1]. It’s core spatial CNN module recurrently aggre-
Previous studies show that their performance lag behind gates spatial information to complete the discontinuous seg-
the well-designed segmentation-based and point detection- mentation predictions, which then requires heuristic post-
based methods by a large margin (up to 8% gap to state-of- processing to decode the segmentation map. Hence, it has
the-art methods on the CULane [22] dataset). In this paper, a high latency, and only struggles to be real-time after an
we aim to answer the question of whether it is possible to optimization of Zheng et al. [41]. Others explore knowl-
build a state-of-the-art curve-based lane detector. edge distillation [13] or generative modeling [8], but their
We observe that the classic cubic Bézier curves, with performance merely surpasses the seminal SCNN. More-
sufficient freedom degrees of parameterizing the deforma- over, these methods typically assume a fixed number (e.g.,
tions of lane lines in driving scenes, remain low computa- 4) of lines. LaneNet [21] leverages an instance segmenta-
tion complexity and high stability. This inspires us to pro- tion pipeline to deal with a variable number of lines, but it
pose to model the thin and long geometric shape proper- requires post-inference clustering to generate line instances.
ties of lane lines via Bézier curves. The ease of optimiza- Some methods leverage row-wise classification [26, 40],
tion from on-image Bézier control points enables the net- which is a customized down-sampling of per-pixel segmen-
work to be end-to-end learnable with the bipartite matching tation so that they still require post-processing. Qin et al.
loss [38], using a sparse set of lane proposals from sim- [26] propose to trade performance for low latency, but their
ple column-wise Pooling (e.g., 50 proposals on the CU- use of fully-connected layers results in large model size.
Lane dataset [22]), without any post-processing steps such In short, segmentation-based methods all require heavy
as the Non-Maximum Suppression (NMS), or hand-crafted post-processing due to the misalignment of representations.
heuristics such as anchors, hence leads to high speed and They also suffer from the locality of segmentation task, so
small model size. In addition, we observe that lane lines that they tend to perform worse under occlusions or extreme
appear symmetrically from a front-mounted camera (e.g., lighting conditions.
between ego lane lines, or immediate left and right lanes). Point Detection-based Lane Detection. The success of ob-
To model this global structure of driving scenes, we further ject detection methods drives researchers to formulate lane
propose the feature flip fusion, to aggregate the feature map detection as to detect lanes as a series of points (e.g., every
with its horizontally flipped version, to strengthen such co- 10 pixels in the vertical axis). Line-CNN [15] adapts classic
existences. We base our design of feature flip fusion on Faster R-CNN [28] as a one-stage lane line detector, but it
the deformable convolution [42], for aligning the imper- has a low inference speed (<30 FPS). Later, LaneATT [33]
fect symmetries caused by, e.g., rotated camera, changing adopts a more general one-stage detection approach that
lane, non-paired lines. We conduct extensive experiments achieves superior performance.
to analyze the properties of our method and show that it However, these methods have to design heuristic lane an-
performs favorably against state-of-the-art lane detectors on chors, which highly depend on dataset statistics, and require
three popular benchmark datasets. Our main contributions the Non-Maximum Suppression (NMS) as post-processing.
are summarized as follows: On the contrary, we represent lane lines as curves with a
• We propose a novel Bézier curve-based deep lane de- fully end-to-end pipeline (anchor-free, NMS-free).
tector, which can model the geometric shapes of lane Curve-based Lane Detection. The pioneering work [37]
lines effectively, and be naturally robust to adverse proposes a differentiable least squares fitting module to fit a
driving conditions. polynomial curve (e.g., x = ay 3 + by 2 + cy + d) to points
predicted by a deep neural network. The PolyLaneNet [32]
• We propose a novel deformable convolution-based fea-
then directly learns to predict the polynomial coefficients
ture flip fusion module, to exploit the symmetry prop-
with simple fully-connected layers. Recently, LSTR [19]
erty of lanes observed from front-view cameras.
uses transformer blocks to predict polynomials in an end-
• We show that our method is fast, light-weight, and ac- to-end fasion based on the DETR [4].
curate through extensive experiments on three popular Curve is a holistic representation of lane line, which nat-
lane detection datasets. Specifically, our method out- urally eliminates occlusions, requires no post-processing,
performs all existing methods on the LLAMAS bench- and can predict a variable number of lines. However, their
mark [3], with the light-weight ResNet-34 backbone. performance on large and challenging datasets (e.g., CU-
2. Related Work Lane [22] and LLAMAS [3]) still lag behind methods of
other categories. They also suffer from slow convergence
Segmentation-based Lane Detection. These methods rep- (over 2000 training epochs on TuSimple), high latency ar-
resent lanes as per-pixel segmentation. SCNN [22] formu- chitecture (e.g., LSTR [19] uses transformer blocks which

2
n Bézier Polynomial +
2nd 0.653 0.945
3rd 0.471 0.558
4th 0.315 0.330

Table P1. Comparison of n-order Bézier curves and polynomials


(x = n i
i=0 ai y ) on TuSimple [1] test set (lower is better). Since
the official metrics are too lose to show any meaningful difference,
we use the fine-grained LPD metric following [32].
Figure 2. Pipeline. Feature from a typical encoder (e.g., ResNet) is
are difficult to optimize for low latency). We attribute their strengthened by feature flip fusion, then pooled to 1D and two 1D
failure to the difficult-to-optimize and abstract polynomial convolution layers are applied. At last the network predicts Bézier
coefficients. We propose to use the parametric Bézier curve, curves through a classification branch and a regression branch.
which is defined by actual control points on the image co-
ordinate system1 , to address these problems. The Proposed Architecture. The overall model architec-
Bézier curve in Deep Learning. To our knowledge, the ture is shown in Figure 2. Specifically, we use layer-3 fea-
only known successful application of Bézier curves in deep ture of ResNets [11] as backbone following RESA [41], but
learning is the ABCNet [20], which uses cubic Bézier curve we replace the dilation inside the backbone network by two
for text spotting. However, their method cannot be directly dilated blocks outside with dilation rates [4, 8] [6]. This
used for our tasks. First, this method still uses NMS so strikes a better speed-accuracy trade-off for our method,
that it cannot be end-to-end. We show in our work that which leaves a 16× down-sampled feature map with a
NMS is not necessary so that our method can be an end- larger receptive field. We then add the feature flip fusion
to-end solution. Second, it calculates L1 loss directly on module (Section 3.2) to aggregate opposite lane features.
the sparse Bézier control points, which results in difficul- The enriched feature map (C × 16 H
×W 16 ) is then pooled to
ties of optimization. We address this problem in our work W
(C × 16 ) by average pooling, resulting in W 16 proposals (50
by leveraging a fine-grained sampling loss. In addition, we for CULane [22]). Two 1 × 3 1D convolutions are used
propose the feature flip fusion module, which is specifically to transform the pooled features, while also conveniently
designed for the lane detection task. modeling interactions between nearby lane proposals, guid-
ing the network to learn a substitute for the non-maximal
3. BézierLaneNet suppression (NMS) function. Lastly, the final prediction is
3.1. Overview obtained by the classification and regression branches (each
is only one 1 × 1 1D convolution). The outputs are W 16 × 8
Preliminaries on Bézier Curve. The Bézier curve’s for- for regression of 4 control points, and W × 1 for existence
16
mulation is shown in Equation (1), which is a parametric of lane line object.
curve defined by n + 1 control points:
n
3.2. Feature Flip Fusion
X
B(t) = bi,n (t)Pi , 0 ≤ t ≤ 1, (1) By modeling lane lines as holistic curves, we focus on
i=0 the geometric properties of individual lane lines (e.g., thin,
where Pi is the i − th control point, bi,n are Bernstein basis long, and continuous). Now we consider the global struc-
polynomials of degree n: ture of lanes from a front-mounted camera view in driving
scenes. Roads have equally spaced lane lines, which appear
bi,n = Cni ti (1 − t)n−i , i = 0, ..., n. (2) symmetrical and this property is worth modeling. For in-
stance, the existence of left ego lane line should very likely
We use the classic cubic Bézier curve (n = 3), which indicate its right counterpart, the structure of immediate left
is empirically found sufficient for modeling lane lines. It lane could help describe the immediate right lane, etc.
shows better ground truth fitting ability than 3rd order poly- To exploit this property, we fuse the feature map with
nomial (Table 1), which is the base function for previous its horizontally flipped version (Figure 3). Specifically,
curve-based methods [19, 32]. Higher-order curves do not two separate convolution and normalization layers trans-
bring substantial gains while the high degrees of freedom form each feature map, they are then added together before
leads to instability. All coordinates for points discussed here a ReLU activation. With this module, we expect the model
are relative to the image size (i.e., mostly in range [0, 1]). to base its predictions on both feature maps.
1 Actually control points of Bézier curves can be outside the image, but To account for the slight misalignment of camera cap-
statistically that rarely happens in autonomous driving scenes. tured image (e.g., rotated, turning, non-paired), we apply

3
+

(a)

Figure 3. Feature flip fusion. Alignment is achieved by calculating


deformable convolution offsets, conditioned on both the flipped
and original feature map. Best viewed in color.

deformable convolution [42] with kernel size 3 × 3 for the


flipped feature map while learning the offsets conditioned
on the original feature map for feature alignment.
We add an auxiliary binary segmentation branch (to seg- (b)
ment lane line or non-lane line areas, which would be re- Figure 4. Grad-CAM [31] visualization on the last layer of ResNet
moved after training) to the ResNet backbone, and we ex- backbone. (a) Our model can infer existence of an ill-marked lane
pect it to enforce the learning of spatial details. Interest- line, from clear markings and cars around the opposite line. Note
ingly, we find this auxiliary branch improves the perfor- that the car is deviated to the left, this scene was not captured with
perfect symmetry. (b) When entire road lacks clear marking, both
mance only when it works with the feature fusion. This is
sides are used for a better prediction. Best viewed in color.
because the localization of the segmentation task may pro-
vide a more spatially-accurate feature map, which in turn based methods, e.g., loss weighting for endpoints loss [19]
supports accurate fusion between the flipped features. and line length loss [33] (see Figure 5(b,c)).
Visualizations are shown in Figure 4, from which we can Bézier Ground Truth Generation. Now we introduce the
see that the flipped feature does correct the error caused by generation of Bézier curve ground truth. Since lane datasets
the asymmetry introduced by the car (Figure 4(a)). are currently annotated by on-line key points, we need the
Bézier control points for the above sampling loss. Given
3.3. End-to-end Fit of a Bézier Curve
the annotated points {(kxi , kyi )}m i=1 on one lane line, where
Distances Between Bézier Curves. The key to learning (kxi , kyi ) denotes the 2D-coordinates of the i-th point. Our
Bézier curves is to define a good distance metric measur- goal is to obtain control points {Pi (xi , yi )}ni=1 . Similarly
ing the distances between the ground truth curve and pre- to [20], we use standard least squares fitting:
diction. Naively, one can directly calculate the mean L1     T
distance between Bézier curve control points, as in ABC- P0 kx0 ky0 b0,n (t0 ) · · · bn,n (t0 )
Net [20]. However, as shown in Figure 5(a), a large L1 error  P1   kx1 ky1   b0,n (t1 ) · · · bn,n (t1 ) 
 ..  =  .. (4)
    
in curvature control points can demonstrate a very small vi- ..   .. .. .. 
 .   . .   . . . 
sual distance between Bézier curves, especially on small or Pn kxm kym b0,n (tm ) · · · bn,n (tm )
medium curvatures (which is often the case for lane lines).
Since Bézier curves are parameterized by t ∈ [0, 1], we pro- {ti }m
i=0 is uniformly sampled from 0 to 1. Different from
pose the more reasonable sampling loss for Bézier curves [20], we do not restrict ground truth to have same endpoints
(Figure 5(b)), by sampling curves at a uniformly spaced set as original annotations, which leads to better quality labels.
of t values (T ), which means equal curve length between Label and Prediction Matching. After obtaining the
adjacent sample points. The t values can be further trans- ground truth, in training, we perform a one-to-one assign-
formed by a re-parameterization function f (t). Specifically, ment between G labels and N predictions (G < N ) us-
given Bézier curves B(t), B̂(t), the sampling loss Lreg is: ing optimal bipartite matching, to attain a fully end-to-
1X end pipeline. Following Wang et al. [38], we find a G-
Lreg = ||B(f (t)) − B̂(f (t))||1 , (3) permutation of N predictions π ∈ ΠN
n G that formulates the
t∈T best bipartite matching:
G
where n is the total number of sampled points and is set X
π̂ = arg max Qi,π(i) , (5)
to 100. We empirically find f (t) = t works well. This π∈ΠN
G i
simple yet effective loss formulation makes our model easy
1−α  α
to converge and less sensitive to hyper-parameters that typ-
 
Qi,π(i) = p̂π(i) · 1 − L1 bi , b̂π(i) , (6)
ically involved in other curved-based or point detection-

4
Dataset Train Val Test Resolution #Lines
TuSimple [1] 3268 358 2782 720 × 1280 ≤5
CULane [22] 88880 9675 34680 590 × 1640 ≤4
LLAMAS [3] 58269 20844 20929 717 × 1276 ≤ 4∗

Table 2. Details of datasets. *Number of lines in LLAMAS dataset


is more than 4, but official metric only evaluates 4 lines.

weather conditions. CULane dataset contains more com-


plex urban driving scenarios, including shades, extreme il-
(a) (b) (c)
luminations, and road congestion. LLAMAS is a newly
Figure 5. Lane loss functions. (a) The L1 distance of control formed large-scale dataset, it is the only lane detection
points is not highly correlated with the actual distance between benchmark without public test set labels. Details of these
curves. (b) The proposed sampling loss is one unified distance datasets can be found in Table 2.
metric by t-sampling. (c) Typical loss for polynomial regression
[19], at least 3 separate losses are required: y-sampling loss, y 4.2. Evalutaion Metics
start point loss, y end point loss. For CULane [22] and LLAMAS [3], the official metric
is F1 score from [22]:
where Qi,π(i) ∈ [0, 1] represents matching quality of the i- 2 · Precision · Recall
F1 = , (9)
th label with the π(i)-th prediction, based on L1 distance Precision + Recall
between curves bi , b̂π(i) (sampling loss) and class score
where Precision = T PT+F P TP
P and Recall = T P +F N . Lines
p̂π(i) . α is set to 0.8 by default. The above equations can be
are assumed to be 30 pixels wide, prediction and ground
efficiently solved by the well-known Hungarian algorithm.
truth lines with pixel IoU over 0.5 are considered a match.
Wang et al. [38] also use a spatial prior that restricts the For TuSimple [1] dataset, the official metrics include
matched prediction to a spatial neighborhood of the label Accuracy, false positive rate (FPR), and false negative rate
(object center distance, the centerness prior in FCOS [35]). N
(FNR). Accuracy is computed as Npred , where Npred is the
However, since lots of lanes are long lines with a large gt
number of correctly predicted on-line points and Ngt is the
slope, this centerness prior is not useful. See Appendix E
number of ground truth on-line points.
for more investigations on matching priors.
Overall Loss. Other than Bézier curve sampling loss, there 4.3. Implementation Details
is also the classification loss Lc for the lane object classi-
fication (existence) branch. Since the imbalance between Fair Comparison. To fairly compare among different state-
positive and negative examples is not as severe in lane de- of-the-art methods, we re-implement representative meth-
tection as in object detection, instead of the focal loss [16], ods [19, 22, 41] in a unified PyTorch framework. We Also
we use the simple weighted binary cross-entropy loss: provide a semantic segmentation baseline [5] originally pro-
posed in [22]. All our implementations do not use val set in
Lcls = −(y log(p) + w(1 − y) log(1 − p)), (7) training, and tune hyper-parameters only on val set. Some
methods with reliable open-source codes are reported from
where w is the weighting for negative samples, which is their own codes [26, 32, 33]. For platform sensitive met-
set to 0.4 in all experiments. The loss Lseg for the binary ric Frames-Per-Second (FPS), we re-evaluate all reported
segmentation branch (Section 3.2) takes the same format. methods on the same RTX 2080 Ti platform. More de-
The overall loss is a weighted sum of all three losses: tails for implementations and FPS tests are in Appendices A
to C.
L = λ1 Lreg + λ2 Lcls + λ3 Lseg , (8) Training. We train 400, 36, 20 epochs for TuSimple, CU-
Lane, and LLAMAS, respectively (training of our model
where λ1 , λ2 , λ3 are set to 1, 0.1, 0.75, respectively. takes only 12 GPU hours on a single RTX 2080 Ti), and the
input resolution is 288×800 for CULane [22] and 360×640
4. Experiments for others, following common practice. Other than these,
all hyper-parameters are tuned on CULane [22] val set and
4.1. Datasets
remain the same for our method across datasets. We use
To evaluate the proposed method, we conduct experi- Adam optimizer with learning rate 6 × 10−4 , weight decay
ments on three well-known datasets: TuSimple [1], CU- 1 × 10−4 , batch size 20, Cosine Annealing learning rate
Lane [22] and LLAMAS [3]. TuSimple dataset was col- schedule as in [33]. Data augmentation includes random
lected on highways with high-quality images, under fair affine transforms, random horizontal flip, and color jitter.

5
CULane [22] TuSimple [1]
No Dazzle
Method Ep. Total Normal Crowd Night Shadow Arrow Curve Cross ↓ train+val Ep. Acc. FPR ↓ FNR ↓
line light
Segmentation-based
Baseline (ResNet-18)* 12 65.30 85.45 62.63 61.04 33.88 51.72 78.15 53.05 59.70 1915 50 94.25 0.088 0.089
Baseline (ResNet-34)* 12 69.92 89.46 66.66 65.38 40.43 62.17 83.18 58.51 63.00 1713 50 95.31 0.064 0.062
Baseline (ResNet-101)* 12 71.37 90.11 67.89 67.01 43.10 70.56 85.09 61.77 65.47 1883 50 95.19 0.062 0.062
SCNN (ResNet-18) [22]* 12 72.19 90.98 70.17 66.54 43.12 66.31 85.62 62.20 65.58 1808 50 94.77 0.075 0.074
SCNN (ResNet-34) [22]* 12 72.70 91.06 70.41 67.75 44.64 68.98 86.50 61.57 65.75 2017 50 95.25 0.063 0.063
SCNN (ResNet-101) [22]* 12 73.58 91.10 71.43 68.53 46.39 72.61 86.87 61.95 67.01 1720 50 95.69 0.052 0.050
UFLD (ResNet-18) [26]** 50 68.4 87.7 66.0 62.1 40.2 62.8 81.0 58.4 57.9 1743 - − − − −
UFLD (ResNet-34) [26]** 50 72.3 90.7 70.2 66.7 44.4 69.3 85.7 59.5 69.5 2037 - − − − −
RESA (ResNet-18) [41]* 12 72.90 91.23 70.57 67.16 45.24 68.01 86.56 64.32 66.19 1679 50 95.24 0.069 0.057
RESA (ResNet-34) [41]* 12 73.66 91.31 71.80 67.54 46.57 72.74 86.94 64.46 67.31 1701 50 95.15 0.069 0.059
RESA (ResNet-101) [41]* 12 74.04 91.45 71.51 69.01 46.54 75.83 87.75 63.90 68.24 1522 50 95.56 0.058 0.051
Point detection-based
FastDraw (ResNet-18) [25] − − − − − − − − − − − X 7 94.9 0.061 0.047
CurveLanes-NAS-S [39] 12 71.4 88.3 68.6 66.2 47.9 68.0 82.5 63.2 66.0 2817 - − − − −
CurveLanes-NAS-M [39] 12 73.5 90.2 70.5 68.2 48.8 69.3 85.7 65.9 67.5 2359 - − − − −
CurveLanes-NAS-L [39] 12 74.8 90.7 72.3 68.9 49.4 70.1 85.8 67.7 68.4 1746 - − − − −
LaneATT (ResNet-18) [33]** 15 74.88 90.98 72.78 68.61 48.23 69.68 85.44 65.43 63.18 1163 X 100 95.57 0.036 0.030
LaneATT (ResNet-34) [33]** 15 76.42 91.94 74.76 70.32 49.17 77.68 88.14 65.92 68.07 1323 X 100 95.63 0.035 0.029
LaneATT (ResNet-122) [33]** 15 76.79 91.50 76.04 70.43 50.29 75.96 86.16 68.99 63.99 1265 X 100 96.10 0.056 0.022
Curve-based
PolyLaneNet (EfficientNet-B0) [32]** − − − − − − − − − − − X 2695 93.36 0.094 0.093
LSTR (ResNet-18, 1×) [19]* − − − − − − − − − − − 2000 95.06 0.049 0.042
LSTR (ResNet-18, 2×) [19]* 150 68.72 86.78 67.34 59.92 40.10 59.82 78.66 56.63 56.64 1166 - − − − −
BézierLaneNet (ResNet-18) 36 73.67 90.22 71.55 68.70 45.30 70.91 84.09 62.49 58.98 996 400 95.41 0.053 0.046
BézierLaneNet (ResNet-34) 36 75.57 91.59 73.20 69.90 48.05 76.74 87.16 69.20 62.45 888 400 95.65 0.051 0.039

Table 3. Results on test set of CULane [22] and TuSimple [1]. *reproduced results in our code framework, best performance from three
random runs. **reported from reliable open-source codes from the authors.

Testing. No post-processing is required for curve methods. while showing a favorable trade-off, with an acceptable
Standard Gaussian blur and row selection post-processing convergence time.
is applied to segmentation methods. NMS is used for Comparison with Segmentation-based Methods. These
LaneATT [33], while we remove its post-inference B-Spline methods tend to have a low speed due to recurrent feature
interpolation in CULane [22], to align with our framework. aggregation [22, 41], and the use of high-resolution feature
map [5, 22, 41]. BézierLaneNet outperforms them in both
4.4. Comparisons speed and accuracy. Our small models even compare favor-
ably against RESA [41] and SCNN [22] with large ResNet-
Overview. Experimental results are shown in Tables 3
101 backbone, surpassing them in CULane [22] with a clear
and 4. TuSimple [1] is a small dataset that features clear-
margin (1 ∼ 2%). On LLAMAS [3], where the dataset re-
weather highway scenes and has a relatively easy metric,
stricts testing on 4 center lines, the segmentation approach
most methods thrive in this dataset. Thus, we mainly focus
shows strong performance (Table 4). Nevertheless, our
on the other two large-scale datasets [3, 22], where there
ResNet-34 model still outperforms SCNN by 0.92%.
is still a rather clear difference between methods. For high-
performance methods (> 70% F1 on CULane [22]), we also UFLD [26] reformulates segmentation to row-wise clas-
show efficiency metrics (FPS, Parameter count) in Table 5. sification on a down-sampled feature map to achieve fast
speed, at the cost of accuracy. Compared to us, UFLD
Comparison with Curve-based Methods. As shown in
(ResNet-34) is 0.9% lower on CULane Normal, while
Tables 3 and 4, in all datasets, BézierLaneNet outper-
7.4%, 3.0%, 3.2% worse on Shadow, Crowd, Night, re-
forms previous curve-based methods [19, 32] by a clear
spectively. Overall, our method with the same backbones
margin, advances the state-of-the-art of curve-based meth-
outperforms UFLD by 3 ∼ 5%, while being faster on
ods by 6.85% on CULane [22] and 6.77% on LLAMAS
ResNet-34. Besides, UFLD uses large fully-connected lay-
[3]. Thanks to our fully convolutional and fully end-to-
ers to optimize latency, which causes a huge model size (the
end pipeline, BézierLaneNet runs over 2× faster than LSTR
largest in Table 5).
[19]. LSTR has a speed bottleneck from transformer archi-
tecture, the 1× and 2× model have FPS 98 and 97, respec- A drawback for all segmentation methods is the weaker
tively2 . While curves are difficult to learn, our method con- performance on Dazzle Light. Per-pixel (or per-pixel grid
verges 4-5× faster than LSTR. For the first time, an elegant for UFLD [26]) segmentation methods may rely on infor-
curve-based method can challenge well-designed segmen- mation from local textures, which is destroyed by extreme
tation methods or point detection methods on these datasets exposure to light. While our method predicts lane lines as
holistic curves, hence robust to changes in local textures.
2 The original 420 FPS report from LSTR paper [19], is throughput with Comparison with Point Detection-based Methods. Xu et
batch size 16, detailed discussions in Appendix A. al. [39] finds a series of point detection-based models with

6
Method FPS ↑ Params (M) ↓
LLAMAS [3]
Segmentation-based (ignored post-processing time)
Method Ep. F1 Precision Recall
Baseline (ResNet-101) 27 43.56
Segmentation-based SCNN (ResNet-18) [22] 21 12.63
Baseline (ResNet-34)* 18 93.43 92.61 94.27 SCNN (ResNet-34) [22] 21 22.74
SCNN (ResNet-101) [22] 14 44.15
SCNN (ResNet-34) [22]* 18 94.25 94.11 94.39
UFLD (ResNet-34) [26] 144 71.58
Point detection-based RESA (ResNet-18) [41] 68 6.61
RESA (ResNet-34) [41] 54 11.99
LaneATT (ResNet-18) [33]** 15 93.46 96.92 90.24 RESA (ResNet-101) [41] 25 31.46
LaneATT (ResNet-34) [33]** 15 93.74 96.79 90.88
Point detection-based (ignored NMS time in real images)
LaneATT (ResNet-122) [33]** 15 93.54 96.82 90.47
LaneATT (ResNet-18) [33] 165 12.02
Curve-based LaneATT (ResNet-34) [33] 117 22.13
PolyLaneNet (EfficientNet-B0) [32]** 75 88.40 88.87 87.93 LaneATT (ResNet-122) [33] 26 8.55
BézierLaneNet (ResNet-18) 20 94.91 95.71 94.13 Curve-based (entirely end-to-end)
BézierLaneNet (ResNet-34) 20 95.17 95.89 94.46 BézierLaneNet (ResNet-18) 213 4.10
BézierLaneNet (ResNet-34) 150 9.49
Table 4. Results from LLAMAS [3] test server.
Table 5. FPS (image/s) and model size. All FPS results are tested
neural architecture search techniques called CurveLanes- with 360 × 640 random inputs on the same platform. Here only
shows models with > 70% CULane [22] F1 score.
NAS. Despite its complex pipeline and extensive architec-
ture search for the best accuracy-FLOPs trade-off, our sim-
ple ResNet-34 backbone model (29.9 GFLOPs) still sur- methods either discards the end-to-end pipeline (LaneATT
passes its large model (86.5 GFLOPs) by 0.8% on CULane. [33]), or entirely fails the real-time requirement (SCNN
CurveLanes-NAS also performs worse under occlusions, [22], RESA [41]). While our BézierLaneNet is fully end-
a similar drawback as the segmentation methods without to-end, fast (>150 FPS), light-weight (<10 million parame-
recurrent feature fusion [5, 26]. As shown in Table 3, ters) and maintains consistent high accuracy across datasets.
with similar model capacity compared to our ResNet-34
4.5. Analysis
model, CurveLanes-NAS-M (35.7 GFLOPs) is 1.4% worse
on Normal scenes, but the gap on Shadow and Crowd are Although we develop our method by tuning on the val
7.4% and 2.7%. set, we re-run ablation studies with ResNet-34 backbone
Recently, LaneATT [33] achieves higher performance (including our full method) and report performance on the
with a point detection network. However, their design is CULane test set for clear comparison.
not fully end-to-end (requires Non-Maximal Suppression
(NMS)), based on heuristic anchors (>1000), which are cal- Curve representation F1
culated directly from the dataset’s statistics, thus may sys- Cubic Bézier curve baseline 68.89
tematically pose difficulties in generalization. Still, with 3rd Polynomial baseline 1.49
ResNet-34, our method outperforms LaneATT on the LLA- BézierLaneNet 75.41
MAS [3] test server (1.43%), with a significantly higher 3rd Polynomial from BézierLaneNet 5.01
recall (3.58%). We also achieve comparable performance
to LaneATT on TuSimple [1] using only the train set, and Table 6. Curve representations. Baselines directly predict curve
only ∼ 1% worse on CULane. Our method performs signif- coefficients without feature flip fusion.
icantly better in Dazzle Light (3.3% better), comparably in Importance of Parametric Bézier Curve. We first replace
Night (0.4% lower). It also has a lower False Positive (FP) the Bézier curve prediction with a 3rd order polynomial,
rate on Crossroad scenes (Cross), even though LaneATT adding auxiliary losses for start and end points. As shown
shows an extremely low-FP characteristic (large Precision- in Table 6, polynomials catastrophically fail to converge in
Recall gap in Table 4). Methods that rely on heuristic an- our fully convolutional network, even when trained with
chors [33] or heuristic decoding process [22,26,39,41] tend 150 epochs (details in Appendix B.8). Then we consider
to have more false predictions in this scene. Moreover, the modifying the LSTR [19] to predict cubic Bézier curves,
NMS is a sequential process that could have unstable run- the performance is similar to predicting polynomials. We
time in real-world applications. Even when NMS was not conclude that heavy MLP may be necessary to learn polyno-
evaluated on real inputs, our models are 29%, 28% faster, mials [19, 32], while predicting Bézier control points from
have 2.9×, 2.3× fewer parameters, compared to LaneATT position-aware CNN is the best choice. The transformer-
on ResNet-18 and ResNet-34 backbones, respectively. based LSTR decoder destroys the fine spatial information,
To summarize, previous curve-based methods (Poly- suppresses the advancement of curve function.
LaneNet [32], LSTR [19]) have significantly worse perfor- Feature Flip Fusion Design. As shown in Table 7, feature
mance. Fast methods trades either accuracy (UFLD [26]) or flip fusion brings 4.07% improvement. We also find that
model size (UFLD [26], LaneATT [33]) for speed. Accurate the auxiliary segmentation loss can regularize and increase

7
CP SP Flip Deform Seg F1 4.6. Limitations and Discussions
X 63.74 Curves are indeed a natural representation of lane lines.
X 68.89
However, their elegance in modeling inevitably brings a
X X 65.82
drawback. It is difficult for the curvature coefficients to
X X 70.28
X X X 72.96 generalize when the data distribution is highly biased (al-
X X X 73.97 most all lane lines are straight lines in CULane). Our Bézier
X X X X 75.41 curve approach has already alleviated this problem to some
extent and has achieved an acceptable performance (62.45)
Table 7. Ablations. CP: Control point loss [20]. SP: The proposed in CULane Curve. On datasets such as TuSimple and LLA-
sampling loss. Flip: The feature flip fusion module. Deform: MAS [1, 3], where the curvature distribution is fair enough
Employ the deformable convolution in feature flip fusion. Seg: for learning, our method achieves even better performance.
Auxiliary segmentation loss. To handle broader corner cases, e.g., sharp turns, blockages
and bad weather, datasets such as [30,34,39] may be useful.
the performance further, by 2.45%. It is worth noting that
The feature flip fusion is specifically designed for a
auxiliary loss only works with feature fusion, it can lead to
front-mounted camera, which is the typical use case of deep
degenerated results when directly applied on the baseline
lane detectors. Nevertheless, there is still a strong induc-
(−3.07%). A standard 3 × 3 convolution performs worse
tive bias by assuming scene symmetry. In future work, it
than deformable convolution, by 2.68% and 1.44%, before
would be interesting to find a replacement for this mod-
and after adding the auxiliary segmentation loss, respec-
ule, to achieve better generalization and to remove the de-
tively. We attribute this to the effects of feature alignment.
formable convolution operation, which poses difficulty for
Bézier Curve Fitting Loss. As shown in Table 7, replacing
effective integration into edge devices such as Jetson.
the sampling loss by direct loss on control points lead to in-
More discussions in Appendix G.
ferior performance (−5.15% in the baseline setup). Inspired
by the success of IoU loss in object detection. We also im-
plemented a IoU loss (formulas in Appendix D) for the con- 5. Conclusions
vex hull of Bézier control points. However, the convex hull In this paper, we have proposed BézierLaneNet: a novel
of close-to-straight lane lines are too small, the IoU loss is fully end-to-end lane detector based on parametric Bézier
numerically unstable, thus failing to facilitate the sampling curves. The on-image Bézier curves are easy to optimize
loss. and naturally model the continuous property of lane lines,
without heavy designs such as recurrent feature aggrega-
Model Aug F1
tion or heuristic anchors. Besides, a feature flip fusion
LSTR (ResNet-18, 2×) [19] X 68.72
module is proposed. It efficiently models the symmetry
LSTR (ResNet-18, 2×) [19] 39.77(−28.95)
property of the driving scene, while also being robust to
BézierLaneNet (ResNet-34) X 75.41 slight asymmetries by using deformable convolution. The
BézierLaneNet (ResNet-34) 55.11(−20.30)
proposed model has achieved favorable performance on
three datasets, defeating all existing methods on the pop-
Table 8. Augmentation ablations. Aug: Strong data augmentation.
ular LLAMAS benchmark. It is also both fast (>150 FPS)
Importance of Strong Data Augmentation. Strong data and light-weight (<10 million parameters).
augmentation is defined by a series of affine transforms and Acknowledgements. This work has been sponsored
color distortions, the exact policy may slightly vary for dif- by National Key Research and Development Program
ferent methods. For instance, we use random affine trans- of China (2019YFC1521104), National Natural Science
form, random horizontal flip, and color jitter. LSTR [19] Foundation of China (61972157, 72192821), Shang-
also uses random lighting. Default augmentation includes hai Municipal Science and Technology Major Project
only a small rotation (3 degrees). As shown in Table 8, (2021SHZDZX0102), Shanghai Science and Technology
strong augmentation is essential to avoid over-fitting for Commission (21511101200), Art major project of National
curve-based methods. Social Science Fund (I8ZD22), and SenseTime Collabora-
For segmentation-based methods [5, 22, 41], we fast val- tive Research Grant. We thank Jiaping Qin for guidance
idated strong augmentation on the smaller TuSimple [1] on road design and geometry, Yuchen Gong and Pan Chen
dataset. All shows a 1 ∼ 2% degradation. This suggests for helping with CAM visualizations, Zhijun Gong, Jiachen
that they may be robust due to per-pixel prediction and Xu and Jingyu Gong for insightful discussions about math,
heuristic post-processing. But they highly rely on learn- Fengqi Liu for providing GPUs, Lucas Tabelini for coop-
ing the distribution of local features such as texture, which eration in evaluating [32, 33], and the CVPR reviewers for
could become confusing by strong augmentation. constructive comments.

8
References [18] Lizhe Liu, Xiaohao Chen, Siyu Zhu, and Ping Tan. Cond-
lanenet: a top-to-down lane detection framework based on
[1] TuSimple benchmark. https : / / github . com / conditional convolution. In ICCV, 2021. 14
TuSimple/tusimple-benchmark, 2017. 2, 3, 5, 6,
[19] Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. End-
7, 8, 12, 14, 15
to-end lane shape prediction with transformers. In WACV,
[2] Claudine Badue, Rânik Guidolini, Raphael Vivacqua 2021. 1, 2, 3, 4, 5, 6, 7, 8, 11, 12
Carneiro, Pedro Azevedo, Vinicius B Cardoso, Avelino
[20] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen
Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Fil-
Jin, and Liangwei Wang. Abcnet: Real-time scene text spot-
ipe Mutz, et al. Self-driving cars: A survey. Expert Systems
ting with adaptive bezier-curve network. In CVPR, 2020. 3,
with Applications, 2021. 1
4, 8
[3] Karsten Behrendt and Ryan Soussan. Unsupervised labeled [21] Davy Neven, Bert De Brabandere, Stamatios Georgoulis,
lane markers using maps. In ICCV, 2019. 2, 5, 6, 7, 8, 14, 15 Marc Proesmans, and Luc Van Gool. Towards end-to-end
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas lane detection: an instance segmentation approach. In IV,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- 2018. 1, 2
end object detection with transformers. In ECCV, 2020. 2 [22] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Xiaoou Tang. Spatial as deep: Spatial cnn for traffic scene
Kevin Murphy, and Alan L. Yuille. Semantic image segmen- understanding. In AAAI, 2018. 1, 2, 3, 5, 6, 7, 8, 11, 14, 15
tation with deep convolutional nets and fully connected crfs. [23] Tim A Pastva. Bezier curve fitting. Technical report, NAVAL
In ICLR, 2015. 1, 5, 6, 7, 8, 11 POSTGRADUATE SCHOOL MONTEREY CA, 1998. 13
[6] Qiang Chen, Yingming Wang, Tong Yang, Xiangyu Zhang, [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
Jian Cheng, and Jian Sun. You only look one-level feature. James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
In CVPR, 2021. 3 Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
[7] MOT Highway Department and Highway Engineering Com- imperative style, high-performance deep learning library. In
mittee under China Association for Engineering Construc- NeurIPS, 2019. 11, 14
tion Standardization. Technical Standard of Highway Engi- [25] Jonah Philion. Fastdraw: Addressing the long tail of lane
neering. 2004. 14 detection by adapting a sequential prediction network. In
[8] Mohsen Ghafoorian, Cedric Nugteren, Nóra Baka, Olaf CVPR, 2019. 6
Booij, and Michael Hofmann. El-gan: Embedding loss [26] Zequn Qin, Huanyu Wang, and Xi Li. Ultra fast structure-
driven generative adversarial networks for lane detection. In aware deep lane detection. In ECCV, 2020. 2, 5, 6, 7, 11
ECCV Workshops, 2018. 2 [27] Zhan Qu, Huan Jin, Yang Zhou, Zhen Yang, and Wei Zhang.
[9] Ross Girshick. Fast r-cnn. In ICCV, 2015. 1 Focus on local: Detecting lane marker from bottom up via
[10] Alexandru Gurghian, Tejaswi Koduri, Smita V Bailur, Kyle J key point. In CVPR, 2021. 14
Carey, and Vidya N Murali. Deeplanes: End-to-end lane [28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
position estimation using deep neural networksa. In CVPR Faster r-cnn: Towards real-time object detection with region
Workshops, 2016. 14 proposal networks. NeurIPS, 2015. 1, 2
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [29] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Deep residual learning for image recognition. In CVPR, Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
2016. 3 tersection over union: A metric and a loss for bounding box
[12] Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. regression. In CVPR, 2019. 14
Recent progress in road and lane detection: a survey. MVA, [30] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Se-
2014. 1 mantic foggy scene understanding with synthetic data. IJCV,
[13] Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change 2018. 8
Loy. Learning lightweight lane detection cnns by self atten- [31] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
tion distillation. In ICCV, 2019. 2 Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Grad-cam: Visual explanations from deep networks via
Imagenet classification with deep convolutional neural net- gradient-based localization. In CVPR, 2017. 4
works. In NeurIPS, 2012. 12 [32] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixao, Claudine
[15] Xiang Li, Jun Li, Xiaolin Hu, and Jian Yang. Line-cnn: Badue, Alberto F De Souza, and Thiago Oliveira-Santos.
End-to-end traffic line detection with line proposal unit. ITS, Polylanenet: Lane estimation via deep polynomial regres-
2019. 1, 2 sion. In ICPR, 2020. 1, 2, 3, 5, 6, 7, 8, 12
[16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [33] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixao, Claudine
Piotr Dollár. Focal loss for dense object detection. In ICCV, Badue, Alberto F De Souza, and Thiago Oliveira-Santos.
2017. 5 Keep your eyes on the lane: Real-time attention-guided lane
[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, detection. In CVPR, 2021. 1, 2, 4, 5, 6, 7, 8, 11, 12
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [34] Xin Tan, Ke Xu, Ying Cao, Yiheng Zhang, Lizhuang Ma,
Zitnick. Microsoft coco: Common objects in context. In and Rynson W. H. Lau. Night-time scene parsing with a
ECCV. Springer, 2014. 12 large real dataset. TIP, 2021. 8

9
[35] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In ICCV,
2019. 5
[36] Federal Highway Administration under United States De-
partment of Transportation. Standard Specifications for
Construction of Roads and Bridges on Federal Highway
Projects. 2014. 14
[37] Wouter Van Gansbeke, Bert De Brabandere, Davy Neven,
Marc Proesmans, and Luc Van Gool. End-to-end lane de-
tection through differentiable least-squares fitting. In ICCV
Workshops, 2019. 2
[38] Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian
Sun, and Nanning Zheng. End-to-end object detection with
fully convolutional network. In CVPR, 2021. 2, 4, 5
[39] Hang Xu, Shaoju Wang, Xinyue Cai, Wei Zhang, Xiaodan
Liang, and Zhenguo Li. Curvelane-nas: Unifying lane-
sensitive architecture search and adaptive point blending. In
ECCV, 2020. 1, 6, 7, 8
[40] Seungwoo Yoo, Hee Seok Lee, Heesoo Myeong, Sungrack
Yun, Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim.
End-to-end lane marker detection via row-wise classifica-
tion. In CVPR Workshops, 2020. 2
[41] Tu Zheng, Hao Fang, Yi Zhang, Wenjian Tang, Zheng Yang,
Haifeng Liu, and Deng Cai. Resa: Recurrent feature-shift
aggregator for lane detection. In AAAI, 2021. 1, 2, 3, 5, 6, 7,
8, 11
[42] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
formable convnets v2: More deformable, better results. In
CVPR, 2019. 2, 4

10
Appendix Overview. The Appendix is organized as fol- inal DeeplabV1 without CRF, lanes are considered as dif-
lows: Appendix A describes the FPS test protocol and ferent classes, and a separate lane existence branch (a se-
environments; Appendix B introduces implementation de- ries of convolution, pooling and MLP) is used to facilitate
tails for each compared method (including ours in Ap- lane post-processing. We optimized its training and testing
pendix B.8); Appendix C provides implementation details scheme based on recent advances [41]. Re-implemented in
for Bézier curves, including sampling, ground truth gener- our codebase, it attains higher performance than what recent
ation and transforms; Appendix D formulates the IoU loss papers usually report.
for Bézier curves and discusses why it failed; Appendix E Post-processing. First, the existence of a lane is determined
explores matching priors other than the centerness prior; by the lane existence branch. Then, the predicted per-pixel
Appendix F shows extra ablation studies on datasets other probability map is interpolated to the input image size. Af-
than CULane [22], to verify the generalization of feature ter that, a 9 × 9 Gaussian blur is applied to smooth the pre-
flip fusion; Appendix G discusses limitations and recog- dictions. Finally, for each existing lane class, the smoothed
nizes new progress in the field; Appendix H presents quali- probability map is traversed by pre-defined Y coordinates
tative results from our method, visualized on three datasets. (quantized), and corresponding X coordinates are recorded
by the maximum probability position on the row (provided
A. FPS Test Protocol it passes a fixed threshold). Lanes with less than two quali-
fied points are simply discarded.
Let one Frames-Per-Second (FPS) test trial be the aver- Data Augmentation. We use a simple random rotation with
age runtime of 100 consecutive model inference with its Py- small angles (3 degrees), then resize to input resolution.
Torch [24] implementation, without calculating gradients.
The input is a 3x360x640 random Tensor (some use all B.2. SCNN
1 [33], which does not have impact on speed). Note that all
Our SCNN [22] is re-implemented from the Torch7 ver-
methods do not use optimization from packages like Ten-
sion of the official code. Advised by the authors, we added
sorRT. We wait for all CUDA kernels to finish before count-
an initialization trick for the spatial CNN layers, and learn-
ing the whole runtime. We use Python time.perf counter()
ing rate warm-up, to prevent gradient explosion caused by
since it is more precise than time.time(). For all methods,
recurrent feature aggregation. Thus, we can safely adjust
the FPS is reported as the best result from 3 trials.
the learning rate. Our improved SCNN achieves signifi-
Before each test trial, at least 10 forward pass is con-
cantly better performance than the original one.
ducted as warm-up of the device. For each new method
Some may find reports of 96.53 accuracy of SCNN on
to be tested, we keep running warm-up trials of a recorded
TuSimple. However, that was a competition entry trained
method until the recorded FPS is reached again, so we can
with external data. We report SCNN with ResNet back-
guarantee a similar peak machine condition as before.
bones, trained with the same data as other re-implemented
Evaluation Environment. The evaluation platform is a methods in our codebase.
2080 Ti GPU (standard frequency), on a Intel Xeon-E3 CPU Post-processing. Same as Appendix B.1.
server, with CUDA 10.2, CuDNN 7.6.5, PyTorch 1.6.0. Data Augmentation. Same as Appendix B.1.
FPS is a platform-sensitive metric, depending on GPU fre-
quency, condition, bus bandwidth, software versions, etc. B.3. RESA
Also using 2080 Ti, Tabelini et al. [33] can achieve a better Our RESA [41] is implemented based on its published
peak performance for all methods. Thus we use the same paper. A main difference to the official code release is we
platform for all FPS tests, to provide fair comparisons. do not cutout no-lane areas (in each dataset, there is a cer-
Remark. Note that FPS (image/s) is different from tain height range for lane annotation). Because that trick
throughput (image/s). Since FPS restricts batch size to is dataset specific and not generalizable, we do not use that
1, which better simulates the real-time application scenario. for all compared methods. Other differences are all vali-
While throughput considers a batch size more than 1. LSTR dated to have better performance than the official code, at
[19] reported a 420 FPS for its fastest model, which is ac- least on the CULane val set.
tually throughput with batch size 16. Our re-tested FPS is Post-processing. Same as Appendix B.1.
98. Data Augmentation. Same as Appendix B.1. The original
RESA paper [41] also apply random horizontal flip, which
B. Specifications for Compared Methods was found ineffective in our re-implementation.
B.1. Segmentation Baseline B.4. UFLD
The segmentation baseline is based on DeeplabV1 [5], Ultra Fast Lane Detection (UFLD) [26] is reported from
originally proposed in SCNN [22]. It is essentially the orig- their paper and open-source code. Since TuSimple FP and

11
FN information is not in the paper, and training from source of LSTR on this dataset. With tuning of hyper-parameters
code leads to very high FP rate (almost 20%), we did not (learning rate, epochs, prediction threshold), bug fix (the
report their performance on this dataset. We adjusted its original classification branch has 3 output channels, which
profiling scripts to calculate number of parameters and FPS should be 2), we achieve 4% better performance on CU-
in our standard. Lane than the authors’ trial. Specifically, we use learning
Post-processing. Since this method uses gridding cells rate 2.5 × 10−4 with batch size 20. 150 and 2000 epochs,
(each cell is equivalent to several pixels in a segmentation 0.95 and 0.5 prediction thresholds, for CULane and TuSim-
probability map), each point’s X coordinate is calculated ple. The lower threshold in TuSimple is due to the official
as the expectation of locations (cells from the same row), test metric, which significantly favors a high recall. How-
i.e. a weighted average by probability. Differently from ever, for real-world applications, a high recall leads to high
segmentation post-processing, it is possible to be efficiently False Positive rate, which is undesired.
implemented. We divide the curve loss weighting by 10 with our
Data Augmentation. Augmentations include random rota- LSTR-Beizer ablation, since there were 100 sample points
tion and some form of random translation. with both X and Y coordinates to fit, that is a loss scale
about 10 times the original loss (LSTR loss takes summa-
B.5. PolyLaneNet tion of point L1 distances instead of average). This modu-
PolyLaneNet [32] is reported from their paper and open- lation achieves a similar loss landscape to original LSTR.
source code. We added a profiling script to calculate num- Post-processing. This method requires no post-processing.
ber of parameters and FPS in our standard. Data Augmentation. Data augmentation includes Poly-
Post-processing. This method requires no post-processing. LaneNet’s (Appendix B.5), then appends random color dis-
Data Augmentation. Augmentations include large random tortions (brightness, contrast, saturation, hue) and random
rotation (10 degrees), random horizontal flip and random lighting by a light source calculated from the COCO dataset
10 [17]. That is by far the most complex data augmentation
crop. They are applied with a probability of 11 .
pipeline in this research field, we have validated that all
B.6. LaneATT components of this pipeline helps LSTR training.
LaneATT [33] is reported from their paper and open- Remark. The polynomial coefficients of LSTR are un-
source code. We adjusted its profiling scripts to calculate bounded, which leads to numerical instability (while the bi-
parameters and FPS in our standard. partite matching requires precision), and high failure rate
Post-processing. Non-Maximal Suppression (NMS) is im- of training. The failure rate of fp32 training on CULane
plemented by a customized CUDA kernel. An extra inter- is ∼ 30%. This is circumvented in BézierLaneNet, since
polation of lanes by B-Spline is removed both in testing and our L1 loss can be bounded to [0, 1] without influence on
profiling, since it is slowly executed on CPU and provides learning (control points easily converges to on-image).
little improvement (∼ 0.2% on CULane).
B.8. BézierLaneNet
Data Augmentation. LaneATT uses random affine trans-
forms including scale, translation and rotation. While it also BézierLaneNet is implemented in the same code frame-
uses random horizontal flip. work where we re-implemented other methods. Same as
Followup. We did not have time to validate the re- LSTR, the default prediction threshold is set to 0.95, while
implementation of LaneATT in our codebase, prior the sub- 0.5 is used for TuSimple [1].
mission deadline. Therefore, the LaneATT performance is Post-processing. This method requires no post-processing.
still reported from the official code. Our re-implementation Data Augmentation. We use augmentations similar to
indicates that all LaneATT results are reproducible except LSTR (Appendix B.7). Concretely, we remove the ran-
for the ResNet-34 backbone on CULane, which is slightly dom lighting from LSTR (to strictly avoid using knowledge
outside the standard deviation range, but still reasonable. from external data), and replace the PolyLaneNet 10 11 chance
augmentations with random affine transforms and random
B.7. LSTR
horizontal flip, like LaneATT (Appendix B.6). The ran-
LSTR [19] is re-implemented in our codebase. All dom affine parameters are: rotation (10 degrees), transla-
ResNet backbone methods start from ImageNet [14] pre- tion (maximum 50 pixels on X, 20 on Y), scale (maximum
training. While LSTR [19] use 256 channels ResNet-18 20%).
for CULane (2×), 128 channels for other datasets (1×), Polynomial Ablations. For the polynomial ablations (Table
which makes it impossible to use off-the-shelf pre-trained 7), we modified the network to predict 6 coefficients for 3rd
ResNets. Although whether ImageNet pre-training helps order Polynomial (4 curve coefficients and start/end Y co-
lane detection is still an open question. Our reported perfor- ordinates). Extra L1 losses are added for the start/end Y co-
mance of LSTR on CULane, is the first documented report ordinates similar to LSTR [19]. With extensive tryouts (ad-

12
justing learning rate, loss weightings, number of epochs), where u0 = 1 − t0 , u1 = 1 − t1 . This formula can be
even at the full BézierLaneNet setup, with 150 epochs on efficiently implemented by matrix multiplication. The pos-
CULane, the models still can not converge to a good enough sibility of noncontinuous cubic Bézier segment on lane de-
solution. In other word, not precise enough to pass the tection datasets is extremely low and thus ignored for sim-
CULane metric. The sampling loss on polynomial curves plicity. If it does happen, Equation (10) will not change
can only get to 0.02, which means 0.02 × 1640pixels = the curve, while our network can also predict out-of-image
32.8pixels average X coordinate error on training set. CU- control points, which still fit the on-image lane segments.
Lane requires a 0.5 IoU between curves, which are enlarged
to 30 pixels wide, thus at least around 10 pixels average er- D. IoU Loss for Bézier Curves
ror is needed to get meaningful results. By loosen up the
IoU requirement to 0.3, we can get F1 score 15.82 for “3rd Here we briefly introduce how we formulated the IoU
Polynomial from BézierLaneNet”. Although the reviewing loss between Bézier curves. Before diving into the algo-
committee suggested adding simple regularization for this rithm, there are two preliminaries.
ablation to converge, regretfully we failed to do this. • Polar sort: By anchoring on an arbitrary point in-
N
side the N-sided polygon with vertices ci (xi , yi )i=1
0
C. Bézier Curve Implementation Details (normally
PN the mean PN coordinate between vertices c =
( N1 i=1 xi , N1 i=1 yi )), vertices are sorted by its
Fast Sampling. The sampling of Bézier curves may seem atan2 angles. This will return a clockwise or coun-
tiresome due to the complex Bernstein basis polynomials. terclockwise polygon.
To fast sample a Bézier curve by a series of fixed t values, • Convex polygon area: A sorted convex polygon can
simply pre-compute the results from Bernstein basis poly- be efficiently cut into consecutive triangles by sim-
nomials, thus only one simple matrix multiplication is left. ple indexing operations. The convex polygon area is
Remarks on GT Generation. The ground truth of Bézier the sum of these triangles. The area S of triangle
curves are generated with least squares fitting, a common ((x1 , y1 ), (x2 , y2 ), (x3 , y3 )) is: S = 21 |x1 (y2 − y3 ) +
technique for polynomials. We use it for its simplicity and x2 (y3 − y1 ) + x3 (y1 − y2 )|.
the fact that it already shows near-perfect lane line fitting
ability (99.996 and 99.72 F1 score on CULane test and Assume we have two convex hulls from Bézier curves
LLAMAS val, respectively). However, it is not an ideal (there are a lot of convex hull algorithms). Now the IoU
algorithm for parametric curves. There is a whole research between Bézier curves are converted to IoU between con-
field for fitting Bézier curves better than least squares [23]. vex polygons. Based on the simple fact that the intersec-
Bézier Curve Transform. Another implementation diffi- tion of convex polygons is still a convex polygon, after po-
culty on Bézier curves is how to apply affine transform (for lar sorting all the convex hulls and determining the inter-
transforming ground truth curves in data augmentation). sected polygon, we can easily formulate IoU calculations
Mathematically, affine transform on the control points is as a series of convex polygon area calculations. The dif-
equivalent to affine transform on the entire curve. However, ficulty lies in how to efficiently determine the intersection
translation or rotation can move control points out of the between convex polygon pairs.
image. In this case, a cutting of Bézier curves is required. Consider two intersected convex polygons, their inter-
The classical De Casteljau’s algorithm is used for cutting an section includes two types of vertices:
on-image Bézier curve segment. Assume a continuous on-
• Intersections: intersection points between edges.
image segment, valid sample points with minimum bound-
ary t = t0 , maximum boundary t = t1 . The formula to cut a • Insiders: vertices inside/on both polygons.
cubic Bézier curve defined by control points P0 , P1 , P2 , P3
to its on-image segment P00 , P10 , P20 , P30 , is derived as: For Intersections, we first represent every polygon edge
as the general line equation: ax + by = c. Then, for line
P00 = u0 u0 u0 P0 + (t0 u0 u0 + u0 t0 u0 + u0 u0 t0 )P1
a1 x + b1 y = c1 and line a2 x + b2 y = c2 , the intersection
+ (t0 t0 u0 + u0 t0 t0 + t0 u0 t0 )P2 + t0 t0 t0 P3 , (x0 , y 0 ) is calculated by:
P10 = u0 u0 u1 P0 + (t0 u0 u1 + u0 t0 u1 + u0 u0 t1 )P1
+ (t0 t0 u1 + u0 t0 t1 + t0 u0 t1 )P2 + t0 t0 t1 P3 , x0 = (b2 c1 − b1 c2 )/det
(10) (11)
P20 = u0 u1 u1 P0 + (t0 u1 u1 + u0 t1 u1 + u0 u1 t1 )P1 y 0 = (a1 c2 − a2 c1 )/det,
+ (t0 t1 u1 + u0 t1 t1 + t0 u1 t1 )P2 + t0 t1 t1 P3 , where det = a1 b2 − a2 b1 . All (x0 , y 0 ) that is on the respec-
P30 = u1 u1 u1 P0 + (t1 u1 u1 + u1 t1 u1 + u1 u1 t1 )P1 tive line segments are Intersections.
+ (t1 t1 u1 + u1 t1 t1 + t1 u1 t1 )P2 + t1 t1 t1 P3 , For Insiders, there is a certain definition:

13
Def. 1 For a convex polygon, point P (x, y) on the same maximum classification logit. This prior can facilitate the
side of each edge is inside the polygon. model to understand the spatially sparse structure of lane
lines. As shown in Figure 6, the learned feature activation
A sorted convex polygon is a series of edges (line seg- for classification logits exhibits a similar structure as an ac-
ments defined by P0 (x0 , y0 ), P1 (x1 , y1 )), the equation to tual driving scene.
decide which side a point is to a line segment is as follows:
F. Extra Results
sign = (y − y0 )(x1 − x0 ) − (x − x0 )(y1 − y0 ). (12)

sign > 0 means P is on the right side, sign < 0 is the TuSimple [1] LLAMAS [3]
left side, and sign = 0 means P is on the line segment. Bézier Baseline 93.36 95.27
+ Feature Flip Fusion 95.26 (+1.90) 96.00 (+0.73)
Note that equality is not a stable operation for float com-
putations. But there are simple ways to circumvent that in
Table 9. Ablation study on TuSimple (test set Accuracy) and LLA-
coding, which we will not elaborate here. MAS (val set F1), before and after adding the Feature Flip Fusion
There are other ways to determine Intersections and In- module. Reported 3-times average with the ResNet-34 backbone,
siders, but the above formulas can be efficiently imple- since ablations often are not stable enough on these datasets to
mented with matrix operations and indexing, making it pos- exhibit a clear difference between methods.
sible to quickly train networks with batched inputs.
Finally, after being able to compute convex polygon in- G. Discussions
tersections and areas, the Generalized IoU loss (GIoU) is
simply (as in [29]): There exists a primitive application of lane detectors
from lateral views to estimate the distance to the border
input : Two arbitrary convex shapes: A, B ⊆ S ∈ Rn of the drivable area [10], which contradicts the use of fea-
output: GIoU ture flip fusion. In this case, possibly a lower order Bézier
1. For A and B, find the smallest enclosing convex object curve baseline (with row-wise instead of column-wise pool-
C, where C ⊆ S ∈ Rn ing) would suffice. This is out of the focus of this paper.
|A ∩ B| Recent Progress. Recently, others have explored alterna-
2. IoU =
|A ∪ B| tive lane representation or formulation methods that do not
|C\(A ∪ B)|
3. GIoU = IoU − fully fit in the three categories (segmentation, point detec-
|C|
tion, curve). Instead of the popular top-down regime, [27]
propose a bottom-up approach that focus on local details.
Union is computed as A ∪ B = A + B − A ∩ B. The [18] achieve state-of-the-art performance, but the complex
enclosing convex object C can be computed as the convex conditional decoding of lane lines results in unstable run-
hull of two convex polygons, or upper-bounded by a enclos- time depending on the input image, which is not desirable
ing rectangle. We implement the IoU computation purely for a real-time system.
in PyTorch [24], the runtime for our implementation is only
about 5× the runtime of rectangle IoU loss computation. H. Qualitative Results
However, lane lines are mostly straight based on road de-
sign regulations [7, 36]. This leads to extremely small con- Qualitative results are shown in Figure 7, from our
vex hull area for Bézier curves, thus introduces numerical ResNet-34 backbone models. For each dataset, 4 results are
instabilities in optimization. Although succeeded in a toy shown in two rows: first row shows qualitative successful
polygon fitting experiment, we currently failed to observe predictions; second row shows typical failure cases.
the loss’s convergence to help learning on lane datasets. TuSimple. As shown in Figure 7(a), our model fits high-
way curves well, only slight errors are seen on the far side
E. GT and Prediction Matching Prior where image details are destroyed by projection. Our typi-
cal failure case is a high FP rate, mostly attributed to the use
of low threshold (Appendix B.8). However, in the bottom-
right wide road scene, our FP prediction is actually a mean-
ingful lane line that is ignored in center line annotations.
CULane. As shown in Figure 7(b), most lanes in this
W
Figure 6. Logits activation statistics (1 × 16
) on CULane [22]. dataset are straight. Our model can make accurate predic-
tions under heavy congestion (top-left) and shadows (top-
Instead of the centerness prior, we explore a local maxi- right, shadow cast by trees). A typical failure case is inaccu-
mum prior, i.e., restricts matched prediction to have a local rate prediction under occlusion (second row), in these cases

14
(a) TuSimple [1].

(b) CULane [22].

(c) LLAMAS [3].


Figure 7. Qualitative results from BézierLaneNet (ResNet-34) on val sets. False Positives (FP) are marked by red, True Positives (TP)
are marked by green, ground truth are drawn in blue. Blue lines that are barely visible are precisely covered by green lines. Bézier curve
control points are marked with solid circles. Images are slightly resized for alignment. Best viewed in color, in 2× scale.

one often cannot visually tell which one is better (ground ered by shadow. In bottom-left image, our model fails in
truth or our FP prediction). a low-illumination, tainted road. While in the other low-
LLAMAS. As shown in Figure 7(c), our method performs illumination scene (bottom-right), the unsupervised annota-
accurate for clear straight-lines (top-left), and also good for tion from LIDAR and HD-map is misled by the white arrow
large curvatures in a challenging scene almost entirely cov- (see the zigzag shape of the right-most blue line).

15

You might also like