College of Computer and Information Science, Southwest University, Chongqing, China
College of Computer and Information Science, Southwest University, Chongqing, China
ABSTRACT
Accurate and fast scene understanding is one of the chal-
lenging task for autonomous driving, which requires to take
full advantage of LiDAR point clouds for semantic segmen-
arXiv:2207.12691v1 [cs.CV] 26 Jul 2022
Backbone
Seg Head
Original point cloud Projection (x,y,z,d,r) Classification Head Projection Back 3D Prediction
Seg Head
Seg Head
Seg Head
Seg Head
Seg Head
Seg Head
Plan A auxiliary loss auxiliary loss
Plan B
Fig. 2. The overall architecture of our pipeline. The backbone network can be any feature extraction structure such as Ba-
sicBlock in ResNet used in this paper. Plan A and Plan B are two different designs of auxiliary loss. As shown in the dotted
line, it can be removed during inference thus not influencing the inference speed.
for class c. We give the definition of boundaries as,
Table 1. Inference speed test with varying kernel size on b
NVIDIA 1080Ti. The batch size = 32, input channels = out- ygt = pool(1 − ygt , θ0 ) − (1 − ygt )
b (3)
put channels = 2048, resolution = 56×56, stride = 1. ypd = pool(1 − ypd , θ0 ) − (1 − ypd )
Kernel Theoretical Time Theoretical
size FLOPs (B) usage (ms) TFLOPS Here pool(·) refers to the max-pooling operation on a slid-
1×1 420.9 84.5 9.96 ing window of size θ0 . Finally, our total loss is the weighted
3×3 3788.1 198.8 38.10 combination of these three loss functions. The formulation is
generated range images usually have structure features, which
L = αLwce + βLls + γLbd (4)
makes 3 × 3 conv much more suitable than MLP. 2) For 1 × 1
conv, the lower number of parameters and computational cost where α, β, and γ are the corresponding weights and are set
does not mean faster inference speed. As shown in Table 1 to 1.0, 1.5, 1.0 respectively. In addition, we set θ0 in Lbd to
from RepVGG [21], it can be concluded that under the same be 3.
condition, the computational density (theoretical operations Auxiliary Loss. In FIDNet, the authors integrate bilinear up-
divided by time usage) of 3 × 3 conv can be up to 4× that sampling methods into FID module to interpolate the low-
of 1 × 1 conv on a 1080Ti GPU. Therefore, we attempt to resolution feature maps, which generates five point-wise fea-
conduct a simple substitution to improve the inference speed, ture tensors with the same resolution but encoding differ-
which will be verified in Sec 4.3 Ablation Studies. In ad- ent levels of information. Compared with conventional de-
dition, inspired by YoloV5 and MobileNetV3, we can see coders used in other networks [13, 14, 22], FID is totally
that the usage of a stronger nonlinear activation function can parameter-free and dramatically reduces the complexity and
contribute to the improvement of network’s expressive power storage cost. However, this simple decoder make model’s
without increasing the parameters. Hence, we adopt SiLU and performance excessively depend on the low-dimensional and
Hardswish activation functions in our model. Experiments in high-dimensional features. Additionally, unlike progressive
Sec 4.3 also show that both functions can enhance our model’s upsampling decoder, the simple interpolation fusion decoding
performance with little influence on inference speed. may results in feature maps at different scales not being fully
Loss Function. In order to solve problems 1) Class imbal- aligned and decoded. To alleviate this problem, we introduce
ance, 2) The problem of optimizing the intersection-over- multiple auxiliary loss heads to refine the feature maps at dif-
union (IoU), 3) Blurred segmentation boundaries, following ferent resolutions for learning capability improvement.
[14], we use three different loss functions, namely weighted Specifically, we use the auxiliary segmentation head to
cross-entropy loss Lwce , Lovász-Softmax loss Lls and bound- predict the output of three feature maps with different resolu-
ary loss Lbd , to supervise our model. tions, and compute the weighted loss together with the main
For segmentation task, the boundary-blurring problems loss to supervise our network to produce more semantic fea-
between different objects generally arise from upsampling tures. The final loss function can be defined as,
and downsampling operations. To address this issue, we intro- 3
X
duce a boundary loss function, which can be formally defined Ltotal = Lmain + λ L(yi , ŷi ) (5)
as, i=1
other-vehicle
other-ground
motorcyclist
motorcycle
traffic-sign
mean-IoU
vegetation
FPS (Hz)
sidewalk
bicyclist
building
parking
bicycle
person
terrain
fence
trunk
truck
road
pole
car
Category Methods Size
PointNet++ [4] 0.1 20.1 53.7 1.9 0.2 0.9 0.2 0.9 1.0 0.0 72.0 18.7 41.8 5.6 62.3 16.9 46.5 13.8 30.0 6.0 8.9
RandLa-Net [6] 20 53.9 94.2 26.0 25.8 40.1 38.9 49.2 48.2 7.2 90.7 60.3 73.7 20.4 86.9 56.3 81.4 61.3 66.8 49.2 47.7
Point-based 50K pts
KPConv [5] − 58.8 96.0 30.2 42.5 33.4 44.3 61.5 61.6 11.8 88.8 61.3 72.7 31.6 90.5 64.2 84.8 69.2 69.1 56.4 47.4
BAAF [16] 5 59.9 95.4 31.8 35.5 48.7 46.7 49.5 55.7 53.0 90.9 62.2 74.4 23.6 89.8 60.8 82.7 63.4 67.9 53.7 52.0
MinkowskiNet-lite [18] 8.6 57.5 − − − − − − − − − − − − − − − − − − −
MinkowskiNet [18] 3.4 63.1 − − − − − − − − − − − − − − − − − − −
Voxel-based Voxel
SPVCNN-lite [7] 8.1 58.5 − − − − − − − − − − − − − − − − − − −
SPVCNN [7] 3.2 63.8 − − − − − − − − − − − − − − − − − − −
SqueezeSeg-CRF [9] 55 30.8 68.3 18.1 5.1 4.1 4.8 16.5 17.3 1.2 84.9 28.4 54.7 4.6 61.5 29.2 59.6 25.5 54.7 11.2 36.3
SqueezeSegV2-CRF [19] 40 39.6 82.7 21.0 22.6 14.5 15.9 20.2 24.3 2.9 88.5 42.4 65.5 18.7 73.8 41.0 68.5 36.9 58.9 12.9 41.0
SqueezeSegV3 [20] 6 55.9 92.5 38.7 36.5 29.6 33.0 45.6 46.2 20.1 91.7 63.4 74.8 26.4 89.0 59.4 82.0 58.7 65.4 49.6 58.9
64 × 2048
SalsaNext [22] 24 59.5 91.9 48.3 38.6 38.9 31.9 60.2 59.0 19.4 91.7 63.7 75.8 29.1 90.2 64.2 81.8 63.6 66.5 54.3 62.1
Lite-HDSeg [15] 20 63.8 92.3 40.0 55.4 37.7 39.6 59.2 71.6 54.3 93.0 68.2 78.3 29.3 91.5 65.0 78.2 65.8 65.1 59.5 67.7
KPRNet [15] 0.3 63.1 95.5 54.1 47.9 23.6 42.6 65.9 65.0 16.5 93.2 73.9 80.6 30.2 91.7 68.4 85.7 69.8 71.2 58.7 64.1
RangeNet++ [10] 38.5 41.9 87.4 26.2 26.5 18.6 15.6 31.8 33.6 4.0 91.4 57.0 74.0 26.4 81.9 52.3 77.6 48.4 63.6 36.0 50.0
MPF [23] 33.7 48.9 91.1 22.0 19.7 18.8 16.5 30.0 36.2 4.2 91.1 61.9 74.1 29.4 86.7 56.2 82.3 51.6 68.9 38.6 49.8
64 × 512
FIDNet-Point [15] 82.0 51.3 90.4 28.6 30.9 34.3 27.0 43.9 48.9 16.8 90.1 58.7 71.4 19.9 84.2 51.2 78.2 51.9 64.5 32.7 50.3
Image-based Ours 84.9 60.7 92.1 45.4 42.9 43.9 46.8 56.4 63.8 29.7 91.3 66.0 75.3 31.1 88.9 60.4 81.9 60.5 67.6 49.5 59.1
RangeNet++ [10] 23.3 48.0 90.3 20.6 27.1 25.2 17.6 29.6 34.2 7.1 90.4 52.3 72.7 22.8 83.9 53.3 77.7 52.5 63.7 43.8 47.2
MPF [23] 28.5 53.6 92.7 28.2 30.5 26.9 25.2 42.5 45.5 9.5 90.5 64.7 74.3 32.0 88.3 59.0 83.4 56.6 69.8 46.0 54.9
64 × 1024
FIDNet-Point [15] 60.9 56.0 92.4 44.0 41.5 33.2 30.8 57.9 52.6 18.0 91.0 61.2 73.8 12.6 88.2 57.9 80.8 59.5 65.1 45.3 58.4
Ours 67.9 62.3 93.0 50.5 47.6 41.7 43.4 64.5 65.2 32.5 90.5 65.5 74.1 29.2 90.9 65.4 81.6 65.4 65.6 55.9 61.0
RangeNet++ [10] 12.8 52.2 91.4 25.7 34.4 25.7 23.0 38.3 38.8 4.8 91.8 65.0 75.2 27.8 87.4 58.6 80.5 55.1 64.6 47.9 55.9
MPF [23] 20.6 55.5 93.4 30.2 38.3 26.1 28.5 48.1 46.1 18.1 90.6 62.3 74.5 30.6 88.5 59.7 83.5 59.7 69.2 49.7 58.1
64 × 2048
FIDNet-Point [15] 33.7 58.6 93.0 45.7 42.0 27.9 32.6 62.6 58.1 30.5 90.8 58.3 74.9 20.1 88.5 59.5 83.1 64.3 67.8 52.6 60.0
Ours 37.8 64.7 91.9 58.6 50.3 40.6 42.3 68.9 65.9 43.5 90.3 60.9 75.1 31.5 91.0 66.2 84.5 69.7 70.0 61.5 67.6
cone/stone
trashcan
building
ground
person
mIoU
fence
truck
rider
pole
bike
car
SqueezeSeg [9] 14.2 1.0 13.2 10.4 28.0 5.1 5.7 2.3 43.6 0.2 15.6 31.0 75.0 18.9 mentation. The weight decay is set to 1e−4 . For Semantic-
SqueezeSeg + CRF [9] 6.8 0.6 6.7 4.0 2.5 9.1 1.3 0.4 37.1 0.2 8.4 18.5 72.1 12.9
SqueezeSegV2 [19] 48.0 9.4 48.5 11.3 50.1 6.7 6.2 14.8 60.4 5.2 22.1 36.1 71.3 30.0 sKITTI, we train network for 100 epochs with initial learning
SqueezeSegV2 + CRF [19]
RangeNet53 [10]
43.9
55.7
7.1
4.5
47.9
34.4
18.4
13.7
40.9
57.5
4.8
3.7
2.8
6.6
7.4
23.3
57.5
64.9
0.6
6.1
12.0
22.2
35.3
28.3
71.3
72.9
26.9
30.3 rate 1e−2 , which is dynamically adjusted by a cosine anneal-
RangeNet53 + KNN [10] 57.3 4.6 35.0 14.1 58.3 3.9 6.9 24.1 66.1 6.6 23.4 28.6 73.5 30.9
MINet [24] 61.8 12.0 63.3 22.2 68.1 16.3 29.3 28.5 74.6 25.9 31.7 44.5 76.4 42.7 ing scheduler. For SemantciPOSS, the network is trained for
MINet + KNN [24] 62.4 12.1 63.8 22.3 68.6 16.7 30.1 28.9 75.1 28.6 32.2 44.9 76.3 43.2
FIDNet-Point [15] 71.6 22.7 71.7 22.9 67.7 21.8 27.5 15.8 72.7 31.3 40.4 50.3 79.5 45.8 3 cycles with 45 epochs, the min and max learning rate are set
FIDNet-Point + KNN [15]
Ours
72.2
74.9
23.1
21.8
72.7
77.0
23.0
25.3
68.0
72.0
22.2
18.0
28.6
30.9
16.3
46.9
73.1
75.9
34.0
26.1
40.9
47.5
50.3
51.7
79.1
80.7
46.4
49.9 to 1e−5 and 1e−3 , respectively.
Ours + KNN 75.5 22.0 77.6 25.3 72.2 18.2 31.5 48.1 76.3 27.7 47.7 51.4 80.3 50.3
the GT labels with corresponding rate. For Plan B, since all 4.2. Results and Discussion
feature maps are upsampled to the final output size, the GT
Quantitative Results on SemanticKITTI: Table 2 shows the
labels are their ŷi .
quantitative results of several state-of-the-art models on Se-
manticKITTI benchmark. It reports the input size, frames
4. EXPERIMENT per second (FPS), mean IoU and class-wise IoU. The best
results are highlighted in bold. From these results, it can be
4.1. Dataset and Implementation details seen that for input of size 64 × 2048, our model achieves the
state-of-the-art performance (64.7%mIoU) compared to both
SemanticKITTI is a large-scale dataset for the task of point point-based, voxel-based methods or image-based methods,
cloud segmentation of autonomous driving scenes. It con- while maintaining a higher FPS (37.8 FPS). For 64 × 1024
tains 43,551 LiDAR scans from 22 sequences collected from and 64 × 512 input, CENet obtain the superior results, 67.9
a city in Germany, where sequences 00 to 10 (19,130 scans) FPS, 62.3%mIoU and 84.9FPS, 60.7%mIoU, which outper-
are used for training, 11 to 21 (20,351 scans) for testing, and forms the current methods by a large margin. It is worth not-
sequence 08 (4,071 scans) for validation. All sequences are ing that our CENet shows excellent performance when the
labeled with dense point-wise annotations. input is 64 × 512, surpassing the performance of the base-
SemanticPOSS is a much smaller, sparser, and more chal- line method FIDNet [15] and many other methods [20,22,23]
lenging benchmark collected by Peking University. It consists under 64 × 2048 input.
of 2,988 different, complex LiDAR scenes, each with a large Quantitative Results on SemanticPOSS: Table 3 shows the
number of sparse dynamic instances (e.g. pedestrians and bi- comparison of our proposed CENet with other related works.
cycles). SemanticPOSS is divided into 6 parts, where we use As can be reported from these results, all methods attain a
part 2 as test set, and others as training set. little worse results due to the differences in sensors and en-
Implementation details. We conduct all the experiments vironments, as well as the samll-scale and sparse structure of
on a single NVIDIA RTX3060 and RTX3090 GPU. The features. Nevertheless, it is worth noting that our method out-
Stochastic Gradient Descent (SGD) optimizer with momen- performs all models not only in overall mIoU, but also almost
tum of 0.9 is use for network optimization. During training, class-wise mIoU. This further verify the effectiveness and ef-
(a) Input Point Cloud (b) Ground Truth (c) FIDNet-Point (d) CENet(ours)
Fig. 3. Qualitative analysis on SemanticKITTI validation set. Where (a) and (b) are input data of the LIDAR scan frame
and corresponding segmentation ground truth, (c) and (d) are segmentation error maps in this scan frame for FIDNet and our
method. (With red indicating wrong prediction)
Table 4. Ablation study evaluated on SemanticKITTI valida- Table 6. Impact of λ in auxiliary loss.
tion set. λ 0 0.1 0.5 1.0
Baseline Row KS SiLU H-swish Plan A Plan B mIoU Params(M) Latency(ms)
FIDNet 1 55.4 6.053 11.956 mIoU 59.2 61.1 62.8 64.3
2 58.9 6.053 11.956
3 3 59.2 6.774 11.576
4 3 60.0 6.053 11.981 Table 7. Ablation studies for auxiliary loss module on differ-
5
6 3 3
3 60.5
59.5
6.053
6.774
12.039
11.776
ent backbone.
7 3 3 60.6 6.774 11.562
Backbones Params(M) mIOU
Ours
8 3 62.5 6.061 11.956 HarDNet(vinilla) 2.735 55.1
9 3 63.0 6.061 11.956 HarDNet(ours) 3.138 58.7
10 3 3 63.2 6.782 11.576
11 3 3 64.3 6.782 11.576 HarDNet(ours) + Plan B 3.146 61.3
12 3 3 3 65.3 6.782 11.562