0% found this document useful (0 votes)
25 views6 pages

College of Computer and Information Science, Southwest University, Chongqing, China

The paper introduces CENet, a novel image-based semantic segmentation network designed for efficient LiDAR point cloud processing in autonomous driving. CENet enhances feature learning and reduces computational complexity by utilizing larger convolution kernels, optimized activation functions, and multiple auxiliary segmentation heads. Experimental results demonstrate that CENet outperforms existing state-of-the-art models in both accuracy and inference speed on benchmark datasets SemanticKITTI and SemanticPOSS.

Uploaded by

2361316024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

College of Computer and Information Science, Southwest University, Chongqing, China

The paper introduces CENet, a novel image-based semantic segmentation network designed for efficient LiDAR point cloud processing in autonomous driving. CENet enhances feature learning and reduces computational complexity by utilizing larger convolution kernels, optimized activation functions, and multiple auxiliary segmentation heads. Experimental results demonstrate that CENet outperforms existing state-of-the-art models in both accuracy and inference speed on benchmark datasets SemanticKITTI and SemanticPOSS.

Uploaded by

2361316024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CENET: TOWARD CONCISE AND EFFICIENT LIDAR SEMANTIC SEGMENTATION

FOR AUTONOMOUS DRIVING

Hui-Xian Cheng, Xian-Feng Han * , Guo-Qiang Xiao

College of Computer and Information Science, Southwest University, Chongqing, China


{chenghuixian}@email.swu.edu.cn {xianfenghan, gqxiao}@swu.edu.cn

ABSTRACT
Accurate and fast scene understanding is one of the chal-
lenging task for autonomous driving, which requires to take
full advantage of LiDAR point clouds for semantic segmen-
arXiv:2207.12691v1 [cs.CV] 26 Jul 2022

tation. In this paper, we present a concise and efficient


image-based semantic segmentation network, named CENet.
In order to improve the descriptive power of learned fea-
tures and reduce the computational as well as time complex-
ity, our CENet integrates the convolution with larger ker-
nel size instead of MLP, carefully-selected activation func-
tions, and multiple auxiliary segmentation heads with cor-
responding loss functions into architecture. Quantitative
and qualitative experiments conducted on publicly available Fig. 1. LIDAR semantic segmentation accuracy vs speed
benchmarks, SemanticKITTI and SemanticPOSS, demon- on SemanticKITTI test set [1]. (The dotted lines represent
strate that our pipeline achieves much better mIoU and in- the performance of the same method at different resolutions.)
ference performance compared with state-of-the-art models. Best viewed in color and zoomed in for more detail.
The code will be available at https://github.com/
In the past few years, the emergence of datasets, such as
huixiancheng/CENet.
SemanticKITTI [1] and SemanticPOSS [3], benchmarks the
Index Terms— LiDAR Point Cloud, Autonomous Driv- LiDAR semantic segmentation. And they make it possible to
ing, Semantic Scene Understanding, Semantic Segmentation apply deep learning techniques to this task. However, we can-
not directly perform standard Convolutional Neural Networks
(CNNs) on 3D point clouds due to their irregular and sparse
1. INTRODUCTION
structure. To address this problem, many recent new methods
Recently, the rapid development of LiDAR sensors makes the have been proposed, which can be classified into raw point-
3D computer vision become an interesting research topic in based [4–6], voxel-based [7,8], and range image-based [9,10].
applications, such as robotics and autonomous driving [2], Generally, point-based networks deal with raw 3D LiDAR
where accurate, real-time and robust environment perception point clouds directly, which can obtain much better perfor-
and understanding is a challenging task. In order to achieve mance with higher computational complexity [5] [6]. Voxel-
this goal, LiDAR becomes the most popular and widely used based methods project unstructured point clouds into regu-
choice, since 1) compared with visual cameras, LiDAR is lar grid cells, and allows the usage of 3D convolutional neu-
much more robust to varying lighting and weather conditions. ral networks. Although the state-of-the-art accuracy can be
2) and the acquired 3D point clouds provide rich geometric in- achieved, the higher model complexity makes it difficult for
formation. Therefore, LiDAR-based semantic scene percep- these methods to obtain real-time inference speed. For exam-
tion, especially 3D point cloud semantic segmentation, has ple, the inference speeds of SPVCNN [7] and Cylinder3D [8]
received increasing attention, aiming to perform point-wise on Tesla V100 are only 8.0 and 7.6 fps, respectively. In ad-
classification. dition, range image-based approaches choose to represent the
raw 3D point clouds as an ordered range image using spheri-
* Corresponding author. This research was supported by the National
cal projection strategy, then the well-designed 2D CNNs can
Natural Science Foundation of China (No. 62002299), and the Natural
Science Foundation of Chongqing, China (No. cstc2020jcyj-msxmX0126),
be utilized to carry out LiDAR semantic segmentation task.
and the Fundamental Research Funds for the Central Universities (No. This kinds of methods actually can provide superior inference
SWU120005) and accuracy performance, which drives more and more re-
searchers to focus on this field [11], though they inevitably use sparse convolution instead of standard 3D convolution to
suffer from information loss during projection [12]. reduce the computational cost. SPVNAS [7] exploits neural
On the other hand, the current range image-based meth- structure search (NAS) to further improve the network’s per-
ods, such as KPRNet [13] and Lite-HDSeg [14] typically have formance.
an extremely large number of parameters, and much lower in- Image-based methods project LiDAR point clouds onto
ference speed or higher model complexity, which limits their 2D multimodal images and then apply the well-designed
applications in autonomous driving to some extent. In order 2D CNNs for semantic segmentation. SqueeeSeg [9] and
to relieve this issue, FIDNet [15] presents a new range image- SqueezeSegV2 [19] use the lightweight model SqueezeNet
based LiDAR semantic segmentation network that keeps the and CRF for segmentation. RangeNet++ [10] integrates
solution as simple as possible while maintaining good perfor- Darknet into SqueezeSeg and proposes an efficient KNN
mance. post-processing method to predict labels for point. Squeeze-
Nevertheless, the performance of FIDNet is not compara- SegV3 [20] proposes Spatially-Adaptive Convolution (SAC)
ble to the current state-of-the-art methods [7,14,16]. So based with different filters depending on the location of the input
on the fundamental idea and performance of FIDNet, we at- image. KPRNet [13] achieves promising results by using
tempt to rethink its design choices, and propose a concise powerful as backbone together with KPConv as segmentation
and efficient LiDAR semantic segmentation model, termed head. Lite-HDSeg [14] achieves state-of-the-art performance
CENet. Quantitatively experimental results demonstrate that by introducing three different modules, Inception-like Con-
our network performance can outperform current state-of-the- text Module, Multi-class Spatial Propagation Network, and a
art methods with no increase in the number of effective pa- boundary loss. Although it gets a high inference speed, the
rameters, and has a much higher inference speed (as shown in higher number of parameters and model complexity make it
Fig. 1). Specifically, the main contributions of this paper are less suitable for autonomous driving applications.
as follows:
• We present a newly-designed LiDAR point cloud segmen- 3. METHODOLOGY
tation architecture, named CENet, which can improve the
inference speed at the cost of no parameters increment. In order to take full advantage of well-optimized conventional
• In order to improve the nonlinear capability of the network, convolution functions for real-time LiDAR point cloud seg-
we use SiLU and Hardswish to adjust the activation func- mentation, we use the spherical projection approach to gen-
tions. erate 2D multimodal range images by projecting the LiDAR
• By introducing multiple auxiliary segmentation heads, we point cloud into the spherical coordinate system, which is for-
significantly improve the learning power of our network mulated as,
1 −1
  !
without introducing additional inference parameters. 2 1 − arctan (y, x) π W i
 
u
= (1)
h
1 − arcsin z, d−1 + fu fu +f
  1
• We conduct comprehensive experiments on the publicly v H
d
available datasets, SemanticKITTI and SemanticPOSS. where f = fu + fd refers to the sensor’s vertical field-
The results show that our method achieves state-of-the-art of-view. The depth d of each point is calculated as d =
performance. p
x2 + y 2 + z 2 . The final result is a projected range im-
age of size (H, W, 5), where each pixel contains 5 chan-
2. RELATED WORK nels (x, y, z, d, r) and r is the intensity information of points.
Based on this transformation, the point cloud segmentation
Point-based methods work directly on the raw point clouds. problem is turned into an image segmentation task.
PointNet [17] and PointNet++ [4] are pioneering studies to
use shared MLPs to learn the properties of each point, which
3.1. Network Architecture
inspire the appearance of a series of point-based networks.
KPConv [5] develops deformable convolutions that can use Our CENet architecture is shown in Fig 2. We will detailedly
arbitrary number of kernel points to learn local representa- introduce the core components in the following subsections.
tions. RandLA-Net [6] adopts a random sampling strategy to Input Module, Classification Head and Activation Func-
considerably improve the efficiency of point cloud processing tion. The range image is a special two-dimensional image
and uses local feature aggregation to reduce the information with five channels, where each location actually can be con-
loss caused by random operations. BAAF [16] makes full sidered as a point representation. Therefore, FIDNet [15]
use of the geometric and semantic features of points to ob- uses 1 × 1 conv layers in Input Module and Classification
tain more accurate semantic segmentation by using bilateral Head to process the input features, which, the authors think,
structures and adaptive fusion methods. can achieve the similar effect as MLP used in PointNet for
Voxel-based methods first discretize the point clouds into 3D point feature learning. However, it is more reasonable to use
voxel representations, then predict semantic labels for these 3 × 3 conv layers instead of 1 × 1, due to the following rea-
voxels using 3D CNN frameworks. Minkowski [18] chose to sons. 1) Unlike disordered and unstructured point cloud, the
2D Range Image Input Module Concatenate 2D Prediction

Backbone

3x3, channel = 128


3x3, channel = 128

3x3, channel = 256

3x3, channel = 128


3x3, channel = 64

Seg Head
Original point cloud Projection (x,y,z,d,r) Classification Head Projection Back 3D Prediction

Seg Head
Seg Head

Seg Head

Seg Head
Seg Head
Seg Head
Plan A auxiliary loss auxiliary loss
Plan B

Fig. 2. The overall architecture of our pipeline. The backbone network can be any feature extraction structure such as Ba-
sicBlock in ResNet used in this paper. Plan A and Plan B are two different designs of auxiliary loss. As shown in the dotted
line, it can be removed during inference thus not influencing the inference speed.
for class c. We give the definition of boundaries as,
Table 1. Inference speed test with varying kernel size on  b
NVIDIA 1080Ti. The batch size = 32, input channels = out- ygt = pool(1 − ygt , θ0 ) − (1 − ygt )
b (3)
put channels = 2048, resolution = 56×56, stride = 1. ypd = pool(1 − ypd , θ0 ) − (1 − ypd )
Kernel Theoretical Time Theoretical
size FLOPs (B) usage (ms) TFLOPS Here pool(·) refers to the max-pooling operation on a slid-
1×1 420.9 84.5 9.96 ing window of size θ0 . Finally, our total loss is the weighted
3×3 3788.1 198.8 38.10 combination of these three loss functions. The formulation is
generated range images usually have structure features, which
L = αLwce + βLls + γLbd (4)
makes 3 × 3 conv much more suitable than MLP. 2) For 1 × 1
conv, the lower number of parameters and computational cost where α, β, and γ are the corresponding weights and are set
does not mean faster inference speed. As shown in Table 1 to 1.0, 1.5, 1.0 respectively. In addition, we set θ0 in Lbd to
from RepVGG [21], it can be concluded that under the same be 3.
condition, the computational density (theoretical operations Auxiliary Loss. In FIDNet, the authors integrate bilinear up-
divided by time usage) of 3 × 3 conv can be up to 4× that sampling methods into FID module to interpolate the low-
of 1 × 1 conv on a 1080Ti GPU. Therefore, we attempt to resolution feature maps, which generates five point-wise fea-
conduct a simple substitution to improve the inference speed, ture tensors with the same resolution but encoding differ-
which will be verified in Sec 4.3 Ablation Studies. In ad- ent levels of information. Compared with conventional de-
dition, inspired by YoloV5 and MobileNetV3, we can see coders used in other networks [13, 14, 22], FID is totally
that the usage of a stronger nonlinear activation function can parameter-free and dramatically reduces the complexity and
contribute to the improvement of network’s expressive power storage cost. However, this simple decoder make model’s
without increasing the parameters. Hence, we adopt SiLU and performance excessively depend on the low-dimensional and
Hardswish activation functions in our model. Experiments in high-dimensional features. Additionally, unlike progressive
Sec 4.3 also show that both functions can enhance our model’s upsampling decoder, the simple interpolation fusion decoding
performance with little influence on inference speed. may results in feature maps at different scales not being fully
Loss Function. In order to solve problems 1) Class imbal- aligned and decoded. To alleviate this problem, we introduce
ance, 2) The problem of optimizing the intersection-over- multiple auxiliary loss heads to refine the feature maps at dif-
union (IoU), 3) Blurred segmentation boundaries, following ferent resolutions for learning capability improvement.
[14], we use three different loss functions, namely weighted Specifically, we use the auxiliary segmentation head to
cross-entropy loss Lwce , Lovász-Softmax loss Lls and bound- predict the output of three feature maps with different resolu-
ary loss Lbd , to supervise our model. tions, and compute the weighted loss together with the main
For segmentation task, the boundary-blurring problems loss to supervise our network to produce more semantic fea-
between different objects generally arise from upsampling tures. The final loss function can be defined as,
and downsampling operations. To address this issue, we intro- 3
X
duce a boundary loss function, which can be formally defined Ltotal = Lmain + λ L(yi , ŷi ) (5)
as, i=1

2P c Rc where Lmain is the main loss, yi is the semantic output ob-


Lbd (ŷ, y) = 1 − (2) tained from stage i, and ŷi represents the corresponding se-
P c + Rc
c c
Where P and R denote the precision and recall of the pre- mantic label. L(·) is computed according to Equation 4. As
dicted boundary map ypd with respect to the ground truth ygt shown in Fig. 2, for Plan A, ŷi is obtained by downsampling
Table 2. The performance comparison on SemanticKITTI test set.

other-vehicle

other-ground
motorcyclist
motorcycle

traffic-sign
mean-IoU

vegetation
FPS (Hz)

sidewalk
bicyclist

building
parking
bicycle

person

terrain
fence

trunk
truck

road

pole
car
Category Methods Size
PointNet++ [4] 0.1 20.1 53.7 1.9 0.2 0.9 0.2 0.9 1.0 0.0 72.0 18.7 41.8 5.6 62.3 16.9 46.5 13.8 30.0 6.0 8.9
RandLa-Net [6] 20 53.9 94.2 26.0 25.8 40.1 38.9 49.2 48.2 7.2 90.7 60.3 73.7 20.4 86.9 56.3 81.4 61.3 66.8 49.2 47.7
Point-based 50K pts
KPConv [5] − 58.8 96.0 30.2 42.5 33.4 44.3 61.5 61.6 11.8 88.8 61.3 72.7 31.6 90.5 64.2 84.8 69.2 69.1 56.4 47.4
BAAF [16] 5 59.9 95.4 31.8 35.5 48.7 46.7 49.5 55.7 53.0 90.9 62.2 74.4 23.6 89.8 60.8 82.7 63.4 67.9 53.7 52.0
MinkowskiNet-lite [18] 8.6 57.5 − − − − − − − − − − − − − − − − − − −
MinkowskiNet [18] 3.4 63.1 − − − − − − − − − − − − − − − − − − −
Voxel-based Voxel
SPVCNN-lite [7] 8.1 58.5 − − − − − − − − − − − − − − − − − − −
SPVCNN [7] 3.2 63.8 − − − − − − − − − − − − − − − − − − −
SqueezeSeg-CRF [9] 55 30.8 68.3 18.1 5.1 4.1 4.8 16.5 17.3 1.2 84.9 28.4 54.7 4.6 61.5 29.2 59.6 25.5 54.7 11.2 36.3
SqueezeSegV2-CRF [19] 40 39.6 82.7 21.0 22.6 14.5 15.9 20.2 24.3 2.9 88.5 42.4 65.5 18.7 73.8 41.0 68.5 36.9 58.9 12.9 41.0
SqueezeSegV3 [20] 6 55.9 92.5 38.7 36.5 29.6 33.0 45.6 46.2 20.1 91.7 63.4 74.8 26.4 89.0 59.4 82.0 58.7 65.4 49.6 58.9
64 × 2048
SalsaNext [22] 24 59.5 91.9 48.3 38.6 38.9 31.9 60.2 59.0 19.4 91.7 63.7 75.8 29.1 90.2 64.2 81.8 63.6 66.5 54.3 62.1
Lite-HDSeg [15] 20 63.8 92.3 40.0 55.4 37.7 39.6 59.2 71.6 54.3 93.0 68.2 78.3 29.3 91.5 65.0 78.2 65.8 65.1 59.5 67.7
KPRNet [15] 0.3 63.1 95.5 54.1 47.9 23.6 42.6 65.9 65.0 16.5 93.2 73.9 80.6 30.2 91.7 68.4 85.7 69.8 71.2 58.7 64.1
RangeNet++ [10] 38.5 41.9 87.4 26.2 26.5 18.6 15.6 31.8 33.6 4.0 91.4 57.0 74.0 26.4 81.9 52.3 77.6 48.4 63.6 36.0 50.0
MPF [23] 33.7 48.9 91.1 22.0 19.7 18.8 16.5 30.0 36.2 4.2 91.1 61.9 74.1 29.4 86.7 56.2 82.3 51.6 68.9 38.6 49.8
64 × 512
FIDNet-Point [15] 82.0 51.3 90.4 28.6 30.9 34.3 27.0 43.9 48.9 16.8 90.1 58.7 71.4 19.9 84.2 51.2 78.2 51.9 64.5 32.7 50.3
Image-based Ours 84.9 60.7 92.1 45.4 42.9 43.9 46.8 56.4 63.8 29.7 91.3 66.0 75.3 31.1 88.9 60.4 81.9 60.5 67.6 49.5 59.1
RangeNet++ [10] 23.3 48.0 90.3 20.6 27.1 25.2 17.6 29.6 34.2 7.1 90.4 52.3 72.7 22.8 83.9 53.3 77.7 52.5 63.7 43.8 47.2
MPF [23] 28.5 53.6 92.7 28.2 30.5 26.9 25.2 42.5 45.5 9.5 90.5 64.7 74.3 32.0 88.3 59.0 83.4 56.6 69.8 46.0 54.9
64 × 1024
FIDNet-Point [15] 60.9 56.0 92.4 44.0 41.5 33.2 30.8 57.9 52.6 18.0 91.0 61.2 73.8 12.6 88.2 57.9 80.8 59.5 65.1 45.3 58.4
Ours 67.9 62.3 93.0 50.5 47.6 41.7 43.4 64.5 65.2 32.5 90.5 65.5 74.1 29.2 90.9 65.4 81.6 65.4 65.6 55.9 61.0
RangeNet++ [10] 12.8 52.2 91.4 25.7 34.4 25.7 23.0 38.3 38.8 4.8 91.8 65.0 75.2 27.8 87.4 58.6 80.5 55.1 64.6 47.9 55.9
MPF [23] 20.6 55.5 93.4 30.2 38.3 26.1 28.5 48.1 46.1 18.1 90.6 62.3 74.5 30.6 88.5 59.7 83.5 59.7 69.2 49.7 58.1
64 × 2048
FIDNet-Point [15] 33.7 58.6 93.0 45.7 42.0 27.9 32.6 62.6 58.1 30.5 90.8 58.3 74.9 20.1 88.5 59.5 83.1 64.3 67.8 52.6 60.0
Ours 37.8 64.7 91.9 58.6 50.3 40.6 42.3 68.9 65.9 43.5 90.3 60.9 75.1 31.5 91.0 66.2 84.5 69.7 70.0 61.5 67.6

Table 3. Evaluation results on the SemanticPOSS test split.


we adopt random rotation, random point dropout, and addi-
traffic sign

cone/stone
trashcan

building

ground
person

tion of random noise to X, Y, Z values to perform data aug-


plants

mIoU
fence
truck
rider

pole

bike
car

SqueezeSeg [9] 14.2 1.0 13.2 10.4 28.0 5.1 5.7 2.3 43.6 0.2 15.6 31.0 75.0 18.9 mentation. The weight decay is set to 1e−4 . For Semantic-
SqueezeSeg + CRF [9] 6.8 0.6 6.7 4.0 2.5 9.1 1.3 0.4 37.1 0.2 8.4 18.5 72.1 12.9
SqueezeSegV2 [19] 48.0 9.4 48.5 11.3 50.1 6.7 6.2 14.8 60.4 5.2 22.1 36.1 71.3 30.0 sKITTI, we train network for 100 epochs with initial learning
SqueezeSegV2 + CRF [19]
RangeNet53 [10]
43.9
55.7
7.1
4.5
47.9
34.4
18.4
13.7
40.9
57.5
4.8
3.7
2.8
6.6
7.4
23.3
57.5
64.9
0.6
6.1
12.0
22.2
35.3
28.3
71.3
72.9
26.9
30.3 rate 1e−2 , which is dynamically adjusted by a cosine anneal-
RangeNet53 + KNN [10] 57.3 4.6 35.0 14.1 58.3 3.9 6.9 24.1 66.1 6.6 23.4 28.6 73.5 30.9
MINet [24] 61.8 12.0 63.3 22.2 68.1 16.3 29.3 28.5 74.6 25.9 31.7 44.5 76.4 42.7 ing scheduler. For SemantciPOSS, the network is trained for
MINet + KNN [24] 62.4 12.1 63.8 22.3 68.6 16.7 30.1 28.9 75.1 28.6 32.2 44.9 76.3 43.2
FIDNet-Point [15] 71.6 22.7 71.7 22.9 67.7 21.8 27.5 15.8 72.7 31.3 40.4 50.3 79.5 45.8 3 cycles with 45 epochs, the min and max learning rate are set
FIDNet-Point + KNN [15]
Ours
72.2
74.9
23.1
21.8
72.7
77.0
23.0
25.3
68.0
72.0
22.2
18.0
28.6
30.9
16.3
46.9
73.1
75.9
34.0
26.1
40.9
47.5
50.3
51.7
79.1
80.7
46.4
49.9 to 1e−5 and 1e−3 , respectively.
Ours + KNN 75.5 22.0 77.6 25.3 72.2 18.2 31.5 48.1 76.3 27.7 47.7 51.4 80.3 50.3

the GT labels with corresponding rate. For Plan B, since all 4.2. Results and Discussion
feature maps are upsampled to the final output size, the GT
Quantitative Results on SemanticKITTI: Table 2 shows the
labels are their ŷi .
quantitative results of several state-of-the-art models on Se-
manticKITTI benchmark. It reports the input size, frames
4. EXPERIMENT per second (FPS), mean IoU and class-wise IoU. The best
results are highlighted in bold. From these results, it can be
4.1. Dataset and Implementation details seen that for input of size 64 × 2048, our model achieves the
state-of-the-art performance (64.7%mIoU) compared to both
SemanticKITTI is a large-scale dataset for the task of point point-based, voxel-based methods or image-based methods,
cloud segmentation of autonomous driving scenes. It con- while maintaining a higher FPS (37.8 FPS). For 64 × 1024
tains 43,551 LiDAR scans from 22 sequences collected from and 64 × 512 input, CENet obtain the superior results, 67.9
a city in Germany, where sequences 00 to 10 (19,130 scans) FPS, 62.3%mIoU and 84.9FPS, 60.7%mIoU, which outper-
are used for training, 11 to 21 (20,351 scans) for testing, and forms the current methods by a large margin. It is worth not-
sequence 08 (4,071 scans) for validation. All sequences are ing that our CENet shows excellent performance when the
labeled with dense point-wise annotations. input is 64 × 512, surpassing the performance of the base-
SemanticPOSS is a much smaller, sparser, and more chal- line method FIDNet [15] and many other methods [20,22,23]
lenging benchmark collected by Peking University. It consists under 64 × 2048 input.
of 2,988 different, complex LiDAR scenes, each with a large Quantitative Results on SemanticPOSS: Table 3 shows the
number of sparse dynamic instances (e.g. pedestrians and bi- comparison of our proposed CENet with other related works.
cycles). SemanticPOSS is divided into 6 parts, where we use As can be reported from these results, all methods attain a
part 2 as test set, and others as training set. little worse results due to the differences in sensors and en-
Implementation details. We conduct all the experiments vironments, as well as the samll-scale and sparse structure of
on a single NVIDIA RTX3060 and RTX3090 GPU. The features. Nevertheless, it is worth noting that our method out-
Stochastic Gradient Descent (SGD) optimizer with momen- performs all models not only in overall mIoU, but also almost
tum of 0.9 is use for network optimization. During training, class-wise mIoU. This further verify the effectiveness and ef-
(a) Input Point Cloud (b) Ground Truth (c) FIDNet-Point (d) CENet(ours)

Fig. 3. Qualitative analysis on SemanticKITTI validation set. Where (a) and (b) are input data of the LIDAR scan frame
and corresponding segmentation ground truth, (c) and (d) are segmentation error maps in this scan frame for FIDNet and our
method. (With red indicating wrong prediction)

Table 4. Ablation study evaluated on SemanticKITTI valida- Table 6. Impact of λ in auxiliary loss.
tion set. λ 0 0.1 0.5 1.0
Baseline Row KS SiLU H-swish Plan A Plan B mIoU Params(M) Latency(ms)
FIDNet 1 55.4 6.053 11.956 mIoU 59.2 61.1 62.8 64.3
2 58.9 6.053 11.956
3 3 59.2 6.774 11.576
4 3 60.0 6.053 11.981 Table 7. Ablation studies for auxiliary loss module on differ-
5
6 3 3
3 60.5
59.5
6.053
6.774
12.039
11.776
ent backbone.
7 3 3 60.6 6.774 11.562
Backbones Params(M) mIOU
Ours
8 3 62.5 6.061 11.956 HarDNet(vinilla) 2.735 55.1
9 3 63.0 6.061 11.956 HarDNet(ours) 3.138 58.7
10 3 3 63.2 6.782 11.576
11 3 3 64.3 6.782 11.576 HarDNet(ours) + Plan B 3.146 61.3
12 3 3 3 65.3 6.782 11.562

fectiveness when replacing the 1×1 convs with 3×3 convs. It


Table 5. Impact of kenel size on time of model inference
Kernel Size Input Resolution Model Latency(ms) FPS can be concluded that although the 3×3 conv introduces more
64 × 2048 29.347 33.7 parameters than 1 × 1 conv, the former can further reduce the
1×1 64 × 1024 16.151 60.9
64 × 512 11.956 82.0 model’s inference time, and slightly improve the performance
64 × 2048 26.128 37.8 by 0.3%. This is mainly because large kernel size brings much
3×3 64 × 1024 14.578 67.4
64 × 512 11.576 84.8 larger perceptual field, and the modern computational library
is highly optimized. Table 5 further reports the difference
ficiency of our network. Specifically, we achieve a significant between 3 × 3 conv and 1 × 1 conv in model latency and to-
7.1% and 3.9% mIoU improvements over the original best tal FPS for different input range image resolutions. Overall,
model MINet [24] and baseline method FIDNet [15]. 3 × 3 conv can improve the model inference speed signifi-
Qualitative Result: To better visualize the enhancement of cantly. The fourth to seventh rows state the performance us-
our model over the baseline, we provide qualitative compari- ing different activation functions. From these results, we can
son examples in Fig 3. As can be seen from the results, our ap- see that the introduction of stronger nonlinearities contributes
proach demonstrates a significant improvement over the base- to the descriptiveness improvement of our model at little cost
line and is closer to the ground truth. of inference speed. Results in the eighth to eleventh lines
show that 1) both auxiliary segmentation heads can signifi-
4.3. Ablation Studies cantly improve the performance of our CENet. Although the
integration of the auxiliary heads introduces some additional
To quantitatively analyze the effectiveness of different com- training parameters and increases the training time, we can re-
ponents, we conducted the following ablation experiments on move these auxiliary heads in the inference phase. Therefore,
the SemanticKITTI validation set. Here, we exploit a simi- they have no effect on the network’s latency. 2) The combi-
lar setup as SqueezeSegV3 [20] for high efficient training and nation of 3 × 3 conv and auxiliary loss outperforms the effect
evaluation. The size of input range image is set to 64 × 512. in the original 1 × 1 conv with auxiliary loss. And 3) since
And we choose to list the accuracy evaluated directly on the Plan B module calculates the loss directly based on the up-
projected 2D image instead of original 3D points. sampled feature maps, which serves to refinement of features
Effects of Module Components. Table 4 demonstrates the at different stages, Plan B brings a much better performance
experimental results of different design choices of network, than Plan A.
which uses the mIoU, traning parameters required by models, Effects of λ in auxiliary loss. The hyperparameter λ con-
and inference time as measures. Results in the first row are tained in Equation 5 play an important role in optimizing the
obtained by using the official FIDNet code. The second line model’s performance. Hence, we evaluate the effectiveness
reports the performance achieved using our network, in which of λ. As illustrated in Table 6, the introductions of additional
the required normal vector has been removed. It still get 3.5% supervise loss terms can help improve the segmentation per-
improvement over FIDNet. The third row validates the ef- formance. And the best results is achieved with λ = 1.0.
Effectiveness on Different Backbone. In Table 7, we use [8] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin,
a different feature extraction backbone, HarDNet (Harmonic “Cylinder3d: An effective 3d framework for driving-scene li-
DenseNet), to verify the generalization ability of the auxiliary dar semantic segmentation,” arXiv preprint arXiv:2008.01550,
loss modules. Compared with ResNet and DenseNet, HarD- 2020.
Net can achieve comparable accuracy in several tasks while [9] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer,
“Squeezeseg: Convolutional neural nets with recurrent crf for
significantly reducing GPU running time. First, we use FC-
real-time road-object segmentation from 3d lidar point cloud,”
HarDNet-70 to conduct experiment, and obtain 55.1% mIoU. in ICRA. IEEE, 2018, pp. 1887–1893.
Then, we carefully optimize the structure of HarDNet to con- [10] Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stach-
struct a more powerful baseline with addition of fewer num- niss, “Rangenet++: Fast and accurate lidar semantic segmen-
ber of parameters. Finally, we integrate the auxiliary losses, tation,” in IROS. 2019, pp. 4213–4220, IEEE.
and achieve a consistent model performance improvement. [11] Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and Zhaox-
iang Zhang, “Rangedet: In defense of range view for lidar-
based 3d object detection,” in ICCV, 2021, pp. 2918–2927.
5. CONCLUSION [12] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun,
“Deep learning for 3d point clouds: A survey,” IEEE Trans.
In this paper, we propose a newly-designed networks,termed Pattern Anal. Mach. Intell., 2020.
CENet, for LiDAR point cloud segmentation task. It is a [13] Deyvid Kochanov, Fatemeh Karimi Nejadasl, and Olaf Booij,
concise and efficient model. Based on analysis of previous “Kprnet: Improving projection-based lidar semantic segmen-
studies, we choose to use standard convolution with larger tation,” arXiv preprint arXiv:2007.12668, 2020.
kernel size, and the SiLU as well as Hardswish activation [14] Ryan Razani, Ran Cheng, Ehsan Taghavi, and Bingbing Liu,
functions, to improve learning capability of our network. “Lite-hdseg: Lidar semantic segmentation using lite harmonic
Then, we embed multiple auxiliary segmentation heads to fur- dense convolutions,” in ICRA. 2021, pp. 9550–9556, IEEE.
ther improve the power of learned features without introduc- [15] Yiming Zhao, Lin Bai, and Xinming Huang, “Fidnet: Lidar
point cloud semantic segmentation with fully interpolation de-
tion of parameters and efficiency cost. Experimental results
coding,” in IROS. 2021, pp. 4453–4458, IEEE.
on SemanticKITTI and SemanticPOSS demonstrate that our [16] Shi Qiu, Saeed Anwar, and Nick Barnes, “Semantic segmenta-
CENet can achieve state-of-the-art performance. tion for real point cloud scenes via bilateral augmentation and
adaptive fusion,” in CVPR, 2021, pp. 1757–1767.
6. REFERENCES [17] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas,
“Pointnet: Deep learning on point sets for 3d classification and
[1] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- segmentation,” in CVPR, 2017, pp. 652–660.
zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall, “Se- [18] Benjamin Graham, Martin Engelcke, and Laurens Van
mantickitti: A dataset for semantic scene understanding of li- Der Maaten, “3d semantic segmentation with submanifold
dar sequences,” in ICCV, 2019, pp. 9297–9307. sparse convolutional networks,” in CVPR, 2018, pp. 9224–
[2] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, 9232.
C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep [19] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and
multi-modal object detection and semantic segmentation for Kurt Keutzer, “Squeezesegv2: Improved model structure and
autonomous driving: Datasets, methods, and challenges,” unsupervised domain adaptation for road-object segmentation
IEEE Trans. Intell. Transp. Syst., 2020. from a lidar point cloud,” in ICRA. IEEE, 2019, pp. 4376–
[3] Yancheng Pan, Biao Gao, Jilin Mei, Sibo Geng, Chengkun Li, 4382.
and Huijing Zhao, “Semanticposs: A point cloud dataset with [20] Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Va-
large quantity of dynamic instances,” in IV. IEEE, 2020, pp. jda, Kurt Keutzer, and Masayoshi Tomizuka, “Squeezesegv3:
687–693. Spatially-adaptive convolution for efficient point-cloud seg-
[4] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas, mentation,” in ECCV. Springer, 2020, pp. 1–19.
“Pointnet++: Deep hierarchical feature learning on point sets [21] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han,
in a metric space,” NIPS, vol. 30, 2017. Guiguang Ding, and Jian Sun, “Repvgg: Making vgg-style
[5] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, convnets great again,” in CVPR, 2021, pp. 13733–13742.
Beatriz Marcotegui, François Goulette, and Leonidas J Guibas, [22] Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy, “Sal-
“Kpconv: Flexible and deformable convolution for point sanext: Fast, uncertainty-aware semantic segmentation of lidar
clouds,” in ICCV, 2019, pp. 6411–6420. point clouds,” in ISVC. Springer, 2020, pp. 207–222.
[6] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan [23] Yara Ali Alnaggar, Mohamed Afifi, Karim Amer, and Mo-
Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham, hamed ElHelw, “Multi projection fusion for real-time seman-
“Randla-net: Efficient semantic segmentation of large-scale tic segmentation of 3d lidar point clouds,” in WACV, 2021, pp.
point clouds,” in CVPR, 2020, pp. 11108–11117. 1800–1809.
[7] Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, [24] S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss, and J. Gall,
Hanrui Wang, and Song Han, “Searching efficient 3d architec- “Multi-scale interaction for real-time lidar data segmentation
tures with sparse point-voxel convolution,” in ECCV. Springer, on an embedded platform,” IEEE Robotics Autom. Lett., 2021.
2020, pp. 685–702.

You might also like