Efficient RGB-D Semantic Segmentation For Indoor Scene Analysis
Efficient RGB-D Semantic Segmentation For Indoor Scene Analysis
Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
Semantic Segmentation
robots acting in different environments. Semantic segmentation RGB Encoder Decoder
can enhance various subsequent tasks, such as (semantically
assisted) person perception, (semantic) free space detection, (se-
ESANet
mantic) mapping, and (semantic) navigation. In this paper, we Depth Encoder Efficient RGB-D
propose an efficient and robust RGB-D segmentation approach Time Segmentation
arXiv:2011.06961v3 [cs.CV] 7 Apr 2021
BN, ReLU
For ResNet50: ×4
RGB-D Fusion
RGB-D Fusion
RGB-D Fusion
RGB-D Fusion
ResNet
Layer2
ResNet
ResNet
Layer1
ResNet
Layer3
Layer4
BN, ReLU BN, ReLU BN, ReLU
Context Module
RGB-D Fusion
Decoder
Decoder
Module
Module
Mod
Dec
BN, ReLU
ResNet
ResNet
Layer2
Layer3
ResNet
Layer4
ResNet
Layer1
Encoder Decoder
Learned Up. ×2
Depth RGB
...
b branches Global BN, Relu
Non-Bottleneck-1D (NBt1D)
Global Global AvgPool NBt1D ReLU
Squeeze-and-Excitation
Squeeze-and-Excitation
AvgPool AvgPool
subglobal
ReLU ReLU
global
Decoder Module
local
NBt1D
Context Module
ReLU
RGB-D Fusion
Sigmoid Sigmoid
BN
Multi-Scale
BN, ReLU Supervision ReLU
Legend: kw ×kh , C : Convolution with kernel size kw ×kh and C output channels, S2: Stride 2, BN: Batch Normalization, Up.: Upsampling, DW: Depthwise, : Concatenation
Fig. 2: Overview of our proposed ESANet for efficient RGB-D segmentation (top) and specific network parts (bottom).
incorporate depth information at all. Therefore, our ESANet so-called Non-Bottleneck-1D-Block (NBt1D) is depicted in
uses an additional encoder for depth data. This depth encoder Fig. 2 (violet) and was initially proposed in ERFNet [26] for
extracts complementary geometric information that is fused another network architecture. In our experiments, we show
into the RGB encoder at several stages using an attention that this block can also be used in ResNet and simultaneously
mechanism. Furthermore, both encoders use a revised archi- reduces inference time and increases segmentation perfor-
tecture enabling faster inference. The decoder is comprised mance.
of multiple modules, each is upsampling the resulting feature
maps by a factor of 2 and is refining the features using B. RGB-D Fusion
convolutions as well as by incorporating encoder features. At each of the five resolution stages in the encoders (see
Finally, the decoder maps the features to the classes and Fig. 2), depth features are fused into the RGB encoder. The
rescales the class mapping to the input resolution. features from both modalities are first reweighted with a
Our entire network features simple components imple- Squeeze and Excitation (SE) module [38] and then summed
mented in PyTorch [34]. We do not use complex structures element-wisely, as shown in Fig. 2 (light green). Using this
or specifically tailored operations as these are often incom- channel attention mechanism, the model can learn which fea-
patible for converting to ONNX [35] or NVIDIA TensorRT tures of which modality to focus on and which to suppress,
and, thus, result in slower inference time. depending on the given input. In our experiments, we show
In the following, we explain each part of our network that this fusion mechanism notably improves segmentation.
design in detail as well as its motivation. Fig. 2 (bottom)
depicts the exact structure of our network modules. C. Context Module
Due to the limited receptive field of ResNet [33], we
A. Encoder additionally incorporate context information by aggregating
The RGB and depth encoder both use a ResNet archi- features at different scales using several branches in a context
tecture [36] as backbone. For efficiency reasons, we do not module similar to the Pyramid Pooling Module in PSP-
replace strided convolutions by dilated convolutions as in Net [33] (see Fig. 2 orange). Since NVIDIA TensorRT only
PSPNet [33] or DeepLabv3 [37]. Thus, the resulting feature supports pooling with fixed sizes, we carefully designed the
maps at the end of the encoder are 32 times smaller than the context module such that the pooling sizes are always a factor
input image. For a trade-off between speed and accuracy, of the input resolution of the context module and no adaptive
we use ResNet34 but also show results for ResNet18 and pooling is required. Note that, depending on the image
ResNet50. We replace the basic block in each layer of resolution of the respective dataset, the number of existing
ResNet18 and ResNet34 with a spatially factorized version. factors and, thus, the branches b and pooling sizes pbw ×pbh
More precisely, each 3×3 convolution is replaced by a differ. Our experiments show that this additional context
3×1 and a 1×3 convolution with a ReLU in-between. The module improves segmentation.
D. Decoder IV. E XPERIMENTS
We evaluate our approach on two commonly used RGB-D
As shown in Fig 2, our decoder is comprised of three indoor datasets, namely SUNRGB-D [13] and NYUv2 [12]
decoder modules (depicted in red in Fig 2). Our decoder and present an ablation study to essential parts of our
module extends the one of SwiftNet [30], which is comprised network. In order to demonstrate that our approach is suitable
of a 3×3 convolution with a fixed number of 128 channels for other areas of application as well, we also show results
and a subsequent bilinear upsampling. However, our exper- on the Cityscapes [14] dataset, the most widely used outdoor
iments show that for indoor RGB-D segmentation a more dataset for semantic segmentation. Finally, instead of report-
complex decoder is required. Therefore, we use 512 channels ing benchmark results only, we present qualitative results
in the first decoder module and decrease the number of when using our approach in a robotic indoor application.
channels in each 3×3 convolution as the resolution increases.
Moreover, we incorporate three additional Non-Bottleneck- A. Implementation Details & Datasets
1D-blocks to further increase segmentation performance. We trained our networks using PyTorch [34] for 500
Finally, we upsample the feature maps by a factor of 2. epochs with batches of size 8. For optimization, we used
We do not use transposed convolutions for upsampling both SGD with momentum of 0.9 and Adam [40] with
as they are computationally expensive and often introduce learning rates of {0.00125, 0.0025, 0.005, 0.01, 0.02, 0.04}
undesired gridding artifacts to the final segmentation, as and {0.0001, 0.0004}, respectively, and a small weight decay
shown in Fig. 3 (right). Moreover, instead of using bilin- of 0.0001. We adapted the learning rate using PyTorch’s one-
ear interpolation, we propose a novel light-weight learned cycle learning rate scheduler. To further increase the number
upsampling method (see Fig. 2 dark green), which achieves of training samples, we augmented the images using random
better segmentation results: In particular, we first use nearest scaling, cropping, and flipping. For RGB images, we also
neighbor upsampling to enlarge the resolution. Afterwards, applied slight color jittering in HSV space.
a 3×3 depthwise convolution is applied to combine adjacent The best models were chosen based on the mean inter-
features. We initialize the kernels such that the whole learned section over union (mIoU). We used bilinear upsampling to
upsampling initially mimics bilinear interpolation. However, rescale the resulting class mapping to the size of the ground
our network is able to adapt the weights during training and, truth segmentation before computing the argmax for the final
thus, can learn how to combine adjacent features in a more segmentation mask.
useful manner, which improves segmentation performance. NYUv2 & SUNRGB-D: NYUv2 contains 1,449 indoor
RGB-D images, of which 795 are used for training and
Although being upscaled, the resulting feature maps still
654 for testing. We used the common 40-class label setting.
lack fine-grained details that were lost during downsampling
SUNRGB-D has 37 classes and consists of 10,335 indoor
in the encoders. Therefore, we design skip connections from
RGB-D images, including all images of NYUv2. There are
encoder to decoder stages of the same resolution. To be
5,285 training and 5,050 testing images. Our ablation study
precise, we take the fused RGB-D encoder feature maps,
is based on NYUv2 as it is smaller and, thus, leads to
project them with a 1×1 convolution to the same number of
faster trainings. However, according to [41], training on a
channels used in the decoder, and add them to the decoder
subset is sufficient for a reliable model selection. For both
feature maps. Incorporating these skip connections results in
datasets, we used a network input resolution of 640×480
more detailed semantic segmentations.
and applied median frequency class balancing [42]. As the
Similar to [30], [39], we only process feature maps in the input to the context module has a resolution of 20×15 due to
decoder until they are 4× smaller than the input images and the downsampling of 32, we used b = 2 branches, one with
use a 3×3 convolution to map the features to the classes of global average pooling and one with a pooling size of 4×3.
the respective dataset. Two final learned upsampling modules Cityscapes: This dataset contains 5,000 images with fine-
restore the resolution of the input image. grained annotation for 19 classes. The images have a high
Instead of calculating the training loss only at the final resolution of 2048×1024. There are 2,975 images for train-
output scale, we add supervision to each decoder module. ing, 500 for validation, and 1,525 for testing. Cityscapes also
At each scale, a 1×1 convolution computes a segmentation provides 20k coarsely annotated images, which we did not
of a smaller scale, which is supervised by the down-scaled use for training. We computed corresponding depth images
ground truth segmentation. from the disparity images. Since we set the network input
picture window
door
wall
table
chair counter
table?
RGB image Ground Truth Learned Up. (ours) Bilinear Up. Transp. Conv. (ACNet [11])
Fig. 3: Qualitative comparison of upsampling methods on NYUv2 test set (same colors as in Fig. 1 and Fig. 6).
ResNet50 51
ResNet34 NBt1D selected network
50
selected network learned
Mean intersection over union
20 25 30 35 40 45 28 29 30 31 32 33 34 35
FPS (NVIDIA Jetson AGX Xavier, TensorRT 7.1, Float16) FPS (NVIDIA Jetson AGX Xavier, TensorRT 7.1, Float16)
Fig. 4: Comparison of RGB-D to RGB and depth networks Fig. 5: Ablation Study on NYUv2 test. Each color indicates
(single encoder) and different backbones on NYUv2 test set. modifying one aspect: purple: number of NBt1D blocks in
decoder module, dark green: upsampling method, and gray:
resolution to 1024×512, the input to our context module has usage of specific network parts with CM : no context module,
a resolution of 32×16, which allows b = 4 branches in the Skip: no encoder-decoder skip connections, and SE : no
context module, one with global average pooling and the Squeeze-and-Excitation before fusing RGB and depth.
others with pooling sizes of 16×8, 8×4, and 4×2.
For further details and other hyperparameters, we refer to
our implementation available on GitHub. C. Ablation Study on NYUv2
Fig. 5 shows the ablation study for fundamental parts of
B. Results on NYUv2 & SUNRGB-D
our network architecture and justifies our design choices.
Fig. 4 compares our RGB-D approach on NYUv2 to Furthermore, it indicates the impact of each part when it
single-modality baselines for RGB and depth (single en- is necessary to adapt our selected network to deviating real-
coder) as well as evaluates different encoder backbones. time requirements.
As expected, neither processing depth data nor RGB data As shown in purple, a shallow decoder similar to Swift-
alone reach the segmentation performance of our proposed Net [30] is not as good as more complex decoders. Therefore,
RGB-D network. Remarkably, the shallow ResNet18-based we gradually increased the number of additional NBt1D
RGB-D network performs better than the much deeper blocks in the decoder module. Apparently, a fixed number of
ResNet50-based RGB network while still being faster. More- three blocks in each decoder module performs better than a
over, replacing ResNet’s basic block with Non-Bottleneck- different number or a reversed layout of the encoder’s design.
1D (NBt1D) block can further improve both segmentation In dark green, different upsampling methods in the de-
and inference time. Note that ResNet50 incorporates bottle- coder are displayed. Although increasing inference time, the
neck blocks, which cannot be replaced the same way. learned upsampling improves mIoU by 0.9. Moreover, as
Tab. I lists the results of our RGB-D approach for both shown in Fig. 3, the obtained segmentation contains more
indoor datasets. For the larger SUNRGB-D dataset, a similar fine-grained details compared to using bilinear interpolation.
trend can be observed. Compared to the state of the art, our It further prevents gridding artifacts introduced by transposed
smaller ESANet achieves similar segmentation results as the convolutions as used in ACNet [11] or RedNet [10].
often much deeper networks. Besides focusing on segmen- As shown in gray in Fig. 5, a context module, encoder-
tation performance alone, we also strive for low inference decoder skip connections, as well as reweighting modality-
time on the embedded hardware of our robots. Therefore, specific features with Squeeze-and-Excitation before fusion,
we measured the inference time for all available approaches independently improve segmentation performance. Incorpo-
on a NVIDIA Jetson AGX Xavier using NVIDIA TensorRT. rating all three network parts leads to the best result.
For our carefully designed ESANet, NVIDIA TensorRT
enables up to 5× faster inference compared to PyTorch. As D. Results on Cityscapes
shown in Tab. I (last column), our approach enables much To demonstrate that our approach is applicable to other
faster inference while performing on par or even better than areas such as outdoor environments as well, in Tab. II, we
other approaches. For our application, we choose ESANet further present an evaluation on the Cityscapes dataset.
with ResNet34 backbone and Non-Bottleneck-1D (NBt1D) We first focus on the smaller resolution of 1024×512 as
block (printed in bold in Tab. I) as it offers the best trade- it is commonly used for efficient segmentation. Moreover,
off between inference time and performance. The last row since most approaches rely on RGB as input solely, we start
in Tab. I further indicates that additional pretraining on by comparing a single-modality RGB version of our ap-
synthetic data, such as SceneNet [43], should be preferred to proach. Efficient approaches with custom architectures such
deeper backbones, especially if the target dataset is small. as ERFNet [26], LEDNet [27], and ESPNetv2 [32] are quite
SUN- 1024×512 2048×1024
Method Backbone NYUv2 FPS
RGB-D Method Val Test FPS Val Test FPS
FuseNet [9] 2× VGG16 - 37.29 † ERFNet [26] - 69.7 49.9 - - -
LEDNet [27] - 70.6* 38.5 - - -
RedNet [10] 2× R34 - 46.8 26.0 ESPNetv2 [32] 66.4 66.2 47.4 - - -
SSMA [24] 2× mod. R50 - 44.43 12.4 SwiftNet [30] 70.2 - 64.5 75.4 75.5 20.8
RGB
MMAF-Net [25] 2× R50 - 45.5 N/A BiSeNet [31] - - - 74.8 74.7 20.0
RedNet [10] 2× R50 - 47.8 22.1 PSPNet [33] - - - - 81.2* 1.8
RDFNet [23] 2× R50 47.7* - 7.2 DeepLabv3 [37] - - - 79.3 81.3* 0.9
ACNet [11] 3× R50 48.3 48.1 16.5 ESANet-R18-NBt1D 71.48 - 37.2 77.95 - 9.8
SA-Gate [22] 2× R50 50.4 49.4* 11.9 ESANet-R34-NBt1D 72.70 72.87 32.3 78.47 77.56 8.3
SGNet [19] R101 49.0 47.1 N/A O ESANet-R50 73.88 - 24.9 79.23 - 6.5
Idempotent [21] 2× R101 49.9 47.6 N/A O
SSMA [24] - - - 82.19* 82.31* 2.2
2.5D Conv [16] R101 48.5 48.2 N/A O
SA-Gate [22] - - - 81.7 82.8 2.1
RGB-D
MMAF-Net [25] 2× R152 44.8 47.0 N/A O LDFNet [44] 68.48 71.3 25.3 - - -
RDFNet [23] 2× R152 50.1* 47.7* 5.8 ESANet-R18-NBt1D 74.65 - 28.9 79.25 - 7.6
ESANet-R18 2× R18 47.32 46.24 34.7 ESANet-R34-NBt1D 75.22 75.65 23.4 80.09 78.42 6.2
ESANet-R18-NBt1D 2× R18 NBt1D 48.17 46.85 36.3 ESANet-R50 75.66 - 16.9 79.97 - 4.0
ESANet-R34 2× R34 48.81 47.08 27.5
ESANet-R34-NBt1D 2× R34 NBt1D 50.30 48.17 29.7 TABLE II: Mean intersection over union of our ESANet on
ESANet-R50 2× R50 50.53 48.31 22.6
Cityscapes for both common input resolutions compared to
ESANet (pre. SceneNet) 2× R34 NBt1D 51.58 48.04 29.7 state-of-the-art methods. FPS is reported for NVIDIA Jetson
TABLE I: Mean intersection over union of our ESANet com- AGX Xavier (Jetpack 4.4, TensorRT 7.1, Float16). Legend:
pared to state-of-the-art methods on NYUv2 and SUNRGB- : test server result, *: trained with additional coarse data.
D test set ordered by SUNRGB-D performance and backbone
complexity. FPS is reported for NVIDIA Jetson AGX Xavier accomplish the complex system for semantic scene analysis
(Jetpack 4.4, TensorRT 7.1, Float16). Legend: R: ResNet, shown in Fig. 1. The obtained segmentation masks enrich the
*: additional test-time augmentation, i.e., flipping or multi- robot’s visual perception enabling stronger person perception
scale (not timed), N/A: no implementation available, †: in- and robust semantic mapping including a refined floor rep-
cludes operations, which are not supported by TensorRT, and resentation which indicates free space. Fig. 6 provides an
O: expected to be slower due to complex backbone. insight into the entire system. For further qualitative results
and a comparison to non-semantic scene perception, we refer
fast but also perform notably worse than our ESANet. Com- to the attached video or our repository on GitHub.
pared to ERFNet, LEDNet, and ESPNetv2, SwiftNet [30]
is both faster and achieves higher mIoU. Nevertheless, with V. C ONCLUSION
an input resolution of 1024×512, our ESANet-R34-NBt1D In this paper, we have presented an efficient RGB-D
still exceed 30 FPS while outperforming all other efficient segmentation approach, called ESANet, which is charac-
approaches by at least 2.2 mIoU. Incorporating depth further terized by two enhanced ResNet-based encoders utilizing
increases segmentation performance. However, the perfor- the Non-Bottleneck-1D block, an attention-based fusion for
mance gain is not as high as for the indoor dataset NYUv2. incorporating depth information, and a decoder utilizing a
We assume that this can be deduced to the fact that the novel learned upsampling. On the indoor datasets NYUv2
disparity images of Cityscapes are not as precise as the and SUNRGB-D, our ESANet performs on par or even
indoor depth images of NYUv2 and SUNRGB-D. Compared better while enabling much faster inference compared to
to the RGB-D approach LDFNet [44] with similar inference other state-of-the-art methods. Thus, it is well suited for
time, we achieve notably higher mIoU. embedding in a complex system for scene analysis on mobile
For completeness, we also evaluated our networks on the robots given limited hardware.
full resolution of 2048×1024. Compared to other methods,
our ESANet lies in between mobile (SwiftNet, BiSeNet) and Wall
Floor
non-mobile approaches for both mIoU and inference time.
Ceiling
However, compared to SwiftNet (RGB, 2048×1024), our Lamp
ESANet-R34-NBt1D achieves similar segmentation perfor- Chair
mance and slightly faster inference while processing RGB-D Table
inputs with the smaller input resolution of 1024×512. Sofa
Picture
E. Application on our Robots Box
TV
Instead of evaluating on benchmark datasets only, we Bag
further present qualitative results with a Kinect2 sensor [45], Pillow
[46] in one of our indoor applications. We deployed our
proposed ESANet-R34-NBt1D to our robot in order to Fig. 6: Application in our robotic scene analysis system.
R EFERENCES [24] A. Valada, et al., “Self-supervised model adaptation for multimodal
semantic segmentation,” Int. Journal of Computer Vision (IJCV), 2019.
[1] H.-M. Gross, et al., “TOOMAS: Interactie shopping guide robots in [25] F. Fooladgar and S. Kasaei, “Multi-Modal Attention-based Fusion
everyday use – final implementation and experiences from long-term Model for Semantic Segmentation of RGB-Depth Images,” arXiv
field trials,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems preprint arXiv:1912.11691, pp. 1–12, 2019.
(IROS). IEEE, 2009, pp. 2005–2012. [26] E. Romera, et al., “ERFNet: Efficient Residual Factorized ConvNet for
[2] B. Lewandowski, et al., “Socially compliant human-robot interaction Real-Time Semantic Segmentation,” IEEE Transactions on Intelligent
for autonomous scanning tasks in supermarket environments,” in IEEE Transportation Systems (ITS), pp. 263–272, 2018.
Int. Symp. on Robot and Human Interactive Communication (RO- [27] Y. Wang, et al., “LEDnet: A Lightweight Encoder-Decoder Network
MAN). IEEE, 2020, pp. 363–370. for Real-Time Semantic Segmentation,” in IEEE Int. Conference on
[3] H.-M. Gross, et al., “Mobile robot companion for walking training Image Processing (ICIP), 2019, pp. 1860–1864.
of stroke patients in clinical post-stroke rehabilitation,” in IEEE Int. [28] G. Li, et al., “DABNet: Depth-wise Asymmetric Bottleneck for Real-
Conf. on Robotics and Automation (ICRA), 2017, pp. 1028–1035. time Semantic Segmentation,” British Machine Vision Conference
[4] T. Q. Trinh, et al., “Autonomous mobile gait training robot for (BMVC), 2019.
orthopedic rehabilitation in a clinical environment*,” in IEEE Int. [29] S.-Y. Lo, et al., “Efficient dense modules of asymmetric convolution
Conf. on Robot and Human Interactive Communication (RO-MAN), for real-time semantic segmentation,” in ACM Int. Conf. on Multimedia
2020, pp. 580–587. in Asia, 2019, pp. 1–6.
[5] H.-M. Gross, et al., “Robot companion for domestic health assistance: [30] M. Oršić, et al., “In Defense of Pre-trained ImageNet Architectures
Implementation, test and case study under everyday conditions in for Real-time Semantic Segmentation of Road-driving Images,” IEEE
private apartments,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Conf. on Computer Vision and Pattern Recognition (CVPR), pp.
Systems (IROS). IEEE, 2015, pp. 5992–5999. 12 607–12 616, 2019.
[6] H. M. Gross, et al., “Living with a mobile companion robot in your [31] C. Yu, et al., “BiSeNet: Bilateral segmentation network for real-time
own apartment - final implementation and results of a 20-weeks field semantic segmentation,” in Europ. Conf. on Computer Vision (ECCV),
study with 20 seniors,” in IEEE Int. Conf. on Robotics and Automation 2018, pp. 325–341.
(ICRA), Montreal, Canada. IEEE, 2019, pp. 2253–2259. [32] S. Mehta, et al., “ESPNetv2: A Light-weight, Power Efficient, and
[7] D. Seichter, et al., “Multi-task deep learning for depth-based person General Purpose Convolutional Neural Network,” in IEEE Conf. on
perception in mobile robotics,” in IEEE/RSJ Int. Conf. on Intelligent Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9190–
Robots and Systems (IROS). IEEE, 2020, pp. 10 497–10 504. 9200.
[8] E. Einhorn and H.-M. Gross, “Generic 2D/3D SLAM with NDT maps [33] H. Zhao, et al., “Pyramid scene parsing network,” in IEEE Conf. on
for lifelong application,” in Europ. Conf. on Mobile Robots (ECMR), Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–
2013. 2890.
[9] C. Hazirbas, et al., “FuseNet: Incorporating Depth into Semantic Seg- [34] A. Paszke, et al., “Pytorch: An imperative style, high-performance
mentation via Fusion-based CNN Architecture,” in Asian Conference deep learning library,” in Advances in Neural Information Processing
on Computer Vision (ACCV), 2016, pp. 213–228. Systems (NIPS). Curran Associates, Inc., 2019, pp. 8024–8035.
[10] J. Jiang, et al., “RedNet: Residual Encoder-Decoder Network [35] J. Bai, et al., “Onnx: Open neural network exchange,” https://github.
for indoor RGB-D Semantic Segmentation,” arXiv preprint com/onnx/onnx, 2019.
arXiv:1806.01054, 2018. [36] K. He, et al., “Deep residual learning for image recognition,” IEEE
[11] X. Hu, et al., “ACNet: Attention Based Network to Exploit Comple- Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 770–
mentary Features for RGBD Semantic Segmentation,” IEEE Int. Conf. 778, 2016.
on Image Processing (ICIP), 2019. [37] L.-C. Chen, et al., “Rethinking Atrous Convolution for Semantic
[12] N. Silberman, et al., “Indoor Segmentation and Support Inference from Image Segmentation,” arXiv preprint arXiv:1706.05587, 2017.
RGBD Images,” in Europ. Conf. on Computer Vision (ECCV), 2012. [38] J. Hu, et al., “Squeeze-and-excitation networks,” in IEEE Conf. on
[13] S. Song, et al., “SUN RGB-D: A RGB-D Scene Understanding Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–
Benchmark Suite,” in IEEE Conf. on Computer Vision and Pattern 7141.
Recognition (CVPR), 2015, pp. 567–576. [39] L.-C. Chen, et al., “Encoder-Decoder with Atrous Separable Convolu-
[14] M. Cordts, et al., “The Cityscapes Dataset for Semantic Urban tion for Semantic Image Segmentation,” in Europ. Conf. on Computer
Scene Understanding,” IEEE Conf. on Computer Vision and Pattern Vision (ECCV), 2018, pp. 801–818.
Recognition (CVPR), pp. 3213–3223, 2016. [40] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza-
[15] Y. Zhong, et al., “3D Geometry-Aware Semantic Labeling of Outdoor tion,” in Int. Conf. Learning Representation (ICLR), 2015.
Street Scenes,” in Int. Conf. on Pattern Recognition (ICPR), 2018, pp. [41] J. Bornschein, et al., “Small Data, Big Decisions: Model Selection in
2343–2349. the Small-Data Regime,” in Int. Conf. on Machine Learning (ICML),
[16] Y. Xing, et al., “2.5D Convolution for RGB-D Semantic Segmen- 2020.
tation,” in IEEE Int. Conf. on Image Processing (ICIP), 2019, pp. [42] D. Eigen and R. Fergus, “Predicting depth, surface normals and se-
1410–1414. mantic labels with a common multi-scale convolutional architecture,”
[17] Y. Xing, et al., “Malleable 2.5D Convolution: Learning Receptive Int. Conf. on Computer Vision (ICCV), pp. 2650–2658, 2015.
Fields along the Depth-axis for RGB-D Scene Parsing,” in Europ. [43] J. McCormac, et al., “SceneNet RGB-D: Can 5M Synthetic Images
Conf. on Computer Vision (ECCV), 2020, pp. 1–17. Beat Generic ImageNet Pre-training on Indoor Segmentation?” Int.
[18] W. Wang and U. Neumann, “Depth-Aware CNN for RGB-D Seg- Conf. on Computer Vision (ICCV), pp. 2697–2706, 2017.
mentation,” in Europ. Conf. on Computer Vision (ECCV), 2018, pp. [44] S.-W. Hung, et al., “Incorporating Luminance, Depth and Color
144–161. Information by a Fusion-Based Network for Semantic Segmentation,”
[19] L.-Z. Chen, et al., “Spatial Information Guided Convolution in IEEE Int. Conf. on Image Processing (ICIP), 2019, pp. 2374–2378.
for Real-Time RGBD Semantic Segmentation,” arXiv preprint [45] Lingzhu Xiang, et al., “Libfreenect2: Release 0.2,” 2016. [Online].
arXiv:2004.04534, pp. 1–11, 2020. Available: https://zenodo.org/record/50641
[20] Y. Chen, et al., “3D Neighborhood Convolution: Learning Depth- [46] F. J. Lawin, et al., “Efficient multi-frequency phase unwrapping
Aware Features for RGB-D and RGB Semantic Segmentation,” in Int. using kernel density estimation,” in Europ. Conf. on Computer Vi-
Conf. on 3D Vision (3DV), 2019, pp. 173–182. sion (ECCV), 2016, pp. 170–185.
[21] Y. Xing, et al., “Coupling Two-Stream RGB-D Semantic Segmentation
Network by Idempotent Mappings,” in IEEE Int. Conf. on Image
Processing (ICIP), 2019, pp. 1850–1854.
[22] X. Chen, et al., “Bi-directional Cross-Modality Feature Propagation
with Separation-and-Aggregation Gate for RGB-D Semantic Segmen-
tation,” in Europ. Conf. on Computer Vision (ECCV), 2020, pp. 561–
577.
[23] S. Lee, et al., “RDFNet: RGB-D Multi-level Residual Feature Fusion
for Indoor Semantic Segmentation,” Int. Conference on Computer
Vision (ICCV), pp. 4990–4999, 2017.