Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
Article
Semantic Segmentation with Transfer Learning for
Off-Road Autonomous Driving
Suvash Sharma 1 , John E. Ball 1 , Bo Tang 1, * , Daniel W. Carruth 2 , Matthew Doude 2 and
Muhammad Aminul Islam 3
1 Department of Electrical and Computer Engineering, Mississippi State University, Starkville,
MS 39762, USA; [email protected] (S.S.); [email protected] (J.E.B.)
2 Center for Advanced Vehicular Systems, Mississippi State University, Starkville, MS 39762, USA;
[email protected] (D.W.C.); [email protected] (M.D.)
3 Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211,
USA; [email protected]
* Correspondence: [email protected]
Received: 2 April 2019 ; Accepted: 30 May 2019; Published: 6 June 2019
Abstract: Since the state-of-the-art deep learning algorithms demand a large training dataset, which
is often unavailable in some domains, the transfer of knowledge from one domain to another
has been a trending technique in the computer vision field. However, this method may not be a
straight-forward task considering several issues such as original network size or large differences
between the source and target domain. In this paper, we perform transfer learning for semantic
segmentation of off-road driving environments using a pre-trained segmentation network called
DeconvNet. We explore and verify two important aspects regarding transfer learning. First, since the
original network size was very large and did not perform well for our application, we proposed a
smaller network, which we call the light-weight network. This light-weight network is half the size to
the original DeconvNet architecture. We transferred the knowledge from the pre-trained DeconvNet
to our light-weight network and fine-tuned it. Second, we used synthetic datasets as the intermediate
domain before training with the real-world off-road driving data. Fine-tuning the model trained with
the synthetic dataset that simulates the off-road driving environment provides more accurate results
for the segmentation of real-world off-road driving environments than transfer learning without
using a synthetic dataset does, as long as the synthetic dataset is generated considering real-world
variations. We also explore the issue whereby the use of a too simple and/or too random synthetic
dataset results in negative transfer. We consider the Freiburg Forest dataset as a real-world off-road
driving dataset.
1. Introduction
Semantic segmentation, a task based on pixel-level image classification, is a fundamental approach
in the field of computer vision for scene understanding. Compared to other techniques such as object
detection in which no exact shape of object is known, segmentation exhibits pixel-level classification
output providing richer information, including the object’s shape and boundary. Autonomous driving
is one of several fields that needs rich information for scene understanding. As the objects of interest,
such as roads, trees, and terrains, are continuous rather than discrete structures, detection algorithms
often cannot give detailed information, hindering the performance of autonomous vehicles. However,
this is not true of semantic segmentation algorithms, as all the objects of interests are detected on
a pixel-by-pixel basis. Nonetheless, to use this technique, one needs careful annotations of each object
of interest in the images along with a complex prediction network. Despite these challenges, there has
been tremendous work and progress in object segmentation in images and videos.
Convolutional Neural Networks (CNNs) such as Alexnet [1], VGGnet [2], and GoogleNet [3] have
been used extensively in several seminal works in the field of semantic segmentation. For semantic
segmentation, either existing classification networks are adopted as a baseline or completely new
architectures are designed from scratch. For the segmentation task that uses an existing network
as a baseline, the learned parameters on that network are used as a priori information. Semantic
segmentation can also be considered as a classification task in which each pixel is labeled with the
class of the corresponding enclosing object. The segmentation algorithm can either be single-step or
multi-step. In a single-step segmentation process, only the classification of pixels is carried out, and
the output of the segmentation network is considered to be the final result. When the segmentation
is a multi-step process, the network output is subjected to a series of post-processing steps such
as conditional random fields (CRFs) and ensemble approaches. CRFs provide a way of statistical
modeling for the structured prediction. In semantic segmentation, CRFs help to improve the boundary
delineation in the segmented outputs. Ensemble approaches help to pool the strengths of several
algorithms. The results of these algorithms are fused using some rules to achieve better performance.
However, these techniques increase the computational cost, making them inapplicable to our problem
of scene segmentation for autonomous driving. Therefore, the application of these post-processing
steps depends upon the type of domain. The performance and usefulness of the segmentation
algorithms are evaluated on the basis of parameters such as accuracy over a benchmark dataset,
algorithm speed, boundary delineation capability, etc.
As segmentation holds its importance in the identification/classification of objects, investigating
the abnormalities, etc., it applies to a number of fields, such as agriculture [4,5], medicine [6,7],
and remote sensing [8–10]. A multi-scale CNN and a series of post-processing techniques are applied
in [11] to provide a scene labeling on several datasets. The concept of both segmentation and detection
is used in [12,13] to classify the images in a pixel-wise manner. Although there has been a lot of
work in semantic segmentation, the major improvement was recorded after [14], which demonstrates
the superior results on the Pascal Visual Object Classes (VOC) dataset. It performs the end-to-end
training and supervised pre-training for segmentation avoiding any post-processing steps. In terms
of architecture, it uses the skip layers method to combine the coarse higher-layer information with
fine lower-layer information. The methods described in [15,16] are based on an encoder–decoder
arrangement of layers that use the max-pooling indices transferred to the decoder part making the
network more memory efficient. In both of these works, the mirrored version of the convolutional part
acts as the deconvolutional or decoder part. The concept of dilated convolution to avoid information
loss due to the pooling layer was used in [17]. A fully connected CRF is used in [18] to enhance the
object representation along the boundary. A CRF is used as a post-processing step that improves the
segmentation results produced by the network. An enhanced version of [18] is used in [19] which is
based on spatial pyramid pooling and the concepts of dilated convolution presented in [17]. A new
technique using a pooling called pyramid pooling is introduced in [20] so as to increase the contextual
information along with the dilated convolution technique.
All the works mentioned above are evaluated on several benchmark datasets, and one is said to
be better than another based on the performance on those datasets. However, in real-life scenarios,
there are several areas in which adequate training data are not available. The deep convolutional
neural networks require huge amount of training data so that they can generalize well. Lack of enough
training data in the domain of interest is one of the main reasons for using Transfer Learning (TL).
In TL, the knowledge from a domain, known as the source domain, is transferred to the domain of
interest, known as the target domain. In this technique, the deep neural network is first trained in
the domain where enough data are available. After this, the useful features are incorporated into the
target domain as a priori information. This technique is effective and beneficial when the source and
target domain tasks are comparable. The nature of the convolutional neural network to learn general
Sensors 2019, 19, 2577 3 of 21
features through lower layers and specific features through higher layers makes the technique of TL
effective [21,22]. In particular, in fields such as medicine and remote sensing where datasets with
correct annotations are rarely available, the transfer learning technique is a huge benefit. In [23,24],
the transfer learning technique is applied for the segmentation of brain structures in brain images from
different imaging protocols. Fine-tuning of fast R-CNN [25] for traffic sign detection and classification
for autonomous vehicles is performed in [26].
Apart from finding different applications where transfer learning might be used, there has been
a constant research effort in effective transfer of knowledge from one domain to another. As it
is never the case that all of the knowledge learned from the source task is useful for the target
task, deciding what to transfer and how to transfer it holds an important role for the optimum
performance of the TL approach. A TL method which automatically learns what and how to transfer
from previous experiences is proposed in [27]. A new way of TL for segmentation is devised in [28],
which transfers the learned features from a few strong categories, using pixel-level annotations to
predict the classes that do not have any annotations (known as weak categories). For a similar transfer
scenario, Hong et al. [29] proposes an encoder–decoder architecture combined with an attention model
to semantically segment the weak categories. In [30], an ensemble technique, which is a TL approach
that trains multiple models one after the other, is demonstrated when the source and target domains
have drastic differences.
In our work, we use the TL approach for semantic segmentation specifically for off-road
autonomous driving. We use the semantic segmentation network proposed in [16] as a baseline
network. This network is trained with the Pascal VOC datasets [31] for segmentation. This domain has
a large difference from the one that we are interested in (the off-road driving scene dataset). On the
other hand, the off-road driving scene contains fewer classes compared to the Pascal VOC datasets,
consisting of 20 classes. Because of this, we propose decreasing the network size, and performing
transfer learning on the smaller network. To bridge the difference between the real-world off-road
driving scene and Pascal VOC datasets, we use different synthetic datasets as an intermediate domain
which might help in performance boosting for the data-deprived domain. Similarly, to correspond to
the lower complexity and the latency required for the off-road autonomous driving domain, a smaller
network is proposed. Motivated by previous TL approaches in CNN [22,32] and auto-encoder neural
networks for classification [33], we transfer the trained weights from the original network to the
corresponding layers in the proposed smaller network. However, while most of the state-of-the-art
TL methods perform fine-tuning without making any changes to the original architecture (with the
exception of the last layer), to the best of our knowledge, this is the first attempt to perform transfer
learning from a bigger network to a smaller network, which is helpful to address the two important
requirements of autonomous driving. With several experiments using synthetic and real-world
datasets, we verify that the network size trained in the source domain may not transfer the best
knowledge to the target domain. However, a smaller chunk of the same architecture might work better
based on the complexity embedded in the target domain. On the other hand, this work also explores
the effect of using various synthetic datasets as an intermediate domain during TL by assessing the
performance of the network on a real-world dataset.
The main contributions of this paper are listed as follows:
• We propose a new light-weight network for semantic segmentation. Basically, the DeconvNet
architecture is downsampled to half the original size which performs better for the off-road
autonomous driving domain;
• We use the TL technique to segment the Freiburg Forest dataset. During this, the light-weight
network is initialized with the trained weights from the corresponding layers in the Deconvnet
architecture;
• We study the effect of using various synthetic datasets as an intermediate domain to segment the
real-world dataset in detail.
Sensors 2019, 19, 2577 4 of 21
The rest of the paper is organized as follows. We briefly review the background and related work
in the semantic segmentation of off-road scenes in Section 2. The details of the proposed methods,
including Deconvnet segmentation network and our proposed light-weight network, are explained
in Section 3. In Section 4, we describe all the experiments and the corresponding results including,
the descriptions of the datasets used. Section 5 provides the brief analysis and discussion about the
obtained results. The final section of the paper includes our conclusions and notes on future work.
2.1. Background
N
1
L=
N ∑ p ( y i | X i ), (1)
i
where N is the total number of images or training samples per batch, X i represents the ith input
sample, and yi represents the corresponding label. p(.) is the probability of correct classification for
corresponding input data. For any layer l, Wlt is the weight vector at lth layer at time instant t, and Ult
is the required update in weight. If αl is momentum, and µ is the learning rate, learning in the network
occurs as follows:
∂L
Ult+1 = αUlt − µ (2a)
∂Wl
Wlt+1 = Wlt + Ult+1 . (2b)
Sensors 2019, 19, 2577 5 of 21
3. Proposed Methods
design and decisions made by the system software. As a result of the nature of autonomous vehicles,
a fast processing speed is required for scene understanding and inferencing which ultimately gives
robust control over decision making of the vehicle. We consider this requirement to be very important
in this work, thus we aim for the smallest possible network size with the highest possible accuracy.
In addition to this, transferring all the weights from large pre-trained networks provided sub-optimal
results for our synthetic and real-world dataset as the target domains are simpler than the source
domain. Therefore, to use the best size of convolutional network (which achieves a suitable processing
speed) as well as having an acceptable accuracy level, we propose a smaller convolutional network,
called the light-weight network, taking [16] as a base model. Our proposed network, which better suits
our application, is half the size of the original Deconvnet architecture. Figure 1 shows the structure of
our light-weight network architecture.
The DeconvNet [16] is learned on top of the VGG-16 network [2] and takes 2D images 224 × 224
pixels in size. The deconvolutional part is a mirrored version of the convolutional part and contains
13 layers on both the convolutional and deconvolutional side. The convolutional part is converged
into two fully connected layers augmented at the end to impose class-specific projections. It is
trained using a two-stage training procedure in which the first step involves training with easy
examples. The second stage involves fine-tuning of the network learned in first stage with more
challenging images. Our light-weight network consists of seven convolutional layers and three pooling
layers towards the convolutional side. The deconvolutional network is the mirrored version of the
convolutional network. The major modification in architecture [16] is the removal of some intermediate
layers, including fully connected layers, which improves the computational complexity of the network.
Both the architectures, DeconvNet and light-weight, are called encoder–decoder-based architectures,
in which the convolutional part downsamples and the deconvolutional part upsamples the feature
maps. Such architectures allow the use of max-pooling indices during upsampling which helps to
obtain better segmentation maps with preserved global context information. However, the use of
max-pooling indices slightly increases the computational cost. The original DeconvNet architecture
and proposed light-weight network architectures are shown in Figure 1. The details, including each
layer’s output and the kernel size of our light-weight network architecture, are shown in Table 1.
Max Max
Max Max pooling
pooling
pooling pooling
Max
pooling
112x112 112x112
56x56 56x56
28x28
28x28
Max
Max
pooling
pooling
Max
pooling
Figure 1. Top: Original DeconvNet architecture, Bottom: Proposed light-weight network architecture.
In the following two sections, we do a comparative study of the original and proposed network
in terms of computational complexity and latency.
Sensors 2019, 19, 2577 7 of 21
In Equation (3), l represents the corresponding layer; nl −1 represents the number of filters in the
(l − 1)th layer; sl represents the spatial size (length) of filter in the lth layer; and ml is the spatial size of
the output feature map. DeconvNet consists of 13 convolutional layers and 13 deconvolutional layers,
whereas the proposed light-weight network consists of seven convolutional and seven deconvolutional
layers. Incorporating the fact that the convolution and deconvolution operations are the same in
terms of computation, the overall computational complexity for both networks is shown in Table 2.
The proposed light-weight network has a complexity 1.56 times lower compared to that of the original
network. This reduction in complexity is in favor of the low latency requirement of autonomous driving.
Table 1. Detailed structure of proposed light-weight network architecture. Note that C is the number
of classes.
network has a frame rate of 21 Frames Per Second (fps), which is better than that of the original
network (17.7 fps).
3.2. Training
The second part of this work is about actual learning and fine-tuning the network with synthetic
and real-world datasets. We fine-tuned our proposed light-weight network with synthetic datasets
as well as with a real-world dataset and report the result. Here, we explore the advantages and
disadvantages of using a synthetic dataset. We used the synthetic dataset as the intermediate domain
and the real-world dataset as the final domain. In the first training method, we performed transfer
learning using only the real-world data and observed the results. In the second training technique,
we trained the light-weight network using the synthetic dataset as an intermediate domain. In this
work, we are interested in seeing the effectiveness of our segmentation results in a real-world scenario
by fine-tuning the light-weight network trained with synthetic dataset. To do so, we fine-tuned
the original model with the synthetic dataset as a first step, and transferred this knowledge for the
real-world dataset as a final step. As we are interested in the off-road autonomous driving scenario,
we focused on how the transfer learning works in order to segment the real-world dataset with and
without using synthetic dataset.
In this work, we used the softmax loss as an optimization function available in Caffe
framework [40]. This loss function is basically a multinomial logistic loss that uses softmax of
the output in the final layer of the network. The softmax function is the most common function
used in the output of CNNs for classification. It is used as a layer in CNN architecture that takes
an N-dimensional feature vector and produces the probabilistic values as output in the range (0, 1).
Considering [ x1 , x2 , x3 , ..., x N ] as the input to the softmax layer and [o1 , o2 , o3 , ..., o N ] as its output,
the input–output mapping occurs as in Equation (4).
e xi
oi = ∀i ∈ 1...N. (4)
∑yN=1 e xy
Therefore, in the classification or segmentation of input images, the softmax layer produces the
probabilistic values for all possible classes. On the basis of these probabilities, any test data (or pixel
in the case of segmentation) is assigned to the class with the maximum probabilistic value. Consider
( x1 , y1 ), ( x2 , y2 ), ......, ( xn , yn ) to be any n number of training points ,where x denotes the training data
and y denotes the corresponding label. In Caffe [40], the softmax loss is defined in a composite form
by applying multinomial logistic loss to the softmax layer’s output. In Equation (5), the softmax loss is
defined as a cost function to be optimized.
θT x
1 n c ej i
n i∑ ∑ 1 {yi = j} log c θkT xi ,
J (θ ) = − (5)
=1 j =1 ∑ e k =1
where c represents the total number of classes. The parameter θ T represents the transpose of the weight
matrix of the network at that instant in time. With this loss function, the training was performed using
the stochastic gradient descent method with a learning rate of 0.001, a momentum of 0.9, and a weight
decay of 0.0005 in Nvidia Quadro GP100 GPU with 16G memory.
Sensors 2019, 19, 2577 9 of 21
Figure 2. Sample images from two-class synthetic dataset. Best viewed in color.
Figure 3. Three sample images from four-class high-definition dataset. best viewed in color.
Figure 4. Sample images from four-class random synthetic dataset. Best viewed in color.
Figure 5. Sample images from Freiburg Forest dataset. Best viewed in color.
Figure 6. Segmentation of the Freiburg Forest dataset (a–c): test images, (d–f): corresponding
segmented images using DeconvNet, (g–i): corresponding segmented images using the proposed
light-weight network. Note that the color code for classes is: yellow: tree, green: road, blue: sky, red:
ground, black: obstacle. Best viewed in color.
Sensors 2019, 19, 2577 12 of 21
The proposed algorithm achieved 93.1 percent overall accuracy with only the outermost layer
learning from scratch and 94.43 percent accuracy with the two outermost layers learning from scratch
with a learning rate of 0.01. All the other layers are slowly modified with a learning rate of 0.001.
This way of fine-tuning typically means adopting the DeconvNet to the new domain, where the
general properties are slowly modified/learned and specific properties are quickly modified/learned.
The concern about which layers are to be learned from scratch is an open-ended question and is mostly
the function of diversity between the source and target domain. The results produced by the model
with the best accuracy (the one that is trained with the two outermost layers learned from scratch) are
shown in Figure 6.
Figure 7. Segmentation of the two-class synthetic dataset (a–c): test images, (d–f): corresponding
segmented images using DeconvNet, (g–i): corresponding segmented images using the proposed
light-weight network. Note that the color code for classes is: yellow: tree, green: ground. Best viewed
in color.
four-class high-definition dataset considers more realistic properties of the real-world environment in
terms of number of classes and intra-class variability.
Figure 8. Segmentation of the four-class high-definition dataset (a–c): test images, (d–f): segmented
images using DeconvNet, (g–i): segmented images using proposed light-weight network. Note that the
color code for classes is: green: ground, red: vegetation, yellow: tree, blue: sky. Best viewed in color.
As stated above, consideration of the real-world properties for the forest environment is increased in
this dataset. However, the reduction in the overall accuracy could be due to increased variation among
the dataset that caused the network to learn the features that are less correlated to the target domain.
This phenomenon is sometimes called negative transfer. In Figure 10, we show the comparative results
for all the experiments.
Figure 9. Segmentation of the four-class random dataset (a–c): test images, (d–f): segmented images
using DeconvNet, (g–i): segmented images using proposed light-weight network. Note that the color
code for classes is: green: ground, red: vegetation, yellow: tree, blue: sky. Best viewed in color.
four-class random synthetic, the performance decreased slightly. The two-class synthetic dataset is
a simpler dataset which does not take into account the real-world environmental effects in terms of
both the number of classes and their properties. This dataset just increased the volume with no helpful
information learned before moving into the target domain causing negative transfer performance.
On the other hand, the random dataset includes images from different environments. It includes
the data from three different environments: an American Southeast forest ecosystem, an American
Southeast meadow ecosystem, and an American Southwest desert ecosystem with their various
lighting conditions. These different environments caused a high level of randomness and a lower
correlation to the target domain; this dataset also added no helpful knowledge while doing the transfer
learning. However, it also caused the negative transfer. The four-class high-definition dataset gave
the positive TL performance with the accuracy of 94.59% on the Freiburg test set. Different from the
two other datasets, this dataset has higher correlation with the target domain. Additionally, the huge
randomness caused by the various ecosystems in the four-class random dataset is not available in
the four-class high-definition dataset. The forest and ground structure have comparatively more
similarity with that of the target domain which causes the improved performance while training with
the Freiburg dataset.
Table 4. Quantitative results produced by DeconvNet and the proposed network for various TL
experiments. Shading indicates the improvement of one method over another.
We show the confusion matrices for each experiment performed with the proposed light-weight
network in Table 5. Each entry is the percentage measurement of either the correctly or falsely classified
number of pixels in all test images. We can see the obstacle class having the lowest accuracy and the
sky class having the highest accuracy in each TL experiment. The cause for the low accuracy regarding
obstacles is that the pixels belonging to this class are very limited in the training datasets compared
to the other classes. In addition, the obstacle in the training images have less structural uniformity.
This results in the network learning less about the obstacle class causing a biased prediction in favor of
classes having a higher number of pixels.
Sensors 2019, 19, 2577 17 of 21
Input image Ground truth W/O synthetic 2 class synthetic 4 class HD 4 class random
(a)
(b)
(a)
(b)
(a)
(b)
Figure 10. Examples to show that the light-weight network produces better results than DeconvNet
for each of the experiments. Note that each pair of rows (a,b) represents the results produced by
DeconvNet and the proposed light-weight network, respectively. Best viewed in color.
Sensors 2019, 19, 2577 18 of 21
Table 5. Confusion matrices of the test results produced by the proposed network for different TL
experiments on the Freiburg Forest dataset. Note that each entry is an overall percentage. (Top left:
without using synthetic dataset, top right: using two-class synthetic dataset, bottom left: using
four-class high-definition synthetic dataset, bottom right: using four-class random synthetic dataset).
Predicted class
Predicted class
Grass 12.99 90.83 13.60 3.18 0.09 Grass 14.92 91.07 12.89 3.33 0.15
Road 5.58 3.11 83.09 0.98 0.17 Road 4.76 3.01 84.28 1.02 0.23
Tree 19.28 5.62 2.44 93.58 4.62 Tree 23.06 5.35 2.23 93.32 4.31
Sky 1.28 0.30 0.84 2.12 95.07 Sky 1.54 0.43 0.577 2.21 95.23
Predicted class
acle
Grass 16.43 89.89 11.31 2.99 0.05 Grass 15.22 90.42 11.75 3.21 0.25
Road 4.85 3.47 86.72 1.07 0.21 Road 4.16 3.38 85.57 1.03 0.36
Tree 18.16 6.17 1.43 93.41 4.50 Tree 19.20 5.81 2.24 93.36 4.49
Sky 1.10 0.35 0.50 2.40 95.18 Sky 2.34 0.25 0.39 2.27 94.85
Author Contributions: S.S. provided conceptualization, implementation and writing; J.E.B. and B.T. gave
a detailed revision and suggestions about experimental design; D.W.C., M.D. and M.A.I. provided the data
and important suggestions for writing.
Funding: Portions of this work were performed in association with the Halo Project, a research and development
project with the Center for Advanced Vehicular Systems (CAVS) at Mississippi State University. Learn more at
http://www.cavs.msstate.edu.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the NIPS’12 the 25th International Conference on Neural Information Processing Systems,
Lake Tahoe, Nevada, 3–6 December 2012; pp. 1097–1105.
2. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
3. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9.
4. Lin, K.; Gong, L.; Huang, Y.; Liu, C.; Pan, J. Deep learning-based segmentation and quantification of
cucumber Powdery Mildew using convolutional neural network. Front. Plant Sci. 2019, 10, 155. [CrossRef]
[PubMed]
5. Bargoti, S.; Underwood, J.P. Image segmentation for fruit detection and yield estimation in apple orchards.
J. Field Robot. 2017, 34, 1039–1060. [CrossRef]
6. Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. Deep neural networks segment neuronal
membranes in electron microscopy images. In Proceedings of the NIPS’12 the 25th International Conference
on Neural Information Processing Systems, Lake Tahoe, Nevada, 3–6 December 2012; pp. 2843–2851.
7. Kolařík, M.; Burget, R.; Uher, V.; Říha, K.; Dutta, M.K. Optimized High Resolution 3D Dense-U-Net Network
for Brain and Spine Segmentation. Appl. Sci. 2019, 9, 404. [CrossRef]
8. Liu, Y.; Ren, Q.; Geng, J.; Ding, M.; Li, J. Efficient Patch-Wise Semantic Segmentation for Large-Scale Remote
Sensing Images. Sensors 2018, 18, 3232. [CrossRef] [PubMed]
9. Pan, X.; Gao, L.; Zhang, B.; Yang, F.; Liao, W. High-Resolution Aerial Imagery Semantic Labeling with Dense
Pyramid Network. Sensors 2018, 18, 3774. [CrossRef] [PubMed]
10. Papadomanolaki, M.; Vakalopoulou, M.; Karantzalos, K. A Novel Object-Based Deep Learning Framework
for Semantic Segmentation of Very High-Resolution Remote Sensing Data: Comparison with Convolutional
and Fully Convolutional Networks. Remote Sens. 2019, 11, 684. [CrossRef]
11. Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1915–1929. [CrossRef] [PubMed]
12. Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection
and segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 345–360.
13. Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In European
Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 297–312.
14. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12
June 2015; pp. 3431–3440.
15. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for
image segmentation. arXiv 2015, arXiv:1511.00561.
16. Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of
the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015;
pp. 1520–1528.
17. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122.
18. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep
convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062.
Sensors 2019, 19, 2577 20 of 21
19. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal.
Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed]
20. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
21. Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning transferable features with deep adaptation networks.
arXiv 2015, arXiv:1502.02791.
22. Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks?
In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence,
N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: New York, NY, USA, 2014; pp. 3320–3328.
23. Van Opbroek, A.; Ikram, M.A.; Vernooij, M.W.; de Bruijne, M. Supervised image segmentation across scanner
protocols: A transfer learning approach. In Proceedings of the International Workshop on Machine Learning
in Medical Imaging, Nice, France, 1 October 2012; pp. 160–167.
24. Van Opbroek, A.; Ikram, M.A.; Vernooij, M.W.; De Bruijne, M. Transfer learning improves supervised image
segmentation across imaging protocols. IEEE Trans. Med. Imaging 2015, 34, 1018–1030. [CrossRef] [PubMed]
25. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 13–16 December 2015; pp. 1440–1448. [CrossRef]
26. Wei, L.; Runge, L.; Xiaolei, L. Traffic sign detection and recognition via transfer learning. In Proceedings of
the 2018 Chinese Control And Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 5884–5887.
27. Ying, W.; Zhang, Y.; Huang, J.; Yang, Q. Transfer learning via learning to transfer. In Proceedings of the 35th
International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5072–5081.
28. Xiao, H.; Wei, Y.; Liu, Y.; Zhang, M.; Feng, J. Transferable Semi-supervised Semantic Segmentation.
arXiv 2017, arXiv:1711.06828.
29. Hong, S.; Oh, J.; Lee, H.; Han, B. Learning transferrable knowledge for semantic segmentation with deep
convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3204–3212.
30. Nigam, I.; Huang, C.; Ramanan, D. Ensemble Knowledge Transfer for Semantic Segmentation.
In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe,
NV, USA, 12–15 March 2018; pp. 1499–1508.
31. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc)
challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [CrossRef]
32. Bengio, Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of the
UTLW’11 the 2011 International Conference on Unsupervised and Transfer Learning Workshop, Washington,
DC, USA, 2 July 2011; pp. 17–36.
33. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML
Workshop on Unsupervised and Transfer Learning, Edinburgh, Scotland, 27 June 2012; pp. 37–49.
34. Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MI, USA, 2016;
Volume 1.
35. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359.
[CrossRef]
36. Maturana, D.; Chou, P.W.; Uenoyama, M.; Scherer, S. Real-time semantic mapping for autonomous off-road
navigation. In Field and Service Robotics; Springer: Berlin/Heidelberg, Germany, 2018; pp. 335–350.
37. Adhikari, S.P.; Yang, C.; Slot, K.; Kim, H. Accurate Natural Trail Detection Using a Combination of a Deep
Neural Network and Dynamic Programming. Sensors 2018, 18, 178. [CrossRef] [PubMed]
38. Holder, C.J.; Breckon, T.P.; Wei, X. From on-road to off: transfer learning within a deep convolutional neural
network for segmentation and classification of off-road scenes. In Proceedings of the European Conference
on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 149–162.
39. He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360.
40. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international
conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678.
Sensors 2019, 19, 2577 21 of 21
41. Valada, A.; Oliveira, G.; Brox, T.; Burgard, W. Deep Multispectral Semantic Scene Understanding of
Forested Environments using Multimodal Fusion. In Proceedings of the 2016 International Symposium on
Experimental Robotics (ISER 2016), Tokyo, Japan, 3–6 October 2016.
42. Hudson, C.R.; Goodin, C.; Doude, M.; Carruth, D.W. Analysis of Dual LIDAR Placement for Off-Road
Autonomy Using MAVS. In Proceedings of the 2018 World Symposium on Digital Intelligence for Systems
and Machines (DISA), Kosice, Slovakia, 23–25 August 2018; pp. 137–142.
43. Goodin, C.; Sharma, S.; Doude, M.; Carruth, D.; Dabbiru, L.; Hudson, C. Training of Neural Networks with
Automated Labeling of Simulated Sensor Data; SAE Technical Paper; Society of Automotive Engineers: Troy, MI,
USA, April, 2019.
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).