Overview of semantic segmentation
Overview of semantic segmentation
An overview of semantic
image segmentation.
JEREMY JORDAN
21 MAY 2018 12 MIN READ
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image.
(Source)
One important thing to note is that we're not separating instances of the same class; we
only care about the category of each pixel. In other words, if you have two objects of the
same category in your input image, the segmentation map does not inherently
distinguish these as separate objects. There exists a different class of models, known as
instance segmentation models, which do distinguish between separate objects of the
same class.
Autonomous vehicles
We need to equip cars with the necessary perception to understand their
environment so that self-driving cars can safely integrate into our existing
roads.
2 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. (Source)
3 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Note: For visual clarity, I've labeled a low-resolution prediction map. In reality, the
An overview of semantic image segmentation.
segmentation label resolution should match the original input's resolution.
Similar to how we treat standard categorical values, we'll create our target by one-hot
encoding the class labels - essentially creating an output channel for each of the
possible classes.
A prediction can be collapsed into a segmentation map (as shown in the first image) by
taking the argmax of each depth-wise pixel vector.
When we overlay a single channel of our target (or prediction), we refer to this as a
mask which illuminates the regions of an image where a specific class is present.
4 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Image credit
Recall that for deep convolutional networks, earlier layers tend to learn low-level
concepts while later layers develop more high-level (and specialized) feature mappings.
In order to maintain expressiveness, we typically need to increase the
number of feature maps (channels) as we get deeper in the network.
This didn't necessarily pose a problem for the task of image classification, because for
that task we only care about what the image contains (and not where it is located).
Thus, we could alleviate computational burden by periodically downsampling our
feature maps through pooling or strided convolutions (ie. compressing the spatial
resolution) without concern. However, for image segmentation, we would like our
model to produce a full-resolution semantic prediction.
5 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Image credit
Image credit
However, transpose convolutions are by far the most popular approach as they
allow for us to develop a learned upsampling.
6 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Image credit
Whereas a typical convolution operation will take the dot product of the values
currently in the filter's view and produce a single value for the corresponding output
position, a transpose convolution essentially does the opposite. For a transpose
convolution, we take a single value from the low-resolution feature map and multiply
all of the weights in our filter by this value, projecting those weighted values into the
output feature map.
For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with
stride 2 - as shown in the below example), the overlapping values are simply added
together. Unfortunately, this tends to produce a checkerboard artifact in the output
and is undesirable, so it's best to ensure that your filter size does not produce an
overlap.
7 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
8 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
The full network, as shown below, is trained according to a pixel-wise cross entropy
loss.
Image credit
However, because the encoder module reduces the resolution of the input by a factor of
32, the decoder module struggles to produce fine-grained segmentations (as
shown below).
9 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
These skip connections from earlier layers in the network (prior to a downsampling
operation) should provide the necessary detail in order to reconstruct accurate shapes
for segmentation boundaries. Indeed, we can recover more fine-grain detail with the
addition of these skip connections.
10 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Image credit
Note: The original architecture introduces a decrease in resolution due to the use of
valid padding. However, some practitioners opt to use same padding where the
padding values are obtained by image reflection at the border.
Whereas Long et al. (FCN paper) reported that data augmentation ("randomly
mirroring and “jittering” the images by translating them up to 32 pixels") did not result
in a noticeable improvement in performance, Ronneberger et al. (U-Net paper) credit
data augmentations ("random elastic deformations of the training samples") as a key
concept for learning. It appears as if the usefulness (and type) of data
augmentation depends on the problem domain.
11 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
The standard U-Net model consists of a series of convolution operations for each
An overview of semantic image segmentation.
"block" in the architecture. As I discussed in my post on common convolutional
network architectures, there exist a number of more advanced "blocks" that can be
substituted in for stacked convolutional layers.
Drozdzal et al. swap out the basic stacked convolution blocks in favor of residual
blocks. This residual block introduces short skip connections (within the block)
alongside the existing long skip connections (between the corresponding feature maps
of encoder and decoder modules) found in the standard U-Net structure. They report
that the short skip connections allow for faster convergence when training and allow
for deeper models to be trained.
Expanding on this, Jegou et al. proposed the use of dense blocks, still following a U-
Net structure, arguing that the "characteristics of DenseNets make them a very good fit
for semantic segmentation as they naturally induce skip connections and multi-scale
supervision." These dense blocks are useful as they carry low level features from
previous layers directly alongside higher level features from more recent layers,
allowing for highly efficient feature reuse.
One very important aspect of this architecture is the fact that the upsampling path does
not have a skip connection between the input and output of a dense block. The authors
note that because the "upsampling path increases the feature maps spatial resolution,
the linear growth in the number of features would be too memory demanding." Thus,
only the output of a dense block is passed along in the decoder module.
12 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
The FC-DenseNet103 model acheives state of the art results (Oct 2017) on the CamVid dataset.
Dilated/atrous convolutions
One benefit of downsampling a feature map is that it broadens the receptive field (with
respect to the input) for the following filter, given a constant filter size. Recall that this
approach is more desirable than increasing the filter size due to the parameter
inefficiency of large filters (discussed here in Section 3.1). However, this broader
context comes at the cost of reduced spatial resolution.
13 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Image credit
Some architectures swap out the last few pooling layers for dilated convolutions with
successively higher dilation rates to maintain the same field of view while preventing
loss of spatial detail. However, it is often still too computationally expensive to
completely replace pooling layers with dilated convolutions.
Because the cross entropy loss evaluates the class predictions for each pixel vector
individually and then averages over all pixels, we're essentially asserting equal learning
to each pixel in the image. This can be a problem if your various classes have
unbalanced representation in the image, as training can be dominated by the most
prevalent class. Long et al. (FCN paper) discuss weighting this loss for each output
channel in order to counteract a class imbalance present in the dataset.
14 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Meanwhile, Ronneberger et al. (U-Net paper) discuss a loss weighting scheme for each
An overview of semantic image segmentation.
pixel such that there is a higher weight at the border of segmented objects. This loss
weighting scheme helped their U-Net model segment cells in biomedical images in a
discontinuous fashion such that individual cells may be easily identified within the
binary segmentation map.
Notice how the binary segmentation map produces clear borders around the cells. (Source)
Another popular loss function for image segmentation tasks is based on the Dice
coefficient, which is essentially a measure of overlap between two samples. This
measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete
overlap. The Dice coefficient was originally developed for binary data, and can be
calculated as:
For the case of evaluating a Dice coefficient on predicted segmentation masks, we can
approximate as the element-wise multiplication between the prediction and
target mask, and then sum the resulting matrix.
15 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
In order to quantify and , some researchers use the simple sum whereas other
researchers prefer to use the squared sum for this calculation. I don't have the practical
experience to know which performs better empirically over a wide range of tasks, so I'll
leave you to try them both and see which works better.
In case you were wondering, there's a 2 in the numerator in calculating the Dice
coefficient because our denominator "double counts" the common elements between
the two sets. In order to formulate a loss function which can be minimized, we'll simply
use . This loss function is known as the soft Dice loss because we directly
use the predicted probabilities instead of thresholding and converting them into a
binary mask.
With respect to the neural network output, the numerator is concerned with the
common activations between our prediction and target mask, where as the
denominator is concerned with the quantity of activations in each mask separately.
This has the effect of normalizing our loss according to the size of the target mask such
that the soft Dice loss does not struggle learning from classes with lesser spatial
representation in an image.
16 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
A soft Dice loss is calculated for each class separately and then averaged to yield a final
score. An example implementation is provided below.
17 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Datasets
An overview of semantic image segmentation.
Cityscapes Dataset
Further Reading
Papers
18 of 20 10/29/2024, 10:26 AM
An overview of semantic image segmentation. https://www.jeremyjordan.me/semantic-segmentation/
Fluorescence Images
An overview of semantic image segmentation.
Lectures
Blog posts
Pytorch implementations
19 of 20 10/29/2024, 10:26 AM