5
proaches incorporate probabilistic graphical models, such
as Conditional Random Fields (CRFs) and Markov Random
Field (MRFs), into DL architectures.
Chen et al. [38] proposed a semantic segmentation algo-
rithm based on the combination of CNNs and fully connected
CRFs (Figure 10). They showed that responses from the final
layer of deep CNNs are not sufficiently localized for accurate
object segmentation (due to the invariance properties that
make CNNs good for high level tasks such as classification).
To overcome the poor localization property of deep CNNs,
they combined the responses at the final CNN layer with a
fully-connected CRF. They showed that their model is able to
localize segment boundaries at a higher accuracy rate than it
was possible with previous methods.
Fig. 10. A CNN+CRF model. The coarse score map of a CNN is up-
sampled via interpolated interpolation, and fed to a fully-connected CRF
to refine the segmentation result. From [38].
Schwing and Urtasun [39] proposed a fully-connected
deep structured network for image segmentation. They
presented a method that jointly trains CNNs and fully-
connected CRFs for semantic image segmentation, and
achieved encouraging results on the challenging PASCAL
VOC 2012 dataset. In [40], Zheng et al. proposed a similar
semantic segmentation approach integrating CRF with CNN.
In another relevant work, Lin et al. [41] proposed an
efficient algorithm for semantic segmentation based on
contextual deep CRFs. They explored “patch-patch” context
(between image regions) and “patch-background” context to
improve semantic segmentation through the use of contex-
tual information.
Liu et al. [42] proposed a semantic segmentation algorithm
that incorporates rich information into MRFs, including high-
order relations and mixture of label contexts. Unlike previous
works that optimized MRFs using iterative algorithms, they
proposed a CNN model, namely a Parsing Network, which
enables deterministic end-to-end computation in a single
forward pass.
3.3 Encoder-Decoder Based Models
Another popular family of deep models for image seg-
mentation is based on the convolutional encoder-decoder
architecture. Most of the DL-based segmentation works use
some kind of encoder-decoder models. We group these works
into two categories, encoder-decoder models for general
segmentation, and for medical image segmentation (to better
distinguish between applications).
3.3.1 Encoder-Decoder Models for General Segmentation
Noh et al. [43] published an early paper on semantic
segmentation based on deconvolution (a.k.a. transposed
convolution). Their model (Figure 11) consists of two parts,
an encoder using convolutional layers adopted from the
VGG 16-layer network and a deconvolutional network that
takes the feature vector as input and generates a map of
pixel-wise class probabilities. The deconvolution network
is composed of deconvolution and unpooling layers, which
identify pixel-wise class labels and predict segmentation
masks. This network achieved promising performance on the
PASCAL VOC 2012 dataset, and obtained the best accuracy
(72.5%) among the methods trained with no external data at
the time.
Fig. 11. Deconvolutional semantic segmentation. Following a convolution
network based on the VGG 16-layer net, is a multi-layer deconvolution
network to generate the accurate segmentation map. From [43].
In another promising work known as SegNet, Badri-
narayanan et al. [44] proposed a convolutional encoder-
decoder architecture for image segmentation (Figure 12).
Similar to the deconvolution network, the core trainable
segmentation engine of SegNet consists of an encoder net-
work, which is topologically identical to the 13 convolutional
layers in the VGG16 network, and a corresponding decoder
network followed by a pixel-wise classification layer. The
main novelty of SegNet is in the way the decoder upsamples
its lower resolution input feature map(s); specifically, it
uses pooling indices computed in the max-pooling step
of the corresponding encoder to perform non-linear up-
sampling. This eliminates the need for learning to up-sample.
The (sparse) up-sampled maps are then convolved with
trainable filters to produce dense feature maps. SegNet is also
significantly smaller in the number of trainable parameters
than other competing architectures. A Bayesian version of
SegNet was also proposed by the same authors to model the
uncertainty inherent to the convolutional encoder-decoder
network for scene segmentation [45].
Several other works adopt transposed convolutions, or
encoder-decoders for image segmentation, such as Stacked
Deconvolutional Network (SDN) [46], Linknet [47], W-Net
[48], and locality-sensitive deconvolution networks for RGB-
D segmentation [49].
Fig. 12. SegNet has no fully-connected layers; hence, the model is fully
convolutional. A decoder up-samples its input using the transferred pool
indices from its encoder to produce a sparse feature map(s). From [44].