3
Forest [27], [28] or Boosting [29], [30] to predict the class
probabilities of the center pixel. Features based on appearance [27]
or SfM and appearance [28], [29], [30] have been explored for
the CamVid road scene understanding test [22]. These per-pixel
noisy predictions (often called unary terms) from the classifiers
are then smoothed by using a pair-wise or higher order CRF [29],
[30] to improve the accuracy. More recent approaches have aimed
to produce high quality unaries by trying to predict the labels
for all the pixels in a patch as opposed to only the center pixel.
This improves the results of Random Forest based unaries [31]
but thin structured classes are classified poorly. Dense depth maps
computed from the CamVid video have also been used as input
for classification using Random Forests [32]. Another approach
argues for the use of a combination of popular hand designed
features and spatio-temporal super-pixelization to obtain higher
accuracy [33]. The best performing technique on the CamVid
test [30] addresses the imbalance among label frequencies by
combining object detection outputs with classifier predictions in
a CRF framework. The result of all these techniques indicate the
need for improved features for classification.
Indoor RGBD pixel-wise semantic segmentation has also
gained popularity since the release of the NYU dataset [25]. This
dataset showed the usefulness of the depth channel to improve
segmentation. Their approach used features such as RGB-SIFT,
depth-SIFT and pixel location as input to a neural network
classifier to predict pixel unaries. The noisy unaries are then
smoothed using a CRF. Improvements were made using a richer
feature set including LBP and region segmentation to obtain higher
accuracy [34] followed by a CRF. In more recent work [25], both
class segmentation and support relationships are inferred together
using a combination of RGB and depth based cues. Another
approach focuses on real-time joint reconstruction and semantic
segmentation, where Random Forests are used as the classifier
[35]. Gupta et al. [36] use boundary detection and hierarchical
grouping before performing category segmentation. The common
attribute in all these approaches is the use of hand engineered
features for classification of either RGB or RGBD images.
The success of deep convolutional neural networks for object
classification has more recently led researchers to exploit their fea-
ture learning capabilities for structured prediction problems such
as segmentation. There have also been attempts to apply networks
designed for object categorization to segmentation, particularly
by replicating the deepest layer features in blocks to match
image dimensions [7], [37], [38], [39]. However, the resulting
classification is blocky [38]. Another approach using recurrent
neural networks [40] merges several low resolution predictions
to create input image resolution predictions. These techniques are
already an improvement over hand engineered features [7] but
their ability to delineate boundaries is poor.
Newer deep architectures [2], [4], [10], [13], [18] particularly
designed for segmentation have advanced the state-of-the-art by
learning to decode or map low resolution image representations
to pixel-wise predictions. The encoder network which produces
these low resolution representations in all of these architectures is
the VGG16 classification network [1] which has 13 convolutional
layers and 3 fully connected layers. This encoder network weights
are typically pre-trained on the large ImageNet object classifi-
cation dataset [41]. The decoder network varies between these
architectures and is the part which is responsible for producing
multi-dimensional features for each pixel for classification.
Each decoder in the Fully Convolutional Network (FCN)
architecture [2] learns to upsample its input feature map(s) and
combines them with the corresponding encoder feature map to
produce the input to the next decoder. It is an architecture which
has a large number of trainable parameters in the encoder network
(134M) but a very small decoder network (0.5M). The overall
large size of this network makes it hard to train end-to-end on
a relevant task. Therefore, the authors use a stage-wise training
process. Here each decoder in the decoder network is progressively
added to an existing trained network. The network is grown until
no further increase in performance is observed. This growth is
stopped after three decoders thus ignoring high resolution feature
maps can certainly lead to loss of edge information [4]. Apart
from training related issues, the need to reuse the encoder feature
maps in the decoder makes it memory intensive in test time. We
study this network in more detail as it the core of other recent
architectures [10], [11].
The predictive performance of FCN has been improved further
by appending the FCN with a recurrent neural network (RNN)
[10] and fine-tuning them on large datasets [21], [42]. The RNN
layers mimic the sharp boundary delineation capabilities of CRFs
while exploiting the feature representation power of FCN’s. They
show a significant improvement over FCN-8 but also show that
this difference is reduced when more training data is used to
train FCN-8. The main advantage of the CRF-RNN is revealed
when it is jointly trained with an architecture such as the FCN-
8. The fact that joint training helps is also shown in other recent
results [43], [44]. Interestingly, the deconvolutional network [4]
performs significantly better than FCN although at the cost of
a more complex training and inference. This however raises the
question as to whether the perceived advantage of the CRF-RNN
would be reduced as the core feed-forward segmentation engine is
made better. In any case, the CRF-RNN network can be appended
to any deep segmentation architecture including SegNet.
Multi-scale deep architectures are also being pursued [13],
[44]. They come in two flavours, (i) those which use input images
at a few scales and corresponding deep feature extraction net-
works, and (ii) those which combine feature maps from different
layers of a single deep architecture [45] [11]. The common idea
is to use features extracted at multiple scales to provide both
local and global context [46] and the using feature maps of the
early encoding layers retain more high frequency detail leading to
sharper class boundaries. Some of these architectures are difficult
to train due to their parameter size [13]. Thus a multi-stage training
process is employed along with data augmentation. The inference
is also expensive with multiple convolutional pathways for feature
extraction. Others [44] append a CRF to their multi-scale network
and jointly train them. However, these are not feed-forward at test
time and require optimization to determine the MAP labels.
Several of the recently proposed deep architectures for seg-
mentation are not feed-forward in inference time [4], [3], [18].
They require either MAP inference over a CRF [44], [43] or
aids such as region proposals [4] for inference. We believe the
perceived performance increase obtained by using a CRF is due
to the lack of good decoding techniques in their core feed-forward
segmentation engine. SegNet on the other hand uses decoders to
obtain features for accurate pixel-wise classification.
The recently proposed Deconvolutional Network [4] and its
semi-supervised variant the Decoupled network [18] use the max
locations of the encoder feature maps (pooling indices) to perform
non-linear upsampling in the decoder network. The authors of
these architectures, independently of SegNet (first submitted to