DCNNs at multiple image resolutions and then employ a
segmentation tree to smooth the prediction results. More
recently, [21] propose to use skip layers and concatenate the
computed intermediate feature maps within the DCNNs for
pixel classification. Further, [51] propose to pool the inter-
mediate feature maps by region proposals. These works still
employ segmentation algorithms that are decoupled from
the DCNN classifier’s results, thus risking commitment to
premature decisions.
The third family of works uses DCNNs to directly provide
dense category-level pixel labels, which makes it possible to
even discard segmentation altogether. The segmentation-
free approaches of [14], [52] directly apply DCNNs to the
whole image in a fully convolutional fashion, transforming
the last fully connected layers of the DCNN into convolu-
tional layers. In order to deal with the spatial localization
issues outlined in the introduction, [14] upsample and con-
catenate the scores from intermediate feature maps, while
[52] refine the prediction result from coarse to fine by propa-
gating the coarse results to another DCNN. Our work builds
on these works, and as described in the introduction extends
them by exerting control on the feature resolution, introduc-
ing multi-scale pooling techniques and integrating the
densely connected CRF of [22] on top of the DCNN. We
show that this leads to significantly better segmentation
results, especially along object boundaries. The combination
of DCNN and CRF is of course not new but previous works
only tried locally connected CRF models. Specifically, [53]
use CRFs as a proposal mechanism for a DCNN-based
reranking system, while [39] treat superpixels as nodes for a
local pairwise CRF and use graph-cuts for discrete inference.
As such their models were limited by errors in superpixel
computations or ignored long-range dependencies. Our
approach instead treats every pixel as a CRF node receiving
unary potentials by the DCNN. Crucially, the Gaussian CRF
potentials in the fully connected CRF model of [22] that we
adopt can capture long-range dependencies and at the same
time the model is amenable to fast mean field inference. We
note that mean field inference had been extensively studied
for traditional image segmentation tasks [54], [55], [56], but
these older models were typically limited to short-range con-
nections. In independent work, [57] use a very similar
densely connected CRF model to refine the results of DCNN
for the problem of material classification. However, the
DCNN module of [57] was only trained by sparse point
supervision instead of dense supervision at every pixel.
Since the first version of this work was made publicly
available [38], the area of semantic segmentation has pro-
gressed drastically. Multiple groups have made important
advances, significantly raising the bar on the PASCAL VOC
2012 semantic segmentation benchmark, as reflected to the
high level of activity in the benchmark’s leaderboard
1
[17],
[40], [58], [59], [60], [61], [62], [63]. Interestingly, most top-
performing methods have adopted one or both of the key
ingredients of our DeepLab system: Atrous convolution for
efficient dense feature extraction and refinement of the raw
DCNN scores by means of a fully connected CRF. We outline
below some of the most important and interesting advances.
End-to-end training for structured prediction has more
recently been explored in several related works. While we
employ the CRF as a post-processing method, [40], [59],
[62], [64], [65] have successfully pursued joint learning of
the DCNN and CRF. In particular, [59], [65] unroll the CRF
mean-field inference steps to convert the whole system into
an end-to-end trainable feed-forward network, while [62]
approximates one iteration of the dense CRF mean field
inference [22] by convolutional layers with learnable filters.
Another fruitful direction pursued by [40], [66] is to learn
the pairwise terms of a CRF via a DCNN, significantly
improving performance at the cost of heavier computation.
In a different direction, [63] replace the bilateral filtering
module used in mean field inference with a faster domain
transform module [67], improving the speed and lowering
the memory requirements of the overall system, while [18],
[68] combine semantic segmentation with edge detection.
Weaker supervision has been pursued in a number of
papers, relaxing the assumption that pixel-level semantic
annotations are available for the whole training set [58],
[69], [70], [71], achieving significantly better results than
weakly-supervised pre-DCNN systems such as [72]. In
another line of research, [49], [73] pursue instance segmen-
tation, jointly tackling object detection and semantic
segmentation.
Fig. 1. Model illustration. A deep convolutional neural network such as VGG-16 or ResNet-101 is employed in a fully convolutional fashion, using
atrous convolution to reduce the degree of signal downsampling (from 32x down 8x). A bilinear interpolation stage enlarges the feature maps to the
original image resolution. A fully connected CRF is then applied to refine the segmentation result and better capture the object boundaries.
1. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
challengeid=11&compid=6
836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018