[
60
] for their COCO 2017 detection challenge submission,
and show improvement in terms of both accuracy and speed
for the task of semantic segmentation.
3. Methods
In this section, we briefly introduce atrous convolution
[
30
,
21
,
64
,
56
,
8
] and depthwise separable convolution [
67
,
71
,
74
,
12
,
31
]. We then review DeepLabv3 [
10
] which is
used as our encoder module before discussing the proposed
decoder module appended to the encoder output. We also
present a modified Xception model [
12
,
60
] which further
improves the performance with faster computation.
3.1. Encoder-Decoder with Atrous Convolution
Atrous convolution:
Atrous convolution, a powerful tool
that allows us to explicitly control the resolution of features
computed by deep convolutional neural networks and adjust
filter’s field-of-view in order to capture multi-scale informa-
tion, generalizes standard convolution operation. In particu-
lar, in the case of two-dimensional signals, for each location
i
on the output feature map
y
and a convolution filter
w
,
atrous convolution is applied over the input feature map
x
as follows:
y[i] =
X
k
x[i + r · k]w[k] (1)
where the atrous rate r determines the stride with which we
sample the input signal. We refer interested readers to [
9
] for
more details. Note that standard convolution is a special case
in which rate
r = 1
. The filter’s field-of-view is adaptively
modified by changing the rate value.
Depthwise separable convolution:
Depthwise separa-
ble convolution, factorizing a standard convolution into a
depthwise convolution followed by a pointwise convolution
(i.e.,
1 × 1
convolution), drastically reduces computation
complexity. Specifically, the depthwise convolution per-
forms a spatial convolution independently for each input
channel, while the pointwise convolution is employed to
combine the output from the depthwise convolution. In
the TensorFlow [
1
] implementation of depthwise separable
convolution, atrous convolution has been supported in the
depthwise convolution (i.e., the spatial convolution). In this
work, we refer the resulting convolution as atrous separable
convolution, and found that atrous separable convolution sig-
nificantly reduces the computation complexity of proposed
model while maintaining similar (or better) performance.
DeepLabv3 as encoder:
DeepLabv3 [
10
] employs
atrous convolution [
30
,
21
,
64
,
56
] to extract the features
computed by deep convolutional neural networks at an arbi-
trary resolution. Here, we denote
output stride
as the ratio of
input image spatial resolution to the final output resolution
(before global pooling or fully-connected layer). For the
task of image classification, the spatial resolution of the final
feature maps is usually 32 times smaller than the input image
resolution and thus
output stride = 32
. For the task of se-
mantic segmentation, one can adopt
output stride = 16
(or
8
) for denser feature extraction by removing the striding in
the last one (or two) block(s) and applying the atrous convolu-
tion correspondingly (e.g., we apply
rate = 2
and
rate = 4
to the last two blocks respectively for
output stride = 8
).
Additionally, DeepLabv3 augments the Atrous Spatial Pyra-
mid Pooling module, which probes convolutional features at
multiple scales by applying atrous convolution with different
rates, with the image-level features [
47
]. We use the last
feature map before logits in the original DeepLabv3 as the
encoder output in our proposed encoder-decoder structure.
Note the encoder output feature map contains 256 channels
and rich semantic information. Besides, one could extract
features at an arbitrary resolution by applying the atrous
convolution, depending on the computation budget.
Proposed decoder:
The encoder features from
DeepLabv3 are usually computed with
output stride = 16
.
In the work of [
10
], the features are bilinearly upsampled
by a factor of 16, which could be considered a naive de-
coder module. However, this naive decoder module may not
successfully recover object segmentation details. We thus
propose a simple yet effective decoder module, as illustrated
in Fig. 2. The encoder features are first bilinearly upsampled
by a factor of 4 and then concatenated with the correspond-
ing low-level features [
25
] from the network backbone that
have the same spatial resolution (e.g., Conv2 before striding
in ResNet-101 [
27
]). We apply another
1 × 1
convolution
on the low-level features to reduce the number of channels,
since the corresponding low-level features usually contain
a large number of channels (e.g., 256 or 512) which may
outweigh the importance of the rich encoder features (only
256 channels in our model) and make the training harder.
After the concatenation, we apply a few
3 × 3
convolutions
to refine the features followed by another simple bilinear
upsampling by a factor of 4. We show in Sec. 4 that using
output stride = 16
for the encoder module strikes the best
trade-off between speed and accuracy. The performance is
marginally improved when using
output stride = 8
for the
encoder module at the cost of extra computation complexity.
3.2. Modified Aligned Xception
The Xception model [
12
] has shown promising image
classification results on ImageNet [
62
] with fast computa-
tion. More recently, the MSRA team [
60
] modifies the Xcep-
tion model (called Aligned Xception) and further pushes the
performance in the task of object detection. Motivated by
these findings, we work in the same direction to adapt the
Xception model for the task of semantic image segmenta-
tion. In particular, we make a few more changes on top of
MSRA’s modifications, namely (1) deeper Xception same as