深度学习语义图像分割：DeepLab v3+详解

需积分: 50 8 浏览量更新于2024-09-06 收藏 1.87MB PDF 举报

"deeplab v3+.pdf" 本文档介绍了一种用于语义图像分割的深度学习模型——DeepLab v3+，该模型结合了编码器-解码器结构和空洞卷积（Atrous Separable Convolution）的优势，旨在提高语义分割任务的性能，特别是对于边界清晰度的提升。语义分割是计算机视觉领域的一个重要任务，它涉及到将图像像素级地分类到不同的类别中。传统的深度网络，如FCN（全卷积网络），通过上采样来恢复空间信息，但可能在捕获多尺度上下文信息方面表现不足。另一方面，空间金字塔池化模块（Spatial Pyramid Pooling Module）和编码-解码结构能更好地处理多尺度信息，但可能在精确描绘对象边界时遇到挑战。 DeepLab系列模型，始于DeepLab v1，一直在探索更有效的语义分割方法。在DeepLab v3+中，作者引入了一个简单的解码器模块，其主要目的是细化分割结果，尤其是改善对象边界的精度。解码器的作用是逐步恢复因下采样而丢失的空间细节，使得分割结果更精确。在DeepLab v3+中，作者还深入研究了Xception模型，这是一种基于深度可分离卷积的高效网络架构。深度可分离卷积（Depthwise Separable Convolution）将标准卷积分解为两个独立的步骤：深度卷积（Depthwise Convolution）和逐点卷积（Pointwise Convolution）。这种分解方式显著减少了计算量，提高了模型速度，同时保持了模型的表达能力。 Atrous Spatial Pyramid Pooling (ASPP) 是DeepLab系列模型中的关键组件，它通过使用不同空洞率（dilation rate）的卷积来捕捉不同尺度的信息。在DeepLab v3+中，ASPP模块也应用了深度可分离卷积，进一步优化了模型的效率和性能。 DeepLab v3+通过整合高效的解码器和优化后的ASPP模块，利用Xception模型中的深度可分离卷积，构建了一个更快、更强的编码-解码网络，能够在保持高精度的同时，有效处理图像的多尺度信息，特别是在处理物体边界时表现出色。这对于实时或资源受限的语义分割应用具有重大意义。

[

] for their COCO 2017 detection challenge submission,

and show improvement in terms of both accuracy and speed

for the task of semantic segmentation.

3. Methods

In this section, we brieﬂy introduce atrous convolution

[

] and depthwise separable convolution [

]. We then review DeepLabv3 [

] which is

used as our encoder module before discussing the proposed

decoder module appended to the encoder output. We also

present a modiﬁed Xception model [

] which further

improves the performance with faster computation.

3.1. Encoder-Decoder with Atrous Convolution

Atrous convolution:

Atrous convolution, a powerful tool

that allows us to explicitly control the resolution of features

computed by deep convolutional neural networks and adjust

ﬁlter’s ﬁeld-of-view in order to capture multi-scale informa-

tion, generalizes standard convolution operation. In particu-

lar, in the case of two-dimensional signals, for each location

on the output feature map

and a convolution ﬁlter

atrous convolution is applied over the input feature map

as follows:

y[i] =

x[i + r · k]w[k] (1)

where the atrous rate r determines the stride with which we

sample the input signal. We refer interested readers to [

] for

more details. Note that standard convolution is a special case

in which rate

r = 1

. The ﬁlter’s ﬁeld-of-view is adaptively

modiﬁed by changing the rate value.

Depthwise separable convolution:

Depthwise separa-

ble convolution, factorizing a standard convolution into a

depthwise convolution followed by a pointwise convolution

(i.e.,

1 × 1

convolution), drastically reduces computation

complexity. Speciﬁcally, the depthwise convolution per-

forms a spatial convolution independently for each input

channel, while the pointwise convolution is employed to

combine the output from the depthwise convolution. In

the TensorFlow [

] implementation of depthwise separable

convolution, atrous convolution has been supported in the

depthwise convolution (i.e., the spatial convolution). In this

work, we refer the resulting convolution as atrous separable

convolution, and found that atrous separable convolution sig-

niﬁcantly reduces the computation complexity of proposed

model while maintaining similar (or better) performance.

DeepLabv3 as encoder:

DeepLabv3 [

] employs

atrous convolution [

] to extract the features

computed by deep convolutional neural networks at an arbi-

trary resolution. Here, we denote

output stride

as the ratio of

input image spatial resolution to the ﬁnal output resolution

(before global pooling or fully-connected layer). For the

task of image classiﬁcation, the spatial resolution of the ﬁnal

feature maps is usually 32 times smaller than the input image

resolution and thus

output stride = 32

. For the task of se-

mantic segmentation, one can adopt

output stride = 16

(or

) for denser feature extraction by removing the striding in

the last one (or two) block(s) and applying the atrous convolu-

tion correspondingly (e.g., we apply

rate = 2

and

rate = 4

to the last two blocks respectively for

output stride = 8

Additionally, DeepLabv3 augments the Atrous Spatial Pyra-

mid Pooling module, which probes convolutional features at

multiple scales by applying atrous convolution with different

rates, with the image-level features [

]. We use the last

feature map before logits in the original DeepLabv3 as the

encoder output in our proposed encoder-decoder structure.

Note the encoder output feature map contains 256 channels

and rich semantic information. Besides, one could extract

features at an arbitrary resolution by applying the atrous

convolution, depending on the computation budget.

Proposed decoder:

The encoder features from

DeepLabv3 are usually computed with

output stride = 16

In the work of [

], the features are bilinearly upsampled

by a factor of 16, which could be considered a naive de-

coder module. However, this naive decoder module may not

successfully recover object segmentation details. We thus

propose a simple yet effective decoder module, as illustrated

in Fig. 2. The encoder features are ﬁrst bilinearly upsampled

by a factor of 4 and then concatenated with the correspond-

ing low-level features [

] from the network backbone that

have the same spatial resolution (e.g., Conv2 before striding

in ResNet-101 [

]). We apply another

1 × 1

convolution

on the low-level features to reduce the number of channels,

since the corresponding low-level features usually contain

a large number of channels (e.g., 256 or 512) which may

outweigh the importance of the rich encoder features (only

256 channels in our model) and make the training harder.

After the concatenation, we apply a few

3 × 3

convolutions

to reﬁne the features followed by another simple bilinear

upsampling by a factor of 4. We show in Sec. 4 that using

output stride = 16

for the encoder module strikes the best

trade-off between speed and accuracy. The performance is

marginally improved when using

output stride = 8

for the

encoder module at the cost of extra computation complexity.

3.2. Modiﬁed Aligned Xception

The Xception model [

] has shown promising image

classiﬁcation results on ImageNet [

] with fast computa-

tion. More recently, the MSRA team [

] modiﬁes the Xcep-

tion model (called Aligned Xception) and further pushes the

performance in the task of object detection. Motivated by

these ﬁndings, we work in the same direction to adapt the

Xception model for the task of semantic image segmenta-

tion. In particular, we make a few more changes on top of

MSRA’s modiﬁcations, namely (1) deeper Xception same as

剩余10页未读，继续阅读

buyaobianlan2017

粉丝: 3
资源: 2

深度学习语义图像分割：DeepLab v3+详解

DeepLab_v3.rar

tensorflow-deeplab-v3-plus：在TensorFlow中内置的DeepLabv3 +

CamVid_TFrecord 用于DeepLabv3+测试

DeepLearning深度学习教程_第九章_图像分割.pdf

遥感图像语义分割.pdf

基于深度学习的红外烟幕区域分割技术.pdf

基于深度学习的服装风格识别问题的研究.pdf

基于深度学习的颈动脉粥样硬化斑块成分识别.pdf

基于深度学习的高分辨率遥感图像建筑物识别.pdf

百度计算机视觉算法工程师面经（research岗，已offer）.pdf

最新资源