全卷积网络在语义分割中的应用与优势

需积分: 50 2 浏览量更新于2024-09-09 1 收藏 809KB PDF 举报

"本文主要介绍了全卷积网络（Fully Convolutional Networks，FCNs）在语义分割领域的应用，展示了如何通过端到端的训练，将任意大小的输入转化为相应大小的输出，从而提高语义分割的效果。作者Evan Shelhamer、Jonathan Long和Trevor Darrell探讨了全卷积网络的结构设计，将其与分类网络（如AlexNet、VGGnet和GoogLeNet）相结合，并通过微调适应语义分割任务。他们还提出了一种跳跃结构（skip architecture），结合深层的语义信息和浅层的细节信息，实现了准确且详细的分割结果。FCNs在实验中取得了显著的性能提升，优于之前的最佳结果。" 在语义分割任务中，全卷积网络是一种关键的技术，它利用卷积神经网络（CNN）的特性进行像素级别的预测。传统的CNN通常用于图像分类，其最后几层通常包含全连接层，这些层限制了输入图像的尺寸，因为它们的参数是固定的。然而，在语义分割中，我们需要对整个图像的每一个像素进行分类，这就要求模型能够处理不同大小的输入并产生同样大小的输出。 FCNs的创新之处在于，它们完全由卷积层和池化层组成，没有全连接层。这使得网络可以接受任意大小的输入，并生成同样大小的输出，每个输出对应输入图像的一个像素，这样就实现了像素级的预测。通过端到端的训练，FCNs可以直接学习从输入图像到像素级别分割图的映射，大大提高了语义分割的效率和准确性。为了进一步优化FCN的表现，作者引入了跳跃结构。这种结构结合了网络深层（通常较粗略）的语义信息和浅层（通常包含更多细节）的特征，使得模型既能理解图像的全局语义，又能保留局部的细节信息，从而生成更准确、更细致的分割结果。在实践中，他们将流行的分类网络（如AlexNet、VGGnet和GoogLeNet）改造为FCNs，并通过微调这些预训练模型的权重来适应语义分割任务，这显著加速了学习过程并提高了性能。 FCNs通过端到端的训练和跳跃结构的运用，极大地推动了语义分割领域的发展，成为深度学习在图像分析中解决像素级别问题的标准方法之一。它们的成功应用不仅限于计算机视觉，也扩展到了医疗图像分析、遥感图像处理等多个领域。

deconvolution, and unpooling. U-Net [43] combines skip

layers and learned deconvolution for pixel labeling of

microscopy images. The dilation architecture of [44] makes

thorough use of dilated convolution for pixel-precise output

without a random ﬁeld or skip layers.

3FULLY CONVOLUTIONAL NETWORKS

Each layer output in a convnet is a three-dimensional

array of size h  w  d,whereh and w are spatial dimen-

sions, and d is the feature or channel dimension. The ﬁrst

layer is the image, with pixel size h  w,andd cha nnels.

Locations in higher layers correspond to the locations in

the image they are p ath-con nected to, which are called

their receptive ﬁelds.

Convnets are inherently translation invariant. Their basic

components (convolution, pooling, and activation func-

tions) operate on local input regions, and depend only on

relative spatial coordinates. Writing x

for the data vector at

location ði; jÞ in a particular layer, and y

for the following

layer, these functions compute outputs y

¼ f



siþdi;sjþdj

0di;dj<k



;

where k is called the kernel size, s is the stride or subsam-

pling factor, and f

determines the layer type: a matrix

multiplication for convolution or average pooling, a spatial

max for max pooling, or an elementwise nonlinearity for an

activation function, and so on for other types of layers.

This functional form is maintained under composition,

with kernel size and stride obeying the transformation rule

 g

¼ðf  gÞ

þðk1Þs

;ss

While a general net computes a general nonlinear function,

a net with only layers of this form computes a nonlinear ﬁl-

ter, which we call a deep ﬁlter or fully convolutional network.

An FCN naturally operates on an input of any size, and pro-

duces an output of corresponding (possibly resampled) spa-

tial dimensions.

A real-valued loss function composed with an FCN

deﬁnes a task. If the loss function is a sum over the spatial

dimensions of the ﬁnal layer, ‘ðx; uÞ¼

‘

ðx

; uÞ, its

parameter gradient will be a sum over the parameter gra-

dients of each of its spatial components. Thus stochastic gra-

dient descent on ‘ computed on whole images will be the

same as stochastic gradient descent on ‘

, taking all of the

ﬁnal layer receptive ﬁelds as a minibatch.

When these receptive ﬁelds overlap signiﬁcantly, both

feedforward computation and backpropagation are much

more efﬁcient when computed layer-by-layer over an entire

image instead of independently patch-by-patch.

We next explain how to convert c lassiﬁcation nets into

fully c onvolutional nets that produce coarse output maps.

For pixelwise prediction, we need to connect these coarse

outputs back to the pixels. Section 3.2 describes a trick

used for this purpose (e.g., by “fast scannin g” [45]). We

explain this trick in terms of network modiﬁcation. As an

efﬁcient, effective alternative, we upsample in Section 3.3,

reusing our impl eme ntatio n of convol ution. In Se cti on 3.4

we con sider training by pa tchwise sampling, and give

evidence in Section 4.4 that our whole image traini ng i s

faster and equally effective.

3.1 Adapting Classiﬁers for Dense Prediction

Typical recognition nets, including LeNet [21], AlexNet [1],

and its deeper successors [2], [3], ostensibly take ﬁxed-sized

inputs and produce non-spatial outputs. The fully connected

layers of these nets have ﬁxed dimensions and throw away

spatial coordinates. However, fully connected layers can also

be viewed as convolutions with kernels that cover their entire

input regions. Doing so casts these nets into fully convolu-

tional networks that take input of any size and make spatial

output maps. This transformation is illustrated in Fig. 2.

Furthermore, while the resulting maps are equivalent to

the evaluation of the original net on particular input

patches, the computation is highly amortized over the

overlapping regions of those patches. For example, while

AlexNet takes 1:2 ms (on a typical GPU) to infer the classiﬁ-

cation scores of a 227  227 image, the fully convolutional

net takes 22 ms to produce a 10  10 grid of outputs from a

500  500 image, which is more than 5 times faster than the

€

ıve approach.

The spatial output maps of these convolutionalized mod-

els make them a natural choice for dense problems like

semantic segmentation. With ground truth available at

every output cell, both the forward and backward passes

are straightforward, and both take advantage of the inher-

ent computational efﬁciency (and aggressive optimization)

of convolution. The corresponding backward times for the

AlexNet example are 2:4 ms for a single image and 37 ms

for a fully convolutional 10  10 output map, resulting in a

speedup similar to that of the forward pass.

While our reinterpretation of classiﬁcation nets as fully

convolutional yields output maps for inputs of any size, the

output dimensions are typically reduced by subsampling.

The classiﬁcation nets subsample to keep ﬁlters small and

computational requirements reasonable. This coarsens the

output of a fully convolutional version of these nets, reduc-

ing it from the size of the input by a factor equal to the pixel

stride of the receptive ﬁelds of the output units.

Fig. 2. Transforming fully connected layers into convolution layers ena-

bles a classiﬁcation net to output a spatial map. Adding differentiable

interpolation layers and a spatial loss (as in Fig. 1) produces an efﬁcient

machine for end-to-end pixelwise learning.

1. Assuming efﬁcient batching of single image inputs. The classiﬁca-

tion scores for a single image by itself take 5.4 ms to produce, which is

nearly 25 times slower than the fully convolutional version.

642 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 4, APRIL 2017

剩余11页未读，继续阅读

顿顿304122

粉丝: 0
资源: 42

全卷积网络在语义分割中的应用与优势

Fully-Convolutional-Networks-for-Semantic-Segmentation:2020 CVPR项目，全卷积网络在语义分割中的应用

图像分割__目标检测.zip

fully convolutional networks for semantic segmentation

Using Fully Convolutional Networks for Semantic Image Segmentation

Semantic Segmentation of Mechanical Parts based on Fully Convolutional Network

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

fcn（fully convolutional network）源代码

pytorch-semantic-segmentation：用于语义分割的PyTorch

semantic_segmentation:在给定图像中语义分割道路

segmentation papers

最新资源