SegNet：深度卷积编码-解码器架构用于图像分割

5星 · 超过95%的资源需积分: 50 166 浏览量更新于2024-09-07 1 收藏 2.17MB PDF 举报

"SegNet是一种深度卷积神经网络架构，专用于图像语义分割任务，由编码器网络、对应的解码器网络以及一个像素级分类层组成。编码器网络的结构与VGG16网络的13个卷积层相同，而解码器网络则通过在最大池化步骤中计算的池化索引来实现非线性上采样，这消除了学习上采样的需求。" SegNet网络结构是深度学习领域中用于语义分割的重要模型。语义分割是一项计算机视觉任务，其目标是将图像中的每个像素分配到预定义的类别中，如行人、车辆、建筑物等。SegNet的设计理念是通过深度卷积神经网络实现端到端的像素级分类。该网络主要由两部分构成：编码器和解码器。编码器网络采用VGG16网络的前13层，这些层包括多个卷积层和池化层，目的是逐步提取图像的高级特征。VGG16是一种经典的卷积神经网络架构，因其在ImageNet挑战中的优异表现而被广泛采用。编码器通过连续的卷积和池化操作，将输入图像的高分辨率特征图转换为低分辨率但富含语义信息的特征表示。解码器网络是SegNet的核心创新点。在传统的卷积神经网络中，上采样通常通过插值或其他学习方法完成，但这可能导致信息丢失。SegNet的解码器通过在编码阶段记录的最大池化索引进行上采样，这是一种非线性的反向操作，可以精确地恢复空间位置信息。在编码过程中，每个池化层都会保存其最大值的索引位置，这些索引在解码阶段被用作指导上采样的依据，使得特征图能够恢复到原始输入的分辨率。解码器网络接着对上采样的特征图进行卷积，生成密集的特征图。最后，一个像素级分类层应用在解码器的输出上，以预测每个像素的类别。这个层通常是一个全连接的层，尽管在SegNet中，由于解码器已经恢复了输入的分辨率，它可以被实现为一个1x1的卷积层，这在计算效率上更优。 SegNet的一个显著优势是它的简洁和可训练性。通过使用编码器的池化索引来指导上采样，网络不需要额外的参数来学习上采样过程，这降低了模型的复杂性和训练时间。然而，SegNet也有其局限性，例如它可能无法处理非常大的输入图像，因为VGG16的固定尺寸限制以及解码器的逐层上采样可能导致计算量巨大。 SegNet为图像语义分割提供了一种实用且高效的解决方案，它的编码-解码架构为后续的语义分割网络，如U-Net，提供了灵感，并促进了深度学习在图像理解领域的进步。

Forest [27], [28] or Boosting [29], [30] to predict the class

probabilities of the center pixel. Features based on appearance [27]

or SfM and appearance [28], [29], [30] have been explored for

the CamVid road scene understanding test [22]. These per-pixel

noisy predictions (often called unary terms) from the classiﬁers

are then smoothed by using a pair-wise or higher order CRF [29],

[30] to improve the accuracy. More recent approaches have aimed

to produce high quality unaries by trying to predict the labels

for all the pixels in a patch as opposed to only the center pixel.

This improves the results of Random Forest based unaries [31]

but thin structured classes are classiﬁed poorly. Dense depth maps

computed from the CamVid video have also been used as input

for classiﬁcation using Random Forests [32]. Another approach

argues for the use of a combination of popular hand designed

features and spatio-temporal super-pixelization to obtain higher

accuracy [33]. The best performing technique on the CamVid

test [30] addresses the imbalance among label frequencies by

combining object detection outputs with classiﬁer predictions in

a CRF framework. The result of all these techniques indicate the

need for improved features for classiﬁcation.

Indoor RGBD pixel-wise semantic segmentation has also

gained popularity since the release of the NYU dataset [25]. This

dataset showed the usefulness of the depth channel to improve

segmentation. Their approach used features such as RGB-SIFT,

depth-SIFT and pixel location as input to a neural network

classiﬁer to predict pixel unaries. The noisy unaries are then

smoothed using a CRF. Improvements were made using a richer

feature set including LBP and region segmentation to obtain higher

accuracy [34] followed by a CRF. In more recent work [25], both

class segmentation and support relationships are inferred together

using a combination of RGB and depth based cues. Another

approach focuses on real-time joint reconstruction and semantic

segmentation, where Random Forests are used as the classiﬁer

[35]. Gupta et al. [36] use boundary detection and hierarchical

grouping before performing category segmentation. The common

attribute in all these approaches is the use of hand engineered

features for classiﬁcation of either RGB or RGBD images.

The success of deep convolutional neural networks for object

classiﬁcation has more recently led researchers to exploit their fea-

ture learning capabilities for structured prediction problems such

as segmentation. There have also been attempts to apply networks

designed for object categorization to segmentation, particularly

by replicating the deepest layer features in blocks to match

image dimensions [7], [37], [38], [39]. However, the resulting

classiﬁcation is blocky [38]. Another approach using recurrent

neural networks [40] merges several low resolution predictions

to create input image resolution predictions. These techniques are

already an improvement over hand engineered features [7] but

their ability to delineate boundaries is poor.

Newer deep architectures [2], [4], [10], [13], [18] particularly

designed for segmentation have advanced the state-of-the-art by

learning to decode or map low resolution image representations

to pixel-wise predictions. The encoder network which produces

these low resolution representations in all of these architectures is

the VGG16 classiﬁcation network [1] which has 13 convolutional

layers and 3 fully connected layers. This encoder network weights

are typically pre-trained on the large ImageNet object classiﬁ-

cation dataset [41]. The decoder network varies between these

architectures and is the part which is responsible for producing

multi-dimensional features for each pixel for classiﬁcation.

Each decoder in the Fully Convolutional Network (FCN)

architecture [2] learns to upsample its input feature map(s) and

combines them with the corresponding encoder feature map to

produce the input to the next decoder. It is an architecture which

has a large number of trainable parameters in the encoder network

(134M) but a very small decoder network (0.5M). The overall

large size of this network makes it hard to train end-to-end on

a relevant task. Therefore, the authors use a stage-wise training

process. Here each decoder in the decoder network is progressively

added to an existing trained network. The network is grown until

no further increase in performance is observed. This growth is

stopped after three decoders thus ignoring high resolution feature

maps can certainly lead to loss of edge information [4]. Apart

from training related issues, the need to reuse the encoder feature

maps in the decoder makes it memory intensive in test time. We

study this network in more detail as it the core of other recent

architectures [10], [11].

The predictive performance of FCN has been improved further

by appending the FCN with a recurrent neural network (RNN)

[10] and ﬁne-tuning them on large datasets [21], [42]. The RNN

layers mimic the sharp boundary delineation capabilities of CRFs

while exploiting the feature representation power of FCN’s. They

show a signiﬁcant improvement over FCN-8 but also show that

this difference is reduced when more training data is used to

train FCN-8. The main advantage of the CRF-RNN is revealed

when it is jointly trained with an architecture such as the FCN-

8. The fact that joint training helps is also shown in other recent

results [43], [44]. Interestingly, the deconvolutional network [4]

performs signiﬁcantly better than FCN although at the cost of

a more complex training and inference. This however raises the

question as to whether the perceived advantage of the CRF-RNN

would be reduced as the core feed-forward segmentation engine is

made better. In any case, the CRF-RNN network can be appended

to any deep segmentation architecture including SegNet.

Multi-scale deep architectures are also being pursued [13],

[44]. They come in two ﬂavours, (i) those which use input images

at a few scales and corresponding deep feature extraction net-

works, and (ii) those which combine feature maps from different

layers of a single deep architecture [45] [11]. The common idea

is to use features extracted at multiple scales to provide both

local and global context [46] and the using feature maps of the

early encoding layers retain more high frequency detail leading to

sharper class boundaries. Some of these architectures are difﬁcult

to train due to their parameter size [13]. Thus a multi-stage training

process is employed along with data augmentation. The inference

is also expensive with multiple convolutional pathways for feature

extraction. Others [44] append a CRF to their multi-scale network

and jointly train them. However, these are not feed-forward at test

time and require optimization to determine the MAP labels.

Several of the recently proposed deep architectures for seg-

mentation are not feed-forward in inference time [4], [3], [18].

They require either MAP inference over a CRF [44], [43] or

aids such as region proposals [4] for inference. We believe the

perceived performance increase obtained by using a CRF is due

to the lack of good decoding techniques in their core feed-forward

segmentation engine. SegNet on the other hand uses decoders to

obtain features for accurate pixel-wise classiﬁcation.

The recently proposed Deconvolutional Network [4] and its

semi-supervised variant the Decoupled network [18] use the max

locations of the encoder feature maps (pooling indices) to perform

non-linear upsampling in the decoder network. The authors of

these architectures, independently of SegNet (ﬁrst submitted to

剩余13页未读，继续阅读

扰扰1994

粉丝: 221

SegNet：深度卷积编码-解码器架构用于图像分割

keras-segnet, 利用keras框架实现SegNet模型.zip

caffe-segnet:SegNet的实现

SegNet+tensorflow代码+数据集.rar

D-LinkNet的一个优势特点是有Dense Feature Linking，请你帮我在segnet网络结构的基础上，引入Dense Feature Linking，写出完整的代码

SegNet网络的优点

segnet

SegNet

SegNet_神经网络图像_分割_segnet分割_segnet_

图像语义分割网络：SegNet

caffe-segnet-segnet-cleaned

最新资源