SegNet：深度卷积编码解码器在图像分割中的创新架构

需积分: 11 197 浏览量更新于2024-07-16 1 收藏 1.36MB PDF 举报

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation 是一篇由Vijay Badrinarayanan、Alex Kendall和Roberto Cipolla（IEEE Senior Member）共同提出的论文，它在2015年的计算机视觉领域引起了广泛关注。该研究旨在提供一种新颖且实用的深度全卷积神经网络架构，用于图像语义像素级分割任务。论文的核心贡献是SegNet，一个包括编码器网络、解码器网络以及后续的像素级分类层的整体设计。编码器网络的设计灵感来源于VGG16网络的13个卷积层，具有相同的拓扑结构。然而，不同于传统的逐层下采样策略，SegNet的关键创新在于解码器部分。解码器通过利用编码阶段的最大池化操作中的池化索引来执行非线性上采样。这消除了对学习上采样的需求，使得解码过程更为高效。上采样的特征图通常稀疏，然后通过可训练的滤波器进一步转换为密集特征图。这种“逆向”操作允许将编码器的低分辨率特征映射扩展回原始输入分辨率，以便进行精确的像素级分类。 SegNet的主要优点包括： 1. **效率与可解释性**：通过利用编码器的信息，解码器能够有效地恢复空间细节，避免了全连接层带来的计算开销，同时也提高了模型的解释性。 2. **上采样策略**：通过利用编码过程中的池化信息，SegNet实现了无参数的上采样，降低了模型复杂度，有利于训练和泛化。 3. **应用广泛**：由于其对输入尺寸的适应性和良好的性能，SegNet被广泛应用于医学图像分析、遥感图像处理、自然语言处理等领域中的图像分割任务。 4. **实践价值**：论文不仅提出了理论概念，还提供了详细的实现方法和实验结果，为后来者提供了宝贵的实践参考。 SegNet作为一项里程碑式的贡献，展示了如何巧妙地结合编码和解码结构，为深度学习中的图像分割问题提供了一种创新且实用的解决方案。通过这篇论文，研究人员和工程师们得以理解和利用这一方法来改进图像识别和理解的任务。

general discussion regarding our approach with pointers to

future work in Section 5. We conclude in Section 6.

2LITERATURE REVIEW

Semantic pixel-wise segmentation is an active topic of

research, fuelled by challenging datasets [20], [21], [22], [24],

[25]. Before the arrival of deep networks, the best performing

methods mostly relied on hand engineered features classify-

ing pixels independently. Typically, a patch is fed into a clas-

siﬁer, e.g., Random Forest [26], [27] or Boosting [28], [29] to

predict the class probabilities of the center pixel. Features

based on appearance [26] or SfM and appearance [27], [28],

[29] have been explored for the CamVid road scene under-

standing test [21]. These per-pixel noisy predictions (often

called unary terms) from the classiﬁers are then smoothed by

using a pair-wise or higher order CRF [28], [29] to improve

the accuracy. More recent approaches have aimed to pro-

duce high quality unaries by trying to predict the labels for

all the pixels in a patch as opposed to only the center pixel.

This improves the results of Random Forest based unaries

[30] but thin structured classes are classiﬁed poorly. Dense

depth maps computed from the CamVid video have also

been used as input for classiﬁcation using Random Forests

[31]. Another approach argues for the use of a combination

of popular hand designed features and spatio-temporal

super-pixelization to obtain higher accuracy [32]. The best

performing technique on the CamVid test [29] addresses the

imbalance among label frequencies by combining object

detection outputs with classiﬁer predictions in a CRF frame-

work. The result of all these techniques indicate the need for

improved features for classiﬁcation.

Indoor RGBD pixel-wise semantic segmentation has also

gained popularity since the release of the NYU dataset [24].

This dataset showed the usefulness of the depth channel to

improve segmentation. Their approach used features such

as RGB-SIFT, depth-SIFT and pixel location as input to a

neural network classiﬁer to predict pixel unaries. The noisy

unaries are then smoothed using a CRF. Improvements

were made using a richer feature set including LBP and

region segmentation to obtain higher accuracy [33] followed

by a CRF. In more recent work [24], both class segmentation

and support relationships are inferred together using a com-

bination of RGB and depth based cues. Another approach

focuses on real-time joint reconstruction and semantic seg-

mentation, where Random Forests are used as the classiﬁer

[34]. Gupta et al. [35] use boundary detection and hierarchi-

cal grouping before performing category segmentation.

The common attribute in all these approaches is the use of

hand engineered features for classiﬁcation of either RGB or

RGBD images.

The success of deep convolutional neural networks for

object classiﬁcation has more recently led researchers to

exploit their feature learning capabilities for structured pre-

diction problems such as segmentation. There have also

been attempts to apply networks designed for object catego-

rization to segmentation, particularly by replicating the

deepest layer features in blocks to match image dimensions

[6], [36], [37], [38]. However, the resulting classiﬁcation is

blocky [37]. Another approach using recurrent neural net-

works [39] merges several low resolution predictions to cre-

ate input image resolution predictions. These techniques

are already an improvement over hand engineered features

[6] but their ability to delineate boundaries is poor.

Newer deep architectures [2], [4], [9], [12], [17] particu-

larly designed for segmentation have advanced the state-of-

the-art by learning to decode or map low resolution image

representations to pixel-wise predictions. The encoder net-

work which produces these low resolution representations

in all of these architectures is the VGG16 classiﬁcation net-

work [1] which has 13 convolutional layers and three fully

connected layers. This encoder network weights are typi-

cally pre-trained on the large ImageNet object classiﬁcation

dataset [40]. The decoder network varies between these

architectures and is the part which is responsible for pro-

ducing multi-dimensional features for each pixel for

classiﬁcation.

Each decoder in the Fully Convolutional Network archi-

tecture [2] learns to upsample its input feature map(s) and

combines them with the corresponding encoder feature

map to produce the input to the next decoder. It is an archi-

tecture which has a large number of trainable parameters in

the encoder network (134 M) but a very small decoder net-

work (0.5 M). The overall large size of this network makes it

hard to train end-to-end on a relevant task. Therefore, the

authors use a stage-wise training process. Here each

decoder in the decoder network is progressively added to

an existing trained network. The network is grown until no

further increase in performance is observed. This growth is

stopped after three decoders thus ignoring high resolution

feature maps can certainly lead to loss of edge information

[4]. Apart from training related issues, the need to reuse the

encoder feature maps in the decoder makes it memory

intensive in test time. We study this network in more detail

as it the core of other recent architectures [9], [10].

The predictive performance of FCN has been improved

further by appending the FCN with a recurrent neural net-

work (RNN) [9] and ﬁne-tuning them on large datasets [20],

[41]. The RNN layers mimic the sharp boundary delineation

capabilities of CRFs while exploiting the feature representa-

tion power of FCN’s. They show a signiﬁcant improvement

over FCN-8 but also show that this difference is reduced

when more training data is used to train FCN-8. The main

advantage of the CRF-RNN is revealed when it is jointly

trained with an architecture such as the FCN-8. The fact that

joint training helps is also shown in other recent results [42],

[43]. Interestingly, the deconvolutional network [4] per-

forms signiﬁcantly better than FCN although at the cost of a

more complex training and inference. This however raises

the question as to whether the perceived advantage of the

CRF-RNN would be reduced as the core feed-forward seg-

mentation engine is made better. In any case, the CRF-RNN

network can be appended to any deep segmentation archi-

tecture including SegNet.

Multi-scale deep architectures are also being pursued

[12], [43]. They come in two ﬂavours, (i) those which use

input images at a few scales and corresponding deep fea-

ture extraction networks, and (ii) those which combine fea-

ture maps from different layers of a single deep architecture

[10], [44]. The common idea is to use features extracted at

multiple scales to provide both local and global context [45]

and the using feature maps of the early encoding layers

retain more high frequency detail leading to sharper class

BADRINARAYANAN ET AL.: SEGNET: A DEEP CONVOLUTIONAL ENCODER-DECODER ARCHITECTURE FOR IMAGE SEGMENT ATION 2483

剩余14页未读，继续阅读

qq_37040329

粉丝: 0
资源: 1

SegNet：深度卷积编码解码器在图像分割中的创新架构

基于segnet的语义分割

图像语义分割网络：SegNet

Convolutional Auto-Encoders卷积自编码器（Matlab代码）

ResNet18的变体：探索ResNeXt、ResNet-D和Wide ResNet，拓展你的模型选择

给出SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation的文献主题、结论、贡献和局限性，要十分具体

图像分割网络模型汇总

segnet语义分割

基于卷积神经网络的图像分割算法

FCN（全卷积网络），CNN（卷积神经网络），RNN（循环神经网络），DeepLab系列，SegNet，U-Net

FCN（全卷积网络），CNN（卷积神经网络），RNN（循环神经网络），DeepLab系列，SegNet，U-Net简单介绍

最新资源