使用LSTM循环神经网络进行场景图像像素级分割与分类

需积分: 10 151 浏览量更新于2024-09-10 收藏 1.2MB PDF 举报

"计算机视觉:算法与应用" 计算机视觉是一门多学科交叉的领域，它致力于让计算机解析图像，理解世界，就像人类那样。尽管在近年来取得了显著的进步，但要达到与两岁小孩相当的视觉理解水平，计算机视觉仍面临诸多挑战。《计算机视觉：算法与应用》这本书深入探讨了用于分析和解释图像的各种技术，并展示了视觉在实际应用中的成功案例，如医疗成像、图像编辑和视频拼接等。书中不仅提供了实用的“食谱”式算法，还采用科学方法来处理基本的视觉问题。它首先建立物理成像模型，然后逆向工程这些模型来描述场景。通过统计模型和严格的工程技巧，对这些问题进行分析和解决。全书结构支持活跃的教学大纲和项目导向的课程，每章末尾的练习题强调测试算法，并包含许多小型中期项目建议。附录涵盖了线性代数、数值技术以及贝叶斯估计理论等更深入的数学主题，为读者提供额外的学习材料。此外，每一章末尾推荐了相关的进一步阅读资料，引导学生探索每个子领域的最新研究，书后还有全面的参考文献列表。配套网站提供了额外的课程资料，方便学生学习。这本书适合用作计算机科学或工程学高级本科或研究生课程的教材，同时也可作为计算机视觉基础技术和最新研究文献的独特参考资料。本摘要提到的具体内容是关于使用LSTM（长短期记忆）循环神经网络进行场景图像的像素级分割和分类。LSTM网络常用于序列分类，但在该研究中，研究人员探索了2D LSTM网络如何处理自然场景图像中复杂的标签空间依赖。传统方法通常需要独立的分类和图像分割阶段，而此方法则将分类、分割和上下文整合都集成在2D LSTM网络中，允许在单个模型中学习纹理和空间模型参数。这种网络能够有效地捕捉局部和全局的上下文信息，适应原始RGB值的变化。

Scene Labeling with LSTM Recurrent Neural Networks

Wonmin Byeon

1 2

Thomas M. Breuel

Federico Raue

1 2

Marcus Liwicki

University of Kaiserslautern, Germany.

German Research Center for Artiﬁcial Intelligence (DFKI), Germany.

{wonmin.byeon,federico.raue}@dfki.de {tmb,liwicki}@cs.uni-kl.de

Abstract

This paper addresses the problem of pixel-level segmen-

tation and classiﬁcation of scene images with an entirely

learning-based approach using Long Short Term Mem-

ory (LSTM) recurrent neural networks, which are com-

monly used for sequence classiﬁcation. We investigate two-

dimensional (2D) LSTM networks for natural scene images

taking into account the complex spatial dependencies of la-

bels. Prior methods generally have required separate clas-

siﬁcation and image segmentation stages and/or pre- and

post-processing. In our approach, classiﬁcation, segmen-

tation, and context integration are all carried out by 2D

LSTM networks, allowing texture and spatial model param-

eters to be learned within a single model. The networks

efﬁciently capture local and global contextual information

over raw RGB values and adapt well for complex scene

images. Our approach, which has a much lower compu-

tational complexity than prior methods, achieved state-of-

the-art performance over the Stanford Background and the

SIFT Flow datasets. In fact, if no pre- or post-processing is

applied, LSTM networks outperform other state-of-the-art

approaches. Hence, only with a single-core Central Pro-

cessing Unit (CPU), the running time of our approach is

equivalent or better than the compared state-of-the-art ap-

proaches which use a Graphics Processing Unit (GPU). Fi-

nally, our networks’ ability to visualize feature maps from

each layer supports the hypothesis that LSTM networks are

overall suited for image processing tasks.

1. Introduction

Accurate scene labeling is an important step towards im-

age understanding. The scene labeling task consists of par-

titioning the meaningful regions of an image and labeling

pixels with their regions. Pixel labels can (most likely)

not only be decided with low-level features, such as color

or texture, extracted from a small window around pixels.

For instance, distinguishing “grass” from “tree” or “forest”

would prove tedious under such a setting. As a matter of

fact, humans perceptually distinguish regions via the spatial

dependencies between them. For instance, visually similar

regions could be predicted as “sky” or “ocean” depending

on whether they are on the top or bottom part of a scene.

Consequently, a higher-level representation of scenes

(their global context) is typically constructed based on the

similarity of the low-level features of pixels and on their

spatial dependencies using a graphical model. The graph-

ical models construct the global dependencies based on

the similarities of neighboring segments. The most pop-

ular graph-based approaches are Markov Random Fields

(MRF) [4, 15, 16, 25] and Conditional Random Fields

(CRF) [10, 20]. However, most such methods require pre-

segmentation, superpixels, or candidate areas.

More recently, deep learning has become a very ac-

tive area of research in scene understanding and vision in

general. In [23], color and texture features from overseg-

mented regions are merged by Recursive Neural Networks.

This work has been extended by Socher et al. [22] who

combined it with convolutional neural networks. Among

deep learning approaches, Convolutional Neural Networks

(CNNs) [17] are one of the most successful methods for

end-to-end supervised learning. This method has been

widely used in image classiﬁcation [14, 21], object recog-

nition [12], face veriﬁcation [24], and scene labeling [5, 3].

Farabet et al. [3] introduced multi-scale CNNs to learn

scale-invariant features, but had problems with global con-

textual coherence and spatial consistency. These problems

were addressed by combining CNNs with several post-

processing algorithms, i.e., superpixels, CRF, and segmen-

tation trees. Later, Kekec¸ et al. [13] improved CNNs by

combining two CNN models which learn context informa-

tion and visual features in separate networks. Both men-

tioned approaches improved accuracy through carefully de-

signed pre-processing steps to help the learning, i.e., class

frequency balancing by selecting the same amount of ran-

dom patches per class, and speciﬁc color space for the input

data.

下载后可阅读完整内容，剩余8页未读，立即下载

bcshaust

粉丝: 0
资源: 8

使用LSTM循环神经网络进行场景图像像素级分割与分类

MATLAB 中Computer vision system toolbox的官网指南和用户手册

Color in Computer Vision：Fundamentals and Applications

ComputerVision

computervision

Computer vision

Computer Vision

Computer Vision Project Report_matlab_ComputerVision_

Introduction to Computer Vision - CSE/EE486 Computer Vision I

Computer Vision – ECCV 2012_ 12th European Conference on Computer Vision

Learning OpenCV 3 Computer vision in C++.zip_Computer Vision_Lea

最新资源