Scene Labeling with LSTM Recurrent Neural Networks
Wonmin Byeon
1 2
Thomas M. Breuel
1
Federico Raue
1 2
Marcus Liwicki
1
1
University of Kaiserslautern, Germany.
2
German Research Center for Artificial Intelligence (DFKI), Germany.
{wonmin.byeon,federico.raue}@dfki.de {tmb,liwicki}@cs.uni-kl.de
Abstract
This paper addresses the problem of pixel-level segmen-
tation and classification of scene images with an entirely
learning-based approach using Long Short Term Mem-
ory (LSTM) recurrent neural networks, which are com-
monly used for sequence classification. We investigate two-
dimensional (2D) LSTM networks for natural scene images
taking into account the complex spatial dependencies of la-
bels. Prior methods generally have required separate clas-
sification and image segmentation stages and/or pre- and
post-processing. In our approach, classification, segmen-
tation, and context integration are all carried out by 2D
LSTM networks, allowing texture and spatial model param-
eters to be learned within a single model. The networks
efficiently capture local and global contextual information
over raw RGB values and adapt well for complex scene
images. Our approach, which has a much lower compu-
tational complexity than prior methods, achieved state-of-
the-art performance over the Stanford Background and the
SIFT Flow datasets. In fact, if no pre- or post-processing is
applied, LSTM networks outperform other state-of-the-art
approaches. Hence, only with a single-core Central Pro-
cessing Unit (CPU), the running time of our approach is
equivalent or better than the compared state-of-the-art ap-
proaches which use a Graphics Processing Unit (GPU). Fi-
nally, our networks’ ability to visualize feature maps from
each layer supports the hypothesis that LSTM networks are
overall suited for image processing tasks.
1. Introduction
Accurate scene labeling is an important step towards im-
age understanding. The scene labeling task consists of par-
titioning the meaningful regions of an image and labeling
pixels with their regions. Pixel labels can (most likely)
not only be decided with low-level features, such as color
or texture, extracted from a small window around pixels.
For instance, distinguishing “grass” from “tree” or “forest”
would prove tedious under such a setting. As a matter of
fact, humans perceptually distinguish regions via the spatial
dependencies between them. For instance, visually similar
regions could be predicted as “sky” or “ocean” depending
on whether they are on the top or bottom part of a scene.
Consequently, a higher-level representation of scenes
(their global context) is typically constructed based on the
similarity of the low-level features of pixels and on their
spatial dependencies using a graphical model. The graph-
ical models construct the global dependencies based on
the similarities of neighboring segments. The most pop-
ular graph-based approaches are Markov Random Fields
(MRF) [4, 15, 16, 25] and Conditional Random Fields
(CRF) [10, 20]. However, most such methods require pre-
segmentation, superpixels, or candidate areas.
More recently, deep learning has become a very ac-
tive area of research in scene understanding and vision in
general. In [23], color and texture features from overseg-
mented regions are merged by Recursive Neural Networks.
This work has been extended by Socher et al. [22] who
combined it with convolutional neural networks. Among
deep learning approaches, Convolutional Neural Networks
(CNNs) [17] are one of the most successful methods for
end-to-end supervised learning. This method has been
widely used in image classification [14, 21], object recog-
nition [12], face verification [24], and scene labeling [5, 3].
Farabet et al. [3] introduced multi-scale CNNs to learn
scale-invariant features, but had problems with global con-
textual coherence and spatial consistency. These problems
were addressed by combining CNNs with several post-
processing algorithms, i.e., superpixels, CRF, and segmen-
tation trees. Later, Kekec¸ et al. [13] improved CNNs by
combining two CNN models which learn context informa-
tion and visual features in separate networks. Both men-
tioned approaches improved accuracy through carefully de-
signed pre-processing steps to help the learning, i.e., class
frequency balancing by selecting the same amount of ran-
dom patches per class, and specific color space for the input
data.
1