Convolutional Neural Networks with Intra-layer
Recurrent Connections for Scene Labeling
Ming Liang Xiaolin Hu Bo Zhang
Tsinghua National Laboratory for Information Science and Technology (TNList)
Department of Computer Science and Technology
Center for Brain-Inspired Computing Research (CBICR)
Tsinghua University, Beijing 100084, China
liangm07@mails.tsinghua.edu.cn, {xlhu,dcszb}@tsinghua.edu.cn
Abstract
Scene labeling is a challenging computer vision task. It requires the use of both
local discriminative features and global context information. We adopt a deep
recurrent convolutional neural network (RCNN) for this task, which is originally
proposed for object recognition. Different from traditional convolutional neural
networks (CNN), this model has intra-layer recurrent connections in the convo-
lutional layers. Therefore each convolutional layer becomes a two-dimensional
recurrent neural network. The units receive constant feed-forward inputs from the
previous layer and recurrent inputs from their neighborhoods. While recurrent
iterations proceed, the region of context captured by each unit expands. In this
way, feature extraction and context modulation are seamlessly integrated, which
is different from typical methods that entail separate modules for the two steps.
To further utilize the context, a multi-scale RCNN is proposed. Over two bench-
mark datasets, Standford Background and Sift Flow, the model outperforms many
state-of-the-art models in accuracy and efficiency.
1 Introduction
Scene labeling (or scene parsing) is an important step towards high-level image interpretation. It
aims at fully parsing the input image by labeling the semantic category of each pixel. Compared
with image classification, scene labeling is more challenging as it simultaneously solves both seg-
mentation and recognition. The typical approach for scene labeling consists of two steps. First,
extract local handcrafted features [6, 15, 26, 23, 27]. Second, integrate context information using
probabilistic graphical models [6, 5, 18] or other techniques [24, 21]. In recent years, motivated by
the success of deep neural networks in learning visual representations, CNN [12] is incorporated in-
to this framework for feature extraction. However, since CNN does not have an explicit mechanism
to modulate its features with context, to achieve better results, other methods such as conditional
random field (CRF) [5] and recursive parsing tree [21] are still needed to integrate the context infor-
mation. It would be interesting to have a neural network capable of performing scene labeling in an
end-to-end manner.
A natural way to incorporate context modulation in neural networks is to introduce recurrent con-
nections. This has been extensively studied in sequence learning tasks such as online handwriting
recognition [8], speech recognition [9] and machine translation [25]. The sequential data has strong
correlations along the time axis. Recurrent neural networks (RNN) are suitable for these tasks be-
cause the long-range context information can be captured by a fixed number of recurrent weights.
Treating scene labeling as a two-dimensional variant of sequence learning, RNN can also be applied,
but the studies are relatively scarce. Recently, a recurrent CNN (RCNN) in which the output of the
top layer of a CNN is integrated with the input in the bottom is successfully applied to scene labeling
1