DF
2
Net: A Discriminative Feature Learning and Fusion Network for RGB-D
Indoor Scene Classification
Yabei Li
1,2
, Junge Zhang
1,2
, Yanhua Cheng
3
, Kaiqi Huang
1,2,4
, Tieniu Tan
1,2,4
1
CRIPAC & NLPR, CASIA
2
University of Chinese Academy of Sciences
3
Tencent Wechat AI
4
CAS Center for Excellence in Brain Science and Intelligence Technology
yabei.li@cripac.ia.ac.cn, {jgzhang, kqhuang, tnt}@nlpr.ia.ac.cn, breezecheng@tencent.com
Abstract
This paper focuses on the task of RGB-D indoor scene clas-
sification. It is a very challenging task due to two folds. 1)
Learning robust representation for indoor scene is difficult
because of various objects and layouts. 2) Fusing the com-
plementary cues in RGB and Depth is nontrivial since there
are large semantic gaps between the two modalities. Most ex-
isting works learn representation for classification by training
a deep network with softmax loss and fuse the two modali-
ties by simply concatenating the features of them. However,
these pipelines do not explicitly consider intra-class and inter-
class similarity as well as inter-modal intrinsic relationships.
To address these problems, this paper proposes a Discrimi-
native Feature Learning and Fusion Network (DF
2
Net) with
two-stage training. In the first stage, to better represent scene
in each modality, a deep multi-task network is constructed to
simultaneously minimize the structured loss and the softmax
loss. In the second stage, we design a novel discriminative
fusion network which is able to learn correlative features of
multiple modalities and distinctive features of each modality.
Extensive analysis and experiments on SUN RGB-D Dataset
and NYU Depth Dataset V2 show the superiority of DF
2
Net
over other state-of-the-art methods in RGB-D indoor scene
classification task.
Introduction
Scene classification is one of the basic problems in computer
vision research. Recently, with the release of cost-affordable
depth sensors, e.g. Kinect, which provide strong illumina-
tion and color invariant geometric cues, some intrinsic chal-
lenges in indoor scene classification such as various illumi-
nation, diverse objects and layouts are promising to be par-
tially solved.
Compared with the standard object-centric image classi-
fication problem, the task of RGB-D indoor scene classifi-
cation has several challenges. Firstly, obtaining robust rep-
resentation for scene classification in single modality is dif-
ficult. To understand a scene, people not only recognize the
objects in the scene, but also consider the correlations of the
objects. As for indoor scenes, they are usually cluttered with
diverse objects and various layouts, resulting in large intra-
class variation and sever inter-class overlap. As we illustrat-
ed in Figure 1, the classroom has various views. Some view
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: The difficulties of RGB-D indoor scene classifica-
tion. 1) indoor scene images have large intra-class variation
and small inter-class variation. 2) RGB and Depth image
have semantic gaps. Sample images are from SUN RGB-D
Dataset.
of classroom is similar to other scene categories such as lec-
ture theatre. Secondly, although it’s an opportunity to utilize
the additional depth cues to benefit indoor scene classifica-
tion, there are large semantic gaps between the RGB and
Depth modality. As shown in Figure 1, RGB image gives
appearance cues while Depth image provides geometric pri-
ors. How to exhaustively use the complementary cues in the
RGB and Depth modalities remains an open problem.
The ideal multimodal representation for RGB-D indoor
scene ought to have small distances between the same class
and large distances between the different classes, as shown
in Figure 2(a). One of the most popular pipelines (Eitel et al.
2015) is to learn representation for RGB and Depth image
with softmax loss separately and directly concatenate them,
illustrated as in Figure 2(b). However, for scene images that
have large intra-class variation and small inter-class vari-
ation, it’s hard to obtain discriminative enough RGB and
Depth representation. It leads to the concatenated multi-
modal representation far from ideal. To ease this situation,
we design a discriminative feature learning network which
can explicitly model the intra-class and inter-class similarity