RGB-D Indoor Segment 451
objects are recognized using optical character recognition (OCR) software on
extracted text regions. Lee et al. [5] incorporate visual odometry and feature
based metric-topological Simultaneous Localization And Mapping (SLAM) into
the navigation system. Then a vicinity map based on dense 3D data obtained
from RGB-D camera is built to do path planning. These methods only focus on
some specific part of the scene. So the user can’t receive the information about
the whole scene.
Semantical scene analysis could help the visually impaired know better of
the surrounding environment. And there have been many development in RGB-
D scene analysis method recently. Silberman et al. [6] use depth for bottom-up
segmentation and use context features to infer support relationships in the scene.
Ren et al. [7] use kernel descriptors on superpixels and use a Markov Random
Field (MRF) on superpixel with segmentation tree to model the context of the
scene. Choi et al. [8] use 3D geometric phrase model to capture the semantic and
geometric relationship between objects which frequently co-occur in the same
3D spatial configuration and then understand the indoor scenes. Gupta et al. [9]
propose algorithms for object boundary detection and hierarchical segmentation.
Their algorithms visit the segmentation problem afresh from ground-up and
develop a gPb like machinery to combine depth information naturally. Wang
et al. [10] propose a label propagation method to utilize the existing massive
2D semantic labeled datasets such as ImageNet. Koppula et al. [11] parse the
indoor scene with RGB-D data in a mobile robots. A full 3D reconstruction is
applied with multiple views of the scene acquired with a Kinect sensor. Then the
3D point cloud is over-segmented and used as underlying structure for a MRF
model. These methods focus on the algorithm for general scene segmentation
and labeling, while lacking specific analysis for the visually impaired. Wang et
al. [12] use hough transform to extract the concurrent parallel lines on the RGB
channels and then use depth information to distinguish stairs from pedestrian
crosswalks. Then stairs are be recognized as upstairs and downstairs. These
methods are mainly focus on the accuracy of scene segmentation while neglect
the efficiency of the algorithm which, however, is a key factor in our work. Liu
et al. [13] use a graph-based segmentation algorithm which combines the result
of plane segmentation and RGB-D data. The method is more focused on the
efficiency of the algorithm. However, in order to help the visually impaired to
know better of the scene, more semantical analysis, like the type of different
structures, should be conducted.
In man-made indoor environment, there exist many planes which contain
much structural information. Extracting these planes could be very helpful in
scene segmentation. There are many plane segmentation algorithms in litera-
ture. One way to extract planes is applying 2D segmentation methods on 3D
data. However, this approach performs badly if two planes are very close to each
other. In order to take advantage and make full use of 3D data, many new meth-
ods have been proposed. Holz et al. [14] compute local surface normal of point
clouds using integral images. And then the points are clustered, segmented, and
classified in both normal space and spherical coordinates. This method achieves