1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2016.2555778, IEEE
Transactions on Circuits and Systems for Video Technology
Light Field Depth Estimation via Epipolar Plane Image Analysis and
Locally Linear Embedding
Yongbing Zhang, Huijin Lv, Yebin Liu, Haoqian Wang, Xingzheng Wang, Qian Huang, Xinguang Xiang,
and Qionghai Dai
Abstract—In this paper, we propose a novel method for 4D
light field depth estimation exploiting the special linear structure
of epipolar plane image (EPI) and locally linear embedding
(LLE). Without high computational complexity, depth maps are
estimated locally by locating the optimal slope of each line
segmentation on EPIs, which are projected by corresponding
scene points. For each pixel to be processed, we build and then
minimize the matching cost that aggregates intensity pixel value,
gradient pixel value, spatial consistency as well as reliability
measure to select the optimal slope from a predefined set of
directions. Next, a sub-angle estimation method is proposed to
further refine the obtained optimal slope of each pixel.
Furthermore, based on a local reliability measure, all the pixels
are classified into reliable and unreliable pixels. For the
unreliable pixels, LLE is employed to propagate the missing
pixels by the reliable pixels based on the assumption of manifold
preserving property maintained by natural images. We
demonstrate the effectiveness of our approach on a number of
synthetic light field examples and real-world light field datasets,
and show that our experimental results can achieve higher
performance compared with the typical and recent state-of-the
art light field stereo matching methods.
Index Terms—Depth estimation, epipolar plane image (EPI),
light field, locally linear embedding (LLE).
I. INTRODUCTION
ight field (LF) is a function that describes the amount of
light flowing in every direction through every point in
space. Unlike traditional 2D images, a LF contains
information about not only the accumulated intensity at each
image point, but separated intensity value of light rays in all
directions, which allows a wide range of applications,
This work was partially supported by the the National High-tech R&D
Program of China (863 Program, 2015AA015901), the National Natural
Science Foundation of China under Grant 61571254, 61571259, 61300122,
U1301257&U1201255. This paper was recommended by Associate Editor
Peter Eisert. (corresponding author: H. Wang)
Y. Zhang, H. Lv, H. Wang, and X. Wang are with Graduate School at
Shenzhen, Tsinghua University, Shenzhen, China. (e-mail:
zhang.yongbing@sz.tsinghua.edu.cn; lvhj13@mails.tsinghua.edu.cn;
wanghaoqian@tsinghua.edu.cn; xingzheng.wang@sz.tsinghua.edu.cn).
Y. Liu and Q. Dai are with TNLIST and Department of Automation,
Tsinghua University, Beijing, China. (e-mail: liuyebin@tsinghua.edu.cn;
qhdai@ tsinghua.edu.cn).
Q. Huang is with College of Computer and Information, Hohai University,
Nanjing, China. (e-mail: huangqian@hhu.edu.cn).
X. Xiang is with School of Computer Science and Engineering, Nanjing
University of Science and Technology, Nanjing, China. (e-mail:
xgxiang@njust.edu.cn).
Copyright (c) 2016 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
especially in computer graphics, e.g. LF rendering, scene
reconstruction, synthetic aperture photography or 3D display.
LFs are typically produced either by rendering a 3D model
or by photographing a real scene. In either case, a large
collection of viewpoints must be obtained to produce the LF
views. Nowadays, there are many devices for capturing LFs
photographically such as camera arrays or a gantry consisting
of a single moving camera [1]. However, the camera arrays are
hardware-intensive and need a complex calibration procedure;
and the less expensive gantry consisting of a single moving
camera is limited to static scenes. Recently, plenoptic cameras
such as Lytro [2] and Raytrix [3] are becoming commercially
available, making it available to acquire a large number of LFs
for various scenes and can be applied in many specific
applications, in particular depth estimation.
The quality of depth maps has a significant influence in LF
related applications, however it is a great challenge to obtain a
dense and accurate depth map, due to its large number of views
in LF. To derive accurate and reliable depth maps, many
pioneering works for LF depth estimation have been done in
the literature. According to whether to use the epipolar plane
image (EPI, 2D slices of constant angle and spatial direction)
or not, the LF depth estimation can be simply divided into two
categories.
Depth estimation approaches employing EPI. To the best
of our knowledge, the first attempt to utilize EPI for depth
estimation was presented by Bolles et al. [4], who detect edges
in EPI and fit straight-line segments to edges afterwards to
estimate the 3D structure. However, the basic line fitting is not
robust enough and consequently the quality of reconstruction
is sparse and noisy. Another approach was proposed by
Criminisi et al. [5], who decomposed the scene into a set of
spatio-temporal layers and obtained the disparities by
exploiting the high degree of regularity in the EPI volume. To
achieve higher quality, Wanner and Goldlucke [6, 7] applied
structure tensor to yield high quality depth maps from 4D LFs.
It enables generation of depth maps with higher accuracy,
however the global optimization process is always
computational expensive, which hampers its practical usage.
Depth estimation approaches without employing EPI. Yu
et al. [8] encoded 3D line constraints and applied the
constrained Delaunay triangulation to implement the LF stereo
matching, however, this comes at a very high memory cost and
is vulnerable to severe occlusions. Chen et al. [9] introduced a
cost aggregation method based on the bilateral consistency
metric on the surface camera (SCam) [10]. However, since [9]
utilized the color of the reference pixel as the mean of the
bilateral filter, it is biased towards to the reference view, and
consequently has poorer performance when the input images
are noisy. Kim et al. [11] leveraged coherence in massive LFs