However, these methods typically need a good starting point
(e.g., a visual hull model [25]).
2.5 Recovering Consistent View-Dependent
Depth Maps
Instead of reconstructing a complete 3D model, we focus on
recovering a set of consistent view-dependent depth maps
from a video sequence in this paper. It is mainly motivated
by applications such as view interpolation, depth-based
segmentation, and video enhancement. Our work is closely
related to that of [19], [15], which also aims to infer
consistent depth maps from multiple images. Kang and
Szeliski [19] proposed simultaneously optimizing a set of
depth maps at multiple key frames by adding a temporal
smoothness term. This method makes the disparities across
frames vary smoothly. However, it is sensitive to outliers
and may c ause the blending artifacts around object
boundaries. Gargallo and Sturm [15] formulated
3D modeling from images as a Bayesian MAP problem,
and solved it using the expectation-maximization (EM)
algorithm. They use the estimated depth map to determine
the visibility prior. Hidden variables are computed in a
probabilistic way to deal with occlusions and outliers. A
multiple-depth-map prior is finally used to smooth and
merge the depths while preserving d iscontinuities. In
comparison, our method statistically incorporates the
photo-consistency and geometric coherence constraints in
the data term definition. This scheme is especially effective
for processing video data because it can effectively suppress
temporal outliers by making use of the statistical informa-
tion available from multiple frames. Moreover, we use
efficient loopy belief propagation [10] to perform the overall
opti mization. By combi ning the photo-c onsistency and
geometric coherence constraints, the distribution of our
data cost becomes distinctive, making the BP optimization
stable and converge quickly.
The temporal coherence constraints were also used in
optical flow estimation [1] and occlusion detection [30], [37].
Larsen et al. [24] presented an approach for 3D reconstruc-
tion from multiple synchronized video streams. In order to
improve the final reconstruction quality, they used optical
flow to find corresponding pixels in the subsequent frames
of the same camera, and enforced the temporal consistency
in reconstructing successive frames. With the observation
that the depth error in conventional stereo methods grows
quadratically with depth, Gallup et al. [14] proposed a
multibaseline and multiresolution stereo method to achieve
constant depth accuracy by varying the baseline and
resolution proportionally to depth.
In summary, although many approaches have been
proposed to model 3D objects or to estimate depths using
multiple input images, the problem of how to appropriately
extract information and recover consistent depths from a
video remains challenging. In this paper, we show that by
appropriately maintaining the temporal coherence, surpris-
ingly consistent and accurate dense depth maps can be
obtained from the video sequences. The recovered depth
maps have high quality and are readily usable in many
applications such as 3D modeling, view interpolation, layer
separation, and video enhancement.
3FRAMEWORK OVERVIEW
Given a video sequence
^
I with n frames taken by a freely
moving camera, we denote
^
I ¼fI
t
j t ¼ 1; ...;ng, where
I
t
ðxÞ represents the color (or intensity) of pixel x in frame t.
It is either a 3-vector in a color image or a scalar in a
grayscale image. In our experiments, we assume it is an
RGB color vector. Our objective is to estimate a set of
disparity maps
^
D ¼fD
t
j t ¼ 1; ...;ng.Byconvention,
disparity D
t
ðxÞ (d
x
for short) is defined as d
x
¼ 1=z
x
, where
z
x
is the depth value of pixel x in frame t. For simplicity, the
terms “depth” and “disparity” are used interchangeably in
the following sections.
The set of camera parameters for frame t in a video
sequence is denoted as C
t
¼fK
t
; R
t
; T
t
g, where K
t
is the
intrinsic matrix, R
t
is the rotation matrix, and T
t
is the
translation vector. The parameters for all frames can be
estimated reliably by the structure from motion (SFM)
techniques [17], [29], [50]. Our system employs the SFM
method of Zhang et al. [50].
In order to robustly estimate a set of disparity maps, we
define the following energy in a video:
Eð
^
D;
^
IÞ¼
X
n
t¼1
ðE
d
ðD
t
;
^
I;
^
DnD
t
ÞþE
s
ðD
t
ÞÞ; ð1Þ
where the data term E
d
measures how well disparity
^
D
fits the given sequence
^
I and the smoothness term E
s
encodes the disparity smoothness. For each pixel in
disparity map D
t
, because it maps to one point in 3D,
there should exist corresponding pixels in other nearby
frames. These pixels not only satisfy the photo-consis-
tency constraint, but also have their geometric informa-
tion consistent. We thus propose a bundle optimization
framework to model the explicit correlation among the
pixels and use the collected statistics to optimize the
disparities jointly.
Fig. 2 gives an overview of our framework. With an
input video sequence, we first employ the SFM method to
recover the camera parameters. Then, we initialize the
disparity map for each frame independently. Segmentation
prior is incorporated into initialization for improving the
disparity estimation in large textu reless regions. After
initialization, we perform bundle optimization to iteratively
976 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 6, JUNE 2009
Fig. 2. Overview of our method.
Authorized licensed use limited to: Zhejiang University. Downloaded on April 25, 2009 at 11:39 from IEEE Xplore. Restrictions apply.