IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 7, NOVEMBER 2014 1905
Example-Based Video Stereolization
With Foreground Segmentation and
Depth Propagation
Lei Wang and Cheolkon Jung, Member, IEEE
Abstract—With advances in 3DTV technology, video stereoliza-
tion has attracted much attention in recent years. Althou gh video
stereolization can enrich stere
oscopic 3D contents, it is hard to
create good depth maps from monocular 2D videos. In this paper,
we propose an automatic example-based video stereolization
method with foreground segmen
tation and depth propagation,
called EBVS. To cons id er b oth performance and computational
complexity, we separately estimate depth maps according to the
key and non-key frames. In th
ekeyframes,wefirst estimate an
initial depth map based on examples from the RGB-D training
data set, then refine it to preserve boundaries of foreground
objects. In the non-ke
y frames, we propagate the depth map of
the key frame using motion compensation, and generate depth
maps. Finally, we employ depth-image-based-rendering (DIBR) to
generate stereosco
pic views from 2D videos and their depth maps.
Extensive experiments verify that the proposed EBVS produces
visually pleasing and realistic stereoscopic 3D views from 2D
videos.
Index Terms—3DT
V, depth generation, depth propagation,
depth-image-based-rendering , learning-based, stereoscopic views,
video stereolization.
I. INTRODUCTIO N
B
ECAUSE 3DTV p
rovides realisti c 3D e ffects to viewers
based on stereoscopic 3D contents, it is expected to have
a dominant position in the market of the next generation digital
TV. Howeve
r, the promotion of 3DTV is constrained due to
the lack of available stereoscopic 3D contents. Altho ugh new
stereoscopic contents are recently captured by stereoscopic
cameras
such as activ e depth sensors [1], it still remains an
open problem t o con vert the existing large amounts of mono c-
ular 2D videos into stereoscopic 3D co ntents, called vid eo
stere
olization. The vis ual ability to perceive stereoscopic 3D
contents is closely related to the h um an depth perception. That
is, the slight diff erence between the left-eye and right-eye
vie
ws, i .e., horizontal disparity, is transformed into different
depth informa tion, and leads to different stereoscopic visual
Manuscript received February 06, 2014; revised May 20, 2014; accepted July
14, 2014. Date of publication July 22, 2014; date of current version October 13,
2014. This work was supported by the National Natural Science Foundation of
China under Grant 61271298 and the International S&T Cooperation Program
of China under Grant 2014DFG12780. The associate editor coordinating the
review of this manuscript and approving it for publication was Prof. Jing-Ming
Guo.
The authors are with the Key Lab of Intelligent Perception and Image Un-
derstanding, Ministry of Education of China, Xidia n University, Xi’an 710071,
China (e-mail: lwang@stu.xidian.edu.cn; zhengzk@xidian.edu.cn).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2014.2341599
perceptions of outward perception, on-screen perception, and
inward perception [2].
A. Related Work
A key component for video stereolization is depth map esti-
mation from 2D videos. Up to the present, a number
of stu dies
have been conducted to estimate depth maps from m onocular
2D videos automatically and semi-automatically [3]–[24].
Representative m ethods for automatic de
pth map estimation
are structure from motion (SFM) [3], depth-from-defocus [4],
depth from geometric perspective [5], depth from model [6],
[7] among others. SFM ob tain ed the dept
h information based
on the tracked feature points and the cam era poses. With the
calculated camera poses, m ultip le view stereos were applied
to each frame for producing the de
nse depth maps, which is
based on th e assumption of orthographic projection to estimate
the 3D structure and camera motion [ 8]. Disparis et al. [9] and
Moustakas et al. [10] dealt wi
th dynamic scenes by segmenting
rigid objects into layers, and em ployed SFM [3] for each layer
to reconstruct 3D structures. Kn orr and Sikora [11], Rotem
[12], and Zhang et al. [13] g
enerated dense depth maps by
synthesizing one view from the other f ram e s in an input video
to achieve less computation al complexity. However, th ey were
designed to han dle stat
ic scenes, and the certain assumptions
regarding pixel correspondence and the p rojection model had
to be ma de to reconstruct scene geometry structure and camera
positions from two or
more images. In [4], wavelet transform
was u sed to measure defocus information in the image, and
then the depth values were assigned to the hig h frequency area
of the image. The d
epth values were obtained by analyzing the
high-frequency wavelet subbands of an image. The numb e r
of high va lue wavelet transform coefficients was taken as a
blurring measu
re. In [5], depth maps were generated based
on the position of the lines and the vanishing points which
generally had the farthest distance. Since more than one depth
cues exist i
n most cases, [14] u tilized hybrid-depth cues such
as perspective geometry, defocus, and visual saliency. The
final depth map is generated by fusing them together. Another
approach w
as model-based a u tomatic depth estim atio n to con-
struct one or several depth m odels for natural scenes and blend
them together [6], [7], [15]. Chen et al. [6 ] utilized the edge
inform a
tion to segment regions, and then generated depth maps
by assigning each region to a priori hypothesis of the depth gra-
dient. Yamada et al. [7] generated depth maps with three simple
model
s based on the color theory. Lin et al. [15] adopted a
depth estimation strategy based on the foreground-background
separation. They adopted a three-layer back-prop agation neural
1520-9210 © 2014 IEEE. P ersonal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.