724 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 2, FEBRUARY 2015
Toward Naturalistic 2D-to-3D Conversion
Weicheng Huang, Xun Cao, Member, IEEE, Ke Lu, Qionghai Dai, Senior Member, IEEE,
and Alan Conrad Bovik, Fellow, IEEE
Abstract—Natural scene statistics (NSSs) models have been
developed that make it possible to impose useful perceptually
relevant priors on the luminance, colors, and depth maps of
natural scenes. We show that these models can be used to develop
3D content creation algorithms that can convert monocular
2D videos into statistically natural 3D-viewable videos. First,
accurate depth information on key frames is obtained via human
annotation. Then, both forward and backward motion vectors are
estimated and compared to decide the initial depth values, and
a compensation process is applied to further improve the depth
initialization. Then, the luminance/chrominance and initial depth
map are decomposed by a Gabor filter bank. Each subband of
depth is modeled to produce a NSS prior term. The statistical
color–depth priors are combined with the spatial smoothness
constraint in the depth propagation target function as a prior
regularizing term. The final depth map associated with each
frame of the input 2D video is optimized by minimizing the
target function over all subbands. In the end, stereoscopic frames
are rendered from the color frames and their associated depth
maps. We evaluated the quality of the generated 3D videos
using both subjective and objective quality assessment methods.
The experimental results obtained on various sequences show
that the presented method outperforms several state-of-the-art
2D-to-3D conversion methods.
Index Terms— 2D-to-3D conversion, depth propagation,
natural scene statistics, Bayesian inference.
I. INTRODUCTION
T
HREE DIMENSIONAL (3D) video has become quite
popular in recent years. Yet, the proliferation of
3D capture and display devices has not been matched
by a corresponding degree of availability of quality
3D video content. Towards helping to overcome this
3D content shortage, a new 3D content creation technology,
2D-to-3D conversion, is being developed to convert existing
2D videos into 3D videos [1], [2].
Manuscript received October 15, 2013; revised May 30, 2014; accepted
December 17, 2014. Date of publication December 23, 2014; date of current
version January 9, 2015. This work was supported in part by the National
Science Foundation of China under Project 61371166 and Project 61422107,
in part by the Importation and Development of High-Caliber Talents Project
through the Beijing Municipal Institutions under Grant IDHT20130225, in
part by the National Natural Science Foundation of China under Grant
61103130 and Grant 61271435, and in part by the National Program on Key
Basic Research Project (973 Program) under Grant 2010CB731804-1. The
associate editor coordinating the review of this manuscript and approving it
for publication was Prof. Charles Boncelet.
W. Huang and K. Lu are with the College of Engineering and Information
Technology, University of Chinese Academy of Sciences, Beijing 100049,
China (e-mail: luk@ucas.ac.cn).
X. Cao is with the School of Electronic Science and Engineering, Nanjing
University, Nanjing 210093, China (e-mail: caoxun@nju.edu.cn).
Q. Dai is with the Department of Automation, Tsinghua University,
Beijing 100084, China (e-mail: qhdai@tsinghua.edu.cn).
A. C. Bovik is with the Department of Electrical and Computer
Engineering, University of Texas at Austin, Austin, TX 78712 USA (e-mail:
bovik@ece.utexas.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2014.2385474
2D-to-3D video conversion methods can be divided into
two categories, depending on whether human-computer
interactions are involved in the conversion process: fully-
automatic methods and semi-automatic methods [2]. Current
fully-automatic methods are generally only able to deliver
a limited 3D effect. However, semi-automatic methods have
made it possible to balance 3D content quality with production
cost, and has been demonstrated to enable the conversion of
popular old films – such as the Star Wars series, Titanic,
and so on into successful cinematic 3D presentations [3]. The
general approach to semi-automatic 2D-to-3D conversion is to
manually or semi-manually create high quality depth maps at
strategically chosen key frames or parts of frames, then propa-
gate depth information from the key frames to non-key frames
to initiate depth calculations at non-key frames (see Fig. 1 as
an illustration). The highest cost arises during the process of
assigning depths to key frames, whereas the 3D quality of
the final production largely depends on the accuracy of the
key frame depth maps, the key frame separations, and the
depth propagation method. Smaller intervals and higher key
depth accuracy lay a better foundation for subsequent depth
propagation, leading to improved stereo quality. Unfortunately,
these increase the cost as well.
Developing depth propagation methods that effectively
control depth errors can make it possible to relax the key frame
interval constraints, while also significantly improving the
final quality. Of course, the additional algorithm complexity
of automation is negligible as compared with the reduction
in human-computer interaction. This is the main reason why
depth propagation plays such a critical role in 2D-to-3D video
conversion.
Recently, statistical models of natural scenes have proven
to provide useful constraints on many image processing and
computer vision problems, including image compression [4],
image and video quality prediction [5], image denoising [6]
and stereo matching [7], [8]. They provide powerful statistical
priors that can force ill-posed visual problems towards stable,
naturalistic solutions. For example, the univariate distributions
of band-pass luminance images (wavelet coefficients) are
well-modeled as obeying a generalized Gaussian distribution:
P(c) =
e
−|c/s|
p
Z(s, p)
(1)
where Z(s, p) is a normalizing constant that forces the integral
of P(c) to be 1, while the parameters p, s control the shape
and spread of the distribution, respectively. Liu et al.[7]also
showed that the conditional magnitudes of luminance and
depth are mutually dependent, i.e. regions exhibiting larger
luminance variations often have larger depth variations and
vice versa.
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.