SUPERPIXEL MATCHING-BASED DEPTH PROPAGATION FOR 2D-TO-3D CONVERSION
WITH JOINT BILATERAL FILTERING
Cheolkon Jung and Jiji Cai
School of Electronic Engineering, Xidian University, Xi’an 710071, China
zhengzk@xidian.edu.cn
ABSTRACT
In this paper, we propose superpixel matching-based depth
propagation for 2D to 3D video conversion with joint
bilateral filtering. First, we perform superpixel matching to
estimate motion vectors of superpixels from the reference
depth map instead of block matching. This is because a
superpixel is a group of pixels which have similar
characteristics and thus superpixel matching performs better
than block matching in estimating motion vectors. Then, we
conduct depth compensation based on the motion vectors to
obtain the current depth map. However, the size of each
superpixel is not exactly the same, which causes matching
errors in the compensated depth map. Thus, we perform
joint bilateral filtering to refine the depth map. Experimental
results show that the proposed algorithm successfully
performs depth propagation and produce high-quality depth
maps for 2D-to-3D conversion.
Index Terms—2D-to-3D conversion, depth
compensation, depth propagation, joint bilateral filtering,
superpixel matching.
1. INTRODUCTION
Stereoscopic three-dimensional (S3D) videos enhance
traditional viewing experience dramatically by providing an
immersive experience. However, available S3D contents are
insufficient for the proliferation of S3D display devices. To
overcome the S3D content shortage, 2D-to-3D conversion,
which estimates 3D information from monocular video
shots, is very useful for converting conventional 2D video
into S3D contents [1]. Up to now, current 2D-to-3D
conversion methods are classified into three main categories:
manual, full-automatic, and semi-automatic approaches [2].
Although manual methods can produce the most accurate
depth map for each individual frame, the incredible time
cost and human involvement make it unrealistic in most
applications. In contrast, full-automatic 2D-3D utilizes
depth clues such as motion, linear perspective, atmospheric
perspective, texture gradient, and relative height to estimate
3D
structures about 2D scenes [3, 4] without any user
participation. However, these current automatic techniques
suffer from inaccurate depth estimation and less adaptation
to varying video contents, which lead to degrade the
viewing experience. In recent years, semi-automatic 2D to
3D conversion methods [5-10] provide more balance
between quality and cost than fully-automatic ones by
introducing human-computer interaction. A representative
approach to semi-automatic 2D to 3D conversion is depth
propagation. The main idea of depth propagation is to
manually or semi-manually create high-quality depth maps
at key frames, i.e. reference depth maps, and then propagate
the depth maps to non-key frames, i.e. current depth maps.
For manual depth assignment, some computer software (e.g.,
Photoshop) or algorithms (e.g., Lazy Snapping [11]) might
be used to facilitate the user’s operation, which is beyond
the scope of this paper. However, we focus on propagating
the reference depth map of key frames to those of non-key
frames. Recent approaches are mostly based on motion
estimation and compensation. For example, Varekamp and
Barenbrug [8] propagated depth information of key frames
to non-key frames via motion compensation. Then, Lie et al.
[9] introduced trilateral filtering to reduce the depth
propagation errors. Cao et al. [10] proposed a semiautomatic
conversion method that first performed multi-object
segmentation to create disparity maps for key frames and
then employed shifted bilateral filtering to propagate depth
into non-key frames. In general, they suffer from inaccurate
motion estimation and compensation due to complex local
motions or object occlusions. As shown in Fig. 1, the three
girl’s hands have complex local motions. Moreover, their
sizes and colors are too small and similar to background
because block matching only refer to color difference and
the specified size of a rectangular block, which leads to the
presence of big errors (See the red circles).
In this paper, we propose superpixel-based depth
propagation for 2D-to-3D conversion with joint bilateral
filtering.
(a) (b) (c)
Fig. 1 Inaccurate depth propagation results caused by block matching due
to complex local motions and object occlusions. (a) Reference color image.
(b) Current color image. (c) Estimated depth map by block matching.