926 IEEE Transactions on Consumer Electronics, Vol. 58, No. 3, August 2012
Contributed Paper
Manuscript received 06/24/12
Current version published 09/25/12
Electronic version published 09/25/12. 0098 3063/12/$20.00 © 2012 IEEE
Inter Mode Selection for Depth Map Coding
in 3D Video
Liquan Shen, Zhaoyang Zhang, Zhi Liu
Abstract —3D video (3DV) data usually includes both
conventional 2D videos and corresponding depth maps. In 3DV
coding, color videos and depth maps need to be jointly coded.
Usually, the system of 3DV coding uses the two-channel coding
method which encodes color videos and depth maps based on
two parallel codec implementations. The complexity and
hardware requirements are nearly two times higher than coding
2D color videos. While low-complexity color video coding has
been broadly studied, depth map coding has received much less
attention. The depth map represents the distance from cameras
to objects, which has characteristics of both data on Z-axis in
world coordinate and the video signal. Thus, there is a high
correlation among motion information (the prediction mode,
reference frame and motion vector) from color videos and depth
maps. Based on this observation, we propose one method to
reduce depth coding complexity. An experimental analysis is
performed to study the prediction mode correlation in coding
information from color videos and depth maps. Based on the
correlation, we propose an efficient mode decision algorithm.
With almost the same RD performance, the proposed algorithm
can reduce about 70% computational complexity of depth
coding, which is beneficial for real-time realization through a
hardware or software implementation for the 3DV applications
1
.
Index Terms —3DTV/FTV, depth coding, mode decision.
I. INTRODUCTION
Although the multi-view video (MVV) can provide both the
immersive sense of realism and the function of free viewpoint
navigation, it still has some problems to be directly used for
3DTV or free viewpoint television (FTV) systems [1]. The
performance of the MVV system highly depends on the number
of original views. Thus, the system must capture a very large
number of views and encode a huge amount of multi-view data
to display a realistic 3D scene with multi-viewpoints at the
decoder side. The main challenge of the MVV system is high
requirements of storage and transmission bandwidth. To solve
this problem, Moving Pictures Experts Group (MPEG) has
initiated work toward a new standard for 3DTV and FTV,
referred to as 3DVC (3D video coding). Recently, new data
1 This work is sponsored by Shanghai Rising-Star Program
(11QA1402400) and Innovation Program of Shanghai Municipal Education
Commission, and is supported by the National Natural Science Foundation of
China under grant No. 60832003, 60902085 and 61171084.
The authors are with the Key Laboratory of Advanced Display and System
Application, Shanghai University, Ministry of Education, Shanghai, 200072,
China (e-mail: jsslq@163.com)
formats including captured 2D video sequences and
corresponding depth maps have been proposed for 3DVC. With
color videos and depth maps, virtual views can be generated
using Depth-Image-Based-Rendering (DIBR) techniques [2].
The depth image represents a relative distance from a camera to
an object in the 3D space, which is widely used in computer
vision and computer graphics fields to represent 3D information.
Since depth maps are required to be transmitted together with 2D
videos, depth compression needs to be investigated in 3DV
coding. Typically, a depth image consists of depth samples with
each sample represented by a scaled 8 bits value and
corresponding to a pixel in the video frame. It can be regarded as
a typical grayscale image/video [3]. Thus the most
straightforward approach to compress depth map sequences is to
encode them using conventional image/video compression
algorithms such as H.264 or joint multiview coding (MVC).
Recently, depth coding techniques could be classified into
two main groups depending on the relation with color video
coding: independent coding and joint coding. Independent depth
coding techniques encode the depth image using the
characteristics of depth data. An independent depth video
coding based on platelet is proposed in [4], which employs a
quad-tree decomposition that divides the image into blocks and
model functions of 4 types based on platelet. A mesh-based
depth coding is proposed in [5] to improve compression
efficiency. The main problem with these independent coding
schemes is that their coding efficiency is not high since
redundancies between the color video and the corresponding
depth map are not explored. On the other side, joint coding
algorithms proposed in [6-9] consider the correlation between
the depth map and the corresponding video. Motion information
from the color video is utilized to improve efficiency of depth
map coding in [6-7]. A joint coding method for both the color
video and the depth map in [8] uses the concept of the layered
depth image to represent and process multi-view video with
depth. The depth map coding for view synthesis is proposed to
improve the view rendering quality in [9]. However, these joint
depth coding techniques focus only on the improvement of
depth coding efficiency and do not evaluate the coding
complexity. Usually, the system of 3DV coding uses the two-
channel coding method which encodes color video and depth
map sequence based on two parallel H.264 codec
implementations. The complexity and hardware requirements
are nearly two times higher than coding 2D videos. While low-
complexity color video coding has been broadly studied, depth
coding has received much less attention. That is, how to code
the depth map sequence efficiently is an important issue.