1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology
3
resolution of the right view. This method belongs to the
category of reconstruction-based methods, which set up an
energy function and then minimize it to obtain the optimal
solution. The energy function usually consists of a data term
and some other constraint terms.
Here, we start constructing the energy function by the
following data term. Mathematically,
E
data
=
SKI
R
N
− I
R
N
(low)
2
2
, (1)
where S is a down-sampling operator and K is a blurring
operator, I
R
N
is a variable referring to the expected full-
resolution right view of the N
th
frame and I
R
N
(low) is the
initial low-resolution input of the N
th
frame. k·k
2
denotes
the Euclidean norm. This is a common term used in the
reconstruction-based methods [23], [24]. It indeed enforces a
constraint on the expected full-resolution image I
R
N
so that it
is consistent with the low-resolution input after the blurring
and down-sampling process.
In addition, the high-frequency information of the left full-
resolution view can be used to enhance the resolution of right
view, since they share many scene points. Therefore, once the
correspondence between the left and right views is obtained,
we can add a mapping term to the energy function. Similar
disparity based pixel mapping strategy is also applied in the
methods in [10] and [17]. The explicit form of this term can
be denoted as:
E
map
=
X
(m,n)∈Λ
c
mn
I
R
N
(m, n) − I
L
N
m, n + D
′
N
(m, n)
2
2
(2)
where Λ is the pixel index set of the image grid, I
L
N
is the
N
th
frame of left full-resolution view, D
′
N
denotes the stereo
correspondence of the N
th
frame (depth map
1
of I
R
N
relative to
I
L
N
). It can be obtained from the corresponding left view depth
map D
N
of I
L
N
relative to I
R
N
, and we use linear interpolating
to deal with the non-integer case. c
mn
is a binary confidence
value about D
′
N
(m, n). It is necessary since the depth map
may be not accurate, especially in the occlusion regions
and non-overlapping regions of two views. In this paper, we
determine c
mn
by measuring the similarity (mean square error,
MSE) between the local patch centered at I
R
N
(m, n) and the
local patch centered at I
L
N
m, n + D
′
N
(m, n)
.
Besides the above observation about the point-to-point map-
ping between two views, there is another useful observation
about natural image, nonlocal prior [25]. This nonlocal prior is
based on such an observation that the image content is likely
to repeat itself within some neighborhood. This self-similarity
of natural image is beneficial for solving super-resolution
problem, because it means that we can exploit the redundant
information hidden in the full-resolution view. Leveraging the
nonlocal prior, we enforce an additional nonlocal constraint
between the left view and right view under the guidance of
stereo correspondences. The explicit form of this nonlocal
regularization term is:
E
nonlocal
=
P
(m,n)∈Λ
c
mn
P
(p,q)∈Ω
nr
(
m,n+D
′
N
(m,n)
)
w
mn,pq
T I
R
N
(m, n) − T I
L
N
(p, q)
2
2
. (3)
Here Ω
nr
(i, j) denotes the nonlocal neighborhood at position
(i, j), whose size is (2 × nr + 1) × (2 × nr + 1). T is a
1
Depth and disparity are two interdependent terms in stereo vision. We use
them interchangeably whenever appropriate.
vectorization patch extraction operator, and T I
R
N
(m, n) is the
vectorization representation of a patch centered at (m, n) on
image I
R
N
, and w
mn,pq
is the nonlocal weight calculated by
measuring the similarity (mean square error, MSE) between
patch T I
R
N
(m, n) and T I
L
N
(p, q) [25], [26].
As can be seen, minimizing the above energy function relies
on the calculation of stereo correspondence. We know that
the result of stereo matching algorithm depends on the co-
occurrence of distinct details in both views. The more distinct
details these two views share, the more reliable the matching
result is. However, since we only have the mixed-resolution
videos as inputs, the result of directly matching the left view
with the interpolated right view may not meet the expectation.
We need to restore the details of right view for obtaining a
reliable depth map. Therefore, we combine the calculation of
stereo correspondence and the super-resolution together, and
propose a unified function as follows:
E
SR
= E
data
+ λ
1
E
map
+ λ
2
E
nonlocal
+λ
3
E
depth
, (4)
where λ
1
, λ
2
, and λ
3
are regularization parameters, E
depth
is the depth energy function. We will give its explicit form in
the following part.
C. Depth energy function
In [19], we proposed a region-based stereo matching algo-
rithm using cooperative optimization. This method can achieve
high-quality depth map with relatively high efficiency. In this
paper, we extend this method to the stereoscopic video case.
By using the temporal consistency of depth information in
stereoscopic video, this extension method can obtain tempo-
rally consistent depth maps. In the following part, we briefly
describe the idea about region based stereo matching algorithm
using cooperative optimization. We recommend referring [19]
for the detailed description.
Supposing that R
1
,. . ., R
n
are regions obtained by the
Mean-shift segmentation algorithm [27], we define a total
energy function, which can be decomposed into the sum of
several subtarget energy functions. Mathematically,
E
depth
=
X
i∈Λ
seg
E
i
, (5)
where Λ
seg
is the index set of regions, E
i
is the energy
function of the i
th
region R
i
.
Next, we give the explicit form of every subtarget E
i
. Here,
we mainly concentrate on four aspects: data energy, occlusion
energy, smoothness energy, and temporal consistency energy.
Mathematically, we define the energy function of the i
th
region
R
i
as follows:
E
i
= E
i
data
+ E
i
occlusion
+ E
i
smooth
+E
i
consistency
. (6)
The first term is the data term. It evaluates the validity of the
depth at the position (m, n) in region R
i
by calculating the
color difference between two corresponding pixels. Its explicit
form is:
E
i
data
=
X
(m,n)∈V
i
R
,(p,q)∈V
i
L
I
R
N
(m, n) − I
L
N
(p, q)
∞
, (7)
where k·k
∞
denotes the maximum norm or infinity norm, V
i
L
and V
i
R
denote the visible pixel sets [19], [28] on the current
region of the left and right images, respectively.