Computing the Stereo Matching Cost with a Convolutional Neural Network
Jure
ˇ
Zbontar
University of Ljubljana
jure.zbontar@fri.uni-lj.si
Yann LeCun
New York University
yann@cs.nyu.edu
Abstract
We present a method for extracting depth information
from a rectified image pair. We train a convolutional neu-
ral network to predict how well two image patches match
and use it to compute the stereo matching cost. The cost
is refined by cross-based cost aggregation and semiglobal
matching, followed by a left-right consistency check to elim-
inate errors in the occluded regions. Our stereo method
achieves an error rate of 2.61 % on the KITTI stereo dataset
and is currently (August 2014) the top performing method
on this dataset.
1. Introduction
Consider the following problem: given two images taken
from cameras at different horizontal positions, the goal is
to compute the disparity d for each pixel in the left image.
Disparity refers to the difference in horizontal location of
an object in the left and right image—an object at position
(x, y) in the left image will appear at position (x − d, y) in
the right image. Knowing the disparity d of an object, we
can compute its depth z (i.e. the distance from the object to
the camera) by using the following relation:
z =
fB
d
, (1)
where f is the focal length of the camera and B is the dis-
tance between the camera centers.
The described problem is a subproblem of stereo recon-
struction, where the goal is to extract 3D shape from one
or more images. According to the taxonomy of Scharstein
and Szeliski [14], a typical stereo algorithm consists of four
steps: (1) matching cost computation, (2) cost aggregation,
(3) optimization, and (4) disparity refinement. Following
Hirschmuller and Scharstein [5], we refer to steps (1) and
(2) as computing the matching cost and steps (3) and (4) as
the stereo method.
We propose training a convolutional neural network [9]
on pairs of small image patches where the true disparity is
known (e.g. obtained by LIDAR). The output of the net-
work is used to initialize the matching cost between a pair
of patches. Matching costs are combined between neighbor-
ing pixels with similar image intensities using cross-based
cost aggregation. Smoothness constraints are enforced by
semiglobal matching and a left-right consistency check is
used to detect and eliminate errors in occluded regions. We
perform subpixel enhancement and apply a median filter
and a bilateral filter to obtain the final disparity map. Fig-
ure 1 depicts the inputs to and the output from our method.
The two contributions of this paper are:
• We describe how a convolutional neural network can
be used to compute the stereo matching cost.
• We achieve an error rate of 2.61 % on the KITTI
stereo dataset, improving on the previous best result
of 2.83 %.
2. Related work
Before the introduction of large stereo datasets [2, 13],
relatively few stereo algorithms used ground-truth informa-
tion to learn parameters of their models; in this section, we
review the ones that did. For a general overview of stereo
algorithms see [14].
Kong and Tao [6] used sum of squared distances to com-
pute an initial matching cost. They trained a model to pre-
dict the probability distribution over three classes: the ini-
tial disparity is correct, the initial disparity is incorrect due
to fattening of a foreground object, and the initial disparity
is incorrect due to other reasons. The predicted probabil-
ities were used to adjust the initial matching cost. Kong
and Tao [7] later extend their work by combining predic-
tions obtained by computing normalized cross-correlation
over different window sizes and centers. Peris et al. [12]
initialized the matching cost with AD-Census [11] and used
multiclass linear discriminant analysis to learn a mapping
from the computed matching cost to the final disparity.
Ground-truth data was also used to learn parameters of
graphical models. Zhang and Seitz [22] used an alterna-
tive optimization algorithm to estimate optimal values of
Markov random field hyperparameters. Scharstein and Pal
1