Pyramid Stereo Matching Network
Jia-Ren Chang Yong-Sheng Chen
Department of Computer Science, National Chiao Tung University, Taiwan
{followwar.cs00g, yschen}@nctu.edu.tw
Abstract
Recent work has shown that depth estimation from a
stereo pair of images can be formulated as a supervised
learning task to be resolved with convolutional neural net-
works (CNNs). However, current architectures rely on
patch-based Siamese networks, lacking the means to ex-
ploit context information for finding correspondence in ill-
posed regions. To tackle this problem, we propose PSM-
Net, a pyramid stereo matching network consisting of two
main modules: spatial pyramid pooling and 3D CNN. The
spatial pyramid pooling module takes advantage of the ca-
pacity of global context information by aggregating con-
text in different scales and locations to form a cost volume.
The 3D CNN learns to regularize cost volume using stacked
multiple hourglass networks in conjunction with interme-
diate supervision. The proposed approach was evaluated
on several benchmark datasets. Our method ranked first in
the KITTI 2012 and 2015 leaderboards before March 18,
2018. The codes of PSMNet are available at:
https:
//github.com/JiaRenChang/PSMNet
.
1. Introduction
Depth estimation from stereo images is essential to com-
puter vision applications, including autonomous driving for
vehicles, 3D model reconstruction, and object detection and
recognition [4, 31]. Given a pair of rectified stereo images,
the goal of depth estimation is to compute the disparity d
for each pixel in the reference image. Disparity refers to the
horizontal displacement between a pair of corresponding
pixels on the left and right images. For the pixel (x, y) in the
left image, if its corresponding point is found at (x − d, y)
in the right image, then the depth of this pixel is calculated
by
fB
d
, where f is the camera's focal length and B is the
distance between two camera centers.
The typical pipeline for stereo matching involves the
finding of corresponding points based on matching cost
and post-processing. Recently, convolutional neural net-
works (CNNs) have been applied to learn how to match
corresponding points in MC-CNN [
30]. Early approaches
using CNNs treated the problem of correspondence esti-
mation as similarity computation [
27, 30], where CNNs
compute the similarity score for a pair of image patches
to further determine whether they are matched. Although
CNN yields significant gains compared to conventional ap-
proaches in terms of both accuracy and speed, it is still
difficult to find accurate corresponding points in inherently
ill-posed regions such as occlusion areas, repeated patterns,
textureless regions, and reflective surfaces. Solely applying
the intensity-consistency constraint between different view-
points is generally insufficient for accurate correspondence
estimation in such ill-posed regions, and is useless in tex-
tureless regions. Therefore, regional support from global
context information must be incorporated into stereo match-
ing.
One major problem with current CNN-based stereo
matching methods is how to effectively exploit context in-
formation. Some studies attempt to incorporate seman-
tic information to largely refine cost volumes or disparity
maps [
8, 13, 27]. The Displets [8] method utilizes object
information by modeling 3D vehicles to resolve ambigui-
ties in stereo matching. ResMatchNet [27] learns to mea-
sure reflective confidence for the disparity maps to improve
performance in ill-posed regions. GC-Net [
13] employs the
encoder-decoder architecture to merge multiscale features
for cost volume regularization.
In this work, we propose a novel pyramid stereo match-
ing network (PSMNet) to exploit global context information
in stereo matching. Spatial pyramid pooling (SPP) [9, 32]
and dilated convolution [
2, 29] are used to enlarge the re-
ceptive fields. In this way, PSMNet extends pixel-level fea-
tures to region-level features with different scales of recep-
tive fields; the resultant combined global and local feature
clues are used to form the cost volume for reliable dispar-
ity estimation. Moreover, we design a stacked hourglass
3D CNN in conjunction with intermediate supervision to
regularize the cost volume. The stacked hourglass 3D CNN
repeatedly processes the cost volume in a top-down/bottom-
up manner to further improve the utilization of global con-
text information.
Our main contributions are listed below:
5410