Practical Stereo Matching via Cascaded Recurrent Network
with Adaptive Correlation
Jiankun Li
1
Peisen Wang
1
*
Pengfei Xiong
2
*
Tao Cai
1
Ziwei Yan
1
Lei Yang
1
Jiangyu Liu
1
Haoqiang Fan
1
Shuaicheng Liu
3,1†
1
Megvii Research
2
Tencent
3
University of Electronic Science and Technology of China
https://github.com/megvii-research/CREStereo
Figure 1. Examples of our predictions on images from Holopix50K [16] dataset. We show left images of the stereo pairs and their
corresponding predicted disparities. Our results achieve high accuracy and exhibit high-quality details for fine-structured objects.
Abstract
With the advent of convolutional neural networks, stereo
matching algorithms have recently gained tremendous
progress. However, it remains a great challenge to accu-
rately extract disparities from real-world image pairs taken
by consumer-level devices like smartphones, due to practi-
cal complicating factors such as thin structures, non-ideal
rectification, camera module inconsistencies and various
hard-case scenes. In this paper, we propose a set of in-
novative designs to tackle the problem of practical stereo
matching: 1) to better recover fine depth details, we design
a hierarchical network with recurrent refinement to update
disparities in a coarse-to-fine manner, as well as a stacked
cascaded architecture for inference; 2) we propose an adap-
tive group correlation layer to mitigate the impact of erro-
neous rectification; 3) we introduce a new synthetic dataset
with special attention to difficult cases for better generaliz-
ing to real-world scenes. Our results not only rank 1
st
on
both Middlebury and ETH3D benchmarks, outperforming
existing state-of-the-art methods by a notable margin, but
also exhibit high-quality details for real-life photos, which
clearly demonstrates the efficacy of our contributions.
*
Equal contribution. † Corresponding author.
1. Introduction
Stereo matching is a classical research topic of computer
vision, the goal of which, given a pair of rectified images,
is to compute the displacement between two corresponding
pixels, namely “disparity” [34]. It plays an important role
in many applications, including autonomous driving, aug-
mented reality, simulated bokeh rendering and so forth.
Recently, with the support of large synthetic datasets
[5, 27, 46], convolutional neural network (CNN) based
stereo matching methods have taken the accuracy of dis-
parity estimation to a new height [8, 23, 44]. However, to
make the algorithm truly practical in the scenario of every-
day consumer photography, we are still faced with three ma-
jor obstacles.
Firstly, it remains a complicated issue for most existing
algorithms to precisely recover the disparity of fine image
details, or thin structures such as nets and wire frames. The
fact that consumer photos are being produced in higher res-
olutions only serves to worsen the problem. In computa-
tional bokeh, for instance, disparity error around fine details
would result in degraded rendering results that are unpleas-
ing to human perception [32]. Secondly, perfect rectifica-
tion [24, 56] is hard to obtain for real-world stereo image
pairs, as they are often produced by camera modules with