![](https://csdnimg.cn/release/download_crawler_static/88240626/bg1.jpg)
Received December 21, 2020, accepted December 30, 2020, date of publication January 11, 2021,
date of current version January 21, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3050556
HighRes-MVSNet: A Fast Multi-View Stereo
Network for Dense 3D Reconstruction
From High-Resolution Images
RAFAEL WEILHARTER AND FRIEDRICH FRAUNDORFER , (Member, IEEE)
Institute of Computer Graphics and Vision, Graz University of Technology, 8010 Graz, Austria
Corresponding author: Rafael Weilharter (rafael.weilharter@icg.tugraz.at)
This work was supported in part by the European Institute of Innovation and Technology (EIT) RawMaterials under Project 18004, and in
part by the RESilient transport InfraSTructure to extreme events (RESIST) Project through the European Union’s Horizon 2020 Research
and Innovation Program under Grant 769066.
ABSTRACT We propose an end-to-end deep learning architecture for 3D reconstruction from
high-resolution images. While many approaches focus on improving reconstruction quality alone, we pri-
marily focus on decreasing memory requirements in order to exploit the abundant information provided by
modern high-resolution cameras. Towards this end, we present HighRes-MVSNet, a convolutional neural
network with a pyramid encoder-decoder structure searching for depth correspondences incrementally over
a coarse-to-fine hierarchy. The first stage of our network encodes the image features to a much smaller
resolution in order to significantly reduce the memory requirements. Additionally, we limit the depth search
range in every hierarchy level to the vicinity of the previous prediction. In this manner, we are able to produce
highly accurate 3D models while only using a fraction of the GPU memory and runtime of previous methods.
Although our method is aimed at much higher resolution images, we are still able to produce state-of-the-art
results on the Tanks and Temples benchmark and achieve outstanding scores on the DTU benchmark.
INDEX TERMS Convolutional neural network, dense 3D reconstruction, multi-view stereo.
I. INTRODUCTION
Multi-View Stereo (MVS) attempts to reconstruct a highly
detailed 3D model of an observed scene from images with
different viewpoints. The prerequisites are known intrinsic
and extrinsic camera parameters which can be obtained via
Structure from Motion (SfM) (see Fig. 1). MVS has been
a well studied problem for decades and traditional meth-
ods based on geometric context [2], [6], [7], [26] achieved
great success when reconstructing scenes with Lambertian
surfaces, especially in terms of accuracy. However, they
struggle with the reconstruction of low-textured, specular,
and reflective regions and in terms of completeness. Further-
more, they usually take a very long time to establish the 3D
correspondence and larger scenes can take several hours to
process.
To address these issues more recent approaches [12], [15]
use deep Convolutional Neural Networks (CNNs) which
are several times faster while also improving the overall
The associate editor coordinating the review of this manuscript and
approving it for publication was Mehul S. Raval .
FIGURE 1. Overview of the Structure from Motion pipeline. MVS attempts
to create a denser, more appealing 3d model from sparse reconstruction
information.
3D reconstruction quality of a scene. This can be mostly
attributed to the fact that learning-based methods can incor-
porate global semantic information such as specular and
reflective priors for more robust matching. Furthermore, if the
11306
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021