多层图像立体放大的新视角合成方法

120 浏览量更新于2023-10-26 收藏 26.72MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

such as planes or spheres with uniformly changing inversedepth. As the number of layers is necessarily limited byresource constraints and the risk of overﬁtting, this num-ber is usually taken to be relatively small (e.g. 32). The re-sulting semi-transparent representation may therefore onlycoarsely approximate the true geometry of the scene, whichlimits the generalization to novel views and introduces arte-facts. The most recent works [4, 17] use excessive num-ber of spheres (up to 128) and then merge the resulting ge-ometry using a non-learned post-processing merging step.While the merge step creates scene-adapted and compactgeometric representation, it is not incorporated into thelearning process of the main matching network, and de-grades the quality of novel view synthesis [4].The coarseness of layered geometry used by multi-layer approaches is in contrast to more traditional image-based rendering methods that start by estimating the non-discretized scene geometry in the form of mesh [25, 33],view-dependent meshes [11], a single-layer depth map [23,27, 37].Geometry estimates may come from multiviewdense stereo matching or from monocular depth. All theseapproaches obtain a ﬁner approximation to scene geometry,although most of them have to use a relatively slow neuralrendering step to compensate for the errors in the geometryestimation.Our approach called StereoLayers (Fig. 1) combinesscene geometry adaptation with multi-layer representation.This model is designed for a case known as stereo magni-ﬁcation problem: it reconstructs the scene from as few astwo input images. The proposed method starts by buildinga geometric proxy that is customized to a particular scene.The proxy is formed by a small number of mesh layers withcontinuous depth coordinate values. In the second stage,similarly to other multi-layer approaches, we estimate thetransparency and color textures for each layer, resulting inthe ﬁnal representation of the scene. When processing anew scene, both stages take the same pair of images ofthat scene as input. Two deep neural networks trained ona dataset of similar scenes are used to implement these two86870多层图像的立体放大0T. Khakhulin 1 , 2 D. Korzhenkov 1 P. Solovev 1 G. Sterkin 1 A.-T. Ardelean 1 , 2 V. Lempitsky 2 *01 三星人工智能中心 - 莫斯科 2莫斯科斯科尔科沃科学技术学院0https://samsunglabs.github.io/StereoLayers/0摘要0使用多个半透明彩色图层来表示场景已经成为实时新视角合成的一种流行且成功的选择。现有方法推断出平面或球形形状的定期间隔图层上的颜色和透明度值。在这项工作中，我们介绍了一种基于多个适应场景几何的半透明图层的新视角合成方法。我们的方法通过两个阶段从立体对中推断出这样的表示。第一阶段从给定的视图对中生成少量数据自适应图层的几何形状。第二阶段推断出这些图层的颜色和透明度值，生成新视角合成的最终表示。重要的是，这两个阶段通过可微分渲染器连接，并进行端到端训练。在实验中，我们证明了所提方法相对于不适应场景几何的定期间隔图层的优势。尽管在渲染过程中速度快了几个数量级，我们的方法也优于基于隐式几何表示的最近提出的IBRNet系统。01. 引言0近年来，基于图像的渲染和新视角合成取得了快速进展，有许多基于神经渲染方法的多样化方法[32]。在这种多样性中，基于半透明多层表示的方法[21，29，30，34，39]由于其快速渲染时间、与传统图形引擎的兼容性以及在输入帧附近重新渲染的良好质量而脱颖而出。现有方法[4，17，21，29，30，34，39]在定期间隔表面网格上构建多层表示，例如平面或球体，这些网格具有均匀变化的逆深度。由于层数必然受到资源限制和过拟合风险的限制，因此通常将该数量限制在相对较小的范围内（例如32）。因此，得到的半透明表示可能只粗略地近似场景的真实几何形状，这限制了对新视角的泛化并引入了伪影。最近的作品[4，17]使用过多的球体（高达128个），然后使用非学习的后处理合并步骤合并所得到的几何形状。虽然合并步骤创建了与场景适应且紧凑的几何表示，但它没有纳入主匹配网络的学习过程中，并降低了新视角合成的质量[4]。多层方法使用的分层几何的粗糙度与更传统的基于图像的渲染方法形成对比，后者通过估计网格[25，33]、视角相关网格[11]、单层深度图[23，27，37]或多视图密集立体匹配或单目深度来估计非离散化的场景几何。所有这些方法都获得了对场景几何的更精细的近似，尽管它们大多数都必须使用相对较慢的神经渲染步骤来补偿几何估计中的错误。我们的方法称为StereoLayers（图1），它将场景几何适应与多层表示相结合。该模型设计用于一种称为立体放大问题的情况：它从仅两个输入图像中重建场景。所提出的方法首先构建一个几何代理，该代理根据特定场景进行自定义。代理由少量具有连续深度坐标值的网格层组成。在第二阶段，类似于其他多层方法，我们估计每个图层的透明度和颜色纹理，从而得到场景的最终表示。在处理新场景时，两个在类似场景数据集上训练的深度神经网络被用来实现这两个阶段。0* 大部分工作是在Victor Lempitsky在三星人工智能中心期间完成的mesh-based representations [7,11,12,25,33,40] and point-based representations [1,15]. Most representations of thesetypes require extensive computations to render a novel view,such as running a raw image through a deep convolutionalrendering network [32] or numerous evaluations of a scenenetwork that has a perceptron architecture [18,22].An important class of representations is based on depthmaps.Such depth maps can be naturally obtained us-ing stereo matching [5] or from monocular depth estima-tion [27, 37]. In this class, the 3D layered inpainting ap-proach [27] is most related to our work, since after startingfrom a monocular depth map, it performs its segmentationinto multiple layers and then applies the inpainting proce-dure to each layer to extend its support behind the morefrontal layers. Our work has several important differences,as it uses two (rather than one) images as input and predictsthe transparency of the layers. Most importantly, the es-timation of the multi-layered geometry and the estimationof their colors and transparency are both implemented us-ing deep architectures, which are trained in an end-to-endfashion.Multi-layer semitransparent representations. In 1999,[30] proposed representing scenes with multiple fronto-parallel semitransparent layers and acquiring such represen-tations through stereo-matching of a pair of input views.Twenty years later, several approaches [8, 21, 29] startingfrom [39] exploited advances in deep learning to build deepnetworks that directly map plane sweep volumes (i.e. ten-sors obtained by the “unprojection” operation) to ﬁnal rep-resentations of the same kind. The rendering of semitrans-parent layers is well supported by modern graphics engines,thus the resulting representation is in general more suitableto interactive applications than most other representationsthat lead to the similar level of realism.The multi-layer representations have been extended towider ﬁelds of view in [3, 4, 17] by replacing planes withspheres. Two approaches [4, 17] suggested to “coalesce”(merge) the groups of nearby layers into layers with scene-adapted geometry. In both cases, the grouping of layers ispredeﬁned and the merge process is non-learnable and usessimple accumulation heuristics. Consequently, [4] reported86880几何网络0参考视角0着色网络0分层深度多层图像0新视角0侧视图0平面扫描0在几何上的反投影0RGBDA结构0图1.提出的StereoLayers流程使用预训练的几何网络从平面扫描体中估计场景自适应的多层几何形状，然后使用预训练的着色网络估计颜色和透明度值。分层几何形状将场景表示为有序的网格层集合。几何和着色网络是一起端到端训练的。0阶段。关键是，我们使用可微分渲染框架[16]以端到端的方式同时训练两个神经网络。我们将我们的方法与之前提出的在流行的RealEstate10k [39]和LLFF[21]数据集上使用定期间隔层的方法进行比较。此外，我们提出了一个更具挑战性的新数据集，用于新视角合成的基准测试。在这两种情况下，我们观察到我们的方法中的场景自适应几何形状比非自适应几何形状产生更好的新视角合成质量。为了将我们的工作放在更广泛的背景下，我们还将我们的系统性能与IBRNet系统[36]进行了比较，并观察到我们的方法的优势，以及更快的渲染时间。总的来说，我们的方法产生了非常紧凑的场景表示，适合在低端设备上进行实时渲染。总结起来，我们的贡献如下。首先，我们提出了一种从成对的立体图像中几何重建场景的新方法。该方法使用少量的半透明层来表示场景，并具有场景自适应的几何形状。与其他相关方法不同，我们的方法使用两个联合（端到端）训练的深度网络，第一个网络估计层的几何形状，第二个网络估计层的透明度和颜色纹理。最后，我们在先前提出的数据集上评估我们的方法，并引入了一个新的具有挑战性的数据集，用于训练和评估新视角合成方法。02. 相关工作0用于新视角合成的表示方法。多年来，针对新视角合成，已经提出了不同类型的表示方法。几乎无一例外的是，当这些表示方法从多个图像中获取时，会使用结构和运动算法进行注册，或者来自预校准的立体摄像机。另外，一些最近的研究探讨了如何从单个图像中创建这些表示方法。所提出的表示方法可以分为几类，包括依赖于体积渲染的体积表示方法。86890图2.我们方法得到的视图外推。中间显示了两个输入图像。提出的方法（StereoLayers）即使在基线放大5倍的情况下也能生成合理的渲染（如本例）。0由于合并的结果会导致渲染质量的损失，这在他们的情况下仍然是合理的，因为增加了渲染和存储效率。我们的研究与先前关于多层半透明表示的工作密切相关。与该组中的大多数工作不同，我们的流程从场景适应（非平面、非球面）层估计开始，然后再估计层的颜色和透明度。虽然[4,17]最终也得到了场景适应的半透明层作为表示，但我们的方法是按照相反的顺序进行重建（首先估计几何形状）。更重要的是，与[4,17]不同，我们使用神经网络来估计层的几何形状，该网络与颜色和透明度估计网络一起进行训练。在实验中，我们展示了这种方法导致更好的视图合成。0可微分渲染的单层新视图合成。SynSin[37]和更近期的Worldsheet[13]系统从单个图像预测单层几何，并使用可微分渲染来学习神经网络，与我们的方法类似。我们的方法考虑了两个输入图像的情况，并专注于多层几何。虽然Worldsheet的一个变种考虑了两层扩展，但它基于不同的架构和不同的层聚合策略，并且在实验中并没有超越单层表示。03. 从立体图像到多层表示0我们考虑立体放大的任务，即根据两个输入视图（图像）：参考视图Ir和侧视图Is，生成一个新的视图ˆIn。我们假设侧视图和新视图相对于参考视图的相对相机姿态πs和πn以及相机内参Kr、Ks和Kn已知。为了解决这个任务，我们的方法建立了只依赖于侧视图和参考视图的场景表示。之后，这样的表示可以在任何新的相机上使用标准的渲染。0使用可微分渲染的单层新视图合成。SynSin [37]和更近期的Worldsheet[13]系统从单个图像预测单层几何，并使用可微分渲染来学习神经网络，与我们的方法类似。我们的方法考虑了两个输入图像的情况，并专注于多层几何。虽然Worldsheet的一个变种考虑了两层扩展，但它基于不同的架构和不同的层聚合策略，并且在实验中并没有超越单层表示。03.1. 几何估计0给定一个训练好的模型和一对新的立体图像，多层表示分为两个阶段进行推断。首先，预测场景的结构，例如网格层的几何形状。然后，在第二阶段，推断层的不透明度（alpha）和RGB颜色（纹理）。请注意，我们将输入视图对称地处理，因为我们在参考相机的视锥体中构建场景表示。我们首先通过在参考相机视锥体中放置P个正交平面，并将侧视图投影到这些平面上，计算平面扫描体（PSV）[6]。平面在逆深度空间均匀间隔地在深度{d1，...，dP}处。我们以H×W分辨率对平面进行采样，并将参考视图作为额外的三个通道连接起来，得到一个H×W×(3P+3)大小的张量，类似于其他多层方法中使用的张量，例如[39]。然后，输入张量经过几何网络Fg处理。尽管我们考虑了以下几种讨论的架构变体，但所有这些架构都预测了大小为h×w的L个深度图，这些深度图对应于在参考视图的图像坐标空间中均匀间隔的h×w束线的深度。在我们的实验中，我们将层的分辨率设置为参考视图的大小，w=W，尽管也可以在不同的分辨率下进行采样。Fg的主干类似于SynSin[37]的深度预测模块，即一个带有谱归一化的UNet-like2D卷积网络。唯一的区别是我们增加了输入和输出特征图的数量，以适应我们模型的多层特性。主干的更详细描述请参见补充材料（第S1节）。我们考虑以下三种方案来进行推断：I+j�86900参考视图StereoLayers-2（BI+RSBg）StereoLayers-2（GC+RSBg）StereoLayers-4（BI+RSBg）StereoMag-P4 StereoMag-320图3.对于两个立体图像对（仅显示参考视图），我们沿着蓝线可视化水平切片。网格顶点显示为具有预测不透明度的点。颜色编码层号。水平轴对应于像素坐标，而垂直轴表示相对于参考相机的顶点深度（仅显示最具说明性的深度范围）。与StereoMag[39]相比，StereoLayers方法变体以更高效的方式生成场景自适应几何，从而获得更节约的几何表示，并获得更好的渲染质量。0对层进行编码。0组合合成（GC）方案。在这个方案中，Fg返回形状为h×w×P的张量，其值范围在0到1之间。PSV的P个通道和相应的P个平面被分成L组，每组得到L个可变形层，具体方法如下：在每个组j（1≤j≤L）中，通过Fg网络预测的“不透明度”{βk}Pk=1对平面深度d1 < ...

下载后可阅读完整内容，剩余1页未读，立即下载