深度学习驱动的视觉里程计：TartanVO与泛化能力

需积分: 1 35 浏览量更新于2024-08-05 收藏 2.75MB PDF 举报

"这篇论文是关于基于学习的视觉里程计（Visual Odometry，简称VO），它在顶会中发表，提出了一个名为TartanVO的模型，该模型具有良好的泛化能力，能适应多种数据集和现实世界的场景，并在复杂场景中超越了基于几何的方法。研究者利用了TartanAir数据集，这是一个包含大量多样化合成数据的SLAM（Simultaneous Localization And Mapping）数据集，用于挑战性环境。为了使VO模型跨数据集泛化，他们提出了一个尺度自适应的损失函数，并将相机内参整合到模型中。实验结果显示，仅在合成数据上训练的单个TartanVO模型无需微调就能应用于如KITTI和EuRoC这样的真实世界数据集，且在困难轨迹上显著优于基于几何的方法。该研究的关键词包括视觉里程计、泛化、深度学习和光流。" 基于学习的视觉里程计（Visual Odometry，VO）是一种计算相机实时运动的技术，它通过分析连续的图像序列来估计相机的位姿。传统的VO方法主要依赖于几何特性，如特征匹配和结构恢复，但这些方法在复杂或动态环境下可能表现不佳。这篇顶会论文介绍的TartanVO模型则采用深度学习方法，提高了在各种场景下的性能。 TartanVO的核心创新在于其泛化能力。通常，深度学习模型在新数据集上的表现受限，但TartanVO通过使用TartanAir数据集，这个包含大量多样性和挑战性环境的合成数据集，使得模型能够学习到广泛的场景知识。TartanAir数据集为训练提供了丰富的多样性，有助于模型学习应对实际世界中的变化和不确定性。为了进一步提高模型的泛化能力，研究者提出了一种尺度自适应的损失函数。这允许模型在不同尺度或大小的环境中都能保持准确的估计。同时，他们将相机的内参整合进模型，确保模型对相机参数变化的鲁棒性，这是实现跨数据集泛化的关键步骤。实验部分展示了TartanVO的优越性。在没有针对真实世界数据进行微调的情况下，仅用合成数据训练的单一模型可以成功应用到如KITTI和EuRoC等标准的现实世界数据集上，并在具有挑战性的轨迹上表现出色，明显优于传统的几何方法。这表明，TartanVO在实际应用中具有强大的潜力，特别是在难以处理的视觉SLAM问题上。关键词“深度学习”和“光流”强调了模型的计算基础。深度学习在这里起到了学习和理解视觉输入的作用，而光流则是估算图像像素在连续帧间运动的一种方式，对于理解相机的运动和跟踪特征至关重要。结合这些技术，TartanVO提供了一种新的、更强大和适应性强的视觉里程计解决方案。

3 Approach

3.1 Background

We focus on the monocular VO problem, which takes two consecutive undistorted images {I

, I

t+1

and estimates the relative camera motion δ

t+1

= (T, R), where T ∈ R

is the 3D translation and

R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-

based VO comes in two folds. Firstly, visual features are extracted and matched from I

and I

t+1

Then using the matching results, it computes the essential matrix leading to the recovery of the

up-to-scale camera motion δ

t+1

Following the same idea, our model consists of two sub-modules. One is the matching module

, I

t+1

), estimating the dense matching result F

t+1

from two consecutive RGB images (i.e.

optical ﬂow). The other is a pose module P

t+1

) that recovers the camera motion δ

t+1

from the

matching result (Fig. 1). This modular design is also widely used in other learning-based methods,

especially in unsupervised VO [13, 19, 16, 22, 18].

3.2 Training on large scale diverse data

The generalization capability has always been one of the most critical issues for learning-based

methods. Most of the previous supervised models are trained on the KITTI dataset, which contains

11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented

the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).

They reported that the performance is limited by the lack of training data and the more complex

dynamics of a ﬂying robot. Surprisingly, most unsupervised methods also only train their models in

very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model

has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To

achieve this, we argue that the training data has to cover diverse scenes and motion patterns.

TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing

more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-

mentation, optical ﬂow, and camera pose. The scenes include indoor, outdoor, urban, nature, and

sci-ﬁ environments. The data is collected with a simulated pinhole camera, which moves with ran-

dom and rich 6DoF motion patterns in the 3D space.

We take advantage of the monocular image sequences {I

}, the optical ﬂow labels {F

t+1

}, and the

ground truth camera motions {δ

t+1

} in our task. Our objective is to jointly minimize the optical

ﬂow loss L

and the camera motion loss L

. The end-to-end loss is deﬁned as:

L = λL

+ L

= λkM

, I

t+1

) − F

t+1

k + kP

(

t+1

) − δ

t+1

k (1)

where λ is a hyper-parameter balancing the two losses. We use

· to denote the estimated variable

from our model.

Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation

data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of

studies show that training purely in simulation but with broad diversity, the model learned can be

easily transferred to the real world. This is also known as domain randomization [38, 39]. In our

experiments, we show that the diverse simulated data indeed enable the VO model to generalize to

real-world data.

3.3 Up-to-scale loss function

The motion scale is unobservable from a monocular image sequence. In geometry-based methods,

the scale is usually recovered from other sources of information ranging from known object size or

camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,

the models generally neglect the scale problem and try to recover the motion with scale. This is

feasible if the model is trained and tested with the same camera and in the same type of scenario.

For example, in the KITTI dataset, the camera is mounted at a ﬁxed height above the ground and a

ﬁxed orientation. A model can learn to remember the scale in this particular setup. Obviously, the

model will have huge problems when tested with a different camera conﬁguration. Imagine if the

剩余11页未读，继续阅读

图灵动力

粉丝: 13
资源: 5

深度学习驱动的视觉里程计：TartanVO与泛化能力

Machine Learning A Constraint-Based Approach.pdf

A SURVEY ON DEEP LEARNING-BASED ARCHITECTUR.pdf

Influence of Autoencoder-Based Data Augmentation on Deep Learning-Based Wireless Communication

deep-learning-with-pytorch.pdf 15章

python强化学习项目 python reinforcement learning projects - 2018.pdf

人工智能会用到的常见英文以及对应的中文

Super-Resolution

给我推荐20个比较流行的用户画像算法模型源码地址

最新资源