3 Approach
3.1 Background
We focus on the monocular VO problem, which takes two consecutive undistorted images {I
t
, I
t+1
},
and estimates the relative camera motion δ
t+1
t
= (T, R), where T ∈ R
3
is the 3D translation and
R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-
based VO comes in two folds. Firstly, visual features are extracted and matched from I
t
and I
t+1
.
Then using the matching results, it computes the essential matrix leading to the recovery of the
up-to-scale camera motion δ
t+1
t
.
Following the same idea, our model consists of two sub-modules. One is the matching module
M
θ
(I
t
, I
t+1
), estimating the dense matching result F
t+1
t
from two consecutive RGB images (i.e.
optical flow). The other is a pose module P
φ
(F
t+1
t
) that recovers the camera motion δ
t+1
t
from the
matching result (Fig. 1). This modular design is also widely used in other learning-based methods,
especially in unsupervised VO [13, 19, 16, 22, 18].
3.2 Training on large scale diverse data
The generalization capability has always been one of the most critical issues for learning-based
methods. Most of the previous supervised models are trained on the KITTI dataset, which contains
11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented
the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).
They reported that the performance is limited by the lack of training data and the more complex
dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in
very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model
has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To
achieve this, we argue that the training data has to cover diverse scenes and motion patterns.
TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing
more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-
mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and
sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran-
dom and rich 6DoF motion patterns in the 3D space.
We take advantage of the monocular image sequences {I
t
}, the optical flow labels {F
t+1
t
}, and the
ground truth camera motions {δ
t+1
t
} in our task. Our objective is to jointly minimize the optical
flow loss L
f
and the camera motion loss L
p
. The end-to-end loss is defined as:
L = λL
f
+ L
p
= λkM
θ
(I
t
, I
t+1
) − F
t+1
t
k + kP
φ
(
ˆ
F
t+1
t
) − δ
t+1
t
k (1)
where λ is a hyper-parameter balancing the two losses. We use
ˆ
· to denote the estimated variable
from our model.
Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation
data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of
studies show that training purely in simulation but with broad diversity, the model learned can be
easily transferred to the real world. This is also known as domain randomization [38, 39]. In our
experiments, we show that the diverse simulated data indeed enable the VO model to generalize to
real-world data.
3.3 Up-to-scale loss function
The motion scale is unobservable from a monocular image sequence. In geometry-based methods,
the scale is usually recovered from other sources of information ranging from known object size or
camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,
the models generally neglect the scale problem and try to recover the motion with scale. This is
feasible if the model is trained and tested with the same camera and in the same type of scenario.
For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a
fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the
model will have huge problems when tested with a different camera configuration. Imagine if the
3