The idea is to maintain a second, delayed marginalization
prior, which has very little overhead, but enables three core
techniques:
1) We can populate the delayed factor graph with new
IMU factors to perform the proposed pose graph
bundle adjustment (PGBA). This is the basis of an
IMU initialization which captures the full photometric
uncertainty, leading to increased accuracy.
2) The graph used for IMU initialization can be re-
advanced, providing a marginalization prior with IMU
information for the main system.
3) When the scale changes significantly in the main
system we can trigger marginalization replacement.
The combination of these techniques makes for a highly
accurate initializer, which is robust even to long periods of
unobservability. Based on it we implement a visual-inertial
odometry (VIO) system featuring a photometric front-end
integrated with a new dynamic photometric weight.
We evaluate our method on three challenging datasets
(Fig. 1), capturing three domains: The EuRoC dataset [7]
recorded by a flying drone, the TUM-VI dataset [8] cap-
tured with a handheld device, and the 4Seasons dataset [9]
representing the automotive scenario. The latter features long
stretches of constant velocity, posing a particular challenge
for mono-inertial odometry.
We show that our system exceeds the state of the art in
visual-inertial odometry, even outperforming stereo-inertial
methods. In summary our contributions are:
• Delayed marginalization compensates drawbacks of
marginalization while retaining the advantages.
• Pose graph bundle adjustment (PGBA) combines the
efficiency of pose graph optimization with the full
uncertainty of bundle adjustment.
• A state-of-the-art visual-inertial odometry system with
a novel multi-stage IMU initializer and dynamically
weighted photometric factors.
The full source code for our approach will be released.
II. RELATED WORK
Initially, most visual odometry and SLAM systems have
been feature-based [10], either using filtering [11] or non-
linear optimization [12] [13]. More recently, direct methods
have been proposed, which optimize a photometric error
function and can operate on dense [14] [15], semi-dense [16],
or sparse point clouds [17].
Mourikis and Roumeliotis [1] have shown that a tight
integration of visual and inertial measurements can greatly
increase accuracy and robustness of odometry. Afterwards,
many tightly-coupled visual-inertial odometry [18] [19] and
SLAM systems [20] [21] [3] [5] have been proposed.
Initialization of monocular visual-inertial systems is not
trivial, as sufficient motion is necessary for the scale to
become observable [22] [2]. Most systems [4] [3] [5] start
with a visual-only system and use its output for a separate
IMU initialization. In contrast to these systems, we continue
optimizing the scale explicitly in the main system. We note
that ORB-SLAM3 [5] also continues to refine the scale after
initialization, but this is a separate optimization fixing all
poses and only performed until 75 seconds after initializa-
tion. [23] also continues to optimize the scale in the main
system, but in contrast to us they do not transfer covariances
between the main system and the initializer, thus they do
not achieve the same level of accuracy. Different to all these
systems, the proposed delayed marginalization allows our
IMU initializer to capture the full visual uncertainty and
continuously optimize the scale in the main system.
VI-DSO [6] initializes immediately with an arbitrary scale
and explicitly optimizes the scale in the main system. It
also introduced dynamic marginalization to handle the con-
sequential large scale changes in the main system. Compared
to it we propose a separate IMU initializer, delayed marginal-
ization as a better alternative to dynamic marginalization, a
dynamic photometric error weight, and more improvements,
resulting in greatly improved accuracy and robustness.
III. METHOD
A. Notation
We denote vectors as bold lowercase letters x, matrices as
bold upper-case letter H, scalars as lowercase letters λ, and
functions as uppercase letters E. T
V
w cam
i
∈ SE(3) represents
the transformation from camera i to world in the visual
coordinate frame V , and R
V
w cam
i
∈ SO(3) is the respective
rotation. Poses are represented either in visual frame P
V
i
:=
T
V
cam
i
w
, or in inertial frame P
I
i
:= T
I
w imu
i
. If not mentioned
otherwise we use poses in visual frame P
i
:= P
V
i
. We also
use states s, which can contain transformations, rotations,
and vectors. For states we define the subtraction operator
s
i
s
j
, which applies log(R
i
R
−1
j
) for rotations and other Lie
group elements, and a regular subtraction for vector values.
B. Direct Visual-Inertial Bundle Adjustment
The core of DM-VIO is the visual-inertial bundle adjust-
ment performed for all keyframes. As commonly done, we
jointly optimize visual and IMU variables in a combined
energy function. For the visual part we choose a direct formu-
lation based on DSO [17], as it is a very accurate and robust
system. For integrating IMU data into the bundle adjustment
we perform preintegration [24] between keyframes.
We optimize the following energy function using the
Levenberg-Marquardt algorithm:
E(s) = W (e
photo
) · E
photo
+ E
imu
+ E
prior
(1)
E
prior
contains added priors on the first pose and the gravity
direction, as well as the marginalization priors explained in
section III-C. In the following we describe the individual
energy terms and the optimized state.
Photometric error: The photometric energy is based on
[17]. We optimize a set of active keyframes F, each of which
hosts a set of points P
i
. Every point p is projected into all
keyframes obs(p) where it is visible, and the photometric
energy is computed:
E
photo
=
X
i∈F
X
p∈P
i
X
j∈obs(p)
E
pj
(2)