Visual SLAM: Why Bundle Adjust?
´
Alvaro Parra Bustos
1
, Tat-Jun Chin
1
, Anders Eriksson
2
and Ian Reid
1
Abstract— Bundle adjustment plays a vital role in feature-
based monocular SLAM. In many modern SLAM pipelines,
bundle adjustment is performed to estimate the 6DOF camera
trajectory and 3D map (3D point cloud) from the input feature
tracks. However, two fundamental weaknesses plague SLAM
systems based on bundle adjustment. First, the need to care-
fully initialise bundle adjustment means that all variables, in
particular the map, must be estimated as accurately as possible
and maintained over time, which makes the overall algorithm
cumbersome. Second, since estimating the 3D structure (which
requires sufficient baseline) is inherent in bundle adjustment,
the SLAM algorithm will encounter difficulties during periods
of slow motion or pure rotational motion.
We propose a different SLAM optimisation core: instead of
bundle adjustment, we conduct rotation averaging to incremen-
tally optimise only camera orientations. Given the orientations,
we estimate the camera positions and 3D points via a quasi-
convex formulation that can be solved efficiently and globally
optimally. Our approach not only obviates the need to estimate
and maintain the positions and 3D map at keyframe rate (which
enables simpler SLAM systems), it is also more capable of
handling slow motions or pure rotational motions.
I. INTRODUCTION
Let u
i, j
be the 2D coordinates of the i-th scene point
as seen in the j-th image Z
j
. Given a set {u
i, j
} of obser-
vations, structure-from-motion (SfM) aims to estimate the
3D coordinates X = {X
i
} of the scene points and 6DOF
poses {(R
j
,t
j
)} of the images {Z
j
} that agree with the
observations. The bundle adjustment (BA) formulation is
min
{X
i
},{(R
j
,t
j
)}
∑
i, j
u
i, j
− f (X
i
| R
j
,t
j
)
2
2
,
(1)
where f (X
i
| R
j
,t
j
) is the projection of X
i
onto Z
j
(assuming
calibrated cameras). In practice, not all X
i
are visible in every
Z
j
, thus some of the (i, j) terms are dropped. For ease of
exposition, we follow [1] and regard the image set {Z
j
} as
inputs to BA, bearing in mind that the effective inputs are
the observations {u
i, j
} and the visibility matrix.
As a non-linear least squares problem, (1) is usually solved
by gradient descent methods, e.g., Levenberg-Marquardt,
which require initialisation for all unknowns. Thus, apart
from the images {Z
j
}, the total inputs to a BA instance
typically include the initial values for {(R
j
,t
j
)} and X.
BA is justifiable in the maximum likelihood sense if the
errors due to the uncertainty in localising the feature points
{u
i, j
} are Normally distributed. However, it is not obvious
that available feature detectors satisfy this property [2], [3],
[4]. While this does not reduce the usefulness of BA, its
statistical validity should not be taken for granted.
1
School of Computer Science, The University of Adelaide.
2
School of Electrical Engineering and Computer Science, Queensland
University of Technology
Algorithm 1 BA-SLAM (adapted from [1]).
1: X ← Initialise points(Z
0
).
2: for each keyframe step t = 1, 2,. .. do
3: s ← t −(window size) + 1.
4: if a number of n ≥ 1 points left field of view then
5: X ← X ∪ initialise n new points(Z
t
).
6: end if
7: R
s:t
,t
s:t
,X ← BA(R
s:t
,t
s:t
,X,Z
0:t
).
8: if loop is detected in Z
t
then
9: R
1:t
,t
1:t
,X ← BA(R
1:t
,t
1:t
,X,Z
0:t
).
10: end if
11: end for
A. BA-SLAM
Roughly speaking, monocular feature-based SLAM [5]
(henceforth, “SLAM”) is the execution of SfM incrementally
to process streaming input images Z
0:t
, where
Z
0:t
= {Z
0
,Z
1
,...,Z
t
}. (2)
Several influential works [6], [7], [1], [8] have cemented the
importance of BA in SLAM. Algorithm 1, which is adapted
from [1, Table 1], describes a SLAM optimisation core based
on BA over keyframes. Specifically:
• In Step 5, new scene points are “spawned” if the current
frame Z
t
does not adequately observe the map X.
• In Step 7 (a.k.a. local mappping), BA is used to estimate
the camera trajectory and 3D map in the current time
window. Often, local mapping is preceded by camera
tracking to accurately initialise the current pose (R
t
,t
t
).
See [1, Sec. 5.3] or [8, Sec. V] for examples.
• In Step 9 (a.k.a. loop closure), a system-wide BA is
executed to reoptimise all the variables and redistribute
accumulated drift errors. Implicit in Algorithm 1 is the
introduction of covisibility information between Z
t
and
older keyframes, prior to BA. Often, Step 9 is preceded
by pose graph optimisation [9], [10], [11], [12] to give
a more accurate initialisation of the poses.
Note that Algorithm 1 is merely a “basic recipe” for SLAM.
In practice, “what will make or break a real-time SLAM
system are all the (often heuristic) nitty-gritty details” [13],
e.g., how to select features/keyframes, how to update the
covisibility graph, how to select/merge/prune 3D points, etc.
However, since our focus is on optimisation, Algorithm 1 is
sufficient to capture the core algorithmic elements of SLAM
systems based on BA, such as ORB-SLAM [8].
arXiv:1902.03747v2 [cs.CV] 14 Jun 2019