Taketomi et al. IPSJ Transactions on Computer Vision and Applications
(2017) 9:16
Page 3 of 11
techniques as relocalization. Basically, relocalization is
done for recovering a camera pose and loop detection is
done for obtaining geometrically consistent map.
Pose-graph optimization has widely been used to sup-
press the accumulated error by optimizing camera poses
[12, 13]. In this method, the relationship between camera
poses is represented as a graph and the consistent graph
is built to suppress the error in the optimization. Bundle
adjustment (BA) is also used to minimize the reprojec-
tion error of the map by optimizing both the map and the
camera poses [14]. In large environments, this optimiza-
tion procedure is employed to minimize estimation errors
efficiently. In small environments, BA may be performed
without loop closing because the accumulated error is
small.
2.3 Summary
As listed above, the framework of vSLAM algorithms is
composed of five modules: initialization, tracking, map-
ping, relocalization, and global map optimization. Since
each vSLAM algorithm employs different methodolo-
gies for each module, features of a vSLAM algorithm
highly depend on the methodologies employed. There-
fore, it is important to understand each module of a
vSLAM algorithm to know its performance, advantages,
and limitations.
It should be noted that tracking and mapping (TAM) is
used instead of using localization and mapping. TAM was
first used in Parallel Tracking and Mapping (PTAM) [15]
because localization and mapping are not simultaneously
performed in a traditional way. Tracking is performed in
every frame with one thread whereas mapping is per-
formed at a certain timing with another thread. After
PTAM was proposed, most of vSLAM algorithms follows
the framework of TAM. Therefore, TAM is used in this
paper.
3 Related technologies
vSLAM, visual odometry, and online structure from
motion are designed for estimating camera motion and 3D
structure in an unknown environment. In this section, we
explain the relationship among them.
3.1 Visual odometry
Odometry is to estimate the sequential changes of sensor
positions over time using sensors such as wheel encoder to
acquire relative sensor movement. Camera-based odom-
etry called visual odometry (VO) is also one of the active
research fields in the literature [16, 17]. From the technical
point of views, vSLAM and VO are highly relevant tech-
niques because both techniques basically estimate sensor
positions. According to the survey papers in robotics
[18, 19], the relationship between vSLAM and VO can be
represented as follows.
vSLAM = VO + global map optimization
The main difference between these two techniques is
global map optimization in the mapping. In other words,
VO is equivalent to the modules in Section 2.1. In the
VO, the geometric consistency of a map is considered
only in a small portion of a map or only relative camera
motion is computed without mapping. On the other hand,
in the vSLAM, the global geometric consistency of a map
is normally considered. Therefore, to build a geometrically
consistent map, the global optimization is performed in
the recent vSLAM algorithms.
The relationship between vSLAM and VO can also be
found from the papers [20, 21] and the papers [22, 23]. In
the paper [20, 22], a technique on VO was first proposed.
Then, a technique on vSLAM was proposed by adding the
global optimization in VO [21, 23].
3.2 Structure from motion
Structure from motion (SfM) is a technique to estimate
camera motion and 3D structure of the environment in a
batch manner [24]. In the paper [25], a SfM method that
runs online was proposed. The authors named it as real-
time SfM. From the technical point of views, there is no
definitive difference between vSLAM and real-time SfM.
This may be why the word “real-time SfM” is not found in
recent papers.
As explained in this section, vSLAM, VO, and real-
time SfM share many common components. Therefore,
we introduce all of them and do not distinguish these
technologies in this paper.
4 Feature-based methods
There exist two types of feature-based methods in the
literature: filter-based and BA-based methods. In this
section, we explain both methods and provide the com-
parison. Even though some of the methods were pro-
posed before 2010, we explained them here because they
can be considered as fundamental frameworks for other
methods.
4.1 MonoSLAM
First monocular vSLAM was developed in 2003 by
Davison et al. [26, 27]. They named it MonoSLAM.
MonoSLAM is considered as a representative method in
filter-based vSLAM algorithms. In MonoSLAM, camera
motion and 3D structure of an unknown environment are
simultaneously estimated using an extended Kalman filter
(EKF). 6 Degree of freedom (DoF) camera motion and 3D
positions of feature points are represented as a state vector
in EKF. Uniform motion is assumed in a prediction model,
and a result of feature point tracking is used as observa-
tion. Depending on camera movement, new feature points
are added to the state vector. Note that the initial map