Visual simultaneous localization and mapping
The idea of utilizing one camera has become popular since the emergence of single cam-
era SLAM or MonoSLAM (Davison 2003). This is probably also because it is now easier to
access a single camera than a stereo pair, through cell phones, personal digital assistants or
personal computers. This monocular approach offers a very simple, flexible and economic
solution in terms of hardware and processing times.
Monocular SLAM is a particular case of bearing-only SLAM. The latter is a partially
observable problem, where sensors do not provide sufficient information from a simple
observation to determine the depth of a landmark. This causes a landmark initialization prob-
lem, where solutions can be divided into two categories: delayed and undelayed (Lemaire et
al. 2007; Vidal et al. 2007). A salient feature tracking across multiple observations has to be
performed to obtain tridimensional information from a single camera.
Even though many contributions have been made to visual SLAM, there are still many
problems. The solutions proposed for the visual SLAM problem are reviewed in Sect. 6. Many
visual SLAM systems suffer from large accumulated errors while the environment is being
explored (or fail completely in visually complex environments), which leads to inconsistent
estimates of robot position and totally incongruous maps. Three primary reasons exist:
(1) First, generally it is assumed that camera movement is smooth and that there will be
consistency in the appearance of salient features (Davison 2003; Nistér et al. 2004),
but in general this is not true. The above assumptions are highly related to the selection
of the salient feature detector and of the matching technique used. This originates an
inaccuracy in camera position when capturing images with little texture or that are
blurred due to rapid movements of the sensor (e.g. due to vibration or quick direction
changes) (Pupilli and Calway 2006). These phenomena are typical when the camera
is carried by a person, humanoid robots, and quad-rotor helicopters, among others.
One way of alleviating this problem to some extent is by the use of keyframes (see
“Appendix I”) (Mouragnon et al. 2006; Klein and Murray 2008). Alternatively, Pretto
et al. (2007)andMei and Reid (2008) analyze the problem of visual tracking in real
time over blurred image sequences due to an out-of-focus camera.
(2) Second, most of researchers assume that the environments to explore are static and
that they only contain stationary and rigid elements; the majority of the environments
contain people and objects in motion. If this is not considered, the moving elements
will originate false matches and consequently will generate unpredictable errors in all
the system. The first approaches to this problem are proposed by Wang et al. (2007);
Wangsiripitak and Murray (2009); Migliore et al. (2009), as well as Lin and Wang
(2010).
(3) Third, the world is visually repetitive. There are many similar textures, such as the
repeated architectural elements, foliage and walls of brick or stone. Also some objects
such as traffic signals appear repeatedly within an urban outdoor environment. This
makes it difficult to recognize a previously explored area and also to do SLAM on large
extensions of land.
4 Salient feature selection
We will make a difference between salient features and landmarks, since in some articles
they are treated indistinctly. According to Frintrop and Jensfelt (2008), a landmark is a region
in the real world described by 3D position and appearance information. On the other hand,
a salient feature is a region of the image described by its 2D position (on the image) and an
123