DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes
Berta Bescos, Jos
´
e M. F
´
acil, Javier Civera and Jos
´
e Neira
Abstract— The assumption of scene rigidity is typical in
SLAM algorithms. Such a strong assumption limits the use
of most visual SLAM systems in populated real-world environ-
ments, which are the target of several relevant applications like
service robotics or autonomous vehicles.
In this paper we present DynaSLAM, a visual SLAM system
that, building on ORB-SLAM2 [1], adds the capabilities of dy-
namic object detection and background inpainting. DynaSLAM
is robust in dynamic scenarios for monocular, stereo and
RGB-D configurations. We are capable of detecting the moving
objects either by multi-view geometry, deep learning or both.
Having a static map of the scene allows inpainting the frame
background that has been occluded by such dynamic objects.
We evaluate our system in public monocular, stereo and
RGB-D datasets. We study the impact of several accuracy/speed
trade-offs to assess the limits of the proposed methodology. Dy-
naSLAM outperforms the accuracy of standard visual SLAM
baselines in highly dynamic scenarios. And it also estimates
a map of the static parts of the scene, which is a must for
long-term applications in real-world environments.
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) is a
prerequisite for many robotic applications, for example
collision-less navigation. SLAM techniques estimate jointly
a map of an unknown environment and the robot pose
within such map, only from the data streams of its on-board
sensors. The map allows the robot to continually localize
within the same environment without accumulating drift.
This is in contrast to odometry approaches that integrate the
incremental motion estimated within a local window and are
unable to correct the drift when revisiting places.
Visual SLAM, where the main sensor is a camera, has
received a high degree of attention and research efforts over
the last years. The minimalistic solution of a monocular cam-
era has practical advantages with respect to size, power and
cost, but also several challenges such as the unobservability
of the scale or state initialization. By using more complex
setups, like stereo or RGB-D cameras, these issues are solved
and the robustness of visual SLAM systems can be greatly
improved.
The research community has addressed SLAM from
many different angles. However, the vast majority of the
approaches and datasets assume a static environment. As
This work has been supported by NVIDIA Corporation through the
donation of a Titan X GPU, by the Spanish Ministry of Economy and
Competitiveness (projects DPI2015-68905-P and DPI2015-67275-P, FPI
grant BES-2016-077836), and by the Arag
´
on regional government (Grupo
DGA T04-FSE).
Berta Bescos, Jos
´
e M. F
´
acil, Javier Civera and Jos
´
e Neira
are with the Instituto de Investigaci
´
on en Ingenier
´
ıa de
Arag
´
on (I3A), Universidad de Zaragoza, Zaragoza 50018, Spain
{bbescos,jmfacil,jcivera,jneira}@unizar.es
(a) Input RGB-D frames with dynamic content.
(b) Output RGB-D frames. Dynamic content has been removed. Occluded
background has been reconstructed with information from previous views.
(c) Map of the static part of the scene, after removal of the dynamic objects.
Fig. 1: Overview of DynaSLAM results for the RGB-D case.
a consequence, they can only manage small fractions of
dynamic content by classifying them as outliers to such static
model. Although the static assumption holds for some robotic
applications, it limits the applicability of visual SLAM in
many relevant cases, such as intelligent autonomous systems
operating in populated real-world environments over long
periods of time.
Visual SLAM can be classified into feature-based methods
[2], [3], that rely on salient points matching and can only esti-
mate a sparse reconstruction; and direct methods [4], [5], [6],
which are able to estimate in principle a completely dense
reconstruction by the direct minimization of the photometric
error and TV regularization. Some direct methods focus on
the high-gradient areas estimating semi-dense maps [7], [8].
None of the above methods, considered the state of the
art, address the very common problem of dynamic objects
in the scene, e.g., people walking, bicycles or cars. Detecting
and dealing with dynamic objects in visual SLAM reveals
several challenges for both mapping and tracking, including:
1) How to detect such dynamic objects in the images to:
a) Prevent the tracking algorithm from using
matches that belong to dynamic objects.
arXiv:1806.05620v2 [cs.CV] 15 Aug 2018