没有合适的资源?快使用搜索试试~ 我知道了~
首页elastic fusion补充
elastic fusion补充
5星 · 超过95%的资源 需积分: 19 24 下载量 96 浏览量
更新于2023-03-16
评论 1
收藏 30.77MB PDF 举报
elasticfusion的补充说明,更容易理解文章里面的核心部分,值得下载
资源详情
资源评论
资源推荐
Real-time large scale dense RGB-D SLAM with volumetric fusion
Thomas Whelan, Michael Kaess, Hordur Johannsson, Maurice Fallon, John J. Leonard and John McDonald
Abstract
We present a new SLAM system capable of producing high quality globally consistent surface reconstructions over hundreds
of metres in real-time with only a low-cost commodity RGB-D sensor. By using a fused volumetric surface reconstruction we
achieve a much higher quality map over what would be achieved using raw RGB-D point clouds. In this paper we highlight three
key techniques associated with applying a volumetric fusion-based mapping system to the SLAM problem in real-time. First, the
use of a GPU-based 3D cyclical buffer trick to efficiently extend dense every frame volumetric fusion of depth maps to function
over an unbounded spatial region. Second, overcoming camera pose estimation limitations in a wide variety of environments by
combining both dense geometric and photometric camera pose constraints. Third, efficiently updating the dense map according
to place recognition and subsequent loop closure constraints by the use of an “as-rigid-as-possible” space deformation. We
present results on a wide variety of aspects of the system and show through evaluation on de facto standard RGB-D benchmarks
that our system performs strongly in terms of trajectory estimation, map quality and computational performance in comparison
to other state-of-the-art systems.
Keywords: volumetric fusion, camera pose estimation, dense methods, large scale, real-time, RGB-D, SLAM, GPU
1 Introduction
The ability for a robot to create a map of an unknown environ-
ment and localise within that map is of extreme importance in
intelligent autonomous operation. Simultaneous Localisation
and Mapping (SLAM) has been one of the large focuses of
robotics research over the last two decades, with 3D mapping
becoming more and more popular within the last few years
over traditional 2D laser scan SLAM. The recent explosion
in full dense 3D SLAM is arguably a result of the release of
the Microsoft Kinect commodity RGB-D sensor, which pro-
vides high quality depth sensing capabilities for a little over
one hundred US dollars. Before the advent of the Kinect, 3D
SLAM methods required either time of flight (TOF) sensors,
3D LIDAR scanners or stereo vision, which were typically
either quite expensive or not suitable for fully mobile real-
time operation if dense reconstruction was desired. Another
recent technology which is often coupled with dense methods
is General-Purpose computing on Graphics Processing Units
T. Whelan and J. McDonald are with the Department of Computer
Science, National University of Ireland Maynooth, Co. Kildare, Ireland.
thomas.j.whelan@nuim.ie, johnmcd@cs.nuim.ie
M. Kaess is with the Robotics Institute, Carnegie Mellon University,
Pittsburgh, PA 15213, USA. kaess@cmu.edu
H. Johannsson, M. Fallon and J. Leonard are with the Massachusetts
Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA.
{hordurj,mfallon,jleonard}@mit.edu
This work was presented in part at the Robotics Science and Systems
RGB-D Workshop, Sydney, July 2012 (Whelan et al. (2012)), in part at the
International Conference on Robotics and Automation, Karlsruhe, May 2013
(Whelan et al. (2013a)) and in part at the International Conference on Intelli-
gent Robots and Systems, Japan, November 2013 (Whelan et al. (2013b)).
(GPGPU) which exploits the massive parallelism available in
GPU hardware to perform high speed and often real-time pro-
cessing on entire images every frame. Being an affordable
commodity technology, GPU-based programming is arguably
another large enabler in recent dense SLAM research.
Many visual SLAM systems and 3D reconstruction sys-
tems (both offline and online) have been published in recent
times that rely purely on RGB-D sensing capabilities because
of the Kinect’s low price and accuracy; Henry et al. (2012);
Endres et al. (2012); St
¨
uckler and Behnke (2013). The Kinect-
Fusion algorithm of Newcombe et al. (2011) is one of the
most notable RGB-D-based 3D reconstruction systems of re-
cent times, allowing real-time volumetric dense reconstruc-
tion of a desk sized scene at sub-centimetre resolution. By
fusing many individual depth maps together into a single vol-
umetric reconstruction, the models that are obtained are of
much higher quality than typical noisy single-shot raw RGB-
D point clouds. KinectFusion enables reconstructions of an
unprecedented quality at real-time speeds but comes with a
number of limitations, namely 1) restriction to a fixed small
area in space; 2) reliance on geometric information alone for
camera pose estimation; and, 3) no means of explicitly incor-
porating loop closures. These three limitations severely limit
the applicability of KinectFusion to the large scale SLAM
problem where it is desirable due to its real-time nature and
very high surface reconstruction fidelity.
In this paper we present solutions to the three aforemen-
tioned limitations such that the system can be used in a full
real-time large scale SLAM setting. We address the three
limitations respectively by 1) representing the volumetric re-
construction data structure in memory with a rolling cyclical
buffer; 2) estimating a dense photometric camera constraint
in conjunction with a dense geometric constraint and jointly
optimising for a camera pose estimate; and, 3) optimising the
dense map by means of a non-rigid space deformation param-
eterised by a loop closure constraint. In the remainder of this
section we provide a discussion on the existing work related
to the area of dense RGB-D SLAM. Following on from this
Sections 2, 3 & 4 address the issues of extended scale volu-
metric fusion, camera pose estimate, and loop closure, respec-
tively. Section 5 provides a comprehensive qualitative and
quantitative evaluation of the system using multiple bench-
mark datasets and finally Section 6 presents conclusions on
the work and future directions of our research.
1.1 Related Work
A large number of publications have been made over the last
few years specifically using RGB-D data for camera pose es-
timation, dense mapping and full SLAM pipelines. While
many visual SLAM systems existed prior to the advent of
active RGB-D sensors (e.g. Comport et al. (2007)), we will
focus mainly on the literature which makes specific use of
active RGB-D platforms. One of the earliest RGB-D track-
ing and mapping systems uses FAST feature correspondences
between frames for visual odometry and offloads dense point
cloud map building to a post-processing step utilising sparse
bundle adjustment (SBA) for global consistency by minimiz-
ing feature reprojection error (Huang et al. (2011)). One of the
first real-time dense RGB-D tracking and mapping systems
estimates an image warping function with both geometric and
photometric information to compute a camera pose estimate,
however only relies on rigid reprojection for point cloud map
reconstruction without using a method for global consistency
(Audras et al. (2011)). Similar work on dense RGB-D cam-
era tracking was done by Steinbr
¨
ucker et al. (2011), also es-
timating an image warping function based on geometric and
photometric information. Recent work by Kerl et al. (2013)
presents a more robust dense photometrics-based RGB-D vi-
sual odometry system that proposes a t-distribution-based er-
ror model which more accurately matches the residual error
between RGB-D frames in scenes that are not entirely static.
Henry et al. (2012) presented one of the first full SLAM
systems based entirely upon RGB-D data, using visual feature
matching with Generalised Iterative Closest Point (GICP) to
build up a pose graph and following that an optimised surfel
map of the area explored. The use of pose graph optimisa-
tion versus SBA is studied, minimising feature reprojection
error in an offline rigid transformation framework. Visual fea-
ture correspondences are used in conjunction with pose graph
optimisation in the RGB-D SLAM system of Endres et al.
(2012). An octree-based volumetric representation is used to
store the map, created by reprojecting all point measurements
into the global frame. This map representation is provided
by the OctoMap framework of Hornung et al. (2013), which
includes the ability to take measurement uncertainties into ac-
count and implicitly represent free and occupied space while
being space efficient. An explicit voxel volumetric occupancy
representation is used by Pirker et al. (2011) in their GPSlam
system which uses sparse visual feature correspondences for
camera pose estimation. They make use of visual place recog-
nition and sliding window bundle adjustment in a pose graph
optimisation framework. To achieve global consistency the
occupancy grid is “morphed” by a weighted average of the
log-odds perceptions of each camera for each voxel. St
¨
uckler
and Behnke (2013) register surfel maps together for camera
pose estimation and store a multi-resolution surfel map in an
octree, using pose graph optimisation for global consistency.
After pose graph optimisation is complete a globally consis-
tent map is created by fusing key views together. In recent
work Hu et al. (2012) proposed a system that uses bundle ad-
justment in order to make use of pixels for which no valid
depth exists, and Lee et al. (2012) presented a system which
exploits GPU processing power for real-time camera tracking.
Both systems produce an optimised map as a final step in the
process.
A substantial number of derived works have been published
recently after the advent of the KinectFusion system of New-
combe et al. (2011), mostly focused on extending the range
of operation, with other related work on object recognition
and motion planning (Karpathy et al. (2013); Wagner et al.
(2013)). Recent work by Bylow et al. (2013) and Canelhas
et al. (2013) directly tracks the camera pose against the accu-
mulated volumetric model by exploiting the fact that the trun-
cated signed distance function (TSDF) representation used by
KinectFusion stores the signed distance to the closest surface
at voxels near the surface. This avoids the need to raycast a
vertex map for each frame to perform camera pose estima-
tion, which potentially discards information about the surface
reconstruction.
Roth and Vona (2012) extend the operational range of
KinectFusion by using a double buffering mechanism to map
between volumetric models upon camera translation and ro-
tation, using a voxel interpolation for the latter. However no
method for recovering the map is provided. Zeng et al. (2012)
replace the explicit voxel representation used by KinectFusion
with an octree representation which allows mapping of areas
up to 8m×8m×8m in size. However this method does increase
the chance for drift within the map and provides no means of
loop closure or map correction. Steinbr
¨
ucker et al. (2013)
make use of a multi-scale octree to represent the signed dis-
tance function, allowing full color reconstructions of scenes
as large as an entire corridor including nine rooms spanning
a total area of 45m×12m×3.4m. After an RGB-D sequence
has been processed, a globally consistent camera trajectory is
resolved and the model is reconstructed. Keller et al. (2013)
present an extended fusion system made space efficient by us-
ing a point-based surfel representation, although lacking in
drift correction or loop closure detection. Chen et al. (2013)
present a novel hierarchical data structure that enables ex-
tremely space efficient volumetric fusion, using a streaming
2
framework allowing effectively unbounded mapping range,
limited only by available memory. However the system lacks
any method for mitigating drift or enforcing global consis-
tency. Nießner et al. (2013) present an alternative space effi-
cient method for large scale dense fusion that uses an intelli-
gent voxel hashing function to minimise the amount of mem-
ory required for reconstruction, but again without a means of
correcting for drift.
An alternative approach to the modern SLAM problem is
introduced by Salas-Moreno et al. (2013), whereby known ob-
jects are detected, tracked and mapped in real-time in a dense
RGB-D framework. Pose graph optimisation is used to en-
sure global consistency on the level of camera poses and de-
tected object positions. This does allow loop closure, however
less influence is placed on a full scene reconstruction with
only point cloud reprojections being used for mapped loop
closure. Recent work by Henry et al. (2013b) uses multiple
smaller “patch volumes” to segment the mapped space into a
set of discrete TSDFs, each with a 6-degrees-of-freedom (6-
DOF) pose which is rigidly optimised upon loop closure de-
tection. This approach can be seen as similar to the SLAM++
approach of Salas-Moreno et al. (2013) whereby the patch
volumes are analogous to objects. While achieving global
consistency between each volume, there is no clear solution
presented for correcting the surface within any one given vol-
ume or stitching surfaces which are split between volumes,
leaving local surfaces disconnected.
Zhou et al. (2013) present an impressive method for re-
constructing 3D scenes that specifically targets the high-
frequency noise and low-frequency distortion effects often en-
countered with RGB-D data. By reconstructing fragments
of the scene which are then aligned and deformed very high
quality reconstructions can be obtained, however in what is
a strictly offline framework. Similar work also by Zhou and
Koltun (2013) presents a method which detects points of inter-
est in a scene and specifically optimises the camera trajectory
to preserve detailed geometry around these points, within an
offline frame.
An number of approaches that rely on keyframes have been
developed to tackle the problem of RGB-D mapping and
SLAM. Tykk
¨
al
¨
a et al. (2013) present a system which uses
real-time dense photometric keyframe-based camera track-
ing to determine a camera trajectory around an indoor envi-
ronment. Individual RGB-D frames are also fused into ex-
isting keyframes to improve reconstruction quality. An op-
tional bundle adjustment step can then be taken to optimise
the camera poses before a watertight Poisson mesh recon-
struction is computed as a post-processing step. Meilland
and Comport (2013) propose a model that unifies the benefits
of a dense voxel-based representation with a keyframe rep-
resentation allowing high quality dense mapping over large-
scales, although without detecting large loop closures or cor-
recting for drift. An intelligent forward composition approach
is proposed which enables efficient combination of reference
images to create a single predicted frame without repeated
redundant image warps. In our work we chose to avoid a
keyframe approach in spite of the resulting higher memory
requirement. A fully 3D voxel-based method removes the
need to implement specific schemes to overcome the prob-
lems associated with reconstructing complex non-concave ob-
jects and non-convex scenes.
As discussed there exists a large number of systems util-
ising RGB-D data for SLAM and related problems. How-
ever, most are either unable to operate in real-time, provide
an up-to-date optimised representation of the map at runtime
or any time it is requested or efficiently incorporate large non-
rigid updates to the map. Non-rigid surface correction is of
great interest specifically in the realm of volumetric fusion as
typically reconstructions are locally highly accurate but drift
slowly over large scales over time, where a smooth continu-
ous deformation of the surface is most suitable for correction.
In the following sections we will fully describe our approach
to RGB-D SLAM with volumetric fusion which is capable of
functioning in real-time over large scale trajectories, while ef-
ficiently applying non-rigid updates to the dense map upon
loop closure to ensure global consistency.
To clarify our definition of “real-time” there is no of-
fline step involved in our pipeline and multiple loops can be
closed immediately as they occur during the mapping process
(shown in Multimedia Extension 2). Our system architec-
ture can be compared to that of PTAM (Klein and Murray
(2007)), whereby camera tracking and mapping run in sepa-
rate threads. While the camera tracking component runs at
frame rate in one thread, the mapping component is freed
from the computational burden of updating the map for ev-
ery frame and instead occasionally receives information from
the tracking thread to update the map for consistency.
This paper brings together work presented in our three pre-
vious publications Whelan et al. (2012), Whelan et al. (2013a)
and Whelan et al. (2013b). In this paper we provide a num-
ber of additions to that work including a method for improv-
ing camera-frustum overlap for greater reconstruction range
(Section 2.4) and a means of reducing the amount of informa-
tion required to perform map deformation, increasing compu-
tational performance (Section 5.3.2). Most significantly this
paper presents an extensive evaluation of the presented sys-
tem not present in any previous work, including both quali-
tative and quantitative evaluation of trajectory estimation per-
formance, surface reconstruction quality and computational
performance.
Please note any provided sample parameter and threshold
values are those which were used for all experiments in this
paper and are provided as an aid to those who wish to re-
implement any aspect of this work.
2 Extended Scale Volumetric Fusion
In this section we will provide some background on the us-
age of volumetric fusion for dense RGB-D-based tracking
and mapping and describe our extension to KinectFusion, the
3
Figure 1: Two dimensional example of the structure of the truncated signed
distance function representation of an implicit surface. Shown are example
signed distance values stored at voxels within the truncation distance of the
observed surface, with rays cast from the observing sensor.
most widely cited system that employs this approach, to allow
spatially extended mapping.
2.1 Background
Real-time volumetric fusion with RGB-D cameras was
brought to the forefront by Newcombe et al. (2011) with the
KinectFusion system. A significant component of the system
is the cyclical pipeline used for camera tracking and scene
mapping, whereby full depth maps are fused into a volumet-
ric data structure (TSDF), which is then raycast to produce
a predicted surface that the subsequently captured depth map
is matched against using ICP. The truncated signed distance
function (TSDF) is a volumetric data structure that encodes
implicit surfaces by storing the signed distance to the closest
surface at each voxel up to a given truncation distance from
the actual surface position. Points at which the sign of the
distance value changes are known as zero crossings, which
represent the actual position of the surface, shown in Figure
1. Each voxel also stores a weight for the distance measure-
ment at that point, effectively providing a moving average of
the surface position. In the case of KinectFusion, the TSDF
is stored as a three dimensional voxel grid in GPU memory
where dense depth map integration is accomplished by sweep-
ing through the volume and updating distance measurements
accordingly, while surface raycasting is carried out by simply
projecting rays from the current camera pose and returning the
depth and surface normals at the first zero crossings encoun-
tered. Surface normals are easily computed by taking the fi-
nite difference around a given position within the TSDF, as ex-
ploited by Bylow et al. (2013) and Canelhas et al. (2013). The
entire process is very amenable to parallelisation and greatly
benefits in execution time from being implemented on a GPU
(Newcombe et al. (2011)). A point to note is that the TSDF
representation has a minimal surface thickness limitation im-
posed by the selected truncation distance. This problem was
Figure 2: Visualisation of the volume shifting process for spatially extended
mapping; (i) The camera motion exceeds the movement threshold m
s
(direc-
tion of camera motion shown by the black arrow); (ii) Volume slice leaving
the volume (red) is raycast along all three axes to extract surface points and
reset to free space; (iii) The raycast surface is extracted as a point cloud and
fed into the Greedy Projection Triangulation (GPT) algorithm of Marton et al.
(2009); (iv) New region of space (blue) enters the volume and is integrated
using new modulo addressing of the volume.
highlighted and explored by Henry et al. (2013a) in their work
on multiple fusion volumes.
2.2 Volume Representation
Defining the voxel space domain as Ψ ⊂ N
3
the TSDF volume
S at some location s ∈ Ψ has the mapping S (s) : Ψ → R ×
N×N
3
. Within GPU memory the TSDF is represented as a 3D
array of voxels. Each voxel contains a signed distance value
(S (s)
T
, truncated float16), an unsigned weight value (S (s)
W
,
unsigned int8) and a byte for each color component R, G and
B (S (s)
R
, S (s)
G
, S (s)
B
) for a total of 6 bytes per voxel. The
integration of new surface measurements is carried out in a
similar fashion to Newcombe et al. (2011), when integrating
a new signed distance function measurement S (s)
T
i
during the
fusion of a new depth map, each voxel s ∈ Ψ at time i is
updated with:
S (s)
T
0
i
=
S (s)
W
i−1
S (s)
T
i−1
+ S (s)
W
i
S (s)
T
i
S (s)
W
i−1
+ S (s)
W
i
(1)
S (s)
W
0
i
= min(S (s)
W
i−1
+ S (s)
W
i
, max weight) (2)
As is the case with previous approaches, we take S (s)
W
i
= 1
to provide a simple moving average, and set max weight to
128. Bylow et al. (2013) have experimented with different
weighting schemes, however we have found the original value
of 1 used by Newcombe et al. (2011) to provide good per-
formance. Using only a cubic volume, we parameterise the
TSDF by the side length in voxels v
s
and the dimension in
metres v
d
. Both of these parameters control the resolution of
the reconstruction along with the size of the immediate “active
area” of reconstruction. In all experiments in this paper we set
v
s
= 512 for total GPU memory usage of 768MB. The 6-DOF
camera pose within the TSDF at time i is denoted as P
T
i
, com-
posed of a rotation R
T
i
∈ SO
3
and a translation t
T
i
∈ R
3
. The
origin of the TSDF coordinate system is positioned at the cen-
ter of the volume with basis vectors aligned with the axes of
the TSDF. Initially R
T
0
= I and t
T
0
= (0, 0, 0)
>
. The position
of the TSDF volume in voxel units in the global frame is ini-
4
Figure 3: Visualisation of the interaction between the movement threshold
m
s
and the shifting process. Between frames 0 and 1 the camera does not
cross the movement boundary (dark brown) and no shift occurs. At frame
2, the pose crosses the boundary and causes a volume shift, recentering the
volume (teal) around P
T
2
and updating g
2
. The underlying voxel grid quanti-
sation is shown in light dashed lines.
tialised to be g
0
= (0, 0, 0)
>
. Note that the superscript
T
refers
to the TSDF pose and not the transpose
>
operator.
2.3 Volume Shifting
Unlike Newcombe et al. (2011) camera pose estimation and
surface reconstruction is not restricted to only the region
around which the TSDF was initialised. By employing mod-
ulo arithmetic in how the TSDF volume is addressed in GPU
memory we can treat the structure like a cyclical buffer which
virtually translates as the camera moves through an environ-
ment. Figure 2 provides a visual example and description of
the shifting process. It is parameterised by an integer move-
ment threshold m
s
, defining the cubic movement boundary (in
voxels) around g
i
which upon crossing, causes a volume shift,
shown in Figure 3. Discussion on the choice of value for m
s
is
provided in Section 5.3. Each dimension is treated indepen-
dently during a shift. When a shift is triggered, the TSDF is
virtually translated about the camera pose (in voxel units) to
bring the camera’s position to within one voxel of g
i+1
. The
new pose of the camera P
T
i+1
has no change in rotation, while
the shift corrected camera position t
T
0
i+1
is calculated from t
T
i+1
by first computing the number of voxel units crossed:
u =
v
s
t
T
i+1
v
d
(3)
And then shifting the pose while updating the global position
of the TSDF:
t
T
0
i+1
= t
T
i+1
−
v
d
u
v
s
(4)
g
i+1
= g
i
+ u (5)
Figure 4: Two dimensional visualisation of the association between extracted
cloud slices, the camera poses and the TSDF volume. Note that the camera
poses here are in global coordinates rather than internal TSDF coordinates. A
red dashed line links camera poses with extracted slices of the TSDF volume
(P
γ
, P
β
and P
α
with C
2
, C
1
and C
0
respectively). The large triangles repre-
sent camera poses that caused volume shifts while the small black squares
represent those that didn’t.
2.3.1 Implementation
There are two parts of volumetric fusion as described by New-
combe et al. (2011) that require indexed access to the TSDF
volume; 1) Volume Integration and 2) Volume Raycasting.
Referring again to Figure 2, the new surface measurements
shown in blue can be integrated into the memory previously
used for the old surface contained within the red region of the
TSDF by ensuring all element look ups in the 3D block of
GPU memory reflect the virtual voxel translation computed
in Equation 5. Assuming row major memory ordering, an el-
ement in the unshifted cubic 3D voxel grid can be found at the
1D memory location a given by:
a = (x + yv
s
+ zv
2
s
) (6)
The volume’s translation can be reflected in how the TSDF
is addressed for integration and raycasting by substituting the
indices in Equation 6 with values that are offset by the current
global position of the TSDF and bound within the dimensions
of the voxel grid using the modulus operator:
x
0
= (x + g
i x
) mod v
s
(7)
y
0
= (y + g
iy
) mod v
s
(8)
z
0
= (z + g
iz
) mod v
s
(9)
a = (x
0
+ y
0
v
s
+ z
0
v
2
s
) (10)
The original KinectFusion work of Newcombe et al. (2011)
benefits greatly from memory caching and pipelining func-
tionality within GPU memory to achieve high computational
performance within the integration step. In our implementa-
tion we have found that use of a cyclical addressing method
5
剩余28页未读,继续阅读
专注3D
- 粉丝: 3
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1