elasticfusion补充_elasticfusion编译

三维重建

5星 · 超过95%的资源需积分: 19 96 浏览量更新于2023-03-16 评论 1 收藏 30.77MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Real-time large scale dense RGB-D SLAM with volumetric fusion

Thomas Whelan, Michael Kaess, Hordur Johannsson, Maurice Fallon, John J. Leonard and John McDonald

Abstract

We present a new SLAM system capable of producing high quality globally consistent surface reconstructions over hundreds

of metres in real-time with only a low-cost commodity RGB-D sensor. By using a fused volumetric surface reconstruction we

achieve a much higher quality map over what would be achieved using raw RGB-D point clouds. In this paper we highlight three

key techniques associated with applying a volumetric fusion-based mapping system to the SLAM problem in real-time. First, the

use of a GPU-based 3D cyclical buﬀer trick to eﬃciently extend dense every frame volumetric fusion of depth maps to function

over an unbounded spatial region. Second, overcoming camera pose estimation limitations in a wide variety of environments by

combining both dense geometric and photometric camera pose constraints. Third, eﬃciently updating the dense map according

to place recognition and subsequent loop closure constraints by the use of an “as-rigid-as-possible” space deformation. We

present results on a wide variety of aspects of the system and show through evaluation on de facto standard RGB-D benchmarks

that our system performs strongly in terms of trajectory estimation, map quality and computational performance in comparison

to other state-of-the-art systems.

Keywords: volumetric fusion, camera pose estimation, dense methods, large scale, real-time, RGB-D, SLAM, GPU

1 Introduction

The ability for a robot to create a map of an unknown environ-

ment and localise within that map is of extreme importance in

intelligent autonomous operation. Simultaneous Localisation

and Mapping (SLAM) has been one of the large focuses of

robotics research over the last two decades, with 3D mapping

becoming more and more popular within the last few years

over traditional 2D laser scan SLAM. The recent explosion

in full dense 3D SLAM is arguably a result of the release of

the Microsoft Kinect commodity RGB-D sensor, which pro-

vides high quality depth sensing capabilities for a little over

one hundred US dollars. Before the advent of the Kinect, 3D

SLAM methods required either time of ﬂight (TOF) sensors,

3D LIDAR scanners or stereo vision, which were typically

either quite expensive or not suitable for fully mobile real-

time operation if dense reconstruction was desired. Another

recent technology which is often coupled with dense methods

is General-Purpose computing on Graphics Processing Units

T. Whelan and J. McDonald are with the Department of Computer

Science, National University of Ireland Maynooth, Co. Kildare, Ireland.

thomas.j.whelan@nuim.ie, johnmcd@cs.nuim.ie

M. Kaess is with the Robotics Institute, Carnegie Mellon University,

Pittsburgh, PA 15213, USA. kaess@cmu.edu

H. Johannsson, M. Fallon and J. Leonard are with the Massachusetts

Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA.

{hordurj,mfallon,jleonard}@mit.edu

This work was presented in part at the Robotics Science and Systems

RGB-D Workshop, Sydney, July 2012 (Whelan et al. (2012)), in part at the

International Conference on Robotics and Automation, Karlsruhe, May 2013

(Whelan et al. (2013a)) and in part at the International Conference on Intelli-

gent Robots and Systems, Japan, November 2013 (Whelan et al. (2013b)).

(GPGPU) which exploits the massive parallelism available in

GPU hardware to perform high speed and often real-time pro-

cessing on entire images every frame. Being an aﬀordable

commodity technology, GPU-based programming is arguably

another large enabler in recent dense SLAM research.

Many visual SLAM systems and 3D reconstruction sys-

tems (both oﬄine and online) have been published in recent

times that rely purely on RGB-D sensing capabilities because

of the Kinect’s low price and accuracy; Henry et al. (2012);

Endres et al. (2012); St

uckler and Behnke (2013). The Kinect-

Fusion algorithm of Newcombe et al. (2011) is one of the

most notable RGB-D-based 3D reconstruction systems of re-

cent times, allowing real-time volumetric dense reconstruc-

tion of a desk sized scene at sub-centimetre resolution. By

fusing many individual depth maps together into a single vol-

umetric reconstruction, the models that are obtained are of

much higher quality than typical noisy single-shot raw RGB-

D point clouds. KinectFusion enables reconstructions of an

unprecedented quality at real-time speeds but comes with a

number of limitations, namely 1) restriction to a ﬁxed small

area in space; 2) reliance on geometric information alone for

camera pose estimation; and, 3) no means of explicitly incor-

porating loop closures. These three limitations severely limit

the applicability of KinectFusion to the large scale SLAM

problem where it is desirable due to its real-time nature and

very high surface reconstruction ﬁdelity.

In this paper we present solutions to the three aforemen-

tioned limitations such that the system can be used in a full

real-time large scale SLAM setting. We address the three

limitations respectively by 1) representing the volumetric re-

construction data structure in memory with a rolling cyclical

buﬀer; 2) estimating a dense photometric camera constraint

in conjunction with a dense geometric constraint and jointly

optimising for a camera pose estimate; and, 3) optimising the

dense map by means of a non-rigid space deformation param-

eterised by a loop closure constraint. In the remainder of this

section we provide a discussion on the existing work related

to the area of dense RGB-D SLAM. Following on from this

Sections 2, 3 & 4 address the issues of extended scale volu-

metric fusion, camera pose estimate, and loop closure, respec-

tively. Section 5 provides a comprehensive qualitative and

quantitative evaluation of the system using multiple bench-

mark datasets and ﬁnally Section 6 presents conclusions on

the work and future directions of our research.

1.1 Related Work

A large number of publications have been made over the last

few years speciﬁcally using RGB-D data for camera pose es-

timation, dense mapping and full SLAM pipelines. While

many visual SLAM systems existed prior to the advent of

active RGB-D sensors (e.g. Comport et al. (2007)), we will

focus mainly on the literature which makes speciﬁc use of

active RGB-D platforms. One of the earliest RGB-D track-

ing and mapping systems uses FAST feature correspondences

between frames for visual odometry and oﬄoads dense point

cloud map building to a post-processing step utilising sparse

bundle adjustment (SBA) for global consistency by minimiz-

ing feature reprojection error (Huang et al. (2011)). One of the

ﬁrst real-time dense RGB-D tracking and mapping systems

estimates an image warping function with both geometric and

photometric information to compute a camera pose estimate,

however only relies on rigid reprojection for point cloud map

reconstruction without using a method for global consistency

(Audras et al. (2011)). Similar work on dense RGB-D cam-

era tracking was done by Steinbr

ucker et al. (2011), also es-

timating an image warping function based on geometric and

photometric information. Recent work by Kerl et al. (2013)

presents a more robust dense photometrics-based RGB-D vi-

sual odometry system that proposes a t-distribution-based er-

ror model which more accurately matches the residual error

between RGB-D frames in scenes that are not entirely static.

Henry et al. (2012) presented one of the ﬁrst full SLAM

systems based entirely upon RGB-D data, using visual feature

matching with Generalised Iterative Closest Point (GICP) to

build up a pose graph and following that an optimised surfel

map of the area explored. The use of pose graph optimisa-

tion versus SBA is studied, minimising feature reprojection

error in an oﬄine rigid transformation framework. Visual fea-

ture correspondences are used in conjunction with pose graph

optimisation in the RGB-D SLAM system of Endres et al.

(2012). An octree-based volumetric representation is used to

store the map, created by reprojecting all point measurements

into the global frame. This map representation is provided

by the OctoMap framework of Hornung et al. (2013), which

includes the ability to take measurement uncertainties into ac-

count and implicitly represent free and occupied space while

being space eﬃcient. An explicit voxel volumetric occupancy

representation is used by Pirker et al. (2011) in their GPSlam

system which uses sparse visual feature correspondences for

camera pose estimation. They make use of visual place recog-

nition and sliding window bundle adjustment in a pose graph

optimisation framework. To achieve global consistency the

occupancy grid is “morphed” by a weighted average of the

log-odds perceptions of each camera for each voxel. St

uckler

and Behnke (2013) register surfel maps together for camera

pose estimation and store a multi-resolution surfel map in an

octree, using pose graph optimisation for global consistency.

After pose graph optimisation is complete a globally consis-

tent map is created by fusing key views together. In recent

work Hu et al. (2012) proposed a system that uses bundle ad-

justment in order to make use of pixels for which no valid

depth exists, and Lee et al. (2012) presented a system which

exploits GPU processing power for real-time camera tracking.

Both systems produce an optimised map as a ﬁnal step in the

process.

A substantial number of derived works have been published

recently after the advent of the KinectFusion system of New-

combe et al. (2011), mostly focused on extending the range

of operation, with other related work on object recognition

and motion planning (Karpathy et al. (2013); Wagner et al.

(2013)). Recent work by Bylow et al. (2013) and Canelhas

et al. (2013) directly tracks the camera pose against the accu-

mulated volumetric model by exploiting the fact that the trun-

cated signed distance function (TSDF) representation used by

KinectFusion stores the signed distance to the closest surface

at voxels near the surface. This avoids the need to raycast a

vertex map for each frame to perform camera pose estima-

tion, which potentially discards information about the surface

reconstruction.

Roth and Vona (2012) extend the operational range of

KinectFusion by using a double buﬀering mechanism to map

between volumetric models upon camera translation and ro-

tation, using a voxel interpolation for the latter. However no

method for recovering the map is provided. Zeng et al. (2012)

replace the explicit voxel representation used by KinectFusion

with an octree representation which allows mapping of areas

up to 8m×8m×8m in size. However this method does increase

the chance for drift within the map and provides no means of

loop closure or map correction. Steinbr

ucker et al. (2013)

make use of a multi-scale octree to represent the signed dis-

tance function, allowing full color reconstructions of scenes

as large as an entire corridor including nine rooms spanning

a total area of 45m×12m×3.4m. After an RGB-D sequence

has been processed, a globally consistent camera trajectory is

resolved and the model is reconstructed. Keller et al. (2013)

present an extended fusion system made space eﬃcient by us-

ing a point-based surfel representation, although lacking in

drift correction or loop closure detection. Chen et al. (2013)

present a novel hierarchical data structure that enables ex-

tremely space eﬃcient volumetric fusion, using a streaming

framework allowing eﬀectively unbounded mapping range,

limited only by available memory. However the system lacks

any method for mitigating drift or enforcing global consis-

tency. Nießner et al. (2013) present an alternative space eﬃ-

cient method for large scale dense fusion that uses an intelli-

gent voxel hashing function to minimise the amount of mem-

ory required for reconstruction, but again without a means of

correcting for drift.

An alternative approach to the modern SLAM problem is

introduced by Salas-Moreno et al. (2013), whereby known ob-

jects are detected, tracked and mapped in real-time in a dense

RGB-D framework. Pose graph optimisation is used to en-

sure global consistency on the level of camera poses and de-

tected object positions. This does allow loop closure, however

less inﬂuence is placed on a full scene reconstruction with

only point cloud reprojections being used for mapped loop

closure. Recent work by Henry et al. (2013b) uses multiple

smaller “patch volumes” to segment the mapped space into a

set of discrete TSDFs, each with a 6-degrees-of-freedom (6-

DOF) pose which is rigidly optimised upon loop closure de-

tection. This approach can be seen as similar to the SLAM++

approach of Salas-Moreno et al. (2013) whereby the patch

volumes are analogous to objects. While achieving global

consistency between each volume, there is no clear solution

presented for correcting the surface within any one given vol-

ume or stitching surfaces which are split between volumes,

leaving local surfaces disconnected.

Zhou et al. (2013) present an impressive method for re-

constructing 3D scenes that speciﬁcally targets the high-

frequency noise and low-frequency distortion eﬀects often en-

countered with RGB-D data. By reconstructing fragments

of the scene which are then aligned and deformed very high

quality reconstructions can be obtained, however in what is

a strictly oﬄine framework. Similar work also by Zhou and

Koltun (2013) presents a method which detects points of inter-

est in a scene and speciﬁcally optimises the camera trajectory

to preserve detailed geometry around these points, within an

oﬄine frame.

An number of approaches that rely on keyframes have been

developed to tackle the problem of RGB-D mapping and

SLAM. Tykk

a et al. (2013) present a system which uses

real-time dense photometric keyframe-based camera track-

ing to determine a camera trajectory around an indoor envi-

ronment. Individual RGB-D frames are also fused into ex-

isting keyframes to improve reconstruction quality. An op-

tional bundle adjustment step can then be taken to optimise

the camera poses before a watertight Poisson mesh recon-

struction is computed as a post-processing step. Meilland

and Comport (2013) propose a model that uniﬁes the beneﬁts

of a dense voxel-based representation with a keyframe rep-

resentation allowing high quality dense mapping over large-

scales, although without detecting large loop closures or cor-

recting for drift. An intelligent forward composition approach

is proposed which enables eﬃcient combination of reference

images to create a single predicted frame without repeated

redundant image warps. In our work we chose to avoid a

keyframe approach in spite of the resulting higher memory

requirement. A fully 3D voxel-based method removes the

need to implement speciﬁc schemes to overcome the prob-

lems associated with reconstructing complex non-concave ob-

jects and non-convex scenes.

As discussed there exists a large number of systems util-

ising RGB-D data for SLAM and related problems. How-

ever, most are either unable to operate in real-time, provide

an up-to-date optimised representation of the map at runtime

or any time it is requested or eﬃciently incorporate large non-

rigid updates to the map. Non-rigid surface correction is of

great interest speciﬁcally in the realm of volumetric fusion as

typically reconstructions are locally highly accurate but drift

slowly over large scales over time, where a smooth continu-

ous deformation of the surface is most suitable for correction.

In the following sections we will fully describe our approach

to RGB-D SLAM with volumetric fusion which is capable of

functioning in real-time over large scale trajectories, while ef-

ﬁciently applying non-rigid updates to the dense map upon

loop closure to ensure global consistency.

To clarify our deﬁnition of “real-time” there is no of-

ﬂine step involved in our pipeline and multiple loops can be

closed immediately as they occur during the mapping process

(shown in Multimedia Extension 2). Our system architec-

ture can be compared to that of PTAM (Klein and Murray

(2007)), whereby camera tracking and mapping run in sepa-

rate threads. While the camera tracking component runs at

frame rate in one thread, the mapping component is freed

from the computational burden of updating the map for ev-

ery frame and instead occasionally receives information from

the tracking thread to update the map for consistency.

This paper brings together work presented in our three pre-

vious publications Whelan et al. (2012), Whelan et al. (2013a)

and Whelan et al. (2013b). In this paper we provide a num-

ber of additions to that work including a method for improv-

ing camera-frustum overlap for greater reconstruction range

(Section 2.4) and a means of reducing the amount of informa-

tion required to perform map deformation, increasing compu-

tational performance (Section 5.3.2). Most signiﬁcantly this

paper presents an extensive evaluation of the presented sys-

tem not present in any previous work, including both quali-

tative and quantitative evaluation of trajectory estimation per-

formance, surface reconstruction quality and computational

performance.

Please note any provided sample parameter and threshold

values are those which were used for all experiments in this

paper and are provided as an aid to those who wish to re-

implement any aspect of this work.

2 Extended Scale Volumetric Fusion

In this section we will provide some background on the us-

age of volumetric fusion for dense RGB-D-based tracking

and mapping and describe our extension to KinectFusion, the

Figure 1: Two dimensional example of the structure of the truncated signed

distance function representation of an implicit surface. Shown are example

signed distance values stored at voxels within the truncation distance of the

observed surface, with rays cast from the observing sensor.

most widely cited system that employs this approach, to allow

spatially extended mapping.

2.1 Background

Real-time volumetric fusion with RGB-D cameras was

brought to the forefront by Newcombe et al. (2011) with the

KinectFusion system. A signiﬁcant component of the system

is the cyclical pipeline used for camera tracking and scene

mapping, whereby full depth maps are fused into a volumet-

ric data structure (TSDF), which is then raycast to produce

a predicted surface that the subsequently captured depth map

is matched against using ICP. The truncated signed distance

function (TSDF) is a volumetric data structure that encodes

implicit surfaces by storing the signed distance to the closest

surface at each voxel up to a given truncation distance from

the actual surface position. Points at which the sign of the

distance value changes are known as zero crossings, which

represent the actual position of the surface, shown in Figure

1. Each voxel also stores a weight for the distance measure-

ment at that point, eﬀectively providing a moving average of

the surface position. In the case of KinectFusion, the TSDF

is stored as a three dimensional voxel grid in GPU memory

where dense depth map integration is accomplished by sweep-

ing through the volume and updating distance measurements

accordingly, while surface raycasting is carried out by simply

projecting rays from the current camera pose and returning the

depth and surface normals at the ﬁrst zero crossings encoun-

tered. Surface normals are easily computed by taking the ﬁ-

nite diﬀerence around a given position within the TSDF, as ex-

ploited by Bylow et al. (2013) and Canelhas et al. (2013). The

entire process is very amenable to parallelisation and greatly

beneﬁts in execution time from being implemented on a GPU

(Newcombe et al. (2011)). A point to note is that the TSDF

representation has a minimal surface thickness limitation im-

posed by the selected truncation distance. This problem was

Figure 2: Visualisation of the volume shifting process for spatially extended

mapping; (i) The camera motion exceeds the movement threshold m

(direc-

tion of camera motion shown by the black arrow); (ii) Volume slice leaving

the volume (red) is raycast along all three axes to extract surface points and

reset to free space; (iii) The raycast surface is extracted as a point cloud and

fed into the Greedy Projection Triangulation (GPT) algorithm of Marton et al.

(2009); (iv) New region of space (blue) enters the volume and is integrated

using new modulo addressing of the volume.

highlighted and explored by Henry et al. (2013a) in their work

on multiple fusion volumes.

2.2 Volume Representation

Deﬁning the voxel space domain as Ψ ⊂ N

the TSDF volume

S at some location s ∈ Ψ has the mapping S (s) : Ψ → R ×

N×N

. Within GPU memory the TSDF is represented as a 3D

array of voxels. Each voxel contains a signed distance value

(S (s)

, truncated ﬂoat16), an unsigned weight value (S (s)

unsigned int8) and a byte for each color component R, G and

B (S (s)

, S (s)

) for a total of 6 bytes per voxel. The

integration of new surface measurements is carried out in a

similar fashion to Newcombe et al. (2011), when integrating

a new signed distance function measurement S (s)

during the

fusion of a new depth map, each voxel s ∈ Ψ at time i is

updated with:

S (s)

i−1

S (s)

i−1

+ S (s)

S (s)

i−1

+ S (s)

(1)

S (s)

= min(S (s)

i−1

+ S (s)

, max weight) (2)

As is the case with previous approaches, we take S (s)

= 1

to provide a simple moving average, and set max weight to

128. Bylow et al. (2013) have experimented with diﬀerent

weighting schemes, however we have found the original value

of 1 used by Newcombe et al. (2011) to provide good per-

formance. Using only a cubic volume, we parameterise the

TSDF by the side length in voxels v

and the dimension in

metres v

. Both of these parameters control the resolution of

the reconstruction along with the size of the immediate “active

area” of reconstruction. In all experiments in this paper we set

= 512 for total GPU memory usage of 768MB. The 6-DOF

camera pose within the TSDF at time i is denoted as P

, com-

posed of a rotation R

∈ SO

and a translation t

∈ R

. The

origin of the TSDF coordinate system is positioned at the cen-

ter of the volume with basis vectors aligned with the axes of

the TSDF. Initially R

= I and t

= (0, 0, 0)

. The position

of the TSDF volume in voxel units in the global frame is ini-

Figure 3: Visualisation of the interaction between the movement threshold

and the shifting process. Between frames 0 and 1 the camera does not

cross the movement boundary (dark brown) and no shift occurs. At frame

2, the pose crosses the boundary and causes a volume shift, recentering the

volume (teal) around P

and updating g

. The underlying voxel grid quanti-

sation is shown in light dashed lines.

tialised to be g

= (0, 0, 0)

. Note that the superscript

refers

to the TSDF pose and not the transpose

operator.

2.3 Volume Shifting

Unlike Newcombe et al. (2011) camera pose estimation and

surface reconstruction is not restricted to only the region

around which the TSDF was initialised. By employing mod-

ulo arithmetic in how the TSDF volume is addressed in GPU

memory we can treat the structure like a cyclical buﬀer which

virtually translates as the camera moves through an environ-

ment. Figure 2 provides a visual example and description of

the shifting process. It is parameterised by an integer move-

ment threshold m

, deﬁning the cubic movement boundary (in

voxels) around g

which upon crossing, causes a volume shift,

shown in Figure 3. Discussion on the choice of value for m

provided in Section 5.3. Each dimension is treated indepen-

dently during a shift. When a shift is triggered, the TSDF is

virtually translated about the camera pose (in voxel units) to

bring the camera’s position to within one voxel of g

i+1

. The

new pose of the camera P

i+1

has no change in rotation, while

the shift corrected camera position t

i+1

is calculated from t

i+1

by ﬁrst computing the number of voxel units crossed:

u =





i+1





(3)

And then shifting the pose while updating the global position

of the TSDF:

i+1

= t

i+1

−

(4)

i+1

= g

+ u (5)

Figure 4: Two dimensional visualisation of the association between extracted

cloud slices, the camera poses and the TSDF volume. Note that the camera

poses here are in global coordinates rather than internal TSDF coordinates. A

red dashed line links camera poses with extracted slices of the TSDF volume

, P

and P

with C

, C

and C

respectively). The large triangles repre-

sent camera poses that caused volume shifts while the small black squares

represent those that didn’t.

2.3.1 Implementation

There are two parts of volumetric fusion as described by New-

combe et al. (2011) that require indexed access to the TSDF

volume; 1) Volume Integration and 2) Volume Raycasting.

Referring again to Figure 2, the new surface measurements

shown in blue can be integrated into the memory previously

used for the old surface contained within the red region of the

TSDF by ensuring all element look ups in the 3D block of

GPU memory reﬂect the virtual voxel translation computed

in Equation 5. Assuming row major memory ordering, an el-

ement in the unshifted cubic 3D voxel grid can be found at the

1D memory location a given by:

a = (x + yv

+ zv

) (6)

The volume’s translation can be reﬂected in how the TSDF

is addressed for integration and raycasting by substituting the

indices in Equation 6 with values that are oﬀset by the current

global position of the TSDF and bound within the dimensions

of the voxel grid using the modulus operator:

= (x + g

i x

) mod v

(7)

= (y + g

) mod v

(8)

= (z + g

) mod v

(9)

a = (x

+ y

+ z

) (10)

The original KinectFusion work of Newcombe et al. (2011)

beneﬁts greatly from memory caching and pipelining func-

tionality within GPU memory to achieve high computational

performance within the integration step. In our implementa-

tion we have found that use of a cyclical addressing method

剩余28页未读，继续阅读

narutopengqing

2018-05-21

内容比较详细，对fusion的补充很好，感觉分享

专注3D

粉丝: 3
资源: 3

会员权益专享

elastic fusion补充

评论1

会员权益专享

最新资源

elastic fusion补充

评论1

KinectFusion 和 ElasticFusion 三维重建方法

ElasticFusion Dense SLAM Without A Pose Graph

微软实时三维重建KinectFusion两篇论文

elasticfusion

CentOS elasticsearch elasticsearch.service

spring-elasticsearch 版本 与elasticsearch 版本对应关系

ElasticSearch快照恢复到其它ElasticSearch

org.elasticsearch.client:elasticsearch-rest-client org.elasticsearch:elasticsearch

Elastic APM的Elasticsearch组件

linux卸载elasticsearch

elasticsearch支持图片

elasticsearch6升级到elasticsearch8 失败如何回退

elasticsearch6升级到elasticsearch8 如何迁移

linux搭建Elasticsearch

python elasticsearch 库如何连接使用elasticsearch

elasticsearch教程java

elasticsearch安装卸载

elasticsearch/elasticsearch 8 使用案例

elasticsearch 8.9 RuntimeException: can not run elasticsearch as root

kibana 安装Elastic APM

会员权益专享

最新资源

spring-elasticsearch 版本与elasticsearch 版本对应关系