nonparametric method. The method is based on pixel-
wise segmentation and the statistical model for pose
estimation, which is susceptible to independent non-
rigid body motion of past frames, and therefore some
dynamic pixels may be include d in the energy function
minimization. Meanwhile, Scona et al. [24] proposed
StaticFusion. They maintain a static background
environment mapping used for pose estimation in a
model-to-frame way. However, StaticFusion cannot
work on scenes with fast camera motion due to not
having enough time for mapping. Finally, Sun et al.
[25] proposed a motion removal approach as a pre-
processing ste p and integrated it into the front end of
RGB-D SLAM. However, the disadvantage of this
method is that they can deal with only one motion,
instead of multiple moving objects in dynamic scenes.
Another line of related works are off-line methods.
Roussos et al. [26] proposed an approach of multi-
body motion segmentation and reconstruction based
on the energy function. The algorithm effectively
gives the camera pose, scene depth, and 3D recon-
struction in dynamic scenes. Unfortunately, the
method processes RGB-D data in a batch way and
hence can be seen as an off-line system. Wang et al.
[27] estimated dense optical flow from frames, where
dynamic objects can be excluded by clustering motion
patterns based on optical flow. However, due to the
large amount of calculations, they could not achieve
real-time performance.
Regarding motion segmentation, Stu
¨
ckler et al. [28]
proposed an efficient real-time dense motion segmen-
tation, whose weakness is that it is only applicable to
rigid body segmentation. Although some unsupervised
learning based methods [29–31] were recently pro-
posed and achieved good results, they cannot always
perform well in other special dynamic scenes since
they need a large dataset for training a network; thus,
they suffer from the generalization problem.
3 Dense Visual Odometry Approach
3.1 Overview
In this paper, we proposed a visual odometry approach
based on the nonparametric statistical model and the
clustering model. The overview is shown in Fig. 1.
First, K-means clustering was used to segment each
frame into N clusters based on depth and intensity.
Each cluster was considered to be a rigid body, and
thus the pixel-wise motio n segmentation problem was
simplified into a cluster-wise segmentation. Second,
the initial camera pose was calculated by minimized
photometric and depth residuals in a Cauchy M-esti-
mator, and then the estimated poses were used to warp
previous frames to the current frame coordinate. After
regularization, these warped frames were used to
compute temporal consistency residuals for each
cluster, which ensured the continuity of clusters’
motion. Third, temporal consistent residuals were used
to build a nonparametric statistical model based on the
t-distribution and to find moving objects by utilizing a
dynamic threshold condition. Finally, the probability
confidence of each static cluster based on the statistical
model was regarded as weight that would be incorpo-
rated into the energy function optimization for
obtaining a more accurate camera pose estimation.
Afterward, the warp function was updated based on a
new estimated transformation for the next iteration.
3.2 Preliminaries
Since the RGB-D sensor simultaneously provides a
color image and depth image, a pair of frames
I
k1
; Z
k1
ðÞand I
k
; Z
k
ðÞis given as input, where I xðÞ2
R and Z xðÞ2R represe nt the intensity and depth,
respectively, of pixel x ¼ u; v
ðÞ
T
2 R
2
. Intensity is
converted from the color image (0.299R ? 0.587G ?
0.114B). In the homogeneous coordinate, given a 3D
point p ¼ X
k
; Y
k
; Z
k
; 1ðÞ
T
, the projection function and
its inverse function between the 3D point and its pixel
on the image is as follows:
x ¼ p p
k
ðÞ¼
X
k
f
x
Z
k
þ o
x
;
Y
k
f
y
Z
k
þ o
y
ð1Þ
p
k
¼ p
1
x; Z
k
ðÞ¼
u o
x
f
x
Z
k
;
v o
y
f
y
Z
k
; Z
k
; 1
ð2Þ
where f
x
and f
y
are the focal lengths and o
x
; o
y
is the
principal point.
As the camera moves, the 3D point p in the preview
frame’s camera coordinate can be transformed rigidly
to the current frame with the transformation matrix
T
k
k1
2 SE 3ðÞ. The new coordinate of the 3D point in
the current camera coordinate can be obtained by the
following function:
123
3D Res (2019) 10:11 Page 3 of 11 11