Visual-lidar Odometry and Mapping: Low-drift, Robust, and Fast
Ji Zhang and Sanjiv Singh
Abstract— Here, we present a general framework for com-
bining visual odometry and lidar odometry in a fundamental
and first principle method. The method shows improvements in
performance over the state of the art, particularly in robustness
to aggressive motion and temporary lack of visual features. The
proposed on-line method starts with visual odometry to estimate
the ego-motion and to register point clouds from a scanning
lidar at a high frequency but low fidelity. Then, scan matching
based lidar odometry refines the motion estimation and point
cloud registration simultaneously. We show results with datasets
collected in our own experiments as well as using the KITTI
odometry benchmark. Our proposed method is ranked #1 on
the benchmark in terms of average translation and rotation
errors, with a 0.75% of relative position drift. In addition
to comparison of the motion estimation accuracy, we evaluate
robustness of the method when the sensor suite moves at a high
speed and is subject to significant ambient lighting changes.
I. INTRODUCTION
Recent separate results in visual odometry and lidar odom-
etry are promising in that they can provide solutions to 6-
DOF state estimation, mapping, and even obstacle detection.
However, drawbacks are present using each sensor alone.
Visual odometry methods require moderate lighting condi-
tions and fail if distinct visual features are insufficiently
available. On the other hand, motion estimation via moving
lidars involves motion distortion in point clouds as range
measurements are received at different times during contin-
uous lidar motion. Hence, the motion often has to be solved
with a large number of variables. Scan matching also fails in
degenerate scenes such as those dominated by planar areas.
Here, we propose a fundamental and first principle method
for ego-motion estimation combining a monocular camera
and a 3D lidar. We would like to accurately estimate the
6-DOF motion as well as a spatial, metric representation of
the environment, in real-time and onboard a robot navigating
in an unknown environment. While cameras and lidars have
complementary strengths and weaknesses, it is not straight-
forward to combine them in a traditional filter. Our method
tightly couples the two modes such that it can handle both
aggressive motion including translation and rotation, and
lack of optical texture as in complete whiteout or blackout
imagery. In non-pathological cases, high accuracy in motion
estimation and environment reconstruction is possible.
Our proposed method, namely V-LOAM, explores advan-
tages of each sensor and compensates for drawbacks from
the other, hence shows further improvements in performance
over the state of the art. The method has two sequentially
staggered processes. The first uses visual odometry running
J. Zhang and S. Singh are with the Robotics Institute at Carnegie Mellon
University. Emails: zhangji@cmu.edu and ssingh@cmu.edu.
Fig. 1. The method aims at motion estimation and mapping using a
monocular camera combined with a 3D lidar. A visual odometry method
estimates motion at a high frequency but low fidelity to register point clouds.
Then, a lidar odometry method matches the point clouds at a low frequency
to refine motion estimates and incrementally build maps. The lidar odometry
also removes distortion in the point clouds caused by drift of the visual
odometry. Combination of the two sensors allows the method to accurately
map even with rapid motion and in undesirable lighting conditions.
at a high frequency as the image frame rate (60Hz) to
estimate motion. The second uses lidar odometry at a low
frequency (1 Hz) to refine motion estimates and remove
distortion in the point clouds caused by drift of the visual
odometry. The distortion-free point clouds are matched and
registered to incrementally build maps. The result is that
the visual odometry handles rapid motion, and the lidar
odometry warrants low-drift and robustness in undesirable
lighting conditions. Our finding is that the maps are often
accurate without the need for post-processing. Although
loop closure can further improve the maps, we intentionally
choose not to do so since the emphasis of this work is to
push the limit of accurate odometry estimation.
The basic algorithm of V-LOAM is general enough that it
can be adapted to use range sensors of different kinds, e.g.
a time-of-fly camera. The method can also be configured to
provide localization only, if a prior map is available.
In addition to evaluation on the KITTI odometry bench-
mark [1], we further experiment with a wide-angle camera
and a fisheye camera. Our conclusion is that the fisheye
camera brings in more robustness but less accuracy because
of its larger field of view and higher image distortion.
However, after the scan matching refinement, the final motion
estimation reaches the same level of accuracy. Our experi-
ment results can be seen in a publicly available video.
1
.
II. RELATED WORK
Vision and lidar based methods are common for state
estimation [2]. With stereo cameras [3], [4], the baseline
provides a reference to help determine scale of the motion.
However, if a monocular camera is used [5]–[7], scale of the
motion is generally unsolvable without aiding from other
sensors or assumptions about motion. The introduction of
RGB-D cameras provides an efficient way to associate visual
images with depth. Motion estimation with RGB-D cameras
1
www.youtube.com/watch?v=-6cwhPMAap8
2015 IEEE International Conference on Robotics and Automation (ICRA)
Washington State Convention Center
Seattle, Washington, May 26-30, 2015
978-1-4799-6922-7/15/$31.00 ©2015 IEEE 2174