Multi-View 3D Object Detection Network for Autonomous Driving
Xiaozhi Chen
1
, Huimin Ma
1
, Ji Wan
2
, Bo Li
2
, Tian Xia
2
1
Department of Electronic Engineering, Tsinghua University
2
Baidu Inc.
{chenxz12@mails., mhmpub@}tsinghua.edu.cn, {wanji, libo24, xiatian}@baidu.com
Abstract
This paper aims at high-accuracy 3D object detection in
autonomous driving scenario. We propose Multi-View 3D
networks (MV3D), a sensory-fusion framework that takes
both LIDAR point cloud and RGB images as input and pre-
dicts oriented 3D bounding boxes. We encode the sparse
3D point cloud with a compact multi-view representation.
The network is composed of two subnetworks: one for 3D
object proposal generation and another for multi-view fea-
ture fusion. The proposal network generates 3D candidate
boxes efficiently from the bird’s eye view representation of
3D point cloud. We design a deep fusion scheme to combine
region-wise features from multiple views and enable inter-
actions between intermediate layers of different paths. Ex-
periments on the challenging KITTI benchmark show that
our approach outperforms the state-of-the-art by around
25% and 30% AP on the tasks of 3D localization and 3D
detection. In addition, for 2D detection, our approach ob-
tains 10.3% higher AP than the state-of-the-art on the hard
data among the LIDAR-based methods.
1. Introduction
3D object detection plays an important role in the vi-
sual perception system of Autonomous driving cars. Mod-
ern self-driving cars are commonly equipped with multiple
sensors, such as LIDAR and cameras. Laser scanners have
the advantage of accurate depth information while cameras
preserve much more detailed semantic information. The fu-
sion of LIDAR point cloud and RGB images should be able
to achieve higher performance and safty to self-driving cars.
The focus of this paper is on 3D object detection utiliz-
ing both LIDAR and image data. We aim at highly accu-
rate 3D localization and recognition of objects in the road
scene. Recent LIDAR-based methods place 3D windows
in 3D voxel grids to score the point cloud [26, 7] or ap-
ply convolutional networks to the front view point map in
a dense box prediction scheme [17]. Image-based meth-
ods [4, 3] typically first generate 3D box proposals and
then perform region-based recognition using the Fast R-
CNN [10] pipeline. Methods based on LIDAR point cloud
usually achieve more accurate 3D locations while image-
based methods have higher accuracy in terms of 2D box
evaluation. [11, 8] combine LIDAR and images for 2D
detection by employing early or late fusion schemes. How-
ever, for the task of 3D object detection, which is more chal-
lenging, a well-designed model is required to make use of
the strength of multiple modalities.
In this paper, we propose a Multi-View 3D object detec-
tion network (MV3D) which takes multimodal data as input
and predicts the full 3D extent of objects in 3D space. The
main idea for utilizing multimodal information is to perform
region-based feature fusion. We first propose a multi-view
encoding scheme to obtain a compact and effective repre-
sentation for sparse 3D point cloud. As illustrated in Fig. 1,
the multi-view 3D detection network consists of two parts:
a 3D Proposal Network and a Region-based Fusion Net-
work. The 3D proposal network utilizes a bird’s eye view
representation of point cloud to generate highly accurate
3D candidate boxes. The benefit of 3D object proposals
is that it can be projected to any views in 3D space. The
multi-view fusion network extracts region-wise features by
projecting 3D proposals to the feature maps from mulitple
views. We design a deep fusion approach to enable inter-
actions of intermediate layers from different views. Com-
bined with drop-path training [15] and auxiliary loss, our
approach shows superior performance over the early/late fu-
sion scheme. Given the multi-view feature representation,
the network performs oriented 3D box regression which
predict accurate 3D location, size and orientation of objects
in 3D space.
We evaluate our approach for the tasks of 3D proposal
generation, 3D localization, 3D detection and 2D detec-
tion on the challenging KITTI [9] object detection bench-
mark. Experiments show that our 3D proposals signifi-
cantly outperforms recent 3D proposal methods 3DOP [4]
and Mono3D [3]. In particular, with only 300 proposals, we
obtain 99.1% and 91% 3D recall at Intersection-over-Union
(IoU) threshold of 0.25 and 0.5, respectively. The LIDAR-
arXiv:1611.07759v3 [cs.CV] 22 Jun 2017