End-to-end Interpretable Neural Motion Planner
Wenyuan Zeng
1,2∗
Wenjie Luo
1,2∗
Simon Suo
1,2
Abbas Sadat
1
Bin Yang
1,2
Sergio Casas
1,2
Raquel Urtasun
1,2
1
Uber Advanced Technologies Group
2
University of Toronto
{wenyuan,wenjie,suo,abbas,byang10,sergio.casas,urtasun}@uber.com
Abstract
In this paper, we propose a neural motion planner for
learning to drive autonomously in complex urban scenar-
ios that include traffic-light handling, yielding, and interac-
tions with multiple road-users. Towards this goal, we design
a holistic model that takes as input raw LIDAR data and a
HD map and produces interpretable intermediate represen-
tations in the form of 3D detections and their future trajec-
tories, as well as a cost volume defining the goodness of
each position that the self-driving car can take within the
planning horizon. We then sample a set of diverse physi-
cally possible trajectories and choose the one with the min-
imum learned cost. Importantly, our cost volume is able to
naturally capture multi-modality. We demonstrate the ef-
fectiveness of our approach in real-world driving data cap-
tured in several cities in North America. Our experiments
show that the learned cost volume can generate safer plan-
ning than all the baselines.
1. Introduction
Self-driving vehicles (SDVs) are going to revolutionize
the way we live. Building reliable SDVs at scale is, how-
ever, not a solved problem. As is the case in many appli-
cation domains, the field of autonomous driving has been
transformed in the past few years by the success of deep
learning. Existing approaches that leverage this technology
can be characterized into two main frameworks: end-to-end
driving and traditional engineering stacks.
End-to-end driving approaches [3, 24] take the output of
the sensors (e.g., LiDAR, images) and use it as input to a
neural net that outputs control signals, e.g., steering com-
mand and acceleration. The main benefit of this framework
is its simplicity as only a few lines of code can build a model
and labeled training data can be easily obtained automati-
cally by recording human driving under a SDV platform. In
practice, this approach suffers from the compounding error
∗
denotes equal contribution.
due to the nature of self-driving control being a sequential
decision problem, and requires massive amounts of data to
generalize. Furthermore, interpretability is difficult to ob-
tain for analyzing the mistakes of the network. It is also
hard to incorporate sophisticated prior knowledge about the
scene, e.g. that vehicles should not collide.
In contrast, most self-driving car companies, utilize a
traditional engineering stack, where the problem is divided
into subtasks: perception, prediction, motion planning and
control. Perception is in charge of estimating all actors’ po-
sitions and motions, given the current and past evidences.
This involves solving tasks such as 3D object detection and
tracking. Prediction
1
, on the other hand, tackles the prob-
lem of estimating the future positions of all actors as well
as their intentions (e.g., changing lanes, parking). Finally,
motion planning takes the output from previous stacks and
generates a safe trajectory for the SDV to execute via a con-
trol system. This framework has interpretable intermediate
representations by construction, and prior knowledge can be
easily exploited, for example in the form of high definition
maps (HD maps).
However, solving each of these sub-tasks is not only
hard, but also may lead to a sub-optimal overall system
performance. Most self-driving companies have large en-
gineering teams working on each sub-problem in isolation,
and they train each sub-system with a task specific objec-
tive. As a consequence, an advance in one sub-system does
not easily translate to an overall system performance im-
provement. For instance, 3D detection tries to maximize
AP, where each actor has the same weight. However, in
a driving scenario, high-precision detections of near-range
actors who may influence the SDV motion, e.g. through in-
teractions (cutting in, sudden stopping), is more critical. In
addition, uncertainty estimations are difficult to propagate
and computation is not shared among different sub-systems.
This leads to longer reaction times of the SDV and make the
overall system less reliable.
In this paper we bridge the gap between these two frame-
works. Towards this goal, we propose the first end-to-
1
We’ll use prediction and motion forecasting interchangeably.