End-to-end Learning of Multi-sensor 3D Tracking by Detection
Davi Frossard Raquel Urtasun
Uber Advanced Technologies Group
University of Toronto
{frossard, urtasun}@uber.com
Abstract— In this paper we propose a novel approach to
tracking by detection that can exploit both cameras as well as
LIDAR data to produce very accurate 3D trajectories. Towards
this goal, we formulate the problem as a linear program that
can be solved exactly, and learn convolutional networks for
detection as well as matching in an end-to-end manner. We
evaluate our model in the challenging KITTI dataset and show
very competitive results.
I. INTRODUCTION
One of the fundamental tasks in perception systems for
autonomous driving is to be able to track traffic participants.
This task, commonly referred to as Multi-target tracking,
consists on identifying how many objects there are in each
frame, as well as link their trajectories over time. Despite
many decades of research, tracking is still an open problem.
Challenges include dealing with object truncation, high speed
targets, lighting conditions, sensor motion and complex
interactions between targets, which leads to occlusion and
path crossing.
Most modern computer vision approaches to multi-target
tracking are based on tracking by detection [
1
], where first
a set of possible objects are identified via object detectors.
These detections are then further associated over time in a
second step by solving a discrete problem. Both tracking and
detection are typically formulated in 2D, and a variety of
cues based on appearance and motion are exploited.
In robotics, tracking by filtering methods are more prevalent,
where the input is filtered in search of moving objects and their
state is predicted over time [
2
]. LIDAR based approaches
are the most common option for 3D tracking, since this
sensor provides an accurate spatial representation of the world
allowing for precise positioning of the objects of interest.
However, matching is more difficult as LIDAR does not
capture appearance well when compared to the richness of
images.
In this paper, we propose an approach that can take
advantage of both LIDAR and camera data. Towards this goal,
we formulate the problem as inference in a deep structured
model, where the potentials are computed using convolutional
neural nets. Notably, our matching cost of associating two
detections exploits both appearance and motion via a siamese
network that processes images and motion representations
via convolutional layers. Inference in our model can be
done exactly and efficiently by a set of feedforward passes
followed by solving a linear program. Importantly, our model
is formulated such that it can be trained end-to-end to solve
both the detection and tracking problems. We refer the reader
to Figure 1 for an overview our approach.
II. RELATED WORK
Recent works in multiple object tracking are usually done
in two fronts: Filtering based and batch based methods.
Filtering based methods rely on the Markov assumption
to estimate the posterior distribution of the trajectories.
Bayesian or Monte Carlo filtering methods such as Gaussian
Processes [
3
], Particle Filters and Kalman Filters [
2
] are
commonly employed. One advantage of filtering approaches
is their efficiency, which allows for real-time applications.
However, they suffer from the propagation of early errors,
which are hard to mitigate. To tackle this shortcoming, batch
methods utilize object hypotheses from a detector (tracking
by detection) over entire sequences to estimate trajectories,
which allows for global optimization and usage of higher
level cues. Estimating trajectories becomes a data association
problem, i.e., deciding from the set of detections which should
be linked to form correct trajectories. The association can be
estimated with Markov Chain Monte Carlo (MCMC) [
4
], [
5
],
linear programming [6], [7] or with a flow graph [8].
Online methods have also been proposed in order to tackle
the performance issue with batch methods [
1
], [
9
]. Milan et
al. [
10
] use Recurrent Neural Networks (RNN) to encode the
state-space and solve the association problem.
Our work also expands on previous research on pixel
matching, which has tipically been used for stereo estimation
and includes methods such as random forest classifiers [
11
],
Markov random fields (MRF) [
12
] and, more classically,
slanted plane models [
13
]. In our research, we focus on a
deep learning approach to the matching problem by exploiting
convolutional siamese networks [
14
], [
15
]. Previous methods,
however, focused on matching pairs of small image patches.
In [
16
] deep learning is exploited for tracking. However, this
approach is only similar to our method at a very high level:
using deep learning in a tracking by detection framework.
Our appearance matching is based on a fully convolutional
network with no requirements for optical flow and learning is
done strictly via backpropagation. Furthermore, we reason in
3D and the spatial branch of our matching networks corrects
for things such as ego-motion and car resemblance. In contrast
[
16
] uses optical flow and is piecewise trained using Gradient
Boosting.
Tracking methods usually employ hand-crafted feature
extractors with distance functions such as Chi-Square or
arXiv:1806.11534v1 [cs.CV] 29 Jun 2018