DeepSignals: Predicting Intent of Drivers Through Visual Signals
Davi Frossard
1,2
Eric Kee
1
Raquel Urtasun
1,2
Abstract— Detecting the intention of drivers is an essential
task in self-driving, necessary to anticipate sudden events like
lane changes and stops. Turn signals and emergency flashers
communicate such intentions, providing seconds of potentially
critical reaction time. In this paper, we propose to detect these
signals in video sequences by using a deep neural network
that reasons about both spatial and temporal information. Our
experiments on more than a million frames show high per-frame
accuracy in very challenging scenarios.
I. INTRODUCTION
Autonomous driving has risen as one of the most impactful
applications of Artificial Intelligence (AI), where it has the
potential to change the way we live. Before self-driving cars
are the norm however, humans and robots will have to share
the roads. In this shared scenario, communications between
vehicles are critical to alert others of maneuvers that would
otherwise be sudden or dangerous. A social understanding
of human intent is therefore essential to the progress of self-
driving. This poses additional complexity for self-driving
systems, as such interactions are generally difficult to learn.
Drivers communicate their intent to make unexpected
maneuvers in order to give warning much further in advance
than would otherwise be possible to infer from motion. Al-
though driver movements communicate intent—for example
when drivers slow down to indicate that they will allow a
merge, or drive close to a lane boundary to indicate a desired
merge position—motion cues are subtle, context dependent,
and near-term. In contrast, visual signals, and in particular
signal lights, are unambiguous and can be given far in
advance to warn of unexpected maneuvers.
For example, without detecting a turn signal, a parked car
may appear equally likely to remain parked as it is to pull
into oncoming traffic. Analogously, when a driver plans to
cut in front of another vehicle, they will generally signal
in advance for safety. Buses also signal with flashers when
making a stop to pick up and drop off passengers, allowing
vehicles approaching from behind to change lanes, therefore
reducing delays and congestion.
These everyday behaviors are safe when drivers under-
stand the intentions of their peers, but are dangerous if visual
signals are ignored. Humans expect self-driving vehicles to
respond. We therefore consider in this work the problem
of predicting driver intent through visual signals, and focus
specifically on interpreting signal lights.
Estimating the state of turn signals is a difficult problem:
The visual evidence is small (typically only a few pixels),
1
Uber Advanced Technologies Group
2
University of Toronto
{frossard, ekee, urtasun}@uber.com
Fig. 1: A vehicle, signaling left, passes through occlusion.
The actor’s intent to turn left is correctly detected (left
arrow), including the occlusion (question mark).
particularly at range, and occlusions are frequent. In addition,
intra-class variations can be large. While some regulation
exists, many vehicles have stylized blinkers, such as light
bars with sequential lights in the direction being signaled,
and the regulated frequency of blinking (1.5 ± 0.5 Hz [1])
is not always followed. Furthermore, since we are interested
in estimating intent, vehicle pose needs to be decoded. For
instance, a left turn signal would correspond to a flashing
light on the left side of a vehicle we are following, but on
the other hand would correspond to a flashing light on the
right side of an incoming vehicle. We refer the reader to
Figure 2 for an illustration of some of the challenges of turn
signal estimation.
Surprisingly little work in the literature has considered
this problem. Earlier published works [2], [3] use hand-
engineered features, trained in-part on synthetic data, and
are evaluated on limited datasets. Other approaches have
considered only nighttime scenarios [4], [5]. Such methods
are unlikely to generalize to the diversity of driving scenarios
that are encountered every day.
In this paper, we identify visual signal detection as an
important problem in self-driving. We introduce a large-
scale dataset of vehicle signals, and propose a modern deep
learning approach to directly estimate turn signal states from
diverse, real-world video sequences. A principled network is
designed to model the subproblems of turn signal detection:
attention, scene understanding, and temporal signal detection.
This results in a differentiable system that can be trained end-
to-end using deep learning techniques, rather than relying
upon hard coded premises of how turn signals should behave.
We demonstrate the effectiveness of our approach on a
new, challenging real-world dataset comprising 34 hours of