Published as a conference paper at ICLR 2020
2 BACKGROUND AND RELATED WORK
Adversarial examples for object detection.
Since the first physical adversarial examples against
traffic sign classifier demonstrated by Eykholt et al. (Eykholt et al., 2018), several work in adversarial
machine learning (Eykholt et al., 2017; Xie et al., 2017; Lu et al., 2017a;b; Zhao et al., 2018b;
Chen et al., 2018) have been focused on the visual perception task in autonomous driving, and more
specifically, the object detection models. To achieve high attack effectiveness in practice, the key
challenge is how to design robust attacks that can survive distortions in real-world driving scenarios
such as different viewing angles, distances, lighting conditions, and camera limitations. For example,
Lu et al. (Lu et al., 2017a) shows that AEs against Faster-RCNN (Ren et al., 2015) generalize well
across a sequence of images in digital space, but fail in most of the sequence in physical world;
Eykholt et al. (Eykholt et al., 2017) generates adversarial stickers that, when attached to stop sign,
can fool YOLOv2 (Redmon & Farhadi, 2017) object detector, while it is only demonstrated in indoor
experiment within short distance; Chen et al. (Chen et al., 2018) generates AEs based on expectation
over transformation techniques, while their evaluation shows that the AEs are not robust to multiple
angles, probably due to not considering perspective transformations (Zhao et al., 2018b). It was not
until recently that physical adversarial attacks against object detectors achieve a decent success rate
(70%) in fixed-speed (6 km/h and 30 km/h) road test (Zhao et al., 2018b).
While the current progress in attacking object detection is indeed impressive, in this paper we
argue that in the actual visual perception pipeline of autonomous driving, object tracking, or more
specifically MOT, is a integral step, and without considering it, existing adversarial attacks against
object detection still cannot affect the visual perception results even with high attack success rate. As
shown in our evaluation in §4, with a common setup of MOT, an attack on object detection needs
to reliably fool at least 60 consecutive frames to erase one object (e.g., stop sign) from the tracking
results, in which case even a 98% attack success rate on object detectors is not enough (§4).
MOT background.
MOT aims to identify objects and their trajectories in video frame sequence.
With the recent advances in object detection, tracking-by-detection (Luo et al., 2014) has become
the dominant MOT paradigm, where the detection step identifies the objects in the images and the
tracking step links the objects to the trajectories (i.e., trackers). Such paradigm is widely adopted
in autonomous driving systems today (Baidu; Kato et al., 2018; 2015; Zhao et al., 2018a; Ess et al.,
2010; MathWorks; Udacity), and a more detailed illustration is in Fig. 1. As shown, each detected
objects at time
t
will be associated with a dynamic state model (e.g., position, velocity), which
represents the past trajectory of the object (
track|
t−1
). A per-track Kalman filter (Baidu; Kato et al.,
2018; Feng et al., 2019; Murray, 2017; Yoon et al., 2016) is used to maintain the state model, which
operates in a recursive predict-update loop: the predict step estimates current object state according
to a motion model, and the update step takes the detection results
detc|
t
as measurement to update its
state estimation result track|
t
.
The association between detected objects with existing trackers is formulated as a bipartite matching
problem (Sharma et al., 2018; Feng et al., 2019; Murray, 2017) based on the pairwise similarity
costs between the trackers and detected objects, and the most commonly used similarity metric is
the spatial-based cost, which measures the overlapping between bounding boxes, or bboxes (Baidu;
Long et al., 2018; Xiang et al., 2015; Sharma et al., 2018; Feng et al., 2019; Murray, 2017; Zhu
et al., 2018; Yoon et al., 2016; Bergmann et al., 2019; Bewley et al., 2016). To reduce errors in this
association, an accurate velocity estimation is necessary in the Kalman filter prediction (Choi, 2015;
Yilmaz et al., 2006). Due to the discreteness of camera frames, Kalman filter uses the velocity model
to estimate the location of the tracked object in the next frame in order to compensate the object
motion between frames. However, as described later in §3, such error reduction process unexpectedly
makes it possible to perform tracker hijacking.
MOT manages tracker creation and deletion with two thresholds. Specifically a new tracker will
be created only when the object has been constantly detected for a certain number of frames, this
threshold will be referred to as the hit count, or H in the rest of the paper. This helps to filter out
occasional false positives produced by object detectors. On the other hand, a tracker will be deleted
if no objects is associated with for a duration of R frames, or called a reserved age. It prevents the
tracks from being accidentally deleted due to infrequent false negatives of object detectors. The
configuration of R and H usually depends on both the accuracy of detection models, and the frame
rate (fps). Previous work suggest a configuration of
R = 2·
fps, and
H = 0.2·
fps (Zhu et al., 2018),
which gives a
R = 60
frames and
H = 6
frames for a common 30 fps visual perception system. We
3