对抗攻击不止于检测：多目标跟踪的威胁

需积分: 10 16 浏览量更新于2024-07-09 收藏 1.73MB PDF 举报

"Fooling Detection Alone is Not Enough - Adversarial Attack against Multiple Object Tracking" 这篇论文探讨了在计算机视觉领域，尤其是针对自动驾驶的视觉感知系统中的安全问题。它指出，尽管现有的研究已经关注到了对抗性样本（Adversarial Examples, AEs）对目标检测模型的影响，但这些研究并未充分考虑到在实际的视觉感知管道中，检测到的物体还需要通过多目标跟踪（Multiple Object Tracking, MOT）来构建周围障碍物的运动轨迹。MOT系统设计的目标是抵抗检测错误，这使得直接针对目标检测的攻击策略面临挑战。论文强调，对于MOT系统，如果想要通过对抗性攻击影响跟踪结果，攻击的成功率必须超过98%，这是一个现有攻击技术难以达到的要求。因此，单纯欺骗目标检测并不足以破坏整个跟踪系统。作者Yunhan Jia、Yantao Lu、Junjie Shen等人来自不同的学术机构和企业，包括独立研究员、雪城大学、加州大学欧文分校、加州大学戴维斯分校、百度X-Lab和北京大学。在ICLR 2020会议上发表的这篇论文，提出了一种新的攻击策略，即针对MOT系统的对抗性攻击。这种攻击方法可能涉及生成特定的对抗性样本，这些样本能够同时绕过目标检测和跟踪算法，从而干扰自动驾驶系统对环境的理解。这揭示了自动驾驶领域的安全漏洞，并强调了需要更全面的防御策略，不仅针对目标检测，还应覆盖跟踪环节。此外，论文还可能涵盖了评估攻击效果的方法、如何量化MOT系统的抗攻击能力，以及可能的防御措施。这为研究人员提供了深入理解对抗性机器学习在实际应用中所面临的复杂性，以及如何改进系统以提高鲁棒性的思路。 "Fooling Detection Alone is Not Enough"论文强调了在计算机视觉和自动驾驶领域，攻击与防御策略需要更加全面，不应仅局限于目标检测，而应考虑整个视觉处理流程的完整性和安全性。通过这种深入研究，我们可以更好地保护自动驾驶系统免受恶意攻击，并促进相关技术的持续发展和优化。

Published as a conference paper at ICLR 2020

2 BACKGROUND AND RELATED WORK

Adversarial examples for object detection.

Since the ﬁrst physical adversarial examples against

trafﬁc sign classiﬁer demonstrated by Eykholt et al. (Eykholt et al., 2018), several work in adversarial

machine learning (Eykholt et al., 2017; Xie et al., 2017; Lu et al., 2017a;b; Zhao et al., 2018b;

Chen et al., 2018) have been focused on the visual perception task in autonomous driving, and more

speciﬁcally, the object detection models. To achieve high attack effectiveness in practice, the key

challenge is how to design robust attacks that can survive distortions in real-world driving scenarios

such as different viewing angles, distances, lighting conditions, and camera limitations. For example,

Lu et al. (Lu et al., 2017a) shows that AEs against Faster-RCNN (Ren et al., 2015) generalize well

across a sequence of images in digital space, but fail in most of the sequence in physical world;

Eykholt et al. (Eykholt et al., 2017) generates adversarial stickers that, when attached to stop sign,

can fool YOLOv2 (Redmon & Farhadi, 2017) object detector, while it is only demonstrated in indoor

experiment within short distance; Chen et al. (Chen et al., 2018) generates AEs based on expectation

over transformation techniques, while their evaluation shows that the AEs are not robust to multiple

angles, probably due to not considering perspective transformations (Zhao et al., 2018b). It was not

until recently that physical adversarial attacks against object detectors achieve a decent success rate

(70%) in ﬁxed-speed (6 km/h and 30 km/h) road test (Zhao et al., 2018b).

While the current progress in attacking object detection is indeed impressive, in this paper we

argue that in the actual visual perception pipeline of autonomous driving, object tracking, or more

speciﬁcally MOT, is a integral step, and without considering it, existing adversarial attacks against

object detection still cannot affect the visual perception results even with high attack success rate. As

shown in our evaluation in §4, with a common setup of MOT, an attack on object detection needs

to reliably fool at least 60 consecutive frames to erase one object (e.g., stop sign) from the tracking

results, in which case even a 98% attack success rate on object detectors is not enough (§4).

MOT background.

MOT aims to identify objects and their trajectories in video frame sequence.

With the recent advances in object detection, tracking-by-detection (Luo et al., 2014) has become

the dominant MOT paradigm, where the detection step identiﬁes the objects in the images and the

tracking step links the objects to the trajectories (i.e., trackers). Such paradigm is widely adopted

in autonomous driving systems today (Baidu; Kato et al., 2018; 2015; Zhao et al., 2018a; Ess et al.,

2010; MathWorks; Udacity), and a more detailed illustration is in Fig. 1. As shown, each detected

objects at time

will be associated with a dynamic state model (e.g., position, velocity), which

represents the past trajectory of the object (

track|

t−1

). A per-track Kalman ﬁlter (Baidu; Kato et al.,

2018; Feng et al., 2019; Murray, 2017; Yoon et al., 2016) is used to maintain the state model, which

operates in a recursive predict-update loop: the predict step estimates current object state according

to a motion model, and the update step takes the detection results

detc|

as measurement to update its

state estimation result track|

The association between detected objects with existing trackers is formulated as a bipartite matching

problem (Sharma et al., 2018; Feng et al., 2019; Murray, 2017) based on the pairwise similarity

costs between the trackers and detected objects, and the most commonly used similarity metric is

the spatial-based cost, which measures the overlapping between bounding boxes, or bboxes (Baidu;

Long et al., 2018; Xiang et al., 2015; Sharma et al., 2018; Feng et al., 2019; Murray, 2017; Zhu

et al., 2018; Yoon et al., 2016; Bergmann et al., 2019; Bewley et al., 2016). To reduce errors in this

association, an accurate velocity estimation is necessary in the Kalman ﬁlter prediction (Choi, 2015;

Yilmaz et al., 2006). Due to the discreteness of camera frames, Kalman ﬁlter uses the velocity model

to estimate the location of the tracked object in the next frame in order to compensate the object

motion between frames. However, as described later in §3, such error reduction process unexpectedly

makes it possible to perform tracker hijacking.

MOT manages tracker creation and deletion with two thresholds. Speciﬁcally a new tracker will

be created only when the object has been constantly detected for a certain number of frames, this

threshold will be referred to as the hit count, or H in the rest of the paper. This helps to ﬁlter out

occasional false positives produced by object detectors. On the other hand, a tracker will be deleted

if no objects is associated with for a duration of R frames, or called a reserved age. It prevents the

tracks from being accidentally deleted due to infrequent false negatives of object detectors. The

conﬁguration of R and H usually depends on both the accuracy of detection models, and the frame

rate (fps). Previous work suggest a conﬁguration of

R = 2·

fps, and

H = 0.2·

fps (Zhu et al., 2018),

which gives a

R = 60

frames and

H = 6

frames for a common 30 fps visual perception system. We

剩余14页未读，继续阅读

潜夙

粉丝: 0
资源: 40

对抗攻击不止于检测：多目标跟踪的威胁

ai-deception-fooling-artificial-intelligence-is-easier-than-you

02-ai-deception-fooling-artificial-intelligence-is-easier-than-y

Adversarial Texture for Fooling Person Detectors in the Physical

23.Tooling&Fooling;–WhatShouldITakeCareOf(testingexperience23_09_13).pdf

破解盒子愚弄基于抽象的深度学习监视器_Hack The Box Fooling Deep Learning Abstractio

信息安全_数据安全_tv-t02-ai-deception-fooling-arti.pdf

one pixel attack for fooling deep neural networks

getchar是C语言中的一个标准库函数.docx

三菱plc实例程序64个，封口机，放料机，二极管引线排列机，分纸机程序，电机预热机，电池封口，电池检查机，电池检查机等等

MSVCR120.ddl，modelsim安装找不到msvcr120.dll，win11可用

最新资源