视觉机器学习中的对抗样本攻击与防御综述

需积分: 24 88 浏览量更新于2024-07-16 收藏 6.28MB PDF 举报

在现代机器学习领域，对抗样本已经成为一个关键的关注点。对抗样本是指经过特殊设计，能够使目标模型产生错误输出的输入数据。这项名为《Adversarial Examples in Modern Machine Learning: A Review》的研究论文深入探讨了视觉领域的机器学习模型，这些模型特别容易受到攻击。论文涵盖了广泛的议题，包括但不限于： 1. **生成与检测方法**：研究者们集中研究了在图像空间中创建和识别对抗样例的方法，这些方法利用诸如梯度上升、Fast Gradient Sign Method (FGSM)、Carlini & Wagner (C&W) 等策略来构造误导模型的输入。 2. **实际世界攻击**：论文还讨论了将理论上的对抗样本应用到现实环境中的挑战，例如在物理世界中制造的可感知攻击，如打印的对抗性海报或通过相机干扰系统。 3. **防御机制**：面对对抗性攻击，研究人员探索了多种防御策略，如防御性训练（通过在训练数据中加入对抗样本增强模型鲁棒性）、检测算法（如基于统计或深度学习的检测技术）以及对抗样本对抗的检测方法。 4. **转移性分析**：研究者关注了一个重要的现象——对抗样本的转移性，即一个对抗样本可能对多个模型有效，这在跨领域或预训练模型的攻击中尤为显著。 5. **优缺点分析**：论文详细剖析了不同攻击和防御方法的利弊，以帮助读者理解它们在实际应用中的局限性和改进的空间。 6. **目标与影响**：作者的目标是提供一个全面的领域综述，使读者能直观地理解对抗攻击和防御的原理，并促进研究社区对于这个基础问题的深入研究和发展。通过对以上内容的深入理解和分析，读者可以更好地了解对抗样本在现代机器学习中的作用，以及如何在设计、训练和评估模型时考虑其潜在威胁。这对于提升模型的安全性和可靠性至关重要，同时也有助于推动相关领域的发展和未来的研究方向。

Adversarial Examples in Modern Machine Learning: A Review

Algorithm 1 DeepFool Algorithm for Multi-Class Classiﬁer [79]

Input: input x, ground truth label y, number of classes N, classiﬁer f, desired norm p, let q =

p−1

Output: adversarial perturbation r

Initialize: x

← x, r ← {0, 0, ...}, i ← 0

while ˆy(x

) == ˆy(x) do

for c = 0 to N do

if c 6= y then

← ∇

f(x

)

(c)

− ∇

f(x

)

(y)

← f(x

)

(c)

− f(x

)

(y)

end if

end for

j ← arg min

c6=y

||w

←

||w

· sign(w

)  |w

q−1

i+1

← x

+ r

i ← i + 1

end while

Return: r =

4.8 Jacobian-based Saliency Map Attacks

The notion of the saliency map was originally conceived for visualizing how deep neural networks make

predictions [118]. The saliency map rates each input feature (e.g., each pixel in an image) by its inﬂuence upon

the network’s class prediction. Jacobian-based Saliency Map Attacks (JSMA) [80] exploit this information

by perturbing a small set of input features to cause misclassiﬁcation. This is in contrast to attacks like the

FGSM [32] (see Section 4.2) that modify most, if not all, input features. As such, JSMA attacks tend to

ﬁnd sparse perturbations.

Given the predicted softmax probabilities vector f (x) from a neural network classiﬁer, one formulation

of the saliency map S

(i)

, t) =

(

0 if ∇

(i)

f(x)

(t)

< 0 or

c6=t

∇

(i)

f(x)

(c)

> 0

−∇

(i)

f(x)

(t)

c6=t

∇

(i)

f(x)

(c)

otherwise,

(9)

where x

(i)

denotes the i-th element of x, and t is a speciﬁed label of interest, e.g., the target for visualization

or attack. Intuitively, the S

saliency map uses components of the gradient ∇

(i)

f(x) to quantify the

degree to which each input feature x

(i)

positively correlates with a target class of interest t, while on average

negatively correlating with all other classes c 6= t. If either condition is violated for a given feature x

(i)

, then

S(x

(i)

, t) is set to zero, eﬀectively ignoring features which are not preferentially associated with the target

class. The features with the largest saliency measure can then be increased to amplify the model’s predicted

conﬁdence for a target class t, whilst attenuating conﬁdences for all other classes.

This process can easily be inverted to provide a negative saliency map, S

−

, describing which features

should be reduced to increase a target class probability. This formulation requires inverting the inequalities

for the two low-saliency conditions:

−

(i)

, t) =

(

0 if ∇

(i)

f(x)

(t)

> 0 or

c6=t

∇

(i)

f(x)

(c)

< 0

−∇

(i)

f(x)

(t)

c6=t

∇

(i)

f(x)

(c)

otherwise.

(10)

Papernot et al. [80] notes that both saliency measures S

and S

−

are overly strict when applied to

individual input features (e.g., single image pixels), since it is likely that the sum of gradient contributions

Adversarial Examples in Modern Machine Learning: A Review

Algorithm 2 Jacobian-based Saliency Map Attack by Increasing Pixel Intensities [80]

Input: n-dimensional input x normalized to [0, 1], target class t, classiﬁer f, maximum number of itera-

tions I

max

, perturbation step θ = +1

Initialize: x

← x, i ← 0, search domain Γ ← {1, ..., n}

while ˆy(x

) 6= t and i < I

max

and |Γ| ≥ 2 do

Calculate ∇

f(x

)

γ ← 0

for every pixel pair (p, q) ∈ Γ do

α =

k=p,q

∇

(k)

f(x

)

(t)

β =

k=p,q

c6=t

∇

(k)

f(x

)

(c)

if α > 0 and β < 0 and −α · β > γ then

∗

, q

∗

← p, q

γ ← −α · β

end if

end for

∗

)

, x

∗

)

← (x

∗

)

+ θ), (x

∗

)

+ θ)

if x

∗

)

== 0 or x

∗

)

== 1 then

Remove p

∗

from Γ

end if

if x

∗

)

== 0 or x

∗

)

== 1 then

Remove q

∗

from Γ

end if

i ← i + 1

end while

Return: x

across non-targeted classes will trigger the minimal-saliency criterion. Consequently, as shown in Algorithm 2,

the Jacobian-based Saliency Map Attack alters the saliency measures to search over pairs of pixels (p, q)

instead. Concretely, given a search domain Γ initialized to contain the indices of all input features, the

algorithm ﬁnds the most salient pixel pair, perturbs both values by θ = +1, and then removes saturated

feature indices from the search domain. This process is repeated until either an adversary is found, or in

practice following a maximum number of iteration I

max

, e.g.:

max

(# of input features) · (maximum desired percentage of perturbed features)

(# of features to be modiﬁed per iteration) · dθ

−1

e · 100

, (11)

The formulation in Algorithm 2 ﬁnds adversarial examples by increasing feature values (θ > 0) based

on the S

saliency measure. An alternative attack variant that decreases feature values can be constructed

by substituting S

with S

−

and setting −1 ≤ θ < 0. Both variants are targeted attacks that increase a

classiﬁer’s softmax prediction conﬁdence f(x)

(t)

for a chosen adversarial target class t.

The original authors prescribed θ = 1 in order to ﬁnd adversaries in as few iterations as possible. In

general though, we can use a smaller feature perturbation step, i.e., 0 < θ ≤ 1, to produce adversarial

examples with fewer saturated features. Additionally, the feature saturation criterion can be also altered to

be -bounded around the initial input values x

(e.g., using Clip{·} from Equation 5), to further constrain

per-pixel perceptual diﬀerence.

Carlini and Wagner [44] note that the above saliency measures and attack can be alternatively applied

to evaluate the gradient of logits Z(x) rather than of the softmax probabilities f(x). While using diﬀerent

saliency measures results in favoring slightly diﬀerent pixel pairs, both variants successfully ﬁnd adversarial

examples. We designate the original algorithm variants as JSMA+F and JSMA-F, and those using logit-

based saliency maps as JSMA+Z and JSMA-Z, where + and - indicate whether input features are increased

or decreased.

Adversarial Examples in Modern Machine Learning: A Review

Papernot et al. [80] showed that JSMA can successfully fool a model by just modifying a few input

features. They found that adversaries can be found by just modifying 4% of the input features in order

to fool a targeted MNIST model. However, there are still room for improving the misclassiﬁcation rate

and eﬃciency by picking which features should be updated in a more optimal way. For example, note that

Algorithm 2 needs to test every possible pixel pairs in the search domain before deciding on which pixel

pairs should be updated for every iteration, which is computationally expensive to perform.

All JSMA variants above must be given a speciﬁc target class t. This choice aﬀects the speed and

quality of the attack, since misclassiﬁcation under certain classes are easier to attain than others, such as

perturbing a hand-written digit “0” to look like “8”. Instead of increasing the prediction probability (or logit)

of an adversarial target t 6= y, we propose to remove this dependency altogether by instead altering JSMA

to decrease the model’s prediction conﬁdence of the true class label (t = y). These non-targeted JSMA

variants are realized by swapping the saliency measure employed, i.e., follow S

−

when increasing feature

values (NT-JSMA+F / NT-JSMA+Z), or S

when decreasing feature values (NT-JSMA-F / NT-JSMA-Z).

Algorithm 3 Maximal Jacobian-based Saliency Map Attack

Input: n-dimensional input x normalized to [0, 1], true class label y, classiﬁer f, maximum number of

iterations I

max

, pixel perturbation step θ ∈ (0, 1], maximum perturbation bound  ∈ (0, 1]

Initialize: x

← x, i ← 0, Γ = {1, ..., n}, η = {0, 0, ...}

while ˆy(x

) == y and i < I

max

and |Γ| ≥ 2 do

Calculate ∇

f(x

)

γ ← 0

for every pixel pair (p, q) ∈ Γ and every class t do

α =

k=p,q

∇

(k)

f(x

)

(t)

β =

k=p,q

c6=t

∇

(k)

f(x

)

(c)

if −α · β > γ then

∗

, q

∗

← p, q

γ ← −α · β

if t == y then

← −sign(α) · θ

else

← sign(α) · θ

end if

end for

∗

)

, x

∗

)

← Clip



{(x

∗

)

+ θ

)}, Clip



{(x

∗

)

+ θ

)}

if x

∗

)

== 0 or x

∗

)

== 1 or η

∗

)

== −θ

then

Remove p

∗

from Γ

end if

if x

∗

)

== 0 or x

∗

)

== 1 or η

∗

)

== −θ

then

Remove q

∗

from Γ

end if

∗

)

, η

∗

)

← θ

i ← i + 1

end while

Return: x

Extending further, we propose a combined attack, termed Maximal Jacobian-based Saliency Map Attack

(M-JSMA), that merges both targeted variants and both non-targeted variants together. As shown in

Algorithm 3, at each iteration the maximal-salient pixel pair is chosen over every possible class t, whether

adversarial or not. In this way, we ﬁnd the most inﬂuential features across all classes, in the knowledge that

changing these is likely to change the eventual classiﬁcation. Furthermore, instead of enforcing low-saliency

Adversarial Examples in Modern Machine Learning: A Review

conditions via S

or S

−

, we identify which measure applies to the most salient pair (p

∗

, q

∗

) to decide on the

perturbation direction θ

accordingly. A history vector η is added to prevent oscillatory perturbations. Similar

to NT-JSMA, M-JSMA terminates when the predicted class ˆy(x) = arg max

f(x)

(c)

no longer matches the

true class y.

Table 4: Performance comparison of the original JSMA, non-targeted JSMA, and maximal JSMA variants

(|θ| = 1,  = 1): % of successful attacks, average L

and L

perturbation distances, and average entropy

H(f(x)) of misclassiﬁed softmax prediction probabilities.

Attack

MNIST F-MNIST CIFAR10

% L

H % L

JSMA+F 100 34.8 4.32 0.90 99.9 93.1 6.12 1.22 100 34.7 3.01 1.27

JSMA-F 100 32.1 3.88 0.88 99.9 82.2 4.37 1.21 100 36.9 2.13 1.23

NT-JSMA+F 100 17.6 3.35 0.64 100 18.8 3.27 1.03 99.9 17.5 2.36 1.16

NT-JSMA-F 100 19.7 3.44 0.70 99.9 33.2 2.99 0.98 99.9 19.6 1.68 1.12

M-JSMA_F 100 14.9 3.04 0.62 99.9 18.7 3.42 1.02 99.9 17.4 2.16 1.12

Table 4 summarizes attacks carried out on correctly-classiﬁed test-set instances in the MNIST [116],

Fashion MNIST [119], and CIFAR10 [120] datasets, using targeted, Non-Targeted, and Maximal JSMA

variants. For targeted attacks, we consider only adversaries that were misclassiﬁed in the fewest iter-

ations over target classes. The JSMA+F results showed that on average only (34.8 L

distance)/(28 ∗

28 pixels of an MNIST image) = 4.4% of pixels needed to be perturbed in order to create adversaries, thus

corroborating ﬁndings from [80]. More importantly, as evidenced by lower L

values, NT-JSMA found ad-

versaries much faster than the fastest targeted attacks across all 3 datasets, while M-JSMA was consistently

even faster and on average only perturbed (14.9 L

distance)/(28 ∗ 28 pixels) = 1.9% of input pixels. Ad-

ditionally, the quality of adversaries found by NT-JSMA and M-JSMA were also superior, as indicated by

smaller L

perceptual diﬀerences between the adversaries x

and the original inputs x, and by lower misclas-

siﬁcation uncertainty as reﬂected by prediction entropy H (f(x)) = −

f(x)

(c)

·log f(x)

(c)

. Since M-JSMA

considers all possible class targets, and both S

and S

−

metrics and perturbation directions, these results

show that it inherits the combined beneﬁts from both the original JSMA and NT-JSMA.

4.9 Substitute Blackbox Attack

All of the techniques covered so far are whitebox attacks, relying upon access to a model’s innards. Papernot

et al. [41] proposed one of the early practical blackbox methods, called the Substitute Blackbox Attack (SBA).

The key idea is to train a substitute model to mimic the blackbox model, and use whitebox attack methods

on this substitute. This approach leverages the transferability property of adversarial examples. Concretely,

the attacker ﬁrst gathers a synthetic dataset, obtains predictions on the synthetic dataset from the targeted

model, and then trains a substitute model to imitate the targeted model’s predictions.

After the substitute model is trained, adversaries can be generated using any whitebox attacks since the

details of the substitute model are known (e.g., [41] used the FGSM [32] (see Section 4.2) and JSMA [80] (see

Section 4.8)). We refer to SBA based on the type of adversarial attacks used when attacking the substitute

model. For example, if the attacker uses FGSM to attack the substitute model, we refer this as FGSM-SBA.

The success of this approach depends on choosing adequately-similar synthetic data samples and a sub-

stitute model architecture using high-level knowledge of the target classiﬁer setup. As such, an intimate

knowledge of the domain and the targeted model is likely to aid the attacker. Even if the absence of speciﬁc

expertise, the transferability property suggests that adversaries generated from a well-trained substitute

model are likely to fool the targeted model as well.

Papernot et al. [41] note that in practice the attacker is constrained from making unlimited query to the

targeted model. Consequently, the authors introduced the Jacobian-based Dataset Augmentation technique,

which generates a limited number of additional samples around a small initial synthetic dataset to eﬃciently

replicate the target model’s decision boundaries. Concretely, given an initial sample x, one calculates the

Adversarial Examples in Modern Machine Learning: A Review

Jacobian of the predicted class’ likelihood assigned by the targeted model f(x)

(ˆy)

= max

f(x)

(c)

with respect

to the inputs. Since the attacker cannot apply analytical backpropagation to the targeted model, this gradient

is instead calculated using the substitute model f

(x), which we denote as ∇

(x)

(ˆy)

. A new sample x

then synthesized by perturbing x along the sign of the gradient, x

= α ·sign



∇

(x)

(ˆy)



by a small step α.

While this process resembles FGSM, its purpose is instead to create samples that are likely to be classiﬁed

with high conﬁdence f

)

(ˆy)

≈ f(x

)

(ˆy)

. Papernot et al. [41] noted that the resulting augmented dataset

better represents the decision boundary of the targeted model, in comparison to randomly sampling more

data points that would most likely fall outside the target model’s training-set manifold.

Algorithm 4 Substitute Model Training with Jacobian-based Dataset Augmentation [41]

Input: initial training set X

, targeted model f , initial substitute model f

, maximum number of iterations

max

, small constant α

Output: reﬁned substitute model f

for i = 0 to I

max

← {f(X

)} ∀ x ∈ X

Train f

on (X

, Y

) input-label pairs

← {x + α ·sign



∇

(x)

(ˆy)



∀ x ∈ X

}

i+1

← X

∪ X

end for

Return: f

The entire training procedure for the substitute model is summarized in Algorithm 4. The attacker ﬁrst

creates a small initial training set X

. For example, X

can be initialized by picking one sample from each

possible class of a set that represents the input domain of the targeted model. The substitute model is then

trained on the synthetic dataset using labels provided by the targeted model (e.g., by querying the targeted

model). New datapoints are then generated by perturbing each sample in the existing dataset along the

general direction of variation. Finally, the new inputs are added to the existing dataset, i.e., the size of the

synthetic dataset grows per iteration. This process is then repeated several times.

It is interesting to note that the targeted model does not have to be diﬀerentiable for the attack to

succeed. The diﬀerentiability constraint applies only to the substitute model. As long as the substitute

model has the capacity to approximate the targeted model, this attack is feasible. Papernot et al. [41]

showed that substitute blackbox attack can be used to attack other machine learning models like logistic

regression, Support Vector Machines (SVM) [40], k-Nearest Neighbor (kNN) [39], and non-diﬀerentiable

models such as decision trees [38].

The authors evaluated SBA by targeting real world image recognition systems from Amazon, Google, and

MetaMind on the MNIST dataset [116], and successfully fooled all targets with high accuracies (> 80%). This

method also successfully attacked a blackbox deep neural network model that was trained on German Traﬃc

Sign Recognition Benchmarks (GTSRB) dataset [121]. Furthermore, SBA was shown to also circumvent

defense methods that rely on gradient masking such as adversarial training on FGSM adversaries [32] (see

Section 6.1.1) and defensive distillation [43] (see Section 6.1.3).

4.10 Hot/Cold Attack

Building on the idea of altering the input to increase classiﬁer loss, as in FGSM [32] (see Section 4.2), Rozsa

et al. [82] proposed an attack algorithm based upon setting the values of the classiﬁcation logits Z

(x). We

can then use the gradients with respect to the input to push inputs towards producing the desired logits.

The logits are modiﬁed such that the per-class gradients will point in directions where the output of the

network will increase the probability of target (“hot”) class and decrease the probability of the ground truth

(“cold”) class. The Hot/Cold attack alters a target classiﬁer’s logits Z(x) into:

剩余96页未读，继续阅读

江南小白龙

粉丝: 57
资源: 14

视觉机器学习中的对抗样本攻击与防御综述

Adversarial Examples: Attacks and Defenses for Deep Learning

藏经阁-Bot-Vs-Bot-Evading-Machine-Learning-Malware-Detection.pdf

Adversarial Text-to-Image Synthesis A Review.pdf

Improving Adversarial Transferability via Neuron Attribution-Based Attacks.pdf

Connecting Generative Adversarial Network and Actor-Critic Methods.pdf

Adversarial_Machine_Learning_-_Attack_and_Defense_in_the_AI_Era.pdf

adversarial_robustness_toolbox-0.2.0-py2.py3-none-any.whl.zip

adversarial_robustness_toolbox-0.2.1-py2.py3-none-any.whl.zip

adversarial_robustness_toolbox-0.2.2-py2.py3-none-any.whl.zip

adversarial_robustness_toolbox-0.1-py2.py3-none-any.whl.zip

最新资源