iteratively. Each iteration, the pixel with the highest derivative is modified by a fixed value (the budget
for the attack), followed by recomputing the saliency map, until the prediction has changed to a target
class. The adversarial images produced by JSMA are subtle and effective for attacks, but they still
require an excessive amount of time to compute.
The fast gradient sign method (FGSM) [7] has been introduced as a computationally inexpensive,
but effective alternative to JSMA. FGSM explores the gradient direction of the cost function and
introduces a fixed amount of perturbation to maximize that cost. In practice, the examples produced
by this attack are more easily detectable and require a bigger distortion to achieve misclassification
than those obtained from JSMA. An iterative version of FGSM, where a smaller perturbation is applied
multiple times, was introduced by Kurakin et al. [14].
Instead of using a fixed attack budget as for the last two methods, DeepFool [22] was the first
method to compute and apply the minimal perturbation necessary for misclassification under the L
2
norm. The method performs iterative steps on the adversarial direction of the gradient provided by
a locally linear approximation of the classifier. Doing so, the approximation is more accurate than
FGSM and faster than JSMA, as all the pixels are simultaneously modified at each step of the method,
but its iterative nature makes DeepFool computationally expensive. In [23, 24], the authors extend
DeepFool in order to craft a universal perturbation to be applied indifferently to any instance: a fixed
distortion is computed from a set of inputs, allowing to maximize the predictive error of the model
on that sample. The perturbation is computed by a greedy approach and needs multiple iterations
over the given sample before converging. To the extent where the sample is representative to the data
distribution, the computed perturbation has good chances of achieving misclassification on unseen
samples as well.
One method aiming to compute good approximations of Problem (1) while keeping the computa-
tional cost of perturbing examples low has been proposed in Carlini and Wagner [2]. The authors cast
the formulation of Szegedy et al. [31] into a more efficient optimization problem, which allows them
to craft effective adversarial samples with low distortion. They define three similar targeted attacks,
based on different distortion measures: L
2
, L
0
and L
∞
respectively. In practice, even these attacks
are computationally expensive.
If it is difficult to find new methods that are both effective in jeopardizing a model and compu-
tationally affordable, defending from adversarial attacks is even a harder task. On one hand, a good
defense should harden a model to any known attack and, on the other hand, it should not compro-
mise the discriminatory power of the model. In the following paragraph, we report the most effective
defenses proposed for tackling adversarial examples.
Defenses A common technique for defending a model from adversarial examples consists in aug-
menting the training data with perturbed examples (technique known as ‘adversarial training‘ [31])
by either feeding a model with both true and adversarial examples or learning it using the modified
objective function:
ˆ
J(θ, x, y) = αJ(θ, x, y) + (1 − α)J(θ, x + ∆x, y)
with J the original loss function. The aim of such defense is to increase the model’s robustness in specific
directions (of the adversarial perturbation) by ensuring that it will predict the same class for the true
example and its perturbations along those directions. In practice, the additional instances are crafted
for the considered model using one or multiple attack strategies, such as FGSM [7], DeepFool [22] and
virtual adversarial examples [21].
However, adversarially training a model is effective only on adversarial examples crafted on the
original model, which is an improbable situation considering that an attacker might not have access to
exactly the same model for computing the perturbations. Additionally, adversarial training has been
proved to be easily bypassed through a two-step attack [32], which first applies a random perturbation
to an instance and then performs any classical attack technique. The success of this new attack,
and of black-box attacks in general, is due to the sharpness of the loss around the training examples:
if smoothing the loss in few adversarial directions makes ineffective gradient-based attacks on those
4