Published as a conference paper at ICLR 2018
is typically performed in a white-box fashion, and so in order to utilize and properly compare against
the adversarial training techniques of Madry et al. (2017), it is important to have strong white-box
attacks.
For ease of presentation, we will describe the attacks assuming that f : R → R
k
discretizes inputs
into thermometer encodings; in order to attack one-hot encodings, simply replace all instances of
f
therm
with f
onehot
, τ with χ, and C with the identity function I. We represent the adversarial
image after t steps of the attack as z
t
, where the value of the ith pixel is z
t
i
.
The first attack, Discrete Gradient Ascent (DGA), follows the direction of the gradient of the loss
with respect to f(x), but is constrained at every step to be a discretized vector. If we have discretized
the input image into k-dimensional vectors using the one-hot encoding, this corresponds to moving
to a vertex of the simplex (∆
k
)
n
at every step. The second attack, Logit-Space Projected Gradient
Ascent (LS-PGA), relaxes this assumption, allowing intermediate iterates to be in the interior of the
simplex. The final adversarial image is obtained by projecting the final point back to the nearest
vertex of the simplex.
Note that if the number of attack steps is 1, then the two attacks are equivalent; however, for larger
numbers of attack steps, LS-PGA is a generalization of DGA.
2.3.1 DISCRETE GRADIENT ASCENT (DGA)
Following PGD (Madry et al., 2017), we initialize DGA by placing each pixel into a random bucket
that is within ε of the pixel’s true value. At each step of the attack, we look at all buckets that are
within ε of the true value, and select the bucket that is likely to do the most ‘harm’, as estimated by
the gradient of setting that bucket’s indicator variable to 1, with respect to the model’s loss at the
previous step.
z
0
i
= f
therm
(x
i
+ U (−ε, ε))
harm(z
t
i
)
l
=
(
(z
t
i
− τ (l))
>
·
∂L(z
t
)
∂z
t
i
if ∃(−ε ≤ η ≤ ε) s.t. b(x
i
+ η) = l
0 otherwise.
z
t+1
i
= τ
arg max
harm
z
t
i
Because the outcome of this optimization procedure will vary depending on the initial random per-
turbation, we suggest strengthening the attack by re-running it several times and using the pertur-
bation with the greatest loss. The pseudo-code for the DGA attack is given in Section B of the
appendix.
2.3.2 LOGIT-SPACE PROJECTED GRADIENT ASCENT (LS-PGA)
To perform LS-PGA, we soften the discrete encodings into continuous relaxations, and then perform
standard Projected Gradient Ascent (PGA) on these relaxed values. We represent the distribution
over embeddings as a softmax over logits u, each corresponding to the unnormalized log-weight
of a specific bucket’s embedding. To improve the attack, we scale the logits with temperature T ,
allowing us to trade off between how closely our softmax approximates a true one-hot distribution
as in the Gumbel-softmax trick (Jang et al., 2016; Maddison et al., 2016), and how much gradient
signal the logits receive. At each step of a multi-step attack, we anneal this value via exponential
decay with rate δ.
z
t
i
= C
σ
u
t
i
T
t
z
final
i
= τ
arg max
u
final
i
T
t
= T
t−1
· δ
We initialize each of the logits randomly with values sampled from a standard normal distribution.
At each step, we ensure that the model does not assign any probability to buckets which are not
within ε of the true value by fixing the logits to be −∞. The model’s loss is a continuous function of
the logits, so we can simply utilize attacks designed for continuous-valued inputs, in this case PGA
5