Table 1 provides summary statistics for all
datasets, as well as the observed test performances
for additional context.
4 Experiments
We run a battery of experiments that aim to ex-
amine empirical properties of learned attention
weights and to interrogate their interpretability
and transparency. The key questions are: Do
learned attention weights agree with alternative,
natural measures of feature importance? And,
Had we attended to different features, would the
prediction have been different?
More specifically, in Section 4.1, we empir-
ically analyze the correlation between gradient-
based feature importance and learned attention
weights, and between emphleave-one-out (LOO;
or ‘feature erasure’) measures and the same. In
Section 4.2 we then consider counterfactual (to
those observed) attention distributions. Under the
assumption that attention weights are explanatory,
such counterfactual distributions may be viewed
as alternative potential explanations; if these do
not correspondingly change model output, then it
is hard to argue that the original attention weights
provide meaningful explanation in the first place.
To generate counterfactual attention distribu-
tions, we first consider randomly permuting ob-
served attention weights and recording associated
changes in model outputs (4.2.1). We then pro-
pose explicitly searching for “adversarial” atten-
tion weights that maximally differ from the ob-
served attention weights (which one might show in
a heatmap and use to explain a model prediction),
and yet yield an effectively equivalent prediction
(4.2.2). The latter strategy also provides a use-
ful potential metric for the reliability of attention
weights as explanations: we can report a measure
quantifying how different attention weights can be
for a given instance without changing the model
output by more than some threshold .
All results presented below are generated on test
sets. We present results for Additive attention be-
low. The results for Scaled Dot Product in its
place are generally comparable. We provide a web
interface to interactively browse the (very large
set of) plots for all datasets, model variants, and
experiment types: https://successar.github.
io/AttentionExplanation/docs/.
In the following sections, we use Total Vari-
ation Distance (TVD) as the measure of change
between output distributions, defined as follows.
TVD(ˆy
1
, ˆy
2
) =
1
2
P
|Y|
i=1
|ˆy
1i
− ˆy
2i
|. We use
the Jensen-Shannon Divergence (JSD) to quan-
tify the difference between two attention dis-
tributions: JSD(α
1
, α
2
) =
1
2
KL[α
1
||
α
1
+α
2
2
] +
1
2
KL[α
2
||
α
1
+α
2
2
].
4.1 Correlation Between Attention and
Feature Importance Measures
We empirically characterize the relationship be-
tween attention weights and corresponding fea-
ture importance scores. Specifically we measure
correlations between attention and: (1) gradient
based measures of feature importance (τ
g
), and,
(2) differences in model output induced by leav-
ing features out (τ
loo
). While these measures are
themselves insufficient for interpretation of neu-
ral model behavior (Feng et al., 2018), they do
provide measures of individual feature importance
with known semantics (Ross et al., 2017). It is thus
instructive to ask whether these measures correlate
with attention weights.
The process we follow to quantify this is de-
scribed by Algorithm 1. We denote the input re-
sulting from removing the word at position t in x
by x
−t
. Note that we disconnect the computation
graph at the attention module so that the gradient
does not flow through this layer: This means the
gradients tell us how the prediction changes as a
function of inputs, keeping the attention distribu-
tion fixed.
4
Algorithm 1 Feature Importance Computations
h ← Enc(x), ˆα ← softmax(φ(h, Q))
ˆy ← Dec(h, α)
g
t
← |
P
|V |
w=1
1
[x
tw
= 1]
∂y
∂x
tw
| , ∀t ∈ [1, T ]
τ
g
← Kendall-τ(α, g)
∆ˆy
t
← TVD(ˆy(x
−t
), ˆy(x)) , ∀t ∈ [1, T ]
τ
loo
← Kendall-τ(α, ∆ˆy)
Table 2 reports summary statistics of Kendall
τ correlations for each dataset. Full distributions
are shown in Figure 2, which plots histograms of
τ
g
for every data point in the respective corpora.
(Corresponding plots for τ
loo
are similar and the
full set can be browsed via the aforementioned
online supplement.) We plot these separately for
each class: orange () represents instances pre-
dicted as positive, and purple () those predicted
4
For further discussion concerning our motivation here,
see the Appendix. We also note that LOO results are compa-
rable, and do not have this complication.