attention to be interpretable, the blue, upper-right
values (i
∗
, not r, flips a decision) should be much
larger than the orange, lower-left values (r, not i
∗
,
flips a decision), which should be close to zero.
5
Although for some datasets in Table 2, the “or-
ange” values are non-negligible, we mostly see
that their fraction of total off-diagonal values mir-
rors the fraction of negative occurrences of Eq. 1
in Figure 4. However, it’s somewhat startling that
in the vast majority of cases, erasing i
∗
does not
change the decision (“no” row of each table). This
is likely explained in part by the signal pertinent
to the classification being distributed across a doc-
ument (e.g., a “Sports” question in the Yahoo An-
swers dataset could signal “sports” in a few sen-
tences, any one of which suffices to correctly cate-
gorize it). However, given that these results are for
the HAN models, which typically compute atten-
tion over ten or fewer sentences, this is surprising.
Altogether, examining importance from a
single-weight angle paints a tentatively positive
picture of attention’s interpretability, but also
raises several questions about the many cases
where the difference in impacts between i
∗
and r
is almost identical (i.e., ∆JS values close to 0 or
the many cases where neither i
∗
nor r cause a de-
cision flip). To answer these questions, we require
tests with a broader scope.
5 Importance of Sets of Attention
Weights
Often, we care about determining the collective
importance of a set of components I
0
. To address
that aspect of attention’s interpretability and close
gaps left by single-weight tests, we introduce tests
to determine how multiple attention weights per-
form together as importance predictors.
5.1 Multi-Weight Tests
For a hypothesized ranking of importance, such as
that implied by attention weights, we would ex-
pect the items at the top of that ranking to func-
tion as a concise explanation for the model’s deci-
sion. The less concise these explanations get, and
the farther down the ranking that the items truly
driving the model’s decision fall, the less likely it
becomes for that ranking to truly describe impor-
tance. In other words, we expect that the top items
5
We see this pattern especially strongly for FLANs (see
Appendix), which is unsurprising since I is all words in the
input text, so most attention weights are very small.
in a truly useful ranking of importance would com-
prise a minimal necessary set of information for
making the model’s decision.
The idea of a minimal set of inputs necessary
to uphold a decision is not new; Li et al. (2016)
use reinforcement learning to attempt to construct
such a minimal set of words, Lei et al. (2016) train
an encoder to constrain the input prior to clas-
sification, and much of the work that has been
done on extractive summarization takes this con-
cept as a starting point (Lin and Bilmes, 2011).
However, such work has focused on approximat-
ing minimal sets, instead of evaluating the ability
of other importance-determining “shortcuts” (such
as attention weight orderings) to identify them.
Nguyen (2018) leveraged the idea of minimal sets
in a much more similar way to our work, compar-
ing different input importance orderings.
Concretely, to assess the validity of an impor-
tance ranking method (e.g., attention), we begin
erasing representations from the top of the rank-
ing downward until the model’s decision changes.
Ideally, we would then enumerate all possible
subsets of that instance’s components, observe
whether the model’s decision changed in response
to removing each subset, and then report whether
the size of the minimal decision-flipping subset
was equal to the number of items that had needed
to be removed to achieve a decision flip by follow-
ing the ranking. However, the exponential num-
ber of subsets for any given instance’s sequence of
components (word or sentence representations, in
our case) makes such a strategy computationally
prohibitive, and so we adopt a different approach.
Instead, in addition to our hypothesized impor-
tance ranking (attention weights), we consider al-
ternative rankings of importance; if, using those,
we repeatedly discover cases where removing a
smaller subset of items would have sufficed to
change the decision, this signals that our candidate
ranking is a poor indicator of importance.
5.2 Alternative Importance Rankings
Exhaustively searching the space of component
subsets would be far too time-consuming in prac-
tice, so we introduce three other ranking schemes.
The first is to randomly rank importance. We
expect that this ranking will perform quite poorly,
but it provides a point of comparison by which
to validate that ranking by descending attention
weights is at least somewhat informative.