注意力机制与模型解释性的探讨

需积分: 50 43 浏览量更新于2024-07-15 收藏 1.2MB PDF 举报

"这篇ACL2019论文《Attention is not Explanation》探讨了注意力机制在神经网络自然语言处理（NLP）模型中的应用及其解释性问题。作者Sarthak Jain和Byron C. Wallace指出，虽然注意力机制提高了预测性能，常被认为是模型透明性的体现，但其实注意力权重与模型输出之间的关系并不明确。" 正文: 注意力机制是深度学习领域，特别是在自然语言处理中的一个关键组件，它允许模型在处理输入序列时聚焦于特定部分，从而提高理解和预测的效率。然而，这篇论文提出了一个重要的观点：注意力权重并不等同于模型的解释。论文通过广泛的实验分析了多种NLP任务，旨在评估注意力权重在多大程度上能提供有意义的预测解释。实验结果表明，注意力权重往往与基于梯度的特征重要性度量不相关。这意味着，尽管注意力机制可以突出显示模型关注的输入单元，但这并不直接反映出这些输入对最终预测的重要性。此外，研究发现即使是非常不同的注意力分布，也可能导致相同的预测结果。这揭示了一个令人惊讶的现象：标准的注意力模块可能提供了误导性的解释，因为它们的注意力分配并不唯一地决定模型的输出。换句话说，模型可能有多种不同的方式来关注输入，而这些方式并不会改变其预测，这就削弱了将注意力权重视为解释的合理性。这一发现对NLP领域的模型解释性和可解释AI具有深远的影响。它强调了在依赖注意力机制来解释模型决策时需谨慎，因为这种解释可能并不准确或全面。为了提高模型的透明度和解释性，研究人员需要探索其他方法，例如使用可解释性技术，如特征重要性排名、局部可解释性模型或可解释的神经架构。论文的结论是，尽管注意力机制在提升NLP模型性能方面取得了成功，但它并不能直接作为模型行为的解释。因此，未来的NLP研究应更加关注如何提供更可靠、更具洞察力的模型解释，以帮助用户理解模型的决策过程，这对构建更负责任和可信赖的AI系统至关重要。

Table 1 provides summary statistics for all

datasets, as well as the observed test performances

for additional context.

4 Experiments

We run a battery of experiments that aim to ex-

amine empirical properties of learned attention

weights and to interrogate their interpretability

and transparency. The key questions are: Do

learned attention weights agree with alternative,

natural measures of feature importance? And,

Had we attended to different features, would the

prediction have been different?

More speciﬁcally, in Section 4.1, we empir-

ically analyze the correlation between gradient-

based feature importance and learned attention

weights, and between emphleave-one-out (LOO;

or ‘feature erasure’) measures and the same. In

Section 4.2 we then consider counterfactual (to

those observed) attention distributions. Under the

assumption that attention weights are explanatory,

such counterfactual distributions may be viewed

as alternative potential explanations; if these do

not correspondingly change model output, then it

is hard to argue that the original attention weights

provide meaningful explanation in the ﬁrst place.

To generate counterfactual attention distribu-

tions, we ﬁrst consider randomly permuting ob-

served attention weights and recording associated

changes in model outputs (4.2.1). We then pro-

pose explicitly searching for “adversarial” atten-

tion weights that maximally differ from the ob-

served attention weights (which one might show in

a heatmap and use to explain a model prediction),

and yet yield an effectively equivalent prediction

(4.2.2). The latter strategy also provides a use-

ful potential metric for the reliability of attention

weights as explanations: we can report a measure

quantifying how different attention weights can be

for a given instance without changing the model

output by more than some threshold .

All results presented below are generated on test

sets. We present results for Additive attention be-

low. The results for Scaled Dot Product in its

place are generally comparable. We provide a web

interface to interactively browse the (very large

set of) plots for all datasets, model variants, and

experiment types: https://successar.github.

io/AttentionExplanation/docs/.

In the following sections, we use Total Vari-

ation Distance (TVD) as the measure of change

between output distributions, deﬁned as follows.

TVD(ˆy

, ˆy

) =

|Y|

i=1

|ˆy

− ˆy

|. We use

the Jensen-Shannon Divergence (JSD) to quan-

tify the difference between two attention dis-

tributions: JSD(α

, α

) =

KL[α

+α

] +

KL[α

+α

4.1 Correlation Between Attention and

Feature Importance Measures

We empirically characterize the relationship be-

tween attention weights and corresponding fea-

ture importance scores. Speciﬁcally we measure

correlations between attention and: (1) gradient

based measures of feature importance (τ

), and,

(2) differences in model output induced by leav-

ing features out (τ

loo

). While these measures are

themselves insufﬁcient for interpretation of neu-

ral model behavior (Feng et al., 2018), they do

provide measures of individual feature importance

with known semantics (Ross et al., 2017). It is thus

instructive to ask whether these measures correlate

with attention weights.

The process we follow to quantify this is de-

scribed by Algorithm 1. We denote the input re-

sulting from removing the word at position t in x

by x

−t

. Note that we disconnect the computation

graph at the attention module so that the gradient

does not ﬂow through this layer: This means the

gradients tell us how the prediction changes as a

function of inputs, keeping the attention distribu-

tion ﬁxed.

Algorithm 1 Feature Importance Computations

h ← Enc(x), ˆα ← softmax(φ(h, Q))

ˆy ← Dec(h, α)

← |

|V |

w=1

= 1]

∂y

∂x

| , ∀t ∈ [1, T ]

← Kendall-τ(α, g)

∆ˆy

← TVD(ˆy(x

−t

), ˆy(x)) , ∀t ∈ [1, T ]

loo

← Kendall-τ(α, ∆ˆy)

Table 2 reports summary statistics of Kendall

τ correlations for each dataset. Full distributions

are shown in Figure 2, which plots histograms of

for every data point in the respective corpora.

(Corresponding plots for τ

loo

are similar and the

full set can be browsed via the aforementioned

online supplement.) We plot these separately for

each class: orange () represents instances pre-

dicted as positive, and purple () those predicted

For further discussion concerning our motivation here,

see the Appendix. We also note that LOO results are compa-

rable, and do not have this complication.

剩余15页未读，继续阅读

MonkeyDogFox

粉丝: 6
资源: 3

注意力机制与模型解释性的探讨

Attention is not not Explanation.pdf

Write a program that prompts the user to enter an integer and reports whether the integer is a palindrome. A number is a palindrome if its reversal is the same as itself. Here are two sample runs: Enter an integer: 121 121 is a palindrome. Enter an integer:123 123 is not a palindrome.

A palindrome string is a string that reads the same backward as forward. For example, "racecar" is a palindrome string because it reads the same way from left to right and from right to left. Another example is "level". Write a C program to check whether a given string is a palindrome or not.

Result from function call is not a proper array of floats

see the natbib package documentation for explanation. type h <return> for im

stata中exp not allowed

self.ids=[splitext(file)[0] for file in listdir(images_dir) if not file.startswith('.')]

最新资源