图卷积神经网络的可解释性方法

54 浏览量更新于2023-10-19 收藏 12.4MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

We adapt three common explainability methods, orig-inally designed for CNNs, and extend them to GCNNs.These three methods are gradient-based saliency maps [32],Class Activation Mapping (CAM) [39], and ExcitationBackpropagation (EB) [39]. In addition, we adapt two vari-ants: gradient-weighted CAM (Grad-CAM) [31] and con-trastive EB. We evaluate the adapted methods on two differ-107720图卷积神经网络的可解释性方法0Phillip E. Pope* HRLLaboratories, LLC0pepope@hrl.com0Soheil Kolouri* HRLLaboratories, LLC0skolouri@hrl.com0Mohammad RostamiHRL Laboratories, LLC0mrostami@hrl.com0Charles E. Martin HRLLaboratories, LLC0cemartin@hrl.com0Heiko Hoffmann HRLLaboratories, LLC0hhoffmann@hrl.com0摘要0随着图卷积神经网络（GCNNs）的广泛使用，对其可解释性的需求也越来越大。在本文中，我们介绍了GCNNs的可解释性方法。我们开发了三种著名的卷积神经网络解释性方法的图形类比：对比梯度（CG）显著图，类激活映射（CAM）和激活反向传播（EB）及其变体，梯度加权CAM（Grad-CAM）和对比EB（c-EB）。我们在两个应用领域的分类问题上展示了这些方法的概念验证：视觉场景图和分子图。为了比较这些方法，我们确定了解释的三个理想属性：（1）它们对分类的重要性，通过遮挡的影响来衡量，（2）它们对不同类别的对比性，（3）它们在图上的稀疏性。我们将相应的定量指标称为保真度，对比性和稀疏性，并对每种方法进行评估。最后，我们分析从解释中获得的显著子图，并报告经常出现的模式。01. 引言0计算机视觉的最新成功主要归功于深度卷积神经网络（CNN）的出现[21]。这导致了在各种计算机视觉任务上的最先进性能，包括目标识别[11,13]，目标检测[27]和语义分割[26]。CNN中的端到端学习策略使其成为从大量视觉数据中学习的强大数据驱动工具。与此同时，这种端到端学习策略阻碍了CNN所做决策的可解释性和可解释性。最近，有越来越多的研究关注CNN的内部工作[38,23,22]和解释所做决策的方法。0这些网络[42,31,39,40]的解释性方法。Zhang等人[41]对CNN的解释性方法进行了很好的调查。0然而，深度卷积神经网络（CNN）是为网格结构数据设计的，例如图像，在欧几里得空间中，卷积是在具有有序元素的输入上定义的操作。然而，在许多应用中，我们需要处理在不同结构上定义的数据，例如图和流形，CNN不能直接使用。这种非欧几里得空间出现在各种应用中，包括场景图分析[15]，3D形状分析[25]，社交网络[37]和化学[35]。几何深度学习[6,1]是一个最近出现的新兴领域，旨在克服CNN的局限性并扩大其应用。特别地，通过将卷积操作扩展到图和一般的非欧几里得空间上，CNN可以推广为适用于图结构数据。CNN在非欧几里得空间上的扩展导致了图卷积神经网络（GCNNs）[7,9,19]。0除了模型的卓越性能外，我们还需要技术来解释模型为什么会做出它的预测。这种解释可以帮助确定和定位与模型在特定任务中所做决策相关的输入数据的部分。受CNN的解释性工作[42]的启发，我们引入了用于GCNN决策的解释性方法。对于图形，解释性可能比图像更有帮助，因为非专家人员无法直观地确定图中的相关上下文，例如在识别贡献到分子的特定属性的原子组（分子图上的子图结构）时。ent applications: visual scene graphs and molecular graphs.For GCNNs, we use the proposed formulation by Kipf et al.[18]. Our speciﬁc contributions in this work are the follow-ing three:• Adaptation of explainability methods for CNNs to GC-NNs,• Demonstration of the explainability techniques on twograph classiﬁcation problems: visual scene graphs andmolecules, and• Characterization of each method’s trade-offs usingmetrics for ﬁdelity, contrastivity, and sparsity.The remainder of this paper is structured as follows. InSection 2, we discuss related work in interpretability andGCNNs. In Section 3, we review the mathematical def-initions of GCNNs and explainability methods on CNNs,and then deﬁne the analogous explainability methods onGCNNs. In Section 4, we detail our experiments on vi-sual scene graphs and molecules and show example results.Moreover, we quantitatively evaluate the performance ofthese four methods with respect to three metrics, ﬁdelity,contrastivity, and sparsity, each designed to capture certaindesirable properties of explanations. We use these metricsto evaluate the merits of each method. Lastly, in the ex-perimental section, we analyze the frequencies of salientsubstructures identiﬁed by Grad-CAM and report the topresults for each dataset.2. Related WorkInterpretability: A long standing limitation of generaldeep neural networks has been the difﬁculty in interpretingand explaining the classiﬁcation results. Recently, explain-ability methods have been devised for deep networks andspeciﬁcally CNNs [32, 42, 31, 39, 40, 41]. These methodsenable one to probe a CNN and identify the important sub-structures of the input data (as deemed by the network) fordecision regarding a task, which could be used as an ex-planatory tool or as a tool to discover unknown underlyingsubstructures in the data. For example, in the area of med-ical imaging [34], in addition to classifying images havingmalignant lesions, they can be localized, as the CNN canprovide reasoning for classifying an input image.The most straightforward approach for generating a sen-sitivity map over the input data to discover the importanceof the underlying substructures is to calculate a gradientmap within a layer by considering the norm of the gradientvector with respect to an input for each network weight [32].However, gradient maps are known to be noisy and smooth-ing these maps might be necessary [33]. More advancedtechniques include Class Activation Mapping (CAM) [42],Gradient-weighted Class Activation Mapping (Grad-CAM)[31], and Excitation Back-Propagation (EB) [39] improvegradient maps by taking into account some notion of con-text. These techniques have been shown to be effective onCNNs and can identify highly abstract notions in images.See Zhang et. al. [41] for a survey of explainability meth-ods for CNNs.Graph Convolutional Neural Networks: The mathe-matical foundation of GCNNs is deeply rooted in the ﬁeldof graph signal processing [3, 4] and spectral graph theoryin which signal operations like Fourier transform and con-volutions are extended to signals living on graphs. GCNNsemerged from the spectral graph theory, e.g., as introducedby Bruna et al. [2] or Henaff et al. [12]. GCNNs basedon spectral graph theory enable deﬁnition of parameterizedﬁlters akin to CNNs. They, however, are often computation-ally expensive and therefore slow. To overcome the compu-tational bottleneck of spectral GCNNs, various authors haveproposed to approximate smooth ﬁlters in the spectral do-main [6, 19], for instance using Chebyshev polynomials ora ﬁrst-order approximation of spectral graph convolutions.In this work, we use the GCNN formulation deﬁned by Kipfand Welling [19] due to its faster training times and higherpredictive accuracy.GCNNs have recently found use in diverse applications.Monti et al. [25] used GCNNs for super-pixel classiﬁca-tion as well as for classifying research papers from theircitation network. Defferard et al. [6] used GCNNs on N-grams for text categorization. In [36] GCNNs were usedfor shape segmentation, and in [14], they were used forskeleton-based action recognition.More recently, John-son et al. [15] used GCNNs to analyze scene-graphs withthe application of image generation from scene graphs. Inchemistry, GCNNs were used to predict various chemicalproperties of organic molecules. GCNNs provide state-of-the-art performance on several chemical prediction tasks,including toxicity prediction [16], solubility [7], and energyprediction [30]. In this paper we focus on explainabilitymethods for GCNNs with applications on scene graph clas-siﬁcation and molecule classiﬁcation.3. MethodsWe compare and contrast the application of popular ex-plainability methods to Graph Convolutional Neural Net-works (GCNNs). Furthermore, we explore the beneﬁts of anumber of enhancements to these approaches.3.1. Explainability for CNNsThe three main groups of popular explainability methodsare contrastive gradients, Class Activation Mapping, andExcitation Backpropagation.Contrastive gradient-based saliency maps [32] is per-haps the most straight-forward and well-established ap-proach. In this approach, one simply differentiates the out-put of the model with respect to the model input, thus cre-10773ating a heat-map, where the norm of the gradient over in-put variables indicates their relative importance. The result-ing gradient in the input space points in the direction cor-responding to the maximum positive rate of change in themodel output. Therefore, the negative values in the gradientare discarded to only retain the parts of input that positivelycontribute to the solution:LcGradient = ∥ReLU�∂yc∂x�∥,(1)where yc is the score for class c before the softmax layer,and x is the input. While easy to compute and interpret,saliency maps generally perform worse than newer tech-niques (like CAM, Grad-CAM, and EB), and it was recentlyargued that saliency maps tend to represent noise rather thansignal [17].Class Activation Mapping provides an improvement oversaliency maps for convolutional neural networks, includingGCNNs, by identifying important, class-speciﬁc features atthe last convolutional layer as opposed to the input space. Itis well-known that such features tend to be more semanti-cally meaningful (e.g., faces instead of edges). The down-side of CAM is that it requires the layer immediately be-fore the softmax classiﬁer (output layer) to be a convolu-tional layer followed by a global average pooling (GAP)layer. This precludes the use of more complex, heteroge-neous networks, such as those that incorporate several fullyconnected layers before the softmax layer.To compute CAM, let Fk ∈ Ru×v be the kth featuremap of the convolutional layer preceding the softmax layer.Denote the global average pool (GAP) of Fk byek = 1Z�i�jFk,i,j(2)where Z = uv. Then, a given class score, yc, can be deﬁnedasyc =�kwckek,(3)where the weights wck are learned based on the input-outputbehavior of the network. The weight wck encodes the im-portance of feature k for predicting class c. By upscalingeach feature map to the size of the input images (to undothe effect of pooling layers) the class-speciﬁc heat-map inthe pixel-space becomesLcCAM[i, j] = ReLU��kwckFk,i,j�.(4)The Grad-CAM method improves upon CAM by relax-ing the architectural restriction that the penultimate layermust be a convolutional. This is achieved by using featuremap weights αck that are based on back-propagated gradi-ents. Speciﬁcally, Grad-CAM deﬁnes the weights accord-ing toαck = 1Z�i�j∂yc∂Fk,i,j.(5)Following the intuition behind Equation (4) for CAM, theheat-map in the pixel-space according to Grad-CAM iscomputed asLcGrad−CAM[i, j] = ReLU��kαckFk,i,j�,(6)where the ReLU function ensures that only features thathave a positive inﬂuence on the class prediction are non-zero.Excitation Backpropagation is an intuitively simple, butempirically effective explanation method. In [28], it is ar-gued and demonstrated experimentally that explainabilityapproaches such as EB [39], which ignore nonlinearities inthe backward-pass through the network, are able to gen-erate heat-maps that “conserve” evidence for or against anetwork predicting any particular class. Let ali be the i’thneuron in layer l of a neural network and a(l−1)jbe a neu-ron in layer (l − 1). Deﬁne the relative inﬂuence of neu-ron a(l−1)jon the activation yli ∈ R of neuron ali, where107740ji Wl-1jiy(l-1)j) and for W(l-1) being the synapticweights between layers (l-1) and l, as a probabilitydistribution P(a(l-1)j) over neurons in layer (l-1). Thisprobability distribution can be factored as0P(a(l-1)j) = �0i P(a(l-1)j|ali)P(ali). (7)0Zhang等人[39]然后将条件概率P(a(l-1)j|ali)定义为0�Z(l-1)iyl-1jW(l-1)ji if W(l-1)ji ≥ 0, 0 otherwise, (8) where0Z(l-1)i =0� 0jyl-1jW(l-1)ji0�0�0-10j P(a(l-1)j|ali) =1。对于给定的输入（例如图像），EB通过从输出层开始，根据公式（7）递归地在像素空间中生成与类别c相关的热图。这些审查的可解释性方法最初是为CNN设计的，CNN在统一的方形网格上定义。在这里，我们对非欧几里德结构上支持的信号（例如图）感兴趣。! = 1! = 2! = 3Input Graph ! = 0'×)*+'×),'×)-'×).'')*+ = 756787), = 128,)- = 256,). = 512 F l(X, A) = σ(F lk(X, A) = σ(V F (l−1)(X, A)W lk)(10)where W lk denotes the k′th column of matrix W l, and V =ek = 1NN�n=1F Lk,n(X, A),(11)LcGradient[n] = ∥ReLU� ∂yc∂Xn�∥.(12)LcCAM[n] = ReLU(�kwckF Lk,n(X, A))) .(13)αl,ck = 1NN�n=1∂yc∂F lk,n,(14)LcGrad−CAM[l, n] = ReLU(�kαl,ck F lk,n(X, A)) .(15)107750全局平均池化0Softmax分类0). × 10CC(C)NCC(O)COc1cccc2ccccc12.[Cl]0图1：我们的GCNN +GAP架构以及BBBP数据集中样本分子的输入特征和邻接矩阵的可视化。0接下来，我们首先简要讨论GCNN，然后描述这些可解释性方法在GCNN中的扩展。直观上，图像可以被概念化为具有像素值作为节点特征的格状图。从这个意义上说，GCNN将CNN推广到适应节点之间的任意连接。03.2. 图卷积神经网络0假设有一个具有N个节点的属性图，其节点属性为X∈RN×d，邻接矩阵为A∈RN×N（加权或二进制）。此外，让该图的度矩阵为Dii = �0j A ij。根据Kipf和Welling[19]的工作，我们将图卷积层定义为02 � �� V F(l-1)(X, A)Wl), (9)0其中Fl是第l'层的卷积激活，F0 = X，˜A = A +IN是添加了自连接的邻接矩阵，其中IN ∈RN×N是单位矩阵，˜Dii = �0j˜Aij,Wl∈Rdl×dl+1是可训练的卷积权重，σ(∙)是逐元素的非线性激活函数。图1显示了本文中使用的GCNN架构，其中层l =1, 2,3的激活遵循公式（9），这是对图上局部谱滤波器的一阶近似。对于分子分类，每个分子可以表示为一个属性图Gi =(Xi,Ai)，其中节点特征Xi总结了分子中原子的局部化学环境，包括原子类型、杂化类型和价态结构[35]，邻接矩阵编码了原子之间的化学键并展示了整个分子的连接性（见图1）。对于具有标签的分子数据集D = {Gi = (Xi, Ai), yi}Mi =1，其中标签yi指示某种化学性质，例如血脑屏障渗透性或毒性，任务是0学习一个分类器，将每个分子映射到其对应的标签，g：(Xi,Ai) →yi。鉴于我们的任务是对可能具有不同节点数量的单个图（即分子）进行分类，我们使用多层图卷积层，然后是对图节点（例如原子）进行全局平均池化（GAP）层。在这种情况下，所有图都将用一个固定大小的向量表示。最后，将GAP特征输入到分类器中。为了使CAM[42]适用，我们在GAP层之后简单地使用了softmax分类器。03.3. 图卷积神经网络的可解释性0在本小节中，我们描述了CNN可解释性方法扩展到GCNNs的方法。设第 l 层的第 k 个图卷积特征图为：02 （参见公式（8））。在这种符号表示中，对于节点 n，第 l 层的第 k 个特征表示为 F l k,n。然后，最终卷积层之后的GAP特征 L 计算如下：0k w c k e k .使用这种符号，我们将可解释性方法扩展到GCNNs，如下所示：节点 n 上的基于梯度的热图为：0CAM 热图的计算方法如下：0Grad-CAM 的类别特定权重用于计算第 l 层和第 k个特征的类别 c ，如下所示：0并且从第 l 层计算得到的热图为：0Grad-CAM可以生成与网络不同层相关的热图。此外，对于我们在图 1中展示的模型，Grad-CAM的热图在最终卷积层和CAM的热图上是等价的。LcGrad−CAMAvg[n] = 1LL�l=1LcGrad−CAM[l, n] .(16)kk,nk,k′107760L c Grad − CAM [ L, n ] = L c CAM [ n ]（详见[31]了解更多细节）。在本文中，我们报告了 L c Grad− CAM [ L, n ] 和变体 Grad-CAM Avg 的结果，其定义为：0Excitation Backpropagation的热图通过反向传播通过softmax分类器、GAP层和多个图卷积层计算得到。通过softmax分类器和GAP层的反向传播方程如下：0其中 p ( c ) = 1表示感兴趣的类别，否则为零。然而，通过图卷积层的反向传播更加复杂。为了简化符号，我们将图卷积算子分解为：0其中第一个方程是原子的局部平均（V n,m ≥0），第二个方程是应用于每个原子的固定感知器（类似于CNN中的逐个卷积）。这两个函数的相应反向传播可以定义为：0（19）我们通过递归地反向传播并对输入层的概率热图进行平均，生成输入层上的热图：0L c EB [ n ] = 1 d in0k =1 p ( F 0 k,n ) . (20)0L c EB的对比扩展遵循[39]中的公式（8）；我们称之为对比变体c-EB。04. 实验0本节描述了我们对类别特定解释的实验和分析。我们在两个应用领域进行实验，即视觉场景图和分子。此外，我们对解释方法在补充材料中识别的图结构子结构进行了频率分析。04.1. 场景图解释0我们的目标是训练用于场景图分类的GCNN，并在这些GCNN上使用我们提出的可解释性方法。场景图是一种图结构化数据，其中节点是场景中的对象，边是对象之间的关系。我们从VisualGenome数据集[20]中获取我们第一个实验的数据。VisualGenome数据集包含图像和场景图对。对象和关系有许多类型，并且数据是从众包工人的自由文本回答中收集的。对象具有图像的相关区域，由边界框定义。我们从VisualGenome构建了两个场景图的二进制分类任务：城市 vs.农村，室内 vs.户外。我们用一组关键词来建模每个单词，这些关键词用于查询VisualGenome数据，以在图像的任何属性中匹配。每个类的图像集是返回每个关键词的图像集的并集。此外，类之间的任何交集都被删除。用于定义每个类的关键词如下：0• “country”: 国家, 农村, 农场, 乡村, 牛, 农作物, 羊0• “urban”: 城市, 都市, 市区0• “indoor”: 室内, 房间, 办公室, 卧室, 浴室0• “outdoor”: 户外, 自然, 外部0这些关键词并不全面，也不能完全涵盖每个单词的含义。它们是为了研究图解释方法而构建的合成结构。由于本研究的目的是研究解释，类定义的确切细节与本研究无关。关键词的选择是为了在分类对之间给出近似平衡的类。每个数据集的类别比例在表1中报告。为简单起见，我们将所有关系折叠成单一类型，表示两个对象之间的通用关系。我们注意到，GCNN的各种扩展存在于具有关系边缘的图形中[29]，这些解释方法可以用于这篇论文中开发的解释性方法。每个对象（即节点）在场景图中都有一个边界框。我们使用预训练的InceptionV3网络，并提取底层图像的边界框的深度特征作为视觉特征，其中每个裁剪图像都被零填充到固定大小。从每个边界框提取的特征的大小为d =2048。因此，我们得到的场景图包含关系和视觉数据。最终的任务是将场景图分类为城市 vs. 农村和室内 vs. 户外。port this metric in Table 2. Grad-CAM showed the highestcontrastivity.Sparsity was designed to measure the localization of anexplanation. Sparse explanations are particularly useful forstudying large graphs, where manual inspection of all nodesis infeasible. More precisely, we deﬁne this measure as oneminus the number of identiﬁed objects in either explanationˆm0 ∨ ˆm1, divided by the total number of nodes in the graph1077704.2. 分子图解释0我们研究了第二个应用领域：识别有机分子的功能基团，以用于生物分子性质。我们在TOX21[35]中评估了三个二进制分类的分子数据集BBBP、BACE和NR-ER任务。每个数据集都包含根据实验确定的小有机分子的二进制分类。BBBP数据集包含有关分子是否穿透人类血脑屏障的测量结果，对于药物设计非常重要。BACE数据集包含有关分子是否抑制人类酶β-分泌酶的测量结果。TOX21数据集包含了几种毒性靶标的分子测量结果。我们从该数据中选择了NR-ER任务，该任务涉及雌激素受体的激活[24]。这些数据集是不平衡的。有关类别比例，请参见表1，此外，我们遵循[35]中的建议，该建议是描述MoleculeNet数据集的原始论文，用于训练/测试分区。特别是对于BACE和BBBP，[35]推荐使用所谓的“骨架”分割，该分割根据分子的结构进行分割，即结构相似的分子被分割到同一组中。我们强调，训练GCNN和常规数据集分割不是我们论文的贡献，我们只是遵循这些数据集的标准做法。04.3. 训练和评估0我们将数据集划分为80:20的训练/测试集。对于所有数据集，我们使用了图卷积神经网络（GCNN）+全局平均池化（GAP）的架构，如图1所示，配置如下：三个图卷积层的大小分别为128、256和512，然后是一个GAP层和一个softmax分类器。使用ADAM优化器进行25个周期的训练，学习率为0.001，β1=0.9，β2=0.999。模型使用Keras和Tensorflow后端实现[5]。表1报告了每个数据集训练模型的AUC-ROC和AUC-PR评估指标。一些分子分类结果观察到测试性能高于训练性能的平均值。虽然这是不寻常的，但结果与[35]中报告的结果一致。04.4. 解释方法的分析0在为每个数据集训练模型后，我们对所有样本应

下载后可阅读完整内容，剩余1页未读，立即下载