深度网络决策解释：Grad-CAM 技术与视觉可视化

版权申诉

62 浏览量更新于2024-07-19 收藏 6.93MB PDF 举报

"Grad-CAM技术是一种用于深度学习模型，尤其是基于卷积神经网络（CNN）模型的可视化解释方法。它通过梯度加权的方式，帮助用户理解模型在做出预测时关注的是图像的哪些区域，从而提高模型的透明度和可解释性。Grad-CAM适用于多种类型的CNN模型，包括具有全连接层的CNN、用于结构化输出（如图像标题生成）的CNN、在多模态输入任务（如视觉问答）或强化学习中使用的CNN，并且无需对模型架构进行修改或重新训练。" Grad-CAM（Gradient-weighted Class Activation Mapping）是深度学习领域的一种重要工具，它主要针对CNN模型的决策过程提供可视化解释。传统的CNN模型虽然在许多任务上表现出色，但其内部工作原理往往难以理解，而Grad-CAM则解决了这一问题。其核心思想是利用目标概念（例如分类任务中的类别或生成文本任务中的关键词）的梯度信息来引导，生成一个粗略的定位图，这个图可以突出显示图像中对预测该概念至关重要的区域。具体实现过程中，Grad-CAM首先计算目标概念对最后一层卷积层中所有通道的梯度。然后，通过平均这些通道的梯度值，得到每个位置的重要性权重。最后，将这些权重与对应通道的激活图相乘，再进行全局平均池化，生成一个与输入图像大小相同的热力图，即定位图。这个定位图可以直观地展示模型在预测时关注的图像区域。由于Grad-CAM方法的通用性，它不仅适用于分类网络，还可以应用于生成结构化输出（如图像标题生成）的任务，甚至在涉及多模态输入的任务（如视觉问答）和强化学习场景中也能发挥作用。这一点使得Grad-CAM成为一种强大的工具，能够帮助研究人员和开发者更好地理解和调试复杂的深度学习模型。此外，Grad-CAM可以与现有的细粒度解释方法结合，以提供更详细的洞察。例如，它可以与局部归一化或其他特征可视化技术相结合，揭示更精确的特征响应。这有助于深入分析模型的决策过程，从而改进模型设计，提高模型的可靠性和公平性。 Grad-CAM作为一种基于梯度的可视化解释方法，极大地推动了深度学习模型的可解释性研究，使得模型的行为更加可理解，这对于模型优化、故障诊断以及建立用户信任具有重要意义。

展开

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization 5

to successive matrix products of the weight matrices and

the gradient with respect to activation functions till the ﬁnal

convolution layer that the gradients are being propagated

to. Hence, this weight

represents a partial linearization

of the deep network downstream from A, and captures the

‘importance’ of feature map k for a target class c.

We perform a weighted combination of forward activation

maps, and follow it by a ReLU to obtain,

Grad-CAM

= ReLU

| {z }

linear combination

(2)

Notice that this results in a coarse heatmap of the same size

as the convolutional feature maps (

14 × 14

in the case of

last convolutional layers of VGG [

] and AlexNet [

]

networks)

. We apply a ReLU to the linear combination

of maps because we are only interested in the features that

have a positive inﬂuence on the class of interest, i.e. pixels

whose intensity should be increased in order to increase

Negative pixels are likely to belong to other categories in the

image. As expected, without this ReLU, localization maps

sometimes highlight more than just the desired class and

perform worse at localization. Figures 1c, 1f and 1i, 1l show

Grad-CAM visualizations for ‘tiger cat’ and ‘boxer (dog)’

respectively. Ablation studies are available in Sec. B.

In general,

need not be the class score produced by an

image classiﬁcation CNN. It could be any differentiable acti-

vation including words from a caption or answer to a question.

3.1 Grad-CAM generalizes CAM

In this section, we discuss the connections between Grad-

CAM and Class Activation Mapping (CAM) [

], and for-

mally prove that Grad-CAM generalizes CAM for a wide

variety of CNN-based architectures. Recall that CAM pro-

duces a localization map for an image classiﬁcation CNN

with a speciﬁc kind of architecture where global average

pooled convolutional feature maps are fed directly into soft-

max. Speciﬁcally, let the penultimate layer produce

feature

maps,

∈ R

u×v

, with each element indexed by

i, j

. So

refers to the activation at location

(i, j)

of the feature

map

. These feature maps are then spatially pooled using

Global Average Pooling (GAP) and linearly transformed to

produce a score Y

for each class c,

|{z}

class feature weights

global average pooling

z }| {

|{z}

feature map

(3)

We ﬁnd that Grad-CAM maps become progressively worse as we

move to earlier convolutional layers as they have smaller receptive ﬁelds

and only focus on less semantic local features.

Let us deﬁne F

to be the global average pooled output,

(4)

CAM computes the ﬁnal scores by,

· F

(5)

where

is the weight connecting the

feature map with

the

class. Taking the gradient of the score for class c (

)

with respect to the feature map F

we get,

∂Y

∂F

∂Y

∂A

∂F

∂A

(6)

Taking partial derivative of

(4)

w.r.t.

, we can see that

∂F

∂A

. Substituting this in (6), we get,

∂Y

∂F

∂Y

∂A

· Z (7)

From (5) we get that,

∂Y

∂F

= w

. Hence,

= Z ·

∂Y

∂A

(8)

Summing both sides of (8) over all pixels (i, j),

Z ·

∂Y

∂A

(9)

Since Z and w

do not depend on (i, j), rewriting this as

= Z

∂Y

∂A

(10)

Note that

is the number of pixels in the feature map (or

Z =

1). Thus, we can re-order terms and see that

∂Y

∂A

(11)

Up to a proportionality constant (

1/Z

) that gets normalized-

out during visualization, the expression for

is identical to

used by Grad-CAM

(1)

. Thus, Grad-CAM is a strict gen-

eralization of CAM. This generalization allows us to generate

visual explanations from CNN-based models that cascade

convolutional layers with much more complex interactions,

such as those for image captioning and VQA (Sec. 8.2).

下载后可阅读完整内容，剩余22页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

电动汽车控制与安全

粉丝: 280

深度网络决策解释：Grad-CAM 技术与视觉可视化

Visual Explanations.pdf

Grad-CAM：Visual Explanations from Deep Networks via Gradient-based Localization

Visual Explanations

Grad-CAM：梯度加权类激活映射（Grad-CAM）

pytorch-grad-cam：Grad-CAM的PyTorch实现

keras-grad-cam：带有keras的Grad-CAM的实现

Grad-CAM-tensorflow:Grad-CAM的Tensorflow实现（CNN可视化）

grad-cam-implementation:卷积神经网络Grad CAM算法的实现

gradcam-pytorch：在Pytorch中实现GradCAM算法

pytorch-grad-cam-master.zip

最新资源