Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization 5
to successive matrix products of the weight matrices and
the gradient with respect to activation functions till the final
convolution layer that the gradients are being propagated
to. Hence, this weight
α
c
k
represents a partial linearization
of the deep network downstream from A, and captures the
‘importance’ of feature map k for a target class c.
We perform a weighted combination of forward activation
maps, and follow it by a ReLU to obtain,
L
c
Grad-CAM
= ReLU
X
k
α
c
k
A
k
!
| {z }
linear combination
(2)
Notice that this results in a coarse heatmap of the same size
as the convolutional feature maps (
14 × 14
in the case of
last convolutional layers of VGG [
52
] and AlexNet [
33
]
networks)
3
. We apply a ReLU to the linear combination
of maps because we are only interested in the features that
have a positive influence on the class of interest, i.e. pixels
whose intensity should be increased in order to increase
y
c
.
Negative pixels are likely to belong to other categories in the
image. As expected, without this ReLU, localization maps
sometimes highlight more than just the desired class and
perform worse at localization. Figures 1c, 1f and 1i, 1l show
Grad-CAM visualizations for ‘tiger cat’ and ‘boxer (dog)’
respectively. Ablation studies are available in Sec. B.
In general,
y
c
need not be the class score produced by an
image classification CNN. It could be any differentiable acti-
vation including words from a caption or answer to a question.
3.1 Grad-CAM generalizes CAM
In this section, we discuss the connections between Grad-
CAM and Class Activation Mapping (CAM) [
59
], and for-
mally prove that Grad-CAM generalizes CAM for a wide
variety of CNN-based architectures. Recall that CAM pro-
duces a localization map for an image classification CNN
with a specific kind of architecture where global average
pooled convolutional feature maps are fed directly into soft-
max. Specifically, let the penultimate layer produce
K
feature
maps,
A
k
∈ R
u×v
, with each element indexed by
i, j
. So
A
k
ij
refers to the activation at location
(i, j)
of the feature
map
A
k
. These feature maps are then spatially pooled using
Global Average Pooling (GAP) and linearly transformed to
produce a score Y
c
for each class c,
Y
c
=
X
k
w
c
k
|{z}
class feature weights
global average pooling
z }| {
1
Z
X
i
X
j
A
k
ij
|{z}
feature map
(3)
3
We find that Grad-CAM maps become progressively worse as we
move to earlier convolutional layers as they have smaller receptive fields
and only focus on less semantic local features.
Let us define F
k
to be the global average pooled output,
F
k
=
1
Z
X
i
X
j
A
k
ij
(4)
CAM computes the final scores by,
Y
c
=
X
k
w
c
k
· F
k
(5)
where
w
c
k
is the weight connecting the
k
th
feature map with
the
c
th
class. Taking the gradient of the score for class c (
Y
c
)
with respect to the feature map F
k
we get,
∂Y
c
∂F
k
=
∂Y
c
∂A
k
ij
∂F
k
∂A
k
ij
(6)
Taking partial derivative of
(4)
w.r.t.
A
k
ij
, we can see that
∂F
k
∂A
k
ij
=
1
Z
. Substituting this in (6), we get,
∂Y
c
∂F
k
=
∂Y
c
∂A
k
ij
· Z (7)
From (5) we get that,
∂Y
c
∂F
k
= w
c
k
. Hence,
w
c
k
= Z ·
∂Y
c
∂A
k
ij
(8)
Summing both sides of (8) over all pixels (i, j),
X
i
X
j
w
c
k
=
X
i
X
j
Z ·
∂Y
c
∂A
k
ij
(9)
Since Z and w
c
k
do not depend on (i, j), rewriting this as
Zw
c
k
= Z
X
i
X
j
∂Y
c
∂A
k
ij
(10)
Note that
Z
is the number of pixels in the feature map (or
Z =
P
i
P
j
1). Thus, we can re-order terms and see that
w
c
k
=
X
i
X
j
∂Y
c
∂A
k
ij
(11)
Up to a proportionality constant (
1/Z
) that gets normalized-
out during visualization, the expression for
w
c
k
is identical to
α
c
k
used by Grad-CAM
(1)
. Thus, Grad-CAM is a strict gen-
eralization of CAM. This generalization allows us to generate
visual explanations from CNN-based models that cascade
convolutional layers with much more complex interactions,
such as those for image captioning and VQA (Sec. 8.2).