IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
object part of a certain category, while remain inac-
tivated on images of other categories
1
. Let I denote
a set of training images, where I
c
⊂ I represents the
subset that belongs to category c, (c = 1, 2, . . . , C).
Theoretically, we can use different types of losses to
learn CNNs for multi-class classification and binary
classification of a single class (i.e. c = 1 for images of
a category and c = 2 for random images).
In the following paragraphs, we focus on the learn-
ing of a single filter f in a conv-layer. Fig. 2 shows
the structure of our interpretable conv-layer. We add a
loss to the feature map x of the filter f after the ReLU
operation. The filter loss Loss
f
pushes the filter f to
represent a specific object part of the category c and
keep silent on images of other categories. Please see
Section 3.2 for the determination of the category c for
the filter f . Let X = {x|x = f (I) ∈ R
n×n
, I ∈ I} denote
a set of feature maps of f after an ReLU operation
w.r.t. different images. Given an input image I ∈ I
c
,
the feature map in an intermediate layer x = f(I) is
an n × n matrix, x
ij
≥ 0. If the target part appears,
we expect the feature map x = f (I) to exclusively
activate at the target part’s location; otherwise, the
feature map should keep inactivated.
Therefore, a high interpretability of the filter f re-
quires a high mutual information between the feature
map x = f(I) and the part location, i.e. the part
location can roughly determine activations on the
feature map x.
Accordingly, we formulate the filter loss as the
minus mutual information, as follows.
Loss
f
=−MI(X; Ω) = −
X
µ∈Ω
p(µ)
X
x
p(x|µ) log
p(x|µ)
p(x)
(1)
where MI(·) denotes the mutual information; Ω =
{µ
1
, µ
2
, . . . , µ
n
2
} ∪ {µ
−
}. We use µ
1
, µ
2
, . . . , µ
n
2
to
denote the n
2
neural units on the feature map x, each
µ = [i, j] ∈ Ω, 1 ≤ i, j ≤ n, corresponding to a location
candidate for the target part. µ
−
denotes a dummy
location for the case when the target part does not
appear on the image.
Given an input image, the above loss forces each
filter to match and only match one of the templates,
i.e. making the feature map of the filter contain a
single significant activation peak at most. This ensures
each filter to represent a specific object part.
• p(µ) measures the probability of the target part
appearing at the location µ. If annotations of part
locations are given, then the computation of p(µ) is
simple. People can manually assign a semantic part
with the filter f, and then p(µ) can be determined
using part annotations.
However, in our study, the target part of filter f is
not pre-defined before the learning process. Instead,
1. To avoid ambiguity, we evaluate or visualize the semantic
meaning of each filter by using the feature map after the ReLU
and mask operations.
the part corresponding to f needs to be determined
during the learning process. More crucially, we do not
have any ground-truth annotations of the target part,
which boosts the difficulty of calculating p(µ).
• The conditional likelihood p(x|µ) measures the
fitness between a feature map x and the part location
µ ∈ Ω. In order to simplify the computation of p(x|µ),
we design n
2
templates for f, {T
µ
1
, T
µ
2
, . . . , T
µ
n
2
}.
As shown in Fig. 3, each template T
µ
i
is an n × n
matrix. T
µ
i
describes the ideal distribution of acti-
vations for the feature map x when the target part
mainly triggers the i-th unit in x. In addition, we also
design a negative template T
−
corresponding to the
dummy location µ
−
. The feature map can match to
T
−
, when the target part does not appear on the input
image. In this study, the prior probability is given as
p(µ
i
) =
α
n
2
, p(µ
−
) = 1 − α, where α is a constant prior
likelihood.
Note that in Equation (1), we do not manually
assign filters with different categories. Instead, we use
the negative template µ
−
to help the assignment of
filters. I.e. the negative template ensures that each
filter represents a specific object part (if the input
image does not belong to the target part, then the
input image is supposed to match µ
−
), which also
ensures a clear assignment of filters to categories.
Here, we assume two categories do not share object
parts, e.g. eyes of dogs and those of cats do not have
similar contextual appearance.
We define p(x|µ) below, which follows a standard
form widely used in [25], [38].
p(x|µ) ≈ p(x|T
µ
) =
1
Z
µ
exp
tr(x · T
µ
)
(2)
where Z
µ
=
P
x∈X
exp[tr(x · T
µ
)]. tr(·) indicates the
trace of a matrix, and tr(x · T
µ
) =
P
ij
x
ij
t
ji
, x, T
µ
∈
R
n×n
. p(x) =
P
µ
p(µ)p(x|µ).
Part templates: As shown in Fig. 3, a negative
template is given as T
−
= (t
−
ij
), t
−
ij
= −τ < 0,
where τ is a positive constant. A positive template
corresponding to µ is given as T
µ
= (t
+
ij
), t
+
ij
=
τ · max(1 − β
k[i,j]−µk
1
n
, −1), where k · k
1
denotes the
L-1 norm distance. Note that the lowest value in a
positive template is -1 instead of 0. It is because that
the negative value in the template penalizes neural ac-
tivations outside the domain of the highest activation
peak, which ensures each filter mainly has at most a
single significant activation peak.
3.1 Part localization & the mask layer
Given an input image I, the filter f computes a feature
map x after the ReLU operation. Without ground-
truth annotations of the target part for f , in this
study, we determine the part location on x during the
learning process. We consider the neural unit with the
strongest activation ˆµ = argmax
µ=[i,j]
x
ij
, 1 ≤ i, j ≤ n
as the target part location.