should be coupled to capsule j.
c
ij
=
exp(b
ij
)
P
k
exp(b
ik
)
(3)
The log priors can be learned discriminatively at the same time as all the other weights. They depend
on the location and type of the two capsules but not on the current input image
2
. The initial coupling
coefficients are then iteratively refined by measuring the agreement between the current output
v
j
of
each capsule, j, in the layer above and the prediction ˆu
j|i
made by capsule i.
The agreement is simply the scalar product
a
ij
= v
j
.ˆu
j|i
. This agreement is treated as if it was a log
likelihood and is added to the initial logit,
b
ij
before computing the new values for all the coupling
coefficients linking capsule i to higher level capsules.
In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in
the layer above using different transformation matrices for each member of the grid as well as for
each type of capsule.
Procedure 1 Routing algorithm.
1: procedure ROUTING(ˆu
j|i
, r, l)
2: for all capsule i in layer l and capsule j in layer (l + 1): b
ij
← 0.
3: for r iterations do
4: for all capsule i in layer l: c
i
← softmax(b
i
) softmax computes Eq. 3
5: for all capsule j in layer (l + 1): s
j
←
P
i
c
ij
ˆu
j|i
6: for all capsule j in layer (l + 1): v
j
← squash(s
j
) squash computes Eq. 1
7: for all capsule i in layer l and capsule j in layer (l + 1): b
ij
← b
ij
+ ˆu
j|i
.v
j
return v
j
3 Margin loss for digit existence
We are using the length of the instantiation vector to represent the probability that a capsule’s entity
exists. We would like the top-level capsule for digit class
k
to have a long instantiation vector if and
only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss,
L
k
for each digit capsule, k:
L
k
= T
k
max(0, m
+
− ||v
k
||)
2
+ λ (1 − T
k
) max(0, ||v
k
|| − m
−
)
2
(4)
where
T
k
= 1
iff a digit of class
k
is present
3
and
m
+
= 0.9
and
m
−
= 0.1
. The
λ
down-weighting
of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity
vectors of all the digit capsules. We use
λ = 0.5
. The total loss is simply the sum of the losses of all
digit capsules.
4 CapsNet architecture
A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two
convolutional layers and one fully connected layer. Conv
1
has
256
,
9 × 9
convolution kernels with a
stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature
detectors that are then used as inputs to the primary capsules.
The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics
perspective, activating the primary capsules corresponds to inverting the rendering process. This is a
very different type of computation than piecing instantiated parts together to make familiar wholes,
which is what capsules are designed to be good at.
The second layer (PrimaryCapsules) is a convolutional capsule layer with
32
channels of convolutional
8
D capsules (i.e. each primary capsule contains 8 convolutional units with a
9 × 9
kernel and a stride
of 2). Each primary capsule output sees the outputs of all
256 × 81
Conv
1
units whose receptive
2
For MNIST we found that it was sufficient to set all of these priors to be equal.
3
We do not allow an image to contain two instances of the same digit class. We address this weakness of
capsules in the discussion section.
3