>.
should be coupled to capsule
j.
c =
exp( b
ij
)
k
exp(b
ik
)
(3)
The log priors can be learned discriminatively at the same time as all the other weights. They
depend
on the location and type of the two capsules but not on the current input image
2
. The
initial coupling
coefficients are then iteratively refined by measuring the agreement between
the current output
v
j
of
each capsule,
j,
in the layer above and the prediction
uˆ
j|i
made by
capsule
i.
The agreement is simply the scalar product
a
ij
= v
j
.uˆ
j|i
.
This agreement is treated as
if
it
was a log
likelihood and is added to the initial logit,
b
ij
before computing the new values for all
the coupling
coefficients linking capsule
i
to higher level capsules.
In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule
in
the layer above using different transformation matrices for each member of the grid as well as
for
each type of capsule.
Procedure 1 Routing algorithm.
1: procedure ROUTING(uˆ
j|i
, r, l)
2: for all capsule
i
in layer
l
and capsule
j
in layer
(l +
1):
b
ij
← 0.
3: for
r
iterations do
4: for all capsule
i
in layer l: c
i
← softmax(b
i
)
1>
softmax computes Eq. 3
5: for all capsule
j
in layer
(l +
1): s
j
←
>.
i
c
ij
uˆ
j|i
6: for all capsule
j
in layer
(l +
1):
v
j
← squash(s
j
)
1>
squash computes Eq. 1
7: for all capsule
i
in layer
l
and capsule
j
in layer (l
+
1): b
ij
← b
ij
+ uˆ
j|i
.v
j
return v
j
3
Margin loss for digit
existence
We are using the length of the instantiation vector to represent the probability that a capsule’s entity
exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and
only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss,
L
k
for each digit capsule, k:
L
k
=
T
k
max(0,
m
+
− ||v
k
||)
2
+ λ
(1 − T
k
)
max(0, ||v
k
|| −
m
−
)
2
(4)
where T
k
=
1
iff
a digit of class k is present
3
and
m
+
=
0.9 and
m
−
=
0.1. The
λ
down-
weighting
of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity
vectors of all the digit capsules. We use
λ =
0.5. The total loss is simply the sum of the losses of all
digit capsules.
4
CapsNet
architecture
A simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two
convolutional layers and one fully connected layer. Conv1 has 256, 9 × 9 convolution kernels with
a
stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local
feature
detectors that are then used as inputs to the primary capsules.
The primary capsules are the lowest level of multi-dimensional entities and, from an inverse
graphics
perspective, activating the primary capsules corresponds to inverting the rendering
process. This is a
very different type of computation than piecing instantiated parts together to
make familiar wholes,
which is what capsules are designed to be good at.
The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of
convolutional
8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 × 9
kernel and a stride
of 2). Each primary capsule output sees the outputs of all 256 × 81 Conv1