Published as a conference paper at ICLR 2020
Higher-level Capsules
Lower-level Capsules
pose
…
…
Agreement by Dot-Product Attention
multiplication
routing!
coefficients
Pose UpdateRouting Coefficients as Normalized Agreement
agreement
weight
matrix multiplication
reshape
⋅
dot product
⊤
Softmax(⋅)
…
…
…
…
Figure 2: Illustration of the Inverted Dot-Product Attention Routing with the pose admitting matrix structure.
Procedure 1 Inverted Dot-product Attention Routing algorithm returns updated poses of the cap-
sules in layer L + 1 given poses in layer L and L + 1 and weights between layer L and L + 1.
1: procedure INVERTED DOT-PRODUCT ATTENTION ROUTING(P
L
, P
L+1
, W
L
)
2: for all capsule i in layer L and capsule j in layer (L + 1): v
L
ij
← W
L
ij
· p
L
i
vote
3: for all capsule i in layer L and capsule j in layer (L + 1): a
L
ij
← p
L+1
j
>
· v
L
ij
agreement
4: for all capsule i in layer L: r
L
ij
← exp(a
L
ij
) /
P
j
0
exp(a
L
ij
0
) routing coefficient
5: for all capsule j in layer (L + 1): p
L+1
j
←
P
i
r
L
ij
v
L
ij
pose update
6: for all capsule j in layer (L + 1): p
L+1
j
← LayerNorm(p
L+1
j
) normalization
7: return P
L+1
transformation is done using a learned transformation matrix W
L
ij
:
v
L
ij
= W
L
ij
· p
L
i
, (1)
where the matrix W
L
ij
∈ R
d
L+1
×d
L
if the pose has a vector structure and W
L
ij
∈ R
√
d
L+1
×
√
d
L
(requires d
L+1
= d
L
) if the pose has a matrix structure. Next, the agreement (a
L
ij
) is computed by
the dot-product similarity between a pose p
L+1
j
and a vote v
L
ij
:
a
L
ij
= p
L+1
j
>
· v
L
ij
. (2)
The pose p
L+1
j
is obtained from the previous iteration of this procedure, and will be set to 0 initially.
Step 2: Computing Poses: The agreement scores a
L
ij
are passed through a softmax function to
determine the routing probabilities r
L
ij
:
r
L
ij
=
exp(a
L
ij
)
P
j
0
exp(a
L
ij
0
)
, (3)
where r
L
ij
is an inverted attention score representing how higher-level capsules compete for attention
of lower-level capsules. Using the routing probabilities, we update the pose p
L+1
j
for capsule j in
layer L + 1 from all capsules in layer L:
p
L+1
j
= LayerNorm
X
i
r
L
ij
v
L
ij
!
. (4)
We adopt Layer Normalization (Ba et al., 2016) as the normalization, which we empirically find it
to be able to improve the convergence for routing. The routing algorithm is summarized in Proce-
dure 1 and Figure 2.
4 INFERENCE AND LEARNING
To explain how inference and learning are performed, we use Figure 1 as an example. Note that
the choice of the backbone, the number of capsules layers, the number of capsules per layer, the
design of the classifier may vary for different sets of experiments. We leave the discussions of
configurations in Sections 5 and 6, and in the Appendix.
3