Architecture.
Following the STM practice [
18
], we take
res4
features with stride 16 from the base
ResNets as our backbone features and discard
res5
. A
3×3
convolutional layer without non-linearity
is used as a projection head from the backbone feature to either the
key
space (
C
k
dimensional) or
the
value
space (
C
v
dimensional). We set
C
v
to be 512 following STM and discuss the choice of
C
k
in Section 4.1.
Feature reuse.
As seen from Figure 1, both the key encoder and the value encoder are processing
the same frame, albeit with different inputs. It is natural to reuse features from the key encoder (with
fewer inputs and a deeper network) at the value encoder. To avoid bloating the feature dimensions
and for simplicity, we concatenate the last layer features from both encoders (before the projection
head) and process them with two ResBlocks [
62
] and a CBAM block
3
[
64
] as the final value output.
3.2 Memory Reading and Decoding
Given
T
memory frames and a query frame, the feature extraction step would generate the followings:
memory key
k
M
∈ R
C
k
×T HW
, memory value
v
M
∈ R
C
v
×T HW
, and query key
k
Q
∈ R
C
k
×HW
,
where
H
and
W
are (stride 16) spatial dimensions. Then, for any similarity measure
c : R
C
k
×R
C
k
→
R
, we can compute the pairwise affinity matrix
S
and the softmax-normalized affinity matrix
W
,
where S, W ∈ R
T HW ×HW
with:
S
ij
= c(k
M
i
, k
Q
j
) W
ij
=
exp (S
ij
)
P
n
(exp (S
nj
))
, (1)
where
k
i
denotes the feature vector at the
i
-th position. The similarities are normalized by
√
C
k
as in
standard practice [
18
,
33
] and is not shown for brevity. In STM [
18
], the dot product is used as
c
.
Memory reading regularization like KMN [22] or top-k filtering [21] can be applied at this step.
With the normalized affinity matrix
W
, the aggregated readout feature
v
Q
∈ R
C
v
×HW
for the
query frame can be computed as a weighted sum of the memory features with an efficient matrix
multiplication:
v
Q
= v
M
W, (2)
which is then passed to the decoder for mask generation.
In the case of multi-object segmentation, only Equation 2 has to be repeated as
W
is defined between
image features only, and thus is the same for different objects. In the case of STM [
18
],
W
must be
recomputed instead. Detailed running time analysis can be found in Section 6.2.
Decoder.
Our decoder structure stays close to that of the STM [
18
] as it is not the focus of this paper.
Features are processed and upsampled at a scale of two gradually with higher-resolution features
from the key encoder incorporated using skip-connections. The final layer of the decoder produces a
stride 4 mask which is bilinearly upsampled to the original resolution. In the case of multiple objects,
soft aggregation [18] of the output masks is used.
3.3 Memory Management
So far we have assumed the existence of a memory bank of size
T
. Here, we will describe the
construction of the memory bank. For each memory frame, we store two items: memory key and
memory value. Note that all memory frames (except the first one) are once query frames. The memory
key is simply reused from the query key, as described in Section 3.1 without extra computation. The
memory value is computed after mask generation of that frame, independently for each object as the
value encoder takes both the image and the object mask as inputs.
STM [
18
] consider every fifth query frame as a memory frame, and the immediately previous frame
as a temporary memory frame to ensure accurate matching. In the case of STCN, we find that it is
unnecessary, and in fact harmful, to include the last frame as temporary memory. This is a direct
consequence of using shared key encoders – 1) key features are sufficiently robust to match well
without the need for close-range (temporal) propagation, and 2) the temporary memory key would
otherwise be too similar to that of the query, as the image context usually changes smoothly and we do
not have the encoding noises resultant from distinct encoders, leading to drifting.
4
This modification
also reduces the number of calls to the value encoder, contributing a significant speedup.
3
We find this block to be non-essential in a later experiment but it is kept for consistency.
4
This effect is amplified by the use of L2 similarity. See the supplementary material for a full comparison.
4