4701
The details of augmented residual connections and
other layers are introduced as follows.
2.1 Augmented Residual Connections
To provide richer features for alignment processes,
RE2 adopts an augmented version of residual con-
nections to connect consecutive blocks. For a se-
quence of length l, We denote the input and output
of the n-th block as x
(n)
= (x
(n)
1
, x
(n)
2
, . . . , x
(n)
l
)
and o
(n)
= (o
(n)
1
, o
(n)
2
, . . . , o
(n)
l
), respectively. Let
o
(0)
be a sequence of zero vectors. The input of the
first block x
(1)
, as mentioned before, is the output
of the embedding layer (denoted by blank rectan-
gles in Figure 1). The input of the n-th block x
(n)
(n ≥ 2), is the concatenation of the input of the
first block x
(1)
and the summation of the output of
previous two blocks (denoted by rectangles with
diagonal stripes in Figure 1):
x
(n)
i
= [x
(1)
i
; o
(n−1)
i
+ o
(n−2)
i
], (1)
where [; ] denotes the concatenation operation.
With augmented residual connections, there are
three parts in the input of alignment and fusion
layers, namely original point-wise features kept
untouched along the way (Embedding vectors),
previous aligned features processed and refined by
previous blocks (Residual vectors), and contextual
features from the encoder layer (Encoded vectors).
Each of these three parts plays a complementing
role in the text matching process.
2.2 Alignment Layer
A simple form of alignment based on the attention
mechanism is used following Parikh et al. (2016)
with minor modifications. The alignment layer, as
shown in Figure 1, takes features from the two se-
quences as input and computes the aligned repre-
sentations as output. Input from the first sequence
of length l
a
is denoted as a = (a
1
, a
2
, . . . , a
l
a
)
and input from the second sequence of length l
b
is denoted as b = (b
1
, b
2
, . . . , b
l
b
). The similarity
score e
ij
between a
i
and b
j
is computed as the dot
product of the projected vectors:
e
ij
= F (a
i
)
T
F (b
j
). (2)
F is an identity function or a single-layer feed-
forward network. The choice is treated as a hyper-
parameter.
The output vectors a
0
and b
0
are computed
by weighted summation of representations of the
other sequence. The summation is weighted by
similarity scores between the current position and
the corresponding positions in the other sequence:
a
0
i
=
l
b
X
j=1
exp(e
ij
)
P
l
b
k=1
exp(e
ik
)
b
j
,
b
0
j
=
l
a
X
i=1
exp(e
ij
)
P
l
a
k=1
exp(e
kj
)
a
i
.
(3)
2.3 Fusion Layer
The fusion layer compares local and aligned repre-
sentations in three perspectives and then fuse them
together. The output of the fusion layer for the first
sequence ¯a is computed by
¯a
1
i
= G
1
([a
i
; a
0
i
]),
¯a
2
i
= G
2
([a
i
; a
i
− a
0
i
]),
¯a
3
i
= G
3
([a
i
; a
i
◦ a
0
i
]),
¯a
i
= G([¯a
1
i
; ¯a
2
i
; ¯a
3
i
]),
(4)
where G
1
, G
2
, G
3
, and G are single-layer feed-
forward networks with independent parameters
and ◦ denotes element-wise multiplication. The
subtraction operator highlights the difference be-
tween the two vectors while the multiplication
highlights similarity. Formulations for
¯
b are simi-
lar and omitted here.
2.4 Prediction Layer
The prediction layer takes the vector representa-
tions of the two sequences v
1
and v
2
from the pool-
ing layers as input and predicts the final target fol-
lowing Mou et al. (2016):
ˆ
y = H([v
1
; v
2
; v
1
− v
2
; v
1
◦ v
2
]). (5)
H is a multi-layer feed-forward neural network.
In a classification task,
ˆ
y ∈ R
C
represents the un-
normalized predicted scores for all classes where
C is the number of classes. The predicted class
is ˆy = argmax
i
ˆ
y
i
. In a regression task,
ˆ
y is the
predicted scala value.
In symmetric tasks like paraphrase identifica-
tion, a symmetric version of the prediction layer
is used for better generalization:
ˆ
y = H([v
1
; v
2
; |v
1
− v
2
|; v
1
◦ v
2
]). (6)
We also provide a simplified version of the pre-
diction layer. Which version to use is treated as
a hyperparameter. The simplified prediction layer
can be expressed as:
ˆ
y = H([v
1
; v
2
]). (7)