Siamese Recurrent Architectures for Learning Sentence Similarity
Jonas Mueller
Computer Science
& Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Aditya Thyagarajan
Department of Computer Science
and Engineering
M. S. Ramaiah Institute of Technology
Abstract
We present a siamese adaptation of the Long Short-Term
Memory (LSTM) network for labeled data comprised of pairs
of variable-length sequences. Our model is applied to as-
sess semantic similarity between sentences, where we ex-
ceed state of the art, outperforming carefully handcrafted
features and recently proposed neural network systems of
greater complexity. For these applications, we provide word-
embedding vectors supplemented with synonymic informa-
tion to the LSTMs, which use a fixed size vector to encode
the underlying meaning expressed in a sentence (irrespective
of the particular wording/syntax). By restricting subsequent
operations to rely on a simple Manhattan metric, we compel
the sentence representations learned by our model to form
a highly structured space whose geometry reflects complex
semantic relationships. Our results are the latest in a line of
findings that showcase LSTMs as powerful language models
capable of tasks requiring intricate understanding.
Introduction
Text understanding and information retrieval are important
tasks which may be greatly enhanced by modeling the un-
derlying semantic similarity between sentences/phrases. In
particular, a good model should not be susceptible to vari-
ations of wording/syntax used to express the same idea.
Learning such a semantic textual similarity metric has thus
generated a great deal of research interest (Marelli et al.
2014). However, this remains a hard problem, because la-
beled data is scarce, sentences have both variable length and
complex structure, and bag-of-words/tf-IDF models, while
dominant in natural language processing (NLP), are limited
in this context by their inherent term-specificity (c.f. Mihal-
cea, Corley, and Strapparava 2006).
As an alternative to these ideas, Mikolov et al. (2013)
and others have demonstrated the effectiveness of neural
word representations for analogies and other NLP tasks.
Recently, interests have shifted toward extensions of these
ideas beyond the individual word-level to larger bodies of
text such as sentences, where a mapping is learned to repre-
sent each sentence as a fixed-length vector (Kiros et al. 2015;
Tai, Socher, and Manning 2015; Le and Mikolov 2014).
Copyright
c
2016, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Naturally suited for variable-length inputs like sentences,
recurrent neural networks (RNN), especially the Long
Short-Term Memory model of Hochreiter and Schmidhu-
ber (1997), have been particularly successful in this setting
for tasks such as text classification (Graves 2012) and lan-
guage translation (Sutskever, Vinyals, and Le 2014). RNNs
adapt standard feedforward neural networks for sequence
data (x
1
,...,x
T
), where at each t ∈{1,...,T}, updates
to a hidden-state vector h
t
are performed via
h
t
= sigmoid (Wx
t
+ Uh
t−1
) (1)
While Siegelmann and Sontag (1995) have shown that the
basic RNN is Turing-complete, optimization of the weight-
matrices is difficult because its backpropagated gradients
become vanishingly small over long sequences. Practically,
the LSTM is superior to basic RNNs for learning long
range dependencies through its use of memory cell units that
can store/access information across lengthy input sequences.
Like RNNs, the LSTM sequentially updates a hidden-state
representation, but these steps also rely on a memory cell
containing four components (which are real-valued vectors):
a memory state c
t
, an output gate o
t
that determines how
the memory state affects other units, as well as an input
(and forget) gate i
t
(and f
t
) that controls what gets stored
in (and omitted from) memory based on each new input
and the current state. Below are the updates performed at
each t ∈{1,...,T} in an LSTM parameterized by weight
matrices W
i
,W
f
,W
c
,W
o
,U
i
,U
f
,U
c
,U
o
and bias-vectors
b
i
,b
f
,b
c
,b
o
:
i
t
= sigmoid (W
i
x
t
+ U
i
h
t−1
+ b
i
) (2)
f
t
= sigmoid (W
f
x
t
+ U
f
h
t−1
+ b
f
) (3)
c
t
= tanh (W
c
x
t
+ U
c
h
t−1
+ b
c
) (4)
c
t
= i
t
c
t
+ f
t
c
t−1
(5)
o
t
= sigmoid (W
o
x
t
+ U
o
h
t−1
+ b
o
) (6)
h
t
= o
t
tanh(c
t
) (7)
A more thorough exposition of the LSTM model and its vari-
ants is provided by Graves (2012) and Greff et al. (2015).
Although the success of LSTM language models eludes
current theoretical understanding, Sutskever, Vinyals, and
Le (2014) empirically validate the intuition that an effec-
tively trained network maps each sentence onto a fixed-
length vector which encodes the underlying meaning ex-
pressed in the text. Recent works have proposed many
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)