Published as a conference paper at ICLR 2020
Node τ calculates the attention weight on its neighbor η using hop query ˆq
τ,0
and key
ˆ
k
η,0
. Then it
uses the weights to combine its neighbors’ value ˆv
η,0
and forms a globalized representation
ˆ
h
l
τ,0
.
The two attention mechanism are combined to form the new representation of layer l:
˜
h
l
τ,0
= Linear([h
l
τ,0
◦
ˆ
h
l
τ,0
]), (8)
˜
h
l
τ,i
= h
l
τ,i
; ∀i 6= 0. (9)
Note that the non-hub tokens (i 6= 0) still have access to the hop attention in the previous layer
through Eqn. (6).
One layer of eXtra Hop attention can be viewed as single-step of information propagation along
edges E. For example, in Figure 1a, the document node d
3
updates its representation by gathering
information from its neighbor d
1
using the hop attention d
1
→ d
3
. When multiple Transformer-
XH layers are stacked, this information in d
1
includes both d
1
’s local contexts from its in-sequence
attention, and cross-sequence information from the hop attention d
2
→ d
1
of the l −1 layer. Hence,
an L-layer Transformer-XH can attend over information from up to L hops away.
Together, three main properties equip Transformer-XH to effectively model raw structured text data:
the propagation of information (values) along edges, the importance of that information (hop at-
tention weights), and the balance of in-sequence and cross-sequence information (attention combi-
nation). The representations learned in H can innately express nuances in structured text that are
required for complex reasoning tasks such as multi-hop QA and natural language inference.
3 APPLICATION TO MULTI-HOP QUESTION ANSWERING
This section describes how Transformer-XH applies to multi-hop QA. Given a question q, the task
is to find an answer span a in a large open-domain document corpus, e.g. the first paragraph of
all Wikipedia pages. By design, the questions are complex and often require information from
multiple documents to answer. For example, in the case shown in Figure 1b, the correct answer
“Cambridge” requires combining the information from both the Wikipedia pages “Facebook” and
“Harvard University”. To apply Transformer-XH in the open domain multi-hop QA task, we first
construct an evidence graph and then apply Transformer-XH on the graph to find the answer.
Evidence Graph Construction. The first step is to find the relevant candidate documents D for
the question q and connect them with edges E to form the graph G. Our set D consists of three
sources. The first two sources are from canonical information retrieval and entity linking techniques:
D
ir
: the top 100 documents retrieved by DrQA’s TF-IDF on the question (Chen et al., 2017).
D
el
: the Wikipedia documents associated with the entities that appear in the question, annotated by
entity linking systems: TagMe (Ferragina & Scaiella, 2010) and CMNS (Hasibi et al., 2017).
For better retrieval quality, we use a BERT ranker (Nogueira & Cho, 2019) on the set D
ir
∪ D
el
and keep the top two ranked ones in D
ir
and top one per question entity in D
el
. Then the third
source D
exp
includes all documents connected to or from any top ranked documents via Wikipedia
hyperlinks (e.g., “Facebook” → “Harvard University”).
The final graph comprises all documents from the three sources as nodes X. The edge matrix E is
flexible. We experiment with various edge matrix settings, including directed edges along Wikipedia
links, i.e. e
ij
= 1 if there is a hyperlink from document i to j, bidirectional edges along Wiki links,
and fully-connected graphs, which rely on Transformer-XH to learns the edge importance.
Similar to previous work (Ding et al., 2019), the textual representation for each node in the graph
is the [SEP]-delimited concatenation of the question, anchor text (the text in the hyperlink in parent
nodes pointing to the child node), and the paragraph itself. More details on the evidence graph
construction are in Appendix A.1.
Transformer-XH on Evidence Graph. Transformer-XH takes the input nodes X and edges E,
and produces the global representation of all text sequences:
H
L
= Transformer-XH(X, E). (10)
4