3.2 Neural Bellman-Ford Networks
While the generalized Bellman-Ford algorithm can solve m a ny classical methods (Th e orem 6), these
methods instan tiate the path formulation with handcrafted operators (Table 1), and may not be opti-
mal for link prediction. To improve the capacity of path formulation, we propose a general frame-
work, Neural Bellman-Ford Network s (NBFNet), to learn the operators in the pairwise representa-
tions.
Neural Parameterization. We relax the semiring assumption and parameterize the generalized
Bellman-Ford algorithm (Equation 3 and 4) with 3 n eural functions, namely INDICATOR, MESSAGE
and AGGREGATE func tions. The INDICATOR function replaces th e indica tor function
1
q
(u = v).
The MESSAGE function replaces the binary multiplication operator ⊗. The AGGREGATE function
is a permutatio n invariant function over sets that r eplaces the n-ary summation operator
L
. Note
that one may alterna tively define AGGREGATE as the commutative binary o perator ⊕ and apply it
to a sequence of messages. However, this will make the parameterization more complica te d.
Now consider the g e neralized Bellman-Ford algorithm for a given entity u and relation q. In this
context, we abbreviate h
(t)
q
(u, v) as h
(t)
v
, i.e., a represen tation on entity v in the t-th iteratio n. It
should be stressed that h
(t)
v
is still a pairwise representation, rather than a node r epresentation. By
substituting the n eural functions into Equation 3 and 4, we get our Neural Bellman-Ford Networks.
h
(0)
v
← INDICATOR(u, v, q) (5)
h
(t)
v
← AGGREGATE
n
MESSAGE
h
(t−1)
x
, w
q
(x, r, v)
(x, r, v) ∈ E(v)
o
∪
n
h
(0)
v
o
(6)
NBFNet can be interpreted as a n ovel GNN framework for learnin g pairwise representations. Com-
pared to common GNN frameworks [28, 43] that compute the pairwise re presentation as two inde-
pendent node representations h
q
(u) and h
q
(v), NBFNet initializes a representation on the source
node u, and read outs the pairwise representation on the target node v. Intuitively, our framework can
be viewed as a source-specific message passing process, where every node learns a representation
conditioned on the source node. The pseudo code of NBFNet is outlined in Algorithm 1.
Algorithm 1 Neural Bellman-Ford Networks
Input: source node u, query relation q, #layers T
Output: pair representations h
q
(u, v) for all v ∈ V
1: for v ∈ V do ⊲ Boundary condition
2: h
(0)
v
← INDICATOR(u, v, q)
3: end for
4: for t ← 1 to T do ⊲ Bellman-Ford iterati on
5: for v ∈ V do
6: M
(t)
←
n
h
(0)
v
o
⊲ Message augmentation
7: for (x, r, v) ∈ E(v) do
8: m
(t)
(x,r,v)
← MESSAGE
(t)
(h
(t−1)
x
, w
q
(x, r, v))
9: M
(t)
← M
(t)
∪
n
m
(t)
(x,r,v)
o
10: end for
11: h
(t)
v
← AGGREGATE
(t)
(M
(t)
)
12: end for
13: end for
14: return h
(T )
v
as h
q
(u, v) for all v ∈ V
Design Space. Now we discuss some
principled designs for MESSAGE, AG-
GREGATE and INDICATOR functions by
drawing insights from tra ditional meth-
ods. Note the potential design space for
NBFNet is way larger than what is pre-
sented here, as one can always borrow
MESSAGE and AGGREGATE from the ar-
senal of message-passing GNNs [17, 14,
60].
For the MESSAGE function, traditional
methods instantiate it as natur al sum-
mation, na tural multiplica tion or min
over scalars. Therefore, we may use
the vectorized version of summation or
multiplication. Intuitively, summation
of h
(t−1)
x
and w
q
(x, r, v) can be in-
terpreted as a translation of h
(t−1)
x
by
w
q
(x, r, v) in the pairwise representa-
tion space, while multiplication corresponds to scaling. Such transformations corresp ond to the
relational operators [16, 40] in k nowledge grap h e mbeddin gs [5, 62, 53, 26, 47]. For example, trans-
lation and scaling are the relational operators used in TransE [5] and DistMult [62] resp e ctively. We
also consider the rotation opera tor in RotatE [47].
The AGGREGATE function is instantiated as natural summation, max or m in in traditional method s,
which are reminiscent of set aggregation f unctions [65, 60, 7] used in GNNs. Therefore, we specify
the AGGREGATE function to be sum, mean, or max , followed by a linear transformation and a non-
linear activation. We also consider the principal neighborhood aggregation (PNA) proposed in a
recent work [7], which jointly learns the types and scales of the aggregation function.
5