Our framework (GRAPHEDM) is general. Specific choices of the aforementioned (encoder and decoder) networks
allows GRAPHEDM to realize specific graph embedding methods. GRAPHEDM is illustrated in Figure 2. Before
presenting the taxonomy and showing realizations of various methods using our framework, we briefly discuss an
application perspective.
Output The GRAPHEDM model can return a reconstructed graph similarity matrix
ˆ
W (often used to train unsuper-
vised embedding algorithms), as well as a output labels by
S
for supervised applications. The label output space varies
depending on the supervised application.
• Node-level supervision, with by
N
∈ Y
|V |
, where Y represents the node label space. If Y is categorical, then this
is also known as (semi-)supervised node classification (Section 6.2.1), in which case the label decoder network
produces labels for each node in the graph. If d-dimensional Z is such that d = |Y|, then the label decoder net-
work can be just a simple softmax activation across the rows of Z, producing a distribution over labels for each
node. Additionally, the graph decoder network might also be leveraged in supervised node-classification tasks,
as it can be used to regularize embeddings (e.g. neighbor nodes should have nearby embeddings, regardless of
node labels).
• Edge-level supervision, with by
E
∈ Y
|V |×|V |
, where Y represents the edge label space. For example, Y can
be multinomial in knowledge graphs (for describing the types of relationships between two entities), setting
Y = {0, 1}
#(relation types)
. It is common to have #(relation types) = 1, and this is is known as link prediction,
where edge relations are binary. In this review, when by
E
= {0, 1}
|V |×|V |
(i.e. Y = {0, 1}), then rather than
naming the output of the decoder as by
E
, we instead follow the nomenclature and position link prediction as an
unsupervised task (Section 4). Then in lieu of by
E
we utilize
c
W , the output of the graph decoder network (which
is learned to reconstruct target similarity matrix s(W )) to rank potential edges.
• Graph-level supervision, with by
G
∈ Y. In the graph classification task (Section 6.2.2), the label decoder
network converts node embeddings Z using input adjacency W , into graph labels, using graph pooling. More
concretely, the graph pooling operation is similar to pooling in standard CNNs, where the goal is to downsample
local feature representations to capture higher-level information. However, unlike images, graphs don’t have a
regular grid structure and it is hard to define a pooling pattern which could be applied to every node in the graph.
A possible way of doing so is via graph coarsening, which groups similar nodes into clusters to produce smaller
graphs [32]. There exist other pooling methods on graphs such as DiffPool [120] or SortPooling [123] which
creates an ordering of nodes based on their structural roles in the graph. We do not cover the details of graph
pooling operators and refer the reader to recent surveys [116] for more details about graph pooling.
3.2 Taxonomy of objective functions
We now focus our attention on the optimization of models that can be described in the GRAPHEDM framework by
describing the loss functions used for training. Let Θ = {Θ
E
, Θ
D
, Θ
S
} denote all model parameters. GRAPHEDM
models can be optimized using a combination of the following loss terms:
• Supervised loss term, L
S
SUP
, which compares the predicted labels ˆy
S
to the ground truth labels y
S
. This term
depends on the task the model is being trained for. For instance, in semi-supervised node classification tasks
(S = N), the graph vertices are split into labelled and unlabelled nodes (V = V
L
∪ V
U
), and the supervised loss
is computed for each labelled node in the graph:
L
N
SUP
(y
N
, ˆy
N
; Θ) =
X
i|v
i
∈V
L
`(y
N
i
, ˆy
N
i
; Θ),
where `(·) is the loss function used for classification (e.g. cross-entropy). Similarly for graph classification tasks
(S = G), the supervised loss is computed at the graph-level and can be summed across multiple training graphs:
L
G
SUP
(y
G
, ˆy
G
; Θ) = `(y
G
, ˆy
G
; Θ).
• Graph regularization loss term, L
G,REG
, which leverages the graph structure to impose regularization con-
straints on the model parameters. This loss term measures the distance between the decoded similarity matrix
7