4
knowledge can be much easier to integrate with the neu-
ral network based models after knowledge representation
learning.
Translational Distance Models With distance-based
scoring functions, this type of models measure the plausi-
bility of a fact as the distance between the two entities after
a translation carried out by the relation. Inspired by linguis-
tic regularities in [38], TransE [39] represents entities and
relations in d-dimension vector space so that the embedded
entities h and t can be connected by translation vector r,
i.e., h + r ≈ t when (h, r, t) holds. To tackle this problem of
insufficiency of a single space for both entities and relations,
TransH [40] and TransR [41] allows an entity to have distinct
representations when involved in different relations. TransH
introduces relational hyperplanes assuming that entities
and relations share the same semantic space, while TransR
exploits separated space for relations to consider different
attributes of entities. TransD [42] argues that entities serve
as different types even with the same relations and construct
dynamic mapping matrices by considering the interactions
between entities and relations. Owning to heterogeneity
and imbalance of entities and relations, TranSparse [43]
simplifies TransR by enforcing sparseness on the projection
matrix.
Semantic Matching Models Semantic matching models
measure plausibility of facts by matching latent semantics
of entities and relations with similarity-based scoring func-
tions. RESCAL [44] associates each entity and relation with
a vector and matrix ,repectively. The score of a fact (h, r, t)
is defined by a bilinear function. To decrease the computing
complexity, DistMult [45] simplifies RESCAL by restricting
relation to diagonal matrices. Combining the expressive
power of RESCAL with the efficiency and simplicity of Dist-
Mult, HolE [46] composes the entity representations with
the circular correlation operation, and the compositional
vector is then matched with the relation representation to
score the triplet. Unlike models above, SME [47] conducts
semantic matching between entities and relation using neu-
ral network architectures. NTN [48] combines projected
entities with a relational tensor and predicts scores after a
relational linear output layer.
Graph Neural Network Models The above models em-
bed entities and relations by only facts stored as a collec-
tion of triplets, while graph neural network based models
take account of the whole structure of the graph. Graph
convolutional network (GCN) is first proposed in [49] and
has been an effective tool to create node embeddings after
continuous efforts [50], [51], [52], [53], which aggregates
local information in the graph neighborhood for each node.
As the extension of graph convolutional networks, R-GCN
[54] is developed to deal with the highly multi-relational
data characteristic of realistic knowledge bases. SACN [55]
employs an end-to-end network learning framework where
the encoder leverages graph node structure and attributes,
and the decoder simplifies ConvE [56] and keeps the trans-
lational property of TransE. Following the same framework
of SACN, Nathani et al. [57] propose an attention-based
feature embedding that captures both entity and relation
features in the encoder. Vashishth et al. [58] believe that the
combination of relations and nodes should be considered
comprehensively during the message transmission. There-
fore they propose CompGCN that leverages various entity-
relation composition operations from knowledge graph em-
bedding techniques and scales with the number of relations
to embed both nodes and relations jointly.
3 OVERVIEW OF KNOWLEDGE ENHANCED PRE-
TRAINED MODELS
3.1 The Motivation of Knowledge Enhanced Pre-
trained Models
The recent progressive development of pre-trained models
has attracted much attention from researchers. However,
despite the great effort invested in their creation, it suffers
from inability of understanding the deep semantics of text
and logical reasoning. In addition to that, the knowledge
learned from the model exists in parameters and is unin-
terpretable. Poor robustness and the lack of interpretability
can be greatly alleviated by infusing entity features and
factual knowledge of KGs. We name the models that inte-
grate knowledge through retrieval or injection as KEPTMs.
Most of the pre-trained models introduced in this paper
focus on the leverage of linguistic knowledge and world
knowledge that belongs to factual knowledge or conceptual
knowledge defined in Section 2.2.1. This kind of knowledge
provides rich information of entities and relations for the
pre-trained model, which promotes the capability of deep
understanding and reasoning of pre-trained models sharply.
3.2 A Taxonomy of Knowledge Enhanced Pre-trained
Models
To compare and analyze existing KEPTMs, We first cat-
egorize them into three groups according to the type of
injected knowledge: entity enhanced pre-trained models,
triplet enhanced pre-trained models and other knowledge
enhanced pre-trained models.
For entity enhanced pre-trained models, all of these
models store knowledge and language information within
parameters of the pre-trained model and belong to coupled-
based KEPTMs. We further classify them into entity features
fused and knowledge graph supervised pre-trained models
according to the method of entity injection.
For triplet enhanced pre-train models, we divide
them into coupled-based and decoupled-based KEPTMs
by whether coupling between triplets and corpus. Since
coupled-based KEPTMs entangle word embeddings and
knowledge embeddings during pre-training, it fails to main-
tain the interpretability of symbolic knowledge. Further,
we categorize coupled-based KEPTMs into three groups:
embedding combined, data structure unified KEPTMs, and
joint training KEPTMs according to the method of triplets
infusion. As for decoupled-based KEPTMs, they preserve
the embeddings of knowledge and language separately and
thus introduce the interpretability of symbolic knowledge.
We divide it into retrieval-based KEPTMs because it utilizes
knowledge by retrieving relevant information.
Other knowledge enhanced models also can be catego-
rized into coupled-based and decoupled-based KEPTMs.
We further divide them into joint training and retrieval-
based KEPTMs.