Large-scale Knowledge Base Completion: Inferring via
Grounding Network Sampling over Selected Instances
Zhuoyu Wei
1,2
, Jun Zhao
1
, Kang Liu
1
, Zhenyu Qi
2
, Zhengya Sun
1
and Guanhua Tian
2
1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
2
Interactive Digital Media Technology Research, Institute of Automation, Chinese Academy of Sciences
{zhuoyu.wei, zhenyu.qi, zhengya.sun, guanhua.tian}@ia.ac.cn, {jzhao,
kliu}@nlpr.ia.ac.cn
ABSTRACT
Constructing large-scale knowledge bases has attracted much
attention in recent years, for which Knowledge Base Com-
pletion (KBC) is a key technique. In general, inferring new
facts in a large-scale knowledge base is not a trivial task.
The large number of inferred candidate facts has resulted in
the failure of the majority of previous approaches. Inference
approaches can achieve high precision for formulas that are
accurate, but they are required to infer candidate instances
one by one, and extremely large candidate sets bog them
down in exp ensive calculations. In contrast, the existing
embedding-based methods can easily calculate similarity-
based scores for each candidate instance as opposed to using
inference, so they can handle large-scale data. However, this
type of method does not consider explicit logical semantics
and usually has unsatisfactory precision. To resolve the limi-
tations of the above two types of methods, we propose an ap-
proach through Inferring via Grounding Network Sampling
over Selected Instances. We first employ an embedding-
based model to make the instance selection and generate
much smaller candidate sets for subsequent fact inference,
which not only narrows the candidate sets but also filters
out part of the noise instances. Then, we only make infer-
ences within these candidate sets by running a data-driven
inference algorithm on the Markov Logic Network (MLN),
which is called Inferring via Grounding Network Sampling
(INS). In this process, we especially incorporate the similar-
ity priori generated by embedding-based models into INS to
promote the inference precision. The experimental results
show that our approach improved Hits@1 from 32.911% to
71.692% on the FB15K dataset and achieved much better
AP@n evaluations than state-of-the-art methods.
Categories and Subject Descriptors
E.1 [Data]: DATA STRUCTURES—Graphs and networks
Keywords
Knowledge Base Completion; Embedding; Inference
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
CIKM’15 October 19-23, 2015, Melbourne, VIC, Australia
c
2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2806416.2806513.
1. INTRODUCTION
Automatically extracting facts from texts and construct-
ing large-scale knowledge bases (KB) have grown vigorously
in recent years. As a result, several typical knowledge bases
have b een built, such as Freebase [2], Nell [6], YAGO [10],
and Knowledge Vault [7]. However, these extracted reposi-
tories are far from completion. To complete the constructed
KBs, according to the conclusion in [7], using an existing
knowledge base to complete itself is an important supple-
ment for automatic knowledge extraction to increase the
number of facts in KBs and cannot be substituted by other
techniques. Therefore, this paper focuses on the large-scale
knowledge base completion (KBC) and is committed to pre-
dicting the missing links in the existing knowledge base.
In general, according to the process of KBC, there are
two types of approaches: inference-based approaches and
embedding-based approaches.
First, inference-based approaches [22, 16, 24] usually em-
ploy logic formulas to infer the missing links among exist-
ing entities in a KB. They manually or automatically con-
struct various logic formulas and learn the weight of each
formula by sampling or counting groundings from existing
KBs. These weighted formulas are viewed as the long-range
interactions across several relations. The biggest limitation
of such approaches is the computation complexity. These
methods need to infer knowledge one by one, which implies
the computation complexity is linearly growing with the size
of candidate sets. However, usually, there are extremely
large candidate sets for some specific relations in large-scale
KBC, and in each candidate set, only one or a few are actu-
ally correct. For example, Barack Obama’s mother is miss-
ing in a KB, and we need to find out who Barack Obama’s
mother is. All p ersons or females in the KB are candidates,
but only one is the correct selection. The huge candidate set
brings inference-based approaches to an unacceptable run-
ning time. Although some methods have avoided this issue
through simple operations, such as only adding a small part
of false facts to testing sets [13, 23], this strategy is too coarse
to obtain precise inference results. On the other hand, there
are some noise candidates, which may violate formulas and
mislead inference algorithms. Therefore, existing inference
methods that rely on formulas cannot remove the noises by
themselves, and the noises may result in the decrease of the
performance.
In contrast, embedding-based methods [21, 5, 12, 4, 3, 26,
18] are not affected by huge candidate sets because they can
easily calculate similarity-based scores for each candidate
instance after learning representations of entities and rela-