JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
on simple KBQA concentrates on proposing an effective
answer prediction module to rank entities accurately.
Early attempts on solving simple KBQA task employed
existing semantic parsing tools to parse a simple natural
language question into an uninstantiated logic form, and
then adapted it to KB schema by aligning the lexicons.
This step results in an executable logic form l
q
for q. In
detail, the existing semantic parsing tools usually follow
Combinatory Categorical Grammars (CCGs) [28], [29], [30]
to build domain-independent logic forms. Then different
methods [28], [29], [30], [31], [32] are proposed to perform
schema matching and lexicon extension, which results in
logic forms grounded with KB schema. For simple KBQA
task, this logic form is usually a single triple starting from
the topic entity and connecting to the answer entities. As
early methods heavily rely on rule-based mapping, which is
hard to be generalized to large-scale datasets [33], [34], [35].
Thus, follow-up work proposed some scoring functions to
automatically learn the lexicon coverage between the logic
forms and the questions [36], [37]. With the development
of deep learning, several advanced neural networks such as
Convolutional Neural Network [38], Hierarchical Residual
BiLSTM [9], Match-Aggregation Module [39], and Neural
Module Network [40] are utilized to measure the semantic
similarities. This line of work is known as semantic parsing-
based methods.
Information retrieval-based methods were also devel-
oped over the decades. They retrieve a question-specific
graph G
q
from the entire KB. Generally, entities one hop
away from the topic entity and their connected relations
form the subgraph for solving a simple question. The ques-
tion and candidate answers in the subgraph can be repre-
sented as low-dimensional dense vectors. Different ranking
functions are proposed to rank these candidate answers
and top-ranked entities are considered as the predicted
answers [7], [41], [42]. Afterwards, Memory Network [43]
is employed to generate the final answer entities [44],
[45]. More recent work [8], [46], [47] employs attention
mechanism or multi-column modules to this framework to
boost the ranking accuracy. In Figure 2, we have displayed
different intermediate outputs of the two methods.
There has been other work on simple KBQA focusing
on improving the topic entity linking modules [9], [48] and
incorporating rules or external resources to help answer
questions over the large-scale KBs [49], [50], [51], [52].
Recent work tries to improve knowledge-aware dialogue
generation via KBQA task [53]. With the development of
neural network techniques, simple KBQA has been well
studied [10], while complex KBQA remains to be open and
attractive due to unsolved challenges and wide application.
2.4 Evaluation Protocol
In order to comprehensively evaluate KBQA systems, effec-
tive measurements from multiple aspects should be taken
into consideration. Considering the goals to achieve, we
categorize the measurement into three aspects: reliability,
robustness, and system-user interaction [62].
Reliability: For each question, there is an answer set (one or
multiple elements) as the ground truth. The KBQA system
usually predicts entities with the top confidence score to
form the answer set. If an answer predicted by the KBQA
system exists in the answer set, it is a correct prediction.
In previous studies [36], [63], [64], there are some classical
evaluation metrics such as Precision, Recall, F
1
and Hits@1.
For a question q, its Precision indicates the ratio of the
correct predictions over all the predicted answers. It is
formally defined as:
Precision =
|A
q
∩
˜
A
q
|
|
˜
A
q
|
,
where
˜
A
q
is the predicted answers, and A
q
is the ground
truth. Recall is the ratio of the correct predictions over all
the ground truth. It is computed as:
Recall =
|A
q
∩
˜
A
q
|
|A
q
|
.
Ideally, we expect that the KBQA system has a higher
Precision and Recall simultaneously. Thus F
1
score is most
commonly used to give a comprehensive evaluation:
F
1
=
2 ∗ Precision ∗ Recall
Precision + Recall
.
Some other methods [44], [65], [66], [67] use Hits@1 to assess
the fraction that the correct prediction rank higher than
other entities. It is computed as:
Hits@1 = I(˜a
q
∈ A
q
),
where ˜a
q
is the top 1 prediction in
˜
A
q
.
Robustness: Practical KBQA models are supposed to be
built with strong generalizability to out-of-distribution
questions at test time [14]. However, current KBQA datasets
are mostly generated based on templates and lack of di-
versity [62]. And, the scale of training datasets is limited
by the expensive labeling cost. Furthermore, the training
data for KBQA system may hardly cover all possible user
queries due to broad coverage and combinatorial explosion
of queries. To promote the robustness of KBQA models, Gu
et al. [14] proposed three levels of generalization (i.e., i.i.d.,
compositional, and zero-shot) and released a large-scale KBQA
dataset GrailQA to support further research. At a basic level,
KBQA models are assumed to be trained and tested with
questions drawn from the same distribution, which is what
most existing studies focus on. In addition to that, robust
KBQA models can generalize to novel compositions of seen
schema items (e.g., relations and entity types). To achieve
better generalization and serve users, robust KBQA models
are supposed to handle questions whose schema items or
domains are not covered in the training stage.
System-user Interaction: While most of the current studies
pay much attention to offline evaluation, the interaction be-
tween users and KBQA systems is neglected. On one hand,
in the search scenarios, a user-friendly interface and accept-
able response time should be taken into consideration. To
evaluate this, the feedback of users should be collected and
the efficiency of the system should be judged. On the other
hand, users’ search intents may be easily misunderstood by
systems if only a single round service is provided. Therefore,
it is important to evaluate the interaction capability of a
KBQA system. For example, to check whether they could