Plogs: Materializing Datalog Programs with
MapReduce for Scalable Reasoning
Haijiang Wu
∗‡
, Jie Liu
†‡
, Tao Wang
‡
, Dan Ye
‡
, Jun Wei
†‡
, and Hua Zhong
‡
∗
University of Chinese Academy of Sciences
Beijing, China
†
State Key Laboratory of Computer Science
Beijing, China
‡
Institute of Software, Chinese Academy of Sciences
Beijing, China
{wuhaijiang12,ljie,wangtao,yedan,wj,zhongh}@otcaix.iscas.ac.cn
Abstract—With the rapid growth of semantic data, scalable
reasoning has attracted more and more attention. However,
most existing works about scalable reasoning focus only on
RDFS/OWL ter Horst semantics, which are small fragments of
OWL 2 RL, and have limitation in expressivity. As OWL 2 RL
semantics extended with SWRL rules can be expressed by datalog
language, materialization of datalog programs is widely adopted
in traditional reasoners. In this paper, we propose a dependency-
aware approach on parallel materialization of datalog programs
for scalable reasoning. We first present an algorithm to automate
the translation from a Datalog rule execution into MapReduce
jobs, and make several optimizations for the algorithm to speed
up the rule evaluation process. Since the rule execution order
has significant impact on reasoning performance due to the
dependencies among rules. We then propose a sampling-based
method to capture rule dependency, and design a dependency-
aware strategy to schedule rule evaluation. Finally, we establish
a system to evaluate the proposed approach with a series of se-
mantic rule sets on large synthetic and real knowledge bases. The
experimental results show that the proposed optimizations have
significant effectiveness and our system achieves approximately
linear scalability.
Keywords—Semantic Web, Datalog, MapReduce, Parallel In-
ference
I. INTRODUCTION
Semantics in knowledge base often imply important in-
formation which can be revealed by a reasoning task. As
the development of knowledge construction techniques, the
volume of semantic data grows rapidly, some big knowledge
bases even evolve to contain billions of RDF triples (e.g.
YAGO [2] , NELL [4], and DBpedia [1]). Such large scale data
bring new challenges to semantic reasoning. Current inference
algorithms for large semantic data sets make tradeoffs between
complexity and expressivity.
W3C proposed OWL 2 RL for applications that require
scalable reasoning without sacrificing too much expressivity
1
. While existing reasoners focus on RDFS/OWL ter Host
semantics, which are fragments of OWL 2 RL. Urbani .et al
built a scalable inference engine on top of Hadoop, which is
an open source share-nothing parallel programming framework
[5]. Rong et al. proposed an efficient parallel inference engine
1
https://www.w3.org/TR/owl2-profiles/
using Spark [3]. However, both of these works support only
RDFS/OWL ter Horst semantics, and their approaches are
specific to a these rule sets, and cannot easily extended to
support application-specific rules.
Datalog is a popular logic programming language based
on Horn clause logic for deductive database, it can express
OWL 2 RL semantics rules extended with SWRL [11], which
is widely used in semantic-based applications. One can ma-
terialize all the consequences of datalog programs to query
data with less response time [5], [18]. Some popular reason-
ers support efficient materialization of datalog programs to
implement semantic reasoning on a single node [10], [14], but
they are not viable for large scale data due to the limitations of
hardware resources. Some works [11], [17] have implemented
the parallel materialization of datalog programs in centralized,
main-memory, multi-core systems, but the framework with
share-memory has limitation in scalability to reason with large
knowledge bases.
In this paper, we propose an efficient and highly scalable
approach on parallel materialization of datalog programs, with
MapReduce and considering dependencies between rules. The
major contributions and novelties of our work are as follows:
First, we propose an algorithm to translate a datalog rule to
the MapReduce jobs automatically. Then, we design a data-
partition model to avoid loading unnecessary data. Based on
the data-partition model, we introduce a partial-cached strategy
to accelerate the rule evaluation speed, by caching the small
files into the memory of each computing node. Moreover, we
optimize the job execution order to avoid join operations on
two large sets, which might cause OOM error or significant
disk I/O overhead.
Second, we find that the rule evaluation order has significant
impact on inference performance due to rule dependency.
Since the dependencies among rules vary with the knowledge
base, analyzing rule dependency by literal may bring false
dependencies and then cause unnecessary job running. We then
capture the true rule dependency by evaluating the datalog
program on a sampled small part of the knowledge base,
and arrange the rule execution order according to the rule
dependencies.
2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing
and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress
978-1-5090-2771-2/16 $31.00 © 2016 IEEE
DOI 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.26
9