大规模推理的MapReduce实现：Plogs与Datalog程序的并行物质化

26 浏览量更新于2024-08-28 收藏 408KB PDF 举报

"Plogs:使用MapReduce实现数据记录程序以进行可扩展的推理" 这篇研究论文探讨了在大数据时代背景下，如何利用MapReduce框架来实现数据记录程序的可扩展推理。Plogs（Probabilistic Log）是论文提出的一种依赖感知的并行数据逻辑程序材料化方法，它旨在解决OWL2 RL语义扩展与SWRL规则结合时的表达力问题，以支持更高效、可扩展的推理。在当前的信息环境中，语义数据的增长速度迅猛，对可扩展推理的需求越来越大。然而，大多数现有的可扩展推理研究主要集中在RDFS/OWL Ter Horst语义上，这些仅是OWL2 RL的一小部分，且在表达能力上存在局限。考虑到OWL2 RL的语义可以通过Datalog语言来表达，传统的推理器通常采用Datalog程序的物质化来实现推理。论文中，作者提出了一种依赖感知的并行物质化方法，用于Datalog程序，以提升大规模推理的效率。首先，他们设计了一种算法，能够自动将Datalog规则执行转换为MapReduce作业。这个转换过程是关键，因为它允许在分布式计算环境中并行处理数据，从而提高处理大量数据的能力。此外，论文还对转换算法进行了优化，以加速推理过程。这些优化可能包括减少数据传输、提高数据局部性、并行化计算任务以及智能调度等策略。通过这些优化，Plogs方法能够在保持正确性的同时，显著降低推理的时间成本。 MapReduce是一种由Google提出的编程模型，用于大规模数据集的并行计算。在Plogs中，Map阶段负责数据的预处理和分区，而Reduce阶段则执行实际的推理逻辑。通过这种方式，Plogs能够利用分布式计算的威力，处理那些传统单机推理系统难以应对的海量数据。论文的贡献在于，它不仅提供了理论框架，还可能推动Datalog和MapReduce在推理领域的实际应用，特别是在大规模知识图谱和语义网中的推理任务。这种方法有望解决当前推理效率和扩展性的问题，对于推动语义Web技术的发展具有重要意义。

Plogs: Materializing Datalog Programs with

MapReduce for Scalable Reasoning

Haijiang Wu

∗‡

, Jie Liu

†‡

, Tao Wang

‡

, Dan Ye

‡

, Jun Wei

†‡

, and Hua Zhong

‡

∗

University of Chinese Academy of Sciences

Beijing, China

†

State Key Laboratory of Computer Science

Beijing, China

‡

Institute of Software, Chinese Academy of Sciences

Beijing, China

{wuhaijiang12,ljie,wangtao,yedan,wj,zhongh}@otcaix.iscas.ac.cn

Abstract—With the rapid growth of semantic data, scalable

reasoning has attracted more and more attention. However,

most existing works about scalable reasoning focus only on

RDFS/OWL ter Horst semantics, which are small fragments of

OWL 2 RL, and have limitation in expressivity. As OWL 2 RL

semantics extended with SWRL rules can be expressed by datalog

language, materialization of datalog programs is widely adopted

in traditional reasoners. In this paper, we propose a dependency-

aware approach on parallel materialization of datalog programs

for scalable reasoning. We ﬁrst present an algorithm to automate

the translation from a Datalog rule execution into MapReduce

jobs, and make several optimizations for the algorithm to speed

up the rule evaluation process. Since the rule execution order

has signiﬁcant impact on reasoning performance due to the

dependencies among rules. We then propose a sampling-based

method to capture rule dependency, and design a dependency-

aware strategy to schedule rule evaluation. Finally, we establish

a system to evaluate the proposed approach with a series of se-

mantic rule sets on large synthetic and real knowledge bases. The

experimental results show that the proposed optimizations have

signiﬁcant effectiveness and our system achieves approximately

linear scalability.

Keywords—Semantic Web, Datalog, MapReduce, Parallel In-

ference

I. INTRODUCTION

Semantics in knowledge base often imply important in-

formation which can be revealed by a reasoning task. As

the development of knowledge construction techniques, the

volume of semantic data grows rapidly, some big knowledge

bases even evolve to contain billions of RDF triples (e.g.

YAGO [2] , NELL [4], and DBpedia [1]). Such large scale data

bring new challenges to semantic reasoning. Current inference

algorithms for large semantic data sets make tradeoffs between

complexity and expressivity.

W3C proposed OWL 2 RL for applications that require

scalable reasoning without sacriﬁcing too much expressivity

. While existing reasoners focus on RDFS/OWL ter Host

semantics, which are fragments of OWL 2 RL. Urbani .et al

built a scalable inference engine on top of Hadoop, which is

an open source share-nothing parallel programming framework

[5]. Rong et al. proposed an efﬁcient parallel inference engine

https://www.w3.org/TR/owl2-proﬁles/

using Spark [3]. However, both of these works support only

RDFS/OWL ter Horst semantics, and their approaches are

speciﬁc to a these rule sets, and cannot easily extended to

support application-speciﬁc rules.

Datalog is a popular logic programming language based

on Horn clause logic for deductive database, it can express

OWL 2 RL semantics rules extended with SWRL [11], which

is widely used in semantic-based applications. One can ma-

terialize all the consequences of datalog programs to query

data with less response time [5], [18]. Some popular reason-

ers support efﬁcient materialization of datalog programs to

implement semantic reasoning on a single node [10], [14], but

they are not viable for large scale data due to the limitations of

hardware resources. Some works [11], [17] have implemented

the parallel materialization of datalog programs in centralized,

main-memory, multi-core systems, but the framework with

share-memory has limitation in scalability to reason with large

knowledge bases.

In this paper, we propose an efﬁcient and highly scalable

approach on parallel materialization of datalog programs, with

MapReduce and considering dependencies between rules. The

major contributions and novelties of our work are as follows:

First, we propose an algorithm to translate a datalog rule to

the MapReduce jobs automatically. Then, we design a data-

partition model to avoid loading unnecessary data. Based on

the data-partition model, we introduce a partial-cached strategy

to accelerate the rule evaluation speed, by caching the small

ﬁles into the memory of each computing node. Moreover, we

optimize the job execution order to avoid join operations on

two large sets, which might cause OOM error or signiﬁcant

disk I/O overhead.

Second, we ﬁnd that the rule evaluation order has signiﬁcant

impact on inference performance due to rule dependency.

Since the dependencies among rules vary with the knowledge

base, analyzing rule dependency by literal may bring false

dependencies and then cause unnecessary job running. We then

capture the true rule dependency by evaluating the datalog

program on a sampled small part of the knowledge base,

and arrange the rule execution order according to the rule

dependencies.

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing

and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress

DOI 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.26

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38654382

粉丝: 1
资源: 932

大规模推理的MapReduce实现：Plogs与Datalog程序的并行物质化

基于MapReduce实现决策树算法

Hadoop mapreduce 实现InvertedIndexer倒排索引

MapReduce---CS6240:使用 MapReduce 进行并行数据处理

Mr.LDA:在MapReduce中使用变分推理的可扩展主题建模

实用技巧：使用MapReduce进行数据分析

实用示例：使用MapReduce进行数据清洗

wordcount-mapreduce:Hadoop MapReduce WordCount 示例应用程序

AggregationMapreduce:使用MapReduce在Java中实现的简单聚合操作

CS236_W15:使用 MapReduce 的天气分析器

电信数据清洗案例：基于MapReduce框架的数据预处理方法

最新资源