GraphX：分布式数据流框架中的图处理

需积分: 10 106 浏览量更新于2024-07-21 收藏 499KB PDF 举报

"Spark Paper" 在《Spark Paper》这篇论文中，作者探讨了分布式数据流框架在图处理性能上的潜力。传统上，为了优化图处理性能，系统社区倾向于采用专门设计的图处理系统，这些系统提供了特定的编程抽象并加速了迭代图算法的执行。然而，论文提出，许多专用图处理系统的优点可以在现代通用的分布式数据流系统中重新获得。论文介绍了GraphX，这是一个基于广泛使用的Apache Spark构建的嵌入式图处理框架。GraphX的核心理念是将图处理与分布式数据流相结合，提供了一种既熟悉又可组合的图抽象，能够表达现有的图API，同时仅使用基本的数据流操作（如JOIN、MAP、GROUP-BY）就能实现。这种设计使得GraphX在保持灵活性的同时，降低了学习和使用的复杂性。为了达到与专用图处理系统相当的性能水平，GraphX采取了一种创新的方法，它将图特定的优化转化为分布式JOIN优化。这表明，通过巧妙地利用现有数据流框架的内在能力，可以实现对图计算的高效执行。此外，GraphX还支持容错性和弹性，这是Spark框架本身固有的特性，这使得在大规模分布式环境中处理图数据变得更加可靠和可扩展。 GraphX的设计和实现考虑了图处理的常见需求，例如遍历、邻接表表示、图变换以及属性图操作。它提供了一个统一的接口，允许开发者使用RDD（弹性分布式数据集）进行图操作，而RDD是Spark的核心数据结构，能够支持高效的并行和分布式计算。通过这种方式，GraphX不仅简化了图处理的编程模型，还能够利用Spark的内存计算能力来提升处理速度。《Spark Paper》展示了如何在通用的分布式数据流系统中实现高效的图处理，挑战了过去认为必须依赖专用系统才能达到高性能图计算的观点。GraphX的成功在于它结合了Spark的易用性、弹性和性能，为图处理提供了一个强大且灵活的平台，对大数据分析和图算法的研究与应用产生了深远影响。

2.1 The Property Graph Data Model

Graph processing systems represent graph structured data

as a

property graph

[

], which associates user-deﬁned

properties with each vertex and edge. The properties can

include meta-data (e.g., user proﬁles and time stamps)

and program state (e.g., the PageRank of vertices or in-

ferred afﬁnities). Property graphs derived from natural

phenomena such as social networks and web graphs often

have highly skewed, power-law degree distributions and

orders of magnitude more edges than vertices [18].

In contrast to dataﬂow systems whose operators

(e.g., join) can span multiple collections, operations in

graph processing systems (e.g., vertex programs) are typi-

cally deﬁned with respect to a single property graph with

a pre-declared, sparse structure. While this restricted fo-

cus facilitates a range of optimizations (Section 2.3), it

also complicates the expression of analytics tasks that

may span multiple graphs and sub-graphs.

2.2 The Graph-Parallel Abstraction

Algorithms ranging from PageRank to latent factor anal-

ysis iteratively transform vertex properties based on the

properties of adjacent vertices and edges. This common

pattern of iterative local transformations forms the ba-

sis of the graph-parallel abstraction. In the graph-parallel

abstraction [

], a user-deﬁned

vertex program

is instan-

tiated concurrently for each vertex and interacts with adja-

cent vertex programs through messages (e.g., Pregel [

])

or shared state (e.g., PowerGraph [

]). Each vertex pro-

gram can read and modify its vertex property and in some

cases [

] adjacent vertex properties. When all vertex

programs vote to halt the program terminates.

As a concrete example, in Listing 1 we express the

PageRank algorithm as a Pregel vertex program. The

vertex program for the vertex

begins by summing the

messages encoding the weighted PageRank of neighbor-

ing vertices. The PageRank is updated using the resulting

sum and is then broadcast to its neighbors (weighted by

the number of links). Finally, the vertex program assesses

whether it has converged (locally) and votes to halt.

The extent to which vertex programs run concurrently

differs across systems. Most systems (e.g., [

])

adopt the bulk synchronous execution model, in which

all vertex programs run concurrently in a sequence of

super-steps. Some systems (e.g., [

]) also sup-

port an asynchronous execution model that mitigates the

effect of stragglers by running vertex programs as re-

sources become available. However, the gains due to an

asynchronous programming model are often offset by

the additional complexity and so we focus on the bulk-

synchronous model and rely on system level techniques

(e.g., pipelining and speculation) to address stragglers.

def PageRank(v: Id, msgs: List[Double]) {

// Compute the message sum

var msgSum = 0

for (m <- msgs) { msgSum += m }

// Update the PageRank

PR(v) = 0.15 + 0.85

msgSum

// Broadcast messages with new PR

for (j <- OutNbrs(v)) {

msg = PR(v) / NumLinks(v)

send_msg(to=j, msg)

}

// Check for termination

if (converged(PR(v))) voteToHalt(v)

}

Listing 1:

PageRank in Pregel

: computes the sum of the

inbound messages, updates the PageRank value for the

vertex, and then sends the new weighted PageRank value

to neighboring vertices. Finally, if the PageRank did not

change the vertex program votes to halt.

While the graph-parallel abstraction is well suited for

iterative graph algorithms that respect the static neigh-

borhood structure of the graph (e.g., PageRank), it is not

well suited to express computation where disconnected

vertices interact or where computation changes the graph

structure. For example, tasks such as graph construction

from raw text or unstructured data, graph coarsening, and

analysis that spans multiple graphs are difﬁcult to express

in the vertex centric programming model.

2.3 Graph System Optimizations

The restrictions imposed by the graph-parallel abstraction

along with the sparse graph structure enable a range of

important system optimizations.

The GAS Decomposition:

Gonzalez et al. [

] ob-

served that most vertex programs interact with neigh-

boring vertices by collecting messages in the form of a

generalized commutative associative sum and then broad-

casting new messages in an inherently parallel loop. They

proposed the GAS decomposition which splits vertex pro-

grams into three data-parallel stages: Gather, Apply, and

Scatter. In Listing 2 we decompose the PageRank vertex

program into the Gather, Apply, and Scatter stages.

The GAS decomposition leads to a pull-based model of

message computation: the system asks the vertex program

for value of the message between adjacent vertices rather

than the user sending messages directly from the ver-

tex program. As a consequence, the GAS decomposition

enables vertex-cut partitioning, improved work balance,

serial edge-iteration [

], and reduced data movement.

However, the GAS decomposition also prohibits direct

communication between vertices that are not adjacent in

the graph and therefore hinders the expression of more

general communication patterns.

剩余14页未读，继续阅读

qq_29710645

粉丝: 0
资源: 3

GraphX：分布式数据流框架中的图处理

Spark SQL- Relational Data Processing in Spark(Paper).rar

Starred_Paper_Hadoop_Spark.docx

国外分布式最经典的一些paper

Spark Core、Spark SQL、Spark Structured Streaming、MLlib、GraphX、SparkR、PySpark、Spark JobServer之间的依赖关系是什么

spark spark on hive

spark之spark任务的提交方式【spark-shell、spark-submit】

【spark源码】spark-submit和spark-class

spark安装_Spark 安装教程

Spark Core、Spark SQL、Spark Streaming、MLlib、GraphX、SparkR、PySpark、Spark JobServer之间的依赖关系是什么

spark 2.4.8支持spark r么

最新资源