GraphChi：单机大规模图计算系统

需积分: 16 194 浏览量更新于2024-07-21 收藏 756KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"GraphChi是针对大规模图计算的系统，能够在一台个人计算机上高效处理包含数十亿边的大图。该系统通过将大图分解成小部分，并采用创新的并行滑动窗口方法，使得在消费级硬件上执行复杂的图挖掘、数据挖掘和机器学习算法成为可能。GraphChi还支持动态图，能在处理大量图更新的同时进行计算，其性能在固态硬盘（SSD）和机械硬盘上均表现出色。相比于现有的分布式系统，GraphChi用更少的资源就能在合理时间内解决相同问题，使得大规模图计算变得更加普及。该工作由Aapo Kyrola、Guy Blelloch和Carlos Guestrin共同完成，旨在降低大规模图计算的门槛。" 本文详细介绍了GraphChi系统的设计与实现，这是一个针对大规模图的离线计算框架，旨在解决在单台个人电脑上处理大型图的问题。当前，大多数图计算系统依赖于分布式计算集群，这对于非专业人士来说开发难度较大。GraphChi通过采用一种已知的图分区方法，将大规模图分割成小块，然后结合并行滑动窗口策略，实现了对大图的高效处理。具体而言，GraphChi的核心创新在于它的并行滑动窗口方法。这种方法允许系统在磁盘上高效地进行迭代计算，有效地利用了磁盘I/O，减少了内存的需求，从而在单一消费级计算机上处理大规模图成为可能。同时，GraphChi还扩展了对动态图的支持，能够快速处理图结构随时间变化的情况，如节点的添加、删除和边的更新，且能在处理这些更新的同时进行计算，处理速度超过每秒十万次更新。实验结果显示，无论是在固态硬盘还是传统旋转硬盘上，GraphChi的性能都得到了验证。通过对比现有的分布式系统，GraphChi在资源消耗上只有它们的一小部分，但仍然能在合理的时间内完成同样的计算任务。这表明，GraphChi极大地降低了大规模图计算的门槛，使得任何拥有现代个人电脑的人都能进行大规模图处理，从而推动了图算法和图分析的广泛应用。

资源详情

资源推荐

rest of the graph from disk.

In the early phase of our project, we explored this op-

tion, but found it difﬁcult to ﬁnd a good cache policy to

sufﬁciently reduce disk access. Ultimately, we rejected this

approach for two reasons. First, the performance would

be highly unpredictable, as it would depend on structural

properties of the input graph. Second, optimizing graphs

for locality is costly, and sometimes impossible, if a graph

is supplied without metadata required to efﬁciently cluster

it. General graph partitioners are not currently an option,

since even the state-of-the-art graph partitioner, METIS

[

], requires hundreds of gigabytes of memory to work

with graphs of billions of edges.

Graph compression. Compact representation of real-

world graphs is a well-studied problem, the best algorithms

can store web-graphs in only 4 bits/edge (see [

]). Unfortunately, while the graph structure can often

be compressed and stored in memory, we also associate

data with each of the edges and vertices, which can take

signiﬁcantly more space than the graph itself.

Bulk-Synchronous Processing. For a synchronous sys-

tem, the random access problem can be solved by writing

updated edges into a scratch ﬁle, which is then sorted (us-

ing disk-sort), and used to generate input graph for next

iteration. For algorithms that modify only the vertices, not

edges, such as Pagerank, a similar solution has been used

[

]. However, it cannot be efﬁciently used to perform

asynchronous computation.

3 Parallel Sliding Windows

This section describes the Parallel Sliding Windows (PSW)

method (Algorithm 2). PSW can process a graph with

mutable edge values efﬁciently from disk, with only a small

number of non-sequential disk accesses, while supporting

the asynchronous model of computation. PSW processes

graphs in three stages: it 1) loads a subgraph from disk; 2)

updates the vertices and edges; and 3) writes the updated

values to disk. These stages are explained in detail below,

with a concrete example. We then present an extension to

graphs that evolve over time, and analyze the I/O costs of

the PSW method.

3.1 Loading the Graph

Under the PSW method, the vertices

of graph

G =

(V, E)

are split into

disjoint

intervals

. For each interval,

we associate a shard, which stores all the edges that have

destination in the interval. Edges are stored in the order of

their source (Figure 1). Intervals are chosen to balance the

number of edges in each shard; the number of intervals,

is chosen so that any one shard can be loaded completely

shard(1)

interval(1) interval(2) interval(P)

shard(2)

shard(P)

1 |V| v

Figure 1: The vertices of graph

(V, E)

are divided into

intervals. Each interval is associated with a shard, which

stores all edges that have destination vertex in that interval.

into memory. Similar data layout for sparse graphs was

used previously, for example, to implement I/O efﬁcient

Pagerank and SpMV [5, 22].

PSW does graph computation in

execution intervals

by processing vertices one interval at a time. To create the

subgraph for the vertices in interval

, their edges (with

their associated values) must be loaded from disk. First,

Shard(p)

, which contains the in-edges for the vertices

in interval(p), is loaded fully into memory. We call thus

shard(p) the

memory-shard

. Second, because the edges

are ordered by their source, the out-edges for the vertices

are stored in consecutive chunks in the other shards, requir-

ing additional

P − 1

block reads. Importantly, edges for

interval(p+1) are stored immediately after the edges for

interval(p). Intuitively, when PSW moves from an interval

to the next, it slides a

window

over each of the shards. We

call the other shards the

sliding shards

. Note, that if the

degree distribution of a graph is not uniform, the window

length is variable. In total, PSW requires only

sequential

disk reads to process each interval. A high-level illustration

of the process is given in Figure 2, and the pseudo-code of

the subgraph loading is provided in Algorithm 3.

3.2 Parallel Updates

After the subgraph for interval

has been fully loaded from

disk, PSW executes the user-deﬁned

update-function

for

each vertex in parallel. As update-functions can modify the

edge values, to prevent adjacent vertices from accessing

edges concurrently (race conditions), we enforce external

determinism, which guarantees that each execution of PSW

produces exactly the same result. This guarantee is straight-

forward to implement: vertices that have edges with both

end-points in the same interval are ﬂagged as critical, and

are updated in sequential order. Non-critical vertices do

not share edges with other vertices in the interval, and

can be updated safely in parallel. Note, that the update of

a critical vertex will observe changes in edges done by

剩余16页未读，继续阅读

「已注销」

粉丝: 139
资源: 1

GraphChi：单机大规模图计算系统

Graphchi下BFS实现

graphchi-java

graphchi 0.1.2源码

Apache Giraph、Apache Flink Gelly、GraphX、GraphLab、PowerGraph的优缺点有哪些， 异同点有哪些

数据挖掘算法对GraphChi的图形表示格式的影响及应用

数据结构实验报告(集合)

MythwareStudentHacker-main.zip

《金智慧RFID高校固定资产管理平台解决方案》.doc

大连东软信息学院在广东2021-2024各专业最低录取分数及位次表.pdf

湖南财政经济学院在广东2021-2024各专业最低录取分数及位次表.pdf

上海立信会计金融学院在广东2021-2024各专业最低录取分数及位次表.pdf

山西大学在广东2021-2024各专业最低录取分数及位次表.pdf

无感Foc电机控制,算法采用滑膜观测器，启动采用Vf，全开源c代码，全开源，启动顺滑，很有参考价值

基于ssm旅游信息网站设计与实现.docx

数学建模2024数学建模A题KELAI.zip

天津大学在广东2021-2024各专业最低录取分数及位次表.pdf

冷热电联供系统CCHP经济优化运行多能源系统优化MATLAB程序 （1）该程序为冷热电联供系统CCHP经济优化运行，多能源系统优

校园短期闲置资源置换平台- 完整代码+论文+PPT

基于ssm旅游管理系统设计与实现.docx

基于ssm中药分类管理系统设计与实现.docx

最新资源

Apache Giraph、Apache Flink Gelly、GraphX、GraphLab、PowerGraph的优缺点有哪些，异同点有哪些

冷热电联供系统CCHP经济优化运行多能源系统优化MATLAB程序（1）该程序为冷热电联供系统CCHP经济优化运行，多能源系统优