2.1 The Property Graph Data Model
Graph processing systems represent graph structured data
as a
property graph
[
33
], which associates user-defined
properties with each vertex and edge. The properties can
include meta-data (e.g., user profiles and time stamps)
and program state (e.g., the PageRank of vertices or in-
ferred affinities). Property graphs derived from natural
phenomena such as social networks and web graphs often
have highly skewed, power-law degree distributions and
orders of magnitude more edges than vertices [18].
In contrast to dataflow systems whose operators
(e.g., join) can span multiple collections, operations in
graph processing systems (e.g., vertex programs) are typi-
cally defined with respect to a single property graph with
a pre-declared, sparse structure. While this restricted fo-
cus facilitates a range of optimizations (Section 2.3), it
also complicates the expression of analytics tasks that
may span multiple graphs and sub-graphs.
2.2 The Graph-Parallel Abstraction
Algorithms ranging from PageRank to latent factor anal-
ysis iteratively transform vertex properties based on the
properties of adjacent vertices and edges. This common
pattern of iterative local transformations forms the ba-
sis of the graph-parallel abstraction. In the graph-parallel
abstraction [
13
], a user-defined
vertex program
is instan-
tiated concurrently for each vertex and interacts with adja-
cent vertex programs through messages (e.g., Pregel [
22
])
or shared state (e.g., PowerGraph [
13
]). Each vertex pro-
gram can read and modify its vertex property and in some
cases [
13
,
20
] adjacent vertex properties. When all vertex
programs vote to halt the program terminates.
As a concrete example, in Listing 1 we express the
PageRank algorithm as a Pregel vertex program. The
vertex program for the vertex
v
begins by summing the
messages encoding the weighted PageRank of neighbor-
ing vertices. The PageRank is updated using the resulting
sum and is then broadcast to its neighbors (weighted by
the number of links). Finally, the vertex program assesses
whether it has converged (locally) and votes to halt.
The extent to which vertex programs run concurrently
differs across systems. Most systems (e.g., [
7
,
13
,
22
,
34
])
adopt the bulk synchronous execution model, in which
all vertex programs run concurrently in a sequence of
super-steps. Some systems (e.g., [
13
,
20
,
37
]) also sup-
port an asynchronous execution model that mitigates the
effect of stragglers by running vertex programs as re-
sources become available. However, the gains due to an
asynchronous programming model are often offset by
the additional complexity and so we focus on the bulk-
synchronous model and rely on system level techniques
(e.g., pipelining and speculation) to address stragglers.
def PageRank(v: Id, msgs: List[Double]) {
// Compute the message sum
var msgSum = 0
for (m <- msgs) { msgSum += m }
// Update the PageRank
PR(v) = 0.15 + 0.85
*
msgSum
// Broadcast messages with new PR
for (j <- OutNbrs(v)) {
msg = PR(v) / NumLinks(v)
send_msg(to=j, msg)
}
// Check for termination
if (converged(PR(v))) voteToHalt(v)
}
Listing 1:
PageRank in Pregel
: computes the sum of the
inbound messages, updates the PageRank value for the
vertex, and then sends the new weighted PageRank value
to neighboring vertices. Finally, if the PageRank did not
change the vertex program votes to halt.
While the graph-parallel abstraction is well suited for
iterative graph algorithms that respect the static neigh-
borhood structure of the graph (e.g., PageRank), it is not
well suited to express computation where disconnected
vertices interact or where computation changes the graph
structure. For example, tasks such as graph construction
from raw text or unstructured data, graph coarsening, and
analysis that spans multiple graphs are difficult to express
in the vertex centric programming model.
2.3 Graph System Optimizations
The restrictions imposed by the graph-parallel abstraction
along with the sparse graph structure enable a range of
important system optimizations.
The GAS Decomposition:
Gonzalez et al. [
13
] ob-
served that most vertex programs interact with neigh-
boring vertices by collecting messages in the form of a
generalized commutative associative sum and then broad-
casting new messages in an inherently parallel loop. They
proposed the GAS decomposition which splits vertex pro-
grams into three data-parallel stages: Gather, Apply, and
Scatter. In Listing 2 we decompose the PageRank vertex
program into the Gather, Apply, and Scatter stages.
The GAS decomposition leads to a pull-based model of
message computation: the system asks the vertex program
for value of the message between adjacent vertices rather
than the user sending messages directly from the ver-
tex program. As a consequence, the GAS decomposition
enables vertex-cut partitioning, improved work balance,
serial edge-iteration [
34
], and reduced data movement.
However, the GAS decomposition also prohibits direct
communication between vertices that are not adjacent in
the graph and therefore hinders the expression of more
general communication patterns.