2 Large Scale and Big Data
cloud programs can be tailored for graph or data parallelism, which require employ-
ing either data striping and distribution or graph partitioning and mapping. Lastly,
from architectural and management perspectives, a cloud program can be typically
organized in two ways, master/slave or peer-to-peer. Such organizations dene the
program’s complexity, efciency, and scalability.
Added to the above design considerations, when constructing cloud programs,
special attention must be paid to various challenges like scalability, communication,
heterogeneity, synchronization, fault tolerance, and scheduling. First, scalability is
hard to achieve in large-scale systems (e.g., clouds) due to several reasons such as
the inability of parallelizing all parts of algorithms, the high probability of load
imbalance, and the inevitability of synchronization and communication overheads.
Second, exploiting locality and minimizing network trafc are not easy to accom-
plish on (public) clouds since network topologies are usually unexposed. Third, het-
erogeneity caused by two common realities on clouds, virtualization environments
and variety in datacenter components, impose difculties in scheduling tasks and
masking hardware and software differences across cloud nodes. Fourth, synchroni-
zation mechanisms must guarantee mutual exclusive accesses as well as properties
like avoiding deadlocks and transitive closures, which are highly likely in distributed
settings. Fifth, fault-tolerance mechanisms, including task resiliency, distributed
checkpointing and message logging should be incorporated since the likelihood of
failures increases on large-scale (public) clouds. Finally, task locality, high parallel-
ism, task elasticity, and service level objectives (SLOs) need to be addressed in task
and job schedulers for effective programs’ executions.
Although designing, addressing, and implementing the requirements and chal-
lenges of cloud programs are crucial, they are difcult, require time and resource
investments, and pose correctness and performance issues. Recently, distributed
analytics engines such as MapReduce, Pregel, and GraphLab were developed to
relieve programmers from worrying about most of the needs to construct cloud pro-
grams and focus mainly on the sequential parts of their algorithms. Typically, these
analytics engines automatically parallelize sequential algorithms provided by users
in high-level programming languages like Java and C++, synchronize and schedule
constituent tasks and jobs, and handle failures, all without any involvement from
users/developers. In this chapter, we rst dene some common terms in the theory
of distributed programming, draw a requisite relationship between distributed sys-
tems and clouds, and discuss the main requirements and challenges for building dis-
tributed programs for clouds. While discussing the main requirements for building
cloud programs, we indicate how MapReduce, Pregel, and GraphLab address each
requirement. Finally, we close up with a summary on the chapter and a comparison
among MapReduce, Pregel, and GraphLab.
1.2 TAXONOMY OF PROGRAMS
A computer program consists of variable declarations, variable assignments, expres-
sions, and ow control statements written typically using a high-level programming
language such as Java or C++. Computer programs are compiled before executed on
machines. After compilation, they are converted to a machine instructions/code that