does not have to restart a query if one of the nodes involved in query
processing fails.
Given the proven operational benefits and resource consumption
savings of using cheap, unreliable commodity hardware to build
a shared-nothing cluster of machines, and the trend towards
extremely low-end hardware in data centers [14], the probability
of a node failure occurring during query processing is increasing
rapidly. This problem only gets worse at scale: the larger the
amount of data that needs to be accessed for analytical queries, the
more nodes are required to participate in query processing. This
further increases the probability of at least one node failing during
query execution. Google, for example, reports an average of 1.2
failures per analysis job [8]. If a query must restart each time a
node fails, then long, complex queries are difficult to complete.
Ability to run in a heterogeneous environment. As described
above, there is a strong trend towards increasing the number of
nodes that participate in query execution. It is nearly impossible
to get homogeneous performance across hundreds or thousands of
compute nodes, even if each node runs on identical hardware or on
an identical virtual machine. Part failures that do not cause com-
plete node failure, but result in degraded hardware performance be-
come more common at scale. Individual node disk fragmentation
and software configuration errors can also cause degraded perfor-
mance on some nodes. Concurrent queries (or, in some cases, con-
current processes) further reduce the homogeneity of cluster perfor-
mance. On virtualized machines, concurrent activities performed
by different virtual machines located on the same physical machine
can cause 2-4% variation in performance [5].
If the amount of work needed to execute a query is equally di-
vided among the nodes in a shared-nothing cluster, then there is a
danger that the time to complete the query will be approximately
equal to time for the slowest compute node to complete its assigned
task. A node with degraded performance would thus have a dis-
proportionate effect on total query time. A system designed to run
in a heterogeneous environment must take appropriate measures to
prevent this from occurring.
Flexible query interface. There are a variety of customer-facing
business intelligence tools that work with database software and
aid in the visualization, query generation, result dash-boarding, and
advanced data analysis. These tools are an important part of the
analytical data management picture since business analysts are of-
ten not technically advanced and do not feel comfortable interfac-
ing with the database software directly. Business Intelligence tools
typically connect to databases using ODBC or JDBC, so databases
that want to work with these tools must accept SQL queries through
these interfaces.
Ideally, the data analysis system should also have a robust mech-
anism for allowing the user to write user defined functions (UDFs)
and queries that utilize UDFs should automatically be parallelized
across the processing nodes in the shared-nothing cluster. Thus,
both SQL and non-SQL interface languages are desirable.
4. BACKGROUND AND SHORTFALLS OF
AVAILABLE APPROACHES
In this section, we give an overview of the parallel database and
MapReduce approaches to performing data analysis, and list the
properties described in Section 3 that each approach meets.
4.1 Parallel DBMSs
Parallel database systems stem from research performed in the
late 1980s and most current systems are designed similarly to the
early Gamma [10] and Grace [12] parallel DBMS research projects.
These systems all support standard relational tables and SQL, and
implement many of the performance enhancing techniques devel-
oped by the research community over the past few decades, in-
cluding indexing, compression (and direct operation on compressed
data), materialized views, result caching, and I/O sharing. Most
(or even all) tables are partitioned over multiple nodes in a shared-
nothing cluster; however, the mechanism by which data is parti-
tioned is transparent to the end-user. Parallel databases use an op-
timizer tailored for distributed workloads that turn SQL commands
into a query plan whose execution is divided equally among multi-
ple nodes.
Of the desired properties of large scale data analysis workloads
described in Section 3, parallel databases best meet the “perfor-
mance property” due to the performance push required to compete
on the open market, and the ability to incorporate decades worth
of performance tricks published in the database research commu-
nity. Parallel databases can achieve especially high performance
when administered by a highly skilled DBA who can carefully de-
sign, deploy, tune, and maintain the system, but recent advances
in automating these tasks and bundling the software into appliance
(pre-tuned and pre-configured) offerings have given many parallel
databases high performance out of the box.
Parallel databases also score well on the flexible query interface
property. Implementation of SQL and ODBC is generally a given,
and many parallel databases allow UDFs (although the ability for
the query planner and optimizer to parallelize UDFs well over a
shared-nothing cluster varies across different implementations).
However, parallel databases generally do not score well on the
fault tolerance and ability to operate in a heterogeneous environ-
ment properties. Although particular details of parallel database
implementations vary, their historical assumptions that failures are
rare events and “large” clusters mean dozens of nodes (instead of
hundreds or thousands) have resulted in engineering decisions that
make it difficult to achieve these properties.
Furthermore, in some cases, there is a clear tradeoff between
fault tolerance and performance, and parallel databases tend to
choose the performance extreme of these tradeoffs. For example,
frequent check-pointing of completed sub-tasks increase the fault
tolerance of long-running read queries, yet this check-pointing
reduces performance. In addition, pipelining intermediate results
between query operators can improve performance, but can result
in a large amount of work being lost upon a failure.
4.2 MapReduce
MapReduce was introduced by Dean et. al. in 2004 [8].
Understanding the complete details of how MapReduce works is
not a necessary prerequisite for understanding this paper. In short,
MapReduce processes data distributed (and replicated) across
many nodes in a shared-nothing cluster via three basic operations.
First, a set of Map tasks are processed in parallel by each node in
the cluster without communicating with other nodes. Next, data is
repartitioned across all nodes of the cluster. Finally, a set of Reduce
tasks are executed in parallel by each node on the partition it
receives. This can be followed by an arbitrary number of additional
Map-repartition-Reduce cycles as necessary. MapReduce does not
create a detailed query execution plan that specifies which nodes
will run which tasks in advance; instead, this is determined at
runtime. This allows MapReduce to adjust to node failures and
slow nodes on the fly by assigning more tasks to faster nodes and
reassigning tasks from failed nodes. MapReduce also checkpoints
the output of each Map task to local disk in order to minimize the
amount of work that has to be redone upon a failure.
Of the desired properties of large scale data analysis workloads,