大型集群快速通用数据处理架构

需积分: 10 181 浏览量更新于2024-07-19 收藏 1.89MB PDF 举报

"《大型集群快速通用数据处理架构》是由Matei Alexandru Zaharia撰写的一篇博士论文，提交于加利福尼亚大学伯克利分校计算机科学研究生院，作为获取哲学博士学位的一部分。论文发表于2013年秋季，探讨了随着数据量激增和处理器速度停滞，如何适应分布式系统的发展趋势，以支持广泛应用的扩展需求。在过去的几年里，计算系统的格局发生了显著变化，大量的数据源，如互联网、商业运营和科学研究设备，产生了海量且具有价值的数据流。然而，单个系统的处理能力已无法满足这些快速增长的需求。论文的焦点在于提出一种能够实现快速且普遍适用于大规模集群的数据处理架构，旨在解决数据处理效率的问题。作者Matei Alexandru Zaharia深入研究了如何通过设计高效的并行处理模型和分布式算法来优化大数据的存储、检索和分析。该架构可能涉及Hadoop这样的分布式计算框架，它强调了数据的分片（data sharding）、容错机制（fault tolerance）以及任务调度（task scheduling）的重要性。通过使用分布式文件系统（如Hadoop Distributed File System, HDFS）来管理存储，论文可能会探讨如何实现数据的高效复制、备份和一致性保证，以及如何利用MapReduce等编程模型进行并行处理。此外，论文可能还涵盖了如何通过负载均衡和资源管理技术来优化集群性能，确保在面对大规模数据时，系统能保持良好的响应时间和吞吐量。另外，它可能还会讨论如何集成机器学习、流处理和其他高级分析工具，以便在实时环境中处理动态变化的数据。总而言之，这篇论文是大数据时代的一个重要贡献，它提供了对于构建和优化大规模集群数据处理系统的关键见解和技术，对IT行业的发展有着深远的影响，特别是在大数据处理、云计算和分布式计算领域。"

Spark (RDDs)

Spark

Streaming

(discretized

streams)

Shark

(SQL)

Bagel

(Pregel)

Fine-grained task execution model

multitenancy, data locality, elasticity

Iterative

MapReduce

…

Figure 1.1. Computing stack implemented in this dissertation. Using the Spark

implementation of RDDs, we build other processing models, such as streaming, SQL

and graph computations, all of which can be intermixed within Spark programs.

RDDs themselves execute applications as a series of ﬁne-grained tasks, enabling

efﬁcient resource sharing.

examples include Dremel and Impala for interactive SQL queries [

], Pregel for

graph processing [72], GraphLab for machine learning [71], and others.

Although specialized systems seem like a natural way to scope down the chal-

lenging problems in the distributed environment, they also come with several

drawbacks:

1. Work duplication:

Many specialized systems still need to solve the same

underlying problems, such as work distribution and fault tolerance. For

example, a distributed SQL engine or machine learning engine both need to

perform parallel aggregations. With separate systems, these problems need to

be addressed anew for each domain.

2. Composition:

It is both expensive and unwieldy to compose computations

in different systems. For “big data” applications in particular, intermediate

datasets are large and expensive to move. Current environments require

exporting data to a replicated, stable storage system to share it between

computing engines, which is often many times more expensive than the actual

computations. Therefore, pipelines composed of multiple systems are often

highly inefﬁcient compared to a uniﬁed system.

3. Limited scope:

If an application does not ﬁt the programming model of a

specialized system, the user must either modify it to make it ﬁt within current

systems, or else write a new runtime system for it.

4. Resource sharing:

Dynamically sharing resources between computing en-

gines is difﬁcult because most engines assume they own the same set of

machines for the duration of the application.

5. Management and administration:

Separate systems require signiﬁcantly

more work to manage and deploy than a single one. Even for users, they

require learning multiple APIs and execution models.

Because of these limitations, a uniﬁed abstraction for cluster computing would

have signiﬁcant beneﬁts in not only usability but also performance, especially for

complex applications and multi-user settings.

1.2 Resilient Distributed Datasets (RDDs)

To address this problem, we introduce a new abstraction, resilient distributed

datasets (RDDs), that forms a simple extension to the MapReduce model. The insight

behind RDDs is that although the workloads that MapReduce was unsuited for

(e.g., iterative, interactive and streaming queries) seem at ﬁrst very different, they all

require a common feature that MapReduce lacks: efﬁcient data sharing across parallel

computation stages. With an efﬁcient data sharing abstraction and MapReduce-

like operators, all of these workloads can be expressed efﬁciently, capturing the

key optimizations in current specialized systems. RDDs offer such an abstraction

for a broad set of parallel computations, in a manner that is both efﬁcient and

fault-tolerant.

In particular, previous fault-tolerant processing models for clusters, such as

MapReduce and Dryad, structured computations as a directed acyclic graph (DAG)

of tasks. This allowed them to efﬁciently replay just part of the DAG for fault recov-

ery. Between separate computations, however (e.g., between steps of an iterative

algorithm), these models provided no storage abstraction other than replicated ﬁle

systems, which add signiﬁcant costs due to data replication across the network.

RDDs are a fault-tolerant distributed memory abstraction that avoids replication.

Instead, each RDD remembers the graph of operations used to build it, similarly

to batch computing models, and can efﬁciently recompute data lost on failure. As

long as the operations that create RDDs are relatively coarse-grained, i.e., a single

operation applies to many data elements, this technique is much more efﬁcient

than replicating the data over the network. RDDs work well for a wide range of

today’s data-parallel algorithms and programming models, all of which apply each

operation to many items.

While it may seem surprising that just adding data sharing greatly increases the

generality of MapReduce, we explore from several perspectives why this is so. First,

from an expressiveness perspective, we show that RDDs can emulate any distributed

system, and will do so efﬁciently as long as the system tolerates some network

latency. This is because, once augmented with fast data sharing, MapReduce can

emulate the Bulk Synchronous Parallel (BSP) [

108

] model of parallel computing,

with the main drawback being the latency of each MapReduce step. Empirically,

in our Spark system, this can be as low as 50–100 ms. Second, from a systems

perspective, RDDs, unlike plain MapReduce, give applications enough control

to optimize the bottleneck resources in most cluster computations (speciﬁcally,

network and storage I/O). Because these resources often dominate execution time,

just controlling these (e.g., through control over data placement) is often enough

to match the performance of specialized systems, which are bound by the same

resources.

This exploration aside, we also show empirically that we can implement many

of the specialized models in use today, as well as new programming models, using

RDDs. Our implementations match the performance of specialized systems while

offering rich fault tolerance properties and enabling composition.

1.3 Models Implemented on RDDs

We used RDDs to implement both several existing cluster programming models

and new applications not supported by previous models. In some of these models,

RDDs only match the performance of previous systems, but in others, they also

add new properties, such as fault tolerance, straggler tolerance and elasticity, that

current systems lack. We discuss four classes of models.

Iterative Algorithms

One of the most common workloads for which specialized

systems have been developed recently is iterative algorithms, such as those used in

graph processing, numerical optimization, and machine learning. RDDs can capture

a wide variety of such models, including Pregel [

], iterative MapReduce models

like HaLoop and Twister [

], and a deterministic version of the GraphLab and

PowerGraph models [71, 48].

Relational Queries

One of the ﬁrst demands on MapReduce clusters was to run

SQL queries, both as long-running, many-hour batch jobs and as interactive queries.

This spurred the development of many systems that apply parallel database designs

in commodity clusters [

]. While it was thought that MapReduce had major

inherent disadvantages over parallel databases for interactive queries [

], for in-

stance, due to its fault tolerance model, we showed that state-of-the-art performance

can be achieved within the RDD model, by implementing many common database

engine features (e.g., column-oriented processing) within RDD operations. The

resulting system, Shark [

113

], offers full fault tolerance, allowing it to scale to both

short and long queries, and the ability to call into complex analytics functions (e.g.,

machine learning) built on RDDs as well.

MapReduce

By offering a superset of MapReduce, RDDs can also efﬁciently run

MapReduce applications, as well as more general DAG-based applications such as

DryadLINQ [115].

Stream Processing

As our largest departure from specialized systems, we also

use RDDs to implement stream processing. Stream processing has been studied

in both the database and systems communities before, but implementing it at

scale is challenging. Current models do not handle the problem of stragglers,

which will frequently occur in large clusters, and offer limited means for fault

recovery, requiring either expensive replication or long recovery times. In particular,

current systems are based on a continuous operator model, where long-lived stateful

operators process each record as it arrives. To recover a lost node, they need to

either keep two copies of each operator, or replay upstream data through a costly

serial process.

We propose a new model, discretized streams (D-Streams), that overcomes this

problem. Instead of using long-lived stateful processes, D-streams execute stream-

ing computations as a sequence of short, deterministic batch computations, storing

state in RDDs in-between. The D-Stream model allows fast fault recovery without

replication by parallelizing the recovery along the dependency graph of the RDDs

involved. In addition, it supports straggler mitigation through speculative execu-

tion [

], i.e., running speculative backup copies of slow tasks. While D-streams do

add some delay by running work as discrete jobs, we show that they can be imple-

mented with sub-second latency while reaching similar per-node performance to

previous systems and scaling linearly to 100 nodes. Their strong recovery properties

make them one of the ﬁrst stream processing models to handle the idiosyncrasies

of large clusters, and their basis in RDDs also lets applications combine them with

batch and interactive queries in powerful ways.

Combining these Models

By putting these models together, RDDs also enable

new applications that are hard to express with current systems. For example,

many streaming applications also need to join information with historical data;

with RDDs, one can combine batch and stream processing in the same program,

obtaining data sharing and fault recovery across both models. Similarly, operators of

streaming applications often need to run ad-hoc queries on stream state; the RDDs

in D-Streams can be queried just like static data. We illustrate these use cases with

real applications for online machine learning (Section 4.6.3) and video analytics

(Section 4.6.3). More generally, even batch applications often need to combine

processing types: for example, an application may extract a dataset using SQL,

train a machine learning model on it, and then query the model. These combined

workﬂows are inefﬁcient with current systems because the distributed ﬁlesystem

I/O required to share data between systems often dominates the computation. With

an RDD-based system, the computations can run back-to-back in the same engine

without external I/O.

20000

40000

60000

80000

100000

120000

140000

Hadoop

(MR)

Impala

(SQL)

Storm

(Stream)

Giraph

(Graph)

Spark

Code Size (lines)

Shark (SQL)

GraphX

Streaming

Hadoop

Giraph

GraphX

Response

Time (min)

Graph

Storm

Spark

Throughput

(MB/s/node)

Streaming

Impala (disk)

Impala (mem)

Redshift

Shark (disk)

Shark (mem)

Response Time

(sec)

SQL

Figure 1.2. Comparing code size and performance between the Spark stack and

specialized systems. Spark’s code size is similar to specialized engines, while

implementations of these models on Spark are signiﬁcantly smaller. Nonetheless,

performance in selected applications is comparable.

1.4 Summary of Results

We have implemented RDDs in an open source system called Spark, which

is now hosted at the Apache Incubator and used in several commercial deploy-

ments [

]. Despite the generality of RDDs, Spark is relatively small: 34,000 lines

of code in the (admittedly high-level) Scala language, putting it in the same range

as specialized cluster computing systems. More importantly, specialized models

built over Spark are much smaller than their standalone counterparts: we imple-

mented Pregel and iterative MapReduce in a few hundred lines of code, discretized

streams in 8000, and Shark, a SQL system that runs queries from the Apache Hive

frontend on Spark, in 12,000. These systems are order of magnitude smaller than

standalone implementations and support rich means to intermix models, but still

generally match specialized systems in performance. As a short summary of results,

Figure 1.2 compares the size and performance of Spark and three systems built on

it (Shark, Spark Streaming, and GraphX [

113

119

112

]) against popular specialized

systems (Impala and the Amazon Redshift DBMS for SQL; Storm for streaming;

and Giraph for graph processing [60, 5, 14, 10]).

Beyond these practical results, we also cover general techniques for implement-

ing sophisticated processing functions with RDDs, and discuss why the RDD model

was so general. In particular, as sketched in Section 1.2, we show that the RDD

model can emulate any distributed system and does so in a more efﬁcient man-

ner than MapReduce; and from a practical point of view, the RDD interface gives

applications enough control over the bottleneck resources in a cluster to match

specialized systems, while still enabling automatic fault recovery and efﬁcient

composition.

剩余125页未读，继续阅读

xxfigo

粉丝: 66
资源: 3

大型集群快速通用数据处理架构

【spark论文】大型集群上的快速和通用数据处理架构（修正版）

An MPI-CUDA Implementation for Massively Parallel Incompressible

The Integration of YOLOv8 with Big Data Analytics: Image Data Mining and Deep Learning Applications

一个使用Androidstudio开发的校园通知APP

基于粒子群的ieee30节点优化、配电网有功-无功优化 软件：Matlab+Matpowre 介绍：对配电网中有功-无功协调优化调度展开研究，通过对光伏电源、储能装置、无功电源和变压器分接头等设备协调

C#自定义事件 2024年12月23日

基于校园的互帮互助社交APP全部资料+详细文档+高分项目.zip

Download usage

基于高德地图的校园导航全部资料+详细文档+高分项目.zip

健康中国2030框架下智慧医药医疗博览会方案

最新资源

基于粒子群的ieee30节点优化、配电网有功-无功优化软件：Matlab+Matpowre 介绍：对配电网中有功-无功协调优化调度展开研究，通过对光伏电源、储能装置、无功电源和变压器分接头等设备协调