大规模数据处理：Apache Spark与MapReduce

5星 · 超过95%的资源需积分: 10 80 浏览量更新于2024-07-21 收藏 12.13MB PDF 举报

"Data.Algorithms" 由 Mahmoud Parsian 编著，主要探讨在大数据时代高效、可扩展和并行的算法需求。随着搜索引擎、基因组分析和社交媒体等领域的数据量爆炸式增长，处理这些数据的计算能力需要大幅度提升。MapReduce 框架应运而生，它为处理大规模（吉字节、太字节或拍字节）数据集提供了并行和分布式处理的方法。本书重点关注 Apache Spark 和 MapReduce/Hadoop 实现，通过实例来教授如何在这两个平台上实现 MapReduce。 MapReduce 是一种软件框架，旨在在商品服务器集群上对大型数据进行大规模数据处理。它包括两个主要阶段：Map 阶段和 Reduce 阶段。Map 阶段将输入数据分割成独立的键值对，并将其分发到集群中的各个节点进行处理；Reduce 阶段则负责收集 Map 阶段的结果，对相同键的数据进行聚合，生成最终结果。这种模型使得数据处理能够充分利用分布式计算资源，处理海量数据。 Apache Spark 是一个快速、通用且可扩展的大数据处理系统，它支持内存计算，提高了数据处理速度。Spark 提供了一个高级 API，使得开发人员可以更容易地编写分布式应用程序。Spark 的弹性分布式数据集 (RDD) 是其核心概念，它是不可变、分区的数据集合，可以在集群中并行操作。Spark 还包含对 SQL 查询的支持（Spark SQL）、流处理（Spark Streaming）和机器学习库（MLlib），提供了一站式的数据分析解决方案。 Hadoop 是另一种广泛使用的开源大数据处理框架，主要由 HDFS（Hadoop 分布式文件系统）和 MapReduce 组件组成。HDFS 提供高容错性的数据存储，而 MapReduce 则负责数据处理。Hadoop 的设计目标是处理PB级别的数据，适合批处理任务。本书将介绍如何在 Spark 和 Hadoop 上编写 MapReduce 程序，帮助读者理解并应用这些工具处理大数据问题。内容涵盖了从基本概念到实际编程技巧，旨在让读者能够熟练掌握大数据处理技术。此外，书中可能还包含了错误检查和修订历史，以确保信息的准确性。 "Data.Algorithms" 是一本面向 IT 专业人士的指南，涵盖了大数据处理的关键算法和框架，对于那些希望在大数据领域深化理解和实践的开发者来说，是一本重要的参考书。通过学习本书，读者可以了解到如何利用 MapReduce 和 Spark 处理海量数据，提升数据处理效率，为现代数据分析和挖掘工作奠定坚实基础。

Why Use MapReduce?

As we’ve discussed, MapReduce works on the premise of “scaling out” by adding more

commodity servers. This is in contrast to “scaling up,” by adding more resources, such

as memory and CPUs, to a single node in a system); this can be very costly, and at some

point you won’t be able to add more resources due to cost and software or hardware

limits. Many times, there are promising main memory–based algorithms available for

solving data problems, but they lack scalability because the main memory is a

bottleneck. For example, in DNA sequencing analysis, you might need over 512 GB of

RAM, which is very costly and not scalable.

If you need to increase your computational power, you’ll need to distribute it across

more than one machine. For example, to do DNA sequencing of 500 GB of sample data,

it would take one server over four days to complete just the alignment phase; using 60

servers with MapReduce can cut this time to less than two hours. To process large

volumes of data, you must be able to split up the data into chunks for processing, which

are then recombined later. MapReduce/Hadoop and Spark/Hadoop enable you to

increase your computational power by writing just two functions: map() and reduce().

So it’s clear that data analytics has a powerful new tool with the MapReduce paradigm,

which has recently surged in popularity thanks to open source solutions such as Hadoop.

In a nutshell, MapReduce provides the following benefits:

Programming model + infrastructure

The ability to write programs that run on hundreds/thousands of machines

Automatic parallelization and distribution

Fault tolerance (if a server dies, the job will be completed by other servers)

Program/job scheduling, status checking, and monitoring

Hadoop and Spark

Hadoop is the de facto standard for implementation of MapReduce applications. It is

composed of one or more master nodes and any number of slave nodes. Hadoop

simplifies distributed applications by saying that “the data center is the computer,” and

by providing map() and reduce() functions (defined by the programmer) that allow

application developers or programmers to utilize those data centers. Hadoop

implements the MapReduce paradigm efficiently and is quite simple to learn; it is a

powerful tool for processing large amounts of data in the range of terabytes and

petabytes.

In this book, most of the MapReduce algorithms are presented in a cookbook format

(compiled, complete, and working solutions) and implemented in

Java/MapReduce/Hadoop and/or Java/Spark/Hadoop. Both the Hadoop and Spark

frameworks are open source and enable us to perform a huge volume of computations

and data processing in distributed environments.

These frameworks enable scaling by providing “scale-out” methodology. They can be

set up to run intensive computations in the MapReduce paradigm on thousands of

servers. Spark’s API has a higher-level abstraction than Hadoop’s API; for this reason,

we are able to express Spark solutions in a single Java driver class.

Hadoop and Spark are two different distributed software frameworks. Hadoop is a

MapReduce framework on which you may run jobs supporting the map(), combine(),

and reduce() functions. The MapReduce paradigm works well at one-pass

computation (first map(), then reduce()), but is inefficient for multipass algorithms.

Spark is not a MapReduce framework, but can be easily used to support a MapReduce

framework’s functionality; it has the proper API to handle map() and reduce()

functionality. Spark is not tied to a map phase and then a reduce phase. A Spark job can

be an arbitrary DAG (directed acyclic graph) of map and/or reduce/shuffle phases.

Spark programs may run with or without Hadoop, and Spark may use HDFS (Hadoop

Distributed File System) or other persistent storage for input/output. In a nutshell, for a

given Spark program or job, the Spark engine creates a DAG of task stages to be

performed on the cluster, while Hadoop/MapReduce, on the other hand, creates a DAG

with two predefined stages, map and reduce. Note that DAGs created by Spark can

contain any number of stages. This allows most Spark jobs to complete faster than they

would in Hadoop/MapReduce, with simple jobs completing after just one stage and

more complex tasks completing in a single run of many stages, rather than having to be

split into multiple jobs. As mentioned, Spark’s API is a higher-level abstraction than

剩余1056页未读，继续阅读

此人没有昵称

粉丝: 0

大规模数据处理：Apache Spark与MapReduce

Data.Algorithms.2015.7.pdf

Data.Structures.and.Algorithms.in.Swift

Swift.Data.Structure.and.Algorithms

Data.Structures.and.Algorithms.USING.C

Data.Structures.and.Algorithms.with.Rust

Data.Structures.and.Algorithms.Using.CSharp

Data.Structures.Algorithms.and Applications.in C++

Learning.JavaScript.Data.Structures.and.Algorithms.1783554878

Data.Structures.and.Algorithms.Made.Easy.epub

Swift.Data.Structure.and.Algorithms.2016.11.pdf

最新资源