Spark快速大数据处理

5星 · 超过95%的资源需积分: 35 177 浏览量更新于2024-07-23 收藏 8.14MB PDF 举报

"Fast Data Processing with Spark" 《Fast Data Processing with Spark》是一本专注于介绍Apache Spark高速分布式计算技术的书籍，由Holden Karau撰写。本书旨在让读者理解并掌握Spark如何实现简易高效的批量数据处理。Spark作为一个开源的计算框架，因其在大数据处理中的速度、易用性和灵活性而备受推崇。 Spark的核心特性包括弹性分布式数据集（Resilient Distributed Datasets, RDDs）、Spark SQL用于结构化数据处理、Spark Streaming用于实时流处理、MLlib机器学习库以及GraphX图计算框架。这些组件共同构建了一个强大的数据处理生态系统，能够处理PB级别的数据。 RDD是Spark的基础，它是一种容错的、不可变的数据集合，可以分布在集群的多个节点上。RDD支持并行操作，如转换和动作，这些操作可以在内存中快速执行，从而显著提高处理速度。RDD的设计使得即使在节点故障时，数据也能被自动恢复，保证了系统的高可用性。 Spark SQL允许用户使用SQL或DataFrame API来查询结构化数据，它与Hadoop的Hive兼容，使得现有Hive用户能无缝过渡到Spark。Spark Streaming则提供了对实时数据流的微批处理能力，可以处理来自多种源的流数据，如TCP套接字、Kafka、Flume等。 MLlib是Spark的机器学习库，包含各种机器学习算法，如分类、回归、聚类、协同过滤等，以及模型评估和特征选择工具。这些算法都设计为可扩展的，能够在大规模数据集上运行。此外，GraphX提供了图数据处理的API，支持图的创建、查询和算法应用，如PageRank算法。书中可能涵盖了Spark的安装和配置、工作环境设置、开发Spark应用程序的方法，以及如何在实际项目中部署和优化Spark集群。读者还能了解到如何使用Spark与其他数据存储系统（如HDFS、Cassandra、HBase等）集成，以及如何利用Spark进行复杂的数据分析和挖掘。《Fast Data Processing with Spark》是一本全面深入的Spark指南，适合数据工程师、数据科学家、架构师以及任何希望了解和使用Spark进行大规模数据处理的读者。通过本书，读者将能够理解Spark的工作原理，并能够有效地利用Spark处理大规模数据问题。

Installing Spark and

Setting Up Your Cluster

This chapter will detail some common methods for setting up Spark. Spark on

a single machine is excellent for testing, but you will also learn to use Spark's

built-in deployment scripts to a dedicated cluster via SSH (Secure Shell). This

chapter will also cover using Mesos, Yarn, Puppet, or Chef to deploy Spark.

For cloud deployments of Spark, this chapter will look at EC2 (both traditional

and EC2MR). Feel free to skip this chapter if you already have your local Spark

instance installed and want to get straight to programming.

Regardless of how you are going to deploy Spark, you will want to get the latest

version of Spark from http://spark-project.org/download (Version 0.7 as of

this writing). For coders who live dangerously, try cloning the code directly from the

repository git://github.com/mesos/spark.git. Both the source code and pre-built

binaries are available. To interact with Hadoop Distributed File System (HDFS), you

need to use a Spark version that is built against the same version of Hadoop as your

cluster. For Version 0.7 of Spark, the pre-built package is built against Hadoop 1.0.4. If

you are up for the challenge, it's recommended that you build against the source since

it gives you the exibility of choosing which HDFS version you want to support as

well as apply patches. You will need the appropriate version of Scala installed and the

matching JDK. For Version 0.7.1 of Spark, you require Scala 2.9.2 or a later 2.9 series

release (2.9.3 works well). At the time of this writing, Ubuntu's LTS release (Precise)

has Scala Version 2.9.1. Additionally, the current stable version has 2.9.2 and Fedora 18

has 2.9.2. Up-to-date package information can be found at http://packages.ubuntu.

com/search?keywords=scala. The latest version of Scala is available from http://

scala-lang.org/download. It is important to choose the version of Scala that matches

the version requested by Spark, as Scala is a fast-evolving language.

Installing Spark and Setting Up Your Cluster

[ 6 ]

The tarball le contains a bin directory that needs to be added to your path, and

SCALA_HOME should be set to the path where the tarball le is extracted. Scala can

be installed from source by running:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz && tar -xvf

scala-2.9.3.tgz && cd scala-2.9.3 && export PATH=`pwd`/bin:$PATH &&

export SCALA_HOME=`pwd`

You will probably want to add these to your .bashrc le or equivalent:

export PATH=`pwd`/bin:\$PATH

export SCALA_HOME=`pwd`

Spark is built with sbt (simple build tool, which is no longer very simple), and build

times can be quite long when compiling Scala's source code. Don't worry if you don't

have sbt installed; the build script will download the correct version for you.

On an admittedly under-powered core 2 laptop with an SSD, installing a fresh copy

of Spark took about seven minutes. If you decide to build Version 0.7 from source,

you would run:

wget http://www.spark-project.org/download-spark-0.7.0-sources-tgz &&

tar -xvf download-spark-0.7.0-sources-tgz && cd spark-0.7.0 && sbt/sbt

package

If you are going to use a version of HDFS that doesn't match the default version

for your Spark instance, you will need to edit project/SparkBuild.scala and set

HADOOP_VERSION to the corresponding version and recompile it with:

sbt/sbt clean compile

The sbt tool has made great progress with dependency resolution,

but it's still strongly recommended for developers to do a clean

build rather than an incremental build. This still doesn't get it

quite right all the time.

Once you have started the build it's probably a good time for a break, such as getting

a cup of coffee. If you nd it stuck on a single line that says "Resolving [XYZ]...." for a

long time (say ve minutes), stop it and restart the sbt/sbt package.

If you can live with the restrictions (such as the xed HDFS version), using the

pre-built binary will get you up and running far quicker. To run the pre-built

version, use the following command:

wget http://www.spark-project.org/download-spark-0.7.0-prebuilt-tgz &&

tar -xvf download-spark-0.7.0-prebuilt-tgz && cd spark-0.7.0

Installing Spark and Setting Up Your Cluster

[ 8 ]

Running Spark on EC2

There are many handy scripts to run Spark on EC2 in the ec2 directory. These

scripts can be used to run multiple Spark clusters, and even run on-the-spot

instances. Spark can also be run on Elastic MapReduce (EMR). This is Amazon's

solution for MapReduce cluster management, which gives you more exibility

around scaling instances.

Running Spark on EC2 with the scripts

To get started, you should make sure that you have EC2 enabled on your account by

signing up for it at https://portal.aws.amazon.com/gp/aws/manageYourAccount.

It is a good idea to generate a separate access key pair for your Spark cluster, which

you can do at https://portal.aws.amazon.com/gp/aws/securityCredentialsR.

You will also need to create an EC2 key pair, so that the Spark script can SSH to the

launched machines; this can be done at https://console.aws.amazon.com/ec2/

home by selecting Key Pairs under Network & Security. Remember that key pairs are

created "per region", so you need to make sure you create your key pair in the same

region as you intend to run your spark instances. Make sure to give it a name that you

can remember (we will use spark-keypair in this chapter as its example key pair

name) as you will need it for the scripts. You can also choose to upload your public

SSH key instead of generating a new key. These are sensitive, so make sure that you

keep them private. You also need to set your AWS_ACCESS_KEY and AWS_SECRET_KEY

key pairs as environment variables for the Amazon EC2 scripts:

chmod 400 spark-keypair.pem

export AWS_ACCESS_KEY="..."

export AWS_SECRET_KEY="..."

You will nd it useful to download the EC2 scripts provided by Amazon from

http://aws.amazon.com/developertools/Amazon-EC2/351. Once you unzip

the resulting ZIP le, you can add the bin folder to your PATH variable in a similar

manner to what you did with the Spark bin folder:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip

unzip ec2-api-tools.zip

cd ec2-api-tools-*

export EC2_HOME=`pwd`

export PATH=$PATH:`pwd`:/bin

剩余119页未读，继续阅读

kimshin

粉丝: 17
资源: 27

Spark快速大数据处理

Fast Data Processing with Spark 2(3rd) mobi

Fast Data Processing with Spark 2, 3rd Edition.pdf

Fast data processing with spark

Fast Data Processing with Spark Second Edition

Fast Data Processing with Spark 2, 3rd Editio

Fast Data Processing with Spark 2 Third Edition.pdf

Fast Data Processing with Spark 2(3rd) epub

Fast Data Processing With Spark (3rd Edition) PDF

Fast Data Processing with Spark（第二版）代码

Fast Data Processing with Spark 2---3rd.rar

最新资源