快速入门Spark：部署与分布式计算详解

5星 · 超过95%的资源需积分: 35 107 浏览量更新于2024-07-23 1 收藏 8.14MB PDF 举报

"快速学习Spark"是一本由Holden Karau编著的详细介绍Apache Spark的教程，旨在帮助读者快速掌握这个强大的分布式计算框架。Spark被设计用来简化大数据处理，特别强调其在快速数据处理方面的高效性。本书适用于那些希望在大数据分析、机器学习和实时流处理等领域快速上手Spark的读者。 Spark的核心特性包括： 1. 高性能计算：Spark通过内存计算模型，能够提供比Hadoop MapReduce更快的速度，因为它将中间结果存储在内存中，减少了磁盘IO，实现了迭代计算的性能提升。 2. 易用的API：Spark提供了一系列易于使用的API，如Spark SQL（用于SQL查询）和Spark Streaming（支持实时流处理），使得数据分析变得更加直观。 3. 可扩展性：Spark支持集群部署，可以轻松地扩展到数千个节点，适应大规模的数据处理需求。 4. 数据共享：Spark的共享内存模型允许不同任务之间共享数据，减少了数据复制的开销。 5. 机器学习支持：Spark MLlib库是Spark的一部分，提供了丰富的机器学习算法，方便用户进行预测分析。 6. 交互式环境：Spark的Shell（Spark Shell）和Spark Notebook（基于Jupyter Notebook的交互式环境）使得开发人员可以进行快速实验和迭代开发。本书的内容覆盖了Spark的基础安装、配置、核心组件（如RDD、DataFrame和Spark SQL）、分布式计算、Spark Streaming、Spark MLlib以及一些高级主题，如Spark的生态系统和最佳实践。版权方面，所有内容未经Packt Publishing事先书面许可，不得任何形式复制、存储或传播。在准备这本书时，作者和出版商已尽力确保信息的准确性，但书中的信息并非无懈可击，且不提供任何形式的质保。任何因本书引起的直接或间接损失，作者和Packt Publishing及其经销商概不负责。同时，尽管书中提及的商标信息已经尽力标注正确，但Packt Publishing并不能保证其准确性。《快速学习Spark》首次出版于2013年10月，这是一本持续更新以适应技术发展的教材，适合对大数据处理有兴趣的开发者、数据分析师和工程师作为入门指南或参考书籍使用。随着Spark的不断发展和新版本的发布，读者在阅读时也应关注官方文档以获取最新信息。

Installing Spark and

Setting Up Your Cluster

This chapter will detail some common methods for setting up Spark. Spark on

a single machine is excellent for testing, but you will also learn to use Spark's

built-in deployment scripts to a dedicated cluster via SSH (Secure Shell). This

chapter will also cover using Mesos, Yarn, Puppet, or Chef to deploy Spark.

For cloud deployments of Spark, this chapter will look at EC2 (both traditional

and EC2MR). Feel free to skip this chapter if you already have your local Spark

instance installed and want to get straight to programming.

Regardless of how you are going to deploy Spark, you will want to get the latest

version of Spark from http://spark-project.org/download (Version 0.7 as of

this writing). For coders who live dangerously, try cloning the code directly from the

repository git://github.com/mesos/spark.git. Both the source code and pre-built

binaries are available. To interact with Hadoop Distributed File System (HDFS), you

need to use a Spark version that is built against the same version of Hadoop as your

cluster. For Version 0.7 of Spark, the pre-built package is built against Hadoop 1.0.4. If

you are up for the challenge, it's recommended that you build against the source since

it gives you the exibility of choosing which HDFS version you want to support as

well as apply patches. You will need the appropriate version of Scala installed and the

matching JDK. For Version 0.7.1 of Spark, you require Scala 2.9.2 or a later 2.9 series

release (2.9.3 works well). At the time of this writing, Ubuntu's LTS release (Precise)

has Scala Version 2.9.1. Additionally, the current stable version has 2.9.2 and Fedora 18

has 2.9.2. Up-to-date package information can be found at http://packages.ubuntu.

com/search?keywords=scala. The latest version of Scala is available from http://

scala-lang.org/download. It is important to choose the version of Scala that matches

the version requested by Spark, as Scala is a fast-evolving language.

Installing Spark and Setting Up Your Cluster

[ 6 ]

The tarball le contains a bin directory that needs to be added to your path, and

SCALA_HOME should be set to the path where the tarball le is extracted. Scala can

be installed from source by running:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz && tar -xvf

scala-2.9.3.tgz && cd scala-2.9.3 && export PATH=`pwd`/bin:$PATH &&

export SCALA_HOME=`pwd`

You will probably want to add these to your .bashrc le or equivalent:

export PATH=`pwd`/bin:\$PATH

export SCALA_HOME=`pwd`

Spark is built with sbt (simple build tool, which is no longer very simple), and build

times can be quite long when compiling Scala's source code. Don't worry if you don't

have sbt installed; the build script will download the correct version for you.

On an admittedly under-powered core 2 laptop with an SSD, installing a fresh copy

of Spark took about seven minutes. If you decide to build Version 0.7 from source,

you would run:

wget http://www.spark-project.org/download-spark-0.7.0-sources-tgz &&

tar -xvf download-spark-0.7.0-sources-tgz && cd spark-0.7.0 && sbt/sbt

package

If you are going to use a version of HDFS that doesn't match the default version

for your Spark instance, you will need to edit project/SparkBuild.scala and set

HADOOP_VERSION to the corresponding version and recompile it with:

sbt/sbt clean compile

The sbt tool has made great progress with dependency resolution,

but it's still strongly recommended for developers to do a clean

build rather than an incremental build. This still doesn't get it

quite right all the time.

Once you have started the build it's probably a good time for a break, such as getting

a cup of coffee. If you nd it stuck on a single line that says "Resolving [XYZ]...." for a

long time (say ve minutes), stop it and restart the sbt/sbt package.

If you can live with the restrictions (such as the xed HDFS version), using the

pre-built binary will get you up and running far quicker. To run the pre-built

version, use the following command:

wget http://www.spark-project.org/download-spark-0.7.0-prebuilt-tgz &&

tar -xvf download-spark-0.7.0-prebuilt-tgz && cd spark-0.7.0

Installing Spark and Setting Up Your Cluster

[ 8 ]

Running Spark on EC2

There are many handy scripts to run Spark on EC2 in the ec2 directory. These

scripts can be used to run multiple Spark clusters, and even run on-the-spot

instances. Spark can also be run on Elastic MapReduce (EMR). This is Amazon's

solution for MapReduce cluster management, which gives you more exibility

around scaling instances.

Running Spark on EC2 with the scripts

To get started, you should make sure that you have EC2 enabled on your account by

signing up for it at https://portal.aws.amazon.com/gp/aws/manageYourAccount.

It is a good idea to generate a separate access key pair for your Spark cluster, which

you can do at https://portal.aws.amazon.com/gp/aws/securityCredentialsR.

You will also need to create an EC2 key pair, so that the Spark script can SSH to the

launched machines; this can be done at https://console.aws.amazon.com/ec2/

home by selecting Key Pairs under Network & Security. Remember that key pairs are

created "per region", so you need to make sure you create your key pair in the same

region as you intend to run your spark instances. Make sure to give it a name that you

can remember (we will use spark-keypair in this chapter as its example key pair

name) as you will need it for the scripts. You can also choose to upload your public

SSH key instead of generating a new key. These are sensitive, so make sure that you

keep them private. You also need to set your AWS_ACCESS_KEY and AWS_SECRET_KEY

key pairs as environment variables for the Amazon EC2 scripts:

chmod 400 spark-keypair.pem

export AWS_ACCESS_KEY="..."

export AWS_SECRET_KEY="..."

You will nd it useful to download the EC2 scripts provided by Amazon from

http://aws.amazon.com/developertools/Amazon-EC2/351. Once you unzip

the resulting ZIP le, you can add the bin folder to your PATH variable in a similar

manner to what you did with the Spark bin folder:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip

unzip ec2-api-tools.zip

cd ec2-api-tools-*

export EC2_HOME=`pwd`

export PATH=$PATH:`pwd`:/bin

剩余119页未读，继续阅读

hawk2036

粉丝: 0
资源: 1

快速入门Spark：部署与分布式计算详解

快速学习Spark数据分析指南：实战电子书

全面学习Scala与Spark的源码项目解析

五分钟快速搭建Spark单机学习环境指南

spark学习 Spark: Cluster Computing withWorking Sets Matei

Spark-Scala学习：Spark和Scala学习

spark 学习示例

快速掌握Spark 2.0：实战分布式实时分析与机器学习

探索Consul-Plugins-Spark：快速集成Spark插件教程

快速掌握Spark进行高效分布式编程

使用Docker快速部署Spark集群教程

最新资源