2013年PacktPub：Spark加速大数据处理实战指南

5星 · 超过95%的资源需积分: 35 195 浏览量更新于2024-07-24 收藏 8.14MB PDF 举报

《快速数据处理：使用Spark轻松实现高效分布式计算》（FastDataProcessing with Spark, Oct. 2013）是一本由Packt Publishing出版的专业书籍，作者是Holden Karau。本书专注于介绍如何利用Apache Spark这一强大的大数据处理框架，简化高速度的分布式计算任务。Spark最初在2013年发布，旨在提供一种简单易用的方法来处理海量数据，它以其快速、内存计算模型和流处理能力而闻名。 Spark的核心理念在于基于内存的计算，它将数据存储在内存中而不是磁盘上，从而显著提高了数据处理速度。书中涵盖了Spark的基本概念、核心组件如Resilient Distributed Datasets (RDDs)、Spark SQL、Spark Streaming以及Spark的集群管理和资源调度。通过这些工具，读者可以学习到如何设计、开发和优化Spark应用程序，以应对实时或批量的数据分析场景。作者Holden Karau以其在大数据领域的深厚背景，详细解释了如何利用Spark进行数据清洗、转换、聚合等操作，同时强调了Spark的交互式编程模型，如Spark Shell，使得数据科学家和开发人员能够快速迭代和试验分析策略。此外，书中还包含了实践经验分享和最佳实践，帮助读者理解和应用Spark解决实际业务问题。尽管这本书是在2013年首次出版，但Spark的发展并未停滞，它已经演变成大数据生态系统中的关键组件。随着Spark的不断更新和新功能的添加，如Databricks的改进和生态系统扩展，本书的内容仍然具有参考价值，但可能需要结合最新的Spark文档和教程进行补充学习。在版权方面，所有权利保留，未经出版商事先书面许可，不得以任何形式复制、存储或传输此书的内容，除非用于嵌入在批评性文章或评论中。同时，作者和Packt Publishing不对因本书信息引起的任何直接或间接损失承担法律责任，确保提供的信息尽可能准确，但不提供任何形式的保修。总体来说，《FastDataProcessing with Spark》是一本值得深入学习的大数据处理入门指南，尤其适合那些希望在分布式计算领域提升技能的数据分析师、工程师和机器学习从业者。对于想要紧跟Spark发展趋势的专业人士而言，持续关注Spark社区和最新资源同样重要。

Installing Spark and

Setting Up Your Cluster

This chapter will detail some common methods for setting up Spark. Spark on

a single machine is excellent for testing, but you will also learn to use Spark's

built-in deployment scripts to a dedicated cluster via SSH (Secure Shell). This

chapter will also cover using Mesos, Yarn, Puppet, or Chef to deploy Spark.

For cloud deployments of Spark, this chapter will look at EC2 (both traditional

and EC2MR). Feel free to skip this chapter if you already have your local Spark

instance installed and want to get straight to programming.

Regardless of how you are going to deploy Spark, you will want to get the latest

version of Spark from http://spark-project.org/download (Version 0.7 as of

this writing). For coders who live dangerously, try cloning the code directly from the

repository git://github.com/mesos/spark.git. Both the source code and pre-built

binaries are available. To interact with Hadoop Distributed File System (HDFS), you

need to use a Spark version that is built against the same version of Hadoop as your

cluster. For Version 0.7 of Spark, the pre-built package is built against Hadoop 1.0.4. If

you are up for the challenge, it's recommended that you build against the source since

it gives you the exibility of choosing which HDFS version you want to support as

well as apply patches. You will need the appropriate version of Scala installed and the

matching JDK. For Version 0.7.1 of Spark, you require Scala 2.9.2 or a later 2.9 series

release (2.9.3 works well). At the time of this writing, Ubuntu's LTS release (Precise)

has Scala Version 2.9.1. Additionally, the current stable version has 2.9.2 and Fedora 18

has 2.9.2. Up-to-date package information can be found at http://packages.ubuntu.

com/search?keywords=scala. The latest version of Scala is available from http://

scala-lang.org/download. It is important to choose the version of Scala that matches

the version requested by Spark, as Scala is a fast-evolving language.

Installing Spark and Setting Up Your Cluster

[ 6 ]

The tarball le contains a bin directory that needs to be added to your path, and

SCALA_HOME should be set to the path where the tarball le is extracted. Scala can

be installed from source by running:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz && tar -xvf

scala-2.9.3.tgz && cd scala-2.9.3 && export PATH=`pwd`/bin:$PATH &&

export SCALA_HOME=`pwd`

You will probably want to add these to your .bashrc le or equivalent:

export PATH=`pwd`/bin:\$PATH

export SCALA_HOME=`pwd`

Spark is built with sbt (simple build tool, which is no longer very simple), and build

times can be quite long when compiling Scala's source code. Don't worry if you don't

have sbt installed; the build script will download the correct version for you.

On an admittedly under-powered core 2 laptop with an SSD, installing a fresh copy

of Spark took about seven minutes. If you decide to build Version 0.7 from source,

you would run:

wget http://www.spark-project.org/download-spark-0.7.0-sources-tgz &&

tar -xvf download-spark-0.7.0-sources-tgz && cd spark-0.7.0 && sbt/sbt

package

If you are going to use a version of HDFS that doesn't match the default version

for your Spark instance, you will need to edit project/SparkBuild.scala and set

HADOOP_VERSION to the corresponding version and recompile it with:

sbt/sbt clean compile

The sbt tool has made great progress with dependency resolution,

but it's still strongly recommended for developers to do a clean

build rather than an incremental build. This still doesn't get it

quite right all the time.

Once you have started the build it's probably a good time for a break, such as getting

a cup of coffee. If you nd it stuck on a single line that says "Resolving [XYZ]...." for a

long time (say ve minutes), stop it and restart the sbt/sbt package.

If you can live with the restrictions (such as the xed HDFS version), using the

pre-built binary will get you up and running far quicker. To run the pre-built

version, use the following command:

wget http://www.spark-project.org/download-spark-0.7.0-prebuilt-tgz &&

tar -xvf download-spark-0.7.0-prebuilt-tgz && cd spark-0.7.0

Installing Spark and Setting Up Your Cluster

[ 8 ]

Running Spark on EC2

There are many handy scripts to run Spark on EC2 in the ec2 directory. These

scripts can be used to run multiple Spark clusters, and even run on-the-spot

instances. Spark can also be run on Elastic MapReduce (EMR). This is Amazon's

solution for MapReduce cluster management, which gives you more exibility

around scaling instances.

Running Spark on EC2 with the scripts

To get started, you should make sure that you have EC2 enabled on your account by

signing up for it at https://portal.aws.amazon.com/gp/aws/manageYourAccount.

It is a good idea to generate a separate access key pair for your Spark cluster, which

you can do at https://portal.aws.amazon.com/gp/aws/securityCredentialsR.

You will also need to create an EC2 key pair, so that the Spark script can SSH to the

launched machines; this can be done at https://console.aws.amazon.com/ec2/

home by selecting Key Pairs under Network & Security. Remember that key pairs are

created "per region", so you need to make sure you create your key pair in the same

region as you intend to run your spark instances. Make sure to give it a name that you

can remember (we will use spark-keypair in this chapter as its example key pair

name) as you will need it for the scripts. You can also choose to upload your public

SSH key instead of generating a new key. These are sensitive, so make sure that you

keep them private. You also need to set your AWS_ACCESS_KEY and AWS_SECRET_KEY

key pairs as environment variables for the Amazon EC2 scripts:

chmod 400 spark-keypair.pem

export AWS_ACCESS_KEY="..."

export AWS_SECRET_KEY="..."

You will nd it useful to download the EC2 scripts provided by Amazon from

http://aws.amazon.com/developertools/Amazon-EC2/351. Once you unzip

the resulting ZIP le, you can add the bin folder to your PATH variable in a similar

manner to what you did with the Spark bin folder:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip

unzip ec2-api-tools.zip

cd ec2-api-tools-*

export EC2_HOME=`pwd`

export PATH=$PATH:`pwd`:/bin

剩余119页未读，继续阅读

haxshx

粉丝: 0
资源: 9

2013年PacktPub：Spark加速大数据处理实战指南

PacktPub.Node.Security.Oct.2013.pdf

Packtpub.BackTrack.5.Wireless.Penetration.Testing.Beginners.Guide.Sep.2011

Packtpub.PostgreSQL.9.Admin.Cookbook.Oct.2010

PacktPub.Storm.Real-time.Processing.Cookbook.Aug.2013.pdf

Packtpub.Akka.Essentials.Oct.2012.pdf

Packtpub.PostgreSQL.9.0.High.Performance.Oct.2010.pdf

Packtpub.Learning.PHP.Data.Objects.Aug.2007.rar

Packtpub.Pentaho.3.2.Data.Integration.Beginners.Guide.Apr.2010.rar

Packtpub.Getting.Started.With.Oracle.SOA.Suite.11g.R1.Oct.2009.rar

Packtpub.Ext.JS.3.0.Cookbook.Oct.2009

最新资源