Spark快速数据处理入门：安装与集群配置

需积分: 10 44 浏览量更新于2024-07-23 收藏 887KB PDF 举报

"快速使用Spark进行大数据处理" 《Fast Data Processing with Spark》是关于使用Apache Spark进行高效数据处理的一本书，由Holden Karau撰写。第一章主要介绍了如何安装Spark以及设置集群。 Apache Spark是一个用于大规模数据处理的开源计算框架，以其高效、易用和适合处理复杂数据分析而闻名。在大数据领域，Spark提供了一个统一的平台，支持批处理、实时流处理、机器学习和图形处理等多种任务。Spark的核心特性是其内存计算能力，它将数据存储在内存中，大大提高了数据处理的速度，比传统的Hadoop MapReduce模型更快。在第一章"安装Spark和设置集群"中，作者可能涵盖了以下关键知识点： 1. **Spark组件**：包括Spark Core（核心引擎）、Spark SQL（用于结构化数据处理）、Spark Streaming（实时处理）、MLlib（机器学习库）和GraphX（图处理库）等。 2. **安装Spark**：详细步骤可能包括下载Spark发行版、配置环境变量、选择合适的版本（如与Hadoop兼容性）、安装Java和Scala等依赖项。 3. **集群配置**：讲解如何设置Spark Standalone集群、Hadoop YARN、Mesos或Kubernetes等资源管理器上的Spark，以及配置文件（如`spark-defaults.conf`和`slaves`文件）的编辑。 4. **提交作业**：介绍如何使用`spark-submit`命令提交Spark作业到集群，以及参数调优和资源分配。 5. **开发环境**：推荐集成开发环境（IDEs）如IntelliJ IDEA、PyCharm，以及如何配置Spark项目。 6. **数据源和持久化**：讨论如何读取和写入各种数据源，如HDFS、Cassandra、HBase，以及Spark的数据持久化策略，如RDD（弹性分布式数据集）的缓存级别。 7. **监控和调试**：介绍如何使用Spark UI和Spark History Server来监控作业执行情况，以及常见问题的排查方法。 8. **性能优化**：涵盖并行度调整、分区策略、宽依赖优化、Shuffle操作的减少以及内存管理等性能提升技巧。 9. **案例研究**：可能会通过实际例子展示Spark在不同场景下的应用，如日志分析、推荐系统、图像识别等。作者Holden Karau是一位来自加拿大的软件工程师，目前在Google工作，有着丰富的经验和开源贡献。她对Scala和大数据处理有深厚的理解，这使得她的书成为学习Spark的宝贵资源。如果你对大数据处理和Spark感兴趣，这本书将为你提供一个良好的起点。更多信息可以在作者的个人网站、博客和GitHub上找到。

Installing Spark and Setting Up Your Cluster

[ 6 ]

The tarball ﬁ le contains a bin directory that needs to be added to your path, and

SCALA_HOME should be set to the path where the tarball ﬁ le is extracted. Scala can

be installed from source by running:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz && tar -xvf

scala-2.9.3.tgz && cd scala-2.9.3 && export PATH=`pwd`/bin:$PATH &&

export SCALA_HOME=`pwd`

You will probably want to add these to your .bashrc ﬁ le or equivalent:

export PATH=`pwd`/bin:\$PATH

export SCALA_HOME=`pwd`

Spark is built with sbt (simple build tool, which is no longer very simple), and build

times can be quite long when compiling Scala's source code. Don't worry if you don't

have sbt installed; the build script will download the correct version for you.

On an admittedly under-powered core 2 laptop with an SSD, installing a fresh copy

of Spark took about seven minutes. If you decide to build Version 0.7 from source,

you would run:

wget http://www.spark-project.org/download-spark-0.7.0-sources-tgz &&

tar -xvf download-spark-0.7.0-sources-tgz && cd spark-0.7.0 && sbt/sbt

package

If you are going to use a version of HDFS that doesn't match the default version

for your Spark instance, you will need to edit project/SparkBuild.scala and set

HADOOP_VERSION to the corresponding version and recompile it with:

sbt/sbt clean compile

The sbt tool has made great progress with dependency resolution,

but it's still strongly recommended for developers to do a clean

build rather than an incremental build. This still doesn't get it

quite right all the time.

Once you have started the build it's probably a good time for a break, such as getting

a cup of coffee. If you ﬁ nd it stuck on a single line that says "Resolving [XYZ]...." for a

long time (say ﬁ ve minutes), stop it and restart the sbt/sbt package.

If you can live with the restrictions (such as the ﬁ xed HDFS version), using the

pre-built binary will get you up and running far quicker. To run the pre-built

version, use the following command:

wget http://www.spark-project.org/download-spark-0.7.0-prebuilt-tgz &&

tar -xvf download-spark-0.7.0-prebuilt-tgz && cd spark-0.7.0

For More Information:

www.packtpub.com/fast-data-processing-with-spark/book

剩余21页未读，继续阅读

hellogiser

粉丝: 0
资源: 1

Spark快速数据处理入门：安装与集群配置

Fast Data Processing with Spark 2(3rd) mobi

Fast Data Processing with Spark 2, 3rd Edition.pdf

Fast Data Processing with Spark Second Edition

关于sparkstreaming的书籍

spark大数据编程头歌

spark hadoop

Data Mining with Big Data

快学big data -- spark 总结（二十三)

data processing library

Describe the classical pipeline of data processing, especially the intentions of each phase of the pipeline

最新资源