没有合适的资源?快使用搜索试试~ 我知道了~
首页《Spark权威指南》:数据科学家与工程师的理想选择
《Spark权威指南》:数据科学家与工程师的理想选择
需积分: 12 33 下载量 18 浏览量
更新于2024-07-18
1
收藏 7.88MB PDF 举报
"《Spark:权威指南》是一本专为数据科学家和数据工程师设计的深入指南,由Bill Chambers和Matei Zaharia共同编写,版权属于2018年的Databricks公司。这本书旨在帮助读者轻松理解并利用Apache Spark进行大数据处理。数据科学家和数据工程师虽然职责有所不同,但实际工作中这两者的界限常常模糊。数据科学家主要负责通过Spark进行交互式查询,以解答问题和构建统计模型,而数据工程师则关注编写可维护、可重复的生产应用程序,可能是为了在实践中应用数据科学家的模型,或者仅为后续分析(如构建数据导入管道)做准备。 书中详细介绍了Spark的基本概念、架构、API以及最佳实践。Spark以其高效的大数据处理能力、内存计算模型和分布式计算框架而闻名,特别适合实时分析和机器学习任务。作者们确保了内容不仅涵盖了基础操作,还深入探讨了高级特性,如DataFrame API、Spark SQL、Spark Streaming、MLlib(机器学习库)等,以及如何将Spark与其他技术(如Hadoop、Kafka等)无缝集成。 《Spark:权威指南》不仅仅是一本技术手册,它也强调了代码组织、性能优化和故障恢复等方面的重要性和策略。书中包含了大量的示例和实战项目,让读者能在实践中快速上手和提升技能。此外,该书还提供了及时的在线修订历史和错误报告链接,确保读者获取到最新、最准确的信息。 无论你是初入Spark的世界,还是希望深化理解或扩展你的技能,这本书都是不可或缺的资源。通过阅读《Spark:权威指南》,你将能全面掌握这个强大工具,推动你的数据分析和工程工作达到新的高度。"
资源详情
资源推荐
passes over the data, and in MapReduce, each pass had to be written as a separate MapReduce
job, which had to be launched separately on the cluster and load the data from scratch.
To address this problem, the Spark team first designed an API based on functional programming
that could succinctly express multistep applications. The team then implemented this API over a
new engine that could perform efficient, in-memory data sharing across computation steps. The
team also began testing this system with both Berkeley and external users.
The first version of Spark supported only batch applications, but soon enough another
compelling use case became clear: interactive data science and ad hoc queries. By simply
plugging the Scala interpreter into Spark, the project could provide a highly usable interactive
system for running queries on hundreds of machines. The AMPlab also quickly built on this idea
to develop Shark, an engine that could run SQL queries over Spark and enable interactive use by
analysts as well as data scientists. Shark was first released in 2011.
After these initial releases, it quickly became clear that the most powerful additions to Spark
would be new libraries, and so the project began to follow the “standard library” approach it has
today. In particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX.
They also ensured that these APIs would be highly interoperable, enabling writing end-to-end
big data applications in the same engine for the first time.
In 2013, the project had grown to widespread use, with more than 100 contributors from more
than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache
Software Foundation as a long-term, vendor-independent home for the project. The early
AMPlab team also launched a company, Databricks, to harden the project, joining the
community of other companies and organizations contributing to Spark. Since that time, the
Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to
make regular releases, bringing new features into the project.
Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of
Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations
such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added
Spark SQL, a new API for working with structured data—tables with a fixed data format that is
not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations
across libraries and APIs by understanding both the data format and the user code that runs on it
in more detail. Over time, the project added a plethora of new APIs that build on this more
powerful structured foundation, including DataFrames, machine learning pipelines, and
Structured Streaming, a high-level, automatically optimized streaming API. In this book, we will
spend a signficant amount of time explaining these next-generation APIs, most of which are
marked as production-ready.
The Present and Future of Spark
Spark has been around for a number of years but continues to gain in popularity and use cases.
Many new projects within the Spark ecosystem continue to push the boundaries of what’s
possible with the system. For example, a new high-level streaming engine, Structured Streaming,
was introduced in 2016. This technology is a huge part of companies solving massive-scale data
challenges, from technology companies like Uber and Netflix using Spark’s streaming and
machine learning tools, to institutions like NASA, CERN, and the Broad Institute of MIT and
Harvard applying Spark to scientific data analysis.
Spark will continue to be a cornerstone of companies doing big data analysis for the foreseeable
future, especially given that the project is still developing quickly. Any data scientist or engineer
who needs to solve big data problems probably needs a copy of Spark on their machine—and
hopefully, a copy of this book on their bookshelf!
Running Spark
This book contains an abundance of Spark-related code, and it’s essential that you’re prepared to
run it as you learn. For the most part, you’ll want to run the code interactively so that you can
experiment with it. Let’s go over some of your options before we begin working with the coding
parts of the book.
You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs
on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster,
all you need is an installation of Java. If you want to use the Python API, you will also need a
Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on
your machine.
There are two options we recommend for getting started with Spark: downloading and installing
Apache Spark on your laptop, or running a web-based version in Databricks Community Edition,
a free cloud environment for learning Spark that includes the code in this book. We explain both
of those options next.
Downloading Spark Locally
If you want to download and run Spark locally, the first step is to make sure that you have Java
installed on your machine (available as java), as well as a Python version if you would like to
use Python. Next, visit the project’s official download page, select the package type of “Pre-built
for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR
file, or tarball, that you will then need to extract. The majority of this book was written using
Spark 2.2, so downloading version 2.2 or later should be a good starting point.
Downloading Spark for a Hadoop cluster
Spark can run locally without any distributed storage system, such as Apache Hadoop. However,
if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you
download the right Spark version for that Hadoop version, which can be chosen at
http://spark.apache.org/downloads.html by selecting a different package type. We discuss how
Spark runs on clusters and the Hadoop file system in later chapters, but at this point we
recommend just running Spark on your laptop to start out.
NOTE
In Spark 2.2, the developers also added the ability to install Spark for Python via pip install
pyspark. This functionality came out as this book was being written, so we weren’t able to include all
of the relevant instructions.
Building Spark from source
We won’t cover this in the book, but you can also build and configure Spark from source. You
can select a source package on the Apache download page to get just the source and follow the
instructions in the README file for building.
After you’ve downloaded Spark, you’ll want to open a command-line prompt and extract the
package. In our case, we’re installing Spark 2.2. The following is a code snippet that you can run
on any Unix-style command line to unzip the file you downloaded from Spark and move into the
directory:
cd ~/Downloads
tar -xf spark-2.2.0-bin-hadoop2.7.tgz
cd spark-2.2.0-bin-hadoop2.7.tgz
Note that Spark has a large number of directories and files within the project. Don’t be
intimidated! Most of these directories are relevant only if you’re reading source code. The next
section will cover the most important directories—the ones that let us launch Spark’s different
consoles for interactive use.
Launching Spark’s Interactive Consoles
You can start an interactive shell in Spark for several different programming languages. The
majority of this book is written with Python, Scala, and SQL in mind; thus, those are our
recommended starting points.
Launching the Python console
You’ll need Python 2 or 3 installed in order to launch the Python console. From Spark’s home
directory, run the following code:
./bin/pyspark
After you’ve done that, type “spark” and press Enter. You’ll see the SparkSession object printed,
which we cover in Chapter 2.
Launching the Scala console
To launch the Scala console, you will need to run the following command:
./bin/spark-shell
After you’ve done that, type “spark” and press Enter. As in Python, you’ll see the SparkSession
object, which we cover in Chapter 2.
Launching the SQL console
Parts of this book will cover a large amount of Spark SQL. For those, you might want to start the
SQL console. We’ll revisit some of the more relevant details after we actually cover these topics
in the book.
./bin/spark-sql
Running Spark in the Cloud
If you would like to have a simple, interactive notebook experience for learning Spark, you
might prefer using Databricks Community Edition. Databricks, as we mentioned earlier, is a
company founded by the Berkeley team that started Spark, and offers a free community edition
of its cloud service as a learning environment. The Databricks Community Edition includes a
copy of all the data and code examples for this book, making it easy to quickly run any of them.
To use the Databricks Community Edition, follow the instructions at
https://github.com/databricks/Spark-The-Definitive-Guide. You will be able to use Scala,
Python, SQL, or R from a web browser–based interface to run and visualize results.
Data Used in This Book
We’ll use a number of data sources in this book for our examples. If you want to run the code
locally, you can download them from the official code repository in this book as desribed at
https://github.com/databricks/Spark-The-Definitive-Guide. In short, you will download the data,
put it in a folder, and then run the code snippets in this book!
Chapter 2. A Gentle Introduction to
Spark
Now that our history lesson on Apache Spark is completed, it’s time to begin using and applying
it! This chapter presents a gentle introduction to Spark, in which we will walk through the core
architecture of a cluster, Spark Application, and Spark’s structured APIs using DataFrames and
SQL. Along the way we will touch on Spark’s core terminology and concepts so that you can
begin using Spark right away. Let’s get started with some basic background information.
Spark’s Basic Architecture
Typically, when you think of a “computer,” you think about one machine sitting on your desk at
home or at work. This machine works perfectly well for watching movies or working with
spreadsheet software. However, as many users likely experience at some point, there are some
things that your computer is not powerful enough to perform. One particularly challenging area
is data processing. Single machines do not have enough power and resources to perform
computations on huge amounts of information (or the user probably does not have the time to
wait for the computation to finish). A cluster, or group, of computers, pools the resources of
many machines together, giving us the ability to use all the cumulative resources as if they were
a single computer. Now, a group of machines alone is not powerful, you need a framework to
coordinate work across them. Spark does just that, managing and coordinating the execution of
tasks on data across a cluster of computers.
The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like
Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to
these cluster managers, which will grant resources to our application so that we can complete our
work.
Spark Applications
Spark Applications consist of a driver process and a set of executor processes. The driver process
runs your main() function, sits on a node in the cluster, and is responsible for three things:
maintaining information about the Spark Application; responding to a user’s program or input;
and analyzing, distributing, and scheduling work across the executors (discussed momentarily).
The driver process is absolutely essential—it’s the heart of a Spark Application and maintains all
relevant information during the lifetime of the application.
The executors are responsible for actually carrying out the work that the driver assigns them.
This means that each executor is responsible for only two things: executing code assigned to it
by the driver, and reporting the state of the computation on that executor back to the driver node.
剩余600页未读,继续阅读
summerfoliage
- 粉丝: 0
- 资源: 10
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 最优条件下三次B样条小波边缘检测算子研究
- 深入解析:wav文件格式结构
- JIRA系统配置指南:代理与SSL设置
- 入门必备:电阻电容识别全解析
- U盘制作启动盘:详细教程解决无光驱装系统难题
- Eclipse快捷键大全:提升开发效率的必备秘籍
- C++ Primer Plus中文版:深入学习C++编程必备
- Eclipse常用快捷键汇总与操作指南
- JavaScript作用域解析与面向对象基础
- 软通动力Java笔试题解析
- 自定义标签配置与使用指南
- Android Intent深度解析:组件通信与广播机制
- 增强MyEclipse代码提示功能设置教程
- x86下VMware环境中Openwrt编译与LuCI集成指南
- S3C2440A嵌入式终端电源管理系统设计探讨
- Intel DTCP-IP技术在数字家庭中的内容保护
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功