精通Spark：大数据分析快速指南

5星 · 超过95%的资源需积分: 35 78 浏览量更新于2024-07-22 收藏 7.82MB PDF 举报

"Learning Spark" 《Learning Spark》是由Spark的创始人Holden Karau, Andy Konwinski, Patrick Wendell及Matei Zaharia合著的一本书，深入浅出地讲解了Spark的安装、配置以及实际应用。这本书对于想要掌握大数据分析框架Spark的读者来说是一本非常有价值的指南。它在大数据领域备受推崇，被誉为快速构建大数据应用的首选框架。 Apache Spark是一个开源的集群计算系统，它的设计目标是提高数据处理的速度，同时简化编程模型。Spark的核心特性包括分布式数据集、内存缓存以及交互式Shell，这些特性使得开发人员能够用Python、Java或Scala等语言编写出简洁的并行处理代码，并高效地处理大规模数据集。在书中，作者们详细介绍了以下关键知识点： 1. **Spark架构基础**：解释了Spark的基本组件，如Master、Worker节点、RDD（弹性分布式数据集）以及DataFrame/Dataset API，帮助读者理解Spark如何在分布式环境中运行。 2. **安装与配置**：涵盖了如何在各种环境下安装Spark，包括本地模式、Standalone模式、Hadoop YARN以及Mesos，以及相关的配置选项和最佳实践。 3. **Spark Shell与交互式编程**：讲解了如何使用Spark Shell进行快速的数据探索和测试，以及如何利用Scala REPL进行交互式数据分析。 4. **分布式数据集操作**：详细介绍了RDD的操作，包括转换（Transformation）和行动（Action），以及如何通过这些操作来构建复杂的并行处理任务。 5. **DataFrame和Dataset API**：讨论了DataFrame和Dataset，这两个高级API提供了更强大的类型安全性和SQL兼容性，简化了数据处理和分析。 6. **存储与持久化**：涵盖了数据读写的不同方式，包括HDFS、Cassandra、HBase等，以及如何使用Spark的缓存机制优化性能。 7. **批处理与实时流处理**：讲解了如何使用Spark SQL进行批处理，以及如何通过Spark Streaming处理实时数据流，实现低延迟的流处理应用。 8. **机器学习与图形处理**：介绍了MLlib库，用于构建机器学习模型，以及GraphX，用于处理图数据和执行图算法。 9. **性能调优**：提供了关于如何优化Spark作业的指导，包括调整执行器和驱动程序的内存设置，以及并行度设置等。 10. **案例研究**：通过实际应用案例展示Spark在不同领域的应用，如推荐系统、网络日志分析等，帮助读者将理论知识转化为实践经验。通过阅读《Learning Spark》，无论是数据科学家还是工程师，都能快速上手Spark，从而在大数据分析领域发挥出Spark的强大潜力。书中的实例和实战指导确保了读者能够掌握Spark的核心概念和技术，从而在大数据处理的世界里游刃有余。

Our Java examples are written to work with Java version 6 and

higher. Java 8 introduces a new syntax called lambdas that makes

writing inline functions much easier, which can simplify Spark

code. We have chosen not to take advantage of this syntax in most

of our examples, as most organizations are not yet using Java 8. If

you would like to try Java 8 syntax, you can see the Databricks blog

post on this topic. Some of the examples will also be ported to Java

8 and posted to the book’s GitHub site.

This book is here to help you get your job done. In general, if example code is offered

with this book, you may use it in your programs and documentation. You do not

need to contact us for permission unless you’re reproducing a significant portion of

the code. For example, writing a program that uses several chunks of code from this

book does not require permission. Selling or distributing a CD-ROM of examples

from O’Reilly books does require permission. Answering a question by citing this

book and quoting example code does not require permission. Incorporating a signifi‐

cant amount of example code from this book into your product’s documentation

does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the

title, author, publisher, and ISBN. For example: “Learning Spark by Holden Karau,

Databricks, 978-1-449-35862-4.”

If you feel your use of code examples falls outside fair use or the permission given

above, feel free to contact us at permissions@oreilly.com.

Safari® Books Online

Safari Books Online is an on-demand digital library that deliv‐

ers expert content in both book and video form from the

world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and crea‐

tive professionals use Safari Books Online as their primary resource for research,

problem solving, learning, and certification training.

Safari Books Online offers a range of plans and pricing for enterprise, government,

education, and individuals.

Members have access to thousands of books, training videos, and prepublication

manuscripts in one fully searchable database from publishers like O’Reilly Media,

Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams,

Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan

xiv | Preface

A Unified Stack

The Spark project contains multiple closely integrated components. At its core, Spark

is a “computational engine” that is responsible for scheduling, distributing, and mon‐

itoring applications consisting of many computational tasks across many worker

machines, or a computing cluster. Because the core engine of Spark is both fast and

general-purpose, it powers multiple higher-level components specialized for various

workloads, such as SQL or machine learning. These components are designed to

interoperate closely, letting you combine them like libraries in a software project.

A philosophy of tight integration has several benefits. First, all libraries and higher-

level components in the stack benefit from improvements at the lower layers. For

example, when Spark’s core engine adds an optimization, SQL and machine learning

libraries automatically speed up as well. Second, the costs associated with running the

stack are minimized, because instead of running 5–10 independent software systems,

an organization needs to run only one. These costs include deployment, mainte‐

nance, testing, support, and others. This also means that each time a new component

is added to the Spark stack, every organization that uses Spark will immediately be

able to try this new component. This changes the cost of trying out a new type of data

analysis from downloading, deploying, and learning a new software project to

upgrading Spark.

Finally, one of the largest advantages of tight integration is the ability to build appli‐

cations that seamlessly combine different processing models. For example, in Spark

you can write one application that uses machine learning to classify data in real time

as it is ingested from streaming sources. Simultaneously, analysts can query the

resulting data, also in real time, via SQL (e.g., to join the data with unstructured log‐

files). In addition, more sophisticated data engineers and data scientists can access

the same data via the Python shell for ad hoc analysis. Others might access the data in

standalone batch applications. All the while, the IT team has to maintain only one

system.

Here we will briefly introduce each of Spark’s components, shown in Figure 1-1.

2 | Chapter 1: Introduction to Data Analysis with Spark

剩余273页未读，继续阅读

yinmingyang1

粉丝: 0
资源: 2

精通Spark：大数据分析快速指南

Learning Spark SQL epub

learning spark 中文版下载

Learning Spark.pdf

learning spark

LearningSpark

Learning Spark pdf

Learning Spark SQL

LearningSpark：学习使用Spark的Scala示例

learning spark笔记17-spark sql

2000-2021年中国科技统计年鉴（分省年度）面板数据集-最新更新.zip

最新资源