深入学习Apache Spark 2.0：大数据处理速成

需积分: 21 34 浏览量更新于2024-07-19 收藏 12.13MB PDF 举报

"LearningApacheSpark2——借助光速处理大数据！由Muhammad Asif Abbasi著，Packt Publishing出版。本书全面介绍了Spark Core、Spark SQL、Spark Streaming、mLlib和GraphX等关键组件，是Spark 2.0的国外经典教程。" 在大数据处理领域，Apache Spark已经成为了不可或缺的工具，尤其是在实时分析和复杂计算方面表现卓越。《Learning Apache Spark 2》这本书深入浅出地讲解了Spark 2.0的核心概念和技术，旨在帮助读者快速掌握这个强大的分布式计算框架。 1. **Spark Core**：作为Spark的基础，Spark Core提供了分布式任务调度、内存管理、错误恢复和互操作性等功能。书中会详细讲解如何创建和运行Spark应用，理解RDD（弹性分布式数据集）的概念，以及如何优化内存使用和任务调度。 2. **Spark SQL**：Spark SQL是Spark用于结构化数据处理的部分，它整合了SQL查询与DataFrame API，使得开发者可以用SQL或者DataFrame API进行数据处理。书中的Spark SQL章节将涵盖DataFrame的创建、转换和查询，以及如何将SQL与Hive集成，实现大规模数据仓库的处理。 3. **Spark Streaming**：Spark Streaming提供了一个高级抽象来处理实时数据流，它将数据流分解为微小的批处理作业，这使得Spark能够利用其核心功能进行流处理。书中会介绍如何设置和操作DStream（离散化流），以及如何处理窗口数据和状态管理。 4. **mLlib**：Spark的机器学习库mLlib包含各种常见的机器学习算法，如分类、回归、聚类、协同过滤等，同时也提供了模型选择和评估工具。书中将详细讨论这些算法的使用，以及如何构建和优化机器学习管道。 5. **GraphX**：GraphX是Spark的一个图形处理库，它提供了一套用于创建、操作和分析图形数据的API。通过GraphX，开发者可以处理复杂的网络数据，例如社交网络分析、推荐系统等。书中会介绍图的表示方法、图算法的实现，以及如何与其他Spark组件结合使用。此外，本书还涵盖了如何在不同的集群环境中部署Spark，包括本地模式、Standalone模式、YARN和Mesos，并讨论了性能调优的策略和最佳实践。通过阅读这本书，读者不仅可以了解Spark的基本原理，还能获得实际操作和项目实施的经验，从而在大数据处理领域更加得心应手。

[ viii ]

Importing relevant libraries 259

Defining the schema for ratings 259

Defining the schema for movies 260

Loading ratings and movies data 260

Data partitioning 261

Training an ALS model 262

Predicting the test dataset 262

Evaluating model performance 263

Using implicit preferences 265

Sanity checking 265

Model Deployment 265

References

266

Summary

267

Chapter 10: Customer Churn Prediction

268

Overview of customer churn

269

Why is predicting customer churn important?

270

How do we predict customer churn with Spark?

270

Data set description

271

Code example

272

Defining schema

273

Loading data

273

Data exploration

274

PySpark import code 276

Exploring international minutes 276

Exploring night minutes 276

Exploring day minutes 276

Exploring eve minutes 277

Comparing minutes data for churners and non-churners

278

Comparing charge data for churners and non-churners

281

Exploring customer service calls

283

Scala code – constructing a scatter plot

283

Exploring the churn variable 285

Data transformation

286

Building a machine learning pipeline

286

References

294

Summary

294

Appendix: There's More with Spark

295

Performance tuning

296

Data serialization

297

Memory tuning

299

Execution and storage 299

Tasks running in parallel 299

https://www.iteblog.com

Preface

[ 2 ]

2004-2006: Google published a paper on the Google File System (GFS) (2003) and

MapReduce (2004) demonstrating the backbone of their search engine being resilient to

failures, and almost linearly scalable. Doug Cutting took particular interest in this

development as he could see that GFS and MapReduce papers directly addressed Nutch’s

shortcomings. Doug Cutting added Map Reduce implementation to Nutch which ran on 20

nodes, and was much easier to program. Of course we are talking in comparative terms

here.

2006-2008: Cutting went to work with Yahoo in 2006 who had lost the search crown to

Google and were equally impressed by the GFS and MapReduce papers. The storage and

processing parts of Nutch were spun out to form a separate project named Hadoop under

AFS where as Nutch web crawler remained a separate project. Hadoop became a top-level

Apache project in 2008. On February 19, 2008 Yahoo announced that its search index is run

on a 10000 node Hadoop cluster (truly an amazing feat).

We haven't forget about the proprietary database vendors. the majority of them didn’t

expect Hadoop to change anything for them, as database vendors typically focused on

relational data, which was smaller in volumes but higher in value. I was talking to a CTO of

a major database vendor (will remain unnamed), and discussing this new and upcoming

popular elephant (Hadoop of course! Thanks to Doug Cutting’s son for choosing a sane

name. I mean he could have chosen anything else, and you know how kids name things

these days..). The CTO was quite adamant that the real value is in the relational data, which

was the bread and butter of his company, and despite that fact that the relational data had

huge volumes, it had less of a business value. This was more of a 80-20 rule for data, where

from a size perspective unstructured data was 4 times the size of structured data (80-20),

whereas the same structured data had 4 times the value of unstructured data. I would say

that the relational database vendors massively underestimated the value of unstructured

data back then.

Anyways, back to Hadoop: So, after the announcement by Yahoo, a lot of companies

wanted to get a piece of the action. They realised something big was about to happen in the

dataspace. Lots of interesting use cases started to appear in the Hadoop space, and the

defacto compute engine on Hadoop, MapReduce wasn’t able to meet all those expectations.

The MapReduce Conundrum: The original Hadoop comprised primarily HDFS and Map-

Reduce as a compute engine. The original use case of web scale search meant that the

architecture was primarily aimed at long-running batch jobs (typically single-pass jobs

without iterations), like the original use case of indexing web pages. The core requirement

of such a framework was scalability and fault-tolerance, as you don’t want to restart a job

that had been running for 3 days, having completed 95% of its work. Furthermore, the

objective of MapReduce was to target acyclic data flows.

https://www.iteblog.com

Preface

[ 3 ]

A typical MapReduce program is composed of a Map() operation and optionally a

Reduce() operation, and any workload had to be converted to the MapReduce paradigm

before you could get the benefit of Hadoop. Not only that majority of other open source

projects on Hadoop also used MapReduce as a way to perform computation. For example:

Hive and Pig Latin both generated MapReduce to operate on Big Data sets. The problem

with the architecture of MapReduce was that the job output data from each step had to be

store in a distributed system before the next step could begin. This meant that each iteration

had to reload the data from the disk thus incurring a significant performance penalty.

Furthermore, while typically design, for batch jobs, Hadoop has often been used to do

exploratory analysis through SQL-like interfaces such as Pig and Hive. Each query incurs

significant latency due to initial MapReduce job setup, and initial data read which often

means increased wait times for the users.

Beginning of Spark: In June of 2011, Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,

Scott Shenker and Ion Stoica published a paper in which they proposed a framework that

could outperform Hadoop 10 times in iterative machine learning jobs. The framework is

now known as Spark. The paper aimed to solve two of the major inadequacies of the

Hadoop/MR framework:

Iterative jobs

Interactive analysis

The idea that you can plug the gaps of map-reduce from an iterative and interactive

analysis point of view, while maintaining its scalability and resilience meant that the

platform could be used across a wide variety of use cases.

This created huge interest in Spark, particularly from communities of users who had

become frustrated with the relatively slow response from MapReduce, particularly for

interactive queries requests. Spark in 2015 became the most active open source project in Big

Data, and had tons of new features of improvements during the course of the project. The

community grew almost 300%, with attendances at Spark-Summit increasing from just 1,100

in 2014 to almost 4,000 in 2015. The number of meetup groups grew by a factor of 4, and the

contributors to the project increased from just over a 100 in 2013 to 600 in 2015.

Spark is today the hottest technology for big data analytics. Numerous benchmarks have

confirmed that it is the fastest engine out there. If you go to any Big data conference be it

Strata + Hadoop World or Hadoop Summit, Spark is considered to be the technology for

future.

https://www.iteblog.com

剩余348页未读，继续阅读

WinterfellDuke

粉丝: 4
资源: 7

深入学习Apache Spark 2.0：大数据处理速成

Learning Apache Spark 2 epub

learning-apache-spark-2.pdf

Machine-Learning-with-Apache-Spark-2.0:使用Apache Spark 2.0进行机器学习的源代码存储库-spark source code

learning-spark:玩Apache Spark

deep-learning-pyspark:使用Apache Spark和深度认知进行深度学习

machine-learning-with-spark:我的 Spark 机器学习解决方案 作者 Nick Pentreath

Machine-Learning-with-Spark:使用Spark源代码进行机器学习-spark source code

Machine-Learning-with-Spark

learning-hadoop-and-spark:链接学习中学习Hadoop和学习Spark课程的同伴

Machine-Learning-with-Spark:创建可扩展的机器学习应用程序，以使用Spark推动现代数据驱动的业务

最新资源

machine-learning-with-spark:我的 Spark 机器学习解决方案作者 Nick Pentreath