Spark 2.0 for Data Science：深入机器学习与基础架构解析

需积分: 9 77 浏览量更新于2024-07-19 收藏 13MB PDF 举报

"Spark for Data Science 是一本专注于介绍如何利用Spark进行数据科学分析的书籍，涵盖了Spark的基本架构、核心概念，并重点讲述了Spark在机器学习领域的应用，适用于最新版本Spark 2.0。本书由Srinivas Duvvuri和Bikramaditya Singhal合著，由Packt Publishing出版。" 在大数据领域，Apache Spark已经成为处理大规模数据集的首选工具，尤其在数据科学和机器学习中。Spark的核心优势在于其内存计算能力，它允许数据在内存中快速处理，极大地提高了数据处理的速度，相比传统的Hadoop MapReduce模型，Spark能提供高达100倍的性能提升。 Spark的基本架构包括以下几个主要组件： 1. **Spark Core**：这是Spark的基础，提供了分布式任务调度、内存管理以及错误恢复等功能。 2. **Spark SQL**：整合了SQL查询与DataFrame API，使得用户可以用SQL或者DataFrame API对结构化数据进行操作。 3. **Spark Streaming**：用于处理实时数据流，通过微批处理的方式实现了低延迟的数据处理。 4. **MLlib**：Spark的机器学习库，包含多种机器学习算法，如分类、回归、聚类、协同过滤等，以及模型评估和特征选择工具。 5. **GraphX**：用于处理图形数据和图计算，支持图形分析和算法。在数据科学中，Spark的应用主要体现在以下几个方面： 1. **数据预处理**：Spark可以方便地进行数据清洗、转换和归一化，为后续的建模工作做好准备。 2. **模型训练**：MLlib提供了多种机器学习模型，如线性回归、逻辑回归、决策树、随机森林、梯度提升机等，可以快速训练大规模数据集。 3. **模型验证和调优**：Spark支持交叉验证和网格搜索，帮助选择最佳模型参数。 4. **预测和部署**：训练好的模型可以应用于新数据，进行预测，并可以通过Spark Serving或其他方式部署到生产环境。本书《Spark for Data Science》将深入讲解Spark的相关技术，并通过实例演示如何使用Spark进行数据分析和机器学习项目。作者们将分享他们在Spark实践中的经验和技巧，帮助读者理解Spark的工作原理，提升数据分析效率。尽管书中尽力保证信息的准确性，但请注意，由于技术的快速发展，某些信息可能已发生变化。在实际应用中，读者应根据最新的官方文档和社区资源来获取最准确的信息。此外，使用任何技术和工具时，都应注意潜在的风险和责任，确保符合法律法规，尊重数据隐私。

[ vii ]

Summary

253

References:

253

Chapter 9: Visualizing Big Data

254

Why visualize data?

255

A data engineer's perspective

256

A data scientist's perspective

256

A business user's perspective

257

Data visualization tools

257

IPython notebook

258

Apache Zeppelin

258

Third-party tools

258

Data visualization techniques

259

Summarizing and visualizing

259

Subsetting and visualizing

263

Sampling and visualizing

267

Modeling and visualizing

270

Summary

272

References

273

Data source citations

273

Chapter 10: Putting It All Together

274

A quick recap

275

Introducing a case study

276

The business problem

277

Data acquisition and data cleansing

277

Developing the hypothesis

283

Data exploration

284

Data preparation

286

Too many levels in a categorical variable

287

Numerical variables with too much variation

289

Missing data 289

Continuous data 290

Categorical data 290

Preparing the data 291

Model building

293

Data visualization

300

Communicating the results to business users

300

Summary

301

References

301

Chapter 11: Building Data Science Applications

302

Preface

In this smart age, data analytics is the key to sustaining and promoting business growth.

Every business is trying to leverage their data as much possible with all sorts of data science

tools and techniques to progress along the analytics maturity curve. This sudden rise in

data science requirements is the obvious reason for scarcity of data scientists. It is very

difficult to meet the market demand with unicorn data scientists who are experts in

statistics, machine learning, mathematical modelling as well as programming.

The availability of unicorn data scientists is only going to decrease with the increase in

market demand, and it will continue to be so. So, a solution was needed which not only

empowers the unicorn data scientists to do more, but also creates what Gartner calls as

“Citizen Data Scientists”. Citizen data scientists are none other than the developers,

analysts, BI professionals or other technologists whose primary job function is outside of

statistics or analytics but are passionate enough to learn data science. They are becoming

the key enabler in democratizing data analytics across organizations and industries as a

whole.

There is an ever going plethora of tools and techniques designed to facilitate big data

analytics at scale. This book is an attempt to create citizen data scientists who can leverage

Apache Spark’s distributed computing platform for data analytics.

This book is a practical guide to learn statistical analysis and machine learning to build

scalable data products. It helps to master the core concepts of data science and also Apache

Spark to help you jump start on any real life data analytics project. Throughout the book, all

the chapters are supported by sufficient examples, which can be executed on a home

computer, so that readers can easily follow and absorb the concepts. Every chapter attempts

to be self-contained so that the reader can start from any chapter with pointers to relevant

chapters for details. While the chapters start from basics for a beginner to learn and

comprehend, it is comprehensive enough for a senior architects at the same time.

What this book covers

Chapter 1, Big Data and Data Science – An Introduction, this chapter discusses briefly about

the various challenges in big data analytics and how Apache Spark solves those problems

on a single platform. This chapter also explains how data analytics has evolved to what it is

now and also gives a basic idea on the Spark stack.

Preface

[ 2 ]

Chapter 2, The Spark Programming Model, this chapter talks about the design considerations

of Apache Spark and the supported programming languages. It also explains the Spark core

components and covers the RDD API in details, which is the basic building block of Spark.

Chapter 3, Introduction to DataFrames, this chapter explains about the DataFrames, which

are the most handy and useful component for the data scientists to work at ease. It explains

about Spark SQL and the Catalyst optimizer that empowers DataFrames. Also, various

DataFrames operations are demonstrated with code examples.

Chapter 4, Unified Data Access, this chapter talks about the various ways we source data

from different sources, consolidate and work in a unified way. It covers the streaming

aspect of real time data collection and operating on them. It also talks about the under-the-

hood fundamentals of these APIs.

Chapter 5, Data Analysis on Spark, this chapter discuss about the complete data analytics

lifecycle. With ample code examples, it explains how to source data from different sources,

prepare the data using data cleaning and transformation techniques, and perform

descriptive and inferential statistics to generate hidden insights from data.

Chapter 6, Machine Learning, this chapter explains various machine learning algorithms,

how they are implemented in the MLlib library and how they can be used with the pipeline

API for a streamlined execution. This chapter covers the fundamentals of all the algorithms

covered so it could serve as a one stop reference.

Chapter 7, Extending Spark with SparkR, this chapter is primarily intended for the R

programmers who want to leverage Spark for Data Analytics. It explains how to program

with SparkR and how to use the machine learning algorithms of R libraries.

Chapter 8, Analyzing Unstructured Data, this chapter discusses only about unstructured

data analysis. It explains how to source unstructured data, process it and perform machine

learning on it. It also covers some of the dimension reduction techniques which were not

covered in the “Machine Learning” chapter.

Chapter 9, Visualizing Big Data, in this chapter, readers learn various visualization

techniques that are supported on Spark. It explains the different kinds of visualization

requirements of data engineers, data scientists and business users; and also suggests right

kinds of tools and techniques. It also talks about leveraging IPython/Jupyter notebook and

Zeppelin, an Apache project for data visualization.

剩余338页未读，继续阅读

china1024k

粉丝: 1
资源: 6

Spark 2.0 for Data Science：深入机器学习与基础架构解析

Spark for Data Science epub

Spark for Data Science Cookbook 无水印pdf

Spark for Data Science

藏经阁-nabling Apache Zeppelin_ and Spark_ for Data Science in the

Java_for_Data_science_Code.zip_data science_tutorial

藏经阁-Spark_ Data Science as a Service.pdf

Practical_Data_Science_with_Hadoop_and_Spark

Spark for Data Science spark2.0

odsc_intro_to_data_science:2015年开放数据科学大会数据科学研讨会简介

Spark for Data Science mobi

最新资源