Spark 2.0 for Data Science: 深入探索机器学习

需积分: 10 61 浏览量更新于2024-07-18 收藏 19.65MB PDF 举报

"data science with spark" Spark 是一个开源的大数据处理框架，它以其高效、易用和可扩展性在数据科学领域广受欢迎。《Spark for Data Science》这本书旨在帮助读者利用Spark 2.0版本进行数据分析，并深入探索机器学习的世界。在数据科学中，Spark 提供了丰富的工具集，包括Spark SQL用于结构化数据处理，MLlib用于机器学习，GraphX用于图计算，以及Spark Streaming用于实时流处理。这些组件使得Spark成为数据科学家的强大工具，能够处理从批量数据到实时流数据的各种工作负载。 Spark的核心特性是其内存计算模型，这使得数据处理速度显著加快，尤其是在迭代算法或需要频繁交互的数据探索过程中。此外，Spark支持多种编程语言（如Python、Java、Scala和R），这让不同背景的团队成员都能方便地使用Spark。在机器学习方面，MLlib库包含了各种算法，如分类、回归、聚类、协同过滤等，以及模型评估和调优工具。通过Spark的API，用户可以轻松构建和训练模型，并在大规模数据集上实现高性能的预测。本书可能涵盖了Spark的基本操作，如创建DataFrame、数据清洗和转换，以及使用Spark SQL查询数据。同时，深入讲解了如何使用MLlib进行监督和无监督学习，包括模型选择、特征工程和超参数调优等实践技巧。除此之外，读者可能会了解到如何部署Spark集群，例如在Apache Mesos、Hadoop YARN或独立模式下，以及如何使用Spark的交互式环境如Jupyter Notebook进行数据科学实验。在实际应用中，作者可能还会讨论如何将Spark集成到数据科学项目的工作流程中，包括数据导入、预处理、建模、验证和模型部署。此外，书中的例子和练习将帮助读者提升解决实际问题的能力。《Spark for Data Science》是针对数据科学家和对大数据分析感兴趣的读者的一本实用指南，它提供了一个深入理解Spark和应用其进行数据科学工作的平台。通过学习本书，读者不仅可以掌握Spark的技术细节，还能了解到如何将其应用于数据驱动的决策制定和创新。

[ vii ]

Summary

253

References:

253

Chapter 9: Visualizing Big Data

254

Why visualize data?

255

A data engineer's perspective

256

A data scientist's perspective

256

A business user's perspective

257

Data visualization tools

257

IPython notebook

258

Apache Zeppelin

258

Third-party tools

258

Data visualization techniques

259

Summarizing and visualizing

259

Subsetting and visualizing

263

Sampling and visualizing

267

Modeling and visualizing

270

Summary

272

References

273

Data source citations

273

Chapter 10: Putting It All Together

274

A quick recap

275

Introducing a case study

276

The business problem

277

Data acquisition and data cleansing

277

Developing the hypothesis

283

Data exploration

284

Data preparation

286

Too many levels in a categorical variable

287

Numerical variables with too much variation

289

Missing data 289

Continuous data 290

Categorical data 290

Preparing the data 291

Model building

293

Data visualization

300

Communicating the results to business users

300

Summary

301

References

301

Chapter 11: Building Data Science Applications

302

https://www.iteblog.com

Preface

In this smart age, data analytics is the key to sustaining and promoting business growth.

Every business is trying to leverage their data as much possible with all sorts of data science

tools and techniques to progress along the analytics maturity curve. This sudden rise in

data science requirements is the obvious reason for scarcity of data scientists. It is very

difficult to meet the market demand with unicorn data scientists who are experts in

statistics, machine learning, mathematical modelling as well as programming.

The availability of unicorn data scientists is only going to decrease with the increase in

market demand, and it will continue to be so. So, a solution was needed which not only

empowers the unicorn data scientists to do more, but also creates what Gartner calls as

“Citizen Data Scientists”. Citizen data scientists are none other than the developers,

analysts, BI professionals or other technologists whose primary job function is outside of

statistics or analytics but are passionate enough to learn data science. They are becoming

the key enabler in democratizing data analytics across organizations and industries as a

whole.

There is an ever going plethora of tools and techniques designed to facilitate big data

analytics at scale. This book is an attempt to create citizen data scientists who can leverage

Apache Spark’s distributed computing platform for data analytics.

This book is a practical guide to learn statistical analysis and machine learning to build

scalable data products. It helps to master the core concepts of data science and also Apache

Spark to help you jump start on any real life data analytics project. Throughout the book, all

the chapters are supported by sufficient examples, which can be executed on a home

computer, so that readers can easily follow and absorb the concepts. Every chapter attempts

to be self-contained so that the reader can start from any chapter with pointers to relevant

chapters for details. While the chapters start from basics for a beginner to learn and

comprehend, it is comprehensive enough for a senior architects at the same time.

What this book covers

Chapter 1, Big Data and Data Science – An Introduction, this chapter discusses briefly about

the various challenges in big data analytics and how Apache Spark solves those problems

on a single platform. This chapter also explains how data analytics has evolved to what it is

now and also gives a basic idea on the Spark stack.

https://www.iteblog.com

Preface

[ 2 ]

Chapter 2, The Spark Programming Model, this chapter talks about the design considerations

of Apache Spark and the supported programming languages. It also explains the Spark core

components and covers the RDD API in details, which is the basic building block of Spark.

Chapter 3, Introduction to DataFrames, this chapter explains about the DataFrames, which

are the most handy and useful component for the data scientists to work at ease. It explains

about Spark SQL and the Catalyst optimizer that empowers DataFrames. Also, various

DataFrames operations are demonstrated with code examples.

Chapter 4, Unified Data Access, this chapter talks about the various ways we source data

from different sources, consolidate and work in a unified way. It covers the streaming

aspect of real time data collection and operating on them. It also talks about the under-the-

hood fundamentals of these APIs.

Chapter 5, Data Analysis on Spark, this chapter discuss about the complete data analytics

lifecycle. With ample code examples, it explains how to source data from different sources,

prepare the data using data cleaning and transformation techniques, and perform

descriptive and inferential statistics to generate hidden insights from data.

Chapter 6, Machine Learning, this chapter explains various machine learning algorithms,

how they are implemented in the MLlib library and how they can be used with the pipeline

API for a streamlined execution. This chapter covers the fundamentals of all the algorithms

covered so it could serve as a one stop reference.

Chapter 7, Extending Spark with SparkR, this chapter is primarily intended for the R

programmers who want to leverage Spark for Data Analytics. It explains how to program

with SparkR and how to use the machine learning algorithms of R libraries.

Chapter 8, Analyzing Unstructured Data, this chapter discusses only about unstructured

data analysis. It explains how to source unstructured data, process it and perform machine

learning on it. It also covers some of the dimension reduction techniques which were not

covered in the “Machine Learning” chapter.

Chapter 9, Visualizing Big Data, in this chapter, readers learn various visualization

techniques that are supported on Spark. It explains the different kinds of visualization

requirements of data engineers, data scientists and business users; and also suggests right

kinds of tools and techniques. It also talks about leveraging IPython/Jupyter notebook and

Zeppelin, an Apache project for data visualization.

https://www.iteblog.com

剩余338页未读，继续阅读

weixin_37790309

粉丝: 8
资源: 3

Spark 2.0 for Data Science: 深入探索机器学习

Spark for Data Science epub

Spark for Data Science Cookbook 无水印pdf

Spark for Data Science

Towards Data Science

spark大数据编程头歌

Data Mining with Big Data

快学big data -- spark 总结（二十三)

"essential math for data science\" jean hadrien pdf"

spark sql: relational data processing in spark

关于sparkstreaming的书籍

最新资源