Spark 2.0数据科学指南：深度解析与机器学习应用

需积分: 9 63 浏览量更新于2024-07-20 1 收藏 12.83MB PDF 举报

Spark for Data Science是一本专注于利用最新Spark版本2.0进行数据分析和深入探讨机器学习的指南。该书由Srinivas Duvvuri和Bikramaditya Singhal合著，由Packt Publishing出版，版权受法律保护，未经许可不得复制、存储或传输。作者们在书中竭力确保信息的准确性，但读者需明白，本书提供的内容是按无担保销售的，即不承担因书中的信息导致的直接或间接损害的责任。 Spark 2.0 是Apache Spark的重要版本，这是一个开源的分布式计算框架，特别适合大规模数据处理和实时分析，它在大数据领域中扮演着关键角色。本书将引导读者学习如何利用Spark进行高效的数据处理，包括数据清洗、转换、建模和部署机器学习模型。Spark的强大之处在于其容错性、内存计算和流处理能力，使得在处理海量数据时能实现快速响应和低延迟。书中涵盖了Spark SQL用于数据操作，DataFrame和DataSet作为核心数据结构，以及如何使用Spark MLlib进行机器学习任务，如分类、回归、聚类和协同过滤。此外，Spark Streaming和GraphX的功能也会被详细介绍，以展示Spark在实时流处理和图计算上的应用。 Spark 2.0还包括了DAG（有向无环图）执行引擎，这使得任务并行化和优化变得更加高效。通过本书，读者不仅能够掌握Spark的基础知识，还能了解到如何在实际项目中优化性能和提高工作效率。然而，尽管作者们力求提供最新和准确的信息，读者在实践中可能仍需要关注技术更新，因为Spark的后续版本可能会引入新的功能和改进。同时，由于版权原因，书中引用的商标信息可能存在更新或不完整之处，需要读者自行确认最新的市场情况。 Spark for Data Science, spark2.0是一本深度学习和实践Spark技术的实用手册，适合数据科学家、开发人员以及对大数据处理有兴趣的专业人士阅读，以提升他们在现代数据驱动的世界中的竞争力。

[ vii ]

Summary

253

References:

253

Chapter 9: Visualizing Big Data

254

Why visualize data?

255

A data engineer's perspective

256

A data scientist's perspective

256

A business user's perspective

257

Data visualization tools

257

IPython notebook

258

Apache Zeppelin

258

Third-party tools

258

Data visualization techniques

259

Summarizing and visualizing

259

Subsetting and visualizing

263

Sampling and visualizing

267

Modeling and visualizing

270

Summary

272

References

273

Data source citations

273

Chapter 10: Putting It All Together

274

A quick recap

275

Introducing a case study

276

The business problem

277

Data acquisition and data cleansing

277

Developing the hypothesis

283

Data exploration

284

Data preparation

286

Too many levels in a categorical variable

287

Numerical variables with too much variation

289

Missing data 289

Continuous data 290

Categorical data 290

Preparing the data 291

Model building

293

Data visualization

300

Communicating the results to business users

300

Summary

301

References

301

Chapter 11: Building Data Science Applications

302

Preface

In this smart age, data analytics is the key to sustaining and promoting business growth.

Every business is trying to leverage their data as much possible with all sorts of data science

tools and techniques to progress along the analytics maturity curve. This sudden rise in

data science requirements is the obvious reason for scarcity of data scientists. It is very

difficult to meet the market demand with unicorn data scientists who are experts in

statistics, machine learning, mathematical modelling as well as programming.

The availability of unicorn data scientists is only going to decrease with the increase in

market demand, and it will continue to be so. So, a solution was needed which not only

empowers the unicorn data scientists to do more, but also creates what Gartner calls as

“Citizen Data Scientists”. Citizen data scientists are none other than the developers,

analysts, BI professionals or other technologists whose primary job function is outside of

statistics or analytics but are passionate enough to learn data science. They are becoming

the key enabler in democratizing data analytics across organizations and industries as a

whole.

There is an ever going plethora of tools and techniques designed to facilitate big data

analytics at scale. This book is an attempt to create citizen data scientists who can leverage

Apache Spark’s distributed computing platform for data analytics.

This book is a practical guide to learn statistical analysis and machine learning to build

scalable data products. It helps to master the core concepts of data science and also Apache

Spark to help you jump start on any real life data analytics project. Throughout the book, all

the chapters are supported by sufficient examples, which can be executed on a home

computer, so that readers can easily follow and absorb the concepts. Every chapter attempts

to be self-contained so that the reader can start from any chapter with pointers to relevant

chapters for details. While the chapters start from basics for a beginner to learn and

comprehend, it is comprehensive enough for a senior architects at the same time.

What this book covers

Chapter 1, Big Data and Data Science – An Introduction, this chapter discusses briefly about

the various challenges in big data analytics and how Apache Spark solves those problems

on a single platform. This chapter also explains how data analytics has evolved to what it is

now and also gives a basic idea on the Spark stack.

Preface

[ 2 ]

Chapter 2, The Spark Programming Model, this chapter talks about the design considerations

of Apache Spark and the supported programming languages. It also explains the Spark core

components and covers the RDD API in details, which is the basic building block of Spark.

Chapter 3, Introduction to DataFrames, this chapter explains about the DataFrames, which

are the most handy and useful component for the data scientists to work at ease. It explains

about Spark SQL and the Catalyst optimizer that empowers DataFrames. Also, various

DataFrames operations are demonstrated with code examples.

Chapter 4, Unified Data Access, this chapter talks about the various ways we source data

from different sources, consolidate and work in a unified way. It covers the streaming

aspect of real time data collection and operating on them. It also talks about the under-the-

hood fundamentals of these APIs.

Chapter 5, Data Analysis on Spark, this chapter discuss about the complete data analytics

lifecycle. With ample code examples, it explains how to source data from different sources,

prepare the data using data cleaning and transformation techniques, and perform

descriptive and inferential statistics to generate hidden insights from data.

Chapter 6, Machine Learning, this chapter explains various machine learning algorithms,

how they are implemented in the MLlib library and how they can be used with the pipeline

API for a streamlined execution. This chapter covers the fundamentals of all the algorithms

covered so it could serve as a one stop reference.

Chapter 7, Extending Spark with SparkR, this chapter is primarily intended for the R

programmers who want to leverage Spark for Data Analytics. It explains how to program

with SparkR and how to use the machine learning algorithms of R libraries.

Chapter 8, Analyzing Unstructured Data, this chapter discusses only about unstructured

data analysis. It explains how to source unstructured data, process it and perform machine

learning on it. It also covers some of the dimension reduction techniques which were not

covered in the “Machine Learning” chapter.

Chapter 9, Visualizing Big Data, in this chapter, readers learn various visualization

techniques that are supported on Spark. It explains the different kinds of visualization

requirements of data engineers, data scientists and business users; and also suggests right

kinds of tools and techniques. It also talks about leveraging IPython/Jupyter notebook and

Zeppelin, an Apache project for data visualization.

剩余338页未读，继续阅读

浮舟

粉丝: 627
资源: 381

Spark 2.0数据科学指南：深度解析与机器学习应用

Spark 2.0 for Data Science: 深入探索机器学习

Spark 2.0 for Data Science：深入机器学习与基础架构解析

Spark 2.0：数据科学与机器学习探索工具

Spark for Data Science mobi

Mastering Spark for Data Science

Spark for Data Science epub

Spark for Data Science Cookbook mobi

Spark for Data Science Cookbook azw3

Spark for Data Science Cookbook 无水印pdf

Spark for Data Science pdf 完整 积分最少

最新资源

Spark for Data Science pdf 完整积分最少