Spark指南：大规模数据处理与机器学习

需积分: 10 136 浏览量更新于2024-07-20 收藏 4.92MB PDF 举报

"BigData Analytics with Spark: A Practitioner's Guide to Harnessing Big Data Processing Power" 在大数据时代，Apache Spark已经成为数据处理领域的一颗璀璨明星。这本书《BigData Analytics with Spark》由经验丰富的专家Mohammed Guller撰写，旨在为读者提供一份实用指南，详细阐述如何利用Spark进行大规模的数据分析、机器学习、图计算以及高速数据流处理。Spark以其高效且容错的分布式计算框架，迅速赢得了数据科学家和工程师们的青睐。 Spark的设计目标是克服Hadoop MapReduce模型在处理大规模数据集时的性能瓶颈，通过内存计算和实时处理能力，极大地提高了数据处理的速度。书中首先会介绍Spark的核心概念，如RDD（弹性分布式数据集）、DataFrame和DataSet，这些都是Spark进行数据操作和处理的基础。通过这些数据结构，Spark能够实现实时的数据转换和聚合，支持复杂的SQL查询，使得数据处理变得更加灵活和高效。作者将深入探讨如何利用Spark进行机器学习，包括基于MLlib的监督和无监督学习算法，如线性回归、决策树、随机森林和深度学习等。此外，Spark MLlib的集成特性使得模型训练和部署变得更加简单。对于图计算，Spark GraphX模块提供了强大的工具，帮助用户处理网络关系数据，执行社区检测、PageRank等算法。高速数据流处理是Spark Streaming模块的强项，它允许实时处理持续不断的数据源，这对于实时监控、日志分析和金融交易等领域至关重要。书中会详细讲解如何设计和优化Spark Streaming应用程序，确保数据处理的实时性和准确性。《BigData Analytics with Spark》不仅仅是一份技术指南，还包含了许多实战案例和最佳实践，帮助读者理解如何在实际项目中应用所学知识。此外，版权信息表明，所有内容都受法律保护，未经许可不得擅自复制或传播，只允许在学术评价或特定计算机系统中进行短篇摘录。总结来说，这本书是一本全面而深入的资源，适合数据工程师、数据分析师和机器学习工程师，希望通过Spark提升大数据处理能力，推动业务决策的智能化和实时化。无论是对初学者还是经验丰富的从业者，都能从中获得宝贵的知识和技能。

xix

About the Technical Reviewers

Sundar Rajan Raman is a big data architect currently working for Bank

of America. He has a bachelor’s of technology degree from the National

Institute of Technology, Silchar, India. He is a seasoned Java and J2EE

programmer with expertise in Hadoop, Spark, MongoDB, and big data

analytics. He has worked at companies such as AT&T, Singtel, and

Deutsche Bank. He is also a platform specialist with vast experience in

SonicMQ, WebSphere MQ, and TIBCO with respective certifications.

His current focus is on big data architecture. More information about

Raman is available at https://in.linkedin.com/pub/sundar-rajan-

raman/7/905/488.

I would like to thank my wife, Hema, and daughter, Shriya, for their

patience during the review process.

Heping Liu has a PhD degree in engineering, focusing on the algorithm

research of forecasting and intelligence optimization and their

applications. Dr. Liu is an expert in big data analytics and machine

learning. He worked for a few startup companies, where he played a

leading role by building the forecasting, optimization, and machine

learning models under the big data infrastructure and by designing and

creating the big data infrastructure to support the model development.

Dr. Liu has been active in the academic area. He has published 20

academic papers, which have appeared in Applied Soft Computing and the

Journal of the Operational Research Society. He has worked as a reviewer

for 20 top academic journals, such as IEEE Transactions on Evolutionary

Computations and Applied Soft Computing. Dr. Liu has been the editorial

board member of International Journal of Business Analytics.

xxi

Acknowledgments

Many people have contributed to this book directly or indirectly. Without the support, encouragement, and

help that I received from various people, it would have not been possible for me to write this book. I would

like to take this opportunity to thank those people.

First and foremost, I would like to thank my beautiful wife, Tarannum, and my three amazing kids, Sarah,

Soha, and Sohail. Writing a book is an arduous task. Working full-time and writing a book at the same time

meant that I was not spending much time with my family. During work hours, I was busy with work. Evenings

and weekends were completely consumed by the book. I thank my family for providing me all the support

and encouragement. Occasionally, Soha and Sohail would come up with ingenious plans to get me to play

with them, but for most part, they let me work on the book when I should have been playing with them.

Next, I would like to thank Matei Zaharia, Reynold Xin, Michael Armbrust, Tathagata Das, Patrick

Wendell, Joseph Bradley, Xiangrui Meng, Joseph Gonzalez, Ankur Dave, and other Spark developers. They

have not only created an amazing piece of technology, but also continue to rapidly enhance it. Without their

invention, this book would not exist.

Spark was new and few people knew about it when I first proposed using it at Glassbeam to solve

some of the problems we were struggling with at that time. I would like to thank our VP of Engineering,

Ashok Agarwal, and CEO, Puneet Pandit, for giving me the permission to proceed. Without the hands-on

experience that I gained from embedding Spark in our product and using it on a regular basis, it would have

been difficult to write a book on it.

Next, I would like to thank my technical reviewers, Sundar Rajan Raman and Heping Liu. They

painstakingly checked the content for accuracy, ran the examples to make sure that the code works, and

provided helpful suggestions.

Finally, I would like to thank the people at Apress who worked on this book, including Chris Nelson,

Jill Balzano, Kim Burton-Weisman, Celestin John Suresh, Nikhil Chinnari, Dhaneesh Kumar, and others.

Jill Balzano coordinated all the book-related activities. As an editor, Chris Nelson’s contribution to this

book is invaluable. I appreciate his suggestions and edits. This book became much better because of his

involvement. My copy editor, Kim Burton-Weisman, read every sentence in the book to make sure it is

written correctly and fixed the problematic ones. It was a pleasure working with the Apress team.

—Mohammed Guller

Danville, CA

xxiii

Introduction

This book is a concise and easy-to-understand tutorial for big data and Spark. It will help you learn how to use

Spark for a variety of big data analytic tasks. It covers everything that you need to know to productively use Spark.

One of the benefits of purchasing this book is that it will help you learn Spark efficiently; it will save

you a lot of time. The topics covered in this book can be found on the Internet. There are numerous

blogs, presentations, and YouTube videos covering Spark. In fact, the amount of material on Spark can be

overwhelming. You could spend months reading bits and pieces about Spark at different places on

the Web. This book provides a better alternative with the content nicely organized and presented in an

easy-to-understand format.

The content and the organization of the material in this book are based on the Spark workshops that

I occasionally conduct at different big data–related conferences. The positive feedback given by the

attendees for both the content and the flow motivated me to write this book.

One of the differences between a book and a workshop is that the latter is interactive. However, after

conducting a number of Spark workshops, I know the kind of questions people generally have and I have

addressed those in the book. Still, if you have questions as you read the book, I encourage you to contact me

via LinkedIn or Twitter. Feel free to ask any question. There is no such thing as a stupid question.

Rather than cover every detail of Spark, the book covers important Spark-related topics that you need

to know to effectively use Spark. My goal is to help you build a strong foundation. Once you have a strong

foundation, it is easy to learn all the nuances of a new technology. In addition, I wanted to keep the book as

simple as possible. If Spark looks simple after reading this book, I have succeeded in my goal.

No prior experience is assumed with any of the topics covered in this book. It introduces the key

concepts, step by step. Each section builds on the previous section. Similarly, each chapter serves as a

stepping-stone for the next chapter. You can skip some of the later chapters covering the different Spark

libraries if you don’t have an immediate need for that library. However, I encourage you to read all the

chapters. Even though it may not seem relevant to your current project, it may give you new ideas.

You will learn a lot about Spark and related technologies from reading this book. However, to get the

most out of this book, type the examples shown in the book. Experiment with the code samples. Things

become clearer when you write and execute code. If you practice and experiment with the examples as you

read the book, by the time you finish reading it, you will be a solid Spark developer.

One of the resources that I find useful when I am developing Spark applications is the official Spark API

(application programming interface) documentation. It is available at http://spark.apache.org/docs/

latest/api/scala. As a beginner, you may find it hard to understand, but once you have learned the basic

concepts, you will find it very useful.

Another useful resource is the Spark mailing list. The Spark community is active and helpful. Not only

do the Spark developers respond to questions, but experienced Spark users also volunteer their time helping

new users. No matter what problem you run into, chances are that someone on the Spark mailing list has

solved that problem.

And, you can reach out to me. I would love to hear from you. Feedback, suggestions, and questions are

welcome.

—Mohammed Guller

LinkedIn: www.linkedin.com/in/mohammedguller

Twitter: @MohammedGuller

Chapter 1

Big Data Technology Landscape

We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing

exponentially. Data generated today is several magnitudes larger than what was generated just a few years

ago. The challenge is how to get business value out of this data. This is the problem that big data–related

technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last

few years. Some of the most active open source projects are related to big data, and the number of these

projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large

established companies are making significant investments in big data technologies.

Although the term “big data” is hot, its definition is vague. People define it in different ways. One

definition relates to the volume of data; another definition relates to the richness of data. Some define big

data as data that is “too big” by traditional standards; whereas others define big data as data that captures

more nuances about the entity that it represents. An example of the former would be a dataset whose

volume exceeds petabytes or several terabytes. If this data were stored in a traditional relational database

(RDBMS) table, it would have billions of rows. An example of the latter definition is a dataset with extremely

wide rows. If this data were stored in a relational database table, it would have thousands of columns.

Another popular definition of big data is data characterized by three Vs: volume, velocity, and variety. I just

discussed volume. Velocity means that data is generated at a fast rate. Variety refers to the fact that data can

be unstructured, semi-structured, or multi-structured.

Standard relational databases could not easily handle big data. The core technology for these databases

was designed several decades ago when few organizations had petabytes or even terabytes of data. Today

it is not uncommon for some organizations to generate terabytes of data every day. Not only the volume

of data, but also the rate at which it is being generated is exploding. Hence there was a need for new

technologies that could not only process and analyze large volume of data, but also ingest large volume of

data at a fast pace.

Other key driving factors for the big data technologies include scalability, high availability, and fault

tolerance at a low cost. Technology for processing and analyzing large datasets has been extensively

researched and available in the form of proprietary commercial products for a long time. For example, MPP

(massively parallel processing) databases have been around for a while. MPP databases use a “shared-

nothing” architecture, where data is stored and processed across a cluster of nodes. Each node comes with

its own set of CPUs, memory, and disks. They communicate via a network interconnect. Data is partitioned

across a cluster of nodes. There is no contention among the nodes, so they can all process data in parallel.

Examples of such databases include Teradata, Netezza, Greenplum, ParAccel, and Vertica. Teradata was

invented in the late 1970s, and by the 1990s, it was capable of processing terabytes of data. However,

proprietary MPP products are expensive. Not everybody can afford them.

This chapter introduces some of the open source big data–related technologies. Although it may seem

that the technologies covered in this chapter have been randomly picked, they are connected by a common

theme. They are used with Spark, or Spark provides a better alternative to some of these technologies. As you

start using Spark, you may run into these technologies. In addition, familiarity with these technologies will

help you better understand Spark, which we will introduce in Chapter 3.

Chapter 1 ■ Big Data teChnology lanDsCape

Hadoop

Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system

for processing large datasets across a cluster of commodity servers. It provides a simple programming

framework for large-scale data processing using the resources available across a cluster of computers.

Hadoop is inspired by a system invented at Google to create inverted index for its search product. Jeffrey

Dean and Sanjay Ghemawat published papers in 2004 describing the system that they created for Google.

The first one, titled “MapReduce: Simplified Data Processing on Large Clusters” is available at research.

google.com/archive/mapreduce.html. The second one, titled “The Google File System” is available at

research.google.com/archive/gfs.html. Inspired by these papers, Doug Cutting and Mike Cafarella

developed an open source implementation, which later became Hadoop.

Many organizations have replaced expensive proprietary commercial products with Hadoop for

processing large datasets. One reason is cost. Hadoop is open source and runs on a cluster of commodity

hardware. You can scale it easily by adding cheap servers. High availability and fault tolerance are provided

by Hadoop, so you don’t need to buy expensive hardware. Second, it is better suited for certain types of data

processing tasks, such as batch processing and ETL (extract transform load) of large-scale data.

Hadoop is built on a few important ideas. First, it is cheaper to use a cluster of commodity servers for

both storing and processing large amounts of data than using high-end powerful servers. In other words,

Hadoop uses scale-out architecture instead of scale-up architecture.

Second, implementing fault tolerance through software is cheaper than implementing it in hardware.

Fault-tolerant servers are expensive. Hadoop does not rely on fault-tolerant servers. It assumes that servers

will fail and transparently handles server failures. An application developer need not worry about handling

hardware failures. Those messy details can be left for Hadoop to handle.

Third, moving code from one computer to another over a network is a lot more efficient and faster than

moving a large dataset across the same network. For example, assume you have a cluster of 100 computers

with a terabyte of data on each computer. One option for processing this data would be to move it to a very

powerful server that can process 100 terabytes of data. However, moving 100 terabytes of data will take a

long time, even on a very fast network. In addition, you will need very expensive hardware to process data

with this approach. Another option is to move the code that processes this data to each computer in your

100-node cluster; it is a lot faster and more efficient than the first option. Moreover, you don’t need high-end

servers, which are expensive.

Fourth, writing a distributed application can be made easy by separating core data processing logic

from distributed computing logic. Developing an application that takes advantage of resources available on

a cluster of computers is a lot harder than developing an application that runs on a single computer. The

pool of developers who can write applications that run on a single machine is several magnitudes larger

than those who can write distributed applications. Hadoop provides a framework that hides the complexities

of writing distributed applications. It thus allows organizations to tap into a much bigger pool of application

developers.

Although people talk about Hadoop as a single product, it is not really a single product. It consists of

three key components: a cluster manager, a distributed compute engine, and a distributed file system

(see Figure1-1).

www.allitebooks.com

剩余289页未读，继续阅读

纯洁的好人

粉丝: 2
资源: 137

Spark指南：大规模数据处理与机器学习

Big Data Processing with Apache Spark by Srini Penchikala

Big Data with Apache Spark and Python mobi

Real-time big data processing with Spark Streaming

Big Data Processing Using Spark in Cloud

Big Data Processing Using Spark in Cloud 2018

Fast Data Processing with Spark

Fast data processing with spark

Fast Data Processing with Spark 2, 3rd Editio

Fast Data Processing With Spark (3rd Edition) PDF

Fast Data Processing with Spark 2, 3rd Edition.pdf

最新资源