大规模数据处理实战：Spark大数据分析

5星 · 超过95%的资源需积分: 15 120 浏览量更新于2024-07-18 收藏 5.31MB PDF 举报

"本书《Big Data Analytics with Spark》是一本实践者指南，旨在介绍如何使用Spark进行大规模数据处理、机器学习、图分析以及高速数据流处理。作者是Mohammed Guller，书中深入探讨了Spark在大数据分析领域的应用，同时涉及了Spark作为MapReduce的替代方案，以及Scala基础知识。" 在大数据生态系统中，Spark作为一个快速、通用且可扩展的数据处理框架，已经成为业界的宠儿。它提供了一种更为高效的方式来处理大规模数据，相比于Hadoop的MapReduce模型，Spark通过内存计算显著提升了数据处理速度。MapReduce主要依赖磁盘I/O，而Spark则利用内存来缓存数据，使得迭代计算和交互式查询更为高效。 Spark的核心组件包括Spark Core、Spark SQL、Spark Streaming、MLlib（机器学习库）和GraphX（图计算）。Spark Core提供了分布式任务调度和内存管理的基础架构；Spark SQL则整合了SQL查询与DataFrame，使得结构化数据处理变得简单；Spark Streaming用于实时数据流处理，它可以处理来自多种数据源的高流量数据；MLlib则为数据科学家提供了丰富的机器学习算法，简化了模型构建和实验过程；GraphX则是处理图形数据的库，支持复杂的图算法。 Scala是一种多范式编程语言，它是Spark的主要编程语言，结合了面向对象和函数式编程的特点。学习Scala对于深入理解和开发Spark应用至关重要。Spark API设计简洁且富有表达力，使得开发者可以轻松地构建复杂的数据处理管道。本书详细介绍了如何使用Spark进行大规模数据处理，包括数据读取、转换、清洗和聚合，以及如何利用Spark SQL进行查询优化。同时，书中的章节还涵盖了机器学习流程，如特征工程、模型选择和评估，以及使用MLlib实现常见的机器学习算法，如分类、回归和聚类。此外，书中还会讨论Spark如何处理图数据，以及Spark Streaming如何处理实时数据流，适用于物联网、社交媒体分析等场景。通过阅读本书，读者不仅可以掌握Spark的基本用法，还能了解到如何在实际项目中应用Spark解决复杂的大数据分析问题。不论是数据工程师、数据科学家还是对大数据感兴趣的从业者，都能从中受益，提升自己在大数据处理和分析领域的技能。

xix

About the Technical Reviewers

Sundar Rajan Raman is a big data architect currently working for Bank

of America. He has a bachelor’s of technology degree from the National

Institute of Technology, Silchar, India. He is a seasoned Java and J2EE

programmer with expertise in Hadoop, Spark, MongoDB, and big data

analytics. He has worked at companies such as AT&T, Singtel, and

Deutsche Bank. He is also a platform specialist with vast experience in

SonicMQ, WebSphere MQ, and TIBCO with respective certifications.

His current focus is on big data architecture. More information about

Raman is available at https://in.linkedin.com/pub/sundar-rajan-

raman/7/905/488.

I would like to thank my wife, Hema, and daughter, Shriya, for their

patience during the review process.

Heping Liu has a PhD degree in engineering, focusing on the algorithm

research of forecasting and intelligence optimization and their

applications. Dr. Liu is an expert in big data analytics and machine

learning. He worked for a few startup companies, where he played a

leading role by building the forecasting, optimization, and machine

learning models under the big data infrastructure and by designing and

creating the big data infrastructure to support the model development.

Dr. Liu has been active in the academic area. He has published 20

academic papers, which have appeared in Applied Soft Computing and the

Journal of the Operational Research Society. He has worked as a reviewer

for 20 top academic journals, such as IEEE Transactions on Evolutionary

Computations and Applied Soft Computing. Dr. Liu has been the editorial

board member of International Journal of Business Analytics.

www.it-ebooks.info

xxi

Acknowledgments

Many people have contributed to this book directly or indirectly. Without the support, encouragement, and

help that I received from various people, it would have not been possible for me to write this book. I would

like to take this opportunity to thank those people.

First and foremost, I would like to thank my beautiful wife, Tarannum, and my three amazing kids, Sarah,

Soha, and Sohail. Writing a book is an arduous task. Working full-time and writing a book at the same time

meant that I was not spending much time with my family. During work hours, I was busy with work. Evenings

and weekends were completely consumed by the book. I thank my family for providing me all the support

and encouragement. Occasionally, Soha and Sohail would come up with ingenious plans to get me to play

with them, but for most part, they let me work on the book when I should have been playing with them.

Next, I would like to thank Matei Zaharia, Reynold Xin, Michael Armbrust, Tathagata Das, Patrick

Wendell, Joseph Bradley, Xiangrui Meng, Joseph Gonzalez, Ankur Dave, and other Spark developers. They

have not only created an amazing piece of technology, but also continue to rapidly enhance it. Without their

invention, this book would not exist.

Spark was new and few people knew about it when I first proposed using it at Glassbeam to solve

some of the problems we were struggling with at that time. I would like to thank our VP of Engineering,

Ashok Agarwal, and CEO, Puneet Pandit, for giving me the permission to proceed. Without the hands-on

experience that I gained from embedding Spark in our product and using it on a regular basis, it would have

been difficult to write a book on it.

Next, I would like to thank my technical reviewers, Sundar Rajan Raman and Heping Liu. They

painstakingly checked the content for accuracy, ran the examples to make sure that the code works, and

provided helpful suggestions.

Finally, I would like to thank the people at Apress who worked on this book, including Chris Nelson,

Jill Balzano, Kim Burton-Weisman, Celestin John Suresh, Nikhil Chinnari, Dhaneesh Kumar, and others.

Jill Balzano coordinated all the book-related activities. As an editor, Chris Nelson’s contribution to this

book is invaluable. I appreciate his suggestions and edits. This book became much better because of his

involvement. My copy editor, Kim Burton-Weisman, read every sentence in the book to make sure it is

written correctly and fixed the problematic ones. It was a pleasure working with the Apress team.

—Mohammed Guller

Danville, CA

www.it-ebooks.info

xxiii

Introduction

This book is a concise and easy-to-understand tutorial for big data and Spark. It will help you learn how to use

Spark for a variety of big data analytic tasks. It covers everything that you need to know to productively use Spark.

One of the benefits of purchasing this book is that it will help you learn Spark efficiently; it will save

you a lot of time. The topics covered in this book can be found on the Internet. There are numerous

blogs, presentations, and YouTube videos covering Spark. In fact, the amount of material on Spark can be

overwhelming. You could spend months reading bits and pieces about Spark at different places on

the Web. This book provides a better alternative with the content nicely organized and presented in an

easy-to-understand format.

The content and the organization of the material in this book are based on the Spark workshops that

I occasionally conduct at different big data–related conferences. The positive feedback given by the

attendees for both the content and the flow motivated me to write this book.

One of the differences between a book and a workshop is that the latter is interactive. However, after

conducting a number of Spark workshops, I know the kind of questions people generally have and I have

addressed those in the book. Still, if you have questions as you read the book, I encourage you to contact me

via LinkedIn or Twitter. Feel free to ask any question. There is no such thing as a stupid question.

Rather than cover every detail of Spark, the book covers important Spark-related topics that you need

to know to effectively use Spark. My goal is to help you build a strong foundation. Once you have a strong

foundation, it is easy to learn all the nuances of a new technology. In addition, I wanted to keep the book as

simple as possible. If Spark looks simple after reading this book, I have succeeded in my goal.

No prior experience is assumed with any of the topics covered in this book. It introduces the key

concepts, step by step. Each section builds on the previous section. Similarly, each chapter serves as a

stepping-stone for the next chapter. You can skip some of the later chapters covering the different Spark

libraries if you don’t have an immediate need for that library. However, I encourage you to read all the

chapters. Even though it may not seem relevant to your current project, it may give you new ideas.

You will learn a lot about Spark and related technologies from reading this book. However, to get the

most out of this book, type the examples shown in the book. Experiment with the code samples. Things

become clearer when you write and execute code. If you practice and experiment with the examples as you

read the book, by the time you finish reading it, you will be a solid Spark developer.

One of the resources that I find useful when I am developing Spark applications is the official Spark API

(application programming interface) documentation. It is available at http://spark.apache.org/docs/

latest/api/scala. As a beginner, you may find it hard to understand, but once you have learned the basic

concepts, you will find it very useful.

Another useful resource is the Spark mailing list. The Spark community is active and helpful. Not only

do the Spark developers respond to questions, but experienced Spark users also volunteer their time helping

new users. No matter what problem you run into, chances are that someone on the Spark mailing list has

solved that problem.

And, you can reach out to me. I would love to hear from you. Feedback, suggestions, and questions are

welcome.

—Mohammed Guller

LinkedIn: www.linkedin.com/in/mohammedguller

Twitter: @MohammedGuller

www.it-ebooks.info

Chapter 1

Big Data Technology Landscape

We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing

exponentially. Data generated today is several magnitudes larger than what was generated just a few years

ago. The challenge is how to get business value out of this data. This is the problem that big data–related

technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last

few years. Some of the most active open source projects are related to big data, and the number of these

projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large

established companies are making significant investments in big data technologies.

Although the term “big data” is hot, its definition is vague. People define it in different ways. One

definition relates to the volume of data; another definition relates to the richness of data. Some define big

data as data that is “too big” by traditional standards; whereas others define big data as data that captures

more nuances about the entity that it represents. An example of the former would be a dataset whose

volume exceeds petabytes or several terabytes. If this data were stored in a traditional relational database

(RDBMS) table, it would have billions of rows. An example of the latter definition is a dataset with extremely

wide rows. If this data were stored in a relational database table, it would have thousands of columns.

Another popular definition of big data is data characterized by three Vs: volume, velocity, and variety. I just

discussed volume. Velocity means that data is generated at a fast rate. Variety refers to the fact that data can

be unstructured, semi-structured, or multi-structured.

Standard relational databases could not easily handle big data. The core technology for these databases

was designed several decades ago when few organizations had petabytes or even terabytes of data. Today

it is not uncommon for some organizations to generate terabytes of data every day. Not only the volume

of data, but also the rate at which it is being generated is exploding. Hence there was a need for new

technologies that could not only process and analyze large volume of data, but also ingest large volume of

data at a fast pace.

Other key driving factors for the big data technologies include scalability, high availability, and fault

tolerance at a low cost. Technology for processing and analyzing large datasets has been extensively

researched and available in the form of proprietary commercial products for a long time. For example, MPP

(massively parallel processing) databases have been around for a while. MPP databases use a “shared-

nothing” architecture, where data is stored and processed across a cluster of nodes. Each node comes with

its own set of CPUs, memory, and disks. They communicate via a network interconnect. Data is partitioned

across a cluster of nodes. There is no contention among the nodes, so they can all process data in parallel.

Examples of such databases include Teradata, Netezza, Greenplum, ParAccel, and Vertica. Teradata was

invented in the late 1970s, and by the 1990s, it was capable of processing terabytes of data. However,

proprietary MPP products are expensive. Not everybody can afford them.

This chapter introduces some of the open source big data–related technologies. Although it may seem

that the technologies covered in this chapter have been randomly picked, they are connected by a common

theme. They are used with Spark, or Spark provides a better alternative to some of these technologies. As you

start using Spark, you may run into these technologies. In addition, familiarity with these technologies will

help you better understand Spark, which we will introduce in Chapter 3.

www.it-ebooks.info

Chapter 1 ■ Big Data teChnology lanDsCape

Hadoop

Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system

for processing large datasets across a cluster of commodity servers. It provides a simple programming

framework for large-scale data processing using the resources available across a cluster of computers.

Hadoop is inspired by a system invented at Google to create inverted index for its search product. Jeffrey

Dean and Sanjay Ghemawat published papers in 2004 describing the system that they created for Google.

The first one, titled “MapReduce: Simplified Data Processing on Large Clusters” is available at research.

google.com/archive/mapreduce.html. The second one, titled “The Google File System” is available at

research.google.com/archive/gfs.html. Inspired by these papers, Doug Cutting and Mike Cafarella

developed an open source implementation, which later became Hadoop.

Many organizations have replaced expensive proprietary commercial products with Hadoop for

processing large datasets. One reason is cost. Hadoop is open source and runs on a cluster of commodity

hardware. You can scale it easily by adding cheap servers. High availability and fault tolerance are provided

by Hadoop, so you don’t need to buy expensive hardware. Second, it is better suited for certain types of data

processing tasks, such as batch processing and ETL (extract transform load) of large-scale data.

Hadoop is built on a few important ideas. First, it is cheaper to use a cluster of commodity servers for

both storing and processing large amounts of data than using high-end powerful servers. In other words,

Hadoop uses scale-out architecture instead of scale-up architecture.

Second, implementing fault tolerance through software is cheaper than implementing it in hardware.

Fault-tolerant servers are expensive. Hadoop does not rely on fault-tolerant servers. It assumes that servers

will fail and transparently handles server failures. An application developer need not worry about handling

hardware failures. Those messy details can be left for Hadoop to handle.

Third, moving code from one computer to another over a network is a lot more efficient and faster than

moving a large dataset across the same network. For example, assume you have a cluster of 100 computers

with a terabyte of data on each computer. One option for processing this data would be to move it to a very

powerful server that can process 100 terabytes of data. However, moving 100 terabytes of data will take a

long time, even on a very fast network. In addition, you will need very expensive hardware to process data

with this approach. Another option is to move the code that processes this data to each computer in your

100-node cluster; it is a lot faster and more efficient than the first option. Moreover, you don’t need high-end

servers, which are expensive.

Fourth, writing a distributed application can be made easy by separating core data processing logic

from distributed computing logic. Developing an application that takes advantage of resources available on

a cluster of computers is a lot harder than developing an application that runs on a single computer. The

pool of developers who can write applications that run on a single machine is several magnitudes larger

than those who can write distributed applications. Hadoop provides a framework that hides the complexities

of writing distributed applications. It thus allows organizations to tap into a much bigger pool of application

developers.

Although people talk about Hadoop as a single product, it is not really a single product. It consists of

three key components: a cluster manager, a distributed compute engine, and a distributed file system

(see Figure1-1).

www.it-ebooks.info

剩余289页未读，继续阅读

自由的海盗

粉丝: 16
资源: 10

大规模数据处理实战：Spark大数据分析

Big Data Analytics with Spark 无水印pdf 0分

Scala and Spark for Big Data Analytics

Big Data Analytics with Spark PDF

Big Data Analytics with Spark and Hadoop mobi

Big Data Analytics with Spark and Hadoop epub

Big Data Analytics with Spark(Apress,2016)

Big Data Analytics with Spark and Hadoop 无水印pdf 0分

Big Data Analytics with Spark and Hadoop（Spark与Hadoop大数据分析）代码code

Data Analytics with Spark Using Python

Big Data Analytics with Hadoop 3

最新资源