使用Pig与Hadoop进行数据流脚本编程

5星 · 超过95%的资源需积分: 9 133 浏览量更新于2024-07-30 收藏 6.41MB PDF 举报

"Programming Pig: Dataflow Scripting with Hadoop" 本书《Programming Pig》由Alan Gates撰写，详细介绍了如何使用Pig语言在Hadoop架构上进行数据处理。这本书超过200页，旨在帮助读者理解并掌握Pig语言，从而有效地处理大数据集。Pig是Hadoop生态系统中的一个强大工具，它提供了一种高级语言，让数据处理变得更加简单和直观，尤其适合处理大规模的数据流。 Pig Latin是Pig所使用的脚本语言，它允许用户定义数据处理的逻辑，并将其转换为一组可以在Hadoop MapReduce框架下运行的任务。Pig Latin具有声明性，这意味着用户只需要描述他们想要的结果，而不需要关心如何实现这个过程的细节。这种特性使得Pig成为非程序员或数据分析师处理复杂数据任务的理想选择。书中的内容可能涵盖了以下主要知识点： 1. **Pig Latin基础**：包括基本的数据类型、操作符、加载和存储数据、以及数据转换函数的使用。例如，用户可以学习如何使用LOAD命令从HDFS加载数据，使用FOREACH和GROUP进行数据聚合，以及如何使用JOIN和FILTER进行数据过滤和连接。 2. **Pig脚本设计**：讲解如何构建有效的数据处理流程，包括管道（pipeline）的概念，以及如何通过UDF（用户定义函数）扩展Pig的功能，处理更复杂的计算需求。 3. **Pig与Hadoop集成**：介绍Pig如何与Hadoop MapReduce协同工作，解释Pig作业的执行模型，以及如何调试和优化Pig脚本，以充分利用Hadoop集群的计算能力。 4. **性能优化**：讨论如何分析Pig日志，识别性能瓶颈，并提供改进数据处理效率的策略。这可能涉及数据倾斜的处理、减少中间结果的大小，以及选择合适的分区策略。 5. **案例研究**：书中可能会包含实际的数据处理案例，展示如何在实际场景中应用Pig来解决数据问题，如数据清洗、数据分析和挖掘等。 6. **最佳实践**：分享在开发Pig脚本时应遵循的指导原则，以确保代码的可读性、可维护性和可扩展性。 7. **Pig生态系统**：介绍Pig与其他Hadoop组件（如Hive、HBase等）的交互，以及Pig在大数据处理生态中的位置和价值。通过这本书，读者不仅可以学习到Pig语言的基本用法，还能深入理解大数据处理的原理和方法，提升在Hadoop环境下处理大规模数据的能力。无论你是数据科学家、数据工程师还是对大数据感兴趣的IT专业人士，这本书都将为你提供宝贵的实践经验和理论知识。

Douglas of the Hadoop project provided me with very helpful feedback on the sections

covering Hadoop and MapReduce.

I would also like to thank Mike Loukides and the entire team at O’Reilly. They have

made writing my first book an enjoyable and exhilarating experience. Finally, thanks

to Yahoo! for nurturing Pig and dedicating more than 25 engineering years (and still

counting) of effort to it, and for graciously giving me the time to write this book.

xiv | Preface

Part of the specification of a MapReduce job is the key on which data will be collected.

For example, if you were processing web server logs for a website that required users

to log in, you might choose the user ID to be your key so that you could see everything

done by each user on your website. In the shuffle phase, which happens after the map

phase, data is collected together by the key the user has chosen and distributed to

different machines for the reduce phase. Every record for a given key will go to the same

reducer.

In the reduce phase, the application is presented each key, together with all of the

records containing that key. Again this is done in parallel on many machines. After

processing each group, the reducer can write its output. See the next section for a

walkthrough of a simple MapReduce program. For more details on how MapReduce

works, see “MapReduce” on page 189.

MapReduce’s hello world

Consider a simple MapReduce application that counts the number of times each word

appears in a given text. This is the “hello world” program of MapReduce. In this ex-

ample the map phase will read each line in the text, one at a time. It will then split out

each word into a separate string, and, for each word, it will output the word and a 1 to

indicate it has seen the word one time. The shuffle phase will use the word as the key,

hashing the records to reducers. The reduce phase will then sum up the number of

times each word was seen and write that together with the word as output. Let’s con-

sider the case of the nursery rhyme “Mary Had a Little Lamb.” Our input will be:

Mary had a little lamb

its fleece was white as snow

and everywhere that Mary went

the lamb was sure to go.

Let’s assume that each line is sent to a different map task. In reality, each map is assigned

much more data than this, but it makes the example easier to follow. The data flow

through MapReduce is shown in Figure 1-1.

Once the map phase is complete, the shuffle phase will collect all records with the same

word onto the same reducer. For this example we assume that there are two reducers:

all words that start with A-L are sent to the first reducer, and M-Z are sent to the second

reducer. The reducers will then output the summed counts for each word.

Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts

that users write into a series of one or more MapReduce jobs that it then executes. See

Example 1-1 for a Pig Latin script that will do a word count of “Mary Had a Little

Lamb.”

2 | Chapter 1: Introduction

Pig Latin, a Parallel Dataflow Language

Pig Latin is a dataflow language. This means it allows users to describe how data from

one or more inputs should be read, processed, and then stored to one or more outputs

in parallel. These data flows can be simple linear flows like the word count example

given previously. They can also be complex workflows that include points where mul-

tiple inputs are joined, and where data is split into multiple streams to be processed by

different operators. To be mathematically precise, a Pig Latin script describes a directed

acyclic graph (DAG), where the edges are data flows and the nodes are operators that

process the data.

This means that Pig Latin looks different from many of the programming languages

you have seen. There are no if statements or for loops in Pig Latin. This is because

traditional procedural and object-oriented programming languages describe control

flow, and data flow is a side effect of the program. Pig Latin instead focuses on data

flow. For information on how to integrate the data flow described by a Pig Latin script

with control flow, see Chapter 9.

Comparing query and dataflow languages

After a cursory look, people often say that Pig Latin is a procedural version of SQL.

Although there are certainly similarities, there are more differences. SQL is a query

language. Its focus is to allow users to form queries. It allows users to describe what

question they want answered, but not how they want it answered. In Pig Latin, on the

other hand, the user describes exactly how to process the input data.

Another major difference is that SQL is oriented around answering one question. When

users want to do several data operations together, they must either write separate quer-

ies, storing the intermediate data into temporary tables, or write it in one query using

subqueries inside that query to do the earlier steps of the processing. However, many

SQL users find subqueries confusing and difficult to form properly. Also, using sub-

queries creates an inside-out design where the first step in the data pipeline is the in-

nermost query.

Pig, however, is designed with a long series of data operations in mind, so there is no

need to write the data pipeline in an inverted set of subqueries or to worry about storing

data in temporary tables. This is illustrated in Examples 1-2 and 1-3.

Consider a case where a user wants to group one table on a key and then join it with a

second table. Because joins happen before grouping in a SQL query, this must be ex-

pressed either as a subquery or as two queries with the results stored in a temporary

table. Example 1-3 will use a temporary table, as that is more readable.

4 | Chapter 1: Introduction

剩余221页未读，继续阅读

Antares6260

粉丝: 1
资源: 4

使用Pig与Hadoop进行数据流脚本编程

利用Hadoop架构的Pig编程：数据流处理指南

Apache Pig 2nd Edition: Scripting & Hadoop Data Processing

使用Pig进行Hadoop数据流编程（第2版）

Programming Pig: Dataflow Scripting with Hadoop [2016]

Programming Pig Dataflow Scripting with Hadoop 2nd EditionPDF

Programming Pig Dataflow Scripting with Hadoop(2nd) mobi

Programming Pig Dataflow Scripting with Hadoop(2nd) epub

Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf

Bash shell编程实战指南：Linux Shell Scripting with TEAMLinG

javascripting:NodeSchool Javascripting 练习的答案

最新资源