利用Hadoop架构的Pig编程：数据流处理指南

需积分: 9 73 浏览量更新于2024-07-28 收藏 6.41MB PDF 举报

《Programming Pig：Dataflow Scripting with Hadoop》是一本由Alan Gates撰写的专业书籍，它详细介绍了如何在Hadoop架构下利用Pig语言进行数据处理。Pig是Apache Hadoop生态系统中的一个重要的大数据处理工具，它提供了一种声明式编程模型，允许开发者以类似SQL的方式处理海量数据。这本书旨在帮助读者理解和掌握Pig的基本概念、语法以及其在实际项目中的应用。书中涵盖了以下关键知识点： 1. **Pig语言基础**：首先，作者会介绍Pig Latin（一种简洁的脚本语言）的概念，包括变量声明、常量定义、函数调用等核心元素。读者将学会如何编写Pig脚本来读取、转换和加载数据。 2. **数据流模型**：Pig设计的核心思想是基于数据流的数据处理，它将数据视为一系列的记录，并通过一系列算子（如装载、过滤、映射、联合、排序和存储）进行处理。理解这个模型对于编写高效的Pig脚本至关重要。 3. **Hadoop集成**：书中深入剖析了Pig如何与Hadoop MapReduce框架协作，解释了Pig如何执行任务并利用Hadoop集群资源。这包括分布式计算、分区策略、错误处理和优化策略。 4. **数据清洗与预处理**：Pig提供了丰富的函数库，用于数据清洗、转换和聚合操作，如日期处理、字符串操作和数学运算。这部分内容会展示如何利用这些工具进行数据预处理，为后续分析做准备。 5. **性能优化与调试**：为了确保在大规模数据集上获得最佳性能，本书会讨论如何调整Pig脚本，如优化查询计划、使用JOIN类型和配置参数等。此外，还有针对性能瓶颈的诊断和调试方法。 6. **实践案例**：书中包含多个实战案例，涵盖电商、社交媒体、日志分析等各种场景，使读者能够通过具体的例子理解Pig在实际项目中的应用场景和价值。 7. **最新版本更新**：由于出版时间是2011年，书中可能包含了当时最新的Pig版本特性，但要注意，对于更晚近的Hadoop和Pig发展，可能存在一些差异，需要读者结合官方文档和社区资料进行补充学习。《Programming Pig》适合Hadoop开发者、数据分析师或任何希望在大数据领域运用Pig语言处理数据的人员阅读，通过深入浅出的讲解和实例，它将帮助读者提升在Hadoop生态系统中的数据处理能力。

Douglas of the Hadoop project provided me with very helpful feedback on the sections

covering Hadoop and MapReduce.

I would also like to thank Mike Loukides and the entire team at O’Reilly. They have

made writing my first book an enjoyable and exhilarating experience. Finally, thanks

to Yahoo! for nurturing Pig and dedicating more than 25 engineering years (and still

counting) of effort to it, and for graciously giving me the time to write this book.

xiv | Preface

Part of the specification of a MapReduce job is the key on which data will be collected.

For example, if you were processing web server logs for a website that required users

to log in, you might choose the user ID to be your key so that you could see everything

done by each user on your website. In the shuffle phase, which happens after the map

phase, data is collected together by the key the user has chosen and distributed to

different machines for the reduce phase. Every record for a given key will go to the same

reducer.

In the reduce phase, the application is presented each key, together with all of the

records containing that key. Again this is done in parallel on many machines. After

processing each group, the reducer can write its output. See the next section for a

walkthrough of a simple MapReduce program. For more details on how MapReduce

works, see “MapReduce” on page 189.

MapReduce’s hello world

Consider a simple MapReduce application that counts the number of times each word

appears in a given text. This is the “hello world” program of MapReduce. In this ex-

ample the map phase will read each line in the text, one at a time. It will then split out

each word into a separate string, and, for each word, it will output the word and a 1 to

indicate it has seen the word one time. The shuffle phase will use the word as the key,

hashing the records to reducers. The reduce phase will then sum up the number of

times each word was seen and write that together with the word as output. Let’s con-

sider the case of the nursery rhyme “Mary Had a Little Lamb.” Our input will be:

Mary had a little lamb

its fleece was white as snow

and everywhere that Mary went

the lamb was sure to go.

Let’s assume that each line is sent to a different map task. In reality, each map is assigned

much more data than this, but it makes the example easier to follow. The data flow

through MapReduce is shown in Figure 1-1.

Once the map phase is complete, the shuffle phase will collect all records with the same

word onto the same reducer. For this example we assume that there are two reducers:

all words that start with A-L are sent to the first reducer, and M-Z are sent to the second

reducer. The reducers will then output the summed counts for each word.

Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts

that users write into a series of one or more MapReduce jobs that it then executes. See

Example 1-1 for a Pig Latin script that will do a word count of “Mary Had a Little

Lamb.”

2 | Chapter 1: Introduction

Pig Latin, a Parallel Dataflow Language

Pig Latin is a dataflow language. This means it allows users to describe how data from

one or more inputs should be read, processed, and then stored to one or more outputs

in parallel. These data flows can be simple linear flows like the word count example

given previously. They can also be complex workflows that include points where mul-

tiple inputs are joined, and where data is split into multiple streams to be processed by

different operators. To be mathematically precise, a Pig Latin script describes a directed

acyclic graph (DAG), where the edges are data flows and the nodes are operators that

process the data.

This means that Pig Latin looks different from many of the programming languages

you have seen. There are no if statements or for loops in Pig Latin. This is because

traditional procedural and object-oriented programming languages describe control

flow, and data flow is a side effect of the program. Pig Latin instead focuses on data

flow. For information on how to integrate the data flow described by a Pig Latin script

with control flow, see Chapter 9.

Comparing query and dataflow languages

After a cursory look, people often say that Pig Latin is a procedural version of SQL.

Although there are certainly similarities, there are more differences. SQL is a query

language. Its focus is to allow users to form queries. It allows users to describe what

question they want answered, but not how they want it answered. In Pig Latin, on the

other hand, the user describes exactly how to process the input data.

Another major difference is that SQL is oriented around answering one question. When

users want to do several data operations together, they must either write separate quer-

ies, storing the intermediate data into temporary tables, or write it in one query using

subqueries inside that query to do the earlier steps of the processing. However, many

SQL users find subqueries confusing and difficult to form properly. Also, using sub-

queries creates an inside-out design where the first step in the data pipeline is the in-

nermost query.

Pig, however, is designed with a long series of data operations in mind, so there is no

need to write the data pipeline in an inverted set of subqueries or to worry about storing

data in temporary tables. This is illustrated in Examples 1-2 and 1-3.

Consider a case where a user wants to group one table on a key and then join it with a

second table. Because joins happen before grouping in a SQL query, this must be ex-

pressed either as a subquery or as two queries with the results stored in a temporary

table. Example 1-3 will use a temporary table, as that is more readable.

4 | Chapter 1: Introduction

剩余221页未读，继续阅读

quailman

粉丝: 0
资源: 22

利用Hadoop架构的Pig编程：数据流处理指南

Programming Pig: Dataflow Scripting with Hadoop [2016]

Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf

Programming Pig Dataflow Scripting with Hadoop 2nd EditionPDF

Programming Pig Dataflow Scripting with Hadoop(2nd) mobi

Programming Pig Dataflow Scripting with Hadoop(2nd) epub

Unreal.js：Unreal.js：为UnrealEngine 4构建的Javascript运行时

net.loadbang.scripting:MXJ 的脚本层（由 Groovy、Jython 和 Clojure 使用）

buildAPKs.apps:buildAPKs.apps https

Westwind.Scripting:小型C＃库可从源代码提供动态运行时代码编译，以执行代码和表达式

shlibs.bash：共享的BASH脚本库https：shlibs.github.ioshlibs.bash

最新资源