Apache Pig编程深入指南

需积分: 10 146 浏览量更新于2024-07-23 收藏 6.41MB PDF 举报

"Pig编程指南，作者Alan Gates，详述Apache Pig的基础与高级特性，包括Pig Latin脚本语言、控制台shell交互命令和用户自定义函数（UDF）。" 在大数据处理领域，Apache Pig是一个强大的工具，它提供了一种高级的、声明性的语言——Pig Latin，用于构建大规模数据处理的流程。《Pig编程指南》这本书由Alan Gates编写，旨在帮助读者深入理解和应用Pig。无论是初学者还是有经验的用户，都可以从中受益。 1. **Pig Latin**： Pig Latin是Pig的核心，它是一种高阶语言，用于定义数据处理任务。它简化了MapReduce的编程模型，允许用户专注于数据转换逻辑，而不用关心底层的并行性和分布式实现。Pig Latin包括各种操作，如LOAD、FILTER、JOIN、GROUP、ORDER等，这些操作可以组合起来形成复杂的处理流程。 2. **控制台Shell交互命令**： Pig提供了一个交互式的shell环境，用户可以在其中运行Pig Latin脚本，查看数据，调试和测试处理任务。通过shell，用户可以实时查看数据处理的结果，快速迭代和优化数据处理逻辑。 3. **用户自定义函数(UDF)**： Pig允许用户通过编写Java代码来创建自己的函数，以处理Pig Latin无法直接完成的特定任务。UDFs扩展了Pig的功能，可以用于执行复杂的数据转换、数据清洗、数据聚合等操作。用户可以通过定义UDFs将自有的业务逻辑集成到Pig的处理流程中。 4. **数据流设计**：在Pig中，数据处理被看作一系列的管道操作，每个操作（如FILTER或JOIN）接收一个数据集并产生一个新的数据集。这种模型使得数据处理过程易于理解，也便于并行化执行。 5. **性能优化**：书中会介绍如何通过优化Pig Latin脚本来提升处理效率，例如，通过有效利用JOIN策略、减少数据传输、合并多个操作等手段。 6. **案例分析**：为了使理论知识更具实践性，《Pig编程指南》可能会包含实际的案例研究，展示如何解决特定的数据处理问题，以及如何在实际环境中部署和运行Pig作业。 7. **错误处理与调试**：书中的内容还将涵盖如何识别和解决Pig作业中可能出现的问题，包括语法错误、类型不匹配、数据质量问题等，并提供调试技巧。 8. **与其他工具的集成**： Pig可以与其他Hadoop生态系统中的工具（如HDFS、HBase、Hive等）无缝集成，这使得数据处理流程更灵活，能够适应各种数据存储和查询需求。《Pig编程指南》是一本全面介绍Apache Pig的教材，它将帮助读者掌握Pig Latin的语法和使用技巧，理解Pig的工作原理，以及如何通过UDF扩展其功能，从而在大数据处理中更高效地工作。

Douglas of the Hadoop project provided me with very helpful feedback on the sections

covering Hadoop and MapReduce.

I would also like to thank Mike Loukides and the entire team at O’Reilly. They have

made writing my first book an enjoyable and exhilarating experience. Finally, thanks

to Yahoo! for nurturing Pig and dedicating more than 25 engineering years (and still

counting) of effort to it, and for graciously giving me the time to write this book.

xiv | Preface

Part of the specification of a MapReduce job is the key on which data will be collected.

For example, if you were processing web server logs for a website that required users

to log in, you might choose the user ID to be your key so that you could see everything

done by each user on your website. In the shuffle phase, which happens after the map

phase, data is collected together by the key the user has chosen and distributed to

different machines for the reduce phase. Every record for a given key will go to the same

reducer.

In the reduce phase, the application is presented each key, together with all of the

records containing that key. Again this is done in parallel on many machines. After

processing each group, the reducer can write its output. See the next section for a

walkthrough of a simple MapReduce program. For more details on how MapReduce

works, see “MapReduce” on page 189.

MapReduce’s hello world

Consider a simple MapReduce application that counts the number of times each word

appears in a given text. This is the “hello world” program of MapReduce. In this ex-

ample the map phase will read each line in the text, one at a time. It will then split out

each word into a separate string, and, for each word, it will output the word and a 1 to

indicate it has seen the word one time. The shuffle phase will use the word as the key,

hashing the records to reducers. The reduce phase will then sum up the number of

times each word was seen and write that together with the word as output. Let’s con-

sider the case of the nursery rhyme “Mary Had a Little Lamb.” Our input will be:

Mary had a little lamb

its fleece was white as snow

and everywhere that Mary went

the lamb was sure to go.

Let’s assume that each line is sent to a different map task. In reality, each map is assigned

much more data than this, but it makes the example easier to follow. The data flow

through MapReduce is shown in Figure 1-1.

Once the map phase is complete, the shuffle phase will collect all records with the same

word onto the same reducer. For this example we assume that there are two reducers:

all words that start with A-L are sent to the first reducer, and M-Z are sent to the second

reducer. The reducers will then output the summed counts for each word.

Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts

that users write into a series of one or more MapReduce jobs that it then executes. See

Example 1-1 for a Pig Latin script that will do a word count of “Mary Had a Little

Lamb.”

2 | Chapter 1: Introduction

Pig Latin, a Parallel Dataflow Language

Pig Latin is a dataflow language. This means it allows users to describe how data from

one or more inputs should be read, processed, and then stored to one or more outputs

in parallel. These data flows can be simple linear flows like the word count example

given previously. They can also be complex workflows that include points where mul-

tiple inputs are joined, and where data is split into multiple streams to be processed by

different operators. To be mathematically precise, a Pig Latin script describes a directed

acyclic graph (DAG), where the edges are data flows and the nodes are operators that

process the data.

This means that Pig Latin looks different from many of the programming languages

you have seen. There are no if statements or for loops in Pig Latin. This is because

traditional procedural and object-oriented programming languages describe control

flow, and data flow is a side effect of the program. Pig Latin instead focuses on data

flow. For information on how to integrate the data flow described by a Pig Latin script

with control flow, see Chapter 9.

Comparing query and dataflow languages

After a cursory look, people often say that Pig Latin is a procedural version of SQL.

Although there are certainly similarities, there are more differences. SQL is a query

language. Its focus is to allow users to form queries. It allows users to describe what

question they want answered, but not how they want it answered. In Pig Latin, on the

other hand, the user describes exactly how to process the input data.

Another major difference is that SQL is oriented around answering one question. When

users want to do several data operations together, they must either write separate quer-

ies, storing the intermediate data into temporary tables, or write it in one query using

subqueries inside that query to do the earlier steps of the processing. However, many

SQL users find subqueries confusing and difficult to form properly. Also, using sub-

queries creates an inside-out design where the first step in the data pipeline is the in-

nermost query.

Pig, however, is designed with a long series of data operations in mind, so there is no

need to write the data pipeline in an inverted set of subqueries or to worry about storing

data in temporary tables. This is illustrated in Examples 1-2 and 1-3.

Consider a case where a user wants to group one table on a key and then join it with a

second table. Because joins happen before grouping in a SQL query, this must be ex-

pressed either as a subquery or as two queries with the results stored in a temporary

table. Example 1-3 will use a temporary table, as that is more readable.

4 | Chapter 1: Introduction

剩余221页未读，继续阅读

Milkice_Chou

粉丝: 0
资源: 2

Apache Pig编程深入指南

pig编程指南

pig编程指南源码

Pig编程指南.pdf

Pig编程指南中文版

Spring MVC架构详解与配置指南：实现Web应用的高效开发

基于golang的渗透测试武器，将web打点部分与常规的漏扫部分进行整合与改进.zip

渗透测试与搭建.zip

【java毕业设计】野生动物公益保护系统源码（ssm+mysql+说明文档+LW）.zip

【java毕业设计】易商B2C网上交易系统ssh+mysql源码（完整前后端+说明文档+LW）.zip

网站渗透测试系统.zip

最新资源