ProgrammingPigDataflowScriptingwithHadoop(2nd)无水印转化版pdf

需积分: 16 72 浏览量更新于2023-03-16 评论收藏 3.85MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Programming Pig

SECOND EDITION

Alan Gates and Daniel Dai

Preface

Data is addictive. Our ability to collect and store it has grown massively in the last several decades, yet

our appetite for ever more data shows no sign of being satiated. Scientists want to be able to store more

data in order to build better mathematical models of the world. Marketers want better data to understand

their customers’ desires and buying habits. Financial analysts want to better understand the workings of

their markets. And everybody wants to keep all their digital photographs, movies, emails, etc.

Before the computer and Internet revolutions, the US Library of Congress was one of the largest

collections of data in the world. It is estimated that its printed collections contain approximately 10

terabytes (TB) of information. Today, large Internet companies collect that much data on a daily basis.

And it is not just Internet applications that are producing data at prodigious rates. For example, the Large

Synoptic Survey Telescope (LSST) under construction in Chile is expected to produce 15 TB of data

every day.

Part of the reason for the massive growth in available data is our ability to collect much more data.

Every time someone clicks a website’s links, the web server can record information about what page the

user was on and which link he clicked. Every time a car drives over a sensor in the highway, its speed

can be recorded. But much of the reason is also our ability to store that data. Ten years ago, telescopes

took pictures of the sky every night. But they could not store the collected data at the same level of

detail that will be possible when the LSST is operational. The extra data was being thrown away because

there was nowhere to put it. The ability to collect and store vast quantities of data only feeds our data

addiction.

One of the most commonly used tools for storing and processing data in computer systems over the last

few decades has been the relational database management system (RDBMS). But as datasets have

grown large, only the more sophisticated (and hence more expensive) RDBMSs have been able to reach

the scale many users now desire. At the same time, many engineers and scientists involved in processing

the data have realized that they do not need everything offered by an RDBMS. These systems are

powerful and have many features, but many data owners who need to process terabytes or petabytes of

data need only a subset of those features.

The high cost and unneeded features of RDBMSs have led to the development of many alternative data-

processing systems. One such alternative system is Apache Hadoop. Hadoop is an open source project

started by Doug Cutting. Over the past several years, Yahoo! and a number of other web companies have

driven the development of Hadoop, which was based on papers published by Google describing how its

engineers were dealing with the challenge of storing and processing the massive amounts of data they

were collecting. Hadoop is installed on a cluster of machines and provides a means to tie together

storage and processing in that cluster. For a history of the project, see Hadoop: The Definitive Guide, by

Tom White (O’Reilly).

The development of new data-processing systems such as Hadoop has spurred the porting of existing

tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a

higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop

without requiring them to write extensive data-processing applications in low-level Java code.

Who Should Read This Book

This book is intended for Pig programmers, new and old. Those who have never used Pig will find

introductory material on how to run Pig and to get them started writing Pig Latin scripts. For seasoned

Pig users, this book covers almost every feature of Pig: different modes it can be run in, complete

coverage of the Pig Latin language, and how to extend Pig with your own user-defined functions

(UDFs). Even those who have been using Pig for a long time are likely to discover features they have

not used before.

Some knowledge of Hadoop will be useful for readers and Pig users. If you’re not already familiar with

it or want a quick refresher, “Pig on Hadoop” walks through a very simple example of a Hadoop job.

Small snippets of Java, Python, and SQL are used in parts of this book. Knowledge of these languages is

not required to use Pig, but knowledge of Python and Java will be necessary for some of the more

advanced features. Those with a SQL background may find “Comparing Query and Data Flow

Languages” to be a helpful starting point in understanding the similarities and differences between Pig

Latin and SQL.

What’s New in This Edition

The second edition covers Pig 0.10 through Pig 0.16, which is the latest version at the time of writing.

For features introduced before 0.10, we will not call out the initial version of the feature. For newer

features introduced after 0.10, we will point out the version in which the feature was introduced.

Pig runs on both Hadoop 1 and Hadoop 2 for all the versions covered in the book. To simplify our

discussion, we assume Hadoop 2 is the target platform and will point out the difference for Hadoop 1

whenever applicable in this edition.

The second edition has two new chapters: “Pig on Tez” (Chapter 11) and “Use Cases and Programming

Examples” (Chapter 13). Other chapters have also been updated with the latest additions to Pig and

information on existing features not covered in the first edition. These include but are not limited to:

▪ New data types (boolean, datetime, biginteger, bigdecimal) are introduced in Chapter 3.

▪ New UDFs are covered in various places, including support for leveraging Hive UDFs (Chapter 4)

and applying Bloom filters (Chapter 7).

▪ New Pig operators and constructs such as rank, cube, assert, nested foreach and nested

cross, and casting relations to scalars are presented in Chapter 5.

▪ New performance optimizations — map-side aggregation, schema tuples, the shared JAR cache,

auto local and direct fetch modes, etc. — are covered in Chapter 7.

▪ Scripting UDFs in JavaScript, JRuby, Groovy, and streaming Python are discussed in Chapter 9,

and embedding Pig in scripting languages is covered in Chapter 8 and Chapter 13 (“k-Means”).

We also describe the Pig progress notification listener in Chapter 8.

▪ We look at the new EvalFunc interface in Chapter 9, including the topics of compile-time

evaluation, shipping dependent JARs automatically, and variable-length inputs. The new

剩余389页未读，继续阅读

yinkaisheng-nj

粉丝: 763
资源: 6953

会员权益专享

Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf

评论0

会员权益专享

最新资源

Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf

评论0

Programming Erlang 2nd.pdf

erlang programming

Programming Erlang(Pragmatic,2ed,2013)

unreal scripting cookbook pdf

unreal-engine-4-scripting-c-cookbook pdf

python scripting for klayout

linux shell scripting cookbook pdf

unity Visual scripting

那你能帮我找一些关于shell编程的学习资料吗

scripting tracker获取python代码

java中使用sap gui scripting api

VBA选择多个文件夹里多张图合并成一个pdf

wxWidgets 书籍

SecureCRT Scripting技术文档

abaqus scripting reference manual

wsh and vbscript pdf

wincc scripting.filesystemobject

unity visual scripting 怎么调用静态方法

java.lang.ClassNotFoundException: jdk.nashorn.api.scripting.ClassFilter

会员权益专享

最新资源