首页Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf
Programming Pig Dataflow Scripting with Hadoop(2nd) 无水印转化版pdf
需积分: 16 55 浏览量 更新于2023-05-23 评论 收藏 3.85MB PDF 举报
Programming Pig Dataflow Scripting with Hadoop(2nd) 英文无水印转化版pdf 第2版 pdf所有页面使用FoxitReader、PDF-XChangeViewer、SumatraPDF和Firefox测试都可以打开 本资源转载自网络，如有侵权，请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
Alan Gates and Daniel Dai
Programming Pig, Second Edition
by Alan Gates and Daniel Dai
Copyright © 2017 Alan Gates, Daniel Dai. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/
institutional sales department: 800-998-9938 or firstname.lastname@example.org.
Editor: Marie Beaugureau Indexer: Lucie Haskins
Production Editor: Nicholas Adams Interior Designer: David Futato
Copyeditor: Rachel Head Cover Designer: Randy Comer
Proofreader: Kim Cofer Illustrator: Rebecca Demarest
▪ November 2016: Second Edition
Revision History for the Second Edition
▪ 2016-11-08: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491937099 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Programming Pig, the cover image,
and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
To my wife, Barbara, and our boys, Adam and Joel. Their support, encouragement, and sacrificed
Saturdays have made this book possible.
To my wife Jenny, my older son Ethan, and my younger son Charlie who was delivered during the
writing of the book.
Data is addictive. Our ability to collect and store it has grown massively in the last several decades, yet
our appetite for ever more data shows no sign of being satiated. Scientists want to be able to store more
data in order to build better mathematical models of the world. Marketers want better data to understand
their customers’ desires and buying habits. Financial analysts want to better understand the workings of
their markets. And everybody wants to keep all their digital photographs, movies, emails, etc.
Before the computer and Internet revolutions, the US Library of Congress was one of the largest
collections of data in the world. It is estimated that its printed collections contain approximately 10
terabytes (TB) of information. Today, large Internet companies collect that much data on a daily basis.
And it is not just Internet applications that are producing data at prodigious rates. For example, the Large
Synoptic Survey Telescope (LSST) under construction in Chile is expected to produce 15 TB of data
Part of the reason for the massive growth in available data is our ability to collect much more data.
Every time someone clicks a website’s links, the web server can record information about what page the
user was on and which link he clicked. Every time a car drives over a sensor in the highway, its speed
can be recorded. But much of the reason is also our ability to store that data. Ten years ago, telescopes
took pictures of the sky every night. But they could not store the collected data at the same level of
detail that will be possible when the LSST is operational. The extra data was being thrown away because
there was nowhere to put it. The ability to collect and store vast quantities of data only feeds our data
One of the most commonly used tools for storing and processing data in computer systems over the last
few decades has been the relational database management system (RDBMS). But as datasets have
grown large, only the more sophisticated (and hence more expensive) RDBMSs have been able to reach
the scale many users now desire. At the same time, many engineers and scientists involved in processing
the data have realized that they do not need everything offered by an RDBMS. These systems are
powerful and have many features, but many data owners who need to process terabytes or petabytes of
data need only a subset of those features.
The high cost and unneeded features of RDBMSs have led to the development of many alternative data-
processing systems. One such alternative system is Apache Hadoop. Hadoop is an open source project
started by Doug Cutting. Over the past several years, Yahoo! and a number of other web companies have
driven the development of Hadoop, which was based on papers published by Google describing how its
engineers were dealing with the challenge of storing and processing the massive amounts of data they
were collecting. Hadoop is installed on a cluster of machines and provides a means to tie together
storage and processing in that cluster. For a history of the project, see Hadoop: The Definitive Guide, by
Tom White (O’Reilly).
The development of new data-processing systems such as Hadoop has spurred the porting of existing
tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a
higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop
without requiring them to write extensive data-processing applications in low-level Java code.
Who Should Read This Book
This book is intended for Pig programmers, new and old. Those who have never used Pig will find
introductory material on how to run Pig and to get them started writing Pig Latin scripts. For seasoned
Pig users, this book covers almost every feature of Pig: different modes it can be run in, complete
coverage of the Pig Latin language, and how to extend Pig with your own user-defined functions
(UDFs). Even those who have been using Pig for a long time are likely to discover features they have
not used before.
Some knowledge of Hadoop will be useful for readers and Pig users. If you’re not already familiar with
it or want a quick refresher, “Pig on Hadoop” walks through a very simple example of a Hadoop job.
Small snippets of Java, Python, and SQL are used in parts of this book. Knowledge of these languages is
not required to use Pig, but knowledge of Python and Java will be necessary for some of the more
advanced features. Those with a SQL background may find “Comparing Query and Data Flow
Languages” to be a helpful starting point in understanding the similarities and differences between Pig
Latin and SQL.
What’s New in This Edition
The second edition covers Pig 0.10 through Pig 0.16, which is the latest version at the time of writing.
For features introduced before 0.10, we will not call out the initial version of the feature. For newer
features introduced after 0.10, we will point out the version in which the feature was introduced.
Pig runs on both Hadoop 1 and Hadoop 2 for all the versions covered in the book. To simplify our
discussion, we assume Hadoop 2 is the target platform and will point out the difference for Hadoop 1
whenever applicable in this edition.
The second edition has two new chapters: “Pig on Tez” (Chapter 11) and “Use Cases and Programming
Examples” (Chapter 13). Other chapters have also been updated with the latest additions to Pig and
information on existing features not covered in the first edition. These include but are not limited to:
▪ New data types (boolean, datetime, biginteger, bigdecimal) are introduced in Chapter 3.
▪ New UDFs are covered in various places, including support for leveraging Hive UDFs (Chapter 4)
and applying Bloom filters (Chapter 7).
▪ New Pig operators and constructs such as rank, cube, assert, nested foreach and nested
cross, and casting relations to scalars are presented in Chapter 5.
▪ New performance optimizations — map-side aggregation, schema tuples, the shared JAR cache,
auto local and direct fetch modes, etc. — are covered in Chapter 7.
and embedding Pig in scripting languages is covered in Chapter 8 and Chapter 13 (“k-Means”).
We also describe the Pig progress notification listener in Chapter 8.
▪ We look at the new EvalFunc interface in Chapter 9, including the topics of compile-time
evaluation, shipping dependent JARs automatically, and variable-length inputs. The new
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额