Pig Latin：面向大数据处理的高级语言

需积分: 10 55 浏览量更新于2024-09-12 收藏 586KB PDF 举报

"Pig Latin是一种用于数据处理的语言，旨在填补SQL的声明性风格与MapReduce低级编程模型之间的空白。它提供了一种更高级别的抽象，使得在大规模数据集上进行临时分析变得更加容易和可维护。" Pig Latin是雅虎研究团队开发的一种数据处理语言，针对互联网公司面临的大量数据分析需求而设计。随着大数据时代的到来，尤其是互联网企业，每天都要处理TB级别的数据，而传统的并行数据库产品（如Teradata）在如此大规模下往往成本高昂。此外，许多从事数据分析的程序员习惯于过程化编程，对于SQL的声明性语法不太适应。 MapReduce编程模型的成功在于其能够利用廉价硬件实现可扩展性，但它的低级特性和刚性导致了大量难以维护和重用的自定义用户代码。Pig Latin的出现解决了这个问题，它提供了一个介于SQL和MapReduce之间的“甜蜜点”，允许开发者以更接近自然的方式表达复杂的转换和分析任务，而无需深入到MapReduce的具体实现细节。 Pig Latin的核心概念是逻辑查询计划（Logical Query Plan），它将用户的输入转化为一系列的MapReduce作业。这些作业通过一种叫做 Pig Latin的脚本语言来描述，这种语言包含了各种操作符，如LOAD、FILTER、JOIN、GROUP BY等，它们对应于数据处理中的常见操作。用户可以使用这些操作符构建复杂的数据处理流程，而无需编写Java代码。例如，Pig Latin的一个简单示例可能是： ```piglatin A = LOAD 'sales_data' AS (date:chararray, product:chararray, amount:double); B = FILTER A BY amount > 100; C = GROUP B BY product; D = FOREACH C GENERATE group, SUM(B.amount); STORE D INTO 'top_sales'; ``` 这段代码加载销售数据，过滤出金额超过100的记录，按产品分组，并计算每个产品的总销售额，最后将结果存储起来。这样的脚本清晰明了，易于理解和维护。 Pig Latin还支持用户自定义函数（UDF），允许开发者用Java或其他语言编写自己的函数来处理特定的数据转换或计算，进一步增强了灵活性。同时，Pig Latin的优化器可以自动对查询计划进行优化，提高执行效率。 Pig Latin是针对大数据分析的一个强大工具，它降低了非专业数据科学家进行复杂数据处理的门槛，提高了工作效率，同时也减少了因编写过多MapReduce代码而带来的维护难题。通过提供更高层次的抽象，Pig Latin使得数据分析更加直观和高效，成为了Hadoop生态系统中不可或缺的一部分。

engine logs, she can run Pig Latin queries over it directly.

She need only provide a function that gives Pig the ability to

parse the content of the ﬁle into tuples. There is no need to

go through a time-consuming data import process prior to

running queries, as in conventional database management

systems. Similarly, the output of a Pig program can be

formatted in the manner of the user’s choosing, according

to a user-provided function that converts tuples into a byte

sequence. Hence it is easy to use the output of a Pig analysis

session in a subsequent application, e.g., a visualization or

spreadsheet application such as Excel.

It is important to keep in mind that Pig is but one of many

applications in the rich “data ecosystem” of a company like

Yahoo! By operating over data residing in external ﬁles, and

not taking over control over the data, Pig readily interoper-

ates with other applications in the ecosystem.

The reasons that conventional database systems do re-

quire importing data into system-managed tables are three-

fold: (1) to enable transactional consistency guarantees, (2)

to enable eﬃcient point lookups (via physical tuple identi-

ﬁers), and (3) to curate the data on behalf of the user, and

record the schema so that other users can make sense of the

data. Pig only supports read-only data analysis workloads,

and those workloads tend to be scan-centric, so transactional

consistency and index-based lookups are not required. Also,

in our environment users often analyze a temporary data set

for a day or two, and then discard it, so data curating and

schema management can be overkill.

In Pig, stored schemas are strictly optional. Users may

supply schema information on the ﬂy, or perhaps not at all.

Thus, in Example 1, if the user knows the the third ﬁeld of

the ﬁle that stores the urls table is pagerank but does not

want to provide the schema, the ﬁrst line of the Pig Latin

program can be written as:

good_urls = FILTER urls BY $2 > 0.2;

where $2 uses positional notation to refer to the third ﬁeld.

2.3 Nested Data Model

Programmers often think in terms of nested data struc-

tures. For example, to capture information about the posi-

tional occurrences of terms in a collection of documents, a

programmer would not think twice about creating a struc-

ture of the form Map< documentId, Set<positions> > for

each term.

Databases, on the other hand, allow only ﬂat tables, i.e.,

only atomic ﬁelds as columns, unless one is willing to violate

the First Normal Form (1NF) [7]. To capture the same in-

formation about terms above, while conforming to 1NF, one

would need to normalize the data by creating two tables:

term_info: (termId, termString, ...)

position_info: (termId, documentId, position)

The same positional occurence information can then be

reconstructed by joining these two tables on termId and

grouping on termId, documentId.

Pig Latin has a ﬂexible, fully nested data model (described

in Section 3.1), and allows complex, non-atomic data types

such as set, map, and tuple to occur as ﬁelds of a table.

There are several reasons why a nested model is more ap-

propriate for our setting than 1NF:

• A nested data model is closer to how programmers think,

and consequently much more natural to them than nor-

malization.

• Data is often stored on disk in an inherently nested fash-

ion. For example, a web crawler might output for each

url, the set of outlinks from that url. Since Pig oper-

ates directly on ﬁles (Section 2.2), separating the data

out into normalized form, and later recombining through

joins can be prohibitively expensive for web-scale data.

• A nested data model also allows us to fulﬁll our goal of

having an algebraic language (Section 2.1), where each

step carries out only a single data transformation. For

example, each tuple output by our GROUP primitive has

one non-atomic ﬁeld: a nested set of tuples from the

input that belong to that group. The GROUP construct is

explained in detail in Section 3.5.

• A nested data model allows programmers to easily write

a rich set of user-deﬁned functions, as shown in the next

section.

2.4 UDFs as First-Class Citizens

A signiﬁcant part of the analysis of search logs, crawl data,

click streams, etc., is custom processing. For example, a user

may be interested in performing natural language stemming

of a search term, or ﬁguring out whether a particular web

page is spam, and countless other tasks.

To accommodate specialized data processing tasks, Pig

Latin has extensive support for user-deﬁned functions

(UDFs). Essentially all aspects of processing in Pig Latin in-

cluding grouping, ﬁltering, joining, and per-tuple processing

can be customized through the use of UDFs.

The input and output of UDFs in Pig Latin follow our

ﬂexible, fully nested data model. Consequently, a UDF to

be used in Pig Latin can take non-atomic parameters as

input, and also output non-atomic values. This ﬂexibility is

often very useful as shown by the following example.

Example 2. Continuing with the setting of Example 1,

suppose we want to ﬁnd for each category, the top 10 urls

according to pagerank. In Pig Latin, one can simply write:

groups = GROUP urls BY category;

output = FOREACH groups GENERATE

category, top10(urls);

where top10() is a UDF that accepts a set of urls (for each

group at a time), and outputs a set containing the top 10

urls by pagerank for that group.

Note that our ﬁnal output

in this case contains non-atomic ﬁelds: there is a tuple for

each category, and one of the ﬁelds of the tuple is the set of

the top 10 urls in that category.

Due to our ﬂexible data model, the return type of a UDF

does not restrict the context in which it can be used. Pig

Latin has only one type of UDF that can be used in all the

constructs such as ﬁltering, grouping, and per-tuple process-

ing. This is in contrast to SQL, where only scalar functions

may be used in the SELECT clause, set-valued functions can

only appear in the FROM clause, and aggregation functions

can only be applied in conjunction with a GROUP BY or a

PARTITION BY.

Currently, Pig UDFs are written in Java. We are building

support for interfacing with UDFs written in arbitrary lan-

In practice, a user would probably write a more generic

function than top10(): one that takes k as a parameter to

ﬁnd the top k tuples, and also the ﬁeld according to which

the top k must be found (pagerank in this example).

剩余11页未读，继续阅读

Quantum_bit

粉丝: 2
资源: 39

Pig Latin：面向大数据处理的高级语言

外研社小学英语单词表默写.doc

冀教版小学英语总复习资料全.doc

Stock-Volatility-Computation-using-Pig-Latin:使用Pig-Latin实现的股票波动率计算

在ubuntu虚拟机上安装pig Latin，使用它实现词频统计，需要怎么做

大数据技术栈思维导图

ubuntu安装pig

解释这段代码：nohup java -jar /root/pig-gateway.jar $JAVA_OPTS > /dev/null 2>&1 &

1）创立文件目录/test，并在此文件目录新建pig.txt文件,文件内容为 “Hello pig! ” ，查看pig.txt内容，然后再查看/test文件目录下的所有文件的inode编号及详细信息。

pig 的基本框架与实现

最新资源