engine logs, she can run Pig Latin queries over it directly.
She need only provide a function that gives Pig the ability to
parse the content of the file into tuples. There is no need to
go through a time-consuming data import process prior to
running queries, as in conventional database management
systems. Similarly, the output of a Pig program can be
formatted in the manner of the user’s choosing, according
to a user-provided function that converts tuples into a byte
sequence. Hence it is easy to use the output of a Pig analysis
session in a subsequent application, e.g., a visualization or
spreadsheet application such as Excel.
It is important to keep in mind that Pig is but one of many
applications in the rich “data ecosystem” of a company like
Yahoo! By operating over data residing in external files, and
not taking over control over the data, Pig readily interoper-
ates with other applications in the ecosystem.
The reasons that conventional database systems do re-
quire importing data into system-managed tables are three-
fold: (1) to enable transactional consistency guarantees, (2)
to enable efficient point lookups (via physical tuple identi-
fiers), and (3) to curate the data on behalf of the user, and
record the schema so that other users can make sense of the
data. Pig only supports read-only data analysis workloads,
and those workloads tend to be scan-centric, so transactional
consistency and index-based lookups are not required. Also,
in our environment users often analyze a temporary data set
for a day or two, and then discard it, so data curating and
schema management can be overkill.
In Pig, stored schemas are strictly optional. Users may
supply schema information on the fly, or perhaps not at all.
Thus, in Example 1, if the user knows the the third field of
the file that stores the urls table is pagerank but does not
want to provide the schema, the first line of the Pig Latin
program can be written as:
good_urls = FILTER urls BY $2 > 0.2;
where $2 uses positional notation to refer to the third field.
2.3 Nested Data Model
Programmers often think in terms of nested data struc-
tures. For example, to capture information about the posi-
tional occurrences of terms in a collection of documents, a
programmer would not think twice about creating a struc-
ture of the form Map< documentId, Set<positions> > for
each term.
Databases, on the other hand, allow only flat tables, i.e.,
only atomic fields as columns, unless one is willing to violate
the First Normal Form (1NF) [7]. To capture the same in-
formation about terms above, while conforming to 1NF, one
would need to normalize the data by creating two tables:
term_info: (termId, termString, ...)
position_info: (termId, documentId, position)
The same positional occurence information can then be
reconstructed by joining these two tables on termId and
grouping on termId, documentId.
Pig Latin has a flexible, fully nested data model (described
in Section 3.1), and allows complex, non-atomic data types
such as set, map, and tuple to occur as fields of a table.
There are several reasons why a nested model is more ap-
propriate for our setting than 1NF:
• A nested data model is closer to how programmers think,
and consequently much more natural to them than nor-
malization.
• Data is often stored on disk in an inherently nested fash-
ion. For example, a web crawler might output for each
url, the set of outlinks from that url. Since Pig oper-
ates directly on files (Section 2.2), separating the data
out into normalized form, and later recombining through
joins can be prohibitively expensive for web-scale data.
• A nested data model also allows us to fulfill our goal of
having an algebraic language (Section 2.1), where each
step carries out only a single data transformation. For
example, each tuple output by our GROUP primitive has
one non-atomic field: a nested set of tuples from the
input that belong to that group. The GROUP construct is
explained in detail in Section 3.5.
• A nested data model allows programmers to easily write
a rich set of user-defined functions, as shown in the next
section.
2.4 UDFs as First-Class Citizens
A significant part of the analysis of search logs, crawl data,
click streams, etc., is custom processing. For example, a user
may be interested in performing natural language stemming
of a search term, or figuring out whether a particular web
page is spam, and countless other tasks.
To accommodate specialized data processing tasks, Pig
Latin has extensive support for user-defined functions
(UDFs). Essentially all aspects of processing in Pig Latin in-
cluding grouping, filtering, joining, and per-tuple processing
can be customized through the use of UDFs.
The input and output of UDFs in Pig Latin follow our
flexible, fully nested data model. Consequently, a UDF to
be used in Pig Latin can take non-atomic parameters as
input, and also output non-atomic values. This flexibility is
often very useful as shown by the following example.
Example 2. Continuing with the setting of Example 1,
suppose we want to find for each category, the top 10 urls
according to pagerank. In Pig Latin, one can simply write:
groups = GROUP urls BY category;
output = FOREACH groups GENERATE
category, top10(urls);
where top10() is a UDF that accepts a set of urls (for each
group at a time), and outputs a set containing the top 10
urls by pagerank for that group.
2
Note that our final output
in this case contains non-atomic fields: there is a tuple for
each category, and one of the fields of the tuple is the set of
the top 10 urls in that category.
Due to our flexible data model, the return type of a UDF
does not restrict the context in which it can be used. Pig
Latin has only one type of UDF that can be used in all the
constructs such as filtering, grouping, and per-tuple process-
ing. This is in contrast to SQL, where only scalar functions
may be used in the SELECT clause, set-valued functions can
only appear in the FROM clause, and aggregation functions
can only be applied in conjunction with a GROUP BY or a
PARTITION BY.
Currently, Pig UDFs are written in Java. We are building
support for interfacing with UDFs written in arbitrary lan-
2
In practice, a user would probably write a more generic
function than top10(): one that takes k as a parameter to
find the top k tuples, and also the field according to which
the top k must be found (pagerank in this example).