and generally improve in quality with more of it. Knowing more about a problem space
generally leads to better decisions (or algorithm efficacy), which in turn leads to happier
users, more money, reduced fraud, healthier people, safer conditions, or whatever the
desired result might be.
Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infra-
structure for building many of the types of applications described earlier. Made up of
a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a com-
putation layer that implements a processing paradigm called MapReduce, Hadoop is
an open source, batch data processing system for enormous amounts of data. We live
in a flawed world, and Hadoop is designed to survive in it by not only tolerating hard-
ware and software failures, but also treating them as first-class conditions that happen
regularly. Hadoop uses a cluster of plain old commodity servers with no specialized
hardware or network infrastructure to form a single, logical, storage and compute plat-
form, or cluster, that can be shared by multiple individuals or groups. Computation in
Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction
for developers that obviates complex synchronization and network programming. Un-
like many other distributed data processing systems, Hadoop runs the user-provided
processing logic on the machine where the data lives rather than dragging the data
across the network; a huge win for performance.
For those interested in the history, Hadoop was modeled after two papers produced
by Google, one of the many companies to have these kinds of data-intensive processing
problems. The first, presented in 2003, describes a pragmatic, scalable, distributed
filesystem optimized for storing enormous datasets, called the Google Filesystem, or
GFS. In addition to simple storage, GFS was built to support large-scale, data-intensive,
distributed processing applications. The following year, another paper, titled "Map-
Reduce: Simplified Data Processing on Large Clusters," was presented, defining a pro-
gramming model and accompanying framework that provided automatic paralleliza-
tion, fault tolerance, and the scale to process hundreds of terabytes of data in a single
job over thousands of machines. When paired, these two systems could be used to build
large data processing clusters on relatively inexpensive, commodity machines. These
papers directly inspired the development of HDFS and Hadoop MapReduce, respec-
tively.
Interest and investment in Hadoop has led to an entire ecosystem of related software
both open source and commercial. Within the Apache Software Foundation alone,
projects that explicitly make use of, or integrate with, Hadoop are springing up regu-
larly. Some of these projects make authoring MapReduce jobs easier and more acces-
sible, while others focus on getting data in and out of HDFS, simplify operations, enable
deployment in cloud environments, and so on. Here is a sampling of the more popular
projects with which you should familiarize yourself:
Apache Hive
Hive creates a relational database−style abstraction that allows developers to write
a dialect of SQL, which in turn is executed as one or more MapReduce jobs on the
2 | Chapter 1: Introduction