ABOUT THIS BOOK xix
Roadmap
This book has 13 chapters divided into five parts.
Part 1 contains a single chapter that’s the introduction to this book. It reviews
Hadoop basics and looks at how to get Hadoop up and running on a single host. It
wraps up with a walk-through on how to write and execute a MapReduce job.
Part 2, “Data logistics,” consists of two chapters that cover the techniques and
tools required to deal with data fundamentals, getting data in and out of Hadoop,
and how to work with various data formats. Getting data into Hadoop is one of the
first roadblocks commonly encountered when working with Hadoop, and chapter 2
is dedicated to looking at a variety of tools that work with common enterprise data
sources. Chapter 3 covers how to work with ubiquitous data formats such as
XML
and JSON in MapReduce, before going on to look at data formats better suited to
working with big data.
Part 3 is called “Big data patterns,” and looks at techniques to help you work effec-
tively with large volumes of data. Chapter 4 examines how to optimize MapReduce
join and sort operations, and chapter 5 covers working with a large number of small
files, and compression. Chapter 6 looks at how to debug MapReduce performance
issues, and also covers a number of techniques to help make your jobs run faster.
Part 4 is all about “Data science,” and delves into the tools and methods that help
you make sense of your data. Chapter 7 covers how to represent data such as graphs
for use with MapReduce, and looks at several algorithms that operate on graph data.
Chapter 8 describes how R, a popular statistical and data mining platform, can be inte-
grated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction
with MapReduce for massively scalable predictive analytics.
Part 5 is titled “Taming the elephant,” and examines a number of technologies
that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig
respectively, both of which are MapReduce domain-specific languages (DSLs) geared
at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which
are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers
techniques to help write unit tests, and to debug MapReduce problems.
The appendixes start with appendix A, which covers instructions on installing both
Hadoop and all the other related technologies covered in the book. Appendix B cov-
ers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2
leverage. Appendix C looks at how
HDFS supports reads and writes, and appendix D
covers a couple of MapReduce join frameworks written by the author and utilized in
chapter 4.
Code conventions and downloads
All source code in listings or in text is in a
fixed-width
font
like
this
to separate it
from ordinary text. Code annotations accompany many of the listings, highlighting
important concepts.