ABOUT THIS BOOK
xix
Roadmap
This book has 10 chapters divided into four parts.
Part 1 contains two chapters that form the introduction to this book. They review
Hadoop basics and look at how to get Hadoop up and running on a single host. YARN,
which is new in Hadoop version 2, is also examined, and some operational tips are
provided for performing basic functions in
YARN.
Part 2, “Data logistics,” consists of three chapters that cover the techniques and
tools required to deal with data fundamentals, how to work with various data formats,
how to organize and optimize your data, and getting data into and out of Hadoop.
Picking the right format for your data and determining how to organize data in
HDFS
are the first items you’ll need to address when working with Hadoop, and they’re cov-
ered in chapters 3 and 4 respectively. Getting data into Hadoop is one of the bigger
hurdles commonly encountered when working with Hadoop, and chapter 5 is dedi-
cated to looking at a variety of tools that work with common enterprise data sources.
Part 3 is called “Big data patterns,” and it looks at techniques to help you work effec-
tively with large volumes of data. Chapter 6 covers how to represent data such as graphs
for use with MapReduce, and it looks at several algorithms that operate on graph data.
Chapter 7 looks at more advanced data structures and algorithms such as graph pro-
cessing and using HyperLogLog for working with large datasets. Chapter 8 looks at how
to tune, debug, and test MapReduce performance issues, and it also covers a number
of techniques to help make your jobs run faster.
Part 4 is titled “Beyond MapReduce,” and it examines a number of technologies
that make it easier to work with Hadoop. Chapter 9 covers the most prevalent and
promising
SQL technologies for data processing on Hadoop, and Hive, Impala, and
Spark
SQL are examined. The final chapter looks at how to write your own YARN appli-
cation, and it provides some insights into some of the more advanced features you can
use in your applications.
The appendix covers instructions for the source code that accompanies this book,
as well as installation instructions for Hadoop and all the other related technologies
covered in the book.
Finally, there are two bonus chapters available from the publisher’s website at
www.manning.com/HadoopinPracticeSecondEdition: chapter 11 “Integrating R and
Hadoop for statistics and more” and chapter 12 “Predictive analytics with Mahout.”
What’s new in the second edition?
This second edition covers Hadoop 2, which at the time of writing is the current
production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22
(Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and
opened up the Hadoop platform to processing paradigms beyond MapReduce.
YARN,
the new scheduler and application manager in Hadoop 2, is complex and new to the
community, which prompted me to dedicate a new chapter 2 to covering YARN basics
and to discussing how MapReduce now functions as a
YARN application.