Preface
I first encountered Hadoop in the fall of 2008 when I was working on an internet crawl-and-
analysis project at Verisign. We were making discoveries similar to those that Doug Cutting and
others at Nutch had made several years earlier about how to efficiently store and manage
terabytes of crawl-and-analyzed data. At the time, we were getting by with our homegrown
distributed system, but the influx of a new data stream and requirements to join that stream
with our crawl data couldn’t be supported by our existing system in the required timeline.
After some research, we came across the Hadoop project, which seemed to be a perfect fit for
our needs—it supported storing large volumes of data and provided a compute mechanism to
combine them. Within a few months, we built and deployed a MapReduce application
encompassing a number of MapReduce jobs, woven together with our own MapReduce
workflow management system, onto a small cluster of 18 nodes. It was a revelation to observe
our MapReduce jobs crunching through our data in minutes. Of course, what we weren’t
expecting was the amount of time that we would spend debugging and performance-tuning our
MapReduce jobs. Not to mention the new roles we took on as production administrators—the
biggest surprise in this role was the number of disk failures we encountered during those first
few months supporting production.
As our experience and comfort level with Hadoop grew, we continued to build more of our
functionality using Hadoop to help with our scaling challenges. We also started to evangelize the
use of Hadoop within our organization and helped kick-start other projects that were also facing
big data challenges.
The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was
relearning how to solve problems with it. MapReduce is its own flavor of parallel programming,
and it’s quite different from the in-JVM programming that we were accustomed to. The first big
hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Actionby
Chuck Lam (Manning Publications, 2010) covers well.
After one is used to thinking in MapReduce, the next challenge is typically related to the
logistics of working with Hadoop, such as how to move data in and out of HDFS and effective
and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much
coverage, and that’s what attracted me to the potential of this book—the chance to go beyond
the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects
of Hadoop.
As I’m sure many authors have experienced, I went into this project confidently believing that
writing this book was just a matter of transferring my experiences onto paper. Boy, did I get a
reality check, but not altogether an unpleasant one, because writing introduced me to new
approaches and tools that ultimately helped better my own Hadoop abilities. I hope that you get
as much out of reading this book as I did writing it.