Intel® Distribution for Apache Hadoop*
Software: Optimization and Tuning Guide
Conguring and managing your Hadoop* environment for performance and cost
Executive Summary
The amount of data being produced every
day is growing at an astounding rate.
The term “big data” has been coined to
represent these new large and complex
data sets. Traditional database manage-
ment tools are no longer a good match
for processing and managing big data.
Fortunately, there are new tools available,
like the Hadoop* framework, that are built
to handle the challenge with ease.
This paper provides guidance for optimizing
and tuning Intel® Distribution for Apache
Hadoop* (Intel® Distribution) software, a
big data system optimized to run on Intel
processor-based architecture. This guidance
is based on benchmark testing done both at
Intel and at customer sites. It begins
with an introduction to big data and the
Intel Distribution software, and then breaks
down the Hadoop system into its compo-
nent layers. The guide then provides tips
for hardware and software conguration,
followed by tuning best practices that are
geared toward providing optimal perfor-
mance of the Intel Distribution based
on the type of workload.
There are many players involved in congur-
ing and managing a Hadoop environment.
Throughout this guide, we’ve clearly identi-
ed which sections are of most interest to
various roles in your IT organization.
Introduction
Data is exploding at a phenomenal rate,
with worldwide growth predicted to
reach 8 zettabytes by 2015. Much of this
data is characterized by data sets that
are larger, more varied in structure and
format, and generated at a faster rate
than ever before—often referred to as
big data. The analysis of big data presents
new challenges for IT, but also exciting
opportunities for organizations to gain
richer insights into customers, partners,
and their business.
The Hadoop platform was designed to
solve the challenge of big data as well
as complex data, such as a mixture of
unstructured and structured data types.
Although the Hadoop framework excels at
processing and managing large data sets,
there are many variables that should be
ne-tuned to support each specic Hadoop
environment for optimal performance.
Some Hadoop workloads will be CPU
intensive, such as analytical jobs, while
others will be I/O intensive, such as
extract, transform, load (ETL) jobs.
Configuration and tuning decisions
within the Hadoop platform, including
hardware and software, should be made
based on the type of workload being
performed. This optimization guide
provides best-practice guidelines for a
Table of Contents
Executive Summary .............. 1
Introduction ..................... 1
Components of Intel® Distribution
for Apache Hadoop* Software ..... 2
Resource Recommendations ...... 3
Optimizing and Tuning
the Hadoop* System ............. 4
Conguring and Optimizing
the Software Layer .............. 5
Conguring and Optimizing
the Hardware Layer .............. 6
Benchmarking ................... 6
Conclusion ...................... 7
OPTIMIZATION AND TUNING GUIDE
Intel® Distribution for
Apache Hadoop* Software