2
❘
CHAPTER 1 HADOOP INTRODUCTION
c01.indd 03/29/2016 Page 2
it can be used along with systems like Oracle, MySQL, and SQL Server. Each of these systems has
developed connector-type components that are processed using Hadoop’s framework. We will
review a few of these components in this chapter and illustrate how they interact with Hadoop.
Business Analytics and Big Data
Business Analytics is the study of data through statistical and operational analysis. Hadoop allows
you to conduct operational analysis on its data stores. These results allow organizations and compa-
nies to make better business decisions that are bene cial to the organization.
To understand this further, let’s build a big data pro le. Because of the amount of data involved,
the data can be distributed across storage and compute nodes, which bene ts from using Hadoop.
Because it is distributed and not centralized, it lacks the characteristics of an RDBMS. This allows
you to use large data stores and an assortment of data types with Hadoop.
For example, let’s consider a large data store like Google, Bing, or Twitter. All of these data stores
can grow exponentially based on activity, such as queries and a large user base. Hadoop’s compo-
nents can help you process these large data stores.
A business, such as Google, can use Hadoop to manipulate, manage, and produce meaningful
results from their data stores. The traditional tools commonly used for Business Analytics are not
designed to work with or analyze extremely large datasets, but Hadoop is a solution that ts these
business models.
The Components of Hadoop
The Hadoop Common is the foundation of Hadoop, because it contains the primary services
and basic processes, such as the abstraction of the underlying operating system and its lesystem.
Hadoop Common also contains the necessary Java Archive (JAR) les and scripts required to start
Hadoop. The Hadoop Common package even provides source code and documentation, as well as a
contribution section. You can’t run Hadoop without Hadoop Common.
As with any stack, there are requirements that Apache provides for con guring the Hadoop
Common. Having a general understanding as a Linux or Unix administrator is helpful in setting this
up. Hadoop Common, also referred to as the Hadoop Stack, is not designed for a beginner, so the
pace of your implementation rests on your experience. In fact, Apache clearly states on their site that
using Hadoop is not the task you want to tackle while trying to learn how to administer a Linux
environment. It is recommended that you are comfortable in this environment before attempting to
install Hadoop.
The Distributed File System (HDFS)
With Hadoop Common now installed, it is time to examine the rest of the Hadoop Stack. HDFS
delivers a distributed lesystem that is designed to run on basic hardware components. Most
businesses nd these minimal system requirements appealing. This environment can be set up
in a Virtual Machine (VM) or a laptop for the initial walkthrough and advancement to server
deployment. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having
large datasets.