轻松掌握大数据：Hadoop工具详解

需积分: 10 39 浏览量更新于2024-07-17 收藏 16.07MB PDF 举报

《Big Data Made Easy》是一本由Michael Frampton编著的专业英文书籍，由Apress出版社发行，专为想要深入了解基于Hadoop的大数据工具集的学习者设计。本书以易于理解的方式介绍了大数据处理中的核心概念和技术，尤其针对的是那些希望在大数据领域建立坚实基础的读者。本书首先从介绍大数据面临的问题出发，帮助读者理解为什么需要处理大量复杂数据以及Hadoop生态系统的重要性。第1章深入探讨了数据问题及其解决方案，引导读者认识到大数据处理不仅仅是技术问题，更是业务洞察的关键。第二章详细讲解了如何利用Hadoop、YARN（Yet Another Resource Negotiator）和ZooKeeper来存储和配置数据。Hadoop作为分布式计算的基础，YARN负责资源管理和任务调度，而ZooKeeper则提供了分布式系统中的协调服务，确保数据一致性。第三章和第四章分别聚焦于数据收集和处理，通过Nutch（一个强大的网络抓取工具）和Solr（用于全文搜索的工具）来收集数据，然后用MapReduce模型进行大规模数据处理，展示了如何实现分布式计算的强大能力。第五章讨论了数据调度和工作流管理，确保高效地执行复杂的分析任务，包括任务分配、依赖关系管理和并行处理。第六章重点在于数据移动，涵盖了如何在Hadoop集群内部或与其他系统之间传输数据，这对于数据集成和分布式的数据仓库建设至关重要。第七章介绍了数据监控，帮助读者了解系统的运行状态，识别性能瓶颈，以及如何通过日志分析和指标跟踪来优化数据处理过程。第八章深入到集群管理层面，包括硬件选型、配置调整和故障恢复，确保Hadoop集群的稳定运行。第九章和第十章着重于数据分析和ETL（Extract, Transform, Load）过程，阐述如何使用Hadoop进行深度分析，提取有价值的信息，并将数据清洗和转换为适合分析的形式。最后一章探讨了Hadoop在报告和可视化方面的应用，如何将处理后的数据转化为易读的报表，以便于业务决策。《Big Data Made Easy》不仅包含丰富的实战案例，而且每个章节都配以实用的代码示例，确保读者能够迅速上手并应用所学知识。此外，作者还使用了CentOS作为主要的操作系统平台，使读者能够在常见的Linux环境中实践书中的教程。对于初学者和有一定经验的开发者来说，这本书是理解和掌握大数据处理不可或缺的资源。

CHAPTER 2

Storing and Configuring Data with

Hadoop, YARN, and ZooKeeper

This chapter introduces Hadoop versions V1 and V2, laying the groundwork for the chapters that follow. Specifically,

you first will source the V1 software, install it, and then configure it. You will test your installation by running a simple

word-count Map Reduce task. As a comparison, you will then do the same for V2, as well as install a ZooKeeper

quorum. You will then learn how to access ZooKeeper via its commands and client to examine the data that it stores.

Lastly, you will learn about the Hadoop command set in terms of shell, user, and administration commands. The

Hadoop installation that you create here will be used for storage and processing in subsequent chapters, when you

will work with Apache tools like Nutch and Pig.

An Overview of Hadoop

Apache Hadoop is available as three download types via the hadoop.apache.org website. The releases are named as

follows:

Hadoop-1.2.1•฀

Hadoop-0.23.10•฀

Hadoop-2.3.0•฀

The first release relates to Hadoop V1, while the second two relate to Hadoop V2. There are two different release

types for V2 because the version that is numbered 0.xx is missing extra components like NN and HA. (NN is “name

node” and HA is “high availability.”) Because they have different architectures and are installed differently, I first

examine both Hadoop V1 and then Hadoop V2 (YARN). In the next section, I will give an overview of each version and

then move on to the interesting stuff, such as how to source and install both.

Because I have only a single small cluster available for the development of this book, I install the different

versions of Hadoop and its tools on the same cluster nodes. If any action is carried out for the sake of demonstration,

which would otherwise be dangerous from a production point of view, I will flag it. This is important because, in

a production system, when you are upgrading, you want to be sure that you retain all of your data. However, for

demonstration purposes, I will be upgrading and downgrading periodically.

So, in general terms, what is Hadoop? Here are some of its characteristics:

It is an open-source system developed by Apache in Java.•฀

It is designed to handle very large data sets.•฀

It is designed to scale to very large clusters.•฀

It is designed to run on commodity hardware.•฀

CHAPTER 2 ■ STORING AND CONFIGURING DATA WITH HADOOP, YARN, AND ZOOKEEPER

The cluster-level Job Tracker handles client requests via a Map Reduce (MR) API. The clients need only process

via the MR API, as the Map Reduce framework and system handle the scheduling, resources, and failover in the event

of a crash. Job Tracker handles jobs via data node–based Task Trackers that manage the actual tasks or processes. Job

Tracker manages the whole client-requested job, passing subtasks to individual slave nodes and monitoring their

availability and the tasks’ completion.

Hadoop V1 only scales to clusters of around 4,000 to 5,000 nodes, and there are also limitations on the number of

concurrent processes that can run. It has only a single processing type, Map Reduce, which although powerful does

not allow for requirements like graph or real-time processing.

The Differences in Hadoop V2

With YARN, Hadoop V2’s Job Tracker has been split into a master Resource Manager and slave-based Application

Master processes. It separates the major tasks of the Job Tracker: resource management and monitoring/scheduling.

The Job History server now has the function of providing information about completed jobs. The Task Tracker has

been replaced by a slave-based Node Manager, which handles slave node–based resources and manages tasks on

the node. The actual tasks reside within containers launched by the Node Manager. The Map Reduce function is

controlled by the Application Master process, while the tasks themselves may be either Map or Reduce tasks.

Hadoop V2 also offers the ability to use non-Map Reduce processing, like Apache Giraph for graph processing, or

Impala for data query. Resources on YARN can be shared among all three processing systems.

Figure

2-2 shows client task requests being sent to the global Resource Manager and the slave-based Node

Managers launching containers, which have the actual tasks. It also monitors their resource usage. The Application

Master requests containers from the scheduler and receives status updates from the container-based Map Reduce tasks.

Figure 2-2. Hadoop V2 architecture

This architecture enables Hadoop V2 to scale to much larger clusters and provides the ability to have a higher

number of concurrent processes. It also now offers the ability, as mentioned earlier, to run different types of processes

concurrently, not just Map Reduce.

CHAPTER 2 ■ STORING AND CONFIGURING DATA WITH HADOOP, YARN, AND ZOOKEEPER

This is an introduction to the Hadoop V1 and V2 architectures. You might have the opportunity to work with both

versions, so I give examples for installation and use of both. The architectures are obviously different, as seen in

Figures 2-1 and 2-2, and so the actual installation/build and usage differ as well. For example, for V1 you will carry out

a manual install of the software while for V2 you will use the Cloudera software stack, which is described next.

The Hadoop Stack

Before we get started with the Hadoop V1 and V2 installations, it is worth discussing the work of companies like

Cloudera and Hortonworks. They have built stacks of Hadoop-related tools that have been tested for interoperability.

Although I describe how to carry out a manual installation of software components for V1, I show how to use one of

the software stacks for the V2 install.

When you’re trying to use multiple Hadoop platform tools together in a single stack, it is important to know what

versions will work together without error. If, for instance, you are using ten tools, then the task of tracking compatible

version numbers quickly becomes complex. Luckily there are a number of Hadoop stacks available. Suppliers can

provide a single tested package that you can download. Two of the major companies in this field are Cloudera and

Hortonworks. Apache Bigtop, a testing suite that I will demonstrate in Chapter 8, is also used as the base for the

Cloudera Hadoop stack.

Table

2-1 shows the current stacks from these companies, listing components and versions of tools that are

compatible at the time of this writing.

Table 2-1. Hadoop Stack Tool Version Details

Cloudera CDH 4.6.0 Hortonworks Data Platform 2.0

Ambari 1.4.4

DataFu 0.0.4

Flume 1.4.0 1.4.0

Hadoop 2.0.0 2.2.0

HCatalog 0.5.0 0.12.0

HBase 0.94 0.96.1

Hive 0.10.0 0.12.0

Hue 2.5.0 2.3.0

Mahout 0.7 0.8.0

Oozie 3.3.2 4.0.0

Parquet 1.2.5

Pig 0.11 0.12.0

Sentry 1.1.0

Sqoop 1.4.3 1.4.4

Sqoop2 1.99.2

Whirr 0.8.2

ZooKeeper 3.4.5 3.4.5

CHAPTER 2 ■ STORING AND CONFIGURING DATA WITH HADOOP, YARN, AND ZOOKEEPER

While I use a Hadoop stack in the rest of the book, here I will show the process of downloading, installing,

configuring, and running Hadoop V1 so that you will be able to compare the use of V1 and V2.

Environment Management

Before I move into the Hadoop V1 and V2 installations, I want to point out that I am installing both Hadoop V1 and V2

on the same set of servers. Hadoop V1 is installed under /usr/local while Hadoop V2 is installed as a Cloudera CDH

release and so will have a defined set of directories:

Logging under /var/log; that is, /var/log/hadoop-hdfs/•฀

Configuration under /etc/hadoop/conf/•฀

Executables defined as servers under /etc/init.d/; that is, hadoop-hdfs-namenode•฀

I have also created two sets of .bashrc environment configuration files for the Linux Hadoop user account:

[hadoop@hc1nn ~]$ pwd

/home/hadoop

[hadoop@hc1nn ~]$ ls -l .bashrc*

lrwxrwxrwx. 1 hadoop hadoop 16 Jun 30 17:59 .bashrc -> .bashrc_hadoopv2

-rw-r--r--. 1 hadoop hadoop 1586 Jun 18 17:08 .bashrc_hadoopv1

-rw-r--r--. 1 hadoop hadoop 1588 Jul 27 11:33 .bashrc_hadoopv2

By switching the .bashrc symbolic link between the Hadoop V1 (.bashrc_hadoopv1) and V2 (.bashrc_hadoopv2)

files, I can quickly navigate between the two environments. Each installation has a completely separate set of

resources. This approach enables me to switch between Hadoop versions on my single set of testing servers while

writing this guide. From a production viewpoint, however, you would install only one version of Hadoop at a time.

Hadoop V1 Installation

Before you attempt to install Hadoop, you must ensure that Java 1.6.x is installed and that SSH (secure shell) is

installed and running. The master name node must be able to create an SSH session to reach each of its data nodes

without using a password in order to manage them. On CentOS, you can install SSH via the root account as follows:

yum install openssh-server

This will install the secure shell daemon process. Repeat this installation on all of your servers, then start the

service (as root):

service sshd restart

Now, in order to make the SSH sessions from the name node to the data nodes operate without a password,

you must create an SSH key on the name node and copy the key to each of the data nodes. You create the key with

the keygen command as the hadoop user (I created the hadoop user account during the installation of the CentOS

operating system on each server), as follows:

ssh-keygen

www.allitebooks.com

剩余380页未读，继续阅读

tubolao888

粉丝: 1
资源: 10

轻松掌握大数据：Hadoop工具详解

英文原版-Big Data Analytics Made Easy 1st Edition

Big Data Made Easy - A Working Guide To The Complete Hadoop Toolset

Big Data Made Easy

big-data-made-easy:Michael Frampton的“ Big Data Made Easy”的源代码-Big source code

Big Data Analytics Made Easy

GLADE Big Data Analytics Made Easy

Beginning Apache Pig: Big Data Processing Made Easy [2016]

Spark: The Definitive Guide: Big Data Processing Made Simple 英文高清.pdf版

Big Data and Visual Analytics 无水印原版pdf

Big Data Technologies and Applications 无水印原版pdf

最新资源