专业Hadoop指南：实战Apache大数据框架

需积分: 9 181 浏览量更新于2024-07-19 收藏 12.04MB PDF 举报

《Professional Hadoop》是一本专为经验丰富的开发人员编写的指南，涵盖了Apache Hadoop这一开源、基于Java的大数据框架的各个方面。作者由一群认证的Hadoop开发者、贡献者和高峰会议演讲者组成，确保了内容的专业性和实用性。本书跳过数据库开发的基础，直接介绍Hadoop框架的核心过程和能力，旨在帮助读者迅速理解和应用Hadoop在实际工作场景中的处理大型数据集技术。全书分为多个章节，包括但不限于： 1. **Hadoop简介**：提供对Hadoop的基本理解，包括其历史背景、核心理念以及在大数据领域的重要性。 2. **存储**：详细讨论Hadoop的数据存储机制，如HDFS（Hadoop分布式文件系统）及其优化策略。 3. **计算**：讲解MapReduce、YARN（Yet Another Resource Negotiator）等计算模型，以及如何高效地进行大规模并行处理。 4. **用户体验**：关注Hadoop如何提供易用性，涉及界面设计、用户体验优化等方面。 5. **与其他系统的集成**：探讨如何与Kafka、Storm等其他工具和平台协同工作，实现数据流处理和实时分析。 6. **Hadoop安全**：涵盖数据加密、访问控制和权限管理等内容，确保数据安全与合规性。 7. **Apache Bigtop和Ignite的整合**：介绍如何利用Hadoop生态系统中的扩展组件来增强功能和性能。 8. **内存计算**：探索Hadoop如何支持内存计算，提升处理速度和响应能力。该书不仅提供了理论知识，还通过一个综合应用实例，展示了Hadoop组件间的协作和交互，强调了它作为主要大数据解决方案的地位。此外，书中还包含了一套实用的指导和建议，帮助读者配置存储、优化用户界面和实现内存计算，同时强调了与其他软件的集成以及数据安全的重要性。《Professional Hadoop》是一本全面且深入的资源，无论对于初次接触Hadoop的专业开发者还是希望进一步提升技能的现有用户，都是快速入门和提升Hadoop技术的宝贵指南。版权信息表明，这本书是John Wiley & Sons, Inc.出版，适用于快速学习和实践Hadoop项目的专业人士。

xxi

INTRODUC TION

 ast.indd 04/20/2016 Page xxi

ERRATA

We make every effort to ensure that there are no errors in the text or in the code. However, no one

is perfect, and mistakes do occur. If you  nd an error in one of our books, like a spelling mistake

or faulty piece of code, we would be very grateful for your feedback. By sending in errata, you may

save another reader hours of frustration, and at the same time, you will be helping us provide even

higher quality information.

To  nd the errata page for this book, go to

www.wrox.com/go/professionalhadoop

and click the Errata link. On this page you can view all errata that have been submitted for this

book and posted by Wrox editors.

If you don’t spot “your” error on the Book Errata page, go to

www.wrox.com/contact/techsupport

.shtml

and complete the form there to send us the error you have found. We’ll check the informa-

tion and, if appropriate, post a message to the book’s errata page and  x the problem in subsequent

editions of the book.

P2P.WROX.COM

For author and peer discussion, join the P2P forums at http://p2p.wrox.com. The forums are a

web-based system for you to post messages relating to Wrox books and related technologies and

interact with other readers and technology users. The forums offer a subscription feature to e-mail

you topics of interest of your choosing when new posts are made to the forums. Wrox authors,

editors, other industry experts, and your fellow readers are present on these forums.

http://p2p.wrox.com, you will  nd a number of different forums that will help you, not only

as you read this book, but also as you develop your own applications. To join the forums, just

follow these steps:

1. Go to http://p2p.wrox.com and click the Register link.

2. Read the terms of use and click Agree.

3. Complete the required information to join, as well as any optional information you wish

to provide, and click Submit.

4. You will receive an e-mail with information describing how to verify your account and

complete the joining process.

NOTE You can read messages in the forums without joining P2P, but in order to

post your own messages, you must join.

c01.indd 03/29/2016 Page 1

Hadoop Introduction

WHAT’S IN THIS CHAPTER?

➤

The components of Hadoop

➤

The roles of HDFS, MapReduce, YARN, ZooKeeper, and Hive

➤

Hadoop’s integration with other systems

➤

Data integration and Hadoop

Hadoop is an essential tool for managing big data. This tool  lls a rising need for businesses

managing large data stores, or data lakes as Hadoop refers to them. The biggest need in busi-

ness, when it comes to data, is the ability to scale. Technology and business are driving orga-

nizations to gather more and more data, which increases the need to manage it ef ciently. This

chapter examines the Hadoop Stack, as well as all of the associated components that can be

used with Hadoop.

In building the Hadoop Stack, each component plays an important role in the platform. The

stack starts with the essential requirements contained in the Hadoop Common, which is a col-

lection of common utilities and libraries that support other Hadoop modules. Like any stack,

these supportive  les are a necessary requirement for a successful implementation. The well-

known  le system, the Hadoop Distributed File System or HDFS, is at the heart of Hadoop,

but it won’t threaten your budget. To narrow your perspective on a set of data, you can use the

programming logic contained within MapReduce, which provides massive scalability across

many servers in a Hadoop cluster. For resource management, you can consider adding Hadoop

YARN, the distributed operating system for your big data apps, to your stack.

ZooKeeper, another Hadoop Stack component, enables distributed processes to coordinate

with each other through a shared hierarchical name space of data registers, known as znodes.

Every znode is identi ed by a path, with path elements separated by a slash (/).

There are other systems that can integrate with Hadoop and bene t from its infrastructure.

Although Hadoop is not considered a Relational Database Management System (RDBMS),

Professional Hadoop®. Benoy Antony, Konstantin Boudnik, Cheryl Adams, Branky Shao, Cazen Lee and Kai Sasaki

❘

CHAPTER 1 HADOOP INTRODUCTION

c01.indd 03/29/2016 Page 2

it can be used along with systems like Oracle, MySQL, and SQL Server. Each of these systems has

developed connector-type components that are processed using Hadoop’s framework. We will

review a few of these components in this chapter and illustrate how they interact with Hadoop.

Business Analytics and Big Data

Business Analytics is the study of data through statistical and operational analysis. Hadoop allows

you to conduct operational analysis on its data stores. These results allow organizations and compa-

nies to make better business decisions that are bene cial to the organization.

To understand this further, let’s build a big data pro le. Because of the amount of data involved,

the data can be distributed across storage and compute nodes, which bene ts from using Hadoop.

Because it is distributed and not centralized, it lacks the characteristics of an RDBMS. This allows

you to use large data stores and an assortment of data types with Hadoop.

For example, let’s consider a large data store like Google, Bing, or Twitter. All of these data stores

can grow exponentially based on activity, such as queries and a large user base. Hadoop’s compo-

nents can help you process these large data stores.

A business, such as Google, can use Hadoop to manipulate, manage, and produce meaningful

results from their data stores. The traditional tools commonly used for Business Analytics are not

designed to work with or analyze extremely large datasets, but Hadoop is a solution that  ts these

business models.

The Components of Hadoop

The Hadoop Common is the foundation of Hadoop, because it contains the primary services

and basic processes, such as the abstraction of the underlying operating system and its  lesystem.

Hadoop Common also contains the necessary Java Archive (JAR)  les and scripts required to start

Hadoop. The Hadoop Common package even provides source code and documentation, as well as a

contribution section. You can’t run Hadoop without Hadoop Common.

As with any stack, there are requirements that Apache provides for con guring the Hadoop

Common. Having a general understanding as a Linux or Unix administrator is helpful in setting this

up. Hadoop Common, also referred to as the Hadoop Stack, is not designed for a beginner, so the

pace of your implementation rests on your experience. In fact, Apache clearly states on their site that

using Hadoop is not the task you want to tackle while trying to learn how to administer a Linux

environment. It is recommended that you are comfortable in this environment before attempting to

install Hadoop.

The Distributed File System (HDFS)

With Hadoop Common now installed, it is time to examine the rest of the Hadoop Stack. HDFS

delivers a distributed  lesystem that is designed to run on basic hardware components. Most

businesses  nd these minimal system requirements appealing. This environment can be set up

in a Virtual Machine (VM) or a laptop for the initial walkthrough and advancement to server

deployment. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It

provides high throughput access to application data and is suitable for applications having

large datasets.

c01.indd 03/29/2016 Page 3

Hadoop Introduction

❘

Hardware failures are unavoidable in any environment. With HDFS, your data can span across

thousands of servers, with each server containing an essential piece of data. This is where the fault

tolerance feature comes into play. The reality is that with this many servers there is always the risk

that one or more may become nonfunctional. HDFS has the ability to detect faults and quickly per-

form an automatic recovery.

HDFS is optimally designed for batch processing, which provides a high throughput of data access,

rather than a low latency of data access. Applications that run on HDFS have large datasets. A typi-

cal  le in HDFS can be hundreds of gigabytes or more in size, and so HDFS of course supports large

 les. It provides high aggregate data bandwidth and scales to hundreds of nodes in a single cluster.

Hadoop is a single functional distributed system that works directly with clustered machines in

order to read the dataset in parallel and provide a much higher throughput. Consider Hadoop as a

power house single CPU running across clustered and low cost machines. Now that we’ve described

the tools that read the data, the next step is to process it by using MapReduce.

What Is MapReduce?

MapReduce is a programming component of Hadoop used for processing and reading large data

sets. The MapReduce algorithm gives Hadoop the ability to process data in parallel. In short,

MapReduce is used to compress large amounts of data into meaningful results for statistical analy-

sis. MapReduce can do batch job processing, which is the ability to read large amounts of data

numerous times during processing to produce the requested results.

For businesses and organizations with large data stores or data lakes, this is an essential component

in getting your data down to a manageable size to analyze or query.

The MapReduce work ow, as shown in Figure 1-1, works like a grandfather clock with a number of

gears. Each gear performs a particular task before it moves on to the next. It shows the transitional

states of data as it is chunked into smaller sizes for processing.

RESOURCE MANAGER

CLIENT

YARN

DISTRIBUTED DATA

PROCESSING

SCHEDULER

APPLICATION

MANAGERS

DATA

NODE

NODE MANAGER

APP

MASTER

CONTAINER

JOURNAL

NODE

SHARED

EDIT

LOGS

ACTIVE

NAMENODE

HDFS

DISTRIBUTED DATA STORAGE

MASTERSLAVES

STANDBY

NAMENODE

SECONDARY

NAMENODE

NODE

MANAGER

DATA NODE

APP

MASTER

CONTAINER

DATA

NODE

NODE MANAGER

APP

MASTER

CONTAINER

FIGURE 1-1

剩余205页未读，继续阅读

jerrykang99

粉丝: 0
资源: 3

专业Hadoop指南：实战Apache大数据框架

专业Hadoop指南：实战开发与安全集成

《专业Hadoop解决方案》：构建与实现大数据处理全攻略

专业Hadoop指南：从入门到实战

Professional Hadoop epub

Professional Hadoop Solutions.pdf

Professional Hadoop 无水印原版pdf

Hadoop 高级编程（Professional Hadoop Solutions） 代码.7z

hadoop solutions

hadoop学习资料

[Hadoop] Hadoop 专业解决方案 (英文版)

最新资源

Hadoop 高级编程（Professional Hadoop Solutions）代码.7z