Hadoop技术详解：分布式数据处理框架

需积分: 9 167 浏览量更新于2024-07-24 收藏 8.46MB PDF 举报

"Apache Hadoop 是一个开源的软件框架，用于支持数据密集型分布式应用程序，它在Apache v2许可下发布。Hadoop 支持在大规模的廉价硬件集群上运行应用程序。Hadoop 源自Google的MapReduce和Google文件系统（GFS）的概念。" 在《Hadoop 官方指南》的第三版中，作者Tom White深入探讨了这个强大的大数据处理框架。本书主要分为以下几个部分，涵盖了Hadoop的核心概念和技术： 1. **Meet Hadoop**：这部分介绍了Hadoop的背景和设计目标，强调其在大数据存储和分析中的作用。Hadoop与关系数据库管理系统（RDBMS）、网格计算和志愿计算等其他系统进行了比较，展示了Hadoop在处理大规模数据时的独特优势。此外，书中还简要回顾了Hadoop的发展历史，并概述了Apache Hadoop及其生态系统，包括各个版本的发布情况。 2. **MapReduce**：MapReduce是Hadoop的核心计算模型，本章通过一个天气数据集的例子来展示MapReduce的工作原理。数据首先以特定格式存储，然后使用Unix工具进行初步分析。接着，使用Hadoop的MapReduce功能进行更复杂的分析。书中详细解释了Map和Reduce函数的职责，以及如何用Java实现MapReduce。此外，还讨论了MapReduce的扩展性、数据流、Combiner功能，以及如何运行分布式MapReduce作业。Hadoop Streaming和Hadoop Pipes（使用非Java语言如Python和Ruby）也在此部分进行了介绍。 3. **The Hadoop Distributed Filesystem (HDFS)**：HDFS是Hadoop的数据存储系统，本章深入解析了HDFS的设计理念。书中讨论了HDFS的基本概念，如NameNode和DataNode的角色，以及HDFS的容错机制。文件块的分布、副本策略和数据访问方式等关键特性也有详细阐述。 4. **后续章节**：虽然这部分内容未提供，但通常会涵盖Hadoop生态中的其他组件，如YARN（Yet Another Resource Negotiator）资源管理器，HBase分布式数据库，Pig和Hive数据分析工具，以及Sqoop数据导入导出工具等。这些组件共同构建了一个完整的大数据处理平台。这本书对于理解Hadoop的工作原理、MapReduce编程模型以及HDFS的内部运作非常有帮助，是学习和应用Hadoop的宝贵资源。无论是开发者、数据分析师还是系统管理员，都能从中受益，提升处理大数据问题的能力。

Foreword

Hadoop got its start in Nutch. A few of us were attempting to build an open source

web search engine and having trouble managing computations running on even a

handful of computers. Once Google published its GFS and MapReduce papers, the

route became clear. They’d devised systems to solve precisely the problems we were

having with Nutch. So we started, two of us, half-time, to try to re-create these systems

as a part of Nutch.

We managed to get Nutch limping along on 20 machines, but it soon became clear that

to handle the Web’s massive scale, we’d need to run it on thousands of machines and,

moreover, that the job was bigger than two half-time developers could handle.

Around that time, Yahoo! got interested, and quickly put together a team that I joined.

We split off the distributed computing part of Nutch, naming it Hadoop. With the help

of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.

In 2006, Tom White started contributing to Hadoop. I already knew Tom through an

excellent article he’d written about Nutch, so I knew he could present complex ideas

in clear prose. I soon learned that he could also develop software that was as pleasant

to read as his prose.

From the beginning, Tom’s contributions to Hadoop showed his concern for users and

for the project. Unlike most open source contributors, Tom is not primarily interested

in tweaking the system to better meet his own needs, but rather in making it easier for

anyone to use.

Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 serv-

ices. Then he moved on to tackle a wide variety of problems, including improving the

MapReduce APIs, enhancing the website, and devising an object serialization frame-

work. In all cases, Tom presented his ideas precisely. In short order, Tom earned the

role of Hadoop committer and soon thereafter became a member of the Hadoop Project

Management Committee.

Tom is now a respected senior member of the Hadoop developer community. Though

he’s an expert in many technical corners of the project, his specialty is making Hadoop

easier to use and understand.

xiii

Preface

Martin Gardner, the mathematics and science writer, once said in an interview:

Beyond calculus, I am lost. That was the secret of my column’s success. It took me so

long to understand what I was writing about that I knew how to write in a way most

readers would understand.

In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting

as they do on a mixture of distributed systems theory, practical engineering, and com-

mon sense. And to the uninitiated, Hadoop can appear alien.

But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides

for building distributed systems—for data storage, data analysis, and coordination—

are simple. If there’s a common theme, it is about raising the level of abstraction—to

create building blocks for programmers who just happen to have lots of data to store,

or lots of data to analyze, or lots of machines to coordinate, and who don’t have the

time, the skill, or the inclination to become distributed systems experts to build the

infrastructure to handle it.

With such a simple and generally applicable feature set, it seemed obvious to me when

I started using it that Hadoop deserved to be widely used. However, at the time (in

early 2006), setting up, configuring, and writing programs to use Hadoop was an art.

Things have certainly improved since then: there is more documentation, there are

more examples, and there are thriving mailing lists to go to when you have questions.

And yet the biggest hurdle for newcomers is understanding what this technology is

capable of, where it excels, and how to use it. That is why I wrote this book.

The Apache Hadoop community has come a long way. Over the course of three years,

the Hadoop project has blossomed and spun off half a dozen subprojects. In this time,

the software has made great leaps in performance, reliability, scalability, and manage-

ability. To gain even wider adoption, however, I believe we need to make Hadoop even

easier to use. This will involve writing more tools; integrating with more systems; and

1. “The science of fun,” Alex Bellos, The Guardian, May 31, 2008, http://www.guardian.co.uk/science/

2008/may/31/maths.science.

writing new, improved APIs. I’m looking forward to being a part of this, and I hope

this book will encourage and enable others to do so, too.

Administrative Notes

During discussion of a particular Java class in the text, I often omit its package name,

to reduce clutter. If you need to know which package a class is in, you can easily look

it up in Hadoop’s Java API documentation for the relevant subproject, linked to from

the Apache Hadoop home page at http://hadoop.apache.org/. Or if you’re using an IDE,

it can help using its auto-complete mechanism.

Similarly, although it deviates from usual style guidelines, program listings that import

multiple classes from the same package may use the asterisk wildcard character to save

space (for example: import org.apache.hadoop.io.*).

The sample programs in this book are available for download from the website that

accompanies this book: http://www.hadoopbook.com/. You will also find instructions

there for obtaining the datasets that are used in examples throughout the book, as well

as further notes for running the programs in the book, and links to updates, additional

resources, and my blog.

What’s in This Book?

The rest of this book is organized as follows. Chapter 1 emphasizes the need for Hadoop

and sketches the history of the project. Chapter 2 provides an introduction to

MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.

Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,

serialization, and file-based data structures.

The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical

steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce

is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the

MapReduce programming model, and the various data formats that MapReduce can

work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining

data.

Chapters 9 and 10 are for Hadoop administrators, and describe how to set up and

maintain a Hadoop cluster running HDFS and MapReduce.

Later chapters are dedicated to projects that build on Hadoop or are related to it.

Chapters 11 and 12 present Pig and Hive, which are analytics platforms built on HDFS

and MapReduce, whereas Chapters 13, 14, and 15 cover HBase, ZooKeeper, and

Sqoop, respectively.

Finally, Chapter 16 is a collection of case studies contributed by members of the Apache

Hadoop community.

xvi | Preface

剩余646页未读，继续阅读

lookdownonyou

粉丝: 1

Hadoop技术详解：分布式数据处理框架

hadoop权威指南第四版高清 pdf下载

hadoop权威指南4和源码

Hadoop权威指南,hadoop权威指南pdf,Hadoop

Hadoop权威指南,hadoop权威指南pdf,Hadoop源码.zip

Hadoop权威指南

hadoop权威指南

Hadoop实践指南

基于SpringBoot的“古城景区管理系统”的设计与实现（源码+数据库+文档+PPT).zip

深入探讨：ADRC自抗扰控制技术与先进PID算法的比较研究,探索现代控制技术：ADRC PID自抗扰控制算法的先进性与应用,ADRC PID自抗扰控制（ADRC）当前最先进PID算法 ,ADRC;

【weixin9163】基于微信小程序的校园二手交易平台系统设计与开发+ssm.zip

最新资源