Hadoop分布式系统详解：权威指南

需积分: 50 103 浏览量更新于2024-07-29 收藏 4.84MB PDF 举报

"Hadoop权威指南（原版）" Hadoop是一个开源的分布式计算框架，由Apache基金会维护，其设计目标是处理和存储大规模数据。该框架的核心包括两个主要组件：Hadoop分布式文件系统（HDFS）和MapReduce计算模型。 HDFS是一个高度可扩展的、容错性强的分布式文件系统，旨在运行在低成本硬件上。它允许数据以流式访问，适合处理大型数据集。HDFS通过数据复制策略确保数据的可用性和可靠性，即使部分节点故障，系统仍能正常运行。它不完全遵循POSIX标准，而是优化了大数据处理的效率和性能。 MapReduce是Hadoop处理数据的核心计算模型，灵感来源于Google的论文。MapReduce将大规模数据处理任务分解为两个阶段：Map和Reduce。Map阶段，原始数据被分割并分配到集群的不同节点上进行并行处理；Reduce阶段，Map阶段的结果被聚合，处理后生成最终结果。这种并行化处理方式极大地提高了数据处理速度。除了HDFS和MapReduce，Hadoop生态系统还包括许多其他工具和服务，如HBase（一个分布式、支持列族的NoSQL数据库），Hive（用于数据仓库和SQL-like查询的工具），Pig（用于数据分析的高级脚本语言），Zookeeper（用于分布式协调的服务）等。这些工具协同工作，提供了全面的大数据解决方案。 Hadoop的设计理念强调可扩展性和容错性，使得它成为云计算环境下的理想选择。在云中，Hadoop能够充分利用弹性计算资源，快速扩展或收缩以应对数据量的变化。同时，Hadoop与云计算平台如Amazon Web Services的EMR（Elastic MapReduce）紧密集成，提供了便捷的云上大数据处理能力。《Hadoop：权威指南》一书，由Tom White撰写，深入介绍了Hadoop的各个方面，包括安装、配置、优化以及各种相关工具的使用。这本书不仅对初学者友好，也对有经验的开发者提供了宝贵的实践指导，是学习和掌握Hadoop技术的重要参考资料。 Hadoop是大数据处理领域的重要工具，通过其分布式文件系统和MapReduce模型，为企业和研究机构提供了处理海量数据的能力。《Hadoop权威指南》是理解、部署和操作Hadoop系统的必备读物，有助于读者深入理解这个强大的计算框架。

Given this, I was very pleased when I learned that Tom intended to write a book about

Hadoop. Who could be better qualified? Now you have the opportunity to learn about

Hadoop from a master—not only of the technology, but also of common sense and

plain talk.

—Doug Cutting

Shed in the Yard, California

xiv | Foreword

Preface

Martin Gardner, the mathematics and science writer, once said in an interview:

Beyond calculus, I am lost. That was the secret of my column’s success. It took me so

long to understand what I was writing about that I knew how to write in a way most

readers would understand.

In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting

as they do on a mixture of distributed systems theory, practical engineering, and com-

mon sense. And to the uninitiated, Hadoop can appear alien.

But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides

for building distributed systems—for data storage, data analysis, and coordination—

are simple. If there’s a common theme, it is about raising the level of abstraction—to

create building blocks for programmers who just happen to have lots of data to store,

or lots of data to analyze, or lots of machines to coordinate, and who don’t have the

time, the skill, or the inclination to become distributed systems experts to build the

infrastructure to handle it.

With such a simple and generally applicable feature set, it seemed obvious to me when

I started using it that Hadoop deserved to be widely used. However, at the time (in

early 2006), setting up, configuring, and writing programs to use Hadoop was an art.

Things have certainly improved since then: there is more documentation, there are

more examples, and there are thriving mailing lists to go to when you have questions.

And yet the biggest hurdle for newcomers is understanding what this technology is

capable of, where it excels, and how to use it. That is why I wrote this book.

The Apache Hadoop community has come a long way. Over the course of three years,

the Hadoop project has blossomed and spun off half a dozen subprojects. In this time,

the software has made great leaps in performance, reliability, scalability, and manage-

ability. To gain even wider adoption, however, I believe we need to make Hadoop even

easier to use. This will involve writing more tools; integrating with more systems; and

“The science of fun,” Alex Bellos, The Guardian, May 31, 2008, http://www.guardian.co.uk/science/

2008/may/31/maths.science.

writing new, improved APIs. I’m looking forward to being a part of this, and I hope

this book will encourage and enable others to do so, too.

Administrative Notes

During discussion of a particular Java class in the text, I often omit its package name,

to reduce clutter. If you need to know which package a class is in, you can easily look

it up in Hadoop’s Java API documentation for the relevant subproject, linked to from

the Apache Hadoop home page at http://hadoop.apache.org/. Or if you’re using an IDE,

it can help using its auto-complete mechanism.

Similarly, although it deviates from usual style guidelines, program listings that import

multiple classes from the same package may use the asterisk wildcard character to save

space (for example: import org.apache.hadoop.io.*).

The sample programs in this book are available for download from the website that

accompanies this book: http://www.hadoopbook.com/. You will also find instructions

there for obtaining the datasets that are used in examples throughout the book, as well

as further notes for running the programs in the book, and links to updates, additional

resources, and my blog.

What’s in This Book?

The rest of this book is organized as follows. Chapter 2 provides an introduction to

MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.

Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,

serialization, and file-based data structures.

The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical

steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce

is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the

MapReduce programming model, and the various data formats that MapReduce can

work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining

data.

Chapters 9 and 10 are for Hadoop administrators, and describe how to set up and

maintain a Hadoop cluster running HDFS and MapReduce.

Chapters 11, 12, and 13 present Pig, HBase, and ZooKeeper, respectively.

Finally, Chapter 14 is a collection of case studies contributed by members of the Apache

Hadoop community.

xvi | Preface

剩余525页未读，继续阅读

raojun_06

粉丝: 0
资源: 7

Hadoop分布式系统详解：权威指南

Hadoop权威指南：深入解析

Hadoop权威指南：深入解析与应用

Hadoop权威指南原版和源码

Hadoop权威指南,hadoop权威指南pdf,Hadoop

Hadoop权威指南,hadoop权威指南pdf,Hadoop源码.zip

hadoop权威指南英文原版

hadoop权威指南（原版）

hadoop权威指南《原版》

Hadoop权威指南(原版).pdf

Hadoop权威指南

最新资源