Hadoop 2革命：YARN、Parquet与Kafka的新篇章

4星 · 超过85%的资源需积分: 10 5 浏览量更新于2024-07-22 收藏 13.68MB PDF 举报

"Hadoop in Practice 第二版英文版" 本书是关于Hadoop实践的第二版，专注于介绍Hadoop 2，这是当时生产就绪的Hadoop版本。相较于第一版（覆盖的是Hadoop 0.22，即Hadoop 1还未发布），Hadoop 2彻底改变了大数据处理的世界，并扩展了Hadoop平台，使其支持MapReduce之外的处理范式。书中新增了一章专门讨论YARN（Yet Another Resource Negotiator）基础和MapReduce在YARN环境下的运行方式，因为YARN是Hadoop 2中的新调度器和应用管理器，对社区来说既复杂又新颖。 Parquet作为一种新兴的数据存储格式，在HDFS中得到了广泛应用。它以列式存储数据，可以提高数据管道的空间和时间效率，并迅速成为存储数据的标准方式。第4章深入探讨了Parquet，包括如何支持复杂的对象模型如Avro，以及各种Hadoop工具如何使用Parquet。自第一版以来，数据如何流入Hadoop也发生了变化，Kafka已成为新的数据管道，作为数据生产者和消费者的传输层，其中消费者可能是像Camus这样的系统，可以从Kafka将数据拉入HDFS。第5章关于数据进出Hadoop的处理，现在包含了Kafka和Camus的介绍。这本书由Alex Holmes撰写，版权归Manning Publications Co所有，2015年出版。书中可能涉及的制造商和销售商用来区分其产品的标识被视为商标，如果Manning Publications知晓其商标权，这些标识将以首字母大写或全大写的形式印刷。该书的出版遵循Manning的政策，即使用酸性免费纸张印刷书籍，以尽力保护书籍内容的持久性。此外，Manning还努力确保书中包含的信息准确无误，但不承担任何错误或遗漏的责任。 "Hadoop in Practice 第二版"是一本详尽介绍Hadoop 2及其生态系统最新进展的实用指南，涵盖了YARN、Parquet和Kafka等关键技术和工具，对于想要深入了解和使用Hadoop进行大数据处理的读者来说极具价值。

Configuration for pseudo-distributed mode for Hadoop 1 and earlier

Configuration for pseudo-distributed mode for Hadoop 2

Set up SSH

Java

Environment settings

Format HDFS

Starting Hadoop 1 and earlier

Starting Hadoop 2

Creating a home directory for your user on HDFS

Verifying the installation

Stopping Hadoop 1

Stopping Hadoop 2

Hadoop 1.x UI ports

Hadoop 2.x UI ports

A.4. Flume

Getting more information

Installation on Apache Hadoop 1.x systems

Installation on Apache Hadoop 2.x systems

A.5. Oozie

Getting more information

Installation on Hadoop 1.x systems

Installation on Hadoop 2.x systems

A.6. Sqoop

Getting more information

Installation

A.7. HBase

Getting more information

Installation

A.8. Kafka

Preface

I first encountered Hadoop in the fall of 2008 when I was working on an internet crawl-and-

analysis project at Verisign. We were making discoveries similar to those that Doug Cutting and

others at Nutch had made several years earlier about how to efficiently store and manage

terabytes of crawl-and-analyzed data. At the time, we were getting by with our homegrown

distributed system, but the influx of a new data stream and requirements to join that stream

with our crawl data couldn’t be supported by our existing system in the required timeline.

After some research, we came across the Hadoop project, which seemed to be a perfect fit for

our needs—it supported storing large volumes of data and provided a compute mechanism to

combine them. Within a few months, we built and deployed a MapReduce application

encompassing a number of MapReduce jobs, woven together with our own MapReduce

workflow management system, onto a small cluster of 18 nodes. It was a revelation to observe

our MapReduce jobs crunching through our data in minutes. Of course, what we weren’t

expecting was the amount of time that we would spend debugging and performance-tuning our

MapReduce jobs. Not to mention the new roles we took on as production administrators—the

biggest surprise in this role was the number of disk failures we encountered during those first

few months supporting production.

As our experience and comfort level with Hadoop grew, we continued to build more of our

functionality using Hadoop to help with our scaling challenges. We also started to evangelize the

use of Hadoop within our organization and helped kick-start other projects that were also facing

big data challenges.

The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was

relearning how to solve problems with it. MapReduce is its own flavor of parallel programming,

and it’s quite different from the in-JVM programming that we were accustomed to. The first big

hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Actionby

Chuck Lam (Manning Publications, 2010) covers well.

After one is used to thinking in MapReduce, the next challenge is typically related to the

logistics of working with Hadoop, such as how to move data in and out of HDFS and effective

and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much

coverage, and that’s what attracted me to the potential of this book—the chance to go beyond

the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects

of Hadoop.

As I’m sure many authors have experienced, I went into this project confidently believing that

writing this book was just a matter of transferring my experiences onto paper. Boy, did I get a

reality check, but not altogether an unpleasant one, because writing introduced me to new

approaches and tools that ultimately helped better my own Hadoop abilities. I hope that you get

as much out of reading this book as I did writing it.

剩余675页未读，继续阅读

failure5152

粉丝: 3
资源: 19

Hadoop 2革命：YARN、Parquet与Kafka的新篇章

《Hadoop in Practice 第2版》：实战指南与核心技术

Hadoop实践：第二版精华解读

Hadoop实践(第二版):征服大数据的104个实战技巧

hadoop in practice 第二版英文版几源代码

Hadoop in Practice

Hadoop in Practice 2nd Edition

hiped2, book"Hadoop in Practice, Second Edition" 附带的源代码.zip

Hadoop.in.Practice.2nd.Edition

hadoop学习资料

Hadoop实战经典第二版：技术与实践探索

最新资源