Hadoop实践(第二版):征服大数据的104个实战技巧

需积分: 9 106 浏览量更新于2024-07-21 收藏 9.86MB PDF 举报

"Hadoop in Practice(Manning,2ed,2014)" 是一本专注于Hadoop实战的书籍，第二版更新了关于Hadoop核心架构的变化和新特性，包括MapReduce 2.0。书中新增章节涉及YARN、Kafka、Impala与Spark SQL与Hadoop的集成。此外，还提供了Flume、Sqoop和Mahout的新技术和更新，这些工具都有重大版本更新。在Hadoop的世界里，这本书提供超过100个经过测试且可以直接应用的实用技术，帮助读者掌握大数据处理。作者Alex Holmes在第一版的基础上，针对Hadoop的最新发展进行了全面修订。 Hadoop是一个开源的大数据处理框架，最初由Apache基金会开发，用于分布式存储和计算大规模数据集。MapReduce是Hadoop的核心计算模型，第二版中提到的MapReduce 2.0（又称YARN，Yet Another Resource Negotiator）是对原MapReduce的重大改进，它将资源管理和任务调度分离，提高了系统的灵活性和效率。 YARN是Hadoop生态系统中的关键组件，它允许不同计算框架如Spark、Tez等在同一个集群上运行，提高了资源利用率。Kafka是一个高吞吐量的分布式消息系统，常用于实时数据流处理和数据集成。Spark SQL是Apache Spark的一部分，提供了一种统一的方式来处理结构化和半结构化数据，与Hadoop集成后可以高效地查询大规模数据集。 Impala是Cloudera公司开发的一种快速、低延迟的SQL查询引擎，它可以与Hadoop的HDFS和HBase直接交互，为Hadoop带来了更接近传统数据库的交互体验。新版本的Hadoop in Practice对这些新兴技术的整合提供了深入的实践指导。 Flume是Hadoop的数据收集工具，用于从各种源收集、聚合和移动大量日志数据。Sqoop则用于在Hadoop和传统关系型数据库之间进行数据导入导出，简化了大数据与传统数据仓库的交互。Mahout是基于Hadoop的机器学习库，提供了许多机器学习算法，随着新版本的发布，书中可能包含更多关于机器学习在大数据场景下应用的实例。 "Hadoop in Practice"第二版是目前市面上最实用、最新的Hadoop参考资料，涵盖了Hadoop生态系统的关键技术和最新进展，对于希望深入了解和应用Hadoop的开发者和数据工程师来说，是一本不可多得的指南。书中的代码示例、实战技巧以及对新技术的介绍，将帮助读者迅速提升在大数据领域的专业技能。

preface

I first encountered Hadoop in the fall of 2008 when I was working on an internet

crawl-and-analysis project at Verisign. We were making discoveries similar to those that

Doug Cutting and others at Nutch had made several years earlier about how to effi-

ciently store and manage terabytes of crawl-and-analyzed data. At the time, we were

getting by with our homegrown distributed system, but the influx of a new data stream

and requirements to join that stream with our crawl data couldn’t be supported by our

existing system in the required timeline.

After some research, we came across the Hadoop project, which seemed to be a

perfect fit for our needs—it supported storing large volumes of data and provided a

compute mechanism to combine them. Within a few months, we built and deployed a

MapReduce application encompassing a number of MapReduce jobs, woven together

with our own MapReduce workflow management system, onto a small cluster of 18

nodes. It was a revelation to observe our MapReduce jobs crunching through our data

in minutes. Of course, what we weren’t expecting was the amount of time that we

would spend debugging and performance-tuning our MapReduce jobs. Not to men-

tion the new roles we took on as production administrators—the biggest surprise in

this role was the number of disk failures we encountered during those first few

months supporting production.

As our experience and comfort level with Hadoop grew, we continued to build

more of our functionality using Hadoop to help with our scaling challenges. We also

started to evangelize the use of Hadoop within our organization and helped kick-start

other projects that were also facing big data challenges.

www.it-ebooks.info

ABOUT THIS BOOK

xix

Roadmap

This book has 10 chapters divided into four parts.

Part 1 contains two chapters that form the introduction to this book. They review

Hadoop basics and look at how to get Hadoop up and running on a single host. YARN,

which is new in Hadoop version 2, is also examined, and some operational tips are

provided for performing basic functions in

YARN.

Part 2, “Data logistics,” consists of three chapters that cover the techniques and

tools required to deal with data fundamentals, how to work with various data formats,

how to organize and optimize your data, and getting data into and out of Hadoop.

Picking the right format for your data and determining how to organize data in

HDFS

are the first items you’ll need to address when working with Hadoop, and they’re cov-

ered in chapters 3 and 4 respectively. Getting data into Hadoop is one of the bigger

hurdles commonly encountered when working with Hadoop, and chapter 5 is dedi-

cated to looking at a variety of tools that work with common enterprise data sources.

Part 3 is called “Big data patterns,” and it looks at techniques to help you work effec-

tively with large volumes of data. Chapter 6 covers how to represent data such as graphs

for use with MapReduce, and it looks at several algorithms that operate on graph data.

Chapter 7 looks at more advanced data structures and algorithms such as graph pro-

cessing and using HyperLogLog for working with large datasets. Chapter 8 looks at how

to tune, debug, and test MapReduce performance issues, and it also covers a number

of techniques to help make your jobs run faster.

Part 4 is titled “Beyond MapReduce,” and it examines a number of technologies

that make it easier to work with Hadoop. Chapter 9 covers the most prevalent and

promising

SQL technologies for data processing on Hadoop, and Hive, Impala, and

Spark

SQL are examined. The final chapter looks at how to write your own YARN appli-

cation, and it provides some insights into some of the more advanced features you can

use in your applications.

The appendix covers instructions for the source code that accompanies this book,

as well as installation instructions for Hadoop and all the other related technologies

covered in the book.

Finally, there are two bonus chapters available from the publisher’s website at

www.manning.com/HadoopinPracticeSecondEdition: chapter 11 “Integrating R and

Hadoop for statistics and more” and chapter 12 “Predictive analytics with Mahout.”

What’s new in the second edition?

This second edition covers Hadoop 2, which at the time of writing is the current

production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22

(Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and

opened up the Hadoop platform to processing paradigms beyond MapReduce.

YARN,

the new scheduler and application manager in Hadoop 2, is complex and new to the

community, which prompted me to dedicate a new chapter 2 to covering YARN basics

and to discussing how MapReduce now functions as a

YARN application.

www.it-ebooks.info

剩余512页未读，继续阅读

vanridin

粉丝: 108
资源: 1187

Hadoop实践(第二版):征服大数据的104个实战技巧

Hadoop in Practice

Hadoop In Practice （英文）

Hadoop in Practice 2nd Edition

hadoop in practice

Hadoop硬实战：Hadoop in Practice

Hadoop in Practice(2012)

hadoop in practice 第二版英文版几源代码

hiped2, book"Hadoop in Practice, Second Edition" 附带的源代码.zip

Pro Apache Hadoop(Apress,2ed,2014)

hadoop-book, 书"Hadoop in Practice" 伴随的源代码，由曼宁出版.zip

最新资源