Hadoop实践：探索MapReduce、HDFS、HBase、Pig与Hive

需积分: 10 53 浏览量更新于2024-07-26 收藏 14.03MB PDF 举报

"Hadoop in Practice 是一本由Alex Holmes编著，Manning出版社出版的技术书籍，主要探讨了Hadoop生态系统中的核心组件和实践应用，包括MapReduce、HDFS、HBase、Pig和Hive等技术。这本书旨在帮助读者理解和掌握大数据处理与分析的Hadoop平台。" 在Hadoop的世界里，MapReduce是处理海量数据的核心计算模型，它将大型任务拆分为一系列可并行执行的小任务（map阶段）和随后的数据整合（reduce阶段）。MapReduce设计的目的是实现容错性和可扩展性，使得即使在大规模分布式环境下也能高效地处理数据。 HDFS（Hadoop Distributed File System）是Hadoop的基础，是一个分布式文件系统，能存储大量数据并支持高吞吐量的数据访问。HDFS设计的目标是在廉价硬件上运行，提供了高可用性和容错性，数据在集群节点间冗余存储，确保了数据的可靠性。 HBase是一个基于HDFS的分布式数据库，设计灵感来源于Google的Bigtable。它提供了一个高效的、可伸缩的、实时的NoSQL数据存储解决方案，适合半结构化和非结构化数据。HBase支持随机读写，适用于需要低延迟数据检索的应用场景。 Pig是Hadoop上的一个高级数据流语言和执行框架，它简化了MapReduce编程，使得用户可以使用Pig Latin语言编写数据处理脚本。Pig Latin抽象了底层的MapReduce细节，使得数据分析师能够更专注于数据分析逻辑，而无需关注分布式执行的复杂性。 Hive是另一个基于Hadoop的数据仓库工具，用于查询和管理大规模数据集。它提供了一种SQL-like的语言（HQL）来抽象Hadoop的MapReduce操作，使得非程序员也能方便地进行数据分析。Hive特别适合于离线数据批处理，但对实时查询的支持相对较弱。通过《Hadoop in Practice》这本书，读者不仅可以学习到这些技术的基本概念和原理，还能了解到如何在实际项目中应用它们，解决大数据处理中的挑战。书中可能涵盖了数据导入导出、数据清洗、查询优化、故障排查等实际操作环节，以及如何利用Hadoop生态系统构建高效的数据处理流程。无论是对初学者还是有经验的开发人员，这本书都提供了宝贵的实战经验和深入理解Hadoop生态的机会。

preface

I first encountered Hadoop in the fall of 2008 when I was working on an internet

crawl and analysis project at Verisign. My team was making discoveries similar to those

that Doug Cutting and others at Nutch had made several years earlier regarding how

to efficiently store and manage terabytes of crawled and analyzed data. At the time, we

were getting by with our home-grown distributed system, but the influx of a new data

stream and requirements to join that stream with our crawl data couldn’t be sup-

ported by our existing system in the required timelines.

After some research we came across the Hadoop project, which seemed to be a

perfect fit for our needs—it supported storing large volumes of data and provided a

mechanism to combine them. Within a few months we’d built and deployed a Map-

Reduce application encompassing a number of MapReduce jobs, woven together with

our own MapReduce workflow management system onto a small cluster of 18 nodes. It

was a revelation to observe our MapReduce jobs crunching through our data in min-

utes. Of course we couldn’t anticipate the amount of time that we’d spend debugging

and performance-tuning our MapReduce jobs, not to mention the new roles we took

on as production administrators—the biggest surprise in this role was the number of

disk failures we encountered during those first few months supporting production!

As our experience and comfort level with Hadoop grew, we continued to build

more of our functionality using Hadoop to help with our scaling challenges. We also

started to evangelize the use of Hadoop within our organization and helped kick-start

other projects that were also facing big data challenges.

The greatest challenge we faced when working with Hadoop (and specifically

MapReduce) was relearning how to solve problems with it. MapReduce is its own

ABOUT THIS BOOK xix

Roadmap

This book has 13 chapters divided into five parts.

Part 1 contains a single chapter that’s the introduction to this book. It reviews

Hadoop basics and looks at how to get Hadoop up and running on a single host. It

wraps up with a walk-through on how to write and execute a MapReduce job.

Part 2, “Data logistics,” consists of two chapters that cover the techniques and

tools required to deal with data fundamentals, getting data in and out of Hadoop,

and how to work with various data formats. Getting data into Hadoop is one of the

first roadblocks commonly encountered when working with Hadoop, and chapter 2

is dedicated to looking at a variety of tools that work with common enterprise data

sources. Chapter 3 covers how to work with ubiquitous data formats such as

XML

and JSON in MapReduce, before going on to look at data formats better suited to

working with big data.

Part 3 is called “Big data patterns,” and looks at techniques to help you work effec-

tively with large volumes of data. Chapter 4 examines how to optimize MapReduce

join and sort operations, and chapter 5 covers working with a large number of small

files, and compression. Chapter 6 looks at how to debug MapReduce performance

issues, and also covers a number of techniques to help make your jobs run faster.

Part 4 is all about “Data science,” and delves into the tools and methods that help

you make sense of your data. Chapter 7 covers how to represent data such as graphs

for use with MapReduce, and looks at several algorithms that operate on graph data.

Chapter 8 describes how R, a popular statistical and data mining platform, can be inte-

grated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction

with MapReduce for massively scalable predictive analytics.

Part 5 is titled “Taming the elephant,” and examines a number of technologies

that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig

respectively, both of which are MapReduce domain-specific languages (DSLs) geared

at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which

are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers

techniques to help write unit tests, and to debug MapReduce problems.

The appendixes start with appendix A, which covers instructions on installing both

Hadoop and all the other related technologies covered in the book. Appendix B cov-

ers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2

leverage. Appendix C looks at how

HDFS supports reads and writes, and appendix D

covers a couple of MapReduce join frameworks written by the author and utilized in

chapter 4.

Code conventions and downloads

All source code in listings or in text is in a

fixed-width

font

this

to separate it

from ordinary text. Code annotations accompany many of the listings, highlighting

important concepts.

剩余536页未读，继续阅读

iamluckyhuhu

粉丝: 0
资源: 3

Hadoop实践：探索MapReduce、HDFS、HBase、Pig与Hive

hadoop in practice 第二版英文版几源代码

Hadoop In Practice （英文）

Hadoop in Practice

Hadoop硬实战：Hadoop in Practice

Hadoop in Practice 2nd Edition

Hadoop in Practice(2012)

hadoop-book, 书"Hadoop in Practice" 伴随的源代码，由曼宁出版.zip

hiped2, book"Hadoop in Practice, Second Edition" 附带的源代码.zip

《Hadoop in Practice 第2版》：实战指南与核心技术

word源码java-hadoop-practice:一些hadoop相关的练习

最新资源