HBase权威指南：大数据分布式存储实战

需积分: 9 49 浏览量更新于2024-07-25 收藏 5.56MB PDF 举报

"HBase The Definitive Guide 是一本由 Lars George 撰写的关于 HBase 的详细指南。这本书深入探讨了 HBase 如何与 Hadoop 高度集成，实现大规模数据分布式存储，并提供了多种访问 HBase 的方式，包括 Java 客户端、REST、Avro 和 Thrift API。书中还涵盖了 HBase 的内部架构，如存储格式、预写日志（WAL）、后台进程等，以及如何利用 Hadoop 的 MapReduce 进行海量数据处理。此外，读者将学习到如何管理和优化 HBase 集群，设计模式，复制表，批量导入数据，删除节点等实际操作任务。" HBase 是一个基于 Google Bigtable 模型构建的开源非关系型数据库，特别适合处理大规模的数据。其核心特性之一是与 Hadoop 生态系统的紧密集成，这使得 HBase 能够轻松地在大量廉价商业服务器上扩展，提供高可用性和高性能的数据存储解决方案。在 Hadoop 集成方面，HBase 利用 HDFS（Hadoop 分布式文件系统）作为底层存储，确保数据的容错性和持久性。HBase 的可伸缩性主要通过水平扩展实现，即将大型数据集分布式到众多节点上，每个节点都存储部分数据，这样可以有效地处理PB级别的数据。 HBase 提供了多种访问接口，包括原生的 Java 客户端，这对于开发人员来说是最直接和最高效的。同时，为了支持非 Java 环境下的应用，HBase 通过网关服务器提供了 REST、Avro 和 Thrift 应用编程接口，使得其他语言（如 Python、PHP 或者 C++）的应用也能方便地与 HBase 交互。书中详细介绍了 HBase 的架构细节，比如它的存储格式——每个表被划分为多个 Region，每个 Region 又由多个 Store 组成，每个 Store 内部包含 HFile 和 MemStore。预写日志（Write Ahead Log，WAL）用于保证数据的一致性和可靠性，即使在系统崩溃时也能恢复未写入磁盘的数据。 HBase 还利用 Hadoop 的 MapReduce 框架进行大数据的批处理。用户可以通过编写 MapReduce 作业对 HBase 表中的数据进行复杂的分析和处理，例如聚合、过滤等操作。在管理与优化方面，书中涵盖了如何调整集群配置以适应不同的工作负载，如何设计模式以优化读写性能，以及如何执行表的复制和批量导入数据。此外，还讨论了如何处理节点故障，确保集群的稳定运行。《HBase The Definitive Guide》是一本全面介绍 HBase 的权威指南，无论是对于初学者还是有经验的开发人员，都能从中获取深入的理解和实用的操作指导，从而更好地利用 HBase 解决大数据存储和处理问题。

Foreword

The HBase story begins in 2006, when the San Francisco-based startup Powerset was

trying to build a natural language search engine for the Web. Their indexing pipeline

was an involved multistep process that produced an index about two orders of mag-

nitude larger, on average, than your standard term-based index. The datastore that

they’d built on top of the then nascent Amazon Web Services to hold the index inter-

mediaries and the webcrawl was buckling under the load (Ring. Ring. “Hello! This is

AWS. Whatever you are running, please turn it off!”). They were looking for an alter-

native. The Google BigTable paper

had just been published.

Chad Walters, Powerset’s head of engineering at the time, reflects back on the

experience as follows:

Building an open source system to run on top of Hadoop’s Distributed Filesystem (HDFS)

in much the same way that BigTable ran on top of the Google File System seemed like a

good approach because: 1) it was a proven scalable architecture; 2) we could leverage

existing work on Hadoop’s HDFS; and 3) we could both contribute to and get additional

leverage from the growing Hadoop ecosystem.

After the publication of the Google BigTable paper, there were on-again, off-again dis-

cussions around what a BigTable-like system on top of Hadoop might look. Then, in

early 2007, out of the blue, Mike Cafarela dropped a tarball of thirty odd Java files into

the Hadoop issue tracker: “I’ve written some code for HBase, a BigTable-like file store.

It’s not perfect, but it’s ready for other people to play with and examine.” Mike had

been working with Doug Cutting on Nutch, an open source search engine. He’d done

similar drive-by code dumps there to add features such as a Google File System clone

so the Nutch indexing process was not bounded by the amount of disk you attach to

a single machine. (This Nutch distributed filesystem would later grow up to be HDFS.)

Jim Kellerman of Powerset took Mike’s dump and started filling in the gaps, adding

tests and getting it into shape so that it could be committed as part of Hadoop. The

first commit of the HBase code was made by Doug Cutting on April 3, 2007, under

the contrib subdirectory. The first HBase “working” release was bundled as part of

Hadoop 0.15.0 in October 2007.

* “BigTable: A Distributed Storage System for Structured Data” by Fay Chang et al.

Not long after, Lars, the author of the book you are now reading, showed up on the

#hbase IRC channel. He had a big-data problem of his own, and was game to try HBase.

After some back and forth, Lars became one of the first users to run HBase in production

outside of the Powerset home base. Through many ups and downs, Lars stuck around.

I distinctly remember a directory listing Lars made for me a while back on his produc-

tion cluster at WorldLingo, where he was employed as CTO, sysadmin, and grunt. The

listing showed ten or so HBase releases from Hadoop 0.15.1 (November 2007) on up

through HBase 0.20, each of which he’d run on his 40-node cluster at one time or

another during production.

Of all those who have contributed to HBase over the years, it is poetic justice that Lars

is the one to write this book. Lars was always dogging HBase contributors that the

documentation needed to be better if we hoped to gain broader adoption. Everyone

agreed, nodded their heads in ascent, amen’d, and went back to coding. So Lars started

writing critical how-tos and architectural descriptions inbetween jobs and his intra-

European travels as unofficial HBase European ambassador. His Lineland blogs on

HBase gave the best description, outside of the source, of how HBase worked, and at

a few critical junctures, carried the community across awkward transitions (e.g., an

important blog explained the labyrinthian HBase build during the brief period we

thought an Ivy-based build to be a “good idea”). His luscious diagrams were poached

by one and all wherever an HBase presentation was given.

HBase has seen some interesting times, including a period of sponsorship by Microsoft,

of all things. Powerset was acquired in July 2008, and after a couple of months during

which Powerset employees were disallowed from contributing while Microsoft’s legal

department vetted the HBase codebase to see if it impinged on SQLServer patents, we

were allowed to resume contributing (I was a Microsoft employee working near full

time on an Apache open source project). The times ahead look promising, too, whether

it’s the variety of contortions HBase is being put through at Facebook—as the under-

pinnings for their massive Facebook mail app or fielding millions of of hits a second on

their analytics clusters—or more deploys along the lines of Yahoo!’s 1k node HBase

cluster used to host their snapshot of Microsoft’s Bing crawl. Other developments in-

clude HBase running on filesystems other than Apache HDFS, such as MapR.

But plain to me though is that none of these developments would have been possible

were it not for the hard work put in by our awesome HBase community driven by a

core of HBase committers. Some members of the core have only been around a year or

so—Todd Lipcon, Gary Helmling, and Nicolas Spiegelberg—and we would be lost

without them, but a good portion have been there from close to project inception and

have shaped HBase into the (scalable) general datastore that it is today. These include

Jonathan Gray, who gambled his startup streamy.com on HBase; Andrew Purtell, who

built an HBase team at Trend Micro long before such a thing was fashionable; Ryan

Rawson, who got StumbleUpon—which became the main sponsor after HBase moved

on from Powerset/Microsoft—on board, and who had the sense to hire John-Daniel

Cryans, now a power contributor but just a bushy-tailed student at the time. And then

xvi | Foreword

剩余553页未读，继续阅读

luxas

粉丝: 0
资源: 23

HBase权威指南：大数据分布式存储实战

HBase资源精选：官方文档与高级优化指南

探索HBase：权威指南

HBase权威指南：探索Java开源NoSQL数据库

HBase.The.Definitive.Guide.2nd.Edition

[HBase.The.Definitive.Guide].Lars.George.文字版

HBase The Definitive Guide

浅谈HBASE数据结构设计.pdf

Hadoop The Definitive Guide 2nd Edition.pdf

[Hadoop权威指南(第2版)].(Hadoop：The.Definitive.Guide).文字版.pdf

Hbase权威指南(HBase: The Definitive Guide)

最新资源