HBase权威指南：Hadoop集群部署与HBase数据库详解

需积分: 9 134 浏览量更新于2024-07-23 收藏 4.59MB PDF 举报

Hadoop是一个开源的分布式计算框架，最初由Apache软件基金会开发，旨在处理大规模数据集。在Hadoop集群中，数据通常被分布存储在多台机器上，通过Hadoop分布式文件系统（HDFS）进行管理和访问，实现高可用性和容错性。HBase是Hadoop生态系统中的一个关键组件，它是一个分布式、面向列的NoSQL数据库，特别适合于大规模数据的实时读写操作。《HBase: The Definitive Guide》这本书由Lars George撰写，是深入了解HBase的最佳资源之一。该书详细介绍了HBase的设计理念、架构、安装配置、数据模型、行键和列族的设计，以及如何使用MapReduce进行数据处理。书中还涵盖了HBase与Hadoop其他组件如HDFS和YARN的集成，以及性能调优和故障恢复策略。 HBase的核心特性包括： 1. **可扩展性**：HBase能够水平扩展，通过添加更多的节点来处理更大的数据量，保持高性能。 2. **列式存储**：数据以列的形式存储，而非传统的行式存储，这使得查询特定列的操作更高效。 3. **稀疏性**：HBase支持随机读写，即使数据不完整或部分缺失，也能快速响应。 4. **高可用性**：通过主备复制和region分裂/合并机制，确保数据的持久性和服务的连续性。 5. **兼容Hadoop**：HBase是Hadoop生态系统的一部分，可以无缝地与其他Hadoop工具和服务交互。 6. **实时查询**：HBase支持实时读取，适合需要低延迟的应用场景，如日志分析、在线广告等。对于想要深入学习HBase并将其应用于实际项目中的开发者和数据工程师来说，这本书提供了全面且实用的指南，包括安装、配置、表设计、数据模型优化以及故障处理等方方面面。无论是初学者还是有经验的用户，都可以通过阅读这本书提升对HBase的理解和运用能力。同时，由于版权问题，O'Reilly Media也提供了在线版本供读者获取。

Foreword

The HBase story begins in 2006, when the San Francisco-based startup Powerset was

trying to build a natural language search engine for the Web. Their indexing pipeline

was an involved multistep process that produced an index about two orders of mag-

nitude larger, on average, than your standard term-based index. The datastore that

they’d built on top of the then nascent Amazon Web Services to hold the index inter-

mediaries and the webcrawl was buckling under the load (Ring. Ring. “Hello! This is

AWS. Whatever you are running, please turn it off!”). They were looking for an alter-

native. The Google BigTable paper

had just been published.

Chad Walters, Powerset’s head of engineering at the time, reflects back on the

experience as follows:

Building an open source system to run on top of Hadoop’s Distributed Filesystem (HDFS)

in much the same way that BigTable ran on top of the Google File System seemed like a

good approach because: 1) it was a proven scalable architecture; 2) we could leverage

existing work on Hadoop’s HDFS; and 3) we could both contribute to and get additional

leverage from the growing Hadoop ecosystem.

After the publication of the Google BigTable paper, there were on-again, off-again dis-

cussions around what a BigTable-like system on top of Hadoop might look. Then, in

early 2007, out of the blue, Mike Cafarela dropped a tarball of thirty odd Java files into

the Hadoop issue tracker: “I’ve written some code for HBase, a BigTable-like file store.

It’s not perfect, but it’s ready for other people to play with and examine.” Mike had

been working with Doug Cutting on Nutch, an open source search engine. He’d done

similar drive-by code dumps there to add features such as a Google File System clone

so the Nutch indexing process was not bounded by the amount of disk you attach to

a single machine. (This Nutch distributed filesystem would later grow up to be HDFS.)

Jim Kellerman of Powerset took Mike’s dump and started filling in the gaps, adding

tests and getting it into shape so that it could be committed as part of Hadoop. The

first commit of the HBase code was made by Doug Cutting on April 3, 2007, under

the contrib subdirectory. The first HBase “working” release was bundled as part of

Hadoop 0.15.0 in October 2007.

* “BigTable: A Distributed Storage System for Structured Data” by Fay Chang et al.

Not long after, Lars, the author of the book you are now reading, showed up on the

#hbase IRC channel. He had a big-data problem of his own, and was game to try HBase.

After some back and forth, Lars became one of the first users to run HBase in production

outside of the Powerset home base. Through many ups and downs, Lars stuck around.

I distinctly remember a directory listing Lars made for me a while back on his produc-

tion cluster at WorldLingo, where he was employed as CTO, sysadmin, and grunt. The

listing showed ten or so HBase releases from Hadoop 0.15.1 (November 2007) on up

through HBase 0.20, each of which he’d run on his 40-node cluster at one time or

another during production.

Of all those who have contributed to HBase over the years, it is poetic justice that Lars

is the one to write this book. Lars was always dogging HBase contributors that the

documentation needed to be better if we hoped to gain broader adoption. Everyone

agreed, nodded their heads in ascent, amen’d, and went back to coding. So Lars started

writing critical how-tos and architectural descriptions inbetween jobs and his intra-

European travels as unofficial HBase European ambassador. His Lineland blogs on

HBase gave the best description, outside of the source, of how HBase worked, and at

a few critical junctures, carried the community across awkward transitions (e.g., an

important blog explained the labyrinthian HBase build during the brief period we

thought an Ivy-based build to be a “good idea”). His luscious diagrams were poached

by one and all wherever an HBase presentation was given.

HBase has seen some interesting times, including a period of sponsorship by Microsoft,

of all things. Powerset was acquired in July 2008, and after a couple of months during

which Powerset employees were disallowed from contributing while Microsoft’s legal

department vetted the HBase codebase to see if it impinged on SQLServer patents, we

were allowed to resume contributing (I was a Microsoft employee working near full

time on an Apache open source project). The times ahead look promising, too, whether

it’s the variety of contortions HBase is being put through at Facebook—as the under-

pinnings for their massive Facebook mail app or fielding millions of of hits a second on

their analytics clusters—or more deploys along the lines of Yahoo!’s 1k node HBase

cluster used to host their snapshot of Microsoft’s Bing crawl. Other developments in-

clude HBase running on filesystems other than Apache HDFS, such as MapR.

But plain to me though is that none of these developments would have been possible

were it not for the hard work put in by our awesome HBase community driven by a

core of HBase committers. Some members of the core have only been around a year or

so—Todd Lipcon, Gary Helmling, and Nicolas Spiegelberg—and we would be lost

without them, but a good portion have been there from close to project inception and

have shaped HBase into the (scalable) general datastore that it is today. These include

Jonathan Gray, who gambled his startup streamy.com on HBase; Andrew Purtell, who

built an HBase team at Trend Micro long before such a thing was fashionable; Ryan

Rawson, who got StumbleUpon—which became the main sponsor after HBase moved

on from Powerset/Microsoft—on board, and who had the sense to hire John-Daniel

Cryans, now a power contributor but just a bushy-tailed student at the time. And then

xvi | Foreword

剩余553页未读，继续阅读

lele2046

粉丝: 1
资源: 4

HBase权威指南：Hadoop集群部署与HBase数据库详解

2012 浪曦Hadoop讲座ppt

Yahoo的Hadoop教程

hadoop2.7.3 Winutils.exe hadoop.dll

hadoop-lab:Hadoop技术讲座的实验室内容

浪曦Hadoop入门讲座：分布式思想与实践

hadoop ipc-hadoop

【Hadoop部署】Hadoop环境部署2-Hadoop安装

hadoop2.8.4的hadoop.dll

Hadoop技术详解.Hadoop Operation

hadoop2.6.5中winutils+hadoop

最新资源