Hadoop运维详解：原版权威指南

需积分: 10 51 浏览量更新于2024-07-23 收藏 8.02MB PDF 举报

《Hadoop Operations》是由Eric Sammer撰写的一本专业书籍，专为读者提供了深入了解Hadoop这一流行的大数据处理框架的深入指南。该书是Hadoop领域的经典之作，针对的是Hadoop的核心概念和技术操作，适合于IT专业人士、系统管理员和数据工程师等想要掌握Hadoop基础和运维知识的读者。这本书的中文版可以帮助读者理解Hadoop的架构、安装、配置、管理和维护等方面的内容。Hadoop作为一个开源框架，主要由Hadoop Distributed File System (HDFS) 和MapReduce两个核心组件组成，它能够处理大规模的数据集，通过分布式计算实现高效的数据处理。书中会详细阐述如何在集群环境中部署Hadoop，包括硬件需求、软件配置、数据分布与备份策略，以及如何优化性能和解决常见问题。作者Eric Sammer以其丰富的经验，将理论知识与实践案例相结合，使得读者能够在阅读过程中掌握实操技巧。书中不仅涉及基础概念，还包括了实时监控、故障排查、安全性和可扩展性等方面的讨论，这对于确保Hadoop集群的稳定运行至关重要。此外，《Hadoop Operations》还强调了Hadoop生态系统中的其他工具，如Hive、Pig、HBase和Hadoop Streaming等，这些都是大数据分析和应用开发的重要组成部分。通过阅读这本书，读者不仅能提升Hadoop技术的理论素养，还能了解到如何在实际项目中有效地整合这些工具。在版权方面，这本书享有2012年的版权，并且O'Reilly Media公司对其享有相关权利。如果你是教育机构或企业用户，可通过O'Reilly官网获取在线版本或联系销售部门了解更多的许可使用信息。《Hadoop Operations》是一本极具价值的专业参考书，无论是对于初学者还是进阶者，都是一次深入学习Hadoop操作和管理的良好起点。它不仅是技术文档，也是一部实战手册，帮助读者在大数据处理领域建立扎实的基础。如果你正在探索或管理一个Hadoop环境，这将是你不可或缺的参考资料。

and generally improve in quality with more of it. Knowing more about a problem space

generally leads to better decisions (or algorithm efficacy), which in turn leads to happier

users, more money, reduced fraud, healthier people, safer conditions, or whatever the

desired result might be.

Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infra-

structure for building many of the types of applications described earlier. Made up of

a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a com-

putation layer that implements a processing paradigm called MapReduce, Hadoop is

an open source, batch data processing system for enormous amounts of data. We live

in a flawed world, and Hadoop is designed to survive in it by not only tolerating hard-

ware and software failures, but also treating them as first-class conditions that happen

regularly. Hadoop uses a cluster of plain old commodity servers with no specialized

hardware or network infrastructure to form a single, logical, storage and compute plat-

form, or cluster, that can be shared by multiple individuals or groups. Computation in

Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction

for developers that obviates complex synchronization and network programming. Un-

like many other distributed data processing systems, Hadoop runs the user-provided

processing logic on the machine where the data lives rather than dragging the data

across the network; a huge win for performance.

For those interested in the history, Hadoop was modeled after two papers produced

by Google, one of the many companies to have these kinds of data-intensive processing

problems. The first, presented in 2003, describes a pragmatic, scalable, distributed

filesystem optimized for storing enormous datasets, called the Google Filesystem, or

GFS. In addition to simple storage, GFS was built to support large-scale, data-intensive,

distributed processing applications. The following year, another paper, titled "Map-

Reduce: Simplified Data Processing on Large Clusters," was presented, defining a pro-

gramming model and accompanying framework that provided automatic paralleliza-

tion, fault tolerance, and the scale to process hundreds of terabytes of data in a single

job over thousands of machines. When paired, these two systems could be used to build

large data processing clusters on relatively inexpensive, commodity machines. These

papers directly inspired the development of HDFS and Hadoop MapReduce, respec-

tively.

Interest and investment in Hadoop has led to an entire ecosystem of related software

both open source and commercial. Within the Apache Software Foundation alone,

projects that explicitly make use of, or integrate with, Hadoop are springing up regu-

larly. Some of these projects make authoring MapReduce jobs easier and more acces-

sible, while others focus on getting data in and out of HDFS, simplify operations, enable

deployment in cloud environments, and so on. Here is a sampling of the more popular

projects with which you should familiarize yourself:

Apache Hive

Hive creates a relational database−style abstraction that allows developers to write

a dialect of SQL, which in turn is executed as one or more MapReduce jobs on the

2 | Chapter 1: Introduction

cluster. Developers, analysts, and existing third-party packages already know and

speak SQL (Hive’s dialect of SQL is called HiveQL and implements only a subset

of any of the common standards). Hive takes advantage of this and provides a quick

way to reduce the learning curve to adopting Hadoop and writing MapReduce jobs.

For this reason, Hive is by far one of the most popular Hadoop ecosystem projects.

Hive works by defining a table-like schema over an existing set of files in HDFS

and handling the gory details of extracting records from those files when a query

is run. The data on disk is never actually changed, just parsed at query time. HiveQL

statements are interpreted and an execution plan of prebuilt map and reduce

classes is assembled to perform the MapReduce equivalent of the SQL statement.

Apache Pig

Like Hive, Apache Pig was created to simplify the authoring of MapReduce jobs,

obviating the need to write Java code. Instead, users write data processing jobs in

a high-level scripting language from which Pig builds an execution plan and exe-

cutes a series of MapReduce jobs to do the heavy lifting. In cases where Pig doesn’t

support a necessary function, developers can extend its set of built-in operations

by writing user-defined functions in Java (Hive supports similar functionality as

well). If you know Perl, Python, Ruby, JavaScript, or even shell script, you can learn

Pig’s syntax in the morning and be running MapReduce jobs by lunchtime.

Apache Sqoop

Not only does Hadoop not want to replace your database, it wants to be friends

with it. Exchanging data with relational databases is one of the most popular in-

tegration points with Apache Hadoop. Sqoop, short for “SQL to Hadoop,” per-

forms bidirectional data transfer between Hadoop and almost any database with

a JDBC driver. Using MapReduce, Sqoop performs these operations in parallel

with no need to write code.

For even greater performance, Sqoop supports database-specific plug-ins that use

native features of the RDBMS rather than incurring the overhead of JDBC. Many

of these connectors are open source, while others are free or available from com-

mercial vendors at a cost. Today, Sqoop includes native connectors (called direct

support) for MySQL and PostgreSQL. Free connectors exist for Teradata, Netezza,

SQL Server, and Oracle (from Quest Software), and are available for download

from their respective company websites.

Apache Flume

Apache Flume is a streaming data collection and aggregation system designed to

transport massive volumes of data into systems such as Hadoop. It supports native

connectivity and support for writing directly to HDFS, and simplifies reliable,

streaming data delivery from a variety of sources including RPC services, log4j

appenders, syslog, and even the output from OS commands. Data can be routed,

load-balanced, replicated to multiple destinations, and aggregated from thousands

of hosts by a tier of agents.

Introduction | 3

Apache Oozie

It’s not uncommon for large production clusters to run many coordinated Map-

Reduce jobs in a workfow. Apache Oozie is a workflow engine and scheduler built

specifically for large-scale job orchestration on a Hadoop cluster. Workflows can

be triggered by time or events such as data arriving in a directory, and job failure

handling logic can be implemented so that policies are adhered to. Oozie presents

a REST service for programmatic management of workflows and status retrieval.

Apache Whirr

Apache Whirr was developed to simplify the creation and deployment of ephem-

eral clusters in cloud environments such as Amazon’s AWS. Run as a command-

line tool either locally or within the cloud, Whirr can spin up instances, deploy

Hadoop, configure the software, and tear it down on demand. Under the hood,

Whirr uses the powerful jclouds library so that it is cloud provider−neutral. The

developers have put in the work to make Whirr support both Amazon EC2 and

Rackspace Cloud. In addition to Hadoop, Whirr understands how to provision

Apache Cassandra, Apache ZooKeeper, Apache HBase, ElasticSearch, Voldemort,

and Apache Hama.

Apache HBase

Apache HBase is a low-latency, distributed (nonrelational) database built on top

of HDFS. Modeled after Google’s Bigtable, HBase presents a flexible data model

with scale-out properties and a very simple API. Data in HBase is stored in a semi-

columnar format partitioned by rows into regions. It’s not uncommon for a single

table in HBase to be well into the hundreds of terabytes or in some cases petabytes.

Over the past few years, HBase has gained a massive following based on some very

public deployments such as Facebook’s Messages platform. Today, HBase is used

to serve huge amounts of data to real-time systems in major production deploy-

ments.

Apache ZooKeeper

A true workhorse, Apache ZooKeeper is a distributed, consensus-based coordina-

tion system used to support distributed applications. Distributed applications that

require leader election, locking, group membership, service location, and config-

uration services can use ZooKeeper rather than reimplement the complex coordi-

nation and error handling that comes with these functions. In fact, many projects

within the Hadoop ecosystem use ZooKeeper for exactly this purpose (most no-

tably, HBase).

Apache HCatalog

A relatively new entry, Apache HCatalog is a service that provides shared schema

and data access abstraction services to applications with the ecosystem. The

long-term goal of HCatalog is to enable interoperability between tools such as

Apache Hive and Pig so that they can share dataset metadata information.

The Hadoop ecosystem is exploding into the commercial world as well. Vendors such

as Oracle, SAS, MicroStrategy, Tableau, Informatica, Microsoft, Pentaho, Talend, HP,

4 | Chapter 1: Introduction

Dell, and dozens of others have all developed integration or support for Hadoop within

one or more of their products. Hadoop is fast becoming (or, as an increasingly growing

group would believe, already has become) the de facto standard for truly large-scale

data processing in the data center.

If you’re reading this book, you may be a developer with some exposure to Hadoop

looking to learn more about managing the system in a production environment. Alter-

natively, it could be that you’re an application or system administrator tasked with

owning the current or planned production cluster. Those in the latter camp may be

rolling their eyes at the prospect of dealing with yet another system. That’s fair, and we

won’t spend a ton of time talking about writing applications, APIs, and other pesky

code problems. There are other fantastic books on those topics, especially Hadoop: The

Definitive Guide by Tom White (O’Reilly). Administrators do, however, play an abso-

lutely critical role in planning, installing, configuring, maintaining, and monitoring

Hadoop clusters. Hadoop is a comparatively low-level system, leaning heavily on the

host operating system for many features, and it works best when developers and ad-

ministrators collaborate regularly. What you do impacts how things work.

It’s an extremely exciting time to get into Apache Hadoop. The so-called big data space

is all the rage, sure, but more importantly, Hadoop is growing and changing at a stag-

gering rate. Each new version—and there have been a few big ones in the past year or

two—brings another truckload of features for both developers and administrators

alike. You could say that Hadoop is experiencing software puberty; thanks to its rapid

growth and adoption, it’s also a little awkward at times. You’ll find, throughout this

book, that there are significant changes between even minor versions. It’s a lot to keep

up with, admittedly, but don’t let it overwhelm you. Where necessary, the differences

are called out, and a section in Chapter 4 is devoted to walking you through the most

commonly encountered versions.

This book is intended to be a pragmatic guide to running Hadoop in production. Those

who have some familiarity with Hadoop may already know alternative methods for

installation or have differing thoughts on how to properly tune the number of map slots

based on CPU utilization.

That’s expected and more than fine. The goal is not to

enumerate all possible scenarios, but rather to call out what works, as demonstrated

in critical deployments.

Chapters 2 and 3 provide the necessary background, describing what HDFS and Map-

Reduce are, why they exist, and at a high level, how they work. Chapter 4 walks you

through the process of planning for an Hadoop deployment including hardware selec-

tion, basic resource planning, operating system selection and configuration, Hadoop

distribution and version selection, and network concerns for Hadoop clusters. If you

are looking for the meat and potatoes, Chapter 5 is where it’s at, with configuration

and setup information, including a listing of the most critical properties, organized by

2. We also briefly cover the flux capacitor and discuss the burn rate of energon cubes during combat.

Introduction | 5

剩余296页未读，继续阅读

wangxf314

粉丝: 0
资源: 4

Hadoop运维详解：原版权威指南

Hadoop Operations (2012.9)

Hadoop Operations 实用指南

Hadoop Operations：权威指南

Hadoop Operations：云计算实战指南

Hadoop Operations：探索2.2新特性

Hadoop Operations：Eric Sammer 实务指南

Hadoop Operations：中文版缺失的经典指南

Hadoop.Operations

hadoop_operations

hadoop books

最新资源