Hadoop Operations 实用指南

5星 · 超过95%的资源需积分: 16 2 浏览量更新于2024-07-27 收藏 7.37MB PDF 举报

"Hadoop Operations，由Eric Sammer撰写，涵盖了Hadoop操作的实用知识，相比definite guide内容更深入。本书由O'Reilly Media出版，适用于教育、商业和销售推广用途。" 在大数据处理领域，Apache Hadoop是一个关键的开源框架，它允许分布式存储和处理大量数据。《Hadoop Operations》这本书，作者Eric Sammer，提供了关于实际操作Hadoop集群的深度指导，旨在帮助读者更好地理解和管理Hadoop环境。书中可能包含以下几个核心知识点： 1. **Hadoop生态系统**：介绍Hadoop生态系统中的主要组件，如HDFS（Hadoop Distributed File System）用于分布式存储，MapReduce用于大规模数据处理，以及YARN（Yet Another Resource Negotiator）作为资源管理系统。 2. **安装与配置**：详细阐述如何在各种操作系统环境下安装Hadoop，包括集群部署的步骤，以及最佳实践和配置优化技巧。 3. **数据管理**：讲解Hadoop如何处理数据输入、输出，以及数据分块、复制策略和容错机制。 4. **监控与性能调优**：提供监控Hadoop集群性能的方法，包括使用工具如Ganglia和Ambari，以及如何通过调整参数来提升性能。 5. **故障排查与维护**：介绍常见问题的解决策略，如节点故障、网络问题和数据一致性问题，以及如何进行定期维护和升级。 6. **安全性**：涵盖Hadoop的安全特性，如Hadoop的权限控制模型HDFS的ACLs，Kerberos认证，以及如何实施数据加密和安全策略。 7. **实时处理**：讨论Hadoop与其他实时处理技术如Storm和Spark的集成，以满足低延迟的数据处理需求。 8. **案例研究**：可能包含实际企业应用Hadoop的案例，展示Hadoop在不同行业的解决方案和成功故事。 9. **扩展与集成**：介绍如何与其他大数据工具如HBase、Hive、Pig等进行集成，以及如何使用Hadoop与NoSQL数据库配合工作。 10. **最佳实践**：总结作者和社区的经验，给出在实际操作Hadoop时的最佳实践建议，以提高效率和稳定性。《Hadoop Operations》适合已经有一定Hadoop基础的读者，希望通过深入学习提高Hadoop集群管理和运维能力的专业人士。书中的实战经验和深入分析将有助于读者在实际工作中解决复杂的问题，提升Hadoop集群的稳定性和效率。

and generally improve in quality with more of it. Knowing more about a problem space

generally leads to better decisions (or algorithm efficacy), which in turn leads to happier

users, more money, reduced fraud, healthier people, safer conditions, or whatever the

desired result might be.

Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infra-

structure for building many of the types of applications described earlier. Made up of

a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a com-

putation layer that implements a processing paradigm called MapReduce, Hadoop is

an open source, batch data processing system for enormous amounts of data. We live

in a flawed world, and Hadoop is designed to survive in it by not only tolerating hard-

ware and software failures, but also treating them as first-class conditions that happen

regularly. Hadoop uses a cluster of plain old commodity servers with no specialized

hardware or network infrastructure to form a single, logical, storage and compute plat-

form, or cluster, that can be shared by multiple individuals or groups. Computation in

Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction

for developers that obviates complex synchronization and network programming. Un-

like many other distributed data processing systems, Hadoop runs the user-provided

processing logic on the machine where the data lives rather than dragging the data

across the network; a huge win for performance.

For those interested in the history, Hadoop was modeled after two papers produced

by Google, one of the many companies to have these kinds of data-intensive processing

problems. The first, presented in 2003, describes a pragmatic, scalable, distributed

filesystem optimized for storing enormous datasets, called the Google Filesystem, or

GFS. In addition to simple storage, GFS was built to support large-scale, data-intensive,

distributed processing applications. The following year, another paper, titled "Map-

Reduce: Simplified Data Processing on Large Clusters," was presented, defining a pro-

gramming model and accompanying framework that provided automatic paralleliza-

tion, fault tolerance, and the scale to process hundreds of terabytes of data in a single

job over thousands of machines. When paired, these two systems could be used to build

large data processing clusters on relatively inexpensive, commodity machines. These

papers directly inspired the development of HDFS and Hadoop MapReduce, respec-

tively.

Interest and investment in Hadoop has led to an entire ecosystem of related software

both open source and commercial. Within the Apache Software Foundation alone,

projects that explicitly make use of, or integrate with, Hadoop are springing up regu-

larly. Some of these projects make authoring MapReduce jobs easier and more acces-

sible, while others focus on getting data in and out of HDFS, simplify operations, enable

deployment in cloud environments, and so on. Here is a sampling of the more popular

projects with which you should familiarize yourself:

Apache Hive

Hive creates a relational database−style abstraction that allows developers to write

a dialect of SQL, which in turn is executed as one or more MapReduce jobs on the

2 | Chapter 1: Introduction

cluster. Developers, analysts, and existing third-party packages already know and

speak SQL (Hive’s dialect of SQL is called HiveQL and implements only a subset

of any of the common standards). Hive takes advantage of this and provides a quick

way to reduce the learning curve to adopting Hadoop and writing MapReduce jobs.

For this reason, Hive is by far one of the most popular Hadoop ecosystem projects.

Hive works by defining a table-like schema over an existing set of files in HDFS

and handling the gory details of extracting records from those files when a query

is run. The data on disk is never actually changed, just parsed at query time. HiveQL

statements are interpreted and an execution plan of prebuilt map and reduce

classes is assembled to perform the MapReduce equivalent of the SQL statement.

Apache Pig

Like Hive, Apache Pig was created to simplify the authoring of MapReduce jobs,

obviating the need to write Java code. Instead, users write data processing jobs in

a high-level scripting language from which Pig builds an execution plan and exe-

cutes a series of MapReduce jobs to do the heavy lifting. In cases where Pig doesn’t

support a necessary function, developers can extend its set of built-in operations

by writing user-defined functions in Java (Hive supports similar functionality as

well). If you know Perl, Python, Ruby, JavaScript, or even shell script, you can learn

Pig’s syntax in the morning and be running MapReduce jobs by lunchtime.

Apache Sqoop

Not only does Hadoop not want to replace your database, it wants to be friends

with it. Exchanging data with relational databases is one of the most popular in-

tegration points with Apache Hadoop. Sqoop, short for “SQL to Hadoop,” per-

forms bidirectional data transfer between Hadoop and almost any database with

a JDBC driver. Using MapReduce, Sqoop performs these operations in parallel

with no need to write code.

For even greater performance, Sqoop supports database-specific plug-ins that use

native features of the RDBMS rather than incurring the overhead of JDBC. Many

of these connectors are open source, while others are free or available from com-

mercial vendors at a cost. Today, Sqoop includes native connectors (called direct

support) for MySQL and PostgreSQL. Free connectors exist for Teradata, Netezza,

SQL Server, and Oracle (from Quest Software), and are available for download

from their respective company websites.

Apache Flume

Apache Flume is a streaming data collection and aggregation system designed to

transport massive volumes of data into systems such as Hadoop. It supports native

connectivity and support for writing directly to HDFS, and simplifies reliable,

streaming data delivery from a variety of sources including RPC services, log4j

appenders, syslog, and even the output from OS commands. Data can be routed,

load-balanced, replicated to multiple destinations, and aggregated from thousands

of hosts by a tier of agents.

Introduction | 3

Apache Oozie

It’s not uncommon for large production clusters to run many coordinated Map-

Reduce jobs in a workfow. Apache Oozie is a workflow engine and scheduler built

specifically for large-scale job orchestration on a Hadoop cluster. Workflows can

be triggered by time or events such as data arriving in a directory, and job failure

handling logic can be implemented so that policies are adhered to. Oozie presents

a REST service for programmatic management of workflows and status retrieval.

Apache Whirr

Apache Whirr was developed to simplify the creation and deployment of ephem-

eral clusters in cloud environments such as Amazon’s AWS. Run as a command-

line tool either locally or within the cloud, Whirr can spin up instances, deploy

Hadoop, configure the software, and tear it down on demand. Under the hood,

Whirr uses the powerful jclouds library so that it is cloud provider−neutral. The

developers have put in the work to make Whirr support both Amazon EC2 and

Rackspace Cloud. In addition to Hadoop, Whirr understands how to provision

Apache Cassandra, Apache ZooKeeper, Apache HBase, ElasticSearch, Voldemort,

and Apache Hama.

Apache HBase

Apache HBase is a low-latency, distributed (nonrelational) database built on top

of HDFS. Modeled after Google’s Bigtable, HBase presents a flexible data model

with scale-out properties and a very simple API. Data in HBase is stored in a semi-

columnar format partitioned by rows into regions. It’s not uncommon for a single

table in HBase to be well into the hundreds of terabytes or in some cases petabytes.

Over the past few years, HBase has gained a massive following based on some very

public deployments such as Facebook’s Messages platform. Today, HBase is used

to serve huge amounts of data to real-time systems in major production deploy-

ments.

Apache ZooKeeper

A true workhorse, Apache ZooKeeper is a distributed, consensus-based coordina-

tion system used to support distributed applications. Distributed applications that

require leader election, locking, group membership, service location, and config-

uration services can use ZooKeeper rather than reimplement the complex coordi-

nation and error handling that comes with these functions. In fact, many projects

within the Hadoop ecosystem use ZooKeeper for exactly this purpose (most no-

tably, HBase).

Apache HCatalog

A relatively new entry, Apache HCatalog is a service that provides shared schema

and data access abstraction services to applications with the ecosystem. The

long-term goal of HCatalog is to enable interoperability between tools such as

Apache Hive and Pig so that they can share dataset metadata information.

The Hadoop ecosystem is exploding into the commercial world as well. Vendors such

as Oracle, SAS, MicroStrategy, Tableau, Informatica, Microsoft, Pentaho, Talend, HP,

4 | Chapter 1: Introduction

Dell, and dozens of others have all developed integration or support for Hadoop within

one or more of their products. Hadoop is fast becoming (or, as an increasingly growing

group would believe, already has become) the de facto standard for truly large-scale

data processing in the data center.

If you’re reading this book, you may be a developer with some exposure to Hadoop

looking to learn more about managing the system in a production environment. Alter-

natively, it could be that you’re an application or system administrator tasked with

owning the current or planned production cluster. Those in the latter camp may be

rolling their eyes at the prospect of dealing with yet another system. That’s fair, and we

won’t spend a ton of time talking about writing applications, APIs, and other pesky

code problems. There are other fantastic books on those topics, especially Hadoop: The

Definitive Guide by Tom White (O’Reilly). Administrators do, however, play an abso-

lutely critical role in planning, installing, configuring, maintaining, and monitoring

Hadoop clusters. Hadoop is a comparatively low-level system, leaning heavily on the

host operating system for many features, and it works best when developers and ad-

ministrators collaborate regularly. What you do impacts how things work.

It’s an extremely exciting time to get into Apache Hadoop. The so-called big data space

is all the rage, sure, but more importantly, Hadoop is growing and changing at a stag-

gering rate. Each new version—and there have been a few big ones in the past year or

two—brings another truckload of features for both developers and administrators

alike. You could say that Hadoop is experiencing software puberty; thanks to its rapid

growth and adoption, it’s also a little awkward at times. You’ll find, throughout this

book, that there are significant changes between even minor versions. It’s a lot to keep

up with, admittedly, but don’t let it overwhelm you. Where necessary, the differences

are called out, and a section in Chapter 4 is devoted to walking you through the most

commonly encountered versions.

This book is intended to be a pragmatic guide to running Hadoop in production. Those

who have some familiarity with Hadoop may already know alternative methods for

installation or have differing thoughts on how to properly tune the number of map slots

based on CPU utilization.

That’s expected and more than fine. The goal is not to

enumerate all possible scenarios, but rather to call out what works, as demonstrated

in critical deployments.

Chapters 2 and 3 provide the necessary background, describing what HDFS and Map-

Reduce are, why they exist, and at a high level, how they work. Chapter 4 walks you

through the process of planning for an Hadoop deployment including hardware selec-

tion, basic resource planning, operating system selection and configuration, Hadoop

distribution and version selection, and network concerns for Hadoop clusters. If you

are looking for the meat and potatoes, Chapter 5 is where it’s at, with configuration

and setup information, including a listing of the most critical properties, organized by

2. We also briefly cover the flux capacitor and discuss the burn rate of energon cubes during combat.

Introduction | 5

剩余296页未读，继续阅读

realvalkyrie

粉丝: 2
资源: 1

Hadoop Operations 实用指南

最新资源