没有合适的资源?快使用搜索试试~ 我知道了~
首页Hadoop Operations 实用指南
Hadoop Operations 实用指南
5星 · 超过95%的资源 需积分: 16 52 下载量 2 浏览量
更新于2024-07-27
收藏 7.37MB PDF 举报
"Hadoop Operations,由Eric Sammer撰写,涵盖了Hadoop操作的实用知识,相比definite guide内容更深入。本书由O'Reilly Media出版,适用于教育、商业和销售推广用途。"
在大数据处理领域,Apache Hadoop是一个关键的开源框架,它允许分布式存储和处理大量数据。《Hadoop Operations》这本书,作者Eric Sammer,提供了关于实际操作Hadoop集群的深度指导,旨在帮助读者更好地理解和管理Hadoop环境。
书中可能包含以下几个核心知识点:
1. **Hadoop生态系统**:介绍Hadoop生态系统中的主要组件,如HDFS(Hadoop Distributed File System)用于分布式存储,MapReduce用于大规模数据处理,以及YARN(Yet Another Resource Negotiator)作为资源管理系统。
2. **安装与配置**:详细阐述如何在各种操作系统环境下安装Hadoop,包括集群部署的步骤,以及最佳实践和配置优化技巧。
3. **数据管理**:讲解Hadoop如何处理数据输入、输出,以及数据分块、复制策略和容错机制。
4. **监控与性能调优**:提供监控Hadoop集群性能的方法,包括使用工具如Ganglia和Ambari,以及如何通过调整参数来提升性能。
5. **故障排查与维护**:介绍常见问题的解决策略,如节点故障、网络问题和数据一致性问题,以及如何进行定期维护和升级。
6. **安全性**:涵盖Hadoop的安全特性,如Hadoop的权限控制模型HDFS的ACLs,Kerberos认证,以及如何实施数据加密和安全策略。
7. **实时处理**:讨论Hadoop与其他实时处理技术如Storm和Spark的集成,以满足低延迟的数据处理需求。
8. **案例研究**:可能包含实际企业应用Hadoop的案例,展示Hadoop在不同行业的解决方案和成功故事。
9. **扩展与集成**:介绍如何与其他大数据工具如HBase、Hive、Pig等进行集成,以及如何使用Hadoop与NoSQL数据库配合工作。
10. **最佳实践**:总结作者和社区的经验,给出在实际操作Hadoop时的最佳实践建议,以提高效率和稳定性。
《Hadoop Operations》适合已经有一定Hadoop基础的读者,希望通过深入学习提高Hadoop集群管理和运维能力的专业人士。书中的实战经验和深入分析将有助于读者在实际工作中解决复杂的问题,提升Hadoop集群的稳定性和效率。
and generally improve in quality with more of it. Knowing more about a problem space
generally leads to better decisions (or algorithm efficacy), which in turn leads to happier
users, more money, reduced fraud, healthier people, safer conditions, or whatever the
desired result might be.
Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infra-
structure for building many of the types of applications described earlier. Made up of
a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a com-
putation layer that implements a processing paradigm called MapReduce, Hadoop is
an open source, batch data processing system for enormous amounts of data. We live
in a flawed world, and Hadoop is designed to survive in it by not only tolerating hard-
ware and software failures, but also treating them as first-class conditions that happen
regularly. Hadoop uses a cluster of plain old commodity servers with no specialized
hardware or network infrastructure to form a single, logical, storage and compute plat-
form, or cluster, that can be shared by multiple individuals or groups. Computation in
Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction
for developers that obviates complex synchronization and network programming. Un-
like many other distributed data processing systems, Hadoop runs the user-provided
processing logic on the machine where the data lives rather than dragging the data
across the network; a huge win for performance.
For those interested in the history, Hadoop was modeled after two papers produced
by Google, one of the many companies to have these kinds of data-intensive processing
problems. The first, presented in 2003, describes a pragmatic, scalable, distributed
filesystem optimized for storing enormous datasets, called the Google Filesystem, or
GFS. In addition to simple storage, GFS was built to support large-scale, data-intensive,
distributed processing applications. The following year, another paper, titled "Map-
Reduce: Simplified Data Processing on Large Clusters," was presented, defining a pro-
gramming model and accompanying framework that provided automatic paralleliza-
tion, fault tolerance, and the scale to process hundreds of terabytes of data in a single
job over thousands of machines. When paired, these two systems could be used to build
large data processing clusters on relatively inexpensive, commodity machines. These
papers directly inspired the development of HDFS and Hadoop MapReduce, respec-
tively.
Interest and investment in Hadoop has led to an entire ecosystem of related software
both open source and commercial. Within the Apache Software Foundation alone,
projects that explicitly make use of, or integrate with, Hadoop are springing up regu-
larly. Some of these projects make authoring MapReduce jobs easier and more acces-
sible, while others focus on getting data in and out of HDFS, simplify operations, enable
deployment in cloud environments, and so on. Here is a sampling of the more popular
projects with which you should familiarize yourself:
Apache Hive
Hive creates a relational database−style abstraction that allows developers to write
a dialect of SQL, which in turn is executed as one or more MapReduce jobs on the
2 | Chapter 1: Introduction
cluster. Developers, analysts, and existing third-party packages already know and
speak SQL (Hive’s dialect of SQL is called HiveQL and implements only a subset
of any of the common standards). Hive takes advantage of this and provides a quick
way to reduce the learning curve to adopting Hadoop and writing MapReduce jobs.
For this reason, Hive is by far one of the most popular Hadoop ecosystem projects.
Hive works by defining a table-like schema over an existing set of files in HDFS
and handling the gory details of extracting records from those files when a query
is run. The data on disk is never actually changed, just parsed at query time. HiveQL
statements are interpreted and an execution plan of prebuilt map and reduce
classes is assembled to perform the MapReduce equivalent of the SQL statement.
Apache Pig
Like Hive, Apache Pig was created to simplify the authoring of MapReduce jobs,
obviating the need to write Java code. Instead, users write data processing jobs in
a high-level scripting language from which Pig builds an execution plan and exe-
cutes a series of MapReduce jobs to do the heavy lifting. In cases where Pig doesn’t
support a necessary function, developers can extend its set of built-in operations
by writing user-defined functions in Java (Hive supports similar functionality as
well). If you know Perl, Python, Ruby, JavaScript, or even shell script, you can learn
Pig’s syntax in the morning and be running MapReduce jobs by lunchtime.
Apache Sqoop
Not only does Hadoop not want to replace your database, it wants to be friends
with it. Exchanging data with relational databases is one of the most popular in-
tegration points with Apache Hadoop. Sqoop, short for “SQL to Hadoop,” per-
forms bidirectional data transfer between Hadoop and almost any database with
a JDBC driver. Using MapReduce, Sqoop performs these operations in parallel
with no need to write code.
For even greater performance, Sqoop supports database-specific plug-ins that use
native features of the RDBMS rather than incurring the overhead of JDBC. Many
of these connectors are open source, while others are free or available from com-
mercial vendors at a cost. Today, Sqoop includes native connectors (called direct
support) for MySQL and PostgreSQL. Free connectors exist for Teradata, Netezza,
SQL Server, and Oracle (from Quest Software), and are available for download
from their respective company websites.
Apache Flume
Apache Flume is a streaming data collection and aggregation system designed to
transport massive volumes of data into systems such as Hadoop. It supports native
connectivity and support for writing directly to HDFS, and simplifies reliable,
streaming data delivery from a variety of sources including RPC services, log4j
appenders, syslog, and even the output from OS commands. Data can be routed,
load-balanced, replicated to multiple destinations, and aggregated from thousands
of hosts by a tier of agents.
Introduction | 3
Apache Oozie
It’s not uncommon for large production clusters to run many coordinated Map-
Reduce jobs in a workfow. Apache Oozie is a workflow engine and scheduler built
specifically for large-scale job orchestration on a Hadoop cluster. Workflows can
be triggered by time or events such as data arriving in a directory, and job failure
handling logic can be implemented so that policies are adhered to. Oozie presents
a REST service for programmatic management of workflows and status retrieval.
Apache Whirr
Apache Whirr was developed to simplify the creation and deployment of ephem-
eral clusters in cloud environments such as Amazon’s AWS. Run as a command-
line tool either locally or within the cloud, Whirr can spin up instances, deploy
Hadoop, configure the software, and tear it down on demand. Under the hood,
Whirr uses the powerful jclouds library so that it is cloud provider−neutral. The
developers have put in the work to make Whirr support both Amazon EC2 and
Rackspace Cloud. In addition to Hadoop, Whirr understands how to provision
Apache Cassandra, Apache ZooKeeper, Apache HBase, ElasticSearch, Voldemort,
and Apache Hama.
Apache HBase
Apache HBase is a low-latency, distributed (nonrelational) database built on top
of HDFS. Modeled after Google’s Bigtable, HBase presents a flexible data model
with scale-out properties and a very simple API. Data in HBase is stored in a semi-
columnar format partitioned by rows into regions. It’s not uncommon for a single
table in HBase to be well into the hundreds of terabytes or in some cases petabytes.
Over the past few years, HBase has gained a massive following based on some very
public deployments such as Facebook’s Messages platform. Today, HBase is used
to serve huge amounts of data to real-time systems in major production deploy-
ments.
Apache ZooKeeper
A true workhorse, Apache ZooKeeper is a distributed, consensus-based coordina-
tion system used to support distributed applications. Distributed applications that
require leader election, locking, group membership, service location, and config-
uration services can use ZooKeeper rather than reimplement the complex coordi-
nation and error handling that comes with these functions. In fact, many projects
within the Hadoop ecosystem use ZooKeeper for exactly this purpose (most no-
tably, HBase).
Apache HCatalog
A relatively new entry, Apache HCatalog is a service that provides shared schema
and data access abstraction services to applications with the ecosystem. The
long-term goal of HCatalog is to enable interoperability between tools such as
Apache Hive and Pig so that they can share dataset metadata information.
The Hadoop ecosystem is exploding into the commercial world as well. Vendors such
as Oracle, SAS, MicroStrategy, Tableau, Informatica, Microsoft, Pentaho, Talend, HP,
4 | Chapter 1: Introduction
Dell, and dozens of others have all developed integration or support for Hadoop within
one or more of their products. Hadoop is fast becoming (or, as an increasingly growing
group would believe, already has become) the de facto standard for truly large-scale
data processing in the data center.
If you’re reading this book, you may be a developer with some exposure to Hadoop
looking to learn more about managing the system in a production environment. Alter-
natively, it could be that you’re an application or system administrator tasked with
owning the current or planned production cluster. Those in the latter camp may be
rolling their eyes at the prospect of dealing with yet another system. That’s fair, and we
won’t spend a ton of time talking about writing applications, APIs, and other pesky
code problems. There are other fantastic books on those topics, especially Hadoop: The
Definitive Guide by Tom White (O’Reilly). Administrators do, however, play an abso-
lutely critical role in planning, installing, configuring, maintaining, and monitoring
Hadoop clusters. Hadoop is a comparatively low-level system, leaning heavily on the
host operating system for many features, and it works best when developers and ad-
ministrators collaborate regularly. What you do impacts how things work.
It’s an extremely exciting time to get into Apache Hadoop. The so-called big data space
is all the rage, sure, but more importantly, Hadoop is growing and changing at a stag-
gering rate. Each new version—and there have been a few big ones in the past year or
two—brings another truckload of features for both developers and administrators
alike. You could say that Hadoop is experiencing software puberty; thanks to its rapid
growth and adoption, it’s also a little awkward at times. You’ll find, throughout this
book, that there are significant changes between even minor versions. It’s a lot to keep
up with, admittedly, but don’t let it overwhelm you. Where necessary, the differences
are called out, and a section in Chapter 4 is devoted to walking you through the most
commonly encountered versions.
This book is intended to be a pragmatic guide to running Hadoop in production. Those
who have some familiarity with Hadoop may already know alternative methods for
installation or have differing thoughts on how to properly tune the number of map slots
based on CPU utilization.
2
That’s expected and more than fine. The goal is not to
enumerate all possible scenarios, but rather to call out what works, as demonstrated
in critical deployments.
Chapters 2 and 3 provide the necessary background, describing what HDFS and Map-
Reduce are, why they exist, and at a high level, how they work. Chapter 4 walks you
through the process of planning for an Hadoop deployment including hardware selec-
tion, basic resource planning, operating system selection and configuration, Hadoop
distribution and version selection, and network concerns for Hadoop clusters. If you
are looking for the meat and potatoes, Chapter 5 is where it’s at, with configuration
and setup information, including a listing of the most critical properties, organized by
2. We also briefly cover the flux capacitor and discuss the burn rate of energon cubes during combat.
Introduction | 5
topic. Those that have strong security requirements or want to understand identity,
access, and authorization within Hadoop will want to pay particular attention to
Chapter 6. Chapter 7 explains the nuts and bolts of sharing a single large cluster across
multiple groups and why this is beneficial while still adhering to service-level agree-
ments by managing and allocating resources accordingly. Once everything is up and
running, Chapter 8 acts as a run book for the most common operations and tasks.
Chapter 9 is the rainy day chapter, covering the theory and practice of troubleshooting
complex distributed systems such as Hadoop, including some real-world war stories.
In an attempt to minimize those rainy days, Chapter 10 is all about how to effectively
monitor your Hadoop cluster. Finally, Chapter 11 provides some basic tools and tech-
niques for backing up Hadoop and dealing with catastrophic failure.
6 | Chapter 1: Introduction
剩余296页未读,继续阅读
realvalkyrie
- 粉丝: 2
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- MATLAB实现小波阈值去噪:Visushrink硬软算法对比
- 易语言实现画板图像缩放功能教程
- 大模型推荐系统: 优化算法与模型压缩技术
- Stancy: 静态文件驱动的简单RESTful API与前端框架集成
- 掌握Java全文搜索:深入Apache Lucene开源系统
- 19计应19田超的Python7-1试题整理
- 易语言实现多线程网络时间同步源码解析
- 人工智能大模型学习与实践指南
- 掌握Markdown:从基础到高级技巧解析
- JS-PizzaStore: JS应用程序模拟披萨递送服务
- CAMV开源XML编辑器:编辑、验证、设计及架构工具集
- 医学免疫学情景化自动生成考题系统
- 易语言实现多语言界面编程教程
- MATLAB实现16种回归算法在数据挖掘中的应用
- ***内容构建指南:深入HTML与LaTeX
- Python实现维基百科“历史上的今天”数据抓取教程
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功