Hadoop实战：分布式编程框架入门

需积分: 18 81 浏览量更新于2024-07-26 收藏 5.09MB PDF 举报

"Hadoop in Action 是一本针对Hadoop的入门书籍，通过Manning Early Access Program出版，涵盖了Hadoop的基础知识、实战应用以及在云环境中的运行。" 在深入探讨Hadoop之前，首先需要理解分布式编程框架的核心概念。Hadoop是一个开源的框架，专门设计用于处理和存储大量数据。它的主要组成部分包括Hadoop Distributed File System (HDFS) 和 MapReduce。HDFS提供了分布式存储的功能，而MapReduce则是一个用于并行处理大规模数据集的编程模型。 **第一章：介绍Hadoop** 本章将引导读者了解Hadoop的基本概念，包括它如何解决大数据处理的问题以及其核心的设计理念。它强调了Hadoop的可扩展性和容错性，这些特性使得Hadoop能够在廉价硬件上构建大规模的数据处理集群。 **第二章：启动Hadoop** 在这一部分，读者将学习如何配置和安装Hadoop环境。这包括准备硬件组件，如服务器或虚拟机，设置网络拓扑，并配置Hadoop集群的各个节点。此外，还会介绍Hadoop的启动和监控过程，确保集群能够正常运行。 **第三章：Hadoop的组件** 本章深入到Hadoop的各个组件，如NameNode、DataNode、TaskTracker和JobTracker等。这些组件协同工作，确保数据的可靠存储和任务的高效执行。同时，还会讨论Hadoop的YARN（Yet Another Resource Negotiator）资源管理器，它是MapReduce v2（MRv2）的一部分，提高了集群资源的管理和利用率。 **第四章：编写基本的MapReduce程序** 在这一章，读者将学习如何编写MapReduce程序。Map阶段负责数据的拆分和处理，Reduce阶段则对结果进行聚合。本章将通过实例解释这两个阶段的工作原理，让初学者能够快速上手。 **第五章：高级MapReduce** 本章进一步深入MapReduce，介绍如自定义分区、Combiner、Reducer优化等高级主题。这些技巧可以帮助提高MapReduce作业的性能和效率。 **第六章：编程实践** 这部分涵盖了开发Hadoop应用程序的最佳实践，包括错误处理、日志记录、数据序列化和反序列化等。同时，也会讨论如何进行测试和调试MapReduce作业。 **第七章：Hadoop实战** 这里提供了各种实用的示例和技巧，帮助读者解决实际问题，例如数据导入导出、数据清洗和转换等。这是一份实用的Hadoop开发者手册。 **第八章：管理Hadoop** 本章讲解如何管理和维护Hadoop集群，包括监控、性能调优、故障排查和安全策略。此外，还会介绍一些工具，如Hadoop命令行工具和Web界面，以帮助管理员更好地控制集群。 **第九章：在云端运行Hadoop** 随着云计算的发展，本章介绍了如何在Amazon Web Services (AWS) 或其他云平台上部署和运行Hadoop集群。这涵盖了云服务的选择、成本控制和弹性伸缩策略。 **第十章：使用Pig编程** Pig是Hadoop上的一个高级语言，用于简化数据处理。本章会介绍Pig Latin语法，以及如何利用Pig进行数据分析。 **第十一章：Hive和Hadoop生态系统** Hive是一个基于Hadoop的数据仓库系统，用于查询和分析大型数据集。本章将探讨Hive的SQL-like查询语言HQL，以及Hive如何与Hadoop的其他组件如HBase和Spark集成。 **第十二章：案例研究** 通过具体的案例，本章展示了Hadoop在不同行业的实际应用，如互联网广告、社交媒体分析和金融风险管理等。 **附录：HDFS文件命令** 这部分提供了HDFS文件系统的常用命令，帮助用户在命令行接口下进行文件操作。《Hadoop in Action》这本书为读者提供了一个全面的Hadoop学习路径，从基础概念到高级应用，再到云环境中的实践，覆盖了Hadoop开发和管理的各个方面，是Hadoop初学者和开发者的重要参考资料。

14 C

HAPTER

1 Introducing Hadoop

for each value in values {

sum = sum + value;

}

emit ((String)token, (Integer) sum);

}

We’ve said before that the output of both map and reduce function are lists. As you

can see from the pseudo-code, in practice we use a special function in the framework

called emit() to generate the elements in the list one at a time. This emit() function

further relieves the programmer from managing a large list.

The code looks similar to what we have in section 1.5.1, except this time it will

actually work at scale. Hadoop makes building scalable distributed programs easy,

doesn’t it? Now let’s turn this pseudo-code into a Hadoop program.

1.6 Counting words with Hadoop—running your ﬁ rst program

Now that you know what the Hadoop and MapReduce framework is about, let’s get it

running. In this chapter, we’ll run Hadoop only on a single machine, which can be

your desktop or laptop computer. The next chapter will show you how to run Hadoop

over a cluster of machines, which is what you’d want for practical deployment. Run-

ning Hadoop on a single machine is mainly useful for development work.

Linux is the ofﬁ cial development and production platform for Hadoop, although

Windows is a supported development platform as well. For a Windows box, you’ll need

to install cygwin (http://www-cygwin.com/) to enable shell and Unix scripts.

NOTE

Many people have reported success in running Hadoop in development

mode on other variants of Unix, such as Solaris and Mac OS X . In fact,

MacBook Pro seems to be the laptop of choice among Hadoop developers, as

they’re ubiquitous in Hadoop conferences and user group meetings.

Running Hadoop requires Java (version 1.6 or higher). Mac users should get it from

Apple. You can download the latest JDK for other operating systems from Sun at

http://java.sun.com/javase/downloads/index.jsp. Install it and remember the root of

the Java installation, which we’ll need later.

To install Hadoop, ﬁ rst get the latest stable release at http://hadoop.apache.org/

core/releases.html. After you unpack the distribution, edit the script conf/hadoop-

env.sh to set JAVA_HOME to the root of the Java installation you have remembered

from earlier. For example, in Mac OS X, you’ll replace this line

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

with this line

export JAVA_HOME=/Library/Java/Home

You’ll be using the Hadoop script quite often. Let’s run it without any arguments to

see its usage documentation:

Counting words with Hadoop—running your ﬁ rst program 15

bin/hadoop

We get

Usage: hadoop [--conﬁ g confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS ﬁ lesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

fsck run a DFS ﬁ lesystem checking utility

fs run a generic ﬁ lesystem user client

balancer run a cluster balancing utility

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

job manipulate MapReduce jobs

version print the version

jar <jar> run a jar ﬁ le

distcp <srcurl> <desturl> copy ﬁ le or directories recursively

archive -archiveName NAME <src>* <dest> create a hadoop archive

daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

We’ll cover the various Hadoop commands in the course of this book. For our current

purpose, we only need to know that the command to run a ( Java) Hadoop program is

bin/hadoop jar <jar>

. As the command implies, Hadoop programs written in Java

are packaged in jar ﬁ les for execution.

Fortunately for us, we don’t need to write a Hadoop program ﬁ rst; the default

installation already has several sample programs we can use. The following command

shows what is available in the examples jar ﬁ le:

bin/hadoop jar hadoop-*-examples.jar

You’ll see about a dozen example programs prepackaged with Hadoop, and one

of them is a word counting program called...

wordcount

! The important (inner)

classes of that program are shown in listing 1.2. We’ll see how this Java program

implements the word counting map and reduce functions we had in pseudo-code

in listing 1.1. We’ll modify this program to understand how to vary its behavior. For

now we’ll assume it works as expected and only follow the mechanics of executing a

Hadoop program.

Without specifying any arguments, executing

wordcount

will show its usage

information:

bin/hadoop jar hadoop-*-examples.jar wordcount

which shows the arguments list:

wordcount [-m <maps>] [-r <reduces>] <input> <output>

16 C

HAPTER

1 Introducing Hadoop

The only parameters are an input directory (

<input>

) of text documents you want to

analyze and an output directory (

) where the program will dump its output.

To execute

wordcount

, we need to ﬁ rst create an input directory:

mkdir input

and put some documents in it. You can add any text document to the directory. For

illustration, let’s put the text version of the 2002 State of the Union address, obtained

from http://www.gpoaccess.gov/sou/. We now analyze its word counts and see the

results:

bin/hadoop jar hadoop-*-examples.jar wordcount input output

more output/*

You’ll see a word count of every word used in the document, listed in alphabetical or-

der. This is not bad considering you have not written a single line of code yet! But, also

note a number of shortcomings in the included

wordcount

program. Tokenization

is based purely on whitespace characters and not punctuation marks, making States,

States., and States: separate words. The same is true for capitalization, where States and

states appear as separate words. Furthermore, we would like to leave out words that

show up in the document only once or twice.

Fortunately, the source code for

wordcount

is available and included in the

installation at src/examples/org/apache/hadoop/examples/WordCount.java. We

can modify it as per our requirements. Let’s ﬁ rst set up a directory structure for our

playground and make a copy of the program.

mkdir playground

mkdir playground/src

mkdir playground/classes

cp src/examples/org/apache/hadoop/examples/WordCount.java

➥

playground/src/WordCount.java

Before we make changes to the program, let’s go through compiling and executing

this new copy in the Hadoop framework.

javac -classpath hadoop-*-core.jar -d playground/classes

➥

playground/src/WordCount.java

jar -cvf playground/wordcount.jar -C playground/classes/ .

You’ll have to remove the output directory each time you run this Hadoop command,

because it is created automatically.

bin/hadoop jar playground/wordcount.jar

➥

org.apache.hadoop.examples.WordCount input output

Look at the ﬁ les in your output directory again. As we haven’t changed any program

code, the result should be the same as before. We’ve only compiled our own copy

rather than running the precompiled version.

Now we are ready to modify

WordCount

to add some extra features. Listing 1.2 is

a partial view of the WordCount.java program. Comments and supporting code are

stripped out.

Listing 1.2 WordCount.java

public class WordCount extends Conﬁ gured implements Tool {

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private ﬁ nal static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

public static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

...

}

The main functional distinction between

WordCount.java

and our MapReduce pseudo-

code is that in

WordCount.java

map()

processes one line of text at a time whereas our

pseudo-code processes a document at a time. This distinction may not even be apparent

from looking at

WordCount.java

as it’s Hadoop’s default conﬁ guration.

The code in listing 1.2 is virtually identical to our pseudo-code in listing 1.1 though

the Java syntax makes it more verbose. The map and reduce functions are inside inner

classes of

WordCount

. You may notice we use special classes such as

LongWritable

IntWritable

, and

Text

instead of the more familiar

Long

Integer

, and

String

classes of Java. Consider these implementation details for now. The new classes have

additional serialization capabilities needed by Hadoop’s internal.

The changes we want to make to the program are easy to spot. We see

that

WordCount

uses Java’s

StringTokenizer

in its default setting, which tokenizes based

only on whitespaces. To ignore standard punctuation marks, we add them to the

StringTokenizer

’s list of delimiter characters:

StringTokenizer itr = new StringTokenizer(line, “ \t\n\r\f,.:;?![]`”);

Counting words with Hadoop—running your ﬁ rst program 17

剩余298页未读，继续阅读

jiaruweiwei

粉丝: 0
资源: 2

Hadoop实战：分布式编程框架入门

Hadoop In Action2

Hadoop权威指南中文版（第二版）+Hadoop in Action

Hadoop in Action（英文版）

ssm用hadoop上传文件

hbase in action 英文版

Hadoop，habse，spark 参考文献

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.metastore.IMetaStoreClient

Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

最新资源