Hadoop分布式编程实战

需积分: 0 103 浏览量更新于2024-07-29 收藏 5.05MB PDF 举报

"hadoop in action 是一本不错的入门文档，涵盖了Hadoop分布式编程框架的基本概念、启动方法、组件介绍，以及MapReduce程序的编写、高级特性、编程实践、Hadoop管理、云计算中的应用、Pig编程、Hive和Hadoop生态系统等。" 在深入探讨Hadoop之前，我们首先需要理解什么是Hadoop。Hadoop是一个开源框架，主要用于处理和存储大量数据，尤其适合处理非结构化数据。它基于Google发表的MapReduce编程模型和GFS（Google File System）理念设计，由Apache软件基金会维护。标题中的“Introducing Hadoop”章节介绍了Hadoop的基本概念，包括其核心的两个组件：Hadoop Distributed File System (HDFS) 和 MapReduce。HDFS是Hadoop的数据存储系统，它将大文件分布式存储在多台廉价服务器上，提供高容错性和高可用性。MapReduce则是Hadoop的计算模型，用于处理和生成大数据集，通过“映射”（map）和“归约”（reduce）两个阶段进行分布式并行计算。 “Starting Hadoop”章节则指导读者如何搭建和配置Hadoop集群。这通常包括安装Java环境、下载Hadoop发行版、配置集群节点间通信以及初始化HDFS和YARN（Yet Another Resource Negotiator，MapReduce的升级版资源调度器）等步骤。 “Componentsof Hadoop”章节详细讲解了Hadoop生态系统中的其他组件，如Hadoop Common（共享库和服务）、Hadoop YARN、Hadoop MapReduce以及Hadoop的周边项目，如HBase（NoSQL数据库）、Hive（数据仓库工具）、Pig（数据流处理语言）和Oozie（工作流调度系统）等。进入“Hadoop in Action”部分，读者将学习如何编写基本的MapReduce程序。MapReduce程序由Mapper和Reducer两部分组成，Mapper处理输入数据并将结果发送给Reducer，Reducer负责聚合数据并生成最终输出。这部分还会涵盖错误处理、数据分区和排序等关键概念。 “Advanced MapReduce”章节则深入MapReduce的高级特性，可能包括Combiner优化、自定义Partitioner和Input/OutputFormat，以及使用Secondary Sort进行更复杂的数据处理。 “Programming practices”章节讨论了良好的Hadoop编程实践，例如数据格式化、日志记录、性能调优和代码模块化等。 “Cookbook”章节提供了一些实用的Hadoop编程实例和解决方案，帮助读者解决实际问题。 “Managing Hadoop”部分讲解了监控、调试、维护和扩展Hadoop集群的方法，包括日志分析、性能监控、故障排查和资源管理。 “Hadoop Gone Wild”章节则探讨了Hadoop在云计算环境中的应用，如Amazon EMR（Elastic MapReduce），以及使用Pig和Hive进行更高级的数据处理和分析。最后，本书还包含了一些案例研究，展示了Hadoop在不同行业的实际应用，以及附录中列出的HDFS文件命令，方便读者查询和操作HDFS。 “Hadoop in Action”是一本全面介绍Hadoop及其生态系统的入门书籍，适合对分布式计算感兴趣或打算使用Hadoop进行大数据处理的读者。

14 C

HAPTER

1 Introducing Hadoop

for each value in values {

sum = sum + value;

}

emit ((String)token, (Integer) sum);

}

We’ve said before that the output of both map and reduce function are lists. As you

can see from the pseudo-code, in practice we use a special function in the framework

called emit() to generate the elements in the list one at a time. This emit() function

further relieves the programmer from managing a large list.

The code looks similar to what we have in section 1.5.1, except this time it will

actually work at scale. Hadoop makes building scalable distributed programs easy,

doesn’t it? Now let’s turn this pseudo-code into a Hadoop program.

1.6 Counting words with Hadoop—running your ﬁ rst program

Now that you know what the Hadoop and MapReduce framework is about, let’s get it

running. In this chapter, we’ll run Hadoop only on a single machine, which can be

your desktop or laptop computer. The next chapter will show you how to run Hadoop

over a cluster of machines, which is what you’d want for practical deployment. Run-

ning Hadoop on a single machine is mainly useful for development work.

Linux is the ofﬁ cial development and production platform for Hadoop, although

Windows is a supported development platform as well. For a Windows box, you’ll need

to install cygwin (http://www-cygwin.com/) to enable shell and Unix scripts.

NOTE

Many people have reported success in running Hadoop in development

mode on other variants of Unix, such as Solaris and Mac OS X . In fact,

MacBook Pro seems to be the laptop of choice among Hadoop developers, as

they’re ubiquitous in Hadoop conferences and user group meetings.

Running Hadoop requires Java (version 1.6 or higher). Mac users should get it from

Apple. You can download the latest JDK for other operating systems from Sun at

http://java.sun.com/javase/downloads/index.jsp. Install it and remember the root of

the Java installation, which we’ll need later.

To install Hadoop, ﬁ rst get the latest stable release at http://hadoop.apache.org/

core/releases.html. After you unpack the distribution, edit the script conf/hadoop-

env.sh to set JAVA_HOME to the root of the Java installation you have remembered

from earlier. For example, in Mac OS X, you’ll replace this line

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

with this line

export JAVA_HOME=/Library/Java/Home

You’ll be using the Hadoop script quite often. Let’s run it without any arguments to

see its usage documentation:

Counting words with Hadoop—running your ﬁ rst program 15

bin/hadoop

We get

Usage: hadoop [--conﬁ g confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS ﬁ lesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

fsck run a DFS ﬁ lesystem checking utility

fs run a generic ﬁ lesystem user client

balancer run a cluster balancing utility

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

job manipulate MapReduce jobs

version print the version

jar <jar> run a jar ﬁ le

distcp <srcurl> <desturl> copy ﬁ le or directories recursively

archive -archiveName NAME <src>* <dest> create a hadoop archive

daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

We’ll cover the various Hadoop commands in the course of this book. For our current

purpose, we only need to know that the command to run a ( Java) Hadoop program is

bin/hadoop jar <jar>

. As the command implies, Hadoop programs written in Java

are packaged in jar ﬁ les for execution.

Fortunately for us, we don’t need to write a Hadoop program ﬁ rst; the default

installation already has several sample programs we can use. The following command

shows what is available in the examples jar ﬁ le:

bin/hadoop jar hadoop-*-examples.jar

You’ll see about a dozen example programs prepackaged with Hadoop, and one

of them is a word counting program called...

wordcount

! The important (inner)

classes of that program are shown in listing 1.2. We’ll see how this Java program

implements the word counting map and reduce functions we had in pseudo-code

in listing 1.1. We’ll modify this program to understand how to vary its behavior. For

now we’ll assume it works as expected and only follow the mechanics of executing a

Hadoop program.

Without specifying any arguments, executing

wordcount

will show its usage

information:

bin/hadoop jar hadoop-*-examples.jar wordcount

which shows the arguments list:

wordcount [-m <maps>] [-r <reduces>] <input> <output>

16 C

HAPTER

1 Introducing Hadoop

The only parameters are an input directory (

<input>

) of text documents you want to

analyze and an output directory (

) where the program will dump its output.

To execute

wordcount

, we need to ﬁ rst create an input directory:

mkdir input

and put some documents in it. You can add any text document to the directory. For

illustration, let’s put the text version of the 2002 State of the Union address, obtained

from http://www.gpoaccess.gov/sou/. We now analyze its word counts and see the

results:

bin/hadoop jar hadoop-*-examples.jar wordcount input output

more output/*

You’ll see a word count of every word used in the document, listed in alphabetical or-

der. This is not bad considering you have not written a single line of code yet! But, also

note a number of shortcomings in the included

wordcount

program. Tokenization

is based purely on whitespace characters and not punctuation marks, making States,

States., and States: separate words. The same is true for capitalization, where States and

states appear as separate words. Furthermore, we would like to leave out words that

show up in the document only once or twice.

Fortunately, the source code for

wordcount

is available and included in the

installation at src/examples/org/apache/hadoop/examples/WordCount.java. We

can modify it as per our requirements. Let’s ﬁ rst set up a directory structure for our

playground and make a copy of the program.

mkdir playground

mkdir playground/src

mkdir playground/classes

cp src/examples/org/apache/hadoop/examples/WordCount.java

➥

playground/src/WordCount.java

Before we make changes to the program, let’s go through compiling and executing

this new copy in the Hadoop framework.

javac -classpath hadoop-*-core.jar -d playground/classes

➥

playground/src/WordCount.java

jar -cvf playground/wordcount.jar -C playground/classes/ .

You’ll have to remove the output directory each time you run this Hadoop command,

because it is created automatically.

bin/hadoop jar playground/wordcount.jar

➥

org.apache.hadoop.examples.WordCount input output

Look at the ﬁ les in your output directory again. As we haven’t changed any program

code, the result should be the same as before. We’ve only compiled our own copy

rather than running the precompiled version.

Now we are ready to modify

WordCount

to add some extra features. Listing 1.2 is

a partial view of the WordCount.java program. Comments and supporting code are

stripped out.

Listing 1.2 WordCount.java

public class WordCount extends Conﬁ gured implements Tool {

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private ﬁ nal static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

public static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

...

}

The main functional distinction between

WordCount.java

and our MapReduce pseudo-

code is that in

WordCount.java

map()

processes one line of text at a time whereas our

pseudo-code processes a document at a time. This distinction may not even be apparent

from looking at

WordCount.java

as it’s Hadoop’s default conﬁ guration.

The code in listing 1.2 is virtually identical to our pseudo-code in listing 1.1 though

the Java syntax makes it more verbose. The map and reduce functions are inside inner

classes of

WordCount

. You may notice we use special classes such as

LongWritable

IntWritable

, and

Text

instead of the more familiar

Long

Integer

, and

String

classes of Java. Consider these implementation details for now. The new classes have

additional serialization capabilities needed by Hadoop’s internal.

The changes we want to make to the program are easy to spot. We see

that

WordCount

uses Java’s

StringTokenizer

in its default setting, which tokenizes based

only on whitespaces. To ignore standard punctuation marks, we add them to the

StringTokenizer

’s list of delimiter characters:

StringTokenizer itr = new StringTokenizer(line, “ \t\n\r\f,.:;?![]`”);

Counting words with Hadoop—running your ﬁ rst program 17

剩余298页未读，继续阅读

diaoshudang

粉丝: 0
资源: 5

Hadoop分布式编程实战

Hadoop In Action2

Hadoop权威指南中文版（第二版）+Hadoop in Action

Hadoop in Action

结合hadoop的书籍推荐代码

ssm用hadoop上传文件

hbase in action 英文版

Hadoop监视文件夹启动命令

Hadoop，habse，spark 参考文献

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.metastore.IMetaStoreClient

最新资源