深入理解Hadoop：分布式大数据处理

需积分: 5 70 浏览量更新于2024-07-18 收藏 5.78MB PDF 举报

"Hadoop in Action 是一本介绍Hadoop及其应用的书籍，涵盖了Hadoop的核心组件、MapReduce编程模型以及在大规模数据集上的管理和云环境运行等内容。" Hadoop是一个广泛应用于大数据处理领域的开源框架，由Apache基金会开发。其设计目标是使用户能够简单地编写并运行分布式应用程序，而无需深入理解分布式系统的底层细节。Hadoop的核心由两个主要部分组成：Hadoop分布式文件系统（HDFS）和MapReduce计算模型。 HDFS是Hadoop的基础，它是一个高度容错性的分布式文件系统，特别适合处理和存储大量数据。HDFS在低成本硬件上运行，能提供高吞吐量的数据访问，使得应用程序能够快速读取和写入大量数据。与传统的文件系统不同，HDFS设计时放宽了对POSIX标准的遵循，更强调数据流式的访问方式，这使得它可以高效处理大规模数据集。 MapReduce是Hadoop用于处理数据的计算模型，它将大型任务拆分成小部分，通过并行化处理的方式提高计算效率。Map阶段将输入数据分片并应用映射函数，而Reduce阶段则对映射结果进行聚合，从而得到最终的输出。这种模型使得处理海量数据变得更加简单和高效。本书《Hadoop in Action》详细介绍了如何使用Hadoop。从第一章“Introducing Hadoop”开始，读者可以了解到编写可扩展的分布式数据密集型程序的基本原理，以及Hadoop和MapReduce的工作原理。后续章节逐步深入，包括启动和管理Hadoop、编写基本和高级的MapReduce程序、最佳编程实践、使用Pig进行编程，以及在云端运行Hadoop等主题。此外，书中还附有HDFS文件命令的附录，供读者参考。通过阅读这本书，读者不仅能够理解Hadoop的架构和原理，还能掌握实际操作和编程技巧，从而有效地利用Hadoop处理大规模数据集，实现高效的数据分析和挖掘。对于想要涉足大数据领域或者提升现有Hadoop技能的读者来说，这是一本非常有价值的参考资料。

Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=544

To install Hadoop, you first get the latest stable release at

http://hadoop.apache.org/core/releases.html. After you unpack the distribution, edit the

script conf/hadoop-env.sh so that JAVA_HOME is set to the root of the Java installation you

have remembered from earlier. For example, in Mac OS X, you will replace this line:

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

with this line:

export JAVA_HOME=/Library/Java/Home

You will be using the Hadoop script quite often. Let’s run the script without any

arguments to see its usage documentation:

bin/hadoop

We get

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS filesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

fsck run a DFS filesystem checking utility

fs run a generic filesystem user client

balancer run a cluster balancing utility

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

job manipulate MapReduce jobs

version print the version

jar <jar> run a jar file

distcp <srcurl> <desturl> copy file or directories recursively

archive -archiveName NAME <src>* <dest> create a hadoop archive

daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

We will cover the various Hadoop commands throughout this book. For our current

purpose, we just need to know that the command to run a (Java) Hadoop program is

bin/hadoop jar <jar>. As the command implies, Hadoop programs written in Java are

packaged in jar files for execution.

Luckily for us, we don’t need to write a Hadoop program first; the default installation

already has several sample programs we can use. The following command shows what is

available in the examples jar file:

bin/hadoop jar hadoop-*-examples.jar

You should see about a dozen example programs prepackaged with Hadoop, and one of

them is a word counting program called...

wordcount! The important (inner) classes of that

program are shown in listing 1.x. In a moment, we will explain how this Java program

implements the word counting map and reduce functions we had in pseudo-code in listing

1.x, and we will also explain how you can modify this program to vary its behavior. For now

we will assume it works as expected and we just want to show the mechanics of executing a

Hadoop program.

Licensed to Chi Wu <cswu@synnex.com>

Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=544

Without specifying any arguments, executing

wordcount will show its usage

information:

bin/hadoop jar hadoop-*-examples.jar wordcount

which shows the arguments list:

wordcount [-m <maps>] [-r <reduces>] <input> <output>

The only parameters are an input directory (

<input>) of text documents you want to

analyze and an output directory (

<output>) where the program will dump its output. So to

execute

wordcount, we need to first create an input directory:

mkdir input

and put some documents in it. You can put any text documents there. For illustration, we will

put the text version of the 2002 State of the Union address, obtained from

http://www.gpoaccess.gov/sou/, into the input directory. We now analyze its word counts

and see the results:

bin/hadoop jar hadoop-*-examples.jar wordcount input output

more output/*

You will see a word count of every word used in the document, listed in alphabetical

order. This is not bad considering you have not written a single line of code yet! However,

you will also notice a number of shortcomings in the included

wordcount program.

Tokenization is based purely on whitespace characters and not punctuation marks, so

“States”, “States.”, and States:” are all considered separate words. We would also like to

ignore capitalization, so “States” and “states” can be counted together. Furthermore, most

words have a small count, showing up in the document only once or twice, and we don’t

really care about those infrequent words.

Fortunately, the source code for wordcount is available and included in the installation at

src/examples/org/apache/hadoop/examples/WordCount.java. We can modify it to do what

we want. We first set up a directory structure for our playground and make a copy of the

program.

mkdir playground

mkdir playground/src

mkdir playground/classes

cp src/examples/org/apache/hadoop/examples/WordCount.java

[CA]playground/src/WordCount.java

Before we make changes to the program, let’s go through compiling and executing this

new copy in the Hadoop framework.

javac -classpath hadoop-*-core.jar -d playground/classes

[CA]playground/src/WordCount.java

jar -cvf playground/wordcount.jar -C playground/classes/ .

You will have to remove the output directory each time you run this Hadoop command, since

it will attempt to create that directory.

bin/hadoop jar playground/wordcount.jar

[CA]org.apache.hadoop.examples.WordCount input output

Look at the files in your output directory again. Since we haven’t changed any program code,

the result should be the same as before. We just went through compiling our own copy

rather than running their pre-compiled version.

Licensed to Chi Wu <cswu@synnex.com>

Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=544

Now we are ready to modify WordCount to add some extra features. Listing 1.2 is a

partial view of the WordCount.java program. Comments and supporting code are stripped

out.

Listing 1.2 WordCount.java

public class WordCount extends Configured implements Tool {

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line); 1

while (itr.hasMoreTokens()) {

word.set(itr.nextToken()); 2

output.collect(word, one);

}

public static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum)); 3

}

...

}

Replace #1, #2, and #3 in the following paragraph with cueballs

The main functional distinction between WordCount.java and our MapReduce pseudo-

code is that

WordCount.java is set up such that map() processes one line of text at a time

whereas our pseudo-code processes a document at a time. This distinction may not even be

apparent from looking at

WordCount.java as it’s simply a use of Hadoop’s default

configuration.

Otherwise, the code in listing 1.2 is virtually identical to our pseudo-code in listing 1.1. Of

course, the Java syntax makes it a bit more verbose. The map and reduce functions are

Licensed to Chi Wu <cswu@synnex.com>

Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=544

inside inner classes of

WordCount. You may notice we use special classes such as

LongWritable, IntWritable, and Text instead of the more familiar Long, Integer, and String

classes of Java. It’s really an implementation detail for Hadoop as these new classes provide

serialization capabilities better tuned for Hadoop.

The changes we want to make to the program are easy to spot. We see in line #1 that

WordCount uses Java’s StringTokenizer in its default setting, which tokenizes based

only on whitespaces. To ignore standard punctuation marks, we add them to the

StringTokenizer’s list of delimiter characters:

StringTokenizer itr = new StringTokenizer(line, " \t\n\r\f,.:;?![]`");

When looping through the set of tokens, each token is extracted and cast into a

Text

object in line #2. (In Hadoop, the special class

Text is used in place of String.) We want

word count to ignore capitalization so we lower case all the words before turning them into

Text objects.

word.set(itr.nextToken().toLowerCase());

Finally, we only care about words with counts higher than four. We therefore modify #3

to collect the word count into the output only if that condition is met. (This is Hadoop’s

equivalent of the

emit() function in our pseudo-code.)

if (sum > 4) output.collect(key, new IntWritable(sum));

After making changes to those three lines, you can recompile the program and execute it

again. The results are shown in table 1.1.

Table 1.1. Words that have occurred more than four times in the 2002 State of the Union

Address.

11th (5) citizens (9) its (6) over (6) to (123)

a (69) congress (10) jobs (11) own (5) together (5)

about (5) corps (6) join (7) page (7) tonight (5)

act (7) country (10) know (6) people (12) training (5)

afghanistan (10) destruction (5) last (6) protect (5) united (6)

all (10) do (6) lives (6) regime (5) us (6)

allies (8) every (8) long (5) regimes (6) want (5)

also (5) evil (5) make (7) security (19) war (12)

America (33) for (27) many (5) september (5) was (11)

American (15) free (6) more (11) so (12) we (76)

americans (8) freedom (10) most (5) some (6) we've (5)

an (7) from (15) must (18) states (9) weapons (12)

and (210) good (13) my (13) tax (7) were (7)

Licensed to Chi Wu <cswu@synnex.com>

Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=544

are (17) great (8) nation (11) terror (13) while (5)

as (18) has (12) need (7) terrorist (12) who (18)

ask (5) have (32) never (7) terrorists (10) will (49)

at (16) health (5) new (13) than (6) with (22)

be (23) help (7) no (7) that (29) women (5)

been (8) home (5) not (15) the (184) work (7)

best (6) homeland (7) now (10) their (17) workers (5)

budget (7) hope (5) of (130) them (8) world (17)

but (7) i (29) on (32) these (18) would (5)

by (13) if (8) one (5) they (12) yet (8)

camps (8) in (79) opportunity (5) this (28) you (12)

can (7) is (44) or (8) thousands (5)

children (6) it (21) our (78) time (7)

It appears that 128 words have a frequency count greater than four. Many of these words

appear frequently in almost any English text. For example, there is “a” (69), “and” (210), “i”

(29), “in” (79), “the” (184) and many others. However, we also see many words that

summarize the issues facing the United States at that time: “terror” (13), “terrorist” (12),

“terrorists” (10), “security” (19), “weapons” (12), “destruction” (5), “afghanistan” (10),

“freedom” (10), “jobs” (11), “budget” (7), and many others.

1.7 History of Hadoop

Hadoop started out as a subproject of Nutch, which in turn was a subproject of Apache

Lucene. Doug Cutting founded all three projects in his spare time, and each project was a

logical progression of the previous one.

Lucene is a full-featured text indexing and searching library. Given a text collection, a

developer can easily add search capability to the documents using the Lucene engine.

Desktop search, enterprise search, and many domain-specific search engines have been built

using Lucene. In particular, Nutch is the most ambitious extension of Lucene, as it tries to

build a complete Web search engine using Lucene as its core component. Nutch has parsers

for HTML, a Web crawler, a link-graph database, and other extra components necessary for a

Web search engine. Doug Cutting envisions Nutch to be an open democratic alternative to

the proprietary technologies in commercial offerings such as Google.

Besides having extra components like crawler and parser, a Web search engine differs

from a basic document search engine in terms of scale. Whereas Lucene is targeted at

Licensed to Chi Wu <cswu@synnex.com>

剩余250页未读，继续阅读

ericx1627

粉丝: 0
资源: 2

深入理解Hadoop：分布式大数据处理

Hadoop实战： Chuck Lam《Hadoop in Action》详解

Hadoop实战：Chuck Lam的《Hadoop in Action》文字版

Hadoop实战指南：Chuck Lam的《Hadoop in Action》

hadoop in action

Hadoop In Action

Hadoop in action

Manning出版社推荐：实战指南《Hadoop in Action》

Chuck Lam的《Hadoop in Action》中文版：入门与参考指南

ta-lib-0.5.1-cp312-cp312-win32.whl

在线实时的斗兽棋游戏，时间赶，粗暴的使用jQuery + websoket 实现实时H5对战游戏 + java.zip课程设计

最新资源