Hadoop MapReduce详解：从入门到进阶

需积分: 0 123 浏览量更新于2024-07-22 收藏 170KB PDF 举报

Hadoop MapReduce是一种强大的分布式计算框架，由Apache Software Foundation开发，旨在处理大规模数据集（多TB级别）的并行处理。本文档提供了对MapReduce框架全面的介绍和教程，适用于初次接触和有经验用户。 1. **目的**：文档的主要目的是为了帮助用户理解MapReduce框架的所有用户层面特性，并提供一个详细的指南，以便开发者能够轻松地编写能有效利用Hadoop集群处理复杂任务的应用程序，特别是针对那些需要处理海量数据的任务，如文本分析、日志处理等。 2. **前置条件**：在开始使用前，用户需要确保Hadoop已经正确安装、配置并运行。对于初学者，推荐完成单节点设置，而对于需要处理大型分布式集群的数据，应进行集群设置。详细的安装和配置步骤可能包括下载Hadoop源码、配置环境变量、启动守护进程等。 3. **概述**：MapReduce的核心理念是将复杂的计算任务分解为一系列简单易行的步骤——Map阶段和Reduce阶段。Map阶段负责将输入数据分割成小块，对每个数据块执行特定的操作（映射），而Reduce阶段则负责收集和合并Map阶段的结果（规约）。通过这种模式，MapReduce能够高效地利用集群的多核处理器和大量内存，实现大数据处理。 4. **输入与输出**：MapReduce应用程序的输入可以是各种数据格式，例如文本文件、数据库记录等。输出则是经过处理后的数据，通常以相同或定制化的格式呈现。用户需要明确指定输入路径和期望的输出路径。 5. **示例：WordCount v1.0** - 这部分展示了经典的WordCount例子，展示了如何编写基础的Map和Reduce函数，以及如何组织代码以提交到Hadoop。用户会学习到如何设置Mapper和Reducer类，以及如何配置job配置文件来指导Hadoop执行任务。 6. **MapReduce用户接口**：文档深入讲解了用户与框架交互的不同接口，包括任务执行环境、作业配置、任务执行监控等。用户界面允许开发者精细控制任务参数、错误处理以及性能优化。 7. **升级示例：WordCount v2.0** - 进一步展示了MapReduce技术的发展，新的版本可能包含了优化、错误处理改进以及API的变化。这部分可能包含代码示例，以及如何使用新功能进行实际操作和提升性能的技巧。通过阅读这篇教程，用户不仅能得到理论知识，还能获得实践指导，掌握如何在实际项目中使用Hadoop MapReduce框架进行数据处理，以满足不同场景下的大数据处理需求。无论是初学者还是经验丰富的开发者，都能从中受益良多。

MapReduce Tutorial

Page 8

The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one

line at a time, as provided by the specified TextInputFormat (line 49). It then splits the

line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-

value pair of < <word>, 1>.

For the given sample input the first map emits:

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

The second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

We'll learn more about the number of maps spawned for a given job, and how to control

them in a fine-grained manner, a bit later in the tutorial.

WordCount also specifies a combiner (line 46). Hence, the output of each map is passed

through the local combiner (which is same as the Reducer as per the job configuration) for

local aggregation, after being sorted on the keys.

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just

sums up the values, which are the occurence counts for each key (i.e. words in this example).

Thus the output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

MapReduce Tutorial

Page 9

The run method specifies various facets of the job, such as the input/output paths (passed

via the command line), key/value types, input/output formats etc., in the JobConf. It then

calls the JobClient.runJob (line 55) to submit the and monitor its progress.

We'll learn more about JobConf, JobClient, Tool and other interfaces and classes a bit

later in the tutorial.

6 MapReduce - User Interfaces

This section provides a reasonable amount of detail on every user-facing aspect of the

MapReduce framework. This should help users implement, configure and tune their jobs in a

fine-grained manner. However, please note that the javadoc for each class/interface remains

the most comprehensive documentation available; this is only meant to be a tutorial.

Let us first take the Mapper and Reducer interfaces. Applications typically implement

them to provide the map and reduce methods.

We will then discuss other core interfaces including JobConf, JobClient,

Partitioner, OutputCollector, Reporter, InputFormat, OutputFormat,

OutputCommitter and others.

Finally, we will wrap up by discussing some useful features of the framework such as the

DistributedCache, IsolationRunner etc.

6.1 Payload

Applications typically implement the Mapper and Reducer interfaces to provide the map

and reduce methods. These form the core of the job.

6.1.1 Mapper

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The

transformed intermediate records do not need to be of the same type as the input records. A

given input pair may map to zero or many output pairs.

The Hadoop MapReduce framework spawns one map task for each InputSplit generated

by the InputFormat for the job.

Overall, Mapper implementations are passed the JobConf for the job via the

JobConfigurable.configure(JobConf) method and override it to initialize themselves. The

framework then calls map(WritableComparable, Writable, OutputCollector, Reporter) for

each key/value pair in the InputSplit for that task. Applications can then override the

Closeable.close() method to perform any required cleanup.

剩余41页未读，继续阅读

AllInCode

粉丝: 143
资源: 10

Hadoop MapReduce详解：从入门到进阶

Hadoop MapReduce实现tfidf源码

Hadoop mapreduce实现wordcount

大数据 hadoop mapreduce 词频统计

hadoop mapreduce

[Hadoop MapReduce] Hadoop MapReduce 经典实例 (英文版)

hadoop-mapreduce:hadoop MapReduce

mapred.zip_hadoop_hadoop mapreduce_mapReduce

Hadoop MapReduce Cookbook

hadoop MapReduce教材

hadoop MapReduce介绍

最新资源