Hadoop Streaming详解与实战指南

5星 · 超过95%的资源需积分: 10 160 浏览量更新于2024-09-29 收藏 49KB PDF 举报

"Hadoop Streaming是Hadoop生态中的一个工具，允许使用任何可执行程序（如脚本或命令行工具）作为MapReduce作业的Mapper和Reducer。这个工具通过标准输入和输出来传递数据，使得非Java程序可以参与Hadoop的数据处理流程。" 在Hadoop Streaming中，数据处理的基本原理是Mapper和Reducer程序通过标准输入（stdin）接收数据，然后通过标准输出（stdout）发送结果。当提交作业时，用户需要指定Mapper和Reducer程序的位置，通常是可执行文件的路径。例如，可以使用Shell脚本或Python脚本来编写Mapper和Reducer。包装文件与作业提交部分提到，用户需要将Mapper和Reducer程序以及它们可能依赖的所有文件打包成一个存档文件（如tar或zip），然后在提交作业时一起提供。这样，Hadoop集群上的节点就能访问到这些必要的文件。关于Streaming选项和用法： 1. Mapper-Only Jobs：有些作业可能只需要Mapper阶段，没有Reducer阶段，这可以通过不指定Reducer来实现。 2. Specifying Other Plugins for Jobs：用户可以指定其他插件，比如Combiner，以提高性能。 3. Large files and archives in Hadoop Streaming：大型文件或存档可以通过Hadoop Streaming处理，只需确保所有需要的文件都包含在提交的作业包中。 4. Specifying Additional Configuration Variables for Jobs：可以通过设置额外的配置变量来调整作业的行为，如设置HDFS参数或MapReduce特定的配置。 5. Other Supported Options：还有其他一些选项，例如定义输入和输出格式，控制日志级别等。更多的使用示例包括： 1. 自定义分割行成键值对的方式，可以根据需求定制数据解析逻辑。 2. 使用有用的Partitioner类（如KeyFieldBasedPartitioner）进行二级排序。 3. 集成Hadoop Aggregate包，实现简单的聚合操作，类似于内置的reduce操作。 4. 字段选择功能，可以提取输入数据中的特定字段，类似于Unix的`cut`命令。常见问题解答部分提供了对一些常见问题的回答，如如何运行一组相互独立的任务、如何按文件进行处理、如何确定Reducer的数量、是否可以在Shell脚本中使用别名以及是否能使用Unix管道等。 Hadoop Streaming为非Java开发者提供了一个灵活的数据处理框架，能够利用已有的脚本或命令行工具进行大数据分析，极大地扩展了Hadoop的适用范围。通过理解和掌握Hadoop Streaming的工作原理和使用技巧，可以更高效地处理各种复杂的数据处理任务。

1. Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you

to create and run map/reduce jobs with any executable or script as the mapper and/or the

reducer. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc

2. How Does Streaming Work

In the above example, both the mapper and the reducer are executables that read the input

from stdin (line by line) and emit the output to stdout. The utility will create a map/reduce

job, submit the job to an appropriate cluster, and monitor the progress of the job until it

completes.

When an executable is specified for mappers, each mapper task will launch the executable as

a separate process when the mapper is initialized. As the mapper task runs, it converts its

inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper

collects the line oriented outputs from the stdout of the process and converts each line into a

key/value pair, which is collected as the output of the mapper. By default, the prefix of a line

up to the first tab character is the key and the the rest of the line (excluding the tab

character) will be the value. If there is no tab character in the line, then entire line is

considered as key and the value is null. However, this can be customized, as discussed later.

When an executable is specified for reducers, each reducer task will launch the executable as

a separate process then the reducer is initialized. As the reducer task runs, it converts its input

key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the

reducer collects the line oriented outputs from the stdout of the process, converts each line

into a key/value pair, which is collected as the output of the reducer. By default, the prefix of

a line up to the first tab character is the key and the the rest of the line (excluding the tab

character) is the value. However, this can be customized, as discussed later.

This is the basis for the communication protocol between the map/reduce framework and the

streaming mapper/reducer.

You can supply a Java class as the mapper and/or the reducer. The above example is

equivalent to:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

Hadoop Streaming

Page 3

剩余13页未读，继续阅读

doudou0411

粉丝: 0
资源: 10

Hadoop Streaming详解与实战指南

Hadoop Streaming 官方中文文档

HadoopStreaming编程.doc

Hadoop Streaming 编程

Migration from Hadoop Streaming to Spark

Hadoop 2.7.1 中文文档

hadoop2.7 官方文档(英文)

Hadoop Streaming 中文指南

在Yosemite上用NodeJS和Python使用Hadoop Streaming

Hadoop Streaming食谱集：多语言MapReduce程序实现

hadoop streaming测试网站

最新资源