Hadoop 2.9.0 MapReduce配置详解：mapred-default.xml与mapred-site.xml

需积分: 19 160 浏览量更新于2024-09-08 收藏 194KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这是关于Hadoop 2.9.0版本的MapReduce配置文件mapred-default.xml的属性集合，包含了所有默认属性及其描述。文件中关键属性已经通过PDF注释进行了详细解释，便于理解和配置Hadoop MapReduce作业。" 在Hadoop MapReduce框架中，`mapred-default.xml`和`mapred-site.xml`是两个非常重要的配置文件，它们定义了MapReduce作业运行时的默认行为和特定设置。以下是一些关键属性的详细说明： 1. **mapreduce.job.hdfs-servers**：此属性的默认值是`${fs.defaultFS}`，它指定了HDFS（Hadoop Distributed File System）的默认名称节点地址。这用于MapReduce作业提交和数据读取。 2. **mapreduce.job.committer.setup.cleanup.needed**：这个属性如果设置为`true`，表示作业需要进行job setup和job cleanup阶段。默认情况下，它是`true`。如果作业不需要这些阶段，可以将其设置为`false`来优化性能。 3. **mapreduce.task.io.sort.factor**：这个参数指定了在排序文件时同时合并的流的数量，默认值是10。它影响了系统在排序过程中打开的文件句柄数。 4. **mapreduce.task.io.sort.mb**：设定在排序文件时使用的缓冲内存总量，单位是MB，默认值为100MB。这决定了每个合并流能分配到的内存大小，有助于减少磁盘I/O操作。 5. **mapreduce.map.sort.spill.percent**：这是一个软限制，当序列化缓冲区达到这个百分比（默认为0.80或80%）时，后台线程将开始溢写内容到磁盘。即使已经有溢写正在进行，也不会阻塞收集，所以实际溢写的数据量可能超过这个阈值。 6. **mapreduce.jobtracker.address**：这个属性定义了MapReduce作业追踪器的主机和端口。如果设置为"local"，则作业会在本地作为一个单一的map和reduce任务运行。 7. **mapreduce.local.clientfactory.class.name**：这个属性指定了客户端工厂类，即`org.apache.hadoop.mapred.LocalClientFactory`，它负责创建与本地作业Tracker交互的客户端实例。以上只是部分关键属性的描述，完整的`mapred-default.xml`文件中包含更多的配置选项，如作业调度、任务执行策略、日志管理、内存分配等。正确理解和配置这些属性对于优化Hadoop MapReduce作业的性能和资源利用率至关重要。在实际应用中，开发者可以根据具体需求调整这些默认设置，以实现最佳的系统性能和作业完成效率。

资源详情

资源推荐

name value description

mapreduce.job.hdfs-servers ${fs.defaultFS}

mapreduce.job.committer.setup.cleanup.needed true true, if job needs job-setup and job-cleanup. false, otherwise

mapreduce.task.io.sort.factor 10

The number of streams to merge at once while sorting files. This determines the number of open file

handles.

mapreduce.task.io.sort.mb 100

The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each

merge stream 1MB, which should minimize seeks.

mapreduce.map.sort.spill.percent 0.80

The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in

the background. Note that collection will not block if this threshold is exceeded while a spill is already

in progress, so spills may be larger than this threshold when it is set to less than .5

mapreduce.jobtracker.address local

The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a

single map and reduce task.

mapreduce.local.clientfactory.class.name org.apache.hadoop.mapred.LocalClientFactory This the client factory that is responsible for creating local job runner client

mapreduce.jobtracker.system.dir ${hadoop.tmp.dir}/mapred/system The directory where MapReduce stores control files.

mapreduce.jobtracker.staging.root.dir ${hadoop.tmp.dir}/mapred/staging

The root of the staging area for users' job files In practice, this should be the directory where users'

home directories are located (usually /user)

mapreduce.cluster.temp.dir ${hadoop.tmp.dir}/mapred/temp A shared directory for temporary files.

mapreduce.job.maps 2 The default number of map tasks per job. Ignored when mapreduce.framework.name is "local".

mapreduce.job.reduces 1

The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that

if a node fails the reduces can still be executed in a single wave. Ignored when

mapreduce.framework.name is "local".

mapreduce.job.running.map.limit 0 The maximum number of simultaneous map tasks per job. There is no limit if this value is 0 or negative.

mapreduce.job.running.reduce.limit 0

The maximum number of simultaneous reduce tasks per job. There is no limit if this value is 0 or

negative.

mapreduce.job.max.map -1 Limit on the number of map tasks allowed per job. There is no limit if this value is negative.

mapreduce.job.reducer.preempt.delay.sec 0

The threshold (in seconds) after which an unsatisfied mapper request triggers reducer preemption when

there is no anticipated headroom. If set to 0 or a negative value, the reducer is preempted as soon as lack

of headroom is detected. Default is 0.

mapreduce.job.reducer.unconditional-

preempt.delay.sec

300

The threshold (in seconds) after which an unsatisfied mapper request triggers a forced reducer

preemption irrespective of the anticipated headroom. By default, it is set to 5 mins. Setting it to 0 leads

to immediate reducer preemption. Setting to -1 disables this preemption altogether.

mapreduce.job.max.split.locations 10 The max number of block locations to store for each split for locality calculation.

mapreduce.job.split.metainfo.maxsize 10000000

The maximum permissible size of the split metainfo file. The MapReduce ApplicationMaster won't

attempt to read submitted split metainfo files bigger than this configured value. No limits if set to -1.

mapreduce.map.maxattempts 4

Expert: The maximum number of attempts per map task. In other words, framework will try to execute a

map task these many number of times before giving up on it.

mapreduce.reduce.maxattempts 4

Expert: The maximum number of attempts per reduce task. In other words, framework will try to

execute a reduce task these many number of times before giving up on it.

mapreduce.reduce.shuffle.fetch.retry.enabled ${yarn.nodemanager.recovery.enabled} Set to enable fetch retry during host restart.

mapreduce.reduce.shuffle.fetch.retry.interval-ms 1000

Time of interval that fetcher retry to fetch again when some non-fatal failure happens because of some

events like NM restart.

mapreduce.reduce.shuffle.fetch.retry.timeout-ms 30000

Timeout value for fetcher to retry to fetch again when some non-fatal failure happens because of some

events like NM restart.

mapreduce.reduce.shuffle.retry-delay.max.ms 60000 The maximum number of ms the reducer will delay before retrying to download map data.

mapreduce.reduce.shuffle.parallelcopies 5 The default number of parallel transfers run by reduce during the copy(shuffle) phase.

mapreduce.reduce.shuffle.connect.timeout 180000

Expert: The maximum amount of time (in milli seconds) reduce task spends in trying to connect to a

remote node for getting map output.

mapreduce.reduce.shuffle.read.timeout 180000

Expert: The maximum amount of time (in milli seconds) reduce task waits for map output data to be

available for reading after obtaining connection.

mapreduce.shuffle.listen.queue.size 128 The length of the shuffle server listen queue.

mapreduce.shuffle.connection-keep-alive.enable false set to true to support keep-alive connections.

mapreduce.shuffle.connection-keep-alive.timeout 5

The number of seconds a shuffle client attempts to retain http connection. Refer "Keep-Alive: timeout="

header in Http specification

mapreduce.task.timeout 600000

The number of milliseconds before a task will be terminated if it neither reads an input, writes an

output, nor updates its status string. A value of 0 disables the timeout.

mapreduce.map.memory.mb 1024 The amount of memory to request from the scheduler for each map task.

mapreduce.map.cpu.vcores 1 The number of virtual cores to request from the scheduler for each map task.

mapreduce.reduce.memory.mb 1024 The amount of memory to request from the scheduler for each reduce task.

mapreduce.reduce.cpu.vcores 1 The number of virtual cores to request from the scheduler for each reduce task.

mapred.child.java.opts -Xmx200m

Java opts for the task processes. The following symbol, if present, will be interpolated: @taskid@ is

replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable

verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte,

pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc Usage of -Djava.library.path can

cause programs to no longer function if hadoop native libraries are used. These values should instead be

set as part of LD_LIBRARY_PATH in the map / reduce JVM env using the mapreduce.map.env and

mapreduce.reduce.env config settings.

mapred.child.env

User added environment variables for the task processes. Example : 1) A=foo This will set the env

variable A to foo 2) B=$B:c This is inherit nodemanager's B env variable on Unix. 3) B=%B%;c This is

inherit nodemanager's B env variable on Windows.

mapreduce.admin.user.env

Expert: Additional execution environment entries for map and reduce task processes. This is not an

additive property. You must preserve the original value if you want your map and reduce tasks to have

access to native libraries (compression, etc). When this value is empty, the command to set execution

envrionment will be OS dependent: For linux, use

LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native. For windows, use PATH = %

PATH%;%HADOOP_COMMON_HOME%\in.

yarn.app.mapreduce.am.log.level INFO

The logging level for the MR ApplicationMaster. The allowed levels are: OFF, FATAL, ERROR,

WARN, INFO, DEBUG, TRACE and ALL. The setting here could be overriden if

"mapreduce.job.log4j-properties-file" is set.

mapreduce.map.log.level INFO

The logging level for the map task. The allowed levels are: OFF, FATAL, ERROR, WARN, INFO,

DEBUG, TRACE and ALL. The setting here could be overridden if "mapreduce.job.log4j-properties-

file" is set.

mapreduce.reduce.log.level INFO

The logging level for the reduce task. The allowed levels are: OFF, FATAL, ERROR, WARN, INFO,

DEBUG, TRACE and ALL. The setting here could be overridden if "mapreduce.job.log4j-properties-

file" is set.

mapreduce.map.cpu.vcores 1 The number of virtual cores required for each map task.

mapreduce.reduce.cpu.vcores 1 The number of virtual cores required for each reduce task.

mapreduce.reduce.merge.inmem.threshold 1000

The threshold, in terms of the number of files for the in-memory merge process. When we accumulate

threshold number of files we initiate the in-memory merge and spill to disk. A value of 0 or less than 0

indicates we want to DON'T have any threshold and instead depend only on the ramfs's memory

consumption to trigger the merge.

mapreduce.reduce.shuffle.merge.percent 0.66

The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the

total memory allocated to storing in-memory map outputs, as defined by

mapreduce.reduce.shuffle.input.buffer.percent.

mapreduce.reduce.shuffle.input.buffer.percent 0.70

The percentage of memory to be allocated from the maximum heap size to storing map outputs during

the shuffle.

mapreduce.reduce.input.buffer.percent 0.0

The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce.

When the shuffle is concluded, any remaining map outputs in memory must consume less than this

threshold before the reduce can begin.

mapreduce.reduce.shuffle.memory.limit.percent 0.25

Expert: Maximum percentage of the in-memory limit that a single shuffle can consume. Range of valid

values is [0.0, 1.0]. If the value is 0.0 map outputs are shuffled directly to disk.

mapreduce.shuffle.ssl.enabled false Whether to use SSL for for the Shuffle HTTP endpoints.

mapreduce.shuffle.ssl.file.buffer.size 65536 Buffer size for reading spills from file when using SSL.

mapreduce.shuffle.max.connections 0

Max allowed connections for the shuffle. Set to 0 (zero) to indicate no limit on the number of

connections.

mapreduce.shuffle.max.threads 0

Max allowed threads for serving shuffle connections. Set to zero to indicate the default of 2 times the

number of available processors (as reported by Runtime.availableProcessors()). Netty is used to serve

requests, so a thread is not needed for each connection.

mapreduce.shuffle.transferTo.allowed

This option can enable/disable using nio transferTo method in the shuffle phase. NIO transferTo does

not perform well on windows in the shuffle phase. Thus, with this configuration property it is possible

to disable it, in which case custom transfer method will be used. Recommended value is false when

running Hadoop on Windows. For Linux, it is recommended to set it to true. If nothing is set then the

default value is false for Windows, and true for Linux.

mapreduce.shuffle.transfer.buffer.size 131072

This property is used only if mapreduce.shuffle.transferTo.allowed is set to false. In that case, this

property defines the size of the buffer used in the buffer copy code for the shuffle phase. The size of this

buffer determines the size of the IO requests.

mapreduce.reduce.markreset.buffer.percent 0.0

The percentage of memory -relative to the maximum heap size- to be used for caching values when

using the mark-reset functionality.

mapreduce.map.speculative true If true, then multiple instances of some map tasks may be executed in parallel.

mapreduce.reduce.speculative true If true, then multiple instances of some reduce tasks may be executed in parallel.

mapreduce.job.speculative.speculative-cap-running-

tasks

0.1 The max percent (0-1) of running tasks that can be speculatively re-executed at any time.

mapreduce.job.speculative.speculative-cap-total-tasks0.01 The max percent (0-1) of all tasks that can be speculatively re-executed at any time.

mapreduce.job.speculative.minimum-allowed-tasks 10 The minimum allowed tasks that can be speculatively re-executed at any time.

mapreduce.job.speculative.retry-after-no-speculate 1000 The waiting time(ms) to do next round of speculation if there is no task speculated in this round.

mapreduce.job.speculative.retry-after-speculate 15000 The waiting time(ms) to do next round of speculation if there are tasks speculated in this round.

mapreduce.job.map.output.collector.class org.apache.hadoop.mapred.MapTask$MapOutputBuffer

The MapOutputCollector implementation(s) to use. This may be a comma-separated list of class names,

in which case the map task will try to initialize each of the collectors in turn. The first to successfully

initialize will be used.

mapreduce.job.speculative.slowtaskthreshold 1.0

The number of standard deviations by which a task's ave progress-rates must be lower than the average

of all running tasks' for the task to be considered too slow.

mapreduce.job.ubertask.enable false Whether to enable the small-jobs "ubertask" optimization, which runs "sufficiently small" jobs

sequentially within a single JVM. "Small" is defined by the following maxmaps, maxreduces, and

maxbytes settings. Note that configurations for application masters also affect the "Small" definition -

yarn.app.mapreduce.am.resource.mb must be larger than both mapreduce.map.memory.mb and

mapreduce.reduce.memory.mb, and yarn.app.mapreduce.am.resource.cpu-vcores must be larger than

1/4

2018/3/15file:///F:/Hadoop/hadoop-2.9.0/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

下载后可阅读完整内容，剩余3页未读，立即下载

devalone

粉丝: 108
资源: 37

Hadoop 2.9.0 MapReduce配置详解：mapred-default.xml与mapred-site.xml

core-default.xml

mapred-default.xml

hadoop默认配置文件x-default.xml

cd /home/modules/hadoop-2.8.3/etc/hadoop/ mv mapred-site.xml.template mapred-site.xml vim /home/modules/hadoop-2.8.3/etc/hadoop/mapred-site.xml

hadoop如何配置mapred-site.xml

在安装配置hadoop时，需要进行配置的配置文件有 A yarn-env.sh B mapred-site.xml C core-site.xml D hadoop-env.sh E mapred-env.sh F hdfs-site.xml G yarn-site.xml

linux hadoop单节点安装hadoop配置mapred-site.xml

please check whether your etc/hadoop/mapred-site.xml contains the below conf

mapred-sit.xml

介绍hadoop的core-site.xml、hadoop-env.sh、hdfs-site.xml、mapred-site.xml、workers、yarn-site.xml文件1200字

介绍hadoop的core-site.xml、hadoop-env.sh、hdfs-site.xml、mapred-site.xml、 yarn-env.sh、yarn-site.xml文件1200字

修改mapred-site.xml文件

hadoop的core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml等具体应该怎么配置

mapred-site.xml文件应该怎么设置

Hadoop3.1.2运行InvertedIndex程序报错Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:

ln -s /opt/module/hadoop-3.3.4/etc/hadoop/hdfs-site.xml /opt/module/hbase- 2.4.17/conf/hdfs-site.xml ln: 目标"2.4.17/conf/hdfs-site.xml" 不是目录

vim /root/hadoop/hadoop-2.9.2/etc/hadoop/hdfs-site.xml

hadoop集群搭建mapred-site.xml文件

如何分割cd /home/modules/hadoop-2.8.3/etc/hadoop/ mv mapred-site.xml.template mapred-site.xml

如何检查core-site.xml、hdfs-site.xml和mapred-site.xml配置文件中指定的所有路径是否正确。

最新资源