如何创建一个mapreduce作业

时间: 2023-08-20 16:31:36 浏览: 89

利用micemeat实现mapreduce

5星 · 资源好评率100%

最近一直在学coursera上面web intelligence and big data这门课，上周五印度老师布置了一个家庭作业，要求写一个mapreduce程序，用python来实现。具体描述如下： Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.

要创建一个MapReduce作业，您需要执行以下步骤： 1. 编写MapReduce程序：您需要编写一个MapReduce程序，该程序定义了Map和Reduce函数，以及作业的输入和输出格式。 2. 打包MapReduce程序：将MapReduce程序打包成JAR文件。 3. 部署Hadoop集群：安装和配置Hadoop集群。 4. 准备输入数据：将输入数据上传到HDFS中。 5. 运行MapReduce作业：使用hadoop jar命令运行MapReduce作业，并指定输入和输出路径。 6. 监控作业执行：在作业运行期间，可以使用hadoop job命令来监控作业的执行情况。 7. 获取输出数据：当作业完成后，输出数据将保存在指定的输出路径中，您可以将其从HDFS中下载到本地文件系统中。以上是创建MapReduce作业的一般步骤，具体步骤可能会因为环境和需求而有所不同。

阅读全文

如何创建一个mapreduce作业

相关推荐

idea编写mapreduce工程pom文件

MapReduce作业运行流程

用于多个MapReduce作业的任务调度算法.pdf

基于Hadoop的MapReduce作业集合.zip

appengine-mapreduce, 在 App Engine上，运行MapReduce作业的库.zip

HBaseBulkLoad:使用 MapReduce 作业从文本文件加载 HBase

MRPack:单个MapReduce作业中基于Hadoop并发算法

Hadoop MapReduce作业卡死问题的解决方法.docx

Python-mrjob在Hadoop或AmazonWebServices上运行MapReduce作业

MRTuner：为MapReduce作业启用整体优化的工具包

MRTuner：MapReduce作业的整体优化解决方案

YARN资源调度器的MapReduce作业动态优化方法

oozie中的MapReduce作业调度与优化

Hadoop中MapReduce作业的调度与执行流程

MapReduce作业调度器与资源管理器解读

Hadoop中MapReduce作业故障排除与调试技术

可以用IDEA创建一个mapreduce吗

在Java环境下，如何利用Hadoop MapReduce框架来设计一个MapReduce作业，以实现对不同课程中学生最高成绩的自定义Key计算？

最新推荐

java大数据作业_5Mapreduce、数据挖掘

使用python实现mapreduce（wordcount）.doc

上市公司企业澄清公告数据（2001-2023年） .xlsx

(源码)基于Java和MySQL的物联网环境监测系统.zip

深入浅出：自定义 Grunt 任务的实践指南

管理建模和仿真的文件

数据可视化在缺失数据识别中的作用

ABB机器人在自动化生产线中是如何进行路径规划和任务执行的？请结合实际应用案例分析。

网络物理突变工具的多点路径规划实现与分析

"互动学习：行动中的多样性与论文攻读经历"