通过编程HadoopMapReduce程序实现简单的数据统计,输入文件是一组文本文件,每个输入文件中都包含很多行,每行都是一个数字字符串,代表了一个特别大的数字。编写一个程序求包含在输入文件中的所有数字的和、最大值、最小值、平均值并输出
时间: 2024-05-01 13:16:06 浏览: 132
以下是一个使用Hadoop MapReduce框架的实现:
首先,定义Mapper类和Reducer类。Mapper类的作用是将每个数字字符串转换为一个数字对,并将其发送到Reducer类。Reducer类的作用是计算输入数字的和、最大值、最小值和计数,并在最后输出结果。
```java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class SimpleStatistics {
public static class TokenizerMapper extends Mapper<Object, Text, NullWritable, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private IntWritable number = new IntWritable();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split(",");
for (String field : fields) {
try {
int num = Integer.parseInt(field.trim());
number.set(num);
context.write(NullWritable.get(), number);
} catch (NumberFormatException e) {
// ignore invalid numbers
}
}
}
}
public static class IntSumReducer extends Reducer<NullWritable,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
private Text type = new Text();
public void reduce(NullWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int count = 0;
int max = Integer.MIN_VALUE;
int min = Integer.MAX_VALUE;
for (IntWritable val : values) {
int num = val.get();
sum += num;
count++;
if (num > max) {
max = num;
}
if (num < min) {
min = num;
}
}
if (count > 0) {
result.set(sum);
type.set("sum");
context.write(type, result);
result.set(max);
type.set("max");
context.write(type, result);
result.set(min);
type.set("min");
context.write(type, result);
result.set(sum / count);
type.set("avg");
context.write(type, result);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "simple statistics");
job.setJarByClass(SimpleStatistics.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < args.length - 1; i++) {
Path inputPath = new Path(args[i]);
job.addInputPath(inputPath);
}
Path outputPath = new Path(args[args.length - 1]);
job.setOutputPath(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
在上述代码中,Mapper类的map()函数将输入的数字字符串转换为IntWritable类型,并将其写入上下文。Reducer类的reduce()函数计算输入数字的和、最大值、最小值和计数,并将结果输出到上下文。
在main()函数中,我们设置了Mapper类和Reducer类,指定输入和输出路径,并运行Job。运行程序时,使用以下命令:
```shell
hadoop jar SimpleStatistics.jar input1,input2,..,inputN output
```
其中,input1,input2,..,inputN是输入文件的路径,用逗号分隔。output是输出文件的路径。
执行后,输出文件将包含以下统计数据:
- sum:所有数字的总和
- max:最大值
- min:最小值
- avg:平均值
希望这能帮到你!
阅读全文