用Eclipse统计/input目录下所有文件中每个单词出现的次数,需要有map阶段的实现,Reduce阶段的实现,Driver程序主类实现
时间: 2024-10-24 16:11:01 浏览: 15
java8源码-WordsCount:Java实现在大容量文件中(大于1G)查询单词出现频率的功能
在Hadoop MapReduce框架中,统计`input`目录下所有文本文件中每个单词出现的次数是一个常见的练习。这个过程可以分为两个主要步骤:Map阶段和Reduce阶段。
**Map阶段**:
在这个阶段,每个Map任务会读取一个输入文件的一部分,将文本分割成单词,并将它们作为键值对(key-value pair)发送到Reducer。例如,使用Java实现可能会这样:
```java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
for (String token : line.split("\\W+")) { // 使用正则表达式分割单词
if (!token.isEmpty()) {
word.set(token);
context.write(word, one); // 发送单词及其计数(1)
}
}
}
}
```
**Reduce阶段**:
Reduce阶段的任务是接收来自所有Map任务的相同键的值,然后计算总和。在这里,我们将合并单词计数值:
```java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get(); // 累加单词计数
}
result.set(sum);
context.write(key, result); // 输出单词及其总次数
}
}
```
**Driver程序主类实现**:
主驱动类,如`WordCountDriver.java`,包含配置Mapper、Reducer,以及Job提交等步骤:
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(ReduceWordCount.class); // 可选,优化Reduce操作
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0])); // 输入目录
FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出目录
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
要运行此程序,你需要提供`input`目录和期望的`output`目录作为命令行参数,比如`./WordCountDriver input output`。
阅读全文