HADOOP用eclipse统计以a开头和以b开头出现的单词个数完整代码
时间: 2024-03-22 11:42:17 浏览: 80
以下是使用Eclipse统计以a开头和以b开头出现的单词个数的完整代码:
```java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if (token.startsWith("a")) {
word.set("a");
context.write(word, one);
} else if (token.startsWith("b")) {
word.set("b");
context.write(word, one);
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
在这个例子中,我们使用了Hadoop的MapReduce框架,将输入文件分成多个块,并在多个计算节点上并行处理这些块。在Mapper阶段中,我们使用了StringTokenizer将输入文件中的每一行拆分成多个单词,然后判断每个单词是否以字母"a"或"b"开头,如果是,则将该单词的计数器加1,并将结果输出为键值对。在Reducer阶段中,我们将所有以"a"开头的单词计数器加起来,或将所有以"b"开头的单词计数器加起来,并将结果输出为键值对。
要运行这个代码,你需要安装Hadoop并将其设置为单机模式。在Eclipse中创建一个新的Java项目,将上述代码复制到一个名为WordCount.java的Java类中,并将其导出为一个可执行的JAR文件。然后,在命令行中运行以下命令:
```
hadoop jar WordCount.jar input.txt output
```
其中,WordCount.jar是你导出的JAR文件的名称,input.txt是你要处理的输入文件的名称,output是你要输出结果的目录的名称。在运行完成后,你将在output目录中看到以"a"和"b"开头的单词的计数器。
阅读全文