ubuntu中使用IDEA开发WordCount程序有界流数据源 1.从元素中读取 2.从集合中读取 3.从文件中读取,代码编写
时间: 2024-09-30 16:14:31 浏览: 21
在Ubuntu系统中,使用IntelliJ IDEA(IdeaVim插件)开发WordCount程序,处理有界流数据源通常会涉及到Hadoop Streaming,它允许用户使用简单的文本工具作为MapReduce作业的输入和输出。这里我们假设您想从文件中读取数据。
首先,确保安装了Java、Hadoop和相关的IDEA配置。以下是一个简单的步骤:
1. **创建项目**:
- 创建一个新的Maven Java项目,命名为`wordcount`。
- 添加Hadoop的依赖到pom.xml文件:
```xml
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-streaming</artifactId>
<version>3.x.y</version>
<scope>provided</scope>
</dependency>
</dependencies>
```
替换`x.y`为实际版本号。
2. **编写Mapper**:
`Mapper.java`:
```java
import java.io.IOException;
public class Mapper {
private final static String INPUT_FORMAT = "%s %s";
private int wordCount;
public void map(String line, Context context) throws IOException, InterruptedException {
String[] words = line.split(" ");
for (String word : words) {
if (!word.isEmpty()) {
wordCount++;
context.write(word, new LongWritable(1));
}
}
}
@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
context.write(null, new IntWritable(wordCount));
}
}
```
3. **编写Reducer**:
`Reducer.java`:
```java
import java.io.IOException;
public class Reducer {
private IntWritable count = new IntWritable();
private Text word = new Text();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
count.set(sum);
context.write(key, count);
}
}
```
4. **编写Driver**:
`Driver.java`:
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(Mapper.class);
job.setCombinerClass(Reducer.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
5. **运行WordCount**:
在命令行中,进入项目的`target/classes`目录,然后运行:
```sh
$ hadoop jar wordcount.jar Driver input.txt output.txt
```
这里的`input.txt`是要分析的文件名,`output.txt`是结果将被保存的地方。