一、利用imdb电影数据文件,使用mapreduce算法,完成下列任务: 1)找出平均评分最高的十部电影id; 2)打印上题中的电影名称,并显示其类别标签;
时间: 2024-03-16 07:46:42 浏览: 98
好的,我理解了你的问题。首先,需要明确的是,本问题需要先将IMDB电影数据文件导入到Hadoop分布式文件系统中,再使用MapReduce算法进行处理。以下是解决方案:
1. 预处理
将IMDB电影数据文件导入到Hadoop分布式文件系统中,可以使用Hadoop自带的命令行工具hdfs dfs -put进行上传。
2. MapReduce实现
(1)Mapper
Mapper的输入是IMDB电影数据文件中的每一行,输出是键值对(电影id,评分)。代码实现如下:
```
public class MovieRatingMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private final static DoubleWritable rating = new DoubleWritable();
private Text movieId = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(",");
if (tokens.length == 4) {
movieId.set(tokens[1]);
rating.set(Double.parseDouble(tokens[2]));
context.write(movieId, rating);
}
}
}
```
(2)Reducer
Reducer的输入是Mapper的输出,输出是键值对(电影id,平均评分)。代码实现如下:
```
public class MovieRatingReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private DoubleWritable avgRating = new DoubleWritable();
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
double sum = 0;
int count = 0;
for (DoubleWritable value : values) {
sum += value.get();
count++;
}
double average = sum / count;
avgRating.set(average);
context.write(key, avgRating);
}
}
```
(3)Driver
Driver类负责配置MapReduce作业,并将结果输出到HDFS上。代码实现如下:
```
public class MovieRatingDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Movie Rating");
job.setJarByClass(MovieRatingDriver.class);
job.setMapperClass(MovieRatingMapper.class);
job.setReducerClass(MovieRatingReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
3. 结果解析
使用MapReduce算法处理后,可以得到每个电影的平均评分。接下来需要找出平均评分最高的十部电影id,并打印出电影名称和类别标签。可以使用Java编写一个程序来实现这个功能。代码实现如下:
```
public class TopTenMovies {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(new Path(args[0]))));
String line;
TreeMap<Double, String> topTen = new TreeMap<>(Collections.reverseOrder());
while ((line = br.readLine()) != null) {
String[] tokens = line.split("\\s+");
String movieId = tokens[0];
double rating = Double.parseDouble(tokens[1]);
if (topTen.size() < 10) {
topTen.put(rating, movieId);
} else if (rating > topTen.firstKey()) {
topTen.remove(topTen.firstKey());
topTen.put(rating, movieId);
}
}
br.close();
// Print top ten movies
for (Map.Entry<Double, String> entry : topTen.entrySet()) {
String movieId = entry.getValue();
String filePath = args[1];
Path path = new Path(filePath);
BufferedReader movieReader = new BufferedReader(new InputStreamReader(fs.open(path)));
String movieLine;
while ((movieLine = movieReader.readLine()) != null) {
String[] tokens = movieLine.split(",");
if (tokens[0].equals(movieId)) {
System.out.println("Movie Name: " + tokens[1] + ", Category: " + tokens[3]);
break;
}
}
}
fs.close();
}
}
```
这个程序的输入是MapReduce作业的输出文件和IMDB电影数据文件。程序首先读取MapReduce作业的输出文件,将电影id和平均评分存入一个TreeMap中,并只保留平均评分最高的十部电影。然后依次读取IMDB电影数据文件中的每一行,找出电影名称和类别标签,并输出到控制台上。
最后需要在命令行中运行这个程序,命令如下:
```
hadoop jar top-ten-movies.jar TopTenMovies <path-to-mapreduce-output> <path-to-movies-data>
```
阅读全文