现有movie数据集,请利用Spark完成如下操作: (1)统计被评分次数最多的10部电影,并给出评分次数(电影名,评分次数) (2)分别统计男性、女性当中平均评分最高的20部电影(性别,电影名,评分) (3)统计movieID为 2116这部电影各年龄的平均评分(年龄,评分) (4)统计各年评分最高的电影类型(年份,类型,影评分) (5)统计评分最高的10部Comedy类电影(电影名,评分) (6)统计评价较高(平均评分>=4.0)电影最多的那个年份的评价最高的10部电影(年份,电影名,评分
时间: 2024-01-22 22:01:47 浏览: 89
很高兴回答你的问题!以下是我对问题的回答:
(1)统计被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
```python
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
movie_counts = ratings.groupBy("movieId").count()
top_10_movies = movie_counts.join(movies, movie_counts.movieId == movies.movieId).select("title", "count").orderBy("count", ascending=False).limit(10)
```
(2)分别统计男性、女性当中平均评分最高的20部电影(性别,电影名,评分)
```python
from pyspark.sql.functions import avg
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
ratings_gender = ratings.join(users, ratings.userId == users.userId).select("gender", "movieId", "rating")
top_20_male = ratings_gender.filter(ratings_gender.gender == "M").groupBy("movieId").agg(avg("rating").alias("avg_rating")).join(movies, "movieId").orderBy("avg_rating", ascending=False).limit(20)
top_20_female = ratings_gender.filter(ratings_gender.gender == "F").groupBy("movieId").agg(avg("rating").alias("avg_rating")).join(movies, "movieId").orderBy("avg_rating", ascending=False).limit(20)
```
(3)统计movieID为 2116这部电影各年龄的平均评分(年龄,评分)
```python
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
movie_2116_ratings = ratings.filter(ratings.movieId == "2116")
movie_2116_ratings_age = movie_2116_ratings.join(users, movie_2116_ratings.userId == users.userId).select("age", "rating")
movie_2116_ratings_avg = movie_2116_ratings_age.groupBy("age").agg(avg("rating").alias("avg_rating"))
```
(4)统计各年评分最高的电影类型(年份,类型,影评分)
```python
from pyspark.sql.functions import year
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
movies = spark.read.format("csv").option("header", "true").load("movies.csv")
joined_data = ratings.join(movies, ratings.movieId == movies.movieId)
year_ratings = joined_data.withColumn("year", year(joined_data.timestamp)).select("year", "genres", "rating")
top_genre_year = year_ratings.groupBy("year", "genres").agg(avg("rating").alias("avg_rating")).orderBy(["year", "avg_rating"], ascending=[True, False])
```
(5)统计评分最高的10部Comedy类电影(电影名,评分)
```python
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
movies = spark.read.format("csv").option("header", "true").load("movies.csv")
comedy_ratings = ratings.join(movies, ratings.movieId == movies.movieId).filter(movies.genres.contains("Comedy"))
top_10_comedy = comedy_ratings.groupBy("title").agg(avg("rating").alias("avg_rating")).orderBy("avg_rating", ascending=False).limit(10)
```
(6)统计评价较高(平均评分>=4.0)电影最多的那个年份的评价最高的10部电影(年份,电影名,评分)
```python
ratings = spark.read.format("csv").option("header", "true").load("ratings.csv")
movies = spark.read.format("csv").option("header", "true").load("movies.csv")
high_ratings = ratings.groupBy("movieId").agg(avg("rating").alias("avg_rating")).filter("avg_rating>=4.0")
high_ratings_data = high_ratings.join(movies, high_ratings.movieId == movies.movieId).withColumn("year", year(movies.title)).select("year", "title", "avg_rating")
top_10_movies = high_ratings_data.groupBy("year", "title").agg(avg("avg_rating").alias("avg_rating")).orderBy(["year", "avg_rating"], ascending=[False, False]).limit(10)
```
阅读全文