使用hive根据数据文件夹ml-1m中三个.dat文件,完成如下功能 (1) 创建相应数据库和数据表,上传数据( (2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数) (3)分别求男性,女性当中评分最高(按平均分算)的10部电影(性别,电影名,影评分)评论次数大于等于50次 (4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分) (5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分) (6)求好片(平均评分>=4.0)最多的那个年份的最好看的10部电影 (7)求1997年上映的电影中,评分最高的10部Comedy类电影 (8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分) (9)各年评分最高的电影类型(年份,类型,影评分) (10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
时间: 2023-10-02 15:04:20 浏览: 205
(1)创建相应数据库和数据表,上传数据
首先需要在Hive中创建数据库和数据表,然后将数据上传到对应的数据表中。
```sql
-- 创建数据库
CREATE DATABASE IF NOT EXISTS movie_db;
-- 切换到该数据库
USE movie_db;
-- 创建数据表
CREATE TABLE IF NOT EXISTS movies (
movieid INT,
title STRING,
genres STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '::'
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS ratings (
userid INT,
movieid INT,
rating FLOAT,
timestamp STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '::'
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS users (
userid INT,
gender STRING,
age INT,
occupation INT,
zipcode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '::'
STORED AS TEXTFILE;
-- 将数据上传到对应的数据表中
LOAD DATA LOCAL INPATH '/path/to/ml-1m/movies.dat' OVERWRITE INTO TABLE movies;
LOAD DATA LOCAL INPATH '/path/to/ml-1m/ratings.dat' OVERWRITE INTO TABLE ratings;
LOAD DATA LOCAL INPATH '/path/to/ml-1m/users.dat' OVERWRITE INTO TABLE users;
```
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
```sql
SELECT m.title, COUNT(r.movieid) AS rating_count
FROM movies m JOIN ratings r ON m.movieid = r.movieid
GROUP BY m.title
ORDER BY rating_count DESC
LIMIT 10;
```
(3)分别求男性,女性当中评分最高(按平均分算)的10部电影(性别,电影名,影评分)评论次数大于等于50次
```sql
SELECT u.gender, m.title, AVG(r.rating) AS avg_rating
FROM movies m JOIN ratings r ON m.movieid = r.movieid
JOIN users u ON r.userid = u.userid
WHERE u.gender = 'F' AND r.userid IN (
SELECT userid
FROM ratings
GROUP BY userid
HAVING COUNT(movieid) >= 50
) OR u.gender = 'M' AND r.userid IN (
SELECT userid
FROM ratings
GROUP BY userid
HAVING COUNT(movieid) >= 50
)
GROUP BY u.gender, m.title
HAVING COUNT(r.movieid) >= 50
ORDER BY u.gender, avg_rating DESC
LIMIT 10;
```
(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
```sql
SELECT
CASE
WHEN age BETWEEN 1 AND 18 THEN '1-18'
WHEN age BETWEEN 18 AND 24 THEN '18-24'
WHEN age BETWEEN 25 AND 34 THEN '25-34'
WHEN age BETWEEN 35 AND 44 THEN '35-44'
WHEN age BETWEEN 45 AND 49 THEN '45-49'
WHEN age BETWEEN 50 AND 55 THEN '50-55'
ELSE '56+'
END AS age_group,
AVG(r.rating) AS avg_rating
FROM ratings r JOIN users u ON r.userid = u.userid
WHERE r.movieid = 2116
GROUP BY age_group;
```
(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
```sql
SELECT u.userid, m.title, AVG(r.rating) AS avg_rating
FROM movies m JOIN ratings r ON m.movieid = r.movieid
JOIN users u ON r.userid = u.userid
WHERE u.gender = 'F' AND r.userid IN (
SELECT userid
FROM ratings
GROUP BY userid
ORDER BY COUNT(movieid) DESC
LIMIT 1
)
GROUP BY u.userid, m.title
HAVING COUNT(r.movieid) >= 50
ORDER BY avg_rating DESC
LIMIT 10;
```
(6)求好片(平均评分>=4.0)最多的那个年份的最好看的10部电影
```sql
SELECT m.title, AVG(r.rating) AS avg_rating, COUNT(r.movieid) AS rating_count, SUBSTR(r.timestamp, 1, 4) AS year
FROM movies m JOIN ratings r ON m.movieid = r.movieid
WHERE AVG(r.rating) >= 4.0
GROUP BY m.title, year
ORDER BY rating_count DESC, avg_rating DESC
LIMIT 10;
```
(7)求1997年上映的电影中,评分最高的10部Comedy类电影
```sql
SELECT m.title, AVG(r.rating) AS avg_rating
FROM movies m JOIN ratings r ON m.movieid = r.movieid
WHERE m.genres LIKE '%Comedy%' AND SUBSTR(r.timestamp, 1, 4) = '1997'
GROUP BY m.title
ORDER BY avg_rating DESC
LIMIT 10;
```
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
```sql
SELECT m.genres, m.title, AVG(r.rating) AS avg_rating
FROM movies m JOIN ratings r ON m.movieid = r.movieid
GROUP BY m.genres, m.title
HAVING COUNT(r.movieid) >= 100
ORDER BY m.genres, avg_rating DESC
LIMIT 5;
```
(9)各年评分最高的电影类型(年份,类型,影评分)
```sql
SELECT SUBSTR(r.timestamp, 1, 4) AS year, m.genres, MAX(avg_rating) AS max_rating
FROM (
SELECT movieid, AVG(rating) AS avg_rating
FROM ratings
GROUP BY movieid
) r JOIN movies m ON r.movieid = m.movieid
GROUP BY year, m.genres;
```
(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
```sql
INSERT OVERWRITE DIRECTORY '/path/to/output'
SELECT u.zipcode, m.title, AVG(r.rating) AS avg_rating
FROM movies m JOIN ratings r ON m.movieid = r.movieid
JOIN users u ON r.userid = u.userid
GROUP BY u.zipcode, m.title
HAVING COUNT(r.movieid) >= 50
ORDER BY u.zipcode, avg_rating DESC;
```
阅读全文