2. 将上题生成的rdd转换成DataFrame后并生成临时表进行查询完成下面题目: 1) 查询“人民文学出版社”出版的书籍的书名,作者和价格; 2) 查询评价人数超过100000的书籍的书名和评分; 3) 查询不同星级书籍的数量;
时间: 2024-02-01 08:15:52 浏览: 56
假设生成的RDD命名为`bookRDD`,转换成DataFrame的代码如下:
```python
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType
# 定义DataFrame的schema
schema = StructType([
StructField("book_name", StringType(), True),
StructField("author", StringType(), True),
StructField("publisher", StringType(), True),
StructField("price", FloatType(), True),
StructField("rating", FloatType(), True),
StructField("rating_count", IntegerType(), True),
StructField("star", IntegerType(), True)
])
# 将RDD转换成DataFrame
bookDF = spark.createDataFrame(bookRDD, schema=schema)
# 创建临时表
bookDF.createOrReplaceTempView("book")
```
1. 查询“人民文学出版社”出版的书籍的书名,作者和价格:
```sql
SELECT book_name, author, price
FROM book
WHERE publisher = '人民文学出版社'
```
2. 查询评价人数超过100000的书籍的书名和评分:
```sql
SELECT book_name, rating
FROM book
WHERE rating_count > 100000
```
3. 查询不同星级书籍的数量:
```sql
SELECT star, COUNT(*) as count
FROM book
GROUP BY star
```
阅读全文