现有book.txt,包含内容有(书名,链接,国家,作者,翻译者,出版社,出版时间,价格,星级,评分,评价人数,简介)请将book.txt转换成rdd 又将生成的rdd转换成DataFrame后并生成临时表进行查询完成下面题目: 1) 查询“人民文学出版社”出版的书籍的书名,作者和价格; 2) 查询评价人数超过100000的书籍的书名和评分; 3) 查询不同星级书籍的数量;
时间: 2023-08-13 22:04:23 浏览: 157
首先,我们需要读取book.txt文件并将其转换为RDD:
```python
# 导入pyspark模块
from pyspark.sql import SparkSession
# 创建SparkSession
spark = SparkSession.builder.appName("book").getOrCreate()
# 读取book.txt并转换为RDD
rdd = spark.sparkContext.textFile("book.txt")
```
然后,我们需要将每一行的数据按逗号分隔,并将其转换为元组:
```python
# 将每一行数据按逗号分隔并转换为元组
rdd = rdd.map(lambda x: x.split(","))
rdd = rdd.map(lambda x: (x[0], x[1], x[2], x[3], x[4], x[5], x[6], float(x[7]), float(x[8]), int(x[9]), int(x[10]), x[11]))
```
接下来,我们可以将RDD转换为DataFrame,并将其注册为临时表:
```python
# 将RDD转换为DataFrame
df = rdd.toDF(["book_name", "link", "country", "author", "translator", "press", "publish_time", "price", "star", "score", "comments", "brief"])
# 将DataFrame注册为临时表
df.createOrReplaceTempView("book")
```
现在,我们可以通过SQL查询来解决问题:
1. 查询“人民文学出版社”出版的书籍的书名,作者和价格:
```python
result1 = spark.sql("SELECT book_name, author, price FROM book WHERE press='人民文学出版社'")
result1.show()
```
2. 查询评价人数超过100000的书籍的书名和评分:
```python
result2 = spark.sql("SELECT book_name, score FROM book WHERE comments>100000")
result2.show()
```
3. 查询不同星级书籍的数量:
```python
result3 = spark.sql("SELECT star, COUNT(*) as count FROM book GROUP BY star")
result3.show()
```
阅读全文