pyspark dataframe 如何分组聚合
时间: 2023-05-22 17:05:38 浏览: 60
可以使用 pyspark dataframe 的 groupBy() 和 agg() 方法来实现分组聚合。
例如:
```
from pyspark.sql.functions import *
# 创建 dataframe
data = [("alice", 25, "female", 3000),
("bob", 30, "male", 5000),
("tom", 35, "male", 4000),
("jerry", 35, "female", 6000),
("hanmeimei", 25, "female", 3500),
("lilei", 30, "male", 4500)]
columns = ["name", "age", "gender", "salary"]
df = spark.createDataFrame(data, columns)
# 分组聚合
df.groupBy("gender").agg(avg("age"), sum("salary")).show()
# 输出:
# +------+--------+-----------+
# |gender|avg(age)|sum(salary)|
# +------+--------+-----------+
# | male| 31.7| 13500|
# |female| 28.3| 10500|
# +------+--------+-----------+
```
以上代码将 dataframe 按照 gender 列进行分组,然后计算每组的平均年龄和薪资总和。