泰坦尼克号pyspark
时间: 2023-12-08 16:37:39 浏览: 29
以下是使用PySpark分析泰坦尼克号数据的示例代码:
```python
# 导入必要的库
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when
# 创建SparkSession
spark = SparkSession.builder.appName("Titanic Analysis").getOrCreate()
# 读取数据
df = spark.read.csv("titanic.csv", header=True, inferSchema=True)
# 数据清洗
df = df.drop("PassengerId", "Name", "Ticket", "Cabin")
df = df.fillna({"Embarked": "S"})
df = df.withColumn("Survived", when(col("Survived") == 1, "Yes").otherwise("No"))
# 数据分析
survival_rate = df.groupBy("Survived").agg(count("*").alias("Total"))
survival_rate = survival_rate.withColumn("Percentage", col("Total") / df.count() * 100)
gender_rate = df.groupBy("Sex").agg(count("*").alias("Total"))
gender_rate = gender_rate.withColumn("Percentage", col("Total") / df.count() * 100)
class_rate = df.groupBy("Pclass").agg(count("*").alias("Total"))
class_rate = class_rate.withColumn("Percentage", col("Total") / df.count() * 100)
# 结果展示
survival_rate.show()
gender_rate.show()
class_rate.show()
```
这段代码使用PySpark读取了泰坦尼克号数据集,并进行了数据清洗和分析。其中,我们计算了幸存率、性别比例和船舱等级比例,并将结果展示出来。