请帮我写一份代码,Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程),要求能实现图解
时间: 2024-09-15 17:04:27 浏览: 68
泰坦尼克号幸存者预测【Python代码&数据集全套】
5星 · 资源好评率100%
当然可以,以下是基于Python和Pandas库的一个简单数据预处理、特征工程和模型训练的例子,我们将使用泰坦尼克号数据集(titanic dataset)来进行生存预测。这个例子会涉及到读取数据、清洗数据、探索性数据分析(EDA)、特征选择、创建可视化图表以及应用机器学习算法(如随机森林分类器)进行预测。
```python
# 导入所需库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# 加载数据
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
# 数据预处理
data.dropna(inplace=True) # 删除缺失值
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1}) # 将性别转换为数字编码
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0]) # 填充未知的登船港口
# 特征工程
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
data['IsAlone'] = 1 if data['FamilySize'] == 1 else 0
# 划分特征和目标变量
X = data.drop('Survived', axis=1)
y = data['Survived']
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建模型
model = RandomForestClassifier()
# 超参数调整(Grid Search)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
# 预测并评估
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
# 可视化结果
sns.heatmap(conf_mat / conf_mat.sum(axis=1), annot=True, fmt=".2%", cmap='Blues')
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.title(f"Confusion Matrix (Accuracy: {accuracy * 100:.2f}%)")
plt.show()
阅读全文