1. 对泰坦尼克号数据进行分析,寻找影响乘客的生还率的因素; 2. 对泰坦尼克号数据进行预处理,填充缺失数据、进行数据编码、选择合适特征; 3. 采用逻辑回归、KNN、决策树、随机森林算法进行模型训练,并输出结果。代码
时间: 2024-03-18 18:38:49 浏览: 92
泰坦尼克号训练数据集,分析各种因素对泰坦尼克号乘客生存率的影响程度
好的,以下是对泰坦尼克号数据进行分析、预处理和建模的Python代码示例:
1. 对泰坦尼克号数据进行分析,寻找影响乘客的生还率的因素:
```python
import pandas as pd
import seaborn as sns
# 读取数据
train_data = pd.read_csv('train.csv')
# 探索性数据分析
sns.barplot(x='Sex', y='Survived', data=train_data) # 性别对生还率的影响
sns.barplot(x='Pclass', y='Survived', data=train_data) # 舱位等级对生还率的影响
sns.barplot(x='SibSp', y='Survived', data=train_data) # 家庭成员数量对生还率的影响
sns.barplot(x='Embarked', y='Survived', data=train_data) # 登船港口对生还率的影响
```
2. 对泰坦尼克号数据进行预处理,填充缺失数据、进行数据编码、选择合适特征:
```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# 读取数据
train_data = pd.read_csv('train.csv')
# 填充缺失数据
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
# 数据编码
le = LabelEncoder()
train_data['Sex'] = le.fit_transform(train_data['Sex'])
train_data['Embarked'] = le.fit_transform(train_data['Embarked'])
# 选择特征
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = train_data[features]
y = train_data['Survived']
```
3. 采用逻辑回归、KNN、决策树、随机森林算法进行模型训练,并输出结果:
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 选择算法
models = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier()]
# 训练模型并输出结果
for model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"{model.__class__.__name__}准确率为:{acc*100:.2f}%")
```
阅读全文