jupyter泰坦尼克号幸存者
时间: 2023-11-30 11:43:15 浏览: 119
以下是使用Jupyter Notebook进行泰坦尼克号幸存者分类的步骤:
1. 导入必要的库和数据集
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
```
2. 数据探索和可视化
```python
# 查看数据集前5行
train_data.head()
# 查看数据集信息
train_data.info()
# 查看数据集中数值型特征的统计信息
train_data.describe()
# 查看数据集中各特征之间的相关性
sns.heatmap(train_data.corr(), annot=True, cmap='coolwarm')
# 查看幸存者和未幸存者的数量
sns.countplot(x='Survived', data=train_data)
# 查看不同性别的幸存者和未幸存者的数量
sns.countplot(x='Survived', hue='Sex', data=train_data)
# 查看不同船舱等级的幸存者和未幸存者的数量
sns.countplot(x='Survived', hue='Pclass', data=train_data)
# 查看不同年龄段的幸存者和未幸存者的数量
sns.histplot(x='Age', hue='Survived', data=train_data, kde=True)
```
3. 数据清洗和特征工程
```python
# 填充缺失值
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)
# 将性别和登船港口转换为数值型特征
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})
test_data['Sex'] = test_data['Sex'].map({'male': 0, 'female': 1})
train_data['Embarked'] = train_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
test_data['Embarked'] = test_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
# 创建新特征FamilySize和IsAlone
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1
train_data['IsAlone'] = np.where(train_data['FamilySize'] == 1, 1, 0)
test_data['IsAlone'] = np.where(test_data['FamilySize'] == 1, 1, 0)
# 删除无用特征
train_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch'], axis=1, inplace=True)
test_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch'], axis=1, inplace=True)
```
4. 模型训练和预测
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
```
阅读全文