泰坦尼克号乘客生存预测分析模型实现的算法步骤详解代码
时间: 2024-10-15 22:29:10 浏览: 46
泰坦尼克号乘客生存预测是一个经典的数据科学项目,通常会使用机器学习技术来构建模型。以下是实现一个基本模型的步骤和Python代码概述:
1. **数据加载**:
```python
import pandas as pd
data = pd.read_csv('titanic.csv') # 假设从CSV文件加载数据
```
2. **数据预处理**:
- 检查缺失值:`data.isnull().sum()`
- 缺失值填充或删除:比如,可以使用平均年龄填充`Age`的NaN
- 类别变量编码:将`Sex`, `Embarked`等转化为数值形式
```python
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
```
3. **特征工程**:
- 创建新特征,如家庭大小、是否单独旅行等
```python
data['FamilySize'] = data['SibSp'] + data['Parch']
data['IsAlone'] = 1 if data['FamilySize'] == 0 else 0
```
4. **划分训练集和测试集**:
```python
from sklearn.model_selection import train_test_split
X = data.drop(['Survived'], axis=1) # 特征
y = data['Survived'] # 目标变量
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
5. **选择模型**:
```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
```
6. **模型训练**:
```python
model.fit(X_train, y_train)
```
7. **模型评估**:
```python
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
```
8. **模型优化** (可选):
可能需要调整模型参数或尝试其他算法,如决策树、随机森林或支持向量机。
阅读全文