随机森林算法实现泰坦尼克生存预测
时间: 2023-09-22 15:11:34 浏览: 80
基于Python实现泰坦尼克船员生存预测【100013050】
随机森林(Random Forest)是一种集成学习(Ensemble Learning)的算法,它通过构建多个决策树集成多个分类器,从而提高预测准确率。在泰坦尼克号生存预测问题中,我们可以使用随机森林算法来进行预测。
下面是使用Python语言实现随机森林算法预测泰坦尼克号生存情况的步骤:
1. 导入所需模块和数据集
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 导入数据集
titanic = pd.read_csv('titanic.csv')
```
2. 数据预处理
```python
# 删除不必要的列
titanic = titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
# 处理缺失值
titanic['Age'].fillna(value=titanic['Age'].mean(), inplace=True)
titanic['Embarked'].fillna(value=titanic['Embarked'].mode()[0], inplace=True)
# 将类别型特征转换为数值型特征
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})
titanic['Embarked'] = titanic['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
# 将数据集划分为训练集和测试集
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
3. 训练随机森林模型
```python
# 创建随机森林分类器
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
# 训练模型
rfc.fit(X_train, y_train)
```
4. 预测测试集结果并计算准确率
```python
# 预测测试集结果
y_pred = rfc.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
完整代码如下:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 导入数据集
titanic = pd.read_csv('titanic.csv')
# 删除不必要的列
titanic = titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
# 处理缺失值
titanic['Age'].fillna(value=titanic['Age'].mean(), inplace=True)
titanic['Embarked'].fillna(value=titanic['Embarked'].mode()[0], inplace=True)
# 将类别型特征转换为数值型特征
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})
titanic['Embarked'] = titanic['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
# 将数据集划分为训练集和测试集
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 创建随机森林分类器
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
# 训练模型
rfc.fit(X_train, y_train)
# 预测测试集结果
y_pred = rfc.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
阅读全文