泰坦尼克号数据集分类 题目要求: (1)使用KNN算法对泰坦尼克号数据集进行分类 (2)使用测试集测试训练好的模型,并打印准确率
时间: 2023-12-14 15:39:40 浏览: 27
好的,这是一个经典的机器学习分类问题。我可以帮助您完成这个任务。
首先,我们需要导入相关的库和数据集。您可以使用 pandas 和 numpy 库来处理数据,使用 sklearn 库来构建 KNN 模型。在导入数据集之前,需要将数据集下载到本地。
```python
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# 读取训练集和测试集
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
```
接下来,我们需要对数据进行预处理。这包括数据清洗、特征工程和数据归一化等步骤。具体处理方法可以参考其他的泰坦尼克号分类案例。
```python
# 数据清洗
train_data.drop(['Cabin'], axis=1, inplace=True)
test_data.drop(['Cabin'], axis=1, inplace=True)
train_data.dropna(inplace=True)
test_data.dropna(inplace=True)
# 特征工程
train_data['Sex'] = (train_data['Sex'] == 'male').astype('int')
test_data['Sex'] = (test_data['Sex'] == 'male').astype('int')
train_data['Embarked_S'] = (train_data['Embarked'] == 'S').astype('int')
train_data['Embarked_C'] = (train_data['Embarked'] == 'C').astype('int')
train_data['Embarked_Q'] = (train_data['Embarked'] == 'Q').astype('int')
test_data['Embarked_S'] = (test_data['Embarked'] == 'S').astype('int')
test_data['Embarked_C'] = (test_data['Embarked'] == 'C').astype('int')
test_data['Embarked_Q'] = (test_data['Embarked'] == 'Q').astype('int')
train_data.drop(['Embarked'], axis=1, inplace=True)
test_data.drop(['Embarked'], axis=1, inplace=True)
# 数据归一化
train_data = (train_data - train_data.min()) / (train_data.max() - train_data.min())
test_data = (test_data - test_data.min()) / (test_data.max() - test_data.min())
# 提取特征和标签
train_x = train_data.drop(['Survived', 'PassengerId'], axis=1)
train_y = train_data['Survived']
test_x = test_data.drop(['PassengerId'], axis=1)
```
然后,我们可以使用 sklearn 库中的 KNeighborsClassifier 类来构建 KNN 模型。在构建模型之前,我们需要确定 K 值的大小。可以通过交叉验证等方法来确定最优的 K 值。
```python
# 构建 KNN 模型
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_x, train_y)
# 在测试集上测试模型
test_y = knn.predict(test_x)
```
最后,我们可以计算模型的准确率。
```python
# 计算准确率
accuracy = knn.score(train_x, train_y)
print('Accuracy:', accuracy)
```
完整代码如下:
```python
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# 读取训练集和测试集
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
# 数据清洗
train_data.drop(['Cabin'], axis=1, inplace=True)
test_data.drop(['Cabin'], axis=1, inplace=True)
train_data.dropna(inplace=True)
test_data.dropna(inplace=True)
# 特征工程
train_data['Sex'] = (train_data['Sex'] == 'male').astype('int')
test_data['Sex'] = (test_data['Sex'] == 'male').astype('int')
train_data['Embarked_S'] = (train_data['Embarked'] == 'S').astype('int')
train_data['Embarked_C'] = (train_data['Embarked'] == 'C').astype('int')
train_data['Embarked_Q'] = (train_data['Embarked'] == 'Q').astype('int')
test_data['Embarked_S'] = (test_data['Embarked'] == 'S').astype('int')
test_data['Embarked_C'] = (test_data['Embarked'] == 'C').astype('int')
test_data['Embarked_Q'] = (test_data['Embarked'] == 'Q').astype('int')
train_data.drop(['Embarked'], axis=1, inplace=True)
test_data.drop(['Embarked'], axis=1, inplace=True)
# 数据归一化
train_data = (train_data - train_data.min()) / (train_data.max() - train_data.min())
test_data = (test_data - test_data.min()) / (test_data.max() - test_data.min())
# 提取特征和标签
train_x = train_data.drop(['Survived', 'PassengerId'], axis=1)
train_y = train_data['Survived']
test_x = test_data.drop(['PassengerId'], axis=1)
# 构建 KNN 模型
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_x, train_y)
# 在测试集上测试模型
test_y = knn.predict(test_x)
# 计算准确率
accuracy = knn.score(train_x, train_y)
print('Accuracy:', accuracy)
```
希望这个例子能对您有所帮助!