预测性别(得分可能低) 读取adults.txt文件,并使用逻辑斯蒂回归算法训练模型,根据种族race、职业occupation、工作时长hours_per_week来预测一个人的性别
时间: 2024-03-18 17:46:06 浏览: 159
首先,需要将adults.txt文件读取为pandas的DataFrame对象。代码如下:
```python
import pandas as pd
df = pd.read_csv('adults.txt', header=None, names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])
```
接下来,需要对数据进行预处理,将分类变量进行编码,并将性别作为标签列。代码如下:
```python
from sklearn.preprocessing import LabelEncoder
# 将分类变量进行编码
df['workclass'] = LabelEncoder().fit_transform(df['workclass'])
df['education'] = LabelEncoder().fit_transform(df['education'])
df['marital-status'] = LabelEncoder().fit_transform(df['marital-status'])
df['occupation'] = LabelEncoder().fit_transform(df['occupation'])
df['relationship'] = LabelEncoder().fit_transform(df['relationship'])
df['race'] = LabelEncoder().fit_transform(df['race'])
df['native-country'] = LabelEncoder().fit_transform(df['native-country'])
# 将性别作为标签列
df['sex'] = df['sex'].map({'Male': 0, 'Female': 1})
```
接下来,需要将数据集划分为训练集和测试集,并使用逻辑斯蒂回归算法来训练模型。代码如下:
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 划分训练集和测试集
X = df[['race', 'occupation', 'hours-per-week']]
y = df['sex']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
lr = LogisticRegression()
lr.fit(X_train, y_train)
# 预测
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', accuracy)
```
最后,我们可以使用训练好的模型对新数据进行预测,并输出预测结果。代码如下:
```python
import numpy as np
new_data = np.array([[0, 1, 40], [1, 4, 50], [2, 7, 60]])
# 进行预测
y_pred = lr.predict(new_data)
# 输出预测结果
for i in range(len(new_data)):
print('Race:', new_data[i][0], 'Occupation:', new_data[i][1], 'Hours per week:', new_data[i][2], 'Prediction:', 'Female' if y_pred[i] else 'Male')
```
注意,这里我们使用了accuracy_score函数来计算模型的准确率,使用了predict函数来进行预测。
阅读全文