数据挖掘adult实验源码
时间: 2023-10-01 09:12:35 浏览: 145
adult数据集是一个二分类问题,目标是预测一个人的年收入是否超过50K美元。该数据集包含14个属性,其中包括年龄、工作类别、受教育程度、婚姻状况、人种、性别、每周工作小时数等等。以下是数据挖掘adult数据集的源码示例。
首先,我们需要导入必要的库和数据集:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 导入数据集
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
adult_data = pd.read_csv(url, header = None, sep=',\s', engine='python')
adult_data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
```
接下来,我们需要对数据进行预处理。我们需要将分类属性进行编码,并且将缺失值进行处理。
```python
# 将分类属性进行编码
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
adult_data['workclass'] = le.fit_transform(adult_data['workclass'])
adult_data['education'] = le.fit_transform(adult_data['education'])
adult_data['marital-status'] = le.fit_transform(adult_data['marital-status'])
adult_data['occupation'] = le.fit_transform(adult_data['occupation'])
adult_data['relationship'] = le.fit_transform(adult_data['relationship'])
adult_data['race'] = le.fit_transform(adult_data['race'])
adult_data['sex'] = le.fit_transform(adult_data['sex'])
adult_data['native-country'] = le.fit_transform(adult_data['native-country'])
adult_data['income'] = le.fit_transform(adult_data['income'])
# 处理缺失值
adult_data = adult_data.replace('?', np.nan)
adult_data = adult_data.dropna()
```
接下来,我们需要将数据集分为训练集和测试集,并进行特征选择。
```python
# 将数据集分为训练集和测试集
from sklearn.model_selection import train_test_split
x = adult_data.drop('income', axis=1)
y = adult_data['income']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
# 特征选择
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=10)
selector.fit(x_train, y_train)
x_train = selector.transform(x_train)
x_test = selector.transform(x_test)
```
接下来,我们使用逻辑回归模型进行训练和预测。
```python
# 训练和预测
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
# 计算准确率和混淆矩阵
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print('Accuracy:', accuracy)
print('Confusion Matrix:\n', cm)
```
最后,我们可以进行模型的优化和性能评估。
```python
# 模型优化
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}]
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)
grid_search = grid_search.fit(x_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print('Best Accuracy:', best_accuracy)
print('Best Parameters:', best_parameters)
# 性能评估
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```
以上就是数据挖掘adult数据集的源码示例。