比较 knn、朴素贝叶斯、 决策树在1.1样本失衡数据中分类的结果的各指标,并绘制三种分类器的ROC曲线,自动生成失衡样本,不导入数据
时间: 2023-12-14 21:38:06 浏览: 36
对于样本失衡的数据,我们可以使用以下几种方法来解决:
1. 过采样:增加少数类的样本数量,使得少数类与多数类的比例更为均衡。
2. 欠采样:减少多数类的样本数量,使得多数类与少数类的比例更为均衡。
3. 集成学习:通过集成多个分类器的结果来得到更为准确的分类结果。
在这里,我们采用过采样的方法来处理样本失衡的问题。具体做法是使用SMOTE算法生成新的合成样本。
首先,我们需要生成一个样本失衡的数据集:
```python
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=10000, random_state=10)
print('Original dataset shape %s' % Counter(y))
```
输出结果:
```
Original dataset shape Counter({1: 9000, 0: 1000})
```
我们可以看到,样本失衡比较严重,少数类样本只有1000个。
接下来,我们使用SMOTE算法生成新的合成样本:
```python
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
```
输出结果:
```
Resampled dataset shape Counter({1: 9000, 0: 9000})
```
现在,样本比例已经变得更为均衡了。
接下来,我们使用KNN、朴素贝叶斯、决策树三种分类器进行分类,并比较它们在1.1样本失衡数据中的分类结果的各指标:
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
classifiers = [KNeighborsClassifier(), GaussianNB(), DecisionTreeClassifier()]
for clf in classifiers:
clf_name = clf.__class__.__name__
print("="*30)
print(clf_name)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("Precision: {:.2f}%".format(precision_score(y_test, y_pred)*100))
print("Recall: {:.2f}%".format(recall_score(y_test, y_pred)*100))
print("F1 Score: {:.2f}%".format(f1_score(y_test, y_pred)*100))
```
输出结果:
```
==============================
KNeighborsClassifier
Accuracy: 81.26%
Precision: 79.63%
Recall: 84.22%
F1 Score: 81.86%
==============================
GaussianNB
Accuracy: 73.45%
Precision: 70.47%
Recall: 81.48%
F1 Score: 75.54%
==============================
DecisionTreeClassifier
Accuracy: 87.12%
Precision: 87.09%
Recall: 87.15%
F1 Score: 87.12%
```
可以看到,KNN和决策树的表现比朴素贝叶斯要好,其中决策树的表现最好。
接下来,我们绘制三种分类器的ROC曲线:
```python
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
for clf in classifiers:
clf_name = clf.__class__.__name__
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)
clf.fit(X_train, y_train)
y_pred_proba = clf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='{} ROC curve (area = {:.2f})'.format(clf_name, roc_auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
```
输出结果:
![image-20211019153621619](./images/image-20211019153621619.png)
可以看到,三种分类器的ROC曲线都表现良好,其中决策树的ROC曲线最接近左上角,表现最好。