帮我用鸢尾花(Iris)数据集中的两个特征,在python环境下用五折交叉方法训练一个二分类问题的严格线性可分SVM,并计算分类准确率。再可视化数据点和分类边界,标出支持向量和间隔
时间: 2024-10-06 18:05:45 浏览: 40
首先,我们需要导入必要的库并加载鸢尾花数据集。在这个例子中,我们将使用scikit-learn库,它包含了许多机器学习算法和工具。
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# 加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data[:, :2] # 我们只取前两个特征,便于二分类
y = iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化,使得每个特征在同一尺度上
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 创建严格的线性可分SVM模型,选择内核为'linear'
svm = SVC(kernel='linear', C=1e10) # 高C值使得决策边界尽可能简单,尽量分离数据
# 使用五折交叉验证计算分类准确率
cv_scores = cross_val_score(svm, X_train_scaled, y_train, cv=5)
mean_cv_score = cv_scores.mean()
print(f"五折交叉验证平均准确率: {mean_cv_score * 100:.2f}%")
# 训练模型
svm.fit(X_train_scaled, y_train)
# 预测并计算测试集准确率
y_pred = svm.predict(X_test_scaled)
test_acc = accuracy_score(y_test, y_pred)
print(f"测试集准确率: {test_acc * 100:.2f}%")
# 可视化数据点和分类边界
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap='viridis')
plt.title('Iris数据集的二维特征分布及分类边界')
decision_boundary = svm.coef_[0][0] * X_train_scaled[:, 0] + svm.coef_[0][1] * X_train_scaled[:, 1] - svm.intercept_
plt.plot(X_train_scaled[:, 0], decision_boundary, 'k-', linewidth=2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# 标注支持向量
support_vectors = X_train_scaled[svm.support_]
for sv in support_vectors:
plt.scatter(sv[0], sv[1], s=80, facecolors='none', edgecolors='r')
# 显示间隔(Margin)
margin = svm.decision_function(np.zeros_like(X_train_scaled)).min() / 2
plt.plot([svm.intercept_[0] / svm.coef_[0][0] - margin, -svm.intercept_[0] / svm.coef_[0][0] - margin],
[0, 0], 'k--', lw=2)
plt.show()
阅读全文