帮我用鸢尾花(Iris)数据集中的两个特征在python环境下用五折交叉方法实现二分类线性可分SVM,并计算分类准确率。再可视化数据点和分类边界,标出支持向量和间隔
时间: 2024-10-06 13:05:48 浏览: 52
Python机器学习SVM作业(源码+实验报告)将经典数据集Iris鸢尾花的数据样本实现SVM分类
在Python环境下,我们可以使用scikit-learn库来处理鸢尾花(Iris)数据集并实现二分类线性可分的支持向量机(SVM)。这里我们将使用`sklearn.model_selection.KFold`进行五折交叉验证,`sklearn.svm.SVC`创建SVM模型,`accuracy_score`计算准确率,以及`matplotlib`库进行数据可视化。
首先,确保安装了必要的库:
```bash
pip install numpy pandas scikit-learn matplotlib seaborn
```
然后,按照以下步骤操作:
1. 导入库并加载数据集:
```python
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns
iris = datasets.load_iris()
X = iris.data[:, :2] # 取前两个特征
y = iris.target
```
2. 数据预处理(标准化):
```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
3. 划分训练集和测试集,以及五折交叉验证:
```python
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
```
4. 实现二分类线性SVM,计算交叉验证得分和最终准确率:
```python
svm_model = SVC(kernel='linear', C=1) # 使用线性核
scores = cross_val_score(svm_model, X_train, y_train, cv=kfold)
accuracy = scores.mean()
print("Cross-validation accuracy:", scores)
print("Mean accuracy:", accuracy)
```
5. 训练模型并预测:
```python
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
```
6. 可视化数据点和分类边界:
```python
def plot_data_and_boundary(X, y, model, title):
colors = ['red' if label == 0 else 'blue' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=50, cmap='viridis')
# 获取决策边界
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500), np.linspace(y_min, y_max, 500))
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.8, cmap='RdBu_r')
plt.figure(figsize=(8, 6))
plot_data_and_boundary(X_scaled, y, svm_model, "Support Vector Machine Decision Boundary")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Iris Dataset with SVM Linear Classifier and Support Vectors')
plt.legend(['Class 0', 'Class 1'], loc="upper right")
# 查找支持向量和支持向量机间隔
support_vectors = svm_model.support_vectors_
print("\nSupport vectors:")
print(support_vectors)
# 计算间隔(这里是间隔半径)
epsilon = svm_model.kernel_params['gamma'] * (np.max(np.abs(model.coef_[0])) ** 2) / (2 * svm_model.C)
print(f"\nInterval: {epsilon}")
plt.show()
```
在这个过程中,你会看到一个二维空间中的数据点分布以及由SVM定义的线性分类边界。支持向量是决策边界的交点,而间隔则是从一个最近的正负样本到超平面的距离。
阅读全文