加载鸢尾花数据集,鸢尾花一共三种,请将其编码为{0,1,2},请将 feature 进行标准化,在二分类问题上,拟合模型并画出决策区域,在多分类问题上,拟合模型并画出决策区域,请解释每一条语句中各参数的含义,探索不同的正则化方法、和不同的 C 值对权重的影响,并解释结果图
时间: 2023-06-11 15:10:13 浏览: 111
首先,我们需要导入所需要的库,并加载鸢尾花数据集:
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from mlxtend.plotting import plot_decision_regions
# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
```
接着,我们将鸢尾花的三个类别分别编码为{0,1,2}:
```python
# Encode labels
y[y == 0] = 0
y[y == 1] = 1
y[y == 2] = 2
```
然后,我们将 feature 进行标准化:
```python
# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
```
在二分类问题上,我们选择 sepal length 和 petal length 作为 feature,将其分别作为 X 和 y,然后进行拟合模型并画出决策区域:
```python
# Binary classification
X_bin = X[:, [0, 2]]
y_bin = y.copy()
y_bin[y_bin == 2] = 1
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, test_size=0.3, random_state=42)
# Fit model
clf_bin = LogisticRegression(C=1.0, penalty='l2', solver='liblinear')
clf_bin.fit(X_train, y_train)
# Predict
y_pred_bin = clf_bin.predict(X_test)
print('Binary Classification Accuracy:', accuracy_score(y_test, y_pred_bin))
# Plot decision regions
plot_decision_regions(X=X_bin, y=y_bin, clf=clf_bin, legend=2)
plt.xlabel('Sepal length (standardized)')
plt.ylabel('Petal length (standardized)')
plt.title('Logistic Regression - Decision Region (Binary Classification)')
plt.show()
```
在多分类问题上,我们选择 sepal length 和 petal length 作为 feature,将其作为 X,然后进行拟合模型并画出决策区域:
```python
# Multi-class classification
X_multi = X[:, [0, 2]]
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_multi, y, test_size=0.3, random_state=42)
# Fit model
clf_multi = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs', multi_class='auto')
clf_multi.fit(X_train, y_train)
# Predict
y_pred_multi = clf_multi.predict(X_test)
print('Multi-Class Classification Accuracy:', accuracy_score(y_test, y_pred_multi))
# Plot decision regions
plot_decision_regions(X=X_multi, y=y, clf=clf_multi, legend=3)
plt.xlabel('Sepal length (standardized)')
plt.ylabel('Petal length (standardized)')
plt.title('Logistic Regression - Decision Region (Multi-Class Classification)')
plt.show()
```
其中,各参数的含义如下:
- `X`:feature 矩阵
- `y`:label 数组
- `StandardScaler`:将 feature 进行标准化的类
- `LogisticRegression`:逻辑回归模型类
- `train_test_split`:将数据集分为训练集和测试集的函数
- `accuracy_score`:计算分类准确率的函数
- `plot_decision_regions`:画出决策区域的函数
- `C`:正则化系数
- `penalty`:正则化方法,包括 L1 和 L2 正则化
- `solver`:求解算法,包括 liblinear 和 lbfgs
- `multi_class`:多分类问题的处理方式,包括 one-vs-rest 和 multinomial
在正则化和 C 值的影响上,我们可以通过改变这些参数来观察权重的变化和结果图的变化。例如,我们可以将正则化方法改为 L1 正则化,C 值改为 0.1,可以得到以下结果:
```python
# Fit model
clf_multi = LogisticRegression(C=0.1, penalty='l1', solver='saga', multi_class='auto')
clf_multi.fit(X_train, y_train)
# Predict
y_pred_multi = clf_multi.predict(X_test)
print('Multi-Class Classification Accuracy:', accuracy_score(y_test, y_pred_multi))
# Plot decision regions
plot_decision_regions(X=X_multi, y=y, clf=clf_multi, legend=3)
plt.xlabel('Sepal length (standardized)')
plt.ylabel('Petal length (standardized)')
plt.title('Logistic Regression - Decision Region (Multi-Class Classification)')
plt.show()
# Print weights
print('Weights:', clf_multi.coef_)
```
可以看到,L1 正则化可以使得权重更加稀疏,而 C 值的变化则会影响权重的大小。在结果图上,我们可以看到决策区域的边界变得更加明显,不同类别之间的区域也更加清晰。
阅读全文