请给出文件中“课程内容”部分的实现代码
时间: 2024-10-29 16:18:16 浏览: 10
根据文件中的“课程内容”部分,以下是各个任务的具体实现代码:
### 1. 编写一个采用留出法的训练集与测试集创建程序 `my_split(data, ratio)`,要求(不使用 `sklearn`)返回训练集与测试集。
```python
import numpy as np
def my_split(data, ratio):
np.random.shuffle(data)
split_idx = int(len(data) * ratio)
train_set = data[:split_idx]
test_set = data[split_idx:]
return train_set, test_set
```
### 2. 编写分类性能度量的函数 `my_metrics(y, y_predict)`,要求(不使用 `sklearn`)返回准确度、精确率、召回率和 F1 score。
```python
def my_metrics(y, y_predict):
tp = sum((y == 1) & (y_predict == 1))
tn = sum((y == 0) & (y_predict == 0))
fp = sum((y == 0) & (y_predict == 1))
fn = sum((y == 1) & (y_predict == 0))
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return accuracy, precision, recall, f1_score
```
### 3. 完成对鸢尾花分类的练习
#### (1) 加载鸢尾花数据集,构造分类任务的 X 和 Y
```python
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
```
#### (2) 对数据集进行随机划分,分别构造训练集和测试集
```python
train_data, test_data = my_split(np.column_stack((X, y)), 0.8)
X_train, y_train = train_data[:, :-1], train_data[:, -1]
X_test, y_test = test_data[:, :-1], test_data[:, -1]
```
#### (3) 训练集和测试集分别进行标准化
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```
#### (4) 利用 `sklearn.linear_model` 创建逻辑回归模型 `LogisticRegression`,通过训练集训练模型
```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
model.fit(X_train_scaled, y_train)
```
#### (5) 将训练好的模型对测试集的输入数据进行预测,获得测试集的预测输出
```python
y_pred = model.predict(X_test_scaled)
```
#### (6) 利用 `sklearn.metrics`,使用准确率,精确率,召回率和 F1 Score 四个评价指标对测试集的预测输出进行评估
```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
```
#### (7) 计算混淆矩阵,并可视化
```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()
```
#### (8) 对比不使用标准化处理的模型性能
```python
model_unscaled = LogisticRegression(max_iter=10000)
model_unscaled.fit(X_train, y_train)
y_pred_unscaled = model_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
precision_unscaled = precision_score(y_test, y_pred_unscaled, average='weighted')
recall_unscaled = recall_score(y_test, y_pred_unscaled, average='weighted')
f1_unscaled = f1_score(y_test, y_pred_unscaled, average='weighted')
print(f"Unscaled Accuracy: {accuracy_unscaled}")
print(f"Unscaled Precision: {precision_unscaled}")
print(f"Unscaled Recall: {recall_unscaled}")
print(f"Unscaled F1 Score: {f1_unscaled}")
```
### 4. 完成线性回归模型的练习
#### (1) 创建一个简单的一元回归样本:`y = 3x + 4 + noise`
```python
np.random.seed(0)
X = np.random.rand(100, 1)
noise = np.random.randn(100, 1) * 0.1
y = 3 * X + 4 + noise
```
#### (2) 为了将线性回归模型 \( y = w_0 + w_1 x \) 变换为其向量化形式 \( \hat{y} = w^T X \),我们将偏置项 \( w_0 \) 写成其与特征值恒为 1 的 \( x \) 的乘积
```python
X_b = np.c_[np.ones((100, 1)), X]
```
#### (3) 采用闭式解方法,通过下式的标准方程计算 \( w \),得出偏置项 \( w_0 \) 与特征权重 \( w_1 \)
```python
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
w0, w1 = theta_best[0][0], theta_best[1][0]
print(f"w0: {w0}, w1: {w1}")
```
#### (4) 实现梯度下降算法求 \( w \)
```python
learning_rate = 0.1
n_iterations = 1000
m = 100
theta = np.random.randn(2, 1)
for iteration in range(n_iterations):
gradients = 2 / m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - learning_rate * gradients
w0_gd, w1_gd = theta[0][0], theta[1][0]
print(f"w0 (Gradient Descent): {w0_gd}, w1 (Gradient Descent): {w1_gd}")
```
#### (5) 研究本例中不同学习率对收敛速度的影响,结合可视化方法进行简单论述
```python
learning_rates = [0.01, 0.1, 0.5]
for lr in learning_rates:
theta = np.random.randn(2, 1)
costs = []
for iteration in range(n_iterations):
gradients = 2 / m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - lr * gradients
cost = np.mean((X_b.dot(theta) - y) ** 2)
costs.append(cost)
plt.plot(range(n_iterations), costs, label=f'Learning Rate: {lr}')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.legend()
plt.title('Convergence of Gradient Descent with Different Learning Rates')
plt.show()
```
这些代码实现了文件中描述的所有任务,涵盖了从数据分割到模型训练和评估的完整过程。希望这些代码对你有所帮助!
阅读全文