python 代码 使用决策树分类模型对银行客户违约信息数据集进行预测 ,并对模型进行剪枝,获得最优参数 ,绘制剪枝后模型的学习曲线,它和未剪枝情况下的有何区别?
时间: 2024-10-18 11:11:45 浏览: 30
在Python中,我们可以使用sklearn库中的`DecisionTreeClassifier`进行决策树模型训练,并通过`GridSearchCV`进行参数搜索,包括剪枝(也称为“预剪枝”或“后剪枝”)。以下是基本步骤:
1. 导入必要的库:
```python
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
```
2. 加载并处理数据:
```python
# 加载银行客户违约数据
data = pd.read_csv('bank_data.csv')
X = data.drop('default', axis=1) # 特征
y = data['default'] # 目标变量
# 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
```
3. 创建决策树模型并进行网格搜索优化(包含剪枝参数如max_depth、min_samples_leaf等):
```python
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': range(1, 11),
'min_samples_leaf': [1, 5, 10],
}
tree_model = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(tree_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
```
4. 使用最优参数训练决策树模型并预测:
```python
best_tree = DecisionTreeClassifier(**best_params, random_state=42)
best_tree.fit(X_train, y_train)
y_pred = best_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Best parameters: {best_params}, Accuracy: {accuracy}")
```
5. 绘制学习曲线(假设你有完整的训练数据和验证数据):
```python
def plot_learning_curve(model, X, y, title, ylim=None, cv=None):
train_sizes, train_scores, validation_scores = learning_curve(
model, X, y, cv=cv, n_jobs=-1,
train_sizes=np.linspace(.1, 1.0, 5),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
validation_mean = np.mean(validation_scores, axis=1)
validation_std = np.std(validation_scores, axis=1)
plt.plot(train_sizes, train_mean, label=f'Training accuracy')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15)
plt.plot(train_sizes, validation_mean, label=f'Validation accuracy')
plt.fill_between(train_sizes, validation_mean - validation_std, validation_mean + validation_std, alpha=0.15)
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Accuracy")
plt.legend(loc="lower right")
plt.tight_layout()
plot_learning_curve(best_tree, X_scaled, y, "Learning Curve with Pruned Tree", ylim=(0.7, 1.0))
plt.show()
```
**区别:**
- **未剪枝模型**通常会有较深的决策树,可能导致过拟合,即在训练数据上表现很好但在新数据上泛化能力较差。
- **剪枝后的模型**由于限制了树的复杂度,可能会减少过拟合的风险,表现为训练误差和验证误差之间的差距减小,而且学习曲线更平滑,在一定范围内增加训练样本时,性能提升相对较小。
阅读全文