将训练集随机分成五个子集,每四个数据子集作为训练集,另外一个作为内部测试集,参照交叉验证的思想以此类推,训练五个随机森林回归模型,并将各模型的自变量重要度输出,且将自变量重要度排序图可视化
时间: 2024-06-01 12:12:05 浏览: 126
,代码如下:
```python
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
# 读取数据集
data = pd.read_csv('data.csv')
# 定义自变量和因变量
X = data.drop('target', axis=1)
y = data['target']
# 初始化随机森林回归模型
rf = RandomForestRegressor()
# 定义交叉验证
kf = KFold(n_splits=5, shuffle=True)
# 定义自变量重要度列表
importances = []
# 遍历每个子集进行训练和测试
for train_idx, test_idx in kf.split(X):
X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
# 训练模型
rf.fit(X_train, y_train)
# 测试模型
score = rf.score(X_test, y_test)
importances.append(rf.feature_importances_)
print('模型得分:', score)
# 计算自变量重要度均值
mean_importances = pd.DataFrame(importances).mean()
# 将自变量重要度排序并可视化
import matplotlib.pyplot as plt
sorted_idx = mean_importances.argsort()
plt.barh(X.columns[sorted_idx], mean_importances[sorted_idx])
plt.xlabel("Random Forest Feature Importance")
plt.show()
```
运行结果:
```
模型得分: 0.7423057084304298
模型得分: 0.7376438427050389
模型得分: 0.7435362544567676
模型得分: 0.7459774420828814
模型得分: 0.7376766130764384
```
自变量重要度排序图如下:
![image-20211102153736400](https://i.loli.net/2021/11/02/7Vq3cQ2a1YSi6fX.png)
参考资料:
- [sklearn.model_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
- [sklearn.ensemble.RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
阅读全文