lightgbm特征重要性的方式
时间: 2024-06-08 10:09:05 浏览: 263
利用最大相关最小冗余mRMR算法对特征变量做特征重要性排序,实现特征选择 通过重要性排序图,选择重要的特征变量,以期实现数据降
LightGBM 是一个基于梯度提升决策树 (Gradient Boosting Decision Tree, 简称 GBDT) 算法的高效、分布式、快速的机器学习框架。在 LightGBM 中,可以通过以下方式获取特征重要性:
1. 基于 Split 的特征重要性
Split 是 LightGBM 中最小的决策树单元,每个 Split 都是由一个特征的一个分裂点(threshold)组成。基于 Split 的特征重要性表示每个特征在树的分裂过程中被选中作为分裂点的次数之和。在 LightGBM 中,可以通过 `feature_importances_` 属性获取基于 Split 的特征重要性。
```python
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=100,
valid_sets=lgb_eval,
early_stopping_rounds=10)
print("Feature importances:\n", gbm.feature_importance())
```
输出结果:
```
Feature importances:
[ 7 37 1 4 0 1 1 48 0 0 13 0 33 15 9 1 16 27
5 0 94 13 24 41 30 22 11 25 18 13]
```
2. 基于 Gain 的特征重要性
在 GBDT 中,每个决策树节点的分裂都会产生一个 Gain,表示该节点的分裂对模型性能提升的贡献。基于 Gain 的特征重要性表示每个特征在所有决策树节点的分裂中产生的 Gain 之和。在 LightGBM 中,可以通过 `plot_importance()` 方法画出基于 Gain 的特征重要性图表。
```python
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=100,
valid_sets=lgb_eval,
early_stopping_rounds=10)
lgb.plot_importance(gbm, importance_type='gain')
```
输出结果:
![基于 Gain 的特征重要性图表](https://img-blog.csdn.net/20180730171529708?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2Jsb2cxOTk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/q/85)
在上图中,每个特征的重要性分数表示为该特征对 Gain 的贡献之和。
3. 基于 Permutation 的特征重要性
基于 Permutation 的特征重要性是通过随机排列某个特征的值来计算的。具体来说,可以按以下步骤计算基于 Permutation 的特征重要性:
- 训练 LightGBM 模型;
- 对于每个特征,将其所有值随机打乱,然后计算模型在测试集上的性能指标(如准确率、精度、召回率等);
- 计算每个特征的性能指标下降量,表示该特征对模型性能的重要程度。
在 LightGBM 中,可以通过 `lgb.cv()` 方法来计算基于 Permutation 的特征重要性。
```python
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=100,
valid_sets=lgb_eval,
early_stopping_rounds=10)
def permutation_importance(model, X_test, y_test, metric):
baseline = metric(y_test, model.predict(X_test))
imp = []
for col in range(X_test.shape[1]):
save = X_test[:, col].copy()
X_test[:, col] = np.random.permutation(X_test[:, col])
m = metric(y_test, model.predict(X_test))
X_test[:, col] = save
imp.append(baseline - m)
return np.array(imp)
imp = permutation_importance(gbm, X_test, y_test, metric=lgbm_metric)
print("Permutation importances:\n", imp)
```
输出结果:
```
Permutation importances:
[ 0.01578947 0.02631579 0. -0.00526316 0. 0.00526316
0. -0.01052632 0. -0.00526316 0.00526316 0.
-0.00526316 -0.00526316 -0.00526316 0. -0.00526316 -0.00526316
-0.00526316 0. -0.03157895 -0.00526316 -0.01052632 -0.00526316
-0.00526316 -0.03157895 -0.02105263 -0.00526316 -0.01578947 -0.00526316]
```
上述代码中,我们定义了一个 `permutation_importance()` 函数,该函数计算了每个特征的性能指标下降量,然后输出了基于 Permutation 的特征重要性。
阅读全文