决策树最优分箱实现 python
时间: 2023-06-29 09:06:35 浏览: 156
决策树最优分箱是一种常用的特征工程方法,可以将连续变量离散化成有序的类别变量,并且可以降低模型的复杂度。下面是一个简单的 Python 实现:
```python
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
def calc_chi2(df):
"""
计算卡方值
:param df: 数据框,包含当前分组内的样本数、总体样本数和当前分组内正负样本数
:return: 卡方值
"""
total = df.sum(axis=0).sum()
total_group = df.sum(axis=1)
expect = np.outer(total_group, df.sum(axis=0)) / total
chi2 = ((df - expect) ** 2 / expect).sum().sum()
return chi2
def best_split_bin(X, y, max_bins=10, min_samples_leaf=0.05):
"""
使用决策树选择最优分箱方案
:param X: 特征变量
:param y: 目标变量
:param max_bins: 最大分箱数
:param min_samples_leaf: 每个叶子节点最少样本数占总样本数的比例
:return: 分箱方案
"""
tree = DecisionTreeRegressor(min_samples_leaf=min_samples_leaf)
bins = [-np.inf]
while len(bins) < max_bins:
split_points = pd.Series()
for i in range(len(bins) - 1):
df = pd.concat([X, y], axis=1)
mask = (X > bins[i]) & (X <= bins[i + 1])
if mask.sum() == 0:
continue
df = df[mask]
cross_tab = pd.crosstab(df.iloc[:, 0], df.iloc[:, 1])
chi2 = calc_chi2(cross_tab)
split_points[i] = chi2
if len(split_points) == 0:
break
best_split = split_points.idxmax()
best_chi2 = split_points.max()
if best_chi2 < tree.min_impurity_decrease:
break
bins.insert(best_split + 1, X.iloc[X[X > bins[best_split]].index].min())
tree.fit(X.apply(lambda x: np.digitize(x, bins)), y)
if tree.tree_.node_count >= 2 ** (tree.max_depth - 1):
break
bins = bins[1:-1] + [np.inf]
return bins
# 测试代码
df = pd.read_csv('data.csv')
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :1], df.iloc[:, 1], test_size=0.2)
bins = best_split_bin(X_train, y_train)
print(bins)
```
该函数使用决策树选择最优分箱方案,返回的是分割点列表,例如 `[-inf, 10, 20, inf]` 表示分为三组,第一组小于等于 10,第二组大于 10 小于等于 20,第三组大于 20。你可以根据自己的需求修改参数,例如 `max_bins` 控制最大分箱数,`min_samples_leaf` 控制每个叶子节点最少样本数占总样本数的比例。
阅读全文