卡方分箱 python
时间: 2023-06-28 11:12:45 浏览: 113
卡方分箱是一种离散化方法,目的是将连续变量离散化为若干个分段,使得每个分段内变量的分布差异较小,不同分段之间的分布差异较大。在python中可以使用以下步骤实现:
1. 导入需要的库
```python
import pandas as pd
import numpy as np
import scipy.stats as ss
```
2. 读入数据
```python
data = pd.read_csv('data.csv')
```
3. 确定分箱个数
可以通过经验法则、等频分箱或等距分箱等方法确定分箱个数。
```python
# 等频分箱
data['rank'] = pd.qcut(data['feature'], 10, labels=False)
```
4. 计算卡方值
卡方值可以衡量变量在不同分箱之间的差异程度。
```python
def chi2(df, total_col, bad_col, overall_rate):
df2 = df.copy()
df2['expected'] = df[total_col].apply(lambda x: x * overall_rate)
combined = zip(df2['expected'], df2[bad_col])
chi = [(i[0]-i[1])**2/i[0] for i in combined]
return sum(chi)
total = data.groupby('rank').count()['feature']
bad = data.groupby('rank').sum()['target']
overall_rate = np.sum(bad) / np.sum(total)
chi2_value = chi2(pd.concat([total, bad], axis=1), 'feature', 'target', overall_rate)
```
5. 合并分箱
根据卡方值,将相邻的分箱合并,直到满足预设的最小卡方值。
```python
def merge(data, col, target, max_interval, min_chi2):
total = data.groupby(col).count()[target]
bad = data.groupby(col).sum()[target]
overall_rate = np.sum(bad) / np.sum(total)
group_intervals = [[i] for i in total.index]
while len(group_intervals) > max_interval:
chi2_values = []
for i in range(len(group_intervals)-1):
new_interval = group_intervals[i] + group_intervals[i+1]
df1 = pd.concat([total.loc[new_interval], bad.loc[new_interval]], axis=1)
df2 = pd.concat([total.loc[group_intervals[i]], bad.loc[group_intervals[i]]], axis=1)
df3 = pd.concat([total.loc[group_intervals[i+1]], bad.loc[group_intervals[i+1]]], axis=1)
chi2_values.append(chi2(df1, 'feature', 'target', overall_rate) - chi2(df2, 'feature', 'target', overall_rate) - chi2(df3, 'feature', 'target', overall_rate))
if min(chi2_values) < min_chi2:
merge_index = chi2_values.index(min(chi2_values))
group_intervals[merge_index] = group_intervals[merge_index] + group_intervals.pop(merge_index+1)
else:
break
return group_intervals
intervals = merge(data, 'rank', 'target', 10, 3.841)
```
6. 将分箱结果应用到数据中
```python
def apply_intervals(x, intervals):
for i in range(len(intervals)):
if x in intervals[i]:
return i
data['rank_new'] = data['rank'].apply(lambda x: apply_intervals(x, intervals))
```
以上就是使用python实现卡方分箱的步骤。需要注意的是,卡方分箱并不是一种万能的离散化方法,应根据具体情况选择合适的方法。
阅读全文