含有缺失值数据卡方分箱Python代码
时间: 2023-10-07 20:09:09 浏览: 105
以下是一个简单的Python代码示例,用于将含有缺失值的数据进行卡方分箱:
```python
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
def chisq_bin(data, col, target, max_bins=5, iv_min=0.02):
"""
卡方分箱函数
:param data: 数据集
:param col: 需要分箱的变量
:param target: 目标变量
:param max_bins: 最大分箱数
:param iv_min: 最小IV值
:return: 分箱结果
"""
data = data[[col, target]].copy()
data['missing'] = data[col].isnull().astype(int)
data = data.groupby(['missing', target]).size().unstack().fillna(0)
data.columns = ['neg', 'pos']
data['tot'] = data['neg'] + data['pos']
data['neg_prop'] = data['neg'] / data['neg'].sum()
data['pos_prop'] = data['pos'] / data['pos'].sum()
data['tot_prop'] = data['tot'] / data['tot'].sum()
data['woe'] = np.log((data['neg'] / data['neg'].sum()) / (data['pos'] / data['pos'].sum()))
data['iv'] = (data['neg_prop'] - data['pos_prop']) * data['woe']
iv = data['iv'].sum()
if iv < iv_min:
return None
cut_points = [np.nan]
while len(cut_points) < max_bins:
data = data.sort_values('woe')
x = np.array(data['neg'])
y = np.array(data['pos'])
xy = np.array([x, y])
chisq = chi2_contingency(xy)[0]
p = chi2_contingency(xy)[1]
if p < 0.05:
cut_point = data.index[0]
cut_points.append(cut_point)
data = data.loc[cut_point:].copy()
data['woe'] = np.log((data['neg'] / data['neg'].sum()) / (data['pos'] / data['pos'].sum()))
data['iv'] = (data['neg_prop'] - data['pos_prop']) * data['woe']
iv = data['iv'].sum()
if iv < iv_min:
break
else:
break
cut_points.append(np.inf)
labels = range(len(cut_points) - 1)
result = pd.cut(data[col], cut_points, labels=labels, include_lowest=True)
return result
```
这个函数将缺失值视为一个单独的类别,并根据卡方检验的结果将数据分成多个箱子,返回一个Pandas Series对象,其中每个值表示变量的分箱编号。您可以使用此结果来替换原始数据中的变量,然后使用OneHot编码或WOE编码等其他技术进一步处理。
阅读全文