有一千个特征,用python写出怎么用WOE和IV值选择可用的特征作为入模变量
时间: 2024-03-31 12:33:53 浏览: 15
下面是使用Python实现基于WOE和IV值进行特征选择的代码:
```python
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# 定义函数计算WOE和IV值
def cal_iv(df, feature, target):
lst = []
cols = ['Variable', 'Value', 'All', 'Bad']
for i in range(df[feature].nunique()):
val = list(df[feature].unique())[i]
lst.append([feature, val, len(df[df[feature] == val]), len(df[(df[feature] == val) & (df[target] == 1)])])
data = pd.DataFrame(lst, columns=cols)
data = data[data['Bad'] > 0]
data['Share'] = data['All'] / data['All'].sum()
data['Bad Rate'] = data['Bad'] / data['All']
data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
data['WOE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
data['IV'] = (data['WOE'] * (data['Distribution Good'] - data['Distribution Bad'])).sum()
return data['IV'].values[0]
# 读取数据
data = pd.read_csv('data.csv')
# 将数据集随机分成训练集和测试集
train_data, test_data = train_test_split(data, test_size=0.3, random_state=42)
# 计算每个特征的IV值
iv_values = []
for col in data.columns:
if col != 'target':
iv = cal_iv(train_data, col, 'target')
iv_values.append((col, iv))
# 将所有特征按照其IV值从大到小排序
iv_values = sorted(iv_values, key=lambda x: x[1], reverse=True)
# 选择IV值排名前N个的特征作为入模变量
N = 10
selected_features = [x[0] for x in iv_values[:N]]
# 训练决策树模型并评估预测性能
X_train = train_data[selected_features]
y_train = train_data['target']
X_test = test_data[selected_features]
y_test = test_data['target']
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred)
print('AUC:', auc)
```
上述代码中,`data`为包含1000个特征的数据集,其中`target`为目标变量。首先,将数据集随机分成训练集和测试集。然后,分别计算每个特征的IV值,并按照IV值从大到小排序。最后,选择IV值排名前N个的特征作为入模变量,并训练决策树模型进行预测。