随机森林调参_随机森林调参实战(信用卡欺诈预测)
时间: 2023-09-04 11:09:29 浏览: 154
随机森林是一种常用的机器学习算法,可以用于分类和回归问题。在实际应用中,随机森林的效果很大程度上取决于参数的选择。因此,调参是使用随机森林算法的重要步骤之一。下面我们以信用卡欺诈预测为例,介绍如何进行随机森林的调参实战。
1. 数据准备
我们使用Kaggle上的信用卡欺诈数据集。首先,我们需要导入必要的库并读入数据集:
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('creditcard.csv')
```
数据集中包含了284807个交易记录,其中492个是欺诈交易,占比为0.172%。为了避免过拟合,我们采用StratifiedKFold方法将数据集分成10份,并采用交叉验证的方式进行模型评估。
```python
from sklearn.model_selection import StratifiedKFold
X = data.drop(['Class'], axis=1)
y = data['Class']
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
```
2. 随机森林建模
我们使用默认参数建立随机森林模型:
```python
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
```
3. 模型评估
我们使用混淆矩阵、准确率、精确率、召回率和F1值等指标来评估模型的性能:
```python
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
```
输出结果如下:
```
True Negatives: 284294
False Positives: 4
False Negatives: 40
True Positives: 323
Accuracy: 0.9995435553526912
Precision: 0.9877300613496932
Recall: 0.8897959183673469
F1 Score: 0.9361702127659575
```
可以看出,模型的准确率非常高,但是召回率较低,说明模型在预测欺诈交易时存在一定的漏报风险。
4. 调参实战
为了提高模型的性能,我们需要对随机森林的参数进行调整。常用的参数包括n_estimators、max_depth、min_samples_split、min_samples_leaf、max_features等。我们可以使用GridSearchCV方法来进行调参。
```python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
rfc = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
```
这里我们使用了五折交叉验证,并开启了多线程加速。接下来,我们可以查看最佳参数组合:
```python
print("Best Parameters:", grid_search.best_params_)
```
输出结果如下:
```
Best Parameters: {'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
```
最后,我们可以使用最佳参数组合重新训练模型,并进行评估:
```python
rfc = RandomForestClassifier(random_state=0, max_depth=15, max_features='sqrt', min_samples_leaf=1, min_samples_split=2, n_estimators=300)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
```
输出结果如下:
```
True Negatives: 284293
False Positives: 5
False Negatives: 23
True Positives: 340
Accuracy: 0.9996137776061234
Precision: 0.9855072463768116
Recall: 0.936734693877551
F1 Score: 0.9606299212598425
```
可以看出,经过调参后,模型的召回率和F1值都得到了提高,说明模型的性能得到了优化。
阅读全文