使用sklearn实现随机森林分类,并分析特征的重要性
时间: 2023-11-22 09:53:22 浏览: 100
好的,实现随机森林分类可以按照以下步骤进行:
1. 导入必要的库和数据集。
```python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 导入数据
data = pd.read_csv('data.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
```
2. 将数据集划分为训练集和测试集。
```python
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
3. 定义随机森林分类器并训练模型。
```python
# 定义随机森林分类器
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
# 训练模型
rfc.fit(X_train, y_train)
```
4. 对测试集进行预测并计算准确率。
```python
# 对测试集进行预测
y_pred = rfc.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
5. 计算特征的重要性并进行可视化。
```python
# 计算特征的重要性
importances = rfc.feature_importances_
# 可视化特征的重要性
import matplotlib.pyplot as plt
feat_importances = pd.Series(importances, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
```
完整代码如下:
```python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# 导入数据
data = pd.read_csv('data.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 定义随机森林分类器
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
# 训练模型
rfc.fit(X_train, y_train)
# 对测试集进行预测
y_pred = rfc.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# 计算特征的重要性
importances = rfc.feature_importances_
# 可视化特征的重要性
feat_importances = pd.Series(importances, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
```
其中,`data.csv`是数据集文件,`X`是特征变量,`y`是目标变量,`n_estimators`是决策树数量,`random_state`是随机种子,`feat_importances.nlargest(10).plot(kind='barh')`表示绘制重要性最高的前10个特征的水平条形图。
阅读全文