C4.5 特征选择 sklearn 代码
时间: 2024-05-25 13:05:25 浏览: 14
C4.5 是一种决策树算法,它使用信息增益比来进行特征选择。在 sklearn 中,我们可以使用 DecisionTreeClassifier 来实现 C4.5 算法。特征选择可以通过设置 criterion 参数为 "entropy" 来使用信息增益比作为评估指标。下面是一个示例代码:
```python
from sklearn.tree import DecisionTreeClassifier
# 创建决策树分类器
clf = DecisionTreeClassifier(criterion='entropy')
# 训练模型
clf.fit(X, y)
# 获取特征重要性
feature_importances = clf.feature_importances_
# 输出特征重要性
for i, importance in enumerate(feature_importances):
print(f"Feature {i+1}: importance = {importance}")
```
请注意,上述代码中的 `X` 和 `y` 分别表示特征矩阵和目标变量。你需要将其替换为你具体的数据。
相关问题
决策树c4.5分类模型python代码
决策树C4.5是一种常见的分类算法,它可以用于解决分类问题。Python中有很多库可以实现决策树C4.5分类模型,比如scikit-learn库。以下是一个简单的示例代码:
```
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 构建决策树模型
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)
clf.fit(X_train, y_train)
# 在测试集上评估模型
score = clf.score(X_test, y_test)
print('Accuracy:', score)
```
以上代码首先加载了鸢尾花数据集,然后将数据集划分为训练集和测试集。接着,使用scikit-learn库中的DecisionTreeClassifier类构建决策树模型,并使用训练集拟合模型。最后,在测试集上计算模型的准确率。
python实现c4.5决策树代码
以下是Python实现C4.5决策树的代码:
```python
import numpy as np
import pandas as pd
import math
class C45DecisionTree:
def __init__(self, epsilon=0.1):
self.epsilon = epsilon
self.tree = {}
def calc_entropy(self, data):
n = len(data)
label_counts = {}
for row in data:
label = row[-1]
if label not in label_counts:
label_counts[label] = 0
label_counts[label] += 1
entropy = 0.0
for key in label_counts:
prob = float(label_counts[key]) / n
entropy -= prob * math.log(prob, 2)
return entropy
def split_data(self, data, axis, value):
ret_data = []
for row in data:
if row[axis] == value:
reduced_row = row[:axis]
reduced_row.extend(row[axis+1:])
ret_data.append(reduced_row)
return ret_data
def choose_best_feature(self, data):
num_features = len(data[0]) - 1
base_entropy = self.calc_entropy(data)
best_info_gain_ratio = 0.0
best_feature = -1
for i in range(num_features):
feat_list = [row[i] for row in data]
unique_vals = set(feat_list)
new_entropy = 0.0
split_info = 0.0
for value in unique_vals:
sub_data = self.split_data(data, i, value)
prob = len(sub_data) / float(len(data))
new_entropy += prob * self.calc_entropy(sub_data)
split_info -= prob * math.log(prob, 2)
info_gain = base_entropy - new_entropy
if split_info == 0:
continue
info_gain_ratio = info_gain / split_info
if info_gain_ratio > best_info_gain_ratio:
best_info_gain_ratio = info_gain_ratio
best_feature = i
return best_feature
def majority_cnt(self, label_list):
label_counts = {}
for vote in label_list:
if vote not in label_counts:
label_counts[vote] = 0
label_counts[vote] += 1
sorted_label_counts = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_label_counts[0][0]
def create_tree(self, data, labels):
class_list = [row[-1] for row in data]
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
if len(data[0]) == 1:
return self.majority_cnt(class_list)
best_feat = self.choose_best_feature(data)
best_feat_label = labels[best_feat]
my_tree = {best_feat_label: {}}
del(labels[best_feat])
feat_values = [row[best_feat] for row in data]
unique_vals = set(feat_values)
for value in unique_vals:
sub_labels = labels[:]
my_tree[best_feat_label][value] = self.create_tree(self.split_data(data, best_feat, value), sub_labels)
return my_tree
def fit(self, X, y):
data = pd.concat([X, y], axis=1).values.tolist()
labels = list(X.columns) + ['label']
self.tree = self.create_tree(data, labels)
def predict(self, X):
X = X.values.tolist()
res = []
for x in X:
res.append(self.predict_single(x))
return res
def predict_single(self, x):
input_tree = self.tree
while True:
(key, value), = input_tree.items()
if isinstance(value, dict):
index = list(labels).index(key)
input_tree = value[x[index]]
else:
return value
# 测试代码
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = C45DecisionTree()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
```