python实现c4.5决策树
时间: 2023-06-29 19:18:18 浏览: 104
C4.5算法是一种基于信息增益比的决策树算法,它是ID3算法的一种改进。下面是Python实现C4.5决策树的基本步骤:
步骤1:计算每个特征的信息增益比
首先,需要计算每个特征的信息增益比,即计算每个特征对分类的贡献程度。计算信息增益比的公式如下:
$$GainRatio = \frac{Gain(D,A)}{IV(A)}$$
其中,$Gain(D,A)$表示数据集$D$相对于特征$A$的信息增益,$IV(A)$表示特征$A$的固有值,计算公式如下:
$$IV(A) = -\sum_{i=1}^{n} \frac{|D_i|}{|D|} log_2 \frac{|D_i|}{|D|}$$
步骤2:选择信息增益比最大的特征作为当前节点的划分特征
选择信息增益比最大的特征作为当前节点的划分特征,将数据集划分为多个子数据集,然后递归的构建决策树。
步骤3:终止条件
构建决策树的过程中,需要设置终止条件,比如:达到预定的树深度、样本数目小于阈值等。
Python代码实现:
```python
import numpy as np
import pandas as pd
import math
class DecisionTree:
def __init__(self, epsilon=0.1):
self.epsilon = epsilon
self.tree = {}
def calc_entropy(self, y):
"""
计算信息熵
"""
n = len(y)
if n <= 1:
return 0
counts = np.bincount(y)
probs = counts / n
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
for i in probs:
ent -= i * math.log(i, 2)
return ent
def calc_cond_entropy(self, x, y):
"""
计算条件熵
"""
n = len(y)
if n <= 1:
return 0
ent = 0.
for v in set(x):
sub_y = y[x == v]
ent += len(sub_y) / n * self.calc_entropy(sub_y)
return ent
def calc_info_gain(self, x, y):
"""
计算信息增益
"""
ent = self.calc_entropy(y)
cond_ent = self.calc_cond_entropy(x, y)
return ent - cond_ent
def calc_info_gain_ratio(self, x, y):
"""
计算信息增益比
"""
info_gain = self.calc_info_gain(x, y)
iv = self.calc_entropy(x)
if iv == 0:
return 0
return info_gain / iv
def fit(self, X, y, depth=0):
"""
构建决策树
"""
n_samples, n_features = X.shape
n_labels = len(set(y))
# 如果所有样本属于同一类别,停止划分
if n_labels == 1:
return y[0]
# 如果样本数量小于阈值,停止划分
if n_samples < self.epsilon:
return np.bincount(y).argmax()
# 如果特征数量为0,停止划分
if n_features == 0:
return np.bincount(y).argmax()
# 如果达到最大深度,停止划分
if depth == 10:
return np.bincount(y).argmax()
# 选择最优划分特征
gains = np.zeros(n_features)
for f in range(n_features):
gains[f] = self.calc_info_gain_ratio(X[:, f], y)
best_feature = np.argmax(gains)
# 如果最优划分特征的信息增益比小于阈值,停止划分
if gains[best_feature] < 1e-4:
return np.bincount(y).argmax()
# 递归构建决策树
self.tree = {'feature': best_feature}
for v in set(X[:, best_feature]):
idx = X[:, best_feature] == v
sub_X, sub_y = X[idx], y[idx]
self.tree[v] = self.fit(sub_X, sub_y, depth+1)
return self
def predict(self, X):
"""
预测
"""
node = self.tree
while isinstance(node, dict):
feature = node['feature']
node = node[X[feature]]
return node
# 测试
data = pd.read_csv('data.csv')
X = data.drop(['class'], axis=1).values
y = data['class'].values
clf = DecisionTree(epsilon=5)
clf.fit(X, y)
print(clf.tree)
```
阅读全文