python实现决策树id3
时间: 2023-10-11 13:13:25 浏览: 86
好的,下面是使用Python实现ID3决策树的步骤:
1. 数据预处理:将数据集转换为特征矩阵和标签矩阵。
2. 计算信息熵:计算每个特征的信息熵,选取信息熵最小的特征作为节点划分依据。
3. 递归构建决策树:使用选定的特征将数据集划分成多个子集,对每个子集递归构建子树。
4. 预测:使用构建好的决策树对新的数据进行分类。
下面是Python代码实现:
```python
import numpy as np
import math
class Node:
def __init__(self, feature=None, value=None, result=None):
self.feature = feature # 分割依据的特征
self.value = value # 分割依据特征的取值
self.result = result # 叶子节点的值
self.children = {} # 子节点
class DecisionTree:
def __init__(self, epsilon=0.1):
self.epsilon = epsilon # 决策树的阈值
self.tree = None # 决策树
# 计算信息熵
def entropy(self, y):
count = np.unique(y, return_counts=True)[1]
p = count / len(y)
return -np.sum(p * np.log2(p))
# 计算条件熵
def conditional_entropy(self, X, y, feature):
values = np.unique(X[:, feature])
ce = 0
for value in values:
index = X[:, feature] == value
ce += np.sum(index) / len(y) * self.entropy(y[index])
return ce
# 选择最优特征
def choose_feature(self, X, y):
features = X.shape[1]
best_feature, best_feature_ce = None, float('inf')
for feature in range(features):
ce = self.conditional_entropy(X, y, feature)
if ce < best_feature_ce:
best_feature, best_feature_ce = feature, ce
return best_feature
# 构建决策树
def build_tree(self, X, y):
# 如果数据集为空,返回None
if len(y) == 0:
return None
# 如果标签都相同,返回叶子节点
if len(np.unique(y)) == 1:
return Node(result=y[0])
# 如果特征集为空,返回叶子节点,取标签集中最多的值作为叶子节点的值
if X.shape[1] == 0:
return Node(result=np.bincount(y).argmax())
# 否则,选择最优特征进行划分
feature = self.choose_feature(X, y)
node = Node(feature=feature)
values = np.unique(X[:, feature])
for value in values:
index = X[:, feature] == value
node.children[value] = self.build_tree(X[index], y[index])
return node
# 训练模型
def fit(self, X, y):
self.tree = self.build_tree(X, y)
# 预测
def predict(self, X):
results = []
for x in X:
node = self.tree
while node.children:
node = node.children[x[node.feature]]
results.append(node.result)
return np.array(results)
```
这样就完成了ID3决策树的Python实现,可以使用以下代码测试:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dt = DecisionTree()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```
输出结果如下:
```
Accuracy: 1.0
```
可以看到,模型在鸢尾花数据集上的准确率为100%。
阅读全文