对鸢尾花数据集实现C4.5
时间: 2023-07-24 17:22:57 浏览: 102
python利用c4.5决策树对鸢尾花卉数据集进行分类(iris)(代码全)
5星 · 资源好评率100%
好的,对于实现C4.5算法来分类鸢尾花数据集,可以按照以下步骤:
1. 准备数据集
鸢尾花数据集包含150个样本,每个样本有4个数值型特征和1个类别标签。可以使用Python中的sklearn库来读取鸢尾花数据集并进行预处理,例如将类别标签映射为数值类型。
```python
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
```
2. 定义C4.5算法
可以定义一个C4.5类,其中包含以下方法:
- `split()`:根据当前节点的数据集和特征列表,计算每个特征的信息增益,并返回信息增益最大的特征和对应的子数据集;
- `build_tree()`:递归地构建决策树,直到满足终止条件(如节点数达到预设值或者所有数据点属于同一类别);
- `predict()`:对于新的数据点,根据决策树进行分类,并返回预测结果。
```python
import numpy as np
from collections import Counter
import math
class C45DecisionTree:
def __init__(self, min_samples_split=2, min_info_gain=1e-4, max_depth=None):
self.min_samples_split = min_samples_split
self.min_info_gain = min_info_gain
self.max_depth = max_depth
self.tree = None
def info_gain(self, X, y, feature):
# 计算信息增益
base_entropy = self.entropy(y)
feature_values = X[:, feature]
unique_values = np.unique(feature_values)
new_entropy = 0
split_info = 0
for value in unique_values:
subset_y = y[feature_values == value]
new_entropy += len(subset_y) / len(y) * self.entropy(subset_y)
split_info -= len(subset_y) / len(y) * math.log(len(subset_y) / len(y), 2)
return base_entropy - new_entropy, split_info
def entropy(self, y):
# 计算熵
counter = Counter(y)
probs = [count / len(y) for count in counter.values()]
return -sum(p * math.log(p, 2) for p in probs)
def split(self, X, y, features):
# 选择信息增益最大的特征进行分裂
max_info_gain = -1
best_feature = None
best_sets = None
for feature in features:
info_gain, split_info = self.info_gain(X, y, feature)
if info_gain / split_info > max_info_gain:
max_info_gain = info_gain / split_info
best_feature = feature
best_sets = {}
feature_values = X[:, feature]
unique_values = np.unique(feature_values)
for value in unique_values:
best_sets[value] = (X[feature_values == value], y[feature_values == value])
return best_feature, best_sets
def build_tree(self, X, y, features, depth=0):
# 构建决策树
n_samples, n_features = X.shape
if n_samples < self.min_samples_split or depth == self.max_depth:
return Counter(y).most_common(1)[0][0]
if len(np.unique(y)) == 1:
return y[0]
best_feature, best_sets = self.split(X, y, features)
if not best_feature:
return Counter(y).most_common(1)[0][0]
tree = {best_feature: {}}
for value, (sub_X, sub_y) in best_sets.items():
sub_features = [f for f in features if f != best_feature]
tree[best_feature][value] = self.build_tree(sub_X, sub_y, sub_features, depth=depth+1)
return tree
def fit(self, X, y):
# 训练决策树
self.tree = self.build_tree(X, y, list(range(X.shape[1])))
def predict(self, X):
# 预测结果
return np.array([self._predict(x, self.tree) for x in X])
def _predict(self, x, tree):
# 递归地预测结果
for feature, sub_tree in tree.items():
value = x[int(feature)]
if value in sub_tree:
if isinstance(sub_tree[value], dict):
return self._predict(x, sub_tree[value])
else:
return sub_tree[value]
return 0
```
3. 使用C4.5算法分类数据集
可以使用上述C4.5类来训练和测试决策树模型,并计算模型的准确率。
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练C4.5模型
model = C45DecisionTree()
model.fit(X_train.values, y_train)
# 预测测试集并计算准确率
y_pred = model.predict(X_test.values)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
以上就是对鸢尾花数据集实现C4.5算法的完整过程。
阅读全文