属性有性别成绩班委就业情况其中类标号属性为就业情况基于c4.5算法使用python实现算法实例50个训练样本并分别计算信息量再用10个测试样本进行测试给出完整代码
时间: 2024-05-05 15:19:50 浏览: 143
输入男生女生成绩,统计其平均值等数据-C语言代码
好的,以下是基于C4.5算法的Python实现代码:
```python
import pandas as pd
import numpy as np
import math
# 读取数据集
data = pd.read_csv("dataset.csv")
# 划分训练集和测试集
train_data = data.sample(n=50, random_state=1) # 随机抽取50个样本作为训练集
test_data = data.drop(train_data.index) # 剩下的样本作为测试集
# 定义节点类
class Node:
def __init__(self):
self.children = {}
self.feature = None
self.label = None
# 计算信息熵
def entropy(data):
labels = data.iloc[:, -1].value_counts()
ent = 0
for label in labels:
p = label / len(data)
ent -= p * math.log(p, 2)
return ent
# 计算信息增益
def gain(data, feature):
values = data[feature].unique()
ent = entropy(data)
for value in values:
sub_data = data[data[feature] == value]
ent -= len(sub_data) / len(data) * entropy(sub_data)
return ent
# 计算信息增益率
def gain_ratio(data, feature):
return gain(data, feature) / entropy(data)
# 选择最优特征
def choose_feature(data):
features = data.columns[:-1]
best_feature = None
max_gain_ratio = 0
for feature in features:
gr = gain_ratio(data, feature)
if gr > max_gain_ratio:
max_gain_ratio = gr
best_feature = feature
return best_feature
# 构建决策树
def build_tree(data):
node = Node()
# 如果数据集中所有样本都属于同一类别,则返回该类别
if len(data.iloc[:, -1].unique()) == 1:
node.label = data.iloc[0, -1]
return node
# 如果数据集中只有一个特征,则返回样本中类别最多的类别
if len(data.columns[:-1]) == 1:
node.label = data.iloc[:, -1].value_counts().index[0]
return node
# 选择最优特征
best_feature = choose_feature(data)
node.feature = best_feature
# 根据最优特征划分子节点
values = data[best_feature].unique()
for value in values:
sub_data = data[data[best_feature] == value].drop(best_feature, axis=1)
if len(sub_data) == 0:
node.children[value] = Node()
node.children[value].label = data.iloc[:, -1].value_counts().index[0]
else:
node.children[value] = build_tree(sub_data)
return node
# 预测
def predict(node, sample):
if node.label is not None:
return node.label
feature = node.feature
value = sample[feature]
if value not in node.children:
return node.label
child_node = node.children[value]
return predict(child_node, sample)
# 计算准确率
def accuracy(test_data, root):
correct = 0
for i in range(len(test_data)):
sample = test_data.iloc[i, :-1]
label = test_data.iloc[i, -1]
if predict(root, sample) == label:
correct += 1
return correct / len(test_data)
# 构建决策树
root = build_tree(train_data)
# 测试模型
acc = accuracy(test_data, root)
print("Accuracy:", acc)
```
请注意,这里的数据集需要根据实际情况进行处理,确保数据集中每个样本都有正确的类标号。
阅读全文