C4.5算法的python实现
时间: 2023-07-26 07:07:31 浏览: 103
C4.5算法是一种决策树算法,用于从数据集中生成分类树。以下是C4.5算法的Python实现。
首先,需要导入必要的库:
```python
import pandas as pd
import numpy as np
import math
```
接下来,定义一个函数来计算数据集的香农熵:
```python
def entropy(target_col):
elements,counts = np.unique(target_col,return_counts = True)
entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
return entropy
```
然后,定义一个函数来计算数据集的信息增益:
```python
def InfoGain(data,split_attribute_name,target_name="class"):
total_entropy = entropy(data[target_name])
vals,counts= np.unique(data[split_attribute_name],return_counts=True)
Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
Information_Gain = total_entropy - Weighted_Entropy
return Information_Gain
```
接着,定义一个函数来选择最佳分裂属性:
```python
def ID3(data,originaldata,features,target_attribute_name="class",parent_node_class = None):
if len(np.unique(data[target_attribute_name])) <= 1:
return np.unique(data[target_attribute_name])[0]
elif len(data)==0:
return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name],return_counts=True)[1])]
elif len(features) ==0:
return parent_node_class
else:
parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]
item_values = [InfoGain(data,feature,target_attribute_name) for feature in features]
best_feature_index = np.argmax(item_values)
best_feature = features[best_feature_index]
tree = {best_feature:{}}
features = [i for i in features if i != best_feature]
for value in np.unique(data[best_feature]):
value = value
sub_data = data.where(data[best_feature] == value).dropna()
subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)
tree[best_feature][value] = subtree
return(tree)
```
最后,调用上述函数来生成分类树:
```python
dataset = pd.read_csv('your_data.csv')
features = list(dataset.columns[:-1])
tree = ID3(dataset,dataset,features)
print(tree)
```
这就是C4.5算法的Python实现。
阅读全文