python实现西瓜2.0决策树
时间: 2023-10-16 10:06:37 浏览: 197
好的,关于Python实现西瓜2.0决策树,以下是基本的流程和示例代码:
1.首先需要导入 necessary 模块:
```
import numpy as np
import pandas as pd
from math import log
import operator
```
2.读入西瓜2.0数据并做处理:
```
watermelon_data = pd.read_csv("watermelon_2.0.csv")
watermelon_data = watermelon_data.iloc[:,1:]
```
3.定义生成决策树的函数 create_tree:
```
def create_tree(data_set, features):
# 判断样本中是否属于同一类别,若是,则返回该类别
class_list = data_set.iloc[:,-1].tolist()
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
# 样本集中只有一个特征时,返回样本中出现次数最多的类别
if len(features) == 1:
return majority_count(class_list)
# 选取最优特征
best_feature = choose_best_feature_to_split(data_set, features)
best_feature_name = features[best_feature]
my_tree = {best_feature_name: {}}
# 从特征列表中删除最优特征
del(features[best_feature])
# 找出该特征对应的所有值,建立决策树
feature_values = data_set.iloc[:,best_feature].tolist()
unique_values = set(feature_values)
for value in unique_values:
sub_features = features[:]
# 对于该特征下的每一个取值,都再建立一个决策树,即遍历该特征的所有取值
my_tree[best_feature_name][value] = create_tree(split_data_set(data_set, best_feature, value), sub_features)
return my_tree
```
4.定义划分数据集的函数 split_data_set:
```
def split_data_set(data_set, axis, value):
ret_data_set = data_set.loc[data_set[axis] == value].drop([axis], axis=1)
return ret_data_set
```
5.定义计算信息熵的函数 calc_shannon_ent:
```
def calc_shannon_ent(data_set):
num_entries = len(data_set)
label_counts = data_set.iloc[:,-1].value_counts()
shannon_ent = 0.0
for key in label_counts:
prob = float(label_counts[key])/num_entries
shannon_ent -= prob * log(prob, 2)
return shannon_ent
```
6.定义选择最优特征的函数 choose_best_feature_to_split:
```
def choose_best_feature_to_split(data_set, features):
base_entropy = calc_shannon_ent(data_set)
best_info_gain = 0.0
best_feature = -1
for i in range(len(features)):
feat_list = data_set.iloc[:,i].tolist()
unique_values = set(feat_list)
new_entropy = 0.0
for value in unique_values:
sub_data_set = split_data_set(data_set, i, value)
prob = len(sub_data_set) / float(len(data_set))
new_entropy += prob * calc_shannon_ent(sub_data_set)
info_gain = base_entropy - new_entropy
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature
```
7.定义计算类别出现次数最多的函数 majority_count:
```
def majority_count(class_list):
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]
```
以上就是Python实现西瓜2.0决策树的基本流程和示例代码。如果您有任何疑问,请随时问我。
阅读全文