决策树算法python实现
时间: 2023-07-20 14:43:21 浏览: 54
好的,以下是一个简单的决策树算法Python实现:
```python
from math import log
# 计算数据集的熵
def calcShannonEnt(dataSet):
numEntries = len(dataSet) # 数据集的总数
labelCounts = {} # 标签字典,用于存储每个标签出现的次数
for featVec in dataSet:
currentLabel = featVec[-1] # 取得当前数据的标签
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries # 计算标签的概率
shannonEnt -= prob * log(prob, 2) # 计算熵
return shannonEnt
# 按照给定特征划分数据集
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
# 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 # 特征数
baseEntropy = calcShannonEnt(dataSet) # 计算数据集的熵
bestInfoGain = 0.0 # 最好的信息增益
bestFeature = -1 # 最好的特征
for i in range(numFeatures):
featList = [example[i] for example in dataSet] # 取得第i个特征的所有值
uniqueVals = set(featList) # 去重
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value) # 按照第i个特征划分数据集
prob = len(subDataSet)/float(len(dataSet)) # 计算划分后数据集的概率
newEntropy += prob * calcShannonEnt(subDataSet) # 计算划分后数据集的熵
infoGain = baseEntropy - newEntropy # 计算信息增益
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
# 多数表决函数,用于决定叶子节点的分类
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=lambda x: x[1], reverse=True)
return sortedClassCount[0][0]
# 创建决策树
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet] # 取得数据集的所有分类标签
if classList.count(classList[0]) == len(classList): # 如果分类标签全部相同,则直接返回
return classList[0]
if len(dataSet[0]) == 1: # 如果只剩下一个特征,则多数表决
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet) # 选择最好的特征
bestFeatLabel = labels[bestFeat] # 取得最好的特征的名称
myTree = {bestFeatLabel: {}}
del(labels[bestFeat]) # 删除已经使用的特征
featValues = [example[bestFeat] for example in dataSet] # 取得最好的特征的所有值
uniqueVals = set(featValues) # 去重
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) # 递归创建子树
return myTree
```
这个实现包含了计算熵、按照给定特征划分数据集、选择最好的数据集划分方式、多数表决函数和创建决策树等函数。在实现时,需要将数据集和特征标签作为参数传入,并将决策树作为结果返回。
相关推荐
![py](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)