决策树python代码实现

时间: 2023-08-04 16:09:51 浏览: 56
决策树在Python中可以使用scikit-learn库来实现。以下是一个简单的决策树的Python代码示例: ``` # 导入所需的库 from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import metrics # 加载数据集 iris = datasets.load_iris() X = iris.data y = iris.target # 将数据集拆分为训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 创建决策树分类器 clf = DecisionTreeClassifier() # 在训练集上训练模型 clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = clf.predict(X_test) # 计算模型的准确率 print("准确率:", metrics.accuracy_score(y_test, y_pred)) ``` 这段代码首先导入了所需的库,包括`datasets`用于加载数据集,`train_test_split`用于将数据集拆分为训练集和测试集,`DecisionTreeClassifier`用于创建决策树分类器,以及`metrics`用于计算模型的准确率。 然后,代码加载了一个经典的鸢尾花数据集作为示例数据。接下来,将数据集拆分为训练集和测试集,其中测试集占总数据集的30%。 然后,创建了一个决策树分类器对象`clf`,并使用训练集对模型进行训练。最后,使用测试集进行预测,并计算模型的准确率。 这只是一个简单的决策树的示例,你可以根据自己的需求对代码进行修改和扩展。

相关推荐

id3决策树 鸢尾花 Python代码实现: python import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split class Node: def __init__(self, feature=None, target=None, left=None, right=None): self.feature = feature # 划分数据集的特征 self.target = target # 叶子节点的类别 self.left = left # 左子节点 self.right = right # 右子节点 class ID3DecisionTree: def __init__(self): self.tree = None # 决策树 # 计算信息熵 def _entropy(self, y): labels = np.unique(y) probs = [np.sum(y == label) / len(y) for label in labels] return -np.sum([p * np.log2(p) for p in probs]) # 计算条件熵 def _conditional_entropy(self, X, y, feature): feature_values = np.unique(X[:, feature]) probs = [np.sum(X[:, feature] == value) / len(X) for value in feature_values] entropies = [self._entropy(y[X[:, feature] == value]) for value in feature_values] return np.sum([p * e for p, e in zip(probs, entropies)]) # 选择最优特征 def _select_feature(self, X, y): n_features = X.shape[1] entropies = [self._conditional_entropy(X, y, feature) for feature in range(n_features)] return np.argmin(entropies) # 构建决策树 def _build_tree(self, X, y): if len(np.unique(y)) == 1: # 叶子节点,返回类别 return Node(target=y[0]) if X.shape[1] == 0: # 叶子节点,返回出现次数最多的类别 target = np.argmax(np.bincount(y)) return Node(target=target) feature = self._select_feature(X, y) # 选择最优特征 feature_values = np.unique(X[:, feature]) left_indices = [i for i in range(len(X)) if X[i][feature] == feature_values[0]] right_indices = [i for i in range(len(X)) if X[i][feature] == feature_values[1]] left = self._build_tree(X[left_indices], y[left_indices]) # 递归构建左子树 right = self._build_tree(X[right_indices], y[right_indices]) # 递归构建右子树 return Node(feature=feature, left=left, right=right) # 训练决策树 def fit(self, X, y): self.tree = self._build_tree(X, y) # 预测单个样本 def _predict_sample(self, x): node = self.tree while node.target is None: if x[node.feature] == np.unique(X[:, node.feature])[0]: node = node.left else: node = node.right return node.target # 预测多个样本 def predict(self, X): return np.array([self._predict_sample(x) for x in X]) # 加载鸢尾花数据集 iris = load_iris() X = iris.data y = iris.target # 划分数据集 train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=1) # 训练模型 model = ID3DecisionTree() model.fit(train_X, train_y) # 预测测试集 pred_y = model.predict(test_X) # 计算准确率 accuracy = np.sum(pred_y == test_y) / len(test_y) print('Accuracy:', accuracy) C4.5决策树 Python代码实现: python import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split class Node: def __init__(self, feature=None, threshold=None, target=None, left=None, right=None): self.feature = feature # 划分数据集的特征 self.threshold = threshold # 划分数据集的阈值 self.target = target # 叶子节点的类别 self.left = left # 左子节点 self.right = right # 右子节点 class C45DecisionTree: def __init__(self, min_samples_split=2, min_gain_ratio=1e-4): self.min_samples_split = min_samples_split # 最小划分样本数 self.min_gain_ratio = min_gain_ratio # 最小增益比 self.tree = None # 决策树 # 计算信息熵 def _entropy(self, y): labels = np.unique(y) probs = [np.sum(y == label) / len(y) for label in labels] return -np.sum([p * np.log2(p) for p in probs]) # 计算条件熵 def _conditional_entropy(self, X, y, feature, threshold): left_indices = X[:, feature] <= threshold right_indices = X[:, feature] > threshold left_probs = np.sum(left_indices) / len(X) right_probs = np.sum(right_indices) / len(X) entropies = [self._entropy(y[left_indices]), self._entropy(y[right_indices])] return np.sum([p * e for p, e in zip([left_probs, right_probs], entropies)]) # 计算信息增益 def _information_gain(self, X, y, feature, threshold): entropy = self._entropy(y) conditional_entropy = self._conditional_entropy(X, y, feature, threshold) return entropy - conditional_entropy # 计算信息增益比 def _gain_ratio(self, X, y, feature, threshold): entropy = self._entropy(y) conditional_entropy = self._conditional_entropy(X, y, feature, threshold) split_info = -np.sum([p * np.log2(p) for p in [np.sum(X[:, feature] <= threshold) / len(X), np.sum(X[:, feature] > threshold) / len(X)]]) return (entropy - conditional_entropy) / split_info if split_info != 0 else 0 # 选择最优特征和划分阈值 def _select_feature_and_threshold(self, X, y): n_features = X.shape[1] max_gain_ratio = -1 best_feature, best_threshold = None, None for feature in range(n_features): thresholds = np.unique(X[:, feature]) for threshold in thresholds: if len(y[X[:, feature] <= threshold]) >= self.min_samples_split and len(y[X[:, feature] > threshold]) >= self.min_samples_split: gain_ratio = self._gain_ratio(X, y, feature, threshold) if gain_ratio > max_gain_ratio: max_gain_ratio = gain_ratio best_feature = feature best_threshold = threshold return best_feature, best_threshold # 构建决策树 def _build_tree(self, X, y): if len(np.unique(y)) == 1: # 叶子节点,返回类别 return Node(target=y[0]) if X.shape[1] == 0: # 叶子节点,返回出现次数最多的类别 target = np.argmax(np.bincount(y)) return Node(target=target) feature, threshold = self._select_feature_and_threshold(X, y) # 选择最优特征和划分阈值 if feature is None or threshold is None: # 叶子节点,返回出现次数最多的类别 target = np.argmax(np.bincount(y)) return Node(target=target) left_indices = X[:, feature] <= threshold right_indices = X[:, feature] > threshold left = self._build_tree(X[left_indices], y[left_indices]) # 递归构建左子树 right = self._build_tree(X[right_indices], y[right_indices]) # 递归构建右子树 return Node(feature=feature, threshold=threshold, left=left, right=right) # 训练决策树 def fit(self, X, y): self.tree = self._build_tree(X, y) # 预测单个样本 def _predict_sample(self, x): node = self.tree while node.target is None: if x[node.feature] <= node.threshold: node = node.left else: node = node.right return node.target # 预测多个样本 def predict(self, X): return np.array([self._predict_sample(x) for x in X]) # 加载鸢尾花数据集 iris = load_iris() X = iris.data y = iris.target # 划分数据集 train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=1) # 训练模型 model = C45DecisionTree(min_samples_split=5) model.fit(train_X, train_y) # 预测测试集 pred_y = model.predict(test_X) # 计算准确率 accuracy = np.sum(pred_y == test_y) / len(test_y) print('Accuracy:', accuracy)
以下是C4.5决策树的Python代码实现,包括打印出树的结构: python import numpy as np import pandas as pd from math import log import operator def calcShannonEnt(dataSet): """ 计算数据集的熵 """ numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key]) / numEntries shannonEnt -= prob * log(prob, 2) return shannonEnt def createDataSet(): """ 创建数据集 """ dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing', 'flippers'] return dataSet, labels def splitDataSet(dataSet, axis, value): """ 按照给定特征划分数据集 """ retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet def chooseBestFeatureToSplit(dataSet): """ 选择最好的数据集划分方式 """ numFeatures = len(dataSet[0]) - 1 baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0 bestFeature = -1 for i in range(numFeatures): featList = [example[i] for example in dataSet] uniqueVals = set(featList) newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet) / float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy if (infoGain > bestInfoGain): bestInfoGain = infoGain bestFeature = i return bestFeature def majorityCnt(classList): """ 多数表决决定叶子节点的分类 """ classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def createTree(dataSet, labels): """ 创建决策树 """ classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0] # 类别完全相同则停止划分 if len(dataSet[0]) == 1: return majorityCnt(classList) # 遍历完所有特征时返回出现次数最多的类别 bestFeat = chooseBestFeatureToSplit(dataSet) # 选择最好的特征 bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel: {}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] # 复制类标签 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) return myTree def printTree(myTree, depth=0): """ 打印树的结构 """ if not isinstance(myTree, dict): print("%s%s" % (" "*depth, myTree)) else: for key, val in myTree.items(): print("%s%s" % (" "*depth, key)) printTree(val, depth+1) if __name__ == '__main__': dataSet, labels = createDataSet() myTree = createTree(dataSet, labels) printTree(myTree) 运行结果: no surfacing 0 flippers 1 no 0 yes 1 yes
决策树算法的Python实现可以使用sklearn库中的tree模块。首先,要导入tree模块,你可以使用以下代码: from sklearn import tree 接下来,你可以实例化一个DecisionTreeClassifier对象,并根据需要设置参数。这里的参数包括criterion(划分标准,可以选择entropy或gini)、random_state(随机种子)和splitter(划分策略,可以选择best或random)。例如: clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=None, splitter='best') 然后,你可以使用训练集对模型进行训练: clf = clf.fit(X_train, y_train) 接着,你可以使用测试集对模型进行评分(得分范围为0到1): test_score = clf.score(X_test, y_test) 以上是决策树分类的基本流程,其中X_train和y_train是训练集的特征和标签,X_test和y_test是测试集的特征和标签。你还可以使用tree.export_graphviz将生成的决策树导出为DOT模式,以便绘制决策树。例如: tree.export_graphviz(clf, out_file='tree.dot') 以上是决策树算法的基本实现代码。你可以根据需要进行参数调整和其他操作来优化模型的性能。123 #### 引用[.reference_title] - *1* *2* [决策树算法python实现](https://blog.csdn.net/qq_46033892/article/details/126234841)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *3* [决策树算法Python实现](https://blog.csdn.net/qq_46465907/article/details/120431621)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
下面是一个简单的 CART 决策树的 Python 实现: python import numpy as np class CARTDecisionTree: def __init__(self, max_depth=10, min_samples_split=2): self.max_depth = max_depth self.min_samples_split = min_samples_split def fit(self, X, y): self.tree = self.build_tree(X, y) def build_tree(self, X, y, depth=0): n_samples, n_features = X.shape n_labels = len(np.unique(y)) # 检查是否应该停止分裂 if (depth >= self.max_depth or n_labels == 1 or n_samples < self.min_samples_split): return np.argmax(np.bincount(y)) # 寻找最佳分割特征和阈值 best_feature, best_threshold = self.get_best_split(X, y, n_samples, n_features) # 分割样本集并递归构建子树 left_indices = X[:, best_feature] < best_threshold right_indices = X[:, best_feature] >= best_threshold left_subtree = self.build_tree(X[left_indices], y[left_indices], depth+1) right_subtree = self.build_tree(X[right_indices], y[right_indices], depth+1) return {'feature': best_feature, 'threshold': best_threshold, 'left': left_subtree, 'right': right_subtree} def get_best_split(self, X, y, n_samples, n_features): best_gini = float('inf') best_feature, best_threshold = None, None # 遍历所有特征和阈值,找到最佳分割 for feature in range(n_features): thresholds = np.unique(X[:, feature]) for threshold in thresholds: left_indices = X[:, feature] < threshold right_indices = X[:, feature] >= threshold if (len(left_indices) == 0 or len(right_indices) == 0): continue gini = self.gini_index(y, left_indices, right_indices) if gini < best_gini: best_gini = gini best_feature = feature best_threshold = threshold return best_feature, best_threshold def gini_index(self, y, left_indices, right_indices): n_left, n_right = len(left_indices), len(right_indices) gini_left, gini_right = 0, 0 if n_left > 0: labels_left, counts_left = np.unique(y[left_indices], return_counts=True) gini_left = 1 - np.sum((counts_left / n_left) ** 2) if n_right > 0: labels_right, counts_right = np.unique(y[right_indices], return_counts=True) gini_right = 1 - np.sum((counts_right / n_right) ** 2) gini = (n_left * gini_left + n_right * gini_right) / (n_left + n_right) return gini def predict(self, X): return np.array([self.predict_sample(x, self.tree) for x in X]) def predict_sample(self, x, tree): if isinstance(tree, int): return tree feature, threshold = tree['feature'], tree['threshold'] if x[feature] < threshold: return self.predict_sample(x, tree['left']) else: return self.predict_sample(x, tree['right']) 需要注意的是,上述代码实现的 CART 决策树仅支持分类问题。如果要用于回归问题,需要对 gini_index 方法进行修改,使用其他的评估指标(如 MSE)。
好的,下面是一个简单的决策树Python代码,不使用任何库函数: python import math class DecisionTree: def __init__(self, data): self.data = data self.labels = data[0] self.features = data[1:] self.tree = self.build_tree(self.features, self.labels) def build_tree(self, features, labels): # 如果所有数据都属于同一类别,则直接返回该类别 if labels.count(labels[0]) == len(labels): return labels[0] # 如果特征集为空,则返回数据中出现最多的类别 if len(features[0]) == 0: return self.majority_vote(labels) # 选择最佳划分特征 best_feature_index = self.get_best_feature(features, labels) best_feature = self.labels[best_feature_index] tree = {best_feature:{}} # 删除已选择的特征 del self.labels[best_feature_index] del features[best_feature_index] # 构建子树 feature_values = [example[best_feature_index] for example in features] unique_values = set(feature_values) for value in unique_values: sub_labels = labels[:] sub_features = [example[:best_feature_index] + example[best_feature_index+1:] for example in features if example[best_feature_index] == value] tree[best_feature][value] = self.build_tree(sub_features, sub_labels) return tree def get_best_feature(self, features, labels): num_features = len(features[0]) base_entropy = self.calculate_entropy(labels) best_info_gain = 0.0 best_feature_index = -1 for i in range(num_features): feature_values = [example[i] for example in features] unique_values = set(feature_values) new_entropy = 0.0 for value in unique_values: sub_labels = [labels[j] for j in range(len(labels)) if features[j][i] == value] prob = len(sub_labels) / float(len(labels)) new_entropy += prob * self.calculate_entropy(sub_labels) info_gain = base_entropy - new_entropy if info_gain > best_info_gain: best_info_gain = info_gain best_feature_index = i return best_feature_index def calculate_entropy(self, labels): num_labels = len(labels) label_counts = {} # 统计每个类别出现的次数 for label in labels: if label not in label_counts.keys(): label_counts[label] = 0 label_counts[label] += 1 entropy = 0.0 for key in label_counts: prob = float(label_counts[key]) / num_labels entropy -= prob * math.log(prob, 2) return entropy def majority_vote(self, labels): label_counts = {} # 统计每个类别出现的次数 for label in labels: if label not in label_counts.keys(): label_counts[label] = 0 label_counts[label] += 1 # 返回出现次数最多的类别 sorted_label_counts = sorted(label_counts.items(), key=lambda x:x[1], reverse=True) return sorted_label_counts[0][0] def classify(self, input_tree, features, test_data): first_str = list(input_tree.keys())[0] second_dict = input_tree[first_str] feature_index = features.index(first_str) for key in second_dict.keys(): if test_data[feature_index] == key: if type(second_dict[key]).__name__ == 'dict': class_label = self.classify(second_dict[key], features, test_data) else: class_label = second_dict[key] return class_label 这个代码实现了一个基本的ID3决策树算法。你可以使用它来构建一个决策树模型,并使用该模型对新数据进行分类。

最新推荐

python使用sklearn实现决策树的方法示例

主要介绍了python使用sklearn实现决策树的方法示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧

机械设备行业研究周报阶段性底部边际变化逐步演绎重视机器人边际变化-13页.pdf.zip

行业报告 文件类型:PDF格式 打开方式:直接解压,无需密码

公用事业及环保产业行业专题研究报告月用电用电增速上行能源板块颇具亮点-16页.pdf.zip

公用事业类行业报告 文件类型:PDF格式 打开方式:直接解压,无需密码

树莓派aarch64位python3.11的cvxpy安装整合包

这个是cvxpy安装所需的whl文件包含clarabel、ecos、qdldl、scs、osqp、cvxpy的whl文件直接使用pip安装即可,安装请最后安装cvxpy,在Linux thea 6.1.0-rpi4-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux测试安装成功

pta题库答案python-18-文件夹操作函数之路径相关函数.ev4.rar

pta题库答案python-18-文件夹操作函数之路径相关函数.ev4.rar

超声波雷达驱动(Elmos524.03&amp;Elmos524.09)

超声波雷达驱动(Elmos524.03&Elmos524.09)

ROSE: 亚马逊产品搜索的强大缓存

89→ROSE:用于亚马逊产品搜索的强大缓存Chen Luo,Vihan Lakshman,Anshumali Shrivastava,Tianyu Cao,Sreyashi Nag,Rahul Goutam,Hanqing Lu,Yiwei Song,Bing Yin亚马逊搜索美国加利福尼亚州帕洛阿尔托摘要像Amazon Search这样的产品搜索引擎通常使用缓存来改善客户用户体验;缓存可以改善系统的延迟和搜索质量。但是,随着搜索流量的增加,高速缓存不断增长的大小可能会降低整体系统性能。此外,在现实世界的产品搜索查询中广泛存在的拼写错误、拼写错误和冗余会导致不必要的缓存未命中,从而降低缓存 在本文中,我们介绍了ROSE,一个RO布S t缓存E,一个系统,是宽容的拼写错误和错别字,同时保留传统的缓存查找成本。ROSE的核心组件是一个随机的客户查询ROSE查询重写大多数交通很少流量30X倍玫瑰深度学习模型客户查询ROSE缩短响应时间散列模式,使ROSE能够索引和检

java中mysql的update

Java中MySQL的update可以通过JDBC实现。具体步骤如下: 1. 导入JDBC驱动包,连接MySQL数据库。 2. 创建Statement对象。 3. 编写SQL语句,使用update关键字更新表中的数据。 4. 执行SQL语句,更新数据。 5. 关闭Statement对象和数据库连接。 以下是一个Java程序示例,用于更新MySQL表中的数据: ```java import java.sql.*; public class UpdateExample { public static void main(String[] args) { String

JavaFX教程-UI控件

JavaFX教程——UI控件包括:标签、按钮、复选框、选择框、文本字段、密码字段、选择器等

社交网络中的信息完整性保护

141社交网络中的信息完整性保护摘要路易斯·加西亚-普埃约Facebook美国门洛帕克lgp@fb.com贝尔纳多·桑塔纳·施瓦茨Facebook美国门洛帕克bsantana@fb.com萨曼莎·格思里Facebook美国门洛帕克samguthrie@fb.com徐宝轩Facebook美国门洛帕克baoxuanxu@fb.com信息渠道。这些网站促进了分发,Facebook和Twitter等社交媒体平台在过去十年中受益于大规模采用,反过来又助长了传播有害内容的可能性,包括虚假和误导性信息。这些内容中的一些通过用户操作(例如共享)获得大规模分发,以至于内容移除或分发减少并不总是阻止其病毒式传播。同时,社交媒体平台实施解决方案以保持其完整性的努力通常是不透明的,导致用户不知道网站上发生的任何完整性干预。在本文中,我们提出了在Facebook News Feed中的内容共享操作中添加现在可见的摩擦机制的基本原理,其设计和实现挑战,以�