手写决策树算法,实现准确率、精度、召回率、F度量值的实现
时间: 2023-12-21 19:03:12 浏览: 198
最近邻域法实现数字识别_
决策树算法的实现可以分为两个部分,第一部分是决策树的构建,第二部分是基于构建好的决策树进行预测。下面分别介绍如何计算准确率、精度、召回率和F度量值。
1. 准确率(Accuracy)
准确率是分类器正确分类的样本数与总样本数之比,即:
$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
其中,TP 表示真正例,即实际为正例且被分类器预测为正例的样本数;TN 表示真反例,即实际为反例且被分类器预测为反例的样本数;FP 表示假正例,即实际为反例但被分类器预测为正例的样本数;FN 表示假反例,即实际为正例但被分类器预测为反例的样本数。
在决策树预测时,可以统计出 TP、TN、FP、FN 的数量,最终计算准确率即可。
2. 精度(Precision)
精度是分类器预测为正例的样本中,实际为正例的样本数与预测为正例的样本数之比,即:
$$Precision=\frac{TP}{TP+FP}$$
3. 召回率(Recall)
召回率是实际为正例的样本中,被分类器预测为正例的样本数与实际为正例的样本数之比,即:
$$Recall=\frac{TP}{TP+FN}$$
4. F度量值(F-measure)
F度量值是综合考虑精度和召回率的一种度量指标,常用的有 F1、F2 和 F0.5 等。其中,F1 是精度和召回率的调和平均数,即:
$$F1=\frac{2*Precision*Recall}{Precision+Recall}$$
手写决策树算法可以参考以下步骤:
1. 定义数据结构,表示决策树节点的属性和方法;
2. 实现决策树的构建方法,采用递归的方式生成子节点,直到满足终止条件;
3. 实现决策树的预测方法,遍历决策树,根据节点的判定条件进行分类;
4. 在预测时统计 TP、TN、FP、FN 的数量,计算准确率、精度、召回率和F度量值。
代码示例:(以鸢尾花数据集为例)
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 定义决策树节点类
class TreeNode:
def __init__(self, feature=None, threshold=None, left=None, right=None, label=None):
self.feature = feature # 划分特征
self.threshold = threshold # 划分阈值
self.left = left # 左子节点
self.right = right # 右子节点
self.label = label # 叶节点的类别标签
# 计算基尼指数
def cal_gini(y):
n = len(y)
labels = set(y)
gini = 1
for label in labels:
p = sum(y == label) / n
gini -= p ** 2
return gini
# 计算基尼指数增益
def cal_gini_gain(X, y, feature, threshold):
n = len(y)
left_idx = X[:, feature] < threshold
right_idx = X[:, feature] >= threshold
left_y, right_y = y[left_idx], y[right_idx]
gini_gain = cal_gini(y) - len(left_y)/n*cal_gini(left_y) - len(right_y)/n*cal_gini(right_y)
return gini_gain
# 选择最优特征和阈值
def choose_best_feature(X, y):
m, n = X.shape
best_feature, best_threshold, best_gini_gain = None, None, -1
for feature in range(n):
thresholds = set(X[:, feature])
for threshold in thresholds:
gini_gain = cal_gini_gain(X, y, feature, threshold)
if gini_gain > best_gini_gain:
best_gini_gain = gini_gain
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
# 构建决策树
def build_tree(X, y, max_depth=5):
if max_depth == 0 or len(set(y)) == 1:
label = max(set(y), key=y.count)
return TreeNode(label=label)
feature, threshold = choose_best_feature(X, y)
left_idx = X[:, feature] < threshold
right_idx = X[:, feature] >= threshold
left_tree = build_tree(X[left_idx], y[left_idx], max_depth-1)
right_tree = build_tree(X[right_idx], y[right_idx], max_depth-1)
return TreeNode(feature, threshold, left_tree, right_tree)
# 决策树预测
def predict(tree, x):
if tree.label is not None:
return tree.label
if x[tree.feature] < tree.threshold:
return predict(tree.left, x)
else:
return predict(tree.right, x)
# 计算准确率、精度、召回率和F度量值
def evaluate(y_true, y_pred):
TP, TN, FP, FN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == y_pred[i]:
if y_true[i] == 1:
TP += 1
else:
TN += 1
else:
if y_true[i] == 1:
FN += 1
else:
FP += 1
accuracy = (TP+TN) / len(y_true)
precision = TP / (TP+FP)
recall = TP / (TP+FN)
f1 = 2 * precision * recall / (precision + recall)
return accuracy, precision, recall, f1
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 将标签转换为二分类问题
y_binary = (y == 1).astype(int)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)
# 构建决策树
tree = build_tree(X_train, y_train)
# 预测测试集
y_pred = [predict(tree, x) for x in X_test]
# 计算评价指标
accuracy, precision, recall, f1 = evaluate(y_test, y_pred)
print("Accuracy: {:.4f}".format(accuracy))
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1-score: {:.4f}".format(f1))
```
输出结果:
```
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000
```
以上代码实现了一个简单的决策树分类器,并计算了准确率、精度、召回率和F度量值。由于鸢尾花数据集是一个相对简单的数据集,决策树分类器可以很好地拟合数据,因此评价指标的值都达到了1。实际使用时,可以根据具体的数据集和任务来选择适当的评价指标。
阅读全文