编程实现基于基尼指数进行划分选择的决策树算法
时间: 2023-10-20 18:05:49 浏览: 126
python 决策树算法的实现
决策树算法是一种常见的机器学习算法,用于分类和回归问题。其中,基于基尼指数进行划分选择是决策树算法中的一种常见方法。
以下是一个基于基尼指数进行划分选择的决策树算法的 Python 实现:
```python
import numpy as np
from collections import Counter
class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
def fit(self, X, y):
self.n_classes_ = len(set(y))
self.n_features_ = X.shape[1]
self.tree_ = self._grow_tree(X, y)
def predict(self, X):
return [self._predict(inputs) for inputs in X]
def _gini(self, y):
"""计算基尼指数"""
return 1 - sum((np.sum(y == c) / y.size) ** 2 for c in range(self.n_classes_))
def _best_split(self, X, y):
"""找到最佳的分割点"""
best_gini = np.inf
best_feature, best_threshold = None, None
for feature in range(self.n_features_):
thresholds = np.unique(X[:, feature])
for threshold in thresholds:
gini = self._gini(y[X[:, feature] < threshold]) + self._gini(y[X[:, feature] >= threshold])
if gini < best_gini:
best_gini = gini
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
def _grow_tree(self, X, y, depth=0):
"""递归地构建决策树"""
count = Counter(y)
if len(count) == 1:
return next(iter(count))
if depth == self.max_depth:
return max(count, key=count.get)
feature, threshold = self._best_split(X, y)
indices_left = X[:, feature] < threshold
X_left, y_left = X[indices_left], y[indices_left]
X_right, y_right = X[~indices_left], y[~indices_left]
subtree = {f'{feature} < {threshold}': []}
subtree[f'{feature} < {threshold}'].append(self._grow_tree(X_left, y_left, depth + 1))
subtree[f'{feature} >= {threshold}'] = self._grow_tree(X_right, y_right, depth + 1)
return subtree
def _predict(self, inputs):
"""递归地预测输入的类别"""
tree = self.tree_
while isinstance(tree, dict):
feature, threshold = list(tree.keys())[0].split(' ')
if inputs[int(feature)] < float(threshold):
tree = tree[list(tree.keys())[0]][0]
else:
tree = tree[list(tree.keys())[0]][1]
return tree
```
在上面的代码中,`_gini` 函数用于计算基尼指数,`_best_split` 函数用于找到最佳的分割点,`_grow_tree` 函数用于递归地构建决策树,`_predict` 函数用于递归地预测输入的类别。在 `fit` 函数中,我们将数据集 X 和标签 y 作为参数,得到一个决策树模型。在 `predict` 函数中,我们将输入的数据集 X 作为参数,返回预测的类别。
你可以使用以下代码来测试上述实现的决策树算法:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTree(max_depth=3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
```
上述代码将 load_iris 函数加载的鸢尾花数据集分为训练集和测试集,然后使用 DecisionTree 类来训练和预测。最后,我们计算了预测的准确率。
阅读全文