自己编写决策树代码进行心脏病预测数据处理得出是否得病的结果不使用sklearn库
时间: 2024-02-23 18:57:57 浏览: 62
Heart_disease_prediction:(决策树)目的是预测该人是否会患有心脏病
好的,让我解释一下决策树的基本原理和实现步骤。决策树是一种基于树形结构的分类模型,它通过对数据进行分裂,将数据划分为多个小的子集,并且每个子集具有相同的特征,最终得到一个分类的结果。
在构建决策树时,我们需要进行以下步骤:
1.选择最佳的特征:我们需要选择最佳的特征,使得选定的特征可以最大程度地分离不同类别的数据。这可以通过计算信息增益或基尼不纯度来实现。
2.分裂数据集:我们使用选定的特征将数据集分裂为两个或更多的子集。对于每个子集,我们可以再次使用步骤1和步骤2来选择最佳的特征,并将数据集分裂为更小的子集,直到达到预定的终止条件。
3.建立决策树:当我们将数据集分裂为多个子集时,我们可以将每个子集看作是一个新的节点,并且将它们连接到一个根节点上。这样就得到了一棵决策树。
4.预测新数据:当有新的数据进入时,我们可以使用决策树来对新数据进行分类。
下面是一个使用Python编写的决策树模型:
```python
import numpy as np
class Node:
def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.value = value
class DecisionTree:
def __init__(self, max_depth):
self.max_depth = max_depth
self.root = None
def fit(self, X, y):
self.root = self.build_tree(X, y, 0)
def predict(self, X):
return [self._predict(inputs) for inputs in X]
def _predict(self, inputs):
node = self.root
while node.left:
if inputs[node.feature] <= node.threshold:
node = node.left
else:
node = node.right
return node.value
def build_tree(self, X, y, depth):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))
# 终止条件
if depth >= self.max_depth or n_labels == 1 or n_samples < 2:
leaf_value = self.get_leaf_value(y)
return Node(value=leaf_value)
# 选择最佳特征
best_feature, best_threshold = self.get_best_split(X, y, n_samples, n_features)
# 分裂数据集
left_indices, right_indices = self.split(X[:, best_feature], best_threshold)
left = self.build_tree(X[left_indices, :], y[left_indices], depth+1)
right = self.build_tree(X[right_indices, :], y[right_indices], depth+1)
return Node(best_feature, best_threshold, left, right)
def get_best_split(self, X, y, n_samples, n_features):
best_feature, best_threshold = None, None
max_info_gain = -1
for feature in range(n_features):
feature_values = X[:, feature]
thresholds = np.unique(feature_values)
for threshold in thresholds:
gain = self.get_info_gain(y, feature_values, threshold, n_samples)
if gain > max_info_gain:
max_info_gain = gain
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
def get_info_gain(self, y, feature_values, threshold, n_samples):
parent_entropy = self.entropy(y, n_samples)
left_indices, right_indices = self.split(feature_values, threshold)
if len(left_indices) == 0 or len(right_indices) == 0:
return 0
n_left, n_right = len(left_indices), len(right_indices)
left_entropy = self.entropy(y[left_indices], n_left)
right_entropy = self.entropy(y[right_indices], n_right)
child_entropy = (n_left / n_samples) * left_entropy + (n_right / n_samples) * right_entropy
info_gain = parent_entropy - child_entropy
return info_gain
def entropy(self, y, n_samples):
if n_samples == 0:
return 0
counts = np.bincount(y)
probabilities = counts / n_samples
entropy = sum([-p * np.log2(p) for p in probabilities if p > 0])
return entropy
def split(self, feature_values, threshold):
left_indices = np.argwhere(feature_values <= threshold).flatten()
right_indices = np.argwhere(feature_values > threshold).flatten()
return left_indices, right_indices
def get_leaf_value(self, y):
counts = np.bincount(y)
return np.argmax(counts)
```
这个模型基于信息增益(ID3)进行特征选择,使用了基尼不纯度进行分类。你可以用这个模型来处理心脏病预测数据集,使用决策树进行分类预测。
阅读全文