题目三:使用 numpy 编写的 CART 分类/回归树算法,并对 iris 数据集/boston 数据集进行预 测。 具体内容: (1)导入数据集。 (2)划分数据(分成训练集和数据集) (3)训练模型(参考程序模板:cart_numpy_template.py) (4)输出树模型。 (5)进行预测,评估模型性能。 拓展内容(选做): (1)尝试加入 TN样本数量阈值和 TG基尼指数阈值作为终止条件。 (2)尝试对离散特征进行分枝。
时间: 2024-01-29 21:02:53 浏览: 20
好的,让我为你解答一下这个问题。
首先,你需要下载数据集,iris 数据集可以从 sklearn 中导入,boston 数据集可以从 sklearn.datasets 中导入。导入数据集的代码如下:
```python
from sklearn.datasets import load_iris, load_boston
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
boston = load_boston()
X_boston = boston.data
y_boston = boston.target
```
接下来,你需要将数据集划分为训练集和测试集。可以使用 sklearn.model_selection 中的 train_test_split 函数来实现。代码如下:
```python
from sklearn.model_selection import train_test_split
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)
X_boston_train, X_boston_test, y_boston_train, y_boston_test = train_test_split(X_boston, y_boston, test_size=0.2, random_state=42)
```
接下来,你可以使用 numpy 库来实现 CART 分类/回归树算法。参考程序模板:cart_numpy_template.py。在实现过程中,你可以加入 TN 样本数量阈值和 TG 基尼指数阈值作为终止条件,以提高模型的泛化能力。对于离散特征的分枝,可以使用决策树的信息增益或基尼指数来进行特征选择。这里只给出 CART 分类树的实现代码示例:
```python
import numpy as np
class CARTClassifier:
def __init__(self, min_samples_leaf=1, min_impurity_decrease=0.0):
self.min_samples_leaf = min_samples_leaf
self.min_impurity_decrease = min_impurity_decrease
def fit(self, X, y):
self.n_classes_ = len(np.unique(y))
self.tree_ = self._build_tree(X, y)
def predict(self, X):
return np.array([self._predict(inputs) for inputs in X])
def _build_tree(self, X, y):
if len(y) == 0:
return None
n_samples, n_features = X.shape
n_labels = len(np.unique(y))
if n_labels == 1:
return {'leaf': True, 'class': y[0]}
if n_samples < self.min_samples_leaf:
return {'leaf': True, 'class': np.argmax(np.bincount(y))}
best_feature, best_threshold, best_impurity = self._best_split(X, y)
if best_impurity == 0:
return {'leaf': True, 'class': np.argmax(np.bincount(y))}
left_idx = np.where(X[:, best_feature] <= best_threshold)[0]
right_idx = np.where(X[:, best_feature] > best_threshold)[0]
left_tree = self._build_tree(X[left_idx, :], y[left_idx])
right_tree = self._build_tree(X[right_idx, :], y[right_idx])
return {'leaf': False, 'feature': best_feature, 'threshold': best_threshold,
'left': left_tree, 'right': right_tree}
def _best_split(self, X, y):
best_feature = None
best_threshold = None
best_impurity = np.inf
for feature_idx in range(X.shape[1]):
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
y_left = y[X[:, feature_idx] <= threshold]
y_right = y[X[:, feature_idx] > threshold]
impurity = self._gini_impurity(y_left, y_right)
if impurity < best_impurity:
best_feature = feature_idx
best_threshold = threshold
best_impurity = impurity
if best_impurity < self.min_impurity_decrease:
return None, None, 0
return best_feature, best_threshold, best_impurity
def _gini_impurity(self, y_left, y_right):
p_l = len(y_left) / (len(y_left) + len(y_right))
p_r = 1 - p_l
gini_l = 1 - sum((np.bincount(y_left) / len(y_left)) ** 2)
gini_r = 1 - sum((np.bincount(y_right) / len(y_right)) ** 2)
return p_l * gini_l + p_r * gini_r
def _predict(self, inputs):
node = self.tree_
while not node['leaf']:
if inputs[node['feature']] <= node['threshold']:
node = node['left']
else:
node = node['right']
return node['class']
```
最后,你可以输出树模型,进行预测,并评估模型性能。代码如下:
```python
clf = CARTClassifier()
clf.fit(X_iris_train, y_iris_train)
print(clf.tree_)
y_iris_pred = clf.predict(X_iris_test)
iris_accuracy = np.sum(y_iris_pred == y_iris_test) / len(y_iris_test)
print('Iris accuracy:', iris_accuracy)
clf = CARTRegressor()
clf.fit(X_boston_train, y_boston_train)
print(clf.tree_)
y_boston_pred = clf.predict(X_boston_test)
boston_mse = np.mean((y_boston_pred - y_boston_test) ** 2)
print('Boston MSE:', boston_mse)
```
参考资料:
1. scikit-learn 官方文档:https://scikit-learn.org/stable/modules/tree.html
2. CART算法原理与实现:https://www.cnblogs.com/Jie-Meng/p/decisiion_tree_c4.5_CART.html
3. 决策树算法详解:https://www.jianshu.com/p/6bfcfc61a6c0