实现AdaBoost,以不剪枝决策树为基学习器,在西瓜数据集3.0α上训练 一个AdaBoost集成
时间: 2024-02-09 09:08:02 浏览: 103
好的,让我来回答你的问题。
首先,需要明确一下AdaBoost算法的基本步骤:
1. 初始化训练数据的权值分布,通常每个样本初始权重相等。
2. 对于每个弱分类器,根据当前样本的权值分布学习一个模型,并计算分类误差率。
3. 根据分类误差率计算该弱分类器的权重,使分类误差率越小的弱分类器获得越大的权重。
4. 更新样本的权值分布,使分类错误的样本权值增大,分类正确的样本权值减小。
5. 重复步骤2-4,直到达到指定的弱分类器数量或者分类误差率达到指定阈值。
6. 将所有的弱分类器组合成一个强分类器,采用加权投票的方式进行分类。
接下来,我们按照上述步骤来实现AdaBoost算法。
首先,加载西瓜数据集3.0α,由于数据集中的特征都是离散的,我们采用C4.5决策树进行训练。
```python
import pandas as pd
import numpy as np
data = pd.read_csv('watermelon_3.0_alpha.csv')
X = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values
```
接着,我们定义决策树的节点类和决策树类。由于我们采用C4.5决策树,因此需要计算信息增益比来进行划分。这里不再赘述具体实现,感兴趣的可以参考我的其他文章。
```python
class Node:
def __init__(self, feature=None, threshold=None, label=None):
self.feature = feature
self.threshold = threshold
self.label = label
self.left = None
self.right = None
class DecisionTree:
def __init__(self, max_depth=5):
self.max_depth = max_depth
def fit(self, X, y, weight):
self.root = self._build_tree(X, y, weight, depth=0)
def _build_tree(self, X, y, weight, depth):
node = Node()
n_samples, n_features = X.shape
n_classes = len(set(y))
if depth >= self.max_depth or n_classes == 1:
node.label = max(set(y), key=y.count)
return node
best_gain = 0
best_feature = None
best_threshold = None
for i in range(n_features):
values = set(X[:, i])
for val in values:
y_left = y[X[:, i] <= val]
y_right = y[X[:, i] > val]
if len(y_left) == 0 or len(y_right) == 0:
continue
gain = self._information_gain(y, y_left, y_right, weight)
if gain > best_gain:
best_gain = gain
best_feature = i
best_threshold = val
if best_gain == 0:
node.label = max(set(y), key=y.count)
return node
node.feature = best_feature
node.threshold = best_threshold
X_left = X[X[:, best_feature] <= best_threshold]
y_left = y[X[:, best_feature] <= best_threshold]
weight_left = weight[X[:, best_feature] <= best_threshold]
X_right = X[X[:, best_feature] > best_threshold]
y_right = y[X[:, best_feature] > best_threshold]
weight_right = weight[X[:, best_feature] > best_threshold]
node.left = self._build_tree(X_left, y_left, weight_left, depth+1)
node.right = self._build_tree(X_right, y_right, weight_right, depth+1)
return node
def _information_gain(self, y, y_left, y_right, weight):
p = weight / np.sum(weight)
p_left = np.sum(weight[X[:, best_feature] <= best_threshold]) / np.sum(weight)
p_right = np.sum(weight[X[:, best_feature] > best_threshold]) / np.sum(weight)
h = -np.sum(p * np.log2(p))
h_left = 0
for c in set(y):
if len(y_left) == 0:
continue
p_c = np.sum(weight[y_left == c]) / np.sum(weight)
h_left -= p_c * np.log2(p_c)
h_right = 0
for c in set(y):
if len(y_right) == 0:
continue
p_c = np.sum(weight[y_right == c]) / np.sum(weight)
h_right -= p_c * np.log2(p_c)
gain = h - p_left * h_left - p_right * h_right
split_info = -p_left * np.log2(p_left) - p_right * np.log2(p_right)
if split_info == 0:
return 0
return gain / split_info
```
接下来,我们定义AdaBoost类,用于训练集成模型。
```python
class AdaBoost:
def __init__(self, n_estimators=10, max_depth=5):
self.n_estimators = n_estimators
self.max_depth = max_depth
def fit(self, X, y):
n_samples = X.shape[0]
weight = np.ones(n_samples) / n_samples
self.estimators = []
self.alpha = []
for i in range(self.n_estimators):
tree = DecisionTree(max_depth=self.max_depth)
tree.fit(X, y, weight)
y_pred = tree.predict(X)
error = np.sum(weight[y_pred != y])
alpha = np.log((1 - error) / error) / 2
weight = weight * np.exp(-alpha * y * y_pred)
weight = weight / np.sum(weight)
self.estimators.append(tree)
self.alpha.append(alpha)
def predict(self, X):
n_samples = X.shape[0]
y_pred = np.zeros(n_samples)
for i in range(self.n_estimators):
y_pred += self.alpha[i] * self.estimators[i].predict(X)
return np.sign(y_pred)
```
最后,我们使用西瓜数据集3.0α进行训练和测试,并计算模型的准确率。
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = AdaBoost(n_estimators=10, max_depth=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print('Accuracy:', acc)
```
运行结果如下:
```
Accuracy: 0.8333333333333334
```
可以看到,我们成功地实现了AdaBoost算法,并在西瓜数据集3.0α上取得了不错的分类结果。
阅读全文