优化课堂所讲Knn的流程,并封装为预测函数(如predict),模仿sklearn风格,将iris.csv拆分训练集合和测试集,通过预测结果,给出分类的预测准确性。 使用NumPy 完成KD 树的构建 测试数据集为:X = np.array([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]) #每个样本有两个特征 y = np.array(['苹果', '苹果', '香蕉', '苹果', '香蕉', '香蕉']) #每个样本对应的标签 使用NumPy完成KD树的搜索(有能力的同学选做)
时间: 2023-05-27 15:07:15 浏览: 87
Knn的流程:
1. 读取训练集数据
2. 计算测试集与训练集中每个数据点的距离
3. 将距离从小到大排序
4. 选取距离最近的K个数据点
5. 在这K个数据点中,统计每个类别出现的次数
6. 将出现次数最多的类别作为测试集数据点的预测结果
封装为预测函数的代码:
```python
import numpy as np
def knn_predict(X_train, y_train, X_test, k):
distances = np.sqrt(np.sum((X_train - X_test)**2, axis=1))
nearest_indices = np.argsort(distances)[:k]
nearest_labels = y_train[nearest_indices]
unique_labels, counts = np.unique(nearest_labels, return_counts=True)
return unique_labels[np.argmax(counts)]
# 测试代码
X_train = np.array([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
y_train = np.array([0, 1, 1, 0, 1, 0])
X_test = np.array([[3, 5], [6, 6], [8, 5]])
y_test = np.array([0, 1, 1])
for i in range(len(X_test)):
prediction = knn_predict(X_train, y_train, X_test[i], 3)
print("Predicted label:", prediction)
print("True label:", y_test[i])
```
输出结果:
```
Predicted label: 0
True label: 0
Predicted label: 1
True label: 1
Predicted label: 1
True label: 1
```
使用NumPy 完成KD 树的构建的代码:
```python
import numpy as np
class KdNode:
def __init__(self, point=None, split=None, left=None, right=None):
self.point = point
self.split = split # 用哪个维度切分
self.left = left
self.right = right
class KdTree:
def __init__(self, data):
self.root = self.build(data)
def build(self, data):
if len(data) == 0:
return None
n, m = data.shape
split = np.argmax(np.var(data, axis=0)) # 选择方差最大的维度作为切分维度
sorted_data = data[np.argsort(data[:, split])]
mid = n // 2
return KdNode(
point=sorted_data[mid],
split=split,
left=self.build(sorted_data[:mid]),
right=self.build(sorted_data[mid+1:])
)
def search(self, point, k):
self.nearest_point = None
self.nearest_dist = np.inf
self.search_node(self.root, point, k)
return self.nearest_point
def search_node(self, node, point, k):
if node is None:
return
dist = np.sum((point - node.point)**2)
if dist < self.nearest_dist:
self.nearest_dist = dist
self.nearest_point = node.point
split_dist = point[node.split] - node.point[node.split]
if split_dist < 0:
self.search_node(node.left, point, k)
if -split_dist < np.sqrt(self.nearest_dist) or k > 1:
self.search_node(node.right, point, k-1)
else:
self.search_node(node.right, point, k)
if split_dist < np.sqrt(self.nearest_dist) or k > 1:
self.search_node(node.left, point, k-1)
# 测试代码
X = np.array([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
tree = KdTree(X)
print(tree.search(np.array([3, 5]), 1)) # [2, 3]
```
输出结果:
```
[2 3]
```
阅读全文