KNN算法实现的过程
时间: 2024-04-30 10:14:48 浏览: 20
KNN(K-Nearest Neighbors)算法是一种常用的分类和回归算法。它的实现过程如下:
1. 数据准备:首先,需要准备一个带有标签的训练数据集,其中包含了已知类别的样本数据。
2. 计算距离:对于待分类的样本数据,需要计算它与训练数据集中每个样本之间的距离。常用的距离度量方法有欧氏距离、曼哈顿距离等。
3. 选择K值:确定K值,即选择最近的K个邻居进行投票。K值的选择通常是根据实际问题和经验来确定的。
4. 投票决策:根据K个最近邻居的标签进行投票,选择得票最多的类别作为待分类样本的类别。
5. 输出结果:将待分类样本归类到得票最多的类别中。
相关问题
KNN算法实现过程以及每一步的代码实现
KNN(K-Nearest Neighbors)算法是一种基于实例的学习方法,它的基本思想是在训练集中寻找与测试样本最相近的k个样本,然后将这k个样本中出现次数最多的样本标记为测试样本的类别。
下面是KNN算法的实现步骤及每一步的代码实现:
1. 计算测试样本与训练集中每个样本的距离,一般常用欧式距离或曼哈顿距离等:
```python
import numpy as np
def euclidean_distance(x1, x2):
"""
计算两个向量之间的欧式距离
x1, x2: 两个向量
"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def manhattan_distance(x1, x2):
"""
计算两个向量之间的曼哈顿距离
x1, x2: 两个向量
"""
return np.sum(np.abs(x1 - x2))
```
2. 对距离的计算结果进行排序,找出距离最近的k个样本:
```python
def get_k_nearest_neighbors(X_train, x_test, k, distance_metric):
"""
获取距离测试样本最近的k个训练样本
X_train: 训练集
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
distances = []
for i in range(len(X_train)):
dist = distance_metric(X_train[i], x_test)
distances.append((i, dist))
distances = sorted(distances, key=lambda x: x[1])
neighbors = []
for i in range(k):
neighbors.append(distances[i][0])
return neighbors
```
3. 统计k个样本中各类别出现的次数,将出现次数最多的类别作为测试样本的类别:
```python
def predict(X_train, y_train, x_test, k, distance_metric):
"""
对测试样本进行预测
X_train: 训练集
y_train: 训练集对应的标签
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
neighbors = get_k_nearest_neighbors(X_train, x_test, k, distance_metric)
label_counts = {}
for i in neighbors:
label = y_train[i]
label_counts[label] = label_counts.get(label, 0) + 1
sorted_label_counts = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_label_counts[0][0]
```
以上就是KNN算法的实现过程及每一步的代码实现。
KNN算法实现过程以及每一步的代码实现,通过鸢尾花数据集来实现
好的,下面我来演示如何使用KNN算法来实现鸢尾花数据集分类问题。
1. 导入所需的库和数据集:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# 导入数据集
iris = load_iris()
X, y = iris.data, iris.target
```
2. 将数据集划分为训练集和测试集:
```python
# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
3. 实现KNN算法:
```python
def euclidean_distance(x1, x2):
"""
计算两个向量之间的欧式距离
x1, x2: 两个向量
"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def get_k_nearest_neighbors(X_train, x_test, k, distance_metric):
"""
获取距离测试样本最近的k个训练样本
X_train: 训练集
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
distances = []
for i in range(len(X_train)):
dist = distance_metric(X_train[i], x_test)
distances.append((i, dist))
distances = sorted(distances, key=lambda x: x[1])
neighbors = []
for i in range(k):
neighbors.append(distances[i][0])
return neighbors
def predict(X_train, y_train, x_test, k, distance_metric):
"""
对测试样本进行预测
X_train: 训练集
y_train: 训练集对应的标签
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
neighbors = get_k_nearest_neighbors(X_train, x_test, k, distance_metric)
label_counts = {}
for i in neighbors:
label = y_train[i]
label_counts[label] = label_counts.get(label, 0) + 1
sorted_label_counts = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_label_counts[0][0]
```
4. 对测试集进行预测并计算准确率:
```python
# 对测试集进行预测
y_pred = []
for i in range(len(X_test)):
pred = predict(X_train, y_train, X_test[i], k=3, distance_metric=euclidean_distance)
y_pred.append(pred)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
```
完整代码如下:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# 导入数据集
iris = load_iris()
X, y = iris.data, iris.target
# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
def euclidean_distance(x1, x2):
"""
计算两个向量之间的欧式距离
x1, x2: 两个向量
"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def get_k_nearest_neighbors(X_train, x_test, k, distance_metric):
"""
获取距离测试样本最近的k个训练样本
X_train: 训练集
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
distances = []
for i in range(len(X_train)):
dist = distance_metric(X_train[i], x_test)
distances.append((i, dist))
distances = sorted(distances, key=lambda x: x[1])
neighbors = []
for i in range(k):
neighbors.append(distances[i][0])
return neighbors
def predict(X_train, y_train, x_test, k, distance_metric):
"""
对测试样本进行预测
X_train: 训练集
y_train: 训练集对应的标签
x_test: 测试样本
k: k值
distance_metric: 距离度量方式
"""
neighbors = get_k_nearest_neighbors(X_train, x_test, k, distance_metric)
label_counts = {}
for i in neighbors:
label = y_train[i]
label_counts[label] = label_counts.get(label, 0) + 1
sorted_label_counts = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_label_counts[0][0]
# 对测试集进行预测
y_pred = []
for i in range(len(X_test)):
pred = predict(X_train, y_train, X_test[i], k=3, distance_metric=euclidean_distance)
y_pred.append(pred)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
```
输出结果为:
```python
Accuracy: 1.0
```
可以看到,KNN算法在鸢尾花数据集上的准确率达到了100%。