改进的KNN分类算法：基于采样优化

45 浏览量更新于2024-08-27 收藏 1.94MB PDF 举报

本文档探讨了一种改进的K近邻（KNN）分类算法，该算法主要针对大数据集中的计算开销问题。KNN算法以其简单高效而被广泛应用，其基本原理是通过查找测试样本与训练样本之间的最短距离来进行分类。然而，当训练数据集庞大时，传统KNN算法会计算所有样本与测试样本的距离，这导致了显著的计算负担，从而降低了分类速度。作者针对这一问题提出了改进策略。他们观察到，KNN算法实际上只关注测试样本与最近邻训练样本点之间的k个最短距离，远距离的训练样本对最终分类结果影响较小。因此，他们的创新方法在于对训练数据进行采样，即在测试样本周围抽取部分样本进行距离计算。这样，他们有效地减少了不必要的计算，降低了算法的计算复杂度，提高了分类的效率。采样技术的应用使得算法能够在保持准确性的同时，显著减少因数据量过大引起的性能瓶颈。具体实施时，可能采用了随机抽样、分层抽样或者基于密度的采样策略，确保在减少计算的同时，保留了关键信息，有助于维持分类结果的稳定性。值得注意的是，这项工作还可能涉及了如何选择合适的采样率，以及如何处理采样后的数据不平衡问题，这些都是优化算法性能的关键要素。此外，为了验证改进算法的效果，文中可能会包含实验设计，比如对比传统KNN算法和改进算法在不同规模数据集上的分类准确性和运行时间。这篇研究论文不仅关注了KNN算法的优化，也深入探讨了如何在大规模数据场景下提高其效率，对于实际应用中的实时数据分析和机器学习任务具有重要的理论价值和实践意义。

An Improved KNN Classification Algorithm based on Sampling

Zhiwei Cheng

1, a

, Caisen Chen

1, b

, Xuehuan Qiu

1, c

and Huan Xie

1, d

The Academy of Armored Forces Engineering, Beijing 100072, China;

cheng.zw@mail.scut.edu.cn,

caisenchen@163.com,

qiuxuehuan@139.com,

2387633126@qq.com

Keywords: KNN, classification algorithm, computational overhead, sampling.

Abstract. K nearest neighbor (KNN) algorithm has been widely used as a simple and effective

classification algorithm. The traditional KNN classification algorithm will find k nearest neighbors, it

is necessary to calculate the distance from the test sample to all training samples. When the training

sample data is very large, it will produce a high computational overhead, resulting in a decline in

classification speed. Therefore, we optimize the distance calculation of the KNN algorithm. Since

KNN only considers the k samples of the shortest distance from the test sample to the nearest training

sample point, the large distance training has no effect on the classification of the algorithm. The

improved method is to sample the training data around the test data, which reduces the number of

distance calculation of the test data to each training data, and reduces the time complexity of the

algorithm. The experimental results show that the optimized KNN classification algorithm is superior

to the traditional KNN algorithm.

1. Introduction

In the field of machine learning and data mining, classification is an important way of data analysis,

and is an important way to predict and analyze data. Classification is one of the supervised learning

methods. By analyzing the training sample data, a classification rule is obtained to predict the type of

test sample data. Common classification algorithms are: decision tree, association rules, Bayesian,

neural network, genetic algorithm, KNN algorithm [1]. This paper is a study of the traditional KNN

algorithm.

KNN is an instance-based inert classification method [2] and have no training process [3]. Because

of the simple realization and superior performance, it is widely used in the fields of machine learning

and data mining. However, the traditional KNN algorithm has some drawbacks in the testing process

of the sample. The traditional KNN algorithm need to calculate the distance of the new sample to all

training samples in the classification process. When the test data is large, it will lead to the traditional

KNN algorithm become slow, low performance in the forecast analysis. At present, many people

have put forward some improved methods for KNN algorithm. Ugur et al. [4] proposed a

density-based training sample cutting method. This method makes the sample distribution density

evenly and reduces the calculation of KNN. Nikolay K et al. [5] proposed a method through

establishing a classification model based on using a central document instead of original samples to

reduce the number of samples that need to be similarly calculated by the KNN algorithm. So, it

increases the classification speed. Wanita S et al. [6] proposed a reduction algorithm and a merging

algorithm that reduces the computational complexity. In the case of not reducing the original

accuracy rate, a fast classification algorithm is constructed to achieve the effect of fast classification.

A low performance problem for KNN in calculating the distance from the test sample to all

training samples [7]. To solve this problem, in this paper, we propose a method to sample the training

samples, which can improve the efficiency of KNN algorithm classification by reducing the time of

distance calculation. During the test, training samples were sampled on each axis direction (d1, d2,

d3... dn) around the value of every test data. And then we calculate the distance from the test sample

to the training sample after sampled and select the-nearest-distance k training data to make the

decision of the test data.

International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 2017)

This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Advances in Engineering Research, volume 114

220

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38528180

粉丝: 4
资源: 942

改进的KNN分类算法：基于采样优化

An improved collaborative recommendation algorithm based on optimized user similarity

ARA*+:Improved path planning algorithm based on ARA

An improved iterative back projection algorithm based on ringing artifacts suppression

An Improved KNN Text Categorization on Skew Sort Condition

An improved PSO algorithm based on particle exploration for function optimization and the modeling of chaotic systems

An improved FCM medical image segmentation algorithm based on MMTD

An Improved RGB-D SLAM Algorithm based on Kinect Sensor

An Improved Diagonal IMM Algorithm for Maneuvering Target Tracking Based on H∞ Filter

Workflow Execution Plan Generation in the Cloud Computing Environment based on an Improved List Scheduling Algorithm

An improved kernel regression method based on Taylor expansion

最新资源