An Improved KNN Classification Algorithm based on Sampling
Zhiwei Cheng
1, a
, Caisen Chen
1, b
, Xuehuan Qiu
1, c
and Huan Xie
1, d
1
The Academy of Armored Forces Engineering, Beijing 100072, China;
a
cheng.zw@mail.scut.edu.cn,
b
caisenchen@163.com,
c
qiuxuehuan@139.com,
d
2387633126@qq.com
Keywords: KNN, classification algorithm, computational overhead, sampling.
Abstract. K nearest neighbor (KNN) algorithm has been widely used as a simple and effective
classification algorithm. The traditional KNN classification algorithm will find k nearest neighbors, it
is necessary to calculate the distance from the test sample to all training samples. When the training
sample data is very large, it will produce a high computational overhead, resulting in a decline in
classification speed. Therefore, we optimize the distance calculation of the KNN algorithm. Since
KNN only considers the k samples of the shortest distance from the test sample to the nearest training
sample point, the large distance training has no effect on the classification of the algorithm. The
improved method is to sample the training data around the test data, which reduces the number of
distance calculation of the test data to each training data, and reduces the time complexity of the
algorithm. The experimental results show that the optimized KNN classification algorithm is superior
to the traditional KNN algorithm.
1. Introduction
In the field of machine learning and data mining, classification is an important way of data analysis,
and is an important way to predict and analyze data. Classification is one of the supervised learning
methods. By analyzing the training sample data, a classification rule is obtained to predict the type of
test sample data. Common classification algorithms are: decision tree, association rules, Bayesian,
neural network, genetic algorithm, KNN algorithm [1]. This paper is a study of the traditional KNN
algorithm.
KNN is an instance-based inert classification method [2] and have no training process [3]. Because
of the simple realization and superior performance, it is widely used in the fields of machine learning
and data mining. However, the traditional KNN algorithm has some drawbacks in the testing process
of the sample. The traditional KNN algorithm need to calculate the distance of the new sample to all
training samples in the classification process. When the test data is large, it will lead to the traditional
KNN algorithm become slow, low performance in the forecast analysis. At present, many people
have put forward some improved methods for KNN algorithm. Ugur et al. [4] proposed a
density-based training sample cutting method. This method makes the sample distribution density
evenly and reduces the calculation of KNN. Nikolay K et al. [5] proposed a method through
establishing a classification model based on using a central document instead of original samples to
reduce the number of samples that need to be similarly calculated by the KNN algorithm. So, it
increases the classification speed. Wanita S et al. [6] proposed a reduction algorithm and a merging
algorithm that reduces the computational complexity. In the case of not reducing the original
accuracy rate, a fast classification algorithm is constructed to achieve the effect of fast classification.
A low performance problem for KNN in calculating the distance from the test sample to all
training samples [7]. To solve this problem, in this paper, we propose a method to sample the training
samples, which can improve the efficiency of KNN algorithm classification by reducing the time of
distance calculation. During the test, training samples were sampled on each axis direction (d1, d2,
d3... dn) around the value of every test data. And then we calculate the distance from the test sample
to the training sample after sampled and select the-nearest-distance k training data to make the
decision of the test data.
International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 2017)
Copyright © 2017, the Authors. Published by Atlantis Press.
This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Advances in Engineering Research, volume 114