k
⇤
-Nearest Neighbors: From Global to Local
Oren Anava
The Voleon Group
oren@voleon.com
Kfir Y. Levy
ETH Zurich
yehuda.levy@inf.ethz.ch
Abstract
The weighted
k
-nearest neighbors algorithm is one of the most fundamental non-
parametric methods in pattern recognition and machine learning. The question of
setting the optimal number of neighbors as well as the optimal weights has received
much attention throughout the years, nevertheless this problem seems to have
remained unsettled. In this paper we offer a simple approach to locally weighted
regression/classification, where we make the bias-variance tradeoff explicit. Our
formulation enables us to phrase a notion of optimal weights, and to efficiently find
these weights as well as the optimal number of neighbors efficiently and adaptively,
for each data point whose value we wish to estimate. The applicability of our
approach is demonstrated on several datasets, showing superior performance over
standard locally weighted methods.
1 Introduction
The
k
-nearest neighbors (
k
-NN) algorithm [
1
,
2
], and Nadarays-Watson estimation [
3
,
4
] are the
cornerstones of non-parametric learning. Owing to their simplicity and flexibility, these procedures
had become the methods of choice in many scenarios [
5
], especially in settings where the underlying
model is complex. Modern applications of the
k
-NN algorithm include recommendation systems [
6
],
text categorization [
7
], heart disease classification [
8
], and financial market prediction [
9
], amongst
others.
A successful application of the weighted
k
-NN algorithm requires a careful choice of three ingredients:
the number of nearest neighbors k, the weight vector ↵, and the distance metric. The latter requires
domain knowledge and is thus henceforth assumed to be set and known in advance to the learner.
Surprisingly, even under this assumption, the problem of choosing the optimal
k
and
↵
is not fully
understood and has been studied extensively since the
1950
’s under many different regimes. Most
of the theoretic work focuses on the asymptotic regime in which the number of samples
n
goes to
infinity [
10
,
11
,
12
], and ignores the practical regime in which
n
is finite. More importantly, the vast
majority of
k
-NN studies aim at finding an optimal value of
k
per dataset, which seems to overlook
the specific structure of the dataset and the properties of the data points whose labels we wish to
estimate. While kernel based methods such as Nadaraya-Watson enable an adaptive choice of the
weight vector
↵
, theres still remains the question of how to choose the kernel’s bandwidth
, which
could be thought of as the parallel of the number of neighbors
k
in
k
-NN. Moreover, there is no
principled approach towards choosing the kernel function in practice.
In this paper we offer a coherent and principled approach to adaptively choosing the number of
neighbors
k
and the corresponding weight vector
↵ 2 R
k
per decision point. Given a new decision
point, we aim to find the best locally weighted predictor, in the sense of minimizing the distance
between our prediction and the ground truth. In addition to yielding predictions, our approach
enbles us to provide a per decision point guarantee for the confidence of our predictions. Fig. 1
illustrates the importance of choosing
k
adaptively. In contrast to previous works on non-parametric
regression/classification, we do not assume that the data
{(x
i
,y
i
)}
n
i=1
arrives from some (unknown)
underlying distribution, but rather make a weaker assumption that the labels
{y
i
}
n
i=1
are independent
30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.