804
IEEE
TRANSACTIONS
ON
SYST6MS. MAN, AND CYBERNETICS,
VOL
25. NO.
5,
MAY
1995
A
k-Nearest Neighbor Classification
Rule Based
on
Dempster-Shafer Theory
Thierry
Denceux
Abstract-In
this paper, the problem of classifying an unseen
pattern on the basis of its nearest neighbors in a recorded data set
is addressed from the point of view
of
Dempster-Shafer theory.
Each neighbor
of
a sample
to
be classified is considered as an item
of
evidence that supports certain hypotheses regarding the class
membership
of
that pattern. The degree
of
support is defined as
a function of the distance between the two vectors. The evidence
of the
k
nearest neighbors is then pooled by means of Dempster's
rule of combination. This approach provides a global treatment
of
such issues as ambiguity and distance rejection, and imperfect
knowledge regarding the class membership
of
training patterns.
The effectiveness of this classification scheme as compared to the
voting and distance-weighted
k-NN
procedures is demonstrated
using several sets of simulated and real-world data.
I.
INTRODUCTION
N
classification problems, complete statistical knowledge
I regarding the conditional density functions of each class is
rarely available, which precludes application of the optimal
Bayes classification procedure. When no evidence supports
one form of the density functions rather than another,
a
good
solution is often
to
build up a collection of correctly classified
samples, called the
training set,
and to classify each new
pattern using the evidence of nearby sample observation. One
such non-parametric procedure has been introduced by Fix
and Hodges [ll], and has since become well-known in the
Pattem Recognition literature as the voting k-nearest neighbor
(k-NN)
rule. According to this rule, an unclassified sample
is assigned to the class represented by
a
majority of its
k
nearest neighbors in the training set. Cover and Hart
[4]
have
provided a statistical justification of this procedure by showing
that, as the number
N
of samples and
k
both tend to infinity
in
such a manner that
k/N
i
0.
the error rate of the
IC-
NN
rule approaches the optimal Bayes error rate. Beyond
this remarkable property, the
k-NN
rule owes much of its
popularity in the Pattem Recognition community to its good
performance in practical applications. However, in the finite
sample case, the voting
k-NN
rule is not guaranteed to be
the optimal way of using the information contained in the
neighborhood of unclassified patterns. This is the reason why
the improvement of this rule has remained an active research
topic in the past
40
years.
The main drawback of the voting
k:-NN
rule is that it
implicitly assumes the
k
nearest neighbors of
a
data point
:c
Manuscript received September 23, 1993; revised July
17,
1994. This
work
was
partially supported by EEC funded Esprit Project 6757 EMS
(Environmental Monitoring System).
The author
is
at the Universitk de Technologie de Compibgne,
U.R.A.
CNRS
817
Heudiasyc,
BP
649 F-60206 Compibgne cedex, France.
IEEE Log
Number
9409222.
to be contained in
a
region of relatively small volume,
so
that
sufficiently good resolution in the estimates of the different
conditional densities can be obtained. In practice, however,
the distance between
II:
and one of its closest neighbors is not
always negligible, and can even become very large outside
the regions of high density. This has several consequences.
First, it can be questioned whether
it
is still reasonable in
that case to give all the neighbors an equal weight in the
decision, regardless of their distances to the point
IC
to be
classified. In fact, given the
k
nearest neighbors
~('1.
.
.
. .
:I;(')
of
:I:.
and
d').
. . . .
d'")
the corresponding distances arranged
in increasing order, it is intuitively appealing to give the label
of
x(~) a greater importance than to the label of
z(J)
whenever
d'j
<
d(3).
Dudani
[
101
has proposed
to
assign to the ith
nearest neighbor
x:(~)
a weight
w(')
defined as:
The unknown pattem
IC
is then assigned
to
the class for
which the weights of the representatives among the
k
nearest
neighbors sum to the greatest value. This rule was shown by
Dudani to be admissible, Le. to yield lower error rates than
those obtained using the voting
k-NN
procedure for at least
one particular data set. However, several researchers, repeating
Dudani's experiments, reached less optimistic conclusions
[
11,
[16], [6]. In particular, Baily and Jain [l] showed that the
distance-weighted
k-NN
rule
is
not necessarily better than
the majority rule for small sample size if ties are broken in
a
judicious manner. These authors also showed that,
in
the
infinite sample case
(N
+
s),
the error rate of the traditional
unweighted
k-NN
rule is better than that of any weighted
k-
NN
rule. However, Macleod et al. [15] presented arguments
showing that this conclusion may not apply if the training set
is finite. They
also
proposed a simple extension of Dudani's
rule allowing for
a
more effective use of the kth neighbor in
the classification.
Apart from this discussion, it can also be argued that, be-
cause the weights are constrained to span the interval
[O,
l], the
distance-weighted
k-NN
procedure can still give considerable
importance to observations that are very dissimilar to the
pattem to be classified. This represents
a
serious drawback
when all the classes cannot be assumed to be represented
in the training set,
as
is often the case in some application
areas as target recognition in noncooperative environments
[5] or diagnostic problems [9]. In such situations, it may
be wise to consider that
a
point that is far away from any
0018-9472/95$04.00
0
1995
IEEE