![](https://csdnimg.cn/release/download_crawler_static/6911431/bg3.jpg)
when the distance becomes far enough . Numerical approxi-
mation techniques called the (Improved) Fast Gauss Trans-
form (IFGT) [14, 36, 49, 50] can further improve these ap-
proaches. But the IFGT approach (in general fast multipole
metho ds) is based on heuristics and does not offer formal
theoretical guarantees on the approximation-time t rade-off.
In order to have a formal theoretical guarantee to de-
rive an (ε, δ)-approximation, random sampling is a baseline
metho d, but it requires O(
1
ε
2
log
1
δ
) samples to be included
in Q, which could lead to expensive query evaluations espe-
cially for small ε and/or δ values.
A recent technique using discrepancy theory [33] creates
a small representation of a kernel density estimate (smaller
than the random sampling approach) while still bounding
the ℓ
∞
error. It works by creating a min-cost matching
of points in P ; that is P is decomposed into |P |/2 pairs
so that the sum over all distances between paired points is
minimized. Then it randomly removes one point from each
pair reducing the size of P by half. This process is repeated
until either the desired size subset or the tolerable error level
is reached. However, computing the min-cost matching [11]
is expensive, so this approach is only of theoretical interest
and not directly feasible for large data. Yet, this will serve
as the basis for a family of our proposed algorithms.
A powerful type of kernel is a reproducing kernel [2, 32]
(an example is the Gaussian kernel) which has the prop-
erty that K(p, q) = hp, qi
H
K
; that is, the similarity be-
tween objects p and q defines an inner-product in a re-
producing kernel Hilbert space (RKHS) H
K
. This inner-
product structure (the so-called “kernel trick”), has led to
many powerful techniques in machine learning, see [38, 40]
and references therein. Most of these techniques are not
sp ecifically interested in the kernel density estimate; how-
ever, the RKHS offers the property that a single element
of this space essentially represents the entire kernel density
estimate. These RKHS approximations are typically con-
structed through some form of random sampling [41,48], but
one technique known as “kernel herding” [7] uses a greedy
approach and requires significantly smaller size in theory,
however it bounds only ℓ
2
error as opposed to the sampling
techniques which bound a stronger ℓ
∞
error [24].
Kernel density estimates have been used in the database
and data mining community for density and selectivity esti-
mations, e.g., [17,51]. But the focus in these works is h ow to
use kernel density estimates for approximating ran ge queries
and performing selectivity estimation, rather than comput-
ing approximate kernel density estimates for fast evalua-
tions. When the end-objective is to use a kernel density esti-
mate to do density or selectivity estimation, one may also use
histograms [16, 22, 26, 34] or range queries [12, 13, 19, 20, 47]
to achieve similar goals, but these do not have the same
smoothn ess and statistical properties of kernel density esti-
mates [42]. Nevertheless, the focus of this work is on com-
puting approximate kernel density estimates that enable fast
query evaluations, rather than exploring how to use kernel
density estimates in different application scenarios (which is
a well-explored topic in the literature).
4. WARM-UP: ONE DIMENSION
Efficient construction of approximate kernel density es-
timates in on e-dimension is fairly straightforward. But it
is still worth investigating these procedures in more detail
since to our knowledge, this has not been done at truly large
scale before, and the techniques developed will be useful in
understanding the higher dimensional version.
Baseline method: random sampling (RS). A baseline
metho d for construct ing an approximate kernel density esti-
mate in one dimension is rand om sampling. I t is well known
that [7, 33] if we let Q be a random sample from P of size
O((1/ε
2
) log(1/δ) ) then with probab ility at least 1 − δ the
random sample Q ensu res that kkde
P
− kde
Q
k
∞
≤ ε.
That said, the fi rst technique (RS) follows from this obser-
vation directly and just randomly samples O((1/ε
2
) log(1/δ) )
points from P to construct a set Q. In the centralized set-
ting, we can emp loy the one pass reservoir sampling tech-
nique [46] t o implement RS efficiently. For large data that is
stored in distribut ed nodes, RS can still be implemented ef-
ficiently using t he recent results on generating random sam-
ples from distributed streams [9].
The construction cost is O(n). The approximate kde has
a size O((1/ε
2
) log(1/δ) ), and its query cost (to evaluate
kde
Q
(x) for any input x) is O((1/ε
2
) log(1/δ) ).
RS can be used as a preprocessing step for any other tech-
nique, i.e., for any technique that constructs a kde over P ,
we run that technique over a random sample from P in-
stead. This may be especially efficient at extremely large
scale (where n ≫ 1/ε
2
) and where sampling can be done
in an efficient manner. This may require that we initially
sample a larger set Q than the final outpu t to meet the
approximation quality required by other techniques.
Grouping selection (GS). A limitation in RS is that it
requires a large sample size (sometimes the entire set) in
order to guarantee a desired level of accuracy. As a result,
its size and query cost becomes ex pensive for small ε and δ.
Hence, we introduce another method, called the grouping
selection (GS) method. It leverages the following lemma,
known as the γ-perturbation.
Lemma 1 Consider n arbitrary values {γ
1
, γ
2
, . . . , γ
n
} such
that kγ
i
k ≤ γ for each i ∈ [n]. Then let Q = {q
1
, q
2
, . . . , q
n
}
such that q
i
= p
i
+ γ
i
for all p
i
∈ P . Then kkde
P
−
kde
Q
k
∞
≤ γ/σ.
Proof. This follows directly from the (1/σ)-Lipschitz con-
dition on kernel K (which states that the maximum gradient
of K is (1/σ)), hence perturbing all points by at most γ af-
fects the average by at most γ/σ.
Using Lemma 1, we can select one point q in every segment
ℓ of length εσ from P and assign a weight to q that is pro-
portional to the number of points from P in ℓ, to construct
an ε-ap proximate kde of P . Specifically, GS is implemented
as follows. After sorting P if it is not already sorted, we
sweep points from smallest to largest. When we encounter
p
i
, we scan until we reach the first p
j
such that p
i
+εσ < p
j
.
Then we put p
i
(or the centroid of p
i
through p
j−1
) in Q
with weight w(p
i
) = (j − i)/n. Since Q constructed by GS
is weighted, t he evaluation of kde
Q
(x) should follow the
weighted query evaluation as specified in equation 4.
Theorem 1 The method GS gives an ε-approximate kernel
density estimate of P .
Proof. The weighted output of GS Q corresponds to a
point set Q
′
that has w(q) unweighted points at the same
location of each q ∈ Q ; then kde
Q
= kde
Q
′
. We claim that
Q
′
is an εσ-perturbation of P , which implies the theorem.
To see this claim, we consider any set {p
i
, p
i+1
, . . . , p
j−1
}
of points that are grouped to a single point q ∈ Q. Since