大数据下质量与效率提升：KDE算法优化

下载需积分: 10 | PDF格式 | 827KB | 更新于2024-09-14 | 14 浏览量 | 举报

"本文《质量与效率在大型数据中的核密度估计》发表于Sigmod 13年，主要探讨了如何在处理大规模数据集时更有效地应用核密度估计(KDE)技术。核密度估计是众多应用领域中的关键工具，包括数据分析、机器学习和信号处理等，它用于估计连续变量的概率密度函数。传统上，核密度估计的构建已经得到了广泛研究，但现有方法在面对海量数据时面临挑战。其主要问题在于计算成本高和理论保证不足，尤其是当数据集规模庞大时，算法的效率问题尤为突出。为了克服这些问题，作者提出了一种随机化和确定性的算法策略，这些算法能够在保持高质量的同时，提供显著的效率提升，与先前的方法相比，具有数量级的优化。新提出的算法设计不依赖于特定的核函数或带宽参数选择，这使得它们更加灵活和适应性强。同时，算法易于并行化，这对于处理分布式和大规模数据集尤为重要。文章不仅展示了这些算法在中心化环境中的实现，还探讨了如何将其应用到MapReduce这样的大规模数据处理框架中，表明其广泛的应用潜力。实验部分详尽地在真实大型数据集上进行了验证，结果证实了新方法在保持估计质量的同时，能够有效提高计算效率，并且表现出良好的扩展性。因此，这篇文章对于那些需要处理大规模数据，同时寻求高效和准确密度估计的科研人员和工程师来说，是一篇极具价值的参考资料，它填补了现有技术在大型数据处理场景下的空白，为今后的数据分析和统计建模提供了新的思考方向和实践指导。"

when the distance becomes far enough . Numerical approxi-

mation techniques called the (Improved) Fast Gauss Trans-

form (IFGT) [14, 36, 49, 50] can further improve these ap-

proaches. But the IFGT approach (in general fast multipole

metho ds) is based on heuristics and does not oﬀer formal

theoretical guarantees on the approximation-time t rade-oﬀ.

In order to have a formal theoretical guarantee to de-

rive an (ε, δ)-approximation, random sampling is a baseline

metho d, but it requires O(

log

) samples to be included

in Q, which could lead to expensive query evaluations espe-

cially for small ε and/or δ values.

A recent technique using discrepancy theory [33] creates

a small representation of a kernel density estimate (smaller

than the random sampling approach) while still bounding

the ℓ

∞

error. It works by creating a min-cost matching

of points in P ; that is P is decomposed into |P |/2 pairs

so that the sum over all distances between paired points is

minimized. Then it randomly removes one point from each

pair reducing the size of P by half. This process is repeated

until either the desired size subset or the tolerable error level

is reached. However, computing the min-cost matching [11]

is expensive, so this approach is only of theoretical interest

and not directly feasible for large data. Yet, this will serve

as the basis for a family of our proposed algorithms.

A powerful type of kernel is a reproducing kernel [2, 32]

(an example is the Gaussian kernel) which has the prop-

erty that K(p, q) = hp, qi

; that is, the similarity be-

tween objects p and q deﬁnes an inner-product in a re-

producing kernel Hilbert space (RKHS) H

. This inner-

product structure (the so-called “kernel trick”), has led to

many powerful techniques in machine learning, see [38, 40]

and references therein. Most of these techniques are not

sp eciﬁcally interested in the kernel density estimate; how-

ever, the RKHS oﬀers the property that a single element

of this space essentially represents the entire kernel density

estimate. These RKHS approximations are typically con-

structed through some form of random sampling [41,48], but

one technique known as “kernel herding” [7] uses a greedy

approach and requires signiﬁcantly smaller size in theory,

however it bounds only ℓ

error as opposed to the sampling

techniques which bound a stronger ℓ

∞

error [24].

Kernel density estimates have been used in the database

and data mining community for density and selectivity esti-

mations, e.g., [17,51]. But the focus in these works is h ow to

use kernel density estimates for approximating ran ge queries

and performing selectivity estimation, rather than comput-

ing approximate kernel density estimates for fast evalua-

tions. When the end-objective is to use a kernel density esti-

mate to do density or selectivity estimation, one may also use

histograms [16, 22, 26, 34] or range queries [12, 13, 19, 20, 47]

to achieve similar goals, but these do not have the same

smoothn ess and statistical properties of kernel density esti-

mates [42]. Nevertheless, the focus of this work is on com-

puting approximate kernel density estimates that enable fast

query evaluations, rather than exploring how to use kernel

density estimates in diﬀerent application scenarios (which is

a well-explored topic in the literature).

4. WARM-UP: ONE DIMENSION

Eﬃcient construction of approximate kernel density es-

timates in on e-dimension is fairly straightforward. But it

is still worth investigating these procedures in more detail

since to our knowledge, this has not been done at truly large

scale before, and the techniques developed will be useful in

understanding the higher dimensional version.

Baseline method: random sampling (RS). A baseline

metho d for construct ing an approximate kernel density esti-

mate in one dimension is rand om sampling. I t is well known

that [7, 33] if we let Q be a random sample from P of size

O((1/ε

) log(1/δ) ) then with probab ility at least 1 − δ the

random sample Q ensu res that kkde

− kde

∞

≤ ε.

That said, the ﬁ rst technique (RS) follows from this obser-

vation directly and just randomly samples O((1/ε

) log(1/δ) )

points from P to construct a set Q. In the centralized set-

ting, we can emp loy the one pass reservoir sampling tech-

nique [46] t o implement RS eﬃciently. For large data that is

stored in distribut ed nodes, RS can still be implemented ef-

ﬁciently using t he recent results on generating random sam-

ples from distributed streams [9].

The construction cost is O(n). The approximate kde has

a size O((1/ε

) log(1/δ) ), and its query cost (to evaluate

kde

(x) for any input x) is O((1/ε

) log(1/δ) ).

RS can be used as a preprocessing step for any other tech-

nique, i.e., for any technique that constructs a kde over P ,

we run that technique over a random sample from P in-

stead. This may be especially eﬃcient at extremely large

scale (where n ≫ 1/ε

) and where sampling can be done

in an eﬃcient manner. This may require that we initially

sample a larger set Q than the ﬁnal outpu t to meet the

approximation quality required by other techniques.

Grouping selection (GS). A limitation in RS is that it

requires a large sample size (sometimes the entire set) in

order to guarantee a desired level of accuracy. As a result,

its size and query cost becomes ex pensive for small ε and δ.

Hence, we introduce another method, called the grouping

selection (GS) method. It leverages the following lemma,

known as the γ-perturbation.

Lemma 1 Consider n arbitrary values {γ

, γ

, . . . , γ

} such

that kγ

k ≤ γ for each i ∈ [n]. Then let Q = {q

, q

, . . . , q

}

such that q

= p

+ γ

for all p

∈ P . Then kkde

−

kde

∞

≤ γ/σ.

Proof. This follows directly from the (1/σ)-Lipschitz con-

dition on kernel K (which states that the maximum gradient

of K is (1/σ)), hence perturbing all points by at most γ af-

fects the average by at most γ/σ.

Using Lemma 1, we can select one point q in every segment

ℓ of length εσ from P and assign a weight to q that is pro-

portional to the number of points from P in ℓ, to construct

an ε-ap proximate kde of P . Speciﬁcally, GS is implemented

as follows. After sorting P if it is not already sorted, we

sweep points from smallest to largest. When we encounter

, we scan until we reach the ﬁrst p

such that p

+εσ < p

Then we put p

(or the centroid of p

through p

j−1

) in Q

with weight w(p

) = (j − i)/n. Since Q constructed by GS

is weighted, t he evaluation of kde

(x) should follow the

weighted query evaluation as speciﬁed in equation 4.

Theorem 1 The method GS gives an ε-approximate kernel

density estimate of P .

Proof. The weighted output of GS Q corresponds to a

point set Q

′

that has w(q) unweighted points at the same

location of each q ∈ Q ; then kde

= kde

′

. We claim that

′

is an εσ-perturbation of P , which implies the theorem.

To see this claim, we consider any set {p

, p

i+1

, . . . , p

j−1

}

of points that are grouped to a single point q ∈ Q. Since

剩余11页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

czt0929

粉丝: 0

大数据下质量与效率提升：KDE算法优化

"美股生物制药行业三季度业绩预览：预期符合市场预期，但仍存在波动风险

"深入学习SPSS混合模型：理论与应用

"商务数据分析与统计建模：多元回归分析及相关问题

nonparametric density estimates L1 view

Normative data and reliability estimates for eight dimensions of reading attitude

Decay estimates for dissipative wave equations in inhomogeneous media

Endpoint Estimates for Maximal Commutators in non-homogeneous spaces

A comparison of urban and rural reliability estimates for the Boehm Basic Concept Test

Blowup Estimates for a Semilinear Reaction Diffusion System arising in the nuclear reactors

NCHS - VSRR Estimates for Birth and Mortality出生与死亡率估计-数据集

最新资源