HyperLogLog：近似计算大规模数据集合唯一元素数目的优化算法

需积分: 50 120 浏览量更新于2024-07-18 收藏 956KB PDF 举报

"Hyperloglog算法是一篇关于近似计算大规模数据集中唯一元素数量（基数估计）的论文，由Philippe Flajolet、Éric Fusy、Olivier Gandouet和Frédéric Meunier共同撰写。该算法在有限的辅助内存（如“短字节”）中实现，只需一次遍历数据，就能提供相对误差约为1.04/√m的基数估计。相比之前最优的LogLog算法，HyperLogLog在占用64%的原始内存即可达到相同精度。这使得在1.5千字节的内存下，能够以2%的典型误差估计超过10亿的基数，显著提升了处理大规模数据集的能力。" HyperLogLog算法是数据挖掘和大数据分析中的一个重要工具，它解决了在处理大量数据时计算基数的难题。基数是指一个集合中不同元素的数量，而这个数量在不存储原始数据的情况下计算是非常有挑战性的。传统的排序或哈希方法在面对海量数据时会遇到内存和计算效率的问题。这篇论文详细分析了HyperLogLog算法的工作原理和性能。算法的核心思想是使用概率统计方法来估算基数，通过对数据中的最大二进制位数进行统计来近似基数。HyperLogLog通过收集数据流中的元素并计算它们的二进制表示中最长连续零位的长度，然后用这些信息构建分布。算法的关键在于找到一种高效的方式来汇总这些信息，以最小的内存开销提供高精度的基数估计。相比于LogLog算法，HyperLogLog的主要改进在于减少了内存需求的同时提高了精度。LogLog算法也基于二进制位模式，但它没有充分利用信息，导致精度较低。HyperLogLog通过合并多个较小的计数器（称为“桶”）来解决这个问题，每个桶记录其所在区域的最大二进制位数，然后使用数学公式将所有桶的信息融合，得到总体基数的估计。此外，论文还讨论了算法的误差分析和实际应用中的优化策略，如平滑处理异常值和减少错误边界。由于其高效性和低内存需求，HyperLogLog被广泛应用于各种大数据系统，如Google的BigQuery和Facebook的Presto数据库系统，用于实时分析和数据流处理。 HyperLogLog算法在处理大规模数据集时提供了近似基数估计的高效解决方案，它在内存效率和精度之间找到了一个良好的平衡点，对于大数据分析和实时监控场景具有重要意义。

130 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier

Let h : D → [0, 1] ≡ {0, 1}

∞

hash data from domain D to the binary domain.

Let ρ(s), for s ∈ {0, 1}

∞

, be the position of the leftmost 1-bit (ρ(0001 ···) = 4).

Algorithm HYPERLOGLOG (input M : multiset of items from domain D).

assume m = 2

with b ∈ Z

;

initialize a collection of m registers, M[1], . . . , M [m], to −∞;

for v ∈ M do

set x := h(v);

set j = 1 + hx

···x

; {the binary address determined by the ﬁrst b bits of x}

set w := x

b+1

b+2

···; set M[j] := max(M[j], ρ(w));

compute Z :=

j=1

−M[j]

−1

; {the “indicator” function}

return E := α

Z with α

as given by Equation (3).

Fig. 2: The HYPERLOGLOG Algorithm.

in M. A suitable hash function h has been ﬁxed. The algorithm relies on a speciﬁc bit-pattern observable

in conjunction with stochastic averaging. Given a string s ∈ {0, 1}

∞

, let ρ(s) represent the position

of the leftmost 1 (equivalently one plus the length of the initial run of 0’s). The stream M is split into

substreams M

, . . . M

, based on the ﬁrst b bits of hashed values

of items, where m = 2

, and each

substream is processed independently. For N ≡ M

such a substream (regarded as composed of hashed

values stripped of their initial b bits), the corresponding observable is then

Max(N) := max

x∈N

ρ(x), (1)

with the convention that Max(∅) = −∞. The algorithm gathers on the ﬂy (in registers M [j]) the values

(j)

of Max(M

) for j = 1 . . . , m. Once all the elements have been scanned, the algorithm computes

the indicator,

Z :=





j=1

−M

(j)





−1

. (2)

It then returns a normalized version of the harmonic mean of the 2

(j)

in the form,

E :=

j=1

−M

(j)

, with α



∞



log



2 + u

1 + u





−1

. (3)

Here is the intuition underlying the algorithm. Let n be the unknown cardinality of M. Each substream

will comprise approximately n/m elements. Then, its Max-parameter should be close to log

(n/m). The

harmonic mean (mZ in our notations) of the quantities 2

Max

is then likely to be of the order of n/m.

Thus, m

Z should be of the order of n. The constant α

, provided by our subsequent analysis, is ﬁnally

introduced so as to correct a systematic multiplicative bias present in m

Our main statement, Theorem 1 below, deals with the situation of ideal multisets:

The algorithm can be adapted to cope with any integral value of m ≥ 3, at the expense of a few additional arithmetic operations.

剩余20页未读，继续阅读

yumao42

粉丝: 0

HyperLogLog：近似计算大规模数据集合唯一元素数目的优化算法

hyperloglog:Java中的HyperLogLog（原始和hyperloglog ++）算法实现

HyperLogLog.zip

HyperLogLog_hyperloglog算法_hyperloglog_

set-sketch-paper：SetSketch：填补MinHash和HyperLogLog之间的空白

HyperLogLog in Practice

paper

paper.rar_paper_paper语音_voice to text

Design Con paper 2019 PAPER 02

Design Con paper 2019 PAPER 01

New folder.rar_LMI_LMI + paper_LMI paper_lmi dependent_paper

最新资源