使用伪随机哈希函数的低空间元素区分度与子集和算法

版权申诉

187 浏览量更新于2024-07-06 收藏 1.01MB PDF 举报

"这篇论文是关于使用伪随机散列函数实现真正低空间需求的元素区分度和子集和问题的解决方案。作者包括Lijie Chen、Ce Jin、R. Ryan Williams和Hongxun Wu，发表于2021年11月3日，arXiv编号为2111.01759v1，属于计算机科学领域，特别是数据结构（cs.DS）的范畴。" 在计算机科学中，元素区分度问题（Element Distinctness problem）是一个经典的问题，其目标是确定给定长度为n的整数数组中是否存在重复的元素。如果每个元素的位宽为O(log n)，则需要设计一种算法来判断所有元素是否两两不同。之前的工作，如Beame、Cliﬀord和Machmouchi在2013年FOCS会议上提出的一种算法，可以在时间复杂度约为O(n^1.5)的情况下解决这个问题，但该算法需要O(log n)的辅助空间，并且依赖于随机预言机，即只能读取多项式数量的随机位。这篇论文解决了上述算法中的一个重要问题，即是否可以去除对随机预言机的依赖。作者提出了一种新的随机化算法，它仅使用O(log^3 n log log n)比特的空间，就能达到相同的时间复杂度O(n^1.5)。这一改进意味着算法对内存的需求显著降低，同时仍然保持了高效的运行时间。此外，作为该研究成果的副产品，作者还得到了一个子集和问题（Subset Sum problem）的解决方案。子集和问题要求确定一个整数集合中是否存在一个非空子集，其元素之和等于给定的目标值。论文提供了一个在空间复杂度为多项式级（poly(n)）的算法，其时间复杂度优化到了O*(2^0.86n)。这个结果对于处理大规模数据集具有重要意义，因为它的运行时间和空间需求都大大降低了。这篇论文为解决经典计算问题提供了新的思路，特别是在有限空间下如何利用伪随机散列函数实现高效算法。这不仅深化了我们对散列函数和随机化算法的理解，也为实际应用中处理大数据集提供了理论支持。

Claim 1. Fix an index

k corresponding to a level-i node (

k = (0, . . . , k

, k

i+1

, . . . , k

ℓ

) and k

> 0).

Conditioned on the event F

, with 1/2 probability µ

has a level-i child ν (i.e., for

′

= (0, . . . , k

1, k

i+1

, . . . , k

ℓ

), F

′

holds) and next(ν) is distributed uniformly in [n].

Assuming that Claim

1 holds, then (12) follows by a simple induction.

However, it is not hard to see

that Claim

1 does not hold for the original tree T . To understand the issue, let

′

be as in Claim 1 and

assume µ

exists (i.e., F

holds). We wish to better understand the conditions under which µ

′

exists. Letting

and g

denote (r

, . . . , r

i−1

) and (g

, . . . , g

i−1

) respectively, we additionally ﬁx (r

, g

) = (r

, g

)

(we use r

∧ g

to denote this event for simplicity).

The existence condition of µ

′

in T . L et α be the smallest-numbered node such that α > µ

and the level of

α is greater than i−1. Then µ

′

exists if and only if α exists and level(α) = i. Hence, our goal is to determine

α. By deﬁnition, to move from µ

to α in the random walk w, one ﬁrst move to the node corresponding

to vertex next(µ

), and then keep going to the next node, until reaching a node with level at least i. The

following algorithm implements this procedure and returns the simulated random walk, and we observe that

it only uses the values of (r

≤i

, g

≤i

). Note that we use (···) to denote a sequence of vertices, and use ◦ to

denote the concatenation of two sequences.

Algorithm 1: Simulating the random walk from s

′

until reaching a level greater than i

1 Function sim(s

′

, i)

2 if i = 0 then

3 return (s

′

) // stop here since all nodes have levels at least 1

4 s

← s

′

, j ← 0, w ← () // start from s

= s

′

5 repeat

6 w ← w ◦ sim(s

, i − 1) // simulate from s

until hitting a node with level at least i

7 x

j+1

← w

|w|

// vertex x

j+1

corresponds to the next node after s

with level ≥ i

8 if g

j+1

) = 1 then

9 s

j+1

← r

j+1

) // move to the next node since the node corresponding to x

j+1

has level i

10 j ← j + 1

11 until g

) = 0

12 return x

// stop here since the node corresponding to x

has level > i

13 Function Find(s

′

, i)

14 return the last vertex in the sequence returned by sim(s

′

, i)

One can see that sim(next(µ), i − 1) generates the entire sub-walk after µ until reaching the next node

with level at least i. Now, the hope is to argue that, conditioning on F

∧ r

∧ g

, we have

(Find(next(µ), i − 1)) = 1

with probability 1/2.

Two issues w ith the original random walk w. T here are two important issues with the argument above:

1. We need to argue g

(Find(next(µ), i − 1)) is independent from the event F

∧ r

∧ g

One also needs to show that with probability 1/2, µ has a level-j child wit h a uniformly random next-value, for all j < i. We

ignore this part in the technical overview.

2. Even if g

(Find(next(µ), i − 1)) = 1, it could be the case that w stops during the simulation of

sim(next(µ), i − 1) due to a collision

, and in that case µ

′

also does not exist.

The second issue is fundamental, as it reveals the “global dependency nature” of the original random

walk w: the event that w stops depends on all entries in w.

A locally simulatable extended random walk. To circumvent the second issue, we wish for our extended

random walk

w to be locally simulatable. That is, knowing that node µ exists and knowing the value of

next(µ), together with ﬁxed r

and g

, one should be able to simulate the extended random walk

w after

µ until reaching a node with level at least i. The second issue above amounts to the fact that sim(µ, i) fails to

locally simulate the walk w, since it does not have enough information to determine whether w has already

terminated during its simulation (it cannot determine whether there is a collision between the encountered

node and the nodes before in w).

Similar to the basic extended random walk in Section

2.3, for each i ∈ [ℓ], we extend the domain of g

and

from [m] to [m] ∪{⋆

, ⋆

, . . . }as follows: for each t ∈

, we sample g

(⋆

) ∈

{0, 1} and r

(⋆

) ∈

[n],

where all samples are independent.

Since the “local” simulation with respect to node 0, next(0) = s and ﬁxed r

≤ℓ

and g

≤ℓ

is just the entire

random walk, we will deﬁne our extended random walk by giving its local simulation in Algor ithm

2, and

we set

w ← walk(s, ℓ, 0).

Note that walk(s, ℓ, 0) also gives the extended tree

T by specifying level and

next.

Algorithm 2: Algorithm for extended walk

1 Function walk(s

′

, i, µ

) (where s

′

∈ [n], 0 ≤ i ≤ ℓ)

2 if i = 0 then return (s

′

)

3 C

← ∅, star ← false

4 j ← 0, s

← s

′

, w ← ()

5 repeat

6 w ← w ◦ walk(s

, i − 1, µ

+ |w|)

7 x

j+1

← w

|w|

8 y, star ←

(

j+1

, false if a

j+1

6∈ C

∧ ¬star

⋆

, true otherwise (where t := min{t ∈

| ⋆

6∈ C

})

9 µ

j+1

← µ

+ |w|

10 if g

(y) = 1 then

11 C

j+1

← C

∪ {y}, s

j+1

← r

(y)

12 level(µ

j+1

) ← i, next(µ

j+1

) ← r

(y)

13 j ← j + 1

14 until g

(y) = 0

15 return w

16 Function ExtFind(s

′

, i)

17 return the last vertex in the sequence returned by walk(s

′

, i, 0)

Indeed, if the simulation sim(next(µ), i − 1) detects a pair of collision (two nodes α, β such that a

= a

), it would loop

forever.

see Section

5.1 for a detailed explanation of Algorithm 2.

Establishing Claim 1 for

T . One can inspect that the algorithm walk behaves the same as sim until a collision

occurs at Line

8 (that is, there is a collision in {a

, a

, . . . , a

j+1

}). That is, sim(s, ℓ) and walk(s, ℓ, 0)

behave the same until reaching a collision a

= a

for j 6= k. This implies that (

9) holds.

To show Claim

1 holds for

w and

T , we still have to argue that g

(ExtFind(next(µ), i − 1)) is inde-

pendent from the event F

∧ r

∧ g

. Formally proving this requires a delicate induction, but the intu-

ition is that F

depends on at most k

values in g

and r

, and the procedure walk carefully ensures that

(ExtFind(next(µ), i − 1)) is never one of them. Hence, since k

≤ τ /4 and g

is τ-wise independent, we

have the desired independence.

Handling E

bad

and the two-vertex case. We have just established Condition (

12) which gives a lower

bound for E

total

; now we brieﬂy discuss how to obtain an upper bound on E

bad

suﬃcient for proving the

desired lower bound on Pr[u ∈ f

∗

a,h

(s)] using (14). One can ﬁrst observe that (13) cannot hold for all

possible

, as there could be a collision between these three paths. In fact, let K be the total number

of nodes in the union of the paths corresponding to

. Then a revised estimate for Pr[F

∧next(µ

) =

u ∧F

∧F

∧a

next(µ

)

= a

next(µ

)

] should be



· n



−1

. B y a careful calculation, one can show that

this revised estimate is still enough to show E

bad

is upper bounded by O(2

3ℓ

), which is good enough for

our purposes.

However, even establishing this revised estimate is quite challenging. Recall that F

∧ F

equivalent to the condition that, for ever y level-i node β on the paths from root to µ

, µ

or µ

, it holds that

) = 1. This amounts to K events and we hope to show they are all independent. However, this is not

true in general, as there can be a collision of a

between two diﬀerent paths among these three paths. We

overcome this issue by showing that for each “bad node” µ

, there must exist a “bad” collision pair

and

on the extended walk without this issue. In such case one can establish a revised estimate; subtracting all

these revised estimates from E

good

would still yield a good lower bound on Pr[u ∈ f

∗

a,h

(s)].

Our proof for lower-bounding Pr[u, v ∈ f

∗

a,h

(s)] follows the same template above, while using a more

involved analysis to handle the dependency issues across the paths (we have to consider four paths now: two

corresponding to u and v, and the other two corresponding to the “bad” collision pair).

3 Preliminaries

Let [n] denote {1, 2, . . . , n}. We use

to denote the set of non-negative integers. We use

O(f) to denote

O(f ·poly log f ) in the usual way;

Ω,

Θ are deﬁned similarly.

We measure the space complexity of an algorithm by the maximum number of bits in its working memory:

the read-only input is not counted. We measure the time complexity by the number of word operations (with

word length Θ(log n)) in the word RAM model.

For Element Distinctness and List Disjointness, we always assume the input arrays of length n consist

of positive integers bounded from above by m = n

+ c, where c is a ﬁxed constant independent of n. (We

often abbrievate this by saying m = poly(n).) For an array a ∈ [m]

, deﬁne the second frequency moment

(a) =

i=1

j=1

1[a

= a

] as the number of colliding pairs (i, j) (including the case where i = j).

Note that n ≤ F

(a) ≤ n

We will use the following standard pseudorandomness construction.

Theorem 3.1 (Explicit k-wise independent hash family, [

CW79]; see also [Vad12, Corollary 3.34]). For

n, m, k, there is a family of k-wise independent functions H ⊆ {h | h: {0, 1}

→ {0, 1}

} such that every

function from H can be described in k · max{n, m} random bits, and evaluating a function from H (given

its description, and given an input x ∈ {0, 1}

) takes time poly(n, m, k).

We often use bold font letters (e.g., X) to denote random variables. We also use supp(X) to denote the

support of random variable X.

For a set U, we often use x ∈

U to denote the process of selecting an element x from U uniformly at

random.

4 Properties of the Pseudorandom Fa mily and their Implications

We will ﬁ rst deﬁne our pseudorandom hash family in Section

4.1, and then give the proofs of our main

theorems in Section

4.2, assuming some key technical lemmas that w ill be proved in subsequent sections.

4.1 Construction of the Pseudorandom Family

We ﬁrst introduce some handy notation. For two functions a, b : [m] → ([n] ∪ {⋆}), we naturally view them

as “restrictions” (where ⋆ means “unrestricted”), and deﬁne their composition as

(a • b)(x) :=

(

b(x) b(x) 6= ⋆,

a(x) otherwise.

Observe that (a • b) • c = a • (b • c).

Let ℓ ≤ log n and τ = O(log n log log n) be two positive integer parameters to be determined later. A

sample h: [m] → ([n] ∪{⋆}) from H

ℓ,m,n

is generated by an ℓ-level iterative restriction process, deﬁned as

follows.

Drawing a sample h from the pseudorandom hash function family H

ℓ,m,n

1. For each i ∈ [ℓ], independently draw two random functions g

: [m] → {0, 1} and r

: [m] →

[n] from τ-wise independent hash families (Theorem

3.1). Deﬁne h

: [m] → [n] ∪ {⋆} to be

(x)

(

⋆ if g

(x) = 0,

(x) if g

(x) = 1.

2. Deﬁne h to be h

ℓ

•··· • h

•h

Intuitively, the functions g

: [m] → {0, 1} control whether the value of h(x) should be restricted at the

i-th level, while the functions r

: [m] → [n] determine the value that h(x) is restricted to, at the i-th level.

Note that h(x) = ⋆ if g

(x) = ··· = g

ℓ

(x) = 0, and h(x) = r

(x) if g

(x) = ··· = g

j−1

(x) = 0 and

(x) = 1.

Since m = poly(n), the seed length for each i ∈ [ℓ] is O(log

n log log n) bits (Theorem

3.1), and

hence the total seed length for describing the hash function h is O(ℓ log

n log log n) = O(log

n log log n).

Slightly abusing notation, we also use h ∈

ℓ,m,n

to denote that h is a hash function generated as above.

Digraph G

a,h

and reachable set f

∗

a,h

(s). Next we set up some notation. Recall that a ∈ [m]

is the input

array. For a hash function h: [m] → [n], we deﬁne a mapping f

a,h

: [n] → ([n] ∪{⋆}) by f

a,h

(x) := h(a

This mapping naturally deﬁnes a n-ver tex digraph G

a,h

, where each vertex x ∈ [n] has one outgoing edge

x 7→ h(a

) if h(a

) 6= ⋆, and no outgoing edge if h(a

) = ⋆.

We use f

∗

a,h

(s) to denote the set of vertices reachable in G

a,h

from s. When a and h are clear from

context, we will simply write f

∗

a,h

(s) as f

∗

(s). Since each vertex in G

a,h

has at most one outgoing edge,

note that the vertices in f

∗

(s) form either a path or a “rho-shaped” component.

4.2 Proofs of the Main Results

Let a = (a

, . . . , a

) ∈ [m]

be the read-only input array. The BCM Element Distinctness algorithm

[

BCM13] uses the following version of Floyd’s cycle-ﬁnding algorithm performed on the digraph speciﬁed

by f

a,h

Lemma 4.1 ([BCM13, Theorem 2.1]). Assuming oracle access to f

a,h

: [n] → ([n] ∪ {⋆}), there is a de-

terministic algorithm COLLIDE(s) which ﬁnds the pair (u, v) ∈ [n] × [n] (if it exists) such that u, v ∈

∗

a,h

(s), u 6= v and a

= a

, in O(|f

∗

a,h

(s)|) time and O(log n) space.

In the BCM algorithm, h was chosen from a truly random hash family. Our goal is to show that sampling

h from our pseudorandom hash family H

ℓ,m,n

also suﬃces. To do this, we need the following two proper ties

of our hash family H

ℓ,m,n

Lemma 4.2 (Bounding the visit probability for a single vertex). Suppose ℓ = log n −

log F

(a)

− 10.

For

ever y vertex v ∈ [n], we have

h∈

ℓ,m,n

,s∈

[n]

[v ∈ f

∗

a,h

(s)] = Θ

(a)

Lemma 4.3 (Lower bound for collision probability). Suppose ℓ = log n−

log F

(a)

−10. For every u, v ∈ [n]

such that u 6= v and a

= a

, we have

h∈

ℓ,m,n

,s∈

[n]

[u, v ∈ f

∗

a,h

(s)] ≥ Ω



(a)



Lemma

4.2 is proved in Section 6 and Lemma 4.3 is proved in Section 7.

Remark 4.4. In Lemma

4.2, we obtain both a lower bound and an upper bound for Pr

h,s

[v ∈ f

∗

a,h

(s)], and

we will see shortly that only the upper bound will be useful in the proof of Theorem

1.1; the lower bound

part of Lemma

4.2 can be seen as a warm-up for the proof of L emma 4.3, w hich requires to prove a lower

bound for the more involved two-vertex case (see Section

7).

Since ℓ ≤ log n, each hash function h from our hash family H

ℓ,m,n

can be described with a seed of

O(log

n log log n) bits and can be evaluated in poly log(n) time and O(log

n log log n) space. Armed

with the two lemmas above, we can prove our main theorems.

The ori ginal BCM algorithm works for f

a,h

: [n] → [n]. But it works equally well when some vertices v may have no outgoing

edges (i.e., f

a,h

(v) = ⋆).

We ignore all ﬂ oors and ceilings for simplicity.

剩余92页未读，继续阅读

易小侠

粉丝: 6598
资源: 9万+

使用伪随机哈希函数的低空间元素区分度与子集和算法

5-1+_子集和问题_

SSLRP.rar_subset simulation_可靠度_子集模拟_子集模拟法_自由度 可靠度

feature-selection_GA-SA-SAGA算法.rar

sumofsub.rar_SumOfSub_回溯法_回溯法子集和_子集和数_子集和数问题

topology__2Ed_-_James_Munkres.pdf

随机森林工具包RF_MexStandalone-v0.02-precompiled

CombMat(m,n):从给定的 m 元素集和 n 元素子集创建组合矩阵。-matlab开发

随机组合：不同排列、组合和子集的随机集合-matlab开发

subset simulation.rar_subset simulation_失效分析_子集模拟_小概率_数值模拟

RANDSUBSET:返回 n 个元素集合中 k 个元素的随机子集-matlab开发

最新资源

SSLRP.rar_subset simulation_可靠度_子集模拟_子集模拟法_自由度可靠度