大数据流中空间时间高效的最密子图维护算法

需积分: 9 18 浏览量更新于2024-07-21 收藏 472KB PDF 举报

本文主要探讨了大数据背景下如何高效处理动态流中密集子图问题。"Space-and-Time-Eﬃcient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams" 是一篇发表于 arXiv 的研究论文，由 Sayan Bhattacharya、Monika Henzinger、Danupon Nanongkai 和 Charalampos E. Tsourakakis 等作者合作完成，日期为 2015 年 4 月 10 日。在许多图形挖掘应用中，实时处理大量数据流，同时在时间和空间效率上达到最优，是一个关键挑战。本文聚焦于"最密集子图问题"（Densest Subgraph Problem），这是一个核心问题，涉及到寻找图中边与节点数量比率最大的子集。这个问题在社交网络分析、社区发现、推荐系统等领域具有广泛应用。该研究提出了一个创新算法，旨在解决动态流环境中的密集子图维护问题。首先，它实现了时间效率和空间效率的双重优化。对于给定的输入图，该算法能在边缘插入和删除操作下，以高概率保持一个 (4+ε) 的近似解决方案。具体来说，它使用了大约 O(n) 的空间复杂度（其中 n 是图中的节点数），并且在每个更新操作的平均时间复杂度上达到了 O(˜1)，其中 ˜O 表示忽略了 polylog(1+ε/n) 的低阶项。这意味着算法在处理大规模数据时，即使面对不断变化的图结构，也能保持高效性能。值得注意的是，这是到目前为止已知的针对图形问题的第一种同时兼顾时间和空间效率的算法。这为大数据环境中处理实时图形分析任务提供了新的理论和技术支持，有助于提升实时数据分析系统的整体效能。总结来说，这篇论文为大数据领域提供了一种新颖且实用的方法，帮助处理动态流中密集子图问题，对于图形挖掘和在线数据分析有重大意义。通过其优化的空间和时间复杂性，该算法为研究人员和实际应用开发者提供了一种强大而高效的工具。

Roughly speaking, they state that we can use the (α, d, L)-decomposition to 2α(1+ǫ)

-approximate

the densest s ubgraph by setting L = O(log n/ǫ) and trying diﬀerent values of d in powers of (1 + ǫ).

Theorem 2.2. Fix any α ≥ 1, d ≥ 0, ǫ ∈ (0, 1), L ← 2+⌈log

(1+ǫ)

n⌉. Let d

∗

← max

S⊆V

ρ(S) be the

maximum density of any subgraph in G = (V, E), and let (Z

, . . . , Z

) be an (α, d, L)-decomposition

of G = (V, E). We have

• (1) If d > 2(1 + ǫ)d

∗

, then Z

= ∅.

• (2) Else if d < d

∗

/α, then Z

6= ∅ and there is an index j ∈ {1, . . . , L − 1} such that

ρ(Z

) ≥ d/(2(1 + ǫ)).

Corollary 2.3. Fix α, ǫ, L, d

∗

as in Theorem

2.2. Let π, σ > 0 be any two numbers satisfying

α · π < d

∗

< σ/(2(1 + ǫ)). Discretize the range [π, σ] i nto powers of (1 + ǫ), by deﬁning d

←

(1 + ǫ)

k−1

· π for every k ∈ [K], where K is any integer strictly greater than ⌈log

(1+ǫ)

(σ/π)⌉.

For every k ∈ [K], construct an (α, d

, L)- decomposition (Z

(k), . . . , Z

(k)) of G = (V, E). Let

′

← max{k ∈ [K] : Z

(k) 6= ∅}. Then we have the following guarantees:

• d

∗

/(α(1 + ǫ)) ≤ d

′

≤ 2(1 + ǫ) · d

∗

• There exists an i ndex j

′

∈ {1, . . . , L − 1} such that ρ(Z

′

) ≥ d

′

/(2(1 + ǫ)).

We will use th e above corollary as follows. Since K = O(log

1+ǫ

n), it is not hard to maintain

′

and the set of nodes Z

′

. The corollary guarantees that the density of the set of nodes Z

′

(2α(1 + ǫ)

)-approximation to d

∗

The rest of this section is devoted to proving Theorem

2.2.

The ﬁrst lemma relates the density to the minimum degree. Its p roof can be found in the full

version.

Lemma 2.4. Let S

∗

⊆ V be a subset of nodes with maximum density, i.e., ρ(S

∗

) ≥ ρ(S) for all

S ⊆ V . Then D

∗

) ≥ ρ(S

∗

) for all v ∈ S

∗

. Thus, the degree of each node in G(S

∗

) is at least

the density of S

∗

of Theorem

2.2. (1) Suppose that d > 2(1 + ǫ)d

∗

. Consider any level i ∈ [L − 1], and note that

δ(Z

) = 2 · ρ(Z

) ≤ 2 · max

S⊆V

ρ(S) = 2d

∗

< d/(1 + ǫ). It follows that the number of nodes v in

G(Z

) with degree D

) ≥ d is less than |Z

|/(1+ ǫ), as otherwise δ(Z

) ≥ d/(1+ ǫ). Let us deﬁne

the set C

= {v ∈ Z

: D

) < d}. We have |Z

\ C

| ≤ |Z

|/(1 + ǫ). Now, from Deﬁnition

2.1

we have Z

i+1

∩ C

= ∅, which, in turn, imp lies that |Z

i+1

| ≤ |Z

\ C

| ≤ |Z

|/(1 + ǫ). Thus, for all

i ∈ [L − 1], we have |Z

i+1

| ≤ |Z

|/(1 + ǫ). Multiplying all these inequalities, for i = 1 to L − 1,

we conclude that |Z

| ≤ |Z

|/(1 + ǫ)

L−1

. Since |Z

| = |V | = n and L = 2 + ⌈log

(1+ǫ)

n⌉, we get

| ≤ n/(1 + ǫ)

(1+log

(1+ǫ)

< 1. This can happen only if Z

= ∅.

(2) Suppose that d < d

∗

/α, and let S

∗

⊆ V be a subset of nodes with highest density, i.e.,

ρ(S

∗

) = d

∗

. We will show that S

∗

⊆ Z

for all i ∈ {1, . . . , L}. This will imply that Z

6= ∅. Clearly,

we have S

∗

⊆ V = Z

. By induction hypothesis, assume that S

∗

⊆ Z

for some i ∈ [L − 1]. We

show that S

∗

⊆ Z

i+1

. By Lemma

2.4, for every node v ∈ S

∗

, we have D

) ≥ D

∗

) ≥ ρ(S

∗

) =

∗

> αd. Hence, from Deﬁnition

2.1, we get v ∈ Z

i+1

for all v ∈ S

∗

. This implies that S

∗

⊆ Z

i+1

Next, we will show that if d < d

∗

/α, then there is an index j ∈ {1, . . . , L − 1} su ch that

ρ(Z

) ≥ d/(2(1 + ǫ)). For the sake of contradiction, sup pose that this is not the case. Then we

have d < d

∗

/α and δ(Z

) = 2 · ρ(Z

) < d/(1 + ǫ ) for every i ∈ {1, . . . , L − 1}. Then, applying an

argument similar to case (1), we conclude th at |Z

i+1

| ≤ |Z

|/(1 + ǫ) for every i ∈ {1, . . . , L − 1},

which implies that Z

= ∅. Thus, we arrive at a contradiction.

3 Warmup: A Single Pass Streaming Algorithm

In this section, we p resent a sin gle-pass streaming algorithm for maintaining a (2 + ǫ)-approximate

solution to th e densest subgraph problem. The algorithm handles a dynamic (turnstile) stream of

edge insertions/deletions in

O(n) space. In particular, we do not worry about the update time of

our algorithm. Our main resu lt in th is section is summarized in Theorem

3.1.

Theorem 3.1. We can process a dynamic stream of updates in the graph G in

O(n) space , and

with high probability return a (2 + O(ǫ))-approximation of d

∗

= max

S⊆V

ρ(S) at the end of the

stream.

Throughout this section, we ﬁx a small constant ǫ ∈ (0, 1/2) and a suﬃciently large constant

c > 1. Moreover, we set α ← (1 + ǫ)/(1 − ǫ), L ← 2 + ⌈log

(1+ǫ)

n⌉. The main technical lemma is

below and states that we can construct a (α, d, L)-decomposition by sampling

O(n) edges.

Lemma 3.2. Fix an integer d > 0, and let S be a collection of cm(L − 1) log n/d mutually in-

dependent random samples (each consisting of one edge) from the edge-set E of the input graph

G = (V, E). With high probability we can construct from S an (α, d, L)-decomposition (Z

, . . . , Z

)

of G, using only

O((n + m/d)) bits of space.

Proof. We partition the samples in S evenly among (L − 1) groups {S

} , i ∈ [L − 1]. Thus, each

is a collection of cm log n/d mutually independent random samples from the edge-set E, and,

furthermore, the collections {S

} , i ∈ [L − 1], themselves are mutually independent. Our algorithm

works as follows.

• Set Z

← V .

• For i = 1 to (L − 1): Set Z

i+1

← {v ∈ Z

: D

, S

) ≥ (1 − ǫ)αc log n}.

To analyze the correctness of the algorithm, d eﬁne the (random) sets A

= {v ∈ Z

: D

, E) >

αd} and B

= {v ∈ Z

: D

, E) < d} for all i ∈ [L − 1]. Note that for all i ∈ [L − 1], the

random sets Z

, A

, B

are completely d etermin ed by the outcomes of the samples in {S

} , j < i.

In particular, the samples in S

are chosen independently of the sets Z

, A

, B

. Let E

be the event

that (a) Z

i+1

⊇ A

and (b) Z

i+1

∩ B

= ∅. By Deﬁnition

2.1, the output (Z

, . . . , Z

) is a valid

(α, d, L)-decomposition of G iﬀ the event

L−1

i=1

occurs. Consider any i ∈ [L − 1]. Below, we show

that the event E

occurs with high probability. The lemma f ollows by taking a union bou nd over

all i ∈ [L − 1].

Fix any instantiation of the random set Z

. Condition on this event, and note that this event

completely determines the sets A

, B

. Consider any node v ∈ A

. Let X

v,i

(j) ∈ {0, 1} be an

indicator random variable for the event that the j

sample in S

is of the form (u, v), with u ∈

). Note that th e random variables {X

v,i

(j)}, j, are mutually independent. Furthermore, we

have E[X

v,i

(j)|Z

] = D

)/m > αd/m for all j. Since there are cm log n/d such samples in S

by lin earity of expectation we get: E[D

, S

)|Z

] =

E[X

v,i

(j)|Z

] > (cm log n/d) · (αd/m) =

αc log n. The node v is included in Z

i+1

iﬀ D

, S

) ≥ (1 − ǫ)αc log n, and this event, in turn,

occurs with high p robability (by Chernoﬀ bound ). Taking a union bound over all nodes v ∈ A

we conclude that P r[Z

i+1

⊇ A

| Z

] ≥ 1 − 1/(p oly n). Using a similar line of reasoning, we get

that Pr[Z

i+1

∩ B

= ∅ | Z

] ≥ 1 − 1/(poly n). I nvoking a union bound over these two events, we get

Pr[E

| Z

] ≥ 1 − 1/(poly n). Since this holds for all possible instantiations of Z

, th e event E

itself

occurs w ith high probability.

The space requ irement of the algorithm, ignoring poly log factors, is proportional to the number

of samples in S (which is cm(L − 1) log n/d) plus the number of nodes in V (which is n). Since

剩余43页未读，继续阅读

alin1980

粉丝: 0
资源: 1

大数据流中空间时间高效的最密子图维护算法

HCNA-BigData大数据平台实验手册.pdf

BigData大数据学习笔记

大数据(Bigdata)详解完整版

大数据hcia-bigdata题库

hcia-bigdata华为认证大数据工程师实验手册

java import bigdata

matlab中bigdata

hccdp-big data真题

把bigdata1上传到HDFS的/hdfs1上，把bigdata2上传到HDFS的/hdfs2上

最新资源