改进的随机游走图采样算法：解决现有问题与新方法

PDF格式 | 388KB | 更新于2024-08-27 | 181 浏览量 | 举报

"基于随机游动的图采样是当前在大规模图数据处理中被广泛采纳的基本技术，用于高效地收集均匀节点样本。本文主要针对三种常用的随机游动图采样算法进行深入分析：重权随机游动（Re-weighted Random Walk, RW）、Metropolis-Hastings随机游动（Metropolis-Hastings Random Walk, MH）以及最大度随机游动（Maximum Degree Random Walk, MD）。首先，作者指出这三种算法在实际应用中存在一些局限性。RW算法可能过于偏向于频繁访问高度连接的节点，导致采样结果非均匀；MH算法虽然能够提供更好的全局采样效果，但由于其接受/拒绝机制，采样效率可能会降低；而MD算法虽然简单易实现，但它可能导致采样集中于具有高度连接性的局部区域，牺牲了全局视角。为了克服这些算法的不足，本文提出了两个新的随机游动图采样方法：拒绝控制的Metropolis-Hastings算法（Rejection-Controlled Metropolis-Hastings, RCMH）和广义最大度随机游动算法（Generalized Maximum Degree Random Walk, GMD）。RCMH算法通过调整接受/拒绝策略，在保持局部连通性和全局采样之间找到了一个平衡点，从而改善了RW算法的非均匀性问题，同时又避免了MH算法的效率瓶颈。另一方面，GMD算法则在保留MD算法的优点——对全局结构的敏感性的同时，通过更精细的设计来减少局部聚集现象，使得采样更具多样性。本文不仅对现有随机游动图采样算法进行了深入剖析，还提出了创新的方法来改进其性能，为大规模图数据的分析和挖掘提供了更为有效和均衡的工具。这不仅有助于提高图采样的效率，也对实际应用中的网络数据分析、推荐系统和社交网络研究等领域具有重要意义。"

where S is the set of sampled nodes and w

(u) ∝ 1 /d

denotes the weight of node u. Note that here the weight

(u) can be up to any multiplicative constant.

It was shown that the above estimator can be interpret-

ed using the importance sampling (IS) framework [9], [23].

Speciﬁcally, instead of drawing nodes from the target dis-

tribution, the IS framework samples nodes from a different

and easily implemented trial distribution [ 12]. In the RW

algorithm, the target distribution is the uniform distribution

, and the trial distribution is π

. According to the IS

framework [12], the importance weight for a node u is

given by w

(u)

= π

(u)/π

(u)=2m/nd

∝ 1/d

meeting the deﬁnition in Eq. (2). It was turned out that

the estimator E

(f) is asymptotically unbiased [12], i.e.,

(f) → E

(f) as n →∞. The variance of the estimator

(f) depends on the variance of f(u)w

(u).Iff(u) is

uncorrelated with w

(u)=π

(u)/π

(u), such variance

relies on the similarity between π

(u) and π

(u). Intuitively,

the closer the trial distribution π

and the target distribution

are, the lower the variance of the estimator E

(f).Liu

in [12] proposes a “rule of thumb” to quantify this intuition

which can also be used to measure the effectiveness of the

IS framework. Speciﬁcally, assume that there is an estimator

E using N independent samples from the target distribution q

(q = π

in our case). Then, Liu’s “rule of thumb” states that

the numbe r of samples from the trial distribution p (p = π

in our case) is at most N (1+var

(q(X)/p(X))) to ach ieve

the same variance as that o f

E [12], [11], [24], [25]. As a

consequence, we strive to seek a trial distribution p such that

var

(q(X)/p(X)) (also called the χ-distance between q and

p [12])isassmallaspossible.

In RW, the trial distribution π

is proportional to the

degree of nodes, while the target distribution is the uniform

distribution π

. As shown in our experiments, π

is typically

far from the uniform distribution π

in many real-world

graphs. That is to say, there is a large deviation between the

trial and target distributions of RW in many real-world graphs.

By Liu’s “rule of thumb”, the effectiveness of RW depends on

the similarity between π

and π

. Therefore, in many real-

world graphs, RW suffers from the large deviation problem.

Metropolis-Hastings random walk (MH). The MH algor ithm

is an application of the Metropolis-Hastings algorithm [26],

[27] for unbiased graph sampling. It makes use of a modiﬁed

random walk to draw nodes from the graph. In particular, it

modiﬁes the transition p robabilities of a random walk so that

the walk converges into a uniform distribution. The transition

probability of MH is given by

⎧

⎨

⎩

1/d

× min{1,d

},ifv∈ N(u),

1 −



u=w

,ifu= v,

0,otherwise,

(3)

where min{1,d

} is the acceptance function of MH. Let

v ∈ N(u) be a neighbor of node u.WhenMHdrawsa

node v from N(u) with probability 1/d

, MH accepts v as

a sample with probability min{1,d

}, and reject it with

probability 1 − min{1,d

}. It h as turned out that the

stationary distribution of MH is π

= π

=[1/n, ··· , 1/n]

[10], [1]. Therefore, the sample mean obtained by MH can be

directly used to construct an unbiased estimator for E

(f).

(a)

(b)

(c)

Fig. 1: Running example

Using the language in IS framework, the trial distribution of

MH is equivalent to th e target distribution, i.e., π

= π

This suggests that MH can overcome the large deviation

problem of RW. However, this is not free. To obtain the

uniform samples, MH has to reject a considerable number of

samples. In other words, to achieve the same sample size as

that of RW, MH g enerally needs to d raw a lot more nodes than

RW. In the setting o f unbiased graph sampling, the primary

costs of different algorithms are measured by the number of

nodes drawn by various algorithms [1]. If we ﬁx the number of

nodes that different algorithms need to draw, then the number

of ﬁnal samples obtained by MH is much smaller than that of

RW, which clearly degrades the performance of MH. Hence,

the MH algorithm suffers from the sample rejection problem.

Maximum-degree random walk (MD). The MD algor ithm

runs a random walk on a dynamically created regular graph to

collect nodes [7], [11]. The idea is that the algorithm modiﬁes

the original graph into a regular graph by adding self-loops

on the nodes so that the degree of each node equals the

maximum degree of the o riginal graph . Note that this can

be done implicitly and on-the-ﬂy. Speciﬁcally, when the walk

traverses a node u, the walk selects a neighbor node from N(u)

with probability 1/d

max

,whered

max

denotes the maximum

degree of the original graph. Following this process, in each

step, the walk stays at u with probability (d

max

− d

)/d

max

That is to say, th e walk traverses the self-loop with probability

max

−d

)/d

max

. Fig. 1(a) and Fig. 1(b) illustrate the original

graph and the corresponding regular graph respectively. For the

graph in Fig. 1(a), the maximum degree is 4. Hence, each node

in the corresponding regular graph shown in Fig. 1(b) has a

degree of 4 including self-loops. The maximum-degree random

walk on a graph is equivalent to the traditional random walk

on the corresponding regular graph. This fact indicates that the

stationary distribution of the maximum-degree random walk is

a uniform distribution. As a result, the sample m ean obtained

by MD can be directly utilized to construct an unbiased

estimator for E

(f) [7]. Similarly, using the language in

IS framework, the trial distribution of MD is equal to the

target distribution π

. Therefore, MD can circumvent the large

deviation problem of RW.

MD also has its drawbacks. First, the algorithm will produce

many repeated samples, because MD needs to traverse the

self-loops. This issue worsens when the walk traverses the

low-degree nodes as these nodes must add many self-loops.

Note that too many repeated samples typically result in a large

variance for the random walk based samplers [2]. Thus, MD

suffers from the repeated samples problem. Second, MD re-

quires the knowledge of maximum degree which is impractical

in the context of unbiased graph sampling via crawling. To

overcome this problem, Bar-Yossef et al. [7] propose to set

a very large constant as the maximum degree. Clearly, this

929

剩余11页未读，继续阅读

weixin_38555304

粉丝: 2

改进的随机游走图采样算法：解决现有问题与新方法

基于随机游动的加权图采样

KnightKing:通用的分布式图形随机游动引擎

无线传感器网络中基于自适应滤波器的数据收集策略

KnightKing: 开创性分布式图随机游动引擎技术

高效准确的随机游动近似HITS主题搜索算法

D2AGE模型：异构图上距离搜索的距离感知DAG嵌入

非平稳随机过程的统计特性与模型建立

vue.js v2.5.17

DM8-SQL语言详解及其数据管理和查询操作指南

1108_ba_open_report.pdf

最新资源