优化随机采样算法：Reservoir Sampling

需积分: 12 71 浏览量更新于2024-07-21 收藏 1.44MB PDF 举报

"Random Sampling for Econometrics - 谷歌首席经济学家Jeffrey Scott Vitter的论文" 这篇论文由谷歌首席经济学家Jeffrey Scott Vitter撰写，主要关注的是在经济学领域中的随机抽样方法，特别是针对大数据集的无放回随机抽样问题。随机抽样在经济学分析中扮演着至关重要的角色，因为它能帮助研究者从庞大的数据集中提取有代表性的子集，进行有效的数据分析和经济计量建模。论文提出了一种名为"Algorithm Z"的快速算法，用于在不知道总体大小N的情况下，从包含N个记录的数据池中抽取n个记录的随机样本。这一算法的特点是在一次遍历中完成，仅需常量空间，并且预期运行时间为O(n(1+log(N/n)))，这是在常数因子内的最优时间复杂度。这意味着不论数据集的大小如何，Algorithm Z都能高效地完成任务。作者还探讨了几种优化策略，这些策略可以显著提升原始算法的速度，使其运行效率提高一个数量级。论文提供了一个高效的类似Pascal的实现版本，这个版本综合了这些优化，适用于各种通用情况。理论分析和实证研究表明，Algorithm Z相对于现有的随机抽样方法具有显著的性能优势。该论文涉及的主要知识点包括： 1. 随机抽样：理解如何从大型数据集中抽取随机样本，这对于经济学分析中的统计推断至关重要。 2. 无放回抽样：在这种抽样方法中，一旦记录被选中，就不会放回数据池，这会影响样本的统计特性。 3. 随机数生成：随机性是抽样的基础，论文可能涉及了如何生成满足特定分布的随机数。 4. 算法分析：评估Algorithm Z的时间和空间复杂度，以证明其效率。 5. 统计软件：论文可能讨论了如何将提出的算法应用于实际的统计分析软件中。 6. 算法设计与性能：强调了优化算法以提高运行速度和实用性的重要性。这篇论文为处理大数据集的随机抽样提供了新的解决方案，对于从事经济学、数据分析和数学软件开发的专业人士来说，具有很高的参考价值。通过Algorithm Z，研究者可以更有效地从海量数据中获取关键信息，进而做出更为准确的经济决策和预测。

Random Sampling with a Reservoir

true random sample of size n. This means that the (t + 1)st record should be

chosen for the reservoir with probability L n/(t + 1). Thus the average size of

the reservoir must be at least as big as that in Algorithm R, which is n(1 +

- H,) =

n (1 + ln(N/n)). This gives a lower bound on the time required to do

the sampling.

The framework we use in this paper to develop reservoir algorithms faster than

Algorithm R revolves around the following random variate:

Definition 2. The random variable 9(n, t) is defined to be the number of

records in the file that are

skipped over

before the next record is chosen for the

reservoir, where n is the size of the sample and where t is the number of records

processed so far. For notational simplicity, we shall often write 9 for 9(n, t),

in which case the parameters n and t will be implicit.

The basic idea for our reservoir algorithms is to repeatedly generate 9, skip

that number of records, and select the next record for the reservoir. As in

Algorithm R, we maintain a set of

candidates in the reservoir. Initially, the

first

records are chosen as candidates, and we set t :=

Our reservoir algorithms

have the following basic outline:

while not

eof

begin

(Process

the rest of the records)

Generate an independent random variate Y(n, t);

SKZP-RECORDS(Y);

(Skip over the next y records)

if not

eof

then

begin

{Make the next record a candidate, replacing one at random)

A:= TRUNC (n

RANDOM( ));

(AT is uniform in the range 0 I A 5 n - 1)

READ-NEXT-RECORD(C[J])

end

+ 9 + 1;

end,

Our sampling algorithms differ from one another in the way that 9 is

generated. Algorithm R can be cast in this framework: it generates 9 in O(y)

time, using O(9) calls to

RANDOM.

The three new algorithms we introduce in

this paper (Algorithms X, Y, and Z) generate 9 faster than does Algorithm R.

Algorithm Z, which is covered in Sections 5-9, generates 9 in constant time, on

the average, and thus by (2.1) it runs in average time O(n(l + log(N/n))). By

the lower bound we derived above, this is optimum, up to a constant factor.

The range of

9’(n,

t) is the set of nonnegative integers. The distribution

function F(s) = Prob(9 I s), for s L 0, can be expressed in two ways:

F(s) = l - (t + s + 1)” = l -

+ 1 -

n)”

tt + +i

(3.1)

(The notation ub denotes the “falling power” a(a - 1) . . . (a - b + 1) =

u!/(u -

b)!,

and the corresponding notation ub denotes the “rising power”

a(u + 1) *. . (a +

- 1) = (a +

- l)!/(a - l)!.) The two corresponding

expressions for the probability function f(s) = Prob(9 = s), for s z 0 are

f(s) = n L = -

n (t - n)‘+’

t+s+ 1

(t+s)”

t-n(t+l)“+”

(3.2)

ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985.

剩余20页未读，继续阅读

piano1462

粉丝: 0
资源: 1

优化随机采样算法：Reservoir Sampling

random sampling with reservior

Random sampling of bandlimited signals on graphs.zip

MMSE-based random sampling for iterative detection for large-scale MIMO systems

Sub-image Method Based on LBP Preprocessing and Recursive Random Sampling for Face Recognition

Stabilization for Networked Control Systems with Random Sampling Periods

random sampling算法

ISO 24153：2009 Random sampling and randomization procedures - 完

01-Random sampling of bandlimited signals on graphs.pdf

Multiple-image authentication with a cascaded multilevel architecture based on amplitude field random sampling and phase information multiplexing

Random Under Sampling Source Code Matlab:Random Under Sampling Source Code Matlab-matlab开发

最新资源