数据流中频繁项挖掘：算法与性能比较

60 浏览量更新于2024-08-25 收藏 487KB PDF 举报

"Finding Frequent Items in Data Streams - PLVDB - 2008-计算机科学" 这篇论文探讨了在数据流中寻找频繁项的问题，这是数据流挖掘领域的一个核心问题，可以追溯到20世纪80年代。频繁项问题的目标是处理一系列的项，并找出出现频率超过给定比例的项。这个问题在很多应用中都有直接或间接的依赖，实际中已经被广泛应用于大规模工业系统。论文作者Graham Cormode和Marios Hadjieleftheriou来自AT&T Labs–Research，他们指出尽管该问题被广泛研究，但不同方法在统一实验条件下的比较却不多。经常会出现对相关工作的错误描述、忽视或重复发明的情况。论文的主要目标是提供一个共同的框架来介绍最重要的算法，并进行基准实现以便进行详尽的实验研究。作者们通过实验发现，频繁项算法的性能存在显著差异。某些最佳方法可以被优化实施，以达到更高的效率和准确性。论文中可能涵盖了以下关键知识点： 1. **数据流挖掘**：这是一种处理持续不断的数据流的方法，由于数据量大且无法完全存储，因此需要设计有效的在线算法。 2. **频繁项（Frequent Itemsets）**：在数据集中出现次数超过特定阈值的项。这是关联规则学习和市场篮子分析的基础。 3. **算法比较**：论文可能对多种不同的数据流频繁项检测算法进行了评估，如Bloom Filter、Count-Min Sketch、Lossy Counting等，分析它们的时间复杂性、空间复杂性和精度。 4. **实验设计**：为了公平比较，论文可能设定了统一的实验环境和性能指标，如数据生成、错误容忍度、处理速度等。 5. **基准实现**：作者提供了各种算法的基准版本，便于其他研究者复现和比较结果。 6. **性能分析**：实验结果可能会揭示哪些算法在特定情况下表现最优，以及哪些算法在特定数据特性或资源限制下更适用。 7. **应用背景**：论文可能讨论了频繁项检测在实时监控、网络流量分析、推荐系统等领域的实际应用。 8. **误差与优化**：论文可能会探讨如何在保证性能的同时，减少错误率或提高算法的适应性。通过深入研究这篇论文，读者将能了解到数据流挖掘中的频繁项检测算法的最新进展，以及如何在实践中选择和优化这些算法。这对于数据科学家、软件工程师和相关领域的研究人员来说是极其宝贵的资源。

Algorithm 3.1: FREQUENT(k )

n ← 0; T ← ∅;

for each i :

n ← n + 1;

if i ∈ T

then c

← c

+ 1;

else if |T | < k − 1

then



T ← T ∪ {i};

← 1;

else for all j ∈ T

← c

− 1;

if c

= 0

then T ← T \{j};

Algorithm 3.2: LOSSYCOUNTING(k)

n ← 0; ∆ ← 0; T ← ∅;

for each i :

n ← n + 1;

if i ∈ T

then c

← c

+ 1;

else



T ← T ∪ {i};

← 1 + ∆;

if b

c 6= ∆

then

∆ ← n/k;

for all j ∈ T

do if c

< ∆

then T ← T \{j}

Algorithm 3.3: SPACESAVING (k)

n ← 0;

T ← ∅;

for each i :

n ← n + 1;

if i ∈ T

then c

← c

+ 1;

else if |T | < k

then



T ← T ∪ {i};

← 1;

else

j ← arg min

j∈T

;

← c

+ 1;

T ← T ∪ {i}\{j};

Figure 1: Pseudocode for counter-based algorithms

algorithm has a streaming ﬂavor: it takes only one pass through

the input (which can be ordered arbitrarily) to ﬁnd a majority item.

To verify that the stored item really is a majority, a second pass

is needed to simply count the true number of occurrences of the

stored item.

Frequent Algorithm. Twenty years later, two papers were pub-

lished [27, 20] which include essentially the same generalization

of the Majority algorithm to solve the problem of ﬁnding all items

in a sequence whose frequency exceeds a 1/k fraction of the total

count. Instead of keeping a single counter and item from the input,

the FREQUENT algorithm stores k − 1 (item,counter) pairs. The

natural generalization of the Majority algorithm is to compare each

new item against the stored items T , and increment the correspond-

ing counter if it is amongst them. Else, if there is some counter with

count zero, it is allocated to the new item, and the counter set to 1. If

all k − 1 counters are allocated to distinct items, then all are decre-

mented by 1. A grouping argument is used to argue that any item

which occurs more than n/k times must be stored by the algorithm

when it terminates. Pseudocode to illustrate this algorithm is given

in Algorithm 3.1, making use of set notation to represent the oper-

ations on the set of stored items T : items are added and removed

from this set using set union and set subtraction respectively, and

we allow ranging over the members of this set (thus implementa-

tions will have to choose appropriate data structures which allow

the efﬁcient realization of these operations). We also assume that

each item j stored in T has an associated counter c

. For items

not stored in T , then c

is deﬁned as 0 and does not need to be

explicitly stored.

It is sometimes stated that the FREQUENT algorithm does not

solve the frequency estimation problem accurately, bound on the

true frequency of the items it retains, but this is erroneous. As

observed by Bose et al.[7], executing this algorithm with k = 1/

ensures that the count associated with each item on termination is

at most n below the true value.

The papers published in 2002 (which cite [22]) were in fact re-

discoveries of an algorithm ﬁrst published in 1982. This n/k gen-

eralization was ﬁrst proposed by Misra and Gries [34]. Misra and

Gries proposed “Algorithm 3”, which is equivalent to that described

in the previous paragraph. In deference to this early discovery, this

algorithm has been referred to as the “Misra-Gries” algorithm in

more recent work on streaming algorithms. In the same paper, “Al-

gorithm 2” correctly solves the problem but has only speculated

worst case space bounds.

The time cost of the algorithm is dominated by the O(1) dictio-

nary operations per update, and the cost of decrementing counts.

Misra and Gries use a balanced search tree, and argue that the

decrement cost is amortized O(1); Karp et al. propose a hash table

to implement the dictionary [27]; and Demaine et al. show how the

cost of decrementing can be made worst case O (1) by representing

the counts using offsets and maintaining multiple linked lists [20].

Lossy Counting. The LOSSYCOUNTING algorithm was proposed

by Manku and Motwani in 2002 [30], in addition to a randomized

sampling-based algorithm and techniques for extending from fre-

quent items to frequent itemsets. The algorithm stores tuples which

comprise an item, a lower bound on its count, and a ‘delta’ (∆)

value which records the difference between the upper bound and

the lower bound. When processing the ith item, if it is currently

stored then its lower bound is increased by one; else, a new tuple

is created with the lower bound set to one, and ∆ set to bi/kc.

Periodically, all tuples whose upper bound is less than bi/kc are

deleted. These are correct upper and lower bounds on the count of

each item, so at the end of the stream, all items whose count ex-

ceeds n/k must be stored. As with FREQUENT, setting k = 1/

ensures that the error in any approximate count is at most n. A

careful argument demonstrates that the worst case space used by

this algorithm is O(



log n), and for certain input distributions it

is O(



Storing the delta values ensures that highly frequent items which

ﬁrst appear early on in the stream have very accurate approximated

counts. But this adds to the storage cost. A variant of this algo-

rithm is presented by Manku in slides for the paper [31], which

dispenses with explicitly storing the delta values, and instead has

all items sharing an implicit value of ∆(i) = bi/kc. The modiﬁed

algorithm stores (item, count) pairs. For each item in the stream,

if it is stored, then the count is incremented; otherwise, it is initial-

ized with a count of 1. Every time ∆(i) increases, all counts are

decremented by 1, and all items with zero count are removed from

the data structure. The same proof sufﬁces to show that the space

bound is O(



log n). This version of the algorithm is quite simi-

lar to Algorithm 2 presented in [34]; but in [31], a space bound is

proven. The time cost is O(1) dictionary operations, plus the peri-

odic compress operations which require a linear scan of the stored

items. This can be performed once every O(



log n) updates, in

which time the number of items stored has at most doubled, mean-

ing that the amortized cost of compressing is O(1). We give pseu-

docode for this version of the algorithm in Algorithm 3.2, where

again T represents the set of currently monitored items, updated by

set operations, and c

are corresponding counts.

Space Saving. The deterministic algorithms presented thus far all

have a similar ﬂavor: a set of items and counters are kept, and vari-

ous simple rules are applied when a new item arrives. The SPACE-

1532

剩余11页未读，继续阅读

weixin_38688855

粉丝: 0
资源: 971

数据流中频繁项挖掘：算法与性能比较

Finding Frequent Items in Data Streams-计算机科学

the Big Data Tidal Wave, Finding Opportunities in Huge Data Streams

java stream

for b in range(x-1,1,-1): for c in range(b-1,1,-1): for d in range(c-1,1,-1):

root check LDPC

Finding Structure in Time介绍一下

贪心算法求解背包问题代码

identifying dense regions in the input space

最新资源