Algorithm 3.1: FREQUENT(k )
n ← 0; T ← ∅;
for each i :
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
:
n ← n + 1;
if i ∈ T
then c
i
← c
i
+ 1;
else if |T | < k − 1
then
T ← T ∪ {i};
c
i
← 1;
else for all j ∈ T
do
8
<
:
c
j
← c
j
− 1;
if c
j
= 0
then T ← T \{j};
Algorithm 3.2: LOSSYCOUNTING(k)
n ← 0; ∆ ← 0; T ← ∅;
for each i :
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
:
n ← n + 1;
if i ∈ T
then c
i
← c
i
+ 1;
else
T ← T ∪ {i};
c
j
← 1 + ∆;
if b
n
k
c 6= ∆
then
8
>
<
>
:
∆ ← n/k;
for all j ∈ T
do if c
j
< ∆
then T ← T \{j}
Algorithm 3.3: SPACESAVING (k)
n ← 0;
T ← ∅;
for each i :
do
8
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
:
n ← n + 1;
if i ∈ T
then c
i
← c
i
+ 1;
else if |T | < k
then
T ← T ∪ {i};
c
i
← 1;
else
8
<
:
j ← arg min
j∈T
c
j
;
c
i
← c
j
+ 1;
T ← T ∪ {i}\{j};
Figure 1: Pseudocode for counter-based algorithms
algorithm has a streaming flavor: it takes only one pass through
the input (which can be ordered arbitrarily) to find a majority item.
To verify that the stored item really is a majority, a second pass
is needed to simply count the true number of occurrences of the
stored item.
Frequent Algorithm. Twenty years later, two papers were pub-
lished [27, 20] which include essentially the same generalization
of the Majority algorithm to solve the problem of finding all items
in a sequence whose frequency exceeds a 1/k fraction of the total
count. Instead of keeping a single counter and item from the input,
the FREQUENT algorithm stores k − 1 (item,counter) pairs. The
natural generalization of the Majority algorithm is to compare each
new item against the stored items T , and increment the correspond-
ing counter if it is amongst them. Else, if there is some counter with
count zero, it is allocated to the new item, and the counter set to 1. If
all k − 1 counters are allocated to distinct items, then all are decre-
mented by 1. A grouping argument is used to argue that any item
which occurs more than n/k times must be stored by the algorithm
when it terminates. Pseudocode to illustrate this algorithm is given
in Algorithm 3.1, making use of set notation to represent the oper-
ations on the set of stored items T : items are added and removed
from this set using set union and set subtraction respectively, and
we allow ranging over the members of this set (thus implementa-
tions will have to choose appropriate data structures which allow
the efficient realization of these operations). We also assume that
each item j stored in T has an associated counter c
j
. For items
not stored in T , then c
j
is defined as 0 and does not need to be
explicitly stored.
It is sometimes stated that the FREQUENT algorithm does not
solve the frequency estimation problem accurately, bound on the
true frequency of the items it retains, but this is erroneous. As
observed by Bose et al.[7], executing this algorithm with k = 1/
ensures that the count associated with each item on termination is
at most n below the true value.
The papers published in 2002 (which cite [22]) were in fact re-
discoveries of an algorithm first published in 1982. This n/k gen-
eralization was first proposed by Misra and Gries [34]. Misra and
Gries proposed “Algorithm 3”, which is equivalent to that described
in the previous paragraph. In deference to this early discovery, this
algorithm has been referred to as the “Misra-Gries” algorithm in
more recent work on streaming algorithms. In the same paper, “Al-
gorithm 2” correctly solves the problem but has only speculated
worst case space bounds.
The time cost of the algorithm is dominated by the O(1) dictio-
nary operations per update, and the cost of decrementing counts.
Misra and Gries use a balanced search tree, and argue that the
decrement cost is amortized O(1); Karp et al. propose a hash table
to implement the dictionary [27]; and Demaine et al. show how the
cost of decrementing can be made worst case O (1) by representing
the counts using offsets and maintaining multiple linked lists [20].
Lossy Counting. The LOSSYCOUNTING algorithm was proposed
by Manku and Motwani in 2002 [30], in addition to a randomized
sampling-based algorithm and techniques for extending from fre-
quent items to frequent itemsets. The algorithm stores tuples which
comprise an item, a lower bound on its count, and a ‘delta’ (∆)
value which records the difference between the upper bound and
the lower bound. When processing the ith item, if it is currently
stored then its lower bound is increased by one; else, a new tuple
is created with the lower bound set to one, and ∆ set to bi/kc.
Periodically, all tuples whose upper bound is less than bi/kc are
deleted. These are correct upper and lower bounds on the count of
each item, so at the end of the stream, all items whose count ex-
ceeds n/k must be stored. As with FREQUENT, setting k = 1/
ensures that the error in any approximate count is at most n. A
careful argument demonstrates that the worst case space used by
this algorithm is O(
1
log n), and for certain input distributions it
is O(
1
).
Storing the delta values ensures that highly frequent items which
first appear early on in the stream have very accurate approximated
counts. But this adds to the storage cost. A variant of this algo-
rithm is presented by Manku in slides for the paper [31], which
dispenses with explicitly storing the delta values, and instead has
all items sharing an implicit value of ∆(i) = bi/kc. The modified
algorithm stores (item, count) pairs. For each item in the stream,
if it is stored, then the count is incremented; otherwise, it is initial-
ized with a count of 1. Every time ∆(i) increases, all counts are
decremented by 1, and all items with zero count are removed from
the data structure. The same proof suffices to show that the space
bound is O(
1
log n). This version of the algorithm is quite simi-
lar to Algorithm 2 presented in [34]; but in [31], a space bound is
proven. The time cost is O(1) dictionary operations, plus the peri-
odic compress operations which require a linear scan of the stored
items. This can be performed once every O(
1
log n) updates, in
which time the number of items stored has at most doubled, mean-
ing that the amortized cost of compressing is O(1). We give pseu-
docode for this version of the algorithm in Algorithm 3.2, where
again T represents the set of currently monitored items, updated by
set operations, and c
j
are corresponding counts.
Space Saving. The deterministic algorithms presented thus far all
have a similar flavor: a set of items and counters are kept, and vari-
ous simple rules are applied when a new item arrives. The SPACE-
1532