2 Notation and tools
Model and assumptions
We analyze our algorithms in the cache-oblivious model [
14
]. In this model,
the machine has a two-level memory hierarchy, where the fast level has an unknown size of
M
words and
a slow level of unbounded size where our data reside. We assume that the fast level plays the role of a
cache for the slow level with an optimal replacement strategy where the transfers (a.k.a. I/Os) between
the two levels are done in blocks of an unknown size of
B ≤ M
words; the I/O cost of an algorithm is
the total number of such block transfers. Scanning and sorting are two fundamental building blocks in
the design of cache-oblivious algorithms [
14
]: under the tall-cache assumption [
8
], given an array of
N
contiguous items the I/Os required for scanning and sorting are
scan(N) = O
1 +
N
B
I/Os and sort(N) = O
N
B
log
M/B
N
B
.
Hypergraphs
An
r
-hypergraph on a vertex set
V
is a subset
E
of
V
r
, the set of subsets of
V
of
cardinality
r
. An element of
E
is called an edge. We call an ordered
r
-tuple from
V
an oriented edge; if
e
is an edge, an oriented edge whose vertices are those in
e
is called an orientation of
e
. From now on we
will focus on 3-hypergraphs; generalization to arbitrary
r
is straightforward. We define valid orientations
those oriented edges (
v
0
, v
1
, v
2
) where
v
1
< v
2
(for arbitrary
r
,
v
1
< ··· < v
r−1
). Then for each edge there
are 6 orientations, but only 3 valid orientations (r! orientations of which r are valid).
We say that a valid oriented edge (
v
0
, v
1
, v
2
) is the
i
-th orientation if
v
0
is the
i
-th smallest among the
three; in particular, the 0-th orientation is the canonical orientation. Edges correspond bijectively with
their canonical orientations. Furthermore, valid orientations can be mapped bijectively to pairs (
e, v
) where
e
is an edge and
v
a vertex contained in
e
, simply by the correspondence (
v
0
, v
1
, v
2
)
7→
(
{v
0
, v
1
, v
2
}, v
0
).
In the following all the orientations are assumed to be valid, so we will use the term orientation to mean
valid orientation.
3 The Majewski–Wormald–Havas–Czech technique
Majewski et al. [
21
] proposed a technique (MWHC) to compute an order-preserving minimal perfect hash
function, that is, a function mapping a set of keys
S
in some specified way into [
|S|
]. The technique
actually makes it possible to store succinctly any function
f
:
S →
[
σ
], for arbitrary
σ
. In this section we
briefly describe their construction.
First, we choose three random
1
hash functions
h
0
, h
1
, h
2
:
S →
[
γn
] and generate a 3-hypergraph
2
with
γn
vertices, where
γ
is a constant above the critical threshold
c
3
[
25
], by mapping each key
x
to
the edge
{h
0
(
x
)
, h
1
(
x
)
, h
2
(
x
)
}
. The goal is to find an array
u
of
γn
integers in [
σ
] such that for each key
x
one has
f
(
x
) =
u
h
0
(x)
+
u
h
1
(x)
+
u
h
2
(x)
mod σ
. This yields a linear system with
n
equations and
γn
variables
u
i
; if the associated hypergraph is peelable, it is easy to solve the system. Since
γ
is larger than
the critical threshold, the algorithm succeeds with probability 1 −o(1) as n → ∞ [25].
By storing such values
u
i
, each requiring
dlog σe
bits, plus the three hash functions, we will be able to
recover
f
(
x
). Overall, the space required will be
dlog σeγn
bits, which can be reduced to
dlog σen
+
γn
+
o
(
n
)
using a ranking structure [
17
]. This technique can be easily extended to construct MPHFs: we define the
function
f : S →
[3] as
x 7→ i
where
h
i
(
x
) is a degree-1 vertex when the edge corresponding to
x
is peeled;
it is then easy to see that
h
f(x)
(
x
) :
S →
[
γn
] is a PHF. The function can be again made minimal by
adding a ranking structure on the vector u [6].
As noted in the introduction, the peeling procedure needed to solve the linear system can be performed
in linear time using a greedy algorithm (referred to as standard linear-time peeling). However, this
procedure requires random access to several integers per key, needed for bookkeeping; moreover, since
the graph is random, the visit order is close to random. As a consequence, if the key set is so large that
it is necessary to spill to the disk part of the working data structures, the I/O volume slows down the
algorithm to unacceptable rates.
1
Like most MWHC implementations, in our experiments we use a Jenkins hash function with a 64-bit seed in place of a
fully random hash function.
2
Although the technique works for r-hypergraphs, r = 3 provides the lowest space usage [25].
3