An optional user-supplied struct declaration may be
placed at the end of the declaration section, just before the %%
separator. This feature enables “typed attribute” initialization.
For example, in Figure 1 struct months is defined to have
four fields that correspond to the initializer values given for the
month names and their respective associated values, e.g.:
struct months {
char *name;
int number;
int days;
int leap_days;
};
%%
Lines containing keywords and associated attributes appear
in the keywords and optional attributes section of the keyfile.
The first field of each line always contains the keyword itself,
left-justified against the first column and without surround-
ing quotation marks. Additional attribute fields can follow
the keyword. Attributes are separated from the keyword and
from each other by field separators, and they continue up to the
“end-of-line marker,” which is the newline character (’\n’) by
default.
Attribute field values are used to initialize components of
the user-supplied struct appearing at the end of the decla-
ration section, e.g.:
january, 1, 31, 31
february, 2, 28, 29
march, 3, 31, 31
...
As with lex and yacc, it is legal to omit the initial dec-
laration section entirely. In this case, the keyfile begins with
the first non-comment line (lines beginning with a "#" char-
acter are treated as comments and ignored). This format style
is useful for building keyword set recognizers that possess no
associated attributes. For example, a perfect hash function for
frequently occurring English words can efficiently filter out
uninformative words, such as “the,” “as,” and “this,” from con-
sideration in a key-word-in-context indexing application [5].
Again, as with lex and yacc, all text in the optional third
auxiliary code section is included verbatim into the generated
output file, starting immediately after the final %% and extend-
ing to the end of the keyfile. It is the user’s responsibility to
ensure that the inserted code is valid C or C++. In the Fig-
ure 1 example, this auxiliary code provides a test driver that is
conditionally included if the DEBUG symbol is enabled when
compiling the generated C or C++ code.
4 Design and Implementation Strate-
gies
Many articles describe perfect hashing [10, 7, 11, 12] and min-
imal perfect hashing algorithms [8, 13, 6, 14, 15]. Few articles,
however, describe the design and implementation of a general-
purpose perfect hashing generator tool in detail. This section
describes the data structures, algorithms, output format, and
reusable components in gperf.
gperf is written in
4,000 lines of C++ source code. C++
was chosen as the implementation language since it supports
data abstraction better than C, while maintaining C’s efficiency
and expressiveness [16].
gperf’s three main phases for generating a perfect or near-
perfect hash function are shown in Figure 2: Figure 6 illus-
trates gperf’s overall program structure. and described be-
january
february
...
december
KEYFILE
Asso_Values
Key_List
GPERF
int hash
(const char *str,
unsigned int len)
{
// ...
C
/C++ CODE
Figure 2: gperf’s Processing Phases
low:
1. Process command-line options, read keywords and at-
tributes (the input format is described in Section 3), and
initialize internal objects (described in Section 4.1).
2. Perform a non-backtracking, heuristically guided search
for a perfect hash function(described in Section 4.2.1 and
Section 4.2.2).
3. Generate formatted C or C++ code according to the
command-lineoptions (output format is described in Sec-
tion 4.3).
The following section outlines gperf’s perfect hash func-
tion generation algorithms and internal objects, examines its
generated source code output, describes several reusable class
components, and discusses the program’s current limitations
and future enhancements.
4.1 Internal Objects
gperf’s implementation centers around two internal objects:
the keyword signatures list (Key List)andtheassociated
values array (asso values), both of which are described
below.
4.1.1 The Keyword Signatures List
Every user-specified keyword and its attributes are read from
the keyfile and stored in a node on a linked list, called
Key List. gperf only considers a subset of each key-
words’ characters while it searches for a perfect hash function.
The subset is called the “keyword signature,” or keysig.
3