没有合适的资源?快使用搜索试试~ 我知道了~
首页线性散列表(linear hash)
资源详情
资源评论
资源推荐

LINEAR HASHING : A NEW TOOL FOR FILE AND TABLE ADDRESSING.
Witold Litwin
I. N. Ft. I. A.
78 150 Le Chesnay, France.
Abstract.
Linear hashing is a hashing in which the
address space may grow or shrink dynamically. A
file or a table may then support ally number of
insertions or deletions without access or memory
load performance deterioration. A record in the
file is,
in general, found in pale access, while
the load may stay practically constant up to 90 %.
A record in a table is found in a mean of 1.7
accesses, while the load is constantly 80 %. No
other algorithms attaining such a performance are
known.
1. B.
The most fundamental data structures are files
and tables of records identified by a primary key.
Hashing and trees (B-tree, binary tree,.. ) are
the basic addressing techniques for those files
and tables, thousands of publications dealed with
this subject.
If a file or a table is almost
static, hashing allows a record to be found in
general in one acoess. A tree always requires
several accesses. However, when the file or the
table is,
as usually, dynamic, then a tree still
works reasonably well, while the performance Of
hashing may become very bad. It may even become
necessary to rehash all records into a new file.
We have shown, however, in /LIT77/ that hashing
may be a tool for dynamic files,
if the hashing
function is dynamically modified in the course of
insertions or deletions, We have called this new
type of hashing yj,&& hashitlp. (VH), in contrast
to the well known hashing with a static function,
which we will refer to as qJ.a&&&
Through an
algorithm called VHO /LIT77/, we have shown that a
record in a dynamic file may typically be found in
two accessses, while the load stays close to 70 %
/LIT77a/. Another algorithm, called VHl /LIT78/, /
LIT78a/, has shown that a record may even be found
typically in one a&ass, while the load during
insertions oscillates between 45 % and 90 % and is
67.5
% on average. It showed also that the average
load during insertions may be always greater than
63
% and almost always greater than
85
5, if we
accept that the average successful search requires
1.6
accesses /LIT79a/. Finally, a gensralisation
of VHl, called VH2, has shown that for a similar
load, the average successful search requires very
close to one access /LIT79a/. Two other algorithms
similar to VHO have been proposed, Dynamic Hashing
(DH) /LAR78/ and Extendible Hashing (EH) /FAG78/.
Since trees typically lead to more than
3 or 4
accesses per search and to a load close to 70 %
/KND74/, /COW/g/, all these VH algorithms offer
better access performance for similar or higher
load factors.
VHO, DH and EH require at least two accesses
per
search because the data structure which
represents the dynamically oreated hashing
functions must be on the disk. VHl and VH2 are
faster, because the functions are represented by a
bit table which, depending on file parameters,
needs 1 kbyte of core (main storage), per file for
7 000 to more than 1 500 000 records. In this
paper,
we present the algorithm which
goes
further,
only a few bytes of core suffice now for
a file of any size. For RD~ number of insertions
or deletions, the load of a file may therefore be
high and a reaord may be found, in general, in one
access.
No other algorithms attaining such a
performance are known.
The algorithm is called Linear Virtual Hashing
or Linaar &&~g in short (LH). The choice of
file parameters may lead to a mean number of
accesses per successful search not greater
than
1.03,
while the load stays close to
60
%. It may
also lead to a load staying equal to 90 % while
the successful search requires 1.35 accesses in
C”1534-7/80/0000-0212$00.75 0 1980 IEEE
212

the average.
Even if the buffer in core may
contain only one
record, a
search in the file
needs 1.7 accesses in the average while the load
remains at
80
$. This property makes LH probably
the best performing tool for dynamic tables as
well.
The next section describes the principles of
LH. We first snow the basic schema for nashing. We
then
discuss
the computing
of the physical
adaresses of buckets, when tne storage for tnem is
allocated in a non-contiguous manner. Finally, we
present some variants of the basic schema.
Section 3 shows performance of the Linear
Hashing. First, we show access and memory load
performance of
the basic schema. Next,
the
performance are analysed for a variant with a,
so-called, load control.
Section 4 concludes the paper. We sum up the
advantages
which Linear Hashing brings, we show
some application areas and, finally, we
indicate
directions for further research.
-I
2.
-OF
THE LINEAR .
2.1.
&sic schem&
We recall that hashing is a technique which
addresses
records provided with an identifier
called B&y or, simply, key. The key, let
it be c,
is usually a non-negative integer and, in
a work on the addressing by primary key,
we may
disregard
the rest of the record. A simple
pseudo-random function, let it be h, called a
m function, assigns to c the memory cell
identified by the value h(c). The &&i,ng Pu
.
w c /--> c mod M ;
M = 2,3,.. ; is an
exemple of a hashing function. The cells are
called buckets and
may
contain b records,
b = 1,2,.. .
The record is
inserted into the
bucket h(c),
called m for c, unless the
bucket is already full. The search for c always
starts with the access to the bucket h(c).
If the bucket is full when c should be stored,
we
speak
about a w. An algorithm called
. .
sm w methoq (CRM) is then applied
which, typically, stores c in a bucket m such that
md h(c).
c then becomes an
overflow record and
thl; bucket m is called overflow bucket for c. If
(i) overflow buckets
are
not primary for any
c,(U) each of them is devoted to only one h(c)
and (iii) a new overflow bucket
for an h(c) is
chained to the existing ones,
then we have a
u chaininn CRM. If,
in particular,
the
capacity b'
of an overflow bucket is b' = 1, we
call this CAM
seoarate chaininn /KNU74/.
A search for an
overflow record requires at
least two accesses. If all collisions are resolved
only
by overflow record
creations, as it
was
assumed until
recently /LIT77/, then access
performance must rapidly deteriorate when primary
buckets become full. If the insertion of c leads
to a collision and no records already stored in
the bucket h(c) should become overflow records,
then c may be stored in its primary bucket only if
a new
hashing function is chosen. The new
function,
let it be h', should
assign new
addresses to some of the records hashed with h on
h(c) and the file should be. reorganized in
consequence.
If h = h’ for all other records, the
reorganizing needs to move only a few records
and
so may be performed dynamically. The new function
is then called by us w
created-
function or, shortly, a s&&j.~g function.
The modification to the hashing function and to
the file is called au, h(c) is, under the
circumstances, -address . The idea in VH in
general
and so,
in particular,
in LH is to use
splits in order to avoid the accumulation of
overflow records. Splits are typically performed
during some insertions. All splits result from the
application of -functions. For LH, as well as
for VHO and for VHl, the basic Split functions are
defined as follows /LIT79/, /LITBO/ :
- Let C be
the
key
space. Let
hO
: c --> (0, l,.., N-11 be the function that is
used to load the file. The functions h
,s h2,..,
hi,..
are called split functions for ho if they
obey the following requirements :
(1)
hi
: C --> (0, l,.., 2’N-11
(2)
For any c either :
hi(c) = hi,,(c)
or :
hi(c)
= h. (c) + 2
i-l
l-l
N
(2.1)
We
assume that,
typically, each
hi ;
i = O,l,.. ;
hashes randomly. This means that the
probability that c is mapped by h. to a given
address is l/2iN. This also meansithat (2.1) and
(2.2) are equiprobable events.
Fig. 1 illustrates the use of split functions.
The file is created with h0 : c --> c mod N, where
N = 100. The bucket capacity is b = 5 records. For
split functions we choose tne hashing by division,
namely we put :
hi
:
c
/-->
c
mod
25.4.
213

This choice respects (2) since, obviously, for any
non-negative integers k, L, either :
k mod 2L
= k mod L
or :
k mod 2L
= k mod L + L.
We assume that a collision occurs during the
insertion of c
= 4900. Instead of simply storing c
as an overflow record, we
change ho
to the
following h :
h(c) = h,(c)
h(c)
= ho(c)
if ho(c) = 0
otherwise.
We then reorganize the file. We thus have applied
h, as the split function and we have performed the
split for the address 0. The hashing function h
results from the spLit and is
a dynamic hashing
function.
h
0
ho 8 h
4 900
‘a)
$J;-%J--#iJ
0 1 53
99
h
1
0
1
53
99
100
Fig.
1. The use of a split for a collision reso-
t;;ion. (a) -
a collision occurs for the bucket 0.
- the collision is resolved without creating
an overflow record and the address space is
extended.
Fig. 1.b shows the new state of the file. Since
h = ho for each address except 0, no records other
than these hashed to 0 have been moved.
Since
hzh
for the records which h0 hashed to the
addre& 0, for approximatively half the number of
these
records the address has been changed. It
followed from (2) that all these records had the
same new address which, under the circumstances,
had to be 100. A new bucket has therefore been
appended to the last bucket of the file to which
all the records have been moved &l pne m.
Since for all these records the bucket 100 is
henceforward the primary bucket, they are all
accessible in m access. In particular, this is
also the case of the new record, i. e,, 4 900. On
the other hand, the records which remained in the
bucket 0 continue to be accessible in one access.
In contrast to what could be done if a classical
hashing was used,
the split has resolved the
collision ulthout creatina a
JELW.&X record and
without--deterioration.
Let us now assume that the file addressed with
h0 undergoes a
sequence of insertions which did
not yet lead to overflow records. Furthermore, let
us
assume that a split is performed iff a
collision occurs. The natural idea would be to
split the bucket which undergoes the collision,
this was
implicit for all algorithms for VH.
However, split
addresses must then be random and
this must lead to dynamic hashing functions using
tables.
Dynamic hashing functions which do not
need tables may be
obtained only if the split
addresses
are chosen in
a predefined order. To
perform splits in some predefined order, instead
of split the bucket which undergoes the collision,
is the main new idea in the linear hashing.
Let m be the address of a collision. Let n be
the address of a split to be performed in the
course
of the resolution of this collision. Since
the values of m are random while these of n are
predefined, usually
n f m. If so, we assume that
the new record is stored as an overflow record
from the bucket m through a classical CRM, bucket
chaining for instance. Next, we assume that n is
given by a pointer which thus indicate the bucket
to be split.
For the first N collisions, the
buckets are
pointed in the
linear order
0,1,2,..,
N-l
and all splits use h,
(2.2) implies
then,
that the file becomes progr&sively larger,
including one
after another the buckets N+l,
N+2 ,...,2N-1.
A record to be inserted undergoes a
split usually not when it leads to the collision,
but with some delay. The delay corresponds to the
number of buckets which has to
be pointed while
the pointer travels up, from the address indicated
in the moment of the collision, to the address of
this collision.
With this mechanics,
no matter what is the
address,
let it be ml, of the first collision, LH
performs the first split using h, and for the
address 0. The records from the bucket 0 are then
randomly distributed between the bucket
0 and a
new bucket N, while, unless m,=O, an overflow
record is created for the bucket m,.
The second
collision,
no matter what is its address, let
US
say m2,
leads to an analogous result, except that,
214
剩余11页未读,继续阅读


















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0