ENHANCEMENTS TO THE VOTING ALGORITHM
Sushi1
Jajodia
and
David
Mutchler
Computer Science and Systems Branch
Code 5590
Naval Research Laboratory
Washington, DC 20375-5000
ABSTRACT
There are several consistency control algorithms for manag-
ing replicated files in the face of network partitioning due to
site or communication link failures. In this paper, we con-
sider the popular voting scheme along with three enhance-
ments:
voting
with a primary
site, dynamic voting, and
dynamic
voting
with linearly
ordered
copiee. We develop a
stochastic model which compares the file availabilities
afforded by each of these schemes. We show that in this
model dynamic voting with linearly ordered copies provides
the greatest availability.
I. INTRODUCTION
There are several consistency control algorithms for
managing replicated data in the face of network partitioning
due to site or communication link failures 141.
Voot-
ing[5,12,15] is the best known example of such a scheme. It
has several appealing aspects: its availability is reasonable;
its simple statement permits
a
clear correctness proof; and it
is simple to implement. Voting
with a primary
site is a sim-
ple extension of voting. More recently, researchers have
introduced two other enhancements to voting, called
dynamic voting [S] (see also [3]) and dynamic voting with
linearly ordered
copies [7]. These enhancements share all the
advantages of the voting scheme; we show that they provide
greater availability as well.
Sections II and III give formal statements of the prob-
lem and the four algorithms listed above. Section IV pro-
vides a stochastic analysis of the availabilities of these algo-
rithms. The model we use assumes that, update requests
arrive much more frequently than sites fail or are repaired.
In the context of our model, we state theorems that compare
the availabilities of the four algorithms. Our main result is
that dynamic voting with linearly ordered copies provides
the greatest availability.
II. FORMAL SPECIFICATION OF PROBLEM
The distributed database (DDB) system consists of a
collection of independent computers, called
nodes
or sites,
connected via communication links. We assume that site
failures are clean, i.e., nodes stop executing without perform-
ing any incorrect actions and that node crashes are
permission to copy without fee all or part of this material is
granted protided &at the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and the
&le of the. publication and its date. appear, and notice is given that
copying is by permission of the Very Large Data Base Endow-
ment. To copy otherwise, or to republish, requires
a fee adbx SW-
cial permission from the Endowment.
Proceedings of the
13th VLDB Conference, Brighton 1987
detectable by other nodes. We do not include Byzantine
failures (111 where sites may act in an arbitrary and mali-
cious manner. Site or communication failures may separate
the sites into more than one connected component of com-
municating sites. We call each connected component a
parti-
tion.
There are several logical files in the DDB, and a physi-
cal copy of each logical file is stored at one or more sites.
Each site keeps
a
history of all updates which it performed
on a file. We assume that each site runs
a
eoneurrcncly con-
trol protocol
which ensures that the execution of all transac-
tions within any partition is serializable [8,1]. While serial-
izability of transactions at each site is certainly desirable; it
is not sufficient to guarantee that the transactions running
in different sites will combine to yield a serialieable result;
and therefore, it is
necessary
to run
a
consistency
control
protocol
which correctly manages the replicated data in the
presence of failures. (An excellent survey of several of these
strategies is given in [4].) In
a
pessimistic consistency control
protocol, mutual consistency of a replicated file is main-
tained by making sure thaf all reads are fresh and that, files
are updated in at most one partition at any given time. We
will call such
a
partition the majority partition. Different
pessimistic protocols use different definitions of
a
majority
partition. When site or communication link recoveries
cause
partitions to unite, the nodes form a new partition by com-
paring their histories and obtain, if necessary, all updates
that they have missed. If there does not exist a majority
partition, all sites in the system must wait until enough sites
and communication links are repaired so that there is once
again a majority partition in the system. Since this wait is
unavoidable [14], the challenge is to come up with a pes-
simistic consistency control algorithm which not only
preserves mutual consistency of various copies of a file, but
at the same time achieves high availability as well.
We can pictorially represent the history of network’s
failure and recovery by using the notion of a partition graph
[lo], defined
as
follows.
Definition 1. A partition
graph
for
a
file {is
a
directed
acyclic graph such that the nodes correspond to the parti-
tions and the edges correspond to either a fragmentation of
a partition into two or more subpartitions or a coalescence
of two or more partitions into a single partition.
Example 1. An example of a partition graph is shown
in Figure 1. The source nodk is labeled with the names of
sites ABCDE that have copies of the file f, indicating that
these sites are all connected and that copies of fare mutu-
ally consistent. The initial partition is fragmented into two
partitions ABC and DE. Later B becomes isolated from AC,
and subsequently A and C also become isolated. Ultimately,
A, D, and E resume comm&ication and form a single parti-
tion.
399