This pap er contains a new, detailed presentation of the
paxos
algorithm,
based on a formal decomp osition into several interacting comp onents. It also
contains a correctness proof and a time performance and fault-tolerance analysis.
The
multipaxos
algorithm is also described together with an application to data
replication. The formal framework used for the presentation is provided by the
Clock GTA.
The correctness proof uses automaton composition and invariant assertion
methods. Comp osition is useful for representing a system using separate compo-
nents. We provide a mo dular presentation of the
paxos
algorithm, obtained by
decomposing it into several components. Each one of these components copes
with a sp ecic aspect of the problem. The correctness of each piece is proved
by means of invariants, i.e., prop erties of system states that are always true in
an execution.
The time performance and fault-tolerance analysis is conditional on the sta-
bilization of the system behavior starting from some p oint in an execution. Using
the Clock GTA we prove that when the system stabilizes
paxos
reaches consen-
sus in 24
`
+ 10
n`
+ 13
d
time and uses 10
n
messages, where
n
is the numb er of
processes. This performance is for a worst-case scenario. We also discuss the
mul-
tipaxos
protocol and provide a data replication algorithm using
multipaxos
.
With
multipaxos
the high availability of the replicated data is combined with
high fault tolerance.
Related work.
The consensus algorithms of Dwork et al. [5] and of Chandra
and Toueg [2] b ear some similarities with
paxos
. The algorithm of [5] also uses
rounds conducted by a leader, but the rounds are conducted sequentially, whereas
in
paxos
a leader can start a round at anytime and multiple leaders are allowed.
The strategy used in each round by the algorithm of [5] is dierent from the
one used by
paxos
. The time analysis provided in [5] is conditional on a \global
stabilization time" after which process response times and message delivery times
satisfy the time assumptions. This is similar to our stabilized analysis. A similar
time analysis, applied to the problem of reliable group communication, can b e
found in [6].
The algorithm of Chandra and Toueg is based on the idea of an abstract fail-
ure detector. It turns out that failure detectors provide an abstract and mo dular
way of incorporating partial synchrony assumptions in the model of computa-
tion. One of the algorithms in [2] uses the failure detector
S
which incorporates
the partial synchrony considered in this paper. That algorithm is based on the
rotating co ordinator paradigm and as
paxos
uses ma jorities to achieve consis-
tency. The p erformances of the Toueg and Chandra algorithm and of the
paxos
algorithm seem to b e comparable.
Both the Chandra and Toueg algorithm and the Dwork et al. algorithm
consider a distributed setting that does not allow process restarts and channel
failures (however the Chandra and Toueg algorithm can be modied to work with
loss of messages). The
paxos
algorithm tolerates pro cess restarts and channel
failures; this makes
paxos
more suitable in practice.
multipaxos
can b e easily used to implement a data replication algorithm.