Distributed Snapshots
l
65
iterations in a sequential program, which are repeated until successive iterations
produce no change, that is, stability is attained. Stability must be detected so
that one phase can be terminated and the next phase initiated [lo]. The
termination of a computational phase is not identical to the termination of a
computation. When a computation terminates, all activities cease-messages are
not sent and process states do not change. There may be activity during the
stable behavior that indicates the end of a computational phase-messages may
be sent and received, and processes may change state, but this activity serves no
purpose other than to signal the end of a phase. In this paper, we are concerned
with the detection of stable system properties; the cessation of activity is only
one example of a stable property.
Strictly speaking, properties such as “the system is deadlocked” are not stable
if the deadlock is “broken” and computation is reinitiated. However, to keep
exposition simple, we shall partition the overall problem into the problems of (1)
detecting the termination of one phase (and informing all processes that a phase
has ended) and (2) initiating a new phase. The following is a stable property:
“the kth computational phase has terminated,” lz = 1,2, . . . . Hence, the methods
presented in this paper are applicable to detecting the termination of the lath
phase for a given k.
In this paper we restrict attention to the problem of detecting stable properties.
The problem of initiating the next phase of computation is not considered here
because the solution to that problem varies significantly depending on the
application, being different for database deadlock detection than for detecting
the termination of a diffusing computation.
We have to present our algorithms in terms of a model of a system. The model
chosen is not important in itself; we could have couched our discussion in terms
of other models. We shall describe our model informally and only to the level of
detail necessary to make the algorithms clear.
2. MODEL OF A DISTRIBUTED SYSTEM
A distributed system consists of a finite set of processes and a finite set of
channels. It is described by a labeled, directed graph in which the vertices
represent processes and the edges represent channels. Figure 1 is an example.
Channels are assumed to have infinite buffers, to be error-free, and to deliver
messages in the order sent. (The infinite buffer assumption is made for ease of
exposition: bounded buffers may be assumed provided there exists a proof that
no process attempts to add a message to a full buffer.) The delay experienced by
a message in a channel is arbitrary but finite. The sequence of messages received
along a channel is an initial subsequence of the sequence of messages sent along
the channel. The state of a channel is the sequence of messages sent along the
channel, excluding the messages received along the channel.
A process is defined by a set of states, an initial state (from this set), and a set
of events. An event e in a process
p
is an atomic action that may change the state
of
p
itself and the state of
at most one
channel c incident on
p:
the state of c may
be changed by the sending of a message along c (if c is directed away from
p)
or
the receipt of a message along c (if c is directed towards
p).
An event e is defined
by (1) the process
p
in which the event occurs, (2) the state s of
p
immediately
ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985.