Appears in the Proceedings of the 2001 USENIX Annual Technical Conference (USENIX ’01) 3
higher-priority thread A’s critical section detects an
interference with a lower-priority thread B, A helps
B to finish its critical section first. During helping,
A lends B its priority to ensure that no other, lower-
prioritized activities can interfere. When B has fin-
ished, A executes its own critical section.
Wait-free object implementations satisfy a stronger
form of block-freedom than lock-free synchroniza-
tion (discussed in the next paragraph) as they guar-
antee freedom from starvation. Therefore, many
authors point out that wait-free synchronization is
a special case of lock-free synchronization. How-
ever, wait-free synchronization can also be imple-
mented using locks, albeit with a nonblocking help-
ing scheme. For example, a locking scheme with
priority inheritance can be considered a wait-free
synchronization scheme as long as critical sections
never block.
Lock-free synchronization works completely
without locks. Critical code sections are designed
such that they prepare their results out of line and
then try to commit them to the pool of shared
data using an atomic memory update instruction
like compare-and-swap (CAS). The compare part
of CAS is used to detect conflicts between two
threads that simultaneously try to update the data; if
it fails, the whole operation is restarted. If needed,
retries can be delayed with an exponential backoff
to avoid retry contention.
2
This synchronization mechanism has some nice
properties: Because there are no locks, it avoids
deadlocks; it provides better insulation from
crashed threads, resulting in higher robustness and
fault tolerance, because operations do not hold
locks on critical data; moreover, it is automatically
multiprocessing-safe.
Preconditions for using lock-free synchronization
are that primitives for atomic memory modifica-
tions are available, and data is stored in type-stable
memory management. We do not digress into type-
stable memory management in this paper (see [7]
for a discussion of operating-systems–related is-
sues); the rest of this subsection discusses atomic
memory modification.
2
Backoff is never needed on single-CPU systems.
Atomic memory update. The x86 CPUs have
two kinds of atomic memory-modification opera-
tions: a test-and-set instruction (TAS) and a CAS
instruction. Newer models (Intel Pentium and
newer) also have a double-size–word (8 bytes)
compare-and-swap instruction (CASW). However,
these CPUs do not support atomically updating two
independent memory words (two-word compare-
and-swap, CAS2).
A number of data structures can be imple-
mented without locks directly on top of CAS and
CASW (i. e., without the overhead of a software-
implemented multi-word CAS): counters and bit-
fields with widths up to 8 bytes, stacks, and FIFO
queues. [21, 18]
Valois introduced a lock-free single-linked list de-
sign supporting insertions and deletions anywhere
in a list, as well as several other data structures
[23, 22]. These designs also work with just CAS.
However, Greenwald [6] has criticized them for be-
ing quite complex, difficult to get right, and com-
putationally expensive.
Most of the algorithms for lock-free data-structure
synchronization that have been developed recently
assume availability of a stronger atomic primitive
like CAS2. These data structures include general
single-linked and double-linked lists. [6]
A number of techniques exist for implement-
ing lock-free and wait-free general multi-word
compare-and-swap (MWCAS) on top of CAS and
CAS2, enabling nonblocking synchronization for
arbitrarily complex data structures [11, 19, 2, 6].
These techniques have considerable overhead in
both space and runtime complexity, especially
when compared to common lock-based operations,
making them less interesting for kernel design.
The most common technique to implement atomic
multi-word updates on uniprocessors is to prevent
preemption during the update. This is usually done
by disabling interrupt delivery in the CPU. The dis-
advantage of this method is (of course) that it does
not work on multiprocessors.
Bershad [4] has proposed to implement CAS in
software using an implementation and lock known