3. CoCo: A Concurrent Compactor
In this section we present CoCo, which is a non-intrusive concur-
rent compaction mechanism allowing moving of objects concur-
rently with the run of the program threads, providing high respon-
siveness and maintaining a program’s lock-freedom.
The CoCo mechanism can be incorporated into a full com-
paction algorithm, such as the Compressor [23], to compact the
entire heap and eliminate fragmentation, or it may be used with any
on-the-fly mark and sweep collector [14, 13, 15] (as it is used here)
to do partial compaction to reduce fragmentation. The overhead of
CoCo increases with the number of objects to be moved, because its
overhead is higher during the move. Thus, its design goal was that
of a partial compactor. In STOPLESS, we employ the mark-sweep
collector to finish updating pointers to the relocated objects. This is
an easy task while traversing the graph of live objects (proposed by
[11]). Alternatively, an additional final stage can be added to let the
compactor explicitly and concurrently fix pointers, perhaps using a
mechanism such as the one employed by the Compressor [23].
When an object is to be moved in CoCo, it must be tagged
previous to the run of CoCo (e.g., by the sweep procedure) by
atomically setting a bit in the object header and adding it to a
list accessible to CoCo. Creating a copy of the original object and
making the program switch to working with the new object instead
of the original one, keeping lock-freedom, maintaining acceptable
memory coherence, and reducing the overheads to an acceptable
measure is nontrivial. The original lock-free copying mechanism
of Herlihy and Moss [20] employed a chain of immutable copies
of the object, one for each object mutation. This is a high overhead
to pay in space and time. CoCo also incurs some, though smaller,
space and time overheads. First, CoCo employs a read barrier,
which has its cost, but an interesting cloning mechanism is used
to eliminate this cost almost entirely when the compactor is idle.
Second, during object copying, CoCo creates a temporary wide
object to represent a mutated object. A forwarding pointer is kept
in each old object pointing to the new copy of the object. In the
wide object, each field is juxtaposed with a status field; the ‘wide’
field (the status and original field combination) can be atomically
modified using a compare-and-swap. We ensure that wide fields are
at most twice the size of the processor word; for example, on a 32-
bit architecture the largest wide field would have a 32-bit status and
a 32-bit payload
2
, thus allowing a 64-bit compare-and-swap to be
used. Such a double-word compare-and-swap is available on most
modern instruction set architectures. If the original field is already
twice the processor word size (such as a 64-bit field on a 32-bit
processor), we first split the field into two 32-bit halves.
The details of how objects are copied and how mutators access
objects to be moved is described in Sections 3.2-3.3. Section 3.4
describes an extension of the basic mechanism that allows mutators
to perform atomic operations (e.g., CAS) on objects while they are
being moved.
3.1 The challenge
A reader who has not previously dealt with real-time or lock-free
collectors may wonder why it is difficult to construct a collector
that supports lock-free programs. We illustrate the problem by dis-
cussing a generic real-time collector, similar to the ones proposed
by Nettles and O’Toole [27], Cheng and Blelloch [7, 10], and Hud-
son and Moss [21]. The basic idea is to create and maintain two
copies of each object. The fresh new copy is created by the collector
and thereafter, any application thread is responsible for executing
writes to both the original and the replicated copy. The main copy
(used for reading the current values) is the original object. Once all
2
We use ‘payload’ to refer to objects fields not added by CoCo, such as the
forwarding pointer word, or the status fields in the wide object.
objects have an updated replica, the copying phase terminates by
stopping all program threads and modifying their root set pointers
to point to the copied objects.
The main problem with this solution is that the two copies of an
object are not guaranteed to contain the same information, unless
proper locking mechanisms are introduced. Suppose two threads
try to concurrently modify a field f, which originally holds the
value 0. Thread T 1 tries to write the value 1 into f and Thread
T 2 tries to write the value 2. Although they attempt to write to
the field concurrently, one of the writes will happen before the
other. Assume that T 1 writes first. A third thread that reads this
field may see the field value going from 0 to 1 and then to 2.
However, threads T 1 and T 2 next concurrently attempt to write
to the replica, possibly happening in a different order, making 1
be the value that prevails in the replica. A third thread that reads
the field in the original location and then in the copied location
may observe the sequence of values 0, 1, 2, 1 in the field f . Such a
sequence should never be observed by any thread according to any
reasonable memory model. To solve this, previous work employed
locking or assumed that there were no concurrent (non-blocking)
writes to a memory location. However, non-blocking concurrent
accesses are essential for any lock-free real-time algorithm.
A second problem is that in the generic algorithm, the threads
are all halted simultaneously to shift from the original copy to
the replica. This also involves some undesirable locking mecha-
nism, making it possible for one slow thread to block others. If the
threads are not stopped simultaneously, then they may be in dif-
ferent stages, where some of them are still reading the old replica,
whereas others are not writing to it anymore. Various other haz-
ardous races exist.
Past solutions include disallowing simultaneous writes [21]; or
(inefficiently) creating a full copy of the object for each modifi-
cation [20]; limiting the run to a uniprocessor or changing the ac-
cessed copy while the program threads halt [3]. Collectors that han-
dle concurrent compaction as the application executes concurrently
[11, 23] employ virtual memory (page protection) to simultane-
ously block access to stale data. The problem with these collectors
is that they induce a trap storm whose duration is tens of millisec-
onds and during which the program is practically halted. CoCo’s
responsiveness is three orders of magnitude better (less than tens of
microseconds).
CoCo does not need to stop the threads simultaneously. It also
does not rely on locking to keep the replicas coherent. The main
idea is to create a temporary wide object in which each field is as-
sociated with a status word. The status changes atomically with the
data and indicates the current location of the data values. The use
of this temporary wide object and incremental object copying, in
which the transfer of data location happens field by field, provides
the high responsiveness of CoCo. The algorithm is described in the
following subsections.
3.2 The object copying mechanism
In what follows, we assume that CoCo runs on a single thread con-
currently with the program. An extension to several CoCo threads
is discussed in Subsection 3.5 below. CoCo traverses the objects
that need to be copied one by one in any order. The first step of
copying an object is to create an uninitialized wide version of the
object, as discussed earlier. Objects contain a special header word
used as a forwarding pointer for CoCo during the compaction. In
the first phase, the forwarding pointer will store a reference to the
wide object. Later it will point to the new copy of the object.
CoCo copies each payload field from the original object into
the wide object. At the same time, the mutator may modify the
wide object (modifications to the original are no longer allowed
after a wide-object pointer has been installed to the header of the