address 0, it enters the “shared” state in CPU 0’s
cache, and is still valid in memory. CPU 3 also loads
the data at address 0, so that it is in the “shared”
state in both CPUs’ caches, and is still valid in mem-
ory. Next CPU 0 loads some other cache line (at ad-
dress 8), which forces the data at address 0 out of its
cache via an invalidation, replacing it with the data
at address 8. CPU 2 now does a load from address 0,
but this CPU realizes that it will soon need to store
to it, and so it uses a “read invalidate” message in
order to gain an exclusive copy, invalidating it from
CPU 3’s cache (though the copy in memory remains
up to date). Next CPU 2 does its anticipated store,
changing the state to “modified”. The copy of the
data in memory is now out of date. CPU 1 does an
atomic increment, using a “read invalidate” to snoop
the data from CPU 2’s cache and invalidate it, so that
the copy in CPU 1’s cache is in the “modified” state
(and the copy in memory remains out of date). Fi-
nally, CPU 1 reads the cache line at address 8, which
uses a “writeback” message to push address 0’s data
back out to memory.
Note that we end with data in some of the CPU’s
caches.
Quick Quiz 5: What sequence of operations
would put the CPUs’ caches all back into the “in-
valid” state?
3 Stores Result in Unnecessary
Stalls
Although the cache structure shown in Figure 1 pro-
vides good performance for repeated reads and writes
from a given CPU to a given item of data, its perfor-
mance for the first write to a given cache line is quite
poor. To see this, consider Figure 4, which shows a
timeline of a write by CPU 0 to a cacheline held in
CPU 1’s cache. Since CPU 0 must wait for the cache
line to arrive before it can write to it, CPU 0 must
stall for an extended period of time.
3
3
The time required to transfer a cache line from one CPU’s
cache to another’s is typically a few orders of magnitude more
than that required to execute a simple register-to-register in-
struction.
CPU 0 CPU 1
Write
Acknowledgement
Invalidate
Stall
Figure 4: Writes See Unnecessary Stalls
But there is no real reason to force CPU 0 to stall
for so long — after all, regardless of what data hap-
pens to be in the cache line that CPU 1 sends it, CPU
0 is going to unconditionally overwrite it.
3.1 Store Buffers
One way to prevent this unnecessary stalling of writes
is to add “store buffers” between each CPU and its
cache, as shown in Figure 5. With the addition of
these store buffers, CPU 0 can simply record its write
in its store buffer and continue executing. When the
cache line does finally make its way from CPU 1 to
CPU 0, the data will be moved from the store buffer
to the cache line.
However, there are complications that must be ad-
dressed, which are covered in the next two sections.
3.2 Store Forwarding
To see the first complication, a violation of self-
consistency, consider the following code with vari-
ables “a” and “b” both initially zero, and with the
cache line containing variable “a” initially owned by
CPU 1 and that containing “b” initially owned by
CPU 0:
6