and found that ARM CPUs are not vulnerable to the attacks
described in this paper.
IBM.
Finally, we also notified IBM security about the
finding reported in this work. IBM had responded that none
of their CPUs is affected, including System-V and PowerPC.
The RIDL Attack.
In a concurrent independent work
1
, the
RIDL attack [56] analyzes additional buffers present inside
Intel CPUs, with specific attention to the Line Fill Buffer
(LFB) and load ports. There, they show that faulty loads from
the LFB or load ports leak information across various security
domains. We note however that Fallout is different from (and
complementary to) RIDL. This is since the two attacks exploit
different microarchitectural elements (LFB and load ports for
RIDL and Store Buffer and WTF optimization for Fallout). In
particular, RIDL can be used to recover values recently placed
in the LFB while Fallout allows the attacker to recover the
value of a specific attacker-chosen writes in the store buffer.
2 Background
In this section, we provide the background required to under-
stand our attack, including a description of caches and cache
attacks, transient execution attacks, and Intel Transactional
Synchronization Extensions.
2.1 Caches and Cache Attacks
Caches are an essential part of modern processors. They are
small and fast memories where the CPU stores copies of
data from the main memory to hide the main memory access
latency. Modern CPUs have a variety of different caches and
buffers for various purposes. The main cache hierarchy is the
instruction and data cache hierarchy consisting of multiple
levels, which vary in size and latency. The L1 is the smallest
and fastest cache. The L3 cache, also called the last-level
cache (LLC), is typically the largest and slowest.
Cache Organization.
Modern caches are typically set-
associative, i.e., a cache line is stored in a fixed set, as deter-
mined by part of its virtual or physical address. Addresses
that map to the same set are called congruent. On modern
processors, the last-level cache is typically physically indexed
and shared across cores. It is also often inclusive of L1 and L2,
which means that all data stored in L1 and L2 is also stored in
the last-level cache. The cache hierarchy exposes the latency
difference between the main memory access (cache miss) and
the cache access (cache hit), i.e., exactly the latency differ-
ence that caches introduce. This can be used in side channels
on a non-colluding victim or in covert channels where sender
and receiver collude to transmit information.
1
Both teams made contact on May 7th, provided each other with an
overview of their findings, and coordinated public disclosure as well as
communication with Intel. For a complete timeline describing the flow of
information related to this disclosure, see mdsattacks.com.
Cache Attacks.
Different cache attack techniques have
been proposed in the past, such as Prime+Probe [45, 47] and
Flush+Reload [58]. Flush+Reload attacks and its variants [17,
19, 36, 60] work on shared memory at a cache-line granularity.
The attacker repeatedly flushes a cache line and measures
how long it takes to reload it. The reload time will always
be high unless another process has reloaded the cache line
back into the cache. In contrast, Prime+Probe attacks work
without shared memory, and only at a cache-set granularity.
The attacker repeatedly accesses a set of congruent memory
addresses, filling an entire cache set with its own cache lines,
and measures how long that takes. As this is repeated in a loop,
the cache set is always filled with the attacker’s cache lines.
Hence the access time will always be rather low. However,
if another process accesses a memory location in the same
cache set, it will evict one of the attacker’s cache lines and
the access time will increase.
Cache attacks have been used to break cryptographic
implementations [11, 12, 38, 45, 47, 58, 59], infer user in-
put [19,36,48], and break system-level security [18,24]. Both
Prime+Probe and Flush+Reload have also been used in high-
performance covert channels [17, 38, 42], also as a building
block of transient execution attacks such as Meltdown [37],
Spectre [32], and Foreshadow [55, 57] that we detail below.
2.2 Superscalar Processors
To achieve their high performance, modern processors are
often superscalar, that is, they perform multiple operations
in parallel. In current implementations, e.g., in modern Intel
processors (refer Fig. 1), execution of a program is divided
between two main parts. The frontend is responsible for pro-
cessing the machine-code instructions of the program, decod-
ing them to a stream of micro-ops (
µ
OPs) that are sent to the
Execution Engine for execution.
Out-of-order Execution.
The execution engine consists
of multiple execution units, which can execute various
µ
OPs.
To allow superscalar execution, the execution engine follows
a variant of Tomasulo’s algorithm [54], which executes
µ
OPs
when the data they depend on is available, rather than follow-
ing strict program order. Once executed, the
µ
OPs arrive at
the reorder buffer whose purpose is to retire
µ
OPs in program
order, ensuring that architecturally-visible effects of
µ
OPs
execute in the order the programmer specified.
Speculative Execution.
The stream of
µ
OPs that the
frontend generates does not necessarily correspond to the
sequence of instructions in the program. A major cause of
deviation is branch prediction. When the frontend reaches a
branch instruction, it often does not yet know where execution
will proceed. Instead of waiting, the frontend attempts to pre-
dict the outcome of the branch and proceed from there. In the
case that the prediction is correct, the generated
µ
OPs match
the program and can be processed. Otherwise, at some later
stage, the processor notices the misprediction. The frontend