2.1 DRAM Organization
The memory hierarchy is optimized for sequential accesses,
often at the expense of random accesses, at all levels: DRAM
chip, DRAM channel, memory controllers, and caches.
Sequential DRAM reads benefit from spatial locality in
DRAM chip rows. Before a DRAM read or write can oc-
cur, a DRAM row – typically 8–16KB of data – must be
destructively loaded (activated) into an internal buffer for
its contents to be accessed. Consecutive read or write re-
quests to the same row are limited only by DRAM channel
bandwidth, thus sequential accesses take advantage of spa-
tial locality in row accesses. In contrast, random accesses
must find independent request streams to hide the high la-
tency of a row cycle of different banks (DRAM page misses),
or worse the additional latency of accessing rows of the same
bank (DRAM page conflicts).
Hardware memory controllers reorder requests to mini-
mize page misses, and other high-latency request sequences
to minimize bus turnarounds and read-to-write transitions.
But requests can be rescheduled only within a short win-
dow of visibility, e.g., 48 cache lines per memory channel
[38] shared among cores. Requests are equally urgent from
hardware perspective, and cannot be reordered outside of
this window.
DRAM interfaces are also optimized for sequential ac-
cesses. Modern CPUs transfer blocks of at least 64 bytes
from DRAM. Double Data Rate (DDR) DRAM interfaces
transfer data on both rising and falling edges of a clock sig-
nal: e.g., a DDR3-1600 [24] chip operating at 800 MHz de-
livers a theoretical bandwidth of 12.8 GB/s. Current gen-
eration CPUs use DDR3/DDR4 interfaces with burst trans-
fers of 4 cycles (mandatory for DDR4), leading to a 64-byte
minimum transfer for a 64-bit DRAM. Even though larger
transfers would improve DRAM power, reliability, and peak
performance, this size coincides with current cache line sizes.
Sequential memory accesses benefit significantly from large
cache lines, as they have good spatial locality and can use all
of the data within each line. By contrast, random memory
accesses use only a small portion of the memory bandwidth
consumed. In many applications with irregular memory ac-
cesses, each DRAM access uses only 4–8 bytes out of cache
lines transferred – 64 bytes transferred for reads, and 128
bytes for writes (a cache-line read and write-back) – netting
3–6% effective bandwidth.
Applications that are stalling on memory but do not satu-
rate the memory bandwidth are bound by memory latency.
Random DRAM device latency has been practically stag-
nant at ∼50 ns for the last decade [41,42]. A DRAM request
at peak bandwidth has 100× worse latency than an L1 cache
lookup, and in the time it takes to wait for memory, a SIMD
superscalar core can execute ∼5,000 64-bit operations.
2.2 CPU Request Reordering
CPU out-of-order execution allows multiple long delay
memory accesses to be handled but only up to the lim-
its of small hardware structures. By Little’s Law, memory
throughput is the ratio of outstanding memory requests over
DRAM latency. The visible effect of either low memory level
parallelism or high DRAM latency is underutilized memory
bandwidth. Such “latency bound” programs perform well
in-cache, and may have no explicit data dependencies, but
become serialized on large inputs.
Hardware prefetchers for sequential accesses increase mem-
ory level parallelism with more outstanding DRAM requests.
Therefore, sequential reads and writes achieve higher DRAM
bandwidth than random accesses. Current hardware has no
prefetchers for random requests.
Current CPUs can handle 10 demand requests per core
(Line Fill Buffers between L1 and L2) [1, 11] as long as
all requests fit within the out-of-order execution window.
The effective Memory Level Parallelism (MLP) is most con-
strained by the capacity limits of resources released in FIFO
order, e.g., reorder buffer, or load and store buffers. A loop
body with a large number of instructions per memory load
may reduce the effective parallelism, e.g., a 192 micro-op
entry reorder buffer (ROB) must fit all micro-ops since the
first outstanding non-retired instruction. Branch mispredic-
tion, especially of hard to predict branches that depend on
indirect memory accesses further reduce the effective ROB
size. Finally, atomic operations drain store buffers [44] and
reduce Instruction Level Parallelism (ILP). While hardware
mechanisms, including memory controllers and out-of-order
schedulers, can reorder only a limited window of tens of op-
erations, Milk efficiently orchestrates billions of accesses.
3. DESIGN
Given the inefficiency of random DRAM references and
the limitations of hardware latency-hiding mechanisms, in
order to harvest locality beyond hardware capabilities, the
Milk compiler uses a software based approach to plan all
memory accesses. To use the Milk compiler, programs must
fit the Milk execution model and have milk
1
annotations.
Milk achieves significant performance improvement by
reordering the memory accesses for improved cache local-
ity. The reordered memory references maximize temporal
locality by partitioning all indirect references to the same
memory location within the cache capacity. The planned
accesses also improve spatial locality by grouping indirect
references to neighboring memory locations. Furthermore,
Milk avoids true sharing, false sharing, and expensive syn-
chronization overheads by ensuring only one core writes to
each cache line.
However, naively reorganizing indirect references can add
non-trivial overhead. Milk keeps all additional bandwidth
low by using only efficient sequential DRAM references. Al-
though data transformations require an investment of ad-
ditional CPU cycles and sequential DRAM bandwidth, we
show how to minimize these overheads with DRAM-conscious
Clustering.
In order for the Milk compiler to perform the optimiza-
tion automatically, users must annotate indirect accesses in
parallel OpenMP loops with a milk clause, which is suffi-
cient for simple loops like Figure 4a. Explicit milk direc-
tives can select indirect references that should be deferred,
along with their context (see line 12 in Figure 4b). Op-
tional combiner functions allow programmers to summarize
the combined effects of updates targeting the same address.
Section 4 describes milk’s syntax in more detail.
The Milk execution model is similar to MapReduce [16].
We do not offer a Map interface, however, since efficient
iteration over in-memory data structures can use domain-
specific knowledge (e.g., incoming vs. outgoing neighbor
traversal in graph processing, 2-D loop nests for image pro-
1
Milk’s milk is an homage to Cilk’s cilk [19].