提升间接内存引用效率：Milk语言扩展与优化策略

174 浏览量更新于2024-08-25 收藏 1.14MB PDF 举报

"Optimizing Indirect Memory References with milk 是一篇由 Vladimir Kiriansky、Yunming Zhang 和 Saman Amarasinghe 合作于 2016 年发表在 MIT CSAIL 的论文，主要探讨了现代应用特别是图形处理和数据分析领域面临的内存瓶颈问题。随着工作集（working set）规模超过缓存容量，这些应用在处理真实世界数据时，性能受到动态随机存取存储器（DRAM）的严重限制。DRAM的带宽增长速度远低于CPU核心数量的增长，而延迟却相对稳定，这使得并行应用程序受限于内存带宽无法有效扩展，而那些受内存延迟影响的应用则无法充分利用可用带宽。传统编译器优化在处理这种内存差距扩大问题上显得不足，特别是对于随机间接内存引用的处理。为此，作者提出了 milk 这一C/C++语言扩展，旨在帮助程序员更简洁地标注出内存密集型循环。通过引入高效的中间数据结构，milk能够将原本分散的随机间接内存引用转换成一系列高效序列化的操作。这种转换显著减少了内存访问的次数，提高了内存访问的局部性，从而减轻了内存带宽压力，并有助于优化应用程序的性能。该论文的核心贡献在于提供了一种编程工具，使得即使非专家程序员也能在不牺牲代码可读性的前提下，自动优化内存访问，弥补了现有编译器优化的不足。它强调了在当今硬件趋势下，对内存优化技术的需求和重要性，特别是在处理大数据和计算密集型任务时。通过milk，开发者可以期待在处理大规模数据时实现更好的内存利用效率，推动应用程序的可扩展性和性能提升。"

2.1 DRAM Organization

The memory hierarchy is optimized for sequential accesses,

often at the expense of random accesses, at all levels: DRAM

chip, DRAM channel, memory controllers, and caches.

Sequential DRAM reads beneﬁt from spatial locality in

DRAM chip rows. Before a DRAM read or write can oc-

cur, a DRAM row – typically 8–16KB of data – must be

destructively loaded (activated) into an internal buﬀer for

its contents to be accessed. Consecutive read or write re-

quests to the same row are limited only by DRAM channel

bandwidth, thus sequential accesses take advantage of spa-

tial locality in row accesses. In contrast, random accesses

must ﬁnd independent request streams to hide the high la-

tency of a row cycle of diﬀerent banks (DRAM page misses),

or worse the additional latency of accessing rows of the same

bank (DRAM page conﬂicts).

Hardware memory controllers reorder requests to mini-

mize page misses, and other high-latency request sequences

to minimize bus turnarounds and read-to-write transitions.

But requests can be rescheduled only within a short win-

dow of visibility, e.g., 48 cache lines per memory channel

[38] shared among cores. Requests are equally urgent from

hardware perspective, and cannot be reordered outside of

this window.

DRAM interfaces are also optimized for sequential ac-

cesses. Modern CPUs transfer blocks of at least 64 bytes

from DRAM. Double Data Rate (DDR) DRAM interfaces

transfer data on both rising and falling edges of a clock sig-

nal: e.g., a DDR3-1600 [24] chip operating at 800 MHz de-

livers a theoretical bandwidth of 12.8 GB/s. Current gen-

eration CPUs use DDR3/DDR4 interfaces with burst trans-

fers of 4 cycles (mandatory for DDR4), leading to a 64-byte

minimum transfer for a 64-bit DRAM. Even though larger

transfers would improve DRAM power, reliability, and peak

performance, this size coincides with current cache line sizes.

Sequential memory accesses beneﬁt signiﬁcantly from large

cache lines, as they have good spatial locality and can use all

of the data within each line. By contrast, random memory

accesses use only a small portion of the memory bandwidth

consumed. In many applications with irregular memory ac-

cesses, each DRAM access uses only 4–8 bytes out of cache

lines transferred – 64 bytes transferred for reads, and 128

bytes for writes (a cache-line read and write-back) – netting

3–6% eﬀective bandwidth.

Applications that are stalling on memory but do not satu-

rate the memory bandwidth are bound by memory latency.

Random DRAM device latency has been practically stag-

nant at ∼50 ns for the last decade [41,42]. A DRAM request

at peak bandwidth has 100× worse latency than an L1 cache

lookup, and in the time it takes to wait for memory, a SIMD

superscalar core can execute ∼5,000 64-bit operations.

2.2 CPU Request Reordering

CPU out-of-order execution allows multiple long delay

memory accesses to be handled but only up to the lim-

its of small hardware structures. By Little’s Law, memory

throughput is the ratio of outstanding memory requests over

DRAM latency. The visible eﬀect of either low memory level

parallelism or high DRAM latency is underutilized memory

bandwidth. Such “latency bound” programs perform well

in-cache, and may have no explicit data dependencies, but

become serialized on large inputs.

Hardware prefetchers for sequential accesses increase mem-

ory level parallelism with more outstanding DRAM requests.

Therefore, sequential reads and writes achieve higher DRAM

bandwidth than random accesses. Current hardware has no

prefetchers for random requests.

Current CPUs can handle 10 demand requests per core

(Line Fill Buﬀers between L1 and L2) [1, 11] as long as

all requests ﬁt within the out-of-order execution window.

The eﬀective Memory Level Parallelism (MLP) is most con-

strained by the capacity limits of resources released in FIFO

order, e.g., reorder buﬀer, or load and store buﬀers. A loop

body with a large number of instructions per memory load

may reduce the eﬀective parallelism, e.g., a 192 micro-op

entry reorder buﬀer (ROB) must ﬁt all micro-ops since the

ﬁrst outstanding non-retired instruction. Branch mispredic-

tion, especially of hard to predict branches that depend on

indirect memory accesses further reduce the eﬀective ROB

size. Finally, atomic operations drain store buﬀers [44] and

reduce Instruction Level Parallelism (ILP). While hardware

mechanisms, including memory controllers and out-of-order

schedulers, can reorder only a limited window of tens of op-

erations, Milk eﬃciently orchestrates billions of accesses.

3. DESIGN

Given the ineﬃciency of random DRAM references and

the limitations of hardware latency-hiding mechanisms, in

order to harvest locality beyond hardware capabilities, the

Milk compiler uses a software based approach to plan all

memory accesses. To use the Milk compiler, programs must

ﬁt the Milk execution model and have milk

annotations.

Milk achieves signiﬁcant performance improvement by

reordering the memory accesses for improved cache local-

ity. The reordered memory references maximize temporal

locality by partitioning all indirect references to the same

memory location within the cache capacity. The planned

accesses also improve spatial locality by grouping indirect

references to neighboring memory locations. Furthermore,

Milk avoids true sharing, false sharing, and expensive syn-

chronization overheads by ensuring only one core writes to

each cache line.

However, naively reorganizing indirect references can add

non-trivial overhead. Milk keeps all additional bandwidth

low by using only eﬃcient sequential DRAM references. Al-

though data transformations require an investment of ad-

ditional CPU cycles and sequential DRAM bandwidth, we

show how to minimize these overheads with DRAM-conscious

Clustering.

In order for the Milk compiler to perform the optimiza-

tion automatically, users must annotate indirect accesses in

parallel OpenMP loops with a milk clause, which is suﬃ-

cient for simple loops like Figure 4a. Explicit milk direc-

tives can select indirect references that should be deferred,

along with their context (see line 12 in Figure 4b). Op-

tional combiner functions allow programmers to summarize

the combined eﬀects of updates targeting the same address.

Section 4 describes milk’s syntax in more detail.

The Milk execution model is similar to MapReduce [16].

We do not oﬀer a Map interface, however, since eﬃcient

iteration over in-memory data structures can use domain-

speciﬁc knowledge (e.g., incoming vs. outgoing neighbor

traversal in graph processing, 2-D loop nests for image pro-

Milk’s milk is an homage to Cilk’s cilk [19].

301

剩余13页未读，继续阅读

weixin_38592405

粉丝: 6
资源: 868

提升间接内存引用效率：Milk语言扩展与优化策略

Optimizing the Gains of the Baro-Inertial Vertical Channel.pdf

Optimizing Federated Learning on Non-IID Data with Reinforcement

Optimizing Data-to-Learning-to-Action.PDF

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Reduce the indexing time and CPU load with pre-built Python packages shared indexes

Error occurred during initialization of VM Too small maximum heap

-mfpu=neon-fp-armv8

Runtime.getRuntime().totalMemory()

you are building on a machine with 15.5G of RAM the mininum required amount of free memory is 16Gb fatal :out of memory

最新资源