内存屏障：硬件视角下的软件攻城略地

需积分: 10 46 浏览量更新于2024-07-18 收藏 298KB PDF 举报

本文主要探讨了"为什么需要内存屏障"（Memory Barriers）这一主题，针对Linux系统中的内存一致性问题进行深入剖析。作者Paul E. McKenney是Linux Technology Center的专家，他在文章中揭示了CPU设计师为何会在多处理器（SMP）系统中引入内存屏障，尽管这对软件开发者来说可能是个"意外的挑战"。首先，文章解释了内存屏障的重要性在于避免指令乱序执行带来的潜在问题。在多核处理器架构中，CPU允许根据内部优化机制重新安排指令执行顺序，这在理论上可以提升性能。然而，某些操作，如同步操作和依赖于内存顺序访问的程序逻辑，其正确性依赖于内存访问的有序性。如果没有适当的内存屏障，可能会导致数据不一致性和并发问题。为了确保内存一致性，文章接下来深入分析了CPU缓存的工作原理。缓存是硬件层面的关键组件，它通过缓存常用的数据来提高系统的速度。当多个处理器同时访问同一块内存时，缓存一致性协议（如MESI/MOESI）确保所有处理器对共享数据有相同认知。这些协议确保了即使在乱序执行的情况下，每个处理器也能看到内存更新的最终状态。此外，文章还讨论了存储缓冲区和失效队列如何与缓存和一致性协议协同工作，以优化性能。存储缓冲区用于临时存储数据，而失效队列则处理缓存数据的更新和清理，这两者共同确保了高性能和数据的一致性。总结来说，内存屏障是实现多处理器系统性能和可扩展性所必需的，它们的存在源于CPU设计的初衷是为了最大化性能，但同时也带来了对内存访问顺序控制的需求。理解这些底层原理对于编写高效、健壮的多线程和分布式应用至关重要，尽管它们可能在编程实践中带来一定的复杂性。通过本文，读者将能深入了解内存屏障在现代操作系统和硬件环境中的关键作用。

address 0, it enters the “shared” state in CPU 0’s

cache, and is still valid in memory. CPU 3 also loads

the data at address 0, so that it is in the “shared”

state in both CPUs’ caches, and is still valid in mem-

ory. Next CPU 0 loads some other cache line (at ad-

dress 8), which forces the data at address 0 out of its

cache via an invalidation, replacing it with the data

at address 8. CPU 2 now does a load from address 0,

but this CPU realizes that it will soon need to store

to it, and so it uses a “read invalidate” message in

order to gain an exclusive copy, invalidating it from

CPU 3’s cache (though the copy in memory remains

up to date). Next CPU 2 does its anticipated store,

changing the state to “modiﬁed”. The copy of the

data in memory is now out of date. CPU 1 does an

atomic increment, using a “read invalidate” to snoop

the data from CPU 2’s cache and invalidate it, so that

the copy in CPU 1’s cache is in the “modiﬁed” state

(and the copy in memory remains out of date). Fi-

nally, CPU 1 reads the cache line at address 8, which

uses a “writeback” message to push address 0’s data

back out to memory.

Note that we end with data in some of the CPU’s

caches.

Quick Quiz 5: What sequence of operations

would put the CPUs’ caches all back into the “in-

valid” state?

3 Stores Result in Unnecessary

Stalls

Although the cache structure shown in Figure 1 pro-

vides good performance for repeated reads and writes

from a given CPU to a given item of data, its perfor-

mance for the ﬁrst write to a given cache line is quite

poor. To see this, consider Figure 4, which shows a

timeline of a write by CPU 0 to a cacheline held in

CPU 1’s cache. Since CPU 0 must wait for the cache

line to arrive before it can write to it, CPU 0 must

stall for an extended period of time.

The time required to transfer a cache line from one CPU’s

cache to another’s is typically a few orders of magnitude more

than that required to execute a simple register-to-register in-

struction.

CPU 0 CPU 1

Write

Acknowledgement

Invalidate

Stall

Figure 4: Writes See Unnecessary Stalls

But there is no real reason to force CPU 0 to stall

for so long — after all, regardless of what data hap-

pens to be in the cache line that CPU 1 sends it, CPU

0 is going to unconditionally overwrite it.

3.1 Store Buﬀers

One way to prevent this unnecessary stalling of writes

is to add “store buﬀers” between each CPU and its

cache, as shown in Figure 5. With the addition of

these store buﬀers, CPU 0 can simply record its write

in its store buﬀer and continue executing. When the

cache line does ﬁnally make its way from CPU 1 to

CPU 0, the data will be moved from the store buﬀer

to the cache line.

However, there are complications that must be ad-

dressed, which are covered in the next two sections.

3.2 Store Forwarding

To see the ﬁrst complication, a violation of self-

consistency, consider the following code with vari-

ables “a” and “b” both initially zero, and with the

cache line containing variable “a” initially owned by

CPU 1 and that containing “b” initially owned by

CPU 0:

剩余27页未读，继续阅读

feng4206yu

粉丝: 20
资源: 1

内存屏障：硬件视角下的软件攻城略地

linux memory barrier

ARM Cortex™-M Programming Guide to Memory Barrier Instructions

CPU Cache and Memory Ordering

ARM有几条memory barrier 的指令？分别有什么区别？

/** \brief Data Synchronization Barrier \details Acts as a special kind of Data Memory Barrier. It completes when all explicit memory accesses before this instruction complete. */ __attribute__((always_inline)) __STATIC_INLINE void __DSB(void) { __ASM volatile ("dsb 0xF":::"memory"); }

jvm内存屏障有什么区别

freertos中portSOFTWARE_BARRIER

asm volatile (""：：: "memory")

cuda m_barrier和named barrier的使用

asm volatile (""：：: "memory")的memory

最新资源

/** \brief Data Synchronization Barrier \details Acts as a special kind of Data Memory Barrier. It completes when all explicit memory accesses before this instruction complete. */ attribute((always_inline)) __STATIC_INLINE void DSB(void) { ASM volatile ("dsb 0xF":::"memory"); }