SafeMem：利用ECC内存检测生产运行中的内存泄漏和内存破坏

需积分: 9 167 浏览量更新于2024-10-06 收藏 182KB PDF 举报

"ECC内存，也称为错误检查和纠正内存，是一种先进的内存技术，用于检测和纠正数据存储过程中的错误。它在现代计算机系统中扮演着关键角色，特别是对于那些对数据完整性和系统稳定性有高要求的应用。本文档描述了如何利用ECC内存来自动检测生产运行过程中的内存泄漏和内存腐败问题，从而提高软件的可用性和安全性。 ECC内存的工作原理是通过在每个数据块中添加额外的位来检测并纠正潜在的内存错误。当数据在内存中被读取或写入时，ECC控制器会计算这些额外位的校验值。如果数据在存储过程中发生了错误，ECC系统可以检测到这种异常，并尝试修复错误，确保数据的准确无误。然而，尽管ECC内存提供了一种强大的硬件保护机制，但在生产环境中，内存泄漏和内存腐败仍然是两个主要的软件问题。内存泄漏会导致系统资源逐渐耗尽，而内存腐败可能引发严重的安全风险。据统计，2003年报告的漏洞中有68%源于内存泄漏或内存腐败。传统的动态监控工具如Purify，虽然能够检测这些问题，但它们往往伴随着高昂的运行开销，可能导致系统性能下降高达20倍，这使得它们无法在生产环境中持续使用。为此，文档介绍了一种名为SafeMem的新工具，该工具可以在生产运行期间实时检测内存泄漏和内存腐败，而无需依赖新的硬件支持。SafeMem创新性地利用了现有的ECC内存技术，结合智能的动态内存使用行为分析，来有效地识别内存问题。这种方法不仅降低了对系统性能的影响，而且能够在不影响生产环境的情况下，及时发现并处理潜在的内存问题，极大地提高了系统稳定性和安全性。 ECC内存是防止数据错误的重要手段，而SafeMem则为解决生产环境中的内存问题提供了一种有效且经济的解决方案。通过深入理解ECC内存的工作机制以及如何利用SafeMem这样的工具，开发者可以更好地管理和维护他们的系统，减少因内存问题导致的故障和安全隐患。"

data

Cache

CPU

ECC generator

ECC Memory Controller

ECC code

data

Memory

(a) Write to ECC memory

data

ECC generator

Cache

CPU

Correct single−bit error

data

Memory

multi−bit

error

Report

ECC Memory Controller

ECC code

(b) Read from ECC memory

Figure 1: Read/Write Operations for ECC Memory

2.2 Using ECC to Monitor Memory Accesses

2.2.1 Main Idea

Our work makes a novel use of ECC memory to monitor

memory accesses for software debugging. More speciﬁ-

cally, we use ECC memory for two purposes: (1) detect-

ing illegal accesses (e.g., out-of-bound memory accesses,

or accesses to freed memory buffers) to monitored memory

locations; (2) pruning false positives in memory leak detec-

tion. More details about each speciﬁc usage are described

in Section 3 and 4.

Both usages require detection of accesses to some moni-

tored memory locations. To achieve this goal, we use ECC

protection in a way similar to page protection, which is

commonly exploited in shared virtual memory systems [20].

Even though ECC groups are either 32 bits or 64 bits in

granularity, using ECC for memory protection has to be at

cache-line granularity, because accesses to main memory

use this granularity.

The advantage of using ECC protection over using page

protection is that the former is at cache line granularity,

whereas the latter is at page granularity. Therefore, ECC

protection can signiﬁcantly reduce the amount of false shar-

ing and padding space. In our experiments, we have com-

pared these two approaches quantitatively, and our results

show that ECC protection can reduce the amount of mem-

ory waste used for memory monitoring by up to 74 times

(see Section 6).

These advantages of ECC protection are also exploited

by some ﬁne-grained distributed shared memory systems,

such as Blizzard [25]. Different from those works, we use

ECC protection for software debugging instead of imple-

menting cache coherence operations. Therefore, we have

different design trade-offs. In addition, they used special

ECC memory controllers, whereas we use a standard off-

the-shelf ECC memory controller, which has much more

limited functionality available to software. For example,

most commercial ECC memory controllers do not allow

software to directly access the ECC code. Moreover, un-

like page protection faults, operating systems do not deliver

the ECC-error interrupt to user-level programs. Therefore,

we need to ﬁrst address all these challenges before we use

ECC for monitoring memory accesses to watched locations.

We modify the Linux operating system to provide three

new system calls: (1) WatchMemory(address, size),which

registers a memory region starting from address to be mon-

itored by SafeMem. The memory region and its size need to

be cache line aligned. (2) DisableWatchMemory(address),

which removes monitoring to the speciﬁed memory region.

(3) RegisterECCFaultHandler(function), which registers a

user-level ECC fault handler. When an ECC fault occurs,

the fault is delivered to this user-level handler.

In our work, we only need to detect the ﬁrst access to

each monitored location because: (1) For memory corrup-

tion detection, the ﬁrst access to a monitored location is a

bug. SafeMem then simply pauses program execution to al-

low programmers to attach an interactive debugger, such as

gdb, to check the program state and analyze the bug. (2) For

memory leak detection, the ﬁrst access to a monitored loca-

tion indicates a false positive. Then this location no longer

needs to be monitored. Therefore, in both cases, the user-

level ECC fault handler of SafeMem can disable the mon-

itoring for the faulted lines using DisableWatchMemory()

system call.

2.2.2 Design Issues

Data Scrambling Since most commercial ECC memory

controllers do not allow software to directly modify an ECC

code, we use a special trick to “scramble” the ECC code

of a watched ECC-group. When WatchMemory is called,

SafeMem ﬁrst disables the ECC functionality, and writes

the scrambled data into this ECC-group. It then ﬂushes the

data from cache into memory. Since ECC is disabled, the

ECC code for this line remains the same, i.e., the old code.

Finally, SafeMem enables ECC. Figure 2 shows the pro-

cess of this trick. During the disable-enable period, we lock

the memory bus to avoid any other background memory ac-

cesses, such as those made by other processors or DMAs, so

that other memory locations are not affected by this Watch-

Memory operation. After this operation, the ﬁrst access to

this location triggers an ECC fault because of the mismatch

between the old ECC code and the scrambled data.

Proceedings of the 11th Int’l Symposium on High-Performance Computer Architecture (HPCA-11 2005)

剩余11页未读，继续阅读

xc54560336

粉丝: 0
资源: 6

SafeMem：利用ECC内存检测生产运行中的内存泄漏和内存破坏

ecc.rar_memory

AMD compute for students

Memory read ECC erro

ECC Options for Improving NAND Flash Memory Reliability

ecc.rar_ecc_ecc flash_ecc 校验_ecc校验_flash ECC

ECC.rar_ecc_ecc下载

crypto memory

CMOS memory Circuites

K9 memory 资料

ECC问题检测方法说明

最新资源