data
Cache
CPU
ECC generator
ECC Memory Controller
ECC code
data
Memory
(a) Write to ECC memory
data
ECC generator
=?
Cache
CPU
Correct single−bit error
data
Memory
multi−bit
error
Report
ECC Memory Controller
ECC code
(b) Read from ECC memory
Figure 1: Read/Write Operations for ECC Memory
2.2 Using ECC to Monitor Memory Accesses
2.2.1 Main Idea
Our work makes a novel use of ECC memory to monitor
memory accesses for software debugging. More specifi-
cally, we use ECC memory for two purposes: (1) detect-
ing illegal accesses (e.g., out-of-bound memory accesses,
or accesses to freed memory buffers) to monitored memory
locations; (2) pruning false positives in memory leak detec-
tion. More details about each specific usage are described
in Section 3 and 4.
Both usages require detection of accesses to some moni-
tored memory locations. To achieve this goal, we use ECC
protection in a way similar to page protection, which is
commonly exploited in shared virtual memory systems [20].
Even though ECC groups are either 32 bits or 64 bits in
granularity, using ECC for memory protection has to be at
cache-line granularity, because accesses to main memory
use this granularity.
The advantage of using ECC protection over using page
protection is that the former is at cache line granularity,
whereas the latter is at page granularity. Therefore, ECC
protection can significantly reduce the amount of false shar-
ing and padding space. In our experiments, we have com-
pared these two approaches quantitatively, and our results
show that ECC protection can reduce the amount of mem-
ory waste used for memory monitoring by up to 74 times
(see Section 6).
These advantages of ECC protection are also exploited
by some fine-grained distributed shared memory systems,
such as Blizzard [25]. Different from those works, we use
ECC protection for software debugging instead of imple-
menting cache coherence operations. Therefore, we have
different design trade-offs. In addition, they used special
ECC memory controllers, whereas we use a standard off-
the-shelf ECC memory controller, which has much more
limited functionality available to software. For example,
most commercial ECC memory controllers do not allow
software to directly access the ECC code. Moreover, un-
like page protection faults, operating systems do not deliver
the ECC-error interrupt to user-level programs. Therefore,
we need to first address all these challenges before we use
ECC for monitoring memory accesses to watched locations.
We modify the Linux operating system to provide three
new system calls: (1) WatchMemory(address, size),which
registers a memory region starting from address to be mon-
itored by SafeMem. The memory region and its size need to
be cache line aligned. (2) DisableWatchMemory(address),
which removes monitoring to the specified memory region.
(3) RegisterECCFaultHandler(function), which registers a
user-level ECC fault handler. When an ECC fault occurs,
the fault is delivered to this user-level handler.
In our work, we only need to detect the first access to
each monitored location because: (1) For memory corrup-
tion detection, the first access to a monitored location is a
bug. SafeMem then simply pauses program execution to al-
low programmers to attach an interactive debugger, such as
gdb, to check the program state and analyze the bug. (2) For
memory leak detection, the first access to a monitored loca-
tion indicates a false positive. Then this location no longer
needs to be monitored. Therefore, in both cases, the user-
level ECC fault handler of SafeMem can disable the mon-
itoring for the faulted lines using DisableWatchMemory()
system call.
2.2.2 Design Issues
Data Scrambling Since most commercial ECC memory
controllers do not allow software to directly modify an ECC
code, we use a special trick to “scramble” the ECC code
of a watched ECC-group. When WatchMemory is called,
SafeMem first disables the ECC functionality, and writes
the scrambled data into this ECC-group. It then flushes the
data from cache into memory. Since ECC is disabled, the
ECC code for this line remains the same, i.e., the old code.
Finally, SafeMem enables ECC. Figure 2 shows the pro-
cess of this trick. During the disable-enable period, we lock
the memory bus to avoid any other background memory ac-
cesses, such as those made by other processors or DMAs, so
that other memory locations are not affected by this Watch-
Memory operation. After this operation, the first access to
this location triggers an ECC fault because of the mismatch
between the old ECC code and the scrambled data.
Proceedings of the 11th Int’l Symposium on High-Performance Computer Architecture (HPCA-11 2005)
1530-0897/05 $20.00 © 2005 IEEE