EXPLODE：轻量级存储系统错误检测系统

28 浏览量更新于2024-07-14 收藏 288KB PDF 举报

"eXplode（EXPLODE）是一个轻量级、通用的系统，专为检测严重的存储系统错误而设计，由Junfeng Yang, Can Sar和Dawson Engler在斯坦福大学计算机系统实验室开发。该系统针对文件系统、数据库和RAID等存储系统提出了一个关键的期望：用户提交的数据应被安全地保存，不应丢失或损坏。由于这些系统通常存储着唯一的副本，数据丢失可能带来灾难性的影响。然而，实现这样的合同并非易事。存储系统的代码必须能够在任意程序点正确处理任何崩溃情况，无论数据的状态如何分布在易失性和持久性存储中。这就要求系统的健壮性和鲁棒性非常高，这使得代码编写极其复杂。 eXplode的创新之处在于它将模型检查技术——一种全面且通常较重的正式验证方法——巧妙地应用到了实际场景中。通过使用用户编写的、可能针对特定系统定制的检查器，eXplode能够引导存储系统进入各种棘手的边缘情况，包括但不限于崩溃恢复时的错误处理。这种方法比纯粹的测试策略更加系统化和高效，同时保持了轻量级的特性，这对于确保大规模存储系统的可靠性至关重要。与传统的纯测试方法相比，eXplode通过结构化的检查过程，能够深入挖掘存储系统的潜在问题，发现那些在日常操作中可能不易察觉的错误。这不仅有助于提高系统的稳定性，还能帮助开发者尽早修复bug，减少数据丢失的风险，从而提升整体的IT环境安全性。因此，eXplode是一个在现代IT领域中不可或缺的工具，对于维护复杂存储系统的完整性具有重要的实践价值。"

1 : const char *dir = "/mnt/sbd0/test-dir";

2 : const char *ﬁle = "/mnt/sbd0/test-file";

3 : static void do

fsync(const char *fn) {

4 : int fd = open(fn, O

RDONLY);

5 : fsync(fd);

6 : close(fd);

7 : }

8 : void FsChecker::mutate(void) {

9 : switch(choose(4)) {

10: case 0: systemf("mkdir %s%d", dir, choose(5)); break;

11: case 1: systemf("rmdir %s%d", dir, choose(5)); break;

12: case 2: systemf("rm %s", ﬁle); break;

13: case 3: systemf("echo \"test\" > %s", ﬁle);

14: if(choose(2) == 0)

15: sync();

16: else {

17: do

fsync(ﬁle);

18: // fsync parent to commit the new directory entry

19: do

fsync("/mnt/sbd0");

20: }

21: check

crash now(); // invokes check() for each crash

22: break;

23: }

24: }

25: void FsChecker::check(void) {

26: ifstream in(ﬁle);

27: if(!in)

28: error("fs", "file gone!");

29: char buf[1024];

30: in.read(buf, sizeof buf);

31: in.close();

32: if(strncmp(buf, "test", 4) != 0)

33: error("fs", "wrong file contents!");

34: }

Figure 2: Example ﬁle system checker. We omit the class initialization

code and some sanity checks.

Checkers range from aggressively system-speciﬁc (or

even code-version speciﬁc) to the fairly generic. Their

size scales with the complexity of the invariants checked,

from a few tens to many thousands of lines.

Figure 2 shows a ﬁle system checker that checks a

simple correctness property: a ﬁle that has been syn-

chronously written to disk (using either the fsync or

sync system calls) should persist after a crash. Mail

servers, databases and other application storage systems

depend on this behavior to prevent crash-caused data

obliteration. While simple, the checker illustrates com-

mon features of many checkers, including the fact that it

catches some interesting bugs.

The mutate method calls choose(4) (line 9) to

fork and do each of four possible actions: (1) create a

directory, (2) delete it, (3) create a test ﬁle, or (4) delete

it. The ﬁrst two actions then call choose(5) and cre-

ate or delete one of ﬁve directories (the directory name is

based on choose’s return value). The ﬁle creation ac-

tion calls choose(2) (line 14) and forces the test ﬁle to

disk using sync in one child and fsync in the other. As

Figure 3 shows, one mutate call creates thirteen chil-





























 

























 













 



 

















 





 



Figure 3: Choices made by one invocation of the mutate method in

Figure 2’s checker. It creates thirteen children.

dren.

The checker calls EXPLODE to check crashes. While

other code in the system can also initiate such check-

ing, typically it is the mutate method’s responsibil-

ity: it issues operations that change the storage sys-

tem, so it knows the correct system state and when

this state changes. In our example, after mutate

forces the ﬁle to disk it calls the EXPLODE routine

check

crash now(). EXPLODE then generates all

crash disks at the exact moment of the call and invokes

the check method on each after repairing and mounting

it using the underlying storage component (see § 3.3).

The check method checks if the test ﬁle exists (line 27)

and has the right contents (line 32). While simple, this

exact checker catches an interesting bug in JFS where

upon crash, an fsync’d ﬁle loses all its contents trig-

gered by the corner-case reuse of a directory inode as a

ﬁle inode (§7.3 discusses a more sophisticated version of

this checker).

So far we have described how a single mutate call

works. The next section shows how it ﬁts in the check-

ing process. In addition, checking crashes at only a sin-

gle code point is crude; Section 6 describes the routines

EXPLODE provides for more comprehensive checking.

3.3 Setting up checked code: Storage components

Since EXPLODE checks live storage systems, these sys-

tems must be up and running. For each storage subsys-

tem involved in checking, clients provide a storage com-

ponent that implements ﬁve methods:

1. init: one-time initialization, such as formatting a

ﬁle system partition or creating a fresh database.

2. mount: set up the storage system so that operations

can be performed on it.

3. unmount: tear down the storage system; used by

EXPLODE to clear the storage system’s state so it can

explore a different one (§5.2).

4. recover: repair the storage system after an EX-

PLODE-simulated crash.

5. threads: return the thread IDs for the storage

system’s kernel threads. EXPLODE reduces non-

determinism by only running these threads when it

wants to (§5.2).

剩余15页未读，继续阅读

weixin_38669091

粉丝: 4
资源: 871

EXPLODE：轻量级存储系统错误检测系统

给磁盘轻度查错

09.hive内置函数--表生成函数--行转列explode--lateral-view.mp4

eXplode-开源

explode-demo

explode-on-steel.k

bit-explode-front:位爆炸前

NX二次开发UF-ASSEM-revert-explode-comp 函数介绍

NX二次开发UF-ASSEM-explode-component 函数介绍

pandas_explode-0.0.4.tar.gz

pandas_explode-0.0.2.tar.gz

最新资源