没有合适的资源?快使用搜索试试~ 我知道了~
首页应对复杂故障:IRON文件系统与局部失效模型
"IRON File Systems (iron-sosp05)" 是一篇关于计算机科学领域的研究论文,主要探讨了现代硬盘的复杂故障模式以及如何构建更健壮的文件系统来应对这些故障。 论文摘要指出,传统的商品化文件系统通常假设硬盘要么正常工作,要么完全故障,但实际情况是现代硬盘可能会出现更复杂的局部故障,如潜伏扇区错误和块损坏。作者们提出了一个新的“部分故障模型”(fail-partial failure model),该模型考虑了这些现实世界中常见的局部故障类型。 为了研究当前的商品化文件系统如何应对这类更为实际的磁盘故障,他们开发并应用了一个创新的故障策略指纹框架。这个框架用于分析文件系统在各种更真实的情况下对磁盘故障的反应。作者们将这些文件系统的故障策略归类到一个新的分类体系中,称为“内部稳健性”(Internal Robustness,简称IRON),这一体系涵盖了故障检测和恢复技术。 通过这个分类,他们发现现有的商品化文件系统在处理部分磁盘故障时的策略往往不一致,有时甚至存在错误,并且总体上在恢复能力方面是不足的。针对这一问题,作者们设计、实现并评估了一个原型的IRON文件系统,旨在提供更高的容错性和数据完整性。 这篇论文的贡献在于提出了一个新的磁盘故障模型,揭示了现有文件系统在处理局部故障时的局限性,并提出了改进的方法。对于系统开发者和存储领域的研究人员来说,这是非常有价值的信息,它可能会影响未来文件系统的设计,以更好地适应现代硬盘的复杂故障行为。
资源详情
资源推荐
Electrical: A power spike or surge can damage in-drive circuits
and hence lead to drive failure [68]. Thus, electrical problems can
lead to entire disk failure.
Drive firmware: Interesting errors arise in the drive controller,
which consists of many thousands of lines of real-time, concurrent
firmware. For example, disks have been known to return correct
data but circularly shifted by a byte [37] or have memory leaks
that lead to intermittent failures [68]. Other firmware problems
can lead to poor drive performance [54]. Some firmware bugs are
well-enough known in the field that they have specific names; for
example, “misdirected” writes are writes that place the correct data
on the disk but in the wrong location, and “phantom” writes are
writes that the drive reports as completed but that never reach the
media [73]. Phantom writes can be caused by a buggy or even mis-
configured cache (i.e., write-back caching is enabled). In summary,
drive firmware errors often lead to sticky or transient block corrup-
tion but can also lead to performance problems.
Transport: The transport connecting the drive and host can also be
problematic. For example, a study of a large disk farm [67] reveals
that most of the systems tested had interconnect problems, such
as bus timeouts. Parity errors also occurred with some frequency,
either causing requests to succeed (slowly) or fail altogether. Thus,
the transport often causes transient errors for the entire drive.
Bus controller: The main bus controller can also be problematic.
For example, the EIDE controller on a particular series of moth-
erboards incorrectly indicates completion of a disk request before
the data has reached the main memory of the host, leading to data
corruption [72]. A similar problem causes some other controllers to
return status bits as data if the floppy drive is in use at the same time
as the hard drive [26]. Others have also observed IDE protocol ver-
sion problems that yield corrupt data [23]. In summary, controller
problems can lead to transient block failure and data corruption.
Low-level drivers: Recent research has shown that device driver
code is more likely to contain bugs than the rest of the operating
system [15, 22, 66]. While some of these bugs will likely crash the
operating system, others can issue disk requests with bad parame-
ters, data, or both, resulting in data corruption.
2.3 The Fail-Partial Failure Model
From our discussion of the many root causes for failure, we are
now ready to put forth a more realistic model of disk failure. In our
model, failures manifest themselves in three ways:
• Entire disk failure: The entire disk is no longer accessible. If
permanent, this is the classic “fail-stop” failure.
• Block failure: One or more blocks are not accessible; often re-
ferred to as “latent sector errors” [33, 34].
• Block corruption: The data within individual blocks is altered.
Corruption is particularly insidious because it is silent – the storage
subsystem simply returns “bad” data upon a read.
We term this model the Fail-Partial Failure Model, to empha-
size that pieces of the storage subsystem can fail. We now discuss
some other key elements of the fail-partial model, including the
transience, locality, and frequency of failures, and then discuss how
technology and market trends will impact disk failures over time.
2.3.1 Transience of Failures
In our model, failures can be “sticky” (permanent) or “transient”
(temporary). Which behavior manifests itself depends upon the
root cause of the problem. For example, a low-level media problem
portends the failure of subsequent requests. In contrast, a transport
or higher-level software issue might at first cause block failure or
corruption; however, the operation could succeed if retried.
2.3.2 Locality of Failures
Because multiple blocks of a disk can fail, one must consider
whether such block failures are dependent. The root causes of
block failure suggest that some forms of block failure do indeed
exhibit spatial locality [34]. For example, a scratched surface can
render a number of contiguous blocks inaccessible. However, all
failures do not exhibit locality; for example, a corruption due to a
misdirected write may impact only a single block.
2.3.3 Frequency of Failures
Block failures and corruptions do occur – as one commercial
storage system developer succinctly stated, “Disks break a lot – all
guarantees are fiction” [29]. However, one must also consider how
frequently such errors occur, particularly when modeling overall re-
liability and deciding which failures are most important to handle.
Unfortunately, as Talagala and Patterson point out [67], disk drive
manufacturers are loathe to provide information on disk failures;
indeed, people within the industry refer to an implicit industry-wide
agreement to not publicize such details [4]. Not surprisingly, the
actual frequency of drive errors, especially errors that do not cause
the whole disk to fail, is not well-known in the literature. Previous
work on latent sector errors indicates that such errors occur more
commonly than absolute disk failure [34], and more recent research
estimates that such errors may occur five times more often than ab-
solute disk failures [57].
In terms of relative frequency, block failures are more likely to
occur on reads than writes, due to internal error handling common
in most disk drives. For example, failed writes to a given sector
are often remapped to another (distant) sector, allowing the drive
to transparently handle such problems [31]. However, remapping
does not imply that writes cannot fail. A failure in a component
above the media (e.g., a stuttering transport), can lead to an unsuc-
cessful write attempt; the move to network-attached storage [24]
serves to increase the frequency of this class of failures. Also, for
remapping to succeed, free blocks must be available; a large scratch
could render many blocks unwritable and quickly use up reserved
space. Reads are more problematic: if the media is unreadable, the
drive has no choice but to return an error.
2.3.4 Trends
In many other areas (e.g., processor performance), technology
and market trends combine to improve different aspects of com-
puter systems. In contrast, we believe that technology trends and
market forces may combine to make storage system failures occur
more frequently over time, for the following three reasons.
First, reliability is a greater challenge when drives are made in-
creasingly more dense; as more bits are packed into smaller spaces,
drive logic (and hence complexity) increases [5].
Second, at the low-end of the drive market, cost-per-byte domi-
nates, and hence many corners are cut to save pennies in IDE/ATA
drives [5]. Low-cost “PC class” drives tend to be tested less and
have less internal machinery toprevent failures from occurring [31].
The result, in the field, is that ATA drives are observably less reli-
able [67]; however, cost pressures serve to increase their usage,
even in server environments [23].
Finally, the amount of software is increasing in storage systems
and, as others have noted, software is often the root cause of er-
rors [25]. In the storage system, hundreds of thousands of lines of
software are present in the lower-level drivers and firmware. This
low-level code is generally the type of code that is difficult to write
and debug [22, 66] – hence a likely source of increased errors in
the storage stack.
3
剩余14页未读,继续阅读
weixin_38585666
- 粉丝: 6
- 资源: 966
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 新型矿用本安直流稳压电源设计:双重保护电路
- 煤矿掘进工作面安全因素研究:结构方程模型
- 利用同位素位移探测原子内部新型力
- 钻锚机钻臂动力学仿真分析与优化
- 钻孔成像技术在巷道松动圈检测与支护设计中的应用
- 极化与非极化ep碰撞中J/ψ的Sivers与cos2φ效应:理论分析与COMPASS验证
- 新疆矿区1200m深孔钻探关键技术与实践
- 建筑行业事故预防:综合动态事故致因理论的应用
- 北斗卫星监测系统在电网塔形实时监控中的应用
- 煤层气羽状水平井数值模拟:交替隐式算法的应用
- 开放字符串T对偶与双空间坐标变换
- 煤矿瓦斯抽采半径测定新方法——瓦斯储量法
- 大倾角大采高工作面设备稳定与安全控制关键技术
- 超标违规背景下的热波动影响分析
- 中国煤矿选煤设计进展与挑战:历史、现状与未来发展
- 反演技术与RBF神经网络在移动机器人控制中的应用
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功