综合冗余（CoRed）：消除软件冗余设计中的单点故障

单点故障

需积分: 24 141 浏览量更新于2024-09-10 收藏 805KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"消除软件冗余设计中的单点故障" 在安全关键的嵌入式和 cyber-physical 系统领域，基于软件的冗余设计通常被认为是提高可靠性的有效且经济的方法。特别是三模冗余（TMR）作为一种广为人知的解决方案，通过执行冗余来增强系统的稳健性。然而，尽管TMR在很多方面提高了系统可靠性，但它并没有完全消除单点故障（SPOF）。这些未受保护的SPOFs，如投票器，需要在所有安全性考量中仔细处理。本文由Peter Ulbrich、Martin Hoffmann、Rüdiger Kapitza、Daniel Lohmann和Wolfgang Schröder-Preikschat以及Reiner Schmid共同提出，他们分别来自德国弗里德里希-亚历山大大学埃尔兰根-纽伦堡分校的分布式系统和操作系统chair，以及西门子公司的系统架构和平台部门。他们提出了一种名为“Combined Redundancy”（CoRed）的综合方法，旨在对系统的关键安全部分进行强化，以抵御软错误，同时有效地消除由SPOF引起的脆弱性。 CoRed 方法不仅关注执行冗余，而是采用一种全面的方法，通过对系统组件进行多层次、多角度的保护来增加容错能力。这种方法可能包括硬件和软件的组合，以确保即使在一个或多个组件出现故障时，系统仍能保持正常运行。例如，通过使用纠错编码、动态监测和故障预测技术，CoRed 可以检测到潜在的故障，并在它们导致严重问题之前采取纠正措施。单点故障是系统设计中的一个关键问题，因为它们可能导致整个系统崩溃。CoRed 的核心目标是识别和解决这些潜在的SPOFs，特别是在软件层面，因为软件故障往往更难以预测和预防。通过在设计阶段就考虑这些故障源，CoRed 使得系统能够更有效地自我恢复，减少了对外部干预的需求。此外，CoRed 还可能涉及冗余组件之间的交互设计，以避免依赖单一的决策机制（如TMR中的投票器）。例如，可以使用多种不同的比较和决策策略，以降低单一组件故障导致整个冗余系统失效的风险。 CoRed 提供了一种创新的解决方案，以克服传统冗余设计中的局限性，特别是针对TMR中的SPOF问题。它强调了在系统设计中全面考虑可靠性和安全性的重要性，对于开发更安全、更健壮的cyber-physical系统具有重要意义。通过采用这种综合方法，工程师们可以创建出更少依赖于单个组件的系统，从而提高整体的系统可靠性和安全性。

资源详情

资源推荐

B. Holistic Protection Approach

Before going into detail, we ﬁrst brieﬂy discuss the overall

approach (Figure 1). For each part of the processing chain

CoRed

uses tailored measures for ensuring reliability. The

basic

SOR

is implemented by

TMR

, as used for the sensor

data acquisition  and the computation  in this example.

In addition,

CoRed

employs data-ﬂow encoding (EAN) to

extend the

SOR

beyond the

TMR

boundaries: Inputs and

outputs are encoded and decoded respectively within the

replicas’ protection domain, subsequently ensuring the data

integrity.

Still, the voting, inevitable in

TMR

systems, tears gaps in

the

SOR

CoRed

’s Encoded (Exact) Voter can determine a

quorum on encoded results. However, data-ﬂow encoding is

insufﬁcient and leaves the control-ﬂow unprotected. To tackle

this issue,

CoRed

introduced control-ﬂow monitoring (CFM)

in addition.

Finally, the voter passes its decision to the output where

it is sent to the actuator. A convenient side effect is that the

data can remain encoded, extending the sphere of replication

even further. For instance, by transmitting the encoded values

to a distributed actuator ECU or to seamlessly connect the

outputs to the inputs of another

CoRed

block. In this way, even

complex applications and systems can be composed.

The tolerance-based voting at the input side represents an

exception. To omit the performance penalties of the encoded

operations, it consists of two parts: The Pre-Stages that reside

within the replicas, mutually determine the input distances and

variants based on a tolerance range – hence, compute the costly

part. Subsequently, the Encoded Tolerance Voter determines a

quorum among the encoded variants as usual.

The remainder of this section will detail the techniques

employed by CoRed step-by-step:

C. Basic Protection

Applying the

CoRed

approach should not require in-depth

knowledge of the application to be safeguarded or the under-

lying system platform (runtime environment and hardware).

We therefore employ the well-known and proven concept of

TMR

[

] as the basis of the

CoRed

approach, as it efﬁciently

detects and masks transient faults of replicated instances. Here,

TMR

is especially suited, as it can be easily applied and does

not require further knowledge of the safety-critical application

itself.

The processing is threefold in terms of its state and code

(optional) and mapped to the replica tasks, which reside in

dedicated protection domains of the runtime environment. The

redundant execution is thereby spanning the initial sphere of

replication.

One of the advantages of implementing the replication on the

coarse-grained software component level is, that it decreases the

bandwidth required for output comparison and input replication.

That in turn potentially simpliﬁes the voting and replication

logic [15].

D. Eliminating input and output vulnerabilities

The basic TMR approach protects only the replica execution

itself, while the propagation of data across the

SOR

-boundaries

and the voting procedure is still susceptible to transient faults.

The corruption of output data within the voting procedure or

on transmission level to the actuator elements can still lead to

a silent data corruption. Even worse, corrupted input data will

lead to a silent data corruption in every case, as the replicas

will work with ﬂawed data and produce apparently correct

results. Data crossing the boundaries have to be protected to

prevent the formation of single points of failure.

To overcome this weakness and extend the protection across

the

SOR

-boundaries, we combined the basic

TMR

approach

with an arithmetic encoding of the data propagation – thereby

giving the name Combined Redundancy (CoRed).

To be more precise, we use an extension of an

AN-Code

which is based on the VCP design presented by Forin et al.

[

], speciﬁcally tailored to our purposes. It uses a combination

of per value signatures and a time stamp to detect data and

sequence faults.

To get a feel for this

EAN

, we exemplify the basics in the

following. An arithmetic code can detect data manipulation and,

at the same time, preserve arithmetic operations on encoded

data. The result of an encoded arithmetic operation applied to

encoded operands is again valid encoded data.

The basic

AN-Code

is the simplest form of an arithmetic

code, formed by multiplying the operands by a constant A:



= X ∗ A (1)

A division by

can then restore the original value of AN-

encoded data. If the remainder of the division does not

equal zero, the value is an invalid code word, which exposes

a data corruption. The multiplication factor

has to be

chosen carefully to minimize the residual error probability

and achieve an adequate Hamming distance. Most

AN-Code

implementations therefore suggest a large prime number [

A bare

AN-Code

can efﬁciently detect bit manipulations of

encoded values. However, it cannot safely indicate addressing

errors – erroneously pointing to another valid code word – nor

can it reveal outdated or out-of-sequence data as it is not aware

of periods.

Therefore the Extended AN Code used in the

CoRed

approach

features a unique signature

per value to detect addressing

errors and in addition a timestamp D to reveal outdated data.



= X ∗ A + B

+ D (2)

As dynamic timestamp

, a cycle counter can be used with the

range

0..D

max

. The constant value of

can then be chosen

arbitrarily with the constraint

+ D

max

. Furthermore

the minimum distance between two signatures has to be greater

than D

max

Finally, to put

EAN

into use within arbitrary calculations,

all arithmetic operations must be adapted. The result of an

operation



 Y



generates an encoded value



that

also includes the speciﬁc signature

. Applying the inverse

5151

剩余11页未读，继续阅读

gaotunny

粉丝: 1
资源: 4

综合冗余（CoRed）：消除软件冗余设计中的单点故障

License.zip

一个冗余485总线的设计例子

Hightec users guide

uav分布式任务分配算法单点故障

如何排除局域网单点故障？

hadoop单点故障问题

SAR ADC冗余设计

单片机 flash 冗余设计

fpga代码实现三模冗余设计

上述冗余是否还会用到异构的硬件和软件？

防火墙有哪些冗余策略 请列举

主动冗余和被动冗余的优缺点

网络冗余技术的高校网络设计与规划文献综述15篇

硬件怎样设计才能具有容错能力?这些方法可以用于软件吗?请给出你所得结论的理由

使用IOU消除冗余框

罗克韦尔冗余cpu全红

生成全局概念模型时，冗余数据和冗余联系是否消除，要考虑那些方面

核心交换机 冗余配置

A，B，Ｃ三台服务器的冗余策略设计

最新资源

防火墙有哪些冗余策略请列举

核心交换机冗余配置