B. Holistic Protection Approach
Before going into detail, we first briefly discuss the overall
approach (Figure 1). For each part of the processing chain
CoRed
uses tailored measures for ensuring reliability. The
basic
SOR
is implemented by
TMR
, as used for the sensor
data acquisition and the computation in this example.
In addition,
CoRed
employs data-flow encoding (EAN) to
extend the
SOR
beyond the
TMR
boundaries: Inputs and
outputs are encoded and decoded respectively within the
replicas’ protection domain, subsequently ensuring the data
integrity.
Still, the voting, inevitable in
TMR
systems, tears gaps in
the
SOR
.
CoRed
’s Encoded (Exact) Voter can determine a
quorum on encoded results. However, data-flow encoding is
insufficient and leaves the control-flow unprotected. To tackle
this issue,
CoRed
introduced control-flow monitoring (CFM)
in addition.
Finally, the voter passes its decision to the output where
it is sent to the actuator. A convenient side effect is that the
data can remain encoded, extending the sphere of replication
even further. For instance, by transmitting the encoded values
to a distributed actuator ECU or to seamlessly connect the
outputs to the inputs of another
CoRed
block. In this way, even
complex applications and systems can be composed.
The tolerance-based voting at the input side represents an
exception. To omit the performance penalties of the encoded
operations, it consists of two parts: The Pre-Stages that reside
within the replicas, mutually determine the input distances and
variants based on a tolerance range – hence, compute the costly
part. Subsequently, the Encoded Tolerance Voter determines a
quorum among the encoded variants as usual.
The remainder of this section will detail the techniques
employed by CoRed step-by-step:
C. Basic Protection
Applying the
CoRed
approach should not require in-depth
knowledge of the application to be safeguarded or the under-
lying system platform (runtime environment and hardware).
We therefore employ the well-known and proven concept of
TMR
[
11
] as the basis of the
CoRed
approach, as it efficiently
detects and masks transient faults of replicated instances. Here,
TMR
is especially suited, as it can be easily applied and does
not require further knowledge of the safety-critical application
itself.
The processing is threefold in terms of its state and code
(optional) and mapped to the replica tasks, which reside in
dedicated protection domains of the runtime environment. The
redundant execution is thereby spanning the initial sphere of
replication.
One of the advantages of implementing the replication on the
coarse-grained software component level is, that it decreases the
bandwidth required for output comparison and input replication.
That in turn potentially simplifies the voting and replication
logic [15].
D. Eliminating input and output vulnerabilities
The basic TMR approach protects only the replica execution
itself, while the propagation of data across the
SOR
-boundaries
and the voting procedure is still susceptible to transient faults.
The corruption of output data within the voting procedure or
on transmission level to the actuator elements can still lead to
a silent data corruption. Even worse, corrupted input data will
lead to a silent data corruption in every case, as the replicas
will work with flawed data and produce apparently correct
results. Data crossing the boundaries have to be protected to
prevent the formation of single points of failure.
To overcome this weakness and extend the protection across
the
SOR
-boundaries, we combined the basic
TMR
approach
with an arithmetic encoding of the data propagation – thereby
giving the name Combined Redundancy (CoRed).
To be more precise, we use an extension of an
AN-Code
,
which is based on the VCP design presented by Forin et al.
[
12
], specifically tailored to our purposes. It uses a combination
of per value signatures and a time stamp to detect data and
sequence faults.
To get a feel for this
EAN
, we exemplify the basics in the
following. An arithmetic code can detect data manipulation and,
at the same time, preserve arithmetic operations on encoded
data. The result of an encoded arithmetic operation applied to
encoded operands is again valid encoded data.
The basic
AN-Code
is the simplest form of an arithmetic
code, formed by multiplying the operands by a constant A:
X
= X ∗ A (1)
A division by
A
can then restore the original value of AN-
encoded data. If the remainder of the division does not
equal zero, the value is an invalid code word, which exposes
a data corruption. The multiplication factor
A
has to be
chosen carefully to minimize the residual error probability
and achieve an adequate Hamming distance. Most
AN-Code
implementations therefore suggest a large prime number [
16
].
A bare
AN-Code
can efficiently detect bit manipulations of
encoded values. However, it cannot safely indicate addressing
errors – erroneously pointing to another valid code word – nor
can it reveal outdated or out-of-sequence data as it is not aware
of periods.
Therefore the Extended AN Code used in the
CoRed
approach
features a unique signature
B
X
per value to detect addressing
errors and in addition a timestamp D to reveal outdated data.
X
= X ∗ A + B
X
+ D (2)
As dynamic timestamp
D
, a cycle counter can be used with the
range
0..D
max
. The constant value of
B
X
can then be chosen
arbitrarily with the constraint
B
X
+ D
max
<A
. Furthermore
the minimum distance between two signatures has to be greater
than D
max
.
Finally, to put
EAN
into use within arbitrary calculations,
all arithmetic operations must be adapted. The result of an
operation
X
Y
generates an encoded value
Z
that
also includes the specific signature
B
z
. Applying the inverse
5151