American Institute of Aeronautics and Astronautics
Fault and failure are interdependent, recursive concepts of “cause” and “effect,” as shown in Figure 1. Seen from
one perspective, a fault explains a given failure, but from another, that same fault is seen as the failure that needs an
explanation. For example, in the Columbia tragedy of 2003, the hole in the leading edge of the wing is the failure
that needs explaining, and its cause is a chunk of insulation foam hitting the wing and causing a structural breach.
However, from the perspective of the designers of the External Tank, the foam falling off the External Tank is the
failure to be explained, and its cause was air bubbles in the foam insulation. In turn, the air bubbles in the foam can
be seen as the failure, and flaws in the
foam application process seen as the fault
that explains it. This process can continue
for quite a long time in a failure
investigation, but ultimately the
investigation stops and no further causes
are sought. The first causes in these long
chains of explanation are the root causes—
failure is often the result of the interactive
effects of several root causes. The term
“root cause” is also relative, because as far
as one group is concerned, the explanation
that satisfies them so that they require no
deeper explanation is their root cause.
However, another group may not be
satisfied with this. For them, the original
group’s root cause is not a cause at all, but
a failure to be explained. When they stop
their investigation, their first causes are the
root causes. The recursive nature of these
terms helps to explain the major difficulties
that many groups have had in defining
them, but also explains their utility.
F. Human Causation of System Failure
Human causation of the majority of failures is a key axiom of SHM theory. Human faults, whether individual or
social via miscommunication or lack of communication, are the root causes for most failures, other than a relatively
small percentage of failures caused by expected system wear-out or environmental causes. Those who have
compiled aerospace failure databases suggest that the vast majority (80% or more) of failures are ultimately due to
one of two fundamental causes: individual performance failures, and social communicative failures. This should
come as little surprise. As is now well understood through studies in the history and sociology of science and
technology, humans create and operate systems for their own purposes, using individual and social processes.
It is
therefore human failings in these areas that lead to faults in design, manufacturing, or operations. We interpret the
Columbia Accident Investigation Board’s finding that NASA’s “culture” is a cause of human spaceflight disasters as
ultimately due to the principle of human causation of system failure.
The results of human faults differ, depending on when they occur in the system life cycle. Human mistakes in the
design phase generally lead to “design faults” or “common mode failures,” since they lead to faults in all copies of
the system. Mistakes in manufacturing generally lead to faults in single copies of the system. These are typically
called “random part failure”, though the label of “random” is usually a cover for our inability to find the human fault
that is almost always the root cause. In manufacturing, faults can also lead to failures in all copies of the system, but
when this is true, the fault is in the design of equipment that manufactures multiple copies, in which case the fault is
ultimately a design flaw. Mistakes in operations are generally considered human operational faults, and are often
blamed on the operators. However, since most failures are ultimately due to humans, then most of them share this
fundamental similarity. For SHM, the implication of human causation is that SHM must address all failure causes,
whether “design faults,” “manufacturing faults”, or “operator faults”, and that the basic rates of occurrence of these
faults are roughly the same due to common human causation. Too often the SHM design focuses on random part
failure, despite the fact that the the other two major failure types are usually just as frequent.
G. System Complexity
Figure 1. Failure / Fault Recursion. Failures and faults are
recursive concepts, and the classification will depend on the
perspective of the observer and whether the analysis is causal or
explanatory.
Downloaded by SHANGHAI JIAO TONG UNIVERSITY on April 8, 2013 | http://arc.aiaa.org | DOI: 10.2514/6.2011-1493