410 IEEE TRANSACTIONS ON RELIABILITY, VOL. 64, NO. 1, MARCH 2015
Effect of Failure Propagation on Cold vs. Hot Standby
Tradeoff in Heterogeneous 1-Out-of-
:G Systems
Gregory Levitin, Senior Member, IEEE, Liudong Xing, Senior Member, IEEE,
Hanoch Ben-Haim, and Yuanshun Dai, Member, IEEE
Abstract—This paper considers 1-out-of- :G heterogeneous
fault-tolerant systems that are designed with a mix of hot and cold
standby redundancies to achieve the tradeoff between restoration
and operation costs of standby elements. In such systems, the
way in which the elements are distributed between hot and cold
standby groups and the initiation sequence of all the cold standby
elements can greatly affect the system reliability and mission cost.
Therefore, it is significant to solve the optimal standby element
distributing and sequencing problem (SE-DSP). The failure that
occurs in a system element can propagate, causing the outage
of other system elements, which complicates the solution to the
SE-DSP problem. In this paper, we first propose a numerical
method for evaluating the reliability and expected mission cost
of 1-out-of-
:G systems with mixed hot and cold redundancy
types and propagated failures. Two different failure propagation
modes are considered: an element failure causing the outage of all
the system elements, and an element failure causing the outage
of only working or hot standby elements but not cold standby
elements. A genetic algorithm is utilized as an optimization tool
for solving the formulated SE-DSP problem, leading to a solution
that can minimize the expected mission cost of the system while
providing a desired level of the system reliability. Effects of the
failure propagation probability on the system reliability, expected
mission cost, as well as the optimization results are investigated.
The suggested methodology can facilitate a reliability-cost tradeoff
study of the considered systems, thus assisting in optimal decision
making regarding the system's standby policy. Examples are
provided for illustrating the considered problem as well as the
proposed solution methodology.
Index Terms—Cold standby, failure propagation, hot standby,
mission cost, optimization, standby system.
ACRONYMS AND ABBREVIATIONS
cumulative distribution function
probability density function
probability mass function
Manuscript received October 27, 2013; revised May 20, 2014; accepted June
03, 2014. Date of publication September 11, 2014; date of current version Feb-
ruary 27, 2015. This work was supported in part by the National Natural Science
Foundation of China (No. 61170042) and Jiangsu Province development and re-
form commission (No. 2013-883). Associate Editor: S. Eryilmaz.
G. Levitin is with the Collaborative Autonomic Computing Laboratory,
School of Computer Science, University of Electronic Science and Technology
of China. He is also with The Israel Electric Corporation, Haifa 31000, Israel
(e-mail: levitin@iec.co.il).
L. Xing is with the University of Massachusetts, Dartmouth, MA 02747 USA
(e-mail: lxing@umassd.edu).
H. Ben-Haim is with The Israel Electric Corporation, Haifa 31000, Israel.
Y. Dai is with the Collaborative Autonomic Computing Laboratory, School of
Computer Science, University of Electronic Science and Technology of China.
Digital Object Identifier 10.1109/TR.2014.2355514
random variable
GA
genetic algorithm
HS
hot standby
CS
cold standby
SE-DSP
standby element distributing and sequencing
problem
PF
propagated failure
N
OMENCLATURE
number of elements in the system
number of HS elements
index of the element initiated after failures
representing the time-to-failure (
or
switching off) of element
probability that element fails by itself in time
interval
after its initiation
probability that the failure of element
propagates
probability that HS element
fails in time
interval
after the mission
beginning
probability that HS element fails before time
interval
given that no PF happened before this
interval
probability that all HS elements fail before the
time interval
causing no PF
probability that a
t least one HS element
generates a PF bef
oretimeinterval
probability that a PF originated from HS
elements happens in time interval
cost (per time unit) of keeping element in hot
standby (or operation) mode
cost (per
time unit) of keeping element
in
cold sta
ndby mode
startup cost of cold standby element
startup cost of hot standby element
mis
sion time
0018-9529 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.