Reliability Analysis for Software Cluster Systems based on Proportional Hazard
Model
Chunyan Hou
1
, Chen Chen
2,*
, Jinsong Wang
1
, Kai Shi
1
1
School of Computer and Communication Engineering, Tianjin University of Technology, Tianjin, China
2
College of Computer and Control Engineering, Nankai University, Tianjin, China
chunyanhou@163.com, nkchenchen@nankai.edu.cn, {jswang, shikai}@tjut.edu.cn
Abstract—With the universal application of software cluster
systems, their reliability is drawing more and more attention
from academia to industry. A cluster system is a kind of software
load-sharing system (LSS) whose reliability is significantly
dependent on system software. Therefore, traditional reliability
analysis methods for hardware LSSs are not applicable for
cluster systems. In this paper, we develop a reliability analysis
model for redundant cluster systems consisting of initial servers
and cold standby servers used to replace failed ones. System
reliability process is modeled with a state-based non-
homogeneous Markov process (NHMH), where each state
corresponds to a non-homogeneous Poisson processe (NHPP).
NHPP arrival rate is expressed using Cox’s proportional hazard
model (PHM) in terms of cumulative and instantaneous
workload of system software. In addition to redundant cluster
systems without repair, the model also can be extended to
analyze those with restart. The analysis results are meaningful to
support cluster management and design decisions. Finally, the
evaluation experiments show the potential of our model.
Keywords—cluster system; load-sharing system; cumulative
workload; software reliability; software aging
I. I
NTRODUCTION
Fast development of new technologies has led to a large
number of critical commercial applications on the Internet. As
the users become dependant on these services, service failure
or interruption can cause great loss for service providers.
Therefore high availability as well as high performance have
become increasingly important to satisfy more demanding
quality of service (QoS) requirements. A widely adopted
technique to significantly improve system availability and
performance is clustering [1]. A cluster is a set of servers and
related resources that act like a single system and provide high
availability, load balancing and parallel processing. These
servers are usually identical. If one server fails, another can act
as a backup. Compared to the expensive high availability
systems with proprietary tightly coupled hardware and
software, cluster systems use commercially available
computers networked in a loosely-coupled fashion, and provide
high availability and performance in a cost-effective way.
In the past few decades, computing capacity of cluster
systems has increased dramatically. However, a linear increase
of cluster size results in an exponential failure rate. System
software and applications running on cluster systems is
becoming more and more complex, which makes them prone
to bugs and other software failures. It has been reported that
software faults and failures result in more outages in larger
computer systems than hardware faults [2] and they cause huge
economic losses or risk to human lives. A large percentage of
the software failures is due to software aging [3]. Fifty years
ago, the notion of software aging was formally introduced in
[4]. Since then, much theoretical and experimental research is
conducted in order to characterize and understand this
important phenomenon. Software aging can be understood as
being a continued and growing degradation of the software
internal state during its operational life. A general
characteristic of this phenomenon is the gradual performance
degradation and/or an increase in failure rate [5].
To counteract software aging, a proactive fault
management technique called software rejuvenation was
proposed. It involves occasionally stopping the running
software, cleaning its internal state and/or its environment and
restarting it. In order to determine the time epochs for
triggering software rejuvenation, analytic models [6, 7],
monitoring system resources followed by statistical analysis [8,
9], or a combination [10] have been proposed. However, most
of these research works concentrated on the impact of
rejuvenation on cluster availability, not system capacity and
performability. It also did not consider the workload and
failure rate variations caused by user behavior patterns. Many
empirical studies of mechanical systems [11] and computer
systems [12] have proved that workload strongly affects
system failure rate. On the other hand, software rejuvenation is
a kind of preventive maintenance technique so that it is not
useful for system design decisions.
System design is the foundation of building a software
system. Just as a good beginning is half done, good design
helps to shorten the period of software development, reduce the
cost to operate and maintain a system, and avoid costly rework.
During design phase, a cluster system is designded to meet not
only functional requirements of customers, but also non-
functional ones. Nowadays system non-functional quality
properties have drawn more and more attention from customers,
which involve system lifetime, reliability, performance, failure
rate and so on. As to software aging phenomenon, customers
may want to know how long a cluster system would run before
aging. Traditionally system designers only make fuzzy
estimation about these quality properties based on their
previous experience, which is subjective, inaccurate and
misleading. Therefore, a reliability model is necessary for
cluster system development to present system reliability-
related properties under various workload and system desgin
schemes precisely. The numerical quality indexes obtained
from the model are not only deliverable for customers but also
very meaningful for system designers and maintainers to make
related decisions. In this paper, we aim to contribute such a
reliability model of cluster systems.
2016 IEEE 40th Annual Computer Software and Applications Conference
0730-3157/16 $31.00 © 2016 IEEE
DOI 10.1109/COMPSAC.2016.177
32