B. Liu et al. / Future Generation Computer Systems 83 (2018) 1–13 3
variety of pure performance models for cloud services were pro-
posed in the last few years. See [21] and references therein. These
models are complementary to our models to capture IaaS cloud
service behaviors. In the following we focus on the literature on
cloud availability analysis. In [22], cloud service availability was
evaluated from a user-centric point of view, unlike our work that
considers a cloud service provider’s point of view.
2.2. Sensitivity analysis
Sensitivity analysis allows the exposure of system QoS bottle-
neck as well as providing guidelines for the system optimization.
It could be divided into nonparametric and parametric sensitivity
analysis [23]. The first kind studies output variations caused by
modifications in the structure of a model (e.g., addition or removal
of a given component in a model). The second studies the output
variations due to a change in system parameter values. There are
several approaches for performing sensitivity analysis [11]. The
following presents three approaches to be used in this paper:
(i) Vary one parameter at a time within the considered range
while keeping the others constant and observe system measures
of interest with respect to the varying parameter. In order to
determine the parameters that cause the greatest impact on the
system QoS, simulations or numerical analysis for all parameters
in their defined ranges must be done.
(ii) Differential sensitivity analysis (also called directed
method). It computes the sensitivity of a given measure Y, which
depends on a specific parameter θ, as S
θ
(Y ) =
∂Y
∂θ
, or SS
θ
(Y ) =
∂Y
∂θ
·
θ
Y
for a scaled sensitivity. The sign of SS
θ
denotes whether an
increase of θ causes a corresponding increase or instead a decrease
of the measure Y. Its absolute value indicates the magnitude of the
variations of Y for small variations of θ. This method is only suitable
for continuous parameters.
(iii) Sensitivity index. This technique is designed for integer-
valued parameters which are not properly evaluated by the dif-
ferential sensitivity analysis approach. The sensitivity formula is
S
θ
(Y ) = 1 −
min{Y (θ)}
max{Y (θ)}
, where θ ∈ [θ
1
, θ
n
], min{Y (θ)} = min{Y (θ
1
),
Y (θ
2
) . . . Y (θ
n
)} and max{Y (θ)} = max{Y (θ
1
), Y (θ
2
) . . . Y (θ
n
)}.
Sensitivity analysis has been conducted in cloud systems.
In [23], the last two methods mentioned above were used for
sensitivity analysis of the availability of a virtualized system, which
was modeled as a continuous-time Markov chain (CTMC). The
authors in [24] studied a hierarchical model, which consisted of
several independent sub-models, each of which was modeled as
a CTMC. Thus, the overall system measure is the multiplication of
the measure of each sub-model. Then the sensitivity of the overall
system availability with respect to a system continuous parameter
could be obtained accordingly by calculating the overall availability
sensitivity with respect to each component and the component
availability sensitivity with respect to this parameter. But in our
hierarchical models, there exist complex interactions among sub-
models. It is hard, if not impossible, to compute the derivative of
the whole system measure with respect to any system parameter.
In Section 6, we show that although S
θ
(Y ) of each parameter could
not be calculated, we could identify parameters which impact
system most significant by applying differential sensitivity analysis
method to each sub-model and then ignoring some parameters
with less impact on system QoS.
3. System description
In this paper, we assume that there are three PM pools (namely
hot, warm and cold) in a CDC. It is known that there exist several
types of failures in a cloud system such as software failures, hard-
ware failures and network failures [8]. This paper considers the
overall effect of these possible failures with an aggregated mean
time to failure (MTTF) [25,26]. Failure detection is assumed to be
an instantaneous event. PMs in the same pool have independent
and identical distributed TTFs. TTFs of hot, warm and cold PM
pools are exponentially distributed. As in [8], mean TTF rates are
assumed as λ
h
> λ
w
≫ λ
c
in this paper. Three possible reasons
for such assumption are as follows. It is known that software
execution could speed up hardware component failure, such as fan
and hard disk. In addition, software aging is unavoidable and then a
computer is forced to shut down if there is no active action to take.
The third is that a computer could generate corrupted files, which
could damage the computer hardware on the long term.
Upon failure of a hot PM, this failed PM is moved from the hot
PM pool to the pre-determined repair station for repair. Mean-
while, a PM available in the warm pool is moved to the hot pool.
When the warm pool is empty but there exists a PM available in the
cold pool, moving this PM to the hot pool is performed. Similarly,
when a warm PM fails, it is moved from the warm pool for repair
and a PM is moved from the cold pool to perform the role of this
warm PM. For each pool, if there is a PM moving from other pools
in order to play the role of a failed PM, this moving PM will return
to its original pool after the failed PM completes its repair. Time
to move a PM from one pool to another follows an exponential
distribution. PM repair activities are work conserving and repaired
PMs are as good as new. We consider two kinds of repair policies
as follows:
(1) Independent repair station (IRS). Each pool has its own repair
station. There is at least one repair facility in each station. Each
facility repairs a failed PM independently. A PM in a pool could
be repaired only by a repair facility of this pool’s repair station.
If the number of PMs in a pool to be repaired is larger than
the number of the corresponding repair facilities/servers, failed
PMs are placed in the corresponding waiting queue. Hot, warm
and cold PM mean repair times are exponentially distributed.
(2) Sharing repair station (SRS). The hot, warm and cold pools
share a single repair station. Failed hot PMs have the repair
priority over the failed PMs of the other pools, while failed
warm PMs have priority over cold failed PMs. The priority is
non-preemptive. Similar to previous policy, PM repair time is
exponentially distributed.
Table 1 summarizes definitions of system input parameters
to be used in the following sections. n
h
, n
w
, n
c
, n
rh
, n
rw
and n
rc
are design parameters, but MTTF, MTTR and MTTM values could
be experimentally measured. Note that we try to use notations
similar to those used in [8] in order to highlight the difference of
our models from those in [8] and then indicate the challenges of
modeling in this paper.
4. System models under SRS policy
This section first presents monolithic SRN model under SRS
repair policy. Then the corresponding scalable interacting SRN sub-
models are given.
4.1. Monolithic SRN model
Fig. 1 shows the monolithic SRN model for the availability
analysis of IaaS cloud under SRS repair policy. The numbers of
tokens in places P
h
, P
w
and P
c
represent the number of non-failed
PMs in hot, warm and cold pools respectively. The firing of each of
transitions T
bwhf
, T
bchf
and T
hf
represents the failure event of a hot
PM. That is, there are three cases that will occur when a hot PM
fails:
Case (F1) A non-failed warm PM is available for moving to the
hot pool, represented by firing T
bwhf
;