软件集群系统基于比例风险模型的可靠性深度探究

需积分: 5 168 浏览量更新于2024-08-26 收藏 435KB PDF 举报

随着软件集群系统的广泛应用，其可靠性的研究成为了学术界和业界关注的焦点。传统的硬件负载均衡系统的可靠性分析方法并不适用于软件集群，因为后者主要依赖于系统软件，其故障行为与硬件有所不同。本文旨在提出一种针对冗余软件集群系统的新可靠性分析模型，该模型考虑了初始服务器和冷备用服务器的联合运作。首先，作者将软件集群系统视为一种特殊的软件负载共享系统（LSS），其可靠性受到软件层面复杂因素的显著影响。为了克服这一挑战，研究人员采用了状态为基础的非齐次马尔可夫过程（NHMH）作为建模工具。在这个模型中，每个状态对应一个非齐次泊松过程（NHPP），这种过程能够捕捉到系统中服务器故障发生的随机性和时间依赖性。 NHPP特性使得模型能够准确地模拟系统在不同运行状态下（如正常运行、单服务器故障、多服务器故障等）的故障概率和恢复时间。通过将初始服务器和冷备用服务器的状态转移概率纳入模型，文章探讨了这两种服务器类型如何共同影响集群的整体可靠性。这包括了冗余策略（如N+1备份或更高级别的冗余）对降低系统失效风险的影响。在构建模型的过程中，研究人员可能还讨论了失效率函数（failure rate function）、平均故障间隔时间和故障密度等关键概念，这些都是评估系统可靠性的基本指标。此外，他们可能运用了统计方法，如条件概率、故障树分析或蒙特卡洛模拟，来估计系统的可靠度和故障模式分布。文章可能还涉及了实际案例研究或仿真实验，以验证模型的有效性和实用性。通过对历史数据的分析，研究者可能会发现基于比例风险模型的软件集群系统可靠性分析结果与实际表现具有良好的一致性，从而增强了模型在实际决策中的指导意义。这篇研究论文提供了一种创新的可靠性分析框架，专用于评估和优化软件集群系统的性能，它在理论和实践上都具有重要的价值，对于提高软件系统的可用性和稳定性具有重要意义。通过深入理解软件集群的特性和故障行为，研究人员和工程师可以制定更有效的容错策略，以应对日益增长的业务需求。

Reliability Analysis for Software Cluster Systems based on Proportional Hazard

Model

Chunyan Hou

, Chen Chen

2,*

, Jinsong Wang

, Kai Shi

School of Computer and Communication Engineering, Tianjin University of Technology, Tianjin, China

College of Computer and Control Engineering, Nankai University, Tianjin, China

chunyanhou@163.com, nkchenchen@nankai.edu.cn, {jswang, shikai}@tjut.edu.cn

Abstract—With the universal application of software cluster

systems, their reliability is drawing more and more attention

from academia to industry. A cluster system is a kind of software

load-sharing system (LSS) whose reliability is significantly

dependent on system software. Therefore, traditional reliability

analysis methods for hardware LSSs are not applicable for

cluster systems. In this paper, we develop a reliability analysis

model for redundant cluster systems consisting of initial servers

and cold standby servers used to replace failed ones. System

reliability process is modeled with a state-based non-

homogeneous Markov process (NHMH), where each state

corresponds to a non-homogeneous Poisson processe (NHPP).

NHPP arrival rate is expressed using Cox’s proportional hazard

model (PHM) in terms of cumulative and instantaneous

workload of system software. In addition to redundant cluster

systems without repair, the model also can be extended to

analyze those with restart. The analysis results are meaningful to

support cluster management and design decisions. Finally, the

evaluation experiments show the potential of our model.

Keywords—cluster system; load-sharing system; cumulative

workload; software reliability; software aging

I. I

NTRODUCTION

Fast development of new technologies has led to a large

number of critical commercial applications on the Internet. As

the users become dependant on these services, service failure

or interruption can cause great loss for service providers.

Therefore high availability as well as high performance have

become increasingly important to satisfy more demanding

quality of service (QoS) requirements. A widely adopted

technique to significantly improve system availability and

performance is clustering [1]. A cluster is a set of servers and

related resources that act like a single system and provide high

availability, load balancing and parallel processing. These

servers are usually identical. If one server fails, another can act

as a backup. Compared to the expensive high availability

systems with proprietary tightly coupled hardware and

software, cluster systems use commercially available

computers networked in a loosely-coupled fashion, and provide

high availability and performance in a cost-effective way.

In the past few decades, computing capacity of cluster

systems has increased dramatically. However, a linear increase

of cluster size results in an exponential failure rate. System

software and applications running on cluster systems is

becoming more and more complex, which makes them prone

to bugs and other software failures. It has been reported that

software faults and failures result in more outages in larger

computer systems than hardware faults [2] and they cause huge

economic losses or risk to human lives. A large percentage of

the software failures is due to software aging [3]. Fifty years

ago, the notion of software aging was formally introduced in

[4]. Since then, much theoretical and experimental research is

conducted in order to characterize and understand this

important phenomenon. Software aging can be understood as

being a continued and growing degradation of the software

internal state during its operational life. A general

characteristic of this phenomenon is the gradual performance

degradation and/or an increase in failure rate [5].

To counteract software aging, a proactive fault

management technique called software rejuvenation was

proposed. It involves occasionally stopping the running

software, cleaning its internal state and/or its environment and

restarting it. In order to determine the time epochs for

triggering software rejuvenation, analytic models [6, 7],

monitoring system resources followed by statistical analysis [8,

9], or a combination [10] have been proposed. However, most

of these research works concentrated on the impact of

rejuvenation on cluster availability, not system capacity and

performability. It also did not consider the workload and

failure rate variations caused by user behavior patterns. Many

empirical studies of mechanical systems [11] and computer

systems [12] have proved that workload strongly affects

system failure rate. On the other hand, software rejuvenation is

a kind of preventive maintenance technique so that it is not

useful for system design decisions.

System design is the foundation of building a software

system. Just as a good beginning is half done, good design

helps to shorten the period of software development, reduce the

cost to operate and maintain a system, and avoid costly rework.

During design phase, a cluster system is designded to meet not

only functional requirements of customers, but also non-

functional ones. Nowadays system non-functional quality

properties have drawn more and more attention from customers,

which involve system lifetime, reliability, performance, failure

rate and so on. As to software aging phenomenon, customers

may want to know how long a cluster system would run before

aging. Traditionally system designers only make fuzzy

estimation about these quality properties based on their

previous experience, which is subjective, inaccurate and

misleading. Therefore, a reliability model is necessary for

cluster system development to present system reliability-

related properties under various workload and system desgin

schemes precisely. The numerical quality indexes obtained

from the model are not only deliverable for customers but also

very meaningful for system designers and maintainers to make

related decisions. In this paper, we aim to contribute such a

reliability model of cluster systems.

2016 IEEE 40th Annual Computer Software and Applications Conference

DOI 10.1109/COMPSAC.2016.177

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38711778

粉丝: 2
资源: 895

软件集群系统基于比例风险模型的可靠性深度探究

大数据环境下系统可靠性分析.pptx

基于SDN的服务器集群动态流量调度方法.pdf

基于比例风险模型的软件集群系统可靠性分析方法研究

LVS集群架构原理与应用场景分析

【算法优化】：提升机器学习模型在金融风险管理中的准确性

【模型验证与选择】：对比分析不同模型的性能指标

【Python数据分析进阶】：掌握这些高级算法，模型构建不再难

系统日志与日志分析技巧

MySQL集群部署方案比较与选择

HBase 2.0集群部署实战：数据备份与恢复

最新资源