虚拟机级异构检查点恢复技术探讨

需积分: 3 42 浏览量更新于2024-09-21 收藏 129KB PDF 举报

虚拟机基于异构检查点技术是一项针对分布式计算环境中的应用程序容错策略。本文档标题《Virtual Machine Based Heterogeneous Checkpointing.pdf》主要探讨如何在虚拟机层面上设计和实现这种机制，以便在遇到故障时，能够从在不同硬件架构或操作系统环境中保存的应用程序状态恢复，避免工作损失。传统的检查点/恢复机制通常涉及捕获整个应用程序进程的状态，包括内存中的数据、注册表项、系统资源等。然而，这种全局的检查点可能会面临性能开销大、跨平台兼容性差等问题。虚拟机级的异构检查点技术则更加细致和灵活。它聚焦于保存与虚拟机关联的应用程序状态，而不是整个进程，这显著减少了数据量和恢复所需的时间。论文的核心思想是，当应用程序运行时，它会定期（或在特定条件触发下）将与其所处虚拟机环境相关的状态（如内存映射、网络连接、资源分配等）保存到稳定存储。这个过程被设计成可移植的，可以在与原始执行机器硬件和操作系统不同的新环境中进行。例如，如果一个应用程序在一台基于x86架构的服务器上运行，而检查点是在基于ARM架构的设备上保存的，恢复时可以在ARM服务器上创建一个新的虚拟机实例，加载保存的检查点状态，并继续运行。为了实现这种异构性，论文可能详细讨论了以下几个关键点： 1. **虚拟机抽象**：阐述了如何在虚拟化层面上定义和抽象出应用的状态，使之不依赖于底层硬件的具体细节。 2. **状态隔离**：如何确保每个虚拟机实例之间的状态独立，以便在迁移时只恢复与目标环境匹配的部分。 3. **兼容性适配**：介绍如何处理跨平台间的差异，如不同架构的内存布局、处理器指令集、驱动程序等，以确保检查点能够成功加载和执行。 4. **恢复策略**：讨论了在启动新的虚拟机实例后，如何准确地还原先前的状态，包括内存映射、文件系统挂载点等。 5. **性能优化**：可能探讨了如何通过优化检查点和恢复过程来减少对正常应用程序运行的影响，比如选择合适的数据压缩算法或使用增量备份策略。 6. **安全性考虑**：在异构环境下，数据迁移和保护隐私可能成为关注点，文章可能会讨论安全措施，如加密和访问控制。总结来说，这篇论文提供了构建一个强大的、跨平台的虚拟机级异构检查点/恢复机制的方法，这对于云计算、分布式计算和高可用性系统至关重要，有助于提高系统的可靠性和灵活性。

Virtual Machine Based Heterogeneous Checkpointing

∗

Adnan Agbaria

†

Roy Friedman

Computer Science Department

Technion – Israel Institute of Technology

Haifa, 32000

Israel

Email: {adnan,roy}@cs.technion.ac.il

Abstract

Checkpointing an application is the act of saving the

application’s state during its execution on stable stor-

age so that if the application fails, it can be restarted

from the last saved state, thereby avoiding loss of the

work that was already done. A heterogeneous check-

point/restart mechanism allows to restart an applica-

tion from a saved state that was taken in a hardware

architecture and/or operating system that can be diﬀer-

ent from those in the machine on which it is restarted.

This paper explores how to construct such a mecha-

nism at the virtual machine level. That is, rather than

dumping the entire state of the application process, the

mechanism reported here dumps the state of the appli-

cation w.r.t. a virtual machine. During restart, the

saved state is loaded into a new copy of the virtual ma-

chine, which continues running from there. The het-

erogeneous checkpoint/restart mechanism reported here

was developed for the OCaml variant of ML. The pa-

per reports on the main issues encountered in build-

ing such a mechanism and the design choices made,

presents performance evaluations, and discusses some

lessons and ideas for extending the work to native code

OCaml, and to Java Virtual Machines.

1 Introduction

One aspect of executing heavy computations,

whether in an interconnected settings or on a single

local computer is checkp oint and restart [6, 14, 19].

That is, it is common practice to periodically save the

∗

This research is supported by the Bar-Nir Bergreen Software

Technology Center of Excellence

†

Also supported by Israeli Ministry of Science and Technol-

ogy, Grant Number 1628-2-00

state of a long running computation so that if the ap-

plication fails, it can be restarted from the last check-

point. This avoids losing the entire computation due

to such a failure. Most practical checkpoint/restart

(hereafter, C/R) work done so far assumes that the

system is homogeneous, or in other words, a system

in which all computers are of the same architecture

and run the same operating system. This is despite

the fact that most interconnected systems are hetero-

geneous, i.e., consist of nodes with diﬀerent hardware

architectures and operating systems.

The problem of C/R is much more diﬃcult in het-

erogeneous systems than it is in homogeneous ones. In

the latter, checkpoint can s imply be done by dump-

ing the process core. There are optimized implemen-

tations that attempt to greatly reduce the amount of

saved data. But even they can rely on the fact that the

architecture and operating system of the computer in

which the failed application is restarted are the same

as the ones in the computer in which the state was

saved. In particular, such implementations can assume

that the data representation, the machine registers, the

stack, heap and data segments, as well as the machine’s

native instruction sets are all the same. Yet, in hetero-

geneous environments the above assumptions do not

hold. Thus, to improve the utilization of clusters as

well as meta and grid computing systems, it is desir-

able to construct C/R mechanisms that can operate

across multiple platforms and operating systems.

In recent years there is a growing proliferation of

virtual machine based programming languages, such as

Java, C#, OCaml, and LISP, that also implement their

own memory management. These languages are trans-

lated to byte-code representation, which is independent

of a particular architecture’s instruction set. Similarly,

the state of the application depends on the virtual ma-

chine’s internal registers, stack, heap and data segment

下载后可阅读完整内容，剩余5页未读，立即下载

whwayne

粉丝: 0
资源: 11

虚拟机级异构检查点恢复技术探讨

ASN.1 Communication between Heterogeneous Systems.pdf

Heterogeneous Process Migration.PDF

Systematic Handling of Heterogeneous Geometric.pdf

Label Embedding with Partial Heterogeneous Contexts.pdf

论文研究-Multi-Leader Multi-Follower Game Based Power Control for Downlink Heterogeneous Networks.pdf

gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms.pdf

最新资源