Virtual Machine Based Heterogeneous Checkpointing
∗
Adnan Agbaria
†
Roy Friedman
Computer Science Department
Technion – Israel Institute of Technology
Haifa, 32000
Israel
Email: {adnan,roy}@cs.technion.ac.il
Abstract
Checkpointing an application is the act of saving the
application’s state during its execution on stable stor-
age so that if the application fails, it can be restarted
from the last saved state, thereby avoiding loss of the
work that was already done. A heterogeneous check-
point/restart mechanism allows to restart an applica-
tion from a saved state that was taken in a hardware
architecture and/or operating system that can be differ-
ent from those in the machine on which it is restarted.
This paper explores how to construct such a mecha-
nism at the virtual machine level. That is, rather than
dumping the entire state of the application process, the
mechanism reported here dumps the state of the appli-
cation w.r.t. a virtual machine. During restart, the
saved state is loaded into a new copy of the virtual ma-
chine, which continues running from there. The het-
erogeneous checkpoint/restart mechanism reported here
was developed for the OCaml variant of ML. The pa-
per reports on the main issues encountered in build-
ing such a mechanism and the design choices made,
presents performance evaluations, and discusses some
lessons and ideas for extending the work to native code
OCaml, and to Java Virtual Machines.
1 Introduction
One aspect of executing heavy computations,
whether in an interconnected settings or on a single
local computer is checkp oint and restart [6, 14, 19].
That is, it is common practice to periodically save the
∗
This research is supported by the Bar-Nir Bergreen Software
Technology Center of Excellence
†
Also supported by Israeli Ministry of Science and Technol-
ogy, Grant Number 1628-2-00
state of a long running computation so that if the ap-
plication fails, it can be restarted from the last check-
point. This avoids losing the entire computation due
to such a failure. Most practical checkpoint/restart
(hereafter, C/R) work done so far assumes that the
system is homogeneous, or in other words, a system
in which all computers are of the same architecture
and run the same operating system. This is despite
the fact that most interconnected systems are hetero-
geneous, i.e., consist of nodes with different hardware
architectures and operating systems.
The problem of C/R is much more difficult in het-
erogeneous systems than it is in homogeneous ones. In
the latter, checkpoint can s imply be done by dump-
ing the process core. There are optimized implemen-
tations that attempt to greatly reduce the amount of
saved data. But even they can rely on the fact that the
architecture and operating system of the computer in
which the failed application is restarted are the same
as the ones in the computer in which the state was
saved. In particular, such implementations can assume
that the data representation, the machine registers, the
stack, heap and data segments, as well as the machine’s
native instruction sets are all the same. Yet, in hetero-
geneous environments the above assumptions do not
hold. Thus, to improve the utilization of clusters as
well as meta and grid computing systems, it is desir-
able to construct C/R mechanisms that can operate
across multiple platforms and operating systems.
In recent years there is a growing proliferation of
virtual machine based programming languages, such as
Java, C#, OCaml, and LISP, that also implement their
own memory management. These languages are trans-
lated to byte-code representation, which is independent
of a particular architecture’s instruction set. Similarly,
the state of the application depends on the virtual ma-
chine’s internal registers, stack, heap and data segment