Fault Tolerant Barnes-Hut Algorithm on Non-Volatile Memory
Wenzhe Zhang, Kai Lu, Xiaoping Wang, Xu Li
Science and Technology on Parallel and Distributed Processing Laboratory
Collaborative Innovation Center of High Performance Computing
State Key Laboratory of High-end Server & Storage Technology
College of Computer, National University of Defense Technology
Changsha, PR China
zhangwenzhe@nudt.edu.cn
lukainudt@163.com
xiaopingwang@nudt.edu.cn
lixu@nudt.edu.cn
Abstract—Today, high performance computers have paced into
Petaflops realm. With the increase of system scale, the Mean
Time To Failure (MTTF) declines fast. It is estimated that the
MTTF of High Performance Computing (HPC) systems might
drop under one hour in the near future. Under such a failure
rate, scientific applications cannot complete correctly and
timely without a fault tolerance mechanism. Algorithm Based
Fault Tolerance (ABFT) is a very cost-effective method to
incorporate fault tolerance into applications. However, current
ABFT approaches are mainly used in matrix operations, and
they are not suitable for general data structures. To fill this
gap, we propose an approach to enhance the ability of ABFT
based on the emerging Non-Volatile Random-Access Memory
(NVRAM) technologies, and make ABFT suitable for
algorithms operating on link-based data structures. Our
approach ensures the data consistency by maintaining the
atomicity of each iteration. We demonstrate the practicality of
our approach by applying it to the Barnes-Hut algorithm,
which is widely used in high performance computing to solve
N-body problem. The experiment results show that our
approach is able to survive fail-stop failures with a
performance overhead of 7%.
Keywords-fault tolerant; Barnes-Hut algorithm; non-volatile
memory.
I. INTRODUCTION
On the November 2012 Top500 list, the peak
performance of the most powerful computer has reached
17.5 Petaflops, and high performance computers are on the
road to Exaflops realm. Also, the number of system
components, such as CPU cores, memory, networking, and
storage grow considerably. With the increase of system scale,
the reliability and availability of such systems has declined.
Yang et al. [1] demonstrate that HPC systems are suffering
reliability wall. That is, to increase the system scale may
hamper applications completion time due to the reliability
issue. It is estimated that the MTTF of future HPC systems
will be less than one hour [2, 3]. Under such a failure rate,
scientific applications cannot complete correctly and timely
without a fault tolerance mechanism.
Checkpoint-Restart (CR) technique is commonly used to
improve the system reliability. However, the cost of the CR
mechanism will be unacceptable with the increase of system
scale [2, 25]. Thus, Du et al. [4] advocate that the HPC
system should adopt ABFT approach to improve the system
availability.
ABFT approaches [4-7] adapt algorithms and apply
appropriate mathematical operations on both the original and
recovery data. Once failure occurs, they can recover the
application dataset with a very low overhead. Currently,
ABFT approaches are mainly used in matrix operations, and
they are not suitable for general data structures.
Focusing on fail-stop failures, we propose an approach to
enhance ABFT for algorithms operating on general link-
based data structures. Our approach leverages the emerging
NVRAM technologies and ensures the application could
continue to execute after failures by maintaining the atomic
execution of each iteration. Based on our approach, we
design and implement a fault tolerant Barnes-Hut algorithm,
which is widely used in high performance computing to
solve N-body problem. The experiment results show that the
performance of our approach is as efficient as current ABFT
approaches [4, 6], and the overhead could be within 7%.
The rest of the paper is organized as follows: Section 2
presents the NVRAM technologies; Section 3 gives details
of our approach. Section 4 reviews the features of Barnes-
Hut algorithm. Section 5 presents the implementation of our
fault tolerant Barnes-Hut algorithm. Section 6 evaluates the
performance and overhead of our algorithm, and Section 7
discusses the related work. At the end, Section 8 concludes
the work.
II. BACKGROUND
We use the term Non-Volatile Random-Access Memory
(NVRAM) to refer to technologies which allow persistent
storage to be attached on memory bus and accessed through
load/store instructions. The new technologies, such as phase-