非易失性存储器上的容错Barnes-Hut算法研究

142 浏览量更新于2024-08-26 收藏 702KB PDF 举报

"这篇研究论文探讨了在非易失性存储器上实现容错Barnes-Hut算法的策略。Barnes-Hut算法是一种在天体物理学中广泛使用的模拟方法，用于处理大规模的粒子系统，而非易失性存储器（Non-Volatile Memory, NVM）则是一种在断电后仍能保持数据的存储技术，它在高性能计算领域中具有重要应用。随着计算机性能的提升，系统的故障率也在增加，因此对容错机制的需求变得至关重要。论文提出了一种基于算法的容错机制（Algorithm-Based Fault Tolerance, ABFT），旨在扩展到更通用的数据结构，以适应高故障率环境下的科学应用需求。" 在这篇研究中，作者指出随着高性能计算机进入千万亿次浮点运算（Petaflops）级别，系统的平均无故障时间（Mean Time To Failure, MTTF）呈现下降趋势。预计未来的高性能计算系统可能会在一小时内发生故障，这使得在没有故障容忍机制的情况下，科学应用程序无法正确及时地完成。目前，ABFT主要应用于矩阵操作，但并不适用于一般的数据结构。为了解决这个问题，论文提出了在非易失性内存上实现的容错Barnes-Hut算法。Barnes-Hut算法是一种四叉树结构的近似方法，用于减少在模拟大量粒子时的计算复杂性。在NVM中实现该算法可以提供持久化的数据存储，即使在系统故障后也能恢复计算状态。通过ABFT，论文旨在扩展这种算法的容错能力，使其不仅限于矩阵运算，而且能够适应Barnes-Hut算法所需的复杂数据结构。具体来说，作者可能探讨了如何设计和实施在NVM上的错误检测和纠正机制，以及如何在算法层面集成这些机制，以确保在系统出现故障时能够恢复计算过程。此外，论文可能还讨论了如何优化算法以减少因NVM的读写延迟而引入的性能开销，同时保持算法的效率和准确性。这篇研究论文对于理解如何在非易失性存储器上实现高效的容错计算，特别是在大规模粒子模拟中的应用，提供了重要的理论基础和技术参考。这一领域的进展对于构建更可靠的高性能计算系统，支持科学计算的持续发展具有深远影响。

Fault Tolerant Barnes-Hut Algorithm on Non-Volatile Memory

Wenzhe Zhang, Kai Lu, Xiaoping Wang, Xu Li

Science and Technology on Parallel and Distributed Processing Laboratory

Collaborative Innovation Center of High Performance Computing

State Key Laboratory of High-end Server & Storage Technology

College of Computer, National University of Defense Technology

Changsha, PR China

zhangwenzhe@nudt.edu.cn

lukainudt@163.com

xiaopingwang@nudt.edu.cn

lixu@nudt.edu.cn

Abstract—Today, high performance computers have paced into

Petaflops realm. With the increase of system scale, the Mean

Time To Failure (MTTF) declines fast. It is estimated that the

MTTF of High Performance Computing (HPC) systems might

drop under one hour in the near future. Under such a failure

rate, scientific applications cannot complete correctly and

timely without a fault tolerance mechanism. Algorithm Based

Fault Tolerance (ABFT) is a very cost-effective method to

incorporate fault tolerance into applications. However, current

ABFT approaches are mainly used in matrix operations, and

they are not suitable for general data structures. To fill this

gap, we propose an approach to enhance the ability of ABFT

based on the emerging Non-Volatile Random-Access Memory

(NVRAM) technologies, and make ABFT suitable for

algorithms operating on link-based data structures. Our

approach ensures the data consistency by maintaining the

atomicity of each iteration. We demonstrate the practicality of

our approach by applying it to the Barnes-Hut algorithm,

which is widely used in high performance computing to solve

N-body problem. The experiment results show that our

approach is able to survive fail-stop failures with a

performance overhead of 7%.

Keywords-fault tolerant; Barnes-Hut algorithm; non-volatile

memory.

I. INTRODUCTION

On the November 2012 Top500 list, the peak

performance of the most powerful computer has reached

17.5 Petaflops, and high performance computers are on the

road to Exaflops realm. Also, the number of system

components, such as CPU cores, memory, networking, and

storage grow considerably. With the increase of system scale,

the reliability and availability of such systems has declined.

Yang et al. [1] demonstrate that HPC systems are suffering

reliability wall. That is, to increase the system scale may

hamper applications completion time due to the reliability

issue. It is estimated that the MTTF of future HPC systems

will be less than one hour [2, 3]. Under such a failure rate,

scientific applications cannot complete correctly and timely

without a fault tolerance mechanism.

Checkpoint-Restart (CR) technique is commonly used to

improve the system reliability. However, the cost of the CR

mechanism will be unacceptable with the increase of system

scale [2, 25]. Thus, Du et al. [4] advocate that the HPC

system should adopt ABFT approach to improve the system

availability.

ABFT approaches [4-7] adapt algorithms and apply

appropriate mathematical operations on both the original and

recovery data. Once failure occurs, they can recover the

application dataset with a very low overhead. Currently,

ABFT approaches are mainly used in matrix operations, and

they are not suitable for general data structures.

Focusing on fail-stop failures, we propose an approach to

enhance ABFT for algorithms operating on general link-

based data structures. Our approach leverages the emerging

NVRAM technologies and ensures the application could

continue to execute after failures by maintaining the atomic

execution of each iteration. Based on our approach, we

design and implement a fault tolerant Barnes-Hut algorithm,

which is widely used in high performance computing to

solve N-body problem. The experiment results show that the

performance of our approach is as efficient as current ABFT

approaches [4, 6], and the overhead could be within 7%.

The rest of the paper is organized as follows: Section 2

presents the NVRAM technologies; Section 3 gives details

of our approach. Section 4 reviews the features of Barnes-

Hut algorithm. Section 5 presents the implementation of our

fault tolerant Barnes-Hut algorithm. Section 6 evaluates the

performance and overhead of our algorithm, and Section 7

discusses the related work. At the end, Section 8 concludes

the work.

II. BACKGROUND

We use the term Non-Volatile Random-Access Memory

(NVRAM) to refer to technologies which allow persistent

storage to be attached on memory bus and accessed through

load/store instructions. The new technologies, such as phase-

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38704870

粉丝: 6
资源: 1000

非易失性存储器上的容错Barnes-Hut算法研究

使用 Barnes-Hut算法在C中进行 重力模拟_C语言_代码_下载

基于Barnes+Hut算法的N-body问题模拟

parallel-barnes-hut：Barnes-Hut算法的并行高效C ++实现，用于模拟N体系统

使用Barnes-Hut算法改善数据字段分层聚类

Barnes-Hut算法优化数据字段层次聚类

MATLAB实现Barnes-Hut算法的N体模拟

使用Barnes-Hut算法模拟银河系N-body问题

Barnes-Hut算法在N-body问题模拟中的应用

Barnes-Hut-Simulator：使用Barnes-Hut-Algorithm有效解决N体问题

barnes-hut-rs:使用WASM实现可视化和Web部署中的Barnes Hut生锈算法

最新资源

使用 Barnes-Hut算法在C中进行重力模拟_C语言_代码_下载