大规模MPI-CUDA并行方法加速无压流计算

需积分: 15 42 浏览量更新于2024-07-19 收藏 1.13MB PDF 举报

在现代计算机科学中，高性能计算（HPC）和并行计算技术的发展极大地推动了大规模模拟应用的效率。本研究聚焦于将Message Passing Interface (MPI) 和 Compute Unified Device Architecture (CUDA) 结合，以实现并行化在大规模多GPU集群上处理大规模无压缩流体动力学（CFD）模拟的能力。MPI-CUDA的集成是针对当前异构架构的多GPU集群，这种集群拥有深度内存层次结构，对编写可扩展和高效的模拟代码提出了独特挑战。研究团队在Boise State University的背景下，利用了MPI的高级特性，如数据并行性和通信优化，以及CUDA的并行计算能力，实现了GPU数据传输与MPI通信的同步执行，以提高性能。他们在NCSA Lincoln Tesla集群的64个节点上，利用128个GPU共30,720个处理单元，实现了大约2.4 teraflops（万亿次浮点运算）的计算性能。这表明，通过多GPU集群，可以显著加速CFD模拟的计算速度，尤其是在解决大型复杂问题时，传统CPU集群已无法满足需求，转向混合并行平台成为趋势。三种策略被探索来评估并优化MPI-CUDA实现的效率和可扩展性： 1. **数据复制优化**：通过智能地管理和复制数据，减少跨GPU和主机之间的数据移动，降低通信开销。 2. **任务并行和流水线调度**：通过精细划分任务并在GPU和CPU之间分配，平衡负载并减少计算阻塞。 3. **通信与计算的协同**：通过使用CUDA streams和MPI的非阻塞I/O，允许数据交换和计算同时进行，提高整体性能。该研究的成果不仅对于CFD模拟领域具有重要意义，也为其他依赖于大规模并行计算的应用提供了宝贵的实践经验，展示了如何在现代HPC环境中有效地利用GPU资源来提升计算效能。此外，这篇发表在AIAA航天科学会议上的论文还可能启发未来在分布式计算环境下的高性能并行软件开发，特别是在那些对计算速度和规模有严格要求的领域。

GFLOPS.

a) b) c)

Figure 1. Three performance metrics on six selected CPU and GPU devices based on incompressible ﬂow

computations on a single device. Actual sustained performance is used rather than peak device performance.

a) Sustained GFLOPS, b) MFLOPS/Watt, c) MFLOPS/Dollar

Figure 1a shows three performance metrics on each platform using an incompressible ﬂow CFD code. The GPU

version of the CFD code is clearly an improvement over the Pthreads shared-memory parallel CPU version. Both

of these implementations are written in C and use identical numerical methods

. The main impact on individual

GPU performance was the introduction of compute capability 1.3, which greatly reduces the memory latency in some

computing kernels due to the relaxed memory coalescing rules

. Compute capability 1.3 also added support for double

precision which is important in many solutions.

Figure 1b shows the performance relative to the peak electrical power of the device. GPU devices show a deﬁnite

advantage over the CPUs in terms of energy-efﬁcient computing. The consumer video cards have a slight power

advantage over the Tesla series, partly explained by having signiﬁcantly less active global memory. The recent paper

by Kindratenko et al.

details the measured power use of two clusters built using NVIDIA Tesla S1070 accelerators.

They ﬁnd signiﬁcant power usage from intensive global memory accesses, implying CUDA kernels using shared

memory not only can achieve higher performance but can use less power at the same time. Approximately 70%

of the S1070’s peak power is used while running their molecular dynamics program NAMD. Figure 1c shows the

performance relative to the street price of the device which sheds light on the cost effectiveness of GPU computing.

The consumer GPUs are better in this regard, ignoring other factors such as the additional memory present in the

compute server GPUs.

The rationale for clusters of GPU hardware is identical to that for CPU clusters – larger problems can be solved

and total performance increases. Figure 1c indicates that clusters of commodity hardware can offer compelling

price/performance beneﬁts. By spreading the models over a cluster with multiple GPUs in each node, memory size

limitations can be overcome such that inexpensive GPUs become practical for solving large computational problems.

Today’s motherboards can accommodate up to 8 GPUs in a single node

, enabling large-scale compute power in

small to medium size clusters. However, the resulting heterogeneous architecture with a deep memory hierarchy cre-

ates challenges in developing scalable and efﬁcient simulation applications. In the following sections, we focus on

maximizing performance on a multi-GPU cluster through a series of mixed MPI-CUDA implementations.

III. Related Work

GPU computing has evolved from hardware rendering pipelines that were not amenable to non-rendering tasks, to

the modern General Purpose Graphics Processing Unit (GPGPU) paradigm. Owens et al.

survey the early history

as well as the state of GPGPU computing in 2007. Early work on GPU computing is extensive and used custom

programming to reshape the problem in a way that could be processed by a rendering pipeline, often one without

32-bit ﬂoating point support. The advent of DirectX 9 hardware in 2003 with ﬂoating point support, combined with

early work on high level language support such as BrookGPU, Cg, and Sh, led to a rapid expansion of the ﬁeld.

14–19

Brandvik and Pullan

show the implementation of 2D and 3D Euler solver on a single GPU, showing 29× speedup

for the 2D solver and 16× speedup for the 3D solver. One unique feature of their paper is the implementation of the

solvers in both BrookGPU and CUDA. Elsen et al.

show the conversion of a subset of an existing Navier-Stokes

American Institute of Aeronautics and Astronautics

剩余16页未读，继续阅读

maxfist

粉丝: 0
资源: 2

大规模MPI-CUDA并行方法加速无压流计算

Mumps很牛逼和古老的语言工具

mumps user guide

MUMPS_5.1.2.tar.gz

MPI-CUDA implementation for Flow Computations on Multi-GPU Clusters

MPI-CUDA-LSQR.zip_matlab例程_Unix_Linux_

MPI-plus-MPI-slides:关于 MPI+MPI（MPI-1 加 MPI-3 共享内存）编程模型的演示

mpi-parallel-words-count

mpi-m:MPI-M文档档案

BPNN-Face-Recognition-For-Parallel:基于并行BP神经网络的人脸识别系统（并行）

MPI-CUDA并行实现：多GPU集群中流体计算的加速研究

最新资源