MPI-CUDA并行实现：多GPU集群中流体计算的加速研究

5星 · 超过95%的资源需积分: 10 23 浏览量更新于2024-07-30 2 收藏 1.12MB PDF 举报

"该资源是关于在多GPU集群上实现流体计算的MPI-CUDA方法的研究，探讨了如何在带有深度内存层次结构的异构架构上开发可扩展且高效的模拟代码，以加速计算流体动力学（CFD）模拟。在伊利诺伊州国家超级计算应用中心（NCSA）的林肯Tesla集群上，利用128个GPU和30,720个处理元素，实现了约2.4 TeraFLOPS的计算性能。" 在现代科学计算领域，图形处理单元（GPU）因其多核架构而成为通用并行计算平台，极大地提升了仿真应用的速度。多GPU工作站能够加速计算问题，但面对更大的问题时，需要更强大的资源。因此，传统的中央处理器（CPU）集群现在开始配备多个GPU，以解决大规模问题。这种多GPU集群的异构架构带来了独特的挑战，尤其是在开发可扩展且高效的模拟代码时。本文研究了一种混合MPI-CUDA实现策略，以应对这些挑战。MPI（Message Passing Interface）是一种用于分布式内存系统中的进程间通信的标准，而CUDA是NVIDIA提供的用于编程GPU的并行计算平台和编程模型。通过结合使用MPI和CUDA，研究者可以实现数据传输和MPI通信与GPU上的计算的重叠，从而提高效率。研究中，作者在林肯Tesla集群上针对不可压缩流体计算进行了实验，该集群由64个节点组成，每个节点都配备了GPU。结果显示，他们能够在128个GPU上保持大约2.4 TeraFLOPS的持续计算性能，总共使用了30,720个处理元素。这表明多GPU集群对于计算流体动力学（CFD）模拟有显著的加速效果。通过使用先进的MPI特性和CUDA编程技术，研究者能够有效地管理多GPU环境中的数据传输和通信，减轻了GPU之间以及GPU与CPU之间的数据交换瓶颈。这不仅提高了计算速度，还展示了在大规模并行计算中的可扩展性。这篇研究提供了在多GPU集群上进行大规模并行流体计算的方法，对于理解如何优化GPU计算资源的利用和提升计算效率具有重要价值。其结果对于需要进行复杂CFD模拟的工程和科学研究具有指导意义，如航空航天、气候建模或生物流体动力学等领域。

GFLOPS.

a) b) c)

Figure 1. Three performance metrics on six selected CPU and GPU devices based on incompressible ﬂow

computations on a single device. Actual sustained performance is used rather than peak device performance.

a) Sustained GFLOPS, b) MFLOPS/Watt, c) MFLOPS/Dollar

Figure 1a shows three performance metrics on each platform using an incompressible ﬂow CFD code. The GPU

version of the CFD code is clearly an improvement over the Pthreads shared-memory parallel CPU version. Both

of these implementations are written in C and use identical numerical methods

. The main impact on individual

GPU performance was the introduction of compute capability 1.3, which greatly reduces the memory latency in some

computing kernels due to the relaxed memory coalescing rules

. Compute capability 1.3 also added support for double

precision which is important in many solutions.

Figure 1b shows the performance relative to the peak electrical power of the device. GPU devices show a deﬁnite

advantage over the CPUs in terms of energy-efﬁcient computing. The consumer video cards have a slight power

advantage over the Tesla series, partly explained by having signiﬁcantly less active global memory. The recent paper

by Kindratenko et al.

details the measured power use of two clusters built using NVIDIA Tesla S1070 accelerators.

They ﬁnd signiﬁcant power usage from intensive global memory accesses, implying CUDA kernels using shared

memory not only can achieve higher performance but can use less power at the same time. Approximately 70%

of the S1070’s peak power is used while running their molecular dynamics program NAMD. Figure 1c shows the

performance relative to the street price of the device which sheds light on the cost effectiveness of GPU computing.

The consumer GPUs are better in this regard, ignoring other factors such as the additional memory present in the

compute server GPUs.

The rationale for clusters of GPU hardware is identical to that for CPU clusters – larger problems can be solved

and total performance increases. Figure 1c indicates that clusters of commodity hardware can offer compelling

price/performance beneﬁts. By spreading the models over a cluster with multiple GPUs in each node, memory size

limitations can be overcome such that inexpensive GPUs become practical for solving large computational problems.

Today’s motherboards can accommodate up to 8 GPUs in a single node

, enabling large-scale compute power in

small to medium size clusters. However, the resulting heterogeneous architecture with a deep memory hierarchy cre-

ates challenges in developing scalable and efﬁcient simulation applications. In the following sections, we focus on

maximizing performance on a multi-GPU cluster through a series of mixed MPI-CUDA implementations.

III. Related Work

GPU computing has evolved from hardware rendering pipelines that were not amenable to non-rendering tasks, to

the modern General Purpose Graphics Processing Unit (GPGPU) paradigm. Owens et al.

survey the early history

as well as the state of GPGPU computing in 2007. Early work on GPU computing is extensive and used custom

programming to reshape the problem in a way that could be processed by a rendering pipeline, often one without

32-bit ﬂoating point support. The advent of DirectX 9 hardware in 2003 with ﬂoating point support, combined with

early work on high level language support such as BrookGPU, Cg, and Sh, led to a rapid expansion of the ﬁeld.

14–19

Brandvik and Pullan

show the implementation of 2D and 3D Euler solver on a single GPU, showing 29× speedup

for the 2D solver and 16× speedup for the 3D solver. One unique feature of their paper is the implementation of the

solvers in both BrookGPU and CUDA. Elsen et al.

show the conversion of a subset of an existing Navier-Stokes

American Institute of Aeronautics and Astronautics

剩余16页未读，继续阅读

rockyrockrr

粉丝: 2
资源: 12

MPI-CUDA并行实现：多GPU集群中流体计算的加速研究

An MPI-CUDA Implementation for Massively Parallel Incompressible

MPI-CUDA-LSQR.zip_matlab例程_Unix_Linux_

MPI-plus-MPI-slides:关于 MPI+MPI（MPI-1 加 MPI-3 共享内存）编程模型的演示

大规模MPI-CUDA并行方法加速无压流计算

mpi-m:MPI-M文档档案

Tuple Space implementation based on MPI-开源

matlab灰度阈值分割代码-UCM-MPI-GPU:密集标签框架的GPU-MPI实现

mpi-s7:Node.JS库与MPI-USB适配器进行通信

mpi-nccl-tests:使用GPU Direct RDMA进行MPI + NCCL测试

二维传热的有限差分法。具有MPI-2并行IO和MPI-3邻.zip

最新资源