GPU加速Linpack：在异构集群上的CUDA实现

需积分: 10 186 浏览量更新于2024-09-12 收藏 256KB PDF 举报

“加速GPU上的Linpack：在异构集群中使用CUDA 本文档详细介绍了如何利用CUDA技术在异构计算集群上提升Linpack基准测试的性能。在这种集群中，CPU和GPU协同工作，对原始源代码进行少量或不修改即可实现加速。通过一个主机库拦截对DGEMM和DTRSM的调用，并同时在GPU和CPU核心上执行这些运算。实验显示，一个8单元的集群可以使用CUDA加速版的HPL（高性能Linpack）持续超过1个Teraflop的运算速度。 1. 引言 Linpack基准测试在高性能计算（HPC）领域非常流行，因为它被用来作为衡量超级计算机性能的标准，用于世界最快计算机的TOP500排行榜。该排行榜自1993年开始，每年在欧洲的国际超级计算会议和美国的超算会议上更新两次。在这个研究中，我们使用了由田纳西大学创新计算实验室编写的HPL，它是Linpack基准的一个参考实现。 2. Linpack基准和CUDA加速 Linpack基准主要通过解决大型线性代数系统来评估计算机系统的浮点计算能力。HPL是Linpack的一种实现，它包含了求解方程组的主要部分，如双精度矩阵乘法（DGEMM）和双精度三角矩阵求解（DTRSM）。CUDA是NVIDIA开发的一种并行计算平台和编程模型，允许开发者利用GPU的强大并行处理能力。 3. 异构计算集群在异构集群中，传统的CPU与GPU结合使用，通过CUDA编程模型，可以在不显著改变原有代码的情况下，将计算任务分配给最适合的硬件。这里的CUDA主机库扮演了关键角色，它识别并分发计算任务到CPU和GPU，使得两者能并行工作，从而提高整体性能。 4. 实现与性能实验中，通过CUDA加速的HPL在8单元集群上展示了超过1 Teraflop的持续计算能力，这表明了CUDA在提升大规模科学计算效率方面的潜力。这种加速不仅提升了系统吞吐量，还可能降低能源消耗，因为GPU通常比CPU更节能。 5. 结论与展望 CUDA加速的Linpack基准在异构集群中的成功应用证明了GPU在高性能计算中的价值。未来，随着更多高性能GPU和优化的并行算法的发展，我们可以期待在更多领域看到GPU加速计算的广泛应用，特别是在需要大量浮点运算的科学计算和数据分析中。” 这篇文档提供了关于如何使用CUDA在GPU上加速Linpack基准测试的深入理解，展示了在异构计算环境中，CPU与GPU协同工作的强大潜力。这对于理解和优化HPC系统的性能至关重要，尤其是在追求更高计算效率和节能效果的当下。

Accelerating Linpack with CUDA

on heterogeneous clusters

Massimiliano Fatica

NVIDIA Corporation

2701 San Tomas Expressway

Santa Clara CA 95050

mfatica@nvidia.com

ABSTRACT

This paper describes the use of CUDA to accelerate the

Linpack benchmark on heterogeneous clusters, where both

CPUs and GPUs are used in synergy with minor or no mod-

iﬁcations to the original source code. A host library inter-

cepts the calls to DGEMM and DTRSM and executes them

simultaneously on both GPUs and CPU cores. An 8U clus-

ter is able to sustain more than a Teraﬂop using a CUDA

accelerated version of HPL.

1. INTRODUCTION

The Linpack benchmark is very popular in the HPC space,

because it is used as a performance measure for ranking

supercomputers in the TOP500 list of the world’s fastest

computers [1]. The Top 500 list was created in 1993 and

it is updated twice a year at the International Supercom-

puting Conference in Europe and at the Supercomputing

Conference in the US. In this study we used HPL [2], High

Performance Linpack, which is a reference implementation

of the Linpack benchmark written by the Innovative Com-

puting Laboratory at the University of Tennessee. HPL is a

software package that solves a (random) dense linear system

in double precision arithmetic on distributed-memory com-

puters. It is the most widely used implementation of the

Linpack Benchmark and it is freely available from Netlib

(http://www.netlib.org/benchmark/hpl). The HPL pack-

age provides a testing and timing program to quantify the

accuracy of the obtained solution as well as the time it took

to compute it.

We performed benchmarks on two diﬀerent systems, a

workstation with a single GPU and an 8-node cluster with

multiple GPUs, with the following speciﬁcations:

1. SUN Ultra 24 workstation with an Intel Core2 Extreme

Q6850 (3.0GHz) CPU and 8GB of memory plus a Tesla

C1060 card.

2. Cluster with 8 nodes, each node connected to half of a

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

GPGPU ’09 Washington DC, USA

Tesla S1070 system, containing 4 GPUs, so that each

node is connected to 2 GPUs. Each node has 2 Intel

Xeon E5462 ( 2.8GHz with 1600Mhz FSB) and 16GB

of memory. The nodes are connected with SDR (Single

Data Rate) Inﬁniband.

Peak performance for the CPU is computed as the product

of the number of cores, the number of operations per clock

and the clock frequency. The CPUs in both systems have 4

cores and are able to issue 4 double precision operations per

clock, so the peak performance is 16 ∗ clock frequency .

The ﬁrst system has a peak double precision (DP) CPU

performance of 48 GFlops, the second system has a peak

DP CPU performance of 89.6 GFlops per node ( total peak

CPU performance for the cluster 716.8 GFlops).

2. GPU ARCHITECTURE AND CUDA

The GPU architecture has now evolved into a highly par-

allel multi-threaded processor with very high ﬂoating point

performance and memory bandwidth.

The last generation of NVIDIA GPUs also added IEEE-

754 double-precision support. NVIDIA’s Tesla, a product

line for high performance computing, has GPUs with 240

single precision and 30 double precision cores and 4 GB of

memory. The double precision units can perform a fused

multiply add per clock, so the peak double precision perfor-

mance is 60 ∗ clock frequency. The PCI-e card (C1060) has

a clock frequency of 1.296GHz and a 1U system with 4 cards

(S1070) has a clock frequency of 1.44GHz.

The GPU is especially well-suited to address problems

that can be expressed as data-parallel computations, i.e. the

same program is executed on many data elements in paral-

lel with high arithmetic intensity (the ratio of arithmetic

operations to memory operations). CUDA [3] is a parallel

programming model and software environment designed to

expose the parallel capabilities of GPUs. CUDA extends C

by allowing the programmer to deﬁne C functions, called

kernels, that when called are executed N times in parallel

by N diﬀerent CUDA threads, as opposed to only once like

regular C functions.

The software environment also provides a performance

proﬁler, a debugger and commonly used libraries for HPC:

1. CUBLAS library: a BLAS implementation

2. CUFFT library: an FFT implementation.

The implementation described in this paper has been per-

formed using the CUBLAS library and the CUDA runtime,

no specialized kernels have been written.

下载后可阅读完整内容，剩余5页未读，立即下载

u010212157

粉丝: 0
资源: 2

GPU加速Linpack：在异构集群上的CUDA实现

压力测试衡量CPU的三个指标

HPL HPCG 测试包 -----Linpack.tar.xz

gpu linpack

cpu性能测试linpack

linux下cpu压力测试软件

HPL_GPU：高性能Linpack Benchmark采用了GPU后端版本

LINPACK 用户手册

CPU压力测试

GPU测试工具

linpack benchmark

最新资源