Accelerating Linpack with CUDA
on heterogeneous clusters
Massimiliano Fatica
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara CA 95050
mfatica@nvidia.com
ABSTRACT
This paper describes the use of CUDA to accelerate the
Linpack benchmark on heterogeneous clusters, where both
CPUs and GPUs are used in synergy with minor or no mod-
ifications to the original source code. A host library inter-
cepts the calls to DGEMM and DTRSM and executes them
simultaneously on both GPUs and CPU cores. An 8U clus-
ter is able to sustain more than a Teraflop using a CUDA
accelerated version of HPL.
1. INTRODUCTION
The Linpack benchmark is very popular in the HPC space,
because it is used as a performance measure for ranking
supercomputers in the TOP500 list of the world’s fastest
computers [1]. The Top 500 list was created in 1993 and
it is updated twice a year at the International Supercom-
puting Conference in Europe and at the Supercomputing
Conference in the US. In this study we used HPL [2], High
Performance Linpack, which is a reference implementation
of the Linpack benchmark written by the Innovative Com-
puting Laboratory at the University of Tennessee. HPL is a
software package that solves a (random) dense linear system
in double precision arithmetic on distributed-memory com-
puters. It is the most widely used implementation of the
Linpack Benchmark and it is freely available from Netlib
(http://www.netlib.org/benchmark/hpl). The HPL pack-
age provides a testing and timing program to quantify the
accuracy of the obtained solution as well as the time it took
to compute it.
We performed benchmarks on two different systems, a
workstation with a single GPU and an 8-node cluster with
multiple GPUs, with the following specifications:
1. SUN Ultra 24 workstation with an Intel Core2 Extreme
Q6850 (3.0GHz) CPU and 8GB of memory plus a Tesla
C1060 card.
2. Cluster with 8 nodes, each node connected to half of a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GPGPU ’09 Washington DC, USA
Copyright 2009 ACM 978-1-60558-517-8 ...$5.00.
Tesla S1070 system, containing 4 GPUs, so that each
node is connected to 2 GPUs. Each node has 2 Intel
Xeon E5462 ( 2.8GHz with 1600Mhz FSB) and 16GB
of memory. The nodes are connected with SDR (Single
Data Rate) Infiniband.
Peak performance for the CPU is computed as the product
of the number of cores, the number of operations per clock
and the clock frequency. The CPUs in both systems have 4
cores and are able to issue 4 double precision operations per
clock, so the peak performance is 16 ∗ clock frequency .
The first system has a peak double precision (DP) CPU
performance of 48 GFlops, the second system has a peak
DP CPU performance of 89.6 GFlops per node ( total peak
CPU performance for the cluster 716.8 GFlops).
2. GPU ARCHITECTURE AND CUDA
The GPU architecture has now evolved into a highly par-
allel multi-threaded processor with very high floating point
performance and memory bandwidth.
The last generation of NVIDIA GPUs also added IEEE-
754 double-precision support. NVIDIA’s Tesla, a product
line for high performance computing, has GPUs with 240
single precision and 30 double precision cores and 4 GB of
memory. The double precision units can perform a fused
multiply add per clock, so the peak double precision perfor-
mance is 60 ∗ clock frequency. The PCI-e card (C1060) has
a clock frequency of 1.296GHz and a 1U system with 4 cards
(S1070) has a clock frequency of 1.44GHz.
The GPU is especially well-suited to address problems
that can be expressed as data-parallel computations, i.e. the
same program is executed on many data elements in paral-
lel with high arithmetic intensity (the ratio of arithmetic
operations to memory operations). CUDA [3] is a parallel
programming model and software environment designed to
expose the parallel capabilities of GPUs. CUDA extends C
by allowing the programmer to define C functions, called
kernels, that when called are executed N times in parallel
by N different CUDA threads, as opposed to only once like
regular C functions.
The software environment also provides a performance
profiler, a debugger and commonly used libraries for HPC:
1. CUBLAS library: a BLAS implementation
2. CUFFT library: an FFT implementation.
The implementation described in this paper has been per-
formed using the CUBLAS library and the CUDA runtime,
no specialized kernels have been written.