GPU Clusters for High-Performance Computing
Volodymyr V. Kindratenko
#1
, Jeremy J. Enos
#1
, Guochun Shi
#1
, Michael T. Showerman
#1
,
Galen W. Arnold
#1
, John E. Stone
*2
, James C. Phillips
*2
, Wen-mei Hwu
§3
#
National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
1205 West Clark Street, Urbana, IL 61801, USA
1
{kindr|jenos|gshi|mshow|arnoldg}@ncsa.uiuc.edu
*
Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign
405 North Mathews Avenue, Urbana, IL 61801, USA
2
{johns|jim}@ks.uiuc.edu
§
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
1308 West Main Street, Urbana, IL 61801, USA
3
hwu@crhc.uiuc.edu
Abstract—Large-scale GPU clusters are gaining popularity in
the scientific computing community. However, their deployment
and production use are associated with a number of new
challenges. In this paper, we present our efforts to address some
of the challenges with building and running GPU clusters in HPC
environments. We touch upon such issues as balanced cluster
architecture, resource sharing in a cluster environment,
programming models, and applications for GPU clusters.
I. INTRODUCTION
Commodity graphics processing units (GPUs) have rapidly
evolved to become high performance accelerators for data-
parallel computing. Modern GPUs contain hundreds of
processing units, capable of achieving up to 1 TFLOPS for
single-precision (SP) arithmetic, and over 80 GFLOPS for
double-precision (DP) calculations. Recent high-performance
computing (HPC)-optimized GPUs contain up to 4GB of on-
board memory, and are capable of sustaining memory
bandwidths exceeding 100GB/sec. The massively parallel
hardware architecture and high performance of floating point
arithmetic and memory operations on GPUs make them
particularly well-suited to many of the same scientific and
engineering workloads that occupy HPC clusters, leading to
their incorporation as HPC accelerators [1], [2], [4], [5], [10].
Beyond their appeal as cost-effective HPC accelerators,
GPUs also have the potential to significantly reduce space,
power, and cooling demands, and reduce the number of
operating system images that must be managed relative to
traditional CPU-only clusters of similar aggregate
computational capability. In support of this trend, NVIDIA
has begun producing commercially available “Tesla” GPU
accelerators tailored for use in HPC clusters. The Tesla GPUs
for HPC are available either as standard add-on boards, or in
high-density self-contained 1U rack mount cases containing
four GPU devices with independent power and cooling, for
attachment to rack-mounted HPC nodes that lack adequate
internal space, power, or cooling for internal installation.
Although successful use of GPUs as accelerators in large
HPC clusters can confer the advantages outlined above, they
present a number of new challenges in terms of the application
development process, job scheduling and resource
management, and security. In this paper, we describe our
experiences in deploying two GPU clusters at NCSA, present
data on performance and power consumption, and present
solutions we developed for hardware reliability testing,
security, job scheduling and resource management, and other
unique challenges posed by GPU accelerated clusters. We
also discuss some of our experiences with current GPU
programming toolkits, and their interoperability with other
parallel programming APIs such as MPI and Charm++.
II. GPU
CLUSTER ARCHITECTURE
Several GPU clusters have been deployed in the past
decade, see for example installations done by GraphStream,
Inc., [3]. However, the majority of them were deployed as
visualization systems. Only recently attempts have been made
to deploy GPU compute clusters. Two early examples of such
installations include a 160-node “DQ” GPU cluster at LANL
[4] and a 16-node “QP” GPU cluster at NCSA [5], both based
on NVIDIA QuadroPlex technology. The majority of such
installations are highly experimental in nature and GPU
clusters specifically deployed for production use in HPC
environments are still rare.
At NCSA we have deployed two GPU clusters based on the
NVIDIA Tesla S1070 Computing System: a 192-node
production cluster “Lincoln” [6] and an experimental 32-node
cluster “AC” [7], which is an upgrade from our prior QP
system [5]. Both clusters went into production in 2009.
There are three principal components used in a GPU cluster:
host nodes, GPUs, and interconnect. Since the expectation is
for the GPUs to carry out a substantial portion of the
calculations, host memory, PCIe bus, and network
interconnect performance characteristics need to be matched
with the GPU performance in order to maintain a well-
balanced system. In particular, high-end GPUs, such as the
NVIDIA Tesla, require full-bandwidth PCIe Gen 2 x16 slots
that do not degrade to x8 speeds when multiple GPUs are used.
Also, InfiniBand QDR interconnect is highly desirable to
match the GPU-to-host bandwidth. Host memory also needs
to at least match the amount of memory on the GPUs in order
to enable their full utilization, and a one-to-one ratio of CPU