Hybrid Parallel Programming with MPI and
Unified Parallel C
∗
James Dinan
Dept. Comp. Sci. and Eng.
The Ohio State University
2015 Neil Avenue
Columbus, OH U.S.A.
dinan@cse.ohio-state.edu
Pavan Balaji
Math. and Comp. Sci. Division
Argonne National Laboratory
9700 S. Cass Avenue
Argonne, IL U.S.A.
balaji@mcs.anl.gov
Ewing Lusk
Math. and Comp. Sci. Division
Argonne National Laboratory
9700 S. Cass Avenue
Argonne, IL U.S.A.
lusk@mcs.anl.gov
P. Sadayappan
Dept. Comp. Sci. and Eng.
The Ohio State University
2015 Neil Avenue
Columbus, OH U.S.A.
saday@cse.ohio-state.edu
Rajeev Thakur
Math. and Comp. Sci. Division
Argonne National Laboratory
9700 S. Cass Avenue
Argonne, IL U.S.A.
thakur@mcs.anl.gov
ABSTRACT
The Message Passing Interface (MPI) is one of the most widely
used programming models for parallel computing. However, the
amount of memory available t o an MPI process is limited by the
amount of local memory within a compute node. Partitioned Global
Address Space (PGAS) models such as Unified Parallel C (UPC)
are growing in popularity because of their ability to provide a shared
global address space that spans the memories of multiple compute
nodes. However, taking advantage of UPC can require a large re-
coding effort for existing parallel applications.
In this paper, we explore a new hybrid parallel programming
model that combines MPI and UPC. This model allows MPI pro-
grammers incremental access to a greater amount of memory, en-
abling memory-constrained MPI codes to process larger data sets.
In addition, the hybrid model offers UPC programmers an opportu-
nity to create static U PC groups that are connected over MPI. As we
demonstrate, the use of such groups can significantly improve the
scalability of locality-constrained UPC codes. This paper presents
a detailed description of the hybrid model and demonstrates its ef-
fectiveness in two applications: a random access benchmark and
the Barnes-Hut cosmological simulation. Experimental results in-
dicate that the hybrid model can greatly enhance performance; us-
ing hybrid UPC groups that span two cluster nodes, RA perfor-
mance increases by a factor of 1.33 and using groups that span four
cluster nodes, Barnes-Hut experiences a twofold speedup at the ex-
pense of a 2% increase in code size.
∗
This work was supported in part by the Office of Advanced Sci-
entific Computing Research, Office of Science, U.S. Department
of Energy under contract D E-AC 02-06CH11357; by the National
Science Foundation under grant #0702182; and by a resource grant
from the Ohio Supercomputer Center.
Copyright 2010 Association for Computing Machinery. ACM acknowl-
edges that this contribution was authored or co-authored by an employee,
contractor or affiliate of the U.S. Government. As such, the Government re-
tains a nonexclusive, royalty-free right to publish or reproduce this article,
or to allow others to do so, for G overnment purposes only.
CF’10, May 17–19, 2010, Bertinoro, Italy.
Copyright 2010 ACM 978-1-4503-0044-5/10/05 ...$10.00.
Categories and Subject Descriptors
D.1.3 [Programming Techniques]: Concurrent Programming—
Parallel programming; D.3.3 [Programming Languages]: Lan-
guage Constructs and Features—Concurrent programming struc-
tures
General Terms
Design, Languages, Performance
Keywords
MPI, UPC, PGAS, Hybrid Parallel Programming
1. INTRODUCTION
The Message Passing Interface ( MPI) is considered to be the de
facto standard for parallel programming today [11]. The flexible,
feature-rich interface provided by MPI has successfully allowed
many complex scientific applications to be represented and mapped
efficiently to large-scale high-end computing systems. However,
the amount of memory available to an MPI process is li mi ted by
each process’s virtual address space; and, for a variety of scientific
applications, this space is insufficient t o solve emerging problems.
Many scientific applications today are written in MPI using a
one-process-per-core model that partitions memory among the cores.
As systems grow, memory per core remains constant or decreases.
Shared memory hybrid parallel programming with MPI and OpenMP
avoids partitioning of memory and, for some applications, provides
access to a large enough amount of memory to simulate increas-
ingly large problems [18]. For many other applications, however,
the memory requirement grows superlinearly w ith problem size.
In particular, the simulation of the phenomena n the nucleus of an
atom via the Green’s function Monte Carlo ( G FMC) method has a
per process memory requirement that grows as 2
A
· A! in the num-
ber of nucleons [17]. Hybridization of this MPI code with OpenMP
has successfully extended it to simulate carbon-12, which requires
roughly 0.5 GB memory per node. For larger atoms, however, the
per-MPI-process memory requirements quickly exceed the avail-
able memory per node. Thus, a new solution i s needed.
Partitioned global address space (PGAS) models such as the Uni-
fied Parallel C (UPC) [21] are relative newcomers to large-scale sci-