Computer Physics Communications 211 (2017) 54–60
Contents lists available at ScienceDirect
Computer Physics Communications
journal homepage: www.elsevier.com/locate/cpc
Parallel 3-dim fast Fourier transforms with load balancing of the
plane waves
Xingyu Gao
a,b,c
, Zeyao Mo
a,b,c
, Jun Fang
b,c
, Haifeng Song
a,b,c
, Han Wang
b,c,∗
a
Laboratory of Computational Physics, Huayuan Road 6, Beijing 100088, PR China
b
Institute of Applied Physics and Computational Mathematics, Fenghao East Road 2, Beijing 100094, PR China
c
CAEP Software Center for High Performance Numerical Simulation, Huayuan Road 6, Beijing 100088, PR China
a r t i c l e i n f o
Article history:
Received 29 January 2016
Received in revised form
23 May 2016
Accepted 2 July 2016
Available online 7 July 2016
Keywords:
First-principles calculation
Kohn–Sham equation
Plane wave
FFT
Load balancing
a b s t r a c t
The plane wave method is most widely used for solving the Kohn–Sham equations in first-principles
materials science computations. In this procedure, the three-dimensional (3-dim) trial wave functions’
fast Fourier transform (FFT) is a regular operation and one of the most demanding algorithms in terms of
the scalability on a parallel machine. We propose a new partitioning algorithm for the 3-dim FFT grid to
accomplish the trade-off between the communication overhead and load balancing of the plane waves.
It is shown by qualitative analysis and numerical results that our approach could scale the plane wave
first-principles calculations up to more nodes.
© 2016 Elsevier B.V. All rights reserved.
1. Introduction
In the context of Density Functional Theory (DFT), solving the
Kohn–Sham equation is the most time-consuming part of the first-
principles materials science computations [1–3]. The plane wave
method, which is a widely used numerical approach [4], could lead
to a large-scale dense algebraic eigenvalue problem. This problem
is usually solved by iterative diagonalization methods such as
Davidson’s [5], RMM-DIIS [3], LOBPCG [6], Chebyshev polynomial
filtering subspace iteration [7], etc. The elementary operation
of the iteration methods is the matrix–vector multiplication.
Since the large-scale dense matrix is not suitable for explicit
assembly, we realize the matrix–vector multiplication by applying
the Hamiltonian operator on trial wave functions. The local term of
the effective potential is one part of the Hamiltonian operator. In
order to compute its action in a lower time complexity, we perform
3-dim FFT twice on one trial wave function in each matrix–vector
multiplication.
There are three features to make the trial wave function’s
FFT one of the most demanding algorithms to scale on a parallel
machine. The first is the moderate sized FFT grid rather than a large
∗
Corresponding author at: Institute of Applied Physics and Computational
Mathematics, Fenghao East Road 2, Beijing 100094, PR China.
E-mail address: wang_han@iapcm.ac.cn (H. Wang).
one. The ratio of computation to communication of the parallel
3-dim FFT is of order log N where N, the single dimension of the
FFT grid, is usually O(10
2
) in most first-principles calculations
of bulk materials. The second is the accumulated communication
overhead led by many execution times corresponding to a
number of wave functions. Thousands of FFTs may be executed
at each step of iterative diagonalization. The third is the all-to-all
communication required by the data transposes. This could limit
the parallel scaling due to the large number of small messages in
the network resulting in competition as well as latency issues.
It has already been recognized that making fewer and larger
messages can speed up parallel trial wave functions’ FFTs. The
hybrid OpenMP/MPI implementation [8,9] can lead to fewer and
larger messages compared to a pure MPI version. And a blocked
version [9] performs a number of trial wave functions’ FFTs at the
same time to aggregate the message sizes and reduce the latency
problem.
In first-principles calculations, we should consider not only
the parallel scaling of trial wave functions’ FFTs, but also the
load balancing of intensive computations on the plane waves that
expand the wave functions. The workload of these computations
are inhomogeneously distributed on a standard 3-dim FFT grid.
Thus a greedy algorithm is usually used to optimize the load
balancing. However, this algorithm results in global all-to-all
communications across all the processors, thus the latency
overhead would grow in proportion to the number of processors
http://dx.doi.org/10.1016/j.cpc.2016.07.001
0010-4655/© 2016 Elsevier B.V. All rights reserved.