This chapter describes the MPI FFTW routines for distributed-memory (and shared-memory) machines supporting MPI (Message Passing Interface). The MPI routines are significantly different from the ordinary FFTW because the transform data here are distributed over multiple processes, so that each process gets only a portion of the array. Currently, multi-dimensional (but not one-dimensional) transforms of both real and complex data are supported.
The FFTW MPI library code is all located in the mpi
subdirectoy
of the FFTW package (along with source code for test programs). On Unix
systems, the FFTW MPI libraries and header files can be automatically
configured, compiled, and installed along with the uniprocessor FFTW
libraries simply by including --with-mpi
in the flags to the
configure
script (see Section Installation on Unix).
The only requirement of the FFTW MPI code is that you have the standard MPI 1.1 (or later) libraries and header files installed on your system. A free implementation of MPI is available from the MPICH home page.
Previous versions of the FFTW MPI routines have had an unfortunate
tendency to expose bugs in MPI implementations. The current version has
been largely rewritten, and hopefully avoids some of the problems. If
you run into difficulties, try passing the optional workspace to
(r)fftwnd_mpi
(see below), as this allows us to use the standard
(and hopefully well-tested) MPI_Alltoall
primitive for
communications. Please let us know (fftw@theory.lcs.mit.edu)
how things work out.
Several test programs are included in the mpi
directory. The
ones most useful to you are probably the fftw_mpi_test
and
rfftw_mpi_test
programs, which are run just like an ordinary MPI
program and accept the same parameters as the other FFTW test programs
(c.f. tests/README
). For example, mpirun ...params...
fftw_mpi_test -r 0
will run non-terminating complex-transform
correctness tests of random dimensions. They can also do performance
benchmarks.
Usage of the MPI FFTW routines is similar to that of the uniprocessor FFTW. We assume that the reader already understands the usage of the uniprocessor FFTW routines, described elsewhere in this manual. Some familiarity with MPI is also helpful.
A typical program performing a complex two-dimensional MPI transform might look something like:
#include <fftw_mpi.h> int main(int argc, char **argv) { const int NX = ..., NY = ...; fftwnd_mpi_plan plan; fftw_complex *data; MPI_Init(&argc,&argv); plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE); ...allocate and initialize data... fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER); ... fftwnd_mpi_destroy_plan(plan); MPI_Finalize(); }
The calls to MPI_Init
and MPI_Finalize
are required in all
MPI programs; see the MPI home page
for more information. Note that all of your processes run the program
in parallel, as a group; there is no explicit launching of
threads/processes in an MPI program.
As in the ordinary FFTW, the first thing we do is to create a plan (of
type fftwnd_mpi_plan
), using:
fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags);
Except for the first argument, the parameters are identical to those of
fftw2d_create_plan
. (There are also analogous
fftwnd_mpi_create_plan
and fftw3d_mpi_create_plan
functions. Transforms of any rank greater than one are supported.) The
first argument is an MPI communicator, which specifies the group
of processes that are to be involved in the transform; the standard
constant MPI_COMM_WORLD
indicates all available processes.
Next, one has to allocate and initialize the data. This is somewhat tricky, because the transform data is distributed across the processes involved in the transform. It is discussed in detail by the next section (see Section MPI Data Layout).
The actual computation of the transform is performed by the function
fftwnd_mpi
, which differs somewhat from its uniprocessor
equivalent and is described by:
void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work, fftwnd_mpi_output_order output_order);
There are several things to notice here:
fftw_mpi
transforms are in-place: the output is
in the local_data
parameter, and there is no need to specify
FFTW_IN_PLACE
in the plan flags.
howmany
/stride
/dist
functionality of the
uniprocessor routines: the n_fields
parameter is equivalent to
howmany=n_fields
, stride=n_fields
, and dist=1
.
(Conceptually, the n_fields
parameter allows you to transform an
array of contiguous vectors, each with length n_fields
.)
n_fields
is 1
if you are only transforming a single,
ordinary array.
work
parameter is an optional workspace. If it is not
NULL
, it should be exactly the same size as the local_data
array. If it is provided, FFTW is able to use the built-in
MPI_Alltoall
primitive for (often) greater efficiency at the
expense of extra storage space.
FFTW_NORMAL_ORDER
), or if it is
transposed (FFTW_TRANSPOSED_ORDER
). Leaving the data transposed
results in significant performance improvements due to a saved
communication step (needed to un-transpose the data). Specifically, the
first two dimensions of the array are transposed, as is described in
more detail by the next section.
The output of fftwnd_mpi
is identical to that of the
corresponding uniprocessor transform. In particular, you should recall
our conventions for normalization and the sign of the transform
exponent.
The same plan can be used to compute many transforms of the same size.
After you are done with it, you should deallocate it by calling
fftwnd_mpi_destroy_plan
.
Important: The FFTW MPI routines must be called in the same order by
all processes involved in the transform. You should assume that they
all are blocking, as if each contained a call to MPI_Barrier
.
Programs using the FFTW MPI routines should be linked with
-lfftw_mpi -lfftw -lm
on Unix, in addition to whatever libraries
are required for MPI.
The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over.
In particular, the array is divided according to the rows (first
dimension) of the data: each process gets a subset of the rows of the
data. (This is sometimes called a "slab decomposition.") One
consequence of this is that you can't take advantage of more processors
than you have rows (e.g. 64x64x64
matrix can at most use 64
processors). This isn't usually much of a limitation, however, as each
processor needs a fair amount of data in order for the
parallel-computation benefits to outweight the communications costs.
Below, the first dimension of the data will be referred to as
`x
' and the second dimension as `y
'.
FFTW supplies a routine to tell you exactly how much data resides on the current process:
void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size);
Given a plan p
, the other parameters of this routine are set to
values describing the required data layout, described below.
total_local_size
is the number of fftw_complex
elements
that you must allocate for your local data (and workspace, if you
choose). (This value should, of course, be multiplied by
n_fields
if that parameter to fftwnd_mpi
is not 1
.)
The data on the current process has local_nx
rows, starting at
row local_x_start
. If fftwnd_mpi
is called with
FFTW_TRANSPOSED_ORDER
output, then y
will be the first
dimension of the output, and the local y
extent will be given by
local_ny_after_transpose
and
local_y_start_after_transpose
. Otherwise, the output has the
same dimensions and layout as the input.
For instance, suppose you want to transform three-dimensional data of
size nx x ny x nz
. Then, the current process will store a subset
of this data, of size local_nx x ny x nz
, where the x
indices correspond to the range local_x_start
to
local_x_start+local_nx-1
in the "real" (i.e. logical) array.
If fftwnd_mpi
is called with FFTW_TRANSPOSED_ORDER
output,
then the result will be a ny x nx x nz
array, of which a
local_ny_after_transpose x nx x nz
subset is stored on the
current process (corresponding to y
values starting at
local_y_start_after_transpose
).
The following is an example of allocating such a three-dimensional array
array (local_data
) before the transform and initializing it to
some function f(x,y,z)
:
fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, &local_ny_after_transpose, &local_y_start_after_transpose, &total_local_size); local_data = (fftw_complex*) malloc(sizeof(fftw_complex) * total_local_size); for (x = 0; x < local_nx; ++x) for (y = 0; y < ny; ++y) for (z = 0; z < nz; ++z) local_data[(x*ny + y)*nz + z] = f(x + local_x_start, y, z);
Some important things to remember:
local_nx x ny x nz
in
the above example, do not allocate the array to be of size
local_nx*ny*nz
. Use total_local_size
instead.
local_nx
may even be zero for some processes. (For
example, suppose you are doing a 6x6
transform on four
processors. There is no way to effectively use the fourth processor in
a slab decomposition, so we leave it empty. Proof left as an exercise
for the reader.)
fftwnd_mpi
, the dimensions of the inverse transform are given by
the dimensions of the output of the forward transform. For example, if
you are using FFTW_TRANSPOSED_ORDER
output in the above example,
then the inverse plan should be created with dimensions ny x nx x
nz
.
MPI transforms specialized for real data are also available, similiar to
the uniprocessor rfftwnd
transforms. Just as in the uniprocessor
case, the real-data MPI functions gain roughly a factor of two in speed
(and save a factor of two in space) at the expense of more complicated
data formats in the calling program. Before reading this section, you
should definitely understand how to call the uniprocessor rfftwnd
functions and also the complex MPI FFTW functions.
The following is an example of a program using rfftwnd_mpi
. It
computes the size nx x ny x nz
transform of a real function
f(x,y,z)
, multiplies the imaginary part by 2
for fun, then
computes the inverse transform. (We'll also use
FFTW_TRANSPOSED_ORDER
output for the transform, and additionally
supply the optional workspace parameter to rfftwnd_mpi
, just to
add a little spice.)
#include <rfftw_mpi.h> int main(int argc, char **argv) { const int nx = ..., ny = ..., nz = ...; int local_nx, local_x_start, local_ny_after_transpose, local_y_start_after_transpose, total_local_size; int x, y, z; rfftwnd_mpi_plan plan, iplan; fftw_real *data, *work; fftw_complex *cdata; MPI_Init(&argc,&argv); /* create the forward and backward plans: */ plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_REAL_TO_COMPLEX, FFTW_ESTIMATE); iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, /* note transposition --> */ ny, nx, nz, FFTW_COMPLEX_TO_REAL, FFTW_ESTIMATE); rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, &local_ny_after_transpose, &local_y_start_after_transpose, &total_local_size); data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); /* workspace is the same size as the data: */ work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); /* initialize data to f(x,y,z): */ for (x = 0; x < local_nx; ++x) for (y = 0; y < ny; ++y) for (z = 0; z < nz; ++z) data[(x*ny + y) * (2*(nz/2+1)) + z] = f(x + local_x_start, y, z); /* Now, compute the forward transform: */ rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER); /* the data is now complex, so typecast a pointer: */ cdata = (fftw_complex*) data; /* multiply imaginary part by 2, for fun: (note that the data is transposed) */ for (y = 0; y < local_ny_after_transpose; ++y) for (x = 0; x < nx; ++x) for (z = 0; z < (nz/2+1); ++z) data[(y*nx + x) * (nz/2+1) + z].im *= 2.0; /* Finally, compute the inverse transform; the result is transposed back to the original data layout: */ rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER); free(data); free(work); rfftwnd_mpi_destroy_plan(plan); rfftwnd_mpi_destroy_plan(iplan); MPI_Finalize(); }
There's a lot of stuff in this example, but it's all just what you would
have guessed, right? We replaced all the fftwnd_mpi*
functions
by rfftwnd_mpi*
, but otherwise the parameters were pretty much
the same. The data layout distributed among the processes just like for
the complex transforms (see Section MPI Data Layout), but in addition the
final dimension is padded just like it is for the uniprocessor in-place
real transforms (see Section Array Dimensions for Real Multi-dimensional Transforms). In particular, the z
dimension of the real input
data is padded to a size 2*(nz/2+1)
, and after the transform it
contains nz/2+1
complex values.
Some other important things to know about the real MPI transforms:
FFTW_TRANSPOSED_ORDER
or
FFTW_TRANSPOSED_ORDER
) must be the same for both forward
and backward transforms. (This is not required in the complex case.)
total_local_size
is the required size in fftw_real
values,
not fftw_complex
values as it is for the complex transforms.
local_ny_after_transpose
and local_y_start_after_transpose
describe the portion of the array after the transform; that is, they are
indices in the complex array for an FFTW_REAL_TO_COMPLEX
transform
and in the real array for an FFTW_COMPLEX_TO_REAL
transform.
rfftwnd_mpi
always expects fftw_real*
array arguments, but
of course these pointers can refer to either real or complex arrays,
depending upon which side of the transform you are on. Just as for
in-place uniprocessor real transforms (and also in the example above),
this is most easily handled by typecasting to a complex pointer when
handling the complex data.
rfftwnd_create_plan
and rfftw2d_create_plan
functions, and
any rank greater than one is supported.
Programs using the MPI FFTW real transforms should link with
-lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm
on Unix.
The MPI FFTW also includes routines for parallel one-dimensional transforms of complex data (only). Although the speedup is generally worse than it is for the multi-dimensional routines,(4) these distributed-memory one-dimensional transforms are especially useful for performing one-dimensional transforms that don't fit into the memory of a single machine.
The usage of these routines is straightforward, and is similar to that
of the multi-dimensional MPI transform functions. You first include the
header <fftw_mpi.h>
and then create a plan by calling:
fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, fftw_direction dir, int flags);
The last three arguments are the same as for fftw_create_plan
(except that all MPI transforms are automatically FFTW_IN_PLACE
).
The first argument specifies the group of processes you are using, and
is usually MPI_COMM_WORLD
(all processes). A plan can be used
for many transforms of the same size, and is destroyed when you are done
with it by calling fftw_mpi_destroy_plan(plan)
.
The transform itself is computed by:
void fftw_mpi(fftw_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work);
n_fields
, as in fftwnd_mpi
, is equivalent to
howmany=n_fields
, stride=n_fields
, and dist=1
, and
should be 1
when you are computing the transform of a single
array. local_data
contains the portion of the array local to the
current process, described below. work
is either NULL
or
an array exactly the same size as local_data
; in the latter case,
FFTW can use the MPI_Alltoall
communications primitive which is
(usually) faster at the expense of extra storage.
To find out what portion of the array is stored local to the current process, you call the following routine:
void fftw_mpi_local_sizes(fftw_mpi_plan p, int *local_n, int *local_start, int *local_n_after_transform, int *local_start_after_transform, int *total_local_size);
total_local_size
is the number of fftw_complex
elements
you should actually allocate for local_data
(and work
).
local_n
and local_start
indicate that the current process
stores local_n
elements corresponding to the indices
local_start
to local_start+local_n-1
in the "real"
array. After the transform, the process may store a different
portion of the array. The portion of the data stored on the process
after the transform is given by local_n_after_transform
and
local_start_after_transform
.
Note that, if you compute both a forward and a backward transform of the same size, the local sizes are guaranteed to be consistent. That is, the local size after the forward transform will be the same as the local size before the backward transform, and vice versa.
There are several things you should consider in order to get the best performance out of the MPI FFTW routines.
First, if possible, the first and second dimensions of your data should be divisible by the number of processes you are using. (If only one can be divisible, then you should choose the first dimension.) This allows the computational load to be spread evenly among the processes, and also reduces the communications complexity and overhead. In the one-dimensional transform case, the size of the transform should ideally be divisible by the square of the number of processors.
Second, you should consider using the FFTW_TRANSPOSED_ORDER
output format if it is not too burdensome. The speed gains from
communications savings are usually substantial.
Third, you should consider allocating a workspace for
(r)fftw(nd)_mpi
, as this can often
(but not always) improve performance (at the cost of extra storage).
Fourth, you should experiment with the best number of processors to use
for your problem. (There comes a point of diminishing returns, when the
communications costs outweigh the computational benefits.(5)) The fftw_mpi_test
program can output helpful performance
benchmarks. It accepts the same parameters as the uniprocessor test
programs (c.f. tests/README
) and is run like an ordinary MPI
program. For example, mpirun -np 4 fftw_mpi_test -s 128x128x128
will benchmark a 128x128x128
transform on four processors,
reporting timings and parallel speedups for all variants of
fftwnd_mpi
(transposed, with workspace, etcetera). (Note also
that there is the rfftw_mpi_test
program for the real
transforms.)
Go to the first, previous, next, last section, table of contents.