Go to the first, previous, next, last section, table of contents.


MPI FFTW

This chapter describes the MPI FFTW routines for distributed-memory (and shared-memory) machines supporting MPI (Message Passing Interface). The MPI routines are significantly different from the ordinary FFTW because the transform data here are distributed over multiple processes, so that each process gets only a portion of the array. Currently, multi-dimensional (but not one-dimensional) transforms of both real and complex data are supported.

MPI FFTW Installation

The FFTW MPI library code is all located in the mpi subdirectoy of the FFTW package (along with source code for test programs). On Unix systems, the FFTW MPI libraries and header files can be automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including --with-mpi in the flags to the configure script (see Section Installation on Unix).

The only requirement of the FFTW MPI code is that you have the standard MPI 1.1 (or later) libraries and header files installed on your system. A free implementation of MPI is available from the MPICH home page.

Previous versions of the FFTW MPI routines have had an unfortunate tendency to expose bugs in MPI implementations. The current version has been largely rewritten, and hopefully avoids some of the problems. If you run into difficulties, try passing the optional workspace to (r)fftwnd_mpi (see below), as this allows us to use the standard (and hopefully well-tested) MPI_Alltoall primitive for communications. Please let us know (fftw@theory.lcs.mit.edu) how things work out.

Several test programs are included in the mpi directory. The ones most useful to you are probably the fftw_mpi_test and rfftw_mpi_test programs, which are run just like an ordinary MPI program and accept the same parameters as the other FFTW test programs (c.f. tests/README). For example, mpirun ...params... fftw_mpi_test -r 0 will run non-terminating complex-transform correctness tests of random dimensions. They can also do performance benchmarks.

Usage of MPI FFTW for Complex Multi-dimensional Transforms

Usage of the MPI FFTW routines is similar to that of the uniprocessor FFTW. We assume that the reader already understands the usage of the uniprocessor FFTW routines, described elsewhere in this manual. Some familiarity with MPI is also helpful.

A typical program performing a complex two-dimensional MPI transform might look something like:

#include <fftw_mpi.h>

int main(int argc, char **argv)
{
        const int NX = ..., NY = ...;
        fftwnd_mpi_plan plan;
        fftw_complex *data;

        MPI_Init(&argc,&argv);

        plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD,
                                      NX, NY,
                                      FFTW_FORWARD, FFTW_ESTIMATE);

        ...allocate and initialize data...

        fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);

        ...

        fftwnd_mpi_destroy_plan(plan);
        MPI_Finalize();
}

The calls to MPI_Init and MPI_Finalize are required in all MPI programs; see the MPI home page for more information. Note that all of your processes run the program in parallel, as a group; there is no explicit launching of threads/processes in an MPI program.

As in the ordinary FFTW, the first thing we do is to create a plan (of type fftwnd_mpi_plan), using:

fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm,
                                       int nx, int ny,
                                       fftw_direction dir, int flags);

Except for the first argument, the parameters are identical to those of fftw2d_create_plan. (There are also analogous fftwnd_mpi_create_plan and fftw3d_mpi_create_plan functions. Transforms of any rank greater than one are supported.) The first argument is an MPI communicator, which specifies the group of processes that are to be involved in the transform; the standard constant MPI_COMM_WORLD indicates all available processes.

Next, one has to allocate and initialize the data. This is somewhat tricky, because the transform data is distributed across the processes involved in the transform. It is discussed in detail by the next section (see Section MPI Data Layout).

The actual computation of the transform is performed by the function fftwnd_mpi, which differs somewhat from its uniprocessor equivalent and is described by:

void fftwnd_mpi(fftwnd_mpi_plan p,
                int n_fields,
                fftw_complex *local_data, fftw_complex *work,
                fftwnd_mpi_output_order output_order);

There are several things to notice here:

The output of fftwnd_mpi is identical to that of the corresponding uniprocessor transform. In particular, you should recall our conventions for normalization and the sign of the transform exponent.

The same plan can be used to compute many transforms of the same size. After you are done with it, you should deallocate it by calling fftwnd_mpi_destroy_plan.

Important: The FFTW MPI routines must be called in the same order by all processes involved in the transform. You should assume that they all are blocking, as if each contained a call to MPI_Barrier.

Programs using the FFTW MPI routines should be linked with -lfftw_mpi -lfftw -lm on Unix, in addition to whatever libraries are required for MPI.

MPI Data Layout

The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over.

In particular, the array is divided according to the rows (first dimension) of the data: each process gets a subset of the rows of the data. (This is sometimes called a "slab decomposition.") One consequence of this is that you can't take advantage of more processors than you have rows (e.g. 64x64x64 matrix can at most use 64 processors). This isn't usually much of a limitation, however, as each processor needs a fair amount of data in order for the parallel-computation benefits to outweight the communications costs.

Below, the first dimension of the data will be referred to as `x' and the second dimension as `y'.

FFTW supplies a routine to tell you exactly how much data resides on the current process:

void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
                            int *local_nx,
                            int *local_x_start,
                            int *local_ny_after_transpose,
                            int *local_y_start_after_transpose,
                            int *total_local_size);

Given a plan p, the other parameters of this routine are set to values describing the required data layout, described below.

total_local_size is the number of fftw_complex elements that you must allocate for your local data (and workspace, if you choose). (This value should, of course, be multiplied by n_fields if that parameter to fftwnd_mpi is not 1.)

The data on the current process has local_nx rows, starting at row local_x_start. If fftwnd_mpi is called with FFTW_TRANSPOSED_ORDER output, then y will be the first dimension of the output, and the local y extent will be given by local_ny_after_transpose and local_y_start_after_transpose. Otherwise, the output has the same dimensions and layout as the input.

For instance, suppose you want to transform three-dimensional data of size nx x ny x nz. Then, the current process will store a subset of this data, of size local_nx x ny x nz, where the x indices correspond to the range local_x_start to local_x_start+local_nx-1 in the "real" (i.e. logical) array. If fftwnd_mpi is called with FFTW_TRANSPOSED_ORDER output, then the result will be a ny x nx x nz array, of which a local_ny_after_transpose x nx x nz subset is stored on the current process (corresponding to y values starting at local_y_start_after_transpose).

The following is an example of allocating such a three-dimensional array array (local_data) before the transform and initializing it to some function f(x,y,z):

        fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,
                               &local_ny_after_transpose,
                               &local_y_start_after_transpose,
                               &total_local_size);

        local_data = (fftw_complex*) malloc(sizeof(fftw_complex) *
                                            total_local_size);

        for (x = 0; x < local_nx; ++x)
                for (y = 0; y < ny; ++y)
                        for (z = 0; z < nz; ++z)
                                local_data[(x*ny + y)*nz + z]
                                        = f(x + local_x_start, y, z);

Some important things to remember:

Usage of MPI FFTW for Real Multi-dimensional Transforms

MPI transforms specialized for real data are also available, similiar to the uniprocessor rfftwnd transforms. Just as in the uniprocessor case, the real-data MPI functions gain roughly a factor of two in speed (and save a factor of two in space) at the expense of more complicated data formats in the calling program. Before reading this section, you should definitely understand how to call the uniprocessor rfftwnd functions and also the complex MPI FFTW functions.

The following is an example of a program using rfftwnd_mpi. It computes the size nx x ny x nz transform of a real function f(x,y,z), multiplies the imaginary part by 2 for fun, then computes the inverse transform. (We'll also use FFTW_TRANSPOSED_ORDER output for the transform, and additionally supply the optional workspace parameter to rfftwnd_mpi, just to add a little spice.)

#include <rfftw_mpi.h>

int main(int argc, char **argv)
{
        const int nx = ..., ny = ..., nz = ...;
        int local_nx, local_x_start, local_ny_after_transpose,
            local_y_start_after_transpose, total_local_size;
        int x, y, z;
        rfftwnd_mpi_plan plan, iplan;
        fftw_real *data, *work;
        fftw_complex *cdata;

        MPI_Init(&argc,&argv);

        /* create the forward and backward plans: */
        plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
                                       nx, ny, nz,
                                       FFTW_REAL_TO_COMPLEX,
                                       FFTW_ESTIMATE);
        iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
          /* note transposition --> */  ny, nx, nz,
                                        FFTW_COMPLEX_TO_REAL,
                                        FFTW_ESTIMATE);

        rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,
                               &local_ny_after_transpose,
                               &local_y_start_after_transpose,
                               &total_local_size);

        data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);

        /* workspace is the same size as the data: */
        work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);

        /* initialize data to f(x,y,z): */
        for (x = 0; x < local_nx; ++x)
                for (y = 0; y < ny; ++y)
                        for (z = 0; z < nz; ++z)
                                data[(x*ny + y) * (2*(nz/2+1)) + z]
                                        = f(x + local_x_start, y, z);

        /* Now, compute the forward transform: */
        rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER);

        /* the data is now complex, so typecast a pointer: */
        cdata = (fftw_complex*) data;
        
        /* multiply imaginary part by 2, for fun:
           (note that the data is transposed) */
        for (y = 0; y < local_ny_after_transpose; ++y)
                for (x = 0; x < nx; ++x)
                        for (z = 0; z < (nz/2+1); ++z)
                                data[(y*nx + x) * (nz/2+1) + z].im
                                        *= 2.0;

        /* Finally, compute the inverse transform; the result
           is transposed back to the original data layout: */
        rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER);

        free(data);
        free(work);
        rfftwnd_mpi_destroy_plan(plan);
        rfftwnd_mpi_destroy_plan(iplan);
        MPI_Finalize();
}

There's a lot of stuff in this example, but it's all just what you would have guessed, right? We replaced all the fftwnd_mpi* functions by rfftwnd_mpi*, but otherwise the parameters were pretty much the same. The data layout distributed among the processes just like for the complex transforms (see Section MPI Data Layout), but in addition the final dimension is padded just like it is for the uniprocessor in-place real transforms (see Section Array Dimensions for Real Multi-dimensional Transforms). In particular, the z dimension of the real input data is padded to a size 2*(nz/2+1), and after the transform it contains nz/2+1 complex values.

Some other important things to know about the real MPI transforms:

Programs using the MPI FFTW real transforms should link with -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm on Unix.

Usage of MPI FFTW for Complex One-dimensional Transforms

The MPI FFTW also includes routines for parallel one-dimensional transforms of complex data (only). Although the speedup is generally worse than it is for the multi-dimensional routines,(4) these distributed-memory one-dimensional transforms are especially useful for performing one-dimensional transforms that don't fit into the memory of a single machine.

The usage of these routines is straightforward, and is similar to that of the multi-dimensional MPI transform functions. You first include the header <fftw_mpi.h> and then create a plan by calling:

fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm,
                                   int n, fftw_direction dir, int flags);

The last three arguments are the same as for fftw_create_plan (except that all MPI transforms are automatically FFTW_IN_PLACE). The first argument specifies the group of processes you are using, and is usually MPI_COMM_WORLD (all processes). A plan can be used for many transforms of the same size, and is destroyed when you are done with it by calling fftw_mpi_destroy_plan(plan).

The transform itself is computed by:

void fftw_mpi(fftw_mpi_plan p, int n_fields,
              fftw_complex *local_data, fftw_complex *work);

n_fields, as in fftwnd_mpi, is equivalent to howmany=n_fields, stride=n_fields, and dist=1, and should be 1 when you are computing the transform of a single array. local_data contains the portion of the array local to the current process, described below. work is either NULL or an array exactly the same size as local_data; in the latter case, FFTW can use the MPI_Alltoall communications primitive which is (usually) faster at the expense of extra storage.

To find out what portion of the array is stored local to the current process, you call the following routine:

void fftw_mpi_local_sizes(fftw_mpi_plan p,
                          int *local_n, int *local_start,
                          int *local_n_after_transform,
                          int *local_start_after_transform,
                          int *total_local_size);

total_local_size is the number of fftw_complex elements you should actually allocate for local_data (and work). local_n and local_start indicate that the current process stores local_n elements corresponding to the indices local_start to local_start+local_n-1 in the "real" array. After the transform, the process may store a different portion of the array. The portion of the data stored on the process after the transform is given by local_n_after_transform and local_start_after_transform.

Note that, if you compute both a forward and a backward transform of the same size, the local sizes are guaranteed to be consistent. That is, the local size after the forward transform will be the same as the local size before the backward transform, and vice versa.

MPI Tips

There are several things you should consider in order to get the best performance out of the MPI FFTW routines.

First, if possible, the first and second dimensions of your data should be divisible by the number of processes you are using. (If only one can be divisible, then you should choose the first dimension.) This allows the computational load to be spread evenly among the processes, and also reduces the communications complexity and overhead. In the one-dimensional transform case, the size of the transform should ideally be divisible by the square of the number of processors.

Second, you should consider using the FFTW_TRANSPOSED_ORDER output format if it is not too burdensome. The speed gains from communications savings are usually substantial.

Third, you should consider allocating a workspace for (r)fftw(nd)_mpi, as this can often (but not always) improve performance (at the cost of extra storage).

Fourth, you should experiment with the best number of processors to use for your problem. (There comes a point of diminishing returns, when the communications costs outweigh the computational benefits.(5)) The fftw_mpi_test program can output helpful performance benchmarks. It accepts the same parameters as the uniprocessor test programs (c.f. tests/README) and is run like an ordinary MPI program. For example, mpirun -np 4 fftw_mpi_test -s 128x128x128 will benchmark a 128x128x128 transform on four processors, reporting timings and parallel speedups for all variants of fftwnd_mpi (transposed, with workspace, etcetera). (Note also that there is the rfftw_mpi_test program for the real transforms.)


Go to the first, previous, next, last section, table of contents.