Go to the first, previous, next, last section, table of contents.


Multi-threaded FFTW

In this chapter we document the parallel FFTW routines for shared-memory threads on SMP hardware. These routines, which support parellel one- and multi-dimensional transforms of both real and complex data, are the easiest way to take advantage of multiple processors with FFTW. They work just like the corresponding uniprocessor transform routines, except that they take the number of parallel threads to use as an extra parameter. Any program that uses the uniprocessor FFTW can be trivially modified to use the multi-threaded FFTW.

Installation and Supported Hardware/Software

All of the FFTW threads code is located in the threads subdirectory of the FFTW package. On Unix systems, the FFTW threads libraries and header files can be automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including --with-threads in the flags to the configure script (see Section Installation on Unix). (Note also that the threads routines, when enabled, are automatically tested by the `make check' self-tests.)

The threads routines require your operating system to have some sort of shared-memory threads support. Specifically the FFTW threads package works with POSIX threads (included with most versions of Unix, including Linux), Solaris threads, BeOS threads (tested on BeOS DR8.2), and Win32 threads (reported to work by users). (There is also untested code to use MacOS MP threads.) If you have a shared-memory machine that uses a different threads API, it should be a simple matter to include support for it; see the file fftw_threads-int.h for more detail.

SMP hardware is not required, although of course you need multiple processors to get any benefit from the multithreaded transforms.

Usage of Multi-threaded FFTW

Here, it is assumed that the reader is already familiar with the usage of the uniprocessor FFTW routines, described elsewhere in this manual. We only describe what one has to change in order to use the multi-threaded routines.

First, instead of including <fftw.h> or <rfftw.h>, you should include <fftw_threads.h> or <rfftw_threads.h>, respectively.

Second, before calling any FFTW routines, you should call the function:

int fftw_threads_init(void);

This function, which should only be called once (probably in your main() function), performs any one-time initialization required to use threads on your system. It returns zero if successful, and a non-zero value if there was an error (in which case, something is seriously wrong and you should probably exit the program).

Third, when you want to actually compute the transform, you should use one of the following transform routines instead of the ordinary FFTW functions:

All of these routines take exactly the same arguments and have exactly the same effects as their uniprocessor counterparts (i.e. without the `_threads') except that they take one extra parameter, nthreads (of type int), before the normal parameters.(3) The nthreads parameter specifies the number of threads of execution to use when performing the transform (actually, the maximum number of threads).

For example, to parallelize a single one-dimensional transform of complex data, instead of calling the uniprocessor fftw_one(plan, in, out), you would call fftw_threads_one(nthreads, plan, in, out). Passing an nthreads of 1 means to use only one thread (the main thread), and is equivalent to calling the uniprocessor routine. Passing an nthreads of 2 means that the transform is potentially parallelized over two threads (and two processors, if you have them), and so on.

These are the only changes you need to make to your source code. Calls to all other FFTW routines (plan creation, destruction, wisdom, etcetera) are not parallelized and remain the same. (The same plans and wisdom are used by both uniprocessor and multi-threaded transforms.) Your arrays are allocated and formatted in the same way, and so on.

Programs using the parallel complex transforms should be linked with -lfftw_threads -lfftw -lm on Unix. Programs using the parallel real transforms should be linked with -lrfftw_threads -lfftw_threads -lrfftw -lfftw -lm. You will also need to link with whatever library is responsible for threads on your system (e.g. -lpthread on Linux).

How Many Threads to Use?

There is a fair amount of overhead involved in spawning and synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have.

As a general rule, you don't want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.

You will have to experiment with your system to see what level of parallelization is best for your problem size. Useful tools to help you do this are the test programs that are automatically compiled along with the threads libraries, fftw_threads_test and rfftw_threads_test (in the threads subdirectory). These take the same arguments as the other FFTW test programs (see tests/README), except that they also take the number of threads to use as a first argument, and report the parallel speedup in speed tests. For example,

fftw_threads_test 2 -s 128x128

will benchmark complex 128x128 transforms using two threads and report the speedup relative to the uniprocessor transform.

For instance, on a 4-processor 200MHz Pentium Pro system running Linux 2.2.0, we found that the "crossover" point at which 2 threads became beneficial for complex transforms was about 4k points, while 4 threads became beneficial at 8k points.

Using Multi-threaded FFTW in a Multi-threaded Program

It is perfectly possible to use the multi-threaded FFTW routines from a multi-threaded program (e.g. have multiple threads computing multi-threaded transforms simultaneously). If you have the processors, more power to you! However, the same restrictions apply as for the uniprocessor FFTW routines (see Section Thread-safety). In particular, you should recall that you may not create or destroy plans in parallel.

Tips for Optimal Threading

Not all transforms are equally well-parallelized by the multi-threaded FFTW routines. (This is merely a consequence of laziness on the part of the implementors, and is not inherent to the algorithms employed.) Mainly, the limitations are in the parallel one-dimensional transforms. The things to avoid if you want optimal parallelization are as follows:

Parallelization deficiencies in one-dimensional transforms


Go to the first, previous, next, last section, table of contents.