In this chapter we document the parallel FFTW routines for shared-memory threads on SMP hardware. These routines, which support parellel one- and multi-dimensional transforms of both real and complex data, are the easiest way to take advantage of multiple processors with FFTW. They work just like the corresponding uniprocessor transform routines, except that they take the number of parallel threads to use as an extra parameter. Any program that uses the uniprocessor FFTW can be trivially modified to use the multi-threaded FFTW.
All of the FFTW threads code is located in the threads
subdirectory of the FFTW package. On Unix systems, the FFTW threads
libraries and header files can be automatically configured, compiled,
and installed along with the uniprocessor FFTW libraries simply by
including --with-threads
in the flags to the configure
script (see Section Installation on Unix). (Note also that the threads
routines, when enabled, are automatically tested by the `make
check'
self-tests.)
The threads routines require your operating system to have some sort of
shared-memory threads support. Specifically the FFTW threads package
works with POSIX threads (included with most versions of Unix, including
Linux), Solaris threads, BeOS threads (tested
on BeOS DR8.2), and Win32 threads (reported to work by users). (There
is also untested code to use MacOS MP threads.) If you have a
shared-memory machine that uses a different threads API, it should be a
simple matter to include support for it; see the file
fftw_threads-int.h
for more detail.
SMP hardware is not required, although of course you need multiple processors to get any benefit from the multithreaded transforms.
Here, it is assumed that the reader is already familiar with the usage of the uniprocessor FFTW routines, described elsewhere in this manual. We only describe what one has to change in order to use the multi-threaded routines.
First, instead of including <fftw.h>
or <rfftw.h>
, you
should include <fftw_threads.h>
or <rfftw_threads.h>
,
respectively.
Second, before calling any FFTW routines, you should call the function:
int fftw_threads_init(void);
This function, which should only be called once (probably in your
main()
function), performs any one-time initialization required
to use threads on your system. It returns zero if successful, and a
non-zero value if there was an error (in which case, something is
seriously wrong and you should probably exit the program).
Third, when you want to actually compute the transform, you should use one of the following transform routines instead of the ordinary FFTW functions:
fftw_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist)
fftw_threads_one(nthreads, plan, in, out)
fftwnd_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist)
fftwnd_threads_one(nthreads, plan, in, out)
rfftw_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist)
rfftw_threads_one(nthreads, plan, in, out)
rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in, istride, idist, out, ostride, odist)
rfftwnd_threads_real_to_complex_one(nthreads, plan, in, out)
rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in, istride, idist, out, ostride, odist)
rfftwnd_threads_complex_to_real_one(nthreads, plan, in, out)
All of these routines take exactly the same arguments and have exactly
the same effects as their uniprocessor counterparts (i.e. without the
`_threads
') except that they take one extra
parameter, nthreads
(of type int
), before the normal
parameters.(3) The nthreads
parameter specifies the number of threads of execution to use when
performing the transform (actually, the maximum number of threads).
For example, to parallelize a single one-dimensional transform of
complex data, instead of calling the uniprocessor fftw_one(plan,
in, out)
, you would call fftw_threads_one(nthreads, plan, in,
out)
. Passing an nthreads
of 1
means to use only one
thread (the main thread), and is equivalent to calling the uniprocessor
routine. Passing an nthreads
of 2
means that the
transform is potentially parallelized over two threads (and two
processors, if you have them), and so on.
These are the only changes you need to make to your source code. Calls to all other FFTW routines (plan creation, destruction, wisdom, etcetera) are not parallelized and remain the same. (The same plans and wisdom are used by both uniprocessor and multi-threaded transforms.) Your arrays are allocated and formatted in the same way, and so on.
Programs using the parallel complex transforms should be linked with
-lfftw_threads -lfftw -lm
on Unix. Programs using the parallel
real transforms should be linked with -lrfftw_threads
-lfftw_threads -lrfftw -lfftw -lm
. You will also need to link with
whatever library is responsible for threads on your system
(e.g. -lpthread
on Linux).
There is a fair amount of overhead involved in spawning and synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have.
As a general rule, you don't want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.
You will have to experiment with your system to see what level of
parallelization is best for your problem size. Useful tools to help you
do this are the test programs that are automatically compiled along with
the threads libraries, fftw_threads_test
and
rfftw_threads_test
(in the threads
subdirectory). These
take the same arguments as the other FFTW test programs (see
tests/README
), except that they also take the number of threads
to use as a first argument, and report the parallel speedup in speed
tests. For example,
fftw_threads_test 2 -s 128x128
will benchmark complex 128x128 transforms using two threads and report the speedup relative to the uniprocessor transform.
For instance, on a 4-processor 200MHz Pentium Pro system running Linux 2.2.0, we found that the "crossover" point at which 2 threads became beneficial for complex transforms was about 4k points, while 4 threads became beneficial at 8k points.
It is perfectly possible to use the multi-threaded FFTW routines from a multi-threaded program (e.g. have multiple threads computing multi-threaded transforms simultaneously). If you have the processors, more power to you! However, the same restrictions apply as for the uniprocessor FFTW routines (see Section Thread-safety). In particular, you should recall that you may not create or destroy plans in parallel.
Not all transforms are equally well-parallelized by the multi-threaded FFTW routines. (This is merely a consequence of laziness on the part of the implementors, and is not inherent to the algorithms employed.) Mainly, the limitations are in the parallel one-dimensional transforms. The things to avoid if you want optimal parallelization are as follows:
howmany > 1
, are fine.) Again, you
should avoid these in any case if you want high performance, as they
require transforming to a scratch array and copying back.
rfftw
) transforms don't parallelize
completely. This is unfortunate, but parallelizing this correctly would
have involved a lot of extra code (and a much larger library). You
still get some benefit from additional processors, but if you have a
very large number of processors you will probably be better off using
the parallel complex (fftw
) transforms. Note that
multi-dimensional real transforms or multiple one-dimensional real
transforms are fine.
Go to the first, previous, next, last section, table of contents.