Cooley–Tukey FFT algorithm


The Cooley–Tukey algorithm, named after J. W. Cooley and John Tukey, is the most common fast Fourier transform algorithm. It re-expresses the discrete Fourier transform of an arbitrary composite size in terms of N1 smaller DFTs of sizes N2, recursively, to reduce the computation time to O for highly composite N. Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.
Because the Cooley–Tukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT. For example, Rader's or Bluestein's algorithm can be used to handle large prime factors that cannot be decomposed by Cooley–Tukey, or the prime-factor algorithm can be exploited for greater efficiency in separating out relatively prime factors.
The algorithm, along with its recursive application, was invented by Carl Friedrich Gauss. Cooley and Tukey independently rediscovered and popularized it 160 years later.

History

This algorithm, including its recursive application, was invented around 1805 by Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized. Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries. FFTs became popular after James Cooley of IBM and John Tukey of Princeton published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer.
Tukey reportedly came up with the idea during a meeting of President Kennedy's Science Advisory Committee discussing ways to detect nuclear-weapon tests in the Soviet Union by employing seismometers located outside the country. These sensors would generate seismological time series. However, analysis of this data would require fast algorithms for computing DFTs due to the number of sensors and length of time. This task was critical for the ratification of the proposed nuclear test ban so that any violations could be detected without need to visit Soviet facilities. Another participant at that meeting, Richard Garwin of IBM, recognized the potential of the method and put Tukey in touch with Cooley. However, Garwin made sure that Cooley did not know the original purpose. Instead, Cooley was told that this was needed to determine periodicities of the spin orientations in a 3-D crystal of helium-3. Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed due to the simultaneous development of Analog-to-digital converters capable of sampling at rates up to 300 kHz.
The fact that Gauss had described the same algorithm was not realized until several years after Cooley and Tukey's 1965 paper. Their paper cited as inspiration only the work by I. J. Good on what is now called the prime-factor FFT algorithm ; although Good's algorithm was initially thought to be equivalent to the Cooley–Tukey algorithm, it was quickly realized that PFA is a quite different algorithm.

The radix-2 DIT case

A radix-2 decimation-in-time FFT is the simplest and most common form of the Cooley–Tukey algorithm, although highly optimized Cooley–Tukey implementations typically use other forms of the algorithm as described below. Radix-2 DIT divides a DFT of size N into two interleaved DFTs of size N/2 with each recursive stage.
The discrete Fourier transform is defined by the formula:
where is an integer ranging from 0 to.
Radix-2 DIT first computes the DFTs of the even-indexed inputs
and of the odd-indexed inputs, and then combines those two results to produce the DFT of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O. This simplified form assumes that N is a power of two; since the number of sample points N can usually be chosen freely by the application, this is often not an important restriction.
The radix-2 DIT algorithm rearranges the DFT of the function into two parts: a sum over the even-numbered indices and a sum over the odd-numbered indices :
One can factor a common multiplier out of the second sum, as shown in the equation below. It is then clear that the two sums are the DFT of the even-indexed part and the DFT of odd-indexed part of the function. Denote the DFT of the Even-indexed inputs by and the DFT of the Odd-indexed inputs by and we obtain:
Note that the equalities hold for,
but the crux is that and are calculated in this way for only.
Thanks to the periodicity of the complex exponential, is also obtained from and :
We can rewrite and as:
This result, expressing the DFT of length N recursively in terms of two DFTs of size N/2, is the core of the radix-2 DIT fast Fourier transform. The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs. Note that final outputs are obtained by a +/− combination of and, which is simply a size-2 DFT ; when this is generalized to larger radices below, the size-2 DFT is replaced by a larger DFT.
This process is an example of the general technique of divide and conquer algorithms; in many conventional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadth-first fashion.
The above re-expression of a size-N DFT as two size-N/2 DFTs is sometimes called the Danielson–Lanczos lemma, since the identity was noted by those two authors in 1942. They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged. The Danielson–Lanczos work predated widespread availability of mechanical or electronic computers and required manual calculation ; they reported a computation time of 140 minutes for a size-64 DFT operating on real inputs to 3–5 significant digits. Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an IBM 7094. Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000.

Pseudocode

In pseudocode, the below procedure could be written:
X0,...,N−1ditfft2: DFT of :
if
N'' = 1 then
X0x0 trivial size-1 DFT base case
else
X0,...,N/2−1ditfft2 DFT of
XN/2,...,N''−1 ← ditfft2 DFT of
for k = 0 to -1 do
combine DFTs of two halves:
p ←
Xk
q ← exp
Xk+N''/2
Xk ← p + q
Xk+N/2 ← p − q
end for
end if
Here, ditfft2, computes X=DFT out-of-place by a radix-2 DIT FFT, where N is an integer power of 2 and s=1 is the stride of the input x array. x+''s denotes the array starting with xs.
High-performance FFT implementations make many modifications to the implementation of such an algorithm compared to this simple pseudocode. For example, one can use a larger base case than
N''=1 to amortize the overhead of recursion, the twiddle factors can be precomputed, and larger radices are often used for cache reasons; these and other optimizations together can improve the performance by an order of magnitude or more. Several of these ideas are described in further detail below.

Idea

More generally, Cooley–Tukey algorithms recursively re-express a DFT of a composite size N = N1N2 as:
  1. Perform N1 DFTs of size N2.
  2. Multiply by complex roots of unity.
  3. Perform N2 DFTs of size N1.
Typically, either N1 or N2 is a small factor, called the radix. If N1 is the radix, it is called a decimation in time algorithm, whereas if N2 is the radix, it is decimation in frequency. The version presented above was a radix-2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/- combination of the even and odd transforms is a size-2 DFT.

Variations

There are many other variations on the Cooley–Tukey algorithm.
  • Mixed-radix implementations handle composite sizes with a variety of factors in addition to two, usually employing the O algorithm for the prime base cases of the recursion .
  • Split radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for power-of-two sizes, although recent variations achieve an even lower count.
Another way of looking at the Cooley–Tukey algorithm is that it re-expresses a size N one-dimensional DFT as an N1 by N2 two-dimensional DFT, where the output matrix is transposed. The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input or output indices. If, instead of using a small radix, one employs a radix of roughly and explicit input/output matrix transpositions, it is called a four-step FFT algorithm, initially proposed to improve memory locality, e.g. for cache optimization or out-of-core operation, and was later shown to be an optimal cache-oblivious algorithm.
The general Cooley–Tukey factorization rewrites the indices k and n as and, respectively, where the indices ka and na run from 0..Na-1. That is, it re-indexes the input and output as N1 by N2 two-dimensional arrays in column-major and row-major order, respectively; the difference between these indexings is a transposition, as mentioned above. When this re-indexing is substituted into the DFT formula for nk, the cross term vanishes, and the remaining terms give
where each inner sum is a DFT of size N2, each outer sum is a DFT of size N1, and the bracketed term is the twiddle factor.
An arbitrary radix r can be employed, as was shown by both Cooley and Tukey as well as Gauss. Cooley and Tukey originally assumed that the radix butterfly required O work and hence reckoned the complexity for a radix r to be O = O; from calculation of values of r/log2r for integer values of r from 2 to 12 the optimal radix is found to be 3. This analysis was erroneous, however: the radix-butterfly is also a DFT and can be performed via an FFT algorithm in O operations, hence the radix r actually cancels in the complexity O, and the optimal r is determined by more complicated considerations. In practice, quite large r are important in order to effectively exploit e.g. the large number of processor registers on modern processors, and even an unbounded radix r= also achieves O complexity and has theoretical and practical advantages for large N as mentioned above.