A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain (often time or space) to a representation in the frequency domain and vice versa. The DFT is obtained by decomposing a sequence of values into components of different frequencies.^{[1]} This operation is useful in many fields, but computing it directly from the definition is often too slow to be practical. An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors.^{[2]} As a result, it manages to reduce the complexity of computing the DFT from , which arises if one simply applies the definition of DFT, to , where is the data size. The difference in speed can be enormous, especially for long data sets where N may be in the thousands or millions. In the presence of roundoff error, many FFT algorithms are much more accurate than evaluating the DFT definition directly or indirectly. There are many different FFT algorithms based on a wide range of published theories, from simple complexnumber arithmetic to group theory and number theory.
Fast Fourier transforms are widely used for applications in engineering, music, science, and mathematics. The basic ideas were popularized in 1965, but some algorithms had been derived as early as 1805.^{[1]} In 1994, Gilbert Strang described the FFT as "the most important numerical algorithm of our lifetime",^{[3]}^{[4]} and it was included in Top 10 Algorithms of 20th Century by the IEEE magazine Computing in Science & Engineering.^{[5]}
The bestknown FFT algorithms depend upon the factorization of N, but there are FFTs with O(N log N) complexity for all N, even for prime N. Many FFT algorithms only depend on the fact that is an Nth primitive root of unity, and thus can be applied to analogous transforms over any finite field, such as numbertheoretic transforms. Since the inverse DFT is the same as the DFT, but with the opposite sign in the exponent and a 1/N factor, any FFT algorithm can easily be adapted for it.
The development of fast algorithms for DFT can be traced to Carl Friedrich Gauss's unpublished work in 1805 when he needed it to interpolate the orbit of asteroids Pallas and Juno from sample observations.^{[6]}^{[7]} His method was very similar to the one published in 1965 by James Cooley and John Tukey, who are generally credited for the invention of the modern generic FFT algorithm. While Gauss's work predated even Joseph Fourier's results in 1822, he did not analyze the computation time and eventually used other methods to achieve his goal.
Between 1805 and 1965, some versions of FFT were published by other authors. Frank Yates in 1932 published his version called interaction algorithm, which provided efficient computation of Hadamard and Walsh transforms.^{[8]} Yates' algorithm is still used in the field of statistical design and analysis of experiments. In 1942, G. C. Danielson and Cornelius Lanczos published their version to compute DFT for xray crystallography, a field where calculation of Fourier transforms presented a formidable bottleneck.^{[9]}^{[10]} While many methods in the past had focused on reducing the constant factor for computation by taking advantage of "symmetries", Danielson and Lanczos realized that one could use the "periodicity" and apply a "doubling trick" to get runtime.^{[11]}
James Cooley and John Tukey published a more general version of FFT in 1965 that is applicable when N is composite and not necessarily a power of 2.^{[12]} Tukey came up with the idea during a meeting of President Kennedy's Science Advisory Committee where a discussion topic involved detecting nuclear tests by the Soviet Union by setting up sensors to surround the country from outside. To analyze the output of these sensors, an FFT algorithm would be needed. In discussion with Tukey, Richard Garwin recognized the general applicability of the algorithm not just to national security problems, but also to a wide range of problems including one of immediate interest to him, determining the periodicities of the spin orientations in a 3D crystal of Helium3.^{[13]} Garwin gave Tukey's idea to Cooley (both worked at IBM's Watson labs) for implementation.^{[14]} Cooley and Tukey published the paper in a relatively short time of six months.^{[15]} As Tukey did not work at IBM, the patentability of the idea was doubted and the algorithm went into the public domain, which, through the computing revolution of the next decade, made FFT one of the indispensable algorithms in digital signal processing.
Let x_{0}, ..., x_{N1} be complex numbers. The DFT is defined by the formula
where is a primitive Nth root of 1.
Evaluating this definition directly requires operations: there are N outputs X_{k}, and each output requires a sum of N terms. An FFT is any method to compute the same results in operations. All known FFT algorithms require ? operations, although there is no known proof that a lower complexity score is impossible.^{[16]}
To illustrate the savings of an FFT, consider the count of complex multiplications and additions for N=4096 data points. Evaluating the DFT's sums directly involves N^{2} complex multiplications and N(N  1) complex additions, of which operations can be saved by eliminating trivial operations such as multiplications by 1, leaving about 30 million operations. On the other hand, the radix2 CooleyTukey algorithm, for N a power of 2, can compute the same result with only (N/2)log_{2}(N) complex multiplications (again, ignoring simplifications of multiplications by 1 and similar) and N log_{2}(N) complex additions, in total about 30,000 operations  a thousand times less than with direct evaluation. In practice, actual performance on modern computers is usually dominated by factors other than the speed of arithmetic operations and the analysis is a complicated subject (for example, see Frigo & Johnson, 2005),^{[17]} but the overall improvement from to remains.
By far the most commonly used FFT is the CooleyTukey algorithm. This is a divide and conquer algorithm that recursively breaks down a DFT of any composite size N = N_{1}N_{2} into many smaller DFTs of sizes N_{1} and N_{2}, along with O(N) multiplications by complex roots of unity traditionally called twiddle factors (after Gentleman and Sande, 1966^{[18]}).
This method (and the general idea of an FFT) was popularized by a publication of Cooley and Tukey in 1965,^{[12]} but it was later discovered^{[1]} that those two authors had independently reinvented an algorithm known to Carl Friedrich Gauss around 1805^{[19]} (and subsequently rediscovered several times in limited forms).
The best known use of the CooleyTukey algorithm is to divide the transform into two pieces of size N/2 at each step, and is therefore limited to poweroftwo sizes, but any factorization can be used in general (as was known to both Gauss and Cooley/Tukey^{[1]}). These are called the radix2 and mixedradix cases, respectively (and other variants such as the splitradix FFT have their own names as well). Although the basic idea is recursive, most traditional implementations rearrange the algorithm to avoid explicit recursion. Also, because the CooleyTukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT, such as those described below.
There are FFT algorithms other than CooleyTukey. Cornelius Lanczos did pioneering work on the FFT and FFS (fast Fourier sampling method) with G. C. Danielson (1940).^{[]}
For N = N_{1}N_{2} with coprime N_{1} and N_{2}, one can use the primefactor (GoodThomas) algorithm (PFA), based on the Chinese remainder theorem, to factorize the DFT similarly to CooleyTukey but without the twiddle factors. The RaderBrenner algorithm (1976)^{[20]} is a CooleyTukeylike factorization but with purely imaginary twiddle factors, reducing multiplications at the cost of increased additions and reduced numerical stability; it was later superseded by the splitradix variant of CooleyTukey (which achieves the same multiplication count but with fewer additions and without sacrificing accuracy). Algorithms that recursively factorize the DFT into smaller operations other than DFTs include the Bruun and QFT algorithms. (The RaderBrenner^{[20]} and QFT algorithms were proposed for poweroftwo sizes, but it is possible that they could be adapted to general composite N. Bruun's algorithm applies to arbitrary even composite sizes.) Bruun's algorithm, in particular, is based on interpreting the FFT as a recursive factorization of the polynomial z^{N}  1, here into realcoefficient polynomials of the form z^{M}  1 and z^{2M} + az^{M} + 1.
Another polynomial viewpoint is exploited by the Winograd FFT algorithm,^{[21]}^{[22]} which factorizes z^{N}  1 into cyclotomic polynomialsthese often have coefficients of 1, 0, or 1, and therefore require few (if any) multiplications, so Winograd can be used to obtain minimalmultiplication FFTs and is often used to find efficient algorithms for small factors. Indeed, Winograd showed that the DFT can be computed with only O(N) irrational multiplications, leading to a proven achievable lower bound on the number of multiplications for poweroftwo sizes; unfortunately, this comes at the cost of many more additions, a tradeoff no longer favorable on modern processors with hardware multipliers. In particular, Winograd also makes use of the PFA as well as an algorithm by Rader for FFTs of prime sizes.
Rader's algorithm, exploiting the existence of a generator for the multiplicative group modulo prime N, expresses a DFT of prime size N as a cyclic convolution of (composite) size N  1, which can then be computed by a pair of ordinary FFTs via the convolution theorem (although Winograd uses other convolution methods). Another primesize FFT is due to L. I. Bluestein, and is sometimes called the chirpz algorithm; it also reexpresses a DFT as a convolution, but this time of the same size (which can be zeropadded to a power of two and evaluated by radix2 CooleyTukey FFTs, for example), via the identity
Hexagonal fast Fourier transform aims at computing an efficient FFT for the hexagonally sampled data by using a new addressing scheme for hexagonal grids, called Array Set Addressing (ASA).
In many applications, the input data for the DFT are purely real, in which case the outputs satisfy the symmetry
and efficient FFT algorithms have been designed for this situation (see e.g. Sorensen, 1987).^{[23]}^{[24]} One approach consists of taking an ordinary algorithm (e.g. CooleyTukey) and removing the redundant parts of the computation, saving roughly a factor of two in time and memory. Alternatively, it is possible to express an evenlength realinput DFT as a complex DFT of half the length (whose real and imaginary parts are the even/odd elements of the original real data), followed by O(N) postprocessing operations.
It was once believed that realinput DFTs could be more efficiently computed by means of the discrete Hartley transform (DHT), but it was subsequently argued that a specialized realinput DFT algorithm (FFT) can typically be found that requires fewer operations than the corresponding DHT algorithm (FHT) for the same number of inputs.^{[23]} Bruun's algorithm (above) is another method that was initially proposed to take advantage of real inputs, but it has not proved popular.
There are further FFT specializations for the cases of real data that have even/odd symmetry, in which case one can gain another factor of roughly two in time and memory and the DFT becomes the discrete cosine/sine transform(s) (DCT/DST). Instead of directly modifying an FFT algorithm for these cases, DCTs/DSTs can also be computed via FFTs of real data combined with O(N) pre and postprocessing.
Unsolved problem in computer science: What is the lower bound on the complexity of fast Fourier transform algorithms? Can they be faster than ? (more unsolved problems in computer science)

A fundamental question of longstanding theoretical interest is to prove lower bounds on the complexity and exact operation counts of fast Fourier transforms, and many open problems remain. It is not even rigorously proved whether DFTs truly require ?(N log N) (i.e., order N log N or greater) operations, even for the simple case of power of two sizes, although no algorithms with lower complexity are known. In particular, the count of arithmetic operations is usually the focus of such questions, although actual performance on modernday computers is determined by many other factors such as cache or CPU pipeline optimization.
Following work by Shmuel Winograd (1978),^{[21]} a tight ?(N) lower bound is known for the number of real multiplications required by an FFT. It can be shown that only irrational real multiplications are required to compute a DFT of poweroftwo length . Moreover, explicit algorithms that achieve this count are known (Heideman & Burrus, 1986;^{[25]} Duhamel, 1990^{[26]}). However, these algorithms require too many additions to be practical, at least on modern computers with hardware multipliers (Duhamel, 1990;^{[26]} Frigo & Johnson, 2005).^{[17]}
A tight lower bound is not known on the number of required additions, although lower bounds have been proved under some restrictive assumptions on the algorithms. In 1973, Morgenstern^{[27]} proved an ?(N log N) lower bound on the addition count for algorithms where the multiplicative constants have bounded magnitudes (which is true for most but not all FFT algorithms). This result, however, applies only to the unnormalized Fourier transform (which is a scaling of a unitary matrix by a factor of ), and does not explain why the Fourier matrix is harder to compute than any other unitary matrix (including the identity matrix) under the same scaling. Pan (1986)^{[28]} proved an ?(N log N) lower bound assuming a bound on a measure of the FFT algorithm's "asynchronicity", but the generality of this assumption is unclear. For the case of poweroftwo N, Papadimitriou (1979)^{[29]} argued that the number of complexnumber additions achieved by CooleyTukey algorithms is optimal under certain assumptions on the graph of the algorithm (his assumptions imply, among other things, that no additive identities in the roots of unity are exploited). (This argument would imply that at least real additions are required, although this is not a tight bound because extra additions are required as part of complexnumber multiplications.) Thus far, no published FFT algorithm has achieved fewer than complexnumber additions (or their equivalent) for poweroftwo N.
A third problem is to minimize the total number of real multiplications and additions, sometimes called the "arithmetic complexity" (although in this context it is the exact count and not the asymptotic complexity that is being considered). Again, no tight lower bound has been proven. Since 1968, however, the lowest published count for poweroftwo N was long achieved by the splitradix FFT algorithm, which requires real multiplications and additions for N > 1. This was recently reduced to (Johnson and Frigo, 2007;^{[16]} Lundy and Van Buskirk, 2007^{[30]}). A slightly larger count (but still better than split radix for N >= 256) was shown to be provably optimal for N satisfiability modulo theories problem solvable by brute force (Haynal & Haynal, 2011).^{[31]}
Most of the attempts to lower or prove the complexity of FFT algorithms have focused on the ordinary complexdata case, because it is the simplest. However, complexdata FFTs are so closely related to algorithms for related problems such as realdata FFTs, discrete cosine transforms, discrete Hartley transforms, and so on, that any improvement in one of these would immediately lead to improvements in the others (Duhamel & Vetterli, 1990).^{[32]}
All of the FFT algorithms discussed above compute the DFT exactly (i.e. neglecting floatingpoint errors). A few "FFT" algorithms have been proposed, however, that compute the DFT approximately, with an error that can be made arbitrarily small at the expense of increased computations. Such algorithms trade the approximation error for increased speed or other properties. For example, an approximate FFT algorithm by Edelman et al. (1999)^{[33]} achieves lower communication requirements for parallel computing with the help of a fast multipole method. A waveletbased approximate FFT by Guo and Burrus (1996)^{[34]} takes sparse inputs/outputs (time/frequency localization) into account more efficiently than is possible with an exact FFT. Another algorithm for approximate computation of a subset of the DFT outputs is due to Shentov et al. (1995).^{[35]} The Edelman algorithm works equally well for sparse and nonsparse data, since it is based on the compressibility (rank deficiency) of the Fourier matrix itself rather than the compressibility (sparsity) of the data. Conversely, if the data are sparsethat is, if only K out of N Fourier coefficients are nonzerothen the complexity can be reduced to O(K log(N) log(N/K)), and this has been demonstrated to lead to practical speedups compared to an ordinary FFT for N/K > 32 in a largeN example (N = 2^{22}) using a probabilistic approximate algorithm (which estimates the largest K coefficients to several decimal places).^{[36]}
Even the "exact" FFT algorithms have errors when finiteprecision floatingpoint arithmetic is used, but these errors are typically quite small; most FFT algorithms, e.g. CooleyTukey, have excellent numerical properties as a consequence of the pairwise summation structure of the algorithms. The upper bound on the relative error for the CooleyTukey algorithm is O(? log N), compared to O(?N^{3/2}) for the naïve DFT formula,^{[18]} where ? is the machine floatingpoint relative precision. In fact, the root mean square (rms) errors are much better than these upper bounds, being only O(? ) for CooleyTukey and O(? ) for the naïve DFT (Schatzman, 1996).^{[37]} These results, however, are very sensitive to the accuracy of the twiddle factors used in the FFT (i.e. the trigonometric function values), and it is not unusual for incautious FFT implementations to have much worse accuracy, e.g. if they use inaccurate trigonometric recurrence formulas. Some FFTs other than CooleyTukey, such as the RaderBrenner algorithm, are intrinsically less stable.
In fixedpoint arithmetic, the finiteprecision errors accumulated by FFT algorithms are worse, with rms errors growing as O for the CooleyTukey algorithm (Welch, 1969).^{[38]} Moreover, even achieving this accuracy requires careful attention to scaling to minimize loss of precision, and fixedpoint FFT algorithms involve rescaling at each intermediate stage of decompositions like CooleyTukey.
To verify the correctness of an FFT implementation, rigorous guarantees can be obtained in O(N log N) time by a simple procedure checking the linearity, impulseresponse, and timeshift properties of the transform on random inputs (Ergün, 1995).^{[39]}
As defined in the multidimensional DFT article, the multidimensional DFT
transforms an array x_{n} with a ddimensional vector of indices by a set of d nested summations (over for each j), where the division n/N, defined as , is performed elementwise. Equivalently, it is the composition of a sequence of d sets of onedimensional DFTs, performed along one dimension at a time (in any order).
This compositional viewpoint immediately provides the simplest and most common multidimensional DFT algorithm, known as the rowcolumn algorithm (after the twodimensional case, below). That is, one simply performs a sequence of d onedimensional FFTs (by any of the above algorithms): first you transform along the n_{1} dimension, then along the n_{2} dimension, and so on (or actually, any ordering works). This method is easily shown to have the usual O(N log N) complexity, where is the total number of data points transformed. In particular, there are N/N_{1} transforms of size N_{1}, etcetera, so the complexity of the sequence of FFTs is:
In two dimensions, the x_{k} can be viewed as an matrix, and this algorithm corresponds to first performing the FFT of all the rows (resp. columns), grouping the resulting transformed rows (resp. columns) together as another matrix, and then performing the FFT on each of the columns (resp. rows) of this second matrix, and similarly grouping the results into the final result matrix.
In more than two dimensions, it is often advantageous for cache locality to group the dimensions recursively. For example, a threedimensional FFT might first perform twodimensional FFTs of each planar "slice" for each fixed n_{1}, and then perform the onedimensional FFTs along the n_{1} direction. More generally, an asymptotically optimal cacheoblivious algorithm consists of recursively dividing the dimensions into two groups and that are transformed recursively (rounding if d is not even) (see Frigo and Johnson, 2005).^{[17]} Still, this remains a straightforward variation of the rowcolumn algorithm that ultimately requires only a onedimensional FFT algorithm as the base case, and still has O(N log N) complexity. Yet another variation is to perform matrix transpositions in between transforming subsequent dimensions, so that the transforms operate on contiguous data; this is especially important for outofcore and distributed memory situations where accessing noncontiguous data is extremely timeconsuming.
There are other multidimensional FFT algorithms that are distinct from the rowcolumn algorithm, although all of them have O(N log N) complexity. Perhaps the simplest nonrowcolumn FFT is the vectorradix FFT algorithm, which is a generalization of the ordinary CooleyTukey algorithm where one divides the transform dimensions by a vector of radices at each step. (This may also have cache benefits.) The simplest case of vectorradix is where all of the radices are equal (e.g. vectorradix2 divides all of the dimensions by two), but this is not necessary. Vector radix with only a single nonunit radix at a time, i.e. , is essentially a rowcolumn algorithm. Other, more complicated, methods include polynomial transform algorithms due to Nussbaumer (1977),^{[40]} which view the transform in terms of convolutions and polynomial products. See Duhamel and Vetterli (1990)^{[32]} for more information and references.
An O(N^{5/2}log N) generalization to spherical harmonics on the sphere S^{2} with N^{2} nodes was described by Mohlenkamp,^{[41]} along with an algorithm conjectured (but not proven) to have O(N^{2} log^{2}(N)) complexity; Mohlenkamp also provides an implementation in the libftsh library.^{[42]} A sphericalharmonic algorithm with O(N^{2}log N) complexity is described by Rokhlin and Tygert.^{[43]}
The fast folding algorithm is analogous to the FFT, except that it operates on a series of binned waveforms rather than a series of real or complex scalar values. Rotation (which in the FFT is multiplication by a complex phasor) is a circular shift of the component waveform.
Various groups have also published "FFT" algorithms for nonequispaced data, as reviewed in Potts et al. (2001).^{[44]} Such algorithms do not strictly compute the DFT (which is only defined for equispaced data), but rather some approximation thereof (a nonuniform discrete Fourier transform, or NDFT, which itself is often computed only approximately). More generally there are various other methods of spectral estimation.
The FFT is used in digital recording, sampling, additive synthesis and pitch correction software.^{[45]}
The FFT's importance derives from the fact that it has made working in the frequency domain equally computationally feasible as working in the temporal or spatial domain. Some of the important applications of the FFT include:^{[15]}^{[46]}
Language  Command/Method  Prerequisites 

R  stats::fft(x)  None 
Octave/MATLAB  fft(x)  None 
Python  fft.fft(x)  numpy 
Mathematica  Fourier[x]  None 
Julia  fft(A [,dims])  FFTW 
FFTrelated algorithms:
FFT implementations:
Other links:
journal=
(help)