The FFT
Via Matrix Factorizations
A Key to Designing High Performance Implementations
Charles Van Loan
Department of Computer Science
Cornell University
A High Level Perspective...
Blocking For Performance
A11 A12 · · · A1q } n1
A A · · · A
21 22 2q } n2
A = . . . . .
. . . .
Ap1 Ap2 · · · Apq } nq
|{z} |{z} |{z}
n1 n2 nq
A well known strategy for high-performance Ax = b and Ax = λx
solvers.
Factoring for Performance
One way to execute a matrix-vector product
y = Fnx
when Fn = At · · · A2A1 is as follows:
y=x
for k = 1:t
y = Ak x
end
A different factorization Fn = Ãt̃ · · · Ã1 would yield a different
algorithm.
The Discrete Fourier Transform (n = 8)
ω80 ω80 ω80 ω80 ω80 ω80 ω80 ω80
ω0 ω81 ω82 ω83 ω84 ω85 ω86 7
ω8
8
ω0 ω82 ω84 ω86 ω88 ω810 ω812 14
ω8
8
ω0 ω83 ω86 ω89 ω812 ω815 ω818 21
ω8
8
y = F8x = x
ω0 ω84 ω88 ω812 ω816 ω820 ω824 28
ω8
8
0
ω
8 ω85 ω810 ω815 ω820 ω825 ω830 35
ω8
0
ω
8 ω86 ω812 ω818 ω824 ω830 ω836 42
ω8
ω80 ω87 ω814 ω821 ω828 ω835 ω842 ω849
ω8 = cos(2π/8) − i · sin(2π/8)
The DFT Matrix In General...
If ωn = cos(2π/n) − i · sin(2π/n) then
pq
[Fn]pq = ωn
= (cos(2π/n) − i · sin(2π/n))pq
= cos(2pqπ/n) − i · sin(2pqπ/n)
Fact:
FnH Fn = nIn
√
Thus, Fn/ n is unitary.
Data Sparse Matrices
An n-by-n matrix A is data sparse if it can be represented with
many fewer than n2 numbers.
Example 1.
A has lots of zeros. (“Traditional Sparse”)
Example 2.
A is Toeplitz...
a b c d
e a b c
A =
f
e a b
g f e a
More Examples of Data Sparse Matrices
A is a Kronecker Product B ⊗ C, e.g.,
" #
b11C b12C
A =
b21C b22C
If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m21m22
entries but is parameterized by just m21 + m22 numbers.
Extreme Data Sparsity
n X
X n X
n X
n
A = S(i, j, k, `) · (2-by-2) ⊗ · · · ⊗ (2-by-2)
i=1 j=1 k=1 `=1 | {z }
d times
A is 2d -by-2d but is parameterized by O(dn4) numbers.
Factorization of Fn
The DFT matrix can be factored into a short product of sparse
matrices, e.g.,
F1024 = A10 · · · A2A1P1024
where each A-matrix has 2 nonzeros per row and P1024 is a per-
mutation.
From Factorization to Algorithm
If n = 210 and
Fn = A10 · · · A2A1Pn
then
y = Pnx
for k = 1:10
y = Ak x ← 2n flops.
end
computes y = Fnx and requires O(n log n) flops.
Recursive Block Structure
F8(:, [ 0 2 4 6 1 3 5 7 ]) =
1 0 0 0 1 0 0 0
0 1 0 0 0 ω 0 0
8
0 0 1 0 0 0 ω 2 0
8
F 0
0 ω8 3
0 0 0 1 0 0 4
1 0 0 0 −1 0 0 0 0 F4
0 1 0 0 0 −ω8 0 0
0 0 1 0 0 2
0 −ω8 0
0 0 0 1 0 0 0 −ω83
Fn/2 “shows up” when you permute the columns of Fn so that
the odd-indexed columns come first.
Recursion...
We build an 8-point DFT from two 4-point DFTs...
1 0 0 0 1 0 0 0
0 1 0 0 0 ω8 0 0
0 0 1 0 0 0 ω82 0 " #
0 ω83 F4x([Link])
0 0 0 1 0 0
F8 x =
1 0 0 0 −1 0 0 0 F4x([Link])
0 1 0 0 0 −ω8 0 0
0 0 1 0 0 2
0 −ω8 0
0 0 0 1 0 0 0 −ω83
Radix-2 FFT: Recursive Implementation
function y =fft(x, n)
if n = 1
y = x
else
m = n/2; ω = exp(−2πi/n)
Ω = diag(1, ω, . . . , ω m−1)
zT = fft(x([Link]n − 1), m)
zB = Ω· fft(x([Link]n − 1), m)
Im Im zT
y = Overall: 5n log n flops.
Im −Im zB
end
The Divide-and-Conquer Picture
([Link])
HH
H
H
HH
H
H
([Link]) ([Link])
Q Q
Q Q
Q Q
Q Q
([Link]) ([Link]) ([Link]) ([Link])
@ @ @ @
@ @ @ @
([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link])
A A A A A A A A
A A A A A A A A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Towards a Nonrecursive Implementation
The Radix-2 Factorization...
If n = 2m and
Ωm = diag(1, ωn, . . . , ωnm−1),
then " # " #
Fm ΩmFm Im Ωm
FnΠn = = (I2 ⊗ Fm).
Fm −ΩmFm Im −Ωm
where Πn = In(:, [Link]n [Link]n]).
Fm 0
Note: I2 ⊗ Fm = .
0 Fm
The Cooley-Tukey Factorization
n = 2t
Fn = At · · · A1Pn
Pn = the n-by-n “bit reversal ” permutation matrix
" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
L/2−1
ΩL/2 = diag(1, ωL, . . . , ωL ) ωL = exp(−2πi/L)
The Bit Reversal Permutation
([Link])
HH
H
H
HH
H
H
([Link]) ([Link])
Q Q
Q Q
Q Q
Q Q
([Link]) ([Link]) ([Link]) ([Link])
@ @ @ @
@ @ @ @
([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link])
A A A A A A A A
A A A A A A A A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Bit Reversal
x(0) x(0000) x(0000) x(0)
x(1) x(0001) x(1000) x(8)
x(2) x(0010) x(0100) x(4)
x(3) x(0011) x(1100) x(12)
x(4) x(0100) x(0010) x(2)
x(5) x(0101) x(1010) x(10)
x(6) x(0110) x(0110) x(6)
x(7) x(0111) x(1110) x(14)
x(8) = x(1000)
→ x(0001) = x(1)
x(9) x(1001) x(1001) x(9)
x(10) x(1010) x(0101) x(5)
x(11) x(1011) x(1101) x(13)
x(12) x(1100) x(0011) x(3)
x(13) x(1101) x(1011) x(11)
x(14) x(1110) x(0111) x(7)
x(15) x(1111) x(1111) x(15)
Butterfly Operations
This matrix is block diagonal...
" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
r copies of things like this
1 ×
1 ×
1 ×
1 ×
1
×
1 ×
1 ×
1 ×
At the Scalar Level...
a sH a + ωb
s
H
H
ω
HH
b s Hs a − ωb
Signal Flow Graph (n = 8)
x0
H
s s s s y0
HH @ A
ω80 @ A
H @ A
@
x4 s
HHs
ω80
A s y1
A A
s
@
@ @ A A
@
@ @ A A
@ A
@s A s y2
x2
HH ω82
A
ω80
s s
A
H @ A A
ω80 @ A A A
A
HH @ A A
x6 Hs @s A ω81 A s y3
A A
s
A
A A A A
A A
A A A A
s A A A s y4
x1
H ω82
A A A
s s
H @
H
ω80 @ A A
A
H @
@ AA A A
H
x5
s Hs ω80 s ω83 A A s y5
@ A A
@ @
@ A A
@ @ A A
@
@ A A s y6
H
x3 2
s s ω8 s
HH A
@
ω80 @ A
H @ A
H
x7
s Hs @
s A s y7
The Transposed Stockham Factorization
If n = 2t, then
Fn = St · · · S2S1,
where for q = 1:t the factor Sq = Aq Γq−1 is defined by
Aq = I r ⊗ BL , L = 2q , r = n/L,
Γq−1 = Πr∗ ⊗ IL∗ , L∗ = L/2, r∗ = 2r,
IL∗ ΩL∗
BL = ,
IL∗ −ΩL∗
ΩL∗ = diag(1, ωL, . . . , ωLL∗−1).
Perfect Shuffle
x0 x0
x1 x1
x2 x4
x3 x5
(Π4 ⊗ I2)
x4 = x2
x5 x3
x6 x6
x7 x7
Cooley-Tukey Array Interpretation
Step q:
k
2k 2k+1
8
>
<
L∗ =2q−1
>
−→ L=2q
:
| {z }
r∗ =n/L∗
| {z }
r=n/L
Reshaping
×
×
×
×
× × × ×
x = × → x2×4 =
× × × × ×
×
×
×
Transposed Stockham Array Interp
k k+r
9
>
=
(q−1)
xL∗ ×r∗ = FL∗ xT
r∗ ×L∗ = L∗ =2q−1 .
>
;
| {z }
r∗ =n/L∗
x(q) = Sq x(q−1)
k
9
>
>
>
>
>
>
>
>
=
(q)
xL×r = FL xT
r×L = L=2q
>
>
>
>
>
>
>
>
;
| {z }
r=n/L
2 × 2 × 2 Basic Radix-2 Versions
Store intermediate DFTs by row or column
Intermediate DFTs adjacent or not.
How the two butterfly loops are ordered.
" #!
IL/2 ΩL/2
x = Ir ⊗ x L = 2q , r = n/L
IL/2 −ΩL/2
The Gentleman-Sande Idea
It can be shown that FnT = Fn and so if
Fn = At · · · A1PnT
then
Fn = FnT = PnAT1 · · · ATt
and we can compute y = Fnx as follows...
y = x
for k = t: − 1:1
y = ATk x
end
y = Pny
Convolution and Other Aps
From “problem space” to “DFT space” via
for k = t: − 1:1
x = ATk x
end
x = Pnx
Do your thing in DFT space. Then inverse transform back to
Problem space via
x = PnT x
for k = 1:t
x = Ak x
end
x = x/n
Can avoid the Pn ops by working in “scrambled” DFT space.
Radix-4
Can combine four quarter-length DFTs to produce a single full-
length DFT:
I I I I a (a + c) + (b + d)
I −iI −I iI b (a − c)−i(b − d)
v= =
I −I I −I c (a + c) − (b + d)
,
I iI −I −iI d (a − c)+i(b − d)
The radix-4 butterfly.
Better re-use of data.
Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n).
Mixed Radix
96
#P
cPP
# PP
c PP
# c
# c PP
24 24 24 24
@ @ @ @
@ @ @ @
8 8 8 8 8 8 8 8 8 8 8 8
Multiple DFTs
Given: n1-by-n2 matrix X.
Multicolumn DFT Problem...
X ← Fn1 X
Multirow DFT Problem...
X ← XFn2
Blocked Multiple DFTs
X ← Fn1 X becomes
X1 | X2 | · · · | Xp ← Fn1 X1 | Fn1 X2 | · · · | Fn1 Xp
The 4-Step Framework
A matrix reshaping of the x ← Fnx operation when n = n1n2:
xn1×n2 ← xn1×n2 Fn2 Multiple row DFT
xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply
xn2×n1 ← xTn1×n2 Transpose
xn2×n1 ← xn2×n1 Fn1 Multiple row DFT .
Can be arranged so communication is concentrated in the trans-
pose step.
Distributed Transpose: Example
Initial:
X00 X01 X02 X03
X10 X11 X12 X13
X =
X20
.
X21 X22 X23
X30 X31 X32 X33
Transpose each block:
T
X00 T
X01 T
X02 T
X03
XT T
X11 T
X12 T
X13
10
X ← .
XT T
X21 T
X22 T
X23
20
T
X30 T
X31 T
X32 T
X33
Now regard as 2-by-2 and block transpose each block:
X T XT XT XT
00 10 02 12
T T T T
X X X X
X ← 01 11 03 13 .
T T T T
X X X X
20 30 22 32
T XT XT XT
X21 31 23 33
Now do a 2-by-2 block transpose:
X T XT XT XT
00 10 20 30
T T T T
X X X X
X ← 01 11 21 31 .
T
X XT XT XT
02 12 22 32
T XT XT XT
X03 13 23 33
Factorization and Transpose
xn×m ← xTm×n
corresponds to
x ← P (m, n)x
where P (m, n) is a perfect shuffle permutation, e.g.,
P (3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11])
Different multi-pass transposition algorithms correspond to differ-
ent factorizations of P (m, n).
Two-Dimensional FFTs
If X is an n1-by-n2 matrix then is 2D DFT is
X ← Fn1 XFn2
Option 1.
X ← Fn1 X
X ← XFn2
Option 2. Assume n1 = n2 and Fn1 = At · · · A1.
for q = 1:t
X ← Aq XATq
end
Interminlgling the column and row butterfly computations can
result in better locality.
3-Dimensional DFTs
Given X(1:n1, 1:n2, 1:n3 ), apply DFT in each of the three dimen-
sions.
If
x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1)
then the problem is to compute
x ← (Fn3 ⊗ Fn2 ⊗ Fn1 )x
i.e.,
x ← (In3 ⊗ In2 ⊗ Fn1 )x
x ← (In3 ⊗ Fn2 ⊗ In1)x
x ← (Fn3 ⊗ In2 ⊗ In1)x
d-Dimensional DFTs
Sample for d = 5:
X(α1, α2 , α3, α4, α5) Fn1
µ=1
X(α2, α3 , α4, α5, α1) ΠTn1,n
X(α2, α3 , α4, α5, α1) Fn2
µ=2
X(α3, α4 , α5, α1, α2) ΠTn2,n
X(α3, α4 , α5, α1, α2) Fn3
µ=3
X(α4, α5 , α1, α2, α3) ΠTn3,n
X(α4, α5 , α1, α2, α3) Fn4
µ=4
X(α5, α1 , α2, α3, α4) ΠTn4,n
X(α5, α1 , α2, α3, α4) Fn5
µ=5
X(α1, α2 , α3, α4, α5) ΠTn5,n
Intemingling of component DFTs and tensor transpositions.
References
FFTW: http:[Link]
C. Van Loan (1992). Computational Frameworks for the Fast
Fourier Transform, SIAM Publications, Philadelphia, PA.