0% found this document useful (0 votes)
290 views42 pages

FFT Matrix Factorization Techniques

The document discusses efficient implementations of the discrete Fourier transform (DFT) using matrix factorizations. It describes how the DFT matrix can be factored into sparse matrices involving block diagonal matrices and permutation matrices. This factorization leads to an algorithm for computing the DFT in O(n log n) operations using a divide-and-conquer approach, by recursively breaking the problem into smaller DFT subproblems. The Cooley-Tukey algorithm is presented as an efficient non-recursive implementation of this approach.

Uploaded by

Olimpiu Stoicuta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
290 views42 pages

FFT Matrix Factorization Techniques

The document discusses efficient implementations of the discrete Fourier transform (DFT) using matrix factorizations. It describes how the DFT matrix can be factored into sparse matrices involving block diagonal matrices and permutation matrices. This factorization leads to an algorithm for computing the DFT in O(n log n) operations using a divide-and-conquer approach, by recursively breaking the problem into smaller DFT subproblems. The Cooley-Tukey algorithm is presented as an efficient non-recursive implementation of this approach.

Uploaded by

Olimpiu Stoicuta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The FFT

Via Matrix Factorizations


A Key to Designing High Performance Implementations

Charles Van Loan


Department of Computer Science
Cornell University
A High Level Perspective...
Blocking For Performance

 
A11 A12 · · · A1q } n1
A A · · · A 
 21 22 2q  } n2
A =  . . . . . 
 . . . . 
Ap1 Ap2 · · · Apq } nq
|{z} |{z} |{z}
n1 n2 nq

A well known strategy for high-performance Ax = b and Ax = λx


solvers.
Factoring for Performance

One way to execute a matrix-vector product


y = Fnx
when Fn = At · · · A2A1 is as follows:

y=x
for k = 1:t
y = Ak x
end

A different factorization Fn = Ãt̃ · · · Ã1 would yield a different


algorithm.
The Discrete Fourier Transform (n = 8)

 
ω80 ω80 ω80 ω80 ω80 ω80 ω80 ω80
 
 ω0 ω81 ω82 ω83 ω84 ω85 ω86 7
ω8 
 8 
 
 ω0 ω82 ω84 ω86 ω88 ω810 ω812 14
ω8 
 8 
 
 ω0 ω83 ω86 ω89 ω812 ω815 ω818 21
ω8 
 8 
y = F8x =  x
 ω0 ω84 ω88 ω812 ω816 ω820 ω824 28
ω8 
 8 
 0 
ω
 8 ω85 ω810 ω815 ω820 ω825 ω830 35
ω8  
 0 
ω
 8 ω86 ω812 ω818 ω824 ω830 ω836 42
ω8  
ω80 ω87 ω814 ω821 ω828 ω835 ω842 ω849

ω8 = cos(2π/8) − i · sin(2π/8)
The DFT Matrix In General...

If ωn = cos(2π/n) − i · sin(2π/n) then

pq
[Fn]pq = ωn

= (cos(2π/n) − i · sin(2π/n))pq

= cos(2pqπ/n) − i · sin(2pqπ/n)

Fact:
FnH Fn = nIn


Thus, Fn/ n is unitary.
Data Sparse Matrices

An n-by-n matrix A is data sparse if it can be represented with


many fewer than n2 numbers.

Example 1.
A has lots of zeros. (“Traditional Sparse”)

Example 2.
A is Toeplitz...
 
a b c d
e a b c
A = 
f

e a b
g f e a
More Examples of Data Sparse Matrices

A is a Kronecker Product B ⊗ C, e.g.,

" #
b11C b12C
A =
b21C b22C

If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m21m22


entries but is parameterized by just m21 + m22 numbers.
Extreme Data Sparsity

n X
X n X
n X
n
A = S(i, j, k, `) · (2-by-2) ⊗ · · · ⊗ (2-by-2)
i=1 j=1 k=1 `=1 | {z }
d times

A is 2d -by-2d but is parameterized by O(dn4) numbers.


Factorization of Fn

The DFT matrix can be factored into a short product of sparse


matrices, e.g.,

F1024 = A10 · · · A2A1P1024

where each A-matrix has 2 nonzeros per row and P1024 is a per-
mutation.
From Factorization to Algorithm

If n = 210 and
Fn = A10 · · · A2A1Pn
then

y = Pnx
for k = 1:10
y = Ak x ← 2n flops.
end

computes y = Fnx and requires O(n log n) flops.


Recursive Block Structure

F8(:, [ 0 2 4 6 1 3 5 7 ]) =
 
1 0 0 0 1 0 0 0
 0 1 0 0 0 ω 0 0 
 8 
 0 0 1 0 0 0 ω 2 0 
 8 
 F 0 

0 ω8  3
 0 0 0 1 0 0 4
 
 1 0 0 0 −1 0 0 0  0 F4
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83

Fn/2 “shows up” when you permute the columns of Fn so that


the odd-indexed columns come first.
Recursion...

We build an 8-point DFT from two 4-point DFTs...


 
1 0 0 0 1 0 0 0

 0 1 0 0 0 ω8 0 0 

 0 0 1 0 0 0 ω82 0  " #
0 ω83  F4x([Link])
 
 0 0 0 1 0 0
F8 x =  
 1 0 0 0 −1 0 0 0  F4x([Link])
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83
Radix-2 FFT: Recursive Implementation

function y =fft(x, n)
if n = 1
y = x
else
m = n/2; ω = exp(−2πi/n)
Ω = diag(1, ω, . . . , ω m−1)
zT = fft(x([Link]n − 1), m)
zB = Ω· fft(x([Link]n − 1), m)
  
Im Im zT
y = Overall: 5n log n flops.
Im −Im zB
end
The Divide-and-Conquer Picture

([Link])
HH
 H
 H
 HH
 H
 H
([Link]) ([Link])
Q Q
 Q  Q
 Q  Q
 Q  Q
([Link]) ([Link]) ([Link]) ([Link])
@ @ @ @
@ @ @ @
([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link])
A A A A A A A A
 A  A  A  A  A  A  A  A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Towards a Nonrecursive Implementation

The Radix-2 Factorization...

If n = 2m and
Ωm = diag(1, ωn, . . . , ωnm−1),
then " # " #
Fm ΩmFm Im Ωm
FnΠn = = (I2 ⊗ Fm).
Fm −ΩmFm Im −Ωm

where Πn = In(:, [Link]n [Link]n]).


 
Fm 0
Note: I2 ⊗ Fm = .
0 Fm
The Cooley-Tukey Factorization

n = 2t

Fn = At · · · A1Pn

Pn = the n-by-n “bit reversal ” permutation matrix


" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2

L/2−1
ΩL/2 = diag(1, ωL, . . . , ωL ) ωL = exp(−2πi/L)
The Bit Reversal Permutation

([Link])
HH
 H
 H
 HH
 H
 H
([Link]) ([Link])
Q Q
 Q  Q
 Q  Q
 Q  Q
([Link]) ([Link]) ([Link]) ([Link])
@ @ @ @
@ @ @ @
([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link]) ([Link])
A A A A A A A A
 A  A  A  A  A  A  A  A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Bit Reversal
       
x(0) x(0000) x(0000) x(0)
 x(1)   x(0001)   x(1000)   x(8) 
       
 x(2)   x(0010)   x(0100)   x(4) 
       
 x(3)   x(0011)   x(1100)   x(12) 
       
 x(4)   x(0100)   x(0010)   x(2) 
       
 x(5)   x(0101)   x(1010)   x(10) 
       
 x(6)   x(0110)   x(0110)   x(6) 
       
 x(7)   x(0111)   x(1110)   x(14) 
 x(8)  =  x(1000) 
    →  x(0001)  =  x(1) 
   
       
 x(9)   x(1001)   x(1001)   x(9) 
       
 x(10)   x(1010)   x(0101)   x(5) 
       
 x(11)   x(1011)   x(1101)   x(13) 
       
 x(12)   x(1100)   x(0011)   x(3) 
       
 x(13)   x(1101)   x(1011)   x(11) 
       
 x(14)   x(1110)   x(0111)   x(7) 
x(15) x(1111) x(1111) x(15)
Butterfly Operations
This matrix is block diagonal...
" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
r copies of things like this
 
1 ×

 1 × 


 1 × 


 1 × 

1
 × 


 1 × 

 1 × 
1 ×
At the Scalar Level...

a sH  a + ωb
s
H 
H 
ω
 HH
b s Hs a − ωb
Signal Flow Graph (n = 8)

x0
H 
s s s s y0
HH  @ A 
ω80 @ A 
 H @ A 
@
x4 s 
 HHs
ω80
A  s y1
A A  
s
@
@ @ A A  
@
@ @ A A  
@ A 
@s A  s y2
x2
HH  ω82
A
ω80

s s
 A 
H  @ A  A 
ω80 @ A A A 
A  
 HH @ A A
x6  Hs @s A ω81 A  s y3
A A 
s
A 
A A  A  A
A A

A  A  A A
s A A  A s y4
x1
H  ω82
A  A A
s s
H  @
H 
ω80 @  A A
 A
 H @
@  AA A A
 H
x5 
s Hs ω80 s  ω83 A A s y5
@   A A
@ @
@   A A
@ @   A A
@
@  A A s y6
H 
x3 2
s s ω8 s
HH   A
@
ω80 @  A
 H @  A
 H
x7 
s Hs @
s A s y7
The Transposed Stockham Factorization

If n = 2t, then
Fn = St · · · S2S1,
where for q = 1:t the factor Sq = Aq Γq−1 is defined by

Aq = I r ⊗ BL , L = 2q , r = n/L,

Γq−1 = Πr∗ ⊗ IL∗ , L∗ = L/2, r∗ = 2r,


 
IL∗ ΩL∗
BL = ,
IL∗ −ΩL∗

ΩL∗ = diag(1, ωL, . . . , ωLL∗−1).


Perfect Shuffle

   
x0 x0
 x1   x1 
   
 x2   x4 
   
 x3   x5 
(Π4 ⊗ I2) 
 x4  =  x2 
  
   
 x5   x3 
   
 x6   x6 
x7 x7
Cooley-Tukey Array Interpretation

Step q:

k

2k 2k+1 
8 

>
<


L∗ =2q−1
>
−→ L=2q
: 



| {z } 
r∗ =n/L∗
| {z }
r=n/L
Reshaping

 
×
×
 
×
 
×  
  × × × ×
x =  ×  → x2×4 =
 
× × × × ×
 
×
 
×
×
Transposed Stockham Array Interp

k k+r
9
>
=
(q−1)
xL∗ ×r∗ = FL∗ xT
r∗ ×L∗ = L∗ =2q−1 .
>
;

| {z }
r∗ =n/L∗
x(q) = Sq x(q−1)
k
9
>
>
>
>
>
>
>
>
=
(q)
xL×r = FL xT
r×L = L=2q
>
>
>
>
>
>
>
>
;

| {z }
r=n/L
2 × 2 × 2 Basic Radix-2 Versions

Store intermediate DFTs by row or column

Intermediate DFTs adjacent or not.

How the two butterfly loops are ordered.


" #!
IL/2 ΩL/2
x = Ir ⊗ x L = 2q , r = n/L
IL/2 −ΩL/2
The Gentleman-Sande Idea

It can be shown that FnT = Fn and so if

Fn = At · · · A1PnT
then
Fn = FnT = PnAT1 · · · ATt
and we can compute y = Fnx as follows...
y = x
for k = t: − 1:1
y = ATk x
end
y = Pny
Convolution and Other Aps

From “problem space” to “DFT space” via


for k = t: − 1:1
x = ATk x
end
x = Pnx

Do your thing in DFT space. Then inverse transform back to


Problem space via
x = PnT x
for k = 1:t
x = Ak x
end
x = x/n

Can avoid the Pn ops by working in “scrambled” DFT space.


Radix-4

Can combine four quarter-length DFTs to produce a single full-


length DFT:
    
I I I I a (a + c) + (b + d)
 I −iI −I iI  b   (a − c)−i(b − d) 
v=   = 
 I −I I −I  c   (a + c) − (b + d) 
,

I iI −I −iI d (a − c)+i(b − d)

The radix-4 butterfly.


Better re-use of data.
Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n).
Mixed Radix

96
 
#P
cPP

 # PP
  c PP
 # c
 # c PP
24 24 24 24
@ @ @ @
@ @ @ @
8 8 8 8 8 8 8 8 8 8 8 8
Multiple DFTs

Given: n1-by-n2 matrix X.

Multicolumn DFT Problem...

X ← Fn1 X

Multirow DFT Problem...

X ← XFn2
Blocked Multiple DFTs

X ← Fn1 X becomes

   
X1 | X2 | · · · | Xp ← Fn1 X1 | Fn1 X2 | · · · | Fn1 Xp
The 4-Step Framework

A matrix reshaping of the x ← Fnx operation when n = n1n2:

xn1×n2 ← xn1×n2 Fn2 Multiple row DFT

xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply

xn2×n1 ← xTn1×n2 Transpose

xn2×n1 ← xn2×n1 Fn1 Multiple row DFT .

Can be arranged so communication is concentrated in the trans-


pose step.
Distributed Transpose: Example

Initial:  
X00 X01 X02 X03
 X10 X11 X12 X13 
X = 
 X20
.
X21 X22 X23 
X30 X31 X32 X33
Transpose each block:
 
T
X00 T
X01 T
X02 T
X03
 
 XT T
X11 T
X12 T
X13 
 10 
X ←  .
 XT T
X21 T
X22 T 
X23
 20 
T
X30 T
X31 T
X32 T
X33
Now regard as 2-by-2 and block transpose each block:
 
X T XT XT XT
 00 10 02 12 
 T T T T

X X X X 
X ←  01 11 03 13  .
 T T T T

X X X X 
 20 30 22 32 
T XT XT XT
X21 31 23 33
Now do a 2-by-2 block transpose:
 
X T XT XT XT
 00 10 20 30 
 T T T T

X X X X 
X ←  01 11 21 31  .
 T 
 X XT XT XT 
 02 12 22 32 
T XT XT XT
X03 13 23 33
Factorization and Transpose

xn×m ← xTm×n

corresponds to
x ← P (m, n)x

where P (m, n) is a perfect shuffle permutation, e.g.,

P (3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11])

Different multi-pass transposition algorithms correspond to differ-


ent factorizations of P (m, n).
Two-Dimensional FFTs

If X is an n1-by-n2 matrix then is 2D DFT is


X ← Fn1 XFn2

Option 1.
X ← Fn1 X
X ← XFn2

Option 2. Assume n1 = n2 and Fn1 = At · · · A1.


for q = 1:t
X ← Aq XATq
end
Interminlgling the column and row butterfly computations can
result in better locality.
3-Dimensional DFTs

Given X(1:n1, 1:n2, 1:n3 ), apply DFT in each of the three dimen-
sions.
If
x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1)

then the problem is to compute

x ← (Fn3 ⊗ Fn2 ⊗ Fn1 )x


i.e.,
x ← (In3 ⊗ In2 ⊗ Fn1 )x
x ← (In3 ⊗ Fn2 ⊗ In1)x
x ← (Fn3 ⊗ In2 ⊗ In1)x
d-Dimensional DFTs

Sample for d = 5:
X(α1, α2 , α3, α4, α5) Fn1
µ=1
X(α2, α3 , α4, α5, α1) ΠTn1,n
X(α2, α3 , α4, α5, α1) Fn2
µ=2
X(α3, α4 , α5, α1, α2) ΠTn2,n
X(α3, α4 , α5, α1, α2) Fn3
µ=3
X(α4, α5 , α1, α2, α3) ΠTn3,n
X(α4, α5 , α1, α2, α3) Fn4
µ=4
X(α5, α1 , α2, α3, α4) ΠTn4,n
X(α5, α1 , α2, α3, α4) Fn5
µ=5
X(α1, α2 , α3, α4, α5) ΠTn5,n

Intemingling of component DFTs and tensor transpositions.


References

FFTW: http:[Link]

C. Van Loan (1992). Computational Frameworks for the Fast


Fourier Transform, SIAM Publications, Philadelphia, PA.

You might also like