Advanced Topics in Communications Electronics
FPGA Information Processing Systems
Lecture 4 Unfolding and Retiming
Chapter 4, the book by Keshab Parhi
Lecture 4 1
Example of Unfolding
a
y(n) 9D a
x(2k) + X
y(2k) 5D
x(n) + X
y(2k+1) 4D a
x(2k+1) + X
Lecture 4 2
J-Slow from J-Unfolding
If input to delay element
is x(kJ+m)
Output is x(kJ+m-J) a
If J == 2, input y(2k+1)
x(2k) + 5D
X
to D y(2k)
Output is y(2k-1)
y(2k+1) 4D a
For 5D, output is y(2k-9)
x(2k+1) + X
Lecture 4 3
Unfolding Algorithm for Factor J
Lecture 4 4
Example of Unfolding
B
9D A0 C0 D0
5D
B0
A C D
B1 4D
A1 C1 D1
Lecture 4 5
Unfolding without Loops
!"#
𝑢! → 𝑣 !"# %% with %
delays
x(4k) u0 v0 y(4k-37)
10D
9D
37D x(4k+1) u1 v1 y(4k-36)
u v
x(n) y(n-37) 9D
x(4k+2) u2 v2 y(4k-35)
9D
x(4k+3) u3 v3 y(4k-34)
For circuits without loops, unfolding == parallel processing
Lecture 4 6
Another Example
2D
2D
2D
D u0 v0 s0
u v D
2D
5D 6D u1 v1 s1
s
u2 v2
2D s2
Lecture 4 7
Properties of Unfolding
Lecture 4 8
Critical Path
Lecture 4 9
Retiming
D
D
X X
D
Number of delay elements (registers) along each
complete path or loop is preserved
Lecture 4 10
Effect of Retiming
(4) (4)
D
s s
(4) D (4) D
(1) q D
(1) q t t
(0) (1) 2D (0) (0) (1) D (0)
p r u p r u
Critical path (period) 6 Critical path (period) 4
Lecture 4 11
Clock Skew and Useful Skew
FF Delay = 9 FF Delay = 5
FF
t1 t2 t3
Clock
Zero skew: t1 = t2 = t3, clock period = 9
FF Delay = 9 FF Delay = 5
FF
t1 t2 t3
Clock
Useful skew: t1 = t3, t2 = t3 + 2, clock period = 7
Lecture 4 12
Clock Skew Optimization
T: clock period
ti: clock signal arrival time at flip-flop i
Dij: combinational circuit delay between flop i and j
Minimize T
s.t. ti + Dij,max ≤ tj + T – tsetup
ti + Dij,min ≥ tj + thold
for all flop pairs (i , j ) with combinational logic in between
Lecture 4 13
Sample Period Reduction
s0
(4)
(4) (4)
(1) q0 t0 D
s
(4) D (1) D
(1) q t p0 r0 u0
(0) (1) 2D (0) s1
(4)
p r u (4)
(1) q1 t1
(1) D
p1 r1 u1
Lecture 4 14
Fractional Iteration Bound
(1) (1) (1) (1)
s0 t1 u1 v2
(1) (1) (1) (1) D
D D
s t u v (1) (1) (1) (1)
D
D s1 t2 u2 v0
(1) (1) (1) (1)
D
s2 t0 u0 v1
Lecture 4 15
Overlapping Scheduling
D
Iteration D
bound: 3.5 B0 C0 A1
(1) A D
B C A0 B1 C1
(2) (4)
D
Precedence C B0 C0 A1
graph B
A A0 B1 C1
P1: B C B0 C0 A1
P2: A A0 B1 C1
Time
Lecture 4 16
Word-Level Parallel Processing
X
x(n)
C B A
c X b X a X
y(n) 2D 4D
2D 4D D E
+ + x(3k)
X0
C0 B0 A0 c X b X a X
y(3k)
D + +
D D0 E0 2D
X1 2D x(3k+1)
D c X b X a X
C1 B1 A1 D y(3k+1)
D + +
D D1 E1
X2 x(3k+2)
b
c X X a X y(3k+2)
C2 B2 A2 D
D + +
D2 E2 Lecture 4 17
Parallelism Levels
a5 b5
a4 b4 Digit-
a3 b3 a4 a2 a0 serial b4 b2 b0
Bit-
a2 parallel b2 a5 a3 a1 Digit- b5 b3 b1
a1 b1 size: 2
a0 b0
a5 a4 a3 a2 a1 a0 Bit-serial b5 b4 b3 b2 b1 b0
Lecture 4 18
Bit-Serial Adder
a3 a2 a1 a0 s3 s2 s1 s0
b3 b2 b1 b0
+ D
4p+0 4p+1, 2, 3
0 p is word index
0, 1, 2, 3 are bit indices
Lecture 4 19
Unfolding Switches
Wp+q p is word index
u v q is bit index
Lecture 4 20
Example of Unfolding Switch
12p+1,7,9,11 4p+3
u v u0 v0
12p + 1 = 3(4p + 0) + 1 4p+0,2
12p + 7 = 3(4p + 2) + 1
12p + 9 = 3(4p + 3) + 0 u1 v1
12p + 11 = 3(4p +3) + 2
4p+3
u2 v2
Lecture 4 21
Dummy Node
A
2D 6p+1,5 A
2D D 6p+1,5
C C
B 6p+0,2,3,4 B 6p+0,2,3,4
D
A1 D0
C0
B0 C0
B0 2p+0,1
D D
A2 D1 2p+0 A2 2p+0
C1 C1
B1 2p+1 B1 2p+1
A0 D2 2p+1 A0 2p+1
C2 C2
B2 2p+0 B2 2p+0
Lecture 4 22
2-Unfolding Bit-Serial Adder
+
a3 a2 a1 a0 s3 s2 s1 s0 A S
b3 b2 b1 b0
D X
B D D
4p+0
4p+0 4p+1, 2, 3 Z
0 4p+1, 2, 3
A0 S0
X0 A1 S1
D
B0 D0 X1
2p+0
B1 D1
Z0
2p+1 Z1
Lecture 4 23
Unfolding Bit-Serial Adder
(J=2)
s3 s2 s1 s0
+
a3 a2 a1 a0
b3 b2 b1 b0
D
a2 b2 a3 b3
4p+0 a0 b0 a1 b1
4p+1, 2, 3
0
2p+0
+ +
2p+0
0
D
Carry out
2p+1
s2 s3
s0 s1
Lecture 4 24
Unfolding Bit-Serial Adder
(J=4)
s3 s2 s1 s0
+
a3 a2 a1 a0
b3 b2 b1 b0
4p+0
4p+1, 2, 3
0
a0 b0 a1 b1 a2 b2 a3 b3
0
+ + + + Carry out
s0 s1 s2 s4
Lecture 4 25