FPGA - Ch5 - Unfolding
FPGA - Ch5 - Unfolding
BMĐT
GV: Hồ Trung Mỹ
Ch.05
Unfolding (Trải ra)
TLTK:
1. Các slide từ sách của Prof. Parhi
2. Slide của Prof. Viktor Öwall
3. Slide của Prof. Lan-Da Van
1
Outline
5.1 Introduction
5.2 An Algorithm for Unfolding
5.3 Properties of Unfolding
5.4 Critical Path, Unfolding, and Retiming
5.5 Applications of Unfolding
5.6 Conclusions
2
5.1 Introduction
Unfolding is a transformation technique
Applied to a DSP algorithm to create a new program describing
more than one iteration of the original program.
unfolding
1 2 3 …..
1 2 3 1 2 3 1 2 3
looping
Unfolding is also referred to as loop unrolling
Unfolding factor J describes J consecutive iterations of the original
program
Unfolding = Loop unrolling
• assembly programming
• compiler theory
Applications to high-speed & low-power VLSI architectures
Reveal hidden concurrencies so that the program can be scheduled
to a smaller iteration period
• sample period reduction, reach T , increase throughput
Design parallel architectures at the word level and bit level
3
Unfolding Example
Example
Y(n)=ay(n-9)+x(n), after 2-unfolding we have: Y(2k)=ay(2k-
9)+x(2k), Y(2k+1)=ay(2k-8)+x(2k+1),
The above equations can be rewritten as
Y(2k)=ay(2(k-5)+1)+x(2k), Y(2k+1)=ay(2(k-4))+x(2k+1),
Y(n) Y(2k) a
a X(2k)
Unfolding
5D
X(n) 9D
(J = 2)
4D
X(2k+1)
Y(2k+1) a
4
5.2 Unfolding Algorithm
5
Example
Y(n) Y(2k) a
a X(2k)
Unfolding
5D
X(n) 9D
(J = 2)
4D
X(2k+1)
Y(2k+1) a
i w (i+w)%J
B0
0 9 1 4
B 1 9 0 5
A0 C0 D0
Unfolding 5D
9D
A C D
(J = 2)
A1 C1 4D D1
𝑖+𝑤
J-unfolded DFG 𝑈𝑖 ⌊ ⌋ 𝑉 (𝑖+𝑤 ) % 𝐽
𝑈𝑤𝑉 𝐽 B1
→ →
𝑖=0,1,2 , ⋯ , 𝐽 −1 6
Another DFG Unfolding Example (1)
J=2 S0
i w (i+w)%J (i w) / J
0 2 0 1 Q0 T0
S
1 2 1 1
R0
0 3 1 1 Q T
2D 3D
1 3 0 2 S1
R
Q1 T1
R1
Step 1. Duplicate J copies of each node
7
Another DFG Unfolding Example (2)
J=2 S0
i w (i+w)%J (i w) / J
0 2 0 1 Q0 T0
S
1 2 1 1
R0
0 3 1 1 Q T
2D 3D
1 3 0 2 S1
R
Q1 T1
R1
Step 2. Add all edges with 0 delay on them.
8
Another DFG Unfolding Example (3)
J=2 S0
i w (i+w)%J (i w) / J
0 2 0 1 Q0 T0
S D
1 2 1 1
R0
0 3 1 1 Q T
D 2D
2D 3D
1 3 0 2 S1
R
Q1 T1
D
R1
Step 3. Use table on the left to
figure out edges with delays.
9
3-Unfolded Example
2D
D
U V 2D
3-unfolded U0 V0 T0
D
5D 6D
T 2D
2D
U1 V1 T1
2D
U2 V2 T2
D
10
4-Unfolded Example
Step 1. For each node U in the original
DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1 J=4
i = 0, 1, …, J-1 V0
37D U0 9D
U V
U1 9D V1
i w (i+w)%J
0 37 1 9 U2 V2
9D
1 37 2 9
U3 10D V3
2 37 3 9
3 37 0 10
iw
w J
U V U i V(i w)% J
12
5.3 Properties of Unfolding
Unfolding preserves the number Any path in the original
of registers (delays) in a DFG DFG containing J or more
For a loop with delays in a DFG delays leads to J paths with
that has been unfolded J times, it 1 or more delay in each
leads to g.c.d.(,J) loops in the path. Therefore, it can not
unfolded DFG, with each of these create a critical path in the
loops containing J-unfolded DFG
/(g.c.d.(,J)) delays and Any clock period that can
J/(g.c.d.(,J)) copies of each node be achieved by retiming a
that appear in the original loop .
J-unfolded DFG, or can be
Unfolding a DFG with iteration achieved by retiming the
bound T results in a J-unfolded original DFG and followed
DFG with iteration bound J.T by J-unfolding.
A path with delays in a DFG will
lead to paths with no delays, and g.c.d. = greatest common divisor
paths with 1 delay each in the J- For example, the g.c.d. of 8 and 12
unfolded DFG. is 4.
13
Example
14
A single loop with
Example D
B
D 3-unfolded g.c.d.(6, 3) = 3 (loops) A0 B0 C0
6/3 = 2 delays/loop D D
A 2D
3/3 = 1 copy of each
node that appear in
3D C 4-un the original loop.
A1 B1 C1
fo D
ld ed
A2 B2 C2
D D
A0 B0 C0 D
D D
D g.c.d.(6, 4) = 2 (loops)
A1 B1 C1 6/2 = 3 delays/loop
4/2 = 2 copies of each
D node that appear in the
A2 B2 C2 original loop.
D
A3 B3 C3
15
Example
In original DFG: 2 loops g.c.d.(10, 5) = 5 (loops)
: 10/5 = 2 delays/loop
: 5/5 = 1 copy of each
node that appear in the
original loop.
g.c.d.(10, 2) = 2 (loops)
10/2 = 5 delays/loop
2/2 = 1 copy of each node that
appear in the original loop. 16
Unfolding and Iteration Bound
17
5.4 Critical Path, Unfolding, and Retiming
Consider a path with delays in the original DFG. J-
unfolding of this path leads to paths with no delays and
paths with 1 delay each may create a new critical path.
Any path in the original DFG containing J or more delays
leads to J paths with 1 or more delays in each path.
Therefore a path in the original DFG with J or more delays
cannot create a critical path in the J-unfolded DFG.
From these, we can retime the original DFG such that the
J-unfolded version of the retimed DFG will meet a
specified critical path delay . This is true if there exists a
path in the original DFG with computation time and less
than J delays.
Assume that the critical path of the J-unfolded DFG is , if
D(U,V) > , then wr(U,V)=W(U,V)+r(V)-r(U) ³ J
18
Unfolding and Critical Path
Critical
path
Critical
J=3
path
19
5.5 Applications of Unfolding
Applications of Unfolding:
Sample Period Reduction
Case 1 : A node in the DFG having computation time
greater than T .
Case 2 : Iteration bound is not an integer.
Case 3 : Longest node computation is larger than the
iteration bound T , and T is not an integer.
Parallel Processing
Word- Level Parallel Processing
Bit Level Parallel processing
20
Sample Period Reduction
Case 1:
The original DFG cannotT∞have
= 3, sample period equal to the
iteration bound when aTnode computation time is more
critical = 6
than iteration bound.
22
Sample Period Reduction
Case 2: is not an integer
The original DFG cannot have sample period equal to the
iteration bound when the iteration bound is not an integer.
Recall: In number theory, two integers a and b are said to be relatively prime,
mutually prime, or coprime (also written co-prime) if the only positive integer
(factor) that divides both of them is 1. This is equivalent to their greatest common
divisor (gcd) being 1. (source: wiki)
23
Sample Period Reduction
Case 2: is not an integer
(1) (1) (1) (1)
S0 T1 U1 V2
T 4 / 3 D
(1) (1) (1) D (1)
(1) (1) (1) (1)
D D S1 T2 U3 V0
S T U V 3-Unfolded
(1) (1) (1) (1)
D D
S2 T0 U0 V1
Even retiming cannot not achieve a
critical path of less than 2.
T 4 for each loop
Rule of thumb: In general if a # Of sampling = 3 and hence
critical loop bound is tl/wl where tl the minimum sampling period
and wl are mutually prime, then of the unfolded DFG is 4/3 =T¥
wl-unfolding should be used. of the original DFG.
24
Sample Period Reduction
Case 3: is not an integer
in case
where J. T is an integer 25
Parallel processing
Unfolding transformation is used to derive parallel
processing architectures from serial processing
architectures.
b0 b1 b2
y(n)
26
Parallel processing
Parallel processing:
Word- Level Parallel Processing
Unfolding a word-serial architecture by J
creates a word-parallel architecture that
processes J words per clock cycle.
Bit-Level Parallel processing
Bit-serial processing
Bit-parallel processing
Digit-serial processing
27
Bit-level Parallel Processing
Let W be the word length of the data
Bit-serial processing: one bit processed per
clock cycle and a complete word is processed
in W clock cycles
Bit-parallel processing: one word of W bits is
processed every clock cycle W-unfolded.
Digit-serial processing: N bits (1<N<W) are
processed per clock cycle and a word is
processed in W/N clock cycles. N is the digit
size N-unfolded.
28
A demonstration of bit-parallel, bit-serial, and digit-serial
processing styles for wordlength W = 6
29
The adder architectures with Bit-Parallel/Bit-
Serial/Digit-Serial styles
30
Bit-serial adder with word length of 4
= iteration
31
Unfolding of Switches
The following assumptions are made when
unfolding an edge UV containing a switch:
The word length W is a multiple of the unfolding factor
J, i.e. W = W’J
All edges into and out of the switch have no delays.
If so, an edge UV can be unfolded as:
Write the switching instance as
W + u = J( W’ + u/J ) + (u%J)
Draw an edge from the node Uu%J a Vu%J,
which is switched at time instance (W’ + u/J)
W+u
U V
32
Example: Unfolding of Switches, J=3
33
Example: Unfolding of Switches, J=3 (2)
34
Example: Unfolding of Switches, J=3 (3)
35
Example: Unfolding of Switches, J=3 (4)
36
Switch with multiple instances
37
Switch with multiple instances (2)
38
Switches with Delays
Unfolding a DFG containing an edge having a switch and
a positive number of delays is done by introducing a
dummy node.
2D 6 + 1, 5
A 2D 6+ 1, 5 A D
Inserting
C C
Dummy node
B 6 + 0, 2, 3, 4
B 6+ 0, 2, 3, 4
39
Switches with Delays nodes
Switched at time instances
Dummy
nodes
Unfolded Switching
DFG instances
dead
node
Ngõ vào A
Bộ cộng nối tiếp
C
N D Mi Si
Ni
Ci
6l + 0 6l + 1,2,3,4,5
41
Example: How to Unfold a Bit-serial Adder
A S Output
Delay
INPUTS X
D Bit-Serial
B ai
si
bi
4l+0 4l+1,2,3 couti
Reset
Carry = 0
Z Carry D
42
Example: How to Unfold a Bit-serial Adder
A S Output
INPUTS X D
Dummy node
B D
4l+0 4l+1,2,3
Reset Carry
Carry = 0
Z
43
Unfold Bit-serial Adder, J=2
A0 S0 A1 S1
X0 X1
B0 D0 B1 D1
Z0 Z1
A0 S0 A1 S1
X0 X1
B0 D0 B1 D1
Z0 Z1
For each edge U V with delays in the original DFG,
draw the J edges Ui V(i + w)%J with
(i+w)/J delays for i = 0, 1, …, J-1
If edge has w=0 a Ui Vi with 0 delays
45
Unfold Bit-serial Adder, J=2
A0 S0 A1 S1
X0 X1
B0 D0 B1 D1
D
Z0 Z1
For each edge U V with delays in the original DFG,
draw the J edges Ui V(i + w)%J with
(i+w)/J delays for i = 0, 1, …, J-1
XaD for i=0 a X0 D1 with 0 delays
X D
B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1
47
Unfold the Switch, J=2
A S
X D
B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1
48
Unfold the Switch, J=2
A S
X D
B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1
D1 g X1 at time 2l+0,1
i.e. always closed
49
Unfold the Switch, J=2
A0 S0 A1 S1
X0 X1
B0 D0 B1 D1
2l+0 2l+1 D
Z0 Z1 Dead Node
D1 g X1 at time 2l+0,1
Z0 g X0 at time 2l+0
i.e. always closed
D0 g X0 at time 2l+1
50
Remove Dead and Dummy Nodes
A0 S0 A1 S1
X0 X1
B0 B1
2l+0 2l+1 D
Z0
D1 g X1 at time 2l+0,1
Z0 g X0 at time 2l+0
i.e. always closed
D0 g X0 at time 2l+1
51
Remove Dead and Dummy Nodes
A0 S0 A1 S1
X0 X1
B0 B1
2l+0 2l+1 D
Z0
Carry within
Carry next iteration iteration
D=1
52
The digit-serial adder designed (digit size of 2) by
unfolding the bit-serial adder using J = 2
53
Fully Parallel Adder, i.e. J=4
LSB MSB
A0 S0 A1 S1 A2 S2 A3 S3
X0 X1 X2 X3
B0 D0 B1 D1 B2 D2 B3 D3
Z0 Z1 Z2 Z3
D
For each node U in the original DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1
For each edge U V with w delays in the original DFG,
draw the J edges Ui V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1
54
Unfold the Switch, J=4
A S
X D
B D
ZaX DaX
4+0 g 4(1+0)+0 4+0 4+1,2,3 4+1 g 4(1+0)+1
4+2 g 4(1+0)+2
Z
4+3 g 4(1+0)+3
55
Unfold the Switch, J=4
A S
X D
B D
ZaX DaX
4l+0 g 4(1l+0)+0 4l+0 4l+1,2,3 4l+1 g 4(1l+0)+1
4l+2 g 4(1l+0)+2
Z
4l+3 g 4(1l+0)+3
56
Bit-parallel Adder
A0 S0 A1 S1 A2 S2 A3 S3
X0 X1 X2 X3
B0 D0 B1 D1 B2 D2 B3 D3
Z0 Z1 Z2 Z3
D
Only 1 time instance 0, i.e. fully parallel
Z0 g X0, D1 g X1, D2 g X2 and D3 g X3
57
Bit-parallel Adder
LSB MSB
A0 S0 A1 S1 A2 S2 A3 S3
X0 X1 X2 X3
B0 D0 B1 D1 B2 D2 B3 D3
Z0 Z1 Z2 Z3
Dead nodes
58
Remove Dead and Dummy Nodes
A0 S0 A1 S1 A2 S2 A3 S3
X0 X1 X2 X3
B0 D0 B1 D1 B2 D2 B3 D3
Z0 Z1 Z2 Z3
D
Dead nodes
can be removed Dummy nodes
can be removed
59
Bit-parallel Adder
A0 S0 A1 S1 A2 S2 A3 S3
X0 X1 X2 X3
B0 B1 B2 B3
Z0
Carry out
Cin
Carry in
Carry Ripple Adder a3 a2 a1 a0
b3 b2 b1 b0
Cout s3 s2 s1 s0
60
The digit-serial adder designed (digit size of 4) by
unfolding the bit-serial adder using J = 4
61
If Wordlength is not a multiple of J
Determine L=lcm{W,J}, lcm = least common multiple
Replace switching instance Wl+u with L/W instances
Ll+u+wW, for w= 0,1,...,L/W-1
i.e. the switching periodicity has been changed from W to L
Perform the unfolding as previously
Identify the correspondence between original instances and
expanded instances
62
Example: Unfold Bit-serial Adder by J=3 (1/11)
63
Example: Unfold Bit-serial Adder by J=3 (2/11)
64
Example: Unfold Bit-serial Adder by J=3 (3/11)
65
Example: Unfold Bit-serial Adder by J=3 (4/11)
66
Example: Unfold Bit-serial Adder by J=3 (5/11)
67
Example: Unfold Bit-serial Adder by J=3 (6/11)
68
Example: Unfold Bit-serial Adder by J=3 (7/11)
A0 S0 A1 S1 A2 S2
X0 X1 X2
B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+2
Z0 Z1 Z2
69
Example: Unfold Bit-serial Adder by J=3 (8/11)
70
Example: Unfold Bit-serial Adder by J=3 (9/11)
A0 S0 A1 S1 A2 S2
X0 X1 X2
B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2
Z0 Z1 Z2
71
Ex: Unfold Bit-serial Adder by J=3 (10/11)
72
Ex: Unfold Bit-serial Adder by J=3 (11/11)
A0 S0 A1 S1 A2 S2
X0 X1 X2
B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2 4l+0,1,3
Z0 Z1 Z2
73
Remove Dead and Dummy Nodes
A0 S0 A1 S1 A2 S2
X0 X1 X2
B0 B1 B2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2 4l+0,1,3
Z0 Z1 Z2
74
Digit-serial adder with digit size of 3
75
5.6 Conclusions
Unfolding algorithm is a graph-based transformation
technique
Reveal hidden concurrencies of the program
Retrieve a smaller iteration period of the algorithm
Unfolding and retiming
Applications for sample period reduction
Applications for parallel processing
Word-level
Bit-level
• Bit-serial
• Bit-parallel
• Digit-serial
76