0% found this document useful (0 votes)
62 views

FPGA - Ch5 - Unfolding

The document discusses unfolding, which is a technique for transforming algorithms by creating a new program that describes multiple iterations of the original program. Unfolding duplicates nodes and edges in a data flow graph (DFG) based on an unfolding factor J. This preserves precedence constraints between operations and maintains the number of registers/delays in the DFG. Unfolding can reveal hidden parallelism and is used to design high-speed and low-power architectures.

Uploaded by

Eli Eli Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

FPGA - Ch5 - Unfolding

The document discusses unfolding, which is a technique for transforming algorithms by creating a new program that describes multiple iterations of the original program. Unfolding duplicates nodes and edges in a data flow graph (DFG) based on an unfolding factor J. This preserves precedence constraints between operations and maintains the number of registers/delays in the DFG. Unfolding can reveal hidden parallelism and is used to design high-speed and low-power architectures.

Uploaded by

Eli Eli Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

ĐHBK Tp HCM

BMĐT
GV: Hồ Trung Mỹ

Ch.05
Unfolding (Trải ra)

TLTK:
1. Các slide từ sách của Prof. Parhi
2. Slide của Prof. Viktor Öwall
3. Slide của Prof. Lan-Da Van
1
Outline
5.1 Introduction
5.2 An Algorithm for Unfolding
5.3 Properties of Unfolding
5.4 Critical Path, Unfolding, and Retiming
5.5 Applications of Unfolding
5.6 Conclusions

2
5.1 Introduction
 Unfolding is a transformation technique
 Applied to a DSP algorithm to create a new program describing
more than one iteration of the original program.
unfolding
1 2 3 …..
1 2 3 1 2 3 1 2 3
looping
 Unfolding is also referred to as loop unrolling
 Unfolding factor J describes J consecutive iterations of the original
program
 Unfolding = Loop unrolling
• assembly programming
• compiler theory
 Applications to high-speed & low-power VLSI architectures
 Reveal hidden concurrencies so that the program can be scheduled
to a smaller iteration period
• sample period reduction, reach T ,  increase throughput
 Design parallel architectures at the word level and bit level
3
Unfolding Example
 Example
 Y(n)=ay(n-9)+x(n), after 2-unfolding we have: Y(2k)=ay(2k-
9)+x(2k), Y(2k+1)=ay(2k-8)+x(2k+1),
 The above equations can be rewritten as
 Y(2k)=ay(2(k-5)+1)+x(2k), Y(2k+1)=ay(2(k-4))+x(2k+1),

Y(n) Y(2k) a

a X(2k)
Unfolding
5D
X(n) 9D 
 (J = 2)

4D
X(2k+1)
Y(2k+1) a

4
5.2 Unfolding Algorithm

with the same function as U

5
Example
Y(n) Y(2k) a

a X(2k)
Unfolding
5D
X(n) 9D 
 (J = 2)

4D
X(2k+1)
Y(2k+1) a
i w (i+w)%J
B0
0 9 1 4
B 1 9 0 5
A0 C0 D0
Unfolding 5D
9D 
A C  D 
(J = 2)
A1 C1 4D D1
𝑖+𝑤
J-unfolded DFG 𝑈𝑖 ⌊ ⌋ 𝑉 (𝑖+𝑤 ) % 𝐽
𝑈𝑤𝑉 𝐽 B1
→ →
𝑖=0,1,2 , ⋯ , 𝐽 −1 6
Another DFG Unfolding Example (1)

J=2 S0
i w (i+w)%J  (i  w) / J 

0 2 0 1 Q0 T0
S
1 2 1 1
R0
0 3 1 1 Q T
2D 3D
1 3 0 2 S1
R
Q1 T1

R1
Step 1. Duplicate J copies of each node

7
Another DFG Unfolding Example (2)

J=2 S0
i w (i+w)%J  (i  w) / J 

0 2 0 1 Q0 T0
S
1 2 1 1
R0
0 3 1 1 Q T
2D 3D
1 3 0 2 S1
R
Q1 T1

R1
Step 2. Add all edges with 0 delay on them.

8
Another DFG Unfolding Example (3)

J=2 S0
i w (i+w)%J  (i  w) / J 

0 2 0 1 Q0 T0
S D
1 2 1 1
R0
0 3 1 1 Q T
D 2D
2D 3D
1 3 0 2 S1
R
Q1 T1
D
R1
Step 3. Use table on the left to
figure out edges with delays.

9
3-Unfolded Example
2D

D
U V 2D
3-unfolded U0 V0 T0
D
5D 6D
T 2D
2D
U1 V1 T1

2D
U2 V2 T2
D

10
4-Unfolded Example
Step 1. For each node U in the original
DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1 J=4
i = 0, 1, …, J-1 V0
37D U0 9D
U V
U1 9D V1
i w (i+w)%J

0 37 1 9 U2 V2
9D
1 37 2 9
U3 10D V3
2 37 3 9
3 37 0 10

Step 2. For each edge U  V with w delays in the original


DFG, draw the J edges Ui  V(i + w)%J with (i+w)/J
delays for i = 0, 1, …, J-1
11
Unfolding preserves precedence constraints
 The edges in the original DFG explicitly show the precedence
constraints for 1 iterations of the original program
 The edges in the J-unfolded DFG explicitly show the
precedence constraints for iterations of the program
 The edge with delay in the unfolded DFG corresponds to the
edge UV with delay in the original DFG.

 iw 
w  J 
U  V  U i V(i  w)% J
 

 The k iteration of Ui corresponds to the (Jk+i)th iteration of the


th

node U, and the th iteration of V(i+w)%J corresponds to the th of the


node V.

12
5.3 Properties of Unfolding
 Unfolding preserves the number  Any path in the original
of registers (delays) in a DFG DFG containing J or more
 For a loop with delays in a DFG delays leads to J paths with
that has been unfolded J times, it 1 or more delay in each
leads to g.c.d.(,J) loops in the path. Therefore, it can not
unfolded DFG, with each of these create a critical path in the
loops containing J-unfolded DFG
 /(g.c.d.(,J)) delays and  Any clock period that can
 J/(g.c.d.(,J)) copies of each node be achieved by retiming a
that appear in the original loop .
J-unfolded DFG, or can be
 Unfolding a DFG with iteration achieved by retiming the
bound T results in a J-unfolded original DFG and followed
DFG with iteration bound J.T by J-unfolding.
 A path with delays in a DFG will
lead to paths with no delays, and g.c.d. = greatest common divisor
paths with 1 delay each in the J- For example, the g.c.d. of 8 and 12
unfolded DFG. is 4.
13
Example

A single loop with


 g.c.d.(2, 3) = 1 (loop)
 2/1 = 2 delays/loop
 3/1 = 3 copies of each
node that appear in the
original loop.

14
A single loop with
Example D
B
D 3-unfolded  g.c.d.(6, 3) = 3 (loops) A0 B0 C0
 6/3 = 2 delays/loop D D
A 2D
 3/3 = 1 copy of each
node that appear in
3D C 4-un the original loop.
A1 B1 C1
fo D
ld ed
A2 B2 C2
D D
A0 B0 C0 D
D D
D  g.c.d.(6, 4) = 2 (loops)
A1 B1 C1  6/2 = 3 delays/loop
 4/2 = 2 copies of each
D node that appear in the
A2 B2 C2 original loop.

D
A3 B3 C3
15
Example
In original DFG: 2 loops  g.c.d.(10, 5) = 5 (loops)
:  10/5 = 2 delays/loop
:  5/5 = 1 copy of each
node that appear in the
original loop.

 g.c.d.(10, 2) = 2 (loops)
 10/2 = 5 delays/loop
 2/2 = 1 copy of each node that
appear in the original loop. 16
Unfolding and Iteration Bound

17
5.4 Critical Path, Unfolding, and Retiming
 Consider a path with delays in the original DFG. J-
unfolding of this path leads to paths with no delays and
paths with 1 delay each  may create a new critical path.
 Any path in the original DFG containing J or more delays
leads to J paths with 1 or more delays in each path.
Therefore a path in the original DFG with J or more delays
cannot create a critical path in the J-unfolded DFG.
 From these, we can retime the original DFG such that the
J-unfolded version of the retimed DFG will meet a
specified critical path delay . This is true if there exists a
path in the original DFG with computation time and less
than J delays.
 Assume that the critical path of the J-unfolded DFG is , if
D(U,V) > , then wr(U,V)=W(U,V)+r(V)-r(U) ³ J
18
Unfolding and Critical Path
Critical
path

Critical
J=3
path

19
5.5 Applications of Unfolding
Applications of Unfolding:
 Sample Period Reduction
 Case 1 : A node in the DFG having computation time
greater than T .
 Case 2 : Iteration bound is not an integer.
 Case 3 : Longest node computation is larger than the
iteration bound T , and T is not an integer.
 Parallel Processing
 Word- Level Parallel Processing
 Bit Level Parallel processing

20
Sample Period Reduction
Case 1:
 The original DFG cannotT∞have
= 3, sample period equal to the
iteration bound when aTnode computation time is more
critical = 6
than iteration bound.

 The minimum sample period after retiming is 4u.t. > 21


Sample Period Reduction
Case 1:
The unfolded DFG performs 2 iterations
of the original DFG in 6 u.t., so the
sample period of the unfolded DFG is
6/2=3 u.t., which is the same as the T∞= 6,
iteration bound of the original DFG.
Tcritical = 6
 Rule of Thumb:
6

22
Sample Period Reduction
Case 2: is not an integer
 The original DFG cannot have sample period equal to the
iteration bound when the iteration bound is not an integer.

Recall: In number theory, two integers a and b are said to be relatively prime,
mutually prime, or coprime (also written co-prime) if the only positive integer
(factor) that divides both of them is 1. This is equivalent to their greatest common
divisor (gcd) being 1. (source: wiki)
23
Sample Period Reduction
Case 2: is not an integer
(1) (1) (1) (1)
S0 T1 U1 V2
T  4 / 3 D
(1) (1) (1) D (1)
(1) (1) (1) (1)
D D S1 T2 U3 V0
S T U V 3-Unfolded
(1) (1) (1) (1)
D D
S2 T0 U0 V1
Even retiming cannot not achieve a
critical path of less than 2.
T  4 for each loop
Rule of thumb: In general if a # Of sampling = 3 and hence
critical loop bound is tl/wl where tl the minimum sampling period
and wl are mutually prime, then of the unfolded DFG is 4/3 =T¥
wl-unfolding should be used. of the original DFG.
24
Sample Period Reduction
Case 3: is not an integer

in case

where J. T is an integer 25
Parallel processing
 Unfolding transformation is used to derive parallel
processing architectures from serial processing
architectures.

x(n) x(n-1) x(n-2)


D D

b0 b1 b2
y(n)

26
Parallel processing
Parallel processing:
Word- Level Parallel Processing
 Unfolding a word-serial architecture by J
creates a word-parallel architecture that
processes J words per clock cycle.
Bit-Level Parallel processing
 Bit-serial processing
 Bit-parallel processing
 Digit-serial processing

27
Bit-level Parallel Processing
Let W be the word length of the data
Bit-serial processing: one bit processed per
clock cycle and a complete word is processed
in W clock cycles
Bit-parallel processing: one word of W bits is
processed every clock cycle W-unfolded.
Digit-serial processing: N bits (1<N<W) are
processed per clock cycle and a word is
processed in W/N clock cycles. N is the digit
size N-unfolded.
28
A demonstration of bit-parallel, bit-serial, and digit-serial
processing styles for wordlength W = 6

29
The adder architectures with Bit-Parallel/Bit-
Serial/Digit-Serial styles

30
Bit-serial adder with word length of 4

= iteration

31
Unfolding of Switches
 The following assumptions are made when
unfolding an edge UV containing a switch:
 The word length W is a multiple of the unfolding factor
J, i.e. W = W’J
 All edges into and out of the switch have no delays.
 If so, an edge UV can be unfolded as:
 Write the switching instance as
W + u = J( W’ + u/J ) + (u%J)
 Draw an edge from the node Uu%J a Vu%J,
which is switched at time instance (W’ + u/J)
W+u
U V
32
Example: Unfolding of Switches, J=3

33
Example: Unfolding of Switches, J=3 (2)

34
Example: Unfolding of Switches, J=3 (3)

35
Example: Unfolding of Switches, J=3 (4)

36
Switch with multiple instances

37
Switch with multiple instances (2)

38
Switches with Delays
 Unfolding a DFG containing an edge having a switch and
a positive number of delays is done by introducing a
dummy node.
2D 6 + 1, 5
A 2D 6+ 1, 5 A D
Inserting
C C
Dummy node
B 6 + 0, 2, 3, 4
B 6+ 0, 2, 3, 4

 To unfold by J = 3, the 6 switching instances can be


rewritten as

39
Switches with Delays nodes
Switched at time instances

Dummy
nodes

Unfolded Switching
DFG instances

dead
node

It may be noted that the


number of delays is not
preserved by unfolding of
circuits containing switches.
40
Example: How to Unfold a Bit-serial Adder
M S Ngõ ra

Ngõ vào A
Bộ cộng nối tiếp
C
N D Mi Si
Ni
Ci
6l + 0 6l + 1,2,3,4,5

Reset Bit nhớ


Z D
Bit nhớ Z = 0

41
Example: How to Unfold a Bit-serial Adder
A S Output

Delay
INPUTS X
D Bit-Serial
B ai
si
bi
4l+0 4l+1,2,3 couti

Reset
Carry = 0
Z Carry D

42
Example: How to Unfold a Bit-serial Adder
A S Output

INPUTS X D
Dummy node
B D
4l+0 4l+1,2,3
Reset Carry
Carry = 0
Z

43
Unfold Bit-serial Adder, J=2

A0 S0 A1 S1

X0 X1

B0 D0 B1 D1

Z0 Z1

 For each node U in the original DFG, draw J nodes U0 , U1 ,


U2 ,…, UJ-1
44
Unfold Bit-serial Adder, J=2

A0 S0 A1 S1

X0 X1

B0 D0 B1 D1

Z0 Z1
 For each edge U  V with delays in the original DFG,
draw the J edges Ui  V(i + w)%J with
(i+w)/J delays for i = 0, 1, …, J-1
 If edge has w=0 a Ui  Vi with 0 delays
45
Unfold Bit-serial Adder, J=2

A0 S0 A1 S1

X0 X1

B0 D0 B1 D1
D
Z0 Z1
 For each edge U  V with delays in the original DFG,
draw the J edges Ui  V(i + w)%J with
(i+w)/J delays for i = 0, 1, …, J-1
XaD for i=0 a X0  D1 with 0 delays

and XaD for i=1 a X1  D0 with 1 delays 46


Unfold the Switch, J=2
A S

X D

B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1

 Write the switching instance as


W + u = J( W’ + u/J ) + (u%J)

47
Unfold the Switch, J=2
A S

X D

B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1

Z0 g X0 at time 2l+0 D0 g X0 at time 2l+1

48
Unfold the Switch, J=2
A S

X D

B D
ZaX DaX
4l+0 g 2(2l+0)+0 4l+0 4l+1,2,3 4l+1 g 2(2l+0)+1
4l+2 g 2(2l+1)+0
Z
4l+3 g 2(2l+1)+1

Z0 g X0 at time 2l+0 D0 g X0 at time 2l+1

D1 g X1 at time 2l+0,1
i.e. always closed
49
Unfold the Switch, J=2

A0 S0 A1 S1

X0 X1

B0 D0 B1 D1
2l+0 2l+1 D
Z0 Z1 Dead Node
D1 g X1 at time 2l+0,1
Z0 g X0 at time 2l+0
i.e. always closed
D0 g X0 at time 2l+1

50
Remove Dead and Dummy Nodes
A0 S0 A1 S1

X0 X1

B0 B1
2l+0 2l+1 D
Z0

D1 g X1 at time 2l+0,1
Z0 g X0 at time 2l+0
i.e. always closed
D0 g X0 at time 2l+1

51
Remove Dead and Dummy Nodes
A0 S0 A1 S1

X0 X1

B0 B1
2l+0 2l+1 D
Z0

Carry within
Carry next iteration iteration
D=1
52
The digit-serial adder designed (digit size of 2) by
unfolding the bit-serial adder using J = 2

53
Fully Parallel Adder, i.e. J=4
LSB MSB
A0 S0 A1 S1 A2 S2 A3 S3

X0 X1 X2 X3

B0 D0 B1 D1 B2 D2 B3 D3

Z0 Z1 Z2 Z3

D
 For each node U in the original DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1
 For each edge U  V with w delays in the original DFG,
draw the J edges Ui  V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1
54
Unfold the Switch, J=4
A S

X D

B D
ZaX DaX
4+0 g 4(1+0)+0 4+0 4+1,2,3 4+1 g 4(1+0)+1
4+2 g 4(1+0)+2
Z
4+3 g 4(1+0)+3

Write the switching instance as


W + u = J( W’ + u/J ) + (u%J)

55
Unfold the Switch, J=4
A S

X D

B D
ZaX DaX
4l+0 g 4(1l+0)+0 4l+0 4l+1,2,3 4l+1 g 4(1l+0)+1
4l+2 g 4(1l+0)+2
Z
4l+3 g 4(1l+0)+3

Only 1 time instance 0, i.e. fully parallel


Z0 g X0, D1 g X1, D2 g X2 and D3 g X3

56
Bit-parallel Adder

A0 S0 A1 S1 A2 S2 A3 S3

X0 X1 X2 X3

B0 D0 B1 D1 B2 D2 B3 D3

Z0 Z1 Z2 Z3

D
Only 1 time instance 0, i.e. fully parallel
Z0 g X0, D1 g X1, D2 g X2 and D3 g X3
57
Bit-parallel Adder
LSB MSB
A0 S0 A1 S1 A2 S2 A3 S3

X0 X1 X2 X3

B0 D0 B1 D1 B2 D2 B3 D3

Z0 Z1 Z2 Z3

Dead nodes

58
Remove Dead and Dummy Nodes

A0 S0 A1 S1 A2 S2 A3 S3

X0 X1 X2 X3

B0 D0 B1 D1 B2 D2 B3 D3

Z0 Z1 Z2 Z3

D
Dead nodes
can be removed Dummy nodes
can be removed
59
Bit-parallel Adder

A0 S0 A1 S1 A2 S2 A3 S3

X0 X1 X2 X3

B0 B1 B2 B3

Z0

Carry out
Cin
Carry in
Carry Ripple Adder a3 a2 a1 a0
b3 b2 b1 b0
Cout s3 s2 s1 s0

60
The digit-serial adder designed (digit size of 4) by
unfolding the bit-serial adder using J = 4

61
If Wordlength is not a multiple of J
 Determine L=lcm{W,J}, lcm = least common multiple
 Replace switching instance Wl+u with L/W instances
Ll+u+wW, for w= 0,1,...,L/W-1
i.e. the switching periodicity has been changed from W to L
 Perform the unfolding as previously
 Identify the correspondence between original instances and
expanded instances

62
Example: Unfold Bit-serial Adder by J=3 (1/11)

 Wordlength W=4 not a


A S multiple of the the unfolding
factor J=3.
X D  Determine
L=lcm{W,J}=lcm{4,3}=12
B D
 Replace
4l+0 4l+1,2,3
Wl+u Ll+u+wW
for w= 0,1,
Z

63
Example: Unfold Bit-serial Adder by J=3 (2/11)

64
Example: Unfold Bit-serial Adder by J=3 (3/11)

65
Example: Unfold Bit-serial Adder by J=3 (4/11)

66
Example: Unfold Bit-serial Adder by J=3 (5/11)

67
Example: Unfold Bit-serial Adder by J=3 (6/11)

68
Example: Unfold Bit-serial Adder by J=3 (7/11)

A0 S0 A1 S1 A2 S2

X0 X1 X2

B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+2

Z0 Z1 Z2

69
Example: Unfold Bit-serial Adder by J=3 (8/11)

70
Example: Unfold Bit-serial Adder by J=3 (9/11)

A0 S0 A1 S1 A2 S2

X0 X1 X2

B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2

Z0 Z1 Z2

71
Ex: Unfold Bit-serial Adder by J=3 (10/11)

72
Ex: Unfold Bit-serial Adder by J=3 (11/11)

A0 S0 A1 S1 A2 S2

X0 X1 X2

B0 D0 B1 D1 B2 D2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2 4l+0,1,3

Z0 Z1 Z2

73
Remove Dead and Dummy Nodes
A0 S0 A1 S1 A2 S2

X0 X1 X2

B0 B1 B2
4l+0 4l+1,2,3 4l+1 4l+0,2,3 4l+2 4l+0,1,3

Z0 Z1 Z2

74
Digit-serial adder with digit size of 3

75
5.6 Conclusions
 Unfolding algorithm is a graph-based transformation
technique
 Reveal hidden concurrencies of the program
 Retrieve a smaller iteration period of the algorithm
 Unfolding and retiming
 Applications for sample period reduction
 Applications for parallel processing
 Word-level
 Bit-level
• Bit-serial
• Bit-parallel
• Digit-serial

76

You might also like