Finite State Machine Datapath Design, Optimization, and Implementation
Finite State Machine Datapath Design, Optimization, and Implementation
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI: 10.2200/S00087ED1V01Y200702DCS014
A Publication in the Morgan & Claypool Publishers series
Lecture #14
Series Editor: Mitchell Thornton, Southern Methodist University
Series ISSN
Robert Reese
Mississippi State University
ABSTRACT
Finite State Machine Datapath Design, Optimization, and Implementation explores the design space
of combined FSM/Datapath implementations. The lecture starts by examining performance issues
in digital systems such as clock skew and its effect on setup and hold time constraints, and the use
of pipelining for increasing system clock frequency. This is followed by definitions for latency and
throughput, with associated resource tradeoffs explored in detail through the use of dataflow graphs
and scheduling tables applied to examples taken from digital signal processing applications. Also,
design issues relating to functionality, interfacing, and performance for different types of memories
commonly found in ASICs and FPGAs such as FIFOs, single-ports, and dual-ports are examined.
Selected design examples are presented in implementation-neutral Verilog code and block diagrams,
with associated design files available as downloads for both Altera Quartus and Xilinx Virtex FPGA
platforms. A working knowledge of Verilog, logic synthesis, and basic digital design techniques is
required. This lecture is suitable as a companion to the synthesis lecture titled Introduction to Logic
Synthesis using Verilog HDL.
KEYWORDS:
Verilog, datapath, scheduling, latency, throughput, timing, pipelining, memories, FPGA, flowgraph
v
Table of Contents
Chapter 1 – Calculating Maximum Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 – Improving design performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 3 – Finite State Machine with Datapath (FSMD) Design . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4 – Embedded Memory Usage in Finite State Machine with
Datapath (FSMD) Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi
vii
Table of Figures
Figure 1.1: Inverter propagation delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Figure 1.2: AND gate propagation delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Figure 1.3: Glitches caused by propagation delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Figure 1.4: XOR gate architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Figure 1.5: D-type flip-flop input options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 1.6: Relative setup and hold time timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Figure 1.7: Sequential circuit for propagation delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 1.8: Calculating adjusted setup/hold times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 1.9: Adjusted setup and hold timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 1.10: Board-level schematic to compute maximum clock frequency . . . . . . . . . . . . . . . . . 15
Figure 2.1: Adding an output register to the sequential circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Figure 2.2: Adding input registers to the sequential circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.3: Operation of a Delay Locked Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 2.4: Board-level schematic to compute maximum clock frequency . . . . . . . . . . . . . . . . . . 30
Figure 3.1: Saturating Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 3.2: Unsigned Saturating Adder (8-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 3.3: Implementation for 1-F operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 3.4: Multiplication of an 8-bit color operand by 9-bit blend operand . . . . . . . . . . . . . . . . 40
Figure 3.5: Dataflow Graph of the Blend Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 3.6: Naı̈ve Implementation of the Blend Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 3.7: Blend Equation Implementation with Latency = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 3.8: Cycle Timing for Latency = 2, Initiation period = 2 clocks . . . . . . . . . . . . . . . . . . . . . 44
Figure 3.9: Cycle Timing for Latency = 2, Initiation period = 1 clocks . . . . . . . . . . . . . . . . . . . . . 47
Figure 3.10: Multiplication of an 8-bit color operand by 9-bit blend
operand with pipeline stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 3.11: Blend Equation Implementation with Pipelined Multiplier, Latency = 3 . . . . . . . 51
viii FINITE STATE MACHINE DATAPATH DESIGN
CHAPTER 1
book will only consider the delays associated with the gate but with the understanding that it is
defined by the underlying transistors.
In Out
In
tphl
tplh
A
Y
B
tphl
tplh
Y 50% point
For a two-input gate, four propagation delays are found: A2Y tplh , A2Y tphl , B2Y tplh ,
A2Y tphl . For simplicity, the worst case is taken for the four propagation delays and is considered
to be the total tpd for the entire gate (Y tpd ). This is true for any number of inputs for a com-
binational gate. Typically, datasheets for a logic device contains the worst-case tpd along with the
typical tpd .
X
tpd
Z
X
X
tpd
Z
tpd
X A1
Y N2
O2 Z
O1 N3
N1
NOT 10 ns
AND 25 ns
OR 20 ns
X A1 + O2 45 ns
X O1 + N3 + O2 50 ns
Y N2 + A1 + O2 55 ns
Y N1 + O1 + N3 + O2 60 ns
The worst-case delay path is 60 ns. On the datasheet, the maximum tpd would be listed as
60 ns. This is also the minimum period of the clock if the XOR gate is used in a real circuit.
S
D Q
C
R
edge-triggered signal. For a latch, the enable is a level-sensitive signal. This book uses flip-flops
in its examples since this is the most commonly-used design style. While many types of flip-flops
exist such as SR flip-flops, D flip-flops, T flip-flops, or JK flip-flops, this book will only discuss D
flip-flops since they are the simplest and most straight-forward. The other types of flip-flops can
be analyzed using the same techniques as the D flip-flop. In D flip-flops, the input is copied to the
output at the clock edge. The D flip-flop can have a variety of input options as shown in Fig. 1.5.
A specialized type of flip-flop is called a register. Registers have an enable input which prevents
the latter from being transferred to the output in every clock cycle. The input will only be copied
when the enable is set high. Registers can come in arrays, which all have the same control signals,
but have different data inputs/outputs. Sometimes the term register is used synonymously with the
term flip-flop.
The output for a memory element has a tpd like a combinational gate; however, it is measured
differently. Since the output for a register only changes on a clock transition, tpd is measured from
the time the clock changes to the time the input is copied to the output. Since the data output
does not change when the data input changes, tpd is not measured from the data input to the data
output. However, the clock-to-output propagation delay (tC2Q ) is not the only delay associated with
a register.
Clock
t t
su hd
X A
E
1 ns
H D Z
8 ns 6 ns
9 ns
Y B
1 ns
F D Q
D Q U2
U1 7 ns
C
C
tsu = 3 ns
G
8 ns
t = 4 ns
hd
Clk C tC2Q = 5 ns
2 ns
1 + 8 + 9 + 6 = 24 ns (1.2)
For the input Y, the path starts at the input buffer B and proceeds through the AND gate H,
and the output buffer D. The propagation delays for these gates are added together to get 1 + 9 +
6 = 16 ns.
X A+E+H+D 24 ns
Y B+H+D 16 ns
1 + 9 + 6 = 16 ns (1.4)
The larger of these two delays is the worst-case tP2P for this circuit. The path “A + E
+ H + D” is the worst-case with a delay of 24 ns. The list of delays is in Table 1.3.
Some circuit analysis programs treat the clock-to-output delay the same as the pin-to-pin
combinational delay, so sometimes on the analysis report there will be no clock-to-output delay
listed. The clock input is counted as a regular input. Often these reports will list the worst-case
delays for each input, so the clock-to-output delay can be found by searching this list.
There are two clock-to-output paths through the circuit. Both paths pass through the input
buffer C. One path then proceeds through the first register U1, through the OR gate E, through
the 3-input AND gate H, and finally to the output buffer D.
2 + 5 + 8 + 9 + 6 = 30 ns (1.7)
10 FINITE STATE MACHINE DATAPATH DESIGN
Clk C + U1 + E + H + D 30 ns
Clk C + U2 + H + D 22 ns
The second path proceeds through the second register U2, through the 3-input AND gate
H, and finally to the output buffer D.
2 + 5 + 9 + 6 = 22 ns (1.9)
The larger of these two delays is the worst-case tC2Q for this circuit. The path “C + U1 +
E + H + D” is the worst-case with a delay of 30 ns. The list of delays is in Table 1.4.
U1 U1 + F + U2 15 ns
U2 U2 + G + U1 16 ns
There are two registers in this design. Starting with register U1, there is only one path from
the output of this register to another register. This path passes through gate F to the input of register
U2. Therefore, computing this register-to-register path is easy.
5 + 7 + 3 = 15 ns (1.12)
Starting with register U2, there is only one path from the output to another register. This path
passes through gate G to the input of register U1.
5 + 8 + 3 = 16 ns (1.14)
The two register-to-register paths in Table 1.5 above are 15 ns and 16 ns. The worst-case tR2R
is therefore 16 ns through the path “U2 + G + U1”. If all the registers have the same clock-to-output
delay and tsu (as is often the case), the only difference between the paths is the combinational circuits
between the registers. This can make computing tR2R much easier.
P2P A+E+H+D 24 ns
C2Q C + U1 + E + H + D 30 ns
R2R U2 + G + U1 16 ns
and thd after the clock at the inputs to the internal register. If the sequential circuit was going to
be packaged into a chip and sold to a customer, the customer may not know how to check if the
internal register setup and hold requirements have been met. Therefore tsu and thd requirements are
recomputed for the entire sequential circuit and that information is passed to the customer.
For setup time, the data signal must not change for a given time before the clock edge. If the
input signal is delayed, such as, through a combinational gate or input buffer as in Fig. 1.8, the input
may violate the tsu requirement. Therefore, any delay added between the input pin and the register
input must be added to the setup time requirement. The delay between the clock input pin and the
clock input to the register must also be subtracted from tsu . This means if the delays between the
pins to the register are the same, there will be no change in tsu . Only when there is a difference in
the delays will the setup time change.
This procedure must be repeated for each register in the design that has an external input
routed to its input through any combinational path. The longest delay from the data input to the
registers is used as the worst case. The shortest delay from the clock input to the registers is used as
the worst case. The difference between these two paths is the adjustment to the setup time.
For hold time, if the clock signal is delayed, such as through an input buffer, the input may
violate the thd requirement. The worst case for thd is the opposite worst case for tsu : the longest delay
from the clock input of the circuit to the register, and the shortest delay from the data input to the
register. The difference between these two paths is the adjustment to the hold time.
4 ns 4 ns
tsu thd
Clock
When tsu and thd have been adjusted correctly for the external inputs, the internal tsu and tsu
at the register inputs will not be violated. The timing diagram in Fig. 1.9 shows the behavior of
internal delays, which can cause changes in the setup and hold requirement.
1.4.9 Example 1.4 Using the same circuit in Fig. 1.7, find the adjustments to the tsu and thd
for the circuit.
In this design, the data input is delivered to the input to two registers. The first path is routed
from the Y input through the input buffer, through the OR gate G, and then to the input of the
U1 register. The second path passes through the input buffer, through the AND gate F, and then to
the U2 register. Note there are no paths from the X input to the inputs of any registers. Table 1.7
provides the set of all input to register delays.
The calculation for tsu will include the longest data delay and the shortest clock delay. For
this example, the longest data delay is tpd data U1 that will add 9 ns to tsu . The shortest clock delay is
Y to U1 B + G + U1 9 ns tpd data U1
Y to U2 B + F + U2 8 ns tpd data U2
Clk to U1 C + U1 2 ns tpd clk U1
Clk to U2 C + U2 2 ns tpd clk U2
14 FINITE STATE MACHINE DATAPATH DESIGN
tpd clk U1 that will subtract 2 ns from tsu . Given tsu of 3 ns, the external tsu for this circuit is 10 ns.
(9 − 2) + 3 = 10 ns (1.18)
The calculation for thd will include the longest clock delay and the shortest data delay. For
this example, the longest clock delay is tpd clk U1 that will add 2 ns to thd . The shortest data delay is
tpd data U1 that will subtract 8 ns from the hold time. Given thd of 4 ns, the external thd for this circuit
is −2 ns.
(2 − 8) + 4 = −2 ns (1.20)
The setup and hold window is 8 ns in which the data cannot change. The negative sign in
the hold time calculation means the data input can actually start changing before the clock signal.
This is not an intuitive behavior for a digital circuit, so often a negative thd will be specified as zero
instead. By setting thd to zero, the effective setup and hold window has increased to 10 ns.
First, the pin-to-pin combinational delay is found for any path from the X input to the output.
There is one pin-to-pin path from the input A to the X input of U1, to the X input of U2, to the
A X Z X Z B
Y Y
U1 U2
tsu = 10 ns
C C
t = 0 ns
hd
Clk
tC2Q = 30 ns
tP2P = 24 ns
output B. The delay of this path adds the two pin-to-pin delays together 24 + 24 = 48 ns.
24 + 24 = 48 ns (1.22)
Two clock-to-output delays exist for this circuit. The first path passes through the clock
input of U1, through the X input of U2. The second path passes only through the clock input of
U2. Since the clock-to-output delays for each chip are the same, the first path will be longer since
30 + 24 = 54 ns.
30 + 24 = 54 ns (1.24)
Three tR2R exist for this circuit. The first path goes through the U1 clock-to-output, through
the X input of U2, and then back to the Y input of U1. The second is through the U1 clock-to-output
to the input of Y on U2. The third is through the U2 clock-to-output to the input of Y on U1.
The longest path is the first since it passes through the combinational portion of U2 for 30 + 24 +
10 = 64 ns.
30 + 24 + 10 = 64 ns (1.26)
The three worst-case paths and the chip minimum clock period limit the clock frequency for
the board-level system. The largest of these values (48 ns, 54 ns, 64 ns, 30 ns) is 64 ns, which is the
minimum clock period for the board which corresponds to 15.63 MHz. This frequency is much
lower than the chip clock frequency. Note that the combinational delay of the chip contributes most
of the slow-down to the circuit.
Table 1.9 shows delay evolution for the Xilinx Virtex family of field programmable gate arrays
(FPGAs) over time. The top row gives each FPGA family name as well as the CMOS technology,
supply voltage, and date of first introduction. A CMOS technology designated as 2200 nm (nanome-
ter = 1.0e–9 m) means that the shortest channel MOS transistors has a channel length of 2200 nm
(the value 2200 nm is more commonly written as 0.22 m, but nm is used for consistency purposes).
The Xilinx Virtex FPGA family uses a static RAM lookup table (LUT) as the programmable logic
element. A LUT is a small memory that is used to implement a boolean function; its contents are
loaded from a non-volatile memory at power up. The Virtex 1, 2, and 4 families use a 16×1 LUT,
which means that it can implement one boolean function of four variables; the Virtex-5 family uses
a 64×2 LUT (two boolean functions of the same six variables). The LUT delays given in Table 1.9
are for a mid-range speed grade of these devices. CMOS integrated circuits being made on the same
fabrication line can have a range of delays because of variations in the CMOS fabrication process.
Thus, devices coming off a fabrication line are tested and separated into different speed grades,
with the higher performing devices being sold at a premium price. The supply voltages of Table 1.9
have decreased over time because transistor-switching speeds reach a maximum at lower voltages as
transistor channel lengths shrink. Lowering the supply voltage has the added benefit of reducing
power consumption, which is important because excessive heating due to high power consumption
has become a problem as increasing number of transistors are used in a single integrated circuit.
The delays of Table 1.9 are given in picoseconds (1 ps = 1.0e–12 s). Observe that the LUT
propagation delays in Table 1.9 have decreased by almost an order of magnitude across the families
(the Virtex-5 LUT tpd would be even faster if it used the smaller LUT of the previous families).
The D-flip-flops (DFF) Clock-to-Q propagation delay shows a similar improvement. The DFF tsu
and thd are hard to compare because these times include a MUX delay on the D-input of the DFF
for the Virtex 1, 2, and 4 families – the setup/hold times for the Virtex-5 DFF does not include
this delay. However, in general, DFF tsu and thd also decrease as transistor channel lengths decrease.
The Input/Output buffer (IOB) delays are relatively constant over this time because the bonding
pad size used to connect the integrated circuit to the package does not shrink as transistor channel
length shrinks. The delays associated with any digital logic within the IO pad decreases, but the
IO pad delay is dominated by the off-chip load for an output pad, and by the input pad capacitive
load for the input pad. Any changes in these delays over time are due to architectural changes in
the pad design, such as providing different ranges of output drive strength current, or the need to
accommodate different IO standards over time.
For modern programmable logic devices, the device delays are kept in a database that is
included in the design toolkit being used to create the design. The timing analysis tool in the FPGA
vendor’s design toolkit uses these device delay times to calculate external setup and hold times,
maximum operating frequency, and internal setup and hold constraints using the timing equations
presented in this chapter.
CALCULATING MAXIMUM CLOCK FREQUENCY 19
1.7 SUMMARY
This chapter has discussed how to find the important timings of a circuit such as maximum clock
frequency by analyzing the delay paths through the gates and registers. By categorizing the delay
paths through the circuit, the total number of delay paths that need to be calculated can be minimized.
These timings of the internal chip design can also be used to find the maximum clock frequency of
the board-level system.
20 FINITE STATE MACHINE DATAPATH DESIGN
a. Calculate the worst-case pin-to-pin combinational delay, clock-to-output delay, and register-
to-register delay.
b. Use this data to find the maximum clock frequency.
c. Calculate tsu and thd for the external inputs.
1.
X A
2 ns C Z
8 ns
D D Q
7 ns U1
tsu = 4 ns
t = 5 ns
hd
tC2Q = 6 ns
Clk B
3 ns
2.
D D Q F
FB A
U1 3 ns
4 ns 5 ns
E D Q G H Z
OV B
U2
3 ns
4 ns 6 ns 8 ns
tsu = 2 ns
t = 4 ns
hd
Clk C tC2Q = 6 ns
2 ns
CALCULATING MAXIMUM CLOCK FREQUENCY 21
Clk D E t = 2 ns
3 ns 3 ns hd
tC2Q= 6 ns
X tpd P2P 2 + 8 = 10 10 ns
tpd C2Q 3 + 6 + 8 = 17 17 ns
tpd R2R 6 + 7 + 4 = 17 17 ns
Tclk max(10, 17, 17) 17 ns
Fclk 1/Tclk 58.8 MHz
tsu X 4 + (2 + 7) −3 = 10 10 ns
thd X 5 + 3− (2 + 7) = −1 or 0 0 ns
2.
OV tpd P2P 4 + 8 + 3 = 15 15 ns
tpd C2Q 2 + 6 + 3 + 8 + 3 = 22 22 ns
tpd R2R 6 + 3 + 5 + 6 + 2 = 22 22 ns
Tclk max(15, 22, 22) 22 ns
Fclk 1/Tclk 45.5 MHz
tsu FB 2 + (4 + 5 + 6) − 2 = 15 15 ns
thd FB 4 + 2 − (4 + 5) = − 3 or 0 0 ns
22 FINITE STATE MACHINE DATAPATH DESIGN
3.
IN tpd P2P 0 0 ns
tpd C2Q 3 + 3 + 6 + 5 + 3 = 20 20 ns
tpd R2R 6 + 7 + 3 + 3 (gate E) = 19 16 ns
Tclk max(0, 20, 19) 20 ns
Fclk 1/Tclk 50 MHz
tsu IN 3 + (2 + 5 + 7) − (3 + 3) = 11 11 ns
thd IN 2 + 3 − (2 + 5) = − 2 or 0 0 ns
23
CHAPTER 2
of the circuit, there are no combinational circuits after this to add to the clock-to-output delay. The
only clock-to-output delay paths possible are through these output registers, so the analysis is greatly
simplified.
The output registers can only be added before the combinational output buffer delay because
this is not an actual gate in the design. This delay represents the interface from the chip to the board.
Often the output circuitry design has a significant delay because of the need for a high fan-out, larger
voltage swing, and over-voltage protection. Therefore, placing the register immediately before this
buffer is the optimum location.
One consequence of this approach is the impact of tR2R through the circuit. Since there are
more registers in the design, there are more register-to-register delays to be computed. Sometimes
the worst-case tR2R will increase because of this. If the clock frequency is being limited by the pin-
to-pin delay or the clock-to-output delay, and then those delays are reduced, the clock frequency
will still increase if tR2R is not increased by a significant amount. If registers are added to the outputs,
the worst-case tR2R will usually become the largest delay path of the circuit.
Another consequence of this approach is the impact on latency. Latency is the time required
for an input to propagate through a circuit to the output. If a circuit is all combinational, then the
latency is in the same clock period in which the data input is applied. By adding registers to the output
of the circuit, the latency increases into the next clock period. Adding a set of registers to all outputs
of a device means the latency of each input will increase to the beginning of the next clock period.
While this is a disadvantage, the impact on performance is usually not significant. The latency has
increased, but the clock period has decreased as well (usually). Therefore, the combination of these
two effects often cancels each other out.
While latency may have increased by one clock cycle, the rate at which data is being input and
output is the same. New data is input and output every clock cycle. The throughput of the data is
the same, even though the latency has increased. Therefore, the overall computing performance of
the device will increase. This effect is called pipelining, which will be covered in much more detail
in the next chapter.
2.2.1 Example 2.1 Add a register to the output of the circuit in Fig1.7 and recompute the
maximum clock frequency. Compare the new computations with the computations before the circuit
improvements. The new circuit is shown in Fig. 2.1.
The analysis for this circuit is the same as for all maximum clock frequency calcula-
tions. The worst-case pin-to-pin combinational delay, clock-to-output delay, and tR2R must be
found. Since the output is now registered, there is no pin-to-pin combinational delay. This
measurement can be excluded from the analysis, or set to zero for continuity in the final
comparison.
IMPROVING DESIGN PERFORMANCE 25
X A
1 ns E
8 ns
H D Q D
6 ns
Z
U3
9 ns
Y B C
1 ns
F D Q
D Q U2
U1 7 ns
C
C
tsu = 3 ns
G t = 4 ns
hd
8 ns
tC2Q = 5 ns
Clk C
2 ns
The clock-to-output delay only has one path to compute. Since this delay can pass through at
most one register, the only register it can now pass through to the output is the new added register.
This path proceeds from the clock buffer C, through the register U3, and through the output buffer
D. The improved clock-to-output delay is 13 ns.
2 + 5 + 6 = 13ns (2.2)
The number of register-to-register paths has increased due to adding another register from
two to four. The paths are listed in Table 2.1 . The worst-case path is from U1, through gates E and
H, to the new output register U3 for a total delay of 25 ns.
U1 U1 + F + U2 15 ns
U2 U2 + G + U1 16 ns
U1 U1 + E + H + U3 25 ns
U2 U2 + H + U3 17 ns
26 FINITE STATE MACHINE DATAPATH DESIGN
P2P 24 ns 0 ns
C2Q 30 ns 13 ns
R2R 16 ns 25 ns
Clock Period 30 ns 25 ns
Clock Frequency 33.3 MHz 40 MHz
The clock period is set by taking the largest of the three worst-case paths, zero ns for the
pin-to-pin combinational delay, 13 ns for the clock-to-output delay, and 25 ns for tR2R . Therefore,
the minimum clock period is 25 ns, which corresponds to a maximum clock frequency of 40 MHz.
Before adding the register on the output, the minimum clock period was set by the clock-to-
output delay. Since this delay decreased to 13 ns, it is no longer limiting the clock period. The tR2R has
increased, but is still less than the previous limiting value of 30 ns. This means the maximum clock
frequency has significantly increased by adding a single register to the design. The total comparison
of measured values is present in Table 2.2.
The tsu of the circuit before adding input registers is computed by finding the longest combina-
tional path to any register in the design. The addition of the output register increases the worst-case
delay to 18 ns from the circuit input X to the U3 register through gates A, E, and H. The minimum
IMPROVING DESIGN PERFORMANCE 27
X A D Q
1 ns E
U5
8 ns
C
Y B D Q H D Q D
6 ns
Z
1 ns U4 U3
9 ns
C C
F D Q
D Q U2
U1 7 ns
C
C
tsu = 3 ns
G t = 4 ns
hd
8 ns
tC2Q = 5 ns
Clk C
2 ns
clock delay remains the same. Therefore, the new circuit tsu increases to 19 ns.
(18 − 2) + 3 = 19 ns (2.4)
The thd of the circuit before adding input registers is computed by finding the shortest com-
binational path to any register in the design. The addition of the output register does not increase
this value. The shortest path is the same as the previous analysis at 8 ns. This means thd remains the
same at –2 ns, which should be set to zero since it is negative. The setup and hold window is now
19 ns because of the addition of the output registers.
(2 − 8) + 4 = −2ns (2.6)
Adding input registers after the input buffers simplifies the computations because the number
of paths from each input is reduced to one per input. For this circuit, the combinational delay for
each input is 1 ns, and the delay for the clock is 2 ns. This means the new tsu is 2 ns, and the new
thd is 5 ns. This means the setup and hold window is now 7 ns. The comparison between tsu and thd
28 FINITE STATE MACHINE DATAPATH DESIGN
Setup Time 10 ns 19 ns 2 ns
Hold Time 0 ns 0 ns 5 ns
Setup and Hold Window 10 ns 19 ns 7 ns
(1 − 2) + 3 = 2ns (2.8)
(2 − 1) + 4 = 5ns (2.10)
The setup and hold window is nearly doubled when output registers were added to the design.
When registers were added to the inputs, the setup and hold window decreased to the smallest
possible window. The window cannot decrease below this because it is limited by the setup and hold
window of the register, which is also 7 ns.
A DLL can change the phase of the internal clock either manually or automatically. The
advantage of this is that the active clock edge can be placed anywhere. This means the clock delay in
the clock-to-output calculations and tsu and thd calculations can be set to whatever needed. Typically
the DLL will align the internal clock with the external clock to remove any delays added by the
input buffer for the clock signal. The input buffer will add a fixed delay to the clock signal, and the
DLL will effectively reduce the delay by that same amount. Note that this technique is not possible
to reduce the delays on the data signals because they don’t have a predictable repeating pattern.
0 + 5 + 6 = 11ns (2.12)
The pin-to-pin combinational delay and the register-to-register delay are not affected by the
change to the clock because they do not include the clock buffer C. The maximum clock frequency
must be checked because this change might affect it if the clock-to-output delay was the limiting
factor. Typically tR2R limits the maximum clock frequency, so often the clock frequency will not
change when adding a DLL.
The tsu and thd also depend on the clock delay, so they will be affected by adding a DLL. The
minimum and maximum clock delay is set to zero and tsu and thd are recalculated.
(1 − 0) + 3 = 4ns (2.14)
30 FINITE STATE MACHINE DATAPATH DESIGN
Parameter Description Old min Old max New min New max Units
(0 − 1) + 4 = 3ns (2.16)
The new tsu is 4 ns, and the new thd is 3 ns. The setup and hold window has not changed from
7 ns.
A X Z X Z B
Y Y
U1 U2
tsu = 3 ns
C C
t = 4 ns
hd
Clk
tC2Q = 11 ns
First, since there is no combinational path through the chip, there is no calculation for the
pin-to-pin combinational path for the board. This value is excluded when computing maximum
clock frequency.
One clock-to-output delay exists for this circuit. This path passes only through the clock input
of U2. If there is no clock delay, the clock-to-output for the board is the same as the clock-to-output
of the chip. This delay is 11 ns.
Two register-to-register delays exist for this circuit. The first is through the U1 clock-to-
output to either input on U2. The third is through the U2 clock-to-output to the input of Y on U1.
Both paths have the same delay of 11+ 4 = 15 ns.
11 + 4 = 15ns (2.18)
The three worst-case paths and the chip minimum clock period limit the clock frequency for
the board-level system. The largest of these four values (0 ns, 11 ns, 15 ns, 25 ns) is 25 ns, which is
also the minimum clock period for the chip. This means the board can operate at the same frequency
as the chips on the board. Note the removal of the combinational paths greatly reduces the delays at
the board level.
2.6 SUMMARY
By understanding the parameters that dictate the maximum clock frequency of a circuit, the design
can be modified to reduce the longest delays to improve circuit performance. Reducing the combi-
national delay paths increases the maximum clock frequency by targeting the worst-case paths. By
registering all inputs and outputs, the circuit can operate at its maximum frequency within a larger
system. Using additional technologies like DLLs can further increase the circuit performance within
a larger system.
1
X A
2 ns C E Z
8 ns 3 ns
D D Q
7 ns U1
tsu = 4 ns
t = 5 ns
hd
tC2Q = 6 ns
Clk B
3 ns
2
D D Q F
FB A
U1 3 ns
4 ns 5 ns
E D Q G H Z
OV B
U2
3 ns
4 ns 6 ns 8 ns
tsu = 2 ns
t = 4 ns
hd
Clk C tC2Q = 6 ns
2 ns
3 C D Q
IN A B D Q U2
G F OUT
2 ns 5 ns 7 ns 3 ns
U1
C 5 ns
C
tsu = 3 ns
Clk D E t = 2 ns
3 ns 3 ns hd
tC2Q = 6 ns
3. For this problem, assume the clock routed to the output register passes through both clock
buffers, and the clock to the input register passes through only the first clock buffer, and the DLL
only removes the delay in clock buffer D.
IMPROVING DESIGN PERFORMANCE 33
DLL effects:
2.
OV tpd P2P 0 0 ns
tpd C2Q 2 + 6 + 3 = 11 11 ns
tpd R2R 6 + 3 + 5 + 6 + 2 = 22 22 ns
Tclk max(0, 11, 22) 22 ns
Fclk 1/Tclk 45.5 MHz
tsu FB 2+4–2=4 4 ns
thd FB 4+2–4=2 2 ns
34 FINITE STATE MACHINE DATAPATH DESIGN
DLL effects:
3.
IN tpd P2P 0 0 ns
tpd C2Q 3 + 3 + 6 + 3 = 15 15 ns
tpd R2R 6 + 5 + 7 + 3 − 3 = 18 18 ns
Tclk max(0, 15, 18) 18 ns
Fclk 1/Tclk 55.6 MHz
tsu IN 3+2−3=2 2 ns
thd IN 2+3−2=3 3 ns
DLL effects:
tpd C2Q 0 + 3 + 6 + 3 = 12 12 ns
tsu IN 3 + 2 −0 = 5 5 ns
thd IN 2 + 0 −2 = 0 0 ns
35
CHAPTER 3
number by 2Y to produce the final decimal result. From the 5.3 format example of Table 3.1, the
value 0b10001111 converted to its 8.0 value is 143, which is 17.875 when divided by 23 .
The numbers in a fixed-point datapath are assumed to share a common X.Y format. The
logic used to implement binary addition and multiplication works the same regardless of where the
decimal point is located, as long as both numbers have the same X.Y format, i.e., the decimal points are
aligned. This is in contrast to a floating-point datapath, which can perform computation on numbers
whose decimal points do not align. Floating-point computation blocks require significantly more
logic to implement than fixed-point logic blocks. Floating-point computation is used in applications
that require an extended range for its numerical data. This chapter does not cover floating-point
number encoding or implementation of floating-point computational elements. However, since this
chapter treats computation elements as black boxes, the lessons learned in this chapter concerning
clock-cycle versus execution unit tradeoffs in datapath design using fixed-point datapaths can easily
be applied to floating-point datapaths.
overflows, as the result is greater than the maximum value of 255. A saturating adder that clips
the result to its maximum value in the overflow case is shown for the same operation in Fig. 3.1b.
While the results in Fig. 3.1a and Fig. 3.1b are both incorrect, the saturating operation produces
a result that is closer to the correct answer, which is desirable in applications that cannot take any
other corrective action on overflow. Figure 3.1c demonstrates an underflow case (a borrow into the
most significant binary digit) for unsigned eight-bit subtraction. The same operation is performed
in Fig. 3.1d using a saturating subtraction operation, which clips the result to its minimum value of
zero.
An eight-bit unsigned saturating adder is shown in Fig. 3.2. The output is saturated to its
maximum value of ‘b11111111 when the eight-bit sum produces a carryout of ‘1’.
endmodule
3.6 MULTIPLICATION
A good question to ask at this point is “How does saturating arithmetic operate for multiplication?”
To answer this, recall that the binary multiplication of two N -bit numbers, N × N , requires a 2N -
bit result to contain all of the bits produced by the multiplication. However, it is usually not possible
to retain these 2N -bit in the datapath calculation, as successive multiplications would continually
require the datapath size to double in order to prevent any data loss. Assuming that only N bits of
an N × N bit multiplication is kept, then two strategies can be used for discarding half of the bits
of the 2N -bit product. If the fixed-number format used for the calculation is N .0 (integers), then a
saturating multiplier can be built that saturates the result to the maximum value in case of overflow
in the same manner as was done for addition. In this case, the upper N -bit of the 2N bit product is
discarded and the lower eight-bit saturated to its maximum value.
Another approach is to encode the fixed-point numbers in a 0.N format, which means that
the product of the N × N multiplication can never overflow, since the two N -bit numbers being
multiplied are always less than one. Hardware saturation of the result is not required; instead, the
lower eight-bit of the 2N -bit product are discarded. The bits that are discarded are the least significant
bits of the product, causing successive multiplications to automatically saturate towards a minimum
value of zero, as precision is lost due to only retaining eight bits of the product. This will be the
approach used in this chapter, as the multiplier design does not have to be modified and the examples
used in this chapter assume a 0.8 fixed-point number format.
The nine-bit encoding of F is ‘b100000000 if F is equal to one, and 0dddddddd for any other
value of F, where dddddddd is the 0.8 fixed point equivalent of F. For computation speed purposes,
the lower eight-bit of 1-F is computed as the one’s complement value of the lower eight-bit of F
when F is not equal to one or zero. The one’s complement operation produces an error of one least
significant bit (LSb), but this is deemed acceptable in pixel blend operations, in which computation
speed is the most critical factor. The 1-F operation implementation is shown in Fig. 3.3. The mxa
multiplexer and the zero detect logic handle the special case of F = 0.0 (0b000000000), in which
case the output is 1.0 (0b100000000). The mxb multiplexer handles the case of F = 1.0, which is
40 FINITE STATE MACHINE DATAPATH DESIGN
//do 1-F operation
module oneminus (a, y);
a == 9'b000000000
a[0] Zero detect input [8:0] a;
a[1] output [8:0] y;
a[2] zero
a[3] reg [8:0] a_1c;
din[8]
a[4]
a[5] //handle '0' input case
a[6] always @(a) begin
a[7] if ( a == 9'b000000000 )
// input is zero, convert to '1.0'
a_1c = 9'b100000000;
9‘b100000000 else
1 a_1c[8:0]
9 // do one's complement
a[7:0] begin
0 9 0 a_1c[8] = a[8];
a[8] 8 9 mxa y[8:0] a_1c[7:0] = ~a[7:0];
1 end
9‘b000000000 mxb
9 end
a[8]
//handle '1.0' input case
// a[8]==1 a[8]==0
assign y = a[8] ? 9'b000000000 : a_1c;
endmodule
detected by examining the most significant bit (MSb) of F. If F is not equal to zero or one, then the
output is the one’s complement of the lower eight-bit. The most significant bit is not included in
this one’s complement operation, as this would make the output value equal to one.
The multiplication operations in the blend equation have an eight-bit color operand, either Ca
or Cb , and a nine-bit blend operand, either F or 1-F. When the nine-bit blend operand is not equal to
one, then the multiplication result is the product of the lower eight-bit of the nine-bit blend operand
and the eight-bit color operand. When the nine-bit operand is equal to one, then the product of the
multiplication should be exactly equal to the eight-bit operand, which is accomplished by using a
multiplexer on the output of the multiplier and testing the most significant bit of the nine-bit blend
operand. The multiplication implementation is shown in Fig. 3.4; the Verilog blendmult module
assumes the availability of an 8×8 multiplier component named mult8×8.
FINITE STATE MACHINE WITH DATAPATH DESIGN 41
Table 3.2 gives some example blend computations for three cases: A, B, and C. In Case A,
the blend factor F is 1.0, causing Cnew to be exactly equal to Ca . In Case B, the blend factor F is
zero, causing Cnew to be exactly equal to Cb . In Case C, the blend factor F is 0.5; note that the 1-F
computation gives a value of 0.49609375 that is incorrect by one LSb due to the use of the one’s
complement to compute 1-F. This one LSb error is propagated to the final result of 0.49609375,
which should be exactly equal to 0.5 if precise arithmetic is used for the computation of 0.75 *
0.5 + (1 − 0.5)* 0.25.
output dataset contains Cnew . The latency of a datapath measures the number of clock cycles required
for a calculation on an input dataset and this number is from the first element of the input dataset to
the last element of the output dataset. The total computation time of the datapath for an input dataset
is the latency multiplied by the clock period. The initiation period measures how often a datapath
can accept a new input dataset and is the number of clock cycles from the first element of the input
dataset to the first element of the next input dataset. The throughput of a datapath is the number of
input datasets processed per unit time; lowering the initiation period (providing input datasets more
often) or decreasing the clock period increases the throughput of a datapath.
The constraints of a datapath determine how it is designed. Constraints are measured in both
time and area (number of gates). One common constraint for datapath design is the minimum
time constraint, i.e., design the datapath to perform computation in the least amount of time.
Another common constraint is the minimum area constraint, i.e., design the datapath to use the
minimum number of logic gates. These two constraints are contradictory to each other as performing
a computation in a fewer number of clock cycles usually requires more execution units so that
computations can be performed in parallel, which means more logic gates. In this chapter, we specify
time constraints for a datapath as latency and initiation period values, which are measured in clock
cycles. We do not specify a clock period constraint, as this is dependent upon the implementation
technology such as the particular FPGA family used for the datapath.
Figure 3.5 shows the DFG of the blend equation. In a DFG, circles represent computations,
with arrows linking circles to show the dataflow between computations. The operations (circles) of
the DFG are labeled n1, n2, . . .nN for referral purposes. DFGs are useful in high-level synthesis
tools that synthesize a datapath solution given latency and initiation period constraints. Our DFG
usage is very informal and is principally used to visualize dependencies between computations; the
reader is referred to [2] for a complete discussion of DFGs.
While a DFG shows the data dependencies between computations, the datapath diagram
shows an implementation of the DFG’s computation. A datapath diagram shows the computation
elements and registers that are used to perform the computation and how these elements inter-
connect. Figure 3.6 is a datapath diagram for a naı̈ve implementation of the blend equation. This
implementation is termed naı̈ve as it is simply a one-to-one assignment of the nodes of the DFG to
execution units. This is an undesirable implementation as the execution units are chained together,
Cb
* multiply operation (9-bit x 8-bit)
* n3
1- +
n1 addition operation (saturating)
F + Cnew
* n4 1- 1-F operation
n2
Ca
creating a long delay path that results in a large clock period. For example purposes, relative delays
of bmult = 2.0, satadd = 1.0, and oneminus = 0.4 are assumed with no time units specified. The
longest combinational delay through this datapath is then 0.4 + 2.0 + 1.0 = 3.4 time units, which
forces the clock period of the system to be at least Tcq (register clock-to-q delay) + 3.4 + Tsu (register
setup time) assuming the inputs and outputs of the datapath are registered. Assuming that Tcq and
Tsu are both 0.1, this gives a system clock period of 3.6 time units.
Figure 3.7 shows a better implementation of the blend equation where DFFs have been
placed after the multipliers and after the adder to break the combinational delay path, assuming
that the inputs originate from a registered source. This implementation still has the 1-F calculation
chained with the n3 bmult execution unit, as the 1-F operation is designed for a low combinational
delay by using the one’s complement operation that allows it to be chained with another execution
unit. Within the datapath’s Verilog code, the DFFs are implemented by the always block and are
synthesized as rising edge triggered via the posedge clk in the always block’s sensitivity list. Observe
that the longest tR2R path of Fig. 3.13 is 2.6, which is shorter than the longest combinational path
of Fig. 3.11, allowing for a higher clock frequency.
The cycle-by-cycle timing for the implementation of Fig. 3.7 is shown in Fig. 3.8 for the
blend computations of Table 3.2. The latency of the datapath is two clock cycles due to the two
DFFs in series for any path through the datapath. The initiation period as implemented in Fig. 3.8
is two clocks as new input values are only provided every two clock cycles. Observe that this datapath
takes 2 * 2.6 = 5.2 time units to compute an output result for an input dataset, which is actually
longer than the 3.6 clock period of Fig. 3.6. One reason for this is because dividing the combination
delay by adding registers does not also divide the Tcq and setup times of the DFFs, which remain
constant. Furthermore, the combinational delay path is not divided evenly when the registers are
inserted. The delay of the register-to-register path that includes the adder is only 0.1 (Tcq ) + 1.0
44 FINITE STATE MACHINE DATAPATH DESIGN
module blend2clk(clk,ca,cb,
reg-to-reg delay path f,cnew);
(assume inputs are registered) input clk;
input [7:0] ca,cb;
input [8:0] f;
u1 u1y u3 output [7:0] cnew;
a y f u3y reg-to-reg
9 wire [7:0] u2y,u3y,u4y;
oneminus y dq delay path
u3q wire [8:0] u1y;
(delay=0.4) 8 reg [7:0] u3q, u2q, cnew;
cb c u4
8 dff
bmult 8 a u4y cnew bmult u2 (.c(ca),.f(f),
y .y(u2y));
u2 dq oneminus u1 (.a(f),.y(u1y));
8 8
f f u2y u2q bmult u3 (.c(cb),.f(u1y),
9 b dff .y(u3y));
y dq satadd u4 (.a(u3q),.b(u2q),
ca 8 8
c .y(u4y));
8
dff satadd
bmult // always block that adds DFFs
(delay=1.0) // to datapath
A (delay=2.0) B always @(posedge clk)
begin
A = Tcq+oneminus+bmult+Tsu B = Tcq+satadd+Tsu cnew <= u4y; //dff on output
u3q <= u3y; //dff on u3 output
= 0.1 + 0.4 + 2.0 + 0.1 = 0.1 + 1.0 + 0.1 u2q <= u2y; //dff on u2 output
= 2.6 time units = 1.2 time units end
endmodule
(satadd) + 0.1 (Tsu ) = 1.2 time units, as compared to the longest path of 2.6 time units. This is not a
good division of work between the datapath stages; an optimium division of labor evenly divides the
delay path between the datapath stages. However, this datapath’s faster clock period of 2.6 time units
allows computations outside of the datapath to execute faster than that possible with the datapath
of Fig. 3.6.
0 1 2 3 4 5 6 7
clk
Ca 'hC0
Cb 'h40
F ?? 'h100 'h000 'h080
The timing diagram of Fig. 3.8 is one way to view a datapath’s activity. A scheduling table, as
shown in Table 3.3, provides another viewpoint of a datapath’s activity. A scheduling table shows
how DFG operations map to datapath resources such as input/output busses and execution units.
Each row of the scheduling table shows the activity of the datapath resources for that clock cycle. A
blank entry for a resource indicates that the resource is idle for that clock cycle. Indices such as ‘(0)’,
‘(1)’, etc., are used with input data values, output data values, and DFG node names to track the
dataset computation that is being performed. The row entries in a schedule eventually repeat as the
datapath performs the same operations on each input dataset. The last two rows in Table 3.3 form
the generalized schedule, that is, the repeated operations on the datapath resources for each dataset.
The percentage time that each datapath resource is busy during the generalized schedule is listed
in the %utilization row of Table 3.3. Each of the datapath resources in Table 3.3 is only utilized 50%
of the time as each resource is idle for one clock period of the two clock cycles that form the
generalized schedule.
CLOCK RESOURCES
3 n4(1)
4 ca(2) cb(2) f(2) n2(2) n1(2) n3(2) cnew(1)
0 1 2 3 4
clk
Ca 'hC0
Cb 'h40
F ?? 'h100 'h000 'h080
is provided before the output corresponding to first input dataset is produced. This means that the
datapath has calculations on multiple datasets in progress simultaneously, with each input dataset in
a different computation state. In this case, the computations for the two input datasets are said to be
pipelined, or overlapped. Because the datapath resources of Fig. 3.7 are idle 50% of the time as shown
by Fig. 3.8, no extra datapath resources are required to support the new initiation period of one clock
cycle. Lowering the initiation period to one clock cycle doubles the throughput of the datapath, as
a new result is now available with each clock instead of every two clock cycles. However, lowering
the initiation period (increasing the throughput) does not affect the latency of the datapath. The
cycle timing and scheduling table for an initiation period of one clock cycle are shown in Fig. 3.9
and Table 3.4, respectively. Observe that each resource is now utilized 100%, which is the best that
can be achieved.
In Section 3.6, we observed that the tR2R paths of Fig. 3.7 were not evenly balanced, which
is undesirable as the longest tR2R path determines the clock period. The excess time in the clock
period for the shorter tR2R paths is wasted time; distributing the delays more evenly would produce
a shorter clock period. Note that the longest delay path in Fig. 3.7 contains the 1-F and multiplier
units, with the multiplier having the longest delay of any execution unit. A pipeline stage inserted
in the multiplier, that is, DFFs inserted within the multiplier logic, should reduce the length of this
delay path. Figure 3.10 shows the blend multiplication of Fig. 3.4 modified to include a pipeline stage
within the 8 × 8 multiplier. This example assumes the existence of an unsigned 8 × 8 multiplier with
one pipeline stage named mult8 × 8pipe. Observe that inserting a pipeline stage in the mult8 × 8pipe
component is not sufficient by itself; the other two paths through the multiplier for c[7:0] and
f[8] must also have DFFs inserted so that the data streams remain synchronized when they reach
the output multiplexer. If we assume that the multiplier pipeline stage perfectly divides the old
combinational delay path by two, then the output and input delays of the multiplier both become
48
CLOCK RESOURCES
FIGURE 3.10: Multiplication of an eight-bit color operand by nine-bit blend operand with pipeline
stage.
equal to 1.1 time units as seen in Fig. 3.10. This decreased delay path comes at the cost of a clock
cycle of latency through the blend multiplication unit.
Figure 3.11 shows the blend implementation of Fig. 3.7 modified to use the pipelined multi-
plier of Fig. 3.10. The longest tR2R path has been reduced from 2.6 to 1.6 time units, at the cost of
an extra clock cycle of latency.
The cycle timing for the blend implementation with the pipelined multiplier is shown in
Fig. 3.12. The only difference between this timing and the timing in Fig. 3.9 is the extra clock cycle
of latency. Table 3.5 shows the scheduling table for the blend implementation with the pipelined
multiplier. The table entries for the bmultpipe units show two calculations, one for each pipeline
stage of the bmultpipe unit. The extra clock cycle of latency in the bmultipipe units causes the satadd
unit to remain idle until clock cycle two, as opposed to clock cycle one in the Table 3.4 schedule.
In comparing the cycle timings and schedules for the two clock cycle latency versus the three
clock cycle latency solutions, a good question to ask is “When is it not advantageous to pipeline
execution units?” Each clock cycle of latency is one more clock cycle that it takes for the pipeline to
become full and for all execution units to become active. A pipelined datapath with a large latency is
efficient as long as it has a continuous stream of input data. If the application using the datapath does
50
TABLE 3.5: Schedule for latency = 3, initiation period = 1
RESOURCES
CLOCK
n2(1) n2(1)
3 ca(3) cb(3) f(3) n2(3), n1(3) n3(3), n4(1) cnew(0)
n2(2) n2(2)
4 ca(4) cb(4) f(4) n2(4), n1(4) n3(4), n4(2) cnew(1)
n2(3) n2(3)
not provide continuous input data, thus allowing the pipeline to become empty or partially empty,
then the datapath throughput is significantly decreased.
Table 3.6 compares the datapaths that have been discussed to this point by clock period,
latency, initiation period, and throughput.
0 1 2 3 4 5
clk
Ca 'hC0
Cb 'h40
F ?? 'h100 'h000 'h080
The throughput value measures the number of input datasets processed per time unit, and is
calculated by Eq. (3.2), assuming that the pipeline is filled. Decreasing either the initiation period
or the clock period improves throughput, as is seen in rows (c) and (d) of Table 3.6. However,
these improvements come at a cost. Decreasing the initiation period generally requires adding more
datapath resources, even though this was not necessary in this simple example. Decreasing the clock
period by pipelining execution units adds latency to the datapath.
1
Throughput = (3.2)
initiation period × clock period
RESOURCES
CLOCK
u2 rA u2q
f 0 mf
9 u2y
u1 u1y f dq a
9 8
a y 1 u4y rC
u3 9 ld
y q
oneminus 8 d 8
y cnew
msel rB u3q ld
reset_b ca 0 mc
8 dq 8 b
ld_n2 8 c
cb 1 ld satadd
ld_n3 8
bmult
ld_cnew
fsm
module blend1mult(clk,reset_b,ca,cb,f,cnew);
input clk, reset_b;
input [7:0] ca,cb;
input [8:0] f;
output [7:0] cnew;
bmult u2 (.c(ma),.f(mf),.y(u2y));
oneminus u1 (.a(f),.y(u1y));
satadd u4 (.a(u3q),.b(u2q),.y(u4y));
fsm u3 (.clk(clk), .reset_b(reset_b),.msel(msel),.ld_n2(ld_n2),
.ld_n3(ld_n3), .ld_cnew(ld_cnew));
endmodule
cycle 3(i + 0), the multiplier’s operands are Ca and F, while in clock cycle 3(i + 1) the multiplier’s
operands are Cb and 1-F. This means that a multiplexer is needed on the multiplier’s inputs to choose
between the two sets of operands. The other problem is that a register is required to store the n2
result produced in clock cycle 3(i + 1) until it is needed in clock cycle 3(i+ 2) for the n4 operation.
A datapath that implements this schedule is shown in Fig. 3.13. This datapath uses registers instead
of DFFs to break the combinational delay path and to store intermediate results. A register has a
load input (LD); the register accepts a new input value only when LD is asserted and when the
active clock edge occurs. By contrast, a DFF accepts a new input on each active clock edge. The
three registers are named rA, rB, and rC. A register transfer operation (RTL) is added to cells in the
FINITE STATE MACHINE WITH DATAPATH DESIGN 55
module fsm(clk,reset_b,msel,ld_n2,ld_n3,ld_cnew);
input clk,reset_b;
output msel,ld_n2,ld_n3,ld_cnew;
reg msel,ld_n2,ld_n3,ld_cnew;
reg [1:0] state, nstate;
scheduling table for each clock cycle that register writes occurs. The RTL notation “ca*f→rA” for
execution unit u2 in clock zero indicates that register rA is loaded with the result of the multiplication
that has the ca and f input busses as operands. Note that the rA and rB registers controlled by the
ld n2 and ld n3 load signals have their data inputs connected to the multiplier output u2y. The ld n2
load signal is asserted in clock cycle 3(i + 1) to store the n2 result, while the ld n3 load signal is
asserted in clock cycle 3(i+1) to store the n3 result. The ld cnew load signal is asserted in clock
cycle 3(i + 2) to load the output register with the satadd n4 result. The multiplexer select signal
msel is negated in clock cycle 3(i + 1) to pass the Ca, F operands to the multiplier, while msel is
asserted in clock cycle i + 1 to select Cb, 1-F as the multiplier operands. As an optimization, register
rB could be replaced with DFFs as its contents are only needed in the following clock cycle. The
Cnew (i − 1) output value is held stable by the rC register for the duration of the computation; this
might be useful if this value is used by a destination datapath. If this is not required, then register
rC could also be replaced by DFFs.
A finite state machine component named FSM is responsible for driving the datapath’s control
lines of msel, ld n2, ld n3, and ld cnew with the correct values in the appropriate clock cycles. The
control signals in Fig. 3.13 are drawn with dotted lines to distinguish them from the data busses
that are operated on by the execution units. The control signals and FSM component are typically
not drawn in a datapath diagram; they are included here since this is the first datapath example that
has required a FSM. Figure 3.14 shows the FSM implementation. Three states are required since
the datapath’s operation is a repeating computation covering three clock cycles.
56 FINITE STATE MACHINE DATAPATH DESIGN
0 1 2 3 4 5 6 7 8 9
clk
reset_b
Ca 'hC0
Cb 'h40
F ?? 'h100 'h000 'h080
u2q 'h00 'hC0 'h00 'h60 u2q = Ca * F
msel
ld_n2
ld_n3
ld_cnew
FIGURE 3.15: Cycle timing for the single multiplier blend implementation.
This FSM implementation uses two state DFFs and a grey-code encoding for the state imple-
mentation; an alternate encoding method such as one-hot encoding could have been used as well.
The FSM requires an asynchronous reset input to initialize the state registers to state S0; in this
example the reset signal is named reset b and is a low-true input. The polarity choice for the reset
signal, low-true or high-true, is implementation dependent.
reg msel,ld_n2,ld_n3,ld_cnew,ordy;
reg [1:0] state, nstate;
FIGURE 3.16: Handshaking added to FSM for single multiplier blend implementation.
is asserted for one clock cycle when valid data is placed on the Cnew output bus by delaying the
ld cnew signal that is asserted in state S2 for one clock cycle. This is implemented by a DFF that is
synthesized via the Verilog assignment ordy < = ld cnew within the always block used for the state
registers of the FSM. In the Algorithm state chart (ASM chart), the ordy signal action is described
by the annotation ld cnew@1c→ordy, which reads, “ordy is assigned the value of ld cnew, delayed
by one clock”. Figure 3.17 shows the cycle timing of the modified datapath for one computation;
the assertion of irdy indicates valid input data and causes the computation to begin. The ordy signal
is asserted when the Cnew output bus contains the computation result. The changes required to the
blend1mult module of Fig. 3.13 to support the new handshaking signals are left as an exercise for
the reader.
58 FINITE STATE MACHINE DATAPATH DESIGN
0 1 2 3 4 5 6 7 8 9
clk
reset_b
Ca ?? 'hC0
Cb ?? 'h40
F ?? 'h080
irdy
u2q ?? 'h60 u2q = Ca * F
state s0 s1 s2 s0
msel
ld_n2
ld_n3
ld_cnew
ordy
FIGURE 3.17: Cycle timing for the single multiplier blend implementation with handshaking.
RESOURCES
CLOCK
INPUT REGISTER BMULT ONEMINUS SATADD OUTPUT
(DIN) (RF) (U2) (U1) (U4) (CNEW)
0 f(0) din→rF
1 ca(0) f(0) n2(0) n1(0)
din*rF→rA
2 cb(0) n3(0)
din*u1→rB
3 n4(0)
rA+rB→rC
4 f(1) f(1)→rF cnew(0)
4(i+0) f(i) din→rF
4(i+1) ca(i) f(i) n2(i) n1(i) cnew(i-1)
din*rF→rA
4(i+2) cb(i) n3(i)
din*u1→rB
4(i+3) n4(i)
rA+rB→rC
%utilization 75% 25% 50% 25% 25% 25%
FINITE STATE MACHINE WITH DATAPATH DESIGN
59
60 FINITE STATE MACHINE DATAPATH DESIGN
0
irdy?
1
S1
ld_n2 ld_cnew ordy
dq
ld_n3
msel S2 dff
ld_cnew
S3
Figure 3.18a shows the datapath for the blend implementation with a shared input bus. The
nine-bit din data bus is used for the F, Ca , and Cb data values. The multiplexer that was used on
input c of the bmult multiplier in Fig. 3.13 is no longer needed, as the Ca , Cb input values are now
time-multiplexed over the din databus.
Figure 3.18b shows the ASM chart for the datapath’s FSM control; the FSM uses handshaking
in the same manner as used in Fig 3.16. The Verilog code for this implementation is left as an exercise
for the reader.
X n1
* * multiply operation
b0 n3
+ Y + addition operation
a1 n2
*
Y@1 iteration critical loop
is the output computed from the previous input dataset, and is not the output of Y delayed by one
clock cycle. A special class of digital filters known as infinite impulse response (IIR) filters have the
general structure of (Eq. 3.3), except that multiple previous output values (Y@1, Y@2, . . .Y@n) and
multiple previous input values (X, X@1, X@2, . . . X@k) are typically used as shown in (Eq. 3.4).
The values ai (a1 , a2 , a3, . . . an ) and bi (b0 , b1 ,..bk ) that are multiplied by the previous output and
previous input values are called the filter coefficients, and are determined by the filter’s specifications
(cutoff frequencies for low pass, band pass, high pass; roll-off constraints, etc.). Each multiplication
operation is called a filter tap, and increasing the number of filter taps improves the filter quality.
Y = Y @1 × a1 + X × b0 (3.3)
Y = (Y @1 × a1 + Y @2 × a2... + Y @n × an)
(3.4)
+ (X × b0+X @1 × b1... + X @k × bk)
One of the features of a non-recursive equation is that a datapath implementation can always
achieve an initiation period of one clock cycle by overlapping computations and adding the required
extra resources such as input data busses, execution units, and registers. However, assuming that
execution units cannot be chained, the minimum initiation period of a recursive calculation depends
upon the iteration critical loop, which is the shortest path through the data flowgraph involving a
previous output. Figure 3.19 gives the DFG of Eq. (3.3), with the iteration critical loop containing
nodes n2 and n3. Each node requires one clock cycle assuming that execution unit chaining is not
allowed, thus resulting in a minimum initiation period for this DFG of two clock cycles.
Table 3.9 shows a schedule for Eq. (3.3) that meets the minimum initiation period of two
clock cycles. This schedule assumes that the filter coefficients are loaded into the datapath over the
shared input data bus during an initialization phase, which is done before the datapath computation
loop is entered.
Figure 3.20 shows the datapath and FSM control for the schedule of Table 3.17, with eight-bit
data used for all calculations and 0.8 fixed-point encoding assumed. The ASM chart shows the states
divided into two groups: initialization and computation. The S0 and S1 states are used to initialize
the a1 , b0 coefficient registers of the datapath with the a1 , b0 values input over the din input bus
in consecutive clock cycles once the irdy handshaking signal is asserted. States S2 and S3 form the
62
TABLE 3.9: Schedule for latency = 2, initiation period = 2, Eq. (3.3) implementation
CLOCK RESOURCES
ld_rArB
ld_y
ordy
ordy
ld_y@1c ? ordy
ld_y
dq
dff
ordy 0
ld_b0
irdy?
ld_a1
1
S1
} S0, S1 states form
the initialization phase for
loading the a1, b0 coefficients
}
ld_rArB S2
S2, S3 states form
ld_y the computation loop
S3
0 1
irdy?
computation loop, with new X values available over the din input bus as long as the irdy handshaking
signal is asserted. The computation loop is exited when the irdy handshaking signal is negated. The
ordy output handshaking signal is produced by delaying the ld y signal of the FSM by one clock
cycle. The Verilog code for this implementation is left as an exercise for the reader.
+
Shortest path is three clocks, n7 * multiply operation
assuming no execution unit chaining and Y + addition operation
no execution unit pipelining
methodology appropriate for higher complexity datapaths is developed. This methodology does not
attempt to include all of the optimizations found in behavioral synthesis methodologies [3], but
rather serves to illustrate the key problems in datapath scheduling.
Equation (3.5) is a four-tap finite impulse response (FIR) filter, and is used as the target equation
for the datapath implementations that follow. A FIR filter differs from an IIR filter (Eq. 3.4)
in that it is a non-recursive equation — the filter does not use past output values. A FIR filter
generally requires more filter taps than an IIR filter to achieve the same filter quality. As with the IIR
equation, X@1 means the X input from the previous input dataset, and is not the X input delayed
by a clock cycle. Please note that because of the regular structure of the FIR equation, an efficient
datapath implementation can be done for the case of initiation period = 1, where each addition
and multiplication operation are mapped to individual execution units. This equation is used in
this section to illustrate the more difficult problem of mapping multiple flowgraph operations onto
the same execution unit, when resource constraints prevent one-to-one mappings of operations to
execution units.
Y = X × b0 + X @1 × b1 + X @2 × b2 + X @3 × b3 (3.5)
Figure 3.21 shows the DFG for Eq. (3.5). The shortest path through this DFG is three clock
cycles, assuming no execution unit chaining and non-pipelined execution units. This shortest path
of three clock cycles is the minimum achievable latency for this equation.
Table 3.10 shows the steps in the datapath design methodology that is followed in this section.
This methodology’s goal is a datapath that uses the minimum number of execution units to meet a
set of target constraints.
The target constraints in this methodology is initiation period and latency, both measured in
clock cycles. Step #2 computes a lower bound for each type of resource required to meet the target
constraints, using Eq. (3.6). The result of Eq. (3.6) is a lower bound for the resource, which means
that it cannot be done with any fewer resources than this value, and may actually require more than
FINITE STATE MACHINE WITH DATAPATH DESIGN 65
STEP ACTION
TABLE 3.11: Schedule for Figure 3.40 using two multipliers, one adder for target latency = 3,
target initiation period =3
CLOCK RESOURCES
INPUT MULT MULT SATADD OUTPUT
(U1) (U2) (U3)
of adders must be increased from one to two. However, performing the n5 and n6 computations in
clock #1, requires that the n3, n4 multiply operations be performed by clock #0, which requires that
the number of multipliers be increased from two to four.
Table 3.12 shows that the scheduling now succeeds with the increased resources of four
multipliers and two adders for the target latency of three clocks. However, meeting this target required
a doubling of the resources from their lower bound computations, which may not be acceptable if
resources are limited. Relaxing the target constraints must be done if the resource requirements are
too high.
If the target constraints are relaxed to initiation period = 4 clocks and latency = 4 clocks,
then the new lower bound computations are shown in Eqs (3.10) and (3.11) (the input bus resource
is omitted for brevity as it clearly does not affect the scheduling).
4
# of multipliers = =1 (3.10)
4
3
# of adders = =1 (3.11)
4
Table 3.13 shows that the scheduling attempt fails for these resource lower bounds, because
the addition operations n5 and n7 cannot be scheduled within the target latency of four clocks. The
three addition operations must begin in clock #1 if they are to be completed within the four clock
latency using only one adder. If the n6 addition operation is scheduled in clock#1, then the n3 and
n4 multiply operations must be scheduled in clock#0, which requires two multipliers.
TABLE 3.12: Schedule for Figure 3.40 using four multipliers, two adders for target latency = 3, target initiation period =3
CLOCK RESOURCES
INPUT MULT MULT MULT MULT SATADD SATADD OUTPUT
(U1) (U2) (U3) (U4) (U6) (U7)
TABLE 3.13: Schedule for Figure 3.40 using one multiplier, one adder for target latency = 4,
target initiation period = 4
CLOCK RESOURCES
0 x(0) n4(0)
1 n3(0)
2 n2(0) n6(0)
3 n1(0)
Scheduling fails, operations
n5, n7 are not scheduled
within target latency.
Table 3.14 shows that scheduling is successful for the target latency of four clocks after the
number of multipliers is increased from one to two. Assuming that this resource increase is acceptable,
the datapath design can continue with register scheduling.
TABLE 3.14: Schedule for Figure 3.40 using Two Multipliers, One Adder for Target latency
= 4, target initiation period = 4
CLOCK RESOURCES
INPUT MULT MULT SATADD OUTPUT
(U1) (U2) (U3)
either produced by computations or input to the datapath during the cycle and saved for a future clock
cycle. For example, in clock cycle i + 0, the x value in the Produced column is input by the datapath
during that cycle and must be saved as it becomes the x@1 value in the next dataset computation.
The Consumed column lists items from the Initial column that are no longer needed after this clock
cycle. The Total Registers column is the total number of registers needed during that clock cycle, and
is computed as Initial + Produced – Consumed, as registers whose values are consumed can now be
used to store new values. The maximum register count in the Total Registers column is the number
of registers required by the datapath for this schedule; in this case it is seven registers. This does
not include the registers required for coefficients b0 , b1 , b2 , and b3 as they are loaded during the
initialization phase and do not change during the computation loop. The total number of datapath
registers is 11 (7 + 4) once the coefficient registers are included. Observe that the scheduling of
node operations in Table 3.14 affects the number of registers required for a particular clock cycle.
For example, if node operations n1, n2 were scheduled in clock i + 0 instead of nodes n3, n4, then
the x@3 value would not be consumed in clock cycle i + 0, and the register count for that clock cycle
would be seven. This does not increase the maximum number of registers for this datapath, but this
may not be true for other datapaths.
70
The registering requirements of Table 3.15 can be mapped to specific registers on a clock-by-
clock basis as shown in Table 3.16. The seven registers identified in Table 3.15 are named rA, rB,
rC, rD, rE, rF, and rY, with the register contents corresponding to the Initial and Produced columns
of Table 3.15. If a register’s content is changed during a clock cycle, then this is indicated by a
register write operation such as “n3→rD” (the result of operation n3 is written to register rD) or
“rE→rA” (the contents of register rE is written to register rA). This write operation is shown because
this translates into a load line assertion for this register in the finite state machine control of the
datapath. If a register’s contents is no longer required after a clock cycle, then that table cell is shown
as blank even though the register’s contents has not physically changed (i.e., the n6 computation
result in register rF is consumed in clock i + 3 and no new value is written to register rF, so the
table cell entry for rF is blank in clock i + 3 even though the n6 computation result is still physically
present). The initial row shows the assumed register contents at the beginning of the i + 0 clock
cycle; the assignments of x@1, x@2, x@3 to registers rA, rB, rC is an arbitrary choice. Observe that
register transfers in clock i + 3 such as “rE→rA” that writes the current x value to rA is done to
get ready for the next set of computations, as x becomes x@1, x@1 becomes x@2, and x@2 becomes
x@3. The register choices made in Table 3.16 affects the multiplexing requirements of the datapath;
in this methodology we do not attempt to optimize the register assignments in order to reduce the
multiplexing.
The execution unit scheduling of Table 3.14 and the register content scheduling in Table 3.16 is
now combined into one table that completely specifies the datapath operation, as shown in Table 3.17.
The execution unit operations are now specified asRTL, such as “rC * b3→rC” for the n4 computation
done in clock i + 0. The table also contains a column that contains register to register transfers such
as “rE→rA”. Observe that the choice of a particular unused register for storing a result affects
the multiplexing needed for a register input. For example, the n1 and n3 computations are both
written to register rC, while n2 and n4 are written to register rD. From Table 3.17, it is seen that
register rD receives results only from multiplier unit u2, and thus does not require a multiplexer on
its input. However, in clock cycle i + 0 if register rD had been chosen for computation n3, and
rC for computation n4, then register rD would receive results from both the u1 and u2 multiplier
units, requiring a multiplexer on the rD register input. After creating initial versions of the execution
unit scheduling, register scheduling, and combined execution unit/register scheduling tables, the
multiplexing requirements become visible and changes can be made to register assignments to reduce
the number of multiplexors in the datapaths. It should be noted that high-level synthesis tools exist
that perform these optimizations automatically.
The datapath and FSM implementation of the scheduling in Table 3.17 is shown in Fig. 3.22.
The FSM control such as register load signals and multiplexer select signals are not shown in the
datapath; the presence of these signals is assumed. Datapath diagrams such as Fig. 3.22a quickly
become unwieldy as the datapath complexity increases and are also not strictly necessary, as the
72
RA RB RC RD RE RF RY
(a) Datapath
din mx1 din mx4
b0 b3 u2
0 u1 mx3 0
qD
din din rD
b2 1 0 b1 1
rC
u3y 1 mx5
din qE mx2
rE 0
0 2
qE qA
qA qB rA 1
rB 1 Y
rY
mx6 u3y
qD
0 +
The b0, b1, b2, b3, b4 coefficents and rF
X values are input over the shared din databus. 1
0 1
irdy?
ld_rY@1c ? ordy
FIGURE 3.22: Datapath, FSM for implementation using Table 3.7 scheduling.
scheduling operations in Table 3.17 specify datapath operations. The Verilog code that implements
the datapath is the final representation of the datapath operation, with datapath diagrams only
used as an aid for visualizing the components and their interconnection that comprise the datapath.
The FSM control is comprised of eight states; four states for the initialization of the coefficient
registers, and four states for the compute loop. The assignments of registers to the mx1, mx2 and
mx4, mx5 multiplexer inputs were done so that the select lines of these two pairs of multiplexors
can be connected together. Thus, the number of multiplexer select signals in the ASM chart can be
reduced from what is shown, as the mx1, mx2 and mx4, mx5 signals have the same values in each
state and thus each pair can be driven by one signal. The assignments of inputs to the mx3 and mx6
multiplexer were arbitrarily chosen.
FINITE STATE MACHINE WITH DATAPATH DESIGN 75
computation i
computation i+1
latency initiation period
L clocks computation i+2
} N clocks
} generalized
N clocks
schedule
Number of overlapped
computations is L/N
TABLE 3.18: Schedule for Figure 3.40 using one multiplier, one adder for target latency = 5, target initiation period =5
CLOCK RESOURCES
TABLE 3.19: Schedule for Figure 3.44 using one multiplier, one adder for target latency = 5,
target initiation period = 5
CLOCK RESOURCES
CLOCKS RESOURCES
INPUT MULT (U1) MULT (U2) SATADD (U3) SATADD (U4) OUTPUT
generalized schedule. For example, it would not work to schedule the n6, n5, and n7 operations all
on the u3 adder as this cannot be repeated within the two clocks of the initiation period.
Table 3.21 show that the temporary registers required by this schedule is eight, so the total
number of registers needed for the datapath, including the four coefficient registers, is 12. Assuming
the clock period remains the same, doubling the throughput has only cost one additional register
and one extra adder. The reason for this small increase in resources is because of the low %utilization
of the resources in the latency = 4, initiation period = 4 solution of Table 3.14.
The remaining detailed register scheduling and datapath design is left as an exercise for the
reader.
3.19 SUMMARY
DFGs are useful tools for visualizing the data dependencies of a computation. Latency and initiation
period constraints determine the number of registers and execution units required to implement
a particular computation. A scheduling table is used to map computations to available execution
units and registers. Overlapped computations and pipelined executions are both useful techniques
for increasing the throughput of a datapath.
Equation 3.14 implements an operation known as bilinear filtering in which a new color Cnew
is produced from four colors C00, C01 , C10 , C11 using two blend factors, u and v. As an example, if v
= 0.5 and u = 0.5, then Cnew is an equal blend of each color (Cnew = 0.25*C00 + 0.25*C01 + 0.25*C10
+ 0.25*C11 ). The data types and operations in Eq. 3.14 are the same as in the blend equation. The
colors are 0.8 fixed-point values, while u, v are nine-bit values encoded in the same manner as F in
80
Figure 3.25 shows a DFG for Eq. (3.14) that assumes a single nine-bit input databus, with
the u, v blend factors input during the datapath initialization phase and multiple four-tuples of C00,
C01 , C10, and C11 input during the computation loop for use with these blend factors. The square
boxes around C01 , C10, and C11 and the arrows linking C00, C01 , C10, and C11 indicate that these are
input operations over a shared input bus.
The following questions reference Eq. (3.14) and Fig. 3.25. Use the minimum number of
execution units in all implementations.
7. Using the methodology of Table 3.10, design a datapath that has latency = 6 clocks and
initiation period = 6 clocks. Assume that C00, C01 , C10, and C11 are available in successive
clock cycles in the first four clocks of the initiation period.
8. Using the methodology of Table 3.10, design a datapath that has latency = 8 clocks and
initiation period = 4 clocks. Assume that C00, C01 , C10, and C11 are available in successive
clock cycles in the four clocks that comprise the initiation period.
9. If multiplier units with one pipeline stage are used in, then the shortest path becomes eight
clocks. Using multiplier units with one pipeline stage, design a datapath that has latency =
eight clocks and initiation period = eight clocks.
C01
C00 1-v 1-v C10
n1 * 1-u v
n3 * u C11
v
Shortest path is six clocks,
* n5 * 1-u
n2 *
assuming no execution unit + n4
n9 * n7 * u
chaining and no execution unit n6
pipelining + *
9-bit x 8-bit = 8 bit n10 n8
* multiply
+
+ saturating addition n11 a shortest path, there are
Cnew multiple paths of this
input length
FIGURE 3.25: Dataflow Graph for Equation 3.14
82 FINITE STATE MACHINE DATAPATH DESIGN
10. If multiplier units with one pipeline stage are used in, then the shortest path becomes eight
clocks. Using multiplier units with one pipeline stage, design a datapath that has latency =
eight clocks and initiation period = four clocks.
3.21 REFERENCES
[1] Kai Hwang, Computer Arithmetic Principles, Architecture and Design, Wiley, 1979.
[2] S. S. Bhattacharya, P.K. Murthy et al., Software Synthesis from Dataflow Graphs, Kluwer
Academic Publishers, 1996.
[3] Sumit Gupta, Rajesh Gupta et al. SPARK:: A Parallelizing Approach to the High-Level
Synthesis of Digital Circuits, Springer 2005, pp 262.
84
85
CHAPTER 4
1. Discuss the operational differences between synchronous and asynchronous embedded mem-
ories, and between single-port, dual-port, and FIFO memories.
2. Implement FSM/datapaths that incorporate single-port synchronous RAMs.
3. Discuss application scenarios for FIFOs and dual-port memories.
4. Use two-phase and four-phase handshaking for data transfer.
5. Use a two-flop synchronizer for asynchronous input synchronization.
longer access times. Figure 4.1b shows sample contents for an 8 × 4 ROM; this memory requires a
three-bit address bus (log2 (K )) and a four-bit data output bus.
Figure 4.2 shows a synchronous version of a K × N ROM. DFFs are placed on the address
inputs (i.e., these inputs are registered), thus latching the address inputs on a rising clock edge.
The data output bus is available in both registered and unregistered versions. A designer might
use the registered dout version if the ROM’s access time is large and the designer did not want
the ROM’s access time summed with the datapath delay that follows the ROM’s output. This is
similar to the methodology used in Chapter 3 in which registers are placed between execution units
(adders, multipliers) to break long combinational paths, reducing critical path length and increasing
system clock frequency. The tradeoff associated with using the registered dout bus is a clock cycle
of latency for data access; the registered dout value in the current clock cycle corresponds to the
memory contents of the address bus value latched on the rising clock edge of the previous clock cycle.
By contrast, the unregistered dout bus contains the memory contents of the address value latched
on the rising clock edge of the current clock cycle. The registered dout value is available at T cq
propagation delay after the rising clock edge; T cq is less than TACCESS time. It should be noted that
the availability of both registered and unregistered dout buses in synchronous embedded memories
is a design decision made by the FPGA vendor and thus will vary by FPGA vendor and by FPGA
family. In this text, the assumption is made that both registered and unregistered dout buses are
available.
Random Access Memory (RAM) is an embedded memory block whose contents can be
modified under application control. Figure 4.3 shows an asynchronous K × N RAM; the additional
signals on this embedded memory block when compared to the asynchronous ROM of Fig. 4.1
are the data input bus (din) and write enable (we) input. New data on the din bus is written to
the current address location when the we enable signal experiences a high-to-low transition; there
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 87
(b) 8 x 4 ROM
addr sample contents
K locations, each locations contains N bits. 000 0110
M[i] is read as ‘contents of location i’. 001 1010
Input address latched on rising clock 010 1101
edge, unregistered output available 011 0000
after delay from rising clock edge. Registered 100 0000
output available after one clock cycle delay. 101 1111
110 0101
111 1001
clk 1 clk 2 clk 3
clk 1 clk 2 clk 3
clk clk
dout dout
M[i] M[j] M[k] 1010 1111
(unreg.) M[?] (unreg.) ???? 1001
addr[log2(K)-1:0]
din[N-1:0] dout[N-1:0]
we
TACCESS
output delay
addr i j
we newdata latched
on falling edge of we
{
{
Read Write
clk
addr i j k i j k
dout
(unreg) ?? M[i]= 5 M[j]= 47 M[k]= 32 M[i]= 78 M[j]= 13 M[k]= 32
dout
(reg) ?? M[i]= 5 M[j]= 47 M[k]= 32 M[i]= 78 M[j]= 13
din 78 13 62
we
is also a minimum high pulse-width requirement on the we signal with setup (tsu) and hold (thd)
constraints for din on the falling we edge.
Figure 4.4 shows read and write operations for a synchronous K × N RAM. The read
operation for a synchronous RAM is the same as for a synchronous ROM, the address input is
latched on the rising clock edge and output data is available either as an access time later (unregistered
dout) or a T cq time after the next rising clock edge (registered dout). In Fig. 4.4, clock cycles four
and five demonstrate write operations to the RAM. The addr, din, and we inputs are latched on
the rising clock edge; a logic one value on the we signal indicates a write operation. Location i is
written with the value 78 (din bus value) in clock cycle four; observe that the unregistered dout
bus reflects this new value as tpd (at least TACCESS and it may be longer depending on the memory)
after the rising clock edge of clock cycle four. Location j is written in clock five with the value 13.
The din bus value does not affect memory operation when we is negated.
Synchronous RAMs are almost always preferred over asynchronous RAMs in designs in order
to avoid problems with timing uncertainty during write operations. Figure 4.5 shows a finite state
machine (FSM) connected to an asynchronous RAM, with the timing diagram illustrating a write
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 89
clk we*
we (RAM)
we
Timing Uncertainty Timing Uncertainty
clk
data B data C
din data A
addr addr A addr B addr C
operation. New addr, din, and we values are provided by the FSM some delay after the rising
clock edge. This delay is dependent on how the signals are generated by the FSM (registered only, or
registered plus combinational encoding) as well as wiring delays between the FSM and the RAM.
Wire delay in FPGAs can be significant and can also vary significantly depending on the number
of programmable switches that a signal passes through between the blocks. This timing uncertainty
is problematic during a write operation as the input data and address values that are latched on the
falling edge of the write enable signal are unknown. This problem is sometimes attacked by AND’ing
the we signal from the FSM with the inverse of the clock signal and using this new signal (we*) as
the RAM we. However, this approach relies on the assumption that the address and data input bus
values have a longer delay than we*, which is an assumption that may not be true and whose timing
may be violated if routing delays change between the FSM block and RAM.
The timing problems in Fig. 4.5 can be avoided by using a synchronous RAM, as shown
in Fig. 4.6. The data/addr/we signals to the synchronous RAM only have to satisfy tsu and thd
relative to the rising clock edge. The timing uncertainty for these signals can be an issue for thd , but
tpd of the data/addr/we signals after the rising clock edge is typically much larger than the RAM
thd ,which is either zero or very small. The astute reader may observe that because a synchronous
RAM is an asynchronous RAM with registered inputs, the race condition between the addr/din
signals and the we signal is simply moved inside the synchronous RAM block. This is true, but it is
the responsibility of the synchronous RAM designer to solve this timing problem, and it is not an
issue for a designer who wishes to use synchronous RAM blocks since correct operation is guaranteed
as long as the input tsu and thd are met.
90 FINITE STATE MACHINE DATAPATH DESIGN
Timing
Timing Uncertainty Tsu Thd Uncertainty
The data/addr/we inputs must only satisfy the setup and hold times of
the synchronous RAM; the timing uncertainty of these signals is not an issue.
{
P: ?? M[P]
at a location P.
P+1: ?? +M[P+1]
b. Be able to sum N memory locations
N
P+2: ?? +M[P+2] starting at a location P.
locations
P+3: ?? +M[P+2]
Both N and P are variable.
P+N-1: ?? +M[P+N-1] N
result = M[i]
i=P
• Initialization: the datapath initializes the RAM’s content’s starting at a location P. Both P
and initialization data are provided from an external input data bus.
• Computation: the datapath sums the contents of N locations, starting at location P. Both N
and P are specified by an external input data bus, and with the result given on an external
output data bus.
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 91
clk
din XX P dd dd dd dd dd XX
don’t care start address data to data to data to data to data to
M[P] M[P+1] M[P+2] M[P+3] M[P+4]
clk, start, mode, din are all inputs
clk
start
din XX P N XX XX
start address # of locations
dout XX XX result
ordy N
clk, start, mode, din are all inputs; dout, ordy are outputs. result =
S i=P
M[i]
Datapath operation is controlled by assertion of a start input, with a mode input deter-
mining if initialization or computation is performed.
The cycle timing specification for the initialization operation is shown in Fig. 4.8. The com-
bination of start = 1 and mode = 1 causes the initialization operation to begin. The starting
address P for the initialization operation is provided on the din input bus in the clock cycle following
start assertion. Memory locations M[P], M[P+1], M[P+2], etc., are written in successive clock
cycles with data provided on din; locations are written as long as start is asserted (Figure 4.8
shows writes to only four locations; more locations could have been written). The negation of start
signals the end of the initialization operation.
Figure 4.9 gives the computation mode timing specification. The start address (P) and num-
ber of locations to sum (N) are provided in the first two clock cycles after start assertion with
mode = 0. At some later time, the output ready (ordy) output is asserted by the datapath when the
result is available on the dout data bus. The number of clock cycles required for the computation
is implementation dependent.
92 FINITE STATE MACHINE DATAPATH DESIGN
inputs outputs
din
computation N address
w = log2(K)
counter counter
q
dout
d d din d q
w N N
dec
ld zero?
inc
ld
q
w
addr
we dout
+ sclr
ld
N
Adder
Synchronous Accumulator
K x N RAM
en_cc ld_cc
zero
en_ac
mode
mode ld_ac
we
start start ld_r
clr_r
set_ordy s
ordy
q
clr_ordy r
FSM
A datapath (Fig. 4.10) and finite state machine (ASM chart is shown in Fig. 4.11) performs
the required operations of initialization and computation. The datapath particulars are:
• The address counter provides the RAM address; it is used to sequentially access memory
locations during both initialization and computation operations. The counter is loaded with
(P) at the start of both operation modes, and has an increment by-one functionality.
• The computation counter tracks the number of locations remaining to be summed during
computation operation and is loaded with (N ) to be summed at the beginning of this operation.
The computation operation is halted when the count value reaches zero. The counter has a
decrement by one functionality.
• The adder coupled with an output register provides an accumulator functionality, that is,
successive additions add the register value with the contents of the currently accessed memory
location. The register has a synchronous clear function since the register value must be zero
for the first addition. The dout bus is the accumulator output.
• A synchronous K × N RAM is used as the embedded memory block.
• A set/reset flip-flop (SRFF) is used to implement the output ready (ordy) signal; an SRFF
is useful when a signal must be asserted for several clock cycles.
The FSM sequences the actions on the datapath according to the ASM chart given in Fig. 4.11.
State S0 waits for start assertion, and then branches to the first states of the initialization operation
or computation operation based on the mode input.
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 93
S0
0 (a) ASM chart for memory
start? summing operation
1
initialization computation
mode?
1 0
ld_ac clr_r, ld_ac, clr_ordy
S1_i (clear accumulator., load address
S1_c
(load address
counter from DIN) counter from DIN, clear output rdy)
1 0 (computation counter)
start? zero?
0 1 (all values summed)
(RAM initialization finished) set_ordy
(b) Need an intermediate state (set output ready)
to correct problem of summing
first memory location twice
ld_cc S2_c
(load computation cntr. from DIN)
The initialization operation is straightforward. The first state S1 i loads the starting address
into the address counter by asserting the address counter’s ld input. The second state S2 i writes
data values in the RAM by asserting the RAM’s write enable; the input data is provided on the
din data bus. The address counter is incremented in S2 i by assertion of the address counter’s
inc input. State S2 i returns to state S0 when start is negated. Fig. 4.12 is a timing diagram
for the initialization operation with example data, and contains both external and internal signals.
Data is written to locations four through eight on the leading rising edges of clocks four through
eight. Observe that even though start is negated in clock cycle seven, the data in this clock cycle
is written to RAM as specified in Fig. 4.8.
Two versions of the computation operation are provided— an incorrect version of three states
(S1 c, S2 c and S3 c) and a correct version of four states (S1 c, S2 c, S2b c, and S3 c). The incorrect
version appears to be a straightforward implementation of the computation operation of Fig. 4.9 in
that the starting address and locations to be summed are captured in states S1 c and S2 c, with state
94 FINITE STATE MACHINE DATAPATH DESIGN
clk 1 clk 2 clk 3 clk 4 clk 5 clk 6 clk 7 clk 8
clk
start
initialize RAM locations
mode
din ?? 4 6 3 11 48 20 ??
start address data data data data data
external signals
internal signals
address counter ?? 4 5 6 7 8 9
write enable
we for RAM
M[4]=6 M[5]=3 M[6]=11 M[7]=48 M[8]=20
FIGURE 4.12: Initialization operation showing both external and internal signals for sample data.
S3 c is used to sum the memory contents. However, Fig. 4.13 illustrates the reason for the incorrect
behavior by attempting to sum two locations, starting at location five. In the first clock cycle of state
S3 c (clock 4), the memory dout bus contains M[5] = 3, the accumulator value is zero, and the adder
output is 3 + 0 = 3. The accumulator load signal is asserted in S3 c, so in clock five the accumulator
becomes three, and the address counter is incremented to location six. However, even though the
address counter value is now six, this value is not latched into the RAM until the next clock cycle,
and thus the RAM dout remains at M[5] = 3 for clock five. This means that at the end of clock
five, the new value loaded into accumulator is 3 + 3 = 6, causing the first location to be included
twice in the accumulated sum. The next clock produces M[6] = 11 + 6 for a final result of 17, which
is incorrect. The correct result should be M[5] + M[6] = 3 + 11 = 12.
There are multiple ways to correct the errant behavior of Fig. 4.13; one solution is to not
assert the accumulator load line in the first clock cycle after state S2 c. This is done by inserting
a new state named S2b c between states S2 c and S3 c; state S2b c increments the address counter
and decrements the computation in the same way as state S3 c, but it does not load the accumulator
register. Fig. 4.14 shows the datapath/FSM operation with the new S2b c state producing the correct
sum of M[5] + M[6] = 3 + 11 = 14.
start
mode do computation
din ?? 5 2 ??
start address # of locations
ordy
external signals
internal signals
ld_ac
en_ac
increment address counter
address counter ?? 5 6 7 8
en_cc
decrement computation counter
ld_r
load accumulator (dout)
set output ready
set_ordy
tual operation for an eight-entry FIFO. A FIFO has a write port for placing data into the FIFO, and
a read port for removing data from the FIFO. Figure 4.15a shows an empty eight-element FIFO.
A write operation in Fig. 4.15b places dataA into the buffer, followed by a second write of dataB in
Fig. 4.15c. Read operations in Fig. 4.15d and Fig. 4.15e first removes dataA and then dataB, thus
illustrating FIFO nature of the buffer. Figure 4.15f shows a full FIFO after eight successive write
operations.
Figure 4.16 provides two sample uses of a FIFO in a digital system. One common use is for
buffering data from an external input channel as shown in Fig. 4.16a. Many input channels have the
characteristic that data arrives in irregular bursts, and the individual data elements cannot always be
processed by the digital system as they arrive, since the system may be busy with other tasks. The
96 FINITE STATE MACHINE DATAPATH DESIGN
clk 1 clk 2 clk 3 clk 4 clk 5 clk 6 clk 7 clk 8
clk
start
mode do computation
din XX 5 2 XX
start address # of locations
ordy
external signals
internal signals added state
state S0 S1_c S2_c S2b_c S3_c S0
ld_ac
en_ac
increment address counter
address counter XX 5 6 7 8
en_cc
decrement computation counter
accumulator load is
ld_r not done in S2b_c
load accumulator (dout)
set output ready
set_ordy
FIFO holds the data until the system is ready for input processing. If handshaking signals are not
used to regulate the data flow of the input channel, then the FIFO size is chosen to accommodate
the maximum expected number of data elements to arrive between input processing tasks by the
digital system.
Another typical FIFO usage is for data transfer between cooperating FSM/datapaths operating
in different clock domains as shown in Fig. 4.16b. Data is written to the FIFO synchronized by clock
domain A, and removed from the FIFO synchronized by clock domain B. Data transfer between
two independent clock domains is an asynchronous transfer, that is, data can arrive at any time and
is not synchronized to the receiver’s active clock edge. This uncertainty in data arrival can cause
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 97
FREE
FREE
FREE
FREE
FREE
FREE
FREE
FREE
(a) FIFO empty Read port
Write port
data A
FREE
FREE
FREE
FREE
FREE
FREE
FREE
(b) After write Read port
of data A Write port
data B
data A
FREE
FREE
FREE
FREE
FREE
FREE
(c) After write Read port
of data B Write port
data B
FREE
FREE
FREE
FREE
FREE
FREE
FREE
(d) After read Read port
of data A Write port
data D
data C
data F
data E
data J
data I
tsu and thd violations in the receiver’s input register, resulting in a corrupted data transfer. A FIFO
that supports independent read and write clocks is one method for solving this asynchronous data
transfer problem.
The design of a FIFO with independent read/write clocks is challenging from a timing per-
spective, and is beyond the scope of this text, but FPGA vendors provide these as ready-to-use
Digital System
Digital System
FSM FSM
FIFO
(b) + +
Datapath Datapath
din[N-1:0] dout[N-1:0]
w_req r_req
Write Read
Port w_full r_full Port
w_empty r_empty
w_clk r_clk
For simplicity, timing diagrams shown with common clock and common output status:
(r_clk = w_clk = clk, r_full = w_full = full, r_empty = w_empty =empty)
empty
r_req
embedded memory blocks. Figure 4.17 shows a sample interface for a FIFO with independent
read/write clocks. The write port consists of the write clock (w clk), input data bus (din), write
request input (w req), empty status output (w empty), and full status output (w full). Data is
written to the FIFO on the active edge of w clk when the w req input is asserted. The w empty
output is asserted when the FIFO is empty, and w full is asserted when the FIFO is full, with
transitions synchronized to the write clock. The read port consists of the read clock (r clk), output
data bus (dout), read request input (r req), empty status output (r empty), and full status
output (r full). Data is read from the FIFO on the active edge of r clk when the r req input
is asserted. The timing diagram in Fig. 4.17 shows dataA, dataB written to an empty FIFO in clocks
three and four, and data read from the FIFO in clocks five and six. Observe that the empty sta-
tus output is negated after the write of dataA to the FIFO, and is asserted after the read of dataB
from the FIFO. For simplicity, the timing diagrams assumes common clocks for the read and write
ports. It must be noted that the timing details of FIFOs with independent read/write clocks can
vary significantly from one FPGA vendor to another, and even between FPGA families of the same
FPGA vendor. Thus, Fig. 4.17 is provided for example purposes only; the reader must consult the
data sheets for FIFO blocks offered by a particular FPGA vendor when incorporating a FIFO into
a digital system.
Some FIFO blocks have additional status signals named almost empty and
almost full with configurable thresholds for these conditions. These signals are useful for
assisting with controlling the data flow between the writing and reading digital systems. Two error
conditions associated with FIFOs are:
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 99
• Writing to a full FIFO (input data is typically discarded). This condition is avoided by writing
to the FIFO only when the full signal is negated.
• Reading from an empty FIFO (output data is unknown). This condition is avoided by reading
from the FIFO only when the empty signal is negated.
In some FIFO implementations, the triggering of these error conditions may corrupt the inter-
nal FIFO status and produce erratic subsequent behavior, and error status signals (read error,
write error) may be provided for system monitoring.
The digital system designer using a dual-port memory is responsible for creating a system that
avoids forbidden simultaneous operations. This usually involves external handshaking signals that
coordinate access to the memory (the FIFO’s empty/full signals fulfills this purpose in a FIFO design).
Figure 4.19 shows two datapaths using a true dual-port memory and two handshaking signals, request
(req) and acknowledge (ack), to send data from datapath A to datapath B. Figure 4.19a uses a
two-phase protocol for accomplishing the data transfer; a change in the req signal indicates data
availability from datapath A, with a corresponding change in the ack signal acknowledging receipt of
the data by datapath B. In a two-phase protocol, data is transferred on each low-to-high transition of
din_a[?] din_b[?]
Port A addr_a[?] addr_b[?] Port B
we_a we_b
dout_a[?] dout_b[?]
clk_a clk_b
ack Q D Q D Q D ack
clk_b
clk_a
req D Q D Q D Q req
clk_a
clk_b
FSM/Datapath A FSM/Datapath B
Dual Port
Memory
clk_a clk_b
transfer #1 transfer #2
transfer #1
the req line. A two-phase protocol requires changes in the req line to be detected, and is sometimes
referred to as an edge-triggered or transition-sensitive protocol.
A four-phase protocol is used in Fig. 4.19b for accomplishing the data transfer; a logic one for
req indicates data availability while a logic one for ack indicates data acceptance. Both the ack
and req signals are negated (logic zero) before beginning a new data transfer. A four-phase protocol
is referred to as a level sensitive protocol because the logic state of the handshaking signals indicate
data availability and data acceptance.
Both four-phase and two-phase protocols can be readily expressed in modern HDLs. Some
of the conventional pros/cons of two-phase versus four-phase protocols are as follows:
However, all of these pros/cons are technology and design dependent, with designer experience
determining the protocol choice for a particular design.
The reader may question the necessity for using req/ack signals and instead want to indi-
cate data availability by having datapath A write a nonzero value to a specified memory location
being monitored by datapath B. This works only if the dual-port memory supports a simultaneous
read during write operation to the same location, which is not the case for most true dual-port
memories. It should be noted that if the two datapaths and the dual-port all share the same
clock, then a simultaneous read during write operation to the same location is typically supp-
orted.
The advantages of a dual-port memory over a FIFO are that the dual-port allows bi-directional
transfers between two datapaths and provides greater flexibility in data access. The disadvantage is
that handshaking signals for avoiding forbidden simultaneous accesses may need to be provided by
the designer.
A synchronizer is needed for any asynchronous input to a synchronous system. The reader is
referred to [1] for a more complete discussion of metastability and synchronizer design.
In Fig. 4.19, the DFF clocked by clk b on the ack output of datapath B and the DFF
clocked by clk a on the req output of datapath A are included to ensure that the ack and req
outputs are glitch-free, that is, they only experience a single high-to-low or low-to-high transition
during any clock period. These DFFs can be removed if these signals are already registered within
the datapath. An FSM output signal that is generated by combinational gating using an FSM’s state
registers may experience glitches due to different delay paths through the logic gates. Because the
req and ack outputs are asynchronous inputs to the receiving datapaths, these glitches could be
treated as valid inputs, causing incorrect operation. If the two datapaths shared a common clock,
102 FINITE STATE MACHINE DATAPATH DESIGN
then glitch-free outputs would not be needed because it is assumed that the outputs would be stable
(satisfy tsu /thd ) by the time the active clock edge occurred.
4.7 SUMMARY
This chapter has introduced the reader to commonly available embedded memory blocks found in
modern FPGAs. Synchronous RAM blocks are preferred over asynchronous RAMs blocks because
timing constraints for the designer are simplified when using synchronous RAM. Typical usage of
RAM blocks requires counters to drive address lines, adding an extra clock cycle of latency from
assertion of counter input to RAM output. FIFOs and dual-ports are useful for data exchange
between datapaths that use different clock domains.
FSM/Datapath A FSM/Datapath B
Clock Domain A Clock Domain B
ack_1 Q D Q D Q D ack_1
clk_b
clk_a
req_1 D Q D Q D Q req_1
clk_a
clk_b
“1” Reg A
dout din
+ N
D Q
ld
N
D Q D Q
N
clk_b
din dout
N
Q D Q D
N
Q D
ld N +
“1”
clk_a
Reg B
req_2 Q D Q D Q D req_2
clk_b
clk_a
ack_2 D Q D Q D Q ack_2
clk_a
clk_b
c. Create a read-side FSM that removes elements from the FIFO whenever the empty
signal is negated; remove data as fast as possible from the FIFO (one clock per datum).
Ensure that your FSM does not attempt to read from an empty FIFO.
d. Change the read/write clocks such that the write clock has a 1/3 longer period than the
read clock. Verify that your design performs as expected.
6. This problem refers to Fig. 4.20. Using four-phase handshaking and with datapath A clock
2/3 the period of datapath B, create FSMs for dapathpaths A/B that accomplish the following
(steps a through c are FSM A operation, steps d through f are FSM B operation).
a. After reset, FSM A initializes Register A to zero.
b. FSM A then transmits the Reg A value to FSM B using the handshaking pair
req 1/ack 1 and its dout bus.
c. FSM A then waits for a value to be transmitted back from FSM B on its din bus and using
the handshaking pair req 2/ack 2. This new value is incremented by ‘one’ via the adder,
and loaded into Reg A(at this point, FSM A loops through steps b and c, resulting in a
continuously incrementing value being transmitted between FSM A and FSM B.)
104 FINITE STATE MACHINE DATAPATH DESIGN
The x value represents the current input sample value, x@1 the input sample value from
the previous sample period, x@2 the input sample value from two sample periods previously, etc.
The filter coefficients a0, a1, . . . aN determine the filter’s performance characteristics such as low
pass, high pass, band pass, etc. A JAVA applet that produces FIR filter coefficients given a filter
specification is available at [2]. Typical results from the applet are given in Table 4.1.
This project’s task is to build a fixed-point, programmable FIR filter that allows the filter
order and coefficients to be dynamically loaded. As with the memory sum example of Section 4.3,
the filter has an initialization mode in which the filter order and coefficients are loaded, and a
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 105
Rectangular window FIR filter, Filter type: Low Pass (LP), Order: 20
computation mode that accepts new input samples and produces a new output value for each
input sample. Figure 4.21 gives the cycle specification for initialization mode, which is entered
when start is asserted and mode is a logic one. The start input is negated when the last filter
coefficient is entered.
In Fig. 4.22, computation mode is entered when start is asserted and mode is logic zero.
The filter then waits for assertion of input ready (irdy), which indicates that a new sample value is
present on the din input data bus. The filter asserts output ready (ordy) when the filter computation
is finished and the dout data bus contains the final result. The filter then returns to waiting for the
next assertion of irdy. Computation mode is exited when start is negated.
clk
din XX N a0 a1 a2 a3 aN XX
don’t care filter order coeff. coeff. coeff.
clk
irdy
din XX x XX XX x XX
current sample value current sample value
dout XX result XX
ordy
clk, start, mode, irdy, din are all inputs; result = x*a0 + x@1*a1 + .... + x@n * an
dout, ordy are outputs.
Because the datapath contains only one multiplier and one adder, an FIR calculation for a
new input sample requires at least N + 1 clocks. The multiplier is a signed multiplier, which is
generally available as a building block from FPGA vendors. It was mentioned in Chapter 3 that a
K -bit ×K -bit multiplier produces a 2K -bit result. For unsigned fixed-point numbers mapped to
the range (1.0 – 0.0], it was noted that the lower K -bits of the 2K -bit product could be discarded,
since these represented the K least significant bits, and the datapath size could be kept at K -bits.
However, what bits should be discarded for a signed K -bit ×K -bit multiplier using numbers
in the range (+1.0 to −1.0]? One may intuit that it would also be the least significant K -bits, but the
true answer is somewhat more complex. To illustrate, examine Eq. 4.2 that shows the multiplication
of +0.5 * −0.5:
The numbers + 0.5, −0.5 mapped to 12-bit two’s complement are + 0.5 * 2048 = 1024 =
0×400 and − 0.5*2048 = -1024 = 0×C00. The signed binary multiplication of Eq. 4.2 produces:
Dropping the least significant 12-bits (last three hex-digits), the value 0 x F00 is equal to
−256 as a 12-bit two’s complement integer. Mapping –256 to the range (+ 1.0 to – 1.0] produces:
which is one-half the expected value of − 0.25. Equation 4.5 shows the reason for this by examining
the number range of the multiplication result:
The multiplier output range has to be extended by an additional integer bit because the value
+ 1.0 is now included in the output range (because −1.0 * −1.0 = + 1.0). This means that the upper
two bits of the 24-bit product are dedicated to the sign and integer portion of the result. This also
has the unfortunate result that the output number range of (+2.0, − 2.0] is now different from the
input number range of (+ 1.0, − 1.0]. The extra bit needed for the integer portion of the product to
encode + 1.0 is wasted if the multiplier is never given the inputs of −1.0 * − 1.0. Because one of the
multiplier inputs is always a coefficient, the coefficient choices can be restricted to not include −1.0.
This means that actual range of values produced by the multiplier fall in the range (+1.0, − 1.0]
and thus the most significant bit of the multiplier can be discarded. Note that discarding the most
significant bit is the same as shifting the multiplier output to the left by one, which is multiplication
by two. Multiplying the result of eq. 4.4 by two gives the expected result: −0.125 * 2 = − 0.25.
The datapath of Fig. 4.24 shows 15 bits of the 24-bit multiplier product being retained (nine
bits are discarded). The bits discarded from the 24-bit product are the most significant bit, and the
eight least significant bits. This gives three extra least significant bits for rounding purposes as the
FIR sum is being accumulated. Only the most significant 12-bits of the accumulator register are
used for the dout output result.
The adder shown in the datapath of Fig. 4.24 is a two’s complement saturating adder, which
saturates the output result to the maximum positive or maximum negative value if two’s complement
overflow occurs. Fig. 4.25 shows a conceptual implementation for a two’s complement saturating
adder (this logic works but more optimal implementations exist).
An optimum check is to provide a digitized sine wave of a particular frequency and observe the
output to determine if the filter function (low-pass, high-pass, band-pass) is accomplished. The
psuedo code in Listing 1 produces input values for one cycle of a sine wave for a given frequency f
sampled at a frequency of S (the digital filter applet of [2] assumes a sample frequency of 8000 Hz).
• The coefficients of N -order FIR filter are symmetric as seen in Table 4.1; a0 = aN , a1 =
a(N − 1), etc. The number of memory locations used in the coefficient RAM can be reduced
from N + 1 to (N /2) + 1.
• The number of clock cycles required for producing the output given an input sample can
be reduced by distributing the input samples and coefficients among multiple RAMs and
including more multipliers and adders. This is the hardware resource versus computation
time tradeoff examined in Chapter 3.
• The maximum clock period can be decreased at the cost of greater clock cycle latency by using
the registered dout output of the RAM blocks and by placing a pipeline register between
the multiplier and adder.
• Some FPGA vendors offer embedded RAM blocks that have built-in shift register function-
ality as required for digital filter implementations and could replace the counter logic that is
currently used to access the RAMs.
• Some FPGA vendors offer library support for floating-point execution units; change the
datapath from 12-bit fixed-point to single-precision floating-point.
EMBEDDED MEMORY USAGE IN FINITE STATE MACHINE WITH DATAPATH (FSMD) DESIGNS 111
Input Waveform
1
0.8
0.6
0.4
0.2
0 Input Waveform
444
580
733
905
1004
1057
1165
1327
1382
0
9.26
36.6
81.9
145
227
326
-0.2
-0.4
-0.6
-0.8
-1
Output Waveform
1
0.8
0.6
0.4
0.2
0 Output Waveform
38.5
86.3
1014
1085
1214
1338
1431
153
239
344
468
610
772
953
0
9.75
-0.2
-0.4
-0.6
-0.8
-1
4.13 REFERENCES
[1] R. Ginosar, “Fourteen ways to fool your synchronizer”, Proc. of the Ninth International Symposium
on Asynchronous Circuits and Systems, 12-15 May 2003, pp 89-96.
[2] FIR Digital Filter Design Applet, Online as of August 2007: http://www.dsptutor.freeuk.com/
FIRFilterDesign/FIRFilterDesign.html.
112
113
Author Biography
Justin Stanford Davis received his Ph.D. in Electrical Engineering from the Georgia Institute of
Technology in August 2003, as well as his M.S. and B.E.E. degrees in 1999 and 1997. During
the summers of 1998 and 1999, he worked at Hewlett-Packard (now Agilent Technologies). In
fall of 2003 he joined the faculty in the Department of Electrical Engineering at Mississippi State
University as an Assistant Professor. In the summer of 2007 he joined Raytheon Missile Systems as
a Senior Electrical Engineer. His research interests include digital design for high-speed systems,
SoCs, and SoPs, as well as signal integrity and systems engineering.
Robert B. Reese received the B.S. degree from Louisiana Tech University, Ruston, in 1979 and
the M.S. and Ph.D. degrees from Texas A&M University, College Station, in 1982 and 1985,
respectively, all in electrical engineering. He served as a Member of the Technical Staff of the
Microelectronics and Computer Technology Corporation (MCC), Austin, TX, from 1985 to 1988.
Since 1988, he has been with the Department of Electrical and Computer Engineering at Mississippi
State University, Mississippi State, where he is an Associate Professor. Courses that he teaches include
VLSI systems and Digital System design. His research interests include self-timed digital systems
and computer architecture.