Verilog Text Book
Verilog Text Book
Digital Logic
Design Using
Verilog
Coding and RTL Synthesis
Digital Logic Design Using Verilog
Vaibbhav Taraate
123
Vaibbhav Taraate
Semiconductor Training @ Rs.1
Pune, Maharashtra
India
Today’s century is era of miniaturization and high-speed chips, and complex ASICs
are designed in lesser time as compared to before. The technology evolution from
1990 has opened up a new paradigm for ASIC designers. Customers are always
expecting the speedy delivery of the ASIC products and it always accumulates the
good amount of pressure to come up with the high-performance design using less
number of resources.
The evolution in the process node technology in the past decade has started the
real evolution in the semiconductor industry! Many new design techniques and
flows got evolved and stabilized in the past decade. Many EDA tool companies
help designers to complete the design in shorter time span.
In today’s industrial scenario, designer doesn’t spend more time to draw the
schematic to design the digital logic circuits. The EDA tools have enabled the best
design practices by using hardware description languages such as VHDL and
Verilog. The synthesis tools are used primarily to convert the HDL into the
equivalent logic structure or gate level netlist. The latest EDA tool features have
also improved the productivity and efficiency of the design team!
The book is organized into three sections; the first section consists of Chaps. 1–9
and describes about the digital logic design and synthesizable Verilog RTL.
Section I is organized in such a way that reader will be able to have better
understanding of basics of digital logic and synthesizable RTL. This section will be
helpful for the reader to understand the Verilog HDL constructs, hardware infer-
ence, simulation concepts and design guidelines for simple to complex designs. For
the better understanding of the reader, few practical scenarios are included in this
section.
Chapter 1 discusses about the evolution of the logic design, logic design
abstraction levels, IC design methodologies and flow, Verilog Module declaration,
and different design styles. This chapter discusses about the simulation and syn-
thesis flow for the Verilog RTL. Even this chapter discusses about the key verilog
HDL features.
vii
viii Preface
Chapters 2 and 3 describe about the combinational logic design and synthesiz-
able RTL. These chapters also focus on the practical issues and scenarios while
designing the combinational logic using Verilog RTL.
Chapter 4 discusses on the key Verilog coding guidelines and the role of Verilog
in writing an efficient RTL for combinational design.
Chapter 5 discusses about the sequential logic design and covers most of the
simple to complex practical design scenarios. This chapter also deals with the
synthesizable sequential design issues, timing diagrams, and simulation of the
design.
Chapter 6 discusses on the key Verilog coding guidelines and the role of Verilog
in writing an efficient RTL for sequential design.
Chapter 7 describes about the efficient RTL coding for a few complex density
designs and also gives information about the synthesizable results and the key
practical scenarios for the design.
Chapter 8 deals with FSM and design of an efficient FSM using the suitable
FSM encoding styles.
Chapter 9 describes the simulation concepts and PLD based design. Even this
chapter describes about the design guidelines while using PLDs.
Section II consists of Chaps. 10–12 and mainly deals with the logic synthesis,
the static timing analysis, and the constraining ASIC designs. This section is
organized in such a way that reader can have better understanding of synthesizable
Verilog RTL and constraining designs for given specifications. This section also
deals with the static timing analysis and practical issues in performance improve-
ment for the design.
Chapter 10 discusses about the logic synthesis, ASIC design flow, design con-
straints, and gate level netlist.
Chapter 11 describes the static timing analysis and the timing reports and
analysis for complex RTL designs. This chapter also deals with the practical few
practical scenarios in the design and performance improvement technique.
Chapter 12 discusses about the different design constraints and how to tweak
architecture, microarchitecture, and RTL to improve the design performance. This
chapter also deals with the DRC, optimization and performance improvement
scenarios for better understanding of the design constraints.
Section III consists of Chaps.13–15 and mainly discusses on the advanced RTL
design concepts such as multiple clock domain designs, need of synchronizers,
clock domain crossing, low power designs, and SOC-based designs and challenges.
Every chapter in this section discusses about the key practical scenarios using the
efficient Verilog RTL.
Chapter 13 describes about the multiple clock domain designs and the syn-
chronizers and their need. This chapter also focuses on synchronous and asyn-
chronous FIFO buffers and RTL design using Verilog and concludes with a case
study.
Chapter 14 discusses on most of the low power design techniques and the goal of
designers to implement the low power designs. This chapter also deals with the low
power design architecture and power sequencing for the low power designs.
Preface ix
Chapter 15 focuses on the real-life SOC-based designs and the role of Verilog in
implementing the SOC-based designs.
The book consists of many practical examples from simple to complex logic
depth. This will enable the reader to have better understanding about how to code
an efficient RTL using Verilog. The synthesizable designs, and frequent issues in
the RTL design life cycle are organized in each section for the better understanding.
This book is targeted to the engineering students, inexperienced engineers, and
professionals those who want to implement practical, synthesizable, efficient RTL
using Verilog!
Acknowledgments
This book is possible due to help of many people. I truly appreciate their direct and
indirect help during writing of this book. Among them I am very much thankful to
my dearest friend, Ishita Thaker (Ish), for encouraging me to write this book. This
book would not have been possible if my wife Somi has not reviewed the book
contents and grammatical mistakes.
Special thanks to my son Siddesh and my daughter Kajal for their ideas and
creative thoughts while creating diagrammatic representation for this book. I truly
appreciate the sacrifices of Siddesh and Kajal.
Special thanks to all the students to whom I taught the subject for more than one
decade. Indirectly I want to thank all my teachers for their valuable help during my
engineering and postgraduation at IIT Powai (Mumbai).
Special thanks to all the Springer staff, especially Swati Maherishi and Aparajita
Singh for good and encouraging conversations, support, and encouragement.
Special thanks in advance to all those readers across the world for buying,
reading, and enjoying the book!
xi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of Logic Design . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 System and Logic Design Abstractions . . . . . . . . . . . . . . . . . 3
1.3 Integrated Circuit Design and Methodologies . . . . . . . . . . . . . 4
1.3.1 RTL Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Functional Verification . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.4 Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Verilog HDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Verilog Design Description . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Structural Design . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Behavior Design. . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.3 Synthesizable RTL Design. . . . . . . . . . . . . . . . . . . 10
1.6 Key Verilog Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Verilog Arithmetic Operators . . . . . . . . . . . . . . . . . 11
1.6.2 Verilog Logical Operators . . . . . . . . . . . . . . . . . . . 11
1.6.3 Verilog Equality and Inequality Operators . . . . . . . . 11
1.6.4 Verilog Sign Operators . . . . . . . . . . . . . . . . . . . . . 13
1.6.5 Verilog Bitwise Operators . . . . . . . . . . . . . . . . . . . 16
1.6.6 Verilog Relational Operators . . . . . . . . . . . . . . . . . 18
1.6.7 Verilog Concatenation and Replication
Operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.8 Verilog Reduction Operators . . . . . . . . . . . . . . . . . 19
1.6.9 Verilog Shift Operators . . . . . . . . . . . . . . . . . . . . . 20
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Combinational Logic Design (Part I). . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Introduction to Combinational Logic. . . . . . . . . . . . . . . . . . . 27
2.2 Logic Gates and Synthesizable RTL . . . . . . . . . . . . . . . . . . . 28
2.2.1 NOT or Invert Logic. . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Two-Input OR Logic. . . . . . . . . . . . . . . . . . . . . . . 28
xiii
xiv Contents
14.5 Low Power Design Architecture and UPF Case Study . . . . . . 370
14.5.1 Isolation Cells . . . . . . . . . . . . . . . . ..... . . . . . . 371
14.5.2 Retention Cells. . . . . . . . . . . . . . . . ..... . . . . . . 372
14.5.3 Level Shifters. . . . . . . . . . . . . . . . . ..... . . . . . . 374
14.5.4 Power Sequencing and Scheduling . . ..... . . . . . . 374
14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . . 380
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . . 380
15 System on Chip (SOC) Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
15.1 What is System on Chip (SOC)? . . . . . . . . . . . . . . . . . . . . . 382
15.2 SOC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
15.3 SOC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
15.3.1 IP Design and Reuse. . . . . . . . . . . . . . . . . . . . . . . 383
15.3.2 SOC Design Considerations . . . . . . . . . . . . . . . . . . 385
15.3.3 Hardware Software Codesign . . . . . . . . . . . . . . . . . 386
15.3.4 Interface Timings . . . . . . . . . . . . . . . . . . . . . . . . . 386
15.3.5 EDA Tool and License Requirements . . . . . . . . . . . 387
15.3.6 Developing the Required Prototyping Platform . . . . . 387
15.3.7 Developing the Test Plan. . . . . . . . . . . . . . . . . . . . 388
15.3.8 Developing the Verification Environment. . . . . . . . . 388
15.3.9 Prototyping Using FPGAs . . . . . . . . . . . . . . . . . . . 388
15.3.10 ASIC Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
15.4 SOC Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
15.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
15.6 SOC Design Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
15.6.1 Microprocessors or Microcontrollers . . . . . . . . . . . . 392
15.6.2 Counters and Timers . . . . . . . . . . . . . . . . . . . . . . . 393
15.6.3 General Purpose IO Block . . . . . . . . . . . . . . . . . . . 395
15.6.4 Universal Asynchronus Receiver and Transmitter
(UART). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
15.6.5 Bus Arbitration Logic . . . . . . . . . . . . . . . . . . . . . . 397
15.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
About the Author
xxiii
Chapter 1
Introduction
Abstract This chapter discusses about the overview of the design abstraction
levels and the evolution of logic design in the perspective of the system design. This
chapter is mainly focused on the familiarity with Verilog HDL, different modeling
styles, and Verilog operators. The chapter is organized in such a way that it covers
basic to the practical scenarios in detail. All the Verilog operators with meaningful
examples are described in this chapter for easy understanding.
Keywords RTL IEEE 1364-2005 Behavioral model Structural model Verilog
VHDL Moore’s law Concurrent Sequential Procedural blocks Always
Four-value logic Operators Arithmetic Shift Logical Bitwise Concatenation
Case equality Case inequality Continuous assignments Net Variable Data types
During the year 1958, Jack Kilby, a young electrical engineer at Texas Instrument
figured out how to place the circuit elements, transistors, resistors, and capacitors,
on small piece of Germanium. But prior to the year 1958, many more revolu-
tionized ideas were published and conceptualized.
How Moore’s prediction was right, that experience engineers can get with the
complex VLSI-based ASIC chip designs. In the present decade, the chip area has
shrunk enough and process technology node on which design houses foundries are
working is 14 nm and the chip has billions of cells of small silicon die size. With
the evolutions in the design and manufacturing technologies; most of the designs
are implemented by using Very High Speed Integrated Circuit Hardware
Description Language (VHSICHDL) or using Verilog. We are focusing on the
Verilog as hardware description language. The evolution in the EDA industry has
opened up new efficient path ways for the design engineers to complete the mile-
stones in less time.
1.2 System and Logic Design Abstractions 3
As shown in Fig. 1.1 most of the designs have various abstraction levels. The
design approach can be top-down or bottom-up. The implementation team takes
decision about the right approach depending on the design complexity and avail-
ability of design resources. Most of the complex designs are using the top-down
approach instead of bottom-up approach.
The design is described as functional model initially and the architecture and
micro-architecture of the design is described by understanding the functional design
specifications. Architecture design involves the estimation of the memory processor
logic and throughput with associative glue logic and functional design require-
ments. Architecture design is in the form of functional blocks and represents the
functionality of design in the block diagram form.
The micro-architecture is the detailed representation of every architecture block
and it describes the block and sub block level details, interface and pin connections,
and hierarchical design details. The information about synchronous or asyn-
chronous designs and clock and reset trees can be also described in the
micro-architecture document.
RTL stands for Register Transfer Level. RTL design uses micro-architecture as
reference design document and design can be coded using Verilog RTL for the
required design functionality. The efficient design and coding guidelines at this
stage plays important role and efficient RTL reduces the overall time requirement
during the implementation phase. The outcome of RTL design is gate level netlist.
Gate level netlist is the output from the RTL design stage after performing RTL
synthesis and it is representation of the functional design in the form of combi-
national and sequential logic cells.
Finally, the switch level design is the abstraction used at the layout to represent
the design in the form of switches. PMOS, NMOS, and CMOS.
Functional Design
Architecture
Micro-architecture
Bottom-Up Top-Down
Approach RTL Design Approach
With the evolution of VLSI design technology, the designs are becoming more
complex and SOC-based designs are feasible in shorter design cycle time. The
demand of the customers to get products in the shorter design cycle time is possible
by using efficient design flow. The design needs to be evolved from specification
stage to final layout. The use of EDA tools with the suitable features has made it
possible to have the bug free designs with proven functionality. The design flow is
shown in Fig. 1.2 and it consist of three major steps to generate the netlist.
Functional design is described in the document form using the architecture and
micro-architecture. The RTL design using Verilog uses the micro-architecture
document to code the design. RTL designer uses the suitable design and coding
guidelines while implementing the RTL design. An efficient RTL design always
plays important role during implementation cycle. During this, the designer
describes the block level and top level functionality using an efficient Verilog RTL.
Verilog RTL
Design
Functional
Design Verification
Constraints
Yes No
Synthesis Pass?
No
Constraints
Met?
Yes
Physical Design
After completion of an efficient Verilog RTL for the given design specifications, the
design functionality is verified by using industry standard simulator. Pre-synthesis
simulation is without any delays and during this the focus is to verify the design
functionality of design. But common practice in the industry is to verify the design
functionality by writing the testbench. The testbench forces the stimulus of signals
to the design and monitors the output from the design. In the present scenario,
automation in the verification flow and new verification methodologies have
evolved and used to verify the complex design functionality in the shorter span of
time using the proper resources. The role of verification engineer is to test the
functional mismatches between the expected output and actual output. If functional
mismatch is found during simulation, then it needs to be rectified before moving to
the synthesis step. Functional verification is an iterative process unless and until
design meets the required functionality and target coverage.
1.3.3 Synthesis
When the functional requirements of the design are met, the next step is synthesis.
Synthesis tool uses the RTL Verilog code, design constraints, and libraries as inputs
and generates the gate level netlist as an output. Synthesis is an iterative process
until the design constraints are met. The primary design constraints are area, speed,
and power. If the design constraints are not met then the synthesis tool performs
more optimization on the RTL design. After the optimization, if it is observed that
the constraints are not met, it becomes compulsory to modify RTL code or tweak
the micro-architecture. The synthesizer tool generates the area, speed and power
reports, and gate level netlist as an output.
It involves the floor-planning of design, power planning, place and route, Clock tree
synthesis, post layout verification, Static timing analysis, and generation of GDSII
for an ASIC design. This step is out of scope for the subsequent discussions.
Verilog is standardized as IEEE 1364 standard and used to describe digital elec-
tronic circuits. Verilog HDL is used mainly in design and verification at the RTL
level of abstraction. Verilog was created by Prabhu Goel and Phil Moorby during
6 1 Introduction
the year 1984 at Gateway design automations. Verilog IEEE standards are
Verilog-95 (IEEE 1364-1995), Verilog-2001 (IEEE 1364-2001), and Verilog-2005
(IEEE 1364-2005). Verilog is case sensitive and before we proceed further to
discuss on RTL design and synthesis, it is essential to have the basic understanding
of the Verilog code structure (Fig. 1.3).
As shown in the Verilog code structure template.
wire and reg are net types, wire doesn’t hold any data and
used in continuous assignment. Reg is used to hold da-
ta and used for the procedural assignments.
Every Verilog code starts with the ‘module’ keyword and ends with “end-
module.” Module consists of the port declaration, net declaration, and the func-
tionality of design.
1.5 Verilog Design Description 7
In the practical scenarios the Verilog HDL is categorized into three different kind of
coding descriptions. The different styles of coding description are structural, be-
havioral, and synthesizable RTL. Consider the design structure of half adder shown
in Fig. 1.4c which describes different coding styles. Figure 1.4 shows the truth
table, schematic and logic structure realization for half adder.
Structural design defines a data structure of the design and it is described in the form
of netlist using the required net connectivity. Structural design is mainly the
8 1 Introduction
The name itself indicates the nature of coding style. In the behavior style of Verilog
code, the functionality is coded from the truth table for the specific design. It is
assumed that the design is black box with the inputs and outputs. The main
intention of the designer is to map the functionality at output according to the
required set of inputs (Example 1.2).
10 1 Introduction
Verilog supports logical AND, OR, and negation operators to perform desired
logical operation. Logical operators are used to return single bit value at the end of
the operation. Table 1.2 describes the functional use of logical operators
(Example 1.5).
Verilog equality operators are used to return true or false value after comparing two
operands. Table 1.3 describes the functionality of the operators (Example 1.6).
12 1 Introduction
Verilog supports the operators ‘+’ or ‘−’ to assign sign to the operand. Table 1.4
describes the sign operands (Example 1.7).
14 1 Introduction
Verilog supports the bitwise operations. Logical bitwise operators use two single or
multi-bit operands and return the multi-bit value. Verilog does not support NAND
and NOR. Table 1.5 describes the functionality and use of bitwise operators
(Example 1.8).
1.6 Key Verilog Terminologies 17
Verilog supports the relational operator to compare two binary numbers and returns
true (‘1’) or false (‘0’) value after comparison of two operands. Table 1.6 describes
the relational operators (Example 1.9).
Verilog supports the concentration and replication for any binary string. Table 1.7
describes the functionality of concentration and replication operators
(Example 1.10).
1.6 Key Verilog Terminologies 19
Verilog supports the reduction operators and returns the single bit value after
bitwise reduction. Table 1.8 describes the reduction operators (Example 1.11).
20 1 Introduction
Verilog uses the shift operators and required two operands. These operators are
used to perform the shifting operation. Table 1.9 describes the functionality of shift
operators (Example 1.12).
1.6 Key Verilog Terminologies 21
1.7 Summary
As discussed earlier, Verilog is a case-sensitive language and is used for design and
verification of logic circuits. Following are key points to summarize this chapter.
1. Verilog is an efficient hardware description language to describe the design
functionality.
2. Although there are different description styles, practically the designer uses the
RTL coding style to code the RTL. Verilog supports concurrent and sequential
designs.
3. Verilog is used as an efficient HDL and supports four values, logical ‘0’, logical
‘1’, high impedance ‘z’ and unknown ‘x’.
4. Verilog uses concurrent and sequential statement. Verilog HDL supports dif-
ferent operators to perform logical and arithmetic operations.
5. Verilog is used for both design and verification of digital logic.
6. Verilog is case sensitive and have synthesizable and non-synthesizable
constructs.
Chapter 2
Combinational Logic Design (Part I)
Abstract This chapter describes the use of Verilog HDL to code the combinational
logic design and covers the small gate count designs. The chapter is organized in
such a way that it can give the practical synthesizable Verilog HDL understanding
with key practical scenarios and applications. The synthesizable Verilog HDL is
described for the required functionality and the synthesized logic is explained for
practical understanding. This chapter is useful to build the practical expertise to
code the combinational designs using synthesizable Verilog constructs.
Keywords Logic gates NOT AND NAND OR NOR EXOR EXNOR
Buffer Adder
Subtractor Gray
Binary
Code-conversion Blocking
assignment Continuous assignment Procedural lock Always Tri state
Two’s compliment
Combinational logic is implemented by using the logic gates and in the combina-
tional logic, output is the function of present input. The goal of a designer is always
to implement the logic using minimum number of logic gates or logic cells.
Minimization techniques are K-map, Boolean algebra, Shannon’s expansion theo-
rems, and hyper planes. The thought process of a designer should be such that; the
design should have the optimal performance with lesser area density. The area
minimization techniques have an important role in the design of combinational
logic or functions. In the present scenario, designs are very complex; the design
functionality is described using the hardware description language Verilog. The
subsequent section focuses on the use of Verilog RTL to describe the combinational
design.
This section discusses about the logic gates and the synthesizable Verilog RTL.
NOT logic complements the input. NOT logic is also called as inverter.
Synthesizable RTL is shown in the Example 2.1. The truth table of NOT logic is
shown in the Table 2.1.
Synthesized NOT logic is shown in the Fig. 2.1, input port of NOT logic gate is
named as ‘a_in’ and output as ‘y_out.’
OR logic generates output as logical ‘1’ when one of the input is logical ‘1.’
Synthesizable RTL is shown in the Example 2.2. The truth table of OR logic is
shown in the Table 2.2.
Synthesized OR logic is shown in the Fig. 2.2, input ports of OR logic gate are
named as ‘a_in,’ ‘b_in,’ and output as ‘y_out’.
Example 2.2 Synthesizable Verilog code for two-input OR logic. Note While describing the
design functionality; make sure that all the input ports are listed in the sensitivity list. Missing
required signals from sensitivity list will create simulation and synthesis mismatch and will be
discussed in Chap. 3
AND logic generates an output as logical ‘1’ when both the inputs ‘a,’ ‘b,’ are
logical ‘1.’ Synthesizable RTL is shown in the Example 2.4. The truth table of
AND logic is shown in the Table 2.4.
Example 2.4 Synthesizable Verilog code for two-input AND logic. Note AND gate is visualized
as a series of two switches and used in programmable logic devices (PLD) as one of the element to
realize the required logic. Programmable AND plane can be created using the AND logic gates as
primary elements having feature as programmable inputs
Synthesized two-input AND logic is shown in the Fig. 2.4, input ports of AND
logic gate are named as ‘a_in,’ ‘b_in,’ and output as ‘y_out.’
NAND logic is the opposite or complement of the AND logic. Synthesizable RTL
is shown in the Example 2.5. The truth table of NAND logic is shown in the
Table 2.5.
Example 2.5 Synthesized Verilog RTL for two-input NAND Logic. Note NAND logic is also
treated as universal logic. Using NAND logic, all possible logic functions can be realized. NAND
logic is used to implement the storage elements like latches or flip-flops and also to realize
combinational functions. According to DeMorgan’s theorem the bubbled OR is equivalent to NAND
34 2 Combinational Logic Design (Part I)
Synthesized NAND logic is shown in the Fig. 2.5, input ports of NAND logic
gate is named as ‘a_in,’ ‘b_in,’ and output as ‘y_out.’
Two-input XOR is called as exclusive OR logic and generates output as logical ‘1,’
when both inputs are not equal. Synthesizable RTL is shown in the Example 2.6.
The truth table of XOR logic is shown in the Table 2.6.
Synthesized two-input XOR logic is shown in the Fig. 2.6; input ports of XOR
logic gate are named as ‘a_in,’ ‘b_in,’ and output as ‘y_out.’
If XOR cell or gate is not available in the library then XOR logic is realized
using AND-OR-Invert or using minimum number of NAND gates.
Two-input XNOR is called as exclusive NOR logic and generates output as logical
‘1’ when both the inputs are equal. XNOR is opposite or complement of XOR
logic. Synthesizable RTL for XNOR is shown in the Example 2.7. The truth table
of XNOR logic is shown in the Table 2.7.
Synthesized XNOR logic is shown in the Fig. 2.7, input port s of XNOR logic
gate are named as ‘a_in,’ ‘b_in,’ and output as ‘y_out’.
If XNOR cell is not available in the library then XNOR logic is realized using
AND-OR-Invert or using minimum number of NAND or NOR gates. Minimum
five two input NAND gates are required to realize the 2 input XNOR gate.
2.2 Logic Gates and Synthesizable RTL 35
Example 2.6 Synthesizable Verilog code for two-input XOR logic. Note XOR gate can be
implemented using two-input NAND gates. The number of two-input NAND gates required to
implement two-input XOR gate are equal to 4. XOR gates are used to implement arithmetic
operations such as addition and subtraction
Tri-state has three logic states namely, logical ‘0,’ logical ‘1,’ and high impedance
‘z.’ Synthesizable RTL is shown in the Example 2.8. The truth table of tri-state
buffer logic is shown in the Table 2.8.
Synthesized tri-state logic is shown in the Fig. 2.8, input port of tri-state NOT
logic is named as ‘data_in,’ enable input as ‘enable’ and output as ‘data_out.’
Example 2.8 Synthesizable Verilog code for tri-state logic. Note Avoid use of tri-state logic
while developing the RTL. Tri state is difficult to test. Instead of tri-state logic, it is recommended
to use multiplexers to develop the logic with enable
38 2 Combinational Logic Design (Part I)
Arithmetic operations such as addition and subtraction has an important role in the
efficient design of processor logic. Arithmetic logic unit (ALU) of any processor
can be designed to perform the addition, subtraction, increment, decrement oper-
ations. The arithmetic designs are described by the RTL Verilog code to achieve the
optimal area and less critical path. This section describes the important logic blocks
to perform arithmetic operations with the equivalent Verilog RTL description.
2.3.1 Adder
Adders are used to perform the binary addition of two binary numbers. Adders are
used for signed or unsigned addition operations.
Half adder has two, one-bit inputs ‘a_in,’ ‘b_in’ and generates two, one-bit outputs
‘sum_out,’ ‘carry_out.’ Where ‘sum_out’ is the summation or addition output and
‘carry_out’ is the carry output. Table 2.9 is the truth table for half adder and RTL is
described in the Example 2.9.
Example 2.9 Synthesizable RTL code for half adder. Note Half adders are used as basic
component to perform the addition. Full adder logic circuits are designed using the instantiation of
half adders as components
Synthesized half adder is shown in the Fig. 2.9, input ports of half adder are
named as ‘a_in,’ ‘b_in,’ and output as ‘sum_out,’ ‘carry_out.’
Full adders are used to perform addition on three, one-bit binary inputs. Consider
three, one-bit binary numbers named as ‘a_in,’ ‘b_in,’ ‘c_in’ and one-bit binary
outputs as ‘sum_out,’ ‘carry_out.’ Table 2.10 is the truth table for full adder and
RTL is described in the Example 2.10.
40 2 Combinational Logic Design (Part I)
Table 2.10 Truth table for c_in a_in b_in sum_out carry_out
full adder
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Example 2.10 Synthesizable Verilog code for full adder. Note Full adder consumes more area so
it is highly recommended to implement the adder logic using multiplexers
Synthesized full adder is shown in the Fig. 2.10, input ports of full adder are
named as ‘a_in,’ ‘b_in,’ ‘c_in’ and output as ‘sum_out’ ‘carry_out.’
2.3 Arithmetic Circuits 41
2.3.2 Subtractor
Subtractors are used to perform the binary subtraction of two binary numbers. This
section describes about the half and full subtractors.
Half subtractor has two, one-bit inputs ‘a,’ ‘b’ and generates two one-bit outputs
‘d,’ ‘bor’. Where ‘d’ is difference output and ‘bor’ is borrow output. Table 2.11 is
the truth table for half subtractor and RTL is described in the Example 2.11.
Synthesized half subtractor is shown in the Fig. 2.11, input ports of half adder
are named as ‘a,’ ‘b,’ and output as ‘d,’ ‘bor.’
Full subtractors are used to perform subtraction of three, one-bit binary inputs.
Consider three, one-bit numbers named as ‘a,’ ‘b,’ ‘c’ and one-bit binary outputs as
‘d,’ ‘bor.’ Table 2.12 is the truth table description for full subtractor and RTL is
described in the Example 2.12 and Fig. 2.12.
Example 2.11 Synthesizable Verilog code for half subtractor. Note Half subtractors are used as
basic component to perform the binary subtractions. Full subtractor logic circuits are designed
using the instantiation of half subtractors as components
Example 2.12 Synthesizable Verilog code for full subtractor. Note It is recommended to use the
full adder to perform the subtraction operation. Subtraction is performed using two’s complement
addition
Synthesized full subtractor is shown in the Fig. 2.12 input ports of full subtractor
are named as ‘a,’ ‘b,’ ‘c’ and output as ‘d,’ ‘bor.’
44 2 Combinational Logic Design (Part I)
Multi-bit adders and subtractors are used in the design of arithmetic units for the
processors. The logic density depends upon the number of input bits of adder or
subtractor.
Many practical designs use multi-bit adders and subtractors. It is the industrial
practice to use basic component as full adder to perform the addition operation. For
example, if designer needs to implement the four-bit design logic of an adder, then
four full adders are required. As shown in the Example 2.13, addition is performed
on two, four-bit binary numbers ‘A,’ ‘B.’ The final result is four-bit addition and
output at ‘S.’ Carry input is Ci and carry output is Co.
Synthesized four-bit adder is shown in the Fig. 2.13, input ports of four-bit adder
are named as ‘A,’ ‘B,’ ‘Ci,’ and output as ‘S,’ ‘Co.’
Example 2.13 Synthesizable Verilog code for four-bit adder. Note Four-bit addition operation
uses four full adders. Depending on signed or unsigned addition requirements the Verilog code can
be modified
2.3 Arithmetic Circuits 45
Design of addition and subtraction can be accomplished using the adders only.
Subtraction can be performed using two’s complement addition. For example
consider the scenario shown in the Table 2.13.
Synthesized four-bit adder/subtractor is shown in the Fig. 2.14, for Example
2.14, input ports of four-bit adder/subtractor are named as ‘A,’ ‘B,’ ‘Ci,’ and output
as ‘S,’ ‘Co.’ When control input SUB is equal to logic '0' then it performs the
addition and for control input SUB is equal to logic '1' it performs the subtraction
which is 2's complement addition.
Example 2.14 Synthesizable Verilog code for four-bit adder and subtractor. Note Consider SUB
control,input as Ci and S4 as Co in the synthesized logic. Here, the resource used is binary full
adder to perform both the additions and subtractions. Subtraction operation is performed using
adders only. Resource sharing and resource utilization are to be discussed in the Chap. 3
In most of the practical scenarios; comparators are used to compare the equality of
two binary numbers. Parity detectors are used to compute the even or odd parity for
the given binary number. It becomes very essential for the design engineer to have
the better understanding of this.
These are used to compare the two binary numbers. As discussed earlier Verilog
supports four value logic and they are logical ‘0,’ logical ‘1,’ don’t care ‘x’ and high
impedance ‘z.’ Verilog supports logical equality operator (==) and inequality
2.3 Arithmetic Circuits 47
operator (!=), and these are used to describe the comparison of two numbers. These
operators are used in the Verilog Synthesizable RTL code.
For example consider the operational Table 2.14. As shown in the table; when
A, B both are equal then output ‘Y’ is assigned to XOR of ‘A,’ ‘B’ and for unequal
case output ‘Y’ is assigned to AND of ‘A,’ ‘B’ (Example 2.15).
Synthesized equivalent block representation is shown in the Fig. 2.15
(Example 2.15).
Example 2.15 Synthesizable Verilog code for 1-bit comparator. Note Logical equality and
inequality operators are used in the synthesizable RTL code and for any of the operands are ‘x’ or
‘z’ comparison is false
48 2 Combinational Logic Design (Part I)
Parity detectors are used to detect the even or odd parity for the binary number
string. For even number of 1’s, the output required is logical ‘0’ and for odd number
of 1’s the output required is logical ‘1,’ then the RTL Verilog can be described as
shown in the Example 2.16.
Example 2.16 Synthesizable Verilog code for parity detector. Note Parity detectors are used in
many of the DSP applications and an integral module for encryption engines
2.3 Arithmetic Circuits 49
The operational table for the parity detector is shown below in Table 2.15. For
odd number of 1’s the output is logical ‘1’ and for even number of 1’s output is
assigned as logical ‘0’.
Synthesized equivalent block representation is shown in the Fig. 2.16.
This section deals with the commonly used code converters in the design. As name
itself indicates the code converters are used to convert the code from one number
system to another number system. In the practical scenarios, binary to gray and gray
to binary converters are used.
Base of binary number system is 2, for any multi-bit binary number one or more
than one bit changes at a time. In gray code, only one bit changes at a time.
The RTL description of four-bit binary to gray code conversion is described in
Example 2.17.
50 2 Combinational Logic Design (Part I)
Example 2.17 Synthesizable Verilog code for four-bit binary to gray code converter. Note Gray
codes are used in the multiple clock domain designs to transfer the control information from one of
the clock domain to another clock domain
Gray to binary code converter is opposite of that of binary to gray and the RTL
description of four-bit gray to binary code conversion described in Example 2.18.
Synthesized equivalent block representation is shown the Fig. 2.18.
2.4 Summary 51
Example 2.18 Synthesizable Verilog code for four-bit gray to binary code converter. Note Gray
codes are used in the gray counter implementation and also in the error correcting mechanism
2.4 Summary
As discussed already in this chapter; following are the important points need to be
considered while implementing combinational logic RTL.
1. Use minimum area by sharing the arithmetic resources.
2. Use all the required signals in the sensitivity list to avoid simulation and syn-
thesis mismatch.
3. Avoid use of tri-state logic and implement the logic required using multiplexers
with proper enable circuit.
52 2 Combinational Logic Design (Part I)
4. Verilog supports four value logic and they are logical ‘0,’ logical ‘1,’ don’t care
‘x,’ high impedance ‘z.’
• Use less number of adders in design. Adders can be implemented using
multiplexers.
• NAND and NOR are universal logic gates and used to implement any
combinational or sequential logic.
Chapter 3
Combinational Logic Design (Part II)
Abstract This chapter describes the complex combinational logic designs and
covers the detail and practical oriented scenarios while describing the multiplexers,
decoders, encoders, and priority encoders. The use of constructs like ‘‘if-else,’’
‘‘case,’’ and continuous assignment ‘‘assign’’ are described in detail with the
meaningful practical examples. The main focus of this chapter is to describe the
design functionality with the synthesizable logic. Even this chapter focuses on the
key practical issues need to be tackled while describing the Verilog HDL.
Keywords MUX Decoder Encoder Priority logic Parallel logic Nested
statement Concurrent execution Do not care Priority Inference Simulation
Synthesis Gate level Block level
3.1 Multiplexers
Multiplexers are used to select one of the input from many. Multiplexers are also
called as universal logic and terminology used in the practical world is MUX. By
using the suitable multiplexers any of combinational logic function can be realized.
Multiplexers are used as selection logic in ASIC and FPGA-based designs.
Multiplexer consumes lesser area as compared to adders and most of the time MUX
is used to implement arithmetic components such as adders and subtractors.
The block diagram of n:1 MUX is shown in Fig. 3.1 and it consists of ‘n’ input
lines, ‘m’ select lines, and one output line. Input lines are denoted as “i[0], i[1],…,i
[n − 1];” select lines by “s[0], s[1], …,s[m − 1],” and output line by ‘y’.
As shown in Fig. 3.1 multiplexer has ‘n’ input lines, ‘m’ select lines, and single
output line. Relation between the input lines and select lines is given by n = 2m. For
example; for 4:1 MUX input lines are four so m = log2 n that is select lines equal to
two.
A 2:1 MUX has two input lines, one select line and one output line. When ‘s’ input
is logical ‘0’ output ‘y’ is assigned to ‘a’ and output is assigned to ‘b’ for ‘s’ equal
to logical ‘1’. Table 3.1 describes the truth table of 2:1 MUX and Example 3.1 is
synthesizable RTL for 2:1 multiplexer.
Example 3.1 Synthesizable Verilog code for 2:1 MUX. Note Conditional assignments are used
to select from many inputs so infers the multiplexer
Fig. 3.2 Synthesized 2:1 MUX. Note A 2:1 multiplexer symbolic representation is used to
describe the implementation of higher complexity multiplexers. Multiplexer is treated as universal
logic. Using multiplexers all possible combinational logic can be realized
Example 3.2 Synthesizable Verilog code for 2:1 MUX using ‘‘if-else’’
Four as to one MUX has four input lines and single output lines. The 4:1 MUX has
two select lines and used to select one of the inputs at a time. The truth table of 4:1
MUX is shown in Table 3.2 and Example 3.4 describes the synthesizable RTL for
4:1 MUX.
The equivalent synthesizable hardware inferred for the 4:1 MUX described in
the Example 3.4 is shown in Fig. 3.4. As shown in Fig. 3.4 input ‘a’ has the
highest priority as compared to other inputs. Input ‘d’ has least priority.
The hardware inferred in Fig. 3.4 is also generated by the Verilog RTL
described in the Example 3.5.
3.1 Multiplexers 57
Example 3.3 Synthesizable Verilog code for 2:1 MUX using ‘‘case’’. Note ‘‘if-else’’ generates
priority logic and ‘‘case’’ generates parallel logic. It is recommended to use ‘‘case’’ statement to
describe MUX, decoders. It is recommended to use ‘‘if-else’’ to describe priority logic
The 4:1 MUX can be implemented by using 2:1 MUX and the equivalent repre-
sentation is shown in Fig. 3.5.
As shown in Fig. 3.5; 4:1 MUX is implemented by using three 2:1 multiplexers.
The Verilog RTL is described in the Example 3.8.
58 3 Combinational Logic Design (Part II)
3.2 Decoders
Decoder has ‘n’ select lines or input lines and ‘m’ output lines and used to generate
either active high or active low output. The relation between select lines and output
lines is given by m = 2n Depending on the logic status on ‘n’ input lines at a time
one of the output line goes high or low. Figure 3.6 represents 3:8 decoder; as shown
in Fig. 3.6 X2, X1, and X0 are select inputs and from Y0 to Y7 are active high output
lines.
The truth table of 3–8 decoder is shown in Table 3.3. For the active high output
at a time one of the output line is active high.
Figure 3.7 is a symbolical representation of 3:8 decoder with active high enable
input ‘en’. The truth table described above holds good for the decoder with active
high enable ‘en=1’. When ‘en=0’ decoder is disabled and output ‘Y=8’b0000_
0000’. The synthesizable RTL is shown in the Example 3.9.
The 1 line to 2 or (1:2) decoder has one select input ‘‘Sel’’ and two output lines
‘‘Out_Y0’’ and ‘‘Out_Y1’’ The truth table and equivalent representation is shown in
Table 3.4 and Fig. 3.8, respectively.
The Verilog RTL is shown in the Example 3.10 and the equivalent synthesized
result in Fig. 3.9.
The 1 line to 2 or (1:2) decoder has one select input ‘‘Sel’’, enable input ‘‘En’’ and
two output lines ‘‘Out_Y0’’ and ‘‘Out_Y1’’. The truth table and equivalent repre-
sentation are shown in Table 3.5.
The Verilog RTL is shown in the Example 3.11 and the equivalent synthesized
result in Fig. 3.10.
The 2 line to 4 or (2:4) decoder has two select inputs ‘‘Sel [1], Sel [0],’’ enable input
‘‘En’’ and four output lines ‘‘Out_Y[3], Out_Y[2], Out_Y[1], and Out_Y[0]’’. The
truth table and equivalent representation is shown in Table 3.6.
The Synthesizable Verilog RTL is described in the Example 3.12 and the
equivalent hardware inferred is shown in Fig. 3.11.
64 3 Combinational Logic Design (Part II)
Example 3.11 Verilog RTL for 1:2 decoder with enable input
The 2 line to 4 or (2:4) decoder has two select inputs ‘‘Sel [1], Sel [0],’’ active low
enable input ‘‘En_bar’’ and four active low output lines ‘‘Out_Y[3], Out_Y[2],
68 3 Combinational Logic Design (Part II)
Fig. 3.10 Synthesized logic for 1:2 decoder with active high enable
Out_Y[1], and Out_Y[0]’’. The truth table and equivalent representation are shown
in Table 3.7.
The Synthesizable Verilog RTL is described in the Example 3.13 and the
equivalent hardware inferred is shown in Fig. 3.12.
Both the RTL Verilog codes described in the Example 3.13 infer the same
hardware and are shown in Fig. 3.12.
The 4 line to 16 or (4:16) decoder has four select inputs ‘‘Sel [3]: Sel [0],’’ active
low enable input ‘‘En_bar’’ and designed by using 2:4 decoder which has four
active low output lines ‘‘Out_Y[3], Out_Y[2], Out_Y[1], and Out_Y[0]’’. The
equivalent representation is shown in Fig. 3.13 (Example 3.14).
3.2 Decoders 69
Table 3.7 Truth table for 2:4 decoder with active low enable and active low output
En_bar Sel[1] Sel[0] Out_Y[3] Out_Y[2] Out_Y[1] Out_Y[0]
0 0 0 1 1 1 0
0 0 1 1 1 0 1
0 1 0 1 0 1 1
0 1 1 0 1 1 1
1 X X 1 1 1 1
3.2 Decoders 71
Fig. 3.13 Synthesized logic for 4:16 decoder using 2:4 decoders
3.2 Decoders 73
3.3 Encoders
Function of an encoder is reverse of the decoder. Encoder has ‘n’ input lines and
‘m’ output lines and the relation between input lines and output lines is given by
n = 2m. For example consider 4:2 encoder. Number of input lines are n = 4 and
output lines m = 2. The block diagram of 4:2 encoder is shown in Fig. 3.14 with
the equivalent gate level representation for 4:2 encoder and the truth table is
described in Table 3.8.
The Verilog RTL description for 4:2 encoder is described in the Example 3.15.
The Verilog RTL infers the similar hardware as shown in Fig. 3.14. As described in
the Example 3.15 the output is ‘00’ when none of the input is active high and even
when In_A(0) is active high. So to indicate the difference the status flag need to be
incorporated in the design. For In_A(0) =1 output ‘00’ and status flag is one. For
the default condition the status flag should be zero.
Table 3.8 Truth table for 4:2 In[3] In[2] In[1] In[0] Out_Y[1] Out_Y[0]
encoder
1 0 0 0 1 1
0 1 0 0 1 0
0 0 1 0 0 1
0 0 0 1 0 0
76 3 Combinational Logic Design (Part II)
Table 3.9 Truth table for In_A[3] In_A[2] In_A[1] In_A[0] Out_Y[1] Out_Y[0]
priority 4:2 encoder
1 X X X 1 1
0 1 X X 1 0
0 0 1 X 0 1
0 0 0 1 0 0
Priority encoders are used in the practical applications and has ‘n’ input lines and
‘m’ output lines and the relation between input lines and output lines is given by
n = 2m. For example consider 4:2 priority encoder. Number of input lines are n = 4
and output lines m = 2. The block diagram of 4:2 priority encoder is shown in
Fig. 3.15 with the equivalent gate level representation for 4:2 priority encoder and
78 3 Combinational Logic Design (Part II)
Fig. 3.16 Synthesized 4:2 priority encoder logic. Note In the practical applications, encoders are
used to design the control logic. As ‘‘case’’ generates the parallel logic and ‘‘if-else’’ generates the
priority logic; ‘‘case’’ is used to describe behavior of encoder. ‘‘if-else’’ is used to describe
behavior of priority encoder. Priority encoders are used to sense the level sensitive interrupts
the truth table is described in Table 3.9. The input In_A[3] has highest priority and
the input In_A[0] has lowest priority. Where ‘X’ indicates the do not care.
The Verilog RTL description for 4:2 priority encoder is described in the
Example 3.16. The Verilog RTL infers the hardware as shown in Fig. 3.16.
3.4 Summary
As discussed in this chapter the combinational logic RTL using Verilog can be
efficiently written using the Verilog constructs and following are key points to
summarize:
1. ‘‘assign’’ is used to infer the 2:1 MUX. MUX is treated as universal logic cell.
2. ‘‘if-else’’ generates the 2:1 MUX and ‘‘nested if’’ generates the priority logic.
3. ‘‘case-endcase’’ is used to model the parallel logic and used inside the proce-
dural block.
4. ‘‘default’’ condition in the ‘‘case-endcase’’ is used to describe the nonspecified
conditions covered in the case.
5. Synthesis tool ignores the sensitivity list specified in the procedural blocks.
6. Decoders are used to select one of the memory or input–output device at a time.
7. Priority encoders are used in the design of interrupt control logic and logic can
be described by using nested ‘‘if-else’’.
Chapter 4
Combinational Design Guidelines
Abstract This chapter describes about the design guidelines for the combinational
logic designs. In the practical ASIC designs, these guidelines are used to improve
the readability, performance of the design. The key practical guidelines discussed
are use of ‘if-else’ and ‘case’ constructs and the practical scenarios, how to infer the
parallel and priority logic. The detailed practical use of resource sharing and use of
blocking assignments to describe the combinational logic design is explained in
detail. The chapter key highlight is the description of the stratified event queuing
and logical partitioning. This chapter also describes the scenarios of missing else,
default in the sequential statements and combinational looping in the design. All the
guidelines in this chapter are covered with the meaningful practical examples and
the synthesized logic is explained for better understanding.
Keywords Stratified event queue Logical partitioning Active Inactive NBA
Monitor Postponed Delay assignments Logical equality Case equality
Logical inequality Case inequality Blocking Non-blocking if-else case
Default Sensitivity Looping Race conditions Oscillatory behavior Multiple
driver
The design and coding guidelines are used to improve the design performance,
readability and the reusability for the design. The combinational design where the
output is function of the present inputs should be described in such a way that the
design should have least propagation delay time and the least area.
Verilog supports the two kinds of the assignments in the procedural blocks. These
assignments are named as blocking (=) and non-blocking (<=) assignments. It is
always recommended to use the blocking assignments while describing the com-
binational logic design. The reason being very simple to understand but the essence
is to understand the fundamental behind it as an engineer.
To have the understanding of the blocking assignments, let us understand the
concept of stratified event queue. According to IEEE 1364-2005 Verilog standard,
the stratified event queue is divided into four major regions. These regions and
named as: Active, Inactive, NBA, and Monitor.
But the major question is why to have the understanding of the stratified event
queue and what exactly is the application of it. As the name itself indicates that the
stratified event queue is used to evaluate the expressions and update the results.
Figure 4.1 describes the stratified event queue according to the Verilog IEEE
1364-2005 standard.
As shown in Fig. 4.1 the Verilog Stratified Event Queue has four main regions
and are explained below
i. Active Queue Most of the Verilog events are scheduled in the active event
queue. These events can be scheduled in any order and evaluated or updated in
any order. The active queue is used to update the blocking assignments, con-
tinuous assignments, evaluation of RHS of the non-blocking assignments (LHS
of NBA is not updated in the active queue), $display commands and the
updating the primitives.
ii. Inactive Queue The #0delay assignments are updated in the inactive queue.
Using #0 delays in the Verilog is not good practice and it unnecessarily
complicates the event scheduling and ordering. Most of the times, the designer
uses the #0 delay assignments to fool the simulator to avoid the race around
conditions.
iii. NBA Queue The LHS of the non-blocking assignments updates in this queue.
iv. Monitor Queue it is used to evaluate and update the $monitor and $strobe
commands. The updates of all the variables are during the current simulation
time.
As discussed above, the blocking assignments execute sequentially inside pro-
cedural block. Blocking assignment blocks all the trailing statements in the pro-
cedural block while executing the current statement. The execution of the blocking
4.1 Use of Blocking Assignments and Event Queue 81
1. Update of blocking
assignments
From Previous 2. Evaluation of the RHS of
time slot the Non-blocking
Active Queue assignment
3. $display commands
4. Update of the output of
primitives
#0 delay 5. Evaluation of the inputs of
assignments primitives
Inactive Queue
and update 6. Continuous assignment
NBA Queue
Used to update the LHS of NBA
Used to update
1. $monitor
Monitor or
command
postponed
2. $strobe
command
It is recommended to incorporate all the required signals and inputs in the sensi-
tivity list of combinational design procedural block. Consider the Example 4.2 to
describe the functionality of two input NAND logic.
82 4 Combinational Design Guidelines
Example 4.1 Blocking assignments update in the procedural block. Note The major issue with
the blocking assignments is during the use of the same variable on the RHS side in one procedural
block and on LHS side in another procedural block. If both the procedural blocks are scheduled in
the same simulation time or on the same clock edge, it generates the race condition in the design.
This will be discussed in the subsequent chapters
In the Example 4.2, the synthesis tool ignores the sensitivity list and generates
the two input NAND gate as synthesizable output but the simulator ignores the
changes in the input ‘b_in’ and generates the output waveform. This leads to
simulation and synthesis mismatch. The simulation result is shown in Fig. 4.2.
begin
q_out<= data_in;
end
begin
end
In the procedural block, if the blocking (=) assignments are used, they are
updated in the active event queue. All the non-blocking assignments (<=) are
evaluated in the active event queue but updated in the non-blocking event queue.
Fig. 4.2 Incomplete sensitivity list waveform. Note To avoid the simulation and synthesis
mismatch it is recommended to use the procedural block: always@(*). According to IEEE
1364-2001 standard the ‘*’ in the sensitivity list will include all the inputs and required signals
4.4 Combinational Loops in Design 85
The unintentional combinational loops in the design are very critical to debug and
fix during the implementation phase and generates an oscillatory behavior. The
Example 4.3 describes the combinational loop in the design.
Figure 4.3 describes the synthesizable output for the combinational loop.
As discussed above, combinational loops in design are one of the dangerous and
critical design errors. Combinational loop in the design occurs in the same signal
are used or updated in the multiple procedural blocks. If the same signal is present
on the right-hand side of expression and on the left hand side of expression, the
design has combinational loop.
Combinational loops exhibit the oscillatory behavior and during updating, they
can have race conditions. Consider the design scenario shown in Example 4.4.
In Example 4.4, both always blocks execute concurrently and due to that, while
updating the b value the new value to a is assigned. This has race condition in the
design. This design generates the oscillatory behavior due to events on a, b.
Fig. 4.3 Combinational loop outcome. Note It is recommended that the design should not have
any combinational loop. To avoid the combinational loop break the feedback path by using the
sequential elements
b=a;
end
always@(b)
begin
a=b;
end
b<=a;
end
always@(posedge clk)
begin
a<=b;
end
executed concurrently, the non-blocking assignments are queued in the NBA queue
and hence generates the structure as shown in Fig. 4.4.
It is recommended that the design should not have unintended latches as latch acts
as transparent during active level and transfers the data to its output. The unin-
tentional latches are not recommended in ASIC design as it causes the issues during
the design testing or during DFT. Even during STA, the timing algorithm will be
not able to understand whether to sample the data on positive edge of the clock or
on negative edge of the clock. So, most of the time as real intention of the designer
is not reflected in the hardware inference, the STA for such paths are difficult. This
will be discussed in subsequent chapters.
Consider the functionality shown in the following Example 4.7.
In the above code, as during the else clause the information about the update of
b_in is not given, it infers the latch and holds the previous value of b_in. The
diagrammatic representation is shown in Fig. 4.5. If-else statement infers multi-
plexer for a_in assignment and for b_in assignment it infers the positive level
sensitive latch controlled by enable input c_in.
As shown in Fig. 4.5, due to missing b_in assignment in the else clause it
generates the latch and holds the previous value assigned in the if clause. Latches
if (c_in==1)
begin
a_in=1’b1;
b_in=1’b1;
end
else
begin
a_in=1’b0;
end
end
4.5 Unintentional Latches in the Design 89
As discussed above, blocking assignments are denoted by (=) and used inside a
procedural block to describe the functionality of combinational logic design.
Readers are requested not to get confused with the (=) assignment used by using
Example 4.8 Continuous assignment Verilog RTL. Note It is recommended to use the full adder
to perform the subtraction operation. Subtraction is performed using 2’s complement addition.
Multiple continuous assignment statements executes concurrently
90 4 Combinational Design Guidelines
continuous assignment ‘assign’. Example 4.8 uses the multiple assign constructs to
describe the functionality of design.
Consider the scenario of use of blocking assignment in the procedural block. If
the order of the blocking assignment is not proper, then there is chance for the
simulation and synthesis mismatch.
Example 4.9 is shown and in the example, the issue in simulation and synthesis
outcome is due to ordering of the blocking statements. Blocking assignment blocks
the next immediate statement execution unless and until current statement is exe-
cuted. Readers are encouraged to use only blocking assignments but care should be
taken while using the statements to get the real intended results.
The synthesis result for the above example is shown in Fig. 4.6 and it generates
two wires. But while simulating the ‘y2_out’ is updated with the previous time
stamp value ‘a_in’. So results in simulation and synthesis mismatch.
y2_out
4.7 Use of If-Else Versus Case Statements 91
When all the case conditions covered in the ‘case-endcase’ the statement is said to
be full-case statement. For combinational design, case statement should use all the
blocking assignments.
The synthesis result for 4:1 MUX is shown in Fig. 4.7 and it infers parallel logic.
Fig. 4.7 Parallel logic inference for 4:1 MUX using ‘case’
92 4 Combinational Design Guidelines
While describing the functionality of decoding logic, use the continuous assignment
or ‘case’ construct. Both will generate the parallel logic. As discussed in Chap. 3,
decoder has parallel select inputs and generates parallel outputs.
4.9 Decoder 2:4 93
If the decoder is described using ‘case-endcase’ statement, then it also infers the
parallel logic. The hardware description for decoder implementation using ‘assign’
and ‘case-endcase’ is shown in Fig. 4.8 (Example 4.11).
To describe the encoder functionality, use the ‘if-else’ construct as priority defi-
nition can be defined. The functionality of 4:2 encoder is described using ‘if-else’
construct and it infers the priority logic. For Example 4.12, the synthesized outcome
is shown in Fig. 4.9.
If all conditions are not covered in the ‘case-endcase’ expression, then it infers the
latches in the design. If all case conditions are not required in the design func-
tionality, then it is recommended to use ‘default’ clause. If ‘default’ is missing, the
synthesizer flashes warning for missing ‘case’ conditions and infers latches in the
design.
94 4 Combinational Design Guidelines
As shown in the example the 4:1 MUX functionality is described using nested
‘if-else’ but due to missing ‘else’ clause it infers 4:1 MUX with the unintentional
latches. It is recommended to avoid the unintentional latches by incorporating the
‘else’ clause at the desired and required places in the RTL code.
For the Example 4.14, the similar hardware is generated as shown in Fig. 4.10.
96 4 Combinational Design Guidelines
Logical equality (==) and logical inequality (!=) operators are used in the synthe-
sizable designs whereas case equality (===) and case inequality (!==) are not
recommended in the synthesizable design.
always@(a_in, b_in)
begin
if (a_in==b_in)
y_out= a_in ^b_in;
else
y_out =a_in &b_in;
end
//For either of a_in, b_in has ‘x’ or ‘z’ value then the re-
sult is y_out= a_in & b_in;
always@(a_in, b_in)
begin
if (a_in===b_in)
y_out= a_in ^b_in;
else y_out =a_in &b_in;
end
//For either of a_in,b_in has ‘x’ or ‘z’ value then the re-
sult is y_out= a_in ^ b_in;
The Example 4.15, the design without resource sharing. The intended design
functionality is to design the combinational logic shown in Table 4.1.
As shown in the synthesized logic in Fig. 4.11, it uses three full-adders and two
multiplexers. The synthesized logic is inefficient as all the additions are performed
simultaneously and multiplexer output is control signal dependent. So it is wastage
of more power and inefficient as per as area utilization is concern.
Resource sharing is one of the efficient techniques used in the ASIC design to share
the common resources. As discussed in Example 4.15, the adders are generating the
results simultaneously and waits for the control signal either ‘s1_in’ or ‘s2_in’
(Example 4.16, Fig. 4.12).
4.14 Arithmetic Resource Sharing 99
Example 4.15 Verilog RTL without using the concept of resource sharing
4.16 Summary
Abstract This chapter describes the detail practical understanding about the
sequential logic designs. RTL coding using Verilog is described in detail with the
practical scenarios and concepts. The Verilog RTL for the flip-flops, latches, var-
ious counters, shift registers, and memories is covered with the synthesized results
and explanations. The practical do’s and don’ts are explained with the meaningful
diagrams and timing sequences. This chapter will be useful for the ASIC designers
while coding for the sequential logic. This chapter also covers the necessity of
registered input and register outputs.
Keywords Latch Flip-flop Edge triggered Level sensitive Asynchronous
Synchronous Toggle Cumulative delay Up-down Shift register Ripple
Johnson Ring Register input Register output Memory Performance
Timing analysis Glitch Spike
Sequential logic is defined as the digital logic whose output is a function of present
input and past output. So the sequential logic holds the binary data. Sequential logic
elements are latches and flip-flops and are used to design the sequential circuits for
the given design functionality. For the RTL design engineer it is essential to
understand the efficient RTL design for clocked-based logic circuits. The sequential
logic elements are used to hold a larger amount of data in the complex designs. The
logic is triggered on the clock. The subsequent chapter discusses on the efficient
Verilog RTL to describe the required sequential logic. In the practical applications,
it is always essential to describe the logic circuit which can be triggered on either
positive edge of clock or on the negative edge of clock. It is always expected that
the designed circuit should generate the finite output for finite duration of clock
period. Figure 5.1 describes the basic sequential logic triggered on positive edge of
clock. The output from the logic is a function of a present input and past output.
Latches are sensitive to the level. In the D-latch, D stands for the data input. The
latches are either sensitive to positive or negative level of clock or enable. Positive
level sensitive latch is shown in Fig. 5.2 and truth table is described in Table 5.1.
As shown in Table 5.1 for latch enable ‘E’ is equal to positive level (logical ‘1’)
output, Q is equal to data input ‘D’ else output remains in the previous state (past
output) and is shown by Qn−1. The timing sequence is shown in Fig. 5.3.
From the timing sequence it is clear that the output ‘Q’ is equal to data input ‘D’
during the time period for which enable input ‘E’ is equal to positive level. So D
latch acts transparent during this period. During negative level (logical ‘0’) of
enable ‘E’, D latch holds the previous value.
Now the important point in the mind of the readers is how to describe positive
level sensitive D latch using Verilog. It is very simple to visualize and to describe
the design functionality. Example 5.1 describes the Verilog RTL for the positive
level sensitive D latch and the synthesizable hardware is shown in Fig. 5.4.
The truth table of the negative level sensitive D latch is described in Table 5.2 and
it has active low or negative level sensitive latch enable (‘LE_n’), data input ‘D’,
and output ‘Q’.
The equivalent gate level representation is shown in Fig. 5.5. The latch acts as
transparent on negative level of ‘LE_n’ and holds the data during the positive level
of ‘LE_n’. The timing sequence is shown in Fig. 5.6.
The Verilog RTL description is shown in Example 5.2 and the synthesized
hardware is shown in Fig. 5.7.
Example 5.2 Synthesizable verilog RTL for negative level sensitive D latch
5.2 Flip-Flop
Flip-flop is an edge triggered logic circuit. It can be triggered either on positive edge
of clock or on negative edge of clock. Flip-flop can be realized by using positive
and negative level sensitive latches in cascade. Flip-flop is used as a memory
storage element. Flip-flops are set-reset (SR), JK, D, and toggle. In an ASIC design
108 5 Sequential Logic Design
the D flip-flop is used as a sequential circuit element, where D stands for the data
input. The subsequent section discusses on the positive and negative edge triggered
flip-flop.
Asynchronous reset is not a part of data path and used to initialize flip-flop irre-
spective of active clock edge of clock and hence, named as asynchronous reset.
This technique to initialize flip-flop is not recommended for internal reset signal
generation as it is prone to glitches. Care needs to be taken by designer to syn-
chronize this reset signal internally to avoid the glitches. The internally synchro-
nized reset signal is applied to the storage elements. The reset deassertion is the
main problem in the asynchronous reset signals and this problem can be overcome
by using two stage level synchronizer. Level synchronizer avoids the race around
conditions during reset deassertion.
Verilog RTL is shown in the figure and uses active low asynchronous reset
signal ‘reset_n’ (Example 5.3).
The synthesized logic for D flip-flop with asynchronous reset ‘reset_n’ is shown
in Fig. 5.10.
110 5 Sequential Logic Design
In synchronous reset, the reset signal is part of data input that is data path and
depends upon the active clock edge. The synchronous reset does not have issues of
glitches or hazards so this approach is best suited for the design. This mechanism
does not require the additional synchronization circuit.
Verilog RTL is described in Example 5.4 and uses active low synchronous reset
signal ‘reset_n’.
The synthesized logic for positive edge triggered D flip-flop with synchronous
reset input is shown in Fig. 5.11.
If multiple signals or inputs are part of the data path and sampled on the active edge
of clock then output of sequential cell is assigned on the active edge of clock.
Consider the Verilog RTL shown in Example 5.6, inputs ‘reset_n’, and ‘load_en’
are synchronous inputs and sampled on the positive edge of the clock. Synchronous
input ‘reset_n’ has highest priority and ‘load_en’ has the lowest priority.
The synthesized logic is shown in Fig. 5.13 and ‘reset_n’ and ‘load_en’ are part
of data paths.
5.3 Synchronous and Asynchronous Reset 113
Example 5.5 Verilog RTL of D flip-flop with asynchronous load and reset
114 5 Sequential Logic Design
If all the storage elements are triggered by same source clock signal then the design
is said to be synchronous. The advantage of synchronous design is the overall
propagation delay for the design is equal to propagation delay of flip-flop or storage
element. STA is very easy for the synchronous logic and even the performance
improvement is possible by using the pipelining. Most of the ASIC implementation
uses the synchronous logic. This section describes the synchronous counter design.
Four-bit binary counter is used to count from ‘0000’ to ‘1111’ and the four-bit
BCD counter is used to count from the ‘0000’ to ‘1001’. Figure 5.14 shows the
four-bit binary counter where every stage is divided by two counters.
As shown in Fig. 5.14. The counter has four output lines ‘QA, QB, QC, QD’
where ‘QA’ is LSB and ‘QD’ is MSB. The output at ‘QA’ toggles on every clock
pulse and hence divided by two. Output at ‘QB’ toggles for every two clock cycles
and hence it is divisible by four, at ‘QC’ output toggles for ever four clock cycles
and hence the output is divided by eight. Similarly the output at ‘QD’ toggles for
every eight cycle and hence output at ‘QD’ is divided by sixteen of the input clock
time period. In the practical applications counters are used as clock divider network.
Even counters are used in the frequency synthesizers to generate variable frequency
outputs.
Counters are used to generate the predefined and required count sequence on the
active edge of clock. In ASIC design it is essential to write an efficient RTL code
for the counter by using the synthesizable constructs. Three-bit up counter is
described by using Verilog to generate synthesizable design. Counter counts from
‘000’ to ‘111’ on the positive edge of the clock and wraps around to ‘000’ on the
next positive edge of the count. The counter described in Example 5.7 is presettable
5.4 Synchronous Counters 115
Fig. 5.13 Synthesized logic with synchronous reset_n and synchronous load
counter and it has the synchronous active high ‘load_en’ input to sample the
three-bit required presettable value. The data input is three bit and indicated as
‘data_in’.
Counter has active low asynchronous ‘reset_n’ input and when it is active low
the status on output line ‘q_out’ is ‘000’. During normal operation ‘reset_n’ is
active high.
The synthesizable output is shown in Fig. 5.15 and has three bit data input lines
‘data_in’, active high ‘load_en’, and active low reset input ‘reset_n’. Output is
indicated by the ‘q_out’ lines and positive edge triggered clock by ‘clk’.
5.4 Synchronous Counters 117
Counter has active low asynchronous ‘reset_n’ input and when it is active low
the status on output line ‘q_out’ is ‘000’. During normal operation ‘reset_n’ is
active high.
The synthesizable output is shown in Fig. 5.17 and has three-bit data input lines
‘data_in’, active high ‘load_en’, and active low reset input ‘reset_n’. Output is
indicated by the ‘q_out’ lines and positive edge triggered clock by ‘clk’.
The synthesizable output is shown in Fig. 5.19 and has three-bit data input lines
‘data_in’, active high ‘load_en’, and active low reset input ‘reset_n’. Output is
indicated by the ‘q_out’ lines and positive edge triggered clock by ‘clk’ and select
line is ‘up_down’.
Gray counters are used in the multiple clock domain designs as only one bit
changes on the active clock edge. Gray codes are used in the synchronizers. Gray
122 5 Sequential Logic Design
counter is described in the Example and in this only one bit is changing on the
active clock edge with reference to the previous output of the counter. In this active
high reset input is ‘rst’. When ‘rst = 1’ then the output of counter ‘out’ is assigned
to ‘0000’.
The counter described in Example 5.10 is presettable counter and it has the
synchronous active high ‘load_en’ input to sample the four-bit required presettable
value. The data input is four bit and indicated as ‘data_in’.
Counter has active high asynchronous reset ‘rst’ input and when it is active high
the status on output line ‘out’ is ‘0000’. During normal operation ‘rst’ is active low.
In most of the practical applications binary and Gray counters need to be used. Gray
counter output can be generated from the binary counter output by using the
combinational logic. Refer Chap. 2 for the binary to Gray and Gray to binary code
converters.
Parameterized binary and Gray counter is described in Example and the
Verilog RTL is described to generate four-bit binary and Gray output. For
‘rst_n = 0’ binary and Gray counter output is assigned to ‘0000’. Four-bit Gray
code output is denoted as ‘gray’ (Example 5.11).
Simulation result for the four-bit binary counter is shown in the following timing
sequence Fig. 5.20 and for every positive edge of clock counter output increments
by one.
124 5 Sequential Logic Design
Ring counters are used in the practical applications to provide the predefined delay.
These counters are synchronous in nature and used in the practical applications like
traffic light controller, timers to introduce the certain amount of predefined delay.
The internal logic structure using the D flip-flops for four-bit ring counter is shown
in the Fig. 5.21, as shown the output of MSB flip-flop is fed back to the LSB
flip-flop input and the counter shifts the data on every active edge of clock signal.
The Verilog RTL for the four-bit ring counter is described in Example 5.12, the
counter has ‘set_in’ input to set the input initialization value of ‘1000’ and works on
the positive edge of clock signal.
The synthesized logic is shown in Fig. 5.22.
126 5 Sequential Logic Design
Example 5.11 Verilog RTL for parameterized binary and Gray counter
5.4 Synchronous Counters 127
The Johnson counter is the special type of synchronous counter and designed by
using the shift register. The internal structure for three-bit Johnson counter is shown
in Fig. 5.23.
The Verilog RTL for four-bit Johnson counter is shown in Example 5.13.
The synthesized logic is shown in Fig. 5.24.
128 5 Sequential Logic Design
In the practical applications to improve the readability and reusability the counters
are designed by defining the parameter. The parameter integer value can be used to
define the number of bits for the counter.
The Verilog RTL for the eight-bit parameterized counter is shown in Fig. 5.25.
The synthesizable top-level module for the parameterized counter is shown in
Fig. 5.26.
Shift registers are used in most of the practical applications to perform the shifting
or rotation operations on the active edge of clock. The shifter timing sequence with
reference to the positive edge of clock signal is shown in Fig. 5.27. As shown in the
timing sequence for every positive edge of the clock the data from LSB shifts by
one bit to the next stage and hence, for the four-bit shift register it requires four
clock latency to get the valid output data from MSB.
The Verilog RTL for the serial input serial output shift register is described in
Example 5.14. As described in the example the data ‘d_in’ is shifted on every clock
edge to generate the serial output ‘q_out’. During normal operation reset input
‘reset_n’ is set to logic ‘1’. To generate valid serial output for any change on the
serial input the shift register needs four clock pulses.
The synthesized logic with four registers for the serial input serial output shift
register is shown in Fig. 5.28.
Most of the practical application involves the use of right or left shift of the data.
Consider protocol which involves the processing of strings, where requirement is to
shift the string on the right side or on the left side by one bit or by multiple bits. In
such scenario the bidirectional (right/left) shift registers are used.
The Verilog RTL is described in Example 5.15 for bidirectional shift register
and the direction of data is controlled by ‘right_left’ input. For ‘right_left = 1’ the
data is shifted towards right side and for the ‘right_left = 0’ the data is shifted
towards left side.
The synthesized logic is shown in Fig. 5.29 and the direction of data transfer is
controlled by ‘right_left’ input. The synthesized logic consists of four registers with
additional combinational logic to control, the data flow direction.
5.5 Shift Register 131
parameter N = 8;
For parameterized counter the parameter
value is configured to be 8 and hence
input reset_n; act as an eight bit counter.
else
endmodule
Example 5.14 Verilog RTL for serial input serial output shift register
134 5 Sequential Logic Design
PD and four-bit parallel output lines are named as QA, QB, Qc, and QD. The PIPO
register is triggered on the positive edge of clock signal.
The Verilog RTL is described in Example 5.16.
The synthesized logic for the four-bit PIPO register is shown in Fig. 5.31.
5.5 Shift Register 137
The timing is very important parameter for ASIC designs. Meeting timing for the
sequential circuits is very crucial for the complex ASIC designs. The detail timing
analysis and frequency calculations for the RTL designs will be discussed in the
subsequent section.
For the better understanding of the same it is essential to have the oversight
about the register inputs and register outputs. In the practical ASIC designs the
Verilog code should be efficiently written and should have the register inputs and
register outputs. The reason for the same is to have better timing analysis and to get
the clean register to register paths.
The Verilog RTL with the register output is shown in Example 5.17. It is
assumed that another module drives the input signals ‘a’, ‘b’, ‘c’, ‘d’, and ‘select’.
All these inputs are registered inputs. This enables clean register path and easy
timing analysis.
The synthesized logic is shown in Fig. 5.32 and generates the eight-bit parallel
input parallel output register. The logic is triggered on the positive edge of clock.
In the asynchronous counters the clock signal is not driven by the common clock
source. If the output of LSB flip-flop is given as an input to the subsequent flip-flop
then the design is asynchronous. The issue with the asynchronous design is the
cumulative clock to q delay of flip-flop due to the cascading of the stages.
Asynchronous counters are not recommended in the ASIC design due to the issue
of glitches or spikes and even the timing analysis for such kind of design is very
complex.
5.7 Asynchronous Counter Design 139
The ripple counter is an asynchronous counter and shown in Fig. 5.33. As shown in
the logic diagram all the flip-flops are positive edge triggered and the LSB register
receives the clock from the master clock source. The output of LSB flip-flop is
given as clock input to the next subsequent stage.
The Verilog RTL for the four-bit ripple up counter is shown in Example 5.18.
The synthesized logic is shown in Fig. 5.34.
In most of the ASIC designs and SOC-based designs memories are used to store the
binary data. Memories can be of type ROM, RAM, single port, or dual port. The
objective of this section is to describe basic single port read write memory. The
timing sequence is shown in Fig. 5.35.
5.8 Memory Modules and Design 141
clk
cs
rd_wr
address
data_in
5.9 Summary
The following are the key points to summarize the sequential logic design.
1. Latches are level sensitive and not recommended in the ASIC designs.
2. Flip-flops are edge triggered and are recommended in the ASIC designs.
3. Flip-flops are described by using procedural block ‘always’ and triggered by
either ‘posedge clk’ or ‘negedge clk’.
4. Binary counters can be designed by using synchronous design concept or
asynchronous design concept.
5. Gray counters can be designed by using the binary counters with the additional
combinational logic.
6. Synchronous counters are recommended in the ASIC design as timing analysis
will be easy and they are not prone to the glitches.
7. Asynchronous counters are prone to glitches or spikes and hence not recom-
mended in the ASIC designs.
8. Special counters like ring and Johnson can be designed by using the shift
registers.
9. The memories can be described by using the Verilog RTL to perform the read
and write operation.
Chapter 6
Sequential Design Guidelines
Abstract This chapter describes about the key sequential design guidelines used in
the ASIC design. These guidelines are essential for any ASIC design and used to
improve the readability, performance, and need to be followed by an ASIC design
engineer. The key guideline includes the use of nonblocking assignments in
sequential designs, the use of synchronous resets and clock gating. The guidelines
to use the pipelined stages in the design are described in detail and useful for
improving the design performance. This chapter also covers the basic information
about describing the Verilog RTL with multiple clocks, multiphase clocks and the
issues with asynchronous resets.
Keywords Blocking Nonblocking Synchronous reset Asynchronous reset
Cycle stealing Time borrowing Clock gating Pipelining Multiphase
Multiple clock domains Data path Control path
As described in Example 6.1, blocking assignments are used in the multiple “al-
ways” block. Procedural Block “always” is triggered on the positive edge of clock
and synthesizer infers the sequential logic. As discussed already, all the blocking
assignments are evaluated and updated in active queue. Readers are requested to
refer Chap. 4 section stratified event queuing.
As described in Example 6.1, both “always” block executes in parallel and
generates the output as two-bit serial in serial out shift register. First, always block
generates an output “b_in.” The output generated from the first “always” block is
used as an input by another “always” block. Hence, synthesizer infers this as
two-bit serial-input serial-output shift register.
Synthesized logic for Example 6.1 is shown in Fig. 6.1, and has input “a_in,”
“clk” and an output “y_out.”
If blocking assignments are used to describe the sequential logic and multiple
assignments are used in the same “always” procedural block, then the desired
intended requirement may or may not match with the synthesized logic. The reason
being, in the blocking assignment is that all the trailing statements (next immediate)
are blocked unless and until the present statement is executed. This results in
truncation of the hardware and may infer the undesired synthesis output.
Consider the design scenario described in Example 6.2 and its intention is to
create the three-bit serial-input and serial-output shift register but after synthesizing
Example 6.2 it infers into the single flip-flop.
Synthesizable logic is shown in Fig. 6.2 which has inputs as ‘a’, ‘clk’, and an
output ‘y’. The required functionality is serial-input serial-output shift register but
6.1 Use of Blocking Assignments 147
Fig. 6.1 Synthesized logic for blocking assignments in multiple always block
the above example infers the single flip-flop due to the use of blocking assignments.
So, it is recommended to use the nonblocking assignments while coding or
describing the RTL for the sequential functionality.
6.1 Use of Blocking Assignments 149
Consider the design scenario described in Example 6.3 and its intention is to create
the three-bit serial-input and serial-output shift register and due to the order of the
blocking assignment statements used in the block “begin” and “end” it generates the
three-bit serial-input serial-output shift register.
Synthesized logic is shown in Fig. 6.3 and has inputs as ‘a’, ‘clk’, and an output
‘y’. The required functionality is serial-input serial-output shift register and it infers
the serial-input serial-output shift register. So, the important point to remember is
that the order of the blocking assignment statement inside the procedural “always”
block is decisive factor in the synthesis.
If nonblocking assignments are used to describe the sequential logic and multiple
assignments are used in the same “always” procedural block, then the desired
intended logic is always inferred by synthesizer. The reason being, in the non-
blocking assignment all the statements written in “begin-end” block are executed in
parallel. This results in sequential logic.
Consider the design scenario described in Example 6.5, and intention is to create
the three bit serial-input and serial-output shift register and nonblocking assign-
ments are used.
Synthesized logic is shown in Fig. 6.5 and has inputs as ‘a’, ‘clk’, and an output
‘y’. The required functionality is serial-input serial-output shift register and it infers
the serial-input serial-output shift register.
6.2 Nonblocking Assignments 151
Fig. 6.4 Synthesized logic for nonblocking assignments in the different always blocks
152 6 Sequential Design Guidelines
Fig. 6.5 Synthesized logic for nonblocking assignments in the same always block
6.2 Nonblocking Assignments 153
Consider the design scenario described in the example and its intention is to create
the three-bit serial-input and serial-output shift register and nonblocking assign-
ments are used.
The order of nonblocking assignments used in Sect. 6.2.2, are reordered in this
Example 6.6.
Synthesized logic is shown in Fig. 6.5 and has inputs as ‘a’, ‘clk’, and an output
‘y’. The required functionality is serial-input serial-output shift register and it infers
the serial-input serial-output shift register. So, the important point to remember is
Example 6.6 Nonblocking assignment with order change in the same always block
154 6 Sequential Design Guidelines
that, order of the nonblocking assignment statement inside the procedural “always”
block is not a decisive factor in inferring logic.
In the practical sequential designs, latches and flip-flops are used as elements to
design the required intended design functionality. Latch is level triggered and
flip-flop is edge triggered. Most of the ASIC design uses flip-flops as sequential
element.
6.3.1 D Flip-Flop
As discussed earlier, flip-flop is edge triggered and the area for the flip-flop cell is
more as compared to latch and even for flip-flop additional power control logic is
required as power consumption due to free running clock is higher. Flip-flop does
not have the cycle stealing or time borrowing concept. The operation need to be
completed in one clock cycle. For flip-flop-based design, the setup time should be
met and overall operating frequency of design depends upon the critical path in the
design. Timing analysis and time budgeting is more clear for flip-flop-based
designs.
The D flip-flop RTL is described in Example 6.7 and uses the nonblocking
assignment. Input ‘D’ is assigned to output ‘Q’ on positive edge of clock.
The synthesized logic for the positive edge triggered D flip-flop is shown in
Fig. 6.6.
6.3.2 Latch
As discussed earlier, Latch is level triggered and the area for the latch cell is less as
compared to flip-flop and even for the latch additional power control logic is not
required as power consumption is lesser due to low switching at latch enable input.
Latch has the cycle stealing or time borrowing concept and is useful in pipelining. It
is not necessary that the operation need not to be completed in one clock cycle. For
latch-based design, the overall operating frequency of design does not depend upon
the slowest path in the design. Timing analysis and time budgeting is more difficult
for latch-based designs.
6.3 Latch Versus Flip-Flop 155
The D Latch RTL is described in Example 6.8 and uses the nonblocking
assignment. Input ‘D’ is assigned to output ‘Q’ on positive level of latch enable
input.
The synthesized logic for positive level sensitive latch is shown in Fig. 6.7.
156 6 Sequential Design Guidelines
Most of the time, the design engineer gets confused while using reset input! When
to us an asynchronous reset and when to use synchronous reset is one of the key
challenges for the engineer. So, for an ASIC design engineer it is required to have
good understanding about the reset and rest issues as well as rest trees. Reset tree
structure and synchronization will be discussed subsequently. This section
describes about the synchronous and asynchronous reset description using
Verilog HDL.
6.4 Use of Synchronous Versus Asynchronous Reset 157
For the sequential designs, use the “if-else” construct to describe the priority logic
functionality. To assign the priority signals use the “if-else” construct. Use the
“case” construct to describe the parallel logic. Please refer Chap. 5 for the detailed
information about the use of “if-else” and “case” construct.
Internally generated clock signals use system or master clock and generates an
output as internally generated clock signal. But, internally generated clock signals
need to be avoided as it causes the functional and timing issues in the design. The
functional and timing problems are due to the combinational logic propagation
delays. The internal generated clock signals can generate the glitch or spike in the
158 6 Sequential Design Guidelines
output. This can trigger the sequential logic multiple times or can generate unde-
sired output. Even due to violation of setup or hold time these type of designs have
the timing violations.
It is always recommended to generate the internal clocks by using register output
logic. But still due to the propagation delay of the flip-flop, the overall cumulative
delay or skew can generate the glitches or spikes in the design.
As shown in Example 6.11, Verilog RTL is described to generate the internal
clocks. The generated internal clock signal is used by some other sequential pro-
cedural block.
The synthesized logic is shown in Fig. 6.10 and the first register is driven by
clock ‘clk’ and the second register clock is driven by ‘int_clk’.
Gated clock signals are used to enable switching at the clock input and can be used
in single or the multiple clock domain designs by using the enable inputs. When
enable input is high the clock domain is on and when enable input is low the clock
domain is off. The clock gating logic is required to control the clock turn on or turn
off. Clock gating is an efficient technique used in ASIC design to reduce the
switching power at the clock input of register. By using the clock gating structure,
the clock switching can be stopped as and when required according to the design
functional requirements.
But the issue with the clock gating is it cannot be used in the synchronous
designs the reason being it introduces significant amount of clock skew and even
this technique introduces glitches. To avoid the glitches, special care need to be
taken by ASIC design engineer.
Verilog RTL is described in Example 6.12 and uses enable input to control the
clock switching activity. For ‘enable=1,’ the clock input ‘clk’ toggles and for
‘enable=0’ clock input is permanently active low so no switching at clock input.
The synthesized logic is shown in Fig. 6.11 where clock is gated by using AND
logic.
Pipelining is one of the powerful techniques used to improve the performance of the
design at the cost of latency. This technique is used in many processor designs and
many ASIC design applications to perform multiple tasks at a time that is for
concurrent execution. This section discusses about the design without pipelining
and design with pipelining.
162 6 Sequential Design Guidelines
During the initial stage of the design, most of the designs are described by using
Verilog RTL without the use of the pipelined logic. If the desired speed that is
design performance is not met, then ASIC designer can tweak the design by using
6.8 Use of Pipelining in Design 163
To improve the design performance, the combinational logic AND output can be
given to the additional pipelined register and the output of the pipelined register can
drive one of the input of OR logic.
This technique will improve the overall performance of the design at the cost of
once clock latency. The improvement in the design performance is due to the
reduction in the combinational delay in the register–to-register path.
Verilog RTL is described in Example 6.14, and by adding additional register
logic the pipelined is achieved.
The synthesized RTL logic is described in Fig. 6.13 and consists of three reg-
isters triggered by common clock source ‘clk’.
Multiple clock sources or signals can be used in the multiple clock domain designs.
These clock signals can be generated by using different sources and can be used in
the ASIC design to trigger the different “always” procedural blocks. The data
164 6 Sequential Design Guidelines
transfer from one clock domain to another clock domain needs additional syn-
chronizers in the data path and control path and these can be discussed in the
subsequent chapters.
Verilog RTL is described in Example 6.15 and uses two different clock signals
‘clk1’ and ‘clk2’. Two different procedural blocks are used to describe the func-
tionality y using ‘clk1’ and ‘clk2’, respectively. Defining the multiple clock signals
in the same module is not the best practice. In the multiple clock domain designs,
according to the functional requirements the different design blocks need to be
described and they can be triggered on different clock signals.
Synthesized output is shown in Fig. 6.14 and generates an output ‘f1_out’ and
‘f2_out’. The clock signal ‘clk1’ is used to trigger the upper register. Upper register
is triggered on positive edge of clock ‘clk1’. The lower register is triggered on
negative edge of ‘clk2’.
The clock signals used to trigger multiple procedural blocks and generated from the
same clock source and having the arrival time difference are called as multi phase
clock signals. For example, if one of the procedural blocks is triggered on positive
edge of clock and another procedural block is triggered on negative edge of clock
then there is phase difference of 180° in the triggering of the register and these
signals and treated like phase shifted signals.
Verilog RTL is shown in Example 6.15 and one of the procedural block is
triggered by positive edge of clock and another is triggered by negative edge of
clock (Example 6.16).
The synthesized logic is shown in Fig. 6.15 where two different registers are
triggered on different edges of clocks.
166 6 Sequential Design Guidelines
Following are key guidelines needed to be followed while modeling the asyn-
chronous design
1. If asynchronous reset signals are used, then use the dual edge synchronizer to
synchronize the internally generated reset signals.
2. Avoid use of driving the flip-flop output to the asynchronous reset of the sub-
sequent flip-flop as this can have the race conditions.
3. Avoid use of asynchronous pulse generator as it creates the issue in the design
and timing closure and even during the place and route.
4. If power consumption is the goal then only use the efficient ripple counter; but
there is performance degrade while using the ripple counters due to the
cumulative delay effect or the cascaded delay due to the individual propagation
delay of flip-flop.
6.13 Summary
cin_in
a_in
result_out_a
Arithmetic Unit
0
b_in The complex designs can be efficiently imple-
result_out
co_out
mented by using the Verilog. Now days de-
a_in
sign complexity has increased and the de-
result_out_b
b_in
Logic Unit 1
sel
signs are targeted for the lower power, high
speed and least area. This chapter discuss-
op_code
es the use of Verilog constructs to design
Control Unit logic for the required functionality.
Abstract The complex ASIC designs can be described by using the Verilog RTL.
In the practical scenario the objective is to describe the design functionality by
using efficient Verilog RTL by using key and important combinational and
sequential design guidelines. This chapter focuses on the discussion to describe the
complex designs like ALU, parity generators, checkers, and barrel shifters. This
chapter also discusses about the synthesized logic with the data path and control
paths. The complex examples are explained in this chapter with practical aspects
and with the diagrams and functional tables. This chapter is useful for ASIC and
FPGA designers to understand the issues like combinational designs, critical paths,
register inputs, and outputs.
Keywords ALU Logic unit Arithmetic unit Data path Control path
Function Task Timing control Delay control Parity checker Parity gen-
erator Combinational shifter Protocol Register input Register output Barrel
shifter DSP
As discussed in the previous chapters Verilog HDL is efficiently used to code the
functionality of the design. The concurrent and sequential constructs discussed in
the previous chapters can be used to infer the synthesized logic. In the practical
Arithmetic logic unit (ALU) is used in most of the processors to perform the
arithmetic and logical operations. Processor performs one of the operations at a time
depending on the operational code (opcode). For 8-bit processors, the ALU is used
to perform the operations on two eight-bit operands. Operand is the data on which
operation needs to be performed. Similarly for the 16-bit processors the ALU is
used to perform the operations on two 16-bit numbers.
As shown in Fig. 7.1 a ALU architecture is described to perform the operation on
two four-bit numbers A (A3 is MSB and A0 is LSB), B (B3 is MSB and B0 is LSB),
and carry input C0, A ALU generates an output F (F3 is MSB and F0 is LSB) and an
output carries Cout3. In the practical ASIC design scenario, one-bit ALU is designed to
perform operation on the single bit of data. The operation is performed depending on
the opcode bits specified by lines S1, S0. As shown in figure, ALU is designed to
perform the execution for the four instructions. Table 7.1 for ALU and the func-
tionality is described to perform the operational listed depending on the status of select
lines ‘S1’ and ‘S0’. In this example opcode is 2-bit and is indicated by ‘S1’ and ‘S0’.
performing either AND, OR, XOR or complement operation. Table 7.2 shown
below describes the different logical operations. The complement operation is
performed by using adder having one input A0 and another input logical ‘1’.
The issue with this type of design is due to the use of parallel and multiplexing
logic. The data path is described from input A0 and B0 to the multiplexer data
inputs and control path is the control lines of multiplexers ‘S1’ and ‘S0’. As shown
in Fig. 7.2 the logic unit performs all the operations at a time and result ‘F0’ for one
of the operation results. But this technique is inefficient as it needs more area, power
174 7 Complex Designs Using Verilog RTL
and it does not have the efficient implementation mechanism. If ‘S1’, ‘S0’ are late
arriving signals and if this block is used in the register-to-register path then there
may be possibility of the timing violations. Another important aspect is the concept
of resource sharing which is not used in this design.
So, it is recommended to write an efficient Verilog RTL for the logic unit using
the ‘‘case’’ construct but by sharing the common resources. The following section
describes the Verilog RTL for the logic unit to infer the parallel logic and the logic
with the registered inputs and outputs.
Example 7.1 describes the functionality to perform the operations on two 8-bit
binary inputs ‘‘a_in’’, ‘‘b_in’’. The operations are performed using the functionality
which is described in Table 7.3. The Verilog RTL infers the parallel logic with
multiplier encoding.
As described in Example 7.1 the functionality is described by using a procedural
‘‘always” block with the ‘‘case” construct. All the case conditions are described and
during ‘‘default” condition the logic unit generates output ‘‘result_out” equal to
‘8’b0000_0000’.
The functionality of the logic unit can be modeled using the full-case construct.
As described in Example 7.2 the functionality is described by using a procedural
‘‘always” block with the full ‘‘case” construct. All the case conditions are described
using the full-case construct.
7.1 ALU Design 175
Synthesized logic using full case construct for the 8-bit logic unit is shown in
Fig. 7.3. As shown in the above figure it infers the logic gates with multiplexing
logic. In the practical scenario it is recommended to use the adders as common
resources to implement both the logic and arithmetic unit. Readers are requested to
refer Chap. 9 for the improved design of 8-bit ALU.
176 7 Complex Designs Using Verilog RTL
Example 7.2 Verilog RTL for 8-bit ALU using full-case construct
7.1 ALU Design 177
For the efficient and clean timing analysis it is recommended to use register inputs
and register outputs. If all the inputs are registered that is data sampled on the active
edge of clock and even if all the outputs are registered and captured on the active
edge of clock then design can give better understanding about the
register-to-register timing paths. The registered inputs and registered outputs can
give the clean data path and even the output is glitch or hazard free. For the
performance improvement the pipelining can be used to reduce the data arrival
time. Please refer the Chap. 11 for the detail information about the timing analysis.
Example 7.3 uses the register input and register output logic. The inputs are
sampled or captured on the positive edge of clock ‘‘clk” and outputs are launched at
178 7 Complex Designs Using Verilog RTL
Example 7.3 Verilog RTL for 8-bit logic unit with register inputs and outputs
7.1 ALU Design 179
a_in reg_a_in
Register logic
reg_result_out result_out
Register logic
Logic Unit
reg_b_in
b_in
Register logic reg_op_code
op_code
Control Unit with
registered logic
clk
Fig. 7.4 Synthesized logic unit with registered inputs and outputs
the positive edge of ‘‘clk”. During reset condition ‘reset_n=0’ the logical unit is
initialized to logic ‘0’.
The above example generates the logic unit with all the inputs and outputs
registered on positive edge of clock. Readers are requested to assume that every
register has an asynchronous reset input ‘‘reset_n’’. The synthesized logic is shown
in Fig. 7.4.
cin_in
a_in
result_out
Arithmetic unit
b_in
op_code_in
co_out
Control Unit
cin_in
a_in [0]
result_out [0]
0 Full Adder
tmp_b_in [0]
b in [0]
~b in [0] op_code_in
1
co_out [0]
Control Unit
The synthesized logic for one-bit arithmetic unit is shown in Fig. 7.6. The logic
uses the full adder as component to perform the addition and subtraction.
Subtraction is performed using 2’s complement addition. The synthesized logic also
consists of the multiplexer 4:1 to pass the required operand at one of the input of
full adder depending on the opcode.
Figure 7.7 illustrates the ALU with the associated logic circuit to perform the
operation on two 8-bit numbers ‘‘a_in’’, ‘‘b_in’’. For logic operations, the carry
input (cin_in) is ignored and the output ‘‘result_out’’ is generated depending on the
operational code of the instruction. Depending on the operational code ALU can
perform either arithmetic or logical operation. During arithmetic operations if result
is more than 8-bit then carry output ‘‘co_out’’ is set to logical ‘1’ that indicates
carry propagation outside to MSB (Table 7.5).
Table 7.6 describes the number of bits required at inputs and outputs for the
ALU design for 11 instructions. The table describes the seven arithmetic instruc-
tions and four logical instructions. The pin or signal description is shown in
Table 7.5.
An efficient Verilog RTL description using the two different ‘‘case’’ constructs
to infer the parallel logic is described in Example 7.5. For the ‘op_code_in[3]=0’
182 7 Complex Designs Using Verilog RTL
cin_in
a_in
result_out
ALU
b_in
op_code
co_out
Control Unit
cin_in
a_in
result_out_a
Arithmetic Unit
b_in 0
result_out
co_out
a_in
result_out_b
Logic Unit 1
b_in sel
op_code
Control Unit
Task and functions are used in the Verilog to describe the commonly used func-
tional behavior. Instead of replicating the same code at the different places, it is
good and common practice to use the functions or tasks depending on the
requirement. For easy maintenance of the code, it is better to use the functions or
tasks like the subroutine.
7.2 Functions and Tasks 185
The following example describes the task used to count 1’s from the given string.
The following are key important points need to remember while using the task:
1. Task can consist of the time control statements and even delay operators.
2. Task can have input and output declarations.
3. Task can consist of function calls but function cannot consist of the task.
4. Task can have output argument and not used to return the value when called.
5. Task can be used to call other tasks.
6. It is not recommended to use the task while writing the synthesizable
Verilog RTL.
7. Tasks are used for writing the behavioral or simulatable model.
Example 7.6 is the description to count number of 1’s from the given string. In
this example task is used with arguments ‘‘data_in’’, ‘‘out’’ The name of task is
‘‘count_1s_in_byte’’. In most of the protocol descriptions it is required to perform
some operations on the input string. In this example the string is 8-bit input
‘‘data_in’’ and output result is 4-bit ‘‘out’’. It is not recommended to use the task to
generate synthesized logic.
The following example describes the function used to count 1’s from the given
string. The following are the key important points need to remember while using
the function:
1. Function cannot consists of the time control statements and even delay
operators.
2. Function can have at least one input argument declarations.
3. Function can consist of function calls but function cannot consists of the task.
4. Function executes in zero simulation time and returns single value when called.
5. It is not recommended to use the function while writing the synthesizable
Verilog RTL.
6. Functions are used for writing the behavioral or simulatable model.
7. Functions should not have nonblocking assignments.
Example 7.7 is the description to count number of 1’s from the given string. In
this example, function is used with arguments ‘‘data_in’’. The name of function is
‘‘count_1s_in_byte’’. In most of the protocol descriptions, it is required to perform
some operations on the input string. In this example the string is 8-bit input
‘‘data_in’’ and output result is 4-bit ‘‘out’’. It is not recommended to use the
function to generate synthesized logic.
186 7 Complex Designs Using Verilog RTL
In most of the practical ASIC and SOC designs, Verilog RTL is used to describe the
protocol behavior. The requirement and objective is functional correctness of the
design and timing and cycle accurate models. In most of the practical applications,
the parity needs to be detected to report for even parity or odd parity. If even
number of 1’s is there in any string then the parity is treated as even parity and if
odd number of 1’s are there in the string then parity will be treated as odd parity.
This section focuses on the parity generator and checker.
Efficient Verilog RTL is described in Example 7.8. As described in the RTL the
even or odd parity is generated at output ‘‘q_out’’. Even parity is indicated by logic
‘0’ and odd parity is indicated by logic ‘1’.
188 7 Complex Designs Using Verilog RTL
The synthesized result is shown in Fig. 7.9 and is register logic with the com-
binational logic at the data input of flip-flop. Multiple registers can be inferred by
the synthesizer depending on the number of nonblocking assignments inside the
edge sensitive procedural ‘‘always’’ block.
Consider the practical scenario in the design to use the multiple functional blocks.
The design requirement is to complement input ‘‘add_sub’’ when ‘‘add_sub=1’’
then perform the complement of an input ‘b’. For ‘‘add_sub=0’’ pass input ‘b’ as it
is. The adder operates on two operands and the result from the complement logic.
Adder generates an output ‘‘cy_out and sum’’. The parity checker is used at the
output stage to find the even number of 1’s or odd number of 1’s in the string.
190 7 Complex Designs Using Verilog RTL
The Verilog RTL is described by using RTL as shown in Example 7.9. Inputs
for the logic are a, b, add_sub, and an output is ‘p’.
The partial synthesized logic is shown in Fig. 7.10. The overall synthesized
logic consists of three blocks ‘‘complementor’’, ‘‘adder’’, and a ‘‘parity_checker’’.
In most of the DSP applications the combinational shifters are used to perform the
shifting operations on the data input. The combinational shifters are called as barrel
shifter. The advantage of barrel shifter is that it performs the shifting operation
depending on the required number of bits or control inputs without any clocked
logic. Most of the barrel shifters are designed by using the multiplexer logic.
Example 7.10 is described in below and has 8-bit input ‘‘d_in’’, three-bit control
input ‘‘c_in’’ and an 8 bit output ‘‘q_out’’. The synthesis outcome is shown in the
Fig. 7.11.
7.4 Barrel Shifters 193
7.5 Summary
6. Parity generators are used to generate an even or an odd parity for the data input
string.
7. Barrel shifters are combinational shifters and designed by using mux-based
logic.
Chapter 8
Finite State Machines
Keywords FSM Moore Mealy Binary encoding Gray encoding One-hot
encoding State register Current state Next state State transition table State
diagrams State transition diagrams Output combinational logic Level to pulse
conversion Glitch free output STA Timing path Sequential logic Sequence
detector
The finite state machine (FSM) is a very important design block in the ASIC design.
Most of the ASIC designs and controller design needs the efficient and synthesizable
state machines and commonly called as FSM. The FSMs can be described very
efficiently by using the Verilog HDL and for ASIC design engineer, it is required to
have in-depth understanding about the efficient coding of state machines.
© Springer India 2016 197
V. Taraate, Digital Logic Design Using Verilog,
DOI 10.1007/978-81-322-2791-5_8
198 8 Finite State Machines
FSMs are classified as Moore and Mealy machines. In the Moore machines, the
output is the function of the present or current state and in the Mealy machine, the
output of FSM is the function of the present or current state as well as present input.
In the Moore machine, an output is stable for one clock cycle and hence output is
glitch or hazard free. In the Mealy machines, an output may or may not be stable for
one clock cycle as it is the function of current state or change in the input.
The timing analysis for the Moore machine is very simple due to clean
register-to-register path but for the Mealy machine there might be chances of timing
violations if the input changes during the setup time window.
But the disadvantage of Moore machine is it needs more number of states
compared to Mealy machine. Practical scenario is that the Mealy machine has one
state less compared to Moore machine.
Figure 8.1 describes the internal structure for the machine and it consists of
combinational block as next state logic which is dependent upon the ‘Current_state’
and an input, state register block which is dependent on the ‘Next_state’ and output
logic block which is purely combinational in nature and depends upon the
‘Current_state.’ As discussed earlier, Moore machine output is the function of
‘Current_state’ and hence stable for one clock cycle.
Next_state Current_state
Input Output
clk
Next_state Current_state
Input Output
clk
Figure 8.2 describes the internal structure for the mealy machine and it consists
of combinational block as next state logic which is dependent upon the
‘Current_state’ and an input, state register block which is dependent on the
‘Next_state’ and an output logic block which is purely combinational in nature and
depends upon the ‘Current_state’ as well changes in the input. As discussed earlier,
Mealy machine output is the function of ‘Current_state’ as well as changes in the
inputs and hence may or may not be stable for one clock cycle. Due to this, the
Mealy machines are prone to glitches.
How to code an efficient FSM is one of the important points to discuss. As an
ASIC design engineer, the overall efficiency of the design is dependent upon the
efficient Verilog RTL coding. Most of the inexperienced ASIC engineers uses
single-procedural ‘always’ block for describing the behavior of the FSM. But single
‘always’ block FSM always leads to inefficient coding and creates issue while
synthesizing the design and even during timing analysis.
In the practical scenarios two- or three-procedural block FSMs are used. In this
chapter, I have recommended to use the three-procedural block FSM.
Multiple-procedural block FSM increases the number of lines of code but is the
most efficient during synthesis and timing analysis. Even this improves the overall
readability and reusability during the reviews and design cycle.
1. One of the procedural ‘always’ blocks describes the functionality for the ‘Next
State Logic’ and it is level sensitive to changes on the inputs and
‘Current_state.’
2. Another procedural ‘always’ block is used to describe the register logic and
sensitive to positive or negative edge of clock and hence used to infer the state
register sequential logic.
3. Third procedural ‘always’ block is sensitive to the changes on ‘Current_state’
and used to infer the combinational logic. This is true for the Moore machine
4. For the Mealy machine, third procedural ‘always’ block is sensitive to the
changes on ‘Current_state’ as well as input and used to infer the combinational
logic.
200 8 Finite State Machines
Table 8.1 illustrates the key differences between Moore and Mealy machines.
The template shown in Fig. 8.3 gives the information about the steps and dec-
laration for the FSM coding.
The practical FSM for the toggle flip-flop is described using the three-procedural
block FSM. The example describes the efficient Verilog RTL for Table 8.2. The
state table is used to describe the state transition on clock edge (Example 8.1).
The synthesized logic for the toggle flip-flop using FSM is shown in Fig. 8.4 and
it infers the state register triggered on the positive edge of clock and has active low
asynchronous reset ‘reset_n’. Due to use of the ‘case’ construct, the decoding logic
is inferred, treat this logic as a ‘Not’ gate. The output is generated at ‘y_out’ and the
output toggles on every positive edge of clock ‘clk.’
The level to pulse converter partial state transition diagram is shown in Fig. 8.5.
As shown in the diagram the FSM remains in the state ‘S0’ for the input data_in=0
and for the data input data_in=1 it remains in the state ‘S1.’ The state transition
table for the Mealy level to pulse converter is shown in Table 8.3.
The synthesizable Verilog RTL using three-procedural block is described in
Example 8.2.
As described in the Verilog RTL, the output of level to pulse converter is the
function of an input ‘data_in’ and ‘current_state.’ The Verilog RTL generates the
structure shown in Fig. 8.6 for the level to pulse converter.
The synthesized logic for the Mealy level to pulse converter is shown in Fig. 8.7
and it infers the register logic with the combinational structure at output. Thus, the
output of Mealy level to pulse converter is function of the ‘current_state’ and an
input ‘data_in.’
8.1 Moore Versus Mealy Machines 201
Remain in State s0 for data_in =’0’ or Remain in State s1 for data_in =’1’ or
y_out=’0’. y_out=’0’.
Fig. 8.5 Partial state transition diagram for Mealy level to pulse converter
y_out
data_in D
clk ~Q
FSM can be described by many styles and practically there are three encoding styles
used to describe the FSMs. These styles are named as
a. Binary Encoding FSM can be described by using binary encoding styles and by
using this style, the number of register elements used is equal to log2 number of
states. Consider an FSM has four states; then, the number of registers equal to
log2 4 is equal to 2.
b. Gray Encoding FSM can be efficiently described by using Gray encoding
technique and in this style the gray codes are used to represent the states. The
number of register elements used is equal to log2 number of states. Consider an
FSM has four states; then, the number of registers equal to log2 4 is equal to 2.
c. One-hot encoding FSMs can be efficiently described using one-hot encoding
style. One-hot indicates that only one bit is active high at a time or hot at a time.
The number of register elements used is equal to number of states in the FSM.
Consider an FSM has four states then the number of registers also equals 4. This
206 8 Finite State Machines
style requires more area but advantage is it has clean register-to-register path and
it makes STA very simple. If FSM has 16 states then one-hot encoding needs 16
flip-flops.
The comparison of different FSM styles for 16 states is shown in Table 8.4.
The encoding representation for 4-state FSM is shown in Table 8.5.
As discussed earlier, the binary encoding style can be used if the area requirement is
a constraint on the design. In this encoding style state parameters for the binary
encoding are represented in the binary format.
As discussed earlier, the Gray encoding style can be used if the area requirement is
a constraint on the design. In this encoding style, state parameters are represented in
the Gray format.
The synthesized logic for the two-bit Gray counter is shown in Fig. 8.11. As
shown in figure, the state register is triggered on the positive edge of the clock and
has active low asynchronous reset ‘reset_n.’ The output combinational logic is
decoding structure due to the ‘case’ construct.’
The synthesized logic for the two-bit binary counter using one-hot encoding is
shown in Fig. 8.12. As shown in figure, the state register is triggered on the positive
edge of the clock and has active low asynchronous reset ‘reset_n.’ This encoding
method uses the four registers to represent the functionality. The output combinational
logic is decoding structure to generate two-bit output using the ‘case’ construct.’
FSMs are used to describe the functionality of sequence detectors. The efficient
RTL coding using Verilog is used in the practical scenario to find out the correct
sequence for the state. Depending on the requirements, either Moore or Mealy
machines can be used to detect the correct sequence.
The state transition table for the sequence detector is shown with the required
output (Table 8.9).
The synthesizable Verilog RTL is shown in Example 8.6 to detect the sequence
1010. The output of sequence detector is 2 bit and when sequence of 1010 is found
then it generates output as “11”. To get the single output the output “11” can be
given to AND logic.
The synthesizable Mealy machine sequence detector infers the state register
sequential logic consisting of two registers and combinational decoding logic. In
this, output is function of the state and input changes. The synthesized logic is
shown in Fig. 8.13.
Another Mealy machine sequence detector for the sequence 101 is shown in the
state diagram Fig. 8.14. For the overlapping sequence of 10101 also, this works and
generates an output ‘y_out = 1’ after detecting the sequence ‘101.’ The state
transition table is shown.
As shown in Table 8.10 the Mealy machine output ‘y_out’ is active high when
input sequence ‘101’ is detected. Output is function of the current state and changes
in the input.
The Verilog RTL for the Mealy sequence detector is described in Example 8.7.
The objective or goal for FSM coding is efficient synthesis and fast debugging. The
reusability and modifications in the state encoding is another important point ASIC
designer need to focus. Even the coding style should be compact as well as
readable.
endcase
endmodule
The following are key guidelines used to improve the FSM performance.
a. Do not use the single ‘always’ block FSM. As the issue is in readability and it
does not yield in the efficient synthesis results.
b. Use multiple-procedural block FSMs. In practical ASIC designs, two or three
‘always’ block FSMs are used as it improves the readability, reusability and it
yield in the efficient synthesis results.
c. Declare the state parameters according to the required state encoding and then
declare the next_state and current_state.
d. Use nonblocking assignments for describing the state register logic.
e. Use blocking assignments for describing the next state combinational logic.
f. Use blocking assignments for describing the output combinational logic.
g. Use the ‘default’ condition in the ‘case’ construct to avoid the inference of
latches.
h. For use of ‘if-else’ construct the number of transitions in the state diagram
should be same like number of ‘if-else’ clauses.
i. Register the FSM outputs as it ensures that an output is glitch free.
j. For better and efficient synthesis outcome use the one-hot encoding method.
8.6 Summary
Abstract Programmable logic devices (PLDs) are used extensively in the research
areas and even in the industrial applications to realize the complex designs due to
programmability features. PLDs are used to prototype the ASIC SOCs due to the
availability of the configurable logic blocks, multipliers, and DSP blocks. This
chapter discusses about the PLD evolution, architecture of FPGA, and why to use
FPGA, FPGA design guidelines and the logic realization using FPGAs. Even this
chapter discusses about the simulation constructs and the different delays with the
basic testbench.
Keywords PLD CPLD PAL PLA PROM SPLD FPGA Programmable
ASIC LUT CLB IOB Interconnect Logic capacity Logic density DSP
Multiplier Processor core IO standards Structured ASIC Flash SRAM
Antifuse STA RTL Simulation Intra-delay Inter-delay Combinational
loop Grouping Clock gating DLL PLL IOB CLB LUT Interconnect
Clock skew IP XILINX Spartan
9.1 Key Simulation Concepts
The design entered in the Verilog or VHDL needs to be simulated to check for the
functional correctness of the design. For the HDL RTL functionality, the testbench
need to be written using the nonsynthesizable Verilog constructs. Nonsynthesizable
Verilog constructs are used while developing the testbench. Please refer Appendix I
for the synthesizable and nonsynthesizable Verilog constructs.
Consider that the Verilog RTL consists of the blocking assignments shown in
Example 9.1.
In the example, the procedural “always” block executes every time on the event
on the clock “clk.” The “initial” block executes only once and is used to assign the
value to ‘a,’ ‘b,’ ‘c,’ and ‘d.’ The simulation result for the nonblocking assignment
is shown in Waveform 9.1.
Consider that the Verilog RTL consists of the nonblocking assignments shown
in Example 9.2.
The simulation result for the above Verilog code using nonblocking is shown in
Waveform 9.2.
clk
(a) 02 07
03 0b
(b)
04 12
(c)
05 05
(d)
02 07
(a)
03 07
(b)
04 05
(c)
05 05
(d)
The inter-assignment delays with the blocking assignment, delay both the evalua-
tion of the assignment and the update for the assignment.
Consider the following Verilog code shown in Example 9.3.
Waveform 9.3 gives the simulation results for the blocking assignment with the
inter-assignment delays.
9.1 Key Simulation Concepts 223
04 04
(a)
03 08
(b)
02 0b
(c)
00 10
(d)
0 4 7 8 12
224 9 Simulation Concepts and PLD-Based Designs
The intra-assignment delays with the blocking assignment, delay the evaluation of
the assignment but not the update for the assignment.
Consider the following Verilog code shown in Example 9.4.
Waveform 9.4 gives the simulation results for the blocking assignment with the
intra-assignment delays.
The intra-assignment delays with the nonblocking assignment delay both the
evaluation of the assignment and the update for the assignment.
Consider the following Verilog code shown in Example 9.5.
Waveform 9.5 gives the simulation results for the nonblocking assignment with
the inter-assignment delays.
clk
(a) 04 04
(b) 03 08
(c) 02 0b
(d) 00 10
0 4 7 8 12
Waveform 9.4 Simulation result for the Verilog blocking assignment with intra-assignment delay
04 04
(a)
03 08
(b)
(c) 02 0b
(d) 00 10
0 4 7 8 12
The intra-assignment delays with the blocking assignment, the update of the
assignment but not the evaluation of the assignment.
Consider the following Verilog code shown in Example 9.6.
Waveform 9.6 gives the simulation results for the blocking assignment with the
intra-assignment delays.
In Chaps. 1–8, we have discussed about the detail design synthesis and hardware
inference. Verilog HDL is powerful for the simulation of the design. By using
nonsynthesizable constructs, the Verilog Design Under Verification (DUV) can be
verified to find out functional correctness of the design. Consider the simple Verilog
Design of ring counter with inputs as “clk” and the “reset_n,” counter has four-bit
output “q_out [3:0]” the RTL description of ring counter is shown in Example 9.7.
9.2 Simulation Using Verilog 227
04 04
(a)
03 08
(b)
(c) 02 07
00 06
(d)
0 4 7 8 12
228 9 Simulation Concepts and PLD-Based Designs
The testbench for the ring counter is described by Example 9.8 and forces the
stimulus to the DUV.
The above testbench generates the results shown in Waveform 9.7.
As discussed above, the basic simulation can be carried out by writing the
testbench which can force the stimulus to the design under test. For the lesser
complexity FPGA designs, this approach can work. But for large SOC design
modules, it is essential to use the sophisticated self-checking testbenches. It is
essential for the verification engineer to understand about the creation of the test
cases, test plans, and test vectors. Even the best industry practice is making use of
the verification architecture by using drivers, monitors and checkers. This discus-
sion is out of scope for the FPGA-based design.
9.2 Simulation Using Verilog 229
clk
reset_n
In the past decade, the Programmable logic devices (PLD) market has grown and
the demand of the PLDs has increased to realize and prototype the new ideas. The
chip which has programmable features and can be programmed is called as PLD.
The PLD is also named as filed programmable device (FPD). FPDs are used to
implement the digital logic, where the integrated circuit can be configured by the
user to realize the different designs. The programming of such integrated circuit is
accomplished by using the special programming using the EDA tools.
The first programmable chip introduced in the market was Programmable Read
Only Memory (PROM). PROM has a number of address lines and data lines.
Address lines are used as logic circuit inputs and data lines are used as logic circuit
outputs, as PROM has inefficient architecture and cannot be used to realize the
complex digital logic. The device developed during 1970s is PLA which has two
levels of logic and is used to realize the small-density logic. After evolution of PLA,
the real evolution of programmable logic device took place. After PLA, the SPLD,
CPLD, and FPGA evolved during early 1980s. Early programmable logic device is
shown in Fig. 9.1.
Modern FPGAs are named as programmable ASICs and used in various applica-
tions which include the ASIC SOC designs and prototyping. FPGA programming
includes following types and discussed in this section. The main programming
types for any FPGA are
Most of the FPGAs in the market are based on the SRAM technology. They store the
configuration bit-file in the SRAM cells designed using latches. As the SRAM is
volatile, they need to be configured at the start. There are two modes for programming
and they are Master and Slave. The SRAM memory cell is shown in Fig. 9.4.
In the Master mode, FPGA reads configuration data from the external source and
that can be flash.
In the Slave mode, FPGA is configured by using the external master device such
as processor. The external configuration interface can be JTAG that is also called as
boundary scan.
In this type of FPGAs, the flash memory is used to store the configuration data.
So the primary resource for this FPGA is the flash memory. So these kinds of
FPGAs have the less power consumption and they are less tolerant for the radiation
effects. In the SRAM-based FPGAs, the internal flash is only used during power-up
to load the configuration file. The floating gate transistor used in the flash memory
is shown in Fig. 9.5.
234 9 Simulation Concepts and PLD-Based Designs
These types of FPGAs are used to program only once and they are different as
compared to previous two types of FPGAs. Antifuse is opposite to fuse and initially
at the start they does not conduct current but can be burned to conduct current.
9.4 FPGA as Programmable ASIC 235
Once they are programmed, there is no any way to reprogram as burned antifuse
cannot be forced to the initial state. It is shown in Fig. 9.6.
The following are key building blocks in the FPGA architecture and described in
this section. The FPGA architecture is shown in Fig. 9.7.
1. Configurable Logic Block (CLB) CLB consists of the Look UP Tables (LUTs),
multiplexers, and registers. RAM-based LUTs are used to implement the digital
logic. CLBs can be programmed to realize wide variety of logic functions. Even
CLBs are used to store the data.
2. Input–Output Block (IOB) This block is used to control the data flow between
the internal logic and IO pins of the device. Each IO is used to support the
bidirectional data flow with the tri-state control. There are almost 24 different IO
standards which include seven differential special IO high-performance stan-
dards. The double data rate registers are also provided with the digital-controlled
impedance feature.
3. Block RAM (BRAM) They are used to store the large amount of the data and
available in the form of dual-port RAM. For example 18-Kbit dual-port block
RAM. BRAM can consist of such multiple blocks depending on the device.
4. Digital clock managers (DCMs) They are used for clock management and
provides fully calibrated digital clock solution. They are used for the uniform
clock distribution, delay of clock signals, multiply or divide the clock signals
with uniform clock skew.
5. Multipliers dedicated multiplier block is used to perform the multiplication of
two ‘n’ bit digital numbers. Depending on the device the ‘n’ can vary. If n = 18
then the dedicated block is used to perform the multiplication of two 18-bit
numbers.
6. DSP blocks They are embedded DSP blocks used to realize the DSP functions
such as filtering, data processing. These blocks are used to improve the overall
performance of the FPGA while processing the huge amount of data for the DSP
applications.
236 9 Simulation Concepts and PLD-Based Designs
FPGA design flow includes following key steps and described in Fig. 9.8.
1. Design entry
2. Design simulation and synthesis
3. Design implementation
4. Device programming.
The design steps are elaborated in the following section.
9.5 FPGA Design Flow 237
Design validation
Bit-Stream Generation
Device programming
Device Test the device
Programming
Before the design entry, the design planning need to be done from the design
specifications. The design specifications need to be converted to the architecture
and micro-architecture. The design architecture and micro-architecture includes the
overall design break-up into small modules to realize the intended functionality.
During the architecture design phase the requirement of memory, speed and power
need to be estimated. Depending on the requirement the FPGA device need to be
chosen for the implementation.
238 9 Simulation Concepts and PLD-Based Designs
Design entry is done by using either Verilog (.v) or VHDL (.vhd) file. After the
design entry the design need to be simulated for the functional correctness of the
design. This is called as functional simulation.
During functional simulation the set of inputs are applied to the design to check for
the functional correctness of the design. Although the timing or area, power
problems can crop up during the later design cycle but designer is at least sure about
the functionality of the design.
The major goal of the hardware design engineer is to generate the efficient
hardware. The synthesis is the process of converting the one level of the design
abstraction into the other level. In the logical synthesis, the HDL is converted into
the netlist. The netlist is device independent and can be in the standard format like
electronic design interchangeable format (EDIF).
The design goes through the steps as translate, map and place and route. During the
design implementation the EDA tool translates the design into the required format
and maps it on to the FPGA depending on the required area. The mapping is
performed by the EDA tool by using the actual logic cells or macrocells. During the
mapping process the EDA tool uses the macrocells, programmable interconnects
and the IO blocks. The special dedicated blocks like multipliers, DSP, and BRAMs
are also mapped using vendor tools. The blocks are placed on the predefined
geometry inside the FPGA and routed by using the programmable interconnects for
the intended functionality. The step is called as place and route.
To check for the design timing performance and whether the constraints are met
or not the timing analysis is performed and it is called as post layout STA. During
the STA the timing paths are checked with the delays associated with the pro-
grammable interconnects. Extracting the RC delays and using them for timing
analysis is also called as back annotation.
If the design is targeted with the specific FPGA then the EDA tool generates
device utilization summary. Please refer Appendix II for the XILINX Spartan series
devices.
The architecture of modern FPGA consists of the array of CLBs, Block RAMs,
Multipliers, DSPs, IOBs, and digital clock managers. Delay-Locked Loop (DLL) is
used to distribute the clock with uniform clock skew. The floor plan for the
XILINX SPARTAN Series FPGA is shown in the following figure.
As shown in the following figure basic CLB consists of the LUTs, flip-flop, and
multiplexer logic. The configuration data is hold in the latch. The CLB architecture
is vendor dependent and can consists of multiple LUTs, flip-flops, multiplexers, and
latches. The following Verilog code is realized by using the single four input LUT
without register and the output is called as combinatorial output.
240 9 Simulation Concepts and PLD-Based Designs
always@(a_in, b_in)
begin
end
The following Verilog functional block uses single LUT with single register
during realization and hence the logic is called as sequential logic.
always@(posedge clk)
begin
end
The CLB shown in Fig. 9.9 is also used to implement 16-bit shift register. The
LUTs can be cascaded to design the longer size shift register or it can be used for
pipelining of the design.
An input–output block is used to establish the interface of the logic with outside
world and consists of the number of registers and buffers with the tri-state control
mechanism. The block can be used to have a registered input and registered output.
9.6 Logic Realization Using FPGA 241
The IOB structure of modern FPGA is complex and can consist of many IO control
support which may include DDR, special purpose high-speed interfaces. The basic
IO block structure is shown in Fig. 9.10.
XILINX Spartan-3 family supports 200 MHz block RAM organized in the four
columns and in the form of synchronous configurable 18 kbits blocks. Each
Block RAM contains 18,432 bits among them 16 kbits is allocated for the data
storage and remaining 2 kbits are allocated for the parity. Block RAM can be used
as single port memory or dual-port memory and has independent port access. Each
port is synchronous with independent clock, clock enable, and write enable. Read
operations are also synchronous in nature and requires the clock enable. The
applications of Block RAM is to store the data, FIFO designs, buffers, and stacks
and even while designing the complex state machines. Single port RAM is shown
in Fig. 9.11.
242 9 Simulation Concepts and PLD-Based Designs
The Xilinx device family uses the delay-locked loop (DLL) and Altera uses the
phase-locked loop (PLL) as clock manager. The role of DCM, DLL is to provide
complete control over the phase shift, clock skew, and clock frequency. The DCM,
DLL supports the following functions.
• Phase shifting
• Clock skew elimination
• Frequency synthesis.
The DCM consists of the variable delay line and clock distribution network and
basic block diagram is shown in Fig. 9.12.
All Spartan3 FPGA has two 18-bit inputs and it generates 36-bit output. The
multiplier is embedded block and each device has 4–104 embedded multiplier
blocks. The main advantage of embedded multiplier is that it requires the lesser
power as compared to the CLB-based multipliers. They are used to implement the
fast arithmetic functions with minimum use of the general purpose resources.
Cascading of multiplier using the routing resources is possible and following figure
shows the multiplier configured as 22-bit multiplied by 16-bit to generate the 38-bit
product. The multiplier can be used for the signed or unsigned number multipli-
cation. The multipliers are extensively used in the DSP applications. The basic
block is shown in Fig. 9.13.
The RTL Verilog codes discussed in Chaps. 1–8 can be targeted on the suitable
FPGA design by using the design guidelines for the FPGA. Following are the few
design guidelines need to be followed while designing by using FPGAs.
Guidelines for using Verilog to implement efficient RTL are listed in this section
and it is always recommended to use these guidelines during RTL design phase.
Among these, few guidelines are mainly described with reference to Verilog
Stratified Event Queue discussed in Chaps. 4 and 6.
I. Binary encoding techniques are efficient for a design having 16 or fewer states.
As number of states increases the next state combinational logic performs
slower operation.
II. One-hot encoding technique is efficient and reliable as compared to the binary
encoding due to glitch free behavior. One-hot encoding requires low density
next state logic and useful in design of larger FSM blocks. But the main
drawback of one-hot encoding is that it uses more registers.
III. While designing FSM, designer need to take care of following key points
a. Do not leave any undefined states. Initialize the unused states to reset value
or use the default statements.
b. Do not implement the FSM with combination of registers and latches.
Avoid the unintentional latches in the FSM design to improve the
reliability.
c. Model the FSM blocks by using case statements to infer the parallel logic.
d. Separate the next state, output combinational logic and state register logic
in different always block to improve the speed of FSM and for better
synthesis results.
e. Register FSM output as it preserves the hierarchy.
f. Use the look ahead mealy machines for better design performance.
9.7 Design Guidelines for FPGA-Based Designs 245
I. Use the signal grouping to improve the performance of FPGA-based design. For
example, if the expression q = (x + y + z + w) is implemented on FPGA then
hardware inference is cascade structure but with grouping by using expression
q = (x + y) + (z + w) the hardware inference is parallel structure. Due to
grouping the timing performance of design is improved.
9.7.5 Assignments
Most of the synthesis tools ignore the sensitivity list of combinational procedural
block but simulator executes the procedural block, only when there is event on one
of the signal in the sensitivity list parameters. Due to incomplete sensitivity list it
creates the simulation synthesis mismatch.
246 9 Simulation Concepts and PLD-Based Designs
It is the powerful technique to reduce the net delay by enabling the placement tool
to place the replicated logic in various areas of die [2]. The major drawback of this
technique is that it increases the area of the design while replicating the register or
sequential logic.
As per as FPGA area minimization is concerned, logic duplication can act as a
very efficient tool but depends on the design specific scenarios. Consider example
of implementing 8:256 decoder using single case statement. If FPGA architecture
has logic block with two, four input LUTs and output generation LUT as shown in
Fig. 9.14 [3] then to implement the single output it uses three LUTs. So for 256-bit
output 768 LUTs are utilized. By splitting case statement to implement two 4:16
decoders, logic duplication can be achieved. By using logic duplication, if two 4:16
decoders are used with 256 AND gate array then the overall device utilization is just
288 LUTs for implementation of 8:256 decoder and it reduces the device utilization
9.7 Design Guidelines for FPGA-Based Designs 247
by around 480 LUTs. That is very huge reduction in the overall area. For the 8:256
decoder the logic duplication is accomplished by using the four input LUTs and two
input LUTs; the structure of logic block is shown in Fig. 9.14 [3].
The performance and reliability of an FPGA-based design are based upon the
clocking schemes. For the FPGA-based design and implementation it is recom-
mended that
a. Use single global clock.
b. Avoid use of gated clocks.
c. Avoid mixed use of positive and negative edge triggered flip-flops.
d. Avoid use of internally generated clock signals.
e. Avoid ripple counters and asynchronous clock division.
It is recommended by most of the FPGA vendors that do not use the internal
generated clocks as it causes the functional and timing issues in the design. If
internal generated clocks are required in the design then use DLL [3] or PLL [2] to
generate the clocks. The internal generated clocks by using combinational logic are
prone to glitches and it creates the functionality issues in the design. Due to the
combinational delays it create the timing issues in the FPGA designs. The major
problem for using the internal generated clocks is the issue during synthesis and
timing analysis. Xilinx [3] provides the library component global clock buffers
BUFGCTL [3] and BUFGMUX [3] to generate internal clocks.
To avoid glitches it is recommended to register the output of the internal gen-
erated clocks. It is recommended to use the clock generation logic. For low power
248 9 Simulation Concepts and PLD-Based Designs
In synchronous design the data input is sampled on every active edge of clock and
clock signal controls the activities of inputs and outputs. Figure 9.17 represents the
synchronous design where the combinational logic (CL) drives the data to the input
of flip-flop. For the proper operation of the design, it is essential that the data input
should be stable for at least setup time of register and it should be stable for at least
hold time of register. The propagation delay of combinational logic limits the
difficult and complex while using asynchronous resets. At the same time automatic
insertion of the test structure is difficult.
On the other hand, synchronous resets are difficult to implement as it requires
more resources and they are dependent on the clock. Synchronous resets slowdown
the design performance. It is recommended that FPGA designer should avoid
internally generated conditional resets [2, 3].
It has been observed during FPGA-based designs that, reset deassertion circuit is
required while using asynchronous reset. If reset signal is deasserted and if does not
pass the setup and hold timing check then flip-flop goes into metastable state and it
can lead to potential functional issues in the design [5].
It is recommended to use the synchronized asynchronous resets. That is asyn-
chronously asserted and synchronously deasserted. Figure 9.19 is the recommended
representation of asynchronous active low reset (reset_n) passing through the
two-level synchronizer.
For very large density or complex FPGA-based designs with multiple hierar-
chies, it is essential to use the Linting tool which can provide proper information
about the reset and clock trees [2, 3].
Reducing the power for many applications is very critical and due to complexity of
designs only use of power efficient FPGA devices or architecture is not sufficient. It
is essential for designer to understand the features of EDA tools to optimize the
dynamic power. The recommendation by many FPGA vendors is to reduce the
switching activity in the sequential logic and clock routing [2, 3]. For the low power
design, it is recommended to use the gated clocks or the low power clock gating
cells. Dynamic power of a cell is dependent on voltage, load capacitance, and on
clock frequency. Due to switching at the clock input it has been observed that the
dynamic power increases. So to reduce dynamic power it is recommended to use
clock gating cells. Figure 9.21 shows the clock gating cell.
It is always recommended by the FPGA vendor to have the brief and detail
understanding of the FPGA device and the architecture of FPGA device. It is
recommended to use the vendor-specific design and coding guidelines to improve
the performance of design. It is highly recommended to encrypt the IP by using
proper security standards.
252 9 Simulation Concepts and PLD-Based Designs
9.8 Summary
References
Keywords ASIC FPGA STA Data path Control path Library Link
library Target library Search path Technology library Design constraints
Optimization constraints Resource allocation Structuring Partitioning Clock
definitions Skew definitions Input delay Output delay Read design Check
design Analyze Elaborate Compile Map efforts
ASIC stands for Application Specific Integrated Circuit. Integrated circuits are made
up of silicon wafer and each silicon wafer consists of thousands of die. If any
integrated circuit is designed for specific application then it is called as an ASIC. The
examples are chip designed for the car controller, chip designed for satellite com-
munication, interfacing chips to establish communication between the CPU and
memory. The microprocessors, memories are general-purpose integrated circuits and
are not treated as an ASIC. Following are main types of ASIC and shown in Fig. 10.1.
In such type of ASIC the design starts from the scratch. The ASIC design engineers
create the ASIC logic cells and layout required for all the logic. The analog and
digital design can be implemented by using full-custom ASICs. In such type of
ASICs predefined standard cells or gates are not used to describe the functionality
of the design.
In such type of ASIC, the designer uses the predefined logic cells which are also
called as standard cells. Few of the standard cells are logic gates, MUX, flip-flops,
or latches. These standard cells are predefined and pretested so designer saves the
design time and money and there is less risk while using these standard cells. These
types of ASIC designs are flexible like full-custom ASIC designs but reduce overall
risk. The standard cell libraries are designed by using full-custom design flow but in
semi-custom design already designed libraries are used.
In such type of ASICs, the array which consists of number of transistors is pre-
defined on the silicon wafer. The array is also called as base or basic array and the
transistor cell is called as basic cell or base cell. The interconnect between the cell
and the inside structure of the cell is customized and hence improves the pro-
grammability. The types of these ASICs are as follows:
a. Channeled Gate Array
b. Channelless Gate Array
c. Structured Gate Array.
While designing the ASIC following are key objectives need to be considered:
1. Speed of an ASIC Whether the ASIC is working at the desired high speed or not.
2. Area of an ASIC What is the maximum area of an ASIC?
3. Power of an ASIC What is the leakage and dynamic power dissipation in the
best case and worst case scenarios?
4. Time to Market of an ASIC What is the time to market for an ASIC?
To design an ASIC, the designer needs to have in-depth understanding of the key
steps from specification to the layout. These key steps are used to define the design
flow. Figure 10.2 shows the simple ASIC design flow with key steps used in the
design cycle.
As shown in the above figure, an ASIC design flow consists of the key design
steps and can be treated as design milestones. Every ASIC design starts with the
basic idea and the idea to develop chip functionality is an outcome of the in-depth
market research. After the idea is freeze for the design functionality, the actual
ASIC design implementation cycle starts with the specification definitions. The
following section describes the key steps in the ASIC design flow.
The input to the design specification is the data collected through the market
research or the data for the feasible ideas. Following are the key points need to be
described in the design specification document:
a. Functionality of the design. That is what the chip exactly does?
b. Design goals and constraints for the design
c. Performance constraints like the speed, power, and area for the said design
258 10 ASIC RTL Synthesis
d. Technology constraints like the physical dimensions, space and size for the cell
level design
e. Fabrication techniques for the ASIC design
f. Vendor-dependent constraints and third-party IPs
g. Memories and macros used for the design
h. The data rate and the interface definitions for the design
i. Packaging information and the testing or verification planning for the design
j. Risk and dependability and time to market for the design.
The above specifications are described in the form of block diagrams and this
phase is called as architecture level design. As discussed in the previous section, the
architecture consists of the block level representation of an ASIC design. For
example if 16-bit processor needs to be designed then the architecture can consists
of ALU, control logic, instruction decoder and encoder, interrupt logic, serial IO
controller, BUS arbitration logic, counters and pointer logic. All the mentioned
blocks are interconnected together to form the desired architecture required for the
10.2 ASIC Design Flow 259
specific application. The chip architect designs the multiple architectures and the
best possible architecture for an ASIC is frozen depending on the requirement of
speed, power, and resources. This architecture document is used in the ASIC design
cycle to describe the functionality of each and every block.
After the architecture for an ASIC is frozen the architecture blocks are described
in the form of the small blocks with the interface and logic details and this is called
as microarchitecture of the design. The microarchitecture for every functional block
can be represented for the intended design functionality using the logic elements.
The chip architect with good amount of industrial experience can design the viable
and feasible microarchitecture by understanding the functional, timing, and power
requirements. Most of the ASIC microarchitecture uses the low power definitions,
the DFT friendly design details, the timing details for the interfaces, and the area
details. In the microarchitecture the software and hardware design partitioning is
defined with the technology-dependent component details.
Once the RTL verification is completed and the coverage goals are met the RTL is
converted into the optimized gate level netlist. The process of converting an RTL
design into the gate level netlist is called as synthesis. The EDA tool uses the
Verilog or VHDL RTL, design constraints, and the standard cell library as an input
and generates the gate level netlist as an output. Figure 10.3 shows the synthesizer
inputs and outputs.
The popular synthesis tools in the industry are Synopsys Design Compiler,
Cadence RTL Compiler etc. The synthesis tool considers time, power, and
260 10 ASIC RTL Synthesis
testability as the major important factors to generate the gate level netlist.
Synthesizer tries to meet the constraints by calculating the cost of various imple-
mentations. The gate level netlist is the structural description with only standard
cells. The gate level netlist is verified for the functional correctness of the design
and this is called as gate level verification.
After successful verification of the RTL design the design need to be checked for
the timing violations. This process is called as pre layout STA. During this mile-
stone the goal of ASIC engineer is to find the timing violations for the design.
During this phase STA is performed without considering the parasitic (RC) effect.
The objective is to fix the setup time violations in the design and to improve the
overall performance of the design. In most of the ASICs the hold time violations for
various timing paths are fixed after CTS and routing.
Before physical implementation of an ASIC design, the gate level netlist is given
as an input to DFT (Design For Testability) tool. The objective is to find out the
various design faults. As discussed above the RTL should be DFT friendly for the
quicker scan chain insertions and to find out the overall fault coverage for the
design. The DFT techniques and processes are included in Appendix III.
The next milestone in the ASIC flow is physical design and implementation. In this
phase the gate level netlist is converted into geometric representations. The geo-
metric representation can be treated as the layout of the design. The discussion on
the physical design is out of scope and readers are requested to refer the physical
design and synthesis books. The basic physical design flow is shown in Fig. 10.4
and it consists of the following key steps.
The gate level netlist is given to the physical implementation tool to generate the
layout. The key steps in the physical design implementation are floor planning,
power planning, placement of standard cells and macros, clock tree synthesis, and
routing. The design is basically converted from the gate level abstraction to the
10.2 ASIC Design Flow 261
switch level abstraction by using CMOS cells. The netlist generated is given to the
STA tool to fix the timing violations and this process is called as post layout STA.
The physical design and implementation tool uses the design rule library to
produce the GDSII file. The design rule library consists of the guidelines based on
the fabrication processes. GDSII file is used by the foundry to fabricate the inte-
grated circuit. The industry leading tool for the physical design and implementation
is IC Compiler from Synopsys or Encounter from Cadence.
The physical verification needs to be performed to verify the intended design
functionality and to make sure that the layout is designed according to the rules!
After the physical verification and timing analysis the layout is ready for the fab-
rication. In this phase the layout data is converted into the photo lithography masks.
After the fabrication process the wafer is diced into the various individual chips and
packaged as well as tested.
This section only focuses on the logical synthesis using the Design Compiler to
perform the RTL to gate level synthesis. As discussed above, the RTL synthesis
tool uses the RTL design either Verilog (.v) or VHDL (.vhd) files, the design
constraints (.sdc) and library (.lib) as an input and generates the optimized gate level
netlist using standard cells available in the library. The gate level netlist is tech-
nology dependent and can change if process node varies. Depending on the
functionality the gate level netlist for the 40 nm can be different as compared to gate
level netlist generated for the lower process nodes like 20 or 14 nm process node.
ASIC synthesis tool perform internally few steps to generate the gate level netlist.
The key steps for the ASIC synthesis are translate, map, and optimize. The key
262 10 ASIC RTL Synthesis
steps for the FPGA synthesis are translate, optimize and map. Figure 10.5 gives the
brief information about the ASIC synthesis steps to generate the gate level netlist.
1. Read Library: To perform the RTL synthesis, the synthesizer reads the
designWare libraries, technology libraries, and symbol libraries. The
designWare library consists of the complex cells like adders, comparators,
multipliers, etc. The technology library consists of the logic gates, flip-flops,
latches, etc. While synthesizing the synthesizer algorithms automatically
determine when to use the technology library cells and when to use the
designWare library components. These library cells are used efficiently to
generate the gate level netlist.
2. The nest step is to read the RTL description described by using either Verilog or
VHDL.
3. The synthesis tool after reading the libraries and the HDL performs many
required steps like optimization, conversion to unoptimized Boolean logic,
technology-independent optimization and finally maps the logic using the
technology library, the library is also called as target library. The above process
is called as linking the logic to the desired target library.
4. The synthesizer uses the design constraints like area, speed, and power while
optimizing the design using the standard cells in the target library. So basically
10.3 ASIC Synthesis Using Design Compiler 263
link library can be IO library, cell library, or macro library and used to link the
design, target library is used while optimizing the design.
5. For efficient RTL coding it is required that, RTL design engineer should have
good understanding of the target standard cell library. After the design is opti-
mized then the design is ready for the DFT, the goal is to detect early faults in the
design. During RTL design stage only, the DFT friendly RTL need to be described
to enable quick scan insertions and testing for various faults in the design.
6. The optimized netlist can be in the Verilog (.v) format or in the database (.ddc)
format and used by the placement and routing tool. Based on the routing the
back-annotation can be performed with actual routing delays for accurate timing
analysis. If timing goals are not met then the design can be resynthesized to meet
the timing goals.
After invoking the Synopsys Design Compiler it reads the startup file from the
current working directory. The startup file is .synopsys_dc.setup. There should be
two startup files: one should be in the current working directory and another should
be in the root directory where the Design Compiler is installed. To use the tool the
following important parameters need to be setup.
1. search_path This parameter is used to search for the synthesis technology
library for reference during synthesis.
2. target_library This parameter is used by the synthesizer while mapping the
logic cells during synthesis. The target library consists of the logic cells.
3. symbol_library All the logic cells have symbolical representation used for the
visualization after synthesis. This parameter is used to pint the library that
contains the visual information for the logic cells present in the technology
synthesis library.
4. link library The tool uses the cells from the target_library for mapping the
desired functionality, this parameter is used to pint the reference of the logic
gates in the synthesis technology library.
The above four parameters for .synopsys_dc.setup are described by using fol-
lowing [1]
Once the above variable or parameters are setup for the desired process node
library then the synthesis tool can be invoked at the command prompt.
The design objects are described in the above table and are used during syn-
thesis. Every design is the description of the logic circuit to perform some of the
logical operations. The design can be single system description or can consist of the
multiple sub-systems. The design objects are described in Table 10.1.
The design is described using VHDL or Verilog languages using the synthesizable
constructs. This design need to be used as an input by design compiler. Table 10.2
describes the key commands used by design compiler for various definitions.
It is essential for the ASIC design engineer to understand about the difference
between the read command and the analyze, elaborate command? The following are
key highlights:
1. The analyze and elaborate command is used in order to pass required parameters
while elaborating the design.
2. The read command is used while entering for the pre-compiled designs or
netlists in DC.
10.5 Constraining Design Using Synopsys DC 265
After the design has been read using the design compiler, the check_design is used.
Table 10.3 describes the command used to check the errors in the design.
The clock needs to be defined using the command create_clock and this is used as
reference for timing analysis. Table 10.4 describes the clock definition commands.
If designer wish to use the clock for variable duty cycle with rising edge at 1 ns
and clock period of 5 ns, then the same command can be modified as
As discussed in the previous section, the skew is difference between arrivals of the
clock signal at various pins of the flip-flop. If clock at the source flip-flop is delayed
with reference to the destination flip-flop then the skew is called as negative clock
skew and useful for the hold. If clock at the destination flip-flop is delayed with
reference to the source flip-flop then the skew is called as positive clock skew and
useful for the setup. The reason being clock at the destination flip-flop is delayed
the data can arrive late.
The design compiler will not be able to synthesize the clock tree; hence to
overcome the problem the clock skew is used to model the propagation delay that
exists in the clock tree.
Table 10.6 Commands used for the input, output delay definitions
Command [1] Description
set_input_delay –clock <clock_name> Used to define the input port delay with reference
<input_delay> <input_port> to the clock. To define 1 ns delay with reference
to clock, the command can be used as
set_input_delay –clock master_clock 1 data_in
set_output_delay –clock Used to define the output port delay with reference
<clock_name> <output_delay> to the clock. To define 1 ns delay with reference
<output_port> to clock, the command can be used as
set_output_delay –clock master_clock 1
data_out
Table 10.5 describes the commands used by design Compiler while defining
clock skew for the design.
The input and output delay can be defined by using set_input_delay and set_out-
put_delay respectively. Table 10.6 describes the command used with the required
parameter definition.
The input and output delays can be defined as min or max depending on the
requirements. Table 10.7 describes the min and max delay definitions.
Using the command compile the design can be synthesized, prior to synthesis the
design constraints need to be given to the design. The design can be synthesized
using the different efforts levels like low, medium, and high.
Table 10.8 describes the compile command.
The design can be saved by using write command in various formats using design
compiler. The format can be Verilog or Database format (ddc).
Table 10.9 describes the command used to save the design.
268 10 ASIC RTL Synthesis
Before discussion on the synthesis, timing and reports let us understand the different
synthesis techniques used for the optimization. The optimization can be performed
at the code level or during synthesis. The fully optimized design is that which has
met the area and timing requirements. The optimization at the RTL level can be
10.6 Synthesis Optimization Techniques 269
achieved by modifying the code to meet the intended functionality. In such type of
optimizations care has to be taken that, the optimized code should have the same
simulation results before synthesis and after synthesis. But there are few standard
techniques used in the real practical scenarios to have better synthesis optimizations
and results. A few of such techniques are discussed in this section.
This is used for the better synthesis results and this optimization technique uses the
sharing of hardware resources.
Consider the Verilog procedural block described in the following example
always@(*)
begin
if(a_in==1)
y_out = b_in+c_in;
else
y_out = b-in+d_in;
end
The above functionality generates two adders: one to perform addition of a_in
and b_in and another to perform addition of b_in and d_in. It also generates the 2:1
MUX to select one of the outputs of the adder. The synthesis result is shown in
Fig. 10.6.
In the above synthesis result, the common input b_in is not shared properly. If
the above code is modified using only one adder then the synthesis optimization is
better due to minimum area. Figure 10.7 shows the synthesis output.
The modified optimized Verilog code functionality is described in the following
example.
always@(*)
begin
if(a_in==1)
y_tmp= c_in;
else
y_tmp= d_in;
end
So prior to the sharing of the resources, the area was more but resource sharing
technique is effective to reduce the area.
In most of the Verilog RTL code, the RTL engineer uses the expressions or
sub-expression. Most of the time, the sub-expressions are not reused. If the
sub-expression computed values are reused then the synthesizer will be able to
perform the better results.
Consider the example shown below. In the following example b_in + c_in is
used for the multiple assignments.
Instead of the two assignments the single continuous assignment can give the
better logic using minimum resources.
Consider another Verilog RTL code common factor can be reused while writing
an efficient Verilog RTL.
always@(*)
begin
if (a_in)
else
end
In the above example the common factor is (c_in + d_in) and can be reused. The
above code can be modified as
always@(*)
begin
if (a_in)
else
end
These minor modifications in the Verilog RTL can generate more optimized
logic.
In most of the Verilog RTL the expressions are used in for or while loops. These
expressions values may or may not change during every iteration. Those statements
used in for or while loops whose value will not change can be handled by using the
changes in the code. The synthesizer during optimization handles such scenario but
272 10 ASIC RTL Synthesis
there are chances of redundant logic generation. This can be avoided by moving the
expression outside of the loop. Consider the following Verilog RTL.
z_out = y_tmp-9;
In the above example it is assumed that y_tmp is not assigned with the new value
within the loop and the above expression remains constant for every iteration inside
the loop. The synthesizer generates the nine subtractors during synthesis and this
occupies more area. The above Verilog RTL functionality can be modified to avoid
the unnecessary logic.
z_out = tmp;
Consider the use of constant in the Verilog RTL code. Instead of writing the code
use the direct computed or required value for the y_out. The Verilog RTL piece of
code is shown in the following example.
Instead of using the above unnecessary Verilog RTL, the better way is to use
value 9 for y_out and this technique is called as constant folding.
10.6 Synthesis Optimization Techniques 273
The section of the code which is never executed is called as dead zone code. The
dead zone code elimination technique has to be used for the better synthesis results.
The piece of Verilog RTL is shown in the following example
integer c_in=3;
integer b_in =2;
always@(*)
if (b_in >c_in)
y_out=1;
else
y_out=0;
end
In the above code the condition is always false and hence if statement always
generates the false output. The synthesizer during synthesis will perform such kind
of optimizations. But if the code is modified then it will reduce the time during
synthesis.
In the most of the Verilog RTL designs if parentheses are used properly then the
synthesis results can be better optimized.
For example if the assign statement is used in the Verilog RTL without any
parentheses then it generates the logic with more propagation delay (Fig. 10.8).
If the above statement is modified as shown below then it gives the clear timing
and data path (Fig. 10.9).
The design needs to be structured and partitioned for the better synthesis outcome.
It is the practical reality that the design which is better partitioned generates better
synthesis results and even it reduces the synthesis runtime. The following are key
guidelines recommended for the design partitioning:
1. Partition the design for the design reuse.
2. For the different functionality use the different module.
3. Use the combinational related logic in the same block.
4. Use the separate block or structure logic and for the random logic.
5. Partition the design at the top level.
6. Do not use the glue logic at the top level.
7. Use the separate module for state machines that is isolating the state machines
form the other logic.
8. Limit the logic size to maximum 10 K gates for every block.
9. Avoid use of the multiple clocks in the same block.
10. Isolate the synchronizers for the multiple clock domain designs.
The timing, area reports and synthesis script is discussed in Chap. 12
10.7 Summary
5. RTL synthesizer uses the Verilog RTL, libraries, and constraints as an input.
6. The physical design flow has the steps like floor planning, power planning,
CTS, placement and routing, and back-annotation.
7. The Synopsys DC uses the different optimization techniques while performing
the synthesis.
8. The optimization can be achieved by modification in the RTL code or by using
the synthesis optimization algorithms.
Reference
1. www.synopsys.com Guidelines and practices for successful logic synthesis version 1998, 08
Aug 1998
Chapter 11
Static Timing Analysis
Abstract Static timing analysis (STA) is used for the timing checks for any ASIC
designs. The objective of this chapter is to discuss in detail STA concepts used by
the timing analyzer. This chapter discusses about the register timing parameters and
their use in the frequency calculations. The positive clock skew and negative clock
skew are also discussed in detail with the practical scenario. This chapter also
focuses on the different timing paths and SDC commands and their use while
writing the script. The solutions and techniques to fix the setup and hold violations
are also discussed for the better understanding of the engineers. Even the timing
exceptions like false and multicycle paths are covered with the practical scenario.
Keywords Timing violations Setup time Hold time Clock to q delay Timing
paths Timing exceptions Positive clock skew Negative clock skew Slack
Data arrival time Data required time Logic duplication Priority encoding
Multiplexed encoding Register balancing Pipelining Asynchronous paths
Hold fixes Data path Clock path Maximum operating frequency
In the previous section we have discussed abut the key RTL Verilog concepts
and few important synthesis commands in detail. But we have not discussed abut
the timing parameters for the ASIC design. The timing analysis is very important
phase for any ASIC design and it can be performed during various design phases.
Timing analysis can be performed before design layout stage and after design
layout stage. So it is essential and important to understand key timing considera-
tions for a ASIC design.
Before layout the timing analysis is performed on gate-level netlist of the design
with goal of fixing the setup time. Timing analysis tool uses the design constraint
file and the vendor libraries to perform the timing analysis for the design. Timing
analysis is of two types static and dynamic. Static timing analysis (STA) is per-
formed without using any set of vectors, and dynamic timing analysis is performed
using set of vectors for the design. The goal is to fix the setup and hold time
violations for the design.
For any sequential element two important design parameters are setup and hold
time.
If setup time or hold time is violated, then the design goes into metastable state.
So it is essential to find out and fix the timing violations in the design and this
process is performed by the timing analysis tool. Popular timing analysis tool is
synopsys prime time (Synopsys PT). This section focuses on the key timing con-
siderations their importance during timing closure.
The minimum amount of time required for which the input signal ‘d’ of flip-flop
should maintain stable value either logic ‘0’ or logic ‘1’ before arrival of an active
edge of the clock is called as setup time.
The setup time considerations need to be checked for the design when the design
is over constrained. There can be many setup violations. The designer can perceive
that the violations in the design are due to the too tight constraints in the design.
To meet the setup time it is required that the data should arrive at the input of ‘D’
flip-flop before certain amount of time before arrival of the active clock edge. For
example, if we consider design operated with 200 MHz clock frequency (Clock
cycle time period = 5 ns) and has setup time requirement of 1 ns then it is required
that data should arrive at least 4 ns with reference to the active edge of clock so that
the required setup time of 1 ns can be met.
Consider Fig. 11.1 consisting of combinational logic with the register. If setup
time is tsu then the data arrival time which is dependent of tcomb should be such that
it should meet the setup time.
So the required time to travel data at ‘D’ input is Tclk − tsu. The data arrival time
is tcomb, that is, delay of combinational logic.
Figure 11.2 shows the valid setup time region with the condition to meet the
desired setup time. Consider the positive edge of the clock the data arrives at the D
input of flip-flop prior to setup time window. So there is no setup violation in the
design.
Data arrival time is the amount of time required to arrive the data at the data
input of the D flip-flop. It is given by
11.1 Setup Time 279
The amount of time for which the input signal ‘d’ of flip-flop should maintain the
stable value either logic ‘0’ or logic ‘1’ after arrival of an active edge of the clock is
called as hold time.
Hold time is important timing parameter consideration in the design. For most of
the designs constrained at high frequency it is critical to meet the hold parameter.
During the STA at ASIC layout stages most of the hold violations are reported and
fixed. The hold violations in the design are due to the fact that data is arriving
slowly as compared to the required time.
For example, consider the scenario in Fig. 11.3. If the design is constrained at
200 MHz operating frequency that is clock cycle time is 5 ns. If hold time
requirement is 1 ns and data arrived at ‘D’ input of flop changes during the 1 ns
window after arrival of active clock edge, then there exist hold violations in the
design.
280 11 Static Timing Analysis
As shown in Fig. 11.3 the valid constant data is present on the D input of the
register during setup and hold time durations. Both setup and hold times are met for
the said design; hence, there is no timing violation in the design.
Data arrival time = Propagation delay of flip-flop + combinational delay should
be greater than hold time of flip-flop.
If propagation delay of flip-flop is 3 ns and combinational delay is 1 ns for the
design, then data will never change during the 1 ns window, so there is no chance
of hold violation in the design.
But consider the scenario in the design that flip-flop propagation delay is 0.8 ns
but hold time is 1 ns and if there is no combinational logic in the data path; then the
hold violation occurs in the design.
So with reference to our discussion, it is important to note that the data should be
stable at the D input of register during setup and hold time windows.
The amount of time required for the flip-flop to generate valid output either logic ‘0’
or logic ‘1’ after arrival of an active clock edge is called as propagation delay of
flip-flop. Propagation delay of flip-flop is also called as clock to ‘q’ delay.
Consider tsu is the setup time of flip-flop, th is the hold time of flip-flop, and tpff is
the propagation delay of flip-flop. Figure 11.4 describes the various timing
parameters for the register.
As shown in Fig. 11.4 the timing parameters of flip-flop (reg1) are given as tpff1 and
tsu1, and the timing parameters for the register 2 are given as tpff2 and tsu2. The
combinational logic design delay in the data path is given as tcomb.
These timing parameters are used to find out maximum operating frequency for
the design.
11.3 Clock to Q Delay 281
To find the maximum operating frequency for the design, find out the data
required time and data arrival time. The data required time is the addition of all the
delays in the register-to-register path.
Therefore, the data required time is given by tpff1 + tcomb.
The data arrival time is given by Tclk − tsu2, where Tclk is the one clock cycle
time and tsu2 is the setup time of second flip-flop.
So the maximum frequency is calculated by equating the data required time and
data arrival time.
Consider both registers have same timing parameter values, that is, tpff1 = tpf-
f2 = 2 ns, tsu1 = tsu2 = 1 ns, and tcomb = 2 ns. Then the maximum operating fre-
quency is
Consider the example shown in Fig. 11.5. In this example the register 1 is triggered
early and register 2 is triggered late. Register 1 is called as launch flip-flop and
register 2 is called as capture flip-flop. As the launch flip-flop is triggered first and
capture flip-flop is triggered last, there is skew in the clock pulse and it is called as
positive clock skew.
In the above example, clock and data travels in the same direction and due to
buffer delay the clk1 is delayed by delay of buffer as compared to clk input of
register Reg1.
To find the maximum operating frequency for the above design, find out the data
required time and data arrival time. The data required time is the addition of all the
delays in the register-to-register path.
Therefore, the data required time is given by tpff1 + tcomb.
The data arrival time is given by Tclk − tsu2 + tbuf, where Tclk is the one clock
cycle time and tsu2 is the setup time of second flip-flop where tbuf is the buffer delay
of the buffer in the clock path.
So the maximum frequency is calculated by equating the data required time and
data arrival time.
Consider both registers have same timing parameter values, that is, tpff1 = tpf-
f2 = 2 ns, tsu1 = tsu2 = 1 ns, tbuf = 1 ns, and tcomb = 2 ns. Then the maximum
operating frequency is
So from the above discussion it is clear that positive skew is good to improve the
performance of design. In the above example due to the buffer delay of 1 ns the
clock at register 2 is delayed by 1 ns time as compared to the clk at register 1. So
the time of 1 ns delayed clock can be compensated by setup time and hence
increases frequency by 50 MHz.
Let us consider another example shown in Fig. 11.6. In this example source
flip-flop is triggered last and destination flip-flop is triggered first. In the other way
one can perceive that the clock and data are traveling in the opposite direction.
To find the maximum operating frequency for the above design find out the data
required time and data arrival time. The data required time is the addition of all the
delays in the register-to-register path.
Therefore, the data required time is given by tpff1 + tcomb + tbuf.
The data arrival time is given by Tclk − tsu2, where Tclk is the one clock cycle
time and tsu2 is the setup time of second flip-flop where tbuf is the buffer delay of the
buffer in the clock path.
So the maximum frequency is calculated by equating the data required time and
data arrival time.
Consider both registers have same timing parameter values, that is, tpff1 = tpf-
= tsu2 = 1 ns, tbuf = 1 ns, and tcomb = 2 ns. Then the maximum oper-
f2 = 2 ns, tsu1
ating frequency is
So from the above discussion it is clear that negative clock skew degrades the
performance of design. In the above example due to the buffer delay of 1 ns the
284 11 Static Timing Analysis
As discussed in the above section the STA is non-vectored approach to check the
timing performance of the ASIC design by checking the timing violations in all
possible timing paths.
Timing paths in design start at ‘Start point,’ clock port of the register or input
port of the design is treated as start point. Timing path terminates or ends at the
‘End point,’ and data input of register or an output port is treated as end point.
For any RTL design there can be four timing paths and they are named as
• Input-to-register path (Input-to-reg path)
• Output-to-register path (output-to-reg path)
• Register-to-register path (Reg-to-reg path)
• Input-to-output path (combinational path)
So timing analyzer checks for the worst possible delays through each of the logic
elements in the timing paths but ignores the logical operations. As timing analyzer
ignores the logic operations it is non-vectored approach and more faster as com-
pared to the simulator. But reader needs to understand that the timing analysis is
used to check for the timing correctness of the design but not used to check for the
logical functional correctness for the design.
This section discusses about the different timing paths in the design.
Input-to-register path has start point input port ‘q1’ and end point data input ‘d2’ of
the register element. This path is also called as input register path group.
Figure 11.7 shows the input port ‘q1’ and combinational logic (Combo logic) and
the path from ‘q1’ to ‘d2’ through combo logic is treated as input-to-register path.
Register-to-output path has start point clock input port ‘clk’ and end point data
output ‘q_out’ of the register element. This path is also called as output register path
group. Figure 11.8 shows the start point port ‘clk’ and data ‘d’ travels through the
register through combinational logic and hence named as register-to-output path.
11.5 Timing Paths in Design 285
Register-to-register path has start point clock input port ‘clk’ and first register acts
as an launch register end point data input ‘d2’ of the second register element. This
path is also called as clock path group. Figure 11.9 shows the clock port ‘clk’ and
launched data by register 1 passes through the combinational logic (Combo logic)
before arriving at the data input ‘d2’ of the second register. This path decides the
maximum operating frequency for the design.
Input-to-output path has start point input port ‘d’ and end point data output
‘q1_out.’ This path is also called as combinational path group. Figure 11.10 shows
the input port ‘d’ and the data passes through the combinational logic (Combo
logic) to generate an output ‘q1_out.’
In the practical scenario the design timing goals are described using the clock
definitions for the design and by defining IO timing relative to the clock. The reason
for all this definition in the synchronous designs as data arrives from the clocked
device and the data goes to the clocked device.
The following template shown in Fig. 11.11 describes the definition required to
specify the timing goals for the design.
11.6 Timing Goals for the Design 287
Use the SDC commands to define the clock, input delays, output delays, and
clock skew.
The SDC [1] commands to specify the timing goals are listed in Fig. 11.12.
So from the above discussion it is clear that the setup time is due to faster clock
arrival and slow data arrival. To overcome the setup violations the data should
arrive fast, launch clock should arrive fast, and capture clock should arrive slowly.
Hold time violation is due to that data arrival is fast, capture is slow, and data
arrival is fast. The hold time can be fixed using the slow data arrival, launch is slow,
and capture is fast.
In the practical scenario, the min, max corner analysis can be performed using
minimum value of timing parameters and using maximum value of timing
parameters. During setup time analysis consider the maximum data path delays and
minimum delays in the clock path. During hold analysis consider minimum delays
in the data path and maximum delays in the clock path.
The following example shown in Fig. 11.13 is used to describe the minimum,
maximum analysis for the design.
In this example the minimum delays are considered in the clock path and
maximum delays are considered in the data path. Consider the timing parameters of
register 1 and register 2.
Consider the first register delay as (1.35, 1.5) ns and the second register delay as
(1.65, 1.75) ns and the combinational path delay as 2 ns. Inverter propagation delay
is (0.75, 0.8) and setup time of both the register is (0.6, 0.65).
288 11 Static Timing Analysis
Fig. 11.12 SDC commands Define the clock for 200MHz operating frequency and
to specify timing goals having 50% duty cycle by using
Skew in the design is due to the inverters in the clock path. This skew is
calculated as follows. Using minimum delay analysis the skew in the design is 1.2
− 0.6 = 0.6 ns. This skew is due to additional delay of inverter for the capture flop.
Data arrival time is equal to Tpff1 + Tcombo = 1.5 + 2 = 3.5 ns
Data required time is equal to tclk + tskew − tsu = tclk + 0.6 − 0.6. Then the
maximum operating frequency due to minimum time period of design is
The above calculation is for the setup analysis that is maximum delay in the data
path and minimum delay in the clock path.
As discussed in the Sects. 11.1−11.7 the set up time and hold time should be met
for all the registers in the design. If setup and hold time is not met then it results into
timing violation. Following are few techniques used to fix the design violations.
To fix the design violations the last option is to make the required and necessary
changes at the architecture level of design. But the architecture level changes are
not recommended for the design as it can have significant impact on the design and
implementation cycle. But after incorporating changes at the microarchitecture of
the design or during optimization if the timing constraints are not met, sometime it
290 11 Static Timing Analysis
is essential to incorporate the changes at the architecture level. The designer needs
to suggest to chief architect about the required changes in the architecture. The chief
architect needs to take care of the design functionality as the changes in the
architecture can affect the design functionality. It is essential to make the desirable
changes by keeping the same design functionality.
If the design optimization fails to meet the required timing, then it is essential to
make the necessary and required changes at the microarchitecture level. The
microarchitecture document is the golden reference document for the RTL design
and due to that the designer has insight about it. The greater detail understanding of
the microarchitecture always plays crucial and significant role during the RTL
design stage.
Synthesizers used at the RTL level are more efficient due to the in-built synthesis
efficient and optimization algorithms. They are driven by the coding and design
styles adopted at the synthesis level. If design does not meet the timing then
optimization algorithms need to be used. To meet the desire timing goals the
designer can use the optimization concepts like pipelining, register duplications,
register balancing, etc. Consider the scenario, if the design needs to be optimized
for 100 timing violations, and among them 20–30 timing violations are not possible
to fix using synthesis optimizations, then the better approach can make the nec-
essary and required changes in the RTL code to fix these violations.
The readers need to ask themselves that why it is challenging to fix the timing
violations in the design? As the design complexity increases from block level to
chip level due to multiple hierarchies in the design the propagation delay between
registers increases and it has overall significant impact on register-to-register tim-
ing. It may be possible that the multiple timing paths can be violated due to
non-meeting of the setup and hold time parameters.
It is general observation that at the block level design meets the timing goals,
that is, the design does not have any timing violations at the block level. But at the
top level design due to integration of multiple blocks there exist possibilities of
several timing violations. At the top level these violations can be fixed by mini-
mizing the logic levels between the registers. If data required time is greater than
the data arrival time, then it is treated as clean register-to-register path due to
positive slack. This indicates that there is no setup violation in the design at top
level.
11.9 Fixing Setup Violations in the Design 291
This technique increases the effective area but generates two independent paths
during synthesis. This technique is effective to fix setup time violation. For
example, consider Fig. 11.14. Consider inputs as in_1, in_2, in_3, and in_4 are
registered inputs and the combinational logic is in the register-to-register path. If
every adder has propagation delay of 3 ns, then overall combinational path delay is
6 ns. But due to logic duplication the two independent paths can be optimized to
improve the timing.
As shown in Fig. 11.15 the two independent paths have been created using logic
duplication technique and hence the optimization for these two independent paths is
possible by retaining same functionality. Logic duplication technique increases the
area.
The popular used encoding techniques are priority encoding and multiplexed
encoding. Consider the continuous assignment statement.
assign y_out=a_in && b_in && c_in && d_in && e_in && f_in
&& g_in && h_in;
The above statement generates the priority logic, where h_in has highest priority
over any other input signal. It generates equivalent logic as shown in Fig. 11.16.
In the priority encoding method the overall delay is of 7tpd, and if tpd is equal to
1 ns then the overall propagation delay is of 7 ns.
To improve the design performance it is essential to reduce the propagation
delay of combinational logic and hence multiplexed encoding technique can be
efficient as compared to the priority encoding technique. Figure 11.17 shows the
multiplex encodings using the continuous assignment.
assign y_out= ( (a_in && b_in) && (c_in && d_in)) && ( (e _in
&& f_in) && ( g_in && h_in);
As shown in Fig. 11.17 the number of levels has been reduced from seven to
three and hence the overall propagation delay for the multiplexed encoding is only
3tpd. If the tpd is 1 ns then overall propagation delay for the multiplexed encoding
is only three-stage delay, that is, 3 ns. So this technique has improved the perfor-
mance as compared to the priority encoding technique.
For any design if control signals are late arriving then it has significant impact on
the design timing. Due to late arrival of the control signal setup time may be
violated.
In the example shown in Fig. 11.18, in_1 and in_2 are multiplexer inputs and
arrive quickly but sel_in is select line of multiplexer and arrives very late. The
select input sel_in is late arrival signal. This signal has significant impact on the
setup time of design.
To improve the timing and to avoid the setup time violation the combinational
logic can be pushed ahead and the multiplexer logic can be pushed later. The
combinational logic can be duplicated at the input of multiplexers. This technique
increases area but improves the overall design performance by compensating the
time required for the combinational logic and late arrival signal. Another important
point to understand is this technique allows the logic partitioning efficiently into
two groups for further improvement in the timing.
To fix the setup time and to improve the design performance register balancing is
one of the powerful techniques. Consider the operating frequency for the design as
200 MHz, that is, clock time period is of 5 ns. If register-to-register path has high
294 11 Static Timing Analysis
combinational delay due to which the data arrival time is greater than the data
required time, the slack is negative and it violates the setup time for the design.
Consider the example shown in Fig. 11.19, register 1 to register 2 path has
combinational logic and has a delay of 3 ns. If we consider the setup time of
register as 1 ns and propagation delay of flip-flop as 2 ns and hold time as 0.5 ns,
then the data arrival time for register 1 to register 2 path is 5 ns and data required
time is Tclk - 1 ns. So the clock time period Tclk = 6 ns. This violates the setup time
of design for the given design constraints of 5 ns.
For register 2 to register 3 path the combinational delay specified is 1 ns and if
we consider same timing parameters for the register then the data required time is
Tclk - 1 ns and data arrival time is 3 ns. This meets the timing constraints for the
design. For register 2 to register 3 paths the data is arrived at the D input of register
3 at 3 ns and waiting for the clock which is arriving after 2 ns. So there is additional
time margin of 1 ns and this can be used to improve the design performance and
this technique is called as balancing the timing between two register-to-register
paths.
This can be achieved by splitting the combinational logic between the register 1
and register 2 into two paths and pushing the combinational logic with delay of 1 ns
to the register 2 to register 3 path.
This will give the clean timing for all register-to-register paths as the data arrival
time for both the paths will be 4 ns. This meets the design constraints and the
operating frequency for the design meets the target of 200 MHz.
Hold time violation occurs in the design if the data at the D input of register
changes very fast. For example, consider the design as shown in Fig. 11.1, if
combinational logic delay is less and due to that if data at D input of register
changes very fast, then there exist hold violation for the design. During the hold
time window if the data changes then there is hold violation.
11.10 Hold Violation Fix 295
To fix the hold violation for the design it is recommended to insert the buffers in
the data path but care need to be taken that this should not violate the setup time
requirements for the design. Inserting buffers into the data path increases the time
required to change data at the D input of register and thus hold violation can be
fixed. The logic after inserting the buffers in the data path is shown in Fig. 11.20.
There are two main timing exceptions and are named as false paths and multicycle
paths. These timing exceptions need to be reported to timing analyzer using SDC
commands.
If the changes on any one of the signals or ports do not affect on the output of
design then the path needs to be reported as false path. False path is basically timing
exception and needs to be notified to the synthesis tool. For example, consider the
following expression:
assign y_out = (a_in + b_in) (c_in + d_in)
In this example if the d_in is set to zero due to some reason, then the logical
output depends on only a_in, b_in, and c_in inputs, and the path from d_in to y_out
will be considered as false path.
Asynchronous path Asynchronous path needs to be notified to the synthesis tool
and these path violations need to be treated as false violations and need to be
ignored.
296 11 Static Timing Analysis
Figure 11.21 describes the false path and this needs to be reported to the timing
analyzer. The SDC command discussed in Chap. 10 can be used to specify the false
path.
If any path in the design has delay of more than once clock cycle then the path is
treated as multicycle path. Consider the following design scenario where register
(FF4) to register (FF5) delay is of 40 ns and clock period is of 5 ns. To update the d
input of register with new value the number of clock pulses required is equal to 8.
This needs to be informed to the tool so that setup and hold windows can be pushed
according to the requirement. The multicycle path is a timing exception.
11.11 Timing Exceptions in the Design 297
The SDC command discussed in Chap. 10 can be used to specify the multicycle
path (Fig. 11.22).
The design performance for the design can be improved by adding the multiple
stage pipelining in the ASIC design. The overall latency to get an output data is
dependent upon the number of pipelined stages. Pipelining will increase the area as
register utilization for multiple bits increases.
Due to use of pipelining the overall performance of the design also improves.
Readers are requested to refer Chap. 6 for better understanding of the pipelining.
298 11 Static Timing Analysis
11.13 Summary
The following are key important points need to be remembered by the ASIC design
engineers:
1. STA is non-vectored approach and faster as compared to simulator.
2. Flip-flop timing parameters are setup, hold, and clock to q delay.
3. If setup or hold time is violated then design goes into the metastable state.
4. There are four timing paths in the design and register-to-register path decides the
maximum operating frequency for the design.
5. The setup analysis, the timing analyzer, uses the late clock latency for the data
arrival path and early clock latency for the clock arrival path. The clock latency
for setup is defined with reference to rising (-rise) or falling (-fall) clock
transitions.
6. For the hold analysis the timing analyzer uses the early clock latency for the data
arrival time and late clock latency for the clock arrival time.
7. The multicycle paths and false paths are the timing exceptions.
Reference
Abstract This chapter discusses about the constraining design using Synopsys DC
compiler. Every ASIC design needs to meet the constraints. The constraints are
classified as optimization, design rule, and environmental constraints. This chapter
covers the area minimization techniques, design optimization techniques using the
meaningful practical design scenarios. Even this chapter describes about the key
important commands used to boost the design performance. This chapter even
discusses about the commands used for the FSM extractions. The sample scripts are
given in the chapter and can be used for the design optimization and the report
generations.
Modern ASIC SOCs are very complex in the nature and consists of more than
100 K gates. Design complexity has grown exponentially in the past decade due to
the demand of the sophisticated and intelligent devices. In such scenario there is
additional overhead and cost during the design synthesis and timing closure. As
discussed in Chap. 10, the ASIC design passes through various phases which
include architecture exploration, microarchitecture design, design entry using HDL,
simulation, and synthesis. The Synopsys DC is leading EDA tool used to synthesize
design and Synopsys PT is used for the timing closure.
As a ASIC design engineer, it is required to have exposure about the design
synthesis and timing analysis. These concepts are covered in the Chaps. 10 and 11,
respectively. The real understanding of the design constraints and the commands
used to constrain the design for the area, speed, and power is very much required to
design a chip. This chapter is focused on the design constraints using Synopsys DC.
The design constraints are classified as design rule constraints and optimization
constraints. The classification is shown in Fig. 12.1.
Synthesis flow is discussed in Chap. 10 with the key sdc commands. For easy
reference the synthesis flow is shown in Fig. 12.2. These are also treated as the
steps while carrying out synthesis for any design. The compilation strategy can be
chosen as top-down or bottom-up. The commands used at each step are discussed
subsequently.
Specify Technology
Requirements
Design Environment
Definitions
Optimize Design
1. Read Design Object: Design object is Verilog RTL code which is simulated for
the functional correctness. The commands used at this step are
302 12 Constraining ASIC Design
2. Specify Technology Requirements: In these steps the design rules and libraries
required need to be specified. The commands used in this step are
5. Setting Design Constraints: The constraints need to be set for the design opti-
mization and for the timing analysis. The commands used in this step are
7. Analyze and debug the design: This step is important to understand the potential
problems in the design by generating various reports. The commands used in
this step are
8. Generate Script file: The design database is stored in the form of script file.
Consider the top level object as full adder with inputs ‘a_in, b_in, c_in’ and
outputs ‘sum_out, carry_out’. The top-down compilation run is shown by using the
following script and can be used in the practical scenario. Refer Chap. 10 for the
sdc commands. To synthesize the design and compile use the script shown in
Example 12.1.
304 12 Constraining ASIC Design
The methods used for compilation of any design can have top-down or bottom-up
compilation approach. Each compilation method has its own advantages and
disadvantages.
The top-down compilation uses the top level design constraints and is easier to
execute as compared to the bottom-up compilation approach. Following are the
advantages and disadvantages for the top-down compilation.
Advantages
1. Optimization engines work on full design, complete paths
2. Usually get best optimization result
3. No iteration required
4. Simpler constraints
5. Simpler data management
12.2 Compilation Strategy 305
Disadvantages
1. Longer runtime
2. More memory requirements
3. More runtime
The commands used for the top-down compilation are
The bottom-up compilation uses the submodule level compilation first and then it
moves towards top level. Care must be taken by the designer to set “set_-
dont_touch” attribute on the submodules to avoid recompilation of the submodules.
The designer needs to know the timing information for the inputs and outputs for
each of the submodule. The advantages and disadvantages are discussed below
Advantages
1. Faster as compare to top-down compilation
2. Less processing required per run
3. Less memory requirement
Disadvantages
1. Optimization works on the submodule or subdesign
2. More iteration required
3. More hierarchies to maintain
Consider the design has two submodules. The commands used for the bottom-up
compilation are
306 12 Constraining ASIC Design
There are several techniques used for minimizing the overall area of the design. The
highest priority of the designer is to optimize for the timing followed by area. There
are several efficient area minimization techniques at the RTL level. In the previous
section, we have discussed about the resource sharing. Following are the key
guidelines used to optimize for the area
1. Avoid use of the combinational logic as individual block or module
2. Do not use the glue logic between two modules
3. Use set_max_area attribute while synthesizing the design.
d_in
Combo Combo
Logic1: A Logic2 : B
D flip-flop
clk
12.3 Area Minimization Techniques 307
If the module II in Fig. 12.3 is replaced by glue logic, that is instance of logic gate,
then it glues between the different modules and is shown in Fig. 12.5. Such type of
design partitioning is not good, the reason being the logic gate cannot be optimized
by the design compiler and as design is not partitioned properly. To avoid this type
of scenario it is recommended to use the group command. Either group the glue
logic in the module I or module II. Following command is used to group the glue
logic into module I:
Module I :
Instance m1 Module III:
GLUE LOGIC
d_in1 q1
Combo
Logic1: A
D flip-flop
clk
Following command is used to group the glue logic into module II:
To obtain the minimum possible area it is recommended to use the attribute set_-
max_area. This attribute is effective in the optimization of the design. Design
compiler gives the highest priority to the timing optimization. If timing is met then
only the area optimization phase can start. The priorities for the design optimization
are listed below.
1. Design rule constraints (DRC)
2. Timing
3. Power
4. Area
The area report is generated by the design compiler using report_area command.
The sample area report is shown in Example 12.2. The area report for any design
consists of the number of ports, nets, references. It also gives information about the
combinational, sequential, and total cell area.
During optimization the timing has highest priority as compared to power and area.
During the first phase of optimization the design compiler checks for the design rule
constraints (DRC) violations, then the timing violations and the power constraints,
and finally the area constraints. This section discusses about the few timing opti-
mization commands supported by the design compiler.
Most of the time design engineer uses the option as map_effort medium while
performing the synthesis. It is advisable that during synthesis of the first phase,
designer can use the option as map_effort medium as it reduces the compilation
time. If the deign constraints are not met then the designer can go for the incre-
mental compilation with the option as map_effort high. This can improve the design
performance by at least 5–10 %.
The sdc command is shown below.
The design hierarchy of the design can be broken by using logical flattening of the
design. The option allows all the logic gates of design at the same level of hier-
archy. This allows the compiler to have better performance and better area uti-
lization for the design. If the hierarchical design is large then this option may not
work out. If number of hierarchies in the design increases, then compiler will take
large amount of time for the design optimization.
Use the following command to achieve the logical flattening for the design
310 12 Constraining ASIC Design
The design performance can boost unto 10 % by using the map_effort high option.
But if timing is not met with the incremental compilation by using the design
constraints then it is essential to group the critical timing paths and use the weight
factor to boost the design performance. This command is useful to improve the
timing performance. The command is shown below.
Consider the design scenario which has the setup violation of 0.38 ns. The setup
violation is the difference between the data required time and data arrival time. So
the slack is negative and setup time is violated.
The above commands generate the timing report with positive slack and remove
setup violation and are shown in Example 12.4.
As shown in the above timing report for the max analysis with the compile_map
high option and weight factor of 8 the slack is met.
In the practical ASIC designs, the design can have multiple hierarchies. Consider
that the TOP level design consists of submodules X, Y, Z. If independently these
submodules are synthesized and optimized for the design constraints they might
meet the timing requirements independently. When these submodules are instan-
tiated in the higher level of hierarchy (TOP) then it may be possible that they do not
meet the timing. The reason for this may be the glue logic used among the sub-
modules X, Y, Z or the tight constraints at the top level hierarchy.
312 12 Constraining ASIC Design
Register balancing is very efficient and powerful command to move the combi-
national logic from one pipelined stage to another pipelined stage. This technique
improves the design performance by moving the logic and hence reduces the
register-to-register delay. Consider the pipelined design shown in Fig. 12.6 and
consisting of the three flip-flops and combinational logic. In the first pipelined stage
the combinational logic is 4-variable function and the second pipelined stage has
combinational logic as 8-variable function and has more propagation delay as
compare to the 4-variable combinational logic. Due to the different propagation
delays in two different pipelined stages the design performance is based on the
register-to-register timing path which has more delay.
Under such circumstances the register balancing can be used to shift the com-
binational logic from one of the pipelined stage to another pipelined stage without
affecting the functionality of the design. This is achieved by compiler by using the
following set of commands.
d_in
Combo Combo
Logic: Logic:
D flip-flop A D flip-flop B D flip-flop
clk
For the optimization of the finite state machines the FSM Compiler is used. The use
of FSM compiler is to achieve the small area optimization and to improve the
design performance. In the practical ASIC designs the state machines are always
coded as an independent block. The design which has state machines and the other
logic cannot be considered as good design partitioning. The reason being, if the
other required logic is isolated from the state machine logic then the designer can
chose for the best suited encoding style while coding for the sate machines. So
always use the different submodules for the logic and for the state machines for the
better design performance.
Following script shown in Example 12.5 can be used for the FSM extraction and
optimization.
Fixing of the hold violations is very easy as compared to the setup violations. To fix
the setup violations it is essential to modify the architecture of the design and in turn
it has greater impact on the RTL coding of the design. The setup violations are fixed
during the prelayout STA and hold violations can be fixed during pre layout or post
layout STA phase. Design compiler is efficient to fix the hold violations automat-
ically. Use the following command to fix the hold violation
12.4.8.1 report_qor
This is used to generate report which consists of timing summary of all the path
groups. This gives overall status of the timing for the design. Example 12.6 shows
the sample report with multiple timing path groups using report_qor command.
12.4.8.2 report_constraints
This command is used to show the difference between the user constraints and the
actual design values. Following is the sample Example 12.7 with report_con-
straints command.
12.4.8.3 report_contraints_all
This command is used to show all the timing and DRC violations. Following is the
sample Example 12.8 by using the report_constraints_all command
316 12 Constraining ASIC Design
Following are the important commands used to validate the design (Table 12.1).
Following are the important commands used to define design rules, power, and
optimization constraints (Table 12.2).
Example 12.9 is the sample script and can be used to constrain the design for
operating frequency of 500 MHz.
12.7 Summary
8. For the optimization of the finite state machines the FSM compiler is used. The
use of FSM compiler is to achieve the small area optimization and to improve
the design performance.
References
Abstract In the practical ASIC and SOC designs the multiple clocks are used and
the designs are called as multiple clock domain designs. These kinds of designs
need to be described using the efficient design architectures and Verilog RTL. This
chapter focuses in the key design techniques which are used to describe the multiple
clock domain designs while passing data from one of the clock domain to other.
The chapter key highlights are the detail description for the synchronizers, data
path, and control path synchronization logic using the efficient Verilog RTL. This
chapter also discusses on the key design challenges in the multiple clock domain
designs and even this chapter focuses on the design guidelines to describe the
efficient clock domain designs.
Keywords Metastability CDC Skew STA FIFO Level synchronizer
Pulse synchronizer Mux synchronizer FSM Sending clock domain Receiver
clock domain Edge detection Level-to-pulse conversion Gray counter Binary
to gray Gray to binary MCP Convergence of data Legal states Gray
encoding FSM encoding Reset synchronization Data correlation Setup and
hold time
regB
regA
CLK1 CLK2
It is a very simple approach to design single clock domain design logic. If all the
flip-flops in the design are clocked by single clock source then the design is said to
be synchronous. If the flip-flops are triggered by the different clock sources then the
design is said to be asynchronous clock domain design. In the modern ASIC or
SOCs the design can have multiple clock sources of different frequencies. For
example, consider Fig. 13.1, in this example the flip-flop regA is triggered by
CLK1 and flip-flop regB is triggered by CLK2.
In Fig. 13.1 the data is sampled in the clock domain with the clock source CLK1
and output from the clock domain1 is data_out_1. The flip-flop is named as regA in
clock domain1 and named as regB in the clock domain2. The regB has clock input
as CLK2 and samples the data generated by clock domain1 on the rising edge of
CLK2. The output from clock domain2 is data_out. The difference between the
single clock domain and multiple clock domain designs is the phase difference
between arrivals of the clock signals. The clock sources CLK1 and CLK2 are
different for both the domains and regardless of the same or different frequencies
the design is treated as multiple clock domain design. The data is launched from
one clock domain and captured in another clock domain.
The data transfer can be from slower clock domain to faster clock domain or from
faster clock domain to slower clock domain. The data or control signal crosses from
one of the clock domains to another clock domain and it is described by the term
clock domain crossing.
The synthesizable outcome of the Example 13.1 is shown in Fig. 13.2.
The timing sequence for Example 13.1 is shown in Fig. 13.3. In this the flip-flop
propagation delays are not considered.
If we consider the multiple clock domain design described by Example 13.2, then
there is an issue in the data integrity of the design due to unstable output or meta-
stable output. In the practical use multiple clocks are not used in the same module.
13.2 What Is Clock Domain Crossing (CDC) 323
regB
regA
clk
clk
data_in
data_out_1
data_out
regB
regA
CLK1 CLK2
Fig. 13.4 Synthesis result for the multiple clock domain design logic
Describe design functionality separately for different clock domains. That is module
1 can use clock source as CLK1 and module 2 can use clock source as CLK2.
The synthesizable outcome of the above Verilog RTL code is shown in
Fig. 13.4.
The timing sequence for the Example 13.2 is shown in Fig. 13.5.
As shown in Fig. 13.5 the output from regB, i.e., data_out is in the metastable
state for one clock cycle. Metastability is the scenario in the design due to occur-
rences of multiple events very close to each other and due to that the setup and hold
time violation occurs. The scenario results into the synchronization failure between
multiple clock domain designs. It is due to different clock frequencies and different
phases of the clock in the design. It is essential for the designer to think that why
13.2 What Is Clock Domain Crossing (CDC) 325
design goes into metastable state? The reasons being every flip-flop has setup and
hold time and if the data changes during the setup time and hold time window then
the design has timing violations and results into the unstable output. Metastable
state of the design is not a stable state of the design so if the data output data_out is
fed to the other logic then the output from the other logic is unpredictable state or
invalid logic state. So to avoid the metastable issues in the design it is essential to
have synchronizers in the data path and control path for the design.
The issue of metastability can be resolved by adding the level synchronizers
while sampling the control signals from one clock domain to another clock domain.
Figure 13.6 shows the multiple clock domain design with the two-stage
level-synchronizer logic.
As shown in Fig. 13.6 the level synchronizer is used in the second clock domain.
The level synchronizer is designed using regC, regB and used to sample the data
‘data_out_1’ from the clock domain1. The register regC output can be metastable
but register regB generates the stable legal output ‘data_out’ on the next clock edge.
The Verilog RTL is shown in Example 13.3.
clk1 clk2
Example 13.3 Verilog RTL for the use of two-stage level synchronizer in the control path
The level synchronizers are used to pass the control signal information from one of
the clock domains to another clock domain. In the practical scenario either two-stage
or three-stage synchronizers are used. In the two-stage level synchronizer the
number of registers used are two and three-stage level synchronizers are designed
using three registers. The latency of control information transfer is dependent on the
number of registers. The two-stage level synchronizer is shown in Fig. 13.8.
As described in the above section the functionality for the two-stage level
synchronizer can be described using Example 13.4.
The three-stage level synchronizer is described using Verilog RTL as shown
below and the synthesized logic is described in the Example 13.5 (Fig. 13.9).
In the multiple clock domain designs the data can be passed from slower clock
domain to fast clock domain or from faster clock domain to slow clock domain
depending on the design architecture and requirement. In both the cases the syn-
chronizers need to be incorporated in the design. The synchronizers need to be
incorporated in the data and control path for the design.
regB
regA
clk
Passing of the control signal from the slower clock domain to the faster clock
domain is not a problem as the signal launched by the slower clock domain can be
sampled by the faster clock domain multiple times using the two-stage or
three-stage level synchronizer.
As discussed above, consider ‘clk1’ is of 100 MHz and ‘clk2’ is of 200 MHz.
As second clock domain is faster compared to the first clock domain, there is no any
issue while sampling the control signals passed to the second clock domain. But in
the practical design scenario problem occurs when the control information need to
be passed from faster clock domain to the slower clock domain. The issue is due to
nonconverging of the legal states of the control signals passed from clock domain1
to clock domain2.
13.3 Level Synchronizers 329
clk1
data_in
data_out_1
clk2
data_out_2
data_out
Fig. 13.10 Timing sequence for capturing the data in the slower clock domain
As shown in Fig. 13.10 due to slower clock ‘clk2’ in the clock domain2, the data
output ‘data_out_1’ is sampled on the active edge of clock ‘clk2’ but unable to
produce the desired output. Due to that both the registers in the second clock
domain generate output logic ’0’ and which is unexpected output. Both the
‘data_out_2’ and ‘data_out’ outputs from the registers are at logic ‘0’ and shown in
the timing sequence. The issue of sampling the data from faster clock domain to the
slower clock domain can be resolved using pulse stretcher. The level-to-pulse
generator on the positive clock edge is shown in Fig. 13.11.
Another mechanism to achieve the legal converging of the data is to use a
handshaking mechanism using the handshaking signals.
As shown in Fig. 13.12 the sampled signal in the clock domain2 is reported as a
handshaking signal to clock domain1. This handshake mechanism is like ac-
knowledgment or notification to the faster clock domain1 that the control signal
passed by the faster clock domain is successfully sampled by the slower clock
domain. In most of the practical scenarios this kind of mechanism is used and even
the faster clock domain can send another control signal after receiving the valid
notification or acknowledgment signal from the slower clock domain.
data_out
data_in
data_out_1
Register A Register B
clk
Clock domain1
data_in Clock domain2
clk1 clk2
After receiving the valid handshake signal from clock domain 2 the output of two
stage level synchronizer in the clock domain 1 should generate the output to control
the next data_in. This requires additional logic.
This type of synchronizer uses the two-stage level synchronizer with the additional
register to sample the output of two-stage level synchronizer. The output syn-
chronized data is generated by XORing the output from the two-stage level syn-
chronizer and the sampled output from the three-stage synchronizer. This kind of
synchronizer is also named as toggle synchronizer and used to synchronize the
pulse generated in sending clock domain into the destination clock domain. While
passing the data from faster clock domain to the slower clock domain the pulse can
be skipped if two-stage level synchronizer is used. In such scenarios the pulse
synchronizers are very efficient and useful. The pulse synchronizer diagram is
shown in Fig. 13.13.
Sync_data
XOR
Logic
data_in Gate
clk
data_in
Register C
Register C
sync_data
Clock
domain1
clk2
clk1
Use the pair of the data and control signals while sending the information from clock
domain1 to clock domain2. Use the multiple-bit data and use the single-bit control
signal. At the receiving end depending on the ratio of the sending clock and receiving
clock use the level or pulse synchronizer to generate the control signal for the multi-
plexer. This technique is similar to the MCP and effective if the data is stable for
multiple clock cycles across the clock boundaries. The diagram is shown in Fig. 13.14.
Passing multiple control signals from one of the clock domains to another clock
domain is one of the key challenges for an ASIC or SOC design engineer. When
multiple signals are passed from one of the clock domains to another clock domain
then the arrival time of the entire control signals is very important. If all the control
signals are arrived at a time then the skew is zero. Then there is no issue while
capturing these signals in another clock domain. But in the practical scenario, it
332 13 Multiple Clock Domain Design
data_in
data_out
enable
Two stage level
Synchronizer
load_en
Sequential logic
Two stage level
Synchronizer
ready
Two stage level
Synchronizer
clk2 clk2
may be possible that there may be skew between the multiple control signals due to
different arrival time from clock domain1 to clock domain2. And this can be the
cause of the synchronization failure. Consider the design scenario shown in
Fig. 13.15, where ‘enable’, ‘load_en’, and ‘ready’ need to be passed from one of
the clock domains to another clock domain. In such scenario, if independent-level
synchronizers are used for all the required control signals then there might be
synchronization failure at the receiving end due to skew.
Consider the case where ‘ready’ and ‘load_en’ are arrived and sampled at a time
but due to late arrival of ‘enable’ input in the receiving clock domain2. The data
output from the first register of synchronizer does not change and it does not sample
the new value. Hence there is practical issue in generating valid legal state output.
The sampling of multiple control signals is described in the Example 13.6.
The practical and feasible design solution to resolve this problem is to develop
the logic in the clock domain1 to generate the single control signal using ‘enable’,
‘load_en’, and ‘ready’. Pass this single-bit control signal from clock domain1 to
clock domain2. The architecture modification is shown in Fig. 13.16.
The Verilog RTL is shown in Example 13.7 for the receiving second clock
domain logic.
Design Scenario I
Consider the design scenario for passing the multiple signals from clock domain1 to
clock domain2. If clock domain1 generates two output signals ‘enable_1’ and
‘enable_2’ and the receiving clock domain2 uses these two signals for the pipelined
control logic as illustrated in the Example 13.8 then there might be a chance of
synchronization failure. The synthesized logic is shown in Fig. 13.17.
The issue for the Verilog RTL described in Example 13.8 is sampling of ‘en-
able_1’ and ‘enable_2’ in the receiving clock domain2. Although the two-level
synchronizers are used to sample ‘enable_1’ and ‘enable_2’ the small skew
13.6 Challenges in the Design of Synchronizers 333
data_in
enable data_out
cons_sig
cons_sig
load_en Combo
Two stage level
logic
ready Synchronizer Sequential
Register A
Logic
clk1 clk2
Fig. 13.16 Consolidated control signal passing in the multiple clock domains
between arrivals of ‘enable_1’ and ‘enable_2’ can cause the issue. The pipelined
stage shown in Fig. 13.17 can miss the data due to this issue and can result in
invalid output.
334 13 Multiple Clock Domain Design
Example 13.7 Partial Verilog RTL for consolidated control signal receiving
Figure 13.18 illustrates that the ‘data_out’ is permanently zero and not loaded
due to the skew between the ‘enable1_1 and ‘enable2_1. If these two signals have
skew then there is a gap of clock cycle while sampling these signals in the receiving
clock domain.
Solution The practical solution is to use the consolidated enable signal and
sample ‘enable_cons’ in the second clock domain to generate the valid ‘enable2_2’
signal from the output of two-stage level synchronizer. Figure 13.19 shows the
generation and use of the consolidated control signal (Example 13.9).
Design Scenario II
Consider the design scenario of passing the multiple-bit encoder output from one of
the clock domains to another clock domain. Consider that encoder outputs ‘en-
coder_1’, ‘encoder_2’ are passed from the clock domain1 to clock domain2. The
output generated by the clock domain1 is sampled by the clock domain2 using the
two-stage level synchronizer. The output of level synchronizer is used as an input to
2:4 decoder. There might be a chance that the decoder output is error prone if there
is skew between the inputs ‘encoder_1’ and ‘encoder_2’.
The Verilog RTL functionality in the clock domain2 is shown in the
Example 13.10.
Consider the practical scenario with reference to Example 13.10, the issue in the
output is due to skew between ‘encoder_1’ and ‘encoder_2’. Due to the skew the
decoder output ‘decoder_out[1]’ is permanently zero and never be asserted. This
problem can be fixed using the enable control signal while sampling the ‘en-
coder_1’ and ‘encoder_2’ signals from the clock domain1 by clock domain2. The
enable control signal can be of one clock duration wide and can act as device ready
or control signal to pass the control information when ‘enable=1’. The enable signal
13.6 Challenges in the Design of Synchronizers 335
Example 13.8 Verilog RTL for using multiple signals for pipelined operation
can be asserted while asserting the encoder output or enable signal can be asserted
after one clock cycle after assertion of the encoder output. Assertion and deassertion
logic can be designed separately for enable input.
336 13 Multiple Clock Domain Design
data_in data_out
enable_1
clk2
clk2
enable_1
enable_2
enable1_2
enable2_2
data_in
data_out_2
data_out
Fig. 13.18 Timing sequence for the use of multiple control signals for pipelined control logic
enable1_1
enable_cons data_in data_out
enable2_1 Combo
Two stage level
logic
Synchronizer
Register A Register B
clk2
RegisterC
Fig. 13.19 Modified architecture to register the consolidated control signal for pipelined logic
Example 13.9 Partial Verilog RTL for the use of the consolidated control signals for pipelined
logic
The following Verilog RTL describes the sampling of the decoder output in the
clock domain2 (Example 13.11).
338 13 Multiple Clock Domain Design
Example 13.10 Partial Verilog RTL for the sampling of encoder output
As discussed in the above section to pass the multi-bit signals from one of the clock
domains to another clock domain is difficult and error-prone task. Although the
multistage level synchronizers can be used due to skew between the multiple clock
signals the synchronization cannot be achieved. So for the multi-bit data the other
techniques are used to pass the data from one of the clock domains to another clock
domain. There are two main techniques to pass multi-bit data and these are used in
the practical ASIC designs. These techniques are
(a) Handshaking mechanism
(b) FIFO memory buffers
As discussed in the previous section, one or more than one handshake signals are
required while passing the data from one of the clock domains to another clock
13.7 Data Path Synchronizers 339
Example 13.11 Partial Verilog RTL for the pushing decoder in the single clock domain
domain. Consider the design scenario shown in Fig. 13.20, where the multi-bit data
need to be passed from the transmitter to receiver. The transmitter is clocked in the
clock domain1 and receiver is clocked by another clock in the second clock domain.
So the multi-bit data exchange using only level synchronizer is not effective while
datavalid
data
Transmitter Receiver
deviceready
Clock domain 1 Clock domain 2
clk1 clk2
ack
Two stage level
synchronizer
clk1
clk2
passing data from transmitter to receiver. ASIC designer can think of the archi-
tecture by incorporating the handshake signals ‘datavalid’ and ‘deviceready’. In
most of the practical scenario where latency is not a bottleneck this mechanism is
effective to pass multi-bit data.
As shown in Fig. 13.20, when transmitter passes multi-bit data from clock
domain1, then the receiver receives the data in the another clock domain using
receive clock edge and generates active high ‘datavalid’ signal to indicate that the
valid data has been received in the second clock domain. So the transmitter uses this
signal ‘datavalid’ as handshaking signal. So until ‘datavalid’ signal is active high
the transmitter cannot place the new data on the data lines. As two-or three-stage
level synchronizers are used to sample the data in the second clock domain, it is
recommended that the ‘datavalid’ signal should be active for at least two or three
clock cycles. The overall latency while transferring the data is dependent upon the
number of synchronizer stages and number of handshaking signals used. The poor
latency is one of the biggest disadvantages of the handshake mechanism.
If required in most of the cases, another handshaking signal ‘deviceready’ can be
generated with the ‘datavalid’ signal. The receiving clock domain can notify to the
transmitter clock domain about the receiver status by asserting the ‘deviceready’
signal. But while designing handshake mechanism care needs to be taken for the
generation of ‘deviceready’ and ‘datavalid’ signals. The ‘deviceready’ handshake
signal should go to logic ‘1’ after deassertion of ‘datavalid’ signal.
For the FSM control use the architecture to establish the handshake across the
clock domains using request and ack signals, Fig. 13.21.
In the practical ASIC design scenario, the FIFO memory buffers are used as data
path synchronizers to pass the data between multiple clock domains. The sender
13.7 Data Path Synchronizers 341
write_data read_data
logic logic
write_clk
read_clk
write_full
Write full generation logic Read empty generation logic
read_empty
clock domain or transmitter clock domain can write the data into the FIFO memory
buffer using write_clk and receiver clock domain can read the data using read_clk.
So basically FIFO consists of the memory buffer, write domain logic, read
domain logic, and the empty and full flag generation logic. The FIFO with different
logic blocks is shown in Fig. 13.22.
While passing the multiple bits of the data or control signals it is essential to use the
gray encoding technique as it is guaranteed to sample the single-bit change in the
receiving clock domain. For example, if 4-bit binary data need to be passed using
the binary counter from sending clock domain to receiver clock domain then use the
binary-to-gray converter logic in the sender clock domain. This guarantees only
one-bit change across the clocking boundary. After sampling the gray counter value
in the receiving clock domain use gray-to-binary conversion logic to perform the
operations on the binary numbers. The technique is described in Fig. 13.23.
Please refer the Chap. 2 for the Verilog RTL of gray-to-binary converter. The
gray-to-binary code conversion for 4-bit number is described in the Example 13.12.
342 13 Multiple Clock Domain Design
Binary to Gray to
gray Binary
Register A Register B Register C
clk1 clk2
Please refer the Chap. 2 for the Verilog RTL of binary-to-gray converter. The
binary-to-gray code conversion for 4-bit number is described in the Example 13.13.
In the multiple clock domain designs it is recommended to use the Gray codes as in
the two successive Gray numbers only one bit changes. The Verilog RTL for the
Gray counter is shown in Example 13.14 (Fig. 13.24).
13.8 Design Guidelines for the Multiple Clock Domain Designs 343
CDC design errors can cause the serious design failures. These design failures are
expensive during the chip design cycle. These design failures can be avoided using
the following few guidelines during the design and verification phases.
1.
Metastability While passing the control signal information or data information, use
the register logic in the sending clock domain. The reason being, if unregistered
logic is used to pass the data from the sending clock domain to the receiver clock
domain then there might be chances of glitches or hazards due to the multiple
transitions in the single clock cycle. This can force the register logic into the
metastable state due to violation of setup or hold time. The multiple transitions
during single clock cycle can be avoided using the register logic while passing the
data. Metastability blocking logic is shown in Fig. 13.25.
2.
Use of MCP Multi-Cycle Path formulation is highly recommended to avoid the
metastability issue while passing the data and control signal information across the
clock domains. In the MCP the strategy is to create the control and data pairs to
pass the multi-bit data and single-bit control signal from sending clock domain to
receiving clock domain. The control information can be sampled in the receiving
clock domain using the pulse synchronizer and data can be passed to the receiving
clock domain from sending clock domain with or without synchronizers. This
technique is highly effective as the data can maintain the stable value for multiple
344 13 Multiple Clock Domain Design
cycles and can be sampled in the receiving clock domain using the synchronized
signal generated using pulse synchronizer. Across the clock domain crossing
boundaries following are key points need to be considered:
a. Control signals must be synchronized using the multistage synchronizers.
b. Control signals should be free of hazards and glitches.
c. There should be single transition across clock boundaries.
d. Control signal should be stable for at least single clock cycle.
13.8 Design Guidelines for the Multiple Clock Domain Designs 345
clk
reset n
Set
reset reset 0
1
data_in Register C
sync_data
clk
data_in
Register C
Register C
sync_data
Clock
Domain 1
clk2
clk1
This is highly recommended as the STA will be easier due to clean reg-to-reg
paths. Even the design verification will be easier due to the design partitioning
and using the single clock.
6. Clock naming conventions It is recommended to use the clock naming con-
ventions to identify the clock source. The naming conventions for the clock
should be supported by the meaningful prefix. For example, for sending clock
domain use clk_s and for the receiving clock domain use clk_r.
7. Reset synchronization For the ASIC and SOC designs it is highly recommended
to use the reset synchronizers while asserting the reset and even it is essential to
incorporate the reset synchronizer to avoid the metastability during reset de-
assertion. Every SOC has single reset and either it is positive-level sensitive or
negative-level sensitive. So if synchronizers are not used then there are chances
of metastable state of flip-flops.
8. Avoid hold time violations To avoid the hold time violations it is recommended
to pass the stable data for multiple clock cycles from the sending clock domain
to the receiving clock domain. If data is passed from the faster clock domain to
the slower clock domain then the data should be stable for multiple clock cycles
to avoid the hold time violations.
13.8 Design Guidelines for the Multiple Clock Domain Designs 347
9. Avoid loss of correlation Across the clock domain boundary there are several
ways due to which loss of correlation can occur. Few of them are
a. Multiple bits on the bus
b. Multiple handshake signals
c. Unrelated signals
To avoid this use the clock intent verification technique. This technique will
ensure the passing of multi-bit signal across the clock boundaries.
The FIFO design and case study are described using the following Verilog RTL and
the key steps are described in the following template.
FIFO memory buffer top-level pin diagram is shown in Fig. 13.27 and has two
write_data read_data
write_address read_address
write_full read_empty
FIFO memory Buffer
write_inc read_inc
write_clk read_clk
wreset_n rreset_n
different clock domains. Input clock domain or sender clock domain works on the
write_clk and another clock domain, receiver clock domain works on the read_clk.
352 13 Multiple Clock Domain Design
13.11 Summary
Abstract In the modern lower process node ASIC design the power is considered
as the major factor. The low power design chips are required in many applications
like mobile, computing, processing, and video and audio controller designs. Most
of the SOC designs need the low power design support. This chapter discusses abut
the low power design techniques at the RTL level and the use of the consistent
format UPF at the logical design. This chapter is useful for the RTL design engi-
neers to understand the UPF terminology and the key commands for inclusion of
the level shifter, retention, and isolation cells. Even this chapter describes about the
multiple power domain creation with the UPF commands.
Keywords CMOS Static power Dynamic power Switching power Net
power Parasitic Multi supply multi voltage Isolation logic Isolation cell
Level shifter Retention cell Clock gating Voltage scaling UPF
In the modern ASIC and SOC designs the power optimization is very crucial. The
power requirements for the ASIC or SOC play an important role in the planning of
design. The overall power estimations for the chip and the design of low power
architecture and microarchitecture are decisive factors in the ASIC design flow. The
goal of the chip architect is to design the architecture and microarchitecture for low
power aware designs. As process node has shrunk from 90 to 14 nm in the past
decade the voltage levels are dropped substantially.
As power is one of the crucial factors in the design of SOC, it has become the
main problem in every category of the design. The power density is measured as
watt per square millimeter and it raises with the alarming rate in the SOC designs.
So for the SOC design perspective the power or energy management needs to be
used in the design from the architecture stage itself. Hence low power design
techniques are essential to be used from RTL to GDSII.
Power management is required for all the designs below the process node of
90 nm. As size of the chip has shrunk below 90 nm at the smaller geometry it
requires the aggressive management of the leakage current. The primary source of
power dissipation in CMOS is leakage current. The leakage current is the sum-
mation of the cell leakage current and is state dependent.
The dynamic power is defined as addition of the summation of the internal cell
dynamic power and summation of power dissipated due to wires. The following are
the few equations which describe the leakage and dynamic power:
X
Pleakage ¼ Cell Leakage
where cell leakage can be computed using the library cell leakage and it is state
dependent.
X X1
Pdynamic ¼ Cell dynamic power þ Cl V V Tr
2
where the Cl is the capacitive load at pin or net, V is the voltage level, and Tr is the
toggle rate.
In the past decade, the major interest of designer was to improve the design
performance, that is, throughput and latency and frequency and even to reduce the
silicon area. But below 90 nm the power management has became the key for the
SOC designs. In the present scenario, due to the exponential growth in the field of
the wireless and mobile communications and other home electronics intelligent
applications the demand is for the complex functionalities and high-speed com-
putations. Even the low power management and the long battery life are key for
such kind of applications in the competitive market. It is expected that such kind of
devices should be of lightweight, small, cool, and even they should have the long
battery life.
Consider the ASIC standard cell as inverter shown in Fig. 14.1. As shown in the
figure inverter is designed using PMOS and NMOS. PMOS passes strong ‘1’ and
NMOS passes strong ‘0.’ At a time either PMOS is on or NMOS is on. But
14.2 Power Dissipation in CMOS Inverter 361
PMOS
a_in y_out
NMOS
So the power dissipation for any CMOS cell is the function of the switching
activity, capacitance, voltage, and the structure of transistor. So the power is
described as
The total power for any CMOS cell is summation of the dynamic and leakage
power.
Dynamic power is summation of the switching power and short-circuit power.
The power dissipation is due to the charging and discharging of internals and net
capacitances. The short-circuit power dissipation is due to the gate switching state
and it is due to the short circuit between the supply voltage and ground. The
following equation describes the switching and short-circuit power:
where α is the switching activity, f is the switching frequency, Ceff is the effective
capacitance, and Vdd is the supply voltage.
where Isc is the short-circuit current during switching, f is the switching frequency,
and Vdd is the supply voltage.
Dynamic power can be reduced by reducing the switching activity, clock fre-
quency (it reduces the design performance), also using the capacitance and the
supply voltage. If faster slew cells are used, then it consumes the less dynamic
power and hence cell selection is important in reduction of the dynamic power.
Leakage power is given by the following equation and it is the function of the
supply voltage Vdd , the switching threshold voltage Vth , and the size of transistor:
W
Pleakage ¼ f Vdd ; Vth ;
L
In the above equation the W is the width of transistor and L is the length of
transistor.
Powers saving opportunities at different design stages are listed in Table 14.1.
There are several techniques used to reduce the power and few of the commonly
used power management techniques are listed in Table 14.2.
Another few important techniques used in the power management at the various
abstraction levels are listed in Table 14.3.
This technique is very efficient in reducing the dynamic power. In most of the
applications the power is wasted due to unnecessary toggling of the clock signal. If
the register clk input is changing, the clock signal toggles on every clock cycle. This
is the major reason for the more dynamic power. Even the clock trees are the major
sources for the larger dynamic power as they have the larger capacitive load and the
switching requires the maximum rate. So if the data is not loaded in the register
frequently then significant amount of power is wasted and this can be saved using
clock gating technique. The clock gating is at the register level or leaf level and if it is
done at the block level, then the entire functional block can be disabled by disabling
the clock tree. This reduces the switching and hence reduces the dynamic power.
This technique is effective in reducing the dynamic power dissipation in the data
path of any blocks using the enable signals. Most of the times the data path
elements are sampled periodically and hence this sampling can be controlled using
the enable inputs. During inactive state of enable signal the datapath inputs can be
forced to the constant value and hence it reduces the dynamic power due to lesser
switching.
This technique is very effective while optimizing for area, power, and speed using
different threshold voltages. Most of the libraries have different switching threshold
voltages. The efficient EDA tool used for synthesis can be able to use different
library cells of different switching threshold voltages for meeting the area and speed
constraints with the lowest power dissipation.
Dynamic Voltage and Frequency Scaling is very efficient technique to reduce the
active power consumption. As discussed in the earlier section the power dissipation
is proportional to the voltage square, so lowering the voltage has squared effect on
the power consumption. In this type of technique, depending on the performance
and power requirements, the frequency and voltage can be scaled down on the fly
and hence it can reduce the power dissipation. This technique is very effective to
optimize the static and dynamic powers due to optimization of the frequency and
voltage levels.
Power gating or power shut off (PSO) is one of the effective techniques, and in this
technique the design modules which are not used are switched off using switches.
This is one of the powerful techniques used to reduce the leakage power. In most of
the industrial applications the leakage power can be reduced by, more than 90 %
using the power gating switches. To design this technique, it needs the clear
understanding of the power-down sequence and isolation cells. It is essential to use
the isolation logic with the state retention elements and even level shifters to
implement the power gating.
This is used at the output of power-down block to prevent unpowered signals and
floating signals from power-down block. In the simulation these signals can be
denoted by ‘X.’ Isolation cells are used between the two power domains and
connected between the power-off domain and the power-on domain. The reason for
isolation cell in the two power domains is to isolate the output of blocks before the
power switch off state and need to remain isolated until the power is switched on. In
few design scenarios isolation cells can be used to block level to prevent the
connection to power-down logic. Consider the block logic as driving power domain
and it is in the off state, and then isolation cell can be located in the driving domain
to isolate the signals from the driving power domain to the receiving power-on
domain.
366 14 Low Power Design
During the power-off mode, most of the time it is essential to retain the state of the
registers. The state of the registers is useful during the power recovery. In most of the
low power designs the state retention power gating flip-flops are used and these
flip-flops are called as SRPG. Most of the EDA tool cell libraries have the SRPG cell.
In the present scenario, there are many low power design techniques at the RTL and
gate level. It is essential for the design team to understand about the low power
goals to define the techniques uniformly by ensuring the consistency and pre-
dictability in the overall design cycle. Most of the SOC designs use the low power
design techniques using power analysis and optimization issues. This section
focuses on the low power design technique.
1. Modeling and power estimation: For the low power design and the management
of power for any SOC it is essential to prepare the library models with the
required power data. It is required to develop the transistor-level models for the
custom blocks. The common practice in the SOC design at the RTL level is the
use of power compiler to understand the power consumption based on the
switching activity information from the RTL simulation data. This technique is
useful for estimation of the power consumption at early stage of the design.
Another important point to be considered at the gate level is to develop the
glitch-free low power designs and state and path dependencies’ support. As
gate-level analysis is more accurate as compared to the RTL level analysis it is
essential to use the time-based analysis based on the peak power and hot spots.
2. Clock Gating: Use the clock gating technique using the clock gating cells to
minimize the power during the RTL design. Clock gating can be implemented
by identifying the synchronous load enable register banks. Clock gating can be
implemented using the gating of clock with the enables instead of recirculating
of the data when enable is inactive. If power compiler is used at the RTL level
then it automatically optimizes the static, dynamic power dissipation with the
delay and area to meet the design constraints.
Clock gating stops the clock and forces the original circuit in the zone of no
transition. In the practical scenario if we consider the functional block as
always@(posedge clk)
begin
if(enable)
data_out<=data_in;
end
14.4 Low Power Design Techniques at the RTL Level 367
data_out
data_in
Register
clk
enable
data_in
data_out
enable Register
G_clk
clk Latch
The above block can generate the synthesis result as shown in the Fig. 14.3.
The above-generated logic is without clock gating and has the higher power
dissipation. To reduce the power consumption the clock gating logic needs to be
used and can be designed by eliminating the multiplexers at the input thus avoiding
the recirculation of data. This results in the area and power savings and reduces the
power consumption in the clock network. The synthesized logic using clock gating
is shown in Fig. 14.4. The timing sequence is shown in Fig. 14.5.
The use of clock gating has a drawback that the logic used to implement the
clock gating technique is redundant and hence there can be issues in the testing and
verification. Another important point needs to be kept in mind is that it is essential
to stop the glitches and hazards on enable signal and this is achieved using the
transparent latch between the enable and the AND logic gate.
Clock gating can be efficiently implemented using the power compiler from
synopsys. Use the command set_clock_gating_signals. Figure 14.6 illustrates the
inputs and outputs used for the power compiler.
Power Compiler
Library elaboration
Elaborated Unmapped
design
The outcome of the power compiler is the elaborated unmapped design. Power
compiler uses the inputs as source RTL code and library to optimize for the low
power.
The following are few of the key points need to be considered while imple-
menting the clock gating using the power compiler.
1. General clock gating can be included or excluded from the design for the
hierarchical modules. The command used is set_clock_gating_signals. The
care needs to be taken by the designer while using the power compiler for the
same. Each design should have the single command line for both the inclusion
and exclusion of the clock gating.
2. If the design has multiple registers and few of the registers need to be excluded
from the clock gating strategy then they should have the separate enable signal.
If same enable signal is used then it generates the same clock gating for the
entire register bank. For example, if the data bus is defined as data_in[7:0] with
the registered inputs and if the lower nibble ‘data_in[3:0] needs to be excluded
from clock gating, then it should have a different enable and data_in[7:4]
should have a different enable.
3. Clock gating signals as single bit or multiple bits have added advantage as it
avoids the recirculation of the data by removing the multiplexers. But it can
consume more area and additional power due to the clock gating logic.
4. Do not use clock gating for the master-slave flip-flops. Generally, it is common
practice that clock gating logic is used at the slave flip-flop if the clock gating
conditions are met. Such design may not perform the desired operation. Use the
command set_clock_gating_exclude to exclude the master-slave flip-flops.
14.4 Low Power Design Techniques at the RTL Level 369
5. While using the clock gating it is common practice to use the minimum bus
width. The minimum bus width can be of 5 or more. Use the command
set_clock_gating_style_minimum_bitwidth.
6. In most of the design practices at the RTL level if the procedural ‘always’
blocks are used and if it consists of ‘case’ with the ‘default’ clause or condi-
tional expressions like ‘if-else’ then modify the RTL by including the default
condition in every ‘if-else’ statement. Example 14.1 describes the modification
of the procedural block using ‘default’ as ‘else’ clause.
7. If same enable is shared by the multiple register banks, then the power compiler
feature can be used to share the clock and enable signal to multiple register
bank. This is used to save the overall area. Consider Example 14.2 and it has
two different procedural blocks, and then the same clock gating logic can be
used for both of the procedural blocks.
8. Use the simple clocking strategies for the automatic clock gating insertion. If the
number of clock domains is minimum then it gives simplified timing analysis
and clock tree synthesis. The lower down modules can have enable signals
instead of dividing the clock. Use the set-don’t_touch_network command to
avoid the compilation changes on the clock network. During the multiple step
compilation process this avoids the changes on the clock gating logic.
endcase
case(a_in)
2’b00: begin
else c_in=e1_in;
end
2’b01:begin
if (~reset_n)
else if (enable)
data_out<=data_in;
end
begin : block_2
if (~reset_n)
else if (enable)
data_out_1<=data_in_1;
end
9. Use the simple set and reset strategies. Complex set and reset strategies may
result in the design logic which is prone to issues at the gate-level functional
debugging. The care needs to be taken by the designer to have the proper logic
partition for synthesis while using the internal set and reset signals.
10. Clock balancing and the clock buffer signal insertion need to be used efficiently
to have efficient clock tree synthesis (CTS). CTS tools work by adding or
moving the buffers, resizing of cells along the clock tree network to mange the
required skew and the insertion delay.
Unified power format (UPF) is the standard used to design electronic systems by
considering the power as the feature. The standard is used for low power ASIC
designs. The reasons for using UPF are as follows:
1. There is no method which can support accurate management and distribution of
low power at the HDL level abstraction.
2. Vendor-specific power formats are inconsistent and are prone to bugs due to
inconsistent specifications.
14.5 Low Power Design Architecture and UPF Case Study 371
3. UPF provides the following and can be used consistently in low power ASIC
designs
a. Power distribution architecture
i. Define the power domains
ii. Define power switches
iii. Define power rails
b. Power strategy
i. Creation of power state tables
c. Set and map
i. Isolation
ii. Retention
iii. Level shifter
iv. Switches
UPF is IEEE 1801 standard and can be used throughout the design flow for power
aware design intent. Example 14.3 describes the use of UPF at various stages.
As discussed already the isolation cells are used at the output of power-down block.
The isolation cell can be set using the following UPF command. Figure 14.7 shows
the design using isolation cell.
As discussed already in the above section the retention cells are used to retain the
state of key registers during power-off state. Figure 14.8 shows the setting of the
retention cell in the design.
14.5 Low Power Design Architecture and UPF Case Study 373
Level shifters are used to translate from one voltage level to another voltage level.
The translation can be from low to high voltage level or high to low voltage level.
Set and map level shifter can be achieved by using the following UPF commands.
Figure 14.9 shows the use of command to set and map the level shifter.
The key points to consider for the same are
1. Pick the correct power domain
2. Select input or output ports or both
3. Use UP-SHIFT or Down-shift rule
4. Define the location
Set level shifter
Specific sequence is generally followed for the power down. The sequence includes
isolation, state retention, and the power shut-off. For the power-up cycle the
opposite sequence needs to be followed. During power-up cycle it is recommended
to have the specific reset sequence. Following timing sequence gives information
about the power-up/down sequence.
14.5 Low Power Design Architecture and UPF Case Study 375
For the multiple clock domains with different power sequences and the multiple
clock gating with few common power control signal, it requires the higher verifi-
cation efforts to ensure the correct sequencing for the power-on and power-off.
The UPF can be used from the RTL to GDSII and the basic UPF support is
shown in Fig. 14.10. During the verification using the UPF, the functional and
power intent should be analyzed and need the robust verification using the
advanced verification techniques.
The power domains can be created using the following UPF command:
For example creating the power domain having name pdA, the UPF command
used is given below and the outcome is shown in Fig. 14.11:
The supply port can be created using the following UPF command:
For example creating the supply port with the name spAOn, the command used
is given below and the outcome is shown in Fig. 14.12:
The supply net can be created using the following UPF command:
For example creating supply net named as RET, the UPF command used is given
below and the outcome is shown in Fig. 14.13:
For example creating supply net named as PR, the UPF command is given below
and the outcome is shown in Fig. 14.14:
The power switch can be created using the following UPF command:
For example creating the power switch SW1, the UPF command used is given
below with the net outcome shown in Fig. 14.15:
14.5 Low Power Design Architecture and UPF Case Study 379
The connection for supply net can be created using the following UPF command:
For example connecting the power supply net, the command used is given below
and the net outcome is shown in Fig. 14.16:
14.6 Summary
References
Abstract SOCs are complex density ASICs and need to be validated using the
FPGAs. In the present scenario there is more demand for the FPGA prototyping to
realize the ASICs. Single or multiple FPGA can be used to prototype the desired
SOC functionality. This chapter focuses on the discussion on the SOC components,
challenges, and the SOC design flow. Even the individual key SOC block coding is
discussed in this chapter.
Keywords SOC IP Timing accurate model Cycle accurate model Data
hazards Structural hazards Pipelining STA Timing closure FPGA proto-
typing ASIC porting Verification plan Test plan DFT UPF ASIC
porting Control path Data path Control hazard Arbitration IO Timers
Counters Memories Microcontroller Microprocessor
System On Chip (SOC) is designed by using ASIC design flow and for proof of
concepts PLDs are used. In the present scenario the designs are complex in nature
and consist of multiple functional blocks to perform the desired operations with
high design performance. The main important SOC design challenge is to have
lower power, high performance, and less die area.
As SOC complexity has increased over the past decade it has become extremely
important to detect the defects in the SOCs during early stage of design cycle. The
best and affordable way is to use the modern FPGAs to realize or prototype the
design. In the present scenario most of the complex designs are prototyped by using
modern FPGA architectures.
It is essential to understand about, why the FPGA prototyping has became
popular during this decade? The main reason is the lesser non recurring investment
and the availability of the high performance computing and reprogrammable block
in the FPGA. SOCs consists of processor, IO interfaces such as Ethernet, SATA,
USB, UART, SPI, I2C, high performance DSP computational capabilities, video
and audio codecs, and high speed memory controllers like DDR II or DDR III.
Modern FPGAs are used for SOC prototyping as they have most of the capabilities
listed above to achieve the high performance.
In the present decade IP and SOC complexity has increased so much. There is
demand for SOC design and FPGAs with the high density functional blocks used
for validation of SOC. This is also called as ASIC or SOC prototyping. we consider
that typical SOC has processor core, various memories, and clock source as PLL,
multiple power domain functional blocks, peripherals, communication interfaces,
and analog-to-digital and digital-to-analog converters. The important point in the
design of SOC is to partition of the hardware and software blocks. In the present
scenario the FPGAs are used for SOC prototyping due to reconfigurable capabilities
and to accelerate the performance of design due to use of soft and hard IPs in it.
The different blocks for SOC are shown in Fig. 15.1. If we consider any com-
plex SOC then it consists of the different communication interfaces such as USBs,
Bluetooth and most of the SOCs support the standard protocols. For any SOC
design it is essential to achieve the area, speed, and power constraints. Achieving
the required design functionality with the constraints is one of the challenges due to
the availability of lesser time to design and market the product due to high demand
of new features and functional requirements. SOC design always needs the realistic
plan, resources, and availability of necessary validation testing and prototyping
setup.
15.3 SOC Design Flow 383
Memory USB
SPI SATA
Processor
LCD Ethernet
Memory
Audio Codec Controller Audio Codec
The SOC design flow for the design and validation is shown in Fig. 15.2. As shown
in figure it has multiple steps which includes the design feasibility and imple-
mentation, FPGA prototyping and testing and ASIC porting. The key steps are
discussed in the subsequent section (Fig. 15.3).
Most of the SOC uses intellectual properties (IPs). But as designer, it is important to
validate the IPs in SOCs due to available features, timing requirements, and
functionality. The important parameter in IP design is the overall functionality of
the design. The IPs are getting sold in the semiconductor market due to its features,
timing performance, and low power requirements. If we consider simple tablet then
the tablet selling point in the market is due to the availability of the functional
features. The IPs are not only sold in the market due to the only interfaces.
Most of the SOC design team always uses the third party functionally and timing
proven IPs. Instead of spending the time on design of IPs most of the SOC design
team uses multiple IPs required according to the desired functional requirements.
All the required IPs can be integrated together according to the speed and power
requirements. Although there is challenge in overall integration of IPs and that
challenge can be overcome by understanding the architecture details of IPs, timing,
and power figures of IPs. The IP can be soft IP or hard IP. The IP vendor companies
can provide the synthesizable and process independent RTL, or netlist with the
necessary timing information with high performance user friendly interfaces.
The IPs should exhibit the required functionality and should be delivered with
the synthesizable RTL, synthesis scripts, design constraints, and interface details.
384 15 System on Chip (SOC) Design
Then it becomes easy for IP integration and validation of SOC in lesser time. The
reason for growing complexity of SOCs is due to following few factors:
15.3 SOC Design Flow 385
clk
System Bus
Capturing the design requirements and analysis of the design is the first important
step in the SOC design. The Input for this phase is the design or product specifi-
cations provided by the client or end user. The analysis involves the feasibility
study of all the features provided. The feasibility study is an important phase as
during this phase it can be easily understood about the risks in the implementations
as well as dependability during implementation. The feasibility study is needed to
be done for all the features by keeping in mind the time to market. This study gives
386 15 System on Chip (SOC) Design
roadmap and challenges involved during SOC design implementation cycle and
useful for even creating various versions with the milestone details. The design
specifications are analyzed and understood during this phase.
This is also called as design partitioning; the design has to be partitioned into
hardware and software. The important point of consideration is while partitioning
the design; how parallel execution needs to be incorporated in the design? In the
present scenario as SOCs are complex the functionality can be implemented using
the parallelism in the design which in turn can improve the design performance.
The complex computational task or algorithms need to be partitioned during the
design partitioning phase. Most of the complex computational blocks need to be
implemented using hardware. Design partitioning is important and decisive phase to
define what need to be implemented using software? And what need to be imple-
mented using hardware?
For example consider the design of video decoders which needs multiple frame
support. The video decoder can be efficiently implemented using hardware and even
the parallelism can be incorporated for the few decoder features. The high compu-
tational DSP functional blocks which need filters like FFT, FIR, and IIR or high
speed multipliers can be effectively and efficiently implemented using hardware.
Let us consider the scenario of protocol implementation, most of the protocol
like Ethernet, USB, and AHB can be efficiently implemented using hardware–
software codesign. These algorithms should be functional and timing proven This
can have advantage to overcome and to reduce latency in the design. For most of
the protocol implementation it is essential to consider software and hardware design
partitioning and design cycle with the overall overheads on the design.
The major challenge in the hardware–software design portioning is the analysis
of throughput and power requirements. For example consider the scenario in SOC
design where fixed length packets need to be transferred over the fixed time
interval. If the design is implemented by using hardware then care needs to be taken
such that there should be minimum interaction between the hardware and software.
To minimize the interaction between hardware and software, the strategy can be
used by using FIFO buffers and timers.
FIFO is first in first out memory buffer and can be used to hold the packet infor-
mation depending on the depth of FIFO. At the start of data transfer FIFO can
interact with procedural calls defined in the software. To track the time duration the
timers can be implemented using hardware which can have communication
15.3 SOC Design Flow 387
interface with FIFO to indicate the end of timing intervals. Such type of hand-
shaking mechanism can be implemented easily using hardware.
The design architecture for both the hardware and software activities can be
created by considering power aware design and throughput requirements.
For every SOC, it is essential to have the functional and timing proven bus inter-
faces. In most of the applications Advanced High Speed Bus protocols are used.
These protocols need to be validated for the functional and timing correctness of the
design. IO interfaces need to be targeted for the high speed data transfer. There are
many different kinds of IO interfaces used in SOC designs. These IOs can be
general purpose, differential IOs and high speed IOs.
Clock distribution network is used to provide the uniform clock skew to all the
registers in the SOCs. The clocking policy plays the crucial role in overall design
performance. The uniform clock skew can be achieved by using the suitable clock
tree by using clock tree synthesis. Use of single clock structure or multiple clock
domain structure need to be decided at the architecture level. Also, the uses of
synchronous or asynchronous logic need to be defined at the architecture level.
Reset can be asynchronous or synchronous and needs to be defined at the archi-
tecture phase of SOC.
Choose the required necessary EDA tools and licenses for FPGA prototyping of a
SOC and for ASIC porting. The most industry standard tools are
Simulator: Questasim and VCS
Synthesis: Synpilfy pro and Synopsys DC
STA: prime time (Synopsys PT)
For SOC validation use the necessary prototyping and development platform.
Prototyping platform consists of use of multiple FPGA boards to realize and val-
idate SOCs, IP required, DSP functionality required, memories, and general
388 15 System on Chip (SOC) Design
purpose processors required. The availability of desired prototyping boards with the
necessary interfaces to realize SOC and use debug or testing setup.
Most of the SOCs are tested by using the test setup consisting of available EDA
tools and logic analyzers. At the start of the SOC design cycle, architect analyses
the design and functional requirements and according to the requirement of speed
and estimation of gate count the prototyping platform can be build. Here the overall
important factors are time to market, budget allocation, and design time require-
ments. If DSP capabilities are available in FPGA then it is wise to implement the
DSP functionality on FPGAs.
For complex gate count SOCs, the necessary test cases need to be developed with
the required test vectors. The features can be extracted using top level functional
specifications and the required test cases can be documented in the test plan doc-
ument. The test vectors developed can have significant impact on the quality of the
verification to achieve the coverage goals. The test cases can be documented as
basic, corner, and the random test cases. The constrained random verification with
the required coverage goals can be achieved by using the required necessary test
cases and test plan.
Use the verification languages like Verilog and high level verification languages
like System Verilog or System C; for early detection of bugs and to achieve the
coverage goals. The verification planning to improve the overall design quality by
capturing the bugs during early design cycle is always crucial in the large gate count
SOC designs. The overall objective is to achieve the required and designed func-
tionality in less time. The verification environment needs to be build to achieve the
coverage goals. The verification architecture can have the necessary bus functional
models and the drivers, monitors, and scoreboards for robust checking of the design
functionality. The overall verification planning and creation of environment is with
goal to achieve the automation to minimize the time requirement to complete the
functional checks in the lesser amount of time duration.
At the architecture and microarchitecture level the gate count estimation is done for
the SOCs. As discussed already the prototyping development can have multiple
15.3 SOC Design Flow 389
FPGAs with the required high speed interfaces. Depending on the complexity of
design FPGAs can be chosen. The main criteria are use of FPGA for the lesser
power and more speed. The following are key important points need to be con-
sidered while prototyping using FPGAs:
1. Use of FPGA functional blocks to meet the required area requirements. Choose
the suitable FPGA platform and use the 70 % of FPGA resources.
2. The area, speed, and power constraints need to be extracted at the chip level and
at the block level.
3. Use the block level constraints while synthesizing the blocks and use the chip
level constraints at the top level.
4. If high performance DSP algorithms need to be coded then use the DSP
functional macro blocks to realize the high computational DSP filtering and the
processing algorithms.
5. Try to choose the FPGA platform with required high speed interfaces such as
USB, Ethernet, PCI, and memory controllers.
6. Choose the mechanism to interact between software and hardware.
7. Choose the required tool options for auto place and route of design to meet the
design constraints.
8. FPGAs should have the capability to achieve the functionality at higher speed.
Low power requirements for the design: Most of the FPGA demands low power
in the today’s market scenario. SOCs can be designed to meet the desired power.
Use the Unified Power Formats (UPFs) to achieve the desired low power
requirements.
After performing the realization and validation of SOC using FPGA ,the design
needs to be migrated to an ASIC. For quick realization of ASIC, designer need to
do following:
1. Replace the clock gating logic with the equivalent component from the ASIC
library.
2. Insert DFT and check for the stuck at fault coverage.
3. Use the low power intent design using UPF.
4. Use the block level and chip level constraints while migrating from FPGA to
ASIC flow.
5. Synthesize the design for the required constraints.
6. Implement the physical design using the design flow for the required area,
speed, and power.
390 15 System on Chip (SOC) Design
While designing SOC there might be many design challenges and few of them are
listed below.
1. Use of the modeling abstraction levels In the practical scenario different mod-
eling levels are used from the design specifications to fabrication of chip. It is
good decision to use the different level of abstractions while design of SOC.
a. Functional modeling To describe the functionality and to get the valid and
accurate output by using the simulators.
b. Cycle accurate modeling To understand the required number of cycles
consumed while performing the operation.
c. Event level modeling To understand the number of events within a clock
cycles are accurate or not?
d. Memory accurate modeling To understand the memory contents and layout
is accurate or not?
e. Transaction level modeling To understand for the number of transactions are
accurate or not?
2. RTL design Efficient RTL design description and synthesizable RTL is one of
the key challenge and ASIC SOC design engineer needs to take care of
following:
a. Order of continuous assignments and loop free design. The outcome is latch
free synthesis results.
b. Defining hierarchy of design and having efficient design partition.
c. Registering inputs and outputs for the module.
d. Uses of each register assignment in single clock domain.
e. DFT friendly RTL design and low power aware RTL.
f. Properly use blocking and nonblocking assignments.
3. RTL verification The goal is to detect the bugs during early design cycle and to
achieve the coverage. So, the main challenge is to understand the usage of
event-driven or cycle accurate simulators and use of their features. While cre-
ating the testbench architecture, care need to be taken for the self checking
testbench and design test automation for the higher coverage. Use of the
transport and inertial delays during the verification and using zero delay models
is another key challenge for the robust verification.
4. Synthesis The goal should be to meet the desired power, speed, and area
requirements. For low power designs, use the isolation cells, retention cells,
level shifters, and clock gatinglogic. For speed improvement, use the techniques
like register balancing, pipelining, andregister retiming. For area minimization,
use the techniques like multiplex decoding, grouping, and constant data
propagation.
5. Hazard free designs For any efficient ASIC SOC design, it is recommended not
to have the hazards. There are potential issues in the design due to hazards for
15.4 SOC Design Challenges 391
example write after write hazard can create the potential issues in the design if
second write does not happen properly after first write of the data. Following are
the few important points need to be considered for the hazard free design.
a. Data hazard Can be potential problem if the data or address is not computed
or arrived at the required time stamp.
b. Structural hazard Can be potential problem due to the limited number of
resources to perform the multiple activities at a time. To overcome these
hazards use the registers and sequence the operations using the pipelined
structure. Following are few examples for the structural hazards:
i. Memories with the limited number of ports and less latency.
ii. Non-pipelined designs and limited number of processing units.
iii. Implementation of multiplier algorithms without the pipelining or booth
multiplication.
c. Control hazards Can be potential problem due to late arrival of control
signal or it is not clear when to perform the operation?
d. Read and write hazard Can be potential problem if the read and write
operations are performed during the same time stamp.
e. Timing estimation and analysis The challenge is to meet the required timing
for the SOC and challenges are following:
i. Use the pipelined design with the required pipelined stages.
ii. Use the grouping technique and logic duplications for the clean register
to register paths.
iii. Use the techniques to reduce the critical path timing delays.
6. Interface and protocol implementations Most of the SOC design use the pro-
tocol and as discussed earlier meeting of the timing performance at the interface
level is also an important aspect for the efficient SOC. Following can be few
points need to be considered while modeling the protocols and interfaces.
a. Use of the handshaking mechanism for the transaction notification.
b. Use of the general purpose IOs and the special IOs for the interfaces.
c. Understanding the timing details at the pin and signal level.
d. Use of serializer, deserializer and parallelism while modeling the protocols.
7. SOC components Selecting the required SOC components or describing the
SOC RTL design is one of the key challenges. The main SOC components can
be microprocessors or microcontrollers, IOs, arbiters, memories, general pur-
pose controllers, interrupt, and DMA controller. Describing the RTL for each
and every individual component is one of the key challenges as goal is to
achieve the required area, speed, and power.
8. Design Implementation and Testing After completion of the hardware and
software component design, the integration of hardware and software is the major
challenge due to the interface synchronization requirements. The testing of the
SOC needs the efficient verification and testing plan to test the features covered.
392 15 System on Chip (SOC) Design
The SOC design case study for the moderate complex design is discussed in this
section. As discussed in the above section, SOC consists of the microprocessor or
microcontroller to perform the processing operation on the multiple operands, the
memory banks RAM and ROM, general purpose IO and control mechanism,
counters and timers, and UART. For easy understanding of SOC, the complex
modules like DSP controllers, DMA controllers, video controllers, and complex
arbiter are not discussed in the case study. Readers are encouraged to use the
fundamental design concepts to describe the architectures and to code the RTL for
the above complex modules.
The key SOC design blocks and the Verilog RTL for these blocks are discussed in
this section. The key SOC design blocks are
1. Microprocessor or microcontroller
2. Counters and timers
3. General purpose IO
4. UART
5. Bus arbitration logic
The memories are discussed in Chaps. 7 and 9 and readers are requested to refer
the memory section. The objective of this section is to discuss on the design aspect
of these blocks. Finally, these individual blocks can act as an IP and can be
integrated together to achieve the desired functional requirement.
The SOC with moderate gate complexity is shown in the Fig. 15.3 and it consists
of most of the blocks mentioned above.
The generalized architecture for processor is shown in Fig. 15.4. As shown in figure
it consists of ALU, instruction register and decoder logic, control and timing unit,
and program and stack pointer incremental logic. It also consists of bus arbitration
logic. While designing the processor it is essential to take care of the design
partitioning to code the individual modules using synthesizable constructs. Data
path and control path logic need to be partitioned for the better visibility and better
15.6 SOC Design Blocks 393
Operand A Operand B
Instruction
Decoder
Control Bus
timing and performance. Readers are requested to refer Chaps. 1–9 for the RTL
coding concepts to design the efficient blocks of processor.
In most of the design the requirement is to count the predefined number of pulses
depending on the external event by using active edge of the clock. An efficient RTL
design and functional correctness of the design to achieve the desired performance
is the major goal. Consider the block level representation for the timer or counter
block shown in Fig. 15.5. The RTL description for the block is shown in the
example 15.1.
Address_in
Write_data
Event_1
Read_data
Timer and counter block
Write_enable
Event_2
Read_enable
interrupt
Fig. 15.5 Top level signal diagram for the timer and counter block
394 15 System on Chip (SOC) Design
In most of the ASIC SOC design the general purpose bidirectional IOs are used.
Multiple IOs are required depending on the required interface inputs and outputs.
IOs are used to communicate with the outside world. The generalized structure for
bidirectional IO is shown Fig. 15.6.
The partial Verilog RTL is described in the Example 15.2.
Write_enable
Write_
Address_in data
PA D
Address enable
Write_enable Decoder
Read_data
Write_data
Serial_output
Read_data
Serial_input
Address_in Universal Asynchronous
receiver and Transmitter
Interrupt
These kinds of blocks can be used in the serial data transfer. The basic protocol is
that use the active low start bit and then 8-bit of serial data and finally active high
start bit. The data rate can be adjusted by generating the baud clock by using baud
rate generator.
The UART consists of transmitter to transmit the serial data using “serial_out-
put” pin and receiver to receive the serial data using “serial_input” pin. The data
rate is controlled by the baud rate control block. The control logic block can be
designed using the multiple data buffers and FIFOs.
The block level architecture for the UART is shown in Figs. 15.7 and 15.8.
15.6 SOC Design Blocks 397
Write_data
Read_data
Transmitter Serial_output
Address_in Control logic
Interrupt
Write_enable
Receiver Serial_input
Read_enable
Baud rate
CLK generator
The bus arbiters are used to share the same resource by the multiple masters of
clients. In the practical scenario typical shared resources are memories, multipliers,
and buses. The arbiter decides to which client the service needs to be given, and the
property can be static or round robin. The arbiter is shown in Fig. 15.9 and the
partial RTL is described in the Example 15.3.
request_0
Grant_0
Request_1
Bus arbitration logic Grant_1
Request_2
Grant_2
clk
15.7 Summary
The list of synthesizable and non-synthesizable Verilog constructs is tabu-lated in the following
Table
Verilog Used for Synthesizable Non-Synthesizable
Constructs construct Construct
module The code inside the module and Yes No
the endmodule consists of the
declarations and functionality of
the design
Instantiation If the module is synthesizable Yes No
then the instantiation is also
synthesizable
initial Used in the test benches No Yes
always Procedural block with the reg Yes No
type assignment on LHS side.
The block is sensitive to the
events
assign Continuous assignment with Yes No
wire data type for modeling the
combinational logic
primitives UDP’s are non-synthesizable Yes No
whereas other Verilog primitives
are synthesizable
force and These are used in test benches No Yes
release and non-synthesizable
delays Used in the test benches and No Yes
synthesis tool ignores the delays
fork and join Used during simulation No Yes
ports Used to indicate the direction, Yes No
input, output and inout. The
input is used at the top module
parameter Used to make the design more Yes No
generic
time Not supported for the synthesis No Yes
(continued)
(continued)
real Not supported for synthesis No Yes
functions and Both are synthesizable. Provided Yes No
task that the task does not have the
timing constructs
loop The for loop is synthesizable and Yes No
used for the multiple iterations.
Verilog Used for arithmetic, bitwise, Yes No
Operators unary, logical, relational etc are
synthesizable
Blocking and Used to describe the Yes No
non-blocking combinational and sequential
assignments design functionality respectively
if-else, case, These are used to describe the Yes No
casex, casez design functionality depending
on the priority and parallel
hardware requirements
Compiler Used during synthesis Yes No
directives
(‘ifdef,‘undef,
‘define)
Bits and part It is synthesizable and used for Yes No
select the bit or part select
Appendix II
Xilinx Spartan Devices
The Design For Testability (DFT) and its necessity is discussed in summarized
In the practical ASIC design, the DFT is used to find out various kinds of faults in
the design. For FPGA designs this step is excluded. The necessity of DFT is for
early detection of the faults in the design using scan chain insertions. The func-
tional abstraction of defects is called as fault and the abstraction of the fault is the
system level error. Physical testing is carried out after manufacturing of chip to
understand the fabrication-related issues or faults.
The defects in the design can be physical or electrical. Physical defects are due to
silicon or defective oxide. Electrical defects are short, open, transistor defects and
changes in the threshold voltage.
Few of the faults in the design are following
1. Stuck at faults: Stuck at one or Stuck at zero
2. Memory faults or pattern-sensitive faults
3. Bridging faults
4. Cross point faults
5. Delay faults
Testing process is the process of test pattern generation, test pattern application
and output evaluation.
Generally, the test flow includes the following:
1. Identify the target faults
2. Test generation
3. Fault Simulation
4. Testability
5. DFT
1. RTL design
2. Simulation
3. Synthesis
4. Insert scan chain
5. Layout
If every data input of the register need to be forced to the known value during the
test, then the design is controllable.
Observability indicates the ability to observe the node at primary output. The
de-sign needs to be controllable and observable.
As shown in the following design, the design input of comb_logic1 is control-
lable and the output from comb_logic3 is observable. But comb_logic1 and
comb_logic2 are not observable. So for detection of faults, it is essential to make
comb_logic1, comb_logic2, and comb_logic3 controllable as well as observable.
clk
• The basic DFT techniques are: Ad-HOC DFT and Structured DFT. The struc-
tured DFT includes the scan-based DFT which is again classified as MUX -
based DFT and level–sensitive, element-based DFT. An-other structured DFT
technique is MBIST and LBIST. JTAG is used for boundary scan.
Basic MUX-based technique is described below.
• MUX-based scan cell
The MUX-based scan cell is shown below and it has additional inputs as
Test_data, Scan_en. The MUX is inserted at the input of the D flip-flop and
during testing Scan_en=1 the D input is Test_data. During normal operation, the
Scan_en=0 and Data_in can pass through the combinational logic to the D input.
Thus, the following cell works both in the test and normal modes. The clk can
be scan_clk during the test mode.
Appendix III: Design For Testability 407
Scan _en
clk
comb_
Data_in comb_
logic2
logic1
D D comb_
Test_data logic3
Scan _en
clk
408 Appendix III: Design For Testability
1. Generated clocks in the design: There should not be generated clocks in the
design as they are not controllable
2. Combinational feedback loop: There should not be any combinational loop in
the design as it creates issues in the timing analysis and hence it is essential to
break the combinational loop
3. Gated clocks: Gated clocks need to be avoided as they are not controllable
4. Asynchronous Control signals: There should not be any internally generated
asynchronous control signals
5. Do not mix the positive and negative edge triggered flip-flops
6. Avoid use of latches in the design
7. If shift registers are used then do not replace them by using scan enabled flip-
flops but only ensure the enable control
8. Do not use the clock input as data
9. Bypass the memories during DFT
Index
C Constraints, 260
Cadence RTL Compiler, 259 Continuous assignment, 82, 92
Capture flip-flop, 282 Control and timing unit, 392
case, 57, 369 Control path, 165, 392
case construct, 92 Control signals, 325
case-endcase, 57, 93 Coverage goals, 388
Case equality, 97 CPLD, 230
Case inequality, 97 CPU, 256
Cell library, 263 create_clock, 265, 287
Characterize, 312 Cumulative delay, 160
Check_design, 265, 318 Current simulation time, 80
Check_timing, 318 Current_state, 198
Checker, 187, 228 Cycle accurate, 187
Chip level, 389 Cycles, 390
CLB, 232, 235 Cycle stealing, 154
clk initalization, 229
clk generator, 229 D
Clock balancing, 370 Data arrival time, 278, 281
Clock buffer, 370 Database, 263
Clock definitions, 286 Data buffers, 396
Clock domain, 161 Data integrity, 322
Clock domain crossing (CDC), 250, 322 Data path, 280, 364, 392
Clocked-based logic, 104 Data path synchronizer, 340
Clocked logic, 192 Data propagation, 390
Clock gating, 251, 363, 364, 366, 390 Data rate, 258
Clock gating structure, 161 Data required time, 281
Clocking boundary, 341 DCM, 235, 242
Clock path group, 285 DDR, 241
Clock skew, 242, 387 DDR II, 382
Clock to ‘q’ delay, 280 DDR III, 382
Clock tree, 363, 364, 387 Dead zone code, 273
Clock tree synthesis, 260 Debug, 303
CMOS, 360 Decoder, 58
CMOS logic, 2 Decrement, 179
Code converters, 49 default, 93, 217, 369
Coding guidelines, 79 Defining hierarchy, 390
Combinational logic, 10, 27 Delay operators, 185
Combinational loop, 245 Deserializer, 391
Combinational path, 286 Design compiler, 263
Combinational path group, 284 Design constraints, 5, 300
Combinational shifters, 192 Design environment, 302
Comparators, 46 Design implementation, 252
Compile, 267 Design object, 264, 301
Compile-characterize, 302 Design partitioning, 274, 306, 346, 386
Compiler, 314 Design performance, 79, 162, 163, 386
Computational blocks, 386 Design rule constraints, 300
Concentration and replication, 18 Design rule library, 261
Concurrent, 10 Design rules, 302
Concurrent execution, 161 Design specification, 257
Conditional assignments, 55 DesignWare, 262
Configuration data, 239 Device utilization summary, 239
Consolidated control signal, 334 DFT, 259, 389
Constant folding, 272 DFT friendly RTL, 390
Constants, 10 Differential IOs, 387
Index 411
F H
Fabrication techniques, 258 Half adder, 38
False path, 295, 327 Half subtractor, 41
Fast debugging, 215 Handshaking, 338
Faults, 263 Handshaking mechanism, 391
Feasibility study, 385 Handshaking signals, 329
FFT, 386 Hazards, 111, 390
FIFO, 386, 396 HDL, 370
FIFO memory buffer, 338, 345 Hierarchical design, 302
FIR, 386 Hierarchies, 290
Flash memory, 233 High impedance, 11
Flip-flop, 103, 107 High speed, 257
Floor planning, 260 High-speed interfaces, 241
Four as to one MUX, 56 High speed IOs, 387
FPD, 230 Hold, 160
412 Index