CLOCK TREE SYNTHESIS
PHYSICAL DESIGN
Y.V.L.Tanuja
Definition
• It is the process of connecting of sequential cells to the clock net, such that
clock skew is minimized.
• The process of distributing the clock and balancing the load is called CTS.
• CTS is the process of insertion of buffers or inverters along the clock paths of
ASIC design in order to achieve zero/minimum skew or balanced skew.
• Before CTS, all clock pins are driven by a single clock source. CTS starting point
is clock source and CTS ending point is clock pins of sequential cells.
• CTS QoR decides timing convergence & power. In most of the ICs clock
consumes 30-40 % of total power. So efficient clock architecture, clock gating &
clock tree implementation helps to reduce power.
Inputs of CTS:
Technology file (.tf)
Netlist
SDC
Library files (.lib & .lef) & TLU+ file
Placement DEF file
Clock specification file which contains Insertion delay, skew, clock
transition, clock cells, NDR, CTS tree type, CTS exceptions, list of
buffers/inverters etc...
Checklist before CTS:
• Before going to CTS it should meet the following requirements:
• The clock source are identified with the create_clock or
create_generated_clock commands.
• The placement of standard cells and optimization is done.
•{NOTE: use check_legality –verbose command to verify that the placement
is legalized. If cells are not legalize the qor is not good and it might have
long run time during CTS stage}
•Power ground nets- pre-routed
• Congestion- acceptable
•Timing – acceptable
•Estimated max tran/cap – no violations
•High fan-out nets such as scan enable, reset are synthesized with buffers.
CTS Targets:
• Skew &
• Insertion delay CTS Flow:
Read CTS SDC
CTS Goals/Constraints: Compile CTS using CTS spec file
Place clock tree cells
Max transition
Route clock tree
Max capacitance
Max fanout
About Skew:
• The difference in the clock latencies of two flops
belong to the same clock domain.
- If the capture clock latency is more than the launch
clock, then it is positive skew. This helps to meet setup.
- If the capture clock latency is less than the launch
clock, then it is negative skew. This helps to meet hold.
• Types of skew:
Local Skew:
- The difference in the clock latencies of two logically
connected flops of same clock domain.
Global Skew:
- The difference in the lowest clock latency and highest
clock latency of two flops of same clock domain.
Local Skew:
Global Skew:
Useful Skew:
About clock latency/insertion delay
• The time taken by the clock to reach the sink
point from the clock source is called Latency.
It is divided into two parts
– Clock Source Latency &
– Clock Network Latency.
• Clock Source Latency:
- The delay between the clock waveform
origin point to the definition point.
• Clock Network Latency:
- The delay from the clock definition point to
the destination/sink point.
CTS Goals/Constraints:
Clock period : 1.2 ns (833MHz)
Max Transition Targets: How to fix max trans violation?
There are several ways to fix the transition time
- Data Transition : 10% of clock violations.
period 1) Increase the driver size.
- Clock Transition : 20% of clock 2) Break the nets in the case of long nets.
period 3) Break the large fanout by duplicating drivers or with
buffering.
• Max transition (clock or data) is the 4) Change the VT if option available(changing drivers
maximum slew that is allowed at the from hvt to svt or lvt).
5) Reduce the load by downsizing the cells(special cases)
cell input pin. after the looking the timing impact on the design.
• This comes either from the library, or 6) Change the Load to hvt because hvt has higher lib
it can come from a manually limit.
constrained file from the designer.
Max capacitance : Max Fanout:
The capacitance of node is the combination of fanout of
-The Max Fanout of an output measures its load driving
the output pin and the capacitance of the net
capability:
- Max Capacitance generally used to meet the power
It is the greatest no. of inputs of gates to which the
and timing constraints on the design.
output can be safely connected.
- It means every point of output pin of any cell in the
-Fanout load is dimensionless number.
design should not experience a capacitance more than - This info is present in the .lib file.
that. - Max Fanout is only available for output pins only.
-Such that the delay consumed by same cell for that
particular timing arc would in a particular range.
- The value that’s choosen is completely design
dependent.
-Generally go through with 50-60% of library
characterized value and can change based on Power and
Timing constraints.
-This violation can be removed by increasing drive
strength of a cell.
Clock Tree Architectures
• Depending on the application, the clock frequency
and the available resources in terms of area and
routing there are three broad clock tree architectures:
Single Point Clock Tree Synthesis – This is the
simplest clock tree architecture that offers lowest clock
switching power but local clock skew can be fairly
large.
• Single Point CTS is most suitable for low frequency
applications, or designs with multiple clock domains.
• Most of the SoC applications use single point CTS.
The clock divergence point begins from the clock
source itself, and therefore the
OCV (on-chip variation) penalty for the single point
CTS is maximum of all clock tree architectures.
Clock Mesh – Clock Mesh lies at the opposite
end of the spectrum that offers impeccable
clock balancing, resulting in small clock skews
thereby making this the choice of architecture
for high-frequency GHz applications,
particularly with a single clock domain.
• CPU and GPU applications tend to use clock
mesh. The biggest disadvantage of clock
mesh architecture is that depending on the
density of the clock mesh, it can take up
plenty of routing resources.
• Clock mesh cannot be gated and it tends to
be highly capacitive and therefore is power
hungry. The common clock path extends up
to the mesh, and therefore it incurs minimum
OCV penalty.
Multi-Source Clock Tree Synthesis (MSCTS) – MS-
CTS is a hybrid approach that tends to offer better
clock skews in contrast with single point CTS while at
the same time doesn’t dissipate as much power as a
clock mesh design.
• As the name suggests, it splits the design into
multiple partitions, and has one clock TAP point for
each partition.
• The clock from the clock port to these TAP points is
routed with the help of an H-Tree.
• The multiple TAP points subsequently act as clock
sources for all the sink pins within their respective
partitions. The global clock tree part, as shown in
Figure 5 can be a coarse mesh or an H-tree structure.
• The common clock path for an MS-CTS design is
therefore more than that of a single point CTS, and
less than that of a clock mesh.
Clock Tree Reference:
• By default, each clock tree references list contains all the clock buffers
and clock inverters in the logic library.
• The clock tree reference list is,
- Clock tree synthesis
- Boundary cell insertions
- Sizing
- Delay insertion : If the delay is more, instead of adding many buffers
we can just add a delay cell of particular delay value.
Advantage is the size and also power reduction. But it has high variation,
so usage of delay cells in clock tree is not recommended.
Boundary cell insertions:
• When we are working on a block-level design, we
might want to preserve the boundary conditions of the
block’s clock ports (the boundary clock pin).
• A boundary cell is a fixed buffer that is inserted
immediately after the boundary clock pins to preserve
the boundary conditions of the clock pin.
• When boundary cell insertion is enabled, buffer is
inserted from the clock tree reference list immediately
after the boundary clock pins. For multi-voltage
designs, buffers are inserted at the boundary in the
default voltage area.
• The boundary cells are fixed for clock tree synthesis
after insertion; it can’t be moved or sized. In addition,
no cells are inserted between a clock pin and its
boundary cell.
Clock Tree Exceptions:
• Non-Stop pin:
-Non-stop pins trace through the endpoints that
are normally considered as endpoints of the clock
tree.
Example:
- The clock pin of sequential cells driving
generated clock are implicit non-stop pins.
- Clock pin of ICG cells.
• Float pin: Float pins are clock pins that have
special insertion delay requirements and
balancing is done according to the delay [Macro
modeling].
• This is same as sync pin but internal clock
latency of the pin is taken into consideration
while building the clock tree.
• To adjust the clock arrival for specific endpoints
with respect to all other endpoints.
Example - Clock entry pin of hard macros
• Exclude pin:
Exclude pin are clock tree endpoints that are excluded
from clock tree timing calculation and optimization. The
tool considers exclude pins only in calculation and
optimizations for design rule constraints.
During CTS, the tool isolates exclude pins from the
clock tree by inserting a guide buffer before the pin or
these pins are need not to be considered during the clock
tree propagation.
Example - Non clock input pin of sequential cell
Beyond the exclude pin the tool never perform skew or
insertion delay optimization but does perform design rule
fixing.
• Stop pin: Stop pins are the endpoints of clock
tree that are used for delay balancing. In CTS,
the tool uses stop pins in calculation &
optimization for both DRC and clock tree
timing.
Example - Clock sink are implicit stop pins
• The optimization is done only up to the stop
pin as shown in the above fig. The clock signal
should not propagate after reaching the
stop/sync. This pin needs to be considered for
building the clock tree.
Don't Touch Sub-tree:
• If we want to preserve a portion of an existing clock
tree, we put don’t touch exception on the sub-tree.
- CLK1 is the pre-existing clock and path 1 is
optimized with respect to CLK1.
- CLK2 is the new generated clock. Don’t touch sub-
tree attribute is set w.r.t C1.
Example:
- If path1 is 300ps and path2 is 200ps, during
balancing delay are added in path2.
- If path1 is 200ps and path2 is 300ps, during
balancing delay can’t be added on path1 because on
path1 don’t touch attribute is set and we get violation.
• Don't Buffer Net: It is used in order to improve the results, by preventing the tool
from buffering certain nets. Don’t buffer nets have high priority than DRC.
-CTS do not add buffers on such nets. Example - If the path is a false path, then no
need of balancing the path. So set don’t buffer net attribute.
• Don't Size Cell: To prevent sizing of cells on the clock path during CTS and
optimization, we must identify the cell as don’t size cells.
• Specifying Size-Only Cells: During CTS & optimization, size only cells can only
be sized not moved or split.
-After sizing, if the cells overlap with an adjacent cell, the size-only cell might be
moved during the legalization step.
CTS Algorithms
• RC Tree based CTS
• H- tree based algorithm
• X- tree based algorithm
• Method of mean and meridian
• Geometric matching algorithm
• Pi configuration
RC Tree based CTS
H- tree based algorithm
• A perfect synchronization
between the clock signals is
achieved by ‘H’ like model
before the arrival of clock to the
sub-blocks or synchronous
elements. With the help of H-tree
zero skew can be easily
achieved.
• Consider ‘a’ to ‘p’ as clock pin of sequential elements and the four modules
(boxed) are nothing but sub-modules within the top module. All those sequential
elements need to get the clock at same time.
• To achieve this H-tree is built within the top module and the sub module
• it is clear that all the clock pins are exactly 9 units from the clock definition point.
• The points marked are called as tap points and when the signal reaches these tap
points, they split into two different directions.
• This is how the clock consumes exactly 9 time units to reach all the sequential
elements with zero skew.
Advantages:
- Balanced latencies & Low skew
Disadvantages:
- Requires big driver, thus lots of power
- Requires more routing resources
X- tree based algorithm
• X tree is similar to H-tree but only difference is the connections are
not rectilinear. The module design used for H-tree is taken and X-tree
is implemented and the difficulties.
• Advantages:
- Balanced latencies & Low skew
• Disadvantages:
- Crosstalk
Geometric matching algorithm
• For explaining H, X and MMM algorithms 16
point structures were taken, and let us consider
an 8 point structure for explaining the Geometric
Matching Algorithm.
• The physical locations of sub-modules are not
symmetric. Developing H-tree among these sub-
modules is practically not possible. At first two
sub-modules are grouped together and those
trees are named as X-1, X-2, X-3 and X-4.
• The optimal entry point may not be equidistant from the entry point of
X-1 and X-2; buffer insertion can balance the delay because of un-
equal net length.
• Then two two-point trees are joined together to form a H like
structure. The resultant H-trees are named as X-12 and X-34.
• The tap points of both the H structure cannot be connected by using
rectilinear nets.
• In order to connect the two trees the geometrical position of one H-tree
is changed compatible to the other tree’s tap point.
Pi configuration
• In pi configuration, the total number of buffers
inserted along the clock path is multiple of previous
level.
• This type of structure uses the same number of
buffers and geometrical wires and relies on matching
the delay components at each level of the clock.
• The pi structure is clock tree is considered to be
balanced.
• Π and H-tree are the most efficient clk routing
algorithms because of no cross talk and it consumes
minimum skew
CTS Optimization process:
•By buffer sizing
•Gate sizing
•Buffer relocation
•Level adjustment
•HFN synthesis
•Delay insertion
•Fix max transition
•Fix max capacitance
•Reduce disturbances to other cells as much as possible.
•Perform logical and placement optimization to all fix possible timing.
CTS Optimization Techniques:
1. Buffer/Gate Sizing: Sizes up or down
buffers and gates to improve both skew
and insertion delay.
2. Buffer/Gate Relocation: Physical
location of the buffer or gate is moved to
reduce skew and insertion delay.
3. Delay Insertion: Delay is inserted for
shortest paths.
4. Dummy Load Insertion: Uses load
balancing to fine tune the clock skew by
increasing the shortest path delay.
CTS quality checks
• There are following quality checks for CTS:
- Minimum insertion delay
- Skew balancing
- Duty cycle
- Pulse width
- Clock tree power consumption
- Signal integrity & cross-talk issue
Checks after CTS:
•In latency report check is skew is minimum? And insertion delay is
balanced or not.
•In qor report check is timing (especially HOLD) met, if not why?
•In utilization report check Standard cell utilization is acceptable or not?
•Check global route congestion?
•Check placement legality of cells.
•Check whether the timing violations are related to the constrained paths
or not like not defining false paths, asynchronous paths, half-cycle paths,
multi-cycle paths in the design.
Outputs of CTS:
- Timing report
- Congestion report
- Skew report
- Insertion delay report
- CTS DEF file
THANKYOU