Overcoming Power Compiler limitations to optimize clock gating PDF
Overcoming Power Compiler limitations to optimize clock gating PDF
Sylvain Haas
ABSTRACT
Synopsys Power Compiler is a simple tool that helps designers achieve a very low power design
by replacing multi-bit registers feedback loops with a single clock-gating cell. After a quick
review of Power Compiler features and advantages, we will focus on its limitations: a single
level of clock gating only, no hierarchical understanding of the design, no ability to use one
clock gating cell for several registers with slightly different enabling conditions, etc. Once those
limitations and their impact on the quality of result have been well understood with a few design
examples, we will determine solutions to get the best results from Power Compiler.
2.1 Principle
During design elaboration, automatic clock gating insertion can be invoked with the
-gate_clock option. When that option is used, a single clock-gating cell replaces the
multiplexers and feedback loops of multi-bit registers with a synchronous load-enable.
The minimum bit-size of a register to enable that replacement is a user-defined parameter.
Usually, a 6 to 8 value is considered a good solution since there is a tradeoff between the clock
gating cell area cost and the area gain due to the removal of feedback loops. That value was
mentioned during the presentation of [1]; however, it depends on the design and library
characteristics, as we will see in a later chapter. Moreover, it is not obvious power is reduced
when a cell is used to gate too few registers because of the cell intrinsic power consumption.
The replacement principle is illustrated in the schematics below:
2.2 Benefits
The major expected advantage of the method is the reduction of power consumption since multi-
bit registers that have been automatically gated by the tool will only receive a clock when they
really need to change their contents.
However, this is not the only possible advantage as replacing 32 multiplexers by a single clock
gating cell, even a large one, tends to reduce the design area. This is due to both the mere
cumulated multiplexer cells area that is larger than the clock gating cell area, and the reduction
of routing congestion.
The multiplexer replacement has also a timing impact: it can remove one level of logic on the
datapath, which may be considered as an improvement. But that also might increase the timing
constraint on the multiplexer selection logic (the load-enable signal) because the enable will have
to meet the setup constraint of a latch whose clock insertion delay is sometimes much lower than
the clock insertion delay of the flip-flops that generate that enable. This very point is more
thoroughly discussed in a later chapter.
Finally, the ease-of-use and setup of Power Compiler is one of its big advantages. It is very
simple to invoke from Design Compiler or Physical Compiler, and it saves days of RTL coding
compared with manual insertion of the clock gating cells, and it also helps having a library-
independent design.
analyze set_clock_gating_style \
-sequential_cell "latch" \
-positive_edge_logic "integrated" \
setup clock gating constraints -negative_edge_logic "integrated" \
-control_signal "scan_mode" \
elaborate -control_point "before" \
-minimum_bitwidth 6
<1999>
D Q
<2000>
D Q
<1999>
D Q
clk
<1999> <30>
D Q
<464>
<2000> D Q
<1999>
D Q
clk
<2000>
D Q
clk
<228>
<28> D Q
<2>
clk
The activity figures show one high activity gate for the manual clock gating whereas there are
four gates with a high activity for the Power Compiler result. This is due to the merge of the
global enable with the registers load enables. Even after power optimization, it is not possible for
Power Compiler to further reduce the number of high activity gates. In that case, Power
Compiler optimization is not able to achieve a result as good as our original manual clock gating.
So, with implicit decoding, the whole array is seen as being updated: Power Compiler is
currently unable to build a separate enable condition based on the value of the selector; it
is however possible to help it by explicitly using the selector in the enabling condition.
Since they receive the same enable, Power Compiler understands them as a single register
that is wider than the bitwidth limit, thus enabling its automatic gating.
• The same trick can be used to have the exact desired enable for the automatic clock
gating cell; that can be used to share a global enable that proves more power efficient,
etc.
• Manual clock gating must be instanced in a different sub-module than the registers that
use the gated clock (e.g. At the top of the module hierarchy or in a separate sub-module
that could include all the manual clock gating cells).
Although design B proves to be the typical case when Power Compiler is more efficient than
manual clock gating or combined clock gating when the global enable activity is reasonably low,
that test-case shows it is possible to have completely different results when the global enable
activity becomes much more important than the activity of the remaining input signals. The
major lesson to keep from that example is the tremendous impact of the stimuli on the strategy to
use, and using Power Compiler to generate a second level of refined clock gating seems achieves
good results. It is also important to remember that the clock tree consumption is not included in
the data, and it should further favor the solution with two levels of clock gating over the ‘Power
Compiler only’ and ‘manual only’ solutions.
This is exactly our expectation: with more clock gating cells instanced by Power Compiler, there
is more logic that toggles because of the high activity global enable. It is clear that instancing
several clock gating cells with a shared global enable is not always as power efficient as the
original single clock gating cell that received that global enable.
That example also shows the benefits from using both manual and Power Compiler clock gating
increases with the number of automatic clock gating cells. Power Compiler additional level of
clock gating allows the activation of registers only when needed. Without that additional level of
clock gating, all the registers receive their clock, which makes them consume unnecessary
power.
40
30
No original
clock gating
20 (power)
No original
clock gating
10
Gain (%)
(area)
With original
0 clock gating
(power)
-30
1
9
8
7
6
5
4
3
2
32
16
24
12
10
Register width
The point of inversion is around 2 for the area and between 2 and 1 for the power. Let us now
consider design C that is a very big design with over 3,000 flip-flops and a wide variety of
register widths. The following graph shows the results.
127
8.60
8.57
8.54 8.54 8.53 8.54
8.52
111
103
101
98
88,108
Clock gating cells
2.12
8
6
5
4
2
16
10
Bitwidth
For that example, the inversion point is not as obvious as before but it seems the bitwidth limit 6
is the best choice.
We have seen two examples, one with more and more registers gated (practical case) and a
second one with width variations. The register width variation results can be explained with the
library characteristics regarding power and area: the integrated clock gating cell in our library is
exactly twice as large as the multiplexer, hence the inversion point of the area curve at bitwidth
value 2. Regarding power, it depends on the library and the activity file, and since the power
estimations were not performed with a clock tree, we cannot determine the exact limit; we can
simply expect it to be around the same value as the area limit.
Design C results show how the parameter variations influence the results on a design that was
not tuned for Power Compiler, with the same library as for design B. It is interesting to notice the
best bitwidth limit is 6, which is in the range of admitted values, but the curve variations are not
as smooth as for design B and the best value could easily be different with another reference
simulation.
There are practical implications of those results that we will describe in a later chapter.
The above results clearly show that over-constraining the design has a very negative impact on
the overall power consumption, regardless of the strategy. Moreover, the best results are not
obtained with the same strategy depending on the timing constraints.
When the timing target is not achievable, the removal of one level of multiplexers by Power
Compiler has a positive impact on the timing: the worst-case path is 0.10 ns shorter. However,
power consumption is not improved since the power target comes after the timing target.
4.4.6 Area
Many syntheses were performed on the designs presented above. The averaged area gain is
drawn in the following graph. The three designs had different behaviors regarding the area,
which explains the local variations of the graph. However, a global trend is visible that shows an
area reduction in proportion to the number of automatic clock gating cells. That effect is caused
by the mere multi-bit multiplexers replacement by a single clock gate as well as the relaxed
timing on the data paths.
Figure 7: Area gain with Power Compiler clock gating
100000
10000
Area gain
1000
100
10
2
8
10
45
53
98
1
3
7
10
11
10
12
Several interesting comments can be drawn from the figures in that table. The average gain is
comprised between 5% and 7%, depending on the clock gating strategy. Obviously, on our test
cases, the Power Compiler clock gating allowed a better power optimization during the second
synthesis stage. It is important to understand that the representativeness of the simulation used to
calculate the logic activity is thus very critical to the reality of the achieved improvements. The
minimum and maximum figures show that the power improvement of that second stage can be
much greater than 10%; but, in some cases, degradation is noticed. Those cases happened when
the design was over-constrained: instead of improving power, the incremental synthesis stage
focused on improving timing. Clearly, over-constraining the design is a strategy that prevents
efficient power optimization.
On several test cases, multiple power optimizations were successively performed using an
updated activity file every time. They all showed the same results as the original power
optimization with less than 1‰ difference. The first power optimization gives very interesting
results, but no additional improvement can be expected from any other power optimization.
clk
combinatorial
logic
D Q
D Q
clk
clock gate ins. delay tCG insertion delay delta t∆
The clock gating cell enable is calculated from flip-flops that receive a clock with tTOT insertion
delay. That enable is latched with the same root clock after tCG insertion delay that arrives t∆
earlier than tTOT . That means the enable logic constraint must be shorter by t∆ than Synopsys
default behavior. Unfortunately, t∆ is unknown during synthesis since it depends on the clock
tree structure.
If several levels of clock gating are used in the design, t∆ becomes larger for the cells that are
closer to the root clock pin. Moreover, the larger t∆ is, the more efficient is clock gating and this
is our goal. So, during synthesis, we need to account for a large t∆ value.
The easiest solution is to have synthesis consider tTOT insertion delay for all the design flip-flops,
which we assume is known for synthesis (it is usually project-dependent; if it is unknown, a
greater value than the expected real value can be used); and tCG insertion delay forced to a null
value for all the clock gating cell latches. The effect is an over-constraint of the enable logic that
guarantees no post-clock-tree trouble.
If those over-constraints cannot be met, they can be relaxed with the effect of creating new
insertion delay constraints during the clock tree elaboration: non-null tCG means the
corresponding clock gating cells must receive their clocks with an insertion delay greater than
tCG; if tTOT is not fixed, the final clock tree insertion delay will have to be smaller than tTOT .
Below is a short script example that implements the clock gating constraints as described above:
create_clock –period $CLKPER –name CLK [get_ports {<clock port list>}]
create_clock –period $CLKPER –name POST_CG_CLK [get_pins –hierarchical “*clk_gate*/clkout”]
create_clock –period $CLKPER –name PRE_CG_CLK [get_pins –hierarchical “*clk_gate*/clkin”]
set_clock_uncertainty $UNCERTAINTY [all_clocks]
set_clock_latency $INSERTION_DELAY [get_clocks {CLK POST_CG_CLK}]
set_clock_latency 0 [get_clocks {PRE_CG_CLK}]
7 References
[1] How To Successfully Use Gated Clocking in an ASIC Design, by Darren Jones at SNUG Boston
2002.