0% found this document useful (0 votes)
12 views

Overcoming Power Compiler limitations to optimize clock gating PDF

Clock Gating

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Overcoming Power Compiler limitations to optimize clock gating PDF

Clock Gating

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Overcoming Power Compiler limitations

to optimize clock gating

Sylvain Haas

Motorola Semiconductor Products Sector


[email protected]

ABSTRACT

Synopsys Power Compiler is a simple tool that helps designers achieve a very low power design
by replacing multi-bit registers feedback loops with a single clock-gating cell. After a quick
review of Power Compiler features and advantages, we will focus on its limitations: a single
level of clock gating only, no hierarchical understanding of the design, no ability to use one
clock gating cell for several registers with slightly different enabling conditions, etc. Once those
limitations and their impact on the quality of result have been well understood with a few design
examples, we will determine solutions to get the best results from Power Compiler.

SNUG Europe 2003 1 Overcoming Power Compiler limitations


to optimize clock gating
1 Introduction
Nowadays, systems on chip are more and more used for mobile applications that require very
low power consumption. An increasingly large number of techniques exist to reduce digital
systems power consumption. The techniques used by the RTL designers include clock gating that
can now be automatically managed by Power Compiler.
The purpose of this paper is to discuss Power Compiler automatic clock gating generation,
understand its current capabilities in order to better use it and thus achieve a design that enhances
the power consumption with very little impact on the area and the timing. We will consider two
different types of questions: what are the benefits of Power Compiler for an existing design that
might already include clock gating, can we improve its power with as little work as possible?
How can we achieve the lowest power consumption on a new design using Power Compiler?
The structure of the paper is based on the various stages followed by the designer that wants to
answer those questions: we will first understand how to use the tool, then discover its limitations,
determine ways to overcome them, elaborate design strategies, try to precisely measure the
influence of several identified parameters, and finally wrap up the acquired experience into a set
of principles and guidelines.

2 Power Compiler clock gating


One of Power Compiler features is the automatic instancing of clock gating circuitry based on
the RTL functionality analysis. This chapter introduces the automatic clock gating principles, its
benefits and the Power Compiler design flow.

2.1 Principle
During design elaboration, automatic clock gating insertion can be invoked with the
-gate_clock option. When that option is used, a single clock-gating cell replaces the
multiplexers and feedback loops of multi-bit registers with a synchronous load-enable.
The minimum bit-size of a register to enable that replacement is a user-defined parameter.
Usually, a 6 to 8 value is considered a good solution since there is a tradeoff between the clock
gating cell area cost and the area gain due to the removal of feedback loops. That value was
mentioned during the presentation of [1]; however, it depends on the design and library
characteristics, as we will see in a later chapter. Moreover, it is not obvious power is reduced
when a cell is used to gate too few registers because of the cell intrinsic power consumption.
The replacement principle is illustrated in the schematics below:

SNUG Europe 2003 2 Overcoming Power Compiler limitations


to optimize clock gating
Figure 1: Power Compiler principle

<n> D Q dout din D Q dout


din
<n> <n>
load D Q
load
<1>
clk
clk

2.2 Benefits
The major expected advantage of the method is the reduction of power consumption since multi-
bit registers that have been automatically gated by the tool will only receive a clock when they
really need to change their contents.
However, this is not the only possible advantage as replacing 32 multiplexers by a single clock
gating cell, even a large one, tends to reduce the design area. This is due to both the mere
cumulated multiplexer cells area that is larger than the clock gating cell area, and the reduction
of routing congestion.
The multiplexer replacement has also a timing impact: it can remove one level of logic on the
datapath, which may be considered as an improvement. But that also might increase the timing
constraint on the multiplexer selection logic (the load-enable signal) because the enable will have
to meet the setup constraint of a latch whose clock insertion delay is sometimes much lower than
the clock insertion delay of the flip-flops that generate that enable. This very point is more
thoroughly discussed in a later chapter.
Finally, the ease-of-use and setup of Power Compiler is one of its big advantages. It is very
simple to invoke from Design Compiler or Physical Compiler, and it saves days of RTL coding
compared with manual insertion of the clock gating cells, and it also helps having a library-
independent design.

2.3 Design flow


Although this paper focuses on a very specific feature of Power Compiler, the complete Power
Compiler design flow is summarized below:

SNUG Europe 2003 3 Overcoming Power Compiler limitations


to optimize clock gating
Figure 2: Power Compiler design flow

RTL analyze -lib WORK design_rtl.v

analyze set_clock_gating_style \
-sequential_cell "latch" \
-positive_edge_logic "integrated" \
setup clock gating constraints -negative_edge_logic "integrated" \
-control_signal "scan_mode" \
elaborate -control_point "before" \
-minimum_bitwidth 6

Gtech GL w/ clock gating


elaborate -gate_clock -lib WORK design

compile compile/compile + physopt/compile_physical

write -format verilog -hier -out design_gl.v


compiled GL w/ clock gating vcs

simulate reset_switching_activity -all -verbose


read_saif -input design.saif -instance tb/design
backannotate activity
set_max_dynamic_power 0
set_max_leakage_power 0
optimize set physopt_enable_power_optimization true
compile -inc/physopt -inc
GL optimized for power

2.4 Power Compiler with Physical Compiler


When Power Compiler is used within Physical Compiler, clock-gating cells placement is
performed with respect to the gated registers placement. The gating cells will be placed as close
as possible to the flip-flops they drive with a soft bound constraint.
If the clock gating cells are not fully integrated and are thus made of several discrete elements,
Physical Compiler will place those elements very close with a hard bound constraint in order to
achieve a gated clock without glitches. That issue is nonexistent with integrated cock gating
cells.
Moreover, Physical Compiler provides user-controlled capabilities to perform rewiring and
removing of Power Compiler clock gating in order to achieve timing goals. Power Compiler
within Physical Compiler helps designers to manage skew and timing effectively.

3 What do we want to achieve?


Let us first clarify the goals we want to achieve and state the constraints we have, before starting
to analyze Power Compiler results regarding automatic clock gating efficiency and comparing
them with our expectations.

SNUG Europe 2003 4 Overcoming Power Compiler limitations


to optimize clock gating
For a new design, the first and primary objective is to obtain a low power design without loss of
productivity. We probably could achieve that goal without Power Compiler, by simply exploring
refined clock gating structures and evaluating them for power consumption. Unfortunately, this
is very time consuming. Furthermore, that implies RTL tuning that will hamper its reusability
because the tuning will be dependent on power estimations that may vary according to the
synthesis results.
In the case of an existing design, that objective becomes obtaining a lower power design than the
original one, considering that it might already include clock gating. We would like to achieve
that goal without spending days or weeks in RTL modifications. In the case when there is an
existing clock gating structure, we want to quickly refine that structure in order to reduce power.
The second objective is to keep a clean RTL, simple to write, easy to understand and modify (i.e.
focused on functionality rather than physical implementation).
Since area and timing are still very critical, there is no desire to tradeoff between power, area and
timing. So, area should be reduced or at least be unaffected, and timing should improve or at
least be maintained.
Finally, back-end work should not be more difficult, which implies a reasonable complexity for
the clock tree structure.
Of course, all those goals are not achievable simultaneously, but we will try to get as close as
possible to them. They will also determine how we are going to analyze Power Compiler results
and help us determine the best strategies to get an optimal clock gating structure.

4 Understanding Power Compiler capabilities


Now that the objectives are clearly defined, we can start using Power Compiler to try and reach
them. In all the following paragraphs, several designs have been compiled, simulated and
analyzed for power, area and timing. The same simulation was used to have back-annotated
activity for all the Gate Level netlist variants. The power estimations were always performed
using Gate Level netlists with no clock tree, and are thus not as accurate as post-route
estimations. However, that was considered to be accurate enough for comparing strategies
applied at the RTL level.

4.1 Problem statement


Let us consider a simple design with existing clock gating. We would like to remove the clock
gates from the RTL code and rely on Power Compiler to automatically perform clock gating,
thus achieving a lower power design with no more library-dependent instances.
The single manual clock-gating cell has been removed and its enable used as a global enable on
all the registers that formerly received the gated clock.
The following power consumption was measured on the design for all three options:
No clock gating 172.66 uW
Original manual clock gating 43.34 uW
Power Compiler automatic clock gating 48.18 uW

SNUG Europe 2003 5 Overcoming Power Compiler limitations


to optimize clock gating
Both Power Compiler and manual clock gating provided significant power savings over the
design without clock gating. However, Power Compiler did not achieve a lower power design
than the original one. Below we will investigate the reasons for that disappointing performance
and try finding techniques to further increase the savings that Power Compiler can provide.

4.2 Problem analysis


The clock gating enables have been analyzed; their structure and the activity on a few
representative nodes are given below for the manual and Power Compiler netlists (activity is
measured in number of toggles during the simulation span):
Figure 3: Activity of the enable logic with manual clock gating

<1999>
D Q
<2000>
D Q

<1999>
D Q
clk

SNUG Europe 2003 6 Overcoming Power Compiler limitations


to optimize clock gating
Figure 4: Activity of the enable logic with automatic clock gating

<1999> <30>
D Q
<464>
<2000> D Q

<1999>
D Q
clk

<2000>
D Q

clk

<228>
<28> D Q
<2>

clk

The activity figures show one high activity gate for the manual clock gating whereas there are
four gates with a high activity for the Power Compiler result. This is due to the merge of the
global enable with the registers load enables. Even after power optimization, it is not possible for
Power Compiler to further reduce the number of high activity gates. In that case, Power
Compiler optimization is not able to achieve a result as good as our original manual clock gating.

4.3 Power Compiler areas of enhancements


Power Compiler provides several benefits (automatic clock gating insertion, no RTL changes,
library independence) and has interesting features: the feedback loop multiplexer removal and
the ability to perform clock gating at the register level, which may prove to be impractical if
performed manually when there are over a hundred different registers.
However, we have found an example that proved Power Compiler is not always able to yield an
optimal result as simply as we would like. Since we want to achieve the lowest possible power
consumption, we will discuss Power Compiler areas of enhancement and means of overcoming
its limitations to realize lower power consumption than it would have been possible without the
tool in a reasonable time frame.

4.3.1 Need for enhanced clock-gating implementation beyond elaboration


Power Compiler inserts clock gating at the module level during the elaboration phase. That
strategy limits Power Compiler visibility to the module it is working on, without knowledge of

SNUG Europe 2003 7 Overcoming Power Compiler limitations


to optimize clock gating
the upper and lower levels of hierarchy. That gives the following list of features that may be seen
as limitations when they prevent us from easily obtaining the clock gating we desire, but also as
opportunities to direct the tool understanding:
• Since clock gating is calculated at module level, the feedback loops that belong to several
levels of hierarchy cannot be understood; their replacement is only performed in the sub-
module that also includes the registers. So, complex feedback loops that go through several
levels of hierarchy are not completely replaced; that also means having the registers isolated
in separate sub-modules prevents Power Compiler from gating their clocks.
• After elaboration, the clock gating enable structure is fixed; it cannot thus depend on the
switching activity.
• A single level of clock gating is generated.
• When the bitwidth limit is not reached, Power Compiler is unable to share one clock gate for
several small registers that would have a more general enable.
• The feedback loop replacement sometimes leads to sub-optimal solutions in terms of power.
E.g. There are cases when it is more interesting to have a single clock gate with a more
general enable than 2 or 3 clock gates with more complex enables.
Synopsys is known to be working on enhancements in a later release that will remove some of
the limitations listed above.

4.3.2 Need for enhanced RTL code understanding


Following are other areas of enhancements that can easily be worked around once they have
been identified:
• Power Compiler does not recognize manually inserted clock gating elements (that feature
is not supported yet), thus preventing it from instancing its refined level of clock gating.
• When a multi-bit register has different enabling conditions depending on the bits, Power
Compiler splits the register into as many parts as there are enabling conditions and
applies its rules on each one; that means a 16-bit register with a specific enable on its 2
lower bits will receive a clock gate for its 14 upper bits only (if the bitwidth limit is
strictly greater than 2).
• Arrays of registers are not understood correctly by Power Compiler when they are
accessed with an implicit decoder as in the example below:
reg [31:0] my_buffer[7:0];
...
if (condition_a)
my_buffer[sel[2:0]] <= newvalue[31:0]; // the whole array is considered to be updated
else if (condition_b)
my_buffer[0] <= anothernewvalue[31:0]; // only buffer[0] is considered to be updated
else if (sel[2:0] == 3'b010)
my_buffer[4] <= yetanother[31:0]; // only buffer[4] is considered to be updated
else if (sel[2:0] == 3'b110)
my_buffer[sel[2:0]] <= thelastone[31:0]; // the whole array is updated for Power Compiler

So, with implicit decoding, the whole array is seen as being updated: Power Compiler is
currently unable to build a separate enable condition based on the value of the selector; it
is however possible to help it by explicitly using the selector in the enabling condition.

SNUG Europe 2003 8 Overcoming Power Compiler limitations


to optimize clock gating
Since Power Compiler is unable to detect feedback loops that go through several levels of
hierarchy, it is possible to use that limitation to drive its code understanding and
overcome the corresponding lacks of its code analysis:
• Bitwidth limit may be overcome by grouping small registers together in a sub-module
where they receive a global enable as in the example below:
module regbundle(din,dout,en,clk);
input [2:0] control_a_next;
input control_b_next;
input [1:0] new_conf_next;
input [3:0] val1_next;
input [2:0] val2_next;
input en;
input clk;
output [2:0] control_a;
reg [2:0] control_a;
...
always @(posedge clk)
if (en)
begin
control_a <= control_a_next;
...
end
endmodule

Since they receive the same enable, Power Compiler understands them as a single register
that is wider than the bitwidth limit, thus enabling its automatic gating.
• The same trick can be used to have the exact desired enable for the automatic clock
gating cell; that can be used to share a global enable that proves more power efficient,
etc.
• Manual clock gating must be instanced in a different sub-module than the registers that
use the gated clock (e.g. At the top of the module hierarchy or in a separate sub-module
that could include all the manual clock gating cells).

4.4 Correlation between several parameters and power


The example described in paragraph 4.1 “Problem statement” exhibited the inefficiency of
Power Compiler in some cases. We have identified several areas of tool enhancements in the
previous paragraph. In the following paragraph, we will study the influence of several
parameters on different clock gating strategies: manual gating, Power Compiler gating and both,
applied to three designs with different characteristics. We will also discuss the impact of Power
Compiler on timing and area of the final design.

4.4.1 The test-cases


Three designs were used to explore clock-gating strategies:
• Design A is the simple test case that exhibited Power Compiler inefficiency. It is
constituted of a small state-machine and two large registers, runs at 200 MHz and is less
than 1K gates. The whole design is activated using a sub-frequency enable.
• Design B is constituted of ten 32-bit registers and a very simple calculation unit. Its size
is below 10K gates and it runs at 140 MHz. It is activated according to the operation to
perform.

SNUG Europe 2003 9 Overcoming Power Compiler limitations


to optimize clock gating
• Design C is a very complex design with numerous manual clock-gating cells; it
represents the kind of existing designs you might want to go through Power Compiler in
order to determine the efficiency of the tool on real test cases. Its size is about 90K gates
including more than 3,000 flip-flops, 14 clock gating cells and several memories; it runs
at about 115 MHz.
Basic clock gating strategies have been evaluated on those three designs and the results are listed
in the following table:
Clock gating option Design A Design B Design C
No clock gating 176 uW 3.53 mW 21.07 mW
Manually inserted clock gating only 45 uW 3.21 mW 2.51 mW
Power Compiler clock gating only 50 uW 2.88 mW NA
Power Compiler applied on design with existing manual clock gating 44 uW 2.87 mW 2.17 mW

We have already discussed the results in the case of design A.


On the contrary, design B shows a case when Power Compiler alone is the most efficient since
using both manual and automatic clock gating does not improve power but costs two levels of
clock gating. That design structure is different from design A: its global enable is generated from
the same state-machine that is used to generate the load-enables of the ten 32-bit registers. We
can expect the global enable logic toggling to be as frequent as the toggling of any register
specific load command, thus making Power Compiler the best choice. The number of registers
(design A has two registers whereas design B has ten registers) may also be a reason why Power
Compiler is more efficient.
Design C is the typical case of legacy design whose existing clock-gating cannot be easily
replaced by Power Compiler but that is improved when Power Compiler second level of clock
gating is added. The reason why no result is shown for ‘Power Compiler only’ strategy is that it
would have meant taking the enables from the global clock gating cells and propagating them in
the RTL code of several dozens of sub-modules, with no guarantee we would achieve a lower
power consumption than with the two levels of clock gating.
In the following paragraphs, we will investigate various parameters that may have influenced the
results given above; and by making them vary, determine their impact on different strategies.

4.4.2 The module global enable activity


The global enable is the signal that gates the clock for the entire module; it is connected to the
manual clock-gating cell. As it was already identified in the example of paragraph 4.1, we can
expect that signal activity to impact the efficiency of Power Compiler vs. manual clock gating.
Let us consider variations of the global enable activity on design B and measure its impact on the
design power consumption. The table below shows the results:

SNUG Europe 2003 10 Overcoming Power Compiler limitations


to optimize clock gating
Clock gating Low Medium High Very high
option (global enable activity ~ ½ (global enable activity ~ (global enable activity ~ (global enable activity ~
other inputs activity) other inputs activity) 5x other inputs activity) 10x other inputs activity)

None 2105.5 100% 1335.0 100% 910.3 100% 823.2 100%


Manual only 1743.9 83% 737.2 55% 152.9 17% 34.2 4%
Power Compiler 1232.9 59% 701.5 53% 206.6 23% 106.0 13%
Both 1482.3 70% 663.1 50% 140.5 15% 31.3 4%

Although design B proves to be the typical case when Power Compiler is more efficient than
manual clock gating or combined clock gating when the global enable activity is reasonably low,
that test-case shows it is possible to have completely different results when the global enable
activity becomes much more important than the activity of the remaining input signals. The
major lesson to keep from that example is the tremendous impact of the stimuli on the strategy to
use, and using Power Compiler to generate a second level of refined clock gating seems achieves
good results. It is also important to remember that the clock tree consumption is not included in
the data, and it should further favor the solution with two levels of clock gating over the ‘Power
Compiler only’ and ‘manual only’ solutions.

4.4.3 Number of registers to be gated


We analyzed that one of the factors for Power Compiler poor quality-of-results on design A,
when used alone, is caused by the quantity of toggling logic that drives the clock gating cells
enables when the global enable toggles. In order to check that assumption, let us consider a few
modifications of design A that will make Power Compiler instance 6 clock gating cells instead of
the original 3 ones.
Clock gating option Original Modified
No clock gating 172.66 uW 358% 176.21 uW 290%
Original manual clock gating 43.34 uW 90% 49.71 uW 82%
Power Compiler automatic clock gating 48.18 uW 100% 60.71 uW 100%
Both 41.54 uW 86% 41.73 uW 69%

This is exactly our expectation: with more clock gating cells instanced by Power Compiler, there
is more logic that toggles because of the high activity global enable. It is clear that instancing
several clock gating cells with a shared global enable is not always as power efficient as the
original single clock gating cell that received that global enable.
That example also shows the benefits from using both manual and Power Compiler clock gating
increases with the number of automatic clock gating cells. Power Compiler additional level of
clock gating allows the activation of registers only when needed. Without that additional level of
clock gating, all the registers receive their clock, which makes them consume unnecessary
power.

4.4.4 Bitwidth limit


In some cases, the bitwidth limit combined with Power Compiler inability to find a common
enable condition for very small registers and drive them with a single clock gating cell, is a cause
for Power Compiler automatic clock gating worse quality-of-results compared to manual clock
gating.

SNUG Europe 2003 11 Overcoming Power Compiler limitations


to optimize clock gating
However, playing with the bitwidth parameter is probably not the solution to solve that kind of
problems. The question here is to determine the range of bitwidth limits that give the best results
in terms of power, area and timing.
Let us take design B whose registers are all 32-bit wide and make their width vary from 32 down
to 1 bit. The following graph presents the power consumption reduction brought by Power
Compiler when there was and was not original clock gating in the design.
Figure 5: Register width impact on Power Compiler clock gating efficiency

40

30
No original
clock gating
20 (power)

No original
clock gating
10
Gain (%)

(area)

With original
0 clock gating
(power)

-10 With original


clock gating
(area)
-20

-30
1
9
8
7
6
5
4
3
2
32

16
24

12
10

Register width

The point of inversion is around 2 for the area and between 2 and 1 for the power. Let us now
consider design C that is a very big design with over 3,000 flip-flops and a wide variety of
register widths. The following graph shows the results.

SNUG Europe 2003 12 Overcoming Power Compiler limitations


to optimize clock gating
Figure 6: Bitwidth parameter impact on a large design

127
8.60
8.57
8.54 8.54 8.53 8.54
8.52
111

103
101
98

88,108
Clock gating cells

87,797 Power (mW)


2.25 87,526
87,244
86,616 86,661
Area (gates)

85,779 Timing (ns)


53
2.19
45
2.17 2.17
2.16
2.15

2.12
8

6
5
4

2
16

10

Bitwidth

For that example, the inversion point is not as obvious as before but it seems the bitwidth limit 6
is the best choice.
We have seen two examples, one with more and more registers gated (practical case) and a
second one with width variations. The register width variation results can be explained with the
library characteristics regarding power and area: the integrated clock gating cell in our library is
exactly twice as large as the multiplexer, hence the inversion point of the area curve at bitwidth
value 2. Regarding power, it depends on the library and the activity file, and since the power
estimations were not performed with a clock tree, we cannot determine the exact limit; we can
simply expect it to be around the same value as the area limit.
Design C results show how the parameter variations influence the results on a design that was
not tuned for Power Compiler, with the same library as for design B. It is interesting to notice the
best bitwidth limit is 6, which is in the range of admitted values, but the curve variations are not
as smooth as for design B and the best value could easily be different with another reference
simulation.
There are practical implications of those results that we will describe in a later chapter.

4.4.5 Timing constraints


Regarding design timings, two aspects are interesting: the impact of timing constraints on power
results and the impact of Power Compiler automatic clock gating on timings.
The tests were performed on design B with three different timing targets: 5 ns (over-
constrained), 7 ns (normal constraint) and 15 ns (relaxed constraint). Two synthesis stages were

SNUG Europe 2003 13 Overcoming Power Compiler limitations


to optimize clock gating
used: a preliminary RTL compile followed by an incremental compile with power reduction.
Results after the second stage are presented below.
Clock gating option 5 ns 7 ns No timing constraints
No clock gating 6.03 mW 5.79 ns 3.47 mW 7.02 ns 2.36 mW NA
Original manual clock gating 6.51 mW 5.79 ns 3.09 mW 7.01 ns 1.94 mW
Power Compiler automatic clock gating 5.99 mW 5.69 ns 2.73 mW 7.01 ns 1.55 mW
Both 6.00 mW 5.70 ns 2.67 mW 7.01 ns 1.43 mW

The above results clearly show that over-constraining the design has a very negative impact on
the overall power consumption, regardless of the strategy. Moreover, the best results are not
obtained with the same strategy depending on the timing constraints.
When the timing target is not achievable, the removal of one level of multiplexers by Power
Compiler has a positive impact on the timing: the worst-case path is 0.10 ns shorter. However,
power consumption is not improved since the power target comes after the timing target.

4.4.6 Area
Many syntheses were performed on the designs presented above. The averaged area gain is
drawn in the following graph. The three designs had different behaviors regarding the area,
which explains the local variations of the graph. However, a global trend is visible that shows an
area reduction in proportion to the number of automatic clock gating cells. That effect is caused
by the mere multi-bit multiplexers replacement by a single clock gate as well as the relaxed
timing on the data paths.
Figure 7: Area gain with Power Compiler clock gating

100000

10000
Area gain

1000

100

10
2

8
10

45

53

98

1
3

7
10

11
10

12

Number of automatic clock gating cells

SNUG Europe 2003 14 Overcoming Power Compiler limitations


to optimize clock gating
4.4.7 The incremental power optimization
After every power analysis performed on test cases, an incremental optimization with power
reduction was performed. The results are summarized in the following table:
Gain Basic Manual Power Compiler Both
Minimum -1.52 0.76 -3.28 1.02
Maximum 8.78 12.57 13.67 13.42
Average 5.27 6.87 7.35 7.28

Several interesting comments can be drawn from the figures in that table. The average gain is
comprised between 5% and 7%, depending on the clock gating strategy. Obviously, on our test
cases, the Power Compiler clock gating allowed a better power optimization during the second
synthesis stage. It is important to understand that the representativeness of the simulation used to
calculate the logic activity is thus very critical to the reality of the achieved improvements. The
minimum and maximum figures show that the power improvement of that second stage can be
much greater than 10%; but, in some cases, degradation is noticed. Those cases happened when
the design was over-constrained: instead of improving power, the incremental synthesis stage
focused on improving timing. Clearly, over-constraining the design is a strategy that prevents
efficient power optimization.
On several test cases, multiple power optimizations were successively performed using an
updated activity file every time. They all showed the same results as the original power
optimization with less than 1‰ difference. The first power optimization gives very interesting
results, but no additional improvement can be expected from any other power optimization.

5 Power Compiler usage recommendations and strategy


In the previous paragraph, we have analyzed the effects of several parameters. It is now time to
utilize that information in order to find the best strategy and quickly achieve the lowest power
consumption with our design. The objective below is to give advice that range from RTL coding
to Power Compiler parameters; there will not be any discussion about power reduction
techniques at the process or transistor level, nor at the system level.

5.1 Recommended design flow


The design flow we have presented in paragraph 2.3 requires a few adjustments and restrictions
to guarantee a correct power optimization:
• The preliminary compile and incremental compile can be performed with timing over-
constraints but it is recommended to relax those timing constraints and have a positive
slack for the power incremental optimization.
• Build a very accurate simulation that will be used to back-annotate the activity of all the
design cells. It is essential for that simulation to reflect reality since all the following
strategy evaluations and decisions will depend on the power estimations made with that
simulation. We have seen in the previous paragraph that the best strategy may vary
depending on the simulation.

SNUG Europe 2003 15 Overcoming Power Compiler limitations


to optimize clock gating
• Write or modify the RTL, setup Power Compiler options and elaborate with or without
automatic clock gating according to the design:
Rewriting the RTL of existing designs to account for Power Compiler limitations should
only be performed on simple designs. Do not consider modifying the RTL of large
designs with existing manual clock gating: apply the synthesis flow with both
possibilities (with and without automatic clock gating from Power Compiler), and
perform power evaluation of both netlists to select the best strategy. This is very quick to
setup and it only requires CPU runtime. There is very little improvement to expect from
designs that already have a very well tuned manual clock gating.
Having manual clock gating and Power Compiler automatic clock gating means two
levels of clock gating cells in the clock tree. Those cells have usually an important impact
on the tree insertion delay, and the project rules regarding the modules insertion delay
might prevent the use of two levels.
When there are very few registers in the design (5 and less), having both manual clock
gating and automatic clock gating gives very little power gain compared to manual or
automatic only, for a large increase of the module insertion delay. Deciding between
manual and automatic clock gating depends on the design structure and global enable
activity. Both solutions should be compared (reworking the design to replace manual
clock gating with automatic clock gating should be relatively easy since we are talking
about a simple design with very few registers).
• Tune the RTL code and Power Compiler parameters to drive automatic clock gating
insertion and run the flow to compare results with power analysis. The simulation to
evaluate power has to stimulate every part of the design with its expected activity to
allow the most efficient tuning. The best choices for register grouping and bitwidth limit
may be slightly different depending on the simulation.
• When the clock gating cells are made of discrete gates, use Power Compiler within
Physical Compiler to have them placed very close to each other. The manual clock gating
cells will need hard bounds to be explicitly written by the designer.
• Perform Power Compiler optimization once the design timing and area targets have been
achieved.

5.2 RTL coding tips


It is possible to drive Power Compiler understanding and have it generate the exact automatic
clock gating we want by using the design hierarchy as it is explained in paragraph 4.3.2 “Limited
code understanding”. For example, you can find the common factor of the load-enables of
several small registers below the bitwidth limit and group them in a separate sub-module to force
Power Compiler to gate them.
When writing or modifying the RTL, questions arise about the best choice for the lowest power.
Ideally, the synthesis flow is setup and a good simulation exists to compare possibilities, there
only needs time to write the different options, go through the flow and select the one that gave
the lowest power. If the flow is not ready, or there is a lack of time, the best option has to be
selected a priori. Below are a few guidelines to help you make the good decision.

SNUG Europe 2003 16 Overcoming Power Compiler limitations


to optimize clock gating
• When there are very few registers to be gated, a simple level of clock gating is sufficient:
use manual clock gating when the global enable activity is much greater than the specific
load-enable signals of the registers (typically a sub-frequency generated via a
synchronous enable for the module as a whole); use Power Compiler automatic clock
gating in other cases, especially when the global enable is generated from the same state-
machines as the specific load-enables.
• For larger designs (more than 10 separate registers), a combination of manual and
automatic clock gating gives the best results. Do not forget to exclude the global enable
that drives the manual clock-gating cell from the load-enable of every register.
• The global enable logic can be kept very simple and coarse; rely on Power Compiler for
the refined clock gating.
• Never leave registers without clock gating. The impact of that assertion on the RTL code
depends on the clock gating strategy. With a global clock gating inserted manually,
regardless of the use of Power Compiler automatic clock gating, nothing particular has to
be done. But, if Power Compiler is used as the only level of clock gating, it is essential to
tune the RTL and group registers that would not have been gated otherwise (because of
the bitwidth limit, setup or enable conditions that are not all met).

5.3 Power Compiler automatic clock gating parameters


Before trying any value for the bitwidth limit, a first quick calculation can be performed to
determine the number of multiplexers to have the same area as the clock-gating cell. That value
entirely depends on the library. That should be considered the minimum width to use when
grouping registers during the RTL tuning. That value can also be used as a starting point for the
first Power Compiler attempt, and slight increases can be tried to evaluate the trend.
When this is not possible to determine that information from the library, a value between 6 and 8
can be used instead.
Since the bitwidth limit impact might slightly vary according to the design and its reference
simulation, it is recommended to try several values to find the optimal value.
That bitwidth limit can be different for every sub-module if their elaboration is performed
separately. For a limited number of sub-modules (10 or less), the parameter setup and its
corresponding synthesis script can be written fairly quickly.
When a single level of clock gating is planned to be used, special care must be taken if it is a
Power Compiler automatic clock gating level since some registers might be left without any
clock-gating cell due to the bitwidth limit and/or other violations. It is then necessary to group
them in a separate sub-module with a global load-enable to have them gated by Power Compiler.

5.4 Clock gate synthesis constraints


The number of levels of clock gating cells on a clock tree has to be limited because it lengthens
the insertion delay which has a power impact, hampers the clock tree generation and also may
cause setup violations on the enable of the clock gating cells. That last issue is explained below:

SNUG Europe 2003 17 Overcoming Power Compiler limitations


to optimize clock gating
Figure 8: Clock gate enable setup constraint

total insertion delay tTOT


D Q

clk

combinatorial
logic
D Q

D Q

clk
clock gate ins. delay tCG insertion delay delta t∆

The clock gating cell enable is calculated from flip-flops that receive a clock with tTOT insertion
delay. That enable is latched with the same root clock after tCG insertion delay that arrives t∆
earlier than tTOT . That means the enable logic constraint must be shorter by t∆ than Synopsys
default behavior. Unfortunately, t∆ is unknown during synthesis since it depends on the clock
tree structure.
If several levels of clock gating are used in the design, t∆ becomes larger for the cells that are
closer to the root clock pin. Moreover, the larger t∆ is, the more efficient is clock gating and this
is our goal. So, during synthesis, we need to account for a large t∆ value.
The easiest solution is to have synthesis consider tTOT insertion delay for all the design flip-flops,
which we assume is known for synthesis (it is usually project-dependent; if it is unknown, a
greater value than the expected real value can be used); and tCG insertion delay forced to a null
value for all the clock gating cell latches. The effect is an over-constraint of the enable logic that
guarantees no post-clock-tree trouble.
If those over-constraints cannot be met, they can be relaxed with the effect of creating new
insertion delay constraints during the clock tree elaboration: non-null tCG means the
corresponding clock gating cells must receive their clocks with an insertion delay greater than
tCG; if tTOT is not fixed, the final clock tree insertion delay will have to be smaller than tTOT .
Below is a short script example that implements the clock gating constraints as described above:
create_clock –period $CLKPER –name CLK [get_ports {<clock port list>}]
create_clock –period $CLKPER –name POST_CG_CLK [get_pins –hierarchical “*clk_gate*/clkout”]
create_clock –period $CLKPER –name PRE_CG_CLK [get_pins –hierarchical “*clk_gate*/clkin”]
set_clock_uncertainty $UNCERTAINTY [all_clocks]
set_clock_latency $INSERTION_DELAY [get_clocks {CLK POST_CG_CLK}]
set_clock_latency 0 [get_clocks {PRE_CG_CLK}]

SNUG Europe 2003 18 Overcoming Power Compiler limitations


to optimize clock gating
6 Conclusions and Recommendations
In that paper, we have presented Power Compiler, discussed several areas of enhancements
regarding automatic clock-gating insertion, and proposed techniques to improve the power
savings using Power Compiler. We have also analyzed the behavior of the tool for different
design cases, compared power results for different strategies using Power Compiler automatic
clock gating, manually inserted clock gating or both.
We have seen the results depend on numerous parameters such as the module global enables
activity, the number of registers to be automatically gated by Power Compiler and their size.
Although it was possible to draw guidelines from those results, the only entirely reliable strategy
is to try the different possibilities and compare them with a reference simulation.
Except for the Power Compiler parameters that can all be tried (with or without automatic clock
gating, and the bitwidth limit variations), the smart grouping of small registers, the determination
of the most efficient global enable, etc. can be time-consuming since it requires RTL code
modifications at several hierarchy levels. Anyway, the guidelines can be used to limit the number
of possibilities.
A last word regarding Power Compiler expected features. Ideally, that paper should become
useless in the future and Power Compiler would manage the whole clock gating structure by
itself. The only designer task would be the generation of the most representative simulation for
Power Compiler to find the lowest power clock gating with as many levels as needed.
During the elaboration stage, we could expect Power Compiler to determine the exact feedback
loop of all registers regardless of the hierarchy and mark it. Then, during clock tree compiling
within Physical Compiler, Power Compiler could adjust the clock gating according to the back-
annotated activity, adding or removing levels of clock gating, finding common enable factors,
sharing or splitting clock gating cells, etc. A runtime issue might arise when searching for
optimal clock gating tree; in order to solve it, it should be possible to freeze the structure found
by Power Compiler for future synthesis runs.
Synopsys is already looking at introducing some of these enhancements in its future releases.

7 References
[1] How To Successfully Use Gated Clocking in an ASIC Design, by Darren Jones at SNUG Boston
2002.

SNUG Europe 2003 19 Overcoming Power Compiler limitations


to optimize clock gating

You might also like