1996 - Power Minimization in IC Design Principles and Applications
1996 - Power Minimization in IC Design Principles and Applications
Massoud Pedram
Department of EE-Systems
University of Southern California
Los Angeles CA 90089
Abstract
Low power has emerged as a principal theme in today’s electronics indus-
try. The need for low power has caused a major paradigm shift in which power
dissipation is as important as performance and area. This article presents an
in-depth survey of CAD methodologies and techniques for designing low power
digital CMOS circuits and systems and describes the many issues facing design-
ers at architectural, logic and physical levels of design abstraction. It reviews
some of the techniques and tools that have been proposed to overcome these diffi-
culties and outlines the future challenges that must be met to design low power,
high performance systems.
1. Introduction
In the past, the major concerns of the VLSI designer were area, perfor-
mance, cost and reliability; power considerations were mostly of only secondary
importance. In recent years, however, this has begun to change and, increasingly,
power is being given comparable weight to area and speed. Several factors have
contributed to this trend. Perhaps the primary driving factor has been the remark-
able success and growth of the class of personal computing devices (portable
desktops, audio- and video-based multimedia products) and wireless communica-
tions systems (personal digital assistants and personal communicators) which
demand high-speed computation and complex functionality with low power con-
sumption.
2 Power Minimization in IC Design: Principles and Applications
tricity consumed and therefore, the less the impact on global environment, the
less the office noise (due to elimination of a fan from the desktop), and the less
stringent the environment/office power delivery and cooling requirements.
The motivations for reducing power consumption differ from application to
application. In the class of micro-powered battery-operated, portable applica-
tions, such as cellular phones and personal digital assistants, the goal is to keep
the battery lifetime and weight reasonable and the packaging cost low. Power lev-
els below 1-2 W, for instance, enable the use of inexpensive plastic packages. For
high performance, portable computers, such as laptop and notebook computers,
the goal is to reduce the power dissipation of the electronics portion of the system
to a point which is about half of the total power dissipation (including that of dis-
play and hard disk). Finally, for high performance, non-battery operated systems,
such as workstations, set-top computers and multimedia information processing
and communication systems, the overall goal of power minimization is to reduce
system cost (cooling, packaging and energy bill) and ensure long-term circuit
reliability. These different requirements impact how power optimization is
addressed and how much the designer is willing to sacrifice in cost or perfor-
mance to obtain lower power dissipation.
Our goal in writing this paper is to provide background and outlook for
people interested in using or developing low power design methodologies and
techniques. Even though we tried to be complete, some research work might have
been unintentionally left out. In addition, the description of various techniques
may be perceived as uneven at times because of the amount of coverage given to
certain topics; this is mainly due to our experience in using these methods for
building our power optimization and synthesis system, POSE.
The paper is organized as follows. First, we describe sources of power dis-
sipation in CMOS circuits and degrees of freedom in the low power design space.
We then present an in-depth survey (and in many cases analysis) of power estima-
tion and minimization techniques and describe some of the frontiers of the
research currently being pursued. We conclude by summarizing the major low
power design challenges that lie ahead.
2
P = 0.5C L V dd E ( sw ) f clk (1)
where CL is the physical capacitance at the output of the node, Vdd is the supply
voltage, E(sw) (referred to as the switching activity) is the average number of out-
put transitions per 1/fclk time, and fclk is the clock frequency. The product of
E(sw) and fclk which is the number of transitions per second, is referred to as the
transition density in [101].
The term dynamic power dissipation refers to the sum of short circuit and
capacitive dissipations. Using the concept of equivalent short-circuit capacitance
described above, the dynamic power dissipation can be calculated using equation
(1) if we add CSC to CL. Short-circuit currents in CMOS circuits can be made
small with appropriate circuit design techniques. In most of this article, we will
thus focus on capacitive power dissipation.
6 Power Minimization in IC Design: Principles and Applications
3.1. Voltage
Because of its quadratic relationship to power, voltage reduction offers the
most effective means of minimizing power consumption; a factor of two reduc-
tion in supply voltage gives a factor of four decrease in power consumption. Fur-
thermore, this power reduction is a global effect that is experienced throughout
the entire design. In some cases designers are thus willing to sacrifice increased
physical capacitance or circuit activity for reduced voltage. Unfortunately, we
pay a speed penalty for supply voltage reduction, with delays drastically increas-
ing as Vdd approaches the threshold voltage Vt of the devices. This tends to limit
the useful range of Vdd to a minimum of two to three times Vt.
One approach to reduce the supply voltage without loss in throughput is to
modify the Vt of the devices. Reducing the Vt allows the supply voltage to be
scaled down without loss in speed. The limit of how low the Vt can go is set by
the requirement to set adequate noise margins and control the increase in sub-
threshold leakage currents. The optimum Vt must be determined based on the cur-
rent gain of the CMOS gates at low supply voltage regime and control of the
leakage currents. Since the inverse threshold slope (S) of a MOSFET is invariant
with scaling [36], for every 80-100 mV (based on the operating temperature)
reduction in Vt, the subthreshold current will be increased by one order of magni-
tude. As a rule, the “off-current” current should remain two to three orders of
magnitude smaller than the “on-current”. This tends to limit Vt to about 0.3 V for
room temperature operation of CMOS circuits.
Another important concern in the low Vdd - low Vt regime is the fluctuation
in Vt. Basically, delay changes by 3x for a ΔVdd of plus/minus 0.15 V when Vdd
equals 1 V. Such a large variation in nominal delay values cannot be tolerated.
This sets a major limitation on how low Vdd can go unless the Vt fluctuation is
cancelled by circuit techniques such as the self-adjusting threshold scheme that
reduces the Vt fluctuation to plus/minus 0.05 V when Vdd equals 1 V [69].
Power Minimization in IC Design: Principles and Applications 7
tions.
(a) (b) (c)
i j i j i j
0→0 0→0 0→0 0→1 0→0 0→0 0→0 0→0 0→0
0→1 0→1 0→1* 0→1 0→1 0→1* 0→0 1→1 0→0
1→0 1→0 1→0* 0→1 1→0 0→0 0→1 0→1 0→1*
1→1 1→1 1→1 1→0 0→0 0→0 0→1 1→0 0→0
1→0 0→1 0→0 1→0 1→0 1→0*
1→0 1→0 1→0 * 1→0 0→1 0→0
1→1 0→0 0→0 1→1 0→0 0→0
1→1 0→1 0→1* 1→1 1→1 1→1
1→1 1→0 1→0*
can thus account for the hazards in the circuit (see Figure 1). A real delay model
significantly increases the computational requirements of the power estimation
techniques while improving the accuracy of the estimates.
1
0
Real Delay Model
from Cadence Design) can be adapted to report power dissipation of the circuits
under user-specified input sequences. These techniques rely on macromodels
built for the gates in the ASIC library as well as detailed gate-level timing analy-
sis to produce power estimates quickly. Their accuracy depends heavily on the
quality of the macromodels, the glitch filtering scheme used and the accuracy of
physical capacitances provided at the gate level. The execution time is 3-4 orders
of magnitude shorter than SPICE [115]. Similarly, switch-level simulators (such
as IRSIM [125]) can be easily modified to report the switched capacitance (and
thus dynamic power dissipation) during a simulation run. Switch-level simulation
techniques are in general much faster than circuit-level simulation techniques, but
are not as accurate or versatile.
Most of the high level power prediction tools use profiling and simulation
to address data dependencies. Important statistics include the number of opera-
tions of a given type, the number of bus, register and memory accesses, and the
number of I/O operations executed within a given period [25] [72]. Instruction
level simulation or behavioral simulators are easily adapted to produce this infor-
mation.
Estimation of the average energy consumption per operation (cycle of
activity) in asynchronous (clockless) control circuits that use a two-phase signal-
ing protocol for request/acknowledge handshaking is described in [71]. The pro-
posed method requires pre-calculation of energy consumption per output
transition for a small set of predefined macro gates. Estimation of the average
energy consumption per external signal transition in a speed-independent asyn-
chronous control circuit is presented in [7]. The proposed method is simulative in
nature, but only requires a small number of input patterns proportional to the size
of the high-level specification for the circuit.
power can be dissipated by the cell. With the relevant parameters set according to
the user’s specs, a SPICE circuit simulation is invoked to accurately obtain the
power dissipation of each vector. During logic simulation, Aspen monitors the
transition count of each cell and computes the total power consumption as the
sum of the power dissipation for all cells in the power vector path.
⎛ tα ⁄ 2 σ ⎞ 2
N > ⎝ -------------- ⎠ (2)
εη
where tα/2 is defined so that the area to its right under the standard normal distri-
bution curve is equal to α/2. In estimating the total power consumption of the cir-
cuit, the convergence time of the MCS method is short when the error bound is
loose or the confidence level is low. Note however that the MCS method may
converge to a premature (thus wrong) power estimate if the sample density does
not follow a normal distribution (that is, if T was too small). Additionally, this
method does not handle spatial correlations at the circuit inputs.
Stopping criteria to obtain a specified switching activity accuracy at all
individual nodes in a circuit is proposed in [167]. In this case, the convergence
rate, which is determined by the “low-activity” nodes in the circuit, becomes very
slow. This problem is addressed by replacing the percentage error bound for these
nodes by an absolute error bound, thus allowing possibly large percentage error
on these nodes. The overall error however remains small because the contribution
of these nodes to the total power dissipation of the circuit is small.
The MCS method has been extended to finite state machines in [104] and
16 Power Minimization in IC Design: Principles and Applications
[30] where it is shown that choices of initial states and the length of warm-up
periods are critical for generating accurate power estimates. In general, the simu-
lation time for finite state machines is significantly higher than that for combina-
tional circuits of comparable size.
The issue of obtaining run-time and apriori estimates of the number of
input patterns for a specified accuracy is discussed in [53]. These estimates are
derived through the definition of a set of multinomial random variables and a set
of functions based on the parameters of these random variables.
single switching activity value which represents the average switching activity on
each input bit. In [73], a more detailed model is presented where it is projected
that data in the datapath of a digital system can be divided into two regions: the
Least Significant Bits (LSB) which act as uncorrelated white noise and the Most
Significant Bits (MSB) which correspond to sign bits and exhibit strong temporal
dependence. The power model thus uses two capacitance values and requires two
input switching activity values corresponding to the LSB and MSB regions. Both
models ignore the spatial correlations among bits of the same input or across bits
of different inputs.
A parametric model is described in [137], where the power dissipation of
the various components of a typical processor architecture are expressed as a
function of a set of primary parameters. The technique suffers from an abundance
of parameters, requires a lot of fine-tuning for specific architectures, and is sensi-
tive to mismatches in the modeling assumptions. A power estimation program
which combines analytical and stochastic techniques to provide fast and relatively
accurate power estimates at the system level is presented in [92].
Word-level behavior of a data input can be properly captured by its proba-
bility density function (pdf). Similarly, spatial correlation between two data
inputs can be captured by their joint pdf. This observation is used in [27][28] to
develop a probabilistic technique for behavioral level power prediction which
consists of four steps: 1) Building the joint pdf of the input variables of a data
flow graph (DFG) based on the given input vectors, 2) Computing the joint pdf
for any combination of internal arcs in the DFG, 3) Calculating the switching
activity at the inputs of each functional block or register in the DFG using the
joint pdf of the inputs and the data representation format which determines the
(bit-level) Hamming distances of (word-level) data values, 4) Estimating the
power dissipation of each functional block using the input statistics obtained in
step 3 and the library characterization data that gives the physical capacitance
information for each module in the library. This method is very robust, but suffers
from the worst-case complexity of joint pdf computation and inaccuracies associ-
ated with the library characterization data.
An information theoretic approach is described in [89] and [103] which
relies on information theoretic measures of activity (for example, entropy) to
devise fast, yet accurate, power estimation at the algorithmic and structural
behavioral levels. The following summarizes the approach in [89]. Entropy char-
acterizes the uncertainty of a sequence of applied vectors and thus, intuitively, is
18 Power Minimization in IC Design: Principles and Applications
related to switching activity. Indeed, it is shown that, under the temporal indepen-
dence assumption, the average switching activity of a bit is upper-bounded by one
half of its entropy. For control circuits and random logic, given the statistics of
the input stream and having some information about the structure and functional-
ity of the circuit, the output entropy per bit is calculated as a function of the input
entropy per bit and a structure- and function-dependent information scaling fac-
tor. For dataflow graphs, the output entropy is calculated using a compositional
technique which has linear complexity in terms of the circuit size. Next the aver-
age entropy per circuit line is calculated and used as an estimate of the average
switching activity per signal line. This is then used to estimate the power dissipa-
tion of the module. A major advantage of this technique is that it is not simulative
and is thus very fast, yet it produces accurate power estimates.
The above techniques apply to datapaths. Behavioral power prediction
models have also been proposed for the controller circuitry in [74][72]. These
techniques provide quick estimation of the power dissipation in a control circuit
based on the knowledge of its target implementation style (that is, precharged
pseudo-NMOS or dynamic PLA), the number of inputs, outputs, states, and so on.
The estimates can be made more accurate by introducing empirical parameters
that are determined by curve fitting and least squared fit error analysis on real
data.
cuit inputs representing the signal probabilities of these inputs. Then, for each
internal circuit line, they compute algebraic expressions involving these vari-
ables. These expressions represent the signal probabilities for these lines. While
the algorithm is simple and general, its worse case time complexity is exponen-
tial. Approximate signal probability calculation techniques have been proposed in
[48] [126] [35] [130] and [70].
In [21], an exact procedure based on Ordered Binary-Decision Diagrams
(OBDDs) [16] is described which is linear in the size of the corresponding func-
tion graph (the size of the graph, however, may be exponential in the number of
circuit inputs). In this method, which is known as the OBDD-based method, the
signal probability at the output of a node is calculated by first building an OBDD
corresponding to the global function of the node (i.e., function of the node in
terms of the circuit inputs) and then performing a postorder traversal of the
OBDD using equation:
p(x3) x3 x3 p(x3)
00
1 1
0 1
Figure 2 Computing the signal probability using OBDDs.
lines and ignoring higher order correlation terms is described. The correlation
coefficient of two signals i and j is defined as:
prob ( i ∧ j )
C ( i, j ) = ------------------------------------------ (5)
prob ( i ) prob ( j )
The correlation coefficients of signal i and complement signal j, comple-
ment signal i and signal j, etc. are defined similarly. Ignoring higher order correla-
tion coefficients, it is assumed that C(i,j,k) = C(i,j) C(i,k) C(j,k). The signal
probability of g is thus approximated by:
NOT gate: prob ( g ) = 1 – prob ( i )
OR gate: prob ( g ) = 1 – ∏
i ∈ inputs
( 1 – prob ( i ) ) ⋅ ∏ C ( i, j)
j>i
The various transition probabilities can be computed exactly using the OBDD
representation of the logic function of x in terms of the circuit inputs.
In [87], the authors also describe a mechanism for propagating the transi-
tion probabilities and correlation coefficients through the circuit which is more
efficient because there is no need to build the global function of each node in
Power Minimization in IC Design: Principles and Applications 21
p00 p01
0 1
p11
p10
Figure 3 A Markov chain model for representing temporal correlations.
terms of the circuit inputs. The loss in accuracy is often small while the computa-
tional saving is significant. They then extend the model to account for spatio-tem-
poral correlations. The mathematical foundation of this extension is a four state
time-homogeneous Markov chain where each state represents some assignment of
binary values to two lines x and y and each edge describes the conditional proba-
bility for going from one state to the next. The computational requirement of this
extension is however high because it is linear in the product of the number of
nodes and number of paths in the OBDD representation of the Boolean function
in question. A practical method using local OBDD constructions is described by
the authors.
This work has been extended to handle highly correlated input streams
using the notions of conditional independence and isotropy of signals [88]. Based
on these notions, it is shown that the relative error in calculating the signal proba-
bility of a logic gate using pairwise correlation coefficients can be bounded from
above.
The above techniques target average power dissipation. In some applica-
tions, peak power dissipation should also be estimated. In [37], a technique for
finding the two-vector input sequence that leads to maximum power dissipation in
a combinational circuit is described. More recently, a technique is presented that
computes the multiple-vector input sequence that leads to maximum average
power dissipation in a finite state machine [86].
Estimation under a Real Delay Model
The above methods only account for steady-state behavior of the circuit
and thus ignore hazards and glitches. This section reviews some techniques that
examine the dynamic behavior of the circuit and thus estimate the power dissipa-
tion due to hazards and glitches.
In [47], the exact power estimation of a given combinational logic circuit is
22 Power Minimization in IC Design: Principles and Applications
carried out by creating a set of symbolic functions that represent Boolean condi-
tions for all values that a node x in the circuit can assume at different time
instances under a pair of input vectors. The inputs to the created symbolic func-
-
tions are the circuit input lines at time instances 0 and ∞. Each symbolic function
is the EXOR of the characteristic functions describing the logic values of node x
at two consecutive time instances (see Figure 4 for an example symbolic network
constructed under a unit delay model). The output of the EXOR gate evaluates to
one exactly when node x makes a transition between the two time instances. Sum-
ming the signal probabilities of these symbolic functions gives the average
switching activity at x. The process, which has to be repeated for all gates in the
circuit, is known as the symbolic simulation. The major disadvantage of this esti-
mation method is that for medium to large circuits, the symbolic formulae
become too large to build. However, for circuits that this method is applicable to
and subject to error introduced by the imperfect logic-level glitch propagation
scheme, the estimates provided by the method can serve as a basis for comparison
among different approximation schemes.
⎛ ⎛ t ⎞ ⎛ t ⎞⎞
E x ( sw ) = ∑
t ∈ eventlist ( x )
⎝ prob ⎝ x 0 → 1 ⎠ + prob ⎝ x 1 → 0 ⎠ ⎠ . (7)
Given such waveforms at the circuit inputs and with some convenient parti-
tioning of the circuit, the authors examine every sub-circuit and derive the corre-
sponding waveforms at the internal circuit nodes. In [100], an efficient
probabilistic simulation technique is described that propagates transition wave-
forms at the circuit primary inputs throughout the circuit and thus estimates the
total power consumption (ignoring signal correlations due to the reconvergent
fanout nodes).
A tagged probabilistic simulation approach is described in [150] that cor-
rectly accounts for reconvergent fanout and glitches. The key idea is to break the
set of possible logical waveforms at a node n into four groups, each group being
characterized by its steady state values (i.e., values at time instance 0- and ∞).
tagged waveforms
probability waveform
sp00(t)
0.25
1 logic waveforms prob. of waveforms w00
w1 0 P(w1) = 0.25 sp01(t) t
0.75 0.25 w01
tu(0)=0.25 w2 P(w2) = 0.25 tu01(0) = 0.25
0.5 w3 P(w3) = 0.25
td(0)=0.25 sp10(t) t
0.25 w4 1 P(w4) = 0.25 w10
t=0 0.25 10
0 td (0) = 0.25
t t
sp11(t) w11
0.25
t
Figure 5 Probability waveforms.
Next, each group is combined into a probability waveform with the appropriate
steady-state tag (see Figure 5). Given the tagged probability waveforms at the
input of node n, it is then possible to compute the tagged probability waveforms
at its output. The correlation between probability waveforms at the inputs is
approximated by the correlation between the steady state values of these lines,
which is in turn calculated efficiently by describing the node function in terms of
some set of intermediate variables in the circuit. This approach requires signifi-
cantly less memory and runs much faster than symbolic simulation, yet achieves
very high accuracy, e.g., the average error in aggregate power consumption is
about 10%.
24 Power Minimization in IC Design: Principles and Applications
n
∂y
D ( y) = ∑ P ⎛⎝ ∂ x ⎞⎠ D ( x )
i
i (8)
i=1
where y is the output of a node, xi’s are the inputs of the node, and the Boolean
difference of function y with respect to xi gives all combinations for which y
depends on xi. This equation, which can be thought of as a first-order Taylor poly-
nomial approximation of D(y), does not take simultaneous input switching into
account. The accuracy of transition density propagation equation can be improved
by using higher-order Boolean difference terms as in [29] [93] or by using a con-
ceptual low-pass filter to reduce the hazard count in the above equation as in
[102]. A major source of error is the assumption that xi’s are independent. This
assumption is however incorrect because xi’s tend to become correlated due to
reconvergent fanout structures in the circuit. This problem is solved by describing
y in terms of the circuit inputs, which are still assumed to be independent. In this
case, the accuracy is improved, but calculation of the Boolean difference terms
becomes very expensive. A compromise between accuracy and efficiency can be
reached by describing y in terms of some set of intermediate variables in the cir-
cuit. One such technique that relies on circuit partitioning and computation cach-
ing using OBDDs, is described in [64].
considerably more difficult than that for combinational circuits for two reasons:
1) The probability of the circuit being in each of its possible states has to be cal-
culated; 2) The present state line inputs of the FSM are strongly correlated (that
is, they are temporally correlated due to the machine behavior as represented in
its State Transition Graph description and they are spatially correlated because of
the given state encoding).
prob ( S j ) = ∑
S i ∈ instates ( S j )
prob ( S i ) prob 〈S j|S i〉 (9)
where instates(Si) is the set of fanin states of Si in the STG. Given K states, we
obtain K equations out of which any one equation can be derived from the
remaining K - 1 equations. We have a final equation:
∑ prob ( S )
j
j = 1. (10)
ps 1 = f 1 ( pi, ps 1, ps 2, …, ps n )
…
(11)
…
ps n = f n ( pi, ps 1, ps 2, …, ps n )
where psj denotes the state bit probabilities of the ith next state bit at the output
and the jth present state bit at the input of the FSM, respectively and fl’s are non-
linear algebraic functions. The fixed point (or zero) of this system of equations
can be found using the Picard-Peano (or Newton-Raphson) iteration [80].
Increasing the number of variables or the number of equations in the above
system results in increased accuracy [148]. For a wide variety of examples, it is
shown that the approximation scheme is within 1-3% of the exact method, but is
orders of magnitude faster for large circuits. Previous sequential switching activ-
ity estimation methods exhibit significantly greater inaccuracies.
Power Minimization in IC Design: Principles and Applications 27
In [36], two CMOS device and voltage scaling scenarios are described, one
optimized for the highest speed and one trading off high performance for signifi-
cantly lower power (the speed of the low power case in one generation is about
the same as the speed of the high-performance case of the previous generation,
with greatly reduced power consumption). It is shown that the low power scenario
is very close to the constant electric-field (ideal) scaling theory. It is shown that a
7x improvement in speed and over two orders of magnitude improvement in
power-delay product (mW/MIPS) are expected by scaling of (bulk) CMOS down
to sub-0.1 micron region compared with high performance 0.6 micron devices at
5 volts. This paper also presents a discussion of how high the electric field in a
transistor channel can go without impacting the long term device reliability, while
at the same time achieving high performance and low power. Next the
speed/standby current trade-off is addressed, dealing with the issue of non-scal-
ability of the threshold voltage.
The status of silicon-on-insulator (SOI) approach to scaled CMOS is also
reviewed in [36], showing that the potential for 3x savings in power compared to
the bulk case at the same speed. The performance improvement of SOI compared
to bulk CMOS is mainly due to the reduction of parasitic capacitances and body
effect. Also, in partially depleted device designs, the floating body effect can give
rise to a sharper subthreshold slope (< 60 mV/dec) at high drain bias, which effec-
tively reduces the threshold voltage and can actually improve the performance at
a given standby current. In addition, CMOS on SOI offers significant reduction in
soft error rate, latch-up elimination, and simpler isolation which results in
reduced wafer fabrication steps. The main challenges are the availability of low
cost wafers with low defect density at high volumes, floating body effects on the
device and circuit operation, and heat dissipation through the buried oxide.
of the energy that is delivered from the power supply may be cycled back to the
power supply [3]; a given task may be partitioned between various hardware
modules or programmable processors or both so as to reduce the system-level
power consumption; memory optimizing transformations can be used to minimize
communications to and from the global memory modules [166]; and software
may be compiled so as to minimize the power dissipation when it is executed on a
given hardware platform [143].
In many synchronous applications much power is dissipated by the clock.
The clock is the only signal that switches all the time and it usually has to drive a
very large clock tree. Moreover in many cases the switching of the clock causes a
lot of additional unnecessary gate activity. For that reason, circuits are being
developed with controllable clocks. This means that from the master clock other
clocks are derived that can be slowed down or stopped completely with respect to
the master clock, based on certain conditions. The circuit itself is partitioned in
different blocks and each block is clocked with its own (derived) clock. The
power savings that can be achieved this way are very application dependent, but
can be significant.
In [141], the authors introduce a technique for saving power in the clock
tree by stopping the clock fed into idle modules. Sections of the clock tree are
turned on or off by gating the clock signals during the active or idle times of the
clocked elements as follows. Associated with every node of the clock tree is the
activity pattern, which is a binary string of 1’s and 0’s representing the active/idle
status of the node in each time slot. The leaves of the clock tree are sinks and their
activities are found from the high level description of the system. The activity
patterns of the internal nodes of the clock tree are computed successively by per-
forming bitwise OR operation on the activity patterns of their children. Signifi-
cant power savings have been reported.
Asynchronous architectures use event-driven handshaking that requests
operations to execute only when they are needed, thereby systematically perform-
ing what can be considered optimal gated clocking. The disadvantage is that the
handshaking control overhead has traditionally limited performance and margin-
ally increased area. For some applications, such as a compact digital cassette
error corrector chip set, the performance requirements are easily met and the
low-power advantages of completely asynchronous design have yielded an energy
savings of up to a factor of five compared with synchronous counterparts [10]. In
addition, the ongoing project to implement a fully compatible low-power asyn-
30 Power Minimization in IC Design: Principles and Applications
results in reduced internal memory size, and the design of digital and analog cir-
cuits optimized for low supply voltages.
A number of other power saving techniques have been applied at the algo-
rithm and system level. Interested reader is referred to [91] for a recent review of
power optimization techniques at this level.
solved optimally in polynomial time using a max-cost flow algorithm. This algo-
rithm accounts for the switched capacitance in a hardware-shared design due to
transitions between values of signals in the same iteration of a loop.
In [120] and [121], simulation and profiling are used to construct switched
capacitance matrices for each type of library module. Entry (i,j) of this matrix
represents the switched capacitance for the instance i of the module when its
input j changes. The proposed module and register binding algorithms are based
on heuristic or integer linear programming techniques for solving the same prob-
lems. In [122], an iterative improvement algorithm for performing concurrent
scheduling, clock selection, resource allocation and binding with the aim of
reducing power consumption in synthesized datapath circuits is presented.
Results show that a sizeable reduction in power is possible.
Capacitances for the I/O and the global busses are significantly larger than
those for the internal circuitry. It is therefore essential to develop techniques for
reducing the activity on the I/O pins and the busses. An instruction encoding and
scheduling scheme based on Gray coding is presented in [136] which minimizes
switching activity in the instruction unit (and the address bus) of a high perfor-
mance micro-processor. A Bus-Invert method for minimizing the activity on I/O
pins is proposed in [134]. The idea is to add an extra line to the bus which indi-
cates if the value being transmitted is the true or complement of the intended
value. Depending on the value transmitted in the previous cycle, a decision is
made to either transmit the true or the complemented value on the bus so as to
minimize the bit activity on the bus.
Another low power I/O encoding method based on transition signalling
(instead of the usual level signalling) and limited-weight codes, is also described
in the same reference. These methods resulted in average of 25% reduction in
average power dissipation under a binomial distribution of the distance between
consecutive patterns. Methods to implement low-activity arithmetic units based
on the one-hot residue coding of the input operands are presented in [31]. CMOS
implementation of a direct digital frequency synthesizer for a frequency-hopped
spread spectrum communication systems using this technique resulted in almost
2X reduction in the power-delay product compared to a conventional,
fully-encoded design.
34 Power Minimization in IC Design: Principles and Applications
g1 = 1 ⇒ f = 1 (12)
g2 = 1 ⇒ f = 0 (13)
R1
f R3
A
R2
En
g1
g2
Figure 7 A precomputation architecture for sequential circuits.
produce the optimal retiming solution because the retiming of a single node can
dramatically change the switching activity of many other nodes in the circuit.
g g R
CL CL
The authors report that the power dissipated by the 3-stage pipelined cir-
cuits obtained by retiming for low power with a delay constraint is about 8% less
than that obtained by retiming for minimum number of flip-flops given a delay
constraint.
Synthesis of FSMs with Gated Clock
A technique for automatic synthesis of FSMs with gated clocks that
reduces the power dissipation is presented in [8]. The idea is to modify the
flip-flop based FSM architecture by adding a new activation signal whose pur-
pose is to selectively stop the local clock for the FSM when the machine is idle
and does not perform state transitions. The activation function is implemented in
the form of a combinational logic block that uses as its inputs the primary inputs
and the state lines of the FSM. Applying this technique to some FSM circuits has
resulted in large power savings.
State Assignment
State assignment of a finite state machine (which is the process of assign-
ing binary codes to the states) has a significant impact on the area of its final logic
implementation. In the past, many researchers have addressed the encoding prob-
lem for minimum area of two-level or multi-level logic implementations. These
techniques can be modified to minimize the power dissipation. One approach is to
minimize the switching activity on the present state lines of the machine by giv-
ing minimum-distance (ideally uni-distance) codes to states with high transition
frequencies to one another [124]. In [49], a fully implicit encoding algorithm for
reducing the average number of bit changes per state transition is presented.
The above formulation however ignores the power consumption in the
combinational logic that implements the next state and output logic functions. In
an attempt to account for power consumption in the combinational part of the
Power Minimization in IC Design: Principles and Applications 37
FSM, the authors of [106] minimize a linear combination of the number of state
bits that change every cycle and the number of literals in a multi-level logic
implementation of the FSM using a genetic local search algorithm. A more effec-
tive approach is presented in [149] where the complexity of the combinational
logic resulting from the state assignment is considered by modifying the objective
functions used in the conventional encoding schemes such as NOVA [162] and
JEDI [84] to achieve lower power dissipation. Experimental results on a large
number of benchmark circuits show 10% and 17% power reductions for two-level
logic and multi-level implementations compared to NOVA and JEDI, respec-
tively.
Multi-Level Network Optimization
Network don’t cares can be used for minimizing the intermediate nodes in
a boolean network [127]. Two multi-level network optimization techniques for
low power are described in [131] and [57]. The main difference between the pro-
cedure in [127] and the low power procedures is in the cost function used during
the two-level logic minimization. The new cost function minimizes a linear com-
bination of the number of product terms and the switched capacitance. In addi-
tion, the authors of [57] consider how changes in the global function of an
internal node affect the switching activity (and thus, the power consumption) of
nodes in its transitive fanout. The paper presents a greedy, yet effective, network
optimization procedure as summarized below.
The procedure presented in [57] proceeds in a reverse topological fashion
from the circuit outputs to the circuit inputs simplifying fanouts of a node before
reaching that node. Once a node n is simplified, the procedure propagates those
don’t care conditions which could only increase (or decrease) the signal probabil-
ity of that node if its current signal probability is greater than (less than or equal
to) 0.5. This will ensure that as nodes in the transitive fanin of n are being simpli-
fied, the switching activity of n will not increase beyond its value when node n
was optimized. Power consumption in a combinational logic circuit has been
reduced by some 10% as a result of this optimization.
The above restriction on the construction of ODC may be overly constrain-
ing for the resynthesis process. In [79], a node simplification procedure is pre-
sented that identifies good candidates for resynthesis, that is, nodes where a local
change in their activity plus the change in activity throughout their transitive
38 Power Minimization in IC Design: Principles and Applications
fanout nodes, reduces the power consumption in the circuit. Both (delay-indepen-
dent) functional activity and (delay-dependent) spurious activity are considered.
Node simplification process itself consists of using the appropriate don’t
care to minimize the area cost of the node. In [165] and [59], this procedure is
modified to minimize the power cost of the node. First, consider an example that
illustrates the difference between minimizing area and power cost of the node.
Assume a, b and c are uncorrelated signals with p(a) = 0.9, p(b) = p(c) = 0.5 and
the following two-level implementations of node f:
F1 = a.b + b.c
F2 = a.b + a.b.c
Under the temporal independence assumption, we obtain:
P(F1) = E(a) + 2E(b) + E(c) + E(a.b) + E(b.c) + E(F1) = 3.04
P(F2) = 2E(a) + 2E(b) + E(c) + E(a.b) + E(a.b.c) + E(F2) = 2.89
where P(f) denotes the power cost (that is, switched capacitance) of function f and
all its inputs. This example shows that implementation F2 provides a better power
solution in spite of including a non-prime implicant. Even though the implemen-
tation for a non-prime implicant requires more literals and more transistors, over-
all, less power is consumed.
This observation motivates the definition for power prime implicants
(PPIs) in [147] and [59]. A PPIs is an implicant whose power cost is strictly less
than the power cost of all implicants that contain it. PPIs thus define the set of all
implicants that are sufficient and necessary for obtaining a minimum power solu-
tion. Given a function f and its don’t care set, an algorithm for generating the set
of all PPIs of f is presented in [59]. Using this set, the minimum power solution
for a two-level function is then generated by solving a minimum covering prob-
lem. The main difficulty in generating a minimum power solution is that com-
pared to a minimum area solution which only requires prime implicants, more
implicants need to be considered while solving the covering problem. An upper
bound on the expected number of PPIs that will be generated, is derived in [59].
This average-case analysis shows that assuming uniformly distributed values for
the input signal probabilities, the number of power prime implicants of a function
is linearly proportional to the number of prime implicants of the function where
the proportionality constant is < 4/3 times the number of inputs to the function.
Power Minimization in IC Design: Principles and Applications 39
f = cg+ab
g = a+b Config. A
f = ab+ac+bc
a b c
PA = 2 Ea+2Eb+Ec+Eg+Ef
f = ah+bc
a b c
h = b+c Config. B
a b c
PB = Ea+2Eb+2Ec+Eh+Ef
Figure 9 Two decompositions with equal literal counts but different power.
of divisors in terms of their literal saving factors).
Path Balancing
To reduce spurious activity in a circuit, delay of all true paths that converge
at each gate, must be roughly balanced. This is because balancing path delays
leads to nearly simultaneous switching on input signals to a gate, and thus elimi-
nates possible hazards at the output of the gate (see Figure 9). This in turn reduces
the average power dissipation in the circuit. Path balancing can be achieved
before technology mapping by selective collapsing and logic decomposition or
after technology mapping by delay insertion and pin reordering.
The rationale behind selective collapsing is that by collapsing the fanins of
a node into that node, the arrival time at the output of the node can be changed.
Logic decomposition and extraction can be performed so as to minimize the level
difference between the inputs of nodes which are driving high capacitive nodes.
Additionally by inserting variable-delay buffers in a circuit, the delays of all
paths in the circuit can be made equal. The key issue in delay insertion is to use
the minimum number of delay elements to achieve the maximum reduction in
spurious switching activity. Path delays may sometimes be balanced by appropri-
Power Minimization in IC Design: Principles and Applications 41
1 1
1
0
1
Clustering A Clustering B
f1 f1
0.43 0.43
0.49 0.38
0.49 0.38
a b c d e a b c d e
Figure 11 Clustering solutions with equal logic depth but different power.
Two example clustering solutions are shown in Figure 9 where the solution
on the left is obtained by Lawler’s algorithm while the solution on the right corre-
sponds to power and delay optimal clustering solution (the maximum cluster size
is seven). In this example, all input activities are set to 0.5 and the numbers
shown beside the nodes represent their switching activities obtained by symbolic
simulation of the Boolean network. Both solutions have a depth of two. However,
the power cost (switched capacitance) of inter-cluster lines in Clustering A is 1.3
while that in Clustering B is 0.65. Experimental results indicate that, on average,
25% improvement in power dissipation of multi-level Boolean circuits is
obtained without any increase in circuit delay (assuming that the physical capaci-
tance on inter-cluster lines is much higher than the capacitance on intra-cluster
lines).
Technology Decomposition
This is the problem of converting a set of Boolean equations (or a Boolean
network) to another set (or another network) consisting of only two-input NAND
and inverter gates. It is difficult to come up with a NAND decomposed network
which will lead to a minimum power implementation after technology mapping
since gate loading and mapping information are unknown at this stage. Neverthe-
less, it has been observed that a decomposition scheme which minimizes the sum
Power Minimization in IC Design: Principles and Applications 43
of the switching activities at the internal nodes of the network, is a good starting
point for power-efficient technology mapping.
P-type dynamic gate
a
a
b EA(sw)
b
c c = p(ab)+p(abc)+p(abcd)
d d = 0.246
Config. A
p(a)=0.3 p(b)=0.4 a
p(c)=0.7 p(d)=0.5 b
c EB(sw)
d Config. B = p(ab)+p(cd)+p(abcd)
= 0.512
Given the switching activity value at each input of a complex node, a pro-
cedure for AND decomposition of the node is described in [151] which minimizes
the total switching activity in the resulting two-input AND tree under a
zero-delay model. The principle is to inject high switching activity inputs into the
decomposition tree as late as possible. The decomposition procedure (which is
similar to Huffman’s algorithm for constructing a binary tree with minimum aver-
age weighted path length) is optimal for dynamic CMOS circuits and produces
very good results for static CMOS circuits. An example is shown in Figure 12
where the input signal with the highest switching activity (that is, signal d) is
injected last in the decomposition tree in configuration A, thus yielding lower
power dissipation for this configuration.
In general, the low power technology decomposition procedure reduces the
total switching activity in the networks by 5% over the conventional balanced tree
decomposition method.
A different technology decomposition technique is described in [99]. This
technique, which again exploits Huffman’s algorithm, aims at minimizing the
total number of transitions in the binary decomposed tree (including glitches).
Under a non-zero delay model and with certain assumptions about spacing of the
input arrival times and lack of buffers, the paper presents an optimal algorithm for
achieving a minimum transition count decomposition. The paper however ignores
the probabilistic nature of logic transitions at the inputs.
44 Power Minimization in IC Design: Principles and Applications
Technology Mapping
This is the problem of binding a set of logic equations (or a boolean net-
work) to the gates in some target cell library. A successful and efficient solution
to the minimum area mapping problem was suggested in [66] and implemented in
programs such as DAGON and MIS. The idea is to reduce technology mapping to
DAG covering and to approximate DAG covering by a sequence of tree coverings
which can be performed optimally using dynamic programming.
The problem of minimizing the average power consumption during tech-
nology mapping is addressed in [151],[142] and [81]. The general principle is to
hide nodes with high switching activity inside the gates where they drive smaller
load capacitances (see Figure 13).
h h l
l l
h h
l l l
The approach presented in [151] consists of two steps. In the first step,
power-delay curves (that capture power consumption versus arrival time
trade-off) at all nodes in the network are computed. In the second step, the map-
ping solution is generated based on the computed power-delay curves and the
required times at the primary outputs. For a NAND-decomposed tree, subject to
load calculation errors, this two step approach finds the minimum area mapping
satisfying any delay constraint if such a solution exists. Compared to a technol-
ogy mapper that minimizes the circuit delay, this procedure leads to an average of
18% reduction in power consumption at the expense of 16% increase in area with-
out any degradation in performance.
Generally speaking, the power-delay mapper reduces the number of high
switching activity nets at the expense of increasing the number of low switching
activity nets. In addition, it reduces the average load on the nets. By taking these
two steps, this mapper minimizes the total weighted switching activity and hence
Power Minimization in IC Design: Principles and Applications 45
tively constant irrespective of the number of NMOS transistors that are on. There-
fore the power cost for a product (AND) term is given by:
0
V dd ⋅ I dc ⋅ prob AND (14)
2 k
V dd ⋅ f ⎛ ⎞
∑ Ci Ei ( sw) + CAND probAND + 2Cclock⎟⎠
0
--------------- ⋅ ⎜ (15)
2 ⎝
i=1
where Ci is the gate capacitance seen by the ith input of the AND term, CAND is
the load capacitance that the AND term is driving, and Cclock is the load capaci-
tance of the precharge and evaluate transistors that the clock drives and f is the
clock frequency.
AND Plane OR Plane AND Plane OR Plane
Vdd Phi2
Vdd Vdd Vdd
latch
Phi1
Phi2
Register Phi1 Register
Fortunately, in both cases it has been shown that the optimum two-level
cover will consist of only prime implicants [147] [60]. The resulting minimiza-
tion problems can be solved exactly by changing the cost function used in the
Quine-McClusky procedure or the Espresso heuristic minimizer [14]. In general,
optimization for power resulted in a 5% increase in the number of cubes of the
function while reducing the power by an average of 11%.
Power Minimization in IC Design: Principles and Applications 47
cluster tree that captures the connectivity among modules. The optimal floorplan
topology, block shapes and room assignments, and pin positions (or block orien-
tations) are determined during a preorder traversal of this tree [169] [109]. The
two dimensional shape function curves can be indexed by the power cost, that is,
for each distinct power dissipation value, one shape function is built. These
indexed shape functions can then be used during the preorder traversal to com-
pute the optimal power solution which also leads to minimum chip area (see [26]
for details).
Placement refers to the process of assigning locations to gates in a circuit
netlist. Placement algorithms can be easily modified to minimize the power dissi-
pation. For example, a common placement algorithm for small-cell ICs is to for-
mulate the problem as a constrained mathematical programming problem and
then solve it in two phases: global optimization and slot assignment [145] [68].
The objective function is the sum of squares of net lengths while the constraints
are center-of-mass and/or path-based timing constraints. The only change needed
in the low power formulation is to use the sum of squares of switched capaci-
tances as the objective function during each phase [158]. With this modification,
an average power reduction of 8% has been obtained compared to the minimum
net length solution without any increase in circuit delay.
Global and Detailed Routing
Global routing produces routing trees for all nets in the circuit so as to min-
imize the interconnect length and/or chip area. The routing trees for multi-termi-
nal nets are often constructed as Rectilinear Spanning or Steiner trees. In routing
a single net to achieve lower power dissipation, the goal is to minimize the physi-
cal capacitance which coincides with the minimum length objective used in con-
ventional routing. When routing a collection of nets in fixed-size routing channels
(e.g., Gate Array or FPGA layouts), in variable-width routing channels (e.g.,
Standard Cell layout) or in general area (e.g., General Cell layouts), the differ-
ence between minimizing the total physical capacitance and the total switched
capacitance comes to surface. In the following, Standard Cell layout will be used
as an example.
Both sequential [123] and parallel [32] [77] routing algorithms for routing
in Standard Cell layouts have been proposed. Sequential routing algorithms can
be modified to produce minimum-power routing solution by simple net weighting
where the net weights are derived from the switching activity values of the driver
Power Minimization in IC Design: Principles and Applications 49
gates. Nets with higher weights are given priority during routing and thus tend to
assume their lowest possible routes. In contrast, low activity nets may encounter
blockages, congestion, etc. and thus tend to assume longer lengths than is ideally
possible. Alternatively, one can modify the feedthrough insertion and net segment
assignment steps in the parallel global routers to generate tree connections with
smaller lengths for nets that are driven by gates with higher switching rates [159].
Experimental results have produced only marginal improvements in power dissi-
pation. This is because global routing is a complex process where the net lengths
and channel congestion are dictating the routing solution for each net; an extra
weighting factor for the nets can only produce a sizeable difference in the final
result if net activities (especially on large nets where global routers have many
options to route them) are drastically different. This condition was not met in the
examples attempted in [159].
Detailed routing produces the wiring geometries and layer assignments
within a routing channel, switchbox or general area. To reduce power dissipation
during detailed routing, one can give high priority to active nets in using the
available routing resources (e.g., tracks, layers). Power dissipation due to
cross-talk can be minimized by ensuring that wires carrying high activity signals
are placed sufficiently far from the other wires.
Transistor and Gate Sizing
If performance was not a design constraint, design for low (capacitive)
power would be achieved by using minimum-sized gate versions everywhere. The
gate sizing problem is thus to find a minimum power solution subject to meeting a
given delay constraint.
An efficient approach to continuous (generator-based) gate sizing for low
power is to linearize the path-based timing constraints and use a linear program-
ming solver to find the global optimum solution [11]. This work has been
extended to handle setup and hold time constraints in [139]. The drawbacks of
this approach are the omission of slope factor (input ramp time) for input wave-
forms from the delay model and use of a simple power dissipation model that
ignores short-circuit current. The LP-based cell selection algorithm can be easily
extended to account for the short-circuit power dissipation as described in [111].
A heuristic technique for discrete (library-based) gate sizing for minimum
power subject to a given delay constraint is described in [140]. The idea is to start
with minimum-sized gate versions, and then size up gates along the paths with
50 Power Minimization in IC Design: Principles and Applications
negative slacks (that is, critical paths) so as to satisfy the constraints while
increasing the switched capacitance of the circuit minimally. Alternatively, one
may start with the fastest possible design and then size down the gates along the
paths with positive slack (compared to the given delay constraint) so as to maxi-
mize the reduction in switched capacitance. Another technique presented in [82],
starts with a circuit that satisfies the timing constraint and sizes down certain
gates (which are not necessarily on the non-critical paths) to reduce the power
dissipation. The shortcoming of these approaches is their greedy nature which
leads to sizing one gate a time.
Discrete gate sizing problem is a special case of technology mapping prob-
lem and thus the dynamic programming technique can be applied to build the
power-delay trade-off curves during a postorder traversal of the circuit and then
perform the gate selection during a preorder traversal so as to satisfy the delay
constraints while minimizing the switched capacitance.
In [12], the problem of transistor sizing in a static CMOS layout to mini-
mize the capacitive plus short circuit power dissipation. It is shown that the
power-optimal size for the transistors in a gate that is driving a given load, can be
larger than minimum size. The authors next derive the power-delay optimal sizes
for these transistors and present a greedy algorithm for calculating the optimal
power sizing subject to a given delay constraint for all gates in a circuit. This
algorithm starts by doing an initial power-optimal transistor sizing on each gate.
If the power-minimal layout satisfies the delay constraint, the process is termi-
nated; otherwise, the power-delay optimal sizing is applied to gates on the critical
paths until the timing target is met.
These researchers have reported about 15-20% reduction in total power
dissipation as a result of cell selection or transistor sizing.
Transistor Reordering
In general, library gates have pins that are functionally equivalent which
means that inputs can be permuted on those pins without changing function of the
gate output. These equivalent pins may have different input pin loads and pin
dependent delays. It is well known that the signal to pin assignment in a CMOS
logic gate has a sizeable impact on the propagation delay through the gate [63].
If we ignore the parasitic (internal) power dissipation due to charging and
discharging of source/drain to bulk diffusion capacitances inside a CMOS logic
gate, it becomes self-evident that high switching activity inputs should be
Power Minimization in IC Design: Principles and Applications 51
matched with pins that have low input capacitance [81]. This scheme is however
not very effective as in the semi-custom libraries, the difference in pin capaci-
tances for logically equivalent pins is small. The parasitic power dissipation var-
ies in turn as a function of the switching activities and pin assignment of the input
signals (see [153] and [83] for details of the parasitic power calculation model).
To find the minimum power pin assignment for a gate that accounts for this inter-
nal power dissipation, one must solve a difficult optimization problem as formu-
lated in [153]. As the number of functionally equivalent pins in a typical
semi-custom library is not greater that six, it is feasible to exhaustively enumerate
all pin permutations to find the minimum power pin assignment.
One can also use heuristics, for example, one such rule assigns input signal
with the largest probability of assuming a controlling value (zero for NMOS and
one for PMOS) to the transistor near the output terminal of the gate (for
series-connected transistors in the pull-up or pull-down blocks of a logic gate)
[111]. The rationale is that this transistor will switch off more frequently, thus
blocking the internal nodes from non-productive charging and discharging.
Another rule is presented in [114] where the input that has the highest switching
activity when all other inputs are set to their non-controlling values (one for
NMOS and zero for PMOS in series-connected transistors) is directed to the input
closest to the output terminal. The rationale is that assigning such a signal closest
to the Vdd and ground terminals would lead to large power dissipation. The
authors of [132] derive similar rules to those mentioned above and point out that
if there is a conflict between the two rules, then the transistor ordering should be
determined by the ratio of the probability of assuming controlling value over
probability of making transitions, that is input with the highest ratio will be
placed closest to the output terminal. Experimental results show that about 5%
power reduction can be achieved by transistor ordering.
In general, pin permutation for minimum delay produces results that are
very different from those obtained for minimum power. Therefore, pin permuta-
tion for low power should take place on non-critical gates.
Wire and Driver Sizing
Wire and/or driver sizing are often needed to reduce the interconnect delay
on time-critical nets. Wire sizing however tends to increase the load on the driver
and hence increase the power dissipation. A simultaneous wire and driver sizing
approach can reduce the interconnect delay with only a small increase in the
52 Power Minimization in IC Design: Principles and Applications
power dissipation. The approach in [33] uses the properties of monotonicity, sep-
arability and dominance (which apply to Elmore delay) to determine the optimal
wire sizing solution. The delay is measured using the distributed Elmore delay
model and power estimations include both capacitive and short circuit power
components. Experimental results show that for the same delay constraint, this
approach reduces the power by about 10% when compared to the conventional
method of driver sizing only. Another optimal buffer and wire sizing approach
based on convex programming techniques which avoids the monotonicity and
separability assumptions of the delay model is presented in [94]. This method can
be easily extended to determine the optimal gate size and wire widths so as to
minimize the power dissipation instead of the area required for the circuit layout.
Super Buffer Design
Super buffer design is a chain of inverters designed to derive a large capac-
itive load with minimal signal propagation time [63]. A power-optimal buffer siz-
ing technique applicable to the design of super buffers at high speed is presented
in [170]. This work is based on an analytic relationship among signal delay,
power dissipation, driver size and interconnect load which is in turn derived from
the I-V characteristics of CMOS transistors. This work shows that optimal-power
sizing requires a variable tapering (scaling) factor for the inverter chain.
Clock Tree Generation
Clock is the fastest and most heavily loaded net in a digital system. Ideally,
clock signals should have minimum rise/fall times, specified duty cycles and zero
skew. Power dissipation of the clock net contributes a large fraction of the total
power consumption in a digital circuit [39], thus, it is also desirable to minimize
the total capacitive load seen by the clock source.
Many zero-skew clock routing algorithm have been proposed. In one
approach, a chain of drivers is introduced at the source and zero-skew is achieved
by wire extending or sizing [146] [171]. In another approach, buffers are inserted
at internal points in the clock tree [168] for satisfying rise/fall time constraints
and for minimizing the area of the clock net. The rationale is that instead of
increasing wire widths and lengths to reduce the skew which will result in
increased power dissipation, one can use a balanced buffer insertion scheme to
partition a large clock tree into a small number of subtrees with minimum wire
widths. In [163] a technique for low power clock synthesis that simultaneously
inserts buffers and generates the clock tree topology is presented. The main
Power Minimization in IC Design: Principles and Applications 53
advantage of this approach is that by judicious buffer insertion, one can reduce
the total wire length needed to achieve zero-skew clock tree. Experimental results
show improvements in terms of area, rise/fall times and power dissipation com-
pared to the case where buffers are inserted into clock tree as a postprocessing
step. The paper also demonstrates that inserting buffers at internal nodes of the
clock tree leads to better results compared to inserting buffers at the root of the
clock tree only.
Zero-skew is imposed to ensure correct circuit operation. In practice, cir-
cuits function correctly within a tolerable clock skew. The objective of low power
clock routing is thus to minimize the load on the clock drivers (and hence the
clock tree length) subject to meeting a tolerable clock skew. Algorithms for mini-
mum cost bounded skew clock and Steiner tree routing are described in [34] and
[55].
Power Distribution
As the supply voltage is reduced, the noise margins are diminished, thus,
small voltage drop in the power distribution may have a relatively big impact on
the circuit speed. Careful power distribution is thus becoming more important at
lower supply voltages. In [164], a technique for concurrent topology design and
wire sizing in power distribution networks is presented. The objective is to mini-
mize the layout area while limiting the average current density to avoid elec-
tromigration-induced reliability problems and large resistive voltage drops. This
technique is based on the observation that when two sinks do not draw currents at
the same time, narrow wires can be used for power distribution to those sinks,
thus reducing the layout area. The authors report up to 30% area saving compared
with the star connection scheme.
6. Challenges Ahead
The need for lower power systems is being driven by many market seg-
ments. There are several approaches to reducing power, however the highest
return-on-investment approach is through designing for low power. Unfortunately
designing for low power adds another dimension to the already complex design
problem; the design has to be optimized for power as well as performance and
area.
Optimizing the three axes necessitates a new class of power conscious
CAD tools. The problem is further complicated by the need to optimize the design
54 Power Minimization in IC Design: Principles and Applications
for power at all design phases. The successful development of new power con-
scious tools and methodologies requires a clear and measurable goal. In this con-
text the research work should strive to reduce power by 5-10x in three years
through design and tool development.
It is worthwhile to enumerate the major challenges that, to our belief [116],
have to be addressed if we want to keep power dissipation within bounds in the
future generations of digital integrated circuits.
• A low voltage/low threshold technology and circuit design approach,
targeting supply voltages around 1 Volt and operating with reduced
thresholds.
• Low power interconnect, using advanced technology, reduced swing or
reduced activity approaches.
• Introduction of low-power system synchronization approaches, using
either self-timed or locally synchronous approaches.
• Dynamic power management techniques, varying supply voltage and
execution speed according to activity measurements. This can be
achieved by partitioning the design into sub-circuits whose energy
levels can be independently controlled and by powering down
sub-circuits which are not in use.
• Moving the work to less energy constrained parts of the system, for
example, by performing the task on fixed stations rather than mobile
sites, by using asymmetric communication protocols, or unbalanced
data compression schemes.
• Application specific processing. This might rely on the increased use
of application specific circuits or application or domain specific pro-
cessors. Examples include implementing the most energy consump-
tive operations in hardware, choosing processor with instruction set,
datapath width and functional units best suited to algorithm, map-
ping functions to hardware so that inter-chip communication is
reduced, and using suitable memory hierarchy.
• Move toward self-adjusting and adaptive circuit architectures that
can quickly and efficiently respond to the environmental change as
well as varying data statistics.
• An integrated design methodology - including synthesis and compila-
tion tools. This might require the progression to higher level program-
ming and specification paradigms (e.g. data flow or object oriented
programming).
• Development of power conscious techniques and tools for behav-
ioral synthesis, logic synthesis and layout optimization. The key
Power Minimization in IC Design: Principles and Applications 55
7. Acknowledgment
This paper could have not been written without the help of students in the
CAD group at USC whose research works and papers provided good summaries
of some sections of this survey. I am also indebted to professors P. Beerel and
C-Y. Tsui and my students C-S. Ding and S. Iman for their careful readings of the
manuscript and suggestions to improve parts of it. I also like to thank the anony-
mous reviewer whose comments made the paper more balanced in the treatment
of some ideas. This work was performed in part under ARPA contract No.
F33615-95-C1627, NSF NYI award No. MIP-9457392 and SRC contract No.
94-DJ-559.
8. References
[1] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou. " Precomputa-
tion-based sequential logic optimization for low power. " In Proceedings of the 1994
International Workshop on Low Power Design, pages 57-62, April 1994.
[2] B. S. Amrutur and M. Horowitz . " Techniques to reduce power in fast wide memories .
" In Proceedings of the Symposium on Low Power Electronics, pages 92-93, October
1994.
[3] W.C.Athas, L. J. Svensson, J.G.Koller, N.Thartzanis and E. Chou. " Low-power digital
systems based on adiabatic-switching principles. " IEEE Transactions on VLSI Systems,
2(4)398-407:, December 1994.
[4] W. Athas. " Energy-recovery CMOS. " In Low Power Design Methodologies. J. Rabaey
56 Power Minimization in IC Design: Principles and Applications
capacitive loads. " In Proceedings of the 1995 IEEE Symposium on Low Power Elec-
tronics, pages 60-61, October 1995.
[52] N. Hedenstierna and K. Jeppson. " CMOS circuit speed and buffer optimization. " IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
6(3):270-281, March 1987.
[53] A. M. Hill and S-M. Kang. " Determining accuracy bounds for simulation-based
switching activity estimation. " In Proceedings of the 1995 International Symposium on
Low Power Design, pages 215-220, April 1995.
[54] M. Horowitz, T. Indermaur and R. Gonzalez. " Low-power digital design. " In Proceed-
ings of the 1995 IEEE Symposium on Low Power Electronics, pages 8-11, October
1995.
[55] D. J. Huang, A. B. Kahng and C. W. Tsao. " On the bounded-skew clock and Steiner
tree problems. " In Proceedings of the 32nd Design Automation Conference, pages
508-513, June 1995.
[56] C. X. Huang, B. Zhang, A-C. Deng and B. Swirski. " The design and implementation of
PowerMill. " In Proceedings of the 1995 International Symposium on Low Power
Design, pages 105-110, April 1995.
[57] S. Iman and M. Pedram. " Multi-level network optimization for low power. " In Pro-
ceedings of the IEEE International Conference on Computer Aided Design, pages
372–377, November 1994.
[58] S. Iman and M. Pedram. " Logic extraction and decomposition for low power. " In Pro-
ceedings of the 32nd Design Automation Conference, pages 433-438, June 1995.
[59] S. Iman and M. Pedram. " Two level logic minimization for low power. " In Proceed-
ings of the I EEE International Conference on Computer Aided Design, pages 372–377,
November 1994.
[60] S. Iman, C. Y. Tsui and M. Pedram. " PLA minimization for low power VLSI designs. "
CENG Technical Report, Dept. of EE-Systems, University of Southern California, April
1995.
[61] K. Itoh, K. Sasaki and Y. Nakagome. " Trends in low-power RAM circuit technologies.
" Proceedings of IEEE, 83(4):524-543, April 1995.
[62] S. M. Kang. " Accurate simulation of power dissipation in VLSI circuits. " IEEE Jour-
nal of Solid State Circuits, 21(5):889–891, October 1986.
[63] S. M. Kang and Y. Leblebici. CMOS Digital Integrated Circuits: Analysis and Design.
McGraw-Hill Companies, Inc. 1996.
[64] B. Kapoor. " Improving the accuracy of circuit activity measurement. " In Proceedings
of the 1994 International Workshop on Low Power Design, pages 111-116, April 1994.
[65] B. W. Kernighan and S. Lin. " An efficient heuristic procedure for partitioning graphs. "
Bell System Technical Journal, 49(2):291-307, February 1970.
[66] K. Keutzer. " DAGON: Technology mapping and local optimization. " In Proceedings
of the 24th Design Automation Conference, pages 341–347, June 1987.
[67] S. Kirkpatrick and C. D. Gelatt and M. P.Vecchi. " Optimization by simulated anneal-
ing. " Science, 220(4598):671-680, May 1983.
60 Power Minimization in IC Design: Principles and Applications
[68] J. M. Kleinhans, G. Sigl, F. M. Johannes and K. J. Antreich. " GORDIAN: VLSI place-
ment by quadratic programming and slicing optimization. " IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 10(3):356-365, March
1991.
[69] T. Kobayashi and T. Sakurai. " Self-adjusting threshold-voltage scheme for low voltage
high speed operation. " Proceedings of CICC, pages 271-274, May 1994.
[70] B. Krishnamurthy and I. G. Tollis. " Improved techniques for estimating signal proba-
bilities. " IEEE Transactions on Computers, 38(7):1245–1251, July 1989.
[71] P. Kudva and V. Akella. " A technique for estimating power in asynchronous circuits. "
In Proceedings of the International Symposium on Advanced Research in Asynchronous
circuits and systems, pages 166-175, 1994.
[72] N. Kumar, S. Katkoori, L. Rader and R. Vemuri. " Profile-driven behavioral synthesis
for low power VLSI systems. " To appear.
[73] P.E. Landman and J. Rabaey. " Power estimation for high level synthesis. " In Proceed-
ings of the European Conference on Design Automation, pages 361-366, February
1993.
[74] P.E. Landman and J. Rabaey. " Activity-sensitive architectural power analysis for con-
trol path. " In Proceedings of the 1995 International Symposium on Low Power Design,
pages 93-98, April 1995.
[75] L. Lavagno, P. M. McGeer, A. Saldanha, A. L. Sangiovanni-Vincentelli. " Timed Shan-
non Circuits: a power-efficient design style and synthesis tool. " In Proceedings of the
32nd Design Automation Conference, pages 4254-260, June 1995.
[76] E. L. Lawler and K. N. Levitt and J. Turner. " Module clustering to minimize delay in
digital networks. " IEEE Transactions on Computers, pages 45-57, January 1969.
[77] K. W. Lee and C. Sechen. " A new global router for row-based layout. " In Proceedings
of the I EEE International Conference on Computer Aided Design, pages 180-183,
November 1988.
[78] C. E. Leiserson, F. M. Rose and J. B. Saxe. " Optimizing synchronous circuitry by
retiming. " In Proceedings of the Third Caltech Conference on VLSI, pages 23-36,
March 1983.
[79] C. Lennard and A. R. Newton. " An estimation technique to guide low power resynthe-
sis algorithms. " In Proceedings of the 1995 International Symposium on Low Power
Design, pages 227-232, April 1995.
[80] H. M. Lieberstein. " A Course in Numerical Analysis. " Harper & Row Publishers,
1968.
[81] B. Lin and H. De Man. " Low-power driven technology mapping under timing con-
straints " In Proceedings of the International Conference on Computer Design, pages
421-427, October 1993.
[82] H-R. Lin and T-T. Hwang. " Power reduction by gate sizing with path-oriented slack
calculation. " In Proceedings of the 1st Asia-Pacific Design Automation Conference,
pages 7-12, August 1995.
[83] J. Y. Lin, T. C. Liu and W. Z. Shen. " A cell-based power estimation in CMOS combina-
Power Minimization in IC Design: Principles and Applications 61
[130] S.C. Seth, L. Pan, and V.D. Agrawal. " PREDICT - Probabilistic estimation of digital
circuit testability. " In Proceedings of the Fault Tolerant Computing Symposium, pages
220–225, June 1985.
[131] A. A. Shen, A. Ghosh, S. Devadas, and K. Keutzer. " On average power dissipation and
random pattern testability of CMOS combinational logic networks. " In Proceedings of
the IEEE International Conference on Computer Aided Design, pages 402-407,
November 1992.
[132] W-Z. Shen, J-Y. Lin and F-W. Wang. " Transistor reordering rules for power reduction
in CMOS gates. " In Proceedings of the 1st Asia-Pacific Design Automation Confer-
ence, pages 1-5, August 1995.
[133] C. Small, " Shrinking devices put the squeeze on system packaging. " EDN, vol. 39, no.
4, pages 41-46, Feb. 17, 1994.
[134] M. Stan and W. Burleson. " Limited-weight codes for low power I/O. " In Proceedings
of the 1994 International Workshop on Low Power Design, pages 209-214, April 1994.
[135] A. Stratakos, R. W. Brodersen, and S. R.Sanders. " High-efficiency low-voltage DC-DC
conversion for portable applications. " 1994 International Workshop on Low-Power
Design, pages 105-110, April 1994.
[136] C-L. Su, C-Y. Tsui and A. M. Despain. " Low power architecture design and compila-
tion techniques for high-performance processors. " In CompCon’94 Digest of Technical
Papers, pages 489-498, February 1994.
[137] C. Svensson and D. Liu. " A power estimation tool and prospects of power savings in
CMOS VLSI chips. " In Proceedings of the 1994 International Workshop on Low
Power Design, pages 171-176, April 1994.
[138] C. Svensson and D. Liu. " Low power circuit techniques. " In Low Power Design Meth-
odologies. J. Rabaey and M. Pedram (Editors). Kluwer Academic Publishers, pages
38-64, 1996.
[139] Y. Tamiya, Y. Matsunaga and M. Fujita." LP based cell selection with constraints of
timing, area and power consumption. " In Proceedings of the IEEE International Con-
ference on Computer Aided Design, pages 4378-381, November 1994.
[140] C-H. Tan and J. Allen. " Minimization of power in VLSI circuits using transistor sizing,
input ordering and statistical power estimation. " In Proceedings of the 1994 Interna-
tional Workshop on Low Power Design, pages 75-80, April 1994.
[141] G. E. Tellez, A. Farrahi and M. Sarrafzadeh. " Activity-driven clock design for low
power circuits. " In Proceedings of the IEEE International Conference on Computer
Aided Design, pages 62-65, November 1995.
[142] V. Tiwari, P. Ashar, and S. Malik. " Technology mapping for low power. " In Proceed-
ings of the 30th Design Automation Conference, pages 74-79, June 1993.
[143] V. Tiwari, S. Malik and W. Wolfe. " Power analysis of embedded software: a first step
towards software minimization. " IEEE Transactions on VLSI Systems, 2(4):437-445,
December 1994.
[144] V. Tiwari, S. Malik and P. Ashar. " Guarded evaluation: Pushing power management to
logic synthesis/design. " In Proceedings of the 1995 International Symposium on Low
Power Design, pages 221-226, April 1995.
Power Minimization in IC Design: Principles and Applications 65
[145] R. S. Tsay, E. S. Kuh and C. P. Hsu. " PROUD: A sea-of-gates placement algorithm. "
In Proceedings of the IEEE International Conference on Computer Aided Design, pages
318-323, November 1988.
[146] R. S. Tsay. " An exact zero-skew clock routing algorithm. " IEEE Transactions on Com-
puter-Aided Design of Integrated Circuits and Systems, 12(3):242-249, March 1993.
[147] C-Y.Tsui. Power analysis and optimization for CMOS circuits. PhD Dissertation, Com-
puter Engineering, University of Southern California, 1994.
[148] C-Y. Tsui, J. Monteiro, M. Pedram, S. Devadas, A. M. Despain and B. Lin. " Power
estimation in sequential logic circuits. " IEEE Transactions on VLSI Systems,
3(3):404-416, September 1995.
[149] C-Y. Tsui, M. Pedram, C-H. Chen, and A. M. Despain. " Low power state assignment
targeting two- and multi-level logic implementations. " In Proceedings of the IEEE
International Conference on Computer Aided Design, pages 82–87, November 1994.
[150] C-Y. Tsui, M. Pedram, and A. M. Despain. " Efficient estimation of dynamic power dis-
sipation under a real delay model. " In Proceedings of the IEEE International Confer-
ence on Computer Aided Design, pages 224–228, November 1993.
[151] C-Y. Tsui, M. Pedram, and A. M. Despain. " Technology decomposition and mapping
targeting low power dissipation. " In Proceedings of the 30th Design Automation Con-
ference, pages 68–73, June 1993.
[152] C-Y. Tsui, M. Pedram, and A. M. Despain. " Exact and approximate methods for calcu-
lating signal and transition probabilities in FSMs. " In Proceedings of the 31st Design
Automation Conference, pages 18–23, June 1994.
[153] C-Y. Tsui, M. Pedram, and A. M. Despain. " Power efficient technology decomposition
and mapping under an extended power consumption model. " IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 13(9), September 1994.
[154] A. Tyagi. " Hercules: A power analyzer of MOS VLSI circuits. " In Proceedings of the
IEEE International Conference on Computer Aided Design, pages 530–533, November
1987.
[155] S. Turgis, N. Azemard and D. Auvergne. " Explicit evaluation of short circuit power
dissipation for CMOS logic structures. " In Proceedings of the 1995 International Sym-
posium on Low Power Design, pages 129-134, April 1995.
[156] T. Uchino, F. Minami, T. Mitsuhashi and N. Goto. " Switching activity analysis using
Boolean approximation method. " In Proceedings of the IEEE International Conference
on Computer Aided Design, pages 20-25, November 1987.
[157] H. Vaishnav and M. Pedram. " PCUBE: A performance driven placement algorithm for
low power designs. " In Proceedings of the European Design Automation Conference,
pages 72-77, September 1993.
[158] H. Vaishnav and M. Pedram. "Delay optimal partitioning targeting low power VLSI cir-
cuits. " In Proceedings of the IEEE International Conference on Computer Aided
Design, November 1995.
[159] H. Vaishnav. Optimization of Post-Layout Area, Delay and Power Dissipation. Ph.D.
Dissertation, Computer Engineering, University of Southern California, August 1995.
66 Power Minimization in IC Design: Principles and Applications
[160] P. Van Oostende, P. Six and J. Vandewalle and H. De Man. " Estimation of typical
power of synchronous {CMOS} circuits using a hierarchy of simulators. " IEEE Jour-
nal of Solid State Circuits, 28(1):26-39, January 1993.
[161] H. J. M. Veendrick. " Short-circuit dissipation of static CMOS circuitry and its impact
on the design of buffer circuits. " IEEE Journal of Solid State Circuits, 19:468–473,
August 1984.
[162] T. Villa and A. Sangiovanni-Vincentelli. " NOVA: State assignment of finite state
machines for optimal two-level logic implementations. " IEEE Transactions on Com-
puter-Aided Design of Integrated Circuits and Systems, 9: 905-924, September 1990.
[163] A. Vittal and M. Marek-Sadowska. " Power optimal buffered clock tree design. " In
Proceedings of the 32nd Design Automation Conference, pages 497-502, June 1995.
[164] A. Vittal and M. Marek-Sadowska. " Power distribution topology design. " In Proceed-
ings of the 32nd Design Automation Conference, pages 503-507, June 1995.
[165] S. B. K. Vrudhula and H-Y. Xie. " Techniques for CMOS power estimation and logic
synthesis for low power. " In Proceedings of the 1994 International Workshop on Low
Power Design, pages 21-26, April 1994.
[166] S. Wuytack, F. Catthoor, F. Franssen, L. Nachtergaele and H. De Man. " Global com-
munication and memory optimizing transformations for low power systems. " In Pro-
ceedings of the 1994 International Workshop on Low Power Design, pages 203-208,
April 1994.
[167] M. Xakellis and F. Najm. " Statistical estimation of switching activity in digital circuits.
" In Proceedings of the 31st Design Automation Conference, pages 728-733, June 1994.
[168] J. G. Xi and W-M. Dai. " Buffer insertion and sizing under process variations for low
power. " In Proceedings of the 32nd Design Automation Conference, pages 491-496,
June 1995.
[169] G. Zimmermann. " A new area and shape function estimation technique for VLSI lay-
out. " In Proceedings of the 25th Design Automation Conference, pages 60-65, June
1988.
[170] D. Zhou and X. Y. Liu. " Optimal drivers for high speed low power ICs. " To appear,
1995.
[171] Q. Zhu, W. M. Dai and J. G. Xi. " Optimal sizing of high speed clock network based on
distributed and transmission line models. " In Proceedings of the IEEE International
Conference on Computer Aided Design, pages 628-633, November 1993.