Robust Chip Level CTS
Robust Chip Level CTS
6, JUNE 2011
877
I. Introduction
SYSTEM-ON-A-CHIP (SoC) design can be defined as
an IC, designed by stitching together multiple standalone VLSI designs to provide full functionality for an application [1]. SoC designs have become increasingly common
and the trend is expected to continue in the future [2]. An
attractive feature of SoC designs is the ability to reuse a given
sub-component in multiple chips. The level of reuse can be
different from IP to IP. This paper uses the word IP to denote
the individual sub-blocks used in SoC designs. They are also
referred to as core in some literature [1]. At one extreme of
the reuse spectrum are hard-IPs where the exact transistor-level
layout is reused in several designs. At the other end are the
soft-IPs which go through the physical design/timing closure
process from scratch so as to integrate the IP with the rest
of the chip. This paper defines a soft-IP as the one for which
netlist is available but physical information is not present as a
part of the IP.
Most SoC physical design closure is done in a hierarchical
fashion [1]. In such a methodology, different IPs should be
integrated along with the glue logic to complete the chip-level
Manuscript received March 10, 2010; revised June 30, 2010 and October 8,
2010; accepted December 20, 2010. Date of current version May 18, 2011.
This work was supported in part by NSF, SRC, and the IBM Faculty Award.
This paper was recommended by Associate Editor Y.-W. Chang.
A. Rajaram is with Magma Design Automation, Austin, TX 78731 USA
(e-mail: [email protected]).
D. Z. Pan is with the Department of Electrical and Computer Engineering,
University of Texas, Austin, TX 78712 USA (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2011.2106852
c 2011 IEEE
0278-0070/$26.00
878
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
Fig. 1. Simple chip-level CTS example. The black circles represent the clock
root for each IP.
Fig. 2. Even for identical nominal skews, Case A is better than Case B
because of lesser clock divergence and hence lesser skew variation.
879
Fig. 4.
know the sum of clock delays that is not shared by the given
sink pairs. However, when considering more than one sink
pair, such a direct measurement of divergence is not correct
because not all sink pairs are equally critical from timing
perspective. For example, while considering all the four sinks
of Fig. 4, there are six potential sink pairs and thus we need to
consider the relative importance of each pair while calculating
the divergence for the entire clock tree.
The relative importance of the different sink pairs can
be represented by a pairwise weights proportional to the
timing criticality of the path between the sink pair. If there
is no valid timing path between a pair of sinks, then the
corresponding weight is zero. This concept can be easily
extended as more clock sinks and timing paths are added.
Similarly, clock divergence at the chip-level can be measured
as the weighted sum of clock divergence between the clock
trees of the different IPs. The weight used for a pair of IPs will
be proportional to the timing criticality of all the paths between
the pair. Please note that the timing criticality information can
be obtained directly from the timing analysis usually done
with ideal clocks just before CTS. The actual weights might
be made proportional to either the worst negative slack or the
total negative slack of all paths between the given pair of IPs.
Thus, for a given chip-level clock tree with N IPs, the value
of divergence can be expressed as
divergence =
Wi,j Di F + Dj F 2 Di,j C .
i,j
(1)
880
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
i,j
Wi,j Di + Dj 2 Di,j
(2)
.
any CTS is done on them. For example, this step may be done
after the floor-planning stage of the chip design and before
the timing closure of the individual IPs starts. We restrict the
possible clock pin locations to the mid points of one of the
four sides of each IP. This minimizes the distance between the
clock pin and the farthest register and can result in reduced
clock tree delay. When the flop distribution is not uniform
within a given IP or when there are multiple clocks present in
a given IP, we locate each clock pin such that it divides the
sink distribution it drives into roughly two equal halves, either
in the horizontal or vertical direction. Under this assumption,
clock pin assignment problem can be formulated as follows:
Minimize
xi p xj q Wi,j Top Level Dist(Bi p , Bj q )
s.t.
xi p = 1, xi p {0, 1}
(3)
where 1 i, j N, i = j; 1 p 4; 1 q 4.
In the above equations:
1) i and j denote IP numbers;
2) p and q denote one of the four pin locations on a given
IP;
3) binary variable xi p represents if a given pin location p
is selected for IP i;
4) Bi p denotes the IP i with pin location at p;
5) Wi,j denote the criticality of the paths between IPs i and
j;
6) Top Level Dist(Bi p , Bj q ) represents the Manhattan
distance between pin location p of Bi and q of Bj .
The conditions that each of the variables xi p should be either
0 or 1 and that the sum of all the variables for a given IP
should exactly be 1 makes sure that exactly one pin location
is selected for each IP. The cost function being minimized is
the weighted sum of distances between all the clock pins of all
IP pairs where the weight is the criticality of the paths between
a given IP pair. Minimizing the distance between two pins will
directly increase the chances of clock delay sharing between
the two IPs. The only variables in the above optimization
problem are xi p and since they can only take values of either
0 or 1, the above problem is a 01 quadratic programming
problem. Though this problem is NP-hard, efficient heuristics
are available to solve this problem [18]. It may be noted
here that, though prior work [19] solves a similar problem,
the formulation is not suitable when different IP pairs have
different criticality values.
A. Impact of Pin Assignment on Delay at the IP-Level
The above formulation ignores the impact of clock pin
assignment on the IP-level clock tree delays, which might
end up increasing the overall delay or even clock divergence.
However, the formulation can be made to account for IP-level
clock tree delays by introducing an additional weighting term
of the form Ki p that denotes the criticality of assigning the
pin location p for IP i with regards to the IP-level clock tree.
For example, if all four sides are equally acceptable for the
IP-level CTS of IP i, then the value of Ki p will be identical
for all four values of p. If on the other hand, we want to
make a particular pin location more likely, we can increase
the corresponding scaling factor. The relative values for these
881
882
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
Fig. 7.
Fig. 8.
883
three buffer configurations be (4.9, 8), (5, 10), and (5.3, 12),
respectively. This means, for very similar delays in the nominal
corner, these three buffer configurations have significantly
different delays in the slow corner. This can happen since they
are driving different interconnect lengths. Now, let SP sub-tree
of Fig. 7 have delays of (19.7, 38) and SQ sub-tree have delays
of (25, 50). Now, SQ has higher delay than SP , so we will have
to recursively add buffer configurations on top of SP till the
delays of the two sub-trees are very close. Let us here assume
that the skew requirement for the algorithm to stop is 0 ps.
Since the skew between SP and SQ is not zero, the algorithm
will enter the while loop of Fig. 7. Now, we have to select the
best buffer configuration from the three available configurations to add to SP . We iteratively go through each of the buffer
configurations and find the best configuration to be added to SP
to bring its CCDR closest to that of SQ . From among the three
configurations, we can see that adding the buffer configuration
BCC not only brings the CCDR of SP to the same values as
SQ , but also brings down the skew to 0ps. Thus, after adding
the buffer configuration BCC to SP , its delay is identical to
that of SQ . At this point, the algorithm will exit the while loop
in Fig. 7 and proceed to physically merge the two sub-trees.
V. Chip-Level CTS Algorithms
In this section, we discuss four different chip-level CTS
algorithms with varying degrees of complexity. Please note
that only the dynamic programming based algorithm is newly
proposed in this paper. The other three algorithms are simple
modifications of existing CTS works used for comparison.
A. Single-Corner Approach
This algorithm is a direct application of existing CTS
algorithms to the CCTS problem in which only one corner
delays are used. The algorithm recursively merges sub-tree
nodes which are the nearest neighbors in a manner similar
to that of well known CTS algorithms [13][16]. If a given
sub-tree cannot be merged with any other sub-tree without
violating the slew limits, a buffer is added on top of the subtree to extend the possible merging region for the sub-tree.
The buffer sizes for merging two sub-trees are chosen in such
a manner to reduce the total amount of buffer area added. The
results from this approach will be used as the baseline for rest
of the algorithms.
B. MultiCorner Approach
This approach is identical to the single-corner approach with
one key difference: the consideration of multicorner skews.
During the process of merging two sub-trees, the method
described in Fig. 7 is used instead of using only one corner
delay. At each step, the sub-trees that are closest to each other
are merged recursively till only one sub-tree remains. The
results from this approach will be used to do the cost versus
benefit analysis of multicorner skew reduction.
C. Greedy CCTS Algorithm
This algorithm is a simple modification of the work
of [20] in which every sub-tree merger is done to minimize
the cost (wirelength or buffer area) of that merger. In our
884
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
Fig. 9. Dynamic programming based approach to chip-level CTS. The substeps are highlighted and explained separately in Figs. 10 and 11.
Fig. 10.
trees.
885
Procedure to pick valid pairs for merger from a given set of sub-
2) Pre Eliminate Procedure: In the Pre Eliminate procedure shown in Fig. 10, the objective is to return only the
smallest number of valid pairs for next round of merger
without impacting the quality of result. This is done by taking
advantage of three key properties of the CCTS problem listed
below.
a) First, any merger between two old sub-trees can be
eliminated. This is because their merger would have
been already considered when at least one of them was
a new sub-tree. In other words, considering a merger of
two old sub-trees simply means we are doing the same
work again. This property is used in the first If condition
in line 3 of Fig. 10.
b) Second, we can eliminate any sub-tree pair that have
even one common IP between them. This is because the
presence of an IP in a sub-tree means a given IP has been
physically merged with another IP in the solution. This
means, any other sub-tree with that same IP cannot be
physically merged with the given sub-tree. This property
is used in the second If condition in line 3 of Fig. 10.
c) Third, any merger between sub-trees that are too far
away either in terms of delay or distance between their
roots is likely to be sub-optimal when other alternatives
with better delay or distance matching is available.
This property is made use of in the calculation of
PreElim Cost in Fig. 10. This cost measures the desirability of merger between any given two sub-trees
that do not overlap. This cost is proportional to the
physical distance between the roots of the sub-trees
(dist(Si , Sj )) and the delay difference between the subtrees (del dist(Si , Sj )). It is also inversely proportional
to the number of critical timing interactions between the
IPs in the two sub-trees. This last effect is captured by
C(Si , Sj ) =
W(p, q)
(6)
p,q
Fig. 11.
The Pre Eliminate procedure uses two user defined parameters that are explained next. The parameter is used as a
weighing factor between the delay difference and the physical
distance between the roots of the sub-trees. It is set to be the
average length of interconnect that may be driven per unit of
delay using a given set of buffers and a given technology under
the maximum slew constraint. It is measured in terms of distance per unit delay. The other parameter used in Fig. 10, , is
an integer and is used to control how many potential pairs are
to be allowed per sub-tree. can be any integer with values of
at least 1. In our experiments, was set to a value of 2. It may
be noted that in the preliminary version of this paper [12], we
used two other comparable parameters with similar motivation
that directly controlled the actual value of maximum allowed
delay difference and distance difference between the sub-trees.
However, based on our experiments on a large number of testcases, we find the new parameters are a lot easier to set without
any need to tune the parameters for individual test-cases. Thus,
using the above mentioned three properties of CCTS, the
Pre Eliminate procedure selects only a few best sub-tree pairs
for consideration during the next round of mergers.
3) Post Eliminate Procedure: The objective of the postelimination procedure of Fig. 11 is to compare all the existing
sub-trees and weed out any inferior solutions. A sub-tree P is
inferior if there exist another sub-tree Q that covers the same
set (or a super-set) of clock pins covered by sub-tree P, but has
same or lower merging cost. Two sub-trees that drive different
sets of IPs will never be directly compared for elimination as
886
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
one cannot fully replace the other. Once the inferior solutions
are identified, they are removed from the list of active subtrees that will be considered for the next round of sub-tree
mergers. This is shown in the steps 2 to 4 of Fig. 11.
In addition to the above straight forward pruning, the
procedure of Fig. 11 executes another pruning that is a bit
more subtle. This is shown in the steps 5 to 7 of Fig. 11.
This procedure uses a user-defined integer parameter, , that
represents the maximum number of independent and complete
solutions that can be present in the current set of sub-trees.
We first sort all the current sub-trees in descending order of
number of IPs in them. When two sub-trees have the same
number of IPs in them, we sort them on the ascending value
of the cost. The final sorted list of valid sub-trees represent
how close each sub-tree is to the final complete solution to be
chosen. For example, the top-most sub-tree has the maximum
number of IPs below it with the least cost. Given this sorted
list of sub-trees, we move down the list from top to bottom to
select a list of sub-trees that can be used to get one complete
and independent solution to the CCTS problem. This step
gets repeated until the total number of independent solutions
reaches or the list of potential full solutions runs out. Any
sub-tree that is not present in any of the top complete
solutions is added to the list of eliminated sub-trees. The list
of sub-trees eliminated by the Post Eliminate procedure are
dropped from subsequent iterations of the algorithm in Fig. 9.
The second pruning procedure drastically reduces the overall runtime with little impact on the final results. This is
because a sub-tree that is not a part of the top final solutions
can be eliminated with little risk as long as is sufficiently
large. For example, in our experiments, we set to 200.
However, keeping that sub-tree in the solution pool takes up
exponentially higher runtime since it may add quite a few new
solutions in the subsequent iterations without actually adding
to any better results. It may be noted that in the preliminary
version of this paper [12], this last pruning method was not
employed. As a result, the runtime of the original algorithm
does not scale as well as the new algorithm with respect to
the number of IPs in the CCTS problem.
VI. Practical Considerations in CCTS
A. Generalization of Pin Assignment Algorithm
In Section III, the 01 quadratic programming problem was
formulated assuming that the clock pins can be located in only
the mid-points of the four sides. In the most generic case, a
given IP can have multiple candidate clock-pin locations on
each of the four sides and also candidate locations on the top
of IP. This situation can be easily handled by introducing two
constant weight factors for each candidate location. One new
factor should account for the estimated IP clock delay for each
candidate location. This factor should increase proportionally
with respect to the estimated delay of IP clock tree for the
candidate pin location. The second factor should consider the
potential routing layer difference that might arise when clockpin locations on top of the IP are considered. Also, another
straightforward modification that can be made to the method
proposed in 3 is that the variables p and q that represent
the number of candidate pin locations should be changed to
TABLE I
Key Test-Case Generation Parameters
Parameter
Chip size
No. of IPs
Aspect ratio
Hard-IP probability
# Slew limit range
Technology
Value
0.25 cm2 to 6.25 cm2
10130
0.71.3
0.2
90110 ps
65 nm
account for the new candidate pin locations. Thus, the original
formulation in Section III is applicable generally.
B. Consideration of Blockages
A key requirement of any chip-level CTS algorithm is that
it works in the presence of blockages. All the algorithms
presented in our approach to the CCTS problem can be applied
even for chips with blockages. For example, the clock pin assignment algorithm can be made blockage aware by measuring
the distance between any two candidate pin locations using a
blockage aware global router instead of a Manhattan estimate.
Similarly, the multicorner sub-tree balancing heuristic of Fig. 7
can be modified by using the global router based distance instead of Manhattan distance. Since the dynamic programming
algorithm internally uses the multicorner heuristic, that can
also be used in the presence of blockages.
VII. Experimental Results
A. Test-Case Generation
To test the effectiveness of our algorithms, we need several
chip-level SoC test-cases. Since obtaining test-cases from
actual SoC chips is not feasible for us and since there are
no known CCTS work in the literature, we generate random
test-cases using the data available on SoC chips in the
literature [1][4].
1) Defining SoC Chips Physical Attributes: First, we
define reasonable ranges for the following variables: chip size,
number of IPs, size range of the IPs, aspect ratio range for IPs,
and chip density. Using these, we generate random chip-level
floorplans such that the chip size, number of IPs, and so on are
all within the selected ranges. We also make sure that the chip
density (the ratio of the chip covered by the all IPs) is within
limits and that there are no overlaps between the IPs. Each IP
is marked as a hard or soft IP randomly with probabilities of
0.2 and 0.8 respectively. We would like to note here that the
relative probabilities of hard and soft IPs were chosen based
on our prior experience with SoC chips. We are unable to find
any previous work from which we can choose this number.
2) Generating Timing Criticality Data: To generate a
realistic timing criticality information between IP pairs, we
consider how the chip-level floorplan is done. A key objective
of floorplanning step is to ensure IPs that interact heavily are
located close to each other. However, when the interaction
between the IPs become complex, placing all the IPs that
interact right next to each other becomes impossible. Also,
IPs that are very far away from each other rarely have a
significant number of critical paths between them. To closely
887
TABLE II
Clock Divergence, Delay, Skew, Buffer Area (BA), and Wire Length (WL) Results for the Test-Cases in Table IV
TC
PAM
RND
TC1
QP
RND
TC2
QP
RND
TC3
QP
RND
TC4
QP
RND
TC5
QP
RND
TC6
QP
CCTS Alg.
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
Divergence (s)
NN
SS
FF
0.13
0.16
0.10
0.13
0.16
0.10
0.11
0.12
0.10
0.12
0.14
0.09
0.12
0.15
0.09
0.12
0.15
0.09
0.11
0.12
0.10
0.11
0.13
0.09
0.50
0.63
0.40
0.52
0.65
0.42
0.48
0.55
0.43
0.38
0.47
0.30
0.52
0.66
0.41
0.53
0.67
0.43
0.45
0.51
0.41
0.35
0.44
0.29
0.67
0.83
0.53
0.70
0.86
0.57
0.57
0.65
0.51
0.50
0.63
0.41
0.66
0.83
0.54
0.60
0.75
0.49
0.56
0.63
0.50
0.49
0.60
0.39
1.36
1.71
1.10
1.43
1.77
1.16
1.19
1.35
1.06
1.02
1.27
0.83
1.39
1.74
1.11
1.36
1.69
1.11
1.20
1.37
1.07
1.04
1.30
0.85
3.67
4.61
2.92
3.46
4.29
2.80
3.18
3.60
2.84
2.27
2.84
1.84
3.59
4.53
2.86
3.52
4.38
2.86
3.20
3.63
2.87
2.19
2.74
1.76
6.42
8.02
5.17
6.30
7.83
5.10
4.94
5.63
4.40
5.10
6.35
4.14
6.11
7.64
4.92
5.96
7.42
4.82
5.59
6.45
4.93
5.28
6.59
4.26
Max.
NN
2.44
2.41
2.41
2.41
2.44
2.41
2.41
2.42
1.79
1.83
1.78
1.77
1.79
1.93
1.77
1.78
0.65
0.65
0.65
0.66
0.65
0.63
0.63
0.66
0.81
0.91
0.81
0.81
0.81
0.86
0.80
0.84
1.28
1.34
1.37
1.30
1.34
1.43
1.28
1.31
1.14
1.06
1.05
1.07
1.06
1.04
1.04
1.06
Delay
SS
3.00
2.99
2.99
2.99
2.99
2.99
2.99
3.00
2.22
2.29
2.23
2.22
2.22
2.42
2.22
2.24
0.80
0.80
0.82
0.83
0.80
0.79
0.79
0.82
1.00
1.13
1.01
1.01
1.00
1.08
1.01
1.06
1.59
1.66
1.71
1.62
1.66
1.78
1.60
1.64
1.40
1.33
1.32
1.34
1.33
1.32
1.30
1.33
(ns)
FF
1.98
1.96
1.96
1.96
1.97
1.96
1.97
1.97
1.42
1.48
1.43
1.43
1.42
1.56
1.42
1.44
0.52
0.53
0.54
0.54
0.53
0.51
0.51
0.54
0.65
0.76
0.66
0.66
0.65
0.70
0.65
0.68
1.04
1.09
1.12
1.05
1.09
1.16
1.05
1.07
0.92
0.86
0.85
0.87
0.86
0.84
0.84
0.85
NN
3.95
2.17
2.25
2.45
4.22
2.43
1.99
2.11
6.46
3.36
3.55
3.79
6.13
4.67
3.06
3.09
8.77
11.83
11.45
12.36
12.47
10.36
13.33
10.75
10.62
7.53
9.27
8.08
8.27
9.62
7.11
14.24
6.85
6.49
6.46
6.52
8.21
6.11
8.20
6.15
8.04
8.77
9.88
6.83
7.11
10.36
9.73
7.51
Skew (%
SS
0.91
2.64
2.38
2.94
1.08
2.48
2.57
2.88
3.07
4.39
4.47
4.11
2.36
5.03
5.49
4.02
5.29
10.75
12.00
14.20
8.23
13.22
15.27
11.05
6.14
7.46
9.26
8.32
5.22
11.52
9.75
14.57
4.12
7.17
7.17
7.41
4.91
6.64
9.62
7.01
5.16
9.52
10.81
8.16
5.03
11.89
11.23
8.39
of Delay)
FF
6.16
2.42
2.63
2.38
6.18
2.49
3.24
2.23
8.72
4.71
4.76
4.23
8.72
6.39
4.79
3.64
11.58
11.69
13.93
13.39
17.07
11.61
13.58
12.74
12.61
11.73
10.28
10.31
11.46
11.26
8.96
15.43
11.14
6.79
7.18
6.49
13.85
6.39
9.04
6.40
11.45
10.20
9.98
8.26
11.35
10.46
10.35
9.54
Worst
6.16
2.64
2.63
2.94
6.18
2.49
3.24
2.88
8.72
4.71
4.76
4.23
8.72
6.39
5.49
4.02
11.58
11.83
13.93
14.20
17.07
13.22
15.27
12.74
12.61
11.73
10.28
10.31
11.46
11.52
9.75
15.43
11.14
7.17
7.18
7.41
13.85
6.64
9.62
7.01
11.45
10.20
10.81
8.26
11.35
11.89
11.23
9.54
BA (nm2 )
X 1e6
32.32
32.34
32.45
32.43
32.25
32.30
32.35
32.35
10.99
11.07
11.21
11.21
10.97
11.04
11.15
11.15
2.89
2.93
3.12
3.03
2.90
2.93
3.07
3.03
6.42
6.48
6.86
6.68
6.40
6.45
6.78
6.65
9.32
9.38
10.08
9.81
9.33
9.40
10.01
9.68
29.35
29.42
31.14
30.29
29.32
29.40
31.03
30.12
WL (m)
X 1e6
163.27
163.33
163.66
163.62
163.06
163.15
163.39
163.36
55.83
55.98
56.47
56.40
55.76
55.88
56.25
56.19
14.28
14.36
14.98
14.65
14.27
14.31
14.80
14.64
32.63
32.75
34.04
33.38
32.62
32.71
33.79
33.24
43.78
43.90
46.13
45.13
43.79
43.92
45.94
44.76
141.97
142.10
147.73
144.71
141.86
142.01
147.34
144.13
CPU
(s)
1
1
8
10
1
2
7
10
2
3
28
54
1
2
23
50
2
2
39
48
1
2
33
51
3
5
61
84
2
4
57
86
4
4
139
148
3
5
116
240
6
8
254
488
4
6
218
421
888
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
TABLE III
Average Values for Different Metrics for Six Test-Cases Shown in Table II Along With Average and Normalized Results of All
the 100 Test-Cases Used
TC
PAM
RND
Avg.
(6 TCs)
QP
RND
Avg.
(100 TCs)
% impr. w.r.t.
1CA RND
QP
RND
QP
CCTS Algorithm
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
1CA
MCA
MC-GRD
MC-DyP
MC-DyP
MC-DyP
Divergence
NN
SS
2.12 2.66
2.09 2.59
1.74 1.98
1.57 1.95
2.07 2.59
2.02 2.51
1.85 2.12
1.58 1.97
1.85 2.32
1.87 2.32
1.57 1.79
1.35 1.67
1.83 2.30
1.84 2.29
1.57 1.79
1.30 1.62
27.1 27.8
29.5 30.1
(s)
FF
1.70
1.69
1.56
1.27
1.66
1.63
1.64
1.27
1.48
1.51
1.40
1.09
1.46
1.49
1.40
1.06
26.2
28.6
Average
NN
1.35
1.37
1.35
1.34
1.35
1.39
1.33
1.35
1.37
1.41
1.36
1.36
1.37
1.41
1.36
1.37
0.19
0.46
Skew (% of Delay)
NN
SS
FF
Worst
6.57 3.35 9.31
9.31
5.32 5.80 6.38
6.38
5.65 6.18 6.47
6.47
5.34 6.05 5.90
6.05
6.73 3.49 10.06 10.06
5.89 6.68 6.62
6.68
5.61 7.17 6.80
7.17
5.59 6.36 6.32
6.36
6.67 3.63 9.95
9.95
5.47 6.25 6.53
6.53
5.45 6.44 6.53
6.53
5.90 6.52 6.93
6.93
6.95 3.72 10.33 10.33
5.69 6.41 6.60
6.60
5.45 6.44 6.53
6.53
6.37 7.16 7.28
7.28
40.6 34.4 30.3
40.6
35.9 28.0 26.8
35.9
BA (nm2 )
X 1e6
15.21
15.27
15.81
15.57
15.20
15.25
15.73
15.50
13.68
13.76
14.16
14.00
13.67
13.74
14.16
13.95
2.33
1.97
WL (m)
X 1e6
75.29
75.40
77.17
76.32
75.22
75.33
76.92
76.06
68.57
68.77
70.03
69.44
68.51
68.65
70.03
69.28
1.27
1.03
CPU
(s)
3
4
88
139
2
4
76
143
3
4
121
133
3
4
119
121
sets are identical in all manner other than the pin locations, a
direct comparison of the results from these two sets will indicate the impact of clock pin placement method. Also, we run
each of the four CCTS algorithms on all test-cases irrespective
of their clock pin placement method. This will be used to compare the relative effectiveness of the four CCTS algorithms.
Table II gives detailed results of six representative test-cases
out of the 100 test-cases we have generated. Table III gives
the average results for the six test-cases used in Table II along
with average results of all the 100 test-cases generated. The
last two rows of Table III give the percentage improvement
of the different parameters with respect to the baseline values
from the single-corner random pin assignment (1CA RND)
method. A positive number in these rows implies a reduction
in value. Please note that we have used the worst values of
the 1CA-RND skew to normalize all the other values in these
rows. Some of the acronyms used in Tables II and III are
explained next. TC denotes the Test Case for the results. PAM
denotes the pin assignment method used in the test-case. This
can either be the quadratic-programming (QP) based method
or random pin assignment method (RND). The four CCTS
algorithms described earlier are abbreviated as: singe-corner
approach (1CA), multicorner approach (MCA), multicorner
greedy algorithm (MC-GRD), and multicorner dynamic
programming based algorithm (MC-DyP). The divergence
values given are weighted sum of clock divergence between
all IP pairs. The weights are proportional to the timing
criticality of all the paths between the IP pairs. Please note
that in Table II, all metrics except skew are absolute values.
Skew in a given corner is given as a percentage of the delay
in the corresponding corner. Since the delay values between
the slowest corner (SS) and fastest corner (FF) can be quite
different, we believe normalizing the absolute skew in each
corner by the corresponding delay will tell us how significant
the skew is in a given corner. We call this skew value as
889
TABLE IV
Characteristics of the 100 Random Test-Cases Generated with a Representative Six
TC
No. of IPs
No. of Flops
TC1
TC2
TC3
TC4
TC5
TC6
Avg(6)
Avg(100)
14
30
48
63
90
126
62
56
589 824
184 320
48512
119 296
146 432
521 216
268 267
279 777
X Size
(cm)
2.02
1.63
0.98
1.48
1.31
2.67
1.68
1.74
Y Size
(cm)
2.66
1.76
0.91
1.21
1.82
2.28
1.77
1.84
Aspect Ratio
0.76
0.93
1.07
1.22
0.72
1.17
0.98
0.96
Max IP Del
(ns)
2.41
1.77
0.63
0.79
1.29
1.04
1.32
1.48
Min IP Del
(ns)
1.08
0.27
0.11
0.14
0.15
0.34
0.35
0.35
Fig. 13. Divergence and skew variation are directly correlated. (a) Absolute
values. (b) Normalized values.
890
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 6, JUNE 2011
Acknowledgment