0% found this document useful (0 votes)
243 views

Clock Distribution: Shmuel Wimer

The document discusses clock distribution in integrated circuits. It begins by describing the basic architecture of a clock system, which includes a clock generator that receives an external clock and distributes a global clock signal. Local drivers and gates then drive physical clocks to clocked elements. The document then discusses global clock generation using techniques like phase locked loops (PLL) to compensate for distribution delays. It also discusses synchronous chip interfaces that use PLLs to synchronize clocks between chips. The remainder of the document details the components and operation of PLLs and delay locked loops (DLLs) and various clock distribution network topologies like trees, grids, and spines. It concludes by discussing factors that influence clock skew and jitter.

Uploaded by

Kaushal Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views

Clock Distribution: Shmuel Wimer

The document discusses clock distribution in integrated circuits. It begins by describing the basic architecture of a clock system, which includes a clock generator that receives an external clock and distributes a global clock signal. Local drivers and gates then drive physical clocks to clocked elements. The document then discusses global clock generation using techniques like phase locked loops (PLL) to compensate for distribution delays. It also discusses synchronous chip interfaces that use PLLs to synchronize clocks between chips. The remainder of the document details the components and operation of PLLs and delay locked loops (DLLs) and various clock distribution network topologies like trees, grids, and spines. It concludes by discussing factors that influence clock skew and jitter.

Uploaded by

Kaushal Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 79

Clock Distribution

Shmuel Wimer
Bar Ilan Univ. Eng. Faculty
Technion, EE Faculty

July 2010

Clock System Architecture


clk3

External
Clock

ext_clk

Clock
Generator

Clock
Distribution

gclk
Buffers
clk1

Clocked
Elements

Gaters
clk2

Chip

Chip receives external clock through I/O pad.


Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Local drivers and gaters drive the physical clocks to clocked elements.
July 2010

Global Clock Generation


Receives external clock signal and produce the global
clock distributed across the die.
A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
Therefore, data at clocked elements is no more in sync
with data at I/O pins.
Phased Locked Loop (PLL) compensates this delay.
PLL can perform frequency multiplication to obtain the
required on-chip frequencies.
July 2010

Synchronous Chip Interface with PLL


Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be
synchronized to the common clock.
A PLL produces the global clock of chip B such that it is in sync
with the external clock.
Chip A

Chip B
CLKin

CLKout

ext_clk
clk

Dout
Din
July 2010

Din
Dout

ref_clk

PLL clk_out

gclk

fdbk_clk

Clock Distribution
4

How PLL Works?

Charge
Pump

Loop
Filter
C

Up
ref_clk

fdbk_clk

M
N

I R
Phase
Detect

Vctrl

Voltage
Controlled
Oscillator

clk_out

I
Down

July 2010

Phase Frequency Detector (PFD)


1
A

Q_A: B should go
faster

CLR

CLR

B
1

Q_B: B should go
slower

The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
The output of the leading flip-flop is 1 for the lead duration.
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
July 2010

What happens when the reference and the sampled signals are a shift of
each other?
A: reference

B: sampled

Q_A: sampled
should go faster
Q_B: sampled
should go slower
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
July 2010

What happens when the reference and the sampled signals have different
frequencies?
A: reference

B: sampled

Q_A: sampled
should go faster
Q_B: sampled
should go slower
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
July 2010

Charge Pump
faster
1
CLK_ref

Q
Icp

CLR

Sup
Sdn

CLR

CLK_fdbk

Vctrl

Icp
1

slower

Converts PFD error (digital) to charge (analog), which then controls PLL VCO.

Charge is proportional to PFD widths: Qcp I up tfaster I dn tslower .


July 2010

Current Mirror
Iin

Vcc

Iout
N2

N1
Vss

P1
Iin

P2
Iout

Charge pump consists of current mirrors which are sources of constant


current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesnt affect Iout.
Current mirror works similarly for P transistors.
July 2010

10

How Charge Pump Works?


Vcc

R load determines
the current through
current mirror

Vcc
I

P-type current mirror


faster

Switch open when


faster = 1
Switch open when
slower = 1

Vout
slower
I

N-type current mirror


Vss
July 2010

11

Faster Mode Vout Vcc


Vcc

Vcc
C

faster=1

Vout
slower=0

Vss
July 2010

12

Slower Mode Vout Vss


Vcc

Vcc
C

faster=0

Vout
slower=1

Vss
July 2010

13

Loop Filter
Vcc
C
R

Vctrl

Vctrl

Differential amplifier connected as a unity-gain follower is used.

July 2010

14

Voltage Controlled Oscillator (VCO)


A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.

Vout

If tinv is the delay of an inverter and n is the number of inverters,


oscillation frequency is f 1 2ntinv
Frequency can be controlled by number of inverters and supply voltage
of inverter (higher voltage obtains faster inverter).
July 2010

15

Components of VCO

Buffering for
driving clk_out

clk_out

Vctrl

Vcc

Vctrl

Vss
Ring of 5 inverters
July 2010

Level Converter from


Vctrl-Vss to Vcc-Vss
16

Delay Locked Loop (DLL)


It is a variant of PLL that uses voltage-controlled delay
line rather than oscillator.
It adjust phase only. Frequency multiplication is
impossible.
It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.
July 2010

17

How DLL Works?

Charge
Pump

Loop
Filter
C

Up
I R

ref_clk
Phase
Detect

Vctrl

fdbk_clk

Voltage
Controlled
Delay Line

clk_out

I
Down

July 2010

18

Delay Line
The signal from input to output is delayed

In

or not according to the control bit.


In

2n1

2n2

Out

Dn 1

Dn 2

Out

D0

Connecting in series n delay elements, each of delay 2i , n 1 i 0,


it is possible to control the dely of the line to any value from 0 to 2 n 1
dealy units by an n-bit control word Dn 1 , Dn 2 , K , D0 .
July 2010

19

Clock Distribution Networks

July 2010

20

Tree Clock Network (Unconstrained)


clk1
clk2
clk3

clkn
No constraints imposed on buffers and wires.
Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.
July 2010

21

If TCLK is the clock delay at a leaf, the variance exapression


i

1 n

T
i 1 CLKi n i 1 CLKi
n

should be minimized.

Serpentine routing or extra buffers may be introduced


to obtain small variance.
Constraints on power can be imposed by limiting
number and size of clock buffers and width of wires.
July 2010

22

Clock Distribution with Grids


Grid feeds flops
directly, no local
buffers

Clock driver tree spans height of chip


Internal levels shorted together

Low skew but high power


July 2010

23

Clock drivers are on perimeter


July 2010

Clock drivers are on grid points


24

Delay and Skew in Grid Distribution

July 2010

25

DECs Alpha Microprocessor Clocking

July 2010

26

DEC Alpha 21264 Microprocessor Clock distribution

July 2010

27

Clock Distribution with Spines

July 2010

28

Intels Pentium4 Clock Distribution

July 2010

29

July 2010

30

Clock Distribution with Trees


RC-Tree

Each branch is individually


routed to balance RC delay

H-Tree

Recursive pattern to
distribute signals uniformly
with equal delay over area

More skew but less power


July 2010

31

Clock H-Tree
chip / functional block / IP

sequential elements

clock / PLL
July 2010

32

IBM / Motorola PowerPC Clock Distribution

July 2010

33

Delay Calculation

We use Elmore delay model. Sub trees are modeled


as capacitive loads
July 2010

34

Clock Skew and Jitter


Clock should theoretically arrive simultaneously to all
sequential circuits.
Practically it arrives in different times. The differences
are called clock skews.
Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
Most system distribute a global clock and then use local
clock gaters located near clocked elements.
Clock skew consists of the following components:
July 2010

35

Systematic is the portion existing under nominal


conditions. It can be minimized by appropriate design.
Random is caused by process variations like devices
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on
silicon and adjusted by delay components.
Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010

36

Factors affecting clock skew, Intel 1998, 0.25u.

July 2010

37

Skew, Clock Cycle and Design Margins

Clock Jitter is the same order as skew, but far more


difficult to compensate.
July 2010

38

Skew Modeling
Point of
divergence

Tclk1

Clock
Generator
1

CL
Tclk2

n
D

TCLK1

m
T
i 1 i

TCLK2 i 1Ti n
n

Cl - capacitive load, I d - drive current, VCC - voltage swing

Cl VCC I d
July 2010

39

Consider a small change in the delay, taking the linear term.


VCC
ClVCC
Cl


VCC
Cl
I d VCC
Cl
I d
2
VCC
Cl
I d
Id
Id
Id
VCC Cl I d


Cl
Id
VCC

Standard deviation of stage delay is = . is around 5%.


If clock buffers delays are normally distributed and independent
of each other, then TCLK1 m , and TCLK2 n .
Skew is: Tskew | TCLK2 TCLK1 |
July 2010

m n skew .
40

Clock Distribution Switching Power


Consider m - level clock tree.

Let Cl_m be the total load of its far-end driven sequential


elements.
Assuming a fixed fanout k of each clock tree buffer, the
dynamic power is:

2
PCLK_m Cl_mVCC
f,

PCLK_ m j

July 2010

Cl_m
kj

2
VCC
f , 0 j m 1.

41

Summing over whole clock tree:


2
PCLK j 0 Cl_ m j VCC
f
m 1

V f j 0
2
CC

m 1

Cl_m
kj

1
k

2
VCC
fCl_m

1 1 k

How much of the power is consumed by the far


end drivers of clock tree?

PCLK _ m
PCLK
July 2010

1 1 k

1 1 k

.
42

Given the number of sequential elements in a block, at


least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
Example: Assume a block with 214 sequential elements
and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.
July 2010

43

Active Clock De-Skewing


Compensates process variability, temperature gradients,
imperfect design.
Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
Its implementation is a difficult design challenge.
July 2010

44

Intels Pentium2 De-Skewing System


1998, 450MHz Clock
0.25u process
60pSec skew w/o fix
15pSec skew with fix
Two clock spines for
two clock regions.
A phase detector
detects relative shifts.
Clock of a region is
shifted by a delay line.
July 2010

45

Delay line consists of two cascaded inverters.


Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010

46

Intels IA64 Itanium1 De-Skewing System

2000, 800MHz clock


0.18u process
28pSec skew with fix
X4 increase w/o fix
30 independent de-skew
regions.

PLL

Each cluster is driven


from a global H-tree.
Delay circuit in de-skew
region are similar to
Pentium3 with 20-bit
registers.
July 2010

47

July 2010

48

Proposal for H-Tree Clock De-Skew


Hierarchical Approach

If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010

49

Proposal for H-Tree Clock De-Skew


Mesh Approach
Clock is distributed by Htree, but de-skew takes
place by neighbor leaves
phase detection.
A delay buffer accepts
phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.

July 2010

50

Clock Characteristics of Commercial Processors

July 2010

51

Data-Driven Clock Gating: Motivation


Clocking consumes 30% to 70% of dynamic power
Clock enabling is easier at high design levels but harder
in logic and gate level
Clock enabling is easier in register files and data path,
but harder in control
Designers are conservative, leaving on table a lot of
hidden disabling
Aug 2011

Data-DrivenClock Gating

52

Industrial Block DSP

22467 FFs, Test bench of 240373 CLK cycles


10% of period CLK is enabled. Of which only at
1.6% of CLK pulses FFs are toggling!

Aug 2011

Data-DrivenClock Gating

53

Industrial Block Networking Control

37155 FFs, Test bench of 6301 CLK cycles


20% of period CLK is enabled. Of which only at
1.3% of CLK pulses FFs are toggling!

Aug 2011

Data-DrivenClock Gating

54

How Adaptive Gating Works?

clk_en

D
clk

D
clk

Aug 2011

Q
clk_en

Data-DrivenClock Gating

55

Joint Gating of k Flip-Flops

Aug 2011

Data-DrivenClock Gating

56

Hardware Overhead Vs. Power Savings


1 0 0 0 1 1 0 0 0 1

clk

en2

clk_g

70% clk idle

70% clk idle


1 0 0 0 1 0 0 0 0 1

Aug 2011

clk_en

clk_en

en1

en_joint

60% clk idle

1 0 0 0 0 1 0 0 0 1

Data-DrivenClock Gating

57

Adaptive Gating in Sequential CKT


k
O

D1

FF

Q1

CL

CL

D2

FF

clk_g
clk_g

Latch

D3

clk

Q2

Q3

Theres a serious timing overhead (discussed later)


Aug 2011

Data-DrivenClock Gating

58

k fan-out clock-tree

level

level 2

level 1

level 0

k 2K

Aug 2011

n 2 N 2 K leaves
Data-DrivenClock Gating

59

clk
en_joint

clk_g
enk

clk
clk_g

en_joint

en_joint

en1

en1

clk_g

Data-DrivenClock Gating

enk

clk

Aug 2011

enk

en1
backward connection
of enabling signal

60

chip / functional block / IP

sequential elements

clock / PLL
Aug 2011

Data-DrivenClock Gating

61

What is the Optimal Clock Gater Fan-out?


Theres tradeoff between hardware overhead and
amount of saved clock pulses (power savings).

FFs activities and their correlations is a key.

Worst case assumption:


All FF are toggling independently of each other.
Aug 2011

Data-DrivenClock Gating

62

The Optimal Clock Gater Fan-out


k: # flip-flops, q: FF probability for D=Q

Net saving at
a leaf flip-flop

Latch overhead
amortized over k
FFs

cnet_saving q k cFF cw clatch k 1 q cw cOR


Gaters disabling
probability

Derivate by k:

Aug 2011

Switching probability
of FF enabling

q k ln q cFF cW clatch k 2 0
Data-DrivenClock Gating

63

Aug 2011

Data-DrivenClock Gating

64

Timing Implications
TC

clk

tpA
tpcq_latch

tpA
tpcq_latch

TC

clk_g

tpcq_FF

tsetup_latch

Q1

tpd_logic

tsetup_FF

D3

tpX tpO
D2

Aug 2011

Data-DrivenClock Gating

65

Timing Constraints
clk_g:

tpcq_FF tpd_logic tsetup_FF TC


clk:

tpA tpcq_latch tpcq_FF tpd_logic tpX tpO tsetup_latch TC


tpcq_FF tpd_logic T TC

T max tsetup_FF , tpA tpcq_latch tpX tpO tsetup_latch


Aug 2011

Data-DrivenClock Gating

66

Optimal Flip-flop k-size Grouping


Given n flip-flops and m 1 clock cycles

a a1,K , am is the activity (toggling) of flip-flop

ai a j is the number of redundant clock pulses


ocurring by jointly clocking FFi and FF j
1 ai a j m measures the activity correlation
between FFi and FF j
Aug 2011

Data-DrivenClock Gating

67

Activity Correlation of FFs

Avoid these pairs from


being in same group

Aug 2011

Data-DrivenClock Gating

68

Avoid these pairs from


being in same group

Aug 2011

Data-DrivenClock Gating

69

FF Pairwise Activity Model


G V , E , w : FF pairwise activity graph.
vi V corresponds to FFi .

eij vi , v j E is FF pairing.
ai | a j is joint toggling.

w eij ai a j is redundant clock pulses, hence a waste.


E E: vertex matching
Aug 2011

Data-DrivenClock Gating

70

Total power:

P 2 e E ai | a j
ij

v V
i

v V
i

Aug 2011

ai e E e E ai ai | a j a j ai | a j

ij
ij
ai e E
ij

Essential + Waste
ai a j v V ai e E w eij
i

Data-DrivenClock Gating

ij

71

aj

ai

ai a j

Optimal FFs pairing (k 2) is solved in polynomial


time by minimal perfect graph matching.
Aug 2011

Data-DrivenClock Gating

72

What happens when k>2 ?


Aug 2011

Data-DrivenClock Gating

73

Is repeated perfect matching optimal ?

a 7 | a8
a1 | a2

a1

Aug 2011

a3

a3 | a 4
a 6 | a5

a2

a7

a4

Data-DrivenClock Gating

a6

a5

a8

74

No! Here is the optimal 4-size grouping

Aug 2011

Data-DrivenClock Gating

75

Finding the Optimal k-size FFs Groups


H V , E , w : k uniform weighted hypergraph.

v V , v k , ev vu uv E is hyperedge, E n .
k

Uuv au is the joint toggling of FF group.


w ev vv av Uuv au is the redundant clocking
resulting in power waste.
Aug 2011

Data-DrivenClock Gating

76

Total power:

P e E k
v

uv

au

vi V

ai e E vv a v Uuv au

vi V

ai e E w ev

Minimizing e E w ev is called MIN_CLK_GATE .


v

A Minimum Weight Set Partitioning (Exact Covering)


algorithm can solve it, but Exact Covering is NP-hard.
FF grouping does not cross clock domain boundaries.
Hence grouping works in clock domain scope.
Aug 2011

Data-DrivenClock Gating

77

Distribution of clock domain size (DSP)

Aug 2011

Data-DrivenClock Gating

78

Distribution of clock domain size (Network Control)

Aug 2011

Data-DrivenClock Gating

79

You might also like