DRAM Circuit Design Searchable
DRAM Circuit Design Searchable
CIRCUIT DESIGN
Fiindamcntal and High-Speed Topics
Technical Reviewers
Jim Frenzel, University of Idaho
Zhao Zhang, Iowa State University
DRAM Circuit Design
Fundamental and High-Speed Topics
Brent Keeth
R. Jacob Baker
Brian Johnson
Feng Lin
令 IEEE
IEEE PRESS
眄 WILEY:
S 2oo7 i
WILEY-INTERSCIENCE
A John Wiley & Sons, lncM Publication
Copyright © 2008 by the Institute of Electrical and Electronic Engineers, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher fbr permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable fbr your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable fbr any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or fbr technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic format. For information about Wiley products, visit our web site at
www.wiley.com.
ISBN 978-0-470-18475-2
10 987654321
To Susi, John, Katie, Gracie, Dory, and Faith (B.K.)
To Julie, Kyrit and Josh (J.B.)
To Cassandra, Parker, Spencer, and Dolpha
To Hui, Luan, and Conrad (EL.)
To Leah and Nicholas (M.M.)
Contents
Preface.......................................................................................... xi
vii
vin Contents
Appendix................................... 391
Xlll
XIV Preface
are available in technical journals and symposium digests. The book intro
duces the reader to DRAM theory, history, and circuits in a systematic, tuto
rial fashion. The level of detail varies, depending on the topic. In most
cases, however, our aim is merely to introduce the reader to a functional ele
ment and illustrate it with one or more circuits. After gaining familiarity
with the purpose and basic operation of a given circuit, the reader should be
able to tackle more detailed papers on the subject.
The second half of the book is completely devoted to advanced con
cepts pertaining to state-of-the-art high-speed and high-performance
DRAM memory. The two halves of the book are worlds apart in content.
This is intentional and serves to rapidly advance the reader from novice to
expert through the turning of a page.
The book begins in Ch叩ter 1 with a brief history of DRAM device evo
lution from the first 1Kbit device to the 64Mbit synchronous devices. This
chapter introduces the reader to basic DRAM operation in order to lay a
foundation for more detailed discussion later. Chapter 2 investigates the
DRAM memory array in detail, including fundamental array circuits needed
to access the array. The discussion moves into array architecture issues in
Chapter 3, including a design example comparing known architecture types
to a novel, stacked digitline architecture. This design example should prove
useful, fbr it delves into important architectural trade-oflfc and exposes
underlying issues in memory design. Chapter 4 then explores peripheral cir
cuits that support the memory array, including column decoders and redun
dancy. The reader should find Chapter 5 very interesting due to the breadth
of circuit types discussed. This includes data path elements, address path
elements, and synchronization circuits. Chapter 6 follows with a discussion
of voltage converters commonly found on DRAM designs. The list of con
verters includes voltage regulators, voltage references, Fpp/2 generators,
and voltage pumps.
Chapter 7 introduces the concept of high-performance memory and
underlying market forces. Chapter 8 examines high-speed DRAM memory
architectures including a discussion of performance-cost trade-offs. Chapter
9 takes a look at the input circuit path of high-performance DRAM while
Chapter 10 takes a complementary look at the output circuit path. Chapter 11
dives into the complicated world of high-performance timing circuits
including delay-lock-loops and phase-lock-loops. Chapter 12 ties it all
together by tackling the difficult subject of control logic design. This is an
especially important topic in high-performance memory chips. Chapter 13
looks at power delivery and examines methods to improve performance
through careful analysis and design of the power delivery network. Finally,
Chapter 14 discusses future work in high-perfbrmance memory. We wrap
Preface xv
up the book with the Appendix, which directs the reader to a detailed list of
papers from major conferences and journals.
Brent Keeth
R. Jacob Baker
Brian Johnson
Feng Lin
Chapter
1
An Introduction to DRAM
Dynamic random access memory (DRAM) integrated circuits (ICs) have
existed for more than thirty years. DRAMs evolved from the earliest kilobit
(Kb) generation to the gigabit (Gb) generation through advances in both
semiconductor process and circuit design technology. Tremendous
advances in process technology have dramatically reduced feature size, per
mitting ever higher levels of integration. These increases in integration have
been accompanied by major improvements in component yield to ensure
that overall process solutions remain cost-effective and competitive. Tech
nology improvements, however, are not limited to semiconductor process
ing. Many of the advances in process technology have been accompanied or
enabled by advances in circuit design technology. In most cases, advances
in one have enabled advances in the other. In this chapter, we introduce
some fundamentals of the DRAM IC, assuming that the reader has a basic
background in complementary metal-oxide semiconductor (CMOS) circuit
design, layout, and simulation [1].
address input buffer. The input buffers that drive the row (R) and column
(C) decoders in the block diagram have two purposes: to provide a known
input capacitance (CQ on the address input pins and to detect the input
address signal at a known level so as to reduce timing errors. The level
Vtrip, an idealized trip point around which the input buffers slice the input
signals, is important due to the finite transition times on the chip inputs
(Figure 1.3). Ideally, to avoid distorting the duration of the logic zeros and
ones, VTR1P should be positioned at a known level relative to the maximum
and minimum input signal amplitudes. In other words, the reference level
should change with changes in temperature, process conditions, input maxi
mum amplitude 化浦,and input minimum amplitude (VJL), Having said
this, we note that the input buffers used in first-generation DRAMs were
simply inverters.
Continuing our discussion of the block diagram shown in Figure LI,
we see that five address inputs are connected through a decoder to the
1,024-bit memory array in both the row and column directions. The total
number of addresses in each direction, resulting from decoding the 5-bit
word, is 32. The single memory array is made up of 1,024 memory ele
ments laid out in a square of 32 rows and 32 columns. Figure 1.4 illustrates
the conceptual layout of this memory anay. A memory element is located at
the intersection of a row and a column.
% %
? ?
CE* 6
D/N 。 Buffer
R1 -
a
R2 p
o
o Cell matrix
R3 o
p 1,024 bits
-
R4 o:
R5
C2 1c ?6 c3
G 2匚 ]15 G
俨 31 Ji4 q
E 4匚 J13 C&
J 5匚 ]l2 D011T
6 6匚 口 11 D,
J 7[ 口10 %
E 8[ 09 %。
Figure 1.2 1,024-bit DRAM pin connections.
①
SS8PP8 Moa
o
A
l
I - AW _ । ]
耽 / \ !/~
ce* =1 / 1\ /
1 ।
LLL4 A Note on the Power Supplies. The voltage levels used in the
Ik DRAM are unusual by modern-day standards. In reviewing Figure 1.2,
we see that the Ik DRAM chip uses two power supplies: VDD and %s, To
begin,嗅 is a greater voltage than VDD: Vss is nominally 5 Y while Vdd is
-12 V. The value of Vss was set by the need to interface to logic circuits that
were implemented using transistor-transistor logic (TTL) logic. The 17 V
difference between VDD and was necessary to maintain a large signal-
to-noise ratio in the DRAM array. We discuss these topics in greater detail
later in the book. The Vss power supply used in modem DRAM designs, at
the time of this writing, is generally zero; the VDD is in the neighborhood of
L5 V.
LL1.5 The 3-Transistor DRAM Cell. One of the interesting circuits
used in the Ik DRAM (and a few of the 4k and 16k DRAMs) is the 3-tran-
sistor DRAM memory cell shown in Figure 18 The column- and rowlines
shown in the block diagram of Figure 1.1 are split into Write and Read line
pairs. When the Write rowline is HIGH, Ml turns ON. At this point, the
data present on the Write columnline is passed to the gate of M2, and the
information voltage charges or discharges the input capacitance of M2. The
next, and final, step in writing to the mbit cell is to turn OFF the Write row
line by driving it LOW. At this point, we should be able to see why the
memory is called dynamic. The charge stored on the input capacitance of
M2 will leak off over time.
Sec. 1.1 DRAM Types and Operation 7
If we want to read out the contents of the cell, we begin by first pre
charging the Read columnline to a known voltage and then driving the Read
rowline HIGH. Driving the Read rowline HIGH turns M3 ON and allows
M2 either to pull the Read columnline LOW or to not change the pre
charged voltage of the Read columnline. (If M25s gate is a logic LOW, then
M2 will be OFF, having no effect on the state of the Read columnline.) The
main drawback of using the 3-transistor DRAM cell, and the reason it is no
longer used, is that it requires two pairs of column and rowlines and a large
layout area. Modem 1 -transistor, 1 -capacitor DRAM cells use a single row
line, a single columnline, and considerably less area.
greater than 5 V to turn the NMOS access devices fully ON (more on this
later), and the substrate held at a potential less than zero. For voltages out
side the supply range, charge pumps are used (see Chapter 6). The move
from NMOS to CMOS, at the 1Mb density level, occurred because of con
cerns over speed, power, and layout size. At the cost of process complexity,
complementary devices improved the design.
LI .2.1 Multiplexed Addressing. Figure 1.9 shows a 4k DRAM block
diagram, while Figure 1.10 shows the pin connections for a 4k chip. Note
that compared to the block diagram of the Ik DRAM shown in Figure 1.1,
the number of address input pins has decreased from 10 to 6, even though
the memory size has quadrupled. This is the result of using multiplexed
addressing in which the same address input pins are used for both the row
and column addresses. The row address strobe (RAS*) input clocks the
address present on the DRAM address pins Ao to A5 into the row address
latches on the falling edge. The column address strobe (CAS*) input clocks
the input address into the column address latches on its falling edge.
Figure 1.11 shows the timing relationships between RAS*, CAS*, and
the address inputs. Note that tRC is still (as indicated in the last section) the
random cycle time fbr the DRAM, indicating the maximum rate we can
write to or read from a DRAM. Note too how the row (or column) address
must be present on the address inputs when RAS* (or CAS*) goes LOW.
The parameters and fCAS indicate how long RAS* or CAS* must remain
LOW after clocking in a column or row address. The parameters tASR, tRAH,
Osc, and tCAH indicate the setup and hold times fbr the row and column
addresses, respectively.
2
Column decoder
32/
RAS"
Data —a Qx/r
s』
A1 ― u。一外
-
A2 一
ndf ssappq
A3 —
% 3T
A4
A5 — Arrayl (2k)
喂1 c •
RAV*3Q 1
114 Dour
Ao 5[ J12 A3
Ai 6[ . ]11 A4
A2 71 Jo A5
% 81 J9 %
Digitline or
columnline or
bitline
Wordhne .
or rowhne g 1
Storage / JT
capacitor V
more bitlines we use in an array, the longer the delay through the wordline
(Figure 1.13).
From row
Wordline ■o
address
driver
£
decoder
Wordtine |q
E
v V V V
If we drive the wordline on the left side of Figure 1.13 HIGH, the signal
will take a finite time to reach the end of the wordline (the wordline on the
right side of Figure 1.13). This is due to the distributed resistance/capaci-
tance structure formed by the resistance of the polysilicon wordline and the
capacitance of the MOSFET gates. The delay limits the speed of DRAM
operation. To be precise, it limits how quickly a row can be opened and
closed. To reduce this RC time, a polycide wordline is formed by adding a
silicide, for example, a mixture of a refractory metal such as tungsten
with polysilicon, on top of poly silicon. Using a polycide wordline will have
the effect of reducing the wordline resistance. Also, additional drivers can
be placed at different locations along the wordline, or the wordline can be
stitched at various locations with metal.
The limitations on the additional number of wordlines can be under
stood by realizing that by adding more wordlines to the array, more para
sitic capacitance is added to the bitlines. This parasitic capacitance becomes
important when sensing the value of data charge stored in the memory ele
ment. We'll discuss this in more detail in the next section.
LL2.3 Memory Array Size. A comment is in order about memory
array size and how addressing can be used fbr setting word and page size.
(We'll explain what this means in a moment.) If we review the block dia
gram of the 4k DRAM shown in Figure 1.9, we see that two 2k-DRAM
memory arrays are used. Each 2k memory is composed of 64 wordlines and
32 bitlines fbr 2,048 memory elements/address locations per array. In the
block diagram, notice that a single bit, coming from the column decoder,
can be used to select data, via the bitlines, from ArrayO or Array 1.
From our discussion earlier, we can open a row in ArrayO while at the
same time opening a row in Array 1 by simply applying a row address to the
input address pins and driving RAS* LOW. Once the rows are open, it is a
simple matter of changing the column address to select different data asso
12 Chap. 1 An Introduction to DRAM
ciated with the same open row from either array. If our word size is 1 bit, we
could define a page as being 64 bits in length (32 bits from each array). We
could also define our page size as 32 bits with a 2-bit word for input/output.
We would then say that the DRAM is a 4k DRAM organized as 2k x 2. Of
course, in the 4k DRAM, in which the number of bits is small, the concepts
of page reads or size aren't too useful. We present them here simply to illus
trate the concepts. Lefs consider a more practical and modem configura
tion.
Suppose we have a 64-Meg DRAM organized as 16 Meg x 4 (4 bits
input/output) using 4k row address locations and 4k column address loca
tions (12 bits or pins are needed fbr each 4k of addressing). If our (sub)
memory array size is 256kbits, then we have a total of 256 memory arrays
on our DRAM chip. We'll assume that there are 512 wordlines and 512 bit
lines (digitlines), so that the memory array is logically square. (However,
physically, as we shall see, the array is not square.) Internal to the chip, in
the address decoders, we can divide the row and column addresses into two
parts: the lower 9 bits fbr addressing the wordlines/bitlines in a 256k mem
ory array and the upper 3 bits for addressing one of the 64 "group・o&fdur”
memory arrays (6 bits total coming from the upper 3 bits of the row and col
umn addresses).
Our 4-bit word comes from the group-of-fbur memory arrays (one bit
from each memory array). We can define a page of data in the DRAM by
realizing that when we open a row in each of the four memory arrays, we
are accessing 2k of data (512 bits/array x 4 arrays). By simply changing the
column address without changing the row address and thus opening another
group-of-fbur wordlines, we can access the 2k “page" of data. With a little
imagination, we can see different possibilities fbr the addressing. For exam
ple, we could open 8 group-of-fbur memory arrays with a row address and
thus increase the page size to 16k, or we could use more than one bit at a
time from an array to increase word size.
LL2.4 Refreshing the DRAM. Refreshing the DRAM is accom
plished by sequentially opening each row in the DRAM. (We'll discuss how
the DRAM cell is refreshed in greater detail later in the book.) If we use the
64-Meg example in the last section, we need to supply 4k row addresses to
the DRAM by changing the external address inputs from 000000000000 to
111111111111 while clocking the addresses into the DRAM using the fall
ing edge of RAS*. In some DRAMs, an internal row address counter is
present to make the DRAM easier to refresh. The general specification fbr
64-Meg DRAM Refresh is that all rows must be refreshed at least every
64 ms, which is an average of 15.7 jxs per row. This means, that if the Read
cycle time tRC is 100 ns (see Figure 1.11), it will take 4,096 • 100 ns or
Sec. 1.1 DRAM Types and Operation 13
held LOW and CAS* is toggled, the internal address counter is incremented,
and the sequential data appears on the output of the DRAM. The term nib
ble mode comes from limiting the number of CAS* cycles to four (a nibble).
g
SSRPP4
s*
UT
////////
lRAS
RAS* -1
i NCP ■
I 二^ • 一: L 'CAH
cah : : :
«
i। :,*:,
: t
lRAC
Dqut
RARRRRRRARRFIFIRmRRRRRRmmFIFIQ
V
0 4 52 /sSQ
D DDQ7 51
D 0DQ2 5 50 DQ14
6 49 DQ13
V Q 48
Q %D0
D 5sQ3 7 47
D Q4 8 46 DQ12
45 DQ11
V 9 44
Q “SSQ
DQ 5 43
D 6 10 42 DQ10
11 41 DQ9
VQ 12 40
07M 39 %Q
D - 13
Q 14 38 DQ8
V
15 37
D 16 36 Vss
Q
W s 17 35 NC
C4 18 34 DQMH
19 33 CLK
R4 20 32
Cs s 21 31 CKE
BA ★ 22 30 NC
23 29
B4 o 24 28 A11
A 7
名
25 A9
O 26
A A8
A1 A7
A2 A6
A A5
V3 A4
%
7
the numbers from the previous paragraph, this means that a 64Mb DDR
SDRAM with an input/output word size of 16 bits will transfer data to and
from the memory controller at 400-572 MB/s.
Figure 1.18 shows the block diagram of a 64Mb SDRAM with 16-bit
I/O. Note that although CLK is now used for transferring data, we still have
the second-generation control signals CS*, WE*, CAS*, and, RAS* present
on the part. (CKE is a clock enable signal which, unless otherwise indi
cated, is assumed HIGH.) Lefs discuss how these control signals are used
in an SDRAM by recalling that in a second-generation DRAM, a Write was
executed by first driving WE* and CS* LOW. Next a row was opened by
applying a row address to the part and then driving RAS* LOW. (The row
address is latched on the falling edge of RAS^.) Finally, a column address
was ipplied and latched on the falling edge of CAS*. A short time later, the
data applied to the part would be written to the accessed memory location.
For the SDRAM Write, we change the syntax of the descriptions of
whafs happening in the part. However, the fundamental operation of the
DRAM circuitry is the same as that of the second-generation DRAMs. We
can list these syntax changes as follows:
1. The memory is segmented into banks. For the 64Mb memory of Fig
ure 1.17 and Figure 1.18, each bank has a size of 16Mbs (organized
as 4,096 row addresses [12 bits] x 256 column addresses [8 bits] x 16
bits [16 Z)。I/O pins]), As discussed earlier, this is nothing more than
a simple logic design of the address decoder (the banks can be laid
out so that they are physically in the same area). The bank selected is
determined by the addresses BAO and BA 1.
2. In second-generation DRAMs, we said, “We open a row,“ as dis
cussed earlier. In SDRAM, we now say, “We activate a row in a
bank." We do this by issuing an active command to the part. Issuing
an active command is accomplished on the rising edge of CLK with a
row/bank address applied to the part with CS* and RAS* LOW, while
CAS^ and WE* are held HIGH.
3. In second-generation DRAMs, we said, “We write to a location given
by a column address,“ by driving CAS* LOW with the column
address applied to the part and then applying data to the part. In an
SDRAM, we write to the part by issuing the Write command to the
part. Issuing a Write command is accomplished on the rising edge of
CLK with a column/bank address applied to the part: CS*, CAS*, and
WE* are held LOW, and RAS* is held HIGH.
18 Chap. 1 An Introduction to DRAM
P
C
0
E
spews
E Cwlrol
C4S*K O logic
D
266 x
ColurnTT
address
counter/
latch
Command H X X X X X X —
inhibit (NOP)
No operation L H H H X X X —
(NOP)
PRECHARGE L L H L X Code X 5
(deactive row in
bank or banks)
Auto-Refresh L L L H X X X 6,7
ors self-refresh
(enter self
refresh mode)
Write inhibit/ 一 — — —
— H — Hi-Z 8
output Hi-Z
Notes
1. CKE is HIGH fbr all commands shown except for self-refresh.
2. A0-A 11 define the op-code written to the mode register.
3. AO-A11 provide row address, and BAO, BAX determine which bank is made active.
4. /0-49 (x4),力(K48 (x8), or 40-47 (x16) provide column address; J10 HIGH
enables the auto PRECHARGE feature (nonpersistent), while 410 LOW disables the
auto PRECHARGE feature; 840, BA\ determine which bank is being read from or
written to.
5. N10 LOW: BAO, BAX determine the bank being precharged. >410 HIGH: all banks
precharged and BAO, BA 1 are “don't care.”
6. This command is Auto-Refresh if CKE is HIGH and Self-Refresh if CKE is LOW.
20 Chap. 1 An Introduction to DRAM
7. Internal Refresh counter controls row addressing; all inputs and I/Os are "don't
care” except fbr CKE.
8. Activates or deactivates the DQs during Writes (zero-clock delay) and Reads (two-
clock delay).
Pipeline stages in the data path can also be helpful when synchronizing
output data to the system clock. CAS latency refers to a parameter used by
the SDRAM to synchronize the output data from a Read request with a par
ticular edge of the system clock. A typical Read fbr an SDRAM with CAS
latency set to three is shown in Figure 1.19. SDRAMs must be capable of
reliably functioning over a range of operating frequencies while maintain
ing a specified CAS latency. This is often accomplished by configuring the
pipeline stage to register the output data to a specific clock edge, as deter
mined by the CAS latency parameter.
CLK _
Addresses A ; 乂 X X
Output, Q
An Aw A9 A8 A7 A6 A5 A4 A3 A2 Ao
111111111111Addressbus
/ 11/10/9/ 8/7/6/5/4/3/2/1/0/I mhHa m3
I Reserved* WB )p mode CAS* latency BT Burst length
J
1
Burst length
M2 M1 MO M3 = 0 M3 = 1
0 0 0 1 1
0 0 1 2 2
0 1 0 4 4
0 1 1 8 8
1 0 0 Reserved Reserved
1 0 1 Reserved Reserved
1 1 0 Reserved Reserved
1 1 1 Full page Reserved
M3 Burst type
0 Sequential
1 Interleaved
M6 M5 M4 CAS* latency
0 0 0 Reserved
0 0 1 Reserved
0 1 0 2
0 1 1 3
1 0 0 Reserved
1 0 1 Reserved
1 1 0 Reserved
1 1 1 Reserved
“cc
Q =长C
(ID
where C is the capacitance value in farads. Conversely, storing a logic zero
in the cell requires a capacitor with a voltage of --VCCI2 across it. Note that
the stored charge on the mbit capacitor for a logic zero is
―"cc
Q =受C
(1.2)
The charge is negative with respect to the VCCI2 common node voltage
in this state. Various leakage paths cause the stored capacitor charge to
slowly deplete. To return the stored charge and thereby maintain the stored
data state, the cell must be refreshed. The required refreshing operation is
what makes DRAM memory dynamic rather than static.
Wordline
T or rowline
M1
D3 D2 Do
<
<
VV<
<
Arra
y = y 'mbit
signal Cell'cdigit^cmbit
(13)
A VSignai of200 mV is yielded from a design in which VceU = 1.00,。或〃 =20
fF, and 孰城=200 fF.
26 Chap. 1 An Introduction to DRAM
After the cell has been accessed, sensing occurs. Sensing is essentially
the amplification of the digitline signal or the differential voltage between
the digitlines. Sensing is necessary to properly read the cell data and refresh
the mbit cells. (The reason fbr forming a digitline pair now becomes appar
ent) presents a schematic diagram fbr a simplified sense amplifier circuit:
a cross-coupled NMOS pair and a cross-coupled PMOS pair. The sense
amplifiers also appear like a pair of cross-coupled inverters in which ACT
and NLAT* provide power and ground. The NMOS pair or Nsense-amp has
a common node labeled NLAT* (for Nsense-amp latch).
Similarly, the Psense-amp has a common node labeled ACT (fbr Active
pull-up). Initially, NLAT* is biased to VCCH. and ACT is biased to Vss. or
signal ground. Because the digitline pair DI and DI* are both initially at
Vee。, the Nsense-amp transistors are both OFF. Similarly, both Psense-
amp transistors are OFF. Again, when the mbit is accessed, a signal devel
ops across the digitline pair. While one digitline contains charge from the
cell access, the other digitline does not but serves as a reference for the
Sensing operation. The sense amplifiers are generally fired sequentially: the
Nsense-amp first, then the Psense-amp. Although designs vary at this point,
Sec. L2 DRAM Basics 27
the higher drive of NMOS transistors and better VTH matching offer better
sensing characteristics by Nsense-amps and thus lower error probability
compared to Psense-amps.
Waveforms fbr the Sensing operation are shown in Figure 1.28. The
Nsense-amp is fired by bringing NLAT* (Nsense-amp latch) toward ground.
As the voltage difference between NLAT* and the digitlines (DI and DI* in
)approaches the NMOS transistor, whose gate is connected to the
higher voltage digitline, begins to conduct. This conduction occurs first in
the subthreshold and then in the saturation region as the gate-to-source volt
age exceeds VTH and causes the low-voltage digitline to discharge toward
the NLAT* voltage. Ultimately, NLAT* will reach ground and the digitline
will be brought to ground potential. Note that the other NMOS transistor
will not conduct: its gate voltage is derived from the low-voltage digitline,
which is being discharged toward ground. In reality, parasitic coupling
between digitlines and limited subthreshold conduction by the second tran
sistor result in a temporary voltage drop on the high digitline, as seen in
Figure 1.28.
Psense-amp Nsense-amp
Sometime after the Nsense-amp fires, ACT will be brought toward Vcc
to activate the Psense-amp, which operates in a complementary fashion to
the Nsense-amp. With the low-voltage digitline approaching ground, there
is a strong signal to drive the appropriate PMOS transistor into conduction.
This conduction, again moving from subthreshold to saturation, charges the
28 Chap. 1 An Introduction to DRAM
ing digitlines are accessed through additional CSEL lines that correspond to
different column address locations.
Write Restore
1. Initially, both RAS* and CAS* are HIGH. All bitlines in the DRAM
are driven to 展/2, while ail wordlines are at 0 V This ensures that
all of the mbit's access transistors in the DRAM are OFF.
2. A valid row address is applied to the DRAM and RAS* goes LOW.
While the row address is being latched, on the falling edge of RAS*,
and decoded, the bitlines are disconnected from the Vcc/2 bias and
allowed to float. The bitlines at this point are charged to VCCH^ and
they can be thought of as capacitors.
3. The row address is decoded and applied to the wordline drivers. This
forces only one rowline in at least one memory array to Vccp- Driv
ing the wordline to VCCP turns ON the mbits attached to this rowline
and causes charge sharing between the mbit capacitance and the
capacitance of the corresponding bitline. The result is a small pertur
bation (upwards for a logic one and downwards for a logic zero) in
the bitline voltages.
4. The next operation is Sensing, which has two purposes: a) to deter
mine if a logic one or zero was written to the cell and b) to refresh
the contents of the cell by restoring a full logic zero (0 V) or one
(Vcc) to the capacitor. Following the wordlines going HIGH, the
Nsense-amp is fired by driving, via an n-channel MOSFET, NLAT*
to ground. The inputs to the sense amplifier are two bitlines: the bit
line we are sensing and the bitline that is not active (a bitline that is
still charged to Vcc2—an inactive bitline). Pulling NLAT* to ground
results in one of the bitlines going to ground. Next, the ACT signal is
pulled up to Vcc, driving the other bitline to Vcc. Some important
notes:
(a) It doesn't matter if a logic one or logic zero was sensed because
the inactive and active bitlines are pulled in opposite directions.
(b) The contents of the active cell, after opening a row, are restored
to full voltage levels (either 0 V or VCc)- The entire DRAM can
be refreshed by opening each row.
Now that the row is open, we can write to or read from the DRAM. In
either case, it is a simple matter of steering data to or from the active
array(s) using the column decoder. When writing to the array, buffers set the
new logic voltage levels on the bitlines. The row is still open because the
wordline remains HIGH. (The row stays open as long as RAS^ is LOW.)
When reading data out of the DRAM, the values sitting on the bitlines
are transmitted to the output buffers via the I/O MOSFETs. To increase the
speed of the reading operation, this data, in most situations, is transmitted to
Sec. 1.2 DRAM Basics 31
REFERENCES
[1] R. J. Baker, CMOS: Circuit Design, Layout, and Simulation, 2nd ed.
Hoboken, NJ: John Wiley & Sons, Inc., and IEEE Press, 2005.
[2] Micron Technology, Inc., Synchronous DRAM datasheet, 1999.
Chapter
2
33
34 Chap. 2 The DRAM Array
Figure 2.1, is essentially under the control of process engineers, for every
aspect of the mbit must meet stringent performance and yield criteria.
Metal 1
digitline
Poly3
cellplate
N+ active area Wordlines
(polyl)
(2・D
It is easier to explain the 8F2 designation with the aid of Figure 2.3. An
imaginary box drawn around the mbit defines the cell's outer boundary.
Along the x-axis, this box includes one-half digitline contact feature, one
wordline feature, one capacitor feature, one field poly feature, and one-half
poly space feature fbr a total of four features. Along the y-axis, this box
contains two one-half field oxide features and one active area feature fbr a
total of two features. The area of the mbit is therefore
4F-2F = 8F2
(2.2)
Ideally, a twisting scheme equalizes the coupling terms from each digit
line to all other digitlines, both true and complement. If implemented prop
erly, the noise terms cancel or at least produce only common-mode noise to
which the differential sense amplifier is more immune.
Each digitline twist region consumes valuable silicon area. Thus, design
engineers resort to the simplest and most efficient twisting scheme to get the
job done. Because the coupling between adjacent metal lines is inversely
proportional to the line spacing, the signal-to-noise problem gets increas
ingly worse as DRAMs scale to smaller and smaller dimensions. Hence, the
industry trend is toward use of more complex twisting schemes on succeed
ing generations [6][7].
Figure 2.8 presents a process cross section fbr the buried capacitor mbit
depicted in Figure 2.2, and Figure 2.9 shows a SEM image of the buried
capacitor mbit. This type of mbit, employing a buried capacitor structure,
places the digitline physically above the storage capacitor [10]. The digit
line is constructed from either metal or polycide, while the digitline contact
is formed using a metal or polysilicon plug technology. The mbit capacitor
is formed with polysilicon (poly2) as the bottom plate, an oxide-nitride-
oxide (ONO) dielectric, and a sheet of polysilicon (poly3). This top sheet of
polysilicon becomes a common node shared by all mbit capacitors. The
capacitor shape can be simple, such as a rectangle, or complex, such as con
centric cylinders or stacked discs. The most complex capacitor structures
are the topic of many DRAM process papers [11][12][13].
Digitline
ONO dielectric
tse Poly3
c ■\\ cellplate
o
o
①
U
S6e o s
Z = Field poly
A = Wordline
o 6
」
d Q ri~~h
Field oxide
VN+ acHve area)
P-substrate
has no more than VCCH V across it for either stored logic state, a logic one
at + VCCH V ora logic zero at -2/2 V.
Two other basic mbit configurations are used in the DRAM industry.
The first, shown in Figures 2.10, 2.11, and 2.12, is referred to as a buried
digitline or capacitor over bitline (COB) cell [14][15]. The digitline in this
cell is almost always formed of polysilicon rather than of metal.
As viewed from the top, the active area is normally bent or angled to
accommodate the storage capacitor contact that must drop between digit
lines. An advantage of the buried digitline cell over the buried capacitor cell
of Figure 2.8 is that its digitline is physically very close to the silicon sur
face, making digitline contacts much easier to produce. The angled active
area, however, reduces the effective active area pitch, constraining the isola
tion process even further In buried digitline cells, it is also very difficult to
form the capacitor contact. Because the digitline is at or near minimum
pitch fbr the process, insertion of a contact between digitiines can be diffi
cult.
Figures 2.13 and 2.14 present a process cross section of the third type of
mbit used in the construction of DRAMs. Using trench storage capacitors,
this cell is accordingly called a trench cell [12][13]. Trench capacitors are
formed in the silicon substrate, rather than above the substrate, after etching
deep holes into the wafer The storage node is a doped polysilicon plug,
which is deposited in the hole following growth or deposition of the capaci
Sec. 2.1 The Mbit Cell 43
tor dielectric. Contact between the storage node plug and the transistor drain
is usually made through a poly strap.
With most trench capacitor designs, the substrate serves as the com
mon-node connection to the capacitors, preventing the use of +%。/2 bias
and thinner dielectrics. The substrate is heavily doped around the capacitor
to reduce resistance and improve the capacitor's CV characteristics. A real
44 Chap. 2 The DRAM Array
which permit four I/O transistors to share a single column select (CSEL)
control signal. DRAM designs employing two or more metal layers run the
column select lines across the arrays in either Metal2 or Metal3. Each col
umn select can activate four I/O transistors on each side of an array to con
nect four digitline pairs (columns) to peripheral data path circuits. The I/O
transistors must be sized carefully to ensure that instability is not introduced
into the sense amplifiers by the I/O bias voltage or remnant voltages on the
I/O lines.
Although designs vary significantly as to the numerical ratio, I/O tran
sistors are designed to be two to eight times smaller than the Nsense ampli
fier transistors. This is sometimes referred to as beta ratio. A beta ratio
between five and eight is considered standard; however, it can only be veri
fied with silicon. Simulations may foil to adequately predict sense amplifier
instability, although theory would predict better stability with higher beta
ratio and better Write times with lower beta ratio. During Write, the sense
amplifier remains ON and must be overdriven by the Write driver (see
Section 1.2.2).
I/O b* I/O b I/O a* I/O a
fier drives the low-potential digitline to ground. Similarly, the Psense ampli
fier consists of cross-coupled PMOS transistors and drives the HIGH-
potential digitline to KCc-
of nonaccessed mbits becomes zero when the digitlines are latched. This
results in high subthreshold leakage for a stored one level because full VCc
exists across the mbit transistor while the VGS is held to zero. Stored zero
levels do not suffer from prolonged subthreshold leakage: any amount of
cell leakage produces a negative VGS for the transistor. The net effect is that
a stored one level leaks away much faster than a stored zero level. One's
level retention, therefore, establishes the maximum Refresh period for most
DRAM designs. Boosted sense ground extends Refresh by reducing sub
threshold leakage for stored ones. This is accomplished by guaranteeing
negative gate-to-source bias on nonaccessed mbit transistors. The benefit of
extended Refresh from these designs is somewhat diminished, though, by
the added complexity of generating boosted ground levels and the problem
of digitlines that no longer equilibrate at KCc/2 V.
2.2.6 Configurations
Figure 2.19 shows a sense amplifier block commonly used in double- or
triple-metal designs. It features two Psense amplifiers outside the isolation
transistors, a pair of £g/bias (EQb) devices, a single Nsense amplifier, and
a single I/O transistor for each digitline. Because only half of the sense
amplifiers fbr each array are on one side, this design is quarter pitch, as are
the designs in Figures 2.20 and 2.21. Placement of the Psense amplifiers
outside the isolation devices is necessary because a full one level (Vcc) can
not pass through unless the gate terminal of the ISO transistors is driven
above Vcc. £0/bias transistors are placed outside of the ISO devices to per
mit continued equilibration of digitlines in arrays that are isolated. The I/O
transistor gate terminals are connected to a common CSEL signal fbr four
adjacent digitlines. Each of the four I/O transistors is tied to a separate I/O
50 Chap. 2 The DRAM Array
to permit writing a full logic one into the array mbits. The triple Nsense
amplifier is suggestive of PMOS isolation transistors; it prevents fiill zero
levels to be written unless the Nsense amplifiers are placed adjacent to the
arrays. In this more complicated style of sense amplifier block, using three
Nsense amplifiers guarantees faster sensing and higher stability than a simi
lar design using only two Nsense amplifiers. The inside Nsense amplifier is
fired before the outside Nsense amplifiers. However, this design will not
yield a minimum layout, an objective that must be traded off against perfor
mance needs.
The sense amplifier block of Figure 2.21 can be considered a reduced
configuration. This design has only one Nsense-amp and one Psense-amp,
both of which are placed within the isolation transistors. To write full logic
levels, either the isolation transistors must be depletion mode devices or the
gate voltage must be boosted above VCc by at least one VTH. This design
still uses a pair of EQ/bias circuits to maintain equilibration on isolated
arrays.
Only a handful of designs operates with a single Eg/bias circuit inside
the isolation devices, as shown in Figure 2.22. Historically, DRAM engi
neers tended to shy away from designs that permitted digitlines to float for
extended periods of time. However, as of this writing, at least three manu
facturers in volume production have designs using this scheme.
A sense amplifier design for single-metal DRAMs is shown in
Figure 2.23. Prevalent on 1-Meg and 4-Meg designs, single-metal processes
conceded to multi-metal processes at the 16-Meg generation. Unlike the
sense amplifiers shown in Figures 2.19, 2.20, 2.21, and 2.22, single-metal
sense amps are laid out at half pitch: one amplifier for every two array digit
lines. This type of layout is extremely difficult and places tight constraints
on process design margins. With the loss of Metal2, the column select sig
nals are not brought across the memory arrays. Generating column select
signals locally for each set of I/O transistors requires a full column decode
block.
2.2.7 Operation
A set of signal waveforms is illustrated in Figure 2.24 for the sense
amplifier of Figure 2.19. These waveforms depict a Read-Modify-Write
cycle (Late Write) in which the cell data is first read out and then new data
is written back. In this example, a one level is read out of the cell, as indi
cated by DO rising above DO during cell access. A one level is always
+Vcc/2 in the mbit cell, regardless of whether it is connected to a true or
complement digitline. The correlation between mbit cell data and the data
appearing at the DRAM's data terminal (DQ) is a function of the data topol
ogy and the presence of data scrambling. Data or topo scrambling is imple
mented at the circuit level: it ensures that the mbit data state and the DQ
logic level are in agreement. An mbit one level (+@72) corresponds to a
logic one at the DQ, and an mbit zero level (一喂⑵ conesponds to a logic
zero at the DQ terminal.
Writing specific data patterns into the memory arrays is important to
DRAM testing. Each type of data pattern identifies the weaknesses or sensi
tivities of each cell to the data in surrounding cells. These patterns include
solids, row stripes, column stripes, diagonals, checkerboards, and a variety
of moving patterns. Test equipment must be programmed with the data
topology of each type of DRAM to correctly write each pattern. Often the
tester itself guarantees that the pattern is correctly written into the arrays,
unscrambling the complicated data and address topology as necessary to
write a specific pattern. On some newer DRAM designs, part of this task is
Sec. 2.2 The Sense Amp 53
implemented on the DRAM itself, in the form of a topo scrambler, such that
the mbit data state matches the DQ logic level. This implementation some
what simplifies tester programming.
vCCP
%/2
Again, there is not one set of timing waveforms that covers all design
options. The sense amps of Figures 2.19=2.23 all require slightly different
signals and timings. Various designs actually fire the Psense amplifier prior
to or coincident with the Nsense amplifier. This obviously places greater
constraints on the Psense amplifier design and layout, but these constraints
are balanced by potential performance benefits. Similarly, the sequence of
events as well as the voltages for each signal can vary. There are almost as
many designs for sense amplifier blocks as there are DRAM design engi
neers. Each design reflects various influences, preconceptions, technolo
gies, and levels of understanding. The bottom line is to maximize yield and
performance and minimize everything else.
一忌喏%—
(2.3)
In conjunction with the wordline voltage rising from ground to VCCP, the
gate-to-source capacitance of Ml provides a secondary boost to the boot
node. The secondary boost helps to ensure that the boot voltage is adequate
to drive the wordline to a full Vccp level.
The layout of the boot node is very important to the bootstrap wordline
driver. First, the parasitic capacitance of node which includes routing,
junction, and overlap components, must be minimized to achieve maximum
boot efficiency. Second, charge leakage from the boot node must be mini
mized to ensure adequate VGS fbr transistor Ml such that the wordline
remains at VCcp for the maximum RAS* LOW period. Low leakage is often
achieved by minimizing the source area for M3 or by using donut gate
structures that surround the source area, as illustrated in Figure 2.27.
The bootstrap driver is turned OFF by first driving the PHASE0 signal
to ground. Ml remains ON because node Bl cannot drop below VCc- %h;
Ml substantially discharges the wordline toward ground. This is followed
by the address decoder turning OFF, bringing DEC to ground and DEC* to
Vcc. With DEC* at Vcc. transistor M2 turns ON and fully clamps the word
line to ground. A voltage level translator is required for the PHASE signal
because it operates between ground and the boosted voltage VCCP. For a glo
bal row decode configuration, this requirement is not much of a burden. For
a local row decode configuration, however, the requirement for level trans
lators can be very troublesome. Generally, these translators are placed either
in the array gap cells at the intersection of the sense amplifier blocks and
row decode blocks or distributed throughout the row decode block itself.
The translators require both PMOS and NMOS transistors and must be
capable of driving large capacitive loads. Layout of the translators is
exceedingly difficult, especially because the overall layout needs to be as
small as possible.
V
V CCP V
v CCP
PHASE
pass gate, or a combination thereof With any type of logic, however, the
primary objectives in decoder design are to maximize speed and minimize
die area. Because a great variety of methods have been used to implement
row address decoder trees, it is next to impossible to cover them all. Instead,
we will give an insight into the possibilities by discussing a few of them.
Regardless of the type of logic with which a row decoder is imple
mented, the layout must completely reside beneath the row address signal
lines to constitute an efficient, minimized design. In other words, the metal
address tracks dictate the die area available for the decoder. Any additional
tracks necessary to complete the design constitute wasted silicon. For
DRAM designs requiring global row decode schemes, the penalty for ineffi
cient design may be insignificant; however, for distributed local row decode
schemes, the die area penalty may be significant. As with mbits and sense
amplifiers, time spent optimizing row decode circuits is time well spent.
2.3.7 Predecoding
The row address lines shown as RAX-RA2)can be either true and com
plement or predecoded. Predecoded address lines are formed by logically
combining (AND) addresses as shown in Table 2.1.
Sec. 2.4 Discussion 61
0 0 0 1 0 0 0
1 0 1 0 1 0 0
0 1 2 0 0 1 0
1 1 3 0 0 0 1
2.4 DISCUSSION
We have briefly examined the basic elements required in DRAM row
decoder blocks. Numerous variations are possible. No single design is best
fbr all applications. As with sense amplifiers, design depends on technology
and performance and cost trade-offs.
62 Chap. 2 The DRAM Array
REFERENCES
[1] K, Itoh, “Trends in megabit DRAM circuit design,“ IEEE Journal of Solid
State Circuits, vol. 25, pp. 778-791, June 1990.
[2] D. Takashima, S. Watanabe, H. Nakano, Y. Oowaki, and K. Ohuchi,
“Open/fblded bit-line arrangement for ultra-high-density DRAMs,“ IEEE
Journal of Solid-State Circuits, vol. 29, pp. 539-542, April 1994.
[3] Hideto Hidaka, Yoshio Matsuda, and Kazuyasu Fujishima,44A divided/shared
bit-line sensing scheme for ULSI DRAM cores/' IEEE Journal of Solid-State
Circuits, vol. 26, pp. 473-477, April 1991.
[4] M. Aoki, Y. Nakagome, M. Horiguchi, H. Tanaka, S. Ikenaga, J. Etoh,
Y. Kawamoto, S. Kimura, E. Takeda, H. Sunami, and K. Itoh, “A 60-ns 16-
Mbit CMOS DRAM with a transposed data-line stnicture,11 IEEE Journal of
Solid-State Circuits, vol. 23, pp. 1113-1119, October 1988.
[5] R. Kraus and K. Hoffmann, ''Optimized sensing scheme of DRAMs,“ IEEE
Journal of Solid-State Circuits, vol. 24, pp. 895-899, August 1989.
[6] T. Yoshihara, H. Hidaka, Y. Matsuda, and K. Fujishima, “A twisted bitline
technique for multi-Mb DRAMs/' in 1988 IEEE ISSCC Digest of Technical
Papers, pp. 238-239.
[7] Yukihito Oowaki, Kenji Tsuchida, Yohji Watanabe, Daisaburo Takashima,
Masako Ohta, Hiroaki Nakano, Shigeyoshi Watanabe, Akihiro Nitayama,
Fumio Horiguchi, Kazunori Ohuchi, and Fujio Masuoka, “A 33-ns 64Mb
DRAM," IEEE Journal of Solid-State Circuits, vol. 26, pp. 1498-1505,
November 1991.
Sec. 2.4 Discussion 63
Array Architectures
This chapter presents a detailed description of the two most prevalent array
architectures under consideration for future large-scale DRAMs: the afore
mentioned open architectures and olded digitline architectures.
65
66 Chap. 3 Array Architectures
Parameter Value
N(Cd+C0
pd = vc& ---- 2P--- watts
(3.1)
On a 256-Mbit DRAM in 8k (rows) Refresh, there are 32,768 (215) active
columns during each Read, Write, or Refresh operation. The active array
current and power dissipation for a 256-Mbit DRAM appear in Table 3.2
for a 90 ns Refresh period (-5 timing) at various digitline lengths. The bud
68 Chap. 3 Array Architectures
get for the active array current is limited to 200mA for this 256-Mbit
design. To meet this budget, the digitline cannot exceed a length of
256 Mbits.
Wordline length, as described in Section 2.1, is limited by the maximum
allowable RC time constant of the wordline. To ensure acceptable access
time fbr the 256-Mbit DRAM, the wordline time constant should be kept
below 4 nanoseconds. For a wordline connected to N Mbits, the total resis
tance and capacitance follow:
L ”
(3.2)
(3.3)
where PDL is the digitline pitch, WLW is the wordline width, and Cw^ is the
wordline capacitance in an 8F2 Mbit cell.
Digitline Power
Digitline Active Current
Capacitance Dissipation
Length (Mbits) (mA)
g (mW)
Table 3.3 contains the effective wordline time constants fbr various
wordline lengths. As shown in the table, the wordline length cannot exceed
512 Mbits (512 digitlines) if the wordline time constant is to remain under 4
nanoseconds.
The open digitline architecture does not support digitline twisting
because the true and complement digitlines, which constitute a column, are
in separate array cores. Therefore, no silicon area is consumed for twist
regions. The 32-Mbit array block requires a total of two hundred fifty-six
128kbit array cores in its construction. Each 32-Mbit block represents an
address space comprising a total of 4,096 rows and 8,192 columns. A prac
tical configuration for the 32-Mbit block is depicted in Figure 3.2.
Sec. 3.1 Types of Array Architectures
IMI
w
s
x.!gg
・ ・ ・ ・
:!ig
ig
1
:!
ig
:!
ig
ig
Table 3.4. Essentially, the overall height of the 32-Mbit block can be found
by summing the height of the row decode blocks (or stitch regions) together
with the product of the digitline pitch and the total number of digitlines.
Accordingly,
(3.4)
where TR is the number of local row decoders, HLDEC is the height of each
decoder, TDL is the number of digitlines including redundant and dummy
lines, and PDL is the digitline pitch. Similarly, the width of the 32-Mbit
block is found by summing the total width of the sense amplifier blocks
with the product of the wordline pitch and the number of wordlines. This bit
of math yields
(3.5)
where TSA is the number of sense amplifier strips, WAMP is the width of the
sense amplifiers, TWL is the total number of wordlines including redundant
and dummy lines, and PWl^ is the wordline pitch for the 6F2 Mbit.
Table 3.4 contains calculation results fbr the 32-Mbit block shown in
Figure 3.2. Although overall size is the best measure of architectural effi
ciency, a second popular metric is array efficiency. Array efficiency is deter
mined by dividing the area consumed by functionally addressable Mbits by
the total die area. To simplify the analysis in this text, peripheral circuits are
ignored in the array efficiency calculation. Rather, the calculation considers
only the 32-Mbit memory block, ignoring all other factors. With this simpli
fication, the array efficiency fbr a 32-Mbit block is given as
(100 -2 255
Efficiency = ---------------------------- percent
Areas!
(3.6)
where 225 is the number of addressable Mbits in each 32-Mbit block. The
open digitline architecture yields a calculated array efficiency of 51.7%.
Unfortunately, the ideal open digitline architecture presented in
Figure 3.2 is difficult to realize in practice. The difficulty stems from an
interdependency between the memory array and sense amplifier layouts in
which each array digitline must connect to one sense amplifier and each
sense amplifier must connect to two array digitlines.
Sec. 3.1 Types of Array Architectures 71
Table 3.4 Open digitline (local row decode): 32-Mbit size calculations.
With the presence of the additional conductor, such as MetaI3, the sense
amplifier layout and either a full or hierarchical global row decoding
scheme is made possible. A full global row decoding scheme using word
line stitching places great demands on metal and contact/via technologies;
however, it represents the most efficient use of the additional metal. Hierar
chical row decoding using bootstrap wordline drivers is slightly less effi
cient. Wordlines no longer need to be strapped with metal on pitch, and,
thus, process requirements are relaxed significantly [5].
For a balanced perspective, both global and hierarchical approaches are
analyzed. The results of this analysis for the open digitline architecture are
summarized in Tables 3.5 and 3.6. Array efficiency for global and hierarchi
cal row decoding calculate to 60.5% and 55.9%, respectively, for the 32
Mbit memory blocks based on data from these tables.
74 Chap. 3 Array Architectures
Table 3.5 Open digitline (dummy arrays and global row decode): 32-Mbit size
calculations.
Table 3.6 Open digitline (dummy arrays and hierarchical row decode): 32-Mbit
size calculations.
guardband live digitlines. These photo effects are pronounced at the edges
of large repetitive structures such as the array cores.
Sense amplifier blocks are placed on both sides of each array core. The
sense amplifiers within each block are laid out at quarter pitch: one sense
amplifier for every four digitlines. Each sense amplifier connects through
isolation devices to columns (digitline pairs) from both adjacent array cores.
Odd columns connect on one side of the core, and even columns connect on
the opposite side. Each sense amplifier block is therefore connected to only
odd or even columns and is never connected to both odd and even columns
within the same block. Connecting to both odd and even columns requires a
half^pitch sense amplifier layout: one sense amplifier for every two digit
lines. While half-pitch layout is possible with certain DRAM processes, the
bulk of production DRAM designs remain quarter pitch due to the ease of
laying them out. The analysis presented in this chapter is accordingly based
on quarter-pitch design practices.
The location of row decode blocks fbr the array core depends on the
number of available metal layers. For one- and two-metal processes, local
row decode blocks are located at the top and bottom edges of the core. For
three- and four-metal processes, global row decodes are used. Global row
decodes require only stitch regions or local wordline drivers at the top and
bottom edges of the core [7]. Stitch regions consume much less silicon area
than local row decodes, substantially increasing array efficiency fbr the
DRAM. The anay core also includes digitline twist regions running parallel
to the wordlines. These regions provide the die area required for digitline
twisting. Depending on the particular twisting scheme selected fbr a design
Sec. 3.1 Types of Array Architectures 77
(see Section 2.1), the array core needs between one and three twist regions.
For the sake of analysis, a triple twist is assumed, as it offers the best overall
noise performance and has been chosen by some DRAM manufacturers fbr
select advanced large-scale applications [8]. Because each twist region con
stitutes a break in the array structure, it is necessary to use dummy word
lines. For this reason, there are 16 dummy wordlines (2 fbr each array edge)
in the folded array core rather than 4 dummy wordlines as in the open digit
line architecture.
There are more Mbits in the array core fbr folded digitline architectures
than there are fbr open digitline architectures. Larger core size is an inherent
feature of folded architectures, arising from the very nature of the architec
ture. The term folded architecture comes from the fact that folding two open
digitline array cores one on top of the other produces a folded array core.
The digitlines and wordlines from each folded core are spread apart (double
pitch) to allow room fbr the other folded core. After folding, each constitu
ent core remains intact and independent, except for the Mbit changes (8F2
conversion) necessary in the folded architecture. The array core size dou
bles because the total number of digitlines and wordlines doubles in the
folding process. It does not quadruple as one might suspect because the two
constituent folded cores remain independent: the wordlines from one folded
core do not connect to Mbits in the other folded core.
Digitline pairing (column formation) is a natural outgrowth of the fold
ing process; each wordline only connects to Mbits on alternating digitlines.
The existence of digitline pairs (columns) is the one characteristic of folded
digitline architectures that produces superior signal-to-noise performance.
Furthermore, the digitlines that form a column are physically adjacent to
one another. This feature permits various digitline twisting schemes to be
used, as discussed in Section 2.1, further improving signal-to-noise perfor
mance.
Similar to the open digitline architecture, digitline length fbr the folded
digitline architecture is again limited by power dissipation and minimum
cell-to-digitline capacitance ratio. For the 256-Mbit generation, digitlines
were restricted from connecting to more than 256 cells (128 Mbit pairs).
The analysis used to arrive at this quantity is similar to that fbr the open dig
itline architecture. (Refer to Table 3.2 to view the calculated results of
power dissipation versus digitline length fbr a 256-Mbit DRAM in 8k
Refresh.) Wordline length is again limited by the maximum allowable RC
time constant of the wordline.
Contrary to an open digitline architecture in which each wordline con
nects to Mbits on each digitline, the wordlines in a folded digitline architec
ture connect to Mbits only on alternating digitlines. Therefore, a wordline
can cross 1,024 digitlines while connecting to only 512 Mbit transistors.
78 Chap. 3 Array Architectures
The wordlines have twice the overall resistance, but only slightly more
capacitance because they run over field oxide on alternating digitlines.
Table 3.7 presents the effective wordline time constants for various word
line lengths for a folded array core. For a wordline connected to N Mbits,
the total resistance and capacitance follow:
(3.7)
(3.8)
where Pdl is the digitline pitch and is the wordline capacitance in an
8F2 Mbit cell. As shown in Table 3.7, the wordline length cannot exceed
512 Mbits (1,024 digitlines) fbr the wordline time constant to remain under
4 nanoseconds. Although the wordline connects to only 512 Mbits, it is two
times longer (1,024 digitlines) than wordlines in open digitline array cores.
The folded digitline architecture therefore requires half as many row decode
blocks or wordline stitching regions as the open digitline architecture.
The 32-Mbit array block shown in Figure 3.6 includes size estimates for
the various pitch cells. The layout was generated where necessary to arrive
at the size estimates. The overall size for the folded digitline 32-Mbit block
can be found by again summing the dimensions for each component.
Accordingly,
(3.9)
where TR is the number of row decoders, HRDEC is the height of each
decoder, TDL is the number of digitlines including redundant and dummy,
and PDL is the digitline pitch. Similarly,
(3.10)
where TSA is the number of sense amplifier strips, WAMP is the width of the
sense amplifiers, TWL is the total number of wordlines including redundant
and dummy, PWL^ is the wordline pitch for the 8F2 Mbit, is the total
number of twist regions, and WTWlST is the width of the twist regions.
Table 3.8 shows the calculated results fbr the 32-Mbit block shown in
Figure 3.6. In this table, a double-metal process is used, which requires
local row decoder blocks. Note that Table 3.8 for the folded digitline archi
tecture contains approximately twice as many wordlines as does Table 3.5
fbr the open digitline architecture. The reason for this is that each wordline
in the folded array only connects to Mbits on alternating digitlines, whereas
each wordline in the open array connects to Mbits on every digitline. A
folded digitline design therefore needs twice as many wordlines as a compa
rable open digitline design.
Array efficiency for the 32-Mbit memory block from Figure 3.6 is again
found by dividing the area consumed by functionally addressable Mbits by
the total die area. For the simplified analysis presented in this text, the
peripheral circuits are ignored. Array efficiency fbr the 32-Mbit block is
therefore given as which yields 59.5% fbr the folded array design example.
(100 . 225.PD£.2P^8)
Efficiency - ------------ ------------------- percent
(3.11)
80 Chap. 3 Array Architectures
8,512 Wordlines
S
OJW
三
6
b
z
Lne
od
一
Row
7 256Kbit Sense I!
decode core amplifier 45 pm
Table 3.8 Folded digitline (local row decode): 32-Mbit size calculations.
fbr a 256-Mbit DRAM. Finally, we compare the results achieved with the
new architecture to those obtained fbr the open digitline and folded digitline
architectures from Section 3.1.
Wordline pitch is effectively relaxed for the plaid 6F2 Mbit of the
bilevel digitline architecture. The Mbit is still built using the minimum pro
cess feature size of 0.3 [xm. The relaxed wordline pitch stems from struc
tural differences between a folded digitline Mbit and an open digitline or
plaid Mbit. There are essentially four wordlines running across each folded
digitline Mbit pair compared to two wordlines running across each open
digitline or plaid Mbit pair. Although the plaid Mbit is 25% shorter than a
folded Mbit (three versus four features), it also has half as many wordlines,
effectively reducing the wordline pitch. This relaxed wordline pitch makes
layout of the wordline drivers and the address decode tree much easier. In
fact, both odd and even wordlines can be driven from the same row decoder
block, thus eliminating half of the row decoder strips in a given array block.
This is an important distinction, as the tight wordline pitch for folded digit
line designs necessitates separate odd and even row decode strips.
Sense amplifier blocks are placed on both sides of each array core. The
sense amplifiers within each block are laid out at half pitch: one sense
amplifier fbr every two Metal 1 digitlines. Each sense amplifier connects
through isolation devices to columns (digitline pairs) from two adjacent
array cores. Similar to the folded digitline architecture, odd columns con
nect on one side of the array core, and even columns connect on the other
side. Each sense amplifier block is then exclusively connected to either odd
or even columns, never to both.
Unlike a folded digitline architecture that uses a local row decode block
connected to both sides of an array core, the bilevel digitline architecture
uses a local row decode block connected to only one side of each core. As
stated earlier, both odd and even rows can be driven from the same local
row decoder block with the relaxed wordline pitch. Because of this feature,
the bilevel digitline architecture is more efficient than alternative architec
tures. A four-metal DRAM process allows local row decodes to be replaced
by either stitch regions or local wordline drivers. Either approach could sub
stantially reduce die size. The array core also includes the three twist
regions necessary fbr the bilevel digitline architecture. The twist region is
larger than that used in the folded digitline architecture, owing to the com
plexity of twisting digitlines vertically. The twist regions again constitute a
break in the array structure, making it necessary to include dummy word
lines.
As with the open digitline and folded digitline architectures, the bilevel
digitline length is limited by power dissipation and a minimum cell-to-digit-
Sec. 3.2 Design Examples: Advanced Bilevel DRAM Architecture 91
line capacitance ratio. In the 256-Mbit generation, the digitlines are again
restricted from connecting to more than 256 Mbits (128 Mbit pairs). The
analysis to arrive at this quantity is the same as that fbr the open digitline
architecture, except that the overall digitline capacitance is higher. The
bilevel digitline runs over twice as many cells as the open digitline with the
digitlirie running in equal lengths in both Metal2 and Metall. The capaci
tance added by the Metal2 component is small compared to the already
present Metall component because Metal2 does not connect to Mbit transis
tors. Overall, the digitline capacitance increases by about 25% compared to
an open digitline. The power dissipated during a Read or Refresh operation
is proportional to the digitline capacitance (Cd), the supply (internal) volt
age (Vcc), the external voltage (Vccx), the number of active columns (N)t
and the Refresh period (P). It is given as
p %(分 + 4))
(3.12)
On a 256-Mbit DRAM in 8k Refresh, there are 32,768 (215) active col
umns during each Read, Write, or Refresh operation. Active array current
and power dissipation fbr a 256-Mbit DRAM are given in Table 3.11 for a
90 ns Refresh period (-5 timing) at various digitline lengths. The budget fbr
active array current is limited to 200mA for this 256-Mbit design. Tb meet
this budget, the digitline cannot exceed a length of 256 Mbits.
Wordline length is again limited by the maximum allowable (RC) time
constant of the wordline. The calculation for bilevel digitline is identical to
that performed fbr open digitline due to the similarity of array core design.
These results are given in Table 3.3. Accordingly, if the wordline time con
stant is to remain under the required 4-nanosecond limit, the wordline
length cannot exceed 512 Mbits (512 bilevel digitline pairs).
Table 3.11 Active current and power versus bilevel digitline length.
sions for a 32・Mbit array block could be calculated. The diagram for a 32-
Mbit array block using the bilevel digitline architecture is shown in Figure
3.13. This block requires a total of one hundred twenty-eight 256kbit array
cores. The 128 array cores are ananged in 16 rows and 8 columns. Each 4-
Mbit vertical section consists of 512 wordlines and 8,192 bilevel digitline
pairs (8,192 columns). Eight 4-Mbit strips are required to form the complete
32-Mbit block. Sense amplifier blocks are positioned vertically between
each 4-Mbit section.
Row decode strips are positioned horizontally between every anay
core. Only eight row decode strips are needed for the sixteen array cores, for
each row decode contains wordline drivers for both odd and even rows. The
32-Mbit array block shown in Figure 3.13 includes pitch cell layout esti
mates. Overall size for the 32-Mbit block is found by summing the dimen
sions fbr each component
As before,
(3.13)
S 8 u一e
6
5
z
lnE
cd
where TR is the number of bilevel row decoders, H^ec is the height of each
decoder, TDL is the number of bilevel digitline pairs including redundant and
dummy, and PDL is the digitline pitch. Also,
(3.14)
where TSA is the number of sense amplifier strips, WAMP is the width of the
sense amplifiers, TWL is the total number of wordlines including redundant
and dummy, PWL6 is the wordline pitch for the plaid 6F2 Mbit, Ttwist is the
total number of twist regions, and is the width of the twist regions.
Table 3.12 shows the calculated results for the bilevel 32-Mbit block
shown in Figure 3.13. A three-metal process is assumed in these calcula
tions because it requires the local row decoders. Array efficiency for the
bilevel digitline 32-Mbit array block, which yields 63.1% for this design
example, is given as
. (100-2 755 PDr2-P^6)
Efficiency = ----------- - ----- -------------percent
(3.15)
With Metal4 added to the bilevel DRAM process, the local row decoder
scheme can be replaced by a global or hierarchical row decoder scheme.
The addition of a fourth metal to the DRAM process places even greater
demands on process engineers. Regardless, an analysis of 32-Mbit array
block size was performed assuming the availability of Metal4. The results
of the analysis are shown in Tables 3.13 and 3.14 for the global and hierar
chical row decode schemes. Array efficiency for the 32-Mbit memory block
using global and hierarchical row decoding is 74.5% and 72.5%, respec
tively.
array block size data from Sections 3.1 and 3.2 is summarized in Table 3.15
fbr the open digitline, folded digitline, and bilevel digitline architectures.
From Table 3.15, it seems that overall die size (32-Mbit area) is a better
metric for comparison than array efficiency. For instance, the three-metal
folded digitline design using hierarchical row decodes has an area of
34,089,440mm2 and an efficiency of 70.9%. The three-metal bilevel digit
line design with local row decodes has an efficiency of only 63.1% but an
overall area of 28,732,296 mm2. Array efficiency for the folded digitline is
higher. This is misleading, however, because the folded digitline yields a die
that is 18.6% larger for the same number of conductors.
Sec. 3.2 Design Examples: Advanced Bilevel DRAM Architecture 95
Table 3.15 also illustrates that the bilevel digitline architecture always
yields the smallest die area, regardless of the configuration. The smallest
folded digitline design at 32,654,160mm2 and the smallest open digitline
design at 29,944,350mm2 are still larger than the largest bilevel digitline
design at 28,732,296mm2. It is also apparent that both the bilevel and open
digitline architectures need at least three conductors in their construction.
96 Chap. 3 Array Architectures
The folded digitline architecture still has a viable design option using only
two conductors. The penalty of using two conductors is a much larger die
size-a full 41% than the three-metal bilevel digitline design.
Number of Tsa 9
sense amplifier
strips
Width of sense
amplifiers
* 65 gm
REFERENCES
99
100 Chap. 4 The Peripheral Circuitry
address and the corresponding match data could be presented at the same
time. There was no need fbr the fire-and-cancel operation: the match data
was already available.
Therefore, the column decoder fires either the addressed column select
or the redundant column select in synchrony with the clock. The decode tree
is similar to that used fbr the CMOS wordline driver; a pass transistor was
added so that a decoder enable term could be included. This term allows the
tree to disconnect from the latching column select driver while new address
terms flow into the decoder. A latching driver was used in this pipeline
implementation because it held the previously addressed column select
active with the decode tree disconnected. Essentially, the tree would discon
nect after a column select was fired, and the new address would flow into
the tree in anticipation of the next column select. Concurrently, redundant
match information would flow into the phase term driver along with CA 45
address terms to select the correct phase signal. A redundant match would
then override the normal phase term and enable a redundant phase term.
Operation of this column decoder is shown in Figure 4.4. Once again,
deselection of the old column select CSEL<0> and selection of a new col
umn select RCSEL<\> are enveloped by EQIO. Column transition timing is
under the control of the column latch signal CLATCH七 This signal shuts
OFF the old column select and enables firing of the new column select.
Concurrent with CLATCH* firing, the decoder is enabled with decoder
enable (DECEN) to reconnect the decode tree to the column select driver.
After the new column select fires, DECEN transitions LOW to once again
isolate the decode tree.
I , ■ I I
FL CSEL
户I RCSEL
<1>
<2>
<3>
RED
<1>
Y <2>
I
<3>
CLATCH*
0.5 ns
RESET*
wordline rather than the normal wordline will be fired. This pretest capabil
ity permits all of the redundant wordlines to be tested prior to any laser pro
gramming.
Fuse banks or redundant elements, as shown in Figure 4.5, are physi
cally associated with specific redundant wordlines in the array. Each ele
ment can fire only one specific wordline, although generally in multiple
subarrays. The number of subarrays that each element controls depends on
the DRAM's architecture, refresh rate, and redundancy scheme. It is not
uncommon in 16-Meg DRAMs for a redundant row to replace physical
rows in eight separate subarrays at the same time. Obviously, the match cir
cuits must be fast. Generally, firing of the normal row must be held off until
the match circuits have enough time to evaluate the new row address. As a
result, time wasted during this phase shows up directly on the parfs row
access (tRAC) specification.
(4.1)
The column address match (CAM) signals from all of the predecoded
addresses are combined with static logic gates to create a column match
(CMAT^) signal for the column fuse block. The CMAT* signal, when
active, cancels normal CSEL signals and enables redundant RCSEL signals,
as described in Section 4.1. Each column fuse block is active only when its
corresponding enable fuse is blown. The column fuse block usually con
tains a disable fuse for the same reason as a row redundant block: to repair a
redundant element. Column redundant pretest is implemented somewhat
differently in Figure 4.6 than row redundant pretest here. In Figure 4.6, the
bottom fuse terminal is not connected directly to ground. Rather, all of the
signals for the entire column fuse block are brought out and programmed
either to ground or to a column pretest signal from the test circuitry.
During standard part operation, the pretest signal is biased to ground,
allowing the fuses to be read normally. However, during column redundant
pretest, this signal is brought to %, which makes the laser fuses appear to
be programmed. The fuse/latch circuits latch the apparent fuse states on the
next RAS* cycle. Subsequent column accesses allow the redundant column
elements to be pretested by addressing their pre-programmed match
addresses.
The method of pretesting just described always uses the match circuits
to select a redundant column. It is a superior method to that described fbr
the row redundant pretest because it tests both the redundant element and its
match circuit Furthermore, as the match circuit is essentially unaltered dur
ing redundant column pretest, the test is a better measure of the obtainable
DRAM performance when the redundant element is active.
Obviously, the row and column redundant circuits that are described in
this section are only one embodiment of what could be considered a wealth
of possibilities. It seems that all DRAM designs use some alternate form of
redundancy. Other types of fuse elements could be used in place of the laser
fuses that are described. A simple transistor could replace the laser fuses in
either Figure 4.5 or Figure 4.6, its gate being connected to an alternative
fuse element. Furthermore, circuit polarity could be reversed and non-pre-
decoded addressing and other types of logic could be used. The options are
nearly limitless. Figure 4.7 shows a SEM image of a typical set of poly
fuses.
Sec. 4.2 Column and Row Redundancy 109
CFP*・
PRO
PR1
C401
<0,3>
CFP Column address 2 & 3 fusebank
E>
CFP*
PR2 ♦—— FO
PR3 < CAM*
CA23\ A<0:3>
<0:3>
EFDS
REDTSTC ENABLE
DISRED7
REFERENCE
Global Circuitry
and Considerations
In this chapter, we discuss the circuitry and design considerations associ
ated with the circuitry external to the DRAM memory array and memory
array peripheral circuitry. We call this global circuitry.
111
112 Chap. 5 Global Circuitry and Considerations
and from the DRAM controller. In either circuit, Vpp and VREF are set to
Vcc/2.
From this topology, we can see that a differential input buffer should be
used: an inverter won't work. Some examples of fully differential input
buffers are seen in Figure 5.3, Figure 5.4, and Figure 5,5 [1].
Figure 5.3 is simply a CMOS differential amplifier with an inverter out
put to generate valid CMOS logic levels. Common-mode noise on the difF
amp inputs is, ideally, rejected while amplifying the difference between the
input signal and the reference signal. The difFamp input common-mode
range, say a hundred mV, sets the minimum input signal amplitude (cen
tered around ”尸)required to cause the output to change stages. The speed
of this configuration is limited by the difTamp biasing current. Using a large
current will increase input receiver speed and, at the same time, decrease
amplifier gain and reduce the di住amp's input common-mode range.
The input buffer of Figure 5.3 requires an external biasing circuit. The
circuit of Figure 5.4 is self-biasing. This circuit is constructed by connecting
the gate of the NMOS device to the gates of the PMOS load transistors. This
circuit is simple and, because of the adjustable biasing connection, poten
tially very fast. The output inverter is needed to ensure that valid output
logic levels are generated.
Both of the circuits in Figures 5.3 and 5.4 suffer from duty-cycle distor
tion at high speeds. The PULLUP delay doesn't match the PULLDOWN
114 Chap. 5 Global Circuitry and Considerations
x4 part, 4 Mbits per data pin. For each configuration, the number of array
sections available to an input buffer must change. By using, Data Write
maxes that permit a given input buffer to drive as few or as many Write
driver circuits as required, design flexibility is easily accommodated.
DW<n> DIN*<vn>
x16 x8 _x4
0 0 0
1 1 1
0PTX16 2 2 0
3 3 1
0PTX8 4 4 0
5 5 1
OPTX4 6 6 0
7 7 1
DIN*<0> DW<Q>
DW<1>
DIN*<2> O- B DW<2>
A I
r
DIN*<3> O
B ]
I
A
B 1
1
l r DW<4>
D/A/*<4> O C
L
A
E
D//V*<5> O 1 1
a
D/N*<6> O B
1 ^>o---- ► DW<6>
A
D/A/*<7> O
B
) 1
Figure 5.6 Data Write mux.
118 Chap. 5 Global Circuitry and Considerations
I/O
BIASIO*
BIASIO
For example, if I/O = 1.25 V and UO* = 1.23 V then I/O becomes VCCl
and I/O* goes to zero when CLK transitions HIGH. Using positive feedback
makes the HFF sensitive and fast. Note that HFFs can be used at several
locations on the I/O lines due to the small size of the circuit.
OP7X16 6
DR<2> Q LDQ<2>
DR<3> O LDQ<3>
DR<4> a LDQ<4>
DR<5> O LDQ<5>
DR<6> a LDQ<6>
DR<4> Q
XBCTRL<0:1>0-
0PTX4 a
DR<7> o LDQ<7>
DR<5> o
In , two NMOS transistors are placed in series with Vccx to reduce sub
strate injection currents. Substrate injection currents result from impact ion
ization, occurring most commonly when high drain-to-source and high
gate-to-source voltages exist concurrently. These conditions usually occur
when an output driver is firing to Vccx,especially for high-capacitance
loads, which slow the output transition. Two transistors in series reduce this
effect by lowering the voltages across any single device. The output stage is
tristated whenever both signals PULLUP and PULLDOWN are at
ground.
The signal PULLDOWN is driven by a simple CMOS inverter, whereas
PULLUP is driven by a complex circuit that includes voltage charge pumps.
The pumps generate a voltage to drive PULLUP higher than one VTH above
Vqcx- This is necessary to ensure that the series output transistors drive the
Sec. 5.1 Data Path Elements 125
pad to Vccx. The output driver is enabled by the signal labeled OE. Once
enabled, it remains tristated until either DQ or DQ* fires LOW. If DR fires
LOW, PULLDOWN fires HIGH, driving the pad to ground through M3. If
DR* fires LOW, PULLUP fires HIGH, driving the pad to Vccx through Ml
and M2.
The output latch circuit shown in controls the output driver operation.
As the name implies, it contains a latch to hold the output data state. The
latch frees the DCSA or HFF and other circuits upstream to get subsequent
data fbr the output. It is capable of storing not only one and zero states, but
also a high-impedance state (tristate). It offers transparent operation to
allow data to quickly propagate to the output driver. The input to this latch is
connected to the DR<n> signals coming from either the DCSAs or Data
Read maxes. Output latch circuits appear in a variety of forms, each serving
the needs of a specific application or architecture. The data path may con
tain additional latches or circuits in support of special modes such as burst
operation.
address path, namely, the row address buffer, CBR counter, predecode logic,
array buffers, and phase drivers.
counters from each row address buffer are cascaded together to form a CBR
ripple counter. The Q output of one stage feeds the CLK* input of a subse
quent stage. The first register in the counter is clocked whenever RAS* falls
while in a CBR Refresh mode. By cycling through all possible row address
combinations in a minimum of clocks, the CBR ripple counter provides a
simple means of internally generating Refresh addresses. The CBR counter
drives through a mux into the inverter latch of the row address buffer. This
mux is enabled whenever CBR address latch (CBRAL) is LOW. Note that
the signals RAL and CBRAL are mutually exclusive in that they cannot be
LOW at the same time. For each and every DRAM design, the row address
buffer and CBR counter designs take on various forms. Logic may be
inverted, counters may be more or less complex, and maxes may be
replaced with static gates. Whatever the differences, however, the function
of the input buffer and its CBR counter remains essentially the same.
addressing is related to these Refresh rates fbr the 16Mb example. In this
example, the 2k Refresh rate would be more popular because it has an equal
number of row and column addresses or square addressing.
Refresh rate is also determined by backward compatibility, especially in
personal computer designs. Because a 4Mb DRAM has less memory space
than a 16Mb DRAM, the 4Mb DRAM should naturally have fewer address
pins. To sell 16Mb DRAMs into personal computers that are designed for
4Mb DRAMs, the 16Mb part must be configured with no more address pins
than the 4Mb part. If the 4Mb part has eleven address pins, then the 16Mb
part should have only eleven address pins, hence 2k Refresh. To trim cost,
most PC designs keep the number of DRAM address pins to a minimum.
Although this practice holds cost down, it also limits expandability and
makes conversion to newer DRAM generations more complicated owing to
resultant backward compatibility issues.
4K 4,096 1,024 12 10
2K 2,048 2,048 11 H
IK 1,024 4,096 10 12
130 Chap. 5 Global Circuitry and Considerations
RA*<Q> o-------------------------------------
A— --------- ►Even
RA <0> o-------------------------------------
0- --------- ►Odd
NAND Row
predecc►de address
DA <O> 4— — — ^^rivers
r-1A — ------------ aN412<3>
n --------- ARA12<2>
DA 〜_____— —
-D>
RA *<2> 〜 — --------- ►RA12<1>
could be included for combining the addresses with enabling signals from
the control logic or for making odd/even row selection by combining the
addresses with the odd and even address signals. Regardless, the resulting
signals ultimately drive the decode trees, making speed an important issue.
Buffer size and routing resistance, therefore, become important design
parameters in high-speed designs because the wordline cannot be fired until
the address tree is decoded and ready for the PHASE signal to fire.
equilibration of the I/O lines. As we learned in Section 5.1.4, the I/O lines
need to be equilibrated to Vcc^th prior to a new column being selected by
the CS£L<n> lines. EQIO* is the signal used to accomplish this equilibra
tion.
The second signal generated by the equilibration driver is called equili
brate sense amp (EQSA), This signal is generated from address transitions
occurring on all of the column addresses, including the least significant
addresses. The least significant column addresses are not decoded into the
column select lines (CSEL), Rather, they are used to select which set of I/O
lines is connected to the output buffers. As shown in the schematic, EQSA is
activated regardless of which address is changed because the DCSA s must
be equilibrated prior to sensing any new data. EQIO, on the other hand, is
not affected by the least significant addresses because the I/O lines do not
need equilibrating unless the CSEL lines are changed. The equilibration
driver circuit, as shown in Figure 5.19, uses a balanced NAND gate to com
bine the pulses from each ATD circuit. Balanced logic helps ensure that the
narrow ATD pulses are not distorted as they progress through the circuit.
The column addresses are fed into predecode circuits, which are very
similar to the row address predecoders. One major difference, however, is
that the column addresses are not allowed to propagate through the part until
the wordline has fired. For this reason, the signal Enable column (ECOL) is
gated into the predecode logic as shown in Figure 5.20. ECOL disables the
predecoders whenever it is LOW, forcing the outputs all HIGH in our exam
ple. Again, the predecode circuits are implemented with simple static logic
gates. The address signals emanating from the predecode circuits are buff
ered and distributed throughout the die to feed the column decoder logic
blocks. The column decoder elements are described in Section 4.1.
As DRAM clock speeds continue to increase, the skew becomes the domi
nating concern, outweighing the RDLL disadvantage of longer time to
acquire lock.
This section describes an RSDLL (register-controlled symmetrical
DLL), which meets the requirements of DDR SDRAM. (Read/Write
accesses occur on both rising and falling edges of the clock.) Here, symmet
rical means that the delay line used in the DLL has the same delay whether
a HIGH-to-LOW or a LOW-to-HIGH logic signal is propagating along the
line. The data output timing diagram of a DDR SDRAM is shown in Figure
5.23. The RSDLL increases the valid output data window and diminishes
DQS to DQ skew by synchronizing both the rising and falling edges of the
DQS signal with the output data DQ.
Figure 5.22 shows the block diagram of the RSDLL. The replica input
butler dummy delay in the feedback path is used to match the delay of the
input clock buffer. The phase detector (PD) compares the relative timing of
the edges of the input clock signal and the feedback clock signal, which
comes through the delay line and is controlled by the shift register. The out
puts of the PD, shift-right and shift-left, control the shift register. In the sim
plest case, one bit of the shift register is HIGH. This single bit selects a
point of entry fbr CLKIn the symmetrical delay line. (More on this later.)
When the rising edge of the input clock is within the rising edges of the out
put clock and one unit delay of the output clock, both outputs of the PD,
shift-right and shift-left, go LOW and the loop is locked.
Sec. 5.3 Synchronization in DRAMs 137
LOW, the other switches from LOW to HIGH. An added benefit of the two-
NAND delay element is that two point-of-entry control signals are now
available. The shift register uses both to solve the possible problem caused
by the POWERUP ambiguity in the shift register.
Shift-left
CLK
(to shift
register)
Shift-right
Faster,
shift-right Adjusted
CLKIn
CLKOut
CLKOut +
unit delay ,
From right to left, the first LOW-to-HIGH transition in the shift register
sets the point of entry into the delay line. The input clock passes through the
tap with a HIGH logic state in the corresponding position of the shift regis
ter. Because the Q* of this tap is equal to a LOW, it disables the previous
stages; therefore, the previous states of the shift register do not matter
(shown as "don't careJX in Figure 5.25). This control mechanism guaran
tees that only one path is selected. This scheme also eliminates POWERUP
concerns because the selected tap is simply the first, from the right, LOW-
to-HIGH transition in the register.
external clock and the feedback clock, is used. This provides enough time
for the shift register to operate and the output waveform to stabilize before
another decision by the PD is implemented. The unwanted side effect of
this delay is an increase in lock time. The shift register is clocked by com
bining the shift-left and -right signals. The power consumption decreases
when there are no shift-left or -right signals and the loop is locked.
Another concern with the phase-detector design is the design of the flip
flops (FFs). To minimize the static phase error, very fast FFs should be
used, ideally with zero setup time.
Also, the metastability of the flip-flops becomes a concern as the loop
locks. This, together with possible noise contributions and the need to wait,
as just discussed, before implementing a shift-right or -left, may make it
more desirable to add more filtering in the phase detector. Some possibili
ties include increasing the divider ratio of the phase detector or using a shift
register in the phase detector to determine when a number of-say,
four-shift-rights or - lefts have occurred. For the design in Figure 5.26, a
divide-by-two was used in the phase detector due to lock-time require
ments.
6o
5o
4o
(s
e 3o
』
忠二「
2o
1o
o
25
133 143 1 54 167 1 82 200
Input frequency (MHz)
5 0o
4 5o
4 0o
3 5o
(sd)
3 0o
0 6 E W A布
2 5o
2 0o
B Q
1 5o
1 0o
Figure 5.28 Measured delay per stage versus Vcc and temperature.
142 Chap. 5 Global Circuitry and Considerations
Freq (MHz)
Figure 5.29 Measured ICC (DLL current consumption) versus input frequency.
5.3.6 Discussion
In this section we have presented one possibility for the design of a
delay-locked loop. While there are others, this design is simple, manufac
turable, and scalable.
In many situations the resolution of the phase detector must be
decreased. A useful circuit to determine which one of two signals occurs
earlier in time is shown in Figure 5.30. This circuit is called an arbiter. If SI
occurs slightly before S2 then the output SOI will go HIGH, while the out
put SO2 stays LOW. If 52 occurs before SI, then the output SO2 goes HIGH
and SOI remains LOW. The fact that the inverters on the outputs are pow
ered from the SR latch (the cross-coupled NAND gates) ensures that SOI
and SO2 cannot be HIGH at the same time. When designed and laid out cor
rectly, this circuit is capable of discriminating tens of picoseconds of differ
ence between the rising edges of the two input signals.
The arbiter alone cannot be capable of controlling the shift register. A
simple logic block to generate shift-right and shift-left signals is shown in
Figure 5.31. The rising edge of SO\ or SO2 is used to clock two D-latches
so that the shift-right and shift-left signals may be held HIGH for more than
one clock cycle. Figure 531 uses a divide-by-two to hold the shift signals
valid for two clock cycles. This is important because the output of the arbi
ter can have glitches coming from the different times when the inputs go
back LOW. Note that using an arbiter-based phase detector alone can result
in an alternating sequence of shift-right, shift-left. We eliminated this prob
lem in the phase-detector of Figure 5.24 by introducing the dead zone so
that a minimum delay spacing of the clocks would result in no shifting.
Sec. 5.3 Synchronization in DRAMs 143
Figure 5.33 shows how inserting transmission gates (TGs) that are con
trolled by the shift register allows the insertion point to vary along the line.
When C is HIGH, the feedback clock is inserted into the output of the delay
stage. The inverters in the stage are isolated from the feedback clock by an
additional set of TGs. We might think, at first glance, that adding the TGs in
Figure 5.33 would increase the delay significantly; however, there is only a
single set of TGs in series with the feedback before the signal enters the
line. The other TGs can be implemented as part of the inverter to minimize
their impact on the overall cell delay. Figure 5.34 shows a possible inverter
implementation.
To output of
the delay line
OUT
REFERENCES
Voltage Converters
In this chapter, we discuss the circuitry for generating the on-chip voltages
that lie outside the supply range. In particular, we look at the wordline
pump voltage and the substrate pumps. We also discuss voltage regulators
that generate the internal power supply voltages.
147
148 Chap. 6 Voltage Converters
maximum operating voltage that the process can reliably tolerate. For
example, a 1 Gb DRAM built in a 45 nm CMOS process with a 14A thick
gate oxide can operate reliably with an internal supply voltage not exceed
ing 1.2 V. If this design had to operate in a 3.3 V system, an internal voltage
regulator would be needed to convert the external 3.3 V supply to an inter
nal 1.2 V supply. For the same design operating in a 1.2 V system, an inter
nal voltage regulator would not be required. Although the actual operating
voltage is determined by process considerations and reliability studies, the
internal supply voltage is generally proportional to the minimum feature
size. Table 6.1 summarizes this relationship.
Process
vcc
Internal
90 nm 1.5 V
72 nm 1.35 V
45 nm 1.2 V
32 nm L0 V
All DRAM voltage regulators are built from the same basic elements: a
voltage reference, one or more output power stages, and some form of con
trol circuit. How each of these elements is realized and combined into the
overall design is the product of process and design limitations and the
design engineer's preferences. In the paragraphs that follow, we discuss
each element, overall design objectives, and one or more circuit implemen
tations.
DRAM's operating range and ensure data retention during low-voltage con
ditions.
The second region exists whenever Vccx is in the nominal operating
range. In this range, Vcc flattens out and establishes a relatively constant
supply voltage to the DRAM. Various manufacturers strive to make this
region absolutely flat, eliminating any dependence on We have found,
however, that a moderate amount of slope in this range for characterizing
performance is advantageous. It is critically important in a manufacturing
environment that each DRAM meet the advertised specifications, with some
margin for error. A simple way to ensure these margins is to exceed the
operating range by a fixed amount during component testing. The voltage
slope depicted in Figure 6.1 allows this margin testing to occur by establish-
The third region shown in Figure 6.1 is used for component bum-in.
During bum-in, both the temperature and voltage are elevated above the
normal operating range to stress the DRAMs and weed out infant failures.
Again, if there were no VCcx and Vcc dependency, the internal voltage could
not be elevated. A variety of manufacturers do not use the monotonic curve
shown in Figure 6.1. Some designs break the curve as shown in Figure 6.2,
producing a step in the voltage characteristics. This step creates a region in
which the DRAM cannot be operated. We will focus on the more desirable
circuits that produce the curve shown in Figure 6.1.
lb design a voltage reference, we need to make some assumptions
about the power stages. First, we will assume that they are built as unbuf
150 Chap. 6 Voltage Converters
fered, two-stage, CMOS operational amplifiers and that the gain of the first
stage is sufficiently large to regulate the output voltage to the desired accu
racy. Second, we will assume that they have a closed loop gain of Av. The
value of Av influences not only the reference design, but also the operating
characteristics of the power stage (to be discussed shortly). For this design
example, assume Av = 1.5. The voltage reference circuit shown in
Figure 6.3 can realize the desired VCc characteristics shown in Figure 6.4.
This circuit uses a simple resistor and a PMOS diode reference stack that is
buffered and amplified by an unbuffered CMOS op-amp. The resistor and
diode are sized to provide the desired output voltage and temperature char
acteristics and the minimum bias current. Note that the diode stack is pro
grammed through the series of PMOS switch transistors that are shunting
the stack. A fuse element is connected to each PMOS switch gate. Unfortu
nately, this programmability is necessary to accommodate process varia
tions and design changes.
Two PMOS diodes (M3—M4) connected in series with the op-amp out
put terminal provide the necessary bum-in characteristics in Region 3. In
normal operation, the diodes are OFF. As Vccx is increased into the bum-in
range, Vccx will eventually exceed VREE by two diode drops, turning ON the
PMOS diodes and clamping VREF to Vth below Vccx. The clamping action
will establish the desired burn-in characteristics, keeping the regulator
monotonic in nature.
In Region 2, voltage slope over the operating range is determined by the
resistance ratio of the PMOS reference diode Ml and the bias resistor RI.
Slope reduction is accomplished by either increasing the effective PMOS
diode resistance or replacing the bias resistor with a more elaborate current
source as shown in Figure 65 This current source is based on a VTH refer
enced source to provide a reference current that is only slightly dependent
on Vccx [1]. A slight dependence is still necessary to generate the desired
voltage slope.
The voltage reference does not actually generate Region 1 characteris
tics for the voltage regulator Rather, the reference ensures a monotonic
transition from Region 1 to Region 2. To accomplish this task, the reference
must approximate the ideal characteristics for Region 1, in which
VCc = Vccx- The regulator actually implements Region 1 by shorting the
VCc and Vccx buses together through the PMOS output transistors found in
each power stage op-amp. Whenever Vccx is below a predetermined voltage
VI, the PMOS gates are driven to ground, actively shorting the buses
together. As Vccx exceeds the voltage level VI, the PMOS gates are
released and normal regulator operation commences. To ensure proper
DRAM operation, this transition needs to be as seamless as possible.
tR2
R5
(
Substrate
R3
R6
R7
isolate these buses can result in speed degradation fbr the DRAM because
high-current spikes in the array cause voltage cratering and a corresponding
slowdown in logic transitions.
CLAMPF*
ENS clamp- Qower op-amp (3X)
ENS* ENS*
V^VRUP PWRUP
BOOST
BOOSTF
CLAMPF*
cm/wp* Power op-amp (3X)
ENS
ENS* ENS*
V„,
V F^VRUP
v REF BOOST
VPWRUP
BOOSTF
clamp* Power op-amp (12X) Vcc Af
ENS*
“20 W
PWRUP
BOOST
PUMPBOOST Vcc Periph
PUMPBOQST*
Standby op-amp
DISABLEA
ers except that the bias current is greatly reduced. This amplifier provides
the necessary supply current to operate the VCCP and VBS voltage pumps.
The final element in the voltage regulator is the control logic. An exam
ple of this logic is shown in Figure 6.9. It consists primarily of static CMOS
logic gates and level translators. The logic gates are referenced to VCc-
Level translators are necessary to drive the power stages, which are refer
enced to VCcx levels. A series of delay elements tune the control circuit rel
ative to Psense-amp activation (ACT) and RAS* (RL*) timing. Included in
the control circuit is the block labeled VCcx level detector [1]. The reference
generator generates two reference signals, which are fed into the compara
tor, to determine the transition point VI between Region 1 and Region 2
operation for the regulator. In addition, the boost amp control logic block is
shown in Figure 6.9. This circuit examines the VBB and VCcp control signals
to enable the boost amplifier whenever either voltage pump is active.
158 Chap. 6 Voltage Converters
6.2.1 Pumps
Voltage pump operation can be understood with the assistance of the
simple voltage pump circuit depicted in Figure 6.10. For this positive pump
circuit, imagine, for one phase of a pump cycle, that the clock CLK is
HIGH. During this phase, node A is at ground and node B is clamped to
Vcc-Vth by transistor Ml. The charge stored in capacitor Cl is then
(6#
During the second phase, the clock CLK will transition LOW, which brings
node A HIGH. As node A rises to Vcc, node B begins to rise to VCc +
(Vcc-Vth)^ shutting OFF transistor Ml. At the same time, as node B rises
one VTH above transistor M2 begins to conduct. The charge from
capacitor Cl is transferred through M2 and shared with the capacitor CLOAD.
This action effectively pumps charge into C^oad and ultimately raises the
voltage Vom. During subsequent clock cycles, the voltage pump continues
to deliver charge to CLOad until the voltage VO[JT equals ZVcc-VthTthz,
one VTH below the peak voltage occurring at node B. A simple, negative
voltage pump could be built from the circuit of Figure 6.10 by substituting
PMOS transistors for the two NMOS transistors shown and moving their
respective gate connections.
Schematics fbr actual VCCP and VBS pumps are shown in Figures 6.11
and 6.12, respectively. Both of these circuits are identical except for the
changes associated with the NMOS and PMOS transistors. These pump cir
Sec. 6.2 Pumps and Generators 159
cuits operate as two phase pumps because two identical pumps are working
in tandem. As discussed in the previous paragraph, note that transistors Ml
and M2 are configured as switches rather than as diodes. The drive signals
for these gates are derived from secondary pump stages and the tandem
pump circuit. Using switches rather than diodes improves pumping effi
ciency and operating range by eliminating the VTH drops associated with
diodes.
charging the load capacitors. The oscillator is enabled and disabled through
the signal labeled REGDIS*. This signal is controlled by the voltage regula
tor circuit shown in Figure 6.14. Whenever REGDIS* is HIGH, the oscilla
tor is functional, and the pump is operative. Examples of VCCP and %
pump regulators are shown in Figure 6.14 and 6.15, respectively.
REGDIS'
PWRUP o-------- [>> --------- - 4>o->oI-►CLK
a
Figure 6.13 Ring oscillator
Sec. 6.2 Pumps and Generators 161
冬
二
. .
一
del
r
wr
Figure 6.14 vCcp regulator.
Oscillator
ENABLE*
Feedback
source and MOS diodes. The reference voltage supply VDD is translated
down by one threshold voltage (VTH) by sinking reference current from a
current mirror through a PMOS diode. The pumped voltage supply VCcp is
similarly translated down by sinking the same reference current with a
matching current mirror through a diode stack. The diode stack consists of a
PMOS diode, matching that in the VDD reference translator, and a pseudo-
NMOS diode. The pseudo-NMOS diode is actually a series ofNMOS tran
sistors with a common gate connection.
Transistor quantity and sizes included in the pseudo-NMOS diode are
mask-programmable. The voltage drop across this pseudo-NMOS diode
determines the regulated voltage for KecP. Accordingly,
(6.2)
The voltage dropped across the PMOS diode does not affect the regulated
voltage because the reference voltage supply VDD is translated through a
matching PMOS diode. Both translated voltages are fed into a comparator
stage, enabling the pump oscillator whenever the translated VCcp voltage
falls below the translated VDD reference voltage. The comparator hysteresis
dictates the amount of ripple present on the regulated VCCP supply.
The VBB regulator in Figure 6.17 operates in a similar fashion to the
VCCP regulator of Figure 6.16. The primary difference lies in the voltage
translator stage. For the VBB regulator, this stage translates the pumped volt
age VBB and the reference voltage Vss up within the input common mode
range of the comparator: The reference voltage Vss is translated up by one
threshold voltage (VTH) by sourcing a reference current from a current mir
ror through an NMOS diode. The regulated voltage VSB is similarly trans
lated up by sourcing the same reference current with a matching current
mirror through a diode stack. This diode stack contains an NMOS diode that
matches that used in translating the reference voltage Vss. The stack also
contains a mask-adjustable, pseudo-NMOS diode. The voltage across the
pseudo-NMOS diode determines the regulated voltage for VBB such that
VBB = -Vndiode
(6.3)
Comparator
Sec. 6.3 Discussion 165
6.3 DISCUSSION
In this chapter, we introduced the popular circuits used on a DRAM fbr volt
age generation and regulation. Because this introduction is far from exhaus
tive, we include a list of relevant readings and references in the Appendix
for those readers interested in greater detail.
166 Chap. 6 Voltage Converters
Modified
inverter
Driver
bias DVC2
ENABLE*
REFERENCES
[1] R. Baker, CMOS: Circuit Design, Layout, and Simulation, 2nd ed. Hoboken,
NJ: IEEE Press, 2005.
[2] B. Keeth. 1994. Control circuit responsive to its supply voltage level. US
Patent 5,373,227, issued December 13,1994.
Chapter
7
An Introduction to
High-Speed DRAM
Historically, the semiconductor industry has been the slave to two masters:
cost and performance. Cost, or more specifically cost reduction, is a matter
of pure economic survival for most companies in this industry. It is cost
after all that drives the industry to aggressively shrink process technology
year after year. Memory chips, due to their status as a commodity product,
are highly sensitive to cost. This fact fosters additional incentives for mem
ory companies to push their process technology development faster than
most other segments of the semiconductor industry. The historical record is
littered with the names of companies that produced memory products at one
time or another, but subsequently dropped out of the marketplace. A key
contributor to this attrition is the ability or inability of any given company
to remain cost competitive. Market forces, not supplier controls, determine
the basic selling price for commodity memory-most notably, the balance
or imbalance in supply and demand. This leaves cost as the only component
of a balance sheet that can be controlled. Control costs and survive. Lose
control and fade away.
167
168 Chap. 7 An Introduction to High-Speed DRAM
the record straight, the remaining chapters deal entirely with high-perfor
mance DRAM.
Now, DRAM may not be what first comes to mind when you hear the
phrase high performance. We agree with that sentiment as images of race
cars and fighter aircraft flash through our minds. But, when we come back
to reality and refocus our attention on the semiconductor industry, DRAM
memory devices still don't come to mind. Rather, we are more likely to con
sider microprocessors. Clearly, these devices are on the leading edge of high
performance, and they benefit from the use of the latest transistor, metal,
and packaging technology in the quest for speed and processing muscle.
Memory, on the other hand, tends to limp along with slower transistors,
a minimum of metal layers, and inexpensive packaging. The goal at DRAM
companies has always been to meet the necessary speed targets while keep
ing manufacturing costs to an absolute minimum. This doesn't mean that
the DRAM processes are not highly advanced. They clearly are advanced.
The real difference between microprocessor and DRAM process technology
has been the focus of process development. While the focus for micropro
cessors has been performance, the focus in the memory sector has always
been cost. For DRAM, this focus has become a myopic quest for smaller
and smaller process geometries, with the ultimate goal of yielding more die
per wafer and, hence, at a lower cost than the competition.
Only recently has the need for higher performance DRAM become a
pressing concern. There are two sectors of the memory industry that are
driving this issue. The first is high-performance desktop systems; the sec
ond is high-end graphics cards. Both of these sectors are extremely compet
itive, and they are highly prized fbr the pricing margins and marketing
mileage available to the best in class. With processor speeds racing into the
multi-gigahertz range there are greater incentives fbr the memory sub
system to try and keep pace. Pricing pressures are lower in this sector,
which gives DRAM manufacturers more latitude with their process technol
ogy, design architectures, and packaging in order to realize desired perfor
mance gains. These performance gains and the underlying technology
eventually trickle down, over time, to the higher volume desktop mar
ket-increasing performance levels fbr the mainstream market, but ulti
mately raising the bar again in the high-performance desktop market.
DRAM devices developed fbr the graphics card markets follow a simi
lar path, except at a more accelerated pace. High-end graphics cards create a
need for very high-performance DRAM devices. Given lower pricing pres
sure, DRAM manufacturers can improve their process technology, designs,
and packaging to significantly increase performance. Ultimately, these
improvements trickle down to other portions of the graphics market and
Sec. 7.2 Performance for DRAM Memory Devices 169
even into those main memory devices being developed for the high-perfor
mance desktop market.
tion yields the best system performance, but reality being what it is, won't
result in acceptable device yields for the memory manufacturer. Generally,
both tcK and CAS latency incrementally improve with successive genera
tions. Both memory manufacturers and system builders manage device per
formance, and the inevitable improvements, through a system known as
speed grading. Basically, a speed grade is a set of timing parameters associ
ated with device operation at a specific data bus frequency. At higher bus
frequencies, these timing parameters become more demanding and harder
to meet. Fortunately for the memory industry, devices that are unable to
meet all of the timing parameters for a specific speed grade can be down-
binned to a slower speed grade and sold into other market segments. Speed
grades are not only important to ensure that components from different sup
pliers are compatible, but also to support down-binning strategies that
improve overall production yields.
Synchronous DRAM (SDRAM), despite notable improvements in per
formance, represents only an evolution in technology. When SDRAM first
appeared, manufacturers essentially converted existing asynchronous
designs by merely changing I/O circuits to accommodate the synchronous
interface. Even the device specifications remained rooted in asynchronous
behavior. The real advantage gained by synchronous operation occurred in
the output timing specs. Asynchronous designs had very limited data rates
due to the fact that the memory controller could not issue a new data request
(Read operation) until it successfully captured data from the previous Read.
The controller drove CAS* LOW to initiate the Read operation and held
CAS* LOW until it was able to capture the Read data. This meant that col
umn cycle times included not just the DRAM access delay, but also the I/O
transfer delays, including flight time. Accordingly, column cycle time was
not optimized, because the controller was precluded from issuing a subse
quent read command until it had captured data from the current read com
mand.
Extended Data Out (EDO) asynchronous DRAMs reduced column
cycle time by hiding some of the I/O transfer delay. While traditional asyn
chronous devices, such as Page Mode (PM) or Fast Page Mode (FPM), only
drive Read data while CAS* is LOW, EDO devices continue to fire Read
data after CAS^ transitions HIGH. This allows the memory controller to
begin to cycle the column for a subsequent Read while concurrently captur
ing data from the cunent Read. This seemingly small change allowed a rea
sonable gain in system performance.
SDRAM represents another step in the steady evolution of DRAM tech
nology towards higher performance. The synchronous interface specifica
tion is important, as it essentially divides the Read operation into two
separate components: the column access phase and the data output phase.
Sec. 7.3 Underlying Technology Improvements 171
Essentially, the data output operation was pushed into a separate clock cycle
from that of the array access. This meant that device speed and the associ
ated clock frequency were now only limited by how fast the memory array
could perform column accesses. Synchronization ultimately increased per
formance because the data output operation became a scheduled event, iso
lated from the array access.
Double data rate (DDR)-style DRAM devices and their faster cousins,
graphics double data rate (GDDR) devices, represent an evolution in tech
nology compared to synchronous DRAM (SDRAM). For each Read com
mand, they drive two bits of data for every clock cycle: one bit on the rising
edge and a second bit on the falling edge. This simple change, dating from
technology developed in the early 1980s, effectively doubles data transfer
rates to and from the DRAM [1]. The various permutations of DDR and
GDDR, such as DDR2, DDR3, GDDR2, GDDR3, and GDDR4, encompass
evolutionary advances in technology to achieve higher overall data transfer
rates. These advances are enabled by changes in burst length, improvements
in signaling technology and packaging, and significant advances in circuit
design.
REFERENCE
173
Figure 8.1 a) Standard DDR2 die photo and b) high-speed GDDR3 die photo.
lliiillliHHiiiliiililiiHHilil
7卯闾才4 •
chap8
■删唧黜螂:
出|册朋那曲眠池常
H _ g h speed Die
A—
「
c h = e c t u es
「
Sec. 8.2 Architectural Features: Bandwidth, Latency, and Cycle Time 175
This leaves us with the array data path, Read data path, and any limitations
caused by the locality of circuit blocks.
(8.1)
Hence, a shorter column cycle time yields a higher data rate (bandwidth).
Note that the column cycle time is expressed as 及3 亩 DRAM device data
sheets. This relationship leads to the conclusion that higher bandwidths
require faster column cycle times, which brings us right back to the array
data path. Historically, column cycle time hasn't ventured below the 5 ns
mark for the majority of designs. In fact, burst length typically increases
from one interface technology to the next to maintain the column cycle time
above 5 ns. Table 8.1 shows the relationship between column cycle time,
burst length, and bandwidth for a variety of DRAM technologies.
Min Column
Peak
Technology Burst Length Cycle Time
Bandwidth
(tcco)
SDRAM 1 6 ns 167Mbps/pin
DDR 2 5 ns 400Mbps/pin
DDR2 4 5 ns 800Mbps/pin
DDR3 8 5 ns 1.6Gbps/pin
To begin, we'll review what happens within a DRAM array data path
during a typical column cycle. As shown in Figure 8.2, the array data path
consists of a column decode circuit; column select lines; I/O transistors,
which connect each sense amplifier to local I/O lines; global I/O lines; and
helper flip-flop (HFF) circuits. In a column cycle, as shown in Figure 8.3,
the column select line fires HIGH, and the I/O and HFF equilibration cir
cuits are released. This permits the sense amplifier to drive data onto the
local and global I/O lines, thus forcing a voltage differential to develop on
the lines. This differential signal drives into the HFF, shown in Figure 8.4,
which is fired to both amplify and latch the data for subsequent handling by
the Read data path. Finally, the I/O equilibration circuit fires, and the col
umn select line drives LOW. In a GDDR3 device, all of this action takes
place in as little as 2.5 ns.
3.0
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
.8
.6
.4
.2
0
1.8
1.6
1.4
1.2
1.0
.8
.6
.4
.2
0
Figure 8.3 Array data path 2.5 ns column cycle waveforms (0.5 ns/division).
To maximize the performance of the array data path, all of its compo
nents must be carefully designed fbr high-speed operation. The column
decoder circuit and the associated redundancy logic must be designed to
cycle very quickly. This may require adapting different types of logic styles
into the design and modifying how redundant columns are handled. Tradi
tional column redundancy techniques push out column cycle time to allow
fbr address matching and decoder delays. Alternative schemes, such as col
umn steering or data suppression, may need to be considered in place of the
more popular address suppression technique. Also, the global column select
lines, which connect the column decoder to the sense amplifier I/O gates,
must be carefully constructed to minimize the RC time constant. Finally, the
I/O transistors must be properly sized to strike an optimal balance between
minimum loading on the column select line (via smaller gate area) and max
imum Read/Write performance (via low source/drain resistance). Necessary
compromises such as this are common in the high-speed design realm.
Sec. 8.2 Architectural Features: Bandwidth, Latency, and Cycle Time 179
times. One good technique fbr managing the spatial nature of this problem
is to design the signal nets with identical flight times. For instance, refening
back to Figure 8.2, the flight times of the column select lines (CSEL) must
match the flight times of the various signals controlling the helper flip
flops. This requires us to carefully manage loading, wire widths, and drivers
so as to match both the propagation delay and the rise/fall times of these
critically timed signals. Any carelessness or shortcut taken in this regard
pushes out column cycle times. Likewise, all of the timing control circuits
that drive the column decoder and array data path must be similar in struc
ture and design to ensure that they track over temperature and voltage varia
tions. This attention to detail is necessary so that the timing relationships,
necessary fbr fast cycle times, are consistent despite anticipated changes in
the environment.
Column cycle time can be significantly affected by architectural choices
and by the construction of the array data path. These choices include array
block size, array partitioning, column decoder configuration and location,
global column select design, HFF configuration and location, routing metal
utilization, local I/O line length, global I/O line configuration, and any deci
sion to share resources between array banks. Let's walk through a simple set
of examples to see the impact that each of these choices may have on speed.
Start by examining Figure 8.5, which shows a simplified block diagram
of a memory array and an array data path. This example is typical of that
found in a DDR or DDR2 SDRAM device. It consists of a pair of array
blocks that are each subdivided by sense amplifier strips into four array
cores. Each array block adjoins column and row decoder blocks. At each
wordline and column select intersection we gain access to four bits of data,
two bits per sense amplifier strip. In this example, the HFF block is shared
between two array blocks. The shared HFF block necessitates the addition
of multiplexers onto each HFF so that the I/O lines can be selected from one
array block or the other. These multiplexers increase loading on the I/O
lines and invariably impact cycle time. The sizing of these multiplexers as
well as of the I/O equilibration devices is critical in order to keep column
cycle times minimized.
Sec. 8.2 Architectural Features: Bandwidth, Latency, and Cycle Time 181
I/O lines
Wordline
I/O lines
」
sdo-"®」
岩 。。昭
Japooop
M
®
B
Moa
s
® 。
o
工
Memory array
Memory array
amplifier strips, which simplifies their design and reduces their effective
loading on the I/O lines. These changes to the I/O lines and multiplexers
support reduced column cycle time and higher bandwidth performance. All
of this results from a simple change to the array architecture.
Wordline
I/O lines
」
」
①
o)po
p o 。8 P M o a
° I/O lines
。
3P M o a
Word line
I/O lines
have wordline activation restnctions imposed upon them by either the spec
ification or by simple power constraints, lypically, most designs use some
combination of multiple wordline and multiple column activation to satisfy
data access requirements.
I/O lines
Wordline
I/O lines
--
s
wpoog Moa
0
do
5fd
e 8
1
」 不工
&
UJ
s
s
o o
Figure 8.9 shows a simplified floor plan for a high-speed GDDR3 DRAM
device. Graphics memory devices, typified by GDDR3, typically have 32
DQ pins (X32). GDDR3 and GDDR4 standards, as defined by JEDEC
(Joint Electron Device Engineering Council), divide these 32 DQ pins into
four byte groups with eight DQ pins in each group. As shown in
Sec. 8.2 Architectural Features: Bandwidth, Latency, and Cycle Time 185
Figure 8.10, the GDDR3 package is pinned out so that there is a byte group
located in each of the four comers, with the command and address pins
(C/A) located in the center. This unique feature permits the floor plan of Fig
ure 8.9 to be divided into four quadrants, where each quadrant services a
separate byte group.
一_--------------
To help facilitate discussion, one of the four identical byte groups is cir
cled in Figure 8.9. Each byte group consists of a 32-bit data path that oper
ates on four bits of data for each of the eight DQ pins associated with the
group. The byte groups are fairly compact owing to the very pinout that
gives rise to them. Package pinouts fbr SDRAM, DDR, DDR2, and DDR3,
on the other hand, locate the DQ pins and C/A pins on opposite ends of the
package. While this arrangement simplifies module routing, it penalizes the
on-die data path by stretching it out along the entire length of the die. The
compact GDDR3 byte group of Figure 8.9 produces shorter routing lengths
and better timing control to enable faster cycle times and higher overall per
formance. The GDDR3 device implements two parallel pad rows. This fea
ture permits all address, control, and timing circuits to exist between the
pads. This in turn allows these circuits to be centered between all the byte
groups, while keeping address and control timing tight and signal routing
short.
186 Chap. 8 High-Speed Die Architectures
X 、
<^DDQ
VDO VSS \ZQ /MF VSS VDD VQDQ
VS*
VSSQ DQO DQ1 VS5Q VSSQ DQ9 DQ8
VDD4
\
c VDDQ DQ2 DQ3 VDDQ DQ11 DQ10 vooq
vssj
D VSSQ WDQ2C RDQ20 VSSQ RDQ21 WDQ21 vssq
VD&
\E VDDQ DQ4 DMO VDDQ DM1 DQ12 VDqQ
"DQ5^ 以S*
VDD D96- \cs* 't>Q43 DQ14 痂0
VS^ AZSSQ. -0(37 BAO BAI^ 'BQ15. VSS
Additionally, Read data path components sit outside the two pad rows
allowing the overall byte group to be even more compact since alt of the cir
cuits associated with that byte group are in close proximity to the array and
the array data path. It should be apparent by now that cycle time and speed
are improved by keeping the layout very compact and the associated routes
as short as possible. For a GDDR3 device, as shown in Figure 8.9, the
highly optimized placement of bond pads, array, array data path, and Read
data path produce very short cycle times and excellent overall performance.
Most of these changes can be sorted into one of two cost classifications:
minor and major. Minor changes or low-hanging fruit are simple design
changes and optimizations that result in minor improvements to Read
latency. Major changes basically trade off die size for significant improve
ments in Read latency. Lefs discuss each of these cost classifications in
turn.
Latency is a by-product of delay accumulation that can be broadly clas
sified as gate delay, analog delay, or synchronization delay. While a large
number of circuits are called into action by a DRAM Read operation, only a
portion of these circuits actually impact Read latency. These particular cir
cuits form the Read latency critical timing path. All three delay types are
usually present in this critical timing path. However, the critical timing path
can actually change to a new path as individual circuits are tuned to mini
mize delay. This result stems from the fact that Read latency is detennined
by the longest-not the shortest-electrical path involved in a Read opera
tion.
For the sake of discussion, we will define the Read latency critical cir
cuit path for an exemplary device as comprised of C/A receivers; C/A cap
ture circuits; a command decoder; a column address path; column
redundancy; a column decoder; an array data path, including HFF; a Read
data path, including pipeline and serialization circuits; and an output driver.
This list of circuits is foirly representative of a critical timing path for most
SDRAM designs in production and is characterized by gate delays, synchro
nization points, and various wire and analog delays. Of these delays, analog
delays and the synchronization points are the most important.
Synchronization (sync) points are unavoidable in an SDRAM Read
latency control path. This is due to the fact that latency is defined in clock
cycles, not in nanoseconds. So, at the very least, the final stage must be syn
chronized to the clock. Additional sync points can help overcome timing
uncertainties and assist with latency management. Sync points can hide sig
nificant timing variability. However these results come at a cost. Each sync
point adds a clock cycle to Read latency. At shorter t(jK (higher frequency),
this may be a small concern, but as tcK pushes out at lower frequencies,
these clock cycles can translate into large latencies.
So, lefs return to the topic of Read latency and minor design changes.
The first consideration when designing the critical timing path is striking the
proper balance of delay types. Better performance is normally achieved by
minimizing the number of sync points to only those required to ensure
deterministic latency. Additional sync points generally contribute to higher
tcK dependency. The real goal is to produce the lowest Read latency possi
ble, independent of frequency. The memory array and array data path are
highly asynchronous. The timing of events within the array data path is gen
188 Chap. 8 High-Speed Die Architectures
9.1 INTRODUCTION
For synchronous DRAM (SDRAM), clocks and data are transmitted
together with a fixed relationship: they are source synchronous (SS). This
essentially means that the clock edges used to launch transmitted data are
available at the receive end fbr capture. While double data rate (DDR)
SDRAM uses both clock edges to latch data, C/A uses only a single edge
(single data rate or SDR). The relationships between the clock, data, C/A,
and reference voltage (Vref) for DDR SDRAM are shown in Figure 9.1
[1-3]. Because data is running twice as fast as C/A, additional clocks called
data strobes can be included in the I/O interface and dedicated fbr high
speed data capture. In other words, strobes can be treated as a special type
of data as they are bundled and tightly linked (source synchronous) with
193
194 Chap. 9 Input Circuit Paths
data. Strobes save power because they are active only in the presence of
data, unlike clocks, which are active continuously. Furthermore, strobes,
either single-ended or differential, provide predictable timing relationships
when synchronized with system clocks (CLK/CLK*). High-speed data cap
ture is more reliable with dedicated strobes than with non-source-synchro-
nous system clocks.
Clock and data alignment are also critical fbr data capture. Clocks and
strobes (DQS) are usually center aligned with incoming C/A and data,
respectively. Center alignment creates the largest timing margin fbr the cap
ture latches. Timing parameters, such as data setup (r§) and hold (勿)are
illustrated in Figure 9.1 along with input voltage margins (匕小心・ As data
rates continue to increase (from 133 Mb/s for SDR to 1,600 Mb/s fbr
DDR3) and supply voltages scale down (from 3.3 V fbr SDR to 1.5 V for
DDR3), the valid data windows (or data “eyes")shrink due to clock jitter,
duty-cycle distortion, inter-symbol interference (ISI), simultaneous switch
ing noise (SSN), impedance mismatch, reference noise, and supply noise, to
name a few. To address these signal integrity (SI) issues, we investigate sev
eral circuit techniques in Chapter 10, such as on-die termination (ODT) and
data bus inversion (DBI).
clk*-、
v/h(ac)min
v/h(dc)min
qf(dc)
VIL (DC) MAX
■ VIL (AC) MAX
-V
VSSQ
Figure 9.1 a) Setup and hold timing diagram fbr C/A.
Sec. 9.1 Introduction 195
Figure 9.2 Input receiver delay variations based on a) signaling levels and b) slew rate.
Sec. 9.2 Input Receivers 197
* Data
a)
receiver, as shown in Figure 9.4. Notice that in order to get a better voltage
margin, not all of the devices are biased in the saturation region. For folly
differential clock receivers, the reference input (Vref) is replaced by the
complementary clock input (CLK*).
%
--------------------- d-------------------------------
Metal
length/width
M2 M1
100pm/4pm 1000pm/2pm 100pm/4pm
The placement (locality) of input paths can greatly impact the length
and shape of the route and should be floor-planned in the early design
phases. Shielding further improves matching by eliminating in-phase or out-
of^phase coupling. However, shielding also increases the overall layout area
and capacitive loading.
Three routing styles, with and without shielding, are seen in Figure 9.6:
1. Size maze: Efficient for adding length to horizontal routes. (Style can
be rotated fbr vertical routes.)
2. Trombone: For adding length to vertical routes.
3. Accordion: For adding length, but requires excessive comers.
202 Chap. 9 Input Circuit Paths
1_____ u
C)
Figure 9.6 Matched routing styles: a) size maze, b) trombone, and c) accordion.
In order to properly size clock drivers, logical effort [4] can be evalu
ated based on logic style, loading, and propagation delay. Extra buffer
stages and routes are added on each data or C/A path to match the
clock/strobe path. Input path delay, denoted d in Figure 9.7, represents the
delay from the input pins to the capture latches, including the receiver,
buffer, and matched path delays. One measurement for the quality of
matched routing is the difference of £加・ For well-matched input paths, less
than 10 ps timing skew can be achieved across simulation comers with lay
out back-annotation. Less than 10 ps skew across the data bus was con
firmed in silicon for a GDDR3 part. However, when matched routing is
applied, a trade-off must be made between capture timing and the power
and layout area consumed by the routes.
From a layout point of view, matched routing is rarely accomplished by
automated tools due to its complexity and restrictions. Before preparing full
custom layout, a preliminary-matched route design must be budgeted and
floor-planned. Basic route topologies, routing distances, materials, dimen
sions, buffers, and loads must be determined and tuned to obtain the desired
signal quality. The initial design phase relies on using parasitic resistor and
capacitor (PRC) models to estimate routing parasitics, which enable dimen
sional and electrical tuning of the routes. Pad order and pad pitch also have
a direct impact on the route cells. An automated tool is helpful fbr checking
the final quality of matched routing by comparing conductors, topologies,
bends, and sizes. SPICE back-annotated simulation provides timing verifi
cation at the final design phase.
DQ C/A (pin)
CLK (pin)
A
-----A
J \
DQ C/A (latch)
CLK (latch)
Captured DQ/C/A
As shown in Figure 9.12b, a clock path that includes a clock divider and
Write clock tree, is not strictly matched to the input paths. The Write clock
tree is routed to both the command and data paths, which makes it longer
than the input matched network. However, the Write command must be
placed precisely in order to hit the setup and hold window in the Write pipe
line sequencer, which is controlled by the Write strobe. So, clock domain
crossing and timing adjustment may require extra buffers and delays in the
input matched paths, which increase latency, power, area, and sensitivity to
PVT variations. Data training can provide better alignment between strobes
and clocks during initialization. This open-loop ”de-skewing” exercise con
ducted by the memory controller can help meet Sqss,Sss,and 如“ speci
fications. This is referred to as Write leveling.
Instead of training or matching, an on-die timing adjustment circuit
(ie, DLL) can be used. With the input C/A DLL shown in Figure 9.12a, the
Write clock tree delay can be hidden by synchronizing CACLK with the out
puts (CAFB) of the clock tree. Capture latch delay can also be modeled in
the feedback path. The C/A DLL provides flexible tuning over PVT varia
tions and permits on-die Write leveling. However, the DLL also adds some
timing uncertainty compared to strictly matched routing. There are also
concerns about extra power, area, and "wake-up" time.
208 Chap. 9 Input Circuit Paths
1.2
1.15
1.1
1.05
1
s 0.95
I 0.9
0.85
0.8
0.75
07
0.65
0.6
0 20 40 60 80 100 120 150 160 180 200
Picoseconds
1.2
1.15
1.1
1.05
1
s
:
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0 20 40 60 80 100 120 150 160 180 200
Picoseconds
3
I I ]« I __ —] — — — — 1 — — 一》I
(9.1)
where 垢 is the delay fbr clock receiver;。力,fbr PLL/DLL delay; and tTree,
fbr CDN. The receive clock path delay can also be rounded up in terms of
unit intervals (UI). Therefore, the clock edge used to capture the data is n-
bit-shifted compared to the edge that launches it at the transmit site (TX).
When the clock and data are out of phase, such n,UI delay amplifies jitter at
certain noise frequencies. To reduce the impact of the n・UI input path delay,
clock path circuits should be insensitive to PVT variations and efforts
should be made to minimize。恸.Step size and adjustment range also steer
the timing circuit design toward an analog approach.
One problem associated with traditional CMOS logic is its sensitivity to
PVT variations, especially to variations in process and supply voltages. A
plot of delay versus PVT comers fbr two standard NAND gates is shown in
Figure 9.16. The delay can vary from 103 ps (fast process-FF, 1.5 V, and
Sec. 9.5 Input Timing Adjustments 211
Figure 9.17 Current-mode logic: a) CML buffer and its I-V characteristics.
Sec. 9.6 Current Mode Logic (CML) 213
CMOS CML
D 二 NxCxVdd
D = NRC = ------ y------
P = NxCx^dd^ P = NxIxVdd
N3xC2xVddxA^
ED = 7V2x2x^-x
ED -----------------------------
k公-匕 I
A simulation setup for both CMOS and CML clock trees is shown in
Figure 9.18. Two buffers drive two parasitic RC models (PRC) for metal
routes, each with a distance of 1,500 pm, a width of 0.5 gm, and a space of
1 |nm in an 80 nm DRAM process. A resistor is used fbr a CML load with
improved bandwidth. The CML buffer is biased to get a constant swing of
approximately 400 mV at 1.35 Vdd- For CMOS inverters, the width of
PMOS is 50 jim and for NMOS, 25Pm. A minimum length is applied to all
devices except the tail NMOS (MN3 in Figure 9.17a) fbr the CML buffer,
which has twice the minimum length. The input clocks (CLK/^ and
CLKinb) are differential and run at 1 GHz. Rise and fall times for the CML
and CMOS buffers are 135 ps and 108 ps, respectively, to maintain reason
able duty cycle and signal quality. The simulation is set up at a typical pro
cess with PRC, L35 V, and 95℃ comers.
Figure 9.18 Simulation setup fbr CML (top) and CMOS (bottom) clock trees.
Sec. 9.6 Current Mode Logic (CML) 215
2u u
(sd)
QJ 17 5
A
g1
a
『q
Temperature (℃)
b)
Figure 9.20 a) CML and b) CMOS clock trees
under process and temperature variations @ 1.35 V
Indeed, the CML clock tree is immune to voltage and process varia
tions, as predicted in Table 9.1. Near-constant delay makes the CML buffer
ideal for designing a critical timing path, especially when using a clock
alignment scheme. (Timing circuits using the CML buffer are discussed in
Chapter 11.) The CML buffer is also a natural fit for clock receiver circuits,
as shown in Figure 9.3b. Extensive use of CML logic makes input circuits
capable of handling much higher data rates and more aggressive scaling.
So, what are the obstacles preventing designers from using CMLs in
DRAM timing paths as opposed to using simple CMOS inverters?
REFERENCES
219
220 Chap. 10 Output Circuit Paths
=加+J2ML =1R+j2矶
c NG+j2nfC N jlnfC
(10.1)
lim Zc
(10.2)
The skin effect is a phenomenon in which resistance increases as fre
quency increases due to the fact that the signal is pushed into a very thin
layer of the conductor surface. The thickness of this layer is inverse propor
tional to the square root of frequency (f). The skin resistance 街J of the
transmission line is given in Equation (103) along with AC resistance
(H/C), where p is permeability, o is conductivity, and W and I are the width
and length of the conductor, respectively.
(10.3)
At the frequency that some of the point-to-point DDR systems are run
ning, the skin depth is down to a few tenths of a mil (0.0002 inches) for the
conducting layer, which, in turn, increases the AC resistance. The skin
effect is the main reason that the resistance generally cannot be ignored in
calculating the characteristic impedance of a transmission line. The high
frequency loss due to skin effect can be compensated through equalization
to reduce inter-symbol interference (ISI).
When a transmission line is terminated at both ends with its characteris
tic impedance (Z0), the line is said to be matched, as shown in Figure 10.2.
Being matched means that no reflections should occur on the line, which is
desirable for high-frequency signal transmission. Without termination and
impedance control, reflections on the line distort the signal, which, if the
distortion is large enough, prevents reliable detection by the receiver.
In Figure 10.2, at the driver side, the matched impedance (Z0) is equal
to the output driver impedance (e.g., 18 ohm) plus the termination imped
ance (eg, 45 ohm). On-die termination (ODT) can provide a matched
impedance on the receiver end. An ODT pin is added to DDR2 parts, as
shown in Figure 10.3, to control the termination impedance as needed. For
DDR2, ODT is required only for the data, data mask, and strobes. For clock
222 Chap. 10 Output Circuit Paths
{Z0 = 63G}
Driver R Series R T Line
Source)........... 节外一
18 Q 45 Q 63 Q
DQ pins
50 Q
System
50 C
controller
a)
Active module Standby module
b)
Figure 10.4 Typical ODT configurations for DRAM (two modules populated)
5
a) Writes and b) Reads.
1
1.9
1
工
£
>)
4 > 13
)
①
— —
6S4 0)6
L
OA0 o,19
忠
~n00 07
oAlndu-
d£0 O.O
5
3
01 0 1
-0.1 -0.1
0 0.75 1.5 2,25 3 375 0 075 1.5 2,25 3 3,75
Time (nanoseconds) Time (nanoseconds)
r5
>)rJ >)
①
6SL
「
OA0 O)6
5o
5d.97 >
u-0,6s
Indu-
Figure 10.5 Comparisons between SSTL and ODT for dual-ranked module.
224 Chap. 10 Output Circuit Paths
ZQ calibration control
J
Digital filter Control logic < <SComp
<P>J SAR logic
P Code gen. VREF generator
dn,
35
一
(10.4)
(10.5)
tCalJni =("1)/F」SCK
(10.6)
(10 7)
Sec. 10.2 Impedance Control 227
same orientation and power buses should be applied to all drivers. To avoid
the physical limitations caused by tuning transistors, a minimum-sized tran
sistor may need to be stacked (connected in serial) to provide better resolu
tion. Because of mobility diflerence, a different number of legs can be used
in parallel for pull-up or pull-down calibrations. For instance, two pull-up
legs versus one pull-down leg results in 120 Q Rcal,DN compared to 240 Q
Rcal.Up' It is also important to isolate critical analog circuits such as the
comparator from the digital control logic. To minimize noise and cross talk,
analog signals Vref,七。and Vterm must be shielded and de-coupled. To
reduce errors, V^q and Vtern} should be tapped in the middle of the connec
tion (small circles in Figure 10.6). Finally, layout extraction and simulation
are essential to ensure that the design meets the specs across the PVT cor
ners.
Now, following design and layout, lefs talk about how to characterize
the output impedance. The output drive impedance can easily be approxi
mated with the dV/dl curves during Test and Characterization. The curve is
obtained by sweeping the Vout from 0 to Vddq and measuring 101tt. An
example of GDDR3 output impedance versus temperature is plotted in Fig
ure 10.9, with both simulated and silicon data running at Vj)£)q = 1.8 V and
at typical process comers. To extract the driver impedance from the curves,
divide the delta Kby delta I in the output driver's operating range. The pull-
up impedance shows tighter spread than the pull-down impedance in a two-
step ZQ calibration. Nonlinearity is also observed when Vout is approaching
0 for pull-down impedance or Vddq 命r pull-up impedance.
Pull-down and pull-up curves Pull up/down impedance, 40 ii
%“(V)
55.00 55.00
£ 50.00 50 00
? 45.00
4500
40.00 40.00
35.00 35.00
30 00 30.00
di«
vs = LP~dl
(10.8)
a
p 3.0E-9
apmcSBE (&/Z
m
a 2.5E-9
e
E 2.0E-9
{PQ/Z
1 5E-0
1 .OE-9
3.0E-9 3.0E-9
(Dp
m2.5E-9
c 2.5E-9
6 Multi-layer
B2.0E-9
E 2.OE-9
1.5E-9
1.0E-9
10.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4 0 4.5 5.0
Freq., GHz Freq., GHz
b) Signal-ground normalized loop impedance without Cfat (left) and with CfQt (right).
Moreover, by staggering the tum-on time of the three legs of the output
driver, shown in Figure 10.16, peak switching currents are reduced, result
ing in less ringing and lower SSN. Gate delay or RC delay can be used to
create a small delta % the value of which depends on the targeted rise/foll
times and data rate. The range of T can be easily tailored from 3Ops to
Table 10.1 Pin RLC results for TSOP and FBGA packages.
Figure 10.17 Lead-frame-based (TSOP) packages on the left versus fine balkgrid array
based (FBGA) board-on-chip (BOC) packages on the right.
Table 10.2 Average data eye parameters of four lanes with or without DBL
Driver
Ground
b) Without Cfab signal return path is shifted between V[)DQ and Vssq*
a)
Cross talk (wlh 100 nF decoupling capacitor。口后阪)
L When the victim trace is close to the aggressor, cross talk is the com
bination of SRPS (Signal Return Path Shift) noise and coupling
noise.
2, When the victim is far away from the aggressor (>60 mil), cross talk
is due only to the SRPS noise.
Sec. 10.4 Signal Return Path Shift (SRPS) 241
1 98
1.69
1.40
1.10
081
-0.13 —~1~11~~~~~~
0.00 0.27 0.53 0 80 1.07 1.33 0.52
0.00 0.27 0.53 0.80 1.07 1.33
T5, 1191 mV w/o Cf |t Lp=,5 nH T5, 304 mV, CIa =250pF, Lp=.5 nH
MaxV=2.453 V/ns MinV= 1.262 V MaxV=1.929 V/ns MinV=1.625 V
T5,1280 mV w/o Cfat, Lp=2 nH T5, 337 mV, C(a =250pF, Lp=2 nH
MaxV=2.496 V/ns MinV=1.216 V MaxV=1.955 V/ns MinV=1.618 V
2.60
2.28
1.96
1.64
1.32
1.00
protect input gate oxide. The GCNMOS device is for the power clamp. ESD
diodes (ESDPN and ESDNP) serve as primary cunent clamps for the I/O
supplies. A large decoupling capacitor (FatCap) between power and ground
can shunt some ESD stress current and improve ESD performance.
Primary I/O supply
ODT
The output driver transistor can be built to act like a stand-alone ESD
circuit or as a high-speed driver. If we lay out the driver to meet the high
current requirements that ESD requires (i.e., 3 amps), then this would
impact the resolution of impedance matching and high-speed performance
of the driver. Instead, ESD and the driver can be separated into two different
circuits. The tuning transistors for impedance and ODT control would be
chosen to meet the desired LSB without any ESD considerations. The ESD
circuit would be then added in parallel as the best possible ESD solution.
This attempt can maximize the perfbnnance of both circuits independently.
However, this approach works up to a certain bandwidth like 3 GHz.
Beyond that, the drivers are built as both driver and ESD circuit in an
attempt to allocate design space fbr the ESD circuit function.
As mentioned before, Cjo reduction is beneficial for high-speed opera
tions. However, bidirectional I/O with push-pull drivers and ODT circuits
add more capacitance per I/O pad. From the DDR2 Go spec, only 2.5-3pF
is allowed, which leaves little room fbr the ESD circuits. On the other hand,
the diodes fbr primary power and ground current clamps can add up to more
than 0.3pF. If the ESD circuits are too small, then there will be a yield pen
alty as Assembly and Test still generate static electricity through their back
end process and handling. ESD circuits using a breakdown device like an
SCR can reduce capacitance with a turn-on time limitation. A modified
SCR for twin-well technology is reported in [3] with a maximum 0.45pF
Cjo reduction. A diode-triggered SCR (DTSCR) [6] can achieve 500ps
turn-on time with only 0.2pF pin capacitance.
Noise coupling from the ESD diodes among different power domains
must be addressed. When the power bus networks are broken into sections,
244 Chap. 10 Output Circuit Paths
edges. The outputs of P2S are fed into pre-drivers and finally output drivers
with proper slew-rate and impedance controls.
The overall output timing is denoted as touh which is from the internal
Read clock (i.e., the DLL and clock distribution network) to the DQ at the
pad. The output path is generally modeled in the timing control circuit (i.e,
DLL) and latency control logic to maintain clock synchronization and the
desired programmable CAS latency. Several timing parameters can be
derived from or related to the output timing:
• Oc as DQ output access time from external system clocks CLK/CLK*
• Iqqsck as Read strobe (DQS) access time from CLK/CLK*
The first two parameters are linked to the performance of the on-die
timing control circuits, like DLL. Due to power and area constraints, the
model used in the DLL is not a straight copy of the output path. Instead, a
scaled down version of the output path is used, especially for pre-drivers
and output drivers. Moreover, the noisy DQ supplies (VD[)q/Vssq) cannot
be mapped into the output model, which could result in excessive timing jit
ter. On the other hand, DQS-DQ skew is a measure of the matched routing
of output paths and SSO-induced jitter. For high-speed DRAM interface,
toQSQ is more stringent than tAQ and tDQSCK since strobes are used to cap
ture the data at the memory controller. DQS is generally treated as a special
data signal having a one-zero toggle pattern. For DDR2-800, the maximum
tDQSQ is only 175ps, which is half of the tDQSCK spec. For GDDR4,
is reduced to only 80Ps.
A circuit block diagram for a P2S is shown in Figure 10.25. Four identi
cal latches can be used with outputs wire OR'd together for multiplexing
four bits of data. Multiplexer fan-in can seriously limit the bandwidth of the
output path [7]. In Figure 10.26, a dynamic latch with domino logic can be
used for the Read latch. The dynamic latch has speed advantages over the
static latch because the load for input (D) is only one NMOS transistor
246 Chap. 10 Output Circuit Paths
(WN39). The internal sample clock CLKsp is single phase with light load
ing. A keeper circuit bleeds a small current to counter the leakage through
an internal high-impedance node, d*, Sizing the output stage of this latch
and buffer stage in the P2S is critical to maintaining good duty cycle.
Data is first sampled by the rising edges of multi-phase internal Read
clocks, which run at lower speed for less distortion and better clock distri
bution. A clock sampling window for bit 0 is shown in Figure 10.27, where
CO is at logic C” and C90 at logic "0." CO and C90 are connected to CLK1
and CLK2 signals of the Read latch, respectively. Clock routes and loads are
matched inside the P2S to maintain the phase relationship. Implementation
of the phase generator is discussed in Chapter 11.
An output enable (OE) is generated from the latency control logic to
start the serialization. The OE signal passes through the same latch and
selects either the latch output, ODT, or tri-state signal. Then proper pull-up
or pull-down signals are generated and fed into the pre-divers and output
Some speed limitation is imposed by the P2S because the data is serial
ized and running full-speed from this point. Due to the bandwidth limitation
of the CMOS gates, it is difficult to push the bit time below 400ps (2.5Gbps
data rate) without encountering some signal distortion. One way to handle
this is to push the serialization out to the output drivers [8]. This output-
multiplexed architecture requires multiple copies of the output driver, which
is power- and area-hungry. Current-mode logic (CML) is another solution
for the output path when the speed is beyond 2.5Gbps. With reduced volt
age swing and fully differential signaling, CML offers superior performance
over traditional CMOS logic with the added cost of constant current draw
and a more complicated design. A CML-based output path [7] is shown in
Figure 10.28. Pass transistors controlled by different phases of the clock are
used for input multiplexing. True and complementary signals are required
for CML, which effectively doubles the logic.
V
Figure 10.28 CML-based output paths [7].
248 Chap. 10 Output Circuit Paths
REFERENCES
Timing Circuits
As mentioned in Chapter 5, timing circuits, such as those used fbr clock
synchronization, are an important class of circuits. In this chapter, we
explore various types of timing circuits and their implementation. Among
these, delay-locked loop circuits (DLL), both digital and analog, are cov
ered in detail due to their simplicity and wide use in high-speed DRAM.
11.1 INTRODUCTION
Timing circuits, in general, are associated with a clock. A system clock,
which is generated by the memory controller, synchronizes the input and
output (I/O) operations of the DRAM. It makes sense then that two of the
biggest obstacles for high-speed I/O operations are clock-related: clock
skew and jitter. Clock skew is any timing difference between the external
system clock and the internal DRAM operating clocks. A propagation delay
from the I/O circuits and local buffers (to drive long interconnects for clock
distribution) can skew the internal clock relative to the external clock.
Delay mismatch among different clock paths and duty-cycle distortion can
impact the timing budget even further. Dynamic clock jitter, on the other
hand, can be introduced by supply noise; cross talk; and voltage and tem
perature (VT) variations. Timing variations (both skew and jitter) along the
critical data path could cross several clock cycles as speeds increase making
it difficult to predict and track latency. To reduce such variations, timing cir
cuits such as clock synchronization or de-skewing circuitry are generally
used. These play an increasingly important role in high-speed DRAM cir
cuit design.
To achieve better performance, the critical timing path of the memory
device from input to output must be carefully analyzed and modeled. The
critical timing path includes the input path (Chapter 9), output path
251
252 Chap. 11 Timing Circuits
(Chapter 10), clock distribution networks, and one or more de-skewing cir
cuits. Latency control (Chapter 12) and power delivery (Chapter 13) must
also be considered when designing the timing circuits. Although high-speed
circuit design for de-skewing is the focus of this chapter, each piece along
the critical timing path inevitably interacts with the core sync circuits and
contributes to the overall timing budget. Several critical timing specs, such
as Uc,fDQSQ^ tDQSS,CAS latency, data setup and hold, etc., can be deter
mined through the analysis. Therefore, the design of clock synchronization
circuits are integral to the overall system; design modifications and
improvements are necessary to accommodate increasing clock speed.
So, how is the design of synchronization circuits impacted by increas
ing clock frequencies? First, the timing margins shrink with a shorter clock
cycle time. In turn, the clock and data alignment must be more precise and
tolerate less jitter. Because both the rising and falling edges are utilized,
duty-cycle distortion jeopardizes the accuracy of data processing. Second,
clock domain crossing becomes more of a problem due to timing uncer
tainty arising from clock skew and short cycle time. With the inherently
slow transistors and minimum number of metal layers of a DRAM process,
clock and power delivery is a challenge: ground bounce, voltage bumps,
and noise from cross coupling degrade timing. Last but not least, impedance
mismatch and inter-symbol interference (ISI) degrade signaling, which
necessitates on-die calibration and training. Advanced packaging technol
ogy must be considered and modeled in the design flow to analyze timing
and power delivery. Since cost remains the primary concern fbr DRAM
manufacturers, design trade-offs must be made between cost and perfor
mance, simplicity, and scalability.
Historically, sync circuitry first appeared in DDR SDRAM, in which
the clock frequency was approximately 66-133 MHz. A simple digital
delay-locked loop was used to synchronize the clock with output data.
High-performance GDDR4 DRAMs have pushed speeds up to multi-giga
hertz range, while scaling cycle times down to less than 1 ns. Achieving
such high speeds requires enormous complexity in the sync circuit and sys
tem design. However, variability in the lock time, tuning range, resolution,
jitter, power, and layout can greatly impact overall system performance.
Based on various design choices, sync circuitry can be implemented
using digital or analog, open or closed loop methodologies or any combina
tion thereof. The two most common sync circuits in use today are the delay-
locked loop (DLL) and the phase-locked loop (PLL). The DLL is widely
used in memory interfaces for its simplicity and good jitter performance. An
alternative approach is the PLL, which is usually found in communication
systems and microprocessors fbr clock data recovery and frequency synthe
sis. Both the DLL and PLL are closed-loop systems with a reference and
Sec. 11.1 Introduction 253
feedback signal. The biggest difference between a DLL and PLL is that a
DLL merely passes the reference signal through a variable delay line, while
a PLL generates a new signal that is both phase- and frequency-locked to
the reference signal.
Although both the DLL and PLL are good for clock de-skewing, the
PLL is rarely used in high-speed DRAM due to the following:
1. Increased complexity
2. Stability issues inherent in a higher order system
3. Need for frequency locking
4. Longer lock time
Unlike the PLL, jitter in a DLL is small and easy to control due to the
lack of jitter circulation, which occurs in a PLL oscillator. However, any
input jitter may directly pass to the DLL output without any filtering. With a
low-jitter reference signal, a well-designed DLL can achieve the same jitter
performance as a PLL fbr clock synchronization.
An analog approach generally includes a voltage-control delay line
(VCDL) and a charge-pump phase detector to generate a control signal [3].
However, an all-digital DLL (ADDLL) can be implemented fbr greater scal
ability and portability across DRAM process generations. Because logic
gate delay remains fairly constant as feature size and supply voltage scale,
an all-digital design can be reused with little modification and is also easy
fbr test and debug, an issue which cannot be over-emphasized for modem
DRAM design. Good stability over process, voltage, and temperature (PVT)
variations is another plus because it is a Oth-order system without integra
tion. A register-controlled DLL [1, 2] is discussed in Chapter 5, and
ADDLL design is addressed in detail in Sections 11.2 of this chapter.
A higher performance DRAM generally supports a variety of power
saving modes. To resume normal operation after exiting power-saving
modes, the sync circuit must be relocked, resulting in some start-up delay.
Lock time of a sync circuit becomes an important design issue, and digital
implementations usually surpass analog counterparts in this area. Some
open-loop architectures such as synchronous mirror delay (SMD) [5] and
ring counter delay (RCD) can achieve lock within several cycles although
they lack fine timing resolution. A combination of DLL and SMD, called
measure-controlled delay (MCD), has been proposed to take advantage of
quick initialization with a closed-loop structure for better voltage and tem
perature tracking after initial lock. Voltage and temperature variations intro
duced by the power saving in DRAM can greatly impact critical timing
needed fbr high-speed operation. Power-saving modes may need to be
traded for better overall timing performance.
254 Chap. 11 Timing Circuits
As mentioned, clock and power delivery are big challenges for high
speed DRAM circuit design. To maintain proper slew rate and duty cycle,
the external clock must be shaped and re-driven over a long distance inter
nally. Clock power for the distribution network can represent as much as
70% of the entire timing path and even higher as the speed increases. One
solution to this problem is scaling down the internal clock frequency by
using a clock divider. Except for the I/O circuits, a single-phase, low-speed
clock signal can be distributed globally across the die with less attenuation
and power. However, the high-speed interface still processes data in full
data rate (twice the clock frequency), and an edge recovery circuit must be
added to transform the internally divided-down clock into multi-phase
clocks that include ali missing edge information. A mixed-mode DLL with
an analog phase generator can be configured to suit this purpose. It is obvi
ous that increased speeds demand more complex clocking system design.
Layout is another challenge fbr high-speed clock synchronization
design. Full custom layout is generally required for better matching, rout
ing, sizing, and shielding. The delay line (for delay adjustment) is without
question the most important piece as it carries a clock signal. For mixed
mode layout, partitioning sensitive analog blocks from digital circuits with
proper isolation, bussing, and placement is key for better performance.
A circuit design is not complete without built-in test capability. Design-
fbr-test (DFT) is essential for clocking in high-speed DRAM products.
Wafer-level probing is a challenge due to power delivery, high-speed clock
and signal delivery, and availability of limited Probe resources. Trouble
shooting can be extremely time-consuming if not impossible. Along with
circuit design, we address various test issues related to timing circuits.
Figure 11.1 Timing parameters of a typical all-digital DLL used in a memory system.
where tin is the delay of the clock receiver and input buffer, and e崂 is the
delay reserved for tuning after initial lock due to voltage and temperature
variations. Parameter 勿 includes DLL buffer delays and a fine adjustment
delay, if any. The delay of the data output path tollt includes the DQ latches
and DQ drivers. For low-speed operations, the cycle time «ck) is usually
greater than “nt肛 As speed increases, Sk may be a fraction of "ntri,and
tight synchronization is needed to reduce timing skew. In addition to the for
ward path, a DLL loop also includes a feedback path. An I/O model takes an
input from the clock tree and generates an output signal fed back to the PD.
The I/O model tries to track the delay of input and output paths and can be
defined as
(11.2)
256 Chap. 11 Timing Circuits
When the DLL is locked, the total forward path delay, troTAL^ is equal to m
times the clock period "ck),where m is an integer number greater than
zero. The relationship can be given as
(H.3)
where is the insertion delay when the DLL is locked. Assuming a digi
tal delay line with a total of N stages where the delay per stage is Zp (unit
delay) and n is the number of delay stages involved in the delay line when
the loop is locked, the tuning range can be given as
hock = ntD^ 1
(114)
Generally, it is important to know the timing difference caused by pro
cess, voltage, and temperature (PVT) variations. Table 11.1 and Table 11.2
present data based on an exemplary timing analysis. A digital delay line
with fine and coarse tuning is shown as part of the DLL intrinsic delay in
Table 11.1. Different implementations produce different DLL intrinsic
delay. The clock input buffer also includes a clock divider for high-speed
operation. The sum of the intrinsic forward path delay is shown as the total
delay.
The data in Table 11.1 show the sensitivity of major timing blocks ver
sus PVT variations. The clock input path, clock distribution, and data output
path represents 60% of the overall forward path delay at approximately 20%
for each. Larger die size and additional data pins require a larger clock dis
tribution network and produce longer delay. The ADDLL intrinsic delay
accounts for the remaining 40% of the forward-path delay. Worst-case PVT
variation could introduce 60% timing skew (five cycles of variation) for the
total forward delay after lock for a 1GHz external clock.
Given a typical process, the voltage and temperature (VT) variations
depicted in Table 11.2 produce up to 2.1 ns (40%) timing skew (1.6 ns is
due to the 600 mV supply change). For every 100 mV drop in the nominal
1.5 V supply, the total forward path delay (excluding DLL delay) can
change by 360 ps, almost an entire bit time at a 2.5GBit/s data rate. Com
pared to lowering the voltage, increasing the voltage has less impact on
delay. On the other hand, a 100℃ temperature change may vary the delay
by half a nanosecond with higher voltages creating even greater temperature
sensitivity.
Table 11.2 Timing data under voltage and temperature variations with a typical process.
Delay change
157 130 183 470
(ps)/+100℃
Delay change
123 109 128 359
(ps)Z-O.l V
Delay change
-59 ~57 -64 -180
(ps)/+0.I V
258 Chap. 11 Timing Circuits
Periph logic
Clock trees
(matched/shielded routing)
DQ latches
______ 土.........
DQ pre-drivers & drivers :
Package
Schematic Layout
Simulation Corners*
(Fixed Load) (Extraction)
Fast: I.9V,FF,-40℃ 83 81
*(TT: typical NMOS and PMOS; FF, fast comer; SS, slow comer.)
260 Chap. 11 Timing Circuits
The capacitive load (Q) of one delay element (half of the delay stage)
is the sum of output capacitance, input capacitance of the next delay ele
ment, and interconnect capacitance. Because the CMOS NAND gate has
four transistors versus two in the CMOS inverter, Cj is comparably larger in
the NAN D-based delay stages. The size of the delay element also deter
mines the value of Q, which should be a compromise between speed,
power dissipation, and area. For high-speed, low-voltage applications, each
delay stage should provide enough gain (or drive) to prevent the signal tran
sition slope or slew rate from being too sluggish. The transition time, rise
time (Jplh) + 回 1 time (加⑺,should be less than 40% of the cycle time for
better signal quality and lower attenuation. Unlike analog implementations,
which change delay by varying current for every stage, the all-digital
approach preserves slew rate of the clock signal regardless of the delay
range provided by the DLL.
Making a symmetrical transition "hl = 1plh) is very important for low
duty-cycle distortion. A seemingly small (10 ps mismatch) for the delay
stage can result in significant distortion at the DLL output. Because the
mobility of P devices and N devices are significantly different, it's not easy
to match the rise and fall time over process variation. Stresses seen during
bum-in may result in negative bias temperature instability (NBTI), causing
device mismatch and producing excessive duty-cycle distortion. Duty-cycle
amplification along the critical timing path can make such mismatch even
Sec. 1L2 All-Digital Clock Synchronization Design 261
(U.5)
While both frequency and process are factored in during initial lock, the
delay line still needs to be able to track voltage and temperature (VT) varia
tion. As shown in Table 11.2, the intrinsic delay can vary as much as 40% at
lower voltage and higher temperature. The delay line should have sufficient
tuning range to cover such variations. Some other applications may require
a frequency change over a long period of time (called frequency slewing)
for EMI reduction. Depending on the initial lock and lock point, the delay
line needs to accommodate such frequency, voltage, and temperature (FVT)
changes at the cost of extra intrinsic delay, power, and area.
The initial phase relationship between the internal clock and the feed
back signal is uncertain at power up because of delay variations. After
power up and reset, the delay line should be set to known initial conditions,
which establish a unique clock propagation path. In general, we want n from
Equation (11.4) as small as possible in order to reduce power dissipation
and increase noise immunity. Therefore tock should be limited to one clock
cycle Uck) after lock, even though the delay line could provide a delay
range more than two times the clock period (in some literature, ifs referred
Sec. 11.2 All-Digital Clock Synchronization Design 263
N =,CK, s/ow
hjast' @
(116)
where the @ is the phase-shifting factor. For 180°, 0 equals 2. Using the pre
vious example, t(jK = 5ns and g = 100 ps, the worst-case number of delay
stages needed is reduced from 50 to 25. The worst-case insertion delay is
264 Chap. 11 Timing Circuits
cut in half by simply adjusting the phase of the input signal. No spare delays
are needed with this arrangement.
Figure 1L5 Input phase selection to improve the efficiency of delay line.
(H.7)
where n is the number of delay stages when the DLL is locked, Cj is the
capacitance associated with each delay element, rpp is the power supply
voltage, and fcLK is the operating frequency. Supply voltage is the biggest
knob with regard to power reduction. This reduction may cost some speed
and, due to smaller voltage swings, increase sensitivity to supply noise
(Table 1L2). Reducing the clock frequency with an internal clock divider is
another way to save power. If the clock frequency and supply voltage are
Sec. 11.2 All-Digital Clock Synchronization Design 265
fixed, the only design choices remaining are n and Cj. Since Pavg is propor
tional to the number of active delay stages (n), power can be reduced by a
factor of(|)with appropriate input phase selection. To further save power
when the DLL is not in use (during active power down, for example), a dis
able signal can be applied to stop the clock that is fed into the delay line.
The DLL, however, must reacquire lock after the clock is re-enabled.
CLK
CLK* -I: 一
I i
I
CLKO ;
*
CLK90 J
I
CLK180 ; ______
I
CLK270 ; _
Divide by 2
U
O
Q
O £
O
S
x
S,
布
o
o
o
Divide by 2
Feedback signal
Figure 11.6 A clock path with clock dividers and fburphase input signals.
Jp-P =±,Q,JRMS - y
(H.8)
For a traditional delay line made up of logic gates, the minimum gate
delay (an inverter) is approximately 50 ps. To preserve signal polarity, two
inverters are needed, and peak-to-peak jitter will not be any better than 100
ps. This jitter is evident at the device outputs whenever the DLL makes an
adjustment.
To improve resolution for high-speed operations, a finer delay element
can be used, which in turn reduces jitter. Two examples of a fine delay ele
ment are shown in Figure 11.7: one with load adjustment and the other with
path selection. By adding or removing a small load capacitor, controlled by
a shift register, the delay can be varied dependent upon the selected load
capacitance and the driver. A pair of N and P MOSFETs with their source
and drain tied together, respectively, can serve as load capacitors. Delay res
olution is determined by the unit load capacitance. The delay will be fairly
constant over PVT variation, and the intrinsic delay is small (basically two
inverters for eight stages in Figure 11.7). For the path selection delay ele
ment, a delta delay will exist between the slightly different-sized inverter
strings. This topology offers better delay tracking over PVT variations. If
connected in serial, the fine delay line introduces a large intrinsic delay,
which is not good. An interleaved architecture using both interpolation
methods is commonly implemented.
IN — —『上
—>o— ¥ 1~~A- OUT Fast path 广、- kT二
Figure 11.8 Phase mixing by adjusting slew rate and driving strength.
The tuning range of the fine delay should cover the delay variation of
one coarse delay stage in either direction: up or down. The step size (unit
delay) of the fine delay line (茄)is a fraction of the coarse unit delay (力).
A separated loop is required to perform fine delay adjustment in addition to
the conventional DLL loop, referred to as a coarse loop. Dual-loop architec
ture provides an efficient way to achieve wide lock range (via a coarse loop)
and tight synchronization (via a fine loop). Smooth transition between the
coarse and fine adjustments is a challenge and may cause a large phase jump
or discontinuity of the DLL output.
One concern is that the resolution of a DLL is limited by the unit delay
stage, which needs to be factored into the PD design to prevent further
action if the phase error is within a certain boundary (sometimes referred to
as a window detector). Loop dynamics, including lock time, tracking time,
and stability are determined in large part by the phase detector design.
11.23.1 Lock time Lock time, measured in clock cycles, is another
important design aspect. Two situations must be considered:
268 Chap. 11 Timing Circuits
1. After a DLL is reset and enabled, how long will it take to initially
lock the loop before a valid Read command can be issued?
2. After initial lock (if the lock is lost due to voltage and temperature
drift), how long will it take to regain the lock condition?
To answer these questions, we must determine the sampling frequency at
which a digital phase detector (DPD) will make DLL adjustments. For a
high-speed, low-voltage memory system, the intrinsic forward path delay
(hNTRl) could be several times the clock period, as shown in Table 11.1 and
Table 11.2. The loop must be allowed to settle down after a delay adjust
ment (the input signal propagates through the delay path and feeds back into
the DPD, sometimes referred to as loop settling time). To achieve this, the
sampling frequency 向 must not exceed
fs = r~k
lCK长
(11.9)
where the sampling factor k can be chosen according to
(11.10)
The subscript slow refers to a slow corner condition, while MtcK,fast refers
to the highest operating frequency. The function INT(x)even makes the x to
the closest even integer, for example, x = 0.6 gives 0, and x = 1.4 gives 2,
The k()can be either 1 or 2. The side effect of this sampling factor is an
increase in the lock time by a factor of k.
Making adjustments beyond the desired sampling frequency causes sta
bility problems for a feedback system. Because the sampling frequency is
determined by the highest operating frequency, it may be too pessimistic for
low-speed operations. Dynamically adapting the sampling frequency based
on clock frequency and intrinsic delay is possible by measuring the loop
delay under varying PVT conditions. As part of the DPD, a digital filter like
an up/down counter or shift register can be used to control the sampling fre
quency.
With the help of the previous discussion on lock range and input phase
selection, the worst-case lock time can be determined by
Sec. 11.2 All-Digital Clock Synchronization Design 269
=N4-tcK
工 worst
(UJI)
where N is the number of total effective delay stages and is the phase
selection factor. If = 50, A: = 5, and(|)=1, it will take 250 clock cycles to
initially lock the loop, which may be unacceptable (DDR and DDRII
SDRAM usually specs 200 clock cycles for initial lock time). With input
phase selection, the lock time can be reduced to 125 cycles (with 180° phase
shift) or 65 cycles (with 90° phase shift,(|)= 4).
Another way to reduce lock time is to perform segment selection
(grouping several delay stages into a segment) prior to locating the entry
delay stage. After the segment is located, the final entry point can be found
within that segment. The number of delay stages grouped into a segment (S)
can be determined by
S = IN人;CKJast )
2 ' ^D,slow
(11.12)
For example, assume that the fast clock period is 2 ns and the delay per
stage at slow comer is 200ps, then S is equal to 5. The lock time with seg
ment selection is given by (worst-case):
(11.13)
For the same parameters used in the previous example, the new lock time
calculates to 75 clock cycles, which is four times faster than the lock time
without the segment and phase selections. For a high-speed system, the ratio
of tcK to S is small, somewhat reducing the benefits of segment selection.
Initial lock time can also be reduced by performing a binary search with a
successive approximation register, or SAR. Local feedback (bypassing the
clock tree and I/O model delays) can speed up the initial adjustment by
making continuous changes to a dummy delay line. Also ifs possible to
make loop adjustments proportional to the magnitude of the phase error
without impacting loop stability. This would require an adaptive PD capable
of selecting big (several stages) or small (one-stage) adjustments based
upon phase error.
270 Chap. 11 Timing Circuits
After initial lock, a loop can lose lock due to excessive frequency, volt
age, and temperature (FVT) variations. Generally, a DLL can dynamically
track out these variations and adjust delay to relock the loop. Depending on
the sampling frequency and the magnitude of the phase enor, the tracking
time may vary. After initial lock and during memory Read operations, it's
undesirable to make a big delay change that appears as timing jitter at the
outputs. Rather, small adjustment steps, which improve jitter performance,
tend to extend tracking time. While sync circuits are designed to track long
term FVT variation, proper filtering is important to ensure loop stability.
1123.2 Circuit implementations For a coarse delay line, the best
resolution is a two-gate delay. The delay line is primarily used to ensure
adequate tuning range and proper lock. A PD block diagram used for course
adjustment is shown in Figure 11.11.
CLKDLL
CLKIn
Unit
cwn j_Ln_nj-
CL3
Sample
clock (k=2) n
Less delay, More delay. Adjusted,
Shift-right
Shift-right Shift-left No change
The FPD is active only when the coarse loop is locked. The coarse PD
can take over the control to correct large phase error while the fine loop is
active. Extra time is needed to lock a fine loop. Similar searching algorithms
to those previously described can be used if a large number of fine delay
stages are used.
The regular delay line gets adjusted only during initialization and won't
change after initial lock. The coarse delays in the dual-delay lines provide
the proper VT tuning range at the cost of extra intrinsic delay, power, and
area consumption.
Internal coarse shift
Figure 1L15 Dual-path fine and coarse adjustments with smooth transitions.
edge information for the internal clock must be recovered for full-speed I/O
operation. Multiple phases from the phase generator can represent the exter
nal full-speed clock and are used to time the output data path accordingly.
An analog DLL with a voltage-controlled delay line {VCDL) is generally
suited for such applications. Combining both digital and analog implemen
tations, a mixed-mode DLL is investigated in this section with a focus on
analog DLL design.
As the timing window gets smaller and smaller for high-speed interfaces,
the focus shifts somewhat from just output timing to both input and output
timing. The entire data path, from command address capturing to data out
put at specific CAS latency, must be taken into account for floor planning
and timing analysis. Design changes based on interactions between the data
path and timing circuits are possible. These may include the need for more
than one DLL, depending on the timing budget, flexibility, power, and area
constraints.
Fewer delay stages make the ADL smaller than a DDL, but with compara
ble tuning range for clock synchronization. Phase interpolation between two
adjacent taps is generally required to achieve full-range (360°) coverage and
precise de-skewing. For high-speed DRAM, analog implementations
become the only choice in some applications.
11.3.LI Unit delay Choosing a proper unit delay is the first step in
building a delay line. An analog control signal (e.g., a bias voltage) is gener
ally applied to vary a current (Itail) used to charge or discharge internal node
capacitance (Cw). A unit delay (//)) can be defined as the time it takes an
internal node to be charged or discharged by the bias current /幻”.
(D = \dV
hail」
(UH)
v
Figure 11.18 Single-ended (left) and fully differential (right) analog unit delays.
It is usually difficult to control the bias voltage over process, voltage, and
temperature variation. Sei住bias techniques [3], shown in Figure 11.19, can
be applied to the differential delay buffer for:
The control voltage (Vctrj) is fed from the phase detector and loop filter. The
bias voltage Vgp fbr the loads can be generated from Vctri with the aid of an
op-amp. A replica delay is used to generate Vbn for the current sinks. The
same current is mirrored inside the delay buffer to produce equal rise and
fall times. For a self-bias fully differential buffer, the unit delay can be
given as
f 1 _ a _ g
D n £ J=T^l 3dd/Mtp)
(1L15)
The unit delay (measured from the crossing points of the differential
signals) versus control voltage at different process and temperature comers
is plotted in Figure 11.20 for a clock speed of 625 MHz and a supply voltage
of 1.2 V. A slow process comer (SS) and lower temperature (U0°C) make
the delay tuning range smaller due to limited voltage margin (higher thresh
old voltage). The signal may get severely attenuated due to sluggish transi
tions and limited gain. A general rule of thumb is to keep the unit delay less
than 15% of the cycle time (tctd at worst PVT comers. Fast process comer
(FF) and high temperature (110℃) may cause problems for low-speed oper
ation because the delay is entering a nonlinear operating region.
CO* 篙 4学 寺 等 5gsi 法 盅 饿 9
oocioodooddoooooo
Figure 1L20 Unit delay versus control voltage (Vctrl) at different comers.
280 Chap. 11 Timing Circuits
a)
b)
Figure 11.21 Supply noise sensitivity for both a) analog and b) digital delays.
Sec. 11.3 Mixed-Mode Clock Synchronization Design 281
11.3. L2 Gabi and lock range The gain for the self-bias differential
delay line is given as [3]
C n Cn n Cn
(j = 一 ・■,, = . 内 ■- ,■・,,■,- ,・
2" k-{VDD-Vctrl-VTP^
(11.16)
where n is the number of delay stages and Itaii is the bias current. The output
from each delay stage can swing from Vctri to V/)D・ The system may not be
stable when the Vctri is approaching Vdd-Vtp (minimum swing at the out'
put). Furthermore, Vctri must not exceed a specific threshold determined by
the process and operating conditions in order to keep the differential input
transistors in the saturation region.
The lock range of the voltage-controlled delay line (VCDE) is n • g,
where g is the unit delay defined in Equation (11.15). Although the number
of stages (n) for analog delay line is fixed, there is a minimum requirement
to keep the delay per stage 〃£>) within the linear region. If the transition time
is comparable to the bit time (half cycle), the output may not reach full
swing, and the signal could get attenuated so badly that no valid output
would be generated. The rule of thumb is to pick the n greater than eight.
Locking to a different clock phase can be used to eliminate this problem,
just like the input phase selection used in digital delay line. Instead of lock
ing to 360-degree phase (a full tc/d of the VCDL, the differential feedback
signals can be swapped to achieve 180-degree (0.5 Qk) locking.
With the same clock frequency, the effective delay per stage is 50%
less, and the transition time is always less than the bit time, which provides
better noise margin. Figure 11.22 shows two lock regions under different
configurations: an eight-stage VCDL locking to 360° and a six-stage VCDL
locking to a 180° phase. It is obvious that the 180° lock has better noise
margin and more symmetric transitions (tpiH= tpH》However, duty-cycle
distortion from the input must be addressed to achieve precise phase align
ment in this case.
There is also an upper limit fbr the VCDL stages. Lock range under dif
ferent PVT comers generally requires a longer delay line at the cost of large
area and more power. A comparison between eight- and sixteen-stage delay
lines is plotted in Figure 11.23 with regard to slow simulation comers and
linear operating region. The lock range can vary from 425〜980 MHz (eight
stages) to 220〜500 MHz (16 stages). Fewer delay stages are desirable fbr
high-speed operation. To keep the gain constant, an adaptive scheme can be
applied based on the frequency and PVT comers. The minimum VCDL
282 Chap. 11 Timing Circuits
联,D = m 小
(11.17)
Figure 11.22 Lock range and delay versus control voltage for two different VCDLs.
S
)
3 A
Q
A c
e B
2 n
a) b
a
。
i
1
LL
5 o k 5 5
47 52
SO )50
(mV
113.1.3 Noise and jitter The VCDL with current-mode logic (CML)
has better supply noise rejection than the digital delay line with CMOS
logic. With 150 mV lower supply voltage (-10%), the CML delay line (five
stages with loads) has delay variation only +2.1%, while the CMOS inverter
Sec. 11.3 Mixed-Mode Clock Synchronization Design 283
chain (four inverters with loads) has +12.9% variation. Under the same con
ditions, the level translation from CML (differential) to CMOS (single
ended, CML2CMOS) has 10% delay variation. Depending on the output
swing, common-mode range and PVT comers, the CML2CMOS may gen
erate duty-cycle distortion at the output. This is not an issue when only the
rising edges of the multiple phases from the VCDL are used fbr clocking.
For an all-digital implementation, supply noise contributes 76% of total tim
ing variation (Table 11.2). A timing path with extensive CML delay ele
ments is a good alternative for high-speed DRAM design.
To improve the jitter performance and noise immunity, several design
and layout techniques for the VCDL are addressed as follows.
• Be careful when selecting the delay element. A device with longer
channel length (L) has better matching and noise performance at the
cost of speed. Current loads and sinks usually use longer L devices.
Devices with lower threshold voltage offer reduced phase noise and
improved voltage margin.
• Remember that large tail currents improve speed and reduce
noise and gain (Equation (11.16)) at the cost of power. Trade-offs must
be made between speed, lock range, gain, power, tracking bandwidth,
and the number of delay stages.
• Match the layout. Differential inputs use an interdigitated or common
centroid layout [8]. Minimize parasitic capacitance and interconnects.
Group all identical circuits into a cell fbr better matching and LVS (lay
out versus schematic). Keep every delay cell oriented the same, and use
the same layer of interconnects. Identical circuits can be used fbr load
and voltage-level matching at the cost of power.
• Isolate and shield all analog signals, i.e.s Vctri. etc. Shield all high
speed matched clocks. Put guard rings around critical analog blocks
(VCDL, bias generator, etc.). Use quiet and isolated power buses for
critical analog circuits. Avoid digital routes across sensitive analog
blocks.
• Placement is another key. Keep the clock output buffers (digital, high
speed) far away from the CML delay line (VCDL) and level translators
(CML2CMOS). Identify and partition the circuits into high-speed
clocks, digital control logic, and sensitive analog circuits. Dedicated
power pins fbr analog circuits are imperative fbr better performance.
Figure 1L24. Based on the outputs of the phase detector, a current (Ipump)
can be pumped into or out of the loop filter (LF), which is a simple capaci
tor for the DLL. When the loop is locked, there is no net change in charge
within the LF, and Vetri remains relatively constant The CPPD shown in
Figure 11.24 is prone to charge-sharing problems. A fully differential
charge-pump with a unit-gain source follower [4] is useful and shown in
Figure 11.25. A current-steering technique [6] can also be used to reduce
ripple at the LF output and improve jitter performance. To avoid a “dead
zone” [3-4] when a phase is close to lock, keep the up-and-down pulses
from the PD the same, with the same duration, depending on PVT varia
tions. The best resolution available from a CPPD depends on the control
voltage ripple, loop capacitance, matching etc., and is usually 10-20 ps fbr
all practical purposes.
Phase comparisons can be made on both rising and falling edges to
reduce lock time given there is little duty-cycle distortion. The lock time for
the CPPD is usually longer compared to the digital DLL, depending on the
Ipump and Q尸 A large loop capacitor usually improves loop stability (a
first-order system with a pole provided by the LF) but slows down the lock
ing procedure. Based on initial conditions and clock frequency, a typical
lock time fbr an analog DLL is approximately 50〜100 cycles. If no clock is
running during power-saving modes, there is a “run-away“ problem for the
bias control signal (%〃/), which is generally floated. A digital-to-analog
converter (DAC) can be used to store the bias information and reduce the
relock time after exiting power-saving modes. The extra complexity, power,
and layout area must be traded against performance gains.
A potential "fhlse lock” problem can exist with the VCDLy in which the total
VCDL delay equals a multiple of clock cycles. False-lock detection is
needed for the CPPD to provide correct phase generation. Initializing the
VCDL with minimum delay is another way to prevent false lock at the cost
of longer lock times under certain conditions. 6tFast-down^^ mode is useful
to speed up the initial phase acquisition by ignoring the CPPD until a certain
phase relationship is reached. Loop stability must be ensured with regard to
lock time. When a digital DLL (DDLL, for synchronization) and an analog
phase generator (APG, fbr edge recovery) are both used in a mix-mode
clock synchronization system, a lock signal can be generated to start the
DDLL synchronization when the phase difference of the APG is within a
certain timing window. The overall lock time fbr the mixed-mode DLL is
longer compared to each individual DLL.
The loop response time fbr the APG is approximately one cycle because
the feedback is taken from the output of the last delay stage. Except for the
CML delay line, the single-ended to differential converter, CML2CMOS
logic and clock buffers remain subject to supply noise. Voltage and temper
ature variations are tracked quickly, and tracking time is usually within two
cycles.
For the mixed-mode DLL in Figure 11.16, any clock manipulation in
the DDLL, like input phase selection and MCD locking, impacts the input
signal to the APG, and, therefore, changes the loop dynamics of the APG. To
ensure smooth transitions and stable operations, the CPPD is blocked from
any control voltage changes, and transitions are controlled by the digital
loop. The “block” function is also good for test and debug, when an open
loop operation is required to manually adjust the control signal and charac
terize the VCDL.
286 Chap. 11 Timing Circuits
VCDL
in the analog DLL has the same structure as the main VCDL in the APG
The phase control logic can compare the feedback signal with a reference
signal (CLKIn) and generate control signals fbr phase selection and interpo
lation. Compared to an all-digital implementation, the analog dual-loop
DLL has less supply sensitivity due to the CML delay elements. If the glo
bal clock distribution can also be implemented with CML buffers, the sup
ply noise sensitivity can be further reduced. Intrinsic delay can also be
decreased with CML buffers, which help to reduce loop response time and
latency. A similar scheme has also been chosen fbr high-speed data captur
ing and per-pin de-skewing fbr gigahertz links (receiver and transmitter
designs) [10].
UOAnqEwp ipoo
Phase selection &
interpolation
VCDL (slave)
Charge OP
pump PD OE
Local clock tree s
lb output path
Internal mismatch between the clock and strobe distribution paths degrades
the timing even further Known relationships between a decoded Write
command and captured data and a decoded Read command and output data
must be established by clock synchronization circuits. Read-Write latency
control circuits can be timed by the respective sync circuits. Clock domain
crossing, ie, command clock to data strobe or command clock to output
(Read) clock, must be circumvented for various PVTF comers.
Data u
o
板
latch D 03 Q
q ajP o
E O
p
S E l
nq-nduD
- p
<
p
Ae
0 ⑥ iso
0 o
o
C/A
latch
Figure 11.28 Dual-loop DLL for command, address (C/A) and data capture.
An analog DLL can be used to hide clock distribution delay and gener
ate quadrature clocks fbr data capturing, if needed. In Figure 11.28, internal
clocks are center-aligned with received data by phase selection and interpo
lation. C/A and data are delivered directly to capture latches after they are
received. A training sequence may be required to properly align the data
with internally generated clocks. Loop dynamics like tracking, filtering, set
tling time, etc., must be carefully analyzed. For a feedback system, instanta
neous jitter or skew cannot be corrected by a DLL; therefore, jitter caused
by the supply noise has the biggest short-term timing impact. Voltage regu
lation may be helpful for a sensitive timing path. Differential signaling is a
plus fbr better supply noise rejection. As the data rate keeps cranking up,
very precise de-skewing fbr both input and output may be inevitable. A
mixed-mode DLL is a good basis fbr high-speed clock synchronization
design.
Sec. 113 Mixed-Mode Clock Synchronization Design 289
DDLL ADLL
Simulation data fbr a DDLL and an MDLL are compared in Table 11.5
with a typical process, 1.5V supply voltage, and 85℃ temperature. The
external clock is at 800 MHz. A full-speed clock runs in the digital DLL
without an APQ and a half-speed clock with two APGs was used in the
mixed-mode DLL. Even with two analog phase generators (APG) in the
MDLL clock trunk, the power is less than that of the DDLL with full-speed
clocks. For the digital DLL itself 40% power saving is achieved. However,
lock time fbr the MDLL increases significantly due to analog phase locking.
Table 11.5 Lock time and power consumption for DDLL and MDLL at 1.5 V TT, 85℃.
If the part is down binned to a lower speed grade, full-speed clocking can
resume and one of the analog phase generators (APG) can be used as an
analog duty-cycle corrector (DCC). Only two phases are generated by the
APG: clock and the 180° phase-shifted clock. Compared to a digital DCC
(another DDLL, 70% area), an APG is small (30%) and has less intrinsic
delay. Better supply noise rejection and quick tracking make an APG a good
DCC fbr clock frequencies anywhere between 400 and 800 MHz.
Finally, a 90° phase-shifted data strobe can be generated by the MDLL.
The strobe can be center aligned with the output data rather than edge
aligned, which makes it easy to capture data at the controller site. The APG
should provide a 45° phase shift fbr a divide-by-two clock or a 90° phase
shift for a full-speed clock. If an MDLL is also used for input timing, an
internal 90° phase-shifted clock can be generated fbr data capture.
REFERENCES
295
296 Chap. 12 Control Logic Design
Array
array access delay characteristics of the DRAM always limit device perfor
mance.
Min Column
Peak
Technology Burst Length Cycle Time
Bandwidth
flC€D)
SDRAM 1 6 ns 167Mbps/pin
DDR 2 5 ns 400Mbps/pin
DDR2 4 5 ns 800Mbps/pin
DDR3 8 5 ns 1.6Gbps/pin
(12.1)
where Q is the load capacitance, is the supply voltage, and a is the
velocity saturation index, which is close to one for a short-channel process.
In some cases, the problem of high Kyyys has been overcome by devel
oping process technology with multiple gate oxide thicknesses and doping
profiles. By providing an option for thinner gate oxide devices in the
DRAM process, performance improvements have been realized for I/O cir
cuits and peripheral logic interface circuits. Even with process improve
ments, more aggressive logic styles, such as dynamic logic may be
necessary to overcome the process delay limitations caused by relatively
high Vth devices. Thinning gate oxides and more aggressive logic styles
have helped opened the door to high-performance DRAM devices. But
device scaling has not lead to the performance gains seen with process scal
ing of logic processes for modem microprocessors [6].
Because of process scaling, changes in interconnect characteristics is a
second major difficulty faced by the DRAM logic designer. Figure 12.2
shows how scaling has resulted in a change in the aspect ratio of the inter
connect metals on a single layer. Qualitatively, we must consider that the
capacitance is proportional to the dielectric constant, k, wire height, H and
space, S. Using low-k dielectrics to reduce sidewall capacitance is some
times an option for more expensive logic processes, but this has not been a
cost-effective option for DRAM at this time. Please refer to [7] and [8] for a
quantitative treatment of this topic. Wire scaling results in increased capaci
tive coupling, which in turn increases the potential for noise. Under worst-
300 Chap. 12 Control Logic Design
Advanced process
「 H
C 0c
(12.2)
There are other more indirect effects of scaling that impact logic circuit
design. When process scales, voltage is often scaled proportionally. But the
VT^s of the DRAM process have not changed proportionally with scaling
and has therefore lead to a higher ratio of threshold voltage to supply volt
age. Referring to Equation (12.1), when 々° is reduced there is an increase
in delay,力.Voltage scaling can affect noise immunity fbr dynamic circuits;
but this vulnerability is often offset by the relatively high K777S of the
DRAM process. We see that characteristics of a DRAM process are a hin
drance to high-performance logic. But we must also consider how DRAM
performance specifications are dictated by the access delays of the DRAM
array. Next, we will consider specifications fbr array operation and how
these specifications affect the circuit topology decisions fbr DRAM logic
design.
Sec. 12.2 DRAM Logic Styles 301
% Don,t Care
阙 Undefined
b)
Figure 12.3 a) EDO Page mode and b) SDRAM Read cycles.
maximum access time for the device is 12.9ns. This is expressed as a maxi
mum because the first valid data out of the device in the SDRAM can still
come any time before the maximum specified access time. The correspond
ing timing parameter, f°L13ns, for the EDO device reflects approxi
mately the same access delay as is shown for the SDRAM device.
When we look at a column Read operation for a high-perfbrmance
DRAM, such as the data sheet for a GDDR3 device shown in Figure 12.4,
we see that column access latency has remained relatively constant over the
generational changes of the DRAM protocol. The clock period shown in
Figure 12.4 is 1.25ns; and, with a CAS latency of 10, the maximum column
access time is 12.5 ns. The lack of improved access time is partially because
the transistor speed benefits of process scaling are often tempered by volt
age scaling and increased interconnect coupling. The sum of the effects of
interconnect scaling with simultaneous increases in area because of increas
ing array density has caused most DRAM timing parameters to remain con
stant.
TO T1 T2 T3 T10 T10n T11 T11n T12 T12n T13 T13n
Another observation from Figure 12.4 is that the short burst length
(BL), which is translated into column cycle time (Jccd) by Equation (12.3)
for a double-data rate (DDR) device,
(12.3)
z = BLt
「CCD 2 ‘CK
allows several Read commands to accumulate before the first data burst is
output from the device. In Figure 12.4, with a CAS latency of 10 and a Read
command occurring every second clock cycle, the result is 5 possible Read
commands pipelined in the device before the first Read command com
pletes. This is in contrast to early SDRAM designs where each data access
stood alone because the C/A clock frequency was chosen to match the char
acteristic delays of the memory device. Later in this chapter we discuss how
304 Chap. 12 Control Logic Design
large delay variations fbr array column accesses due to PVT variation can
influence the logic design, especially in the Read data path.
800 ps
clk h_h_n_rLrLrLrLrLn_rLrLrLrLn_n_n_n_n_
H~~►:
.1.6 ns ,
ctKor-l— J-1-T-LJ-LJ-LJ_LJ-LJ-LJ-L-
CW :_j_ i_r_L_j_i_rn_j_lj-1—1~lj-i—rn
800 ps
Clock Rate
Speed Grade
CL-10 CL = 9 CL = 9 CL = 8
-14 700 Hz
Clock Rate
Speed
Grade
CL = 14 CL =13 CL = 12 CL-11 CL = 10
-09 LI GHz
-1 1.0 GHz
When we consider PVT variation, the pulse width out of the circuit in
Figure 12.7 can vary by 100% over the full operating PVT range. How does
the delay-chain logic style fit in with high-performance DRAMs? The
308 Chap. 12 Control Logic Design
answer is that this logic style is still widely used in the peripheral array
logic region. As we stated at the beginning of this chapter, when synchro
nous DRAM came into existence, the synchronous interface became a
wrapper around the existing DRAM logic. Thus, we have defined the
peripheral logic interface as the high-performance, synchronous section of
logic that interfaces to the original DRAM peripheral logic. Many DRAM
designs still use some form of delay-chain logic in the peripheral interface
region of the DRAM die. This method can prove effective when directly
interfacing to the array, but as performance demands grow and process con
tinues to scale, this style of logic has limitations that must be considered.
Some risks when using delay-chain logic in the peripheral logic inter
face region include questionable delay tracking reliability when operating
over a wide range of clock frequencies as well as increased process variabil
ity as a result of scaling. The latter results in reliability issues related to sta
tistical delay variation relative to anticipated delay values. Another
important consideration when using delay-chain logic is the fact that the
heuristic nature of the design approach for this style of logic often leads to
disjointed circuitry with a lack of procedural clarity fbr the design process
flow. What often happens is that delay-chain logic designs are passed
between different designers and the logic is patched to the point that the
original functionality is difficult to discern. When delay-chain logic is used
in the peripheral logic interface area, the circuits must be “tuned“ to operate
fbr a given process, voltage, and frequency. Being forced to tune these cir
cuits limits the reusability of this style of logic. This is not to say that this
style of logic is not without merit when applied judiciously in the peripheral
array logic region. The greatest strength of using this logic in the array
peripheral region is that the delay chains exhibit some correlation to PVT
variation between the array control signals and the delay characteristics of
the array itself. Another advantage to this style of logic is the ability to
adjust the delay stages using metal mask changes so that post-silicon timing
adjustments can be easily made.
Next, we will consider a dynamic logic methodology. This method
solves some of the problems of delay variation and is well suited to the
asynchronous timing specifications of DRAM array accesses. Two more
important advantages for applying dynamic logic in the peripheral logic
interface is that first, this method provides a rigorous design procedure for
implementing logic functions and second, the logic implementation is only
limited by gate delay and is robust when properly designed.
Sec. 12.2 DRAM Logic Styles 309
ing the output of the dynamic gate and driving a LOW-to-HIGH transition
to the input of the next domino gate. Monotonicity must be enforced
between levels of logic fbr reliable timing and robust operation.
One problem we face in the DRAM interface is that signals coming
from the input circuit path are not monotonic, as we have defined here. Sig
nals such as CAS* and RAS* transition at the command/address clock (C/A
clock) frequency and can transition both HIGH or LOW between each eval
uation phase of the clock. If we are to use domino-style logic in the periph
eral logic interface, we must first convert the control signals, such as CAS*
and RAS*, from a static format referenced to the C/A clock to monotonic
signals compatible with the domino logic style.
Figure 12.9 shows a circuit that converts signals from the static logic
domain to the dynamic logic domain [26], If we look at the input latch cir
cuit examples from Chapter 9, we see that the output from these circuits is
generally valid for an entire clock cycle. The circuit in Figure 12.9 can be
used to convert from the full-cycle static domain into the monotonic, half
cycle valid domain. This circuit operates by creating a pulse from the clock
edge through the inverter string driven by the clock signal. When CLK is
LOW, the outputs Q and Qb are precharged to a low value through the p-
channel devices. When CLK transitions HIGH, the data input is evaluated
during the pulse created by the overlap of the HIGH time of the clock signal
and the delayed transition of the clock through the inverter chain. An impor
tant aspect of the static-to-dynamic converter circuit is that the output is
considered dual-rail. What this means is that if the input data to the con
verter, D, is LOW, then the output signal Qb will transition HIGH for one
half of a clock cycle. Likewise, if the input D is HIGH during evaluation,
then the output Q will transition HIGH while the output Qb remains in the
Precharge state (LOW). Domino circuits cannot be cascaded to perform
combinatorial logic because the outputs of domino gates are not function
ally complete. Static hazards can occur where, for instance, a non-mono
tonic transition from HIGH to LOW could be skewed relative to the
Precharge clock such that the gate evaluates to the incorrect state. Once the
dynamic logic gate incorrectly evaluates, it cannot recover until the next
evaluation cycle. The dual-rail output from the circuit in Figure 12.9 gener
ates both true and complementary outputs that are both monotonically ris
ing. This allows us to exploit the skew-tolerance [1] of domino logic and
secure functional completeness for implementing the command decoder
logic.
312 Chap. 12 Control Logic Design
The inset of Figure 12.9 serves as a point of reference for using the
static-to-dynamic circuit in the DRAM input path. Figure 12.9a shows the
connections of the single-ended input receiver circuit used to detect the
incoming signal, followed by Figure 12.9b, the input capture latch, and then
followed by Figure 12.9c, the static-to-dynamic converter circuit. This
example would be common for the command and control input signals such
as RAS* and CAS*.
At this point we should consider the question: Why use dynamic logic
in the command input path of a high-performance DRAM? Part of the
answer to that question is related to the process limitations discussed at the
beginning of this chapter. Remember that DRAM processes implement as
few metal layers as possible, making global clock distribution very difficult.
Not only is it difficult in a DRAM process to distribute a well-matched
clock, but using a global clock results in associated cycle delay dependen
cies on array access time, thereby limiting the usefulness of a globally dis
tributed clock. Recall that an important timing parameter for DRAM
performance is column access latency, which we have also referred to as
CAS latency (4)• Because column access latency is constant across the
range of operating frequencies for a device, we must not allow the clock
Sec. 12.2 DRAM Logic Styles 313
period to become part of the array access latency equation. Figure 12.10 is a
diagram showing how internally synchronous operation would detrimen
tally allow the clock period to become part of the access latency. By syn
chronizing the command and address logic, we could potentially force a
violation of the specified array access time. In Figure 12.10, the command
and address are captured by the input capture latches (Chapter 9). If the fol
lowing clock edge is used to align a partially decoded command (assuming
that the total command decode plus synchronization overhead exceeds a
clock cycle), the decode delay will be increased by the difference in clock
period delay between the two operating frequencies shown. In this example,
a device is specified with an operating frequency range of 800 MHz to
1.25 GHz. If we allow a clock cycle to occur between the command capture
clock edge and one stage of the partial command decode, a & delay differ
ence of 900 ps will be added between the minimum frequency and maxi
mum frequency operating points. Because of this difference, if devices are
binned with this frequency dependency, many potentially high-performance
devices may not meet the latency specifications.
800 ps 1.25 ns
O
rLrLhLTLFLTL 'ri_rn_h_rT_
One obvious solution for maintaining constant CAS latency across the
operating frequency range is to use a combinatorial circuit fbr the decode
operation and delay the input clock to a second set of retiming latches. The
second set of retiming latches are required fbr realigning the decoder output
to account fbr timing skew and hazards through the input and decode path;
this configuration is illustrated in Figure 12.11. Without realignment of the
decoder output, signal skew can cause false decoder transitions to appear in
314 Chap. 12 Control Logic Design
the command and timing logic paths and possibly lead to data corruption in
the DRAM array. One problem with this solution is that the clock must be
delayed based on the maximum skew and delay through the decode path. If
the delayed clock does not correlate with the delay through the input and
decode paths, then, potentially, a high-performance device may not meet
latency requirements at the highest operating frequency. For example, a
device might meet timing requirements through the decode path but,
because the clock is delayed based on worst-case timing expectations (with
timing margin built in, of course), then any performance gains in the decode
path may be masked by the delayed clock. This would be a greater concern
for those devices operating right at the margin, since the delayed clock may
prevent the device from binning at a higher performance point.
Skew between decoded signals is not the only problem with using com
binatorial decode circuits for command and timing control. For high-perfor
mance devices with extremely low clock periods, the decode path could
also have multi-cycle delay paths at higher clock frequencies but still satisfy
single-cycle timing at lower clock frequencies. This problem gets back to
the design and use of the same DRAM device over a wide range of clock
frequencies. In this situation, false decodes could occur if the incorrect
clock edge were to trigger a false decode. And, finally, using the simplistic
method of a combinatorial decoder causes greater difficulty for final timing
verification of the design. This is true because the global or, in some cases,
multiple local clocks that are used for retiming the decoder output must be
comprehensively verified for all combinations of decoder outputs.
12.2.6 Testability
One disadvantage of using domino logic is the issue of testability.
Because domino logic is dynamic, there are possibilities that at extremely
low frequencies, such as the clock rates used during array testing early in
the manufacturing process, charge could leak from the dynamic nodes and
Sec. 12.2 DRAM Logic Styles 315
Ystatic
Now that we have examined some of the reasons for utilizing domino
logic for portions of the timing and control circuits in the peripheral logic
interface, we more specifically look at some of the issues related to com
mand and timing control of array accesses and why we might want to make
particular choices in logic implementations. As stated at the beginning of
this chapter, these solutions are only one method to accomplish the task of
meeting the industry standard specifications for a high-performance
DRAM. The specifications are the thread of commonality that ties together
the operation of DRAM between multiple vendors. The descriptions that
follow are not meant as a design guide, but rather an illustration of the prob
lems posed by the synchronous interface surrounding what has essentially
remained an unchanged asynchronous DRAM. In the next section, we
examine how domino-style logic can be implemented in the command
decode and address register logic. Following that, we look at the data side of
the DRAM and examine the Read and Write data paths and how the vari
Sec. 12.3 Command and Address Control 317
able latency of array accesses are made to fit within a deterministic timing
framework established by the global clock.
of the address registers are routed to the peripheral array logic and are
decoded by the row and column decoders. The outputs of the address regis
ters are not synchronized to a global clock but are instead bundled with sig
nals that enable the peripheral array logic circuits. The control and timing
signals are bundled and carefully routed in conjunction with the address sig
nals. This enables correct timing when activating the row and column
decoder circuits in the peripheral array logic. In particular, row address reg
isters are often an extension of the peripheral array logic, with the style of
logic resembling the delay-chain style described earlier in this chapter. The
outputs of the row address register circuits are often referred to as bank con
trol logic because each bank is determined by the division of row address
decoders to sections of the DRAM array that can be independently
accessed. We will not fbcus on bank control logic as its implementation
depends on array architecture and is largely unchanged for high-perfor
mance DRAM, except for the case of reduced latency DRAM described in
Chapter 8.2. Also, see Chapters 2, 3, and 4 for details on how row and col
umn decode and control logic operate in the peripheral array logic area of
the DRAM die.
The remainder of this section describes our choices for command
decoder logic style and operation for a high-performance DRAM. We also
discuss some architectural decisions that must be made with regard to the
column address registers that are important to achieving low column access
latency and how increased I/O bandwidth requirements affect the way that
column addresses are handled.
Command
X
Address
X
RAS*心AS*
^Lpun
Asynchronous DRAM access Synchronous DRAM access
Command bus
I I
Address bus
RASyCAS*
clk njTnTLrLrLrLrLjmnnrL
Command NOP NOP
csq jnjTn—TLrLrLrLrLTLnrLn
CSQ* 口
创 _rLrLrL_rLrLFLrLrLrLrTrm
C4SQ* Fl_______________________________________
.q 丁 LTLrLrLrLrLTLrLrLjrrLrLn
RASQ*
作q ZrLrLnjTrLrLrLTLTLrLrLrLn
WEQ*
READ Fl
WEQ* ---------------------
One of the advantages of using domino logic fbr generating the decode
and timing signals is that if a device suffers logic timing issues, they can
usually be overcome, and correct operation is attained, by simply reducing
the clock frequency. In Figure 12.16 we see that reducing the clock fre
quency increases the pulse widths, which results in more hold time fbr the
final decoder logic stage. Reducing clock frequency has advantages for syn
chronous static logic as well, but recall that the advantages of domino logic
include reduced reliance on clock distribution networks, skew tolerance,
and a reduction in timing overhead.
The example just described is for simple, single-phase clock implemen
tation, The pulse width of the signal Read is approximately one half of a
clock period wide. For extremely high frequency, we may be forced to
divide the clock frequency to get wider pulse widths on the decoder output
signals. A wider pulse width on the decoder outputs helps ensure better sig
nal integrity when distributing the signal over long distances. For this situa
tion, each gate would require two phases of dynamic logic with the last
static gate in the logic path implementing a merge function so that a single
decode signal would be generated with a fiill-cycle pulse width, clock fre
quency division is only possible when the array access cycle times are mul
tiple command clock cycles. For instance, if the burst length (BL)
specification were a minimum of four fbr a DDR device, then, given that the
command clock and data strobe frequencies are equal, the column cycle
time, as determined by Equation (12.3) would be two clock cycles. Clock
frequency division under these conditions would be possible. A frequency
divided decode logic path and timing diagram is shown in Figure 12.17.
Sec. 12.3 Command and Address Control 323
Dec
CMD
[ I-CLKO
CLK^> I-CLK2
clk rrrLrLrLrLrLTLn_rLrLn_rLrL
ML ― 而
CLK2
_____________________________________________ |~|______________________
DecQ
Depending on the circuits used in the input circuit path, we may be able
to take advantage of the receiver circuits for generating monotonic signals.
The single-ended signaling protocol is the most common signaling protocol
for monolithic DDR DRAM. This signaling protocol applies a reference
voltage on one input of the receiver differential pair, with the target signal
connected to the opposite input (Figure 12J7). An example of this topology
is shown in Figure 12.18 (a). The latch is separated from the receiver, which
allows matched delay interconnect distribution for CLK and DATA. For
high-performance DRAM, a receiver sense-amp latch (SA latch), shown in
Figure 12.18b, can be used as a combination input detection circuit and
latch. The SA latch circuit tightly couples the analog detection circuit with a
static latch. The detection circuit of the SA latch naturally outputs a dual
rail, monotonic signal to toggle the inputs of a set-reset latch. If these sig
nals are buffered out of the SA circuit, both the latch stage and single-ended
to double-ended conversion circuits (shown in Figure 12.9) are bypassed.
As a side note, the SA latch can also be used for differential signaling. This
signaling technique has disadvantages when applied to DRAM because of
increased pin count. For differential signaling, channel bandwidth must be
324 Chap. 12 Control Logic Design
a)
clk n_n_n_n_n_
-------- SigQ*
”RR
Sig
/ <T Ech
— Sig static
y
SigQ _____ 口______
CLK -------- SigQ SigQ” _n_n_j_Ln_
b) SigStatic | |
ing the delay into two components: array access latency and command
decode latency. Array access latency is the delay measured from the
enabling of the column address decoder to the arrival of valid Read data
from the helper flip-flops (Chapter 5). We designate this delay as tcAA- The
command decode latency is a measure of the sum of the input circuit delay,
the command decoder delay, and the address register logic. We designate
the total command decode delay as tcMD・ Considering these definitions, in
order to minimize array access latency from the perspective of the control
logic, we must distribute the address to the peripheral array logic and enable
the column address decoder with the minimum possible delay.
By partially decoding the command, a monotonic signal that has a pulse
width of one half of the valid address period is generated. This allows the
use of transparent latches for the address register. The latches are transpar
ent when the CASQ* signal, qualified by the chip-select, is HIGH. The tim
ing of the latch signal aligns with the address transitions from the input
capture latches so that the address signals start to transition on the address
distribution path following the t[)Q (data to output) delay of the latch circuit.
The timing relationship between the latch signal and the captured addresses
is advantageous since the latch signal is one half of an address cycle wide
while the address signals are one clock cycle wide; thus, the critical timing
parameter is tcp as shown in Figure 12.19.
Read
address Column
Latch
address
Write
address |
厂
Figure 12.19 Read column address register timing for transparent latches.
The column address for Writes must be handled differently from the
Read column address. For Write commands we must wait until the Write
data is available before the column address decoder can be enabled. Figure
12.20 is a timing diagram showing how the arrival of Write data is delayed
326 Chap. 12 Control Logic Design
by the Write latency specification,你,and the burst length, Irl, for data
arrival at the Write drivers in the peripheral array logic. Notice that in this
example, data is not available to be retired to the array before the end of the
first burst of four data bits. While the device waits to receive all of the Write
data, Write commands and addresses continue to be received. This differs
from the column address for Reads where the address is retired as quickly as
possible to achieve minimum array access latency. That is why there is a
mux in the address path shown in Figure 12.19 to illustrate that the Write
address originates from a different register than the Read address.
clk rLrLULrLruTrLrLrLrLrLrLrLTL
Command
Address "
Write
strobe
-
Lhjnirmj-Lru
I I
Write data
—^<XXX>00000<»
—
I
Figure 12.20 Write data timing relationship to Write command and address.
BL Mt" i TTTT
2 Max tWT 2 WLMax tWL
-------------- +--- = [ +------- +---
BL Min ,CK BL mm (CK
-2~
(12.4)
Ignoring the transfer overhead,恤,we find that for our example in Fig
ure 12.20, FD=3. Of course, we can not ignore the transfer overhead, which
is partially dependent on the address FIFO design chosen.
One possible Write address FIFO design is a simple asynchronous FIFO
[14] [15], The advantage of using an asynchronous pipeline design is that
there is very little overhead in the circuit implementation and, if the design
is carefully chosen, the forward latency through the pipeline can be mini
mized. Another advantage to the asynchronous pipeline design is compati
bility with the signaling protocol for decoded addresses as examined earlier
in this section. If we generate a decoded Write command similar to the
decoded Read command in Figure 12.15, we can use the generated WRITE
signal to load the Write address FIFO directly from the column address reg
ister. Figure 12.21 is a top-level block diagram showing this configuration.
The timing parameter, f尸乙 is the forward latency through the asynchronous
address FIFO. This is the sum of delay for each latch controller stage in the
FIFO. This delay, summed with the address distribution delay, must be less
than the minimum Write data arrival time shown in Figure 12.20.
Address
AddLt Fl_______________
Read _______________
address X
Column i _______
address 义 Add
RM*
There are many details surrounding the design of the Write address
FIFO. If an asynchronous FIFO is used, the throughput must be considered,
especially when the FIFO is at capacity. Also, the asynchronous channel
communication between latch stages must be considered for compatibility
with the logic-signaling environment in which the FIFO is placed. A careful
examination of the references provided should edify the reader on these
328 Chap. 12 Control Logic Design
issues with details of the FIFO design and implementation that are beyond
the scope of this text. Primarily, we must be aware of how the DRAM spec
ification affects the address distribution logic in the peripheral logic inter
face area of the DRAM device.
In the next section, we examine one method for enabling the data access
control logic once the command and address are fully received. This
method involves a two-wire interface with the control signals bundled with
the address and data signal busses. We illustrate how the asynchronous
nature of the DRAM array affects the Read and Write I/O data paths.
clk rLrLrLn_n_rLTLrLrLn_rLrLrLnj-LrL
dataValid
ri______
Figure 12.22 Two-wire bundled signal interface for data accesses.
Notice in Figure 12.22 how colCyc transitions are timed when there is a
Read access versus a Write access. The R/W^ signal is included to indicate
when a Read or Write occur. Remember that the DRAM array really should
be considered a Read-only device when a column address is enabled. Writes
are a result of overdriving the array sense amplifiers through Write driver
circuits. For Reads, the access delay is partially determined by the com
mand decode delay and address distribution logic. Writes are partially
delayed by the Write data latency and burst length. This is shown in Figure
12.22.
The dataValid signal is sent from the peripheral array logic to the
peripheral logic interface whenever Read data is driven from the HFF cir
cuits. This signal is very important to the operation of the Read data path in
the peripheral logic interface because tcAA will vary with changes in tem
perature and voltage. This signal must be considered folly asynchronous
without reference to any clock signal generated near the array interface. On
the other hand, the colCyc signal does have an established cycle time based
on the command clock. This becomes a more important distinction when we
examine the Read data path in Section 12.4.
At this point, we have examined how process can determine which
logic style we might choose for implementing DRAM access functions. We
have also illustrated an application of domino-style logic and how exploit
ing the gate-delayed potential of domino logic allows us to maintain con
stant data access latency across a wide range of clock frequencies. Further,
we have examined how the DRAM specification can affect how the column
address is treated for both Read and Write accesses. But there are many
details that have been left out of our discussion of DRAM peripheral logic
implementation. The delay-chain logic style implemented in the peripheral
array logic is more often an art form than an academically rigorous logic
style. There is also the wordline control logic that we have mostly ignored.
330 Chap. 12 Control Logic Design
These sections of logic are important but the details do not contribute
greatly in our discussion of some of the unique problems encountered in
high-performance DRAM logic design. Therefore, the remainder of this
chapter is devoted to examining the data I/O circuit paths between the
peripheral array logic and the data input and output paths. This will include
an examination of the Write data serial-to-parallel conversion circuits
(demultiplexors) and the Read data FIFO circuits, which are necessary to
decouple variations in array access latency versus the synchronous Read
latency timing specifications of the DRAM.
,PQS.
鲁瑞黑) n_n_n_n_n_tltljti_n_n_n-
Command
WDQS
Data (DQ)
wData
wDataLat
some have proposed DRAM with interfaces that are based on data packet
protocols [16], which would potentially eliminate the heed to track Write
data latency. At the present time, a packet-based DRAM interface is
intended for niche applications. The packet-based interface is attractive
because of narrow input ports and, hence, low pin count. However, the per
pin data rates are necessarily higher in order to match the bandwidth avail
able from alternative devices with wider input ports. There has not been
widespread use of packet-based DRAM, so this protocol is not discussed in
this text. Following the discussions of Write latency timing, we examine, in
greater detail, one method for demultiplexing the data driven from the input
circuit path and how data is assembled and eventually driven to the periph
eral array logic circuits. The discussion about demultiplexing will also con
sider issues related to clock frequency division that, for a DRAM, is often
advantageous when dealing with data rates greater than 2 Gb/s.
口曲 ---------------- kt>000----------
।
WDQS ----------------------------------- 节P广产号d _ ----------------------
:^DQSSmaX
加 ------------- OOOO-------------
Q ----------- T_J-----------------------
wd s
^OQSSm'n \ 1
receivers are often not self^biased but instead require a biased front end
(Chapter 9). For some generations of DDR DRAM, the data strobe is bi
directional and center terminated (the signal is terminated to the middle of
the input signal swing range). Also, if the input data receivers are biased to
an active state, the receivers constantly consume power. Write latency
allows time between the Write command and data so that the receivers can
be enabled only during valid Write data periods. On the other hand, with
supply terminated signaling (high-performance GDDR designs implement
supply terminated signaling), the receivers can be turned on with little
power consumption because if the bus is idle, the termination resistors pull
the input signal to the supply voltage. Also, with a supply-terminated data
strobe, there is less likelihood of false strobe toggles propagating into the
device due to noise occurring around the switching point of the receiver.
With a center-terminated bus, there is a very high probability of false tog
gles on the data strobe if the receiver is enabled during idle bus states,
which, depending on how the logic is implemented, could lead to data and
logic failures in the capture and demultiplexing logic. Overall, Write
latency is a timing mechanism that increases the timing margin between
decoding the Write command and the arrival of input Write data, and it also
allows time to enable and disable the data strobe and data input receivers fbr
more reliable, lower power operation.
Without getting distracted by the difficulties posed by specification-
related problems such as termination or protocol, the fundamental design
problems related to Write data latency for high-performance DDR DRAM
are associated with the separate clocking domains of the command clock
and Write data strobe (WDQS). There is some relief in that the WDQS is
specified to have a bounded skew (Idqss) relative to the command clock.
Referring back to Figure 12.13, we see that connected to the command
decoder is a block labeled Write latency sequencer. This block essentially
represents a counter that counts command clock cycles in order to align tim
ing signals generated from the command clock domain to converge with the
arrival of data from the inputs of the device all timed to Write latency. The
problem with this configuration is that the data I/O are often located a long
physical distance from the command decoder, so that sending a signal
across the die to synchronize the completion of the demultiplexing function
requires very careful, customized design.
Figure 12.26 defines signals that we discuss fbr the remainder of this
section. Signals Write[0] and Write[2] are driven from the Write sequencer
logic. These signals are used for completing the sequence between tracking
the deterministic Write latency in the command clock domain and the
arrival of data indicated by valid transitions of the WDQS signal. We con
sider the Write latency sequencer to be part of the command decoder cir
Sec. 12.4 Write Data Latency Timing and Data Demultiplexing 335
cuitry. The timing diagram in Figure 12.26 shows the relationship between
Write[0,2] and the assemblage of Write data within the demultiplexor block
of Figure 12.24. Also, Figure 12.27 is a diagram showing the physical rela
tionship between the data input circuit paths and the command decode and
latency timing circuits.
Write[0J or Write⑵
from Write latency
sequencer _ L
Figure 12.26 Write[0,2] definition.
The Write[0t2] signals (these signals are labeled 0,2 because they are only
generated from phases 0 and 2 of CA/OCZK[0:3]) are then retimed at the
control logic for the demultiplexor logic circuits. The reason Write[0t2] is
retimed at the demultiplexor circuits with the command domain clock is
because CMDCLK\Q:3] is continuous; whereas, WDQS is a burst strobe that
only toggles when data is driven into the device (Figure 12.25).
Demultiplexor CMDCLK[0:3]
K Write data
WDQS
by 2
C/A clock distribution
matched to frequency
divided input delay (f2) of
divided Write strobe (WDQS)
CMDCLK[0:3] Write[0,2] /
>
Command
Cl" decoder
Figure 12.27 Relationship between data input path and command decode path.
one half of the input data rate. In our example data path, there is a fre
quency-divided, four-phase clock. This clock is distributed to all of the data
bits associated with the WDQS signal and operates at one quarter of the data
rate. Each phase of the clock is used to capture either the rising edge data
from the capture latches (Phases 0 and 2) or the falling edge data from the
capture latches (Phases 1 and 3). At this set of latches, the data rate is down
to one quarter of the of the original data rate at the DRAM inputs. The
demultiplexing circuit then assembles the bits into a parallel set of data that
is driven, along with the wDatLat signal, to the peripheral array logic,
where it is latched and held until the data is driven on the I/O lines of the
DRAM array logic.
wData
wDataLat
There are two primary choices when determining the optimum circuit
topology for demultiplexing across a data word. Figure 12.30 illustrates
these two choices. The first example is shown in Figure 12.30a. Here, the
control and timing logic is centered within the data byte with the timing and
control signals globally distributed to each nibble on either side of the con
trol logic block. If the distribution delay of the timing and data select signals
(DSel) is less than the column cycle time, then this is a viable configuration.
Notice that the latency timing signals (Write[0,2]\ which are inputs to the
control logic, are timed to coincide with the input WDQS, The input data is
clocked into each of the demultiplexor circuits by the distributed WDQS
signal. For each burst, the control logic at the center of the data byte drives
control and timing signals to each of the demultiplexor circuits concur
rently For this case, the demultiplexor circuits operate independently when
capturing data but the timing of the signals distributed from the centralized
control logic block retimes the data driven from each of the demultiplexor
circuits.
wDatLat
wDatLat
built into the output timing of wDatLat from this circuit to compensate for
any timing skew between each bit within the byte. Part of the compensation
must account for Eqs* Typically, 30ss is +/-0.5 of a bit time (often
referred to as unit interval). If tDQSS violates this specification, then the
latency timing circuits may align to the incorrect starting bit of a burst. The
t[)Qss compensation is handled by the timing domain crossing that is indi
cated by the convergence of the Write[0t2] signals with the data strobe dis
tribution generated in the WDQS timing domain. By tracking the status of
the demultiplexing circuits relative to the arrival of the Write[0f 2] signals,
Sqss can be compensated within the demultiplexor tracking circuits. The
primary difference between the circuit topologies in Figure 12.30 is that the
centralized logic in Figure 12.30a generates all of the timing and control
signals for each of the demultiplexor circuits and also generates the bundled
latching signal wDatLat. Whereas, in Figure 12.30 (b), the control and tim
ing logic is repeated within each of the demultiplexor circuits with only the
latency control logic and 也”logic contained in the centralized circuit
block. The circuit of Figure 12.30b can operate at a higher frequency
because there are no globally distributed control signals, but this circuit suf
fers from increased power and area relative to the circuit of Figure 12.30a.
Now that we have examined some of the details of Write latency track
ing and the global data capture topology of the demultiplexor circuits, we
turn our attention to the design of an individual demultiplexor circuit. To
illustrate the demultiplexor operation, a simple flip-flop based circuit is
shown in Figure 12.31. The circuit in Figure 12.31 is an example of only
part of what is required for a full demultiplexor solution. We see from the
timing diagram in the figure that the purpose of the demultiplexor circuit is
to successively expand the data valid window at each pipeline stage until
the data can be safely transferred to the array logic. Because the contents of
the flip-flops are over-written on successive Write cycles, this solution is
only viable if data is transferred to a retiming circuit within three-bit times
(between WDS[3] and WDS[2] in this example) using the wDatLat signal.
Remember from the previous discussion that wDatLat results from the con
vergence of the Write latency timing signals sent from the command decode
clock domain, with the timing signals generated by the central control cir
cuit in the Write strobe timing domain.
342 Chap. 12 Control Logic Design
One problem that is not obvious from Figure 12.31 is that with a
divided Write strobe, there are two possible relationships between the
receipt of a Write command and the output state of the Write strobe fre
quency divider. The timing diagram in Figure 12.32 illustrates this case.
This diagram is taken from a GDDR DRAM data sheet and shows the case
where two Write commands are issued with an extra command cycle
inserted between the minimum possible Write cycle time. This Write com-
mand sequence is often referred to as a nonconsecutive Write operation.
First, consider that the alignment between the WDS[0:3] divided strobes is
the same as that shown in Figure 12.31. Then, when the extra command
cycle shown in Figure 1232 is inserted between the Write commands, an
extra toggle occurs on the WDQS signal, and invalid data is loaded into the
data capture latches.
Sec. 12.4 Write Data Latency Timing and Data Demultiplexing 343
Figure 12.33 shows how the divided strobe alignment changes between
the first set of Write data and the second set of Write data. Focusing on the
WDS[0:3] signals, for the first Write, WDS[0] aligns with the first rising
edge of WDQS for the data burst (burst of four, for this example). Because
of the extra cycle inserted between the Write commands, WDS[2] aligns
with the first rising edge of the WDQS fbr the second data burst. Because
the divided strobe signals are operating at one half of the frequency of com
mand clock, the command to WDQS alignment can assume two different
states. In order to correctly track Write data latency, and drive the correct
data out of the demultiplexor circuit, we need a method to detect the two
possible command-to-data alignment states fbr a divided WDQS. This is the
reason that the Write latency timing signals are paired as Write[0,2].
Following a system reset or power cycle, the WDQS divider and the
Write latency control logic in the command decoder are reset to a known
condition. Depending on the logic or clocking implementation at the Write
decoder and demultiplexor logic, one of the latency timing signals fires fbr
the first Write command and establishes an alignment relationship between
the command decoder logic and the output of the WDQS divider for the first
Write command. For all subsequent Writes, the correct Write[0,2] signal
will align accordingly with the WDS[0,2] signals to correctly time the out
put of the demultiplexor circuit. We should note that if toggles occur on
WDQS that do not correspond to a Write command, fbr instance, because of
overshoot or undershoot on WDQS due to signal integrity issues, then
untracked toggles may occur on WDS[0:3]. Under these conditions, the
demultiplexor circuits must be designed to track the correct Write[0f2] to
WDS[0,2] alignment, regardless of starting conditions. This is a specific
problem the designer must evaluate, yet solutions to this problem are not
fundamental to this discussion.
344 Chap. 12 Control Logic Design
口而—<XXX—OOOO--------------------------
利 Q^LrLTLTUWJ -----------------------
Dr ><toX2^O<0bX2rX '
Df
WDS⑼ I 一L|—L J― I
IVDSpj I | [ I I
乂 Qg(a) >f~X Q"(b)
X Q 、⑻ X~X Q;1(b)
* Q1 ⑶ Q1(b) X-
>ic_5^"x~x_5^n>c-
I I ____
wDatLat : | | : | I
I I
WritefO] | ] | |
Latency timing
; ;
from command _____________ i ]_________________
clock domain
W 丽 | |
rPls I—I I―I
Figure 12.33 Divided Write strobe alignment fbr nonconsecutive Write commands.
Write commands with the WDS[0,2] signals to mux the correct data into the
demultiplexor circuit path. The block labeled "Pls" represents centralized
logic to control the timing of the outputs QO-3 of the circuit to align the data
for transfer to the peripheral array logic. The wDatLat signal is used to
accommodate the data transfer to the peripheral array logic. By examining
the timing diagram in Figure 12.33, we see that the second set of Write data
aligns to the complementary WDS signals relative to the first Write com
mand because of the nonconsecutive Write condition.
Penpheral logic
wDatLat DataValid
interface region
Command
decode
Column address
Write pipes Read pipes
registers wDatLat
Command/
,address latches
Figure 12.35 Peripheral logic interface Read and Write data paths.
Figure 12.35 illustrates the relationship between the Read and Write
pipes and how the data rate of the I/O circuit paths translate to array column
cycle time. Another feature shown in this figure is separated Read and Write
busses in the peripheral logic interface region. The separated Read/Write
busses allow concurrency for Reads and Writes, which may be necessary
for tight Read-to-Write or Write-to-Read data timing. Also consider that the
Read data array accesses are controlled by the timing of the command clock
domain, while Write data array accesses are timed to the WDQS timing
domain. If we consider SQSS combined with variations in address access
delay for Reads 面),there could be overlap on a shared Read data and
Write data bus in the peripheral logic interface region. There will be a tim
ing diagram later in this discussion that will better exemplify this possibil
ity. Another notable feature of Figure 12.35 is the address bus with the
bundled signal, colCyc*. Whenever a Read command is received, the
address and colCyc* signal are immediately driven to the peripheral anay
logic to get the lowest possible array access latency. For Write data, the col-
Cyc* signal is fired based on the wDatLat signal that comes at the end of the
data demultiplexor operation (Section 12.4), Recall that Read data accesses
348 Chap. 12 Control Logic Design
are initiated as quickly as possible to get the lowest possible array access
latency, while Write data accesses require that the column address is held
until all of the data of a burst is fully captured in the demultiplexor circuits.
Let us look at an example of a Write-to-Read access to see how the
array column cycle time is maintained for data accesses. Figure 12.36 is a
timing diagram showing a series of column accesses. In this example, there
is a Write command followed by a Read command, which results in a mini
mum array column address cycle time at the peripheral array logic. We need
to be aware that a colCyc* initiated by the wDatLat signal must have timing
similar to a colCyc* initiated for a Read command originating from the
command decoder. By having similar timing relative to the launching event,
we can maintain constant column cycle time between Read and Write data
accesses. In the timing diagram, Read latency (CL) is four command clock
periods, while Write latency (WL) is CL- 1=3. At the end of the Write data
burst, the wDatLat signal transitions. As outlined in Section 12.4, the wDat-
Lat serves two functions. The first is to retime data in a set of latches in the
peripheral array logic. The second is to enable the firing of the colCyc* sig
nal to cycle data accesses from the array. Write data is then overdriven
through the sense amps and, ultimately, to the Mbit.
Again refening to Figure 12.36, a Read command is issued at the end of
Write data. For this example, the Read command occurs two command
clock periods, which is one column cycle period, from the end of the Write
data. We see that the tight timing of the Read command relative to the end
of Write data results in close to a minimum column cycle period transition
on the colCyc^ signal between the Write and Read accesses (labeled t盯r in
the figure). The chip architect must decide between a shared data bus or two
unidirectional busses for interconnect between the peripheral logic interface
region and the peripheral array logic fbr Read and Write data. Variables
such as bus loading, interconnect routing, column cycle time, data through
put, Read and Write access concurrency, die area, and any other perfor
mance specification-related issues must be considered when making this
decision.
Sec. 12.5 Read Data Latency Timing and Data Multiplexing 349
b rLrLrLruT_rLn_rLpLrLrLrLn_n_rLn_
一二二;二一"x^>c —:
wDatLat
cdCyc*
I_r
Address
bus x Write address ;* Read addr 嬴
Write data bus
X
Read data bus (late access)
^>1
-X
Z)1CZ
Read data bus (early access)
WL=3 ] !
CL=4
DQ
oooo
Write data
^>00<>
Read data
DQS
LTLTLJ
Figure 12.36 Timing diagram for internal data bus utilization.
imum array access delay, the data must arrive at the synchronization circuits
ahead of the expiration of the programmed CL value.
For the remainder of this section, we will examine the two main compo
nents of Read data timing managed within the peripheral logic interface.
First, we look at the purpose of the Read pipes as well as the performance
characteristics necessary for the Read pipe circuit block. After looking at
the Read pipe circuitry, we examine synchronous Read latency tracking fbr
a high-performance DRAM and look at an example circuit used fbr tracking
Read latency over various operating conditions.
In our example, the array data path delivers an eight-bit data Prefetch
for each DQ. The eight-bit Prefetch is delivered to the Read data FIFO bun
dled with the data Valid signal, which is discussed in Section 12.3.3 and is
shown in Figure 12.22. The data is then reduced to a fbur-bit width through
the Read data FIFO and delivered to the data serializer. The output data seri
alizer reduces the data from four bits down to singular output data as well as
retimes the data to the output data clock. There are a few functional details
about the FIFO that need to be clarified. One is that the Read data FIFO
must be capable of multiplexing and sequencing the correct four bits of data
through the output data path. For some DRAM specifications, there is what
is called critical word first. This feature uses a lower column address bit to
indicate which nibble is output first in an 8n burst (a choice between 0-3 or
4-7). This feature is a part of the DDR SDRAM specification and is some
times referred to as burst ordering.
There is another multiplexing function necessary in the Read data
FIFO. Most device specifications call for commands to be referenced to the
positive transition of the command clock. When the clock is frequency
divided, there are two phases of the output data clock referenced by the pos
itive edge of the command clock. For our example, phO and ph2 are consid
ered proxies for a positive transition on the command clock. When data is
routed from the peripheral array logic to the Read data FIFO, there is no
output clock phase information included. A combination of CL and Read
352 Chap. 12 Control Logic Design
command alignment to the four-phase output data clock results in two pos
sible data-to-clock alignments. Figure 12.38 shows an example of a noncon-
secutive Read access where the first Read forces the first data bit of an
access to align to output data clock, ph2 and the second Read forces the first
data bit to align to output data clock, phO. The remaining bits are also multi
plexed to the conect clock phase so that the phase order fbr the first access
is 2,3,0J and the second output phase order is 0,1,2,3.
^mand ULrm_Rwi_rLnjTrLrUmjnjuU'L
8mmand j ]
Cl =10
common timing reference for all DRAMs in a system. The side effects of
advancing the timing of the output clock for on-die Read latency tracking
will become more apparent later in the discussion about Read latency (CL)
tracking.
Peripheral array
interface
Figure 1239 Data alignment to multi-phase output clock through the Read data FIFO.
To get a better idea of how the FIFO must perform in the DRAM Read
data path, lefs examine the diagram in Figure 12.40. Here, we see a contin
uous stream of consecutive Read commands. The boxes represent relative
percentages of the overall CL time for each operation of a Read command.
Notice that for the “Fast" operating condition, the time allotted for the FIFO
is the majority of the CL delay. Because the CL specification is based on
worst-case timing conditions, if the device happens to perform better than
the data sheet specification indicated for CL, there is no way fbr the system
to take advantage of the improved array performance. The reason for this is
that there may be several DRAMs in the system that perform worse or bet
ter. Therefore, the FIFO must match the throughput and latency of the array
accesses with the expected data arrival based on the programmed CL value
for all of the devices. The device data sheet specifies the minimum and
maximum value for CL. The value of CL is based on the acceptable worst
case performance of any one DRAM in the system. For the "Slow” operat
ing condition, the FIFO period is much less than。力 and, fbr this case, we
are concerned with the length of delay through the FIFO circuitry (Jfi),
354 Chap. 12 Control Logic Design
M______________________ CL^IO______________________
Command1r 1 ~~ 1 7__ ___ ___ __ __ __
c,ock rLpLrLrLrLn_rLrLTLrLrLti_rLrLrLn_rLn_
--------------------------- kxxx>oo<xxx>oooo
IUCI 一 I FIFO 「 「";
I F'F。 | 正1_____
闻 Gl - I 一 垂. I ci
~ im1工 i : 石fo Tei
i Last bit out of FIFO from
Fast operating Y-j first Read command
conditions i /
Slow operating /
conditions ] 匕 >। J
I 露 fc,| . j FIFO | |
I■L I d J I Fljo 1 L |
R/l Qe I - 」 「丽 1工 I
The peripheral array logic does not have storage available for data and, fiir-
thermore, if array accesses are purposely delayed, then a violation of mini
mum column cycle time will occur for subsequent Reads during
consecutive burst Read operations. The second condition is that the delay
through the Read data FIFO is not large enough to significantly impact
access delay and, therefore, prevent the device from meeting the minimum
Read latency specified by the data sheet. For this case, the delay through the
FIFO needs to be minimized.
With these two criteria in mind, we consider an example for a high-per-
fonnance device. This example is taken from the specifications fbr a
GDDR3 high-performance graphics DRAM. The maximum data rate is 1.6
GT/s, CL=10 and BL=4. The example is illustrated through a simple, step-
by-step calculation procedure that shows some of the necessary characteris
tics of the Read data FIFO when evaluating circuit design options for the
FIFO implementation:
1. First, because we need to deliver four bits of data fbr each column
cycle, we know that the column cycle time will be 2.5ns. Next, the
serializer circuit requires 4 bits of data at its input-one bit to align
with each phase of the Read data output clock. Therefore, the cycle
time for each bit at the output of the FIFO is:
收=总2.5 ns
T
(12.5)
where DR=1.6GT/s.
2. Next, we need to determine the maximum input-to-output delay QFL.
max), otherwise known as forward latency, allowed through the FIFO
in order to meet the minimum programmed CL value:
6ns > > 3.5ns array access delay (max., min.)
tCMD = 1 ns Command to colCyc delay
tsus = 0 5 ns Data setup time requirement at serializer
input relative to the Read data output clock
2.8ns > tjo > 1.5 ns asynchronous I/O delay (max., min.)
tFLmax = 22ns
(12.6)
356 Chap. 12 Control Logic Design
The values for all of the timing parameters (except 山 the tpi
calculation are fbr the worst-case operating comer. The value kK is
the minimum period for the command clock frequency. The defini
tion fbr and tjQ show the wide range of delay over which these
timing variables can vary.
3. Depending on the chosen circuit architecture, the preceding FIFO
design parameters are dependent on the FIFO depth. Given a series
of consecutive Read accesses, the FIFO depth is determined by the
number of column accesses that can be loaded into the FIFO before
the first access in the series is consumed. Read FIFO depth must be
calculated using the longest possible latency and shortest 爆 and %
delay. This timing difference will determine the maximum required
FIFO depth (FD) to maintain constant data throughput. We derive
the equation for determining the FIFO depth by looking at the Stfast
operating,, conditions in Figure 12.40. The FIFO depth is based on
the peak number of column accesses that can be simultaneously
loaded into the FIFO (for DDR I/O):
FD = 3.6
(12.7)
fbrmance DRAMs typically have latencies of two or three cycles, with col
umn cycle times of two clock periods, the latency-to-column cycle time
ratio is close to 1. However, for high-performance DRAM, such as a
GDDR4 device, latencies grow to 10 or 12 cycles and column cycle times
are four clock periods leading to latency-to-column cycle time ratios of 2-3.
Worse yet, for GDDR3 devices, the column cycle times are 2 and latency is
10 or higher leading to latency-to-column cycle time ratios of 5 or greater.
The result is that for high-performance devices there is a requirement for
deeper data FIFOs. A traditional pointer-based FIFO can be used for deep
FIFOs. The problem with these circuits is that output data muxing require
ments can increase the output delay of the FIFO so that data cannot cycle at
the output clock cycle time.
Another option is to implement the FIFO as a set of series connected
latches controlled by asynchronous protocol controllers [14, 25]. This type
of FIFO implementation reduces the FIFO design complexity and can
increase performance at the output of the FIFO; but because the FIFO is a
series connection of latches, this type of design suffers from higher tpi and,
hence, longer latency. This text does not discuss specific FIFO implementa
tions. The designer must choose the implementation based on the calculated
requirements.
For high-performance DRAM, synchronization at the output of the
FIFO can be complicated by the data rate. There are two primary alterna
tives: either the circuit design meets the cycle time requirements through
specific circuit implementation techniques, or the width of the data path at
the FIFO output is increased at the cost of design complexity. Figure 12.41
shows an example of both approaches. The difference between the two
approaches is that for the low-cycle time approach, an aggressive circuit
implementation using dynamic logic techniques must be considered. For the
increased data path width approach, a method for toggling between the data
sets must be implemented and the data must be muxed correctly through the
FIFO circuits. This muxing would be in addition to, or a modification
the clock phase based muxing shown in Figure 12.39. Another consider
ation when increasing the data width is the increased clock loading and
complexity of the serialization circuit.
358 Chap. 12 Control Logic Design
DataQ]) ^ata(0]
I
DonefO] | |
OE _____ H-
Select I
Figure 12.41 FIFO output synchronization solutions for multi-phase output clock.
span multiple clock cycles. Add to this fact that delay variation through the
I/O paths is directly impacted by process, voltage and temperature, which
can cause the delay to vary across clock period boundaries. What we find is
that achieving deterministic Read data latency is not a simple matter of
counting clock cycles. In the next section, we investigate difficulties related
to deterministic Read latency tracking. We also investigate how the inclu
sion of clock synchronization logic (Chapter 11) on the DRAM simulta
neously complicates CL tracking and accommodates synchronization and
clock domain crossing between the command clock and the output data
clock domains.
ward to DDR DRAM devices and beyond, we find a clock alignment circuit
on the DRAM generating an output clock domain separate from the com
mand clock domain. The output clock domain is advanced relative to the
command clock by the output circuit path delay, similar to the diagram in
Figure 12.42. The clock alignment circuit is used to align the output data to
the clock edge corresponding to CL and, at the same time, minimize t^Ct0
predominately short-term components of output clock jitter.
Figure 12.42 Effect of synchronization circuit on output data timing (DLL example).
tracks the relative delay variations between the two clock domains with
long-term changes in voltage and temperature.
The method for CL tracking that we are about to examine can be diffi
cult to explain and, depending on how well it is explained, difficult to
understand. To facilitate our understanding of this method, we will break
the function of the circuits into two parts. The first part will explain how the
relative phase relationship between C/A CLK and RdCLK is comprehended.
By determining the phase relationship between the two clock domains, we
can establish a basis for tracking delay variations (resulting in phase align
ment changes) that occur between the two clock domains. The second part
will examine how variations in the phase relationship between the two clock
domains are tracked and, ultimately, how we determine the precise internal
RdCLK clock cycle from which to fire the OE signal so that data is correctly
aligned CL cycles from the receipt of a Read command.
To summarize, for high-performance DRAM, precise CL control
depends on the inclusion of a clock alignment circuit. Without the clock
alignment circuit, simply counting the correct number of command clocks
as a basis for firing the data transmitter circuits will result in a large time
variation between the anticipated CL and actual CL. This variation is a
result of the asynchronous input and output circuit path distribution delay.
Therefore, if cycle-accurate CL is part of the protocol specification, then a
clock alignment circuit is a necessity.
If we refer to Figure 12.42, we see that there is already a circuit in the
clock alignment topology that tracks the phase difference between the C/A
CLK and the RdCLK domains. This circuit is the I/O feedback model used
in the clock alignment circuit. The absolute phase difference between the
C/A CLK and the RdCLK is given by:
t - 4~ / .
9 =叫_竺2冗rad
,CK
(12.8)
If tin + tout > tQ^ then the absolute phase difference is greater than a sin
gle clock period. Given this situation, the method used to establish the phase
difference between the clock domains must account for this absolute differ
ence and not just the fractional phase difference that might be present. Fig
ure 12.43 is a circuit configuration that can synchronize a signal between
the two clock domains.
362 Chap. 12 Control Logic Design
Initialization
logic output.
StCA
C/ACLK
clock following the command decoder. Keep in mind, of course, from the
previous discussion, that the Read command sent to the array is not retimed
to any clock; but the Read command used fbr CL tracking is synchronized
because CL tracking is, inherently, a synchronous operation.
If synchronization points are present in the CL tracking logic, and these
retiming circuits logically delay signals or logic used for CL timing, then
we must account for these cycles when tracking CL. Continuing with our
CL tracking methodology, Figure 12.45 is again a modification of earlier
diagrams to show a calculation that is used to determine a load value for the
C/A CLK clock domain counter. Because the synchronous initialization of
the counters accounts for the I/O delay induced phase difference between
the C/A CLK and RdCLK domains, by preloading one of the counters with
an offset based on the compensated programmed CL value, the delay
between corresponding values of the two counters will reflect the internal
CL timing necessary to ensure correct external CL timing. The internal CL
timing is given by:
= (CL-SP)tCK-(tin + tout)
(12.9)
later at the output of the RdCLK counter. With the counters configured in
this fashion, we are now internally tracking CL, which is defined relative to
the external command clock domain.
Rd* rLrLrLrLrLtTrLrLrLTLrLrmLrLrL
Initialization
logic output
RdCLK
counter
StCa
C/ACLK
counter
C/ACLK "LrLrLTLrLrLnL^urLrLrLrLrLrrrLr
Figure 12.45 Load value for C/A CLK clock domain counten
ure 12.46, each time a Read command is decoded, the signal Read is
generated and used to load the C/A CLK counter value into the FIFO. Once
the RdCLK counter attains the current output value of the FIFO, a digital
comparator indicates a match and signals back to the FIFO that the value
has been consumed. At the same time, the comparator indicates to the OE
timing circuit that the transmit circuits must be enabled on the next cycle.
The OE timing circuit drives the OE signal to the data serializers in the out
put circuit paths for the length of the Read burst. The sequence of events
required to enable the output data path is shown in Figure 12.47. Note that a
synchronous transfer of the OE signal from the digital comparator to the
serializer would cost one SP. Likewise, the synchronous transfer from the
OE timer circuit to the serializer would cost another SP, Gray-code counters
are used for the clock domain counters [22]. By using gray-code counters,
false compares are eliminated from the digital comparator and FIFO loads
are more reliable because only one bit transitions on each clock cycle. This
configuration is illustrated in Figure 12.46.
c,aclk T_TL?LrLrLrLrLn_nrLrLrLrLrLn_r
I I--- 1
Read ^J |______________________________________________________________
I
FIFO output DC 2..........
RddK run|rLrLrLn_^LrLn_rLrLrLTLn_rL
然霜 QXI"><Q<®XD@X£<IXIXIXD<1XIX^>
「Compare 1 心. 、
valid I I _
OE | |
CL=10,SP = 3,tJL=175%
fCL 祝= (cl—sp)*-(,+ L)
Figure 12.47 Timing diagram of output data path enable.
The preceding discussion of the Read latency tracking circuit is but one
example of how we might track Read latency. Another proposal [21] is very
similar but uses tokens in connected shift registers rather than counters to
track the difference between clock domains. This method has merit except
that there is higher clock loading and potentially longer output delay due to
a large number of mux terms if a wide range of latency values is required.
For high-performance DRAM, the absolute value of CL increases greatly
relative to SDRAM and early DDR DRAM and, because the device must
often be designed to cover a wide range of data rates, the number of CL val
ues that are valid is greatly increased over less demanding DRAM perfor
mance levels. Another advantage of the method described in this text is that
the C/A CLK and RdCLK are not required to be physically routed to the
same area of the die. The use of a FIFO allows the clock routing for the sep
arate clock domains to remain in their respective areas of the die.
REFERENCES
[1] D. Harris, Skew Tolerant Circuit Design. San Francisco, CA: Morgan
Kaufman Publishers, 2001.
[2] John P. Uyemura, CMOS Logic Circuit Design. Norwell, MA: Kluwer
Academic Publishers, 1999.
[3] Y. Taur, “CMOS scaling and issues in sub-0.25um systems'* in Design of
High-performance Microprocessor Circuits ed. by A. Chandrakasan, W. J.
Bowhill, and F. Fox. Piscataway, NJ: IEEE Press, 2001.
[4] T. Sakurai and A. R. Newton, "Alpha-power law MOSFET model and its
application to CMOS inverter delay and other formulas,n IEEE Journal on
Solid-State Circuits, vol. 25, pp. 584-593, April 1990.
[5] K. Chen and C. Hu, ''Performance and scaling in deep submicron CMOS,”
IEEE Journal on Solid-State Circuits, vol. 33, pp. 1586-1589, Oct. 1998.
[6] R. K. Krishnamurthy, A. Alvandpour, S. Mathew, M. Anders, V. De, B.and
Borkar, <€High-perfbrmance, low-power, and leakage-tolerance challenges fbr
sub-70nm microprocessor circuits,in Proc. 28th European ESSCIRC 2002,
pp. 315-321.
References 369
latency control/' IEEE Journal on Solid-State Circuits, vol. 40, pp. 223-232,
Jan. 2005.
[22] D. D. Gajski, Principles of Digital Design. Saddle River, N.J.: Prentice-Hall,
Inc., 1997.
[23] T. Zhang and S. Sapatnekar, “Simultaneous shield and buffer insertion fbr
crosstalk noise reduction in global routing,M IEEE Transactions on Very Large
Scale Integration, vol. 15, pp. 624-635, June 2007.
[24] C. Svenssen and J. Yuan, “High-speed CMOS circuit technique,IEEE
Journal on Solid-State Circuits, vol. 24, pp. 62-70, Feb. 1989.
[25] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev,
Logic Synthesis ofAsynchronous Controllers and Interfaces. Berlin: Springer
Verlag, 2002.
[26] H. Partovi, “Clocked storage elements/' in Design of High-Petformance
Microprocessor Circuits ed. by A. Chandrakasan, W. J. Bowhill, and F. Fox.
Piscataway, NJ: IEEE Press, 2001.
Chapter
13
Power Delivery
An interesting thing happens on the way to high performance. Tolerance fbr
supply noise and voltage gradients drops off a cliff This increased sensitiv
ity to supply voltage arises from two different factors. The first is that the
timing margins for high-frequency designs are very thin, leaving less mar
gin on the table fbr supply variation. The second is that the inevitable scal
ing of supply voltages creates a need fbr better power delivery networks just
to maintain the status quo. After all, 100 mV of supply noise on a 1-volt
supply is a much greater problem than 100 mV of supply noise on a 2.5-volt
supply. In this chapter, we explore the power delivery problem in greater
detail and ways to mediate it.
As an obvious outcome of moving toward higher clock frequencies and
faster data rates, on-die timing gets squeezed. This is especially true in and
around high-speed I/O, pipeline, and control circuits. As timing budgets
shrink, the margins available for supply variation also shrink. This leads to
the inevitable conclusion that to reduce these variations circuit designs must
either tolerate more power supply variation or improve the power delivery
network. Typically the most timing-sensitive circuits are involved in I/O
operations, such as capturing input data and driving output data. Data cap
ture circuits often involve clock receivers, DLL or PLL blocks, clock distri
bution networks, and input buffers and capture latches. The actual path from
the clock and data pins to the capture latches can be very long-on the order
of multiple clock cycles in electrical length. A long electrical path, of
course, leads to a greater sensitivity to power supply noise and variation.
The delay sensitivity of typical CMOS logic to supply voltage ranges
between two to ten times greater than that of fully differential circuits, such
as current mode logic (CML) [1]. As a result, highly sensitive timing cir
cuits should use as much differential logic as possible. Furthermore, CML
371
372 Chap. 13 Power Delivery
n
e
s
s
be m a p p e d the P D N model. This e H m i n a s s -he ideal p o w e r a n d
o
a
o
g r o u n d c o n n e c - i o n s that w e g e n e r a H y d e a l w i t h in s i m u F - i o n s p a c p r e 一
p ac
,
i n g一 h e m w i _ h c o n n e c t i o n s o the n o n i d c a H z e d p o w e r g r i d ,E l i m i n a t i o n o f
-
—
ideal p o w e r c o n n e x i o n s s i g n s c a 二 n y increases simulafion H m e b e c a u s e
e v e n small a m o u n t s o f curren- m o v i n g a r o u n d h e n e t w o r k can impact large
-
s e g m e n t s o f一 h e vo-fage g Localized disHrbances o f the supply grid
r
F
i m p a c - s一 h e circ 1 in a variety o f ways- N o ab 二 y ransis 一 or a n d
E,
s
i
m
a
a
i
o
n
-
-
ga_e behavior d e p e n d o n o c a l conditions- A s a r e s u F s o m e circuits speed
up w h i o 一 o hers d o w p Signaling e v e F within a circuit and
s
o
w
b
s
h
b
o
c
k
b e t w e e n b o c k s 》are affec-ed b y p o w e r supp-y noise a n d gradients ,C h a n g e s
in一 h e c o m m o n m o d e v o H a g e f r o m o n e a r e a o f the die to a n o - h e r i m p a c t
b o f h s i g n一
a and dis bution , n essence 》y o u r s i m一 u a H o n w一 i一 exhibit
5.
c
o
c
k
m o r e rea丁 w o r l d behavior than ever b e f b r p A s a y o u w三 b e to
E-
r
e
s
L
a
z
c
observe p o w e r s u p p y v o y a g e s throughoutthe P D N u n d e r a variety o f o p ?
-
a i n g c o n d i t i o n s a n d simii 二a o n vectors- m o a i m p o r t a n t ^ y o u c a n see
b
&
-
・♦・・»••3••・・•••・,・»««»««•<•・•♦・B
驷
I
aiiiuniiiimuiuinfliiNniii
卦
i_
?l:
r:;
热
i:;.
j
r-:
sl»
ssn
£!t
z?:
su
a
slla
A
5:3;
g:.
8?!&
2:
i:=
;:
376 Chap. 13 Power Delivery
c)
d)
a)
b)
O O O O IO o 。o o
C)
d)
e)
f)
Figure 13.5 d) GDDR3 third substrate layer, e) GDDR3 bottom substrate layer,
and f) GDDR3 package layer stack.
380 Chap. 13 Power Delivery
three pairs of Vdd and 飞 bond pads within each of the Read clock trees.
These new bond pads were readily supported by the FCIP packaging tech
nology of this design. Simulation of the revised design yielded the improve
ment shown in Figure 13.7 and labeled, “After." As shown, the power
supply noise and sag were reduced to only 50 mV, which met our internal
design goal for this voltage supply.
Figure 13.8 once again shows the metal3 layout for the GDDR3 device.
In this view, however, the 3/G power pads, which were added to sup
port the Read clock trees, are circled to show their relative locations.
Figure 13.8 GDDR3 metaB layout with clock tree power pads.
382 Chap. 13 Power Delivery
Figure 13.9 Vdd and P5j waveforms fbr HFF block before tuning.
Sec. 13.3 Full-Chip Simulations 383
10ns/dMsion
w --------出MM2*温—
Figure 13.10 %)and Vgg waveforms for HFF block after tuning.
REFERENCES
Future Work
in High-Performance Memory
The semiconductor industry will continue its inevitable march for
ward-deriving ever greater benefits through additional scaling and higher
levels of integration. Barring the unforeseen, device density, power, and
speed will continue to follow their historical trends.
We gain insight into what the future may hold fbr DRAM by looking at
the past and at various trends. For instance, Figure 14.1 is a graph showing
a trend line fbr DRAM data rate per pin by year. This data shows steady
gains from 1995 through 2010 with the data rate doubling approximately
every three years. While any data beyond present day is somewhat specula
tive, it is based on recent trending and extrapolated historical data. Figure
14.1 shows that device performance, as it relates to I/O bandwidth, will
continue to advance beyond the end of the decade. This is both exciting and
foreboding fbr designers. It is exciting because engineers relish a challeng
ing design project, yet foreboding because the trend presented in Figure
14.1 represents a never-ending series of design problems that become
increasingly more difficult.
While higher bandwidths are a perceived benefit, unfortunately this
benefit comes at a cost. The power dissipated by these high-strung designs
also appears to rise with each new generation. Figure 14.2 shows a trend
line of power dissipation fbr a variety of recent DRAM interface technolo
gies. The data shown is based on the IDD4 specification fbr each technol
ogy. IDD4 measures supply current for back-to-back page Reads and/or
back-to-back page Writes at minimum column cycle time. This specifica
tion examines power at the maximum sustainable device bandwidth.
Despite the fact that supply voltages drop with each new technology, power
385
386 Chap. 14 Future Work in High-Perfbrmance Memory
is nevertheless going up. The fact that the trend line rises is not surprising
since it is tightly correlated to I/O bandwidth, internal clock frequencies,
and the data pre-fetch size-all of which are on the increase.
Figure 14.1 Trend line fbr DRAM data rate per pin.
900.0E-03
8OO.OE-O3
700.03-03
」
8M0d
600.0E-03
500.0E-03
400.0E-03
256 Mbit SDRAM 512 Mbit DDR 1 Gbit DDR2 1Gbit DDR3
167MHz 200MHz 333 MHz 333MHz
Technology
Most every computer system in use today and those being designed for
future use are highly power constrained. This is at odds with the device
power dissipation depicted in Figure 14.2. As a result, one should expect
that future technologies will adopt more sophisticated power management
schemes to keep device and system power within budget. These schemes
Chap. 14 Future Work in High-Performance Memory 387
will strive to reduce both active and standby power to absolute minimum
levels through a variety of mechanisms. The most obvious being to shut
down circuits not needed for ongoing operations. The degree to which cir
cuits can be shut down depends on our tolerance fbr the delays we encoun
ter when turning these circuits back on. Nothing comes for free.
Another issue pertaining to power is leakage current. All transistors
leak when they are turned off-some more, some less. Historically, DRAM
transistors are designed to have low leakage, especially the array Mbit tran
sistors, since leakage impacts data retention and Refresh. However, the
quest for higher perfbnnance and high-bandwidth I/O is creating pressure to
build better transistors. Better, in this case, meaning higher drive and lower
Unfortunately, what leads to higher performance also leads to higher
leakage current. So, in order to keep the array leakage current low while
increasing peripheral transistor performance, DRAM manufacturers will be
forced to add process complexity. This will translate to more masks, more
process steps, and higher cost, which is just the opposite of what is needed
in a highly competitive commodity market. Nevertheless, if the perfor
mance goals are to be realized, then added process complexity is the price to
be paid.
Without a doubt, the most notable and predictable feature of DRAM
technology is memory density and its associated growth over the years.
Advances in memory density are fueled by a perpetual scaling of the under
lying CMOS process technology. Figure 14.3 shows a ten-year timeline of
historical DRAM shipments by density. Although the cadence in recent
years appears to have tapered off a bit, memory density has essentially dou
bled every two to three years. Accordingly, Figure 14.4 shows a forecast of
shipments by density through the end of 2010. As expected, the lower den
sities continue to taper off while the higher densities, starting at 512MB,
grow in share until they peak and tail off. There's nothing surprising about
this data since it merely follows the historical trend.
Increased memory density does create some problems for high-perfor
mance DRAM devices. There are obvious architectural ramifications, as
discussed earlier in Chapter 8. Large blocks of memory can lead to long
wire routes, cycle time problems, and longer latency. Not exactly what you
expect in a high-performance part. Typical first-generation parts of a spe
cific memory density will not achieve the same level of performance as
their lower density counterparts. It usually takes one or two process shrinks
before performance levels rise to targeted levels.
388 Chap. 14 Future Work in High-Performance Memory
00 00 8s s
寸 0
OS QO 0® 0s 00
Z6
86
6 。6
9 。6
96
coo
90
96
96
Z6
86
66
8
ounr
a) Q Q 0 O©
Q
C
。
80%
60%
40%
20%
§ 6 0
卜
0 0
60DZ
0
CODE
60g
-
800 L
90ON
90g
03Z
058
goq
90g
ZOOL
ZGQZ
zoom
9 0 O g
L
Calendar quarter
So, what will DRAM devices look like in the future? Given the histori
cal trends, they should continue to grow in density and speed, bum more
power, and operate at lower and lower supply voltages. But will that neces
sarily be the case? It's reasonable to expect these trends to hold for the fore
seeable future, but they are not sustainable indefinitely. Recent
transformations in the microprocessor world should serve to illustrate this
point For years, the personal computer world looked like a horse race
Chap. 14 Future Work in High-Performance Memory 389
Supplemental Reading
In this second edition of DRAM Circuit Design, we introduce the reader to a
variety of topics—from introductory to advanced. However, we may not
have covered specific topics to the reader's satisfaction. For this reason, we
have compiled a list of supplemental readings from major conferences,
journals, and books categorized by subject. It is our hope that unanswered
questions are addressed by the authors of these readings, who are experts in
the field of CMOS design.
391
392 Appendix
DRAM Cells
[47] C. G Sodini and T. I. Kamins, ""Enhanced capacitor fbr one-transistor
memory cell,,i IEEE Transactions Electron Devices, vol. ED-23,
pp. 1185-1187, October 1976.
396 Appendix
DRAM Sensing
[62] N. C.-C. Lu and H. H. Chao, ^Half-Ppp/bit-line sensing scheme in CMOS
DRAMs," IEEE Journal of Solid-State Circuits, vol. 19, pp. 451-454,
August 1984.
[63] P. A. Layman and S. G. Chamberlain, <4A compact thermal noise model for
the investigation of soft error rates in MOS VLSI digital circuits/' IEEE
Journal of Solid-State Circuits, vol. 24, pp. 79-89, February 1989.
[64] R. Kraus, ^Analysis and reduction of sense-amplifier offset,n IEEE Journal
ofSolid-State Circuits, vol. 24, pp. 1028-1033, August 1989.
[65] R. Kraus and K. Hoffmann, ^Optimized sensing scheme of DRAMs,n IEEE
Journal of Solid-State Circuits, vol. 24, pp. 895-899, August 1989.
[66] H. Hidaka, Y. Matsuda, and K. Fujishima, **A divided/shared bit-line sensing
scheme for ULSI DRAM cores,M IEEE Journal of Solid-State Circuitsy vol.
26, pp. 473Y78, April 1991.
[67] T. Nagai, K. Numata, M. Ogihara, M. Shimizu, K. Imai, T. Hara, M.
Yoshida, Y. Saito, Y. Asao, S. Sawada, and S. Fuji, “A 17-ns 4-Mb CMOS
DRAM,“ IEEE Journal of Solid-State Circuits^ vol 26, pp. 1538-1543,
November 1991.
[68] T. N. Blalock and R. C. Jaeger, <4A high-speed sensing scheme for IT
dynamic RAMs utilizing the clamped bit-line sense amplifier,'' IEEE
Journal of Solid-State Circuits^ vol. 27, pp. 618-625, April 1992.
[69] M. Asakura, T. Ooishi, M. Tsukude, S. Tomishima, T. Eimori, H. Hidaka,
Y. Ohno, K. Arimoto, K. Fujishima, T. Nishimura, and T. Yoshihara, **An
experimental 256-Mb DRAM with boosted sense-ground scheme,n IEEE
Journal of Solid-State Circuits^ vol. 29, pp. 1303-1309, November 1994.
[70] T. Eirihata, S. H. Dhong, L. M. Terman, T. Sunaga, and Y. Taira, “A variable
precharge voltage sensing/* IEEE Journal of Solid-State Circuits, vol. 30,
pp. 25-28, January 1995.
[71] T. Hamamoto, Y. Morooka, M. Asakura, and H. Ozaki, uCell-plate-line/bit-
line complementary sensing (CBCS) architecture for ultra low-power
DRAMs/5 IEEE Journal of Solid-State Circuits, vol, 31, pp. 592-601, April
1996.
[72] T. Sunaga, t4A full bit prefetch DRAM sensing circuit,IEEE Journal of
Solid-State Circuits, vol. 31, pp. 767-772, June 1996.
398 Appendix
DRAM SOI
[76] S. Kuge, F. Morishita, T. Tsuruda, S. Tomishima, M. Tsukude, T. Yamagata,
and K. Arimoto, “SOLDRAM circuit technologies for tow power high speed
multigiga scale memories,'' IEEE Journal ofSolid-State Circuits, vol. 31, pp.
586-591, April 1996.
[77] K. Shimomura, H. Shimano, N. Sakashita, F. Okuda, T. Oashi,
Y. Yamaguchi, T. Eimori, M. Inuishi, K. Arimoto, S. Maegawa, Y. Inoue,
S. Komori, and K. Kyuma, 4tA 1-V 46-ns 16-Mb SOI-DRAM with body
control technique,IEEE Journal of Solid-State Circuits, vol. 32,
pp. 1712-1720, November 1997.
Embedded DRAM
[78] T. Sunaga, H. Miyatake, K. Kitamura, K. Kasuya, T. Saitoh, M. Tanaka,
N. Tanigaki, Y. Mori, and N. Yamasaki, “DRAM macros for ASIC chips,“
IEEE Journal of Solid-State Circuits, vol. 30, pp. 1006-1014, September
1995.
Redundancy Techniques
[79] H. L. Kalter, C. H. Stapper, J. E. Barth, Jr., J. Di Lorenzo, C. E. Drake,
J. A. Fifield, G. A. Kelley, Jr., S. C. Lewis, W. B. van der Hoeven, and
J. A. Yankosky, <4A 50-ns 16-Mb DRAM with a 10-ns data rate and on-chip
ECC," IEEE Journal ofSolid-State Circuits, vol. 25, pp. 1118-1128, October
1990.
[80] M. Horiguchi, J. Etoh, M. Aoki, K. Itoh, and T. Matsumoto, UA flexible
redundancy technique for high-density DRAMs,“ IEEE Journal of Solid
State Circuits, vol. 26, pp. 12-17, January 1991.
Appendix 399
DRAM Testing
[83] T. Ohsawa, T. Furuyama, Y. Watanabe, H. Tanaka, N. Kushiyama,
K. Tsuchida, Y. Nagahama, S. Yamano, T. Tanaka, S. Shinozaki, and
K. Natori, **A 60-ns 4-Mbit CMOS DRAM with built-in selftest function,M
IEEE Journal ofSolid-State Circuits, vol. 22, pp. 663-668, October 1987.
[84] P. Mazumder, ^Parallel testing of parametric faults in a three-dimensional
dynamic random-access memory/, IEEE Journal of Solid-State Circuits,
vol. 23, pp. 933-941, August 1988.
[85] K. Arimoto, Y. Matsuda, K. Furutani, M. Tsukude, T. Ooishi, K. Mashiko,
and K. Fujishima, “A speed-enhanced DRAM array architecture with
embedded ECC,“ IEEE Journal of Solid-State Circuits, vol. 25, pp. 11-17,
February 1990.
[86] T. Takeshima, M. Takada, H. Koike, H. Watanabe, S. Koshimaru, K. Mitake,
W. Kikuchi, T. Tanigawa, T. Murotani, K. Noda, K. Tasaka, K. Yamanaka,
and K. Koyama, “A 55-ns 16-Mb DRAM with built-in self-test function
using microprogram ROM,“ IEEE Journal of Solid-State Circuits, vol. 25,
pp. 903-911, August 1990.
[87] T. Kirihata, Hing Wong, J. K. DeBrosse, Y. Watanabe, T. Hara, M. Yoshida,
M. R. Wordeman, S. F可ii, Y Asao, and B. Krsnik, ^Flexible test mode
approach for 256-Mb DRAM," IEEE Journal of Solid-State Circuits,
vol. 32, pp. 1525-1534, October 1997.
[88] S. Tanoi, Y. Tokunaga, T. Tanabe, K. Takahashi, A. Okada, M. Itoh,
Y. Nagatomo, Y. Ohtsuki, and M. Uesugi, <4On-wafer BIST of a 200-Gb/s
failed-bit search fbr 1-Gb DRAM,“ IEEE Journal of Solid-State Circuits,
vol. 32, pp. 1735-1742, November 1997.
Synchronous DRAM
[89] T. Sunaga, K. Hosokawa, Y. Nakamura, M. Ichinose, A. Moriwaki,
S. Kakimi, and N. Kato, “A full bit prefetch architecture for synchronous
DRAMs,M IEEE Journal of Solid-State Circuits^ vol. 30, pp. 998-1005,
September 1995.
400 Appendix
Low-Voltage DRAM
[91] K. Lee, C. Kim, D. Yoo, J. Sim, S. Lee, B. Moon, K. Kim, N. Kim, S. Yoo,
J. Yoo, and S. Cho, “Low voltage high speed circuit designs for Giga-bit
DRAMs,n in 1996 Symposium on VLSI Circuits^ June 1996, p. 104.
[92] M. Saito, J. Ogawa, K. Gotoh, S. Kawashima, and H. Tamura, ^Technique
for controlling effective in multi-Gbit DRAM sense amplifier,in 1996
Symposium on VLSI Circuits, June 1996, p. 106.
[93] K. Gotoh, J. Ogawa, M. Saito, H. Tamura, and M. Taguchi, 4*A 0.9 V sense
amplifier driver for high-speed Gb-scale DRAMs/' in 1996 Symposium on
VLSI Circuits, June 1996, p. 108.
[94] T. Hamamoto, Y. Morooka, T. Amano, and H. Ozaki, “An efficient charge
recycle and transfer pump circuit for low operating voltage DRAMs,“ in
1996 Symposium on VLSI Circuits. June 1996, p. 110.
[95] T. Yamada, T. Suzuki, M. Agata, A. Fujiwara, and T. Fujita, ''Capacitance
coupled bus with negative delay circuit for high speed and low power
(lOGB/s < 500mW) synchronous DRAMs,“ in 1996 Symposium on VLSI
Circuits, June 1996, p. 112.
High-Speed DRAM
[96] S. Wakayama, K. Gotoh, M. Saito, H. Araki, T. S. Cheung, J. Ogawa, and
H. Tamura, u10-ns row cycle DRAM using temporal data storage buffer
architecture,in 1998 Symposium on VLSI Circuits, June 1998, p. 12.
[97] Y. Kato, N. Nakaya, T. Maeda, M. Higashiho, T. Yokoyama, Y. Sugo,
F. Baba, Y. Takemae, T. Miyabo, and S. Saito, ^Non-precharged bit-line
sensing scheme for high-speed low-power DRAMs,“ in 1998 Symposium on
VLSI Circuits, June 1998, p. 16.
[98] S. Utsugi, M. Hanyu, Y. Muramatsu, and T. Sugibayashi, “Non-
complimentary rewriting and serial-data coding scheme fbr shared-sense
amplifier open-bit-line DRAMs,“ in 1998 Symposium on VLSI Circuits, June
1998, p. 18.
[99] Y. Sato, T. Suzuki, T. Aikawa, S. Fujioka, W. Fujieda, H. Kobayashi,
H. Ikeda, T. Nagasawa, A. Funyu, Y. Fujii, K. I. Kawasaki, M. Yamazaki,
and M. Taguchi, “Fast cycle RAM (FCRAM); a 20-ns random row access,
Appendix 401
High-Performance DRAM
[104] T. Kono, T. Hamamoto, K. Mitsui, and Y. Konishi, “A precharged-capacitor-
assisted sensing (PCAS) scheme with novel level controlled for low power
DRAMs,M in 1999 Symposium on VLSI Circuits^ June 1999, p. 123.
[105] H. Hoenigschmid, A. Frey, J. DeBrosse, T. Kirihata, G. Mueller, G Daniel,
G Frankowsky, K. Guay, D. Hanson, L. Hsu, B. Ji, D. Netis, S. Panaroni,
C. Radens, A. Reith, D. Storaska, H. Terletzki, O. Weinfiirtner, J. Alsmeier,
W. Weber, and M. Wordeman, 7F2 cell and bitline architecture featuring
tilted array devices and penalty-free vertical BL twists fbr 4Gb DRAMs” in
1999 Symposium on VLSI Circuits, June 1999, p. 125.
[106] S. Shiratake, K. Tsuchida, H. Toda, H. Kuyama, M. Wada, F. Kouno,
T. Inaba, H. Akita, and K. Isobe, UA pseudo multi-bank DRAM with
categorized access sequence/' in 1999 Symposium on VLSI Circuits, June
1999, p. 127.
[107] Y. Kanno, H. Mizuno, and T. Watanabe, WA DRAM system for consistently
reducing CPU wait cycles/* in 1999 Symposium on VLSI Circuits^ June
1999, p. 131.
[108] S. Perissakis, Y. Joo, J. Ahn, A. DeHon, and J. Wawrzynek, ^Embedded
DRAM for a reconfigurable array,“ in 1999 Symposium on VLSI Circuits,
June 1999, p. 145.
402 Appendix
High-Performance Logic
[111] D. Harris, Skew Tolerant Circuit Design. San Francisco, CA: Morgan
Kaufman Publishers, 2001.
[112] W. J. Dally and J. W. Poulton, Digital Systems Engineering. Cambridge, UK:
Cambridge University Press, 1998.
[113] L E. Sutherland, uMicropipelines,,, Communications of the ACM, vol. 32, pp.
720-738, June, 1989.
[114] C. Svenssen and J. Yuan, ^High-speed CMOS circuit technique,IEEE
JSSC, vol. 24, pp 62-70, Feb. 1989.
[115] A. Chandrakasan, W. J. Bowhill, and F. Fox, High-performance
Microprocessor Circuits. Piscataway, NJ: IEEE Press, 2001.
[116] I. Sutherland and S. Fairbanks, t4GasP: a minimal FIFO control,Async
2001, pp. 46-53, March 2001.
[117] C. J. Myers, Asynchronous Circuit Design. Hoboken, New Jersey: John
Wiley & Sons, Inc, 2001.
locked loop with infinite phase capture ranges,M in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2000, vol. 469, pp. 350-351.
[122] J.-B. Lee, K.-H. Kim, C. Yoo, S. Lee, O.-G. Na, C.-Y. Lee, H.-Y. Song, J.-S.
Lee, Z.-H. Lee, K.-W. Yeom, H.-J. Chung, L-W. Seo, M.-S. Chae, Y.-H.
Choi, and S,-I. Cho, ^Digitally-controlled DLL and I/O circuits for 500
Mb/s/pin xl6 DDR SDRAM,“ in IEEE Int, Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2001, vol. 431, pp. 68-69.
[123] T. Matano, Y. Takai, T. Takahashi, Y. Sakito, I. Fujii, Y. Takaishi,
H. Fujisawa, S. Kubouchi, S. Narui, K. Arai, M. Morino, M. Nakamura,
S. Miyatake, T. Sekiguchi, and K. Koyama, “A 1-Gb/s/pin 512-Mb DDRII
SDRAM using a digital DLL and a slew-rate-controlled output buffer,“
IEEE Journal on Solid-State Circuits, vol. 38, pp. 762-768, May 2003.
[124] J.-T. Kwak, C.-K. Kwon, K.-W. Kim, S.-H. Lee, and J.-S. Kih, “Low cost
high performance register-controlled digital DLL fbr IGbps x32 DDR
SDRAM," in Symposium VLSI Circuits Dig. Tech. Papers, 2003, pp.
283-284.
[125] H. H. Chang and S. I. Liu, “A wide-range and fast-locking all-digital cycle-
controlled delay-locked loop,“ IEEE Journal on Solid-State Circuits, vol.
40, pp. 661-670, Mar. 2005.
[126] J. S. Wang, Y. W. Wang, C. H. Chen, and Y. C. Liu, “An ultra-low power
fast-lock-in small-jitter all-digital DLL," in IEEE Int, Solid-State Circuits
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2005, pp. 422-433.
[127]1S. R. Han and S. I. Liu, “A 500-MHz-1.25-GHz fast-locking pulse-width
control loop with presettable duty cycle,v IEEE Journal on Solid-State
Circuits, vol. 39, pp. 463-468, March 2004.
[128] M. Mota and J. Christiansen, four-channel self-calibrating high-
resolution time to digital converter,M in Proc. IEEE Int. Conf. Electronics,
Circuits, and Systems, 1998, pp. 409-412.
[129] Y. J. Wang, S. K. Kao, and S. I. Liu, ^All-digital delay-locked loop/pulse-
width-control loop with adjustable duty cycles,'' IEEE Journal on Solid
State Circuits, vol. 41, June 2006.
[130] T. Hsu, B. Shieh, and C. Lee, “An all-digital phase-locked loop (ADPLL)-
based clock recovery circuit,,5 IEEE Journal on Solid-State Circuits, vol. 34,
pp. 1063-1073, Aug. 1999.
[131] J. Dunning, G. Garcia, J. Lundberg, and E. Nuckolls, uAn all-digital phase-
locked loop with 50-cycle lock time suitable fbr high-performance
microprocessors,v IEEE Journal on Solid-State Circuits, vol, 30, pp.
412T22, Apr. 1995.
[132] T. Saeli, K. Minami, H. Yoshida, and H. Suzuki, UA direct-skew-detect
synchronous mirror delay fbr application-specific integrated circuits, IEEE
Journal on Solid-State Circuits, vol. 34, pp. 372-379, Mar. 1999.
404 Appendix
[133] K. Sung, B. D. Yang, and L.S Kim, “Low power clock generator based on
an area-reduced interleaved synchronous mirror delay scheme,in Proa
IEEE Int. Symposium Circuits and Systems. 2002, pp. 671-674.
[134] Y.-H. Koh and O.-K. Kwon, “A fast-locking digital delay line with duty
conservation/' in Proc. IEEE Asia-Pacific Conf. Circuits and Systems, 1998,
pp. 287-290.
[135] T. Seaeki, H. Nakamura, and J. Shimizu, “A 10 ps jitter 2 clock cycle lock
time CMOS digital clock generator based on an interleaved synchronous
mirror delay scheme,in Symposium VLSI Circuits Dig. Tech. Papers, 1997,
pp. 109-110.
[136] D. Shim, D.-Y. Lee, S. Jung, C.-H. Kim, and W. Kim, “An analog
synchronous mirror delay for high-speed DRAM application/' IEEE Journal
on Solid-State Circuits, vol. 34, pp. 484-493, April 1999.
[137] Y. M. Wang and J. S. Wang, “A low-power half-delay-line fast skew
compensation circuit/' IEEE Journal on Solid-State Circuits, vol. 39, June
2004.
[138] Y. Jung, S. Lee, D. Shim, W. Kim, C. Kim and S. Cho, “A dual-loop delay-
locked loop using multiple voltage-controlled delay lines/' IEEE Journal on
Solid-State Circuits, vol. 36, pp. 784-791, May 2001.
[139] J. Kim, S. Lee, T. Jung, C. Kim, S. Cho and B. Kim, “A low-jitter mixed
mode DLL for high-speed DRAM applications,IEEE Journal on Solid
State Circuits, vol. 35, pp. 1430-1436, Oct. 2000.
[140] G Dehng, J. Lin, and S. Liu, “A fast-lock mixed-mode DLL using a 2-b SAR
algorithm,“ IEEE Journal on Solid-State Circuits^ vol. 35, pp. 1464-1471,
Oct. 2001.
[141] B. Razavi, Design of Analog CMOS Integrated Circuits. New York:
McGraw-Hill, 2001.
[142] T. H. Lee, The Design of CMOS Radio-Frequency Integrated Circuits.
Cambridge, U.K.: Cambridge Univ. Press, 1998.
[143] T. Gabara, W. Fischer, J, Harrington, and W. Troutman, ^Forming damped
LRC parasitic circuits in simultaneously switched CMOS output buffers/5
IEEE Journal on Solid-State Circuits^ vol. 32, pp. 407-418, March 1997.
[144] M. Dolle, ^Analysis of simultaneous switching noise/' in Int. Symposium
Circuits Systems, May 1995, pp. 904-907.
[145] K. Kim et aL, "A 1.4 Gb/s DLL using 2nd order charge-pump scheme with
low phase/duty error for high-speed DRAM application/, in IEEE Int. Solid
State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2004, pp. 212-523.
[146] S. Lee et aL, "A 1.6 Gb/s/pin double data rate SDRAM with wave-pipelined
CAS latency control,M in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers. 2004, pp. 210-213.
Appendix 405
407
408 Glossary
circuit includes a bias network, which helps to set and hold the equilibra
tion level to a known voltage (generally VDDH) prior to Sensing.
Feature Size Generally refers to the minimum realizable process dimen
sion. In the context of DRAM design, however, feature size equates to a
dimension that is half of the digitline or wordline layout pitch.
Folded DRAM Array Architecture A DRAM architecture that uses non
crosspoint-style memory arrays in which a memory cell is placed only at
alternating wordline and digitline intersections. Digitline pairs, for con
nection to the sense amplifiers, consist of two adjacent digitlines from a
single memory array. For layout efficiency, each sense amplifier connects
to two adjacent memory arrays through isolation transistor pairs.
FPM, Fast Page Mode A second-generation memory technology permit
ting consecutive Reads from an open page of memory, in which the col
umn address could be changed while CAS* was still LOW.
Gate-Delay Logic In this text, gate-delay logic refers to a section of dom
ino logic that is first made functionally complete through a conversion of
static signals to dual-rail, monotonic signals. Following conversion, the
logic signals are used as precharge/evaluate terms to downstream domino
gates. This style of logic is used in the command decoder and address
register logic to avoid clock period dependencies fbr data access latency.
Helper Flip-Flop, HFF A positive feedback (regenerative) circuit for
amplifying the signals on the I/O lines.
I/O Devices MOSFET transistors that connect the array digitlines to the
I/O lines (through the sense amplifiers). Read and Write operations
from/to the memory arrays always occur through I/O devices.
Intrinsic Delay Also called intrinsic forward path delay. The intrinsic
delay represents a propagation delay of the critical timing path from
inputs (eg, CLK) to outputs (e.g., DQs), excluding the insertion delay
from timing adjustment circuits (e.g., DLL).
Isolation Devices MOSFET transistors that isolate array digitlines from
the sense amplifiers.
Matched Routing A method of maintaining an original timing relationship
after signal distribution and placement.
Mbit, Memory Bit A memory cell capable of storing one bit of data. In
modem DRAMs, the mbit consists of a single MOSFET access transistor
and a single storage capacitor. The gate of the MOSFET connects to the
wordline or rowline, while the source and drain of the MOSFET connect
to the storage capacitor and the digiti ine, respectively.
410 Glossary
413
414 Index
SSTL. See Stub series terminated logic Z/SR. See Row setup time
Stability, 47, 51, 118, 154, 165, 253, 打小 See Address to Write delay time
268,270-271,284-285,290 (CAC- See Access time from CAS*
Static column mode, 7,13-14 tcAH- See Column hold time
Stub series terminated logic (SSTL), tccD- See Column cycle time
112-113, 222-223 See Clock cycle time
Subarray, 104, 107, 189-190 t[)QscK- See Read strobe (DQS) access
Subthreshold leakage, 48-49, 58 time from CLK/CLK^
Successive-approximation register tr)QSQ- See DQS-DQ skew
Index 421