Design Techniques For CMOS 1694104127
Design Techniques For CMOS 1694104127
ABSTRACT Wireline receivers continue to target higher data rates, posing great challenges at circuit
and architecture levels. Governed by tradeoffs among speed, power consumption, and channel loss (CL),
receiver designs can benefit from new methods that push the performance envelope. This paper presents
a number of techniques that allow non-return-to-zero data rates as high as 40 and 56 Gb/s in 45-nm and
28-nm CMOS technologies, respectively. The prototypes operate with a CL of 19-25 dB and a bit error
rate of less than 10−12 .
INDEX TERMS Continuous-time linear equalizer (CTLE), demultiplexers (DMUX), equalization, feed-
forward, SERDES, serial links.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
(b)
FIGURE 2. (a) Impedance discontinuity along a channel. (b) Resulting notch in the
frequency response.
(b)
FIGURE 3. Impulse response of a lossy channel.
FIGURE 1. (a) One section of a scalable channel model. (b) Loss profile for 12
cascaded stages. equalizers (CTLEs). For discrete-time structures, on the other
hand, a time-domain perspective becomes necessary. For
example, decision-feedback equalizers (DFEs) are designed
reports the following values for a section corresponding
according to the impulse response of the channel. Plotted in
to a 1-in trace: L1 = 77.25 nH, C1 √ = 30.9 pF (such
Fig. 3 is such a response, where TB denotes the unit interval
that the characteristic impedance, Z0 = L1 /C1 = 50 ),
(UI), i.e., the bit or symbol period. The precursor at −TB
R1 = 5.55 , R2 = 150 m, L2 = 468.9 pH, R3 = 2
and the postcursors at TB , 2TB , etc., introduce intersymbol
k, C3 = 200 fF, R4 = 100 , and C4 = 80 fF. The trace
interference (ISI).
simulations in [13] suggest a reasonable agreement with this
model. Additional RL and RC branches can be included so as
to refine the model. Fig. 1(b) plots the magnitude response B. RECEIVER ARCHITECTURES
of a channel consisting of 12 such sections, displaying a In the past decade, two general RX architectures have
loss of 21 dB at 28 GHz. In this paper, the term “loss” will become common [1], [2], [3], [4], [5], [6], [7], [8], [9]. In
refer to that at the Nyquist rate. “analog” receivers, equalization and clock and data recov-
We also wish to study the effect of impedance discon- ery (CDR) occur in the analog domain. Fig. 4(a) illustrates
tinuities on the link performance. We observe that such a this approach, which is better suited to NRZ data. A CTLE
nonideality can lead to deep notches in the channel frequency provides some high-frequency boost so as to partially com-
response. As an example, consider the scenario depicted in pensate for the channel, and the result is applied to a DFE
Fig. 2(a), where Zp denotes a parasitic impedance at some for further equalization. In addition, a CDR circuit senses
point along the channel, e.g., at a connector, but the link the data and generates a clock with proper frequency and
is otherwise ideal. Since the impedance seen to the right of phase values for driving the DFE and the data demultiplexer
node X is equal to Z0 , we note that Zp ||Z0 is transformed by (DMUX). Even though this architecture incorporates latches
the transmission line on the left to create Zin . The impedance in the DFE, the CDR, and the DMUX, it is still considered
rotation by a length of L1 can move Zp ||Z0 to a high Zin , thus an analog solution as most of its building blocks are crafted
lowering the power delivered by Vin to the line and causing by analog designers.
a notch in the frequency response [Fig. 2(b)]. In the time The second architecture employs an analog-to-digital con-
domain, the data experiences reflection at node X, a benign verter (ADC) and delegates some of the functions to the
effect if RS = Z0 . In other words, even though the reflection digital domain [Fig. 4(b)]. Called “ADC-based” receivers,
is absorbed on the TX side, the removal of the signal energy such systems are suited to PAM4 data—especially for chan-
by the discontinuity still demands compensation. nel losses (CLs) greater than 20 dB. They do incorporate a
The frequency-domain view of the channel proves use- CTLE in the front end so as to provide a boost of 10–20 dB,
ful for the design of circuits such as continuous-time linear thus relaxing the ADC resolution to some extent. The ADC
(a)
(a)
(b)
(a)
(b)
(a)
FIGURE 9. Extensive use of feedforward in a CTLE.
(a)
(b)
(b)
FIGURE 10. (a) CTLE frequency response for different configurations: (1) no
feedforward, (2) with Gmf 1 , (3) with GMf 1 and Gmf 2 , and (4) with all feedforward paths,
and (b) corresponding responses for the channel-CTLE cascade.
(c)
it leads to its broadened counterpart at the output. Thus,
p(t) − αp(t − TB ) produces less ISI. Implementing the opera- FIGURE 11. (a) Illustration of FFE, (b) its implementation, and (c) its effect on long
tion as shown in Fig. 11(b), we write Y = (1 − αz−1 )X and runs.
(a)
FIGURE 13. Five sources of error in a DFE.
That is, if CB is not much less than CA , then the circuit B. PROPOSED DFE TOPOLOGIES
also displays an infinite impulse response (IIR) tap equal to A number of circuit and architecture techniques can improve
CB /(CA + CB ). the performance of high-speed DFEs. We begin by applying
the concept of charge steering to summation and latching in
V. DFE DESIGN a half-rate/quarter-rate environment. Consider the topology
DFE architectures have been studied extensively. For most, shown in Fig. 14(a), where half-rate data streams Dodd and
the loop around the first tap must “close” in 1 UI regardless Deven drive the summers and 2-to-1 DMUXs. The quarter-
of the clock rate/data rate ratio. (In “unrolled” or “specu- rate outputs of each DMUX are then multiplexed, scaled,
lative” topologies, a loop consisting of a multiplexer still and subtracted from the input data in the other path. The
dictates a 1-UI timing budget [19].) DMUX and MUX stages utilize the quadrature phases of the
quarter-rate clock, generated by a ÷2 circuit that receives the
half-rate clock. Illustrated in Fig. 14(b), the circuit imple-
A. EYE OPENING CONSIDERATIONS mentation employs charge-steering differential pairs for the
The eye height observed at the DFE summing junction must summer, the latch, and the MUX/tap 1 combination [16].
be large enough to satisfy the target bit error rate (BER), Moreover, the summer exploits RC degeneration so as to
e.g., 10−12 . As shown in Fig. 13, five imperfections must provide a few dB of boost.
be discounted from this height. These include 1) VOS1 : the
CTLE and summer dc offsets; 2) VOS2 : the flipflop (FF) 2. Additionally, the FF kickback noise and hysteresis become problem-
input-referred offset; 3) Vn1 : the CTLE and summer noise; atic in some implementations.
(a)
(b)
FIGURE 15. (a) Use of high-pass branches in a DFE and (b) associated waveforms.
(b) CTLE drives the DFE unilaterally and only at one port, we
can envision some feedforward and feedback paths between
FIGURE 14. (a) Half-rate/quarter-rate DFE and (b) its charge-steering the two [12]. Depicted in Fig. 15(a) is a full-rate example:
implementation. we allow a high-pass feedforward branch, G(s), to inject the
CTLE output into the summing junction. Furthermore, we
A remarkable attribute of this architecture is its relaxed create a high-pass feedback branch, H(s), that returns the
first-tap timing budget. In a conventional loop, we must have slicer output to Dsum (Loop 2). If G(s) = αs and H(s) = βs,
tCK−Q + tMUX + tsum + tsetup < 1 UI, where the four terms, we have
respectively, denote the flipflop clock-to-Q delay, the MUX
Dsum (n) = (1 + αs)Din − (h1 + β0 s)Dout (n − 1). (8)
delay, the summing node delay, and the FF setup time. In
the charge-steering realization, on the other hand, we have The high-frequency boost thus imparted to Din and Dout
tCK−Q < 1 UI, where tCK−Q is the delay from CK1 to the improves the performance, a point that can be verified in
output of the latch [16]. This constraint does not include a the time domain as well. From the waveforms shown in
setup time because, in contrast to continuous-time current- Fig. 15(b), we observe that αdDin /dt and βdDout /dt pulsate
mode latches, here the input data need not propagate to the only on the data edges. Upon adding these derivatives to
precharged drain nodes of the MUX before this stage is the summer output, we note that the rise and fall times are
clocked. shortened. If two consecutive bits are the same, Dsum exhibits
It is possible to reach a similar timing budget by injecting a kink due to βdDout /dt (e.g., at t = t3 ), a benign effect as
the feedback signal into the output of the first latch in the the kink occurs at bit boundaries.
FF [22]. But this is not possible in the half-rate architecture The proposed feedforward and feedback techniques read-
of Fig. 14(a). ily lend themselves to circuit implementation. As shown in
In addition to charge steering, we investigate greater inter- Fig. 16(a), dDin /dt is available at node P within the CTLE
actions between the CTLE and the DFE to open the eye and travels through Gm stages to reach the summing junc-
further. In contrast to conventional cascades, wherein the tions. For dDout /dt, we first multiplex the quarter-rate outputs
(a)
(a)
(b)
(b)
FIGURE 17. (a) Modified charge-steering latch and (b) improved summer.
(c)
FIGURE 16. (a) Use of a high-pass signal within the CTLE. (b) Use of the same node
for high-pass DFE feedback. (c) Addition of second tap.
in Fig. 17(a), where a cascode pair, M5 –M6 , and two The CDR utilizes the signals processed by DMUX1 and
cross-coupled pairs, M3 –M4 and M7 –M8 , boost the output the DFE to reduce the number of latches that it requires [11].
voltage swings [11]. These transistors play the following Specifically, XOR3 measures the phase difference between
roles: the first pair isolates X and Y from the large capaci- Dodd and Deven , while XOR1 and XOR2 generate a constant-
tance at P and Q, raising the voltage gain from Vin to these width pulse on Vref for each data transition. The resulting
nodes; the second pair also increases this gain by means of difference, Verr − Vref , uniquely represents the phase error
regeneration; the third pair restores the high level at P or Q regardless of the data pattern.
to VDD , avoiding the CM drop observed in Fig. 5(a). The 56-Gb/s RX is depicted in Fig. 20 [12]. (For simplic-
The second method relates to the DFE summing node ity, the second DFE tap is not shown.) In this case, the higher
itself. As illustrated in Fig. 17(b), we attach two cross- speed is accommodated by driving the CDR from node Q in
coupled pairs to this interface, thus increasing the eye height the CTLE so that CCDR negligibly affects the signal path’s
by 50% [12]. The continuous-time CM drop caused by I1 at bandwidth. The data presented to the CDR thus displays a
A and B is less than 20 mV in the 18-ps evaluation mode high-pass spectrum, but it still allows locking [12].
of the 56-Gb/s RX. The half-rate PD requires quadrature clocks at 28 GHz,
We quantify the improvements afforded by some of our a condition fulfilled by simply delaying the output of a
proposed techniques for the 56-Gb/s RX in the presence of differential LC oscillator by a self-biased inverter. It is shown
a CL of 25 dB. Fig. 18 illustrates the incremental improve- that this stage’s delay variability does not affect the PD gain
ments due to each concept. The eye width increases from significantly [12].
18.5 to 25 ps, and the eye height from 55 to 200 mV.
VII. EXPERIMENTAL RESULTS
This section presents the measured results for the 40-Gb/s
VI. 40-Gb/s AND 56-Gb/s RECEIVERS and 56-Gb/s NRZ receivers. The prototypes have been
The 40-Gb/s and 56-Gb/s NRZ RX examples reported here mounted directly on printed-circuit boards and tested on a
operate with a CL of 19–25 dB at the Nyquist frequency. high-speed probe station. Unless otherwise stated, all mea-
The former’s architecture is shown in Fig. 19 [11]. A single surements are carried out with a 1-V supply at the full data
CTLE stage drives DMUX1 , the DTLE, and the DFE, which rate and with a pseudo-random bit sequence (PRBS) pattern
consists of two summers, latches L1 –L8 , and MUX1 –MUX4 . of 27 − 1. Fig. 21 depicts a test setup example for character-
The retimed and demultiplexed return-to-zero (RZ) data is izing receivers. A BER tester (BERT) generates NRZ data,
converted to NRZ as described in [14]. which is then subjected to a lossy channel such as M8049A.
The result drives the device under test (DUT) and the output
is captured by an oscilloscope. The recovered clock too is FIGURE 22. 40-Gb/s RX die photograph.
monitored on a spectrum analyzer.
(b)
FIGURE 23. (a) Two channel profiles used in measurements and (b) received eye (b)
diagram (with the gray profile).
FIGURE 24. (a) RX output eye diagram at 10 Gb/s and (b) equalizer bathtub curves.
We begin with the RX path measurements using
an external clock. The output data at 10 Gb/s is
TABLE 1. Performance summary and comparison for 40-Gb/s RX.
depicted in Fig. 24(a) and the equalizer bathtub curve in
Fig. 24(b). The horizontal eye opening is 0.28 UI. Part of
the eye closure arises from the PRBS generator’s 8-psrms jit-
ter. Also shown is the bathtub curve for an input data rate of
20 Gb/s and the gray loss profile in Fig. 23(a), demonstrat-
ing that charge-steering circuits can accommodate a wide
range of frequencies.
The complete RX is characterized for jitter generation,
transfer, and tolerance while it equalizes the dispersed data.
The CDR bandwidth is set to 20 MHz unless otherwise
stated. Fig. 25(a) and (b) plots the recovered clock spectrum
and waveform, respectively. For phase noise measurements,
the 20-GHz clock is divided by 2 off-chip, yielding the
profile illustrated in Fig. 26. The integrated jitter amounts
to 515 fsrms from 100 Hz to 1 GHz.
Fig. 27 shows the measured jitter transfer and tolerance
for different CDR bandwidths. The latter improves as the
BW increases, reaching 0.45 UIpp at 5 MHz with 19 dB of
CL. (The maximum jitter amplitude of 20 UI is dictated by 275 μm. The tests are carried out with Keysight’s boards,
the equipment.) MS8049A-002 and MS8049A-003, which, along with a
Table 1 summarizes and compares the performance. 30-in cable, provide the loss profiles plotted in Fig. 29. To
these losses at 28 GHz, we add 1.7 dB to account for the
B. 56-Gb/s RX probes and the interconnects.
This RX has been fabricated in TSMC’s 28-nm technology. We first report the RX performance while the CDR is
Fig. 28 shows the die with an active area of 250 μm × disabled and an external 28-GHz clock is used. In this
FIGURE 25. (a) Measured recovered clock spectrum and (b) its waveform.
FIGURE 26. Measured phase noise of recovered clock. FIGURE 28. 56-Gb/s RX die photograph.
FIGURE 32. (a) Recovered clock waveform and (b) its spectrum.
FIGURE 31. Eye diagrams (a) received from the channel and (b) at RX output.