0% found this document useful (0 votes)
67 views16 pages

Design Techniques For CMOS 1694104127

The document discusses techniques for designing CMOS wireline receivers up to 56 Gbps. It presents concepts that allow non-return-to-zero data rates as high as 40 and 56 Gbps in 45nm and 28nm CMOS technologies. The designs operate with a channel loss of 19-25 dB and a bit error rate below 10^-12.

Uploaded by

vinodjn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views16 pages

Design Techniques For CMOS 1694104127

The document discusses techniques for designing CMOS wireline receivers up to 56 Gbps. It presents concepts that allow non-return-to-zero data rates as high as 40 and 56 Gbps in 45nm and 28nm CMOS technologies. The designs operate with a channel loss of 19-25 dB and a bit error rate below 10^-12.

Uploaded by

vinodjn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received 8 March 2023; revised 10 May 2023; accepted 5 June 2023.

Date of publication 28 June 2023; date of current version 2 August 2023.


Digital Object Identifier 10.1109/OJSSCS.2023.3290551

Design Techniques for CMOS Wireline NRZ


Receivers Up To 56 Gb/s
BEHZAD RAZAVI (Fellow, IEEE)
(Invited Paper)
Department of Electrical Engineering, University of California at Los Angeles, Los Angeles, CA 90095, USA
CORRESPONDING AUTHOR: B. RAZAVI (e-mail: [email protected])
This work was supported in part by Realtek Semiconductor and in part by Oracle.

ABSTRACT Wireline receivers continue to target higher data rates, posing great challenges at circuit
and architecture levels. Governed by tradeoffs among speed, power consumption, and channel loss (CL),
receiver designs can benefit from new methods that push the performance envelope. This paper presents
a number of techniques that allow non-return-to-zero data rates as high as 40 and 56 Gb/s in 45-nm and
28-nm CMOS technologies, respectively. The prototypes operate with a CL of 19-25 dB and a bit error
rate of less than 10−12 .

INDEX TERMS Continuous-time linear equalizer (CTLE), demultiplexers (DMUX), equalization, feed-
forward, SERDES, serial links.

I. INTRODUCTION travels through the channel, requiring that the RX provides


sufficient compensation for successful data recovery.1 We
T HE GROWING demand for greater throughput rates in
data centers and edge computing presents significant
challenges to physical layer designers. Wireline transceivers
must therefore employ a reasonably realistic channel model
in our RX design efforts.
have been under intense development [1], [2], [3], [4], [5], A given channel can be modeled by an electromag-
[6], [7], [8], [9], targeting speeds as high as 224 Gb/s. This netic field simulator or a network analyzer, with the results
trend is also accompanied by issues regarding the power typically expressed as S-parameters. In transceiver design,
consumption—both in absolute value (which dictates pack- however, we prefer a scalable model so that the link behavior
aging and heat removal costs) and as the amount of energy can be assessed for different amounts of loss. The scalabil-
per bit (which determines the efficiency of serialization and ity proves especially critical to the design of RX building
hence the number of lanes). blocks as it reveals the limits of their performance.
This paper serves as a companion to [10] and describes Copper media, such as printed-circuit-board traces, suf-
receiver (RX) design techniques that can improve the achiev- fer from three nonidealities: 1) loss due to skin effect;
able data rate while saving power. The ideas are presented 2) loss due to the dielectric underneath or surrounding
in the context of 40-Gb/s [11] and 50-Gb/s [12] receivers the signal line; and 3) impedance discontinuities arising
operating with non-return-to-zero (NRZ) data. Realized in from connectors and line cards. The former two require
28-nm and 45-nm technologies, respectively, the designs a frequency-dependent model, as exemplified by the sec-
demonstrate concepts that can lead to higher speeds in more tion shown in Fig. 1(a) [13]. Obtained empirically from
advanced process nodes. simulations of 50- traces on FR4 boards, this scalable
representation accounts for skin effect by R2 and L2 (at
II. GENERAL CONSIDERATIONS high frequencies, the series resistance rises from R1 ||R2 to
A. CHANNEL CHARACTERIZATION R1 ) and dielectric loss by R3 and R4 . As an example, [13]
The design of a wireline receiver is dictated by the proper-
ties of the channel that precedes it. Imperfections, such as 1. The transmitter also offers a modest amount of compensation for the
loss and impedance discontinuities, “distort” the data as it channel.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

118 VOLUME 3, 2023


(a)
(a)

(b)

FIGURE 2. (a) Impedance discontinuity along a channel. (b) Resulting notch in the
frequency response.

(b)
FIGURE 3. Impulse response of a lossy channel.

FIGURE 1. (a) One section of a scalable channel model. (b) Loss profile for 12
cascaded stages. equalizers (CTLEs). For discrete-time structures, on the other
hand, a time-domain perspective becomes necessary. For
example, decision-feedback equalizers (DFEs) are designed
reports the following values for a section corresponding
according to the impulse response of the channel. Plotted in
to a 1-in trace: L1 = 77.25 nH, C1 √ = 30.9 pF (such
Fig. 3 is such a response, where TB denotes the unit interval
that the characteristic impedance, Z0 = L1 /C1 = 50 ),
(UI), i.e., the bit or symbol period. The precursor at −TB
R1 = 5.55 , R2 = 150 m, L2 = 468.9 pH, R3 = 2
and the postcursors at TB , 2TB , etc., introduce intersymbol
k, C3 = 200 fF, R4 = 100 , and C4 = 80 fF. The trace
interference (ISI).
simulations in [13] suggest a reasonable agreement with this
model. Additional RL and RC branches can be included so as
to refine the model. Fig. 1(b) plots the magnitude response B. RECEIVER ARCHITECTURES
of a channel consisting of 12 such sections, displaying a In the past decade, two general RX architectures have
loss of 21 dB at 28 GHz. In this paper, the term “loss” will become common [1], [2], [3], [4], [5], [6], [7], [8], [9]. In
refer to that at the Nyquist rate. “analog” receivers, equalization and clock and data recov-
We also wish to study the effect of impedance discon- ery (CDR) occur in the analog domain. Fig. 4(a) illustrates
tinuities on the link performance. We observe that such a this approach, which is better suited to NRZ data. A CTLE
nonideality can lead to deep notches in the channel frequency provides some high-frequency boost so as to partially com-
response. As an example, consider the scenario depicted in pensate for the channel, and the result is applied to a DFE
Fig. 2(a), where Zp denotes a parasitic impedance at some for further equalization. In addition, a CDR circuit senses
point along the channel, e.g., at a connector, but the link the data and generates a clock with proper frequency and
is otherwise ideal. Since the impedance seen to the right of phase values for driving the DFE and the data demultiplexer
node X is equal to Z0 , we note that Zp ||Z0 is transformed by (DMUX). Even though this architecture incorporates latches
the transmission line on the left to create Zin . The impedance in the DFE, the CDR, and the DMUX, it is still considered
rotation by a length of L1 can move Zp ||Z0 to a high Zin , thus an analog solution as most of its building blocks are crafted
lowering the power delivered by Vin to the line and causing by analog designers.
a notch in the frequency response [Fig. 2(b)]. In the time The second architecture employs an analog-to-digital con-
domain, the data experiences reflection at node X, a benign verter (ADC) and delegates some of the functions to the
effect if RS = Z0 . In other words, even though the reflection digital domain [Fig. 4(b)]. Called “ADC-based” receivers,
is absorbed on the TX side, the removal of the signal energy such systems are suited to PAM4 data—especially for chan-
by the discontinuity still demands compensation. nel losses (CLs) greater than 20 dB. They do incorporate a
The frequency-domain view of the channel proves use- CTLE in the front end so as to provide a boost of 10–20 dB,
ful for the design of circuits such as continuous-time linear thus relaxing the ADC resolution to some extent. The ADC

VOLUME 3, 2023 119


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

(a)

(a)

(b)

FIGURE 4. (a) Analog and (b) ADC-based receiver architectures.

output drives a digital processor performing equalization and


(b)
data detection. The result also drives a CDR loop contain-
ing a phase detector (PD), a digitally controlled oscillator FIGURE 5. (a) Basic charge-steering stage and (b) its integrating counterpart.
(DCO), and a phase interpolator (PI), which delivers the
ADC’s sampling clock(s). This RX architecture consumes
substantial power in the ADC, the digital processor, and the Charge steering has been used in a multitude of RX and
clock generation and distribution network. TX designs to save power [11], [12], [13], [14], [15].
This paper focuses on analog NRZ receivers. For extra-
short-reach or medium-reach links (with a CL of less than
D. LINEARITY REQUIREMENTS
20 dB), this architecture draws markedly less power, an
important advantage because a given system contains many The generation of NRZ data in transmitters does not
more such links than long-reach channels. dictate any linearity for their front end unless feedfor-
ward equalization is used. In NRZ receivers, on the other
hand, some linearity is necessary before the data is sliced
C. CHOICE OF CIRCUIT TOPOLOGIES by the DFE because channel properties manifest them-
The analog and mixed-signal processing required in high- selves in the received signal amplitude. This issue proves
speed receivers can be realized by means of current-mode important because we wish to amplify the input so as
differential and regenerative pairs, but at the cost of to maximize the eye height but must also be mindful of
significant static power consumption. For most of the oper- nonlinearity.
ations beyond the CTLE, it is possible to employ “charge We investigate this point by considering the simple model
steering” [14]. shown in Fig. 6(a), where the RX front end is represented
Depicted in Fig. 5(a), a basic charge-steering differential by a constant gain, k, and a static nonlinear stage [16].
stage replaces the tail current source with a “charge source” Let us examine the impulse response of the entire chain,
consisting of CT , S1 , and S2 , and also the load resistors noting that hin (t) is that of the channel, which is then
with precharge switches S3 and S4 . The output nodes are amplified by a factor of k. The result is subjected to com-
first tied to VDD while CT is discharged. Next, X and Y are pressive nonlinearity and exhibits a main cursor equal to
released, and CT switches into the tail node. The charge then h0 and a first postcursor equal to h1 . In other words,
flows from M1 , M2 , and their drain capacitances, amplifying nonlinearity equivalently raises the normalized postcursor
the input and ceasing when VP reaches about one threshold level.
below the input common-mode (CM) level. The circuit can The nonlinearity is modeled by y = α1 x +√α3 x3 and thus
serve as an amplifier and/or a latch. A key difference between an input 1-dB compression point A1dB = 0.145|α1 /α3 |.
charge-steering and integrating stages, e.g., that shown in Suppose hin (t) in Fig. 6(b) contains a main cursor equal to
Fig. 5(b), is that, by design, the former does not allow VX βm A1dB and a first postcursor given by β1 A1dB . It can be
and VY , and hence VX − VY , to collapse to zero whereas shown that [16]
the latter does. Thus, the timing margins are more relaxed  3
for charge steering. Moreover, this style can operate across h1 β1 (βm /β1 )2 − 0.145βm2
= . (1)
a much wider speed range with no adjustment. h0 βm 1 − 0.145βm2
120 VOLUME 3, 2023
(a)

(a)

(b)

FIGURE 6. (a) Receiver containing nonlinearity and (b) effect of nonlinearity on


postcursors.

As the front-end gain and hence βm increases, h1 /h0 exceeds


the input ratio, β1 /βm . According to the findings in [16], this
effect manifests itself if βm reaches 1.5A1dB . (b)

E. CHOICE OF CLOCK RATE


The simplest, most compact receivers operate with a full-rate
clock, i.e., one whose frequency is equal to the input data
rate. However, the generation and distribution of clocks at
high speeds present formidable challenges. For this reason,
we opt for half-rate or quarter-rate architectures—at the cost
of doubling or quadrupling the hardware, respectively. An
immediate consequence is that the CTLE in Fig. 4(a) now
sees a greater load capacitance. As a compromise, we select
half-rate clocking in the front end.
Half-rate clocking also becomes a natural choice in
transceivers where the TX employs such a clock for its last
multiplexer stage and the RX utilizes this clock along with (c)
phase interpolation to implement the CDR loop.
FIGURE 7. (a) Basic CTLE stage, (b) its frequency response, and (c) eye diagram
showing inner and outer heights.
III. CTLE DESIGN
The CTLE in Fig. 4(a) must provide a high boost factor so as
to 1) increase the eye opening at the DFE summing junction
channel. We typically target a low-frequency gain of around
and 2) deliver a sufficient swing to the CDR, thus ensur-
0 dB, thereby facing a boost factor bound of about 6 dB per
ing an adequate PD gain, lock range, and loop bandwidth
stage due to the limited voltage headroom.
(BW). We begin with the basic stage shown in Fig. 7(a),
For higher boost factors, we cascade multiple CTLE
and note that the output pole, ω0 , should preferably lie
stages, bearing in mind the proportional rise in the power
above ωp = 1/(RS CS ) [Fig. 7(b)], allowing the circuit to
consumption and the reduction in the bandwidth. For n
provide its maximum boost factor, A2 /A1 = 1 + gm RS /2. In
identical stages, we have [18]
fact, ω0 must exceed approximately 2.5ωp [17], a daunting

challenge at high speeds that dictate the use of inductive m
BWtot = BW0 21/n − 1 (2)
peaking.
The design of the basic CTLE stage entails a tradeoff where BW0 denotes the bandwidth of one stage and m = 4
between the low-frequency gain, gm RD /(1 + gm RS /2) (also for second-order stages. A cascade of two thus suffers from
called the “dc” gain), and the boost factor. For the output a 20% bandwidth shrinkage, i.e., ω0 in Fig. 7(b) falls by
eye depicted in Fig. 7(c), a greater RS reduces the outer this amount. For these reasons, typical front-end designs,
height, H1 , while raising the inner height, H2 . An optimum comprising a CTLE and possibly a variable-gain amplifier,
can therefore be achieved for the latter as dictated by the contain no more than three stages.

VOLUME 3, 2023 121


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

(a)
FIGURE 9. Extensive use of feedforward in a CTLE.

gm1,2 RD and hence


gm1,2 RD RS CS
gm3,4 < . (4)
(1 + gm1,2 RS /2)LD
The advantages of feedforward become more pronounced
if it is applied to both stages of a CTLE. As illustrated in
(b) Fig. 9, we exploit all three possible feedforward paths. The
stage consisting of Gm1 , Gmf 1 , and its RL load is identical to
FIGURE 8. (a) CTLE using feedforward and (b) its frequency response.
the circuit shown in Fig. 8(a), and so is the stage formed by
Gm2 , Gmf 2 , and its RL load. The values of Gmf 1 and Gmf 2
follow (4). The path consisting of Gmf 3 and LD2 manifests
The boost factor limitations outlined above call for itself as the rest of the circuit approaches a flat response.
additional high-frequency equalization techniques. We The performance of CTLEs must be studied in both
propose the concept of “feedforward” in this regard [12]. frequency and time domains. Owing to the significant
Illustrated in Fig. 8(a), the idea is to create a high-pass effect of layout parasitics, we report simulation results for
branch that contributes boost with negligible voltage head- only extracted circuits. The inductors are modeled by RLC
room consumption. Transistors M3 and M4 and inductors L1 networks obtained from Cadence’s EMX tool. We also
and L2 play such a role. The overall response is quantified include the input capacitances of the stages fed by the CTLE,
as namely, the CDR and the DFE. In the frequency domain, we
perform two tests and study 1) the stand-alone CTLE and
Vout gm1,2 (RD + LD s)
=− − gm3,4 LD s (3) 2) the channel-CTLE cascade. Fig. 10(a) plots the proposed
Vin 1 + gm1,2 ( RS || 1 )2 2CS s CTLE response as feedforward paths are added to the circuit.
We observe that feedforward increases the boost factor by
where L1 = L2 = LD and the capacitances at the drains are about 7 dB but it also lowers the corresponding frequency.
neglected for now. The second term on the right-hand side Whether or not this result is acceptable is determined by
represents the zero created by feedforward. additional tests. As depicted in Fig. 10(b), we cascade the
At high frequencies, source degeneration in Fig. 8(a) van- channel profile of Fig. 1(b) with the CTLE. Notably, the
ishes and the fraction on the right-hand side of (3) approaches overall response becomes flatter as feedforward branches are
−gm1,2 (RD +LD s), yielding Vout /Vin ≈ −gm1,2 RD −(gm1,2 + inserted, but the 3-dB bandwidth decreases to some extent.
gm3,4 )LD s. This implies that feedforward raises the appar- The ultimate test examines the eye diagram at the summing
ent value of LD and could be simply avoided by making junction of the DFE with and without these branches. As
LD larger. The key point, however, is that CL constrains the explained in Section V-C, the three paths increase the eye
value of LD if the output pole must lie above the Nyquist height from 55 to 160 mV and the eye width from 18.5 to
frequency. Thus, feedforward provides greater flexibility in 20.5 ps.
shaping the frequency response.
We now consider the capacitances at the drains in Fig. 8(a) IV. DISCRETE-TIME LINEAR EQUALIZATION
and sketch the responses created by the two paths. As shown The notion of boosting high-frequency components can
in Fig. 8(b), the feedforward path is designed such that it be pursued in the time domain as well. As illustrated in
dominates as the main path’s response reaches a plateau at Fig. 11(a), a pulse experiencing the channel’s loss is broad-
ωp1 . The feedforward path should take over for ω > ωp1 = ened and introduces ISI at t = TB = 1 UI. If this pulse
(1 + gm1,2 RS /2)/(RS CS ); i.e., we must have gm3,4 LD ωp1 < is shifted by 1 UI, scaled by a factor of α, and negated,

122 VOLUME 3, 2023


(a)

(a)

(b)

(b)

FIGURE 10. (a) CTLE frequency response for different configurations: (1) no
feedforward, (2) with Gmf 1 , (3) with GMf 1 and Gmf 2 , and (4) with all feedforward paths,
and (b) corresponding responses for the channel-CTLE cascade.

(c)
it leads to its broadened counterpart at the output. Thus,
p(t) − αp(t − TB ) produces less ISI. Implementing the opera- FIGURE 11. (a) Illustration of FFE, (b) its implementation, and (c) its effect on long
tion as shown in Fig. 11(b), we write Y = (1 − αz−1 )X and runs.

recognize that this “feedforward equalizer” (FFE) yields


Y = (1 − α)X + α(1 − z−1 )X (5) flipflops. Depicted in Fig. 12(a) is a DTLE example where
the 1-UI delay is formed by a two-stage passive sampler. If
where 0 < α < 1. The input is therefore subjected to two CA  CB , the circuit delays x(t) and scales it by a factor of
effects. α, but α itself can be realized by ratioing CB with respect
1) It is scaled by a factor of 1 − α, suffering from attenu- to CA .
ation and displaying smaller low-frequency swings. As explained in Section II-E, we prefer half-rate opera-
This can be seen by applying a long sequence of tion so as to ease the generation and distribution of clocks.
ONEs and noting that they settle to a smaller amplitude This points to the topology shown in Fig. 12(b) [11], where
[Fig. 11(c)]. both DMUX1 and the DTLE are driven by a half-rate
2) The input is differentiated and scaled by a factor of α, clock, CK1/2 . The odd and even data produced by DMUX1
thereby benefiting from high-frequency amplification. are delayed by 1 UI, scaled, and injected into the DFE’s
The boost factor is equal to (1 + α)/(1 − α). A greater summing junctions.
α translates to both a higher “dc” loss and a larger We make two remarks. First, DMUX1 in Fig. 12(b)
boost factor. must perform sampling and can thus be merged with the
In TX design, the unit delays necessary for FFE are read- first stage of the DTLE. This leads to the implemen-
ily realized by flipflops as the NRZ data can be processed tation shown in Fig. 12(c), where two-stage sampling is
nonlinearly before the final summation point. FFE can also performed in the odd path by S3 , S4 , and the charge-
be formed in the analog domain in receivers. We call such steering stage, M1 -M2 , which injects the result into the DFE
a circuit a “discrete-time linear equalizer” (DTLE) [11]. summing junction. In addition, the charge-steering regen-
Unlike TX FFEs, however, RX DTLEs process dispersed erative pair consisting of M3 and M4 provides a gain of
data and must provide some linearity so as to preserve the 6 dB. The nonlinearity introduced by this pair is studied
channel profile information. That is, they cannot rely on in [11].

VOLUME 3, 2023 123


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

(a)
FIGURE 13. Five sources of error in a DFE.

4) Vn2 : the FF input-referred noise; and 5) Vsen : the FF


sensitivity. We define Vsen as the input difference that allows
the FF output to reach roughly 90% of its full swing in 1
UI [19] so that the first tap, h1 completely switches.2 If an
eye monitor is available in the system, then VOS1 and VOS2
can be canceled. The BER is expressed as
⎛ ⎞
1 ⎝ Vpp /2 − VOS − Vsen ⎠
(b) BER ≈ Q  (7)
2
Vn2

where Q denotes the error function, Vpp denotes the dif-


ferential eye opening, VOS  denotes the total offset (with or
without cancellation), and Vn2 denotes the total rms noise
referred to the summing junction. An error rate of 10−12
demands that the argument of the Q function exceed 7.
In the absence of an eye monitor, VOS in (7) must remain
sufficiently small by proper design. For example, with an
eye opening of 200 mVpp and a total noise of 5 mVrms ,
(c) the offset must not exceed 65 mV (if Vsen is neglected). In
practice, we would confine the 3σ offset to about 30 mV to
FIGURE 12. (a) Discrete-time linear equalization, (b) half-rate RX using DTLEs (CA in
odd branch tracks while DMUX1 produces Dodd ), and (c) charge-steering realization. leave a margin for the sensitivity and other imperfections.
The horizontal eye opening determines how much clock
jitter and phase offset the equalizer can tolerate. The accept-
Second, the DTLE transfer function emerges as able eye width depends, to some extent, upon the height:
CA −1
the greater the latter, the more the clock phase can depart
CA +CB z from the center. This relationship is formulated in [13].
H(z) = 1 − α . (6)
1 − CAC+C
B
z −2
B

That is, if CB is not much less than CA , then the circuit B. PROPOSED DFE TOPOLOGIES
also displays an infinite impulse response (IIR) tap equal to A number of circuit and architecture techniques can improve
CB /(CA + CB ). the performance of high-speed DFEs. We begin by applying
the concept of charge steering to summation and latching in
V. DFE DESIGN a half-rate/quarter-rate environment. Consider the topology
DFE architectures have been studied extensively. For most, shown in Fig. 14(a), where half-rate data streams Dodd and
the loop around the first tap must “close” in 1 UI regardless Deven drive the summers and 2-to-1 DMUXs. The quarter-
of the clock rate/data rate ratio. (In “unrolled” or “specu- rate outputs of each DMUX are then multiplexed, scaled,
lative” topologies, a loop consisting of a multiplexer still and subtracted from the input data in the other path. The
dictates a 1-UI timing budget [19].) DMUX and MUX stages utilize the quadrature phases of the
quarter-rate clock, generated by a ÷2 circuit that receives the
half-rate clock. Illustrated in Fig. 14(b), the circuit imple-
A. EYE OPENING CONSIDERATIONS mentation employs charge-steering differential pairs for the
The eye height observed at the DFE summing junction must summer, the latch, and the MUX/tap 1 combination [16].
be large enough to satisfy the target bit error rate (BER), Moreover, the summer exploits RC degeneration so as to
e.g., 10−12 . As shown in Fig. 13, five imperfections must provide a few dB of boost.
be discounted from this height. These include 1) VOS1 : the
CTLE and summer dc offsets; 2) VOS2 : the flipflop (FF) 2. Additionally, the FF kickback noise and hysteresis become problem-
input-referred offset; 3) Vn1 : the CTLE and summer noise; atic in some implementations.

124 VOLUME 3, 2023


(a)

(a)

(b)

FIGURE 15. (a) Use of high-pass branches in a DFE and (b) associated waveforms.

(b) CTLE drives the DFE unilaterally and only at one port, we
can envision some feedforward and feedback paths between
FIGURE 14. (a) Half-rate/quarter-rate DFE and (b) its charge-steering the two [12]. Depicted in Fig. 15(a) is a full-rate example:
implementation. we allow a high-pass feedforward branch, G(s), to inject the
CTLE output into the summing junction. Furthermore, we
A remarkable attribute of this architecture is its relaxed create a high-pass feedback branch, H(s), that returns the
first-tap timing budget. In a conventional loop, we must have slicer output to Dsum (Loop 2). If G(s) = αs and H(s) = βs,
tCK−Q + tMUX + tsum + tsetup < 1 UI, where the four terms, we have
respectively, denote the flipflop clock-to-Q delay, the MUX
Dsum (n) = (1 + αs)Din − (h1 + β0 s)Dout (n − 1). (8)
delay, the summing node delay, and the FF setup time. In
the charge-steering realization, on the other hand, we have The high-frequency boost thus imparted to Din and Dout
tCK−Q < 1 UI, where tCK−Q is the delay from CK1 to the improves the performance, a point that can be verified in
output of the latch [16]. This constraint does not include a the time domain as well. From the waveforms shown in
setup time because, in contrast to continuous-time current- Fig. 15(b), we observe that αdDin /dt and βdDout /dt pulsate
mode latches, here the input data need not propagate to the only on the data edges. Upon adding these derivatives to
precharged drain nodes of the MUX before this stage is the summer output, we note that the rise and fall times are
clocked. shortened. If two consecutive bits are the same, Dsum exhibits
It is possible to reach a similar timing budget by injecting a kink due to βdDout /dt (e.g., at t = t3 ), a benign effect as
the feedback signal into the output of the first latch in the the kink occurs at bit boundaries.
FF [22]. But this is not possible in the half-rate architecture The proposed feedforward and feedback techniques read-
of Fig. 14(a). ily lend themselves to circuit implementation. As shown in
In addition to charge steering, we investigate greater inter- Fig. 16(a), dDin /dt is available at node P within the CTLE
actions between the CTLE and the DFE to open the eye and travels through Gm stages to reach the summing junc-
further. In contrast to conventional cascades, wherein the tions. For dDout /dt, we first multiplex the quarter-rate outputs

VOLUME 3, 2023 125


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

(a)

(a)

(b)

(b)

FIGURE 17. (a) Modified charge-steering latch and (b) improved summer.

(c)

FIGURE 16. (a) Use of a high-pass signal within the CTLE. (b) Use of the same node
for high-pass DFE feedback. (c) Addition of second tap.

of the latches so as to obtain full-rate data [Fig. 16(b)]. This


topology can be viewed as a direct 4-to-1 MUX, except that
it is driven by overlapping quadrature phases. It is shown that
charge steering still delivers nonoverlapping charge packets
to this output. We then inject the result into node P, granting
LD the task of differentiation. The strength of the injection,
i.e., β, is defined by the amount of charge that each MUX
branch draws. FIGURE 18. Eye height and width improvement due to proposed techniques (A:
The second DFE tap is accommodated by adding sec- original design; B: CTLE feedforward 1; C: CTLE feedforward 1 and 2; D: CTLE
feedforward 1, 2, and 3; E: DFE high-pass feedback branch; F: DFE high-pass input
ondary latches to each quarter-rate arm, multiplexing their branch; and G: cross-coupled pair at the summing junction).
outputs, and injecting the results into each summing node
and node P [Fig. 16(c)].
One may wonder how precisely one must control the tim- summing junction. According to the foregoing analysis and
ing alignment of the data that returns to node P in Fig. 16(c). simulations, this approach yields a greater eye opening.
In this work, no adjustment has been included as simulations
reveal that this timing is no more critical than that of the C. REFINEMENTS
main tap. If an eye monitor is present, one can adjust this We incorporate additional circuit techniques to further
path’s delay for optimum performance. improve the DFE’s performance, striving to maximize the
In contrast to IIR DFEs [20], [21], the proposed method NRZ eye opening at its summing junctions. First, we mod-
returns the shaped signal to the DFE input rather than to its ify the basic charge-steering latch of Fig. 5 as shown

126 VOLUME 3, 2023


FIGURE 19. Overall architecture of 40-Gb/s RX.

in Fig. 17(a), where a cascode pair, M5 –M6 , and two The CDR utilizes the signals processed by DMUX1 and
cross-coupled pairs, M3 –M4 and M7 –M8 , boost the output the DFE to reduce the number of latches that it requires [11].
voltage swings [11]. These transistors play the following Specifically, XOR3 measures the phase difference between
roles: the first pair isolates X and Y from the large capaci- Dodd and Deven , while XOR1 and XOR2 generate a constant-
tance at P and Q, raising the voltage gain from Vin to these width pulse on Vref for each data transition. The resulting
nodes; the second pair also increases this gain by means of difference, Verr − Vref , uniquely represents the phase error
regeneration; the third pair restores the high level at P or Q regardless of the data pattern.
to VDD , avoiding the CM drop observed in Fig. 5(a). The 56-Gb/s RX is depicted in Fig. 20 [12]. (For simplic-
The second method relates to the DFE summing node ity, the second DFE tap is not shown.) In this case, the higher
itself. As illustrated in Fig. 17(b), we attach two cross- speed is accommodated by driving the CDR from node Q in
coupled pairs to this interface, thus increasing the eye height the CTLE so that CCDR negligibly affects the signal path’s
by 50% [12]. The continuous-time CM drop caused by I1 at bandwidth. The data presented to the CDR thus displays a
A and B is less than 20 mV in the 18-ps evaluation mode high-pass spectrum, but it still allows locking [12].
of the 56-Gb/s RX. The half-rate PD requires quadrature clocks at 28 GHz,
We quantify the improvements afforded by some of our a condition fulfilled by simply delaying the output of a
proposed techniques for the 56-Gb/s RX in the presence of differential LC oscillator by a self-biased inverter. It is shown
a CL of 25 dB. Fig. 18 illustrates the incremental improve- that this stage’s delay variability does not affect the PD gain
ments due to each concept. The eye width increases from significantly [12].
18.5 to 25 ps, and the eye height from 55 to 200 mV.
VII. EXPERIMENTAL RESULTS
This section presents the measured results for the 40-Gb/s
VI. 40-Gb/s AND 56-Gb/s RECEIVERS and 56-Gb/s NRZ receivers. The prototypes have been
The 40-Gb/s and 56-Gb/s NRZ RX examples reported here mounted directly on printed-circuit boards and tested on a
operate with a CL of 19–25 dB at the Nyquist frequency. high-speed probe station. Unless otherwise stated, all mea-
The former’s architecture is shown in Fig. 19 [11]. A single surements are carried out with a 1-V supply at the full data
CTLE stage drives DMUX1 , the DTLE, and the DFE, which rate and with a pseudo-random bit sequence (PRBS) pattern
consists of two summers, latches L1 –L8 , and MUX1 –MUX4 . of 27 − 1. Fig. 21 depicts a test setup example for character-
The retimed and demultiplexed return-to-zero (RZ) data is izing receivers. A BER tester (BERT) generates NRZ data,
converted to NRZ as described in [14]. which is then subjected to a lossy channel such as M8049A.

VOLUME 3, 2023 127


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

FIGURE 20. Overall architecture of 56-Gb/s RX.

FIGURE 21. Test setup example.

The result drives the device under test (DUT) and the output
is captured by an oscilloscope. The recovered clock too is FIGURE 22. 40-Gb/s RX die photograph.
monitored on a spectrum analyzer.

A. 40-Gb/s RX clock has also been fabricated to permit the characterization


Realized in TSMC’s 45-nm technology, the 40-Gb/s RX die of the equalizer. We first employ a channel having the
is shown in Fig. 22 and occupies an active area of about black profile shown in Fig. 23(a) and producing the eye
110 μm × 175 μm. Another version accepting an external in Fig. 23(b).

128 VOLUME 3, 2023


(a) (a)

(b)

FIGURE 23. (a) Two channel profiles used in measurements and (b) received eye (b)
diagram (with the gray profile).

FIGURE 24. (a) RX output eye diagram at 10 Gb/s and (b) equalizer bathtub curves.
We begin with the RX path measurements using
an external clock. The output data at 10 Gb/s is
TABLE 1. Performance summary and comparison for 40-Gb/s RX.
depicted in Fig. 24(a) and the equalizer bathtub curve in
Fig. 24(b). The horizontal eye opening is 0.28 UI. Part of
the eye closure arises from the PRBS generator’s 8-psrms jit-
ter. Also shown is the bathtub curve for an input data rate of
20 Gb/s and the gray loss profile in Fig. 23(a), demonstrat-
ing that charge-steering circuits can accommodate a wide
range of frequencies.
The complete RX is characterized for jitter generation,
transfer, and tolerance while it equalizes the dispersed data.
The CDR bandwidth is set to 20 MHz unless otherwise
stated. Fig. 25(a) and (b) plots the recovered clock spectrum
and waveform, respectively. For phase noise measurements,
the 20-GHz clock is divided by 2 off-chip, yielding the
profile illustrated in Fig. 26. The integrated jitter amounts
to 515 fsrms from 100 Hz to 1 GHz.
Fig. 27 shows the measured jitter transfer and tolerance
for different CDR bandwidths. The latter improves as the
BW increases, reaching 0.45 UIpp at 5 MHz with 19 dB of
CL. (The maximum jitter amplitude of 20 UI is dictated by 275 μm. The tests are carried out with Keysight’s boards,
the equipment.) MS8049A-002 and MS8049A-003, which, along with a
Table 1 summarizes and compares the performance. 30-in cable, provide the loss profiles plotted in Fig. 29. To
these losses at 28 GHz, we add 1.7 dB to account for the
B. 56-Gb/s RX probes and the interconnects.
This RX has been fabricated in TSMC’s 28-nm technology. We first report the RX performance while the CDR is
Fig. 28 shows the die with an active area of 250 μm × disabled and an external 28-GHz clock is used. In this

VOLUME 3, 2023 129


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

FIGURE 27. Measured jitter transfer and tolerance.

FIGURE 25. (a) Measured recovered clock spectrum and (b) its waveform.

FIGURE 26. Measured phase noise of recovered clock. FIGURE 28. 56-Gb/s RX die photograph.

measurement, Keysight’s M8040A BERT has the capabil-


ity to emulate a 2-tap TX FFE in the data applied to the
channel. Fig. 30 plots the bathtub curves for two cases: 1) for
channel A, which has a loss of 25 dB, and no FFE and
2) for channel B, which has a loss of 30 dB, while the
BERT implements an FFE function of the form −0.2 +
0.8z−1 . The horizontal eye openings are 0.4 and 0.33 UI,
respectively.
We next present results with the CDR enabled. Shown
in Fig. 31 are the outputs of channel A and the RX. The
BER is less than 10−12 . Fig. 32 plots the recovered clock FIGURE 29. Measured CL profiles.
waveform and spectrum for a CDR noise-shaping bandwidth
of 50 MHz. The phase noise profile of Fig. 33 reaches a the spectrum, which falls to −128 dBc/Hz at 14-GHz offset.
100-MHz offset, at which it is equal to −124.4 dBc/Hz. For The integrated jitter from 100 Hz to 14 GHz amounts to
greater offsets, we measure the phase noise directly from 100 fsrms .

130 VOLUME 3, 2023


FIGURE 30. Measured RX bathtub curves.

FIGURE 32. (a) Recovered clock waveform and (b) its spectrum.

FIGURE 33. Measured phase noise of recovered clock.

FIGURE 31. Eye diagrams (a) received from the channel and (b) at RX output.

Fig. 34 plots the CDR jitter transfer for different CLs,


obtained by cascading different sections of Keysight’s board
and different cable lengths. For the 25-dB loss case, the
3-dB BW is around 55 MHz, consistent with the VCO noise-
shaping BW observed in Fig. 32(b). The high-pass nature
of the CDR input data leads to some peaking for low loss
values, but it enables the CDR to achieve bandwidths as
high as 25 MHz for a CL of 30 dB. FIGURE 34. Measured jitter transfer.
Fig. 35 plots the measured CDR jitter tolerance for a
loss of 25 dB, yielding a value of 1.1 UIpp at 5 MHz and VIII. CONCLUSION
exceeding the CEI-56G-VSR mask. Table 2 summarizes and High-speed wireline receivers present a multitude of chal-
compares the performance. lenges, especially for greater CLs. This paper describes

VOLUME 3, 2023 131


RAZAVI: DESIGN TECHNIQUES FOR CMOS WIRELINE NRZ RECEIVERS UP TO 56 Gb/s

TABLE 2. Performance summary and comparison for 56-Gb/s RX.

[2] T. Ali et al., “A 460mW 112Gbps DSP-based transceiver with 38dB


loss compensation for next generation data centers in 7nm FinFET
technology,” in ISSCC Dig. Tech. Papers Slide Supplements, Feb. 2020,
pp. 118–120.
[3] P. Upadhyaya et al., “A fully adaptive 19-to-56Gb/s PAM-4 wireline
transceiver with a configurable ADC in 16nm FinFET,” in ISSCC Dig.
Tech. Papers, Feb. 2018, pp. 108–110.
[4] T. Ali et al., “6.4 A 180mW 56Gb/s DSP-based transceiver for high-
density IOs in data center switches in 7nm FinFET technology,” in
ISSCC Dig. Tech. Papers, Feb. 2019, pp. 118–120.
[5] J. Im et al., “6.1 A 112Gb/s PAM-4 long-reach wireline transceiver
using a 36-way time-interleaved SAR-ADC and inverter-based RX
analog front-end in 7nm FinFET,” in ISSCC Dig. Tech. Papers,
Feb. 2020, pp. 116–118.
[6] J. Im et al., “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct
decision-feedback equalization in 16-nm FinFET,” IEEE J. Solid-State
FIGURE 35. Measured jitter tolerance.
Circuits, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
[7] T. Shibasaki et al., “A 56-Gb/s receiver front-end with a CTLE and
1-tap DFE in 20-nm CMOS,” in VLSI Circuits Symp. Dig., Jun. 2014,
methods that improve the performance of CTLEs and DFEs pp. 1–2.
[8] A. Roshan-Zamir et al., “A 56-Gb/s PAM4 receiver with low-overhead
and proposes concepts such as discrete-time linear equal- techniques for threshold and edge-based DFE FIR-and IIR-tap adap-
ization and charge steering. Collectively, these techniques tation in 65-nm CMOS,” IEEE J. Solid-State Circuits, vol. 54, no. 3,
lead to 40-Gb/s and 56-Gb/s receivers with low power pp. 672–684, Mar. 2019.
[9] A. Cevrero et al., “6.1 A 100Gb/s 1.1 pJ/b PAM-4 RX with dual-mode
consumption. 1-tap PAM-4/3-tap NRZ speculative DFE in 14nm CMOS FinFET,” in
ISSCC Dig., Feb. 2019, pp. 112–113.
[10] B. Razavi, “Design techniques for high-speed wireline transmit-
ACKNOWLEDGMENT ters,” IEEE Open J. Solid-State Circuits Soc., vol. 1, pp. 53–66,
2021.
The author gratefully acknowledges the TSMC University [11] A. Manian and B. Razavi, “A 40-Gb/s 14-mW CMOS wireline
Shuttle Program for chip fabrication. receiver,” IEEE J. Solid-State Circuits, vol. 52, no. 9, pp. 2407–2421,
Sep. 2017.
[12] A. Atharav and B. Razavi, “A 56-Gb/s 50-mW NRZ receiver in 28-
nm CMOS,” IEEE J. Solid-State Circuits, vol. 57, no. 1, pp. 54–67,
REFERENCES Jan. 2022.
[1] Y. Segal et al., “A 1.41pJ/b 224Gb/s PAM-4 SerDes receiver with [13] S. Gondi and B. Razavi, “Equalization and clock and data recov-
31dB loss compensation,” in ISSCC Dig. Tech. Papers, Feb. 2022, ery techniques for 10-Gb/s CMOS serial links,” IEEE J. Solid-State
pp. 114–115. Circuits, vol. 42, no. 9, pp. 1999–2011, Sep. 2007.

132 VOLUME 3, 2023


[14] J. W. Jung and B. Razavi, “A 25-Gb/s 5-mW CMOS [19] S. Ibrahim and B. Razavi, “Low-power CMOS equalizer design
CDR/deserializer,” IEEE J. Solid-State Circuits, vol. 48, no. 3, for 20-Gb/s systems,” IEEE J. Solid-State Circuits, vol. 46, no. 6,
pp. 684–697, Mar. 2013. pp. 1321–1336, Jun. 2011.
[15] Y. Chang, A. Manian, L. Kong, and B. Razavi, “An 80-Gb/s [20] B. Kim, Y. Liu, T. O. Dickson, J. F. Bulzacchelli, and
40-mW wireline PAM4 transmitter,” IEEE J. Solid-State Circuits, D. J. Friedman, “A 10-Gb/s compact low-power serial
vol. 53, no. 8, pp. 2214–2226, Aug. 2018. I/O with DFE-IIR equalization in 65-nm CMOS,” IEEE
[16] J. W. Jung and B. Razavi, “A 25 Gb/s 5.8 mW CMOS equalizer,” IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3526–3538,
J. Solid-State Circuits, vol. 50, no. 2, pp. 515–526, Feb. 2015. Dec. 2000.
[17] B. Razavi, “The design of an equalizer—Part I [the analog [21] O. Elhadidy and S. Palermo, “A 10-Gb/s 2-IIR-Tap DFE receiver with
mind],” IEEE Solid-State Circuits Mag., vol. 13, no. 4, pp. 7–160, 35 dB loss compensation in 65-nm CMOS,” in Symp. VLSI Circuits
2021. Dig., Jun. 2013, pp. C272–C273.
[18] R. P. Jindal, “Gigahertz-band high-gain low-noise AGC amplifiers [22] Y. Lu and E. Alon, “Design techniques for a 66 Gb/s 46 mW 3-tap
in fine-line NMOS,” IEEE J. Solid-State Circuits, vol. 22, no. 4, decision feedback equalizer,” IEEE J. Solid-State Circuits, vol. 48,
pp. 512–521, Aug. 1987. no. 12, pp. 3243–3257, Dec. 2013.

VOLUME 3, 2023 133

You might also like