0% found this document useful (0 votes)
187 views61 pages

HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF

Uploaded by

jvgediya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views61 pages

HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF

Uploaded by

jvgediya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

PCI Ex p r ess 3 .

0 Ov er v iew

Jasmin Ajanovic
Sr . Pr in cip al En g in eer
I n t el Cor p .

HotChips - Aug 23, 2009


Agenda

PCI e Architecture Overview

PCI e 3.0 Electrical Optimizations

PCI e 3.0 PHY Encoding and Challenges

New PCI e Protocol Features

Summary & Call to action

2
PCI Express* ( PCI e) I nterconnect
Physical I nterface
Point-to-point full-duplex
I O Tr e n ds
Differential low -voltage signaling I n cr e a se in I O Ba n dw idt h
Embedded clocking
Scaleable w idth & frequency
Re du ct ion in La t e n cy
Supports connectors and cables En e r gy Efficie n t Pe r for m a n ce

Protocol
Em e r gin g Applica t ion s
Load Store architecture
Fully packetized split-transaction V ir t u a liza t ion
Credit-based flow Control Opt im ize d I n t e r a ct ion be t w e e n
Virtual Channel mechanism
H ost & I O
Advanced Capabilities Ex a m ple s: Gr a ph ics, M a t h ,
Enhanced Configuration and Pow er
Management Ph ysics, Fin a n cia l & H PC Apps.
RAS: CRC Data I ntegrity, Hot Plug,
Advanced error logging/ reporting
QoS and I sochronous support
N e w Ge n e r a t ion s of
PCI Ex pr e ss Te ch n ology

3
PCI e Technology Roadmap
Gen3: 8GT/s Signaling
Atomic Ops, Caching Hints
60 Lower Latencies, Improved PM
Enhanced Software Model
50
I/O Virtualization
Device Sharing
40
GB/ Se c

PCIe Gen2 @ 5GT/s


30
PCIe Gen1 @ 2.5GT/s

20

PCI / PCI - X Note: Dotted Line is For Projected Numbers

1999 2001 2003 2005 2007 2009 2011 2013

Raw Bit Rate Link BW BW/ lane/ way BW x16

PCI e 1.x 2.5GT/ s 2Gb/ s ~ 250MB/ s ~ 8GB/ s

PCI e 2.0 5.0GT/ s 4Gb/ s ~ 500MB/ s ~ 16GB/ s

PCI e 3.0 8.0GT/ s 8Gb/ s ~ 1GB/ s ~ 32GB/ s


Based on x16 PCIe channel

All da t e s t im e fr a m e s
a n d pr odu ct s a r e
su bj e ct t o ch a n ge
Con t in u ou s I m pr ove m e n t : D ou blin g Ba n dw idt h &
w it h ou t fu r t h e r
n ot ifica t ion I m pr ovin g Ca pa bilit ie s4 Eve r y 3 - 4 Ye a r s!
PCI e 3.0 Electrical I nterface

5 5
PCI e 3.0 Electrical Requirements
Compatibility w ith PCI e 1.x, 2.0

2x payload performance bandw idth over PCI e 2.0

Similar cost structure ( i.e. no significant cost adders)

Preserve existing data clocked and common clock architecture


support

Maximum reuse of HVM ingredients


FR4, reference clocks, etc.

Strive for similar channel reach in high-volume topologies


Mobile: 8 , 1 connector
Desktop: 14 , 1 connector
Server: 20 , 2 connectors

6
PCI e Gen3 Solution Space
Equalizat ion Sweep Equalizat ion Sweep
0.08 0.04 Pass
8GT/ s 0.02 8GT/ s
Eye Height (V)

0.06

Eye Height (V)


0
0.04 -0.02
Fail
10GT/ s Pass -0.04
0.02
-0.06 10GT/ s
0 -0.08
-0.1
-0.02 Fail
14 Client Channel -0.12 20 Server Channel
-0.04 -0.14
TX EQ [1] [1] [1] [1] [1] [1] TX EQ [1] [1] [1] [1] [1] [1]
Rx CTLE 1st 1st 1st Rx CTLE 1st 1st 1st
Rx DFE [2 3] [1 2] [1-4 ] [1-6 ] Rx DFE [2 3] [1 2] [1-4 ] [1-6 ]

Sou r ce : I n t e l Cor por a t ion

Solution space exists to satisfy 8GT/ s client and server channels


requirements
Power, channel loss and distortion much worse at 10GT/ s
Similar findings by PCI -SI G members corroborated I ntel analysis

PCI -SI G approved 8GT/ s as PCI e 3.0 bit rate

CTLE= Continuous Time Linear Equalizer


DFE= Decision Feedback Equalizer
7
Enabling Factors for 8G
Scrambling permits 2x payload rate increase w rt. Gen2 w ith
8 GT/ s data rate
Scrambling eliminates 25% coding overhead of 8b/ 10b
8G chosen over 10G due to eye margin considerations
More capable Tx de-emphasis
One post cursor tap and one pre cursor tap (2.5 and 5G has 1 post cursor tap)
Six selectable presets cover most equalization requirements
Finer Tx equalization control available by adjusting coefficients
Receiver equalization
1st order LE (linear eq.) is assumed as minimum Rx equalization
Designs may implement more complex Rx equalization to maximize margins
Back channel allowing Rx to select fine resolution Tx equalization settings
BW optimizations for Tx, Rx PLLs and CDR
PLL BW reduced, CDR (Clock Data Recovery) jitter tracking increased
CDR BW > 10 MHz, PLL BW 2-4 MHz

8
PCI e 3.0 Encoding/ Signaling

9 9
Problem Statement
PCI Express* ( PCI e) 3.0 data rate decision: 8 GT/ s
High Volume Manufacturing channel for client/ servers
Same channels and length for backwards compatibility
Low power and ease of design - avoid using complicated receiver
equalization, etc.

Requirement: Double Bandw idth from Gen 2


PCI e 1.0a data rate: 2.5 GT/ s
PCI e 2.0 data rate: 5 GT/ s
Doubled the bandwidth from Gen 1 to Gen 2 by doubling the data rate
Data rate gives us a 60% boost in bandwidth
Rest will come from Encoding
Replace 8b/ 10b encoding with a scrambling-only encoding scheme when
operating at PCI e 3.0 data rate

Double B/ W: Encoding efficiency 1.25 X data rate 1.6 = 2X

Ch a lle n ge : N e w En codin g Sch e m e t o cove r


2 5 6 da t a plu s 1 2 K- code s w it h 8 bit s
10
New Encoding Scheme
Ln 3 Ln 2 Ln 1 Ln 0
Two levels of encapsulation
Lane Level (mostly 128/ 130) DLLP

Packet Level to identify packet


LI DL
boundaries
Point to where next packet begins

Pa ck e t Le ve l Enca psula t ion


STP

Additive Scrambling only (no 8b/ 10b)


to provide edge density

La ne le ve l ( 1 3 0 bit s)
Data Packets scrambled
TLP/ DLLP/ LI DL
Ordered Sets mostly not scrambled STP
Electrical I dle Exit Ordered Set
resets scrambler (Recovery/
Config) Sou r ce : I n t e l Cor por a t ion

Scr a m bling w it h t w o le ve ls of
e nca psula11
t ion
Mapping of bits on a x1 Link
Receive Transmit
MSB MSB LSB
LSB

7 6 5 4 3 2 1 0 Symbol 15 7 6 5 4 3 2 1 0
Symbol 0

MSB LSB
MSB LSB
Symbol 1 7 6 5 4 3 2 1 0
Symbol 1 7 6 5 4 3 2 1 0

MSB MSB LSB


LSB

7 6 5 4 3 2 1 0 Symbol 0 7 6 5 4 3 2 1 0
Symbol 15

X1 Link

Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI

0 1 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


128 bit
Payload
Block

12
Mapping of bits on a x4 Link
Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI

Lane 0 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 4 Symbol 60


Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI

0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 1
Sync Symbol 1 Symbol 5 Symbol 61

Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI

Lane 2 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 2 Symbol 6 Symbol 62


Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI

Lane 3 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 3 Symbol 7 Symbol 63

13
P-Layer Encapsulation: TLP
7 4 3 0 15 14 8 23 20 19 16 31 24 39 32 n-1 n-8

STP Frame Seq LCRC (4B,


Len[3:0] (1111) P Len [10:4] CRC No Seq No [7:0] TLP Payload (same format as 2.0) same format
[3:0] [11:8] as 2.0)

[Len[10:0]: length of the TLP in DWs, Frame CRC[4:0]: Check Bits covering Length[0:10], P: Frame Parity, No END]

Length know n from the first 3 Symbols


First 4 bits are 1111 (bit[ 0:3] = 4b1111)
Bits 4:14 has the length of the TLP (valid values: 5 to 1031)*
Bits 15 and 20:23 is check bits to cover the TLP Length field
Primitive Polynomial (X4 + X + 1) protects 15 bit field
Provides double bit flip detection guarantee (length 11 bits + CRC 4 bits)
Odd parity covers the 15 bits (length 11 bits + CRC 4 bits)
Guaranteed detection of triple bit errors (over 16 bits)
Sequence Number occupies bits 16:19 and 24:31
TLP payload is from the 4th Symbol position ( same as 2.0)
No explicit END - Check 1st Symbol after TLP for implicit END
vs. an explicit EDB = > Ensures triple bit flip detection
All Symbols are scrambled/ de-scrambled

*Note: Valid values for a TLP Prefix is 5 to ~ 1039 (Max value depends on type of TLP Prefix)
14
P-Layer Encapsulation: DLLP

7 0 15 8 23 16 47 40 55 48 63 56

SDP LCRC (same


DLLP Payload (same format as 2.0) format as 2.0)
(11110000) (10101100)

(DLLP Layout)
Preserve DLLP layout of 2.0 spec
First Symbol is F0h
Second Symbol is ACh
Next 4 Symbols ( 2 through 5) are the DLLP layout
Next 2 Symbols ( 6 and 7) : LCRC ( identical to 2.0)
No explicit END
All Symbols are scrambled/ de-scrambled

15
Ex: TLP/ DLLP/ I DLs in x8
LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7
0 0
Sync Char 0 0
1
0
1 1 1
0
1
0 0
1
1 1
STP + Seq No
Symbol 0 (STP: 1111, Len TLP + CRC + P)
TLP Header (DW 0)
TLP (7 DW)
Symbol 1 TLP Header (DW 1 and 2)

Symbol 2 TLP Header (DW 3) Data (1 DW)

Symbol 3 LCRC (1 DW) SDP DLLP Payload DLLP


(11110000) (10101100)
LIDL LIDL LIDL LIDL
Symbol 4 DLLP Payload LCRC
(00000000) (00000000) (00000000) (00000000)
LIDL LIDL LIDL LIDL LIDL LIDL LIDL LIDL
Symbol 5 (00000000) (00000000) (00000000) (00000000) (00000000) (00000000) (00000000) (00000000)

Symbol 6 STP + Seq No


(STP: 1111, Len TLP + CRC + P)
TLP Header (DW 0)

Symbol 15 TLP Data (DW 14) TLP Data (DW 15) TLP (23 DW:
0 0 0 0 0 0 0 0 Straddles two
Sync Char 1 1 1 1 1 1
1 1 Blocks)
TLP Data (DW 16) TLP Data (DW 17)
Symbol 0
LIDL LIDL LIDL LIDL
Symbol 1 LCRC (1 DW) (00000000) (00000000) (00000000) (00000000)

Time

16
TLP Transmission in a X4 Link

Rsvd
L3 L2 L1 L0 d3 d2 d1 d0 h11 h10 h9 h8 h7 h6 h5 h4 h3 h2 h1 h0 Q[11:0]

(TLP Transmitted: 3 DW Header (h0 .. h11) + 1 DW Data (d0 .. D3).


1 DW LCRC (L0 .. L3) and Q[11:0]: Sequence No from Link Layer)
Sync = 01b

[Framer O/P: STP S[3:0] = f h; length l[10:0] = 006h;


Framing Logic
Other Length CRC C[3:0] = f h; Parity P = 0b]
Packets
TLP (6DW)
(Scrambled)
Lane 0

L[7:10],
L[7:10],

S[3:0]
Scrambler
S[3:0]

h8
h4

d0

L0
h0
h4

h8

d0

L0
h0

C[3], L[0:6]
Lane 1
L[0:6]

h5

L1
h9
h1

d1
C[3],

h5

L1
h9
Scrambler

h1

d1
Lane 2
P, C[0:2]

P, C[0:2]
Q[11:8],

h10

Q[11:8],
L2

h10
d2

L2
h2

h6

Scrambler

d2
h2

h6
Lane 3
Scrambler
Q[7:0]

Q[7:0]

h11
h11

L3

L3
d3

d3
h7

h7
h3

h3
Time

17
PCI e 3.0 Protocol Extensions

18 18
Protocol Extensions
Pe r for m a n ce I m pr ove m e n t s
TLP Pr oce ssing H int s hint s t o opt im ize syst em resources
and perform ance
System TLP Pr e fix m ech t o ext end TLP headers for TLP Processing
CPU
Memory
Hint s, MR- I OV, and fut ure ext ensions
I D - Ba se d Or de r ing Transact ion- level at t ribut e/ hint t o
opt im ize ordering wit hin RC and m em ory subsyst em
Coherent System I/F Ex t e nde d Ta g Ena ble D e fa ult perm it s default for Ext ended
Tag Enable bit t o be Funct ion- specific
MEM Soft w a r e M ode l I m pr ove m e n t s
Host Memory At om ic Ope r a t ions new at om ic t ransact ions t o reduce
synchronizat ion overhead
Pa ge Re que st I nt e r fa ce m ech in ATS 1.1 for a device t o
Root Complex request fault ed pages t o be m ade available ( not covered)
Com m u n ica t ion M ode l En h a n ce m e n t s
PCI Express M ult ica st m echanism t o t ransfer com m on dat a or com m ands
sent from one source t o m ult iple recipient s
Pow e r M a n a ge m e n t
MEM D yna m ic Pow e r Alloca t ion support for dynam ic power
operat ional m odes t hrough st andard configurat ion m ech
Local
Memory La t e ncy Tole r a nce Re por t ing Endpoint s report service
lat ency requirem ent s for im proved plat form power m gm t
Opt im iz e d Buffe r Flush/ Fill Mechs for devices t o align DMA
Device act ivit y for im proved plat form power m gm t
Con figu r a t ion En h a n ce m e n t s
Re siz a ble BAR Mechanism t o support BAR size negot iat ion
I nt e r na l Er r or Re por t ing Ext end AER t o report com ponent
int ernal errors and record m ult iple error logs

19 19
TLP Processing Hints ( TPH)

20
Transaction Processing Hints
Background:
CPU Cores Small I O Caches implemented in
LLC/RC
server platforms
cache I neffective w/ o info about intended use of
MEM I O data
Root Complex Host Memory
Feature:
TPH= hints on a transaction basis
PCI Express
Allocation & temporal reuse
MEM More direct CPU< - > I O collaboration
Control structures (headers, descriptors)
Accelerator Local
and data payloads
Memory

Benefits:
Reduced access latencies
Ch a n ge in CPU M iss Ra t e w it h TPH I mproved data retention/ allocation
Reduced mem & QPI BW/ pow er
Avoiding data copies
New applications
Comm adapters for HPC and DB clusters,
Computational Accelerators,

Provides stronger coupling between


Ca ch e Siz e Host Cache/Memory hierarchy and IO
21
Basic Device Writes
Device Writes Host Reads
CPU
2 3
D e vice W r it e s D M A D a t a 1
5 $
6
2 Sn oop Syst e m Ca ch e s
$
W r it e Ba ck ( M e m or y)
4 Mem ory
3 D e vice W r it e ( M e m or y)

N ot ify H ost ( Opt ion a l) 4


Soft w a r e Re a ds D M A D a t a
Root Complex 5

6 H ost Re a d Com ple t e d


$

1
Transaction Flow does not take full
advantage of System Resources
System Caches
System I nterconnect
PCI Express* Device

22
Device Writes w ith TPH
Device Writes Host Reads
CPU
2
D e vice W r it e s D M A D a t a 1
4 $ $ ( H in t , St e e r in g Ta g)
5
Sn oop Syst e m
2 Ca ch e s
$ $
3 Mem ory I n t e r r u pt H ost ( Opt ion a l) 3

4 Soft w a r e Re a ds D M A D a t a
$

5 H ost Re a d Com ple t e d


Root Complex

Effective Use of System Resources


1 Reduce Access latency to system
Dat a St ruct . ( I nt erest ) memory
Control Struct. ( Descriptors) Reduce Memory & system
Headers for Pkt. Processing
Data Payload ( Copies) interconnect BW & Pow er
PCI Express* Device

23
Basic Device Reads
Host Writes Device Reads
CPU
1 Soft w a r e W r it e s
$ 5 1 DMA Data
$
2 Com m a n d W r it e t o D e vice ( Opt ion a l)
6

D e vice Pe r for m s Re a d
2 4 Mem ory 3
4 Sn oop Syst e m Ca ch e s
$
5 W r it e Ba ck t o M e m or y
Root Complex

6 D e vice Re a d Com ple t e d

3 Transaction flow does not take full


advantage of System Resources
System Caches
System I nterconnect
PCI Express* Device

24
Device Reads w ith TPH
Host Writes Device Reads
CPU
1 4 Soft w a r e W r it e s
1 DMA Data
$ $
5
2 Com m a n d W r it e t o D e vice ( Opt ion a l)

$ D e vice Pe r for m s Re a d
2 Mem ory ( H in t , St e e r in g Ta g) 3
$
Sn oop Syst e m
4 Ca ch e s $
Root Complex
5 D e vice Re a d Com ple t e d
$

Effective Use of System Resources


3 Reduce Access latency to system
Dat a St ruct . ( I nt erest ) memory
Control Struct. ( Descriptors) Reduce Memory & system
Headers for Pkt. Processing
Data Payload ( Copies) interconnect BW & Pow er
PCI Express* Device

25
Atomic Operations
( AtomicOps)

26 26
Synchronization
Atomic Read-Modify-Write

CPU Atomic transaction support for Host


update of main memory exists today
Useful for synchronization without interrupts
Rich library of proven algorithms in this area
At om ic Com plet er Benefit in extending existing inter-
processor primitives for data
Engine
sharing/ synchronization to PCI e
Mem ory interconnect domain
Low overhead critical sections
Non-Blocking algorithms for managing data
structures e.g. Task lists
Lock-Free Statistics e.g. counter updates
I mprove existing application
Root Complex performance
Faster packet arrival rates create demand for
faster synchronization
Emerging applications benefit from
Atomic RMW
Multiple Producer Multiple Consumer support
Example: Math, Visualization, Content Processing
etc

PCI Express* Device

27
Atomic Read-Modify-Write ( RMW)
Atomic RMW Operation
CPU
$ D e vice I ssu e s RM W
At om ic Com plet er Opt ion a l ( H in t , ST) 1
3
Engine
At om ic Com plet er 2
[ Fe t ch Add, Sw a p or CAS]

4 Engine $
At om ic Com plet er
2 Re a d I n it ia l Va lu e
Mem ory Engine

Request Description
Data( Addr) = Data( Addr) +
FetchAdd
AddData
Root Complex Sw ap Data( Addr) = Sw apData
I f ( CompareData = = Data( Addr) )
CAS then
Data( Addr) = Sw apData
Opt ion a l
1 $
At om ic Com plet er
3 W r it e N e w Va lu e
Engine

4 Re t u r n I n it ia l Va lu e
PCI Express* Device

28
Pow er Management Enhancements
Dynamic Pow er Allocation( DPA)
Optimized Buffer Flush ( OBFF)
Latency Tolerance Reporting ( LTR)

29
Dynamic Pow er Allocation
Background
PCI e 1.x provided standard Device & Link-level Pow er
Management
PCI e 2.0 adds mechanisms for dynamic scaling of Link
w idth/ speed
No architected mechanism for dynamic control of device
thermal/ pow er budgets

Problem Statement
Devices are increasingly higher consumers of system pow er &
thermal budget
Emerging 300W Add-I n Cards
New Customer & Regulatory Operating Requirements
On-going I ndustry wide efforts e.g. ENERGY STAR* Compliance
Battery Life/ Enclosure Power Management
Mobile, Servers & Embedded Platforms

30
Dynamic Pow er Allocation ( DPA)
DPA Capability
CPU Extend Existing PCI Device PM
to provide Active ( D0) substates
Up to 32 substates supported
Dynamic Control of D0 Active
Substates
Mem ory Benefits
Platform Cost Reduction
Pwr/ Thermal Management
Platform Optimizations
Root Complex Battery Life (Mobile)/ Power(Servers)

Soft ware Managed Dynamic Power Allocation

Transit ions 400 1.2


1

Performance
Total Power
300
0 0.8
Performance
200 0.6
1 0.4 Total Power
100
2 0.2
0 0
D0 SubSt at es D0.0 D0.1 D0.2 D0.3 D0.4 D0.5 D0.6 D0.7
D0 Sub States
PCI Express* Device Sou r ce : I n t e l Cor por a t ion

En a ble s N e w Pla t for m Le ve l Fle x ibilit y in


Pow e r / Th e r m a l Re sou r ce M a n a ge m e n t
31
Latency Tolerance Reporting

Problem: Current Platforms PM policies guesstimate when devices are


idle (e.g. w/ inactivity timers)
Guessing w rong can cause performance issues, or even HW
failures
Worst case: PM disabled to allow functionality at cost to
pow er
Even best case not good reluctance to pow er dow n leaves
some PM opportunities on the table
Tough balancing act between performance / functionality and power

Want ed: Mechanism for plat form t o t une PM based on act ual
device service requirem ent s

32
Latency Tolerance Reporting ( LTR)
LTR Mechanism
CPU PCI e Message sent by Endpoint w ith
tolerable latency
Capability to report both snooped & non-snooped
values
Terminate at Receiver routing, MFD & Switch
send aggregated message

LTR Mem ory


Benefits
Message Provides Device Benefit : Dynamically tune
platform PM state as a function of Device
activity level
Platform benefit : Enables greater pow er savings
w ithout impact to performance/ functionality
Root Complex
Dynamic LTR

1 LTR ( M a x )
Buffer 1 I dle
2
Buffer
LTR ( Act ivit y
Adj u st e d) Buffer 2 Act ive

PCI Express* Device

LTR enables dynam ic power vs. perform ance


t radeoffs at m inim al33cost im pact
Optimized Buffer Flush/ Fill
Problem: Devices do not know power state of central resources

Asynchronous device activity prevents optimal pow er management of


memory, CPU, RC internals by idle w indow fragmentation
Premise: I f devices knew w hen to talk, most could easily optimize their
Request patterns
Result: System would stay in lower power states for longer periods of time with no impact on
overall performance
Optimized Buffer Flush/ Fill ( OBFF) - a mechanism for broadcasting PM hint
to device

enlarged enlarged
idle idle
window window

Device Bus Master/Interrupt events

Want ed: Mechanism for Align Device Act ivit y wit h Plat form
PM event s

34
Optimized Buffer Flush/ Fill ( OBFF)
OBFF
CPU
Notify all Endpoints of optimal
w indow s w ith minimal pow er
impact
Solution1: When possible, use
WAKE# with new wire semantics

Opt ional Solution2: WAKE# not available


Mem ory Use PCI e Message
OBFF 2
Message

Wake#

Root Complex Opt im a l W indow s


W AKE# W a ve for m s
CPU Act ive
Plat form fully Tr a n sit ion Eve n t W AKE#
act ive. Opt im al for
bus m ast ering and I dle OBFF
int errupt s
OBFF Plat form I dle CPU Act ive
m em ory pat h
available for OBFF/ CPU Act ive I dle
m em ory read and
writ es OBFF CPU Act ive
I dle Plat form is
PCI Express* Device in low power st at e CPU Act ive OBFF

Great est Pot ent ial I m provem ent When


I m plem ent ed by All Plat35 form Devices
Other Protocol Enhancements
I D-based Transaction Ordering
I O Page Fault Mechanism
Resizable BAR
Multicast

36
Transaction Ordering Enhancement
Background:
H ost CPU/ M e m Strong ordering = = unnecessry
stalls
Transactions from different
Requestors carry different I Ds

Feature:
New Transaction Attribute bit to
indicate I D-based ordering
relaxation
Permission to reorder transactions
between different I D streams
Applies to unrelated streams
w ithin:
MF Devices, Root Complex, Switches
} transaction
Ordering unrelated
streams Benefits:
I mproves latency/ pow er/ BW
Reduces transaction latencies in the system.
w ithin memory subsystem
Mitigates
37 overhead of I O
I O Page Fault Mechanism
Background:
Host CPU
Emmerging trend: Platform
Memory
Virtualization
I ncreases pressure on memory
resources making page pinning
Translation Agent
(TA)
Address Translation and
Protection Table (ATPT)
very expensive
Feature:
Built upon PCI e Address Translation
Services ( ATS) Mechanism
Root Complex (RC)
Notify I O devices w hen I O page
Root faults occur
Port
ATS (RP) ATS Device pause/ resume on page faults
Request
Completion
PCIe
Faulted pages requested to be made
available
PCIe Endpoint ATC
Benefits:
OS/ Hypervisor gets ability to
maintain overall system performance
by over-commiting memory
Critical for future IO Virtualization
allocationapplication
for I O scaling.
New38 usage: User-Mode I O for
Resizable BAR & Multicast
CPU/Mem

Root
CPU

PCIe Switch
P2P
Bridge
MEM
Host Memory
Root Complex X Virtual PCI Bus

PCI Express
P2P P 2P P2P P 2P
Bridge Bridge Bridge Bridge

Local
BAR Memory

Device Endpoint Endpoint Endpoint Endpoint


MEM

PCIe Standard Address Route


BAR == Base Address Register PCI mechanism for
mapping device memory into sys. address space Multicast Address Route

Improved platform addres space Multicast provides perf. scaling of


management -- solves current existing apps (e.g. multi Gfx) -- opens
problems with gfx/accel new
39
usages for PCIe in embedded space
Summary
8.0 GT/ s silicon design is challenging but achievable

Double B/ W: Encoding efficiency 1.25 X data rate 1.6 = 2X

Next Generation PCI e Protocol Extensions Deliver


Energy Efficient Performance,
Software Model I mprovements and
Architecture Scalability

Specification Status:
Rev 0.5 spec delivered to PCI SI G in Q109
Rev 0.7 targeting Sept. 09 & Rev 0.9 early Q110

Con t in u ou s I m pr ove m e n t : D ou blin g Ba n dw idt h &


I m pr ovin g Ca pa bilit ie s Eve r y 3 - 4 Ye a r s!

40
Call to Action & Referrences
Con t r ibu t e t o t h e e volu t ion of PCI
Ex pr e ss a r ch it e ct u r e
Review and provide feedback on PCI e 3.0 specs
I nnovat e and different iat e your product s wit h
PCI e 3.0 indust ry st andard

Visit :
w w w .pcisig.com for PCI Express specificat ion
updat es
h t t p:/ / dow n loa d.in t e l.com / t e ch n ology/ pcie x pr e ss/
de vn e t / docs/ PCI e 3 _ Acce le r a t or - Fe a t u r e s_ W P.pdf
for whit e- paper on PCI e Accelerat or Feat ures

41
Back u p
Example of a Eye As Seen At
Receiver I nput Latch

UI/2 UI/2

EH EW Eye aperture defines Tj at 10-12

Eye margins reflect CDR tracking and Rx equalization

43
Scrambling vs. 8b/ 10b coding
8GT/ s uses scrambled data to improve signaling efficiency
over 8b10b encoding used in 2.5GT/ s and 5GT/ s, yielding 2x
payload data rate w rt. 5 GT/ s
Unlike 8b10b a maximal length PRBS generated by an LFSR
does not preserve DC balance
The average voltage level over a constant period of time varies slowly based on the pattern of
the PRBS
I n an AC coupled system this creates a slowly changing differential offset that that reduces eye
height
Different PRBS polynomials have different average run
lengths through their pattern and so different peak differential
offsets
There exists a best case PRBS23 polynomial yielding minimum DC wander of ~ 4.5 mVPP: x23 +
x21 + x18 + x15 + x7 + x2
Large number of taps tends to break up long runs of 0s or 1s
( a common case)
Pathological match between PRBS and data pattern have very low probability
Retry mechanism changes polynomial starting point to prevent pathological data pattern from
failing repeatedly

44
Gen3 Signaling: Error Detection & Recovery
Framing error is detected by the physical layer
The first byte of a packet is not one of the allowed sets (e.g., TLP, DLLP, LI DL)
Sync character is not 01 or 10
Same sync character not present in all lanes after deskew
CRC error in the length field of a TLP
Ordered set not one of the allowed encodings or not all lanes sending the same
ordered set after deskew (if applicable)
10 sync header received after 01 sync header without a marker packet in the 01
sync header OR received a marker packet in the 01 sync header and the subsequent
sync header in any lane not 10
Any framing error requires directing LTSSM to
Recovery
Stop processing any received TLP/ DLLP after error until we get through Recovery
Block lock acquired with EI EOS
Scrambler reset with each EI EOS
Error Detection Guarantees
Triple bit flip detection within each TLP/ DLLP/ I DL/ OS

45
TLP Processing Hints ( TPH)

46
TPH Mechanism

Mechanism to provide PH[1:0] Processing Usage Model


processing hints on per TLP Hint
basis for Requests that target 00 Bi- Bi-Directional data structure
directional
Memory Space data
Enable system hardware (ex: Root- structure
Complex) to optimize on a per TLP basis
01 Requestor D*D*
Applicable to Memory Read/ Write and
Atomic Operations 10 Target DWHR
HWDR
11 Target with DWHR (Prioritized)
Priority HWDR (Prioritized)

47
Steering Tag ( ST)

Memory Write TLP

Memory Read or
AtomicOperation TLPs

ST: 8 bits defined in header to carry System


specific Steering Tag values
Use of Steering Tags is optional No preference value used to
indicate no steering tag preference
Architected Steering Table for software to program system
specific steering tag values

48
TPH Summary
Mechanism to make effective use of system fabric and improve
system efficiency
Reduce variability in access to system memory
Reduce memory & system interconnect BW & power consumption
Ecosystem I mpact
Software impact is under investigation - minimally may require software
support to retrieve hints from system hardware
Endpoints take advantage only as needed No cost if not used
Root Complex can make implementation tradeoffs
Minimal impact to Switches
Architected softw are discovery, identification, and control of
capabilities
RC support for processing hints
Endpoint enabling to issue hints

49
I D-Based Ordering
( I DO)

50 50
Review :
PCI e Ordering Rules

Table is based on
new 2.0 errata!
No ent ries
caused by
Producer/
Consum er
rest rict ions
Yes
ent ries are
required for
deadlock
avoidance

Maximum theoretical flexibility: All entries are Y/ N


Traditional Relaxed Ordering ( RO) enables A2 & D2 Y/ N cases
AtomicOps ECR defines an RO-enabled C2 Y/ N case
I D-Based Ordering ( I DO) enables A2, B2, C2, & D2 Y/ N cases

51 51
Motivation
RO w orks w ell for single-stream models w here a data buffer
is w ritten once, consumed, and then recycled
Not OK for buffers that will be written more than once because writes are not
guaranteed to complete in order issued
Does not take advantage of the fact that ordering doesnt need to be
enforced between unrelated streams
Conventional Ordering ( CO) can cause significant stalls
Observed stalls in the 10s to 100s of ns are seen
Worst case behavior may see such stalls repeatedly for a Request stream
Consider case of NI C or disk controller w ith multiple streams
of w rites:

Each CO Flag
Writ e
serializes &
adds lat ency
t o t raffic from
unrelat ed
st ream s

52 52
I DO: Perf Optimizations
for Unrelated TLP Streams
TLP Stream: a set of TLPs
that all have the same
originator
Optimizations possible
for unrelated TLP
Streams, notably w ith:
Multi-Function device (MFD)
/ Root Port Direct Connect
Switched Environments
Multiple RC I ntegrated
Endpoints (RCI Es)
I DO permits passing
betw een TLPs in
different streams
Particularly beneficial
w hen a Translation
Agent ( TA) stalls TLP
streams temporarily

53 53
TLP Prefix

54
Motivation
+0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
TLP Prefix Byte 0
Byte 0 >

TLP Prefix (Optional)


Byte H 4 >

Byte H> Header Byte 0 TLP Prefixes (Optional)


Header

Byte J > Data Byte 0


Data
{included when applicable}
Data Byte K-1

TLP Digest (Optional)


31 24 23 16 15 8 7 0

Emerging usage models require increase in header size to carry new


information
Example: Multi-Root I OV, Extended TPH
TLP Prefix mechanism extends the header sizes by adding DWORDs
to the front of headers

55
Prefix Encoding

+0 +1 +2 +3 +0 +1 +2 +3
76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0
Byte 0 100 0 Type Local Prefix Contents Byte 0 100 1 Type End-End Prefix Contents

Base TLP Prefix Size 1 DW


Appended to TLP headers
TLP Prefixes can stacked or repeated
More than one TLP Prefix supported
Link Local Where routing elements may process the TLP for routing or other
purposes.
Only usable when both ends understand and are enabled to handle link local TLP Prefix
ECRC not applicable
End-End TLP Prefix
Requires support between the Requester, Completer and routing elements
End-End TLP Prefix not required to but is permitted to be protected by ECRC
I f underlying Base TLP is protected by ECRC then End-End TLP Prefix is also protected by ECRC
Upper bound of 4DWORDs (16 Bytes) for End-End TLP Prefix
Fmt field grow s to 3 bits
New error behavior defined
Undefined Fmt and/ or Type values results in Malformed TLP
Extended Fmt Field Supported capability bit indicates support for 3 bit Fmt
Support is recommended for all components (independent of Prefix support)

56
Stacked Prefix Example:
Link Local is first
Starts at 0
LCRC TypeL1
STP Sequence # End-End # 1 follow s Link Local
Starts at 4
TypeE1
Link Local Prefix ECRC End-End # 2 follow s End-End # 1
Starts at 8
End-End Prefix #1 TypeE2
PCI e Header follow s End-End # 2
End-End Prefix #2 Starts at 12

Sw itch routes using Link Local and PCI e


PCIe TLP Header Header
and possibly additional Link Local DWORDs
if more extension bits needed
Malformed TLP if dont understand

Payload (optional) Sw itch forw ards End-End Prefixes unaltered


End-End Prefixes do not affect routing
Up to 4 DWORDs (16 Bytes) of End-End Prefix
ECRC (optional) End-End Prefixes are optional
Different End-End Prefixes sequence are unordered
LCRC affects ECRC but does not affect meaning
Repeated End-End Prefix sequence must be ordered
END

e.g. 1 st Extended TPH vs. 2 nd Extended TPH attribute
meaning of this is defined by each End-End Prefix

57
Multicast

58 58
Multicast Motivation &
Mechanism Basics
Several key applications benefit from Multicast
Communications backplane (e.g. route table updates, support of I P Multicast)
Storage (e.g., mirroring, RAI D)
Multi-headed graphics
PCI e architecture extended to support address-based Multicast
New Multicast BAR to define Multicast address space
New Multicast Capability structure to configure routing elements and Endpoints for
Multicast address decode and routing
New Multicast Overlay mechanism in Egress Ports allow Endpoints to receive Multicast
TLPs without requiring Endpoint Multicast Capability structure
Supports only Posted, address- routed transactions ( e.g., Memory Writes)
Supports both RCs and EPs as both targets and initiators
Compatible with systems employing Address Translation Services (ATS) and Access
Control Services (ACS)
Multicast capability permitted at any point in a PCI e hierarchy

59 59
Multicast Example
Address route Upstream
Upstream Port must be part
Multicast Address Route Root of the forwarding Ports for
Multicast

PCIe Switch
P2P
Bridge

Virtual PCI Bus

P2P P2P P2P P2P


Bridge Bridge Bridge Bridge

Endpoint Endpoint Endpoint Endpoint

60 60
Multicast Memory Space
MC_Base_Address
Multicast Group 0
2 MC_Index_Position Memory Space

Multicast Group 1
Memory Space

Multicast Group 2
Memory Space

Multicast Group 3
Memory Space

Multicast Group N-1


Memory Space

61 61

You might also like