HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
0 Ov er v iew
Jasmin Ajanovic
Sr . Pr in cip al En g in eer
I n t el Cor p .
2
PCI Express* ( PCI e) I nterconnect
Physical I nterface
Point-to-point full-duplex
I O Tr e n ds
Differential low -voltage signaling I n cr e a se in I O Ba n dw idt h
Embedded clocking
Scaleable w idth & frequency
Re du ct ion in La t e n cy
Supports connectors and cables En e r gy Efficie n t Pe r for m a n ce
Protocol
Em e r gin g Applica t ion s
Load Store architecture
Fully packetized split-transaction V ir t u a liza t ion
Credit-based flow Control Opt im ize d I n t e r a ct ion be t w e e n
Virtual Channel mechanism
H ost & I O
Advanced Capabilities Ex a m ple s: Gr a ph ics, M a t h ,
Enhanced Configuration and Pow er
Management Ph ysics, Fin a n cia l & H PC Apps.
RAS: CRC Data I ntegrity, Hot Plug,
Advanced error logging/ reporting
QoS and I sochronous support
N e w Ge n e r a t ion s of
PCI Ex pr e ss Te ch n ology
3
PCI e Technology Roadmap
Gen3: 8GT/s Signaling
Atomic Ops, Caching Hints
60 Lower Latencies, Improved PM
Enhanced Software Model
50
I/O Virtualization
Device Sharing
40
GB/ Se c
20
All da t e s t im e fr a m e s
a n d pr odu ct s a r e
su bj e ct t o ch a n ge
Con t in u ou s I m pr ove m e n t : D ou blin g Ba n dw idt h &
w it h ou t fu r t h e r
n ot ifica t ion I m pr ovin g Ca pa bilit ie s4 Eve r y 3 - 4 Ye a r s!
PCI e 3.0 Electrical I nterface
5 5
PCI e 3.0 Electrical Requirements
Compatibility w ith PCI e 1.x, 2.0
6
PCI e Gen3 Solution Space
Equalizat ion Sweep Equalizat ion Sweep
0.08 0.04 Pass
8GT/ s 0.02 8GT/ s
Eye Height (V)
0.06
8
PCI e 3.0 Encoding/ Signaling
9 9
Problem Statement
PCI Express* ( PCI e) 3.0 data rate decision: 8 GT/ s
High Volume Manufacturing channel for client/ servers
Same channels and length for backwards compatibility
Low power and ease of design - avoid using complicated receiver
equalization, etc.
La ne le ve l ( 1 3 0 bit s)
Data Packets scrambled
TLP/ DLLP/ LI DL
Ordered Sets mostly not scrambled STP
Electrical I dle Exit Ordered Set
resets scrambler (Recovery/
Config) Sou r ce : I n t e l Cor por a t ion
Scr a m bling w it h t w o le ve ls of
e nca psula11
t ion
Mapping of bits on a x1 Link
Receive Transmit
MSB MSB LSB
LSB
7 6 5 4 3 2 1 0 Symbol 15 7 6 5 4 3 2 1 0
Symbol 0
MSB LSB
MSB LSB
Symbol 1 7 6 5 4 3 2 1 0
Symbol 1 7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0 Symbol 0 7 6 5 4 3 2 1 0
Symbol 15
X1 Link
0 1 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
12
Mapping of bits on a x4 Link
Time = 0 Time = 2 UI Time = 10 UI Time = 122 UI
Lane 0 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 1
Sync Symbol 1 Symbol 5 Symbol 61
Lane 2 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 3 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
13
P-Layer Encapsulation: TLP
7 4 3 0 15 14 8 23 20 19 16 31 24 39 32 n-1 n-8
[Len[10:0]: length of the TLP in DWs, Frame CRC[4:0]: Check Bits covering Length[0:10], P: Frame Parity, No END]
*Note: Valid values for a TLP Prefix is 5 to ~ 1039 (Max value depends on type of TLP Prefix)
14
P-Layer Encapsulation: DLLP
7 0 15 8 23 16 47 40 55 48 63 56
(DLLP Layout)
Preserve DLLP layout of 2.0 spec
First Symbol is F0h
Second Symbol is ACh
Next 4 Symbols ( 2 through 5) are the DLLP layout
Next 2 Symbols ( 6 and 7) : LCRC ( identical to 2.0)
No explicit END
All Symbols are scrambled/ de-scrambled
15
Ex: TLP/ DLLP/ I DLs in x8
LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7
0 0
Sync Char 0 0
1
0
1 1 1
0
1
0 0
1
1 1
STP + Seq No
Symbol 0 (STP: 1111, Len TLP + CRC + P)
TLP Header (DW 0)
TLP (7 DW)
Symbol 1 TLP Header (DW 1 and 2)
Symbol 15 TLP Data (DW 14) TLP Data (DW 15) TLP (23 DW:
0 0 0 0 0 0 0 0 Straddles two
Sync Char 1 1 1 1 1 1
1 1 Blocks)
TLP Data (DW 16) TLP Data (DW 17)
Symbol 0
LIDL LIDL LIDL LIDL
Symbol 1 LCRC (1 DW) (00000000) (00000000) (00000000) (00000000)
Time
16
TLP Transmission in a X4 Link
Rsvd
L3 L2 L1 L0 d3 d2 d1 d0 h11 h10 h9 h8 h7 h6 h5 h4 h3 h2 h1 h0 Q[11:0]
L[7:10],
L[7:10],
S[3:0]
Scrambler
S[3:0]
h8
h4
d0
L0
h0
h4
h8
d0
L0
h0
C[3], L[0:6]
Lane 1
L[0:6]
h5
L1
h9
h1
d1
C[3],
h5
L1
h9
Scrambler
h1
d1
Lane 2
P, C[0:2]
P, C[0:2]
Q[11:8],
h10
Q[11:8],
L2
h10
d2
L2
h2
h6
Scrambler
d2
h2
h6
Lane 3
Scrambler
Q[7:0]
Q[7:0]
h11
h11
L3
L3
d3
d3
h7
h7
h3
h3
Time
17
PCI e 3.0 Protocol Extensions
18 18
Protocol Extensions
Pe r for m a n ce I m pr ove m e n t s
TLP Pr oce ssing H int s hint s t o opt im ize syst em resources
and perform ance
System TLP Pr e fix m ech t o ext end TLP headers for TLP Processing
CPU
Memory
Hint s, MR- I OV, and fut ure ext ensions
I D - Ba se d Or de r ing Transact ion- level at t ribut e/ hint t o
opt im ize ordering wit hin RC and m em ory subsyst em
Coherent System I/F Ex t e nde d Ta g Ena ble D e fa ult perm it s default for Ext ended
Tag Enable bit t o be Funct ion- specific
MEM Soft w a r e M ode l I m pr ove m e n t s
Host Memory At om ic Ope r a t ions new at om ic t ransact ions t o reduce
synchronizat ion overhead
Pa ge Re que st I nt e r fa ce m ech in ATS 1.1 for a device t o
Root Complex request fault ed pages t o be m ade available ( not covered)
Com m u n ica t ion M ode l En h a n ce m e n t s
PCI Express M ult ica st m echanism t o t ransfer com m on dat a or com m ands
sent from one source t o m ult iple recipient s
Pow e r M a n a ge m e n t
MEM D yna m ic Pow e r Alloca t ion support for dynam ic power
operat ional m odes t hrough st andard configurat ion m ech
Local
Memory La t e ncy Tole r a nce Re por t ing Endpoint s report service
lat ency requirem ent s for im proved plat form power m gm t
Opt im iz e d Buffe r Flush/ Fill Mechs for devices t o align DMA
Device act ivit y for im proved plat form power m gm t
Con figu r a t ion En h a n ce m e n t s
Re siz a ble BAR Mechanism t o support BAR size negot iat ion
I nt e r na l Er r or Re por t ing Ext end AER t o report com ponent
int ernal errors and record m ult iple error logs
19 19
TLP Processing Hints ( TPH)
20
Transaction Processing Hints
Background:
CPU Cores Small I O Caches implemented in
LLC/RC
server platforms
cache I neffective w/ o info about intended use of
MEM I O data
Root Complex Host Memory
Feature:
TPH= hints on a transaction basis
PCI Express
Allocation & temporal reuse
MEM More direct CPU< - > I O collaboration
Control structures (headers, descriptors)
Accelerator Local
and data payloads
Memory
Benefits:
Reduced access latencies
Ch a n ge in CPU M iss Ra t e w it h TPH I mproved data retention/ allocation
Reduced mem & QPI BW/ pow er
Avoiding data copies
New applications
Comm adapters for HPC and DB clusters,
Computational Accelerators,
1
Transaction Flow does not take full
advantage of System Resources
System Caches
System I nterconnect
PCI Express* Device
22
Device Writes w ith TPH
Device Writes Host Reads
CPU
2
D e vice W r it e s D M A D a t a 1
4 $ $ ( H in t , St e e r in g Ta g)
5
Sn oop Syst e m
2 Ca ch e s
$ $
3 Mem ory I n t e r r u pt H ost ( Opt ion a l) 3
4 Soft w a r e Re a ds D M A D a t a
$
23
Basic Device Reads
Host Writes Device Reads
CPU
1 Soft w a r e W r it e s
$ 5 1 DMA Data
$
2 Com m a n d W r it e t o D e vice ( Opt ion a l)
6
D e vice Pe r for m s Re a d
2 4 Mem ory 3
4 Sn oop Syst e m Ca ch e s
$
5 W r it e Ba ck t o M e m or y
Root Complex
24
Device Reads w ith TPH
Host Writes Device Reads
CPU
1 4 Soft w a r e W r it e s
1 DMA Data
$ $
5
2 Com m a n d W r it e t o D e vice ( Opt ion a l)
$ D e vice Pe r for m s Re a d
2 Mem ory ( H in t , St e e r in g Ta g) 3
$
Sn oop Syst e m
4 Ca ch e s $
Root Complex
5 D e vice Re a d Com ple t e d
$
25
Atomic Operations
( AtomicOps)
26 26
Synchronization
Atomic Read-Modify-Write
27
Atomic Read-Modify-Write ( RMW)
Atomic RMW Operation
CPU
$ D e vice I ssu e s RM W
At om ic Com plet er Opt ion a l ( H in t , ST) 1
3
Engine
At om ic Com plet er 2
[ Fe t ch Add, Sw a p or CAS]
4 Engine $
At om ic Com plet er
2 Re a d I n it ia l Va lu e
Mem ory Engine
Request Description
Data( Addr) = Data( Addr) +
FetchAdd
AddData
Root Complex Sw ap Data( Addr) = Sw apData
I f ( CompareData = = Data( Addr) )
CAS then
Data( Addr) = Sw apData
Opt ion a l
1 $
At om ic Com plet er
3 W r it e N e w Va lu e
Engine
4 Re t u r n I n it ia l Va lu e
PCI Express* Device
28
Pow er Management Enhancements
Dynamic Pow er Allocation( DPA)
Optimized Buffer Flush ( OBFF)
Latency Tolerance Reporting ( LTR)
29
Dynamic Pow er Allocation
Background
PCI e 1.x provided standard Device & Link-level Pow er
Management
PCI e 2.0 adds mechanisms for dynamic scaling of Link
w idth/ speed
No architected mechanism for dynamic control of device
thermal/ pow er budgets
Problem Statement
Devices are increasingly higher consumers of system pow er &
thermal budget
Emerging 300W Add-I n Cards
New Customer & Regulatory Operating Requirements
On-going I ndustry wide efforts e.g. ENERGY STAR* Compliance
Battery Life/ Enclosure Power Management
Mobile, Servers & Embedded Platforms
30
Dynamic Pow er Allocation ( DPA)
DPA Capability
CPU Extend Existing PCI Device PM
to provide Active ( D0) substates
Up to 32 substates supported
Dynamic Control of D0 Active
Substates
Mem ory Benefits
Platform Cost Reduction
Pwr/ Thermal Management
Platform Optimizations
Root Complex Battery Life (Mobile)/ Power(Servers)
Performance
Total Power
300
0 0.8
Performance
200 0.6
1 0.4 Total Power
100
2 0.2
0 0
D0 SubSt at es D0.0 D0.1 D0.2 D0.3 D0.4 D0.5 D0.6 D0.7
D0 Sub States
PCI Express* Device Sou r ce : I n t e l Cor por a t ion
Want ed: Mechanism for plat form t o t une PM based on act ual
device service requirem ent s
32
Latency Tolerance Reporting ( LTR)
LTR Mechanism
CPU PCI e Message sent by Endpoint w ith
tolerable latency
Capability to report both snooped & non-snooped
values
Terminate at Receiver routing, MFD & Switch
send aggregated message
1 LTR ( M a x )
Buffer 1 I dle
2
Buffer
LTR ( Act ivit y
Adj u st e d) Buffer 2 Act ive
enlarged enlarged
idle idle
window window
Want ed: Mechanism for Align Device Act ivit y wit h Plat form
PM event s
34
Optimized Buffer Flush/ Fill ( OBFF)
OBFF
CPU
Notify all Endpoints of optimal
w indow s w ith minimal pow er
impact
Solution1: When possible, use
WAKE# with new wire semantics
Wake#
36
Transaction Ordering Enhancement
Background:
H ost CPU/ M e m Strong ordering = = unnecessry
stalls
Transactions from different
Requestors carry different I Ds
Feature:
New Transaction Attribute bit to
indicate I D-based ordering
relaxation
Permission to reorder transactions
between different I D streams
Applies to unrelated streams
w ithin:
MF Devices, Root Complex, Switches
} transaction
Ordering unrelated
streams Benefits:
I mproves latency/ pow er/ BW
Reduces transaction latencies in the system.
w ithin memory subsystem
Mitigates
37 overhead of I O
I O Page Fault Mechanism
Background:
Host CPU
Emmerging trend: Platform
Memory
Virtualization
I ncreases pressure on memory
resources making page pinning
Translation Agent
(TA)
Address Translation and
Protection Table (ATPT)
very expensive
Feature:
Built upon PCI e Address Translation
Services ( ATS) Mechanism
Root Complex (RC)
Notify I O devices w hen I O page
Root faults occur
Port
ATS (RP) ATS Device pause/ resume on page faults
Request
Completion
PCIe
Faulted pages requested to be made
available
PCIe Endpoint ATC
Benefits:
OS/ Hypervisor gets ability to
maintain overall system performance
by over-commiting memory
Critical for future IO Virtualization
allocationapplication
for I O scaling.
New38 usage: User-Mode I O for
Resizable BAR & Multicast
CPU/Mem
Root
CPU
PCIe Switch
P2P
Bridge
MEM
Host Memory
Root Complex X Virtual PCI Bus
PCI Express
P2P P 2P P2P P 2P
Bridge Bridge Bridge Bridge
Local
BAR Memory
Specification Status:
Rev 0.5 spec delivered to PCI SI G in Q109
Rev 0.7 targeting Sept. 09 & Rev 0.9 early Q110
40
Call to Action & Referrences
Con t r ibu t e t o t h e e volu t ion of PCI
Ex pr e ss a r ch it e ct u r e
Review and provide feedback on PCI e 3.0 specs
I nnovat e and different iat e your product s wit h
PCI e 3.0 indust ry st andard
Visit :
w w w .pcisig.com for PCI Express specificat ion
updat es
h t t p:/ / dow n loa d.in t e l.com / t e ch n ology/ pcie x pr e ss/
de vn e t / docs/ PCI e 3 _ Acce le r a t or - Fe a t u r e s_ W P.pdf
for whit e- paper on PCI e Accelerat or Feat ures
41
Back u p
Example of a Eye As Seen At
Receiver I nput Latch
UI/2 UI/2
43
Scrambling vs. 8b/ 10b coding
8GT/ s uses scrambled data to improve signaling efficiency
over 8b10b encoding used in 2.5GT/ s and 5GT/ s, yielding 2x
payload data rate w rt. 5 GT/ s
Unlike 8b10b a maximal length PRBS generated by an LFSR
does not preserve DC balance
The average voltage level over a constant period of time varies slowly based on the pattern of
the PRBS
I n an AC coupled system this creates a slowly changing differential offset that that reduces eye
height
Different PRBS polynomials have different average run
lengths through their pattern and so different peak differential
offsets
There exists a best case PRBS23 polynomial yielding minimum DC wander of ~ 4.5 mVPP: x23 +
x21 + x18 + x15 + x7 + x2
Large number of taps tends to break up long runs of 0s or 1s
( a common case)
Pathological match between PRBS and data pattern have very low probability
Retry mechanism changes polynomial starting point to prevent pathological data pattern from
failing repeatedly
44
Gen3 Signaling: Error Detection & Recovery
Framing error is detected by the physical layer
The first byte of a packet is not one of the allowed sets (e.g., TLP, DLLP, LI DL)
Sync character is not 01 or 10
Same sync character not present in all lanes after deskew
CRC error in the length field of a TLP
Ordered set not one of the allowed encodings or not all lanes sending the same
ordered set after deskew (if applicable)
10 sync header received after 01 sync header without a marker packet in the 01
sync header OR received a marker packet in the 01 sync header and the subsequent
sync header in any lane not 10
Any framing error requires directing LTSSM to
Recovery
Stop processing any received TLP/ DLLP after error until we get through Recovery
Block lock acquired with EI EOS
Scrambler reset with each EI EOS
Error Detection Guarantees
Triple bit flip detection within each TLP/ DLLP/ I DL/ OS
45
TLP Processing Hints ( TPH)
46
TPH Mechanism
47
Steering Tag ( ST)
Memory Read or
AtomicOperation TLPs
48
TPH Summary
Mechanism to make effective use of system fabric and improve
system efficiency
Reduce variability in access to system memory
Reduce memory & system interconnect BW & power consumption
Ecosystem I mpact
Software impact is under investigation - minimally may require software
support to retrieve hints from system hardware
Endpoints take advantage only as needed No cost if not used
Root Complex can make implementation tradeoffs
Minimal impact to Switches
Architected softw are discovery, identification, and control of
capabilities
RC support for processing hints
Endpoint enabling to issue hints
49
I D-Based Ordering
( I DO)
50 50
Review :
PCI e Ordering Rules
Table is based on
new 2.0 errata!
No ent ries
caused by
Producer/
Consum er
rest rict ions
Yes
ent ries are
required for
deadlock
avoidance
51 51
Motivation
RO w orks w ell for single-stream models w here a data buffer
is w ritten once, consumed, and then recycled
Not OK for buffers that will be written more than once because writes are not
guaranteed to complete in order issued
Does not take advantage of the fact that ordering doesnt need to be
enforced between unrelated streams
Conventional Ordering ( CO) can cause significant stalls
Observed stalls in the 10s to 100s of ns are seen
Worst case behavior may see such stalls repeatedly for a Request stream
Consider case of NI C or disk controller w ith multiple streams
of w rites:
Each CO Flag
Writ e
serializes &
adds lat ency
t o t raffic from
unrelat ed
st ream s
52 52
I DO: Perf Optimizations
for Unrelated TLP Streams
TLP Stream: a set of TLPs
that all have the same
originator
Optimizations possible
for unrelated TLP
Streams, notably w ith:
Multi-Function device (MFD)
/ Root Port Direct Connect
Switched Environments
Multiple RC I ntegrated
Endpoints (RCI Es)
I DO permits passing
betw een TLPs in
different streams
Particularly beneficial
w hen a Translation
Agent ( TA) stalls TLP
streams temporarily
53 53
TLP Prefix
54
Motivation
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
TLP Prefix Byte 0
Byte 0 >
55
Prefix Encoding
+0 +1 +2 +3 +0 +1 +2 +3
76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0 76 5 4 3 2 1 0
Byte 0 100 0 Type Local Prefix Contents Byte 0 100 1 Type End-End Prefix Contents
56
Stacked Prefix Example:
Link Local is first
Starts at 0
LCRC TypeL1
STP Sequence # End-End # 1 follow s Link Local
Starts at 4
TypeE1
Link Local Prefix ECRC End-End # 2 follow s End-End # 1
Starts at 8
End-End Prefix #1 TypeE2
PCI e Header follow s End-End # 2
End-End Prefix #2 Starts at 12
57
Multicast
58 58
Multicast Motivation &
Mechanism Basics
Several key applications benefit from Multicast
Communications backplane (e.g. route table updates, support of I P Multicast)
Storage (e.g., mirroring, RAI D)
Multi-headed graphics
PCI e architecture extended to support address-based Multicast
New Multicast BAR to define Multicast address space
New Multicast Capability structure to configure routing elements and Endpoints for
Multicast address decode and routing
New Multicast Overlay mechanism in Egress Ports allow Endpoints to receive Multicast
TLPs without requiring Endpoint Multicast Capability structure
Supports only Posted, address- routed transactions ( e.g., Memory Writes)
Supports both RCs and EPs as both targets and initiators
Compatible with systems employing Address Translation Services (ATS) and Access
Control Services (ACS)
Multicast capability permitted at any point in a PCI e hierarchy
59 59
Multicast Example
Address route Upstream
Upstream Port must be part
Multicast Address Route Root of the forwarding Ports for
Multicast
PCIe Switch
P2P
Bridge
60 60
Multicast Memory Space
MC_Base_Address
Multicast Group 0
2 MC_Index_Position Memory Space
Multicast Group 1
Memory Space
Multicast Group 2
Memory Space
Multicast Group 3
Memory Space
61 61