PCIe
PCIe
PCIe Notes
From Legacy Systems to High-
Speed Innovation: A Deep Dive
into PCIe Architecture
Anoushka Tripathi
1 VLSI TECH WITH ANOUSHKA | PCIe Series
Background:
What is PCI?
• PCI (Peripheral Component Interface) was created in the early 1990s to fix problems
with older bus systems like ISA (Industry Standard Architecture).
• Back in the day, ISA worked well for older computers (like 286 machines), but it couldn’t
keep up with faster 32-bit computers.
• ISA had problems: it was slow, didn’t have modern features like plug-and-play, and
used big connectors with lots of pins.
• PCI was developed as an open standard by a group of companies called PCISIG (PCI
Special Interest Group).
• PCI’s advantages:
• PCI-X (PCI-eXtended) was developed a few years later to improve PCI’s performance.
• The main goal was to keep PCI-X compatible with PCI, so old devices would still work.
• PCI-X was still a parallel bus, meaning all data was sent at the same time on multiple
wires, which eventually hit a speed ceiling.
• Parallel buses have a lot of limitations: high pin count, slower speeds, and issues when
trying to go faster.
• To fix this, the industry eventually moved away from parallel buses like PCI-X to serial
buses (like PCI Express).
2 VLSI TECH WITH ANOUSHKA | PCIe Series
• PCI buses allow multiple devices to be connected, but as clock speeds increase, the
number of devices that can share the bus decreases.
• With PCI-X 2.0, the bus became a point-to-point system, meaning only one device
could connect at a time for faster speeds.
• A typical PCI system included a North Bridge (which connected the processor to the
PCI bus) and a South Bridge (connecting PCI to other peripherals like USB or audio
devices).
1. Request: The device wanting to send data signals to the bus (this is called Bus
Mastering).
2. Arbitration: The system decides which device gets to use the bus (handled by an
Arbiter).
3. Data Transfer: The device sends or receives data. Devices can insert a Wait State
(pausing the transaction) if they aren’t ready to transfer.
Reflected-Wave Signaling:
4 VLSI TECH WITH ANOUSHKA | PCIe Series
• PCI uses a trick called reflected-wave signaling to save power. Instead of fully driving a
signal, the device sends a weaker signal that bounces back and strengthens when it
reaches the other end. This helps reduce power use but also limits the number of
devices and the length of the bus.
o Parallel protocol - Protocol where addr, data, other control information is sent at same edge of
clock i.e we are sending parallely.
o Serial protocol - Protocol where addr, data, other control information is sent at consecutive clock
edges with few bits sent at a time.
- Parallel protocol is using 32 bit interface.Serial protocol is using 1 bit data pin than parallel will
be 32 times more efficient than serial.
- Parallel protocols (AXI,AHB,APB) are limited by frequency of operation,
The PCI (Peripheral Component Interconnect) bus has three main data transfer models:
Programmed I/O (PIO), Direct Memory Access (DMA), and Peer-to-Peer. Here's a
simplified explanation of each, along with some related protocols:
5 VLSI TECH WITH ANOUSHKA | PCIe Series
• In the early days, PIO was common because it was easy to implement.
• How it works: The processor (CPU) directly manages the data transfer. For example,
if a PCI device wants to send data to memory, the CPU:
o Reads the data from the PCI device into its internal registers.
o Writes that data from its registers to memory.
• Drawbacks:
o It generates two bus cycles (one for reading, one for writing), making it slow.
o The CPU is busy managing data instead of performing other tasks, making it
inefficient in modern systems.
• Why it's still used: Although inefficient, PIO is necessary for software to interact
with devices, but it’s rarely used for actual data transfers today.
• How it works: A separate device, called a DMA engine, handles data transfers,
freeing up the CPU. The CPU just sets the starting address and size of the data, and
the DMA engine does the rest.
• Benefits:
o The CPU can focus on other tasks while the DMA engine transfers data
directly between the device and memory.
o It only takes one bus cycle to move a block of data, which is much more
efficient than PIO.
• Over time, devices integrated DMA functionality, allowing them to perform Bus
Mastering, meaning they can control data transfers without needing external DMA
engines.
3. Peer-to-Peer:
6 VLSI TECH WITH ANOUSHKA | PCIe Series
• How it works: One PCI device (acting as a Bus Master) can transfer data directly to
another PCI device without involving the CPU.
• Benefits: This keeps the PCI bus busy without disturbing the rest of the system,
allowing for more efficient transfers.
• Drawbacks: It’s rarely used because devices often don’t use the same data format, so
the CPU usually needs to intervene to reformat the data.
• Shared bus: Since many devices share the PCI bus, only one device can use it at a
time. Devices request access from a bus arbiter, which decides who gets the bus
next.
• Hidden arbitration: This arbitration happens in the background, without wasting
clock cycles.
• Retry: When a PCI Bus Master requests data from a device that isn’t ready yet, the
device can ask for a retry. This prevents the bus from being held up by a device that
can’t send data right away. The master has to wait and try again later.
• Disconnect: If a device can transfer some data but not all of it, it can disconnect the
transaction. This frees up the bus for other transfers until the device is ready to
continue the operation.
In short, PCI uses different methods to manage data transfers depending on the complexity
and needs of the system. DMA is preferred for efficiency, PIO is basic but inefficient, and
Peer-to-Peer is rarely used. The PCI bus has mechanisms like arbitration, retries, and
disconnects to keep data flowing smoothly without locking up resources.
PCI Inefficiencies
The PCI Retry Protocol comes into play when a PCI master (such as a processor or North
Bridge) initiates a transaction with a target device (e.g., an Ethernet device), but the target is
not ready to complete the data transfer. In this case, the target signals a retry using the
STOP# signal.
7 VLSI TECH WITH ANOUSHKA | PCIe Series
• Mechanism:
o The PCI master begins the transaction by asserting control of the PCI bus.
o If the target device cannot provide the data immediately, it can insert wait-
states (brief pauses). If it requires more than 16 clock cycles, it asserts STOP#
to signal a retry.
o The PCI master then aborts the transaction and waits for at least two clock
cycles before re-arbitrating for control of the bus.
o During the retry, other devices can use the bus, improving efficiency by
preventing long wait periods.
• Efficiency Gains:
Retrying helps avoid holding the bus in an idle state, especially when the target needs
a significant amount of time to prepare the requested data. The master keeps retrying
the transaction until the target is ready and successfully transfers the data.
The PCI Disconnect Protocol is used when the target device can transfer some, but not all,
of the requested data during a transaction.
• Mechanism:
o A PCI master initiates a transaction (e.g., a burst read from Ethernet).
o The target device transfers a portion of the data but then runs out of data to
send.
o If the target cannot provide more data within 8 clock cycles, it asserts the
STOP# signal to disconnect the transaction.
o The PCI master waits for two clock cycles and then re-arbitrates for the bus to
continue the transaction.
o The disconnect protocol allows some data to be transferred before the bus
cycle ends, unlike the retry protocol, which ends the transaction without any
data transfer.
8 VLSI TECH WITH ANOUSHKA | PCIe Series
• Efficiency Gains:
The disconnect protocol allows more efficient bus utilization since the bus can be
granted to other devices when the original master is waiting for additional data from
the target device.
Summary of Inefficiencies
Both the retry and disconnect protocols are essential to maintain PCI bus efficiency, but they
still introduce delays due to the need to repeatedly re-arbitrate and re-initiate transactions.
The inefficiency stems from:
1. Bus arbitration overhead: The master must continually re-compete for the bus,
adding delay.
2. Wait-state insertion: Wait-states can briefly stall transactions, though the
retry/disconnect mechanisms attempt to limit this.
3. Limited data transfer per cycle: In cases of retries or disconnects, the PCI master
may not transfer data immediately, requiring multiple attempts, which reduces overall
bandwidth efficiency.
In PCI (Peripheral Component Interconnect) systems, interrupt handling is achieved via four
sideband signals: INTA#, INTB#, INTC#, and INTD#. These signals are used by PCI
devices to notify the system of an interrupt request. Here's how the interrupt handling process
works:
1. Interrupt Assertion: When a PCI device needs to request an interrupt, it asserts one
of these signals. In a single-CPU system, this causes the system's interrupt controller
to assert the INTR (Interrupt Request) pin to notify the CPU of the interrupt.
2. Interrupt Processing in Single-CPU Systems:
o The interrupt controller sends the signal to the CPU via the INTR pin.
o The CPU, upon receiving this signal, must identify the source of the interrupt
by querying the devices or the controller, which takes several bus cycles. This
method is slower and less efficient, especially in systems with multiple
devices.
3. Multi-CPU Systems and APIC:
o In multi-CPU systems, handling interrupts became more complex as a single
INTR pin would not suffice. To manage this, the APIC (Advanced
Programmable Interrupt Controller) was introduced.
o The APIC model improves the interrupt handling by using a messaging
mechanism to communicate with multiple CPUs. Instead of relying on a single
INTR pin, the interrupt controller sends messages to the relevant CPUs,
allowing for more efficient interrupt handling in multi-CPU environments.
4. Legacy Interrupt Handling:
o In legacy PCI systems, identifying the source of an interrupt required multiple
bus cycles, making it inefficient.
o The APIC model significantly reduces the overhead of interrupt handling by
streamlining the process and avoiding the delays associated with querying
devices directly via the bus.
9 VLSI TECH WITH ANOUSHKA | PCIe Series
PCI error detection mechanisms involve monitoring transactions for parity errors during
address and data phases. These errors are identified through the use of the PAR (parity)
signal, which ensures even parity across most signals.
In summary, PCI interrupt and error handling mechanisms are foundational to system
reliability, providing ways to detect, report, and address issues through both hardware signals
and software intervention.
PCI Express (PCIe) is a significant upgrade from the older PCI architecture, offering
improvements in performance and efficiency. Unlike its predecessor, PCI, which was a parallel
bus system, PCIe operates as a serial bus, more like InfiniBand or Fibre Channel. Although it’s
based on a serial model, PCIe is fully backward compatible with PCI software, making it easy
to integrate into existing systems.
PCIe adopts a dual-simplex connection model, which means that it has separate paths for
sending (transmit) and receiving (receive) data, allowing communication in both directions
simultaneously. This setup is technically full-duplex, but PCIe uses the term dual-simplex to
emphasize that each path is one-way.
One of the main advantages of PCIe is its software backward compatibility with older PCI
systems. Even though the hardware architecture has changed, the address spaces for memory,
IO, and configuration in PCIe remain the same. This means that software written for PCI, like
BIOS code and device drivers, can still work seamlessly with PCIe.
In PCIe, serial communication is used instead of parallel. Serial communication may seem
slower because it transmits one bit at a time, but it achieves much higher speeds, enabling it to
meet or exceed the bandwidth of parallel buses like traditional PCI. PCIe can reach speeds like
2.5 GT/s (Gigatransfers per second), 5.0 GT/s, and 8.0 GT/s, bypassing the limitations faced by
parallel designs.
13 VLSI TECH WITH ANOUSHKA | PCIe Series
1. Flight Time: This is the delay caused by the time it takes a signal to travel from the
transmitter to the receiver. In parallel buses, signals must arrive before the next clock
cycle, but as the clock speeds increase, it becomes impractical to shorten the physical
length of the traces or reduce load.
2. Clock Skew: This happens when the clock signal arrives at the sender and receiver at
different times, complicating data synchronization.
3. Signal Skew: In parallel buses, all data bits should arrive together at the receiver.
However, signal skew causes the bits to arrive at different times, requiring the system to
wait for the slowest bit, which affects overall performance.
PCIe overcomes these problems by embedding the clock within the data stream, eliminating
the need for external clock signals. This solves:
• Flight Time: It no longer matters how long the signal takes to reach the receiver because
the clock arrives with the data.
• Signal Skew: In serial communication, only one bit is transmitted at a time, eliminating
intra-lane skew. In multi-lane setups, any inter-lane skew can be corrected
automatically by the receiver.
14 VLSI TECH WITH ANOUSHKA | PCIe Series
Bandwidth Improvements
PCIe offers impressive bandwidth due to its high speed and ability to use multiple Lanes. For
example:
• Generation 1 (Gen1) has a bit rate of 2.5 GT/s, translating to 0.5 GB/s per Lane.
• Generation 3 (Gen3) reaches 2.0 GB/s per Lane by using a more efficient encoding
method called 128b/130b encoding, which improves bandwidth without increasing the
clock speed as much.
15 VLSI TECH WITH ANOUSHKA | PCIe Series
When multiple Lanes are used (like x4 or x16 Links), the bandwidth multiplies accordingly,
making PCIe highly scalable.
Differential Signals
PCIe uses differential signaling, meaning each lane sends both a positive (D+) and negative (D−)
version of the same signal. This technique doubles the pin count but offers two key advantages:
1. Improved Noise Immunity: Since both signals (D+ and D−) travel closely together, any
noise affecting one affects the other equally. The receiver, which measures the
difference between the two signals, cancels out the noise, improving signal integrity.
2. Reduced Signal Voltage: Differential signaling operates with lower voltages, reducing
power consumption and allowing faster transmission speeds.
In PCIe (Peripheral Component Interconnect Express), a common clock is not needed because
PCIe operates using a source-synchronous model. In this model, the transmitter supplies the
16 VLSI TECH WITH ANOUSHKA | PCIe Series
clock to the receiver indirectly by embedding the clock signal into the data stream. This
eliminates the need for an external or forwarded clock.
How it Works:
• Embedded Clock: Instead of sending a separate clock signal, the clock is embedded
into the data stream using 8b/10b encoding (or other encoding mechanisms like
128b/130b for later PCIe generations). This encoding ensures regular transitions in the
data stream to help the receiver recover the clock.
• Clock Recovery: The Phase-Locked Loop (PLL) in the receiver recovers the clock from
the incoming data. The PLL takes the incoming bitstream as a reference and generates a
clock signal that matches the frequency of the transmitted data. It continually
compares the incoming data’s timing (phase) with its own generated clock and adjusts
until they match (this process is called locking).
• PLL Adjustments: Since factors like temperature or voltage fluctuations can affect the
transmitter’s clock, the PLL continuously fine-tunes the recovered clock to maintain
synchronization with the transmitter.
• In parallel systems, a common clock is needed for synchronizing data transfer, but in
high-speed serial links like PCIe, clock skew and transmission delays pose significant
problems. By embedding the clock in the data, these issues are avoided, and the need
for a common clock is eliminated.
• Transition Density: For the PLL to function correctly, it needs regular transitions
(changes from 1 to 0 or 0 to 1) in the incoming bitstream to maintain phase comparison.
Without transitions, the PLL can lose synchronization. The 8b/10b encoding ensures no
more than 5 consecutive ones or zeroes appear in the data stream, preventing
synchronization loss.
• After recovering the clock, the receiver uses it to latch (capture) the incoming data and
deserialize it (convert the serial data stream into parallel data).
• PCIe links also support low power states where data transmission stops. In these
states, the receiver can no longer rely on the incoming reference clock. Therefore, the
17 VLSI TECH WITH ANOUSHKA | PCIe Series
receiver needs its own internal clock to manage operations when the data stream is
inactive.
Packet-Based Protocol:
• PCIe uses a packet-based protocol to transfer data. Instead of using side-band control
signals to manage data types, as seen in parallel buses, PCIe sends data in structured
packets. The receiver identifies the packet boundaries and interprets the data based on
predefined structures.
In a PCIe system, devices are connected in a simple tree structure. This means there are no
loops or complicated connections, which makes things easier for the system to manage. The
diagram mentioned shows a typical PCIe setup, with a CPU at the top, connected to various
other devices like memory, endpoints (e.g., graphics cards), and legacy devices.
• CPU: The brain of the computer. In the PCIe world, the CPU is at the top of the hierarchy,
meaning it controls everything beneath it.
• Root Complex (RC): This is a collection of components that connect the CPU to the
PCIe devices. Think of it as a bridge between the CPU and the PCIe world. It’s the main
hub where the CPU communicates with PCIe devices. The Root Complex usually
includes interfaces for the processor, memory (like DRAM), and the PCIe bus itself.
18 VLSI TECH WITH ANOUSHKA | PCIe Series
2. Switches:
• Switch: This is like a traffic controller for PCIe. If you have multiple devices trying to
connect to the CPU through a single PCIe Port, a switch helps manage this. It decides
where to send the data based on the destination address and routes packets to the
correct device.
3. Bridges:
• Bridge: A bridge connects different types of buses. For example, if you have an older PCI
or PCI-X device, a bridge allows it to communicate with newer PCIe systems. There are
two types:
o Reverse Bridge: Allows newer PCIe devices to work with older PCI systems.
4. Endpoints:
• Endpoints: These are the actual devices at the “end” of the PCIe tree, such as a
graphics card, network card, or SSD. Endpoints are the devices that send or receive
data, and they only have one port facing upward toward the Root Complex. Endpoints
are categorized into two types:
o Native PCIe Endpoints: Devices that were designed specifically for PCIe, like
modern SSDs or graphics cards. They communicate directly with the PCIe
system and use memory-mapped IO (MMIO).
o Legacy PCIe Endpoints: Older devices that originally worked with older buses
(like PCI-X) but were modified to work with PCIe. These devices may still use
older features that aren't used in newer PCIe designs, such as IO space or
locked requests.
• The system must be compatible with older PCI software and configuration schemes, so
even though PCIe is newer and faster, the topology and configuration remain somewhat
similar to older PCI systems.
• The tree structure helps keep things simple and ensures that devices can be easily
tracked and managed without complex loops.
- Upstream lane : A lane that carries the traffic(packets) towards the root complex.
- Downstream lane: A lane that carries the traffic(packets) away from the root complex.
- Upstream port : A port that is pointing towards to RC
- Downstream port : A port that is pointing away from RC
➔ Link consists of lanes in both transmit and receive, If transmit is upstream, receive will
be downstream
19 VLSI TECH WITH ANOUSHKA | PCIe Series
PCIe Day 4
Configuration
20 VLSI TECH WITH ANOUSHKA | PCIe Series
- Per Lane in 1 direction : 2.5GT/s *1 lane = 2.5Gb/s => 2.5 Gb/s * 1Byte/10bits =250MB/s
- Bidirectional => 250MB/s *2=500MB/s=0.5GB/s
- For 32 lanes => cumulative BW = 0.5 * 32 = 16GB/s
Gen2 numbers
Gen3 numbers
o Root complex should know everything(what type of device, who is manufacturer etc) about
the device that is connected in to it.
➢ all txs that are done to these devices are done using IO writes,
➢ inefficient: each address needs to be addressed individual.
o all txs that are done to these devices is done using memory writes, memory reads
➢ Configuration Space
- Same as in PCI
Configuration space registers are mapped to memory locations. Device drivers and
diagnostic software must have access to the configuration space, and operating
systems typically use APIs to allow access to device configuration space.
What is Configuration Address Space?
In the early days of computers, when you installed a new device (like a sound card or
network card), you had to manually set switches and jumpers to tell the computer how to
use it. This was like putting together a puzzle without clear instructions, and it often led to
conflicts where two devices tried to use the same memory, I/O ports, or interrupts (signals
that tell the CPU to do something). It was pretty complicated!
Later, systems got smarter with plug-and-play technology, which made it easier for the
computer to figure out how to use new devices automatically. But things really improved
with the PCI system. PCI introduced a new way to automatically manage all the resources
(like memory and I/O) that each device needed, without conflicts, thanks to Configuration
Address Space.
• When PCI was first designed, every function got 256 bytes of configuration space. These
256 bytes were enough back then because devices didn’t need a lot of extra features.
• The first part of this space, called the configuration header (the first 64 bytes), is used
to set up the basic functionality of the device. There are two types of headers:
o Type 0 header: For most devices,
o Type 1 header: For bridge devices (which connect buses).
• The remaining part of the 256 bytes is used for optional features (like adding new
capabilities to the device, such as power management or hot-plugging support).
Extended Configuration Space (When 256 Bytes Isn't Enough):
• As PCIe evolved, new devices needed more capabilities, and the original 256 bytes
wasn’t enough anymore. So, PCIe introduced the Extended Configuration Space,
which is 4KB (or 4096 bytes) per function.
• This new space allows PCIe devices to include extra registers that give them more
powerful and flexible features, like advanced error reporting or extra power management
options. However, these extended features can only be accessed by newer software
that knows how to use them. Older systems won’t be able to see or use the extended
space, but they can still use the basic 256 bytes.
23 VLSI TECH WITH ANOUSHKA | PCIe Series
In Summary:
• Configuration Address Space is a special area of memory where each PCIe function
gets its own set of settings (called registers) that help the computer detect, configure,
and manage the device.
• Originally, each function had 256 bytes of space, but modern devices needed more
room, so PCIe expanded this to 4KB per function in the form of Extended Configuration
Space.
• This space helps avoid conflicts and allows modern computers to automatically
manage complex devices without needing manual setup.
So, PCIe's Configuration Address Space is like a "control panel" that tells the system
how to manage all the devices connected to it, making sure everything works smoothly
and without conflicts.
1. Bus:
• PCIe supports up to 256 Bus Numbers (0-255), which are assigned by configuration
software. The first bus, typically Bus 0, is associated with the Root Complex, and it
includes Virtual PCI buses and bridges (P2P bridges). These bridges can extend the bus
hierarchy, allowing additional PCIe devices to connect.
• Bus numbers are assigned through a process called depth-first search, where
configuration software starts at Bus 0 and assigns unique bus numbers to each bus it
finds.
2. Device:
24 VLSI TECH WITH ANOUSHKA | PCIe Series
• Each PCIe bus can support up to 32 devices (Device 0 to Device 31). However, due to
the point-to-point nature of PCIe, only one device is directly connected to a PCIe link,
and this device will typically have the Device Number 0.
• Devices may reside on virtual PCI buses, like those in the Root Complex or PCIe
switches, which can support multiple attached devices. Each device is expected to
implement Function 0 and may support up to eight functions (Function 0 to Function 7).
3. Function:
1. Bus:
• Think of the Bus as a road that devices use to communicate with the computer. In PCIe,
up to 256 buses can be created (numbered from 0 to 255).
• The first bus, Bus 0, is usually connected to the computer's Root Complex, which is like
a central hub where communication starts.
• Sometimes, there are devices called bridges that connect one bus to another, like an
overpass connecting two roads. Each new bus created by a bridge gets its own unique
bus number. The computer assigns these numbers one by one as it explores the
system, looking for new buses to connect.
2. Device:
• Now imagine that along these buses, there are parking spots for devices, like your
graphics card or sound card. Each bus can have up to 32 devices parked on it (Device 0
to Device 31).
• However, because PCIe is point-to-point, only one device can be directly connected to
a single PCIe link (the path from the motherboard to the device). This device will usually
have the number Device 0.
25 VLSI TECH WITH ANOUSHKA | PCIe Series
• In some special cases, like when the device is connected through a virtual PCI bus (for
example, in a PCIe switch or Root Complex), you can have multiple devices attached
to the same bus.
3. Function:
• Every device on the bus can perform certain functions. Think of it like having a multi-
tool—one tool (the device) can perform multiple tasks (functions).
• Each device can have up to 8 functions (numbered from Function 0 to Function 7). For
example, a single device might handle your USB ports, network connections, and
display output all at once.
• Devices don't always use all their function slots. A device might only use Function 0 and
Function 2, skipping others. Each function has its own space in the computer’s memory,
where the computer can configure and manage it.
• The BDF (Bus, Device, Function) system is used to give every function of every device its
own address. This is kind of like giving every apartment in a building its own unique
number so that the mail carrier knows exactly where to deliver the mail.
• For example, a network card could have an address like Bus 2, Device 0, Function 1.
This tells the computer:
The BDF system is important because it helps the computer’s configuration software (the part
of the system that sets up your hardware) to:
In a Nutshell:
• Device: The device parked on that highway (like your sound card).
• Function: The specific task that device performs (like handling audio).
• BDF: The address (Bus, Device, and Function) that makes sure the computer knows
exactly where to send data and instructions.
So, in PCIe, the BDF system is like a well-organized map that helps your computer keep track of
all the devices connected to it, making sure each one can do its job without stepping on another
device’s toes.
26 VLSI TECH WITH ANOUSHKA | PCIe Series
The Host-to-PCI Bridge is like the main connection point between the computer's processor
and the PCI devices. It's responsible for making sure the processor can communicate properly
with PCI devices, like a "translator" between the two.
The configuration registers for the Host-to-PCI Bridge don’t have to follow the typical
configuration mechanisms (the methods used in older PCI devices). Instead, these registers are
often placed in the memory address space, which the computer’s firmware (the underlying
27 VLSI TECH WITH ANOUSHKA | PCIe Series
software that controls hardware) knows about. So, when the processor wants to interact with
these registers, it knows where to look.
However, even though the placement is different, the layout and how the registers are used still
need to follow the standard Type 0 template. This standard comes from the PCI 2.3
specification, which means it still needs to behave like a typical PCI device in many ways.
PCIe (the updated version of PCI) expands the configuration space for each function on a
device. Let's talk about the space and what it looks like:
o The PCI-Compatible Space: This is the original 256-byte configuration area that
older PCI software can access. It's where important things like PCI Express
capability are stored.
o The Extended Configuration Space: This is where more advanced features (like
error reporting, power budgeting, or virtual channels) are stored. This space is
only accessible by modern PCIe systems using an enhanced method.
So, think of the 256 bytes as the "basic" settings area and the extra 4KB as the "advanced"
settings for new PCIe features.
In a PCIe system, the Root Complex (which is connected to the processor) is the only part of
the system allowed to make configuration requests. In other words, the Root Complex acts like
the "manager" of all configuration activities. This manager sends configuration requests (like
instructions) to the PCI devices and makes sure everything is set up properly.
Why only the Root Complex? Because allowing other devices to change the configuration could
create chaos. Imagine if every device could start changing things without permission — it would
lead to conflicts. So, only the Root Complex has this special permission.
Since only the Root Complex can send configuration requests, the requests can only flow
downstream. This means configuration requests start at the processor (through the Root
Complex) and travel to the devices below it (like PCIe devices on different buses). Devices on
the same level (peer-to-peer) cannot send configuration requests to each other.
These configuration requests are routed based on something called the BDF (Bus number,
Device number, and Function number), which tells the system exactly where the device is
located in the PCIe topology.
Most processors can’t directly make configuration read and write requests. They’re good at
handling memory and I/O requests, but they need help for configuration tasks. That’s where
the Root Complex comes in: it translates the memory and I/O requests from the processor into
configuration requests.
28 VLSI TECH WITH ANOUSHKA | PCIe Series
There are two ways to access the configuration space for a PCI or PCIe device:
o The problem with older systems was that there wasn’t enough I/O address space
(only 64KB available for I/O). By the time PCI came along, this space was
cluttered with many devices.
o To solve this, PCI used a technique called indirect address mapping. This
means that instead of assigning each device a separate I/O address, the system
uses one register for the target address and another register for the data being
sent to or read from that address.
In the Legacy PCI mechanism, the PCI configuration registers are accessed indirectly through
the Configuration Address Port and the Configuration Data Port. Here’s how it works:
• The processor writes the target device’s address (Bus, Device, and Function) to the
Configuration Address Port (at a specific I/O address: 0CF8h).
• Then, the processor writes or reads data from the Configuration Data Port (I/O address:
0CFC–0CFFh).
• The Root Complex checks if the target bus is within its range and, if so, initiates the
configuration read or write request.
The Configuration Address Port is an important mechanism used by the processor to access
the configuration space of PCI devices. When a processor needs to interact with the PCI
configuration space, it writes to the Configuration Address Port to specify which PCI device and
register to target. The address is written in a structured 32-bit format, which is detailed below.
29 VLSI TECH WITH ANOUSHKA | PCIe Series
1. Bits [1:0]:
o Purpose: Specifies the target dword (or register number) in the configuration
space of the device.
o Value: This defines which doubleword (dword) in the PCI device’s configuration
space you want to access. There are 64 dwords in the first section of the
configuration space, meaning this field can address any of these first 64
locations.
o Limitations: This mechanism can only target the first 64 doublewords (64 DWs)
of a device’s configuration space.
o Value: This field specifies which device (0 to 31) is being targeted on the PCI bus.
A PCI bus can support up to 32 devices, each with its own unique device
number.
o Value: Specifies the bus number (0 to 255) on which the target device resides.
In a PCI system, there can be up to 256 buses, and this field indicates the bus
where the device is located.
o Value: They must always be set to 0. They are not used in the current
specification.
30 VLSI TECH WITH ANOUSHKA | PCIe Series
o Value:
Summary
• The Host Bridge in the Root Complex is responsible for connecting the CPU to different
PCI buses and devices. It contains two important registers:
o Secondary Bus Number: This represents the number of the bus directly
connected to the Host Bridge.
o Subordinate Bus Number: This defines the maximum bus number that can be
accessed downstream (below) from the Host Bridge.
• These two registers help the Host Bridge determine which bus the device you want to
communicate with is located on.
• To access any PCI device, the CPU writes a 32-bit value to the Configuration Address
Port (0CF8h). This value includes the Bus Number, Device Number, Function
Number, and the Register Number within the device that you want to access.
• When a configuration request is made, the Host Bridge checks if the Bus Number in the
request matches the range of buses it can access, which is determined by the
Secondary and Subordinate Bus Numbers.
o If the target bus is equal to the Secondary Bus Number, the Host Bridge
recognizes that the request is for a device directly connected to that bus. It
sends a Type 0 configuration request, which is a request to configure a device
directly on that bus.
o If the target bus falls between the Secondary Bus Number and Subordinate
Bus Number (but is not exactly the Secondary Bus), the Host Bridge forwards
the request as a Type 1 configuration request. This means the request is being
passed to another bus downstream, where another bridge will handle it and
possibly forward it further.
• Once the target bus and device are determined, the CPU can send read or write
requests to the Configuration Data Port (0CFCh). This is where the actual configuration
data is transferred.
o If the Configuration Address Port had bit 31 set to 1, and the Bus Number
matched the range handled by the Host Bridge, then the data in the
Configuration Data Port will be interpreted as a PCI configuration transaction.
Depending on whether it's a read or write request, the Host Bridge will either
retrieve or update the configuration of the device.
• Type 0 Configuration Request: This is used when the target device is on the same bus
as the Host Bridge's Secondary Bus Number. It means the device is directly accessible
on that bus.
• Type 1 Configuration Request: This is used when the target device is on a different bus
(downstream), and the request has to be forwarded through one or more bridges to
reach the target device.
• If the Bus Number in the request is not within the range of buses managed by the Host
Bridge (i.e., it’s greater than the Subordinate Bus Number), the request will not be
forwarded, and the Host Bridge won’t perform any configuration transaction.
32 VLSI TECH WITH ANOUSHKA | PCIe Series
33 VLSI TECH WITH ANOUSHKA | PCIe Series
In systems with multiple Root Complexes (components that connect the CPU to the PCI
Express bus), there is a challenge when it comes to accessing the configuration space. The
Configuration Address and Data ports (used for configuring PCI devices) can be duplicated
across different Root Complexes, but there must be a system to prevent conflicts. Let’s break
this down step by step:
1. Preventing Contention:
When multiple Root Complexes are present, they share the same IO addresses for
configuration. However, to avoid contention (conflicts) between them, only one Root
Complex’s bridge is active at a time during configuration. Here's how it works:
• When the processor writes to the Configuration Address Port (the port that specifies
which PCI device and function is being accessed), only one of the Root Complexes will
respond and participate in the transaction.
2. Enumeration Process:
During enumeration (a process where the system detects and assigns addresses to PCI
devices):
34 VLSI TECH WITH ANOUSHKA | PCIe Series
• Software discovers all the buses and devices under the active Root Complex and
assigns them bus numbers.
• After the first Root Complex finishes, the second Root Complex is enabled. The software
assigns it a bus number range that does not overlap with the first Root Complex’s bus
numbers.
By ensuring the two Root Complexes have non-overlapping bus numbers, both can operate
without conflict, even though they see the same configuration requests.
• Any access to the Configuration Address Port is seen by both Root Complexes, but
only the one responsible for the target bus will process the request.
• The selected Root Complex acts as a gateway to the appropriate PCI bus:
o If the request is for a device on the Secondary Bus, it converts the request to a
Type 0 configuration access (for devices directly on its bus).
o If the request is for a bus further down the PCI hierarchy, it converts it to a Type 1
configuration access (to pass through to other buses).
With modern multi-core, multi-threaded CPUs, the old model for accessing configuration
space no longer works well. Here's why and what the spec writers did to solve this:
2. The CPU would then perform a corresponding access to the Configuration Data
Port.
• This was fine when there was only one CPU running a single thread. But in modern
systems with multiple cores and threads, different threads might try to access
configuration space simultaneously, which could cause conflicts. For example, Thread
A might write to the Configuration Address Port, but before it can complete the
operation, Thread B could overwrite the address with a new one.
• Instead of using the old two-step IO port model, they mapped the entire configuration
space into a block of memory addresses.
• Now, accessing configuration space is done with one memory request, which directly
generates a Configuration Request on the PCI Express bus.
• Each PCI Function now gets a 4KB block of configuration space (up from the previous
256 bytes).
• Mapping configuration space for all possible PCI functions requires 256MB of address
space.
However, this is a minor issue because modern CPUs support large memory address spaces (36
to 48 bits of physical address space). In such large address spaces, 256MB is insignificant.
• The Root Complex is not required to support certain advanced behaviors, such as:
Thus, software should avoid relying on these features unless it knows that the Root Complex
supports them.
The issue being addressed is how to access and configure PCI devices efficiently, given the
limitations of legacy methods which used a restricted portion of the address space for
configuration. To solve this, the writers of the PCI Express specification decided to map the
entire PCI configuration space directly into memory addresses. This allows a simpler and faster
way of accessing configuration space with a single memory access command.
• Instead of using limited IO ports to access configuration registers, all PCI configuration
space is mapped into a dedicated 256MB of memory address space.
• Each PCI Function (a function is a logical subcomponent of a device) gets its own 4KB
of this address space.
• Mapping the entire configuration space into memory allows the system to send a single
memory request, which generates a Configuration Request on the PCI bus.
• The trade-off for this new approach is that it consumes 256MB of memory address
space. However, this is insignificant in modern systems where the CPU can address 36
to 48 bits of memory (a very large addressable space, far beyond 256MB).
• Thus, the new method allows for much more straightforward configuration access
without worrying about running out of IO space, but at the cost of using some of the
available memory address range.
Each PCI Function’s 4KB configuration space is mapped starting at a 4KB-aligned address
within this 256MB of memory. This means that the memory address itself now carries
information about which PCI device and function are being targeted.
When configuring devices on a PCI Express (PCIe) bus, configuration requests are used to
access a device's configuration space. There are two types of configuration requests: Type 0
and Type 1, depending on whether the target device is on the current bus or a bus further
downstream. Here's how these two types of requests work:
• A Type 0 configuration request is used when the target bus number matches the
Secondary Bus Number of the bridge. This indicates that the target device is located
directly on the bus connected to the secondary side of the bridge.
37 VLSI TECH WITH ANOUSHKA | PCIe Series
Steps:
o Devices on the secondary bus check the Device Number field to determine
which device is being targeted. In PCIe, Endpoints on an external link are always
assigned as Device 0.
o The selected device then checks the Function Number to identify which
function within the device is being accessed. Devices can have multiple
functions (for example, a network card could have a control function and a data
transfer function).
o The selected function uses the Register Number field in the request to
determine which dword (32-bit block) in the configuration space is being
accessed.
o The First Dword Byte Enable (BE) field specifies which bytes within the selected
dword are to be read or written.
• The Format (Fmt) field specifies whether the request is a read or a write.
• A Type 1 configuration request is used when the target bus number does not match the
Secondary Bus Number of the bridge, but the target bus number is within the range
specified by the bridge's Secondary Bus Number and Subordinate Bus Number. In
this case, the packet is forwarded to the bridge's secondary bus as a Type 1 request.
38 VLSI TECH WITH ANOUSHKA | PCIe Series
Steps:
o When the request reaches the bridge, the bridge compares the target bus
number with its Secondary and Subordinate Bus Numbers.
o If the target bus falls within this range, the bridge forwards the request to the
secondary bus as a Type 1 request.
o If the target bus matches the bridge’s secondary bus, the request is converted
from Type 1 to Type 0 and is forwarded to the secondary bus for devices on that
bus to process.
o If the target bus does not match the bridge's secondary bus but is within its
range, the request is passed further downstream as a Type 1 request.
• The Format (Fmt) field indicates whether the request is a read or a write.
• Type 0:
o Used when the device is on the local bus (i.e., the secondary bus of the bridge).
o Devices on the bus directly process the request by checking the device,
function, and register fields.
• Type 1:
o Bridges process these requests and forward them to the appropriate bus until
they reach the target bus.
o The code mov dx, 0CF8h sets up DX to the configuration address port (0xCF8).
o Then, mov eax, 80040000h sets the EAX register to point to bus 4, device 0,
function 0, and the first DWORD (register 0, containing the Vendor ID). This
value:
o The OUT instruction sends this address to the configuration address port
(0xCF8), effectively setting the target.
o The out dx, eax command writes this address to the configuration address port
(0CF8h), indicating a request to read from Bus 4, Device 0, Function 0. Since
the bus number is 4 (non-zero), this triggers a Type 1 Configuration Read
starting from Bus 0.
o The Root Complex (often referred to as the Host/PCI Bridge) receives the
configuration request from the processor. It knows the requested bus is
downstream of Bus 0, so it begins a Type 1 Configuration Read targeting Bus 4.
o The Device 1 on Bus 0 is a PCI-to-PCI (P2P) bridge, and it checks the bus range.
Since Bus 4 is in its range (Bus 1 to Bus 4), it forwards the request downstream.
o The request then travels through Bus 1 and Bus 2, being forwarded by the
respective PCI bridges until it reaches Bus 4.
o Once the request reaches Bus 4, the bridge converts the Type 1 request to a
Type 0 Configuration Read because it is now addressing a device on the local
bus (Bus 4).
o The target device (Device 0) and function (Function 0) are identified from the
configuration request.
o The Type 0 request specifies the first DWORD (register 0), which holds the
Vendor ID of the device.
o The PCI device responds with the first two bytes of the Vendor ID.
5. Return of Data:
o The response packet, containing the Vendor ID, is sent back to the Root
Complex and then forwarded to the processor via the configuration data port
(0xCFC), which is read using the IN instruction.
• Instead of using the I/O ports (0xCF8 and 0xCFC), Enhanced Configuration Access
uses memory-mapped I/O.
41 VLSI TECH WITH ANOUSHKA | PCIe Series
• In the example, the address E0400000h refers to the configuration space in memory,
allowing the processor to directly perform memory read operations.
• The processor reads from this memory location, and the Root Complex generates a
Configuration Read request, similar to the legacy access method, but triggered by a
memory read operation.
After a system reset or power-up, the configuration software scans the PCI Express (PCIe) fabric
to discover the topology of devices connected to the system. This process is called
enumeration. Here’s how it works step-by-step:
At the start, the only known device in the system is the Host/PCI bridge, which serves as the
entry point for configuration and connects the processor and the PCIe fabric. This bridge
assigns Bus 0 to its downstream side, known as the secondary bus. The rest of the topology,
including devices on other buses, is yet to be discovered.
• The processor and the Root Complex (part of the bridge) know that Bus 0 exists.
• Other buses and devices are marked as unknown ("? ?") and are yet to be identified by
the configuration software.
Enumeration involves the configuration software searching for devices on each bus by sending
Configuration Read Requests to each potential bus, device, and function combination. Here's
how this process is carried out:
o The software attempts to read the Vendor ID from each device's configuration
space.
o By reading the Vendor ID register for all possible Bus, Device, and Function
numbers, the software determines whether a device is present.
o In this case, the PCIe fabric handles the situation by returning a Completion
with Unsupported Request (UR) status:
▪ The upstream PCIe bridge (above the target) generates this completion.
▪ The Root Complex translates the result to all ones (FFFFh), a reserved
value indicating the absence of a device.
o The software interprets a Vendor ID of FFFFh as the device not being present.
This prevents the system from falsely reporting errors.
o Although a Master Abort error (from legacy PCI) or a UR response could be seen
as an error during normal runtime, it is expected during enumeration and not
treated as a critical issue.
o The system uses a special Unsupported Request Status bit to note these
conditions without halting the enumeration process. This is important to prevent
unnecessary error handling during this stage, as the system might not yet have
all error-handling capabilities active.
Another issue the software might encounter is when a device exists but is not yet ready to
respond to configuration requests. This typically happens after a system reset, during which the
device needs time to initialize.
1. Initialization Delay:
o After reset, PCIe devices require some time to initialize before they can respond
to configuration accesses. If the data rate is 5.0 GT/s or less, software must wait
100 ms before sending a configuration request.
o For higher speeds (e.g., Gen3), the wait time extends beyond 100 ms, due to the
time required for Link training and Equalization.
43 VLSI TECH WITH ANOUSHKA | PCIe Series
o This status signals that the device is not yet ready but will be shortly.
o The Root Complex handles CRS differently depending on the system settings.
During enumeration:
Enumeration is crucial because it allows the system to discover the full topology of PCIe
devices and allocate resources like bus numbers, memory addresses, and I/O space. Without
enumeration, the system would not know how many devices are connected or how to
communicate with them.
To determine whether a PCIe function is an endpoint or a bridge, we use information from the
Header Type register (offset 0Eh in the PCI configuration space header). Here's how it works:
Key Points:
o The lower 7 bits of the Header Type register identify the function type.
o Values:
44 VLSI TECH WITH ANOUSHKA | PCIe Series
▪ 0 = Single-function device.
2. Check Multifunctionality:
3. Enumeration Process: During enumeration, the software probes devices and functions
starting from bus 0, device 0. For each device and function found, the Vendor ID and
Header Type registers are checked to determine the nature of the function (endpoint or
bridge). If a function is found to be a bridge, its bus number registers (Primary,
Secondary, and Subordinate) are updated, and the enumeration continues downstream
from that bridge.
o If the Header Type is 1, set bus number registers for the bridge and continue
downstream.
4. Continue scanning for functions in a depth-first manner until all devices and functions
have been discovered and appropriately configured.
Once enumeration completes, each bridge’s Subordinate Bus Number register is updated
with the actual largest bus number downstream, ensuring that the PCIe hierarchy is correctly
mapped for further transactions.
This process is essential for configuring a PCIe system, as it helps organize the hierarchy of
devices and allows the host system to address each device accurately.
46 VLSI TECH WITH ANOUSHKA | PCIe Series
PCIe Day 5
Architecture Overview
47 VLSI TECH WITH ANOUSHKA | PCIe Series
Even though PCIe is an advanced technology compared to the older PCI standard, it keeps a
level of compatibility with older systems. One way this is achieved is by keeping the
configuration headers for both Endpoints (like devices) and Bridges (which connect buses)
similar to those in PCI. Think of the configuration headers as the “identity card” for a device.
When older software interacts with PCIe devices, it doesn’t really see the difference between
PCI and PCIe because the basic layout of these headers remains unchanged.
Bridges in PCIe
In PCIe systems, instead of having individual bridges, you often have Switches and Root
Complexes (the Root is the interface between the CPU and PCIe). While older software might
still think it’s dealing with regular PCI bridges, in reality, it’s working with a more complex
internal setup. This setup is hidden from the software, meaning the software only sees what it
expects, even if the internal design of the Root or Switches has changed and become more
advanced.
In PCIe, each device or function gets a 256-byte Configuration Space. This space is divided
into sections:
• 64 bytes are reserved for the old PCI configuration (so older software can still function).
• The remaining 192 bytes are used for PCIe-specific or function-specific configurations.
This means that newer PCIe features are supported while still keeping the older system
functionality intact.
48 VLSI TECH WITH ANOUSHKA | PCIe Series
Topology Example
Consider how your computer’s PCIe system looks. At the top is the Root Complex (which
connects to the CPU), and beneath that are several PCIe devices and ports. In older systems,
the software would see this as a series of bridges. With PCIe, while the system structure is more
complex, it is made to look like the old system to the software. The Root Complex has internal
connections that act like PCI buses, even though they aren’t actually physical PCI buses.
When the system powers on, it goes through a process called enumeration, where it discovers
all the connected devices (like graphics cards or network cards) and assigns them bus numbers
and system resources. This process in PCIe works just like it did in older PCI systems. Once
enumeration is done, the software has a clear map of the devices in the system, making it easier
to manage them.
In traditional PCI systems, bridges were used to connect different buses. Each bridge would
help route data between different parts of the system. In PCIe, while the technology has
evolved, the way things look to the software hasn't changed much. The Switch in a PCIe system
works internally in a more sophisticated way but still appears to software as a collection of
bridges. These bridges are all connected by a shared bus.
By organizing PCIe Switches to look like PCI bridges to the software, it simplifies compatibility.
This setup means the software doesn't need to be rewritten or heavily modified when moving
from older PCI systems to newer PCIe systems. The software still thinks it's dealing with regular
PCI bridges, but in reality, the Switch is doing the work behind the scenes. This allows
transaction routing—which is the process of sending data between different parts of the
system—to function just like it did in the older PCI systems.
Enumeration Process
Enumeration is a key process that happens when the system starts up. During enumeration,
the configuration software scans the system, identifies all the connected devices, and assigns
them bus numbers and system resources like memory and I/O space.
49 VLSI TECH WITH ANOUSHKA | PCIe Series
In a PCIe system, even though the internal setup might be more complex with Switches and
multiple buses, the enumeration process works the same way as it did with PCI. Once the
system is done with enumeration, each device and bridge gets a bus number and can now
communicate with the rest of the system. This is important because enumeration ensures that
every device knows its place in the system and how to route its transactions.
Figure below provides a visual example of how enumeration works. Let’s walk through it in a
simplified way:
• The Root Complex (connected to the CPU) starts by having Internal Bus 0, which is a
virtual bus that connects everything below it.
• The PCI-PCI Bridges appear as different buses (Bus 1, Bus 2, etc.), and each device (like
a PCIe Endpoint) gets a bus number.
• For example, Bus 3 might have a PCIe Endpoint, which is some device like a network
card or a graphics card. Similarly, other endpoints and bridges will be on other buses,
like Bus 5, Bus 7, etc.
Once enumeration finishes, the system knows where every device is located, and
communication can happen smoothly.
To make this concept even clearer, let’s compare two different types of systems:
50 VLSI TECH WITH ANOUSHKA | PCIe Series
1. Low-Cost Consumer Desktop: In a simple desktop machine, you might have a few
PCIe ports and slots for adding things like a graphics card or a sound card. The internal
structure of this system looks very similar to how older PCI systems were organized. You
still have a Root Complex, a few PCIe Ports, and some slots for add-in cards. The
simplicity of the design makes it easier for software to manage.
2. High-End Server: A server, on the other hand, has a much more complex structure. It
might have multiple networking interfaces and many PCIe slots for connecting storage
devices or other peripherals. Even though the system design is more sophisticated, PCIe
allows these high-end systems to be managed similarly to simpler ones. In the early
days of PCIe, some even thought that PCIe could replace other networking protocols
due to its flexible architecture, but it hasn’t fully replaced them because external
networks typically use different technologies.
Another key concept is the Root Complex. The Root Complex is the starting point of the PCIe
system, where the CPU interfaces with the PCIe devices. In modern processors, especially from
companies like Intel, the Root Complex is often integrated into the CPU package itself.
51 VLSI TECH WITH ANOUSHKA | PCIe Series
For example, modern CPUs may have integrated memory controllers (for DRAM), a PCIe x16
port (for graphics cards), and routing logic. All of this logic outside the actual CPU cores is often
referred to as Uncore logic, meaning it’s part of the CPU package but not directly involved in the
processing power of the CPU cores. The Root Complex here handles all the traffic between the
PCIe devices and the CPU, and since part of it resides inside the CPU package, this integration
makes communication faster.
In Summary
• Switches in PCIe systems act like a group of bridges to the software, which helps
maintain compatibility with older software.
• Transaction routing (sending data between devices) and enumeration (discovering and
organizing devices) work the same way in PCIe as they did in PCI.
• In both low-cost and high-end systems, PCIe can be organized in a way that the system
appears simpler to the software.
• The Root Complex connects the CPU to the PCIe devices, and in modern systems, it’s
often integrated inside the CPU package, speeding up communication.
PCIe defines a layered architecture where each layer has a distinct role, and these layers
operate independently for transmit (TX) and receive (RX) traffic. This separation into layers
provides flexibility for hardware designers because it allows upgrades or modifications to one
layer without necessarily affecting the others. However, while the layers are defined for clarity,
there’s no strict requirement for designs to strictly partition hardware in this exact way to meet
PCIe compliance.
The layers, as depicted in figure below, are crucial to understanding how PCIe transfers data
and how the different parts of a device interact with each other.
At the heart of the PCIe system is the device core, which implements the primary functionality
of the PCIe device. Depending on the type of device, the core can vary:
• If it’s an endpoint (a device like a graphics card or network card), it may contain multiple
functions (up to 8). Each of these functions has its own configuration space.
• If it’s a switch, the core contains packet routing logic and an internal bus for routing
packets to their destinations.
• If it’s a root complex, which interfaces the CPU to the PCIe devices, the root core
typically includes a virtual PCI bus 0, where embedded devices and virtual bridges
reside.
1. Transaction Layer
• It creates Transaction Layer Packets (TLPs) for transmission and decodes them upon
receipt.
o Flow Control: Ensuring that data is sent at a rate the receiver can handle.
o Transaction Ordering: Keeping track of the correct order in which data should
be processed.
Each of these functions ensures that the data flow between devices is efficient and properly
managed. The creation and processing of TLPs occur on both the transmit and receive sides of
the system.
The Data Link Layer ensures reliable communication between two directly connected devices:
• It handles Data Link Layer Packets (DLLPs) for error management, including the
creation and decoding of these packets.
• One of its primary responsibilities is error detection and correction using the Ack/Nak
protocol.
The Data Link Layer thus ensures that any errors in transmission are caught and corrected to
maintain data integrity.
3. Physical Layer
The Physical Layer is responsible for the actual transmission of data across the PCIe link. It
handles the conversion of data into a form that can be transmitted over the physical
connection, as well as the reception and conversion of incoming data into a usable format.
• Ordered-Set Packet Creation: On the transmit side, it creates Ordered-Sets, which are
groupings of bytes used for synchronization and control during transmission.
On the transmit side, the Physical Layer processes these packets through various steps:
• Byte Striping: Distributes the data across different lanes for parallel transmission.
• Scrambling: Randomizes the data to avoid patterns that could cause interference.
• 8b/10b Encoding (for Gen1/Gen2) or 128b/130b Encoding (for Gen3): Converts the
data into a format suitable for transmission.
• Serialization: Converts the data from parallel to serial form for transmission across the
link.
On the receive side, the Physical Layer performs the reverse operations:
• Clock and Data Recovery (CDR): Uses the incoming data stream to recover the clock
signal and ensure accurate data timing.
• Elastic Buffers: Buffer incoming data to handle any clock or data rate mismatches.
The Physical Layer also includes the Link Training and Status State Machine (LTSSM), which is
responsible for initializing and training the link to ensure it operates correctly. This process
adjusts settings like link speed and the number of active lanes to maximize data transfer
efficiency.
Imagine two PCIe devices (Device A and Device B) communicating. The layered architecture
ensures that:
• On the transmit side (TX) of Device A, the Transaction Layer creates TLPs, which then
pass through the Data Link Layer for error management, and finally, the Physical Layer
converts them into a form suitable for transmission over the PCIe link.
• On the receive side (RX) of Device B, the Physical Layer receives the serialized data,
decodes it, and passes it through the Data Link Layer for error checking, before finally
reaching the Transaction Layer, where the original data is reconstructed.
A common question is whether Switch Ports need to implement all these layers since they
primarily route packets between devices. The answer is yes—Switch Ports must implement the
54 VLSI TECH WITH ANOUSHKA | PCIe Series
full stack of layers, especially the Transaction Layer, because they need to inspect the packet
contents to determine how to route them. The Transaction Layer logic is essential for reading
packet headers, managing flow control, and maintaining transaction ordering across the PCIe
topology.
A common question is whether a Switch Port needs to implement all the PCIe layers, given that
its main role is to forward packets between devices. The answer is yes. Switch Ports must
implement all the layers (Transaction Layer, Data Link Layer, and Physical Layer), even though
they primarily act as intermediaries between devices. The reason is that in order to correctly
forward packets, the Switch needs to evaluate the contents of the packets. This evaluation,
including checking packet headers and determining routing paths, takes place in the
Transaction Layer.
Each layer in the PCIe interface has a specific role, and it communicates with its corresponding
layer in the device at the other end of the PCIe Link. For example:
• The Transaction Layer in the transmitting device communicates with the Transaction
Layer in the receiving device.
• Similarly, the Data Link Layer and Physical Layer in the transmitter communicate with
their counterparts in the receiver.
This communication occurs through a process of packetization. The upper two layers
(Transaction Layer and Data Link Layer) organize data into packets, with each layer adding
specific information necessary for that layer’s function.
Here’s a simplified flow of how packets move through the layers in a PCIe interface:
o This packet contains details like the command type, the target address, and
other attributes (e.g., read/write operations).
o The packet is stored in a buffer, called a Virtual Channel, until it’s ready to be
passed down to the next layer.
o The Data Link Layer adds additional information for error checking.
o In the Physical Layer, the packet is encoded and transmitted across the PCIe
Link using all available lanes. Transmission is typically done using differential
signaling for robustness.
o On the receiving device, the Physical Layer decodes the incoming data from the
PCIe Link and checks for any errors.
o If no errors are found, the packet is forwarded up to the Data Link Layer.
o The Data Link Layer performs further error checks to ensure the packet is
intact. If no errors are detected, the packet is passed up to the Transaction
Layer.
o The Transaction Layer on the receiving side buffers the packet, checks for any
additional errors, and then disassembles the packet to retrieve the original
information (such as the command type, target address, etc.).
o Finally, the contents are delivered to the device core of the receiving device,
allowing the core logic to process the request (e.g., executing a command or
transferring data).
56 VLSI TECH WITH ANOUSHKA | PCIe Series
This layered communication structure ensures that each layer handles a specific function, and
by doing so, the process of transmitting and receiving data becomes modular and manageable.
Each layer performs its role (whether that’s adding/removing error checking, organizing data into
packets, or physically transmitting bits), and in doing so, the PCIe system can reliably transfer
data across devices.
The Transaction Layer in PCIe is responsible for managing how devices communicate with each
other by sending and receiving data. It handles the main tasks of sending requests (like "read" or
"write" commands) and receiving responses from other devices connected to the PCIe system.
2. Receiving Responses: For some requests, the device that sent the request expects a
response. For example, if a device asks to read data, the device that has the data will
send a completion packet back, confirming the operation and returning the requested
data.
The transactions at this layer are communicated using Transaction Layer Packets (TLPs). TLPs
handle various types of requests, which can be categorized into the following four groups:
• The first three types (Memory, IO, and Configuration) were carried over from PCI and
PCI-X, but Messages are specific to PCIe.
Table below lists the types of requests along with whether they are Posted or Non-Posted
transactions.
o I/O Writes and Configuration Writes: Although they send data to the target,
they still require a completion to confirm that the write was successful. This
follows the split transaction protocol from PCI-X, where the request and
completion occur in separate packets.
o Memory Writes: The requester sends the write command and assumes it will be
completed successfully without any explicit confirmation from the target device.
o Messages: These are control or event signaling packets that also do not require
a completion packet.
58 VLSI TECH WITH ANOUSHKA | PCIe Series
Although posted transactions don’t receive a completion packet, they still use the Ack/Nak
protocol at the Data Link Layer to ensure reliable delivery. For example, even in a posted write,
the Data Link Layer will acknowledge that the packet was successfully transmitted. This is
important because it ensures data integrity without the performance overhead of completion
packets.
TLP Basics
TLPs originate in the Transaction Layer of the transmitter and terminate in the Transaction
Layer of the receiver. As the packet moves through the PCIe layers, the Data Link Layer and
Physical Layer add information to the TLP, ensuring reliable transmission and error-checking.
• The Data Link Layer adds error-checking data to ensure the packet is transmitted
correctly across the link.
• The Physical Layer handles the actual transmission of the packet over the physical
connection between devices.
Upon receiving the packet, the same layers at the receiving end verify that the data was
transmitted correctly, and the Transaction Layer at the receiver processes the packet based on
its content (e.g., memory operations or configuration commands).
• The Transaction Layer starts the process by creating the core part of the TLP. This part
includes important information, such as a header, and in some cases, data (like when
writing to memory).
• Header: Every TLP has a header, which acts like an address label that tells the system
where the packet is going and what type of request it is (for example, a memory read or
write).
• Data: Some packets, like a write request, will have data that needs to be transferred.
But other packets, like a read request, won’t include data—they just contain
instructions.
• Optional ECRC (End-to-End Cyclic Redundancy Check): The Transaction Layer can
also add an ECRC at the end of the packet. This is a type of error detection code that
ensures no errors occurred during transmission between the sender and receiver.
• The packet, along with the ECRC (if used), is then passed down to the next layer, the
Data Link Layer.
2. Data Link Layer Adds Sequence Number and CRC (Link-Level Error Detection)
• The Data Link Layer is responsible for ensuring the packet is transmitted error-free
across the immediate PCIe link between two devices.
• Sequence Number: The Data Link Layer adds a sequence number to the packet, which
helps the receiving device keep track of the order of packets.
• LCRC (Link Cyclic Redundancy Check): The Data Link Layer also adds another error-
checking code called the LCRC. This code allows the receiving device to check if any
errors occurred during transmission over that specific link. If there’s an error, the
receiver notifies the sender, and the packet is retransmitted.
You might wonder, if we already have LCRC for error checking, why do we need ECRC?
• LCRC only checks for errors on the link between two neighboring devices.
• ECRC, on the other hand, checks for errors across the entire path from the sender to
the final destination. This is useful because errors could happen inside the devices (like
switches or ports) that route the packet, and LCRC won’t catch those.
By having both, we make sure errors are caught at different stages, providing better protection.
60 VLSI TECH WITH ANOUSHKA | PCIe Series
• The Physical Layer is responsible for converting the packet into a format that can be
transmitted over the physical connection (the wires) between devices.
• Control Characters (in PCIe Gen 1 and 2): In earlier versions of PCIe (Generations 1
and 2), the Physical Layer added special control characters at the beginning and end of
the packet. These characters told the receiver how to handle the packet.
• New Encoding for PCIe Gen 3 and Beyond: In PCIe Generation 3, the control
characters were replaced with a different encoding method that packs more information
into each transmission, improving efficiency.
• The Physical Layer also encodes the packet and sends it over the PCIe link using
differential transmission (which is a way of sending signals to reduce noise and errors).
• When the receiving device sees the incoming TLP packet, the Physical Layer is the first
to handle it.
• The Physical Layer checks for special characters (like Start and End characters) that
were added during transmission. These characters tell the receiver where the packet
begins and ends.
• After verifying these control characters, the Physical Layer removes them and forwards
the rest of the packet to the next layer, the Data Link Layer.
• The Data Link Layer is responsible for ensuring the packet arrived without errors.
61 VLSI TECH WITH ANOUSHKA | PCIe Series
1. LCRC (Link Cyclic Redundancy Check): This checks for transmission errors
between the neighboring devices (i.e., on the PCIe link).
• If there are no errors, the Data Link Layer removes the LCRC and sequence number
fields (which were added for error detection) and passes the remaining packet to the
Transaction Layer.
• The Transaction Layer now works with the core part of the packet (the header, data,
and possibly the ECRC field). If the receiving device is a Switch, it will check the header
to see where the packet should go.
• Switches: A switch in the PCIe system is like a traffic controller. It looks at the header of
the TLP to see where the packet is headed (which device or port). The switch doesn't
modify the ECRC (End-to-End Cyclic Redundancy Check), but it can check it for errors
and report them if needed.
• If the receiving device is the target (the packet's final destination), it checks the ECRC
(if it's enabled to do so) to ensure there were no errors in the packet's journey across all
the links and devices.
• Once the ECRC (if present) is checked and there are no errors, the Transaction Layer
removes the ECRC field.
• What’s left is the header and the data (if any), which are then passed up to the Software
Layer of the device. The Software Layer is where the device processes the request or
data that was originally sent.
A non-posted transaction means that when a request is sent, the sender waits for a response.
This is commonly used in read operations, where a device requests data from memory and
expects a response with that data.
1. Requesting Data:
o An Endpoint (a device like a GPU or network card) wants to read data from the
system memory.
o The request travels through the PCIe system, passing through Switches that
help route the packet to the correct destination.
o In this example, the Root Complex (which is part of the CPU) recognizes that the
request is for system memory.
o The Root Complex reads the memory at the requested address and gathers the
data to be sent back to the Endpoint.
o The Root Complex sends the data back to the Endpoint in Completion Data
(CplD) packets.
When the Endpoint made the request, it included its return address in the packet. This address
is a combination of three numbers:
• Bus number
• Device number
• Function number
These three together form the BDF (Bus, Device, Function), which tells the system where to
send the completion packets.
Additionally, the request also had a Tag. Each request is given a unique Tag to help the Endpoint
match the incoming completion packets with the correct request, especially when multiple
requests are being handled simultaneously.
Handling Errors
If an error occurs, the Completer (the device responding to the request) can indicate this by
setting specific bits in the completion status field of the packet. This lets the Endpoint know
something went wrong, though how to handle such errors depends on the software, not the
PCIe specification.
Locked Memory Reads are used in special cases where a processor needs to ensure that no
other device can access or modify a specific piece of memory while it is performing a critical
operation. This is mainly used for operations like Atomic Read-Modify-Write, which are
essential in tasks such as managing a semaphore (a variable that controls access to a shared
resource).
1. Atomic Operations:
o Imagine you have a semaphore (a variable that prevents multiple devices from
using the same resource at the same time).
o When a processor checks the semaphore (like testing if a resource is free) and
decides to change its value (like marking the resource as "in use"), no other
processor should be able to modify the semaphore during this process.
o To ensure this, the processor "locks" the memory location containing the
semaphore. This prevents other devices from accessing or modifying it until the
lock is released.
2. Race Conditions:
o Without locking, two processors could try to modify the semaphore at the same
time, leading to a race condition, which could cause unpredictable results.
o The lock ensures that one processor completes its operation before another one
can access the same memory.
o This request travels through the PCIe system, moving through Switches and
other routing devices, eventually reaching the memory or device where the data
is stored.
o As the request passes through each routing device (like switches), the egress
port (the port where packets leave the switch) gets locked.
o This means no other packets can pass through that port until the locked
transaction is completed.
o When the memory device receives the locked request, it fetches the data and
sends it back in a Locked Completion (CplDLk) packet.
o The completion packet travels back to the original requester (the CPU), and the
ports unlock as the packet passes through them.
4. Handling Errors:
65 VLSI TECH WITH ANOUSHKA | PCIe Series
o If something goes wrong, like if the data cannot be fetched, the device sends a
Locked Completion without data, indicating an error.
o The status field in this completion packet will tell the requester (CPU) what
went wrong, and the lock will be cancelled. After that, the software will need to
decide how to handle the error (for example, by retrying the request or taking
some other action).
• Locked Reads are mainly a legacy feature carried over from older systems (like PCI)
because they were once used for processor buses.
• New PCIe devices are not required to support locked requests unless they are Legacy
Devices that self-identify as such. So, only certain devices like CPUs or root ports can
initiate locked transactions in modern PCIe systems.
In computer systems, I/O (Input/Output) write transactions are a way for a processor to send
data to a specific device (called an Endpoint). These transactions follow specific rules to ensure
the data is delivered and confirmed.
3. Acknowledgement of Data:
Once the target device (called the Completer) receives the data, it sends back a
confirmation message (called a completion packet) to the processor.
o This confirmation does not contain any data—just a status field indicating
whether everything went okay or if there was an error.
4. Error Handling:
If an error occurred, the processor's software is responsible for fixing it.
o This is the key difference between a non-posted write and other types of writes
(like memory writes):
▪ Memory write: The processor doesn’t wait—it assumes the data will
eventually get there.
6. Why Wait?
Waiting ensures that the next operation (depending on the successful delivery of the
data) doesn’t happen too soon, avoiding errors.
7. Processor-Exclusive Writes:
Only the processor can initiate non-posted writes because it’s closely involved in
coordinating these critical steps.
In computer systems, posted writes are a fast and efficient way to send data, typically used for
memory operations. Let’s break it down:
67 VLSI TECH WITH ANOUSHKA | PCIe Series
o When a device (called the Requester) sends data to memory or another device,
it doesn’t wait for a confirmation message (unlike non-posted writes).
o This saves time and bandwidth, making the system faster and more efficient.
2. How It Works:
o The Requester sends the data along with the memory address (this is called the
Memory Write Request or MWr).
o The data packet travels through the system, being forwarded by Switches until it
reaches the destination (called the Completer).
o Once a Switch successfully sends the data to the next step, it considers its job
done and is ready to handle the next transaction.
o The transaction is considered finished for the Requester as soon as the data is
sent.
o The Completer eventually receives the data and stores it in memory, officially
completing the process.
o Since the Requester doesn’t wait for a response, it won’t know if something goes
wrong during the transaction.
o If there’s an error, the Completer might log it and send a Message to the Root
Complex (a central controller in PCIe) to alert the system software.
o The links between devices are freed up sooner for other transactions.
• Disadvantage: Errors aren’t directly reported to the Requester, which could make
troubleshooting more complex.
• How it works:
o Once the mail carrier takes it, you assume it will reach its destination (Requester
doesn’t wait for feedback).
o The recipient eventually receives the letter (Completer finishes the transaction).
• Pros: It’s fast and efficient because you don’t wait for confirmation.
• Cons: If the letter gets lost, you won’t know unless someone informs you later.
In addition to memory writes, messages are special transactions in PCIe used to communicate
system events like errors or power management. These messages are flexible:
2. Purpose:
o Messages replace the older "side-band signals" (extra wires for communication)
by using the regular data paths, simplifying the design and reducing hardware
complexity.
69 VLSI TECH WITH ANOUSHKA | PCIe Series
Quality of Service (QoS) in PCIe ensures that time-sensitive data, like video or audio streams,
is delivered on time while still supporting less urgent data, like file transfers. Here's how it works
in simple terms:
The Problem:
Imagine a video camera and a file transfer device (like a hard drive) both need to send data to
your computer's memory (DRAM).
• Video Camera Data: Needs to arrive on time, or the video will become choppy or lose
frames.
• File Transfer Data: Doesn’t care much about timing—it just needs to arrive without
errors.
QoS ensures that the video data gets priority so it arrives on time, even when the system is busy.
o Each packet of data is given a priority level by the software using a 3-bit field
called the Traffic Class (TC).
o Each port in the system has multiple "lanes" or buffers to handle different
priority packets.
o Packets are placed into the right buffer based on their Traffic Class.
3. Arbitration Logic:
o When there are multiple packets ready to be sent, the system uses rules
(arbitration logic) to decide which packet to send first based on priority.
4. Port Arbitration:
o In addition to managing buffers, the system also decides which input port gets
access to an output port when multiple ports compete for resources.
5. Guaranteed Service:
o With all these mechanisms in place, the system can guarantee that high-priority
data (like video) gets enough bandwidth and low latency, ensuring smooth
performance.
70 VLSI TECH WITH ANOUSHKA | PCIe Series
• Scenario:
o If the video doesn’t get enough bandwidth, frames are dropped, and the video
looks choppy.
o The video packets are given a higher priority (a higher Traffic Class).
o The system ensures the camera gets enough bandwidth and fast delivery, while
the SCSI data waits if needed.
o The video stays smooth, and the backup still completes—just a bit slower.
Benefits of QoS:
• Efficiency: Allows less urgent data to be handled without interfering with time-sensitive
tasks.
o Packets within the same VC (lane for data) always follow the order they arrived in
unless specific "relaxed ordering" rules apply.
o Packets with different TCs may not follow the same rules because they don’t
share an ordering relationship (they’re treated independently).
3. Why It Matters:
o For example, if a video packet (high-priority) is sent after a regular data packet,
the video packet might still be processed first due to its priority, but the system
ensures both packets reach their destinations without confusion.
• Same Lane: Cars in the same lane follow the "first come, first serve" rule (strict
ordering).
• Different Lanes: Cars in separate lanes don’t affect each other’s order—they can move
independently (no ordering relationship).
How It Works:
o The receiver has buffers (temporary storage) to hold incoming data packets
(Transaction Layer Packets, or TLPs).
o These buffers can fill up if too much data arrives too quickly.
o The receiver constantly updates the transmitter about how much buffer space is
available using Data Link Layer Packets (DLLPs).
72 VLSI TECH WITH ANOUSHKA | PCIe Series
o The transmitter keeps track of this and only sends data when there’s enough
space.
o Unlike normal data packets (TLPs), DLLPs are small and can always be sent,
even if the receiver’s buffers are full. This ensures updates about available space
are never delayed.
4. Automatic Management:
• Receiver: The kitchen has a limited number of plates (buffers) to prepare orders.
• Flow Control: The kitchen updates the waiter when it has room for more orders (DLLPs).
• Result: Orders are sent only when there’s space, avoiding chaos in the kitchen.
• Transaction Ordering: Prevents data from arriving out of order, maintaining system
reliability.
• Flow Control: Ensures smooth data flow without overloading parts of the system.
The Data Link Layer in PCI Express (PCIe) serves as the middle manager between the physical
layer (hardware-level signals) and the transaction layer (high-level data handling). Its primary
responsibilities include error correction, flow control, and power management.
o What’s a TLP?
Transaction Layer Packets (TLPs) carry the actual data for operations like
memory reads or writes.
▪ The sender replays the TLP from its Replay Buffer until it gets an Ack
(Acknowledgement) confirming successful reception.
2. Flow Control:
o Flow control ensures the sender does not overwhelm the receiver with too much
data.
o The receiver communicates its buffer availability through small DLLPs (Data
Link Layer Packets).
o These DLLPs act like “tickets” that tell the sender, “You’re allowed to send more
data now.”
3. Power Management:
o The Data Link Layer also manages power efficiency by communicating power
state changes through DLLPs.
• Purpose: DLLPs are special, small packets used for control purposes like
acknowledgements (Ack/Nak) and buffer space updates (flow control).
• Characteristics:
o Travel Scope: They only move between neighboring devices (e.g., from a switch
to an endpoint) and don’t propagate through the entire network.
o Assembly:
o Disassembly:
▪ At the receiver, the physical layer removes framing info and sends the
DLLP to the Data Link Layer.
▪ The Data Link Layer verifies the CRC for errors and takes appropriate
action (e.g., update flow control or retry TLPs).
This protocol ensures reliable communication by allowing the sender to retry in case of errors.
1. Sending TLPs:
o The sender keeps a copy of each outgoing TLP in its Replay Buffer until it
receives an Ack DLLP confirming successful delivery.
2. Receiving TLPs:
o If the receiver detects no errors, it sends an Ack DLLP back to the sender, which
then deletes the TLP from its Replay Buffer.
o If an error is detected, the receiver sends a Nak DLLP, prompting the sender to
resend the TLPs from the Replay Buffer.
A DLLP (Data Link Layer Packet) is a small packet of data used for communication within the
Data Link Layer of the PCIe architecture. It performs important management tasks for the link
between two neighboring devices.
Basic Structure:
• 4-Byte DLLP Type Field: This identifies the type of DLLP and can include additional
information depending on the task.
• 2-Byte CRC: This is a "Cyclic Redundancy Check" value added to the packet for error
detection.
Let’s go step-by-step through the process of how a memory read (MRd) request flows through a
system with a Requester, a Switch, and a Completer (the device holding the requested data).
1. Step 1a:
o It saves a copy of this request in its Replay Buffer (a memory area for
retransmissions if needed).
o The request is sent to the Switch, which checks the packet for errors using the
LCRC (Link Cyclic Redundancy Check) and the sequence number.
2. Step 1b:
o On receiving this Ack, the Requester deletes the saved copy of the request from
its Replay Buffer.
1. Step 2a:
o The Switch uses the memory address in the request to route the TLP to the
correct output port (Egress Port).
o It saves a copy of the TLP in the Egress Port’s Replay Buffer for possible
retransmission.
2. Step 2b:
76 VLSI TECH WITH ANOUSHKA | PCIe Series
o If no errors are found, the Completer sends an Ack DLLP back to the Switch.
o The Switch then removes the TLP copy from its Replay Buffer.
1. Step 3a:
o It creates a Completion with Data TLP (CplD) containing the data and saves a
copy in its Replay Buffer.
2. Step 3b:
o If no errors are found, the Switch sends an Ack DLLP to the Completer, which
then deletes the saved copy of the CplD from its Replay Buffer.
1. Step 4a:
o The Switch uses the Requester ID in the CplD to route the packet to the correct
Egress Port.
2. Step 4b:
o If no errors are found, the Requester sends an Ack DLLP back to the Switch.
o The Switch deletes its saved copy of the CplD from the Replay Buffer.
o The Requester checks the optional ECRC (End-to-End CRC) for additional error
detection. If no issues are found, the data is passed to the core logic for further
processing.
o Ack DLLP confirms successful receipt, and the sender deletes the saved copy of
the packet.
• Replay Mechanism:
o Errors are often caused by transient issues and can usually be corrected through
retransmissions.
1. Flow Control:
o The Data Link Layer manages how data flows between devices, ensuring smooth
communication.
2. Power Management:
o DLLPs are used to manage link and system power states, helping conserve
energy.
o For example, the devices might negotiate to enter low-power states during
periods of inactivity.
The Physical Layer is the foundation of the PCIe architecture. Its job is to handle the physical
transmission of data over the link between devices. It consists of two main parts:
1. Logical Physical Layer: Deals with digital logic to prepare packets for transmission and
to process incoming packets.
2. Electrical Physical Layer: Handles the actual analog signaling between devices over
the PCIe lanes.
78 VLSI TECH WITH ANOUSHKA | PCIe Series
Data from higher layers (TLPs and DLLPs) is passed to the Physical Layer for actual
transmission. Here’s how it works:
• TLPs (Transaction Layer Packets) and DLLPs (Data Link Layer Packets) are first placed
into a buffer in the Physical Layer.
• Framing characters (Start and End characters) are added to each packet to mark its
boundaries. These help the receiver detect the beginning and end of the packet during
transmission.
PCIe uses multiple lanes to transmit data simultaneously, like a multi-lane highway. Each byte
of data is:
• Striped: Split into chunks and sent across the available lanes. Each lane operates as an
independent serial path.
• At the receiver end, the bytes are reassembled into their original order.
Encoding is the process of converting data into a format suitable for transmission.
o This adds some overhead but provides benefits like error detection and ensuring
enough transitions for clock synchronization.
o Uses a more efficient encoding scheme where 128 bits of data are packed into
130 bits.
o This reduces overhead compared to 8b/10b and increases efficiency for higher
speeds.
• After encoding, the data is serialized: converted into a continuous stream of bits.
o Gen2: 5 GT/s
o Gen3: 8 GT/s
o The serial bit stream is converted back into a parallel stream (bytes) using a
deserializer.
2. Elastic Buffer:
o The data passes through an elastic buffer, which compensates for small timing
differences between the sender and receiver clocks.
3. Decoding:
o For Gen1 and Gen2, the 10-bit symbols are decoded back into 8-bit characters
using an 8b/10b decoder.
4. Descrambling:
5. Byte Unstriping:
80 VLSI TECH WITH ANOUSHKA | PCIe Series
o Bytes received across multiple lanes are reassembled into a single, ordered data
stream.
6. Delivering Data:
o The final reconstructed data is sent up to the Data Link Layer for further
processing.
• Differential Signaling:
o Each lane uses a pair of wires (one positive, one negative) to transmit data. This
helps reduce noise and improves signal integrity.
• Analog Components:
1. Framing Characters:
2. Byte Striping:
3. 8b/10b Encoding:
4. 128b/130b Encoding:
5. Elastic Buffer:
The Link Training and Initialization process is like setting up a handshake between two PCIe
devices before they start exchanging data. It ensures that both devices agree on how they will
communicate, such as the number of lanes, speed, and how the physical connection is
configured. This process is fully automatic and involves several key steps.
o PCIe links can consist of 1 to 32 lanes, where each lane can transmit data
independently.
o During initialization, the devices negotiate how many lanes will be used for the
connection (e.g., x1, x4, x8, x16).
o PCIe supports multiple speeds depending on the generation (e.g., Gen1: 2.5
GT/s, Gen2: 5 GT/s, Gen3: 8 GT/s).
o The training process determines the fastest speed both devices can support
reliably.
• If lanes are connected in reverse order (e.g., Lane 1 is connected to Lane 4), the training
process detects and corrects it automatically.
4. Polarity Inversion
• If the positive and negative signals of a lane are swapped during connection, the training
process adjusts for it.
• The receiver synchronizes with the transmitter’s clock to accurately recover the data
being sent.
6. Symbol Lock
• The receiver identifies patterns in the data stream (symbols) to determine how data is
organized.
7. Lane-to-Lane De-skew
• For multi-lane links, data traveling through different lanes may arrive slightly out of sync
due to variations in distance or signal timing. De-skewing realigns this data to ensure
everything is synchronized.
82 VLSI TECH WITH ANOUSHKA | PCIe Series
AC-Coupled Connections
• What is AC-Coupling?
o The connection between the transmitter and receiver uses a capacitor in the
signal path.
o This blocks low-frequency (DC) signals and allows high-frequency (AC) signals
to pass through.
• Why AC-Coupling?
o It allows the transmitter and receiver to have different reference voltages. For
example, this is useful when the devices are far apart or in different
environments.
Impedance Matching
o 100 Ohms: Differential impedance (between the positive and negative signal
lines).
Ordered Sets are special patterns of characters used by the Physical Layer for specific
purposes. They are not traditional data packets and do not have Start or End characters.
1. Link Training:
• In Gen1 and Gen2, an Ordered Set begins with a special COM character followed by
three or more additional characters.
• In Gen3 and later, the format of Ordered Sets changes to accommodate faster speeds
and new requirements.
The Link Training process is crucial to ensure the PCIe connection works efficiently and reliably:
• It automatically negotiates the best lane width and speed for the link.
84 VLSI TECH WITH ANOUSHKA | PCIe Series
• It handles signal alignment issues, such as polarity inversion, lane reversal, and de-
skewing.
• Ordered Sets play a vital role in this process, ensuring synchronization, clock
adjustments, and power management.
Without proper Link Training and Initialization, communication between PCIe devices would be
error-prone or might not work at all. This process ensures that the physical link is stable,
optimized, and ready for data transmission.
This is the first phase where a device (the Requester) sends a request to another device (the
Completer) to read some data from memory.
Step-by-step Process:
• Requester’s Device Core or Software Layer: This part prepares the request, including
the address of the memory to be read, the transaction type (what kind of request it is),
the data size (how much data to read), and additional information like traffic class and
byte enables (to define which bytes of the data are important).
• Transaction Layer: The Transaction Layer then builds a Memory Read Request (MRd)
Transaction Layer Packet (TLP) using this information. This packet can be 3 or 4 Double
Words (DW) long, depending on whether the address is 32-bit or 64-bit. It also includes
the Requester ID (a unique identifier for the device making the request).
• Flow Control: Before the packet is sent, the Flow Control Logic ensures that there’s
enough space at the destination device to receive the packet. Only when there’s enough
room, the packet proceeds to the next layer.
85 VLSI TECH WITH ANOUSHKA | PCIe Series
• Data Link Layer: Here, the TLP gets a Sequence Number and a LCRC (a kind of
checksum to check for errors). A copy of the TLP with these added is stored in the
Replay Buffer.
• Physical Layer: In the Physical Layer, the TLP is prepared for transmission. It’s
converted into serial data (so it can be sent over the physical link), scrambled to prevent
interference, and encoded (using 8b/10b encoding) before being sent across the link.
• Receiver Side (Completer): The Completer receives the data, de-serializes it (turns the
serial data back into parallel form), and passes it through an elastic buffer (to deal with
timing differences between the devices). After decoding the data and removing the
start/end markers, it sends the TLP to its own Data Link Layer.
• Data Link Layer (Completer): This layer checks if there are any errors in the received
packet. If everything is okay, the Data Link Layer creates an Acknowledgment (Ack)
with the same Sequence Number and sends it back to the Requester, confirming the
packet was received.
• Requester’s Data Link Layer: Upon receiving the Ack, the Requester’s Data Link Layer
checks that the CRC (error check) is valid. If it is, the TLP is removed from the Replay
Buffer, meaning the request has been successfully acknowledged.
This phase happens after the memory read request has been processed by the Completer and
it’s time to send the requested data back to the Requester.
Step-by-step Process:
send the data back to), and additional information like the transaction type and status of
the request. This TLP is then sent to the Data Link Layer.
• Flow Control: Just like in the request phase, the Flow Control logic ensures there is
space to send the completion packet before it’s transmitted.
• Data Link Layer (Completer): The Data Link Layer adds a Sequence Number and
LCRC to the TLP, stores a copy in the Replay Buffer, and then sends it to the Physical
Layer.
• Physical Layer (Completer): The packet goes through the same process as in the
request phase—adding Start and End characters, scrambling the data, encoding it, and
then serializing it for transmission.
• Requester Side: The Requester receives the CplD TLP on the Physical Layer, de-
serializes it, decodes it, and removes the Start/End characters. It then sends the data to
the Data Link Layer.
• Data Link Layer (Requester): The Data Link Layer checks for errors in the received
packet. If everything is okay, it sends an Ack DLLP back to the Completer to confirm the
packet was received.
• Completion Confirmation: If the Ack is valid, the Requester processes the completion
data and forwards it to its Software Layer, completing the process. If there were errors,
the Requester may ask for the data to be resent.
Key Points:
• Transaction Layer: This layer is responsible for creating the memory read request and
the completion response, using TLPs to send data between devices.
• Data Link Layer: Adds extra reliability to the process by adding Sequence Numbers
and CRC checks to the packets, and ensuring proper flow control.
• Physical Layer: Handles the actual transmission of data over the physical link,
including converting data to a form suitable for transmission, such as serializing and
encoding it.
This process, from Memory Read Request to Completion with Data, ensures that devices in a
PCIe system can communicate efficiently and reliably, even over high-speed links.