CXL* Type 3 Memory Device Software Guide
CXL* Type 3 Memory Device Software Guide
Software Guide
Development Guides
August 2024
Revision 1.1
Figures
Figure 2-1 Conceptual CXL Architecture for Volatile Memory Support ..........12
Figure 2-2 Conceptual CXL Architecture for Persistent Memory Support .......17
Figure 2-3 Basic Linux CXL PMEM Architecture .........................................20
Figure 2-4 Example per CXL Host Bridge Fixed Memory Allocation ..............33
Figure 2-5 CEDT CFMWS Example - No XHB Interleaving ..........................34
Figure 2-6 CEDT CFMWS Example - x2 XHB Interleave .............................35
Figure 2-7 CEDT CFMWS Example - x4 XHB Interleave .............................36
Figure 2-8 CEDT CFMWS Example - x4 XHB Interleave Across Multiple Sockets
..............................................................................................37
Figure 2-9 Components for Managing Regions .........................................40
Figure 2-10 Example - Volatile and Persistent x2 Interleaved Regions .........42
Figure 2-11 Example - Region Interleaved Across 2 Switches.....................44
Figure 2-12 Example - Region Interleaved Across 2 HBs ...........................50
Figure 2-13 Example - Region Interleaved Across 2 HBs and 4 Switches......52
Figure 2-14 Example – 2 Regions Interleaved Across 2 and 4 Devices .........53
Figure 2-15 Example - Out of Order Devices Within a Switch or Host Bridge 55
Figure 2-16 Example - Out of Order Devices Across HBs (Failure Case) .......56
Figure 2-17 Example - Valid x2 HB Root Port Device Ordering ....................57
Figure 2-18 Example - Invalid x2 HB Root Port Device Ordering .................58
Figure 2-19 Example - Unbalanced Region Spanning x2 HB Root Ports ........59
Figure 2-20 High-level Sequence: System Firmware and UEFI Driver Memory
Partitioning ...............................................................................60
Tables
Table 1-1 Terms and Acronyms .............................................................. 9
Table 2-1 High-level Software Component Responsibilities – System Boot ...21
Table 2-2 High-level Software Component Responsibilities - System Shutdown
and Global Persistent Flush (GPF) ................................................26
Table 2-3 High-level Software Component Responsibilities - Hot Add ..........26
Table 2-4 High-level Software Component Responsibilities - Managed Hot
Remove ...................................................................................28
Table 2-5 System Firmware Memory Interface Summary ..........................29
Table 2-6 SRAT and HMAT Content ........................................................67
Table 2-7 Valid x2 XHB Configuration .....................................................92
Table 2-8 Valid x2 XHB Configuration 2 ..................................................94
Table 2-9 Invalid x4 xHB Configuration...................................................95
Table 2-10 Valid x4 XHB Configuration ...................................................95
Table 2-11 Valid x4 XHB Configuration 2 ................................................96
Table 2-12 Invalid x8 xHB Configuration .................................................98
Table 2-13 Valid x8 XHB Configuration ...................................................99
Table 2-14 HPA to DPA Translation 1 .................................................... 112
Table 2-16 HPA to DPA Translation 2 .................................................... 114
Table 2-18 HPA to DPA Translation – 4-way XOR ................................... 117
Table 2-19 HPA to DPA Translation – 8-way XOR ................................... 119
Table 2-20 DPA to HPA Translation ...................................................... 120
Table 2-21 HPA to DPA and DPA to HPA Translation - XOR ...................... 123
Table 2-22 EFI Memory Types ............................................................. 127
Table 2-23 EFI Memory Attribute ......................................................... 127
1.1 • Included the recommendation that Compute Express Link* (CXL*) August 2024
expansion memory should be marked as Specific Purpose
• Added section 2.13.24.1. It describes Host Physical Address (HPA) to
Device Physical Address (DPA) translation algorithm when Modulo
arithmetic combined with Exclusively-OR (XOR) is used
• Added section 2.13.25.1. It describes DPA to HPA translation algorithm
when Modulo arithmetic combined with XOR is used
• Added section 2.13.25.2. It describes DPA to HPA translation algorithm
when CXL Fixed Memory Window Structure (CFMWS) references a
single Host Bridge instance twice
• Updated references to match the latest CXL Specification and the
terminology
1.4 Abbreviations
Abbreviations used in this document that may not be found in the CXL, PCIe,
ACPI or UEFI Specifications:
Interleave Set Collection of DPA ranges from one or more devices that make up a single
HPA range. See Region
ISP Interleave set position, the position of a device within an Interleave set. 0-
based
Region CXL term for a memory interleave set. See Interleave set
XHB Cross CXL Host Bridge interleave set. An interleave set the spans multiple
Host Bridges. May or may not cross sockets
HDM Decoders - The MMIO mapped HDM decoders are setup by the system
firmware for known CXL volatile capacity, by the UEFI and OS driver for known
CXL persistent capacity, and by the OS for hot added CXL volatile and
persistent memory capacity. These registers determine what Host Physical
Address (HPA) range will be mapped to the Device Physical Address (DPA)
range exposed by the device. HDM decoders are found in all up-stream CXL
switch ports as well as in each CXL Root Complex. These HDM decoders will
also need to be programmed to account for all the downstream device HDM
decoder programming. HDM decoders in the CXL Root Complex determine
which root port is the target of the memory transaction. Similarly, HDM
decoders in an upstream port determine which downstream port is the target.
The command interface surfaces the following to the system firmware, UEFI
and OS drivers:
Event Log - Each CXL Memory device is required to support space for at least
one event record in each of the informational, warning, failure, or fatal event
severity queues. Asynchronous notification of new entries in the list is done
using standard PCIe Medium Scale Integration (MSI)/MSI-X (OS First) or
Vendor Defined Message (VDM) message interrupts (FW first).
CXL Switch
Provides expansion of CXL Root Ports into many downstream switch ports
allowing more CXL memory devices to be connected in a scalable way. Each
switch has a set of HDM decoders that govern the upstream switch ports
CXL HW connection to the CXL memory device or CXL switch port via one or
more flexbus lanes. Equivalent to a PCIe root port.
Platform specific CXL root ports and the equivalent to a PCIe root complex. Bus
number assignments are the responsibility of the system firmware and exposed
through the CXL host bridge ACPI namespace device.
Virtual SW entity implemented by the system firmware under the _SB (System
Bus) of the ACPI tree and consumed by the OS. Each ACPI0016 object
represents a single CXL root complex. Since the root of the CXL tree (the CXL
root complex) is platform specific and is not presented through a PCI Base
Address Register (BAR), the system firmware is responsible for generating an
object to represent the collection of CXL root ports that represent a CXL host
bridge. Each HB is represented by a unique ACPI0016 object under the top of
the ACPI /SB device tree. There are a several ACPI methods attached to this
object that the system firmware will implement on behalf of the OS. This is not
an exhaustive list of ACPI methods the CXL host bridge device is expected to
support. Assume all standard PCI host bridge driver methods for finding and
configuring HW will apply:
• _CRS – Same as PCIe host bridge method
• _OSC - Determine FW first/OS first responsibility. Follows standard PCIe
host bridge functionality with CXL additions. See the CXL _OSC section in
CXL 3.1 Specification.
• _REG – Same as PCIe host bridge method
• _BBN – Same as PCIe host bridge method
• _SEG – Same as PCIe host bridge method
• _CBR – Retrieve pointer to the CXL host bridge register block which
provides access to the HDM decoders. If the platform supports hot add of
CXL host bridges the OS can utilize this method to find the register block
after the addition of the new HB. CXL host bridge hot add is considered out
of scope for this document. CXL host bridges present at boot will not have
a _CBR method and the CEDT CXL Host Bridge Structure (CHBS) must be
utilized.
• _PXM – Same as PCIe host bridge method
A CXL enlightened version of a standard PCIe bus driver that consumes all
ACPI0016 CXL host bridge ACPI device instances and initiates CXL bus
enumeration, understands the relationship between CXL host bridges, CXL
Each physical CXL memory device surfaced by the bus driver is consumed by
a separate instance of the CXL memory device driver. This driver consumes the
command interface and typically exports those features through
OS specific IOCTLs to allow the OS in-band management stack components to
manage the device.
Get Quality of Service (QoS) throttling group DSM – Retrieve the QTG the
device should be programmed to:
CEDT
CXL Early Discovery Table – System firmware provides this ACPI table that
UEFI and OS drivers utilize to retrieve pointers to all of the CXL CHBS, a Set of
Fixed Memory Windows (CFMWS) for each CXL host bridge present at platform
boot time and optionally one or more CXIMS. The pointer to the register block
allows the system firmware, UEFI and OS drivers to program HDM decoders for
the CXL Host Bridges. While the ACPI0017 object will signal the presence of the
CEDT, this table is not dependent on the ACPI0017 object since it must be
available in early boot, before the ACPI device tree has been created. This is a
static table created by the system firmware at platform boot time.
CHBS
ACPI CXL Host Bridge Structure – Each CXL Host Bridge instance will have a
corresponding CHBS which identifies what version the CXL host bridge supports
and a pointer to the CXL root complex register block that is needed for
programming the CXL root complex’s HDM decoder.
CFMWS
CXL Fixed Memory Window Structure – A new structure type in CEDT that
describe all the platform allocated and programmed HPA based windows where
the system firmware, UEFI and OS drivers can map CXL memory.
CEDT Driver
The ACPI0017 CEDT and CFMWS will be consumed by a bus driver that
concatenates the HDM decoders for each CXL host bridge in to one or more
regions (or interleave sets) that the bus driver surfaces to the existing
Persistent Memory (PMEM)/ Security Control Module (SCM) stacks. Since
interleave sets in CXL may span multiple CXL host bridges, this driver handles
XHB interleaving, and presents other drivers in the stack with a single
Volatile Region
Each volatile region represents an HPA range that utilizes a set collection of
CXL memory devices with volatile capacity.
Memory Manager
CXL based in-band management libraries and UI components that will utilize
OS implementation specific IOCTLs and pass-through interfaces surfaced by the
CXL root port bus driver, CXL memory device driver, and the PMEM or SCM
region driver instances.
LSA - The CXL Memory device is responsible for surfacing a persistent label
storage area that the UEFI and OS drivers utilize to read and write Region
(interleave set) configuration information and Namespace configuration
information. This is required to re-assemble the persistent memory region
configuration correctly.
Poison List - The device is required to maintain a persistent poison list so the
UEFI and OS drivers can quickly determine what areas of the media contain
invalid data and must be avoided or corrected.
Fixed ACPI Description Table – Existing ACPI table that is utilized to report
fixed attributes of the platform at platform boot time. For CXL, the new
PERSISTENT_CPU_CACHES attribute is utilized by the platform to report if the
CPU caches are considered persistent and by the OS to set application flushing
policy.
Persistent Region
Each persistent region represents an HPA range that utilizes a set collection of
CXL Memory devices with persistent capacity configured in a specific order. The
configuration of each region is described in the region labels stored in the Label
Storage Area (LSA) which is exposed through the command interface.
Namespaces
File Systems
Existing file system drivers subdivide each partition into one or more files and
supply standard file API and file protection for the user.
Most persistent memory aware applications make use of Ring3 libraries like
PMDK to simplify the persistent memory programming model. These libraries
typically make use of memory mapped files or direct device DAX to access the
persistent memory. There will be additions to these libraries to surface new
CXL features.
Applications
CXL Subsystem
New kernel component that utilizes the information provided through the
ACPI0017 CEDT. Provides CXL specific services for the NVDIMM BUS
component. Provides CXL specific IOCTL and SYSFS interfaces for management
of CXL devices
LIBNVDIMM BUS
LIBNVDIMM REGION
PMEM
DEVICE DAX
A simplified direct pipeline between the application and the persistent memory
namespace that bypasses the filesystem and memory mapped file usage.
FS DAX
BTT
PMEM NAMSPACE
CXL-CLI
LIBCXL
NDCTL
LIBNDCTL
The following tables describe these high-level roles and responsibilities for
major SW components in greater detail. Most of these responsibilities are
outlined in more detail in the following sections of this document.
Creating N/A – System Firmware does When no region labels exist in the device’s LSA:
persistent not create persistent memory • Partition the device volatile and persistent
memory region region labels boundary according to the device’s
labels capabilities and admin policy. If the OS and
device support re-partitioning without a
reboot (SetPartitionInfo w Immediate flag),
UEFI and OS should assume the CDAT may
have changed.
• Check the available System Firmware
programmed CFMWSs available and only
allow configuring of persistent memory
regions that match the available windows
and HB interleave granularity and ways
Write the region labels following the CXL
Specification
• Read the region labels from each memory
device, verify requested configuration
• Programming the device for the region label
defined
Consuming N/A – System Firmware does • Read the region labels from each memory
persistent not consume persistent memory device, verify requested configuration
memory region region labels • Programming the device for the region label
labels defined
Programming CXL Program platform for platform For all volatile and memory devices not
HDM decoders for specific volatile CXL capacity: configured by the system firmware:
configured regions • For device HDM decoders, • Track which volatile capacity has already
program device HDM been assigned an HPA range by the System
decoder global control Firmware by checking the devices HDM
register to enable HDM use, decoders
disabling Designated Vendor • Utilize the CDAT DSMAS memory ranges
Specific Extended Capability returned by the memory device, the QoS
(DVSEC) decoder use Throttling Group from the platform, and the
• Program HDM decoders in QTG from the CFMWS and program HDMs
the memory device, aligned to the DSMAS ranges.
upstream switch ports, and • Place device in to fixed memory window and
CXL Host Bridges to decode use HPA range for programming HDM
volatile memory decoders while avoiding HPA ranges already
• Utilize the CDAT DSMAS populated by the System Firmware.
memory ranges returned by • For device HDM decoders, program device
the memory device, and HDM decoder global control register to
program HDMs aligned to enable HDM use, disabling DVSEC decoder
those ranges use
• For immutable volatile • Program HDM decoders in the memory
configurations, set fixed device upstream switch ports, and CXL Host
device configuration Bridges to decode volatile and persistent
indicator in CFMWS and memory
utilize Lock on Commit to
prevent others from
changing these HDMs
HMAT and CDAT For memory devices containing N/A For all persistent
volatile capacity: capacity:
Parse device and switch CDAT Utilize memory device
and create HMAT entries for CPU CDAT, switch CDATs,
and volatile memory proximity and Generic Port
domains found in the SRAT entries to calculate
total BW and Latency
for the path from the
CXL Host Bridge to
each device
Managing regions Program all platform Not Supported Upon hot add event:
HW for each fixed Hot added volatile
memory region that and persistent
the platform supports memory devices:
for hot-plug of volatile • Program HDM
or persistent memory decoders for
and surface fixed memory device
windows through based on
ACPI CXL Fixed assigned HPA
Memory Window range, HDM
Structures (CFMWS) decoders for
for HB and XHB upstream switch
interleaving. ports, HDM
decoders for CXL
Host Bridges
• Reprogram GPF
timeouts and
other values that
depend on the
number of device
present, for CXL
switches and
Host Bridges
• Utilize the CDAT
DSMAS memory
ranges supported
by the memory
device, the QoS
Throttling Group
(QTG) from the
platform, and the
QTG from the
CFMWS and
program HDMs
aligned to the
DSMAS ranges.
• Utilize the Get
QTG DSM when
determining best
CFMWS to utilize
when
programming
HDMs
Not Supported Hot added volatile
and persistent
memory devices:
• OS First: Program
memory device
event log
interrupt steering
for OS First
MSI/MSIx based
on _OSC
FW First: Skip
programming
memory device event
log interrupts
switch CDATs,
and CXL Host
Bridge HMAT
information to
calculate total BW
and Latency for
the path from the
CXL Host Bridge
to the new device
SRAT Indicate hot pluggable
proximity domains
with Memory Affinity
Structure
HotPluggable
indicator
NonVolati
le clear
If CDAT
DSEMTS is
provided:
• Type
EfiReserv
edMemor
yType if
so
specified
by CDAT
DSEMTS,
NonVolati
le clear
• Otherwise
type
EfiConven
tionalMe
mory set,
EFI_MEM
ORY_SP
attr set,
NonVolati
le clear
(even if
CDAT
does not
specify
EFI_MEM
ORY_SP)
The following requirements limit what the System Firmware may surface to the
UEFI and OS drivers:
• The architecture would allow mixing of any combination of the Window
Restrictions. It is the responsibility of the System Firmware to only surface
windows with Window Restrictions that the platform supports. If the
platform HW does not allow T2 and T3 CXL memory devices to utilize the
same HPA range, the System Firmware cannot report both T2 and T3
Window Restrictions in a single window.
• The System Firmware may surface windows that provide the UEFI and OS
drivers multiple options for configuring a given region. It is UEFI and OS
driver policy specific as to which possible window is utilized for configuring
the region.
• There cannot be any overlap in the HPA ranges described in any of the
CFMWS instances.
See the next examples that outline the intended content of the CFMWS, for
various CXL topologies.
2.6.1 EFI_MEMORY_MAP
The CEDT CFMWS structures are based on the resources the System Firmware
has allocated and programmed with the platform HW. See previous sections for
the outline of the EFI_MEMORY_MAP content.
The next examples outline the intended content of both the EFI_MEMORY_MAP
and CFMWS, for various CXL topologies.
Since the CXL Fixed Memory Windows surfaced in the CEDT are based on the
CXL Host Bridge _UIDs, which must be unique across the platform, the
resulting windows are identical to the previous example. Thus, the organization
of CXL Host Bridges into sockets from a hardware perspective, imposes no
restrictions on which CXL Host Bridges can be utilized in an XHB interleave. The
platform is free to choose which CXL Host Bridges it will allow to be interleaved
together without having to describe the hardware restrictions to the UEFI and
OS drivers.
HDM decoders
• Contain the HPA Base and Size programmed by the System Firmware, UEFI
and OS drivers at CXL hierarchy enumeration time.
In this example, 2 devices are interleaved together with volatile capacity and
the same 2 devices are interleaved together with persistent capacity and each
device contributes 64MB of volatile and persistent capacity, for simplicity.
Note: This is provisional/draft output and will be updated with more accurate text in
another release.
In this example, all 8 devices are interleaved together into a single region and
each device contributes 64MB of persistent capacity, for simplicity.
Note how the Switch interleave granularity is programmed to 2K and the Host
Bridge interleave granularity is programmed to 1K, so that each switch utilizes
HPA[12:11] and the Host Bridge utilizes HPA[10] to select the target port. This
allows proper distribution of the HPA in a round robin fashion across all the
ports in each HostBridge.HDM.InterleaveTargetList[ ] and
Switch.HDM.InterleaveTargetList[ ] the region is associated with. This leads to
good performance since maximum distribution of memory requests across all
HBs and switches is achieved, as shown in the resulting region data block
layout pattern at the bottom of the figure.
In this example, all 8 devices are interleaved together into a single region that
spans multiple CXL Host Bridges and each device contributes 64 MB bytes of
persistent capacity, for simplicity.
Note how the Host Bridge interleave granularity is programmed to 2K and the
platform XHB interleave granularity is pre-programmed to 1K so that the Host
Bridge utilizes HPA[11] to select the root port and the platform XHB interleave
granularity utilizes HPA[10] to select the Host Bridge. This allows proper
distribution of the HPA in a round robin fashion across all the Host Bridges in
Note: This is provisional/draft output and will be updated with more accurate text in
another release.
/sys/bus/cxl/devices/root0
├── address_space0 // CFMWS0
│ ├── devtype // cxl_address_space
│ ├── end // 1TB
│ ├── start // 0
│ ├── supports_pmem
│ ├── supports_type3
│ └── uevent
...
├── devtype // cxl_root
├── dport0 -> ../LNXSYSTM:00/LNXSYBUS:00/ACPI0016:00 // Host Bridge A
├── port1 // Port 0
│ ├── decoder1.0 // HDM 0
│ ├── devtype // cxl_port
│ ├── dport0 -> ../../pci0000:34/0000:34:00.0 // Port 0 Address
│ ├── subsystem -> ../../../bus/cxl
│ ├── uevent
│ └── uport -> ../../LNXSYSTM:00/LNXSYBUS:00/ACPI0016:00 // HB assignment
├── port2 // Port 1
In this example, all 8 devices are interleaved together into a single region that
spans multiple CXL Host Bridges and multiple Switches, and each device
contributes 64 MB of persistent capacity, for simplicity.
In this example, 4 devices are interleaved together into region 0, 2 of the same
devices are also interleaved together into region 1, and each device contributes
64 MB bytes of persistent capacity, for simplicity. Note that a second set of
region labels are utilized to describe the second interleave and that HDM
decoder 1 on the device, switches, and the CXL Host Bridge is utilized to
program the second HPA range.
In this example, 4 devices are interleaved together into region 0 on the same
CXL host bridge and each device contributes 64 MB bytes of persistent
capacity, for simplicity. However, the devices are plugged into the CXL root
ports of the CXL switch in a random order compared to the ordering specified
by the region label position field found on each device. This could happen if the
system the devices were originally plugged into failed and the devices were
moved to an identical system for data recovery purposes, but the original
ordering was shuffled in the process. Since the CXL switch and CXL host bridge
HDMs contains a target list that specifies the order the root ports are
interleaved, there is no need to reject this configuration. The UEFI and OS
drivers program the CXL switch or CXL host bridge HDMs target list to match
the position of the device specified in the region label storage in the device’s
LSA without the need for a configuration change.
If all the devices are plugged in to the same CXL host bridge, either directly, or
through a CXL switch as shown in the example case here, the target list in the
CXL switch HDM decoder or CXL host bridge HDM decoder can be programmed
to fix the ordering. See the next examples for additional constraints when
verifying device ordering across CXL host bridges.
In this example, 4 devices are interleaved together into region 0 with 2 devices
on one host bridge and 2 devices on the next host bridge using an XHB region.
Because platform HW may restrict ordering across host bridges and the OS
cannot re-configure it, the ordering of the devices across host bridges now
matters and adds an additional responsibility when checking for valid regions.
The suggested algorithm for software to check for proper device connection to
each host bridge is outlined in the Section 2.13.14.
Figure 2-16 Example - Out of Order Devices Across HBs (Failure Case)
The suggested algorithm for software to check for proper device connection to
each root port on the host bridge is outlined in the Section 2.13.15.
The suggested algorithm for software to check for proper device connection to
each root port on the host bridge is outlined in the Section 2.13.15.
The following figure outlines the basic system firmware, OS and device steps
for handling, detecting, and reporting a dirty shutdown, the device phases, and
timeline.
The following proximity domains may appear in the system firmware generated
SRAT:
The following general rules apply to the System Firmware generated HMAT:
• HMAT returns best case latency (unloaded) L, and best-case bandwidth B,
between pairs of proximity domains described in the SRAT.
• Since the SRAT does not contain proximity domains for persistent memory
capacity, the HMAT will not contain performance information related to
persistent memory capacity.
• If a path P is made up of subpaths P1, P2, …, Pn
L(P) = L(P1)+L(P2), ..+L (Pn) where L is the latency
B(P) = MIN [B(P1), B(P2), ..,B(Pn)] where B is the bandwidth
• Subpaths can be one of two types
External Link
• L and B can be computed based on knowledge of the link width, frequency
and optionally retimer count. This is described in further detail below.
• Internal Link
• Internal to a CXL device – Device CDAT returns L and B of each subpath.
• Internal to a CXL switch - Switch CDAT returns L and B of each subpath.
• Internal to the CPU – system firmware gets L and B from CPU datasheet.
OS infers this from the HMAT.
Figure 2-27 SRAT and HMAT Example: Latency Calculations for 1 Socket System
with no Memory Present at Boot
In this example, a one socket system has native DDR attached memory, a CXL
Type 2 accelerator device with volatile memory, and a CXL Type 3 memory
expander device with volatile memory attached. This shows example latency
calculations the system firmware performs to build the latency portion of the
HMAT.
In this example, the same one socket system as the previous example has
native DDR attached memory, a CXL Type 2 accelerator device with volatile
memory, and a CXL Type 3 memory expander device with volatile memory
attached. This shows example bandwidth calculations the system firmware
performs to build the bandwidth portion of the HMAT.
In this example, a two-socket system has two CXL Type 2 accelerator devices
with volatile memory, and two CXL Type 3 memory expander devices with
volatile memory attached. This shows example latency calculations the system
firmware performs to build the latency portion of the HMAT.
Figure 2-32 SRAT and HMAT Example: Latency Calculations for 1 Socket System
with Persistent Memory and Hot Added Memory
Note that intermediate CXL switches between CXL memory device and the CXL
host bridge do not play a part in determining proper region configuration. If the
connected devices in the region span multiple CXL host bridges, the OS and
UEFI drivers must verify proper device ordering across host bridges. If the
correct devices are connected to each host bridge, the order the devices are
connected within the host bridge does not matter, as shown in the out of order
memory region examples.
Here are some example region configurations and how this algorithm allows
SW to determine correct device placement on each host bridge. This XHB
algorithm does not need to check device ordering within a host bridge or
intermediate switch that might be present. Those are covered in the following
section for CXL host bridge root port ordering checks.
0110 1
0111 1
0110 0
0111 1
-x4 XHB w x8 CFMWS.E NLabel = 8 Invalid configuration: The CFMWS.IG will send
device region NIW = 2 1K down each HB so with 256 for device IG
-2 devices on (x4) Interleave there would need to be 1K/256B = 4 devices
each HB Granularity = on each HB which is 16 devices and is >
-Different host CFMWS.H 0 (256B) Dev.RegionLabel.NLabel.
bridge and BIG This is caught by the following test in the
device = 2 (1K) previous flow:
Interleave If ((2^(CFMWS.IG-Dev.RegionLabel.IG) *
Granularity (2^CFWMS.ENIW)) > Dev.RegionLabel.NLabel)
//invalid configuration
=1 0100 10
(512B)
0101 10
-2 devices on 0110 11
each HB
-Different host 0111 11
bridge and
device
Interleave
Granularity
0110 10
0111 11
-x8 XHB w x8 CFMWS.E NLabel = 8 Invalid configuration: The CFMWS.IG will send
device region NIW = 3 512B down each HB so with 256 for device IG
-1 devices on (x8) Interleave there would need to be 1K/256B = 4 devices
each HB Granularity = on each HB which is 16 devices and is >
-Different host CFMWS.H 0 (256B) Dev.RegionLabel.NLabel.
bridge and BIG This is caught by the following test in the
device = 2 (1K) previous flow:
Interleave If ((2^(CFMWS.IG-Dev.RegionLabel.IG) *
Granularity (2^CFWMS.ENIW)) > Dev.RegionLabel.NLabel)
//invalid configuration
0110 110
0111 111
The following sections outline examples of the previous CXL host bridge root
port configuration checks required for persistent memory regions.
Example of an invalid region spanning 2 HB root ports that this check would
catch:
Example of an invalid region spanning 2 HB root ports that this check would
catch:
Here are some example HPA to DPA translations referencing the steps in the
previous flow.
Configuration description See Memory Region Example – See Memory Region Example –
Region interleaved across 2 CXL Region interleaved across 2 HBs
switches for configuration for configuration
-1 HB -x2 XHB interleave
-2 Switch -4 devices on each HB
-4 devices on each HB -x8 pmem interleave set
-x8 pmem interleave set
Step 4: HB extract HB A: Extract 1 bit (IW) from HB B: Extract 2 bit (IW) from
InterleaveWays HPA starting at bit 10 (IG+8) = HPA starting at bit 11 (IG+8) =
1b 10b
Step 12: Device calculate Device 6: Remove 3 bits Device 6: Remove 3 bits
DPAOffset (Device.IW) from HPAOffset (Device.IW) from HPAOffset
starting at bit 10 (Device.IG + starting at bit 10 (Device.IG +
8), shift upper address bits 8), shift upper address bits right
right 3 bits = 0x400 3 bits = 0x400
Step 2: XHB Platform HW Platform HW: Forward request Platform HW: Forward request to
forward request to host to CFMWS.TargetList[ 1 ] = HB CFMWS.TargetList[ 3 ] = HB D
bridge B (With Standard Modulo
arithmetic, this would have been
CFMWS.TargetList[ 1 ] i.e. HB B)
Step 4: HB extract HB B: Extract 1 bit (IW) from HB D: Extract 1 bit (IW) from
InterleaveWays HPA starting at bit 9 (IG+1) =0 HPA starting at bit 8 (IG) =0
CEDT may contain multiple CFMWS entries with XOR Modulo arithmetic. In
such a case, the number of CXIMS entries must match the number of CFMWS
entries with XOR Modulo arithmetic and they must be in the same order. The
CXIMS entry number N is associated with the N’th CFMWS entry with XOR
Modulo arithmetic. For example, a CEDT instance may include 4 CHBS entries,
3 CFMWS entries and 2 CXIMS entries as shown below.
Assume that CFMWS0 and CFMWS2 have the XOR Modulo Arithmetic flag set.
In this case, CXIMS0 describes the XOR calculation associated with CFMWS0
range and CXIMS1 describes the XOR calculation associated with CFMWS2.
Since CFMWS1 uses standard modulo arithmetic, it is not associated with any
CXIMS entry.
Here is an example that illustrates the difference between Standard and XOR
Arithmetic: With Standard Modulo arithmetic, 4-way interleaving at 256B
granularity may use HPA[9:8] as the interleave selector. HPA[9] is the selector
MSb and HPA[8] is the selector LSb. With a host that implements Modulo
arithmetic combined with XOR, Interleave selector LSb may be computed by
XORing HPA[8] with HPA[11], HPA[17] and HPA[25]. HPA[9] may be XORed
with HPA[12], HPA[18] and HPA[26] while calculating Interleave Selector MSb.
CFMWS.Base = 0
CFMWS.WindowsSize = 4 GB
CFMWS.ENIW 2 (x4)
CFMWS.HBIG 0 (256B)
The following table shows HPA to DPA conversion for this 4-way interleaving
configuration. It may be used to cross-check software implementations. Note
the ISP corresponding to HPA=2080 is different due to XOR. Without XOR, it
would have been 0.
248 0 0 248
272 1 1 16
528 2 2 16
800 3 3 32
1056 0 0 288
1312 1 1 288
1568 2 2 288
1824 3 3 288
2080 1 1 544
2336 0 0 544
2592 3 3 544
4112 2 2 1040
6176 3 3 1568
131088 1 1 32784
262160 2 2 65552
393248 3 3 98336
393496 2 2 98328
393520 2 2 98352
1775813747 0 0 443953523
The following table shows HPA to DPA conversion for this 8-way interleaving
configuration. It may be used to cross-check software implementations. All the
numbers are in decimal format.
248 0 0 0 248
272 1 0 2 16
528 2 0 4 16
800 3 0 6 32
2064 1 0 2 272
4112 2 0 4 528
6176 3 0 6 800
131088 1 0 2 16400
262160 2 0 4 32784
393248 3 0 6 49184
393496 2 0 4 49176
393520 2 0 4 49200
932162229 3 0 6 116520373
1957525275 1 1 3 244690459
2342899463 1 1 3 292862215
245769350 0 1 1 30721158
1971096847 0 1 1 246386959
581610561 1 1 3 72701249
4235060509 1 1 3 529382429
1529057420 2 0 4 191132300
148712089 1 0 2 18589081
2754367011 3 1 7 344295715
Note that when the IW bits are added back into the address, software needs to
insert the correct number for that device which is equivalent to that device’s
RegionLabel.Position or associated ISP. If the device does not implement LSA,
Here are some example DPA to HPA translations referencing the steps in the
previous flow.
Where ISP is the position of the device in the interleave set and it is calculated
by walking the CXL tree depth-first.
When Modulo arithmetic combined with XOR, the extraction of these HPA bits
involves two steps
Step 2: XOR certain middle order bits in HPA1 with the Interleave Selector bits
in HPA1 to undo the XOR operation that occurred during HPA to DPA
conversion. If 0<IW<8, the interleave selector bits are HPA1[IG+IW+7:IG+8].
If 8<IW<0Bh, the interleave selector bits are HPA1[IG+IW-1:IG+8]. There
shall be one XORMAP array element in the CXIMS corresponding to each HPA
bit that needs to be recovered using this XOR operation.
Algorithm:
Step 1: Calculate HPA1 using the algorithm used when Standard Modulo
arithmetic is employed.
CFMWS.Base = 0
CFMWS.WindowsSize = 6 GB
CFMWS.HBIG 0 (256B)
Per the specification, each CXIMS entry is variable size. XORMAP[0] is the first
entry. Since 12=4*3, there must be 2 XORMAP entries in the CXIMS associated
with this CFMWS entry, one for each x2 interleave.
• CXIMS.XORMAP[0] = 0x02020900 (indicates XOR of A8,A11,A17,A25)
• CXIMS.XORMAP[1] = 0x04041200 (Indicates XOR of A9,A12,A18,A26)
For this example, the first iteration in step 2.c is expanded below
The following table shows HPA to DPA and DPA to HPA transformation for this
configuration. All numbers in the below are represented in decimal.
If CXL Host Bridge appears twice in the list in consecutive position, that means
the hardware will send 2*256 consecutive bytes out of every 4*256 bytes to
that HB and the device below it.
From the device perspective, that essentially means a 2 way interleave (IW=1)
at 512B granularity (IG=1). During the DPA to HPA translation, the interleaving
scheme as seen by the device along with the position of the device in that
Interleave set shall be used.
The EFI “specific purpose memory” attribute was added to help Linux handle
performance-differentiated memory. System firmware/bios should apply this
attribute to all volatile CXL memory.
When Linux sees the EFI_MEMORY_SP attribute, it avoids using the subject
memory during boot, and never uses it for kernel data structures. Instead, the
memory becomes available as one or more devdax devices (for example,
/dev/dax0.0, which can be memory mapped by applications to access the
memory).
Starting in Linux 6.3, the default memory policy is to wait until after boot and
then online the CXL memory as system-ram. This policy can be overridden via
the “memhp_default_state=offline” kernel command line argument, or with
build-time kernel configuration options. Thus the default behavior (system-
ram) can be overridden to devdax mode by modifying the kernel command line
and rebooting once.
The daxctl utility can be used to perform online conversion between system-
ram and device-dax modes – with the caveat that conversion from system-ram
to devdax mode cannot always be guaranteed to work (because it is not always
possible to relocate all contents of the memory). It is for this reason that Linux
needs the EFI_MEMORY_SP attribute to prevent onlining CXL memory as
system-ram during boot.
The attribute has no effect with operating systems that ignore the SPM
attribute. Operating systems that ignore EFI_MEMORY_SP will see the memory
as conventional.
Note that the CDAT DSEMTS can override the type to EFI_RESERVED_TYPE
(which must be respected by firmware/bios), but the CDAT DSEMTS should not