LP 1619
LP 1619
Dave Feisthammel
Hussein Jammal
Laurentiu Petre
4 Summary ................................................................................................ 17
Starting in Azure Stack HCI version 21H2, GPUs can be included in an Azure Stack HCI cluster to provide
GPU acceleration to workloads running in clustered virtual machines. This document discusses the basic
prerequisites of this capability and how to use GPUs with clustered virtual machines running the Azure Stack
HCI operating system to provide GPU acceleration to workloads running in the cluster.
GPU acceleration is provided using Discrete Device Assignment (DDA), also known as GPU pass-through,
which allows you to dedicate one or more physical GPUs to a virtual machine. Clustered virtual machines can
take advantage of GPU acceleration and clustering capabilities such as high availability via failover. Although
Live Migration of a virtual machine that has one or more GPUs assigned to it is not currently supported, virtual
machines can be automatically restarted and placed where GPU resources are available in the event of a
failure.
Note: Microsoft does not yet support GPU partitioning (GPU-P) in the Azure Stack HCI operating system.
Currently, it is only possible to assign the entire GPU from the host to a single virtual machine using DDA.
This document provides instructions and examples to configure GPUs for use by the Azure Stack HCI
operating system. We include information for installing GPU device drivers, configuring Windows Admin
Center to manage GPUs, creating GPU Pools, and assigning virtual machines to GPUs in a pool.
ThinkAgile MX3520 Appliance (MT 7D5R) and ThinkAgile MX Certified Node (MT 7Z20) based on the
ThinkSystem SR650 Rack Server:
• ThinkSystem NVIDIA Tesla T4 16GB PCIe Passive GPU (Feature Code B4YB)
• ThinkSystem NVIDIA A2 16GB PCIe Gen4 Passive GPU (Feature Code BP05)
• ThinkSystem NVIDIA A10 24GB PCIe Gen4 Passive GPU (Feature Code BFTZ)
• ThinkSystem NVIDIA A30 24GB PCIe Gen4 Passive GPU (Feature Code BJHG)
• ThinkSystem NVIDIA A100 40GB PCIe Gen4 Passive GPU (Feature Code BEL5)
ThinkAgile MX1020 Appliance (MTs 7D5S and 7D5T) and MX1021 Certified Node (MTs 7D1B and 7D2U)
based on the ThinkSystem SE350 Edge Server:
• ThinkSystem NVIDIA Tesla T4 16GB PCIe Passive GPU (Feature Code B4YB)
• ThinkSystem NVIDIA A2 16GB PCIe Gen4 Passive GPU (Feature Code BP05)
Note: Since a GPU consumes the only available PCIe slot in an SE350 Edge Server,
only 4 SSD or NVMe devices can be configured in any SE350 that includes a GPU
• ThinkSystem NVIDIA Tesla T4 16GB PCIe Passive GPU (Feature Code B4YB)
• ThinkSystem NVIDIA A2 16GB PCIe Gen4 Passive GPU (Feature Code BQZT)
For ThinkAgile MX3530 Appliance (MT 7D6B) and ThinkAgile MX3531 Certified Node (MT 7D66) based on
the ThinkSystem SR650 V2 Rack Server:
• ThinkSystem NVIDIA Tesla T4 16GB PCIe Passive GPU (Feature Code B4YB)
• ThinkSystem NVIDIA A2 16GB PCIe Gen4 Passive GPU (Feature Code BP05)
• ThinkSystem NVIDIA A10 24GB PCIe Gen4 Passive GPU (Feature Code BFTZ)
• ThinkSystem NVIDIA A30 24GB PCIe Gen4 Passive GPU (Feature Code BJHG)
• ThinkSystem NVIDIA A40 48GB PCIe Gen4 Passive GPU (Feature Code BEL4)
• ThinkSystem NVIDIA A100 40GB PCIe Gen4 Passive GPU (Feature Code BEL5)
Note: The ThinkSystem NVIDIA Quadro RTX 6000 24GB PCIe Passive GPU is supported only for the
Windows Server 2019 operating system. Since Windows Server 2022 and Azure Stack HCI operating
systems are not supported, we do not show it in the list above.
All currently supported GPUs for ThinkAgile MX solutions are NVIDIA GPUs. Therefore, the driver can be
downloaded directly from NVIDIA at the following URL:
https://www.nvidia.com/Download/index.aspx?lang=en-us
For NVIDIA GPUs, in addition to the standard GPU driver, an additional INF file needs to be installed on host
systems. This INF file informs Hyper-V on how to correctly reset the GPU during VM reboots. This guarantees
the GPU is in a clean state when the VM boots up. More information and a link to download the required INF
files is available from NVIDIA at the following URL:
https://docs.nvidia.com/datacenter/tesla/gpu-passthrough/index.html#introduction
1. At the SConfig screen, type 15 and press Enter to leave SConfig and enter Windows PowerShell.
2. At the PowerShell prompt, navigate to the directory that contains the GPU driver installation file.
3. On the first node only, create a directory into which you will copy the extracted driver installation files.
This will allow skipping the extraction process on all the other nodes in the cluster when installing the
driver. In our example, the directory containing the downloaded driver installer is named
“C:\ClusterStorage\hj\Drivers” and the directory that will be used for the extracted installation files is
named “C:\ClusterStorage\hj\Drivers\GPUs”.
4. Once the directory has been created, run the downloaded installer from PowerShell, as shown in the
following example screenshot.
6. The installer will take a minute or so to extract its files to the Extraction path specified and then
present the main driver installation window, as shown in the following screenshot.
The driver can be installed on additional nodes from this directory without waiting for content to be
extracted. Simply run “setup.exe” from this location.
8. Back in the driver installation window, click through the wizard to install the driver. For general use,
the Express installation option is suitable.
After the GPU driver has been installed on all nodes, the required INF file must be installed on all nodes in
order to assure that Hyper-V can correctly reset the GPU during VM reboots.
1. Download the ZIP archive that contains the INF files from NVIDIA at the following URL:
https://docs.nvidia.com/datacenter/tesla/gpu-passthrough/index.html#introduction
After the appropriate INF files have been installed on all nodes, Windows Admin Center (WAC) can be
configured to manage the GPUs.
To verify that all GPUs have been identified in WAC, connect to the cluster in WAC and then use the left
navigation pane to select the GPUs extension. Each node should be shown, including installed GPUs by
name. The following screenshot shows our 4-node cluster with an NVIDIA A30 24GB PCIe Gen4 Passive
GPU installed in each of the nodes.
GPU Pools are created to allow GPUs to be assigned to specific virtual machines. To create a GPU Pool,
follow these steps:
1. In WAC, with the GPU extension showing, click the GPU pools heading, and then click on the Create
GPU pool button.
The environment is now ready for GPUs to be assigned to individual virtual machines.
1. In WAC, with the GPU extension showing and click the + Assign VM to pool button to open the VM
assignment screen. If the + Assign VM to pool button is not visible, it is likely hidden behind the
ellipsis (“…”). Currently, the Type of assignment cannot be changed, since only DDA is currently
supported by Microsoft. For Server, choose a node that is the host for the virtual machine needing the
GPU assignment (in our example “hci-node1.contoso.com”). For GPU pool, choose the appropriate
GPU pool (in our example “GPUPool01”). For Virtual machine, choose an appropriate virtual machine
that is running on the selected host (in our example “GPU-VM4”). In the Advanced area, set the High
memory mapped IO space (in MB) setting to “66560” and choose whether or not to check the
Configure offline action to force shutdown checkbox, depending on your needs. Once all settings
have been specified, click Assign.
For more about high memory mapped IO space and other considerations for using DDA, see the
Microsoft reference article here:
https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-
devices-using-discrete-device-assignment
1. Access the virtual machine by whatever method is suitable. For our example, we simply use Failover
Cluster Manager to connect to the virtual machine.
2. To verify that the GPU driver is not yet installed in the virtual machine, open Device Manager. With no
GPU device driver installed, an item “Display Controller” should be listed under “Other devices”.
Looking at the Properties for this device will show that no driver is installed.
3. Copy the GPU driver installer to the virtual machine and run it, exactly as was done for the host.
4. After the driver is installed, check Device Manager again to ensure the driver is recognized.
Lenovo Press document: Microsoft Storage Spaces Direct (S2D) Deployment Guide
https://lenovopress.com/lp0064
Lenovo Press document: ThinkAgile MX1021 on SE350 Azure Stack HCI (S2D) Deployment Guide
https://lenovopress.com/lp1298
Lenovo Press document: How to Deploy Azure Stack HCI clusters via Microsoft Windows Admin Center
https://lenovopress.com/lp1524
Lenovo Press document: Lenovo Certified Configurations for Microsoft Azure Stack HCI – V1 Servers
https://lenovopress.com/lp0866
Lenovo Press document: Lenovo Certified Configurations for Microsoft Azure Stack HCI – V2 Servers
https://lenovopress.com/lp1520
References in this document to Lenovo products or services do not imply that Lenovo intends to make them
available in every country.
Lenovo, the Lenovo logo, ThinkSystem, ThinkCentre, ThinkVision, ThinkVantage, ThinkPlus and Rescue and
Recovery are trademarks of Lenovo.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Intel, Intel Inside (logos), and Pentium are trademarks of Intel Corporation in the United States, other
countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
All customer examples described are presented as illustrations of how those customers have used Lenovo
products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
Information concerning non-Lenovo products was obtained from a supplier of these products, published
announcement material, or other publicly available sources and does not constitute an endorsement of such
products by Lenovo. Sources for non-Lenovo list prices and performance numbers are taken from publicly
available information, including vendor announcements and vendor worldwide homepages. Lenovo has not
tested these products and cannot confirm the accuracy of performance, capability, or any other claims related
to non-Lenovo products. Questions on the capability of non-Lenovo products should be addressed to the
supplier of those products.
All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice,
and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the
full text of the specific Statement of Direction.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive
statement of a commitment to specific levels of performance, function or delivery schedules with respect to
any future products. Such commitments are only made in Lenovo product announcements. The information is
presented here to communicate Lenovo’s current investment and development activities as a good faith effort
to help with our customers' future planning.
Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be given that an individual
user will achieve throughput or performance improvements equivalent to the ratios stated here.
Any references in this information to non-Lenovo websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this Lenovo product and use of those websites is at your own risk.