performance-tuning-guide-ucs-m6-servers
performance-tuning-guide-ucs-m6-servers
Cisco public
D
This document explains the BIOS settings that are valid for the Cisco Unified Computing System™ (Cisco UCS®)
M6 server generation of the following servers: Cisco UCS B200 M6 Blade Server, X210c M6 Compute Node,
C220 M6 Rack Server, and C240 M6 Rack Server. All servers use third-generation (3rd Gen) Intel® Xeon®
Scalable processors. The document describes how to optimize the BIOS settings to meet requirements for the
best performance and energy efficiency for the Cisco UCS M6 generation of blade and rack servers.
With the release of the 3rd Gen Intel Xeon Scalable processor family (architecture code named Ice Lake), Cisco
released sixth-generation Cisco UCS servers to take advantage of the increased number of cores, higher
memory speeds, and PCIe 4.0 features of the new processors, thus benefiting CPU-, memory-, and I/O-
intensive workloads.
Understanding the BIOS options will help you select appropriate values to achieve optimal system performance.
This document does not discuss the BIOS options for specific firmware releases of Cisco UCS M6 servers. The
settings demonstrated here are generic.
Cisco UCS servers with standard settings already provide an optimal ratio of performance to energy efficiency.
However, through BIOS settings you can further optimize the system with higher performance and less energy
efficiency. Basically, this optimization operates all the components in the system at the maximum speed
possible and prevents the energy-saving options from slowing down the system. In general, optimization to
achieve greater performance is associated with increased consumption of electrical power. This document
explains how to configure the BIOS settings to achieve optimal computing performance.
Performance tuning is difficult and general recommendations are problematic. This document tries to provide
insights into optimal BIOS settings and OS tunings that have an impact on overall system performance. This
document does not provide generic rule-of-thumb (or values) to be used for performance tuning. The finest
tuning of the parameters described requires a thorough understanding of the enterprise workloads and the
Cisco UCS platform on which they run.
Processor settings
This section describes processor options you can configure.
You can specify whether the processor uses Enhanced Intel SpeedStep Technology, which allows the system to
dynamically adjust processor voltage and core frequency. This technology can result in decreased average
power consumption and decreased average heat production.
Intel Turbo Boost is especially useful for latency-sensitive applications and for scenarios in which the system is
nearing saturation and would benefit from a temporary increase in the CPU speed. If your system is not running
at this saturation level and you want the best performance at a utilization rate of less than 90 percent, you
should disable Intel SpeedStep to help ensure that the system is running at its stated clock speed at all times.
CPU performance
Intel Xeon processors have several layers of cache. Each core has a tiny Layer 1 cache, sometimes referred to
as the Data Cache Unit (DCU), that has 32 KB for instructions and 32 KB for data. Slightly bigger is the Layer 2
cache, with 256 KB shared between data and instructions for each core. In addition, all cores on a chip share a
much larger Layer 3 cache, which is about 10 to 45 MB in size (depending on the processor model and number
of cores).
The prefetcher settings provided by Intel primarily affect the Layer 1 and Layer 2 caches on a processor core
(Table 1). You will likely need to perform some testing with your individual workload to find the combination that
works best for you. Testing on the Intel Xeon Scalable processor has shown that most applications run best
with all prefetchers enabled. See Tables 2 and 3 for guidance.
Option Description
CPU performance Sets the CPU performance profile for the server. This can be one of the following:
● Enterprise/HPC: All prefetchers are enabled. This is the platform-default setting for M6 servers.
● High throughput: The DCU IP prefetcher is enabled, and all other prefetchers are disabled.
● Custom: Allow users to choose the desired prefetcher settings depending on workloads.
● Platform default: The BIOS uses the value for this attribute contained in the BIOS defaults for the
server type and vendor.
All enabled HPC benchmarks, web server, analytical database, virtualization, and
relational database systems
DCU-IP enabled; all others disabled SPECjbb2015 benchmark and certain server-side Java application-server
applications
Hardware prefetcher
The hardware prefetcher prefetches additional streams of instructions and data into the Layer 2 cache upon
detection of an access stride. This behavior is more likely to occur during operations that sort sequential data,
such as database table scans and clustered index scans, or that run a tight loop in code.
You can specify whether the processor allows the Intel hardware prefetcher to fetch streams of data and
instructions from memory into the unified second-level cache when necessary.
Adjacent-cache-line prefetcher
The adjacent-cache-line prefetcher always prefetches the next cache line. Although this approach works well
when data is accessed sequentially in memory, it can quickly litter the small Layer 2 cache with unneeded
instructions and data if the system is not accessing data sequentially, causing frequently accessed instructions
and code to leave the cache to make room for the adjacent-line data or instructions.
You can specify whether the processor fetches cache lines in even or odd pairs instead of fetching just the
required line.
This prefetcher is a Layer 1 data cache prefetcher. It detects multiple loads from the same cache line that occur
within a time limit. Making the assumption that the next cache line is also required, the prefetcher loads the next
line in advance to the Layer 1 cache from the Layer 2 cache or the main memory.
● Disabled: The processor does not try to anticipate cache read requirements and fetches only explicitly
requested lines.
● Enabled: The DCU prefetcher analyzes the cache read pattern and prefetches the next line in the cache if
it determines that it may be needed.
You can specify whether the processor uses the DCU-IP prefetch mechanism to analyze historical cache
access patterns and preload the most relevant lines in the Layer 1 cache.
The setting for this BIOS option can be either of the following:
● Disabled: The LLC prefetcher is disabled. The other core prefetchers are not affected.
● Enabled: The core prefetcher can prefetch data directly to the LLC.
Note: If you change this option, you must power the server off and on before the setting takes effect.
The Intel UPI power management is used to conserve power on a platform. Low power mode reduces UPI
frequency and bandwidth. This option is recommended to save power; however, UPI power management is not
recommended for high-frequency, low-latency, virtualization and database workloads.
This BIOS option controls the link L0p Enable and link L1 Enable values.
L1 saves the most power but has the greatest impact on latency and bandwidth. L1 allows a UPI link it to
transition from the full-link-down state. L1 is the deepest power savings state.
L0p allows a partial-link-down state. A subset of all of the lanes will remain awake.
UPI link frequency determines the rate at which the UPI processor interconnect link operates. If a workload is
highly Nonuniform Memory Access (NUMA) aware, sometimes lowering the UPI link frequency can free more
power for the cores and result in better overall performance.
The setting for this BIOS option can be either of the following:
● Disabled: The LLC is treated as one cluster when this option is disabled.
● Enabled: The LLC capacity is used more efficiently, and latency is reduced as a result of the core and
integrated memory controller proximity. This setting may improve performance on NUMA-aware
operating systems.
Note: When SNC is selected, the operating system discovers each physical CPU socket as two NUMA
nodes, except for 3rd Gen Intel Xeon Scalable processors with fewer than 12 cores, for which SNC is not
supported. Refer to your OS documentation to determine whether SNC is supported.
Values for the LLC dead line BIOS option can be either of the following:
● Disabled: If this option is disabled, dead lines will be dropped from the LLC. This setting provides better
utilization in the LLC and prevents the LLC from evicting useful data.
● Enabled: If this option is enabled, the processor determines whether to keep or drop dead lines. By
default, this option is enabled.
Processor C1E
Enabling the C1E option allows the processor to transition to its minimum frequency upon entering the C1 state.
This setting does not take effect until after you have rebooted the server. When this option is disabled, the CPU
continues to run at its maximum frequency in the C1 state. Users should disable this option to perform
application benchmarking.
You can specify whether the CPU transitions to its minimum frequency when entering the C1 state.
● Disabled: The CPU continues to run at its maximum frequency in the C1 state.
● Enabled: The CPU transitions to its minimum frequency. This option saves the maximum amount of
power in the C1 state.
Processor C6 report
The C6 state is a power-saving halt and sleep state that a CPU can enter when it is not busy. Unfortunately, it
can take some time for the CPU to leave these states and return to a running condition. If you are concerned
about performance (for all but latency-sensitive single-threaded applications), and if you can do so, disable
anything related to C-states.
You can specify whether the BIOS sends the C6 report to the operating system. When the OS receives the
report, it can transition the processor into the lower C6 power state to decrease energy use while maintaining
optimal processor performance.
You can specify the amount of power available to the server components when they are idle.
● C0/C1 State: When the CPU is idle, the system slightly reduces power consumption. This option requires
less power than C0 and allows the server to return quickly to high-performance mode.
● C2 State: When the CPU is idle, the system reduces power consumption more than with the C1 option.
This option requires less power than C1 or C0, but the server takes slightly longer to return to high-
performance mode.
● C6 Nonretention: When the CPU is idle, the system reduces power consumption more than with the C3
option. This option saves more power than C0, C1, or C3, but the system may experience performance
problems until the server returns to full power.
● C6 Retention: When the CPU is idle, the system reduces power consumption more than with the C3
option. This option consumes slightly more power than the C6 Nonretention option, because the
processor is operating at Pn voltage to reduce the package’s C-state exit latency.
When the operating system requests CPU core C1 state, system hardware automatically changes the request to
the core C6 state.
Note: This BIOS feature is applicable only to Cisco UCS C-Series Rack Servers.
Memory settings
You can use several settings to optimize memory performance.
Always set the memory reliability, availability, and serviceability (RAS) configuration to Maximum Performance
for systems that require the highest performance and do not require memory fault-tolerance options.
Most modern operating systems, particularly virtualization hypervisors, support NUMA because in the latest
server designs a processor is attached to a memory controller: therefore, half the memory belongs to one
processor, and half belongs to the other processor. If a core needs to access memory that resides in another
processor, a longer latency period is needed to access that part of memory. Operating systems and hypervisors
recognize this architecture and are designed to reduce such trips. For hypervisors such as those from VMware
and for modern applications designed for NUMA, keep this option enabled.
The Integrated Memory Controller (IMC) BIOS option controls the interleaving between the integrated memory
controllers. There are two integrated memory controllers per CPU socket in an x86 server running Intel Xeon
Scalable processors. If integrated memory controller interleaving is set to 2-way, addresses will be interleaved
between the two integrated memory controllers. If integrated memory controller interleaving is set to 1-way,
there will be no interleaving.
Adaptive Double Device Data Correction (ADDDC) is a memory RAS feature that enables dynamic mapping of
failing DRAM by monitoring corrected errors and taking action before uncorrected errors can occur and cause
an outage. It is now enabled by default.
After ADDDC sparing remaps a memory region, the system could incur marginal memory latency and bandwidth
penalties on memory bandwidth intense workloads that target the affected region. Cisco recommends
scheduling proactive maintenance to replace a failed DIMM after an ADDDC RAS fault is reported.
Patrol scrub
You can specify whether the system actively searches for, and corrects, single-bit memory errors even in
unused portions of the memory on the server.
● Disabled: The system checks for memory Error-Correcting Code (ECC) errors only when the CPU
reads or writes a memory address.
● Enabled: The system periodically reads and writes memory searching for ECC errors. If any errors
are found, the system attempts to fix them. This option may correct single-bit errors before they
become multiple-bit errors, but it may adversely affect performance when the patrol-scrub process
is running.
Fan policy
Fan policy enables you to control the fan speed to reduce server power consumption and noise levels. Prior to
fan policy, the fan speed increased automatically when the temperature of any server component exceeded the
set threshold. To help ensure that the fan speeds were low, the threshold temperatures of components were
usually set to high values. Although this behavior suited most server configurations, it did not address the
following situations:
● Maximum CPU performance: For high performance, certain CPUs must be cooled substantially below the
set threshold temperature. This cooling requires very high fan speeds, which results in increased power
consumption and noise levels.
● Low power consumption: To help ensure the lowest power consumption, fans must run very slowly and,
in some cases, stop completely on servers that allow this behavior. But slow fan speeds can cause
servers to overheat. To avoid this situation, you need to run fans at a speed that is moderately faster
than the lowest possible speed.
● Balanced: This is the default policy. This setting can cool almost any server configuration, but it may not
be suitable for servers with PCI Express (PCIe) cards, because these cards overheat easily.
● Low Power: This setting is well suited for minimal-configuration servers that do not contain any PCIe
cards.
● High Power: This setting can be used for server configurations that require fan speeds ranging from 60
to 85 percent. This policy is well suited for servers that contain PCIe cards that easily overheat and have
high temperatures. The minimum fan speed set with this policy varies for each server platform, but it is
approximately in the range of 60 to 85 percent.
● Maximum Power: This setting can be used for server configurations that require extremely high fan
speeds ranging between 70 and 100 percent. This policy is well suited for servers that contain PCIe
cards that easily overheat and have extremely high temperatures. The minimum fan speed set with this
policy varies for each server platform, but it is approximately in the range of 70 to 100 percent.
● Acoustic: The fan speed is reduced to reduce noise levels in acoustic-sensitive environments. Rather
than regulating energy consumption and preventing component throttling as in other modes, the
Acoustic option could result in short-term throttling to achieve a lowered noise level. Applying this fan
control policy might results in short duration transient performance impacts.
Acoustic mode is available only on the Cisco UCS C220 M6 Server, Cisco UCS C240 M6 Server, Cisco
UCS C240 SD M6 Server.
● For standalone Cisco UCS C-Series M6 servers using the Cisco® Integrated Management Controller
(IMC) console and the Cisco IMC supervisor. From the Cisco IMC web console, choose Compute >
Power Policies > Configured Fan Policy.
● For Cisco UCS managed C-Series M6 servers, this policy is configurable using power control policies
under Servers > Policies > root > Power Control Policies > Create Fan Power Control Policy > Fan Speed
Policy.
● For Cisco Intersight™ managed C-Series M6 servers, the fan control policy is defined in Intersight via the
Thermal policy using the Fan Control Mode object.
● For UCS B-series and X-series servers, the fan speeds are dynamically adjusted based on the resource
usage.
Processor configuration
Package C-state limit C0/C1 State No Limit, Auto, C0/C1 State, C2, C6
Retention, and C6 Nonretention
Energy and performance BIOS configuration Balanced Performance Performance, Balanced Performance,
Balanced Power, and Power
UPI link speed* Auto Auto, 9.6 GTs, 10.4 GTs, and 11.2 GTs
Memory configuration
Memory RAS configuration ADDDC Sparing Mirror Mode 1LM, ADDDC Sparing, Partial
Mirror Mode 1LM, and Maximum Performance
Note: BIOS tokens with an asterisk* marked are available and supported only on M6 servers with 3rd Gen
Intel Xeon Scalable processors.
● CPU-intensive workloads
● I/O-intensive workloads
● Energy-efficient workloads
● Low-latency workloads
CPU-intensive workloads
For CPU-intensive workloads, the goal is to distribute the work for a single job across multiple CPUs to reduce
the processing time as much as possible. To do this, you need to run portions of the job in parallel. Each
process, or thread, handles a portion of the work and performs the computations concurrently. The CPUs
typically need to exchange information rapidly, requiring specialized communication hardware.
CPU-intensive workloads generally benefit from processors that achieve the maximum turbo frequency for any
individual core at any time. Processor power management settings can be applied to help ensure that any
component frequency increase can be readily achieved.
I/O-intensive workloads
I/O-intensive optimizations are configurations that depend on maximum throughput between I/O and memory.
Processor utilization–based power management features affect performance on the links between I/O and
memory are disabled.
Low-latency workloads
Workloads that require low latency, such as financial trading and real-time processing, require servers to
provide a consistent system response. Low-latency workloads are for customers who demand the least amount
of computational latency for their workloads. Maximum speed and throughput are often sacrificed to lower
overall computational latency. Processor power management and other management features that might
introduce computational latency are disabled.
To achieve low latency, you need to understand the hardware configuration of the system under test. Important
factors affecting response times include the number of cores, the processing threads per core, the number of
NUMA nodes, the CPU and memory arrangements in the NUMA topology, and the cache topology in a NUMA
node. BIOS options are generally independent of the OS, and a properly tuned low-latency operating system is
also required to achieve deterministic performance.
Table 5. BIOS recommendations for CPU-intensive, I/O-intensive, energy-efficient, and low-latency workloads
BIOS tokens BIOS values CPU intensive I/O intensive Energy efficient Low latency
(platform
defaults)
Processor configuration
Intel Hyper-Threading Enabled Platform default Platform default Platform default Platform default
Technology
Intel Virtualization Enabled Platform default Platform default Platform default Disabled
Technology
CPU performance Custom Platform default Platform default Platform default Enterprise
DCU IP prefetcher Enabled Platform default Platform default Disabled Platform default
Intel VT for Directed I/O Enabled Platform default Platform default Platform default Disabled
Enhanced CPU Disabled Auto Platform default Platform default Platform default
performance*
Intel Turbo Boost Enabled Platform default Platform default Platform default Disabled
Technology
Energy Efficient turbo Disabled Platform default Platform default Enabled Enabled
Package C-state limit C0/C1 State Platform default Platform default C6 Non- Platform default
Retention
UPI prefetch Auto Enabled Platform default Platform default Platform default
XPT prefetch Auto Enabled Platform default Platform default Platform default
UPI link enablement Auto 1 Platform default Platform default Platform default
UPI power management Disabled Enabled Platform default Platform default Platform default
UPI link speed Auto Platform default Platform default Platform default Platform default
Sub-NUMA clustering Disabled Enabled Platform default Platform default Platform default
Uncore frequency Enabled Disabled Platform default Platform default Platform default
scaling
LLC dead line Enabled Disabled Platform default Platform default Platform default
Memory configuration
NUMA optimized Enabled Platform default Platform default Platform default Platform default
IMC interleaving Auto 1-way Interleave Platform default Platform default Platform default
Memory RAS ADDDC Maximum Platform default Platform default Platform default
configuration Sparing Performance
Memory refresh rate 2x Refresh 1x Refresh Platform default Platform default 1x Refresh
● From Table 5. Enhanced CPU Performance* - This feature is currently available on C-series only.
However, it is being extended to B- and X-series. It may be better to state, that this BIOS token will
be extended to these platforms at a later firmware release. This way you do NOT have to update
this document when that happens.
● Default BIOS options are generally selected to produce the best overall performance for typical
workloads. However, typical workloads differ from end-user to end-user. Therefore, the default
settings may not be the best choices for your specific workloads.
These database systems are often decentralized to avoid single points of failure. Spreading the work over
multiple servers can also support greater transaction processing volume and reduce response time. In a
virtualized environment, when the OLTP application uses a direct I/O path, make sure that the Intel VT for
Directed I/O option is enabled. By default, this option is enabled.
Virtualization workloads
Intel Virtualization Technology provides manageability, security, and flexibility in IT environments that use
software-based virtualization solutions. With this technology, a single server can be partitioned and can be
projected as several independent servers, allowing the server to run different applications on the operating
system simultaneously. It is important to enable Intel Virtualization Technology in the BIOS to support
virtualization workloads.
The CPUs that support hardware virtualization allow the processor to run multiple operating systems in the
virtual machines. This feature involves some overhead because the performance of a virtual operating system is
comparatively slower than that of the native OS.
Big data analytics is the use of advanced analytics techniques on very large, diverse big data sets that include
structured, semistructured, and unstructured data, from any source. These data sets can be defined as ones
whose size or type is beyond the ability of traditional relational databases to capture, manage, and process with
low latency. In addition, new capabilities include real-time streaming analytics and impromptu, iterative
analytics on enormous data sets.
An analytics database is specifically designed to support Business Intelligence (BI) and analytics applications,
typically as part of a data warehouse or data mart. This feature differentiates it from an operational,
transactional, or OLTP database, which are used to process transactions, such as order entry and other
business applications.
The SAP HANA platform is a flexible data source–independent in-memory data platform that allows you to
analyze large volumes of data in real time. Using the database services of the SAP HANA platform, you can
store and access data in memory and using columns. SAP HANA allows OLTP and online analytical processing
(OLAP) on one system, without the need for redundant data storage or aggregates. Using the application
services of the SAP HANA platform, you can develop applications, run your custom applications built on SAP
HANA, and manage your application lifecycles.
For more information about SAP HANA, see the SAP help portal: http://help.sap.com/hana/.
Computing clusters include a head node that provides a single point for administering, deploying, monitoring,
and managing the cluster. Clusters also have an internal workload management component, known as the
scheduler, that manages all incoming work items (referred to as jobs). Typically, HPC workloads require large
numbers of nodes with nonblocking MPI networks so that they can scale. Scalability of nodes is the single most
important factor in determining the achieved usable performance of a cluster.
HPC requires a high-bandwidth I/O network. When you enable DCA support, network packets go directly into
the Layer 3 processor cache instead of the main memory. This approach reduces the number of HPC I/O cycles
generated by HPC workloads when certain Ethernet adapters are used, which in turn increases system
performance.
Processor configuration
Adjacent cache line Enabled Platform Platform Platform Platform Platform default
prefetcher default default default default
DCU streamer prefetch Enabled Platform Platform Platform Platform Platform default
default default default default
Intel VT for Directed I/O Enabled Platform Platform Disabled Disabled Disabled
default default
Intel Turbo Boost Enabled Platform Platform Platform Platform Platform default
Technology default default default default
Energy Efficient turbo Disabled Platform Enabled Enabled Enabled Platform default
default
Package C-state limit C0/C1 State Platform Platform Platform Platform Platform default
UPI link enablement Auto Platform Platform Platform Platform Platform default
default default default default
UPI link speed Auto Platform Platform Platform Platform Platform default
default default default default
Memory configuration
Note:
● From Table 6. Enhanced CPU Performance* - This feature is currently available on C-series only.
However, it is being extended to B- and X-series. It may be better to state, that this BIOS token will
be extended to these platforms at a later firmware release. This way you do NOT have to update
this document when that happens.
Use the following Linux tools to measure maximum turbo frequency and power states:
● CPUpower monitor: The CPUpower monitor reports the processor topology and frequency and idle
power state statistics. The command is forked, and statistics are printed upon the command’s
completion, or statistics are printed periodically. The command implements independent processor
sleep state and frequency counters. Some are retrieved from kernel statistics, and some are read
directly from hardware registers. Use this setting:
cpupower monitor -l
Refer the following resources for more information about OS and Hypervisor performance tuning
recommendations:
● Microsoft Windows and Hyper-V server platform tuning is straightforward: “Set the power policy to
High Performance”. See: https://docs.microsoft.com/en-us/windows-
server/administration/performance-tuning/additional-resources
● VMware ESXi tuning is straightforward as well: “Set the power policy to High Performance”. See:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/v
sphere-esxi-vcenter-server-70-performance-best-practices.pdf
● Citrix XenServer, set xenpm set-scaling-governor performance. See:
https://support.citrix.com/article/CTX200390
● Red Hat Enterprise Linux, set “cpupower to Performance”. See:
https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/7/html/performance_tuning_guide/index
● SUSE Enterprise Linux, set “cpupower to Performance”. See:
https://documentation.suse.com/sles/15-SP2/pdf/book-sle-tuning_color_en.pdf