Section 04 Server Maintenance Course Guide
Section 04 Server Maintenance Course Guide
SERVER CONCEPTS:
SECTION 04 SERVER
MAINTENANCE
COURSE GUIDE
Table of Contents
Resources............................................................................................................... 109
Supporting Resources: Server Maintenance .................................................................... 110
Certification Journey Map ................................................................................................ 111
Management Servers
Monitor and
Maintain Compute Servers
Server Health
Database Servers
Data Protection Appliance
Example: Generic two rack datacenter system with server maintenance in place.
• Hot swap (identified with an orange tab) and cold swap (identified with a blue
tab) server components that fail.
• Swap out or isolate servers that fail.
• Maintain clean and cool server environments.
• Perform server firmware and driver updates.
For reasons like dust accumulation and monitoring software functioning well, server
maintenance is performed regularly to ensure long-term server health and
functionality. Server maintenance helps to save money on repairs by preventing
damage and complete replacement of a server system.
Server Components
Expansion
Risers
Expansion
Cards
Processors
Memory
All servers (tower, rack, and modular) use the same basic configuration in
components as a desktop system. However, a server provides a different purpose
compare to a desktop which includes a design to support multiple users. Multiple
users is why the server has more CPU and memory than a desktop. A server is
able to run 24x7 and can be managed remotely through iDRAC.
The server components include: A System board, enhanced central processing unit
(CPU), expansion cards, enhanced memory, system fans, and hard drives.
The components that are displayed in the image are covered in detail in this topic.
Server Memory
Servers prepare programs that are used for any system unit on a network, while
client systems are responsible for their own operations. The memory for each is
different. Servers use larger memory capacity and bandwidth to cope with multiple
CPU processing loads and performing operations to simultaneously support client
workstations. Memory risers are present as a dynamic architecture to support a
longer equipment provision life cycle. The dynamic architecture provides staged
pathways for upgrading, and ultimately simplifies server maintenance.
PowerEdge Server
Client Data In
12x NVMe Drives 12x NVMe Drives
Client
Client
DDR4 NVDIMMs (not to
scale) NVDIMMs NVDIMMs
Data Out
Workload
Server systems run on an Error-Correcting Code (ECC) memory type while client
systems run on non-ECC memory. The ECC memory system tests and corrects
any errors in memory without interrupting the other server operations. ECC also
makes corrections without the processor or user being aware.
Data M bits
Data In Compare Error Signal
Code K bits
DDR4 Server Memory
Note: 1 byte = 8 bits. ECC uses a hamming code style function that
allows the correction of a single-bit error in words. The idea is that
every 64-bit value is hashed with an 8-bit value that is recorded with
it. ECC can detect 2-bit errors but cannot fix them. ECC DIMMs
function by adding 8 bits and extra chips to the memory
module/DIMM.
Memory Technology
Random access memory (RAM) performs tasks to store data that can be quickly
accessed, read, and written by the CPU. Double Data Rate (DDR) is a form of
Dynamic RAM (DRAM) which is a widely used RAM in server memory technology.
DDR is known for its low-power requirements and high-speed data transfer rate.
Memory Voltage
Memory Types
Memory Comparison
The table highlights the differences in memory features across the three
generations of Dell PowerEdge servers.
RAM Size 1 X 4 GB 1 X 8 GB 1 X 8 GB
System Board
Expansion Slot
T
Storage Controller
Processor Sockets
SSD/HDD
PowerEdge R740
A system board is the main circuit board of a server system that connects and
governs the interactions between system components. It is similar to a
"motherboard" in a desktop or laptop unit, with different features and functionalities.
The major components on the server system board include central processing unit
(CPU) sockets, memory socket, storage controller, a supporting circuitry known as
the chipset, a hard disk, and an expansion slot for connecting other hardware.
Hard Disk: The hard disks are connected to the server backplane that is
connected to the storage controller. The storage controller is connected to or is
embedded in the system board.
PowerEdge R740
Bandwidth, clock speed, and the number of processor cores all contribute to
processor performance. A server system processor performs for longer periods at
100% sustained loads.
Expansion Card
Expansion Card
Expansion Card
PowerEdge R740
An expansion card is a printed circuit board (PCB) that enhances the functionality
of a server. Depending on the server generation, different types of PCI technology
for expansion card inclusion is supported.
Expansion cards are installed in the expansion slots or expansion riser slots of a
server. A connector is used to create an electronic interface between the server
system board and the expansion card. Some examples of expansion cards are
video graphics cards, network cards, and storage controller cards.
• Redundant Array of Independent Disks (RAID) and Host Bus Adapters (HBA)
Modules
• General Purpose Computing on Graphics Processing Units (GPGPU)
• Network Interface Card (NIC) and Converged Network Adapters (CNA)
• Host Channel Adapters (HCA)
A host bus adapter (HBA and eHBA) is an expansion card that plugs inside a slot
on a server system board (such as PCIe). The HBA connects the host to the
storage or network devices and delivers fast, reliable non-RAID Input/Output (I/O).
Heatsink Battery
PCIe connector
GPGPU
Cooling Fan
PCIe Connector
Fast context switching hides memory latency. When a memory fetch is issued while
processing one subset of data elements, that subset is set aside. Another subset
that is not waiting on a memory reference replaces the memory that is waiting.
GPUs use the chip area for hundreds of individual processing elements that
simultaneously run a single instruction stream on multiple data elements.
In large enterprise companies, main servers have (at least) two adapters – Fibre
Channel Host Bus Adapter (FC HBA) and Ethernet Network Interface
Card (Ethernet NIC). The adapters connect to the storage network (Fibre Channel)
and system network (Ethernet). Converged Network Adapters (CNA) combine the
functionality of both adapters into one.
The diagram shows both the traditional setup with FC HBA and NIC as well as the
CNA and Fibre Channel over Ethernet (FCoE) setup. In the first diagram, the
server requires two separate adapters to connect to the Ethernet-based system
network and the FC-based storage network.
The setup in the second diagram requires one adapter (CNA), which carries both
Ethernet traffic and FCoE traffic on a single cable. This cable connects to one of
the Ethernet ports in the converged switch that has both Ethernet and Fibre
Channel ports. This converged switch converts the FCoE traffic into Fibre Channel
traffic to be sent to the FC SAN over the Fibre Channel network. Computer network
traffic is directly sent to the LAN over the Ethernet network.
Traditional Setup with FC HBA and NIC New Setup with CNA and FCoE
Server Server
Ethernet Switch
NIC Ports
FC HBA CNA
Fibre Channel Switch Ethernet/FCoE Fibre Channel
Ports Switch Ports Switch Ports
Ethernet
Ethernet Fibre Channel
Traditional setup with FC HBA/NIC and a CNA/Fibre Channel over Ethernet (FCoE) setup.
Dell 79DJ3 Mellanox ConnectX-3 56Gbps Single Port QSFP Host Channel Adapter
1 2
1: The PowerEdge RAID Controller (PERC) 10 series consist of the H345, H740P,
H745, H745P MX, and H840 cards.
The PERC 10 family of storage controller cards has the following characteristics:
• The auto Configure RAID 0 feature creates a single drive RAID 0 on each hard
drive that is in the ready state.
• A non-RAID disk is a single disk to the host, and not a RAID volume. The only
supported cache policy for non-RAID disks is Write-Through.
• Physical disk power management is a power-saving feature of PERC 10 series
cards. The feature allows disks to be spun down based on disk configuration
and I/O activity.
• FastPath is a feature that improves application performance by delivering high
I/O per second (IOPs) for solid state drives (SSD). The Dell PERC 10 series
supports FastPath.
Tip: For more information about PERC 10.6 follow this link:
https://www.dell.com/support/manuals/en-us/poweredge-rc-
h840/perc10_ug_pub/overview?guid=guid-ecf11753-0ae0-4122-
b875-d909905059ae
PERC 11.1
The PERC11 controller introduces many new features that boost performance.
New features such as support for the PCIe Gen4 host interface and the upgraded
DDR4 8GB 2666MT/s cache memory. However, the greatest addition to this
generation of technology is the inclusion of NVMe hardware RAID support. NVMe
hardware RAID support is available on the H755N front, H755MX and H755
adapter form factors.
1 2
1: The PERC 11 series consists of the many different adapters. PERC H755
adapter, PERC H755 front SAS, and PERC H755N front NVMe, PERC H750
adapter SAS, PERC H755 MX adapter, PERC H355 adapter SAS, PERC H355
front SAS, and PERC H350 adapter SAS cards.
• A non-RAID disk is a single disk that is connected to the host that is not part of
a RAID volume. The only supported cache policy for non-RAID disks is Write-
Through.
• Opal Security Management of Opal SED drives require security key
management support. The security key that is set in the Opal drives and used
as an authentication key to lock and unlock the Opal drives can be generated.
IT administrators use the application software or the Integrated Dell Remote
Access Controller (iDRAC) to generate the security key.
• Hardware RoT (Root-of-Trust) builds a chain of trust by authenticating all the
firmware components before its execution. Hardware RoT also permits the
authenticated firmware to perform and be flashed.
• Disk roaming occurs once a hard drive is moved from one cable connection or
backplane slot to another on the same controller.
Tip: For more information about PERC 11.1 follow this link:
https://www.dell.com/support/manuals/en-us/perc-
h755/perc11_ug/dell-technologies-poweredge-raid-controller-
11?guid=guid-d64f78f6-d10c-4228-ae3f-f8e455ec9d04
The Open Compute Project (OCP) cards are network cards that connect to the PCI
bus. They are physically smaller than the Industry Standard Architecture (ISA)
expansion cards and often connect to a dedicated connector on the system board.
The OCP card was introduced with the PowerEdge 15G servers.
Important: The OCP and the NDC cards are not a hot-swappable
component.
Server Management
iDRAC
iDRAC UI Dashboard.
The Integrated Dell Remote Access Controller (iDRAC) improves the overall
availability of Dell servers. The iDRAC enables users to deploy, update, monitor,
and maintain servers from any location.
SupportAssist Enterprise
SupportAssis
t Enterprise
email notification
Dell SupportAssist Enterprise at work monitoring and reacting to a PowerEdge MX7000 Modular
System hardware issue.
3 When issues arise, alerts are issued, possibly before the user is aware something
is wrong. A support case is opened automatically.
4 Proactive monitoring happens 24 x 7 x 365. Dell technical support contacts the
and Dell Technologies before they occur. Support cases are created on behalf of
the customer when issues are predicted.
Storage Device
OpenManage Enterprise
PowerEdge MX7000
Modular Platform
Network Device
- OME Power Manager - OME Integration with ServiceNow, VMware, and Microsoft
- OME Update Manager - OpenManage Micro Focus Operations Bridge Manager (Connect)
• Monitor health status and events for Dell PowerEdge racks, towers, modular
servers, or PowerVault MD and ME storage systems, or third-party
infrastructure.
• Provide hardware-level control and management for the PowerEdge server,
blade system, and internal storage arrays.
• Link and launch element management interfaces, such as iDRAC, Chassis
Management Controller (CMC), OME-Modular (OME-M), SC, and EQL group
manager.
Both in-band and out-of-band methods require a network protocol that is configured
on the managed device.
In-Band Management
Management Station
Server
Managed Devices
In-Band example.
• Messaging
• Inventory
Out-Of-Band Management
Management Station
IDRAC
Server
Managed Devices
Out-Of-Band example.
iDRAC Service Module (iSM) is a lightweight, optional software agent that users
can install on PowerEdge servers. iSM is an OS-resident process that expands
iDRAC management into supported host operating systems.
iSM Pre-Installation
Install the iSM using the iDRAC option, or by downloading the file from the support
site and installing it in the server operating system. Before installing the iSM, the
iDRAC reports an error in the iSM setup section.
Installation Verification
Once installing the iSM, the iDRAC reports that iSM is installed and running.
The iDRAC Virtual Console enables users to access the local console remotely in
either graphic or text mode. Using virtual console, you can control an iDRAC-
enabled server. Use the keyboard, video, and mouse on the local management
station to control the corresponding devices on a remotely managed system. Users
can run up to six simultaneous virtual console sessions.
User can use the virtual console with virtual media to perform remote software
installations.
Power Distribution
Functionality of a PSU
The input power7 supply takes AC power from the power socket and converts it to
DC power, then the PSU distributes the power throughout the server. There are
also DC PSUs that do not require conversion.
Redundant PSUs
PowerEdge R740 - Rear View
Power
Hot
Swap
Fan
Pull Handle
6 Input power supplies rate the wattage of power they produce for the system.
7 Some input power supplies can accept DC input as well.
Important: There are also non-hot swap capable PSUs sold in a few
of the Dell PowerEdge servers.
The Power Configuration panel to configure redundant power supplies for servers
is found in the iDRAC9 (and newer versions) interface. The type of power supply
configuration or the redundancy mode depends on the server chassis and the
number of PSUs. When the primary PSU fails, a redundant power supply provides
the necessary power supply to minimize the risk of a complete server shutdown.
Grid Redundancy
In grid redundancy mode, the hot spare8 feature is disabled. The power factor
correction (PFC) is disabled by default, to reduce power consumption when the
system is on standby. However, if a single PSU fails, the power drops down. The
grid redundancy configuration is also known as 1 + 1 configuration.
8 When the hot spare feature is enabled, one of the redundant PSUs is switched to
the sleep state. The active PSU supports 100 percent of the system load, thus
operating at higher efficiency. The PSU in the sleep state monitors the output
voltage of the active PSU. If the output voltage of the active PSU drops, the PSU in
the sleep state returns to an active output state.
9 These failures may originate in the input power grid, the cabling, or a PSU itself.
iDRAC Power Configuration page in the Configuration section of the user interface.
Non-Redundant
iDRAC Power Configuration page in the Configuration section of the user interface.
When a system is configured for Grid Redundancy the PSUs are divided into grids:
PSUs in slots 1, 2, and 3 are in the first grid while PSUs in slots 4, 5, and 6 are in
the second grid. The system management manages power so that if there is a
failure of either grid, the system continues to operate without any degradation. Grid
Redundancy also tolerates failures of individual PSUs.
Server Facility
UPS
On Servers
Power
AC
Supply
Off
Switches
Power Battery
Supply
An unexpected power failure cause issues such as data loss or internal hardware
problems. The process to recover data requires time, energy and money - and yet
the data may be impossible to recover.
Dell offers UPS devices for small to medium size organizations. Click here to view
a SmartUPS 1500 SMARTCONNECT 120V RM device.
The iDRAC with Lifecycle Controller is embedded on several server models but not all.
The Integrated Dell Remote Access Controller (iDRAC) improves the overall
availability of PowerEdge servers. The iDRAC enables users to deploy, update,
monitor, and maintain servers from any location.
The iDRAC9 uses a Nuvoton dual-core ARM A-9 processor @ 800 MHz, with a
512 KB L2 cache, and 8 GB NAND memory.
The iDRAC with Lifecycle Controller is embedded within the Dell PowerEdge
servers.
The iDRAC Service Module (iSM) monitors information from the operating system.
1 2 3 4 5
2: The Storage tab provides details on the storage components. Information about
controllers, hard drives, virtual disks, and enclosures.
4: The Maintenance tab includes The Lifecycle log, job queue, system update,
system event log, troubleshooting, diagnostics, and SupportAssist.
5: The iDRAC Settings tab includes information about the iDRAC itself,
Connectivity, Services, and Users.
• The System tab provides system information and iDRAC details and at a glance
status of the systems. More details about the system are accessed from the
tabs inside this section.
• The Storage tab provides details on the storage components. Summary
information and information about controllers, hard drives, virtual disks, and
enclosures are accessed from here.
• The Configuration tab is where settings for items such as: Power management,
virtual console, licenses, systems, storage configuration, BIOS, and server
configuration profile may be configured.
− The Maintenance tab includes: The Lifecycle log, job queue, system update,
system event log, troubleshooting, diagnostics, and SupportAssist.
• The iDRAC tab displays the details of the iDRAC settings. Configuration of the
network settings, IPv4 settings, and the iDRAC service module options for
connectivity, services, and users are also available.
Performing a Shutdown
Properly shutting down the server consists of allowing all current operations to
complete, disconnecting current connections, stopping services, and powering off.
An immediate server shutdown can cause the loss of unsaved data.
The iDRAC9 dashboard provides the option to Perform a cold boot of the system using the
do a graceful shutdown. Graceful Shutdown drop-down menu.
• Use a Linux graphic user interface (GUI) like Gnome Desktop or Ubuntu Mate
to select the power off from the menu options.
• Use the $ sudo shutdown, $ sudo reboot, or the $ sudo
poweroff command to power down the Linux system.
Administrators can shut down ESXi through CLI commands, from the Direct
Console User Interface (DCUI), or from the vSphere client and web client.
• Learn more about the shutdown methods from the VMWare Customer Connect
online support knowledge-based article.
Server Cooling
Most server design revolves around cooling the main components: power supplies,
processors (CPU and GPU), and memory. For this reason, the chassis has
memory shrouds, processor heatsinks, and power supply fans. GPUs provide their
own cooling fans. The airflow of the server chassis is like that of a client system.
The only difference is that the server chassis emits more heat load11.
Air Shroud
Cooling Fans
Heatsinks
If servers are unprotected from heat, then the servers slow down or work differently
than expected. The ideal temperature for the data center depends on the quantity
of servers and amount of heat emitted. Operating within the ideal temperature
range is critical for performance.
CRAH
CRAH
CRAH
Hot aisle containment (HAC) guides the hot air (red arrows) into a Computer Room
Air Handler (CRAH), which then recirculates the flow into cool air (blue arrows).
12Many data centers start out as a few racks in a server room, adding more
equipment over time. Overall, data center HVAC management can become difficult.
One method of improving air flow is to use the Hot aisle/Cold aisle layout. Cold air
is routed to the intake in the rack front. Hot air exhaust exits the rack rear and is
routed to cooling equipment. A computer room administrator can choose either a
Computer Room Air Conditioner (CRAC) or Computer Room Air Handler (CRAH) to
route the air.
1 2 3
1: Hot air exits the rack and is sent to the CRAC or CRAH.
2: Cold air is pumped from the CRAC or CRAH to the rack intake.
3: Hot air exits the rack and is sent to the CRAC or CRAH.
Maintenance Tasks
5 3
1
2
1: Create a disaster recovery (DR) plan. Do both sites host workloads? How often
does the data replicate? Create a planned site-to-site failover.
2: Examine the power design. Can the server survive a spontaneous failure of
power supplies, Uninterruptible Power Supply (UPS), or building circuits?
3: Verify access of all tools. If the server is not onsite, check availability. Ensure the
iDRAC and operating system tools are available.
4: Review logs for any issues. Including but not limited to iDRAC, Lifecycle
Controller, Server event log (SEL), PERC TTY, and operating system event logs.
Enable and configure alerting including operating system event log forwarding,
SMTP, syslog, and other native utilities such as the iDRAC.
5: Verify the backup plan. Does the workload require a full, incremental, differential
backup? How often are backups completed? Can they successfully be restored?
What are the Recovery Point Objective (RPO) and Recovery Time Objectives
(RTO)?
Dell PowerEdge Modular systems are all-in-one chassis platforms that provide
compute servers, network I/O modules, and storage devices. A modular system
relies on a unique environmental capacity to function as the all-encompassing
platform. IT administrators are aware of the modular system unique specifications
for rack mounting plus cooling and heating needs.
Modular systems are also maintained differently than rack servers because of the
varied components. Some modular systems like the PowerEdge MX7000 function
as a multi-chassis platform cabled together to provide a comprehensive solution.
PowerEdge FX Series
The FX2 includes a network switch, eight PCI Express (PCIe) expansion slots for a
cost-optimized, entry-level option.
FX2 FX2s
PowerEdge VRTX
Dual SD cards for redundant hypervisors. Hot-plug and swappable HDDs, plus
many RAID options tailored to specific needs, including optional PERC for RAID
controller failover inside the chassis. Optional hot-plug and swappable power
supply units and fans. Versatile shared storage
These improvements allow users to avoid the additional time and cost of training
that is related to new management solutions.
• No compromise on performance
• Versatile shared storage
• Integrated networking and flexible I/O
• Seamless management integration
PowerEdge M1000e
The M1000e uses ultraefficient power supplies with large variable-speed fans to
cool the entire chassis while using less power.
PowerEdge MX7000
The PowerEdge MX7000 chassis hosts disaggregated blocks of server and storage
to create on demand resources. Shared power, cooling, networking, I/O, and in-
chassis management provide outstanding efficiencies.
PCIe Slots: 8 PCIe low- 3 full height not applicable not applicable
profile slots 5 low-profile
I/O module 3 full width 1 full height 6 full height 4 full width
(IOM) 2 half-width
Quantity:
MX9002m modules
1 2
The Management Module (MM) essentially controls the overall chassis power,
cooling, and physical user interfaces such as the front panel.
MX7000 supports two MX9002m modules for redundancy. At least one MX9002m
is required to power on the system.
IT administrators investigate and resolve server issues when they are identified to
avoid server downtime or data loss.
PowerEdge server hardware troubleshooting steps help users take logical and
systematic steps towards reviewing, diagnosing, and identifying operational or
technical faults in the server. Users review the PowerEdge server replacement
procedure after isolating a damaged server component to complete the
troubleshooting task.
PowerEdge R640
iDRAC
Cold Swap
Removable expansion riser 1B
iDRAC Maintenance/Troubleshooting Page
iDRAC Logs
Hot Swap
Removable cooling fan
Since 14G, the Liquid Crystal Display (LCD) panel is an optional feature on some
PowerEdge servers. The LCD displays system information, status, and error
messages to indicate whether the system is functioning correctly or requires
attention. The LCD panel is used to configure or view the system iDRAC IP
address.
• The LCD backlight turns off in standby mode. To turn on the LCD backlight,
press any of the LCD front panel buttons.
• If an error detection while the system is connected to a power source, the LCD
turns amber. The error detection happens regardless of whether the system is
turned on or off.
To review the tasks and steps involved in completing the MX7000 Left Ear LCD
Panel simulation job aid, download the job aid document from the on-demand
resources section. Or click the Configuring the Left Control Display (LCD) Panel
Job Aids link to review the task and steps online.
Some PowerEdge servers are not delivered with a full LCD panel so, PowerEdge
servers have a set of Light-emitting Diode (LED) indicators. The front panel in the
iDRAC web interface helps administrators view the system ID LED status as well
as the LCD panel information. To get started, administrators select System >
Overview > Front Panel.
The Live Front Panel Feed section displays the current front panel status.
Solid blue Indicates that the system is turned on, the system is healthy,
and system ID mode is not active. Press the system health
and system ID button to switch to system ID mode.
Blinking blue Indicates that the system ID mode is active. Press the system
health and system ID button to switch to system health mode.
Blinking amber Indicates that the system is experiencing a fault. Check the
system event log or the LCD panel, if available on the bezel,
for specific error messages.
The iDRAC user interface front panel feature is used to remotely view the LED
status on the server front panel.
Tip: When the system is operating (indicated by the blue health icon
on the LED front panel), both Hide Error and Un-Hide Error are
grayed-out. Only rack and tower servers can hide and unhide errors.
Quick Sync 2
PowerEdge R640
OpenManage Mobile (OMM) and left control panel on a PowerEdge R640 server with the Quick
Sync 2 indicator.
Another optional maintenance feature for 14G and above PowerEdge servers is
Quick Sync 2. Using Quick Sync 2 with OpenManage Mobile (OMM),
administrators can configure, monitor, and troubleshoot 14G and above
PowerEdge servers and the MX7000 Modular system chassis.
The terms hot swap and cold swap indicate the replacement of system components
when the system is running or shutdown.
Hot Swap
Hot swap components are identified with an orange tab. Hot swap is the
replacement of a hard drive, system fans, power supply, or system devices while
the server remains in operation. When hot swappable devices fail, server devices
continue to function independently while the defective device is replaced.
1. Press the orange release tab and lift the cooling fan to disconnect the fan from
the connector on the system board.
Note: Ensure not to tilt or rotate the cooling fan while removing it from the
system.
Cold Swap
Cold swap components are identified with a blue tab. Cold swap is the process of
installing, connecting, or uninstalling a server device while the server is turned off.
1. Disconnect the cooling fan cable that is connected on the system board
connector or the power interposer board (PIB).
2. Holding the blue touch point, lift the cooling fan out of the fan cage.
Note: Ensure not to tilt or rotate the cooling fan while removing it from the system.
Review the chart for iDRAC9 Maintenance section capabilities that are based on
server troubleshooting and maintenance scenarios:
Access a record of alerts in the iDRAC user Maintenance > System Event Log
interface.
View scheduled server firmware update jobs. Maintenance > Job Queue
View the most recent crash screen that displays Maintenance > Troubleshooting
events leading to the system crash.
Event Logs
Server event logs are used to identify the cause of a problem that continues
despite basic troubleshooting of the server system. The Integrated Dell Remote
Access Controller (iDRAC) displays the server event logs. The event logs provide a
short explanation of system events that occurred. The event log descriptions are
beneficial for troubleshooting.
• Lifecycle Controller (LCC) logs: LCC logs provide the history of changes that
are related to components installed on the managed system.
• System Event Logs (SEL): Record when a system event occurs on a managed
system. This SEL13 entry is also available in the LCC log. To get started with
SEL through iDRAC9, go to Maintenance > System Event.
View and export the Lifecycle Controller log entries from the Maintenance>Lifecycle Log page in
iDRAC.
13 The SEL offers a filtered version of the LC log containing system events.
Simulation Activity: The boot capture option helps a user view the
video recording of the last three boot cycles. A boot cycle video logs
the sequence of events for a boot cycle. The video log is an effective
tool in troubleshooting system errors.
Navigate through the guided walk-through to learn how to play a boot
capture video from iDRAC.
To review the tasks and steps involved in completing the Play a Boot Capture
Video File simulation job aid, download the job aid document from the on-demand
resources section. Or click the Playing a Boot Capture Video Job Aid link to review
the task and steps online.
Boot capture files reflecting under the Troubleshooting tab in the iDRAC.
To configure the boot capture video settings, select one of the following options
and click Apply.
The boot capture timestamp is the time that the boot capture sequence is
completed. The boot capture completion is either when the boot capture file size
has reached 2 MB or when the host system is rebooted.
The list displays the active boot capture file. While the update is in progress, click
Refresh to view the latest timestamp for the boot capture file. You can play the files
directly from iDRAC or save them to a location on your system.
POST Code, Intrusion, and Last Crash Screen are troubleshooting tools that the
iDRAC provides. Each tool provides a report when a system event occurs.
1: The POST Code option is a view of the last system POST code (in hexadecimal)
before booting the operating system of the managed system. The POST code
helps to detect pre-video errors, report fatal errors, and analyze the system failures
during BIOS POST, particularly a No POST No Video situation. The fatal error
codes are used to report all the fatal POST errors.
2: The Intrusion option is related to the chassis intrusion switch and provides
information about whether the server cover is removed or not seated correctly. A
server cover that is unseated can lead to the system overheating and bring about
potential shutdown issues.
3: The Last Crash Screen option provides information about the events leading to
the system crash. This information is saved in the iDRAC memory and is remotely
accessible. The Last Crash Screen feature is available with iDRAC Express and
Enterprise licenses.
The last crash screen capture requires the user to install Open Manage Server
Administrator (OMSA).
POST Codes are the progress indicators from the system BIOS indicating various
stages of the boot sequence. The POST Code option helps view the last system
POST code (in hexadecimal) before booting the operating system of the managed
system. The POST code helps to detect pre-video errors, report fatal errors, and
analyze the system failures during BIOS POST, particularly the No POST No Video
situations. The fatal error codes are used to report all the fatal POST errors.
Intrusion provides the status of the chassis intrusion probes. The Intrusion option is
related to the chassis intrusion switch. Intrusion provides information about whether
the server cover is removed or not seated correctly. Improper server covering can
lead to the system overheating and therefore potential shutdown issues.
The Last Crash Screen option provides information about the events leading to the
system crash. This information is saved in the iDRAC memory and is remotely
accessible. The Last Crash Screen feature is available with iDRAC Express and
Enterprise licenses.
The last crash screen capture is only available with the Windows operating system,
and the user must have installed Open Manage Server Administrator (OMSA). The
last crash screen capture does not work with Linux or ESXi operating system. If the
Windows operating system should fail, the last screen feature displays a blue
screen.
• arp - Displays the contents of the Address Resolution Protocol (ARP) table.
ARP entries may not be added or deleted.
• ifconfig - Displays the contents of the network interface table.
• netstat - Displays the contents of the routing table. If the optional interface
number is provided in the text field to the right of the netstat option, then netstat
displays additional information. The information that is displayed is regarding
the traffic across the interface, buffer usage, and other network interface
information.
• ping <IP Address> - Verifies that the destination IPv4 address is reachable
from iDRAC with the current routing-table contents. An Internet Control
Message Protocol (ICMP) echo packet is sent to the destination IP address
based on the current routing table contents.
• gettracelog - Displays the iDRAC trace log. It may take a few seconds to
return the trace log. The command gettracelog -i returns the number of
records in the trace log. The -A option returns the trace log without the record
numbers.
• ping6 <IPv6 Address> - Verifies that the destination IPv6 address is
reachable from iDRAC with the current routing-table contents.
iDRAC diagnostics screen showing Serial Data Logs and BIOS Live Scanning.
This feature enables you to retrieve the system serial data for operating system
debugging.
NOTE: Serial Data Logs is a licensed feature and is available only with iDRAC
Datacenter license.
This feature enables you to scan the system BIOS once POST is completed. This
task can be run once or can be set up on a schedule.
Note - BIOS Live Scanning is a licensed feature and is available only with iDRAC
Datacenter license. This feature is only available on select iDRAC9 x5 systems.
1. Automatic: Use the iDRAC Service Module (iSM) that automatically invokes
the operating system collector tool.
2. Manual: Run the operating system collector tool on the server operating system
to export the operating system and application data.
Troubleshooting iDRAC
SERVER
Server rear
iDRAC port -
1GB Server
management
iDRAC web UI
The iDRAC is responsible for system profile settings and out-of-band management.
System conditions can cause the iDRAC to become unresponsive. When the
iDRAC becomes unresponsive, resetting the iDRAC back to factory defaults may
help to resolve the issue.
The web interface or the iDRAC BIOS enables users to reset the iDRAC to its
default settings.
• Reset iDRAC configuration to default all – Reset to factory settings and resets
the default username and password to root and calvin.
• Reset iDRAC configuration to default all – Reset to factory settings and resets
the default username and password14 to the shipping value.
iDRAC Reset and Reset iDRAC to Default Settings are listed under the Diagnostics option.
The Reset iDRAC performs a reset and not loses any settings.
The Reset iDRAC to Default Settings gives you the following options:
NOTE: You can perform a hard or soft iDRAC restart without turning off the server.
- Hard restart—On the server, press and hold the ID LED button for 15 s.
To review the tasks and steps involved in completing the Reset iDRAC to Default
Settings simulation job aid, download the job aid document from the on-demand
resources section. Or click the Reset iDRAC to Default Settings Job Aid link to
review the task and steps online.
Management and
Benefits
Planning
Configuration management deals with the server specifications and related product
features and specifications. Configuration management maximizes server
performance at all utilization levels and workload types.
• Server configuration
Server Documentation
Power Consumption
Booting Time
Server Redundancy
Performance
There is no perfect formula to standardize all settings across all servers. Best
practices seek to achieve standardization according to server roles, administration
policies, and procedures.
Patch Management
Choosing the best time to schedule the fix. Ensure updates are
installed outside of working hours to minimize disruption to
business workflows.
Patch management is the process of ensuring the most recent updates are applied
to all software components. Patch management includes application and services
such as server operating systems and and database - as well as server tools like
Internet Explorer and Adobe Flash.
The use of patch management tools is crucial to maintain the productivity and
integrity of work.
Patch management also provides an overview of the network health and the
urgency of a needed fix.
Use WSUS servers to distribute patches and updates to clients and servers.
Firewall
Internet
Updates
Microsoft Update
WSUS Server
Updates
Resources
The below training topics support the concepts and features that are discussed in
this training. Click the provided links for more information.
• Supported Memory Configuration Guide for PowerEdge Servers
(VC, ODC)
PowerEdge
(C) - Classroom
(VC) - Virtual Classroom
(ODC) - On Demand Course
2S
Two socket form factor. Used to identify the family of servers. PowerEdge servers
can have 1S, 2S, or 4S. See the PowerEdge rack server portfolio page for details.
AI
Artificial Intelligence (AI) is the designing and building of intelligent agents that
receive precepts from the environment and act to affect that environment.
BOSS
Dell Technologies boots optimized storage solution. RAID solution card that is
designed for booting a server's operating system.
bus interface
A bus interface is a communication system that transfers data between
components inside a system, or among systems.
CMC
Chassis Management Controller (CMC) manages hardware and software solution
for multiple Dell blade chassis.
CMC
Chassis Management Controller (CMC) manages hardware and software solution
for multiple Dell blade chassis.
configuration
Configuration is the specifications that make the IT enterprise environment systems
work.
connector
A connector is a jack, plug, or card edge that helps connect the device to a port.
CPU
DIMM
Direct Access Inline Memory Module. DIMMs are available in varying capacities. All
DIMMs in a cache must have the same capacity
DIMM
Direct Access Inline Memory Module. DIMMs are available in varying capacities. All
DIMMs in a cache must have the same capacity
DL
Deep Learning (DL) is a form of Machine Learning which uses Artificial Neural
Networks.
HCI
Hyper Converged infrastructure (HCI) combines compute, virtualization, storage,
and networking in a single cluster.
Hot-swap
Hot-swap means the removal and replacement of an electronic device or module
without powering down or shutting down the system.
HPC
High performance computing (HPC) is the ability to process data and perform
complex calculations at high speeds.
HW RAID
Form of RAID. The motherboard or a separate RAID card handles the processing.
iDRAC
The Integrated Dell Remote Access Controller (iDRAC) is designed for secure local
and remote server management and helps IT administrators deploy, update, and
monitor PowerEdge servers.
iDRAC
IDSDM
Redundant SD-card module for embedded hypervisors. PowerEdge servers can
boot to the hypervisor out-of-the-box. The embedded hypervisor is mirrored across
dual SD cards using an integrated hardware controller.
IEEE 802.3
The Electrical and Electronics Engineers (IEEE) 802.3 is a collection of IEEE
standards. The working group defining the physical layer and Media Access Control
(MAC) of Data Link Layer in the Ethernet set the standards.
InfiniBand
A computer networking communications standard used in high-performance
computing.
IoT
The Internet of things (IoT) describes the network of physical objects such as
sensors, software, and other technologies for the purpose of connecting and
exchanging data with other devices and systems over the Internet. (Wikipedia)
iSM
The iDRAC Service Module (iSM) is optional software provided by the Integrated
Dell Remote Access Controller (iDRAC). The iSM provides additional information
using RACADM CLI, Redfish, Web Service Management (WSMan), and User
Interface (UI). The iSM integrates with the iDRAC SupportAssist collection.
iSM
The iDRAC Service Module (iSM) is optional software provided by the Integrated
Dell Remote Access Controller (iDRAC). The iSM provides additional information
using RACADM CLI, Redfish, Web Service Management (WSMan), and User
Interface (UI). The iSM integrates with the iDRAC SupportAssist collection.
Lifecycle Controller
LRDIMM
Load-Reduced DIMM. Has higher densities than RDIMMs. Uses a memory buffer
chip to reduce the load on the server memory bus.
ML
Machine Learning (ML) is an application of AI where systems use data to learn how
to respond, rather than being explicitly programmed.
MT/s
Mega-Transfers per Second (MT/s). Measurement of bus and channel speed in
millions of cycles per second.
Multicasting
Multicasting involves sending the same message to many endpoints such as in a
video conferencing facility.
NVDIMM
Non-Volatile DIMM
NVMe
Non-Volatile Memory Express (NVMe). Communications interface for PCIe-based
SSDs. Used to increase efficiency and performance.
NVMe
Non-Volatile Memory Express (NVMe). Communications interface for PCIe-based
SSDs. Used to increase efficiency and performance.
OCP
OCP
Open Compute Project (OCP) is an organization that shares designs of data center
products and best practices among companies. OCP designs and projects include
server designs, data storage, rack designs, and open networking switches. Read
more information about the organization by going to www.opencompute.org.
OME
OpenManage Enterprise (OME) is the one-to-many management console used to
discover and manage up to 8,000 devices regardless of the form factor.
OMSA
Dell OpenManage Server Administrator (OMSA) is an In-band, one-to-one software
application that can manage and monitor the health of one server.
PCH
Platform controller hub (PCH) controls certain data paths and support functions
used in conjunction with Intel CPUs.
PCIe
Peripheral component interconnect express (PCIe) is an interface standard for
connecting high-speed components.
PCIe
Peripheral component interconnect express (PCIe) is an interface standard for
connecting high-speed components.
PERC
PowerEdge RAID Controller (PERC). Family of controllers that enhance
performance, increase reliability, add fault tolerance, and simplifies management.
PERC
proactive contact
A support case is automatically opened, diagnostic information is proactively sent
to Dell Technologies, and technical support begins troubleshooting. A support case
supports Windows, Linux, VMware, and Hyper-V environments.
RAID
Redundant Arrays of Independent Disks (RAID). RAID controllers combine multiple
server physical hard drives together into a virtual drive or multiple drives to improve
data efficiency and protection.
RAID
Redundant Arrays of Independent Disks (RAID). RAID controllers combine multiple
server physical hard drives together into a virtual drive or multiple drives to improve
data efficiency and protection.
RDIMM
Registered DIMM. Dual in-line memory module (DIMM) with improved reliability.
SAS
SAS (serial-attached SCSI) is a type of SCSI that uses serial signals to transfer
data, instructions, and information. SAS drives are dual ported.
SAS
SAS (serial-attached SCSI) is a type of SCSI that uses serial signals to transfer
data, instructions, and information. SAS drives are dual ported.
SATA
SATA (Serial Advanced Technology Attachment) uses serial signals to transfer
data, instructions, and information. SATA drives have only a single port.
SATA
SATA (Serial Advanced Technology Attachment) uses serial signals to transfer
data, instructions, and information. SATA drives have only a single port.
SDS
SNAP I/O
Balances I/O performance. CPUs share one adapter, which prevents data from
traversing the inter-processor link when accessing remote memory.
SP
A service provider (SP) is a company that provides its subscribers access to the
internet.
STP cable
Shielded Twisted Pair (STP) Ethernet cable that is commonly used for high-speed
networks. A metallic substance shields STP. An additional metal foil wraps each set
of twisted wire pairs together.
SupportAssist
SupportAssist Enterprise is for users that require monitoring of up to 15,000 server,
storage, and networking devices.
SupportAssist
SupportAssist Enterprise is for users that require monitoring of up to 15,000 server,
storage, and networking devices.
UDIMM
Unregistered or unbuffered DIMM. UDIMMs do not have an onboard register as
seen with an RDIMM. UDIMMs are typically used in desktops and laptops.
UEFI boot
Unified Extensible Firmware Interface (UEFI). UEFI secure boot prevents systems
from booting from unsigned or unauthorized preboot device firmware, applications,
and operating system boot loaders. Without secure boot enabled, systems are
vulnerable to malware corrupting the startup process. UEFI is a firmware interface
that connects the firmware to the operating system. UEFI initializes the hardware
components and starts the operating system.
UTP cable