0% found this document useful (0 votes)
54 views26 pages

DEAD Recovery Solution For 10GigEthr 03 Iocxgbe Adapters 5900-5001

The document outlines a recovery solution for 10GigEthr-03 adapters that enter a DEAD state due to hardware or firmware errors, detailing both automatic and manual recovery methods. Automatic recovery is supported for various configurations, while manual recovery is introduced for DIO interfaces in the B.11.31.2201 release. The document also specifies installation requirements, supported adapters, and limitations of the recovery solutions.

Uploaded by

manujaleel24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views26 pages

DEAD Recovery Solution For 10GigEthr 03 Iocxgbe Adapters 5900-5001

The document outlines a recovery solution for 10GigEthr-03 adapters that enter a DEAD state due to hardware or firmware errors, detailing both automatic and manual recovery methods. Automatic recovery is supported for various configurations, while manual recovery is introduced for DIO interfaces in the B.11.31.2201 release. The document also specifies installation requirements, supported adapters, and limitations of the recovery solutions.

Uploaded by

manujaleel24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DEAD recovery solution for

10GigEthr-03(iocxgbe) adapters

1| HewlettPackardEnterprise
Table of Contents

Executive Summary ................................................................................................................... 3


Problem Statement .................................................................................................................... 3
Requirements ............................................................................................................................ 3
Deliverables for auto recovery ................................................................................................... 4
Supported adapters ................................................................................................................... 4
Limitations ................................................................................................................................. 5
Limitation for non-DIO interfaces (Automatic recovery) .............................................................. 5
Accessing and Installing from the Software Depot ..................................................................... 5
UE and Non-UE cases ............................................................................................................... 7
Overview of Automatic Recovery ............................................................................................... 8
Manual Recovery for non-DIO interfaces ................................................................................... 9
Time taken for recovery of non-DIO interfaces ......................................................................... 12
Tunable ................................................................................................................................... 12
Manual recovery of DIO adapter on Blade server .................................................................... 13
Detailed explanation of DIO manual recovery .......................................................................... 13
Logging .................................................................................................................................... 20
HPE recommended HA configuration ...................................................................................... 26
References .............................................................................................................................. 26

2| HewlettPackardEnterprise
Executive Summary
The 10GigEthr-03 adapters have been observed to experience hardware or firmware errors
such as unrecoverable error (UE) because of which the LAN controller becomes unresponsive,
eventually forcing the HPUX 10 Gigabit Ethernet driver (iocxgbe) to move to DEAD state. Such
errors don’t have much diagnostic value but do cause an impact in customer deployments that
have to reboot the entire system to get the LAN controllers to recover from such an error. This
document describes a scheme that attempts to recover such DEAD adapters back to the usable
state and avoiding a full system reboot. This document is the replacement of formally called
"Automatic recovery of the 10GigEthr-03(iocxgbe) adapters in DEAD state scenarios”.

There're 2 ways of recovery solution - automatic and manual recovery.


The automatic recovery solution will work even when the interface is configured under APA and
VLAN. And vswitch and exported to guest (HPVM/vPAR) as an AVIO device can be auto-
recovered without any impact to the existing configuration. However, the automatic recovery
does not work for the interface(s) exported to guest(s) as DIO device and the system reboot
was still required to recover from DEAD state in such case before B.11.31.2201 version of
IOCXGBE. At B.11.31.2201, the feature of manual recovery feature for DIO interfaces is added
for blade servers. With this solution, DEAD interfaces can be recovered even without the
system reboot also for DIO interfaces.

Problem Statement
When the 10GigEthr-03 adapter encounters any hardware or firmware issue such as an
unrecoverable error (UE), the HPUX iocxgbe network driver is forced to move to DEAD state.
When a LAN controller chip encounters this problem, all the ports of that particular controller are
impacted, and moved to DEAD state.
Automatic adapter recovery is a method to regain such DEAD network I/O card back to normal
working state without any user intervention, and without rebooting the server. It brings back all
the affected functions of a configured port either as a LAN device or as an FCoE device to the
normal working state.

Requirements
Automatic recovery
The B.11.31.1609 release of the HPUX 10GigEthr-03 (iocxgbe) driver introduces automatic
recovery feature for the iocxgbe adapters in DEAD state.
The 10GigEthr-03 software bundle contains the iocxgbe driver with updates for supporting
recovery mechanism and changes to handle UE scenarios.
In addition to installing the 10GigEthr-03 bundle, the iocxgbe driver requires other patches to
enable auto recovery.
The latest version of the 11.31 Networking Commands cumulative patch PHNE_44564, which
delivers a daemon and a startup script that handles the NIC recovery needs to be installed.
Also, the latest version of the LAN cumulative patch PHNE_44540, which delivers the modified
driver interface layer for DEAD event handling must be installed to support Automatic adapter
recovery.

3| HewlettPackardEnterprise
The latest version of FCoE driver as of 2016 September should be installed for the interfaces
that supports FCoE as well as 10Gigabit Ethernet.

Manual recovery for DIO


The B.11.31.2201 release of the HPUX 10GigEthr-03 (iocxgbe) driver introduces manual
recovery feature for the iocxgbe adapters in DEAD state in case where the interface is used in
DIO mode. In addition to installing the 10GigEthr-03 bundle, the iocxgbe driver requires other
patches to enable this feature. The latest version of the 11.31 Networking Commands
cumulative patch PHNE_44903, which delivers a daemon and a start-up script that handles the
NIC recovery needs to be installed. Also, the latest version of the LAN cumulative patch
PHNE_44904, which delivers the modified driver interface layer for manual recovery feature
must be installed to support manual adapter recovery. For more details, read the sections from
"Manual recovery of DIO adapter on Blade server" section later, which describes some
limitations on this feature as well as installation and operation details.

Deliverables for auto recovery


Following should be installed on the server to support the feature:
• Latest version of the 10GigEthr-03 (iocxgbe) driver (B.11.31.1609)
Depot: 10GigEthr-03_B.11.31.1609_HP-UX_B.11.31_IA.depot
• Patch Name: PHNE_44540 Patch Description: LAN cumulative patch
Depot: PHNE_44540.depot
• Patch Name: PHNE_44564 Patch Description: 11.31 Networking Commands
cumulative patch
Depot: PHNE_44564.depot
With the installation of the mentioned patches and depots, Automatic adapter recovery feature
is enabled by default.

IMPORTANT: The following problem was fixed in the later release of patches i.e.
‘PHNE_44588 - LAN Cumulative patch’ and ‘PHNE_44655 - Networking Commands patch’ or
later.
QXCR1001522640: lsof hangs when nicrecd daemon is running.

Please note that these patches are not included as part of the Fusion 2018 release and needs
to be installed separately.

Supported adapters
This solution will work on all the variants of the adapter i.e. LOM, Mezzanine, Standup and
Combo form factors.

Supported adapters for auto recovery:


This feature is supported with the following network adapters:
 LOM
o NC553i -Dual Port FlexFabric 10Gb BL8X0c-FCoE-LOM i4

4| HewlettPackardEnterprise
 Mezzanine o NC553m 10Gb 2-port FlexFabric Converged Network
Adapter o NC552m Dual-Port Flex-10 10GbE BL-c Adapter
 Stand-up cards o AT111A HP PCIe 2-port CNA o AT118A HP
Integrity NC552SFP 2P 10GbE Adapter o AT094A HP PCIe 2p 8Gb
FC and 2p 1/10GbE Adapter
Supported adapters for manual DIO recovery:
Details about the network adapters supporting this feature is described in "a) Overview of the
DIO manual recovery solution" section later in this document.

Limitations
Non-DIO and DIO interfaces recovery solution has some limitations;
Limitation for non-DIO interfaces (Automatic recovery)
 A minimum card firmware version of 4.9.416.12 or later is required. Ensure that the adapter
is updated with the latest firmware version before proceeding with the installation. Recovery
is not supported on firmware versions earlier than 4.9.416.12. This means that when an
adapter with older firmware becomes unresponsive, the driver will immediately detect it and
mark the interface as DEAD but it will not recover the interface.
 Currently, Automatic adapter recovery feature is not supported for DIO devices. If an
interface has any of its functions exported to HPVM guest as a DIO device, the recovery
fails. The VSP needs to be rebooted to recover the DEAD interface.
 Maximum recovery attempts for a particular adapter is five within 24 hours.
For example, if a particular card becomes unresponsive frequently, there will be up to five
attempts to recover the interface within 24 hours of time.
Beyond that, it is either concluded that the card is faulty and needs to be replaced, or the
user needs to try manual recovery.
 Interfaces display order in lanscan/nwmgr can change after recovery. However, the
PPA/interface name (suffix) will not change.
Limitation for DIO interfaces (Manual recovery for DIO)
Refer to ‘b) Limitations of DIO manual recovery’ later in this document.

Accessing and Installing from the Software Depot


For manual recovery for DIO interfaces, skip this section and read from “c) Deliverables for DIO
manual recovery” sections later in this document.

The Automatic adapter recovery feature can be accessed and installed on supported systems
from the HPE Software Depot
To install the driver bundle and other dependent patches on the server with a single reboot,
follow these steps-
Step 1. Log in to the server as root.
Step 2. Back up the server before installing the product.
Step 3: Download all the depots into a directory (say /tmp).
Step 4: Verify that the depots are downloaded correctly using the following commands:

5| HewlettPackardEnterprise
swlist -d -s /tmp/<depot_filename> For
example:
# swlist -d -s /tmp/10GigEthr-03_B.11.31.1609_HP-UX_B.11.31_IA.depot
# 10GigEthr-03 B.11.31.1609 PCIe 10 GbE;Supptd
HW=580151/610609/613431B21,NC551/552/553,AT094/111/118A

# swlist -d -s /tmp/PHNE_44540.depot
# PHNE_44540 1.0 LAN cumulative patch

# swlist -d -s /tmp/PHNE_44564.depot
# PHNE_44564 1.0 Networking commands cumulative patch

Step 5: Create a new directory (say /tmp/hpe_patches).


Step 6: Swcopy all the downloaded depots into this directory using the following command:
swcopy -s /tmp/<depot_filename> \* @ /target_directory
For example:
# swcopy -s /tmp/10GigEthr-03_B.11.31.1609_HP-UX_B.11.31_IA.depot
\* @ /tmp/hpe_patches

# swcopy -s /tmp/PHNE_44540.depot \* @ /tmp/hpe_patches

# swcopy -s /tmp/PHNE_44564.depot \* @ /tmp/hpe_patches

The “hpe_patches” is now a mega bundle which contains all the depots downloaded. Installing
the mega bundle effectively installs all the depots together on the server.
Step 7: Install the /tmp/hpe_patches bundle using the swinstall tool:
swinstall -s /tmp/<depot_filename> For
example:
# swinstall -x autoreboot=true -s /tmp/hpe_patches/ \*
Important: Use this command on the server where the product is to be installed, running
standalone; do not perform this step over the network. When installation is complete, the
server reboots.
Step 8: To verify that the 10 Gigabit Ethernet driver installation was successful, use this
command:
what /stand/vmunix /stand/current/mod/* | grep iocxgbe For
example:
# what /stand/vmunix /stand/current/mod/* | grep iocxgbe
/stand/current/mod/iocxgbe:
iocxgbe_ilan Version: 3 Sep 14 2016
iocxgbe Revision: IOCXGBE_B.11.31.WR1609 Sep 14 2016
$Revision: iocxgbe: B.11.31.1609_LR
/stand/current/mod/iocxgbe.prep:
#
Step 9: To verify the patch installation was successful, use these commands:
swverify <patch_name> For
example:
# swverify PHNE_44540

6| HewlettPackardEnterprise
# swverify PHNE_44564
Ensure that there are no errors displayed in the output.
Step 10: Verify the auto recovery daemon is active, by using this command:
ps –ef | grep nicrecd For
example,
# ps -ef | grep nicrecd
root 14389 1 0 16:29:46 ? 0:00 /usr/sbin/nicrecd
#

Important: On a HPVM configuration, the auto recovery related patches should be installed
ONLY on the host/VSP.

UE and Non-UE cases


When a card starts functioning erratically and becomes unresponsive, there are two things that
could possibly happen. Either the UE bits are set in the card firmware indicating that an
unrecoverable error has occurred or the firmware stops responding to the management and
statistics commands posted by the driver periodically.
In the former case the interface will be marked DEAD due to UE.

In the latter case, the driver mailbox commands will timeout, eventually resulting in a data path
transmit engine stall. The driver will no longer be able to send or receive any packets and the
interface will be marked DEAD due to command timeout.

The Automatic recovery feature will attempt to recover the interfaces from all such DEAD cases.

Following are the examples of the messages logged in the syslog, when an iocxgbe adapter
encounters a DEAD state due to UE:
Sep 23 15:28:19 hpsmbl3 vmunix: iocxgbe0/6725, 1474883899.771878
iocxgbe_mark_card_dead called from iocxgbe_watchdog_detected_dead Sep 23
15:28:20 hpsmbl3 vmunix: iocxgbe0/1649, 1474883900.170283 iocxgbe_watchdog:
uerr lo 0x4000000 0x4000000, hi 0x1080 0x0 Sep 23 15:28:20 hpsmbl3 vmunix:
iocxgbe0/1653, 1474883900.170287 iocxgbe_watchdog: uerr hi bit 7 set: PMEM
Sep 23 15:28:20 hpsmbl3 vmunix: iocxgbe0/1665, 1474883900.170291 iocxgbe_watchdog: Reboot needed
to recover device
Sep 23 15:28:20 hpsmbl3 vmunix: iocxgbe1/1649, 1474883900.171572
iocxgbe_watchdog: uerr lo 0x4000000 0x4000000, hi 0x1080 0x0 Sep 23 15:28:20
hpsmbl3 vmunix: iocxgbe1/1653, 1474883900.171575 iocxgbe_watchdog: uerr hi
bit 7 set: PMEM
Sep 23 15:28:20 hpsmbl3 vmunix: iocxgbe1/1665, 1474883900.171579 iocxgbe_watchdog: Reboot needed
to recover device
Sep 23 15:28:31 hpsmbl3 vmunix: iocxgbe1/1618, 1474883911.171575
iocxgbe_watchdog: uerr lo 0x4000020 0x4000000, hi 0x801080 0x0 Sep 23
15:28:31 hpsmbl3 vmunix: iocxgbe1/1622, 1474883911.171580 iocxgbe_watchdog:
uerr lo bit 5 set: MPU
Sep 23 15:28:31 hpsmbl3 vmunix: iocxgbe1/1634, 1474883911.171585 iocxgbe_watchdog: Reboot needed
to recover device
Sep 23 15:28:31 hpsmbl3 vmunix: iocxgbe1/1649, 1474883911.171588
iocxgbe_watchdog: uerr lo 0x4000020 0x4000000, hi 0x801080 0x0 Sep 23
15:28:31 hpsmbl3 vmunix: iocxgbe1/1653, 1474883911.171591 iocxgbe_watchdog:
uerr hi bit 23 set: NETC
Sep 23 15:28:31 hpsmbl3 vmunix: iocxgbe1/1665, 1474883911.171595 iocxgbe_watchdog: Reboot needed
to recover device

7| HewlettPackardEnterprise
Following are the examples of the messages logged in the syslog, when an iocxgbe adapter
encounters a DEAD state due to command timeout:

Sep 23 11:42:57 hpsmbl3 vmunix: iocxgbe36/6725, 1474611177.435287


iocxgbe_mark_card_dead called from iocxgbe_complete_cmd_timeout. Or
Sep 23 11:42:57 hpsmbl3 vmunix: iocxgbe36/6725, 1474611177.435287 iocxgbe_mark_card_dead called
from iocxgbe_complete_cmd.

Overview of Automatic Recovery


HPUX NIC Recovery Daemon & Start-up script:
A new daemon for adapter recovery (/usr/sbin/nicrecd) is introduced which will automatically
detect any DEAD state scenarios and attempts to recover that particular interface in DEAD
state.
A new startup script (/sbin/rc2.d/S307hpnicrecovery a soft link to /sbin/init.d/hpnicrecovery) is
introduced which will administer the start/stop operations of the daemon. The daemon gets
launched from the soft link given to the new startup script (/etc/rc2.d/S307hpnicrecovery)
during the system boot up.
Startup script will have provision to start and stop the daemon as per the init script standards.
Both the files get installed with the latest PHNE_44564.

With the installation of the mentioned patches and depots, automatic adapter recovery feature is
enabled by default. This means that the NIC recovery daemon will always be alive on that
HPUX server. The daemon will be sleeping on an ioctl to the driver interface layer, to get
unblocked by a driver event.
When the 10GigEthr-03 (iocxgbe) driver detects that a particular interface is not responding, it
will instantly mark the interface as DEAD and notifies the upper layers by sending an event.
The daemon, awakened by the driver event, confirms the interface is in DEAD state and
attempts to recover the interface by simply issuing a driver API.
Once the recovery attempt is completed, a success or failure message is logged in the syslog
and the daemon will go back to sleep.

After the card has been resumed, a recovery message will be logged in syslog, for example:
Sep 23 11:44:16 hpsmbl3.in.rdlabs.hpecorp.net nicrecd[1353]: Interface: 42/0/1/0/0/0
recovered successfully.

If the recovery does not succeed, the adapter has a persistent error condition. A
failure message will be logged in syslog, for example:
Sep 23 11:44:16 hpsmbl3.in.rdlabs.hpecorp.net nicrecd[1353]: Interface: 42/0/1/0/0/0
recovery failed.

Auto recovery daemon also handles cases where multiple adapters are going to dead state
simultaneously.

If a particular adapter becomes DEAD more frequently, there will be up to five attempts to
recover that interface within 24 hours.
Beyond that a manual NIC recovery operation is required to restore the card.

8| HewlettPackardEnterprise
NOTE:
HPE recommends to replace the card with such erratic behaviour. There is a high probability
that the I/O card is defective. But if a user intends to use the same adapter, one can continue
with the manual recovery option.
However, a manual recovery attempt may also fail if there is a persistent error condition.

Manual Recovery for non-DIO interfaces


This feature is made available in the form of a command under the existing “lanadmin”
framework for the HPUX network drivers.
To recover from the DEAD state manually, follow these steps:
1. List all the available iocxgbe interfaces on the server along with the <ppa> using the
following command: nwmgr –S iocxgbe
Note the <ppa> number or <ClassInstance> of the interface which is in DOWN state. For
example,
#nwmgr –S iocxgbe
Name/ Interface Station Sub- Interface Related
ClassInstance State Address system Type Interface
========= ===== =========== ===== ======= =======
lan0 DOWN 0xA0B3CC1CAF28 iocxgbe 10GBASE-KR
lan1 DOWN 0xA0B3CC1CAF2C iocxgbe 10GBASE-KR
lan2 DOWN 0xA0B3CC1CAF2A iocxgbe 10GBASE-KR
lan3 DOWN 0xA0B3CC1CAF2E iocxgbe 10GBASE-KR
lan4 DOWN 0xA0B3CC1CAF2B iocxgbe 10GBASE-KR
lan5 DOWN 0xA0B3CC1CAF2F iocxgbe 10GBASE-KR
lan6 DOWN 0xA0B3CC1CAF20 iocxgbe 10GBASE-KR
lan7 DOWN 0xA0B3CC1CAF22 iocxgbe 10GBASE-KR
lan8 UP 0x10604B353B6C iocxgbe 10GBASE-KR
lan9 UP 0x10604B353B70 iocxgbe 10GBASE-KR
#

2. Execute the following command to confirm the driver state of the interface that is
DOWN:
nwmgr –q info -c lan<ppa> | grep “Driver state” For
example,
#nwmgr -q info -c lan1 | grep "Driver state"
Driver state: IOCXGBE_STATE_DEAD
#
#nwmgr -q info -c lan5 | grep "Driver state"
Driver state: IOCXGBE_STATE_DEAD
#

3. Using the following command, user can take a look at all the related functions/hardware
paths of iocxgbe interfaces:
lanscan For
example,
#lanscan
Hardware Station Crd Hdw Net-Interface NM MAC HP-DLPI DLPI

9| HewlettPackardEnterprise
Path Address In# State Name PPA ID Type Support Mjr#
0/0/0/7/0/0/0 0xA0B3CC1CAF28 0 DOWN lan0 snap0 1 ETHER Yes 119
0/0/0/7/0/0/1 0xA0B3CC1CAF2C 1 DOWN lan1 snap1 2 ETHER Yes 119
0/0/0/7/0/0/2 0xA0B3CC1CAF2A 2 DOWN lan2 snap2 3 ETHER Yes 119
0/0/0/7/0/0/3 0xA0B3CC1CAF2E 3 DOWN lan3 snap3 4 ETHER Yes 119
0/0/0/7/0/0/4 0xA0B3CC1CAF2B 4 DOWN lan4 snap4 5 ETHER Yes 119
0/0/0/7/0/0/5 0xA0B3CC1CAF2F 5 DOWN lan5 snap5 6 ETHER Yes 119
0/0/0/7/0/0/6 0xA0B3CC1CAF20 6 DOWN lan4 snap6 7 ETHER Yes 119
0/0/0/7/0/0/7 0xA0B3CC1CAF22 7 DOWN lan5 snap7 8 ETHER Yes 119
0/0/0/3/0/0/0 0x10604B353B6C 8 UP lan8 snap8 9 ETHER Yes 119
0/0/0/3/0/0/1 0x10604B353B70 9 UP lan9 snap9 10 ETHER Yes 119
#

4. Execute the following command to confirm the firmware version of the DEAD interface:
nwmgr -q info -c lan<ppa> | grep "Firmware" For
example,
#nwmgr -q info -c lan0 | grep "Firmware"
Firmware version: 4.9.416.12
#

5. Execute the following command to recover the interface which is in DEAD state:
lanadmin –x recover <ppa> For
example,
# lanadmin -x recover 1

WARNING!!
Driver will reset the entire interface.
This means that ALL the lan/FCoE ports that are associated with this interface on the adapter
will be reset.
Recovery is not supported on an interface with one or more ports/functions exported to the
guest as a DIO device.
In such a case please abort this operation and reboot the VSP to recover the interface.
Do you really want to proceed? (y/n) [n]: y

Hardware path : 0/0/0/7/0/0/0

Suspending h/w path 0/0/0/7/0/0/0


Suspending h/w path 0/0/0/7/0/0/1
Suspending h/w path 0/0/0/7/0/0/2
Suspending h/w path 0/0/0/7/0/0/3
Suspending h/w path 0/0/0/7/0/0/4
Suspending h/w path 0/0/0/7/0/0/5
Suspending h/w path 0/0/0/7/0/0/6
Suspending h/w path 0/0/0/7/0/0/7

Reset Success!!
Attempting recovery of the interface(s)...

Resuming h/w path 0/0/0/7/0/0/0


Resuming h/w path 0/0/0/7/0/0/1
Resuming h/w path 0/0/0/7/0/0/2
Resuming h/w path 0/0/0/7/0/0/3

10 | HewlettPackardEnterprise
Resuming h/w path 0/0/0/7/0/0/4
Resuming h/w path 0/0/0/7/0/0/5
Resuming h/w path 0/0/0/7/0/0/6
Resuming h/w path 0/0/0/7/0/0/7

Recovery attempt completed.

6. Execute the following command to verify the interface has recovered successfully:
nwmgr –S iocxgbe For
example,
#nwmgr –S iocxgbe
Name/ Interface Station Sub- Interface Related ClassInstance
State Address system Type Interface
========= ===== =========== ===== ======= =======
lan0 UP 0xA0B3CC1CAF28 iocxgbe 10GBASE-KR
lan1 UP 0xA0B3CC1CAF2C iocxgbe 10GBASE-KR
lan2 UP 0xA0B3CC1CAF2A iocxgbe 10GBASE-KR
lan3 UP 0xA0B3CC1CAF2E iocxgbe 10GBASE-KR
lan4 UP 0xA0B3CC1CAF2B iocxgbe 10GBASE-KR
lan5 UP 0xA0B3CC1CAF2F iocxgbe 10GBASE-KR
lan6 UP 0xA0B3CC1CAF20 iocxgbe 10GBASE-KR
lan7 UP 0xA0B3CC1CAF22 iocxgbe 10GBASE-KR
lan8 UP 0x10604B353B6C iocxgbe 10GBASE-KR
lan9 UP 0x10604B353B70 iocxgbe 10GBASE-KR
#

7. Verify the driver state of the recovered interface using the following command:
nwmgr –q info -c lan<ppa> | grep “Driver state”
Note that all the interfaces corresponding to a DEAD card will be restored. For
example,
#nwmgr -q info -c lan1 | grep "Driver state"
Driver state: IOCXGBE_STATE_ONLINE
#
#nwmgr -q info -c lan5 | grep "Driver state"
Driver state: IOCXGBE_STATE_ONLINE
#

A successful attempt at manual recovery will restore the card.

NOTE:
 When a DEAD state is reported, all the ports of that particular adapter will be affected
and marked as DEAD. Recovery should be attempted by choosing one of the DEAD
ports belonging to the failed LAN controller chip. Executing a single recover command is
sufficient to recover all the ports belonging to the particular failed chip.
 Recovery will not proceed when a port is online/offline/suspended. It should be
attempted only on a dead port.

11 | HewlettPackardEnterprise
 Two or more simultaneous manual recovery, either on same or different interface,
cannot be executed on a system. The second or latter attempt will fail unless the
currently executed command is finished.
 Before attempting the manual recovery,
o User need not stop hpvmnet (vswitch) associated with the interface in prior to the
operation.
o User need not remove the interface from APA failover group in prior to the
operation.
o If the interface is a member of APA failover group, after the recovery is done,
check if the interface is available in the failover group by running:
nwmgr -v -S apa -c lan90x
(lan90x is the interface name of the failover group)

Time taken for recovery of non-DIO interfaces


Following tables list the time taken by the 10GigEthr-03 (iocxgbe) cards to recover from the
DEAD state with the adapter recovery feature:
Time taken for Auto recovery:
NOTE: Assuming from the moment the driver is marked DEAD to the time driver comes
ONLINE/OFFLINE state.
Adapter in VC mode Approximately 120 seconds
Adapter in non-VC mode Approximately 60 seconds

Time taken for Manual recovery:


NOTE: Assuming from the moment the user executes the command to the time driver comes
ONLINE/OFFLINE state.
Adapter in VC mode Approximately 90 seconds
Adapter in non-VC mode Approximately 30 seconds

NOTE:
VC stands for Virtual Connect, which is used by blade servers.
For non-blade server users, only the information for "non-VC mode" will be applied.

Tunable
To enable or disable the automatic recovery feature, the 10GigEthr-03 driver startup
configuration file /etc/rc.config.d/hpiocxgbeconf has the following parameter:
HP_IOCXGBE_AUTO_RECOVERY= <yes/no> #default yes
If a user intends to disable the auto recovery feature for some reason, then the iocxgbe startup
configuration file should be edited as:
HP_IOCXGBE_AUTO_RECOVERY=no
Ensure there are no spaces in the beginning of the line.

Auto recovery will be enabled by default, even if the tunable parameter


“HP_IOCXGBE_AUTO_RECOVERY” is missing in the configuration file.

12 | HewlettPackardEnterprise
However, irrespective of the tunable value, the NIC recovery daemon process will always be
alive in sleeping mode.
Modifying the tunable value to ‘no’ will not stop the NIC recovery daemon. Only recovery will not
be attempted on a DEAD iocxgbe I/O card.

The line "HP_IOCXGBE_AUTO_RECOVERY=..“ will be missing if


/etc/rc.config.d/hpiocxgbeconf file was modified before updating the 10GigEthr-03 driver to
B.11.31.1609 version. If you want to disable automatic recovery feature in such a case, add the
line -"HP_IOCXGBE_AUTO_RECOVERY=no" manually.

Manual recovery of DIO adapter on Blade server


Automatic adapter recovery feature is not supported for DIO devices. If an interface has any of
its functions exported to HPVM guest as a DIO device, the auto recovery fails,
VSP needs to be rebooted to recover the DEAD interface before IOCXGBE version
B.11.31.2201.so it does cause an impact in customer deployments that have to reboot the
entire system to get the LAN controllers to recover from such an error.

At IOCXGBE B.11.31.2201, a new feature to recover the DEAD interfaces manually for DIO
cases to avoid the system reboot. Below, we'll describe a scheme that attempts to recover such
DEAD adapters back to the usable state and avoiding a full system reboot for Blade server.

NOTE: This feature does not support SuperDome2 and other non-blade servers.

Detailed explanation of DIO manual recovery


a) Overview of the DIO manual recovery
b) Limitation of DIO manual recovery
c) Deliverables for DIO manual recovery
d) Accessing and Installing from the Software Depot
e) hpnicrecovery dio_recover and resume option
f) Manual DIO recovery
g) Manual DIO recovery and auto recovery of non-DIO interfaces execution

a) Overview of DIO manual recovery solution


 Solution built on top of existing (automatic recovery) solution.
 Added a new option “dio_recover” in auto recovery hpnicrecovery script.
 Manual recovery script has to run from VSP.
 This solution will work only for the variants of the adapter i.e. LOM and Mezzanine with
NIC personality mode
 Non-blade servers is not supported.
 Below are the cards on which recovery can be attempted –
o LOM
 NC553i -Dual Port FlexFabric 10Gb BL8X0c-FCoE-LOM i4
o Mezzanine
 NC553m 10Gb 2-port FlexFabric Converged Network Adapter
 NC552m Dual-Port Flex-10 10GbE BL-c Adapter

13 | HewlettPackardEnterprise
 This solution will also work when the interface is configured under APA, VLAN, Vswitch
and exported to guest (HPVM/VPAR) as an AVIO device although those interfaces
which is not used as DIO can be recovered by auto recovery solution.
 Only a single recovery should be attempted by choosing any of the DEAD ports
belonging to the failed LAN controller chip. This is sufficient to recover all the ports
belonging to the particular failed chip. Care should be taken not to run more than one
recovery attempts in parallel on the failed PPAs.

b) Limitations of DIO manual recovery:


 Minimum card firmware version required: 4.9.416.17 and above.
 Recovery will not be done for LOMs/Mezzanine adaptors whose personality setting is
FCoE. Personality setting of the adaptor is FCoE by default. Therefore, to make this
solution work well, this setting has to be changed to NIC from Device Manager at the
boot time.
 Unclaimed port detection: detection of unclaimed state port on guest not supported. So if
the primary port is of unclaimed state on Guest, manual recovery script will execute and
the VSP will go for MCA. Hence follow the recovery script instruction carefully before
running recovery script, you need to double check all the DIO interfaces are in CLAIMED
state on all the guests.
MCA foot
 DEAD port detection: Example, if all the ports belonging to the dead adapter are
exported to Guest, then currently there is no way to detect the DEAD state ,So make
sure you are going to perform recovery operation on DEAD card otherwise it may impact
your system.
 Recovery will proceed only if Dead adapter ports exported to the guests in ON state. IF
the guests are in OFF/EFI shell state, then remove the dead adapter ports from
corresponding guest and keep it DIO pool.
 Recovery will proceed only when the port is in dead state. It will not proceed if the
specified port(s) is/are in online/offline/suspended state.
 Recovery should not be attempted on a VSP with single CPU as it can cause CPU
starvation issues. Add an extra CPU back to VSP from the VM pool and bind recovery
operation to it.
 Recovery support only HPVM version 6.5 (latest HPVM patches as of Jan 2022)
installed system.

c) Deliverables for DIO manual recovery


Following should be installed on the server to support the feature:
• Latest version of the 10GigEthr-03 (iocxgbe) driver (B.11.31.2201)
Depot: 10GigEthr-03_B.11.31.WR2201_HP-UX_B.11.31_IA.depot
• Patch Name: PHNE_44904 Patch Description: LAN cumulative patch
Depot: PHNE_44904.depot
• Patch Name: PHNE_44903 Patch Description: 11.31 Networking Commands
cumulative patch

14 | HewlettPackardEnterprise
Depot: PHNE_44903.depot

With the installation of the mentioned patches and depots, Manual DIO recovery of adapter
feature is available to use.

d) Accessing and Installing from the Software Depot


The Manual DIO recovery adapter feature can be accessed and installed on supported systems
from the HPE Software Depot.
To install the driver bundle and other dependent patches on the server with a single reboot,
follow these steps-
Step 1. Log in to the server as root.
Step 2. Back up the server before installing the product.
Step 3: Download all the depots into a directory (say /tmp).
Step 4: Verify that the depots are downloaded correctly using the following commands:
swlist -d -s /tmp/<depot_filename> For
example:
# swlist -d -s /tmp/10GigEthr-03_B.11.31.WR2201_HP-
UX_B.11.31_IA.depot
# 10GigEthr-03 B.11.31.2201 PCIe 10 GbE;Supptd
HW=580151/610609/613431B21,NC551/552/553,AT094/111/118A

# swlist -d -s /tmp/PHNE_44904.depot
# PHNE_44904 1.0 LAN cumulative patch
# swlist -d -s /tmp/PHNE_44903.depot
# PHNE_44903 1.0 Networking commands cumulative patch

Step 5: Create a new directory (say /tmp/hpe_patches).


Step 6: Swcopy all the downloaded depots into this directory using the following command:
swcopy -s /tmp/<depot_filename> \* @ /target_directory
For example:
# swcopy -s /tmp/10GigEthr-03_B.11.31.WR2201_HP-
UX_B.11.31_IA.depot
\* @ /tmp/hpe_patches

# swcopy -s /tmp/PHNE_44903.depot \* @ /tmp/hpe_patches

# swcopy -s /tmp/PHNE_44904.depot \* @ /tmp/hpe_patches

The “hpe_patches” is now a mega bundle which contains all the depots downloaded. Installing
the mega bundle effectively installs all the depots together on the server.
Step 7: Install the /tmp/hpe_patches bundle using the swinstall tool:
swinstall -s /tmp/<depot_filename> For
example:
# swinstall -x autoreboot=true -s /tmp/hpe_patches/ \*

15 | HewlettPackardEnterprise
Important: Use this command on the server where the product is to be installed, running
standalone; do not perform this step over the network. When installation is complete, the
server reboots.
Step 8: To verify that the 10 Gigabit Ethernet driver installation was successful, use this
command:
what /stand/vmunix /stand/current/mod/* | grep iocxgbe For
example:
# what /stand/vmunix /stand/current/mod/* | grep iocxgbe
/stand/current/mod/iocxgbe:
iocxgbe_ilan Version: 3 Jan 11 2022
iocxgbe Revision: IOCXGBE_B.11.31.WR2201 Jan 11 2022
$Revision: iocxgbe: B.11.31.2201_LR
/stand/current/mod/iocxgbe.prep:
#
Step 9: To verify the patch installation was successful, use these commands:
swverify <patch_name> For
example:
# swverify PHNE_44903
# swverify PHNE_44904
Ensure that there are no errors displayed in the output.
Step 10: Verify the auto recovery daemon is active, by using this command:
ps –ef | grep nicrecd For
example,
# ps -ef | grep nicrecd
root 14389 1 0 16:29:46 ? 0:00 /usr/sbin/nicrecd –a
#
Step 11: Verify the manual dio recovery option in hpnicrecovery, by using this command:
# /sbin/init.d/hpnicrecovery
Usage: hpnicrecovery start|stop|restart|recover|resume|dio_recover

e) hpnicrecovery dio_recover and resume option


# /sbin/init.d/hpnicrecovery dio_recover -help

Default Usage:
hpnicrecovery dio_recover

Specific Usage:
hpnicrecovery dio_recover <VM_name> <Guest_HW_path>

VM_name : Guest Name of the DEAD interface


Guest_HW_path : DEAD interface Hardware path in the Guest which given in <VM_name>

For more detailed explanation, refer to the appendix "Detailed explanation of DIO manual recovery" in the White
Paper

# /sbin/init.d/hpnicrecovery resume -help

Usage:
hpnicrecovery resume <VSP/HOST HW_Path of the interface>

16 | HewlettPackardEnterprise
VSP/HOST HW_Path : Suspended state interface hardware path for dead recovery

For more detailed explanation, refer to the appendix "Detailed explanation of DIO manual recovery" in the White
Paper

f) Manual DIO recovery


This feature is made available in the form of a command under the hpnicrecovery script.
To recover from the DEAD state manually, execute the following command and select the dead
adapter from the list.
# /sbin/init.d/hpnicrecovery dio_recover

Shutting down NIC recovery daemon:


Done.
Recovery daemon is not running now.

Existing LAN/FC Configuration details of the machine:


===================================================

ID Owner Host-hardware-path Guest-hardware-path Description


=== ==== ================== =================== ============
1 host 0/0/0/3/0/0/0(lan0) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded NIC
2 host 0/0/0/3/0/0/1(lan1) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
3 host(fc) 0/0/0/3/0/0/3 - HP 10Gb PCIe 2-port Embedded FCoE Adapter
4 host 0/0/0/3/0/0/5(lan2) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
5 host 0/0/0/3/0/0/7(lan3) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
6 host 0/0/0/4/0/0/0(lan4) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
7 host 0/0/0/4/0/0/1(lan5) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
8 host(fc) 0/0/0/4/0/0/3 - HP 10Gb PCIe 2-port Embedded FCoE Adapter
9 host 0/0/0/4/0/0/5(lan6) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
10 host 0/0/0/4/0/0/7(lan7) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
11 host 0/0/0/5/0/0/0(lan8) - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
12 host 0/0/0/5/0/0/1(lan9) - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
13 dio 0/0/0/5/0/0/2 - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
14 vm1 0/0/0/5/0/0/3 0/0/4/0(lan3) HP NC552m 2p 10GbE BL-c Mezzanine Adapter
15 vm2 0/0/0/5/0/0/4 0/0/4/0(lan2) HP NC552m 2p 10GbE BL-c Mezzanine Adapter
16 dio 0/0/0/5/0/0/5 - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
17 vm3 0/0/0/5/0/0/6 0/0/4/0(lan2) HP NC552m 2p 10GbE BL-c Mezzanine Adapter
18 host 0/0/0/5/0/0/7(lan15) - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
19 host 0/0/0/7/0/0/0(lan16) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
20 host 0/0/0/7/0/0/1(lan17) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
21 host(fc) 0/0/0/7/0/0/2 - HP 613433-001 10Gb PCIe 2-port OCm11102-F-HP FCoE
Mezzanine Adapter
22 host(fc) 0/0/0/7/0/0/3 - HP 613433-001 10Gb PCIe 2-port OCm11102-F-HP FCoE
Mezzanine Adapter
23 host 0/0/0/7/0/0/4(lan18) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
24 host 0/0/0/7/0/0/5(lan19) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
25 host 0/0/0/7/0/0/6(lan20) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
26 host 0/0/0/7/0/0/7(lan21) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
27 host 0/0/0/9/0/0/0(lan22) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
28 host 0/0/0/9/0/0/1(lan23) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
29 host(fc) 0/0/0/9/0/0/2 - HP 613433-001 10Gb PCIe 2-port OCm11102-F-HP FCoE
Mezzanine Adapter
30 host(fc) 0/0/0/9/0/0/3 - HP 613433-001 10Gb PCIe 2-port OCm11102-F-HP FCoE
Mezzanine Adapter
31 host 1/0/0/3/0/0/0(lan24) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
32 host 1/0/0/3/0/0/1(lan25) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
33 host 1/0/0/3/0/0/3(lan26) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
34 host 1/0/0/3/0/0/5(lan27) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
35 host 1/0/0/3/0/0/7(lan28) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC

17 | HewlettPackardEnterprise
36 host 1/0/0/4/0/0/0(lan29) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
37 host 1/0/0/4/0/0/1(lan30) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
38 host(fc) 1/0/0/4/0/0/3 - HP 10Gb PCIe 2-port Embedded FCoE Adapter
39 host 1/0/0/4/0/0/5(lan31) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
40 host 1/0/0/4/0/0/7(lan32) - HP Dual Port CNA 10Gb BL8X0c i4 Embedded CNIC
49 vpar8 1/0/0/7/0/0/0 0/0/0/4/0(lan1) HP NC553m 2p 10GbE BL-c Mezzanine Adapter
50 vm1 1/0/0/7/0/0/1 0/0/2/0(lan1) HP NC553m 2p 10GbE BL-c Mezzanine Adapter
51 vm2 1/0/0/7/0/0/2 0/0/2/0(lan1) HP NC553m 2p 10GbE BL-c Mezzanine Adapter
52 host 1/0/0/7/0/0/3(lan44) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
53 host 1/0/0/7/0/0/4(lan45) - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
54 vpar5 1/0/0/7/0/0/5 0/0/0/4/0(lan1) HP NC553m 2p 10GbE BL-c Mezzanine Adapter
55 dio 1/0/0/7/0/0/6 - HP NC553m 2p 10GbE BL-c Mezzanine Adapter
56 vpar7 1/0/0/7/0/0/7 0/0/0/4/0(lan1) HP NC553m 2p 10GbE BL-c Mezzanine Adapter
57 host 1/0/0/9/0/0/0(lan49) - HP NC552m 2p 10GbE BL-c Mezzanine Adapter
58 host 1/0/0/9/0/0/1(lan50) - HP NC552m 2p 10GbE BL-c Mezzanine Adapter

Above table shows the information of all the LAN ports distributed across
VMs/host/DIO pool on the system.

Compare the Guest H/W path of the dead adapter mentioned in the table.
Select the ID of the interface that should be recovered : 49 <-

Existing Configuration details of the selected adapter:

===================================================

Owner Host-hardware-path Guest-hardware-path


====== =================== ====================
vpar8 1/0/0/7/0/0/0 0/0/0/4/0(lan1)
vm1 1/0/0/7/0/0/1 0/0/2/0(lan1)
vm2 1/0/0/7/0/0/2 0/0/2/0(lan1)
host 1/0/0/7/0/0/3 (lan44) -
host 1/0/0/7/0/0/4 (lan45) -
vpar5 1/0/0/7/0/0/5 0/0/0/4/0(lan1)
dio 1/0/0/7/0/0/6 -
vpar7 1/0/0/7/0/0/7 0/0/0/4/0(lan1)

Above table shows the information of all the ports belonging


to the selected adapter distributed across VMs/host.

Verifying selected card is in dead state or not :

Selected adapter is on dead state.

Please verify hpvmstatus shows all the owner VM/Vpar guests listed above are in ON state
Also S/W state of each port is in CLAIMED from ioscan output
running on each VM/Vpar guest

Confirm all ports are in CLAIMED state? (y/n) :y

Dead Adapter Recovery Operation starting ::

Suspending h/w path 1/0/0/7/0/0/0 in vpar8 (0/0/0/4/0(lan1))


Done.
Suspending h/w path 1/0/0/7/0/0/1 in vm1 (0/0/2/0(lan1))
Done.
Suspending h/w path 1/0/0/7/0/0/2 in vm2 (0/0/2/0(lan1))
Done.
Suspending h/w path 1/0/0/7/0/0/3 in HOST
Done.
Suspending h/w path 1/0/0/7/0/0/4 in HOST

18 | HewlettPackardEnterprise
Done.
Suspending h/w path 1/0/0/7/0/0/5 in vpar5 (0/0/0/4/0(lan1))
Done.
Suspending h/w path 1/0/0/7/0/0/6 in DIO pool
Done.
Suspending h/w path 1/0/0/7/0/0/7 in vpar7 (0/0/0/4/0(lan1))
Done.

Attempting Bus reset for DEAD Adapter...


Done.

Resuming the Dead interface(s)...


NOTE: It takes a while to resume each interface on each guest/VSP. Be patient.
It takes much longer than making them suspended above. Wait here for enough time.

Resuming h/w path 1/0/0/7/0/0/0 in vpar8 (0/0/0/4/0(lan1))


Done.
Resuming h/w path 1/0/0/7/0/0/1 in vm1 (0/0/2/0(lan1))
Done.
Resuming h/w path 1/0/0/7/0/0/2 in vm2 (0/0/2/0(lan1))
Done.
Resuming h/w path 1/0/0/7/0/0/3 in HOST
Done.
Resuming h/w path 1/0/0/7/0/0/4 in HOST
Done.
Resuming h/w path 1/0/0/7/0/0/5 in vpar5 (0/0/0/4/0(lan1))
Done.
Recovering port 1/0/0/7/0/0/6 in DIO pool
Done.
Resuming h/w path 1/0/0/7/0/0/7 in vpar7 (0/0/0/4/0(lan1))
Done.

Manual Recovery operation succeeded.


Verify the status in 'nwmgr -g -q info -c lan<ppa>' on VSP/guest.

NOTE:
 When a DEAD state is reported, all the ports of that particular adapter will be
affected and marked as DEAD. Recovery should be attempted by choosing one of the
DEAD ports belonging to the failed LAN controller chip. Executing a single recover
command is sufficient to recover all the ports belonging to the particular failed chip.
 Solution has a limitation to check the dead adapter in all cases, Recovery will
proceed when a port is online/offline/suspended, so carefully select the dead adapter
from the list. Recovery on non-dead card might be effect badly on your system. So
attempt the manual DIO recovery only on a dead adapter.
 User need to follow the instruction very carefully, wrong input may cause your
system badly and it may ended it up with system crash as well.
 Two or more simultaneous manual recovery, either on same or different interface,
cannot be executed on a system. The second or latter attempt will fail unless the
currently executed command is finished.
 Before attempting the manual recovery, the user needs to check the syslog and
/var/opt/hprecovery/hprecoverylog and confirm auto recovery got failed due to DIO
port, keep a note of the dead adapter.
ERROR: Interface 1/0/0/7/0/0/1 is in Guest OS DIO pool!!
Nov 07 06:43:46 hpbl1-s2 nicrecd[26732]: Interface: 1/0/0/7/0/0 recovery failed
Nov 07 06:43:46 hpbl1-s2:if some of the port(s) on the same NIC hardware are used for DIO,
Nov 07 06:43:46 hpbl1-s2:nicrecd cannot recover. Try manual recocvery by: /sbin/init.d/hpnicrecovery
dio_recover
 User need to get the admin support if the manual DIO recovery attempt got failed.

19 | HewlettPackardEnterprise
g) Manual DIO recovery and auto recovery of non-DIO interfaces execution
Manual DIO recovery and auto recovery of non-DIO interfaces cannot run simultaneously.
Before starting DIO manual recovery command user has to check whether automatic recovery
operation is running for DEAD interface on the VSP. If it is running then user has to wait to
complete the automatic recovery operation, also user can check the status of auto recovery
operation from /var/opt/hprecovery/hprecoverylog file.

Here the example user tries dio_recover command and automatic recovery is running in
background, so dio_recover command exit with warning message.

# /sbin/init.d/hpnicrecovery dio_recover

Warning !!

Auto recovery script is running with pid:25355


Please try dio_recover later, after auto recovery is done.
Moniter the recovery log from /var/opt/hprecovery/hprecoverylog to see
if auto recovery is completed.

Logging
Following are the examples of messages logged in the syslog, when an iocxgbe adapter
encounters a DEAD state:

Sep 23 11:42:57 hpsmbl3 vmunix: iocxgbe36/6725, 1474611177.435287


iocxgbe_mark_card_dead called from iocxgbe_watchdog_detected_dead.

Or

Sep 23 11:42:57 hpsmbl3 vmunix: iocxgbe36/6725, 1474611177.435287


iocxgbe_mark_card_dead called from iocxgbe_complete_cmd_timeout.

Or

Sep 23 11:42:57 hpsmbl3 vmunix: iocxgbe36/6725, 1474611177.435287


iocxgbe_mark_card_dead called from iocxgbe_complete_cmd.

Auto recovery for non-DIO interface related logging:


 Following are the examples of messages logged in the syslog, when the daemon detects a DEAD
state:
May 29 14:54:40 hpbl1-s3 vmunix: iocxgbe31/1467, 1432891480.871588 Driver will attempt to
auto-recover from this condition.

 Once the recovery attempt is completed, a success/failure message will be logged in the syslog:

May 29 14:56:03 hpbl1-s3.in.rdlabs.hpecorp.net nicrecd[5273]: Interface: 0/0/0/5/0/0


recovered successfully.

Or

Nov 23 00:15:39 hpbl1-s2 nicrecd[22880]: Interface: 0/0/0/5/0/0 recovery failed


Nov 23 00:15:39 hpbl1-s2:if some of the port(s) on the same NIC hardware are used for
DIO,
Nov 23 00:15:39 hpbl1-s2:nicrecd cannot recover. Try manual recocvery by:
/sbin/init.d/hpnicrecovery dio_recover

20 | HewlettPackardEnterprise
 The following messages are examples of what will be logged in the syslog, if an interface has
reported DEAD state more than five times within 24 hours:

Sep 21 16:34:27 lansec724 nicrecd[14389]: CAUTION: 0/0/0/3/0/0 going to dead state too
frequently!!
Sep 21 16:34:27 lansec724 nicrecd[14389]: It could be faulty. Please replace the part and
continue...
Sep 21 16:34:27 lansec724 nicrecd[14389]: NIC Recovery operation failed at Interface
0/0/0/3/0/0.
Sep 21 16:34:27 lansec724 nicrecd[14389]: Manual NIC Recovery operation may be attempted.

 The auto recovery daemon start message during system boot up:
Start NIC auto-recovery daemon.................................... OK

 The auto recovery daemon stop message during system shutdown:


Stop NIC auto-recovery daemon.................................... OK

 Sample message displayed by auto recovery daemon during system boot up without the LAN
cumulative patch PHNE_44540 installed:
Start NIC auto-recovery daemon.................................... N/A

 Sample message displayed by auto recovery daemon in rc.log during system boot up without
the LAN cumulative patch PHNE_44540 installed:

Failed to start NIC recovery daemon!!


Please ensure the latest LAN cumulative patch is
installed.

 Sample message displayed when the user attempts to manually recover an interface which is
NOT in DEAD state:

# lanadmin -x recover 8

WARNING!!
Driver will reset the entire interface.
This means that ALL the lan/FCoE ports that are associated with this interface on the
adapter will be reset.
Recovery is not supported on an interface with one or more ports/functions exported to
the guest as a DIO device.
In such a case please abort this operation and reboot the VSP to recover the interface.
Do you really want to proceed? (y/n) [n]: y

ERROR: Interface not in dead state!!

 Sample message displayed when the user attempts to manually recover an interface on a guest
DIO device:

# lanadmin -x recover 10

WARNING!!

21 | HewlettPackardEnterprise
Driver will reset the entire interface.
This means that ALL the lan/FCoE ports that are associated with this interface on the
adapter will be reset.
Recovery is not supported on an interface with one or more ports/functions exported to
the guest as a DIO device.
In such a case please abort this operation and reboot the VSP to recover the
interface. Do you really want to proceed? (y/n) [n]: y

ERROR: Cannot attempt recover on a DIO guest interface!!

#
 Sample message displayed when the user attempts to manually recover an interface on host
when some of its ports are exported to the guest DIO pool:

# lanadmin -x recover 21

WARNING!!
Driver will reset the entire interface.
This means that ALL the lan/FCoE ports that are associated with this interface on the
adapter will be reset.
Recovery is not supported on an interface with one or more ports/functions exported to
the guest as a DIO device.
In such a case please abort this operation and reboot the VSP to recover the interface.
Do you really want to proceed? (y/n) [n]:y

Hardware path : 46/0/1/2/0/0/1

ERROR: Interface is in Guest OS DIO pool!!

DIO manual recovery related logging:


DIO manual recovery execution message will be logged in below log file;
/var/opt/hprecovery/hprecovery_dio_log

 Once the DIO manual recovery attempt is completed, a success message will be logged in the
syslog:

Nov 26 09:08:56: hpnicrecovery[8796] Interface: 0/0/0/5/0/0/x DEAD Adapter recovered


successfully

 Sample message displayed when the user attempts to manually recover an interface which is
NOT in DEAD state:
# /sbin/init.d/hpnicrecovery dio_recover

Verifying selected card is in dead state or not ...

Warning !!

Selected Card is not a DEAD card...


Do not try recovery operation on Non-DEAD card...

22 | HewlettPackardEnterprise
 Sample message displayed when the user attempts to manually recover an interface on a guest
DIO device:

# /sbin/init.d/hpnicrecovery dio_recover

Warning !!
Manual recovery script not supported on Guest
If iocxgbe DIO interface goes DEAD,
Run manual recovery script from VSP

Different scenario of DIO manual recovery operation logging:

I. Recovery operation of non-dead adapter :


Here executing dio_recover command on VSP and selecting a non-DEAD state adapter from the list of ports
shown by the dio_recover command output;

# /sbin/init.d/hpnicrecovery dio_recover

Verifying selected card is in dead state or not ...

Warning !!

Selected Card is not a DEAD card...


Do not try recovery operation on Non-DEAD card...

II. Recovery operation of FCoE personality enabled adapter


Here executing dio_recover command on VSP and selecting FCoE personality adapter from the list of ports
shown by the dio_recover command output;

# /sbin/init.d/hpnicrecovery dio_recover

ERROR !!

FCoE personality card recovery operation not supported.


Recovery solution has a limitation.
Please review "b) Limitations of DIO manual recovery" in White Paper.
In this case, the system reboot is required to recover from DEAD state.
To make manual recovery possible in future, change the personality mode setting to
"NIC" mode from the Device Manager during the boot time of the VSP.

III. Recovery operation of a dead adapter port exported the guest is in OFF/EFI state.
Here executing dio_recover command on VSP and selecting a DEAD adapter from the list of ports shown by the
dio_recover command output; but dead adapter port ‘’1/0/0/7/0/0/2” exported the guest “vm2” is in OFF/EFI
state.

# /sbin/init.d/hpnicrecovery dio_recover

1/0/0/7/0/0/2 belonging vm2 is on OFF state

Delete the corresponding port from vm2 and keep it in DIO pool

Also find all ports which belongs to VM/Vpar guest OFF/EFI shell state from the
below list,
Delete the corresponding port from those VM/Vpar guest and keep it in DIO pool
Then rerun the script ...

23 | HewlettPackardEnterprise
Note:
In this case, “1/0/0/7/0/0/2” port has to be removed from vm2 by hpvmmodify -d while the recovery is done. After
the recovery is completed, it can be re-added to vm2 using hpvmmodify –a

IV. Recovery operation of a dead adapter port is in UNCLAMIED state on guest.


Here executing dio_recover command on VSP and selecting a DEAD adapter from the list of ports shown by the
dio_recover command output; but one of the dead adapter port ‘’1/0/0/7/0/0/2” exported the guest “vm2 ,
“0/0/2/0” port is in UNCLAIMED state.

# hpvmdevinfo | grep "1/0/0/7/0/0/"


vm1 lan [0,2,0xEA513CF1781D] hwpath 1/0/0/7/0/0/1 0/0/2/0 (lan1)
vm2 lan [0,2,0xC2E71B9CD4EB] hwpath 1/0/0/7/0/0/2 0/0/2/0 (lan1)
vpar5 lan [0,4,0x3A8CA29FB414] hwpath 1/0/0/7/0/0/5 0/0/0/4/0 (lan1)
vpar7 lan [0,4,0x7AD4E37EE336] hwpath 1/0/0/7/0/0/7 0/0/0/4/0 (lan1)
vpar8 lan [0,4,0xD2587F8BF718] hwpath 1/0/0/7/0/0/0 0/0/0/4/0 (lan1)

Example one of the port is on UNCLAIMED state on VM2 guest:-

vm2# ioscan -kfnC lan


Class I H/W Path Driver S/W State H/W Type Description
===================================================================
lan 0 0/0/0/0 igssn CLAIMED INTERFACE HP IGSSN PCI AVIO LAN Adapter
lan 1 0/0/2/0 UNCLAIMED UNKNOWN PCIe Ethernet (15900060)
lan 2 0/0/4/0 iocxgbe CLAIMED INTERFACE HP NC552m 2p 10GbE BL-c Mezzanine
Adapter

# /sbin/init.d/hpnicrecovery dio_recover

Verifying selected card is in dead state or not …

Selected adapter is on dead state.

Please verify hpvmstatus shows all the owner VM/Vpar guests listed above are in ON state
Also S/W state of each port is in CLAIMED from ioscan output
running on each VM/Vpar guest

Confirm all ports are in CLAIMED state? (y/n) :n

Warning !!

IF any of port is not in claimed state or port belonging VM/Vpar guest is in OFF/EFI shell state
Then Delete the corresponding port from those VM/Vpar guest and keep it in DIO pool
And rerun the script ...

Note:
In this case, “1/0/0/7/0/0/2” port has to be removed from vm2 by hpvmmodify -d while the recovery is done. After
the recovery is completed, it can be re-added to vm2 using hpvmmodify –a

V. Recovery operation of a dead adapter port exported guest is in booting stage.


Here executing dio_recover command on VSP and selecting a DEAD adapter from the list of ports shown by the
dio_recover command output; but dead adapter port ‘’1/0/0/7/0/0/2” exported the guest “vm2” is in booting
stage.

# /sbin/init.d/hpnicrecovery dio_recover

Warning !!

24 | HewlettPackardEnterprise
Failed to suspend the h/w path:1/0/0/7/0/0/2

Verify the S/W state of this port is in UNCLAIMED on Guest


Run "ioscan -kfnH " on "vpar2"

If it is on unclaimed state,
Delete the port only from the Guest and keep it in DIO pool
Then rerun the script..

Note:
In this case, “1/0/0/7/0/0/2” port has to be removed from vm2 by hpvmmodify -d while the recovery is done. After
the recovery is completed, it can be re-added to vm2 using hpvmmodify –a

VI. Recovery operation of a dead adapter and manually resume the adapter ports.
Here executing dio_recover command on VSP and selecting a DEAD adapter from the list of ports shown by the
dio_recover command output; but recovery operation got exited before successfully complete the recovery
operation and it has shown below message.

# /sbin/init.d/hpnicrecovery dio_recover

Resuming h/w path 1/0/0/7/0/0/0 in vpar8 (0/0/0/4/0(lan1))


Done.
Resuming h/w path 1/0/0/7/0/0/1 in vm1 (0/0/2/0(lan1))
Done.

WARNING !!
In this condition do not rerun script again,
it may panic the VSP, So manually resume
each port of the DEAD adapter mentioned below;

Interface are not yet recovered fully..


Some of Interface are not yet resumed.
on either of VSP or the guest owning the interface

Resume the interfaces one by one manually


from hardware path /0 to /x
if the interface is in host/DIO/Guest
Run the resume command only from VSP
Please use below command:

/sbin/init.d/hpnicrecovery resume <VSP/HOST hardware path>

Owner Host-hardware-path Guest-hardware-path


====== =================== ====================
vpar8 1/0/0/7/0/0/0 0/0/0/4/0(lan1)
vm1 1/0/0/7/0/0/1 0/0/2/0(lan1)
dio 1/0/0/7/0/0/2 -
host 1/0/0/7/0/0/3 (lan44) -
host 1/0/0/7/0/0/4 (lan45) -
vpar5 1/0/0/7/0/0/5 0/0/0/4/0(lan1)
dio 1/0/0/7/0/0/6 -
vpar7 1/0/0/7/0/0/7 0/0/0/4/0(lan1)
#

Note:
In this case, “1/0/0/7/0/0/0” and “1/0/0/7/0/0/1” ports are successfully resumed. For completing the recovery
operation, do the resume operation on remaining ports manually using resume option.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/2

Make sure 1/0/0/7/0/0/2 corresponding port is on suspend state

25 | HewlettPackardEnterprise
Confirm? (y/n) :y
Recovering port 1/0/0/7/0/0/2 in DIO pool
Done.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/3

Make sure 1/0/0/7/0/0/3 corresponding port is on suspend state


Confirm? (y/n) :y
Resuming h/w path 1/0/0/7/0/0/3
Done.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/4

Make sure 1/0/0/7/0/0/4 corresponding port is on suspend state


Confirm? (y/n) :y
Resuming h/w path 1/0/0/7/0/0/4
Done.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/5

Make sure 1/0/0/7/0/0/5 corresponding port is on suspend state


Confirm? (y/n) :y
Resuming h/w path 1/0/0/7/0/0/5
Done.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/6

Make sure 1/0/0/7/0/0/6 corresponding port is on suspend state


Confirm? (y/n) :y
Recovering port 1/0/0/7/0/0/6 in DIO pool
Done.

# /sbin/init.d/hpnicrecovery resume 1/0/0/7/0/0/7

Make sure 1/0/0/7/0/0/7 corresponding port is on suspend state


Confirm? (y/n) :y
Resuming h/w path 1/0/0/7/0/0/7
Done.

HPE recommended HA configuration


The automatic adapter recovery mechanism is not transparent to the user and may take up to
two minutes to recover a DEAD adapter and during that time there may be network downtime.
HPE strongly recommends to have an HA/APA configuration on the server. It will minimize the
impact caused by DEAD card by ensuring the seamless traffic by failing over to a different
healthy adapter.
Do not use the same interface for standby/active. Be sure that they are not linked to the same
interconnect.

References
For more information on HPUX 10 Gigabit Ethernet drivers, see - http://www.hpe.com/info/10-
gigabit-ethernet-docs

26 | HewlettPackardEnterprise

You might also like