CH121 V3 has VMware ESXi 6 installed, while working with MZ910 NIC and attach Oceanstor storage as the backend storage device via ISCSI link. Here we found some alarms represents lost connectivity to storage device and server was frozen.
Host OS with Support Pack– VM 6.0
Guest/VM OS Details:
Adapter(s) Under Test - MZ910
Driver version installed – elxnet 10.4.370.0, lpfc: 10.4.203.5
Firmware – 10.2.590.0
CH121 Firmware Verison:
Network/SAN Configuration - Running @ FCoE
Per driver logs, on PF0, there is no response from firmware for ioctl GetPportStats for 32 sec, and then driver issued reset.
2016-04-25T04:17:16.059Z cpu27:32950)WARNING: elxnet:elxnet_asyncWorldWait:3655: [vmnic0] GetPportStats: Checkpoint 3 (36 sec) No resp for MCC cmd opcode: 0x12, subsystem:0x3,timeout:0, req_len:656
2016-04-25T04:17:16.059Z cpu27:32950)WARNING: elxnet:elxnet_asyncWorldWait:3673: [vmnic0] GetPportStats: MCC cmd timed out. opcode:0x12, subsystem:0x3, timeout:0, req_len:656
2016-04-25T04:17:16.059Z cpu27:32950)WARNING: elxnet: elxnet_generateUE:54:
[vmnic0] Injecting fatal error for post-mortem dump
Prior to the ioctl timeout, VMkernel has re-initiated Tx Queues for around 774 times in just 13 minutes, i.e. once in every sec (between 2016-04-25T04:03:47 and 2016-04-25T04:16:39).
The driver and VMkernel is keeping restarting the firmware, however, since the firmware got failure, the driver is working as expected. We moved our focus on firmware dump analysis.
In FW code, a procedure checks the TX pipeline and initializes them periodically. There are 24 transmitting threads and the procedure goes to each one of them and initializes. There is a corner case where one of the threads could be left in disabled state and if a traffic or mailbox command is requested to use that thread, a stall could occur. The fix in FW is to ensure that the threads are in enabled state before we leave the initialization.
The current occur rate is 1 of thousands of MZ910. This is a FW code issue but happens very rarely. The probability probably can be calculated using the fact that there are 24 threads which are initialized periodically and with some rare probability one of them can be left in disabled state. If we receive traffic or commands at that moment assigned to that thread, we may hit the issue. The fact that next periodic initialization will clear the disable bit makes it rare.
The fix in FW is to ensure that the threads are in enabled state before we leave the initialization.Please upgrade the MZ910 firmware to 10.2.590.1. and below is the download link: