Per driver logs,
on PF0, there is no response from firmware for ioctl GetPportStats for 32 sec,
and then driver issued reset.
cpu27:32950)WARNING: elxnet:elxnet_asyncWorldWait:3655: [vmnic0] GetPportStats:
Checkpoint 3 (36 sec) No resp for MCC cmd opcode: 0x12,
cpu27:32950)WARNING: elxnet:elxnet_asyncWorldWait:3673: [vmnic0] GetPportStats:
MCC cmd timed out. opcode:0x12, subsystem:0x3, timeout:0, req_len:656
cpu27:32950)WARNING: elxnet: elxnet_generateUE:54:
Injecting fatal error for post-mortem dump
Prior to the ioctl timeout, VMkernel has re-initiated Tx Queues for
around 774 times in just 13 minutes, i.e. once in every sec (between
2016-04-25T04:03:47 and 2016-04-25T04:16:39).
The driver and VMkernel is keeping restarting the firmware, however,
since the firmware got failure, the driver is working as expected. We moved our
focus on firmware dump analysis.
In FW code, a procedure checks the TX pipeline and initializes them
periodically. There are 24 transmitting threads and the procedure goes to each
one of them and initializes. There is a
corner case where one of the threads could be left in disabled state and if a
traffic or mailbox command is requested to use that thread, a stall could
occur. The fix in FW is to ensure that the threads are in enabled state before
we leave the initialization.
The current occur rate is 1 of thousands of MZ910. This is a FW code issue but happens very rarely. The probability
probably can be calculated using the fact that there are 24 threads which are
initialized periodically and with some rare probability one of them can be left
in disabled state. If we receive traffic or commands at that moment assigned to
that thread, we may hit the issue. The fact that next periodic initialization
will clear the disable bit makes it rare.