Fault Symptom: After deployment, controller B was reported "lost the communication", and can't start work any more.
Network: 8G FC SAN
The system cannot monitor controller (Controller Enclosure CTE0, controller B). Error code: --.
The communication between controller (A) and (B) is abnormal in enclosure (controller enclosure CTE0), but the system can continue to work. Error code: 0x4000cf12.
1. Customer directly connected the issue controller(B) to laptop with serial cable, and found the controller still "alive", but directly entered "minisystem" mode. In this case, we need to set a remote session with customer can check the controller status.
2. During remote session, executed "showbootmode" and found the controller was in rescue mode. It means the controller encountered 3 abnormal reset in 30 minutes. We tried to restore boot mode to normal mode(execute "restorebootmode" and restart controller), but the controller keep reset 3 times and entered rescue mode again.
3. Export system log of controller B by SFTP tool, and we found the controller encountered panic reset and the dump stack is below:
panic reason:stack-protector: Kernel stack is corrupted in: ffffffbfa2f496c0
CPU 9 <pid:0:0:swapper/9>
[<ffffffbfa2795ae0>] kbox_dump_backtrace+0x0/0x120 [kbox]
[<ffffffbfa2789978>] kbox_show_trace+0x1e0/0x2a0 [kbox]
[<ffffffbfa278a1b0>] kbox_show_task_kernel_info+0x1c8/0x4d8 [kbox]
[<ffffffbfa278a52c>] kbox_print_specified_tasks+0x6c/0x88 [kbox]
[<ffffffbfa278324c>] kbox_panic_notifier_callback+0x24c/0x348 [kbox]
[<ffffffbfa2f496c0>] IOC_ProcessMultiIstsIob+0x7d0/0x830 [unflowlevel_ioc]
[<ffffffbfa2f3d5e4>] IOC_ProcessRespQueue+0x7a4/0x1330 [unflowlevel_ioc]
4. As the stack information, the software bug is related to SmartIO card driver. So, we plugged out the SmartIO card of controller B, restore boot mode to normal mode again, reboot controller, then controller B start work again.
In SmartIO card driver of V3 storage, as refer to SCSI protocol, we only parse 96Bytes sense data of SCSI command, but driver directly copy the sense data received from remote device. Some of other vendor storage may send sense data longer than 96Bytes, like IBM DS serial storages.
In this case, if customer connect Huawei storage and other vendor(IBM) storage in the same FC fabric, and the SmartIO card working in INI and TGT mode, V3 storage can establish FC session with other vendor storage because of SmartVirtualization feature. Then, remote storage may send "abnormal" sense data to V3 storage and cause panic reset.
Any of the below operation can solve the problem, you can choose a best option fit for you.
Option 1: Upgrade system software. This issue was first found in V300R005C00SPC300, V300R005C01 and V300R006C00, and was fixed in V300R006C00SPC100 and V300R003C20SPC200.
Option 2: Change FC fabric configuration, isolate Huawei storage and other vendor storage on FC switch by zone. But this option can't work if customer want to use SmartVirtualization to migrate other vendor storage.
Option 3: Similar like option 2, we can change the SmartIO card port from IIN and TGT to TGT mode.(In CLI command line, command format: change port fc port_mode=TGT port_id=xxxxx)