No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

LSI3108 RAID card reset on RH2288H V3

Publication Date:  2018-02-21 Views:  264 Downloads:  0
Issue Description

Customer found disk IO hang some times.

 

Handling Process

1. Check SEL and FDM log, no alert or error message was found.

2. Check OS log and find below error messages:

Jan  4 04:24:30 srv5431 kernel: megaraid_sas 0000:01:00.0: 1811 (568322617s/0x0020/DEAD) - Fatal firmware error: Line 2001 in ../../bbu/onfi.c
Jan  4 04:24:30 srv5431 kernel: megaraid_sas 0000:01:00.0: Iop2SysDoorbellIntfor scsi0
Jan  4 04:24:30 srv5431 kernel: megaraid_sas 0000:01:00.0: Found FW in FAULT state, will reset adapter scsi0.
Jan  4 04:24:30 srv5431 kernel: megaraid_sas 0000:01:00.0: resetting fusion adapter scsi0.
Jan  4 04:24:40 srv5431 kernel: megaraid_sas 0000:01:00.0: Waiting for FW to come to ready state
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: FW now in Ready state
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: Current firmware maximum commands: 928#011 LDIO threshold: 0
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: FW supports sync cache#011: No
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: Init cmd success
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: firmware type#011: Extended VD(240 VD)firmware
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: controller type#011: MR(2048MB)
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: Online Controller Reset(OCR)#011: Enabled
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: Secure JBOD support#011: Yes
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: Reset successful for scsi0.
Jan  4 04:25:14 srv5431 kernel: megaraid_sas 0000:01:00.0: 1814 (568322619s/0x0020/CRIT) - Controller encountered a fatal error and was reset

Then we can confirm the IO hang caused by RAID controller reset.

3. Check the latest driver release from LSI, we can find release note about this issue.

 

Root Cause

1. There's a race condition while checking for ONFI(Open NAND Flash Interface) interrupt status for idle condition, and ultimately results in timeout. In real time,  ONFI engine is being accessed whenever Cache switches in between being “Dirty” and “Non-Dirty”.

2. The frequency of occurrence is totally dependent on how soon the timing window is hit where both processors access the ONFI module at the same time. Issue might get regularly hit, if the rate at which the Cache is switched to Dirty due to incoming IOs get equal/multiple to the rate at which the IOs get completed and data gets committed to disks including parity.

3. All the server products which LSI3108 may encounter this issue, only if firmware version lower than 4.660.00-8102.

Solution

Upgrade LSI 3108 firmware version to 4.660.00-8102, download link: http://support.huawei.com/enterprise/en/software/22785464-SW1000284121

 

END