No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Controller is faulty

Publication Date:  2017-04-20 Views:  534 Downloads:  0
Issue Description

Two OceanStor 2200 V3 storage devices reported the controller failure alarm (alarm ID: 0xF00CF005), and the error code was 0x4000cf3b.

2017-01-25
18:32:19    0xF00CF005F    Major   None    Controller (Controller Enclosure CTE0, controller B, item 03057201, SN 210305720110GB000107) is faulty. Error code: 0x4000cf3b.

2017-01-30
22:20:33    0xF00CF005F    Major   None    Controller (Controller Enclosure CTE0, controller B, item 03057201, SN 210305720110GC000075) is faulty. Error code: 0x4000cf3b.



Handling Process

Working Principle of the Slow Disk Alarm

The SAS driver calculates the average I/O service time at an interval of 30 minutes. If the average I/O service time exceeds 100 ms, this period is considered a slow period. Within 24 hours, if the average I/O service time exceeds 100 ms in 21 hours, the slow disk alarm is reported (error code: 0x4000CF3b).




Calculation formula:

(Total time of a period – Total idle time in the period)/Number of I/Os in the period

As shown in the preceding figure, for example,two I/Os exist in the period of 30 minutes, so the average I/O service time of the two I/Os is [30 minutes – (Ide1 + Idle2 + Idle3)]/2. If the average I/O service time exceeds 100 ms, this period is a slow period.


Analysis of System Logs
Based on the analysis of system logs, the SAS driver confirmed that the system responded slowly to I/Os, because the average I/O service time exceeded 100 ms. Therefore, the slow disk alarm was reported.

1.        Based on the analysis of system disk logs, no I/O timeout or SMART exception information was found. This slow disk alarm was not caused by the hardware fault of the system disk.

2.        Based on the analysis of system logs, it was found that the total idle time in a period exceeded the total time of the period (30 minutes). After code confirmation, it was found that more idle time was calculated due to a logic defect in the software.




As shown in the preceding figure, when the system disk has only a few I/Os, the value of DeltaIdle is greater than the actual average I/O service time (IO1 + IO2). Therefore, the average
I/O service time calculated from [30 minutes – (Ide1 + Idle2 + Idle3 + …+DeltaIdle)]/(IO count) exceeds 100 ms, which is abnormal. When this problem occurs in a total of 21 hours (42 times), the controller failure alarm is reported.

Root Cause

When the system disk of a controller had only a few I/Os, the SAS driver incorrectly determined a slow disk, causing the system to report a false controller failure alarm.

Solution

V300R005C00SPH301 resolve the issue, so need to upgrade storage the version.

END