No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Oceanstor S2600T V100R005C10-The communication between controllers is abnormal, but the system is functioning properly. The error code is [2]

Publication Date:  2016-06-23 Views:  84 Downloads:  0
Issue Description

ISM repoted the alarm on Oceanstor S2600T

The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].

Alarm Information
Level: Major
Occurred At: 2016-05-08 06:31:57
Details: The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].
Handling Process

1、Ask customer to collect the logs:

2.   Physically link Analysis:

we checked the logs there is no error code found on physically link.

3.   The configuration file analysis:

For this S2600T storage, the RAID group is RAID 5 with 11 pcses 4T NL-SAS. as every known the NL-SAS disk or SATA disk exist high faulty ratio. we dont recommend customer to configure RAID5 using NL-SAS. also we checked the LUN attributed controller A and B, actually for NL-SAS or SATA disk, there is only one IO access interface, if the LUN belonged both controller that will access this disk, it will results in the IO piling up on storage array. Its better to configure the LUN so we recommend customer configure the LUN attributed single controller.

 

3.   Analysis on storage logs:

1)、We checked the event on storage log, we found the usage of CPU is high on 8th MAY.

//2016-05-08 01:41:52    0x12020e0085    Infor    None    CPU (99)% usage on controller(1) is too high.    None

//2016-05-08 06:40:05    0x12020e0085    Infor    None    CPU (99)% usage on controller(1) is too high.    None

//2016-05-08 07:30:12    0x12020e0085    Infor    None    CPU (99)% usage on controller(1) is too high.    None

//2016-05-08 09:37:07    0x12020e0085    Infor    None    CPU (99)% usage on controller(0) is too high.    None

//2016-05-08 10:28:19    0x12020e0085    Infor    None    CPU (99)% usage on controller(0) is too high.    None

2)、When customer enabled the performance recording then CPU reported the high usage alarm.

2016-05-08 01:37:01    0x200e02110007    Infor    None    hpadmin:192.168.16.75 set the performance statistics policy (statistic-period 5 second, archiving-switch-status 1, statistic-days 7) successfully.    None.

2016-05-08 01:37:20    0x200e02110001    Infor    None    hpadmin:192.168.16.75 set the status of the performance statistics switch to (switch-status 1) successfully.

3)、And storage system reported the heartbeat is abnormal alarm and recovery in short time.

2016-05-08 06:31:57    0xe0204004a    Major    2016-05-08 06:32:02    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 07:39:13    0xe0204004a    Major    2016-05-08 07:39:14    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 08:28:56    0xe0204004a    Major    2016-05-08 08:29:01    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 08:29:18    0xe0204004a    Major    2016-05-08 08:29:19    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 10:47:25    0xe0204004a    Major    2016-05-08 10:47:27    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 10:57:59    0xe0204004a    Major    2016-05-08 10:58:02    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 12:20:48    0xe0204004a    Major    2016-05-08 12:20:49    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 12:34:26    0xe0204004a    Major    2016-05-08 12:34:32    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

2016-05-08 12:34:32    0xe0204004a    Major    2016-05-08 14:51:19    The communication between controllers is abnormal, but the system is functioning properly. The error code is [2].    Step 1 Replace controller A. If the fault persists, replace controller B. If the fault persists, go to step 2.

4)、Storage received many abort command from Host, and storage array recovery accept. That means the handling process is slow.

//[2016-05-08 12:27:10]---->ABTS:0x81010200 0x00010607 0x00090000 0x0a000000 0x00058117 0x00000000

//[2016-05-08 12:27:10]<----BLS_ACC(0x0):0x84010607 0x00010200 0x00980000 0x00000000 0x00058553 0x00000000 0x00000000 0x00058117 0x0000ffff
when customer shutdown all of the host at 13:00, the heartbeat and CPU usage back to normal level. at 14:26:57storage array rebooted.

2016-05-08 14:26:57    0x200e020e0016    Infor    None    admin:192.168.17.74 restarted the system successfully.    None.

After the rebooting finish and customer restart VM, we checked the usage of disk on controller A/B, we found the usage of disk is upto 95%.

But according to customers feedback that only 50 VMs working in current system, there are other 50 VMs never start, if customer run more than 100 VMs on current system, because the usage of disk is high and it will result in the IO piling up on storage array, then CPU usage will raise. In the meantime, if customer enable the performance recording, it will consume more CPU resource and system reported the CPU usage raise to 100%, then the heartbeat link was blocked and cannot synchronise the data between both controllers.

Root Cause

For this issue, the root cause is the high business pressure on NL-SAS disk that will result in the IO timeout frequently. And then host will retry to execute read/write. So IO was piling up on storage array. The usage of CPU will be raised. In the meantime, customer enable the performance recording function, more CPU resource will be consumed. Then this issue happened.

Solution

suggestion

1.    Currently 11pcees Disk built RAID 5, and as known the NLSAS exists high failure rate, if one disk faulty, it needs more that 24 hours to reconstruct on 4T NL-SAS disk. it has certain probability that the other disk faulty also. So we recommend customer to change the raid type from RAID5 to RAID6.

2.    We recommend customer to change the LUN attribution to single controller.

http://support.huawei.com/ehedex/hdx.do?docid=DOC1000005747&lang=en&idPath=7919749|7941815|21430818|21462742|7345891&clientWidth=1350&browseTime=1462847376131

3.  We recommend customer to disable the performance recording.

4.  We recommend customer to install the V100R005C01SPH905 hot pitch ( this pitch will optimize the IO handling process), it will be published on early of June.(current the storage version is V1R5C01 that cannot support to upgrade to V200R002C30 version)

5.    Since the business pressure is heavy and the usage of disk is up to 95%,we recommend customer to migrate VMs or expansion (the performance of NL-SAS is low, for VMs business we recommend customer choose SAS or SSD)

Suggestions
NA

END