No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

OceanStor 6800 V3 Storage reported that controller can not be monitored, there are 3 controllers are isolated

Publication Date:  2016-06-16 Views:  80 Downloads:  0
Issue Description

In the process of using OceanStor 6800 V3, the storage reported 3 controllers are isolated which can’t be monitored.

Alarm Information

Alarm information as follows

Handling Process
Three controllers (B, C, D) have occurred self healing reset in the adjacent time points. After the controller restart, due to the management board failure system can’t be loaded normally, which results in 3 controllers are unable to normally enter the system. In the site, to power down the entire storage and replace all the management board failure, after that power on the storage. Then all the controllers resume normal. 
Root Cause

To analysis the reason of three controllers isolated, that three controllers (B, C, D) have occurred self healing reset in the adjacent time points which causes this issue. The log is as follows:

The latest NO.1 reset: localorcmostime=1453740225, ji=244711951, reason=failure recovery reset

Desktime=2016-01-26-00:43:45

The latest NO.1 reset: localorcmostime=1453740137, ji=245402264, reason=failure recovery reset

Desktime=2016-01-26-00:42:17

The latest NO.1 reset: localorcmostime=1453740374, ji=244748102, reason=failure recovery reset

Desktime=2016-01-26-00:46:14

 

The process of the whole self healing is as follows:

l  Host multipath software periodically sends INQ command to the controller to query storage port’s location information. The command will arrive at TGT module of the storage. TGT

    directly supply one kernel lock to query by the internal interface, at the same time storage internal monitor thread also periodically supply the same kernel lock to query device status.

l  Once storage’s management board appears abnormal, internal monitor thread will take more time for get device status. The situation will cause timeout of release the kernel lock. So

    TGT would be waiting for internal monitor thread to release the kernel lock. It causes that system detects TGT occupy the CPU for long time, then system considers there is

    abnormal which triggers system self-healing to reset the controller.

 

Storage V3 series’ management board have saved the storage’s system image, the controller boots the system by management board when power on. After the controller restart to face the failure of management board, the controller won’t be able to load the system again, which results in the controller isolated.

Solution
To power down the entire storage and replace the management board, after that power on the storage to recover all the controllers.
Suggestions

Storage V3 series’ management board have saved the storage’s system image, the controller boots the system by management board when power on. After the controller restart to face the failure of management board, the controller won’t be able to load the system again, which results in the controller isolated. At this time only can replace management board to resume the controller isolated.

END