No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search


To have a better experience, please upgrade your IE browser.


The replacement controller failed to start

Publication Date:  2016-06-30 Views:  150 Downloads:  0
Issue Description

Customer reported one controller (controller A) of OceanStor 18500 was fault,a controller spare part has applied to replace, but the replacement controller was failed to start.

Alarm Information

Login controller B and found there was the error message for controller B:

Handling Process

 Engineers logged in to controller B and then confirmed that controller A failed to start because configuration check items were not passed.By checking cable connections at the site, engineers found that the root cause was the incorrect connections between disk enclosures in loop 0 and controller B. Controller A was correctly connected to disk enclosures DAE000, DAE001, and DAE002 through SAS cables. However, controller B was connected to the three disk enclosures through SAS cables in reverse connection mode. That is, SAS expansion port P0 on controller B should be connected to a PRI port on interface module B of DAE000 but was wrongly connected to a PRI port on interface module B of DAE002 in reality. In the following figure, red links indicate the incorrect cable connections


In the SAS cable connection scenario illustrated in the preceding figure, controller A identified DAE000 as the disk enclosure where coffer disks reside. However, the first disk enclosure in loop 0 connected to controller B was DAE002. Therefore, controller B identified DAE002 as the disk enclosure where coffer disks reside. After replacement controller A synchronized the version and powered on again, the storage system checked for the coffer disk enclosure and found information inconsistency. The actual coffer disk enclosure was DAE000. Therefore, controller A failed to start, for the purpose of protecting the system and ensuring information consistency.

DAE000 served as the coffer disk enclosure because when the storage system started for the first time, the cable connections were correct. Both controllers A and B were connected to disk enclosures in loop 0 in forward connection mode. Therefore, DAE000 was identified as the coffer disk enclosure and recorded into the DB.

The following figure shows some system log information, where node Id = 0 indicates controller A, node Id = 1 indicates controller B, loopNum = 0 indicates loop 0, wwn indicates the WWN of a disk enclosure, and userFrameId = 0x100 indicates that disk enclosure 0x5486276ff745f03f is DAE000. It can be inferred from the log information that DAE000 had been correctly connected to controllers A and B before disk enclosure DAE002 was added.

Some day, disk enclosure DAE002 was added to disk enclosure loop 0. In addition, SAS expansion port P0 on controller B was disconnected (connection to this port does not need to be changed in the event of adding a disk enclosure to loop 0). Therefore, cables may be incorrectly connected. See the following log information:

2015-07-06 09:42:04 DST    0xf0060002    Major    2015-07-06 09:42:19 DST    The SAS expansion port (Controller Enclosure ENG0, SAS interface module B0, port number P0) is disconnected.

2015-07-06 09:42:26 DST    0x100f0ce000b    Infor    None    Disk Enclosure(ID DAE002) has been inserted.

Root Cause
 When DAE002 was being added on 2015-07-06, the SAS cable between SAS expansion port P0 on controller B and the port on coffer disk enclosure DAE000 was removed from the port on coffer disk enclosure DAE000 and connected to a port on DAE002. As a result, when replacement controller A was starting, the storage system found that the coffer disk enclosure identified by controller A was DAE000 and that identified by controller B was DAE002 because of the reverse connection. The startup check process detected the information inconsistency, resulting in the startup failure.

 The most preferred solution is to change cable connections non-disruptively. However, controller A cannot work properly at the moment and controller B is processing all services. In addition, the SAS cable connection to controller B is incorrect, which must be rectified. If cable connections are changed at this time, host services are definitely affected. Therefore, the solution cannot be implemented.

The second preferred solution is to install a patch without interrupting services. The patch solution is to temporarily cancel the coffer disk enclosure check during the startup process. After controller A works properly, change cable connections. However, a patch involves time-consuming test verification, whereas the customer allows only a three-hour change time window. Therefore, the patch solution is not suitable either.

The last choice is to suspend services and change cable connections.

The following figure shows the networking change diagram provided for the customer. The SAS cable connection between SAS expansion port P0 on controller B and the port on DAE002 as well as the SAS cable connection between DAE002 and DAE000 are adjusted. For details, see the red lines in the following figure:



Please connect the releated SAS line between controller enclosure and disk enclosure as the Huawei guide showing.