Customer reported one controller (controller A) of OceanStor 18500 was fault,a controller spare part has applied to replace, but the replacement controller was failed to start.
Login controller B and found there was the error message for controller B:
Engineers logged in to controller B and then confirmed that controller A failed to start because configuration check items were not passed.By checking cable connections at the site, engineers found that the root cause was the incorrect connections between disk enclosures in loop 0 and controller B. Controller A was correctly connected to disk enclosures DAE000, DAE001, and DAE002 through SAS cables. However, controller B was connected to the three disk enclosures through SAS cables in reverse connection mode. That is, SAS expansion port P0 on controller B should be connected to a PRI port on interface module B of DAE000 but was wrongly connected to a PRI port on interface module B of DAE002 in reality. In the following figure, red links indicate the incorrect cable connections
In the SAS cable connection scenario illustrated in the preceding figure, controller A identified DAE000 as the disk enclosure where coffer disks reside. However, the first disk enclosure in loop 0 connected to controller B was DAE002. Therefore, controller B identified DAE002 as the disk enclosure where coffer disks reside. After replacement controller A synchronized the version and powered on again, the storage system checked for the coffer disk enclosure and found information inconsistency. The actual coffer disk enclosure was DAE000. Therefore, controller A failed to start, for the purpose of protecting the system and ensuring information consistency.
DAE000 served as the coffer disk enclosure because when the storage system started for the first time, the cable connections were correct. Both controllers A and B were connected to disk enclosures in loop 0 in forward connection mode. Therefore, DAE000 was identified as the coffer disk enclosure and recorded into the DB.
The following figure shows some system log information, where node Id = 0 indicates controller A, node Id = 1 indicates controller B, loopNum = 0 indicates loop 0, wwn indicates the WWN of a disk enclosure, and userFrameId = 0x100 indicates that disk enclosure 0x5486276ff745f03f is DAE000. It can be inferred from the log information that DAE000 had been correctly connected to controllers A and B before disk enclosure DAE002 was added.
Some day, disk enclosure DAE002 was added to disk enclosure loop 0. In addition, SAS expansion port P0 on controller B was disconnected (connection to this port does not need to be changed in the event of adding a disk enclosure to loop 0). Therefore, cables may be incorrectly connected. See the following log information:
2015-07-06 09:42:04 DST 0xf0060002 Major 2015-07-06 09:42:19 DST The SAS expansion port (Controller Enclosure ENG0, SAS interface module B0, port number P0) is disconnected.
2015-07-06 09:42:26 DST 0x100f0ce000b Infor None Disk Enclosure(ID DAE002) has been inserted.
The most preferred solution is to change cable connections non-disruptively. However, controller A cannot work properly at the moment and controller B is processing all services. In addition, the SAS cable connection to controller B is incorrect, which must be rectified. If cable connections are changed at this time, host services are definitely affected. Therefore, the solution cannot be implemented.
The second preferred solution is to install a patch without interrupting services. The patch solution is to temporarily cancel the coffer disk enclosure check during the startup process. After controller A works properly, change cable connections. However, a patch involves time-consuming test verification, whereas the customer allows only a three-hour change time window. Therefore, the patch solution is not suitable either.
The last choice is to suspend services and change cable connections.
The following figure shows the networking change diagram provided for the customer. The SAS cable connection between SAS expansion port P0 on controller B and the port on DAE002 as well as the SAS cable connection between DAE002 and DAE000 are adjusted. For details, see the red lines in the following figure: