The Storage System Is Unavailable or Its Reliability Is Degraded Due to the Switch Loop Storm

Issue Description
Related information about the product and version: CSS V100R001C00.
Alarms of A node exits., The reliability of the system is degraded., Services go offline., and The storage system used by the MDS service is unavailable. are displayed on the ISM interface. Nodes in the storage domain cannot be pinged.
Alarm Information
Handling Process
Step 1     Disconnect network cables that form a loop.
Step 2     Run the telnet command on the ISM server to log in to switches.
Step 3     In Telnet, run the display stp brief command to check the status of the switch network port. If status of all switch network ports are FORWARDING, as shown in Figure 1-1, the fault is removed.

Figure 1-1 Network port status

Step 4     Run the stp enable command to enable switch loop suppression.
Step 5     Run the display stp command to verify that STP has been enabled, as shown in Figure 1-2.

Figure 1-2 STP service mode

 Step 6     Run the save command to save configurations.                              
Step 7    
Run the ping command on the ISM server to check whether nodes can be pinged. If node communications is abnormal, restart abnormal nodes from MDSs to OSNs.
 Step 8     Check the storage domain status on the ISM interface. If the status is Recovery or Normal, the storage domain is available. If the status is Faulty, wait 10 minutes. If the status is still Faulty after 10 minutes, contact technical support engineers.
Step 9     Check the status of wushan CA. If the status is Faulty, you need to remount the client.
1.         Run the umount/mnt/CA command to uninstall the client.
2.         Run the mount.wfs -t wlite128.20.181.28: /mnt/CA command to mount the client again. indicates the IP address of the active MDS, indicates the IP address of the standby MDS, ok indicates the name of the related storage domain, default indicates the namespace, and /mnt/CA indicates the client mounting point.
Step 10     Verify that the status of the storage domain and client are Normal. Then log in to the client and run the following command:
touch testfile echo “123456” >> testfile cat testfile

If the storage domain and client remain abnormal, contact technical support engineers.
Check whether an error message is displayed. If no, the fault is removed. If yes, contact technical support engineers.

Root Cause
Alarms on node exit and reliability degrade are generated and communication among nodes is abnormal. These may be caused by switch faults or loops.
Check the indicators and network ports of switches. They are normal. Then check network cable connection on switches. A switch loop exists. The switch loop storm causes the to network break down.