Frequent Active/Standby Management Nodes Switchover Caused by Aggregation Switch ARP Suppression

Publication Date:  2015-03-10 Views:  472 Downloads:  0
Issue Description
om]
The switchover occurs frequently between the active and standby management nodes (CRM, OMM, and ESC or VRM). As recorded in the watchdog log, communication fails between the management nodes and the gateway, whereas the gateway can be pinged from the management nodes.
When you ping the gateway on a management node, no packet loss occurs. However, the storage latency varies, and a switchover occurs every 2 to 3 minutes.
Detailed information is as follows:

Ÿ The watchdog on the CRM, OMM, and ESC nodes or the VRM node detects communication failures between the nodes and the gateway.



Ÿ When you ping the gateway on the CRM, OMM, and ESC nodes or the VRM node, no packet is lost.



Ÿ Based on the analysis of packets captured on the network, a large number of ARP packets are sent from the IP address of the faulty node, but the gateway does not respond to the ARP requests. 




Handling Process
1. Configure the aggregation switches based on ARP Sending Rate from Same Source over Threshold in GalaX8800 Product Documentation.

2. Disable the related ARP traffic control policies configured on the aggregation switches and adjust the value for traffic control based on the live network conditions.
Root Cause
     When the communication between a management node and the CNA nodes fails, the management node sends two ARP broadcast packets every second. If the management node cannot communicate with three CNA nodes in a cluster, it sends more than six ARP broadcast packets every second.

      However, the default value of arp speed-limit soure-ip maximum is 5 on the S9300 switch with version V100R001, V100R002, and V100R006. That is, if more than 5 ARP broadcast packets are sent from a source IP address per second for a period of time, the S9300 switch will suppress the ARP broadcast packets and does not respond to the ARP requests. Other switches also have similar ARP suppression mechanisms.

      If the ARP table on the management nodes ages, the watchdog detests communication failures between the nodes and the gateway. As a result, the watchdog sends a ping packet to the gateway every 2 seconds. If the watchdog does not receive a response within 1 second, it considers that the ping command fails.

      If the watchdog does not receive a response for five times consecutively (3 x 5 = 15 seconds), it considers that the gateway is faulty. Therefore, after 15 seconds, if the watchdog cannot ping the gateway, the system automatically performs an active/standby switchover for each management node.

END