Perform reliability planning and configurations for spine nodes in M-LAG networking. For details, see Figure 2-3 and Table 2-3.
Figure 2-3 Common fault points of spine nodes![]()
Table 2-3 Impact analysis of common faults of spine nodes and recommended deployment solutionsNo.
|
Fault Scenario
|
Impact Analysis
|
Recommended Deployment Solution
|
1
|
Device fault
|
- When Spine1 restarts, service traffic is quickly switched to Spine2.
- After Spine1 is restarted and re-connects to the network:
- M-LAG member interfaces go Up after a default delay of 240s. Before the interfaces go up, the uplink network converges first, and traffic from the PEs to Spine1 is forwarded through the peer-link. Spine1 needs to learn a large number of ARP and MAC address entries, and packet loss lasts for a long time.
- After the M-LAG member interfaces go Up, the outbound interfaces in the ARP and MAC address entries need to be changed from peer-link interfaces to the M-LAG member interfaces. During the update, packet loss occurs.
|
- Configure the interface on the spine node connected to the PE to go Up after a delay of 360s. (If the uplink and downlink interfaces are not on the same card, set the delay to the increased card registration time of the downlink interface compared with that of the uplink interface. In most cases, the registration time of different cards is as follows: 100GE > 40GE > 10GE > GE.)
- After the M-LAG member interface goes Up, the uplink interface goes Up after a delay. When traffic from Server reaches Spine1, the traffic is forwarded to Spine2 through the backup link. After the uplink interface goes Up, route switching is performed.
|
Card fault
|
- If the uplink and backup link are on the same card, a card fault causes all uplink paths to fail. As a result, the traffic from the downlink interface is interrupted.
- If a spine node has multiple cards installed, but two server leaf nodes in an M-LAG are connected to the same card on the spine node, traffic is switched to the backup spine node when the card fails. After the card fault is rectified, traffic switchback is affected based on the ARP/MAC learning performance, causing packet loss.
|
When a spine node has multiple cards installed:
- Deploy the uplink and backup link on different cards if there is only one uplink, and deploy the uplinks on different cards if there are multiple uplinks.
- Connect the two server leaf nodes in an M-LAG to the same spine node through interfaces on different cards.
|
2
|
PE fault
|
When the link between a spine node and PE fails, traffic is quickly switched to the backup path. After the link recovers, traffic is quickly switched back to the original route. However, convergence on the PE side may be slow.
|
On the spine node, configure OSPF to advertise routes and retain the maximum cost within a specific period after the OSPF interface goes Up, reducing packet loss caused by slow convergence on the PE side. This configuration is not required when the number of routes is small or the convergence on the PE side is fast.
|
3
|
Link fault between a spine node and a firewall or LB
|
A firewall is dual-homed to a spine node. When one link fails, traffic is quickly switched to the other link. After the faulty link recovers, traffic is quickly switched back.
|
-
|
4
|
DAD link fault
|
The DAD link takes effect only when a device or peer-link fails. In normal cases, the DAD link functions as the backup path and does not carry traffic.
|
- The DAD link does not carry traffic, and traffic forwarding is not affected when the DAD link fails.
- Configure the interfaces on the DAD link as reserved interfaces.
|
5
|
Peer-link down (all member links are faulty)
|
- If a peer-link fails and is detected by the DAD link, the uplink and downlink interfaces of the backup spine node enter the Error-Down state, and traffic is quickly switched to the other device in the group.
- After the peer-link fault is rectified, all Error-Down interfaces go Up after a delay of 240s. The uplink and downlink interfaces go Up at the same time, but services are interrupted for a long time due to the convergence time difference between uplink routes and downlink ARP entries.
|
- Deploy the peer-link on different cards to improve reliability and reduce the occurrence of peer-link faults.
- Configure the M-LAG member interfaces to go Up after a delay of 240s and set the recovery interval to 10s. During the delay, the outbound interfaces in forwarding entries are peer-link interfaces. After the delay period ends, one M-LAG member interface is updated in the forwarding entries every 10s, improving network convergence performance.
|