After packets at the media port of the SGSN were captured, a large number of duplicate packets were found, which caused the TCP protocol on the SGSN to malfunction and even a lot of packet retransmissions.
The possible causes of the large number of duplicate packets were as follows:
1. TCP packets were discarded on the service layer.
2. The carrier network sent duplicate TCP packets.
Huawei further identified the cause of the issue:
1. Considering that the duplicate ACK packets had the same IP identification filed, suspected that the duplicate packets were generated during the transmission process and the issue was most probably caused by a fault on the carrier network.
2. Checked the traffic transmit and receive paths and found they were different, as indicated by the red and blue lines respectively. It is common that traffic is transmitted and received over different paths on live networks. Therefore, Huawei did not focus the analysis on the aspect.
3. Checked the forwarding entries for the traffic receive path and found that NE40E-1 did not have the MAC forwarding table for the SGSN.
Continued to analyze how the traffic was forwarded on NE40E-1 and NE40E-2:
(1) VLANIF 2020 for NE40E-1 and NE40E-2 and the SGSN belonged to the same network segment, which meant that they were in the same broadcast domain. Therefore, when the return traffic arrived at NE40E-2, the MAC address of the SGSN was encapsulated into the packets based on the ARP entry and then the packets were transmitted to Eth-trunk 2.
(2) When return traffic arrived at NE40E-1, the packets were forwarded at Layer 2 based on the MAC forwarding table.
(3) Checked and found that NE40E-1 did not have the MAC forwarding entry corresponding to the SGSN, which caused the packets to be duplicated and sent to three ports. As a result, a lot of duplicate TCP packets were received on the SGSN.
4. Analyzed why NE40E-1 did not have the MAC forwarding entry.
(1) A MAC forwarding entry was learned based on a Layer 2 packet's source MAC address and ingress port after the packet was sent downstream.
(2) When a packet was transmitted to NE40E-1, its MAC address was terminated and then forwarded at Layer 3. Therefore, the MAC forwarding entry corresponding to the packet cannot be created on NE40E-1.
(3) The SGSN, however, sent an ARP request packet every second for resolution, and the ARP request packet should be broadcast by NE40E-1 to NE40E-2 over Eth-trunk 2. The method could also implement MAC forwarding entry learning, but it failed.
(4) Checked the ARP entry on NE40E-1 and NE40E-2. Found that the ARP entry timestamp was 20 minutes on NE40E-1 and kept decreasing on NE40E-2. Huawei concluded that the ARP request packet was not broadcast by NE40E-1 over Eth-trunk 2.
(5) Performed an Ethernet packet debugging test on NE40E-1 and found that the ARP request packet sent by the SGSN was a unicast packet with its destination MAC address being the virtual MAC address of the VRRP group. The unicast ARP request packet then was terminated at NE40E-1 without being sent out at Layer 2. As a result, NE40E-1 failed to learn the MAC forwarding entry corresponding to the SGSN.
Huawei ran the mac-address aging-time 1200 (numbers in seconds) command to change the aging time for MAC forwarding entries to be longer than the aging time for ARP entries (20 minutes). When ARP entries on NE40E-2 were approaching the aging time limit, NE40E-2 sent unicast ARP request packets and the SGSN sent back ARP reply packets, which were then sent out by NE40E-1 at Layer 2. After that, NE40E-1 updated its MAC forwarding table.The method ensured that NE40E-1 always had the MAC forwarding entry for the SGSN.
To resolve the issue, the available methods include:
1. Enhance the ARP detection function for the SGSN. Specifically, add some broadcast ARP request packets to maintain the MAC forwarding entries on routers.
2. Change the aging time for MAC forwarding entries on NE40E-1 to be longer than 20 minutes.
3. Configure routing policies so that return traffic is terminated only on the VRRP master.
To better understand Layer 2 and Layer 3 forwarding at an Ethernet port, make a summary as follows:
1. If a Layer 3 port receives a packet and finds the destination MAC address of the packet is the port's own MAC address, the Layer 3 port searches in its FIB table based on the IP address and then forwards or sends the packet upstream. If the port finds that the destination MAC address of the packet is not the port's own MAC address, it discards the packet.2. If a Layer 2 port receives a packet and finds the destination MAC address of the packet is the port's own MAC address, it sends the packet upstream. If the destination MAC address of the packet is not the port's own MAC address, the port forwards the packet at Layer 2 based on the MAC forwarding table. An NE40E learns a MAC forwarding entry based on the source MAC address only after the packet is successfully forwarded at Layer 2.