No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

The Softswitch Services Attached to the NE40 Were Interrupted Due to the Loop of Some Packets

Publication Date:  2012-07-27 Views:  33 Downloads:  2
Issue Description
1. The networking is as follows:
As shown in the Appendix, the entire bearer network features a biplane architecture. NQ_UMG8900 accesses the NGN bearer network through NQ_NE40A and NQ_NE40B. The SoftX3000 accesses the NGN bearer network through XJ_NE40A and XJ_NE40B. In addition, E200-1 and E200-2 are Eudemon200 firewall devices, on which the Layer 3 routing function is enabled.
2. Symptom of the fault
The customer reported that all the services attached to NQ_UMG8900 were interrupted, and users could not make phone calls.
When messages were traced at the NQ_UMG8900 side, the SoftX3000 and NQ_UMG8900 could successfully ping each other, but NQ_UMG8900 stayed detached when viewed from the SoftX3000. 
 
Alarm Information
1. The SoftX3000 constantly displayed alarms informing that NQ_UMG8900 was detached.
2. Lots of alarms informing that the ServiceChange registration packets were delivered repeatedly were displayed on NQ_UMG8900. 
 
Handling Process
1. Analysis of the type and delivery frequency of the service packets
The exchanged messages between the SoftX3000 and the UMG8900 were as follows:
(1) The SoftX3000 sent the Audit message to the UMG8900 at a rate of 100 messages per second.
(2) The UMG8900 sent the ServiceChange message to the SoftX3000 at a rate of one message per 20 seconds. If the SoftX3000 did not return any response, the UMG8900 sent the message again for three times at most.
2. Verification of the direction that the packets were discarded
Then the engineer checked whether the packets were discarded when they were sent form the SoftX3000 to the UMG8900, or when they were sent form the UMG8900 to the SoftX3000. Through the message tracing function of the softswitch NMS, the engineer found that:
A. The SoftX3000 sent the Audit packet to the UMG8900, received the ServiceChange message from the UMG8900, and returned a response.
B. The UMG8900 did not receive any packet from the SoftX3000, including the Audit packet and the response to the ServiceChange packet. In addition, because the UMG8900 did not receive the response to the ServiceChange packet from the SoftX3000, the UMG8900 delivered the ServiceChange message for three more times.
Therefore, the engineer deduced that the packet was discarded when the packet was sent form the SoftX3000 to the UMG8900.
3. Confirming the general location where the packet was discarded
The Audit message form the SoftX3000 to the UMG8900 was large in size, approximately about 100 pps. The engineer checked the size of the packet in order to confirm where the packet of this size was discarded.
(1) Checking the forwarding path of the packets sent from the SoftX3000 to the UMG8900:
The engineer identified the forwarding path by running the tracert command as follows: SoftX3000?XJ_3528B?E200-2?XJ_NE40B?NQ_NE40A?NQ_UMG8900
(2) Checking the counts on the interface to confirm the general location where the packets were discarded:
A. The engineer started from the UMG8900. Through repeated comparisons of the counts on the interface, the engineer found that the packets from XJ_NE40B to NQ_NE40A increased slowly at a rate of 10 packets per second at most (suppose the interval for entering the command was 100 ms). Therefore, the engineer deduced that the packets with a size of 100 pps sent from the SoftX3000 were discarded before reaching NQ_NE40A. That is, the packets were discarded in the section of XJ_3528B?E200-2?XJ_NE40B.
B. The engineer found no obvious increase in the discarded packet counts on the interface of XJ_3528B. Therefore, the packets were not discarded on XJ_3528B. (Note: The S3528 features a pure CPU forwarding architecture, and the discarded packet counts are reliable)
C. Because the NE40 features the NP forwarding architecture, the discarded packet counts may be inaccurate. The best way to check whether packets were discarded on E200-2 or XJ_NE40B is to catch packets.
(3) Catching packets to ascertain the accurate location where packets were discarded.
Catching packets between XJ_NE40B and E200-2, the engineer found a large number of TTL loopback packets:
XJ_NE40B forwarded the MGCP packets with 10.63.4.106 (the address of NQ_UMG8900) as the source IP address and 10.63.5.1 (the address of the SoftX3000) as the target IP address to E200-2, but E200-2 returned the packets to XJ_NE40B. Packets were repeatedly received and sent back and then discarded when TTL reached 254. Therefore, NQ_UMG8900 could not send the service packets to the SoftX3000, and the service was interrupted.
4. Analysis of the firewall
Checking ICMP statistics on E200-2, the engineer found that the ICMP packets returned by E200-2 with the expired TTL amounted to 31827 when the service was interrupted for 26 hours, that is, 1,560 minutes. 20.4 packets were discarded per minute on average, which coincided with the symptom reported by the customer that about 20 packets were discarded per minute when NQ_UMG8900 failed.
According to the further analysis of the R&D staff, the softswitch packets were looped because the Eudemon200 failed to properly process the packet fragments. You can avoid the problem by disabling the fast forwarding function on the interface of the Eudemon200.
5. Recovery of the service
There are two methods to recover the service:
(1) Disable the fast forwarding function on the interface of the Eudemon200 to recover the service.
(2) Switch the service to another plane for fast recovery because the network features the biplane architecture.
Because the forwarding performance of the device will be affected after the fast forwarding function is disabled, the engineer decided to recover the service with the second method. 
 
Root Cause
1. Because the SoftX3000 and NQ_UMG8900 could successfully ping each other, routes on the devices must be configured correctly.
2. Because NQ_UMG8900 stayed detached when viewed from the SoftX3000, the engineer deduced that softswitch service packets were discarded on a certain device on the bearer network.
Further analysis disclosed that only ICMP packets were involved in the ping test. The softswitch service packets were TCP packets or UDP packets. Therefore, only TCP packets and UDP packets were discarded.
3. It is not difficult to find out the reason why packets can be pinged, but the service is abnormal. However, when you cannot resort to the tracert command, it is hard to locate the fault. Because of the large traffic volume on the network and many possible faulty points, a single packet capture scheme can hardly locate the cause.
In this case, it is necessary to analyze the type and characteristics of the service packets, especially the delivery direction and frequency of the packets, before you can select a proper method to locate the faulty point. Remember that sharpening your ax will not delay your job of cutting wood. 
 
Suggestions
1. It is not difficult to find out the reason why packets can be pinged, but the service is abnormal. However, when you cannot resort to the tracert command, it is hard to locate the fault. Because of the large traffic volume on the network and many possible faulty points, a single packet capture scheme can hardly locate the cause.
In this case, it is necessary to analyze the type and characteristics of the service packets, especially the delivery direction and frequency of the packets, before you can select a proper method to locate the faulty point. Remember that sharpening your ax will not delay your job of cutting wood.
Usually you can recover the service fast by switching the service to another plane when the bearer network features the biplane architecture. However, not all bearer networks feature the symmetrical biplane architecture. In this case, you need to fast locate the specific section on the entire forwarding path where the fault occurs, and then recover the service by partial switchover. Take this case as an example. You can shut down the interface between the SoftX3000 and XJ_NE40B to recover the service fast. 
 

END