No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

packt loss leaded by the serdes open error alarm

Publication Date:  2014-06-20 Views:  44 Downloads:  0
Issue Description

At Aug 21 2013, there was a packet loss problem happen on the 10G link between

RLU13MSU1, after switch to other link ,the traffic was recovered.
there was alarm as following .

 

=====================================================

  ===============display alarm all===============

=====================================================

----------------------------------------------------------------------------

Index  Level      Date      Time                        Info

 

1      Emergency  13-08-21  09:42:36    LPU 6 is failed, RXPowLowWarn of XFP2

                                        ALARM of PIC0 is abnormal

2      Error      13-08-12  12:50:20    The cfcard: storage media of MPU10(Ent

                                        ity) exceeded the prealarm threshold

3      Error      13-08-12  12:36:50    The cfcard: storage media of MPU9(Enti

                                        ty) exceeded the prealarm threshold

4      Alert      13-08-12  12:28:12    The address76,channel2  temperature se

                                        nsor of SlotID 6(entity) exceed upper

                                        minor limit

5      Alert      13-08-12  12:17:22    LPU3 is failed, SERDES No.0 channel No

                                        .3 interface open error

6      Critical   13-08-12  12:11:59    The air filter is failed, Maybe it is

                                        not cleaned as scheduled. Please clean

                                         it and run the reset dustproof run-ti

                                        me command

---------------------------------------------------------------------------- 



Handling Process

The following red show the LPU3 find there are CRC error on the SERDES No.0 channel No3 interface, which link to the SFU14 board.

=============================================================== 

  ===============display switch-port lpu 3===============

===============================================================

SlotID:3  SERDES interface port status:

OPI port: status[SERDES interface NO.,portNO.]OUT/IN

 ON[0, 0]O  ON[0, 1]O  ON[0, 2]O  ON[0, 3]O

 

IPE port: status[SERDES interface NO.,portNO.]OUT/IN

 ON[0, 0]I  ON[0, 1]I  ON[0, 2]I  CRCError[0, 3]I

After check our code ,we find there was a software bug ,which lead to the problem.

It was recommended to isolated the serdes when there is a error. In this way the SFU 3+1 backup could be used. So there would be no service affected. But here  when the hardware problem happen , as this was a small software bug ,which lead to the isolated failture.  So the traffic was still send though this serdes , In the egress ,when the LPU3 receive the packet cell form the SFU14 ,it find there was a CRC error, then the LPU3 would drop the error packets ,so the packet was lost.

 

The fault serdes link had two sides , sending form SFU14, receiving by LPU3, it is hard to distinguish which side was fault. So if want to slove the fault on LPU3. We need to bring spare part of SFU14 and LPU3 both to the site ,to do the change.



The following red show the LPU3 find there are CRC error on the SERDES No.0 channel No3 interface, which link to the SFU14 board.

=============================================================== 

  ===============display switch-port lpu 3===============

===============================================================

SlotID:3  SERDES interface port status:

OPI port: status[SERDES interface NO.,portNO.]OUT/IN

 ON[0, 0]O  ON[0, 1]O  ON[0, 2]O  ON[0, 3]O

 

IPE port: status[SERDES interface NO.,portNO.]OUT/IN

 ON[0, 0]I  ON[0, 1]I  ON[0, 2]I  CRCError[0, 3]I

After check our code ,we find there was a software bug ,which lead to the problem.

It was recommended to isolated the serdes when there is a error. In this way the SFU 3+1 backup could be used. So there would be no service affected. But here  when the hardware problem happen , as this was a small software bug ,which lead to the isolated failture.  So the traffic was still send though this serdes , In the egress ,when the LPU3 receive the packet cell form the SFU14 ,it find there was a CRC error, then the LPU3 would drop the error packets ,so the packet was lost.

 

The fault serdes link had two sides , sending form SFU14, receiving by LPU3, it is hard to distinguish which side was fault. So if want to slove the fault on LPU3. We need to bring spare part of SFU14 and LPU3 both to the site ,to do the change.

Root Cause

There is a serdes link error on either SFU14 or LPU3 , which leaded to the packet loss in egress on LPU3.

Solution

 

We need to bring spare part of SFU14 and LPU3 both to the site , change SFU14MPU10 firstly ,then check if the fault recovers ,if not ,then we can arrive at there is nothing wrong with SFU14. then change the LPU3.

Suggestions

For the software bug mentioned above , it would be solved at the future patch . which it is plan to release at V600R005SPH023 at 2013-Sep-30th

END