No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade
Knowledge Base

the Root cause for AR2240--device not reachable and reboot to recover

Publication Date:  2019-07-22  |   Views:  579  |   Downloads:  0  |   Author:  SU1001006543  |   Document ID:  EKB1001358606

Contents

Issue Description


The engineer reported that an AR2240 at the headquarters  was faulty. As a result, the interconnections between the headquarters and 29 branches
were interrupted for about 43 minutes. The problem occurrence process was as
follows:
09:48:08: The customer logged in to the AR2240 through SSH.
09:49:31: The customer switched from the AR2240 to the HP router through Telnet.
09:51:16: The customer copied the configuration on the HP router. However, the AR router did not respond. The customer verified that the AR router was offline through the NMS.
10:34:46: The customer powered off and then started the AR router. After the AR was successfully started, services were recovered.
10:52:02: The customer switched from the AR router to the HP router. The configuration was normal, and no AR exception occurred.

At the customer's headquarters, the primary DC (MK) and DR DC (DR) are deployed. When the primary DC is unavailable, the standby DC is enabled. Two AR2240 core routers functioning as egress gateways are deployed respectively at MK and DR to connect the headquarters and branches.
The customer has 430 branches. Each branch has two routers. Most branches use private lines of different carriers to respectively connect to the two core routers at the headquarters through GRE tunnels.
At some branches, only private lines provided by a single carrier are available. These branches connect to only one core router at the headquarters through GRE tunnels. When the specific core router is faulty, services of the branches connected to this core router are affected. The fault at EWB was caused by the core router EWBMK2R-01, and services at 29 branches were affected.

Handling Process

According to the following logs, the core router encountered an unexpected reset at 09:51:22, and the customer powered off and started the core router at 10:34:46. After the core router was started, services were successfully recovered.







 

The MPU frame[0] board[11]'s  reset total 10, detailed information:


 

--  1. 10/06   02:34:46, Reset No.: 10


 

       Reason: Board have been
  pulled out or no power


 

--  2. 10/06  09:51:22, Reset No.: 9


 

       Reason: Reset selfboard because of
  find exception


 

--  3. 02/21   00:03:52, Reset No.: 8


 

       Reason: Reset by user command


 




Logs recorded during the restart of the AR:


 

Oct 6 2017 09:51:09+08:00
  EWBMK2R-01 OSPF/4/AGELSA:OID 1.3.6.1.2.1.14.16.2.13: An LSA is aged.
  (LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=172.27.19.64,
  LsdbRouterId=192.168.188.95, ProcessId=1, RouterId=192.168.188.96,
  InstanceName=)
Oct 6 2017 09:51:16+08:00 EWBMK2R-01 OSPF/4/AGELSA:OID 1.3.6.1.2.1.14.16.2.13: An LSA is aged. (LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=172.27.19.64, LsdbRouterId=192.168.188.95, ProcessId=1, RouterId=192.168.188.96, InstanceName=)


 

---The unexpected reset failed,
  and the core router was successfully restarted through power-off.


 

Oct 6 2017 02:37:17+00:00
  Huawei %%01GTL/4/STATECHANGED(l)[0]:License state changed from NULL to
  Default.

  Oct 6 2017 02:37:29+00:00 Huawei %%01QOS/6/INIT_OK(l)[1]:Succeed in mqc
  initializtion.

  Oct 6 2017 02:37:33+00:00 Huawei %%01ADP_MSTP/5/SET_PORT_INSTANCE(l)[2]:Vlanlist
  has been bound on instance 0 on iochip slot 0.

  Oct 6 2017 02:38:08+00:00 Huawei
  %%01ADP_MSTP/5/SET_PORT_INSTANCE(l)[3]:Vlanlist has been bound on instance 0
  on iochip slot 0.

  Oct 6 2017 02:38:11+00:00 Huawei %%01IFNET/4/LINK_STATE(l)[4]:The line protocol
  IP on the interface LoopBack0 has entered the UP state.

  Oct 6 2017 10:39:35+08:00 EWBMK2R-01 LLDP/4/ADDCHGTRAP:OID
  1.3.6.1.4.1.2011.5.25.134.2.5 Local management address is changed.
  (LocManIPAddr=127.0.0.1).


 

---The
  core router was started successfully.


 


System kernel debugging information:



 

<4>    a80000041fc43e00 a80000041fc43e08
  7fffffffffffffff 0000000000000002


 

<4>    0000000000000000 ffffffff8110d8e0
  0000000100000000 a80000041fc38000


 

<4>    ffffffff811605c0 a80000041fc43e10
  a80000041fc43e10 ffffffff8111064c


 

<4>        ...


 

<4>Call Trace:


 

<4>[<ffffffff8110dc70>]
  schedule+0x220/0x700


 

<4>[<ffffffff8110e590>]
  schedule_timeout+0x1b8/0x2e0


 

<4>[<ffffffff8110d8e0>]
  wait_for_common+0xb8/0x190


 

<4>[<ffffffff8116cbc0>]
  do_fork+0x130/0x420


 

<4>[<ffffffff8113a684>]
  _sys_clone+0x2c/0x40


 

<4>[<ffffffff811036a4>]
  handle_sysn32+0x44/0x84


 


According to the system kernel log information recorded after the restart, CPU exception alarms were generated. Huawei suspected that the abnormal reset of the device was caused by CPU exceptions.

Software Analysis
  Reboot exception
1. As AR device has been power off after the exception occur, the reset call stack info are lost. So the reset reason cannot be determined directly.
2. Repeat the same operation in the lab (login AR by SSH and configuring by redirect telnet) but the problem is not reproduced.
3. Check the list of known problems in R5 history, and found no related problems.
4. From the CPU exception alarms recorded in the log, some more hardware analysis will be needed.

  Fail to reboot
The device has been running for a long period of time, and a large number of file system fragments were generated in the memory. When the device encounters an unexpected restart, the fragmented system data needs to be written back to the SD card. If a timeout occurs in the process, the device fails to be reset. The device will need to be power off to restart.

Hardware Analysis
After receiving the faulty device, R&D engineers checked the signals of key components and repeatedly powered on and off the device. The device ran properly, and the fault was not replicated.
When the device underwent an electrical overstress in a lab environment, there was a possibility that the logical status of the CPU latch on the SRU become abnormal, causing an unexpected restart.
  Key component signal detection
During the startup, signals of the SD card, CPU power supply, watchdog, DDR, NandFlash, and peripheral components were captured. No abnormal signal was found.

Simulation of an electrical overstress environment
We simulate various types of electrical stress to simulate scenarios where a device encounters an unexpected reset.
Generally, in addition to the component protection capabilities, a device is grounded to improve its protection capabilities. The logical status of the latch will seldom become abnormal within the protection specifications. However, in the case of an external electrical overstress, for example, surge, electrical fast transient/burst (EFT/B), or electric discharge, components will become abnormal. Due to the electrical stress on the live network, the logical status of the latch became abnormal but no permanent damage was caused. Instead, only CPU instructions were disordered, causing the device exception.
 Voltage dip test
A hardware reset was simulated in the scenario where services were running and the power supply switched or a transient change of power voltage occurred.

The test result showed that when the voltage dropped to 40% and lasted for about 500 ms, an unexpected restart was triggered.
 EFT/B
Purpose: Superimpose transient interference signals that quickly change to the power supply. The interference signals enter circuits through the AC power supply. As a result, there will be a large number of digital circuits on the device. Digital circuits are more sensitive to pulse interference. The interference signals that enter the digital circuits are directly superimposed on the digital circuits and trigger induction coupling, making the digital circuits abnormal.
Electronic pulses that change quickly were superimposed to the AC power supply, and the device was normal in normal protection standards.

In the pulse test within the range of ±4 kV, the device ran normally. When the pulse voltage reached ±8 kV, the power supply was unstable, and the device encountered an unexpected restart.

ESD overstress test
Purpose: Simulate the process of generating static electricity on objects and test the device protection capabilities when the objects are in contact with or close to the device.
In the grounding scenario, perform an ESD test for backplane services using a 24GE card.


Under normal grounding protection, the device was normal when the ESD level reached ±8 kV. When the ESD level reached ±12 kV, the device might encounter an unexpected restart.
Conclusion: According to the hardware test result, when the device encountered an external electrical overstress, for example, EFT/B or ESD overstress, an unexpected restart was triggered.

Root Cause

As the problem of resetting mechanism, device running for a long time and reboot abnormally has the probability that the software restarts failed. This can be solved by the latest software version.
The CPU became abnormal due to an external electrical overstress, triggering unexpected resets of the device.
Some branches did not have backup links, featuring low network reliability. When the headquarters site became abnormal, switchover could not be performed at the branches.

Solution


1. Upgrade the device software version to V200R007C00SPCc00 and install the patch V200R007SPH015.
2. It is recommended that the devices at branches be dual homed to different core routers at the headquarters.
3. It is recommended that the AR2240 be replaced with the AR3260-SRU200 with two SRUs.
4. Check the equipment room environment.
  Overall policies
(1) Do not cascade power cables, and connect rack power distribution units (RPDUs) to power sockets separately on the wall. Isolate low-voltage cables from high-voltage cables to prevent interference.
(2) Ensure that the ground cables of the devices and cabinets are properly connected, and ensure that the ESD and surge discharge channels are available.
(3) Arrange the cables on the connectors of the front and rear panels neatly when they are not used, and avoid exposure of connectors.
  Device installation requirements
(1) If multiple routers are installed in the cabinet, it is recommended that at least 1 U (44.45 mm) space be kept between two routers.


(2) Connect ground cables.
a Use a Phillips screwdriver to remove the M4 screw from the ground point on the rear panel of the router. Keep the M4 screw in an appropriate place for later use.
b Align the M4 lug of the ground cable with the tapped hole on the ground point, and then secure the ground cable with the M4 screw.
c Connect the M6 lug of the ground cable to a ground point on the desk, wall, or cabinet where the router is installed.


Device layout
(1) Spacing between cabinets: Reserve sufficient space for maintenance in the front and rear of the cabinet, while considering the air volume of devices. Determine the spacing based on the power density of the equipment room and the air volume required for heat dissipation of the devices. In most cases, the minimum spacing in the rear of a cabinet is 0.6 m. The recommended spacing is 1.2 m. The minimum spacing in the front of a cabinet is 1 m. The recommended spacing is 1.2 m (for example, the spacing between cabinets in high-density equipment rooms is 1.8 m). The minimum spacing on two sides of a cabinet is 0.6 m. The recommended spacing is 1.2 m.
(2) Layout of devices in the cabinet: When deploying devices in a cabinet, ensure that the cabinet gravity is low after device deployment. Heavy or large-sized devices are deployed as close as to the bottom of the cabinet. Reserve sufficient space in the cabinet to facilitate device maintenance, but not to hinder cabling of high-voltage and low-voltage cables. For example, if there are only a small number of IT devices in an equipment room, devices are deployed in the same cabinet from the bottom to the top, starting from 3 U devices to 2 U devices, such as the UPS, servers, network devices, and distribution frames.
(3) Grounding: All IT cabinets must be grounded, and ground cables cannot be connected in series. IT devices that require grounding must be grounded, and the grounding resistance must be less than or equal to 1 ohm.
  Cabinet cabling
(1) Separation of high-voltage and low-voltage cables: High-voltage and low-voltage cables must be separated in a cabinet.
(2) Cable arrangement: Arrange cables according to the cabling mode of devices, and ensure that the front and rear doors of the cabinet can be opened and closed normally after the cables of installed devices in the cabinet are connected.
(3) Isolation measures: Lay out power cables inside cable troughs or bridge racks far away from computer signal cables. Do not lay out the low-voltage and high-voltage cables side by side. If side-by-side layout cannot be avoided, take corresponding measures to isolate them.
  RPDU installation
RPDUs are designed especially for power distribution for electrical devices installed in cabinets. The following figure shows an RPDU.


Use RPDUs with indicators but no switch or switch reset button to minimize unintentional outage. RPDUs are connected to power sockets using industrial connectors. The number of output ports on an RPDU is determined as required. An RPDU with 10 or more sockets is recommended. The socket type must conform to local mandatory standards. When 32 A and higher-current RPDUs are used, circuit breakers can be provided. The circuit breakers must have operation protection covers.
Installation requirements:
(1) Each IT equipment room must have at least one power distribution cabinet (box). The UPS output must pass through this power distribution cabinet (box). Sufficient space needs to be reserved for expansion of 2 to 4 switches in a power distribution cabinet (box) while the current IT design requirements are met.
(2) RPDU fixing: RPDUs must be secured to cabinets if the cabinets are available, and cannot be placed randomly in the cabinets.
(3) Installation position: RPDUs must be fixed at positions above the ground. It is recommended that a horizontal RPDU be installed within a height of 6 U in the rear of a cabinet, and a vertical RPDU on the left or right rear side of a cabinet. The specific installation position depends on the cabinet structure and layout structure of IT devices. The installation positions of RPDUs should be consistent in the same equipment room. In addition, RPDU installation should not affect installation, maintenance, and heat dissipation of IT devices. Low-voltage and high-voltage cables must be separated.
(4) Power connection: Each RPDU is connected to a switch or an output socket of the UPS in a power distribution box. The power cable is connected to a waterproof socket with the IP rating not lower than IP44. The live wire, neutral wire, and ground cable are connected in compliance with local national standards. Three-phase RPDUs must be grounded.
(5) Cascading: Each cabinet must use an independent RPDU, and RPDUs cannot be cascaded.