Publication Date: 2019-07-22 | Views: 579 | Downloads: 0 | Author: SU1001006543 | Document ID: EKB1001358606
The engineer reported that an AR2240 at the headquarters was faulty. As a result, the interconnections between the headquarters and 29 branches
were interrupted for about 43 minutes. The problem occurrence process was as
09:48:08: The customer logged in to the AR2240 through SSH.
09:49:31: The customer switched from the AR2240 to the HP router through Telnet.
09:51:16: The customer copied the configuration on the HP router. However, the AR router did not respond. The customer verified that the AR router was offline through the NMS.
10:34:46: The customer powered off and then started the AR router. After the AR was successfully started, services were recovered.
10:52:02: The customer switched from the AR router to the HP router. The configuration was normal, and no AR exception occurred.
At the customer's headquarters, the primary DC (MK) and DR DC (DR) are deployed. When the primary DC is unavailable, the standby DC is enabled. Two AR2240 core routers functioning as egress gateways are deployed respectively at MK and DR to connect the headquarters and branches.
The customer has 430 branches. Each branch has two routers. Most branches use private lines of different carriers to respectively connect to the two core routers at the headquarters through GRE tunnels.
At some branches, only private lines provided by a single carrier are available. These branches connect to only one core router at the headquarters through GRE tunnels. When the specific core router is faulty, services of the branches connected to this core router are affected. The fault at EWB was caused by the core router EWBMK2R-01, and services at 29 branches were affected.
According to the following logs, the core router encountered an unexpected reset at 09:51:22, and the customer powered off and started the core router at 10:34:46. After the core router was started, services were successfully recovered.
Logs recorded during the restart of the AR:
System kernel debugging information:
According to the system kernel log information recorded after the restart, CPU exception alarms were generated. Huawei suspected that the abnormal reset of the device was caused by CPU exceptions.
1. As AR device has been power off after the exception occur, the reset call stack info are lost. So the reset reason cannot be determined directly.
2. Repeat the same operation in the lab (login AR by SSH and configuring by redirect telnet) but the problem is not reproduced.
3. Check the list of known problems in R5 history, and found no related problems.
4. From the CPU exception alarms recorded in the log, some more hardware analysis will be needed.
Fail to reboot
The device has been running for a long period of time, and a large number of file system fragments were generated in the memory. When the device encounters an unexpected restart, the fragmented system data needs to be written back to the SD card. If a timeout occurs in the process, the device fails to be reset. The device will need to be power off to restart.
After receiving the faulty device, R&D engineers checked the signals of key components and repeatedly powered on and off the device. The device ran properly, and the fault was not replicated.
When the device underwent an electrical overstress in a lab environment, there was a possibility that the logical status of the CPU latch on the SRU become abnormal, causing an unexpected restart.
Key component signal detection
During the startup, signals of the SD card, CPU power supply, watchdog, DDR, NandFlash, and peripheral components were captured. No abnormal signal was found.
Simulation of an electrical overstress environment
We simulate various types of electrical stress to simulate scenarios where a device encounters an unexpected reset.
Generally, in addition to the component protection capabilities, a device is grounded to improve its protection capabilities. The logical status of the latch will seldom become abnormal within the protection specifications. However, in the case of an external electrical overstress, for example, surge, electrical fast transient/burst (EFT/B), or electric discharge, components will become abnormal. Due to the electrical stress on the live network, the logical status of the latch became abnormal but no permanent damage was caused. Instead, only CPU instructions were disordered, causing the device exception.
Voltage dip test
A hardware reset was simulated in the scenario where services were running and the power supply switched or a transient change of power voltage occurred.
The test result showed that when the voltage dropped to 40% and lasted for about 500 ms, an unexpected restart was triggered.
Purpose: Superimpose transient interference signals that quickly change to the power supply. The interference signals enter circuits through the AC power supply. As a result, there will be a large number of digital circuits on the device. Digital circuits are more sensitive to pulse interference. The interference signals that enter the digital circuits are directly superimposed on the digital circuits and trigger induction coupling, making the digital circuits abnormal.
Electronic pulses that change quickly were superimposed to the AC power supply, and the device was normal in normal protection standards.
In the pulse test within the range of ±4 kV, the device ran normally. When the pulse voltage reached ±8 kV, the power supply was unstable, and the device encountered an unexpected restart.
ESD overstress test
Purpose: Simulate the process of generating static electricity on objects and test the device protection capabilities when the objects are in contact with or close to the device.
In the grounding scenario, perform an ESD test for backplane services using a 24GE card.
Under normal grounding protection, the device was normal when the ESD level reached ±8 kV. When the ESD level reached ±12 kV, the device might encounter an unexpected restart.
Conclusion: According to the hardware test result, when the device encountered an external electrical overstress, for example, EFT/B or ESD overstress, an unexpected restart was triggered.
As the problem of resetting mechanism, device running for a long time and reboot abnormally has the probability that the software restarts failed. This can be solved by the latest software version.
The CPU became abnormal due to an external electrical overstress, triggering unexpected resets of the device.
Some branches did not have backup links, featuring low network reliability. When the headquarters site became abnormal, switchover could not be performed at the branches.
1. Upgrade the device software version to V200R007C00SPCc00 and install the patch V200R007SPH015.
2. It is recommended that the devices at branches be dual homed to different core routers at the headquarters.
3. It is recommended that the AR2240 be replaced with the AR3260-SRU200 with two SRUs.
4. Check the equipment room environment.
(1) Do not cascade power cables, and connect rack power distribution units (RPDUs) to power sockets separately on the wall. Isolate low-voltage cables from high-voltage cables to prevent interference.
(2) Ensure that the ground cables of the devices and cabinets are properly connected, and ensure that the ESD and surge discharge channels are available.
(3) Arrange the cables on the connectors of the front and rear panels neatly when they are not used, and avoid exposure of connectors.
Device installation requirements
(1) If multiple routers are installed in the cabinet, it is recommended that at least 1 U (44.45 mm) space be kept between two routers.
(2) Connect ground cables.
a Use a Phillips screwdriver to remove the M4 screw from the ground point on the rear panel of the router. Keep the M4 screw in an appropriate place for later use.
b Align the M4 lug of the ground cable with the tapped hole on the ground point, and then secure the ground cable with the M4 screw.
c Connect the M6 lug of the ground cable to a ground point on the desk, wall, or cabinet where the router is installed.
(1) Spacing between cabinets: Reserve sufficient space for maintenance in the front and rear of the cabinet, while considering the air volume of devices. Determine the spacing based on the power density of the equipment room and the air volume required for heat dissipation of the devices. In most cases, the minimum spacing in the rear of a cabinet is 0.6 m. The recommended spacing is 1.2 m. The minimum spacing in the front of a cabinet is 1 m. The recommended spacing is 1.2 m (for example, the spacing between cabinets in high-density equipment rooms is 1.8 m). The minimum spacing on two sides of a cabinet is 0.6 m. The recommended spacing is 1.2 m.
(2) Layout of devices in the cabinet: When deploying devices in a cabinet, ensure that the cabinet gravity is low after device deployment. Heavy or large-sized devices are deployed as close as to the bottom of the cabinet. Reserve sufficient space in the cabinet to facilitate device maintenance, but not to hinder cabling of high-voltage and low-voltage cables. For example, if there are only a small number of IT devices in an equipment room, devices are deployed in the same cabinet from the bottom to the top, starting from 3 U devices to 2 U devices, such as the UPS, servers, network devices, and distribution frames.
(3) Grounding: All IT cabinets must be grounded, and ground cables cannot be connected in series. IT devices that require grounding must be grounded, and the grounding resistance must be less than or equal to 1 ohm.
(1) Separation of high-voltage and low-voltage cables: High-voltage and low-voltage cables must be separated in a cabinet.
(2) Cable arrangement: Arrange cables according to the cabling mode of devices, and ensure that the front and rear doors of the cabinet can be opened and closed normally after the cables of installed devices in the cabinet are connected.
(3) Isolation measures: Lay out power cables inside cable troughs or bridge racks far away from computer signal cables. Do not lay out the low-voltage and high-voltage cables side by side. If side-by-side layout cannot be avoided, take corresponding measures to isolate them.
RPDUs are designed especially for power distribution for electrical devices installed in cabinets. The following figure shows an RPDU.
Use RPDUs with indicators but no switch or switch reset button to minimize unintentional outage. RPDUs are connected to power sockets using industrial connectors. The number of output ports on an RPDU is determined as required. An RPDU with 10 or more sockets is recommended. The socket type must conform to local mandatory standards. When 32 A and higher-current RPDUs are used, circuit breakers can be provided. The circuit breakers must have operation protection covers.
(1) Each IT equipment room must have at least one power distribution cabinet (box). The UPS output must pass through this power distribution cabinet (box). Sufficient space needs to be reserved for expansion of 2 to 4 switches in a power distribution cabinet (box) while the current IT design requirements are met.
(2) RPDU fixing: RPDUs must be secured to cabinets if the cabinets are available, and cannot be placed randomly in the cabinets.
(3) Installation position: RPDUs must be fixed at positions above the ground. It is recommended that a horizontal RPDU be installed within a height of 6 U in the rear of a cabinet, and a vertical RPDU on the left or right rear side of a cabinet. The specific installation position depends on the cabinet structure and layout structure of IT devices. The installation positions of RPDUs should be consistent in the same equipment room. In addition, RPDU installation should not affect installation, maintenance, and heat dissipation of IT devices. Low-voltage and high-voltage cables must be separated.
(4) Power connection: Each RPDU is connected to a switch or an output socket of the UPS in a power distribution box. The power cable is connected to a waterproof socket with the IP rating not lower than IP44. The live wire, neutral wire, and ground cable are connected in compliance with local national standards. Three-phase RPDUs must be grounded.
(5) Cascading: Each cabinet must use an independent RPDU, and RPDUs cannot be cascaded.