According to the following logs, the core router encountered an unexpected reset at 09:51:22, and the customer powered off and started the core router at 10:34:46. After the core router was started, services were successfully recovered.
Logs recorded during the restart of the AR:
System kernel debugging information:
According to the system kernel log information recorded after the restart, CPU exception alarms were generated. Huawei suspected that the abnormal reset of the device was caused by CPU exceptions.
1. As AR device has been power off after the exception occur, the reset call stack info are lost. So the reset reason cannot be determined directly.
2. Repeat the same operation in the lab (login AR by SSH and configuring by redirect telnet) but the problem is not reproduced.
3. Check the list of known problems in R5 history, and found no related problems.
4. From the CPU exception alarms recorded in the log, some more hardware analysis will be needed.
Fail to reboot
The device has been running for a long period of time, and a large number of file system fragments were generated in the memory. When the device encounters an unexpected restart, the fragmented system data needs to be written back to the SD card. If a timeout occurs in the process, the device fails to be reset. The device will need to be power off to restart.
After receiving the faulty device, R&D engineers checked the signals of key components and repeatedly powered on and off the device. The device ran properly, and the fault was not replicated.
When the device underwent an electrical overstress in a lab environment, there was a possibility that the logical status of the CPU latch on the SRU become abnormal, causing an unexpected restart.
Key component signal detection
During the startup, signals of the SD card, CPU power supply, watchdog, DDR, NandFlash, and peripheral components were captured. No abnormal signal was found.
Simulation of an electrical overstress environment
We simulate various types of electrical stress to simulate scenarios where a device encounters an unexpected reset.
Generally, in addition to the component protection capabilities, a device is grounded to improve its protection capabilities. The logical status of the latch will seldom become abnormal within the protection specifications. However, in the case of an external electrical overstress, for example, surge, electrical fast transient/burst (EFT/B), or electric discharge, components will become abnormal. Due to the electrical stress on the live network, the logical status of the latch became abnormal but no permanent damage was caused. Instead, only CPU instructions were disordered, causing the device exception.
Voltage dip test
A hardware reset was simulated in the scenario where services were running and the power supply switched or a transient change of power voltage occurred.
The test result showed that when the voltage dropped to 40% and lasted for about 500 ms, an unexpected restart was triggered.
Purpose: Superimpose transient interference signals that quickly change to the power supply. The interference signals enter circuits through the AC power supply. As a result, there will be a large number of digital circuits on the device. Digital circuits are more sensitive to pulse interference. The interference signals that enter the digital circuits are directly superimposed on the digital circuits and trigger induction coupling, making the digital circuits abnormal.
Electronic pulses that change quickly were superimposed to the AC power supply, and the device was normal in normal protection standards.
In the pulse test within the range of ±4 kV, the device ran normally. When the pulse voltage reached ±8 kV, the power supply was unstable, and the device encountered an unexpected restart.
ESD overstress test
Purpose: Simulate the process of generating static electricity on objects and test the device protection capabilities when the objects are in contact with or close to the device.
In the grounding scenario, perform an ESD test for backplane services using a 24GE card.
Under normal grounding protection, the device was normal when the ESD level reached ±8 kV. When the ESD level reached ±12 kV, the device might encounter an unexpected restart.
Conclusion: According to the hardware test result, when the device encountered an external electrical overstress, for example, EFT/B or ESD overstress, an unexpected restart was triggered.