No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
NCM Bandwidth Decrease Alarm

NCM Bandwidth Decrease Alarm

Problem Description

Table 5-250 Basic information

Item

Information

Source of the Problem

KunLun 9016

Intended Product

KunLun 9008/9016/9032

Release Date

2017-08-16

Keyword

NCM chip, bandwidth, decrease

Symptom

The alarm "The NCM chip bandwidth is decreased by 50% or the ECC error count exceeds the threshold" is displayed on the CMC WebUI.

Key Process and Cause Analysis

The bandwidth of the QPI links of the node controller module (NCM) chip decreases. The NCM may be faulty. Collect alarm information in one-click mode for further analysis.

Determine the NCM for which the alarm is generated based on the alarm information. In this case, the alarm source is SCE2 and the event body is NCM1. Therefore, the faulty NCM is located on BPU B of SCE2. Check SCE2-BPUB\dump_info-2-3\LogDump\strategy_log, which shows that a QPI bandwidth decrease occurs at the time when the alarm is generated, as shown in Figure 5-319.

Figure 5-319 Strategy_log

The faulty port is QPI0 (each NCM provides four QPI ports). Figure 5-320 shows the topology between NCM QPI ports and each CPU board module.

Figure 5-320 QPI topology

QPI0 of NCM1 corresponds to CPU board module 3. The QPI link alarm may be caused by a fault on an NCM, CPU board module, or any component on the backplane. If possible, perform the following steps to locate the faulty component (the backplane is not considered in this case because it is a passive component and is not likely to be faulty during working):

  1. Switch CPU board module 3 to the slot (1/2/5/6) on the CPU board module that is connected to NCM2. If the alarm event body changes to NCM2, it indicates that CPU board module 3 is faulty. Otherwise, NCM1 is faulty.
  2. Install CPU board module 3 back to the original slot, and switch the positions of NCM1 and NCM2. If the alarm event body changes to NCM2 (that is, NCM1 before the switch), it indicates that NCM1 is faulty. Otherwise, NCM2 is faulty.

If the preceding steps cannot be performed onsite, replace the suspected NCM and CPU board module.

Conclusion and Solution

NCM1 is faulty. After NCM1 is replaced, the alarm is cleared.

Experience

None

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 70445

Downloads: 3773

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next