Problems Related to Interface and Management Modules of the S5600T Storage

Publication Date:  2015-09-28 Views:  372 Downloads:  1
Issue Description

2.1 Description
On September 23, 2014, a SAS interface module of an S5600T failed to work. However, after being inserted to another slot, the SAS interface module worked properly. After an engineer replaced the controller enclosure, the problem was resolved.
On December 15, 2014, the customer found that the management network port of an S5600T could not be pinged. Even if the management module was reinserted multiple times and finally replaced, the problem could not be resolved. However, after the storage array was restarted, the symptom disappeared.
On February 10, 2015, a SAS interface module of another S5600T failed to work.

2.2 Recovery Process
The problem related to the SAS interface module that occurred on September 23, 2014 was resolved by replacing the controller enclosure.
On December 15, 2014, the customer found that the management network port of an S5600T could not be pinged. Even if the management module was reinserted multiple times and finally replaced, the problem could not be resolved. However, after the storage system was restarted, the problem was resolved.
On February 10, 2015, a SAS interface module of another S5600T failed to work. The problem is not resolved yet.

Alarm Information

3.1 Analysis of the Problem That Occurred on September 23, 2014

3.1.2 Log Analysis
Through device log analysis, the following events are found.
1. In September 2014, the SAS driver found that the PCI register of the SAS chip on interface module B1 was abnormal, and then reported the chip abnormality event to trigger the PCIe recovery process.
Related logs are as follows:
2014-09-23 22:46:43    0xe02040037    Infor    None    The recovery process of the PCIE on controller ([B]) started. This process temporarily disconnected the ports.    None
[2014-09-23 22:46:43][1272943995][540004100459][ERR][Card 2 scrapad 0 0xffffffff scrapad 1 0xffffffff scrapad 2 0xffffffff scrapad 3 0xffffffff][SAS_INI][PMS_Chec.cState,5752][sas_event]
[2014-09-23 22:46:43][1272943995][54000410054c][ERR][Card 2 pcie error detected, hotplug to recover][SAS_INI][PMS_Chec.cState,5764][sas_event]
[2014-09-23 22:46:43][1272943995][5400040b006f][ERR][Device notify chip error!8001 11F8  [b:25 s:00 f:00]][PCIE_AER][PCIEAER_.hipErr,768][sas_event]
The PCIe recovery process failed, and the interface module reported the failure. The cause of the PCIe recovery failure was that the rate negotiation between the SAS interface module and the PCIe bridge chip on the controller failed.
2014-09-23 22:47:04    0xe01ff0003    Major    2014-11-19 10:34:47 DST    The interface module (controller enclosure 0, SAS interface module B1) is faulty. The error code is (0). Therefore, links to the ports on the interface module are down, and services are interrupted.
[2014-09-23 22:47:03][1272943995][5400040a008b][ERR][Link training error occurs.][PCIE_HP][PCIEHP_C.Status,921][PCIEHP_WorkThre]
[2014-09-23 22:47:03][1272943995][5400040a0175][ERR][Check the link status failed.][PCIE_HP][PCIEHP_A.Device,3988][PCIEHP_WorkThre]
2. On November 19, 2014, the customer hot swapped the SAS interface module to its original slot (namely, slot 1) multiple times. However, the alarm indicator on the interface module was still red.
The interface module failed to be initialized during its power-on. The cause of the initialization failure was as before: The PCIe rate negotiation failed.
Related logs are as follows:
[2014-11-19 09:35:24][3013498881][5400040a008b][ERR][Link training error occurs.][PCIE_HP][PCIEHP_C.Status,921][PCIEHP_WorkThre]
[2014-11-19 09:35:24][3013498881][5400040a0175][ERR][Check the link status failed.][PCIE_HP][PCIEHP_A.Device,3988][PCIEHP_WorkThre]
3. The interface module was reinserted to the same slot multiple times. However, the problem persisted.
4. Log information indicated that the SAS interface module was reinserted to slot 1 multiple times but the problem was not resolved. After the module was inserted to slot 4, the problem was resolved.

3.2 Analysis of the Problem That Occurred on December 15, 2014

3.2.2 Log Analysis
Through analysis of the storage array logs, it is found that the internal network port 82574 of controller A and the internal switch chip of the management module cannot link up. For details about the status of the internal management network port.
The link-up failure of the internal management network port inevitably leads to the failure to communicate with the external network.

3.3 Analysis of the Problem That Occurred on February 10, 2015

3.3.2 Log Analysis
Through analysis of the EVENT log file, it is confirmed that the system reported a PCIe recovery event at 2015-02-10 12:31:05.
2015-02-10 12:31:05 DST    0xe02040037    Infor    None    The recovery process of the PCIE on controller ([B]) started. This process temporarily disconnected the ports.    None
The symptom recorded by the EVENT log file is the same as the scenario where the backplane caused the failure of a SAS interface module in September 2014. The preliminary conclusion is that the PCIe channel of the SAS interface module is abnormal. Then, the EVENT log file is further analyzed.
[2014-09-23 22:46:43][6966613192][5400040b009d[ERROR][device [8086:340c](b:0 d:5 f:0)\t][PCIEAER_AerPrintError][PCIE_AER^]
[2014-09-23 22:46:43][6966613192][5400040b009e[ERROR][error status/mask=00000020/00000000\t][PCIEAER_AerPrintError][PCIE_AER^]
[2014-09-23 22:46:43][6966613192][5400040b0095[ERROR][( 5) Surprise Down (First)][PCIEAER_AerPrintErrorHelper][PCIE_AER^]
[2014-09-23 22:46:43][6966613192][5400040a0156[ERROR][Device is pull out of system without handled.][PCIEHP_DeleteDev][PCIE_HP^]
[2014-09-23 22:46:43][6966613192][5400040a0157[INFO][Remove the device with bus:device(4b:0) out of the system.][PCIEHP_DeleteDev][PCIE_HP^]
[2014-09-23 22:46:43]pcieport 0000:4c:04.0: PCI INT A disabled
[2014-09-23 22:46:43][6966613193][54000410054c[ERROR ][Card 0 pcie error detected, hotplug to recover][PMS_CheckSpcState][SAS_INI^]
[2014-09-23 22:46:43][6966613193][5400040b006f[ERROR][Create PCIEAER module's polling thread failed.8001][PCIEAER_CreateThread][PCIE_AER^]
It can be confirmed that the PCIe error this time was reported because the PCIe module of the controller found that the 8624 PCIe bridge chip on the controller was abnormal. The fault cause is totally different from the cause of the SAS interface module problem found in September.

Handling Process

3.1 Analysis of the Problem That Occurred on September 23, 2014

3.1.3 Possible Causes
The following figure shows the PCIe device topology of a controller.
Figure 3-2 Connection between PCIe bridge chips of controller B and external components


The problem was that the PCIe link rate negotiation between the interface module in slot 1 and PCIe bridge 0 failed. There are three possible causes:
 The interface module failed.
 The PCIe bridge chip failed.
 The link between the interface module and the PCIe bridge chip failed.
As shown in the preceding figure, the interface module can work properly after being inserted to slot 4. Therefore, the SAS interface module is not the problem source.
3.1.4 Fault Injection Experiment in the Lab
In a Huawei lab, the same environment as the site is deployed. Related faults are injected to the PCIe signals of slot 1 in the backplane of the controller enclosure. The following table provides the experiment results.
Fault Injection Result
Backplane signal short-circuit The same symptom as that observed at the site occurred.
Backplane signal interruption The same symptom as that observed at the site occurred.
PCI bridge chip failure The PCIe recovery process was initiated. The external link was interrupted. However, the MESSAGE log file contained information about bridge chip abnormality, whereas the information was not found at the site.

Log information obtained in backplane signal fault injection experiments (signal short-circuit and signal interruption) is the same as that obtained at the site.
[2015-02-15 13:22:11][1272943995][540004100459][ERR][Card 2 scrapad 0 0xffffffff scrapad 1 0xffffffff scrapad 2 0xffffffff scrapad 3 0xffffffff][SAS_INI][PMS_Chec.cState,5752][sas_event]
[2015-02-15 13:22:11][1272943995][54000410054c][ERR][Card 2 pcie error detected, hotplug to recover][SAS_INI][PMS_Chec.cState,5764][sas_event]
[2015-02-15 13:22:11][1272943995][5400040b006f][ERR][Device notify chip error!8001 11F8  [b:25 s:00 f:00]][PCIE_AER][PCIEAER_.hipErr,768][sas_event]
.....
[2015-02-15 13:22:31][3013498881][5400040a008b][ERR][Link training error occurs.][PCIE_HP][PCIEHP_C.Status,921][PCIEHP_WorkThre]
[2015-02-15 13:22:31][3013498881][5400040a0175][ERR][Check the link status failed.][PCIE_HP][PCIEHP_A.Device,3988][PCIEHP_WorkThre]
Therefore, it can be confirmed that the cause of the symptom at the site is that the PCIe hardware signal link of the faulty controller enclosure is abnormal. The specific cause of the fault can be analyzed after the faulty controller enclosure is returned to the R&D department.

 

3.2 Analysis of the Problem That Occurred on December 15, 2014

3.2.3 Possible Causes
Figure 3-4 Link topology of the management network port


The preceding figure is a schematic diagram of the internal hardware topology of the management network port. The internal network port cannot be linked up due to the following possible causes:
 The 6161 switch chip on the management module is abnormal.
After the faulty management module is replaced with a functional one, the problem still persists. Therefore, this possible cause is excluded.
 Backplane links are abnormal.
According to the following picture taken on site, there are no obvious bent pins on the backplane.
Figure 3-5 Pins on the backplane


Therefore, the backplane is almost unlikely to cause the problem. However, there is still a probability that the internal lines of the backplane are abnormal.
 The 82574 NIC chip on controller A is abnormal.
There is a probability that the 82574 NIC chip is abnormal.
To sum up, it is uncertain whether the backplane or the 82574 NIC chip on controller A is abnormal. Therefore, it is recommended that a spare controller and a spare backplane be taken to the site. Replace the controller first. If the eth2 link-up problem still persists, replace the backplane.
On February 13, 2015, the customer powered off the storage array. After the storage array was powered on again, the symptom still existed. Therefore, it can be confirmed that the fault source is the abnormal 82574 NIC chip on controller A. The cause of the chip abnormality was that the chip had a soft failure.
In the following figure, the left part shows the status register of the abnormal 82574 chip, and the right part shows the status register of a normal chip.
Figure 3-6 Status register comparison between an abnormal chip and a normal chip


After excluding irrelevant register values, it is found that when the fault occurred, some specific bits in the PHY status register (1000BASE-T Status Register) of the network port chip were changed unexpectedly. PHY and MAC modules are integrated into the 82574 chip. The PHY module is a necessary module used by the port to ensure underlying link signal transmission and link rate negotiation. The 1000BASE-T Status Register is used to indicate the configuration and operating status of the PHY module. For details about the PHY status register, see the following figure.
Figure 3-7 Description of the PHY status register in the 82574 data sheet


It is found that when the network port chip became faulty, bits 12 to 14 in the corresponding PHY status register were changed unexpectedly. If the 82574 chip is working properly, the values of bits 12 and 13 are 1 (indicating that the local and peer receivers work properly), and the value of bit 14 is 0 (indicating that the chip is in Slave mode). When the network port chip became faulty, the values of the three bits were changed unexpectedly, indicating that the local PHY module was abnormal. The three bits were read-only. Therefore, it can be confirmed that the problem is not caused by the upper-layer network port driver.
3.2.4 Fault Injection Experiment in the Lab
Figure 3-8 Interaction principle of the 82574 chip


As shown in the preceding figure, the 82574 chip of a controller involves the following signals: 1.9 V power signal, 3.3 V power signal, 100 MHz clock signal, and 25 MHz clock signal. The following table provides the fault injection results obtain in the lab.
Fault Injection Type Experiment Result Register Status of the 82574 Chip
10% increase in 1.9 V voltage The chip hangs and cannot work. The register status cannot be read.
10% decrease in 1.9 V voltage The chip hangs and cannot work. The register status cannot be read.
10% increase in 3.3 V voltage The chip hangs and cannot work. The register status cannot be read.
10% decrease in 3.3 V voltage The chip hangs and cannot work. The register status cannot be read.
200 ppm increase in 100 MHz clock frequency The management network port is interrupted intermittently many times. The PHY register status of the chip on the management network port keeps changing.
200 ppm increase in 25 MHz clock frequency The management network port cannot link up, which is the same as the symptom observed at the site. The values of some bits in the PHY register are changed unexpectedly.
50 ppm increase in 25 MHz clock frequency In the scenario where the management network port keeps sending and receiving packets, the fault occurs when the system keeps running for about two hours, which is the same as the symptom observed at the site. After the system is restarted, the fault no longer exists. The values of some bits in the PHY register are changed unexpectedly.

According to the experiment results, it can be confirmed that the onsite fault is caused by abnormal 25 MHz clock signals on the controller.
When a controller is working, the following cases may cause frequency deviation in 25 MHz clock signals:
 Capacitors C1798 and C1799 fail.
 Resistor R2465 is abnormal.
 The X2 oscillation clock fails.
Capacitor and resistor failures are typical hard failures that will not be recovered after system restart. Therefore, the first two cases can be excluded.
Abnormality of the X2 oscillation clock only causes a minor frequency deviation. In addition, the logic of the PHY module of the 82574 chip will become abnormal only when the frequency deviation accumulates to the upper limit that the 82574 chip can tolerate. Therefore, restarting a controller will reset the X2 oscillation clock, and then the frequency deviation will become zero. That is why the symptom will not emerge immediately after system power-on.
Conclusion: When the fault occurred, the X2 oscillation clock next to the 82574 chip of the controller became abnormal. As a result, frequency deviation occurred. Then, some bit values in the underlying PHY register of the 82574 chip were changed unexpectedly. After that, the chip failed to work, and the management network port could not be pinged.

 

3.3 Analysis of the Problem That Occurred on February 10, 2015

3.3.3 Possible Causes
Figure 3-9 Internal PCIe connection between a controller and an interface module


The following cases can possibly trigger the PCIe recovery process:
 The SAS driver finds that the PCIe register of the SAS interface module is abnormal.
 The PCIe module of the controller finds that the register of the 8624 chip on the controller is abnormal.
 The PCIe link of a backplane connector is abnormal.
According to the log analysis in the preceding sections, the first two cases can be excluded. Therefore, it can be confirmed that the 8624 PCIe bridge chip on the controller is abnormal. The controller must be replaced.




Root Cause

4.1 Root Cause of the Problem That Occurred on September 23, 2014
In the backplane of the controller enclosure, the connector of slot 1 that transmits PCIe hardware link signals encountered short-error or signal interruption. Then, the SAS interface module failed to negotiate a rate.


4.2 Root Cause of the Problem That Occurred on December 15, 2014
The X2 oscillation clock next to the internal management chip 82574 of the controller was abnormal. As a result, the 25 MHz frequency provided for the chip deviated. As the frequency deviation accumulated to the upper limit that the 82574 chip can tolerate, some bit values in the underlying PHY register of the 82574 chip were changed unexpectedly. After that, the chip failed to work, the internal link-up failed, and the management network port could not be pinged.

 
4.3 Root Cause of the Problem That Occurred on February 10, 2015
The 8624 PCIe bridge chip on the controller was abnormal. The upper-layer software found that the chip register was abnormal and then triggered PCIe recovery. The SAS chip on the interface module implemented soft reset. Finally, external SAS links were interrupted, and each disk had only one link.

Solution

5.1 Solution to the Problem That Occurred on September 23, 2014
Replace the backplane.


5.2 Solution to the Problem That Occurred on December 15, 2014
It is recommended that you replace controller A.


5.3 Solution to the Problem That Occurred on February 10, 2015
It is recommended that you replace controller B.
Perform preventive maintenance on all storage systems at the site to eliminate risks.

END