HGST 10TB disk failed on RH2288H V3 server

Publication Date:  2018-03-23
Issue Description

RH2288H V3 server with 14*10TB HDD

iBMC version: V276

BIOS version: V387

RAID controller: LSI3108 Ver. B

When use 10TB disks, there're massive instabilities, where the RAID controller drops (seemingly randomly) multiple disks, completely destroying any RAID-set we are using, and placing the failed drives into a foreign state.

Alarm Information

There are a lot of alerts in SEL log. We can see many disks failed and restored at the same time.

ID Severity Event Type Event Description Generation Time Status Event Code Suggestion
717 Major Disk The disk DiskD failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
716 Major Disk The disk Disk7 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
715 Major Disk The disk Disk6 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
714 Major Disk The disk Disk5 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
713 Major Disk The disk Disk3 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
712 Major Disk The disk Disk2 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
711 Major Disk The disk DiskC failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
710 Major Disk The disk Disk11 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
709 Major Disk The disk Disk9 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
708 Major Disk The disk Disk4 failure. 3/19/2018 15:28 Deasserted 0x02000008 N/A
707 Major Disk The disk DiskD failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
706 Major Disk The disk DiskC failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
705 Major Disk The disk Disk11 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
704 Major Disk The disk Disk9 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
703 Major Disk The disk Disk7 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
702 Major Disk The disk Disk6 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
701 Major Disk The disk Disk5 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
700 Major Disk The disk Disk4 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
699 Major Disk The disk Disk3 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.
698 Major Disk The disk Disk2 failure. 3/19/2018 15:27 Asserted 0x02000007 Replace the faulty disk.

Handling Process

1. Check SEL log, no hardware failure found except disk failure alerts, also FDM log.

2. Check "LSI_RAID_Controller_Log", only found some information about PD(physical disk) insert events.

3. Check "RAID_Controller_Info", all the disk status is normal.

ID                                       : 10
Device Name                              : Disk10
Manufacturer                             : HGST
Serial Number                            : 2TG5AZRD
Model                                    : HUH721010ALE600
Firmware Version                         : T2JC
Health Status                            : Normal
Firmware State                           : JBOD
Power State                              : Spun Up
Media Type                               : HDD
Interface Type                           : SATA
Interface Speed                          : 6.0 Gbps
Link Speed                               : 12.0 Gbps
Drive Temperature                        : 25
Capacity                                 : 9.095 TB
Hot Spare                                : None
Rebuild in Progress                      : No
Patrol Read in Progress                  : No
Remnant Media Wearout                    : N/A
SAS Address(0)                           : 500e004aaaaaaa0a
SAS Address(1)                           : 0000000000000000
Location State                           : Off

Media Error Count                        : 0
Prefail Error Count                      : 0
Other Error Count                        : 0

4. We tried to insert another model of disks(6TB disk) to the same server, the issue disappeared. So, it should be disk firmware issue.

5. Since all of the issue disks is HGST disk, not from Seagate, we found the release note of the issue in the new version of disk firmware(T3C0).

Change to add a nullptr check for read command's setup handling after conflict check.
Detailed Description:
A null pointer access problem was discovered by an internal test. The location of this problem is inside a setup handling for read commands. This setup handling is to setup the first LBA for the drive to start from. In some particular cases there is no need to do a media access, however, the firmware still assumes there is always a disk LBA to start with, which resulted in a null pointer access attempt. The fix is to add a check for no disk access cases.
Failing Conditions:
There is a host read command or an internal read command whose LBA range is overlapping with the cached customer data(of previous internal write command's) and no need to do a disk access at all.
System Error:
Self-Initiated Reset 0x0716
Root Cause:
Code Error
Additional Root Cause Details:
There are some cases that a read command's execution has no need to access the media because the newest customer data are in the cache, and there is handling to re-adjust the start location for read commands after this conflict check, in which the firmware didn't consider the non-media access cases. This would result in a nullptr access.
Fix Description:
Add a nullptr check in the read command start location setup/re-adjust handling after the read range conflict check handling.
Drive Recovery:
None Needed

Root Cause

There's a disk firmware bug in T2JC version, a null pointer access problem cause disk can't access at all.


Upgrade disk firmware version T3C0. The disk firmware patch and upgrade guide is attached.