1. Collect fault symptoms. The I/O write failure is random and it is not specific to a certain hard disk.
2. Run the following command on the RAID controller card of the OS to query the alarm information. It is found that the number of Other Error Count is high. In this case, analysis on the BMC and OS logs is needed.
3. Collect the BMC and OS logs.
3.1. According to the SEL recorded in the BMC logs, the server did not print any abnormal disk information.
3.2. Search for other error in the smart files in the disk directory of the OS log. There are 10 hard disks with incorrect statistics.
3.3 Locate the sasraidlog file in the raid directory in the OS log (log file names vary according to RAID controller card models). The log shows that multiple hard disks and the hard disk backplane have generated some I/O timeout records.
3.4. Collect the OS log one day later to check the Other Error Count of hard disks. It is found that the value of Other Error Count keeps increasing quickly.
4. Replace the hard disk backplane, RAID controller card, and SAS cable.
The communication between the hard disk and the system is abnormal because the SAS link is not functioning properly. This results in I/O command delivery timeout and a great value of Other Error Count.
1. Other Error is caused by hard disk reset due to IO timeout on the SAS link.
2. It is recommended that you collect the Other Error Count within a specified period. For servers with a high increment in Other Value Count, you are advised to replace the hard disk backplane, SAS cables, and RAID controller card.