ES3000 V2 appears erro:SEU fault

Publication Date:  2016-09-09 Views:  211 Downloads:  0
Issue Description
Below error is occurred during ES3000 inspection:

Average EC:             209
        Max bad block rate:     0.080%
        Event log:              1 error(s)
        Health:                 Fatal

# hio_info -d /dev/hioa
hioa    Size(GB):               1204
        Max size(GB):           1204
        Serial number:          030PXS10D2000011
        Driver version:
        Bridge firmware version:        228
        Controller firmware version:    228
        Battery firmware version:       105
        Battery  status:        Warning
        Run time (sec.):        73346200
        Total  IO  read:        4067017862
        Total  IO write:        4815334486
        Total  read(MB):        87612080
        Total write(MB):        178392684
        IO timeout:             0
        R/W error:              0
        Max bit flip:           8

# hio_log -d /dev/hioa
2014-07-20 03:57:38 <0x93> hioa controller 0: SEU fault
Handling Process
1. Power off the server then power on, to start the system regularly.

2. Backup the data on the ES3000 if requires. Ignore the step if no need to save the data.

3. Execute the command to delete the data on SSD:

hio_cleardata -d /dev/hioa

4. Execute hio_clear command to clear the log information as below steps:

1) cd /usr/local/hio
2) tar -xvf toolsd
3) /usr/local/hio /hio_clear -d /dev/hioa -il
// Please notice that is half-angle of “-” in step3,4. The parameters of log deletion is lower-case letter i and l.

Result after clearing:

5. After reboot the system, check the status by command of hio_info. If it shows OK that means recover successfully.

Root Cause
This fault is caused by soft failure of FPGA. It’s a general phenomenon which used to happen to RAM devices in industry.

The causes of soft failure of FPGA are as below:

1. Soft failure is the specific phenomenon of all of the semiconductor devices but especial RAM. And it will cause instant Bit inverting but not permanent damage.

2. FPGA is the structure base on SRAM which is possible to cause soft failure.

3. FPGA soft failure is caused by Bit inverting when the neutron of cosmic ray shocked the bit space of RAM. It will be recovered after reload the configuration.

4. ES3000 is doing SEU inspection to make sure the correctness and consistency of data. It will report immediately once error appears during scanning to all the spaces of FPGA by internal dedicated engine.

5. The probability of failure which is given by vendor is 65Years/Time for one chip. We do statistics for all the SEU have been delivered and current is within the target for FPGA.
Reboot the server and delete the ES log information.