ECC warning – 0X0C00FFFF + CAT Error - 0X0700FFFF

Publication Date:  2015-07-31 Views:  503 Downloads:  0
Issue Description

Customer X received in his E9000 - blade servers CH222  - Correctable ECC warning – 0X0C00FFFF for one DIMM  + CAT Error - 0X0700FFFF (for both CPUs in the same time) only for one blade server - slot no 14 - .

  

Solution

The customer tried to restar the server - normal restart also plug out the blade from chassis and plug in into a different slot, the ECC warning disapeared for several hours but occured again and again,  the problem occured to CPUs its reated to this kind of memmory error.  - nothing wrong to mainboard.

About the CAT error - could be a problem to : mainboard, memmory DIMM or CPU but in our case the customer received CAT errror for both of CPUs and its imposible that two CPUs to be faulty in the same time.

Next Step:  

If we have another error mesage -like in my case ECC warning - its much better to check and fix the related errors in order to fix the CAT error.  

If we have only CAT error for one CPU its ok to remove the involved CPU and to start some tests only with one CPU - the good one - we have to put the good CPU in both slots from mainboard and to observ if we receive again the CAT error . if the CAT error will not occured again the mainboard its ok also the CPU its ok. We have to repeat these tests with faulty CPU in both slots to see if the server will report CAT error for both of slots - in that case the CPU must to be replaced.   

I was sure that the CPUs were not faulty so i replaced only the DIMM memmory involved in the ECC warning with a new one. 

After the replacement the CAT error were auto-cleared by the system after the restart  also the ECC message was cleared and the blade server was online again.

END