RH2288 server hang, no display and not responding

Publication Date:  2014-12-31 Views:  928 Downloads:  0
Issue Description
Model: RH2288 H V2 :- Server is intermittently going to hang state, it is not even showing display, not responding to any key strock. only option to hard reset the server, that lead to production downtime,
Alarm Information
 Memory (DIMM030) faulty

86  Minor 12/26/2014 5:44 12/26/2014 5:44 Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C00FFFF Asserted
85  Minor 12/25/2014 19:26 12/25/2014 19:26 Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C80FFFF Deasserted
84  Minor 12/25/2014 15:01 12/25/2014 15:01 Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C00FFFF Asserted
83  Minor 12/23/2014 17:29 12/23/2014 17:29 Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C80FFFF Deasserted
82  Ok 12/23/2014 10:16 12/23/2014 10:16 System Boot / Restart Initiated(SysRestart) System restart.cause unknown.command from ch #0 1D0700FF Asserted
81  Minor 12/22/2014 16:34 12/22/2014 16:34 Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C00FFFF Asserted
80  Ok 12/22/2014 16:32 12/22/2014 16:32 System Boot / Restart Initiated(SysRestart) System restart.power-up via power pushbutton.command from ch #0 1D0703FF Asserted
Handling Process
1. collect the Hardware logs from the IMANA system.
2. check the abnormal alert
3. in this case we observed the issue with one memeory DIMM,
     "Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C00FFFF Asserted "
4. Findout the BOM code for the corresponding component (in this case it is DIMM030 - BOM code - 06200139)
5. arrange replacement  
Root Cause
DIMM has ECC mechanisim that handle the error in memory hardware, but if the error level is more than threshold it is requied to replace, as it may intermittently make system hand.

  "Memory (DIMM030) Warning(number of correctable ECC errors reached threshold) 0C00FFFF Asserted "
Solution
replace the faulty DIMM, in this case it BOM code " 06200139 "
Suggestions
1, need to check the OS level logs for Hang issue, if there any pointer for the Hardware
2. check the Hardware logs, and find out Hardware corresponding failure event, in this case it is not complete failure of DIMM but "Memory (DIMM030) number of correctable ECC errors reached threshold) 0C00FFFF Asserted

END