PSOD happened in RH2288H V3

Publication Date:  2016-04-19 Views:  181 Downloads:  0
Issue Description

PSOD happened in RH2288H V3.

Handling Process

1.       There are many DIMM030 Corrected Errors from BMC log.

CPU:0 (socket:CPU1)    core:Uncore    LogType:MCA BANK8    (HA1)    MCA mode:Corrupt Data Containment/Poison

Error type:Corrected Errors

MCACODE:0x0091 (Data Read Error: {Memory-read-error}_CHANNEL{1}_ERR)

MSCODE:0x0001 (Memory Read Error)

HA1 generated an eMCA event

Address Mode:Segment Offset    Address: 0xf0dfc70c0(Dimm slot:0 [DIMM030])

CMCI Threshold: 1    Count:1(No Tracking)

2.       Before PSOD happened, there is ” MCA error detected via CMCI” printed in ESX logs, which are caused by Corrected Errors of DIMM .

2016-04-14T13:29:20.004Z cpu15:34240)World: 14302: VC opID hostd-f2cc maps to vmkernel opID 3b1e5c2f

2016-04-14T13:29:31.836Z cpu3:34927)MCE: 1118: cpu3: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2016-04-14T13:29:31.836Z cpu3:34927)MCE: 222: cpu3: bank8: status=0xcc0000c000010091: (VAL=1, OVFLW=1, UC=0, EN=0, PCC=0, S=0, AR=0), ECC=no, Addr:0x50727200 (valid), Misc:0x1f0dfc78c0 (valid)

2016-04-14T13:29:31.836Z cpu3:34927)MCE: 231: cpu3: bank8: MCA recoverable error (CE): "Memory Controller Read Error on Channel 1."

2016-04-14T13:29:36.791Z cpu15:2351579)World: 14302: VC opID hostd-a78f maps to vmkernel opID c2d7e57


2016-04-14T14:58:25.174Z cpu3:3669627)MCE: 222: cpu3: bank8: status=0x8c00004000010091: (VAL=1, OVFLW=0, UC=0, EN=0, PCC=0, S=0, AR=0), ECC=no, Addr:0x504a4a00 (valid), Misc:0x1f0dfe6340 (valid)

2016-04-14T14:58:25.174Z cpu3:3669627)MCE: 231: cpu3: bank8: MCA recoverable error (CE): "Memory Controller Read Error on Channel 1."

2016-04-14T14:58:40.002Z cpu30:2351577)World: 14302: VC opID hostd-d8c5 maps to vmkernel opID 1f3374b6

3.       And we find related logs about PSOD as followings in the ESX logs.

2016-04-14T16:42:55.637Z cpu14:3524812)@BlueScreen: PCPU 0: no heartbeat (2/2 IPIs received)

2016-04-14T16:42:55.637Z cpu14:3524812)Code start: 0x41803a000000 VMK uptime: 15:08:32:42.016

2016-04-14T16:42:55.637Z cpu14:3524812)0x4124e331d950:[0x41803a08d1c9]PanicvPanicInt@vmkernel#nover+0x575 stack: 0x412400000010

2016-04-14T16:42:55.638Z cpu14:3524812)0x4124e331d9c0:[0x41803a08d375]Panic_WithBacktrace@vmkernel#nover+0x59 stack: 0x0

2016-04-14T16:42:55.638Z cpu14:3524812)0x4124e331da40:[0x41803a37819d]Heartbeat_DetectCPULockups@vmkernel#nover+0x70d stack: 0x4124e331dab

2016-04-14T16:42:55.638Z cpu14:3524812)0x4124e331daf0:[0x41803a0a8a30]Timer_BHHandler@vmkernel#nover+0x1b8 stack: 0x0

2016-04-14T16:42:55.639Z cpu14:3524812)0x4124e331db80:[0x41803a02e966]BH_DrainAndDisableInterrupts@vmkernel#nover+0x5a stack: 0x4100128520

2016-04-14T16:42:55.639Z cpu14:3524812)0x4124e331dbc0:[0x41803a064277]IDT_IntrHandler@vmkernel#nover+0x1af stack: 0x4124e331dce8

2016-04-14T16:42:55.639Z cpu14:3524812)0x4124e331dbd0:[0x41803a0f2064]gate_entry@vmkernel#nover+0x64 stack: 0x0

4.       On the above VMware logs, we find it is known issue of VMware. The link of vmware kb is :





Root Cause

A known bug of VMware ESXi 5.5.0 (build-2403361) was caused by many DIMM030 Corrected Errors.


1.        Replace DIMM030.

2.       Install VMware patch to fix the bug as the following link of vmware KB: