Almost each night when backup was going server freezes on nearly 5 minutes and did not respond, errors were seen on backup server, and alarms. After this freeze server was recovering without any interruption from engineer. Main problem was, that on this 5 minutes all VMs were fully unavailable.
Server software Microsoft windows 2012 r2
Backup software Microsoft DPM
Such windows alarms (no ibmc alarms were generated):
Input-output operation of logical block for disk 1
Returning to device\device\raidport2
Driver had error with controller \device\raidport2
Error in raid controller logs:
Controller encountered a fatal error and was reset
At first we analyzed logs and saw a lot of events about raid controller restarts:
1) Controller encountered a fatal error and was reset was seen in dump_info\LogDump\LSI_RAID_Controller_Log - collected via 1-click info collection
2) Controller encountered a fatal error and was reset was seen in raid\sasraidlog.txt - collected via info collect tool
We replaced raid controller, error was seen not so often, but still it was.
We asked customer to test with different VM count on hosts. When there were 5-6 VM server did not have such problem, when 8 - problem was.
Also as customer had 3-4 such servers wih such software we asked to provide logs from "good" server, we analyzed them also to compare with problem server.
After the detailed analysis from R & D they confirmed it was firmware problem, raid controller was unable to process so huge amount of data.
Root cause was old firmware which was unable to process all data during backup and raid controller was restarting. Firmware upgrade to the latest version solved the problem.
Suggestion is to upgrade firmware before raid card replacement.