Faulty symptom: Customer run a hyper-v HA cluster with CSV,he received following error during VM backups through System Center DPM:
The device, \Device\Harddisk9\DR9, has a bad block.
VM’s are running without problems. He tried to relocate one VM which returns this error every day to other storage but also when copying the data Ihe also get this error. But Chkdsk command output within the VM shows clean.
chkdsk c:\ClusterStorage\Volume3 /scan
The type of the file system is NTFS.
Stage 1: Examining basic file system structure ...
37120 file records processed.
File verification completed.
65 large file records processed.
0 bad file records processed.
Stage 2: Examining file name linkage ...
37328 index entries processed.
Index verification completed.
0 unindexed files scanned.
0 unindexed files recovered.
Stage 3: Examining security descriptors ...
Security descriptor verification completed.
104 data files processed.
Windows has scanned the file system and found no problems.
No further action is required.
4194173 MB total disk space.
2072578404 KB in 13713 files.
11268 KB in 106 indexes.
0 KB in bad sectors.
234223 KB in use by the system.
65536 KB occupied by the log file.
2169931 MB available on disk.
4096 bytes in each allocation unit.
1073708543 total allocation units on disk.
555502570 allocation units available on disk
Version information: V300R002C10SPC200
1. Analyze the message log, we can find a lot of DIF error like below:
[2017-04-13 00:55:03][43785410.442410] [1500000fa0003][WARN][DIF_WARNING_CHECK_DIF: LBA error: expected(3937552128),actual(2020383104),block(472),difPtr(ffff88045b608c00),type(0),ctrlFlag(0xb20000),mid(1),lbaLevel(1).][DIF][isDifLbaCorrect,702][CSD_7]
[2017-04-13 00:55:03][43785410.442434] [1500000fa0003][WARN][DIF_WARNING_CHECK_DIF_FAIL_BLOCK: data(size=512): 0x73 65 3c 2f 59 65 61 72...3e 30 3c 2f 69 6e 74 3e; dif: 0x9d af 13 02 78 6c 99 80.][DIF][isDifLbaCorrect,702][CSD_7]
[2017-04-13 00:55:03][43785410.442460] [1500000c80336][ERR][Check dif of read req failed. LUN id(2), opcode(1212417), LBA(3937551656), len(512).][LUN][doDifVerify,14973][CSD_7]
[2017-04-21 20:03:34][44544544.968745] [1500000fa0003][WARN][DIF_WARNING_CHECK_DIF: LBA error: expected(2297460354),actual(3937455106),block(122),difPtr(ffff880590b3b390),type(0),ctrlFlag(0xb20000),mid(1),lbaLevel(1).][DIF][isDifLbaCorrect,702][CSD_10]
[2017-04-21 20:03:34][44544544.968761] [1500000fa0003][WARN][DIF_WARNING_CHECK_DIF_FAIL_BLOCK: data(size=512): 0x40 00 00 00 00 00 00 00...00 00 00 00 00 00 00 00; dif: 0x9a 05 13 02 ea b0 cc 02.][DIF][isDifLbaCorrect,702][CSD_10]
But the DIF error is very strange, we can find a lot LBA error. And user data is consistent. So, we can confirm it's a software bug.
1. The issue VM belongs to a thin LUN. We have two trees, reclaim tree and allocate tree. After data was deleted, the data blocks will be put into reclaim tree, format and then move to allocate tree. But, because of software bug, sometimes, the reclaimed blocks may directly move to allocate tree without format.
If the block is not full in use, the host may find the data inconsistency in backup scenerio. For example, if host write 1KB data, and get a 4KB unformatted block from allocate tree. There will be 3KB data not in use.
When host read the 1KB data, the storage will verify the data consistency of the whole block by DIF. Obviously, the DIF verify will fail and storage will return IO error to host.
1. Upgrade storage to V300R002C10SPC200 + V300R002C10SPH206