Storage can't start work when thin pool space was used up

Publication Date:  2016-03-10 Views:  316 Downloads:  0
Issue Description

S2600T, System version is V100R005C00SPC900

Thin LUN for VMware platform

Power off and power on storage after service interrupted. But the storage keep power on, can't login from ISM. If login from CLI, It always prompt "system not ready".

Alarm Information

Since it can't collect logs from ISM nor Toolkit, we can't check historical alarm.

Handling Process

1.Login the controller through sftp tools like winscp and download all the logs from coffer_log directory.

2.Check the power on logs from /OSM/coffer_log/log/his_debug/log_debug as below. Please get the latest powon log.


3.There are two directories in the compressed history powon file: cur_debug and nvram. First, we need to check "messages" file in "cur_debug", this is the system logs before storage reset.

And we can find many error logs as below:

It means that the thin pool space was used up and the system even can't alloc metadata for thin LUN.

Second, Open the "log_debug.txt" file in "nvram" directory, it's power on logs, we can check why the system can't start work.

Searching critical words "[SYS]" to check the latest power on step, as below:

From the screenshot above, we can see the system is already suceessfully mount coffer and register VAULT. So, It's obvious that the system may hang on the next step(flushing dirty data in VAULT). Combine with previous analysis, we know it must be disk space used up in thin pool and the system can't flush dirty in VAULT.

Root Cause

The system need to start work after successfully flushing all dirty in VAULT. But, there's no space left in  thin pool and the flushing can't completed. So, the storage would keep flushing dirty data but never end. It looks like a dead lock status.

Solution

1. Pull out some member disks of the thin pool and caused the RAID group become faulty. Then the system would skip the dirty data flushing procedure and start work.

Please note: Pull out these non coffer disk, do not pull out any coffer disk. The number of disks need to pull out  is up to RAID level. For example, it's 2 for RAID 5 and 3 for RAID6.

2.After the system successfully start work, Plug in these disks which was pull out in step 1, then revive disks and raidlun to recover thin LUNs' status.

Please note:Since there's no free capacity in thin pool, please don't start service at this time.

3.Add some new disks to the storage and expand to the thin pool whose space was used up.

4. Start service on hosts.

Suggestions

1.Please pay attention to the alarm from storage and expand space in time. All of the services would interrupted when the space of thin pool was used up.

2.Please don't unauthorized restart the storage when storage is in abnormal status.

3.If you are not skillful Huawei engineer, please implement the recovery solution under R&D direction.

END