One hard disk failed in MD soft raid environment( Suse11 )

Publication Date:  2013-09-14 Views:  876 Downloads:  0
Issue Description
Many of our servers are running Suse11 and soft raid MD is working on it. If a hard disk fails, the soft raid will be degraded and the file system is in danger. Here I’ll show you how to fix it.
Alarm Information
Here is the alarm example:
Handling Process
1. First find that how many disks in this md. For each disk in the soft raid, there is a complete meta data in it. So you can find the raid information from either disk in the soft raid. In this example, we can remember that sdb is in this raid, so we can find the whole raid information from sdb. If unfortunately you don't remember any information about the disks’ name, you can also try them one by one. This command will one try to read meta data and will do no harm to disk. 
Use command mdadm –E <disk_name>

From above picture, we can see that:
  •  there are 5 disks in this raid and one is faulty.
  •  sdb/sdc/sde/sdf are working properly now.

2. Now we have enough information and we can try to re-assemble the raid and make it active first. This can be done by command “mdadm –assemble <MD_Name> <disks>”

Check above example, we use four disks (sdb/sdc/sdd/sde) to reassemble the raid. And it can be mounted successfully and the data is still there.
This is not the end. The raid is still in degraded mode since the faulty disk is not be replaced with new disk.

3. Replace the fault disk and replace it with a new disk. This step should be done very carefully. Make sure you don't’ replace wrong disk. If so, it might be a disaster.

4. Now we assume that the new disk is already in place and it’s sdf. Now we’ll add it to the raid and the raid will be re-built. Command “mdadm <MD_Name> -a <disk_name>” will do this.

See above example, after sdf is successfully added to md1, you can use command “cat /proc/mdstat” to check the current re-built progress. After it’s done, we can say that the problem is finally resolved.

Root Cause
If there is one disk fail, soft raid will be down graded.
  • The whole re-construction procedure should be guided with appropriate engineer. Make sure you know what you are doing before you run the command.
  • Before you re-assemble the soft raid, make sure that you already have enough information about member disks.
  • Step "Replace faulty disk with new disk" must be done very carefully.