How to recover a failed software raid1 system

Jephe Wu - http://linuxtechres.blogspot.com

Objective: use mdadm and grub-install to replace a failed software raid1 hard disk
Environment: HP lp1000r server, 2 hard disk (sda and sdb) with software raid1 during OS installation time, Fedora core 3. One day, sdb failed with many I/O errors

Concept:
shutdown server, remove failed sdb hard disk, replace it with a new one, reboot system. Make partition for the new hard disk, use mdadm to hot add
the partitions to mdX. Finally, install grub for new hard disk.


Steps:
1.  partition the new hard disk after installing to the server
# sfdisk -d /dev/sda | sfdisk /dev/sdb

run partprobe to inform OS partition table changes:
# partprobe -s
/dev/sda: msdos partitions 1 2 3
/dev/sdb: msdos partitions 1 2 3


2. hot add partition to md devices
a. show the current settings
cat /proc/mdstat
mdadm -D /dev/md0
cat /etc/mdadm.conf


b. hotadd partitions to MD devices
mdadm /dev/md1 -a /dev/sdb1  (md1 is for /boot, 100m)
mdadm /dev/md0 -a /dev/sdb2  (md0 is for /, 1G)
mdadm /dev/md2 -a /dev/sdb3  (md2 is for other partitions, 18G, we use LVM on md2 for other partitions such as /usr, /var,/home etc)

note: waiting for the building process percentaga to finish. Initially it will appears as spared hard disk only just after you add the device, then it will start to rebuild, once finished, the removed and spared device informatioin will disappear and become active sync.

or use cat /proc/mdstat to monitor the building process.

3. install grub

you might encounter error 'md0 does not have a corresponding BIOS drive' when running command 'grub-install /dev/sdb'.

why do you get this error?
when you issue command grub-install /dev/sdb, firstly, it looks for /boot/grub/device.map because grub only knows hd0 hd1 etc which is the first and second hard disk found.

when installing grub on sdb by grub-install, it also needs files under /boot, it will look for /etc/mtab to decide where to find /boot. You need to change /boot line device which is /dev/md1 according to point b above. Change /dev/md1 to /dev/sda1 or /dev/sdb1(after hotadd and finished rebuilding)

# grub-install /dev/sdb
or
# grub-install hd1



4. How to monitor failed software raid

 
nohup mdadm --monitor --mail=[user] --delay=[checking_time_in_second] /dev/md[X] &


References:
1. http://radu.rendec.ines.ro/howto/raid1.html
2. http://en.wikipedia.org/wiki/Mdadm
3. http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array