10.12.2009

Linux Software RAID Nagios Monitoring & Failure Simulation

It's always nice to monitor to make sure your technology is working properly, it's nicer knowing that when it fails it will alert you to what happened so sometimes you need to test the scenarios. This is just an overview using Debian Lenny of how to use the Linux mdadm tool to create, destroy and rebuild a software RAID 1 (mirror) device. Throughout the process you will see the effects of the Nagios check_linux_raid plugin from the Nagios Plugins project.

First I installed the Nagios plugins and mdadm:
>apt-get install mdadm nagios-plugins


I used fdisk to create two partitions: /dev/sdc1 & /dev/sdc2 (of equal size, in this case 40 MB, I know, I'm generous.)

I then used the following command to create a RAID 1 array with the two disks. I happened to call it md0:
>mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdc2


Created an ext3 filesystem on my array:
>mkfs.ext3 /dev/md0


Mounted the new file system in an arbitrary location:
>mount /dev/md0 /media/raid


The array is now active and happy, here's how to verify:
>mdadm --detail /dev/md0
/dev/md0:
...
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 34 1 active sync /dev/sdc2


Now I will run the first test of the Nagios plugin, you can give it the device name if you like or it can detect your devices for you. As you can see all OK (exit status 0):
>/usr/lib/nagios/plugins/check_linux_raid
OK md0 status=[UU].


Now I'm going to set one of my block devices as faulty:
>mdadm /dev/md0 -f /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md0


To verify the array is degraded we'll look at the details again:
>mdadm --detail /dev/md0
/dev/md0:
...
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

UUID : 3103faf6:34bec8ae:2689ae2c:f32f1e02 (local to host buenos-aires)
Events : 0.8

Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 0 0 1 removed

2 8 34 - faulty spare /dev/sdc2

You'll notice /dev/sdc2 is now marked as a faulty spare.


Another check of the Nagios Plugin reveals an exit code of 2 and the service as critical:
>/usr/lib/nagios/plugins/check_linux_raid
CRITICAL md0 status=[U_].


I now remove the faulty device from the array completely (note it must be marked faulty first per the last step):
>mdadm /dev/md0 -r /dev/sdc2
mdadm: hot removed /dev/sdc2


Again test the plugin, still marked as down:
>./check_linux_raid
CRITICAL md0 status=[U_].


Now I can re-add the device to the array (since there's really nothing wrong with it). Once it rebuilds you can test again and it will show as OK on the Nagios plugin.
mdadm /dev/md0 -a /dev/sdc2


You can read more about Linux software RAID here. Or about Nagios here.

4 comments:

Anonymous said...
This comment has been removed by a blog administrator.
Retrovit ID said...

Thanks for review, it was excellent and very informative.
thank you :)

Anonymous said...

Thanks, I found what I´m looking for... Excellent!!
Jose

Daniel Benicio Alves said...

Thanks! This solved my problem. Great explanation!
Regards.