Linux Software RAID Nagios Monitoring & Failure Simulation

October 12, 2009

It's always nice to monitor to make sure your technology is working properly, it's nicer knowing that when it fails it will alert you to what happened so sometimes you need to test the scenarios. This is just an overview using Debian Lenny of how to use the Linux mdadm tool to create, destroy and rebuild a software RAID 1 (mirror) device. Throughout the process you will see the effects of the Nagios check_linux_raid plugin from the Nagios Plugins project.

First I installed the Nagios plugins and mdadm:

>apt-get install mdadm nagios-plugins

I used fdisk to create two partitions: /dev/sdc1 & /dev/sdc2 (of equal size, in this case 40 MB, I know, I'm generous.)

I then used the following command to create a RAID 1 array with the two disks. I happened to call it md0:

>mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdc2

Created an ext3 filesystem on my array:

>mkfs.ext3 /dev/md0

Mounted the new file system in an arbitrary location:

>mount /dev/md0 /media/raid

The array is now active and happy, here's how to verify:

>mdadm --detail /dev/md0
/dev/md0:
...
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 34 1 active sync /dev/sdc2

Now I will run the first test of the Nagios plugin, you can give it the device name if you like or it can detect your devices for you. As you can see all OK (exit status 0):

>/usr/lib/nagios/plugins/check_linux_raid
OK md0 status=[UU].

Now I'm going to set one of my block devices as faulty:

>mdadm /dev/md0 -f /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md0

To verify the array is degraded we'll look at the details again:

>mdadm --detail /dev/md0
/dev/md0:
...
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

UUID : 3103faf6:34bec8ae:2689ae2c:f32f1e02 (local to host buenos-aires)
Events : 0.8

Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 0 0 1 removed

2 8 34 - faulty spare /dev/sdc2

You'll notice /dev/sdc2 is now marked as a faulty spare.

Another check of the Nagios Plugin reveals an exit code of 2 and the service as critical:

>/usr/lib/nagios/plugins/check_linux_raid
CRITICAL md0 status=[U_].

I now remove the faulty device from the array completely (note it must be marked faulty first per the last step):

>mdadm /dev/md0 -r /dev/sdc2
mdadm: hot removed /dev/sdc2

Again test the plugin, still marked as down:

>./check_linux_raid
CRITICAL md0 status=[U_].

Now I can re-add the device to the array (since there's really nothing wrong with it). Once it rebuilds you can test again and it will show as OK on the Nagios plugin.