Linux Software RAID Nagios Monitoring & Failure Simulation
It's always nice to monitor to make sure your technology is working properly, it's nicer knowing that when it fails it will alert you to what happened so sometimes you need to test the scenarios. This is just an overview using Debian Lenny of how to use the Linux mdadm tool to create, destroy and rebuild a software RAID 1 (mirror) device. Throughout the process you will see the effects of the Nagios check_linux_raid plugin from the Nagios Plugins project.
First I installed the Nagios plugins and mdadm:
I used fdisk to create two partitions: /dev/sdc1 & /dev/sdc2 (of equal size, in this case 40 MB, I know, I'm generous.)
I then used the following command to create a RAID 1 array with the two disks. I happened to call it md0:
Created an ext3 filesystem on my array:
Mounted the new file system in an arbitrary location:
The array is now active and happy, here's how to verify:
Now I will run the first test of the Nagios plugin, you can give it the device name if you like or it can detect your devices for you. As you can see all OK (exit status 0):
Now I'm going to set one of my block devices as faulty:
To verify the array is degraded we'll look at the details again:
You'll notice /dev/sdc2 is now marked as a faulty spare.
Another check of the Nagios Plugin reveals an exit code of 2 and the service as critical:
I now remove the faulty device from the array completely (note it must be marked faulty first per the last step):
Again test the plugin, still marked as down:
Now I can re-add the device to the array (since there's really nothing wrong with it). Once it rebuilds you can test again and it will show as OK on the Nagios plugin.
You can read more about Linux software RAID here. Or about Nagios here.
First I installed the Nagios plugins and mdadm:
>apt-get install mdadm nagios-plugins
I used fdisk to create two partitions: /dev/sdc1 & /dev/sdc2 (of equal size, in this case 40 MB, I know, I'm generous.)
I then used the following command to create a RAID 1 array with the two disks. I happened to call it md0:
>mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdc2
Created an ext3 filesystem on my array:
>mkfs.ext3 /dev/md0
Mounted the new file system in an arbitrary location:
>mount /dev/md0 /media/raid
The array is now active and happy, here's how to verify:
>mdadm --detail /dev/md0
/dev/md0:
...
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 34 1 active sync /dev/sdc2
Now I will run the first test of the Nagios plugin, you can give it the device name if you like or it can detect your devices for you. As you can see all OK (exit status 0):
>/usr/lib/nagios/plugins/check_linux_raid
OK md0 status=[UU].
Now I'm going to set one of my block devices as faulty:
>mdadm /dev/md0 -f /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md0
To verify the array is degraded we'll look at the details again:
>mdadm --detail /dev/md0
/dev/md0:
...
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : 3103faf6:34bec8ae:2689ae2c:f32f1e02 (local to host buenos-aires)
Events : 0.8
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 0 0 1 removed
2 8 34 - faulty spare /dev/sdc2
You'll notice /dev/sdc2 is now marked as a faulty spare.
Another check of the Nagios Plugin reveals an exit code of 2 and the service as critical:
>/usr/lib/nagios/plugins/check_linux_raid
CRITICAL md0 status=[U_].
I now remove the faulty device from the array completely (note it must be marked faulty first per the last step):
>mdadm /dev/md0 -r /dev/sdc2
mdadm: hot removed /dev/sdc2
Again test the plugin, still marked as down:
>./check_linux_raid
CRITICAL md0 status=[U_].
Now I can re-add the device to the array (since there's really nothing wrong with it). Once it rebuilds you can test again and it will show as OK on the Nagios plugin.
mdadm /dev/md0 -a /dev/sdc2
You can read more about Linux software RAID here. Or about Nagios here.
Comments
Jose
Regards.