Matthew McMillan: AWS

Monday, October 7, 2013

Nagios monitoring for Amazon SQS queue depth

I have found that a bunch of messages stacking up in my SQS queue's can be the first sign of something breaking. Several things can cause messages to stack up in the queue. I have seen malformed messages, slow servers and dead processes all cause this at different times. So to monitor the queue depth I wrote this Nagios check / plug-in. The check simply queries the SQS api and finds out the count of messages in each queue. Then it compares the count to the warning and critical levels.

This check is written in python and uses the boto library. It includes perfdata output so you can graph the number of messages in the queue. The the AWS API for SQS does wildcard matching of queue names so you can monitor a bunch of queues with one check if they have some sort of common prefix to the name. The way I use this is I have several individual checks using the complete explicit name of the queue and then a catchall using a wildcard set to a higher number that will catch any queues that have been added. Make sure you have a .boto file for the user that will be running this nagios check. It only requires read permissions.

Some queues may be more time sensitive than others. That is the case for my setup. For queues that are time sensitive I set the warning and critical counts to low values. Less time sensitive queues are set to higher count values. This screenshot is an example of that:

Config

Here is the command definition I use for Naigos:

# 'check_sqs_depth' command definition
define command{
command_name check_sqs_depth
command_line /usr/lib/nagios/plugins/check_sqs_depth.py --name '$ARG1$' --region '$ARG2$' --warn '$ARG3$' --crit '$ARG4$'
}

and here is the service definition I'm using

define service{
use generic-service
host_name sqs.us-east-1
service_description example_name SQS Queue
contact_groups admins,admins-page,sqs-alerts
check_command check_sqs_depth!example_name!us-east-1!150!300!
}

Code

The code is available on my github nagios-checks repository here: https://github.com/matt448/nagios-checks and I have posted it as a gist below. My git repository will have the most up-to-date version

Saturday, May 25, 2013

Monitor S3 file ages with Nagios

I have started using Amazon S3 storage for a for a couple different things like static image hosting and storing backups. My backup scripts tar and gzip files and then upload the tarball to S3. Since I don't have a central backup system to alert me of failed backups or to delete old backups I needed to handle those tasks manually. S3 has built in lifecycle settings which I do utilize but as with everything AWS it doesn't always work perfectly. As for alerting on failed backups I decided to handle that by watching the age of the files stored in S3 bucket. I ended up writing a Nagios plugin that can monitor both the minimum and maximum age of files stored in S3. In addition to monitoring the age of backup files I think this could also be useful in monitoring the age of files if you use an S3 bucket as a temporary storage area for batch processing. In this case old files would indicate a missed file or possibly a damaged file that couldn't be processed.

I wrote this my favorite new language Python and used the boto library to access S3. The check looks through every file stored in a bucket and checks the file's last_modified property against the supplied min and/or max. The check can be used for either min age, max age or both. You will need to create a .boto file in the home directory of the user executing the Nagios check with credentials that have at least read access to the S3 bucket.

The check_s3_file_age.py file is available on my github nagios-checks repository here: https://github.com/matt448/nagios-checks.

To use this with NRPE add an entry something like this:

command[check_s3_file_age]=/usr/lib/nagios/plugins/check_s3_file_age.py --bucketname myimportantdata --minfileage 24 --maxfileage 720

Here is output from --help:

./check_s3_file_age.py --help

usage: check_s3_file_age.py [-h] --bucketname BUCKETNAME
                            [--minfileage MINFILEAGE]
                            [--maxfileage MAXFILEAGE] [--listfiles] [--debug]

This script is a Nagios check that monitors the age of files that have been
backed up to an S3 bucket.

optional arguments:
  -h, --help            show this help message and exit
  --bucketname BUCKETNAME
                        Name of S3 bucket
  --minfileage MINFILEAGE
                        Minimum age for files in an S3 bucket in hours.
                        Default is 0 hours (disabled).
  --maxfileage MAXFILEAGE
                        Maximum age for files in an S3 bucket in hours.
                        Default is 0 hours (disabled).
  --listfiles           Enables listing of all files in bucket to stdout. Use
                        with caution!
  --debug               Enables debug output.

I am a better sys admin than I am a programmer so please let me know if you find bugs or see ways to improve the code. The best way to do this is to submit an issue on github.

Here is sample output in Nagios

Matthew McMillan

Labels