Matthew McMillan: Nagios

Showing posts with label Nagios. Show all posts

Tuesday, December 31, 2013

Simple way to integrate Nagios with Slack messaging

At work we recently switched messaging applications from Skype to a new platform called Slack. Slack just launched in August 2013. I have read it is similar to Campfire but I've never used that platform so I can't really comment on that but it is much more useful than a basic chat client like Skype. With Slack you can share files, easily search message history for text or files and integrate with 3rd party applications. Plus it is private for just your team or company. Slack has quite a few preconfigured integrations plus the ability to create your own custom integrations. First we setup the Github integration which allows all of our commit messages to dump into a channel. Next we setup the Trello integration to dump card changes from our main board into another channel. Then I went to setup the Nagios integration and ran into problems. They have a prebuilt integration for Nagios but I could not get it to work. It would post alert messages into the channel but the messages contained no information:

I mucked with their provided perl script quite a bit but I simply could not get it to work. It just kept posting empty messages. Being impatient and a do-it-yourselfer I set about trying to find another way to accomplish this. I looked through the list of integrations and noticed that they had a custom one called Incoming WebHooks which is an easy way to get messages from external sources posted into Slack. The simplest way to utilize Incoming WebHooks is to use curl to post the message to Slack's API. I wrote a little bash script that provides a detailed Nagios alert, a link back to the Nagios web page and conditional emoji's! Each warning level (OK, WARNING, CRITICAL and UNKNOWN) has it's own emoji icon. Here are some example messages in my Slack client:

Here is my bash script that posts to Slack. I placed it in /usr/local/bin

Here are the Nagios config lines that are added to commands.cfg

And finally lines I added to contacts.cfg

I'm not sure why Slack's prebuilt Nagios integration didn't work for me but I really like what I came up with. No Perl modules to install and the only outside dependency is curl. It's also pretty easy to modify the info in the alert message by adding or removing NAGIOS_ env variables in the curl statement.

Monday, October 7, 2013

Nagios monitoring for Amazon SQS queue depth

I have found that a bunch of messages stacking up in my SQS queue's can be the first sign of something breaking. Several things can cause messages to stack up in the queue. I have seen malformed messages, slow servers and dead processes all cause this at different times. So to monitor the queue depth I wrote this Nagios check / plug-in. The check simply queries the SQS api and finds out the count of messages in each queue. Then it compares the count to the warning and critical levels.

This check is written in python and uses the boto library. It includes perfdata output so you can graph the number of messages in the queue. The the AWS API for SQS does wildcard matching of queue names so you can monitor a bunch of queues with one check if they have some sort of common prefix to the name. The way I use this is I have several individual checks using the complete explicit name of the queue and then a catchall using a wildcard set to a higher number that will catch any queues that have been added. Make sure you have a .boto file for the user that will be running this nagios check. It only requires read permissions.

Some queues may be more time sensitive than others. That is the case for my setup. For queues that are time sensitive I set the warning and critical counts to low values. Less time sensitive queues are set to higher count values. This screenshot is an example of that:

Config

Here is the command definition I use for Naigos:

# 'check_sqs_depth' command definition
define command{
command_name check_sqs_depth
command_line /usr/lib/nagios/plugins/check_sqs_depth.py --name '$ARG1$' --region '$ARG2$' --warn '$ARG3$' --crit '$ARG4$'
}

and here is the service definition I'm using

define service{
use generic-service
host_name sqs.us-east-1
service_description example_name SQS Queue
contact_groups admins,admins-page,sqs-alerts
check_command check_sqs_depth!example_name!us-east-1!150!300!
}

Code

The code is available on my github nagios-checks repository here: https://github.com/matt448/nagios-checks and I have posted it as a gist below. My git repository will have the most up-to-date version

Saturday, May 25, 2013

Monitor S3 file ages with Nagios

I have started using Amazon S3 storage for a for a couple different things like static image hosting and storing backups. My backup scripts tar and gzip files and then upload the tarball to S3. Since I don't have a central backup system to alert me of failed backups or to delete old backups I needed to handle those tasks manually. S3 has built in lifecycle settings which I do utilize but as with everything AWS it doesn't always work perfectly. As for alerting on failed backups I decided to handle that by watching the age of the files stored in S3 bucket. I ended up writing a Nagios plugin that can monitor both the minimum and maximum age of files stored in S3. In addition to monitoring the age of backup files I think this could also be useful in monitoring the age of files if you use an S3 bucket as a temporary storage area for batch processing. In this case old files would indicate a missed file or possibly a damaged file that couldn't be processed.

I wrote this my favorite new language Python and used the boto library to access S3. The check looks through every file stored in a bucket and checks the file's last_modified property against the supplied min and/or max. The check can be used for either min age, max age or both. You will need to create a .boto file in the home directory of the user executing the Nagios check with credentials that have at least read access to the S3 bucket.

The check_s3_file_age.py file is available on my github nagios-checks repository here: https://github.com/matt448/nagios-checks.

To use this with NRPE add an entry something like this:

command[check_s3_file_age]=/usr/lib/nagios/plugins/check_s3_file_age.py --bucketname myimportantdata --minfileage 24 --maxfileage 720

Here is output from --help:

./check_s3_file_age.py --help

usage: check_s3_file_age.py [-h] --bucketname BUCKETNAME
                            [--minfileage MINFILEAGE]
                            [--maxfileage MAXFILEAGE] [--listfiles] [--debug]

This script is a Nagios check that monitors the age of files that have been
backed up to an S3 bucket.

optional arguments:
  -h, --help            show this help message and exit
  --bucketname BUCKETNAME
                        Name of S3 bucket
  --minfileage MINFILEAGE
                        Minimum age for files in an S3 bucket in hours.
                        Default is 0 hours (disabled).
  --maxfileage MAXFILEAGE
                        Maximum age for files in an S3 bucket in hours.
                        Default is 0 hours (disabled).
  --listfiles           Enables listing of all files in bucket to stdout. Use
                        with caution!
  --debug               Enables debug output.

I am a better sys admin than I am a programmer so please let me know if you find bugs or see ways to improve the code. The best way to do this is to submit an issue on github.

Here is sample output in Nagios

Monday, March 18, 2013

Template Nagios check for a JSON web service

I wrote two different custom Nagios checks for work last week and realized I could make a useful template out of them. After writing the first check I was able to reuse most of the code for the second check. The only changes I had to make had to do with the data returned. So I decided to make this into a generic template that I can reuse in the future. The check first verifies that the web service is responding correctly and then checks various data returned in JSON format. While writing this template I found this really cool service (www.jsontest.com) that let me code against a service available to anyone who wants to try out this Nagios check before customizing it. This is the first time I have used Python's argparse function and I have to say it is fantastic. It makes adding command line arguments very easy and the result is professional looking.

My github repo can be found here: https://github.com/matt448/nagios-checks

Here is the code in a gist:

Wednesday, March 13, 2013

Nagios file paths on Ubuntu and simple backup script

This is more of a note to myself than anything else but might be helpful to others. Here are the config and data directories for Nagios when installed using packages on Ubuntu 12.04.

Config files
----------------------
/etc/nagios3/
/etc/nagios3/conf.d
/etc/nagios-plugins/config
/etc/nagios

Plugin executables
---------------------
/usr/lib/nagios/plugins

Graphing (pnp4naigos)
----------------------
/usr/share/pnp4nagios/html
/var/lib/pnp4nagios/perfdata

Other
-----------------------
/var/lib/nagios
/var/lib/nagios3

Here is a very simple backup script for Nagios on Ubuntu 12.04

Matthew McMillan

Labels