Monday, October 7, 2013

Nagios monitoring for Amazon SQS queue depth

I have found that a bunch of messages stacking up in my SQS queue's can be the first sign of something breaking. Several things can cause messages to stack up in the queue. I have seen malformed messages, slow servers and dead processes all cause this at different times. So to monitor the queue depth I wrote this Nagios check / plug-in. The check simply queries the SQS api and finds out the count of messages in each queue. Then it compares the count to the warning and critical levels.

This check is written in python and uses the boto library. It includes perfdata output so you can graph the number of messages in the queue. The the AWS API for SQS does wildcard matching of queue names so you can monitor a bunch of queues with one check if they have some sort of common prefix to the name. The way I use this is I have several individual checks using the complete explicit name of the queue and then a catchall using a wildcard set to a higher number that will catch any queues that have been added. Make sure you have a .boto file for the user that will be running this nagios check. It only requires read permissions.

Some queues may be more time sensitive than others. That is the case for my setup. For queues that are time sensitive I set the warning and critical counts to low values. Less time sensitive queues are set to higher count values. This screenshot is an example of that:


Config


Here is the command definition I use for Naigos:

# 'check_sqs_depth' command definition
define command{
        command_name    check_sqs_depth
        command_line    /usr/lib/nagios/plugins/check_sqs_depth.py --name '$ARG1$' --region '$ARG2$' --warn '$ARG3$' --crit '$ARG4$'
        }

and here is the service definition I'm using

define service{
        use                                 generic-service   
        host_name                      sqs.us-east-1
        service_description          example_name SQS Queue
        contact_groups                admins,admins-page,sqs-alerts
        check_command             check_sqs_depth!example_name!us-east-1!150!300!
        }


Code


The code is available on my github nagios-checks repository here: https://github.com/matt448/nagios-checks and I have posted it as a gist below. My git repository will have the most up-to-date version


#!/usr/bin/python
##########################################################
#
# Written by Matthew McMillan, matthew.mcmillan@gmail.com
#
# Requires the boto library and a .boto file with read
# permissions to the queues.
#
import sys
import argparse
import boto
import boto.sqs
def printUsage():
print
print "Example: ", sys.argv[0], "--name myqueue --region us-east-1 --warn 10 --crit 20"
print
#Parse command line arguments
parser = argparse.ArgumentParser(description='This script is a Nagios check that \
monitors the number of messages in Amazon SQS \
queues. It requires a .boto file in the user\'s \
home directroy and AWS credentials that allow \
read access to the queues that are to be monitored.')
parser.add_argument('--name', dest='name', type=str, required=True,
help='Name of SQS queue. This can be a wildcard match. \
For example a name of blah_ would match blah_1, \
blah_2, blah_foobar. To monitor a single queue, enter \
the exact name of the queue.')
parser.add_argument('--region', dest='region', type=str, default='us-east-1',
help='AWS Region hosting the SQS queue. \
Default is us-east-1.')
parser.add_argument('--warn', dest='warn', type=int, required=True,
help='Warning level for queue depth.')
parser.add_argument('--crit', dest='crit', type=int, required=True,
help='Critical level for queue depth.')
parser.add_argument('--debug', action='store_true', help='Enable debug output.')
args = parser.parse_args()
# Assign command line args to variable names
queueName = args.name
sqsRegion = args.region
warnDepth = args.warn
critDepth = args.crit
if critDepth <= warnDepth:
print
print "ERROR: Critical value must be larger than warning value."
printUsage()
exit(2)
qList = []
depthList = []
statusMsgList = []
statusMsg = ""
msgLine = ""
perfdataMsg = ""
warnCount = 0
critCount = 0
exitCode = 3
# Make SQS connection
conn = boto.sqs.connect_to_region(sqsRegion)
rs = conn.get_all_queues(prefix=queueName)
# Loop through each queue and get message count
# Push the queue name and depth to lists
for qname in rs:
namelist = str(qname.id).split("/") # Split out queue name
qList.append(namelist[2])
depthList.append(int(qname.count()))
if args.debug:
print
print '========== Queue List ============='
print qList
print '=================================='
print
# Build status message and check warn/crit values
for index in range(len(qList)):
if depthList[index] >= warnDepth and depthList[index] < critDepth:
warnCount += 1
if depthList[index] >= critDepth:
critCount += 1
#print index, ": ", qList[index], depthList[index]
msgLine = qList[index] + ":" + str(depthList[index])
statusMsgList.append(msgLine)
# Set exit code based on number of warnings and criticals
if warnCount == 0 and critCount == 0:
statusMsgList.insert(0, "OK - Queue depth (")
exitCode = 0
elif warnCount > 0 and critCount == 0:
statusMsgList.insert(0, "WARNING - Queue depth (")
exitCode = 1
elif critCount > 0:
statusMsgList.insert(0, "CRITICAL - Queue depth (")
exitCode = 2
else:
statusMsgList.insert(0, "UNKNOWN - Queue depth (")
exitCode = 3
# Build status message output
for msg in statusMsgList:
statusMsg += msg + " "
# Build perfdata output
for index in range(len(qList)):
perfdataMsg += qList[index] + "=" + str(depthList[index]) + ";" + str(warnDepth) + ";" + str(critDepth) + "; "
# Finalize status message
statusMsg += ") [W:" + str(warnDepth) + " C:" + str(critDepth) + "]"
# Print final output for Nagios
print statusMsg + "|" + perfdataMsg
# Exit with appropriate code
exit(exitCode)

No comments:

Post a Comment

Please note all comments are moderated by me before they appear on the site. It may take a day or so for me to get to them. Thanks for your feedback.