Readable Nagios Log Timestamps

If you’re like me and most humans, the Nagios logfile timestamp (a unix timestamp) isn’t terribly useful when trying to grep through the logs and correlate events:

# head -2 nagios.log
[1350360000] LOG ROTATION: DAILY
[1350360000] LOG VERSION: 2.0

Here’s a nifty Perl one-liner that you can pipe your logs through:
perl -pe ‘s/(\d+)/localtime($1)/e’
to get nicer output like:

# head -2 nagios.log
[Tue Oct 16 00:00:00 2012] LOG ROTATION: DAILY
[Tue Oct 16 00:00:00 2012] LOG VERSION: 2.0

Nagios Check Plugin for Rsnapshot Backups

In a previous post, I described how I do Secure rsnapshot backups over the WAN via SSH. While my layout of rsnapshot configuration files, data, and log files is a bit esoteric, I monitor all this with a Nagios check plugin that runs on my backup host. It Assumes that the output of rsnapshot is written to a text log file, one file per host, at a path that matches /path_to_log_directory/log_HOSTNAME_YYYYMMDD-HHMMSS.log where HOSTNAME is the name of the host, and YYYYMMDD-HHMMSS is a datestamp (actually, the script just finds the newest file matching log_HOSTNAME_*.log in that directory). In order to obtain correct timing of the runs, which rsnapshot doesn’t offer, it assumes that you trigger rsnapshot through a wrapper script, which runs it once per host (inside a loop?) with per-host log files and some logging information added, like:

for h in <LIST OF HOSTNAMES>
do
    LOGFILE="/mnt/backup/rsnapshot/logs/log_${h}_`date +%Y%m%d-%H%M%S`.txt"
    echo "# Starting backup at `date` (`date +%s`)" >> "$LOGFILE"
    /usr/bin/rsnapshot -c /etc/rsnapshot-$h.conf daily &>> "$LOGFILE"
    echo "# Finished backup at `date` (`date +%s`)" >> "$LOGFILE"
done

The check_rsnapshot.pl plugin uses utils.pm from Nagios, as well as Getopt::Long, File::stat, File::Basename, File::Spec and Number::Bytes::Human. This was one of my first Perl plugins, but seems to be rather acceptable. It makes the following checks based on the rsnapshot log:

  1. Backup run in the last X seconds (warning and crit thresholds)
  2. Maximum time from start to finish (warning and crit thresholds)
  3. Minimum size of backup (warning and crit thresholds)
  4. Minimum number of files in backup (warning and crit thresholds)

In addition to check_file_age checks on a number of files that are included in backups and I know are modified before each backup run, this seems to handle monitoring quite well for me. I certainly preferred running Bacula and using my MySQL-based check_bacula_job.php, but as I’m now backing up 4 machines to my desktop, I no longer have a need for Bacula (or tapes).

The script itself can be found at github.

Script to Chart Intervals Between Problem and Recovery from Nagios/Icinga Log Files

At work, we use Icinga (a fork of Nagios) for monitoring. We have a few services which are restarted or otherwise poked by event handlers, but the recovery takes a while – so we often get paged for problems which recover in a few minutes. I wrote a small perl script that greps through the archived log files for a given regex (service and/or host name) and then calculates the time from problem to recovery and graphs those times.

The script is called nagios_log_problem_interval.pl and can be downloaded from my github. Below is some sample output, the number of minutes from problem to recovery are along the Y axis and the count is along the X axis:


> nagios_log_problem_interval.pl --archivedir=/var/icinga/archive --match=myhost --backtrack=10
myhost;HTTP
Count
1:########(8)
2:##(2)
3:#(1)
4:##(2)
5:#######(7)
6:(0)
7:(0)
8:#(1)
9:(0)
10:(0)
11:#(1)
12:(0)
13:#(1)
14:(0)
15:(0)
16-29:(0)
30-59:(0)
60+:(0)

World of Warcraft Realm Status Check Plugin for Nagios

My wife Jackie (Syrilia) is an avid World of Warcraft player (it’s a MMORPG with over 10 million players). They have weekly server maintenance/update windows every Tuesday morning – total downtime. The length is never really fixed, so I looked around to see if there was a logical way to notify when the servers came back up.

I managed to find a World of Warcraft Realm status check plugin on Nagios Exchange, but it was written to a now-discontinued API. It was also last modified in 2008, and I can’t seem to get in contact with the author, Scott A’Hearn (webmaster@scottahearn.com) – that email returns undeliverable, there’s no email link on the site that his domain now redirects to, and the domain scottahearn.com is a (eek) private registration in WHOIS, so I don’t really have any way of finding contact information. Regardless, I’ve modified the script to use the new Blizzard REST API and it’s now working. Of course, this is pulling from Blizzard’s data feed, not doing any actual monitoring itself, and be warned that they impose query limits (at the moment, their docs say 3,000 requests per day for anonymous access; to be nice to them, I only check on Tuesdays from 3am-4pm, when I’m most concerned about it). The updated source code is shown below, but the most up-to-date version will always live at
https://github.com/jantman/nagios-scripts/blob/master/check_wow.pl. If you want, you can also see a diff of my changes to Scott’s original version on github.

#!/usr/bin/perl -w
#
# World of Warcraft Realm detector plugin for Nagios
#
# Written by Scott A'Hearn (webmaster@scottahearn.com), version 1.2, Last Modified: 07-21-2008
#
# Modified by Jason Antman <jason@jasonantman.com> 02-22-2012, to cope with the change from
# the deprecated worldofwarcraft.com XML feed to the BattleNet JSON API.
#
# Usage: ./check_wow -r <realm_name>
#
# Description:
#
# This plugin will check the status of a World of Warcraft realm, based 
# on input from the battle.net JSON realm status API.
#
# Output:
#
# If the realm is up, the plugin will
# return an OK state with a message containing the status of the realm as well 
# as some extended information such as type (PvP, PvE, etc) and population.  
# If the realm is down, the plugin will return a CRITICAL state with a message
# containing the status of the realm as well as any available extended 
# information such as type (PvP, PvE, etc) and population. If the realm is
# shown as currently having a queue, a WARNING state will be returned.
#
#
# If the requested realm is not found, the plugin will
# return an UNKNOWN state with an appropriate warning message.
#
# If there is an invalid [or no] response from the battle.net server,
# the plugin will return a CRITICAL state.
#
# $HeadURL: http://svn.jasonantman.com/public-nagios/check_wow.pl $
# $LastChangedRevision: 13 $
#
# Changelog:
# 2012-02-22 Jason Antman <jason@jasonantman.com> (version 1.3):
#     * modified for new BattleNet JSON API
#     * added WARNING output if realm has queue
#
# 2008-07-21 Scott A'Hearn <webmaster@scottahearn.com> (version 1.2):
#     * version on Nagios Exchange
#
 
# use modules
use strict;				# good coding practices
use Getopt::Long;			# command-line option parsing
use LWP;				# external content retrieval
use JSON;                               # JSON for API reply
use lib  "/usr/lib/nagios/plugins";	# nagios plugins
use utils qw(%ERRORS &print_revision &support &usage );	# nagios error and message libraries
use Data::Dumper;                       # debugging
 
# init global vars
use vars qw($PROGNAME);	$PROGNAME="check_wow";
my ($ver_string, $browser, $jsonurl, $raw_json, $opt_V, $opt_h, $opt_r, $decoded) = (undef, undef, undef, undef, undef, undef, undef, undef);
$jsonurl = "http://us.battle.net/api/wow/realm/status?realm=";
$ver_string = "1.3";
 
# init subs
sub print_help ($$);
sub print_usage ($);
 
# define command-line option handling
Getopt::Long::Configure('bundling');
GetOptions(
	"V"   => \$opt_V, "version"	=> \$opt_V,
	"h"   => \$opt_h, "help"	=> \$opt_h,
	"r=s" => \$opt_r, "realm=s"	=> \$opt_r);
 
# show version info, exit
if ($opt_V) {
	print_revision($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# show help, exit
if ($opt_h) {
	print_help($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# get first command-line param
$opt_r = shift unless ($opt_r);
 
# if no command-line param passed, show usage/help, exit
if (! $opt_r) {
	print_usage($PROGNAME);
	exit $ERRORS{'UNKNOWN'};
}
 
# new browser object, with agent
$browser = LWP::UserAgent->new();
$browser->agent("check_wow/$ver_string");
 
# retrieve JSON from WoW site
$jsonurl .= $opt_r;
$raw_json = $browser->request(HTTP::Request->new(GET => $jsonurl));
 
if ($raw_json->is_success) {
	# if success, process
	$raw_json = $raw_json->content;
} else {
	# otherwise, fail UNKNOWN
	print "UNKNOWN - Realm '$opt_r' status not received.";
	exit $ERRORS{'UNKNOWN'};
}
 
$decoded = decode_json $raw_json;
 
if($decoded->{realms}[0]->{status} != 1) {
    print "CRITICAL - Realm ".$decoded->{realms}[0]->{name}." Down (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'CRITICAL'};
} elsif($decoded->{realms}[0]->{queue} != 0) {
    print "WARNING - Realm ".$decoded->{realms}[0]->{name}." Has Queue (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'WARNING'};
} else {
    print "OK - Realm ".$decoded->{realms}[0]->{name}." Up (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'OK'};
}
 
# usage function
sub print_usage ($) {
        my ($PROGNAME) = @_;
	print "Usage:\n";
	print "  $PROGNAME [-r | --realm <realm>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
# help function
sub print_help ($$) {
        my ($PROGNAME, $ver_string) = @_;
	print_revision($PROGNAME, $ver_string);
	print "Copyright (c) 2008 Scott A'Hearn, 2012 Jason Antman\n\n";
	print_usage($PROGNAME);
	print "\n";
	print "  <realm> Standard World of Warcraft realm name, case sensitive.\n";
	print "\n";
	# support();
}
 
# end

Nagios Check Plugin for Linode Monthly Bandwidth Usage

Since I have most of my public-facing stuff hosted with Linode, and I have a monthly bandwidth cap (albeit one that I’ll probably never come close to), I decided that it would be a good idea to add my monthly bandwidth usage to my monitoring system. Luckily, Linode offers this (their billing view of it – which is, of course, what I’m concerned about) via their API, and it’s very nicely implemented in Michael Greb’s WebService::Linode Perl (CPAN) module.

Using Michael’s Perl module, I wrote check_linode_transfer.pl (github link) as a Nagios check plugin. It seems to be working fine for me, and runs with the embedded perl interpreter, though it may not be 100% up to par with the Nagios plugin spec (for one, I used utils.pm instead of Nagios::Plugin). About the only thing unusual is that I store my API keys in a perl module, so you’ll need to create something like this in your plugin directory (usually /usr/lib/nagios/plugins:

package api_keys;
 
require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw($API_KEY_LINODE);
 
$API_KEY_LINODE = "yourApiKeyGoesHere";
 
1;

The latest version of the plugin will always be available at https://github.com/jantman/nagios-scripts/blob/master/check_linode_transfer.pl. The current version is also below. It’s free for anyone to use under the terms of GNU GPLv3, though I would really like it if any changes/patches/updates are sent back to me for inclusion in the latest version.

#! /usr/bin/perl -w
 
# check_linode_transfer.pl Copyright (C) 2012 Jason Antman <jason@jasonantman.com>
#
# Define your Linode API key as $API_KEY_LINODE in api_keys.pm in the plugin library directory
#  a sample should be included in this distribution.
#
# This plugin requires WebService::Linode from CPAN, with a patch - add the following to the end of sub _error{} in Linode/Base.pm:
#  $self->{err} = $err; $self->{errstr} = $errstr;
# Also - bug in WebService::Linode::Base docs, example, line 3 should be:
#  my $data = $api->do_request( api_action => 'domains.list' );
# not:
#  my $data = $api->do_request( action => 'domains.list' );
#
##################################################################################
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty
# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# you should have received a copy of the GNU General Public License
# along with this program (or with Nagios);  if not, write to the
# Free Software Foundation, Inc., 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA
#
##################################################################################
#
# The latest version of this plugin can always be obtained from:
#  $HeadURL$
#  $LastChangedRevision$
#
 
use strict;
use English;
use Getopt::Long;
use vars qw($PROGNAME $REVISION);
use lib "/usr/lib/nagios/plugins";
use utils qw (%ERRORS &print_revision &support);
use api_keys qw($API_KEY_LINODE);
use WebService::Linode;
use Data::Dumper;
 
sub print_help ();
sub print_usage ();
 
my ($opt_c, $opt_w, $opt_h, $opt_V, $opt_s, $opt_S, $opt_l, $opt_H);
my ($result, $message);
 
$PROGNAME="check_linode_transfer.pl";
$REVISION='1.0';
 
$opt_w = 60;
$opt_c = 80;
 
Getopt::Long::Configure('bundling');
GetOptions(
    "V"   => \$opt_V, "version"	=> \$opt_V,
    "h"   => \$opt_h, "help"	=> \$opt_h,
    "w=f" => \$opt_w, "warning=f" => \$opt_w,
    "c=f" => \$opt_c, "critical=f" => \$opt_c
);
 
if ($opt_V) {
	print_revision($PROGNAME, $REVISION);
	exit $ERRORS{'OK'};
}
 
if ($opt_h) {
	print_help();
	exit $ERRORS{'OK'};
}
 
$result = 'OK';
 
my $api = new WebService::Linode(apikey => $API_KEY_LINODE, nowarn => 1);
my $data = $api->do_request( api_action => 'account.info' );
if(! $data) {
    $result = "UNKNOWN";
    print "LINODE TRANSFER $result: ".$api->{errstr}."\n";
    exit $ERRORS{$result};
}
 
my ($used, $pool, $pct) = ($data->{TRANSFER_USED}, $data->{TRANSFER_POOL}, 0);
 
$pct = ($used / $pool) * 100;
 
if($pct >= $opt_c){
    $result = "CRITICAL";
}
elsif($pct >= $opt_w){
    $result = "WARNING";
}
 
print "LINODE TRANSFER $result: $pct"."%"." of monthly bandwidth used ($used / $pool GB)|usedBW=$used; totalBW=$pool\n";
exit $ERRORS{$result};
 
sub print_usage () {
	print "Usage:\n";
	print "  $PROGNAME [-w <percent>] [-c <percent>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
sub print_help () {
	print_revision($PROGNAME, $REVISION);
	print "Copyright (c) 2012 Jason Antman\n\n";
	print_usage();
	print "\n";
	print "  <percent>  Percent of network transfer used\n";
	print "\n";
	support();
}

Nagios / Icinga Configuration Highlighting with GeSHi

As you may know from former posts, this blog (WordPress-powered) and a few MediaWiki sites that I have use the excellent PHP-based GeSHi syntax highlighter. Today I was writing a post that includes some Icinga (Nagios) configuration snippets. After a quick search, I found a Nagios language file for GeSHi on GitHub. Thanks very much to Albéric de Pertat (adepertat) for writing this and providing it to the public.

Sending AOL Instant Messenger (AIM) Messages from a Perl Script

I’ve been doing some work on icinga (a Nagios fork) and wanted to implement notification via AOL Instant Messenger (AIM), since I’m almost always signed on when I’m at a computer. Unfortunately, most of the scripts that I could find use Net::AIM::TOC which implements a now-defunct protocol. So, I found Perl’s Net::OSCAR and James Nonnemaker’s script, and decided to rework them into something a bit more full-featured.

The below script sends a single IM to a single contact via the command line (using a specified AIM username and password). It’s intended to be a Nagios notification script (using the configurations shown below), but could be used for any purpose. The most up-to-date version of the script will be available at: github.com/jantman/public-nagios/master/send_aim.pl

#!/usr/bin/perl
 
#
# Script to send AIM messages from the command line
#
# Copyright 2012 Jason Antman <http://blog.jasonantman.com> <jason@jasonantman.com>
# based on the simple version (C) 2008 James Nonnemaker / james[at]ustelcom[dot]net 
#    found at: <http://moo.net/code/aim.html>
#
# The canonical, up-to-date version of this script can be found at:
#  <http://svn.jasonantman.com/public-nagios/send_aim.pl>
#
# For updates, news, etc., see:
#  <http://blog.jasonantman.com/2012/02/sending-aim-messages-from-a-perl-script/>
#
# $HeadURL$
# $LastChangedRevision$
#
 
use strict;
use warnings;
use Net::OSCAR qw(:standard);
use Getopt::Long;
 
my ($screenname, $passwd, $ToSn, $Msg);
my $VERSION = "r17";
 
my $result = GetOptions ("screenname=s" => \$screenname,
		      "password=s"   => \$passwd,
		      "to=s"         => \$ToSn);
 
if(! $screenname || ! $passwd || ! $ToSn) {
    print "send_aim.pl $VERSION by Jason Antman <jason\@jasonantman.com>\n\n";
    print "USAGE: send_aim.pl --screenname=<sn> --password=<pass> --to=<to_screenname>\n\n";
}
 
# slurp message from STDIN
my $holdTerminator = $/;
undef $/;
$Msg = <STDIN>;
$/ = $holdTerminator;
my @lines = split /$holdTerminator/, $Msg;
$Msg = "init";
$Msg = join $holdTerminator, @lines;
 
my $oscar = Net::OSCAR->new();
$oscar->loglevel(0);
$oscar->signon($screenname, $passwd);
 
$oscar->set_callback_snac_unknown(\&snac_unknown);
$oscar->set_callback_im_ok (\&log_out);
$oscar->set_callback_signon_done (\&do_it);
 
while (1) {
    $oscar->do_one_loop();
}
 
sub do_it {
    $oscar->send_im($ToSn, $Msg);
}
 
sub log_out {
    $oscar->signoff;
    exit;
}
 
sub snac_unknown {
    my($oscar, $connection, $snac, $data) = @_;
    # just use this to override the default snac_unknown handler, which prints a data dump of the packet
}

The command line usage is pretty simple – it takes the message to send on stdin and parameters for the sender’s screen name and password, and the recipient’s screen name, like:

echo -e "Hello\nworld\n" | send_aim.pl --screenname=mySN --password=myPass --to=recipientSN

The Icinga configs that I used for this are as follows. I just used the default Icinga 1.6 notify by email commands, since AIM should handle the full length fine.

# host notification command
define command{
        command_name    notify-host-by-aim
        command_line    /usr/bin/printf "%b" "***** Icinga *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/lib/nagios/plugins/notification/send_aim.pl --screenname=mySN --password=myPass --to=$CONTACTADDRESS1$
}
 
# service notification command
define command{
        command_name    notify-service-by-aim
        command_line    /usr/bin/printf "%b" "***** Icinga *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/lib/nagios/plugins/notification/send_aim.pl --screenname=mySN --password=myPass --to=$CONTACTADDRESS1$
}
 
# example contact
define contact{
        contact_name                    joeadmin
        alias                           Joe Admin
        use                             generic-with-AIM-contact
        email                           joeadmin@example.com
        pager                           5555555555@vtext.com
        address1                        joeAdminSN ; AIM screen name
}

Nagios check scripts

Last week I added some of my Nagios check scripts to my nagios-scripts GitHub repository. Perhaps they’ll be of some use to some other people…

  • check_1wire_temps.php – quick and dirty, built for one specific application, but a good starting place for checking Dallas 1-wire temperatures via OWFS.
  • check_802dot11.php – A script to check various things in the IEEE-802DOT11 MIB, written for Ubiquiti APs (SNMP).
  • check_frogfoot.php – A script to check some stuff from FROGFOOT-MIB, also written for Ubiquiti APs (SNMP).
  • check_asterisk_iaxpeers – a Python check script to parse the output of rasterisk for IAX peer status and latency (includes perf data output).
  • check_bacula_job.php – A script to connect to the Bacula database and make sure a specified job terminated OK and was run on schedule.
  • check_docsis – A script to check status and various metrics for cable modems implementing the DOCSIS MIB (SNMP). Works with (at least) the Motorola SurfBoard modems used by Cablevision (which use 192.168.100.1 on the LAN side).
  • check_syslog_age.php – A PHP script which checks (recursively) that the newest file under a directory is no more than X seconds old. I use this for checking my centralized syslog server, which has logs separated out in /var/log/HOSTS/hostname.

Update 2011-01-31 – the check_syslog_age.php script was updated today to handle an error condition where stat() calls in PHP fail on files larger than 2GB on 32-bit systems.

Parsing Nagios status.dat in PHP

If you’re just looking for the script or PHP module, you can get them at: http://github.com/jantman/php-nagios-xml.

A while ago (back in late 2008), I wrote a PHP script that parses the Nagios status.dat file into an associative array. My original use was to output XML which was then read by another script on another server and used for a small custom GUI. It’s a very simple PHP script that just takes the path of the status.dat file (which, obviously, must be readable by the user running the script).

At that time, I was using Nagios v2. Since then, I’ve moved to Nagios v3, and have updated the script to include the ability to parse v3 status.dat files, as well as a function to detect the version of a status file. I also refactored the code so that the parsing functions are all contained in a single file (statusXML.php.inc) which is safe to include in other scripts. The actual statusXML.php file now just includes examples of how to call all of the functions and output XML (though it is equally useful to output the serialized array, or use it directly).

Since I posted my script online, two people have been kind enough to send back their modifications:

Both of these generous contributions have been included in my Github repository as of the current commit. Unfortunately, due to my delay in putting my Nagios3 code into svn, both of these contributions are Nagios v2 only.

As time permits, I plan on merging Artur’s changes into the current version of statusXML.php.inc. Unfortunately, C isn’t one of my strong points, but I plan on also updating Whitham’s PHP module code to work with Nagios3 as soon as possible.

Stay tuned for updates, and thanks to both gentlemen for contributing their work. I’m always interested in hearing how people are using my code, and how they are making it better.

Also: While I added this project to Nagios Exchange, and plan on adding it to Monitoring Exchange, I don’t always keep those sites up to date (I can’t access Nagios Exchange right now, and who knows if I’ll have time to update it tomorrow). I strongly recommend directly checking out from Git at https://github.com/jantman/php-nagios-xml.

pnp4nagios, CentOS 5.3 and pcre

I started testing out the pnp4nagios tool to incorporate graphs of performance data into Nagios. Despite what Klein and Sellens suggest (p. 57), I really don’t want separate tools for monitoring and trending. Cactialready handles UPS metrics, switch ports, router traffic, etc. For everything else – system load, etc. – I see no reason to have two checks run rather than just one (Nagios).

There was a CentOS package for the older pnp4nagios 0.4.x, but I opted to build and install the new 0.6.x from source. Unfortunately, I hit one snag – it requires PCRE compiled with support for Unicode properties, and I couldn’t find any package for CentOS compiled with that option. So, with a simple edit of the %configure macro in the SPEC file, I built one. Unfortunately, I wasn’t working in a real build environment – just on one of my web servers – so I only built the .i386 version, but you can feel free to build from the source rpm.