Nagios Check Plugin for Rsnapshot Backups

In a previous post, I described how I do Secure rsnapshot backups over the WAN via SSH. While my layout of rsnapshot configuration files, data, and log files is a bit esoteric, I monitor all this with a Nagios check plugin that runs on my backup host. It Assumes that the output of rsnapshot is written to a text log file, one file per host, at a path that matches /path_to_log_directory/log_HOSTNAME_YYYYMMDD-HHMMSS.log where HOSTNAME is the name of the host, and YYYYMMDD-HHMMSS is a datestamp (actually, the script just finds the newest file matching log_HOSTNAME_*.log in that directory). In order to obtain correct timing of the runs, which rsnapshot doesn’t offer, it assumes that you trigger rsnapshot through a wrapper script, which runs it once per host (inside a loop?) with per-host log files and some logging information added, like:

for h in <LIST OF HOSTNAMES>
do
    LOGFILE="/mnt/backup/rsnapshot/logs/log_${h}_`date +%Y%m%d-%H%M%S`.txt"
    echo "# Starting backup at `date` (`date +%s`)" >> "$LOGFILE"
    /usr/bin/rsnapshot -c /etc/rsnapshot-$h.conf daily &>> "$LOGFILE"
    echo "# Finished backup at `date` (`date +%s`)" >> "$LOGFILE"
done

The check_rsnapshot.pl plugin uses utils.pm from Nagios, as well as Getopt::Long, File::stat, File::Basename, File::Spec and Number::Bytes::Human. This was one of my first Perl plugins, but seems to be rather acceptable. It makes the following checks based on the rsnapshot log:

  1. Backup run in the last X seconds (warning and crit thresholds)
  2. Maximum time from start to finish (warning and crit thresholds)
  3. Minimum size of backup (warning and crit thresholds)
  4. Minimum number of files in backup (warning and crit thresholds)

In addition to check_file_age checks on a number of files that are included in backups and I know are modified before each backup run, this seems to handle monitoring quite well for me. I certainly preferred running Bacula and using my MySQL-based check_bacula_job.php, but as I’m now backing up 4 machines to my desktop, I no longer have a need for Bacula (or tapes).

The script itself can be found at github.

World of Warcraft Realm Status Check Plugin for Nagios

My wife Jackie (Syrilia) is an avid World of Warcraft player (it’s a MMORPG with over 10 million players). They have weekly server maintenance/update windows every Tuesday morning – total downtime. The length is never really fixed, so I looked around to see if there was a logical way to notify when the servers came back up.

I managed to find a World of Warcraft Realm status check plugin on Nagios Exchange, but it was written to a now-discontinued API. It was also last modified in 2008, and I can’t seem to get in contact with the author, Scott A’Hearn (webmaster@scottahearn.com) – that email returns undeliverable, there’s no email link on the site that his domain now redirects to, and the domain scottahearn.com is a (eek) private registration in WHOIS, so I don’t really have any way of finding contact information. Regardless, I’ve modified the script to use the new Blizzard REST API and it’s now working. Of course, this is pulling from Blizzard’s data feed, not doing any actual monitoring itself, and be warned that they impose query limits (at the moment, their docs say 3,000 requests per day for anonymous access; to be nice to them, I only check on Tuesdays from 3am-4pm, when I’m most concerned about it). The updated source code is shown below, but the most up-to-date version will always live at
https://github.com/jantman/nagios-scripts/blob/master/check_wow.pl. If you want, you can also see a diff of my changes to Scott’s original version on github.

#!/usr/bin/perl -w
#
# World of Warcraft Realm detector plugin for Nagios
#
# Written by Scott A'Hearn (webmaster@scottahearn.com), version 1.2, Last Modified: 07-21-2008
#
# Modified by Jason Antman <jason@jasonantman.com> 02-22-2012, to cope with the change from
# the deprecated worldofwarcraft.com XML feed to the BattleNet JSON API.
#
# Usage: ./check_wow -r <realm_name>
#
# Description:
#
# This plugin will check the status of a World of Warcraft realm, based 
# on input from the battle.net JSON realm status API.
#
# Output:
#
# If the realm is up, the plugin will
# return an OK state with a message containing the status of the realm as well 
# as some extended information such as type (PvP, PvE, etc) and population.  
# If the realm is down, the plugin will return a CRITICAL state with a message
# containing the status of the realm as well as any available extended 
# information such as type (PvP, PvE, etc) and population. If the realm is
# shown as currently having a queue, a WARNING state will be returned.
#
#
# If the requested realm is not found, the plugin will
# return an UNKNOWN state with an appropriate warning message.
#
# If there is an invalid [or no] response from the battle.net server,
# the plugin will return a CRITICAL state.
#
# $HeadURL: http://svn.jasonantman.com/public-nagios/check_wow.pl $
# $LastChangedRevision: 13 $
#
# Changelog:
# 2012-02-22 Jason Antman <jason@jasonantman.com> (version 1.3):
#     * modified for new BattleNet JSON API
#     * added WARNING output if realm has queue
#
# 2008-07-21 Scott A'Hearn <webmaster@scottahearn.com> (version 1.2):
#     * version on Nagios Exchange
#
 
# use modules
use strict;				# good coding practices
use Getopt::Long;			# command-line option parsing
use LWP;				# external content retrieval
use JSON;                               # JSON for API reply
use lib  "/usr/lib/nagios/plugins";	# nagios plugins
use utils qw(%ERRORS &print_revision &support &usage );	# nagios error and message libraries
use Data::Dumper;                       # debugging
 
# init global vars
use vars qw($PROGNAME);	$PROGNAME="check_wow";
my ($ver_string, $browser, $jsonurl, $raw_json, $opt_V, $opt_h, $opt_r, $decoded) = (undef, undef, undef, undef, undef, undef, undef, undef);
$jsonurl = "http://us.battle.net/api/wow/realm/status?realm=";
$ver_string = "1.3";
 
# init subs
sub print_help ($$);
sub print_usage ($);
 
# define command-line option handling
Getopt::Long::Configure('bundling');
GetOptions(
	"V"   => \$opt_V, "version"	=> \$opt_V,
	"h"   => \$opt_h, "help"	=> \$opt_h,
	"r=s" => \$opt_r, "realm=s"	=> \$opt_r);
 
# show version info, exit
if ($opt_V) {
	print_revision($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# show help, exit
if ($opt_h) {
	print_help($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# get first command-line param
$opt_r = shift unless ($opt_r);
 
# if no command-line param passed, show usage/help, exit
if (! $opt_r) {
	print_usage($PROGNAME);
	exit $ERRORS{'UNKNOWN'};
}
 
# new browser object, with agent
$browser = LWP::UserAgent->new();
$browser->agent("check_wow/$ver_string");
 
# retrieve JSON from WoW site
$jsonurl .= $opt_r;
$raw_json = $browser->request(HTTP::Request->new(GET => $jsonurl));
 
if ($raw_json->is_success) {
	# if success, process
	$raw_json = $raw_json->content;
} else {
	# otherwise, fail UNKNOWN
	print "UNKNOWN - Realm '$opt_r' status not received.";
	exit $ERRORS{'UNKNOWN'};
}
 
$decoded = decode_json $raw_json;
 
if($decoded->{realms}[0]->{status} != 1) {
    print "CRITICAL - Realm ".$decoded->{realms}[0]->{name}." Down (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'CRITICAL'};
} elsif($decoded->{realms}[0]->{queue} != 0) {
    print "WARNING - Realm ".$decoded->{realms}[0]->{name}." Has Queue (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'WARNING'};
} else {
    print "OK - Realm ".$decoded->{realms}[0]->{name}." Up (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'OK'};
}
 
# usage function
sub print_usage ($) {
        my ($PROGNAME) = @_;
	print "Usage:\n";
	print "  $PROGNAME [-r | --realm <realm>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
# help function
sub print_help ($$) {
        my ($PROGNAME, $ver_string) = @_;
	print_revision($PROGNAME, $ver_string);
	print "Copyright (c) 2008 Scott A'Hearn, 2012 Jason Antman\n\n";
	print_usage($PROGNAME);
	print "\n";
	print "  <realm> Standard World of Warcraft realm name, case sensitive.\n";
	print "\n";
	# support();
}
 
# end

Nagios Check Plugin for Linode Monthly Bandwidth Usage

Since I have most of my public-facing stuff hosted with Linode, and I have a monthly bandwidth cap (albeit one that I’ll probably never come close to), I decided that it would be a good idea to add my monthly bandwidth usage to my monitoring system. Luckily, Linode offers this (their billing view of it – which is, of course, what I’m concerned about) via their API, and it’s very nicely implemented in Michael Greb’s WebService::Linode Perl (CPAN) module.

Using Michael’s Perl module, I wrote check_linode_transfer.pl (github link) as a Nagios check plugin. It seems to be working fine for me, and runs with the embedded perl interpreter, though it may not be 100% up to par with the Nagios plugin spec (for one, I used utils.pm instead of Nagios::Plugin). About the only thing unusual is that I store my API keys in a perl module, so you’ll need to create something like this in your plugin directory (usually /usr/lib/nagios/plugins:

package api_keys;
 
require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw($API_KEY_LINODE);
 
$API_KEY_LINODE = "yourApiKeyGoesHere";
 
1;

The latest version of the plugin will always be available at https://github.com/jantman/nagios-scripts/blob/master/check_linode_transfer.pl. The current version is also below. It’s free for anyone to use under the terms of GNU GPLv3, though I would really like it if any changes/patches/updates are sent back to me for inclusion in the latest version.

#! /usr/bin/perl -w
 
# check_linode_transfer.pl Copyright (C) 2012 Jason Antman <jason@jasonantman.com>
#
# Define your Linode API key as $API_KEY_LINODE in api_keys.pm in the plugin library directory
#  a sample should be included in this distribution.
#
# This plugin requires WebService::Linode from CPAN, with a patch - add the following to the end of sub _error{} in Linode/Base.pm:
#  $self->{err} = $err; $self->{errstr} = $errstr;
# Also - bug in WebService::Linode::Base docs, example, line 3 should be:
#  my $data = $api->do_request( api_action => 'domains.list' );
# not:
#  my $data = $api->do_request( action => 'domains.list' );
#
##################################################################################
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty
# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# you should have received a copy of the GNU General Public License
# along with this program (or with Nagios);  if not, write to the
# Free Software Foundation, Inc., 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA
#
##################################################################################
#
# The latest version of this plugin can always be obtained from:
#  $HeadURL$
#  $LastChangedRevision$
#
 
use strict;
use English;
use Getopt::Long;
use vars qw($PROGNAME $REVISION);
use lib "/usr/lib/nagios/plugins";
use utils qw (%ERRORS &print_revision &support);
use api_keys qw($API_KEY_LINODE);
use WebService::Linode;
use Data::Dumper;
 
sub print_help ();
sub print_usage ();
 
my ($opt_c, $opt_w, $opt_h, $opt_V, $opt_s, $opt_S, $opt_l, $opt_H);
my ($result, $message);
 
$PROGNAME="check_linode_transfer.pl";
$REVISION='1.0';
 
$opt_w = 60;
$opt_c = 80;
 
Getopt::Long::Configure('bundling');
GetOptions(
    "V"   => \$opt_V, "version"	=> \$opt_V,
    "h"   => \$opt_h, "help"	=> \$opt_h,
    "w=f" => \$opt_w, "warning=f" => \$opt_w,
    "c=f" => \$opt_c, "critical=f" => \$opt_c
);
 
if ($opt_V) {
	print_revision($PROGNAME, $REVISION);
	exit $ERRORS{'OK'};
}
 
if ($opt_h) {
	print_help();
	exit $ERRORS{'OK'};
}
 
$result = 'OK';
 
my $api = new WebService::Linode(apikey => $API_KEY_LINODE, nowarn => 1);
my $data = $api->do_request( api_action => 'account.info' );
if(! $data) {
    $result = "UNKNOWN";
    print "LINODE TRANSFER $result: ".$api->{errstr}."\n";
    exit $ERRORS{$result};
}
 
my ($used, $pool, $pct) = ($data->{TRANSFER_USED}, $data->{TRANSFER_POOL}, 0);
 
$pct = ($used / $pool) * 100;
 
if($pct >= $opt_c){
    $result = "CRITICAL";
}
elsif($pct >= $opt_w){
    $result = "WARNING";
}
 
print "LINODE TRANSFER $result: $pct"."%"." of monthly bandwidth used ($used / $pool GB)|usedBW=$used; totalBW=$pool\n";
exit $ERRORS{$result};
 
sub print_usage () {
	print "Usage:\n";
	print "  $PROGNAME [-w <percent>] [-c <percent>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
sub print_help () {
	print_revision($PROGNAME, $REVISION);
	print "Copyright (c) 2012 Jason Antman\n\n";
	print_usage();
	print "\n";
	print "  <percent>  Percent of network transfer used\n";
	print "\n";
	support();
}

Nagios / Icinga Configuration Highlighting with GeSHi

As you may know from former posts, this blog (WordPress-powered) and a few MediaWiki sites that I have use the excellent PHP-based GeSHi syntax highlighter. Today I was writing a post that includes some Icinga (Nagios) configuration snippets. After a quick search, I found a Nagios language file for GeSHi on GitHub. Thanks very much to Albéric de Pertat (adepertat) for writing this and providing it to the public.

Sending AOL Instant Messenger (AIM) Messages from a Perl Script

I’ve been doing some work on icinga (a Nagios fork) and wanted to implement notification via AOL Instant Messenger (AIM), since I’m almost always signed on when I’m at a computer. Unfortunately, most of the scripts that I could find use Net::AIM::TOC which implements a now-defunct protocol. So, I found Perl’s Net::OSCAR and James Nonnemaker’s script, and decided to rework them into something a bit more full-featured.

The below script sends a single IM to a single contact via the command line (using a specified AIM username and password). It’s intended to be a Nagios notification script (using the configurations shown below), but could be used for any purpose. The most up-to-date version of the script will be available at: github.com/jantman/public-nagios/master/send_aim.pl

#!/usr/bin/perl
 
#
# Script to send AIM messages from the command line
#
# Copyright 2012 Jason Antman <http://blog.jasonantman.com> <jason@jasonantman.com>
# based on the simple version (C) 2008 James Nonnemaker / james[at]ustelcom[dot]net 
#    found at: <http://moo.net/code/aim.html>
#
# The canonical, up-to-date version of this script can be found at:
#  <http://svn.jasonantman.com/public-nagios/send_aim.pl>
#
# For updates, news, etc., see:
#  <http://blog.jasonantman.com/2012/02/sending-aim-messages-from-a-perl-script/>
#
# $HeadURL$
# $LastChangedRevision$
#
 
use strict;
use warnings;
use Net::OSCAR qw(:standard);
use Getopt::Long;
 
my ($screenname, $passwd, $ToSn, $Msg);
my $VERSION = "r17";
 
my $result = GetOptions ("screenname=s" => \$screenname,
		      "password=s"   => \$passwd,
		      "to=s"         => \$ToSn);
 
if(! $screenname || ! $passwd || ! $ToSn) {
    print "send_aim.pl $VERSION by Jason Antman <jason\@jasonantman.com>\n\n";
    print "USAGE: send_aim.pl --screenname=<sn> --password=<pass> --to=<to_screenname>\n\n";
}
 
# slurp message from STDIN
my $holdTerminator = $/;
undef $/;
$Msg = <STDIN>;
$/ = $holdTerminator;
my @lines = split /$holdTerminator/, $Msg;
$Msg = "init";
$Msg = join $holdTerminator, @lines;
 
my $oscar = Net::OSCAR->new();
$oscar->loglevel(0);
$oscar->signon($screenname, $passwd);
 
$oscar->set_callback_snac_unknown(\&snac_unknown);
$oscar->set_callback_im_ok (\&log_out);
$oscar->set_callback_signon_done (\&do_it);
 
while (1) {
    $oscar->do_one_loop();
}
 
sub do_it {
    $oscar->send_im($ToSn, $Msg);
}
 
sub log_out {
    $oscar->signoff;
    exit;
}
 
sub snac_unknown {
    my($oscar, $connection, $snac, $data) = @_;
    # just use this to override the default snac_unknown handler, which prints a data dump of the packet
}

The command line usage is pretty simple – it takes the message to send on stdin and parameters for the sender’s screen name and password, and the recipient’s screen name, like:

echo -e "Hello\nworld\n" | send_aim.pl --screenname=mySN --password=myPass --to=recipientSN

The Icinga configs that I used for this are as follows. I just used the default Icinga 1.6 notify by email commands, since AIM should handle the full length fine.

# host notification command
define command{
        command_name    notify-host-by-aim
        command_line    /usr/bin/printf "%b" "***** Icinga *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/lib/nagios/plugins/notification/send_aim.pl --screenname=mySN --password=myPass --to=$CONTACTADDRESS1$
}
 
# service notification command
define command{
        command_name    notify-service-by-aim
        command_line    /usr/bin/printf "%b" "***** Icinga *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/lib/nagios/plugins/notification/send_aim.pl --screenname=mySN --password=myPass --to=$CONTACTADDRESS1$
}
 
# example contact
define contact{
        contact_name                    joeadmin
        alias                           Joe Admin
        use                             generic-with-AIM-contact
        email                           joeadmin@example.com
        pager                           5555555555@vtext.com
        address1                        joeAdminSN ; AIM screen name
}

Nagios check scripts

Last week I added some of my Nagios check scripts to my nagios-scripts GitHub repository. Perhaps they’ll be of some use to some other people…

  • check_1wire_temps.php – quick and dirty, built for one specific application, but a good starting place for checking Dallas 1-wire temperatures via OWFS.
  • check_802dot11.php – A script to check various things in the IEEE-802DOT11 MIB, written for Ubiquiti APs (SNMP).
  • check_frogfoot.php – A script to check some stuff from FROGFOOT-MIB, also written for Ubiquiti APs (SNMP).
  • check_asterisk_iaxpeers – a Python check script to parse the output of rasterisk for IAX peer status and latency (includes perf data output).
  • check_bacula_job.php – A script to connect to the Bacula database and make sure a specified job terminated OK and was run on schedule.
  • check_docsis – A script to check status and various metrics for cable modems implementing the DOCSIS MIB (SNMP). Works with (at least) the Motorola SurfBoard modems used by Cablevision (which use 192.168.100.1 on the LAN side).
  • check_syslog_age.php – A PHP script which checks (recursively) that the newest file under a directory is no more than X seconds old. I use this for checking my centralized syslog server, which has logs separated out in /var/log/HOSTS/hostname.

Update 2011-01-31 – the check_syslog_age.php script was updated today to handle an error condition where stat() calls in PHP fail on files larger than 2GB on 32-bit systems.

Parsing Nagios status.dat in PHP

If you’re just looking for the script or PHP module, you can get them at: http://github.com/jantman/php-nagios-xml.

A while ago (back in late 2008), I wrote a PHP script that parses the Nagios status.dat file into an associative array. My original use was to output XML which was then read by another script on another server and used for a small custom GUI. It’s a very simple PHP script that just takes the path of the status.dat file (which, obviously, must be readable by the user running the script).

At that time, I was using Nagios v2. Since then, I’ve moved to Nagios v3, and have updated the script to include the ability to parse v3 status.dat files, as well as a function to detect the version of a status file. I also refactored the code so that the parsing functions are all contained in a single file (statusXML.php.inc) which is safe to include in other scripts. The actual statusXML.php file now just includes examples of how to call all of the functions and output XML (though it is equally useful to output the serialized array, or use it directly).

Since I posted my script online, two people have been kind enough to send back their modifications:

Both of these generous contributions have been included in my Github repository as of the current commit. Unfortunately, due to my delay in putting my Nagios3 code into svn, both of these contributions are Nagios v2 only.

As time permits, I plan on merging Artur’s changes into the current version of statusXML.php.inc. Unfortunately, C isn’t one of my strong points, but I plan on also updating Whitham’s PHP module code to work with Nagios3 as soon as possible.

Stay tuned for updates, and thanks to both gentlemen for contributing their work. I’m always interested in hearing how people are using my code, and how they are making it better.

Also: While I added this project to Nagios Exchange, and plan on adding it to Monitoring Exchange, I don’t always keep those sites up to date (I can’t access Nagios Exchange right now, and who knows if I’ll have time to update it tomorrow). I strongly recommend directly checking out from Git at https://github.com/jantman/php-nagios-xml.

pnp4nagios, CentOS 5.3 and pcre

I started testing out the pnp4nagios tool to incorporate graphs of performance data into Nagios. Despite what Klein and Sellens suggest (p. 57), I really don’t want separate tools for monitoring and trending. Cactialready handles UPS metrics, switch ports, router traffic, etc. For everything else – system load, etc. – I see no reason to have two checks run rather than just one (Nagios).

There was a CentOS package for the older pnp4nagios 0.4.x, but I opted to build and install the new 0.6.x from source. Unfortunately, I hit one snag – it requires PCRE compiled with support for Unicode properties, and I couldn’t find any package for CentOS compiled with that option. So, with a simple edit of the %configure macro in the SPEC file, I built one. Unfortunately, I wasn’t working in a real build environment – just on one of my web servers – so I only built the .i386 version, but you can feel free to build from the source rpm.

Nagios and check plugins run as root

No matter how much we may not like it, and no matter how insecure it can potentially be, we occasionally have to run Nagios check scripts (written in scripting languages) as root. (On a side note, this method is also used for my MultiBindAdmin project’s DNS file push). Here’s how to do it:

  1. Write your check script in the language of your choice and test as root.
  2. Grab setuid-prog.c from GitHub.
  3. uncomment the DEFINE for FULL_PATH, change the string to the full path to your script.
  4. Be sure your script is owned by root, and is chmod at most 755.
  5. Compile setuid-prog.c:
    gcc -o {check_script_name}-wrapper setuid-prog.c
  6. Put the resulting binary in your plugin directory.
  7. Assuming your checks run as user nagios and group nagios, chown the binary to root:nagios and chmod 4755.

This allows the use of the SUID bit with scripts.

Use at your own risk. I only recommend this on systems where the Nagios account is strongly authenticated, and where ALL users are trusted.

Nagios check_by_ssh and NAT

At a remote location, I have a number of machines to monitor but only one IP (dynamic on a residential connection). Most of my remote monitoring with Nagios uses check_by_ssh. Previously, I’d used one host for Nagios to SSH to, and then chained together another check_by_ssh to reach the remote hosts. Unfortunately, this means nothing past the one first host can get monitored if the first host is down. All of the other hosts (everything is behind NAT) have SSH visible externally on different ports.

SSH itself doesn’t like one IP/hostname with SSH on different ports – host key verification will fail, as the SSH client only looks at the address that it’s connecting to, not the port number. Normally, this is bypassed by using a .ssh/config file like:

Host foo1
        Hostname foo.example.com
        HostKeyAlias foo1
        CheckHostIP no
        Port 22
        User nagios
 
Host foo2
        Hostname foo.example.com
        HostKeyAlias foo2
        CheckHostIP no
        Port 222
        User nagios
 
Host foo3
        Hostname foo.example.com
        HostKeyAlias foo3
        CheckHostIP no
        Port 10022
        User nagios

And then you SSH using the “Host” named in the config file, not the actual hostname.

Unfortunately, the only way to get check_by_ssh to do this was a bit messy, and required defining a bunch of extra macros for each host:

/check_by_ssh -o Hostname=foo.example.com -o HostKeyAlias=foo1 -o CheckHostIP=no -o Port=222 -o User=nagios -H foo.example.com -C uptime

So, I made a quick little patch for check_by_ssh.c (patched against the released nagios-plugins-1.4.14) :

--- check_by_ssh.c      2009-10-22 14:32:26.000000000 -0400
+++ check_by_ssh_ORIG.c 2009-10-22 14:12:15.000000000 -0400
@@ -181,7 +181,6 @@
                {"skip", optional_argument, 0, 'S'}, /* backwards compatibility */
                {"skip-stdout", optional_argument, 0, 'S'},
                {"skip-stderr", optional_argument, 0, 'E'},
-               {"ssh-config", optional_argument, 0, "F"},
                {"proto1", no_argument, 0, '1'},
                {"proto2", no_argument, 0, '2'},
                {"use-ipv4", no_argument, 0, '4'},
@@ -199,7 +198,7 @@
                        strcpy (argv[c], "-t");
 
        while (1) {
-               c = getopt_long (argc, argv, "Vvh1246fqt:H:O:p:i:u:l:C:S::E::n:s:o:F:", longopts,
+               c = getopt_long (argc, argv, "Vvh1246fqt:H:O:p:i:u:l:C:S::E::n:s:o:", longopts,
                                 &option);
 
                if (c == -1 || c == EOF)
@@ -222,7 +221,7 @@
                                timeout_interval = atoi (optarg);
                        break;
                case 'H':                                                                       /* host */
-                 /* host_or_die(optarg); */     /* commented out 2009-10-22 by jantman for ssh config file use */
+                       host_or_die(optarg);
                        hostname = optarg;
                        break;
                case 'p': /* port number */
@@ -300,12 +299,6 @@
                        else
                                skip_stderr = atoi (optarg);
                        break;
-               /* added 2009-10-22 by jantman for ssh -F option (config file) */
-               case 'F':                                                                       /* ssh config file */
-                       comm_append("-F");
-                       comm_append(optarg);
-                       break;
-               /* END added 2009-10-22 by jantman */
                case 'o':                                                                       /* Extra options for the ssh command */
                        comm_append("-o");
                        comm_append(optarg);
@@ -411,8 +404,6 @@
   printf ("    %s\n", _("Ignore all or (if specified) first n lines on STDERR [optional]"));
   printf (" %s\n", "-f");
   printf ("    %s\n", _("tells ssh to fork rather than create a tty [optional]. This will always return OK if ssh is executed"));
-  printf (" %s\n", "-F");
-  printf ("    %s\n", _("path to ssh config file [optional]"));
   printf (" %s\n","-C, --command='COMMAND STRING'");
   printf ("    %s\n", _("command to execute on the remote machine"));
   printf (" %s\n","-l, --logname=USERNAME");

It works fine. The only problem is that I disabled the check that the given hostname/IP is valid, so instead of getting a nice “Invalid hostname/address – foobar” error, you’ll get the usual “Remote command execution failed: ssh: foobar: Name or service not known” error (though it will still give an exit code of 3). I had to do this because check_by_ssh was checking for a valid hostname itself, though SSH needs to be passed the “Host” alias as defined in the config file.

With the patch, we now have something nice and clean like:

./check_by_ssh -H foo1 -F /home/nagios/.ssh/config -l nagios -i /home/nagios/.ssh/id_dsa -C uptime

Which only adds the “-F” flag to what I was already using, and is safe to use for all hosts.

When I get a chance, I’ll figure out a way to gracefully deal with the host aliases (“fake hostnames”) and submit a patch. Most likely, I’ll add another option so that you have to specify both the actual hostname (so it can check that it exists) and the alias used in the config file (perhaps “-a”?)