Nagios Check Plugin for Rsnapshot Backups

In a previous post, I described how I do Secure rsnapshot backups over the WAN via SSH. While my layout of rsnapshot configuration files, data, and log files is a bit esoteric, I monitor all this with a Nagios check plugin that runs on my backup host. It Assumes that the output of rsnapshot is written to a text log file, one file per host, at a path that matches /path_to_log_directory/log_HOSTNAME_YYYYMMDD-HHMMSS.log where HOSTNAME is the name of the host, and YYYYMMDD-HHMMSS is a datestamp (actually, the script just finds the newest file matching log_HOSTNAME_*.log in that directory). In order to obtain correct timing of the runs, which rsnapshot doesn’t offer, it assumes that you trigger rsnapshot through a wrapper script, which runs it once per host (inside a loop?) with per-host log files and some logging information added, like:

for h in <LIST OF HOSTNAMES>
do
    LOGFILE="/mnt/backup/rsnapshot/logs/log_${h}_`date +%Y%m%d-%H%M%S`.txt"
    echo "# Starting backup at `date` (`date +%s`)" >> "$LOGFILE"
    /usr/bin/rsnapshot -c /etc/rsnapshot-$h.conf daily &>> "$LOGFILE"
    echo "# Finished backup at `date` (`date +%s`)" >> "$LOGFILE"
done

The check_rsnapshot.pl plugin uses utils.pm from Nagios, as well as Getopt::Long, File::stat, File::Basename, File::Spec and Number::Bytes::Human. This was one of my first Perl plugins, but seems to be rather acceptable. It makes the following checks based on the rsnapshot log:

  1. Backup run in the last X seconds (warning and crit thresholds)
  2. Maximum time from start to finish (warning and crit thresholds)
  3. Minimum size of backup (warning and crit thresholds)
  4. Minimum number of files in backup (warning and crit thresholds)

In addition to check_file_age checks on a number of files that are included in backups and I know are modified before each backup run, this seems to handle monitoring quite well for me. I certainly preferred running Bacula and using my MySQL-based check_bacula_job.php, but as I’m now backing up 4 machines to my desktop, I no longer have a need for Bacula (or tapes).

The script itself can be found at github.

Script to Chart Intervals Between Problem and Recovery from Nagios/Icinga Log Files

At work, we use Icinga (a fork of Nagios) for monitoring. We have a few services which are restarted or otherwise poked by event handlers, but the recovery takes a while – so we often get paged for problems which recover in a few minutes. I wrote a small perl script that greps through the archived log files for a given regex (service and/or host name) and then calculates the time from problem to recovery and graphs those times.

The script is called nagios_log_problem_interval.pl and can be downloaded from my github. Below is some sample output, the number of minutes from problem to recovery are along the Y axis and the count is along the X axis:


> nagios_log_problem_interval.pl --archivedir=/var/icinga/archive --match=myhost --backtrack=10
myhost;HTTP
Count
1:########(8)
2:##(2)
3:#(1)
4:##(2)
5:#######(7)
6:(0)
7:(0)
8:#(1)
9:(0)
10:(0)
11:#(1)
12:(0)
13:#(1)
14:(0)
15:(0)
16-29:(0)
30-59:(0)
60+:(0)

A Collection of Great Links on Monitoring, SysAdmin, Scaling, etc.

I’ve had a bunch of tabs open in my browser for a while – stuff that I read, thought was wonderful, and wanted to comment on. At risk of letting it pile up forever, here’s a collection of links that I thought were really interesting or insightful…

  • MongoDB is Fantastic for Logging – I was looking into some log storage ideas, and came by this post (on the MongoDB blog) about why Mongo is well-suited to storing logs.
  • Sensu – a Ruby-based cloud-oriented monitoring system. It uses AMQP/RabbitMQ to communicate between the clients and server, which is a really big part of what I think monitoring should be.
  • High Scalability – this is one of the few blogs I follow on a regular basis. Some really wonderful stuff, and great food for thought.
  • Everything Sysadmin: Fear of Rebooting – A great article on Tom Limoncelli’s blog about why we fear rebooting machines and why this is bad – moreover, why we should reboot often.
  • The Netflix Tech Blog: Fault Tolerance in a High Volume, Distributed System – This is a really, really cool post NetFlix about how latency increases in a single subsystem can bring down their whole API in seconds, and how they combat this. Really cool stuff.
  • Ars Technica – Exclusive: a behind-the-scenes look at Facebook release engineering – Ars Technical is more or less “mainstream media” to me, but this is a really interesting writeup on Facebook’s release engineering process, albeit at a higher level. Specifically, it talks about their automation, phased rollouts, rollbacks, and how they release the Facebook codebase as a single giant binary, sent out via BitTorrent.
  • Monitoring Sucks blog posts (github) – The “monitoing sucks” movement really speaks to me, having worked extensively with Nagios, Cacti, and similar technologies. Specifically, having rolled out monitoring in a variety of “weird” scenarios (a lot of monitoring devices or whole networks behind NAT, on dynamic IP connections, or otherwise unreachable from a central server), I’ve felt a lot of pain in the current want of doing things. There are a lot of really good thoughts linked here, especially the “wonderland” series by Patrick Debois and the “Latency sucks” series by Lindsay Holmwood. This really got me thinking about my ideal monitoring system, which among other things, would integrate the “alerting” functions of Nagios with graphing/trending and correlation, would be based on some sort of message queue architecture (that supports multiple levels of proxies that could gracefully support NAT and multiple hops), and would be configured almost totally on the originating “client” (unlike the pain of distributed Nagios/Icinga).
  • Mike Brittain – Metrics Driven Engineering at Etsy (3.2MB PDF) – presentation slides. I’d love to see the video. Some really good ideas about putting the science back into being a SysAdmin. Also mentions a few tools I really want to play around with (including ganglia, graphite, logster and StatsD). Also mentions adding PHP memory usage and time to Apache logs, which I don’t believe I never thought of.
  • Some really thoughtful posts from R. I. Pienaar on Thinking about monitoring frameworks and Composable Architectures. Some really good stuff, but what else would you expect from someone like this.

World of Warcraft Realm Status Check Plugin for Nagios

My wife Jackie (Syrilia) is an avid World of Warcraft player (it’s a MMORPG with over 10 million players). They have weekly server maintenance/update windows every Tuesday morning – total downtime. The length is never really fixed, so I looked around to see if there was a logical way to notify when the servers came back up.

I managed to find a World of Warcraft Realm status check plugin on Nagios Exchange, but it was written to a now-discontinued API. It was also last modified in 2008, and I can’t seem to get in contact with the author, Scott A’Hearn (webmaster@scottahearn.com) – that email returns undeliverable, there’s no email link on the site that his domain now redirects to, and the domain scottahearn.com is a (eek) private registration in WHOIS, so I don’t really have any way of finding contact information. Regardless, I’ve modified the script to use the new Blizzard REST API and it’s now working. Of course, this is pulling from Blizzard’s data feed, not doing any actual monitoring itself, and be warned that they impose query limits (at the moment, their docs say 3,000 requests per day for anonymous access; to be nice to them, I only check on Tuesdays from 3am-4pm, when I’m most concerned about it). The updated source code is shown below, but the most up-to-date version will always live at
https://github.com/jantman/nagios-scripts/blob/master/check_wow.pl. If you want, you can also see a diff of my changes to Scott’s original version on github.

#!/usr/bin/perl -w
#
# World of Warcraft Realm detector plugin for Nagios
#
# Written by Scott A'Hearn (webmaster@scottahearn.com), version 1.2, Last Modified: 07-21-2008
#
# Modified by Jason Antman <jason@jasonantman.com> 02-22-2012, to cope with the change from
# the deprecated worldofwarcraft.com XML feed to the BattleNet JSON API.
#
# Usage: ./check_wow -r <realm_name>
#
# Description:
#
# This plugin will check the status of a World of Warcraft realm, based 
# on input from the battle.net JSON realm status API.
#
# Output:
#
# If the realm is up, the plugin will
# return an OK state with a message containing the status of the realm as well 
# as some extended information such as type (PvP, PvE, etc) and population.  
# If the realm is down, the plugin will return a CRITICAL state with a message
# containing the status of the realm as well as any available extended 
# information such as type (PvP, PvE, etc) and population. If the realm is
# shown as currently having a queue, a WARNING state will be returned.
#
#
# If the requested realm is not found, the plugin will
# return an UNKNOWN state with an appropriate warning message.
#
# If there is an invalid [or no] response from the battle.net server,
# the plugin will return a CRITICAL state.
#
# $HeadURL: http://svn.jasonantman.com/public-nagios/check_wow.pl $
# $LastChangedRevision: 13 $
#
# Changelog:
# 2012-02-22 Jason Antman <jason@jasonantman.com> (version 1.3):
#     * modified for new BattleNet JSON API
#     * added WARNING output if realm has queue
#
# 2008-07-21 Scott A'Hearn <webmaster@scottahearn.com> (version 1.2):
#     * version on Nagios Exchange
#
 
# use modules
use strict;				# good coding practices
use Getopt::Long;			# command-line option parsing
use LWP;				# external content retrieval
use JSON;                               # JSON for API reply
use lib  "/usr/lib/nagios/plugins";	# nagios plugins
use utils qw(%ERRORS &print_revision &support &usage );	# nagios error and message libraries
use Data::Dumper;                       # debugging
 
# init global vars
use vars qw($PROGNAME);	$PROGNAME="check_wow";
my ($ver_string, $browser, $jsonurl, $raw_json, $opt_V, $opt_h, $opt_r, $decoded) = (undef, undef, undef, undef, undef, undef, undef, undef);
$jsonurl = "http://us.battle.net/api/wow/realm/status?realm=";
$ver_string = "1.3";
 
# init subs
sub print_help ($$);
sub print_usage ($);
 
# define command-line option handling
Getopt::Long::Configure('bundling');
GetOptions(
	"V"   => \$opt_V, "version"	=> \$opt_V,
	"h"   => \$opt_h, "help"	=> \$opt_h,
	"r=s" => \$opt_r, "realm=s"	=> \$opt_r);
 
# show version info, exit
if ($opt_V) {
	print_revision($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# show help, exit
if ($opt_h) {
	print_help($PROGNAME, $ver_string);
	exit $ERRORS{'OK'};
}
 
# get first command-line param
$opt_r = shift unless ($opt_r);
 
# if no command-line param passed, show usage/help, exit
if (! $opt_r) {
	print_usage($PROGNAME);
	exit $ERRORS{'UNKNOWN'};
}
 
# new browser object, with agent
$browser = LWP::UserAgent->new();
$browser->agent("check_wow/$ver_string");
 
# retrieve JSON from WoW site
$jsonurl .= $opt_r;
$raw_json = $browser->request(HTTP::Request->new(GET => $jsonurl));
 
if ($raw_json->is_success) {
	# if success, process
	$raw_json = $raw_json->content;
} else {
	# otherwise, fail UNKNOWN
	print "UNKNOWN - Realm '$opt_r' status not received.";
	exit $ERRORS{'UNKNOWN'};
}
 
$decoded = decode_json $raw_json;
 
if($decoded->{realms}[0]->{status} != 1) {
    print "CRITICAL - Realm ".$decoded->{realms}[0]->{name}." Down (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'CRITICAL'};
} elsif($decoded->{realms}[0]->{queue} != 0) {
    print "WARNING - Realm ".$decoded->{realms}[0]->{name}." Has Queue (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'WARNING'};
} else {
    print "OK - Realm ".$decoded->{realms}[0]->{name}." Up (".$decoded->{realms}[0]->{type}.", population: ".$decoded->{realms}[0]->{population}.")\n";
    exit $ERRORS{'OK'};
}
 
# usage function
sub print_usage ($) {
        my ($PROGNAME) = @_;
	print "Usage:\n";
	print "  $PROGNAME [-r | --realm <realm>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
# help function
sub print_help ($$) {
        my ($PROGNAME, $ver_string) = @_;
	print_revision($PROGNAME, $ver_string);
	print "Copyright (c) 2008 Scott A'Hearn, 2012 Jason Antman\n\n";
	print_usage($PROGNAME);
	print "\n";
	print "  <realm> Standard World of Warcraft realm name, case sensitive.\n";
	print "\n";
	# support();
}
 
# end

Nagios Check Plugin for Linode Monthly Bandwidth Usage

Since I have most of my public-facing stuff hosted with Linode, and I have a monthly bandwidth cap (albeit one that I’ll probably never come close to), I decided that it would be a good idea to add my monthly bandwidth usage to my monitoring system. Luckily, Linode offers this (their billing view of it – which is, of course, what I’m concerned about) via their API, and it’s very nicely implemented in Michael Greb’s WebService::Linode Perl (CPAN) module.

Using Michael’s Perl module, I wrote check_linode_transfer.pl (github link) as a Nagios check plugin. It seems to be working fine for me, and runs with the embedded perl interpreter, though it may not be 100% up to par with the Nagios plugin spec (for one, I used utils.pm instead of Nagios::Plugin). About the only thing unusual is that I store my API keys in a perl module, so you’ll need to create something like this in your plugin directory (usually /usr/lib/nagios/plugins:

package api_keys;
 
require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw($API_KEY_LINODE);
 
$API_KEY_LINODE = "yourApiKeyGoesHere";
 
1;

The latest version of the plugin will always be available at https://github.com/jantman/nagios-scripts/blob/master/check_linode_transfer.pl. The current version is also below. It’s free for anyone to use under the terms of GNU GPLv3, though I would really like it if any changes/patches/updates are sent back to me for inclusion in the latest version.

#! /usr/bin/perl -w
 
# check_linode_transfer.pl Copyright (C) 2012 Jason Antman <jason@jasonantman.com>
#
# Define your Linode API key as $API_KEY_LINODE in api_keys.pm in the plugin library directory
#  a sample should be included in this distribution.
#
# This plugin requires WebService::Linode from CPAN, with a patch - add the following to the end of sub _error{} in Linode/Base.pm:
#  $self->{err} = $err; $self->{errstr} = $errstr;
# Also - bug in WebService::Linode::Base docs, example, line 3 should be:
#  my $data = $api->do_request( api_action => 'domains.list' );
# not:
#  my $data = $api->do_request( action => 'domains.list' );
#
##################################################################################
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty
# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# you should have received a copy of the GNU General Public License
# along with this program (or with Nagios);  if not, write to the
# Free Software Foundation, Inc., 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA
#
##################################################################################
#
# The latest version of this plugin can always be obtained from:
#  $HeadURL$
#  $LastChangedRevision$
#
 
use strict;
use English;
use Getopt::Long;
use vars qw($PROGNAME $REVISION);
use lib "/usr/lib/nagios/plugins";
use utils qw (%ERRORS &print_revision &support);
use api_keys qw($API_KEY_LINODE);
use WebService::Linode;
use Data::Dumper;
 
sub print_help ();
sub print_usage ();
 
my ($opt_c, $opt_w, $opt_h, $opt_V, $opt_s, $opt_S, $opt_l, $opt_H);
my ($result, $message);
 
$PROGNAME="check_linode_transfer.pl";
$REVISION='1.0';
 
$opt_w = 60;
$opt_c = 80;
 
Getopt::Long::Configure('bundling');
GetOptions(
    "V"   => \$opt_V, "version"	=> \$opt_V,
    "h"   => \$opt_h, "help"	=> \$opt_h,
    "w=f" => \$opt_w, "warning=f" => \$opt_w,
    "c=f" => \$opt_c, "critical=f" => \$opt_c
);
 
if ($opt_V) {
	print_revision($PROGNAME, $REVISION);
	exit $ERRORS{'OK'};
}
 
if ($opt_h) {
	print_help();
	exit $ERRORS{'OK'};
}
 
$result = 'OK';
 
my $api = new WebService::Linode(apikey => $API_KEY_LINODE, nowarn => 1);
my $data = $api->do_request( api_action => 'account.info' );
if(! $data) {
    $result = "UNKNOWN";
    print "LINODE TRANSFER $result: ".$api->{errstr}."\n";
    exit $ERRORS{$result};
}
 
my ($used, $pool, $pct) = ($data->{TRANSFER_USED}, $data->{TRANSFER_POOL}, 0);
 
$pct = ($used / $pool) * 100;
 
if($pct >= $opt_c){
    $result = "CRITICAL";
}
elsif($pct >= $opt_w){
    $result = "WARNING";
}
 
print "LINODE TRANSFER $result: $pct"."%"." of monthly bandwidth used ($used / $pool GB)|usedBW=$used; totalBW=$pool\n";
exit $ERRORS{$result};
 
sub print_usage () {
	print "Usage:\n";
	print "  $PROGNAME [-w <percent>] [-c <percent>]\n";
	print "  $PROGNAME [-h | --help]\n";
	print "  $PROGNAME [-V | --version]\n";
}
 
sub print_help () {
	print_revision($PROGNAME, $REVISION);
	print "Copyright (c) 2012 Jason Antman\n\n";
	print_usage();
	print "\n";
	print "  <percent>  Percent of network transfer used\n";
	print "\n";
	support();
}

pnp4nagios, CentOS 5.3 and pcre

I started testing out the pnp4nagios tool to incorporate graphs of performance data into Nagios. Despite what Klein and Sellens suggest (p. 57), I really don’t want separate tools for monitoring and trending. Cactialready handles UPS metrics, switch ports, router traffic, etc. For everything else – system load, etc. – I see no reason to have two checks run rather than just one (Nagios).

There was a CentOS package for the older pnp4nagios 0.4.x, but I opted to build and install the new 0.6.x from source. Unfortunately, I hit one snag – it requires PCRE compiled with support for Unicode properties, and I couldn’t find any package for CentOS compiled with that option. So, with a simple edit of the %configure macro in the SPEC file, I built one. Unfortunately, I wasn’t working in a real build environment – just on one of my web servers – so I only built the .i386 version, but you can feel free to build from the source rpm.

Nagios check_by_ssh and NAT

At a remote location, I have a number of machines to monitor but only one IP (dynamic on a residential connection). Most of my remote monitoring with Nagios uses check_by_ssh. Previously, I’d used one host for Nagios to SSH to, and then chained together another check_by_ssh to reach the remote hosts. Unfortunately, this means nothing past the one first host can get monitored if the first host is down. All of the other hosts (everything is behind NAT) have SSH visible externally on different ports.

SSH itself doesn’t like one IP/hostname with SSH on different ports – host key verification will fail, as the SSH client only looks at the address that it’s connecting to, not the port number. Normally, this is bypassed by using a .ssh/config file like:

Host foo1
        Hostname foo.example.com
        HostKeyAlias foo1
        CheckHostIP no
        Port 22
        User nagios
 
Host foo2
        Hostname foo.example.com
        HostKeyAlias foo2
        CheckHostIP no
        Port 222
        User nagios
 
Host foo3
        Hostname foo.example.com
        HostKeyAlias foo3
        CheckHostIP no
        Port 10022
        User nagios

And then you SSH using the “Host” named in the config file, not the actual hostname.

Unfortunately, the only way to get check_by_ssh to do this was a bit messy, and required defining a bunch of extra macros for each host:

/check_by_ssh -o Hostname=foo.example.com -o HostKeyAlias=foo1 -o CheckHostIP=no -o Port=222 -o User=nagios -H foo.example.com -C uptime

So, I made a quick little patch for check_by_ssh.c (patched against the released nagios-plugins-1.4.14) :

--- check_by_ssh.c      2009-10-22 14:32:26.000000000 -0400
+++ check_by_ssh_ORIG.c 2009-10-22 14:12:15.000000000 -0400
@@ -181,7 +181,6 @@
                {"skip", optional_argument, 0, 'S'}, /* backwards compatibility */
                {"skip-stdout", optional_argument, 0, 'S'},
                {"skip-stderr", optional_argument, 0, 'E'},
-               {"ssh-config", optional_argument, 0, "F"},
                {"proto1", no_argument, 0, '1'},
                {"proto2", no_argument, 0, '2'},
                {"use-ipv4", no_argument, 0, '4'},
@@ -199,7 +198,7 @@
                        strcpy (argv[c], "-t");
 
        while (1) {
-               c = getopt_long (argc, argv, "Vvh1246fqt:H:O:p:i:u:l:C:S::E::n:s:o:F:", longopts,
+               c = getopt_long (argc, argv, "Vvh1246fqt:H:O:p:i:u:l:C:S::E::n:s:o:", longopts,
                                 &option);
 
                if (c == -1 || c == EOF)
@@ -222,7 +221,7 @@
                                timeout_interval = atoi (optarg);
                        break;
                case 'H':                                                                       /* host */
-                 /* host_or_die(optarg); */     /* commented out 2009-10-22 by jantman for ssh config file use */
+                       host_or_die(optarg);
                        hostname = optarg;
                        break;
                case 'p': /* port number */
@@ -300,12 +299,6 @@
                        else
                                skip_stderr = atoi (optarg);
                        break;
-               /* added 2009-10-22 by jantman for ssh -F option (config file) */
-               case 'F':                                                                       /* ssh config file */
-                       comm_append("-F");
-                       comm_append(optarg);
-                       break;
-               /* END added 2009-10-22 by jantman */
                case 'o':                                                                       /* Extra options for the ssh command */
                        comm_append("-o");
                        comm_append(optarg);
@@ -411,8 +404,6 @@
   printf ("    %s\n", _("Ignore all or (if specified) first n lines on STDERR [optional]"));
   printf (" %s\n", "-f");
   printf ("    %s\n", _("tells ssh to fork rather than create a tty [optional]. This will always return OK if ssh is executed"));
-  printf (" %s\n", "-F");
-  printf ("    %s\n", _("path to ssh config file [optional]"));
   printf (" %s\n","-C, --command='COMMAND STRING'");
   printf ("    %s\n", _("command to execute on the remote machine"));
   printf (" %s\n","-l, --logname=USERNAME");

It works fine. The only problem is that I disabled the check that the given hostname/IP is valid, so instead of getting a nice “Invalid hostname/address – foobar” error, you’ll get the usual “Remote command execution failed: ssh: foobar: Name or service not known” error (though it will still give an exit code of 3). I had to do this because check_by_ssh was checking for a valid hostname itself, though SSH needs to be passed the “Host” alias as defined in the config file.

With the patch, we now have something nice and clean like:

./check_by_ssh -H foo1 -F /home/nagios/.ssh/config -l nagios -i /home/nagios/.ssh/id_dsa -C uptime

Which only adds the “-F” flag to what I was already using, and is safe to use for all hosts.

When I get a chance, I’ll figure out a way to gracefully deal with the host aliases (“fake hostnames”) and submit a patch. Most likely, I’ll add another option so that you have to specify both the actual hostname (so it can check that it exists) and the alias used in the config file (perhaps “-a”?)

Cable Management, Power Measurements, Major Outage, Cacti

So, once again, still really busy. But a few new things.

First, my racks both at home and at the apartment are atrocious. They have no cable management at all. Both started with 1-3 machines, and no real plans for upgrades (since they’re just my personal/development machines). Unfortunately, the “rack” (a metal workshop shelving unit) at home now has 8 machines and a host of ancillary equipment. The one at the apartment – an actual 42U rack – has 5 plus a few switches, rackmount KMM, etc. They’re both a jumble of wires in the back. Unfortunately, it seems like cable management hardware is *epxensive*. $30 for a 2U metal blank with a few plastic split D-rings, or almost $40 for a 2-meter vertical hunk of plastic channel with slits in the sides? So, I’ve been vaguely considering what it will take to fabricate some cable management hardware of my own. Probably just building something out of rack blanks for the horizontal off of the switches, and buying some sort of vertical channel for power and networking/KVM. Man, those KVM cables sure do take up a lot of space. Also at the moment, at home my power is all coming directly out of two UPSs, whereas at the apartment it’s straight from mains off of a surge suppressor. I’s like to buy another UPS for the apartment from RefurbUPS.com, where I got the ones from home, and also add a PDU at home and a vertical power strip at the apartment.

Also, at the apartment, the roommates and I have had some discussion lately about how much power the machines draw. This mainly stemmed from our plans to move this June, into a rented house with two more people. This seems to be falling through, so I don’t have to worry about moving and re-cabling everything, but I’m still interested in finding out how much power is being drawn. Granted, my UPSs at home give me a more-or-less good idea of power consumption, but I’d like to know in detail. The ideal solution would be a clamp ammeter around the mains line to the equipment – one with a serial interface. Unfortunately, I can’t seem to find such a thing, short of a digital multimeter left on all the time. So, I guess I’ll be looking around, and if I can’t find anything specific, maybe I’ll work on a microcontroller that can read 1-200mV in 1mV increments, and use it with an inductive clamp ammeter (usual output for them is 1mV per A).

So, on Monday I got into work and couldn’t access my mailserver. Weird. I never even got any Nagios alerts. I checked Nagios and… nothing. As in no connection. I SSH’d home and pinged both boxes, but nothing. The switch showed the mail server totally offline, and the Nagios box plugged connected but ZERO data out. I reset the counters and waited. Still nothing. After an hour or so of poking around, I determined that both devices were on the same 6-port group on the switch, and nothing else there was up too. So, after five long hours, I got someone back home to switch the cables. Still nothing. On a hunch, I asked to have her check the mail server (the “new” Sun Blade 150) and, sure enough, it wasn’t powered on. A click of the power button, and the mail server was back online. Along with an ominous last email from Nagios, stating that the UPS running my switch lost power, and 6 minutes later, was going down hard. Then quiet.

I don’t usually have power outages. So I’ll admit, when I added some of the new machines, I committed a high sin – I “never got around” to setting up everything power-wise. I also have the switch running off of an old BackUPS 500VA unit, USB, without automatic self-tests. As a result of all this:

  1. The little UPS powering the switch only held out for 6-7 minutes. As a result, once that died, the bigger units didn’t even matter, as all hope was lost. This needs to be on a bigger UPS – maybe one of the 1000VA’s until it gets its’ own.
  2. APCupsd requires a network to initiate shutdown, so the rest of the machines came down hard (as confirmed by looking through log files).
  3. The SunBlade was never setup to power on after power interruption, so it just sat there like a brick.

Most disturbingly, while my Nagios/monitoring box is up (according to the switch, power draw figures from the UPS, and the lights, as confirmed by someone on-site), it’s dead. No ping, nothing out. I’ll have to look into it, but it made me realize that this really is my only way of analyzing problems. That needs to stop.

Maybe one day I’ll have the money for a nice SmartUPS RT or even a Symmetra – though getting 208V into my basement is even more of a dream than spending $4000 on a UPS.

Also, I decided (after all this) to setup graphing of UPS data (load, voltage in and out, temp, capacity, run time, etc.). While I haven’t gotten around to setting up Zenoss yet, I did a quick (well, 4 hours later I’m done configuring it) Cacti installation on my web server (I should already have it running on the monitoring box, but who knows what that will look like when I get home). I also dropped a Cacti host template in CVS for the AP9605 PowerNet SNMP card in my UPSs.

Update, Eventum/MySQLTicketing Integration

Well I know I haven’t updated in a while. I have a whole bunch of links that I’d like to comment on, but things have been horribly busy. You can find the links in my “1-toblog” folder on del.icio.us (prefixed with “1-” so it shows up at the beginning of my bookmark menu).

In monitoring land, I’ve paused my Hyperic HQ VM as I wasn’t too pleased with how the features panned out. I was invited to beta test Groundwork Open Source 5.2b, but I’m not crazy about the open-ness of a non-public Beta, and am honestly not that intrigued by the small feature set (though, admittedly, they do need more documentation on the F/OSS version). I’d still like to try them all, especially Zenoss Core, but I’m pretty busy with class, and things are heating up at work and with a few consulting projects.

In my “spare time” (read: staying up until 5 AM and somehow still getting up for work at 9) I’ve been working on something that’s been bugging me for a while – getting Nagios to automatically open and update tickets in Eventum, the ticketing system that I (and MySQL) use. The general idea is to use a “glue” script, written in PHP (Eventum’s native language). It will (hopefully) keep track of which hosts/services it has opened tickets for (and what the ticket ID is), and decide from that whether to open a new ticket or, if one already exists for that host/service, update it. It should also handle changes to assigned user/group, update categories, priorities, etc. This will all be based on a DB table that maps problem severities and hosts/services to the users, groups, categories, and priorities that they should be assigned.

The biggest problem is that I’m not a whiz at object-oriented PHP, and like any good OO program, Eventum is broken down into dozens of objects, classes, and files. With the help of the Xdebug debugging extension for PHP, which prints full debugging output including stack and function call traces, I’ve been able to *finally* – after about four hours of work – write a simple little 15-line script that uses ONLY existing Eventum classes, unmodified (except for a separate init.php with some stuff commented out), which gets a list of users assigned to an issue. From here, it shouldn’t be difficult to get full issue information and then, hopefully, add and update issues.

I have a basic description of the project on my wiki, and the current (development, so could be broken) source code in CVS, which can be seen through ViewVC on my site.

Stay Tuned!

F/OSS Monitoring Comparison – Hyperic Part I

So, I’ve made some headway on the comparison. I have Hyperic installed and partly configured, albeit without email alerts yet. I’ve found some serious features that I need missing, but I’m going to give it a full run before I move on to another.

The full text, updated a few times a day, is available on my wiki. Here’s a bit of an excerpt:

Part I – Installation

  1. setup Xen virtual machine running OpenSuSE 10.3 base packages. (3 hours, some server problems, some Xen problems, and some time learning Xen administration from the CLI)
  2. Download hyperic-hq-installer-3.2.0-607-x86-linux.tgz from Hyperic and extract.
  3. Browse to http://support.hyperic.com/confluence/display/DOC/Full+Installation+Guide
  4. cd into hyperic-hq-installer and run ./setup.sh -full
    1. The installation can’t be run as root (though I assumed it would need root privileges).
    2. I selected to install all 3 components – Server, Shell, and Agent.
    3. Well, whoops! Sorta stupid to not allow installation as root, when the default location to install to is /home/hyperic. How do they expect an arbitrary user to install there? Even worse, it appears that the default OpenSuSE 10.3 installation doesn’t come with sudo (!!!!) so I can’t try that.
    4. As root, create /home/hyperic and chown to my user.
    5. Repear the above steps (well, hopefully not all of them).
    6. Default ports for everything – web GUI on 7080, HTTPS web GUI on 7443, jnp service on 2099, mbean server on 9093,
    7. Change domain names in default URLs to logical ones for my test environment (no real DNS, just IPcop hosts, so devel-hyperic1.localdomian). I hope that I can change these later, or even better that absolute paths aren’t used too much, as this will screw with my idea of using SSH port forwarding for remote access.
    8. Leave the default SMTP server alone and change it later – I odn’t even have mail running here at the apartment.
    9. Use the built-in PostgreSQL database with default port of 9432.
    10. Go with the defaults for everything after this.
    11. Everything runs nicely, and then it tells you to login to another terminal as root and run a script. I’m not sure I like this method, but I guess it works. Login and do it.
    12. How will it start the builtin database? As my user???? Yup. postgres is running as my user. Wonderful. Nothing in the install document mentioned user creation. Was this just assumed? Because in the naive world I live in, most installer scripts (think Nagios) create a user for you, or tell you to.
    13. Setup script complete. A few instructions follow…
  5. Run /home/hyperic/server-3.2.0/bin/hq-server.sh start… as my user. Note to self: setup a user for Postgres and Hyperic. Believe it or not, but it booted – but followed with the message, “Login to HQ at: http://127.0.0.1:7080/
  6. Browsed to http://devel-hyperic1:7080 and was greeted by a startup page, saying that the server was 18% finished booting. My, I yearn for little C binaries and a PHP frontend.
  7. Page turns blank and stops there. I refresh, and get a login page. I enter my username and password, and get a little message box where the “invalid password” box usually is – says “Server is still booting”. This is over a minute later. I’m happy to see Apache/Coyote1.1, but would like to be able to get into Hyperic in less time than it takes the machine to boot to a graphical login screen (ok, granted, I’m running XFCE). In SuSE’s YaST Xen Monitor, I see that the VM is at 45% of its’ 464MB RAM, and 90% CPU – with 8.5% consumed by dom0.
  8. CPU usage for the VM drops to 1% and I login again. BAM! Hyperic HQ. Aside from the fact that it shows NO resources… oh… start the Agent.
  9. Start the Agent on the VM running Hyperic. It asks me for the server IP address. What, no DNS? I enter the IP as it is… for now. I keep everything at defaults, including using the hqadmin username and password. Successfully started.
  10. BAM! In Dashboard, I see the auto-discovered host with the right hostname, as well as Tomcat, Agent, JBoss, and PostgreSQL. Amazing! Click “Add to Inventory”.
  11. Check out the “Resources” -> “Browse” screen. It knows this machine is OpenSuSE 10.3, and I see my four services (listed above). Of course, no metrics yet, but I see the correct IP, gateway, DNS, vendor (SuSE), kernel version, RAM, architecture, and CPU speed.
  12. Looking through the “Inventory” screen, I see everything – NICs and MACs, running servers and one service (a CPU resource). What more could a man want in…let’s see.. just over an hour!
  13. I really *love* the “Views” screen which, even out-of-the-box, allows “Live Exec” information from cpuinfo, df, ifconfig, netstat, top, who, and more.
  14. Well, it’s 03:35, and I have work and class tomorrow. I think it’s time to give Part I a rest. But first…
  15. Go to the “Platform” page for my one machine and… YES! Graphs are starting to appear!
  16. Following the suggestion here, I enable log and config tracking on the platform for /var/log/warn and /etc/hosts, respecitvely.
  17. Before I call it a night (now 03:42), I stop back at the downloads page and grab the Linux x86 Agent for the dom0 machine, hoping to get some physical information as well. While I’m at it, I grab the Linux AMD64 Agent to try on my laptop. I create “hyperic” users on each system. On the base Xen server, I give it a shot and get “Unable to register agent: Error communicating with agent: Unauthorized”. Same thing on the laptop.
  18. Did a little reading here. As to keeping all of the defaults, it turns out that both clients had firewalls blocking TCP port 2144. I opened it up on both, and also set the IP address (that the server uses to contact the client) to the correct ones. Viola! Now I have 3 clients connected, and gatheirng data for the next ~16 hours until I have time to check it out agian.

More to come in Part II tomorrow – actually doing something with Hyperic. For
now (04:08), time to sleep.


Part II – Configuration

Unfortunately, I haven’t had much time to play with Hyperic in the two days
since installation. The most I’ve really done is setup Agents on my laptop,
desktop, and the host machine (both dom0 and domU for Hyperic), so that they
start to collect data.

While I found a lot of upsetting stuff in the features list (see below), I
decided to go ahead and add some other devices. On the network at the
apartment, I have two manageable switches (a Linksys and a 3Com) – which pretty
much make up the sum of non-host equipment. I also have an IPcop box, though I
assume the standard Linux Agent will handle that. The one item missing that I
have at home is my set of APC SmartUPS UPSs with SNMP cards, but I guess I’ll
just have to skip them for this review.

First, I went in and added a platform (Resources->Browse, Tools Menu->Add
Platform) for the 3Com switch (a SuperStack II Switch 3300). It showed
successful creation – but nothing else. I went in and entered the SNMP
community string, IP, and version (1). In about a minute or so, I started to
see metrics – Availability, IP Forwards, IP In Receives, an IP In Received per
Second. While it’s quite basic, that’s good for a starting point. While the
[http://support.hyperic.com/confluence/display/DOCSHQ30/Network+Device+platform
Network Device Platform] documentation lists lots of metrics that can be
enabled, I’d also like telnet availability and – my big one since I use a
“cute” (crappy) IPcop installation for local DNS, a dig on DNS to make sure
the entry is there. In the Monitor screen, I was able to enable a bunch of
additional metrics (by clicking on the “Show All Metrics” link), though
there’s also no way (that I can find) to monitor the status of individual
ports.

Next, I browsed through the “Administration” pages, setup a few users, and
started setting *way* more default metrics for various platforms, services,
and servers. While I don’t have mail running yet, that will come this
weekend. While I added a lot of things as “Default On”, I still need to go
back and add more things in the templates as Indicators.

I also added some escalations, though they’re quite simple – you can notify HQ
users or “other users” by email or SMS, write to SysLog, or suppress alerts
for 0 minutes to 24 hours. Hopefully I’ll also find a plugin for Asterisk
integration. One striking omission is user groups. Also, the concept of
“Roles” (maybe their idea of groups?) is only available in the Enterprise
version.

At this point, I also notice one other majoe issue, though perhaps I’ll find a
solution in my experimentation – there doesn’t be a way to setup default
alerts for metrics. If they have all of this platform, server, and service
information defined as default templates, why not just have a way to assign
default users (and groups) to these objects, and have default alerts
generated?

In terms of Apache 2.2 monitoring, out-of-the-box, nothing worked. No metrics
at all. Firstly, Hyperic requires the mod_status module. Persoanlly, I’d
rather handle all of that through a backend, like Nagios. Secondly, it got the
pidfile and apache2ctl paths wrong. Furthermore, it has no “smart” checking for resources – while my Apache 2.2 resource config was clearly wrong (wrong PID file path, no mod_status), Hyperic didn’t detect this and was showing the resource as “Down”.

After that, I setup a bunch of alerts for things that I thought would be off-kilter a lot (like WARN log entries on my laptop, high memory usage on some stressed machines, etc.) as well as log and config file monitoring and alerts for them. While I didn’t have mail working yet, I figured I might as well get that stuff running.

On the Xen dom0 host that runs the Hyperic vm (box called xenmaster1), I wasn’t able to add config file tracking for any of the /etc/xen/ files. At this point I notice some serious shortcomings – not only is it not possible to define a template of alerts for a given platform/server/service, it’s also impossible to define a template for alerts. I also noticed that it’s not possible to define groups of contacts. This wasn’t much of a problem for my test installation – the alerts are only going to my roommate and I – but it would surely be an issue in any larger setting.

At this point in configuration, I come to a make-or-break point. With some of these shortcomings, I really need a way to call a script with alert information when an alert is generated – whether it’s to dial out through Asterisk or just automatically create a ticket for the problem.

Adding alerts is a cumbersome process. You have to browse to a page for a specific metric – which means going to the page for a specific platform, server, or service – and then opening the page for that metric. The actual alert creation takes up two pages – one for the metric, threshold, and time-based criteria, and a second for who to alert. This means that to add alerts for a machine, you need to view the platform page as well as the services and servers pages, and each metric therein.

I’ll be posting some more in the days to come. From a post at the Hyperic Forums, I was able to find out that a Xen plugin is in the works, but for the Open Source version, the only way to trigger a script is to send an email and have it handled by a filter such as Procmail.