Python script to check a list of URLs for return code, and final return code if redirected

Every once in a while I need to add a bunch of redirects in Apache. Here’s a handy, dead simple Python script which takes a list of URLs on STDIN, and for each one prints out either the response code, or, if the response is a redirect, the response code of what is redirected to. Pretty useful when you’ve just added a bunch of redirects and want to make sure none of them 404.

The latest source of this script lives at https://github.com/jantman/misc-scripts/blob/master/check_url_list.py.

#!/usr/bin/env python
"""
Script to check a list of URLs (passed on stdin) for response code, and for response code of the final path in a series of redirects.
Outputs (to stdout) a list of count of a given URL, response code, and if redirected, the final URL and its response code
 
Optionally, with verbose flag, report on all URL checks on STDERR
 
Copyright 2013 Jason Antman  all rights reserved
This script is distributed under the terms of the GPLv3, as per the
LICENSE file in this repository.
 
The canonical version of this script can be found at:
 
"""
 
import sys
import urllib2
 
def get_url_nofollow(url):
    try:
        response = urllib2.urlopen(url)
        code = response.getcode()
        return code
    except urllib2.HTTPError as e:
        return e.code
    except:
        return 0
 
def main():
    urls = {}
 
    for line in sys.stdin.readlines():
        line = line.strip()
        if line not in urls:
            sys.stderr.write("+ checking URL: %s\n" % line)
            urls[line] = {'code': get_url_nofollow(line), 'count': 1}
            sys.stderr.write("++ %s\n" % str(urls[line]))
        else:
            urls[line]['count'] = urls[line]['count'] + 1
 
    for url in urls:
        if urls[url]['code'] != 200:
            print "%d\t%d\t%s" % (urls[url]['count'], urls[url]['code'], url)
 
if __name__ == "__main__":
    main()

Modern (0.10.x+) NodeJS RPMs on CentOS/REHL 5 and 6

I posted back in January about RPM Spec Files for nodejs 0.9.5 and v8 on CentOS 6. In that post I also said that I was unable to get recent NodeJS to build on CentOS 5 because of a long chain of dependencies including node-gyp, v8, http-parser, glibc, etc. I said I couldn’t get it to build. Well, I have good news for both distro versions.

On the CentOS/RHEL 6 side, thanks to a lot of work by T. C. Hollingsworth and others, NodeJS 0.10.5 is currently in the official EPEL repositories. They seem to be keeping the packages pretty current, but if you need newer, you can always grab the SRPMs from EPEL and build the newer versions. This is great, because it means I no longer need to maintain the spec files and do my own builds. I don’t think I really did anything to help get this package in EPEL, other than ping a few people and comment on a few tickets.

For CentOS/RHEL 5, I finally have packages, but they’re not exactly pretty. The dependency solving issues still stand; they’re rooted at the dependency of node-gyp which requires the v8 C++ JavaScript library, and is required to compile shared object addons. The best solution that I (and a few others) could find is simply not to build node-gyp, and not to have support for addons or package any addons; we just have the binaries that NodeJS’s Makefile creates, and everything else is interpreted. A coworker found https://github.com/kazuhisya/nodejs-rpm which contains a configure patch and specfile for a dead-simple CentOS 5/6 RPM of NodeJS 0.10.9, which essentially just uses EPEL’s python26 packages to power the NodeJS build process, configures and uses the Makefile’s make binary command to spit out a NodeJS binary tarball, and then packages that. That whole process way out of line from the Fedora Packaging Guidelines, and also only dumps out nodejs, nodejs-binary and nodejs-debuginfo packages, so I also can’t just substitute in a different package name in my puppet manifests (which install nodejs, nodejs-devel and npm packages). So I forked that repository and made some changes to the specfile: I gave the package name a prefix (“cmgd_”, since that’s where I work these days) and some warnings in the description, to make it abundantly clear that these packages are very far from what you find in EPEL and other repositories, and broke npm and the devel files out into their own subpackages. Hopefully this spec file will be of use to someone else who also has the unfortunate need of supporting recent NodeJS on CentOS 5. If there’s enough interest, I’ll consider building the packages and putting them in a repository somewhere.

You can see the NodeJS 0.10.9 on CentOS 5 spec file, a patch, and the READMEs at https://github.com/jantman/nodejs-rpm-centos5. Patches and/or pull requests are greatly appreciated, especially from anyone who wants to make the spec file more Fedora guidelines compliant.

Script to easily rebuild a SRPM

Between RHEL/CentOS 5 and 6 the default RPM compression format was changed to xz. As such, trying to build a recent Fedora or Cent6 SRPM on Cent5 will error out with a message like error: unpacking of archive failed on file foo;51a4c2a5: cpio: MD5 sum mismatch because tar on CentOS 5 doesn’t support xz.

Here’s a quick and dirty little script to use rpm2cpio to rebuild a SRPM using the host’s native RPM compression. The latest version will live at https://github.com/jantman/misc-scripts/blob/master/rebuild_srpm.sh

#!/bin/bash
#
# Script to rebuild a SRPM 1:1, useful when you want to build a RHEL/CentOS 6
# SRPM on a RHEL/CentOS 5 system that doesn't support newer compression (cpio: MD5 sum mismatch)
#
# by Jason Antman <jason@jasonantman.com>
# The latest version of this script will always live at:
# <https://github.com/jantman/misc-scripts/blob/master/rebuild_srpm.sh>
#
 
if [[ -z "$1" || "$1" == "-h" || "$1" == "--help" ]]
then
    echo "USAGE: rebuild_srpm.sh <srpm> <output directory>"
    exit 1
fi
 
if [[ -z "$2" ]]
then
    OUTDIR=`pwd`
else
    OUTDIR="$2"
fi
 
if [[ ! -e "$1" ]]
then
    echo "ERROR: SRPM file not found: $1"
    exit 1
fi
 
if ! which rpmbuild &> /dev/null
then
    echo "rpmbuild could not be found. please install. (sudo yum install rpm-build)"
    exit 1
fi
 
if ! which rpm2cpio &> /dev/null
then
    echo "rpm2cpio could not be found. please install. (sudo yum install rpm)"
    exit 1
fi
 
SRPM=`dirname "$1"`"/"`basename "$1"`
TEMPDIR=`mktemp -d`
STARTPWD=`pwd`
 
echo "Rebuilding $SRPM..."
 
# copy srpm into tempdir
cp $SRPM $TEMPDIR
 
pushd $TEMPDIR &>/dev/null
 
# setup local build dir structure
mkdir -p rpm rpm/BUILD rpm/RPMS rpm/SOURCES rpm/SPECS rpm/SRPMS rpm/RPMS/athlon rpm/RPMS/i\[3456\]86 rpm/RPMS/i386 rpm/RPMS/noarch rpm/RPMS/x86_64
 
# setup rpmmacros file
cat /dev/null > $TEMPDIR/.rpmmacros
echo "%_topdir        $TEMPDIR/rpm" >> ~/.rpmmacros
 
echo "Extracting SRPM..."
pushd $TEMPDIR/rpm/SOURCES/ &>/dev/null
rpm2cpio $SRPM | cpio -idmv &>/dev/null
popd &>/dev/null
 
# build the SRPM from the spec and sources
# we're just building a SRPM so we can ignore dependencies
echo "Rebuilding SRPM..."
NEW_SRPM=`rpmbuild -bs --nodeps --macros=$TEMPDIR/.rpmmacros $TEMPDIR/rpm/SOURCES/*.spec | grep "^Wrote: " | awk '{print $2}'`
 
echo "Copying to $OUTDIR"
cp $NEW_SRPM $OUTDIR/
 
echo "Wrote file to $OUTDIR/`basename $NEW_SRPM`"
 
# cleanup
cd $STARTPWD
rm -Rf $TEMPDIR

Environment Variable Substitution in Apache httpd Configs

I’ve been configuring Apache httpd for over a decade, from a single personal web server to web farms running thousands of vhosts. In most of the “real” environments I’ve worked in, we’ve had some variation of production, stage/test/QA and development hosts; and usually some method of managing configurations between them, whether it’s source control or generating them from template. And in all of these environments, there has invariably been drift between the configurations in the various environments, whether it’s because of poor tools to maintain a unified configuration or many of those emergency redirect requests that make it into production but are never backported. This is made all the worse because everywhere I’ve worked, the real difference between what production and other environments should be is really just a string replacement in Apache configurations – /prod/ to /test/ or www.example.com to www.dev.example.com or something along those lines.

Well a few days ago I was having a discussion with some co-workers that dovetailed into this topic, and when I started some research, I found (finally after using httpd for years) that the Apache httpd 2.2 configuration file syntax documentation states that httpd supports environment variable interpolation anywhere in the config files (and httpd 2.4 supports it with Defines as well).

Yup, that’s right. All those different Apache configs I’ve worked with for years that define separate vhosts, document roots, rewrite targets, ServerAliases, etc. for www.example.com and www.qa.example.com and www.dev.example.com really only had to be www.${ENV_URL_PART}example.com, and set ENV_URL_PART in the init script or sysconfig file. (Of course this all assumes that you have your different environments served by different httpd instances, which you do, of course…)

For me, this is a very big deal. It means that finally, instead of maintaining separate sets of configs for different environments which are (theoretically, except for those emergencies) kept identical by hand, or updating templates and then re-generating each environment’s configs, we can finally follow the same commit/merge/promotion-between-environments workflow that we use for other production code and Puppet configuration. It also means that those pesky little rewrites and other minor tweaks will make it all the way back to development environments.

So, here’s a little example of how this would work in reality. Let’s assume that we have 3 main environments, prod, qa and dev (though this should work for N environments) and that domains are prefixed with “qa.” or “dev.” for the respective internal environments. We set environment variables before httpd is started, on a per-host basis, depending on what environment that host is in. On RedHat based systems, we’d add the variables to /etc/sysconfig/httpd for production:

HTTPD_ENV_NAME="prod"
HTTPD_ENV_URL_PART=""

or for QA:

HTTPD_ENV_NAME="qa"
HTTPD_ENV_URL_PART="qa."

Those variables will now be available to httpd within the configurations (and also to any applications or scripts that have access to the web server’s environment variables).

Now let’s look at an example vhost configuration file that uses the environment variables:

<VirtualHost *:80>
ServerName example.com
ServerAlias www.example.com
# Aliases including proper environment name
ServerAlias www.${HTTPD_ENV_NAME}.example.com ${HTTPD_ENV_NAME}.example.com
 
ErrorLog /var/log/httpd/example.com-error_log
CustomLog /var/log/httpd/example.com-access_log combined
 
DocumentRoot /sites/example.com/${HTTPD_ENV_NAME}/
 
# Environment-specific configuration, if we absolutely need it:
Include /etc/httpd/sites/${HTTPD_ENV_NAME}/env.conf
 
<Location "/testrewrite">
RewriteEngine on
RewriteRule /foobar/.* http://www.${HTTPD_ENV_URL_PART}example.com/baz/ [R=302,L]
</Location>
 
</VirtualHost>

Every instance of ${HTTPD_ENV_NAME} will be replaced with the value set in the sysconfig file, and likewise with every instance of ${HTTPD_ENV_URL_PART}. This way, we can have one set of configurations and use our normal source control branch/promotion process to both test and promote changes through the environments along with application code, and ensure that any straight-to-production emergency changes (everyone has customer-ordered rewrites like that, right?) make it back to development and qa.

One caveat is that, if the environment variable is not defined, the ${VAR_NAME} will be left as a literal string in the configuration file. There doesn’t seem to be any way to protect against this in httpd 2.2, other than making sure the variables are set before the server starts (and maybe setting logical default values, like an empty string, in your init script which should be overridden by the sysconfig file).

If you’re running httpd 2.4+, you can turn on mod_info and browse to http://servername/server-info?config to dump the current configuration, which will show the variable substitution.

Fedora Init Script Specification Summary

I’ve been deploying some new software lately (specifically selenesse, which combines Selenium and fitnesse, xvfb). None of these seem to come with init scripts to run as daemons, and the quality of the few Fedora/RedHat/CentOS init scripts I was able to find was quite poor. The Fedora project has a Specification for SysV-style Init Scripts in their Packaging wiki, which specifies what a Fedora/RedHat/CentOS init script should look like, in excruciating detail. What follows is an overview of the more important points, which I’m using to develop or modify the scripts I’m currently working on.

  • Scripts must be put in /etc/rc.d/init.d, not in the /etc/init.d symlink. They should have 0755 permissions.
  • Scripts must have a Fedora-style chkconfig header (“chkconfig:”, “description:” lines), and may have an LSB-style header (BEGIN INIT INFO/END INIT INFO). See Initscript template.
  • Scripts must make use of a lockfile in /var/lock/subsys/, and the name of the lockfile must be the same as the name of the init script. (There is a technical reason for this relating to how sysv init terminates daemons at shutdown). The lockfile should be touched when the daemon successfully starts, and removed when it successfully stops.
  • Init scripts should not depend on any environment variables set outside the script. They should operate gracefully with an empty/uninitialized environment (or only LANG and TERM set and a CWD of /, as enforced by service(8), or with a full environment if they are called directly by a user.
  • Required actions – all of the following actions are required, and have specific definitions:
    • start: starts the service
    • stop: stops the service
    • restart: stop and restart the service if the service is already running, otherwise just start the service
    • condrestart (and try-restart): restart the service if the service is already running, if not, do nothing
    • reload: reload the configuration of the service without actually stopping and restarting the service (if the service does not support this, do nothing)
    • force-reload: reload the configuration of the service and restart it so that it takes effect
    • status: print the current status of the service
    • usage: by default, if the initscript is run without any action, it should list a “usage message” that has all actions (intended for use)
  • There are specified exit codes for status actions and non-status actions.
  • They must “behave sensibly”. I’ve found this to be one of the biggest problems with homegrown init scripts. If servicename start is called while the service is already running, it should simply exit 0. Likewise if the service is already stopped. Init scripts must not kill unrelated processes. I don’t know how many times I’ve seen scripts that kill every java or python process on a machine.

I intend to use this as a quick checklist when developing or evaluating init scripts for RedHat/Fedora based systems. In my experience, the biggest problems with most init scripts revolve around poor handling of PID files and lockfiles, mainly:

  • Killing processes other than the one that the script started (i.e. killing all java or python processes), usually because the PID isn’t tracked at start
  • Starting a second instance of the subsystem because lockfiles aren’t used, or the status function is broken.
  • improper exit codes
  • either explicitly relying on environment variables (and therefore breaking when called through service(8)), or conversely, not cleaning/resetting environment variables that are used by dependent code or processes.

Readable Nagios Log Timestamps

If you’re like me and most humans, the Nagios logfile timestamp (a unix timestamp) isn’t terribly useful when trying to grep through the logs and correlate events:

# head -2 nagios.log
[1350360000] LOG ROTATION: DAILY
[1350360000] LOG VERSION: 2.0

Here’s a nifty Perl one-liner that you can pipe your logs through:
perl -pe ‘s/(\d+)/localtime($1)/e’
to get nicer output like:

# head -2 nagios.log
[Tue Oct 16 00:00:00 2012] LOG ROTATION: DAILY
[Tue Oct 16 00:00:00 2012] LOG VERSION: 2.0

Some questions from a tech interview with a big Internet company

A while back, I did a technical phone screen with a big online “social” company (I won’t say who, but they’re a household name, growing fast, and doing cool things; that doesn’t leave too many options). I rarely remember to write down interview questions, but I was cleaning out my desk this morning and came by a ripped-out sheet of notebook paper with a handful of the interview questions written on it. Most of them weren’t terribly difficult, or terribly unusual for competent technical interviewers, but since I happen to actually have the list written down, I though I’d share it. I don’t remember why the programming questions are all Python; likely, I was asked to choose between Python (which I’ve used, though not lately), Ruby (which I can barely muddle my way through reading on a good day), and something else I don’t know. Here are some of them…

  • What is an inode? What does it store?
  • What is a hard link?
  • What is the difference between a hard link and a soft link?
  • What is a list in Python?
  • Name some data structures that you’d use in Python. Describe them, and tell me why you would use them.
  • How would you list all the man pages containing the keyword “date”?
  • If the chmod binary had its permissions set to 000, how would you fix it?

Dumping all Macros from an RPM Spec File

I’ve been doing a lot of RPM packaging lately, and on different (and very old) distros and versions. Sometimes I lose track of all of the macros used in specfiles (_bindir _sbindir dist _localstatedir, etc). There’s no terribly easy way to dump a list of all of the available macros. There is, however, a bit of a kludge. Insert the following code in your specfile before the %prep or %setup lines:

%dump
exit 1

The %dump macro will dump all defined macros to STDERR. The exit 1 will prevent rpmbuild from going on and trying to build the package. If you want to view the output nicely, you can pipe it through a pager like less: rpmbuild -ba filename.spec 2>&1 | less.

Just make sure to remove those two lines when you want to actually build the package.

Project – Storing and Analyzing Apache httpd Logs from Many Hosts

I’ve recently started casual work on a side-project to collect, store, and analyze apache logs from a bunch of servers – for the initial implementation, I’m looking to handle about 15M access_log lines per day (that works out to 173 lines/second assuming an even distribution, which there certainly isn’t). Here is a selection of links that I’ve been using for ideas and inspiration, both for the technical side (data collection, transport, storage and analysis) and visualization:

  • RRDtool – RRDtool Gallery – I’m starting a graphing/log analysis project, and looked here for some inspiration for my proof-of-concept code
  • Creating pretty graphs with RRDTOOL from Girish Venkatachalam.
  • There’s some good information on RRDtool’s “Abberant Behavior Detection” (Holt-Winters prediction, deviation and failure detection) on the rrdtool, rrdgraph_examples and rrdcreate documentation pages, but unfortunately no anchors to link directly to.
  • Cube – “Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets. Cube is built on MongoDB and available under the Apache License on GitHub.”
  • Cubism.js – “Cubism.js is a D3 plugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data from Graphite, Cube and other sources. Cubism is available under the Apache License on GitHub.” The demo on that page looks pretty cool.
  • Highcharts Demo Gallery – JS chart/graph library. It requires a paid license for commercial use (though it’s a bit unclear to me whether an internal ops dashboard would fall under this license provision) so I probably wouldn’t go with this one. They have some cool charts, including a dynamic line chart updating every second, a scatter plot and a nice zoomable time-series graph, though IMHO it’s not as nice as the Google Chart Tools (formerly Google Visualization) annotated timeline.
  • [ HOWTO ] Graphing Holt-Winters Predictive Analysis – Cacti forums
  • dygraphs – an impressive permissive-license JS chart library dedicated to visualizing dense time-series data. Developed by Google and now used by them (Google Correlate, Google Latitude) as well as NASA, 10gen and others. There are some very cool demos on that main page, and also on the tests page.
  • Graphite, JMXTrans, Ganglia, Logster, Collectd, say what ? « Planet DevOps
  • Visage
  • kgorman/mongo_graph – a tool to pull data from MongoDB and put it in RRD files
  • drraw – a perl-based graphing frontend (web UI) for RRDtool
  • etsy/logster · GitHub – Etsy’s Python tool to maintain a pointer on a log file, and parse at a regular rate feeding the data into a tool like Graphite or Ganglia.
  • cebailey59/charcoal – a Sinatra app that allows creation of dashboards from Graphite, collectd, or any other service that creates images from URL calls.
  • etsy/dashboard – some examples of how Etsy builds monitoring dashboards.
  • GDash – Graphite Dashboard | R.I.Pienaar – a Sinatra dashboard app for Graphite, using Twitter bootstrap for visualization.
  • paperlesspost/graphiti – a Ruby and JavaScript front-end for Graphite.
  • Graphite Screenshots – just two, but they get the idea across pretty well.
  • Graylog2 – a centralized log management application with a powerful web interface. Stores logs in ElasticSearch (which is built on Lucene, a Java-based index and search server) and statistics/graphs in MongoDB. It does analytics, alerting, monitoring/graphing and searching all through a web interface, and accepts log data via syslog, AMQP and GELF (its own log format). Java server and Ruby on Rails web UI.
  • Logstash – another centralized log project that stores and indexes logs, with search via a web UI. “Ship any event to anywhere over any protocol.” Takes many inputs including files, syslog, AMQP, Flume, STOMP, HTTP and even twitter, performs a number of filters including timestamp checks, parsing, dropping, joins, etc, and then sends logs back on an output including AMQP, Graylog2 GELF, STOMP, MongoDB, ElasticSearch, syslog, WebSockets and to Nagios. One particularly cool feature is its “file” input, which continuously tails a file and claims to be log rotation safe. Just cool.
  • jordansissel’s Logstash intro slides.
  • Kibana – an alternative interface for Logstash and ElasticSearch that allows searching, graphing and analysis of log data stored in Logstash.
  • Pivotal Labs: Talks – Metrics Metrics Everywhere (Coda Hale)
  • PaperlessPost – @quirkey’s talk on metrics – very good high level stuff, but slides only
  • paperlesspost/graphiti – graphiti, a JS/Ruby frontend for Graphite that does graphs, dashboards, and point-in-time snapshots of graphs. Lots of functionality.
  • Redis – a distributed key/value store that’s really popular with the cool kids. Another Redis Use Case: Centralized Logging • myNoSQL
  • Charcoal – a Sinatra (Ruby) dashboard app (ready for use on Heroku but usable anywhere). Graphite-oriented but will work with any tool that generates images from URLs.
  • etsy/logster – etsy’s Logster tool, which keeps a tail on log files, parses them, and ships metrics to Graphite or Ganglia.

Some PowerDNS Links and Interesting Features

At $WORK we lost a disk in the RAID1 of one of our external nameservers, and it rekindled an occasional discussion of migration from ISC BIND to PowerDNS. PowerDNS has separate authoritative and recursive servers, and doesn’t seem to natively support views or split-horizon the way BIND does, but it has some really cool features including very mature database backends, load balancing, Lua scripting support to modify how recursive queries are answered, and geolocation or IP-range based query results.

While this project is still just casual research, I thought I’d share some of the useful links and information I’ve found:

PowerDNS Front-ends:

  • JPowerAdmin – One of the two most popular, a GPLv3 Java (JBoss SEAM) based web UI with a RESTful API, with support for “multiple” database backends. Sponsored by Nicmus, Inc. Online demo (demo:demo). Looks nice, simple UI, but no support for split-horizon.
  • PowerAdmin – the other most popular, though it seems to be undergoing a large overhaul at the moment. Has full support for most of PowerDNS’s features, written in PHP, supports “large” databases, fine-grained user permissions, RFC validation, zone templates. Online demo (demo:demo). I don’t really like that it manages the SOAs as full text (without any templating, dropdowns or default values), and that it doesn’t prepopulate default values for TTL in the new record form, but it looks like a good starting place for someone (like me) who’s handy with PHP.
  • pdns-gui – PowerDNS GUI – Google Project Hosting – PHP/MySQL GUI. Online demo. Handles templates nicely but won’t scale to too many of them. Window-based UI is visually pleasing but will probably be a problem for big zones.
  • powerdns-webinterface – PowerDNS Webinterface – Google Project Hosting – A nice but relatively simplistic UI written in PHP. It has some nice features like multi-user authentication (and logging, though I haven’t looked into how detailed it is), automatic SOA serial update, automatic PTR creation, etc. Unfortunately not geared towards people with lots of domains and multiple records; it has only one template for new domains (and no way to update domains created from a template), no easy filtering, and still treats SOA like a single text record.
  • ZoneAdmin | SourceForge.net and Project website – Maybe not the fastest tool to use in bulk, but a nice, relatively intuitive and full-featured admin tool. Online demo (demo:demo).

Some links on PowerDNS split-horizon

It looks to me that split-horizon is going to be the hardest part for us, at least to also have a web UI to manage it. It looks like with PowerDNS, the most common way to run split horizon DNS (views) is to run two separate sets of servers or instances, either on different boxes or multi-homed; one for internal and one for external. While that sounds like quite a bit of overhead beyond what BIND does, the real problem is finding a web UI that supports it; I don’t care if it’s in two separate databases, but what I want is a logical (web UI) view that has zones made up of resource names (i.e. the leftmost column in a zone file) with one or two RRs (type, ttl, priority, value) – one for each view. That’s the real catch – all of our machines are in private IP space behind a firewall, so I need to be able to manage the internal and external records on one screen. While it’s not exactly scalable, and the code stagnated quite a bit once I got it to a point that was usable for me, this was the main goal of my MultiBIND Admin project.