Logging OpenSSH SFTP Transactions

I just came across a really handy post on David Busby‘s blog: Enable logging in the SFTP subsystem – Oneiroi. From OpenSSH 4.4 on, you can pass arguments to Subsystem calls, and the sftp subsystem supports logging to an aribtrary syslog facility and priority. Simply adding a line like:

Subsystem       sftp    /usr/libexec/openssh/sftp-server -f LOCAL5 -l INFO

and the appropriate lines to your syslog config will give you a handy transfer log like:

Jul 16 09:22:25 hostname sftp-server[2058]: session opened for local user jantman from [A.B.C.D]
Jul 16 09:22:26 hostname sftp-server[2058]: open "/home/jantman/temp/sftp_test" flags WRITE,CREATE,TRUNCATE mode 0666
Jul 16 09:22:45 hostname sftp-server[2058]: close "/home/jantman/temp/sftp_test" bytes read 0 written 1464813
Jul 16 09:23:08 hostname sftp-server[2058]: session closed for local user jantman from [A.B.C.D]
Jul 16 09:27:50 hostname sftp-server[2309]: session opened for local user jantman from [A.B.C.D]
Jul 16 09:27:50 hostname sftp-server[2309]: open "/home/jantman/temp/sftp_test" flags READ mode 0666
Jul 16 09:27:54 hostname sftp-server[2309]: close "/home/jantman/temp/sftp_test" bytes read 1464813 written 0
Jul 16 09:27:54 hostname sftp-server[2309]: session closed for local user jantman from [A.B.C.D]

If you have syslog write these logs to their own file, remember to setup log rotation for them.

Unfortunately, I’m not aware of any way to log SCP file transfers.

Apache httpd – logging for sites with and without load balancing

There are a few unfortunate places where I have an Apache httpd server serving multiple vhosts, some behind a F5 BigIp load balancer and some with direct traffic. For sites behind the LB, the remote IP/host will always show up as the LB’s IP/host, not that of the actual client. Using the default configuration with LogFormat directives in httpd.conf, this means that either we need to define log formats per-vhost or lose the client IP in one of our scenarios (LB or no LB).

I came by a simple solution to this on Emmanuel Chantréau‘s blog, and here is my condensed version of it. It sets an environment variable (“bigip-request”) if the BIOrigClientAddr request header is set (this header holds the client’s IP; it’s the BigIp proprietary version of the X-Forwarded-For header. You could easily substitute that more standard header in the following snippet) and then sets the “combined” LogFormat based on that variable – a version using BIOrigClientAddr if it is set, and a version using the normal “%h” remote host otherwise.

In httpd.conf:

# set the "bigip-request" env variable to "1" if there is a BIOrigClientAddr header in the request                                                                                                   
SetEnvIf BIOrigClientAddr . bigip-request
# we'll use this following LogFormat (BIOrigClientAddr in place of remote host) as "combined" IF the bigip-request env variable is set                                                                     
LogFormat "%{BIOrigClientAddr}i %l %u %t %v \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined_lb
# else we'll use this one (remote host IP address) as "combined" IF the bigip-request env variable is NOT set                                                                                   
LogFormat "%h %l %u %t %v \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

And then in our vhost configuration:

# use this log format if we're behind an LB
CustomLog logs/<%= domain %>_access_log combined env=!bigip-request
# or this format if we're not
CustomLog logs/<%= domain %>_access_log combined_lb env=bigip-request

Some Thoughts on Choosing a New WordPress Theme

I think I’m going to choose a new theme for my blog. The current theme is iNove (albeit an older version with some custom modifications), and I feel like it looks a bit messy and has gotten a bit cluttered, so it’s time to find something new. I like the 2-column layout, and have a few other things I’m looking for – specifically, aside from something with advanced features like lots of widget support and hooks, something that has good visual separation between different posts and widgets. I also really want something, if possible, with relative column widths. My current home and work desktops both have dual monitors, and the minimum resolution I have on one screen is 1920×1080. When I look at my blog in a maximized window, about half the screen width is wasted with empty space. So, ideally, I’d like a theme that’s based on relative widths, probably with a “min-width” property so it wouldn’t get compressed to an absurdly narrow width on small screens.

I use Google Analytics (as noted in the privacy policy) for visitor statistics on this blog (more about that in a moment). So, I took a peek at the breakdown of visitors by screen resolution, and saw that for the past year, 94% of the 27,500 visits had a screen width of 1024px or more (and the majority of the others looked like mobile device resolutions, so they’d probably zoom the page correctly). So, my first gut reaction was to assume that I could use a theme approximately 1000px wide. Unfortunately, there’s two main problems with that: first, as mentioned by Chris Coyier on CSS-Tricks.com, just because someone has a given screen resolution doesn’t mean their browser window (let alone the viewport) is that size. As a matter of fact, I usually have my main browser window set at about 80% of the width of one of my monitors, with my instant messaging client Pidgin taking up the rest of the space. So there’s one inaccuracy. There’s a potentially much greater inaccuracy in my stats as well, which I’m going to discuss in a separate post.

Inaccuracies in Google Analytics for Website Stats

I use Google Analytics for visitor stats on this blog. Not because I’m trying to direct-market to my readers or become Big Brother, but for a number of simple reasons:

  • It’s simple – no software for me to update, and nothing that needs to run on my server and read through giant log files every night. Google does it all for me.
  • It gives a lot more information than I can get from just web server logs.
  • Because of Google’s “big brother” tracking, and the vast number of sites that they track people on, I can tell things I’d have no other way of knowing, like how long someone stayed on my page.
  • They tell me useful stats like which search keywords brought the most people to my site and which posts are the most popular, which I keep in mind when writing new stuff and updating older posts.
  • They tell me information about client operating system and browser version, which I think tells quite a bit about my audience.
  • As far as I know, they’re pretty good at filtering out anything other than an actual human visitor.
  • They tell me stats that have no real use to me, but are just cook – like what countries my visitors are from, what type of Internet connection they’re on, their screen resolution, etc.

Obviously not for google, but for me, all of these stats are totally anonymous – I just get percentages or numbers of visits, it’s not like I can see all of the details per-IP address. The most important aspect to me is just the ease of use – I sign up and put a little snippet of code on my pages, and I get an amazing dashboard interface with all of this information. Nothing to install and update on my server, and (most importantly, since I’m now running everything of mine on one virtualized server) no massive program to run as a cron job that has to read all my server log files.

Last week I was talking with a couple of my co-workers, specifically about the stats that I get from Google Analytics. While I know it’s not uncommon to run NoScript especially among the more security- and privacy-conscious groups of people, I was a bit disturbed to hear that they all block Google’s tracking code in their browsers via NoScript. I assume there’s also a percentage of people who still just turn off JavaScript alltogether (although I can’t imagine how they use the modern Web), and many who use the Google Analytics Opt-Out feature. So, especially with as technical an audience as I have, I guess that means I’m likely missing a large number of visitors in my stats. On one hand, I want to respect the privacy of my visitors, and respect their desire to opt-out of advanced tracking. On the other hand, since I no longer parse web server logs for statistics, these privacy-conscious visitors aren’t even showing up in what I think of as my monthly visit count, or in my information on what posts and search keywords are most popular, which I only use for “good” purposes – to make my blog more useful. So that’s a bit of a conundrum.

I’ll admit that I do run Google AdSense Ads on my blog, and I’m sure there are some people who block the ads. On one hand, that upsets me a bit; I run this blog to try and share information that I find or learn with others, and the hosting costs aren’t insignificant. If I can get paid to just show some ads, to try and help offset the cost of running the site, I think that’s good. And if other people can help support the site by just letting the ads stay on the page, why not? On the other hand, my hosting costs $50/month (granted the server also handles all of my email, and a whole bunch of other sites). I’ve been participating in Google AdSense since March 5, 2010 (two years and two weeks), and my “estimated earnings” are currently $80. The payout is in $100 increments. So, I haven’t seen a cent from it in two years, so I’ve given up being concerned with it. If you want to be nice, and find my posts interesting, click on one of the ads. Unfortunately, unless I get famous, the ads aren’t going to come close to offsetting even part of the cost of running the site.

Google itself says, “In order for Google Analytics to record a visit, the visitor must have JavaScript, images, and cookies enabled for your website.” There seems to be some buzz about this on the ‘net, and I’ve seen a number of (translated; original post in Dutch) posts advocating building a request to a Google Analytics GIF manually on the server side, and then including it as an image element inside a <noscript> tag in the page. While this is probably one of the nicer solutions (and much more likely to reduce double-counting), it doesn’t capture any of the advanced data (screen resolution, etc.) that JavaScript-based Analytics does, and more importantly, I imagine that many of the Ad blocking extensions also block traffic to google-analytics.com, so this is an incremental improvement at best. There are also quite a few posts about how to make a request to Analytics purely server-side. This has a few disadvantages as well; it bypasses the google domain blacklisting problem that the client-side image has, but it also means you lose the client IP address (and therefore geolocation), and that you double-track any user who allows javascript (so you need a separate profile, and then need to average out the results). It also means that you track every search engine and bot that crawls your site, and possibly every person who clicks a link and then hits “back” before the page finishes loading. I found another blogger who commented about the wide disparity he saw between Google Analytics, the StatsPress WordPress plugin, and AWstats (a server-side log file analyzer).

So what’s the solution?

Most of the options have a down side, but I’m looking for something that’s the best I can reasonably do. As much as I’d rather not, I’m going to look into self-hosted alternatives to Google Analytics (a self-hosted JavaScript-based stats provider), in the hopes that NoScript users will be more friendly to scripts coming from my own domain, and sending requests to my own domain, than ones from Google or other major trackers. I don’t think I want to try anything that parses web server logs as a primary approach, as I don’t think I could ever get a meaningful comparison to Google Analytics or something else JavaScript-based.

I did have one other idea which I think is interesting, though a bit of an overhead. I could have Apache (or, more likely, a Perl script called in the Apache configuration) generate a random string for each request, and save it in an Apache environment variable. The environment variable would then be added to a field in the server logs, and also added (via PHP or whatever else generates the pages server-side) as a custom parameter for the JS tracking code, enabling page hits to be correlated between the JS tracking and the server logs. Assuming the JS tacking backend stores its data in a sane format (and as raw data, not just aggregated), and at the cost of a serious performance penalty, a server-side statistics program like AWstats or Webalizer could be patched to lookup the unique identifier in the JS stats data store, and ignore all hits which were tracked that way.

I’m going to start by looking into self-hosted open source alternatives to Google Analytics, which I’ll post about sometime hopefully soon.

Petit for Log Analysis

I recently discovered the petit program for log analysis. It’s a simple tool to pull out useful information from syslog logs in a variety of ways. I’ve only used it a few times so far, mainly on logs from problems I’ve already solved but didn’t know the cause of at first. So far, it’s proven quite useful. Here are a few examples:

  • petit --wordcount /var/log/messages – displays ordered count of words appearing in the log. My first step, especially if “warning”, “error” or “fatal” shows up near the top…
  • petit --hash --fingerprint /var/log/messages – hashes the log, removes filters (such as numerics, datestamp), and displays count of matching lines. Absolutely wonderful for web error logs, as it removes client IP addresses, line numbers, etc.
  • petit --mgraph /var/log/messages – graph messages per minute for the first hour of the log (ASCII of course)
  • petit --hgraph /var/log/messages – same as above, but messages per hour for the first day
  • Petit will also read from stdin with the –Xgraph options, so you can cat logfile | grep word | petit --mgraph

Just one note – this tool appears to work only on standard syslog formatted logs. If some non-datestamped lines managed to work their way into the log (i.e. someone used echo >> logfile instead of logger), it will choke.

Many thanks to Scott McCarty for this wonderful tool!

php-suhosin syslog issues

I just installed php-suhosin 0.9.29 from EPEL on a CentOS 5.6 box. I’m running a whole bunch of name-based vhosts in Apache, and have a bunch of web apps, so I opted to run suhosin in simulation mode (don’t actually block anything, but log errors) and have it log via syslog to a single file. Unfortunately, when I configured this, the syslog messages started showing up in the wrong place, apparently with the wrong facility and priority. After some roundabout debugging (at first assuming syslogd to be the problem), I determined that, for whatever really strange reason (perhaps an incorrect syslog.h on the EPEL box that built the suhosin package?) the LOG_* constants were incorrect. I looked up the correct integer values in /usr/include/sys/syslog.h and the following configuration directives accomplished the task correctly:

suhosin.log.syslog.facility = 128
; 128 = LOG_LOCAL0
 
suhosin.log.syslog.priority = 5
; 5 = LOG_NOTICE

This one line puts suhosin into simulation mode, where it only logs errors instead of enforcing on them:

suhosin.simulation = On

MySQL General Query Log and Awk Trimming Columns

So today I started to implement a number of complex regex-based rules and templates to get rsyslog to parse ISC DHCPd logs in realtime. Unfortunately one of my templates must have been wrong, because I started seeing some errors about a field that cannot be blank in /var/log/messages. Unfortunately, rsyslog doesn’t log the query that raised the error, or the name of the template, or anything else useful – just that there was a database error. With over a dozen new templates, this didn’t help much. But the following technique did:

MySQLd has a General Query Log that can be activated by addling a line like:
log=/tmp/query.log
to /etc/my.cnf under the [mysqld] section. This will log *all* queries to the specified log file, even if they resulted in an error or did not manipulate data.

I couldn’t find documentation on the log file format, but I observed that each line appeared to begin with some whitespace, then a number (perhaps a connection or section number, or maybe some sort of query ordering number), then the word “Query”, then the query. The following awk expression prints everything from the third column on, dropping the first two columns (the number and “Query”):
awk '{ for (i=2; i<=NF; i++) printf "%s ", $i; printf "\n"; }'
For easy analysis, the output of that can be piped into sort and then uniq.

rsyslog on CentOS5

Update July 2011 – We’ve been using rsyslog for our centralized syslog infrastructure at work (site being redesigned at the moment) for about a year now. As a result, our colleagues at Rutgers University Open System Solutions have been nice enough to include rsyslog in their koji build system. Updated packages of rsyslog for CentOS 5 x86 and x86_64 are available here. Please be aware that they have some dependencies of specific RU versions. It’s probably best if you download the source RPM and build it yourself using rpmbuild. The current 5.6.2 source package includes a CentOS 5 spec file and other related scripts. This RPM repository is only open to the public as a courtesy, so please download once at most and distribute it to your servers yourself.

Having finally setup my storage server (I know it’s not much, but for me starting with 1TB is wonderful), I actually got around to redoing my centralized logging infrastructure. Here’s a small summary of what I have to handle:

  • logs from 15 hosts at 2 locations, including Cisco devices and a mix of syslog and syslog-ng Linux boxen.
  • Remote location logs forwarded to one server at remote location, then to centralized log server via SSH port forwarding.
  • 48-hour retention of full iptables border firewall logs (~ 3GB/day).
  • Future plans to have all logs stored in MySQL.

I’d previously had most of the boxes logging to an older host with low disk space, but had to discontinue this due to lack of storage. Having assessed the options, and with definite plans to log to a database, I decided to go with rsyslog for the centralized host.

Unfortunately, stable rsyslog is up to 4.4.x, with 5.x in development, and the newest package I could find for CentOS was 2.something. So, I set about building it from source. It was a *very* difficult build on my machine (CentOS 5.3, 2.6.18-128.el5 #1 SMP i686). Unfortunately, I don’t have an RPM build environment setup, but here’s how I accomplished it:

  1. I f not already done, yum install rpmforge-release
  2. yum install gnutls-devel gnutls libatomic_ops-devel gcc43 java-1.6.0-openjdk-devel
  3. Download rsyslog-4.5.2 (currently Beta). Extract, cd into the directory.
  4. To test the performance difference from MySQL, yum install php-pgsql postgresql postgresql-devel postgresql-libs postgresql-server
  5. Edit the “configure” script, add -DHAVE_ATOMIC_BUILTINS to the DEFS line (28936 in this version).
  6. export CC=/usr/bin/gcc43
  7. export CCDEPMODE=”depmode=gcc4″
  8. export CFLAGS=”-O3 -march=i686″
  9. ./configure –enable-mysql –enable-omtemplate –enable-gnutls –enable-pgsql
  10. make && make install

For my system, I used the /etc/sysconfig/rsyslog and /etc/init.d/rsyslog from the 2.x rsyslog RPMs, with some modifications as follows:

/etc/sysconfig/rsyslog (comments have been removed to save space):

# -c4   version 4 compatibility mode
# -x     disable DNS for remote messages (don't want it to hang if DNS is down
# -4     IPv4 only
SYSLOGD_OPTIONS="-c4 -x -4"

/etc/init.d/rsyslog (comments have been removed to save space):

#!/bin/bash
#
# rsyslog        Starts rsyslogd.
#
#
# chkconfig: - 12 88
# description: Syslog is the facility by which many daemons use to log \
# messages to various system log files.  It is a good idea to always \
# run rsyslog.
### BEGIN INIT INFO
# Provides: $syslog
# Required-Start: $local_fs $network $remote_fs
# Required-Stop: $local_fs $network $remote_fs
# Default-Stop: 0 1 2 3 4 5 6
# Short-Description: Enhanced system logging and kernel message trapping daemons
# Description: Rsyslog is an enhanced multi-threaded syslogd supporting,
#              among others, MySQL, syslog/tcp, RFC 3195, permitted
#              sender lists, filtering on any message part, and fine
#              grain output format control.
### END INIT INFO
 
# Source function library.
. /etc/init.d/functions
 
RETVAL=0
 
start() {
        [ -x /sbin/rsyslogd ] || exit 5
 
        # Do not start rsyslog when sysklogd is running
        if [ -e /var/run/syslogd.pid ] ; then
                echo $"Shut down sysklogd before you run rsyslog";
                exit 1;
        fi
 
        # Source config
        if [ -f /etc/sysconfig/rsyslog ] ; then
                . /etc/sysconfig/rsyslog
        else
                SYSLOGD_OPTIONS="-m 0"
        fi
 
        if [ -z "$SYSLOG_UMASK" ] ; then
              SYSLOG_UMASK=077;
        fi
        umask $SYSLOG_UMASK
 
        echo -n $"Starting system logger: "
        daemon rsyslogd $SYSLOGD_OPTIONS
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && touch /var/lock/subsys/rsyslog
        return $RETVAL
}
stop() {
        echo -n $"Shutting down system logger: "
        killproc rsyslogd
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/rsyslog
        return $RETVAL
}
reload()  {
    RETVAL=1
    syslog=`cat /var/run/rsyslogd.pid 2>/dev/null`
    echo -n "Reloading system logger..."
    if [ -n "${syslog}" ] && [ -e /proc/"${syslog}" ]; then
        kill -HUP "$syslog";
        RETVAL=$?
    fi
    if [ $RETVAL -ne 0 ]; then
        failure
    else
        success
    fi
    echo
    return $RETVAL
}
rhstatus() {
        status rsyslogd
}
restart() {
        stop
        start
}
 
case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  restart)
        restart
        ;;
  reload|force-reload)
        reload
        ;;
  status)
        rhstatus
        ;;
  condrestart)
        [ -f /var/lock/subsys/rsyslog ] && restart || :
        ;;
  *)
        echo $"Usage: $0 {start|stop|restart|reload|force-reload|condrestart}"
        exit 2
esac
 
exit $?

My rsyslog.conf file:

# uncomment next line for debugging, use graphvis to see the graph
#$GenerateConfigGraph /root/rsyslog-graph.dot
$ModLoad imklog
$ModLoad imtcp
$ModLoad imudp
$ModLoad imuxsock
 
# template
$template RemoteHost,"/var/log/HOSTS/%HOSTNAME%/%$YEAR%/%$MONTH%/%$DAY%/%syslogfacility-text%.log"
# used for Cisco, vanilla syslog when we can't parse host name
$template RemoteFromHost,"/var/log/HOSTS/%FROMHOST%/%$YEAR%/%$MONTH%/%$DAY%/%syslogfacility-text%.log"
 
# NOTE - we can't bind UDP to a ruleset, so it enters the local RuleSet
#   and has to be dealt with here
 
$RuleSet local
 
# for cisco, vyatta - doesn't send hostname, need to use IP manually
:fromhost, isequal, "192.168.0.99" ?RemoteFromHost
:fromhost, isequal, "192.168.0.99" ~
:fromhost, isequal, "192.168.0.103" ?RemoteFromHost
:fromhost, isequal, "192.168.0.103" ~
:fromhost, isequal, "192.168.0.97" ?RemoteFromHost
:fromhost, isequal, "192.168.0.97" ~
:fromhost, isequal, "192.168.0.111" ?RemoteFromHost
:fromhost, isequal, "192.168.0.111" ~
# anything from a remote host gets logged as such
:source, isequal, "" ?RemoteHost
:source, isequal, "" ~
 
#
# LOCAL LOGGING
#
 
kern.*                                                 /var/log/messages
*.info;mail.none;authpriv.none;cron.none                /var/log/messages
authpriv.*                                              /var/log/secure
mail.*                                                  -/var/log/maillog
cron.*                                                  /var/log/cron
*.emerg                                                 *
uucp,news.crit                                          /var/log/spooler
local7.*                                                /var/log/boot.log
 
# use the local RuleSet as default
$DefaultRuleset local
 
#
# BEGIN centralized logging stuff added 2009-09-16 by jantman
#
 
# define ruleset for remote logging
$RuleSet remote
 
*.* ?RemoteHost
 
# bind ruleset to tcp listener
$InputTCPServerBindRuleset remote
# and activate it:
$InputTCPServerRun 5000
 
$UDPServerRun 514
$UDPServerRun 5000
 
#
# END remote logging
#