Environment Variable Substitution in Apache httpd Configs

I’ve been configuring Apache httpd for over a decade, from a single personal web server to web farms running thousands of vhosts. In most of the “real” environments I’ve worked in, we’ve had some variation of production, stage/test/QA and development hosts; and usually some method of managing configurations between them, whether it’s source control or generating them from template. And in all of these environments, there has invariably been drift between the configurations in the various environments, whether it’s because of poor tools to maintain a unified configuration or many of those emergency redirect requests that make it into production but are never backported. This is made all the worse because everywhere I’ve worked, the real difference between what production and other environments should be is really just a string replacement in Apache configurations – /prod/ to /test/ or www.example.com to www.dev.example.com or something along those lines.

Well a few days ago I was having a discussion with some co-workers that dovetailed into this topic, and when I started some research, I found (finally after using httpd for years) that the Apache httpd 2.2 configuration file syntax documentation states that httpd supports environment variable interpolation anywhere in the config files (and httpd 2.4 supports it with Defines as well).

Yup, that’s right. All those different Apache configs I’ve worked with for years that define separate vhosts, document roots, rewrite targets, ServerAliases, etc. for www.example.com and www.qa.example.com and www.dev.example.com really only had to be www.${ENV_URL_PART}example.com, and set ENV_URL_PART in the init script or sysconfig file. (Of course this all assumes that you have your different environments served by different httpd instances, which you do, of course…)

For me, this is a very big deal. It means that finally, instead of maintaining separate sets of configs for different environments which are (theoretically, except for those emergencies) kept identical by hand, or updating templates and then re-generating each environment’s configs, we can finally follow the same commit/merge/promotion-between-environments workflow that we use for other production code and Puppet configuration. It also means that those pesky little rewrites and other minor tweaks will make it all the way back to development environments.

So, here’s a little example of how this would work in reality. Let’s assume that we have 3 main environments, prod, qa and dev (though this should work for N environments) and that domains are prefixed with “qa.” or “dev.” for the respective internal environments. We set environment variables before httpd is started, on a per-host basis, depending on what environment that host is in. On RedHat based systems, we’d add the variables to /etc/sysconfig/httpd for production:

HTTPD_ENV_NAME="prod"
HTTPD_ENV_URL_PART=""

or for QA:

HTTPD_ENV_NAME="qa"
HTTPD_ENV_URL_PART="qa."

Those variables will now be available to httpd within the configurations (and also to any applications or scripts that have access to the web server’s environment variables).

Now let’s look at an example vhost configuration file that uses the environment variables:

<VirtualHost *:80>
ServerName example.com
ServerAlias www.example.com
# Aliases including proper environment name
ServerAlias www.${HTTPD_ENV_NAME}.example.com ${HTTPD_ENV_NAME}.example.com
 
ErrorLog /var/log/httpd/example.com-error_log
CustomLog /var/log/httpd/example.com-access_log combined
 
DocumentRoot /sites/example.com/${HTTPD_ENV_NAME}/
 
# Environment-specific configuration, if we absolutely need it:
Include /etc/httpd/sites/${HTTPD_ENV_NAME}/env.conf
 
<Location "/testrewrite">
RewriteEngine on
RewriteRule /foobar/.* http://www.${HTTPD_ENV_URL_PART}example.com/baz/ [R=302,L]
</Location>
 
</VirtualHost>

Every instance of ${HTTPD_ENV_NAME} will be replaced with the value set in the sysconfig file, and likewise with every instance of ${HTTPD_ENV_URL_PART}. This way, we can have one set of configurations and use our normal source control branch/promotion process to both test and promote changes through the environments along with application code, and ensure that any straight-to-production emergency changes (everyone has customer-ordered rewrites like that, right?) make it back to development and qa.

One caveat is that, if the environment variable is not defined, the ${VAR_NAME} will be left as a literal string in the configuration file. There doesn’t seem to be any way to protect against this in httpd 2.2, other than making sure the variables are set before the server starts (and maybe setting logical default values, like an empty string, in your init script which should be overridden by the sysconfig file).

If you’re running httpd 2.4+, you can turn on mod_info and browse to http://servername/server-info?config to dump the current configuration, which will show the variable substitution.

Tools for watching apache httpd and memcached

Recently I was working on a code release on a site running PHP on Apache httpd, and using >memcached. Without getting into specifics, we had a number of issues that were both Apache and memcached problems, and little visibility into them as it was running on an older server without much monitoring in place. I started looking around for simple tools that could provide a bit more insight, without many dependencies (as the machine is a relatively minimalist install). Here are some of the options I found:

  • memcache-top – A top-like script that pulls stats from memcached instances and can show both per-instance, total and average usage %, hit rate, number of connections, time to run the stats query, evictions, gets, sets, and read and write amounts. Best of all, it’s a very small perl script that requires only IO::Socket and Time::HiRes. Here’s a small example of the output:
    memcache-top v0.6       (default port: 11211, color: on, refresh: 3 seconds)
    
    INSTANCE                USAGE   HIT %   CONN    TIME    EVICT   GETS    SETS    READ    WRITE
    127.0.0.1:11211         86.6%   99.4%   115     0.6ms   0.0     4114    1669    1.3M    24.2M
    127.0.0.1:11212         85.5%   59.9%   2       0.4ms   0.0     0       0       90      8055
    
    AVERAGE:                86.0%   79.6%   58      0.5ms   0.0     2057    834     682.4K  12.1M
    
    TOTAL:          0.9GB/  1.0GB           117     1.0ms   0.0     4114    1669    1.3M    24.2M
    
  • damemtop is also a nice top-like memcached tool. On the positive side, you can specify any column from “stats”, “stats items” or “stats slabs” in the configuration file, and can choose between average or one-second snapshots for each column. On the down side, it requires the YAML and AnyEvent Perl modules, so it has some uncommon dependencies.
    damemtop: Tue Jun 26 14:02:24 2012 [sort: hostname asc] [delay: 3s]
    hostname           all_version  all_fill_rate  hit_rate  evictions  curr_items  curr_connections   cmd_get  cmd_set  bytes_written  bytes_read  get_hits  get_misses
    TOTAL:
    NA                 NA           NA             NA        NA         NA          NA                 87       32       491,735        30,894      86        1
    AVERAGE:
    NA                 NA           86.00%         99.00%    NA         NA          NA                 43       16       122,933        7,723       43        1
    10.200.1.78:11211  1.2.6        86.63%         98.04%    0          0           -1.00204024880524  51       19       386,492        21,613      50        1
    10.200.1.78:11212  1.2.6        85.46%         NA        0          0           0                  0        0        11,373         31          0         0
    10.200.1.79:11211  1.2.6        87.31%         100.00%   0          0           -1.00204024880524  36       13       82,479         9,219       36        0
    10.200.1.79:11212  1.2.6        85.08%         NA        0          0           0                  0        0        11,389         31          0         0
    loop took: 0.305617094039917
    

I’m still looking around for something for apache that uses mod_status and isn’t too verbose; ideally I’d like to be able to watch memcached, apache response codes/times, and apache mod_status all in the same terminal window.

Apache httpd – logging for sites with and without load balancing

There are a few unfortunate places where I have an Apache httpd server serving multiple vhosts, some behind a F5 BigIp load balancer and some with direct traffic. For sites behind the LB, the remote IP/host will always show up as the LB’s IP/host, not that of the actual client. Using the default configuration with LogFormat directives in httpd.conf, this means that either we need to define log formats per-vhost or lose the client IP in one of our scenarios (LB or no LB).

I came by a simple solution to this on Emmanuel Chantréau‘s blog, and here is my condensed version of it. It sets an environment variable (“bigip-request”) if the BIOrigClientAddr request header is set (this header holds the client’s IP; it’s the BigIp proprietary version of the X-Forwarded-For header. You could easily substitute that more standard header in the following snippet) and then sets the “combined” LogFormat based on that variable – a version using BIOrigClientAddr if it is set, and a version using the normal “%h” remote host otherwise.

In httpd.conf:

# set the "bigip-request" env variable to "1" if there is a BIOrigClientAddr header in the request                                                                                                   
SetEnvIf BIOrigClientAddr . bigip-request
# we'll use this following LogFormat (BIOrigClientAddr in place of remote host) as "combined" IF the bigip-request env variable is set                                                                     
LogFormat "%{BIOrigClientAddr}i %l %u %t %v \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined_lb
# else we'll use this one (remote host IP address) as "combined" IF the bigip-request env variable is NOT set                                                                                   
LogFormat "%h %l %u %t %v \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

And then in our vhost configuration:

# use this log format if we're behind an LB
CustomLog logs/<%= domain %>_access_log combined env=!bigip-request
# or this format if we're not
CustomLog logs/<%= domain %>_access_log combined_lb env=bigip-request

Apache catchall vhost

As mentioned in One of my recent posts, I occasionally have to setup catchall pages in Apache. The general idea is usually that I either want a vhost that serves one page for any conceivable request, or that I moved something and want to alert the visitor, but provide a formula-based link to the new content. Assuming you have mod_rewrite, this is relatively simple.

In your vhost configuration (or .htaccess), you just need two lines:

RewriteEngine on
RewriteCond %{REQUEST_URI} !/index\.php$1
RewriteRule ^(.*)$ /index.php$1 [L]

This will redirect every request for the vhost to /index.php. Within your PHP script, you can access the actual request URI through $_SERVER["REQUEST_URI"]. The script that I’m currently using for an internal page is:

$newServer = "http://foo.example.com:12345";
 
if($_SERVER["REQUEST_URI"] == "/" || $_SERVER["REQUEST_URI"] == "/index.php")
  {
    header("Location: ".$newServer);
  }
else
  {
    $newURL = $newServer.$_SERVER["REQUEST_URI"];
    echo '<html><head><title>Page Moved</title>';
    echo '<META HTTP-EQUIV="refresh" CONTENT="5;URL='.$newURL.'">';
    echo '</head><body>';
    echo '<p>The page you are looking for is best found at:</p>';
    echo '<p><strong><a href="'.$newURL.'">'.$newURL.'</a></strong></p>';
    echo '<p>You will be automatically redirected after 5 seconds. If this does not happen, click the link above.</p>';
    echo '</body></html>';
  }

This script takes two distinct actions:

  • If the requested URL is / or /index.php, it transparently redirects to a different URL (and port).
  • Otherwise, it displays a “page moved to” message and uses a Meta-Refresh to redirect after 5 seconds.

New web server, WP optimization

Tonight, more or less on a whim, I moved my blog from my older (dual 1GHz Pentium III Coppermine, 1GB RAM, 10k RPM SCSI disks, Compaq Proliant DL360 G1, OpenSuSE 10.2 32-bit) web server to my newer one (dual 1.4GHz Pentium III, 2GB RAM, 10k RPM SCSI disks, HP Proliant DL360 G2, CentOS 5.3 32-bit). I did some profiling with ab (ApacheBench), and just moving from one server to the other got some serious performance gains (I was profiling with runs of 1000 requests total, 10 concurrent requests). I also added the W3 Total Cache WordPress plugin, which got the numbers to look even better!

As a side note, this was all done pretty quickly (moving the database and tarball for the vhost, installing the plugin, changing DNS), so please give me a heads-up if you experience any problems.

The numbers are rather impressive:

 Total Time(s)RPSAvg. Connection Time (ms)
Old Server1192.252838.7511,893
New Server569.1211757.095,667
Default W3tc Config23.75442,098.44237
Tuned W3tc12.28181,428.76122

All tests were performed on my workstation, a Dell Precision 470, two dual-core Xeons at 2.8 GHz, 2GB RAM, 16GB swap, OpenSuSE 11.1 64-bit. This was on the same LAN and subnet as the servers, with the workstation connected via a 1Gbps copper Ethernet link and the web-serving interfaces of the servers connected via 100Mbps (There’s a trunk in between, from the gigabit aggregation switch to the 100Mbps distribution switch).

Apache2 – list Name-Based Virtual Hosts

Here’s a little tidbit that I never knew until I had an Apache2 name-based virtual host problem: httpd -S lists the vhosts that are being served by Apache, and how they were parsed from the config files.

The output on one of my servers looks something like:

[root@web2 vhosts.d]# httpd -S
VirtualHost configuration:
wildcard NameVirtualHosts and _default_ servers:
_default_:443          web2.jasonantman.com (/etc/httpd/vhosts.d/ssl-host.conf:7)
*:80                   is a NameVirtualHost
         default server www.jasonantman.com (/etc/httpd/vhosts.d/000-default.conf:1)
         port 80 namevhost www.jasonantman.com (/etc/httpd/vhosts.d/000-default.conf:1)
         port 80 namevhost rackman.jasonantman.com (/etc/httpd/vhosts.d/rackman.jasonantman.com.conf:1)
         port 80 namevhost whatismyip.jasonantman.com (/etc/httpd/vhosts.d/whatismyip.jasonantman.com.conf:1)
Syntax OK

This is quite useful in debugging vhost problems, especially those pesky times when a request that should go to a specific vhost is being served by the default (in my case at this time, I had two ServerName directives instead of a ServerName and a ServerAlias).

Apache holding strong, IIS declining

According to the latest (June 2009) NetCraft web server survey, the Free/Open Source Apache web server is now hosting 50.46% of all active web sites surveyed (about 38 Million). Microsoft’s IIS server is at 28.05% (or about 29 Million) – a 7.64% decline from IIS’s May 2009 statistics. Interestingly, Google holds 12.2%, presumably most of that is their own content or content generated by their applications.

This is nothing new – both Free/Open Source software and Unix-related stuff has always had a stronger share in the server (and Internet) market than Microsoft products. And, despite all of Microsoft’s FUD, it’s clear that Apache is still more popular than IIS by a large margin – probably in no small part due to the extendability and scalability of Apache, and its security record (just take a look at the difference in system calls).

The real shining example, however, comes from looking at the stats on the Internet’s million busiest sites – 66.26% running Apache and only 18.77% running IIS, which has been constant for the better part of the last year. That says quite a bit about the stability and scalability of Apache. Not to mention that a lot of the really big sites run their own custom-modified versions of Apache which may or may not be identified as Apache in a survey.

WordPress Installation, Finished

Found this from a month and a half ago, waiting as a draft:

So, I mostly finished the WordPress installation. I got everything for WordPress up and running, tested my Blogger URL redirection script and then switched over my subdomain redirection.

The blogger redirection takes two parts, but is in fact quite simple. First, I went into the directory where the Blogger content had lived – /srv/www/htdocs/blog and moved everything in there into another directory, out of the way. I then created a .htaccess in the directory like:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /blog/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* index.php [L]
</IfModule>

All this does is used mod_rewrite to serve blog/index.php up for every page request. In index.php, I handle the important URL forms for Blogger – archives, tags, feeds, and posts – and redirect to the appropriate place. For archives, I just parse out the year and month from the Blogger URL and redirect to the proper page for WP. The feed is straight redirection. The tags (“labels” in Blogger parlance) are pulled out of the URL, have spaces (after urldecode()) replaced with dashes and are redirected to the right tag for WP.

The posts, on the other hand, were a bit more difficult. My solution ended up being parsing the post name out of the URL. When I used the import tool, WP kept the original Blogger URLs in the wp_postmeta table with a meta_key of “blogger_permalink”. I just looked for a Blogger permalink matching the title from the Blogger URL, found the corresponding post ID and redirected to the proper new WP URL.

The code for index.php, for me, looks something like:

<?php
// redirect old Blogger URLs in /blog to new WordPress in /wp
$request = mysql_real_escape_string(str_replace("/blog", "", $_SERVER['REQUEST_URI']));
 
// handle constant stuff like feeds and top-level pages
// TODO
if(strpos($request, "_archive.html"))
{
    // redirect to an archive
    $request = substr($request, strpos($request, "/", 1)+1);
    $ary = explode("_", $request);
    $redirect_to = "http://blog.jasonantman.com/".$ary[0]."/".$ary[1]."/";
    header("Location: ".$redirect_to);
    die();
}
elseif(strpos($request, "labels/"))
{
    // redirect to a tag page
    $redirect_to = substr($request, strpos($request, "labels/")+7);
    $redirect_to = str_replace(".html", "", $redirect_to);
    $redirect_to = urldecode($redirect_to);
    $redirect_to = str_replace(" ", "-", $redirect_to);
    $redirect_to = "http://blog.jasonantman.com/tags/".strtolower($redirect_to)."/";
    header("Location: ".$redirect_to);
    die();
}
elseif(strpos($request, "/blogger.html"))
{
    // redirect to main blog
    header("Location: http://blog.jasonantman.com/");
    die();
}
elseif(strpos($request, "/atom.xml"))
{
    // redirect to new feed
    header("Location: http://blog.jasonantman.com/feed/");
    die();
}
 
// handle the posts, months, tags, etc.
$fail = false;
$redirect_to = "";
$conn = mysql_connect()   or die("Error. MySQL connection failed at mysql_connect");
if(! $conn)
{
    error_log("SCRIPT ".$_SERVER['PHP_SELF'].": "."Unable to connect to MySQL.");
    $fail = true;
}
$select = mysql_select_db('wordpress');
if(! $select)
{
    error_log("SCRIPT ".$_SERVER['PHP_SELF'].": "."Unable to select DB wordpress.");
    $fail = true;
}
$query = "SELECT m.meta_key,m.meta_value,p.post_name,p.post_date FROM wp_postmeta AS m LEFT JOIN wp_posts AS p ON m.post_id=p.ID WHERE m.meta_key='blogger_permalink' AND m.meta_value='".$request."';";
$result = mysql_query($query);
if(! $result)
{
    error_log("SCRIPT ".$_SERVER['PHP_SELF'].": "."Error in query: ".$query." ERROR: ".mysql_error());
    $fail = true;
}
if(mysql_num_rows($result) < 1)
{
    // couldn't find an appropriate page
    // TODO: find a better way... for now just redirect to the month page
    $ary = explode("/", $request);
    if(count($ary) > 3)
    {
        $redirect_to = "http://blog.jasonantman.com/".$ary[1]."/".$ary[2]."/";
    }
    else
    {
        $redirect_to = "http://blog.jasonantman.com/";
    }
}
else
{
    $row = mysql_fetch_assoc($result);
    $redirect_to = "http://blog.jasonantman.com/".date("Y", strtotime($row['post_date']))."/".date("m", strtotime($row['post_date']))."/".$row['post_name'];
}
 
if($fail)
{
    // redirect to main page with 302
    Header( "Location: http://blog.jasonantman.com/" ); // implicit 302
}
else
{
    // redirect to the post or month
    Header( "HTTP/1.1 301 Moved Permanently" );
    Header( "Location: ".$redirect_to );
}
 
?>

So, it now looks like I’m pretty much done with setup, and even get to keep my links. The one interesting problem that will crop up is due to the fact that, at the moment, I’m hosting off of a dynamically IPed residential internet connection, so I’m at http://jantman.dyndns.org:10011. The problem lies in the fact that Blogger used this for its’ URIs and Permalinks, and it seems that (though http://blog.jasonantman.com uses a 302 not a 301 to redirect) Google, Technorati, etc. have indexed my site with this hostname and port, instead of the redirecting subdomain. Normally this wouldn’t be a problem, but I plan on soon moving to a business hosting account with 5 static IPs and port 80 open. Which means that soon the subdomain will become “real”… and all of those pesky dyndns.org:10011 links will be obsolete. The only way I can think of fixing this is, once I make the switch to static IP and port 80 (which will also include moving all of my subdomains to name-based virtual hosts) I’ll have to craft RewriteRules or redirect rules to replace http://jantman.dyndns.org:10011/wp/ with http://blog.jasonantman.com/, update DynDNS with my new static IP, and keep a default vhost listening on 10011 to provide rule-based redirection to the new subdomain. Eek.