Random Links for Wednesday, October 24th

Some random interesting links from Slashdot for today:

SysAdmin Links of The Day

A few links that I’ve had in my “mention in a blog post” category for a while:

Instagram – Scaling a Startup

If you keep up with news on the ‘net, you may have heard about photo sharing startup Instagram, which was purchased by Facebook for $1 Billion just 9 days after they released their Android app. Well, not only are they based mostly on open source software (well, yeah, pretty much a given for web startups these days), but they’ve dealt with scaling issues like a million new users in 12 hours, and they’re talking about it. There’s the slide deck to a talk that co-founder Mike Krieger gave on TechCrunch and Scribd, along with a High Scalability article about it, and an earlier High Scalability article that gives an overview of the company and some details of what and how they’re running. Instagram Engineering also has a tumblr account, with a bunch of cool posts like Keeping Instagram up with over a million new users in twelve hours (which specifically mentions statsd, dogslow, PGFouine, node2dm and some database stuff) and What Powers Instagram: Hundreds of Instances, Dozens of Technologies which talks about their OS and hosting (Ubuntu 11.04 on EC2), load balancing (nginx, DNS and Amazon Elastic Load Balancer), Django, Redis, Solr, Munin, etc.

This is a really cool company, doing some really cool stuff, at a really large scale, and growing fast.

On another note, I’m continuing my attempt to read all of the excellent Puppet articles on Brice Figureau’s (aka masterzen) blog. It’s taking a while, as it’s really good, in-depth information that I want to rememeber, but I’d highly recommend it for anyone working with Puppet.

A Collection of Great Links on Monitoring, SysAdmin, Scaling, etc.

I’ve had a bunch of tabs open in my browser for a while – stuff that I read, thought was wonderful, and wanted to comment on. At risk of letting it pile up forever, here’s a collection of links that I thought were really interesting or insightful…

  • MongoDB is Fantastic for Logging – I was looking into some log storage ideas, and came by this post (on the MongoDB blog) about why Mongo is well-suited to storing logs.
  • Sensu – a Ruby-based cloud-oriented monitoring system. It uses AMQP/RabbitMQ to communicate between the clients and server, which is a really big part of what I think monitoring should be.
  • High Scalability – this is one of the few blogs I follow on a regular basis. Some really wonderful stuff, and great food for thought.
  • Everything Sysadmin: Fear of Rebooting – A great article on Tom Limoncelli’s blog about why we fear rebooting machines and why this is bad – moreover, why we should reboot often.
  • The Netflix Tech Blog: Fault Tolerance in a High Volume, Distributed System – This is a really, really cool post NetFlix about how latency increases in a single subsystem can bring down their whole API in seconds, and how they combat this. Really cool stuff.
  • Ars Technica – Exclusive: a behind-the-scenes look at Facebook release engineering – Ars Technical is more or less “mainstream media” to me, but this is a really interesting writeup on Facebook’s release engineering process, albeit at a higher level. Specifically, it talks about their automation, phased rollouts, rollbacks, and how they release the Facebook codebase as a single giant binary, sent out via BitTorrent.
  • Monitoring Sucks blog posts (github) – The “monitoing sucks” movement really speaks to me, having worked extensively with Nagios, Cacti, and similar technologies. Specifically, having rolled out monitoring in a variety of “weird” scenarios (a lot of monitoring devices or whole networks behind NAT, on dynamic IP connections, or otherwise unreachable from a central server), I’ve felt a lot of pain in the current want of doing things. There are a lot of really good thoughts linked here, especially the “wonderland” series by Patrick Debois and the “Latency sucks” series by Lindsay Holmwood. This really got me thinking about my ideal monitoring system, which among other things, would integrate the “alerting” functions of Nagios with graphing/trending and correlation, would be based on some sort of message queue architecture (that supports multiple levels of proxies that could gracefully support NAT and multiple hops), and would be configured almost totally on the originating “client” (unlike the pain of distributed Nagios/Icinga).
  • Mike Brittain – Metrics Driven Engineering at Etsy (3.2MB PDF) – presentation slides. I’d love to see the video. Some really good ideas about putting the science back into being a SysAdmin. Also mentions a few tools I really want to play around with (including ganglia, graphite, logster and StatsD). Also mentions adding PHP memory usage and time to Apache logs, which I don’t believe I never thought of.
  • Some really thoughtful posts from R. I. Pienaar on Thinking about monitoring frameworks and Composable Architectures. Some really good stuff, but what else would you expect from someone like this.

Petit for Log Analysis

I recently discovered the petit program for log analysis. It’s a simple tool to pull out useful information from syslog logs in a variety of ways. I’ve only used it a few times so far, mainly on logs from problems I’ve already solved but didn’t know the cause of at first. So far, it’s proven quite useful. Here are a few examples:

  • petit --wordcount /var/log/messages – displays ordered count of words appearing in the log. My first step, especially if “warning”, “error” or “fatal” shows up near the top…
  • petit --hash --fingerprint /var/log/messages – hashes the log, removes filters (such as numerics, datestamp), and displays count of matching lines. Absolutely wonderful for web error logs, as it removes client IP addresses, line numbers, etc.
  • petit --mgraph /var/log/messages – graph messages per minute for the first hour of the log (ASCII of course)
  • petit --hgraph /var/log/messages – same as above, but messages per hour for the first day
  • Petit will also read from stdin with the –Xgraph options, so you can cat logfile | grep word | petit --mgraph

Just one note – this tool appears to work only on standard syslog formatted logs. If some non-datestamped lines managed to work their way into the log (i.e. someone used echo >> logfile instead of logger), it will choke.

Many thanks to Scott McCarty for this wonderful tool!

Firefox QR Code Generator

Just a quick little tip, I happened by the Mobile Barcoder Firefox add-on the other day. It’s a Firefox add-on that generates QR Code barcodes for text or links, right in your browser. While my Droid3 has a full keyboard, sometimes I still want to quickly send links from my desktop browser session to my phone. Firefox Sync helps a lot but is a bit slow on the phone (since I usually have 100+ tabs open between all of my desktop Firefox sessions), and email is an option but a bit slower.

There are two caveats about this add-on though:

  1. The feature to generate a QR code for the URL of the current page shows up in the status bar, which isn’t shown in modern versions of Firefox. You’ll need to enable the Add-on bar.
  2. It uses the http://mobilecodes.nokia.com/ server to generate barcodes, so it’s dependent on connectivity and that service.