Instagram – Scaling a Startup

If you keep up with news on the ‘net, you may have heard about photo sharing startup Instagram, which was purchased by Facebook for $1 Billion just 9 days after they released their Android app. Well, not only are they based mostly on open source software (well, yeah, pretty much a given for web startups these days), but they’ve dealt with scaling issues like a million new users in 12 hours, and they’re talking about it. There’s the slide deck to a talk that co-founder Mike Krieger gave on TechCrunch and Scribd, along with a High Scalability article about it, and an earlier High Scalability article that gives an overview of the company and some details of what and how they’re running. Instagram Engineering also has a tumblr account, with a bunch of cool posts like Keeping Instagram up with over a million new users in twelve hours (which specifically mentions statsd, dogslow, PGFouine, node2dm and some database stuff) and What Powers Instagram: Hundreds of Instances, Dozens of Technologies which talks about their OS and hosting (Ubuntu 11.04 on EC2), load balancing (nginx, DNS and Amazon Elastic Load Balancer), Django, Redis, Solr, Munin, etc.

This is a really cool company, doing some really cool stuff, at a really large scale, and growing fast.

On another note, I’m continuing my attempt to read all of the excellent Puppet articles on Brice Figureau’s (aka masterzen) blog. It’s taking a while, as it’s really good, in-depth information that I want to rememeber, but I’d highly recommend it for anyone working with Puppet.

A Collection of Great Links on Monitoring, SysAdmin, Scaling, etc.

I’ve had a bunch of tabs open in my browser for a while – stuff that I read, thought was wonderful, and wanted to comment on. At risk of letting it pile up forever, here’s a collection of links that I thought were really interesting or insightful…

  • MongoDB is Fantastic for Logging – I was looking into some log storage ideas, and came by this post (on the MongoDB blog) about why Mongo is well-suited to storing logs.
  • Sensu – a Ruby-based cloud-oriented monitoring system. It uses AMQP/RabbitMQ to communicate between the clients and server, which is a really big part of what I think monitoring should be.
  • High Scalability – this is one of the few blogs I follow on a regular basis. Some really wonderful stuff, and great food for thought.
  • Everything Sysadmin: Fear of Rebooting – A great article on Tom Limoncelli’s blog about why we fear rebooting machines and why this is bad – moreover, why we should reboot often.
  • The Netflix Tech Blog: Fault Tolerance in a High Volume, Distributed System – This is a really, really cool post NetFlix about how latency increases in a single subsystem can bring down their whole API in seconds, and how they combat this. Really cool stuff.
  • Ars Technica – Exclusive: a behind-the-scenes look at Facebook release engineering – Ars Technical is more or less “mainstream media” to me, but this is a really interesting writeup on Facebook’s release engineering process, albeit at a higher level. Specifically, it talks about their automation, phased rollouts, rollbacks, and how they release the Facebook codebase as a single giant binary, sent out via BitTorrent.
  • Monitoring Sucks blog posts (github) – The “monitoing sucks” movement really speaks to me, having worked extensively with Nagios, Cacti, and similar technologies. Specifically, having rolled out monitoring in a variety of “weird” scenarios (a lot of monitoring devices or whole networks behind NAT, on dynamic IP connections, or otherwise unreachable from a central server), I’ve felt a lot of pain in the current want of doing things. There are a lot of really good thoughts linked here, especially the “wonderland” series by Patrick Debois and the “Latency sucks” series by Lindsay Holmwood. This really got me thinking about my ideal monitoring system, which among other things, would integrate the “alerting” functions of Nagios with graphing/trending and correlation, would be based on some sort of message queue architecture (that supports multiple levels of proxies that could gracefully support NAT and multiple hops), and would be configured almost totally on the originating “client” (unlike the pain of distributed Nagios/Icinga).
  • Mike Brittain – Metrics Driven Engineering at Etsy (3.2MB PDF) – presentation slides. I’d love to see the video. Some really good ideas about putting the science back into being a SysAdmin. Also mentions a few tools I really want to play around with (including ganglia, graphite, logster and StatsD). Also mentions adding PHP memory usage and time to Apache logs, which I don’t believe I never thought of.
  • Some really thoughtful posts from R. I. Pienaar on Thinking about monitoring frameworks and Composable Architectures. Some really good stuff, but what else would you expect from someone like this.