Leap Year Windows Azure Cloud Outage

I haven’t talked about Microsoft in quite a while (mainly because I don’t follow mainstream tech news as much anymore), but I happened by a very interesting post on the Windows Azure blog the other day. It’s a very detailed postmortem of the major outage of the Windows Azure cloud service which occurred from 4:00 PM PST on February 28th through 2:15 AM on March 1st. Before I get into any of the details, I should say that it really is a nice, well-done post. And the fact that they’re willing to do such a detailed, public postmortem – and admit the failures that they did – is a step in the right direction for Microsoft (a company that I don’t particularly care for, to put it lightly.

I’m going to glance over the majority of the post, though I highly recommend that anyone interested in running web-scale services, specifically highly available ones, read it. The general overview (really just the points that are germane to my discussion) is as follows: An agent running inside the guest VM instances (i.e. domU) communicates with a counterpart on the host OS (i.e. dom0) over an encrypted channel, authenticated by certificate. The certs are generated and passed from the guest to the host when the guest instance is first initialized, which means when an app is first deployed, scaled out, OS updated, or when an app is reinitialized on a new host. This cert was generated for a 1-year validity period, by adding 1 to the integer year – hence, the generation process failed on February 29th of a leap year, as the cert end date wasn’t valid. When the cert generation failed, the guest agent essentially stopped cold. The host agent waited for a 25 minute timeout, then re-initialized the guest and started over. After three of these failures, the host assumes there’s a hardware error (since the guest would have reported a more specific error otherwise), declares itself in an error state, and tries to move its current workload over to another host. Which re-initializes the guests on that host, thereby causing a chain-reaction of failures in this case. Skip forward the 2-1/2 hours it took them to identify the problem, and further 2-1/2 hours to get a fix ready. They fast-tracked their fix to 7 clusters that had already been in the process of a software update, but ended up with those clusters in an inconsistent state with incompatibilities between the guest and host networking subsystems, bringing down previously-unaffected instances on these clusters.

This whole scenario offers a few important points on both the development and operations sides:

Inputs need error checking, and errors need to be raised. So the first problem here was the failed cert generation. I’ll leave alone the fact that, in my opinion, doing math on a the integer year of a date is a high school or college programming mistake, and never should have been made by someone doing platform coding for a major company (believe it or not, 25% of years are leap years </sarcasm>). If whatever code was generating the cert was smart enough to check the cert end date validity and error out, that error should have been pushed up the stack to somewhere where it could be handled – or, at least, sent to a central log server that does error trending.

Secure communications when provisioning need an insecure error path. This is somewhat connected to the previous point. If the normal process of creating a new instance and communicating errors up the stack relies on certs and authentication or encryption, there should be some method of communicating errors with that process either up the stack, or to a separate event correlation/trending system. Errors with a certificate-based system are not unusual, and even something as simple as a vastly incorrect time set on the guests could have caused this same problem. In environments where management/control communication between levels of a system are encrypted or authenticated, there should be some way for lower levels of the system to deliver a meaningful error message “somewhere”. Even if this is just a syslog server or web service that listens for errors and can escalate a warning when the numbers spike, it’s a useful alarm and debugging tool.

Autonomous systems shouldn’t lightly assume hardware failures. It’s arrogance for a host system to assume that just because it can’t instantiate new guests, a hardware failure exists. This entire incident is a perfect example that, at least if hardware error indicators are properly monitored, it’s more likely for a software problem to be falsely identified as a hardware problem than the other way around. All of my points are somewhat related, but I can think of many more reasons why a new guest can’t be instantiated that are software-related rather than hardware-related.

Autonomous control mechanisms need historical trending, and need to call for help if this looks wrong. These host systems tried to instantiate new guests three times, waiting 25 minutes in between, and then declared themselves bad and tried to migrate guests to other hosts. From what I understand, Microsoft got it right in having a “kill switch” that prevented further migration of guests. What they didn’t have right was reporting of autonomous actions (guest migration) to a central location that performs trending. The 25 minute timeout with three attempts is a great safety feature, but if the status of guest creation actions was reported to a central server, it would have been much more quickly apparent that 100% of guest creations in the past, say, 10 minutes, had failed – across all clusters. I know plenty of shops that do little, if any, real-time analysis and historical comparisons of their log data. But when systems are designed to perform self-healing and autonomous actions, it’s imperative that these actions are tracked in near-real-time, compared to historical averages, and that deviation from a baseline is identified and escalated to humans.

Release procedures are more, not less, important when the sky is falling. The extended downtime of the last seven clusters was because of an improperly QA’ed update that was pushed out bypassing the normal release and testing procedures. As a matter of fact, it was so poorly QA’ed that the update totally broke networking for the guest VMs, and was still pushed out. I’m sure this was more of a management/executive decision than one made by the actual engineers, but organizations (even management) need to understand that when the sky is falling, services are down, and everybody is stressed, it’s more likely for mistakes and oversights to happen, and this is when a proper, well-documented QA and release procedure (including phased rollout) is most important. Failure to follow these procedures results in exactly what happened in this case – making an already bad problem much worse.

Even I can’t blame Microsoft specifically for all this (though the whole thing would have been avoided if they just represented timestamps as integers like the rest of us…), but it is a good opportunity for us all to learn from a major incident at a “pretty well known” company.

release procedures are most important when things are already going wrong

How to Find Network Settings in various operating systems

Since I’m occasionally asked these things, here’s how to find some commonly needed network information in various operating systems – for now, Windows, Mac OS X and Linux, as well as Android and iOS (iPhone/iPad/etc.). My assumption is that the people running BSD, Solaris, etc. (and yes, all of those have visited my blog) know this stuff. I won’t go into descriptions of what these “strange” things are.

First off, I know that most desktop computer users are used to doing everything graphically. If you know what you want to do, the command line is a lot faster. There’s no reason to fear it. Watching a cooking show might be wonderful if you have no idea how to cook a meal, but it’s not very efficient if you just need the list of ingredients.

First off, how to get a command prompt:

  • Windows: For XP and before, Start -> Run -> type “cmd”, click Ok. For Vista, Start -> type “cmd”, click it.
  • Mac OS X:Applications -> Utilities -> Terminal
  • Linux/Unix: Konsole, Xterm, whatever else you use, or just drop to command line/runlevel 3

In the following examples, anything in monospace font should be typed exactly as is at the command prompt. Note: some of this may need to be run as Administrator/root. If you’re using Windows Vista or newer, once “cmd” appears under Programs, right-click it and select Run as Administrator. On Mac or Linux, you may have to run as sudo, and you may have to specify an absolute (full) path.

Default Gateway – on a simple home network, this is the IP address of your router.

  • Windows: route PRINT, look for the line beginning with “Default Gateway:”
  • Mac OS X: route get default, look for the “gateway:” line.
  • Linux: sudo /sbin/route, look for the line beginning with “default”, it will be the in the “Gateway” column. If your system uses iproute2, ip route show.

MAC Address – The (more or less) globally unique address of your computer’s network adapter. Each network adapter (wired, wireless, etc.) has its own. Looks like xx-xx-xx-xx-xx-xx or xx:xx:xx:xx:xx:xx or xxxxxx:xxxxxx where each “x” is a number from 0 to 9 or a letter from a to f.

  • Windows: ipconfig /all, look for the name of your network connection and then the indented line starting with “Physical Address”.
  • Mac OS X: ifconfig, look for your network adapter (en0 is wired ethernet, en1 is your AirPort), the address will be on a line after “ether”.
  • Linux: ifconfig, look for “HWaddr” for the right interface.

WAN (Internet or External) IP Address) – Go to whatismyip.jasonantman.com.

Ping another host – A ping test shows (simple explanation) how long it takes packets to get from your computer to another. (For you Warcraft players, this isn’t the same as the ping times shown in-game, and you can’t ping the realm servers).

  • Windows: ping -t IPaddress, the -t makes it run until you type Control-C to stop it.
  • Everything else: ping IPaddressCtrl-C (or whatever your OS uses) to stop it.

I’ll update this with more when I get time…

How to make software distribution secure

We were seeing some strange behavior with Mac client machines on the network lately, specifically with DNS queries (I’d guess that a lot of it has to do with Bonjour), but the discussion touched on the DNS Changer trojan for Mac. I’d really never heard about it before, and after some basic reading, it really got me thinking about the state of software packaging, updates, and distribution. Granted, some of my observations would require sweeping changes to how packaging is handled (even on the *nixes), and would require buy-in from more than just the vendor and distributor (well, I guess MS can probably pressure ISVs to do whatever they want), but seems to be the only way to keep appliancization from becoming the solution to security issues. I’ve written about this before, and a while ago in respect to Linux, but here’s my current take on what needs to be done to software packaging to allow our machines to stay secure, no matter what OS they run.

  1. Allow packages to be installed as a user. This is a mammoth task under Windows or Mac, but still an issue under Linux. The DNS Changer trojan is a case in point – there’s no reason a “video codec” would need to be installed system-wide, and if that were simply installed user-specific, the malicious installer would never have the privileges to change system-wide DNS settings. This is also a big issue under Linux. Yum, apt, rpm, etc. should (if run as a non-root user) install packages in a user-local path under /home by default. Of course, this would mean many things would need to change in order to cope – perhaps even a change to the LSB spec.
  2. Warn about inconsistencies on package installation. The package installation program should warn a user (whether installing packages system-wide or local to a user) if the package is going to modify system-wide files, i.e. files not specifically placed by that package and that package only.
  3. Real package management for Windows and Mac It’s about time that Apple and Microsoft admit that people without billions in funding can come up with good ideas. Get rid of these Installer programs (the many many different ones). Each OS should pick a package format, develop a yum-like (or, even better, zypper-like) package management program that understands repositories. I don’t know how they’d cope with the pervasive license keys and DRM in the non-nix world, but I’m sure they could figure out a way that still allowed sane package management. The idea here is that vendors run repositories and are responsible for their GPG keys, so trojans claiming to be an update to a given vendor’s software would be rejected. Also, isn’t it about time that you can update all your software on Windows or Mac through one tool?
  4. Filesystem-based IDS for Windows and Mac Assuming it will take a while to get everyone onboard with the packaging idea, and noting that users of these OSes like installing applications from arbitrary sources, there should be an OS-level feature to audit all filesystem changes made by untrusted/unsigned applications, and a way to alert the user to these changes if they appear suspisious (essentially what Spybot Search & Destroy / TeaTimer do, but builtin to the OS).
  5. Vendor support of packaging/repositories – Along with the idea of repositories, vendors should have a trust or signing system for ISVs signing keys. If users are installing arbitrary software, making them trust an arbitrary key won’t do anything to improve security. Microsoft and Apple need to run a CA that signs the package signing keys of their ISVs. The also – and here’s the big one – need to have a parallel framework for “independent developers”. I.e. something that doesn’t cost any money for the packagers, and allows them to at least give a “this person is who they say they are” message.
  6. Finally, Make package management pervasive – Have a real push to apply the packaging and signing keys standard to all software for the OS.

On a final note, applicable to both the current state of Linux packaging and my ideas about Mac and Windows… DNS is the ideal method of key distribution (granted, yes, this just means that the security of the packager’s DNS records, and their servers and signing key, is just more of an issue). But even with Yum and Zypper, it seems to me to be logical that the packager’s public key should be stored in a DNS record (or at a URL stored in a DNS TXT record). That way, it wouldn’t be up to an end user to import and trust a key, they’d just have to trust the repository (i.e. software.adobe.com) and the package manager would pull down the key and verify that package X in software.adobe.com is, in fact, signed by the software.adobe.com key.

PC vs Mac

Begin shameless rant…

When I read the “system requirements” for hardware these days, and see “PC or Mac”, I cringe. Surely someone who’s developing the hardware should understand the horrible inaccuracy of this.

The term “PC”, or Personal Computer, is used to refer to any hardware that is (at this point, a derivative of) an IBM PC architecture clone. This generally means Intel x86 (compatible) systems.

In 2005, Apple discontinued their PowerPC systems and made the move to Intel-based computers. Since 2005, all Apple (Mac, iMac, MacBook, etc.) computers have been PCs.

Similarly, PC refers just to the hardware, not the operating system. An Intel-based computer running Linux is a PC.

If you mean “Requires Microsoft Windows or Mac”, say that. I don’t know whether it’s more disturbing to see this on the box of a piece of (I assume, engineered) hardware or in a tutorial or how-to supposedly written by someone who knows something about technology.

Links for 2008-02-23

Some links for today:

Microsoft’s new promised on interoperability, open standards. etc. – somewhat ironic given the Office Open XML debacle on “standards”. And Red Hat’s worries about it. (Ars Technica)

Groklaw’s lengthy analysis of the promises.

Pakistan removed from the Internet, causes global YouTube outage.

A Guardian article on the WikiLeaks debacle – perhaps the biggest affront to the First Amendment this year.

An InformationWeek article about some guys from BlackHat D.C. who said that they will be able to crack GSM encryption in under 30 minutes with $1,000 of technology or 30 seconds with $100,000 (FPGAs – Maybe a cluster of PS3′s?)

A Princeton Unviersity blog about cold boots possibly able to crack the Windows BitLocker system.

Yay! Firefox has hit its’ 500 Millionth download!!! And there was much rejoicing…

An ArsTechnica article on Internet Explorer, what should be done to fix it, and how there can still be a non-standards-compliant browser.

Jeremy’s Blog – the mind behind LinuxQuestions.orgprovides a recap of the 2007 LQ Members’ Choice awards. Some interesting winners were VirtualBox for virtualization package, Debain for server distro, Knoppix for Live Distro, Eclipse for IDE/Web Development Environment, Python for language of the year, and – much to my chagrin – vi/vim for editor.

A LinuxJournal article on What’s Next for Open Source and Public Meida.

LinuxInsider – EU taking Microsoft’s promises with a grain of salt, noting that MS has made “at least four similar statements” in the past.

Chris SiebenmannWhere the risk is with virtualization (and iSCSI) and Wireless, machine rooms, and the Asus eeePC.

IBM DeveloperWorks – OOXML: What’s the big deal? – outlining the technical objections to OOXML as a standard. Linked from a rootprompt.org article mentioning that “OOXML is essentially a complete replication of every chunk of data that a Microsoft Office application might possibly save in a file”.

Slashdot YRO – a guy who got hist stock photos stolen, entered into a long legal battle, and won.

Microsoft’s Windows Vista Capable lawsuit granted class-action status.

A Washington Post article on Hans Reiser’s Geek Defense strategy.

A Slashdot post linking to news that Apple sent a cease-and-decist order to the Hymn Project, which produces software to remove DRM from iTunes songs. Apple had their ISP remove all download links. (I guess the only solution is for us all to buy bandwidth right from a NSP…)

Yahoo’s shareholders are suing it for not gobbling up the Microsoft deal.

Comcast getting sued AGAIN for P2P filtering.

A leaked RIAA training video for prosecutors, going so far as to say that IP piracy can lead to arrests for drugs, weapons, or terrorism. It also includes instructions on how to get a RIAA investigator certified as a court expert.

A New York Times article on – gasp – women using the Internet. Linked from Tom Limoncelli’s blog.

Why hasn’t Linux caught up to Windows?

Those of us who are involved in the Linux community are often frustrated by the lack of widespread acceptance of Linux. Granted, I haven’t used all of the newest “desktop” distributions (‘distros’), but I know that my choice – openSuSE – is far from being ready to compete with Windows for the novice user market. From the first few screens of the installation, it’s clear that this isn’t something for the uninitiated. However, to get off on a short tangent, openSuSE has also severely hampered access to the command-line-only, text-mode installation, which I need in order to install on many of my servers.

Granted, it will take a lot of work to get Linux to retain its’ strong points, and still be user-friendly for the non-technical user. However, there are three main points that I see as being the biggest problems for new users. All of which, coincidentally, are ones which some people would bill as strong points of Linux. And they all have to do directly with some of the founding principles of Linux – interoperability and choice.

A) Packaging.
Searching for a package for a linux system goes something like this: figure out what package format your distro uses, figure out the distro version and architecture, and then start checking the online repositories. If it’s something simple, you may be able to use a you distro-specific maintenance program to automatically upgrade it. If not, you can sift through the myriad online repositories for packages that fit your package manager (RPM, Apt, etc.) and your distro/architecture. If you have no luck there, find the package’s homepage, and hope someone has contributed packages for your distro and architecture – usually a hit-or-miss situation. Last but not least, when all else has failed, you choose either to compile from source yourself, or give up. Compiling from source not only requires some knowledge of your system, Linux, and the compilation sequence used by the software – hopefully the generic GNU-style ./configure, make, make install and not some more esoteric scheme. Furthermore, compilation requires a whole slew of tools to be installed on your system – make, gcc, autoconf, and may others, depending on package. While it’s not practical for people with limited resources, homogenous environments, or novice users, I operate in a largely heterogeneous environment – i586/compatible systems running SuSE 9.3-10.2 – and therefore maintain a dedicated system for compilation, if merited.

All of this complexity just enforces the novice’s idea that there is not much software available for Linux, as many novices are limited (due to technical knowledge) to the packages that come with their OS.

While there are a few schemes to standardize all of this, the real solution is quite complex, and would be based on a single package system to be adopted by all distros (beginning with the main ones). Such a system should have the following features:
1) Ability to work easily with all distros
2) I main configuration file which can define which directories to use – i.e. /etc, /bin, etc.
3) Support for both simple, novice-oriented interfaces and expert-level configuration
4) Multiple interfaces, including command-line, text/ncurses, GTK, and other graphical subsystems
5) A generalized package format that is non-distro-specific
6) Integration with an online master-list of repositories
7) Ability to search, download, and install packages from these repositories
8) Automatic update ability
9) Ability to mine the repositories for updates, and display a list on screen or emailed to a user account
10) Very good tools for easy compilation from source.

Some of these ideas would be incorporated in the tool itself, and some as add-on modules.

The features that I, as administrator of a largely heterogeneous network of about 10 machines, would most like to see are:
1) Truly automatic updates via list – select which packages can be automatically updated, and run a cron job nightly to check for any updates for those packages and automatically get and install them.
2) LAN-based updating – A single server on the LAN maintains a list (perhaps gathered via an automatic tool) of ALL packages installed on ALL LAN machines. Each night the configured clients will update this list over the network, and then the master server will download all available updates for all packages. Once this is complete, it will send a message to all LAN machines, which will then update their software from the central repository on the LAN. This would, in effect, automatically keep all LAN machines 1) on the same version of each package and 2) totally up-to-date.
Kernel updates would be done manually, but should have an option for the administrator to push the update to all machines.

B) Distro-specific tools, filesystem layout, etc.
This is not only a barrier for novice users, but experienced users as well. If you do a search online for Linux training, you will surely come by a nubmer of certifications – NCLE, RHCE, etc. The many distinct certifications – offered by each Linux vendor and independent training companies – underscore the inherent differences in Linux distributions. While I’m perfectly comfortable working with SuSE Linux (by Novell), if I was to sit down in front of a Gentoo system, I would probably be totally lost.

While the LSB project (http://www.linux-foundation.org/en/LSB) has aimed to provide compatibility between distros, there are three main points which must still be addressed:
1) The organization of filesystems on different distros, specifically the directory tree and default locations for certain components, still differs. In the interest of usability, the Linux directory tree should be standardized, so that locations of programs, files, etc. will be identical across distributions.
2) An effort needs to be made to make administration as similar as possible across all distros. This means that program names, functionality, location, etc. should be standardized as much as possible.
3) It seems that each distro has its’ own administration tool – YaST for SuSE, and others for other distros. An effort needs to be made to develop a tool encompassing all of the features in one, distro-neutral form. Webmin (www.webmin.com) has done this wonderfully in a web-based interface, but attention should be focused on a text-mode console version as well.

C) GUI
Perhaps the biggest hurdle for novices using Linux, and the biggest development challenge, is general ease of use. While the above two points may fall into this category, I am specifically referring to the general, day-to-day use of the operating system.

While I will not begin to suggest solutions, the main problems that I see are as follows:
1) The stability and security of Linux must be kept intact, unlike distros such as Lindows.
2) There must remain a way for advanced users to perform advanced tasks.
3) As much of the inner workings should be hidden from the end-user as possible, unless specifically requested.
4) I good system would have a field added to a users’ GECOS data specifying their level of “novice-ness” – i.e. allowing a dumbed-down interface for users while retaining a full interface with Expert features for those who want it.
5) “Mysterious” things such as file permissions should be hidden from novice-level users when not absolutely needed.
6) There must be a strong integration with “anti-mistake” tools and DWIM technology. The system itself should manage file permissions in a way that grants only the minimum needed access.
7) There should be good, strong mistake detection, specifically in terms of catching a user’s inadvertent changing of file permissions, deleting required files, etc.
8) Tools should be built so that the novice user is never required to login as root or run a root shell.
9) Perhaps, and I’m sure this is controversial, the root account should be given either CLI-only access, or should not have X running by default, so as to discourage novice users from running day-to-day tasks as root.

I’m sure I’ve missed a lot, and have also probably mentioned a number of things that are already in place. However, the bottom line is that Linux has to be able to achieve the easy of use and interoperability (between distros) that Windows currently has, while retaining the extensibility, advanced features, security, and stability that make Linux what it is.