A few links that I’ve had in my “mention in a blog post” category for a while:
- Summary of Windows Azure Service Disruption on Feb 29th, 2012 -
Windows Azure - Site Home - MSDN
- a very detailed and interesting post on what caused the Windows Azure cloud outage on February 29th, 2012. IMHO many of these failures were predictable, and the bulk of the outage was caused by a combination of inputs not being checked for validity in code, or the invalid case not being handled properly.
- High Scalability - High Scalability - Google: Taming the Long Latency Tail - When More Machines Equals Worse Results
- Everything Sysadmin: Fear of Rebooting
- Everything Sysadmin: Using statements of “Undeniable Value”