Cloud Hosting != Unbreakable
When Microsoft launched their cloud-based operating system last October, they branded it “Azure,” I suppose as a reference to the blue skies that supposedly hold these clouds.
According to Tier1 Research’s A. Piraino, Azure suffered a 22 hour outage this weekend when a (speculatively) software related glitch caused instances to suddenly stop responding. While Microsoft is yet to release the results of a root cause analysis, one can envision a NOC with stacks of monitors displaying Blue Screen’s of Death. Or rather, Azure Screen’s of Death.
To be fair, Microsoft Azure is still in “Technology Preview,” which is to say, pre-production. And other cloud computing platforms have suffered similar outages in their infancy as well. Amazon Web Services suffered a seven hour outage in July from faulty load balancers. Google systems were brought down twice in the past six months.
The problem isn’t that the architecture doesn’t work as planned. The problem is that no amount of planning will cover every situation that can, and will, occur. Failures of critical components become huge issues in virtualized applications, because that many more (virtual) instances require the services of those components. Though a system can have n levels of redundancy built into it, ultimately, there is no such thing as a completely unbreakable system.
Folks tend to get excited about cloud computing because they envision a future of virtualized applications zipping around in a grid computing infrastructure, never failing, never dying. Even in more traditional environments, people get excited about centralized storage, and the joys of instant snapshots and multiple layers of redundancy. Though these technologies are exciting and brings with it new avenues for innovation, uniform architectures share uniform faults. Diversity in architecture is an important consideration when you’re building fault tolerance into your system.
Another important consideration is this: the more power we place in the hands of an administrator, the more damage he can do when he goofs. And he will goof. We all goof once in a while. Take, for example, Flexiscale, who ate a five-day outage because of one such goof. The more we consolidate technology, the more vulnerable we are if something that should never happen, happens.
The takeaway is simple. Take the promises of new technology with a grain of salt. And even if the skies are blue, pack an umbrella just in case.