BAARF

15 Apr 2009: Edited from it’s original form for clarity… and a stab at humor. -Jim

I’m a card-carrying member (so to speak) of BAARF, a little online group dedicated to dispelling the myth that RAID5, or any variant thereof, is a good compromise for capacity and fault tolerance. The reason I bring this up is that I had two hard drives fail earlier today (on separate machines), of which one was RAID5 (it’s not mine). The RAID5 box is still rebuilding, one hard drive failure away from data oblivion. Please, for the love of all that is sacred in storage, don’t trust your data to RAID5, or even RAID6, which is not a whole lot better.

Also, it makes me sad that someone would dedicate some very nice 15K RPM SAS drives to a RAID5 array, presumably to offset the characteristically low IOPS performance of any RAID3/4/5 variant. Listen folks: you can have good IOPS as well as high capacity with other RAID levels, namely RAID10, which offers the best compromise of both worlds. I won’t go into too many details here, the page linked below has a number of good reference write-ups, but the gist is that dedicating resources to parity management (the calculating, reading, and writing of parity data) is a practice that sucks and deserves a swift boot into tech obscurity along with floppy drives and modems.

You may join the fight, or not. Either way, enough is enough.

Rackspace Delves into Cloud Computing Marketplace

Rackspace recently delved into the cloud computing arena with it’s Mosso division. Mosso delivers online presence in an application-as-a-service model and mimics Amazon Web Services and Microsoft Azure in delivering high-availabilty platforms on which to run services. With Amazon and Microsoft however, shell access remains limited and system-level access takes a back seat to stability. Mosso, in contrast, also offers provider-provisioned virtual server instances, which doesn’t require tackling a new learning curve. One can surmise that as the cloud marketplace matures, more providers will adopt this model. Ultimately, cloud computing will provide the availability and performance we want, without sacrificing the control we need.

Join the discussion at the Open Cloud Manifesto site.

Conficker Update

Update: An excellent resource list is available at the Internet Storm Center.

The headline at dailymail.co.uk read “April Fool’s Day computer virus is activated… but fails to cause internet chaos.”

I guess the rumors were unfounded. However, it’s important to note that the virus is still rampant and speculation on the potential uses of such a huge botnet are as well. Some surmise that it might be used to DDOS the crap out of some poor server(s). It might also be used to crack passwords or encryption. Check out http://downadup.org to read more and for removal tools. It’s also a good idea to prepare your network for the potentiality of attack. Don’t be a soft target.

Here’s a couple (read non-comprehensive) ideas on how to not be a soft target:

  • Backup, backup, backup
    • Have systems ready to leap into action if necessary, and keep at least one form of backup offline in case of worst-case scenarios.
    • If you don’t already have a backup strategy in place, it’s time to implement one.
  • Control access to your critical services
    • Enforce strong passwords – or better yet, employ multi-factor authentication. PPP is a strong candidate for the thrifty.
    • Audit your users – does that guy who quit last year still have an active user account? Do your non-administrative users have access to critical servers?
    • Use fail2ban or iptables to detect and drop password-guessing attacks – even with 10 million + IP’s to choose from, it’s not easy to crack a password/one-time password combination when you only get 3 tries per IP.
  • Watch your traffic (not really a botnet vulnerability, but good practice in general):
    • Control your legacy services – seriously, it’s time to retire telnet and other services that transmit passwords in cleartext.
    • https > http – especially when it comes to passwords. Don’t allow users the ability to transmit passwords over http.
  • etc…

I’ve hardly compiled a comprehensive list, and I welcome comments for other good practices, but the most important takeaway is to be cognizant of your security stance. Don’t make it easy for the bad guys.

WebHostingTalk Hacked Hardcore but Still Online

This week, WebHostingTalk suffered a major compromise and “the offsite backup, the onsite backup and the operational data were destroyed by the attacker.” Read the companies thread about the event here.

This isn’t a news blog. I’m not so plugged in that I can write about things as they happen. That’s not my impetus. My impetus is to publish articles about ideas that I believe are sound principles when planning your systems architecture. One such principle is not betting the house on a single technology, and WHT was a prime example of the benefits of such planning. After suffering an attack like that, WHT is online and serving traffic. How? They had DVD backups. Try erasing that. Yes, they’ve only managed to restore up to October 2008, and that was likely a result of corrupted media or backup operator error, but their site isn’t down. It’s an arduous process, writing to permanent media. And it’s an arduous process restoring from it. But in a last-ditch scenario like this, it served WHT well. And it proves my point. Crossing architectural boundaries when backing up data is a necessity, and that means keeping at least one form of backup offline.

Would your business be able to take a hit like WHT’s and keep its doors open?

Cloud Hosting != Unbreakable

When Microsoft launched their cloud-based operating system last October, they branded it “Azure,” I suppose as a reference to the blue skies that supposedly hold these clouds.

According to Tier1 Research’s A. Piraino, Azure suffered a 22 hour outage this weekend when a (speculatively) software related glitch caused instances to suddenly stop responding. While Microsoft is yet to release the results of a root cause analysis, one can envision a NOC with stacks of monitors displaying Blue Screen’s of Death. Or rather, Azure Screen’s of Death.

To be fair, Microsoft Azure is still in “Technology Preview,” which is to say, pre-production. And other cloud computing platforms have suffered similar outages in their infancy as well. Amazon Web Services suffered a seven hour outage in July from faulty load balancers. Google systems were brought down twice in the past six months.

The problem isn’t that the architecture doesn’t work as planned. The problem is that no amount of planning will cover every situation that can, and will, occur. Failures of critical components become huge issues in virtualized applications, because that many more (virtual) instances require the services of those components. Though a system can have n levels of redundancy built into it, ultimately, there is no such thing as a completely unbreakable system.

Folks tend to get excited about cloud computing because they envision a future of virtualized applications zipping around in a grid computing infrastructure, never failing, never dying. Even in more traditional environments, people get excited about centralized storage, and the joys of instant snapshots and multiple layers of redundancy. Though these technologies are exciting and brings with it new avenues for innovation, uniform architectures share uniform faults. Diversity in architecture is an important consideration when you’re building fault tolerance into your system.

Another important consideration is this: the more power we place in the hands of an administrator, the more damage he can do when he goofs. And he will goof. We all goof once in a while. Take, for example, Flexiscale, who ate a five-day outage because of one such goof. The more we consolidate technology, the more vulnerable we are if something that should never happen, happens.

The takeaway is simple. Take the promises of new technology with a grain of salt. And even if the skies are blue, pack an umbrella just in case.

« Previous Page