Disaster Avoidance instead of Disaster Recovery

Posted by Gary Dunlap on Friday, May 21, 2010

A recent white paper from a company called Zero Nines talks about "The Disaster of Disaster Recovery" The basic premise here is that with even the latest Disaster Recovery solutions there will be a certain amount of downtime and a certain amount of data loss. The words "acceptable amount" come to mind here. Zero Nines operates in the international banking sector where the acceptable amount is NONE but most of the rest of us live in another reality where the cost of a "one to many" multiple replicated system would be a little out of our league. So we have set up our vitualization and the hot sites. We monitor the "heartbeats" and our heads rest easily on the pillow at night. I want to pose two questions here : 1) What if you don't have a "state-of-the art disaster recovery solution? Some folks don't you know. 2) What if your disaster recovery solution fails? Please don't say "That can't happen" because you darn well know it can.

Let's look at some recent real world disaster scenarios. These things actually happened and were chronicled in IT Management magazines "Top 8 Datacenter Disasters of 2007":

Hosting firm Rackspace US Inc. suffered back-to-back outages in just 36 hours. The first outage was caused by a "mechanical failure" in the company's Dallas datacenter on Sunday, November 11, 2007. Customers experienced "intermittent service interruptions" and a team of more than 100 techs was deployed to find and fix the problem. Then, on the following Monday evening, a pickup truck struck a utility pole and brought down the transformer feeding the datacenter. Emergency generators kicked in and operated as intended, and Rackspace

transferred its power to its secondary utility power system and brought its chilling units back online. However, the utility shut down power in order to give emergency workers safe access to the downed transformer. Temperatures rose within the datacenter. Rackspace shut down selected servers in order to avoid overheating all of them.
A power outage hit downtown San Francisco on July 24, knocking out 365 Main Inc. - a 227,000 square-foot facility and datacenter development company. At least three of 365 Main's eight co-location centers were knocked out. Among the Web sites that went down for a few hours were giants like Craigslist, GameSpot, Yelp, Technorati, Typepad and Netflix. Power was restored after 45 minutes. 365 Main later estimated that between 20 and 40 percent of its customers were affected.The company ultimately attributed the disaster to backup generators made by the Dutch firm Hitec, which failed to kick in. It seems that an incorrect setting in one tiny generator component prevented the component's memory from resetting properly.
In the Rackspace scenario, in the first outage, mechanical failures happen but past experience tells us that it becomes more likely when maintenance is less than perfect. There is really no excuse for the second outage.Oviously there was only enough emergency power (generators) to run the IT equipment and not the chillers or some other means of keeping the data center cool. I would imagine that the phrase "But that would never happen." was uttered many times prior to this incident. ( side note, "But that would never happen." and variations there of are the most dangerous words ever uttered.)

In the 365 Main instance (I hope the irony in their name isn't lost on anyone) , the generators failed to start. In the life safety/medical area, failure testing is required annually. Failure testing means simulating a real failure.....cutting the commercial power for instance. It might not have uncovered this particular issue but you could certainly say with conviction that "We checked and tested everything possible".

The point here is simple , while spending millions upon millions of dollars on disaster recovery doesn't it make sense to spend as much on trying to avoid having the disaster? How many have seen the Ford commercial where the young lady squeals with glee as her SUV parallel parks itself? That is undoubtedly the scariest thing on TV. Not insisting on rigorous preventative maintenance and not testing emergency systems under real world conditions is kind of like that. Just take your hands off the wheel and trust that the technology will work! WHAT IF IT DOESN'T? Let's think about what happens if that thing that will never happen...HAPPENS! Like the secondary power grid going down or the generator not starting.

Here's another scenario of a disaster where everything worked:

No customers were affected by a fire that broke out in an electrical room at a Terremark data center in Culpeper, Va., early on April 30. Terremark spokesman Xavier Gonzalez said electrical failover systems were activated and there were no interruptions in power supply to the computer rooms. "Our redundant power systems worked and kicked in properly," he said, explaining that one of the data centers on the company's NAP of the Capital Region campus was switched to generator power. "Everything continued to work properly there." The fire was caused by an electrical fault in a medium-voltage room, something Gonzales said was a relatively common occurrence in data centers. The fire damaged one electrical-gear cabinet before it was extinguished by members of the Terremark team together with the local fire department. Firemen were dispatched on site after a fire alarm was activated around 12:30 a.m. EDT and employees at the data center confirmed that there was a fire. According to Gonzalez, it was extinguished fairly quickly. "At the end of the day, the design of the facility and the redundant power systems worked properly," the spokesman said. "That's a good sign for us."
There are currently two active 50,000 square foot data centers at NAP of the Capital Region, a campus that has room for a total of five such facilities. Terremark began construction of the third facility in April.- This is an excerpt of an article from TechFlash

Disaster Recovery and Business Continuation are kind of like parachutes or those floating cushions under your seat on an airplane, you certainly want them, won't fly without them but you really really don't want to have to use them. Instead lets make sure the plane has all it's PMs , that the tires have plenty of air in them and that the pilot doesn't even get to have strong mouthwash. Granted this cannot completely eliminate the need for the safety devices but it certainly lengthens the odds on their use. After all , can you be 100% certain that the chute will open?

If you would like to read ZeroNines Whitepaper, go here to download it

Next post: