Saturday, March 14, 2009

New Thinking in Disaster Recovery Strategies

Over the last few years there has been a lot of discussion in the industry about the various aspects of Business Continuity but the primary focus has centered on two areas:

  • High Availability
  • Disaster Recovery

In regard to Disaster Recovery, the majority of the discussion focused on how you get from your primary business operations center to an alternate location.  But what is you couldn’t go to a single alternate location and needed to do what I describe as “Distributed-DR.”  The difference in a Distributed-DR strategy is in the idea that instead of cutting over to a DR datacenter if you have multiple small Remote Office/Branch Offices (ROBO) then you would distribute your primary datacenter into small pieces across the ROBOs making it more practical to have a real world DR Plan.

One of the more interesting things I ran into while modeling this in our Advanced Technologies lab was the impact this has one of the other quintessential problems in DR planning and execution, bandwidth.  When we think about protecting a company’s data there are two elements: Recovery Point Objective (RPO) and Recovery Time Objective (RTO).  In the simplest terms RPO defines the amount of data you are willing to lose verses RTO which defines how long you are willing to be out of business.

I have consulted for many companies over the years and often when we discussed contingency plans for disaster recovery and I would ask how much are you willing to lose and for how long?  The answer was as little as possible and as near zero downtime as possible or what I call a “0/0” DR Plan.  I started calling them “No-No” DR Plans because as soon as the client got the estimate for what it would cost to meet their objectives, the immediate response was “No way, No how” can we pay that….  I have long asserted that given enough money anything is possible and in the DR business I generally find this to be true.  The challenge is finding the breakpoint between what costs to achieve a 0/0 plan verses the business value of data loss or inaccessibility. 

One of the first reasons to back away from a 0/0 DR Plan is the relative cost of bandwidth necessary to replicate the data between the primary and alternate datacenters.  Another complicating factor is the availability of high speed circuits, I’ve been in a number of locations where it can be difficult to get circuits larger than a DS-3 due to carrier or infrastructure limitations. 

Business Continuity inclusive of both High Availability and Disaster Recovery is as much about physics as it is about methodical planning.  Theories in technology are immensely entertaining to discuss but yield remarkable little in the way of profits.  Any really good theory and a lot of crazy theories need to be modeled and tested against a real world set of data. 

The concept of distributed DR addresses one of the key challenges of DR by allowing the distribution of data in the direction it make sense and the re-aggregation of data where it makes sense, or so the theory goes.  All of this sounds good on the whiteboard but the proof is in the lab and in the real world. 

Meanwhile back in the DataCore AT Lab, we needed to model a company that would be fair representation of a real world organization and the virtual infrastructure needed to support it.  What does that mean, one of my favorite quips is that it’s better to under promise and over deliver that the other way around.  That said to say that we may have over built Demo Company, Inc. with 16 servers and 25 desktops for a company of 25 employees is probably true but it provides a representative sample of what is common practice in the industry today and allow us to measure the scalability of this solution.

No comments: