For years, we've operated under the assumption that excellent DR requires two fully redundant data centers, with "flip of a switch" fail-over from one data center to the other. Toward that end, our DCO team has worked with our vendors to design a state-of-the-art replication system that ensures that both data centers are always in sync, and always ready to fill in for each other as needed.
Here's the problem: The cost of that kind of infrastructure is gigantic (essentially double the cost of a single data center). A small or mid-sized SaaS company may spend $1M per year to keep a single data center running. To maintain two data centers with enough capacity that one can take on the load of the other at a moments notice essentially increases the cost to $2M, or maybe a bit less. That can have a devastating on the company's Cost of Revenue (which, for a SaaS company, includes both DCO costs and Support/Help Desk expenses).
We learned recently that NetSuite, one of the leading on-demand ERP vendors (and one that we are intimately familiar with) has operated in a single data center since its inception. In mid-2007, they announced (as a prelude to their IPO in December 2007) that they would be expanding to a second data center in 2008. I can't find any evidence that they have done that and, in fact, the "Cautionary Note" in the press release announcing their record Q2 2008 results warns that, "unexpected disruptions of service at the Company's data center may occur".
Some key questions have to be answered to decide whether a second data center is necessary:
- How redundant is the infrastructure in your data center? Have you eliminated, to the extent possible, any single points of failure? (That's not cheap--but it's a lot cheaper than a second data center.)
- How much do you trust the facility that you are in? Have they demonstrated the ability to absorb power failure without impacting you? Do they have strong bandwidth peering relationships?
- Do you have a comprehensive backup and validation process? Are you certain that your backups are good (i.e., do you restore and test each backup right after it's made)? Do you move your backups off-site frequently?
- In the case of a truly catastrophic event (major earthquake, fire, flood), how long can your customers wait to get back online? Would your customer base revolt if they were offline for 24-48 hours, or could that be absorbed? Do you have a documented and tested DR plan to recover within that time frame?
No comments:
Post a Comment