Disaster recovery – quite straight forward, isn’t it?
Well – it might be. At least when it comes to IT environments that aren’t too complicated. The downside there is that there aren’t very many IT environments that aren’t complicated. Most of them tends to be, one way or another. Many disaster recovery solutions, no matter if they’re hardware or software based are being very competent. It’s usually easy to set up solutions working directly in VMware ESXi/vCenter for example. Just set up the target environment, install the software and it’s agents and start replicate. Basically. Not to mention, many solutions are easy when it comes to “pushing the big red button”. That is, to invoke the disaster recovery.
Pitfall #1 – consistent replicas
You really want to make sure that you really can use the DR environment when you have decided to invoke, right? The server operating systems themselves with their built-in functions are usually not a problem, but what about the applications? There’s probably some database server in there; SQL Server, DB2, Oracle or MySQL perhaps? Some DR solutions have VSS integrations to let you securely replicate your MSSQL Server. However, there’s no VSS integration available for applications like MySQL, DB2, Lotus Domino and such. You have to handle these application servers differently to make sure that the DR replica is consistent and OK to use.
Pitfall #2 – access to the DR environment
So you have a disaster recovery solution that is being automatically replicated from the main production site. You have been taken measures to make sure that the DR replicas are consistent enough. Good! What happens whenever you decide to push the red button? Well, you break the replication and have the DR site fire up the virtual machines. Verify it all and let the users in to the environment. I would claim that it probably wouldn’t be just that easy unless you’re careful. If you’re using a WAN operator to have office(s) access the production site it might be a good idea to add the DR site to the WAN as well. Otherwise you might have VPN concentrator(s) standing by and have office(s) connect through a site-to-site VPN. Or simply use client VPN for the users if that’s possible. Whatever you decide, document well and test regularly.
Pitfall #3 – circumstances tends to change
The main disadvantage with a disaster recovery environment is that it’s very rare that it’s actually being used live. Actually, let’s hope that it mustn’t. Even though you have done a perfect job when setting it all up; testing and documenting it all the fact that the DR site won’t be used live will work against you. As time goes by and the DR site stays cold the business and thus technical circumstances in the production environment tends to change. You might have added VLAN’s to the network, firewall zones and such. You might have added or upgraded applications that now really can’t be run on nothing less then flash storage – and the DR site is built upon SAS or even SATA. I believe the best is to actually test the DR environment live at least twice a year, and (at least) make assessments of the DR environments quarterly.
Pitfall #4 – the decisions might give you headache
I guess you now have your DR environment, that you evaluate and test it regularly. What you also have to be clear about to yourself and your organization is how to make the actual DR decisions. Let’s be clear about an obvious fact: Invoking DR will run costs. It will take man hours to verify and redirect communications and such. That might not be an issue – the main reason for having a DR environment is that IT downtime costs extensive amounts of money, isn’t it?
One challenge is that we as humans to have faith. Faith in that the problems shutting down IT environment will eventually be solved. So we’ll tend to wait for that to happen. Cause, even though there might be clear procedures for DR invokation there’ll probably be no procedure for failing back. In the “best” case, the primary datacenter has burned and won’t be available for a long time, but what if there’s just the storage or network that’s unavailable?
I would recommend having set up as exact time limits as possible. For example, if there’s no estimation of problem solution – for how long would you troubleshoot before just saying “now is the time to invoke DR”? I come from a background as an operations engineer and I know for sure that we tend to be quite stubborn and not really tend to give up easily. That’s good in some cases, but gives challenges in other cases – like making a decision.
Pitfall #5 – publicly available services
The world changes. It might be that you today do most of your business through e-mail and phone. Well, you might have some e-commerce site available but if that one isn’t available is not really a disaster. As time goes by the e-commerce channel might be of greater importance to your business. It might also be the case that the e-commerce gets more and more integrations – to suppliers, distributors, customer channel(s) and so on. As stated above, you need to regularly evaluate and also make risk assessments. Not to forget, it might be that you gets more and more dependent on your ordinary public IP addresses. Some integrating parties may want to lock down their integrations to specific public IP addresses. It might be that you, in extreme cases, might want to plan for re-route of your ordinary public IP addresses as part of the DR invocation plan.
Pitfall #6 – Internet of Things devices
You might or might not be there today, but in more and more cases businesses relies on IoT devices. It might be classic SCADA systems that controls production or just systems collecting statistics that after a while gets crucial to your business one way or another. Can these devices access the DR environment after the red button has been pushed and vice versa? Are there needs for more complex communication solutions between the primary environment and the DR site?
Although I have listed some pitfalls that might give some amount of headache I’d say that there are some more missed out in here. Every environment is unique which obviously makes it difficult to set some standard, other then pure baselines. You shouldn’t be afraid of planning a DR site and environment, I believe it’s very relevant to have for many corporations. After all, a disaster has a tendency of waiting somewhere around the corner. Don’t forget to get help when assessing and planning. You’re probably the one who knows all there is to know about your IT environment, and help from the outside can bring in interesting aspects that you haven’t thought about from completely different experiences.