(This is part 2 of an ongoing series exploring IT at the BC Libraries Cooperative.)
Ah, Vancouver. To be surrounded by ocean, mountains, mild climate, expensive lattés (and high rent!). It’s home to many of the Co-op’s staff as well as our main server hosting facility located at Peer1.
It’s also home to potential for very large earthquakes. While no one likes to think about them, the reality is that one could happen, and if it does, we need a plan for how to resume services for our members as quickly as possible.
Our disaster recovery plan has a number of components:
- Critical data is backed up and stored outside of the quake zone;
- In the event of a massive event that renders our current infrastructure unreachable, we have potential targets in place to restore services;
- Numerous mechanisms are in place for staff to stay connected for duration of an emergency;
- Documented dry runs of recovery exercises to test our plan.
Let’s go through each of these in turn.
1. Remote backups of all critical data outside of the quake zone
In its early days, the Co-op was like most organizations. We used tape drives to do backups of our systems, and shipped our combination of daily and weekly backups to a secure, off-site location. That secure, off-site locatio was… on Vancouver Island! Not exactly out of the quake zone in case of a massive event. There was yet another issue: if a massive earthquake took place, where would we ship tapes to use them to restore our system?
So, while this backup system was not completely inadequate, we knew there were weaknesses in this plan that we were not in a rush to test. Two developments in the past two years helped us develop a new plan. The larger one was signing a historic agreement with BCNet which meant that the Co-op (and any BC member who can reach one of BC’s transit exchanges) could connect to the ultra-fast Canada-wide research network. Since our main hosting location is located within a few hundred feet of the Vancouver Transit Exchange and BCNet’s own main facility in Vancouver, we were able to cross-connect easily to BCNet’s network with a 100Mbps connection.
The second development was locating a partner outside of the Lower Mainland’s quake zone to whom we could ship our data. That partner is Laurentian University – about as far from the quake zone as we can get! Laurentian agreed to put a storage server in their racks for us and connect their end to the network (in their case, the network is called ‘Orion,’ the Ontario portion of the CANARIE network.)
Since then (approximately February 2015) we’ve been storing 10-14 days worth of nightly backups on this server. And, just in case, we still maintain tape backups both for both long-term audit purposes and any immediate data restoration needs.
2. Potential restore targets
Once we knew our data was safely outside the quake zone, the next question we faced was: in the event of a disaster which renders our current hosting facility inoperable, how do we resume services?
Initially, we turned to our hosting provider, Peer1, since they are a large company with facilities across Canada, offer cloud hosting, and we have an existing contract with them. More recently, we have been researching other options and are leaning towards restoring to Amazon’s EC2 node in Montreal. This option would be fast, cheap, and FIPPA compliant. Recent tests show we can restore there from Laurentian at around 17 MB/s, meaning we can ship the majority of our critical data there in a couple of hours. (As an aside – we only rely on backups for critical data and configurations. Using configuration management, a topic for another post, we can quickly pull the open-source software back down from existing sources and reconfigure it once installed.)
3. Staff Backchannels
Data and restore targets are well and good, but we need people to do the work to restore services too. A lucky accident of the Co-op being a mostly virtual workplace is that we already have a number of online communication tools integrated into our workflow. A second lucky accident is that a main tool for technical staff is IRC, a globally-distributed chat network that would be unaffected by a local disaster. Finally, another lucky accident occurred when we moved from our old Voice over Internet Protocol (VoIP) phone provider to our own VoIP server – we left the configuration for the old provider on the phones and are able to invoke it as a backup system if our own servers were to go down. This means we can maintain phone service as well as IRC connections, critical if we were trying to coordinate a disaster recovery at a distance.
4. Disaster Recovery Test Run
To paraphrase Helmuth von Moltke, “No plan survives contact with implementation.” Having a plan is critical, but we wanted to test it before we had to do so under fire.
To this end, the Co-op help the first of many planned “Disaster Recovery Test Runs” in October 2016. Over a 10-hour period, Co-op staff attempted to resurrect as many critical services as possible using only backup data and remote cloud hosts.
The Test Run served its purposes and was a success. The good news was we learned that our backups and configuration could reliably resurrect our core ILS hosting service, Sitka, in less than a day and in a brand new environment. Even better, we were able to document the critical sequence needed to restore services and identify gaps in our configuration backups that will make any future restoration faster and easier.
Disaster Recovery is another one of those classic IT roles that often goes unacknowledged – you hope you never need it, and until you do, its hard to understand the effort and costs that have gone into it. But when you need it – boy are you glad you did it!