3 Steps For Handling Cloud Service Outages: Community Lessons to Alleviate Pain Points
In part 1 of this blog series, I considered the case of a widely publicized learning management system outage and some of the general lessons to draw from this scenario, as the research and education community increasingly relies upon externally provided IT services. I discussed several potential lessons to consider, and I'd also like to offer a few concrete steps the community could take to prepare for potential outages and thus alleviate some of the headaches they can cause.
Step 1: Develop your own “Service Disaster Recovery Plan” instead of relying upon an SLA; and push for Meaningful SLAs from the get-go
There are some limitations of contractual Service Level Agreements, which although very necessary and useful, for a significant outage likely won't yield much compensation and also may not guarantee resolution of a service outage. SLAs need to be read carefully in order to ensure there aren’t any tricky provisions that would reduce further your credits or set up the wrong kind of financial incentives for the service provider.
With cloud services you deploy and support, I’d recommend developing and documenting your own “Service Disaster Recovery Plan.” This plan should capture exactly what you will do when the service goes down, in escalating time intervals. For instance, if the service is reported unavailable the first step might be to monitor the service provider’s status alert site or to make contact with a Tier 2 support resource at the service provider. It would be a good idea to document the relevant URLs, contact phone numbers and emails, or where the up-to-date information will be found. Once the service is down for an extended period of time, document a plan that is likely more robust and ultimately, would involve your approach for moving to an alternate service (whether a homegrown option, another cloud based offering, or something in between).
Even more important, perhaps, since in the world of cloud services 'contract is king' because you don't own the hardware or software all you own is the contract, you need to coordinate with your attorney to ensure the SLA is as strong as possible. Even though the SLA may not ultimately be meaningful in terms of preventing service reliability issues or giving you appropriate compensation for a significant outage, you should absolutely include language that lets you out of the agreement after various thresholds of service outage.
For instance, a few years ago I talked with a campus that pointed out a service being down for several days over the course of several months wouldn't make a financial dent in terms of the SLA service credits but would be a significant operational problem. Rather than leaving it to lawyers to argue later whether that constituted a 'material breach' of the agreement, cloud contracts should include an 'out clause' if the service has repeated or significant outages. This shouldn't impact a cloud service provider's ability to recognize revenue and if they intend on meeting their SLA commitments they shouldn't be concerned about it substantively either.
Step 2: Push for source code escrow in addition to data backup
Even though having access to your data and external backups of it is often more exigent than the service availability, at least in the short term, the Service Disaster Recovery Plan should include a “service escrow” component. You should consider how you would spin up a new service in parallel to make effective use of that data. Software-as-a-service is at its core 'software,' so there will be source code that you could potentially compile and run yourself if needed. Some SaaS providers, especially those with roots in the open source community, may be comfortable with contractual provisions around source code escrow, especially if you are willing to pay or split the cost of the escrow. If they are routinely keeping up an open source code base you may have access to the what you'd need to get the service running yourself. Will your staff know how to do that? Which of your developers and/or technical staff would take the lead on that project? I recommend that these questions be addressed as part of your Service Disaster Recovery Plan. Ultimately your plan combined with these elements should bring you as close as financially possible to being able to fail over from one service to another.
For Infrastructure and Platform as a Service, you may have a tougher road, but the principles are the same. If you have applications running on cloud infrastructure-as-a-service that suddenly went offline, how would you react? Would you move them to another availability zone or data center in the same cloud infrastructure? Another cloud infrastructure service? On premise? Would your apps need to be rewritten in order to make this move? How long would it take to transfer the data? Knowing how you’d move your service, who would do it, and how you’d pay for it are best thought about and documented well in advance
Step 3: Rely on people and communication
Everyone, from your front line help desk staff to senior leadership, need to be in the loop as you implement a Service Disaster Recovery Plan. Users will be frustrated and impatient because in all likelihood, it will have been a few days or weeks with the service down before you'd migrate to another service or implement another intensive fall back solution.
Service Disaster Recovery Plans should include sample communication responses and templates that can be rapidly customized and implemented as you work through various stages of the plan. Expect that many in your end-user community will be understanding, if you give them clear, consistent, and reliable information, so the messaging should include temporary work-around details, as well as your own commitment for when an alternative service would be up and running. Considering that the external service provider will likely be communicating as well, there is potential to create confusion. Although I don't believe it's productive to point fingers or assign blame in the midst of an outage, if you are passing along an update from the cloud service provider you should say so and identify as such. You may also want to be transparent with your users and let them know that you're implementing an internal Service Disaster Recovery Plan in addition to what they may see or hear from the service provider to minimize confusion and frustration. Ultimately, your end users may be very appreciative that you make the decision to migrate to another service even if it means they go without service in the short term.
One of the advantages of the Internet2 NET+ program are additional contractual provisions that enable Internet2's NET+ Program Advisory Group to terminate the master agreements with providers and “Sunset” services. For a NET+ service with consistent SLA violations or problem severely impacting many campuses this would likely lead to this group of CIOs and community leaders terminating the program, which could have a more significant impact on a cloud service provider than an individual campuses exiting standalone contracts. We’ve attempted to bring community leverage to bear on the cloud service providers participating in NET+.
The program does not have a mechanism for Service Disaster Recovery Plans, since these are likely campus-specific, and not artifacts we can require from external service providers. However, we are considering including in NET+ Service Validation requirements that companies explain in detail how to remove data from their services and that campuses develop model plans that could be adapted by other program participants. Nonetheless, we would welcome additional community dialog on this topic, and seek to provide a mechanism for information sharing among schools that have developed these kinds of plans for cloud services.
Just as we strive to make the NET+ contract templates the community’s “click through” agreements and have effectively flipped the model of accepting commercial agreements, I wonder if we could achieve something similar to improve service reliability with SLAs and Service Disaster Recovery Plans--by letting providers know the criteria we would implement to determine when terminating a service.