Three Lessons for Combatting Cloud Services Outage Headaches
The recent outage of a learning management system at multiple universities left faculty members and students worried that they would not be able to finish finals. Inside Higher Ed's recent article on the topic prompted me to consider and reflect on downtime, SLAs, and “exit strategy” issues as they relate to deploying cloud services in higher education.
The outage underscores how much reliance is being placed on external IT service providers and cloud services, as well as the difficulties that arise when something doesn’t work as intended.* A UC Davis student’s post on the campus Reddit page highlighted this dependence from the IT consumer perspective. The student wrote that “some of my classes don't even have textbooks -- the materials are all posted in PDF or links on SmartSite. We're doing midterms and are a couple of weeks away from finals!"
Whether cloud “as-a-service” offerings, hosted solutions, and/or other services, the community is relying upon external IT providers, so it's likely this is a common issue among students, faculty, researchers, staff, and IT practitioners. While experiencing an outage of any kind is frustrating, this particular case offers universities the opportunity to reflect and prepare:
Lesson #1: An SLA only goes so far
Most are used to Service Level Agreements (SLAs) with service providers, cloud based or otherwise. Every reputable external service should offer one with commitments regarding response time, uptime, downtime, and scheduled and emergency maintenance, among other provisions. These are extremely useful in providing a set of expectations about how the service should preform and demonstrates contractually the kind of confidence the service provider has in their offering and capabilities.
However, I would recommend that campuses don't anticipate receiving credits or monetary compensation in the event the SLA isn't met. For example, while a learning management system being unavailable for a day would be a critical incident for any campus, and in all likelihood would lead to compensation for the amount of downtime (a day would amount to ~3% of the monthly service charge), but wouldn't account for the lost productivity, aggravation, or inconvenience. If the annual service cost is $100,000, a day's outage would give you about $250 in credit, and typically that's all it is, a service credit, not a refund.
Some SLAs can be especially tricky by measuring responsiveness in addition to (or sometimes in place of) downtime. That is, the service provider’s response to an incident and their remediation work could minimize the credit they are obligated to provide. In some extreme cases, the Internet2 NET+ team has seen SLAs from commercial service providers that proposed basing credits on an average downtime during a month, which, paradoxically, would give the service provider with a single lengthy outage a financial incentive for having several more short-term outages before the end of the month (e.g. 24 hours, 1 hour, and 1 hour would lead to a credit for 8.67 hours, or for that $100k annual service, about $100). While several short term outages might mean the provider has restored partial access and their responsiveness is certainly important, each outage should be considered separately. Furthermore, the speed of the response or the time actually being devoted to the outage should be covered separately from calculations of downtime and service credits.
Lesson #2: Data access is key
When an external service goes down for an extended period, there's certainly pain for an end user who relies on that service to complete work. However, losing data stored in that service irrevocably is potentially much more catastrophic and it’s easier to continue operations in the short term with readily available and usable data backups.
I’ve experienced this a few times at the airport, when an air carrier's computer systems are down and this scenario has helped me think through the various contingencies with cloud service availability. Receiving hand written boarding passes was only possible because the airline had a trusty old fashioned print out of the flight manifest. They had access to the key data, knew who was supposed to be on the flight and seat assignments, so they could deal (however painfully) with IT service unavailability. It's much the same with a cloud service: you and your users can likely work around a service being unavailable, but there is a need to access core datasets in order to do so.
In the case of a learning management system or other service for teaching and learning, it’s generally a good idea to think about access to data at the enterprise and end user levels. Best practice at the enterprise level would involve “snapshot” backups of an entire system at daily (or perhaps more frequent) intervals. Although end user behavior is impossible to control fully, you can send regular communications and messaging to your users reminding them to keep their own backups. You might also consider investing in tools and potentially other cloud services in order to facilitate end user backups and data storage in multiple locations.
Lesson #3: Communication needs to be managed and coordinated
In the case of an externally provided service becoming unavailable the help desk is going to be inundated as are multiple communication channels. The service provider themselves will likely be fielding numerous inquiries, from other customers and potentially from the end users themselves.
Cloud services being multi-tenant generally means that an outage impacting your campus will be affecting others as well and a responsible cloud service provider will be providing updates and status reports. On the surface these are good things but it’s important to take into account that outages may not be evenly distributed across a multi-tenant environment. Your end users may be seeing from the provider that the issue is “resolved” yet still experiencing a problem.
Furthermore, in this age of social media and smartphones, some of your end users may reach out to the cloud provider directly for support, updates, and/or to vent their frustrations. What happens if they do? There’s additional possibility for misinformation or confusion, perhaps the cloud provider sends those users to a local helpdesk, gives a boilerplate reply, or perhaps doesn’t reply at all. Understanding the support and helpdesk model for your campus as well as how externally provided services fit into it is crucial.
One of the reason I enjoy my work with the Internet2 NET+ program is because issues like these, although painful, can start to be alleviated by leaning on the community. Using validated and trusted cloud services vetted and developed by the community, and having access to resources and a network of allies, helps us all consider and plan for outages BEFORE they become a headache.
In part 2 of this blog series, I offer a few potential solutions to these issues, to enable campuses to plan their cloud strategies and implementations of externally provided IT services.
* In this case, using the standard “cloud” terminology may not be totally accurate, since the solution in this particular scenario was described as “hosted,” so it may not be a multi-tenant true software-as-a-service application.