Tips for Managing Cloud Infrastructure Total Cost of Ownership
After attending the AWS Public Sector Symposium in Washington DC last week, one thing stood out due to what our community is collectively struggling to address: TCO, or Total Cost of Ownership in the Cloud. How do we know how much we will spend and how much will this "cloud thing" will cost and save? How do we budget for it?
Two sessions stood were particularly relevant to these questions: a panel session on TCO, that included Ryan Frazier, Director, Infrastructure Customer & Project Services, Harvard and a Cost Management Lessons presentation from JR Storment, the Co-Founder and Chief Customer Officer of Cloudability, a cloud management and monitoring company. These two presentations took different approaches but ultimately came to the same conclusion: manage costs—don’t predict them.
This was novel, at least to me, as someone who has previously budgeted IT expenditures on campus. The points being made resonated. Since this was an AWS symposium, my framing focuses on AWS, but it applies to many other IaaS solutions as well. Top guidance from the speakers included:
- the cost of an instance continues to decrease,
- a newer version of the same instance type is typically more efficient for a given workload, than an older one,
- a newer version of the same instance type is, in many cases, cheaper,
- you can better scale your application to a larger instance type as the application grows,
- you can optimize the number and types of instances for the applications current or future condition and if this can be done predictably,
- you can pre-buy an instance size and type to save even more, not to mention spot instances.
Compare this to a traditional, under-utilized, refreshed-every-three-years, static, traditional on-prem environment and you can see why thinking will rapidly shift. Even if an application is stateful, you can still optimize a campus-scale application in the cloud to the typical patterns of campus life (8am-5pm, Monday to Friday, with spikes at the beginning and end of semesters and during class registration, with minimal summer traffic) in ways that could never be done previously. All campus applications follow this pattern so we have to build for the peak, with no real benefit to scaling back at the trough. Apply this thinking to development and test environments alone, and you have an exceedingly significant cost savings over what is running currently.
Some cultural challenges do present themselves. In my experience, de-centralized IT really would prefer to manage their own workloads. There are also the now exposed and charged, previously hidden costs of power, cooling, space. These conversations also bring to the forefront an application risk assessment with questions like "What is a reasonable level of DR?" or "What is an acceptable level downtime when considering Recovery Point and Recovery Time Objectives?" If these questions are not answered, an application can be both over-engineered and over-protected, making it more difficult to manage in the long run.
This leads me to some concluding (and largely transposed) advice on how to tackle these challenges.
- Start with a small, production, but prominent and visible workload. You are looking to have success to point so that all stakeholders (business office, academic and student affairs leadership, and IT) can build upon.
- Don't tie a move to the cloud to resource changes/evolutions/constraints/retraining. IT should be advocates for a move that will eventually free us from mundane maintenance.
- Work to develop ballpark estimates, with a large margin month to month, rather than a "thou shall spend" budget for infrastructure, keeping in mind that your unit cost should go down over time, even if your spend does not.
We have to remember that we are here to enable the University mission—to substantially increase capacities to advance research, science, and scholarship. The cloud is a means to refocus our efforts on the overall community successes.