Elephant In The Room: What’s The TCO For An OpenStack Cloud?

A few months ago, for our own internal use, we started a project to calculate what it costs to run an OpenStack-based private cloud. More specifically, the total cost of ownership (TCO) over the years of its useful life. We found the exercise to be complex and time consuming, as we had to gather all of the inputs, decide on assumptions, vet the model and inputs, etc. So, in addition to results, we’re offering up a few lessons we learned along the way, and hopefully can save you a scar or three when you want to create your own TCO model.

Ultimately, we wanted answers to three layers of cost:

  1. What is the most cost effective method for acquiring and running OpenStack?
  2. How does OpenStack compare financially to non-OpenStack alternatives?
  3. How should we prioritize technical improvements to provide financial improvements?

Following an exhausting survey of cloud TCO research, none of the cost models we could get our hands on were complete enough for our needs: some did not break out costs by year, some did not include all of the relevant costs, and none addressed potential economies of scale. We needed a realistic, objective, and holistic view – not hand-picked marketing results, and found a few suggestions that helped us get there – whatever the technology.

Since we could not find anything both comprehensive and transparent, we created one, and used the opportunity to go a few steps further by adding additional dimensions: full accounting impact across cash flow, income statement, and balance sheet. The additional complexity made it harder to understand and consume the model. Further, we needed the model to not only spit out projections, but be a reliable way to compare options and support decision making throughout the life of a cloud as options and assumptions change. So, we decided to create a tool rather than just a total cost of ownership (TCO), for easy comparisons, and conversations with financial teams and lines of business.

To help us view the data objectively, we relied as much as possible on industry data. Making assumptions was inevitable, not all of the required data is available, but we made as few as possible and verified the model and results with a number of reputable and trusted organizations and individuals in both finance and IT.

What is the most cost effective method for acquiring and running OpenStack?

If you’re considering or even running OpenStack already, we imagine you’re asking yourself a few questions, “I have a smart team, why can’t we just support the upstream code ourselves?”. As Red Hat is commercially supported open source software, we can talk all day about the value of supported open source software, including the direct impact on OpenStack, but we also want to address the direct costs, the line items in your budget. To get to these costs and answer our questions, we shaped the model to analyze two different acquisition and operation methods for OpenStack:

  • Self-supported upstream OpenStack
  • Commercially supported OpenStack

image04

As the model shows, the self-supported upstream use of OpenStack, with the least expensive software acquisition cost, ends up the most expensive, which may seem counter-intuitive. Why? Because of the cost of people and operations.

All of the costs of a dedicated team* running the cloud: the salaries, hiring, training, loaded costs, benefits, raises, etc., regardless of the underlying technology, are a large chunk of the total costs. With a commercially supported OpenStack distribution, you only need to support the operations of your cloud, rather than the software engineers, QA team, etc., for supporting your cloud and the code too. We expect that you need to hire fewer people as your cloud grows, and the savings would exceed the incremental cost of the software subscription. Your alternative, is this:

image10

Taking our analysis a step further, we also explored the financial impact of increasing the level of automation in an OpenStack cloud with a Cloud Management Platform (CMP). Why? Because most companies’ experience shows** that managing complex systems usually doesn’t go according to plan. However, if automation is appropriately implemented, it can lower the TCO of any complex system.

CMP is a term coined by Gartner to describe a class of software encompassing many of the overlaid operations we think of in a mature cloud: self-service, service catalogs, chargeback, automation, orchestration, etc. In some respects, a CMP is an augment to any cloud infrastructure engine, like OpenStack, necessary to provide enterprise-level capabilities.

Our model shows coupling a CMP with OpenStack for automation can be significantly less expensive than either using and supporting upstream code, or using a commercial distribution. Why? As with the commercial distribution, our model shows that you would need to hire fewer people as your cloud grows, and the savings can potentially dwarf the incremental software subscription cost. The combined costs are drawn from Red Hat Cloud Infrastructure, which includes the Red Hat CloudForms CMP and Red Hat Enterprise Linux OpenStack Platform.

image05

One of the sets of industry data we used, to help create an unbiased model, came from an organization named Computer Economics, Inc. They study IT staffing ratios, and all kinds of similar things. They found that the average organization, with the average amount of automation, supports 53 operating system instances (mix of physical and virtual) per system administrator. They also found, that the average organization, with a high level of automation supports 100 instances per admin.

So, in our scenario, with the cloud expected to double in size next (and every) year, you have a few options. You can double your cloud staff (good luck with that), double the load on your administrators (and watch them leave for new jobs), or invest in IT automation.

The aforementioned study shows that high levels of automation can nearly double the number of OS instances supported. While automation can reduce the cost curve for hiring, and make your cloud admins’ lives easier, we’re in a financial discussion. Automation only makes financial sense if it lowers the cost per VM. Which is exactly what we found:

image01

In order to compare the costs and advantages of automation more closely, we looked inward (it was an internal study after all). We compared with the completely loaded costs (hardware, software, and people) for one VM of our commercial distribution of OpenStack, Red Hat Enterprise Linux OpenStack Platform (RHELOSP), with those of our Red Hat Cloud Infrastructure, which includes both RHELOSP and our CMP, Red Hat CloudForms.

Looking at the waterfall chart above, we start with the fully loaded costs of one VM provided by RHELOSP of $5,340 per VM, and want to compare the similarly loaded costs for RHCI. The RHCI software costs an additional $53 per VM under these density assumptions, which increases the costs to $5,393. Next, we factor in the $1,229 savings through automation from hiring fewer people as your cloud grows, we see a loaded cost of $4,164 per VM for RHCI. Under our model, using a CMP with OpenStack resulted in savings of over $1,200 per VM.

Moving from just an average level of automation to a high level of automation, our model showed a significant improvement in costs as you grow, that the extra cost of automation can be dwarfed by the potential savings. High automation is only moving from the median to the 75th percentile, so our model shows that there’s a lot of headroom for improvement above and beyond even what we show.

At $1,200+ savings per-VM per-year, automation has the potential to quickly add up to millions in savings once you’ve reached even moderate scale.

That’s the kind of benefit is one of the many reasons why Red Hat recently acquired Ansible. And given that Ansible is easy to use, use of Ansible tools can not only improves the TCO through automation, but can also help customers achieve those savings faster.

How do OpenStack and non-OpenStack compare financially?

As we said, we wanted to model to be useful also to compare different market alternatives, but in order for the comparison to be useful, we needed the comparison to be apples-to-apples. Competitive private-cloud technology available on the market at the time of our research provided much more than just the cloud infrastructure engine, so we decided to compare OpenStack plus the CMP against commercial bundles made of an hypervisor plus CMP, which is what Red Hat customers and prospects ask us to do most of the time.

In the model, we conservatively assume that the level of automation is exactly the same. If you have data you are willing to share which supports or refutes this, please let us know.

As we expected, the model showed us that an OpenStack-based private cloud, even augmented by a CMP, costs less than a non-OpenStack-based counterpart. The model shows $500 savings per VM increasing to $700 over time and over a larger number of VMs and more as the maturity of the cloud grows over time.

image06

However, the question is: is the $500-700+ in savings per-VM worth the risk of bringing in a new technology? To find the financial answer, we had to consider how these savings add up.

image02

As the chart shows, by the time you have even a moderate sized cloud, OpenStack with a CMP total annual cost savings can exceed two million dollars. We are aware that it’s common business practice to apply discount to retail prices, but to keep the comparison as objective as possible, we referred to list price disclosed by every vendor we evaluated in our research. Because our competitors were not real keen on sharing their discount rates, the only objective comparison we can make are these list prices. We estimate that there is a small portion of this savings that comes through increased VM density (which we’ll talk about later), but the majority is in software costs.

With this in mind, if you take a look at these numbers, and think about the software discounts you’ve negotiated with your vendors, you’ll have a reasonable idea of what this would look like for you. And as a reminder, these are just for the exponential growth model starting from a small base. We’ll wager there are any number of you reading this who have already well exceeded these quantities and are accumulating savings even faster than we show here.

We also recommend looking at the total costs over the life of a project. In fact, when we look at the accumulated savings over the life of your private cloud, we notice something rather striking.

image03

Our model showed that it really doesn’t matter what your discount level is, if you plan on any production scale OpenStack with a CMP can potentially save you millions of dollars over the life of your private cloud.

How should we prioritize technical improvements to provide financial improvements?

In order to move from one-time decisions to deliberate on-going improvements, you need the “why” of the model as well as the outputs. By the time we finished building and vetting our TCO model, we made a number of interesting, and sometimes surprising, discoveries:

Cost per VM is the most important financial metric

For most of this post, we’ve been focusing on cost per VM. Despite the necessity in budgeting, total costs are simply not instructive. Here’s an example of the total annual costs over six years, for one of the many private cloud scenarios we considered:

image08

A typical approach in TCO calculations is looking at the annual costs, but this metric alone isn’t particularly helpful in the analysis of a private cloud, with or without OpenStack. In private clouds, we can’t get away from the fact that we are providing a service, and what our Lines of Business or customers consume is a unit, like a VM or container. Hence, we believe that it’s much more significant to look at the annual per-VM cost.

image07

In the same scenario we showed with the rapidly increasing total costs, the VM cost has dropped by more than half, from the first year to the third. That dramatic improvement is impossible to see in the total costs curve. Without accounting for the VM costs, you’d miss that the total costs are increasing because of more usage, but you’re getting more for your dollar every year. Increasing growth while increasing cost efficacy is a good problem to have.

In other words, we recommend using VM Cost as your main metric because it shows how good you are at reducing the cost of what you provide. Total Cost does not distinguish between cost improvement and usage growth.

The hardware impact on total spend is marginal

We’ve woven in analysis of two of the three main cost components related to acquiring and running OpenStack, and financially comparing OpenStack and non-OpenStack alternatives. Our model shows that the selection of private cloud software choices has the potential to save you millions of dollars. The investment in automation similarly shows the potential to save additional millions of dollars. Either or both of these can save an organization a lot of money, despite the additional expenses. But, so far, we’ve only hinted at hardware costs.

Some of our readers may be surprised at the results: hardware is a large and easily identifiable cost, so if you can cut the amount of hardware, in theory you can save a lot of money. Our model suggests that it’s not really the case.

image09

We asked the model how costs change across a large range of VM densities: 10, 15, 20, and 30 VMs per server, with no other changes. The numbers show very little difference in costs even across this large range of densities.

If we start with an average density of say 15 VMs per server and (unrealistically) double it to 30, we see a savings of around $350 per VM. Not a trivial amount, and one that adds up quickly at scale, but these amounts are before the costs of any software and the effort to make this monumental jump in efficiency.

If we make a more realistic (but still really big) stretch to a ⅓ increase in density from 15 VMs per server up to 20 VMs per server, the models indicates a $175 in savings per-VM before the cost of software and effort. This is tiny compared to the $1,200 or more savings per-VM through automation in the same scenarios.

Never neglect your hardware costs, but don’t start there for cost improvements, it’s unlikely to provide the biggest bang for your buck.

Lowering VM costs will increase usage and total costs

Our model shows that the more you lower the VM costs for the same service, the more you will increase your total costs. There’s a direct causal effect: the less expensive this service is, the more people want to use it.

Here’s a different example from our industry, to further prove our point. 1943 saw the beginning of construction of the ENIAC, the first electronic general-purpose computer, which cost about $500,000. In 2015 dollars, that’s well over $6,000,000. Today, servers cost less than 1/100th of that, and we buy 1,000,000’s of them every year. We now spend much, much more on IT than the first IT organizations did supporting those early giant beasts and, yet, our unit costs are significantly lower.

Based on this awareness, we looked at the market numbers for consumption of servers and VMs from IDC, and ran some calculations: for every 1% you reduce your VM cost, you should expect to see a 1.2% increase in total cost, due to a 2.24% increase in consumption. Which seems counterintuitive, but the increase in total costs is due to your success. You’ve reduced the costs to your customers, so they’re buying more. Once again, your reduction in VM cost is directly increasing the demand for the services of your cloud.

IT, and in particular IT components like servers and VMs have “elastic demand curves,” broadly meaning that reducing prices leads to greater utilization and greater total cost. If increased efficiency causing higher total costs comes as a surprise to you, you’re not the only one.

Track all of your costs to prioritize efforts

Tracking the costs of as many components as possible enables you to prioritize improvements over time even as your cloud matures, your staff gets better and better at running it, and even as demands change from your customers. In order to build a tool around our TCO model, we had to decide on what costs we want to track, and model together. Our model accounts for all hardware, software, and personnel required to operate a private cloud. Each and every one of them are a potential lever in affecting how your costs change over time.

image00

The levers built into the model include: VM density affecting hardware spend, IT automation for personnel costs, and software choices for software costs. Between the three of these, the model addresses all of the major costs of acquiring and operating a private cloud, with the exception of data center facilities. With the low impact on costs of hardware and changes to density, we assumed that datacenter facility costs will largely be the same across technologies and were not a focus of this model. However, should you have great data center cost information you’d like to contribute, please let us know, as we strive to increase the completeness and accuracy of our model.

The model suggests IT automation should be the first item on your todo list.

Considering the timeframe increases model accuracy

Even though building a cloud can be quick, getting the most from its operation is a journey: staff will learn along the way, corporate functions will have to adjust, and business demands for new technologies and faster IT response will only increase.

Per-VM costs are inseparable from timing. You’re buying hardware, hiring people, buying software, suffering talent loss, refreshing hardware, and buying still more to support growth. All of these costs can, and usually do, hit your budget differently every year. If you’re buying software licenses, you have a large upfront cost and maintenance. If your staff gets promoted, gets raises, and sometimes takes new jobs, these will affect salaries, hiring, and training costs. Some you can plan for, some you can’t.

Put in another way, if next year, you provide the exact same quality of service, to the exact same customers, in the exact same quantity, with the exact same technology, there’s still a very real chance your costs will not be the same as they are this year.

We’re showing costs and cost changes over six years, but we modelled out to ten to find out when the costs start flattening out.

If you want your TCO model to be a tool for ongoing decision making, you need to not only look at costs, but how costs change over time.

The cloud growth curve doesn’t affect the TCO

One of the nice things about creating a flexible model is it allows you to try all sorts of hypotheses and inputs. While absolute costs depend on the success and speed of your private cloud adoption, one of our surprising discoveries is that relative costs are not dependant upon your adoption curve. None of the advice the model provides is affected by the growth curve.

This means IT organizations can get started even when unsure of how quickly your private cloud is going to take off. This also makes the particular growth model we discussed here a lot less important. Our examples have VM count doubling every year, which is the most common customer story you hear during IT conference keynotes. But, the advice is equally applicable no matter what your particular growth model is.

Having technical conversations with Lines of Business (LOBs) are frustrating for both sides: they often can’t provide you sufficient information you need in order to provide a thoughtful architecture and plan. Because of any number of reasons, you can’t provide accurate costs and changes to costs over time. With a good TCO model, these conversations can get unbelievably easier for both sides of the table: you can model different scenarios and provide ranges of pricing, and help your LOBs work through priorities. Invest the required time in an accurate TCO model, and you’ll not only make these conversations even easier, but you’ll have the tools in place to add financial input into your designs even as the services you provide change over time.

If you’re interested in expanding on what we’ve built, please let us know.

Erich Morisse
Management Strategy Director
@emorisse

Massimo Ferrari
Management Strategy Director
@crosslogic


* If you think that you can run a cloud by leveraging existing IT Ops, think again. Research published by Gartner shows that not creating a dedicated team is one of the primary reasons for the failure of cloud projects: Climbing the Cloud Orchestration Curve 

** Velocity 2012: Richard Cook, “How Complex Systems Fail”
The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
Complex Adaptive Systems: 1 Complexity Theory
Systemantics
What You Should Know About Megaprojects and Why: An Overview