IT Monitoring is Terrible: We Can Fix it with Machine Learning

In IT operations, we need to know when something isn’t working. But, humans are just bad at identifying anomalies over time.

FIGURE 1: Rapidly decreasing accuracy, after Mackworth & Taylor 1963.

A typical person’s ability to identify an anomaly that we know to look for can drop by more than half in the first 30 minutes on duty1. If that’s not bad enough, when unaided by technology, it can take us up to four times as long to recognize one2. We’re actually terrible at this elementary IT requirement of identifying when things go wrong, and that’s before we get to the ugly case of looking for problems we don’t expect. Combining automation and anomaly detection powered by machine learning (ML) may be the only chance we have to successfully identify and respond to the rising swell of data in IT.

In this blog post, we’ll talk about how the biology of the human brain impacts IT operations, how we can augment our teams with ML applications, and finish with two concrete examples of these applications: one offered as a service today by Red Hat, and another which (as far as I can tell) is a novel approach to assisting Root Cause Analysis with ML.

Monitoring is a Human Problem, and We Can’t Fix It Alone

Our brains are great at recognizing patterns3. We’re so good that sometimes we see them where there aren’t any. If you’ve ever seen a cloud that looked like a cat, or rock that looked like your cousin, you know this is just regular human brain stuff. Our brains get used to seeing emerging patterns very quickly, it’s a physiological process called habituation – our brains expect the pattern. It helps us spend fewer cycles to understand what’s going on around us. In fact, when what’s going on around us isn’t radically changing, habituation reduces the attention we pay to the “signals,” in Signal Detection Theory lingo, from the pattern.

In the case of IT monitoring, we’re inundated with “unwanted signals”– signals that indicate everything is OK and can be ignored by operators. These unwanted signals play a valuable role in helping the monitoring systems know that the services (and the monitoring solution itself) are performing as designed, but are detrimental to human processing. Eventually, the human brain adjusts to the idea of receiving a large number of signals, which becomes the expected pattern. We then pay less attention to whether the signal means OK or PROBLEM. This habituation causes us to require more effort over time to identify the exceptions to the expected, and causes us to be slower at recognizing exceptions too. That’s long winded, let’s use an example:

Construction starts hammering away next door. It’s very loud, so of course you notice immediately. Over the rest of the day, you grow used to (habituated to) the noise of the hammering. When the it stops for the evening, it takes a minute, but you notice that the hammering has stopped. It’s not immediate like when the hammering started. If you’d like to try out your own attention skills, here’s a 60 second selection attention test from Daniel Simons.

This predestined loss in attention, the vigilance decrement, is magnified when we’re looking for rare problems – in IT, like those that cause unplanned downtime.

Work Smarter, Not Harder.

I hate that phrase. Said out loud, it’s too often a cop-out. It means: not enough budget, not enough headcount – go do impossible again. In other words, keep working harder. Do you remember the good old days when our teams and budgets grew at the same rate as the work we had to get done? Me either.

IT ops teams are asked to support ever-larger environments (more containers than VMs, more functions than containers, etc.), and also more types of things (application frameworks, development languages, etc.). This growth in scale and complexity makes support an increasingly daunting effort. So, we’re left with that despicable phrase: work smarter, not harder. When it comes to preventing errors, and especially in the world of overwhelming data that we live in, we need a systematic change to monitoring. Research shows that, rather than only relying on operators’ attention, a systematic approach can be superior for creating highly reliable operations.

With the velocity of complexity in IT, we clearly need that new systemic approach. We need a different approach to scale IT operations by accounting for natural human variability. Our customers often use automation to build quality into IT processes. That helps, and we’ve seen spectacular results. But, if we want the next big jump in improvement, automation is only half of the solution. Since we can only respond when we find an anomaly, how can we do better about recognizing them if we can’t even keep up with the incoming data?

As soon as ops sits down to a shift, their capability of finding something odd quickly decreases across almost every dimension: they miss more, they wrongly call good things as bad, they grow less confident in their decisions, and it takes them longer to make the decision.

Enter artificial intelligence. The availability of machine learning based anomaly detection is the start of a new way to support operations. Through machine learning, operations can learn to provide a higher level of service through identifying and eliminating more anomalies, and more rare anomalies, earlier and with greater accuracy.

Finding the anomalies is the first step, but that alone won’t solve the problem. You have to know what the anomaly means. Machine learning and advanced analytics can help with that, too. Let’s go through two examples of locating anomalies and helping provide information about what’s going on: one Red Hat provides as SaaS, and another you can build for yourself.

Red Hat Insights

A couple of years ago, we released Red Hat Insights, a predictive service which identifies anomalies, helps you understand the causes, and helps automate fixes before the causes become problems. If you subscribe to Insights, it uses a tiny4 bit of metadata to identify the causes of pending outages in real-time. With the data from well over 15 years of resolving support cases, we are able to train Insights to provide both descriptive explanations of the problems and prescriptive remedies. To take it a step further and make operators lives a little easier, we recently extended Insights with the ability to remediate identified problems with automation. As more and more customers use Insights for risk mitigation and automated issue resolution, the additional information enables Insights to become smarter every day, and enables more informed actions by operations.

Connect automation with machine learning to identify and resolve problems before you know to look for them.

Use machine learning to help us diagnose software.

Red Hat Insights provides exact and automatable actions to resolve the complex interactions that lead to downtime. We can also use machine learning to assist in identifying other types of software problems, and reduce the time required to discover the root cause by narrowing down where to first look to a few educated predictions – without having to pour through logs by hand. We can use machine learning to aid operators in root cause analysis by suggesting a possible dependency chain that led to the breakdown – a diagnostic map.

Applications and platforms responsible for the deployment and management of many things (VMs, containers, microservices, functions, etc.) are increasingly providing maps of the things under their control in order to provide operators with context. The example below shows the topology of container interactions in a Kubernetes cluster on Red Hat’s container platform, OpenShift. This works well for platforms that create the topologies, but what about trying to determine the topology for applications we don’t know or control?

FIGURE 2: CloudForms managing a Kubernetes Cluster in OpenShift

Turning one minute of laptop CPU into a diagnostic map.

System logs on Linux (and *nix-based cousins) are great sources of information for what isn’t working well, but any entry rarely provides much context outside the program or subsystem that generated it. In today’s world of massively interconnected systems, unless an operator already has experience with the observed problem, any log entry is rarely enough information to understand its root cause. However, even when we’ve never seen the problem before, we can use machine learning to build a diagnostic map and help us narrow down where to look first for root causes. Here’s an example.

FIGURE 3: Diagnostic Map

Figure 3 represents a part of a machine learning-derived diagnostic map of Linux programs from log entries in syslog. Each circle is a program that logged events in syslog. The arrows suggest an influence relationship: the first program, that points to the second program, impacts the behavior of the second. Now you have a picture of the entire system of events that led to the problematic behavior that led you to look at the events in the first place: you now have context.

Diagnostic maps can reduce cognitive overload, and identify what’s important.

With more deployment types, frameworks, and rapidly evolving applications, the interdependencies of things we support are exploding in number. The best IT operators can debug only some of these problems quickly. However, when we use ML to aid in identifying problems and generate diagnosis maps, we can help reduce time to resolution across problems we haven’t seen before.

Not only are graphs like this valuable as a troubleshooting tool, they can also be tied into monitoring systems to help operators identify and prioritize the right alerts. When something big happens, like a cloud region going down, we’re flooded with alerts. In this case getting an alert from your monitoring systems that every application, every container and every VM is down doesn’t add any new information to help resolve the problem. However, each and every alert takes cognitive effort to process and decide whether it’s important. In cases like this of alert floods, the human brain becomes overwhelmed and stops processing any new alerts. If you’ve ever felt overwhelmed by the amount of email in your inbox, that’s a small version of the same principle.

With an understanding of dependencies, you can gate alerts: you don’t need any more alerts about the applications down if the VM they’re running on are also down. But, knowing a new VM is down may be essential to know.

Artificial Intelligence (AI) is a rapidly evolving field, and its use in IT operations even more so. But, it’s a lot more than academic, and we’re beginning to see emerging markets categories of use. Red Hat already uses it to offer Insights, the service identifying and resolving infrastructure issues before your teams know about them. We also saw an emerging example of using AI to assist in root cause analysis. The field is just getting started, and these are just two of many exciting possible directions the field may evolve.

We’ve seen that it’s essentially impossible for people to watch for anomalies at any level approaching business critical; humans just aren’t wired for it. The good news is machine learning is good at it: both finding anomalies, and helping your teams figure out where to look to solve them. And, you’re not alone in this need.

If you have a substantial investment in any software, call those vendors and ask what tools they have to help your teams identify, diagnose, and solve problems with their software. If you’re feeling like you want to push a little harder, ask what they’re doing to help solve problems where their software is only one piece of the puzzle.

Erich Morisse
Director, Management Strategy
@emorisse


  1. Jane F. Mackworth, “Vigilance and Attention,” Penguin Books, Ltd. 1970 

  2. Mackworth, N. H. (1948). The breakdown of vigilance during prolonged visual search. Quarterly Journal of Experimental Psychology, vol. 1, pp.6-21 

  3. Jeff Hawkins’ “On Intelligence” is a great and accessible read on the topic. 

  4. Less than 5% of the data you provide for a single support case. 

How to Simplify Systems Integration in an Enterprise Environment

Getting heterogeneous systems to talk to each other can be a make or break exercise, especially in hybrid cloud infrastructures, critical to supporting digital transformation processes.

We are seeing more and more of the current generations of IT professionals involved in this transformation, who have a growing expectation of a frictionless enterprise IT experience.
A seamless integration between IT systems can be a critical ingredient to reshaping an infrastructure after the experience offered by modern consumer-grade public cloud services.

My goal in this blog post is to describe how new tools and practices can improve your ability to realize, scale and maintain the integration of infrastructure components over time.

System integration is a well-known practice, but, with the sole purpose of describing how and where the aforementioned tools can be used, I’ll frame it as a two-way process, each way characterised by a different approach:

  • Decision-Oriented: to “know something”, and use this information to make decisions
  • Action-Oriented: to “do something”, as a reaction to an event or a made decision

Depending on your business needs and the nature of systems involved, you may be able to use just one or both of these approaches, combined together in a two-way integration process.

In this context we’ll see how:

Cloud Management Platforms to support Decision-Oriented Integrations

Decision-Oriented Integrations are aimed at collecting data about your IT environment and making decisions about it.

While you can use ad-hoc connectors or rely on APIs to collect the data, you still need a place where that information can be analyzed, correlated, compared against a set of policies, and possibly used to finally generate recommendations.

When your IT environment is a hybrid cloud infrastructure, an ideal place to perform these tasks is a cloud management platform (CMP) like Red Hat CloudForms.

The reasons why you may want to collect metrics from infrastructure systems can be various. As an example, you may want to correlate them to understand whether an application is performing within contracted service levels and then decide to deprovision the resources or reallocate them to meet your business needs. This could be the case for a set of customer-facing services that your company is offering. You may also want to use the collected information in a different way, to decide whether a customer should be offered a discount or suggested a different kind of service.

To better explain this concept, I will use the smart home appliances as a metaphor. Imagine that your IT environment is a centralised smart home system. Now imagine that the smart home system is able to use GPS data from your smartphone to understand when you leave the apartment and trigger an “away” mode (e.g., turning off the smart lights, adjusting the smart thermostat, etc.). To achieve this result, this centralised system must be able to integrate all the different home appliances to collect their data, correlate it, decide whether or not you left your apartment and what elements must be adjusted because of that.

Some energy providers already collect and analyze energy usage data coming from smart meters or smart thermostats, to tailor heating and cooling settings to specific buildings or suggest better energy plans to their customers.

Similarly, a CMP can correlate the current utilization of your systems (e.g. physical clusters or Infrastructure-as-a-Service (IaaS) regions) with the utilization trend of your workloads to decide the optimal placement. Furthermore, it can collect usage data from all the tiers of a service, to determine whether to scale up or down one or more components.

To provide this unified governance, a CMP uses native integration points or APIs to ingest information, alerts and metrics from systems such physical clusters, IaaS engines, virtual machines, etc.

Then, the CMP can correlate and evaluate them against technical, business and compliance policies.

CloudForms can collect and analyze capacity and utilization data (e.g. CPU usage and states, disk & network I/O, memory usage, running VMs and hosts, etc) from many different engines in your virtual infrastructure. The smart placement capability can use this data to understand the optimal placement for new workloads. The same information can be used to determine if a service is reaching its capacity and, according to conditions pre-defined by the organization, scale all the needed components.

IT automation to simplify Action-Oriented Integrations

Action-Oriented Integrations are designed to enable systems to act on an event or a decision.

In this scenario, IT automation could reduce the complexity by acting as an integration broker. The more flexible and powerful the tool, the easier it is to scale and diversify the integration.

I will use the behind the scenes of a self-service portal as an example of integrations aimed at triggering actions.

The goal of a self-service portal is to empower users to deploy a set of predefined and sometimes preconfigured assets (from VMs and containers all the way up to business services), to support the business.

The challenge is to compose these assets by tapping into many different kind of resource pools provided by bare metal-server environments, network or storage arrays, virtualized and containerized environments, public and on-premises IaaS cloud environments, etc.

Let’s go back to our smart home analogy.

Imagine that your resource pool is made of all the smart lights, smart thermostats and the few sensors and smart appliances you have in your smart home environment. Ideally, you should be able to issue a command through an interface (e.g., a smartphone app or a voice-recognition device) and trigger a predefined and preconfigured scenario (e.g. “chill out evening”, which means “dimming the lights and turn on the TV”).

IT automation tools, like Red Hat Ansible automation, offer many advantages which can make them critical to managing a large-scale IT environment. One of them is to provide a single integration layer to deal with, removing the need to enable and maintain many different integration points.

In the smart home analogy, consumer-oriented automation, provided by companies like IFTTT and Zapier, similarly offers the ability to integrate and govern a wide range of systems through a single, common language.

If two systems are not designed to work together (e.g. your self-service portal does not support your network devices), you must write ad-hoc code to integrate them. This means that you have to figure out how your endpoints communicate and code your way through the self-service portal. Repeat this for any other unsupported system in your IT environment and maintain your integration code throughout the many software and hardware updates that your endpoints will have before their decommissioning.

Back to our analogy now. How do you tie together your smart lights, which support Apple Home Kit, and your smart tv, which only supports Amazon Alexa? Once again, an intermediary like IFTTT can solve that complexity and incompatibility, allowing users to combine devices and services in workflows unique to their specific needs. Furthermore, as user needs evolve (e.g., they buy a different smart TV, a new device to monitor plant watering is introduced in the smart home environment), the integration layer enables them to adapt and scale existing workflows, extend control to new components and adapt to replacement smart devices.

Red Hat Ansible automation, used as the integration layer, offers an easy to use, human-readable language, to abstract a set of actions throughout the IT stack. To further reduce the effort to extend and maintain such a large integration surface, Ansible adopts a modular strategy where both the community and certified partners continuously contribute to the support of new technologies.

First collecting data with CloudForms to make a decision, and then acting on that decision with Ansible as a broker to execute commands, is a good example of Decision-Oriented and Action-Oriented integrations combined together in a single process.

Open source to reduce the long-term risk of integration

So far, I discussed how tools like IT automation engines and CMPs can contribute to mitigate the integration complexity in an IT environment that grows in scale and diversity.

However, maintaining multiple integrations can bring a completely different set of challenges.

What if you are the first to attempt a certain integration?

What’s the advantage of contributing to an open source community in this scenario?

Anyone who has worked in IT for a large organization has likely written some custom integration code at least once, probably more than once. Both open and closed solution starts from the same point: the initial cost of writing code. No matter if this is the cost of a system integrator or your own developers.

Open source opens a unique opportunity, though: by submitting this code to the project maintainers, your integration effort is evaluated in terms of stability, scalability and readability, and then, potentially, enters the downstream. The business advantage in that, is the possibility to reduce the recurring cost of maintaining the integration you made.

The factors that determine if this is going to happen are the same that define the success of a project: how critical is the technology for the community and how broad and active that community is.

If we look at Decision-Oriented integrations, one of the biggest challenges you may have is to harmonize all the different data sets.

Different systems collect data with different logics and mostly for internal consumption, you have to extrapolate what you want, and process it through a lowest common denominator to make it usable with other data sets.

To make things even harder, the data formats you have to reconcile will mutate over time.

Open Banking Standard is an example of open source practices used to mitigate this particular integration challenge, promoting community-driven API standardization.

If we look at Action-Oriented integrations, one of the main challenges is to consistently support all the integration points over time. This means being able to evolve at the same pace of both the technology you are integrating and the business requirements served by these integrations.

Using Red Hat Ansible as the integration layer, you have access to more than 1,000 modules to integrate with the most disparate of systems, from infrastructure components, such as networking, up to application servers and databases, and more than 12,000 roles* to support an enormous range of use cases, from provisioning and configuration to remediation.

This broad range of components (that reduce the initial complexity of integration) is the result of a large and active community which dramatically increases the chances of having a return from your contribution.

In summary, you integrate IT systems for two main reasons:

  • to collect data about your IT environment and make decisions (Decision-Oriented Integrations)
  • to act on an event or a decision (Action-Oriented Integrations)

and, to mitigate the complexity of your integration effort, your strategy should consider:

  • leveraging tools like Cloud Management Platforms and IT Automation engines as integration layers to manage more easily an IT environment as it grows in scale and diversity
  • adopting open source to potentially reduce the costs of maintaining the existing integration points over time

Massimo Ferrari
Management Strategy Director
@crosslogic


*Ansible Roles are a standardized file structure, to organize and group together a series tasks and the related metadata, variables, etc.. Grouping content by roles also ease reutilization and sharing with other users.

The Business Value of Automation

Many enterprises have now reached a stage where managing the scale of their IT infrastructures is becoming a challenge second only to increasing the speed of provisioning. A broad spectrum of technologies, solutions, processes, and skillsets is available to help manage a large scale environment. Among management technologies, automation is one of the most powerful.

Automation is not merely a technology choice. First and foremost, it’s a business choice.
Without automation, supporting the growth of your business can be increasingly complex, to the point of becoming impossible beyond a certain scale. If it’s true that software is eating the world, as the well-known venture capitalist Marc Andreessen said in 2011, and if it’s true that every company is becoming a technology company, as the Chief of Research at Gartner said in 2013, then automation becomes a must-have tool in the hands of the business, not just of the IT organization.

Here are four key reasons why automation is critical to managing a large-scale IT environment:

Less waste: automation optimizes IT operations

To grow your business, you can either offer more services or focus on expanding the capabilities of an existing one, providing multiple service levels or various degrees of customization.
Most likely, an enterprise organization makes both choices, multiple times throughout its lifespan. Accordingly, the IT environment evolves over time, starting from a fairly simple environment with limited capacity and ending as a large-scale jungle of multiple languages, platforms and architectures that must be supported for many years.

To describe the evolution of the IT environment I will use a maturity model that we introduced in the blog post “How to manage the cloud journey?”, shaped after two dimensions: scale and complexity. According to our model, if your business is successful, the IT infrastructure that sustains your organization should grow both in scale, hosting increasingly more workloads, and in complexity, hosting an increasingly diversified set of workloads.

To support such evolution, you have two options: hiring at the same pace your infrastructure grows, or empowering your IT organization with a new set of tools that can scale their operational capabilities. Doing nothing isn’t really a viable option because you can’t expect to manage the growing scale and complexity with a mostly flat number of people.

To give you more context, I will use the research on the TCO for a private cloud based on OpenStack that we published last year. Some of the data we used in this research comes from the Server Support Staffing Ratios report published by Computer Economics, Inc. The study found that a large organization*, with an average level of automation, supports 46 operating system instances (mix of physical and virtual) per system administrator while the same large organization, with high levels of automation, supports 101 instances per admin.

In our research, we assumed the number of workloads doubling each and every year. Hypothetically, in a similar situation, if you decide not to invest on automation, you should be doubling your operations staff as well.

Matching a similar pace from a hiring perspective can be extremely hard, if not impossible, due to a number of factors: limited OPEX budget, slow hiring process, scarcity of skilled resources on the market, and more.

In the second scenario, by empowering your organization with new tools, assuming the right tools are identified, you can enable IT operations to get more of the existing job done in the same amount of working hours. The new tools must be designed to execute the tasks in a more efficient way, through increased ease of use, higher flexibility to adapt to the use cases, faster computation, or a mix of all these things. For example, automation can help a team to deploy applications more often, and deploy them faster; fix issues at a broader scale, and fix them faster.

Highways are a good analogy to explain the concept. When the population in a geographical area grows, the government is forced to develop the infrastructure to support the additional cars. In some places, these highways must be equipped with toll gateways to regulate the access, and toll gateways must be operated by humans, each performing thousands of repetitive operations per day. In turn, the newly built highways attract even more citizens in the region, and more cars on the road. The government can deal with the spike in traffic either by adding more gateways and hiring new people to manage them, or it can make the existing highways more efficient by implementing automated barriers and automated access systems like E-ZPass.

Electronic tolls enable more cars to access the highway in the same amount of time that human operators take: by eliminating the brief stop at the toll and the interaction between the driver and the human operator or the cash machine, automation allows each car to move through the toll at a higher speed, in less time.

In the same way, IT automation can help your Operations team to manage more workloads over the same working hours, reducing the need to hire more staff to support the growth of the infrastructure.

Less complexity: automation orchestrates sophisticated services

Beyond a certain point, success often leads to a diversification in the market demand, as your popularity attracts a broader audience. In an enterprise environment, this dynamic implies that some of the Lines of Business (LoBs) that you are serving may eventually start requesting services that are outside the planned offering or far more complex than what originally anticipated.

For example, after a cloud computing environment designed to offer a few highly standardized services becomes highly successful, the organization may experience a raising number of requests to serve complex custom services. Each of those custom services includes many application tiers that must be coordinated in terms of provisioning, configuration, sequential system updates and patching, migration (if necessary), and retirement. The risk is that this newly introduced complexity, paired with the scale you reached, lessens productivity if not properly managed.

Let’s use a different analogy to explain how automation can simplify by orchestrating complex systems: the automatic transmission in a car. In modern cars, the transmission is managed by a computer which operates the gearbox and the clutch, coordinating them with engine, brakes, wheels and many other components. Automating the gear shifting task reduces the amount of work required to drive because it simplifies the entire process, for example by removing the need to monitor the tachometer.

The simplification introduced by the automated transmission is particularly useful when a certain aspect of driving must be repeated over and over. For example, thanks to automation, the continual operation of accelerating, stopping and re-accelerating when stuck in traffic during rush hour doesn’t require the driver to shift gears endless times.

Like the automatic transmission in your car, an automation tool is designed to deal with a large number of moving parts at the same time, taking care of repetitive tasks and keeping all the pieces together while delivering predictable results.

Moreover, to comply with regulations and reduce emissions, car manufacturers have introduced more efficient gearboxes made of eight or even 10 gears. Manually shifting eight gears not only would be complex and distracting, but it would also almost certainly result in highly inefficient driving.

In the same way, automation tools can simplify your experience of deploying and maintaining applications of increasing complexity, from the multi-tier service composition to the configuration of ancillary components such as networking and firewalls.

Less mistakes: automation reduces human errors

Even the most talented member of your IT operations team is human, and humans are prone to mistakes. The larger and more complex an environment, the higher the chances for mistakes.
For example, large-scale environments easily force the IT organization to deal with time pressure and stress. The psychological stress comes from the realization that a task, even a simple one, can’t be accomplished across all the managed machines in the allocated amount of time with a low probability of errors.

It must be also considered that growing complexity leads to more articulated operations that need to be performed. Highly complex tasks require constant focus and precision – skills that not all team members may possess.

In the previous section, I mentioned how automation can more easily coordinate the operations performed on a series of application tiers that compose a sophisticated business service. The challenge is not just about dealing with many moving parts (the “what”), but also dealing with the highly complex configurations that apply to each tier and define the relationship between the tiers (the “how”).

Back to our car analogy. The autopilot feature recently introduced by Tesla is a good example of how automation can assist humans and help them to avoid mistakes. Driving is certainly a complex task but yet manageable by humans. Automation here, while not yet perfect, can be two times more reliable than human drivers.

Less uncertainty: automation prepares you for the future

So far I talked about the value of automation in facing today’s challenges, but automation can do more than that. Automation can also better equip organizations to face the uncertainty of the future.

As an abstraction layer interconnecting many elements on the enterprise IT and operating at scale with minimal effort, automation can be seen as an extensible platform that can evolve and adapt to market changes.

Automation as a platform builds upon the foundational elements you already have in your computing environment, simplifying the evolution of existing services and the creation of completely new ones.

For example, automation can simplify the deployment of existing applications across new public and private cloud infrastructures that don’t exist today.

In another example, automation can make it easier to combine new IT components, like a new Identity and Access Management (IAM) service, with existing ones, to help create new offerings at a fraction of the time otherwise required to re-engineer the whole stack from scratch.

Let’s take the automatic gearbox analogy to another level. Think about an highway with a long queue of cars progressing at a very low, tedious speed. To reduce the annoyance of driving in queues, several manufacturers introduced a technology called Adaptive Cruise Control (ACC). ACC uses information from sensors like radars and cameras to instruct the car’s control systems to automatically follow the vehicle in front, adjusting speed and stopping when necessary. The development of such technology, exactly as for the autopilot, has been possible thanks to the automation of several car components, pioneered by automatic transmission. In this example, automatic transmission is a key building block for car innovation. As car manufacturers introduce more and more new features, the automatic transmission acts both as enabler and actuator of many new capabilities.

In summary, automation is not just a great tool to deal with today’s market demands, but it can also be a fundamental building block to help sustain the growth and evolution of your business tomorrow. However, as I said at the beginning of this post, automation is just one of the many technological, operational, and cultural elements that you may need to introduce in your organization as part of a digital transformation journey. Automation alone is not enough.

Massimo Ferrari
Management Strategy Director
@crosslogic


*With an IT operational budgets of $20 million or greater.

US Coast Guard Academy

How to Manage the Cloud Journey?

By now, it’s hopefully clear that Red Hat is very serious about Management, with a continual commitment and a constant look at the big picture.

OUR COMMITMENT

Over the years, we expressed our commitment to become a key IT management player in a number of ways:

This week we further express our commitment in additional ways:

Chart_Top-Orchestration-Products-Based-on-Expected-Usage-Within-Next-Year
We have grown and evolved our portfolio faithful to enable and support our customers in their march towards a Frictionless IT, shaping our offering after its core principles like ease of use. Three examples:

  • Our newest offering, Insights, features a Software-as-a-Service delivery model to reduce at its minimum the cost of entry.
  • Ansible, already considered one of the easiest product to use among IT automation and configuration management tools in the market, has grown in popularity as an easy way to manage containers.
  • CloudForms ships as single virtual appliance where some competitors still want you to setup and configure 6-12 systems to deploy their cloud management platform.

But what is the big picture? Why these specific products? What is guiding our decisions in terms of management portfolio growth?

THE BIG PICTURE

Part of the answer to the above questions is in the cloud maturity model below, shaped after the dimensions of scale and complexity.
Cloud Journey
If your cloud project is successful, it will likely grow in scale as more and more Lines of Business trust your IT organization to host their applications. Not only guided by common sense, we saw the relationship between success and scale through first hand experience by working with many customers worldwide, and with further validation resulting from a highly detailed TCO analysis that we published recently.

Popularity has a side effect. As more LoBs approach the private or hybrid cloud you built, the business demand will also likely start to diversify and your IT organization may be requested to host not just a great variety of greenfield applications, but also brownfield ones that were not designed to run on IaaS and PaaS clouds. We saw this over and over in conversations with clients, and we heard it even more on stage of industry events like the OpenStack Summit from now very experienced early adopters.

This diversification increases the complexity of the cloud environment, both in terms of complexity of applications to deploy and manage, and in terms of integration with IT systems outside the cloud environment.

So the question is: “How do you manage that growing complexity as you evolve in your cloud journey?”
Cloud Journey with Products
You begin the cloud journey by asking yourself, “Can we move faster?” The first step to answering that question consists of deploying a cloud engine. Your decision to adopt a IaaS or PaaS cloud engine depends on many factors, including cultural fit, readiness to standardize the computing stack at a certain level, preference to work with virtual machines or containers, and much more.

Your cloud engine of choice comes with its own set of management tools, which are perfectly fine to address the needs of an IT organization up to a certain level of complexity. After that level, which varies from organization to organization, your business likely will require more sophisticated support, just like a landlord that expands his/her real estate business, embracing services like Airbnb and beyond.

As you grow in scale, the first management solutions that you may want to consider are the ones that can preserve the health of your growing IT environment, enabling you to run at scale. For this stage of maturity, Red Hat offers Insights and Satellite. Insights can proactively identify configuration issues and security vulnerabilities before they become critical, generating an appropriate remediation plan*. Satellite, conversely, can deploy trusted software content and security patches at scale, enabling IT Ops to fix whatever issue has been identified by Insights or by the IT organization manually.

As the complexity increases along the way, and your cloud environment is requested to serve and host increasingly diverse applications, you may want to consider an IT automation solution to help maximize the efficiency of your cloud. For this stage of maturity, Red Hat offers Ansible. Ansible is capable of automating the provisioning and configuration of the components of a multi-tier application, including the underlying resources that serve the application, like networking.

As both scale and complexity reach their peak, at the most advanced stage of maturity in your cloud journey, you may want to consider a cloud management platform to govern the private cloud environment side by side with public clouds and pre-existing server virtualization environments in a coherent way. For this stage of maturity, Red Hat offers CloudForms. CloudForms provides a single pane of glass to control, automate and keep compliant a truly hybrid IT environment composed of VMware, Microsoft, Amazon, Google, and Red Hat technologies**.

In other words, we are investing in the Red Hat Management portfolio to support our customers at every stage of the maturity model described here, both when they follow the adoption path described so far and when they have more ad-hoc business needs to address***.

There’s much more that we can do, and that we’ll do, to empower IT organizations in their digital transformation. So, as usual, stay tuned for more.

Alessandro Perilli
GM, Management Strategy
@giano


* Red Hat Insights can do this with both IaaS and PaaS clouds. In fact, the new version of the platform that we just announced introduces support for containers, OpenStack-based private clouds, and KVM-based server virtualization environments.

** At the same time this post goes live, we announce the official support for Google Cloud Platform, alongside a remarkable number of other new capabilities and improvements.

*** Do we expect all customers to follow the exact adoption path described in this blog post? No. In fact, multiple customers start adopting some of our management solutions much earlier in their cloud journey. Which is why we leverage CloudForms in both our IaaS and PaaS cloud engines, or why we launched the Ansible Container project, as a way to support IT organizations that want to work with containers from the earliest maturity stages.

Open Source for Business People

Thanks to the effort of companies like Red Hat, Google, Netflix and many others, it’s safe to say that open source is no longer a mystery in today’s IT organizations. However, many struggle to understand the nuances that make a huge difference between vendors commercially supporting the same open source technologies.

Should the general public have any interest in understanding those nuances? A few years ago the answer would have been “no.” However, today, understanding those nuances is critical to select the right business partner when an IT organization wants to adopt open source.
As more vendors start offering commercial support for various projects, from Linux to OpenStack to Kubernetes, the need to understand the real difference between vendor A and vendor B becomes critical for CIOs and IT Directors.

In Red Hat, we have a TL;DR answer to the question “What makes you different from vendor XYZ?”. Our short answer is we have more experience supporting open source projects, and we participate and nurture the open source communities in a way most other industry players simply don’t.
This is a true statement, but what does it actually mean? How does that translate into a competitive advantage that a CIO can appreciate when selecting the best business partner to support her/his agenda? Today, I’ll try to provide the long version of that answer, with some simplifications, in a way that is hopefully easy to understand for business-oriented people.

To narrate this story, let’s take as example a fictitious open source project that we’ll call “Project-O” and divide it into three chapters:

Chapter 1: Innovation brings instability

At any given time during the lifecycle of Project-O, any individual in the world, can contribute a piece of code to:

  • introduce, complete or fix a feature (innovate)
  • improve performance (optimize)
  • increase reliability (stabilize)

To serve the business, we need to innovate and optimize. To protect the business, we need to stabilize. The continuous tension between these two needs compels hundreds or thousands of code contributions to Project-O at any given time. The bigger the project, and the larger the community supporting it, the more code is submitted at any given time.

Let’s use an analogy: if Project-O is an existing house, each code contribution is a renovation proposal. Imagine having hundreds or thousands of renovation proposals per day.

Just like renovation proposals, new code, especially the one that introduces new features, can be written in a very conservative way or in a very disruptive way:

  • It’s conservative code when adoption doesn’t break other parts of Project-O. In other words, the individual who wrote the code has been mindful of the “backwards compatibility.”
    In our analogy, it’s when a renovation proposal doesn’t force the house owner to demolish existing walls or do some other major intervention to accommodate the proposed changes. Imagine painting a guest room.
  • It’s disruptive code when adoption breaks other parts of Project-O and requires some major reworking.
    In our analogy, it’s when a renovation proposal requires the house owner to make drastic changes to the plumbing system in the only bathroom. It can be done, of course, but it implies temporary instability and disruption inside the house.

Obviously, the more conservative the code, the fewer chances there are to innovate. And vice versa.

When an individual wants to improve Project-O, he or she has to submit the proposed code to a group of individuals, called “maintainers”, that govern the project and have the mandate to review the quality and impact of the code before accepting it.

A maintainer has the right to reject the code for various reasons (we’ll explain this in full in Chapter 3), and needs to make a fundamentally binary choice: requesting strong backwards compatibility or allowing disruptive code.

In our analogy, the maintainer is the house owner that has to carefully evaluate the pros and cons for each renovation proposal before approving or rejecting it.

If the house owner wants an amazing new wing of the house, he has to be ready to tear down walls, rework the plumbing system, and deal with a fair amount of redesign. In similar fashion, the maintainer that wants to innovate and quickly evolve Project-O has to allow more disruptive code and deal with the implications of that disruption.

To address the business demand, especially in a highly competitive market like the one we have today, the maintainer has no choice but to allow disruptive code wherever possible*.
How a vendor deals with that disruption makes the whole difference, and can truly define its competitive advantage. This is where things get nuanced and interesting.

Chapter 2: Instability is exponentially difficult to manage in large projects

As we said, the larger the community behind an open source project, the larger the number of code contributions submitted at any given time. In other words, the amount of things you can renovate in a standard apartment is infinitely smaller than the number of things that you can renovate in a castle.

Let’s say that Project-O is a fairly complicated open source project, equivalent to an hotel in our analogy. For the maintainer of Project-O, the challenge is to consider and approve enough code contributions to keep the project innovative, but not too many to be overwhelmed by the amount of things to fix at the same time. Imagine renovating the rooms in one wing, rather than all of them at the same time.

When very many functionalities of Project-O break simultaneously due to too many code contributions, the difficulty of fixing them all together in a reasonable amount of time grows exponentially. The problem is that the market cannot wait forever for Project-O to become stable again to be used. The innovation provided by the newly contributed code must be delivered within a reasonable amount of time to be used in a competitive way. Usually, large enterprises struggle to adopt a new version of Project-O if a stable release is provided faster than every 6 months. However, the same large enterprises won’t wait years before receiving a new stable release of Project-O.

Again, it would be like the hotel owner in our analogy would approve 10,000 renovation proposals all executed at the same time, each one breaking existing parts of the hotel. Imagine upgrading the electrical, plumbing, heating, and remodeling the restaurant all at the same time. Fixing the resulting disruption would be so incredibly difficult to render the hotel completely unusable for an excessive amount of time.

According to what we said so far, the maintainer sets goals and deadlines to stop accepting code contributions. After the deadline is met, no more code contributions are applied and the community works to stabilize the new version of Project-O enough to be usable.

However, “usable” doesn’t necessarily mean “tested” and “certified as reliable”. It’s the same difference that goes between “I tried to run the code a dozen times and everytime it worked” and “I tried to run the code thousands of times, under most disparate conditions, and I know that it will always work in the conditions I tested”. This is where competing vendors can make a business out of an open source technology that is fundamentally free to access and use for the entire world.

So, at a certain point, the maintainer freezes code contributions for Project-O. Subsequently, competing vendors look at all submitted code contributions and decide how much of it should be commercially supported** after their own extensive QA testing.
Because of this, the open source version of Project-O, called “upstream”, is not necessarily identical to the commercially supported version of Project-O provided by vendor A, which in turn is not necessarily identical to the version of Project-O provided by vendor B. There are small and big differences in between these three versions as they represent three discrete states of the same open source project.

Vendor A and vendor B need to make a decision on how much of Project-O they want to commercially support, trying to balance the need for innovation (addressed by newly disruptive code being accepted by the maintainer) and the exponential complexity of fixing the amount of instability caused by that innovation.

Chapter 3: How vendors manage instability is their competitive advantage

At this point, you may think that the differentiation between vendor A and vendor B is in how savvy or smart they are in “making the cut,” in how many new code contributions to Project-O they decide to support at any given time. In reality, that is only partially relevant. What really differentiates the two vendors is how they deal with the instability caused by the newly contributed code.
To manage this instability each vendor can leverage up to three resources:

  • Deep knowledge
  • Special tooling
  • Strong credibility

Deep knowledge
When much of the newly contributed code is disruptive in nature, many things can break at the same time within Project-O. Sometimes the new code breaks dependencies in a domino effect that is very complicated to fully understand. Fixing all broken dependencies quickly and effectively requires a broad and deep knowledge of all aspects of Project-O. Like the hotel owner who intimately knows the property inside and out through many years of renovations, and has a very clear idea of all the areas, obvious and non-obvious, that the changes in a renovation plan imply.

This is why vendors involved in the open source world make a big deal of statistics like number of contributions to any given project, like the ones captured by Stackalytics. Knowing how much and how broadly a vendor is contributing to an open source project may seem a superficial and sometimes misleading metric, but it’s meant to measure how deep is the knowledge of that vendor. The deeper the knowledge, the more skilled the vendor is in managing the instability created by disruptive code.

Special tooling
No matter how deep the knowledge available, at the end of the day a vendor is an organization made of people, and people can make mistakes. Human error is unavoidable. Hence, to mitigate the risk of human error, some vendors develop internal special tooling that assists humans in understanding the impact of the instability created by newly contributed code, and operate the necessary changes across the board to make Project-O as stable as possible, as quickly as possible.

Without a deep knowledge about Project-O, it can be impossible to develop and maintain any special tooling. So, human capital is the biggest asset a vendor involved in open source has.

Strong credibility
Through deep knowledge and/or specialized tooling, a vendor can identify and fix the broken dependencies in open source code faster than its competitors, but there’s a last challenge: submit the patches to the maintainers and be sure that each part of Project-O is fixed for the newly contributed code to work in time for it’s release. If vendors get fixes back “upstream,” they don’t have to maintain those fixes alone. But, for the fixes to be accepted, vendors have to prove their code helps Project-O, not just themselves.

Back to our analogy: the hotel owner accepted a certain number of renovation proposals to build a new wing, and compiled them in a renovation plan. The plan is ambitious and the contractors executing it will break the current plumbing system in the process. Nonetheless, the plan must be completed within 3 weeks or the hotel will not remain competitive enough to justify the renovation plan itself. The contractor that is building the new wing breaks the plumbing system, as expected, and must ask for modifications from the contractor that owns that system.The owner of the plumbing system is willing to help, of course, but to comply he has to review the new wing project and the proposal changes to the plumbing system, and, if he agrees with them, order new pipes. The whole process would normally take 5 weeks, enough to compromise the whole renovation plan.

The only way to save the day is if the contractor who is building the new wing has a strong credibility in plumbing. And that credibility is so strong that the requested modifications to the plumbing system are accepted without questioning, and the pipes are ordered with an express delivery. In other words, the owner of the plumbing system trusts the wing builder so much that a further review is not necessary.

Such credibility is not granted lightly in the open source world. Few individuals are granted that sort of trust, and that sort of trust is earned over years of continuous contribution of new code and demonstration of deep knowledge.

Thanks to these amazing open source contributors that decide to join a vendor, that vendor is more or less able to fix broken dependencies in a timely way. In fact, differently from what could be assumed, highly trusted open source contributors are not easily hired and retained through standard HR practices. They independently decide to join and stay with a vendor primarily because they believe in the mission of that vendor, in how that vendor conducts business.

So, in summary, the difference between two vendors operating in the open source world boils down to how capable they are in managing the instability caused by innovation. That differentiation is very subtle and hard to appreciate for anybody until it’s time to face the instability.

Alessandro Perilli
GM, Management Strategy
@giano


* The deeper you go into the computing stack, all the way down to the kernel of the operating system, the smaller the amount of disruptive code is allowed to compromise the reliability of mission critical systems, and their capability to integrate with a well established ecosystem of ISVs and IHVs. That’s why it’s much harder to innovate at the lowest level of the stack. 

** Commercially supporting open source software means that the vendor performs the QA testing to verify code stability, provides technical support in case something doesn’t work, issues updates and patches for security and functionality improvements and certifies integration with third-party software and hardware components.

Elephant In The Room: What’s The TCO For An OpenStack Cloud?

A few months ago, for our own internal use, we started a project to calculate what it costs to run an OpenStack-based private cloud. More specifically, the total cost of ownership (TCO) over the years of its useful life. We found the exercise to be complex and time consuming, as we had to gather all of the inputs, decide on assumptions, vet the model and inputs, etc. So, in addition to results, we’re offering up a few lessons we learned along the way, and hopefully can save you a scar or three when you want to create your own TCO model.

Ultimately, we wanted answers to three layers of cost:

  1. What is the most cost effective method for acquiring and running OpenStack?
  2. How does OpenStack compare financially to non-OpenStack alternatives?
  3. How should we prioritize technical improvements to provide financial improvements?

Following an exhausting survey of cloud TCO research, none of the cost models we could get our hands on were complete enough for our needs: some did not break out costs by year, some did not include all of the relevant costs, and none addressed potential economies of scale. We needed a realistic, objective, and holistic view – not hand-picked marketing results, and found a few suggestions that helped us get there – whatever the technology.

Since we could not find anything both comprehensive and transparent, we created one, and used the opportunity to go a few steps further by adding additional dimensions: full accounting impact across cash flow, income statement, and balance sheet. The additional complexity made it harder to understand and consume the model. Further, we needed the model to not only spit out projections, but be a reliable way to compare options and support decision making throughout the life of a cloud as options and assumptions change. So, we decided to create a tool rather than just a total cost of ownership (TCO), for easy comparisons, and conversations with financial teams and lines of business.

To help us view the data objectively, we relied as much as possible on industry data. Making assumptions was inevitable, not all of the required data is available, but we made as few as possible and verified the model and results with a number of reputable and trusted organizations and individuals in both finance and IT.

What is the most cost effective method for acquiring and running OpenStack?

If you’re considering or even running OpenStack already, we imagine you’re asking yourself a few questions, “I have a smart team, why can’t we just support the upstream code ourselves?”. As Red Hat is commercially supported open source software, we can talk all day about the value of supported open source software, including the direct impact on OpenStack, but we also want to address the direct costs, the line items in your budget. To get to these costs and answer our questions, we shaped the model to analyze two different acquisition and operation methods for OpenStack:

  • Self-supported upstream OpenStack
  • Commercially supported OpenStack

image04

As the model shows, the self-supported upstream use of OpenStack, with the least expensive software acquisition cost, ends up the most expensive, which may seem counter-intuitive. Why? Because of the cost of people and operations.

All of the costs of a dedicated team* running the cloud: the salaries, hiring, training, loaded costs, benefits, raises, etc., regardless of the underlying technology, are a large chunk of the total costs. With a commercially supported OpenStack distribution, you only need to support the operations of your cloud, rather than the software engineers, QA team, etc., for supporting your cloud and the code too. We expect that you need to hire fewer people as your cloud grows, and the savings would exceed the incremental cost of the software subscription. Your alternative, is this:

image10

Taking our analysis a step further, we also explored the financial impact of increasing the level of automation in an OpenStack cloud with a Cloud Management Platform (CMP). Why? Because most companies’ experience shows** that managing complex systems usually doesn’t go according to plan. However, if automation is appropriately implemented, it can lower the TCO of any complex system.

CMP is a term coined by Gartner to describe a class of software encompassing many of the overlaid operations we think of in a mature cloud: self-service, service catalogs, chargeback, automation, orchestration, etc. In some respects, a CMP is an augment to any cloud infrastructure engine, like OpenStack, necessary to provide enterprise-level capabilities.

Our model shows coupling a CMP with OpenStack for automation can be significantly less expensive than either using and supporting upstream code, or using a commercial distribution. Why? As with the commercial distribution, our model shows that you would need to hire fewer people as your cloud grows, and the savings can potentially dwarf the incremental software subscription cost. The combined costs are drawn from Red Hat Cloud Infrastructure, which includes the Red Hat CloudForms CMP and Red Hat Enterprise Linux OpenStack Platform.

image05

One of the sets of industry data we used, to help create an unbiased model, came from an organization named Computer Economics, Inc. They study IT staffing ratios, and all kinds of similar things. They found that the average organization, with the average amount of automation, supports 53 operating system instances (mix of physical and virtual) per system administrator. They also found, that the average organization, with a high level of automation supports 100 instances per admin.

So, in our scenario, with the cloud expected to double in size next (and every) year, you have a few options. You can double your cloud staff (good luck with that), double the load on your administrators (and watch them leave for new jobs), or invest in IT automation.

The aforementioned study shows that high levels of automation can nearly double the number of OS instances supported. While automation can reduce the cost curve for hiring, and make your cloud admins’ lives easier, we’re in a financial discussion. Automation only makes financial sense if it lowers the cost per VM. Which is exactly what we found:

image01

In order to compare the costs and advantages of automation more closely, we looked inward (it was an internal study after all). We compared with the completely loaded costs (hardware, software, and people) for one VM of our commercial distribution of OpenStack, Red Hat Enterprise Linux OpenStack Platform (RHELOSP), with those of our Red Hat Cloud Infrastructure, which includes both RHELOSP and our CMP, Red Hat CloudForms.

Looking at the waterfall chart above, we start with the fully loaded costs of one VM provided by RHELOSP of $5,340 per VM, and want to compare the similarly loaded costs for RHCI. The RHCI software costs an additional $53 per VM under these density assumptions, which increases the costs to $5,393. Next, we factor in the $1,229 savings through automation from hiring fewer people as your cloud grows, we see a loaded cost of $4,164 per VM for RHCI. Under our model, using a CMP with OpenStack resulted in savings of over $1,200 per VM.

Moving from just an average level of automation to a high level of automation, our model showed a significant improvement in costs as you grow, that the extra cost of automation can be dwarfed by the potential savings. High automation is only moving from the median to the 75th percentile, so our model shows that there’s a lot of headroom for improvement above and beyond even what we show.

At $1,200+ savings per-VM per-year, automation has the potential to quickly add up to millions in savings once you’ve reached even moderate scale.

That’s the kind of benefit is one of the many reasons why Red Hat recently acquired Ansible. And given that Ansible is easy to use, use of Ansible tools can not only improves the TCO through automation, but can also help customers achieve those savings faster.

How do OpenStack and non-OpenStack compare financially?

As we said, we wanted to model to be useful also to compare different market alternatives, but in order for the comparison to be useful, we needed the comparison to be apples-to-apples. Competitive private-cloud technology available on the market at the time of our research provided much more than just the cloud infrastructure engine, so we decided to compare OpenStack plus the CMP against commercial bundles made of an hypervisor plus CMP, which is what Red Hat customers and prospects ask us to do most of the time.

In the model, we conservatively assume that the level of automation is exactly the same. If you have data you are willing to share which supports or refutes this, please let us know.

As we expected, the model showed us that an OpenStack-based private cloud, even augmented by a CMP, costs less than a non-OpenStack-based counterpart. The model shows $500 savings per VM increasing to $700 over time and over a larger number of VMs and more as the maturity of the cloud grows over time.

image06

However, the question is: is the $500-700+ in savings per-VM worth the risk of bringing in a new technology? To find the financial answer, we had to consider how these savings add up.

image02

As the chart shows, by the time you have even a moderate sized cloud, OpenStack with a CMP total annual cost savings can exceed two million dollars. We are aware that it’s common business practice to apply discount to retail prices, but to keep the comparison as objective as possible, we referred to list price disclosed by every vendor we evaluated in our research. Because our competitors were not real keen on sharing their discount rates, the only objective comparison we can make are these list prices. We estimate that there is a small portion of this savings that comes through increased VM density (which we’ll talk about later), but the majority is in software costs.

With this in mind, if you take a look at these numbers, and think about the software discounts you’ve negotiated with your vendors, you’ll have a reasonable idea of what this would look like for you. And as a reminder, these are just for the exponential growth model starting from a small base. We’ll wager there are any number of you reading this who have already well exceeded these quantities and are accumulating savings even faster than we show here.

We also recommend looking at the total costs over the life of a project. In fact, when we look at the accumulated savings over the life of your private cloud, we notice something rather striking.

image03

Our model showed that it really doesn’t matter what your discount level is, if you plan on any production scale OpenStack with a CMP can potentially save you millions of dollars over the life of your private cloud.

How should we prioritize technical improvements to provide financial improvements?

In order to move from one-time decisions to deliberate on-going improvements, you need the “why” of the model as well as the outputs. By the time we finished building and vetting our TCO model, we made a number of interesting, and sometimes surprising, discoveries:

Cost per VM is the most important financial metric

For most of this post, we’ve been focusing on cost per VM. Despite the necessity in budgeting, total costs are simply not instructive. Here’s an example of the total annual costs over six years, for one of the many private cloud scenarios we considered:

image08

A typical approach in TCO calculations is looking at the annual costs, but this metric alone isn’t particularly helpful in the analysis of a private cloud, with or without OpenStack. In private clouds, we can’t get away from the fact that we are providing a service, and what our Lines of Business or customers consume is a unit, like a VM or container. Hence, we believe that it’s much more significant to look at the annual per-VM cost.

image07

In the same scenario we showed with the rapidly increasing total costs, the VM cost has dropped by more than half, from the first year to the third. That dramatic improvement is impossible to see in the total costs curve. Without accounting for the VM costs, you’d miss that the total costs are increasing because of more usage, but you’re getting more for your dollar every year. Increasing growth while increasing cost efficacy is a good problem to have.

In other words, we recommend using VM Cost as your main metric because it shows how good you are at reducing the cost of what you provide. Total Cost does not distinguish between cost improvement and usage growth.

The hardware impact on total spend is marginal

We’ve woven in analysis of two of the three main cost components related to acquiring and running OpenStack, and financially comparing OpenStack and non-OpenStack alternatives. Our model shows that the selection of private cloud software choices has the potential to save you millions of dollars. The investment in automation similarly shows the potential to save additional millions of dollars. Either or both of these can save an organization a lot of money, despite the additional expenses. But, so far, we’ve only hinted at hardware costs.

Some of our readers may be surprised at the results: hardware is a large and easily identifiable cost, so if you can cut the amount of hardware, in theory you can save a lot of money. Our model suggests that it’s not really the case.

image09

We asked the model how costs change across a large range of VM densities: 10, 15, 20, and 30 VMs per server, with no other changes. The numbers show very little difference in costs even across this large range of densities.

If we start with an average density of say 15 VMs per server and (unrealistically) double it to 30, we see a savings of around $350 per VM. Not a trivial amount, and one that adds up quickly at scale, but these amounts are before the costs of any software and the effort to make this monumental jump in efficiency.

If we make a more realistic (but still really big) stretch to a ⅓ increase in density from 15 VMs per server up to 20 VMs per server, the models indicates a $175 in savings per-VM before the cost of software and effort. This is tiny compared to the $1,200 or more savings per-VM through automation in the same scenarios.

Never neglect your hardware costs, but don’t start there for cost improvements, it’s unlikely to provide the biggest bang for your buck.

Lowering VM costs will increase usage and total costs

Our model shows that the more you lower the VM costs for the same service, the more you will increase your total costs. There’s a direct causal effect: the less expensive this service is, the more people want to use it.

Here’s a different example from our industry, to further prove our point. 1943 saw the beginning of construction of the ENIAC, the first electronic general-purpose computer, which cost about $500,000. In 2015 dollars, that’s well over $6,000,000. Today, servers cost less than 1/100th of that, and we buy 1,000,000’s of them every year. We now spend much, much more on IT than the first IT organizations did supporting those early giant beasts and, yet, our unit costs are significantly lower.

Based on this awareness, we looked at the market numbers for consumption of servers and VMs from IDC, and ran some calculations: for every 1% you reduce your VM cost, you should expect to see a 1.2% increase in total cost, due to a 2.24% increase in consumption. Which seems counterintuitive, but the increase in total costs is due to your success. You’ve reduced the costs to your customers, so they’re buying more. Once again, your reduction in VM cost is directly increasing the demand for the services of your cloud.

IT, and in particular IT components like servers and VMs have “elastic demand curves,” broadly meaning that reducing prices leads to greater utilization and greater total cost. If increased efficiency causing higher total costs comes as a surprise to you, you’re not the only one.

Track all of your costs to prioritize efforts

Tracking the costs of as many components as possible enables you to prioritize improvements over time even as your cloud matures, your staff gets better and better at running it, and even as demands change from your customers. In order to build a tool around our TCO model, we had to decide on what costs we want to track, and model together. Our model accounts for all hardware, software, and personnel required to operate a private cloud. Each and every one of them are a potential lever in affecting how your costs change over time.

image00

The levers built into the model include: VM density affecting hardware spend, IT automation for personnel costs, and software choices for software costs. Between the three of these, the model addresses all of the major costs of acquiring and operating a private cloud, with the exception of data center facilities. With the low impact on costs of hardware and changes to density, we assumed that datacenter facility costs will largely be the same across technologies and were not a focus of this model. However, should you have great data center cost information you’d like to contribute, please let us know, as we strive to increase the completeness and accuracy of our model.

The model suggests IT automation should be the first item on your todo list.

Considering the timeframe increases model accuracy

Even though building a cloud can be quick, getting the most from its operation is a journey: staff will learn along the way, corporate functions will have to adjust, and business demands for new technologies and faster IT response will only increase.

Per-VM costs are inseparable from timing. You’re buying hardware, hiring people, buying software, suffering talent loss, refreshing hardware, and buying still more to support growth. All of these costs can, and usually do, hit your budget differently every year. If you’re buying software licenses, you have a large upfront cost and maintenance. If your staff gets promoted, gets raises, and sometimes takes new jobs, these will affect salaries, hiring, and training costs. Some you can plan for, some you can’t.

Put in another way, if next year, you provide the exact same quality of service, to the exact same customers, in the exact same quantity, with the exact same technology, there’s still a very real chance your costs will not be the same as they are this year.

We’re showing costs and cost changes over six years, but we modelled out to ten to find out when the costs start flattening out.

If you want your TCO model to be a tool for ongoing decision making, you need to not only look at costs, but how costs change over time.

The cloud growth curve doesn’t affect the TCO

One of the nice things about creating a flexible model is it allows you to try all sorts of hypotheses and inputs. While absolute costs depend on the success and speed of your private cloud adoption, one of our surprising discoveries is that relative costs are not dependant upon your adoption curve. None of the advice the model provides is affected by the growth curve.

This means IT organizations can get started even when unsure of how quickly your private cloud is going to take off. This also makes the particular growth model we discussed here a lot less important. Our examples have VM count doubling every year, which is the most common customer story you hear during IT conference keynotes. But, the advice is equally applicable no matter what your particular growth model is.

Having technical conversations with Lines of Business (LOBs) are frustrating for both sides: they often can’t provide you sufficient information you need in order to provide a thoughtful architecture and plan. Because of any number of reasons, you can’t provide accurate costs and changes to costs over time. With a good TCO model, these conversations can get unbelievably easier for both sides of the table: you can model different scenarios and provide ranges of pricing, and help your LOBs work through priorities. Invest the required time in an accurate TCO model, and you’ll not only make these conversations even easier, but you’ll have the tools in place to add financial input into your designs even as the services you provide change over time.

If you’re interested in expanding on what we’ve built, please let us know.

Erich Morisse
Management Strategy Director
@emorisse

Massimo Ferrari
Management Strategy Director
@crosslogic


* If you think that you can run a cloud by leveraging existing IT Ops, think again. Research published by Gartner shows that not creating a dedicated team is one of the primary reasons for the failure of cloud projects: Climbing the Cloud Orchestration Curve 

** Velocity 2012: Richard Cook, “How Complex Systems Fail”
The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
Complex Adaptive Systems: 1 Complexity Theory
Systemantics
What You Should Know About Megaprojects and Why: An Overview

Why Red Hat is extending management support to Microsoft private and public clouds

Today, Red Hat announced an unprecedented partnership with Microsoft, focused on mutually supporting a number of technologies and platforms. Among others, we announced the upcoming support of Microsoft Azure and System Center in our cloud management platform CloudForms.

Over the past 18 months, I’ve seen Microsoft’s public cloud evolve and mature, and the market interest grow, to the point that supporting Azure side by side with Amazon Web Services has been the number one request from many of our enterprise customers planning to, or currently building a hybrid cloud. These customers are demanding a single pane of management glass to consistently orchestrate the lifecycle of their applications across the two leading cloud platforms.

We are working hard to satisfy that demand, not only because it’s the best for our customers, but also because it aligns with one of the core principles shaping Red Hat Management portfolio: multi-vendor support.

CloudForms can already orchestrate and govern a broad range of server virtualization and private IaaS cloud engines from multiple vendors, not just Red Hat. This agreement with Microsoft extends our range to Microsoft cloud offerings. As we said for the Ansible acquisition, our enterprise customers have complex heterogeneous IT environments and don’t want IT organizations to create redundant management silos, or embrace single vendor stacks if it’s not the best for their business.

Red Hat aims to be our customer’s most trusted business partner in their journey to cloud computing. The first step has been recognizing that there’s no single cloud technology, product, platform, or vendor that can solve all problems in the most efficient way. Now, the second step is to enable our customers to consume the cloud technologies that fit their business goals and corporate culture in the most frictionless way.

This is what you should expect from us as result of today’s announcement: an easy and streamlined approach to manage workloads deployed across Microsoft Azure and Amazon Web Services, Microsoft System Center Virtual Machine Manager and VMware vSphere, and of course Red Hat Enterprise Virtualization and Red Hat Enterprise Linux OpenStack Platform.

This seamless management will extend to all CloudForms capabilities, including self-service provisioning and lifecycle management, policy-based orchestration, show back and chargeback, configuration management and drift analysis, capacity planning and reporting.

There are many more steps we plan to take to fully enable an enterprise hybrid cloud. Expect big things from Red Hat.

Alessandro Perilli
GM, Management Strategy
@giano

When And Why OpenStack Needs A Cloud Management Platform

At Red Hat we are seeing more and more organizations choosing OpenStack for the next step in their cloud journey. Very often, this transformation journey is marked by three main evolutive stages:

  1. Build a server virtualization environment for scale-up workloads
  2. Extend the server virtualization environment with an Infrastructure-as-a-Service (IaaS) cloud for scale-out workloads
  3. Unify and enforce enterprise-grade governance for both server virtualization and IaaS cloud environments

Different companies stop at different stages of this maturity model, depending on the business needs and the maturity of their IT organization. As the environments in stage 1 and stage 2 grow in size and complexity, companies can reach an operational scale that requires more sophisticated management tools than the ones provided out of the box by server virtualization and IaaS cloud engines.

A Cloud Management Platform (CMP) offers an additional layer to govern a complex server virtualization environment or IaaS cloud as needed by a large-scale end user organization.

In fact, despite OpenStack being a powerful and flexible IaaS cloud engine, doesn’t offer a wide range of management capabilities that some organizations may be looking for, such as:

  • Capacity & Performance Management
  • Configuration & Change Management
  • Chargeback
  • Orchestration

OpenStack does a great job in providing the instrumentation for the aforementioned capabilities – think the metering APIs that OpenStack Telemetry (Ceilometer) offers or the orchestration templates that you can define with OpenStack Orchestration (Heat) – but the management tools that it provides on top of that instrumentation don’t meet the needs of every organization.

To better understand why a CMP is so important at a certain operational scale, let’s use an  analogy: professional property renting.

When you think about the management tools that IT organizations use at each stage of our maturity model, think of:

The Virtual Infrastructure Manager for the amateur landlord

As we said before, at this stage an organization has in place a server virtualization environment and its management console like, for example, Red Hat Enterprise Virtualization Management. The organization is an amateur landlord.

Let’s say that you own one or more apartments that you want to rent. All of them are ideally located in the same city but different in size, finishes, prestige of the location, etc. You want to rent them as long as you can, carefully selecting the best possible occupant for each. You want to keep things simple: long term, fixed price contracts, personally track every change in each apartment and, if something bad happens, you personally work with the occupant to determine responsibility and find a solution.

Your apartments are unique, lovely, hand cared for, just like VMs in a server virtualization environment.

However, you don’t get the most from your properties because this simple, not-automated, way to do business is slow rather than agile, reactive rather than proactive, and with an unbalanced level of attention dedicated to each asset. For example, if one of your tenants starts acting unpredictably and against the law, evicting him can become a nightmare, distracting you from managing all other apartments. In another example, if a growth opportunity knocks at the door, you need time to carefully plan a new property acquisition, select tenants, etc., and this will likely make you lose the opportunity window.

This way of doing business is perfectly fine and sufficient as long as your ambitions as landlord (or your scalability needs as IT organization) remain contained. If your ambition/needs grow, maybe due to a highly competitive market, you need better tools to manage your property portfolio (or your application portfolio) in a more efficient and operationally scalable way.

The IaaS Cloud Manager for the Airbnb-enabled landlord

At this stage an organization has in place an Infrastructure as a Service (IaaS) engine like, for example, Red Hat Enterprise Linux OpenStack Platform. The organization is an Airbnb-enabled landlord.

If the number of apartments you want/need to manage grows, maybe due to early success and increasing market demand, you feel the need for a tool like Airbnb. Airbnb maximizes your capability to address the market demand and minimize the friction in the renting process in many ways. It offers a wonderfully designed website that lists your properties on a map, showing photos of the rooms and furniture, giving guidance about the services around the apartments, and providing a complete booking service that your potential tenants can use in a self-service way.

Airbnb enables you to easily manage different contract options (monthly, weekly, daily), rent a single room or the entire apartment, open and close the calendar for availability instantaneously and, more important, gives you the flexibility to change your mind whenever you want (and offers up to $1M host protection insurance). Airbnb exposes a rating for each property, encouraging landlords to offer a consistent experience for every apartment. Services like Airbnb can help the real estate market grow by increasing competition, pushing landlords to invest more in their properties as revenues come in quicker and in a more frictionless way.

In the same way, OpenStack offers to your lines of business a self-service portal that they can leverage to self-provision what they need, gives you the flexibility to build instance flavours offering different lease times, amount of resources, pre-baked images and grants you the flexibility to introduce or retire those flavours as needed. The usage model encourages users to standardize the OS/Middleware offering, consequently increasing the predictability and efficiency in terms of maintenance, hardware resources, purchasing, etc.

Landlords embrace tools like Airbnb to manage their properties because they want to be agile and catch new business opportunities. To do so, they accept to cut their emotional bond with each individual apartment. IT departments are driven by similar logic, and accept to move from pet-VM to cattle-instances.

The CMP for the professional property manager

At this stage an organization may have deployed a Cloud Management Platform (CMP) like, for example, Red Hat CloudForms, to govern both the server virtualization environment and IaaS cloud. The organization is a professional property manager.

Let’s say that the agility offered by a tool such as Airbnb makes you feel confident to serve hundreds or even thousands of tenants and manage many more properties. This last step in your career as a landlord introduces a completely new set of needs and the complexity is so high that you cannot do everything by yourself. At this point, a tool like Airbnb can’t fulfill all your needs because it’s not designed to serve landlords at scale:

  • managing bookings, cancellations and changes at scale can’t be made with a spreadsheet, you need a professional booking system. You need some level of automation to manage your capacity and at the same time supervise the performance of each property.
  • for each tenant you need to inventory the stay, consumptions, reimbursement, etc., and offer transparent billing. This requires a professional chargeback process.
  • for every booking of every property you need to arrange cleaning, supplies, accesses, etc. When the numbers start rising this can become a massive effort, impossible to be manually fulfilled. You need to orchestrate all the external services connected to your estates: professional cleaning service for both the property and bed linens, for example; suppliers of things like soap, toilet paper, coffee etc.; someone who distribute the keys; and so on.
  • every time a tenant leaves you have to check everything is OK. You need to plan minor and major maintenance activities, changes and improvements for every single property, and even the opportunity to buy new ones!

Operational Burden

Exactly like in our analogy, a CMP introduces a set of critical management capabilities to enhance and augment what OpenStack can do out of the box. Additionally, and critically, a CMP can unify the self-service provisioning experiences across both the server virtualization environment and the IaaS cloud that it manages side by side.

Cloud Management Platform
Following these principles, a CMP like Red Hat CloudForms has capacity planning capabilities that enable IT organizations to know which OpenStack availability zone has enough resources to deploy new instances. For example capacity planning can tell you that a single instance of a web server with 2 vCPUs and 2GB of memory can be safely deployed on zone A but if you plan to scale it out at certain point in time zone B is a better choice provide the amount of additional resources needed.

It provides Performance Analysis capabilities to monitor and forecast the utilization of instances, hosts and providers. For example they can track the average load of the physical hosts over time suggesting the moment to add more hardware to support the increasing demand of resources.

In combination with Ansible (which Red Hat recently acquired), CloudForms offers automation capabilities allowing administrators to create orchestration and configuration workflows for the deployment, setup, and retirement of a certain instances. For example, the deployment of a web server hosting a public website will require your firewall to open a number of ports, and your router to setup a NAT on a public IP to grant access to the Internet audience.

Moreover, CloudForm’s change management and policy enforcement capabilities will keep in compliance the entire environment, tracking modifications and enforcing specific configurations or patch installation on instances and hosts. For example if one of the tenants configures an instance, in its domain, opening a potential security breach CloudForms will automatically restore a safe state.

Last but not least, CloudForm’s chargeback capabilities allow IT organizations to charge OpenStack instances allocation and usage based on a number of different criteria. For example you can account the utilization of a specific instance by minutes, hours, days or a fixed price depending on the kind of workload is going to support.

So, in summary: some organizations may find the management engines coming out of the box with traditional server virtualization or Infrastructure as a Service engines a perfect fit for their business needs. However, for those organizations planning to build a large-scale enterprise-grade private or hybrid cloud, a CMP offers a governance layer that allows them to reach an operational scale that would be impossible to manage otherwise.

Massimo Ferrari
Management Strategy Director
@crosslogic

Why did Red Hat acquire Ansible?

Today, we announced a definitive agreement to acquire Ansible, a popular IT automation tool launched in early 2013. Like in any acquisition, customers and partners will likely have a number of questions, so let me get straight to the point and cover the top three questions I anticipate:

Why an IT automation tool?

Automation helps IT organizations addressing the increasing demand for speed and simplicity coming from the lines of business (LOB) across a wide range of key initiatives, including:

  • Support for cloud-native applications through the deployment of Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) clouds
    Next-generation applications require next-generation computing environments, like scale-out IaaS and PaaS clouds. The deployment of these cloud environments (e.g. OpenStack) can be challenging due to their inherent complexity and the relative maturity of the underlying technology.

    IT automation tools can help to dramatically speed up cloud deployments while drastically reducing human errors associated with manual intervention.

  • Agile application development through the DevOps practice
    Next-generation applications are developed after new methodologies, like DevOps, and new patterns, like the microservices architecture. Supporting the continuous delivery predicated by the DevOps methodology requires a toolchain that empowers developers to release early and often. In turn, the application update frequency depends on how fast, simple and efficient the DevOps tools are in the toolchain.

    IT automation tools are a critical addition to any DevOps toolchain, as they can operate a large amount of changes to complex application architectures, and to a large number of application instances, in very short amount of time.

  • Service orchestration through IT process automation
    The ultimate ambition of IT organizations worldwide is to offer their LOB a fully automated provisioning of entire application stacks, through virtual machines (VMs) or containers, or “service orchestration”. It’s an ambition as old as the private cloud, and yet, the industry struggles to make it a reality. The problem is that orchestration and automation are two incredibly challenging processes, because of the myriad of moving parts to coordinate, and the lack of standardized interfaces to programmatically coordinate them.

    Red Hat CloudForms, our cloud management platform, is top in class at orchestrating the whole lifecycle of an enterprise application (from provisioning to retirement), according to configuration and compliance policies. However, a great orchestration engine still depends on last mile automation to compose each tier of the application. The more flexible and powerful the IT automation engine is, the more complex applications that can be provisioned.

Our customers already use Red Hat solutions in conjunction with various IT automation tools. With this acquisition, we want to offer that type of integration through the world-class Red Hat support and certification that makes open source consumable for the enterprise (exactly the same way we do for OpenStack and every other product in our portfolio).

Why Ansible?

We see in Ansible a perfect alignment with the core principles that shape Red Hat’s management, both at the product level and at the portfolio level.

At the product level, Ansible matches Red Hat’s desire to deliver a frictionless design and a modular architecture through open development:

  • Ansible is simple to use.
    A quick Google search will reveal an overwhelmingly consistent sentiment about Ansible’s low learning curve and its simpler manageability. As we work to deliver the Frictionless IT that our customers need to address the demand of current and future generations, this focus on “simple” is critical.

    How simple? Let me give you two examples.
    First: Ansible’s “playbooks” are written in humanly-readable YAML code, which make easier to both write and maintain the automation workflows.
    Second: Ansible is agentless, using standard SSH connectivity to execute automation workflows, making it much easier to blend into an existing enterprise IT environment and its intricate operational framework.

  • Ansible is modular.
    At the time of writing, Ansible ships 400+ modules, which can be invoked at will to extend the product’s capabilities beyond its core feature-set and intent. This is a critical capability that we want to offer in all Red Hat management products to support our customers as their needs evolve in terms of the maturity, complexity and scale of their IT.

    How modular? Let me give you one example.
    Ansible’s modular capabilities span from managing storage images in OpenStack Image Service (Glance) to managing Linux containers, to collecting data from a F5 Big-IP application delivery controller.

  • Ansible is a very popular open source project.
    Ansible is an incredibly popular open source project and the community members contribute to both the core technology and the modules that come with the core. We believe that supporting and nurturing great open source communities is the only way to guarantee a continuous stream of innovation, and it’s what makes Red Hat so special.

    How popular? Let me share some telling examples.
    First, Ansible has almost 13,000 stars and almost 4,000 forks on GitHub
    Second, according to RedMonk, the number of mentions of Ansible in the Hacker News community is skyrocketing.

At the portfolio level, Ansible matches Red Hat’s desire to support a multi-tier architecture, provide multi-layer consistency, and deliver multi-vendor support:

  • Ansible supports multi-tier deployments.
    Ansible is designed to support the deployment and configuration of a multi-tier application, through VMs and containers. This means that organizations can automatically provision different components of the same application on the tier that is most efficient to run them: scale-up workloads on bare metal and server virtualization engines, scale-out workloads on IaaS cloud engines and PaaS cloud engines. We do not believe in “one size fits all” approaches and we are committed to supporting the broadest range of infrastructure and platform engines possible.

    How far does Ansible’s multi-tier support go? Here’s an example.
    Ansible can manage VMs and guest OSes in a VMware vSphere server virtualization environment, deploy and manage instances in an OpenStack IaaS cloud, and deploy applications inside an OpenShift PaaS cloud, all at the same time.

  • Ansible brings consistency at multiple layers of the architecture.
    Ansible can be used to programmatically manipulate every layer of a computing architecture, from the infrastructure to the application, and for every use case, from orchestration to deployment to configuration. As I said at the beginning of this post, Red Hat is committed to enabling the provisioning of entire application stacks in the easiest possible way, and management consistency is a great way to keep things easy.

    How far does Ansible’s multi-layer support go? Here’s an example.
    Ansible can automate everything including the configuration of network, storage, compute (e.g. OpenStack instances), OS, middleware (e.g. Red Hat JBoss Middleware) and finally, application layers.

  • Ansible supports heterogeneous IT environments.
    Ansible can automate the configuration of a broad range of technologies from many vendors, not just Red Hat. Our enterprise customers have complex heterogeneous IT environments and the last thing we want is for customers to create redundant management silos, or embrace single vendor stacks if it’s not the best for their business.

    How far does Ansible’s multi-vendor support go? I have two final examples for you.
    First: Ansible supports both Linux and Windows environments, performing equally well configuring an Apache2 web server or a web application pool on Microsoft IIS.
    Second: through its modules, Ansible empowers IT organizations to manage a wide range of ISV and IHV technologies, from F5 Big-IP and Citrix NetScaler network controllers to Amazon Web Services and Google clouds.

How does Ansible fit Red Hat’s management strategy?

If you read this far, you already have a pretty good idea of how Ansible will augment and complement Red Hat’s current management portfolio:

  • Red Hat CloudForms will continue to offer overall orchestration and policy enforcement across all architectural tiers we support, within the corporate boundaries and on public clouds.
  • Ansible will automate the provisioning and configuration of infrastructure resources and applications within each architectural tier, as requested through the CloudForms self-service provisioning portal. This will include deploying Red Hat Satellite agents on bare metal machines when the use case requires it.
  • Red Hat Satellite will continue to enable the provisioning and configuration of Red Hat systems (and security patches and software updates) within each architectural tier, as defined by the Ansible automation workflows.

Red Hat Open Management Platform

Red Hat customers will be able to adopt any of the three as standalone products, but we’ll work hard to tighten the integration between the three to enable them to work great together.

We are very excited to have the Ansible team joining the Red Hat family and we can’t wait to put the product in the hands of our customers.

Alessandro Perilli
GM, Management Strategy
@giano

Towards a Frictionless IT (whether you like it or not)

With the term Frictionless IT, Red Hat means an enterprise IT that just works, reshaped after the experience offered by modern consumer-grade public cloud services, which business users are growing to expect.

What does Frictionless IT have to do with Red Hat and the IT organisations that we serve? Simple: if we don’t start moving towards Frictionless IT, we all risk irrelevance.

Current generations of IT professionals are experiencing a growing disconnect between Enterprise IT and Personal IT.

  • Enterprise IT remains reliable, but in most cases slow to procure, complex to use, and overall frustrating. Think about your expense report system.
  • Personal IT is evolving into a set of instantaneously available, incredibly easy to understand and blazing fast at executing the tasks that they are supposed to execute. Think about Gmail, Dropbox, Evernote, IFTTT, and the plethora of other public cloud services that we all interact with on daily basis through our phones, tablets, and laptops.

The first problem with this split brain between Personal and Enterprise IT is that our brain is exactly the same, inside and outside the office. Any interaction with this emerging Personal IT raises the bar on how the IT experience should be. The more we use Gmail, Dropbox, Evernote and IFTTT in our personal life, the more our expectations grow for a similar experience at work. We wonder more and more, “if my Personal IT is such a breeze to use, why does my Enterprise IT have to be miserable?”

The second problem is that current generations can endure frustrating Enterprise IT only because that’s all that they have experienced for decades. New generations will not be so forgiving. The kids in college today, and those who just started their first job in a new, exciting startup, are growing used to only one kind of IT experience: the frictionless one.

At some point in the near future, these kids will land more reliable and less stressful jobs in large enterprises. It will not be just one or two individuals with a different set of expectations joining a typical bank or insurance company. It will be a whole generation that permeates every department of an end user organisation, from marketing to engineering, with a completely different set of demands and expectations. The overwhelming majority of IT organisations, and the traditional solution providers that support them, are completely unprepared to meet that demand.

At Red Hat, we recognise this challenge. In it we see an opportunity to simplify enterprise software in many dimensions, from the user interface to the underlying architecture, through not only the technology, but also aspects like documentation, licensing and much more.

We believe that at least three ingredients are necessary to meet the demand for frictionless IT:

Ease of use

A key enabler for a Frictionless IT is a smooth user experience (UX). The user experience is defined by the quality of an interaction between the human and the system, and it takes place when you deploy, integrate, customize and use enterprise systems. Intelligent installers and self-contained binaries, simplified back-end architectures, supported out-of-the-box plug-ins, modular front-ends, consistent UIs and even coherent documentation all contribute to improve the quality of the UX. However, very few organisations in the world look at these aspects from a holistic standpoint and take a user-centric approach. For example, the user interface (UI), in both commercial-off-the-shelf and custom-made applications, is one of the most overlooked aspects of enterprise software.

If you think that investing in state-of-the-art UI is unnecessary, or not worth the effort, think again. The primary reason why some public cloud offerings become overnight successes at a planetary scale is their intuitive UI. In our Personal IT we are already getting used to intuitiveness, and the demand for it is supported by the broad market offering. We have already reached the point that when an app on our smartphones is too complex to use in the first few minutes, we simply delete it and download an alternative. There’s no second chance for the app that is not frictionless.

Now let’s go back to the upcoming generation of technology consumers. Even among the most technical of them, some may have never built a computer by screwing a motherboard to the case (like many of us did, including me), used a command prompt or plugged in a network cable. Those users will expect that installing software will be as frictionless as deploying a virtual appliance, plugging a cable will be as frictionless as drawing a line on a service catalog UI and so on.

If the IT organisations of tomorrow don’t deliver that kind of ease of use, future generations of business users will simply circumvent them, more than today, relying on external cloud service providers. And to meet the expectations of future generations, the UX in enterprise software has to dramatically improve.

Red Hat understands the challenge, and we are working hard to influence the open source projects that we support in the short and long term. For example, our commercial cloud management platform, CloudForms, comes as a single virtual appliance; this is in contrast to other cloud management platforms that may require 6 to 9 different tools (and not all of them available as virtual appliances). We consider this a prime example of the effort we put in engineering more frictionless enterprise solutions.

Speed

A second key enabler for a Frictionless IT is speed. If the interface is pretty but you still need to take 20 steps (or 20 weeks) to get the job done, it’s not frictionless. We are already know that speed deeply influences the UX, to the point of impacting search engine rankings, thanks to the enormous research conducted around aspects like loading time in web development. And yet, it took a lot for the industry to realize that the same human brain which doesn’t tolerate a very slow page load very likely won’t tolerate a very slow enterprise IT experience.

Speed has become an increasingly important factor in the last five years, to the point that the industry constantly mentions agility as the most desired attribute for business and development models. Of course agility is not just speed, but speed is a very big part of it. Which is one of the many reasons why, for example, we are seeing a shift of interest from virtual machines (VMs) to application containers.

Operating system and application virtualization are as old as (and in some cases, older than) hardware virtualization. More than ten years ago, the emerging virtualization industry was rich with technology startups focused on all three approaches. As we know, eventually the mainstream audience preferred VMs over what we used to call operating system partitions and application layers, but today we are experiencing a second coming of the latter technologies because customers’ business needs are changing and evolving, as they always do.

Ten years ago, IT organizations’ primary challenge was modernizing the data center while maximizing the ROI on existing hardware equipment, and hardware virtualization brilliantly helped to accomplish the goal. Today, IT organizations’ primary challenge is addressing the business demand as fast as possible, because there’s now a competitor that never existed before: the public cloud provider. Application containers can be deployed in seconds rather than the minutes needed for VMs, significantly shrinking the reaction time for a variety of scenarios, including scaling out a web application to address an unexpected traffic peak and avoiding a fatally slow loading time.

Red Hat understands the challenge. This is why, for example, we invested so heavily in application containers, introducing enterprise support for the Docker format across a growing number of our products (Red Hat Enterprise Linux 7 first, then OpenShift 3, and soon CloudForms 4).

Application containers are just one example (and to be fair, they have more virtues than just speed of deployment); we constantly look at solutions that can dramatically increase operational speed.

Integration

A third enabler for Frictionless IT is seamless integration between enterprise products and the ancillary services necessary to make it work or unlock their full potential. No successful software or hardware comes without a certain degree of integration with the existing enterprise IT environment, but the extent of that integration makes or breaks the UX, in turn impacting on users’ productivity.

Integration can happen at the back-end level and at the front-end level. The latter is rarely considered, so I’ll focus on that in this post. To clarify the deeply underestimated importance of front-end integration, I always use the analogy of the smart calendar.

In many cases, in preparation for a business meeting we always check a couple of apps on our smartphones: the calendar app, to know when, where, and how we need to meet; and the map app, to know how to get there. In a perfect world, especially if the business meeting is a delicate negotiation with parties you’re meeting for the first time, we might want to check at least another couple of apps: LinkedIn, to learn more about the people that we are going to meet; and Twitter, to learn more about what those people have to say about topics that may be relevant to the negotiation. Out of the four, it is the last two apps that could provide the intelligence necessary to successfully close the negotiation. But because the information is spread across so many different apps, which dramatically increases the friction, we limit ourself to checking the first two, the indispensable ones. Crucially, because of the friction, we don’t check the information that could be most valuable for the meeting, which deeply impacts our effectiveness.

Thankfully, there’s now a better way. A wave of so called smart calendar apps are emerging (and rapidly being acquired), with their biggest value being the ability to blend the front ends of the aforementioned four apps into a single, consistent UI that dramatically reduces friction. If you have ever tried smart calendars like Tempo or Sunrise, you have an idea.

Enterprise IT has to follow the same path: improve integration to minimize the friction (which in this case can appear as a steep learning curve) and maximise the productivity of the enterprise audience.

Red Hat understands the challenge, and we are working hard to influence the open source projects we support in the short and long term. For example, we are working within the ManageIQ community, the upstream project behind CloudForms, to develop a coherent UI allowing our customers to manage side by side virtual machines and containers in a consistent fashion:

CloudForms managing Kubernetes

 

CloudForms managing Containers

 

Ease of use, speed, and integration are key ingredients to dramatically improve the enterprise software (and hardware) UX. But what’s the difference from the past, you might ask. User experience has been considered as a key differentiator since the late 60s by companies like IBM. And there are plenty of ROI calculators showing that UX has a quantifiable impact on business. The difference is that now enterprise users have choice, and enterprise IT organizations have competitors. And the choice is incredibly broad and incredibly accessible. If IT organisations fail to deliver Frictionless IT, lines of business (LoB) will simply go elsewhere and get the job done with the tool that is most convenient (simplicity, not cost) out of the many available.

A LoB doesn’t care about security, compliance and integration issues, nor do they trouble themselves with the politics driving the IT organization choices towards a specific solution versus another. A LoB only wants to get the job done within the deadline. And if the corporate policies get in the way, they will be often circumvented. In turn, if the corporate policies get circumvented and the tools that empower a LoB are provided by external cloud service providers, in the long term the role of the IT organisation will become less relevant. To stay relevant in the eyes of upcoming generations, both vendors and their clients must recognise the ongoing transformation, anticipate the upcoming demand, and adapt.

It’s great to see how some vendors are starting to realise the need for Frictionless IT. For example, during the last week’s Red Hat Summit 2015, our long term partner SAP demonstrated a growing awareness about the need for simplicity.

On our side, we are working to deliver the most frictionless products that the open source communities, supported by Red Hat’s expertise and vision, can offer. We have a long way to go, but we are confident that this is the right path to walk. Stay tuned for more on this front.

 

Alessandro Perilli
GM, Management Strategy
@giano