The tools you use will differ if you’re running a Platform as a Service (PaaS), running a mobile application, or serving as a bank. Nevertheless, there are some core Key Performance Indicators (KPIs) that should be common to all of these environments. This article will take you through the KPIs you should either evaluate or revisit and examine what you should consider when measuring success.
Asset Management is not just about knowing how many Microsoft Windows servers you have or how many instances of Nginx you’re running. Sure, inventories and statistics are helpful, but the more mature your DevOps practices are, the less important the inventory becomes. After all, as you start trading in pets for cattle, your servers, containers, and serverless functions cease to be individual components and become pools of resources executing code.
This KPI is more about what is under management of automation and how effectively it is controlled. It doesn’t matter if you are using imperative tools, such as Puppet or Chef, or declarative tools like Ansible or Terraform; what matters is how effectively you use them. These tools allow you to automatically provision bare-metal or cloud assets as well as join a load balancer and receive traffic. However, you can achieve similar results from Kubernetes or Swarm clusters or chaining lambda functions. Automation is really about controlling assets to achieve your business goals.
Lastly, optimizing your assets is key not only for controlling costs, but also for being able to properly measure workloads. Whether using on-premises virtualization or cloud workloads, spending time streamlining your processing cycles down to the minimum hardware required can result in a significant difference in monthly cost, as can assessing your “bang for the buck” when calculating scaling. Once you’re fully automating your configuration and metrics collection, you can analyze your metrics and daily utilization and tweak optimizations that enable you to hit the sweet spot cost/size for scaling.
Here are some thoughts to keep in mind when measuring Asset Management:
- How much of your network and how many of your servers, containers, services, etc. are automated? The higher the percent of automation, the less time you’ll spend fixing and maintaining your infrastructure and the more time you’ll spend innovating.
- How many servers or applications are snowflakes? If there are snowflakes, is there a strong business reason for them to not be automated? If you are not cloud native, shrink these automation numbers as much as possible. If you are cloud native, most popular tools covered in this KPI allow you to fully integrate your solutions into automation management. If applications simply cannot be automated, perhaps it’s time to retire them and built the next generation application.
- Are all assets under control through automation using tools (i.e., percent of visibility)? If they are not, there will always be an incident comes back to haunt you. In the cloud, this can also lead to hidden costs.
Should you be monitoring all of your resources? Yes. Should you also filter out the noise? Absolutely. The only thing worse than not monitoring anything is monitoring everything without proper filtering. Implementing good monitoring should be coupled with using the information your acquire.
Resources (servers, applications, and other tools) provide information about your environment.
They create logs, metrics, and statistics. They tell an overarching story that must be understood in order to make better decisions for the business. Capturing logs, parsing them, and using tools such as ELK, Datadog, or Sumo Logic, for example, give you a large kit that can not only capture and glean useful information from a tremendous amount of data, but can also make decisions and perform predictive analyses. You might learn that your stack is operating inefficiently or not scaling fast enough during peak utilization periods.
Taking this further, proper monitoring requires a tremendous amount of effort. Most tools provide canned reports and capture high level metrics, such as percent CPU utilization. Refining this monitoring over time and making better use of your tools to gather the kind of information you want is time well spent. When measuring your monitoring, ask yourself these questions:
- Am I getting hundreds or thousands of alerts that go ignored? If so, you are doing something wrong. “Good” alerts, are actionable. Other information may be helpful during analysis but should not send alerts.
- Are you resolving recurring alerts with automation? If not, why not?
- Are monitoring agents deployed to all of your servers? If not, you are losing insight.
If your PaaS tools and other providers have integrations, you should be using them and parsing out unneeded data. Splunk and Datadog are examples of platforms that have AWS integration hooks for many of their services, allowing you to simply plug in and start getting insights.
- Are you using tools like Logstash to pull and grok application, database, and other logs in your stack for meaningful data? If not, you’re limiting your insight into the service’s health.
- Are your disparate systems aggregating into a common tool? Make your life easy and facilitate this aggregation. Otherwise, you may spend valuable time combining information across systems or pulling information from data warehouses to get your insight. Let the tools do that work for you.
Elad Ishay is the Head of DevOps at Alcide.