Where to start with DevOps Metrics?

Now that more and more teams and organizations are moving towards a DevOps way of working, I get asked the question “What should we measure?” a lot. To be very honest, I find this a very hard question. The main reason is that metrics are always a point of discussion and a trigger for behavior change. And not always necessarily for the good.

But on the other hand, it is an understandable question and as long as you realize that metrics influence behavior and that DevOps is about learning and improving, I believe that metrics are a powerful mechanism to achieve this.

In this post I’ll talk about monitoring and metrics. Becoming a autonomous DevOps team, means also that you as a team are responsible to build and run the application. Building an application and running an application are two completely different things, and require different metrics.  This post gives some guidance which is, and will never be, complete, but merely meant as a starting point. DevOps is all about continuous learning and implementing the right things to make your application successful. Therefore there is no standard recipe for success, nor a complete list of metric requirements that you have to fulfill. You need to gain insights in your application, users, usage and behavior and use that as new input to make the product better.

Primarily you can split the work of a DevOps team into 2 parts. Building the product and Running the product. To structure this article I splitted  up the article into a Build and a Run section where several metrics and questions are discussed.

Metrics when building software

What are you monitoring during the Build Process of your application. Or what things should you take into consideration.

This article gives a nice overview of what needs to be done in order to create a product with good quality. Looking at these dimension some easy questions (but hard answers) can be deducted.

  • Software has to be deployable and it has to satisfy the minimum functionality
    • How often are you deploying?
    • Has it deployed succesfully?
    • Are tests succesful?
    • Can users work with the product, drop off rates?
  • Software needs to be performant and secure
    • Test results from load/stress/penetration testing
    • Credential scans
    • Basic Security
  • Software needs to be easily usable
    • Duration of sessions
    • Incident count
    • Support Questions
    • Drop-off rate
    • Interaction sequences
  • Software needs to be useful.
    • Are features used?
    • Investment. ROI?
    • How often are features used?
    • Is it worth the investment?
  • Software needs to be succesful

Questions for the Product Owner

  • Do you have a clear description of the product you own?
  • What is the vision of this product?
  • Who is the customer?
  • How do you measure success?
    • e.g. number of users
    • increased usage over time
    • no outages
    • NPS scores?
  • When is the product succesful?
  • What is your feature timeline (6 months)
  • How often is each feature being used?
  • What feature sets tend to be used by the same people?
  • Which features are your engaged users using most?
  • Where do users get stuck and abandon the product?
  • How long are users spending on each feature?
  • Who abandons it and who keeps using it?

Questions regarding progress

  • How long does it take to fix a bug? MTTR
    • Are we getting better in fixing bugs and releasing fixes.
  • How are you progressing on the feature timeline?
    • Are features well defined?
    • Is there focus on what should be done?
    • Are features actually getting closed?
    • Do we deliver on promises?
  • How often do you deliver a feature?
    • Are we deploying features?
  • How is your sprint progress? Velocity?

Questions for development teams

  • What is the trend of technical debt?
  • What is the trend of Unit Tests/Coverage/Lines of code?
  • Build breaks? Release breaks?
  • Activity on the project?

Metrics for Running software

Most important thing to remember is that there all decision that you make based ont these metrics should be done based on a SLA or NFR. If you need to guarantee 100% uptime there are other tresholds and other metrics you need to take care of.

Basics to cover

  • Is my product/service up?
  • Availability check, ping request etc.
  • Health, Performance, Usage and Costs
  • What are the treshholds you agreed on with your customer/stakeholder?
    • Response times
    • Resolution time
    • Max Load
  • Do people use the platform. How many? On target?

Again this is not a complete list of metric that you should implement, but if you do not know where to start these are metrics you should at least consider. Based on some best practices found on different websites/blogs and books there are a number of categories that you should take care of.

Category Description
Latency The time it takes to service a request
Traffic & Utilization A measure of how much demand is being placed on your system
Errors The rate of requests that fail
Saturation The part of the system which is most constrained
Utilization The amount of resources that is used

Below you’ll find some example questions and metrics you can tghink of in every category.

Latency

  • How long does it take to serve a request?
  • What happens if you have 10 users / 20 users/ 2000 users etc.?
  • How many concurrent users can you serve before the system goes down/is impacted
  • Are you serving 200 or error pages?

Traffic

  • How many requests are serving per second per service?
  • How many requests are failing per second / minute?
  • What is the trend of traffic of requests?
  • What is the expectation? Do you exceed or underutilize?

Utilization

Monitoring of compute resources as :

  • CPUs: sockets, cores, hardware threads (virtual CPUs) Memory: capacity
  • Network interfaces
  • Storage devices: I/O, capacity
  • Controllers: storage, network cards
  • Interconnects: CPUs, memory, I/O

Questions you should ask

  • What is the trend?
  • What is the average?
  • On what levels do errors/unexpected behavior starts occuring?

Errors

  • The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”).
  • How many errors do you see?
  • What type of errors?

Saturation

How “full” your service is.

  • CPU load
  • Memory load
  • Disk Load
  • Queues for the webserver
  • Limits (e.g. SQL Free has X Gb on disk space)

Tresholds

It is great that you can measure a lot of things. But more important is the tresholds you put in place.

  • When should somebody act?
  • What should the action be?
  • Is there a manual action, automation as follow up?
  • Who is the owner/who gets called?