Where to start with DevOps Metrics?

September 5, 2018 by Rene van Osnabrugge 1 Comment

Now that more and more teams and organizations are moving towards a DevOps way of working, I get asked the question “What should we measure?” a lot. To be very honest, I find this a very hard question. The main reason is that metrics are always a point of discussion and a trigger for behavior change. And not always necessarily for the good.

But on the other hand, it is an understandable question and as long as you realize that metrics influence behavior and that DevOps is about learning and improving, I believe that metrics are a powerful mechanism to achieve this.

In this post I’ll talk about monitoring and metrics. Becoming a autonomous DevOps team, means also that you as a team are responsible to build and run the application. Building an application and running an application are two completely different things, and require different metrics. This post gives some guidance which is, and will never be, complete, but merely meant as a starting point. DevOps is all about continuous learning and implementing the right things to make your application successful. Therefore there is no standard recipe for success, nor a complete list of metric requirements that you have to fulfill. You need to gain insights in your application, users, usage and behavior and use that as new input to make the product better.

Primarily you can split the work of a DevOps team into 2 parts. Building the product and Running the product. To structure this article I splitted up the article into a Build and a Run section where several metrics and questions are discussed.

Metrics when building software

What are you monitoring during the Build Process of your application. Or what things should you take into consideration.

This article gives a nice overview of what needs to be done in order to create a product with good quality. Looking at these dimension some easy questions (but hard answers) can be deducted.

Software has to be deployable and it has to satisfy the minimum functionality
- How often are you deploying?
- Has it deployed succesfully?
- Are tests succesful?
- Can users work with the product, drop off rates?
Software needs to be performant and secure
- Test results from load/stress/penetration testing
- Credential scans
- Basic Security
Software needs to be easily usable
- Duration of sessions
- Incident count
- Support Questions
- Drop-off rate
- Interaction sequences
Software needs to be useful.
- Are features used?
- Investment. ROI?
- How often are features used?
- Is it worth the investment?
Software needs to be succesful
- Cost vs. Revenue
- Impact Mapping
- Feature Injection

Questions for the Product Owner

Do you have a clear description of the product you own?
What is the vision of this product?
Who is the customer?
How do you measure success?
- e.g. number of users
- increased usage over time
- no outages
- NPS scores?
When is the product succesful?
What is your feature timeline (6 months)
How often is each feature being used?
What feature sets tend to be used by the same people?
Which features are your engaged users using most?
Where do users get stuck and abandon the product?
How long are users spending on each feature?
Who abandons it and who keeps using it?

Questions regarding progress

How long does it take to fix a bug? MTTR
- Are we getting better in fixing bugs and releasing fixes.
How are you progressing on the feature timeline?
- Are features well defined?
- Is there focus on what should be done?
- Are features actually getting closed?
- Do we deliver on promises?
How often do you deliver a feature?
- Are we deploying features?
How is your sprint progress? Velocity?

Questions for development teams

What is the trend of technical debt?
What is the trend of Unit Tests/Coverage/Lines of code?
Build breaks? Release breaks?
Activity on the project?

Metrics for Running software

Most important thing to remember is that there all decision that you make based ont these metrics should be done based on a SLA or NFR. If you need to guarantee 100% uptime there are other tresholds and other metrics you need to take care of.

Basics to cover

Is my product/service up?
Availability check, ping request etc.
Health, Performance, Usage and Costs
What are the treshholds you agreed on with your customer/stakeholder?
- Response times
- Resolution time
- Max Load
Do people use the platform. How many? On target?

Again this is not a complete list of metric that you should implement, but if you do not know where to start these are metrics you should at least consider. Based on some best practices found on different websites/blogs and books there are a number of categories that you should take care of.

Category	Description
Latency	The time it takes to service a request
Traffic & Utilization	A measure of how much demand is being placed on your system
Errors	The rate of requests that fail
Saturation	The part of the system which is most constrained
Utilization	The amount of resources that is used

Below you’ll find some example questions and metrics you can tghink of in every category.

Latency

How long does it take to serve a request?
What happens if you have 10 users / 20 users/ 2000 users etc.?
How many concurrent users can you serve before the system goes down/is impacted
Are you serving 200 or error pages?

Traffic

How many requests are serving per second per service?
How many requests are failing per second / minute?
What is the trend of traffic of requests?
What is the expectation? Do you exceed or underutilize?

Utilization

Monitoring of compute resources as :

CPUs: sockets, cores, hardware threads (virtual CPUs) Memory: capacity
Network interfaces
Storage devices: I/O, capacity
Controllers: storage, network cards
Interconnects: CPUs, memory, I/O

Questions you should ask

What is the trend?

What is the average?

On what levels do errors/unexpected behavior starts occuring?

Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”).
How many errors do you see?
What type of errors?

Saturation

How “full” your service is.

CPU load
Memory load
Disk Load
Queues for the webserver
Limits (e.g. SQL Free has X Gb on disk space)

Tresholds

It is great that you can measure a lot of things. But more important is the tresholds you put in place.

When should somebody act?
What should the action be?
Is there a manual action, automation as follow up?
Who is the owner/who gets called?

Links regarding Build and Product Management Metrics

Links regarding Metrics for Running software

Azure, DevOps, Tips & Tricks, Uncategorized

DevOps, Metrics

About Rene van Osnabrugge

View all posts by Rene van Osnabrugge →

Growing your DevOps Mindset

Distribute your blessed ARM templates with Universal Packages in Azure DevOps

Trackbacks/Pingbacks

Where to start with DevOps Metrics? | The Road to ALM | سلمان کے خیا لآ ت - September 6, 2018
[…] https://roadtoalm.com/2018/09/05/where-to-start-with-devops-metrics/ […]

Where to start with DevOps Metrics?