In 2024 I created the LEAD podcast with my good friend Geert van der Cruijsen. In this podcast we explored the various aspects of building an engineering culture. We made quite some episodes. With guests and without guests. And I thought it would be a great idea to share some of these stories combined with my insights from these episodes on this blog. All credits do not go to me. They go to Geert as well, and to our guests. And of course to Xebia, the company I work for, for making this possible.
In one of our episodes, Geert and I talked about something that’s been bothering us for a while. We see more and more companies hiring SREs. But when you look closely, they often just mean “operations” people. It feels like another hype term, like microservices, agile, or scrum, being slapped on job titles without a real understanding of what it means.
So we decided to go back to basics. What is an SRE really? Where did the role come from? And how does it relate to DevOps?
The original idea behind SRE
The term SRE stands for Site Reliability Engineer and was coined by Benjamin Treynor Sloss at Google. Back then, Google was growing fast and traditional operations couldn’t keep up. They needed something different. Not just people who reboot servers or fix tickets, but engineers. Real engineers who could write code, understand systems, and optimize for scale, performance, and availability.
In that sense, an SRE is not just another ops person. It’s someone who engineers reliability into your systems. Someone who brings the same level of software engineering discipline to operations work.
DevOps vs. SRE: conflict or complement?
Geert and I have always liked the DevOps mantra: you build it, you run it. It makes teams more accountable, more aware of quality, and forces them to think about security, reliability, and infrastructure as part of building software.
But let’s be honest. As systems grow more complex and scale to millions of users, expecting every DevOps team to have deep reliability expertise is a stretch. That’s where an SRE can add real value. Not as a replacement for DevOps, but as a supporting role. Sometimes embedded in teams, sometimes working across teams to solve broader reliability challenges.
Google even uses a model where teams can “earn” an SRE. You need to show that your service is mature, valuable, and built to a certain standard before an SRE will work with you. And they don’t just take over your incident queue. Half of their time is for project work, eliminating toil, redesigning systems, and making reliability part of the architecture.
The key: appropriate reliability
One thing we both stressed is the word “appropriate.” Too many companies aim for 99.999% uptime without asking if it’s actually needed. We’ve seen teams gold-plate systems that don’t need it, or apply a single SLA number across the entire stack without thinking it through.
In one example, a company had software used by workers replacing windshields. Each device held about four hours of work. If the backend went down for an hour or two, it didn’t matter. But they still wanted 99.9% uptime across the board. That’s just unnecessary.
So before you think about hiring SREs or defining SLAs, ask yourself: what level of reliability is really appropriate? And for which parts of the system?
SLAs, SLOs, and SLIs: the real value
This brought us to a core part of SRE thinking: the difference between SLAs, SLOs, and SLIs.
- SLA (Service Level Agreement) is the external contract, often with legal or financial implications.
- SLO (Service Level Objective) is the internal goal for reliability, agreed between teams and business.
- SLI (Service Level Indicator) is what you actually measure, like response time or success rate.
The magic is in combining these. SLOs give you clear, actionable targets. You can create error budgets, which help balance reliability and innovation. If your service performs better than expected, you can use the buffer to experiment. If it drops below the objective, you pause new features and fix reliability first.
This turns reliability into a shared responsibility, not a constant battle between “ops” and “business.”
So, should you hire an SRE?
Maybe. But only if you understand what you’re looking for. Don’t just rename your operations team and expect magic. An SRE should be an engineer, not a ticket closer. And you don’t need an SRE team to apply SRE practices. Any mature DevOps team can start using SLIs and SLOs to create alignment and clarity.
So next time you see an SRE vacancy, ask: do we really want an SRE, or are we just looking for ops with a cooler title?
That’s what we talked about in this episode. And honestly, I think it’s one of the more important conversations we’ve had. Because how we talk about roles affects how we hire, how we structure teams, and how we build reliable systems.
Let’s not turn SRE into another buzzword. Let’s use it for what it was meant to be: engineering reliability into our systems in a smart, intentional way.
The original Episode
If you want to listen to the original episode, you can listen to this



