(Like this article? Read more Wednesday Wisdom!)
Because I worked as an SRE and SRE manager at a company that is widely seen to do SRE "right", I often get asked for advice on how to start an SRE team. This question is commonly asked by people who tried to start an SRE team but failed, usually because they got no traction in their organization. This regularly leads to self-doubt and prompts (more) reading of SRE books and SRE blogs to learn about the finer points of how to do SRE.
This is completely unnecessary.
I am pretty certain that if you want to start an SRE function in your company, you understand SRE well enough already. But, something else is going on: Your organization just is not ready yet…
The failure by well-intended engineers to start an SRE team typically comes about because the organizations where they are trying to do that are just not interested enough in reliability. This surely sounds strange, doesn’t it? Isn't reliability a good thing? Doesn't everyone want to have reliable systems? Isn't having a reliable system better than having a system with undefined or unknown reliability properties?
The answers to all of these questions are usually (but not always) a resounding "yes". But, for many executives, “reliability” is a word like "environment": Nobody is against it per se. Everyone is for the environment and everyone is for reliability, but actually suffering some inconvenience to bring more of it about is too much. Sure, if it were free, everyone would like to have a big helping of reliability please. But unfortunately reliability does not come for free, and it is rather inconvenient in other ways to boot.
Although, if you think reliability is too expensive and inconvenient, try unreliability for a while…
Starting an SRE function is hard because it requires you to make people care about the big picture in a rational way. Unfortunately, we are not good at that; in fact every election cycle is a showcase in how incredibly bad we are at this as a species.
The killer rationale (and hence ineffective argument) for an SRE function is that it is the quickest, cheapest and overall most effective way to reach a defined level of reliability.
Q: Is it though?
A: Yes!
Q: But aren’t SRE’s expensive?
A: Yes, but if you want to reach a particular goal, what can be cheaper and more effective than having a group of highly trained professionals working on it? They don't mess about, they know what they are doing, and they are not distracted by other responsibilities. The alternative is that you have a bunch of distracted amateurs on the case. That approach usually gets you a worse result and it will take longer. That's a dumb approach because in most cases you have to do the work anyway!
That is the killer argument for most organizations: You are going to have to engage in SRE work anyway. Even organizations with a suboptimal reliability posture want to have some sort of release process, alerting, regional safety, no single points of failure, stable configuration systems, load balancing, request throttling, and other good things. That’s what you want, that’s what you need, and that’s what you are going to be working on; the only thing you get to choose is the modality of working on it.
Most organizations need some level of reliability, but the people who are in a position to bring that about in a reasonable way (typically mid-level executives) are usually not incentivized to do the right thing. Instead, their incentives typically revolve around feature releases and assigning headcount to SREs comes out of the overall engineering budget and hence delays feature releases…
Q: Or does it?
A: Not necessarily!
Q: Why not?
A: Because the fact that you have to do (some of that) work anyway means that it will instead be done by software engineers and this translates in lower velocity for feature building. But I am running ahead of myself.
Time for a side story: Once I worked for an organization where the senior director asked a group of his managers and me to weigh in on the question of how to do SRE: Should we have a few dedicated SREs per feature team or should we have a dedicated SRE team? Seems like a good question doesn’t it, but actually it is not; it is a completely irrelevant question. I wrote back to him that it didn't matter. Instead, what matters is the organizational will and capacity to fence off a few engineers and have them work exclusively on reliability OKRs. If you can do that, then it doesn't matter at all how you do that. I've seen both models work and I've seen both models fail. But I have never seen any model work in an organization that fundamentally wasn't able to partition their headcount effectively towards the goal of improving reliability. So I told him that we were one of these organizations that was so obsessed with delivering on short term feature work at the cost of nearly everything else that the question was irrelevant. Even if we hired SREs, we would be tasking them with feature work before the month was out. In this mode of working we would never be able to get SRE work done, regardless of the model.
This was not a popular answer.
But unfortunately it does go to the heart of the matter: If you cannot get the organization to start an SRE team that usually means that they are not interested in bringing about reliability in the cheapest and most effective way because they don't care about that. It’s not that they don’t care at all, it’s just that they care more about other things. In most organizations, the things people care about come from the top, because that is where the incentives come from. It is almost impossible to make people care about something from the bottom up, unless everyone votes with their feet, and that rarely happens.
Engineers complain a lot about having to engage in large amounts of suboptimal fire fighting, but they rarely quit over it. Some even (secretly) like it.
Another side story: In some organization I worked for we cared a lot about security because we had had a large security incident and the CEO had to go to customers and apologize in person. He then said "never ever make me have to do that again at any cost" and hence we had a big and expensive security program together with headcount. Until reliability becomes that important to the CEO, nothing much is going to move.
Of course people pay lip service to reliability and so they will do something, anything, that will give them something to point at during reviews. This typically comes down to hiding the SRE work in the software engineering teams who will then by and large not do it as they are living under pressure to deliver features. Then when something does go wrong, there is highly ineffective fire fighting, followed by non-existent learning, and maybe a few small projects to plug the largest gaps, which will get deprioritized over time as the memory of the event fades.
These teams drift into failure: Whatever can go wrong usually doesn't and then everyone learns the wrong thing. Having a good SRE function means having a bunch of pessimists on staff who permanently worry about everything that could potentially go wrong, preventing that drift into the place where the bad things happen.
Hiding reliability work in the existing OKRs of the software engineering teams doesn't work. It gives worse results at more effort. It leads to slow progress and an abundance of fire fighting. Additionally, the work is typically not represented accurately in planning, which means that any time spent on reliability work interferes with feature work. This makes managers worry about "velocity": “How can we increase the speed of development?” Well, one thing you can do is to make sure that you accurately plan for the things that the engineers actually have to do and give any work that they are not good at and don’t really want to do, to a bunch of specialists.
Even if you have very low reliability goals, it is always more efficient to have it done by dedicated professionals. Even if it is only one person. Not too long ago I was in a team that had one SRE and he brought about tremendous improvements in our application's reliability and overall reliability posture. How did he do that? By making a table of reasons why our service failed in production, sorting that table by decreasing frequency, and then solving each problem in turn.
It really can be that simple.
In some cases you might find it hard to start an SRE team because the organization correctly identified that they need almost no reliability. It's rare, but it happens and it has to do with where you are in the company or service lifecycle. Early in the lifecycle, features will get you customers. Later on, reliability keeps your customers. Where are you? What do you need most? Having a super reliable service with no customers serves literally nobody, so it makes more sense to work on features. However eventually you need some reliability to keep the customers you got. “Cost of switching” also plays into this. If your customers face a very high barrier to switching to a competitor, reliability is less important because even if your service is down frequently, where are your customers going to go?
Cable companies have figured this out; in a large part of the US they are a monopoly and it doesn't matter how bad they are because you have literally nowhere to go.
So if you are having a hard time starting an SRE team, first make sure you are not kicking a dead whale across the beach: Your organization might just not care enough about reliability, which is probably caused by the executives not being incentivized to provide it. People do whatever you pay them to do; if that is delivering features at all cost, that is what they will do.
So don’t doubt yourself. It’s not you, it's them!.
It's also related to scale. The Google-style approach to SRE makes the amount of work running a reliable service scale with service complexity, not not service size. It's 10x as many different binaries that hurts, not 10x as many servers. Old school manual System Admin scales with the service size, so it's not so much the 10x as many different binaries that hurts, its the 10x as many servers to run them on.
Companies in their early stages have high service complexity relative to scale, so burning dedicated headcount trying to automate how it's run is more effort than just running it manually. At some point you reach a scale vs complexity tipping point when SRE makes more sense.
Here's a 3 min audio version of "Trouble starting an SRE team?" from Wednesday Wisdom converted using recast app.
https://app.letsrecast.ai/r/a5b0aae3-bbc9-48d2-9b36-39596f5ae587