(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article is also available as a podcast). You can also ask your questions to our specially trained GPT!)
My first job was as a systems programmer on the IBM mainframes of a large Dutch bank. This was a golden era for system people, because security had not been invented yet and few people knew anything about computers. We could basically do whatever we wanted and nobody could question us. The only thing we cared about was that the nightly batch jobs that processed the bank’s transactions and printed statements, had to finish before some cutoff time (I believe it was around 4am). The bank’s software engineers worked towards that same goal, but at the end of the day we (the system programmers) didn’t care much if the batch jobs failed or not, as long as it was not our fault.
The Internet has changed this situation drastically. The software engineers are still responsible for core business logic, but complex and complicated systems are now directly serving users and customers and to make that all happen there is a lot more infrastructure involved. For a significant part, this infrastructure is directly in the path of serving user requests: Reverse proxies, load balancers, firewalls, rate limiters, edge caches, you name it. And so the system people, who are responsible for building, configuring, and running that infrastructure, are now directly involved in reliability as perceived by the user. If the bank’s nightly batch runs failed, that was highly annoying to us, but most bank customers did not notice unless they were waiting at their mailbox for a printed statement. In the Internet era, if your stuff is down, that makes the front page of the newspaper.
For the younger readers: Newspapers are a thing of the past. Imagine a set of printed substacks that get delivered to your house every morning by some 12 year old.
The realization that system reliability was now critical to the user experience drove the creation of an entirely new sub-profession: The Site Reliability Engineer (SRE).
Google is, probably correctly, credited with the invention of modern Site Reliability Engineering. When I was asked to interview for an SRE role at Google in 2006, nobody really knew what an SRE was and the recruiters had to explain what an SRE did and what the profile of the people they wanted to hire for this job looked like.
They were mostly wrong, but it was a valiant attempt.
Almost twenty years later, SRE has become a mainstay of many big IT organizations and SRE principles have been applied to many different fields. At Google we had SRE teams for big services (like Maps, YouTube, and Google Flights, to name a few that I worked on), but we also used SRE principles to backup, repairing network hardware, and for running machine learning pipelines.
To be honest, when I joined SRE, we really had little theoretical or philosophical understanding of what we were doing. We knew our job was to make sure the systems were “up” (whatever that meant) and because of the scale of things, that meant that we had to use automated solutions to do most of the heavy lifting, but we didn’t have a strong theoretical framework for what we did, why we did it, or how we did it. We just made it up as we went along. As we got more experienced, we also evolved a deeper understanding of what we were about and over the decades, a progression of insights led to successive refinements of the SRE model. As part of that evolution, we developed many (now standard) approaches for how to build and manage huge systems in a reliable way.
In the meantime, other companies thought it would be cool to have SREs too and many traditional system administrators and operations team members got a free career upgrade by rebranding themselves as Site Reliability Engineers. By now, there are literally thousands of people calling themselves site reliability engineers although, to be brutally honest, I come across “site reliability engineers” whose job descriptions look much more like that of a traditional system administrator or system operator.
Not that there is anything wrong with that…
Over the years, the leading SRE teams developed patterns for dealing with a wide variety of problems that could impact reliability. Examples of these patterns are canarying, hermetic configurations and something called “N+2”. When used and implemented correctly, these patterns work incredibly well and many of these patterns were adopted throughout the industry. But, here is the point: Each of these patterns was designed for a particular class of systems that share certain attributes in their design.
It seems to me there is a whole generation of “SREs” who learned the patterns but who do not truly understand the principles. They will chastise you (or their LinkedIn audience) for not doing canarying, even though systems exist for which canarying is not a great idea (cue messages from this crowd telling me that I am an idiot and that you should always have canaries). They will tell you that developers should never have production access, that an application should not go into production unless a thorough Production Readiness Review has been done, or that the application needs to be owned in production by a separate SRE team.
To be sure, these are definitely good things to keep in mind for a lot of environments, but they are not the gospel. Every well-known SRE pattern was designed for a particular class of systems and sought to influence outcomes for a particular set of goals. Your system and your environment might be different, and you might have different goals.
A simple example: Depending on the lifecycle of your company it might be more important that you build and launch the right thing quickly, than that you build the thing right. If you are in a hyper-competitive space, there is a huge advantage to being the first mover and for your business, velocity might be more important than the level of reliability that many SRE patterns help establish. It does not help anyone if you have a super reliable application that nobody uses because your competitor has all the users even though their application needs constant care and feeding by their staff. Remember: Quality is fitness for use.
Here is a more complex example: AI models get trained in clusters of machines that collectively run the gradient descent algorithm.
This is a huge simplification, but bear with me for a bit…
The nodes permanently exchange data and they work in lock-step. Unfortunately, that also means that if one node fails mid-step, all nodes in the entire cluster need to restart that very same step. Now imagine you have a new version of the training binary that you want to try out. Your junior SRE told you that you should always canary your code, so you stop the entire cluster at the end of the current step, update one or more canary nodes, and restart the cluster. So far, so good.
Let’s assume the canary fails and the step crashes or you figure out in some other way that this has not been working as expected. Simple! You revert the canary nodes and then restart the cluster from a suitable checkpoint. One or more steps were lost (depending on the nature of the failure and its detection) but not a lot of harm was done as that’s the risk you take with new software.
Instead, imagine the canary succeeds. Once you figure that out, you have to stop the rest of the cluster at a suitable step, upgrade these nodes, and start the cluster again at the next step. Great, canarying worked! But did it?
Let’s now imagine that we did not canary at all. Instead, we just upgraded all the nodes in the cluster with the new software. If the new software failed, you would figure that out and you have to revert all nodes and restart from some checkpoint. But if the new software works, you just keep the cluster stepping and there is no need for additional work! Upgrading all the nodes immediately, instead of selecting a few canary nodes, has the same effect as using canaries if the new software is faulty, but works much better if the new software works, because you don’t have to stop the cluster again for the full rollout!
As this (I will admit somewhat contrived) example shows, canaries as a pattern are not suitable for every possible system; it really depends on the situation. Examples like this abound in every system and in every organization. You really need to think deeply about what reliability means for the business and how you can make that happen.
Site reliability engineering is in essence a really simple proposition. All we do is ruthlessly drive a few numbers using all the insights we have about how computers work to make sure that we meet or improve these numbers. It is not mindlessly applying some patterns you learned in the past.











