(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article is also available as a podcast). You can also ask your questions to our specially trained GPT!)
Note: This is a rewritten and extended version of a Wednesday Wisdom article that appeared on 3/8/2023 inside the walled garden of some intranet far, far away.
“People love toil and uphill battles”, said no one, ever. But, despite the fact that nobody ever said that, the situation on the ground in many companies might easily lead you to believe that this statement is, in fact, true.
In 2007, I was offered the opportunity to start a new team that was going to look after a service that was in dire straits. It was a popular service, very popular even, but its infrastructure was perennially on fire, with the pager ringing off the hook and people constantly fighting to keep the service up. At the same time, the software engineering team believed that the sources of their problems were external and that little could be done. “Oh, if only we had dedicated machines instead of having to share our machines with other services.” Or: “Oh, if only we had a private global file system (GFS) cell instead of having to store our data in a shared cell that is also used by other services.” Not being blessed with either dedicated machines or a private GFS cell (Urs wisely said “no” every time they asked), the team engaged in heroics instead to rescue the service from the brink of a meltdown, multiple times per week.
Some years later, I joined a super popular service where the weekly release and deployment was a right-regular “tour de force”. Every Tuesday, a motley crue of engineers would gather in a specially reserved meeting room and start the process of releasing and deploying the latest and greatest code. This involved kicking off a build and fixing any build problems. When they (finally) had a package, the new code would be deployed to a handful of staging machines that received user traffic. People would then scour the logs and dashboards for errors and other anomalies. When they found some (note: “when”, not “if”), they would track down the offending code, contact the software engineers who were to blame, and goad them into writing and merging a fix. Rinse and repeat, until such time as they had a version that was mostly fine. Every week, that release process ran well into the wee hours, with pizzas and beer shipped in to support the troops.
One service that I have been involved in ran in many datacenters, but one datacenter was the designated “leader”, which was where all the “write” operations happened. There was no process for transferring leadership to another datacenter because that was never needed. Until of course, it was. When Hurricane Sandy threatened to flood the leader data center, a heroic effort by the most knowledgeable engineers on the service transferred leadership to another datacenter and the service was saved. An equally, but slightly better planned, heroic effort transferred leadership back once the hurricane had passed. Both during the emergency leadership switch and during the switch-back, people worked through the night to make sure the site stayed up and running.
Many years and companies later, I joined a team that maintained a very complicated monolithic piece of software that sat at the center of the (user visible) control plane of our main service. Each day, dozens of people were merging changes into the codebase, so come time for a release it was impossible to predict if the whole thing would work or not.
Fortunately, there were tests.
Unfortunately, the tests were mostly useless.
If the tests passed, that didn’t mean that the code worked; if the tests failed, that didn’t mean that the code was broken. To shepherd the release process we had a release czar rotation and it was the release czar’s job to harass all the people who had changes in the upcoming release to either fix their tests or declare that the code was fine even though the tests failed. There was considerable reluctance on the side of the developers to fix the tests and there was an equal amount of reluctance to vouch for the code in absence of a passing test. It was expected that the release czar harassed everyone continuously to get the approvals in and to guide the release and deployment. It was a terrible job, a hero’s job.
Once, I joined the SRE team of a popular web site that we had acquired. One of the components of the site was an API server that gave access to the internal data store. Every now and then, the API server would crash because of an unreasonable amount of requests. Whenever that happened, some ops hero would track down the root cause. More often than not, it would turn out to be some batch job that sent a flood of requests to the API server because, well, why not? The ops hero would locate and contact the owners of the batch job and get them to lower the request rate by either reducing the number of workers or stopping the job entirely.
The first time this happened during my oncall shift I asked people: “What is going on here? We have perfectly fine server side rate limiting technology; why are we not using that?” The answer turned out to be a mix of ignorance, perceived lack of time, and caring. People didn’t know this was possible, didn’t have time to do it, and didn’t care to make the changes required to link in the rate limiting libraries, figure out what reasonable rate limits would be for various types of clients, and set up the required configurations. Apparently it was easier to just deal with the outages than it was to put in the work to deal with this class of problems once and for all. Plus, with heroes oncall, who would need to do that? There was always some hero just a pager away to deal with anything.
Heroics are exciting and most people seem to be of the opinion that it is better to have an exciting job than a boring one. Better (apparently) to be the hero who regularly saves the day than to be the engineer who does a bunch of “boring” work, thereby solving problems before they blow up. Because: Where’s the fun in running complicated services if there are no car chases and explosions? We gotta have car chases and explosions, because otherwise how are you going to recognize the heroes?
The problem with heroism is that it is addictive. Heroes get praise. Heroes are valued. Heroes save the day. But, praise the heroes for their efforts and before you know it you have a bunch of self-important dopamine junkies who have become incapable of doing the slow and methodical work that does away with the need for heroics.
That this is not just my observation might be obvious from this reaction to a recent Wednesday Wisdom article where a reader complained that: “There aren’t enough incentives [red.] for folks to spend time to actually improve the code quality via proactive reviews. People rather become rockstars by fixing buggy code in production than catching it in credit-less code review.
The problem with heroes is that heroics don’t scale; instead: Boring scales. Call me old-fashioned (or boring) if you will, but when I am in charge of running some service, I want my days to be uneventful. The last thing I want to do is to run a site that constantly needs a couple of wizards on-call to deal with things that are completely foreseeable but that the site’s designers didn’t take into account and have no plans for.
None of the problems that I described in the anecdotes in this article are beyond solid engineering. It might not be flashy, it might not be attention-drawing, but these are all solvable problems. Indeed, all of the problems I described in the anecdotes were eventually solved by good old-fashioned engineering. Slow, methodical, but solid, work that significantly improved the services and everyone’s lives in the process.
To paraphrase Bertold Brecht: Unhappy the team that needs heroes. The heroes might think they need to be there to deal with unforeseeable circumstances, but in fact a need for heroes usually indicates a badly designed and implemented system.











