Throwing blood against metal

Manual work doesn't scale

Dec 27, 2023

(Like this article? Read more Wednesday Wisdom!)

In Europe, we look at American “fully automatic” cars with some disdain: Surely the joy of manually shifting gears and the opportunity this gives to optimize the car’s performance is vastly superior. And what about the opportunity it affords for a “push start”? Clearly this is the way to go!

A friend of mine used to drive a fully automatic BMW. When confronted about this he replied that as a software engineer he was of the opinion that it was the machine’s responsibility to do as much as possible all by itself. Why manually manage the state of the gear box when there is a perfectly workable solution that does this automatically? His insight permanently changed my mind on this topic and I am currently driving a Chevrolet Trailblazer without feeling the slightest affront to my manlihood for not having to shift gears.

During one trip, while barreling down a Dutch freeway, my American wife, bless her heart, asked me why I was continuously messing around with that stick located between our seats 🙂

Many engineering teams also seem to subscribe to the approach of doing as much manual work as possible.

This is not great.

First of all: Manual work means mistakes. I once reviewed a postmortem where the resolution was delayed because, even though the fix was merged into the codebase, the CI/CD pipeline contained a manual step and apparently everybody had forgotten about it. Consequently the fix release languished in the pipeline until after a few weeks an engineer looked at the pipeline as part of the the monthly release and “quickly” unblocked the release.

Many things went wrong here, but I will call out two obvious ones that go to the topic of this article: 1) A manual step in an otherwise fully automatic CI/CD pipeline, and 2) No alert when a release was blocked in the manual step.

Second: Manual work only grows over time as the service increases in scale and complexity. I regularly come across teams with oncall shifts that are on fire. Quite often the expectation in engineering teams is that the oncall engineer spends their entire week attending to the service, looking at dashboards, and tracking the ticket queue, their inbox, the Slack channel, IRC, what have you. Armed with the insight of my BMW driving friend mentioned in the first paragraph, this is clearly insane. We are software engineers for crying out loud. Our entire job is to automate things. The purpose of oncall should not be to manually service the machine; it should be to relax and play pool until the machine warns us that something really out of the ordinary has happened.

Whenever I come across yet another team with an oncall shift that is leaking flames left and right, I tell people that I used to be oncall for the core of the YouTube and that that typically took about 25% of my time.

I’ll admit that this is a bit of an exaggeration, but not as much as you’d think. Surely there were very busy weeks, during the Olympics for instance, but there were also weeks where I did almost nothing aside from keeping an eye on the release. And at least the goal was to do as little as possible. And we played a lot of pool.

The typical response of an engineering team where the oncall is on fire is to add a second oncall.

In one word: No!

I can by now no longer count the number of times I had to step in to prevent this madness. What is the end state here? Everybody oncall all of the time? Interestingly enough, the teams that are more than ready to add a second oncall can often not find the time and priority to solve the pain points with engineering work. Apparently we do have time to engage in manual operations but not to make that manual work go away. I am not sure I understand the logic here, but it happens all the time.

A former colleague of mine calls this: “Throwing blood against metal”: Using manual operations to solve the gaps between the service level the system can achieve and the service level we want to provide. It’s a terrible practice…

Many SRE teams suffer from the problem that other engineers think that they are the team that does all the manual operation for them. Nothing should be further from the truth. SRE teams should act as software engineering teams that automate away all operational tasks. While we haven’t achieved that we might do some manual work to keep the service going, but we are not the ops department!

For instance: Whenever our release automation failed, people were routinely flabbergasted that I would not take over the release myself. “I think you misunderstand our jobs”, I used to reply. “We are not the people who do the release, we are the people who build the robot that does the release. When the robot breaks, we do not take over the robot’s work! Instead, we fix the robot.”

“But what about the release?”, people would ask. I would then point them to the service levels where I promised ten releases per quarter and two “no questions asked” emergency releases. I then would ask them if they wanted to use one of their two emergency releases to make up for the breakdown of the release robot. This got people thinking because now there was an actual cost to them for requesting the release.

Always make sure that whatever people can request of you has a cost for them as well. Otherwise there is no barrier to them asking you, and demand is infinite…

There are many other problems with throwing blood against metal. The work is usually uninspiring and not fun. If you do a lot of it, it leads to stress and people leaving the team. It doesn’t scale. Solving the myriad of causes of manual work might not be glamorous, but it beats getting paged to manually change a configuration file or deleting old log files because the hard disk is full.

Make this your new year’s resolution: No more blood against metal!