(Like this article? Read more Wednesday Wisdom!)
Toil blows.
I have been part of many teams that were drowning in repetitive manual work and nobody in these teams ever said that it was cool and that they liked doing it. Rather there are usually complaints all around and toil often features quite prominently in periodic “employee happiness” surveys. However, and this is the weird thing, people keep doing it!
The problem with eliminating toil is that it is real work and there is no other way to do it than to actually do it. A lot of teams resemble a lumberjack with a blunt ax who is trying to cut down a large forest: It’s a lot of work and it’s not going very fast, but she doesn’t have time to sharpen her ax, because there are all these trees to fell!
Something that everyone can do to reduce toil is to take a toilful process, make improvements where one can, and resist making it worse than it already is.
A while ago I was working in an organization where there was an abundance of toil. Oncall was typically on fire and there were a lot of regular processes that required manual wrangling. For instance the weekly release process contained the following steps:
Deploy the new binary to the staging environment.
Track down everyone who had new code in this release (there was a script for this).
Ask these people to verify that their code was working correctly.
Have them sign off on the release in a Wiki page.
It was the release manager’s job to harass the new code owners until they signed off.
The release manager also had to check every major team’s integration test runs in the staging environment and then follow up with those teams whose tests were failing until they had either:
Fixed their test (rare).
Fixed some code (very rare).
Declared that the test failures did not impact the release (very common).
As you can imagine all of this was a never-ending hassle which typically required multiple follow ups with the stragglers.
This, my friends, is toil.
Did I already tell you that I hate toil?
On my first shift as release manager I completely turned the tables: I ran the script to identify the new code owners and then pinged them to say that:
Their new code was now in the staging environment.
It would go to production on Thursday.
Unless they informed me that there was a showstopper.
I did something similar for the teams with failing tests. They received a message from me saying that their tests were failing on the new release but that without notification to the contrary I would assume that this was because the tests were bad, not because the release was bad.
So instead of taking on the responsibility to get a sign-off from every new code owner and from every team owning a failing test, I informed them (once) about their responsibility to tell me if their code didn’t work. I then rewrote the release documentation to enshrine this as the new policy and actively campaigned for everyone in my team to adopt it.
There was a lot of hesitation among my colleagues, who, despite complaining about “ops overload” all the time, were surprisingly hesitangt to shed this particular toil: “What if there is a bug in the code and the code or test owner did not check it and we push it into production?” My answer: “Then we will roll back and tell the code/test owners that they are muppets and that they need to do a better job in the future.”
As this example shows, once added, toil is very hard to remove. This sad fact should really inform every attempt to add manual work to any process: You are probably signing up for this work until the heat death of the universe. Quite often a particular manual step (toil) was added because of an incident in the past. This creates the perception that removing that toil will increase risk. “Surely if we have more people actively looking at things and signing off, we have a lower risk of pushing a bad release?”
No, we do not. Or at least, not always…
As a software engineer I had been on the receiving end of the release manager pings to check my new code in the staging environment. More often than not, I did not actually validate my code but just signed off on the Wiki page in the confidence that my code worked; confidence that I had built up using unit tests and running tests in the pre-staging environment. And in the cases where there was some actual risk of things breaking, I would have tracked the release and validated my code before receiving the release manager’s pings.
Assuming for a moment that my behavior is typical, in most cases the release manager pings did not actually add any safety to the process; it was just useless toil; it was risk management theater.
Even if removing a manual process does add risk, the best answer is usually not to keep the manual process around, but instead remove or automate it, keep track of the times the risk realizes itself, and slap a service level objective on that number. We are not in the business of eliminating all bad outcomes, we are playing a risky game and we just need to win it often enough.
Sometimes organizations just can’t get it together to make people available to engage in the work required to drive down toil.
I once got pulled into a discussion among senior managers and directors on how to organize our engineers for better reliability of the site. “Would it be better to build a separate SRE-like team for the entire organization or should we instead embed a bunch of engineers into every team with a mission to increase reliability for that team’s components and services?” My (unpopular) answer: “It doesn't really matter”.
What matters is if you have the organizational capability to ring-fence a bunch of people and make them work on OKRs related to improving reliability and reducing toil. Only if you have that capability can we have a discussion on how to organize that, but we do need to have that other discussion first. Are we really able to dedicate some people to this problem? Or will they get pulled into feature work or the business priority du jour at the first bump in the road?”
The only way to reduce toil and increase reliability is to actually do the work.
Sometimes the resistance against making the site more reliable comes from a very unlikely corner…
A few years ago I worked on a well-known Internet site with many rough edges when it came to reliability and toil. The site employed an ops team of engineers who were tasked with keeping the site up.
By the way: NEVER EVER have an ops team. Nomen est omen: The purpose of an ops team is to do ops. This automatically implies manual work. Have an engineering team instead, whose job it is to build things what do the work.
In one team I was on I used to tell people: “We are not the people what do the release. We are the people what build the robot what does the release. If the robot fails, we repair it, we do not then actually do the release manually.”
They did a marvelous job. All these people knew the site inside and out and had amazing foo with the operating system, the programming language, and the tools required to quickly build and push out fixes. There were lots of incidents and the engineers in the ops team regularly saved the day. They were heroes.
However, to paraphrase Bertold Brecht: “Unhappy the team that is in need of heroes.”
One problem with heroics is that it makes the heroes feel good and they will actively resist any changes that will call for less heroics and which might “reduce” them to desk jockeys. Saving the day gives a tremendous short-term satisfaction; much more than just sitting at your desk and doing the tedious work of improving the service, implementing automation, or getting rid of spammy alerts.
Ops heroes are adrenaline junkies.
One of the things that regularly happened at that site is that one of our core components would get a flood of RPCs from a batch job and croak under the weight, leading to a site-wide outage; we would effectively be DDOSsing ourselves. The oncall hero would use their amazing foo to track down the source of the traffic surge, shut it down (using their site-wide administrator privileges), and then chastise the writers of the offending code.
Hero came to save the day, total disaster averted. Everybody happy.
These heroics were of course completely unnecessary. The larger organization that owned the site had a pretty decent global rate limiting solution which was not difficult to integrate with a service. Doing that was a bit of a hassle because it meant writing some code, setting up the global configuration, figuring out limits, and writing some dashboards and alerts. None of this was difficult, but it is actual work. To top it off, some of these steps required getting code reviews from a difficult-to-work-with central team that nobody ever really wanted to talk to. It’s not shiny work, a bit tedious even, definitely not heroic, but once it was done, the problem went away definitively.
We need to take a cue from English 1980s punk band The Stranglers: No More Heroes.
The Stranglers are one of the two bands I ever attended a concert of that led to some hearing loss for a day or so. I now have acoustically correct ear plugs that knock about 25 dB off the entire spectrum, linearly.
So to eliminate toil you have to engage in meaningful projects that eliminate toil. There is no other way. And do not put your faith in heroes.
Here's a 2 min audio version of "Whatever happened to, all the heroes 🎶🎶" from Wednesday Wisdom converted using recast app.
https://app.letsrecast.ai/r/e6701192-167c-4f7f-872d-02179afa601f
100% agreed with the post.
> Rather there are usually complaints all around and toil often features quite prominently in periodic “employee happiness” surveys.
For me, this is something that is a pet peeve. Quit complaining and get to work fixing. I find most people are happy to complain and few will grab a shovel and do the work and it's endlessly frustrating. Maybe its signal for seniority though? (folks that have been around the block just roll up their sleeves and get to work?? not interested in title seniority, just folks that have figured out how to deal with this stuff)
Have you found "release manager" to be common as organizations scale? I despise the practice and prefer a ASAP CD model, but perhaps its inevitable once team size grows?
> As this example shows, once added, toil is very hard to remove.
IMO, it's the responsibility of the leaders (no title necessary) to make an active effort to remove this, maybe by force. Even those capable probably may not feel the ownership to do so, but at some point, that ownership either needs to be a) delegated by a leader or b) taken by someone who dare.
Another case of "Culture eats strategy for breakfast". Unfortunately, I re-learned these things too late.