The loneliness of the long distance runbook
Runbooks are programs for execution engines of highly variable quality...
(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article will also become available as a podcast on Thursday)
Note: This is a reprint of the first Wednesday Wisdom ever from the days before it was a public newsletter on Substack. It was originally published on 10/20/2021 on a virtual private network in a galaxy far, far, away. I am reposting this classic article now because I needed to refer to it from an internal-eyes-only document :-)
Many years ago, I was managing a group of Site Reliability Engineers in charge of running an (at that time) popular social network. Our internal documentation was a bit of a mess, so we got a tech writer in to help us patch it up. She started by shadowing the oncall rotation. After a week or two, during our first progress meeting, she spoke the classic words: “Jos, I have been looking over people’s shoulders for the past couple of weeks. Everybody is doing something different and nobody follows the runbooks.”
That sounds terrible, doesn’t it? So we looked into what the problem was. Were the runbooks an inaccurate mess that were not worth the hard disk sectors they were stored on? Were the engineers a band of raving anarchists who considered runbooks an unconstitutional limitation of their freedom as engineers? We researched the problem and found that although the runbooks were not entirely accurate, by and large they were just fine. <ominous tone> There was something else going on…
Most of the aberrant behavior was on display during the weekly release. Every week we would release a new version of the entire service by updating many thousands of machines with the latest and greatest software. Because of the complexity of the service, this was not a trivial process. The runbook was written to deal with every eventuality that we could think of: A region outage, dealing correctly with intra-service dependencies, rolling out during peak traffic, canarying our changes on a small subset of machines, you name it.
However, we found out that the engineers executing the procedure always had contextual information that allowed for shortcuts. For instance, they would know that this release did not contain changes to the static content, so that release step could be skipped. Or they knew that we were at a traffic trough in the diurnal cycle so they could increase the level of cluster parallelism. The engineers running the procedure reduced waiting times between steps, reordered steps for a more optimal running time, and used newer features of the underlying cloud system that hadn’t been available yet when the runbook was written. In other words: They adapted the procedure to their own custom knowledge, experience, skills, and goals.
There is obviously not any one single safe and correct release procedure for a service like this. On top of that, different engineers like to do things differently. They all understood what they were doing though and instead of following the letter of the runbook, they (mostly) followed its spirit instead. Engineers who were newer and less familiar with the system followed the runbook a bit more diligently, whereas experienced engineers went totally rogue and released quickly and safely in completely unorthodox and novel ways.
I am looking at you, frob!
In a subsequent life, I was involved with a large travel website. Airlines would regularly call us to say that they had made a mistake in creating the ticket pricing rules and could we please inject a hot patch into the system to correct that on the double while the real fix was grinding its way through proper channels. We had a runbook for that which included instructions on how to write a small snippet of LISP code and copy that into a file that we would inject into our pricing engine’s virtual machines.
Yes, the flight pricing engine was written in LISP :-)
The runbook contained instructions how to to copy and paste from a template, make the appropriate changes, and submit the final snippet into our version control system. People typically did that faithfully, but, as it turned out, often a bit too faithfully. The runbook was written in Google Docs and quotes in Google Docs are often “smart quotes”, which have different character codes than regular “dumb quotes”. Consequently, these smart quotes were not recognized as string delimiters by the LISP compiler. When that compiler, which was embedded in the virtual machine, saw the smart quotes, it completely gave up on compiling the patch and terminated the entire process instead. You can imagine the havoc that wreaked: The power of copy and paste combined with massively parallel code injection brought down a significant part of the fleet every time someone made that particular mistake.
At this point I am awarding a free Mars bar to the first person to tell me that the configuration change process that injected this snippet of code should be canaried on a single machine before going out to the entire fleet.
Runbooks are a sore point in our profession. People complain about them all the time: Either they are not there and then we don’t know what to do. Or they are there and then they don’t get followed. Or they do get followed but since they are imperfect they cause major mayhem.
Runbooks are programs for an imperfect execution engine of highly variable quality. People are generally very bad at reading and equally bad at following instructions to the letter. When interrupted, we lose context. When we have executed the same runbook a number of times, we think we know it already and stop following it (I myself caused an incident by doing exactly that only two months ago). We are lazy and when the runbook contains commands that seem ready made for copying and pasting verbatim, we will do so.
When things go wrong we blame the runbook. When it’s not there, we criticize its absence. When it’s there and contains errors, we criticize its imperfection. When it is there and perfect, we wonder why it’s not a shell script. Clearly, when you are a runbook, you cannot win.
This might well be an unsolvable problem. Writing runbooks is maybe akin to cleaning your house: A permanent chore that needs constant attention; you’re never done and it’s never done perfectly. Don’t do it and you live in a mess. Work on it regularly and you are still not happy with the combination of effort and result (but at least there are no dust bunnies floating around).
My preference is to treat runbooks not as a program for a human resource machine but as a piece of documentation on how the system works and a suggested approach for accomplishing some effect. Runbooks should definitely not give off the idea that they can be executed by people without any knowledge about the underlying system. They should preferably not contain commands that are copy/pastable if these commands require editing before execution. Like your garden shed they should be expected to be a mess and should be overhauled regularly.
In a more than completely static environment, runbooks are always out of date. There is unfortunately nothing to be done about that, just live with it!
Boss tip: Document the date of the last overhaul at the top of the runbook as a visual indicator for the likelihood of its correctness.
The less frequently the runbook is executed, the more likely it is to be incorrect. This is not only true for paper runbooks, but also for the scripts that might support or replace them. This is especially true if these scripts (as is commonly the case) are not tied into the main codebase and hence do not benefit from the various correctness checks that build and test procedures enforce. At one big tech company I worked for, we had a combination of runbooks and scripts to set up new clusters. We only executed this procedure a small handful of times per year, which meant that every time we wanted to use it, we had to first update the runbook and scripts to the latest and greatest changes in the environment.
Updating a runbook is a common action item resulting from a postmortem. However if you didn’t analyze how the runbook got out of date in the first place and how its incorrectness was not perceived by the person following the runbook, you have not solved the real root cause and are likely to end up in the same postmortem in a few months.
But most importantly: Do not expect miracles from the poor runbook! It is really doing all it can already…
Was it really me? Funny, I don't remember this at all! And yes, runbooks are a blessing and a curse, as you say. I agree with your conclusion that they are helpful as introductions to the system overall, but definitely should not be used as copy-and-paste sources verbatim.