(Like this article? Read more Wednesday Wisdom!)
An old friend who is a bit more direct and outspoken than I am, taught me that every time you say "of course", you also need to say "again", and the other way round. So don't just say: "The bus was late again this morning", but instead say: "Of course, the bus was late again this morning."
The pair "Of course, again" serves to indicate that something went wrong for the umpteenth time and, all things considered, that could really be expected and will happen again. It is an expression that serves to show the overall importance of the law of conservation of misery. "Of course, again" also serves to document the fact that we are really slow learners and are liable to make the same mistake over and over again, even when there is plenty of evidence that doing or not doing the thing is known to lead to bad outcomes.
Here is an "of course, again" item: Launching on the day before a (long) weekend such as on a Friday, on the Thursday before Good Friday, or on the Wednesday before Thanksgiving. We have known for decades now that this is a bad idea for all of the obvious reasons, but we keep doing it.
Q: "Why did nobody see that we had a 100% error rate for three days?”
A: "Because, of course, we launched on a Friday again."
I am in multiple postmortems a year where launching on a day before a (long) weekend is the most important reason why a particular alarm or error report went unnoticed for days.
Why do we keep doing that? We know that it is a bad idea. We have ample experience that it is a bad idea. We understand that it is a bad idea. Many people will tell you that it is a bad idea. But we keep doing it. Why?
Here is another one that keeps coming back: Lack of obvious alarms. During a recent postmortem, one of my non-technical executives said: "I don't understand anything of what goes on here, but I will notice that in all of the tech postmortems I see, one consistent action item seems to be that we need more alarms."
I then explained "of course, again" to him.
Whenever you write a software service you write two things: The thing itself and the operational system around it. We are already terrible at writing the actual software, but the automated management system around it is usually the pits. I know it is hard to develop a good set of alarms, but that is why we hire the smartest people on the planet. Not spending time on that also shows a lack of insight into where your system spends most of its time. It might have taken you months to build, but it will have to run mostly unattended for years. That's a problem that is almost as big as writing the service in the first place. If you do not budget at least 10% of the development time for creating alarms and other features of the automated management system, you are doing it wrong. We know this. But then, in more postmortems that I care to remember, a 100% error rate of course went unnoticed again because we missed an alarm for it.
A lot of "of course, again" items stem from time pressure, sloppiness, and the fact that most things that can go wrong, usually don't. Unfortunately, when something that can go wrong doesn’t, people learn the wrong lesson.
We launch often on (virtual) Fridays and mostly that does not lead to problems. If it would, we would stop doing that pretty quickly. But it doesn't and that plays right into our inability to estimate probabilities well and act on them. We make terrible decisions all the time, but mostly nothing bad happens, which means that we keep making these terrible decisions. Then when something does happen and we unravel the chain of events, it turns out that one or more of these terrible decisions contributed significantly to the problem. But instead of learning that we should probably not do the terrible thing, we learn that it was actually very convenient to do the terrible thing and that it doesn't go wrong that often.
Two other “of course, again” items that we just can’t seem to lose the hang off are testing in production or bypassing the test environment to go straight to production. I regularly have to intervene to stop people from doing either of these things, even though we know that it regularly leads to incredible problems. The reasons for this behavior are that testing is hard and that we have a lot of faith in our ability to get things right the first time. The first thing is obviously true, but it is a mystery to me where the faith in our abilities come from. Every seasoned engineer has stories up and down the wazoo of bypassing tests and wreaking havoc in production.
Reliability is a mindset and it really needs a curmudgeon like me to continuously draw attention to the myriad of details that could go wrong and actively try to prevent them from happening, quite often to the annoyance of others. Some time ago, I was put in charge of a high pressure project that needed to be delivered right before a so-called “frozen period” started. When discussing the planning I pointed out that the frozen period started on a Monday and we should never launch on a Friday, so we should plan to launch on the Thursday before the frozen period at the latest. This led to audible groans from the group who saw their already tight timeline reduced by another day.
High quality comes from making the right decision in every instance; when we don't do that, and things go wrong, you might run into me, pointing out that of course you did <X> again.
Oh. My. Word. Do people really check things into production without thorough testing? More than once in their lifetime?
Then - yes, I have checked things into production right before a long weekend, but with the expectation that we have monitoring in place starting immediately, and that we have staffed with the expectation that extra attention is needed and more attention might be required if an issue indeed crops up. It can give a lighter production-level experience, possibly flushing out issues at a relatively lower level of system stress. it can be beneficial. It can be easier to craft and implement a fix more thoughtfully during the lighter hours of a weekend than during the heat of a weekday. But ya gotta staff for it.