(Like this article? Read more Wednesday Wisdom!)
Some time ago, I was reviewing an incident in which a flurry of traffic had overloaded our system and had caused processing delays which, in turn, resulted in measurable business impact. When we got to the section describing the diagnosis, the incident review document stated that the engineer who was oncall had identified the root cause very quickly because this exact thing had happened five times already this year.
FIVE TIMES!
I was flabbergasted and asked out loud if we were clowns. I will gladly accept that I am maybe somewhat sensitive when it comes to rate limiting and scalability, but shirley if your system has been overwhelmed four times already, you might want to consider that there is maybe something that you need to do about that?
It is a sad fact of life that bad, but well known, things lurk in all systems.
A decade-and-a-half ago, I managed a team that was in charge of some critical monitoring infrastructure. One of the things we owned was the global alert manager, which was configured through a global configuration file (meaning: There was one logical copy of the file for all alert managers in the world). As part of the (automated) roll out procedure of that configuration file, we dry-ran the new version of the file with the “blessed” version of the binary that was running in production as part of a suite of integration tests. This step sometimes found errors in the new file or bugs in the code (the configuration language was very complicated).
One bad day, we pushed a new version of the config file and immediately brought down half of the world's alert managers because of a bug in the config parsing code. Our deployment procedure hadn't caught this, because the bug was at the same time not yet present in the blessed production version of the binary but already fixed in HEAD. Consequently the bug had escaped both the unit tests and the integration tests. Unfortunately, the bug was present in the not-yet-blessed version of the binary that we were rolling out to production and with which we had already updated half of the data centers.
During the incident review there was a lively debate about if and how this eventuality should be fixed. One of the engineers was vehemently against any followup actions: This bug was so rare, he said, that it would probably never happen again. To be fair, it had taken well over a decade of this binary's existence for this failure scenario to realize itself, but I am strongly of the opinion that if something happened once, it will shirley happen again. At the end of the day, I had to wield the managerial hammer to push through an action item to fix this latent but very real problem..
Nine months later another instance of this scenario happened, but now the offending config file was stopped in deployment because of the extra checks we had implemented.
Letting bad things lie around unsolved is a terrible pattern. I have been in umpteen teams where bad things were either happening all the time (but fixed manually every time) or known to lurk in the code (waiting to explode). Typically these ticking time bombs were shrugged off by the team. "We know that this is bad and should be fixed, but nobody has the time for it."
However, as my dear old mother used to say, "Time is priority".
It sounds better in Dutch, where it rhymes: “Tijd is prioriteit”. Don't worry about the question why a word ending in "ijd" rhymes with a word ending in "eit", it just does, okay?
Even if the ticking time bombs are small, it is bad to just let them be. For instance, many teams are suffering from a large volume of spammy pages or tickets. These are typically dealt with easily but not addressing these comes with two major problems:
First of all it makes it very hard to separate the signal from the noise. Even if your oncall engineer diligently attends to every incoming page or ticket, it does take up time and attention. Consequently, if there is a serious alert in the endless stream of mostly useless signals, it will take considerable time to get to it. In postmortems this is often represented as the time it took to learn about the problem; quite often it turns out that there was a page or ticket, but it was buried in a deluge of other pages and/or tickets that were not problematic.
Suppressing the noise does wonders here; in a world where every signal is meaningful and serious, it will get the right attention immediately.
The second problem with spammy alert signals is that it desensitizes people to the signal source. If the pager is ringing off the hook all the time and there is mostly nothing going on, people will stop paying attention and when an important signal does come in, it will not get the right attention soon enough.
If you let bad situations exist for too long they become the new normal. The scientific term for this is "normalization of deviance". This term was coined by Dr. Diane Vaughan, who defined it as a phenomenon in which “individuals and teams deviate from what is known to be an acceptable performance standard until the adopted way of practice becomes the new norm”.
If you are in a team that rubber stamps code reviews, then that becomes the new norm and bad code will slip into your code base. If it is okay to let tickets fill up the ticket queue and then at the end of the week mark them as obsolete in one fell swoop, everyone will start doing it and tickets that are important will get swept along into the bin without attention. If the pager is ringing constantly but nothing is really amiss, you will either silence your pager or just send an acknowledgement and continue doing what you were doing.
This is especially true if what you were doing is sleeping.
Normalization of deviance often occurs because everything that can go wrong usually doesn't and then people learn the wrong thing.
At my flight school they sometimes do drills where the chief flight instructor hides pens in places where they will be found by a good pre-flight check. If you go out for the pre-flight check and then come back with fewer pens then they know they hid, you get a serious talking to from your CFI.
Pre-flight checks lend themselves well for normalization of deviance because, especially in a flight school, the planes are typically in good condition and there is rarely anything wrong. That said, I have found (minor) things wrong and in flying even minor deviations from the standard can have significant (lethal) consequences.
A surprising amount of accidents happen because of lack of fuel, which I guarantee you is easily checked and part of every pre-flight checklist. However, almost every time I did a pre-flight check before a flying lesson, there was enough fuel in the plane. “Almost”, being the operative word here.
Pointing out the normalization of deviance typically doesn't make you very popular. Why would it? The team is just motoring along and here you are telling them something they know already but that they are all conveniently sweeping under the rug because addressing it is work and might interfere with promises on feature delivery dates and other work that seems more important.
One way in which I often try to get some support for addressing these issues is by describing what it would look like if a really bad thing happened which we didn't catch because of normalization of deviance. "You do know that if we have to get up in front of <SVP> and explain what happened, we will look like clowns, right? Are we a troupe of clowns? Are we?"
Also, depending on the severity of the issue, you might end up being unemployed clowns.
Addressing deviant situations is of course work that needs to be prioritized against all other work that could be done. It is not necessarily the case that every problem you identify needs to be structurally solved; in the end it is always a tradeoff that needs to take the opportunity cost of solving the problem into account. But it is up to you to recognize the normalization of deviance and make sure that it is adequately analyzed and then presented as important work that needs to be done, complete with a good cost/benefit analysis. The powers that be can then decide to do it or not.
But then at least you won't look like a clown when the excrement hits the Dyson Air Purifier.
This essay made me wonder of normalization of deviance in groups of people. There are certainly the effects of bad examples (learning to rubber stamping code, or closing unresolved tickets), but these are somewhat easy to detect and mitigate (at least as a manager who is paying attention).
I find that a tricker to situation to detect/diagnose is learned helplessness (and/or weaponized incompetence), because things are functioning on the surface: one person picks up the slack for another person. Again, even if you are paying attention as a manager, this isn’t always easy to spot: nothing is broken, nobody is complaining. Vacations taken by the person who “carries the water” can bring some light to the problem, but it takes a coincidence of vacation and an incident to shine the light at the problem at the team scale.
What really takes a lot of attention and luck to detect is when a whole team/org picks up the slack for another. Whole teams don’t go on extended vacations at once and the situation can be perpetuated for a very long time. Sometimes there is churn and burn out that can be a hint, but sometimes the team that picks up the slack just figures out how to be more effective or add automations to help things. At the same time, the dysfunctional team continues at shirk responsibility. More over, those teams start to invent unnecessary things to justify their existence.
I have seen the above as a problem between different software teams, and I find it hard to mitigate. One of the accepted tensions in the industry of this type that seems similar is the tension between software engineers and production engineers (or site reliability engineers). What is your experience with this? Have you seen it? How do you deal with it?
Here's a 2 min audio version of "Normalization of deviance" from Wednesday Wisdom converted using recast app.
https://app.letsrecast.ai/r/bb5cb9e1-4331-42b9-b949-7a49eb280db1