Wait! What now?

Postmortem blues #1: Lessons learned…

Nov 15, 2023

This is the first article in a series called “Postmortem blues”. It contains wisdoms related to the practice of blameless retrospectives.

(Like this article? Read more Wednesday Wisdom!)

For decades now, I have been involved in writing and reviewing postmortems. If you are reading this, you probably know what these are, but on the off-chance that you don’t, these are analytical documents you write after an incident. The goal of a postmortem is to analyze thoroughly what happened, learn from it, and then define follow-up actions to ensure that this class of problems does not happen again.

Right off the bat: Too many postmortems try to ensure that this exact problem does not happen again. Instead, try to focus on preventing this entire class of problems.

I am not exactly sure when postmortems became hip and happening, but I first encountered them when I joined Google in the early 2000s. The practice immediately made sense to me: Surely if something went wrong, you would want to learn from what happened and then ensure that you change your ways? With the benefit of hindsight I should have been surprised that this totally logical and sensible practice was not established a lot earlier.

Our industry is by no means the first or only one that does postmortems. I am not sure where the practice originated, but I know that the aviation industry is quite big on them.

If you think you write decent postmortems, read this one from the NTSB about the helicopter crash that killed American basketball star Kobe Bryant.

Two weeks ago I was in the lucky circumstance that I could attend a lecture by NASA space flight director Allison Bolinger (a.k.a Athena Flight) who described the safety culture at NASA. One of the things she touched upon was the postmortem after a launch test of Apollo 1 went south and killed three astronauts. Some of the lessons from that postmortem are still operating mantra at NASA today.

An important section in every postmortem document is the “Lessons learned" chapter, where the authors collect and summarize everything they learned from the incident and the subsequent analysis.

More often than not, this is (or should be) a very difficult section to write, because this is the section where you showcase your ignorance and/or lack of skill. A good "Lessons learned" section is brutally honest because what needs to be represented there is completely counter to our modern success culture: How awesome we were not, all the things we knew not, and all the things we didn’t know how to do.

It's one of the reasons why postmortems need to be blameless. If people knew that whatever they wrote down in a postmortem would come back to haunt them, they would not be honest, which would not be helpful to the overall learning process. This would lead to a situation where whatever just happened will surely happen again and it would promote a culture where people will take as few risks as possible, which in turn would hurt velocity.

One day, during my first job, I came into the office to find the data center in disarray and one of my more experienced colleagues looking beaten up and tired. Here is what happened: For some reason (I honestly can't remember why, but it doesn’t matter) they had decided to move all files from a number of projects from disk to tape.

To understand the rest of this story you need to know that files on the classic IBM mainframe had names like "X.Y.Z". There were no directories, but by using the components in the name you could create some structure. For instance the file that contained my job control templates was called "VISSERJ.WORK.CNTL" and the file that contained the account balances was called "CCR.PROD.RC1" or something like that.

CCR stood for Coöperatieve Centrale Raiffeisenbank, one of the predecessors of the (merged) bank I was working for.

Most file utilities supported wildcards, where * stood for a single qualifier and ** stood for multiple matching qualifiers. So my job control file would be matched by VISSERJ.WORK.* and all my files would be matched by VISSERJ.**.

My colleague needed to move all files of some projects to tape, and so he used one of the more powerful file copying utilities to perform a DUMP DELETE operation, which would move the files to tape and then delete them from disk. Unfortunately, he made a mistake. Instead of specifying the list of files to be copied and deleted as "PROJ1.**,PROJ2.**,PROJ3.**" he wrote "PROJ1.**,PROJ2,**,PROJ3.**".

Note the comma after PROJ2.

The utility moved all files from PROJ1 to tape, then complained that file PROJ2 did not exist, and then started to move all files from all disks connected to the mainframe to tape, deleting them after the copy.

At first everyone was just surprised that the process took so long and was eating so many blank tapes, but eventually alarms went off as production batch runs started to fail because their input files were not there anymore.

When they realized what was going on, my colleague halted the rogue DUMP DELETE job and started the recovery. That process took the entire night (and then some) because they conferred with the batch production managers on which files they wanted back first in order to salvage as much of the service level objectives as possible.

The next day the head of the datacenter was shouting in our boss's office and demanding that my colleague be fired. My boss, a sterling guy, provided excellent top cover for his report and the bank enjoyed a few more years of excellent service by him, until he sadly and unexpectedly passed away much too early three years later.

What can we learn from this incident?

The obvious and wrong lesson is that we use very powerful utilities that are easy to configure incorrectly. That is totally true and it is what bit us in the arse, but, and this is important, we knew this already! A lesson learned is something you didn't yet know! The better lesson is: “DF/DSS supports ** as a wildcard pattern for all files on the entire system? Who knew?!” Another good lesson is: “We really shouldn’t issue commands that are this impactful without review.” This last lesson might seem obvious today, but it was not at all obvious back in 1989.

When reviewing postmortems I come across a lot of "Lessons learned" sections that say things like: "We need better documentation" or “We need to improve our integration tests.”

These are totally lame lessons.

Surely, the fact that your documentation was terrible was not unbeknownst to you before the incident happened. You probably knew your documentation was terrible, but decided not to spend any time on it. Same for your integration tests. They always suck. These are not “lessons learned”, these are items you knew but hoped you could safely ignore.

Although hope is not a strategy, it is an often used tactic.

Most teams I have ever worked with (with the exception of the SRE teams I was in), were unfortunately not “operationally excellent”. That makes sense because they also have other things to do, like writing the software. However, that means that they are typically not well positioned to handle an incident. They probably know that, and to the extent that they don't, it typically takes only one incident followed by some honest reflection to learn it.

“Lessons learned” is a confrontational section of the postmortem document because, if you are truly honest, there are often not a lot of good lessons to learn. More often than not, the causes of the incident or the slow detection, mitigation, and resolution, are grounded in things that are amiss but that you already knew about.

Maybe postmortems should have a section called: “Things that were suboptimal but that we already knew about.”

As an aside: Your service level objectives need to match your operational maturity. I review postmortems where I question whether there should be a postmortem at all because the incident, though sure not great that it happened, does not seem to exceed the thresholds of what reasonable service level objectives for the service would be. The service level objectives you can offer need to take your operational maturity into account.

A good lesson learned is something that you truly did not know. And then when describing the lesson you need to reflect on the question whether, all things considered, you should reasonably be expected to have known it.

Clearly for some reasonable definition of "reasonably" 🙂

One incident that I personally caused had its roots in me not understanding some settings in Google's internal cloud platform. Back in those days (and maybe still) if you started a job with a few tasks, you needed to configure the job to not kill the entire job if one task unexpectedly failed. I truly did not know that and therefore hadn't configured a critical production job properly. Consequently when one task of that critical job (the front-end proxy) failed unexpectedly, all tasks got stopped by the platform, taking the entire service with it.

Could I have known that? Sure, it was documented. Did I know that? Well no, this was sufficiently buried somewhere deep down in the documentation. Should I have known that? The jury is still out on that one 🙂. Suffice it to say that this behavior made it to a subsequent "Surprising defaults" page that got highlighted in the documentation...

Personally I think that: "We knew the risk, took it, and then it unfortunately realized itself", is not a good lesson learned. It should be documented in the postmortem somewhere, but I don’t think it’s a lesson you learned.

I propose the "Wait! What now?" test for evaluating whether a lesson learned is a good lesson learned. If someone who is familiar with the service reads the analysis and then says (or thinks): "Wait! What now?", that's probably an indicator that this is a good lesson because it points to something that a reasonable engineer did not know and was surprised by. It probably means that they didn’t know it upfront and were surprised by it.

Some years ago I was involved in a postmortem around an issue where a CA certificate of one of the most trusted certificate authorities on the Internet expired. Obviously, they had issued a new certificate, but because of a bug in older versions of OpenSSL, a lot of binaries that were confronted with both the expired and the new certificate in their trust stores, would incorrectly conclude that a valid certificate that some server presented was not properly signed by a trusted third party.

Bugs, especially in common libraries, often pass the "Wait! What now?" test. What do you mean OpenSSL contains a critical bug of this magnitude? Isn't literally the entire Internet using this? Who knew!? Lesson learned!

Good lessons learned revolve around things that you didn't know, maybe could have known, but couldn’t reasonably have been expected to know. It is a relative standard, as it takes the persons who “didn't know” as the starting point of the analysis. The fact that other people knew doesn’t matter. If you are an engineer writing your first service using a distributed key/value store and then you run into a problem because of a hot partition, you learn something about the importance of good key design. It’s a good lesson learned.

"We were negligent" is not a good lesson to have learned. It might be true and it might feature big in the root cause, but it's not a lesson you learned, unless you reasonably could not have known the thing you were (also) negligent about.

When the Internet started becoming popular, my Dutch company made some good money teaching people about what the Internet was and how it worked. After one of these courses an IT manager walked up to me and confessed that he had just learned from me that the firewall needed to be between their internal network and the Internet, not just at the end of a blind alley hanging somewhere to the side of the Internet router. The Internet was new and most companies were not even using TCP/IP yet, so that manager was forgiven for that. He didn’t know and he couldn’t reasonably have known at the time.

Fortunately he learned that lesson before someone else found out :-).

Good postmortems prompt hard questions around why you sucked as a team and allowed for the incident to happen. Soft lessons learned like "We need more tests" or "We need to improve our documentation" do nothing to actually improve the situation.

My favorite band, Marillion, per usual summed it up best: Be Hard On Yourself.

Er Matto

Dec 17, 2023

When you say "If people knew that whatever they wrote down in a postmortem would come back to haunt them, they would not be honest", it wold be the wrong lesson to conclude that postmortems should be blameless, because NASA and NTSB postmortems are definitely *not* blameless (rather they are not pre-determined to find blame at all cost, which is subtly different). To avoid this obvious conflict of interest, the lesson should be that the people involved must absolutely not be in charge of the investigation. Both engineers and pilots have a duty to follow procedures and professional standards, and if neglicence is found in the course of the investigation, those found guilty can and *will* have their licence suspended, temporarily or even for life.

Expand full comment

Wait! What now?

Postmortem blues #1: Lessons learned…

Discussion about this post