(Like this article? Read more Wednesday Wisdom!)
This week’s Wednesday Wisdom will try to sell you a very simple idea: Production Readiness Reviews should be done continuously during software development and not just after development has completed, right before the launch. If you know this already, I just saved you ten minutes of your life. I suggest you use that time productively by reading this paper, which investigates various myths and mythconceptions related to software development.
For those of you still with us, thank you, and don’t forget to subscribe!
One of the big frustrations of SRE teams all over the world is that they are regularly on the receiving end of bad software that is “thrown over the wall” for the SRE team to take care of. It is a sad fact of life that software development teams are often only, or in any case mostly, concerned with implementing functional requirements. Unfortunately, this usually leaves little or no time to think about the non-functional requirements which would allow the software to run well in the post-apocalyptic hellscape that is the production environment.
Most SRE teams deal with this by trying to keep the doors shut when it comes to accepting responsibility for a new service. When this is not possible, SREs usually engage the software development team in a Production Readiness Review (PRR) to ascertain the quality of the thing they are about to accept the pager for. The outcome of that review is usually a number of action items that the development team needs to complete before the responsibility for the service or component is accepted by the SRE team. It’s a bit of a game: The SRE team tries to get as much done as possible while they still have some leverage, whereas the SWE team tries to get away with doing as little as possible.
The idea of a production readiness review gave rise to standard checklists and templates that are completed as part of that review. I wrote a production readiness review template at Google SRE in Zürich in the early 2000s, which was also used and extended by others. The nature of these templates is to become huge and unwieldy as sections will be added for every conceivable operational modality, storage system, and standard piece of infrastructure that a service could potentially use. On top of that, the template will also grow to encompass pro-active defenses against every incident that ever happened. In that sense, the PRR template is a bit like the preflight checklist I use when I go flying: Literally every item on that list is there because somebody crashed a plane and the root cause analysis indicated a defect related to that particular item.
Unwieldy templates are a problem because many people feel a need to write something, anything, in answer to every question, even when it doesn’t make a lot of sense to do so. For instance if your service consists only of AWS Lambda functions, the questions about your kernel patching and release strategy are not really relevant because the kind people of AWS take care of that for you (for a small fee). Similarly, if your only input mechanism is an SQS message queue, you really don’t need to worry about throttling your callers because SQS supports a practically infinite number of message inserts per second (for standard queues).
PRR template writers could make their templates easier to use if every section or question came with a list of situations where it did or did not apply. And perhaps also throw in some standard answers that the hapless software engineer charged with filling out the template can just ✓.
Annoying though big and unwieldy templates are, the biggest problem with most production readiness reviews is that they often uncover issues with remediations that should really have been thought of during the software development process, as that are hard to fix after the fact.
Let’s look at dark and incremental launches for example. Most SRE teams are naturally averse to big bang launches. We have one big bang that went quite well, but most others come with serious risks. To level-set: A big bang launch is a launch where you make your entire service or feature available to the entire population with a single release. If your stuff works well: Amazing! Everyone can use the new shiny and everybody is happy. However, if there is a problem, everybody is unhappy and you are left with a lot of egg on your face.
For that reason, SRE teams like to do incremental launches. There are lots of variants of that idea, but they typically involve launching the feature surreptitiously (without anyone knowing, a so-called dark launch) or only to a very small number of users (limited by some user attribute such as country of origin or maybe if their int64 userid % 100 <= 2). Incremental launches allow you to test-drive the software and get data about your new feature’s performance without exposing lots of users to any problems that might still lurk in there.
Or let’s look at throttling incoming traffic and rate limiting outgoing traffic. If you have a synchronous RPC-like API that is supposed to return results with relatively low latency, you are well advised to implement a system of throttling where overactive upstream callers get pushback if they are sending more traffic than you can handle. Mutatis mutandis you are equally well advised to not overwhelm any downstream dependencies with more traffic than they can handle by implementing some form of rate limiting.
None of these examples are rocket science but almost no software development team will afford these in their system designs.
That is a problem because when these items come up during production readiness reviews, you are too late in the project to make the changes required to implement any of these good ideas. So what are you going to do now? Block the launch? Or chalk them down as “good idea, we should really implement that when we have some time”, which is typically either never or after the first major incident where the lack of this feature took a prominent spot in the postmortem.
There are many other examples of non-functional requirements that really need to be integrated in the overall system design in order to be useful. Observability is another example. Good dashboards and good alarms require good instrumentation, which in turns means that the right metrics with the right labels need to be collected at the right moments in the processing flow. Again, not rocket science, but it needs to be done consistently throughout the code. Same for logging and good exception handling. If you don’t do any of that during the software development phase, it is almost impossible to do it right after the fact.
I once wrote a Perl script to trawl through our Java sources to insert an “ex.printStackTrace();” statement in all empty catch {} blocks, so that we had at least some idea that something had gone wrong.
During postmortems, I often ask why there was no system for incremental launch implemented. The typical answer is that this was not possible, which is of course not true. The real answer is usually that nobody ever thought about it and consequently the entire system design doesn’t afford it. Even when it was called out during the production readiness review, we were no longer in a position to do anything about it within the time left to us.
This problem comes with a simple solution: Start doing the production readiness review in parallel with writing your design doc and continue filling out the production readiness review document during software development!
Really, it stands to reason. By going through the production readiness review template during the design and build phases, these phases will continuously be influenced by the non-functional requirements that will inevitably come up during pre-launch review, and conveniently at a time that you can still do something about it. Doing that work during the design and build phase also means that the actual review will be mostly a NOP because you come with a mature document and great answers for all the questions. This approach turns the production readiness template from a checklist to a to-do list that is used during development.
Engaging with production readiness right at the start of software development will hopefully reduce the number of postmortem meetings where the software developers look like clowns because they have not implemented an industry standard practice that addresses situations that we know happen all the time in production. It might also reduce the number of postmortem documents where I read things like “Lessons learned: APIs should have validations on the arguments and reject invalid requests” (true example). That’s not a lesson learnt, that’s a miss. And there is really only one miss I like, and that’s my darling daughter.