Fixing backward

The users deserve an error free site...

Mar 01, 2023

(Like this article? Read more Wednesday Wisdom!)

Some years ago I worked for a well known Internet site on an SRE team that had a strong “fix-forward” mentality. If a bug was found during the rollout of a new software version, we would pull in some software developers to debug the issue in production and code up a fix. Once that fix was in, we would fire up the build process to produce a new package version and then roll that out. If we discovered another bug (maybe introduced by the fix), we rinsed and repeated until there was a more-or-less stable version running in production.

Of course there is a time and place for everything and fixing forward can be a good strategy. In some cases, depending on the exact circumstances, it might be the only strategy available to you (although preventing yourself from having to go through a one-way door is a laudable goal for every process).

An obvious problem with fixing forward is that your site is wholly or partially broken for a (potentially) long time while you figure out what is going on and attempt to cobble up a fix. If this concerns your personal web site with five unique visitors per week that’s just fine, but if we are talking about the world’s premier site in its space, it is not. At least I think it is not.

The main problem with fixing forward is that it does not scale. As your site becomes bigger and better, your entire build and deploy process inevitably becomes bigger and slower. And fixing forward is hard to do if every iteration takes hours…

Builds tend to get slower because, instead of compiling and linking a few dozen files on your laptop, your build process might have evolved to compile thousands of files into multiple binary packages on some overloaded build farm and then run a large amount of slow, potentially flaky, and hard to parallelize, tests. It is not uncommon for complex builds to take over an hour to complete. That’s an hour or more that your site is performing suboptimally, eating into your service level objectives.

Deploy processes also rarely get faster as sites evolve. For your small site it might mean copying a new binary to a handful of machines and then restarting some processes. But before you know it, you need to upload your binary packages to a depot, sync them across regions, canary the new software version on a small number of tasks in a single region, look at dashboards, do a bit of manual QA, and then roll out to a thousand machines or more. Such deploy processes commonly take hours, if not days.

On another site I worked on we had a documented official deployment process that worked a bit like the one described in the previous paragraph. At some point we acquired the services of a tech writer to update our team’s documentation. She shadowed a few releases and then had a meeting with me in which she said the following magic words: “Nobody follows the official process and everyone does something different.”

What was going on was that engineers used their highly individual knowledge of the architecture and situational knowledge about the changes in the release to optimize the deployment process for increased speed and reduced need for manual care and feeding.

As our site grew and grew, the build and deploy process had changed in the ways outlined above. Care had been taken to keep the build process somewhat fast and scripts had been written to speed up the deployment process, but overall things were not getting faster and fixing forward became very painful. Additionally, the entire engineering organization had grown tremendously too, so finding the right developer to put a fix together became a problem in itself. At some point, deployment day (Tuesdays) started at 9 am sharp in a war room with over a dozen people involved and typically ran well into the evening with pizzas and beer thrown in to keep morale high.

Then we moved out of the original data centers onto a globally distributed cloud platform with almost unlimited capacity. Unfortunately the cloud platform offered a control plane that was incompatible with a lot of the tricks we had used to make deployment fast.

Here is something we did that no longer worked on the cloud platform: In the old datacenters we ran natively on a bunch of Linux machines. For a release we would first run a massively parallel rsync to stage the new binary onto every machine. Then we started the new server process which opened port 80 with SO_REUSEPORT so that multiple processes could open the same port. At that point there would be two processes serving requests and the kernel would distribute incoming connections between them. We then told the old process to finish all inflight requests and terminate.

This allowed us to effectuate a deployment without ever losing capacity at the cost of higher memory usage during the time that we had two server processes active. We could parallelize all of these steps so that the deployment took as long as the slowest machine took to do all of this.

As part of the move to the cloud platform we had put some engineers on the task of redoing the entire release process. They did a lot of amazing work and built a multi-stage (and multi-day) release pipeline that mostly guaranteed that new versions made it to production without too many problems. Fixing forward was still the order of the day, but because of the cloud platform and the new release pipeline it was simultaneously more painful and less frequently needed.

Then one fine day I was in charge of the weekly release. The moment I started the rollout my pager went off because of an increased number of errors on the site. I engaged in some debugging and tracked it down to a few lines of Python deep in our application server code. Unfortunately it was not at all obvious what the fix was; this was something the original developers would have to take a look at. However, because of increased geographic diversity in the ever growing engineering team, the original developers of that piece of code were nine time zones away and fast asleep. Instead of attempting a fix forward, I decided to halt the release and roll the site back to the last known good version.

Cue outrage.

This was the first time in the site’s decade-long history that a release had been rolled back and people were shocked. The main reproach I got was that this release contained a lot of code by a lot of engineers and they now had to wait a week for their changes to become live.

My answer: The users deserve an error-free site. I am an SRE, I fight for the users.

A lot of the criticism I got was related to deadlines: “We have been working on this feature for months and now your rollback means that we are missing our deadline.” I should probably write an entire article about this particular train of thought, but in short: Your lack of planning is not my emergency. If you have been working for months on a plan that included a one week window for the final release of your feature without any contingency then I am afraid your plan is not very good.

Escalations ensued, engineering vice presidents got involved, and our sister team on the other side of the globe volunteered to do an out-of-band release to mitigate this “disaster”. But over time sanity prevailed and the option to roll back became the default one. At the end of the day nobody wanted to get back to the stress and chaos of fixing forward potentially complex bugs in an ever growing application.

It is completely obvious to me that for grown-up sites that use grown-up release processes, rolling back instead of fixing forward is the right thing to do. There are many problems with fixing forward, the two major ones being that it a) exposes the users to a site with errors while you figure out what is going on and how to fix it, and b) the pressure to come up with a fix fast does not lead to high quality code and often introduces new problems.

Fixing forward is seductive though; in the eyes of the software engineer they have done a good job and any problem still left is only a small impediment that is easily fixed.

Rolling back often feels like a failure, but really it isn’t: It is the correct response to an unknown problem and quite often the fastest way to restore a site to health.

Repeat after me: The users deserve an error-free site.

Fixing backward

The users deserve an error free site...

Discussion about this post