We need more agile system design

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

We need more agile system design

Because a forty page design doc does not help you build better systems

Jos Visser

Feb 19, 2025

Transcript

(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article is also available as a podcast).

As some of you know, my parents owned and operated a bar and restaurant.

It still exists, but under new ownership. If you are in the neighbourhood, please drop by, ask for Wim or Gaby, and tell them Jos says hi.

We had ten seats at the bar, sixty seats in the restaurant, and one hundred seats on the patio outside. Walking in and out with all of the orders and dirty dishes was a significant hassle for the waiting staff, so eventually we hired a local handyman to build an outside bar.

By the way, I got stories about that local handyman that you wouldn’t believe, including his wife breaking down the door with an axe, involvement with illegal radio stations, running off with the daughter of one of our waitresses, and eventually drinking himself to death in the hospital where he was treated for alcohol-induced liver damage. The same hospital where my daughter was born. It’s fun to be from a small community… 🙂

One fine day the handyman rocked up with wood, straw (for the thatched roof), some tools, and got to work. Within days a functioning bar stood firmly in place, complete with running water and electricity. A few months later, the town sent us a letter saying that we had erected a permanent structure without a permit. We could either tear it down or belatedly start the permit process.

One of the requirements for the permit was that we file the plans with the town. One problem though: There were no plans! Local handyman didn’t do plans, he just built stuff. The town engineers were flabbergasted, surely it was impossible to build something without a design and without plans? So, in what I later learned was a time-honored tradition of all building professions, we created the design post-facto.

It is of course an axiom of our profession that you should not start building a system without a detailed design doc that has gone through review and that is maybe approved by the local high priests of the system design cult.

In the olden days we used to call these high priests “architects”, but decades of experience with total muppets who shouldn’t be allowed to design a binary tree, has tainted that term somewhat.

I have written about the folly of the approval process before, but now the time has come to comment on the system design cargo cult as a whole.

Whenever you are going to build something, it is a good idea to have a good idea about what you are going to build and how you are going to build it. Not seldom you will want to write that idea down, share it with a few people, and maybe get some input. So far, so good. Writing your idea down helps you think it through, sharing your ideas with people helps drive alignment (and while we are using BS business speak I might as well note that it could also unlock synergies), and getting input helps improve your ideas because, let’s face it, you might be an insufferable know-it-all, but you really do not know it all.

So far so good, you have some ideas, you write them down, and share them with a few people in order to facilitate a (hopefully brief) discussion.

It is a sad fact however that many systems are designed by, …. uhh, let’s kindly call them, “people with fewer skills and less experience than needed for the task”. That is sad because every time you code something, you create lock-ins. Some of these lock-ins are weak and we don’t care about them, but some lock-ins are strong and changing your mind about them later will require a big effort. Examples of big lock-ins are externally visible APIs and everything that is related to the core data model (often expressed through these APIs).

Lock-ins are not the only problem though. Sometimes a system is designed with faulty premises in mind. I am reminded of an example very early in my career when a group of developers built a new system for processing payment input all based around the idea that it was perfectly fine to read from and write to so-called VSAM key-sequenced data sets (don’t ask) from multiple processes at the same time. Nothing could be further from the truth, but by the time they figured that out, so much code had been written that it was going to be very expensive to back out from that false premise. So, instead, one of my colleagues wrote a hack that sent all read and write operations to a separate process that would do all of the actual I/O. Thus, so as far as the operating system was concerned, all I/O came from a single process, preventing data corruption.

In modern times, I have repeatedly dealt with people who were blissfully unaware of how key-value stores actually work or who wanted to write their own cross-region distributed cache which, according to them, would have better characteristics than Bigtable replication because, surely, they were smarter than the Bigtable developers and the network storms that plagued Bigtable and that caused its replication to sometimes be less than amazing would not hurt them one bit.

Getting input on your design ideas is a useful process for preventing potentially costly mistakes and I am a big fan of it. However, in many companies the design review process has gone too far.

It often starts well intentioned: Some senior engineers organize a meeting where people can bring design documents for review and to get input. Pretty soon though, things get out of hand. The documents that are presented are of course imperfect, so templates spring into being in order to ensure that the design doc writers do not forget important factors such as where the logging will go or what the backup strategy is. It is the nature of these templates that they only grow and refine over time, so quite quickly the template is full of sections that do not pertain to the vast majority of systems that are being designed, making conforming to the template a non-trivial amount of work. Please note: Design docs are not launch checklists!

Then some senior engineers, drunk on their own experience, lobby for the design process to become compulsory. This is wholeheartedly embraced by senior managers who are always on the lookout for risk reduction and who don’t mind being able to point to an “approved” design when the inevitable blame allocation phase that sits at the end of every project comes to pass. This part of the story leads to even more finicky design reviews, because who wants to be on the hook for a design unless they have gone through it with a fine tooth comb and all of their points have been addressed.

All of this does not do wonders for velocity. At one company I worked for the situation had become so dire that it became almost impossible to launch something. This gave rise to a hilarious video and a project called “Trainwreck” that sought to remove as many hurdles as possible, including mandatory reviews of all kinds.

To make matters worse though, the entire design review process does not even guarantee better systems!

Whenever we are building something, be it code or an entire system, we only have an inkling about what we should build and what sort of day-to-day environment the system will be exposed to. We tend to overestimate some requirements and underestimate others. We are blissfully unaware of some aspects of the system that will prove to be non-trivial to deal with. We have badly hidden desires to use certain technological solutions. And, to top it off, the actual requirements, which we don’t understand well to begin with, change while we are building the system. Demands change, user behaviors change, needs change, and fast changing technology kills our carefully designed systems by taking some failure modes off the table and introducing new ones. We want to build for the ages, but everything we build requires constant attention to change to the latest and greatest insights about what we need.

So, we need to design differently, maybe a bit more … uhhh, agile.

A lot of bad things have been said about agile methodologies. Most of that not without basis in fact, but it is not that the core tenets of the agile manifesto are necessarily wrong, they are just hard to implement correctly, especially in complicated organizations that are subject to all sorts of competing pressures that are not necessarily aligned with the goal of effective and efficient system and software development.

My own personal brand of agile aligns with Amazon’s philosophy around one-way doors and two-way doors. Whenever you make a decision, it is useful to think about that decision in terms of these doors. Does this fork in the road represent a two-way door, meaning a door that you can easily get back through again, or is it a one-way door, one that you cannot go back out again without a lot of trouble, effort, and expense.

The “strong lock-ins” that I talked about before are one-way doors. Take API design: The moment you put your API out there and people start using it, it will be difficult to change it. Sure, you can put out a v2 of the API, but you need to support the v1 API as well and in my experience it is hard and time consuming to move everyone to v2. On the other hand, which thread pool library to use is a two-way door. Sure, it’s a bit of a hassle to refactor, but typically the effort involved is limited to a few days of work, code reviews, and making sure the unit tests pass again.

When using a more agile system design approach you need to have a very good gut instinct for which decisions are one-way doors, which ones are two-way doors, and which ones are just dumb. The dumb decisions should be avoided, the two-way doors ones merit a bit of thought (but not too much), and the one-way door decisions are the ones that really need some discussion. Take backup as an example. I think there are few people who think it is okay not to have backups, but unless you are doing something really wonky, I don’t need a page or two in a design document about your backup strategy; it’s usually pretty obvious what needs to be done and if your initial strategy turned out to be a bit naive, that is probably something that can be mopped up later at a reasonably low cost. Same goes for load balancing. I will be the first to admit that even in 2025 it is still not entirely obvious how to load balance depending on the vagaries of the traffic involved, but this is something that I trust you to figure out as you go along. With modern cloud primitives, swapping out one load balancing strategy for another is a bit of a hassle, but often does not require a redesign of your service.

The problem with gut instincts is that it requires lots of experience to get good ones. The agile system design methodology I am promoting here should therefore take the experience of the designer into account. Somewhat junior designer? More oversight and more feedback necessary. Principal engineer? We can probably suffice with a chat at a whiteboard.

My goal here is to unleash people who know what they are doing and who have proven to be good at getting stuff done the right way. If you don’t yet entirely know what you need to build, there is little value in spending a lot of time getting a detailed design in place because, well, you don’t yet know what you need to build. Traditionalists might retort that one of the points of a thorough upfront system design is to unearth these requirements, but that is waterfall thinking! Often (not always, but often), the reason you don’t know exactly yet what you need to build is that nobody knows and that thoughts about what is needed will only start to crystallize once we give the people something to work with. Now where have we heard that one before?

For all of the problems in its implementation, the tenets behind agile software development are still going strong and nobody is clamoring for a return to the good old days of the waterfall. It is time we start making system design more agile too.

We need more agile system design

Discussion about this video