(Like this article? Read more Wednesday Wisdom!)
Prelude
In the 1990s a friend of mine worked in the steel cable industry. At the time the European steel cable industry was in decline, suffering from overcapacity and from Asian competitors that could undercut them on price. One not-so-fine day his employer went bankrupt and everyone got sent home with a cardboard box filled with family photos and other office paraphernalia.
Two days later my friend got a call from the administrator who had been appointed to settle the bankruptcy. The administrator summoned him back to work to help unwinding the company’s assets. My friend gave him a friendly “Eff you”, after which he received a call from the Dutch social insurance company (which, in Holland, takes over the salary payments of bankrupt companies), telling him to report to work on the double or face an immediate loss of income. The day thereafter my friend was back in the office.
The next thing that happened was that my friend called me. Turned out the now-bankrupt company had two Sun Unix workstations for CAD/CAM purposes but all the valuable drawings had been removed by the company’s system administrator, probably out of spite for being laid off.
Free legal advice: Never ever do that. This system administrator got off lightly, but this is course of action opens you up to substantial civil liability and maybe a criminal indictment.
The bankruptcy administrator wanted the drawings back because they represented valuable intellectual property which he could possible sell. Since they did not understand the “tar” command and I did, my help was requested. I readily agreed to get these files back in return for the opportunity to buy these two workstations out of the the bankruptcy at a discounted rate. It clearly pays off to know the “tar” command…
The bankrupt company’s assets were all disposed of quickly. One thing I clearly remember was that this company had a massive machine for creating equally massive steel cables (I forgot the name for it, but apparently there is a technique where you first twist steel wire into a cable and then twist these cables into an even bigger cable). That machine was bought by a consortium of European steel cable manufacturers and then destroyed, lest the machine be bought by an Asian company and shipped there to add to the competition.
It always struck me as sad that a system can get itself into a state where the willful destruction of capital is the profitable thing to do. There is a lesson in there somewhere, but I don’t quite know what it is yet.
The meat
So far none of this has been particularly topical for the average audience of these articles, but I promise it is getting better from here on.
One of the things that this particular friend taught me is that quality is fitness for use. He would regale me with stories about how it does not make sense to sell a cable that can tow an oil rig to a fisherman who needs to pull a fishing net, because the extra strength is worth nothing for the fisherman’s use case (and he will probably never pull an oil rig). In another example he explained that they would sell cables to tow something from A to B, after which the cable was cut loose to sink to the bottom of the ocean. That cable clearly only needed to be good enough to survive that one trip.
Apparently getting the biggest bad-ass strongest cable is not always the right thing to do.
The concept that good enough is good enough has stuck with me, especially because software engineers often overindex on building things that are way better than they need to be. The world’s hard disks are littered with pieces of code that were never taken to the max of their capabilities. This means that time and money was wasted building capabilities that were unnecessary.
I am not talking about building a product for which there turned out to be no demand; that’s a business failure, not an engineering one. What I am talking about is building a steel cable that can be used to tow oil rigs around for fifty years even though we know that we are going to use it once to tow a very small boat.
The big problem here is that it is often very hard to figure out up front how good something needs to be.
We all know of temporary solutions that are still running and causing problems because of their crappy nature. The fact that this sometimes happens leads many people to err on the side of caution: Make everything as good as it can be because that is always better than the opposite, right?
Wrong!
Building something isn’t free and the fact that it is possible to be mistaken about the future does not absolve you from the responsibility to make reasonable and informed decisions about how good something needs to be before you start designing and building it.
From the waterfall method we inherited a solid focus on requirements; these days when I am involved in building something we are all worrying about the requirements. This wasn’t necessarily the case in the past where we often just started to see where things would go. I am happy to report that today we usually also worry about important non-functional requirements, like how many requests per second the new system must be able process, at what latency, and how much data we will need to store.
Getting the requirements right is hard, but non-functional requirements are even harder to get right because it requires 20/20 vision into the future. Figuring out the right quality to strive for requires you to know how much traffic you are going to see, how much data you are going to process, how often you are going to do maintenance, how often you are going to change the underlying platform, how much downtime your business will be able to accept, and so forth.
This is especially hard when it comes to the necessary quality of software engineering. What is the right programming language? Data model? Database abstraction? Should we use an ORM (answer: No). Are we ever going to change the operating system? Container runtime? Database? Message queuing system?
I lived through the wrong answer to that last question. Years ago I was contracting for a bank and we were going to use a message queuing system to exchange data between various systems. The message queue system of choice at this customer was IBM’s MQ Series, which made a lot of sense because a) it was a bank, b) IBM was one of their most critical IT vendors, and c) we also needed to talk to the mainframe (where MQ Series was supported). However, the powers that were decided that we could not build a dependency on MQ Series into our software and so they set about creating a universal message queue wrapper which would theoretically allow us to switch from MQ Series for another message queuing system without any impact on our software.
Cue (pun intended) almost unlimited suckage.
For reasons that I won’t go into right now, writing a generic message queue wrapper that allows swapping out the implementation for a different one is very hard. So hard in fact that you are not going to get it right, probably ever. Writing the first usable version of the wrapper took a considerable amount of time and money. On top of that, using that wrapper was not easy because of mistakes in the APIs (abstraction leaks), bugs, suboptimal documentation, and lack of internal support. Then, to make matters worse, for all the obvious reasons we never actually switched the underlying implementation. The whole wrapper business was just a huge waste of time and money.
Of course in an ideal world it would be awesome to have an amazing implementation independent wrapper for any dependency so that you are not locked into any actual implementation. This allows you to change implementations for better ones or cheaper ones when they come along. Unfortunately this world is practically impossible to bring about (look at J2EE if you disagree). Plus you can ask yourself whether you actually need that implementation independence in the environment that you are in.
Here is some help in answering the last question: Your environment is not as complex and demanding as you think it is. Really not. Also: Your system will not live long enough to require transparently switching out fundamental infrastructure.
The architects designed a system that was better than it needed to be. Just using MQ Series directly from our software would have been cheaper, easier, and faster. It would have been more than good enough for our problem.
We rarely discuss how good something needs to be. What is the expected lifetime of the thing we are going to build? How often will it need to be touched by developers after it has gone into production? How often is this configuration changed and what is the lead-time to get the new configuration into production? Is this the basis of an entirely new subsystem or a one-off to bridge the time to an entirely new solution? Will we ever change the fundamental infrastructure? Is the eventuality that we want to protect ourselves from ever really going to happen? And if it does, what is the cost of dealing with it then instead of now?
Here is another important question: Can you actually afford to do a good job?
In steel cables it is obvious that a better cable costs more money. In software engineering it is equally obvious that a better system is more expensive, but strangely enough that seems to be less considered. If a steel cable that was designed to pull a mini submarine (<150 tons) snaps when used to pull an oil rig (of give-or-take 30,000 tons), the steel cable engineer will not be sad. However, if the bank suddenly decides to throw in their lot with Oracle AQ and I then have to make changes to my system, people are probably going to consider me a muppet because surely I would have foreseen this and built a layer of abstraction that would make that easy to do?
Yes, I did consider that, but solving for that eventuality is not free. To build something that hides important platform services, or is scalable, or easy to maintain, or easy to configure, or which allows seamless updates, or canarying, or whatever, requires time and money to design, implement, and test. And not only that, the extra complexity is an ongoing burden for future maintenance and thus is not only more expensive today, but will continue to incur extra costs all the way until the cows come home.
Which brings me back to the question: Can you actually afford to do the best possible job?
In the past I have described Google as the “Germany of engineering” (when compared to some other companies I worked at): Everything is about ten times better than it needs to be and it takes them ages to get anything done, but once it is there it is totally amazing. Now Google, with over $100 billion cash in hand, can obviously afford whatever they want. But can you?
Although, even Google is starting to become frugal as they are cutting down on the number of staplers in their company. Maybe they’ll also stop having three teams to solve the same problem :-)
For many startups the question is not whether whatever they put out is easy to maintain or can handle 100 million users; it’s whether they can make it to Friday. In that environment speed of execution and cost are of the essence. To optimize for both requires a shared understanding on what the quality is that they are building and consequently what they should and can expect of it.
When having that discussion it pays off not to apply a “one size fits all” approach: There is room for varying quality across the system. For instance I am personally in favor of always applying high standards to things that I consider foundational, such as the data model (at least third normal form), the APIs, the protocol buffer definitions, and the overall module structure. In my opinion even if you do not yet implement the full complexity of the system, getting these bits right is important because they are very hard to change afterwards. However that is just my opinion in the abstract; given the facts of a case at hand, reasonable people might disagree.
Whenever you build something, you have to make a conscious decision on how good it needs to be and then document that decision to prevent misunderstandings later. It’s okay to be wrong about the future, but don’t weave a steel cable to pull an oil rig if what you really need is to hang a very small painting.
Nice article, Jos! It's a rational perspective to ask the question about how much quality you want to pay for. I also liked the 3NF comment. :-) Keep em coming!
Here's a 8 min audio version of "Quality is fitness for use" from Wednesday Wisdom converted using recast app.
https://app.letsrecast.ai/r/e2a46ae1-8796-4983-b507-c701e1819be7