(Like this article? Read more Wednesday Wisdom!)
I am going to try something relatively wild this week, namely making some comments on a topic that is currently in the news. I promise this is not because I just got delayed, rerouted, delayed again, downgraded, upgraded, laid over, and then, to top it off, delayed some more. Though that was annoying for sure, the brand-new six hour layover at IAD in my changed itinerary did give me the opportunity to have dinner with my sisters-in-law, who I do not get to see nearly enough. So it was a mixed bag I guess.
I typically don’t comment on current events because if there is one thing that dabbling in law has taught me, it is that it really pays off to spend some time thinking about and analyzing all the facts, and that a) typically takes a lot of time and b) we really don’t have any of the facts at the moment. I have read a ton about the CrowdStrike IT disaster, but nothing that describes exactly what the problem was and where the process that was meant to ascertain software quality failed. We need to wait until we know more. Unfortunately, news outlets feel a pressing need to fill pages and airtime, so instead of rational analysis and comments we get hopeless drab and endless repeats of the few available facts, which are barely enough to fill half a page. It is an also an open question for me if we will ever figure out exactly what happened because companies are understandably not keen to air their dirty laundry.
In the meantime many pundits on LinkedIn give free advice to CrowdStrike about how to improve the errors of their ways. I don’t know anything about CrowdStrike, but I assume they are not muppets and I have been involved in enough outages to know that it is typically a case of the stars all aligning to bring about an event with a very low probability. Unfortunately, given the massive volumes we are dealing with these days, events with very low probability happen every day.
We do know enough for customers of CrowdStrike to start maybe rethinking their software strategies. I must admit that it was a bit of a surprise to me how many things relied on Windows systems. Even the notice boards in airports apparently run on Windows these days.
Which is kind of interesting because these notice boards are all full screen; none of them have actual windows on them.
The first thing that comes to mind when I think about that is that Windows is an incredibly complicated system for such a limited functionality. I do understand the economics of the situation though; Microsoft’s “Windows Everywhere” strategy has been remarkably effective in ensuring that even a digital notice board can be run in a cost-effective way using low-end commodity hardware and a standard desktop operating system. However, the cost of “Windows Everywhere” is that we have created an incredibly large monoculture and, as we know from history, viruses and pests love monocultures, often with disastrous results. The CrowdStrike disaster was caused because we sought to defend ourselves against the dangers of a monoculture but unfortunately our defense turrets shot us in the foot.
Adhering to the monoculture leads to low startup costs for your solution but there are hidden costs that only come out in the long run. We are paying these costs right now to make sure that the total cost of ownership graphs match up with what they had in store for us all along.
For many of these systems, I wonder if the cost of this disaster is lower than the cost of the events against which CrowdStrike is supposed to defend. In other words, for how many systems was the cure worse than the disease?
The CrowdStrike IT disaster was bound to happen. It was not bound to be CrowdStrike and it was not bound to be here and now, but this was coming for us because of a huge dependency on one single type of system. As we have learned from agriculture, it takes deliberate policies to prevent monocropping and many governments influence (or outright dictate) what farmers grow where, so as to make sure that a single pest cannot ruin an entire country’s agricultural output. Even within the same plant genome you need genetic variation; it’s true for potatoes and it is true for computer systems.
And even for bananas.
Many in the Unix and Linux fanclub will use this disaster as an argument in their rants that Windows just sucks and that if the whole world would run Linux, none of this would happen. This is of course BS. If the whole world would run Linux there would be a huge cybercrime effort to hack Linux systems and everyone would run CrowdStrike for Linux (some people already do) and we’d be in the same boat. Personally, I prefer Unix-like operating systems but there is nothing in Linux or FreeBSD that defends specifically against what seems to be the problem here.
As usual, the Patrician said it best: “Strength comes from diversity, alloys are stronger than steel.”
The CrowdStrike disaster also got me thinking again about one of my older policy ideas: How to structure product liability for software.
Software is unique in that no amount of unfitness for use seems to trigger standard product liability laws and practices. If I go to the Home Depot and buy a $15 hammer, they are on the hook for that thing not functioning and for any damage caused if the hammerhead flies off and hits someone. Also, if I go buy a truck for some purpose and the seller assures me that the thing can do what I need it to do and then it turns out it doesn’t, I can get my money back.
One of the best words in the Dutch legal vocabulary describes this concept perfectly: “Dwaling”. It doesn’t have a direct English translation, but the word is related to the verb for getting lost: Verdwalen. If you are in a state of “dwaling” you are lost with respect to the qualities of the product. It might be your fault or it might be the seller’s fault, but lost you are…
Software is unique in that no amount of damages caused by bugs in the product that are clearly attributable to the vendor seem to trigger any of the standard product liability patterns. Every first year law school student is hammered with law and cases on the topic of product liability. This has led to a world where pretty much everything I buy that is not software is amazing. Cars pretty much do not need any maintenance in their first few years and seldom break down anymore (and most def not in their warranty period). Every electrical device I buy has been designed and tested for safety to the point that I haven’t had a fuse blow because of a short-circuit in decades. Neither do people get electrocuted very often. Overall product safety (in the western world) is really excellent these days.
A lot of things have come together to create this world. For instance, governments set test standards and have safety regulations. If you want to bring a product to market that contains an electronic circuit, there are lots of hoops to jump through and most of these hoops ensure that random n00bs can use that product safely. Same in the food industry: I started my working life as a dishwasher in my dad’s restaurant where I climbed the ladder to sous-chef and eventually chef. My dad was a hard nosed small businessman and he feared almost nothing and nobody, and especially not the bailiffs who would come if tax bills were not paid on time (again). But he did fear the inspectors of the Dutch food safety authority, who had the power to close the place down in a heartbeat if there was anything wrong. Consequently, food scares have become so rare that if one happens, it makes national news.
In product safety, the backstop of regulations is legal liability. In general, “any or all parties along the chain of manufacture of any product for damage caused by that product” are liable for any damages caused by that product. This has led to many high profile legal cases where manufacturers were found liable for faulty products, sometimes costing them billions.
Software seems to be completely exempt from this reality. The only legal cases I see around software quality are contract disputes where a vendor needs to develop or implement some IT system for some business and they cannot get it done in time or according to specs. But even that is hit and miss; in one of the biggest scandals of our time (the UK Post Office’s Horizon scandal) there will probably not be a lawsuit even though Fujitsu clearly messed up.
Software vendors escape liability all the time, even when their software is clearly faulty and leads to actual damage for their users. Software vendors will claim that this is partly because of the unique nature of software. Most working systems are an amalgamation of software from different sources and when a problem occurs, whose fault is that? In the CrowdStrike example, is it really CrowdStrike’s fault? Or Microsoft’s? Or the user’s? We don’t know yet, but if this was a regular physical product problem, many government departments and lawyers would be stretching and doing warm up exercises for an ultra-marathon of legal proceedings.
I think the “amalgamation” argument is partly bogus. My car is just as much an amalgamation of components from different vendors. But when the battery explodes, General Motors cannot tell me it is not their problem because they sourced the battery from somewhere else. There is a difference however, which is that in software, the customer/user is often the one that creates the amalgamation. In the car example: If I replace my car battery with something my nephew concocted in our shed and that battery explodes, that is clearly not General Motors’ problem. Then again, if I replace the audio system in the car and the original battery explodes, that is the car manufacturer's problem. The problem with a lot of software is that the equivalent of putting a nice sheepskin cover on the steering wheel can make the engine blow up and in my view that definitely should be the software vendor’s problem.
A lack of liability concerns has definitely motivated very fast innovation because many vendors just ship it when the unit tests (if they have any) are green. Most software quality measures I have seen put in place were motivated by brand risk concerns or lost ad placement revenue, and not by fear of liability. The ability of vendors to push bugs and then fix forward meant that there was no time lost in designing and building products for inherent product safety.
There is a popular (and old) joke that goes like this: At a computer expo Bill Gates reportedly compared the computer industry with the auto industry and stated “If GM had kept up with technology like the computer industry has, we would all be driving twenty-five dollar cars that got 1000 miles/gallon.” General Motors then addressed this comment by releasing the following statement: “Yeah, but would you want your car to crash twice a day?”
This exchange never happened, but it bears thinking about. Software has reached the point where, in many places, like the airport notice boards, we do not get any advantages of rapid innovation anymore but we do bear the costs of it. Product liability caused most manufacturers to be super careful when releasing new things because of the potentially huge costs when they released a faulty product. And when they accidentally do, there are recalls and customers get their money back, which is expensive, but often not as expensive as being liable for any damages.
In my opinion we should start thinking seriously about liability for software along the lines of traditional product liability. It will slow down the rate of development, but really most software I use on a daily basis has matured to the point that whatever features we had ten years ago was already more than good enough to last me the rest of my life. Personally I’d happily trade an AI powered chatty paperclip on my desktop for software that just works without deleting my data.
Good read, thank you. Is this you shaking your fist at a cloud, or do you have an idea up your sleeve next Wednesday for how to replace monoculture (or "oligoculture") with a robust multiculture?
Strawman: within a decade, AI tools will be powerful enough that we can ask them to burp up a bespoke operating / software system, to detailed specifications, for whichever snowflake application we want, when we want it. Even one-offs will be best practices all the way down.
(Maybe interesting prior art: https://cryptome.org/cyberinsecurity.htm for which the principal author, Dan Geer, was fired from the consulantcy company he worked for)
I came across this article where Microsoft essentially is pointing a finger at the EU. The TL;DR; I got from it was that because of the EU regulating (not sure if this is the right word) Windows and Windows Defender, Microsoft had to open up it's kernel to 3rd parties. I am curious to hear your thoughts on this.
https://www.forbes.com/sites/davidphelan/2024/07/22/crowdstrike-outage-microsoft-blames-eu-while-macs-remain-immune/