0:00
/
0:00

Scale is the only problem left

But the effort curve is not linear…

(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article is also available as a podcast). You can also ask your questions to our specially trained GPT!

One of the topics that is on my mind a lot these days is the discussion that is going on in Europe about their dependency on US hyperscalers and the perceived need for a “sovereign cloud”. I got many thoughts about why Europe is unable to get a decent sized cloud or AI business off the ground, but these thoughts are more opinion than wisdom and so a better fit for my other publication: Thursday Thoughts.

Thursday Thoughts is an irregular and infrequent article series that is more opinionated than Wednesday Wisdom. Subscribe today!

An interesting aspect of this discussion are the posts on LinkedIn by people who claim that there is no need to use US hyperscalers at all because “There is <European company XYZ> which is really good” or “You can just do it all yourself using open source tools like NextCloud”. What these posters either conveniently ignore or do not seem to understand is the problem of scale.

Until quite recently, technology was not powerful enough to solve even a single organization’s set of administrative and business problems. At the start of my career, I worked in the IT organization of a large Dutch bank and we would regularly have to tell internal customers that what they wanted was not possible, not even on our whopping four core mainframe with its equally whopping 64MB of internal memory and our impressive array of 1.8GB disks. Fortunately, Gordon Moore came to the rescue and by now there is pretty much no administrative or business problem that cannot be solved using the computers, networks, and storage devices that a reasonable amount of money can buy.

How come then that there are still many organizations with IT problems? For instance, the Dutch IRS is suffering from perpetual problems with their IT infrastructure, which is apparently even endangering the continuation of tax collection.

At first sight this sounds amazing, but I guarantee you that if that happens the joy will last only five minutes because any advanced Western country’s economy is intricately interlinked with the government’s cash flow.

In a similar vein, a project to automate the Dutch court system’s core business processes failed grandiosely. And lest you think that this is a government-only problem: Shipyard Damen Naval recently got itself into financial trouble because of problems in their design software that are plaguing the timely delivery of six frigates for the German navy.

None of these examples are about things that are fundamentally impossible. The tax code is intricate, but not beyond expression in computer code. The court system’s internal workings might be arcane, but we have been doing courts since before the Romans so we kind of know how to do that. And the design software for naval frigates is surely not simple, but humanity has built complicated things before using similar software. These organizations are not struggling to fit their problem into their computers’ and networks’ physical limitations.

Instead, the problem at hand is scale.

For years, I ran our company’s email server using approximately the same software that the large email providers of the day used. At first this was easy, but pretty soon it became a hassle as we got more employees and also more spam, more email viruses, and more phishing attempts. As our email infrastructure became more complicated and more load bearing for our company, we started getting occasional reports from customers asking why we hadn’t responded to their emails? When that happened, it usually turned out that for <insert technical reason here>, the email hadn’t made it to the intended recipient’s inbox. Eventually, we ditched our own email solution and took out a subscription with a big US hyperscaler. Since that fateful decision we have had zero problems with email and have spent zero effort keeping that email flowing. The big American hyperscaler isn’t exactly cheap, but cheaper than me or one of my colleagues spending an hour a week keeping email going and definitely cheaper than missing a business opportunity because the email didn’t arrive.

The problem, again, is scale.

Running your own email infrastructure for an occasional email is not a problem. But running a reliable and secure email service for a larger group of people is a problem. And running an email service for thousands of people or more is a really big problem. It stands to reason that more users and more requirements requires more effort, but unfortunately that effort curve is not linear. The scale versus effort curve has a weird shape with all sorts of inflection points that represent an abrupt step in size and complexity that requires a vast increase in the effort needed to evolve the solution to a place where further growth is possible.

Let’s look at running IT services for a company. Running a few services on a single computer is usually a small effort. But, as the environment grows, you immediately hit the first inflection point as you need to go from one to two computers. The reason for the extra effort is that most of the techniques you use to keep a single computer running do not work well once you have two computers. Suddenly you have a software and configuration file distribution problem. Or you might need load balancing, failover, or replication. These are solvable problems though and if you know what you are doing (always a good idea), you’ll get this up and running soon enough.

The tools and techniques that you instituted for two computers will keep you in good stead as you grow to more computers, but there comes a point where they don’t suffice anymore and you need to come up with something completely different. It might happen at ten computers, maybe at fifty, but depending on the exact circumstances, there will come a time where your solution doesn’t scale anymore. At this point, you might need to start thinking about using Kubernetes. Or maybe your shell scripts to distribute software need to be replaced by Spinnaker. And your PostgreSQL database might need to be replaced by some distributed database or key/value store. These solutions will help you grow some more, but eventually there might even come a time when these tools no longer work and you have reached another inflection point.

At one of my employers we reached a point where we needed to run a large and tightly connected set of services at a scale that defied Kubernetes. What are you going to do then? What’s the alternative? It is of course possible to design and run a cluster that allows for more machines than Kubernetes’ measly 5,000 nodes: Facebook does it, Google does it, AWS does it. But it is not exactly trivial.

These inflection points happen in every growth curve and represent the fact that every solution has an applicable range around a sweet spot. The entire curve is made up of these ranges, but the solutions are often fundamentally different for each range and often do not form a continuous spectrum. When you reach the upper limit of what a solution can do, you need to start doing something else altogether. At the scale of US hyperscalers there is stuff happening that is completely incomprehensible to almost everyone who hasn’t worked in these environments. As I used to tell potential candidates: “At our scale, everything is difficult”.

Way back when I was at Google interviewing candidates for SRE roles, one of the things that we were looking for was whether people understood the problems of scale. For that reason, our interview questions were often of the nature: “Please design a system to keep a set of configuration files in sync across ten million machines distributed in 100 datacenters, one of which is on the moon.” That is the kind of question that separates the professionals from the novices. To design a system for that problem size, you need a completely new way of thinking and probably nothing you ever worked with works for that scale. Not the software, not even the architecture or design patterns. It is also a good sort of question to weed out candidates who just cannot deal with that scale.

One candidate kept muttering to themselves: “That is really a lot of machines. … Wow, …. That's a lot. Of machines…”

One of the issues at large scale is that it exacerbates a lot of failure modes that you can just brush under the carpet at the lower end of the curve. When you have ten million machines, every possible problem occurs. Machines improbably freeze for two minutes and then wake up as if nothing happened. There will be CRC-defying bit errors in RAM and data transfer. Some machines will inexplicably run outdated versions of the software. Some others will mysteriously have a data rate of 1200 baud (despite having a 10 Gbps ethernet card). Operating at scale means that you need to take every eventuality into account.

Operating at scale also means that you need to think about your requirements differently. Things that are relatively easy to do for a small group of users in a single location are physically impossible to pull off for a large group of users who are distributed across the world. Consistency is often the first thing to go out of the window. Latency is usually next, given that the speed of light is a mere foot per nanosecond.

Scale also impacts organizational processes. Figuring out a new way of sharing documents within your team is easy, but as the group of people grows and grows, it becomes an impossible project as a larger group makes it exponentially harder to figure out the requirements, prioritize them appropriately, and reach consensus. Just like a larger technological base makes every rare event a regular occurrence, a larger user base makes every human condition a fact of life that you have to deal with. I have had the privilege to work in the HR department of a large tech company with millions of employees across dozens of countries. When you have that many employees, everything happens. People with duplicate social security numbers, people with the same name and same date of birth, people without an address, people without an email address or cell phone, people without a last name. You name it, we had it! The problems at that scale were such that no standard HR system worked for us and so we kept ourselves quite busy building our own.

Scale makes everything hard. Remember that when someone proposes to rewrite all of the government’s COBOL programs in a few months or wanst to build a European hyperscaler from scratch. With the current state of technology the problems might be physically solvable, but that does not mean at all it’ll be easy.

Wednesday Wisdom is off the scale! Subscribe today!

Discussion about this video

User's avatar