(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article will also become available as a podcast on Thursday)
I started my professional life as an MVS Systems Programmer (a sort of SRE avant la lettre) at a sizable Dutch bank. The team I was in took care of the core computing platform of the bank, which consisted of no fewer than three computers. Yes, you read that right. Three computers! These three machines ran all of the core ledger processing, all batch bank transactions, all mortgage calculations, and all of the loan administration for the entire bank. In the second year of my career, we upgraded the “big” computer in our primary datacenter to a whopping six CPUs and 64MB of RAM! Now we were truly playing with power!
In parallel, I dabbled in software development on personal computers. My home PC had a single 4.77Mhz CPU, a 20MB hard disk and 512KB of RAM. Even then, 20MB hard disk space was not an awful lot, so eventually I installed a compressing disk driver (Stacker) that increased the virtual capacity of my disk to about 38MB at the cost of some CPU speed and the added bonus of hopelessly corrupting my file system approximately once per year.
In the late 1990s I went through a similar phase when I migrated my Linux laptop to ReiserFS. Because of improvements in the block management layer, ReiserFS was better suited for systems with lots of small files. This gave me more free disk space, but it also provided the feature of hopelessly corrupted my file system about once per year.
ReiserFS went out of fashion when Hans Reiser killed his lovely wife Nina and after a grueling trial, where he behaved like the world class a-hole we all knew him to be, he went to prison. The book about this saga is a must-read for anyone who was involved with Linux in the early 2000s. By the way, Hans seems to be a changed man these days, as witnessed by a letter he wrote to the Linux community earlier this year.
For most of my career, computers were so underpowered that whatever we wanted to do was either not possible or required really smart solutions. Consequently, I spent a lot of time either explaining to customers that whatever they wanted was not possible or concocting up all sorts of ways to do whatever it was they wanted. This was not merely an issue of underpowered hardware; the software we worked with was equally underpowered. For instance, for one piece of code I had to write on the mainframe I wanted a basic regular expression parser. Since none was available in any of the libraries, I rolled my own in S/370 assembler. Super fun, but you can imagine what this does for development velocity in terms of story points per second. Not that we knew what story points where, but that is a topic for another day 🙂.
When I started working, a lot of (now) basic computer science had not been invented yet. For instance the Paxos algorithm was only submitted for review in 1989, a year after I graduated from college. So in those days, if we needed distributed consensus, we were straight out of luck or we needed a shared device such a channel-to-channel controller (CTC) or a shared SCSI-disk so that we could abuse the SCSI lock command to turn it into a distributed locking service.
Pro tip: Do not forget to change the SCSI bus address on one of the systems’ controllers from 7 to 6 and set the dip switches on the shared disk to address 5.
A lot of very bright people wrote a lot of incredible software for these underpowered machines with little or no help from libraries or the operating system. It was all very impressive.
But, computers became cheaper and more powerful and in the mid 1990s we reached a point where they were powerful enough to do pretty much all of the things that we wanted done. It was a joyful time and if someone needed an 80GB database we just laughed at it! Then the Internet came and the problems we needed to solve became global and distributed. Our hitherto more than sufficient computers and networks were once again underpowered for the problems. Fortunately, we had even more smart people on the problem by now, so solutions kept up. We got global file systems, distributed lock managers, key/value stores, data streaming platforms, and eventually globally distributed ACID-style databases.
A problem with solutions for complex problems is that they are complicated themselves and therefore seldom easy to use. In the 1990s we spent a lot of time helping people figuring out their data models and optimizing their SQL queries. In a similar vein, the 2000s saw us help people design key formats for their key-value store and deal with the intricacies of multi-master replication, ordering guarantees, load balancing, and once-and-only-once delivery semantics.
The super smart systems that were developed in the 2000s and 2010s allowed for globally distributed services to be built from the still relatively underpowered machines that were available at the time. So even though life was already much better, we were still in the mode of doing very smart things to create services that were more demanding than what the hardware could comfortably support.
Fortunately, Gordon Moore (the “inventor” of Moore’s law) continued to come round every Christmas to give us bigger CPUs, faster CPUs, more memory, faster memory, bigger disks, faster disks, and faster network interconnects. And to top it off, the scale of the problems didn’t grow anymore, because once you have solved globally distributed services, where is the problem going to grow into? The size of the planet and the speed of light are good upper bounds for anything that we need to do.
So we now find ourselves in the happy space that computing capacity continues to grow but the scale of the problems really doesn’t anymore.
Bar training AI models of course, but that is an entirely different kettle of fish that most people will never have to deal with.
So what are we going to do with all of that extra computing power? The answer: Build dumber solutions faster! That might sound a tad counterintuitive, so let me explain…
Take Kafka as an example. I love Kafka, it is a great data streaming solution that was developed by many smart people to offer a great service, namely distributing data from producers to consumers in more-or-less real-time, while doing its best to do the right thing when it comes to buffering data, dealing with outages, preventing applications from seeing the same message twice (to the extent possible), and, most importantly, dealing with massive amounts of data. All that good work comes at a price, namely non-trivial complexity in the Kafka API. The amazing book “Kafka – The definitive guide” (which can be downloaded for free from Confluent’s website) spends almost a 100 pages explaining how to reliably read and write data from and to Kafka. In the course of that 100 pages it does a deep dive into Kafka’s internals because, in order to do this right, you need to understand brokers, sharding, partitions, offsets, commits, rebalancing, acknowledgments, and other fun topics. All of this complexity gets you something amazing though: Real-time and high volume data streaming using a fleet of cheap and small computers.
But, do we really need all of this advanced technology?
A large part of Kafka’s complexity comes from the fact that it can deal with “slow” network interfaces, even slower hard disks, and systems with limited memory. These are not typically the computers we have these days. If I go to AWS these days, I can easily get a machine with 72 cores, 512 GB of RAM, 15 TB SSD, and 25 Gbit/sec of networking. I can probably implement all of my data streaming needs on this machine alone, using a Postgres database and a few smart queries to find all the data that I haven’t seen yet. True, it is not as advanced as Kafka, but a lot simpler to use and probably adequate for most use cases.
Or take file drop boxes. Since time immemorial we have had this pattern where someone drops a file into a directory (e.g. using FTP) and then some task needs to pick it up and process it. It is easy to poll the directory regularly, but in order to make processing snappy, the polling frequency needed to be high and for the longest time that was a waste of valuable resources. So instead, we implement all sorts of cleverness using the inotify(7) API or its equivalents. Unfortunately, inotify has some annoying edge cases and it is surprisingly tricky to get it right.
But, with today’s fast computers, we can just poll the directory once per second using a Python script and not really notice any slowdown. The fast CPUs makes the Python script fast enough, the fast SSD obviates the need to move the disk heads around like a mofo, and lots of memory makes Python’s terrible memory usage go away.
Really fast computers allow for dumb solutions that serve most use cases well. So whenever someone calls for an advanced solution (most likely because they want to put this technology on their resume), look at the problem carefully to see if a dumb and straightforward solution doesn’t get you there as well!