Did you know there is also a podcast of this episode? You can find it on all high quality podcast platforms. Cannot find the podcast on your platform of choice? Ping me!
A side effect of Big Tech’s longevity, their unconstrained hiring from 2010-2022, and the subsequent layoffs since 2023, is that there are now a lot of people around whose resumes feature a stint at a “FAANG”. By and large, these are “big machine” people.
I count myself lucky that when I joined Google in 2006, not everything was amazing yet. We built binaries with carefully crafted Makefiles, we ran services on machines operated by the “babysitter” platform, and compiling and testing happened on an underpowered cluster of machines that was so overloaded that you had no hope in hell getting anything done once the offices on the west coast of the US woke up and starting submitting jobs. Bigtable was only just becoming available and Spanner was nothing more than a glint in the eye of Jeff Dean. Don’t get me wrong, it was already much better than anything I had ever seen, but compared to the Google technology of 2010 and later, it was kinda crappy…
Google went on to create one of the most phenomenal internal compute platforms that human ingenuity can imagine and build; a hyper advanced platform that can run services at almost unparalleled scale with a matching, and equally almost unparalleled, reliability and security. This “reduced” most Google engineers from people who build and run advanced infrastructure to people who operate services on top of a big machine. AWS, Facebook, and other people built similar platforms, often modeled after what Google had wrought first.
Side story: Somewhere in 2018 I was the victim in a “Wheel of Misfortune” exercise.
For those of you not in the know, “Wheel of Misfortune” is a failure exercise in which a “dungeon master” presents a production problem that a victim then has to solve. The victim is not allowed to use their computer, but must instead instruct a “displayer” (who has a laptop and is projecting) on where to go, which buttons to press, and which commands to execute.
When the dungeon master presented their problem, I started giving instructions to the displayer: Go here, execute this command, check this log file, enter this expression into “Nebgua” (a browser plugin for constructing URLs for our monitoring system; the acronym stood for: “Never Edit a Borgmon Graph URL again” 🙂). My younger colleagues watched in disbelief and amazement. He is going where? He is doing what? He is issuing which commands? Eventually, another colleague intervened: “No, he is not trolling you, that is how we used to do things back in the olden days.”
My younger colleagues had turned into big machine people and the only thing they knew was how to operate the big machine.
An obvious problem with being a big machine person is that if the big machine is broken, it takes detailed knowledge about the big machine’s internals to figure out what to do now. It is for that reason that in every year’s disaster recovery training, I included a task for someone to roll out a new release without using our CI/CD system, pretending that some disaster had taken it out. Come to think of it, quite a lot of “DiRT training” exercises consisted of scenarios where a part of the big machine was “broken”…
Both for older and newer engineers, a common problem of big machines is that it stops them from doing what they want to do, how they want to do it. I have literally spent hours scratching my head and figuring out how to get a big machine to do something that I knew perfectly well how to do using the basic tools available on the operating system, but I just couldn’t find the button, config file, dropdown box, or whatever UI element that the big machine’s product managers figured I should use for my operation. Half of the time, I had to eventually come to the conclusion that the big machine did not let me do whatever it is that I wanted to do, despite that being a completely valid state transition in the underlying system.
I do not necessarily like big machines very much…
Another problem with big machines is that it turns many engineers into big machine people. What I mean by that is that their entire expectation of the environment that they are operating in is formed by the availability of a big machine for Everything(™). When these people move to another company, they typically find a dearth of big machines and their tendency is to a) complain about it (saying things like “at Google…”) and then b) build one. Big machine people want to build big machines because they think that big machines are what every company needs and should want, regardless of the state of the company, regardless of the span of control of the people inside the company, regardless of the company’s financial possibilities, and regardless of the time horizon on which the company operates.
Many companies, including all startups and most scaleups, cannot afford a big machine or do not have the mindset to build and run one.
Building and running a big machine costs oodles of money and takes oodles of time too. Time and money that a company might not have or that, all things considered, it might want to spend somewhere else. Nothing wrong with that, many startups just try to make it to the end of the quarter and many scale-ups are racing towards the break-even point in the most cost-effective way possible. A big machine might not help with that.
In my career, I have come across many big machine people in not-quite-so-big-machine companies. As is natural for them, whenever they see a problem, they want to fix it by building a big machine. They go on to write a big design doc, organize a big review meeting, and then try to assemble a (relatively big) team to realize the big machine in a big project that takes a big amount of time. These projects often fail in a big way because eventually someone figures out that a group of people is working on something that does not promise any value in the short term.
Many startups and scaleups do not have the stamina to build a big machine. That’s not because they are bad people, but building a big machine requires a big sponsor and the organizations in these small and growing companies are often continuously in motion, with sponsors moving around and new executives coming in who do not see the value of the big machine. You really need a big, stable, company to build a big machine.
When people ask me what working at Google was like, I often answer that Google is the Germany of engineering: Everything takes a lot of time to build, is hugely expensive, and when it is ready it is about three times as good as it needs to be. Google excels at building big machines.
For reference: When people ask me what working at Facebook was like, I answer that they are the France of engineering. Whenever they need to build something really good, like the TGV or the Ariane rocket, they are right on it. But most of the time they’re like: “Mweh, let’s finish this quickly and drink more wine” 🙂.
I am, at heart, a big machine person and I regularly need to pull myself away from the brink of writing a big design doc because I realize that right now we are not the people who build big machines for things that are not in our core activity of servicing user requests. I also regularly remind colleagues, who I suspect of being big machine people, that they are better off proposing a simpler solution that we can build quickly and then later extend when it gets traction.
That last sentence is key. We might not want to build a big machine right here right now, but we need big thinking to make sure that the small machine we build is good enough and can be extended to become a slightly bigger machine. The absence of big machines does not mean terrible chaos; instead, it should mean a swarm of smaller machines that are just good enough to solve the problems we have and that we don’t want to live with.
Big machine thinking, combined with big execution, gave us the F35. Big thinking but small execution gave us drones. There are definitely use cases for big machines, but as both Ukraine and Iran are showing us right now, a swarm of crappy drones can be quite effective. This also holds for CI/CD, quotas, rate limiting, and load balancing: A big machine works well for these problem areas, but I have seen many small machines that cost 1/10 to build work almost as well and usually more than well enough. Remember, quality is fitness for use!
So, unless you are in a big machine company, resist your natural inclinations to design and build a big machine. Instead: Consider thinking big, but executing small!











