The unbearable lightness of being an infra engineer
Plumbing is essential and underappreciated
(Like this article? Read more Wednesday Wisdom!)
I was talking to my friend D. the other day. After we had gossiped about the company she works for and the people she works with, she asked me the following question: “Do you ever hear any frustration from people who are working on infrastructure and who would rather be software engineers?” After some back and forth, the question seemed to originate in the infrequently expressed but, in my experience pervasive, differences in appreciation for system engineers and for software engineers. Whence doth that come? And is it justified?
Given the very deep technology stacks we are working with, it makes sense that there is specialization between software engineers and system engineers because it is practically impossible to be an expert in everything that is going on, from the outermost software that directly interacts with the end user, to the firmware that runs in the FPGA on the NIC. Also, there seem to be different aptitudes involved as people naturally seem to gravitate to different parts of the stack.
In some cases, the increased need for specialization leads to a deeper than necessary divide between the system engineers and the software engineers. I started my career as a systems programmer on an IBM mainframe (a sort of SRE avant la letter) and I was always amazed about infrequently we talked with our software engineering colleagues and how little they seemed to know about the system their software ran on.
This regularly led to disasters, typically when the software engineers built programs that made incorrect assumptions about how the system worked. In one memorable disaster, a group of software engineers had spent months building a system that assumed that it was safe to write to VSAM files from multiple address spaces (processes) concurrently. Hint: It is not. This didn’t show up in their puny unit and integration tests, but crashed and corrupted our current accounts system on its first night in production.
Things have not gotten much better since then. At Google I once spent half an hour explaining to a software engineer that it is not possible to execute a two-phase commit between three data centers on different continents within 150 milliseconds and that this was not a problem with our network, but rather with a design issue in the universe called “the speed of light”.
My guess is that the deterioration of the appreciation for the system administrator set in when most organizations started running turn-key systems.
The early adopters of computers had to build their own systems, their own networks, and their own software. Hence, there was a need to have high-powered system engineers as well as high-powered software engineers, because without both of these there was really no way you’d get to an end-to-end solution. That changed in the late eighties and got going for realz in the nineties, when many organizations started buying complete systems that they didn’t build or program themselves.
In your mind, travel back to 1994, and imagine for a moment that you are the CIO of a mid-sized Dutch town that needs to do their financial and civic administration. You are obviously not going to buy your IT infrastructure in bits and pieces and assemble them into a working system on which you are then going to develop the necessary software. Instead, you are going buy a medium sized HP9000 preloaded with an Oracle RDBMS and the appropriate application software, after which you’re off to the races. The “only” thing you need to do (once the system is installed by the vendor) is keep it running.
Enter the “modern” system administrator. These were people with an aptitude for working with computers and who were trained to perform the daily operational tasks, such as running the backups, keeping the user administration, troubleshooting the printers, and (sometimes) upgrading the system software. These were usually not highly paid professionals. Once, at a cocktail party, I met the IT-manager of one of these mid-sized towns and he stated outright that their city couldn’t afford to hire great people for their IT-department. When I asked why not, he said: “Our town is this big, so the mayor earns this. That means that the city manager earns this; I, as a head of a major city department, earn this, and then my employees can earn at most this.” We were now squarely in the range of median incomes in the Netherlands, so I could only concur with him: “You’re right, nobody I know who is any good will take a job at that salary.” That is kinda sad because, apart from the fact that a well run IT-department is vital for the services the city offers its citizens, there were enough more or less interesting challenges for that city to solve, such as setting up highly available Unix clusters with shared SCSI disks and crappy logical volume managers (not for the timid) and running the X.25 network that connected all cities and the national government.
Once, during one of the more advanced HP-UX courses that I used to teach at the HP Education center in Amstelveen, I met a very bright young lad who was attending the course. I asked where he worked and it turned out to be this no-name town in the middle of nowhere in the Netherlands. When I asked him why he worked there, his answer was: “The training budget! They pay peanuts, but I don’t have a college degree and their training budget is nearly unlimited. I plan to be working there for two to three years and I am taking every course I can.”
The status of “system administrator” as a low-level hands-on operative was cemented in the Windows NT era, when lots of companies needed people to run their file servers, printer servers, domain servers, and whatnot servers. Information technology, once the domain of a handful of people with deep expertise, became a much broader field, with jobs and opportunities for a much wider range of expertise. This also kicked off a hausse in certifications. Thee pinnacle of this was the Microsoft MCSE program. These certifications did not come with high standing. Among ourselves we used to joke that MCSE stood for: “Must Consult Someone Experienced”.
This Dilbert comic really says it all…
At the start of the era, system administrators were still expected to write computer programs. In the Unix Fundamentals courses I used to teach, about ⅓ of the course was spent teaching people how to write shell scripts to automate common system administrative tasks. Armed with that knowledge, the good ones would teach themselves Perl and write more complex programs (that were typically indistinguishable from a baud rate error). That culture went out of the Windows with NT, which came neither with a decent built-in programming language, nor a system architecture that lent itself well to automation by its users. System administrators became power users of the system, instead of engineers building the system.
At the time my colleagues and I were so frustrated about this trend that we came up with the tagline “Niet klikken, maar denken!” (Not clicking, but thinking) for our company.
While all of this was going on, the Linux revolution was brewing. A merry band of hackers reinvented Unix and soon Linux systems found their way into a wide range of environments that required both the system and software engineering skills of yore. When the Internet started happening, Linux became the operating system of choice because of the unparalleled possibilities it offered to build cost-effective clusters of servers that were optimized for Internet use cases. Building these systems required many of the skills of the systems programmers from the olden days and so that job was reinvented too. Thus, Site Reliability Engineering was invented; an engineering specialty that sought to design and build systems (hardware and software) for optimal reliability at the lowest possible cost.
Google, who was first to define Site Reliability Engineering as a separate role, hired two types of Site Reliability Engineers: People with a system administrator profile who could also code really well and people with a software engineering profile who also knew a lot about Unix/Linux and networking. The former were called System Engineer (SE) SREs and the latter were Software Engineer (SWE) SREs. Due to the way that Google’s internal HR systems worked, the SE-SREs were placed on the O job ladder and the SWE-SREs were placed on the T job ladder. And, here comes a crucial fact: “Regular” Google software engineers were also on the T ladder.
So if you were a senior SE-SRE, you were deemed to be an O5; if you were a senior SWE-SRE, you were a T5, same as a “regular” senior software engineer at Google. During the normal course of an SRE team’s work, there was no difference between an SE-SRE and a SWE-SRE. I managed teams that had both types in them and it never made any difference, as everyone automatically gravitated to the work that a) needed doing, b) for which they had the skills, and c) in which they were interested.
Because of the well-know vagaries of any recruiting and hiring process, whether people were hired as an SE-SRE or a SWE-SRE was often somewhat random. If you were a “pure” SWE (meaning: No systems knowledge to speak of), you were interviewed as a regular Google SWE and you became a SWE-SRE, but if you had Linux on your resume and knew how to remove a file called “-f”, there was a non-zero probability that you would be interviewed and hired as an SE-SRE, even though you might be an algorithms wizard.
As I said, whether you were an SE-SRE or a SWE-SRE didn’t really matter in your daily work, but it mattered greatly in two important ways: 1) It determined which promotion committee your promotion packet went to and 2) it determined which other roles you could transfer into.
As an SE-SRE your promotion went to a committee consisting of people from the more operational roles in the company (if memory serves me right, these were the networking engineers, data center operations specialists, et cetera). If that promotion packet contained a lot of great software that you had written that was certainly appreciated, but it didn’t necessarily match the job ladder description for the next O-level. Mutatis mutandis, if you were a SWE-SRE, your promotion packet would go to a software engineering committee and if your packet contained lots of really great systems and infrastructure work but was a bit light on lines of code, that was not looked favorably upon. During my time at Google I have seen quite a few frustrated engineers who had done great next level work but unfortunately not next-level work that matched nicely with the specifics of their job ladder (O or T).
You might also want to read: A promotional article, which talks specifically about setting yourself up for promotion by looking closely at the lob ladder and choosing projects that match the things found there.
Things were even worse when you wanted to switch jobs. If you were a SWE-SRE, you could switch to every other SWE role in the company, regardless of the product or project (some exceptions applied for highly specific teams). Some of my SWE-SRE colleagues ended up working as regular software engineers for Google Travel, Chrome, or Google Maps. But if you were an SE-SRE who wanted to switch to a SWE-role, you needed to reinterview, even if, all things considered, your slotting as an SE-SRE had been random and regardless of the fact that you might have produced oodles of high-quality code.
You might wonder why this is such a big deal since these engineers were already in Google and had often already written a lot of code. How hard could it be to reinterview? The answer is that it was perceived to be humiliating to reinterview for a job that you were practically speaking already doing. Also, as we all know, there is a random luck element in interviewing and sometimes these SE-SRE’s failed the interview for the same reason that good engineers fail interviews all the time. Interviewing as become a mess so people were not jonesing to submit themselves to that.
To be fair: Attempts were made to make this process easier and fairer, but inertia is the force that holds the universe together and, as Bones McCoy so aptly put it: The bureaucratic mentality is the only constant in the universe.
All of this made a lot of SE-SREs a bit grumpy and undoubtedly made a lot of them feel underappreciated, even if they had no desire to be sent to the protocol buffer mines as a “regular” SWE.
There is of course no need for that. From my own experience I can say that, even though underappreciated, working with infrastructure is extremely complicated and rewarding and requires very deep expertise. In essence this is the plumbing problem: We all expect it to be there and we all expect it to work well. We rarely give it any thought, until it is not working and then we are up to our eyeballs in shit. The problem of the infrastructure engineer is that if they do a great job, people mostly don’t notice because everything just works, and then these people wonder how hard it can be to put all of that together…
Answer: Very hard!
We definitely need a re-appreciation of infrastructure work. The complexity of today’s infrastructure is simply mind boggling. The IBM mainframe MVS operating system that I used to work with is definitely not a walk in the park, but building a working system from today’s cloud primitives is at another level altogether. Even if you understand every component in isolation (quite a feat already), orchestrating all of these components into a coherent whole using the system management and Infrastructure as Code tools we have today is just nuts.
As an example: I have been fighting for three months now (though not full-time) setting up an instance of some self-hosted application on Azure using Terraform. It’s like nuclear fusion: The principle of it is easy enough to understand, but new problems pop up at every corner. Much like nuclear fusion, I have been proclaiming that it is really almost there now for weeks already. In my 1:1s with my manager we call that project: “The gift that keeps on giving…”
And let’s not forget cloud networking. Really, the less said about that, the better, but it is a bloody miracle that any packet ever gets to where it needs to be. Can I please have my grandmother’s TCP/IP back?
Anyway, to get back to D.’s question: Yes, I have definitely seen an underappreciation of infrastructure-related work and that apparently drives some of the infrastructure engineers to want to be software engineers. I don’t quite get that, because it’s not that all software engineering is fun; an incredible amount of software engineers are down in the JSON mines or writing tests for old and undocumented codebases, which I both associate more with forced labor in a penal colony than with a fun job in the tech industry. It may beat coding in Clipper, but not by a lot.
Actually, come to think about it, I didn’t mind Clipper at all. Maybe I would rather code in Clipper (or its open source version, Harbour) than in Javascript, but it might be that there are some rose-tinted glasses at work…
Infra engineers, please don’t give up. Your work is important and appreciated! Also, all things considered, if you really want to write code, can you please write me a better Terraform or, even more important, a better Helm? You’ll be doing me and lots of people a big favor, I promise you!
God, I need better terraform. And no, writing code to generate terraform isn't better (looking at you, Pulumi and cdktf)
GCL/BCL, while weird, indeed beats any JSON or YAML approach by far.
We already had .ini files, so what's the point of yaml, really?