(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article will also become available as a podcast on Thursday)
Note: There is a bubble in the podcast production pipeline. I am traveling and am improvising to get a mobile studio together that will do justice to my dulcet tones. The plan is to release both last week’s article and this one tomorrow. Thank you for your attention; we will now continue with our scheduled programming.
I have an unfortunate fondness for hands-on work, especially for hands-on work that clears out tech debt. This is clearly a character deficiency that precludes me from reaching higher levels of the corporate hierarchy because, given a choice between attending a meeting about the management philosophy for the new year and fixing some spammy alerts, I will pretty much always choose the latter. Not because I do not value management philosophy, I really do, but because over time I have found most meetings about management philosophy to be useless and non-spammy alerts to be very useful.
I have never had to look far to find hands-on work. One of the unfortunate aspects of computer science is that when we are planning to build something, we have pretty much no idea what we are going to do or exactly how we are going to do it. Sure, we write design docs, hold meetings, and make technology choices, but we invariably build a wobbly pile of software that has many warts and which frequently oozes dark fluids out of the many creaks and crevasses up and down the pile. It cannot be helped, that is the state of our profession. The remarkable absence of detailed ideas on what we are going to build is neatly matched by an abundance of time pressure. As a consequence, we just start doing things and make it all up as we go along.
That is what the agile methodology is by the way: Admitting that we are just making it up as we go along, packaged in a cool sounding name complete with scrumbags, because “we are winging it based on what people told us most recently” doesn’t sound like something that should cost $500 an hour.
As a result we incur tech debt. Weird term really, tech debt. ChatGPT tells me that debt is “an obligation where one party (the debtor) borrows money or resources from another party (the creditor) with the promise to repay it in the future, often with interest.” In the case of tech debt, what did we borrow? And from whom? Did we borrow a bad design from future me and now I want it back with interest?
Anyway, since we are making it up as we go along, we are making myriads of bad decisions, sometimes knowingly, often unknowingly, and sometimes realizing that we will eventually need to address these. Depending on the size and location of the bad decision, the fix might either be a cool project to rewrite some part of the system or it might be a heap of hands-on work that requires the application of liberal amounts of polyurethane foam and flex seal liquid (two substances that, were they magically to disappear from the universe, would be found to hold all buildings together).
There is a nice correlation between the worlds of tech and finance related to whose problem fixing these bad decisions is and who will benefit from it. My dear old father always used to say: “If you are 50 thousand guilders in debt, you have a problem. If you are 50 million guilders in debt, the bank has a problem.” This relates nicely to the wonderful world of tech debt: If your distributed global workflow engine cannot cut it, that will become a coveted project that will most likely lead to people getting promoted. But if your bad alert setup sends a flood of pages that have no tangible resolution, that is for the poor suckers in the oncall rotation to deal with.
So tech debt is a thing and every technical organization I know has plenty of it. How should we deal with it?
A common approach is to do nothing. Not entirely nothing though, because there will be complaining in abundance, but at the end of that, nothing will be done and the only tangible result of sorts will be an hour spent in the operational meeting complaining about the wild fires that are raging up and down the tech stack. Teams that do nothing eventually drown in tech debt and that most commonly results in the oncall being on fire. Inevitably, this leads to calls for two people being oncall simultaneously to deal with the onslaught, a strategy I call: Throwing blood against metal. This way lies madness, but it is surprising how long this kind of works, as long as there are fresh recruits that can be thrown at the machine. Executives will pursue this strategy for as long as it “works”, where “working” is loosely defined as there being no visible short-term impact to the bottom line (even though the attrition rate of team members may be atrocious).
Teams that do nothing rely on the goodness of individual engineers to relentlessly toil in the trenches, trying to improve things. Managers love these people because it allows them to get away with doing nothing for a longer period of time. Unfortunately, this love typically does not translate to official appreciation for the people fighting the good fight; they quite frequently find they cannot get promoted because they are not doing “next-level work”. Managers do not like drowning in shit, but that does not mean they appreciate the shit shovelers.
If you want to meaningfully split infinitives, uhhh, I mean address tech debt, that work will have to become OKRs for the team.
The best approach I have ever seen for dealing with tech debt was while I was at Google SRE.
I apologize for bringing the G-word up (again). One of the most annoying aspects of ex-Googlers must surely be that every sixth sentence starts with: “At Google, …” But, it must be said, Google did a lot of things right in their time. I am afraid it is not longer exactly that time, though I am sure there are many, many, areas where Google is still a shining light compared to all other companies.
Some inspired Google SRE executive, probably Urs or Ben Treynor, recognized that tech debt was a thing and that tech debt does not get discharged fully in tech bankruptcy. So what they did was make grunt work an official expectation at every engineering level. This automatically meant that you could not get promoted without doing your fair share of grunt work, which meant that it started showing up in OKRs for teams and individuals. This ultimately meant that a lot of grunt work got done.
It cannot be underestimated how well this system worked. It is of course completely obvious to the rationally inclined that it did, but it is remarkable that I have not seen this exact program in existence before or since. Making clearing tech debt a job expectation requires acknowledging that tech debt exists and that it is crucial for the longevity of the tech platform that it be cleared. These are two insights that are much rarer than you would expect. I have been in teams that were suffering from near tech-bankruptcy but that would rather discuss the “two persons oncall” solution than acknowledge that they might need to consistently (and quarter over quarter) spend 15% of their headcount dealing with the mess. This sad state of affairs usually has its origin in relentless pressure from the organization to build features. None of this withstands five seconds of rational analysis, but hey, if the world we live in would value rational analysis, it would be a vastly different world.
A hidden advantage of distributing grunt work evenly across the team is that it serves to improve knowledge acquisition and sharing. One of the reasons I like doing grunt work is that I find that it deepens my knowledge about the tech stack. Digging into the system to solve real problems or to extract metrics is a great way to figure out how it actually hangs together and that knowledge eventually helps me to design and implement the next generation of systems. Without grunt work, you might have a good theoretical and high-level overview of the system, but only grunt work can give you in-depth hands-on knowledge.
When it comes to solving tech debt and acquiring deep knowledge about the system, there is no alternative for grunt work! Tech organizations should do a better job recognizing this and have a plan in place to discharge of the tech debt in a way that means that everyone does their fair share.
> So what they did was make grunge work an official expectation at every engineering level
IIRC Schrep (https://www.linkedin.com/in/schrep/) at Facebook also introduced a similar concept, dubbing it "Better Engineering", later renaming it to "Engineering Excellence". Every engineer had to do some BE work and show how their work affected the quality of the code base or systems in general.
Opinions are split as to how much that work made an actual difference come performance review time, but FWIW it was one of the semi-mandatory topics one had to address in their self-review (AKA performance axis).
"ChatGPT tells me that debt is". Graeber has you covered here: "Debt: The First 5,000 Years"