0:00
/

Does code review still matter?

Code reviews are like testing and daily flossing. And what about AI?

Did you know that Wednesday Wisdom is also a podcast! Find it on Apple Podcasts or on Spotify. Did you know that there is also a custom GPT called Midweek Muse that has access to all of Wednesday Wisdom? Did you know that all Wednesday Wisdom videos are also available on YouTube?

Quite a lot of Wednesday Wisdom articles start with the sentence: “When I joined Google in 2006…” followed by some anecdote. For this week’s article, I thought it prudent not to break with that tradition, so here we go…

When I joined Google in 2006, I considered myself a decent C++ software engineer, on account of having written some C++ code and having taught many C++ courses, including courses on how to code using the Microsoft Foundation Classes (which we nicknamed the “Microsoft Frustration Classes”). Google had a rigorous system of code review and because I did not have “readability” in C++ yet, I submitted my first changelist to a “readability reviewer”.

The concept of code readability at Google meant that (among other things) code written by someone without “readability” in that language was subject to stricter code review and approval requirements. If you had readability in the language, your code needed only an “LGTM” (Looks Good To Me) from a random colleague, but if you did not have readability, you needed an explicit approval from a colleague who did have readability in that language.

Brag: I have readability in C++, Python, Sawzall, Go, and Borgmon.

Like I said, I considered myself a good software engineer with a decent understanding of C++ already, so I was full of unwarranted confidence. When I got the changelist back, it came with dozens of comments, some of which were nitpicky (#naming and comment style), but quite a lot were actually very good and either went straight to correctness of any code that would depend on my code (e.g. declaring a private copy constructor using an internal preprocessor macro) or instructed me to use functions or classes from our homegrown “google3” library, instead of writing my own crappy implementations. It was a chastising moment, but since one of my mottos is “Adopt, adapt, and improve”, I quickly got with the program.

I was initially quite impressed with the code review requirements and quite a few bugs in my code were found because of diligent code review by colleagues who were experts in their fields.

My favorite code review comment of all time came on a five-line Borgmon changelist where the reviewer had written: “This changelist is amazing. You would expect it to work; it probably should work. But it doesn’t, and here is why…”, followed by an A4 size explanation of some arcane consequences of the polling interval, how time series line up, and how builtin functions would evaluate in the face of missing data points.

But here is what I quickly found out: Code review does not often comport well with velocity. In environments where high velocity is more important than five nines of reliability, the requirement that even a one-line code change needs to get reviewed sometimes adds days to the timeline of closing a ticket or solving a bug. Days that you do not have.

You would think that a one-line change would be easy to review, but here is the problem with code reviews: The approver becomes part of the set of people who take the blame if something catastrophic happens because of this code. By approving a change, you take responsibility for that change and really, in most companies, what is the incentive to take that risk? Better if someone else reviews it who has a better understanding of the system under change or who has an in-depth understanding of the well-known vagaries of the “Infrastructure as Code” setup. Receiving good code reviews is amazing, but giving good reviews is time consuming and exposes you to mostly unnecessary risk.

This is all fine and dandy if you are developing some system as a team and you are responsible as a team regardless, but if you are an atom writing some code, how do you get a random team member to sign on to take responsibility for your crappy code which might break the system? In a high-functioning team, this is easy to figure out, but in the dystopian post-apocalyptic chaos that is most organizations, this is a problem, which is why many people need to go begging for approvals on Slack in order to make progress on their projects.

It is a well-known fact that if you use a metric to measure some underlying behavior and then make meeting that metric an official goal, people start gaming the metric, because that is usually much easier than engaging in the downstream behaviors that then may or may not drive a “good” value for the metric. Something similar happens with code reviews: Configuring your version control system not to accept any changes without an approval arising from code review, makes people seek the approval but not necessarily the review. Many teams I know have a tradition of “stamps”: Meaningless approvals given without review so that the change can be merged and escape velocity reached. It is a good example of “form without function”: The configuration of the version control system might make the compliance people and auditors happy, but it is often without any practical merit.

Obviously, in an environment like that, removing the approval requirement would make total sense but this never happens. First of all, every rule has an ambassador and a fan club, and so every proposal to get rid of code reviews will run into stiff opposition from the people who truly believe that the problem to be fixed is that people do not do meaningful code reviews, despite the fact that this has not worked so far. The second problem is that the official existence of “approval after code review” is a “Cover Your Ass” measure; if the site goes down or some malicious code makes its way into the code base, the absence of an official code review policy will make you look like a troupe of clowns and might even bear down regulatory scrutiny as code review is usually legally required for systems that touch money or medical data.

So there we are, we should do code reviews, we want code reviews, but practically speaking very few people do meaningful code reviews.

The question on how (or if) to do code reviews is exacerbated by AI programming tools. As I wrote in an earlier Wednesday Wisdom article, I have written about five lines of code this year so far, though I merged tens of thousands of lines of code that were attributed to me. What used to be the coding part of my job has been replaced by talking to Codex and marveling at how much decent code it wrote in the space of mere minutes.

The first question to ask is whether I should look at the code that was generated. The easy answer is: “Oh my God, of course”, since it feels bad to implicitly trust a metric ton of code that was generated by a program without detailed input from a human. When I first started working with AI programming tools, I diligently looked at all the code in some detail, but as my confidence in the quality of the generated code increased, my attention waned because, as a friend of mine so eloquently puts it: “Laziness is the hidden energy that powers the universe”.

To be honest, I sometimes still look at the code in some detail, but only when I feel that the topic is so complicated or subtle that I consider it hard for the model to generate the correct code from a few lines of instructions and the existing code base. This review sometimes leads to small changes and on rare occasions it leads to me throwing away the entire branch and starting anew with improved instructions.

Here is an example of the latter: I recently worked on a piece of code that does something interesting in our Azure environment. Obviously (or maybe not, but anyway), we live in a multi-cloud world, so one of the tasks on my plate is to adapt the code to support our other cloud providers, starting with GCP. I started by writing a design doc. Right off the bat, this is already a decision that is not as obvious as it seems, because why not just point the model at the existing code and tell it to support GCP? Water under the bridge though, because I had already completed the design doc before that thought entered into my head. I pointed the coding tool to the existing code, gave it the design doc, and set it to work. It created a plan, which I lightly reviewed, and then it started plowing through the steps in the plan.

When I looked at the code it had generated, I saw lots of “if GCP” statements, which I did not like at all because it is not very elegant and would turn the code base into an incredible mess once I would add support for additional cloud providers. So, I deleted the entire branch and started over by first instructing the bot to extract all Azure specific code into a module with an abstract interface and then add support for GCP. That went much better already.

Or did it?

One of the things I am wondering about is why I care about that potential for spaghetti. Surely, it would make it really hard for any human to keep track of what is going on, but would it also make things hard for the bot? Is any human ever going to look at that code to debug or improve?

AI coding tools do not only excel at writing code, they also excel at pointing out what is going on when given a pointer to the code and a log file. Whenever I don’t understand something my code is doing, my first action is to throw the log file into the AI tool and have it explain to me why I am seeing what I am seeing. Given that the tool does the correlation between the log file and the code, (and does that quite well) am I really ever going to look at this code again with any amount of concentrated focus? Or is the bot really the only thing that will ever look at this code? Interesting question to which the answer is entirely not obvious.

On the off chance that I would ever want to make sense of the code, I wrote some generic instructions for the bot to always generate helpful comments and to use the programming language’s features for constants to refer to magic values using a descriptive name. That definitely helps me to make sense of the code, but does it matter for the bot? Given that our AI tools are based on large language models, maybe more language in the codebase helps activate the right artificial neurons, but does it? I really don’t know… Maybe I am just polluting the context with extra information that the tool does not need…

After I am happy with the code, it goes to the version control system where some human will need to approve it before I can merge. Now we get to the second set of questions: Should my colleagues really be expected to review the code and if so, what exactly should they review? Should they review the output of the bot, the intent (what was I trying to do), or the input (in other words: The prompts and maybe the ensuing discussion)? Again, I don’t know…

I recently got a code review from a colleague on a Codex-generated PR where they pointed out an invalid assumption of the nature “if a is set to b, then c must be set too”. Turned out not to be true, because, as I learned, when a is set to b there is a default for c that makes sense and that you might want to use too. Now the thing is, I cannot remember if I introduced that assumption into the context of the bot or if this is something the bot came up with all by itself. This is of interest, because it is an open question whether a review of my prompt would have triggered this particular comment from my colleague.

So in other words, code reviews, which were already under pressure, might have to change dramatically. When reviewing a human’s code, there is usually a fair amount of attention to subtle coding errors, like off-by-one mistakes and memory management errors like use-after-free. The bot probably gets these right 100 times out of 99, but it might be liable to make certain semantic errors because of lack of context.

Speaking of context: There is the interesting fact that an LLM integrated into the code review tool sometimes finds errors in code that was generated by that same LLM! That’s kinda weird, right, but it goes to the importance of the “right” context.

We surely live in interesting times.

Instead of reviewing code, review Wednesday Wisdom every week!

Discussion about this video

User's avatar

Ready for more?