(Like this article? Read more Wednesday Wisdom!)
Over the years, I have often come across questions for help in the support channel that go like this: “Hey, I am getting <this> error message. Is something broken?” How I react to these questions is highly dependent on my mood. Sometimes, rarely, I might add, I feel benevolent and dig into the problem and try to provide the requested support. However, more often than not, I get really annoyed about the obvious absence of useful information and the lack of debugging done by the person asking the question.
In more than one company my level of annoyance rose to such levels that I wrote a Wiki page called “How to ask questions, the smart way” and I would respond to underspecified questions with a link to this Wiki page. The last time I did that (write that page, I mean), it was pointed out to me that its title was somewhat insulting and so I changed it to something like: “How to ask questions the right way”. The point of the article stayed the same though: The better your questions, the more likely it is that you get help.
Seasoned Internet professionals will recognize the relationship to this original article by Eric S. Raymond and, to be honest, I copied more than one suggestion from that page. Eric’s well-known status as a bit of an idiot notwithstanding, that original article says a lot of good things.
There is a lot to be said about asking questions the smart way, but my Wiki articles (stored solidly on Intranets behind corporate firewalls) and Eric’s original article say it all really, so that is not what I want to talk about today. Instead, I want to talk about a major professional skill that will serve everyone in good stead, namely debugging. I often get questions from students or junior engineers on what they should study in order to advance their career, and every time that I do not answer: “Debugging!” I do the person asking the question a disservice.
The ability to debug is important for no fewer than three important reasons.
Firstly, debugging skills allow you to solve your own problems. It is always better to be in charge of your own destiny and if you can unlock yourself, you are in charge of your own timeline. There is incredible value in that because if you have to ask others for help continuously, you are not going to be very effective. Also, you will be debugging your entire career. Junior engineers might think they run into problems all the time because they are not very experienced yet, but this is far from the truth. You will be debugging until the day you retire. Only yesterday I was debugging a problem where a managed disk I had attached to an Azure Linux Virtual Machine using Terraform was not available on the VM after it booted.
Answer: Ordering. Terraform creates the VM resource first and then Azure proceeds to boot it. The disk gets attached later and so it shows up in the machine’s device list (lsblk) when the machine is already running. In AWS that has not happened to me. Maybe it’s a race condition where I have been consistently winning the race there but not in Azure.
Second, debugging is how you learn things. More than half of the arcane things I know, I know because I debugged some problem that led to a “Wait, what now?” moment.
Here is an example: One day, I added a SoundBlaster SCSI card and a SCSI disk to my personal computer that was dual boot between MsDos and OS/2 Warp (yes, I am that old).
“Wait, what now?”, I hear you think, a SoundBlaster SCSI card? What is that? Way back in the days, personal computers did not come with decent hardware to generate high quality sound and so there was a healthy market in add-on cards that added that capability. SoundBlaster was the leading company in that market. One fine day, SoundBlaster decided to bring out a sound card that also had a SCSI interface on it. Why? Nobody knows, but of course I bought one :-)
My built-in hard disk had two partitions on it, a primary partition that contained MsDos and an extended partition that contained OS/2. After I added the SCSI disk, I couldn’t boot OS/2 anymore. When I disconnected the SCSI disk, it booted again. W the TF? I spent hours figuring out what was going on there. This debugging was not made any easier by the total absence of the Internet, which I couldn’t connect to because it didn’t exist yet. Eventually, after many experiments, I figured out what was going on. It turned out that MsDos and OS/2 assign drive letters in a different order!
In MsDos, drive letters get assigned by enumerating the disks and then the partitions on the disk. So the first partition on the first disk is C:, the second partition on the first disk is D:, and the first (primary) partition on the second (SCSI) disk was E:. In OS/2 this works differently: The primary partition on the first disk is C:, the primary partition of the second disk is D:, and the second partition on the first disk becomes E:. When OS/2 booted, the SCSI disk driver wasn’t loaded yet, so the second partition on the first disk was D:. Then the SCSI disk driver was loaded and OS/2 remapped the drive letters. Since OS/2 was booted from the second partition on the first disk, the drive letter of the OS partition changed and you can imagine how much havoc that can wreak.
This is of course a completely random factoid and you might wonder how useful it is to know this. But the point is, debugging like that teaches you lots of things, not just the solution. I learned a ton about disk drives and partitions, driver loading order and its consequences, debugging operating system failures during critical phases of boot, et cetera. Every subsequent debugging I did of vaguely similar problems became much easier.
Finding the root cause after many hours of debugging also breeds confidence. Whenever I debug something now, I am confident that I am going to find the root cause. It might take a long time, but I know that I eventually will figure it out.
Third, debugging allows you to ask better questions and get better help. When you haven’t debugged, your question can at best be: “I am doing <this> and I am getting <that> error”. Mind you, it would already be great if you could specify <this>, because asking questions the right way means that you really need to explain to people what it exactly is that you are doing. But when you have debugged the problem and still don’t know what is going on, you are able to give even more relevant information. For instance, you would be able to ask: “I am doing <this> and getting <that> error. I checked the contents of ~/.aws/credentials and there it contains a valid access key and secret.” That is already a much better question and it narrows down the possible root causes. Or you might end up asking another question altogether, like “Why can’t my laptop verify the certificate it gets back from kms-fips.us-east-2.amazonaws.com?” which might be the symptom underlying the error. Just making stuff up here, but hopefully you get what I am driving at.
At Google, when we interviewed Site Reliability Engineers, we explicitly interviewed for the presence of debugging skills. Probably not a shocker, but most people cannot debug worth sh*t. I would often ask the same question which was one that I had debugged myself over many frustrating hours at a previous job. The scenario was that you are an engineer working on a web application that contained a feature to download reports in PDF. Some customers were getting the following error message:
Unable to download.
Internet Explorer was unable to open this site. The requested site is either unavailable or cannot be found. Please try again later
Go debug!
The root cause was this Internet Explorer bug with the ominous title: “Internet Explorer file downloads over SSL do not work with the cache control headers.”
Wait, what now? How dumb is that? Yes, Internet Explorer was that dumb!
Like most interview questions, I am totally not interested in whether you actually find the answer or not. What I am interested in is if you show some solid debugging skills.
First of all: Do you have a good mental model of how the system hangs together? As you can see from this question it is hopelessly underspecified so the candidate needs to ask me questions about the infrastructure. Are there load balancers, reverse proxies, firewalls, or web caches involved? What are the web and application servers that are used? I would provide all of this information on request, but the candidate needs to ask the questions, thereby showing a decent grasp of how applications like this hang together. Then, can you methodically work through that infrastructure to isolate the problem in a specific layer. Can you formulate a logical hypothesis on what the problem could be and then come up with an experiment to test that hypothesis. Can you use tools like logs, and TCP traffic capture to help you. How are you dealing with TLS in the traffic capture?
When I debugging the original problem I wrote a tool called sslsmurf to help me debug HTTPS traffic. Total man in the middle attack. Later, somebody wrote the Charles reverse proxy for the same goal.
Turns out most candidates cannot logically reason about the system and methodically work their way through it. Instead, they use a “Monte Carlo” approach of proposing random problems: “Maybe it is the DNS?” “Well, yes, maybe it is. What is your hypothesis and how are you going to test it?” When it turned out it was not that, they would propose: “Maybe the firewall is configured incorrectly?” Well, yes, maybe it is…
Especially in an interview setting this is a terrible approach. In real life, you might randomly find the problem and that’s your lucky day. But in an interview setting, with the interviewer in the role of the dungeon master, you will never find the root cause this way, because even when you randomly mention the actual root cause of the original problem, I will not give it to you. The candidate’s job is to talk me into a corner where I cannot deny that you found the problem without violating something I said earlier 🙂.
Debugging is a methodical treasure hunt: You need to start with a map and then follow all the clues. It is a bug hunt where you need to corner the bug in a location where it cannot escape from. When debugging software problems I often end up adding print statements to the code to be able to validate exactly what the flow and state of the data is at a given point, in order to narrow down what is going on. Obviously, modern software stacks make this harder, but I am not above forking a Github module and adding debugging logic to help me narrow down the problem. In highly parallel multi-threaded programs where problems are often triggered only by very specific sets of circumstances, you might even have to instrument the program with extra debugging threads and bring it into production to have it spy on what is going on. My former colleague and friend Bubble (ask him about this nickname, it’s a fun story) once spent weeks debugging an intermittent segmentation fault in a high-QPS C++ binary. Like most good bugs, the segmentation fault did not happen directly after the statement that introduced the problem, but some random number of instructions later as a result of some good old-fashioned memory corruption.
Root cause: One particular method in a C++ class would sometimes delete the method’s receiver using “delete this”. If the caller didn’t know and would continue to access the memory after calling the offending method, memory corruption would ensue. Here is a three step program to prevent problems like these: 1) Never use “delete this”, 2) Test your C++ binaries under an address sanitizer, 3) Switch to Rust.
It is of course totally okay to ask for help when you are faced with an intractable problem. But if you ask a (badly formed) question and I give you the answer, what have you learned other than that I am the guy to ask hard questions to? It is true, I am the guy to ask hard questions to, but I don’t scale very well so for an organization as a whole that is not a great lesson learnt. If you do the debugging work involved to unlock yourself and then ask for help and I give you the answer you have achieved three things: You learnt something, you made me more efficient (because it is easier for me to answer well-specified detailed questions), and my answer will actually teach you something.
So, learn how to debug. It might seem slow and inefficient to spend time on something if there are other people in the organization who could answer the question faster, but that is a short-term solution only. Debugging is how you learn your craft and it is better for the organization if there are more people who are better at their craft. And how do you think I learnt how to debug your problem? Not by asking all difficult questions to someone else!
If you're going to build anything of more than passable complexity it's probably time to accept you're going to spend more time debugging than creating. Nearly a decade ago my manager, the great John Kennedy (not that one ;-) ), encouraged me to read this: https://debuggingrules.com/. It's amazing in every way and the suitably cliparted poster has been sat on my wall ever since. Seems it's behind an email wall now but I still highly recommend it.
great advice. engineering 101