(Like this article? Read more Wednesday Wisdom!)
Some time ago I was involved in a post-mortem that hinged on a subtle bug that had surprising depth in terms of coding and data structure design. It made me think about how we make very small decisions every day that influence the robustness of our code.
Here is the scenario: Our code had to calculate some bonus amount that depended on how long you had worked for us. We had a table that indicated the bonus amount for a range of months. The bonus started at zero, but if you had been with us for 3-6 months you started qualifying for some bonus amount N; after 6-9 months the bonus increased to K (with K > N) and so forth. After two years you would reach the maximum bonus amount.
The question is: Given a tenure in months and that table, calculate the bonus amount.
This is of course so simple that it can hardly function as an interview question. That said, we still managed to screw it up, though not in the way you might guess.
The first thing to do here is create the table with the bonus amounts per range of months. The developers had chosen to implement this as follows:
public class Bucket {
int startMonth;
Integer endMonth; // Note: Integer, not int!
double bonus;
}
This is in Java and the Java proficient among the readership will notice that the endMonth is not an int (a native data type), but an Integer (an object), which (and this will become significant) can be null.
The reason for this design decision was that you need a way to encode the fact that after two years the bonus amount does not change anymore. The last bucket in the table would have a starting month (say 21) and no endMonth (null). Then the code what did the calculation would interpret this missing endMonth as infinite.
A function to calculate the bonus amount given an array of these buckets could look like this:
public static double getBonus(List<Bucket> list, int month) {
return list.stream()
.filter(el -> month >= el.startMonth
&& (el.endMonth == null || el.endMonth >= month))
.findFirst()
.map(el -> el.bonus)
.get();
}
Simple. We could ask any newly graduated candidate to write this.
However, there are some subtle issues in this code already. The first one is that we need to check whether the endMonth is null in order to deal correctly with the last entry in the list. The second issue is the comparison between month and endMonth. Does this have to be < or <=? In other words, is endMonth inclusive or not? There is no way to know this without additional documentation in the form of a comment or a design doc. You better write some unit tests for this.
The nitpicks under us will also remark on the order of the comparisons. In this case we check if month >= el.startMonth and if el.endMonth > month. Maybe it’s me, but I find that a bit hard to read and when editing this article it confused me already. Typically, I prefer to write el.startMonth <= month && month <= el.endMonth to make it obvious that we are testing whether month is in a range.
To top it off, this code does a get() on an Optional<Bucket> without testing if we actually found a bucket. If there is no matching bucket, this code will throw an exception, which is not terrible, but not amazing either. It is also not entirely obvious from glancing at the code, so again, you better write a comment to inform the casual reader.
Another issue is that this data structure does not guarantee two things that are probably important. Firstly, there should be no gaps in the table, but this data structure does not guarantee that. For instance you could have a bucket with startMonth 12 following a bucket where the endMonth is 10. What is supposed to happen in month 11? What happens in case of a gap is dependent on how you write the bonus determination function. For instance you could have written it like this:
public static double getBonus2(List<Bucket> list, int month) {
return list.stream()
.filter(el -> el.endMonth == null || el.endMonth >= month)
.findFirst()
.map(el -> el.bonus)
.get();
}
Because this version of the code does not use startMonth there can be no gaps, but now the list of buckets has to be sorted or we get incorrect results. Again: Better write some unit tests for this and check the input table for this (now) important property.
Another (minor) issue with this code is that the data structure does not guarantee that the bonus amounts increase with longer tenure. There is also no protection against the buckets overlapping.
In the post-mortem that I was in, the code had not correctly determined the bonus amount and that had led to pay defects. However, this was not because of a mistake in the algorithm, but because of a subtle problem with the table. We had serialized and deserialized the table a few times as it was making its way through the infrastructure and somewhere in that process the endMonth “null” had its value changed to 0 (courtesy of some serialization library)! The function we used to determine the bonus amount was not resistant against that input and hence had returned an incorrect bonus amount. Cue post-mortem meeting..
The big problem with the design is that it does not invite robust implementations of the functions that use the table.
In my view a better implementation of the table would be this one:
class Bucket2 {
int periodInMonths; // Length of this period in months.
double bonusIncrease; // Bonus top up from the previous period.
}
You will notice that there is no startMonth or endMonth anymore. Also, instead of a bonus amount, we only capture the bonus increases. The list has to be sorted in order to make any sense, but that sorting has to be done by the person creating the table. In this data structure there can be no gaps and no overlaps. There is also no special last entry; once the bonus increases run out, we have the bonus.
Finding the bonus can be done as follows:
public static double getBonus2(List<Bucket2> list, int month) {
double bonus = 0;
Iterator<Bucket2> iter = list.iterator();
while (iter.hasNext() && month > 0) {
Bucket2 el = iter.next();
month -= el.periodInMonths;
bonus += el.bonusIncrease;
}
return bonus;
}
Notice that this method is O(n). Algorithm purists might say that in the original implementation you can binary search through the list to find the right bucket, which is of course true. However we are focusing on correctness here, not on algorithmic efficiency. Also, in order to do a binary search efficiently you need to be able to index the list efficiently, which is not guaranteed for the List interface (which might be implemented through a linked list).
By getting rid of the special value null to encode the end of the table and by encoding the period and bonus as increases from the previous entry, we ensure that whomever processes the table has fewer edge cases to deal with and that there is a smaller risk surface for the kind of null → 0 bugs that created the production problem that made me think about these matters in the first place.
While programming you make small but important decisions like this all the time and each of them really deserves some thought about what could go wrong. The original design looks fine and probably would pass muster in an interview and during code review, but it also made the production problem possible.
This is why I get few code reviews because I harp on issues like this a lot :-)
Another of my pet peeves is the overuse of strings to contain data of all sorts of structured types, like dates.
Recently I was involved in a project where we had date fields stored as strings in no fewer than three different formats: “yyyy-mm-dd” (a.k.a. sane), “mm-dd-yyyy” (a.k.a. insane; don’t ask :-), and “mm/dd/yyyy” (US). We had to compare these dates all the time to figure out if one was before the other or not. If you do not immediately convert the string to a LocalDate object (or one of its friends, Java's date/time libraries are really terrible) you are asking for trouble. Before we did that I had already one unit test I had written fail on me with following error message:
org.opentest4j.AssertionFailedError: expected: <03/01/2023> but was: <2023-03-01>
One of the problems here is that the specific trouble that this pattern causes might not come for the original software developer. If you write code like this:
... findFirstManagerChangeBefore(String date) {...}
you probably know what format the date argument is in because your head is in the code. But how is the code reviewer going to know? And what about that hapless L3 who is going to implement a minor feature request next year?
Programming is full of pitfalls like this. I read a Slack thread yesterday which asked the following insightful question: “If we call this API and we get the country back, that's a 3-letter ISO-3166-1 alpha3 (3 letter) country code, right? Great question, but who knows? Kinda important to know though :-)
Paying attention to these kinds of details while specifying an API and creating data structures will help your colleagues write robust code and prevent bugs. Taking the time to make the right decisions here is worthwhile and will reduce the number of post-mortem meetings that I have to be in. And in the end, that is really all that matters to me 🙂
Sooo, since you took the liberty of lambasting Python the other day...
It appears that the real root cause for this mess you got into might be that the design of your programming platform of choice (or lack therof more like, perhaps some kind of “wisdom” of the crowds deal?) is fraught with amateur-level mistakes (such as nullable references, as seen here; or method dispatch by string name #owmyperformance) and outright bizarre ventures into pointlessness (such as checked exceptions or that whole “dependency injection” nonsense). General Derringer, from the movie War Games, would have some choice words for you in the fifth act... And no, putting the onus on your colleagues to waltz around your platform's pitfalls is not a solution — because it doesn't scale, for starters.