(Like this article? Read more Wednesday Wisdom!)
Here is something I am often wondering about: How come that the same humanity that can build something as hideously complex as an airplane, submarine, or cruise ship, fails to correctly plan a simple software project?
This question was prompted by two things that happened this week:
I went on a cruise with my daughter.
The Guardian contained a news article about the UK city of Birmingham being technically bankrupt.
What do both of these have to do with project planning, you might ask? Fair question; let me expound.
Whenever I go on a cruise I am amazed at the size and complexity of cruise ships. These are humongous devices full of intricate details that all have to be done well because otherwise the thing won’t sail, or not go where it is supposed to go, or we won’t have food on board. There are many thousands of things that need to go right and happen more or less on time in order to get this whole exercise together. The same goes for airplanes. The fact that we can construct something as complicated as an A321neo (which is what I am sitting in as I am writing this) and get it in the air as safely as we can, is nothing short of a miracle.
Then, on the other hand, we have software projects. The aforementioned Guardian article about the sad state of Birmingham’s finances contained a number of reasons why they are in such dire straits, among which are budget overruns in the implementation of an Oracle system for payments (I guess: Oracle Financials) and HR (I guess: PeopleSoft). This project ended up costing more than six times the planned budget and currently clocks in at £130m (about $166m at the time of writing, but with Brexit continuing to deliver on its promises, I expect it to be at parity soon).
Of course the city’s executives did not mention the cost of their Oracle implementation, but instead shamefully pointed to their obligation to pay women equally to men, thereby showing that the desire to return to the 1950s seems to be present at both sides of the Atlantic ocean.
Source: A Birmingham statement indicating that “In June the Council announced that it had a potential liability relating to Equal Pay claims in the region of £650m to £760m, with an ongoing liability accruing at a rate of £5m to £14m per month.”
Incidentally, for less than their new Oracle system, Birmingham could have bought an A321neo, which currently has a list price of about $121m. For the resulting $45m they could have their choice of many luxury accessories as well!
Anyway, here we are: We are a species that can build and operate cruise ships and construct safe airplanes with approximately 340,000 parts, but we cannot plan the implementation of a new payments and HR system.
Fortunately for Birmingham, they are not alone. In about 2012, the Dutch court system figured out that their entire process consisted of pushing paper documents around and they thought that maybe computers could help with that. The resulting KEI project was originally budgeted at around €7m and was finally canceled in 2020 when it had cost, wait for it, …., €220m, though apparently software development had “only” cost about €100m.
W the flying F?
Don’t we know any better?
Turns out we do!
Next year will be the 50th anniversary of the Bible of Software Engineering: Fred Brook’s timeless classic “The Mythical Man Month”.
The reason it is called the “Bible” of software engineering is that, much like the other book, everybody quotes it, some people have read it, and only a handful of people actually live by it…
Brook’s book contains many wisdoms that, the aforementioned sidebar notwithstanding, are not as widely known as they should be. His best known statement is probably that “Adding more manpower to an already late project, makes it later”, which is one that I see breached on an almost weekly basis in discussions about how to meet launch dates. A less well known, but in my opinion particularly astute, observation is that there is a lower limit to the number of errors in a complex system. According to Brooks, any attempt to reduce the number of observed errors will result in the introduction of new errors.
I have a similar theory about the rate of errors in a distributed system. Simple measures like retries work to reduce the error rate by dealing with the odd lost packet, but then mount a spectacular DDOS attack on the downstream system in case of a systemic failure. By excessive retrying you have traded off an error here and there with a huge spike of errors somewhere else. As a result, the area under the error curve stays pretty much the same.
(Please note that Brooks does not say there is an upper limit to the number of errors; something we probably all know from personal experience).
One thing that Brooks does not touch upon is suboptimal project planning, though he does cover “progress tracking” and observes that large software projects get to be one year late one day at a time, as incremental slippages accumulate to produce a large overall delay.
Fast backward to 1987 and a compulsory undergraduate class I had to take in Operations Research. One of the topics in this class was the Critical Path Method (CPM, not to be confused with CP/M, the operating system).
CPM is in essence a bit of graph theory used for project planning. The basics are quite simple: Your project is a Directed Acyclic Graph (DAG) of activities, each of which has one or more predecessors that it depends on and an expected duration. Working backwards from the desired launch date, you can calculate the “latest start date” for each activity, which is the last day on which this activity can start without delaying the launch. Similarly, once you start working, you can also calculate the “earliest start date”, which is the first possible date an activity can start on, assuming that all of its predecessors end at their earliest end date.
People who hang out at LeetCode on a regular basis will have few problems writing the code to calculate these dates.
In such a graph, you can calculate the critical path, which is the sequence of activities (path through the graph) where an increase in the duration of even a single day would immediately move the earliest start date of the terminal node in the graph. In other words: If any of these activities get delayed, the entire project gets delayed.
My personal to-do list prioritization scheme is based on the Critical Path Method. It works as follows: I make sure never to be in a critical path. If I find out I am, I immediately start working on the tasks that get me out of the critical path as soon as possible.
As a 21-year old hacker I looked at CPM and it immediately made lots of sense to me. I had next to no practical experience in multi-person software engineering projects, but this methodology seemed to be the only way to make sure that something, anything, gets done on time.
Consequently, like many other things I think make lots of sense, I have never seen it used consistently or effectively in any project I ever worked on.
This is mind-blowing to me. We suffer tremendously from delays and we know from Brooks (and experience) that projects get delayed one day at a time. Maintaining the DAG of activities with their expected durations on a day-by-day basis, to nip delays in the bud as they happen, and compensate for these at the earliest opportunity, seems table stakes in project management to me. Instead, I see spreadsheets with linear lists of activities with expected durations but no way to determine what is dependent on what and what can be done in parallel. For this reason, a surprising number of projects find out that a launch is delayed only days or weeks before the prospective launch date.
This lack of proper planning also manifests itself in rushing the final few phases of the software engineering lifecycle, which are typically testing, developing operational control, and the launch itself. I have regularly been in discussions with irate software engineers who accused me of delaying their launch because we needed two weeks to do some proper launch hygiene after they merged their final commit. These discussions inevitably ended with me explaining that if it took them six months to code it, I could take two weeks to launch it, and if they wanted it launched earlier, they should have coded faster.
Once you have your DAG of project activities, there are all sorts of other good things you can calculate, like the impact of losing a person on the project, or the effects of changing expected durations. For that reason, all decent project management software packages support the critical path method in one form or another.
I haven’t quite figured out why we are so averse to using a decent planning method. I know planning and estimating is hard, but that is why they pay us the big bucks. I know the critical path method sounds terribly like waterfall and seems at odds with the agile project methodology, but most organizations don’t want to do proper agile anyway. Sure, they want to pay lip service to agile, and have sprints and all that, but they also want you to tell them what you are going to launch a year from now.
So instead we plod on using the almighty spreadsheet and the power of its builtin SUM() function. And consequently, projects will continue to be late and surprisingly more expensive than originally thought. Of course the Critical Path Method is no silver bullet. There are lots of other things that also need to be done right, but if your core planning math does not afford early warning about delays, then you have no hope in hell of ever getting it out of the door on time.
I've been known to plan projects with graphviz :)
But yes, then I constantly got questions amounting to "why can't you just use a spreadsheet?!"
Not to defend software engineers, but there are plenty of examples of projects exploding in time and cost outside the software industry. For example, Boeing's 777X (which is not even a new design), was supposed to launch in 2019-2020. Today, it is expected to launch in 2025, see https://youtu.be/wyJFnPWe-gs.