(Like this article? Read more Wednesday Wisdom!)
EHLO. This week’s Wednesday Wisdom comes with a call for reader participation! Do you have a story of some awesome work you did that never saw the light of day? Or of a great design doc you wrote that somebody butchered in the implementation? Tell me and it might feature in an upcoming edition of Wednesday Wisdom! Doesn’t have to be long, a single paragraph is enough.
One of the frustrations that every software engineer has to deal with is that a lot of the great work we do never sees the light of day. Lots of reasons for that, but today I want to tell you a story about some great software I wrote that was a political failure and that got lost between the cogs of team reshuffling.
The story starts in 2010 when I stepped down from managing Google’s Production Monitoring team to become an individual contributor again (a story I mentioned in an earlier episode of Wednesday Wisdom). After a brief team search, I joined GTape SRE, which was the team at Google that ran the tape backup infrastructure. “Wait, what now?”, I hear you think. “Google backs up to tape? What is this, the 1980s?” Yes they do. Or at least: We did. I am currently not well positioned to talk about what Google is doing at the moment, but back then we were backing up every bit of data onto tape because, all things considered, that was the cheapest option. On top of that it is really easy to move tapes off-site for secure storage. So yes, maybe it was 1980s style tech, but not everything that happened in the 80s was bad 🙂.
To do them backupz, every region of the Google cloud was outfitted with a large number of massive tape libraries that could house dozens of tape drives and thousands of tapes. This was a 7x24hr business: The drives were spinning constantly because Google has a lot of data and it is constantly being added to. In fact, the drives were spinning so much that we were constantly breaking MTBF records (in a bad way) because these means were calculated for a typical customer who are spinning the tape drives only a fraction of the time, whereas we kept them busy constantly.
To keep track of what data ended up on what tape, we had a media database that was stored in Bigtable. Obviously it would be bad if we lost that information (even though the contents of that table could in theory be restored by reading every tape’s catalog, something we’d rather not do). So, in order to safeguard that data, we replicated the tables to other regions and of course we backed it up on tape. The media management table was one big table that contained the information about all the world’s tapes; that one table was replicated into every region using Bigtable’s amazing multi-master replication.
The problem with multi-master replication is that it is asynchronous and out of order. This means that when you are reading a Bigtable that is being replicated into by another Bigtable, you do not see a consistent view of the data. We dealt with this problem by accessing a region’s tape records only from that region. To be a bit more precise: Every region had a primary Bigtable cell that contained all the data and the GTape software in that region accessed only its own records and only through that Bigtable cell. This solved the problem of consistency, because Bigtable has some narrow consistency guarantees as long as the reader and the writer are accessing the same cell.
However, disasters and Bigtable maintenance happen, so every now and then we found ourselves in the situation that a Bigtable cell was going to be unavailable for a few hours (or days). To deal with that we had a manual procedure to switch a region to another primary Bigtable cell. Here was that procedure:
Do not start any new backup or restore jobs in the affected region.
Wait for existing jobs to finish. This could take hours!
Wait for replication to settle. This typically takes minutes.
Restart all GTape software in the affected region to use another Bigtable cell as their primary (set through command line flags).
This was obviously a hassle and, as stated, took hours, causing a significant bubble in the backup and restore pipeline. Because of that, we typically did not switch a region back to its original primary Bigtable cell once the maintenance was over. Over time this led to a situation where the graph of which GTape regions were using which Bigtable cells was a complete nightmare of high latency and uneven load distribution: Some regions used Bigtable cell that were far, far, away, some Bigtable cells served multiple regions as primary and some other Bigtable cells were not used as a primary for anyone while their “local” region’s GTape software was doing long haul TCP to talk to its primary Bigtable somewhere else. It was a mess.
Recently returned to the exalted status of an individual contributor, I was clamoring for a cool software project and I thought that this was something I could take on. I looked at the software architecture, thought about it for a bit, and wrote a design doc for an intricate system that could switch a region’s primary Bigtable cell inflight. The trick there was to make all GTape’s software in the region realize that a failover was going to happen and coordinate among themselves to halt all writes to their (old) Bigtable cell, wait for replication to settle, and then all release and continue writing to a new Bigtable cell. The details are interesting, intricate, mostly secret, and not at all obvious from the patent I wrote on this.
However, while I was thinking about this there was an alternative idea doing the rounds: Some people in the software engineering team proposed switching from Bigtable to another (terrible) storage infrastructure that promised globally consistent transactions (not Spanner, which is amazing but which was not available yet at the time). Tensions ran so high that one of the software engineers threatened to leave the team unless they could work on the alternative storage solution.
I mostly ignored the political fighting around this and started coding. Since I had until recently been a manager, it took me some time before I could focus on anything for more than 30 minutes again, but eventually I got there and I had a lot of very happy moments coding multi-threaded C++ and coordinating multiple independent GTape systems using a distributed state machine in Chubby. Eventually I was done and in an SRE team meeting I showed my assembled colleagues how to switch our test GTape region to another Bigtable cell. It took one command invocation and five minutes. They were impressed! Even better, all the code was in production already, so the only thing we needed to do to activate it was a few flag flips. Truly, some of my best technical work ever.
Unfortunately, I had not paid any attention to the political turmoil around me. The team wasn’t doing very well and there were a lot of unresolved conflicts with our sister team in New York. Right when my software was just about ready to go into production, the entire functional area (storage) got defragmented into a smaller number of locations and Zürich was not on the list anymore. So “my” SRE team got moved to Dublin and the software engineering team got moved to New York. Since nobody in their right mind would move from Zürich to either Dublin or New York, this meant all new teams were being spun up with all new people. At the same time it was also obvious that we needed a whole new vision for our tape backup architecture, for which I had some ideas but really, since I was not moving to either Dublin or New York, I found myself homeless quickly and looking for a new team.
My software was never used; for all of its technical prowess the problem it solved was not annoying enough for the new teams working on GTape and it was badly understood and not loved by anyone but the people who were no longer on the team. After a few weeks, some staff software engineer in the new team deleted the code from the repository.
So, a total technical win and I got a patent out of it. But also: Total political failure on my side to make sure that all the stakeholders were on board and committed to the goal. I had promised my GTape SRE manager (who ended up moving to the US and leaving the company) to give a tech talk about this project, where I explained to an audience of some of the finest engineers in the world what the problem was and how I had solved it. There was applause for the technical solution and a bout of laughter followed by stunned silence when I explained the fate of this beautiful piece of software engineering.
Fortunately, that is not the fate of all of my decent software. When I joined Confluent I did some halfway decent work on a piece of software that deletes unused cloud resources. Last thing I heard is that software is still going strong. You win some, you lose some.
Reminder: Do you have a story of some awesome work you did that never saw the light of day? Or of a great design doc you wrote that somebody butchered on the implementation? Tell me and it might feature in an upcoming edition of Wednesday Wisdom! Doesn’t have to be long, a single paragraph is enough.
And a fine solution tape is: https://gmail.googleblog.com/2011/02/
Thank you for taking on the challenge, and thank you for your contributions to the service and the team, even if it wasn't overtly recognized at the time.
Fun story: After leading what may have been the largest tape backup service in the world at the time, and thus gaining unique experience and perspective, I joined a growing social network company back in 2012 to lead other areas of "data infrastructure". I was chatting with our VP/Global Head of Engineering in my first week, and he said to me "I'd rather gouge out my eyes with a spoon than use tape". Fast forward half a decade, I'd moved on to a different opportunity, and I get an outreach from their head of capacity and engineering asking if I'd meet to consult on my experiences with tape backup at planet scale. The more things change, the more they stay the same.
Hmm