0:00
/
0:00
Transcript

Terrible Terraform

Infrastructure as Code is a software engineering problem
11

(Like this article? Read more Wednesday Wisdom! No time to read? No worries! This article is also available as a podcast).

This week: A weird mix of a rant and some wisdom, in seven acts.

Act 1: A somewhat off-topic introduction but I promise it is going to make sense eventually

Many years ago, I was in a project for the Dutch state telco to automate that telco’s system management and monitoring. The project was run by Hewlett-Packard, so of course we used HP OpenView Operations Center (OpC) to manage alerts and notifications.

This project also features the worst production software I ever wrote: The integration between OpenView and the telco’s pager network. A year after the project was delivered, one of my colleagues came up to me and said: “Did you write that pager module? It’s the worst thing I have ever seen!” “I did, and it is”, I agreed: “But, it was late and there were deadlines. It works though, and that’s the only good thing I will say about it.”

Part of the project was detecting whether critical system and application processes were running or not and sending out alarms if they weren’t. The functional specifications we had received contained a big list of processes for the various Sun and HP Unix systems in the telco’s production network. Looking at that list, I decided that instead of writing dozens of crappy shell scripts, it was time for a generic solution. So I fired up the C compiler and wrote something called the “Process Monitor”, which took a specification file with some details about the processes to be watched (command line, user id, parent process) and then did the needful. Because the process monitor was written in C, it could use the kernel’s system calls for getting the process list, which was more effective, more efficient, and more reliable than the “ps -ef | grep” shell scripts that were in the original design. Also, because it was a daemon, it could more easily remember the past, which is helpful when sending out alerts.

Everyone thought this was a cool idea and the process monitor was rolled out in the telco’s network. It had some limitations though and so we got some interns to design and write an even more generic agent for monitoring system resources: The System Resource Availability Monitor: SRAM. SRAM could not only watch processes, but could also report on file system status, log file content, and other critical elements of the system. SRAM contained a cute configuration language that was used to create pipelines of monitored resources, events and filters that could get a signal, apply thresholds, and maybe generate an alarm.

The experience of co-writing SRAM gave me my first insight into how incredibly hard it is to create a good domain specific language. As the tool evolved, we added features to deal with ever evolving use cases that seemed like they were a match for SRAM. This often required new features in the configuration language. Inevitably, we would run into problems with earlier decisions that were made in simpler times. The language grew, became more subtle, and before we knew it there were all sorts of dependencies on how the runtime engine of SRAM actually went about its business. To make SRAM do what you wanted, you often needed to understand the internals of SRAM, like the order in which resources were scraped and how events propagated down the pipelines. So, instead of a tool that made it easy to do complicated things, we had made a tool that made it simple to do easy things and complex to do complicated things.

I have written about the trouble domain specific languages (DSL) here and I believe it is the fate of all DSLs to become Turing complete, but in the worst possible way.

Act 2: A primer on Terraform

In the last few years I have done quite a lot of work with Terraform.

For those who are blissfully unaware about Terraform: It is a tool that allows you to describe your (cloud) infrastructure in a domain specific language (called HCL) and then Terraform goes off and makes sure that your actual infrastructure matches your description, creating, modifying, and deleting virtual machines, networks, load balancers, DNS entries, queues, and whatnot in the process. Terraform is extensible through plugins. Actually, Terraform itself does not know how to do anything! Every action requires some plugin or other. There are plugins for AWS, Azure, GCP, and many other systems and services.

Since HCL cannot do anything, its sole purpose is to prepare the input to the plugins. For instance the “aws_instance” resource in the “aws” plugin (provider) knows how to call the AWS API to create, modify, and delete a virtual machine. As you can imagine, you need to specify a bazillion attributes to do so, some of which are the output of another object, for example an “aws_vpc” object that is used to create a virtual network, the id of which you need to use in your virtual machine specification. Unfortunately, the id of the virtual network is only known after its creation, so Terraform needs to create and modify things in the right order to make sure that these dependencies are honored. Terraform can mostly detect the right order because of the use of output arguments of one resource as an input argument of another resource. However, as we will see below, some plugins are not very well designed and they thwart Terraform’s efforts to do so, leading to nasty runtime errors. This problem is solved by manually adding dependencies to the configuration.

For all practical purposes, the input to the providers can be thought of as a JSON object and so HCL is nothing more than a fancy way to create these JSON objects. In that fashion, it operates in about the same way as Google’s GCL and many other configuration languages, none of which are particularly better or worse, though I would say that GCL’s fancy way to do late binding would be helpful for Terraform, while accepting that it creates phenomenal problems too…

In the simplest possible case, you specify all of the values for your new virtual machine as constants in HCL with maybe a small number of references to other resources that are inputs. However, that approach does not work for anything but the simplest setups. In our field, as soon as you need more than one of anything, you will want to templatize it. HCL supports modules, which are snippets of configuration that are reusable, complete with input arguments and output values.

Act 3: Production setups are really complicated

Every environment I have ever worked in grew from small beginnings and eventually became really complicated. Size is one problem, but subtle variations between otherwise identical copies of things are another.

Let’s spin up a simple web application consisting of a web server, an application server, and a database. Setting this up with Terraform is no problem at all: A virtual network, a few virtual machines with external disks (for non-ephemeral data), some network ACLs, a database, a DNS entry, and we’re pretty much there. Next, we are going to replicate this setup across the world a few times in different regions. With modules that’s no problem at all: We create a Terraform module for the stack and then instantiate that module for different regions. However, in some regions we might want a slightly different layout of the infrastructure, with maybe a different number of instances to deal with more or less load, and perhaps a larger or smaller database. Fortunately, modules can have parameters and so we can expand the module with different arguments and then the module will adapt to our requirements.

Later we figure out that we can also instantiate our test environment using that same module! This ensures that the test environment is mostly identical to our production environment. But, for cost reasons, we actually want the web and application server on the same virtual machine and instead of using the costly cloud provider’s serverless database we want to run our own postgreSQL database on a t2-nano instance (at $0.0058 per hour). Our current modules do not support that specific use case, so ….

It is time for refactoring!

Act 4: Writing IaC is software engineering!

The fundamental problem we have is that we need to generate a DAG of JSON objects that form the input to the plugins and from which Terraform can determine the order in which these plugins need to be called (which is a topological sort of that DAG). Within that DAG there is a fair amount of duplication which we don’t want present in the input configuration.

This is a software engineering problem and we need a decent programming language to turn the input into the output. Ideally, that programming language supports not only expressions, but also selection (if), repetition (for, while), code reuse (procedures, functions, templates), and inheritance (to express that something is just like something else but with slightly different behavior). And because we want the transformations to be correct, we obviously also want strong typing and the ability to write unit tests.

Most (if not all) domain specific languages fall short here, and to the extent that they do support features for this, they get added over time which makes the language a bit of a mess and creates terrible problems with versioning and backward compatibility. At one past employer we had a wide variety of Terraform configurations that were all targeting a different Terraform version. It took a principal engineer 6 months to harmonize that mess and standardize on a single Terraform version and the same versions of the plugins across the configurations. This process included making changes to the HCL “code” because sometimes features that the configuration depended on had been removed in later versions of Terraform.

As your infrastructure grows and becomes more complicated, the Terraform code needs to evolve and that requires refactoring: Data models change, module abstractions need to be redone, and the production rules that turn your input parameters into the input for the plugins need to be adapted. Unfortunately, there are few tools available that make this easy. There are some open source tools (tfedit, tfmigrate, tfupdate; disclaimer: I have no direct experience with these tools) and of course there are IDE plugins (VSCode, JetBrains). These editor plugins are helpful, but overall I have found writing, and especially refactoring, Terraform code an ordeal compared to writing and refactoring code written in programming languages with great tool support such as C/C++, Golang, and Rust.

I am not mentioning Python on purpose, because it is terrible too when it comes to refactoring.

Writing Infrastructure as Code is software engineering and that means that you need a decent language and decent tools. HCL is not that language and there are virtually no tools to support development and refactoring other than the most basic ones.

Act 5: Plugin hell

HCL cannot do anything itself and Terraform depends on plugins to make it go voom. For every interaction with an infrastructure hosting service, you need specific plugins and some basic functionality, like generating random numbers, has been implemented as plugins too.

If I counted correctly, there have been over 300 versions of the AWS terraform plugin released as of early April 2025. Each version changed something, added something, deprecated something, and is only compatible with a specific range of versions of Terraform. Because of this, every big Terraform installation lives in its own specific plugin version hell where you want to upgrade to a certain plugin version because it adds a new feature, but you cannot because your Terraform version is too old and you cannot easily upgrade your Terraform version because that means also having to upgrade other plugins which means that this now becomes a major refactoring operation. When confronted with this, I typically do not engage in the major refactoring but instead hack around it, which is faster, but creates technical debt.

To recap: Terraform depends on plugins to do the needful and the plugins call the underlying infrastructure (usually: Cloud) API to create, modify, and delete infrastructure objects. It would of course be helpful if Terraform could warn you that particular changes are not possible. But, ultimately, only the underlying APIs know which changes in the infrastructure are supported. Ideally, the plugins would know this too so that they can warn about that during the planning phase..

Unfortunately, the plugins are decidedly not always ideal and so you regularly run into the problem that a plugin blesses a change that then gets denied by the underlying API when the plan is applied. At that point you are halfway through making all the required changes, with some of them already effectuated and some not yet, which can easily leave your infrastructure in a non-working state. Worse, since the change you want to make is not possible, you have to come up with another way to do what you want and that might require some time to figure out.

Sometimes the plugins know that a particular attribute change is not possible but they still allow it because what they can do is destroy the underlying resource and then recreate it with the changed attribute in place. The plugin should warn about this, and to be honest, they mostly do, but in a large Terraform execution plan this message is easy to overlook and you only find out later that it did this, which can lead to data loss. You can set a flag on an object that specifies that it cannot be destroyed by Terraform, but doing this consistently requires a high level of discipline. Additionally, this flag is set in the Terraform configuration (through a meta-attribute of every object) and so when you do need to destroy that object you need to change the configuration to remove that flag, run the config, and then not forget to edit the configuration again to set it after all is said and done. This is particularly cumbersome when Terraform is not run by you, but by some CI agent.

Act 6: State & refactoring

The way Terraform deals with state is a blessing and a curse.

Keeping state is essential for Terraform’s job because through its state file it remembers everything it did last time. When you ask Terraform to apply a configuration it consults the state file and then it knows what exists, or better, should exist. From the current and the desired state, it can figure out what needs to be done to make reality match the new desired configuration.

Unfortunately, Terraform’s state is a DAG that matches the module structure and when you start refactoring, the DAG changes and objects move, even when the number of objects, their attributes, and their dependencies do not change! Terraform has some features for dealing with these moves, but all things considered it is about as convenient as peeling an egg with your winter mittens on.

Act 7: So……

All things considered: To Terraform or not to Terraform?

Here is the problem: For all of its terribleness, and notwithstanding the fact that there are definitely things that could have been done better, Hashicorp did most things right with Terraform and there is not a lot I can fault them for. Sure, I would want a better language than HCL. Sure, I would love a better underlying data model that could deal better with refactoring. Sure, I would like it if JetBrains would create an actual IDE for Terraform. But, to be honest, a lot of what is terrible about Terraform is in the plugins, which are in turn bound to whatever the underlying APIs make possible.

The version hell of Terraform versions and plugin versions that most Terraform users have to deal with are an inescapable consequence of the speed of development in our field. The first versions get released early to get feedback and practical experience. Then as we learn, we make changes, but the bear of backward compatibility also starts rearing its ugly head. Cloud providers add new features and that prompts changes in the plugins, which always lag but eventually catch up. I have given some thought of creating a better core engine that would still depend on the rich ecosystem of plugins that exists, and I might get to that once I retire, but I really wonder if I can do much better than Terraform because of the inherent complexity of the problem and the fact that a lot of the problems stem from the plugins.

The sad fact is that production infrastructure configurations are just complicated and their evolution over time make any attempt at automation difficult because it is a permanently moving target. As software engineers we know that changes in requirements are hard to deal with and so it should not surprise us that breaking some axioms of your IaC solution requires a lot of work.

What is more surprising is that we steadfastly refuse to treat IaC as a fully blown software engineering problem and that we stick to using bad languages, substandard tools, and crappy runtimes as our solution. Really, any sizable SRE or DevOps team using Terraform should probably have at least two people dedicated to keeping it going and making it easy to use for the rest.

Wednesday Wisdom. Free! Subscribe! You probably won’t regret it.

Discussion about this video

User's avatar